Scale API Backend to 8,000+ Req/Min with AWS

About the author's

Shubham Navale
Architect

Shubham Navale has been working for more than 5 years in developing web apps and mobile apps in MEAN and MERN stack. He likes to learn new te... Read More

Sunny Dodhiya
Associate Architect

Sunny Dodhiya is a Senior DevOps Engineer at Nitor Infotech, specializing in AWS Cloud technologies and serverless architectures with over 5 y... Read More

Cloud and DevOps | 28 Jan 2026 | 25 min |

Picture this: it’s Black Friday at a bustling retail chain, and your shiny kiosk platform; originally built for the modest hum of everyday shoppers, suddenly faces a stampede of 1200 requests per minute. Something similar happened to us.

A scalable API is an application interface that can accommodate increased traffic and queries while maintaining performance and avoiding accidents. It scales with your business requirements, servicing 100 or 100,000 users as well. Key characteristics include maintaining fast response times under high load, automatically increasing resources during traffic surges, and scaling back down to save money during calm periods.

Scalable APIs employ technologies such as load balancers, caching, serverless architecture (AWS Lambda), and horizontal scaling (adding more servers rather than larger ones).

Our kiosk platform was originally designed for moderate traffic but started failing badly once it crossed around 1200 requests per minute. Login APIs stopped responding, inventory syncs timed out, and large transactional flows became unpredictable during peak hours. The business SLA was clear: every request had to be completed within 60 seconds, and the old architecture could not meet this requirement beyond a certain scale.

To address this, the backend was completely redesigned using AWS Step Functions, SQS priority queues, Lambda, and Amazon ElastiCache. The new design stabilized the system. It also increased capacity to about 3500 requests per minute for standard APIs and up to 8400 requests per minute for cache‑served APIs, while keeping all calls within the 60‑second performance window.

Do you want to know how we did this? Let’s take a look at the architectural overview.

Architectural Overview

Fig: Architectural Overview

The system starts with a kiosk‑based client application that communicates with the backend through AWS API Gateway over HTTPS. API Gateway acts as the single-entry point. It is responsible for authentication, routing, and enforcing throttling to protect downstream services.

From there, every request flows through two major components before reaching the main processing layer, called the Mixalot Server.

The Prioritizer decides the order in which requests should be processed. This is so that high‑priority traffic is handled first, and the system is not overloaded during spikes.
The Cache Mechanism stores frequently accessed or static data. This is so that repeat requests can be served directly from the cache without re‑invoking the full workflow.

Both components are deeply integrated with Amazon CloudWatch, which collects logs, metrics, and operational data for the entire pipeline. This observability layer is critical for detecting bottlenecks, tracking system health, and validating performance improvements over time.

Together, these pieces form a robust, scalable backend that can handle kiosk traffic reliably while still giving operators full visibility into how the system behaves in production.

With the core architecture established, the critical piece that enabled us to handle 8,000 RPM was our request prioritization system. Here’s how it works.

Request Prioritization Flow

Fig: Request Prioritization Flow

The first major building block in the new architecture is the Prioritizer, which is responsible for ensuring that high‑priority requests are processed first and in strict sequence. This prevents important operations such as logins or critical transactions from getting stuck behind slower, bulk processes.

1. Receiving Requests Through the SQS Priority Queue

All incoming requests from the client app arrive at the backend through API Gateway and are assigned a unique key that is used to correlate the response later. These requests are then placed into an Amazon SQS priority queue, which feeds a Priority Trigger Lambda function whenever a new message appears. The trigger Lambda is responsible for forwarding the request to the Mixalot Server and enforcing the correct processing order.

2. Starting the Response Fetching Loop

In parallel with the Priority Trigger Lambda, a Fetch Response Lambda function starts running and continuously checks for a response associated with the same unique key. This Lambda acts as a timed loop that repeatedly looks for the response in Amazon ElastiCache, runs for up to 20 seconds, and exits as soon as the data becomes available.

3. Backend Processing and Response Storage

When the priority queue triggers the Lambda, the request is sent to Mixalot, which returns either a success response or an error. The Priority Lambda stores this result in ElastiCache using the unique key, with a time‑to‑live (TTL) of 120 seconds for reliability. This ensures that the Fetch Response Lambda can immediately locate the correct response without having to call the backend again.

4. Returning the Response to the Client

As soon as the response appears in ElastiCache, the Fetch Response Lambda retrieves it and sends the final response back to API Gateway. At the same time, it deletes the corresponding cache entry to avoid stale data being reused in future calls. This closes the loop for a normal, successful request cycle.

5. Handling Timeouts

If no response appears in ElastiCache during the 20‑second polling window, the Fetch Response Lambda exits with a timeout condition. API Gateway then returns a message such as “Request processing timed out.” This is so the client always receives a definitive status even if Mixalot fails to respond in time.

While high-priority requests flow through the main pipeline, lower-priority traffic requires a different approach to prevent system overload.

Primary Prioritizer for Lower‑Priority Traffic

Fig: Primary Prioritizer for Lower‑Priority Traffic

The Primary Prioritizer handles lower‑priority API requests using a similar pattern but with an important constraint: it must never compete with or overload the high‑priority queue. Its execution is therefore controlled by the current load on the priority queue.

1. Purpose and Behavior

The Primary Prioritizer also sends requests to Mixalot, listens for responses in ElastiCache, and returns results to the client. However, it only runs when the number of pending messages in the priority queue is below a configurable threshold (in this case, 100 messages). If the queue is at or above that level, the Primary Prioritizer intentionally pauses to make sure that critical operations always receive precedence.

2. Step Function Logic

The flow is orchestrated using AWS Step Functions and begins with a Lambda step called Prioritizer, which checks the current length of the priority queue.

If the queue length is at least 100, the Step Function enters a short loop where it repeatedly checks the depth for up to five seconds.
If the queue never drops below the threshold in this window, the workflow moves to a Terminate state and exits gracefully.
If the queue length does drop below 100, the Step Function proceeds to the next step.

3. Sending the Request for Processing

Once the threshold condition is satisfied, the request is sent to a Primary SQS queue using an action like SendMessageToPrimary. The Primary SQS queue triggers its own Primary Trigger Lambda, while the Step Function also invokes the Fetch Response Lambda to wait for the result in ElastiCache. Both paths run in parallel, using the same response‑polling pattern as the high‑priority flow.

4. Storing and Returning the Response

The Primary Trigger Lambda forwards the request to Mixalot and stores the response in ElastiCache under the unique key, again with a TTL of 120 seconds. The Fetch Response Lambda loops for up to 20 seconds, returning the response to API Gateway and cleaning up the cache on success, or returning a timeout‑style response if nothing is found. This gives the system predictable behavior even when downstream processing is slow.

Prioritization alone wasn’t enough to handle the volume; we needed to reduce the actual load on our backend systems. That’s where our caching strategy became crucial.

Cache Mechanism

Fig: Cache Mechanism

The Cache Mechanism is responsible for avoiding unnecessary repeated calls to the Primary Step Function. It allows data that has already been computed once to be returned directly from ElastiCache on subsequent requests.

1. Identifying the Serial Number

Every request is associated with a unique serial number. When API Gateway receives a request, it checks whether this serial number already exists and then adjusts the flow accordingly:

If the serial number does not exist, the system treats the call as a fresh request.
If the serial number does exist, the system first checks whether there is corresponding data in ElastiCache.

2. Case 1 – Fresh Request (Serial Number Does Not Exist)

For a completely new request, Gateway forwards the call to the Primary Step Function, which invokes Mixalot and returns a success or error response.

If the response is an error, it is returned directly to the client, and nothing is stored in cache.
If the response is successful, a unique serial key provided by the AMM system is used to store the data in ElastiCache with a predefined TTL.

The success response is then returned to the client, and any follow‑up request with the same serial number can be served instantly from the cache.

3. Case 2 – Repeat Request (Serial Number Exists)

If the serial number is already known, Gateway checks ElastiCache for an existing entry.

If data exists in cache, the Cache Handler Lambda retrieves it and returns it immediately to Gateway without any additional backend calls.
If the cache entry is missing—because it has expired or been deleted—the request is sent again to the Primary Step Function for reprocessing. The result is then either returned as an error or stored back in ElastiCache and returned to the client.

This completes a full refresh cycle for stale or missing data and ensures the cache stays consistent.

Of course, caching is only effective if the data stays fresh. Our cache invalidation strategy ensures users never receive stale information.

Cache Invalidation Mechanism

Fig: Cache Invalidation Mechanism

The last major component is the Cache Invalidation Mechanism. It guarantees that outdated data does not remain in the system when underlying state changes. This is especially important for workflows such as inventory updates or price changes.

1. Receiving an Invalidation Request

The process begins when the client application sends a cache invalidation request that includes the serial number of the cached entry that must be removed. This request passes through API Gateway, which exposes a lightweight endpoint dedicated to invalidation actions.

2. Executing the Cache Invalidator Lambda

Once Gateway receives the request, it invokes a Cache Invalidator Lambda function. This Lambda is intentionally simple:

It extracts the serial number from the request.
It connects to Amazon ElastiCache.
It deletes the cache entry associated with that serial number.

There are no complex branches or Step Functions involved; the Lambda focuses purely on targeted cache removal.

3. Clearing and Refreshing Data

After the Lambda deletes the entry, the cache is considered invalidated for that serial number. Any future request with the same serial number will not find an entry in ElastiCache, and the system naturally falls back to the Primary Step Function flow to generate and store a fresh response. In effect, invalidation resets the state for that specific piece of data and forces of reconstruction when needed.

Now that you understand the new architecture, let’s look at why we needed to rebuild in the first place. Here’s what our original system was delivering.

Performance Metrics of the Old Architecture

Fig: Performance Metrics of the Old Architecture

Before implementing the new design, a series of high‑load tests were run on the old architecture to see how it behaved as traffic increased. The test methodology was simple: send 600, 1200, 1800, 2400, and 3000 requests within a fixed 60‑second window and measure whether each API could complete all its requests at that time.

An API was considered “performant” only if all requests for that load finished within 60 seconds. Anything slower or incomplete was treated as a failure.

These numbers tell a story, but the real insights came from analyzing what was actually breaking under load.

Observations

Up to 1200 requests per minute, all APIs completed successfully within the 60‑second window. This meant that the old architecture could handle light to moderate traffic.
At 1800 requests, the system started to break down: the login API could not complete all requests, and several other APIs (status, manager data, inventory, item) exceeded the 60‑second threshold.
At 2400 and 3000 requests, none of the APIs could complete the required number of calls in 60 seconds.

This was particularly serious because the client’s requirement was to support 3000 requests within 60 seconds, and the old system consistently collapsed beyond 1200 requests—more than 150% below the required capacity.

The tests clearly showed that scalability limits, latency spikes, and timeouts made the old design unsuitable for production growth, and they justified the need for a new, step‑function‑based and cache‑driven workflow.

Identifying bottlenecks is just the first step. In another case, we eliminated a 15-day customer process, compressing it to 5 seconds. Same principle: find the constraint, architect around it.

Download Case Study

Okay, back to the analytics. Armed with these insights, we implemented our new serverless architecture. The results were dramatic. Let’s take a look.

Performance Metrics of the New Architecture

Fig: Performance Metrics of the New Architecture

After the redesign, the same style of load tests was repeated to measure how the new system behaved under pressure. As before, the tests used a 60‑second window and checked if each API could complete its full set of requests in time.

1. Strong Performance up to 3500 Requests

The load gradually increased from 100 to 3500 requests per 60 seconds across different stages (100, 500, 1000, 1500, 2000, 2500, 2600, 2700, 2800, 3000, and 3500). Across all these levels, every API successfully completed its workload within the 60‑second requirement. This is a dramatic improvement over the old system, which started failing beyond 1200 requests and showed API failures in 1800.

2. Cached APIs Scaling to 8400 Requests

A separate group of APIs is served entirely from cache, including invalidate_data, products_and_promotions, and manager_data. Because these no longer depend on heavy database calls or Step Function workflows, they can respond very quickly.

To validate this, these cached APIs were tested at much higher volumes ranging from 600 to 8400 requests per 60 seconds. Even at 8400 requests in a single minute, all of them finished within the 60‑second window. This demonstrates the full power of the caching layer for read‑heavy operations.

3. Summary of Improvements

The overall improvement can be summarized as follows:

Aspect	Old Architecture	New Architecture
Stable up to	1200 requests	3500 requests
Fails beyond	1800 requests	—
Max tested load	3000 (failed)	3500 (passed)
Cached API scalability	Not applicable	Up to 8400 requests in 60 sec

By introducing caching, optimized request routing, prioritization, and parallel processing, the system now handles roughly 3× more load for standard APIs and 7× more load for cached APIs, while staying within the same 60‑second SLA. The transformation in these metrics demonstrates what’s possible with properly architected AWS serverless infrastructure.

How many requests can your backend handle right now? If you don’t know the answer, you’re one traffic spike away from finding out the hard way. We load-test, identify bottlenecks, and build AWS serverless solutions that scale automatically, powered by our Cloud Engineering expertise. Contact us today!

Previous Blog Next Blog

Recent Blogs

How Does Platform Engineering Help Scale DevOps Across Modern Teams?

Software Engineering

Why AI Observability Is Critical for Successful AI Adoption

Artificial intelligence

Virtual Health + AI: A Practical Playbook for Healthcare Leaders

Healthcare IT

Subscribe to our
fortnightly newsletter!

we'll keep you in the loop with everything that's trending in the tech world.

How to Scale an API Backend to Handle 8,000+ Requests per Minute Using AWS Serverless