How Netflix Handles 500M Daily API Requests
A deep dive into how Netflix routes, load balances, and protects 500 million daily API requests using Zuul, Ribbon, Eureka, and Hystrix — and what you can steal for your own systems.
Netflix serves over 220 million subscribers across 190 countries. On any given day, their API gateway processes upwards of 500 million requests.
Not 500 million database queries.
Not 500 million page loads.
500 million discrete API calls — device registrations, content metadata fetches, playback license requests, recommendation refreshes, billing checks.
At that scale, the question is never:
"How do we handle this request?"
The question becomes:
"How do we handle this request when three downstream services are degraded, one region is partially unavailable, and 40,000 devices wake up simultaneously because a popular show dropped at midnight?"
That is a very different engineering problem.
The Problem With a Single Gateway
Before we talk about what Netflix built, understand what they were trying to escape.
In 2008, Netflix ran a monolithic Java application. Every feature — streaming, billing, recommendations, device management — lived in one codebase, deployed together, scaled together, and failed together.
That created several problems:
- A bad recommendation deployment could take down billing.
- A memory leak in streaming metadata could crash device registration.
- You couldn't scale one component independently.
The 2008 AWS database corruption incident accelerated Netflix's migration toward microservices.
After spending three days unable to ship DVDs due to cascading failures caused by a tightly coupled architecture, Netflix began decomposing its monolith.
By 2012, Netflix had split the system into hundreds of independent services.
They solved the coupling problem.
They created a routing problem.
What Zuul Actually Is
Zuul is Netflix’s edge service.
Every external request — from every Netflix app on every device — enters Netflix infrastructure through Zuul before touching anything else.
Think of it like the front door of an enormous building.
That front door has to:
- Authenticate users
- Decide routing destinations
- Balance traffic
- Enforce rate limits
- Record metrics
- Handle failures gracefully
All in milliseconds.
At hundreds of thousands of requests per second.
Zuul is built on top of Netty, a non-blocking I/O framework for Java.
Why does that matter?
Blocking I/O scales poorly because threads wait for responses.
Non-blocking I/O allows one thread to manage thousands of concurrent connections simultaneously.
High-Level Architecture
Client Device
│
▼
┌─────────────┐
│ Zuul Edge │ ← authentication, rate limiting, routing
└─────────────┘
│
▼
┌─────────────┐
│ Ribbon │ ← client-side load balancing
└─────────────┘
│
▼
┌─────────────────────────────────────────────┐
│ Microservices (hundreds of them) │
│ Recommendations │ Streaming │ Billing ... │
└─────────────────────────────────────────────┘Zuul's Filter Chain
The core idea behind Zuul is its filter system.
Every request moves through a chain of filters.
Each filter owns one responsibility.
Filter Types
PRE Filters
Run before routing.
Used for:
- Authentication
- Rate limiting
- Request metadata injection
ROUTING Filters
Handle forwarding requests to backend services.
This is where Ribbon participates.
POST Filters
Run after backend responses.
Used for:
- Logging
- Metrics
- Response headers
ERROR Filters
Run whenever failures happen.
Authentication Pre-Filter Example
public class AuthenticationFilter extends ZuulFilter {
@Override
public String filterType() {
return "pre";
}
@Override
public int filterOrder() {
return 1;
}
@Override
public boolean shouldFilter() {
return true;
}
@Override
public Object run() {
RequestContext ctx = RequestContext.getCurrentContext();
HttpServletRequest request = ctx.getRequest();
String token = request.getHeader("Authorization");
if (!tokenService.isValid(token)) {
ctx.setSendZuulResponse(false);
ctx.setResponseStatusCode(401);
return null;
}
ctx.addZuulRequestHeader(
"X-User-Id",
tokenService.getUserId(token)
);
return null;
}
}Netflix open-sourced Zuul in 2013.
Zuul 2 (the Netty rewrite) followed in 2018.
The important takeaway isn't the library itself — it's the filter model. Separating cross-cutting concerns from routing logic scales beautifully.
Service Discovery With Eureka
Zuul knows it must call the Recommendations Service.
But which instance?
At Netflix scale, every service runs across dozens or hundreds of instances.
This becomes a service discovery problem.
Netflix solved it using Eureka.
Eureka is a service registry.
Each service registers itself at startup.
Example:
POST /eureka/apps/RECOMMENDATIONS
{
"instance": {
"hostName": "ec2-54-234-12-88.compute-1.amazonaws.com",
"app": "RECOMMENDATIONS",
"ipAddr": "54.234.12.88",
"port": 8080,
"status": "UP",
"dataCenterInfo": {
"name": "Amazon",
"metadata": {
"availability-zone": "us-east-1a"
}
}
}
}Each service instance sends heartbeats every 30 seconds.
If heartbeats stop:
- Instance marked
DOWN - Traffic stops routing to it
Clients keep a local registry cache.
That means:
Even if Eureka fails, traffic continues routing using previously known service state.
Netflix intentionally optimized for availability over consistency.
A stale registry is survivable.
No routing is catastrophic.
Client-Side Load Balancing — Ribbon
Traditional load balancing works like this:
Client → Load Balancer → BackendNetflix instead uses client-side load balancing.
Meaning:
The client itself chooses which backend instance to call.
Ribbon uses:
- Eureka service registry
- Response time statistics
- Zone awareness
Default strategy:
Zone-aware weighted round robin
Goals:
- Prefer same AWS availability zone
- Reduce latency
- Reduce cross-AZ cost
Ribbon tracks:
- Active requests
- Average response time
- Failure count
Slow instances receive less traffic instead of being immediately removed.
Simplified Ribbon Logic
public Server choose(Object key) {
List<Server> servers = getReachableServers();
double totalWeight = 0;
for (Server server : servers) {
ServerStats stats = serverStatsMap.get(server);
double weight = 1.0 / stats.getResponseTimeAvg();
totalWeight += weight;
weightMap.put(server, weight);
}
double rand = Math.random() * totalWeight;
for (Server server : servers) {
rand -= weightMap.get(server);
if (rand <= 0) {
return server;
}
}
return servers.get(0);
}Ribbon entered maintenance mode in 2021.
Today, similar ideas live in Spring Cloud LoadBalancer and modern service mesh systems.
Handling Failures — Hystrix
At 500M requests/day:
Failures are guaranteed.
The real problem:
Preventing cascading failures.
Without protection:
- Recommendations fail
- Requests wait 30s timeout
- Threads pile up
- Thread pool exhausts
- Entire system degrades
Netflix solved this using the Circuit Breaker Pattern through Hystrix.
Circuit Breaker States
CLOSED
Normal traffic.
OPEN
Too many failures.
Requests fail immediately.
HALF-OPEN
Trial requests are allowed.
If healthy → close circuit.
If unhealthy → reopen.
Hystrix Example
@HystrixCommand(
fallbackMethod = "getDefaultRecommendations",
commandProperties = {
@HystrixProperty(
name = "circuitBreaker.errorThresholdPercentage",
value = "50"
),
@HystrixProperty(
name = "circuitBreaker.sleepWindowInMilliseconds",
value = "5000"
),
@HystrixProperty(
name = "execution.isolation.thread.timeoutInMilliseconds",
value = "1000"
)
}
)
public List<Movie> getRecommendations(String userId) {
return recommendationsClient.fetch(userId);
}
public List<Movie> getDefaultRecommendations(String userId) {
return cacheService.getLastKnownRecommendations(userId);
}Fallbacks matter.
Netflix rarely shows failures.
Instead:
- Cached recommendations
- Popular titles
- Graceful degradation
This principle is called:
Best available response
Rate Limiting at the Edge
Not all requests are legitimate.
Some come from:
- Scrapers
- Credential stuffing attacks
- Misconfigured clients
Netflix handles this in Zuul.
Using:
Token Bucket Algorithm
How it works:
- Clients receive tokens
- Requests consume tokens
- Tokens refill over time
- Empty bucket → HTTP
429
Benefits:
Real users naturally burst traffic.
Example:
Opening Netflix app:
- Device registration
- User fetch
- Homepage fetch
- Recommendation refresh
Then traffic stabilizes.
Different clients get different limits:
- Mobile Apps → burst-friendly
- Smart TVs → predictable limits
- Developer APIs → strict quotas
- Internal Services → separate rules
Regional Failover
Netflix runs across multiple AWS regions.
Example:
us-east-1us-west-2eu-west-1
Traffic normally routes geographically.
But if failures rise:
Traffic shifts using AWS Route 53 weighted routing.
Netflix practices failure regularly using:
Chaos Monkey
Randomly kills production instances.
Chaos Kong
Simulates regional outages.
Yes.
In production.
During business hours.
Why?
Because production is the only environment that truly matters.
Chaos engineering is dangerous without observability.
Before chaos testing, build metrics, tracing, dashboards, and alerting systems.
End-to-End Request Flow
Client sends request to netflix.com
│
├── AWS Route 53 → nearest healthy region
│
├── AWS ELB
│
├── Zuul Edge Node
│ ├── PRE: authenticate token
│ ├── PRE: rate limiting
│ ├── PRE: attach user context
│ └── PRE: canary/stable routing
│
├── Ribbon chooses backend
│
├── API Service
│ ├── Recommendations (Hystrix)
│ ├── Metadata (Hystrix)
│ ├── Personalization (Hystrix)
│ └── fallback if failures occur
│
├── Zuul POST Filters
│ ├── Logging
│ ├── Metrics
│ └── Response headers
│
└── Response returnedTypical latency targets:
- P50 → under
100ms - P99 → under
500ms
What You Can Actually Use
Eureka
Service registry.
Good for:
- Microservices
- Availability-first systems
Alternatives:
- Consul
- etcd
Ribbon
Client-side load balancing.
Modern alternative:
- Spring Cloud LoadBalancer
Hystrix
Circuit breaker.
Modern replacement:
- Resilience4j
Zuul
API gateway.
Alternative:
- Spring Cloud Gateway
The patterns matter more than the tools.
Borrow these ideas:
- Circuit breakers
- Graceful degradation
- Timeouts
- Fallbacks
- Service discovery
- Edge authentication
- Smart routing
The Real Lesson
Netflix architecture is not impressive because it handles 500 million requests.
It is impressive because it handles them while things are failing.
Services crash.
Deployments happen.
Regions degrade.
Traffic spikes.
Chaos is intentional.
The system survives because it is designed for failure first.
Every service assumes failure.
Every timeout has a fallback.
Every fallback is tested.
That is the real engineering philosophy:
Build for failure first.
The happy path takes care of itself.