How Netflix Handles 500M Daily API Requests

A deep dive into how Netflix routes, load balances, and protects 500 million daily API requests using Zuul, Ribbon, Eureka, and Hystrix — and what you can steal for your own systems.

·8 min read

Netflix serves over 220 million subscribers across 190 countries. On any given day, their API gateway processes upwards of 500 million requests.

Not 500 million database queries.
Not 500 million page loads.

500 million discrete API calls — device registrations, content metadata fetches, playback license requests, recommendation refreshes, billing checks.

At that scale, the question is never:

"How do we handle this request?"

The question becomes:

"How do we handle this request when three downstream services are degraded, one region is partially unavailable, and 40,000 devices wake up simultaneously because a popular show dropped at midnight?"

That is a very different engineering problem.


The Problem With a Single Gateway

Before we talk about what Netflix built, understand what they were trying to escape.

In 2008, Netflix ran a monolithic Java application. Every feature — streaming, billing, recommendations, device management — lived in one codebase, deployed together, scaled together, and failed together.

That created several problems:

  • A bad recommendation deployment could take down billing.
  • A memory leak in streaming metadata could crash device registration.
  • You couldn't scale one component independently.

The 2008 AWS database corruption incident accelerated Netflix's migration toward microservices.

After spending three days unable to ship DVDs due to cascading failures caused by a tightly coupled architecture, Netflix began decomposing its monolith.

By 2012, Netflix had split the system into hundreds of independent services.

They solved the coupling problem.

They created a routing problem.


What Zuul Actually Is

Zuul is Netflix’s edge service.

Every external request — from every Netflix app on every device — enters Netflix infrastructure through Zuul before touching anything else.

Think of it like the front door of an enormous building.

That front door has to:

  • Authenticate users
  • Decide routing destinations
  • Balance traffic
  • Enforce rate limits
  • Record metrics
  • Handle failures gracefully

All in milliseconds.

At hundreds of thousands of requests per second.

Zuul is built on top of Netty, a non-blocking I/O framework for Java.

Why does that matter?

Blocking I/O scales poorly because threads wait for responses.

Non-blocking I/O allows one thread to manage thousands of concurrent connections simultaneously.


High-Level Architecture

Client Device


┌─────────────┐
│  Zuul Edge  │ ← authentication, rate limiting, routing
└─────────────┘


┌─────────────┐
│   Ribbon    │ ← client-side load balancing
└─────────────┘


┌─────────────────────────────────────────────┐
│ Microservices (hundreds of them)            │
│ Recommendations │ Streaming │ Billing ...   │
└─────────────────────────────────────────────┘

Zuul's Filter Chain

The core idea behind Zuul is its filter system.

Every request moves through a chain of filters.

Each filter owns one responsibility.

Filter Types

PRE Filters

Run before routing.

Used for:

  • Authentication
  • Rate limiting
  • Request metadata injection

ROUTING Filters

Handle forwarding requests to backend services.

This is where Ribbon participates.

POST Filters

Run after backend responses.

Used for:

  • Logging
  • Metrics
  • Response headers

ERROR Filters

Run whenever failures happen.


Authentication Pre-Filter Example

public class AuthenticationFilter extends ZuulFilter {
 
    @Override
    public String filterType() {
        return "pre";
    }
 
    @Override
    public int filterOrder() {
        return 1;
    }
 
    @Override
    public boolean shouldFilter() {
        return true;
    }
 
    @Override
    public Object run() {
        RequestContext ctx = RequestContext.getCurrentContext();
        HttpServletRequest request = ctx.getRequest();
 
        String token = request.getHeader("Authorization");
 
        if (!tokenService.isValid(token)) {
            ctx.setSendZuulResponse(false);
            ctx.setResponseStatusCode(401);
            return null;
        }
 
        ctx.addZuulRequestHeader(
            "X-User-Id",
            tokenService.getUserId(token)
        );
 
        return null;
    }
}

Netflix open-sourced Zuul in 2013.

Zuul 2 (the Netty rewrite) followed in 2018.

The important takeaway isn't the library itself — it's the filter model. Separating cross-cutting concerns from routing logic scales beautifully.


Service Discovery With Eureka

Zuul knows it must call the Recommendations Service.

But which instance?

At Netflix scale, every service runs across dozens or hundreds of instances.

This becomes a service discovery problem.

Netflix solved it using Eureka.

Eureka is a service registry.

Each service registers itself at startup.

Example:

POST /eureka/apps/RECOMMENDATIONS
 
{
  "instance": {
    "hostName": "ec2-54-234-12-88.compute-1.amazonaws.com",
    "app": "RECOMMENDATIONS",
    "ipAddr": "54.234.12.88",
    "port": 8080,
    "status": "UP",
    "dataCenterInfo": {
      "name": "Amazon",
      "metadata": {
        "availability-zone": "us-east-1a"
      }
    }
  }
}

Each service instance sends heartbeats every 30 seconds.

If heartbeats stop:

  • Instance marked DOWN
  • Traffic stops routing to it

Clients keep a local registry cache.

That means:

Even if Eureka fails, traffic continues routing using previously known service state.

Netflix intentionally optimized for availability over consistency.

A stale registry is survivable.

No routing is catastrophic.


Client-Side Load Balancing — Ribbon

Traditional load balancing works like this:

Client → Load Balancer → Backend

Netflix instead uses client-side load balancing.

Meaning:

The client itself chooses which backend instance to call.

Ribbon uses:

  • Eureka service registry
  • Response time statistics
  • Zone awareness

Default strategy:

Zone-aware weighted round robin

Goals:

  • Prefer same AWS availability zone
  • Reduce latency
  • Reduce cross-AZ cost

Ribbon tracks:

  • Active requests
  • Average response time
  • Failure count

Slow instances receive less traffic instead of being immediately removed.

Simplified Ribbon Logic

public Server choose(Object key) {
    List<Server> servers = getReachableServers();
 
    double totalWeight = 0;
 
    for (Server server : servers) {
        ServerStats stats = serverStatsMap.get(server);
        double weight = 1.0 / stats.getResponseTimeAvg();
 
        totalWeight += weight;
        weightMap.put(server, weight);
    }
 
    double rand = Math.random() * totalWeight;
 
    for (Server server : servers) {
        rand -= weightMap.get(server);
 
        if (rand <= 0) {
            return server;
        }
    }
 
    return servers.get(0);
}

Ribbon entered maintenance mode in 2021.

Today, similar ideas live in Spring Cloud LoadBalancer and modern service mesh systems.


Handling Failures — Hystrix

At 500M requests/day:

Failures are guaranteed.

The real problem:

Preventing cascading failures.

Without protection:

  1. Recommendations fail
  2. Requests wait 30s timeout
  3. Threads pile up
  4. Thread pool exhausts
  5. Entire system degrades

Netflix solved this using the Circuit Breaker Pattern through Hystrix.

Circuit Breaker States

CLOSED

Normal traffic.

OPEN

Too many failures.

Requests fail immediately.

HALF-OPEN

Trial requests are allowed.

If healthy → close circuit.

If unhealthy → reopen.


Hystrix Example

@HystrixCommand(
    fallbackMethod = "getDefaultRecommendations",
    commandProperties = {
        @HystrixProperty(
            name = "circuitBreaker.errorThresholdPercentage",
            value = "50"
        ),
        @HystrixProperty(
            name = "circuitBreaker.sleepWindowInMilliseconds",
            value = "5000"
        ),
        @HystrixProperty(
            name = "execution.isolation.thread.timeoutInMilliseconds",
            value = "1000"
        )
    }
)
public List<Movie> getRecommendations(String userId) {
    return recommendationsClient.fetch(userId);
}
 
public List<Movie> getDefaultRecommendations(String userId) {
    return cacheService.getLastKnownRecommendations(userId);
}

Fallbacks matter.

Netflix rarely shows failures.

Instead:

  • Cached recommendations
  • Popular titles
  • Graceful degradation

This principle is called:

Best available response


Rate Limiting at the Edge

Not all requests are legitimate.

Some come from:

  • Scrapers
  • Credential stuffing attacks
  • Misconfigured clients

Netflix handles this in Zuul.

Using:

Token Bucket Algorithm

How it works:

  • Clients receive tokens
  • Requests consume tokens
  • Tokens refill over time
  • Empty bucket → HTTP 429

Benefits:

Real users naturally burst traffic.

Example:

Opening Netflix app:

  • Device registration
  • User fetch
  • Homepage fetch
  • Recommendation refresh

Then traffic stabilizes.

Different clients get different limits:

  • Mobile Apps → burst-friendly
  • Smart TVs → predictable limits
  • Developer APIs → strict quotas
  • Internal Services → separate rules

Regional Failover

Netflix runs across multiple AWS regions.

Example:

  • us-east-1
  • us-west-2
  • eu-west-1

Traffic normally routes geographically.

But if failures rise:

Traffic shifts using AWS Route 53 weighted routing.

Netflix practices failure regularly using:

Chaos Monkey

Randomly kills production instances.

Chaos Kong

Simulates regional outages.

Yes.

In production.

During business hours.

Why?

Because production is the only environment that truly matters.

Chaos engineering is dangerous without observability.

Before chaos testing, build metrics, tracing, dashboards, and alerting systems.


End-to-End Request Flow

Client sends request to netflix.com

├── AWS Route 53 → nearest healthy region

├── AWS ELB

├── Zuul Edge Node
│   ├── PRE: authenticate token
│   ├── PRE: rate limiting
│   ├── PRE: attach user context
│   └── PRE: canary/stable routing

├── Ribbon chooses backend

├── API Service
│   ├── Recommendations (Hystrix)
│   ├── Metadata (Hystrix)
│   ├── Personalization (Hystrix)
│   └── fallback if failures occur

├── Zuul POST Filters
│   ├── Logging
│   ├── Metrics
│   └── Response headers

└── Response returned

Typical latency targets:

  • P50 → under 100ms
  • P99 → under 500ms

What You Can Actually Use

Eureka

Service registry.

Good for:

  • Microservices
  • Availability-first systems

Alternatives:

  • Consul
  • etcd

Ribbon

Client-side load balancing.

Modern alternative:

  • Spring Cloud LoadBalancer

Hystrix

Circuit breaker.

Modern replacement:

  • Resilience4j

Zuul

API gateway.

Alternative:

  • Spring Cloud Gateway

The patterns matter more than the tools.

Borrow these ideas:

  • Circuit breakers
  • Graceful degradation
  • Timeouts
  • Fallbacks
  • Service discovery
  • Edge authentication
  • Smart routing

The Real Lesson

Netflix architecture is not impressive because it handles 500 million requests.

It is impressive because it handles them while things are failing.

Services crash.

Deployments happen.

Regions degrade.

Traffic spikes.

Chaos is intentional.

The system survives because it is designed for failure first.

Every service assumes failure.

Every timeout has a fallback.

Every fallback is tested.

That is the real engineering philosophy:

Build for failure first.

The happy path takes care of itself.