Scaling Limits & Capacity Planning for API Gateways

Q: What is the default connection limit in Envoy and how do I raise it?

Envoy defaults max_connections and max_pending_requests to 1024 per upstream. Raise both in that upstream's circuit_breakers.thresholds block and monitor envoy_cluster_upstream_cx_overflow_total to confirm headroom.

Q: How do I set a latency budget for each middleware plugin in Kong?

Kong exposes per-plugin execution time via the X-Kong-Proxy-Latency and X-Kong-Upstream-Latency headers. Profile plugins under synthetic load using wrk or k6, then enforce budgets by rewriting synchronous Lua logic as asynchronous ngx.timer callbacks or replacing heavyweight plugins with lightweight Wasm filters.

Q: When should I switch from vertical scaling to horizontal scaling for an API gateway?

Switch when CPU steal time exceeds 5% on the host, when connection queue depth grows faster than it drains, or when p99 latency doubles while CPU utilisation is below 70% — a signal of lock contention rather than raw CPU shortage. Horizontal scaling with consistent hashing preserves session affinity without sticky sessions.

Every API gateway has hard ceilings baked into its runtime: file descriptor limits, connection pool slots, event-loop saturation points, and garbage-collector pause windows. Ignoring those ceilings produces failures that look like upstream problems — p99 latency spikes, cryptic 503 storms, or silent request drops — until you trace them back to an exhausted worker thread pool or a full connection queue at the edge. This page maps the concrete limits for Kong, Envoy, Tyk, and NGINX, then walks through the modeling, configuration, and autoscaling patterns needed to stay inside those limits at scale. It sits inside API Gateway Fundamentals & Architecture, which establishes the control-plane/data-plane separation and deployment topology context you should read first.

Architectural Baseline

Before modeling capacity, engineers need a clear mental model of where resources are consumed in the request path. Each stage accumulates latency and consumes a distinct class of resource; the bottleneck shifts depending on traffic pattern and payload size.

Three distinct resource domains govern gateway capacity:

File descriptors and sockets. Each active connection (inbound and upstream) consumes one file descriptor. OS defaults cap this at 1024; production deployments need ulimit -n at 65535 or higher, plus matching kernel parameters (fs.file-max).
Worker threads / event-loop concurrency. NGINX and Kong run N worker processes tied to physical CPU cores; Envoy uses a thread-per-core model; Tyk runs a Go runtime with GOMAXPROCS goroutines. The concurrency model determines how contended resources (mutexes, connection pools) behave under burst traffic.
Upstream connection pools. The gateway holds persistent connections to each backend. Pool exhaustion forces request queuing, and queue overflow produces 503s — often before any upstream is actually overloaded.

Throughput Modeling and Connection Thresholds

Capacity modeling starts by computing the gateway’s theoretical maximum sustainable throughput (MST) for each resource class:

MST_connections = worker_processes × worker_connections   # NGINX/Kong
MST_connections = thread_count × max_connections_per_thread   # Envoy

For a 99th-percentile latency target of L ms and a mean request duration of D ms, the maximum concurrent requests in flight is:

max_in_flight = (L / D) × target_rps

A gateway serving 10,000 req/s with a 50 ms mean upstream latency must sustain 500 concurrent upstream connections per process — well above most defaults.

Kong (3.x) — connection and pool configuration

Kong builds on NGINX, so both layers need tuning. The NGINX worker layer handles inbound sockets; Kong’s upstream objects control connections toward backends.

# /etc/kong/nginx.conf (Kong 3.x)
worker_processes auto;          # = physical CPU cores
worker_rlimit_nofile 65535;     # per-worker file descriptor ceiling

events {
  worker_connections 16384;     # inbound connections per worker
  use epoll;                    # Linux: use epoll for O(1) event dispatch
  multi_accept on;
}

# Kong upstream object — /etc/kong/kong.yaml (declarative)
upstreams:
  - name: backend_pool
    algorithm: least-connections
    slots: 10000                # balancer hash ring slots; raise for > 100 targets
    healthchecks:
      active:
        healthy:
          interval: 5
          successes: 2
        unhealthy:
          interval: 2
          http_failures: 3

The slots value limits how many backend targets the balancer ring tracks, not the number of connections. Connection limits are inherited from NGINX’s keepalive directive on the upstream proxy block:

# In the generated upstream block (tune via nginx_upstream_keepalive in kong.conf)
upstream backend_pool {
  keepalive 200;          # idle keepalive connections per worker
  keepalive_requests 500; # requests per connection before recycling
  keepalive_timeout 30s;
}

Envoy (1.32+) — circuit breaker thresholds

Envoy’s circuit breaker thresholds — covered in depth in High Availability Topologies — double as connection-pool caps. Each upstream cluster carries its own independent set of thresholds:

# envoy.yaml — upstream cluster (Envoy 1.32+)
clusters:
  - name: backend_pool
    connect_timeout: 0.5s
    type: STRICT_DNS
    lb_policy: LEAST_REQUEST
    circuit_breakers:
      thresholds:
        - priority: DEFAULT
          max_connections: 4096        # active TCP connections to this cluster
          max_pending_requests: 2048   # requests waiting for a connection
          max_requests: 8192           # total in-flight requests (H2 streams)
          max_retries: 64              # concurrent retries
          track_remaining: true        # expose remaining capacity in stats
    upstream_connection_options:
      tcp_keepalive:
        keepalive_probes: 3
        keepalive_time: 30
        keepalive_interval: 10
    http2_protocol_options:
      max_concurrent_streams: 512     # per connection (H2 multiplexing)

Monitor envoy_cluster_upstream_cx_overflow_total and envoy_cluster_upstream_rq_pending_overflow_total to detect saturation before it affects client-visible error rates.

Middleware Chain Latency Budgets

The middleware chain is the single largest source of per-request CPU cost that capacity models undercount. Authentication, rate limiting and throttling checks, payload validation, and protocol translation each consume wall-clock time and, in synchronous plugin models, block the event loop during execution.

A production latency budget (example for 50 ms P99 target):

Stage	Budget	Notes
TLS termination	3–5 ms	Falls to < 1 ms with session resumption (TLS 1.3 0-RTT)
JWT validation	1–3 ms	Requires in-process JWKS cache; cold fetch = 50–200 ms
Rate-limit check	3–6 ms	Redis round-trip; rises to 15–20 ms across AZ boundaries
Routing resolution	1–3 ms	Trie-based O(1) for prefix routes; regex-heavy rules cost more
Upstream transit	20–35 ms	Irreducible network latency; drives most P99 variance
Response transform	2–5 ms	JSON rewrite in Lua or WASM; rises with payload size

Profiling Kong plugins

Kong exposes per-plugin timing in response headers when headers = Request-Id, X-Kong-Response-Latency is set in kong.conf:

-- Custom Lua plugin: capture plugin wall time (Kong 3.x plugin SDK)
local plugin = {
  PRIORITY = 1000,
  VERSION  = "1.0.0",
}

function plugin:access(conf)
  local start = ngx.now()
  -- ... plugin logic ...
  kong.log.inspect(("plugin_ms=%.2f"):format((ngx.now() - start) * 1000))
end

return plugin

For Tyk (5.x), enable the per-middleware timing log by setting "log_level": "debug" in tyk.conf and filtering for "middleware" entries in the structured JSON output:

{
  "level": "debug",
  "msg": "middleware execution",
  "middleware": "RateLimitAndQuotaCheck",
  "elapsed_ms": 4.7,
  "api_id": "my-api",
  "org_id": "acme"
}

When any plugin consistently exceeds its budget, the remediation path is: (1) add a local cache for external lookups, (2) convert synchronous I/O to ngx.timer callbacks (Kong/NGINX), (3) compile the logic into a native Envoy HTTP filter in C++ or Rust, or (4) offload the check to a sidecar via gRPC AuthN/AuthZ.

Memory Management and GC Tuning

NGINX / Kong

NGINX worker processes do not use a heap-collected allocator. Memory pressure manifests as shared-memory zone exhaustion (lua_shared_dict overflow or rate-limiting store saturation) rather than GC pauses:

# kong.conf / nginx.conf tuning (Kong 3.x)
lua_shared_dict kong                5m;
lua_shared_dict kong_locks          2m;
lua_shared_dict kong_healthchecks  32m;  # raise for > 500 upstream targets
lua_shared_dict kong_rate_limiting 64m;  # raise for high-cardinality key spaces

When a lua_shared_dict hits its ceiling, writes silently fail and LRU eviction begins. Monitor via kong.cache.l2.hits, .misses, and .evictions in the Prometheus scrape.

Tyk (5.x) — Go runtime

# /etc/tyk/environment (systemd unit EnvironmentFile)
GOMAXPROCS=8             # match vCPU count; default is all logical CPUs
GOMEMLIMIT=3GiB          # Go 1.19+: hard ceiling; GC triggers before OOM kill
GOGC=80                  # reduce to 80 to trigger GC earlier, lower peak RSS

Tyk’s middleware pipeline allocates per-request []byte buffers for body inspection and transformation. Set "enable_bundle_downloader": false if bundles are pre-cached to eliminate cold startup allocation spikes.

Envoy (1.32+) — C++ allocator

Envoy uses tcmalloc by default. Under multi-GB workloads, switching to jemalloc reduces fragmentation by up to 30%:

# Envoy startup: override allocator via LD_PRELOAD (Envoy 1.32+)
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 \
  envoy -c /etc/envoy/envoy.yaml

Watch envoy_server_memory_allocated and envoy_server_memory_physical_size. When physical exceeds allocated by more than 40%, fragmentation is the likely cause.

Advanced Configuration Knobs

xDS warm-up and cluster churn

Envoy’s xDS control plane applies cluster changes without restarting, but each new cluster must complete a health-check warm-up before receiving traffic. During CI/CD pipeline deployments with rapid config churn, cluster_manager.warming_clusters staying above zero creates 30–60 second windows where newly routed services appear invisible. To reduce warm-up time, set cluster.health_checks[*].no_traffic_interval to a low value (2–5 s) and ensure DNS pre-resolution is complete before xDS pushes that cluster definition.

HTTP/2 multiplexing and pool sizing

HTTP/2 multiplexes many logical streams over a single TCP connection, which changes pool-sizing math significantly. Envoy’s max_concurrent_streams: 512 (per connection) means a pool of 10 upstream connections can carry up to 5,120 in-flight gRPC requests simultaneously. This looks efficient until a single slow upstream stream head-of-line blocks the other 511 on that connection. Tune http2_protocol_options.initial_stream_window_size and initial_connection_window_size to bound the per-stream memory budget.

For Kong fronting gRPC backends, the upstream keepalive pool must be sized for streams, not sockets:

# Kong — gRPC upstream pool (Kong 3.x)
upstream grpc_backend {
  keepalive 50;               # fewer sockets; H2 multiplexing does the rest
  keepalive_requests 10000;   # longer lived: avoids H2 connection ramp-up cost
  keepalive_timeout 120s;
}

Tyk rate-limiter storage tuning

Tyk’s distributed rate-limit storage uses a sliding-window counter in Redis. At high cardinality (> 500,000 unique API keys), each counter set requires two Redis commands per request. With storage.enable_cluster: true set against a Redis Cluster, Tyk will hash-slot keys across nodes — but the storage.hosts list must include all nodes in that Redis deployment, not just primaries, or a missed failover routes all traffic to a single node.

{
  "storage": {
    "type": "redis",
    "enable_cluster": true,
    "hosts": {
      "redis-node-0:6379": "",
      "redis-node-1:6379": "",
      "redis-node-2:6379": ""
    },
    "max_idle_connections": 100,
    "max_active_connections": 500
  }
}

Comparative Implementation Table

Gateway	Max-connections config	Rate-limit store	Memory pressure signal	Scale-out trigger
Kong 3.x	`worker_connections` (NGINX) + `keepalive`	`lua_shared_dict` (local) or Redis	`kong_rate_limiting` dict evictions	CPU > 70% or `upstream_cx_overflow`
Envoy 1.32+	`circuit_breakers.thresholds.max_connections`	External rate-limit service (gRPC)	`memory_physical_size` / `memory_allocated` ratio	`upstream_rq_pending_overflow` rate
Tyk 5.x	`max_idle_connections_per_host` in `tyk.conf`	Redis (required for distributed RL)	Go `runtime.ReadMemStats` HeapSys	Goroutine count + heap growth rate
NGINX Plus	`keepalive` + `worker_connections`	Shared zone (`limit_req_zone`)	Zone write errors in error.log	Connection queue depth > threshold

Observability-Driven Autoscaling

Static replica counts fail under bursty traffic. The goal is autoscaling that reacts to the gateway’s own saturation signals rather than lagging CPU percentages.

KEDA with Envoy metrics (Kubernetes)

# ScaledObject — KEDA 2.x + Envoy 1.32+ (Kubernetes)
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: envoy-gateway-scaler
  namespace: gateway
spec:
  scaleTargetRef:
    name: envoy-gateway
  pollingInterval: 15
  cooldownPeriod:  60
  minReplicaCount: 3
  maxReplicaCount: 30
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        metricName: envoy_upstream_cx_active_ratio
        threshold: "0.75"    # scale when > 75% of pool capacity is used
        query: >
          sum(envoy_cluster_upstream_cx_active{cluster_name="backend_pool"})
          /
          sum(envoy_cluster_circuit_breakers_default_remaining_cx{cluster_name="backend_pool"}
              + envoy_cluster_upstream_cx_active{cluster_name="backend_pool"})
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        metricName: envoy_upstream_rq_pending_overflow_rate
        threshold: "1"       # any overflow event triggers scale-out
        query: >
          rate(envoy_cluster_upstream_rq_pending_overflow_total{cluster_name="backend_pool"}[1m])

Kong HPA with Prometheus adapter

# HorizontalPodAutoscaler — Kong 3.x + Prometheus adapter (Kubernetes 1.28+)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: kong-gateway-hpa
  namespace: kong
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: kong-gateway
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Pods
      pods:
        metric:
          name: kong_nginx_connections_active
        target:
          type: AverageValue
          averageValue: "800"   # per pod; raise worker_connections before raising this
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 65

Set the CPU threshold below 70% so each node has headroom to handle the latency spike that occurs while new pods warm up their JWKS caches and upstream keepalive pools.

Operational Gotchas

1. File descriptor inheritance gaps. Setting worker_rlimit_nofile in nginx.conf is not enough if the systemd service unit does not also set LimitNOFILE. The lower of the two wins. Check with cat /proc/$(pgrep -f nginx | head -1)/limits | grep "open files".

2. Envoy upstream warm-up blocking traffic. When Envoy adds a new upstream via xDS, it withholds traffic until that upstream completes its health check warm-up (cluster_manager.warming_clusters stays > 0). Under rapid config churn (CI/CD pipeline deployments), this produces 30–60 s windows where new routes are invisible. Set cluster.health_checks[*].no_traffic_interval to a low value and pre-seed DNS to reduce warm-up time.

3. Tyk Redis connection exhaustion. Tyk’s distributed rate-limiter opens one Redis connection per worker goroutine by default. With storage.max_idle_connections: 0 (unlimited), a traffic spike can open thousands of connections against a Redis instance with a default maxclients: 10000 limit. Set storage.max_active_connections explicitly.

4. Kong lua_shared_dict silent evictions. The default kong_rate_limiting dict is 12 MB, which fits roughly 200,000 rate-limit counters. Deployments with high-cardinality API keys — particularly multi-tenant deployments using per-user keys — silently lose counters to LRU eviction, allowing bursts above the configured limit. Raise to 256 MB and monitor kong.cache.l2.evictions.

5. NGINX worker_connections vs upstream keepalive double-counting. worker_connections caps both inbound and upstream connections combined. A worker with worker_connections 4096 handling 2000 active client connections has only 2096 slots left for upstream sockets. Size accordingly: worker_connections ≥ expected_inbound + expected_upstream_keepalive + 256 headroom.

6. Circuit breaker retry amplification. When Envoy’s max_retries: 64 is set alongside aggressive per-route retry policies, a 503 cascade from one upstream causes 64x request amplification toward the failing backend, accelerating the failure rather than shedding load. Tie max_retries to a retry budget (no more than 10% of active requests) and use retry_back_off.base_interval >= 250 ms.

Production Configuration Checklist

ulimit -n and systemd LimitNOFILE both set to ≥ 65535 on gateway hosts
worker_connections (Kong/NGINX) sized as: (max_inbound + max_upstream_keepalive) × 1.25
Envoy circuit_breakers.thresholds.max_connections set per cluster, not left at default 1024
Envoy track_remaining: true enabled so Prometheus can expose pool headroom
Kong lua_shared_dict kong_rate_limiting sized for key-space cardinality (≥ 256 MB for > 1M API keys)
Tyk storage.max_active_connections and max_idle_connections both explicitly capped
GOMEMLIMIT set on Tyk to 80% of container memory limit to prevent OOM kills
JWKS cache warm-up verified before declaring pods Ready (use a custom readinessProbe that hits a protected route)
KEDA / HPA scale-out triggered on connection pool utilization, not just CPU percentage
HPA minReplicas ≥ 2 with pod anti-affinity to prevent single-node colocation
Circuit breaker retry budgets capped at 10% of concurrent requests to prevent amplification
Synthetic load test run at 120% of projected peak QPS before each major deployment
Latency budget documented per middleware stage and baselined in CI with k6 threshold assertions
readinessProbe failure does not immediately route traffic to new pods during rolling deploy
HTTP/2 max_concurrent_streams tuned per upstream to prevent head-of-line blocking
Tyk Redis Cluster storage.hosts includes all nodes (primaries and replicas) for failover coverage

FAQ

What is the default connection limit in Envoy and how do I raise it?

Envoy defaults max_connections and max_pending_requests to 1024 per upstream. Raise both in that upstream’s circuit_breakers.thresholds block and monitor envoy_cluster_upstream_cx_overflow_total to confirm headroom. In practice, production deployments handling > 1,000 req/s need max_connections at 4,096–16,384 depending on upstream response time.

How do I set a latency budget for each middleware plugin in Kong?

Kong exposes per-plugin execution time via the X-Kong-Proxy-Latency and X-Kong-Upstream-Latency response headers. Profile plugins under synthetic load using wrk or k6, then enforce budgets by rewriting synchronous Lua logic as asynchronous ngx.timer callbacks or replacing heavyweight plugins with lightweight Wasm filters.

When should I switch from vertical scaling to horizontal scaling for a gateway?

Switch when CPU steal time exceeds 5% on the host, when the connection queue depth grows faster than it drains, or when p99 latency doubles while CPU utilization is below 70% — a signal of lock contention rather than raw CPU shortage. Horizontal scaling with consistent hashing, as described in High Availability Topologies, preserves session affinity without sticky sessions.

Load Testing an API Gateway with k6 — a runbook for finding the connection-pool and route-table knee under load
High Availability Topologies — cross-zone replica placement and failover strategies that pair with autoscaling
Security Boundaries & Zero Trust — mTLS handshake costs and certificate rotation that affect TLS termination budgets
Caching & Response Optimization — response caching as a load-shedding mechanism under capacity pressure
Gateway Selection Criteria — capability matrix for choosing Kong, Envoy, Tyk, or NGINX based on throughput requirements
Authentication Proxying & Token Validation — JWKS cache sizing and JWT validation overhead that affect middleware latency budgets

Up: API Gateway Fundamentals & Architecture