API Gateway Fundamentals & Architecture

Q: When should routing logic reside in the gateway versus a service mesh?

Gateways should handle north-south traffic, external authentication, and rate limiting. Service meshes should manage east-west traffic, mTLS, and intra-cluster retries. Overlapping responsibilities cause configuration drift and increased latency.

Modern distributed systems break apart the monolith but immediately face a harder problem: every external client now talks to dozens of services, each with its own authentication scheme, versioning contract, and failure domain. Without a dedicated ingress layer that owns security boundaries and zero-trust enforcement, rate limiting, and protocol translation between HTTP/REST, gRPC, and WebSocket, teams end up duplicating cross-cutting logic in every service and losing the operational visibility they need to debug production incidents at scale.

An API gateway solves this by acting as the single architectural boundary between external consumers and the backend mesh — terminating TLS, evaluating auth tokens, enforcing rate limiting and throttling policy, and routing traffic to versioned upstream pools. Getting the gateway’s placement, data model, and configuration lifecycle right is the foundational decision that controls how resilient, observable, and evolvable your platform becomes over time.

Design invariants that hold regardless of gateway product or cloud:

The data plane must be stateless and horizontally scalable; session state belongs in an external store, not in the proxy process.
Control-plane configuration must be versioned and applied atomically — partial states during rolling updates are the most common source of routing incidents.
TLS termination, authentication, and coarse-grained rate limiting must happen at the edge before any byte reaches a backend service.
Observability (traces, metrics, structured logs) must be emitted at the gateway layer, not reconstructed downstream.
Health checks on the gateway’s upstream pools must be decoupled from client-facing endpoints to prevent false positives during transient network partitions.
The middleware chain executing inside the gateway — auth, transformation, caching, CORS — must have a defined, deterministic execution order documented and enforced in configuration.

Primary gateway products this guide covers: Kong Gateway 3.x, Envoy Proxy 1.32+, NGINX Plus R31, Tyk Gateway 5.x, AWS API Gateway, Apigee X.

Control Plane vs Data Plane: The Fundamental Separation

Every production gateway architecture enforces a strict boundary between the control plane — which compiles routing rules, distributes certificates, and manages plugin configuration — and the data plane, which executes those decisions on live traffic with sub-millisecond latency budget.

The control plane never touches a live request. It operates in management time: it receives declarative configuration (via Kubernetes CRDs, an Admin API, or a GitOps pipeline), compiles that configuration into binary route tables or xDS snapshots, validates the result, and distributes it to every running data plane replica atomically. In Envoy-based systems, this protocol is called xDS (discovery service), with LDS delivering listener config, RDS delivering route config, and CDS/EDS delivering upstream cluster and endpoint data.

The data plane executes in request time: it reads the distributed config from local memory, evaluates matching rules, runs the plugin pipeline, and forwards the request to the selected upstream. Because it never makes network calls to the control plane on the hot path, a control plane outage degrades configuration freshness but does not drop live traffic — provided the data plane has a valid cached copy.

# Envoy 1.32+: xDS bootstrap — data plane connects to a control plane (e.g., Istio Pilot)
# The data plane loads this once at startup; live config arrives over gRPC streams.
node:
  id: "gateway-node-01"
  cluster: "edge-gateway"

dynamic_resources:
  ads_config:
    api_type: GRPC
    transport_api_version: V3
    grpc_services:
      - envoy_grpc:
          cluster_name: xds_control_plane
  lds_config:
    resource_api_version: V3
    ads: {}
  cds_config:
    resource_api_version: V3
    ads: {}

static_resources:
  clusters:
    - name: xds_control_plane
      connect_timeout: 1s
      type: STRICT_DNS
      http2_protocol_options: {}
      load_assignment:
        cluster_name: xds_control_plane
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: istiod.istio-system.svc.cluster.local
                      port_value: 15010

admin:
  address:
    socket_address: { address: 127.0.0.1, port_value: 9901 }

For teams that do not run a full service mesh, Kong’s declarative configuration format (deck sync) provides the same atomic distribution guarantee without the xDS overhead — a single deck sync call compiles and distributes the full route table to all Kong nodes via the shared Postgres or Redis control-plane store.

Request Lifecycle and the Routing Decision Model

Every inbound request traverses a fixed pipeline before a byte reaches a backend service. Understanding this pipeline in sequence lets teams reason about exactly where latency is introduced, where authentication failures are generated, and where transformation side effects occur.

The canonical request lifecycle for a Kong 3.x data plane node:

TCP accept + TLS handshake. The listener terminates the client connection. Session resumption via TLS session tickets avoids the full handshake cost on reconnects.
HTTP/2 or HTTP/1.1 framing. The request is parsed into headers and a body buffer.
Route matching. The router evaluates the request against all registered routes in priority order — matching on Host, path prefix/regex, HTTP method, and header predicates. The first match wins.
Plugin chain execution (access phase). Enabled plugins run in their configured priority order: key-auth → rate-limiting → request-transformer → proxy-cache. Each plugin can terminate the request (returning a 4xx/5xx) or mutate headers and body before forwarding.
Upstream selection. The matching service’s upstream load-balancer selects a target using the configured algorithm (round-robin, least-connections, or consistent hashing). Health state gates the selection.
Proxying + header injection. Kong injects X-Forwarded-For, X-Real-IP, and X-Kong-Request-Id headers, then opens (or reuses from the keepalive pool) a connection to the upstream.
Response path plugin execution (header_filter + body_filter phases). Plugins like response-transformer and proxy-cache run on the response path, allowing body rewrites and cache population.
Log phase. Logging plugins serialize the request/response context to structured JSON and emit asynchronously — never on the critical path.

# Kong 3.x: declarative route with plugin execution chain
# `deck sync` applies this atomically to all gateway nodes
_format_version: "3.0"

services:
  - name: payment-service
    url: http://payment-v1.internal:8080
    connect_timeout: 500
    read_timeout: 10000
    write_timeout: 10000
    retries: 2

routes:
  - name: payment-v2-route
    service: payment-service
    paths: ["/v2/payments"]
    methods: ["GET", "POST"]
    strip_path: true
    # Header predicate: only match requests carrying a valid tenant header
    headers:
      X-Tenant-ID: ["acme", "globex"]

plugins:
  # Plugin execution order is determined by priority (higher = earlier)
  - name: jwt           # priority 1005 — runs first
    route: payment-v2-route
    config:
      key_claim_name: kid
      claims_to_verify: ["exp", "nbf"]

  - name: rate-limiting  # priority 901
    route: payment-v2-route
    config:
      minute: 500
      hour: 5000
      policy: redis
      redis_host: redis.infra.svc.cluster.local
      redis_port: 6379

  - name: request-transformer  # priority 801
    route: payment-v2-route
    config:
      add:
        headers: ["X-Gateway-Version:kong-3.x", "X-Request-Start:$(date +%s%3N)"]
      remove:
        headers: ["X-Internal-Debug"]

For canary traffic splits, upstreams declare weighted targets so the routing decision model fans traffic across service versions at the data plane level — no application code change required:

upstreams:
  - name: payment-upstream
    algorithm: round-robin
    healthchecks:
      active:
        http_path: /health
        interval: 5
        healthy:
          successes: 2
        unhealthy:
          http_failures: 3
          interval: 5
    targets:
      - target: payment-v1.internal:8080
        weight: 90
      - target: payment-v2.internal:8081
        weight: 10

Policy Enforcement: Auth, Rate Limiting, and Transformation

The gateway is the natural enforcement point for all cross-cutting policy because it sees every request before any backend logic executes. Three policy families matter most in production:

Authentication and Authorization

Gateways offload token validation from backend services, eliminating per-service JWT parsing libraries and their version skew. The gateway validates the token’s signature, expiry, and required claims, then forwards only trusted identity headers (X-Consumer-Username, X-Consumer-Groups) to upstream services, which can skip re-validation. For JWT validation in Kong plugins, the gateway caches JWKS responses to avoid per-request calls to the identity provider.

mTLS at the gateway edge goes further, requiring client certificates so that the gateway can authenticate machine identities before HTTP-layer auth even runs — a prerequisite for zero-trust architectures where credentials alone are insufficient.

Rate Limiting

Rate limiting and throttling strategies at the gateway operate on three scopes: global (total RPS against a route), per-consumer (identified by API key or JWT sub), and per-IP (for unauthenticated public endpoints). Redis-backed counters enable consistent limiting across gateway replicas. Without a shared backend, each replica enforces independently, under-counting by a factor of N.

Request and Response Transformation

The middleware chain’s request and response transformation layer rewrites headers, injects correlation IDs, strips internal headers before forwarding externally, and performs protocol translation between gRPC and REST where backends speak different protocols. Tyk’s virtual_endpoint middleware allows JavaScript-based body transformation without a full plugin recompile:

# Tyk 5.x: virtual endpoint for body transformation
# Runs a JS function that rewrites the request body before proxying
extended_paths:
  virtual:
    - response_function_name: "transformPayload"
      function_source_type: "file"
      function_source_uri: "middleware/transform_payload.js"
      path: "/v1/orders"
      method: "POST"
      use_session: false

Deployment Topologies

How the gateway is placed in the network determines its operational blast radius, latency budget, and operational complexity.

Topology	Description	Blast radius	Latency add	Best fit
Centralized ingress	Single gateway tier handles all north-south traffic	High — gateway outage drops all external traffic	1–3 ms	Small teams, monolithic-to-microservices migration
Per-namespace ingress	One gateway instance per Kubernetes namespace or team boundary	Medium — isolated to the namespace	1–3 ms	Platform teams with multi-team ownership
Sidecar proxy	Envoy sidecar per pod, controlled by a service mesh control plane	Low — failures isolated to the pod	0.5–1 ms	East-west service mesh (Istio, Linkerd)
Hybrid edge + mesh	Centralized gateway for north-south, sidecar mesh for east-west	Low — independent failure domains	2–5 ms total	Enterprise platforms with zero-trust east-west requirements

For a detailed capability comparison across Kong, Tyk, and Envoy across these topologies, see Kong vs Tyk vs Envoy for Microservices.

# NGINX Plus R31: upstream pool for centralized ingress
# keepalive 32 = pool of 32 idle connections per worker to each upstream
upstream payment_pool {
  zone payment_pool 64k;     # shared memory zone for health state (NGINX Plus)
  least_conn;
  server 10.0.1.10:8080;
  server 10.0.1.11:8080;
  keepalive 32;
  keepalive_timeout 60s;
  keepalive_requests 1000;
}

limit_req_zone $binary_remote_addr zone=api_global:10m rate=100r/s;

server {
  listen 443 ssl http2;
  ssl_protocols TLSv1.3;
  ssl_session_tickets on;

  location /v2/payments {
    limit_req zone=api_global burst=50 nodelay;
    proxy_pass http://payment_pool;
    proxy_http_version 1.1;
    proxy_set_header Connection "";            # enable keepalives
    proxy_set_header X-Request-ID $request_id;
    proxy_connect_timeout 500ms;
    proxy_read_timeout 10s;
  }
}

Observability and Operational Telemetry

A gateway without comprehensive telemetry is operationally blind. Three signal types are essential:

Distributed traces using the W3C traceparent header allow a single request to be followed from the client through the gateway, through every upstream service call, and back. The gateway injects the traceparent header on every request it forwards (generating a root span), and OpenTelemetry-instrumented upstream services extend the trace by creating child spans. Kong’s OpenTelemetry plugin exports to any OTLP-compatible backend:

# Kong 3.x: OpenTelemetry plugin — emits OTLP traces per request
plugins:
  - name: opentelemetry
    config:
      endpoint: "http://otel-collector.infra.svc:4318/v1/traces"
      resource_attributes:
        service.name: "kong-gateway"
        deployment.environment: "production"
      propagation:
        default_format: w3c
      batch_span_processor:
        max_queue_size: 2048
        max_export_batch_size: 512
        scheduled_delay: 5000

Structured access logs must include: request_id, route_name, upstream_uri, upstream_latency_ms, response_status, consumer_id, rate_limit_remaining, and cache_status. Unstructured log formats prevent log-based alerting on specific upstream latency thresholds or per-consumer error rates.

Gateway metrics (Prometheus format) should expose: kong_http_requests_total (labelled by service, route, status code), kong_latency_ms (p50/p95/p99 histograms), kong_upstream_latency_ms, and kong_db_reachable. Alert on: p99 latency > 500 ms, 5xx rate > 1%, and upstream connection pool exhaustion (kong_nginx_connections_active near worker_connections limit).

Failure Modes and Resilience Patterns

Circuit Breakers

A circuit breaker in Envoy’s outlier detection ejects an upstream endpoint from the load-balancing pool when it exceeds a consecutive failure threshold. Ejected endpoints enter a penalty period during which no traffic is sent to them, allowing them to recover without continued hammering. The max_ejection_percent guard prevents mass ejection from fragmenting the pool:

# Envoy 1.32+: outlier detection + circuit breaker on an upstream cluster
clusters:
  - name: payment_upstream
    connect_timeout: 500ms
    type: STRICT_DNS
    lb_policy: LEAST_REQUEST

    # Circuit breaker: limits concurrent load per priority level
    circuit_breakers:
      thresholds:
        - priority: DEFAULT
          max_connections: 1000
          max_pending_requests: 500
          max_requests: 2000
          max_retries: 10           # global retry budget — prevents thundering herd

    # Outlier detection: ejects unhealthy endpoints
    outlier_detection:
      consecutive_5xx: 5            # 5 consecutive 5xx → eject
      interval: 10s
      base_ejection_time: 30s       # minimum penalty duration
      max_ejection_percent: 50      # never eject more than half the pool
      enforcing_consecutive_5xx: 100
      enforcing_success_rate: 0     # disable success-rate ejection (noisy at low RPS)

Retry Budgets and Thundering Herd Prevention

Retries without a global budget amplify load on a degraded upstream. Envoy’s max_retries in the circuit breaker acts as an upstream-wide retry budget: when the outstanding retry count reaches this threshold, retries are rejected with a 503 rather than queued — converting retry storms into controlled 503s that surface to the client and allow upstream recovery.

Retry policies must use exponential backoff with full jitter. Uniform backoff intervals cause synchronized retry bursts. Configure Envoy’s retry policy on the route level, not the upstream level, to allow per-route control:

# Envoy 1.32+: route-level retry policy with jitter
routes:
  - match:
      prefix: "/v2/payments"
    route:
      cluster: payment_upstream
      retry_policy:
        retry_on: "5xx,reset,connect-failure,retriable-4xx"
        num_retries: 3
        per_try_timeout: 2s
        retry_back_off:
          base_interval: 100ms
          max_interval: 2s         # caps exponential growth
        # retriable_status_codes: [503, 429]

Common Pitfalls

Embedding business logic in the gateway. Transformation plugins that implement domain rules — not just header manipulation — create a hidden dependency that couples gateway upgrades to application releases.
TLS session ticket rotation mismatch. When multiple gateway replicas use different session ticket keys, clients connecting to a different replica after a reconnect pay the full handshake cost, spiking CPU under load. Rotate keys via a shared secret distributed by the control plane.
Control plane update without staged rollout. Pushing a deck sync or xDS snapshot across all nodes simultaneously during peak traffic causes a brief routing gap if the new config contains errors. Use deck diff to validate diffs before syncing, and apply declarative configs via canary rollout in CI/CD.
Unbounded client_max_body_size. Large request bodies buffered in the gateway’s memory during transform operations can exhaust worker memory. Set an explicit limit (e.g., client_max_body_size 10m) and return 413 for oversized payloads.
Missing X-Forwarded-For trust chain. Without configuring trusted_ips in Kong or xff_num_trusted_hops in Envoy, the gateway reads the client IP from the full XFF header, which clients can spoof — breaking IP-based rate limiting.

Implementation Blueprint

Component	Pattern	Key configuration parameters
TLS termination	Edge termination with session tickets	`ssl_session_tickets on`, TLSv1.3 only, ECDHE cipher suites
Route matching	Priority-ordered prefix + header predicates	Route `priority`, `headers` match conditions, `strip_path`
Authentication	JWT RS256 + JWKS cache	`key_claim_name: kid`, `claims_to_verify: [exp, nbf]`, JWKS TTL 300 s
Rate limiting	Redis-backed sliding window, per-consumer + per-IP	`policy: redis`, `limit_by: consumer`, `minute`/`hour` buckets
Traffic splitting	Weighted upstream targets	`weight: 90/10`, active health checks, `algorithm: round-robin`
Protocol translation	gRPC-to-REST at gateway	Kong `grpc-gateway` plugin or Envoy `grpc_json_transcoder` filter
Observability	OTLP traces + Prometheus metrics + structured JSON logs	`traceparent` propagation, OTLP endpoint, p99 latency alert
Resilience	Outlier detection + retry budget + circuit breaker	`consecutive_5xx: 5`, `max_retries: 10`, `base_ejection_time: 30s`
Config distribution	Declarative GitOps (`deck sync` / xDS snapshot)	Staging diff review, rollback on health check failure
CORS	Per-route origin allowlist	`cors-cross-origin-security` plugin, preflight caching

Technical Validation Checklist

Use this checklist when commissioning a new gateway deployment or auditing an existing one:

TLS 1.3 enforced on all listeners; TLS 1.0 and 1.1 disabled
Session tickets enabled and ticket keys rotated via a shared control-plane secret across all replicas
All routes require an authentication plugin (JWT, mTLS, or API key) — no route is unauthenticated unless explicitly documented
Rate limiting uses a shared Redis or cluster-aware backend; local policy is only used in single-instance dev environments
X-Forwarded-For trust chain is configured; the gateway reads the real client IP from the correct XFF position
Outlier detection is configured on all upstream clusters with max_ejection_percent ≤ 50%
Retry policy uses exponential backoff with jitter; max_retries budget is set at the upstream level
client_max_body_size (NGINX) or max_request_headers_kb + buffer limits are set on all listeners
OpenTelemetry plugin configured with W3C traceparent propagation; traces visible in the APM backend
Structured access logs include request_id, consumer_id, upstream_latency_ms, and response_status
Declarative config is version-controlled; deck diff runs in CI before every deck sync
Health check endpoints on all upstream targets have been validated against the gateway’s active health checker
Gateway data plane nodes are distributed across at least two availability zones with a load balancer in front
Control plane is configured with local config caching so a control-plane outage does not drop live traffic
Gateway selection criteria have been documented, including the rationale for the chosen product against alternatives

FAQ

What is the primary architectural difference between an API gateway and a traditional load balancer?

A load balancer operates primarily at Layer 4 (TCP/UDP) or basic Layer 7 for traffic distribution, while an API gateway provides advanced Layer 7 routing, protocol translation, security policy enforcement, authentication, and developer lifecycle management. The full capability breakdown is at How API Gateways Differ from Load Balancers.

How do you prevent an API gateway from becoming a single point of failure?

Deploy the data plane as a stateless, horizontally scalable cluster across multiple availability zones, implement local configuration caching for control plane outages, and use DNS-based failover with health-checked endpoints. See High Availability Topologies for multi-AZ and active-active blueprints.

When should routing logic reside in the gateway versus a service mesh?

Gateways should handle north-south traffic (external clients to internal services): external authentication, coarse-grained rate limiting, and protocol translation. Service meshes manage east-west traffic (service-to-service): mTLS, intra-cluster retries, and fine-grained traffic policies. Overlapping responsibilities across both layers cause configuration drift and compound latency.

What are the typical scaling limits for a production API gateway?

Scaling limits depend on connection multiplexing efficiency, TLS handshake CPU cost, and worker thread allocation. Typical production limits range from 50,000 to 150,000 concurrent connections per node before CPU saturation from TLS and regex routing. See Scaling Limits & Capacity Planning for concrete benchmarks and horizontal scaling strategies.

Related:

Gateway Selection Criteria — capability matrix for choosing between Kong, Tyk, Envoy, NGINX, and cloud-managed gateways
API Gateway vs Service Mesh — deciding between an edge gateway and a sidecar mesh for north-south versus east-west traffic
Gateway Observability & Operations — distributed tracing, SLOs and error budgets, and structured logging for the running gateway
High Availability Topologies — multi-AZ, active-active, and disaster recovery deployment blueprints
Scaling Limits & Capacity Planning — per-node connection budgets, worker sizing, and horizontal scaling thresholds
Advanced Routing & API Versioning — URI vs header vs content-negotiation versioning strategies and backward-compatibility contracts
Authentication Proxying & Token Validation — offloading JWT validation and identity forwarding to the gateway layer

↑ Home