REST API — Field Guide

Core Concepts

📐 REST Constraints

REST (Representational State Transfer) is an architectural style defined by 6 constraints: Client-Server (UI decoupled from data storage), Stateless (each request carries all context — no server-side session), Cacheable (responses declare cacheability), Uniform Interface (resources identified by URIs, manipulated via representations), Layered System (client can't tell if it's talking to origin or intermediary), and Code on Demand (optional — server can send executable code like JS). Violating stateless is the most common mistake: storing session state server-side breaks horizontal scaling.

stateless uniform interface cacheable

🔧 HTTP Methods & Semantics

Each method carries precise semantics: GET — retrieve, safe + idempotent. POST — create or trigger action, neither safe nor idempotent. PUT — full replace, idempotent. PATCH — partial update, not guaranteed idempotent unless designed so. DELETE — remove, idempotent. HEAD — like GET but no body, used for metadata checks. OPTIONS — capabilities/CORS preflight. Safe means no server state change. Idempotent means N identical requests = same result as 1. These are contracts, not enforcement — the server must uphold them.

GET safe+idempotent POST neither PUT idempotent

📊 HTTP Status Codes

Status codes communicate outcome: 1xx informational (100 Continue, 101 Switching). 2xx success: 200 OK, 201 Created (with Location header), 202 Accepted (async), 204 No Content (DELETE/PATCH with no body). 3xx redirects: 301 permanent, 302/307 temporary, 304 Not Modified (caching). 4xx client errors: 400 Bad Request, 401 Unauthorized (no/bad auth), 403 Forbidden (authed but no permission), 404 Not Found, 409 Conflict, 422 Unprocessable Entity, 429 Too Many Requests. 5xx server errors: 500 Internal, 502 Bad Gateway, 503 Unavailable, 504 Timeout.

401 vs 403 201 + Location 202 for async

🗂️ Resource Design

Resources are nouns, not verbs. URIs identify resources; methods express intent. Good: GET /orders/{id}, POST /orders, DELETE /orders/{id}/items/{itemId}. Bad: POST /getOrder, GET /deleteUser. Use plural nouns for collections (/users, /products). Nest sub-resources only when the child cannot exist without the parent and nesting depth stays ≤ 2 levels — deeper paths become brittle. For actions that don't map to CRUD cleanly, use a sub-resource noun: POST /payments/{id}/refunds rather than POST /refundPayment.

nouns not verbs plural collections max 2 levels deep

🔢 API Versioning

Three main strategies: URI versioning (/v1/orders) — explicit, easy to route, but pollutes URLs and breaks REST's uniform interface. Header versioning (Accept: application/vnd.myapi.v2+json) — clean URLs, but harder to test in a browser. Query param (?version=2) — easy to ignore by proxies, not RESTfully clean. URI versioning wins in practice for public APIs because it's visible and cacheable. Key rule: never break a published version. Additive changes (new fields, new endpoints) are backward-compatible; removing fields or changing types requires a new version.

/v1/ most common never break published additive = safe

📄 Pagination Patterns

Offset/Limit (?offset=40&limit=20) — simple, supports random access, but unstable when rows insert/delete mid-page (items skipped or duplicated). Page/Size (?page=3&size=20) — same trade-offs as offset. Cursor-based (opaque token encoding last-seen position) — stable, efficient for large datasets and infinite scroll, but no random page jump. Preferred at scale. Keyset (?after_id=1234) — similar to cursor but uses a sortable key; very fast with an index. Always include total count (X-Total-Count header or response envelope) and next/prev links (HATEOAS) so clients don't hardcode offsets.

cursor for scale offset unstable include links

⚡ Caching

HTTP caching is built in: Cache-Control (max-age, no-store, no-cache, private/public), ETag (entity tag — server fingerprint of resource), Last-Modified. Conditional requests: If-None-Match: <etag> or If-Modified-Since → 304 if unchanged. GET and HEAD are cacheable by default; POST can be if explicitly declared. PUT/DELETE must invalidate caches. Vary header tells CDNs to cache separately per Accept-Language, Accept-Encoding, etc. ETags enable optimistic concurrency on updates: send If-Match: <etag> with a PUT; server returns 412 Precondition Failed if the resource changed since the client read it.

ETag for concurrency 304 Not Modified Cache-Control

🔗 HATEOAS

Hypermedia As The Engine Of Application State — the highest REST maturity level (Richardson Level 3). Responses include links describing available actions, so clients don't hardcode URLs or know the API structure upfront. Example: a GET /orders/{id} response includes "_links": {"cancel": {"href": "/orders/42/cancellations", "method": "POST"}}. Practically: few public APIs implement full HATEOAS; most stop at Level 2 (resources + verbs). But including next/prev in paginated responses and Location on 201 are lightweight HATEOAS wins that pay off immediately.

Richardson Level 3 self-describing rarely fully implemented

🚦 Rate Limiting

Rate limiting protects APIs from abuse and overload. Common response: 429 Too Many Requests with Retry-After header. Standard informational headers: X-RateLimit-Limit (max requests), X-RateLimit-Remaining, X-RateLimit-Reset (epoch timestamp). Algorithms: Token Bucket (bursty traffic allowed, smooth average), Leaky Bucket (constant output rate), Fixed Window (simple but boundary spike), Sliding Window (smoother, no boundary spike). Apply limits per API key, per user, or per IP depending on the API's trust model.

429 + Retry-After token bucket sliding window

Gotchas & Failure Modes

Using GET for state-changing operations GET requests are logged, cached, and bookmarked by browsers and proxies. Using GET for actions like /deleteUser?id=5 means those actions can be replayed unexpectedly, logged in access logs, and cached. Always use POST/PUT/DELETE for mutations.

Ignoring idempotency on retries Network failures cause clients to retry requests. If POST /payments is not idempotent, a retry after a timeout creates a duplicate charge. Use idempotency keys (Idempotency-Key: <uuid> header) on non-idempotent operations. The server deduplicates within a time window and returns the original response for duplicate keys.

Returning 200 with an error body A common anti-pattern: 200 OK with {"status": "error", "message": "Not found"}. This breaks HTTP clients, monitoring tools, and caches that rely on status codes. Always return the correct HTTP status. Include a structured error body with a machine-readable code field and human-readable message — following RFC 7807 (Problem Details) is good practice.

Exposing internal model in the API Mapping DB columns directly to API fields creates tight coupling — every schema migration potentially breaks clients. API resources are a separate model. Use a DTO/view layer. This also avoids leaking sensitive internal fields (soft-delete flags, internal IDs, audit columns).

Missing or wrong Content-Type Omitting Content-Type: application/json on requests with a body causes many frameworks to ignore or misparse the body. On responses, missing Content-Type prevents correct client parsing. Always set both Content-Type and Accept explicitly. For file uploads use multipart/form-data, not application/json.

No backward-compatibility strategy Adding a required field to a POST body, removing a response field, or changing a field type all break existing clients silently. Treat every published API contract as immutable. Run contract tests (Pact, Spring Cloud Contract) against consumer expectations before deploying. Deprecate with a Deprecated header and sunset date before removal.

When to Use / When Not To

✓ Use REST When

Building public or partner-facing APIs consumed by diverse clients (web, mobile, third-party)
Standard CRUD operations over HTTP where caching and statelessness matter
Services that need to be discoverable, documented (OpenAPI), and version-controlled independently
Systems integrating with existing HTTP infrastructure (proxies, CDNs, API gateways)

✗ Don't Use REST When

Real-time bidirectional communication — use WebSockets or SSE instead
Complex, graph-like queries where over/under-fetching is a problem — GraphQL is a better fit
High-performance internal microservice RPC — gRPC (HTTP/2 + Protobuf) is faster and type-safe
Streaming large binary payloads where chunked HTTP or object storage is more appropriate

Quick Reference & Comparisons

HTTP Methods Reference

GET	Retrieve resource. Safe + idempotent. Response cacheable. No body.
POST	Create resource or trigger action. Not safe, not idempotent. 201 Created + Location on success.
PUT	Full replace. Idempotent. 200 OK or 204 No Content. Client sends complete representation.
PATCH	Partial update. Not inherently idempotent. Use JSON Patch (RFC 6902) or Merge Patch (RFC 7396).
DELETE	Remove resource. Idempotent. 204 No Content. Repeated deletes: 404 or 204 both acceptable.
HEAD	Same as GET but no body. Use to check existence or get headers (Content-Length, ETag) cheaply.
OPTIONS	Returns allowed methods. Used for CORS preflight. Response includes Allow and Access-Control-* headers.

Status Code Quick Reference

200 OK	Generic success for GET, PUT, PATCH with body.
201 Created	POST that created a resource. Must include Location header pointing to new resource.
202 Accepted	Request accepted but processing async. Include a job/status URL in response.
204 No Content	Success with no response body. Common for DELETE and PATCH.
304 Not Modified	Conditional GET hit cache. ETag/Last-Modified still fresh. No body sent.
400 Bad Request	Malformed request syntax, invalid parameters. Include field-level validation errors.
401 Unauthorized	No valid authentication provided. Include WWW-Authenticate header.
403 Forbidden	Authenticated but not authorized. Don't expose resource existence here.
404 Not Found	Resource doesn't exist. Can also use 404 to hide 403 for sensitive resources.
409 Conflict	State conflict — duplicate create, optimistic lock failure, business rule violation.
412 Precondition Failed	If-Match or If-None-Match header condition failed. Used with ETags for optimistic concurrency.
422 Unprocessable Entity	Syntactically valid but semantically invalid (e.g., end date before start date).
429 Too Many Requests	Rate limit exceeded. Include Retry-After and X-RateLimit-* headers.
500 Internal Server Error	Unhandled server error. Never expose stack traces. Log internally, return generic message.
503 Service Unavailable	Temporarily overloaded or down. Include Retry-After. Used during maintenance/deploys.

Key Request & Response Headers

Content-Type	Media type of the body. Request: what I'm sending. Response: what I'm returning. Always set.
Accept	Client's preferred response media type. Enables content negotiation.
Authorization	Credentials — Bearer , Basic , API-Key . Prefer Authorization over custom headers.
ETag	Response fingerprint. Client sends back as If-Match (update) or If-None-Match (cache check).
Cache-Control	Caching directives: max-age=~~, no-store, no-cache, private, public, must-revalidate.~~
Location	URI of newly created resource (201) or redirect target (3xx).
Retry-After	Seconds (or HTTP date) until client may retry. Use with 429 and 503.
Idempotency-Key	Client-generated UUID for deduplication of non-idempotent requests (payments, emails).
X-Request-ID	Correlation ID for distributed tracing. Echo it in response for log correlation.
Deprecation / Sunset	RFC 8594: Deprecation: , Sunset: — signals API version end-of-life to clients.

Pagination Patterns

Offset / Limit	Simple. ?offset=40&limit=20. Supports random page jump. Unstable: inserts/deletes skew results. Bad on large offsets (DB scans all preceding rows).
Page / Size	Same as offset but user-facing. ?page=3&size=20. Same trade-offs. Avoid for large tables.
Cursor-based	Opaque encoded pointer (base64 timestamp+id). Stable across mutations. Fast. No random page jump. Preferred for infinite scroll and event feeds.
Keyset	?after_id=1234. Requires sortable unique key. Very fast with index. Transparent (not opaque). Good for time-series data.
Response envelope	Always include: { "data": [...], "pagination": { "next_cursor": "...", "has_more": true } } or use Link header (RFC 5988) for HATEOAS alignment.

API Versioning Strategies

URI versioning (/v1/)	Most common. Explicit, easy to route, test, and cache. Pollutes URL namespace. Breaking: resource URIs change between versions. Best for public APIs.
Header versioning	Accept: application/vnd.api.v2+json or custom X-API-Version header. Clean URLs. Hard to test in browser. Proxy/CDN routing is complex. Best for internal or partner APIs.
Query param (?version=2)	Easy to add. Often ignored by proxies. Not cache-friendly. Suitable for simple tools where header control is hard.
Additive changes (no bump)	New optional fields in response, new optional request params, new endpoints — always backward-compatible. No version bump needed.
Breaking changes	Removing fields, changing types, renaming fields, making optional required. Always bump major version. Maintain old version for deprecation period (6–12 months min).

REST vs GraphQL vs gRPC

Protocol	HTTP/1.1 or HTTP/2	HTTP/1.1 or HTTP/2	HTTP/2 only
Data format	JSON (usually)	JSON	Protobuf (binary)
Schema / contract	OpenAPI (optional)	Strongly typed schema (SDL)	Strongly typed .proto file
Query flexibility	Fixed endpoints per resource	Client specifies exact fields	Fixed RPCs per method
Over-fetching	Common	Eliminated by design	Minimal (schema-driven)
Streaming	SSE / chunked	Subscriptions	Native bi-directional streaming
Caching	HTTP cache built-in	Hard (POST by default)	No HTTP caching; app-level only
Browser support	Native	Native	Needs grpc-web proxy
Error handling	HTTP status codes	200 + errors[] array	gRPC status codes
Best for	Public APIs, CRUD, mobile	Complex frontends, BFF layer	Internal microservices, low latency
Tooling maturity	Excellent (Postman, curl)	Good (Apollo, Altair)	Good (grpcurl, BloomRPC)

Interview Q & A

0 / 0 reviewed

Senior Engineer — Execution Depth

S-01 What does 'stateless' mean in REST and why does it matter for scalability? Senior ▾

Stateless means each HTTP request must contain all information needed to process it — no server-side session state. The server treats each request as independent. Why it matters for scalability: any server instance can handle any request because there's no affinity to a particular instance. This enables: - Horizontal scaling: spin up N replicas behind a load balancer - Resilience: a crashed instance loses no session data - Caching: responses can be cached by proxies without ambiguity Common violation: storing user session data in server memory. Fix: externalize session to Redis, or go fully stateless with JWTs that carry claims in the token. Statelessness increases per-request overhead (client resends auth, preferences every time) but the scaling and resilience benefits almost always outweigh this cost.

At Staff level, connect this to infrastructure: statelessness is what lets you run behind Kubernetes HPA without sticky sessions, use CDN edge caching, and deploy blue/green without draining sessions. The trade-off conversation becomes: where do you draw the stateless boundary? Auth tokens in headers = stateless; but the token validation still hits a shared key store (JWKS endpoint or Redis token blacklist). True statelessness is a spectrum — design for minimal shared mutable state, not zero state.

S-02 What's the difference between PUT and PATCH? When would you choose each? Senior ▾

PUT replaces the entire resource with the provided representation. If you omit a field, it's set to null/default. It's idempotent: sending the same PUT twice produces the same state. PATCH applies a partial update — only the fields included are changed. It's not inherently idempotent (e.g., increment counter by 1 is a PATCH that isn't idempotent), though most PATCH implementations are. Choose PUT when: the client always sends the complete resource (e.g., replacing a configuration object). Choose PATCH when: you want to update specific fields without knowing or sending the full resource (e.g., updating a user's email only). Two PATCH formats: JSON Merge Patch (RFC 7396) — simple key-value overlay, null means delete field. JSON Patch (RFC 6902) — operation array (add, remove, replace, move, copy, test), more expressive.

S-03 Explain the difference between 401 and 403. When do you return each? Senior ▾

401 Unauthorized — the request lacks valid authentication. The client is not identified. You don't know who is asking, so you can't determine what they're allowed to do. Response must include a WWW-Authenticate header indicating the auth scheme. 403 Forbidden — the client is authenticated (you know who they are) but they don't have permission to access the resource. Decision tree: - No/invalid credentials → 401 - Valid credentials, wrong role/scope → 403 - Valid credentials, resource doesn't exist → 404 (or 403 to hide existence) Security note: returning 404 instead of 403 for sensitive resources prevents information leakage — an attacker learns neither the resource exists nor that they lack permission. Use this pattern for confidential resources.

S-04 What is idempotency and how do you make POST endpoints idempotent? Senior ▾

An operation is idempotent if applying it N times produces the same result as applying it once. GET, PUT, DELETE are idempotent by HTTP contract. POST is not. Why it matters: clients retry on network failures. If POST /payments isn't idempotent, a retry after a timeout creates a duplicate charge. Making POST idempotent with idempotency keys: 1. Client generates a UUID and sends it as Idempotency-Key: <uuid> header 2. Server stores {key → response} in a fast store (Redis) with a TTL (24 hours) 3. On receipt: if key seen before, return stored response immediately; if not, process and store result atomically 4. Return the same response (including status code) for duplicate requests The storage must be atomic: use Redis SET key value NX EX <ttl> to prevent races where two simultaneous requests with the same key both proceed.

S-05 How would you design a pagination API for a large dataset? What are the trade-offs? Senior ▾

Offset/Limit (?offset=40&limit=20): - Pro: random page access, simple to implement - Con: unstable (inserts/deletes shift rows between pages), slow at high offsets (DB must scan all preceding rows), inconsistent results for real-time data

Cursor-based pagination: - Server returns an opaque cursor (base64 of timestamp + last ID) - Client passes ?cursor=<token> on next request - Pro: stable, consistent, O(log n) with index, works for infinite scroll - Con: no random page jump, cursor becomes invalid if data is deleted Keyset pagination (?after_id=1234&after_created=2024-01-01T12:00:00Z): - Similar to cursor but uses readable columns; very fast with composite index - Good for time-ordered data (logs, events) My recommendation: use cursor-based for anything > 10k rows or with real-time updates. Always return has_more, total count where cheap, and next/prev links.

At Staff level: discuss the index strategy. Cursor pagination requires an index on the sort column(s). Multi-column cursors need composite indexes. Also address: how do you handle cursor invalidation when rows are deleted? What do you return when the requested page no longer exists? Consider: for search results, offset may be fine (search indices handle it); for event feeds, keyset is correct.

S-06 How does HTTP caching work with ETags and Cache-Control? Senior ▾

Cache-Control directives set cacheability: max-age=3600 (cache for 1 hour), no-store (never cache), no-cache (cache but always revalidate), private (browser only), public (shared caches ok). ETag is an opaque fingerprint of the resource (hash of content or version). Server includes it in GET responses. Client stores it. Conditional requests: - If-None-Match: <etag> — client asks "return body only if changed." Server returns 304 Not Modified (no body, saves bandwidth) or 200 with new body. - If-Match: <etag> — client says "update only if resource still matches this etag." Server returns 412 Precondition Failed if modified by someone else.

ETags for optimistic concurrency: 1. GET /resource → response has ETag: "v42" 2. Client edits, sends PUT /resource with If-Match: "v42" 3. If another writer updated it in between, server returns 412 — no lost update

S-07 How would you design error responses? What should an error body contain? Senior ▾

HTTP status code communicates the class of error. The body provides actionable detail. RFC 7807 Problem Details is the standard:

json {
  "type": "https://api.example.com/errors/validation",
  "title": "Validation Failed",
  "status": 422,
  "detail": "One or more fields failed validation.",
  "instance": "/orders/abc",
  "errors": [
    {"field": "quantity", "message": "must be greater than 0"}
  ]
}

Key properties: - type: URI identifying the error type (machine-readable) - title: human-readable, stable (don't change per request) - detail: specific to this request - instance: URI of the specific occurrence (links to logs) - Extend with errors array for validation failures Never expose: stack traces, internal class names, SQL errors, server paths. Always log the correlation ID (X-Request-ID) so support can trace the request.

S-08 What is content negotiation and how does it work? Senior ▾

Content negotiation lets the client and server agree on the representation format without hardcoding it. Client signals preference via Accept header: Accept: application/json, application/xml;q=0.8, */*;q=0.5 (q values are quality factors, 0–1, default 1.0) Server responds with: - Best matching format and Content-Type: application/json - 406 Not Acceptable if no match is possible Language negotiation: Accept-Language: en-US, fr;q=0.8 Encoding: Accept-Encoding: gzip, br — server can compress response Why it matters: one endpoint serves multiple consumers (JSON for web, XML for legacy SOAP clients, CSV for analytics pipelines) without separate URLs. Also used for API versioning: Accept: application/vnd.myapi.v2+json.

S-09 What is CORS and how does preflight work? Senior ▾

CORS (Cross-Origin Resource Sharing) is a browser security mechanism. Browsers block cross-origin requests unless the server explicitly allows them. Simple requests (GET/POST with safe headers): browser sends request directly; server must return Access-Control-Allow-Origin header. Preflighted requests (non-simple methods or headers like PUT, DELETE, Authorization, custom headers): browser sends an OPTIONS request first:

OPTIONS /api/orders Origin: https://app.example.com Access-Control-Request-Method: DELETE Access-Control-Request-Headers: Authorization

Server responds:

Access-Control-Allow-Origin: https://app.example.com Access-Control-Allow-Methods: GET, POST, PUT, DELETE Access-Control-Allow-Headers: Authorization, Content-Type Access-Control-Max-Age: 86400   ← cache preflight for 24h

Browser then sends the actual request. Access-Control-Allow-Credentials: true needed if cookies/auth headers must be sent cross-origin. Cannot be combined with Allow-Origin: *.

S-10 How do you handle API versioning when you need to introduce a breaking change? Senior ▾

Breaking changes require a version bump. Process: 1. Release v2 alongside v1 — never replace, always add 2. Announce deprecation on v1: add Deprecation: true and Sunset: Sat, 31 Dec 2025 00:00:00 GMT headers (RFC 8594) to v1 responses 3. Communicate migration guide — exact diff of what changed, code examples 4. Monitor v1 traffic — identify active callers using API key / request logs 5. Reach out to high-volume callers before sunset 6. Return 410 Gone after sunset date — never 404 (clients need to know the resource existed and was intentionally removed)

What counts as breaking: removing fields, changing field types, renaming fields, making optional params required, changing auth scheme, changing status codes. Not breaking: adding new optional fields to responses, adding new optional request params, adding new endpoints, loosening validation.

Staff Engineer — Design & Cross-System Thinking

ST-01 How would you design an idempotent payment API that handles retries, partial failures, and audit requirements? Staff ▾

Core mechanism: idempotency keys + saga pattern Request flow: 1. Client generates Idempotency-Key: uuid-v4 (per payment attempt, not per session) 2. Server atomically: INSERT INTO idempotency_keys (key, status='PROCESSING') ON CONFLICT DO NOTHING — if conflict, another request with same key is in-flight or done 3. Execute payment: call payment processor, write to payments table 4. Update idempotency record: {key, status='DONE', response_body, response_status} 5. Return response; subsequent retries get stored response Handling partial failures: - Payment processor times out before confirmation: store status='UNCERTAIN' - Reconciliation job polls processor for final state and updates - Client retrying gets 202 Accepted with a status check URL until resolved Audit trail: - Append-only payment_events table: every state transition (CREATED, AUTHORIZED, CAPTURED, REFUNDED, FAILED) - Include actor, timestamp, idempotency_key, processor_response on each event - Never update; read by replaying events or projecting to payments table Race condition prevention: - Redis SET key 'PROCESSING' NX EX 30 as distributed lock before DB write - Expires after 30s so a crashed server doesn't lock forever

Principal-level: frame this as a distributed transaction problem. Idempotency key is one component of an at-least-once delivery system. Discuss: how does this compose across services? If payment service calls fraud service then billing service, you need a saga coordinator or outbox pattern. Outbox: write payment event + payment record in one DB transaction; a relay process publishes to message bus. This decouples processing from notification without distributed transactions.

ST-02 How do you design a rate limiting system that works across a distributed API fleet? Staff ▾

Challenge: a single server can rate-limit locally cheaply, but with N instances behind a load balancer, local counters are siloed — each instance allows full quota. Centralized counter (Redis): - Sliding window with Redis sorted sets or atomic INCR + EXPIRE - Key: ratelimit:{api_key}:{window_start} → count - Fast (sub-ms), consistent, but Redis is now a bottleneck and single point of failure - Mitigation: Redis Cluster, read replicas for quota checks Token bucket in Redis (atomic Lua script):

lua -- Atomically check and decrement token bucket local tokens = tonumber(redis.call('GET', key)) or capacity if tokens > 0 then redis.call('SET', key, tokens-1, 'EX', window)
  return 1 else return 0 end

Approximate local + sync: - Each instance tracks locally; periodically syncs with Redis - Trades perfect accuracy for resilience; over-allows briefly during sync interval - Acceptable for most APIs; unacceptable for billing/quota-critical limits Response headers (always include): X-RateLimit-Limit: 1000 X-RateLimit-Remaining: 743 X-RateLimit-Reset: 1704067200 Retry-After: 57 Limit dimensions: per API key, per user, per IP, per endpoint, per tier (free/pro/enterprise). Implement as a middleware/filter at the API gateway level, not per service.

ST-03 How would you evolve a public API to support a new data model without breaking existing consumers? Staff ▾

Strategy: expand-contract pattern (parallel run) Phase 1 — Expand (backward-compatible): - Add new optional fields to responses (old clients ignore unknown fields if they're well-behaved JSON consumers — Postel's Law) - Accept new optional request fields with defaults that match old behavior - New endpoints can coexist with old ones Phase 2 — Migrate (dual write): - Internal logic writes to both old and new data models - New response includes both old fields (for compatibility) and new fields - Monitor which callers are using old vs new fields via analytics Phase 3 — Contract (sunset old): - Add Deprecation + Sunset headers to deprecated fields/endpoints - Remove old fields after sunset date; return 410 for removed endpoints Contract testing (Pact): - Consumer-driven: consumers publish their expectations (pacts) - Provider CI runs against pact broker; build fails if provider breaks a pact - Catches breaking changes before they reach production Field-level deprecation in OpenAPI:

yaml userId:
  type: integer
  deprecated: true
  description: "Use userUuid instead. Removed after 2026-06-01."

ST-04 How do you secure a public REST API beyond just adding an auth token? Staff ▾

Defense in layers: Transport: TLS 1.2+ everywhere. HSTS header. Certificate pinning for mobile clients that control the client binary. Authentication: OAuth2 + JWT for user-facing, API keys (hashed, stored in DB — never plain text) for machine-to-machine. Rotate keys without downtime using key IDs in headers. Authorization: - Scopes on tokens (read:orders, write:payments) - Resource-level: user can only access their own resources — always filter by authenticated identity, not a passed-in user_id param (IDOR prevention) - Rate limiting per API key + per IP Input validation: - Validate all inputs against strict schema (size limits, type, format, regex) - Reject unknown fields or strip them — prevents mass assignment - Limit JSON body size (e.g., 1MB max) to prevent large payload DoS Output filtering: - Never expose internal IDs, stack traces, DB column names in errors - Use opaque IDs (UUIDs or encoded IDs) — sequential IDs enable enumeration attacks Audit logging: - Log every mutating request with auth principal, timestamp, resource, diff - Store separately from application logs — immutable, tamper-evident API gateway features: TLS termination, auth, rate limiting, IP allowlisting, request/response logging — centralise here rather than per-service.

Principal Engineer — Architecture & Org-Scale Thinking

P-01 How do you design an API strategy for a platform that serves both internal microservices and external third-party developers? Principal ▾

The core tension: internal APIs optimize for developer experience and velocity; external APIs optimize for stability, discoverability, and trust. Dual-layer architecture: - Internal APIs (BFFs and service-to-service): gRPC for performance-critical paths, REST for simpler CRUD. Versioning via Protobuf evolution rules. Deployed continuously. Contracts enforced by Pact/schema registries. - External APIs (developer platform): REST with OpenAPI spec as the contract. Semantic versioning with long deprecation windows (6–12 months). API key management, usage analytics, developer portal (docs, sandbox, SDKs).

API Gateway as the seam: - External requests hit the gateway first: auth, rate limiting, request translation - Gateway routes to internal services; internal callers bypass (or use internal gateway) - This decouples external API stability from internal service topology Design governance: - API design review board: new external APIs reviewed against style guide before release - OpenAPI-first: spec written before implementation; code generated from spec - Breaking change policy: documented, enforced, with automated Pact checks in CI Developer experience investment: - Interactive docs (Swagger UI / Redoc) generated from spec - SDKs in top languages auto-generated from OpenAPI spec (openapi-generator) - Sandbox environment with realistic test data - Webhook delivery with retry, delivery logs, and replay UI Metrics that matter for a platform: API adoption rate, SDK version distribution (how many still on deprecated versions), time-to-first-successful-call (DX metric), error rate per consumer, SLA breach rate.

Push further: how do you enforce API standards at scale without becoming a bottleneck? Automated linting (Spectral rules on OpenAPI spec in CI). Paved roads: provide starter templates, base classes, and interceptors that implement auth, logging, rate limit headers correctly — teams use the paved road and get compliance for free. The governance model shifts from review-gate to enable-and-verify.

P-02 A key external API is experiencing severe performance degradation under peak load. Walk through your diagnosis and architectural response. Principal ▾

Immediate triage (minutes): - Check error rate, latency P50/P95/P99, saturation (CPU, connections, DB pool) at each layer: CDN → API gateway → service → DB - Identify if degradation is global or per-endpoint/consumer - Pull slow query log; check connection pool exhaustion; check downstream dependency health (DB, cache, third-party APIs) - Apply rate limiting tighter if specific consumers are causing the spike Short-term stabilization (hours): - Scale horizontally if CPU/connection-bound (add instances) - Enable aggressive caching at gateway for GET endpoints that can tolerate staleness - Shed load: return 503 with Retry-After for non-critical endpoints - Circuit break unhealthy downstream dependencies Root cause categories and fixes: - N+1 queries on REST responses: add projection endpoints, or GraphQL where clients over-fetch; add read replicas; add response-level caching (Redis) - Missing indexes on filter params: profile slow queries, add targeted indexes - Hot partition: if sharding by customer ID and one customer is massive, rethink sharding key or add dedicated capacity - Thundering herd on cache miss: cache stampede — use mutex/lock-based cache population or probabilistic early expiration (XFetch algorithm)

Architectural response (weeks): - Read path: CQRS — separate read model (denormalized, cached) from write model - Write path: async where possible (202 + event); offload heavy processing to queues - Contract with consumers: add pagination where endpoints return unbounded results - SLA tiering: separate clusters for critical vs. background traffic; bulkhead pattern

System Design Scenarios

Design a Paginated Event Feed API

Problem

You're building a REST API for a financial trading platform. It must expose a transaction history endpoint that serves up to 500 million records per customer, supports filtering by date range and transaction type, and is used by a mobile app (infinite scroll), a web dashboard (page navigation), and a batch export system (millions of records).

Constraints

Must respond in < 200ms at P99 for the mobile client
Batch export can export up to 5 million records but must not block the OLTP DB
Real-time inserts happen continuously; pagination must be stable
Multiple consumers with different UX needs

Key Discussion Points

Separate pagination strategies by consumer: The mobile app (infinite scroll) gets cursor-based pagination — stable across inserts, O(log n) with an index on (account_id, created_at, id). Return { "data": [...], "next_cursor": "<base64>", "has_more": true }. The web dashboard can use the same cursor API but present page numbers as a UX affordation — map page clicks to cursor hops internally.
Batch export — read replica + streaming cursor: Never run batch export against the OLTP primary. Route to a read replica. Use server-side streaming: endpoint accepts ?format=csv&from=2024-01-01&to=2024-12-31, streams chunked response with Transfer-Encoding: chunked. Process in keyset batches of 10k rows using (created_at, id) > (last_created_at, last_id). Alternatively, trigger an async export job (202 Accepted + Location: /exports/{jobId}), write to S3, return a presigned download URL when done.
Index design: composite index (account_id, created_at DESC, id DESC) covers all common filter patterns. Date range filter with account_id prefix is highly selective. Partial index on transaction_type if type filtering is frequent.
Caching: cursor responses are immutable for a given cursor value — safe to cache at CDN with long TTL. Only the latest page (no cursor) needs short TTL or no-cache.

🚩 Red Flags

Using offset/limit on a 500M row table — O(n) scan for high offsets kills performance
Running batch exports against the primary DB — competing with OLTP workload
Returning unbounded result sets — no limit means a client can request all 500M rows
Not including has_more or next_cursor — client can't know when it has all results

Design a Rate Limiting System for a Public API Platform

Problem

Your company runs a public REST API platform used by 10,000 external developers. You need to implement rate limiting that: enforces per-API-key quotas (requests/minute and requests/month), handles burst traffic gracefully, is consistent across a 50-instance API fleet, and degrades gracefully if the rate limiting system itself fails.

Constraints

50 API gateway instances behind a load balancer, no sticky sessions
{'Rate limits': '100 req/min (short) + 10,000 req/month (long) per API key'}
{'P99 latency budget for the rate limit check': '< 5ms'}
System must not take down the API if the rate limiter fails

Key Discussion Points

Dual-window counters in Redis: Short window: INCR ratelimit:{key}:min:{unix_minute} with EXPIRE 60. Long window: INCR ratelimit:{key}:month:{year_month} with EXPIRE 2678400 (31 days). Lua script combines both checks atomically — single round-trip. Redis latency is typically < 1ms on the same rack; 5ms budget is achievable.
Token bucket for burst handling: Sliding window with token bucket allows controlled bursts (e.g., 200 requests in the first 10 seconds of a minute) without a sharp reset spike at the minute boundary. Implement in Redis with (last_refill_time, current_tokens) per key and a Lua script that refills tokens proportional to elapsed time on each request.
Fail-open with local fallback: If Redis is unreachable, fail-open (allow the request) rather than fail-closed (deny all). Log the failure. Implement a local in-memory approximate counter as a backstop — it won't be globally consistent but prevents total abuse during Redis outage. Circuit breaker pattern: if Redis error rate > 50% for 10 seconds, switch to local mode and alert on-call.
Response headers: always return X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset (epoch of next window reset), and for 429 responses Retry-After. These headers let well-behaved clients implement client-side throttling and back-off before hitting the limit.
Quota enforcement at the gateway: centralise rate limiting at the API gateway (Kong, Envoy, AWS API Gateway) rather than per-service — single enforcement point, no code duplication. Per-service limits are additive overhead, not a replacement.

🚩 Red Flags

Per-instance local counters — 50 instances each allow 100 req/min = 5,000 req/min effective
Fail-closed on Redis outage — takes down the entire API when rate limiter fails
Fixed window without handling boundary spikes — 100 requests in last second of minute N + 100 in first second of minute N+1 = 200 in 2 seconds
Not returning rate limit headers — clients have no way to implement polite back-off

Migrate a Monolith REST API to Microservices Without Breaking Consumers

Problem

A 5-year-old monolith exposes 200 REST endpoints used by 3 mobile apps (iOS, Android), a web frontend, and 15 partner integrations. You need to decompose it into microservices over 18 months while keeping all existing consumers working without forced upgrades.

Constraints

Zero downtime during migration; no consumer-forced upgrades
Partners have 6-month contractual SLAs on API stability
Some endpoints have undocumented behavior that partners depend on
Team of 12 engineers; migration must be phased

Key Discussion Points

Strangler Fig pattern: don't rewrite — incrementally extract. An API gateway (or routing layer) sits in front of both monolith and new microservices. Traffic for extracted endpoints routes to the new service; unextracted endpoints still hit the monolith. Consumers see the same base URL throughout. No big-bang cutover.
Reverse-engineer undocumented contracts before migrating: Record production traffic (shadow mode) for 30 days. Generate contract tests from real request/response pairs (Pact, Traffic Parrot). These become your regression suite and the "truth" for undocumented behavior. Run these against every deployment.
Phased extraction order: extract by domain bounded context, starting with the most independently testable, least-coupled domains first (e.g., catalog/read-only data before orders/payments). Each extraction: (1) build new service, (2) run in shadow mode alongside monolith comparing responses, (3) gradually shift traffic (1% → 10% → 100% with feature flag), (4) deprecate monolith path.
API gateway as the stability layer: Gateway handles: routing, auth, rate limiting, request/response transformation. If a new microservice changes its response shape, the gateway can transform it back to the v1 contract — consumers never see the internal change. This buys time to migrate consumers at their own pace.
Communication to partners: 90-day advance notice for any route changes. Maintain a changelog and deprecation calendar. Provide a staging environment that mirrors production 2 weeks ahead so partners can test before production changes.

🚩 Red Flags

Big-bang rewrite with a cutover date — almost always fails or requires emergency rollback
Not generating contract tests from production traffic — undocumented behavior will break partners
Extracting tightly coupled domains first (e.g., billing before authentication) — maximizes coordination cost
No API gateway — direct consumer-to-microservice routing means consumers must update base URLs
Forcing consumer upgrades as part of the migration — violates the SLA and partner trust