// full-duplex · persistent TCP · real-time · scaling · security · senior → principal
Upgrade: websocket header with a random base64 nonce (Sec-WebSocket-Key). The server responds 101 Switching Protocols, signs the nonce with a fixed GUID using SHA-1, and returns it as Sec-WebSocket-Accept. After this, the TCP connection is repurposed — HTTP is gone, WebSocket frames flow on the same socket.
The handshake is intentional: it lets WebSockets traverse firewalls and proxies that understand HTTP, and the nonce exchange prevents HTTP caches from accidentally serving WebSocket data.
WebSocket API surfaces four events:
- onopen — handshake complete, safe to send - onmessage — data frame received - onerror — transport error (always followed by onclose) - onclose — connection closed; inspect code and reason
Common close codes: 1000 Normal closure · 1001 Going away (page navigate) · 1006 Abnormal closure (no Close frame sent — TCP dropped) · 1008 Policy violation · 1011 Server error.
A readyState of CONNECTING (0), OPEN (1), CLOSING (2), or CLOSED (3) tracks connection phase.
wss:// (WebSocket Secure) — it is WebSocket over TLS, identical to HTTPS. Plain ws:// exposes frame data and allows proxies to inject or modify content.
Authentication is the hard part. The browser's WebSocket constructor does not support custom headers — you cannot pass an Authorization header. Three patterns:
- Cookie auth: if the user is already cookie-authenticated (same-origin), the browser
sends cookies on the upgrade request automatically — simplest.
- Token in query string: wss://host/ws?token=... — works but tokens appear in server
logs; use short-lived tokens.
- First-message auth: connect first, then immediately send an auth message; server
enforces a timeout and closes unauthenticated connections.
bufferedAmount grows without bound (browser).
Server side: check channel.isWritable() (Netty) or use reactive streams backpressure (Spring WebFlux). If a client is slow, drop non-critical messages, apply a rate limit, or close the connection rather than letting the buffer grow unboundedly.
Browser side: check ws.bufferedAmount before each send — if it's large, the network is lagging and you're queuing up data that hasn't left the browser yet.
reconnecting-websocket handle the mechanics; your app logic must handle the state recovery.
new WebSocket(url) ignores any headers you pass. Authorization: Bearer ... is not possible on the initial upgrade request from a browser. Use query-string tokens (short-lived, single-use), cookie auth, or a post-connect auth message.
ulimit -n appropriately, and track connection count as a first-class metric.
| 0x0 | Continuation frame — subsequent frame of a multi-frame message |
| 0x1 | Text frame — payload must be valid UTF-8 |
| 0x2 | Binary frame — arbitrary bytes, no encoding constraint |
| 0x8 | Close frame — optional 2-byte status code + UTF-8 reason in payload |
| 0x9 | Ping — control frame; peer must respond with Pong |
| 0xA | Pong — response to Ping; may also be sent unsolicited |
| Max frame payload | 2^63 - 1 bytes (theoretical); keep messages under 64KB in practice for browser compatibility |
| Masking | Client→server frames must be masked (4-byte XOR key). Server→client must NOT be masked. |
| Close status 1000 | Normal closure — both sides finished cleanly |
| Close status 1001 | Going Away — server shutting down or browser navigating away |
| Close status 1006 | Abnormal closure — TCP dropped without a Close frame (network failure, process killed) |
| Close status 1011 | Internal server error — server encountered an unexpected condition |
| Sec-WebSocket-Version | Must be 13 (RFC 6455). The only version in production use. |
| Sec-WebSocket-Protocol | Optional sub-protocol negotiation (e.g., 'stomp', 'graphql-ws', 'chat') |
| setMessageSizeLimit | Max inbound message size in bytes. Default 64KB. Increase for large payloads. |
| setSendBufferSizeLimit | Per-session outbound buffer. Default 512KB. If full, session is closed. |
| setSendTimeLimit | Max time (ms) to send a message to a client. Default 10s. Slow clients are closed. |
| setHeartbeatValue | STOMP heartbeat: {server-interval, client-interval} in ms. Both parties negotiate minimum. |
| setAllowedOrigins / setAllowedOriginPatterns | CORS-equivalent for the upgrade request. Default: same-origin only. |
| withSockJS() | Adds SockJS fallback (long-polling / EventSource) for clients that can't upgrade. Adds overhead. |
| TaskScheduler (heartbeat) | Required bean when heartbeats are enabled. Without it, heartbeats silently don't run. |
| proxy_http_version 1.1 | Required — HTTP/2 does not support the Upgrade mechanism WebSockets use. |
| proxy_set_header Upgrade $http_upgrade | Forwards the Upgrade header to the upstream server. |
| proxy_set_header Connection 'upgrade' | Signals to the upstream that this is an upgrade request. |
| proxy_read_timeout | How long Nginx waits for data before closing the connection. Default 60s — set to 3600s+ for long-lived sockets. |
| proxy_send_timeout | How long Nginx waits to send data to the upstream. Also extend for long-lived connections. |
| ip_hash / sticky | Enable sticky sessions so reconnects route to the same upstream. Required for stateful WebSocket servers. |
| Dimension | WebSockets | Server-Sent Events (SSE) | Long Polling | HTTP/2 Push |
|---|---|---|---|---|
| Direction | Full-duplex — client and server send freely | Server → client only; client sends via separate HTTP | Server → client only; client re-polls | Server → client only; client sends via separate HTTP |
| Protocol | Custom framed protocol over TCP after HTTP upgrade | Plain HTTP/1.1 or HTTP/2 chunked response | Standard HTTP request/response cycle | HTTP/2 server push (PUSH_PROMISE) |
| Auto-reconnect | No — must implement with exponential backoff | Yes — browser reconnects automatically; EventSource handles it | Implicit — client reissues request after each response | No — part of a single HTTP/2 connection lifecycle |
| Sticky sessions needed | Yes — connection is stateful and pinned to one server | No — each reconnect can hit a different server; stateless | No — each request is independent | No — tied to HTTP/2 connection which can be load-balanced |
| Proxy/firewall support | Issues with HTTP proxies that don't understand Upgrade | Excellent — looks like slow HTTP; works everywhere | Excellent — standard HTTP request | Limited — many proxies don't support HTTP/2 push |
| Latency | Very low — persistent socket, no handshake per message | Low — persistent response stream, no reconnect overhead for delivery | Medium — includes polling delay and new TCP/TLS per cycle without keep-alive | Very low when supported |
| Best for | Chat, games, live collaboration, bidirectional signaling | Live feeds, notifications, dashboards where server pushes only | Simple polling when WebSockets are blocked; legacy clients | Resource hints, pre-loading assets — not general messaging |
Upgrade: websocket, Connection: Upgrade, Sec-WebSocket-Key: <base64 random 16 bytes>, and Sec-WebSocket-Version: 13.
The server responds with 101 Switching Protocols and a Sec-WebSocket-Accept header computed as base64(sha1(key + "258EAFA5-E914-47DA-95CA-C5AB0DC85B11")). The GUID is fixed in the spec — it's there to prevent a misconfigured HTTP server from accidentally accepting a WebSocket upgrade.
After 101, the TCP connection is no longer used for HTTP. Both sides switch to the WebSocket framing protocol on the same socket. No new connection is created — the upgrade repurposes the existing TCP stream.
The handshake being HTTP is intentional: it lets WebSockets work through firewalls on port 443 that allow HTTPS, and it gives intermediaries (proxies, load balancers) a familiar signal to route the connection correctly.Sec-WebSocket-Key / Accept) is not for security — it's a cache-buster. Without it, a naïve HTTP cache could interpret WebSocket frames as cached HTTP responses and serve garbage. The fixed GUID ensures the Accept value is only derivable by a server that knows the WebSocket spec, preventing accidental upgrades by servers that just echo headers. Real security comes from TLS (wss://) — the handshake crypto does nothing for confidentiality or authentication.WebSocket constructor does not accept custom headers — there is no option to pass Authorization: Bearer <token>. This is a spec limitation.
Three patterns, ordered by preference:
Cookie auth — if the user is already authenticated via cookie on the same origin, the browser automatically includes cookies in the upgrade request. The server validates the session cookie. No special handling needed. Works only for same-origin.
Token in the first message — connect without authentication, then immediately send an auth message as the first WebSocket frame. The server enforces a strict timeout (e.g., 5s): if no valid auth message arrives, it closes with 1008 Policy Violation. Prevents anonymous connections from sitting open.
Short-lived token in query string — wss://host/ws?token=<jwt>. Tokens appear in server access logs and browser history. Mitigate by generating single-use, short-TTL tokens on a REST endpoint immediately before connecting. The token exchange endpoint is authenticated via cookie or normal Authorization header.onerror (if there's a transport error) immediately followed by onclose with code 1006 (abnormal closure — no Close frame was sent, the TCP connection just died). The browser does not automatically reconnect.
A robust reconnection implementation:
1. Exponential backoff with jitter — first retry at ~1s, double each attempt up to a cap (~30s), add random jitter (±25%) to prevent thundering herds when a server restarts and many clients reconnect simultaneously.
2. Reconnect on 1006, 1001, 1011 — all indicate unclean closure. Don't reconnect on 1000 (normal close) or 1008 (policy violation) — those are intentional.
3. Re-authenticate on reconnect — the new connection has no memory of the old one; session state must be re-established.
4. Re-subscribe to channels — if using pub/sub, the server has no record of what the new connection was subscribed to; the client must re-declare subscriptions in onopen.
5. Track missed messages — if ordering matters, send a last_received_seq to the server on reconnect so it can replay what was missed.ws) handle Ping/Pong automatically if configured. The key parameters are pingInterval and pingTimeout. Set the interval shorter than the load balancer's idle timeout — typically 25–45s if the LB drops connections after 60s.
Also monitor OS-level socket state — ss -tnp | grep CLOSE_WAIT — CLOSE_WAIT sockets that are not being closed indicate the server is not handling Close frames in application code.channel.isWritable() returns false when the high-water mark is breached. In Spring WebFlux: use reactive streams backpressure — Flux.sink with overflow strategies. In Node.js ws: listen for drain events after send() returns false.
Strategies by use case:
- Drop non-critical messages: for live tickers, dashboards, or telemetry — if
the client is lagging, drop the oldest pending message and send only the latest.
The client sees slightly stale data, not a crash.
- Rate-limit per connection: enforce a per-session send queue with a max depth.
When the queue is full, either drop or close the connection with a warning.
- Close slow clients: for real-time systems where staleness is harmful (e.g.,
game state), disconnect clients that can't keep up. They reconnect with a fresh
state snapshot.
- Apply backpressure upstream: in reactive pipelines, propagate backpressure
back to the data source — stop reading from Kafka/DB until the client catches up.setSendBufferSizeLimit, setSendTimeLimit), not as application logic. Make connection closure under back-pressure a first-class, logged, monitored event — not a silent failure.text/event-stream content type. The browser auto-reconnects, sends Last-Event-ID so the server can resume, and it works through HTTP/2 multiplexing (no sticky sessions). Proxy support is excellent — it looks like a slow HTTP response.
WebSockets when: the client sends data frequently enough that opening a new HTTP request per send is too expensive (chat input, game controls, collaborative editing ops), or when you need full-duplex with sub-50ms latency in both directions.
Long polling when: both WebSockets and SSE are blocked by network policy (rare now, but real in enterprise environments), or when you need a simpler fallback for old clients. Long polling is inefficient — it creates a new TCP connection per message without keep-alive, and each response triggers a new request with full HTTP headers.
When the choice is wrong: using WebSockets for a notification system where the server pushes events every few minutes and the client never sends — SSE would be simpler, more scalable (no sticky sessions), and auto-reconnecting. Using SSE for a collaborative editor — the round-trip latency of a separate POST for each keystroke vs. a WebSocket frame is significant, and the implementation is more complex.Retry-After-equivalent reconnect hint in the reason string (or via a custom message before closing). This signals clients to reconnect immediately rather than backing off.
3. Reconnect hint includes server exclusion — pass a flag in the close reason that tells the client to avoid this server's address for N minutes. Implement in the load balancer by removing it from rotation before draining starts.
4. Set a drain deadline — some clients will not respond to Close frames (backgrounded mobile, misbehaving clients). Force-close all remaining connections after a timeout (e.g., 60s). Accept that a small percentage will see 1006.
5. Stagger the rolling restart — restart one server at a time with a cooldown (wait until reconnect load is absorbed before starting the next drain). Monitor aggregate connection count and reconnection rate as gating signals.
Reconnection amplification: each closed connection generates at least one reconnect request. 40k connections draining ≈ 40k reconnects in ~10s. Design load balancer and server capacity for 2× normal connection rate during maintenance windows.last_seen_message_id. The server queries the database for messages in their channels after that ID and flushes them over the WebSocket before switching to live delivery.