// MVCC · indexes · query planner · isolation levels · WAL · replication · partitioning · senior → principal
UPDATE writes a new row version (tuple) and marks the old one as dead. Every DELETE marks a tuple dead without removing it. Readers see the version of the row that was committed before their transaction started — reads never block writes and writes never block reads.
Each tuple carries xmin (the transaction that created it) and xmax (the transaction that deleted or updated it). The snapshot a transaction uses determines which xmin/xmax values are visible to it.
Dead tuples are the cost: they accumulate on disk until VACUUM reclaims the space. A table under heavy updates grows bloated until autovacuum catches up.
=, <, >, BETWEEN, LIKE 'foo%'. Covers 95%+ of use cases.
Hash: only =. Slightly faster than B-tree for pure equality, but not WAL-logged before PG10 — avoid in older versions.
GIN (Generalized Inverted Index): for composite values where you search within the value. Full-text search (tsvector), JSONB containment (@>), arrays (&&).
GiST (Generalized Search Tree): geometric types, PostGIS geometries, ranges, full-text (ts_rank). Lossy — may return false positives, rechecked in heap.
BRIN (Block Range Index): extremely small; records min/max per block range. Works only when table data is physically sorted by the indexed column (e.g., an append-only created_at). Terrible for random-ordered columns.
Partial index: CREATE INDEX ON orders (user_id) WHERE status = 'pending' — indexes only a subset of rows. Tiny and fast when queries always include the WHERE clause.
ANALYZE (row count, column distinct values, most common values, histogram buckets). Stale statistics = bad plans.
EXPLAIN: shows the plan with estimated rows and cost. EXPLAIN (ANALYZE, BUFFERS): executes the query and shows actual rows, actual time, and buffer hits/misses — the only reliable way to know what's slow.
Key plan nodes: Seq Scan (full table), Index Scan (b-tree seek + heap fetch), Index Only Scan (covering index, no heap), Bitmap Heap Scan (batched heap access for many index hits), Hash Join, Nested Loop, Merge Join.
work_mem controls the memory available per sort/hash operation. Set too low → spills to disk → slow. Set too high → OOM under concurrent queries.
ERROR 40001.
checkpoint_completion_target=0.9 spreads checkpoint writes over 90% of the checkpoint interval.
WAL also drives replication: standbys receive WAL stream and replay it — physical (byte-for-byte identical) replication. Logical replication decodes WAL into row-level changes (INSERT/UPDATE/DELETE) for selective, cross-version replication.
WHERE clauses use partition pruning to scan only relevant children. Maintenance (VACUUM, index rebuilds) runs on smaller partitions.
Range: PARTITION BY RANGE (created_at) — one partition per month. Best for time-series data; old partitions detach cleanly.
List: PARTITION BY LIST (region) — one partition per region. Good for low-cardinality categorical splits.
Hash: PARTITION BY HASH (user_id) — distributes rows evenly. Good for sharding by ID when no natural range exists.
Partitioning adds overhead: the planner must evaluate each partition, INSERT routes through the partition key, foreign keys to the parent table are not supported.
pg_stat_activity for long-running transactions and age(relfrozenxid) in pg_class for bloat risk. One idle-in-transaction session can block autovacuum across the entire database.
pg_stat_activity as autovacuum: VACUUM (to prevent wraparound)) when the horizon approaches. If it can't keep up, PostgreSQL shuts down to prevent data corruption. Monitor age(datfrozenxid) in pg_database — alert above 1.5B.
VACUUM marks index entries as dead but reclaims pages only during VACUUM FULL (which takes an ACCESS EXCLUSIVE lock — avoid in production). Monitor index bloat with pgstattuple. Use REINDEX CONCURRENTLY (PG12+) to rebuild a bloated index without locking.
ANALYZE), the selectivity estimate is wrong (correlation mismatch), or random_page_cost is calibrated for spinning disk on an SSD server. Set random_page_cost=1.1 on SSD-backed servers to correctly model random I/O as nearly as cheap as sequential.
SELECT * FROM a WHERE id NOT IN (SELECT id FROM b) returns zero rows if any row in b.id is NULL — because NULL != anything is UNKNOWN, not FALSE, and NOT IN requires all comparisons to be FALSE. Use NOT EXISTS or LEFT JOIN / WHERE b.id IS NULL instead. This SQL behavior trips up even experienced developers.
| B-tree | Default. Supports =, <, >, BETWEEN, LIKE 'prefix%'. Handles NULLs. Multi-column indexes support leading-column prefix queries. |
| Hash | Only =. Faster point lookups than B-tree at high cardinality. WAL-safe since PG10. Not useful for range queries. |
| GIN | Full-text search (tsvector), JSONB @> / ?, arrays &&/@>. Slow to build, fast to query. Use gin_pending_list_limit to control build buffering. |
| GiST | Geometric types, PostGIS, range types, full-text ranking. Lossy — may return false positives, rechecked against the heap. |
| BRIN | Tiny index (128 bytes per range). Only useful when column values are physically sorted (e.g., append-only created_at). Terrible on random data. |
| Partial | Index only rows matching a WHERE predicate. Tiny and fast. Query must include the predicate for the planner to use it. |
| Covering (INCLUDE) | B-tree with extra columns in INCLUDE clause. Enables Index Only Scans by putting SELECT columns in the index leaf pages. |
| Expression | Index on an expression: CREATE INDEX ON users (lower(email)). Planner uses it only if the query WHERE clause uses the same expression. |
| READ UNCOMMITTED | Not implemented in PostgreSQL — treated as READ COMMITTED. Dirty reads are never allowed. |
| READ COMMITTED | Default. Each statement sees its own snapshot. Non-repeatable reads possible. Safe for most OLTP reads. |
| REPEATABLE READ | Snapshot at first statement. No non-repeatable reads or phantoms (PostgreSQL goes beyond the SQL standard here). Serialization failures possible on write conflicts. |
| SERIALIZABLE | SSI — full serializability. Application must retry on ERROR 40001 (serialization_failure). Higher CPU overhead. Use for financial transactions, inventory, anything where write skew matters. |
| shared_buffers | PostgreSQL buffer cache. Start at 25% of RAM. Increasing past 40% rarely helps — OS page cache also caches. |
| work_mem | Memory per sort/hash operation per query node. Low → disk spill. High → OOM under concurrency. Set at session level for analytics queries. |
| maintenance_work_mem | Memory for VACUUM, CREATE INDEX, ALTER TABLE. Set higher (1–2 GB) for index builds on large tables. |
| effective_cache_size | Hint to the planner about total available cache (shared_buffers + OS cache). Does not allocate memory. Set to 75% of RAM. |
| random_page_cost | Planner cost of random I/O relative to seq. Default 4.0 (HDD). Use 1.1–1.5 for SSDs — makes the planner prefer index scans correctly. |
| max_connections | Caps OS processes. Keep low (100–200) and use PgBouncer. Each connection uses ~5–10 MB of shared memory overhead. |
| wal_level | minimal / replica / logical. Must be 'replica' for streaming replication, 'logical' for logical replication or CDC. |
| checkpoint_completion_target | Fraction of checkpoint interval to spread writes over. Default 0.5; set to 0.9 to reduce I/O spikes. |
| cost=X..Y | Estimated startup cost .. total cost. Arbitrary units. Compare plans; don't compare across servers. |
| rows=N | Planner's estimated row count. Deviation from actual rows indicates stale statistics. |
| actual time=X..Y | Only in EXPLAIN ANALYZE. Wall time in ms for first row .. last row. Per-loop; multiply by loops. |
| Buffers: hit=N read=N | Only with BUFFERS option. Hit = from shared_buffers (fast). Read = from disk or OS cache (slow). |
| Seq Scan | Full table scan. Fine for small tables or large fraction of rows. Bad if selectivity is high. |
| Index Only Scan | Reads all needed columns from index leaf pages — no heap access. Fastest scan type. Requires covering index. |
| Bitmap Heap Scan | Collects many TIDs from index, sorts them, then fetches heap pages in order. Used when many rows match — reduces random I/O. |
| Nested Loop | For each outer row, probe inner relation. Fast for small outer sets with selective inner index. Slow for large cross products. |
| Dimension | PostgreSQL | MySQL (InnoDB) | MongoDB | CockroachDB |
|---|---|---|---|---|
| Concurrency | MVCC — readers never block writers | MVCC — similar to PostgreSQL | Document-level locking (WiredTiger MVCC) | Serializable MVCC across distributed nodes |
| ACID | Full ACID per transaction | Full ACID (InnoDB) | Multi-doc ACID since 4.0 | Full distributed ACID via Raft consensus |
| Schema | Strict schema; ALTER TABLE for changes | Strict schema; online DDL with caveats | Flexible BSON documents | Strict SQL schema; online schema changes |
| JSON support | JSONB — indexed, queryable, full operators | JSON column (not indexed as flexibly) | Native document store | JSONB (PostgreSQL-compatible) |
| Full-text search | Built-in tsvector/tsquery | Basic FULLTEXT index | Text index (Atlas Search for advanced) | Limited (extensions not supported) |
| Replication | Streaming (physical) + logical, built-in | Binary log replication; group replication | Replica sets (Raft-based); sharding via mongos | Native distributed replication (Raft consensus per range) |
| Horizontal write scaling | Manual sharding / Citus extension | Manual sharding; Vitess popular | Native sharding (mongos) | Native; distributed key-value underneath |
| Geospatial | PostGIS extension — industry gold standard | Geometry types; limited 2D support | 2dsphere index; good for geo queries | Basic geometry (no PostGIS) |
| Best for | Complex OLTP, analytics, general purpose | Web OLTP, read-heavy, strong ecosystem | Document-centric, flexible schemas | Global multi-region distributed SQL |
xmin = current transaction ID. 2. Sets xmax on the old tuple = current transaction ID (marks it deleted).
Readers use a snapshot — a list of transaction IDs visible to them. The snapshot rules: a tuple is visible if its xmin committed before the snapshot and its xmax is either not yet committed or not in the snapshot. This means old and new versions coexist on disk — readers see the version that was current when their transaction started.
Dead tuples are old versions no longer visible to any active snapshot. They waste space and slow down index scans (they appear in indexes but require a heap fetch to confirm visibility).
VACUUM scans the table, identifies dead tuples (those with xmax below the oldest active transaction's snapshot), marks their space as reusable, and removes their index entries. It does not compact the file by default (that's VACUUM FULL). Regular VACUUM runs concurrently with reads and writes.idle in transaction connection can pin the horizon for hours, causing unbounded bloat. Monitor pg_stat_activity for sessions in idle in transaction state with old xact_start. Set idle_in_transaction_session_timeout (PG9.6+) to automatically terminate these sessions. For analytical queries that legitimately run for hours, run them on a read replica, not the primary.ERROR 40001).
SERIALIZABLE: uses Serializable Snapshot Isolation (SSI). Detects read-write dependencies between concurrent transactions that would produce a non-serializable execution. Aborts one transaction with ERROR 40001 — serialization_failure rather than allowing write skew. Application must retry on this error. Overhead is higher but correctness is guaranteed for any concurrent access pattern.
Choosing: READ COMMITTED for most web OLTP; REPEATABLE READ for long transactions reading a consistent view (reports, exports); SERIALIZABLE for financial transactions, inventory deductions, or any "check then act" pattern.SELECT FOR UPDATE (pessimistic locking) can prevent write skew without SERIALIZABLE isolation — it serializes the specific rows being checked. The right tool depends on contention level and retry tolerance.seq_page_cost = 1.0 (baseline) - random_page_cost = 4.0 (default) — random I/O costs 4× sequential - cpu_tuple_cost, cpu_index_tuple_cost, cpu_operator_cost
When the planner chooses Seq Scan over an existing index: it estimates that the fraction of rows returned is large enough that fetching them via random index lookups (each requiring a heap page fetch) costs more than scanning the whole table sequentially.
Wrong plan causes: - Stale statistics: n_distinct, MCV list, histogram are outdated — run ANALYZE. - Wrong random_page_cost: on SSDs, random I/O is nearly as cheap as sequential.
Set random_page_cost = 1.1 to make the planner correctly prefer index scans.
- Correlation mismatch: data inserted in random order won't benefit from a BRIN index
or have good correlation for index-friendly access patterns.
- Missing ANALYZE after bulk load: newly loaded tables have no statistics. - High work_mem for the query causing a hash join to be preferred over a nested loop.EXPLAIN (ANALYZE, BUFFERS). Look at rows=N (estimated) vs actual rows=M (actual). A factor of 10× or more mismatch means statistics are stale or the planner's model doesn't fit the data. For JSONB or array columns with complex access patterns, the planner has no statistics — consider partial indexes or expression indexes that mirror your query predicates. For tables with millions of rows and highly selective queries, the planner rarely makes the wrong call with good statistics; bad plans on large tables almost always trace back to stale stats or a wrong random_page_cost on SSDs.Use GIN when you need to search within a composite value — the opposite of B-tree, which searches by the whole value. Primary GIN use cases: - tsvector full-text search (@@ operator) - JSONB containment (@>) and key existence (?, ?|, ?&) - Array overlap/containment (&&, @>, <@)
Full-text search setup: ```sql -- Add a tsvector column (or use a generated column in PG12+) ALTER TABLE articles ADD COLUMN fts tsvector
GENERATED ALWAYS AS (
to_tsvector('english', coalesce(title,'') || ' ' || coalesce(body,''))
) STORED;
CREATE INDEX articles_fts_idx ON articles USING gin(fts);
-- Query SELECT id, title FROM articles WHERE fts @@ plainto_tsquery('english', 'postgresql indexing') ORDER BY ts_rank(fts, plainto_tsquery('english', 'postgresql indexing')) DESC LIMIT 20; ``
GIN builds an inverted index: for each lexeme, it stores all row IDs containing it. Lookups are fast; writes are slower than B-tree (batched via pending list, flushed atgin_pending_list_limit). For write-heavy tables, considerfastupdate=off` if GIN pending list is causing I/O spikes.
maintenance_work_mem high and build during low traffic. The pending list (gin_pending_list_limit, default 4 MB) buffers writes before merging into the main index structure; heavy write rates can trigger frequent merges. If you're doing JSONB querying and your documents are large and varied, profile whether GIN index size stays manageable — GIN indexes can be 2–5× the size of the data for high-cardinality JSONB. For simple key lookup in JSONB (data->>'user_id'), a B-tree expression index is smaller and faster than GIN.fsync). This means:
Crash recovery: on restart after a crash, PostgreSQL replays WAL from the last checkpoint forward. Any change that has a WAL record is guaranteed to be recoverable, even if the data page was never flushed to disk. The database reaches a consistent state without any data loss for committed transactions.
Physical (streaming) replication: the primary ships WAL segments to standby servers. Each standby replays the WAL, maintaining a byte-for-byte copy of the primary. Standbys can serve read queries (hot standby). Replication lag = WAL on primary not yet replayed on standby.
Logical replication: WAL Decoder (wal_level = logical) reads the WAL and converts it into row-level changes (INSERT/UPDATE/DELETE). Enables selective table replication, cross-major-version replication, and CDC (Change Data Capture) for downstream systems (Debezium reads PostgreSQL's logical replication slot).pg_basebackup), then continuously archive WAL segments to object storage (S3). To recover to any point in time, restore the base backup and replay WAL up to the desired LSN or timestamp. This gives you a RPO measured in seconds (WAL archive interval) rather than the backup frequency. For any production database, PITR should be the baseline DR strategy — not just nightly backups. Test your restore procedure regularly; untested PITR is not recovery, it's hope.Autovacuum is PostgreSQL's background daemon that runs VACUUM and ANALYZE on tables automatically when they accumulate enough dead tuples (autovacuum_vacuum_scale_factor × row count + autovacuum_vacuum_threshold dead tuples). It also triggers ANALYZE for statistics refresh on modified tables.
What happens when it falls behind: - Dead tuples accumulate → table bloat → larger sequential scans, more I/O - Indexes accumulate dead entries → index bloat → slower index scans - xmin horizon stalls → transaction ID wraparound risk (the most dangerous outcome) - PostgreSQL runs emergency autovacuum (to prevent wraparound) which is more aggressive
and has higher priority but can still be outrun on extremely write-heavy tables
Tuning for high-write tables: sql ALTER TABLE orders SET (
autovacuum_vacuum_scale_factor = 0.01, -- vacuum after 1% dead tuples (not default 20%)
autovacuum_vacuum_cost_delay = 2, -- less I/O throttling (ms, default 20)
autovacuum_vacuum_cost_limit = 800 -- more work per cycle (default 200)
); Also increase autovacuum_max_workers (default 3) for databases with many large tables.
pg_stat_user_tables.n_dead_tup, last_autovacuum, and last_autoanalyze — these are your first line of observability. If n_dead_tup consistently grows between autovacuum runs, the tuning is still wrong.xmin is "in the past" relative to the current transaction.
The problem: after ~2.1 billion transactions, XIDs wrap around. A row with xmin that was in the past becomes "in the future" — it becomes invisible. The entire database would appear empty. PostgreSQL prevents this by freezing old tuples: replacing their xmin with a special FrozenTransactionId that is always considered "in the past." VACUUM does this when a row is old enough (vacuum_freeze_min_age transactions old).
Emergency threshold: when the oldest unfrozen XID is within 40M transactions of wraparound, PostgreSQL runs autovacuum in emergency mode. If it can't finish in time, PostgreSQL shuts down with the message "database is not accepting commands to avoid wraparound data loss."
Prevention: monitor age(datfrozenxid) in pg_database. Alert above 1.5 billion. Ensure autovacuum runs regularly on all tables. Don't suppress autovacuum on large tables without an alternative.age() metric should be on every PostgreSQL dashboard. When the age approaches 1.8–2.0 billion, the situation is urgent: run VACUUM FREEZE manually on the table with the oldest relfrozenxid (check pg_class). In extreme cases with hours to wraparound, you can temporarily increase autovacuum_vacuum_cost_delay = 0 at the database level to let autovacuum run at full speed at the cost of I/O impact on other workloads.Start from the queries, not the columns. For each query pattern: 1. Identify the predicate columns (WHERE, JOIN ON) — these must be in the index. 2. Identify the sort columns (ORDER BY) — B-tree indexes can satisfy ORDER BY, eliminating a sort step. 3. Identify the output columns (SELECT) — if they fit in the index, you get an Index Only Scan (no heap fetch).
Multi-column index column order matters: - Equality predicates first, range predicates last (B-tree can use a range only on
the last column in a multi-column index).
- (status, created_at) serves WHERE status = 'pending' ORDER BY created_at well.
Reverse order serves neither well.
- A query using only created_at cannot use (status, created_at) — but can use
(created_at, status). Design for your most selective queries first.
Covering indexes for hot queries: Add INCLUDE (col1, col2) to include non-predicate output columns in index leaf pages. Index Only Scans avoid heap entirely — massive speedup for queries selecting a few columns from large tables.
Partial indexes for selective data: CREATE INDEX ON orders (user_id) WHERE status = 'active' is tiny and fast for an orders table where 99% of rows are closed. The query must include WHERE status = 'active'.
Avoid over-indexing: each index is updated on every write. A table with 12 indexes pays a significant write penalty. Remove indexes with zero scans (pg_stat_user_indexes where idx_scan = 0) after a representative period.
EXPLAIN ANALYZE output and query workload telemetry, not intuition. For existing production systems, use pg_stat_statements to identify your top N queries by total time (not just per-call — a fast query called millions of times per day may dominate). Then run EXPLAIN ANALYZE for each and look for Seq Scans on large tables and sort steps that could be eliminated by an index. The most impactful single change for most OLTP systems is adding a covering index for the 3–5 most-called queries. In Postgres 15+, pg_store_plans (extension) stores actual plans, not just queries — helpful for finding plan regressions after autovacuum or ANALYZE.DROP or DETACH a partition is instant, DELETE on a huge table is slow and bloating - Queries reliably filter on the partition key, enabling partition pruning - You want separate storage, autovacuum tuning, or tablespaces per partition
Range partitioning for time-series data is the most common and valuable use case: sql CREATE TABLE events (
id bigint, created_at timestamptz, payload jsonb
) PARTITION BY RANGE (created_at);
CREATE TABLE events_2024_01 PARTITION OF events
FOR VALUES FROM ('2024-01-01') TO ('2024-02-01'); Use pg_partman to automate partition creation and retention.
Operational trade-offs: - Pruning requires the partition key in the WHERE clause — queries without it scan all partitions (worse than a single large table). - Cross-partition queries (aggregations, sorts) are more expensive — each partition is scanned separately and results merged. - Foreign keys referencing the parent table are not supported in PostgreSQL. - Global unique constraints are not supported — unique constraints must include the partition key. - Index maintenance: each partition has its own indexes. REINDEX CONCURRENTLY must be run per partition.DELETE WHERE created_at < '2023-01-01' on 500M rows, which runs for hours, generates massive WAL, bloats the table, and spikes replication lag. Always ask "do I need the data lifecycle management benefit?" before committing to partitioning — it carries significant schema constraints.hot_standby = on). Failover promotes a standby to primary. Works at the storage level — same PostgreSQL major version required.
Replication slots: guarantee WAL is retained on the primary until the standby has consumed it. Prevents standby falling too far behind and being unable to catch up. Risk: if the standby disconnects long-term, WAL accumulates and fills the disk. Monitor pg_replication_slots.confirmed_flush_lsn and the lag in bytes.
Logical replication: decodes WAL into row-level changes (INSERT/UPDATE/DELETE) and sends only selected tables to a subscriber. Subscriber can be a different major version, a different database, or a downstream analytics system. Used for: zero-downtime major version upgrades, selective replication, CDC with Debezium.
Choosing: - Primary + read replicas for load distribution → streaming - Failover / HA → streaming with a standby promoted on failure (Patroni manages this) - Major version upgrade (PG14 → PG16) without downtime → logical (replicate to new version, switchover) - CDC to Kafka or a data warehouse → logical (via Debezium or pg_logical) - Cross-cloud or cross-region async replica → streamingsynchronous_commit = on) means the primary waits for at least one standby to acknowledge before returning to the client. RPO = 0, but latency increases by one round-trip to the standby. For financial or audit data, sync replication is usually worth the latency cost.pg_stat_statements (extension) aggregates queries by normalized form. Query by total_exec_time (total time across all calls) to find high-impact queries, not just the slowest individual call.
sql SELECT query, calls, mean_exec_time, total_exec_time, rows FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 10;
Step 2 — Get the plan: run EXPLAIN (ANALYZE, BUFFERS) against the query on a replica or during low traffic. Look for: - Seq Scan on large tables (missing index) - Estimated rows far from actual rows (stale statistics → run ANALYZE) - High Buffers: read (reading from disk rather than cache — I/O bound) - Sort step that could be eliminated by an index
Step 3 — Check for locks: if the query is fast in isolation but slow in production, lock contention may be the cause. sql SELECT pid, wait_event_type, wait_event, query FROM pg_stat_activity WHERE wait_event IS NOT NULL;
Step 4 — Fix without downtime: - Add an index: CREATE INDEX CONCURRENTLY — does not lock the table. - Update statistics: ANALYZE tablename — fast, concurrent. - Rewrite the query: avoid NOT IN with subqueries, use EXISTS or JOIN instead. - Fix random_page_cost if index scans are being avoided on SSD.
Step 5 — Verify: compare EXPLAIN ANALYZE before and after. Check pg_stat_statements for reduction in mean_exec_time and total_exec_time.auto_explain (extension) to log execution plans for slow queries automatically: auto_explain.log_min_duration = 1000 logs plans for any query taking over 1 second, without requiring manual EXPLAIN runs.pg_rewind).
Monitoring: replication lag in bytes and seconds, etcd health, Patroni API status, checkpoint frequency, lock wait events.synchronous_standby_names = '*' (all standbys must confirm). If any standby is lagging or unreachable, all writes block. Use ANY 1 (standby1, standby2) to require only one of the named standbys — writes continue as long as one is healthy. Design for partial standby failure, not just primary failure.ALTER TABLE ADD COLUMN col type NOT NULL DEFAULT value in PostgreSQL < 11 rewrites the entire table (ACCESS EXCLUSIVE lock for hours). In PG11+ it's instant for constant defaults — but only for simple scalar constants, not function calls.
Safe approach (all versions):
Step 1 — Add nullable column, no default, no lock: sql ALTER TABLE orders ADD COLUMN metadata jsonb; -- ACCESS EXCLUSIVE lock held for milliseconds, then released
Step 2 — Backfill in small batches (background job, not a single UPDATE): sql UPDATE orders SET metadata = '{}' WHERE id BETWEEN :lo AND :hi AND metadata IS NULL; -- Commit each batch. Keep batches small (1000–10000 rows). -- Pause between batches to let autovacuum keep up and avoid replication lag. A single UPDATE of 500M rows generates enormous WAL, pins replication lag, bloats the table, and may take hours while holding a transaction. Batching avoids all of this.
Step 3 — Add NOT NULL constraint as NOT VALID (PG12+): sql ALTER TABLE orders ADD CONSTRAINT orders_metadata_not_null
CHECK (metadata IS NOT NULL) NOT VALID;
-- Instant — does not scan existing rows.
Step 4 — Validate the constraint (concurrent, no write lock): sql ALTER TABLE orders VALIDATE CONSTRAINT orders_metadata_not_null; -- Scans the table, but holds only SHARE UPDATE EXCLUSIVE (concurrent reads/writes OK).
Step 5 — (Optional) If you truly need a NOT NULL column (PG12+), after validation you can drop the CHECK constraint and convert — or live with the check constraint, which the query planner understands as equivalent.ALTER TABLE statement that combines column add + backfill — usually discovered at 3 AM when the migration has been running for 4 hours on the wrong environment. For index creation on large tables, always use CREATE INDEX CONCURRENTLY — it takes longer but doesn't lock. The tooling ecosystem (Flyway, Liquibase, golang-migrate) does not prevent you from writing locking migrations — the discipline is the engineer's responsibility. Make zero-downtime migration patterns part of your team's standard library.total_exec_time. The order history page likely runs one or two critical queries — find them by correlation with the degradation timeline.shared_buffers and OS cache — at 300 GB it doesn't, so formerly fast Seq Scans now read from disk.pg_stat_user_tables.last_autoanalyze. If the table grew 6× without ANALYZE running frequently enough, the planner's row estimates are wildly wrong — it may be choosing a Seq Scan because it thinks there are 1M rows when there are 50M.ANALYZE orders; takes seconds, is fully concurrent, and may immediately fix the plan. The planner will see correct cardinality and switch to index scans.user_id or created_at with no covering index, add one without locking. An Index Only Scan serving the whole SELECT from the index eliminates all heap I/O.n_dead_tup and last_autovacuum. A fast-growing table with default autovacuum thresholds (20% scale factor) may have accumulated hundreds of millions of dead tuples, bloating the table and making every scan scan more pages than necessary.random_page_cost is still 4.0, the planner systematically undervalues index scans. Set to 1.1 — this alone can flip dozens of bad plans to index scans cluster-wide.pg_upgrade is fast but requires downtime (database offline during upgrade) and produces a period where rollback means restoring a backup — too slow for the rollback SLA.pg_dump --schema-only, apply to PG16) before setting up replication.pg_replication_slot_lag); (4) promote PG16 as primary; (5) flip application connection strings to PG16; (6) verify; (7) remove maintenance mode.tenant_id to every table. All 5,000 tenants share tables. Simplest operationally — one schema to migrate, one set of indexes. Requires Row-Level Security (RLS) to enforce isolation at the database level (CREATE POLICY ... USING (tenant_id = current_setting('app.tenant_id')::uuid)). Cross-tenant analytics is a single query.SET LOCAL app.tenant_id = 'xyz'). Unauthorized access returns empty results rather than an error — test explicitly that the policy is enforced. BYPASSRLS role for analytics service. Requires careful index design: every index on tenant-scoped tables must lead with tenant_id.tenant_id. Large-tenant partitions can have their own storage and autovacuum settings. Small tenants share partitions. Adds complexity but keeps the operational model unified.tenant_id. Composite indexes (tenant_id, <filter_col>, <sort_col>). Without tenant_id as the leading column, every query scans all tenants' data (Seq Scan or full index scan with filter).BYPASSRLS role or set app.tenant_id to a sentinel value that the RLS policy allows for analytics. Partition pruning can speed up per-tenant sub-aggregations.pg_partman-style batch migration). Cross-tenant analytics requires UNION ALL across schemas or a dedicated aggregation pipeline.tenant_id as the leading column in indexes — queries scan all tenants' rows, performance degrades with tenant count