Mid-Level PostgreSQL Interview Questions

Q1:

What is PostgreSQL’s MVCC and how does it enable concurrent reads and writes?

Mid

Answer

MVCC gives each transaction a snapshot of the database. PostgreSQL stores multiple row versions with transaction IDs, allowing readers and writers to work without blocking each other.

Quick Summary: MVCC (Multi-Version Concurrency Control) lets readers and writers proceed concurrently without blocking each other. When a row is updated, PostgreSQL creates a new version (tuple) with a new transaction ID, keeping the old version for concurrent readers. Each transaction sees a consistent snapshot of the data as of when it started. No read locks needed - readers never block writers and vice versa.

Permalink

Q2:

What is a tuple in PostgreSQL?

Mid

Answer

A tuple is a row stored in a PostgreSQL table. Updates create new tuple versions while old ones become dead tuples until vacuumed.

Quick Summary: In PostgreSQL, a tuple is a row version in a table. Every UPDATE creates a new tuple (new row version) rather than modifying in place. Old tuples stay visible to concurrent transactions and are later cleaned up by VACUUM. Each tuple has system columns: xmin (transaction that created it), xmax (transaction that deleted it), ctid (physical location). Multiple tuple versions enable MVCC.

Permalink

Q3:

What is a transaction ID and why is wraparound dangerous?

Mid

Answer

Each transaction gets a unique XID. Because XIDs are 32-bit, they eventually wrap around, risking marking live rows as expired. VACUUM prevents wraparound by freezing tuples.

Quick Summary: Transaction ID (XID) is a 32-bit integer assigned to every transaction. PostgreSQL uses XIDs to determine tuple visibility in MVCC. Wraparound danger: after ~2 billion transactions, XID wraps around to 0. Old rows could suddenly appear to be "in the future" and become invisible - catastrophic data loss. Autovacuum prevents this by freezing old tuples (replacing real XIDs with FrozenXID that's always visible).

Permalink

Q4:

What is VACUUM FULL and how does it differ from regular VACUUM?

Mid

Answer

VACUUM FULL rewrites the table, reclaiming space and compacting rows but requires an exclusive lock. Regular VACUUM only removes dead tuples and updates visibility maps without heavy locking.

Quick Summary: Regular VACUUM marks dead tuples as reusable space without compacting the file or returning space to the OS. VACUUM FULL rewrites the entire table to a new file, reclaiming disk space (table shrinks). VACUUM FULL requires an exclusive lock (all reads and writes block during operation) - not safe on high-traffic tables. Use VACUUM FULL rarely, or use pg_repack as a non-blocking alternative.

Permalink

Q5:

What is the visibility map in PostgreSQL?

Mid

Answer

The visibility map tracks pages containing only visible tuples, enabling index-only scans without reading table pages.

Quick Summary: The visibility map is a per-table file with one bit per page tracking if all tuples on that page are visible to all transactions. Index-only scans use the visibility map to avoid fetching the heap page to verify tuple visibility. VACUUM updates the visibility map. Pages with all-visible tuples allow index-only scans, dramatically improving read performance for covering indexes.

Permalink

Q6:

What is an index-only scan?

Mid

Answer

An index-only scan satisfies a query using only index data when all required columns are in the index, avoiding table reads.

Quick Summary: Index-only scan reads all needed columns directly from the index without accessing the heap (table). Possible when: all SELECT columns are in the index, and the visibility map shows the page as all-visible (so PostgreSQL doesn't need to check heap for tuple visibility). Much faster than regular index scan (no heap fetch). Create covering indexes: include all frequently queried columns.

Permalink

Q7:

What is the difference between B-Tree, GIN, and GiST indexes?

Mid

Answer

B-Tree handles equality and range queries. GIN is optimized for arrays, JSONB, and full-text search. GiST supports ranges, geometric types, and custom index logic.

Quick Summary: B-Tree: default, balanced tree, handles equality and range queries, sorted order - use for most queries. GIN (Generalized Inverted Index): multi-value items - use for JSONB, arrays, full-text search (indexes all keys/elements). GiST (Generalized Search Tree): geometric data, range types, full-text search - use for PostGIS, EXCLUDE constraints. Each handles different query patterns.

Permalink

Q8:

When would you use a BRIN index?

Mid

Answer

BRIN is ideal for very large tables where data is naturally ordered, storing summaries of value ranges to minimize storage.

Quick Summary: BRIN (Block Range INdex) stores min/max values per range of disk pages - very small index, very fast to create, low maintenance overhead. Only useful when column values correlate with physical storage order (sequential inserts of timestamps, monotonic IDs). Not selective for random data. Ideal for time-series tables where you INSERT rows in timestamp order - BRIN can eliminate large chunks of the table.

Permalink

Q9:

What is the purpose of pg_stat_activity?

Mid

Answer

pg_stat_activity displays running queries, wait events, and session details for debugging locks and performance issues.

Quick Summary: pg_stat_activity is a system view showing one row per database connection with: pid (process ID), query (SQL being run), state (active, idle, idle in transaction), wait_event (what it's waiting on), application_name, client_addr. Essential for diagnosing: blocking queries (find locks), long-running transactions, idle-in-transaction connections holding locks, and overall connection usage.

Permalink

Q10:

What is a CTE and why use it?

Mid

Answer

A CTE organizes complex queries into readable blocks and supports recursion and reusable query logic.

Quick Summary: CTE (Common Table Expression) is a named temporary result set defined with WITH, usable in the main query. WITH active_users AS (SELECT * FROM users WHERE active = true) SELECT * FROM active_users WHERE age > 18. Improves readability for complex queries. Recursive CTEs traverse hierarchical data. In PostgreSQL 12+, CTEs are inlined by default (the optimizer can push conditions inside).

Permalink

Q11:

What is the difference between a CTE and a subquery?

Mid

Answer

CTEs improve readability and reusability, while subqueries are nested directly. CTEs may materialize results depending on PostgreSQL version.

Quick Summary: CTE: named, reusable within a query, can be recursive, executes once and result is referenced multiple times (in some cases). Subquery: embedded in the parent query, potentially re-executed for each row (correlated subquery). Pre-PostgreSQL 12, CTEs were optimization fences (result materialized). Now both can be optimized similarly. Use CTEs for readability; subqueries when optimizer needs more flexibility.

Permalink

Q12:

How do recursive CTEs work?

Mid

Answer

Recursive CTEs contain an anchor part and a recursive part that repeats until no new rows are returned.

Quick Summary: Recursive CTEs traverse hierarchical or graph-like data. Structure: WITH RECURSIVE cte AS (base_case UNION ALL recursive_case WHERE termination_condition). Base case: initial rows (root nodes). Recursive case: references the CTE to get next level. Use for: org charts (employee hierarchy), category trees, graph traversal, bill of materials. Add a depth counter or LIMIT to prevent infinite recursion.

Permalink

Q13:

What are dead tuples and why do they accumulate?

Mid

Answer

Dead tuples are obsolete row versions created by updates and deletes. They accumulate because PostgreSQL never overwrites rows.

Quick Summary: Dead tuples are old row versions left behind by UPDATE and DELETE operations in MVCC. When you update a row, PostgreSQL creates a new tuple version but keeps the old one visible for concurrent transactions. Once no active transaction needs the old version, it becomes a "dead tuple". Dead tuples accumulate, bloating tables and slowing queries. VACUUM removes dead tuples and marks space reusable.

Permalink

Q14:

How does PostgreSQL handle deadlocks?

Mid

Answer

PostgreSQL detects cycles in lock waits and cancels one transaction, usually the one with lowest cost.

Quick Summary: PostgreSQL detects deadlocks by running a deadlock detection process after lock_timeout expires. It builds a wait-for graph and detects cycles. When a deadlock is found, PostgreSQL aborts one of the transactions (lowest cost or most recent) with "deadlock detected" error. Prevent deadlocks: always acquire locks in consistent order, keep transactions short, use NOWAIT or FOR UPDATE SKIP LOCKED.

Permalink

Q15:

What is a tablespace?

Mid

Answer

A tablespace maps database storage to filesystem locations, enabling distribution of data across disks.

Quick Summary: Tablespace defines the physical disk location where database objects are stored. CREATE TABLESPACE fast_ssd LOCATION "/mnt/ssd". Create tables in it: CREATE TABLE big_table (...) TABLESPACE fast_ssd. Use to: place hot tables on fast SSDs, put large archive tables on slower cheap storage, separate indexes from their tables for parallel I/O, and spread load across multiple disks.

Permalink

Q16:

What are PostgreSQL extensions?

Mid

Answer

Extensions add new features such as PostGIS, pg_stat_statements, and hstore without modifying core PostgreSQL.

Quick Summary: PostgreSQL extensions add functionality without modifying the core. Install: CREATE EXTENSION extension_name. Popular extensions: pg_stat_statements (query statistics), PostGIS (geospatial), pgcrypto (encryption), uuid-ossp (UUID generation), pg_trgm (trigram similarity for LIKE queries), hstore (key-value in a column), timescaledb (time-series). Extensions ship with PostgreSQL or via PGXN.

Permalink

Q17:

What is pg_stat_statements used for?

Mid

Answer

pg_stat_statements collects query execution statistics like total time and frequency, useful for performance tuning.

Quick Summary: pg_stat_statements extension tracks execution statistics for every SQL query: total execution time, number of calls, rows returned, buffer hits/misses. Query: SELECT query, calls, total_exec_time, mean_exec_time FROM pg_stat_statements ORDER BY total_exec_time DESC. Essential for finding which queries are consuming the most database time. Reset with pg_stat_statements_reset().

Permalink

Q18:

What is WAL in PostgreSQL?

Mid

Answer

WAL logs all changes before writing data files to ensure durability and enable crash recovery.

Quick Summary: WAL (Write-Ahead Log) records every database change before applying it to data files. On crash, PostgreSQL replays WAL to recover to a consistent state. WAL also enables: replication (WAL is streamed to standbys), point-in-time recovery (restore to any past moment), and logical decoding. WAL files are in pg_wal directory. wal_level setting controls detail level.

Permalink

Q19:

What is a WAL checkpoint?

Mid

Answer

A checkpoint flushes dirty pages to disk, reducing recovery time but increasing I/O load if too frequent.

Quick Summary: WAL checkpoint writes all dirty buffers (modified data pages in shared_buffers) to disk files. This ensures recovery can start from a recent point rather than replaying all WAL from the beginning. Checkpoints happen every checkpoint_timeout (5 min default) or when WAL reaches max_wal_size. Frequent checkpoints: faster recovery but more I/O. Checkpoint duration visible in pg_stat_bgwriter.

Permalink

Q20:

What is WAL archiving?

Mid

Answer

WAL archiving stores old WAL segments for point-in-time recovery and disaster recovery.

Quick Summary: WAL archiving copies WAL segment files to a separate location (another disk, S3, network share) using the archive_command parameter. Used for: Point-In-Time Recovery (PITR) - restore to any moment by replaying archived WAL, and building standbys from a base backup + WAL archives. Essential for DR strategy when you need recovery beyond the current WAL on disk.

Permalink

Q21:

What are HOT updates?

Mid

Answer

HOT updates modify rows without updating indexes when non-indexed columns change, reducing index bloat.

Quick Summary: HOT (Heap Only Tuple) update: when an updated row fits in the same table page AND no indexed columns changed, PostgreSQL creates a chain: old tuple points to new tuple in the same page. Index entries still point to the old tuple which redirects to new. Benefit: indexes don't need to be updated (major write amplification reduction). HOT updates only work when fillfactor leaves room in the page.

Permalink

Q22:

What is index bloat and how is it fixed?

Mid

Answer

Index bloat occurs when indexes accumulate dead entries. Fix with REINDEX or VACUUM FULL.

Quick Summary: Index bloat: dead index entries accumulate after many updates/deletes - index grows without being reclaimed by regular VACUUM. Identify with pgstattuple extension or VACUUM VERBOSE. Fix without downtime: REINDEX CONCURRENTLY (PostgreSQL 12+) rebuilds the index while allowing reads and writes. Or pg_repack. Regular autovacuum helps prevent bloat but can lag on very high-write tables.

Permalink

Q23:

What is logical replication?

Mid

Answer

Logical replication sends row-level changes to subscribers and supports selective table replication.

Quick Summary: Logical replication replicates data changes at the row level using logical decoding of the WAL. Subscribers receive INSERT, UPDATE, DELETE events for specific tables. Supports: replicating to different PostgreSQL versions, selective table replication, cross-version upgrades, and CDC (Change Data Capture) for downstream systems. More flexible than physical replication but can't replicate DDL automatically.

Permalink

Q24:

Difference between physical and logical replication?

Mid

Answer

Physical replicates data files, while logical replicates row changes. Physical is faster; logical is more flexible.

Quick Summary: Physical replication: copies the exact binary data files (byte-for-byte replica of the primary). Fast, all objects replicated, standby must be same PostgreSQL version. Logical replication: replicates row-level changes for selected tables/publications. More flexible (different schema versions, selective tables, cross-version). Physical for HA standby; logical for migrations, selective replication, and CDC.

Permalink

Q25:

What is synchronous replication?

Mid

Answer

Synchronous replication waits for confirmation from at least one replica before committing.

Quick Summary: Synchronous replication: primary waits for the standby to confirm it received and wrote the WAL before acknowledging the commit to the client (synchronous_standby_names). Guarantees zero data loss on failover (RPO=0). Higher latency per transaction. Use for financial data where data loss is unacceptable. Configure with synchronous_commit = on and a synchronous standby.

Permalink

Q26:

What is asynchronous replication?

Mid

Answer

Asynchronous replication commits without waiting for replicas, improving performance at the cost of durability.

Quick Summary: Asynchronous replication: primary commits without waiting for standby acknowledgment. Lower write latency but standby may lag behind. If primary fails, recently committed transactions may not be on standby (RPO > 0). Default replication mode in PostgreSQL. Acceptable for most apps where a small amount of data loss is tolerable in exchange for lower latency.

Permalink

Q27:

What is the role of autovacuum?

Mid

Answer

Autovacuum removes dead tuples, prevents wraparound, and updates visibility maps.

Quick Summary: Autovacuum is a background daemon that automatically runs VACUUM and ANALYZE when tables have enough dead tuples or changed data. Prevents XID wraparound, reclaims dead tuple space, and keeps statistics fresh. Tune with: autovacuum_vacuum_scale_factor, autovacuum_vacuum_threshold, autovacuum_vacuum_cost_delay. On high-write tables, tune per-table with ALTER TABLE SET (autovacuum_*).

Permalink

Q28:

What are prepared statements?

Mid

Answer

Prepared statements compile SQL once and reuse execution plans, improving speed and preventing SQL injection.

Quick Summary: Prepared statements parse and plan a SQL query once, then execute multiple times with different parameters. PREPARE stmt AS SELECT * FROM users WHERE id = $1; EXECUTE stmt(42). Benefits: plan reuse (faster for repeated queries), protection against SQL injection (parameters are separate from query). In application code, most drivers use prepared statements automatically (parameterized queries).

Permalink

Q29:

What is query planning and cost estimation?

Mid

Answer

The planner evaluates possible execution paths and selects the lowest-cost plan using statistics.

Quick Summary: PostgreSQL query planner estimates the cost of different execution strategies (seq scan, index scan, hash join, nested loop) using table statistics from pg_statistic. It chooses the plan with lowest estimated cost. Cost estimates depend on: row count estimates, column value distribution, physical I/O cost settings (seq_page_cost, random_page_cost). Stale stats or bad estimates lead to poor plan choices.

Permalink

Q30:

Why does PostgreSQL choose sequential scans sometimes?

Mid

Answer

Sequential scans occur when tables are small, statistics are outdated, or index selectivity is poor.

Quick Summary: Seq scan can be faster than index scan when: the query retrieves a large fraction of the table (reading index then fetching most pages is worse than just reading pages sequentially), the table fits in shared_buffers (already cached), or random_page_cost is set high (sequential I/O is cheaper). Lower random_page_cost (SSD: 1.1 vs HDD: 4.0) encourages PostgreSQL to use indexes more.

Permalink

Q31:

What is ANALYZE and why is it important?

Mid

Answer

ANALYZE updates statistics used by the optimizer. Outdated stats lead to poor performance.

Quick Summary: ANALYZE samples the table data and stores column statistics (histogram, most-common values, null fraction, correlation) in pg_statistic. The query planner uses these to estimate how many rows match a WHERE condition. Inaccurate stats (on high-change tables) lead to bad plan choices. Run manually after bulk loads. Autovacuum handles routine analysis. Also updates the visibility map.

Permalink

Q32:

What is keyset pagination?

Mid

Answer

Keyset pagination uses indexed values to fetch the next rows efficiently, avoiding slow OFFSET scans.

Quick Summary: Keyset pagination (seek method) uses the last seen value as a WHERE filter: WHERE id > last_id LIMIT 10. Much faster than OFFSET pagination for large offsets because PostgreSQL can use an index and jump directly to the start position rather than scanning and discarding rows. Downside: you can't jump to page N directly, only go forward/backward from a cursor position.

Permalink

Q33:

What is constraint exclusion?

Mid

Answer

Constraint exclusion allows PostgreSQL to skip irrelevant partitions in partitioned tables.

Quick Summary: Constraint exclusion allows PostgreSQL to skip partitions (child tables) that can't possibly contain rows matching the WHERE clause. For partitioned tables: SELECT * FROM measurements WHERE date > "2024-01-01" skips partitions with older data. Works by comparing WHERE conditions against partition check constraints. Modern declarative partitioning uses partition pruning (more efficient).

Permalink

Q34:

What are generated columns?

Mid

Answer

Generated columns compute values from other columns, either stored physically or computed on demand.

Quick Summary: Generated columns compute their value from other columns automatically. STORED: value is computed and saved to disk on each INSERT/UPDATE (takes space but fast to read). VIRTUAL: computed on read (not yet supported by PostgreSQL - only STORED is available). Example: full_name TEXT GENERATED ALWAYS AS (first_name || " " || last_name) STORED. Always up-to-date, no trigger needed.

Permalink

Q35:

Difference between BEFORE and AFTER triggers?

Mid

Answer

BEFORE triggers validate or modify data before writes. AFTER triggers run after data is written.

Quick Summary: BEFORE trigger runs before the operation - can modify the NEW row or cancel the operation (return NULL). AFTER trigger runs after the operation completes - sees the final committed state. BEFORE triggers are used for: data validation, transforming input data, setting audit fields. AFTER triggers are used for: logging changes, maintaining denormalized data, sending notifications after successful operation.

Permalink

Q36:

How does concurrent index creation work?

Mid

Answer

CREATE INDEX CONCURRENTLY builds indexes without blocking writes, though it takes longer to complete.

Quick Summary: CREATE INDEX CONCURRENTLY builds an index without taking a lock that blocks writes. Regular CREATE INDEX locks the table for reads and writes during build. CONCURRENTLY: takes longer (two table scans), can't run inside a transaction block. May fail if the table has long-running transactions when the index build starts. Best practice for production: always use CONCURRENTLY for new indexes on live tables.

Permalink

Q37:

What is a partial index and when is it useful?

Mid

Answer

Partial indexes apply only to rows satisfying a condition, reducing index size and improving efficiency.

Quick Summary: Partial index indexes only rows matching a WHERE condition, making it smaller and faster to maintain. Useful when: queries consistently filter on a specific value (WHERE status = "pending"), most of the table data is irrelevant to queries (soft-deleted rows), or you want a partial unique constraint (unique among non-deleted rows). The query must match the partial index WHERE to use it.

Permalink

Q38:

What is the Query Execution Engine?

Mid

Answer

The execution engine runs the query plan, performing scans, joins, filtering, and computations.

Quick Summary: The Query Execution Engine takes the planner's chosen plan and actually executes it. It uses operator nodes (SeqScan, IndexScan, HashJoin, NestLoop, Sort, Aggregate) arranged as a tree - data flows from leaf nodes up through operators to produce the final result. Rows flow in batches through the pipeline. EXPLAIN ANALYZE shows each node's estimated vs actual rows and execution time.

Permalink

Q39:

What are temporary tables?

Mid

Answer

Temporary tables exist only during a session and are used for intermediate data processing.

Quick Summary: Temporary tables exist only for the duration of a session (or transaction). CREATE TEMP TABLE temp_data (...). Automatically dropped when session ends. Not visible to other sessions. Useful for: staging intermediate results, breaking complex queries into steps, storing data for batch processing. Each session gets its own temp table namespace - no naming conflicts. Temp tables can have indexes.

Permalink

Q40:

What causes buffer cache misses and how do you prevent them?

Mid

Answer

Buffer misses occur when needed pages are not in memory. Prevent with tuning shared_buffers, indexing correctly, and reducing full scans.

Quick Summary: Buffer cache miss: requested page is not in shared_buffers, must read from disk (expensive). Prevent by: sizing shared_buffers to fit your working set (start at 25% of RAM), using pg_prewarm to load critical tables/indexes on startup, keeping hot data small (archive old data), and monitoring with pg_statio_user_tables (heap_blks_hit vs heap_blks_read ratio - aim for >99%).

Permalink

Get Pro for Free

Mid PostgreSQL Interview Questions

PostgreSQL Interview Questions & Answers

What you will learn from these PostgreSQL interview questions:

Questions

What is PostgreSQL’s MVCC and how does it enable concurrent reads and writes?

Answer

What is a tuple in PostgreSQL?

Answer

What is a transaction ID and why is wraparound dangerous?

Answer

What is VACUUM FULL and how does it differ from regular VACUUM?

Answer

What is the visibility map in PostgreSQL?

Answer

What is an index-only scan?

Answer

What is the difference between B-Tree, GIN, and GiST indexes?

Answer

When would you use a BRIN index?

Answer

What is the purpose of pg_stat_activity?

Answer

What is a CTE and why use it?

Answer

What is the difference between a CTE and a subquery?

Answer

How do recursive CTEs work?

Answer

What are dead tuples and why do they accumulate?

Answer

How does PostgreSQL handle deadlocks?

Answer

What is a tablespace?

Answer

What are PostgreSQL extensions?

Answer

What is pg_stat_statements used for?

Answer

What is WAL in PostgreSQL?

Answer

What is a WAL checkpoint?

Answer

What is WAL archiving?

Answer

What are HOT updates?

Answer

What is index bloat and how is it fixed?

Answer

What is logical replication?

Answer

Difference between physical and logical replication?

Answer

What is synchronous replication?

Answer

What is asynchronous replication?

Answer

What is the role of autovacuum?

Answer

What are prepared statements?

Answer

What is query planning and cost estimation?

Answer

Why does PostgreSQL choose sequential scans sometimes?

Answer

What is ANALYZE and why is it important?

Answer

What is keyset pagination?

Answer

What is constraint exclusion?

Answer

What are generated columns?

Answer

Difference between BEFORE and AFTER triggers?

Answer

How does concurrent index creation work?

Answer

What is a partial index and when is it useful?

Answer

What is the Query Execution Engine?