Amazon Interview MongoDB Interview Questions

Curated Amazon Interview-level MongoDB interview questions for developers targeting amazon interview positions. 125 questions available.

Last updated: November 23, 2025

Questions

125 questions

Only Code Challenges

Q1:

What is MongoDB?

Entry

Answer

MongoDB is a NoSQL, document-oriented database that stores data as JSON-like documents. It is schema-flexible and designed for scalable modern applications.

Quick Summary: MongoDB is a NoSQL document database that stores data as JSON-like documents (BSON). Unlike relational databases, there are no tables with fixed schemas - each document can have different fields. It's designed for flexibility, horizontal scaling, and developer productivity. Used widely for catalogs, user profiles, content, and real-time analytics.

Permalink

Q2:

What is a document in MongoDB?

Entry

Answer

A document is a JSON-like object containing key-value pairs. It is the basic unit of data stored inside collections in MongoDB.

Quick Summary: A document is a single record in MongoDB stored as BSON (Binary JSON). It contains key-value pairs like a JSON object. Documents can have nested objects and arrays. Example: a user document might have name, email, address (nested object), and orders (array of objects) - all in one document. Maximum document size is 16MB.

Permalink

Q3:

What is a collection?

Entry

Answer

A collection is a group of documents similar to a table in relational databases but without a fixed schema.

Quick Summary: A collection is a group of documents in MongoDB - roughly equivalent to a table in SQL. But unlike SQL tables, collections have no enforced schema by default - documents in the same collection can have different fields. Collections are created automatically when you insert the first document. You query and index at the collection level.

Permalink

Q4:

What is a database in MongoDB?

Entry

Answer

A database is a container for collections. Each application typically uses one or more databases within the MongoDB server.

Quick Summary: A database in MongoDB is a container for collections. One MongoDB server can host multiple databases. Each database has its own set of files on disk. You switch databases with "use dbname". Common practice: one database per application. Unlike SQL, creating a database just requires inserting data - no explicit CREATE DATABASE needed.

Permalink

Q5:

What is BSON?

Entry

Answer

BSON is a binary format used by MongoDB to store documents. It supports more data types than JSON, such as Date and ObjectId.

Quick Summary: BSON (Binary JSON) is the binary format MongoDB uses to store documents. It extends JSON with additional types: Date, ObjectId, Binary data, 32/64-bit integers, Decimal128. BSON is faster to encode/decode than JSON and supports more data types. When you work with MongoDB drivers, you use JSON-like syntax but the data is stored as BSON internally.

Permalink

Q6:

What is an ObjectId?

Entry

Answer

ObjectId is the default unique identifier for documents. It includes timestamp and machine-specific information to ensure global uniqueness.

Quick Summary: ObjectId is MongoDB's default primary key type - a 12-byte unique identifier automatically generated for the _id field. It encodes: 4-byte timestamp, 5-byte random value (unique per machine/process), 3-byte incrementing counter. This makes ObjectIds roughly sortable by creation time, unique across distributed systems, and generated client-side without DB round-trips.

Permalink

Q7:

What is a schema in MongoDB?

Entry

Answer

MongoDB is schema-flexible, allowing documents with different structures. Schema rules can be enforced using validators when needed.

Quick Summary: MongoDB is schemaless by default - no schema definition required. But you can enforce structure using Schema Validation (JSON Schema rules defined on the collection). This lets you have flexible schemas during development but add validation rules as the app matures. Most applications use Mongoose (Node.js) or similar ODM to define schemas at the application layer.

Permalink

Q8:

What is the purpose of the find() method?

Entry

Answer

The find() method retrieves documents based on filters and supports projection, sorting, and pagination.

Quick Summary: find() queries a collection and returns a cursor of matching documents. Usage: db.users.find({age: {$gt: 18}}). The cursor is lazy - documents are fetched in batches as you iterate. You can chain .sort(), .limit(), .skip(), .project() to shape the results. Without arguments, find() returns all documents in the collection.

Permalink

Q9:

What is the difference between find() and findOne()?

Entry

Answer

find() returns multiple documents as a cursor, while findOne() returns only the first matching document.

Quick Summary: find() returns a cursor with all matching documents - you iterate through them. findOne() returns the first matching document directly (not a cursor), or null if none found. Use findOne() when you only need one result and don't want to deal with cursor iteration. It's slightly more efficient when you genuinely only need one document.

Permalink

Q10:

What does the updateOne() function do?

Entry

Answer

updateOne() updates the first matching document using operators like $set, $inc, or $push.

Quick Summary: updateOne() updates the first document matching a filter. Takes two args: filter (which docs to match) and update (what to change). Use $set to change specific fields without replacing the whole document. Returns an object with matchedCount and modifiedCount. If you want to update all matching documents, use updateMany() instead.

Permalink

Q11:

What is a deleteOne() operation?

Entry

Answer

deleteOne() removes the first document matching the filter condition.

Quick Summary: deleteOne() removes the first document matching a filter. db.users.deleteOne({_id: id}) deletes exactly one user. Returns deletedCount. If multiple documents match the filter, only the first found is deleted. For deleting all matching documents, use deleteMany(). Always double-check your filter before running delete operations in production.

Permalink

Q12:

What is field projection in MongoDB?

Entry

Answer

Projection specifies which fields to include or exclude when fetching documents, improving efficiency.

Quick Summary: Field projection controls which fields are returned in query results - reduces data transfer and memory usage. In find(), the second argument is the projection: {name: 1, email: 1} returns only name and email. {password: 0} excludes the password field. You can't mix include and exclude in the same projection (except for _id which can always be excluded).

Permalink

Q13:

What is an index in MongoDB?

Entry

Answer

An index improves search performance on fields. Without indexes, MongoDB performs collection scans.

Quick Summary: An index in MongoDB is a data structure (B-tree) that speeds up queries by allowing MongoDB to find documents without scanning the entire collection. Without an index, every query does a full collection scan (COLLSCAN). Create an index on frequently queried fields: db.users.createIndex({email: 1}). Too many indexes slow down writes.

Permalink

Q14:

What is a primary key in MongoDB?

Entry

Answer

Every document has a unique _id field, which acts as the primary key. MongoDB generates an ObjectId if not provided.

Quick Summary: Every MongoDB document has an _id field that serves as the primary key - it must be unique within the collection. By default, MongoDB auto-generates an ObjectId for _id. You can provide your own _id value (string, int, etc.) but it must be unique. The _id field is always indexed automatically.

Permalink

Q15:

What is a replica set?

Entry

Answer

A replica set is a group of MongoDB servers with redundancy and automatic failover, consisting of one primary and multiple secondaries.

Quick Summary: A replica set is a group of MongoDB servers that maintain the same dataset. One is the primary (handles writes), the rest are secondaries (replicate from primary, can serve reads). If the primary fails, secondaries elect a new primary automatically (failover). Provides high availability and data redundancy. Minimum 3 nodes recommended for proper elections.

Permalink

Q16:

What is sharding in MongoDB?

Entry

Answer

Sharding distributes large datasets across multiple servers for horizontal scaling.

Quick Summary: Sharding distributes data across multiple servers (shards) to handle datasets too large for one machine or write throughput too high for one server. Each shard holds a subset of the data determined by the shard key. A mongos router directs queries to the right shard(s). Config servers store the metadata about which chunks live on which shard.

Permalink

Q17:

What is MongoDB Atlas?

Entry

Answer

MongoDB Atlas is the fully managed cloud service for MongoDB, providing automated scaling, backups, and monitoring.

Quick Summary: MongoDB Atlas is MongoDB's fully managed cloud database service. It runs on AWS, Azure, or GCP. Atlas handles provisioning, backups, monitoring, scaling, security patches, and upgrades automatically. Provides Atlas Search (full-text), Atlas Data Lake, Atlas Charts, and online archive. It's the recommended way to run MongoDB in production - no ops overhead.

Permalink

Q18:

What is the difference between MongoDB and a relational database?

Entry

Answer

MongoDB stores flexible JSON-like documents, while relational databases use structured tables and predefined schemas.

Quick Summary: MongoDB vs relational: MongoDB stores data as flexible documents (no fixed schema), relational uses tables with fixed columns. MongoDB doesn't support joins natively (use $lookup or embed data). MongoDB scales horizontally via sharding; relational typically scales vertically. Relational is better for complex transactions and structured data. MongoDB wins for flexible, hierarchical, and rapidly evolving schemas.

Permalink

Q19:

What is the purpose of the $set operator?

Entry

Answer

$set updates or adds fields without replacing the entire document.

Quick Summary: $set updates specific fields of a document without replacing the whole thing. db.users.updateOne({_id: id}, {$set: {name: "Alice", age: 30}}). Only the specified fields change; other fields stay intact. Without $set, if you pass a plain object MongoDB replaces the entire document (losing all other fields). Always use $set for partial updates.

Permalink

Q20:

What does the $inc operator do?

Entry

Answer

$inc increases or decreases numeric values atomically. Useful for counters or scores.

Quick Summary: $inc atomically increments (or decrements) a numeric field by the given amount. db.products.updateOne({_id: id}, {$inc: {stock: -1, views: 1}}). Decrements if the value is negative. Atomic - safe for concurrent updates (no read-modify-write race condition). Commonly used for counters, inventory tracking, and vote counts.

Permalink

Q21:

What is a capped collection and when should it be used?

Junior

Answer

A capped collection is a fixed-size collection where MongoDB overwrites old documents when full. It maintains insertion order and supports high-speed writes, useful for logs and metrics.

Quick Summary: Capped collections have a fixed maximum size (in bytes) and optionally a max document count. When full, oldest documents are automatically overwritten by new ones (circular buffer). No deletes needed. Use for: logs, event streams, caches where only recent data matters. Insert order is maintained. Downside: can't delete individual documents, limited update operations.

Permalink

Q22:

What is the difference between $push and $addToSet?

Junior

Answer

$push adds an element to an array even if it already exists. $addToSet adds it only if it is not present, preventing duplicates.

Quick Summary: $push appends a value to an array even if it already exists - can create duplicates. $addToSet adds a value only if it doesn't already exist in the array - like a set in math. Use $addToSet when maintaining unique values (tags, categories, user IDs). Use $push when order matters or duplicates are allowed (event log entries).

Permalink

Q23:

What is an embedded document and when is embedding recommended?

Junior

Answer

An embedded document stores related data inside a parent document. Embedding improves read performance and is recommended for one-to-few relationships.

Quick Summary: Embedded documents store related data together in one document (address inside a user doc). Recommended when data is accessed together, relationship is one-to-one or one-to-few, and child data doesn't grow unboundedly. Referencing stores the related document's _id and uses $lookup for joins. Use referencing for many-to-many, frequently changing data, or data shared across documents.

Permalink

Q24:

What is data referencing in MongoDB?

Junior

Answer

Referencing links documents across collections using IDs. It is used when datasets are large, loosely connected, or when avoiding duplication.

Quick Summary: Data referencing stores the _id of a related document instead of embedding the data. Like a foreign key in SQL. Used when: data is large, shared across many documents, or independently accessed. Requires a separate query or $lookup to fetch the referenced data. Trade-off: two queries or slower $lookup vs embedded doc simplicity.

Permalink

Q25:

What is the purpose of the aggregation pipeline?

Junior

Answer

The aggregation pipeline processes documents through stages such as $match, $group, $project, and $lookup for analytics and transformations.

Quick Summary: The aggregation pipeline processes documents through a series of stages to transform and analyze data. Common stages: $match (filter), $group (aggregate by field), $sort, $project (reshape), $lookup (join), $unwind (flatten arrays), $limit, $skip. Each stage passes its output to the next. More powerful than find() for analytics and data transformation.

Permalink

Q26:

What is $lookup used for?

Junior

Answer

$lookup performs a left outer join between collections, enriching documents with related data.

Quick Summary: $lookup performs a left outer join between collections in an aggregation pipeline. It matches documents from the "from" collection based on a localField/foreignField pair and adds matched docs as an array in the output. Similar to SQL JOIN. Performance tip: $lookup is expensive - consider embedding if you always access data together.

Permalink

Q27:

What is the difference between insertOne and insertMany?

Junior

Answer

insertOne inserts a single document. insertMany inserts multiple documents in one operation and improves performance.

Quick Summary: insertOne() inserts a single document and returns the inserted document's _id. insertMany() inserts an array of documents in one operation - faster than calling insertOne() in a loop (one network round-trip). insertMany() by default stops on first error (ordered mode). Set {ordered: false} to continue inserting remaining documents even if some fail.

Permalink

Q28:

What is the purpose of TTL indexes?

Junior

Answer

TTL indexes automatically delete documents after a specified time, useful for sessions, logs, and temporary data.

Quick Summary: TTL (Time To Live) indexes automatically delete documents after a specified number of seconds. Created with expireAfterSeconds: db.sessions.createIndex({createdAt: 1}, {expireAfterSeconds: 3600}) deletes documents after 1 hour. MongoDB runs a background cleanup process every 60 seconds. Use for: sessions, cache entries, temporary data, audit logs with retention policies.

Permalink

Q29:

What is the explain function and why is it useful?

Junior

Answer

explain() shows how a query is executed, including index usage and performance details. It helps diagnose slow queries.

Quick Summary: explain() shows how MongoDB executes a query - which index was used (IXSCAN vs COLLSCAN), how many documents were examined, execution time, and query plan. Use explain("executionStats") for detailed stats. Essential for performance debugging - if you see COLLSCAN on a frequently run query, you need an index. Always run explain() on new queries in development.

Permalink

Q30:

What is a write concern?

Junior

Answer

Write concern defines how strictly MongoDB should confirm a write, ranging from w:1 (primary only) to w:majority for higher durability.

Quick Summary: Write concern controls how many replica set members must acknowledge a write before MongoDB considers it successful. w:1 (default): primary acknowledges. w:majority: majority of members must acknowledge - safer, slower. w:0: fire and forget. Higher write concern = stronger durability guarantee but higher latency. Choose based on your data loss tolerance.

Permalink

Q31:

What is a read preference?

Junior

Answer

Read preference decides which nodes serve read requests, such as primary, secondary, or nearest, enabling load balancing.

Quick Summary: Read preference controls which replica set member handles read operations. primary: all reads from primary (consistent, default). primaryPreferred: primary if available, else secondary. secondary: always read from secondaries (may be slightly stale). secondaryPreferred: secondaries when available. nearest: lowest network latency. Use secondaries to distribute read load but accept eventual consistency.

Permalink

Q32:

What is journaling in MongoDB?

Junior

Answer

Journaling writes operations to a journal file before applying them to data files, preventing data loss in crashes.

Quick Summary: Journaling writes every write operation to an on-disk journal (write-ahead log) before applying it to data files. If MongoDB crashes mid-write, it replays the journal on restart to recover to a consistent state. Enabled by default since MongoDB 3.2. Without journaling, a crash between the write and fsync can corrupt data files.

Permalink

Q33:

What is $regex used for?

Junior

Answer

$regex performs pattern matching on string fields and is useful for partial text searches.

Quick Summary: $regex filters documents where a string field matches a regular expression. db.users.find({name: {$regex: "^alice", $options: "i"}}) finds users whose name starts with "alice" (case-insensitive). Performance warning: regex queries without a text index or leading wildcard can't use indexes and cause full collection scans. Anchor patterns to the start (^) when possible.

Permalink

Q34:

What is the difference between save and update?

Junior

Answer

save replaces an entire document if it exists or inserts it if not. update modifies only specified fields using update operators.

Quick Summary: save() was removed in MongoDB 5.x. Previously: if the document had an _id that matched an existing document, it replaced the whole document; otherwise it inserted. update() (now updateOne/updateMany) modifies specific fields. Always use insertOne/updateOne/replaceOne explicitly - they're clearer about intent and safer than the old save() which could silently replace entire documents.

Permalink

Q35:

What is sharding key selection and why is it important?

Junior

Answer

A good sharding key ensures balanced data distribution, high cardinality, and avoids write hotspots, which affects scaling performance.

Quick Summary: The shard key determines how data is distributed across shards. A good shard key has high cardinality (many distinct values), even write distribution (avoid hotspots), and is included in most queries. Bad choices: monotonically increasing keys (like timestamps or ObjectId) cause all writes to go to one shard. Hash sharding distributes ObjectIds evenly across shards.

Permalink

Q36:

How does MongoDB handle schema flexibility while still allowing schema validation?

Mid

Answer

MongoDB is schema-flexible but supports validation using $jsonSchema. This allows flexible documents while enforcing structure for critical fields.

Quick Summary: MongoDB supports schema flexibility by default but lets you add validation via JSON Schema rules on a collection. You specify required fields, field types, value ranges, and patterns. Validation happens on insert and update. Use validationAction: "warn" during migration (logs violations without rejecting) or "error" to enforce strictly. This balances flexibility with data integrity.

Permalink

Q37:

What are the main differences between embedding and referencing in MongoDB?

Mid

Answer

Embedding stores related data in one document for fast reads, while referencing links documents across collections to reduce duplication and document size.

Quick Summary: Embedding: store related data in one document. Pro: one read, atomic updates, no joins. Con: document size limit, data duplication. Referencing: store _id, fetch separately. Pro: no duplication, smaller documents, shared data. Con: requires extra query or $lookup. Rule: embed when data is accessed together and is one-to-few. Reference when data is shared, large, or frequently updated independently.

Permalink

Q38:

How do compound indexes improve query performance?

Mid

Answer

Compound indexes index multiple fields together, allowing MongoDB to speed up queries and sorting based on index prefix rules.

Quick Summary: Compound indexes cover multiple fields in a specific order. db.orders.createIndex({userId: 1, createdAt: -1}) supports queries filtering by userId and sorting by createdAt descending. This is much faster than two separate indexes because MongoDB traverses one B-tree. The order of fields in the index matters - place equality fields first, then sort fields, then range fields.

Permalink

Q39:

What is an index prefix and why does it matter in compound indexing?

Mid

Answer

MongoDB can only use the initial fields of a compound index. If a query does not include the prefix field, the index cannot be used.

Quick Summary: Index prefix means a compound index {a, b, c} supports queries on {a}, {a, b}, or {a, b, c} but NOT on {b} or {c} alone. MongoDB can only use a compound index from the leftmost field forward. If you frequently query by {b} alone, you need a separate index. Designing indexes with the right field order avoids creating redundant indexes.

Permalink

Q40:

What is the purpose of an aggregation pipeline’s $facet stage?

Mid

Answer

$facet allows running multiple aggregations in parallel on the same input, useful for dashboards requiring different metrics from one dataset.

Quick Summary: $facet runs multiple sub-pipelines on the same input in parallel, each producing a different result in the output. Useful for building faceted search results - one sub-pipeline counts by category, another by price range, another for the actual results. All computed in one aggregation pass instead of multiple queries.

Permalink

Q41:

What is $unwind and why is it used?

Mid

Answer

$unwind expands array fields into multiple documents so pipeline stages can analyze individual elements.

Quick Summary: $unwind deconstructs an array field - for each element in the array, it outputs a separate document. Example: a product document with a sizes array [S, M, L] becomes three documents, one per size. Necessary when you want to group, filter, or sort by individual array elements in an aggregation pipeline.

Permalink

Q42:

What is a covered query in MongoDB?

Mid

Answer

A covered query is satisfied entirely from an index without touching the collection, improving performance by reducing disk access.

Quick Summary: A covered query is satisfied entirely by the index - MongoDB never reads the actual documents. This is the fastest possible query execution. For a query to be covered: all queried fields must be in the index, all projected fields must be in the index, and _id must be excluded from the projection (unless also in the index). Verify with explain() - look for "totalDocsExamined: 0".

Permalink

Q43:

What is index cardinality and how does it affect performance?

Mid

Answer

Higher cardinality means more unique values, making indexes more selective and improving query performance.

Quick Summary: Index cardinality is the number of distinct values an indexed field has. High cardinality (user email, userId) = index is very selective = fast queries. Low cardinality (boolean, status with 3 values) = index is not selective = MongoDB may skip it and prefer a collection scan. Always index high-cardinality fields. Low-cardinality fields work better as second fields in compound indexes.

Permalink

Q44:

What are multi-key indexes?

Mid

Answer

Multi-key indexes allow indexing array fields by indexing each element, enabling fast queries over arrays.

Quick Summary: Multi-key indexes are created on array fields. MongoDB creates an index entry for every element in the array. This allows efficient queries on array contents: find all users where tags contains "mongodb". MongoDB automatically detects and creates a multi-key index when you index an array field. Limitation: a compound index can have at most one multi-key field.

Permalink

Q45:

What is the difference between $in and $nin in performance?

Mid

Answer

$in can use indexes efficiently while $nin generally causes collection scans because it excludes values.

Quick Summary: $in queries documents where a field value is in a provided array. MongoDB uses the index efficiently if the array is small. $nin is "not in" - much slower because it can't use indexes effectively for negative conditions (has to scan all non-matching values). Avoid $nin on large collections. Use a whitelist ($in) approach instead of blacklist ($nin) when possible.

Permalink

Q46:

What is write concern and why is it important?

Mid

Answer

Write concern specifies how many nodes must acknowledge a write. Higher levels improve durability but increase latency.

Quick Summary: Write concern defines durability guarantees. w:1: primary wrote to memory. w:majority: majority of replica set persisted the write - survives primary failure. j:true: write is persisted to journal before acknowledging (survives crashes). For financial data or anything you can't lose: use {w: "majority", j: true}. Higher concern = higher latency.

Permalink

Q47:

What is read concern in MongoDB?

Mid

Answer

Read concern determines the consistency level of reads, such as local, majority, or snapshot for transactions.

Quick Summary: Read concern controls data freshness and isolation for read operations. local: returns data that may not be majority-committed (default). majority: returns data acknowledged by majority of replicas - won't roll back. linearizable: guarantees reading the most recent majority-committed data (slowest). snapshot: consistent point-in-time view for transactions. Choose based on consistency requirements.

Permalink

Q48:

How does MongoDB ensure durability during crashes?

Mid

Answer

MongoDB uses journaling to write operations to journal files before applying them, ensuring recovery after crashes.

Quick Summary: MongoDB ensures durability through: journaling (write-ahead log persists writes before applying), WiredTiger checkpoints (periodic full data snapshots), and replica set replication (data copied to multiple nodes). On crash, MongoDB replays the journal from the last checkpoint to restore to a consistent state. With write concern majority + journaling enabled, committed writes survive node failures.

Permalink

Q49:

What are write-ahead logs (journal files) and how do they work?

Mid

Answer

Journal files store operations sequentially for atomicity and crash recovery. MongoDB replays journals after restarts.

Quick Summary: Write-ahead logging (WAL / journal): before MongoDB applies any data change, it first writes the operation to the journal file on disk. If the server crashes mid-write, MongoDB replays the journal on startup to complete or roll back the incomplete operation. This ensures the data files are never left in a partially-written, inconsistent state after a crash.

Permalink

Q50:

What is a MongoDB transaction and when is it needed?

Mid

Answer

MongoDB transactions allow multi-document ACID operations, needed when updating related data across collections.

Quick Summary: MongoDB transactions provide ACID guarantees across multiple documents and collections (since MongoDB 4.0 for replica sets, 4.2 for sharded clusters). Use when you need to update multiple documents atomically - e.g., transfer money between two accounts. Transactions have performance overhead - they hold locks and use snapshot isolation. Design schemas to minimize transaction needs.

Permalink

Q51:

What is $merge used for?

Mid

Answer

$merge writes aggregation results into a target collection, supporting upserts and replacements useful for ETL workflows.

Quick Summary: $merge writes aggregation pipeline results to a collection (either inserting or merging into existing documents). More flexible than $out (which replaces the whole collection). You can specify what to do when a matching document exists: replace, merge, keep existing, fail, or run a custom pipeline. Use for building materialized views or pre-aggregated reports.

Permalink

Q52:

What challenges arise when using transactions?

Mid

Answer

Transactions add latency, reduce concurrency, and require replica set or sharded clusters. They must be used sparingly.

Quick Summary: Challenges with MongoDB transactions: performance overhead (locks held, snapshot maintained for duration), limited to 60 seconds by default, increased conflict and abort rate under high concurrency, cross-shard transactions add latency. Best practice: keep transactions short, minimize documents touched, prefer schema design that avoids transactions (embedding, atomic update operators).

Permalink

Q53:

How does sharding work in MongoDB?

Mid

Answer

Sharding distributes data across shards based on a shard key. mongos routes queries and config servers store metadata.

Quick Summary: Sharding distributes collection data across shards based on the shard key. MongoDB splits data into chunks (default 128MB ranges). The config server replica set stores the chunk-to-shard mapping. mongos routers use this map to direct queries to the right shard(s). A query on the shard key hits one shard; a query without it hits all shards (scatter-gather).

Permalink

Q54:

What is the role of the mongos router?

Mid

Answer

mongos routes application queries to the correct shards and abstracts the distributed cluster from clients.

Quick Summary: mongos is the routing layer for a sharded cluster. Client applications connect to mongos (not directly to shards). mongos queries the config servers for the chunk map, determines which shard(s) hold the relevant data, fans out queries to those shards, merges results, and returns to the client. It's stateless and you can run multiple mongos instances for high availability.

Permalink

Q55:

What makes a good shard key?

Mid

Answer

A good shard key must offer high cardinality, distribute writes evenly, and match query patterns to avoid hotspots.

Quick Summary: A good shard key: high cardinality (many distinct values), writes distributed across all shards (no hotspot), frequently appears in queries (query isolation to one shard), not monotonically increasing. Hash shard keys distribute writes evenly but lose range query efficiency. Compound shard keys can balance writes and query isolation. Bad key = uneven distribution = one shard gets all the load.

Permalink

Q56:

What are chunk migrations in MongoDB?

Mid

Answer

Chunks are ranges of shard key values that move between shards to balance data. The balancer manages migrations.

Quick Summary: When data becomes unevenly distributed across shards, the balancer moves chunks between shards to rebalance. A chunk migration copies the chunk data from source to destination shard, then updates the config server routing table, then removes the data from the source. Migrations happen in the background but consume I/O and network bandwidth - can impact performance.

Permalink

Q57:

What is the purpose of the balancer?

Mid

Answer

The balancer ensures even data distribution across shards by moving chunks when imbalance occurs.

Quick Summary: The balancer is a background process that ensures chunks are distributed evenly across shards. When shard chunk counts are imbalanced (difference exceeds a threshold), the balancer migrates chunks from the most-loaded shard to the least-loaded. You can schedule balancing windows to avoid running during peak hours and minimize performance impact.

Permalink

Q58:

What causes chunk migration performance issues?

Mid

Answer

Large documents, poor shard keys, heavy writes, and slow inter-shard networks can slow migrations.

Quick Summary: Chunk migration performance issues: migrations copy data over the network, consuming bandwidth. During migration, writes to migrating chunks are paused briefly for the final sync. If you have a poor shard key causing constant imbalance, the balancer migrates continuously. Jumbo chunks (too large to split) can't be migrated, causing permanent imbalance on one shard.

Permalink

Q59:

What is a change stream in MongoDB?

Mid

Answer

Change streams provide real-time events for inserts, updates, and deletes. Useful for microservices and cache invalidation.

Quick Summary: Change streams provide real-time notifications of data changes in MongoDB (inserts, updates, deletes, DDL). They use the oplog under the hood. Consume with a watch() call. Resumable - you save a resume token and restart from a specific point after failure. Use for: triggering downstream actions (invalidate cache, send notification), event sourcing, real-time dashboards.

Permalink

Q60:

What is $graphLookup and when is it useful?

Mid

Answer

$graphLookup performs recursive lookups, useful for hierarchical structures like org charts or categories.

Quick Summary: $graphLookup performs recursive lookups to traverse graph or tree-like data. Given a starting document, it recursively fetches documents connected via a specified field. Use for: org charts, friend-of-friend networks, category hierarchies, file system trees. More efficient than multiple application-side queries for graph traversal. Set maxDepth to limit recursion.

Permalink

Q61:

How do you detect slow queries in MongoDB?

Mid

Answer

Use slow query logs, profiler, and explain() to identify high scan ratios and missing indexes.

Quick Summary: Detect slow queries with: MongoDB profiler (set db.setProfilingLevel(1, {slowms: 100}) to log queries slower than 100ms to system.profile collection), mongotop (shows per-collection read/write time), mongostat (server-wide stats), Atlas Performance Advisor, and the currentOp command to see queries running right now. Follow with explain() on slow queries to find missing indexes.

Permalink

Q62:

What is the role of the WiredTiger storage engine?

Mid

Answer

WiredTiger provides document-level locking, compression, checkpoints, and high concurrency performance.

Quick Summary: WiredTiger is the default MongoDB storage engine since MongoDB 3.2. It provides: document-level concurrency control (multiple writers don't block each other), compression (snappy by default - saves 50-80% disk space), checkpointing (consistent snapshots every 60 seconds), and write-ahead logging for crash recovery. Replaced the old MMAPv1 engine which used collection-level locking.

Permalink

Q63:

How does WiredTiger compression improve storage?

Mid

Answer

Compression reduces disk usage and improves I/O performance by reading and writing fewer bytes.

Quick Summary: WiredTiger compresses data using snappy (default - fast, moderate compression), zlib (slower, better compression), or zstd (MongoDB 4.2+ - best balance). Compression is applied to both data and indexes on disk. This reduces storage costs significantly and can improve I/O performance since less data is read from/written to disk. CPU cost of compression is usually worth the I/O savings.

Permalink

Q64:

What are checkpoints in WiredTiger?

Mid

Answer

Checkpoints flush in-memory data to disk periodically, ensuring durable restart points.

Quick Summary: WiredTiger checkpoints write a consistent snapshot of all in-memory data to disk every 60 seconds (or when the journal reaches 2GB). Checkpoints create a new consistent data file state. On crash recovery, MongoDB restores from the last checkpoint and then replays the journal to apply changes made after that checkpoint. This limits recovery time to the last 60 seconds of journal data.

Permalink

Q65:

What causes collection-level locking and how to avoid it?

Mid

Answer

Multi-document operations and unindexed writes can cause lock contention. Use indexes and smaller writes to avoid locking.

Quick Summary: WiredTiger uses document-level locking so multiple operations can write to the same collection concurrently without blocking each other. Collection-level locking only happens for certain operations like createIndex or collMod. Avoid these in production on large collections. In older MMAPv1, collection-level locking caused severe write contention under concurrent load.

Permalink

Q66:

What is a working set and why is it important?

Mid

Answer

The working set is frequently accessed data and indexes. Performance drops if it exceeds available RAM.

Quick Summary: Working set is the data and indexes that MongoDB actively uses - what fits in RAM. When working set fits in WiredTiger's cache (60% of RAM by default), reads are served from memory. When working set exceeds RAM, MongoDB pages data to/from disk - causing I/O spikes and slow reads. Size your RAM so the frequently accessed working set fits. Monitor using serverStatus.wiredTiger.cache metrics.

Permalink

Q67:

What is index intersection?

Mid

Answer

MongoDB can combine multiple indexes to satisfy a query, useful when no single index covers all fields.

Quick Summary: Index intersection allows MongoDB to use two separate indexes to satisfy a single query (instead of requiring a compound index). MongoDB ANDs the results from both indexes. In practice, a well-designed compound index almost always outperforms index intersection. Check explain() output - if you see "AND_HASH" or "AND_SORTED" stages, MongoDB is using index intersection.

Permalink

Q68:

Why do large documents degrade performance?

Mid

Answer

Large documents slow reads and writes, increase RAM usage, and reduce replication and migration performance.

Quick Summary: Large documents hurt performance: they consume more cache space (fewer docs fit in RAM), take longer to transfer over network, and slow down reads even when you only need a few fields (unless you use projection). If you regularly read only part of a document, consider splitting into multiple documents or using projection to avoid fetching unused fields.

Permalink

Q69:

What is the difference between primary and secondary reads?

Mid

Answer

Primary reads are strongly consistent, while secondary reads are eventually consistent and used for load balancing.

Quick Summary: Primary reads: always fresh, consistent, but all reads go to one node (can be bottleneck). Secondary reads: distributed across replica set members, reduces primary load, but data may be slightly behind (replication lag). Use secondaries for read-heavy analytics workloads or reporting where slight staleness is acceptable. Never read from secondaries for data that needs to be immediately consistent.

Permalink

Q70:

What is replication lag and why does it occur?

Mid

Answer

Lag occurs when secondaries apply changes slower than primary. Causes include heavy writes and slow hardware.

Quick Summary: Replication lag is the delay between a write on the primary and its application on a secondary. Caused by: secondary hardware being slower, heavy write load, network latency, large write operations. Monitor with rs.printSecondaryReplicationInfo(). High lag means secondaries are stale - reads from them return old data and they're slower to take over if primary fails.

Permalink

Q71:

What is the oplog and how does it support replication?

Mid

Answer

The oplog is a capped collection storing recent operations. Secondaries replay oplog entries to stay in sync.

Quick Summary: The oplog (operations log) is a capped collection on each replica set member that records all write operations. Secondaries continuously tail the primary's oplog and apply operations in order. Replication lag grows when secondaries can't keep up. Change streams use the oplog. Oplog size matters: if a secondary falls too far behind, the oplog might not contain the missing entries.

Permalink

Q72:

What is majority write concern and why use it?

Mid

Answer

Majority write concern ensures writes are replicated to most nodes, preventing data loss after failovers.

Quick Summary: Majority write concern (w: "majority") ensures the write is acknowledged by the majority of replica set members before returning success. If the primary fails and a new primary is elected, a majority-acknowledged write is guaranteed to be present on the new primary. Without majority concern, writes acknowledged only by the primary can be rolled back during failover.

Permalink

Q73:

How do you optimize MongoDB for high write throughput?

Mid

Answer

Use good shard keys, bulk writes, avoid unnecessary indexes, keep documents small, and tune WiredTiger cache.

Quick Summary: High write throughput optimization: use bulk writes (bulkWrite() - fewer round trips), avoid per-document indexes (indexes slow writes), use unordered bulk inserts (continue on error, parallel), distribute writes across shards with a good shard key, avoid transactions where possible (they add overhead), use write concern w:1 if you can tolerate some risk, and benchmark WiredTiger cache size.

Permalink

Q74:

What is $project in aggregation?

Mid

Answer

$project selects, removes, or transforms fields, helping control output structure and performance.

Quick Summary: $project in aggregation reshapes documents - include specific fields (field: 1), exclude fields (field: 0), rename fields (newName: "$oldName"), and add computed fields using expressions. Reduces document size early in the pipeline to minimize memory used by subsequent stages. Similar to SQL SELECT - define exactly what fields you want in the output.

Permalink

Q75:

How does MongoDB handle multi-document ACID transactions internally?

Mid

Answer

MongoDB uses two-phase commit, snapshot isolation, and transaction logs to ensure atomic multi-document operations.

Quick Summary: MongoDB multi-document transactions use snapshot isolation (read your own writes, consistent view of data as of transaction start). Internally: WiredTiger takes a snapshot at transaction start, all reads see the snapshot, writes are buffered and committed atomically. On commit, WiredTiger checks for write-write conflicts - if another transaction modified the same document, one is aborted and must retry.

Permalink

Q76:

How does MongoDB handle concurrency using document-level locking?

Senior

Answer

MongoDB uses WiredTiger’s document-level locking where each document has an independent lock, enabling simultaneous writes across different documents and avoiding collection-level contention.

Quick Summary: WiredTiger uses document-level optimistic concurrency. Multiple readers and writers can proceed in parallel without blocking each other. Conflicts are detected at commit time (if two transactions write to the same document, one gets a WriteConflict error and retries). This is far better than the old MMAPv1 collection-level locking that serialized all writes to a collection.

Permalink

Q77:

What is snapshot isolation and how does MongoDB achieve it?

Senior

Answer

MongoDB provides snapshot isolation for transactions using timestamps, oplog ordering, and WiredTiger MVCC to maintain a consistent point-in-time view throughout the transaction.

Quick Summary: Snapshot isolation gives each transaction a consistent point-in-time view of the data. In MongoDB, WiredTiger creates a version snapshot at transaction start. Reads within the transaction see that snapshot - unaffected by concurrent writes from other transactions. Writes are only visible after commit. This prevents dirty reads and non-repeatable reads without locking readers.

Permalink

Q78:

What role does WiredTiger’s write-ahead logging play in durability?

Senior

Answer

WiredTiger writes operations to WAL before flushing data pages. After crashes, MongoDB replays the WAL to restore data, ensuring strong durability guarantees.

Quick Summary: WiredTiger writes every change to the journal (write-ahead log) before applying it to data files. Each journal entry records the operation fully. On crash, MongoDB opens from the last checkpoint (data snapshot) and replays all journal entries that came after. This guarantees durability - a committed, journaled write is never lost even in sudden power failure.

Permalink

Q79:

How do you diagnose performance issues using MongoDB’s profiler?

Senior

Answer

The profiler captures slow queries, execution times, scan metrics, and index usage, helping identify unindexed operations, inefficient sorts, and pipeline bottlenecks.

Quick Summary: MongoDB profiler records query execution details to system.profile. Enable with db.setProfilingLevel(1, {slowms: 100}) to capture queries over 100ms. Analyze: look at millis (execution time), nscanned vs nreturned (high ratio = bad index), keysExamined, and the queryPlanner stage. Atlas Profiler provides this in a UI. Regularly review slow query logs in production.

Permalink

Q80:

Why do $lookup operations cause performance concerns in large systems?

Senior

Answer

$lookup performs cross-collection joins. Without proper indexing, it triggers large scans and increases CPU and memory usage.

Quick Summary: $lookup (join) in a sharded cluster can't be pushed down to shards if the "from" collection is on different shards - all data must come to the mongos for merging. This causes massive data movement and memory pressure. Fix: embed frequently joined data, pre-aggregate with scheduled pipelines, use Atlas Search for full document lookup, or ensure joined collections share the same shard key.

Permalink

Q81:

How does MongoDB’s balancer decide when to migrate chunks?

Senior

Answer

The balancer monitors shard chunk distribution via config servers and migrates chunks when imbalance thresholds are exceeded.

Quick Summary: The balancer triggers migration when the chunk count difference between the most and least loaded shards exceeds a threshold (varies by total chunk count). It migrates chunks from the most loaded to least loaded shards until balanced. Balancing uses collection-level locks during certain migration phases. Schedule balancing windows (balancerStart/Stop) to avoid peak traffic hours.

Permalink

Q82:

How do you avoid shard hotspots?

Senior

Answer

Avoid monotonically increasing keys and choose high-cardinality shard keys or hashed keys for even write distribution.

Quick Summary: Avoid shard hotspots by choosing a shard key with even write distribution. Don't use monotonically increasing keys (timestamps, ObjectId) as shard keys - all new writes go to the last chunk on one shard. Solutions: use hashed shard key (distributes writes randomly), use compound shard key combining a high-write field with a hash component, or pre-split chunks before loading data.

Permalink

Q83:

What are jumbo chunks and why are they problematic?

Senior

Answer

Jumbo chunks grow too large to split or migrate, blocking balancing and degrading performance in sharded clusters.

Quick Summary: Jumbo chunks exceed the maximum chunk size (default 128MB) and can't be automatically split or migrated. Caused by: all documents in a chunk share the same shard key value (low cardinality key). The balancer can't rebalance jumbo chunks, causing permanent hotspot on one shard. Fix: choose a higher-cardinality shard key. For existing jumbo chunks, use refineCollectionShardKey or manual splitting.

Permalink

Q84:

How does MongoDB internally manage oplog entries during replication?

Senior

Answer

The primary writes operations to the oplog; secondaries tail the oplog and apply changes in timestamp order for consistent replication.

Quick Summary: The oplog is a capped collection. Each entry records a write operation (with document data for idempotent replay). Secondaries tail the primary oplog and apply entries. If a secondary falls too far behind (oplog is overwritten before secondary catches up), the secondary goes into RECOVERING state and needs full resync. Size oplog large enough to cover expected lag and maintenance windows.

Permalink

Q85:

What is replication rollback and when does it occur?

Senior

Answer

Rollback happens when a primary steps down before its oplog entries replicate. The node removes unreplicated writes on restart to match the majority view.

Quick Summary: Replication rollback occurs when a primary fails, a new primary is elected, and the old primary rejoins with writes that were never replicated to the majority. Those writes are rolled back (written to a rollback folder for manual recovery). To prevent rollback: use write concern majority. With w:1 writes, any unacknowledged write not yet replicated to the new primary is lost.

Permalink

Q86:

How does MongoDB ensure consistency in a sharded cluster?

Senior

Answer

Config servers maintain chunk metadata; mongos routes queries based on metadata, and majority write concern ensures cluster-wide consistent writes.

Quick Summary: In a sharded cluster, consistency per shard is handled by each shard's replica set. Cross-shard consistency: single-document operations on one shard are always atomic. Multi-document cross-shard transactions use a two-phase commit protocol via the transaction coordinator. For non-transactional operations across shards, you get per-shard consistency but no global atomicity.

Permalink

Q87:

How does MongoDB handle distributed transactions across shards?

Senior

Answer

MongoDB uses two-phase commit across shards to ensure atomic multi-shard updates, preventing partial writes.

Quick Summary: Distributed transactions across shards use a two-phase commit coordinated by a transaction coordinator (runs on the mongos or shard). Phase 1: all shards prepare and lock resources. Phase 2: all shards commit or abort. This adds significant latency and lock contention. Keep cross-shard transactions short, or redesign schema to avoid them - keep related data on the same shard via shard key design.

Permalink

Q88:

What is the impact of large indexes on performance?

Senior

Answer

Large indexes consume memory, slow writes, and increase disk I/O, requiring careful index design for efficiency.

Quick Summary: Large indexes consume RAM in the WiredTiger cache. All indexes must fit in memory for optimal performance. Large indexes also slow down writes (each write must update all indexes). Monitor index size with db.collection.stats(). Remove unused indexes regularly. Covered queries are only possible if the index fits in memory. Use db.collection.aggregate([{$indexStats:{}}]) to find unused indexes.

Permalink

Q89:

What are hidden indexes and when should you use them?

Senior

Answer

Hidden indexes allow evaluating the impact of index removal without affecting the planner, useful for safe index tuning.

Quick Summary: Hidden indexes are maintained but not used by the query planner. You can hide an index to test the performance impact of removing it without actually dropping it. If queries stay fast, the index was unused and you can safely drop it. If performance degrades, unhide it. Introduced in MongoDB 4.4 to safely evaluate index removal in production.

Permalink

Q90:

How does MongoDB choose an execution plan when multiple indexes exist?

Senior

Answer

MongoDB tests candidate plans during a trial phase and caches the best plan to avoid repeated plan selection.

Quick Summary: MongoDB uses the query planner to select the best index. When multiple indexes could satisfy a query, the planner runs a "tournament" - it executes candidate plans in parallel for a trial period and picks the winner (fewest works to return the first batch of results). The winner is cached in the plan cache. Provide hints (db.collection.find().hint()) to force a specific index.

Permalink

Q91:

What is a plan cache eviction and when does it happen?

Senior

Answer

Plan cache evicts entries after metadata changes or when query patterns deviate significantly, forcing re-evaluation of plans.

Quick Summary: Plan cache eviction happens when: the collection's data distribution changes significantly (index stats become stale), the index is dropped, the collection is rebuilt, or MongoDB restarts. After eviction, the planner re-evaluates all candidate plans next time the query runs. Unexpected plan cache evictions can cause sudden performance changes as a previously good plan is reevaluated.

Permalink

Q92:

How do frequent updates cause document movement and why is it bad?

Senior

Answer

Document growth beyond allocated space triggers relocation, causing fragmentation and increased index maintenance.

Quick Summary: When a document grows beyond its allocated space (due to updates adding new fields), WiredTiger must move it to a new location. This movement updates the index entries pointing to the document. Frequent moves cause index fragmentation and more I/O. Mitigation: pre-allocate space by using $set to set fields to null initially, or use padding. Schema design to avoid document growth prevents this.

Permalink

Q93:

How does collation impact index usage?

Senior

Answer

Queries must match index collation; otherwise MongoDB cannot use the index and falls back to collection scans.

Quick Summary: Collation defines language-specific string comparison rules (case sensitivity, accent sensitivity, sort order). Indexes are collation-aware - an index with a specific collation can only be used by queries with the same collation. A query with collation won't use a standard index. Create the index with the same collation your queries use, or the query planner falls back to a collection scan.

Permalink

Q94:

What is the effect of large aggregation pipelines on memory?

Senior

Answer

Large pipelines may spill to disk when memory is insufficient, drastically slowing performance.

Quick Summary: Aggregation pipelines that produce large intermediate results use memory. MongoDB limits aggregation memory to 100MB by default. If exceeded, the pipeline fails unless you add allowDiskUse: true (allows spilling to disk - slower). Optimize: add $match and $project early to reduce document size before expensive stages, use indexes in $match, avoid $unwind on large arrays early in the pipeline.

Permalink

Q95:

Why do unbounded array growth patterns degrade performance?

Senior

Answer

Growing arrays increase document size, cause relocations, and produce heavy index rewrites, slowing reads and writes.

Quick Summary: Unbounded arrays that grow indefinitely cause documents to grow past the 16MB limit, hurt working set efficiency (you load the whole document to read one array element), make updates expensive (updating an element in a 10,000-item array requires rewriting the array), and cause multi-key index bloat. Design pattern: use a separate collection for array items with a reference back to the parent.

Permalink

Q96:

How does replication guarantee ordering?

Senior

Answer

Oplog timestamps and majority write acknowledgment ensure secondaries apply operations in the same sequence as the primary.

Quick Summary: MongoDB replication maintains ordering via the oplog - a capped collection where entries are ordered by a timestamp+counter (oplog timestamp). Secondaries apply oplog entries in strict order. This guarantees writes are replayed in the same order they happened on the primary. The oplog is the single source of truth for replication ordering across all secondary members.

Permalink

Q97:

What is the difference between majority and linearizable reads?

Senior

Answer

Majority reads reflect replicated data, while linearizable reads guarantee strict ordering by requiring primary confirmation.

Quick Summary: Majority read concern returns data that has been committed to a majority of replica set members - guarantees the data won't roll back. Linearizable read concern goes further - it guarantees you read the absolute latest committed write, waiting for all in-flight writes to complete first. Linearizable is single-document only, much slower, but provides the strongest consistency for critical reads.

Permalink

Q98:

How does MongoDB handle versioned schema migrations?

Senior

Answer

Migrations are performed safely using batch updates or tools like Mongock, with applications supporting both old and new versions temporarily.

Quick Summary: MongoDB supports schema evolution via its flexible document model - add new fields without migrating existing documents. Strategies: use schema versioning field (add "schemaVersion: 2" to documents and handle both versions in code), lazy migration (upgrade documents on first read/write), bulk migration scripts for breaking changes. Schema validation can be updated live with collMod.

Permalink

Q99:

What is the role of the config server replica set?

Senior

Answer

Config servers store cluster metadata. If the config server cluster fails, routing and chunk management halt.

Quick Summary: Config servers store all metadata for a sharded cluster: which collections are sharded, the shard key, chunk ranges, and which chunk lives on which shard. Run as a replica set (CSRS). The mongos routers cache this metadata. If config servers are unavailable, mongos can still serve reads from cache but no new chunk migrations or shard key changes can occur. Config servers must be highly available.

Permalink

Q100:

What are yield points during query execution?

Senior

Answer

Yield points allow MongoDB to pause long operations to let other operations acquire locks, preventing system stalls.

Quick Summary: Yield points are moments during a long-running query when MongoDB pauses to check for interrupts and allow other operations to run. Without yields, a long collection scan could hold resources indefinitely. MongoDB yields automatically at regular intervals (every 128 documents by default). This prevents any single query from starving other operations, but it means cursors can see changes made during their execution.

Permalink

Q101:

What is the role of index filters?

Senior

Answer

Index filters restrict query planner index usage, helping force specific indexes for performance tuning.

Quick Summary: Index filters let you specify which indexes the planner can use for a given query shape, overriding the planner's automatic choice. Set with planCacheSetFilter. Useful when the planner consistently picks a suboptimal plan and query hints aren't practical. More persistent than hints (survive restarts in MongoDB 6+). Use sparingly - usually fixing the query or index design is better.

Permalink

Q102:

What is a rolling index build?

Senior

Answer

Rolling index builds rebuild indexes on secondaries first, then switch primaries safely, ensuring zero downtime.

Quick Summary: Rolling index build builds an index on each replica set member one at a time (not simultaneously), so the replica set stays operational during the build. Build index on each secondary while it's stepped down, then step down the primary and build on it. Avoids the performance impact of building on all nodes at once. Required approach for building indexes on large collections in production without downtime.

Permalink

Q103:

What are the trade-offs of using $near for geospatial queries?

Senior

Answer

$near provides distance-ordered results but can be CPU-heavy and require specialized indexes for performance.

Quick Summary: $near and $geoNear for geospatial queries require a 2dsphere or 2d index. Performance trade-offs: they sort results by distance (expensive for large result sets), can't be combined efficiently with other indexes (geospatial index used, other filters applied after), and $near doesn't work in aggregation pipelines ($geoNear must be first stage). Limit results to a reasonable maxDistance and document count.

Permalink

Q104:

Why do sharded clusters struggle with scatter-gather queries?

Senior

Answer

Scatter-gather queries hit all shards, increasing latency and limiting scalability. Good shard keys minimize this pattern.

Quick Summary: Scatter-gather queries hit every shard because the query doesn't include the shard key. mongos fans out to all shards, each returns results, mongos merges them. For sorting, all shards sort and return top N results, mongos re-sorts and picks top N. This is expensive and gets worse as shard count grows. Fix: include the shard key in queries to target a single shard (targeted query).

Permalink

Q105:

How do you design MongoDB schema for high write throughput systems?

Senior

Answer

Use small documents, minimize indexes, design evenly distributed shard keys, and use bucketing patterns to reduce write amplification.

Quick Summary: High write throughput schema design: keep documents small (faster to write, more fit in cache), minimize index count (each write updates all indexes), use bulk writes, avoid transactions, use a good shard key to distribute writes. Consider the bucket pattern for time-series (batch multiple measurements into one document to reduce write count). Pre-aggregate counters with $inc instead of inserting individual events.

Permalink

Q106:

How does MongoDB guarantee global consistency in multi-shard, multi-region deployments?

Expert

Answer

MongoDB uses majority write concern, oplog ordering, causal consistency, and region-aware replica set tags to ensure global consistency across multi-region, multi-shard deployments.

Quick Summary: MongoDB doesn't provide global linearizable consistency across shards in multi-region setups without explicit configuration. Use causal consistency sessions (session.advanceClusterTime) to ensure reads see causally consistent data. For global consistency: configure zone sharding with majority write concern across regions, use global clusters in Atlas, and accept that cross-region reads have latency cost for strong consistency guarantees.

Permalink

Q107:

How does MongoDB internally manage oplog truncation and what risks exist if oplog is too small?

Expert

Answer

MongoDB truncates old oplog entries automatically. If the oplog is too small, secondaries cannot catch up, causing rollback or forcing an initial sync that increases downtime risk.

Quick Summary: The oplog is a capped collection - when it fills up, oldest entries are truncated. If a secondary falls behind more than the oplog window (oplog can hold N hours of operations), the secondary can't catch up via replication and needs a full resync (initial sync). Risk: a resync during heavy load is very slow. Size oplog to cover your longest expected maintenance window.

Permalink

Q108:

What architectural patterns ensure minimal replication lag in high-write clusters?

Expert

Answer

Low-latency storage, optimized shard keys, small document writes, and region-aware replica placement minimize lag. Flow control tuning prevents primaries from overwhelming secondaries.

Quick Summary: Minimize replication lag: use faster hardware on secondaries (match primary specs), avoid large write operations that take long to apply on secondaries, pre-build indexes during off-peak, size oplog appropriately, use secondaries with same network proximity to primary as possible, monitor lag with rs.printSecondaryReplicationInfo(), and avoid writes that cause large document moves or multi-document operations without transactions.

Permalink

Q109:

How do you design a multi-shard transaction strategy to avoid large distributed rollbacks?

Expert

Answer

Keep transactions small, align operations with shard keys, avoid multi-collection writes, and reduce batch size to prevent distributed rollback overhead.

Quick Summary: To avoid large distributed transactions: design schemas so related data lives on the same shard (same shard key prefix), use event sourcing or saga patterns for multi-shard workflows instead of transactions, break large operations into smaller single-shard transactions with compensating logic, and use the transactional outbox pattern to coordinate across service boundaries without cross-shard transactions.

Permalink

Q110:

How does WiredTiger’s checkpointing mechanism influence crash recovery?

Expert

Answer

Checkpoints flush memory pages to disk. On crash, MongoDB replays WAL only after the last checkpoint, reducing recovery time and ensuring durability.

Quick Summary: WiredTiger checkpoints every 60 seconds write a consistent data snapshot to disk. On crash, recovery starts from the last checkpoint and replays journal entries that occurred after. This bounds recovery time to journal entries from the last 60 seconds. If the journal is also lost (disk failure), recovery goes back to the prior checkpoint and loses up to 60 seconds of writes - hence use replica sets.

Permalink

Q111:

What leads to cache pressure in WiredTiger and how do you alleviate it?

Expert

Answer

Cache pressure arises from oversized working sets. Solutions include increasing WT cache, reducing document size, removing heavy indexes, and archiving cold data.

Quick Summary: WiredTiger cache pressure (cache full, high eviction rate) happens when: working set exceeds cache size, large documents prevent efficient caching, too many indexes all needing to be in memory. Fix: increase WiredTiger cache size (storage.wiredTiger.engineConfig.cacheSizeGB), add RAM, reduce index count, use projection to avoid reading large documents fully, or shard to distribute data across more RAM.

Permalink

Q112:

How do you detect and fix logical inconsistencies across replicas?

Expert

Answer

Use db.hashes(), validate(), and CDC systems to detect mismatches. Fix via initial sync, logical rebuild, or selective re-sync.

Quick Summary: Detect replica inconsistencies with db.runCommand({dbHash: 1}) on each member and compare hashes. Use dbCheck (MongoDB 3.6+) for online validation of data consistency between primary and secondaries. Logical inconsistencies from bugs (application wrote wrong data) can't be fixed by MongoDB - require application-level reconciliation. Always use majority write concern to prevent rollback-induced inconsistencies.

Permalink

Q113:

Why is two-phase commit expensive in MongoDB, and when should you avoid it?

Expert

Answer

Two-phase commit requires cross-shard coordination and oplog tracking, increasing latency and resource usage. Avoid unless strict multi-document atomicity is required.

Quick Summary: Two-phase commit (2PC) across shards in MongoDB is expensive because: it requires locking resources on multiple shards simultaneously, increases latency (two round trips across network to all participating shards), increases conflict and abort rate under high concurrency, and mongos must coordinate all phases. Avoid cross-shard transactions by designing data locality - put data that changes together on the same shard.

Permalink

Q114:

What are resumable change streams and why are they critical for event-driven architectures?

Expert

Answer

Change streams resume from a resumeToken or clusterTime, ensuring fault-tolerant event processing with no data loss or duplication.

Quick Summary: Resumable change streams let you restart a change stream after failure from exactly where it left off using a resume token. Save the resume token with each processed event. On restart, pass startAfter: . This guarantees at-least-once processing. Critical for event-driven architectures where missing events would cause data inconsistency downstream.

Permalink

Q115:

How do you scale analytic workloads without impacting OLTP performance?

Expert

Answer

Use dedicated analytics secondaries, hidden nodes, or offload data to OLAP systems via CDC. Use $merge pipelines for incremental materialization.

Quick Summary: Scale analytics without impacting OLTP: use secondary reads with readPreference: secondary for analytics queries (doesn't hit the primary), use Atlas Data Lake to run analytics on archived data, pre-aggregate with scheduled pipelines that write to separate summary collections, use change streams to maintain real-time aggregates, or replicate data to a separate analytics system (Redshift, BigQuery) via Kafka/Debezium.

Permalink

Q116:

How does MongoDB internally manage write conflicts under snapshot isolation?

Expert

Answer

MongoDB uses timestamp-based MVCC. Conflicting writes trigger transaction abort to maintain isolation guarantees.

Quick Summary: Under snapshot isolation, if two concurrent transactions write to the same document, the second to commit gets a WriteConflict error and must retry. WiredTiger detects this at commit time by checking if the document was modified after the transaction's snapshot timestamp. This is optimistic concurrency - no locks taken for reads, conflicts detected at commit.

Permalink

Q117:

What are the internals of the balancer decision algorithm in sharded clusters?

Expert

Answer

The balancer evaluates chunk counts, data size, and cluster history. It throttles migrations and uses metadata from config servers for safe relocation.

Quick Summary: The balancer checks chunk counts across shards every 10 seconds. It triggers migration when the imbalance (most chunks - fewest chunks) exceeds a threshold based on total chunk count. It picks the chunk to move based on size and migration history. You can configure balancerCollectionStatus to check, and set chunksize to control when splits occur. The config server coordinates all migration metadata.

Permalink

Q118:

How do you optimize heavy $lookup workloads in distributed clusters?

Expert

Answer

Use embedding, pre-joins, shard-aligned lookups, reduced cardinality, and foreign-key indexes. Denormalization often replaces expensive $lookup patterns.

Quick Summary: Optimize $lookup in sharded clusters: if both collections share the same shard key, $lookup can be pushed to shards without data movement. Use Atlas Search for cross-collection lookups at scale. Pre-join data using scheduled aggregation pipelines that write denormalized results to a separate collection. Switch from normalized schemas to embedded documents where lookup patterns are performance-critical.

Permalink

Q119:

How does MongoDB prevent data loss during node failover?

Expert

Answer

Primary acknowledges writes only after majority replication. Elections ensure nodes with consistent oplogs become primary, preventing divergence.

Quick Summary: Data loss prevention during failover: use w:majority write concern so acknowledged writes are on a majority of nodes. Enable journaling. Size your replica set with 3+ members for a proper majority. Monitor replication lag - excessive lag means secondaries might not have recent writes when failover occurs. Test failover regularly. Atlas automates replica set management and monitors for failover events.

Permalink

Q120:

What are the biggest risks of sharding too early or too late?

Expert

Answer

Sharding early adds unnecessary complexity; sharding late causes heavy balancing, hotspots, and downtime. Ideal timing depends on write throughput and dataset size.

Quick Summary: Sharding too early: added complexity and operational overhead before you need it, harder to change shard key later. Sharding too late: migrating a huge existing collection to sharding is painful and time-consuming, poor shard key choice under pressure. Right time: when a single replica set reaches its limits (write throughput, storage, or working set exceeds available RAM). Start with a solid shard key design before you need sharding.

Permalink

Q121:

How do bucket patterns optimize time-series data in MongoDB?

Expert

Answer

Buckets group time-series events into ranges, reducing document count and index overhead. Native time-series collections use similar internal bucketing.

Quick Summary: Bucket pattern for time-series: instead of one document per sensor reading, group N readings (e.g., 1 hour of data) into one document with an array. Reduces document count dramatically, better compression, fewer index entries. Example: {sensorId, hour, readings: [{min: 0, val: 22.5}, ...]}. MongoDB 5.0+ has native time-series collections that implement this pattern automatically.

Permalink

Q122:

How does MongoDB optimize read patterns using index intersection?

Expert

Answer

MongoDB combines multiple indexes to satisfy a query when no compound index exists. This improves performance compared to full scans but is slower than a single optimal compound index.

Quick Summary: MongoDB can combine results from two separate indexes using index intersection (AND_HASH/AND_SORTED plan stages). The planner may choose this over a single compound index if it estimates fewer total keys examined. In practice, a well-designed compound index usually outperforms intersection due to lower overhead. Use explain() to verify - if intersection is being used consider adding a compound index instead.

Permalink

Q123:

What strategies help prevent write amplification in high-throughput clusters?

Expert

Answer

Reduce document size, avoid unnecessary indexes, use targeted updates, limit large arrays, and distribute writes evenly with good shard keys.

Quick Summary: Prevent write amplification: minimize index count (each index is written on every document insert/update), use bulk writes, avoid small frequent updates (batch them), use append-only patterns where possible, size WiredTiger cache appropriately so writes don't immediately evict hot data, and use compression to reduce physical write size. Monitor with serverStatus().wiredTiger for cache and I/O metrics.

Permalink

Q124:

How do multi-threaded aggregation queries maintain correctness in MongoDB?

Expert

Answer

Parallel aggregation uses partitioned memory and deterministic stage ordering, merging intermediate outputs without violating semantics.

Quick Summary: Multi-threaded aggregation in MongoDB uses parallel execution within the pipeline where possible (e.g., $lookup can run concurrent sub-queries). Correctness is maintained by the snapshot isolation of the read - the pipeline sees a consistent data snapshot. The aggregation engine handles merging results from parallel threads before passing to the next stage. No correctness trade-offs for users - parallelism is transparent.

Permalink

Q125:

How do you handle schema evolution in long-lived MongoDB clusters?

Expert

Answer

Use additive schema changes, background migrations, versioned schemas, and applications that accept both old and new fields until migration completes.

Quick Summary: Schema evolution in long-lived clusters: use a schemaVersion field in documents, handle multiple versions in application code, migrate lazily (upgrade on read/write), or run batch migration scripts during maintenance windows. Use JSON Schema validation with mode:"warn" during evolution (logs violations without rejecting). Keep old field names during transition. Communicate schema changes across all teams consuming the collection.

Permalink

Curated Sets for MongoDB

No curated sets yet. Group questions into collections from the admin panel to feature them here.

Ready to level up? Start Practice