Computer scienceFundamentalsSQL and DatabasesDBMSMongoDB

Sharding in MongoDB

9 minutes read

MongoDB is a NoSQL database management system. This is an open-source program that is gaining popularity day by day due to it having fewer restrictions than SQL. In MongoDB, the databases created need to be stored and accessed in an efficient way. One of the major pros of MongoDB is that we can do this very efficiently. Scalability is one of the best features in MongoDB. In this topic, we are going to learn about Sharding in MongoDB

Sharding in MongoDB

A method in MongoDB, that is used to manage huge amounts of data is called sharding in MongoDB. Sharding manages the data in MongoDB by dividing a larger database into smaller and more managed parts, those parts are called shards. Let us understand by an example:

Let us suppose there is a very bad dictionary that does not contain words in alphabetical order, that contains hundreds of thousands of words. Let the words be data stored inside the database. If we were to find a random word in the dictionary, we would have to go through each and every word in it. Such a simple task of finding a word in a dictionary would be almost impossible. But when we shard the dictionary we can divide the words into several categories; a normal dictionary is a good example of a sharded database where the words are divided in alphabetical order which makes the job of finding a word from the dictionary very easily accessible.

The process of sharding is very similar, one whole database that contains several data and information is difficult to access through the server. Hence it is divided into several parts called shards. The process of sharding the database is the same, it is divided into several shards which makes accessibility from the server easily accessible and more managed. Each of the shards created is also an individual database but smaller and more managed.

Sharded cluster components

There are several components of sharding also known as sharded cluster. We have briefly mentioned one of the components of the sharded cluster above that is shards. let us discuss that in more detail and other components of the sharded cluster.

Shards

A "shard" in MongoDB is a more manageable, more compact portion of a bigger database. When a database becomes too big for a system to manage, we can divide the data into smaller databases, those chunks of databases are called shards.

Again looking at the example above, a dictionary is a kind of database that contains the meanings of a huge set of words. A proper example of shards in this context would be to divide the dictionary into 26 smaller dictionaries, each one containing only the words that start with a specific letter of the alphabet. For example, one shard (small dictionary) contains only words that start with 'A', another shard contains only words that start with 'B', and so on.

Now, if you want to look up a particular word, you don't have to go through the entire dictionary. You can go directly to that particular shard and find the word much more quickly. This is the basic idea behind sharding: by breaking a larger database into smaller, more manageable parts (shards), you can improve efficiency and speed.

Mongos

In MongoDB, mongos are query routers, meaning they handle queries from the clients and redirect them to their appropriate shards in sharded clusters.

Let's continue with the dictionary example to explain the role of mongos in MongoDB's sharded setup. We have 26 smaller dictionaries in our sharded dictionary, each of which contains words that begin with a specific letter of the alphabet. Now, imagine you want to find the definition of a word, but you don't know which mini-dictionary (shard) it's in. You need a way to ask for a word and have it direct you to the right mini-dictionary. This is where the role of mongos come in, The mongos acts like a query router or a kind of "librarian" for the database. When a user makes a request to find data (like asking for the definition of a word), the request goes to the mongos. The mongos know which shard (mini-dictionary) holds the data you're looking for, and it routes your request to the correct shard.

Config Servers

In MongoDB, config servers are a critical component of a sharded cluster that stores metadata and configuration settings. Metadata is data about the data in the database, and in this case, it includes information about the organization and location of the data in the sharded cluster.

Continuing with the dictionary analogy, let's understand what config servers are, in Mongodb. In our sharded dictionary, we have 26 smaller dictionaries, each containing words that start with a specific letter of the alphabet. Now, there needs to be a system that keeps track of which words are in which mini-dictionary. We need to know that 'Apple' is in the 'A' mini-dictionary, 'Banana' is in the 'B' mini-dictionary, and so on.

This is where config servers come in, They keep track of the cluster's state, including what data is stored on which shard. This is crucial for the mongos (the "librarian") to know where to direct queries. When a user gives a query, the mongos checks with the config servers to determine where the requested data is stored. The config servers tell the Mongos which shard or shards hold the data, and the mongos then route the query to the appropriate place.

Shard keys

A specific field or fields that the MongoDB uses to distribute collection's documents(database's data) across all the shards are called shard keys. The database divides the shard key values into ranges and assigns each range to a specific shard, resulting in a balanced distribution of data, meaning, If your shard key is a date field, one range might be dated in January, the next range might be dated in February, and so on.

Choosing the right shard keys is a critical decision because it may affect the system's performance, capacity, and efficiency of operations. Let us understand with an example.

Let us suppose we have opened an e-commerce website. This website takes orders from customers and delivers them to their country. This e-commerce website has a database with fieldscustomerID, orderID, country, date, amount. If we were to choose shard keys from the above fields:

orderID: If we choose orderID as your shard key, assuming that order IDs are assigned sequentially, we might end up with an unbalanced sharding because all new orders with higher order IDs would be directed to the same shard, leading to the shard containing higher orders being overused while others are underused.

country: Similarly, if we choose country as shard key, there might be a country where the orders are higher from a particular country. This leads to one shard being overused while the other is underused. The main aim is to evenly distribute the workload of the shards.

customerID: Now if we choose customerID, assuming that the customers are diverse and from different parts of the world, we can achieve more evenly distributed data, and each range of customerID can be directed to different shards evenly.

In the scenario above customerID would be a more appropriate shard key because it has a high degree of entropy(variation). The data are well distributed and do not lean towards one particular value.

Of course, the scenarios may vary and we need to choose the shard keys very carefully where the shard key should have a high degree of entropy, which means the values of the shard key are well distributed, not skewed towards a particular value. This helps ensure that the data is evenly distributed across the shards.

Chunks

In MongoDB, "chunks" are fragments of sharded data. When you create a shard key for a collection in MongoDB, the database uses the shard key to divide the collection into chunks. Each chunk typically has a size of 64 megabytes. When a chunk exceeds this limit, "auto-sharding" by MongoDB divides it into smaller chunks. This maintains the data's uniform distribution across all shards.

A variety of shard key values make up each chunk. If your shard key is a number, for instance, one chunk might contain all documents that fall between the range of 1 and 100, another chunk might contain all documents that fall within the range of 101 and 200, and so on.

For example in the example above of ecommerce, customerID was taken as a shard key. Let us suppose this field contains values from 1-100,000 each being unique from each other. MongoDB uses this shard key to divide your order collection into chunks. Each chunk could be range of customerID from 1-10,000 and other chunk could range from 10,001- 20,000 and so on. These chunks are then distributed across different shards in your MongoDB cluster.

If your customers are increasing consistently, this could lead to chunks exceeding the 64-megabite mark. MongoDB will automatically split it into smaller chunks.

Advantages of sharding

We have mentioned several components and factors of sharding above. With them there are several advantages of sharding, let us discuss them,

Scalability: Scalability refers to the ability of a system to handle a huge amount of work or its potential to be enlarged to accommodate that growth. In the context of databases, scalability typically refers to the system's capacity to accommodate more users, increase processing power, or hold more data as demand grows. Sharding allows you to distribute data across multiple systems which allows the database to accommodate more data and write and read requests by equally distributing the load.

High Availability: High Availability refers to a system that can be continuously operational for a very long period. It's usually measured as a percentage, with a 100% percent system being always up. In a sharded system, data is often replicated across multiple shards. This means that if one shard fails, the system can continue to operate by using the data from the remaining shards.

Performance Improvement: By distributing the data across multiple shards, sharding can help to reduce the number of operations each shard must handle. Each shard only manages a subset of the data, which can lead to increased performance.

Conclusion

  • Sharding in MongoDB is a method designed to manage massive datasets efficiently. It involves dividing a large database into smaller, more manageable parts known as "shards."

  • Sharded clusters consist of several components, with the primary one being "shards." Shards are smaller, more compact portions of a large database. They make data access more efficient and faster by splitting it into manageable chunks.

  • In a sharded setup, "mongo" servers act as query routers, directing client queries to the appropriate shards in a sharded cluster. They play a critical role in ensuring efficient data retrieval.

  • MongoDB uses "config servers" to store metadata and configuration settings for sharded clusters. These servers keep track of where data is stored in the cluster, aiding Mongos in routing queries effectively.

  • Shard keys are specific fields used to distribute a collection's documents across shards. Careful selection of shard keys is crucial, as it impacts system performance and data distribution among shards.

  • Sharding offers several advantages, including scalability, high availability, and performance improvement. It enables databases to handle increased workloads, ensures continuous operation in case of failures, and enhances query performance by distributing data across shards.


How did you like the theory?
Report a typo