MongoDB Sharding Guide
This article is part of our Academy Course titled MongoDB – A Scalable NoSQL DB.
In this course, you will get introduced to MongoDB. You will learn how to install it and how to operate it via its shell. Moreover, you will learn how to programmatically access it via Java and how to leverage Map Reduce with it. Finally, more advanced concepts like sharding and replication will be explained. Check it out here!
Table Of Contents
1. Introduction
Sharding is a technique used to split large amount of data across multiple server instances. Nowadays, data volumes are growing exponentially and a single physical server often is not able to store and manage such a mass. Sharding helps in solving such issues by dividing the whole data set into smaller parts and distributing them across large number of servers (or shards). Collectively, all shards make up a whole data set or, in terms of MongoDB , a single logical database.
2. Configuration
MongoDB supports sharding out of the box using sharded clusters configurations: config servers (three for production deployments), one or more shards (replica sets for production deployments), and one or more query routing processes.
- shards store the data: for high availability and data consistency reason, each shard should be a replica set (we will talk more about replication in Part 5. MongoDB Replication Guide)
- query routers (mongos processes) forward client applications and direct operations to the appropriate shard or shards (a sharded cluster can contain more than one query router to balance the load)
- config servers store the cluster’s metadata: mapping of the cluster’s data set to the shards (the query routers use this metadata to target operations to specific shards), for high availability the production sharded cluster should have 3 config servers but for testing purposes a cluster with a single config server may be deployed
MongoDB shards (partitions) data on a per-collection basis. The sharded collections in a sharded cluster should be accessed through the query routers (mongos processes) only. Direct connection to a shard gives access to only a fraction of the cluster’s data. Additionally, every database has a so called primary shard that holds all the unsharded collections in that database.
Config servers are special mongod process instances that store the metadata for a single sharded cluster. Each cluster must have its own config servers. Config servers use a two-phase commit protocol to ensure consistency and reliability. All config servers must be available to deploy a sharded cluster or to make any changes to cluster metadata. Clusters become inoperable without the cluster metadata.
3. Sharding (Partitioning) Schemes
MongoDB distributes data across shards at the collection level by partitioning a collection’s data by the shard key (for more details, please refer to official documentation).
Shard key is either an indexed simple or compound field that exists in every document in the collection. Using either range-based partitioning or hash-based partitioning, MongoDB splits the shard key values into chunks and distributes those chunks evenly across the shards. To do that, MongoDB uses two background processes:
- splitting: a background process that keeps chunks from growing too large. When a particular chunk grows beyond a specified chunk size (by default, 64 Mb), MongoDB splits the chunk in a half. Inserts and updates of the sharded collection may triggers splits.
- balancer: a background process that manages chunk migrations. The balancer runs on every query router in a cluster. When the distribution of a sharded collection in a cluster is uneven, the balancer moves chunks from the shard that has the largest number of chunks to the shard with the lowest number of chunks until balance is reached.
3.1. Range-based sharding (partitioning)
In range-base sharding (partitioning), MongoDB splits the data set into ranges determined by the shard key values. For example, if shard key is numeric field, MongoDB partitions the whole field’s value range into smaller, non-overlapping intervals (chunks).
Although range-based sharding (partitioning) supports more efficient range queries, it may result into uneven data distribution.
3.2. Hash-based sharding (partitioning)
In hash-based sharding (partitioning), MongoDB computes the hash value of the shard key and then uses these hashes to split the data between shards. Hash-based partitioning ensures an even distribution of data but leads to inefficient range queries.
4. Adding / Removing Shards
MongoDB allows adding and removing shards to/from running shared cluster. When new shards are added, it breaks the balance since the new shards have no chunks. While MongoDB begins migrating data to the new shards immediately, it can take a while before the cluster restores the balance.
Accordingly, when shards are being removed from the cluster, the balancer migrates all chunks from those shards to other shards. When migration is done, the shards can be safely removed.
5. Configuring Sharded Cluster
Now that we understand the MongoDB sharding basics, we are going to deploy a small sharded cluster from scratch using simplified (test-like) configuration without replica sets and single config server instance.
The deployment of sharded cluster begins with running mongod server process in a config server mode using the --configsvr
command line argument. Each config server requires its own data directory to store a complete the cluster’s metadata and that should be provided using --dbpath
command line argument (which we have already seen in Part 1. MongoDB Installation – How to install MongoDB). That being said, let us perform the following steps:
- Create a data folder (we need to do that only once):
mkdir configdb
- Run MongoDB server:
bin/mongod --configsvr --dbpath configdb
By default, config servers listen on a port 27019. All config servers should be up and running before starting up a sharded cluster. In our deployment, the single config server is run on a host with name ubuntu.
Next step is to run the instances of mongos processes. Those are lightweight and require only knowing the location of config servers. For that, the command line argument --configdb
is being used followed by comma-separated config server names (in our case, only single config server on host ubuntu):
bin/mongos --configdb ubuntu
The default port mongos process listens to is 27017, the same as a usual MongoDB server.
The final step is to add some regular MongoDB server instances (shards) into the sharded cluster. Les us start two MongoDB server instances on different ports, 27000 and 27001 respectively.
- Run first MongoDB server instance (shard) on port 27000
mkdir data-shard1 bin/mongod --dbpath data-shard1 --port 27000
- Run second MongoDB server instance (shard) on port 27001
mkdir data-shard2 bin/mongod --dbpath data-shard2 --port 27001
Once the standalone MongoDB server instances are up and running, it is a time to add them into sharded cluster with the help of MongoDB shell. Up to now, we have used MongoDB shell to connect to standalone MongoDB servers. In sharded cluster, the mongos processesare the entry points for all clients and as such we are going to connect to the one we have just run (running on default port 27017): bin/mongo --port 27017 --host ubuntu
MongoDB shell provides a useful command sh.addShard()
for adding shards (please refer to Shell Sharding Command Helpers for more details). Let us issue this command against the two standalone MongoDB server instances we have run before:
sh.addShard( "ubuntu:27000" ) sh.addShard( "ubuntu:27001" )
Let us check the current sharded cluster configuration by issuing another very helpful MongoDB shell command sh.status()
.
At this point the sharded cluster infrastructure is configured and we can move on with sharding database collections.
Shard tags
MongoDB allows to associate tags to specific ranges of a shard key with a specific shard or subset of shards. A single shard may have multiple tags, and multiple shards may also have the same tag. But any given shard key range may only have one assigned tag. The overlapping of the ranges is not allowed as well as tagging the same range more than once.
For example, let us assign tags to each of the shards in our sharded cluster by using sh.addShardTag()
command.
sh.addShardTag( "shard0001", "bestsellers" ) sh.addShardTag( "shard0000", "others" )
To find all shards associated with a particular tag, the command against config database should be issued. For example, to find all shards tagged as “bestsellers” the following commands should be typed in MongoDB shell:
use config db.shards.find( { tags: "bestsellers" } )
Another collection in config database named tags contains all tags definitions and may be queried the regular way for all available tags.
Respectively, tags could be removed from the shard using the sh.removeShardTag()
command. For example:
sh.removeShardTag( "shard0000", "others" )
6. Sharding databases and collections
As we already know, MongoDB performs sharding on collection level. But before a collection could be sharded, the sharding must be enabled for the collection’s database. Enabling sharding for a database does not redistribute data but makes it possible to shard the collections in that database.
For demonstration purposes, we are going to reuse the bookstore example from Part 3. MongoDB and Java Tutorial and make the books collection shardable. Let us reconnect to mongos instance by providing additionally the database name bin/mongo --port 27017 --host ubuntu bookstore
(or just issue the command use bookstorein the existing MongoDB shell session) and insert couple of documents into the books collection.
db.books.insert( { "title" : "MongoDB: The Definitive Guide", "published" : "2013-05-23", "categories" : [ "Databases", "NoSQL", "Programming" ], "publisher" : { "name" : "O'Reilly" } } ) db.books.insert( { "title" : "MongoDB Applied Design Patterns", "published" : "2013-03-19", "categories" : [ "Databases", "NoSQL", "Patterns", "Programming" ], "publisher" : { "name" : "O'Reilly" } } ) db.books.insert( { "title" : "MongoDB in Action", "published" : "2011-12-16", "categories" : [ "Databases", "NoSQL", "Programming" ], "publisher" : { "name" : "Manning" } } ) db.books.insert( { "title" : "NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence", "published" : "2012-08-18", "categories" : [ "Databases", "NoSQL" ], "publisher" : { "name" : "Addison Wesley" } } )
As we mentioned before, sharding is not yet enabled nor for bookstore database nor for books collection so the whole dataset ends up on primary shard (as we mentioned in Introduction section). Let us enable sharding for bookstore database by issuing the command sh.enableSharding()
in MongoDB shell.
We are getting very close to have our sharded cluster to actually do some work and to start sharding real collections. Each collection should have a shard key (please refer to Sharding (Partitioning) Schemes section) in order to be partitioned across multiple shards. Let us define a hash-based sharding key (please refer to Hash-based sharding (partitioning) section) for books collection based on document _id field.
db.books.ensureIndex( { "_id": "hashed" } )
With that, let us tell MongoDB to shard the books collection using sh.shardCollection()
command (please see Sharding commands and command helpers for more details).
sh.shardCollection( "bookstore.books", { "_id": "hashed" } )
And with that, the sharded cluster is up and running! In the next section we are going to look closely on the commands available specifically to manage sharded deployments.
7. Sharding commands and command helpers
MongoDB shell provides a command helpers and sh
context variable to simplify sharding management and deployment.
8. What’s next
In this part we have covered basics of MongoDB sharding (partitioning) capabilities. For more complete and comprehensive details please refer to official documentation. In the next section we are going to take a look on MongoDB replication.