MongoDB Indexes

When you're developing an application, it's common to only work with a small quantity of data. The amount of data your production system encounters may not be available or may simply be unmanageable on development systems. Developing with a limited amount of data may mask eventual performance problems. Your queries respond quickly with a few hundred megabytes of sample data, but query latency may skyrocket when users query over a few hundred gigabytes. This article looks at using indexes in your MongoDB collections to address performance problems, as well as how to optimize your indexes.

What are Indexes?

You can think of an index in MongoDB, or its relational counterparts, like a book index. If you're looking for a specific topic in a textbook without an index, your only option is to flip through every page to see if that page matches your interests. Depending on the size of the book, this can get tedious and time-consuming. MongoDB has to do the same thing when you query a collection: each document in the collection has to be examined and matches are added to the result set.

The search changes drastically with an index. The index is a tree-like structure containing the values you specify. If we keep going with our textbook example, a topic index might look like:

As you can see, an index makes finding what you're looking for much faster and simpler. Let's look at how indexes impact an application with some real-world data.

Applying Indexes

To get a suitable amount of data to work with, I pulled in some content from Twitter and added it to a collection in MongoDB. (You can find the script here.) Overall, I ended up putting about 50,000 tweets into the collection. Each tweet has the following structure:

{

    "iso_language_code": "en",

    "to_user_name": null,

    "to_user_id_str": "0",

    "profile_image_url_https": "https://si0.twimg.com/profile_images/2361109664/ndnl4mo5wytiyjq57qhn_normal.jpeg",

    "from_user_id_str": "499977106",

    "source": "<a href="http://twitter.com/">web</a>",

    "text": "RT @openshift: New Blog: For you techies - @NamDuong_rh put together the best tips and tricks of the week in his weekly review - http://t.co/i6pK729T",

    "from_user_name": "Jimmy Guerrero",

    "profile_image_url": "http://a0.twimg.com/profile_images/2361109664/ndnl4mo5wytiyjq57qhn_normal.jpeg",

    "id": NumberLong("235068500878716929"),

    "to_user": null,

    "entities": {

        "user_mentions": [{

            "id": 17620820,

            "indices": [3, 13],

            "id_str": "17620820",

            "screen_name": "openshift",

            "name": "OpenShift by Red Hat"

        }, {

            "id": 293693321,

            "indices": [43, 55],

            "id_str": "293693321",

            "screen_name": "NamDuong_rh",

            "name": "Nam Duong"

        }],

        "hashtags": [],

        "urls": [{

            "url": "http://t.co/i6pK729T",

            "indices": [129, 149],

            "expanded_url": "http://red.ht/R6vpz3",

            "display_url": "red.ht/R6vpz3"

        }]

    },

    "id_str": "235068500878716929",

    "from_user": "paasdude",

    "from_user_id": 499977106,

    "to_user_id": 0,

    "geo": null,

    "created_at": "Mon, 13 Aug 2012 17:41:12 +0000",

    "metadata": {

        "result_type": "recent"

    }

}

If I run a simple query for tweets from OpenShift's (in)famous Jimmy Guerrero, the query has to scan all of the documents in the collection. We can see when executing the query with an explain() suffix:

db.tweets.find({'from_user':'paasdude'}).explain();

{

    "cursor" : "BasicCursor",

    "nscanned" : 51748,

    "nscannedObjects" : 51748,

    "n" : 35,

    "millis" : 40,

    "nYields" : 0,

    "nChunkSkips" : 0,

    "isMultiKey" : false,

    "indexOnly" : false,

    "indexBounds" : {

 }
 }
 

The results of explain() describe the details of how MongoDB executes the query. Some of relevant fields are:

  • cursor: A result of BasicCursor indicates a non-indexed query. If we had used an indexed query, the cursor would have a type of BtreeCursor.
  • nscanned and nscannedObjects: The difference between these two similar fields is distinct but important. The total number of documents scanned by the query is represented by nscannedObjects. The number of documents and indexes is represented by nscanned. Depending on the query, it's possible for nscanned to be greater than nscannedObjects.
  • n: The number of matching objects.
  • millis: Query execution duration.

The point here should be clear - a simple query results in the database having to scan every document in the collection. The query only took 40ms, but we're also only dealing with about 50,000 documents. This duration will increase as the collection size increases.

What happens when an index is added to the from_user field? The difference is dramatic. We can add an index with the ensureIndex(...) command:

db.tweets.ensureIndex({'from_user' : 1});

With the field indexed, our query execution is dramatically different:

db.tweets.find({'from_user':'paasdude'}).explain();

{

    "cursor" : "BtreeCursor from_user_1",

    "nscanned" : 35,

    "nscannedObjects" : 35,

    "n" : 35,

    "millis" : 3,

    "nYields" : 0,

    "nChunkSkips" : 0,

    "isMultiKey" : false,

    "indexOnly" : false,

    "indexBounds" : {

        "from_user" : [

            [

                "paasdude",

                "paasdude"

            ]

        ]

    }

}

The query still matches 35 results, but only 35 objects were scanned and the query took 3ms. This is quiet an improvement over 40ms to scan 50,000 documents!

Indexes can clearly improve performance in your MongoDB application, but there are some guidelines to keep in mind when indexing your collections. These guidelines are covered below.

Index Guidelines

Like everything else in the NoSQL world, there is no silver bullet to performance issues. Indexes are no exception. This section outlines some considerations to keep in mind, starting with what to index.

Indexes in MongoDB are similar to its relational counterparts. You'll get the most value out of indexing the most distinct and frequently queried fields on your collections. For example, some ORM frameworks will automatically create indexes for the type of object stored in the collection. If you're only storing a single type of object per collection (which I recommend), these indexes provide no value to the query engine since all values are the same. Only index distinct and frequently queried fields. This is also relevant when creating and updating indexes.

When it comes to indexes, it's likely you are primarily concerned with query performance. However, indexes have to be maintained. Each time a document is insert or updated, associated index entries must also be updated. Similarly, index entries have to be found and deleted when documents are removed. This index maintenance can impact write performance. You're best served creating indexes for your most frequently used queries.

Finally, you'll want periodically review your indexes. As application usage changes, you may find your previously valuable indexes are just slowing things down. The problem is MongoDB doesn't provide great visibility into index statistics. To address this, Jason Wilder created an excellent set of scripts to make sense of the information MongoDB provides. You can find his mongodb-tools here: https://github.com/jwilder/mongodb-tools.

Conclusion

Index management is a central part of your MongoDB application. Knowing what to index based on your application's usage and the potential performance impact should drive your decisions. MongoDB's explain() function, as well as Jason Wilder's tools, can provide valuable insight and guidance.

References