MongoDB for Developers with Python Transcripts
Chapter: High-performance MongoDB
Lecture: The MongoDBs performance knobs
Login or
purchase this course
to watch this video and the rest of the course contents.
0:01
You've heard MongoDB is fast, really fast, and you've gone through setting up your documents and modeling things,
0:08
you inserted, you imported your data, and you're ready to go; and you run a query and it comes back, so okay, I want to find all the service histories
0:16
that have a certain price, greater than such and such, how many are there— apparently there's 989, but it took almost a second to answer that question.
0:23
So this is a new version of the database, so we are going to talk about it shortly. Instead of having just a handful of cars and service histories
0:31
that we maybe entered in our little play-around app, it has a quarter million cars with a million service histories, something to that effect.
0:39
And the fact that we were able to answer this query of how many sort of nested documents had this property
0:45
in less than a second, on one hand that's kind of impressive, but to be honest, it feels like MongoDB is just dragging,
0:52
this is not very special, this is not great. So this is what you get out of the box, if you just follow what we've done so far
1:00
this is how MongoDB is going to perform. However, in this chapter, we're going to make this better, a lot better.
1:07
How much— well, let's see, we're going to make it fast, here's that same query after applying just some of the techniques of this chapter.
1:14
Notice now it runs in one millisecond, not 706 milliseconds. So we've made our MongoDB just take off,
1:22
it's running over 700 times faster than what the default MongoDB does. Well, how do we do it, how do we make this fast?
1:31
Let's have a look at the various knobs that we can turn to control MongoDB performance. Some of which we're going to cover in this course,
1:39
and some are well beyond the scope of what we're doing, but it's still great to know about them. The first knob are indexes, so it turns out
1:45
that there are not too many indexes added to MongoDB by default, in fact, the only index that gets set up is on _id
1:53
which is basically an index as well as a uniqueness constraint, but other than that, there are no indexes,
1:58
and it might be a little non intuitive at first, when you first hear about this,
2:03
but indexes and manually tuning and tweaking and understanding the indexes in document databases is far more important
2:11
than understanding indexes in a third normal form designed relational database. So why would that be? That seems really odd.
2:19
So think about a third normal form database, you've broken everything up into little tiny tables that link back to each other
2:25
and they often have foreign key constraints traversing all of these relationships,
2:29
well, those foreign key constraints go back to primary keys on the main tables, those are indexed, every time you have one of those relationships
2:36
it usually at least on one end has an index on that thing. In document databases, because we take some of those external tables
2:44
and we embed them in documents, those subdocuments while they kind of logically play the same role
2:50
there is no concept of an index being added to those. So we have fewer tables, but we still have basically the same amount of relationships
2:58
and because of the way documents work, we actually have fewer indexes than we do in say a relational database.
3:05
So we're going to see that working with understanding and basically exploring indexes is super, super important
3:10
and that's going to be the most important thing that we do. In fact, the MongoDB folks, one of their things they do is
3:17
they sell like services, consulting and what not to help their customers and you could hire them, say hey I got this big cluster and it's slow
3:25
can you help me make it faster— the single most dramatic thing that they do, the thing that almost always is the problem is incorrect use of indexes.
3:35
So we're going to talk about how to use, discover and explore indexes for sure.
3:39
Next is document design, all that discussion about to embed or not to embed,
3:44
how should you relate documents, this is sort of the beginning of this conversation,
3:48
it turns out the document design has dramatic implications across the board
3:53
and we did talk quite a bit about this, but we'll touch on it again in this chapter. Query style, how are you writing your queries,
4:02
is there a way that you could maybe restructure a query, or ask the question differently and end up with
4:09
a more high performance query, maybe one example misses an index and the other particular example uses a better index or something to this effect.
4:17
Projections and subsets are also something that we can control, remember when we talked about the Javascript api
4:24
we saw that you could limit your set of returned responses and this can be super helpful for performance;
4:30
you could write a query where it returns 5 MB of data but if you restrict that to just the few fields that you actually care about
4:37
maybe its all K instead of 5 MB, it could be really dramatic, depending on how large and nested your documents might be.
4:44
We're going to talk about how we can do this, especially from MongoEngine. These are the knobs that we're going to turn in this course,
4:50
these are the things that will work even if you have a single individual database, so you should always think about these things,
4:57
some of them happen on the database side, document design, indexes,
5:00
and the other, maybe is in your application interacting with the database, the other two,
5:05
but MongoDB being a NoSql database, allows for other types of interactions, other configurations and network topologies and so on.
5:12
So, one of the things that it supports is something called replication, now replication is largely responsible for redundancy and failover.
5:20
Instead of just having one server I could have three servers, and they could work in triplicate, basically one is what's called the primary,
5:27
and you read and write from this database, and the other two are just there ready to spring into action,
5:32
always getting themselves in sync with the primary, and if one goes down, the other will spring in to be the primary
5:37
and they will sort of fix themselves as the what used to be the primary comes back. There is no performance benefit from that at all.
5:44
However, there are ways to configure your connection to say allow me to read not just from the primary one, but also from the secondary,
5:51
so you can configure a replication for a performance boost, but mostly this is a durability thing.
5:56
The other type of network configuration you can do is what's called sharding.
6:00
This is where you take your data instead of putting all into one individual server, you might spread this across 10 or 20 servers,
6:07
one 20th, hopefully, of evenly balanced, across all of them, and then when you issue a query,
6:13
can either figure out where if it's based on the shard key, which server to point that at and let that one
6:18
handle the query across the smaller set of data, or if it's general like show me all the things with greater than this for the price,
6:24
it might need to fan that out to all 20 servers, but it would run on parallel on 20 machines. So sharding is all about speeding up performance,
6:33
especially write performance, but also queries as well, so you can get tons of scalability out of sharding,
6:39
and you can even combine these like, when I said there is 20 shards, each one of those could actually be a replica set,
6:44
so there is a lot of stuff you could do with network topology and clustering and sharding and scaling and so on.
6:49
We're not turning those knobs in this course, I'll show you how to make individual pieces fast, the same idea applies to these replicas and shards,
6:55
just on a much grander scale if you want to go look at them.