MongoDB for Developers with Python Transcripts
Chapter: High-performance MongoDB
Lecture: Adding indexes in mongoenegine
0:08 I think it's preferable to do this in MongoEngine because that means simply pushing your code into production will ensure
0:15 that the database has all the right indexes set up for to operate correctly. You theoretically could end up with too many,
0:22 if you have one in code and then you take it out but you can always manage that from the shell,
0:27 this way at least the indexes that are required will be there. I dropped all the indexes again, let's go back through our questions here
0:34 and see how we're doing. It says how many owners, how many cars, this is just based on the natural sort however it's in the database
0:42 there's really nothing to do here, but this one, find the 10 thousandth car by owner, let's look at that;
0:49 that is going to basically be this name, we'll use test, it doesn't really matter what we put here
0:58 if we put explain, this should come back as column scan or something like that,
1:02 yeah, no indexes, okay, so how long did it take to answer that question? Find the 10 thousandth owner by name,
1:13 it didn't say by name, I'll go and add by name, well that took 300 milliseconds, well that seems bad and look we're actually using sorting,
1:25 we're actually using paging skip and limit those types of things here, but in order for that to mean anything, we have to sort it,
1:32 it's really the sort that we're running into. Maybe I should change this, like so,
1:39 sort like so, we could just put one, I guess it's the way we're sorting it, so here you can see down there the sort pattern name is one
1:50 and guess what, we're still doing column scan. Any time you want to do a filter by, a greater than, an equality,
1:57 or you want to do a sort, you need an index. Let's go over to the owner here, this is the owner class and let's add the ability to sort it by name or
2:09 equivalently also do a filter like find exactly by name, so we're going to come down here we're going to add another thing to this meta section,
2:17 and we're going to add indexes, and indexes are a list of indexes, now this is going to be simple strings or they can be complex subdictionaries,
2:32 for composite indexes or uniqueness constraints, things like that, but for name all we need is name. Let's run this, first of all, let's go over here
2:42 and notice, if I go to owners and refresh, no name, let's run this code, find the 10 thousandth owner by name, 19 milliseconds, that's pretty good,
2:56 let me run it one more time, 15 yeah okay, so that seems pretty stable, and let's go over here and do a refresh, hey look there's one by name;
3:04 we can see it went from what was that, something like 300 milliseconds to 15 milliseconds, so that's good.
3:12 How many cars are owned by the 10 thousandth owner, so that's 3 milliseconds, but let's go ahead and have a look at this question anyway.
3:20 How many cars are owned by the 10 thousandth owner, so here's this function right here that we're calling
3:26 it doesn't quite fit into a lambda expression, so we put it up here so we want to go and find the owner by id,
3:31 that should be indexed right, that should be indexed right there because it's the id, the id always says an index,
3:37 and now we are saying the id is in this set, so we're doing two queries, but both of them are hitting the id thing,
3:45 so those should both be indexed and 3 milliseconds, well that really seems to indicate that that's the case.
3:51 How many owners own the 10 thousandth car, that is right here. So we'll go find the car, ask how many owners own it.
4:00 Now this one is interesting, so remember when we're doing this basically this in query, let's do a quick print of car id here,
4:12 so if we go back over to this, we say let's go over to the owners save your documents, so this is going to be car ids,
4:22 it's going to have an object id of that, all right, so run this, zero records, apparently this person owns nothing,
4:34 but notice it's taking 77 milliseconds, we could do our explain again here and column scan, yet again, not the most amazing.
4:44 So what we want is we want to have an index on car ids, right because column scan, not good, I think it's not really telling us in our store example
4:54 but for the find it definitely should be. So we can come back to our owner over here, let's add also like an index on car_ids,
5:03 If we'd run this once again, just the act of restarting it should regenerate the database, how long did it take over here—
5:10 a little late now isn't it, because I did the explain, I can look at this one, how many cars, how many owners does the 10 thousandth car have,
5:20 66 milliseconds, if we look at it now— how many owners own the 10 thousandth car, 1.9 milliseconds, so 33 times faster by adding that index, excellent,
5:35 find the 50 thousandth owner by name, that's already done. Alright we already have an index on owners name so that goes nice and quick,
5:46 and how is this doing, one millisecond perfect, this one is super bad, the cars with expensive service 712 milliseconds,
5:53 alright so here, we're looking at service history and then we're navigating that .relationship, that hierarchy,
6:01 with the double underscore, going to the price, greater than, less than, equal it doesn't matter,
6:06 we're basically working with this value here, this subdocument. Let's go over to the car and make that work,
6:12 now the car doesn't yet have any indexes but it will in a second, so what we want to do is represent that here
6:18 and in the the raw way of discussing this with MongoDB we use . (dot) not double underscore, so . represents the hierarchy here.
6:26 Let's run that again, notice expensive service, 712, cars with expensive service, instead of 712 we have 2.4 milliseconds,
6:40 now notice that first time I ran it there is was a pause, the second time it was like immediate,
6:46 and that's because it basically was recreating that index and that pause time was how long that index took to create.
6:52 So here we have cars with expensive service, now we're getting into something more interesting, look at this one with spark plugs,
6:59 we're querying on two things, we're querying on the history and the service, let's actually put this over in the shell so we can look at it.
7:20 I've got to convert this over, do the dots there, this is going to be the dollar greater operator, colon, like so,
7:31 all right, so we're comparing that service history.price and this one, again because you can't put dots in normal json,
7:40 do the dot here and quotes, and this one is just spark plugs, alright, let's run this, okay 22 milliseconds,
7:53 how long is it taking over here— 20 milliseconds, so that's actually pretty good and the reason I think it's pretty good is
8:00 we already have an index on this half and so it has to just basically sort the result, let's find out. Winning plan, index on this one, yes, exactly
8:15 so this one is just going to be crank across there but we're going to use at least this index here, this by price so that gets part of the query there.
8:26 Now maybe we want to be able to do a query just based on the description show me all the spark plugs, well that's a column scan,
8:34 so let's go back and add over here one for the description. Now how do I know what goes in this part,
8:45 see I have a service history here, if we actually look at the service record object it has a price and description, right
8:53 so we know that that results in this hierarchy of service history.price, service history.description.
8:58 If we'd run this again, it will regenerate those and let's go over here and run this, and let's see, now we're doing index scan on price,
9:10 what else do we got, rejected plans, okay so we got this and query and it looks like we're still using the— yes, oh my goodness,
9:25 how about that for a mistake, comma, so what did that do that created, in Python you can wrap these lines and that just created this,
9:34 and obviously, that's not what we want, that comma is super important there. So let me go over here and drop this nonsense thing,
9:42 try this again, I can see it's building index right now, okay, once again we can explain this, okay great,
9:52 so now we're using price and actually we use the description this time and you can see the rejected plan is the one that would have used the price,
10:05 so we're using description, not price, and how long does it take to run that query— 7.9 milliseconds, that's better
10:14 but what would be even better still is if we could do the description and price as a single thing. How do we do that?
10:23 This gets to be a little trickier, if we look at the query we're running, we're first asking for the price and then the description,
10:31 so we can actually create a composite index here as well, and we do that by putting a little dictionary, saying fields
10:40 and putting a list of the names of the fields and you can bet those go like this,
10:49 now this turns out to be really important, the order that you put them here price and the description versus description price, for sorting,
10:57 not so much for matching, run it one more time, alright, expensive cars with spark plugs, here we go, look at that, less than one millisecond,
11:11 so we added one index, it took it from like 66 milliseconds down to 15, and then, we added the description one, it turns out that was a better index
11:22 and it took it from 15 to 9, we added the composite index, and we took it from 9 to half a millisecond, a 0.6 milliseconds, that is really cool.
11:32 Notice over here, this got faster, let's go back and look at what that is. Load cars, so this is the one we are optimizing
11:41 and what are we doing here— let me wrap this so you can see, we're doing a count, okay, we're doing a count
11:47 and so it's basically having the database do all the work but there's zero serialization. Now in this one, we're actually calling list
11:56 so we're deserializing, we're actually pulling all of those records back and let's just go over here and see how many there are,
12:09 well that's not super interesting, to have just one, is it, alright, that's good, but let's actually make this just this,
12:24 let's drop this spark plug thing and just see how many cars there are with this, okay there we go, now we have some data to work with,
12:34 65 thousand cars had 15 thousand dollar service or higher, after all, this is a Ferrari dealership, right.
12:40 Now, it turns out it's a really bad idea to pull back that many cars, let me stop this, let's limit that to just a thousand here as well.
12:55 Okay, so we're pulling back thousand cars because we're limited to this and we're pulling back a thousand cars here.
13:04 But notice, this car name and id versus the entire car so let's go over here cars with expensive service, car name and id,
13:14 so notice the time, so to pull back and serialize those thousand records took actually a while, so it took one basically a second,
13:22 and if we don't ask for all the other pieces, if we just say give me just the make, the model and the id,
13:30 here we're using the only keywords, it says don't pull back the other things just give me the these three fields when you create them,
13:38 it makes it basically ten times faster, let's turn this down to a 100 and see, maybe get a little more realistic set of data.
13:45 Okay, there we go, a 100 milliseconds down to 14 milliseconds, so it turns out that the deserialization step in MongoEngine is a little bit expensive
13:56 so if you like blast a million cars into that list, it's going to take a little bit. If we can express like I only want to pull back these items,
14:06 than it turns out to be quite a bit faster, in this case not quite faster, but definitely faster. Let's round this out here and finish this up.
14:18 Here we're asking for the highly rated, highly priced cars, we're asking like hey for all the people that come and spend a lot of money
14:27 how did they feel about it? And then also what cars had a low price and also a low rating, so maybe we could have just somehow changed our service
14:38 for these sort of cheaper like oil change type people. It turns out that that one is quite fast,
14:43 this one we could do some work and fixing one will really fix the other so we have this customer rating thing, we probably want to have an index on,
14:53 and we already have one on the price, so I think that that's why it's pretty quick actually.
14:58 Go over here, and we don't yet have one on the price, on the rating rather, so we can do that and see if things get better,
15:08 not too much, it didn't really make too much of a difference, it's probably better to use the price than it is the rating,
15:17 because we're kind of doing that together, so we're also going to go down here and have the price and customer rating,
15:22 one of these composite indexes, once again, and maybe if we change price one more time,
15:30 rating and price— it doesn't seem like we're getting much better, so down here this is about as fast as we can get, 16 milliseconds
15:41 and this is less than one millisecond, so that's really good. The final thing is, we are looking for high mileage cars,
15:48 so let's go down here and say find where the mileage of the car is greater than 140 thousand miles, do we have an index on that,
15:56 you can bet the answer is no. Now we could go to the shell and see that, but no we don't have one, so let's go up here and add one more,
16:05 and this is in fact the only index we have here in this thing that is on like just plain field, not one of these nested ones like this;
16:14 so maybe we also want to be able to select by year, so we could have one for year as well. I'm going to add those in.
16:22 Now this high mileage car goes from a hundred and something milliseconds down to six, maybe one more time just to make sure,
16:29 yep, 5, 6, seems pretty stable around there. So we've gone and we've added these indexes to our models, our MongoEngine documents by adding indexes
16:41 and we can have flat ones like this, or we have these here, and we also can have composite ones or richer things,
16:53 if we create a little dictionary and we have fields and things like that. Similarly an owner, we didn't have as many things we were after
17:01 but we did want to find them by their name and by car id, so we had those two indexes, honestly this is just a simpler document than the cars.
17:09 So with these things added here, we can run this one more time and see how we're doing that code all runs really quick,
17:15 if we kind of scan through here, there's nothing that stands out like super bad, 5 milliseconds, half, 18, 6, half, 1, 3, 1, let's say,
17:27 this one, I really wish we could do better, it just turns out there is like so many records there
17:33 that if we run that here you can see that the whole thing runs in one millisecond, super, super fast, we can't make it any faster than that.
17:42 The slowness is basically the allocation, assignment, verification of 100 car objects.
17:49 I'd like to see a little better serialization time out of MongoEngine, if you have some part of your code that has to load tons of these things
17:57 and it's super performance critical, you could drop down to PyMongo, talk to it directly and probably in the case where you're doing that
18:06 you don't need to pull back many, many objects, but also you can see that if we limit what we ask for down here,
18:13 that goes back to 14 miliseconds which is really great, here we're looking at a lot of events, this is like 16 thousand
18:22 or no, 65 thousand, that's quite a bit, this one is really fast, this one is really fast, so I feel like from an index perspective
18:29 we've done quite a good job, how do we know we're done? I guess this is the final question, this has been a bit of a long—
18:36 how do we know we're done with this performance bit? We know we're done when all of these numbers come by
18:44 and they're all within reason of what we're willing to take. Here I have set this up as these are the explicit queries
18:52 we're going to ask and then we'll just time them, like your real application does not work that way.
18:57 How do you know what questions is your applications asking and how long it's taking. So you want to set up profiling, so you can come over here
19:06 and definitely google how to do profiling in MongoDB, so we can came over here and let's just say, db set profiling level
19:14 and you can use this function to say I'm looking for slow queries and to me slow means 10 milliseconds, 20 milliseconds something like that,
19:24 it will generate a table called system.profile and you can just go look in there and see what queries are slow, clear it out,
19:34 run your app, see what shows up in there add a bunch of indexes, make them fast, clear that table, then turn around and run your app again,
19:44 and just until stuff stops showing up in there, you can basically find the slowest one, make it faster, clear out the profile
19:52 and just iterate on that process, and that will effectively like gather up all of the meaningful queries that your app is going to do,
20:00 and then you can go through the same process here to figure out what indexes you need to create.