MongoDB for Developers with Python Transcripts
Chapter: Modeling and document design
Lecture: To embed or not to embed
0:01 When it comes down to modeling with document databases
0:04 you apply a lot of the same thinking as you do with relational databases
0:07 about what the entity should be, and so on.
0:10 However, there's one fundamental question that you often ask
0:13 that really does take some thinking about maybe working through
0:18 some of the guidelines, and that is to embed or not to embed related items.
0:23 So in our previous example, you saw that we had a book
0:26 and the book had ratings embedded within it,
0:28 but we could just as well have the ratings be a separate table
0:30 or the ratings could have even gone into the user object
0:33 about reference back to the book, instead of the reverse.
0:36 So should we embed that ratings, and if we do,
0:40 does it go in books, does it go in users, or does it not go there at all.
0:43 So what I'm going to do, is I'm going to give you some guidelines,
0:46 these are soft rules, we don't have like a really prescriptive way of doing things
0:51 like third normal form here, but some of the thinking there does help;
0:54 so let's get into the rules.
0:56 First of all, the question you want to ask is is that embedded data
0:59 wanted eighty percent of the time that you get the original object;
1:02 do I usually want the rating information when I have the book?
1:08 If it would have resulted in me doing a join in a traditional database
1:11 or going back and doing a second query to Mongo to pull that data out,
1:14 it's very beneficial to have that rating data embedded in the book.
1:19 We designed it that way, so let's suppose like most of our query patterns
1:22 and most the way our application works is
1:25 we want to list the number of ratings, the average number of ratings,
1:29 things like this we want to surface that in almost all the time,
1:32 we want that embedded data when we get a book.
1:35 So that would guide us to embed the data, if this is not true,
1:40 if you only very rarely want that data,
1:42 then you most likely will not want to embed it,
1:45 there's a serious performance cost for what you might think of as dead weight,
1:48 other embedded stuff that comes along with the object
1:51 that you generally don't care about most of the time,
1:54 you can do things like suppress those items coming back,
1:57 so you can basically suppress the ratings object,
2:00 but if you are doing that, it's probably a sign like
2:02 hey maybe I shouldn't really be designing it this way.
2:04 A lot of considerations, but here's the first rule—
2:07 do you want the embedded data most of the time?
2:11 Next, how often do you want the embedded data without the containing document?
2:15 The way our things are structured now is I cannot get the ratings
2:19 without getting the books, I cannot get individual
2:22 ratings without getting all of the ratings.
2:24 So if what I wanted to do was on the user profile page
2:27 show here are all of my individual ratings as a user
2:31 listed on my like favorites page, or things I've rated or something like this,
2:36 that's actually a little bit challenging the way things are written.
2:39 We can definitely do it, and if there's just one
2:41 query we do it that way it's totally fine,
2:43 but this is one of the tensions, you can't get the ratings without getting the books
2:47 you can't get individual ratings, without getting all the other ratings
2:50 from that particular book, there's no way MongoDB
2:53 to actually suppress that, I don't think, like you can suppress the other fields
2:56 we're using a projection right, you get all the ratings, or none of the ratings.
3:00 So how often is it necessary to get a rating without getting a book itself?
3:04 Right, if that's something you want to do often
3:07 or it's a very very hot spot in your application
3:09 maybe again you do not want to embed it,
3:11 if you want the object without the containing document.
3:14 Another really important question to answer is
3:17 is the embedded data a bounded set?
3:19 If it is just a single nested item, fine, that's no problem,
3:22 if it's a list or an array, like we have in the context of ratings,
3:25 how big could the ratings get,
3:28 how many ratings might a book have reasonably speaking;
3:31 if there's ten ratings, it's probably totally fine
3:34 to have the rating data embedded in the book,
3:36 it's nice self contained, you get a little atomicity
3:39 and some nice features of have it embedded there.
3:41 If there's a hundred ratings, maybe it's good,
3:45 if there's a thousand ratings, if there's an unbounded number of ratings
3:48 you do not want to embed it, right so is it a bounded set, first of all
3:53 and related to that, is the bounded set small,
3:55 because every time you get the book back
3:58 you're pulling all of that stuff off disk, possibly out of memory,
4:01 over network for deserialization or serialization
4:04 depending on the side that you're working with.
4:06 So that comes with a cost, and in fact,
4:08 MongoDB puts a limit on the size of these documents,
4:12 you're not allowed to have a document larger than 16 MB,
4:17 in fact, if you try to take a document that's larger than 16 MB
4:20 and save it into MongoDB, even if you pull it back,
4:23 add something it makes it a little bit bigger and you call save
4:26 it's going to totally fail and say no, no, no this is over the limit.
4:29 So this should not be thought of as like a safe upper bound
4:33 this should be thought of as like the absolute limit
4:36 if you've got a document that's ten megabytes,
4:38 it doesn't mean like wow, we're only halfway there, this is amazing or great,
4:41 no, that's a huge performance cost to pull 10 MB over
4:46 every time you need a little bit of something out of there.
4:48 So really, you should aim for a much, much, much smaller thing
4:51 than the upper limit of 16 MB, but the point here is
4:53 there is actually a limit where if this embedded data outgrows that 16 MB
4:59 you just cannot save it back to the database,
5:02 that's a will no longer operate problem,
5:04 is the bound small is more of a performance trade-off type of problem, right,
5:08 but you want to think about these very, very carefully,
5:10 average size of a document is definitely something worth keeping in mind.
5:14 How varied are your queries?
5:17 Do you have like a web app and it asks like maybe ten really common questions
5:21 and you very much know the structure,
5:24 like these are the types of queries my app asks,
5:26 these are the really hot pages and here's what I want to optimize for,
5:29 or is this more of like a bi type thing where people and analysts come along
5:34 and they can ask like almost any sort of reporting question whatsoever;
5:38 it turns out the more focused your queries are,
5:41 the more likely you are to embed data in other things, right,
5:44 if you know that you typically use these things together,
5:47 then embedding them often makes a lot of sense.
5:49 If you're not really sure about the use case,
5:51 it's hard to answer the above questions,
5:53 do you want the data eighty percent of the time, I have no idea,
5:55 there's all sorts of queries, some of the time, right,
5:58 and so the more varied your queries, the more likely you are going to
6:00 tend towards the normalized data, not the embedded modeling data.
6:06 And finally, related to this how varied are your queries
6:09 as are you working with an integration database that lives at the center
6:14 and almost is used for inter-process, inter-application communication
6:17 or is it very focused application database?
6:19 We're going to dig into that idea next.