MongoDB for Developers with Python Transcripts
Chapter: Modeling and document design
Lecture: To embed or not to embed

Login or purchase this course to watch this video and the rest of the course contents.
0:01 When it comes down to modeling with document databases you apply a lot of the same thinking as you do with relational databases
0:08 about what the entity should be, and so on. However, there's one fundamental question that you often ask
0:14 that really does take some thinking about maybe working through some of the guidelines, and that is to embed or not to embed related items.
0:24 So in our previous example, you saw that we had a book and the book had ratings embedded within it,
0:29 but we could just as well have the ratings be a separate table or the ratings could have even gone into the user object
0:34 about reference back to the book, instead of the reverse. So should we embed that ratings, and if we do,
0:41 does it go in books, does it go in users, or does it not go there at all. So what I'm going to do, is I'm going to give you some guidelines,
0:47 these are soft rules, we don't have like a really prescriptive way of doing things
0:52 like third normal form here, but some of the thinking there does help; so let's get into the rules.
0:57 First of all, the question you want to ask is is that embedded data wanted eighty percent of the time that you get the original object;
1:03 do I usually want the rating information when I have the book? If it would have resulted in me doing a join in a traditional database
1:12 or going back and doing a second query to Mongo to pull that data out, it's very beneficial to have that rating data embedded in the book.
1:20 We designed it that way, so let's suppose like most of our query patterns and most the way our application works is
1:26 we want to list the number of ratings, the average number of ratings, things like this we want to surface that in almost all the time,
1:33 we want that embedded data when we get a book. So that would guide us to embed the data, if this is not true, if you only very rarely want that data,
1:43 then you most likely will not want to embed it, there's a serious performance cost for what you might think of as dead weight,
1:49 other embedded stuff that comes along with the object that you generally don't care about most of the time,
1:55 you can do things like suppress those items coming back, so you can basically suppress the ratings object,
2:01 but if you are doing that, it's probably a sign like hey maybe I shouldn't really be designing it this way.
2:05 A lot of considerations, but here's the first rule— do you want the embedded data most of the time?
2:12 Next, how often do you want the embedded data without the containing document? The way our things are structured now is I cannot get the ratings
2:20 without getting the books, I cannot get individual ratings without getting all of the ratings. So if what I wanted to do was on the user profile page
2:28 show here are all of my individual ratings as a user listed on my like favorites page, or things I've rated or something like this,
2:37 that's actually a little bit challenging the way things are written. We can definitely do it, and if there's just one
2:42 query we do it that way it's totally fine, but this is one of the tensions, you can't get the ratings without getting the books
2:48 you can't get individual ratings, without getting all the other ratings from that particular book, there's no way MongoDB
2:54 to actually suppress that, I don't think, like you can suppress the other fields
2:57 we're using a projection right, you get all the ratings, or none of the ratings.
3:01 So how often is it necessary to get a rating without getting a book itself? Right, if that's something you want to do often
3:08 or it's a very very hot spot in your application maybe again you do not want to embed it, if you want the object without the containing document.
3:15 Another really important question to answer is is the embedded data a bounded set? If it is just a single nested item, fine, that's no problem,
3:23 if it's a list or an array, like we have in the context of ratings, how big could the ratings get,
3:29 how many ratings might a book have reasonably speaking; if there's ten ratings, it's probably totally fine
3:35 to have the rating data embedded in the book, it's nice self contained, you get a little atomicity and some nice features of have it embedded there.
3:42 If there's a hundred ratings, maybe it's good, if there's a thousand ratings, if there's an unbounded number of ratings
3:49 you do not want to embed it, right so is it a bounded set, first of all and related to that, is the bounded set small,
3:56 because every time you get the book back you're pulling all of that stuff off disk, possibly out of memory,
4:02 over network for deserialization or serialization depending on the side that you're working with. So that comes with a cost, and in fact,
4:09 MongoDB puts a limit on the size of these documents, you're not allowed to have a document larger than 16 MB,
4:18 in fact, if you try to take a document that's larger than 16 MB and save it into MongoDB, even if you pull it back,
4:24 add something it makes it a little bit bigger and you call save it's going to totally fail and say no, no, no this is over the limit.
4:30 So this should not be thought of as like a safe upper bound this should be thought of as like the absolute limit
4:37 if you've got a document that's ten megabytes, it doesn't mean like wow, we're only halfway there, this is amazing or great,
4:42 no, that's a huge performance cost to pull 10 MB over every time you need a little bit of something out of there.
4:49 So really, you should aim for a much, much, much smaller thing than the upper limit of 16 MB, but the point here is
4:54 there is actually a limit where if this embedded data outgrows that 16 MB you just cannot save it back to the database,
5:03 that's a will no longer operate problem, is the bound small is more of a performance trade-off type of problem, right,
5:09 but you want to think about these very, very carefully, average size of a document is definitely something worth keeping in mind.
5:15 How varied are your queries? Do you have like a web app and it asks like maybe ten really common questions and you very much know the structure,
5:25 like these are the types of queries my app asks, these are the really hot pages and here's what I want to optimize for,
5:30 or is this more of like a bi type thing where people and analysts come along and they can ask like almost any sort of reporting question whatsoever;
5:39 it turns out the more focused your queries are, the more likely you are to embed data in other things, right,
5:45 if you know that you typically use these things together, then embedding them often makes a lot of sense. If you're not really sure about the use case,
5:52 it's hard to answer the above questions, do you want the data eighty percent of the time, I have no idea,
5:56 there's all sorts of queries, some of the time, right, and so the more varied your queries, the more likely you are going to
6:01 tend towards the normalized data, not the embedded modeling data. And finally, related to this how varied are your queries
6:10 as are you working with an integration database that lives at the center and almost is used for inter-process, inter-application communication
6:18 or is it very focused application database? We're going to dig into that idea next.


Talk Python's Mastodon Michael Kennedy's Mastodon