MongoDB with Async Python Transcripts
Chapter: Modeling with Documents
Lecture: To Embed or not to Embed?

Login or purchase this course to watch this video and the rest of the course contents.
0:00 So in MongoDB, one of the really big questions is to embed or not to embed.
0:08 When you have a relationship, does that relationship belong inside of another document
0:13 or does it involve some other collection that you traverse that relationship with? So I'll give you a couple of guidelines here.
0:22 But remember, the way you want to think about these relationships that are embedded,
0:27 example, packages and releases, is that that is a pre-computed join. Instead of saying we're going to join on the package and release table and get the
0:36 results, we're just going to always have those stored on disk so we have instant
0:41 access to that combination of data. So the first thing we want to ask about
0:46 this possibility of embedded data, again the releases in this example, is that That embedded data wanted most of the time.
0:56 There is an overhead of having that data embedded into the other object.
1:01 For example, if I just say, give me the package with the ID beanie, I'm not just going to
1:06 read the top level information like when it was released, who is the maintainer, I'm also going to get all of its releases.
1:14 And that involves taking data across the network, deserializing them and so on.
1:19 So that has a cost to it, and you want to consider, do I normally want this data?
1:26 Because if you rarely want it, just occasionally you might need it, you don't want to pay that cost, all right?
1:32 You want that as some separate thing you can go look up with a separate query. Think about it in reverse as well.
1:40 How often do you want that embedded data without the containing record, without the containing document?
1:47 So is it super important that I get just the details about one release, but I don't care
1:52 what the package is, I don't care about the other releases that might also be bundled in there?
1:57 It's certainly possible to get just one release with a query, but you always are going to bring at least that other embedded data.
2:07 So for example, if I were to write a query in MongoDB, I could say, give me the release that is whatever I'm looking for, release 1.5, no problem.
2:17 But there's no way to limit what you get back to just that one release. At a minimum, you're gonna get all the releases
2:23 and have to identify that in code. So that's a little bit tricky. If I wanted the releases separately and individually,
2:31 that's not a good choice to embed them. You also wanna think about document size, how big an individual record is.
2:39 This is something you consider in relational databases, like how many columns are you going to put into one row. But with document databases,
2:47 the hierarchy can be much, much larger. So there's actually a hard limit on how big a document can be in MongoDB, 16 megabytes. That is not ideal.
3:00 This is not something you should aim for. Like, well, I only have 14 megs, so we're good. No, you wanna stay far below that if possible, right?
3:07 This is just a limit where MongoDB will cut you off and say, look, you got a problem. we're not gonna let you save this record anymore.
3:14 So is this embedded thing in our case releases, is it a bounded set? Because if it's unbounded, it could grow beyond the 16 megabyte limit.
3:24 Imagine you had a CMS and a page was modeled in MongoDB. Would you wanna put the visits to that page into the MongoDB record?
3:33 No way, because on a popular site, that thing's just gonna keep growing and growing and you're gonna spend all your time
3:39 pulling back analytics that you don't care about, right? and that could grow beyond 16 megs,
3:43 you wouldn't even be able to save or edit the page anymore. That'd be bad. But again, 16 megabytes is not aspirational. You want it much, much smaller.
3:52 Maybe 10, 20K might be some kind of upper bound you wanna think of for a lot of your records. Another question you need to ask
4:04 is how varied are your queries? Remember, these embedded documents are kind of like pre-computed joins.
4:12 And if you know, well, I'm gonna ask this kind of question and a lot of times I want this data and that data back,
4:18 you can really carefully structure your documents to match those specialized queries. But the more different types of questions you ask,
4:27 the more angles you ask it from, start to violate or break down number one and two here, right? As you ask different questions from the data
4:37 from different angles and perspectives, the chances that you want that data 80% of the time go down and down and down.
4:45 The chances that you might want the contained document alone so just one release without the other stuff, that goes up.
4:52 So that puts pressure on saying don't embed, right? So the more varied your queries are against the same bit of data,
4:59 the more likely you're gonna treat it more relationally and less embedded. Finally, you might ask, do I have an integration database
5:08 or an application database? We'll talk about that next.


Talk Python's Mastodon Michael Kennedy's Mastodon