MongoDB with Async Python Transcripts
Chapter: Modeling with Documents
Lecture: To Embed or not to Embed?
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
So in MongoDB, one of the really big questions is to embed or not to embed.
0:08
When you have a relationship, does that relationship belong inside of another document
0:13
or does it involve some other collection that you traverse that relationship with? So I'll give you a couple of guidelines here.
0:22
But remember, the way you want to think about these relationships that are embedded,
0:27
example, packages and releases, is that that is a pre-computed join. Instead of saying we're going to join on the package and release table and get the
0:36
results, we're just going to always have those stored on disk so we have instant
0:41
access to that combination of data. So the first thing we want to ask about
0:46
this possibility of embedded data, again the releases in this example, is that That embedded data wanted most of the time.
0:56
There is an overhead of having that data embedded into the other object.
1:01
For example, if I just say, give me the package with the ID beanie, I'm not just going to
1:06
read the top level information like when it was released, who is the maintainer, I'm also going to get all of its releases.
1:14
And that involves taking data across the network, deserializing them and so on.
1:19
So that has a cost to it, and you want to consider, do I normally want this data?
1:26
Because if you rarely want it, just occasionally you might need it, you don't want to pay that cost, all right?
1:32
You want that as some separate thing you can go look up with a separate query. Think about it in reverse as well.
1:40
How often do you want that embedded data without the containing record, without the containing document?
1:47
So is it super important that I get just the details about one release, but I don't care
1:52
what the package is, I don't care about the other releases that might also be bundled in there?
1:57
It's certainly possible to get just one release with a query, but you always are going to bring at least that other embedded data.
2:07
So for example, if I were to write a query in MongoDB, I could say, give me the release that is whatever I'm looking for, release 1.5, no problem.
2:17
But there's no way to limit what you get back to just that one release. At a minimum, you're gonna get all the releases
2:23
and have to identify that in code. So that's a little bit tricky. If I wanted the releases separately and individually,
2:31
that's not a good choice to embed them. You also wanna think about document size, how big an individual record is.
2:39
This is something you consider in relational databases, like how many columns are you going to put into one row. But with document databases,
2:47
the hierarchy can be much, much larger. So there's actually a hard limit on how big a document can be in MongoDB, 16 megabytes. That is not ideal.
3:00
This is not something you should aim for. Like, well, I only have 14 megs, so we're good. No, you wanna stay far below that if possible, right?
3:07
This is just a limit where MongoDB will cut you off and say, look, you got a problem. we're not gonna let you save this record anymore.
3:14
So is this embedded thing in our case releases, is it a bounded set? Because if it's unbounded, it could grow beyond the 16 megabyte limit.
3:24
Imagine you had a CMS and a page was modeled in MongoDB. Would you wanna put the visits to that page into the MongoDB record?
3:33
No way, because on a popular site, that thing's just gonna keep growing and growing and you're gonna spend all your time
3:39
pulling back analytics that you don't care about, right? and that could grow beyond 16 megs,
3:43
you wouldn't even be able to save or edit the page anymore. That'd be bad. But again, 16 megabytes is not aspirational. You want it much, much smaller.
3:52
Maybe 10, 20K might be some kind of upper bound you wanna think of for a lot of your records. Another question you need to ask
4:04
is how varied are your queries? Remember, these embedded documents are kind of like pre-computed joins.
4:12
And if you know, well, I'm gonna ask this kind of question and a lot of times I want this data and that data back,
4:18
you can really carefully structure your documents to match those specialized queries. But the more different types of questions you ask,
4:27
the more angles you ask it from, start to violate or break down number one and two here, right? As you ask different questions from the data
4:37
from different angles and perspectives, the chances that you want that data 80% of the time go down and down and down.
4:45
The chances that you might want the contained document alone so just one release without the other stuff, that goes up.
4:52
So that puts pressure on saying don't embed, right? So the more varied your queries are against the same bit of data,
4:59
the more likely you're gonna treat it more relationally and less embedded. Finally, you might ask, do I have an integration database
5:08
or an application database? We'll talk about that next.