MongoDB with Async Python Transcripts
Chapter: Performance Tuning
Lecture: Document Design from a Performance Perspective

Login or purchase this course to watch this video and the rest of the course contents.
0:00 Let's chat about document design just one more time and focus in on a few issues.
0:06 We did talk about this when we talked about modeling and I said, ""Hey, it's important for performance to get this right.
0:11 But now that you see sometimes you want to project in or out certain pieces of data,
0:17 you want to have an index that traverses this hierarchy in this way to ask those questions, you might think about document design with some fresh eyes.
0:25 So let's look at what we've done here. We have our package and our package has a list of embedded objects that are releases.
0:32 We saw that we can be insanely fast about querying those 0.1 millisecond to find to
0:39 sort through quarter million releases interspersed over 5000 documents.
0:45 So on one hand that tells you, oh, we're not actually suffering hardly at all from a response perspective in terms of querying.
0:54 So when you think about this, that is not the issue, although it might've seemed like it would have been. The issue is when I pull back a package,
1:02 if there are a lot of releases, I'm gonna be taking all that data with me by default. So you need to ask again, how often do you need these?
1:12 I put them embedded here, so we just had some really good examples for this course. I'm right on the fence of whether this is a good idea
1:20 or not, it may be, it probably is, but maybe not. For example, by embedding it, we have to have that second analytics field
1:28 that we gotta keep in sync, which is a little sketchy. It's not terrible, but as long as you don't do it too much,
1:33 but it is a consequence of this, right? So for those reasons, it's probably a good idea, but maybe, maybe not.
1:40 We were able to use projections to avoid worrying about it when we didn't need that data. Again, how many releases are possible?
1:48 Is this a set of 10 or a set of 10,000 embedded objects? That's also, the more there are, the less likely you want to embed all of them,
1:57 especially if it's gonna go past that 16 megabyte limit per document. So should these be in a separate collection?
2:05 Additionally, we also have the maintainer IDs, which is the IDs of the user who maintain the package. Now, you never, never is a strong word,
2:15 you almost never, ever, I have never, ever seen a normalization many-to-many relationship table.
2:21 So like a package underscore to underscore users table that just has the package name
2:27 and the username, the user ID, never, never seen that you don't need it in a document database.
2:34 In this case, what we decided was inside the package, we're going to put the ID which is
2:39 a small bit of data, not huge, the ID of every maintainer, and there won't be that many maintainers of a package, it won't grow dramatically.
2:47 So this should be totally fine in terms of the scenarios below. The one question you might ask is, does this belong on the package?
2:57 Or does it belong on the user? Right on the user, we could say, okay, the user has a list of maintained packages as a list of strings.
3:07 And then we could go and anytime we show a package, we could say query the user table, this ID is in the users maintained packages list, right?
3:20 That would give us back a list of users. So either way you go, we decided as much as we need to release this, we probably want
3:26 to have the information about the users who maintain it as well.
3:31 So I put it on the package, not the user side, but it could go on either side of that many to many relationship. Right, there it is.
3:39 So again, thinking about document design is more about the data transfer and the type
3:45 of queries you can answer, not so much about the query speed as we saw with releases we can still ask super fast questions about.


Talk Python's Mastodon Michael Kennedy's Mastodon