MongoDB for Developers with Python Transcripts
Chapter: Modeling and document design
Lecture: A real world example
0:01 So let's look inside the application that you're using right now
0:03 to take this course as an example.
0:06 So at the time of this recording, here's what the Talk Python training
0:10 website database looks like for courses and users.
0:14 So, first let's focus on the course side of things,
0:17 there's a couple of interesting ideas here,
0:19 one, we have an id which is not an object id, why is it not an object id,
0:24 well, it was actually migrated from a relational database initially,
0:27 this was using SQLAlchemy, and it was easier to keep this id here as a number
0:33 rather than switch to MongoDB's object id,
0:36 it's also easier to refer to it in other areas,
0:39 like say in the commerce system I can put the id in without using,
0:42 I don't have very much space in terms of the message,
0:46 that can go into the e commerce system based on their api,
0:49 so one is much easier than like 32 characters,
0:51 so we're using the non standard id which is generated in the app
0:55 but for these types of things, that is really no big deal,
0:58 for the users, I think we might be using object ids.
1:01 We have somewhat sort of flat things here, we have the url and the title
1:04 and when it was published, things like that,
1:07 so this is the Learn Python by Building Ten Apps Jumpstart Course
1:10 and you can see a lot of the initial ideas here,
1:13 and the initial pieces of data are totally straightforward
1:16 and they would look exactly the same in a relational database.
1:19 However, there's two things that are very different
1:21 than I want to pull your attention to;
1:24 first is not actually the embedded stuff, but is this duration in seconds,
1:27 when I created the MongoDB version of this web app,
1:31 I realized one of the things I do all the time on the home page,
1:36 on the course listing page, and many many places,
1:39 is I say how long is the course, this course is 6.5 hours,
1:42 I think this one is 7.1 hours or something to that effect.
1:46 Using quick math you can figure out duration in second.
1:48 So there was actually a pretty serious bottleneck
1:51 where I'd have to go and in this case pull back 12 chapters
1:55 and then from the chapters I could get the lectures
1:58 and from the lectures I could get how long each individual one was,
2:01 I had that all up and then I could print out that number.
2:05 And then I would do that for say like on the course catalog page,
2:08 there was like ten courses, I would have to go through so many of these chapters
2:13 and then their subsequent lectures, and that was a huge huge bottleneck.
2:15 So what I decided to do was in the application,
2:18 any time I save or update the course, I'm going to compute this on save
2:22 which is extremely rare, and then I'm going to stash this here,
2:26 so this is actually computed from the chapters
2:29 which are computed from the lectures themselves,
2:31 and this is data duplication, but you'll find that a little bit of data duplication,
2:36 I find usually most apps is like one or two little pieces like this that
2:40 just unlock a lot of performance
2:42 because actually computing this turns out to be really really computationally expensive,
2:47 but storing it here on this object made it super fast.
2:50 So this is one thing, this data duplication
2:53 which I try to stay away from as much as I can
2:55 but the trade-off here was so worth it.
2:57 Now, the other part we want to focus on is down here,
2:59 we said I'd like to associate these chapter ids with a particular course,
3:03 now if this was a relational database,
3:05 I might have a course to chapter normalization table, right,
3:09 it'd have the course id and the chapter id
3:11 and I do some query some kind of join on that;
3:14 you almost never ever, ever see that in MongoDB and document databases.
3:19 Usually, at least the ids are embedded on one side of that, one to many relationships
3:23 so here we have the course, the course has some chapters,
3:27 so we're just storing the ids here.
3:29 Now, we also have the chapters, you can see chapter 1001 goes right here
3:35 and this one is a little bit more interesting,
3:37 we've got again our duration in seconds
3:40 which is another thing computed from if you look at the individual lectures
3:44 they've got duration in seconds, and that's the real raw number.
3:48 So this is another duplication, because at many, many levels
3:51 I need to show the time of a chapter,
3:53 and that was turning out to be computationally expensive at many levels,
3:56 so again, these two places, this is the one bit of duplicated data
4:00 and you will see that this is more common
4:03 in a document database than in a relational one.
4:05 So here we've got our chapter which has this soft relationship
4:08 from the course over to the id,
4:10 we also have the course id down there and below it,
4:12 so it's kind of this bidirectional relationship;
4:15 then we have lectures, and lectures is interested in that
4:18 almost every time that we get a hold of a chapter
4:22 we care about its lectures, we usually want to display them in a list
4:27 any time that I get a lecture, this is the thing like you're watching right now,
4:30 this is the lecture, right, an individual video let's say,
4:32 any time you have one of those, you almost always need the other ones,
4:36 at least the ones before and after it, so like if you look in this particular player
4:40 you'll see there is a forward and a backward within the course button
4:45 that you can skip ahead or skip back, that is the other lectures
4:48 so what I find is grouping the chapter along with the lectures into one blob
4:51 that makes it super fast and I almost always want the other lectures
4:57 when I have one lecture, and if I have the lecture,
4:59 I usually need to display the chapter title, and things like that.
5:02 Anyway, so these are really well suited to be put together in this embedded style,
5:06 so I don't have a lectures table, I have course, courses
5:09 and I have chapters, and then in the chapters those are embedding the lectures,
5:12 and we also saw that little bit of data duplication.
5:15 So you can see down here is an individual embedded lecture,
5:18 here's one that talks about doing the exercises
5:20 in this course and it's apparently 202 seconds,
5:24 so I hope this look behind the scenes has helped you understand
5:28 how you might model this stuff, you can look at the course page
5:30 and the player and think about some of the trade-offs,
5:33 I don't know that this is perfect, but it is absolutely working well for the web app.
5:37 Let's look at one more thing.
5:39 Down here we have the users, and we have a couple of items
5:41 that we're going to focus on when we get to the users,
5:44 I have blurred some out, we're using object id now for the user id
5:46 I covered the password and things like that,
5:49 but we've got some flat stuff like whether or not you're opting out of email,
5:52 what your user name is, what your email address is, things like that.
5:55 And then, I have this concept of an origin,
5:58 so if you come from like some particular marketing source
6:01 it might record like hey this person created their account
6:04 and they originally came from Facebook,
6:06 this person originally came from the podcast or something like that,
6:08 so that's pretty interesting, we also have the courses that you are taking,
6:11 so right here, this particular person, this is me,
6:14 so I gave myself basically all the courses,
6:17 these are the ids of the courses that I am a student in,
6:20 so again, there's not a users, there's not a courses in a user courses
6:23 sort of normalization thing is very common that when I as a user
6:29 am loaded into the database, I very often need to know about the courses.
6:32 Now I can't easily embed the course into the user, right,
6:36 that'd be like insane levels of duplication,
6:38 but closest thing I can do is I can get this list
6:40 and then I can go back and do another queer
6:42 say give me all the courses where the course id is in this list of owned courses,
6:46 so basically two queries I have everything I need.
6:49 We also have the bundle id and some other things going on here.
6:52 So that embedded course id, that's actually a list
6:55 one more thing to look at down here is this preferences,
6:58 so this is short name, somewhat short name,
7:02 this is the preferences for your player
7:05 so when you're in the video player, you can choose different qualities,
7:08 you can turn on captions or you can turn off captions,
7:12 subtitles, transcripts basically and you can choose a playback speed,
7:15 it could be like .75 up to two or three or something crazy like this.
7:19 One of the primary actions a user does on this site is to go through the course,
7:25 each course might have 150 lectures
7:28 so as a user, you come in you look round a little bit
7:31 and then you go through 150 lectures,
7:33 so this preferences thing needs to be pulled back frequently.
7:36 And so we got to get the user anyway and embedding them together means
7:39 it's basically instant access any time I'm in the player
7:42 to figure out how to preconfigure the player
7:46 to render your video the way that you like it.
7:48 So this is an embedded item, but not an embedded list
7:51 just an embedded preference object.
7:53 So there you have it, a look inside Talk Python Training
7:57 at least as it was when we recorded this,
7:59 so hopefully this helps you think through some of the challenges
8:03 of building a more realistic app.