MongoDB with Async Python Transcripts
Chapter: PyPI Beanie
Lecture: Review: The Data Model

Login or purchase this course to watch this video and the rest of the course contents.
0:00 Let's start off by talking through the data exactly as we have it in MongoDB. Remember when we talked about modeling in the previous chapter,
0:10 we gave a sense of how that might go, but let's actually look inside of Mongo at the data and see what we're working with, okay?
0:16 So over here, I've opened up the free edition of Studio 3T, and this is the connection that we created when we talked about this tool the first time.
0:26 And over here, you can see we have our PyPI data, And the most important part are these three collections. Recall that MongoDB doesn't have tables,
0:35 it has collections because, well, it's not tabular data. That's actually the whole point. We've already played a little bit with this user data.
0:43 So let's have a quick look at that. And we can just pick an arbitrary one here. This is apparently some info@2capture.com, whatever that is, account.
0:51 So all these have, basically all of the MongoDB documents have an underscore ID, which in Python
0:58 as just an ID property or field of the BNN or Pydantic class, right? So you don't have to worry about the underscore in Python,
1:06 but that's how it is in the document. And unless you do something special, it's an object ID. You can change it to be something more unique
1:15 if there is a unique aspect of that account. So, for example, potentially we could use the email address,
1:21 but if we want to let people change their email address, you don't necessarily want them to change their primary key. key, possible but not ideal.
1:28 So we're not using their email, even though that may be unique. All right, so we got their name, email,
1:33 password hash here, the created date and login date, as well as a profile image. If they have one, in this case, we don't have one.
1:43 We also have this location for them, this state and country. I don't know that we actually have that data out of PyPI,
1:48 but it's just something we're modeling to show you kind of how you might do that with an embedded document. The next one we have is very, very simple.
1:56 release analytics has one record because the way we're modeling the packages with the releases
2:03 being embedded in there is really hard to count how many total releases there are. So this is
2:09 a place where you have a tiny bit of data duplication to open up possibilities for much
2:14 more productive sort of embedding for the 95% use case. And then finally we have packages.
2:22 Again, we have all of our top pieces here and check out this one. The name of the package in PyPI cannot change and it has to be unique.
2:31 So we don't need to have a separate object ID plus a uniqueness constraint on some package name.
2:39 We can just make the package name, the string itself, be the ID, which is pretty cool. Again, like everything, we have a created and last updated date.
2:47 If you go view this package on PyPI itself, and we just put in its ID here, this part, this project description,
2:56 the easiest way to quickly do such and such, and has all the code samples and the tables and all that. That is exactly here, Python module,
3:06 the easiest way to, and it just goes on and on. Here's the markdown code parts and the tables, which is why you can see like,
3:13 there's this huge long bit of text, and we don't wanna have code, you can't fold that over and so on. Like you can't, normally there's a way to say,
3:21 like click here and collapse this chunk. So as you know, it's just too much. We're not doing that right now.
3:26 So anyway, that's what this morning is about. But the description is basically the readme for the PyPI page, homepage, package URL.
3:34 This is the page we just pulled up basically. Who the author is, the email, and then if there's a license specified,
3:42 sometimes there is, sometimes there isn't. And then we have the releases. And this is an embedded array or an embedded list.
3:49 Here's the array part and here's the embedded object. So we have the version made up of those three pieces.
3:55 We have the created date, the comments, the file download, the size of that download. And you can just see all of those here.
4:01 And finally, if there's any maintainers, we'll put the maintainer IDs in here if we have found that relationship.
4:09 When I downloaded this data from PyPI, I didn't download all of it. I did not, I don't know if it terabytes, many, many gigabytes.
4:17 I don't know how much data it is, but it's a lot. I just downloaded the top 5,000. So some of this data might not be 100% all tied together,
4:25 but that's the model that we're using. Some of the places might be empty like this, and sometimes they might be filled out.
4:31 But this is the most important thing that we're gonna focus on is this packages element here, because it represents the main thing that you care about
4:41 when you go to our API. but we also have our users and we have our release analytics to allow us to answer really simple questions like,
4:50 over here, how many releases are there? This is easy on a MongoDB query, this is easy on a MongoDB query,
4:57 but this one, because of the way we've embedded it, turns out to be a bit of a challenge. So we're kind of storing that data separately
5:04 in the database. That's our data model that we're working with. So we're going to take those JSON, BSON definitions and turn those into Beanie classes.


Talk Python's Mastodon Michael Kennedy's Mastodon