MongoDB with Async Python Transcripts
Chapter: PyPI Beanie
Lecture: Review: The Data Model
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
Let's start off by talking through the data exactly as we have it in MongoDB. Remember when we talked about modeling in the previous chapter,
0:10
we gave a sense of how that might go, but let's actually look inside of Mongo at the data and see what we're working with, okay?
0:16
So over here, I've opened up the free edition of Studio 3T, and this is the connection that we created when we talked about this tool the first time.
0:26
And over here, you can see we have our PyPI data, And the most important part are these three collections. Recall that MongoDB doesn't have tables,
0:35
it has collections because, well, it's not tabular data. That's actually the whole point. We've already played a little bit with this user data.
0:43
So let's have a quick look at that. And we can just pick an arbitrary one here. This is apparently some info@2capture.com, whatever that is, account.
0:51
So all these have, basically all of the MongoDB documents have an underscore ID, which in Python
0:58
as just an ID property or field of the BNN or Pydantic class, right? So you don't have to worry about the underscore in Python,
1:06
but that's how it is in the document. And unless you do something special, it's an object ID. You can change it to be something more unique
1:15
if there is a unique aspect of that account. So, for example, potentially we could use the email address,
1:21
but if we want to let people change their email address, you don't necessarily want them to change their primary key. key, possible but not ideal.
1:28
So we're not using their email, even though that may be unique. All right, so we got their name, email,
1:33
password hash here, the created date and login date, as well as a profile image. If they have one, in this case, we don't have one.
1:43
We also have this location for them, this state and country. I don't know that we actually have that data out of PyPI,
1:48
but it's just something we're modeling to show you kind of how you might do that with an embedded document. The next one we have is very, very simple.
1:56
release analytics has one record because the way we're modeling the packages with the releases
2:03
being embedded in there is really hard to count how many total releases there are. So this is
2:09
a place where you have a tiny bit of data duplication to open up possibilities for much
2:14
more productive sort of embedding for the 95% use case. And then finally we have packages.
2:22
Again, we have all of our top pieces here and check out this one. The name of the package in PyPI cannot change and it has to be unique.
2:31
So we don't need to have a separate object ID plus a uniqueness constraint on some package name.
2:39
We can just make the package name, the string itself, be the ID, which is pretty cool. Again, like everything, we have a created and last updated date.
2:47
If you go view this package on PyPI itself, and we just put in its ID here, this part, this project description,
2:56
the easiest way to quickly do such and such, and has all the code samples and the tables and all that. That is exactly here, Python module,
3:06
the easiest way to, and it just goes on and on. Here's the markdown code parts and the tables, which is why you can see like,
3:13
there's this huge long bit of text, and we don't wanna have code, you can't fold that over and so on. Like you can't, normally there's a way to say,
3:21
like click here and collapse this chunk. So as you know, it's just too much. We're not doing that right now.
3:26
So anyway, that's what this morning is about. But the description is basically the readme for the PyPI page, homepage, package URL.
3:34
This is the page we just pulled up basically. Who the author is, the email, and then if there's a license specified,
3:42
sometimes there is, sometimes there isn't. And then we have the releases. And this is an embedded array or an embedded list.
3:49
Here's the array part and here's the embedded object. So we have the version made up of those three pieces.
3:55
We have the created date, the comments, the file download, the size of that download. And you can just see all of those here.
4:01
And finally, if there's any maintainers, we'll put the maintainer IDs in here if we have found that relationship.
4:09
When I downloaded this data from PyPI, I didn't download all of it. I did not, I don't know if it terabytes, many, many gigabytes.
4:17
I don't know how much data it is, but it's a lot. I just downloaded the top 5,000. So some of this data might not be 100% all tied together,
4:25
but that's the model that we're using. Some of the places might be empty like this, and sometimes they might be filled out.
4:31
But this is the most important thing that we're gonna focus on is this packages element here, because it represents the main thing that you care about
4:41
when you go to our API. but we also have our users and we have our release analytics to allow us to answer really simple questions like,
4:50
over here, how many releases are there? This is easy on a MongoDB query, this is easy on a MongoDB query,
4:57
but this one, because of the way we've embedded it, turns out to be a bit of a challenge. So we're kind of storing that data separately
5:04
in the database. That's our data model that we're working with. So we're going to take those JSON, BSON definitions and turn those into Beanie classes.