Full Web Apps with FastAPI Transcripts
Chapter: Storing data with SQLAlchemy
Lecture: Populating the database with PyPI data

Login or purchase this course to watch this video and the rest of the course contents.
0:00 We've seen we created the tables, but we haven't put any data in there, and where are we going to get the data? Anyway,
0:06 wouldn't it be nice to have somewhat realistic data? The data we used before was super fake.
0:11 We just generated some integers and said there's this many things. We put like, this is the summary, instead of actual summary information.
0:19 Well, you're in luck because I've given, I've gotten and given to you the actual data from PyPI. So you're familiar with this section,
0:26 the code all over here. But now, let's have a look over here. When I grabbed all this data, I got the top 100 packages and all of their information,
0:37 and I saved them to a bunch of JSON files, so check this out. We go in here, we've got ampq, we've got boto,
0:44 we've got beautifulsoup, we've got the awscli and so on. And if I quick-view these,
0:49 you can see they've got an author and emails and classifiers like their licenses and
0:54 notice, they're not simple. Like I want to say this is a BSD license but I've got to go to the info, to the classifier to
0:59 get the license and split it into pieces and find the BSD part. Say, oh, it has a BSD got it, right. So there's gonna be some work to read this data,
1:08 but it's no problem. Here is the home page, here's the license written on a better way, here's the summary and so on.
1:13 So what I want to do is read all of this data into the database. So I'm gonna go over here and create a folder called bin
1:21 and This is where I put all these little utilities that like, load up the database or run little migrations or maintenance scripts and so on.
1:29 I'm going to drop into here something called load_data because it is not at all worthwhile for you to watch me pulling this all together,
1:36 OK? So we're gonna use a couple of libraries we haven't talked about yet.
1:39 This really cool library called progressbar2 that does a progress bar while stuff is happening in the terminal. That's great,
1:46 unfortunately, it named itself progressbar, and the package is called progressbar2.
1:50 Not, not great in that regard but we've gotta put progressbar2 there and we install that requirement. And over here, this will go away in a second,
1:59 perfect. Similarly annoying, we wanna work with thing that's fantastic called dateutil.parser, but it comes from Python-dateutil so the names don't
2:08 give you a lot of guidance on, you know, what the underlying package to install is. But I'm walking you through it. It's all good.
2:17 Some of these we're not actually using here. And this is release and user, OK. so what we're gonna do is we're gonna go through, I'll just clean
2:26 this up so it runs fine and all that, you don't have to watch. What we're gonna do is load up all of those files,
2:32 those JSON files, and we're gonna parse through them to find out all the
2:36 users that are mentioned within those and then we're gonna create corresponding users to the ones
2:40 we discovered. And then we're gonna go create the packages and then associate the users
2:46 with their package, who created what. We're gonna do language imports. And we're gonna do some license imports and we'll just print out what happened.
2:54 So we're gonna run that real quick, and then we'll have data in our database.
3:00 All right, that thing I dropped in here actually was importing a little more data
3:02 that we needed, so we're not gonna worry about it like the language and maintainers, we didn't create the classes in SQLAlchemy before,
3:08 so there's nowhere for them to go. So what I'm gonna do is I'm gonna run this, it'll initialize that connection, create one of these sessions,
3:13 and it'll check, has data already been imported? If there's no data whatsoever, it's gonna go do the import, then print out a little summary.
3:21 Let's run it. You'll see down to the bottom the little progress bar thing going across, which is cool. If I press, this, it's gonna run the website,
3:27 which is not what we want. We wanna go here, right click, run. It goes, you can see a little progress bar zipping by at the bottom.
3:35 Finally, we have 84 users, 96 packages and 5400 releases. I know said there was a hundred but I think for
3:45 some reason, a couple couldn't be downloaded or something like that. All right, so now if we go back and we connect again over here,
3:54 we click around. Look at that, all of our data is here. Super cool, right? So we've got our users. Here's Armin Ronacher from Flask,
4:01 he doesn't have a password apparently. Kenneth Reitz, say from requests. You should know these names.
4:07 Here's Ned Batchelder. Probably because of coverage.py and so on. Dingo Software Foundation apparently maintains Django,
4:13 but these are all the really popular projects right here. Celery Project, right. That's expected from the top 100 packages off of PyPI,
4:20 right? So here we go, look at this. We have data, we've got our releases and they go back to their packages, we can see that over here somewhere.
4:29 Notice we're using the actual string name of the package as the id, which is pretty cool. argparse, asyncio and so on. We're in a good spot,
4:38 we've got our database all put together, and we've got our collection set all ready to go and actually view these in the
4:43 website. But remember, we still have to write the queries to do it. At least we have the data to write the queries for.


Talk Python's Mastodon Michael Kennedy's Mastodon