Full Web Apps with FastAPI Transcripts
Chapter: Storing data with SQLAlchemy
Lecture: Populating the database with PyPI data
0:00 We've seen we created the tables,
0:01 but we haven't put any data in there,
0:03 and where are we going to get the data? Anyway,
0:05 wouldn't it be nice to have somewhat realistic data?
0:08 The data we used before was super fake.
0:10 We just generated some integers and said there's this many things. We put like,
0:14 this is the summary, instead of actual summary information.
0:18 Well, you're in luck because I've given, I've gotten and given to you the actual
0:23 data from PyPI. So you're familiar with this section,
0:25 the code all over here. But now, let's have a look over here.
0:29 When I grabbed all this data,
0:31 I got the top 100 packages and all of their information,
0:36 and I saved them to a bunch of JSON files,
0:37 so check this out. We go in here,
0:39 we've got ampq, we've got boto,
0:43 we've got beautifulsoup, we've got the awscli and so on.
0:47 And if I quick-view these,
0:48 you can see they've got an author and emails and classifiers like their licenses and
0:53 notice, they're not simple. Like I want to say this is a BSD license
0:55 but I've got to go to the info, to the classifier to
0:58 get the license and split it into pieces and find the BSD part.
1:01 Say, oh, it has a BSD got it, right.
1:04 So there's gonna be some work to read this data,
1:07 but it's no problem. Here is the home page,
1:08 here's the license written on a better way,
1:10 here's the summary and so on.
1:12 So what I want to do is read all of this data into the database.
1:17 So I'm gonna go over here and create a folder called bin
1:20 and This is where I put all these little utilities that
1:22 like, load up the database or run little migrations or maintenance scripts and so on.
1:28 I'm going to drop into here something called load_data because it is not at all
1:31 worthwhile for you to watch me pulling this all together,
1:35 OK? So we're gonna use a couple of libraries we haven't talked about yet.
1:38 This really cool library called progressbar2 that does a progress bar while stuff
1:43 is happening in the terminal. That's great,
1:45 unfortunately, it named itself progressbar,
1:47 and the package is called progressbar2.
1:49 Not, not great in that regard but we've gotta put progressbar2 there and
1:53 we install that requirement. And over here,
1:56 this will go away in a second,
1:58 perfect. Similarly annoying, we wanna work with thing that's fantastic called dateutil.parser,
2:03 but it comes from python-dateutil so the names don't
2:07 give you a lot of guidance on,
2:10 you know, what the underlying package to install is.
2:13 But I'm walking you through it. It's all good.
2:16 Some of these we're not actually using here.
2:19 And this is release and user,
2:21 OK. so what we're gonna do is we're gonna go through, I'll just clean
2:25 this up so it runs fine and all that, you don't have to watch.
2:27 What we're gonna do is load up all of those files,
2:31 those JSON files, and we're gonna parse through them to find out all the
2:35 users that are mentioned within those and then we're gonna create corresponding users to the ones
2:39 we discovered. And then we're gonna go create the packages and then associate the users
2:45 with their package, who created what. We're gonna do language imports.
2:50 And we're gonna do some license imports and we'll just print out what happened.
2:53 So we're gonna run that real quick,
2:55 and then we'll have data in our database.
2:59 All right, that thing I dropped in here actually was importing a little more data
3:01 that we needed, so we're not gonna worry about it like the language and maintainers,
3:04 we didn't create the classes in SQLAlchemy before,
3:07 so there's nowhere for them to go.
3:08 So what I'm gonna do is I'm gonna run this,
3:10 it'll initialize that connection, create one of these sessions,
3:12 and it'll check, has data already been imported?
3:16 If there's no data whatsoever, it's gonna go do the import,
3:18 then print out a little summary.
3:20 Let's run it. You'll see down to the bottom the little progress bar thing going
3:23 across, which is cool. If I press,
3:24 this, it's gonna run the website,
3:26 which is not what we want.
3:27 We wanna go here, right click, run.
3:31 It goes, you can see a little progress bar zipping by at the bottom.
3:34 Finally, we have 84 users,
3:36 96 packages and 5400 releases. I know said there was a hundred but I think for
3:44 some reason, a couple couldn't be downloaded or something like that.
3:47 All right, so now if we go back and we connect again over here,
3:53 we click around. Look at that,
3:54 all of our data is here. Super cool,
3:56 right? So we've got our users.
3:57 Here's Armin Ronacher from Flask,
4:00 he doesn't have a password apparently.
4:01 Kenneth Reitz, say from requests. You should know these names.
4:06 Here's Ned Batchelder. Probably because of coverage.py and so on.
4:10 Dingo Software Foundation apparently maintains Django,
4:12 but these are all the really popular projects right here.
4:15 Celery Project, right. That's expected from the top 100 packages off of PyPI,
4:19 right? So here we go,
4:20 look at this. We have data,
4:22 we've got our releases and they go back to their packages,
4:26 we can see that over here somewhere.
4:28 Notice we're using the actual string name of the package as the id,
4:32 which is pretty cool. argparse,
4:34 asyncio and so on. We're in a good spot,
4:37 we've got our database all put together,
4:39 and we've got our collection set all ready to go and actually view these in the
4:42 website. But remember, we still have to write the queries to do it. At least we have the data to write the queries for.