|
|
14:13 |
|
show
|
1:53 |
Hello, and welcome to Just Enough Python for Data Scientists.
I'm really glad you're interested in this course.
It's one that I've wanted to create for a really long time.
I see people coming into the Python space from non-traditional computer science or software engineering roles.
Maybe they're coming in through astronomy or biology or for psychology research.
It doesn't really matter where they're coming from.
all these people using Python who are not traditionally computer scientists and computer programmers.
They don't consider themselves a programmer first.
Maybe that's you.
How do you introduce yourself at a party?
Do you say, I'm a psychology researcher?
Or do you say, I'm a data scientist?
Or do you say, I'm a software engineer?
Well, if it's the first one, and possibly the second, this course is for you.
The idea here is we're going to cover enough Python language, Python tooling, and software engineering practices to help you really level up your data science work.
Maybe you're just fine writing Jupyter notebooks and getting some work done, but you have that feeling in the back of your mind like, hmm, probably should be more professional about this, or I bet there's a better way to do this, but we'll just get it working now.
If that's how you feel, then welcome to the course, because we're going to go through all those types of techniques and tools and way of working that will make your programs and your notebooks way more professional.
So if you're ready to level up your data science by getting just enough Python and software engineering to really get going and build professional reports, APIs, and so on, I'm excited to have you here.
Let's dive in.
|
|
show
|
6:13 |
I want to tell you about the topics of this course, but I actually want to tell you what is not a topic of this course before we get to it.
Why?
I feel like a lot of people who are coming into programming and data science have this tendency to boil the ocean.
They see all these different things that they need to learn.
They see people talking about, well, Google runs their code on Kubernetes like the structure.
So we need to use Kubernetes.
And oh, I see this machine learning library is important.
So I'm going to go learn all of machine learning.
No, don't do that.
There's what you need to learn to get started and what you should ignore throughout your initial journey as you're getting your foundation set in programming and data science.
And then as you get to those more advanced or more specialized areas, maybe focus on that.
But don't try to do that from the beginning without a very clear need.
So briefly, I want to go through things I think you should safely avoid to get started.
And it'll really help you focus on foundational things, productive steps, tools, and so on.
Number one is cloud computing and deployment, DevOps, all that stuff.
Avoid cloud platforms like AWS and Azure.
Don't worry about deployment pipelines.
This is probably not something you're going to need right away.
Now, of course, if your boss comes to you and say, I need you to deploy this thing on AWS, well, guess what?
You're learning AWS.
But in general, don't preemptively worry about cloud computing or Linux or DevOps, all those kinds of things at first.
Similarly, big data and distributed frameworks, Spark, Hadoop, et cetera, don't need that right away.
Advanced object-oriented programming, or heck, even basic object-oriented programming.
You don't need to dive deeply into classes, inheritance, design patterns, or other advanced object-oriented programming concepts, and definitely not metaclasses.
If you hear about those things, you probably don't need to worry about it.
You might use a class that comes from a library that uses inheritance and design patterns.
It's different to consume these things rather than understand and create them from scratch.
So yes, you'll use them, but you don't need to know much about them to use them.
Deep learning frameworks, machine learning frameworks.
Don't go teaching yourself TensorFlow, PyTorch, or other specialized machine learning libraries.
For now, typically there's APIs you can call or simpler things that you can do.
Again, if your focus is this, then maybe learn one of those.
but for most data scientists, it's not.
So don't worry about it.
Extensive algorithm theory, optimization, that kind of stuff.
Don't worry about that.
The libraries that you use in your day-to-day work in data science, they've been highly, highly optimized.
Things like Pandas and Polars and NumPy and Matplotlib, they've already done all the hard work for you.
Typically, your job is to get some data, clean it up, hand it off to this library, hand it off to another, get a picture.
So don't worry about optimizing stuff too much in the beginning.
Testing, unit testing, don't worry about it.
I might be a bit of a contrarian.
You might hear how important it is to test your code.
That's what professionals do.
Yes, eventually, maybe when you build a library you're sharing with 10,000 people in the beginning, don't worry about it.
async, parallelism, threading.
Again, kind of like object-oriented programming.
Someday it will be important, but it's not right now.
A lot of the libraries you use already do these things for you behind the scenes, and you don't have to worry about it.
They add a simple veneer or facade over top of some complex algorithms and asynchronous programming.
I'm thinking Polar, Dask, that kind of stuff.
Looks just like the regular code to use.
Does really advance parallelism.
You don't have to know how it works.
Advanced source control using Git.
If things like squashing or rebasing sound complicated to you, because they probably are, don't worry about it.
I'll show you a handful of source control and Git techniques and commands you'll need to use to get work done in a team environment, even for yourself.
But you probably don't even really need to learn terminal-based Git.
I'll show you some really nice tools that will handle that for us.
So don't stress about Git.
We'll cover the few essential things you need.
The rest of it, learn it later.
Finally, Python has these things called decorators.
They're very powerful.
They change the way that your code runs by just putting a little at something.
You say at sign and then like a word or two, and that will change dramatically how your code runs.
For example, if I want to create an API for a web application, I just say at get and give it a URL.
And what was normally a regular function now becomes something hosted on the web.
You'll probably use these in your code.
Don't worry about creating them.
They are hard to get completely right.
And for a long time, you can start it just like everything on this list.
It's something you'll get to eventually when you need it, but you don't need to worry about creating them now.
Use it.
They're super easy to use, fairly hard to create.
So take advantage of the ones that many people, libraries and frameworks already offer you.
So there it is.
This is my list.
Not exhaustive, but gives you an idea of my philosophy for this course.
There's a lot of things you can focus on that are super advanced that will take a long time to get right, a long time to master, and they won't provide that much value over top of the basic Python and data science foundations that we are going to cover throughout this course.
|
|
show
|
3:23 |
Well, you now know what not to cover, what not to pay attention to.
What are we going to focus on?
We're going to start off talking about the Python language.
Maybe you already know Python, in which case you could skip this chapter.
But we're going to talk about a few of the core language features.
Like I just talked about with what you can skip, there's a lot of things in the language you don't need to worry about in the beginning.
So we're going to talk about just the essentials that you need to know, creating variables, writing functions, testing code, testing conditions, doing loops, and that sort of thing.
Next, we're going to talk about writing clean code.
How do you write code that is easy for other people on your team to read or you to go back and read after it's been a few months and you've written it?
How do you write maintainable code?
We're going to save a special chapter for functions.
Functions are a very important part of programming in Python, and Unlike classes, you cannot ignore them.
They are critically important.
So much so that we're going to have a dedicated chapter, even though they would fit into the clean code section as well.
When I talk about organizing code so it can be reused, typically data scientists write a lot of code in Jupyter Notebook that gets locked in there.
It's extremely hard to take some little bit of code you write in here and use it in another place without simply copying it from one notebook to the other.
Not a good idea.
So we're going to talk about how we can do that in the context of data science specifically and with regard to notebooks.
We're going to cover the few things that you do need to know about source control and Git.
So that will be super helpful, I think.
We'll see how to do top-tier professional debugging using real tools, not just what's baked into Jupyter Notebooks and JupyterLab, which is something, but it's not that great.
We're going to see some really awesome tools to do debugging with.
We're also going to spend some time exploring libraries and tools we can leverage to create reproducible software.
Software that will run reliably across collaborators and people just want to reproduce your work across their machines, no matter what their setup is, and even add a degree of longevity to your code so that maybe in 5 or 10 or 15 years, so the code that you write will still run exactly the same as the day that you wrote it.
So this reproducibility stuff is super important.
Finally, in this day and age, it would be remiss to not talk about AI a little bit.
So we're going to close out this course by seeing how we can use agentic AI to help us analyze data, improve our code, create notebooks and analysis for us, all that kind of thing.
I'm going to put that at the end because these foundation ideas are super important and the AI is going to try to apply them to our code for us, but we don't want it to just do our work.
We want it to do our work faster, but we can still understand and maintain and change it.
So at the end, we'll close it out with some awesome stuff about AI.
There's what we're going to cover.
I hope you're excited.
I think this is going to be an incredible foundation for you.
So we're going to enjoy going through it together, I'm sure.
|
|
show
|
1:37 |
Now, before we go on, I want you to make sure you download and star and fork the GitHub repository.
Of course, we're talking about Git in chapter six.
So maybe you're not super up to speed on GitHub.
That's fine.
Don't worry about it.
But you should have a GitHub account.
If you don't have one after this chapter is done, go and create a GitHub account.
It's completely free.
And this is where software and data science happens, is on GitHub.
especially in the Python space.
So you'll want to have an account there.
And then star and consider forking the repository and downloading it.
So if you do use Git, then go ahead and Git clone this.
But if you're new and you don't yet have Git set up and know how to work with it, that green button that says code, you can just click that and download a zip version, which is the same thing, but doesn't have the history basically.
So get the code.
You can see it at the URL here.
And you can also always get this from the video player.
In the top left, there's a little GitHub icon on the course page.
There's a link to the GitHub repository and so on.
So you can always find this.
But do make use of this.
It's really important.
If you want to work with a lot of parts, there might be starter code that you need to work from and so on.
And that will all be included here.
Everything you see me create and write during this course, we are going to write a lot of code live and together because I think that's a super important aspect of learning.
programming, all that code that I write will show up here exactly as it appears on the screen, so that should be helpful as well.
|
|
show
|
1:07 |
Finally, who am I?
Who is this person here telling you about data science?
Well, my name is Michael Kennedy.
I've been in the Python space for a long, long time, and I've been a professional programmer since the last century.
How about that going on?
Well, over 25 years now.
So you can find my personal website, my essays, my blog posts, and that sort of thing at mkennedy.codes.
I'm the host and creator of the most popular Python podcast called Talk Python To Me, as well as the weekly news show, Python Bytes, which I co-host with Brian Okken.
And I'm the founder and principal author here at Talk Python Training, where you're likely taking this course.
These are all the things I've done and created.
I also happen to be a Python Software Foundation Fellow, which is a recognition from the broader Python community of someone who's made significant contributions through the stuff that you see above on this list and others as well.
So that's me.
Thank you so much for taking the course.
We're going to have a great time together.
|
|
|
43:41 |
|
show
|
0:55 |
Let's talk about the Python language.
So that's our first main topic of the course.
And as I said in the introduction, we're going to focus on the aspects of the Python language that are the ones you really need, not all the advanced stuff.
Sure, it sounds impressive to talk about all the cool things you're doing with async and await and threading and meta classes and meta programming.
That's not where the productivity is.
We're going to just get a solid foundation, make sure you have everything you need covered to really be ready to do the rest of the course.
If you're fairly experienced with Python, feel free to skip this chapter and go on to the next.
It's not that long.
You can also watch it if you like.
But we're going to talk about a few core language concepts that will be a solid foundation for the rest of the course.
|
|
show
|
2:32 |
The first thing that we're going to talk about is variables and types.
We have two of them here to start with, x and pi.
Now, x we set to be the value of 2, which is an integer, and pi, well, it's a concatenation or an abbreviation of the real pi, but 3.1459 and so on is a floating point number.
That's right, and it has a decimal.
So these are just values that we set.
This is the simplest thing you can do in Python.
We just, we don't declare the variable in any special way.
We just say we have this thing called x and its value is 2.
We can also go on here and we can say we're going to create another variable y, which involves math or calculations or other operations on existing variables such as x.
So in this case, we're going to set y to be x squared.
And notice, now we're saying explicitly that y is an integer.
Why are we doing this?
It's not required.
These are called type hints, but they are very helpful when you're reading your code, and they're also helpful to your editors and even to AI, agentic, LLMs, and so on, to know what is expected here.
So for example, if we said y is pi squared, that would be a floating point, not an int, and we would get some kind of warning in our editor and possibly other tools that we could have in Spectre code.
Talk more about them later.
So you can optionally set a type for your variable like y:int, but as you can see with the first two, you don't have to.
We can set the value to be an expression rather than just a constant.
We can also, even with z here, we want it to be y to the power of y cubed.
All right, and then we get a tremendously large number down here.
You can see like that.
We can also have things that are not numbers, like strings.
Now, how do you say the type of a string?
In Python, it's str.
Other languages, it's the full sti-r in g.
Here we say this text is a string, and the value of it is that, hey, guess what?
z is really, really big.
It's true.
It's really, really big.
So these are variables in Python.
I typically don't give them types like you see here, unless they're being used in functions, which we're going to talk about in the next chapter.
So sometimes you give them types, sometimes you don't.
Either way, this is how they work.
|
|
show
|
3:10 |
Next up is loops.
Loops are super important.
So here we have what's called a list of Fibonacci numbers.
1, 1, 2, 3, 5, 8, and so on.
We'll play with this sequence throughout the course, I'm sure.
You can see there's brackets and then the values separated by commas.
So this is not just a single value like 1 or 2, but it is a list of them.
And this list can actually grow to be larger if it needs to be.
We could add the next Fibonacci number.
So we're declaring this Fibonacci to be a list.
We're not explicitly setting the type, but that's what it is set to.
Now, what if we want to go through each item in the list?
There's other types of collections, and the same pattern applies for them as well.
Well, what we're going to use is something called a for in loop.
So for something, for variable in the collection.
So the variable name is fib, and the collection is the one we had above, which is Fibonacci.
And then for each time through there, we're going to do something with it.
In this case, we're going to print one, one, then two, then three.
And we're setting the end to be a comma instead of a line break, which would normally be what happens.
So it goes across the screen instead of down with commas separating in the output as well.
So this is the most foundational type of loop in Python.
It's the one you're going to use 80, 90% of the time.
There's also something called a while loop, which is while a condition is true.
It's not used nearly as frequently.
Sometimes other languages have something called a for loop without the for in, just the for loop where some kind of increment, you might say for a variable equals zero, and then you do a test and you keep going up and say like for I equals zero, I plus plus, et cetera, right?
That kind of thing.
Well, these foreign loops have a way to give you that index, what position you are in the iteration of your loop.
So we just have to add this enumerate operation here.
So we're going to enumerate the Fibonacci sequence rather than just go through its values.
And we're going to say the start is 1.
So typically things start at 0 in Python.
So what that means is over here we have an IDX, which is this index, the count through the loop, and then the value just like we had before.
So we can say the idxth.
It's a little bit of a shortcut here, but we see like the first, then the second, and then the third.
And after that, the fourth, fifth, sixth, seven, that starts to sound right.
Fibonacci is, and it puts that number out.
We're also doing something called an f string.
If you haven't seen that before, it says f in front of the little quotes.
And that means pull the values out of the variable.
So you'll see if we run this, we would get the first Fibonacci is one, the fifth Fibonacci is five, the sixth is eight, and so on.
So you can see it's taking the variables of the loop and sticking them into the string like this.
That's pretty much what you need to know about loops.
There's a couple other things like break and continue keywords, but this is the gist of it.
|
|
show
|
2:36 |
Python has a couple really, really important data structures.
We have a list, we have a dictionary, and we have a set.
There are other types of more advanced data structures, but again, you can probably ignore them for the most part.
We've already seen the list in our previous example, but let's talk about working with it directly and not just looping over it.
So the list is an ordered collection of things.
Put stuff in the list, and it's going to stay in that order.
It knows how many of them there are.
You access them by their position in the list.
If we want to fiddle with these Fibonacci numbers a bit more, we could come here and say, I would like the last two numbers.
So the last one, we can pass in a minus one, it's a bracket to index in the list, say minus one, that'll give us 89.
And then we could get the one before it going back to, and that would be 55.
And then if we're going to add the next item into our list, the way the Fibonacci sequence works is you, the next one is the sum of the previous two.
So here we could say Fibonacci.append that, and you can see we get 144 on our list.
We can also ask how long is the data structure.
This doesn't just work on lists.
It works on many things that are collection-like.
Most things that you can loop over can also ask how long they are using this len operator.
So we're going to put that in this variable called count and the count turns out to be 12.
And then maybe we want to get the last one.
Typically, we don't index into these things with negative numbers, but we sort of go from the start to how far they are in.
In this case, we're going to use zero based indexing.
So the first one will be zero, the second one will be one, the third one will be two, and so on.
So if we have the count, we want to get the one at the end, we would say count minus one, because everything in Python is zero based.
In this case, we get 144.
Yes, we could pass in negative one, but I really want to focus on this.
Everything starts at zero for indexes, zero, one, two, and so on.
It's sort of a holdover from the C programming language, where in C, these things are sort of pointers into memory.
And you say, well, how far offset into this space is the first one?
Well, the first one's right at the beginning, so zero.
And then the next one is one offset and the next one.
So it's historically, that's why these are zero-based.
it makes a lot of things easier that way.
So lists are super important.
You're going to use them all the time.
Make sure to get to know them.
|
|
show
|
2:25 |
The second most important data structure you're going to need to know about is something called a dictionary.
Terminology, some languages call these hash maps.
I think dictionary is a pretty good name for it.
It's probably the most common.
And the idea is we have some way of identifying a larger piece of data.
So in this case, we have some email addresses and then a complex data structure that has to do with the person who owns that email address.
So if you look here, we have sam at sammy.com.
This is what's called the key.
This is the identifier, the thing we look at up by.
Then we have a colon to say, and here is the value afterwards.
And then this whole thing between the two curly braces there, that is the full value of this thing.
You see the sammy.com appears again, because that's just part of the information that we're going to use.
We also have Zoe and others.
And then if we want to go into this lookup and we want to get a value out of it, we say, well, let's pass in the key and we'll get the thing back.
So we're going to get a user by passing in this email address.
And you can see the user data structure below in the gray.
It says the name is Zoe Zink.
The email is zoe at gmail.com.
Her age is 31.
Okay.
So that's super cool.
One you're going to get what's called a key error, an exception.
This will basically stop your program from running and say, ah, something went terribly, terribly wrong.
So you might not want to ask for it this way, as we'll see.
So once we have that user pulled back, this is also a dictionary, and we can ask for its age value.
Here we get an age, which is an integer.
Like I said, if this age doesn't exist here, it's going to cause a crash.
So you can use this safer style, say user.get, where you pass in the key, age, and a value that might come back if that's not there.
So we could say, I want the age of this person I got.
If the age is not in their data structure, just give us the zero value instead of crashing.
Maybe you test, you know, is their age greater than zero, or you could use negative one or something like that.
In this case, Zoe's age is set, so we get 31 back.
These dictionaries are insanely fast, and we're going to explore that in a code demo in just a little bit.
|
|
show
|
2:27 |
The final of the big three data structures is set.
You're going to use this one the least, but when you need it, it is insanely powerful.
The idea is if you have a collection, there might be duplication in it.
If you look at these ratings, you can see we have one, two, seven, then two again, then nine, then a whole bunch of sevens, and then one again.
You might want to ask the question like, how many different values of ratings did we have?
Not how many times did seven appear, but was seven there at all?
These could be strings.
It could be how many distinct first names appear in a group, even if the name Michael appears three times, that kind of thing.
So we can create a set here, and we can pass into this set anything that is iterable, anything you can put into a for in loop.
And the set object will go through them and create one of these sets, and we're naming this one distinct ratings.
And so what it's going to do is look at this collection and say, well, we only want to add a value to it if we haven't seen it before.
You can also write sets directly like this with curly braces, kind of like we did for dictionaries, but with just value, comma, value, comma, value, comma.
And it's exactly the same thing.
This is more efficient, but you would almost never write it this way if you're going to write the duplication, right?
You would use the set to remove the duplication.
Regardless, you're going to end up with the same values, 1, 2, 9, and 7.
Those are the values that distinctly appear, the numbers that appear in the ratings.
The literal ratings are the same thing, right?
Doesn't matter how you write.
Now we can also add to this, not append like we do with the list, but add to the set down here.
And notice we can take this and we can add one.
We can add one again.
We already had one, and so that's not going to change anything about it because one was already there.
But we can also add five, which previously didn't appear in the ratings, so now it does down here at the bottom.
So sets are really awesome ways to collect distinct values of something that you haven't seen before when you're processing.
It's a really common data science type of question, so sets are a little extra important in data science, I think.
So that's it.
That's the big three data structures you need to know about in Python.
List, dictionary, and set.
|
|
show
|
2:25 |
Are you ready to write some code?
I'd say it is about time.
So the first thing we're going to do is just get started with the code from GitHub and so on.
Now, I've already cloned this from GitHub.
You could get the same effect by just downloading and unzipping the zip file as well.
And right now it's empty.
As we do work, it'll fill up.
So yours will look different than that, of course.
So we're going to use several editors, at least three different editors.
We're going to use PyCharm, we're going to use a VS Code variant, and we're going to use JupyterLab.
I want to start out in PyCharm.
I think it's the most straightforward, but if you want to use another one, no problem.
Especially VS Code, it's pretty interchangeable.
You can see I already have it here, but let's just go through the more explicit process.
I'll click Open, go over to where the folder is, and then Open on the folder.
So that's one of the ways we can do this.
And here you can see we've got the two files that you saw, plus hidden files, the dot on macOS makes it hidden.
Doesn't do that on Windows, but it does on Linux and macOS.
So we don't need to mess with this.
What said I'm gonna do is I'm gonna make a little bit of organization for us.
I'll make a folder called code, and then I'll make a folder called O2 Python.
So in here we can have our code focused on each chapter.
And as you can imagine, we'll have a chapter three, chapter four and so on, because some of these will get somewhat complex.
Now, what I want to do is, let's just do a really quick hello world sort of thing, make sure everything's working.
So we'll say we'll create a new Python file just called test python.py.
And in here we can just print out, hey, my Python works.
And for now, I'll let PyCharm just grab the most, whatever base Python it grabs.
We're gonna talk about virtual environments and isolation and picking the latest version of Python, all those things in a little bit, but whatever PyCharm needs to get going, we'll just let it do that for now.
So let's just run this and see, hey, my Python works.
Okay, super.
So it looks like we have Python, looks like it's working.
|
|
show
|
5:34 |
All right, now I've copied over this file called speed test, because there's a lot of code written here and the code we write is not that relevant.
It's the performance that we want to pay attention to.
Also, that thing I said about letting Python just pick PyCharm just pick 3.9.
That was a bad choice.
Some of our code we're writing requires I think it was 3.11 or something not that new, but newer than that one.
So I had to go and actually pick a newer version of Python.
But let's just look Let's look at this code real quick.
So here we have a count.
And notice this is really nice data sort of thing is you can put these underscores like kinda as commas, as digit grouping.
Now, Python will let you put it like this.
It just kind of ignores them.
But you should get a warning, don't you?
If you don't group them correctly, if you group them by like 10,000 or whatever.
But anyway, that helps you realize, oh yeah, instantly that's 1 million.
So we have 1 million people but we're going to randomly generate.
I'm gonna create a dictionary and we're gonna have a list.
Okay, so this is a list, an empty list that we're adding to.
And then this is a dictionary comprehension.
Here you can see the key, colon value, and then a loop across it.
So it's like a condensed loop that builds a dictionary.
I know that's super important.
What's important is that given the million people, we're gonna time two different actions.
The first action is how do I look up one of these people in a list?
So here's the situation.
I have a data set, and I need to go through the data set and find one of the items in there based on some criteria.
The criteria here is we've picked a person that is in that list, and we're going to compare their email addresses.
That person is somewhere about in the middle, exactly in the middle, actually.
So this represents in the long term, the average performance time, 'cause some will be before and some will be after, the average should work out to be in the middle.
So here we've got this loop that says, I'm gonna go through, find the person and then compute how long that took.
So if you subtract two date times, you get a what's called a time delta that has a duration in total seconds.
So we're gonna write out the milliseconds it took to find the person using a list.
Then we're gonna do a similar thing where we go to the dictionary and we say, get us that person by the key we created, which is their email address.
Then we're gonna print out how long that took.
First of all, this is cleaner code than all of this, all right, all of this.
So that's great.
But also, there might be a performance difference.
The speed up is, and then, so we're gonna see, I'll tell you, it's faster.
So this is the right angle, speed up rather than slow down.
So we're going to take this one, how long it took, and we're going to see how long the first one took.
So we'll say like do a list lookup was like five times slower or something like that.
So let's run it, and it's building the list to work with.
That takes it a moment, and then it runs the lookups for each.
It took 30 milliseconds.
That's fast.
Let's not fool ourselves.
Computers are insane.
We went through and searched a million records, and found this target person by email in 30 milliseconds.
Not bad, right?
How about the dictionary?
One one-thousandth of one millisecond.
What?
There's a million records there.
It's a thousandth of a millisecond.
Is that a nanosecond?
I don't know.
But I can tell you that that's 30,000 times faster.
30,000 times.
Not percent.
times faster.
If you're doing data science and you're doing this type of processing with a list and it could be restructured to ask the same question with the dictionary, you could either get answers 30,000 times faster, or you could process 30,000 times larger data sets in the same amount of time.
If it's only this lookup, that's the consideration there.
So that is stunningly fast.
It's even faster than I expected it to be.
I knew it'd be way faster, but I thought it'd be like a thousand times faster.
So not 30,000 times faster.
Anyway, this final sort of wrap up of these data structures, the idea here is, look, we have lists, we have dictionaries.
They can both serve the same purpose.
They hold data.
You can search them and get data out, but they're built for certain purposes.
This is for grouping stuff that goes in an order and you wanna keep track of them and find it by index.
Like I wanna find the 12th one of these.
Dictionaries, you give it a key and you can get stuff back from that dictionary by key, in this case, the target user's email address, unimaginably fast, a thousandth of a millisecond for a million records.
So pretty darn impressive, I will say.
Understanding the three important data structures is really, really valuable.
List, dictionary, set, and understanding when you wanna use them.
List for grouping stuff, getting it by position.
Dictionaries for looking up by key.
And sets for distinct or uniqueness.
|
|
show
|
2:22 |
When you say I'm working with Python or I'm using the Python language or something along those lines, that means different things to different people.
For some people, what that means is you're literally using the Python syntax.
So for loop, foreign loops, if statements, the variable definitions, that kind of stuff that we've already talked about.
That's one part of Python.
There's another really critical part of Python called the standard library.
The standard library is a set of many, many libraries that come with Python.
They're automatically included.
If you have Python, you have the standard library, and it gives you a bunch of behaviors and things far beyond just the language syntax.
For example, in that speed test, we're using list and dictionary.
That comes from the standard library.
But we were also using things like the random library for randomly choosing stuff out of a collection.
That's super valuable that we don't have to figure out how to do randomness ourself.
It's just built into Python.
So that's level two of what it means to be using Python.
But the third and now probably the most important part of it is actually this thing called PyPI, the Python Package Index.
And over here, you can see at the time of screenshotting, this is 657,000 libraries or projects that you can use in your code.
So you may already know about this, but if you don't, or if you haven't thought deeply about it, this is a stunning amount of resources you have to build Python libraries, Python applications.
So when people say, yeah, I'm using Python, they might mean the syntax.
Obviously they do, but that might not be the most significant part.
It might mean the standard library, but often what they mean is a collection of focused libraries from PyPI.
Like I'm doing machine learning, so I'm using PyTorch and NumPy and a bunch of other things that automatically give me these amazing capabilities.
So this is a massive asset.
I would say the biggest asset that Python has is all of these libraries that then you can bring in to your Python syntax and work with other things in the standard library and so on.
|
|
show
|
4:03 |
So PyPI and all the libraries there are incredible.
If you go over there, you'll see that many of those libraries have multiple releases.
Some of them have been releasing versions of their library for years, 10, 15 years.
As that time has passed, new features, new capabilities have been added, but less common, but still true.
Sometimes there are problems where some functionality is taken away or a function is renamed.
A real notable one is Pydantic, which moved from Pydantic 1 to 2 and renamed some of the core functions you might work with to make them clearer as an improvement.
But if your code was written against version 1, it's not going to work against version 2.
Those things it was written against are gone.
So let's consider this in a practical example.
We have two projects.
They're both Django projects.
Now, project 1 here was built a couple years ago using Django 4.2.
They didn't explicitly say that.
They just said, I'm using Django.
And at the time, that was version 4.2.
So they wrote their code against whatever Django 4.2 was.
They're also using a library called requests, which allows you to call external APIs, that sort of thing.
And this one was using version 120.
There's another project out there, a more modern recent one.
This one uses Django 5.2, and it's using requests 2.32.
Those are pretty big version number changes there.
There might be things in 5 that are not in 4 for Django.
Certainly true.
More significantly, there might be things that were taken away that used to be in 4 but are no longer in 5.
I don't think that's true, but for some libraries, it certainly is.
So here's the problem.
We can install these into our Active Python.
Unless we take explicit action, that's going to be System Python.
Well, you can only install one library, one version of the library.
You can say, I want to install requests.
Great.
I want to install Django.
Great.
Which version do you get?
If both of these, Project 1 and Project 2, are using that set of libraries, I don't know what one it is yet, depending on the order you set these up, but I can guarantee you one of them is getting a version of that library that is not what they were built for.
And it may be one that they will not work with.
So this is a big problem.
And this is why we have something called virtual environments.
So it's kind of like a copy of Python.
And instead of saying, we're going to install globally for my machine, you can say each project over here, it's a local copy of Python.
And into that local copy, we can install exactly the right version.
Request 1.20.0 and Django 4.2.0.
Project two, well, it doesn't use those.
No, it uses a request 2.32.4 and Django 5.2.0.
And it gets those as well, but things like the standard library and the core runtime and stuff can be symlinked over here so that we don't even actually copy all of Python.
we just sort of tell Python it lives somewhere else.
It will find different versions of the libraries per project.
If you're a JavaScript type person, it's a little bit like node modules.
This is something super important as soon as you start leveraging libraries and packages from PyPI, because as soon as you do, there's a chance that you're gonna want to have two different versions for two different projects that you happen to have worked on.
And even if you don't, there's some reproducibility stuff that we'll talk about later.
So this idea of virtual environments solves this problem of one program is built against one version of a library and the other is built against another and they both need to have exactly what they are built for.
|
|
show
|
1:38 |
Python comes with a library called pip.
Now pip is the Python package manager and it is a local utility that you use to interact with the remote PyPI packages and data and to download them, install them, update them, that sort of thing.
However, over the last couple years, this library called uv created by the team at Astral has more or less consumed the Python packaging space.
Here you can see from some of the performance things how much faster uv is.
It's remarkably a better experience.
And you can see down here it says it replaces pip, pip-tools, PIPX, poetry, PyEmp, Twine, virtual ENV, and yes more.
So there's a lot that this library does, but one of the things that we'll really just focus on is it's a better implementation of pip so if we install this once on our machine then anytime that you would see the word pip install some library we'll just type uv space pip space install you know library right just put uv as a prefix and you get a much better experience so the steps to install it are super easy there's a link right here on uv and once you get it installed it's good to go.
Just use their installer, don't use some of the other shortcuts because then uv can manage itself in addition to Python packages.
I've already got it set up on my machine and I'm not gonna walk you through installing it.
I know you got that.
|
|
show
|
5:35 |
All right, we're back in our project already loaded up, still loaded up in PyCharm.
Now notice there's a virtual environment thing down here, but that's just because I already had one of them laying around.
Let's just assume that we didn't have any virtual environment stuff going on.
And let's use uv to get started and install it.
So what we could do is we come down to the terminal here and we could type uvvnv and give it the little confusing that is the command and that is the location the folder that we want to create into but let's do it outside of any editor just to kind of give you a fresh start so i'm going to use my favorite terminal it doesn't really matter what what terminal you've got.
This one is a warp from warp.dev.
It's awesome, but if you used a built-in one, that's fine.
You see, I've got some nice customizations here and we'll see that in effect in a minute, but it says like on main, for example, because we're in a Git repository and the branch that's checked out is main, which is kind of cool to see.
So if I look here, we can see this.
And if we look a little bit deeper, looks just like we saw in PyCharm, right?
So what I want to do is use uv to create a virtual environment.
So there's a couple of things we can do that's pretty cool with uv.
So there is the VE and V thing, but there's also this really nice capability of Python.
So we could say uv Python list, Python preference only managed.
Now you might wonder, like, why didn't I have a recent version of Python as my system Python?
Because I'd rather have no system Python.
I don't ever use that thing.
What I want to do is I want to use uv, which will actually download and install versions of Python for us.
For example, if I wanted install here, I could say install Python 3.13.2.
You can see it took 930 milliseconds to install that version of Python.
And then I can create projects from that.
So if I go back here and list them again, we now have that one installed as well.
So super cool.
And this is how any project that I start with, I'm just going to go and create a virtual environment with exactly the version of Python that I want, which is usually the latest.
You can see right now there's a release candidate for 3.14, but this one itself is the latest.
So what I can do is I can just come over here into this folder where our project is, top, and I say uv, V and V.
And I want the folder to be called this.
I could say --Python 3.12.1, and it will either install the one it's already gotten or it'll download, then install it.
Don't even need to have Python installed for that to work.
But if I leave it empty, it'll give me the latest one that I have available.
Let's hit that.
You can see it created using CPython 3.13.5.
Put it here, and it says you can activate it with this command.
Activating virtual environments is so common that I have a shortcut.
An alias called activate environment.
So if I say which AE, it says, basically the dot means source.
Source, vnv, bin, activate that.
So I could type what uv showed me, or I could just type AE once you create that, right?
This is something you'll want to do often.
So I'll say source that.
And notice that this thing now popped up in my prompt saying, you are in a virtual environment called that with this version.
So I can ask which Python on Windows, it's where Python.
And it says, it's the one right here, not the system Python.
No, no, no.
We've already talked about the package clashing and there's other issues as well.
We don't want any of that stuff.
We want this one.
And so now we've got this new folder, which is exactly what we want for working with our own project.
So we can come back here and we can type PyCharm.
And it'll open back up to there.
And since it's already picked a virtual environment, it's not going to find it.
So I got to go and add it.
click here.
If I'd done it in the right order, there's a good chance it would have picked it up, but it didn't, did it?
So we'll pick that one.
And now you can see down here, if I just go run, let's just run this one, this hello world one, you can see that the command that's being given is to run that virtual environment Python against, go to the end, against a very long path name to our test Python.
Okay.
So yay, my Python version works, and it's the one from here.
That was a little bit long, a little bit drawn out, but I want to make sure you all have it exactly right and you fully appreciate how awesome uv is.
I know a lot of data science folks use Conda, and Conda's really great too, but with the advent of wheels and uv, there's less of a reason to leverage Conda these days and just sort of stick with uv.
And you can still use Conda, it's fine, But like I said, less reasons, less motivation to do that because wheels take away a lot of what was required previously.
|
|
show
|
7:59 |
So we have our project all set up.
We're using our virtual environment.
Let's go install something.
So one of the libraries that I really like is called HTTPX.
I talked about requests before, and HTTPX is a lot like requests, except for it both supports synchronous and asynchronous APIs.
This is super important for when you're doing web calls, And it's one of the places where Python asynchronous programming really shines.
I know I told you to ignore that.
You can't ignore it.
But if we install this one, everything is set up to basically be ready to add that capability in the future.
Kind of get there.
And you're like, actually, this thing really could benefit from that.
Notice here it says install using pip install HTTPX.
No.
Want to come here and notice that we have the virtual environment activated for us.
That's cool.
If it's not, then you find the VE and VE.
In that folder, you say source that, or you maybe create yourself an alias or a batch command that says activate environment.
But we already got that set.
So down here, instead of pip install HTTPX, the uv pip install HTTPX, look how fast that was.
Look how fast that was.
It got all of those libraries.
figured out the right version for my machine and my version of Python and downloaded them potentially or used a cache version potentially a lot of times you can't even tell the difference and then install them beautiful so say UV pip list and it shows all the ones that are here this is the one we installed but these others come in because this one depends on that which potentially depends on this and I don't know the order but everyone depends upon this one HTTPX.
So now that we have that installed, what we want to do is create a program that uses it.
Let's just say, so first cars, first characters.
And the idea is we're going to write a little bit of code.
You enter a URL and we'll pull down the HTML of that page or whatever lives there and show you the first few characters.
It doesn't work on binary stuff or at least it's not going to look good.
So what we're gonna do is we'll have URL and we don't need to set a type.
Like we don't need to say it's a string, right?
We could, but I'm not going to, I'm gonna keep it simple.
And what I'm gonna do is I'm gonna say input.
This is like a prompt to the user.
What URL will we download?
And let's just print the URL really quick.
Now notice I can click this and it'll run, show up down there, that's cool.
But if I'm over here and I click run, for some reason I'm working on a multi-file project, it doesn't run the one I want.
So I can right-click on this and say run, and it'll stick up here.
So no matter where I am when I say run, it's always going to run this program.
Let's just do that.
So we'll say, what URL do you want to download?
Google.com.
Okay, Google.com it is, it got right from there.
Cool.
Now, obviously, we do validation link.
does it start with HTTP or something like that?
Or if it doesn't, maybe just put that on for the user.
So if I type Google and not HTTPS colon slash slash Google, we'll leave it here.
You know what, let's do it, it's not that hard.
If URL, if not URL, that starts with HTTP.
It's not perfect, but maybe it's better.
So now if I run it and type google.com, let's see, does it work?
Fair enough, it does.
Opens twice for whatever reason, because I guess it really wants to go to Google.
Okay, great.
So now we don't need to do this.
Let's go ahead and use our library.
So in order to use library, we have to add an import statement.
All right, let's go to the top and we'll say import HTTPX up here like that.
And down here, I'll just say response equals HTTPX.get.
Notice PyCharm knows all about the details of it.
Even pull up the parameter list with control or command P depending on your operating system.
But all I care about is the URL.
And then we want to make sure that that worked.
Maybe we got a 404.
We don't care about that.
So you can say raise for status, which only allows you to continue if everything worked, like a 200 or 201 or something like that status code.
Then we'll have the text equals response.text.
And see, it's a P property, not an F function.
So that means we do not put parentheses, which would leave it like that.
And then let's print out, we'll do a little report here.
The first, let's say, count equals 250.
First count letters from URL are, we'll do a little print just to get a new line, and we can print out text.
Now, if we do this, this is going to be all of it.
We can do something called slicing on collections where we can say go from zero to count like that, and it'll give us the first sort of sub collection of that.
In this case, it's a string.
Let's run it and let's put in google.com.
Oh, see this moved permanently.
This is like, hey, this is really where this should be because I didn't put the ending slash.
So with HTTPX, what you can do is say follow redirects true.
Now, if we run this again, it won't consider that an error.
It'll just say, yeah, great, we'll follow it.
So here it is, Doctype HTML, Interscope, something, something.
And if we go over here, the meta is content.
Search the world's information.
Google has all this stuff.
How awesome is that?
Here's just a real simple example of, one, why Python is awesome.
But two, why we want a virtual environment.
We want to use this library.
If we go over here and say UVPIP list, This version that we're using is 0.28.1.
We need to make sure that we're going to have some stability on that version, at least compatibility across all of them.
Maybe this gets changes to getHTML.
I mean, of course it won't, but imagine there was a change there.
This code would stop working, or the follow redirects.
Behavior is different.
The code will stop working.
So we have our virtual environment, so this project can be isolated.
We used uv because uv is much better than the other options out there.
Now, if you look here, you can see PyCharm has package information for it here.
Apparently we could upgrade to this dev release, but this is the latest stable release, which is kind of interesting to know, but you can click over here and it basically pulls up the PyPI page for each of these.
And you can install new packages and search them and so on.
If you like that, go for it.
I personally prefer to just work on the terminal.
It could be this terminal, or it could be the awesome warp terminal here.
Whichever one you want to use, it's the way I like to work.
It's a little more universal.
It doesn't really matter what editor you're using.
uv support is 100% here, and uv support in PyCharm is somewhat.
And they're working on making it better, but it's not 100% there.
Hopefully you thought that was a fun example.
We have many more code demos to come, but yeah, here's a short suite, but also pretty useful given the number of lines we have.
Code sample using external packages from PyPI and a virtual environment created by uv.
|
|
|
26:35 |
|
show
|
0:41 |
Well, we've talked about writing code.
Let's talk about writing clean code.
This chapter focuses on professionalism in your code and a lot of ideas taken from years of experience that helps you work better and shows you how to bring those into your code.
If you haven't worked on a professional team or a large open source project or something along those lines, chances are there's a few things to take away from this.
because those environments really reward that almost required.
But when you're working on projects, even for yourself, still super valuable, as we'll see.
|
|
show
|
1:30 |
Why do you care?
Why do we care if our code works?
Isn't that good enough?
Does it matter if it's pretty or it looks professional?
It gets the answer that we need and it gets the job done.
Good, right?
Yes, but there are a lot of benefits to focusing on idiomatic Python, which is often referred to as Pythonic code, and some of the broader software engineering techniques that people apply when they're writing code.
When we really focus on how code looks and how it's structured, we end up with fewer bugs.
It's easier to onboard new teammates into that project, or if it's an open source project, have external collaborators come in because the code will look like, well, what they expect and not some custom version of it.
And it's a little bit crazy, but it's even better if you have AI agents working on your code with you, you're doing not quite vibe coding, but you're having AI help you write code.
If you have idiomatic code that is clean and well-structured, it's easier for the AI to understand it.
And if, by the way, a little hint to a hat tip towards the last chapter, if you have ugly code that is not clean and not idiomatic, AI can also help fix it because it knows what it's looking for.
Readability, fewer bugs, better collaboration.
Plus, you'll just enjoy working with this code more if you follow some of the practices here.
|
|
show
|
5:23 |
In Python, clean code and idiomatic code starts with something called PEP 8, Python Enhancement Proposal 8.
So this is one of the very early decrees in Python land saying we really should have some guidance on how people should write code.
There's a whole documentation, a whole doc you can read here about it.
You can see the URL there in the browser.
But I want to highlight a few.
they might sound picky and so on, but you'll see that you don't really need to worry about it as much as this document would make it seem.
Certainly when this document was written, it needed to be more manual than it is today.
So number one is use four spaces, never tabs, never two spaces, don't mix tabs and spaces.
Okay, so spaces.
Limit all lines to a maximum of 79 characters for code Reability.
This one, I think, is showing its age.
But with regard to the spaces, notice you never probably saw me go space, space, space, space in PyCharm because modern editors do that for us.
You just hit tab and it means four spaces.
You create a for loop and you hit enter, it will automatically do the indentation for you.
Surround top level functions and class definitions with two blank lines in one blank line within the classes.
Okay.
Lowercase, there's variable naming.
This is actually pretty significant.
Lowercase with underscores for function names and variable names.
You saw me do that in our example.
And if you have a class, it's cap words for class names.
We haven't created classes or talked about them, but if you do, it will be like that.
One of the giveaways that you're using a class someone else created is you'll see cap words style naming.
For example, in the HTTPX library, there's something called an async client, capital A, capital C.
Guess what?
That's a class.
That's a really good example of how this adhering to particular naming styles instantly communicates something to seasoned Python developers and data scientists.
They know that if they see something named cap word style, it must be a class.
They don't have to go look at the source code or the documentation.
They're like, that name tells me that's a class and they see it's underscore like that name tells me it's a function there's a few exceptions but quite rare place all imports you saw me import httpx place all of them into actually three groups ones that come from the standard library third-party ones and then ones that are you using one file inside of another which is common in large applications with a blank line between each.
Woo.
Nitpicky.
Binary operations like seven plus two needs spaces around the plus basically and just one and on and on and on.
So there's a lot of guidance in there.
And this sounds very pedantic and like making sure the spaces are right is seems like such a hassle, right?
Like I said, the editors these days know about things.
So all I have to do in PyCharm or VS Code to say, please reformat this document.
Boom.
All these things are applied to it.
Well, with the exception of say, like changing variable names, because that could be a breaking change.
But like, for example, the grouping of the imports is done automatically for us.
The spaces around the operators done for us, the indent done, and so on and so on and so on.
The character count, It would do that, but I tell all of my editors, you know, I have a 32-inch 4K monitor.
It's completely ridiculous to adhere to standards that say you're going to be in a typical 80-by-something terminal editor, so you've got to have 79 characters.
So you can say extend that number so it's not too constricting and so on, right?
But in general, you just say reformat document, and the editors take it from there.
We also have some awesome tools, Ruff from the same people that make UV and the original Black from Luca Schalenga, who created this library that takes a whole bunch of different tools around formatting, reformatting Python code and enforcing some of these ideas and many, many, many more and standardize them.
So we can run Ruff or Black on the terminal and process all the files in a particular project.
And what's more is PyCharm has rough and black built in as formatting options, as does VS Code.
I think VS Code actually says, oh, you want to reformat a Python document?
Which one of these do you want to pick?
And then we know how to reformat for you.
So there's this nice combination of these things.
Like I want to use rough with PyCharm or I want to use black with VS Code and so on.
And it just knows about all these.
So you don't really have to worry about how picky and detailed this stuff is.
Your editor plus one of these tools, I'm going to play with rough later, shows you, it basically handles it for you.
But just having an awareness, like knowing that style of variable naming you should follow or class names and using that to understand what you're working with is super helpful.
|
|
show
|
2:17 |
So let's quickly talk about some common pitfalls and then we'll work through an example in code.
There are certain things that are very common and especially problematic that you might run into.
So for example, you might have global variables, right?
We're going to play with Conway's game of life.
And here, one of the things in the algorithm is a live count.
Well, the way this code is written, it's just in a cell, which may or may not be a bad thing.
This alive count is now a global variable that if you use the word alive count somewhere else could potentially be a problem depending on the order in which you run the cells or make your code harder to test or harder to refactor in smaller pieces if you're using global variables things like that here's another old grid probably didn't mean for that to be a global variable it just is needed for the loop that's going to follow.
So what can we do potentially to not have global variables?
Well, they're not inherently evil, but you want to minimize the number of them that you have.
Way worse, I think, is long functions or long notebook cells.
Look at this bad boy.
Okay, here's a function.
We'll talk about how to create our own and so on pretty soon.
But here's a function, game of life.
And literally the entire game of life is written in one function.
Well, the point of functions is to create little blocks of behaviors or reusable bits of code.
Well, if you have a huge function that is the entire application, like what's the point of it, right?
Now we can run it and sure enough, it is the game of life and it really does work, but this is not good code, even though it's code that gets the job done, right?
So there's many things happening here.
For example, one of them is displaying a round of the game.
One of them is running each iteration.
One is the math that is associated with a particular iteration and so on and so on.
So for example, we should break that up here or if this was a notebook at minimum, try to put those things into different cells if possible.
|
|
show
|
5:14 |
Another really common pitfall is something called magic numbers or magic values, if it were like a string or something like that, right?
What are these?
Well, let's just look at an example and you'll get the idea super quick.
So let's suppose we're doing something with regard to a game or some kind of simulation.
And we have this code here.
There's a couple issues, not just the magic aspect.
Also, variable names are terrible here.
we've got this y and this o okay so y is a random number between 1 and 12 times 2 and then o is just a random number between 1 and 12 and apparently we're testing if y is greater than o we'll get that back hmm what what is the 12 what is the 2 what is the y what is the o what what is this is we can improve this by removing these magic numbers.
The 12 and the 2, in this case, are magic numbers.
They're a constant number, a constant value, that are completely unclear to what's going on, right?
Like, why the 12?
If we rename that to what it represents, the number of sides of a dice, and we make that a variable, so dice sides equals 12, it's like, oh, random between 1 and number of dice sides.
that's like a dice roll.
So obvious.
But if it's a 12, oh no, it's not obvious.
Not at all.
What is the two?
Well, apparently that's a player boost in our algorithm.
Like maybe the player has a better chance of winning than an opponent or some game character, right?
So that's what that's about.
Other things we probably should be doing is saying, renaming Y to like U and O to opponent or something along those lines and maybe renaming result to like player wins equals y greater than you know you greater than opponent and returning that there's a lot of stuff even in just this like couple of lines of code that are basically no more effort to write but are dramatically better in helping you understand it not just your teammates but you in two months when you come back, you're like, what was that?
Oh gosh, what is this thing doing again?
Like if you name it real well like this, you just read it like English and boom, it's golden.
You're golden.
It's beautiful.
So really, really nice.
Another thing that I would like to sort of plant in your mind and idea is people say code comments are good.
If your code has comments, you write in good code.
If your code doesn't have comments, you're writing bad code.
Sometimes, but often I think that's the opposite of true.
It's not just not great advice.
I think it's wrong advice.
And let me just try to give you an example here.
Suppose you look at play round, the first version with the magic numbers, you're like, oh gosh, this is hard to understand.
So let me do this.
Let me write a comment that describes what's happening here.
The comment's going to say, this play round game, this play round function is going to simulate a dice roll.
In this case, it's a 12-sided dice, and then we're also going to modify the player with a boost.
The default player boost is two.
And then if the player rolls a dice roll higher than the opponent, then they're going to win the game.
What is that code doing?
That code is not making it clear.
That code gives you two things to read.
What if somewhere along the lines you decide, we need a 15-side die?
well did you update the comment probably not the tooling won't help you find like oh this 12 is used here and there so now your comment says you got the 12 sided die but the magic number is a 15 no in this case a code comment is bad and i want to give you a particular way to think about it the code is bad so the comment is trying to cover up the badness of the code so the code comment is like a deodorant for bad code.
Deodorant is not bad.
People need deodorant, but wouldn't it be better just to be in a state where you don't need deodorant?
Your code is not bad, so you don't need a comment to help people understand your bad code.
You just write code so readable, like the lower version, that you just read it.
And do you want to change it to a 15-side die?
You change the constant, the magic, the definition of the variable that now represents the magic number.
Change that to 15 and everywhere that it was used, it now automatically updates, fewer bugs.
But you also don't need to apologize for the code and then also maintain the comment and have something to read and then have the code to read.
Just put a little effort into writing better code.
So this is one of the, this is a real clear example of this professionalism in code.
You just, not much effort, just like the tiniest amount of effort goes a really, really long way.
|
|
show
|
2:00 |
So what is the fix?
Well, you saw the final code, but how do you get there?
Something called refactoring.
Now, refactoring is built into PyCharm and VS Code and other tools.
And it's an automated way to make these changes without potentially missing a case or something like that.
So here we could just highlight the 12 and say, extract this to a variable called dice sides and highlight the two and say, extract that to a variable called player boost.
Literally, those UI actions will result in the code below.
What about our huge function that was the entire function in one thing?
Like, read through that and tell me what that does without the comments and without the printing of the header.
What does this code do?
I don't know.
It's really not that helpful, is it?
But if we break it into smaller pieces, again, we'll focus on functions later, but just to give you a sense of how this might look, we could break it into smaller functions, right?
So now we have game of life.
I can see the parameters and then you can just read through like, now if we just focus in on this top part, all the other pieces were like the building blocks.
We just focus here.
How easy is this to understand now?
Well, it's called the game of life.
That's a hint.
And you can see the parameters like number of generations.
Okay, good.
Initialize the grid.
That's the first thing we do.
The next thing we do is we go through all the generations and we get the alive count.
and then we display that current setup.
Then we calculate the next generation and we take a break and we do it again.
All of a sudden, this monster of a thing becomes pretty trivial, right?
If you break this stuff up, it's incredibly easier to use.
So another one of these professionalism things.
It benefits you more than anyone else and it also happens to be a benefit to many other things.
|
|
show
|
2:23 |
So let's see this in action here.
There's, we'll talk about refactoring functions later.
Let's just focus on this play around one, which had the magic number problem and the variable naming problem as well.
So check this out.
Here's one of our magic numbers.
We now know this represents the number of dice sides.
So die sides.
So what I can do is I can either right click or hit control T and say, introduce a variable or a constant.
I'll just say variable for now.
and it says do you want to get all the 12s you know i do what are we going to call it we're going to call it dice sides notice it understood all the code made sure we use that correctly all over the place and then here we'll say introduce a variable this will be player boost notice the naming that appears well like that all of a sudden it's so much better but while we have this let's keep going let's fix this variable this will be rename and we're going to call this player notice it changed it down here as well and i'm going to go over here and highlight that and hit ctrl t and say rename this will be opponent like that so result and then finally let's rename the result to player wins look at that i basically didn't write any code at all i just use the UI and said, I want that to be a variable.
It said, great, what's the name of it?
I want that to be renamed.
Great, what's the new name?
And we can run it.
You can see it just prints out true or false depending on whether it wins or not.
So true, true.
The player does have a tendency to win, but this is a little extreme.
Oh, there's a false, right?
Because of the player boost.
True, false, and so on.
Works like a champ.
So this refactoring is great.
And you definitely want to embrace it.
The refactoring in PyCharm is far better than the refactoring in VS Code, but VS Code does have some of these features as well.
So we go to the refactoring, you can see all these refactoring options, variables, constants, methods, changing signatures, moving stuff between class hierarchies and all that kind of stuff.
But the common ones are in both editors.
|
|
show
|
7:07 |
Let's talk about reformatting our code in ways that we don't have to do.
So you saw me actually improve the code with the magic number refactoring, but a lot of times it's just like those PEP 8 inconsistencies and a whole bunch of other stuff, right?
I mentioned there were two main tools for doing this, black and rough.
And rough is from the uv folks, And it's sort of equally incredible as uv itself is super, super fast, as you can see from its comparisons right there.
10 to 100 times faster.
So what we're going to do is we're going to install this.
And the way we're going to install it is we're going to say uv toolinstall ruff.
So uv will manage this for us.
So we can just come over here.
The uv toolinstall ruff.
Now I already have this.
It should say it's already there.
So now we have the command rough and we can ask which version that one.
Okay.
So that's great.
Now we can use this command, which we could either have it built into our editors, in which case, like, let's say we can go here and I could say reformat this document as a command.
Say reformat.
It says reformat code or pick a file, but that's just one.
I want to apply this global formatting to not everything, because I want to have some of the examples left alone for now.
But let's go to this one, and I have this reformatted section.
So I want to reformat not just my Python code, but check this out.
My Jupyter Notebook that I'll be viewing in JupyterLab.
So not only will Ruff format my Python files, but it will do notebooks, and I don't have to have plugins for my notebook tools because I just want to have this done globally to my entire project.
If I've got 100 notebooks, make them correct.
Now, before I do this, I do want to point out that this has some defaults and you may not want to adhere to some rules.
You might not care about them.
And at a minimum, you probably want more than 79 characters.
So I'm going to copy and create a file over here at the top of my project.
called rough.toml.
And I'm going to copy this from another location.
So in here, I'll paste a bunch of rules.
So it'll say, hey, look, your line length is 120.
And anytime you have quotes, prefer single quotes over double quotes.
So all strings are consistent.
Don't try to reprocess things like the virtual environment folder, which is full of external packages.
Just leave it alone, right?
What version of Python are you?
are you on?
So Python 3 and then version 13.
All right.
So a bunch of stuff you can set here, but once that's at the top of your project, even if you come and say run it down here, so if I say open in terminal, even if I run it down here, the fact that up at this level, the rough.tomil is defined, it will take effect.
Okay.
So let's look at a few things that are going to be fixed here.
So again, this is not the final, final version.
Functions will make this a lot better.
But notice, for example, here we've got a couple of spaces.
Too many spaces for that comment.
Those are double quotes when all the other strings are single quotes.
These should be grouped together into a single block because PEP 8 says these are all standard library things.
Too many spaces there.
Let's see, what else have we got down here?
This is fine, but this is an F string, which has no data.
so that's not right no reason for that that do we need to break the lines like this i don't know well let's go ahead and run it similarly i'll just show you changes for the notebook oh we also need to install uh jupyter notebook in order for that to work but notice right here all those spaces and so on so let's go back to our terminal and we just say Ruff format in the reformatted section and watch right there boom fixed that space is all better where were there other weird weird oddities um let's go up here the spaces here are good are there was some inconsistent inconsistent string usage this one had a bunch of spaces there all sorts of stuff was this had this is broken across like three lines but it didn't conform correctly still have the magic number issue going on here but we're just going to focus on the reform heading for now okay so how awesome is this and this notebook has one cell because it's just a starting place for us to do something awesome later but as many cells as you have it'll reformat it beautifully and if we go over here to this better we've got this changed as as well, so didn't put that back.
I think I got to pass another flag to make that happen, but it fixed our weird, we had this like that and then put it back to this and let's see.
Yeah, it looks like everything was fixed.
Like for example, if I put a bunch of spaces there and run Ruff format, see, there's them all up.
Another thing we can do is we can run ruff check and it'll tell you about violations.
So for example, there's our F string that we don't really need.
This is the F string with no placeholders, right?
So if I do this, --fix, see this should go away as well, right there.
So there's two things, there's check some of the rules and just format, all right?
Let's see, yep, put those spaces back.
It didn't take away that, did it?
All right, well, it's not perfect.
There's a ton of rules I've turned off, but there's a ton of rules that are turned on as well.
Finally, if you're in PyCharm, you can go over here to plugins, go to the marketplace and type rough.
You can see I've installed rough as the editor here.
And if I type rough, and we get it at some settings, and it says things like when you save a file, well, let's just say automatically run ruff when I make a save.
I use the import, optimize the imports and so on.
These all seem good.
This is the built-in one that got installed.
So for example, let me just go up here and I'll do this and I'll make this quotes again.
And I'll put a weird space there.
If I just press save, just command S, boom, automatically fixed.
That's fixed.
That was fixed.
Apparently it doesn't run that rule the way I got it set up, which is fine.
but pretty awesome same thing for VS Code you can integrate Ruff into there as well
|
|
|
39:26 |
|
show
|
0:53 |
Writing clean code is super important.
We already talked about that, but I've saved the most important one for last, functions.
So we're going to talk about how to create and organize your code with functions, regular ones, ones with type hints and type definitions, and a special class that are especially important for data science called generators that allow you to do stream processing of data.
So imagine you have a CSV with a million rows and you want to work your way through that data.
One option is to load all million and then start working on them.
Another is to load one row, process it, load another row, process it, that kind of thing.
That second style is a generator.
They're very easy to write, but they're very efficient for working on arbitrarily large data sets.
|
|
show
|
1:35 |
Functions are central to programming, and I'm sure you've seen them.
However, they're not leveraged enough in data science as far as I'm concerned.
Here's a really cool set of notebooks.
So this is like an astronomy tutorial exploration type of thing.
And this particular notebook that we're looking at here understands and processes the clusters of galaxies based on simple spectral access protocol processing.
I don't really do a lot of this.
I'm not an astronomer.
But this is a really cool notebook.
You can check it out.
The link's at the bottom.
And if you look at it, you'll see it's 5,513 lines.
That's a lot.
But another thing you will notice is there's only one function, and it's maybe 10, 15 lines long, in this entire thing.
So you'll see that there's actually some challenges to putting code that way.
Like, for example, you have to have all the details before you can get to the final processing.
You can't have like the high-level story and then the details of it because, well, the way in which the code executes, basically.
So we're going to talk about how to break down notebooks and to a smaller degree how to use functions, especially coming up in the next chapter, to allow us to extract out the unnecessary details and just leave the core data analysis, the core presentation in your notebooks.
|
|
show
|
6:08 |
Luckily, functions in Python are super simple.
We create them with the def keyword.
So if you see def, that means we're defining a function.
It's not the only way to define a function.
We also have something called a lambda functions or lambda expressions.
But generally, functions are defined using this def keyword.
So you say def, the name of the function, and you use the lowercase, the snake case style.
So lowercase print underscore header.
And then in parentheses, you have all the arguments.
This is what some languages call a void void function.
It takes no arguments, it returns no arguments, or does it?
At least as far as it's concerned.
What this does is it simply prints out a little bit of text.
Maybe if we were doing the game of life, we might put this at the top and say, here's the version of this game and the name of it and so on.
If we run it, we get output like you very much might expect, assuming the version is 0.11 and this is great so super easy to write these functions especially if they don't take any arguments one thing that is a little bit interesting about python is even if you do not return any value of a function notice that there's print statements but there's not a return keyword at the bottom it says here's the value that this evaluated to even if that's the case when you run it you still get a return value so if you omit the return value you just get none which is like a special value in Python that says there's nothing here.
Okay.
So very, very simple way.
Now this is somewhat useful, but way more useful is I pass in some information and I get out some information.
So here's an implementation for the Fibonacci numbers where we pass in how many Fibonacci numbers we want.
And I get back, it returns a list of those numbers by calculating and I'm using this simple little algorithm here, right?
The Fibonacci number is the sum of the prior two, start with zero and one, and then off you go to the races.
So that's really cool.
You can see we got these numbers, we're returning the numbers, and we print it out.
So we're passing in this n.
Now, if I'm reading this without going to the details of the function, what is it supposed to be?
Is that gonna be an integer?
Is it gonna be a floating point?
We don't know.
Well, if we put type hints in here, which we optionally can do, we can say n is an integer, and what comes back is a list of integers.
Well, that gives us a little more information.
Plus, it gives our editors more information to make sure that we're doing things right, and even agentic AI can look at that and go, oh, I know what's happening to a better degree.
So the return value here, these numbers, have to be a list of integers, which, of course, they are.
If we run this, we get back the first five Fibonacci numbers list 1 1 2 3 5 or I'll get used to Fibonacci numbers right so this is super cool here and I I do prefer to put type arguments on my functions that way when I'm using them I know what supposed to go there and if I miss use them I put in a string for in my editor like PyCharm or VS Code will say whoa whoa whoa you're doing it wrong right if you don't put the type ins you don't get any of that safety nets or assistance.
So functions are great.
They help us take super large, hard to understand blocks of code and turn them into narratives.
We've already talked about this a little bit, but here is the game of life broken into reusable pieces.
We've got a function for initializing the grid.
It takes the rows and columns and it returns a list of list of integers, which turns out to basically be a grid, rows and columns sized, right?
So pretty easy, we can count the alive cells.
That is, even though it was one line of code, that sum of the sum of row for row and grid does not necessarily tell you what's going on.
But count alive cells, all of a sudden that sort of adds to the narrative, right?
And then display grid and calculate next generation.
Those are quite a bit more complicated.
So those are reusable, testable pieces, but that's not the most important thing.
The most important thing is, previously what we did is we first initialized the grid, and then we would start looping over them, and then we would start displaying them.
But that's not the most important aspect here.
The most important thing is about how it breaks it into easy-to-understand pieces, and it lets us talk about the high-level bits first, and then the details.
What do I mean by that?
Just look here.
Forget the stuff below.
Those are the specific details if you need to know them.
But we go right from the top of the file.
We say, how do you run the game of life?
You initialize the grid.
You loop over the iterations.
You get the active cells.
You display the game.
Calculate the next grid.
You do it again.
You get to see that first.
And as a high-level thing, and if you care about how to display the grid, you go down to that function.
If you care about how you count the live cells, you go to that function.
But if you don't, you don't have all that stuff there, right?
So this is to me just so much more clear and maintainable and reusable, not just that it follows some guidelines like, well, things should be in functions and they should be small.
Yes.
But if we go one step further and order it so that the game of life high level bits are put first and then all the details, it really creates kind of a narrative in code.
And remember, code comments can be good, but code comments often are deodorant for bad code, hard to understand code, complex code.
In this case, you might want to put a comment here, like we say, run the Game of Life simulation.
We don't need to actually have very much description of what's going on here because it's almost English.
It's not, but quite close.
I think any beginner programmer or up could read through that and give you a sense of probably what the steps are.
no comments needed
|
|
show
|
8:35 |
Let's take a moment and put this into action.
We're going to take our game of life, which I said we'd gotten pretty far along, but it still needs functions, next chapter.
Well, that's gone because this is the next chapter where we're going to do the functions.
So let's go through and apply some of those reorganizing into functions and writing functions aspect here.
Now, notice I've done a couple of things, by the way.
got surviving counts is a constant.
And Python, you can't actually have a constant, right?
I could come down here and I could set this to be, let's change is alive.
I can set that to be seven.
And this code will run, quote, compile in the sense that Python does.
But notice that the editors and rough and so on say, this is final, aka constant, don't mess with it.
So if you see this typing.final, and it's quite handy to have there.
But down here we have some game parameters, and what I want to do is I want to start by first putting everything into one big function, because that will allow us to move the details below.
If I don't do that, well, then we still have to go little detail by little detail, and then the big picture.
So we can just say def game of life like this.
And in the original Game of Life, all these things that are passed in here, these are actually the parameters.
So let me just put a copy like that.
We'll take away the final because now they're parameters.
And we're going to put commas right here if I can avoid clicking on the wrong piece there.
And functions allow you to set default values.
So here you can say columns is 50.
And what that means is you can pass a value, But if you don't, then it's just going to be that default value of 50, as well as like delays and 0.1 and so on.
So there we go.
We have this def game of life parentheses.
And then we can just go and highlight everything and hit tab to indent it.
And we can delete those.
Okay, great.
So now the next thing that we need to do is we need to initialize the grid.
So we can just take this down here and go to the very bottom.
and we can say def initialize, show you better, wait a second, but let's just take some steps along the way, baby steps, grid, and then we put this, and look, I misspelled this, but guess what?
PyCharm knows how to fix it for me.
Thanks, PyCharm.
And it needs columns and rows, so calls is an int, and rows is an int.
Great.
So then we come back up here, and I'll say okay.
Instead of this, What we're going to do, I'll leave this here for a second.
We'll say initialize grid.
Notice in PyCharm, I can just type IG.
This also applies in VS Code.
And it says it's going to take the columns and the rows.
Okay, so don't need that anymore, right?
This is simple.
Oh, I made one mistake here.
We need to also get the grid back from initialize grid.
So let's jump down here and we'll say return.
grid, right?
So now we've got this grid.
It's no longer a global variable.
It just belongs to this function.
And we can use it later on, for example, right here.
Now, the next thing we want to do is here we're counting, getting the alive count, right?
Counting the alive cells.
So I could comment this, copy it to the bottom, and go on and go below.
But notice we have to have this values later.
We have to have these parameters, at least the row, not really just the grid.
being passed in.
So check this out.
If I highlight this and I right click and say refactor or hit control T or command T, whatever it is, I can say extract method or I can just hit alt command M with my key maps.
Check here.
So it varies by operating system.
So I could say count alive cells.
Now notice it says a parameter is going to be the grid.
Here we go.
And then the output variable, the return variable is a live count.
So I just hit okay, boom, where did that go?
It go at the bottom, beautiful, right there.
It wrote the head code for us and it made sure that the value being passed in is good and that the return value is good.
So what is the return value?
It's an int, we're gonna put that in there.
And if we want, we could say this is a list of list of int.
Don't have to, but I like to just so I know what we're working with here.
So this refactoring is super powerful, right?
So this part here, all of this, now this gets to be more valuable 'cause look at all this junk.
All this is kind of, ah, what's going on here?
So we can go and say, extract a method again.
This is gonna be display grid.
Look at all the stuff that gets passed in there.
Do you know what?
That's okay, has no return value.
So we've say display grid and down here, it wrote this function that took all of these pieces here.
Right?
And we even get this sort of usage, where is this used?
If I click this, it takes me back.
In this case, just to the one place, but if it's used in many places, you'd see it all over.
Okay.
Now what's left is another pretty intense section, but this is just the iteration piece.
This means run one generation.
All right, let's go down here and take this whole section like this, this remaining bit.
And what is this?
This is to just calculate a single generation.
So let's go and refactor extract method.
I noticed passing in rows and columns.
I want to get those a different way.
We should be able to get them from the grid itself, but let's just go run one generation.
Deal with that in a second.
So here's the grid, got the columns and rows, beautiful.
We don't need this update.
Now notice we're passing in rows and columns, but the grid itself actually knows this.
So we can say rows, len of the grid because it's a list of list of integers.
And the columns is gonna be a len of a single row.
So let's get the first one.
And then we can say, let's remove those.
Now notice they're gray because they're not being used.
And I can probably a PyCharm say, remove the perimeter and fix it where the call is as well.
I do it here as well.
So if we go back up, notice it's no longer passing those here.
So it kind of keeps it consistent everywhere, which is great.
So now we've got all of this working.
I think we're in good shape.
Look at this.
Here's our high level story.
How do you run the game of life?
You initialize the grid.
You go through all the generations.
For each generation, you count the live cells, you display them, you run one generation, and then you take a break.
Now this is actually only being used in the display grid.
So I think we could move this down.
We hold down commander on other operating systems that are not macOS control.
You can actually, these become hyperlinks so we can navigate here and we could just say the live count is to call this function inside there.
Okay, and let's go back up and take this out.
So keep that even a little bit cleaner.
We display the grid, we run our generation Depositates to grid.
We do that as first many times, and we're done.
How awesome is that?
This is a much better story here, I think, to see the game of life.
And you can just come down here and say, look, I don't even care about these details.
I don't want to see them.
I want to hide it away unless I'm looking at the details.
And here's one more tip for you for PyCharm.
So I can say there's a region, and this will be game constants.
And down here I can say hash end region.
And then that also collapses there.
Super sweet.
So we've got our game looking really good.
Now, if I run it, it's not gonna run, at least within PyCharm, the way you would hope.
Let's find out.
Why didn't it run?
We get no output.
Did we break it?
Yes, but only sort of.
We'll come back and fix that next.
|
|
show
|
3:08 |
We have two more quick little issues to deal with to get our game of life fully working.
So if I run it, I get no output.
That's not great.
And of course, I ran it by right clicking run.
Life functions, still no output.
Why is that?
Because when Python now imports this code, it doesn't see just this.
It only sees the definition of a function that, if called, would run the game of life.
So we need to add one thing down here at the bottom.
Now, we could just say game of life like this.
And again, we don't have to pass parameters because they have default.
So if I run it, you'll see stuff happening.
It's not good.
We'll fix that.
But there's a slightly more, there's a common technique in Python that allows us to potentially use this as a library if we wanted to break it apart.
And in order to do that, we have to come down here and say main and hit tab.
And PyCharm has a, what's called a template for that, a live template.
So we have to add this one test.
If dunder name, which is a built-in thing that all files have, is dunder main, run it.
That means Python is running this file directly.
In which case, if you're running this file directly, run the game of life.
But if the name is not that, that means it's being used as a library somewhere else that probably has its own goals and wants to run doing its own thing.
So don't run it, just let it get imported so these details, these functions are defined, okay?
So now if I run it, it still works.
Well, you call that working?
What is this?
If I hit stop, PyCharm says term variable not set.
So PyCharm, it doesn't like this.
It doesn't like this.
So what are we going to do?
Well, we can go to the top of this file and just rob this statement here, whatever the arguments are.
Now, of course, I could go in to my terminal and notice here I am.
in what working directory?
Where are we?
We are in chapter three.
So let's go back to 04 now into functions.
And in here with the virtual environment active, in this one, I could just say Python life functions.
Or if I don't really care about going in the directory, I could just paste in, run the full path of the Python interpreter, the full path of the file.
Doesn't matter.
What we're gonna get here is we're gonna get the Game of Life running.
There we go.
Interactively as we would expect.
So here we told it to do 100 generations with a 10th of a second delay and it's running that end.
Boom, Game of Life is done.
How cool is that?
It's so cool.
All right, we'll run it one more time just to see we'll get different output.
But this is our Game of Life.
It operates identical to the way it did previously, but we've refactored it.
What that means is we've changed its internal structure without changing its behavior.
So we've made it much nicer to maintain, to work with, to understand, but it's still the game of life.
We did that primarily with functions.
|
|
show
|
1:30 |
All right, data scientists, back to notebooks.
This idea of using functions to get a higher level storytelling aspect for our code here also applies to notebooks.
Not quite the same way, but we'll take a game of life in a notebook, not a very good version of it, but one that does work.
And then we're going to make it look much nicer and do some storytelling as well using functions.
So let's jump over and do that now.
First of all, we're going to need Jupyter Notebooks.
Now we're going to need that installed.
Remember, if we go down here and we say uv pip list, we do not see Jupyter there.
Just a little side note.
Notice there's only, what is that, six, seven libraries.
Let's install, using uv, let's install Jupyter.
uv pip install Jupyter lab.
Now, remember, if you see that statement in instructions, just use uv in front of it.
It's gonna be better as long as you have uv installed.
Here we go.
So downloading, installed, boom, done.
I don't know how many installed, but there's a lot of pieces and it's 94 packages.
That's a lot.
Okay, so now that we've got that done, notice PyCharm is indexing to try to understand all the new library features available, but there was a warning here that Jupyter was not installed.
really the ipython which comes with jupiter wasn't but it is now
|
|
show
|
8:50 |
Okay, now the first, if we look at this, you see it's literally just one giant, just the function basically copied in here, everything.
But if I run it and I scroll down, you can see, well, it doesn't run very well, and PyCharm does it with that interactive output there.
So let's do this.
Let's just come down here.
Okay, let's fire up JupyterLab down here.
in the terminal.
Great, you can see that we've got our life functions that we just wrote seven minutes ago, and we've got this Jupyter notebook, and by default, it's not looking great.
So let's just go and run it again.
Now you can see outside of PyCharm, we're getting a super nice output.
So like before, we have our notebook running, but is it good?
No, it's not good.
it's not telling a story like notebooks should and it suffers from all the same problems we saw before is you've got to like read through all the details like okay what does this do again okay well variable name is decent so that helps but oh my gosh still not great but it sure enough finished down there so let's do a few things let's go here and I'll hit a to add a cell above I'll type m to turn it to markdown, enter to put some content, and then let's paste in a little descriptive bit, shift enter to run it.
And we've got Conway's Game of Life, a little bit of welcome to the story.
What are we gonna do?
We're gonna run this.
Now, what else can we do to make this better?
First of all, we don't need this description anywhere down there anymore, because this is way better than a one line comment.
The next thing I wanna do is, let's do, put our imports on a separate cell.
We'll take those out.
We can save that and I wanna hit escape and then align above.
And then we'll put import, I'll just say necessary imports and put those in there and run them.
And what's cool about this is we can go over here and collapse that cell.
So no matter how many imports we have, they're gone.
Oh, this verse reminiscent of, we go over here to our region sort of thing that PyCharm does.
With Jupyter Notebooks, we can collapse those.
Let's reorganize this a little bit.
Let's say, well, we want to get things like just setting up the grid.
We want to move those out.
And I also kind of want to isolate this part.
And it's really important to focus on the parameters of our simulation.
So let's take this up here.
And we're going to go hit B to create a cell down there.
And paste those there.
This was fine as code.
But what I really think would be better is if we actually, do we need to have rows here and here and then the description?
Now we can just say like this.
We can come down here and put that as a comment.
And we'll do that for all of these.
There we go.
And we can now remove that, run this cell here.
So right at the top, we have Conway's Game of Life and then the parameters and then the imports.
We're making progress here.
So the next thing we need to do is the sort of initialization.
That was a previous step as well.
So I'm going to cut that out and say create something above.
And first let's do a little bit of a header.
You'll see there's some cool things we can do with this.
And I'll say, and we'll just move this bit of code here.
We'll say set up the grid and display characters, right?
So this is the stuff that you'll see down here, the dots and the circles and so on.
So this part, excellent.
Do you need to see those details?
probably not so we can collapse that down as well all right so now I think it's it's really coming together so here's our introduction the parameters right front and center with a little description super important imports people reading this they don't care if we're using the ipython clear output function or random now they just it has to be above the things before but it it's not that relevant.
Here's how we set up the grid.
If we care, we can expand that, but probably not.
Central to this algorithm here is the actual processing for a single generation.
So I kind of want to focus on that next.
So let's add a markdown cell.
Okay, so a little header to say what's coming.
And then down here, we're going to have some code.
And where was that again?
Here, calculate the next generation so what we want is all of this I want to move this into a function here's where functions can be useful in notebooks so we come over here def and here we're gonna take the old grid and return a new one so we don't need this will say the new grid here we'll just call it grid for now we'll call it new grid let's say this is the new grid all grid is here that that new row new rows will be new grid and then return the new grid okay great so I'm gonna close this just for a second let's go ahead and run that to define it but let's close that up and then where did I steal it from right here and run one generation old grid new grid so what we'll say here instead is grid is equal to run one if I hit tab run one generation and we pass the grid now down here let's actually take all of this can be just display grid.
This was a current grid.
That doesn't exist yet, but guess what?
It's about to.
So let's add a cell above.
Change it to markdown and do a little intro thing.
And then put in our code here.
And this is going to need, what did I call it?
I said display current grid.
and so we're not using the global grid instead we're going to use this current grid we're passing in get a little less coupling between those things and if we run it this function now exists again though this isn't super important so let's just collapse that let's tighten this up a bit put it like that So here we can go and add one more markdown cell.
This is gonna be, put that there and let's run it.
Aha, the display current grid misspelled it, there we go.
I think I messed up the auto complete.
You all probably saw that but somehow I didn't.
Okay, let's just go over here and say run all cells.
And look at that, here we go.
Our game of life simulation is running just like we expect, honestly, just like it was before.
Here's how life played out for everyone, that's cool.
But if we go back and look at our notebook, look how much nicer it looks.
We've done a little effort to put in some commentary and introduction.
We've also done things like move the key parameters of the simulation right to the front.
we've collapsed out the various pieces that don't matter, like the imports and what characters we're gonna display things with, as well as how do we display it at all.
But the two important pieces are left, like here, how do we compute a single generation, as well as the actual simplified, easy to understand version of the game of life with the output at the end.
I think it's really great.
And by the way, these little collapsed pieces, If I hit save here, we close this out and we restart our Jupyter lab.
Go back to the notebook.
It remembers that kind of stuff, right?
It knows that these stay collapsed and the ones that are expanded are expanded.
The ones that are collapsed are collapsed.
So it's a really nice way to save this presentation and have it open up in the way that you want people to perceive it.
|
|
show
|
8:47 |
Let's wrap up this chapter by talking about one more super cool aspect of functions, a type of function that is really powerful called generators.
So before we do look, you can see that our notebook is looking way better here inside of PyCharm as well.
So that's great.
But what we're going to do is we're going to create one more file called generators.
So let's go back to the Fibonacci numbers, right?
So we'll have def at the end, we'll return numbers.
So let's just go and run and make sure this works.
And how about we run the correct file?
Here we go.
Sure enough, first five Fibonacci numbers.
What if I put the first 500 or 5,000?
Well, the Fibonacci numbers are an infinite sequence.
What I would like to do is be able to work with all the Fibonacci numbers, infinitely many of them, or at least as many as I want until I'm doing some sort of test or something like that.
How do I do this?
I can't put infinity here.
If I do, we're going to fill up the numbers until we run out of RAM and crash, and we're going to run out of time.
but there's something really wild in Python that we can do called generators as I've pointed out so check this out we're going to for the beginning leave the number here but we're going to not gather up the values and let me remove this for a second as well so we're not going to gather up the values but instead we're going to go through the algorithm and each time say here's one of the items do some more work oh here's another one and what you'll see happen is super interesting so we'll say yield and we don't have there's no return value and so I'll print let's say fib and not g gen like that okay fib gen and let's put five all right again we get a generator object weird okay so watch this say 4n in fibgen 5 printn.
Let's run it.
Sure enough, you can see it's printing out the numbers.
Okay, but watch this if I put a breakpoint here.
This is craziness.
So let's just do it for the first one.
Put a breakpoint here.
So the regular non generator version.
So I can step into my code, comes over here and says, Okay, we got numbers, I'm gonna run this.
And notice the numbers are starting right up here.
The numbers are compiling up, or all the time in the Fibonacci function, right?
Just cruising along.
And then the numbers come back and then we print them out.
Cool.
So let's come down to this one.
We step into our code, it seems like the same thing.
We're back here, but next and current.
We step, step.
Watch what happens when we hit yield.
We leave this function, go back to where we're working with it, do whatever went with the value, and then we come back in here, and let me do this a few times to emphasize what's going on here.
We come back, notice we don't restart the function.
It's like two and three.
It does one step of the algorithm, lets you work with the answer, and restarts, resumes this function.
These generators are crazy things, but you get the same output as before.
And I told you these were infinite numbers.
So let's just say, so that will just say while true, we're going to forever run that.
Well, like I said, if I run this up here, it's gonna take forever until it runs out of RAM.
This one though, let's just say if n is greater than 1500, We're gonna break out.
We'll print, done.
Watch this.
While true, crazy stuff, boom.
It just processed an infinite series of numbers until it got to a big one, said, all right, I don't actually wanna go through the rest of them.
But it's the consumer who decides how many of the infinite numbers you get to go through.
One more thing, what if we only want the even ones?
We can have another generator and these are composable.
So we could say even numbers, some number list or some numbers.
Watch this.
So we could say for n in some numbers, if n is even, so n mod two equal equal zero, yield n.
So this also can take a finite or infinite set of numbers and process them, okay?
Let me rename this so we don't get all these warnings.
Saying you're overriding the Z or the N.
There we go.
So it says you're going through some thing you could loop over, like a list or a generator, and you're just getting the even ones back.
So watch this.
we'll have even fibs is gonna be even numbers of Fibonacci generator, even fibs.
So this is an infinite sequence that is processing one by one, the infinite sequence of the Fibonacci numbers.
Crazy.
Now let's run it again and see what our answers are.
Look at that.
How awesome is that?
I just, it is so amazing what is happening here, folks.
we have an infinite sequence.
We're feeding it to this other generator that knows how to process arbitrarily large, one at a time numbers, and it's passing them back if they match the criteria.
This is just super, super neat.
So how do you know if you have a generator, not a regular function?
Well, two ways, technically in Python.
Instead of having the return keyword, like we do, I'll just go up here.
Instead of having the return keyword, you have a yield keyword.
Sometimes yield from if you're doing like hierarchical type stuff.
But the yield keyword appears, that means you have a generator.
So super, super neat.
There's a few gotchas that are not really worth going into, but super neat.
The one other way is we can have a sequence.
We'll do one more of these for you.
So I'll have thirds, bibs, so divisible by three.
So what we can do is we can come over here and use parentheses and use a generator comprehension.
I can say n, say f for fn bib gen, if f mod three equal equal zero.
And we could do the same.
When you see a statement like this, If it has parentheses, it's a generator.
If it has square brackets, it's a list comprehension, which is like the first very, very original one.
But if we do it like this, it's effectively the same as these even numbers type thing.
Okay, let's run it one more time.
Let's make it a little bit bigger.
Apparently there's only one that's divisible by three, which is kind of wild.
Here we go.
So we got a couple out, right?
but super interesting that we can do them through these comprehensions and one-liners, or we can do them through generator functions like this.
Generators are really powerful in data science 'cause they allow you to stream data through filtering and processing and grouping type of operations without loading up the data sets, right?
I mean, you might think, oh, I've got a 10 gigabyte CSV file.
How possibly could I process that?
Well, we're processing literally an infinite series numbers so that's more than 10 gigabytes if we were to try to write that down somewhere so pretty amazing you can do the same type thing though that when you open a file you can iterate over it and then yield out line by line and you're off to the races
|
|
|
23:38 |
|
show
|
2:25 |
In this chapter, we're going to talk about organizing and reusing data science code.
Why does it matter to organize our code?
Well, we did a lot with functions, and I think there's a lot of inspiration you can take from that section.
But we're going to go way farther.
Think about the data science libraries that you use day-to-day in your area of study.
Maybe you're an astronomer and you use AstroPy, or you're in genetics and use a genetics library that's used commonly amongst that field.
A lot of that code probably started inside of notebooks, but it can't stay there.
You can't reasonably share libraries as notebooks.
If you want to have that code usable by other people, you need to convert it into a Python package.
You need to publish it to PyPI.
then they can pip install your library and maybe it will become super popular like AstroPy is for astronomers.
The other angle of that is Python and data science in production.
Typically, we don't publish notebooks to production environments.
If I've got a machine learning library that I want to use to make recommendations into an existing web app, I don't embed the notebook running inside that web app, I write Python code in an application that is available to that web app, maybe as an API, maybe it's embedded into it.
Not many places actually use notebooks in production to run machine code.
They might run it, notebooks in production for reports, but not to make applications go, you know, things like paper mill from Netflix notwithstanding.
So what we're going to talk about here will also help bring your code closer and make it more ready to be moved to production already.
And finally, it'll just take our ability to do storytelling with notebooks to another level to make it a little bit clearer and cleaner than what we did with our game of life example in the functions chapter.
Reusable code, code that's closer to go to production, code that is closer to being turned into a Python package that you could ship to the world.
All of these things are good, and I think you'll find that your code's a little more understandable, reasonable, testable, and so on with what you learned in this chapter.
|
|
show
|
2:06 |
So let's look at a notebook where it's pretty simple to understand.
It's nothing too complicated, but it can serve as a blueprint for helping us organize code.
So you saw when we did the game of life, we used functions within our code to make things a little bit nicer.
Like how do we display the grid for its current iteration and things like that.
And we had to put them into our notebook because they had to be defined before we used them later.
Well, there's a lot of functions like that that are really not that relevant to our code, but they're necessary to make things go.
So what we're going to do is push those files into potentially multiple other Python libraries, not packages, no pip install, just a notebook and a Python file or a pair or set of them.
And that'll help us really keep focused on the important parts of what we want to put into our notebook.
And that code that is in the reusable Python module itself would be pretty ready to be dropped into some sort of production environment potentially.
So look at this notebook we got here.
We're going to have this extra math F library as in Fibonacci, keep going with that idea.
And then we can import it and use functions from it.
So Fibonacci gen, multiples of, and so on.
In that way, our section down here just, it means completely focused on our actual area of study.
In this case, what is the relationship between prime numbers and Fibonacci numbers?
You know, something from number theory, that sort of thing.
So we're going to go and create something along these lines that allows us to take these research or utility type of functions that we're not super interested in making part of our presentation because people just say, here are the Fibonacci numbers.
Like you don't need to understand, you don't need to see the implementation of that, right?
You just need to have access to them and so on.
We're going to dive into this and it's going to be a lot of fun.
|
|
show
|
10:20 |
Now let's create our notebook that we want over here.
We'll say new Jupyter notebook.
We'll call it math research.
And let's go ahead and convert that to markdown and put the title and things like that to get started.
Okay, so here's a little introduction to our notebook.
And I just put a note we're going to need to import.
Okay, here's a sketched out version of what we want to get started with.
Spelling mistake included.
So what we need to do is we need to write a couple of math functions.
Now, again, I could put them here.
I could define a function for the Fibonacci's, define a function for other things, but people don't need to see that.
They just want to see what the relationship is and the things we're trying to pull out.
So instead of putting them here, I'm going to go create a separate file.
called Math F.
Now we've already written the Fibonacci number.
So I'll just copy that over from before.
The one thing I didn't do is specify what the type here is.
Let's go ahead and put that in here.
This is going to be a typing not generator.
I think it's an int.
Oh, whoops, int, none, none.
So it says it's a generator, and the things that it generates as part of its collection are integers.
So that's one thing we want to use.
But the other is we're going to be doing number theory type research.
I want to say maybe there's a relationship of Fibonacci numbers and the ones that are divisible by a particular prime.
So I'm going to write another generator.
is multiple of, and we'll give it the number.
And let's give it the collection like that.
So this is going to be a typing dot iterable of int.
And this is going to be an int.
And what is it gonna return?
Well, it's also gonna be a generator.
Could say it's an iterable, but generators a little more specific.
Ooh, and that's a lot of type information there, isn't it?
especially when you consider how insanely simple this is.
So we'll say for nnCollection, if n mod number equal equal zero yield n.
How about that?
Amazing.
So given us potentially infinite set of numbers, something we can loop over, turn that into a generator, a sequence of just the ones that are multiples of this.
Now, if we pass a prime number, then we're exploring the relationship of those.
Okay, that hopefully is enough to be at least motivating that if something's interesting here.
So notice that we have, now we can type import.
If we type, if we do this in PyCharm, we get much better tooling support, right?
Auto complete shows that it's not being used and so on.
But let's just run it to make sure that works.
And we could even do down here, we could say dir print dir math f.
And that'll show us everything that's in there, which is Fibonacci generator.
And since we only have one, let me actually go back and change that to just Fibonacci.
Now, if we rerun this, we just have, we've got to restart the whole thing.
Now we have Fibonacci and is multiple of, can't re-import a library.
So looks like everything is working perfectly there.
We should be able to use these functions down below.
So down in this section, we're going to do our research.
Let's come up with a prime and 17.
Why not?
One that's not too ginormous.
And we want all all the Fibonaccis is going to be, instead of using a function or a cell to compute this, we'll just go to our library and we'll get these as a sequence.
I'll call them based fibs, the ones that are based on that number.
And we'll say math f dot is multiple of, or let's actually rename this.
Can we refactor rename?
Let's try that.
Just call it multiples of.
Did it work?
I think it might have.
Oh, that is so sweet.
Okay, so multiples of all the Fibonacci numbers.
And what is the number that goes in here?
The prime number.
Again, we could just put 17 in here, but this is a little bit of that magic number sort of thing.
So this is a generator.
This is a generator here that comes out, but then we can start looping over.
So let's say the first 10 that we're interested in, we're gonna collect those up.
Remember, that's an infinite sequence.
So this becomes an infinite sequence, and we're gonna need to do a little work to stop.
I'll show you in the last chapter something epic about making this better, but let's just write it in a most obvious way.
So we want to have the index and the Fibonacci number.
So we're going to get that by enumerating the based Fibs, which is the generator taking the generator, right?
Amazing.
Say first 10 dot append Fib.
And then we'll say if IDX is greater than 10 break.
So what are we doing here?
This means we want to store the number.
And if the IDX is going to be first, second, third, fourth, really zero, first, second, third.
And once we've had enough of them, not what value is the Fibonacci, but how many have we gotten?
We could also do this without that by saying the length of the first 10, but we call that every time through the loop.
Probably a little more expensive.
That doesn't matter.
We're going to now have this set here.
All right, so let's run that.
We need to rerun it because I renamed it, right?
This is tricky.
I renamed it, but the import doesn't stick.
So let's do that.
Then we'll run them all.
Okay.
That refactor rename wasn't as amazing as I thought, but it's good enough.
Let's convert this to markdown.
And we'll say, and then we can just print them out.
Now, watch.
We could do it like this.
and there they are awesome so already this is super cool here i think we've focused on our research we're not worried about like these little utility functions no we put those over here and we got the answers we're looking for right 17 is 2 34 is 2 times 17 and so on But let's do a little bit nicer formatting.
We already have the numbers, but I want to turn this into a list of strings, which has things like digit grouping and so on.
So we can create a little fstring, and we can say tn, comma, colon, comma, which will do digit grouping for tn in first, oops, on the other side, for TNN verse 10.
And then we can print, maybe rename that to TN.
Run it again.
There we go.
We get our digit grouping like we hoped.
And it just emphasizes how few of the Fibonacci numbers are actually multiples of this prime.
Maybe we're onto something.
Maybe it's amazing.
I don't know.
But I think this is pretty cool.
Now, let's just go and launch this.
Make sure I hit save.
Launch this up in a Jupyter notebook.
So open in Terminal, JupyterLab.
Open up our math research.
And I think this looks great.
The presentation is a little better here than in PyCharm, at least with dark mode.
So organizing code or importing our math functions and so on.
Let's see what the relationship is.
And here we can just focus on exactly what we're doing.
I don't totally love how this looks, but I think it's certainly better than having to write up Fibonacci's and all this other stuff first.
We can start looking at, well, what are they?
Let's display those, right?
So we'll get a little nice text conversion and print these out this way.
So I'm pretty happy with this.
And then we could even come in here and collapse that out so we don't have to see too much about it, right?
And we don't necessarily have to look at that either.
Yeah, maybe like this, right?
Now we're really focused in on that one piece there.
And let's just rerun all the cells to see.
Yeah, looks great.
It's all working to me, as far as I'm concerned.
So yeah, hopefully this gives you a sense of like, yes, you could put everything in a cell.
Yes, you could put everything just straight in there, not as a function in a cell, but sometimes adding functions and cells allows you to change the order or reuse them in different places, avoid global variables and things like that, which is great.
Or maybe go a little farther and you put them over here into external code that's not in a notebook, that these things could typically be used across multiple notebooks.
They could be used in production, though this is like not really that type of thing, but still super cool.
think the way this notebook came out really clear, really focused.
Heck, you could even write unit tests against that if I didn't tell you not to write or worry about unit tests too much in the beginning.
Eventually, it's a good idea not write the first in the beginning.
Hopefully, you like this.
This inspires you.
I think it's pretty cool.
|
|
show
|
0:53 |
What if we wanted to share our code, not just maybe across notebooks or within our team, but with the world as a Python package?
Now, I want to be clear.
I don't think you need to worry about trying to create and publish packages right away.
There's actually a lot of work.
You, you know, get your GitHub repos, you got your work with external folks reviewing their pull requests, and there's a lot going on here.
But I want to show you how we can go from what we've built all the way to PyPI without very much work.
So it inspires you like, oh, I see what I could do at some point if I built something I am ready to share.
So we're going to really quickly take that little MathF library we've created and basically publish it to PyPI.
Basically, not really, but go right up to the step of almost uploading it, but not quite.
|
|
show
|
3:16 |
All right, in order to create something we can put into PyPI, we have to create what is technically known as a Python package.
So I'm going to create a folder.
That's where we need to start there.
So I'm going to create a folder.
I'm going to call this mathfib.
So just so it has a different name than this.
All right.
And then in here, we're going to create, now it's just a folder.
It's not anything.
If I create a Python file named dunder init, notice the icon right here has changed because this is a package or sub package I'm also gonna let's go here take that information and put it into the dunder in it it's not the only way you can do it but it's the simplest way okay so now if I actually go and open this in a terminal here and I say where this math f library is I can say Python import math fib we've got Fibonacci and multiples of right there very cool right so hey it's a a package, but it's not a package we can publish to PyPI.
So let me see if, what we need to do is we need to create what's called a py, is there a type for this?
I need to create a pyproject.toml.
I don't see it.
So we're just gonna create as a blank file, pyproject.toml, there we go.
And now we need to fill this out with all the structured stuff that you need to make a package.
Let's see if I can get a little AI help on this one.
I'll say, ""Please create a package definition for the math fib package so that I can publish this to PyPI.
This is the Juni agentic coding helper from Jebrains.
So it's created a little plan it's working on.
Examine the contents, check the related math F, understand it might be included.
No, it's not, and so on.
Let's let it do its thing.
There we go.
Awesome.
And is it completely done?
No, it's getting there though.
Awesome.
It says it's done.
Let's see what it's done.
First of all, it created a readme for us.
We can look at the readme like this.
Math fib, Python packet.
Okay, understood it's Fibonacci.
That's kind of fun.
And now we're not doing that.
How do we install this?
We a uv pip install.
There we go.
So here's how I use it.
Even comes up with a little example of that.
Hmm, not terrible.
I guess we're doing MIT and that's fun.
But look, it needed a readme so it could have this.
And I think we're in good shape here.
Yeah, we obviously got to put our details, no.
But Python 313, I could get away with a little bit less.
But now we've got this pyproject.toml, which allows the build tools to create a, called a wheel, a packaged up version of our little local package.
|
|
show
|
4:38 |
Okay, so we have our pyproject.toml, which tells the build tools what we need to package this up, and we could upload it to PyPI.
Once again, we have uv to the rescue.
So let's go over here and open up this folder right here in the terminal.
And we can say, just as simple as, if we're in the same folder as the pyproject.toml, we can simply say uv build.
What it did.
it built a dist math fib source distribution and a math fib platform independent so none any a wheel which is what you need to ship and install all right so here it is this is what we need to publish this or to have something we could publish now the next step is to go back to the same place and we would say, put it up nice and high, uv publish.
And it says, great, how are you going to do this?
Well, you have to have an account at pypi.org.
And then I believe it has to always be tokens these days.
That's how I've been doing it.
So you say token, and then you enter your API key.
If I were to enter my API key, and there's not a math fib package on PyPI, there will be one under my account and people could start pip installing it.
So we're canceling out of there, but see how easy this is to do.
You take a template for my project.toml or you get an LLM to make one for you.
You give it a read me off it goes.
You probably want to get a page repository to go with it.
And let's just see one more thing here.
So let me copy this path, the absolute path.
And let's go over here.
Go to the desktop and say, which Python?
No Python found.
Okay, that's good.
So I'm gonna create a virtual environment here, which is activated.
Now, if I ask which Python is this one, if I say uv pip list, there's nothing here.
How about this?
uv pip install.
Look at that.
I've now installed this package from here.
uv pip list math fib, Python, import math fib.
Before we go on, I just realized I made a quick, a very small mistake.
I need to put that package inside a subfolder next to the pyproject.tomla.
it wasn't building right.
So we'll build one more time.
And then we can go over here to our virtual environment and say uv pip install and pass the whole wheel file like this.
See it works correctly.
We'll run Python.
I'll import math fib.
Look at that, amazing.
Now we can see what's inside of our library.
We've got our Fibonacci and multiples of, let's just print this.
We'll say N for N in multiples math.
One, two, three, four, five, six, seven, eight.
That's the sequence to look through and we want it to be multiples of, let's say even numbers.
You're gonna close that off.
Look at that.
We're checking for multiples of two.
I mean, it's not that advanced, but it does show that our math package is working.
Pretty awesome, actually.
Now, you saw me type uv pip install and pass the wheel.
If this was published to PyPI, like I said, if I put in my password, it would have.
I would just say math fib.
And just like that, my library is available to the world.
People can use it.
They can work with different versions as we publish them.
Super cool.
So again, not something I expect you'll be working with directly right away, but I wanted to kind of show the whole story arc of why we would organize code into separate modules, why we might break it out like we did.
So this goes far beyond how you can share a notebook if the goal is to share these building block functions and functionality.
Okay.
|
|
|
30:15 |
|
show
|
0:35 |
Let's talk about source control and Git and GitHub in particular.
Now you might first be thinking, well, I work by myself, so I don't really need source control.
I promise you, you'll find it super valuable.
Even if you're working by yourself, even if you don't have a full-time job, you're just getting into the field or something like that.
There's tons of reason to use source control, even on your own.
So we're going to quickly go through how to use Git specifically with Jupyter Notebooks and data science type of code.
|
|
show
|
0:49 |
Now, as we go through these topics, you might be thinking, Michael, we're not really going super in-depth into all of this.
There's still more Git things I need to learn.
Well, yes, and we have an entire course on Git and source control and branching and contributing to open source and forks and all those kinds of workflows.
And you can find it right here, just talkpython.fm.com.
If you really want to dive into this, I recommend you check out this course.
It's really one of my favorite ones that we offer.
the goal of this chapter is to just give you enough right i mean that's the title of the course and staying true to that just enough git and source control experience to get started and be productive and if you need to learn more well you know consider this course or other places you can learn about git and source control
|
|
show
|
3:11 |
When you're working on a team, research team, software team, collaborators around the globe, whatever it is, you need to exchange your edits and your contributions to the project, be that notebooks or Python files or data files, that sort of thing.
The best way to do that is to use source control.
Sure, you could create like a shared drive like Dropbox or Google Drive, But soon as two people edit the same file and Google tries to sync that or Dropbox tries to sync it, it's going to say, well, there's a conflict and I don't know what to do.
Two people edited this file at the same time.
Sorry, you're out of luck.
Whereas with source control, you can merge them.
You can see the differences.
You can understand the history incredibly, incredibly well.
And it's really the way that software is built these days, using source control and Git in particular.
That's great if you're on a team, but maybe you're working by yourself thinking, well, that's for the people who are on teams.
It's not for me.
No, it's for you.
Have you ever created a thing where you said, let me make a copy of that file before I edit again, just in case?
Or you've got a project and you zip up the folder and you put a date on it just in case you got to get back to it.
Or you don't do any of those things and you're just hesitant to make a change.
You're like, oh, I would like to explore reworking this code in this way or changing it in that way.
But what if I mess it up?
With source control, you don't care if you mess it up.
You don't care if you break it.
You can always go back to the way it was before.
And you can use fine-grained tools to compare them.
The other place that when you're by yourself that using source control is super valuable is across machines.
Sitting here in my office, I have my live stream recording set up here.
And over there on the other side, I've got a Mac mini that is my main workstation.
And I have a MacBook Air that I use anytime I'm out of the office.
All of those have many of the same projects.
And I need to synchronize even just with myself across those machines.
So what do I do?
I'll push it to Git, push it to GitHub, go to the other machine, do a git pull, make my changes there, push them back when I'm ready.
It's kind of like there's three of me working independently as far as the source control and that sort of thing is concerned.
So lots and lots of reasons to have source control.
And if you're by yourself, the best, most significant, unappreciated advantage is the ability to be fearless, making changes to your code, trying new things, adding features, saying, let me break this, but with the idea that we'll change to make it better in the future, because if it's saved to source control, you can just get it back.
It's no longer a dangerous thing.
It's just, how do I wanna spend my time?
All right, so we're gonna get you going on how to use source control or data science in this chapter.
|
|
show
|
4:44 |
What we're going to see from here on out is just enough of a roadmap so that you can find your way to get started working with source control, collaborating with others and being productive.
There's more to learn and there are some more advanced techniques, especially not so much commands, but workflows like, oh, I have an open source project and I want to contribute to it.
So I'm going to fork it and create a pull request and then make changes to that pull request and then resubmit them like that kind of stuff.
We're not going to cover.
Not here.
But we're going to give you enough, just enough, hopefully, to be really productive without spending too much time on Git and source control.
So what we're going to do is we're going to talk about the six important commands.
And I put six in quotes because there's more like six concepts here.
sometimes there's multiple commands to accomplish that.
I will show you the commands that you actually see on the terminal and the ideas.
And then we're going to talk about how to use primarily UI tooling to make them happen.
First one is clone.
And this is a little bit like creating a copy of a GitHub or Git repository.
I want to work on some software.
I'm going to do it on my machine as if the files are just sitting there because they will be.
so what you're going to do is clone from the remote repository locally.
Git is a disconnected system.
You just make a copy and then you start working on it and then you decide later if you want to save those changes back or share them in any way.
So this clone is effectively give me a copy for my machine that's self-contained and I can work on.
Two says what's the status of the project?
Are there files that are new?
Are there files that are modified or deleted?
which ones are those, and so on.
So it tells you basically what are the changes since the last time I saved them.
Speaking of saving them, you have a two-step system with Git.
You can say, here are some files that belong to this project, right?
Maybe you have a directory full of a bunch of files, but you say, these other ones, I don't actually want them in my project.
They're just here locally.
They shouldn't be saved.
But these others, I want them in the projects.
You would add them.
that doesn't save them anywhere or start tracking the history of them.
It just tells Git, these are things you should pay attention to.
And then when you're ready to make saves, you call commit.
And commit might as well be save, right?
So you commit it, but locally to your local version.
If you want to take your local changes and share them broadly, either back to GitHub just to save them for yourself, or maybe push them somewhere to the Git server, like GitHub or somewhere else, so that team members or other people on an open source project can see them, you need to push those changes.
And the people, or you, who want to get those changes back have to pull them.
So push and pull are this synchronization thing.
Save my changes up, get the changes back to me.
That might have happened in the project.
If you want to know what has happened in the project over time, like a history, how has this file been changed or what files have been changed by who and when, you would do a git log.
And finally, if you want to work with something called branches, which are parallel copies that can change independently of the others, like I might have a feature branch where we're saying we're gonna add a new API endpoint to an API and I'm gonna make one of those branches And so I can just, me and whoever's working on it, can just focus on doing that feature without messing up or influencing the rest of the team.
Because if it's not done and we want to ship the API, maybe it'll cause a problem.
So we can create a branch that lets us work separately, and then we can merge those back.
So the way to work with branches is we check out a branch and maybe merge that branch as well.
So these are the six important concepts.
There's a bunch of other things going on, plus the workflows that I talked about, and I would not really stress too much about them.
I think you can go all day long with these here.
And just to emphasize, all the words you see on the screen, like clone or commit or log, those are actually the terminal CLI commands to get.
So you would say get add a new file, or you would say get status to see the status and so on.
I'll show you some nice UI ways that you don't even have to be in the terminal.
But these concepts here, these commands are actually taken directly from the commands you would give to get like you see below.
|
|
show
|
1:26 |
Let's dive into clone visually first.
So we have some kind of hosting server.
Now, it's possible to just have Git the software installed.
It's free, open source, easy to do, and create a local Git repository that doesn't have a server.
But if you're going to save files somewhere, if you're going to collaborate with anyone or even synchronize them across your own computer, you're going to need to put them somewhere.
The most popular place by far for Python and data science has to be GitHub.
So if you don't have a GitHub account now, you should definitely be having a GitHub account.
Don't make one, it's free.
If you have a repository over there, then you can start working on it on your local machine by cloning it.
So we go over there and we want to get it on our computer so that we can edit the files, make changes, or just look at it.
We can clone or make a copy of it down to our machine here.
And at the time when we call clone, they should be identical.
But of course, other people might be making changes or you could be making changes on other computers to this.
And while you are doing that, this is just happening locally.
There's no persistent connection back to the server.
You've made a copy and now you're editing away on it.
So keep in mind, everything you do and get is disconnected in this sense.
|
|
show
|
3:00 |
The next one you're going to want to know is git status.
So when you've made some changes on your computer, you might want to know, well, what files have changed?
Has anything changed?
Are there unsaved changes?
So we can run git status.
Here you can see we have three files in this project.
File one is in the git repository, but it's unchanged.
Whenever we called commit to save it last time, nothing has happened to it.
It's still just hanging out there.
But you can see that there are two different kinds of changes here.
You can see it says changes to be committed.
That's new file, file two.
And we have an untracked file, file three.
So this is the multi-stage thing that I was talking about.
Git allows you to have files in your project that are ignored or just not yet added.
so maybe I have some files that's got my password in it.
Don't want to put that up in source control in the public.
So I don't want to commit or add that, but I do maybe want to add file two.
So that's why that distinction is there.
And you can see that neither of these are saved, but Git's intention, if you said commit my changes, would be to save the changes to file two and not yet to file three.
If we want file three also included, then we would add it.
Now, you may want to look at it like this, but something I'd like to emphasize, especially for beginners, is most of the time you don't do source control on the command prompt or on the terminal.
You do it through your UI tools.
There are dedicated Git tools like SourceTree that I highly recommend.
It's a free tool.
It runs on Mac, Windows, and Linux.
It comes from Bitbucket, people who have been at this for a long, long time.
and it gives you a really nice visual view of the same thing.
So here you can see that we have file two, it's got the little plus, that means it's staged but not saved.
And then we've got file three, which is unstaged.
It's like, do you want to keep track of this or do you not want to keep track of this?
We can also see this over in PyCharm.
PyCharm has, you've probably already seen it, like different colors for the files and that means different things.
So green means it's a new file, Red means it's an untracked file.
White means it's unchanged.
And the other one you saw a lot was blue, which means it's modified or edited.
And those modifications are not yet saved to get.
Remember, all of this has nothing to do with anything on the server.
All of this is just what is the status of your cloned or local Git repository that's hanging out on your computer.
Things could be happening up on GitHub or not.
But if you go to GitHub and look, You'll see none of this information.
It's only locally until you synchronize those changes.
|
|
show
|
1:37 |
Now, if all of your changes are saved locally, how do you get them back to the server?
Well, you push those changes back to the server.
We have these four different types of changes that we might be considering or states of our files in Git.
But once they're committed, we can push those changes up to the server.
So if we've done a bunch of work and we're ready to make it permanent and put it into the server, then we're going to share those via push.
So git push, or there's usually some kind of command equivalent of that in your UI.
Now, if we want to get the changes, this is something you do very often as well.
Maybe multiple times before you push your changes, we were going to do a git pull.
Now, why would you do pull many times?
Well, maybe I'm working on a project for a day and there's other people making changes.
I would maybe like to say, all right, if I were to push those up, push these changes up, would they continue to work with what everyone else has done?
So I can do a pull and get the latest version, keep working, make some more commits, do another pull, keep working, make some more commits.
And then when I'm all ready, I'll push those changes back.
So that's kind of my workflow when I'm working on a team.
When I'm doing by myself, it's more of a one-to-one thing.
But keep in mind that there's other people working on their local copies of the Git repository they've cloned, and they're doing push-pull to synchronize these changes back and forth.
Sometimes you'll get new changes, sometimes you won't, but this is not just happening for you.
It's happening for everyone in this disconnected way.
|
|
show
|
2:02 |
If you want to see what's happened in the project over time, at least on a particular branch, you can run git log.
Now, this is the git log command running in the CPython code base, which is on GitHub.
So I cloned it and ran git log on the master branch.
And you can see that I've made some kind of change because this was my fork.
So I merged basically the work from everyone else into my fork.
And then we had Zachary Ware who updated Zlib to solve some security problem on Windows.
And then Julian Pillard and April fixed a superfluous backtick in front of the role and so on.
So you can just sort of see the changes going on here.
And again, like everything, you can run this on the terminal, something I very, very, very rarely do.
Instead, it's often just built right into your UI tools.
So for example, in VS Code here, you can, this is a different project, so it'll have different messages, but you can see the timeline at the bottom.
It says, we bumped the version to prepare for release or we imports don't repeat, docs were fixed and so on.
So we can see that chain right there.
And of course, if you interact with that on the left, that will pull up those changes and show you all the things going on there.
Similarly, PyCharm has a really nice one that emphasizes the branching even more.
So now we can see the different messages.
And on the right, you can see there's a couple of files that are blue that were modified.
If you double click them, it'll show you the changes.
This is just so much more powerful than just a simple log.
It really lets you like dive into the details and explore, right?
So that's why I prefer this over just using Git log, but it's the same idea.
And it does emphasize that just because the concept of Git log is important to you, you don't necessarily have to be using get space log as a command.
The tools that we're using have them built in.
|
|
show
|
1:18 |
Let's talk about the three that have to do with branching.
Branch, checkout, and merge.
Get branch will create a new branch.
Checkout will checkout an existing one.
And merge will combine the changes from one into another.
You can see those really nicely here in this particular view from source tree.
These are our local branches.
We've got fixed code styles, master, missing response, and prod.
and you can also see them visually and how they relate over time which is super cool and if you click on them and interact with them it'll tell you stuff about them right you can also see that at the top we have master origin master origin head but if you scroll down you would also see some of those tags for the other branches in this git log history and we have the branches that are on the origin often this is github or wherever you started from there might be branches on the origin that we haven't synced and pulled down.
Like maybe there's 100 different branches on the origin and we were only working with a couple.
So we would just have those here.
And similarly, we might create local branches we have not yet pushed up to GitHub or wherever we're going.
So we've got these two different views, the local and the server-side view of branches.
|
|
show
|
1:56 |
Let's talk about what happens when you clone a project.
So here's something creatively named Proj that we're apparently working on.
And it's got these three files.
We already saw that in our UI example before.
And this is a Git repository.
So you can see when we check it out, it has file one, two, and three.
Now, if it's going to have all of this history and it's going to be disconnected from the server, how does it know?
Like, how does it know what the history of the file is or how that's been changed over time.
Well, if we hit shift command dot on macOS or on Windows, you just go to the finder, sorry, explorer options, and you say show hidden items, you will see a hidden dot git folder.
And in there, we've got a whole bunch of details, not going to go into them, but this folder contains all the copies and changes made to all the files along with code that can run and before and after commits under these hooks and that sort of thing.
So this Git folder is where all of the information and history and branches and that kind of stuff is stored.
So be careful with it.
If you want to keep your Git history and be able to work with it, don't throw this away.
Also, if you're going to make a copy for whatever reason and move this thing around and you want to take the Git aspect with it, make sure you copy this hidden Git folder, not just the three file one, two, and three.
If you want to explore this idea in depth, I did do an interview with Rob Richardson about understanding the Git folder, all the pieces that are in there, how it works.
So if you really want to get geeky about it and dive into it, check out talkpython.fm/311311.
Listen to us talk about all the details of Git.
It's pretty cool, but not worth diving into right away.
But if you get intrigued and want to follow up on it, here's a good way to do it.
|
|
show
|
9:37 |
All right, let's bring all this together and make it practical using our repository we have here.
So we can just take this one, the course repository, and use it right here locally on my computer to play with.
Now, I already have this downloaded, as you know, you've seen me working with it.
But let's assume that we don't.
Let's just start from scratch.
So however you like to do it, I prefer to start on the terminal or command prompt.
So over here, I'm going to set a few things up and then I'll open it either in PyCharm or cursor VS Code type of thing or in source tree, one of the tools I want to use to work with it.
Okay.
So I'm going to get clone and we'll paste that in there and it's going to create a folder just the same as the name of the repository.
So we can CD in there.
And it looks just like we would expect, right?
And to get started, I want to make sure we have our virtual environment.
So I'm going to use uv to create a virtual environment.
I have this nice little shortcut that will run it with the right version.
It'll download it and activate it.
You could just type those out or I'll just say VE and V, which will run it.
And you can see that it's already activated right there.
So we should be good to go.
Now let's go and open this up in PyCharm.
You can see now that the virtual environment preceded PyCharm seeing this project.
Oh, it said, oh, look, we found this virtual environment for you.
Fantastic.
Kind of like I was describing when we talked about virtual environments.
Okay, so over here, we've got a couple of things going on.
Let's create a new folder called version control.
Now, it doesn't matter where we make these changes, but I'm going to kind of put them into their own location so we don't mess up things.
Now, if we're in this section here, you see there's no files.
But if I touch, use the touch command, I can create an empty file.
Notice down here, this one is red.
Okay, so that means it's not tracked by Git.
There's a nice little commit thing that says there's unversioned files.
Great.
I can also, you've already seen this, I can create a file and say, a sample, let's just call it file two.
And you'll see PyCharm asks, like, do you want to add it?
Do you want get to keep track of this file?
So that automatically skips one step here.
And I could just print out, hello world.
Switching back over, you can just see we got our hello, how about this?
Hello, get.
So we can go either to our terminal, here we could say get status.
And it'll tell us we have this new file which moved one step down the process, because a pie chart when it added it.
And it has changes applied to that file, which are not yet staged or tracked, which is an interesting thing.
And we also have this one file that's fully not tracked.
And PyJar makes this a little bit easier, right?
We've just got over here, our changes and our one file.
So let's suppose I want to save this file here.
We'll just say this is our first together get commit.
Before I hit that, let's go over here.
You can see first steps at creating a package.
That was chapter five stuff five days ago.
It's been a weekend.
So if I commit those changes, you can see, look, they're gone, committed, refresh again, still, that was only locally.
It's not until I push, so I could have pushed this button or I could hit the command to say push or I could go the painful way up here, but I just hit this hockey and say push and I'll say, okay, here are the changes that will push up to the server.
Wait for it.
There we go.
Now you can see, Oh, this is our first commit together.
And we look at it and see print hello get excellent.
I could even subscribe to notifications about it.
Really, really cool.
And now if we go back here and I make some changes, maybe get should be lowercase.
Now you can see it's blue.
If we go back, double click it, you can see it actually shows you what's changed that bit right there.
Okay, so this is PyCharm.
We can actually see the history as well down here.
And you can see all the, get my head out of the way.
You can see all the different changes that have been happening over time, right?
And I could double click that and it'll show you what the status was there or the changes, let's see.
Find some that we've modified there.
How do we change the PI project?
Tom, oh, I had to add that little bit there.
Let's put PyCharm away for a second though.
And let's look at it in VS Code, because I know some of you will be working over there.
I'm gonna use cursor, but cursor is just a enhanced VS Code thing.
Let's put that little bit away.
So over on the left, you can see, oh, here's our code.
That's got a little indicator.
Something's going down over here.
And these are the changes.
Again, you can see updated, modified.
You can see we've got our changes over here again.
If we go to the source control section, we can see something real similar.
Here are the changes, and this is an update to an existing file right here.
And this is a, the update is a new file, right?
This is a modified file.
So you can come over here and actually see the diff again as you interact with it.
See this working tree, but it's just like before with PyCharm, you can sort of hop around the history of this thing, what we did with generators and so on.
And then let's suppose we would make these changes here.
We're gonna push them in.
I could stage them to commit them separately, or I could just say, here are changes from VS Code like editors.
And say there's no changes.
We're gonna commit them just directly, sort of do a two step thing and say, you know what, always do that.
that didn't push them, didn't sync.
In VS Code, this sync means do a push and then a pull.
So off it goes.
We've got our changes from VS Code Editor, which are those two files, of course.
Okay, the last one worth looking at here will be to just see what this looks like in SourceTree and then maybe make it do something.
So I'm going to drop this into SourceTree, which I've already installed.
I don't have any branches or anything like that.
So this is just the, there's not that much of an interesting branch story, but you can see we've got some stuff in the origin and we've got the same one here.
But if I created one, obviously it would show up.
Again, you can explore it, but let's make some just quick changes.
Let's make one more change.
I'll add another file.
This will be some other file, md.
And we'll just say hash other file.
And let's add another markdown here.
And let's go over our source and let's go, we can actually go over here and stage the markdown.
Now it's added, if you wanna see how that works.
And now it says, hey, you have uncommitted changes.
And this is really great.
We can open this up and you can see this one has been staged, but not yet committed.
This one, we don't know what's going on with it here, but we can add it.
And I'm not sure why it's binary, it's marked down, but that's okay.
And we go ahead and push these changes from here or just commit them locally.
So something from source tree.
And then we can push those changes up to the main branch.
Excellent.
So here's what I recommend for you.
There are some things that are better to do on the terminal or command prompt.
For example, cloning projects to get the structure set up so you can create virtual environments and things like that, which the editors react better if the first time that they see that project, things like the virtual environments are already set up.
Maybe start there to just get things checked out.
But then after that, whenever you're in your editor, just edit away in the tools, that really deeply integrated into both VS Code and PyCharm automatically.
So I would just, for the most part, just leverage this stuff here.
You don't really need to go to the terminal, all right?
Same thing for VS Code.
But there are certain times you just need to be a little bit more specific.
I want to see the branches exactly.
I want to explore more things.
SourceTree is a really nice tool to have in your toolbox.
It does quite a bit more detailed get stuff if you need to without making you drop into the terminal to make that happen.
So I like to use a combination of all three.
Terminal to get started, my editors almost all the time, and really periodically drop into SourceTree for some more advanced to get things.
|
|
|
8:04 |
|
show
|
0:57 |
In this short, quick chapter, I just want to talk about debugging and focus in on some of the debugging tools that we have, the ones in notebooks and Jupyter notebooks, as well as the ones in the other tools we've already been using and beyond.
So debugging is a super important thing that we can do.
If you find yourself writing print statements, you know, print, the value is this, the value is that, now the value is this, now we're in this location, you're probably doing it wrong.
You could probably really benefit from just choosing a really nice debugger.
Notebooks in some ways suffer less from this because you can just run a cell and then type a variable name and it'll actually show you.
But there are still plenty of times that a debugger is going to help you a bunch.
Okay, so in this chapter, we're going to play around with debugging some of the code that we have.
|
|
show
|
0:33 |
Now, before we dive in and actually start playing with the editors and debuggers and so on, JupyterLab, I don't believe Jupyter Notebook plain version does, but JupyterLab comes with a debugger.
You can see a little bug icon in the pane on the right here.
And you can do things like step into the code, step out of the code, pause it, inspect the variables.
I find the editor to not be super great.
That said, it exists, so you can use it.
|
|
show
|
2:32 |
Even though JupyterLab does have a debugger, I implore you to use a real editor for this purpose.
In my mind, PyCharm stands above the others, above VS Code and the others, quite a bit when it comes to its debugger.
The debugger is really powerful with conditional breakpoints and all sorts of things you can do.
One of the best features is how it overlays onto the screen the current status of the variables.
So here we have a Jupyter Notebook for debugging a cell.
We've run it for a while and the current value is 2,584.
And you can see without even going into little sidebar window pieces, it says base fib is 2,584.
The index IDX is 2.
Right there on the list, right there in the UI, in the editor, you can see that the first 10 is a list with 0 and 34.
They have different colors.
Why is it a different color?
Because the 34 is a changed value, but the zero was already there from the last time we inspected it.
There's just so much going on that makes it incredibly, incredibly powerful.
So I implore you to consider a real editor, not just notebooks and print statements.
You can also do something similar.
It's not quite as good, but it's decent in VS Code.
I don't like the VS Code debugger as much, but it's still pretty good as well.
And it does overlay some of the values, like I said, I really, really like.
Not as cleanly, but that's okay.
You can even use cursor, which actually I think cursor is really pretty awesome for when you get into a problem, you're like, gosh, I just can't quite figure out what is going on here.
Like VS Code, since it basically is VS Code, you can debug these as the same and you see some of the output.
Though I don't believe at the time of recording that cursor overlays the value into the editor.
But what you get as a trade-off is you get a really, really intelligent and context-aware agentic AI.
And you can say, hey, why is this value turning out to be this way?
Or it seems like there's a bug here.
And describe it and say, could you help me solve it?
And it's surprisingly good at those things.
So whatever type of editor you choose, one of the last three, PyCharm, VS Code, or one of the AI editors, they're all pretty awesome.
And they're all way better than trying to debug in a notebook.
|
|
show
|
1:15 |
All right, let's start with the JupyterLab edition.
Fire that up right here in Chapter 7 debugging.
We're going to open up the debug window here.
See there's nothing here yet.
Now, in order for us to use the debugging features, we have to enable it clicking that little bug there.
So now debugging is enabled.
And you can see we can start to set breakpoints here.
So let's just set one right there.
And I will run the cell and let's start stepping over.
Notice sometimes it doesn't show the variables and honestly, I don't know why that's happening, but whatever, we'll just keep stepping through.
And we expand this one come down here, run it the rest of the way.
We come in here and run this one.
Now there we go, got the number of the values to show up there.
We should be able to see number text, which is a list.
There's stuff in there, presumably.
Like I said, there is a debugger.
It is not the most amazing experience.
But if this is all you got, then turn on the debugger in JupyterLab, and it's something.
It's not bad.
|
|
show
|
2:47 |
Now let's contrast that with something like PyCharm or VS Code.
So we come down here and we even collapse that if we want, make sure we run it just to see that everything's good.
That actually, by the way, started up a Jupyter server when I press that button there.
But let's set a breakpoint exactly the same spot as before.
And notice each cell has a little bug, so I'll click the bug.
And we got just too much of that window.
Here we go.
Okay, great.
So we have our control elements right there.
I can step over, I can step into just my code or into all code.
Let's just step over here a little bit.
Notice first 10 is now 34.
Step over here and like, suppose that I wanna change the index for a second.
We've got one, two, three values in here.
Let's just go in and we can edit, set the value.
So we can now the index is 30.
Notice it's changed up here as well as down there.
And it's gonna do this test.
If it's greater than 10, we're out and we'll step out.
And off it goes.
Go ahead and run this one, see the output.
And look, we only have these three number, four numbers because we were controlling the program flow there by looking at the values and changing them.
See our Jupyter variables over here as well.
So pretty neat.
And I think it's fairly night and day compared to the one over in JupyterLab.
And again, VS Code, cursor equivalents, they're pretty similar.
I'll show you one more thing, just one cool little trick here.
Suppose we're trying to debug some kind of problem here, and we know that the error is really only after 15 million.
So we could go put a breakpoint right here and right-click on that and say the condition is fib is greater than, notice we get autocomplete in everything from here, right?
15 million.
So then we could go and debug the cell and we won't be dealing with like the first four or five values, let it run a little bit.
And look, it stops when index is four or on the fifth element, because whatever that number is, it's like 1.1 billion, bigger than all the previous ones, right?
This one was less than 15 million or so of those.
So we can have these conditional breakpoints, they can even do log messages.
There's all kinds of super advanced stuff baked in here that is not apparent right on the surface, but I will tell you it's quite something.
Pretty neat stuff with what we can do here with PyCharm.
|
|
|
49:01 |
|
show
|
1:42 |
Let's talk about reproducibility and setting up our data science projects so that other collaborators can come along and be as successful as possible.
Ideally, we can set things up in a way so that no matter who works on the project, what operating system or version that they're working on it with, they're going to get the same experience as you have, especially as you intended when you build out the project.
This is something super important in data science.
There's a reproducibility crisis in science in general.
A lot of this does have to do with the data science tooling.
We have more and more of science in general being done through things like Jupyter Notebooks and Python and libraries through PyPI, that sort of thing.
That's great.
But when done wrong, it is a problem.
I want to be clear, reproducibility in science in particular is not just limited to computing tools, but surely if we have different versions of libraries, it makes it harder to rely on it.
Or worse, if people who try to reproduce results can't even run them again.
So this is not just about science.
It's about creating reproducibility for whatever you're doing data science for, for yourself over time.
So if you come back in five years for collaborators, if this is open source, there's lots of reasons that we care about making our code run almost exactly the same everywhere and making it easy for people to fall into the pit of success.
They use the same tools, the same versions, everything so that they get the same results.
|
|
show
|
5:11 |
Before we get into the techniques and the tools that we're going to use, let's just think through a little bit of some of the key sources of variability.
And this could be variability in the sense of, well, I get a slightly different answer, or it could be variability in like, I can't run your project.
What's going on?
It doesn't compile or it doesn't open or whatever.
One of them is changing data.
Changing data obviously is going to be a problem.
Maybe you're getting data from an API, maybe using pandas.readHTML to pull a table out of an HTML page, some website or some web portal, all of those things.
Obviously, if that data goes away, that's a huge problem.
If it changes, also a problem.
There might be breaking changes in critical libraries that you're using.
Maybe something like NumPy or Polars or AstroPy or whatever it is, decided we have to have a new version And in this new version, we're making things better.
That's a good thing generally, but it might mean that your code no longer executes as written.
So that's not a good thing.
We'll talk about how to fix those as well.
Now, NumPy is probably going to be pretty stable, but look, if you're thinking 10 years, 20 years in the future, all bets are off.
Changes in Python version.
Right now, at the time of recording, the latest version of Python is 3.13.5.
What about in the future when Python 20 is out?
There very likely will be breaking changes.
I can tell you I've had code that has stopped running because Python itself changed.
Sure, they deprecated the things.
They put out warnings that something's going to be wrong.
But a library that I used depended on another library that depended on a feature inside of Python that was removed.
And I didn't realize I was using it.
But guess what?
My program stopped running.
And I was super confused.
it was really a super big hassle.
So this can be a big problem.
Operating systems change.
They change their requirements.
They change what things they will execute.
So if somebody's trying something on Windows 10 or 11 or 15 versus Ubuntu 24 versus Ubuntu 18, those can be sources of significant variability.
Apps might work on one and not the other.
There's plenty of Python libraries that work on Linux and macOS that do not work on Windows.
For example, uv loop.
So how do we solve these problems?
Well, that's what this chapter is about.
We're going to talk about how we can download our data and keep it in GitHub or Dropbox or some safe place.
All right.
Maybe you work at a university and there's some sort of storage area for your important data or your research data.
I don't know.
But don't depend on the data from external resources if you hope to be able to reproduce this information.
So maybe if you're getting data from an API, instead of just getting it and pumping it straight into Pandas to work with, maybe you save it to a file and then load that file in Pandas.
Maybe save it to a Parquet file or something efficient and fast.
And your notebook can look and says, does this file exist?
No.
Then let's download it.
But if it does exist, let's just use the local one and keep it stable.
All right.
That can be one key step.
or get the latest data and always update the local file.
But maybe check that into GitHub if it's not too huge.
This is probably the simplest one.
Another one that's pretty straightforward is we've already talked about virtual environments and versions of libraries.
My theoretical example where NumPy makes a breaking change and your code doesn't run anymore.
Well, one thing that's awesome is PyPI keeps all the old versions around for us.
So if we do what's called pinning, when we install the version, We explicitly, we don't just say install NumPy.
We say install NumPy 1.2.7, whatever version.
And then uv or pip or whatever using Condit will go and find that version and download it.
And it will be the same.
It's immutable once it's published.
So that's a really nice way.
I'm going to see uv has some awesome tools for that.
We'll also use the same version of Python always.
And there's different ways in which we can do that.
uv also comes as a really awesome option here.
always run the same machine this kid's super hard maybe some sometimes you're on a mac maybe sometimes you're on windows or Linux you've got collaborators who only use Linux you have other collaborators who will not use Linux what do you do well we're going to see that there's some really cool tools how do we do this well we're going to use uv to solve two of the four problems and we're to use Docker to solve kind of a bunch of these and maybe another two as well.
So these are both really awesome tools.
Docker can be intimidating.
We're going to do just enough Docker and you'll see that it's actually pretty straightforward and easy.
So if you don't know Docker, don't be afraid.
It's just going to dip our toe into this world, but it's going to make a huge difference in terms of reproducibility.
|
|
show
|
10:04 |
Let's first focus on making sure that we're always using the exact same versions of our libraries.
So we are over here in our project, created a folder, 08, reproducibility.
And in this section in math research in source, this is where we're working on our code.
We'll make sense of all these pieces and how they fit together in a moment.
But let's just suppose I'm working on this notebook and I have that and I consider this to be the root of my source project.
This is what I'm working on.
Everything else is just kind of there to put things together as we'll see.
So how do I make things reliable?
So for example, this is using Jupyter.
So I could say uv pip install Jupyter lab and it'll say, great, it's already there.
What version?
I don't know.
whatever we got when it was installed.
Now I could specify, let's do uv pip list so I can actually use a legit version here.
So you can see that version right there, but notice there's pigments, server, LSP, et cetera.
So I could be real careful here and say the version is exactly this one.
And I could even go so far as say upgrade if there's some kind of upgrades, all right?
So that's handy.
And uv will actually go and upgrade the dependencies it looks like, which is pretty awesome.
However, it doesn't specify what we built with for the dependencies.
Maybe when we used it, we used that one.
And now the new one is this, if we run that command.
Oh boy, even though JupyterLab didn't change, right?
So we want some way to say this one and not just this one, but Every one of these in this list, we want them to be exactly what we're working with now.
How do we do that?
Well, there's a couple of things we can do.
You may have seen this convention before.
So if you go over here, you have a blank file called requirements.txt.
And then in here, you'll see uv pip install or just pip install -r requirements.
Am I in the right place?
It looks like no.
I could say uv pip install-r requirements.txt.
And if we have JupyterLab written in there, notice that autocomplete, how epic is that?
So if I run this, it'll say, great, you've already got them.
And let's make sure, let's maybe say, we want to make sure that we have NumPy and that we have pandas and pullers.
Now let's just do NumPy for a minute, okay?
So now if I run this again, you'll see, oh, it's like, oh, we don't have NumPy.
We have to install that.
That's great.
What version of NumPy?
That one maybe.
I thought autocomplete is incredible.
So maybe it's that one.
What about its dependencies, right?
This one actually just installed that.
But you saw JupyterLab has an insane amount of them.
So we want to be able to specify like, these are the things we're working with.
But we also want to constrain their versions.
And so that comes to this tool called pip compile.
And you saw over here when we talked about uv that it says its tools to replace pip-tools, one of its functions is something called pip compile.
So what we're going to do instead of directly working in the requirements file is we're going to define something that we enter our top level things.
Like I imagine I'm working with JupyterLab and NumPy, but in reality, I'm working with everything in that pip list.
So I'm going to make a new file called requirements.
And this you just make it up.
Sometimes you're called in, I'm gonna call it pip-tools.
So it reminds me, this is the source input to a pip-tools file.
And I'm gonna put those two things in here.
Now let's check these into source so we can see the changes real quick.
I don't need to push them to GitHub.
I just wanna save it so we can use our diff tools.
So what that pip-tools command will do is it'll look at these and it will generate this with every single version locked down exactly as we want.
So check this out.
So I'm over here in that same folder, you can see we've, if I run tree again, you can see we've got our pip-tools and our requirements.txt.
Let me look at it, it looks just like that, okay?
So what we're gonna do is gonna run kind of a long command.
uv pip compile, we're going to pass in the input file, tell it to see if there's any updates.
So we could update them if they're already there.
This is a nice feature.
When you run this, come in and say, please update everything and generate a new output.
And the output is going to be this requirements file that everyone's used to working with.
I want to be clear, there are three or four different ways to do this with different lock files and different projects and so on.
But this is a tried and true traditional way.
A lot of people know what a requirements.txt file is.
So let's run this and see what happens.
Woo.
So this is actually the new requirements.txt file.
Look at it.
It's insane.
Oh my goodness.
It's even more than I expected.
So it says this file was auto-generated by UV by running the command I just showed you.
Okay.
So if we go over here and look at it now, look at that.
We can do the diff and it says, actually, I don't even know if there, it can identify the changes, right?
It's like, we completely regenerated this.
And this says every single version is exactly pinned to what it is now.
So in the future, it's a uv pip install requirements.txt.
You got to say dash R because I'm giving it in the file.
So there it goes.
It goes, look, we've got exactly the same version.
And if for some reason, like maybe this changes to a lower version and somewhere to run it, You can see it'll exactly synchronize whatever is written in this project, in this requirements, and it's not what we really want, so I'll roll it back.
So that's super cool here that we got a place where we just declare what we want, but now this completely locks it down.
If you look real quickly through this file, it's really handy.
It says, ""Why is this even here?
Well, Jupyter server uses it, and apparently HTTPX, don't know why that's there yet.
So we can come over here and it'll say, ""Well, why is adders here?
JSON schema and referencing.
You find JSON schema.
It's here because of a Jupyter server and so on and so on.
So it's not just what is here and what is its versions.
It starts with what we give it.
And then it figures out what is the current version.
What are all the other things that uses and let's lock that down.
And anytime you want to update it, like let's suppose, let me save these.
Let's go over here and say we're using pandas as well.
Maybe this is gonna be part of our project.
Technically not using pandas, but just play along for a second, okay?
So if I rerun this, uv pip install, actually that's not the one I want, compile, and use this upgrade flag, even without it, it'll still generate it.
When we go back here and look at the changes, it says, look, we have NumPy for multiple reasons now.
It was previously, we just explicitly set it.
Now it's also pandas and somewhere down here, it's added pandas into the project.
Okay.
Now you might notice it a little squiggly there because if you look through this, you won't see pandas like pandas is not here.
So we need to say uv pip install, run it again, and it will synchronize again.
All right, so this way is super, super flexible, simple and nice for pinning down exactly all the versions of every library that you could possibly use.
Not just ones you explicitly state like often is done, but the transitive closure of all the dependencies that they have.
And you also have the mechanism to update and upgrade them.
I don't have a really super way to show you that I suppose, but I gotta have a new library released right while I'm talking, unlikely.
Anyway, these are super cool because they'll pin down our project.
All right, final requirements for our math reproducibility demo.
Now I'll do just one more thing while we're talking.
I'm gonna replicate this in the top level of this project here.
So I'm gonna say new file, which is gonna be requirements.piptools.
And in here for what we're actually doing, so you can run this code anytime.
I wanna check out this project and know what I want.
Remember we did a demo with HCPX and we've used JupyterLab.
I believe those are the only two things here.
So open in terminal, or I could just go to the terminal, make sure the virtual environment is active and I'd say uv pip compile like this.
Then I'd say uv pip install like this.
Now this command is so common that I actually have a shortcut.
I just pip install requirements.
So I just type PIR, does the same thing.
And you can see that here as well.
So this is super cool because if now when people check out the project and want to run the demo code, they can see.
Of course I could have put this there before, but I wanted it to be part of this conversation.
Add a requirements file for the whole course And we'll push all these changes up.
|
|
show
|
1:44 |
The next thing we want to address is how are people running the same version of Python and how are they running the same operating system?
Not just, oh, make sure you use Ubuntu 24.10 or whatever it is.
No, I want to use Ubuntu 2024 with this kernel patch with these things installed and nothing but those things in the same version of those installed.
Like we're talking really, really close to exactly the same machine, not just something called Linux or something called macOS.
No, really close.
And we're going to do that with Docker.
Also, how do we run the same version of Python exactly?
Well, when we set up these Docker systems, we can explicitly specify which version of Python that system should have.
So we'll do sort of two things, solve two things together.
it'll be docker plus uv because we use uv to install a specific version of python inside of docker okay if again if you're not familiar with docker don't stress about it what it is it's a way to build very lightweight Linux machines that you just say the steps to set up the machine from a base like a base install of Linux or whatever operating system you happen to be choosing and it will script the creation of those and then it'll run them in a really lightweight way.
It's kind of like virtual machines, but way lighter weight if you're familiar with those.
So I'll talk you through this.
There's just a couple of steps that we got to do.
You don't even really need to know Docker so much as you need to know Linux, which is a different deal.
But again, just a couple of commands.
|
|
show
|
8:05 |
So we're going to start out with a simple Docker file.
I'm not going to type everything in.
I'm just going to talk you through it.
A lot of this is you find an example or you find the instructions for a certain step and you paste them into the Docker file in the right place.
Seriously, like you go look in how to set up node, which is a requirement of a Jupyter server for certain activities.
And then you enter those into the server or you save them in a Docker file, which will run them in the server.
So the way it works is we're going to have two different Docker builds.
The reason is Docker is a layered system.
So it looks at the first thing you specify and it builds that, then the next step, then the next step, then the next step.
And if something changes like on step three before the end, it only run the last three steps, not everything.
So breaking this into two pieces allow us to say, here's a Linux machine set up basically the way we want it.
And then here's one working for our project exactly with our source files as they are from GitHub right now.
Let's go over here and all you need to see is two things.
You say from and what version of whatever container you want.
We're going to use the base Ubuntu image and 24.04, a long-term support.
We could put latest here.
Sometimes you'll see latest, but that doesn't provide the same level of stability.
So, you know, make your decisions there.
We'll sort of set some environmental variables, especially things like non-interactive are important because it might ask you a question.
Are you sure you wanna do this?
Well, there's no one there listening when this gets set up.
So it has to be explicit.
A lot of times you'll say like, install this thing -yes, don't ask me.
So for our project, we're gonna use a Jupyter server, Jupyter notebook, JupyterLab, and JupyterLab extensions and widgets need Node.js on the server side to work.
So these settings right here, make sure that it's like, if you went to check on how do I set this up?
They would say, put that into your machine, Take away the run command, and this is what the instructions look like.
So for Docker, we just say, run the commands on this machine, whatever Jupyter said I need to do.
You always wanna make sure you have updates.
If there's a security release for Ubuntu, for some reason they haven't patched it, we're gonna want to, or you haven't pulled the new image with the patch, just double check and make sure things are safe.
Better safe than sorry.
Once this just configures here, it configures the potential to know about node.
And here we install node, apt install node.
Few other things like some utilities we can use to get some, track down some errors.
I like to add a little bit extra to my Docker container.
Some people have a minimalist perspective.
Like I want it as small as possible.
If that's you, don't do this.
But for me, I like to be able to connect to my server and have a really nice experience.
Like you can see my shell over here has lots of fancy stuff that I can do.
And a lot of that's done through all my ZSH or all my Z shell.
So install this foundation, install Z shell and Docker.
Make sure you got the certificates 'cause otherwise API calls sometimes fail.
Install uv into the machine and set some more paths so that uv works.
Now, the next thing we need to do is build this Docker container.
Take this definition and basically build out a Linux machine based on Ubuntu 24 exactly with these specifications.
Now, I did realize I made a quick mistake here.
I need to use curl to get node, and I had this statement below, so I just organized things a little bit wrong.
So make sure I'm going to get curl installed before we can use curl to get node.
We also have to run the update a second time so that it actually pulls from that source when we say install node, that it knows what that is.
How do we build this?
Well, we go to the terminal and we're right here with our Linux base.
And we could go in there and build this directly with Docker.
And that would be fine.
But we can do a little bit better, especially when we have these layered systems and with our Jupyter running inside of Docker, we're going to want to maybe expose some ports so we can talk to it from the outside potentially share some folders for collaboration from your machine over to the Docker file or persistence.
There's a lot of things like that that will get a lot easier if we use this tool called Docker Compose.
So we have this compose.yml that defines what that is.
And the way it works is you basically tell it how to build the Docker container.
And in this case, just what you want to name it when it gets built, because you refer to it by name, just like we did up here.
You'd say from, what's the container name?
Its container name is gonna be this.
All right, so it just makes things a little more reproducible.
And the things you might pass as command line arguments to Docker just become part of the specification.
So what we're gonna do is we're gonna go and build this.
And we're in the same folder as the compose file.
So I'll say docker-compose build.
You can see that warp is auto suggesting that through its little AI autocomplete thing that it uses, but you can type it if you don't use warp.
And we say Docker Compose build.
And what it's gonna do is it's just going to make sure that we have this version of Ubuntu downloaded as a Docker image, and then we'll start applying changes one line at a time.
And you'll see that go by here.
We're gonna run Docker Compose build.
You can notice it's downloaded Ubuntu.
Now it's running app update.
And you can just see it doing step by step what you would have to type into a Linux virtual machine or for that matter, an actual real bare metal Linux machine that you might have for like a laptop for research.
The first time you run it, it's going to take a while.
But that's not the normal experience.
You'll see it's way faster than this.
And there it is.
Did it say how long it took?
260 seconds, almost five minutes, four minutes and 20 seconds.
That might seem like an exorbitant amount of time, like a ridiculous amount of time.
But here's the thing.
With Docker, once it's done this, every single line you see here is now cached in the system.
So for example, see here we're installing uv right after update certificates.
If we make a change between those lines and we add a new entry, it's only going to rerun that little tiny bit there.
So for example, if I come down here and I said after this, let's suppose where you say, hey, we're going to need to have Git as well.
Whoops, we should have installed that.
Now we run Docker build again.
Now it's installing Git, Git's done, does the stuff after to make, because something could have changed as a consequence.
And that one took four seconds.
better if nothing's changed it's nearly instant look at this bam done 0.3 seconds it said to figure that out so really really fast once you get it built and customized just the way you like and what we're going to do is we're going to use this Linux system that we built here when we specified it this Linux math base and base our particular setup for our math research on top of this.
So once this thing is built, it just stays there and we're going to let it be stable unless we decide we need to rerun it to update something.
And then we're going to work with our math research project and use Docker to base it on all of the foundation that we created with this Docker file here on top of our customized Python node-based Linux machine.
|
|
show
|
7:52 |
Now we want to apply the same ideas to package up our Python project as a Jupyter server that we can run and interact with over the web through notebooks in a super reproducible way.
So let's look down here into this next section.
So we're going to have a Docker file and a source file.
This is all of our source details that we got here.
for our Docker file, I'm gonna say it depends upon Linux math base latest.
That's this one defined right here.
So the image that we set up for all the prerequisites basically.
And then we have to do a couple of steps in order to set up our app to run as a Jupyter Notebook.
The first thing we wanna do is choose where on our server is it gonna go, where on our Linux machine is gonna go.
Now this is not interactive.
We don't need to worry about putting stuff into the right user folders and not cluttering up the root drive and things like that.
So what I'm gonna do, I'm gonna say, run make dir slash app, just like that.
So right on the root, there's gonna be a slash app.
That's where our project's gonna go.
We're also gonna have a slash VE and V where we run Python from.
We can set the working dir, the work dir to be slash app as well.
So now it's kind of like all subsequent commands are happening in this folder.
Okay, so this is like, I'm getting ready to copy my files over.
This one over here installed uv, but we want to make sure we have the latest version of uv.
So I'm going to say run uv self update.
This just makes sure that uv is the latest things in case there's some kind of bug fix or something along those lines.
And we might not need to rebuild this, but maybe this, you know, this is going to build more frequently because it'll be driven by changes in our source file.
And we need to create a virtual environment.
So how do you do that?
Say uvvenv/venv, and that would do it, but this will let uv pick any old version of Python.
One of our goals is to have exactly the same version of Python for everybody all the time.
So here with uv, we can say Python 3.13.5.
and uv will see there's no version of Python at all on this Linux machine.
We'll download that version that we specify here and then install it and then create a virtual environment from it.
So that's a really nice reproducibility thing.
This was one of our major sources of variability solved right there.
Now, the next thing we need to do is install all the dependencies.
So remember we already went through that process to say, these are the exact versions of everything we need to run this project.
So we're gonna say copy requirements to /app.
Now you might say just copy all the files and then install them.
But I'll talk you through why this is an advantage to copy just this file first.
So then we can run uv pip install -r requirements.txt.
We're already here in this, when we said make that the working dir, so it should be able to find that just fine.
Let's just run this and see if it's working out.
How are things going here?
So now, going back to Docker compose, let's go ahead and first put this as a new service.
And it's got a lot of the same details here.
So I'm gonna just copy those over and it's gonna be math research.
And let's just keep the name consistent this time across the board like that.
So there's other things we're gonna wanna do, but not yet.
Let's just see that this builds correctly.
Docker compose build.
So it did the first one super fast.
Now it's working on our math research.
See it created the virtual environment and installed Python in three seconds.
Incredible.
There we go.
So it looks like it might work.
We could actually come in here.
We could say docker compose run interactive math research zsh.
So if we're going to run just the shell, and look at this.
We have our requirements.
I'll say source.
There we go.
So that's the one that we're working with is this version here and everything got stalled and those versions I haven't done have memorized but I guarantee if you check those are exactly the ones that we pinned in there so we can actually go in here into our To our docker container running and notice this little nice shell So if I just type up arrow goes through there if I type uv up arrow, get like the history of that.
This is all the Z shell stuff.
That's super nice that I said would be better if you had.
So when you go in there and interact with it, you get a better experience.
Okay.
So this is a building.
Let's do just a couple more steps to make this complete.
So if you looked in that app folder, all we had is this file.
We want all of them.
So I'm going to say copy SRC to slash app.
That SRC here is this one.
And it's going to go there.
Let's save this, run it again.
If we go back and interact with it, now we can see our math file, our Jupyter notebook, and so on.
Great.
Looks like that's pretty close to running.
We just need to run this at this point, I think.
So the final thing that we're going to need to do is we're going to need to run JupyterLab, and we can say uv run JupyterLab and say use this configuration file, which I put down here into the config.
So it says to listen on certain port and to listen on the broader networks because the Docker container acts like, or Docker, yeah, Docker container acts like its own separate server.
So we got to change a few things to make that run correctly.
And yeah, we build this when we Docker compose or run it, it should try to run our project.
Let's see.
all right built super quick because it just changed a very few things at the end of the file but now the moment of truth are you ready docker compose up so this means bill they're not there but if they are there just use the ones that exist and then start up the systems in the order in which they were specified.
So it's attaching to math research.
Oh, there's one more setting that we got to do in our config here, it looks like.
It looks like it's complaining that it's running as root.
Now in general, this might be bad.
It's not as bad in Docker, but it turns out there's, some of these things get fairly more complicated if you don't run as root.
So for just this like local Docker thing, we're going to say, allow it to run as root.
It looks like that is specified already.
So what is missing?
I already have this all specified over here in this config file, but it looks like I had mistyped this.
It should be app slash config because that folder gets copied over.
Let's try one more time.
Rebuild it.
Anytime there's a change, you got to rebuild it.
Now we can say up.
Awesome.
There it is.
Our notebook server is up and running, and it has all of the libraries that we specified and our source files like our notebooks that we've copied over.
|
|
show
|
5:43 |
Okay, everything looks like it's working great.
We've got our project, our container exposure set up here in Docker Compose.
We can run Docker Compose up.
Looks like everything's working, but if I click on the URL here, maybe it's not as good as it could be.
That's not a great experience there.
So what we need to do is by default, everything is locked down on Docker.
So we need to go over here in our Docker Compose and says, this thing will allow a certain port to be open so that we can talk to it kind of like a little miniature server on our computer or wherever we happen to run it.
So let's set a couple of things here.
First of all, we could set the working dir to slash app, just to double check that that's all set up.
But we can also go to ports.
This is the one that matters.
And we can go from 9,000 locally to 9,000 on the machine there.
Now PyTarm says, ""Eh, you probably should put that in quotes.
Sometimes I see it with, sometimes I don't, but whatever.
Now let's try to do this again.
We'll shut down the server, rebuild it.
Remember the smallest of change you have to rebuild every time that we can Docker Compose up it again.
Wait for it now, let's try clicking here.
Woo hoo hoo, look at that, that is so epic.
Okay, so what do we got?
Let's see, well, no, we don't care about this.
So we have our config hidden away, but check that out.
This is our notebook.
And it's our Fibonacci math research that we wanted to make reproducible.
Super, super cool.
You know, one thing we could do actually, I think is we could probably remove these after the requirements get installed.
We'll play with that in a minute.
But look at this.
Let's just run this one, make sure it works.
Ooh, one successful, right?
And now we can run our primes and let's just make like the smallest change.
Let's make this, I'll make it five.
I know it still says 10, but let's just make this here.
And I run it, we should see one, two, three, four, five, like that.
Which ones are them?
Boom, there they are.
Guess I missed one in my little list I put out.
Super, super cool.
And we'll see something about this in a minute.
Don't read that part.
How neat is that?
So we've got our notebook running.
We could expose this on the internet through some server if we really wanted to, long as you're not worried about people getting access to it or anything like that.
There's a lot more you could do, a lot more complex things in terms of setup and users and whatnot, but we can just use Talker to build this exactly reproducible system that we can then run our code here.
So one thing to note is I put the five here and if I put save, is it worthwhile?
And I say docker compose build -- say math research --no cache, just to force it to rebuild this math section.
And I run it again.
And we're going to need to connect to it again, because this auth token here has changed.
We go in, we go to our research.
Notice everything's lost.
This is not persistent.
If you want it to be persistent, we're going to talk about like that.
But every time you build it, this is like a fresh start from whatever you provide to it.
So, awesome, we have this running exactly reproducibly.
Let's review.
So we've got the build working, but if we look at like the Linux build, it's got the same environment settings.
It's downloaded and installed the requirements.
Now we're not exactly forcing apt to pick a certain version of like say Wget and stuff.
We could, I don't really think that's necessary.
But again, you know, how mental do you want to go on this, right?
Anyway, this, let's just exactly specify we're going to have a base Linux machine based on this version of Ubuntu with that, that, that version of node, those things, exactly, right?
And then you jump over here to the math one, and it's gonna be with whatever files we put there running exactly Python 3.13.5 with exactly the requirements that we pinned, off it goes.
Let's make one real quick change here after this is all done.
run rm star dot requirement.
So let's remove those.
Let's just do a build one more time here.
Looks like it works.
We'll say Docker compose up.
Again, got to re...
Whoops, stop moving around.
Oh, it's always freaking out if you have this one all left over.
Say close that.
Stop running away.
I need you.
So there.
All right, let's just go look.
Oh, look, there it is.
The requirements files are gone.
We don't need to keep them around once they've done their pip install work, and we don't need them.
So might as well remove them.
So people looking at this don't see like unnecessary files hanging around unless you wanted them to see the versions by going into it.
But probably not, I don't know, it's up to you.
|
|
show
|
5:52 |
Now imagine you want to have a place where output files are generated, and if users running that code make a change, they need it to be able to survive a Docker Compose build.
We saw that even if we make a change, it is saved until you rebuild the image, then it throws everything away and starts over.
So the operating system or the file system of that container is transient.
It can go anytime as far as the user's concerned.
So if we want to create something permanent, like check this out right here at the end, I added this little section to allow us to explore the idea of could we set up a location where files could be independently existing across or outside of this container build lifecycle.
So maybe we want to have how many times has this thing been run and we want that to be forever right so um think about the reproducibility here like this is sort of our working area or whatever but the idea is we're going to say on the server on the docker container we're going to have slash data as a folder that we're going to allow to be persistent so slash data slash whatever subdirectories or just a file that will live forever across these things now I want to be able to run this also locally like I want to run the notebook here and run it in jupiter and so in in jupiter or run it in docker in docker I can guarantee slash data will exist but here we'll just make a little local folder here if we're not in docker okay either way if the file exists we're going to open it and see how many times it's been run before otherwise it's been run zero times then it says hey we ran one more time and it saves the file And so in the output, you can see it says this is run 10.
Okay, so we're gonna create a folder slash data.
And we want this to exist independent of whatever the Docker running instances.
So we can do that over here in the compose file.
Another reason this thing is so nice.
And this is called volumes, like drives.
And so what we can say is from this Docker compose perspective like this, we're gonna say the working directory slash math research slash data, so a data folder in here is going to look like slash data on the server or inside the container.
So when we talk to slash data, it really means look here or wherever you're running the Docker file.
So let's build it again, keep this going.
And we say Docker compose, Give it a moment.
It looks like everything's the same, except for now we have this data folder and we don't need this run count.
That's not how it's supposed to be.
Okay.
So look, we have a data folder here and right now it's empty, but let's go back, open this up.
Again, I don't care about that.
Go in here and let's just run that line.
Now, that is not the error.
It's not really an error.
It just hadn't run yet.
So look, it says this is run one.
And that is because there was no file.
It set it to zero and then to one, but it wrote it.
So go look over here.
Now, look, there's a run count with value one.
This is outside of Docker.
We run it again.
I'll run it a bunch of times.
Run two, run three, run four, and so on.
We check out our file.
Now it's four, sure enough.
Let's shut down Jupyter.
Let's leave.
Close the server and just do something silly to make this change.
So, you know, I'm silly.
I can just do this.
I could say Docker compose build, math research, no cache, right?
Before we saw that cause a problem where it wiped everything away.
Awesome.
Well, now we just got to do Docker Compose up.
Cool.
Now notice this file is still here.
I can even change this to eight if we want and save it.
Now if we go click this to open it one more time, still don't care about privacy or whatever that is go down here and run it oh I'm got to run in the right order let's just run all this is run nine run nine because it that file that whole slash data folder is a place that can be persistent outside there so this can be really tricky with notebooks and or with with Docker.
I just want to kind of cover it a little bit because it can be mysterious.
Like I know it's nice for a while, but then it stops working because like we can't save files or whatever.
Here's a cool way to both like, I could drop stuff in here.
Other parts of our app inside the container could look in there and use those.
We could add new data contents to there and just live it would be updating as if this was just working directly with slash data in the container.
So here, through this volumes thing, is how we add persistence outside of the Docker container.
There's Docker volumes we can create, there's other types of things, there's many options here, but here's a real simple way in which we can do that.
|
|
show
|
2:48 |
Let's do one more really quick thing that'll make these Docker builds potentially super fast compared to without it, if you're using many dependencies.
We're already kind of using a lot.
So check this out.
I'll do just the math research container rebuild with no cache.
Watch how long it takes on this uv, not this one, not this one, this part.
Look how long it takes to install these 'cause it's downloading, downloading.
It's pretty fast, but it did take five seconds.
Can we do better?
The answer is yes.
On our local machine, this goes faster and faster 'cause uv will cache those downloads and only get new ones if the versions change.
That's really neat.
We can actually go in here and leverage a feature of Docker and right there on our uv pip install.
Now, hat tip to Hynek.
I believe he's the one who turned me onto this.
So we can say run and the command, but we can also add this mount option here.
And it says, we would like to persist this file across Docker builds.
So for example, if uv writes to that folder like /root/.cache/uv for its downloads, and then in another builder asks, hey, is that already been downloaded?
The answer can be yes.
Let's find out if this makes a difference.
Remember, it was five seconds before and it's not gonna be any faster this time.
Why?
because there's no persistence.
It hasn't seen it before, but the second time, right?
See that's downloaded like a hundred megs of stuff or more.
Okay, let's do a build, not this time.
Guess we can't use the no cache.
Let me do it, just make a minor change here.
I'll just say, right here, run echo, hello version kick.
So that'll make it do a rebuild there.
So let's do the build.
Look at that.
Now our uv pip install took 1.3 seconds.
Let me get it to run again, put a bunch of those, see if it's any faster one more time.
Nope, that's about how long it takes to install them.
But notice it was roughly five times, 4.8 times or whatever faster than it was before.
and it didn't even need to download.
So it actually puts a whole lot less stress on the PyPI infrastructure, like pypi.org and so on.
Doesn't have to ship you 100 megs of binaries every time you run this build, if you already got them, right?
So both be kind to PyPI and to yourself and make this run faster.
Super easy to do.
|
|
|
49:38 |
|
show
|
1:00 |
We've made it to the last chapter of the course, and this chapter is pretty awesome, as you will see.
Up till now, we've covered core skills that you should have.
Understanding the Python language, working with functions, refactoring concepts, source control, even debugging.
In this chapter, it's a little bit different.
We're covering a core tool that is emerging that can make your data science work tremendously more productive and allow you to gain skills and apply techniques much faster than you otherwise would.
So we're going to talk about AI coding and in particular, agentic AI coding, which if you haven't seen the difference, it's a really big deal.
So we're going to have a lot of fun working through how we can apply AI and LLMs to our data science work and the code that we built so far through this course.
|
|
show
|
4:55 |
Now, I know some of you out there already have played with AI and LLMs for your coding.
You may have had a lot of success.
You might have found that it wasn't that great.
So I want to talk about the different types of AI that we can employ, how we can do it, and some different tools to apply them.
Think and put together in the right way.
If you haven't seen it before, you will be stunned.
Not just impressed, stunned at what is possible.
And as this little yellow sub note points out here, this is probably going to change five minutes after I record it.
So take this as inspiration, apply it to the world as it is today, because in a few months, maybe it's really different.
I don't know.
Now, when people think of AI, they often think of ChatGPT and Google Gemini.
These are really super powerful and really neat tools.
They generally don't apply.
Well, at least at the moment, ChatGPT doesn't really apply in this situation.
If you've gone to chat and you said, hey, create me a function that does X, Y, and Z, it will do it.
But what it's not great as saying, here is a project, a Python project or data science project with seven CSVs, 25 Python files that relate to each other, two notebooks, and I would like you to refactor the code so that this notebook can move some of its functionality over to this other library, taking into account the entire project structure.
Gemini is making a little bit more progress towards that, but at the time of the recording, it doesn't really have that either.
More broadly, the chat-based systems, while amazing for a lot of things, are not really appropriate for true work against your project in the way that the agentic ones are.
okay so you really want to look at agentic ai they don't just look at what you provide them and then come up with an answer they use tools they look at different files so maybe you ask it to do some work it'll say okay well let me look at your project okay i've seen all the files i've seen your instructions on how you've put it together now i'm going to try to make a change here but oh let me run the unit tests to make sure that they're working after i made that change oh something's not right.
Let me go back and fix it.
Great.
Now I'll document that in the markdown file.
Like it's the use of these tools and the iteration that is truly different.
In order to get that experience, you need an agentic coding tool.
So we have a couple of options.
You've seen me use PyCharm a lot during this course, and I am a massive fan of PyCharm.
They just recently released this thing called Juni.
Now I want to be clear.
They previously had something called JetBrains AI.
they're not the same thing you uninstall one and you can install the other they're completely separate products both from JetBrains you want the agentic one Juni, that is where the magic lives so if you want to stay within the JetBrains toolchain and ecosystem Juni is an awesome thing to add to the JetBrains IDEs you can also use GitHub Copilot I'm not a super fan of this for a long time they had no agentic option.
I believe they just recently came out with something.
Keep going.
If you want to go to the high end right now at the time of recording, I believe that Cloud Code is peak agentic AI, but it's super expensive.
So you can see that it plugs in both for VS Code and PyCharm via the JetBrains various IDEs.
And it's super powerful, especially if you use their top level model.
But not that cheap.
So it's also very terminal based.
And it's up to you, you can check this one, I highly recommend it, but it's not what we're going to be using.
We are instead going to use something called cursor.
Cursor is a fork of VS Code that adds agentic AI capabilities and highly recommend this.
I've seen amazing things done with this.
And I have done amazing things with it.
By the way, you can also plug in cloud code into cursor.
There's a lot of mixing that can go on here, but we'll see what we can do with cursor.
If you haven't seen this before, or you've tried AI and you're like, well, it just made a bunch of mistakes and it didn't work very well.
That may be true, but generally what I find when people say that AI hallucinates a lot, or it makes a lot of mistakes, they're using the cheap, small, free model, not the paid high-end models that are like eight times more accurate and so on.
So we'll see what we can do with a relatively modest subscription to cursor.
|
|
show
|
13:13 |
All right, let's jump in and do some fun stuff with Cursor.
You can install it at cursor.com.
Obviously, I have the Pro tier, not the Super Ultra tier.
So just give you a sense of how much do you got to pay to get these features.
For me, it's so incredibly worth it.
I just recently switched to the yearly billing, but I was paying $20 a month to try it out.
I can't do the free trial one, but I'm telling you, use the higher models.
you get dramatically different results.
We're going to look at two different projects.
We're going to first start with something that we've been playing with the whole time, that math research project that we've been iterating and building up through the entire course.
And that'll give us an example or a sense of how do we take an existing project that we wrote from scratch and then apply some of these AI concepts to them.
Then we're going to start an entirely new data science project from scratch with just a CSV file and cursor.
And we'll see how we can use that to quickly build out some notebooks in the same manner that we've been teaching through this whole course.
Proper factoring, separation of utility libraries, all that kind of stuff.
Documentation, putting together our notebooks in ways that do storytelling, all that kind of stuff.
We're going to start from scratch and do that.
But let's for First example, we have the project loaded up here in cursor.
And I now have this agentic AI chapter.
And I'm going to open up our math research.
So here's our project.
And this is basically like VS Code.
We can run it all.
Let it install that real quick.
Now we'll let it run it all again.
There we go.
Looks like it's working.
Let's just change this real quick to see if we can get it to do something different.
I say 15, well, I guess we have a lot of these.
Yeah, sure enough, there we go.
So we'll put that back to 10.
Excellent.
So how do we get started?
Maybe collapse like that, and this looks like VS Code or PyCharm, sort of.
But if we expand here on the right-hand side, let me shift myself to the other.
You can see we have this chat section.
Now it may look something like where it says auto, And who knows, might even say ask.
I'm not sure what it's going to say.
But what you want to make sure you do, this is like chat.
You don't want to do that.
You want to switch to agentic code where it can run tools and act as basically like a junior developer.
And you don't want to let it pick what model to run.
You want to pick the high-end models.
So I'm going to pick Cloud4 Sonnet.
If you wanted to, you can turn on max mode and choose higher level ones.
I got to turn on the pricing, which I'm not doing.
So we're going to choose Cloud4 Sonnet, which is a good one at the time.
Again, agentic is the magic.
Then we have context here.
So I could change this around.
I could say I would like to pick files and folders.
Let's see.
Maybe not that much.
So we've got math research, and this is the AI one.
Barely see the path in the little autocomplete there.
So we can say no matter where we are, This is the file that we're talking about.
And we could also add Math F, or we could just give it this whole folder potentially, however we want, and give it, say, look, these are the files I really want you to especially focus on.
So what can we do?
We've already broken this up pretty well, but there's a couple of things that I think we could do that'll be pretty neat.
Let's start out by having the AI just inspect our code and say, I'm interested in clean coding practices, and see if there's any bugs or anything like that.
Don't make any changes.
Just give me a look.
So let's try this.
So I'm going to say, please look through the math research IPYNB file.
I don't want you to make any changes yet.
Just read it and make a plan.
So it'll start out by just telling us what we could do, and then we could work with it further.
I want you to look for any bugs, performance or clarity improvements, and suggest other changes that it thinks might be an improvement that I can't think of.
Let's hit go and see what it does.
All right, it's reading, it's checking it out, it's looking at the different cells, determining some are marked down and why they're there.
Okay, let's have a look and see what it is talking.
That's a lot that it can suggest.
It's even given us a little checklist.
This is more than I expected, honestly, in terms of what it found.
So what does it say?
It says issues found.
There's an off by one error.
Now, I didn't notice this before.
It says if IDX is greater than 10 actually collects 12 items, 0 through 11, not 10 as intended.
Wow, okay.
So apparently there's a bug here, right?
That's pretty interesting.
Performance, inefficient list building.
It says going through the entire set here in this loop with the break, all of this.
Actually, we could use this thing called iter tools dot islice to operate on generators and say, just give me 10 from the generator, one line.
That's the kind of thing that an experienced Python developer would not miss.
Maybe if they do a lot of this kind of work, I probably wouldn't in the future, but I think that that is pretty awesome.
It says like this append and basically this entire line, this entire thing there could be one line, which is incredible.
Unnecessary intermediate list.
So let's just have it fix.
Let's have it fix the bug.
I'll have it do this one.
I'm going to tell it, please update the notebook.
It should be this selected bit of code right in the middle to use iSlice.
How's it going to do?
Yeah, it's found the right bit of code that it was talking about.
Great, I'll update it.
Give it a second.
Okay, it actually took it a little while to get everything just right.
But look at the suggestion.
Now it's like, hey, we made this change here for you and move this piece that we were using down there into its own variable.
And it said this whole loop right now is just gonna be calling iSlice.
And so do you wanna make that change?
I'll say yes.
Look at that, that whole loop and all that collection came down to just exactly that.
It's asking if it can run Python.
I'm gonna tell it in the future if you're perfectly welcome to run Python.
It's a little dangerous but it needs to do that a lot to make sure the code it wrote worked.
So look at that.
It fixed the off by one error and it made it more efficient.
It also claims it's more readable, which I think that is generally true.
And it improved the variable naming, which I'm often a big fan of.
So that is very, very cool.
Let's see about what else we can do.
I'll ask if it can suggest one of the other improvements, because honestly, it's been going so long, I forgot what its original recommendations were.
We took care of two or three of them.
You can see it's remembering what it's supposed to be doing.
Like it's, oh, we fixed off by one.
We could fix some of the naming partially done.
Let's see what it comes up with.
All right, so look at the state of notebook.
I can suggest we focus on improving the library over here.
Okay, why?
The notebook depends upon it, so their quality matters.
It addresses specific issues.
In particular, it says I use typing.generator, And I should have used collections.abc abstract base class generator.
And it's missing doc strings.
Okay.
I'll tell it.
Great.
So I'll tell it, great.
Go ahead and fix the type deprecation.
And I'll add the doc strings.
So it says, okay, look, here's a bunch of changes.
Great, I fixed it and I'll accept them again.
If this is in source control, then I can just go over here and see the difference, right?
See them side by side, how these have changed.
So I can be fearless like I talked about.
I'm not worried that it's gonna mess something up because I'll just roll it back if I do.
Do I love this much information when it was so simple before?
No, but as something that helps, pretty good.
Now check out how well it did.
It said generate a sequence of Fibonacci numbers, but an infinite one.
And notice that it did update our generator to be this one from ABC, super cool.
And it didn't change any of the code.
And here again, different generator and iterator.
We'll do a collection yield only multiples of a given number.
Given some example, return value, what kinds of errors it could potentially raise.
How sweet is this?
I may have added that as well.
Let's look, see if it added this.
Yeah, it added this error checking force.
I did tell it to make improvements, but maybe not that much, not against it.
And I'm gonna allow list changing directory, that's fine.
Let it finish it up here.
Let's see what it's gonna do.
Look, it's running tests here.
I'm gonna allow list rough, so it's trying to do auto cleanup using rough.
We talked about how cool that is.
And I actually told cursor, use rough to reformat your code, the code that you write when you're done.
I'll show you where I did that, but I'm going to allow list that one ever.
So move up here.
Excellent.
Let me show you the final improved version.
And it tells you what it did.
It improved the typing, improved code quality.
Did it write tests?
I think it just did a little bit of a test for itself.
Let's do one more thing, right?
Impact, the code evolved from this long of well-documented, robust, modern Python.
Love it.
And it talks about all the things that it did.
This is great.
Let's do one final thing to apply here.
And then maybe we'll call it good enough.
Maybe I can do two things real quick.
First, I'll just see if it'll help us add a little bit of summary information, like what are Fibonacci numbers?
What are prime numbers?
This may be great.
This may be bad.
But I'm real quick going to say clean up code via cursor.
Not there.
Save that and see what it's going to come up with.
understanding the mathematical problem great it used a level two markdown header that's cool i find when working with these it doesn't really do a great job of converting them to markdown so you might just be better off switching you can you can talk it into like solving that problem but you know it's sometimes easier just to choose markdown so let's see what it put it said understanding the mathematical problem fibonacci numbers are these the research question we're investigating an interesting mathematical relationship which fibonacci numbers are multiple prime numbers specifically yeah that is super neat it talks about number theory computational mathematics and so on so i i like it i think that's pretty neat there it actually got off a little bit to a rough start to like locate the right thing here but it did really quickly identify a bug we had 12 instead of 10 fibonaccis we said it was 10 and it also identified some readability and performance improvements through use of eye slice so i think this is really good maybe we'd refactor a little bit by moving this up and we could ask it to do that or we could just come up and go, you know what, run that again and then collapse it.
Right?
So I think this is pretty promising and you can go through and ask it more questions.
What do you think you need to improve?
We'll do one more thing in another video here, but given our simple little example, I think it's done a pretty decent job.
Now also to be clear, it does less well on notebooks than it does on pure Python files.
This is another benefit, another benefit of separating some of your code into Python scripts.
I think you'll see that as you work with it.
|
|
show
|
3:41 |
Let's do one more thing here with this.
It would be kind of nice if we had some unit tests.
Now, I told you not to worry about writing unit tests, and I'm not going to suggest that you go study up on testing in a great degree.
But maybe cursor can help us come up with some tests that could be useful for us.
There's only these two functions.
So let's see what I can do.
Now, I could just keep typing down here.
However, this is pretty long.
You see it's used up 23% of its context and it gets slower and slower.
So we can do a new one of these.
Then we'll be, it'll probably be a little more focused.
Notice it has the active tab.
So it knows I'm talking about that file right there.
Although if there weren't copies of it all over the place, it would be pretty much able to find it wherever.
Let's see how it does on this.
And let's see what library it chooses.
I'll help you create some unit tests.
Sure.
Let me read it and understand what I'm supposed to test.
Okay, so it found the two functions, and now it's going to create some of those tests.
It's testing the Fibonacci sequence functionality, and it's testing the multiples of.
So let's go over here and see we got this test math F.
I'll keep it.
It says import pytest.
We don't have that installed.
Let's see what's going to happen.
It's going to try to run it.
Let me run the test.
It's going to notice there is no pytest.
It says great.
I need to install pytest.
And notice it's using UVPIP, not regular pip.
I'm going to allow list that behavior so it doesn't have to ask again.
Notice it ran the tests and now they pass.
This is part of the agentic stuff.
It said, I'm going to write some tests.
I'm going to use pytest.
Oh, you don't have pytest.
Let me make sure that you're using uv.
So let me use uv to install pytest and I'll try again.
That is really, really powerful.
Let's just look real quick at what it's done here.
It's testing the first 10 numbers.
It knows what to expect.
It's testing that it is a generator.
It's testing a few things.
I mean, it's kind of over the top, right?
Like the continue is good.
Maybe this one is not really necessary because we tested the first 10.
So I'll delete that out.
Basic case of multiples, 3, 1.
And it's kind of going over the top just a little bit here.
What happens if there's a negative number?
It raises that exception that it said it did with the value error and so on.
So we could probably ask it to clean this up and simplify it.
I don't really want to make a big, I don't make too much of a big deal about going into this and so on.
I just want to show you how comprehensive this thing can be.
It looked over here.
It understood what was going on.
And then it created the right test cases and everything.
I might ask it to do, say, don't do so many tests.
You could have it clean them up or you just delete them yourselves or whatever.
But this is a pretty interesting idea.
Notice that it ran it, formatted the code with rough.
It used pytest, not UnitTest, which is the built-in default one that comes with Python.
Why did it do those?
We're going to see.
We're going to see.
So super cool.
I think this gives you a sense of what's possible to just open up an existing project and start talking to Cursor about it.
|
|
show
|
3:00 |
All right, last thing before we move on to our from scratch analysis example.
How in the world did it know I was using uv instead of regular pip?
How did it know that it should use rough and Ruff format and rough check, which are two complementary but different ways to improve our code?
How did it know which one of those or any of them to do?
So this brings us over to something called cursor rules.
So I'll say cursor settings and down here somewhere, rules and memories.
You can let it learn about you, which is kind of cool.
But this, this is what, let me see if I can get this in here in a way that works for you to read.
All right.
So what you can do with these cursor rules is you go and tell cursor, here are the standards I would like to abide by.
And so when you're working on my code, I don't care what the world does.
I want you to focus in on and make sure you apply these rules.
For example, use the latest Python syntax.
Use Python rather than say node where possible.
Use async await if it's an async thing.
Add proper error handling, which it did.
But look at this.
Run Ruff format and rough check fix on any Python files you have edited.
That's pretty cool.
Never use system Python.
Run them with uvrun.
somewhere else I told it I'm using uv we use uv for dependency management use uv pip install rather than pip install so when up here you saw this single oh I need pi test and it did this to install pi test to make sure it works that's why let me see if I can if it did one other thing that's pretty epic let's go over to source control here okay it sometimes does this it doesn't always I I think maybe because I have it so focused in this area, but sometimes it knows that it has to put pytest into the requirements.
I think this might be a project setting that I've used before, but not a global one.
So now we can say UVPIP compile and make sure it updates our requirements.txt as well with whatever pytest needs.
No, that one, but this one right there.
Okay.
Take some time.
Don't overdo it.
But take some time and go down and put in these rules so that it works the way that you want.
So rather than doing something frustrating and go, oh, we're going to use React Native.
You're like, no, we're not going to do that.
We're going to use Python and we'll use it this way.
Whatever it is, maybe you wanted to use React Native.
I don't know.
But put those in there.
You can also create these cursor rules files, which you can put in your project, which add on to your user general rules.
So you could say for this project, we're using JupyterLab, this version in this way, whatever you want.
|
|
show
|
0:57 |
Now, let's see the true power of cursor in agentic AI.
We're going to start from nothing but a CSV file, and we're going to end up with some pretty awesome outcomes.
All right.
Now, we're going to use this video game sales CSV file that it has sales from 16,500 games and 100,000, those of which have sold over 100,000 copies.
I'm going to link to it here and I'll put a link in the code for this particular chapter so you can click over here and get to it.
I'm not going to embed the CSV directly because Kaggle wants you to download it from their site.
And I'm not sure what the redistribution ability is here.
So let's keep everything respectful with Kaggle and you can go download it from this URL at the bottom.
Or just watch me run it.
|
|
show
|
18:21 |
All right, here we are in our other subfolder, Game Sales Analysis.
You may have been wondering what that was.
Well, this is it.
Now, before we get into it, I've copied, downloaded and copied that CSV.
Let me just show you something cool.
So maybe we don't want this in source control for whatever reason.
Notice VS Code or Cursor thinks it's something we do.
But I'm going to right-click on this and say Add to.gitignore.
And if we go to the very bottom, you can see I can even just put ignore.
Ignore that file.
I also told it to ignore runCount from the previous project.
I remember that persistent thing with Docker.
So this is a way we can say, look, don't put this into source control.
So now if I look at changes, it just says update your gitignore.
OK.
I'll say the reason I made that change is to ignore the Kaggle CSV.
Let me go over here and it turns to be like a grayed out sort of thing.
This is not in source control.
Good.
Here's a link telling you how to get it and stuff if you want.
So what we want to do is I want to use this particular file along with cursor to first try to understand the data.
I'll say, yeah, you can go ahead and, sure, you can install the rainbow CSV parser so it can highlight these things.
That's cool.
So let's go over here.
Now I could go over here, right, and click this and I could hit the chat or I could have the active tab, but watch this.
If I get rid of this and I go and right click on there and say, add file to a new cursor chat.
So then it's like, aha, now we're talking about this thing.
Now, instead of starting with writing code, I'm literally on a brand new project.
I have a bunch of data and a CSV file, and I need to understand it.
It doesn't know anything about the Kaggle project.
It just knows the CSV.
So I'm going to ask it this.
So this is a CSV file of video game sales.
Please read it and give me an overview.
Put it into a datasummary.md file in this directory.
So basically create a markdown report of what's going on.
Let's see what it even finds.
What can it do?
It's writing a markdown report for us.
And it's got quite a bit of stuff going on here.
Let's see how it goes.
It has top findings, key insights.
Do we necessarily want to trust it?
Probably not, but we can use some of these things to help us get started and have Cursor write some additional code via a notebook to do that.
So let's put that away for a second.
So here we have a little preview of what it did.
Let's say it gives us a little data structure.
We've got the rank, the name.
So this is the structure of the data and the different categories.
The key findings are we have a top performer.
Number one is Wii Sports.
Top three are all Nintendo, that's interesting.
Ranges, platform distribution, genre.
So there's quite a bit going on.
It talks about the limitations.
Apparently the data stops in 2016, so it can't talk about today.
But still, I think this is pretty amazing.
We've got this markdown file saved here, and we can just use it as part of this.
It's getting to understand the data.
But now it's time to create a notebook and start writing some code.
So I'm going to say, ""Excellent.
Please create a new Jupyter Notebook to start our research.
Include the key highlights from your summary in this notebook.
So I had it write down what it's discovered, and now it's going to try to create a notebook that will help us understand these things.
It looks like it's writing a new notebook.
That's cool.
All right, well, that took a minute, but imagine you had to sit down and write down all of these things, discover them, come up with the various plots or data frames to show them.
So let's see what it came up with and we'll run it.
It says, perfect, I've created a comprehensive Jupyter notebook called this.
It has key features, data overview, highlight their data summary, top performers, the complete setup, the library imports, and so on.
Notice it's trying to use matplotlib and Seaborn.
I don't think we have those installed, so that's going to be interesting.
initial exploration, da da da da da.
Let's go ahead and just say run all.
So it's going to run, and I'm sure there's going to be a problem.
For example, what is matplotlib?
What is seaborn?
We don't know.
Error.
Now, I obviously could go fix this, but let's just tell it.
I'll say, look, there's an error running in the notebook.
It'll look and probably discover it's missing some dependencies.
Let's see what it does with them.
So it tried to run pip list.
I don't think pip list exists.
So it ran uv pip list instead.
Looks like it installed pandas, installed matplotlib and seaborn and is running a little Python file or string to make sure everything got installed.
It says it's ready to go, but it didn't put any changes.
I'll just tell it.
I'll just save these to the pip-tools file.
See how it does with that.
All right, it's kind of going sideways with that.
So I'll just take those and put them into this one.
We'll rebuild and reinstall that.
That's all good.
So let's go ahead and run this and just see where I'll, first I'll run all, see what we'll get.
It's gonna work this time.
Oh yeah, look at that.
Cursor away, cursor away.
So it's got come then, And it said we're gonna use pandas, numpy, map plot.lib and seaborn.
Set that up and it said, we're going to read our CSV file.
That is right there.
Load it up and do a quick look at the shape of it.
That's really cool.
It checks some for missing values.
And it says how many are missing a year or publisher.
Okay, not too bad.
Look at some stats.
Here are our top 10 best-selling games.
Apparently Wii Sports, Super Mario Brothers, and Mario Kart.
Wow, Nintendo is really raking it in.
And we've got Tetris, and Wii Play.
Gosh, it's almost all, almost all Nintendo, it's crazy.
You can see these by region and so on.
I want to try to get it to simplify things for us.
So let's say a new chat.
Let's try this.
We'll give it a little confidence boost.
Nice work to tell that it's on the right track.
One final request, maybe not final, but one more request.
Please apply coding best practices to break this notebook into more readable granular code.
Let's see what it does with that.
This is the organizing with functions, breaking out into sub Python scripts to isolate functionality is great.
Now a lot of the things we've actually discussed here, I'll help you refactor this notebook following coding best practice.
I see there's duplicate cards, content, large cells that do multiple things, all the things I told you you shouldn't do.
Let me fix them.
So it's gonna go work on that for us.
So right now I think it's just focused on the notebook itself, but you can see it's writing some of these functions in here.
Let's see if I can find where it's putting them.
There we go.
So it's got some of these functions in here to kind of hide away this functionality, a lot like I discussed, right?
Like, okay, instead of just importing and setting all these values, let's keep things a little more focused.
And it's creating some helper functions farther down, like right here.
That's great.
It has a little to-do thing, like what it's up to.
Breaking down these, remove duplicate, adding data validation functions, and so on.
So it's got a plan, and it's working it through.
All right, it's broken it all down.
It's now working to remove some of the duplication.
That's cool.
But I can keep those so we don't have to look at them.
Now this is not the final product.
This is just one step towards isolating our code so that we can hide away like these utility functions and things like that.
All right, it's done.
It took it a little while now.
That was a decent amount of work, but let's see what it did.
More modular design with helper functions, comprehensive dock strings, better air handling, better core broke larger cells in a single responsibility cells, removed duplication.
It had entire sections repeated.
Obviously that's not good.
And it lists out all of the various best practices that it applied.
That's super cool.
So let's go down here and let me just make sure we didn't get any more raws 'cause that's always a hassle.
There's probably a cursor rules thing I could put in here to fix that, but let's not worry about it.
Okay, let's run it all, see how we do.
Get it to restart so we can see the numbers.
We go, okay, that one worked, two worked.
Now we still have these functions in here, so let's hang on to that idea for a second.
Here are the helper functions it's created for us, and then using them for analysis, right?
So we have pretty much the same output as we had before, although it described what it's doing for us down at the bottom.
Let's go and ask it one more thing.
I don't like how to put all of this in here.
So instead, what I'm going to say is, great, we have a bunch of functions.
Can you pull them into Python scripts, organizing them into multiple files as needed?
For example, validation or market analysis might be different than data quality and so on.
And let's let it go on that and see where we end up.
So that's an excellent idea.
I'm very happy that it approves of what we're doing.
Actually, that's kind of neat.
So it's creating a data loader library.
Let's see what it did there.
I'll make a little room for us to watch.
Gives it a doc string, which is cool.
It says set up the display options.
That's not something we necessarily need to see.
Load the data file.
Could be more complex data or whatever, but nice.
I'm not loving the presentation aspect inside the data loader, but that's okay.
Now it's created a data analysis file.
Analyze missing data.
Yeah, that's cool.
We don't actually need to see that.
Distribution, publishers.
Maybe some of these stay as cells, but if we want to just show the results, it's nice to have them here.
Plus we can test them.
it's got a lot of its to-dos done all right it thinks it's done what did it say it did It took it a while to grind on this, but that's okay.
Better than us doing it.
It's a lot faster than us doing it.
I've successfully extracted all the functions and organized them into modules.
Here's how.
We have the data loader, which has this function, configure pandas and a robust data loading, in case the file's not there or whatever.
And then my data analysis has all of these various pieces here.
Utils, we talked about that.
And even created a dunder init here, which has like documentation for this whole project, turned it into a package, which is a little unexpected, but kind of cool.
It's all right with me.
And yeah, I think things are pretty good, but the real test is like, do we have anything worth looking at?
Does it run?
So run all of it.
Okay, let's have a hard time reading this with all the stuff going on.
So let's do like a final look at this project from Jupyter Notebook.
But I'm pretty pleased now, I want to set some perspectives here.
And let's go over here to our project and just open it up directly in JupyterLab.
See what the final outcome looks like here.
Notice our virtual environment is active, so I can say JupyterLab.
You can see our markdown file that I gave you for this stuff, but more importantly, all the files that it created and the notebook that it created.
So let's just look through this and see what you think.
Notebook analyzes comprehensive video game sales, data containing this many entries across this time frame.
Here are some key insights, top game leaders.
We maybe want to ask for graphs and pictures of these sort of things.
Research questions we could further explore.
I think those are pretty awesome.
It summarizes the notebook structure.
You could delete this once you get a sense of it or leave it there for people.
And let's just go ahead and do a run all cells to make sure they're working.
So here you can see it's importing the various libraries it created and got everything set up.
We don't need to look at this, right?
We can collapse that down.
Same thing for the display.
Dataset may be relevant, so that could be relevant.
I want to keep that there.
Here's a nice little first few rows, but we don't need to see how we do it.
Just see them there.
I'm pretty happy with this.
Now, you might be thinking to yourself, Michael, sure, it's fun that AI created this.
And yeah, there are some interesting things here, but it took it like 20 minutes.
And it's not perfect.
I bet if you look through here, I bet you could find some way it could have done it better.
like it might have grouped things together that shouldn't be grouped together or separated them when they should be together.
Maybe it moved some of the analysis into a file here when it would be better left as a cell.
Here's how I want you to think of this.
Think of this as if you went to a fairly junior data science person and say, your job this week, or your job the next couple of days is to take this data, come up with some preliminary findings, help me structure what is the data like, what are the interesting questions, and just kind of get the whole process started.
That person would, one, take several days minimum to do that, and two, they also would make small mistakes.
I think one of the mistakes we make is a lot of people expect AI to be perfect because it's computer code, and if it makes a single mistake, it's junk.
I'm never using it, it's junk, it did terrible I mean, this thing is pretty decent It's a pretty decent start Is it perfect?
No, it looks like there might be a problem here because again, this is marked as Python but I'm going to do markdown and run it again Now it's fine, right?
So no, it's not perfect but it would also not be perfect if most people worked on it as a preliminary take so when you think about these things think about it as kind of a really really fast and pretty good assistant to help you get going that's actually pretty skilled in particular libraries like you could ask it how can I use polars to optimize this function or please look for bugs in this code and so on for me going from this CSV file and 30 minutes later having this analysis and this understanding already put together, pretty remarkable.
And for me, that's a huge bonus.
Do you want to just ship this as a final product or report?
Probably not.
But it really gets you super far down the line really, really quickly.
It lets you explore ideas, take tangents, add new libraries you're not particularly familiar with, and so on, to come up with new types of analysis and ask and answer new questions way quicker.
Just keep that sort of perspective in mind, right?
The AIs are awesome, especially the agentic ones.
They're not perfect, but they're pretty awesome.
But people are also not perfect, and they're pretty awesome as well.
So if you set the expectations right, I think you'll find this to be an incredible productivity boost.
And again, the way I open this chapter, this is the way it is now.
In three months, six months, two years, it could be completely different.
Things are changing fast, but the agentic AIs and the agentic AI tools like Cursor and CloudCode are remarkably capable at helping you understand data science and build out data science analysis.
|
|
show
|
4:31 |
You know, one final thing, I think, before we put this away.
I imagine you've already got the sense of how cool this is, what is maybe possible to start, but our notebook has no pictures.
Let me ask Cursor here to see if it will create some kind of graphs of pictures for us.
So I'll tell it, hey, we could really benefit from having some Seabourn or Matplotlib visualizations.
Help me out here.
Again, I'm doing this from a new chat because the previous one, I don't really need to carry that information over.
It gives it a better chance to stay focused.
So this is gonna add visualizations for top selling games, regional sales breakdown.
Yeah, that sounds good.
All right, it's added a bunch and let's see what it says for its summary.
I've successfully enhanced your notebook with comprehensive visualizations using both Metplotlib and Seaborn, top performers, horizontal bar charts showing the top 10, stacked for regional breakdown, a pie chart for regional as well.
And at the end, it finished out with a platform lifecycle analysis, which is pretty cool.
And it talked about the choices it made for the visualizations.
Let's go and just run that over here as well.
All right, here's one of our plots.
Which one is this?
Global sales in terms of millions and this.
It's like matplotlib probably here.
Okay, that's super cool.
Oh, it looks like it's having some kind of trouble here.
I'm just gonna give it this and tell it we're having trouble.
See how it'll do.
So again, is it perfect?
No, but you say, look, there's this issue.
And it's like, fine, let me see if I can figure that out.
All right, it thinks it's fixed.
I just kind of like looking at over here in JupyterLab.
It's a little easier for me to see for some reason.
Here we go.
It looks like it fixed all of its issues.
Cool.
So I'll go back up here.
Here's our top 10 selling games as a bar chart.
Come down here.
Look at this.
We've got our regional sales and we've got this stacked bar chart for regional sales breakdown from the different games broken apart by region.
So that's pretty cool.
North America really dominates the Wii Sports.
Here's the different platforms like Wii, Xbox 360 and so on.
A little bit of a warning in the future that might need to change.
We could go back and fix that.
but really nice looking graph.
Nice choice of colors here for the game distribution by genre.
Publisher sales.
Obviously, we know that Nintendo is crushing it.
And look at these final ones here.
This is the analysis over time.
So advanced, the life cycle stuff.
Look at this.
Wow, 2008, 2009 really had a lot of releases, didn't it?
Really nice pictures.
Again, are they perfect?
Probably not.
But are they an awesome starting point?
Oh, 100%.
Nice little heat map.
I love it.
I really love it.
Okay, what's down here?
Some more.
This is a really good-looking lifecycle of the top platforms.
Sales over time.
Here you can see this is like PS2.
PlayStation 3 starts to take over where PS2 is going down you sort of see the life cycle of those different ones so now I feel like our analysis pretty complete we have really nice pictures that are a really good start to building out a final professional presentation hopefully this whole presentation has inspired you to look at some of this agentic AI be that cursor or cloud code or whatever, it can really do a lot of neat things for you.
Kind of play around and get used to how it works, what it expects, what's the right side of the problem to ask it.
But once you get it going, it's really going to help you be more productive.
|
|
|
2:34 |
|
|
2:34 |