|
|
8:14 |
|
show
|
0:51 |
Every organization, big or small, accumulates vast amounts of data, often stored in databases or spreadsheets.
But did you know, that data isn't just numbers, it's a treasure trove of insights waiting to be uncovered.
In this course, I'll guide you on your journey with data using Python.
We'll explore the tools and methodologies that industry leaders employ to load, clean, analyze, visualize, and draw meaningful insights from their data.
We'll be using tools like Jupyter, Pandas, Plotly, Scikit-learn, and more.
I'm thrilled to embark on this exciting adventure with you.
If you've been seeking a comprehensive guide on harnessing the power of Python for data processing, you're in the perfect spot.
Let's dive in and unleash the potential of your data.
|
|
show
|
2:00 |
In this video, I'm going to show you how to set up a local environment so you can run Jupyter locally and access the notebooks.
I've checked out my repo into a directory and what I need to do is make a virtual environment.
I'm going to demo this from a Unix machine, but if you are running this from Windows, you will need to create a virtual environment on Windows.
The commands to create the virtual environment are similar.
The command to activate the virtual environment is different in Windows.
So here's how you would do it on a Unix platform, Mac or Linux.
I'm going to say python3-mvenv and then the name of the virtual environment.
I'm just going to make it in the local directory called env.
At this point, I need to activate this on Unix systems.
I'm going to say source-env-activate and you can see that my prompt has changed.
If I was on Windows, I would say env-scripts-activate.
There's a batch command that I would run to do that.
At this point, all I need to do is install my libraries.
If you look in the directory here, there's a requirements.txt file.
I'm going to say pip install-r requirements.txt.
After that is done, I should have Jupyter.
I can say on Unix systems, which Jupyter and that tells me that I've installed this in my virtual environment.
On Windows, you would say where Jupyter.
At this point, I'm going to launch Jupyter and say Jupyter Space Lab.
At this point, you should see something like this.
Here are the notebooks.
We can click on student01 and you can come down here and you can run this cell.
You can see that that worked.
you
|
|
show
|
1:37 |
In this video I'm going to show how to run the notebooks from Codespace inside of GitHub.
What you're going to do is go to the repo and come over here on code and you're going to say Create Codespace on main.
Just click that button.
What this is going to do is provision a cloud server.
I've already set it up so it will install the appropriate libraries for you.
So we'll just wait for a while for this to set up.
Okay, at this point it looks like it has started.
It's put us into VS Code.
This is running again on GitHub's servers in the cloud.
Bump up the font so it's a little bit easier to see.
What we're going to do is we're just going to go here to notebooks and I'm going to click the student data one.
It's going to open up the notebook up here.
So here we have the notebook up here.
I'm going to scroll down to a cell where I have code.
Click on the cell.
I'm going to hold down ctrl and enter to run it.
And it says up here Type to choose a kernel source.
You can see it's thinking about this.
We will say create a Jupyter kernel.
I'm just going to select that button down there, the Python 3, and it should make one.
Down here at the bottom it says it's connecting to a kernel.
You can see that it's now running.
Codespaces makes it really easy for you to get set up quickly.
All you have to do is click that green button and wait for it to go.
|
|
show
|
2:09 |
In this video I want to give you a quick overview about how to use Jupyter.
Here's Jupyter running locally.
In short, each of these little boxes is what's called a cell.
I can navigate through them with the up and down arrow.
I can create a new cell by hitting B.
B stands for below.
Create a cell below.
I can delete that cell by hitting DD.
I can create a new cell by hitting A for above.
This is a modal environment.
Right now I'm in what's called command mode, so I'm just navigating through the cells.
If I want to edit this, I need to go into edit mode.
I hit enter and in JupyterLab you can see that the box around the code is now editable.
I can come in here and type code import sys.
Like that.
When I want to run this I just hold down control and hit enter.
That's kind of it.
You need to know how to create cells and you need to know how to run them.
A couple other things.
When a cell is running, this circle up here will be dark.
Also this will be filled in with an asterisk.
When it's finished it will put in a number here.
You can do other things.
You can make new cells that are markdown cells as well.
We're not going to be really getting into that here, but if I'm in command mode and I hit M this changes this to a markdown cell.
Now if I say import sys here and run it, it prints it as HTML instead of code.
If you happen to convert something to markdown and you want to change it back, you can hit Y and that will change it back to a code cell.
If you need to do anything else, you've got a bunch of commands at the top here.
One of the ones that hopefully you won't need to use but you might is interrupting the kernel.
That stops the current running cell.
If for whatever reason it won't stop, you can say restart with 00 or you can hit this menu item up here.
This was a very brief introduction, but as you watch me use Jupyter in the course, you'll see some of the commands and things I do, and I'll explain how I'm doing that through the course.
|
|
show
|
1:10 |
In this video, I'm going to show how to use Jupyter in VS Code.
This is similar to JupyterLab.
We have cells here.
I can arrow down between them when I'm in command mode, and I can just hit the up and down arrow.
If I want to go into edit mode, I just hit enter.
You can see that the outline goes in there.
I have a cursor, and now I can edit this.
When I want to run it, I hold down control and hit enter.
So this is the same command here.
You can also hit B to create a cell below.
I can hit D twice to delete that.
If I want to restart the kernel, I can hit this restart button here.
In VS Code, I can also hover between cells, and there's a button here to create a new cell for either code or for markdown.
But I can also, as we saw with JupyterLab, I can just hit M or Y to toggle between markdown cells and code cells.
In this video, I gave you a brief introduction to running Jupyter notebooks inside of VS Code.
Again, the commands are very similar to JupyterLab.
Control enter to run a cell A and B to create cells above and below, hitting enter to go into a cell to edit it.
|
|
show
|
0:27 |
Hey there, I just want to remind you that there are labs so check those out.
You will get a lot more out of the course if you try the labs.
Science tells us that if you just listen to me doing the demos and talking about the code, you'll learn a little bit.
But if you actually use your fingers, you'll use a different part of your brain and you will actually remember this more.
So I highly encourage you to check out the labs.
Every notebook, there's a lab for you to practice on.
|
|
|
31:58 |
|
show
|
0:34 |
In this module we're going to learn about how to use pandas to read zip files and then we're going to look at the PyArrow data type that is new in pandas 2.
I'll also show you some things that I like to do with data when I get a new dataset.
Let's get started.
Our goal for this project is to understand some student data.
What we're going to do is show how to load it from a zip file, look at some summary statistics, explore correlations, look how to explore categorical columns, and make some visualizations about this data.
|
|
show
|
5:25 |
So the data we're going to be looking at is from University of California, Irvine's machine learning repository.
This is a data set of student performance from Portugal.
Let's load our libraries.
I'm loading the pandas library, and I'm also loading some libraries from the Python standard library to help me fetch files from the internet and read zip files.
The data is located at the University of California, Irvine in the zip file.
And if you look inside of the zip file, there are various files inside of it.
We are worried about the student mat csv file.
So what I'm going to do is I'm going to download the zip file using curl.
You'll note in this cell at the front of the cell, I have an exclamation point indicating that I am running an external command.
So curl is not a Python command, but I have curl installed on my Mac machine, and this is using curl to download that data.
Once I've got this zip file, I have it locally.
I can look at it and see that it has the same files.
Now pandas has the ability to read csvs in zip files if there's only one csv in the zip file.
In this case, there are multiple csv files inside of it.
So I'm going to have to use this command here, combine the zip file library with pandas to pull out the file that I want from that.
Let's run that.
It looked like that worked.
I've stored the result df in this df variable.
Let's look at that.
This is a data frame.
We're going to be seeing this a lot in this course.
A data frame represents a table of data.
Down the left-hand side in bold, you see the index.
In this case, it's numeric.
Pandas puts that in for us if we didn't specify one.
There are 395 rows and 33 columns.
So we're only seeing the first five rows and the last five rows.
We're actually only seeing the first 10 columns.
And the last 10 columns, you can see that there's an ellipses in the middle, separating the first 10 columns from the last 10 columns.
And you can also see that there's an ellipses separating the first five rows from the last five rows.
Now, once you have a data frame in pandas, there are various things you can do with it.
One of them might be to look at the memory usage.
I'm going to look at the memory usage from this data frame.
And it looks like it's using 454 kilobytes of memory.
Now, one of the things that pandas 2 introduced is this pyarrow backend.
So I'm going to reload the file using dtype backend as pyarrow and engine is equal to pyarrow.
It looks like that worked.
Let's look at our memory usage now.
And we see that our memory usage has gone to 98 kilobytes.
Prior to pandas 2, pandas would back the data using numpy arrays.
And numpy arrays didn't have a type for storing stream data.
So it was not really optimized for storing stream data.
Pandas 2, if you use pyarrow as a backend, does have a stream type that we can leverage.
And that's probably where we're seeing the memory usage.
Now, we are getting that memory savings by saying dtype backend is pyarrow.
So instead of using numpy, the dtype backend parameter says use pyarrow to store the data.
The other parameter there, engine is equal to pyarrow, is what is used to parse the CSV file.
The pyarrow library is multi-threaded and presumably can parse files faster than the native pandas parse.
Okay, the next thing I want to do is I want to run this microbenchmark here.
And that's going to tell us how long it takes to read this file using pyarrow as the engine.
And it says it takes six milliseconds.
Let's run it without using pyarrow and see how long that takes.
Now, %%timeit is not Python code.
This is Cell Magic.
This is something that's unique to Jupyter that allows us to do a microbenchmark.
Basically, it's going to run the code inside the cell some amount of time and report how long it took.
Interestingly, in this case, it looks like we are not getting a performance benefit from using the pyarrow engine to read the CSV file.
It looks like it's a little bit slower.
When you're running a benchmark with Python, make sure you benchmark it with what you will be using in production, the size of the data that you will be using in production.
In this case, we saw that using that pyarrow engine actually didn't help us.
It ran a little bit slower.
But the number is so small that it's not really a big deal.
If you have minutes and you're going to seconds, that can be a huge savings.
Another thing that you can do with Jupyter is you can put a question mark after a method or a function and you can pull up the documentation here.
You see that read CSV has like 40 different parameters.
If we scroll down a little bit, I think we'll find engine in here.
Let's see if we can find it.
And there it is right here.
So let's scroll down a little bit more.
There is documentation about engine.
So let's read that.
Here it is.
It says that this is the parser engine to use.
The C and pyarrow engines are faster, while the Python engine is currently more feature complete.
The pandas developers have taken it upon themselves to write a CSV parser that will read 99.99% of CSVs in existence.
The pyarrow parser is not quite as feature complete, but can run faster on certain data sets.
To summarize, we've learned that we can use pandas to read CSV files.
We also learned that pandas 2 has some optimizations to make it use less memory.
|
|
show
|
6:34 |
In this section, we're going to look at summary statistics for that student data that we just loaded.
Let's get going.
Here's the summary statistics.
This is taken from that University of California, Irvine website.
We've got multiple columns in here describing a student.
And at the bottom here, we've got grades.
This data set was used to look into what features impact how a student performs on their grades.
And we see that there's a G1, G2, and G3, which are the grades.
Now I'm not really going to get into modeling in this section here, but we will look at some of the summary statistics.
So the first thing I generally do when I've got a data set is I'm going to look at the types of the data.
And with Pandas, we can say .dtypes.
This is going to return what's called a Pandas series.
And in the index of this series, we see the columns, and on the right-hand side, we see the types.
In this case, you'll notice that in brackets, we have PyArrow indicating that we are using PyArrow as the back-end, and we have optimized storage there.
We also see that there's int64s.
So those are integer numbers that are backed by PyArrow.
They're using 8 bytes to represent the integer numbers.
And we're not seeing any other types other than strings and integers here.
Another thing I like to do with Pandas is do this describe method.
I was once teaching this describe method to some of my students when I was doing some corporate training, and when I did it, someone went like this and hit themselves in the head, and I asked them, what?
What happened?
Did I say something wrong?
And they said, no, but we just spent the last three weeks implementing this same describe functionality for our SQL database.
So this is one of the nice things about Pandas.
It has a bunch of built-in functionality that makes it really easy.
Describe is one line of code, and you get a lot of output from it.
So this is returning a Pandas data frame.
Pandas is going to reuse a data frame and a series all over the place.
In this case, the index is no longer numeric.
In the bold on the left-hand side, we can see count, mean, std, min.
That's the index.
You can think of those as row labels.
Along the top, we have the column names.
These correspond to the original column names, but these are the numeric columns.
So for each numeric column, we have summary statistics.
Count has a specific meaning in Pandas.
Generally, when you think of count, you think of this as how many rows we have.
In Pandas, count doesn't really mean that.
It means how many rows don't have missing values.
You just need to keep that in mind when you're looking at that count value.
Mean, that's your average.
Standard deviation is an indication of how much your data varies.
We have the minimum value.
At the bottom, we have the maximum value.
In between there, we have the quartiles.
I like to go through this data and look at the minimum values and the maximum values to make sure that those make sense.
Maybe look at the median value, which would be the 50th percentile.
Compare that to the mean to get a sense of how normal or how skewed our data is.
Also, look at those counts to see if we have missing values as well.
In this case, it looks like most of our data is 5 or below.
We do have some going up to 22 or 75, but most of it is not very high.
It doesn't look like we have any negative values.
Now, remember, we just looked at that Dtypes attribute, which said that we are using 8-byte integers to store this information.
Most of these values don't need 8 bytes to store them.
In fact, all of them could be represented with 8 bits of memory.
We could use pandas to convert these integer columns to use 8 bits instead of 8 bytes for each number.
That would use 1 8th the amount of memory.
We could shrink this data even further than we got by using PyArrow without any loss of fidelity in our data.
There are a bunch of other things that we can do.
One of the methods is the quantile method.
I'm going to run that.
This actually failed.
Let's scroll down and look at the error here.
It says, arrow not implemented.
It says, function quantile has no kernel matching input type strings.
The issue here is we have non-numeric columns.
To get around that, we can specify this parameter, numeric only is equal to true.
This is going to give us back a series.
Why did this give us back a series?
Because this is an aggregation method.
You can think of our original data as 2 dimensions.
We are taking the quantile, the 99th percent quantile.
That is taking each of those columns and telling us what's the 99th percentile of that.
It's collapsing it to a single value.
Because we have 2 dimensions, we're going to collapse each of those columns to a single row.
Pandas is going to flip that and represent that as a series where each column goes in the index and the 99th percentile goes into the value.
You'll see that Pandas uses data frames and series all over the place.
You need to get used to these data structures.
The quantile method has various parameters that you can pass into it.
In Jupyter, I can hold down shift and hit tab to pull up that documentation.
You can see that this Q parameter, the first parameter, accepts a float or an array-like or a sequence-like parameter.
In this case, instead of passing in 0.99, a scalar value like I did above, I'm going to pass in a list.
Let's say I want the first percentile, the 30th percentile, the 50th percentile, the 80th percentile, and the 99th.
When we do that, instead of getting back a series, we're now going to get back a Pandas data frame.
But if you look in the index here, the index is the quantiles that we asked for.
This illustrates that power of Pandas that you can do relatively complicated things with very little amount of code.
Also, you need to be aware that this is kind of confusing in that you can call the same method and it might return a one-dimensional object or it might return a two-dimensional object depending on what you're passing into it.
In this section, we looked at summary statistics of our data.
Once you've loaded your data into a data frame, you're going to want to summarize it to understand what's going on there.
That describe method is very useful.
Then there are various other aggregation summaries that we can do as well.
as well.
I showed one of those which is
|
|
show
|
5:36 |
I want to explore correlations.
Correlations are the relationships between two numeric columns.
And this is a good way to understand if one value is going up, does the other value go up or down or does it have no impact on it.
So let's see how we can do that with pandas.
I'm going to say df.core and I'm going to pass in that numeric only because otherwise it's going to complain about that.
And look at what this returns.
It's a data frame.
In the index we have all the numeric columns and in the columns we have all the numeric columns.
In the values here we have what's called the Pearson correlation coefficient.
This is a number between negative one and one.
A value of one means that as one value goes up the other value goes up in a linear fashion.
If you were to scatter plot that you would see a line going up and to the right.
A correlation of negative one means that if you scatter plotted it you'd see a line going down and to the right.
A correlation of zero means that as one value is going up the other value might go up or down.
You might see a flat line but you also might see alternating values.
As one value increases the other value may or may not increase.
They don't have a relationship to each other.
Now humans are optimized for looking at big tables of data like this.
Generally what I want to do when I have this correlation table is to look for the highest values and the lowest values.
But I might want to look for values around zero and it's kind of hard to pick those out.
If you look you might notice that along the diagonal we do see a bunch of ones and that's because the correlation of a column with itself is the column goes up the column goes up.
So you do see that value there but we're actually not interested in that value.
We want to look at the off diagonal values.
So let me give you some hints on how we can do this.
One of the things that pandas allows us to do is add a style.
So I'm going to use this style attribute and off of that I can say background gradient.
Let me note one more thing here.
This is showing how to use what's called chaining in pandas.
I'm actually doing multiple operations to the same data frame here and I put parentheses around it.
What that allows me to do is put each step on its own line and that makes it read like a recipe.
I'm first going to do this then I'm going to do this then I'm going to do this.
Do I need parentheses?
No I don't.
If I didn't use parentheses I would have to put all of that code on one line and it gets really hard to read.
So I recommend that when you write your change you put parentheses at the front and then parentheses at the end and then just space it each operation on its own line.
It's going to make your life a lot easier.
Okay so what we've done is we've added this background gradient.
The default gradient here is a blue gradient.
It goes from white to blue, dark blue.
Again along that diagonal you do see the dark blue but this is actually not a good gradient.
What we want to use when we're doing a heat map of a correlation is to use a color map that is diverging.
Meaning it goes from one color and then hopefully passes through like a light or white color and goes to another color.
That way we can look for one color for the negative values and the other color for the positive values.
So let's see if we can do that.
I'm going to specify a diverging color map.
That's the RDBU, the red blue color map.
And it looks like we are seeing those diverging values now.
Now there is one issue with this.
The issue is that if you look for the reddest values I'm seeing pretty red values for example around negative 0.23.
That's not negative one and I would like my red values to actually be at negative one because I also want my white values to be around zero.
If I look at my white values it looks like they're around 0.42 right now.
Note that the blue values are at one.
Again that's because that diagonal by definition is going to be one.
So pandas has an option for us to do that.
We can specify these Vmin and Vmax values to specify where those get pinned down.
And when we do that we actually get a proper coloring here.
Now this makes it really easy to find the reddest values and I can see that failures have a large negative correlation with the grade.
Again we do have that diagonal there but we want to look at the off diagonal values for correlations.
And over there at grades we can see that grades are pretty highly correlated with each other.
Probably makes sense that if you did good on the first test you probably did good on the second test etc.
Another thing that you can do with the correlation is you can change the method.
I can say instead of doing the Pearson correlation coefficient which is the default one I can do a Spearman correlation.
A Spearman correlation does not assume a linear relationship rather it's also called a rank correlation.
So you might see if a relationship if you did a scatterplot it curves like that.
That could have a correlation of one as the rank of one goes up the rank of the other one goes up but it's not a linear correlation.
So oftentimes I do like to do a Spearman correlation instead of the Pearson correlation which is the default value.
In this section I showed you how to look at correlations.
I showed you one of my pet peeves I often see in social media and other places people showing these correlation heatmaps and they'll throw a color on them but they don't pin those values.
So make sure you use a diverging color map when you're coloring this and make sure you pin those values so that the negative value is pinned at negative one and that light value goes at zero.
|
|
show
|
4:49 |
In this section, I'm going to take you through what I like to do with categorical columns.
So let's get going.
First of all, let's just select what our categorical columns are.
In Pandas 1, we would do it this way.
We would say, select D types object.
Again, that's because Pandas 1 didn't have a native way to represent strings, and so it used Python strings, which are objects in NumPy parlance.
In Pandas 2, we do have that ability.
So if we do say string here, we get back a data frame and all of the columns here are string columns.
Now, I want to summarize these.
I can't use those same summary statistics that I did use with describe up above, but I can do some other things and I'll show you those.
Alternatively, we could say select D type string and then square bracket, pie arrow, that gives us the same result in this case.
Is there any value to that?
Not necessarily.
It's a little bit more typing.
I want to show you my go-to method.
So when we're doing a lot of these operations, I like to think as Pandas as a tool belt.
It has 400 different attributes that you can do on a data frame and 400 different things that you can do to a series.
Do you have to memorize all of those?
No, you don't.
But I want to show you common ones and you can think of them as tools.
You put them in your tool belt and then we use these chains to build up these operations.
Your go-to when you're dealing with string or categorical data is going to be the value counts method.
Let's look at that.
Let's assume that I want to look at this fam size, which is the size of the family.
You can see the column over here, but let's explore that a little bit more.
So all I'm going to do is I'm going to say, let's take my data frame, pull off that fam size column, and then do a value counts on that.
What this returns is a Pandas series.
Now, let me just explain what's being output here because it might be a little bit confusing.
At the top, we see fam size, and that is the name of the column or the name of the series in this case.
Then on the left-hand side, we see GT3 and LE3.
Those are the values and they are in the index.
The actual values of the series 281 and 114 are on the right-hand side.
At the bottom, we see name.
Name is count.
So that is derived from doing value counts there.
We see D types.
It says this is an int64.
So the type of the series is a PyArrow int64.
Let's do the same thing for higher.
We'll do value counts, and you can see that we get back a series with those counts in that.
Now, if we want to compare two categorical or string columns with each other, Pandas has a built-in function to do that called cross tab or cross tabulation.
What that is going to give us is a data frame, and we'll see in this case we have sex in the index and higher in the columns, and then it gives us the count of each of those.
This has various options.
Again, we can put our cursor there, hold down shift and hit tab four times there to pull up the documentation.
So there's a lot of things we can do.
Turns out Pandas has pretty good documentation.
So check that out if you want to.
I'm not going to go over all that right now.
But an example is we can say normalize.
Now, instead of having the counts there, we have the percentages, and this is normalized over all of the values in there.
If I want to format that and convert that into a percent, we can say style.format, and now I'm getting percents there.
I can say I want to normalize this across the index.
So what does that do?
It says I want to take each row and normalize each row.
So we're going down the index and normalizing each row.
I think that normalizing across the index is a little bit weird.
To me, this seems backwards.
To me, it seems like we're normalizing across the columns instead of the index.
But if we want to normalize down a column, then we would say normalize columns there, and we're normalizing down the columns that way.
Pandas has some warts.
I'll be the first to admit it.
And oftentimes, when we are doing aggregation operations, if we want to sum across the columns, we would say axis is equal to columns, and we would sum across the columns.
In this case, this normalize here seems a little bit backwards, but we'll just deal with it.
It is what it is.
In this section, I showed you how I would look at string data.
Generally, I'm going to take that value counts and quantify what is in there.
Oftentimes, we can see whether we have low cardinality, if we have few unique values, or if we have all unique values, we can see that relatively quickly.
If I want to compare two categorical values, I'm going to use that cross tabulation to do that.
|
|
show
|
8:36 |
In this section I want to show you some visualizations that you can do really easily with pandas.
So if I've got a numeric column, I like to do a histogram on it.
So I'm going to say, let's take the health column, which is this numeric value from 1 to 5.
And this is the health of the student.
And all I do is say pull off the column and then say .his.
Now I am saying fig size is equal to 8,3.
Fig size is a matplotlib-ism.
This is leveraging matplotlib.
Now you do see a space in here around 2.5.
The issue here is that by default we are using 10 bins here and these values only go up to 5.
So I might want to come in here and say bins is equal to 5 and change that.
Oftentimes people say they want to look at a table of data.
And again, humans aren't really optimized for that.
If I gave you a table of the health column and said like, what does this have in it?
It's hard for you to really understand that too much.
But if you plot it, if you visualize it using a histogram, it makes sense.
And that's a great way to understand what's going on with your data.
Let's just take another numeric column.
We'll take the final grade and do a histogram of that.
In this case, I'm going to say bins is equal to 20 because this value goes up to 20.
This is really interesting to me.
You can see that there's a peak there at 0, indicating that you do have a large percent of people who fail.
And then it looks like around 10, you have another peak.
That's probably your average student.
So this is illustrating not a bell curve, so to speak, but the distribution of grades, which I think is interesting.
And it tells a story just by looking at this.
Again, could we tell this by looking at the column of data?
It would be really hard to do.
But giving that plot there makes it relatively easy.
If I have two numeric columns and I want to compare them, I like to use a scatter plot.
We're going to plot one value in the x-axis and another value in the y-axis.
Pandas makes it really easy to do this as well.
What we're going to do is we're going to say df and then an attribute on the data frame is plot.
And from that plot, we can do various plots here.
So one of those is scatter.
In fact, there's also a hist there as well.
So hist is on data frame and it's on a series directly.
But also those are both available from the plot accessor.
In order to use the scatter plot, we need to say what column we want to plot in the x-direction and what column we want to plot in the y-direction.
So we're going to plot the mother's education in the x-direction and their final grade in the y-direction.
And I'm just going to change the size of that so it's 8 by 3.
Here's our plot.
When I look at this plot, a couple of things stand out to me immediately.
One is we see these columns.
One is that we see values at regular intervals here.
So this tells me that we have gradations that are at some level, which kind of makes sense.
Our grade is at the whole number level.
You don't have like a 15.2 or a 15.1.
You just have 15, 16, 17, et cetera.
Makes it very clear when you see the scatter plot.
The other one is that we're seeing columns there.
And so you can think of the mother's education, it is a numeric value, but it's also somewhat categorical in that it's lined up in columns.
So I'm going to show you some tricks to tease that apart and understand what's going on here.
If you just look at this plot on its own, it's hard to tell where the majority of the data is.
So I'm going to show you how we can find out what's going on behind this plot.
One of my favorite tricks with a scatter plot is to adjust the alpha.
Now, if I just see a bunch of dark values there, what I want to do is I want to lower that alpha, which is the transparency, until I start to see some separation there.
I think that looks pretty good.
I might even go a little bit lower.
You can see that I'm now starting to see some faded values here.
So by looking at this, this tells a different story to me than this value up here.
This is telling me that we have more values at 4.
How do I know that we have more values at 4?
Because it's darker there when we lowered the alpha.
We're not really seeing that so much on this plot.
What's another thing we can do?
Another thing that we can do is add jitter.
Basically, we'll add a random amount to the data to spread it apart and let us see what's going on inside of that.
So I'm going to add jitter in the x direction to spread apart that mother's education value.
I'm going to use NumPy to do that, and this is going to use the assign method.
The assign method lets us create or update columns on our data frame.
I'm going to say let's make a new column called EduJit, and it's going to consist of the mother's education plus some random amount.
I'm using NumPy random to generate some random values there.
In this case, the amount is 0.5.
I don't want my random values to overlap values from another value, so I'm keeping them within a certain width.
Then I'm going to say on that new data frame, let's plot that.
Let me just show you that this is pretty easy to debug once you have these chains here.
You can actually say here's my data frame, and then I want to make a new column.
There is my new column.
It popped over there on the end.
Now once I have that, I'm going to plot the new column in the X direction and plot the grade in the Y direction.
We get something that looks like this.
This also tells us a different story than this one up here.
I think this is a much better plot, letting us see where the majority of the data is.
Now I have inlined that Jitter functionality right here, but it's pretty easy to make a function to do that.
I'm going to write a function down here in this next one called Jitter.
Then to leverage that, I'm going to say, okay, EduJit is now this result over here.
Now let's explain what's going on here.
On the right-hand side of a parameter in a sine, up above here you can see that we passed in this is a series, and we're adding some amount to it.
This is a Pandas series up here.
Down here, this is a lambda function.
We can pass in a lambda function on the right-hand side.
What happens when we pass in a lambda function?
When you have a lambda function inside of a sine, Pandas is going to pass in the current state of the data frame as the first parameter to that lambda function.
Generally, you will want that lambda function to return a series because you want that to be what the column is.
Now do you have to use lambdas?
No, you don't have to use lambdas.
You can use normal functions as well.
Oftentimes, it is nice to use lambdas because you want that logic directly there inside.
When you're looking at your code, the logic's right there.
If you were to repeatedly use the same lambda all over the place, then I might recommend moving that out to a function so you only have to write it one place.
Let's run that and make sure that that works.
That looks like that works as well.
If this jitter was useful, what I would do is make a helpers file, and I would stick that jitter into the helpers file so I can leverage that.
I also want to look at how to visualize string data.
What I'm going to do is I'm just going to tack on a plot.bar into my values count.
When we do a bar plot in Pandas, what it does is it takes the index and it puts it in the x-axis.
Then each of those values for those index values, it plots those as bar plots.
Once you understand that, it makes it really easy to do bar plots.
Let's see what happens when we run .plot.bar.
We should see mother and father and other go into the x-axis.
We do see that.
This is a little bit hard to read because I have to tweak my head to the side.
Generally, when I'm making these bar plots, I prefer them to be horizontal.
To make a horizontal bar plot, I just say bar h.
There we go.
There's our visualization of that.
We can see that most of the guardians are actually the mother in this case.
In this section, we looked at how to visualize your data.
I'm a huge fan of visualization because I think it tells stories that you wouldn't get otherwise.
Once you understand how to make these visualizations in Pandas, it's going to make your life really easy.
|
|
show
|
0:24 |
Okay, I hope you enjoyed this module.
We looked at loading some data and doing some basic exploratory data analysis, trying to understand what's going on with our data.
These are steps, tools that I will use every time that I load data.
So I want you to make sure that you understand these, but you start practicing them as well because they'll apply to most data sets that are in tabular form.
|
|
|
14:38 |
|
show
|
0:42 |
In this section, we're going to look at working with Excel data.
I'm going to show you how to load data from Excel, and then mess around with it in Pandas, and then export it back to Excel as well.
Excel tends to be one of those things that rules the world.
A lot of data is stored in Excel, and oftentimes bosses want reports back in Excel as well.
So Pandas makes it relatively painless to ingest data from Excel and export it to Excel as well.
More specifically, what we'll be doing is we'll load Excel files and then we'll find missing values, we'll summarize some of the data, we'll look at a correlation matrix, and we'll create some scatter plots.
And then we'll export the data back to Excel.
|
|
show
|
1:45 |
I'm going to cheat a little bit.
I'm going to load in a CSV file and then create an Excel file from that to simulate the Excel file that we're going to read from.
And then we'll show how to do some fancy exporting later on as well.
I'm importing my pandas library here and again I'm using data from University of California, Irvine.
This is a dataset about how much money people make.
Let's read that in.
In this case, if you look at what I just read in here, you can see that this is actually not the data.
This is information about the data.
This is just a file with information about the data, about what's going on there.
At the bottom of this you can see that there are the columns and some descriptions about the columns.
I'm going to take the column descriptions there and I'm going to use some Python code to split that out and make this names variable.
That's what it looks like.
It turns out that the data inside of this doesn't have column names, so I'm going to want to throw those column names on there.
Let's use read CSV to get that data.
It turns out pandas can read from a URL.
This is a single file, but I need to pass in the names there.
Let's look at our data frame after we do that.
Here is our data frame.
That looks pretty good.
I'm just going to make a directory here.
I've already made it.
I'm going to export this data as an Excel file.
I just showed you how to write an Excel file.
You just use toExcel.
That's a very basic way of doing it.
Again, later on I'll show you how to do a little bit more complicated Excel exporting.
|
|
show
|
1:31 |
Let's read our Excel data now.
So one of the things you'll want to make sure you have installed is this OpenPyXL.
Pandas is going to leverage that, but if you don't have that, your imports might fail.
So I'm going to import my Pandas data.
Let's look at what this data looks like.
So this is our export.
This is a relatively straightforward Excel file.
Let's load this in in Pandas.
Instead of using read CSV, I'm going to use read Excel.
I'm going to say detype back in as pyarrow because I want to have pyarrow representing my data.
Here is my data frame here.
Now one of the things that might stick out to you is that first column that says unnamed 0.
What's going on there?
Well, if we look at our data, you can actually see we have that A column which is the index.
So I already have an index in there.
When I read my Excel file, Pandas didn't see that as an index.
It just saw it as a column.
So I'm going to tell it that the column 0 is the index column.
Let's run that and see what happens when we do that.
Okay, that's looking a little bit better.
Let's also just check our detypes there.
It looks like we do have pyarrow types.
Very basic to read an Excel file.
You can use read Excel.
Again, there are various options for this.
You can pull those up in Jupyter and check out various options to tweak how you import Excel data.
|
|
show
|
3:02 |
Another thing that's very common to do with data is to look for missing values.
So let's see how we can do that.
This is another tool to stick in your tool belt.
One of the 400 things that you can do with Pandas DataFrame is this isNA.
And when we do that, we get back a DataFrame, but if you look at this DataFrame, in the values of it, we have true-false values.
So there's going to be a true any place where the value is missing.
And if I look at this, it looks like I'm not seeing any trues just from spot-checking it, but really, as I've said multiple times, humans are not optimized for looking at tables of data.
They're optimized for finding patterns and visualizations or seeing things that pop out.
So if you feel this urge to look through a big table of data, your spy descent should go off, telling you instead of doing that, you should use a computer to either visualize that or use a computer to filter the data that you want.
In this case, we might want to filter to see if those values actually are missing or quantify those.
One of the many things that we can do with a DataFrame is we can do this any.
Any is an aggregation, and what it's going to do is it's going to aggregate the DataFrame above.
In this case, are any of the values truthy in each of the columns?
So this is going to collapse that, give us back a series, and it looks like there aren't any missing values.
Alternatively, this is a cool thing that you can do.
You can say SUM here, and in this case, this counts how many missing values there are.
You can also do MEAN and multiply that by 100.
That gives you the percent of missing values.
In this case, this is not very interesting, but let me just show you another example here.
I'm going to say df, and let's say I want to know what is the count of folks whose age is greater than 50.
So I'm going to come down here and say age, and then say .gt 50, and this is a Boolean array.
It has true-false values in it, so if I wanted to find the count of those, I can just say let's sum that.
So there are 6,460 people whose age is greater than 50.
If I want to know what percent that is, I can say MEAN and then multiply that by 100, and this didn't work.
It didn't have a mole here, because if you look at MEAN, MEAN is an aggregation in this case because I am working on one dimension and I aggregate it to a scalar value, I'm now entering into the Python realm.
This is a 19, so in this case, I would come in here and say times 100.
So 20% of our people are greater than 50.
In this section, I showed you how to quantify how many values are missing.
That can be really important, especially if you're doing machine learning, because a lot of machine learning algorithms don't like to have missing values.
Also, if you have survey data, you might want to understand what percent is missing.
If you're reporting on that, and 99% of it is missing, it might not be really interesting to report on.
|
|
show
|
2:07 |
I'm going to show how to explore some of the object columns again.
Let's jump into that code.
So again, what I can do is I can say select D types and then say string.
This will give me all of the string columns.
In Pandas 1, you would say object there, but because we have those PyArrow types, we can say string here.
Remember what I said previously, value counts is your friend here.
So let's explore education.
I'm going to say education.valueCounts.
And here's a summary of that.
If I wanted to visualize that, look how easy this is.
I'm going to say .plot.barH to do a horizontal bar plot, and we can visualize that really easily there.
So we see that most of these are high school graduates.
We have some college graduates, some masters, etc.
If I want to filter columns, I want to get the columns that have education in them.
One of the things I can do is use this filter operation, and here are the columns that have education in them.
Note that this valueCounts also works with numbers as well.
So if we take the age column, we might want to summarize the age.
Again, I'd probably do histogram here, but we can do a valueCounts on that.
We can sort the index there.
In fact, if we do a plot and we do a bar on that, we're kind of getting the histogram by doing that.
So this is a very manual way of doing a histogram here.
Again, I would probably just do age.hist to get a similar thing here.
If we want to bump up the bins, we'd say bins is equal to 20, and maybe we say figSize, so it doesn't come off the screen, 8 by 3.
In this section, we looked at pulling out those object columns.
Again, valueCounts is your friend to summarize those.
We also saw that one of the things that you can do is you can use filter to limit what columns you're pulling out as well with a regular expression.
Filter has a bunch of other options as well.
Again, I recommend that you pull that documentation up in Jupyter and see how to use it in other contexts.
|
|
show
|
3:33 |
Okay, so I want to look at the numeric columns as well, understand those.
I'm doing this, it might seem a little repetitive, but the point here is that these are standard procedures that I'm going to do to a lot of data sets.
So let's do a correlation here.
We can see our code that we had from before.
We're going to say I want to do correlation and use a diverging color map from red to blue, and we're going to pin those values from negative 1 to 1.
We can see the most bluest values and the most reddest values.
In this case there doesn't appear to be very many strong correlations other than the correlation of a column with itself.
Let's do a scatter plot to see if we can get some insight into what's going on here.
So to do a scatter plot you need to say .plot.scatter on a data frame, and then we specify the column that we want in the x and y direction here.
So I'm going to say in the x direction, education number, in the y the capital gain, and again we're seeing something similar that we saw previously in our other scatter plots.
We're seeing these things line up in columns.
So one of the first things I'm going to do is I'm going to adjust my alpha.
My general standard practice is just to lower that until we start seeing some of these values fade away.
I might even go even lower than that, but I think this tells a slightly different story than what we're seeing above here.
We can also use our jitter code.
If I put it in the helpers file, now I can say import helpers, and I can say okay let's make education be a column that is jittered from the education num column, and then let's plot the education column there.
Okay and so I think this tells a different story.
I probably would lower the alpha even more, so maybe let's bring it down to 0.2 or 0.1.
I might even go lower than this.
How do I tell how much alpha is enough?
There's no hard rule of thumb.
My take on this is if I'm seeing really dark concentrated values, I kind of want to lower that until I start seeing some transition in that.
So maybe let's go 0.05 here, and what I'm seeing here is around 9 and 10 and 13.
It's still really dark, and so maybe I'll even go to 0.01.
So I think this is telling a different story than what we saw above.
We're seeing that education around 9.
Again, if we want to evaluate this, we can say df education number, and if I try and do education number like this, I'm going to get an issue here.
It's actually trying to do education minus number, so to pull that off I need to use this index operation syntax here.
And then with that I can do a histogram here, and that validates what I said before that around 9 we're seeing the majority of our data or a concentration, and we are seeing that.
We're also seeing 13 here, which is probably high school graduation and probably a lot of like junior high growth, so that is a way to understand just the single value there, but not the relationship between that.
We just demonstrated how to visualize and understand that relationship between numeric columns.
Again, correlation is your friend.
Oftentimes you might not see a strong correlation.
Visualization using scatter plots can also show you relationships between those columns as well, or at least the distributions.
|
|
show
|
1:48 |
Okay, let's explore how to write Excel.
First of all, you're going to want to make sure you have this xlsx-writer library installed.
Again, Pandas is going to leverage that under the covers, but it won't install it by default.
And then I've just got some code here to show various features of writing this.
So I'm going to say, let's make an Excel writer.
And then I'm going to call toExcel.
I'm actually calling toExcel twice.
The first one I'm saying onSheet1, just put all the data.
Next one's saying onSheetLT30.
I'm going to query everything that has an age of less than 30.
And then after that, I'm going to say, okay, let's get the workbook from this, which represents basically the Excel file.
I'm going to change the format to the workbook down below.
I also want Sheet1, and I'm going to change Sheet1 down below.
You can see that I'm looping over my columns, and on my worksheet, Sheet1, I'm changing the format from those column headers.
And then also for my workbook, I'm saying on all of the cells, set TextWrap to true.
And I'm also saying for worksheet1, set AutoFit.
So make those columns based on the length of that.
And then we're going to close our file there.
So a little bit more code to do that.
Let's open this up and see what it looks like after we do that.
Okay, here's our file.
You can see in Sheet1, we have highlighted those column headers with our code to loop over those.
And you can see we also made this LT30.
All these ages in here should be less than 30.
And if we look on the left-hand side, that age column looks like that is the case.
|
|
show
|
0:10 |
In this section we looked at loading and exporting Excel.
We also again did some of our common operations to the columns just to understand the data and see what's going on inside of that.
|
|
|
11:56 |
|
show
|
0:14 |
Okay, now we want to look at how to merge data.
Oftentimes we have data sitting over here, we have data sitting over here, and we need to combine those.
It turns out that much like you can do in a database, you can also do this in Pandas.
You can combine that data.
|
|
show
|
0:57 |
Okay, so I'm going to show us how to combine data using pandas.
I'm also going to show you some features of pandas that we can use to validate merge and debug what's going on here, and then at the end we'll export this to Excel.
So let's run our imports here.
And let's load our data.
This is coming from Kaggle.
This is data about New York City.
So let's read that into this AB data frame here.
And this is Airbnb data about locations in New York City.
Now I also have temperature about New York City, so let's load the temperature data as well, and it's a data frame that looks like this.
So it has latitude and longitude and temperature there, and you can see our data up here has latitude and longitude as well.
Our goal is to merge this and have temperature data with our apartment data.
|
|
show
|
1:33 |
Okay, let's look at merging.
It turns out that one of the 400 different things you can do is merge.
And again, you can come in here and you can pull up the documentation for merging and see that there are a bunch of parameters here.
So I'm just going to try and do this naively here.
We'll say a b merge with temps.
And when I try and do that, I get a failure.
And the message is actually useful.
It says there's no common columns to merge on.
So by default, pandas is going to look for common column names and it's going to merge on those.
Let's use some pandas to inspect what are the common column names.
And it looks like none of them are common.
So here's the column names from our Airbnb data.
And here's the column names from our temperature data.
So one of them has latitude and longitude spelled out.
The other one has lat and lon.
So in order to merge those, what we're going to say is I want a b dot merge temps.
And then we say left on.
In this case, the left is referring to the a b, what we called merge on, and the right is referring to temps.
So the left on, those are columns from a b.
a b has latitude and longitude spelled out.
Temps has lat and lon in there.
And when we do that, we get something that looks like this.
If we scroll to the right, we can see that we now have lat, lon, and temperature in addition to our Airbnb data.
When you want to merge something in pandas, you can use the merge method.
Again, by default, it's going to look for common column names.
But if you don't have common column names, you can specify the column names.
Also, if you want to merge by the index, you can tell it to do that as well.
|
|
show
|
2:51 |
One of the things you might want to do with merged data is to validate it and see if it was successful.
This is a challenge with SQL when you try and merge something.
Oftentimes it's unclear, did the join work?
What's going on there?
Pandas has some functionality to make this a little easier.
Okay, so I'm going to demonstrate this with just a different dataset here.
I've got the left and the right dataset.
In this cell, I'm showing the display function which Jupyter provides you.
This allows us just to output multiple things in a single cell.
So we've got this data frame left that has names and pets, and we have a data frame right that has names and ages.
So let's look at the different merges that we can do.
We can do an inner merge.
And so here is an inner merge.
You can see that Ravi and Jose are in both of my datasets, but Sally is only in one.
So the inner merge doesn't include Sally.
I can do a left merge, which is going to say, take everyone from the left data frame.
In this case, it should have Ravi and Jose in it.
I can do a right merge, which should include Sally.
And we can do an outer, which is going to include everyone from both.
And we can also do a cross merge.
And a cross merge is something you need to be careful with.
Basically, it's going to say for every row in one data frame, merge it with every row in the other data frame.
So you can see we have for the first Ravi from left, we have Ravi, Jose, and Sally.
From the right-hand side, for every Jose in the left, we have Ravi, Jose, and Sally from the other side.
One of the super convenient parameters that comes with this is the validate parameter.
And here I'm going to say I want to validate that this is a one-to-one merge.
And when I run this, I get an error.
And it says that the merge keys are not unique in the left data set.
And if we look at that left data set, again, it looks like this.
And you can see that Jose is repeated there.
So it is impossible to do a one-to-one merge here because Jose is repeated.
So if we want to do this, what it's really going to be is a many-to-one merge, and we can specify that with an m-to-one.
And you can see that that comes out without a complaint.
Pandas provides a nice functionality in this validate parameter.
And you can specify 1,1 for a one-to-one merge, m,1 for many-to-one, or 1,m, or an m,m.
Note that validation on m,m really doesn't exist, but you can put it in there as a hint to your reader about what kind of merge you're expecting to perform.
|
|
show
|
2:36 |
In this part, I want to show you a really cool debugging trick that I think is super useful in Pandas, especially when you start building up these chains.
One of the methods in Pandas is the pipe method.
And what the pipe method allows you to do is pass in an arbitrary Python function, and Pandas is going to call that Python function passing in the current state of the data frame as the first parameter.
In this case, I've defined a function called limit, and it has two other parameters, n rows and n columns, and it's just going to do some slicing using ilope to do that.
So not particularly interesting, other than we are leveraging that with pipe instead of calling ilope directly there.
Now, let me show another way of using pipe.
I like to make these little debug helpers here.
So I've got a function up here called debug.
Note that the first parameter is df, and then I have an optional extra there.
And here's my chain down here.
Note that I've got merge in there, but before I do the merge, I'm going to do this debug call, and I'm going to say extra is before, and then after I do the merge, I'm going to say extra is after.
Let's run that, and you can see that it outputs the before shape and the after shape.
So before we had 16 columns, after we have 19 columns.
You can see that the number of rows does not change.
This is super useful if you're doing complicated merges, and you can see what happens to the number of rows and columns when you're doing that.
So let's do a little bit more here.
I'm going to do a merge, and then after that, I'm going to group by the neighborhood and get the mean values of all the neighborhoods.
Okay, and it looks like that failed.
It did not work.
Let's scroll up and look at the output here.
We do see the before and after, but we see a complaint here, and it's complaining about this mean here.
It says that you can't aggregate on a mean on non-numeric columns or string columns there.
So in Pandas 2, if we want to do this mean, we need to tack in that numeric only as true.
So let's run that again here, and now we can see that we have the before, after, and then we have the summary there, and you can see that the shape does change.
The shape had 48,000 rows before, and after we do the group by, there's only five neighborhood groups, so we have five rows after doing that, and we only have 13 numeric columns.
This pipe trick is super useful.
I use it all the time.
So in this case, I'm using it to debug the shape, but you can use it for other things as well.
It's limited by what you can stick in a function.
|
|
show
|
2:19 |
I want to show you how to clean up the columns a little bit, so let's look at that.
Okay, here's my chain up here.
Let me just comment this out and we'll walk through it.
So one of the things that people say is it's hard to understand what's going on with a chain, and generally when I'm making these chains, I'm building them up from scratch.
So I might say, here's my data frame here, and then maybe I'll just stick in this debugging info before, and note that because this debug returns the ddf, if I scroll up here and look at the debug function that I defined up here, you can see that it returns df.
When I'm using it with pipe, I can continue operating on that data frame.
So this is just having a side effect of printing out that output.
Now let's do our merge.
Okay, it looks like our merge returned this output over here.
I can pipe in how big that is, and now let's look at the columns of that.
Here are the columns after I've done the merge.
One of the things that I like to do is I like to explicitly list out the columns.
So I'm going to use .loc to do that, and loc if you're not familiar with it, is a little bit weird.
It's not a method to be specific, it's a property, and we index off of the property, so we use the square bracket syntax here.
Also, this is a little bit interesting as well.
If you look at what we're indexing with, in this case, we're actually indexing with a tuple.
We've got a colon here.
This is the row selector, so this is saying take all of the rows, and then we've got a comma, and this is the column selector.
So these are the columns that we're taking.
So let's run that, and you can see that we've limited it to 17 columns.
So this might seem like a small thing, but this is one of those practices that I've found is actually very useful.
You want to be flexible in what you receive, but you want to be strict in what you output.
So we can use that loc just to add a hint to the end user.
These are the columns that are coming out of this.
This makes it really easy when you come to the code.
You want to know what's coming out of it.
You can see at that last step of the chain.
So I recommend, especially if you're doing long pipelines or you're doing things for machine learning, as a last step, just put in that loc and make sure that you are explicit about what columns come out of that.
Especially if your data is changing in the future and new columns come in, you want to be explicit about what columns come out.
|
|
show
|
0:56 |
Okay, let's export this to Excel, and we're just going to use our friend Excel Writer here.
I've got my chain up here.
I'm going to say, take a B, and I'm piping debug.
Then we're going to merge it, and then I'm going to pipe again, some more debug, and then I'm going to specify the columns.
That is my Excel data frame.
And then I'm going to say, to Excel, and I'm also going to make another sheet called Brooklyn, and I'm going to filter just the Brooklyn values there.
So let's run that.
It's taking a while.
Oftentimes, exporting to Excel is a little bit slower.
Let's open up our data, and this is what it looks like.
Here's all of the data.
You can see that we've got those 40,000 rows here, and then here's the Brooklyn data.
So the Brooklyn data only has about 20,000 rows or so.
|
|
show
|
0:30 |
We just learned about how to merge data in Pandas.
Again, we want to specify those columns that we want to merge on, and we'll probably want to validate that.
So we can use Pandas code to validate that, but we can also use additional debugging code to look at the shape and size and make sure that what's coming out of that makes sense.
Then I gave that hint about being explicit about those columns.
Again, this is not something that you have to do, it's actually more typing, but it's going to make your code more robust and easier to use.
|
|
|
24:06 |
|
show
|
0:38 |
Now I want to look at summarizing data.
So we're going to take a retail dataset and summarize it.
This is something that is super useful for reporting, and these are going to be, again, tools in your tool belt that will make it really easy for you to make little reports for your boss or to answer questions that they might ask about your data.
More specifically, we'll talk about using Feather to speed up data loading, and then we'll show how to do grouping.
We'll plot those aggregations, we'll limit that, and we'll show how to use built-in functions for aggregation, and we'll also manipulate visualizations.
|
|
show
|
0:33 |
Okay, let's load our data.
This is coming from UCI, University of California, Irvine Machine Learning Repository.
There is a spreadsheet there that has cells data.
Let's load that.
Seems like it's taking a long time to load.
Again, note that we are reading Excel data.
Oftentimes when we're loading and writing Excel data, that tends to be slow operations because that code is often written in pure Python.
Okay, that looks like it was successful and we've got 500,000 rows after doing that.
|
|
show
|
0:49 |
In this section I want to show a way to speed up loading your data.
So I'm going to write this data to a format called feather and I just want to time how long it takes to read the feather data compared to how long it takes to read the Excel data.
I'll sit here and let that Excel data load again.
So this time it took 26.7 seconds here, so I'm going to change this to 26.7 and this is going to be 20 milliseconds for reading the feather data set.
So quite a huge improvement there.
Again, if you want to have your data stored in other formats such as feather or parquet, you're going to save a large amount of time over reading it from Excel.
|
|
show
|
3:48 |
I want to show how to do some exploratory data analysis on this sales data, so let's jump into that.
Again, first what I like to do is do a describe here that gives me summary statistics.
If we look at this, we can see that we've got quantity of things that we sold.
We have a date, we have a unit price, and we have a customer ID.
Again, what I like to look at generally is counts.
Our counts look like they're pretty similar except for customer ID.
It looks like some of our customer IDs are missing.
Note that unit price goes negative as well.
This might be a little weird, but apparently these are refunds.
Also, it looks like quantity is going negative as well.
We've got negative 80,995 and we have positive 80,995.
That's a little weird just looking at that, that we have this weird large number and weird small number and they're the same, they're just opposite signs.
It's probably the case where someone bought a bunch of things and then return them.
That might also indicate that there's maybe test data in this dataset that they tried something out and then undid that.
That might be something that I might want to check out further.
I'm not going to do that here, but just by looking at that, that is an example of ideas that I get by looking at these summaries and things to explore.
Let's visualize the unit price.
Again, I'm just going to throw a histogram on that.
If you look at that, this is not particularly interesting per se.
It looks like our unit price is around zero, but it goes out to 40,000 here and negative 10,000.
What's going on there?
We do know that we have values going from those ranges, because if we didn't, the histogram wouldn't show that.
Maybe I'll bump up the bins a little bit and say bins is equal to 30.
It looks like the vast majority of our data is around that small value there.
Let's see if we can dive into this a little bit more.
I'm going to say, let's look at cells where unit price is less than zero.
We see that we have two cells.
The description is a just bad debt.
Let's look at cells where quantity is less than zero.
There's a lot of entries there.
There's actually 10,000 rows where quantity is less than zero.
Let's look at a customer.
Maybe we can say, I want to look at customer 17548.
You can see 17548 is up here.
They bought 12 pink paisley tissues.
Here is the purchases for that customer.
There's a lot of negative quantities there, but there are also some positive quantities there.
We've summarized some of our numeric columns.
Let's look at our string columns here.
It looks like invoice number is a string, stock code is a string, description, and country are strings.
Country makes sense that that's a string.
Again, our friend value counts, we can come in here and quantify that value counts.
Stock code, maybe not quite so clear that it's a string.
If you look at that, it looks numeric there.
Let's try and understand what's going on there.
Here is stock code.
Now, by doing this, again, value counts is our friend.
We can see that we have things like letters tacked onto the end of that.
That makes sense that that is a string.
Just showed you how to do some basic exploratory data analysis.
Again, start with those summaries.
Then sometimes we get interesting insights and we start digging into those from that.
Showed a little bit of examples of that by looking at those negative values and digging into customer IDs, that sort of thing.
|
|
show
|
2:43 |
Okay, so let's start looking at aggregation.
Let's assume that your boss says I want to look at cells by year.
What does that even mean?
Well, it means that we want to group things by year and then summarize the cells for each of those years.
Turns out that this is relatively easy to do in Pandas.
Once you understand the syntax, Pandas will allow you to do these group by operations with relative ease.
The first thing I'm going to want to do here is make a total value here.
So I'm going to take my quantity and multiply it by the unit price and then I have a total value over there.
Next thing I want to do is get a year entry.
Now I don't have a year entry right now, so I'm going to want to make a new column that has a year entry.
How can I do that?
Well, I have this invoice date that is a date object.
If you look at the D type of it, so because it is a date object, it has this DT accessor which allows us to access various attributes of that.
This is really cool.
Once you've converted something to a date, you can do various operations with that.
So in this case, we want to pull off the year and you can see that I can make a year column over there.
Now at that point, once I have the year column, I can say I want to group by year.
Now Pandas is lazy when you do a group by.
It doesn't do anything until you summarize that.
So what do we want to do?
Let's just say let's take the sum of every numeric column for each year.
When we do that, we get our friend not implemented error.
This is an error that we see often in Pandas 2 because we're using the arrow back end, but it's complaining about strings not supporting sum.
So how do we get around that?
We're going to say numeric only is true.
Here is our summarization by year.
You can see that for the year 2010, we have a total of 748,000 pounds in this is Europe or in Great Britain.
For 2011, it looks like we have almost 9 million.
Now one of the nice things I can do here once I have this little chain here, so this is what I had before, I can say let's pull off that total column.
There's a series and now let's do a bar plot of that.
We can visualize that relatively easily.
So there is our visualization of cells from 2010 to 2011.
In this example, I gave a walkthrough of how we would summarize by year.
First of all, we want to figure out what we want to summarize by and we might need to make a column to summarize by that column.
Once you've made a column with what you want to summarize by, you can say group by with that column in it and then you need to provide the aggregation.
In our case, the aggregation was the sum and we summed up everything for that year column.
|
|
show
|
6:05 |
Let's assume that your boss comes in and says, thanks, I also want to see summaries by country.
How would we do that?
I've got a chain here that does it.
Again, I'm going to show you how I'd build this up.
First of all, we would want to get our totals and our years.
Now I'm going to say, let's group by country.
We already have that column in there.
When we do that, it is lazy.
It's not going to do anything until we tell it to aggregate.
In this case, we'll say, sum the numeric values.
And now we can see for each country, those are the sums.
I'm going to pull off the total here.
And we have that.
Now I'm going to visualize that.
We can do a bar plot.
And this lets us clearly see what that looks like.
Now let's assume that we want to look at everything but the UK.
The UK is blowing everything else out.
So how would we do that?
Well, the first thing that we could do is we could put a query in here and say, I want to have all the rows without UK.
And again, I can come in here and comment this out and sort of walk through this chain here, what's going on.
So this is our data without the UK.
And you can see that instead of 500,000 rows, we only have 46,000 rows now.
And then we'll add our columns.
And then we'll do our group by.
And that's lazy.
Let's take the sum.
There we go.
Let's pull off the total.
OK, and let's plot that.
Remember, a bar plot is going to take the index and put in the x-axis and then plot a bar for each of those values.
OK, let's sort this and change it to horizontal.
So I'm not going to walk through all of this.
But here's what we had before.
If we want to sort that, we say sort values.
If we want to do a horizontal bar plot, we say bar h instead.
We might want to change the figure size to make it so these aren't overlapping.
We could make this a little bit taller if we really cared about that.
Let me show you another technique that we do.
And instead of doing that, we might want to say, OK, let's look at the first n values and everything else let's put into an other bucket.
So let's say here's countries.
And here's the counts for those countries.
And let's say we want to look at the top 10 countries.
So how could we do that?
I'm going to say, OK, let's take the cells and pull off countries and then look at the top counts for each of those countries.
And then that's going to, well, let me run that here.
And if you look at top here, this is the value counts.
And it is in order.
If you pull off the index, those countries are in order.
So we could pull off the first 10 of those.
And that would be the top 10 countries.
So that's what's going on down here.
I'm saying top n, those are the top 10 countries.
Now down here, I'm updating the country column.
And I'm saying, OK, I want the country column to be the country column.
And then we're using this where command.
This where method is a little bit weird.
But basically what you do is you pass in a Boolean operation here where this Boolean operation is true.
You keep the value of this.
Where it's not true, you put in this value.
So it's a little bit weird.
It's like an else statement.
If it's true, keep the value else put in this other value.
Let's just look at that.
So you can see that we have United Kingdom and France.
We're not seeing other here because there's a lot of United Kingdom in there.
If we pull off that, we're still not seeing that.
But let's do a value counts there.
And now we see that other is right there.
OK, this is actually super useful.
I actually make a little helper here, put this in a function so we can do something like this where we say limit n.
So you pass in a data frame into that as the first parameter.
Because you have that, you can use this for pipe.
But you can also use this for a sign.
And then you pass in the column that you want to limit.
And I could also pass in how many values I want to limit and other as well.
But here we're going to say let's just limit country.
And here's the top 20 countries with everything else in other.
So here's how I'd use this in a chain here.
Let's just walk through this here.
I want you to get used to understanding how these things work.
Here's our original data frame here.
You can see that we have United Kingdom in the first couple rows there.
I'm going to filter out United Kingdom.
So we have 541,000 rows.
Now we have 46,000 rows here.
The next thing I'm going to do is I'm going to make some new columns or override columns.
I'm going to update the country column.
So instead of being country, I'm going to limit it to the top 20 countries.
So I'm going to use a lambda here.
And inside of the lambda, I'm going to say let's call limit with dfn country.
Why can't I use limit directly here?
Because if you look at my function up here, I need to specify a column.
And I can't do that just by passing in the column here.
I need to basically wrap that with a lambda that dispatches to limit n with the column that I want.
Let's just run that.
We should also get a total in a year column.
There is a total in a year column.
We're not really seeing any update here because France was in the top 20.
But we should see that down below here.
Let's group by country.
That's lazy.
It doesn't do anything until we do an aggregation.
We'll do a sum aggregation.
And if you look at this, we should see an other in here.
There it is.
There is an other.
Again, if I comment this out, this will be a lot longer here.
So by changing that, we are limiting that to just the top 20.
And then we'll pull off the total column.
Let's sort those values.
And then let's plot that.
In this example, we built up operations to summarize by country.
And we also showed you a nice technique to limit the number of countries, but include the extra companies in a new column.
|
|
show
|
3:35 |
Your boss is liking all this stuff that you're able to give them now.
They're asking for sales by month Okay so here is our original data and we're gonna say let's make that total column and I'm gonna show you a new way to Get that value here the month without actually making a column What I'm gonna do is I'm gonna say PD grouper and we're say I want to group by this invoice date But look at this I say freak is equal to M so what this is going to do is It's going to complain.
It's gonna say nope.
I can't do that So this is only valid with date time index, but got a instance of an index So this is this is actually a pandas to bug.
Hopefully they'll be fixing this soon So I'm gonna look at the types of cells here and you can see that we have the date.
It's this time stamp It's PI arrow time.
I'm gonna actually convert this to a pandas time so let's do that here I'm gonna say as type and If you look at this, we don't really see much difference here, but if we look at the D types now of this You can see this as date time and that's 64 So hopefully in in a soon released version of pandas this will be fixed but I'm gonna add that total column there and then we're gonna group by and I'm gonna say that column which is the date column that we just changed to a NumPy date now and this freak here M is the month frequency.
So let's do that.
That's gonna be lazy It's gonna give us that group by object and then we're gonna say I want to summarize the numeric columns there now look at the index Here instead of having a month here.
What we have is the end of each month.
So this is really cool With relatively little code again I did have to change the type because PI arrow doesn't support that I was able to summarize by month I'm just going to look at the memory usage of our old data here.
There's not a Difference in memory usage.
It's just that one's using NumPy and the other one's using PI arrow Okay, so one things I like to do with this once I have that So here here's what we had.
Let's throw on a plot here to visualize that So look at this is going to do a line plot we haven't seen line plots yet But here's our data.
This is a series when we do a just plot by default.
It's going to do a line plot It's going to put the index in the x-axis in this case The index is dates and then it's going to draw a line plot for those values there So really easy to make a line plot in pandas Just called dot plot there.
This makes it really clear that in November.
That's where we have the most cells But this is aggregated at the month level, but watch this I can change this freak value here from M to W and Now we are looking at an aggregation of the weekly values In fact, I can change it to a D and we can aggregate at the day value, which is kind of cool I can do a 3d here and aggregate at every three day value So pandas makes it really flexible to aggregate at different date intervals We call this the offset alias and we're going to use that PD grouper So the PD grouper syntax is a little weird, but once you get used to it, it's able to do very powerful things
|
|
show
|
5:30 |
Okay, let's do something a little bit more complicated.
Your boss is like, I really like what you do.
Let's look at sales by month, by your top end countries.
Okay, so let's walk through this.
We're actually grouping by two things now.
We want month and countries.
So I'm going to say let's group by, and I'm going to say PD grouper, and I'm going to also passing country in there.
So I'm passing in a list, we're going to group by the day frequency, and then country there.
And that's lazy, it doesn't do anything until we aggregate.
Now, if you look at this, we have the end of each day, and for each day we have each country.
And then if we got the total there, we get something that looks like this.
This is kind of weird looking.
Remember I said that a series is one dimension.
This does not look like one dimension starting to look like a data frame.
However, it is in monospace font.
This is a series.
What's going on here?
In the left-hand side, we've got invoice date and country.
Both of those are actually the index.
The value of the series is a number on the right-hand side.
So we call this a multi-index or hierarchical index.
This is what happens when we group by two or more items, we get a hierarchical index.
These can be somewhat of a challenge to work with, so let me show you some best practices for that.
So one of the things I like to do with this is to unstack this.
So here's our data, and let me show you what unstack does.
What unstack is going to do is it's going to take the innermost index and rotate it up into the columns.
So watch what happens here.
Okay, this is what we had before.
We're going to take country and rotate it up.
And there we go.
Now in the index, we have invoice date.
Now we have a data frame, and in the columns, we have all of the countries.
Now there are a lot of missing values.
Now we see a lot of those NAs, because this is sparse data.
Well, what we can do is we can fill those values with zero, and we get something that looks like this.
Okay, so 38 columns, 305 rows.
One of the cool things that we can do now is we can plot that.
So there's what's going on there.
Is this the world's greatest plot?
No, not necessarily.
People would call this a spaghetti plot.
There's a lot of data that we're plotting here.
You can see this pink line here.
That's probably United Kingdom, which is the majority of our data.
So this plot is not particularly useful.
So what can we do?
Let's remove United Kingdom.
Okay, now it's looking like just a bunch of colors there.
Let's move the legend to the side.
So I'm going to put in this line right here.
This is going to push the legend off to the side.
Okay, there's still a lot of colors.
This is hard to see.
Here's a technique that visualization experts use.
If you want to emphasize a single thing, color it and make everything else gray.
So let's see if we can apply that here.
In this case, I'm going to emphasize Spain.
Everything here is similar to what we had before, but I'm going to say, let's pipe this set colors here before we call plot.
So let's look at what set colors does.
You can see there's a global variable called colors, which is a list.
I get the data frame as the first parameter.
I get Spain as the country, and I'm going to loop through my columns.
And if my column is not equal to that, I'm going to append the normal, which is this gray color.
After I do that, at the very end, I'm going to append the highlight color, which is that red color.
And I'm going to append my country to my columns here.
I'm actually reordering the data frame and making a global colors that's aligned with the colors there.
Why am I reordering the data frame?
Because the last column is going to be plotted on top, and I want that to be the colored column.
Okay, so once I've done that, it actually is not going to do probably what we think it will.
So it doesn't spit anything out.
Why doesn't it spit anything out?
Because we're assigning to AX here.
So if I spit out AX, here is AX.
So this is the data.
We want to plot this.
So let's call this piped plot, and let's see what happens when we do that.
So we're going to take a data frame here, and we're going to say, plot the data frame, and we're going to use the colors that's this global variables.
And we're going to add a title here.
This is going to return a map plot lip axis.
That's the AX here.
I'm going to, with that map plot lip axis, move the legend to the side.
So one, one means move it to the right-hand side at the top.
And then I'm going to make it have two columns.
I'm also going to set the Y label to US dollars, and I'm going to return the data frame as the output of that.
So this is saying AX here.
This shouldn't say AX.
This should say like some other data frame.
We'll just call that final here.
But if we look at the output down here, we can see that here is our plot.
So here is Spain in red, and we can clearly see what's going on with Spain.
Again, the other columns are not super bright, but we're focusing on Spain.
If we wanted to focus, let's say on Finland, we could just come in here and change this to Finland.
And that's what's going on with Finland.
We just showed you how to do a more complicated group by group by two things.
We showed how to do unstack to rearrange the data, and then we showed a visualization technique to actually draw attention to what we want to focus on.
|
|
show
|
0:25 |
I hope you enjoyed this section.
We learned about summarization.
This is super powerful.
Once you understand how to do this, you can do some really cool things with pandas.
Your boss is also going to ask you to do a lot of things because they know that you have these superpowers.
We also looked at visualization.
Oftentimes I will combine grouping with visualization because visualization makes it really easy to tell that story.
|
|
|
23:26 |
|
show
|
0:37 |
In this example, I'm going to walk you through cleaning up data.
Oftentimes you'll come across data that is not in a place where you can do machine learning, modeling with it, or you can do proper analysis, so we need to clean it up.
Let's look at what we need to do.
The dataset we're going to look at is a heart disease dataset.
Our goals are to load multiple files.
I'll show how to do that.
We will look at the columns and the types of those columns, and then I'll show you how to refactor a lot of this cleanup, especially when it gets repetitive, and we'll talk about a complex issue that you might run into.
|
|
show
|
0:46 |
Let's load our libraries, we're going to load NumPy and Pandas, we're also going to load the Glob module, the Glob module lets you load multiple files.
So if you look at this code down here, we are loading these processed star.data files, there's multiple of those, and I want to combine them into a single data frame.
So this code down here will load all of the files, I've got a list comprehension inside of that, and then we're using the pdConcat function to concatenate those along the index, essentially stacking them on top of each other.
Note that I said ignore index is true, that makes it so the index values do not get repeated.
Okay, that looks like we've loaded our data.
|
|
show
|
2:46 |
In this next section, we're going to do a quick overview of our columns.
We're going to understand what columns we have and what are the types of those columns.
Let's run the describe method.
This gives me summary statistics.
It looks like I've got 920 non-blank entries for each of these numeric values.
Look at my minimum values, my maximum values.
It looks like these are all integer-like values.
They don't go very high.
It's probably likely that they're being stored as 64-bit integers, and from the looks of this, they could all be 8-bit integers.
Let's look at our string columns.
We've got two string columns.
You can see that these actually look numeric, but they've got question marks in it, so we'll show how to deal with those.
Let's look at object columns in here, and we've got a few object columns as well.
Let's examine our D types, and we can see that we have a bunch of object columns in here, which is a little concerning given that we are using PyArrow.
If you look at our code up above here, we are using the PyArrow backend, and the engine is PyArrow.
Where did these object columns come from?
These probably came from us reading multiple files, and some of the files had missing values in them.
Some of them did not, and so we ended up getting different types in them.
Because we don't have proper PyArrow types in here, I do want to clean these up, and we'll show the process for doing that.
This data comes with a little descriptive file here, and if you look at this, the goal is...
Scroll down a little bit, we can find the goal.
It says in this section 4, the database has 76 attributes, but we're using 14 of them, and the goal field refers to the presence of heart disease.
So it's an integer from no presence 0 to 4.
And then we have a bunch of other attributes down here.
Here's the 14 that are used, and then there's some documentation about that down here.
Interesting that section 9 says that missing values have the value of minus 9.
We saw some question marks in there, so we might have to dive into that a little bit.
In this section, I showed a quick diving into my data.
We did summary statistics, and then we selected the different types.
We did notice that we had an issue with object types being in there.
Again, you do want to check those D types, and make sure that all of your columns have the proper types.
We also noticed that some of our numeric types probably don't need to be 64-bit integers.
They can probably be shrunken, and we can save a little bit of space if we need to.
|
|
show
|
0:43 |
I'm going to start going through the columns.
We'll start off with the age column.
So here's our data frame.
You can see age is in that first column there.
Let's do a describe on that.
It looks like we are going from 28 to 77.
That looks okay.
Let's do a histogram to understand what's going on there.
Seems perfectly plausible.
Let's look at the D type of that.
The D type is a double pie arrow.
Let's see if we can get away with casting it to an int8.
Pie arrow will complain if this doesn't work.
It looks like that does work.
I think I'm pretty good here leaving this as an int8 and I don't have much more to do with this column.
|
|
show
|
1:17 |
In this section, I'm going to show you how to clean up the sex column.
First of all, let's just look at the sex column.
You'll note that I am keeping that previous conversion from age in there.
I'm going to build up a chain.
I'm going to pull off the sex column and then look at the value counts.
Again, if you want to walk through this chain or recipe, here's the original data frame.
I'm converting age to an int8.
That doesn't change anything visibly here.
I'm going to plot the sex column.
This is ones and zeros.
Let's just validate that and we see that it is ones and zeros.
What I want to do is I want to replace one with male and zero with female.
Let's try and run that.
When I do that, Arrow isn't very happy.
Arrow doesn't like to change the type under the covers for us.
This would be going from an integer column to a string column.
Arrow's not very happy about that.
Instead, what I'm going to do is I'm first going to change it to a string type and then I'm going to replace those values with male and female.
Let's run that and see if that works.
It looks like it did work.
Let's just validate that by doing value counts and that looks like that worked.
We showed how to convert the sex type.
We had to convert the numbers to strings and then we can convert the string versions of the numbers into the strings that we want them to be.
|
|
show
|
0:49 |
In this section we're going to look at chest pain and how to clean up that column.
We should have four values in here.
Typical angina, atypical angina, non-anginal pain, and asymptomatic.
So let's look at those values.
It looks like those are pi-arrow doubles.
And I'm going to see if I can convert those to integers.
It looks like that worked.
So in this case converting those to integers did do the job.
One thing that you could look at if it's an issue is how much memory you're using.
And we could see that each time we are converting these from doubles or int64s to int8s we're saving 7 8ths of that memory because we're using much less memory to store these numbers.
|
|
show
|
2:25 |
The next column that we're going to look at is the resting blood pressure column.
Now I've got my chain that I built up.
Again, I'm just going to keep building this up.
This is where I was after our last column cleanup.
So let's pull off that resting blood pressure column.
It looks like it is an object column.
Let's describe that.
We don't get super useful information here.
We get a count, the unique entries.
There's 103 unique entries.
The top entry is 104.
The bottom entry is 94.
But we saw that there are also values that aren't numeric in there.
So to understand those, I'm going to do value counts.
That's our friend.
You can see that there is a question mark in there.
It looks like we've got floating point numbers in there.
And we also have integer values in there.
That probably explains why we have that object type in there.
So here's what I'm going to do.
I'm going to say, let's replace question mark with none.
And then let's convert that whole thing to an integer.
In this case, I'm going to do an unsigned 8-bit integer.
And I get an error.
It says it could not convert 95 with the type string to an unsigned 8-bit integer.
Again, we're seeing that PyArrow is a little bit picky about type conversion.
So I'm going to have to jump through a little hoop to clean this up.
So instead of going directly from the object type to PyArrow, I'm first going to convert this to a string type.
And then from that, I'm going to convert that to a PyArrow type.
Let's run that and see if that works.
And it looks like we got an error here.
What's our error?
It says it could not convert 138 as a uint8.
Okay, I'm probably going to have to jump through a few more hoops here.
Let's see if we can do this.
So I'm going to convert from a string to a float and then to an int16.
And it looks like that works.
So let's just do a summary statistics now.
And it looks like our values go from 0 to 230.
We do have some missing values.
You can see that we have 861 as the count.
Remember that count is the number of non-missing values.
In this section, I showed changing those types and cleaning them up.
We did see that PyArrow is a little bit picky about changing types.
Sometimes we have to go through a little bit more conversions to get them to convert to the correct types.
|
|
show
|
3:07 |
Our next column is the serum cholesterol column.
So we've got our existing code here.
I'm going to attack on call and then I'm going to describe on the end of that.
And we can see that again, we are not getting a lot of useful information I've described because the column type is kind of messed up.
Let's do a value counts on that.
We can see that we've got, looks like two zeros, which is a little bit weird.
We also have a question mark and we have a 220.
This is telling me that we have string types mixed in with numeric types.
And we probably have two different types of numeric types, probably integers and floating point values.
So when we concatenate these together, we did not get a nice clean type.
It got a little bit confused.
So this is probably going to be the same thing that we did up above with the blood pressure value.
So what I'm going to do is I'm going to refactor this and I'm going to make a function here called remove question.
And you can see that it takes a data frame as the first parameter.
Then it takes a column name and then it takes a D type, a final D type.
And I'm just defaulting that to int8 pi arrow.
And then I'm going to take the chain that I had up above.
If we look at the code up above here, I'm just taking that chain and sticking it into this function here, replacing the final type with that defaulted type.
And then what I'm going to do is I'm going to come down here to my assign and I'm going to replace that chain with a call to lambda.
Why do I need to use lambda here?
Why can't I call this function directly?
I can't call this function directly because in order to use a function with assign, it only can take one parameter and our function does not take one parameter.
We have to pass in the data frame and the column.
So we wrap that in lambda.
So we are able to call it with just the data frame and then the lambda inside the lambda sticks in the column that we want.
Let's test that out.
I'm going to do that to the previous column to make sure that still works.
And I'm also going to do that with the cholesterol.
And then we'll look at the value counts of cholesterol.
Looks like the value counts of cholesterol did work.
Let's do a histogram of that.
And there we go.
Oh, this histogram is interesting.
It looks like we have a bunch of zeros and then we have a bunch of data along the way.
To me, I love these visualizations because it lets me see what's going on.
It seems that we have a lot of people that have a value of zero for cholesterol.
Again, zero is probably not a valid number for cholesterol, but there's also probably values that are missing as well.
Let me just validate that here.
So I'm going to say isNA.
We saw that we can do this already.
And then I'm going to say, let's do a sum of the isNAs there.
There are 30 values that are missing in addition to that.
So if this were my data set, what I would do at this point is talk to the subject matter expert, someone who knows about this data, and figure out is a zero the same as a missing value and take the appropriate action.
I might go and change all of those zeros to missing or vice versa, depending on what my end goal was.
|
|
show
|
1:05 |
Okay, we're going to keep going through.
I'm going to look at fasting blood sugar.
I've got my chain here I'm just going to pull off that column and do a describe on it.
It looks again like we have that same problem Describe isn't giving us super useful values.
I'm going to look at the value counts here I see 0 and 0.01 and 1.0 and question mark.
So we need to convert these types here So this looks like a boolean value I'm going to replace this with a boolean instead of the types that I was using previously for the other columns Because I have the function there it makes it really easy.
I just specify D type Let's do that and look at the value counts.
That looks like it did work We also could come in here and say drop NA as false To see the missing values and you can see that 90 of those are missing again This would be something where I would want to go back to a subject matter expert and ask them why these values are missing If there is not an entry in here, is that the same as a false?
That would be a question that I want to make sure I have an answer to
|
|
show
|
1:27 |
Okay, we're gonna go to our next column resting echocardiographic results.
This is supposed to have three values value zero, normal, value one having a wave abnormality and value two showing probable or definite left ventricle hypertrophy.
Probably said some of those words wrong, but let's go through our process here again.
We're gonna look at our value counts tells us that our types are messed up.
I'm going to do remove question.
This looks like an int8 should be fine.
And let's validate that.
That looks like that works.
What we're seeing is that because we made this function and we have this process it makes it really easy to clean up.
It is sort of annoying that these column types are messed up like this.
One thing that we could do is we could provide types to our function that is reading those CSV files.
But either way we're gonna have to go through and clean up the types.
I kind of like to have something where I just load in the raw data and then clean up after the fact.
Could I load in each individually and clean up each individual one and then concatenate those after fact?
Certainly I could do that as well.
I'm not really adamant about whether one of those is better than the other.
In fact, this code would work with a single file.
However, it probably isn't quite as necessary with a single file because we're not going to get those mixed types.
So that might be a reason why we might want to consider cleaning each file up individually before combining them together.
|
|
show
|
0:14 |
Let's look at maximum heart rate achieved.
Okay, start off with value counts.
Looks like we've got some question marks and some floats in there.
I'm going to convert that to an int16 and look at the value counts.
That looks like it did work.
|
|
show
|
0:14 |
Our next column is exercise-induced angina.
Start off with value counts.
And what we're going to do, if you look at the results there, it looks like ones and zeros.
I'm going to convert that to a Boolean.
Looks like that worked.
|
|
show
|
0:22 |
Our next column is Old Peak, which is depression induced by exercise.
Let's look at that.
We'll pull off our value counts.
Looks like these look like floating point numbers, so I'm going to convert this to a float pyarrow.
You can see that I specified that dtype there.
Let's run that and look at the value counts after we do that.
Looks like we're good.
|
|
show
|
0:19 |
Our next column is the slope column.
So there are three values for this 1, 2, or 3.
Start off with our friend value counts.
And it looks like we're going to convert this to an int8.
Let's try that out.
It looks like that worked.
|
|
show
|
0:17 |
Our next column is the CA column number of major vessels colored by fluoroscopy.
Again, start off with value counts.
And this looks like it's an int.
I'm going to call it an int8.
Let's do value counts after we convert it.
It looks like that worked.
|
|
show
|
0:38 |
Next column is the THAL column.
This should have three values 3, 6, or 7.
Let's do the value counts.
Okay, and that is messed up.
In this case, I'm not going to use our magic remove question function, but rather I'm going to try and convert these to strings.
Let's see if we can do that.
Okay, and that looks like that worked.
It looks like everything is a string in there, so I replaced the string floating point versions to integer floating point versions and then replaced those with the string values.
Thanks a lot.
Thanks.
|
|
show
|
1:07 |
Let's look at the num column.
This should have a value of 0 or 1.
It looks like it has a value of 0, 1, 2, 3, 4.
Let's go back up to our documentation and look at this.
So this is column 58, the predicted attribute.
The description here says value 0 or 1.
But if we go up here above, it says it can be 0, 1, 2, 3, 4.
Which is this goal column, value 0, no presence, 2, 4.
So it looks like we are seeing those values 0 through 4.
The next thing I'm going to do is look at the data type of that.
That actually looks like it's an int64, so that's pretty good.
I'm just going to go ahead and run this.
I'm just going to convert it to an int8.
In this case, I don't need to use my special function there.
I can just stick that into the as type.
So here's our chain that looks like it does a relatively good job of cleaning up our code.
|
|
show
|
0:49 |
In this section we're going to look at memory usage and compare memory usage from before and after.
Here is our original memory usage, 396 kilobytes.
Let's run through our code and do our memory usage and we get 34 kilobytes.
When we compare that we're using 11 times less memory by going through this process.
Again there's no guarantee that you will use 11 times less memory and this is partly because we're using all these object columns so those consume a little bit more memory.
But we did save a bit of memory by going through this.
If we were to go back to the pandas1 types of columns we would use even more memory.
So this is pretty good, pretty good cleanup.
We haven't lost any information but we've cleaned up the types.
|
|
show
|
4:18 |
After I've gone through this whole process, what I want to do now is make this into a function so that I can use this code really easily.
Let me show you how to do that.
Here's our original data, and we're just loading our original data frame.
This is the raw data.
I do like to have the raw data around so I can clean it up.
Then here is my remove question function.
We have that before, and I like to make this little tweak function.
The tweak function is going to take in the raw data and it's just going to return our chain.
Let's run that and make sure that that works.
It looks like that did work.
Once you've done this, this is really nice.
What I like to do is I like to take this code and move it to the top of my notebook.
I have the code that loads the raw data.
When I come to my notebook, I can just load the raw data, and then I have the code right below it that cleans it up.
I don't have to run through 50 cells to do that.
I'm ready to go with both the raw data and the cleaned up data.
Why do I like to work with raw data?
I like to work with raw data because in my experience, it seems like the boss or whoever asked me to do this process, comes back and ask me some questions.
I don't have the raw data, it's hard to track through and explain what went on with a particular example, and that's typically, why did this row do this?
Well, if I have the raw data, I can trace through that chain and see what happened.
Now, if you look at this, if you squint at this, it looks like we have a bunch of lambdas here.
Can we clean up this code a little bit?
Let's try and do that.
What I'm going to do is I'm going to take a dictionary called types and just map the columns to those types.
Then I'm going to use a dictionary comprehension in here and say, let's stick in this dictionary comprehension inside of the assign.
This is the dictionary comprehension with a curly brace, and that star unpacks it and sticks it into our assign, as in each key from the dictionary comprehension is a parameter for the assign method.
Let's do that and see if that works.
It looks like it works.
However, if you look closer, it looks like all of these columns here have the same value.
That's disconcerting.
What is going on there?
The issue here is that Python, if you look at this, we are looping over column and D type in here in a loop.
When you stick this into a lambda, lambda sees column over here, and it's just going to use the last column from the loop, which is annoying.
How do we get around that?
We get around that by doing this little step here.
When we create the lambda, we pass in the column.
Why does that work?
Well, now we are passing in the column when the lambda is created.
We are not evaluating the column when the lambda is executed.
When we execute this for loop, it hasn't evaluated the lambdas, it's just created the lambdas.
It's going to take that last value of column unless we pass that indirectly, and then that is set when the lambda is created, not when the lambda is evaluated.
That's a little tweak that you might need to do if you want to refactor that.
Let's run that and see if that works.
That looks like it does work.
You can see that our values are not all the same.
This example is just showing that we could also use this syntax here and do a look.
We're pulling off those columns and we're just going to convert all of those columns to floating point values.
Could we do that?
We could.
Why don't these all end up looking like floating point values?
Because then afterwards, we're going to use this as type to specify the types as well.
My preference here is probably to use this commented out version because then I don't have to put a secondary as type in there.
Know that if I comment this out, we get a failure because PyArrow really wants us to convert from strings to numbers before we do that as type.
In this section, I showed you how to make a function and gave you some of the benefits behind that.
Make sure you're doing this with your code.
I don't care if you chain.
I think chaining is going to make your code better, but I do think a really good practice is once you've got that code to clean up your data, move it to a function and put that at the top.
that's gonna make your life using notebooks a lot easier.
|
|
show
|
0:06 |
Okay, this was a little bit long, but hopefully you got the point.
This is the process that I would go through to clean up some data.
|
|
|
21:52 |
|
show
|
0:31 |
I want to show you how to work with time series data.
This is really fun, I think, in Pandas.
It lets us manipulate the data relatively easily and start visualizing it and asking questions and understanding what's going on there.
Okay, so in this section, we're going to show how to parse dates.
We're going to clean up some data.
We're going to fix some of these strings that are numbers.
I'm going to make functions.
Again, we'll talk a little bit about time zones, resampling, rolling operations and then we'll do some plotting as well.
|
|
show
|
0:51 |
So this is using an air quality data set from University of California, Irvine.
We've got a few columns in it.
Let's just load in that data.
And I got an error when I loaded in the data.
What's going on there?
Says it found non-unique column index.
Apparently we have repeating column names.
Let's try this again.
And I'm going to comment out this engine pie arrow and see if that works.
Okay, and this did work.
It looks like the pie arrow engine is a little picky.
On the end here you can see that there are two unnamed columns and I'm assuming that pie arrow didn't like that.
So we're going to dive through the data here and see if we can clean this up a little bit.
|
|
show
|
0:52 |
I'm going to show you how to start cleaning up the data.
Okay, so one of the things I might want to know is, are there missing values?
I'm going to sum these up across the columns, not the rows.
And what will this give me?
This will give me back a series that should have the same index as this data frame, but it should give me the count of how many values are missing in each of those rows.
So let's run that.
And you see that a lot of these are missing all of the values.
Okay, let's look at the shape.
I'm going to use some pandas to filter a little bit.
I'm going to say, I want you to return me the rows where the number of values missing is less than 17.
The total number in this is 9,471.
It looks like we have 9,357.
So this filters out the rows where all of the values are missing.
|
|
show
|
2:03 |
Okay, I'm going to show you how to do some date parsing now.
Oftentimes when you load a CSV file, its dates will be strings and we want to convert them into dates.
So let's go back to where we were from our last section here.
You can see that we've got a date column and we've got a time column.
So I'm going to make a new column called Date and it's going to be a lambda.
It's going to take my date column, add a space to it and then add the time column and the time column is going to replace the periods with colons.
Let's scroll over to the end.
We should see that new date column.
And I'll just pull those out.
You can see them this way.
Okay, so let's see if we can convert that to a date time.
There is a PD2 date time function in pandas.
Let's run that.
And when we run that, we get an error.
Let's look at our problem here.
It gives us a semi-useful message here.
It says that this format 13032004 doesn't match month, day, year.
And it looks like our issue is month and days are in the wrong order there.
Pandas is trying to do US-centric date parsing.
This isn't US-centric.
Let's come in here and say day first is true and run it again.
And that looks like that does work.
Alternatively, we could specify the format explicitly.
Let's try doing that.
We're going to say %d is the format.
That looks like that works as well.
We could probably use this and instead of replacing the periods with question marks, we could just stick in periods there.
That should probably work as well.
Let's look at our D types now.
And now we have a date time 64ns down here for our date column.
|
|
show
|
2:36 |
I'm going to rename my columns as well.
You can see these columns have parentheses in them.
This is just going to make them kind of a pain to work with.
They also have periods.
So I'm going to use columns that are a little bit more descriptive.
We'll just use rename here to rename the columns, and then we'll check the columns to see if that works.
We still have those unnamed columns in there.
Let's look at the unnamed column value counts.
There's nothing in that one, and there's nothing in that one.
So I'm going to update my chain here, and I'm going to explicitly specify the columns that I want, excluding those unnamed columns.
Why do I explicitly specify the columns?
Why don't I say drop?
Because I want to have my code focus on the columns and the data that I want, not the data that I don't want.
I want someone who comes to this code to look at it and say, OK, these are the columns that are in there.
Not be worried about what columns are in there and know what columns aren't in there.
You can see that it's easier to drop the columns, but it's actually not as good for the person who is using the code to have just the columns that are dropped.
In this section, we're going to dive into some of our other columns and make sure that the types are correct.
Let's look at the string columns here.
Here are the string columns, and if you look at these, these don't really look like strings.
They look more like they are numbers.
So what are we going to do?
It looks like there's commas in there.
And those commas probably confused pandas such that it made them strings.
OK, so we've got carbon monoxide, benzene, temp, relative humidity, and absolute humidity.
Let's go through and see if we can clean that up.
You can see that I've added this pipe down here.
Remember, pipe is a method.
You can pass in any function you want into it, and this is really flexible.
So let's see how we're using pipe here.
So the pipe is basically letting us use a sign.
And why didn't I use a sign directly?
I didn't use a sign directly because I have these weird column names, and I don't want to refer to those original weird column names.
I want to refer to these nice clean ones.
If I use pipe here, I get the current state of the data frame with the cleaned up column names when I refer to it.
So this is a way to let me quickly refer to the updated column names.
What are we going to do with those?
We're going to replace the commas with periods, and then we're going to cast those to floats.
Let's run that and make sure that it works.
It looks like that did work.
Let's inspect the types of that, and it looks like our types worked.
|
|
show
|
0:52 |
Our next section is making this chain that we just created into a function.
I've already talked about making functions, but again, I think this is super important.
That's why I keep repeating it.
So here's loading our raw data up here.
And down here is a function that I've created.
I've just taken that chain and wrapped it into this function.
And let's run that.
That looks like it worked.
Now, what I like to do is move this to the top.
And I'm going to actually restart my kernel here.
And we'll rerun this and run this down here.
And you can see that that worked.
With two cells executed, I have loaded my data and I've actually cleaned up my data.
And I tested it.
I restarted my kernel and it worked.
Highly recommend you write these functions.
I also highly recommend that you put them at the top of your notebook once you have them.
It's just going to make your life a lot easier.
|
|
show
|
0:57 |
In this section, we're going to look at our time a little bit more and figure out where that time is.
So here's our cleaned up data.
If you look at it, it says it's a date time 64NS.
I'm going to stick that into UTC time.
And when I do that, you can see that this says UTC down here.
Alternatively, I could convert this to a PyArrow timestamp.
Turns out that using PD to date time doesn't give us PyArrow.
It actually gives us a NumPy date time.
So if we wanted to be pedantic about using a PyArrow time, we could.
That also supports TZ localize.
If you want to pull up the documentation for this, you can come up here and hit shift tab a few times.
And this says you have a naive date time and you want to make it time zone aware.
To be pedantic, UTC is not a time zone, but this is how we convert to UTC time.
|
|
show
|
1:29 |
Okay, let's look at how to convert to a different time zone.
I want to convert to Italian time zone I'm going to import this pits library.
This is a time zone library.
This has a nice attribute called all time zones Let's look at the Europe time zones Okay, it looks like there is a Europe Rome in there once I have a column let's try and localize it I'm going to try and stick into Rome time it complains and it says this is a non-existent time How is it possible to have non-existent times when you have daylight savings?
You actually lose an hour or you gain an hour and so there is a time in there where it's either overlapping or it's spaced out and So you have an ambiguous time where it's overlapping or when it's spaced out you have a time that's impossible And so it turns out that we have an impossible time here.
How do we get around this?
We actually need to convert this to UTC time first before we convert it to Italian time So let's run that and when we run that we can see that that converts Okay, let's try that with our pie arrow time it looks like it works with pie arrow time as well Just a Small note here CET to be pedantic is not a time zone but it does work for localization it understands it but Time zone people will tell you that that is not a time zone time zone should be associated with a location
|
|
show
|
3:23 |
In this next section, I'm going to show you how to do some plotting.
So here's our cleaned up data.
I'm going to make this date into Italian time.
So when I do that, you should see this offset here.
I'm going to comment this out and just let you see the difference here.
You can see that this is our naive time.
This is our Italian time.
What I'm going to do next, I'm going to stick this into the index.
So here is the index.
It's just the default index.
Now you can see that our Italian time went in there.
And let's pull off some columns here.
So I'm using loc.
Loc is interesting for a couple reasons.
One is that we don't call it with parentheses, like set index.
We're calling that with parentheses.
Loc, we have an index operation on it.
The other thing that's interesting about this is that you can pass in two things to index on, which is kind of weird.
You don't index on two things in lists or dictionaries.
But in this, you can pass in a row selector.
In this case, we're passing in just a colon, which is a slice, which means take all of the rows.
And then these are the columns.
These are the column selectors.
And we're going to take those two columns.
I'm going to convert this to our PyArrow time zone, and we'll just double check that it works with that as well.
It looks like that did work.
Okay, let's plot this.
So again, let's go back to what we had before here.
Here is our code.
We've pulled out these two columns.
We're going to do a plot.
Plotting is relatively easy in Pandas once you figure out what's going on.
When you just call plot, what it's going to do is it's going to stick the index in the x-axis, and it's going to take each column and plot that as a line.
So here we go.
We've got two columns, so we should see two lines.
Our index is a date, so it should stick the date in the x-axis.
There we go.
There is our plot.
Now, is this the world's greatest plot?
No, not necessarily.
It looks like it's got a lot of negative 200s going on there.
And this is one of the reasons I like plotting, not because this is a great plot per se, but it shows me some of the issues in this data that pop out really clearly when I plot it.
If I were to just look at a table of this data, it would be a little bit harder for me to see this information.
Okay, I'm going to change my location here.
So here's what my location was before.
I was saying take all the rows.
Let's just zoom in a little bit.
Let's zoom in through April and May and see what's going on there.
So there's April and May.
You can see, again, it's dropping off to this negative 200 value.
What's going on with that 200 value?
Again, I would go to a subject matter expert and ask them, I'm going to replace negative 200 because that probably doesn't make sense for a negative value to be in there.
I'm going to replace it with a missing value.
So that's what we get when we replace that with a missing value.
That's looking a little bit better.
It's still a little bit hard to see, but it's a little bit clearer there.
If I want to zoom in a little bit more, let's say I want to go in to April 8th through April 13th, I can do this and get something that looks like this.
This illustrates one of the cool things about Pandas.
When you stick a date into the index, you can use that loc and you can use substrings and it will pull off portions of the date, which is really flexible and nice.
In this section, I showed you how to do a plot.
Basically, if you want to do a plot of time series data, stick the date into the index and every column will be aligned.
|
|
show
|
3:26 |
In this section I want to look at missing values.
So here's our plot from last section here.
Let's show what we can do with that.
One of the things we can do is interpolate that.
Let's try that.
And I got an error here.
It says I got an invalid fill error.
So what's going on here?
Interpolate is supposed to take a value and connect it to the next value if there's missing values in there.
It turns out that the PyArrow version of data doesn't like to do interpolation.
So I'm going to convert this back to NumPy values, which will show you how interpolation works.
I'm going to say as type float and then we'll do interpolate.
And you can see for example here between 9 and 10 we have those values connected.
There it is without the interpolation.
There's also a small portion over here after 11 but it's a little bit hard to see that.
Another option is we can do what's called a forward fill.
Forward fill is going to take the last known value and it's going to push it forward.
Let's do that.
And you can see that it's pushing this value forward.
We do get a warning here.
This warning is a little bit weird.
It's saying I'm falling back on a non-PyArrow path, which is weird that this works for forward fill but it doesn't work for interpolate.
I wish Pandas was consistent here.
There's also a back fill.
The back fill is going to pull the last value and pull it back.
Let's run that.
And it's a little bit hard to tell the difference here.
This one is pushing this value forward.
This back value is pulling this value back.
A note on back fill.
If you're doing machine learning and you want to stay away from back fill you probably want to stay away from interpolate as well because those are working with values from the future.
Interpolate is connecting the previous value and the next seen value.
The next seen value is a value possibly from the future.
A back fill is taking possibly future values and pulling them back.
So if you are doing machine learning with these you want to not use these because oftentimes they're considered cheating and they will lead to models that won't work or will give you bad results.
Another thing you can do is you can replace missing values.
Here I'm just saying replace that with 22.
Is 22 the right value?
No, not necessarily.
But you can see that 22 pops up here and here.
So what is the correct value to replace it with?
Again, the answer is it depends.
You probably should talk to a subject matter expert and see what to replace it with.
Here I'm saying replace minus 200 with nan and then look at this.
I'm using our friend pipe and I'm saying let's just fill in the missing values with the mean of the current data frame.
Why am I using lambda instead of just fill and a?
Because pipe expects a data frame as the first parameter and fill and a does not expect a data frame as the first parameter.
So here we see that the mean value is getting stuck in there over here.
It's a little bit hard to see but the blue value is getting the mean value as well.
In this section I showed you how to deal with missing values.
Pandas has a lot of flexibility there.
Again, talk to a subject matter expert and make sure you're doing the right thing.
|
|
show
|
2:30 |
In this section we're going to explore resampling.
Here's our interpolated chart from before.
Now if I pull off this plot, let's look at what's in the index here.
And it looks like this is every hour.
If I want to change the frequency of that, it's really easy if I have a date in the index.
I just call resample.
So I'm going to say resample to the two-hour frequency.
Now this is lazy.
Resample doesn't return anything.
To get to return anything, I need to do an aggregation here.
So I'm going to do the mean as the aggregation.
When you look at this now, now we have every two hours.
So if we were to tack on a plot to this, I'll just take this code up here and stick this on down here.
You can see that that smooths that out.
I could come in here and say every three hours if I wanted to, or every five hours.
Really convenient that we can just stick those numbers in front of that.
In fact, we can do something like this.
We can say every two hours and 37 minutes.
And you can see that this is two hours and 37 minutes past that, and that pattern continues.
So here I'll just stick a plot onto that.
I can also resample to the day level.
Here's every day.
And we can do five days, six days, etc.
Instead of doing resample, another thing that I can use is use what's called a grouper.
The code here is a little bit different.
I'm commenting out our old code so you can see what I changed here.
I'm not setting the index.
We're keeping the same index.
So our look is going to be a little bit different.
I'm going to pull off these columns.
In this case, I'm including the date column.
Then I'm going to cast my column types.
I commented out that float because I can't say as type float because I have a date in there.
I'm going to do my interpolation.
Now instead of doing resample, I'm going to do group by, but I'm going to do pd grouper.
The nice thing about using grouper instead of resample is that I can actually use group by with multiple groupings.
In this case, I'm only using one grouping, but I could do a hierarchical grouping if I want to.
I'm saying take the date column and do a day frequency and then the aggregation there is a mean.
At this point, I'm going to do my loque, which is similar to what I had up above there, and then I'll do a plot.
In this section, we explored resampling.
Super powerful thing that you can do with pandas.
We have those offset aliases that allow us to very quickly change the resample frequency.
|
|
show
|
1:45 |
Let's look at doing rolling averages.
This is something you'll probably want to be able to do as well.
Here's a plot.
This is going from April 2004 to May 2005.
Now it's a little bit hard to see, this looks kind of like a caterpillar.
Let me show you a cool thing that we can do.
One thing that we could do is we could say let's resample this to the day level.
And here's the resampled day level version of that.
In addition to doing that, we can say, okay, let's do a rolling version of that.
So this is what I had before.
And now what I'm going to do is I'm going to say I want to roll that.
So when I do rolling, it's lazy, it doesn't do anything, but what it's going to do is give me a window.
And this is a window that's going to move or slide along my data, and then I can apply an aggregation as it slides.
So it doesn't do anything until I apply the aggregation.
Let's apply the aggregation.
So the aggregation is mean here, and you can see that the first few entries don't have any values because I need seven of them.
But basically this is taking the mean of the previous six values with the current value.
Let's plot that and see what it looks like.
And it looks like this.
So this is the rolling seven day average of this data up here.
This lets us spot trends really easy.
One of the places you see this is if you have data that's kind of jagged.
It's super common for COVID data.
We saw this a lot with visualizations where you would have daily data, and then you would have visualization people would come in and put a rolling seven day average on top of that to see trends because sometimes they wouldn't measure things on the weekend or whatnot.
And so the rolling version would let you see a trend really easily.
|
|
show
|
0:16 |
Okay, the last thing I'm going to do is I'm just going to fix my function based on this visualization that I did.
So here is our function.
I'm just going to fix the minus 200 values that we didn't have.
I'm going to replace those with missing values.
Let's just run that to make sure that it works.
And it looks like that did work.
|
|
show
|
0:21 |
This just goes to show that working with data is an iterative process, unlike a lot of engineering tasks where it's like, make a button that does this, you can say, okay, I did the button, it's done.
Oftentimes when we're working with data, we find things that we didn't know before, we have to go through and maybe repeat the process.
So if you work with these chains and you make these functions, it's going to make it really easy to see what you've done before and clean that up.
|
|
|
11:37 |
|
show
|
0:25 |
In this section we're going to show you how to take text data and manipulate it, and then we're going to make a machine learning model that will make predictions from your text data.
I'm going to show you how to load the data, we'll look at basic string manipulation, and then we'll get into some natural language processing techniques.
We'll show how to remove stop words, and how to make some metrics to understand the data, and then we'll make a classification model and show how to make predictions from that.
|
|
show
|
1:31 |
We're going to load our libraries, and I've got some data that I've downloaded here.
So this is reviews of movies.
Here's the README from that.
So this is the large movie review data set.
This is organized as a file system.
It's got training data with positive and negative samples, and the reviews look something like this.
So here's our code to load the data here.
This is a little bit more involved, but I've got a directory here, and inside of that there's a positive directory and there's a negative directory.
I have this function up here that will traverse those directories and get us our data frames here and then concatenate those.
Once I've got those, I'm going to drop the index and I'm going to change some types on those.
Let's run that and look at a sample of that.
So we can see that we have this review text here.
Here's what our data frame looks like.
We have 600, two rows, and four columns.
So we've got an ID, a rating, a sentiment, and the text.
|
|
show
|
0:55 |
In this section I'm going to show how to manipulate strings using pandas.
So the key here is once you've got a column or a series that is a string, you can see that the type of this is a string, on that there is an str attribute.
So on the str attribute there are various things that you can do.
These methods that are attached to the str attribute should be very familiar to you if you're familiar with working with Python strings.
But the nice thing about working with pandas is that instead of working with an individual string you will be working on a list of strings, everything in the column at once.
Here if I want to capitalize everything I can say capitalize that and that will capitalize the first letter in there.
Again this is off of that str attribute and I'm just going to show you everything that's in there.
These look very similar to what you would see in a Python string.
Thank you.
you
|
|
show
|
2:43 |
In this section, I'm going to show you how to remove stop words.
Stop words are words that don't add value to your text.
Oftentimes when we're doing natural language processing, we want to get rid of stop words.
Things like a, the, things that occur a lot but don't really mean anything or add value.
We're going to use the spaCy library to do that.
Make sure you install that.
After you install it, you need to download some English files so that it understands how to process English.
This is the command to load this small data set here.
Then you can validate that your spaCy install worked.
You can see that I have downloaded that small one.
I'm going to load spaCy and then I'm going to say load that small English data.
Now I'm going to remove the stop words.
I'm going to use apply here and say, okay, here's the remove text.
We're going to apply this function here.
And we pass in this NLP object.
What this is going to do if we look at it is it's going to get a document from that, which understands what's going on with the text.
Then I'm going to loop over the tokens in the document here.
And if it's not a stop word, I'm going to stick that in there.
So let's run that.
I'm also using the time cell magic at the top.
This is going to take a while.
This is using apply, which is slow.
It's also working with strings, which tend to be slow as well.
But there's not really a way to vectorize this and make it much quicker.
So we'll just deal with that.
Okay, so this takes about 30 seconds.
You can see that I've got, it looks like some HTML in here.
So I might want to further replace some of that HTML.
And I could put in code like this to do further manipulation there.
Let's just load the original data so you can compare the two data sets and see that the stop words are being removed.
Okay, so that's looking better.
Here is the original data you can see for a movie that gets no respect.
It got changed to movie gets respect, sure, lot, memorable quotes.
You can see the bottom one here.
I saw this at the premiere in Melbourne.
Saw premiere in Melbourne.
Do you need to remove stop words?
No, you don't, but this is something that's going to make your models perform better because there's a lot of noise in those stop words.
|
|
show
|
1:43 |
Next thing we're going to do is look at TF-IDF.
That means term frequency inverse document frequency.
This is a way to show how important words are.
We look at the relationship between how often a word appears in a document versus in the document, how many documents have that word.
If you think about this, if a word only occurs in a small subset of those documents, but it occurs a lot in those, that's probably an important word, especially if we've removed those stop words.
If you have words that are important, those tend to describe that document.
We're going to use Scikit-Learn to do that.
So make sure you've installed Scikit-Learn.
Scikit-Learn is a machine learning library and it has this thing called TF-IDF vectorizer, term frequency inverse document frequency vectorizer, and this works with pandas.
So what we're going to do is we're going to apply our removal of stop words, and then we're going to call fit transform on the removed stop words.
This will give us this object I'm calling sparse.
This is a sparse vector.
Okay, you can see that this is a numpy array.
It's got 600 rows and 13,000 columns.
Why does it have 13,000 columns?
Because there's a lot of words and this is basically a binary indicator indicating whether a word occurred in a document.
So let's look at what the features are.
We can ask the vectorizer to get the features.
I'm actually going to stick those into a data frame and then I'm going to concatenate that to my original data frame.
Here's my original data frame and you can see that we have all of these features tacked on to the end of it.
Finally, let's look at our value counts of our sentiment and we've got 301 of each positive and negative reviews.
|
|
show
|
2:40 |
In this section we're going to make a model that predicts whether something is positive or negative.
I'm going to use the XGBoost model.
Make sure you have installed XGBoost.
We're going to import that.
All I'm going to do here is I'm going to say x is equal to this tfdf and our y is going to be equal to whether something is positive.
So what's our tfdf?
That is our data frame that we stuck onto the end of the other one.
So let's look at x.
x looks like this.
It's a bunch of numbers in here.
So this is basically how important 10 is in document 3.
You can see that zone appeared in this document but a lot of these are zeros.
This is sparse because not all of the reviews have all the columns.
Okay what does y look like?
Y is just a series whether something is positive or negative.
And what am I going to do here?
I'm going to use scikit-learn to split our data into a training set and a testing set.
Why do we want a training set and testing set?
Well we want to see how our model would perform on data that it hasn't seen before.
So what we do is we hold out some subset of that, we train a model, and then with the subset we held out we evaluate our model.
We already know what the true positive negative labels are but we see how well our model predicts those based on data that it hasn't seen, giving us some sort of feel of how it might perform in the real world.
Okay so we've split up our data let's train our model.
It's really easy to train a model you just say fit.
So we're going to fit it with the x and the y.
The x are the features the y is the label whether something was positive or negative.
That takes a while because we've got a lot of columns but it looks like we did get a model that came out of that.
And then let's evaluate it.
One way to evaluate it is to use the score.
We're going to pass in the testing data, the data that it hasn't seen, and we're going to pass in the real labels for that and this is going to give us back an accuracy.
It looks like it got 78% right.
Is 78% good?
Well the answer to that is it depends.
It might be good it might not be good.
This is saying that basically four-fifths of the reviews that you classify as positive or negative are correct.
Now if you have a situation where you're making a model that predicts maybe fraud, and fraud is not very common like you could imagine fraud might occur in like one in a thousand, well you could make a model that's highly accurate.
You just predict not fraud.
It's 99% accurate.
So accuracy in and of itself might not be a sufficient metric to determine whether something's good, but it's good to give us a baseline.
This is better than flipping a coin.
It's 80% accurate.
|
|
show
|
1:40 |
In this section I'm going to show you if you had a new review, how would you make a prediction for it?
Okay, so let's make a review here.
I'm going to say xnew.
This is a data frame.
It has some review text in it.
What am I going to do?
I'm going to pull out the text.
I'm going to call my removeStop.
I'm going to stick that into my vectorizer.
And I'm just going to call transform.
And then I'm going to stick that into a data frame.
And that should give me my xnew.
Let's run that.
There is my xnew.
So I've got three reviews.
I hated this movie.
This was the best movie.
I think I know how I felt about this movie.
Both good, but weird parts.
Okay, so let's make a prediction.
And to make a prediction, all we have to do is take our data frame and call predict on it on that XGB model.
And so this says that the first one was zero.
I hated this movie, so that's a negative review.
Next one was best movie.
It got a one positive review.
This one says I think I know how I felt.
Both good, but weird parts.
And this said that that is a positive review.
One of the cool things about this is we can say predict probability.
This is the probability that something is positive or negative.
So the first one, this is the negative column right here.
And this is the positive column.
So this is 90% negative.
This next one is 71% positive.
And this one here is 54% positive.
How do I know that the left side is negative and the right side is positive?
Because remember we're predicting whether something is positive in our Y.
And so in this case here are our classes.
False here corresponds to this column here, and true corresponds to the right column over there.
|
|
|
20:16 |
|
show
|
0:21 |
In this video we're going to look at machine learning a little bit more.
We'll look at data preparation, model creation, and model evaluation.
Our goal for this section is to make a machine learning model.
I'll show you how to load the data, explore it, we'll prep the data for machine learning, and then we'll make a model and tune the model and evaluate the model.
|
|
show
|
2:00 |
We're going to look at a heart disease data set.
This comes from again, University of California, Irvine Machine Learning Repository.
Let's load our imports here and let's load our data.
Our data frame looks like this.
We're predicting whether someone has heart disease or not.
We actually saw this data set previously.
Once we have our data, we can go through it and start looking at it.
Like I can say, pull up the FPS and look at the value types for that.
We can say, what are the D types for this?
It looks like they're all doubles.
Here is my tweak heart.
This probably looks familiar to what we saw before.
Let's run this code and let's look at the output of that.
There's the output of that.
Let me just walk through this code.
This is a chain here.
The first thing we're doing is converting types, and then we're making a sex column.
We're replacing one with male and zero with female.
The THAL, we are replacing the numbers with normal, fixed, and reversible, and we're changing that to a categorical type.
Slope, we're also changing that to text values there.
There's our output after doing that.
Let's just look at our memory usage after doing that.
Here's our original memory usage and here's our cleaned up memory usage.
It's gone down.
We're using about 40 percent of the original memory usage.
At this point, we've loaded our data and cleaned it up a little bit.
Now, we're not done yet.
We need to do a little bit more processing of the data to get it ready for machine learning.
Many machine learning models don't work with non-numeric data.
It turns out that models like XGBoost are more flexible and they will work with non-numeric data, which is nice.
Also, a lot of machine learning models don't work with missing values like NAN.
Turns out XGBoost does work with NAN, so that makes it a little easier to shove a model into XGBoost.
But if you want to compare it to other models like, say, logistic regression or linear regression, you would have to do further processing on your data to make it work with those models.
|
|
show
|
5:00 |
In this section we're going to explore the data a little bit more and just try and understand what's going on in there.
Oftentimes this is an important step before making a machine learning model.
The better you can understand the data, the better your model will be.
Okay, so what we're going to try and predict is whether someone has heart disease.
That's this num column here.
And so I'm going to group by number and look at the mean values for all the numeric entries in there.
And I'm going to color that.
I'm just going to see if there are things that stand out here.
So remember zero means no heart disease.
Everything else means that there is some heart disease.
It looks like zero is red for age, meaning people without heart disease apparently tend to be younger.
Their CP value is lower.
Their TREST BPS value is lower.
It looks like their cholesterol value is actually higher here.
The REST ECG value, it looks like that is lower as well.
The TALAC column is higher and the old peak is lower and the CA is lower as well.
By coloring these that makes it really easy to see.
If we didn't color this, let me just show you what it would look like.
This is nice data, but it's just hard to see what's going on there.
So I do like sticking that background gradient on that.
I said axis equals index.
So we are coloring each of the columns down the index.
That's what that means.
Here's an alternate view of that.
I'm just transposing that.
That's what this T is right here.
Then I'm sticking that style in there as well.
We might also want to consider doing a correlation.
So I'm going to do a Spearman correlation coefficient here.
Let's see if there's anything that correlates.
What I'm looking for is whether something correlates with NUM here.
You can see that CP has a pretty strong correlation.
CA has a pretty strong correlation.
TALAC has a negative correlation.
So maybe we want to come in here and scatter that and see what's going on here.
So I might say let's look at the relationship between NUM and TALAC.
This has a slight negative correlation.
It's a little bit hard to see that.
This is really dark.
So I might clean this up a little bit.
Remember we did have jittering before.
So that's one way that we can look at that.
We can also adjust the alpha there.
So this I think this tells a completely different story than this up here.
Yeah, you can see that it looks like if you're zero you tend to be higher and goes down lower.
It also shows that we have a lot more zero entries than we do the other ones.
Not really clear from looking at this that there are a lot more zero entries there.
Let's look at the scatter plot between NUM and CP.
This is I would say less than useful.
It looks like it's just points on a grid.
So I'm going to jitter both the X and the Y here.
Here's the code to do that and let's scatter that and see what's going on there.
It looks like as these values go up for heart disease, there is more heart disease.
There aren't many lower values for that.
Let's also look at a categorical relationship here.
We're just going to group by sex and then we're going to look at the various values for those.
So it looks like in this case like male, higher value for NUM here.
Again, that's your heart disease.
So presumably males, at least in this sample, are more likely to have heart disease.
Because that was encoded as a categorical, we didn't see that male pop up over here.
If we left it as a binary zero or one, we would see that represented.
Let me show you how I could do that here.
I could say heart and then say assign sex equals heart.
Sex equals equals male.
So now we have the sex there and then we can say let's do a core here of this.
That gives an error because Pyro doesn't like non-numeric values.
So I'm going to say numeric only.
Is equal to true.
If we look at sex here, we should see a relationship and I'm not seeing sex pop in there.
So I'm going to convert this to an integer as type and we'll do int 8 Pyro.
And now we have sex pop in there.
Okay, so there is a slight slight positive correlation there with NUM.
Meaning that as sex goes to male, the NUM tends to go up as well.
NUM going up indicates heart disease.
In this section I showed some exploration.
It's often very useful to explore your data.
Look at those relationships, especially between that target variable that you're trying to predict.
|
|
show
|
4:58 |
Okay, we're going to get our data ready for prediction now.
Let's look at our columns.
We want to make sure that we have certain columns and that they're in the right order.
Let's look at our target here.
This is numb, again, zero meaning no heart disease, above zero meaning heart disease.
We can look at the value counts of those.
So we have a lot with no heart disease.
And there's about 400 that do have heart disease, but they're spread among those other values.
Are there missing values in here?
That's another thing that we need to be aware of.
You can see that there are missing values.
Let's quantify that.
Which columns have missing values?
We just do an any on that.
That collapses that.
We can do a sum to count them.
You can see that CA has a lot of missing values.
We can do a mean to quantify that.
So you can see that 66% of CA is missing.
We might want to make sure that numb doesn't have missing values.
You can see that it doesn't here.
We can also do a value counts and say drop NA if there were missing values that would show up there.
Okay, let's look at the D types.
And our D types are mostly pi-arrow.
We do have a categorical value in there as well.
Let's look at the number types.
There aren't any in 64 types.
Let's look at the number types.
These are all the values that are numbers.
That's looking okay.
What I'm going to do now is I'm going to use our XGBoost model.
XGBoost basically makes a decision tree that's going to predict a value and then it makes subsequent decision trees that correct the value.
I like to compare this to golfing.
If you think of a decision tree, a single decision tree can hit the ball once and it might be some amount off from the hole.
XGBoost is like a model that can hit the ball once.
That's the first tree.
And then subsequent trees look at the error, how far the ball is off the hole, and try and fix that.
And so you can make multiple trees.
Those are called boosters that try and get the ball in the hole or in essence make the correct prediction.
That's why this is a super useful model because it tends to get the ball pretty close to the hole.
Okay, so what am I going to do?
I'm going to make X.
And X is a common variable name that we use for our data here.
Let's look at this chain that I have here.
I'm going to take all of my object columns and I'm going to convert those to categoricals.
And I'm going to take all of my number columns and convert those to just normal numpy floats.
This is a cool trick that we can do with pandas.
So this right here, what I've highlighted, is a data frame.
If I stick a star star in front of it, it's going to unpack it.
If I unpack inside and assign, it's going to replace the columns.
So this is basically replacing object columns with categoricals.
Same thing here.
This is taking all the number columns and converting those to float.
Now after I've done that, I'm going to say let's make sex a category, FBS a float, X-ange a float, and slope a category.
And then I'm going to drop the number column.
So that is my X.
It's everything but the number column.
And then our Y is going to be those labels, those number labels there.
Let's split that up.
I'm going to use scikit-learn to split it up.
Remember, we want to split up our data.
We want to train our model on some portion of data.
And then we want to have a holdout, some data that it hasn't seen that we have the real labels for.
And we can evaluate how our model performs with that data that it hasn't seen.
Why do we need a holdout?
Well, if you evaluate your model on data that's seen, it's really easy to make a model that performs well.
You just memorize the data.
But in the real world, that's not going to work very well, memorizing the data, because presumably the data is going to come in a little bit different and you won't have exact matches.
In fact, if you had exact matches, you wouldn't have to use machine learning because you could just memorize the data and you could use an if statement to check whether they're an exact match or not.
Okay, so we've split up our data.
And what I'm going to do is make an XGBoost classifier.
I'm going to say enable categoricals.
So because we have those categoricals, we're going to enable that.
And we need to set the tree model to hist so it can do categorical classification.
And then we just call fit with our training data.
Let's let that run.
And it looked like that worked.
Let's evaluate our model on our testing data.
It looks like it gets 58% correct.
Is 58% the right number?
It might be, it might not be.
We know that it's better than guessing.
So further evaluation might be useful to understand what's going on there.
Let's look at how it did on the training data.
And you can see that it actually did really well on the training data, indicating that we might be overfitting.
What does overfitting mean?
It means that your model is memorizing or extracting too much information from what it was trained on and it's not generalizing as well as it could be.
My experience has shown that XGBoost does tend to slightly overfit out of the box, even though it tends to get pretty good results.
|
|
show
|
6:01 |
Okay, we're going to tune the model using a library called Hyperopt.
The idea here is that there are hyperparameters that control how the model performs, and we're going to change some of those hyperparameters to make it so it's not memorizing the data as much.
If you think about a decision tree, you can make a decision tree that goes really deep, and it basically memorizes all the features, and you can trace a path through to an individual.
That seems useful, and actually is useful to describe something, but when we want to make predictions, we actually don't want it to be super fit or super complex like that.
We want to simplify it a little bit.
So one of the main levers that we have for simplifying our model is how deep our tree can go.
And remember, XGBoost has a bunch of trees.
One of the things that we can do is tweak those tree levels and make the model perform differently.
I've got some code here that leverages this Hyperopt library.
The nice thing about this Hyperopt library is that it allows us to define a space of parameters that we want to explore, and then it looks at that space and it sees where it's performing well, and it sort of keeps checking out around that space.
Oftentimes, these hyperparameters are floating point numbers, so it'd be annoying to specify all the numbers between 0 and 1 in terms of like 0.001, 0.002, etc.
So this Hyperopt library just says, here's a distribution of what those numbers are, and it starts exploring those, and when it finds good things, it kind of focuses on those.
Every once in a while, it will do an exploration, where it will try and find some other value, but if it doesn't have good results, it will exploit the ones that have good results.
You can see in this, I'm actually not changing all the hyperparameters at once.
I'm doing them in what I'm calling rounds.
So I've grouped a bunch of hyperparameters that perform similarly.
This first one is scale positive weight.
This is going to change the weights for unbalanced data.
The next one is looking at our tree.
So the depth of the tree and min child weight.
Min child weight is used to determine when to make a split.
This next section here is for sampling.
So how many rows and columns of our data do we use when we make trees?
These next ones, reg alpha and reg lambda and gamma, these are regularization hyperparameters, and the last one is the learning rate.
Basically, if you're thinking about golfing, this is how hard you're hitting.
And sometimes you want to actually not hit quite as hard, and you get to a better result faster.
Why do I want to do it in rounds rather than everything all at once?
Well, you might want to do everything all at once.
If you think about it, if you've got 10 parameters and you're trying to optimize all of them, it's hard to optimize all of them.
So I've grouped them into smaller sections so I don't have quite as big of a space to explore.
And the grouping is meant to look at hyperparameters that do similar things.
Is this perfect?
No, it's not.
But it tends to give quicker results because you don't have that combinatoric explosion that you would have if you looked at all of them all at once.
So if you don't have a weekend to spend searching this space out, I suggest using something like this for quick and easy results.
Okay, let's kick that off.
You can see that this is progressing here.
This is one of the rounds, and you can see that there's this loss score.
The best loss is negative 0.5.
And this is just exploring the space, trying to lower that loss.
Okay, we're going to do, in this case, we're going to do 20 evaluations in this round.
So we're at 14 of 20.
It's taking about two seconds per evaluation.
And we've got multiple rounds, so this is going to take a couple minutes to run.
You can see in our next round, our loss has gone down a little bit, indicating that we've tweaked the hyperparameter such that it's doing better.
And here it got even better.
So that's good.
Okay, at this point, we're done.
And here are the parameters.
So what I'm going to do is I'm going to copy this, and I'm going to paste this down here below.
One thing to be aware of is there is some randomness in this.
So there's no guarantee that if you ran this multiple times, you would get the exact same values here, especially these values that are floating point numbers.
Hopefully the values that are integer-like, you would get the same values.
But when you combine those with the floating point numbers, there's no guarantee that you get the exact same values.
I do like to just copy and paste them so that when I come back to this, I can rerun the model with the same parameters.
Okay, so once we've done that, I'm going to now say, okay, let's make a model here.
And I'm going to use these parameters that I just specified.
And I'm going to say, let's have 2,500 estimators, but 50 early stopping rounds.
What this means is I can hit the ball 2,500 times, but if after 50 times, the past 50 times, if I haven't improved, then stop hitting the ball because you're not getting better.
Okay, and I got an error here.
It says that the max depth expected an integer, but it wasn't an integer.
So sadly, this kicked out a non-integer value here.
I'm just going to convert that to an integer and try it again.
And there's our result.
It looks like it didn't need to make 2,500.
It only made 62 trees there.
Let's look at the score of that.
And there's our score of our original.
So in this case, our score didn't really go up.
In fact, it went down a little bit.
But if you look at our model, our model is not overfitting so much on the data.
Is it acceptable if the accuracy goes down?
It might be acceptable if the accuracy goes down.
Again, in this case, we are not overfitting as much.
So I actually feel a little bit better about this model, even though the accuracy went down.
There are other metrics we can look at to see if we improve the model by other metrics, not just accuracy.
|
|
show
|
1:47 |
In this section I'm going to look at a confusion matrix to try and understand how our model is performing.
So confusion matrix looks something like this.
At the bottom you have the predicted label and along the side you have the true label.
So this is our stepwise tuned model.
What we want to see is along the diagonal here that we have a lot of dark values.
You can see that for predictions of 4 it didn't do very well.
It seems to do pretty well for predictions of 0.
How do we interpret this prediction of 4?
Well you can see that three of those are predicted as 3, one of them is predicted as 2, and three of them are predicted as 1.
So it didn't predict anything as 4.
You can see for 0 it did a pretty good job of predicting 0.
Of the ones that were 0 that weren't, 12 of them were predicted as 1 and 4 of them were predicted as 2.
Now let's look at this with the training data.
Here's what it looks like with the training data.
So with the training data it did actually get a 1 here, but it does have some misclassifications there.
Let's compare this with our out-of-the-box model.
Here's our out-of-the-box model.
Let's just look at this 89, 26, 10, and 9.
87, 26, 10, and 9.
So you can see like for example our model is doing better with predicting label 1, but it's doing a little bit worse on the other labels.
If you look at the training data you can see that it matches up along the diagonal.
It looks like if we want to optimize label 1 we should use our step tuned model because that does a better job of predicting 1 than the other model.
|
|
show
|
0:09 |
So this was a brief introduction to making machine learning models.
Once you have the capability to do this, this is a super useful tool to be able to add business value.
|
|
|
8:48 |
|
show
|
0:13 |
In this video we're going to show how to work with databases using Python.
More specifically, we're going to connect to a SQL database, we're going to query it, we're going to get some statistics with it, and we're going to visualize our data.
|
|
show
|
1:32 |
Now I'm going to be using the SQLite library that comes with Python these days.
I've got some data here and I'm loading that data.
This is from a ski resort and I'm going to read that from a CSV file and make a data frame from that.
Let's just go through what's going on here.
This should make sense now that we are familiar with pandas.
We are going to convert the date column to a date and put that in the Denver time zone.
We're pulling off certain columns.
We're getting a month column.
We're getting a year column.
We're making a season column.
The season column is a little bit more involved.
It says if the month is less than five, then we're taking the year minus one and we're adding a dash and we're taking the current year.
Basically you have a ski season and it runs from like November to like May time frame.
This is the code to create that ski season.
If it's during the summertime, we're saying off season there.
Let's run that.
Let's look at Alta.
That's what it looks like.
You can see that we have 1989 through 1990 season that we've crafted there.
Let's look at our D types.
Our D types look pretty good.
We do have some NumPy objects here.
You see that season is an object here.
It's not using our PyArrow backend for those.
If we wanted to, we could go in and tweak those and make sure that those were PyArrow types as well.
|
|
show
|
0:55 |
Okay, in this section I'm going to show you how to connect to a database I'm gonna use SQLAlchemy to connect to a database a nice thing about SQLAlchemy is it provides as a consistent interface Theoretically you could swap out the connection string and connect to other databases now in practice Some databases have features that other databases don't support but a lot of times for the 80 90% use case This works quite well.
Make sure you install SQLAlchemy.
I'm gonna make a connection here to a SQL lite Databases is saying I want to make a in file connection with SQLAlchemy I'm gonna say create an engine and so I'm gonna create an engine that says I can connect to SQL lite and This is going to be a file based connection if I was using Postgres I would need to specify like a URL and a username and a password and possibly a database name similar for other databases like Oracle or SQL server
|
|
show
|
0:31 |
Once I've got a connection, we can start manipulating a database, doing things with it.
Let's create a table.
I'm going to create a table in SQLite.
I'm going to say connect to SQLite, and look at this, I'm using pandas here, and I'm going to say, hey pandas, I want you to make a table called ALTA, and if it exists in there, I want you to replace it.
Don't stick the index in there, and then we'll close our connection.
Once you have data in pandas, relatively easy to stick it into a SQL table.
|
|
show
|
1:19 |
In this video, I'm going to show you how to query a database.
We have our table created here.
Now I'm going to connect to it.
I've imported SQLAlchemy, I've created my engine, and my query is select star from Alta.
I'm going to use pandas read SQL to do that.
Read SQL, if you want to, you can pull up the documentation for that.
You can see that we pass in a SQL string and we can pass in this connection which is a SQLAlchemy connection.
Once I've done that, the result is a pandas data frame.
I'm pulling out my SQL query as pandas.
Let's look at the types of that.
This is using NumPy types.
Let's try and do that again with PyArrow types.
We're going to say dtype back into this PyArrow, and we'll look at the types.
You can see that the date did not work.
It said that this is a PyArrow string.
If you come up here, this says that this is an object.
The date conversion did not quite do what we expected it to.
Let's use SQLite to inspect what's going on inside of SQL.
This is a query that will give us the schema.
You can see that date says it's a timestamp.
Sadly, pandas is not converting the timestamp there.
If you want to get a timestamp working, you'd have to manually convert that.
|
|
show
|
1:57 |
In this section we're going to look at doing another query that's not just selecting star from a database.
Let's run this query I'm just going to say select star from alt over year is 1990 and it looks like we have the values from 1990 there so I mentioned previously when I was teaching about describe someone was upset because They implemented describe on their database and it took them like three weeks to do that So I'd prefer to just pull the data out and use pandas to do that.
Let's see how we can do that Okay, so I'm gonna read my data here Here is my data select star from Alta and I just do an Alta describe Since I've sucked it into pandas I can get those summary statistics relatively easily Now if I wanted to do a describe query, it would look something like this This isn't even a complete describe query, but it looks something like this And we can run that this is doing this for just one column This is the snow depth column how much snow is on the ground?
You can imagine doing that for multiple columns is sort of a pain Now note that I can't even do standard deviation in SQL lite some Databases have support for that but I can do something like this.
This is kind of cool I can say let's create an aggregate called standard deviation, which is this Python?
Standard deviation.
It's this class and then I can actually come in here and run SQL and Tell it to run that standard deviation this example.
I showed how we can pull data out and process it in pandas Hopefully that makes sense.
Can you stick processing into a database?
Yeah, you can I showed how to calculate the standard deviation in SQL lite in practice that tends to be not so much of a problem because people Don't switch out their database too much But oftentimes you do want to push calculation into your database.
You can use Python to do that It's not going to be particularly fast to do that, but it is an option if you need at least for SQL lite
|
|
show
|
1:54 |
One of the downsides of databases is they don't have visualization.
And as you've seen, I'm a huge fan of visualization.
So once we pull out our data, we can visualize it with pandas relatively easily.
So I'm going to select the date and TOBS, this is temperature when someone went out and measured the snow levels.
It looks something like this.
Let's see if we can visualize this.
So I'm going to say, let's group by date and do it at the month frequency.
And then we're going to get the mean values and we're going to plot that.
And we got an error here.
And what is our error?
It says you can only use grouper if you have a date time in there, but it didn't get a date time.
So let's see if we can convert date to a date time.
If we do this, it says that it is an object.
Sadly, that's not really what we want.
We don't want it to be an object.
If we say UTC is equal to true, then it comes out as a date time.
This is one of those issues that I don't like with pandas.
And that sometimes it converts things to dates.
If you look at the type of this, I'm just going to say like ILOC 0 to pull off the first date here.
This is a timestamp object.
If you look at this one, this is a timestamp object in UTC.
This should work.
I'm going to actually convert that to Denver.
And then what I'll do is I'm going to say, now let's group that by date and let's plot that.
So here's the grouping by date.
Looks like that did work.
We can visualize that.
Let's just zoom in from 2010 to 2012.
So again, this is the temperature and you can see that it's cold during the winter.
And then it warms up in the summer, gets cold, warms up, etc.
We see that cyclical pattern.
And we can't really visualize with the database unless we stick on a BI tool on it.
Pandas lets us do this relatively easily.
|
|
show
|
0:27 |
In this video, I showed how to connect to databases.
Sometimes you might have issues when you pull out columns.
They might not be the correct type, so you're going to have to use some Panda skills to fix that.
Hopefully, you're feeling comfortable with that.
We also showed that once you've got your data out, you can do your Pandas things.
So things that are really easy in Pandas, but are painful in SQL, you can do.
So that would include summary statistics.
It would also include visualization as well.
|
|
|
13:26 |
|
show
|
0:10 |
In this video I'm going to show you how to create plots with Plotly.
Plotly is a nice library that allows you to do interaction.
And I'm also going to show you how to make dashboards.
|
|
show
|
0:22 |
Let's load our data.
The data set that I'm going to be using is meteorological data from a ski resort in Utah nearby where I live and what we're going to do is we're going to explore this data a little bit, clean it up, and then we're going to make plots of how much snow is on the ground during the ski season so people can visualize how good the ski season was.
|
|
show
|
1:44 |
In this video, we're going to clean up the data.
So here's the data that we just loaded.
This is a data frame that has meteorological information about a ski resort.
It's got data from 1990 to 2018.
And we're mostly concerned with some of the attributes that you can't see here.
And this is not showing all of the columns that I want to see.
Just bring those out over here.
We've got date and then we've got a lot of attributes that are hidden in here.
The ones that we'll be concerned with is the snow one.
That's how much snow fell on a given day.
Snow depth, that's how much snow is on the ground.
We've got these T, Tmax, Tmin, Tobs.
That's the temperature of observation.
That's the temperature maximum during the day, temperature minimum during the day.
And then there's a Tobs, which is the temperature of observation when they went out and measured the snow depth.
And then another one is precipitation, PRCP.
That's how much water fell.
So that's different than snow.
Snow is how much snow inches fell.
PRCP is how much water inches fell.
Okay, so here's my code to clean it up.
I've got a chain here.
Let me just talk about what this chain is doing.
We are converting the date to a date.
I'm pulling out certain columns and then I'm making a month column, a year column, and a season column.
The season column is a little bit more complex, but basically we can see that a season runs from the end of a year to the next year.
And that's the logic for doing that.
Okay, at this point, I'll store that in a variable called ALTA.
Let's just make sure that ALTA exists.
And there we go.
|
|
show
|
2:01 |
In this video, we're going to explore how we can make plots with Plotly instead of Matplotlib.
Plotly is running in the browser.
Matplotlib, what it does is it goes out and generates an image on the server.
So, Matplotlib, when you tell it to make a plot, will go out and generally make like a PNG image and then bring it back to you.
Plotly runs all this and renders all this in the browser.
So, if you're looking for interactivity, the architecture of Plotly lends itself to better interactivity where you can hover over and things don't have quite as much lag as they would with Matplotlib.
Let's import Plotly.
Plotly has this express module that has a simpler interface than just Plotly itself.
I'm going to tell Pandas that I want to use Plotly as the backend.
When I do that, I can plot using Pandas, but instead of getting Matplotlib output, I'm going to get Plotly coming out of that.
So, here is a line plot.
We're sticking the date in the x-axis, and then we're plotting how much snow is on the ground.
So, we can see the pattern going on here that we have snow on the ground during the winter and then during the summer it melts.
Cool thing about Plotly is it is interactive, so I can hover over these and zoom into them and see what's going on here.
You can see that we do have quite a lot of information that is missing from this.
It probably needs to be cleaned up a little bit more, but we get the basic idea.
So, I'm going to do a little pipe here.
I'm just going to say I want to query the season that is anything with 2010 to 2011, and then let's just do the plot of that.
You can see that the season starts somewhere around November and ends around April.
You can see how much snow is on the ground that it's building up over this time.
2011 was an abnormally high snow year, but you would see this fall off really quickly here in the May-June timeframe.
|
|
show
|
2:28 |
In this video, I'm going to show you how to do a bar plot with Plotly.
Let's go through what we're trying to do here.
I've got a little chain.
So I'm taking off the snow column.
This is how much snow fell during a given day.
I'm going to pipe that into PD cut.
This is going to put it into bins.
So we're going to take this continuous variable and put it into 10 bins.
I'll do a value counts of that.
So how many times did each bin occur?
And then we'll do a bar plot of that.
And when we do that, we get an error.
If we scroll down, we see that it says, object of type interval is not JSON serializable.
Scroll down a little bit more and see if we can make sense of that.
Remember, pandas generally is going to put the index in that x-axis.
And that is an interval or categorical.
And it doesn't work with Plotly.
It needs to be JSON serializable.
So let's mess with that a little bit.
Basically, I'm going to convert that to a string.
So I've got a little chain here that will do that.
And again, let me just walk through this chain so you can see what's going on.
So here's what we had.
We had our cutting.
I'll do our value counts.
That's what we had before.
Let's reset the index.
This is going to push the index into a column.
Right now, we have the snow index, which is the bins.
We're going to put that into its own column.
There we go.
It's capital snow.
That's how much snow fell in bins.
I'm going to just rename that to a column called bin, print c-i-n for inches.
And I'm going to just capitalize count.
OK, that looks like that worked.
And then let's convert bin in to a string.
And at this point, we should be able to plot that.
And that looks like that worked.
So what we see is there is a lot of snowfall in that 0 to 3.6 inches.
And then the other ones look like they're outliers.
Another thing I can do with pipe, which is kind of cool, is I can specify the bin edges.
So here I'm going to say minus 1 to 0, 0 to 1, 1 to 5, et cetera.
Let's look at that.
And we'll look at the output here.
The rest of this is the same.
That's kind of cool.
We can easily specify those bin edges.
So here we see that a lot of snow was in that 0 range.
So let's use our Panda skills to filter that and say, we only want snow that was greater than 0.
So we're getting rid of the tall column there.
And this lets us zoom in to the other values.
|
|
show
|
3:40 |
In this video I'm going to demonstrate scatter plotting with Plotly.
So this works very similar to Matplotlib.
I'm going to say do a scatter plot, specify what column I want in the X&Y and what column I want to use to color that, and there is our scatter plot.
So what is this showing us?
It's showing us each month how much snow is on the ground.
Is this the world's greatest scatter plot?
No, not necessarily.
I am seeing things line up on columns.
That's because month, even though it is numeric, is basically like a category here.
I'd probably want to change the alpha.
Let's try and change the alpha of this.
Turns out that alpha doesn't work with Plotly, sadly.
And I believe it's opacity here.
Let's try this with opacity.
And there we go.
So I would also probably jitter this using our jittering code that we saw in other places, but you get the idea here.
A nice feature of Plotly is that if you hover over a point, it will tell you what the value is of that.
Okay, the next thing I'm going to do is I'm going to do a scatter plot of temperature of observation versus snow depth and color that by month, what the temperature is, how much snow is on the ground, and then what month it is.
Here's our plot.
Let's see what's going on here.
We're going from purple to yellow.
So as we are purple, it looks like we have a low temperature.
We have a lot of snow on the ground.
As we start going to like a pinkish, we're getting a higher temperature, less snow on the ground.
You can see that somewhere over here, it gets to this month six-ish where we get pink and orange.
We have high temperatures, no snow on the ground.
You can see it sort of comes out there and then it starts coming back to orange.
As we get back to orange, we are getting into this fall time frame where it gets colder and we start to get snow on the ground.
So we're sort of seeing this sort of cycle where it gets colder, snow builds up, it gets warmer, snow melts, and then summer there is no snow on the ground and then it comes back.
I would like to use a cyclical color map to demonstrate that.
I'm not aware of how to do that in pandas, but we should be able to do that.
Using some Plotly code, just with Plotly itself.
Let's see how to do this using just Plotly code.
Here's the code to do this with Plotly.
Let's look at the plot.
This makes it a little bit more clear.
I use Twilight right here, which is a cyclical color map.
It goes from white to purple to this dark color to a reddish color and back to white.
So we're seeing in the cold of winter, that's this whitish pinkish color.
We have cold temperatures, snow on the ground.
As we move into February, it warms up a little bit, but snow builds up.
And then we're seeing the darker blues as we get into summer.
It starts to melt.
We have less snow on the ground.
And then we get into when it's warm enough, there is no snow on the ground.
The snow is melted.
And then you can see that this is sort of transitioning back into colder temperatures.
We're starting to get snow build up.
I think this is an interesting scatter plot.
I wish that there was a way to do this from patterns directly.
Let's just look at the code here.
Basically, we're saying make this figure.
And the data in the X is going to be the temperature of observation.
The data in the Y is going to be the snow depth.
And we're passing in this marker dictionary where we can specify the color scale using the Twilight color scale.
|
|
show
|
1:43 |
In this video I'm going to demonstrate making a dashboard using the Dash framework.
This is a project that is started by the people who did Plotly.
You'll notice at the very top I say write file dash demo.
I'm not really doing this from Jupyter, I'm just writing the code in Jupyter, but I'm actually writing it to a file.
I've got all of my imports up here, I've got my tweak code, and then at the bottom I'm loading my data, I'm tweaking it.
And here's the key parts down here.
I'm making this app.
So this is a Dash app, and then I'm making a figure, this is a Plotly figure, and then with my app I'm specifying the layout.
So I'm saying the layout is going to be an HTML div, and the children in that will be an H1 tag, and then some markdown, and then a graph, and you'll notice that in the graph I refer to the figure that I specified up above.
Then down below I have this little snippet.
If I execute this script I want to run the Dash server.
Let's run that.
I'm going to execute that from the terminal.
And this is what the output of this looks like.
You can see here's an H1 tag at the top.
We can see that here's some markdown code, and down here is our Plotly plot.
Again, the nice thing about this Plotly plot, we do have interactivity.
We can zoom into that.
We can reset it if we want to.
We can export that as an image as well.
In this video I showed you how to make a basic Dash dashboard.
It's relatively easy.
You'll need to create a file, and then you make this Dash app.
You add your Plotly figures into it, and you can add markdown into it or HTML, and then you just kick off the Dashboard server, and you can hit the endpoint and view the dashboard.
|
|
show
|
1:10 |
In this video, I'm going to show you how to make a dashboard with a widget.
So here's our code.
A lot of this looks the same as what we had before.
The key difference of this is that we added this drop-down section down below.
And you can see that the drop-down has options.
We're just taking those directly from our pandas data frame.
And we're giving that an ID.
And then down below that, you can see that we have this function called update line graph.
There's a decorator.
It says app callback.
And we're saying that the output is to line graph.
That's specifying the ID of the figure up above.
And the input is coming from the year's ID, which is the drop-down.
So the year will come in as max year in this function down below.
And what are we returning?
We are returning a plot.
This is a plotly plot, which will come as the output and will be tied in to the graph as the figure up above.
Let's run it and see what happens.
Here's the 1990 season.
You can see that the max goes up to like 110 there.
Let's look at the 1993 season.
You can see that the max goes up to 98 here.
|
|
show
|
0:08 |
In this video we showed how to connect a widget to a dashboard.
This is relatively easy once you use the decorator callback syntax to connect the input and the output.
|
|
|
1:16 |
|
show
|
1:16 |
All right, everyone, this marks the end of our course.
We've covered a lot of ground in our Python and pandas and data journey together.
Let's do a quick retrospective of what we learned.
We looked at student data.
We explored loading and manipulating student data.
We looked at income data and how we can mess with that as well.
We talked about joining and merging data.
We showed how pandas has various tricks that we can do to evaluate that.
And we also showed how we can do things like stick in pipe to check how we're doing as we're going along our process.
We looked at doing aggregation, cleaning up data.
We've looked at time series processing.
We looked at dealing with text data, some basic NLP constructs.
We even made some machine learning models.
We looked at how to make simple dashboards and we looked at how to connect with databases.
Hope this course has provided you with a skillset to cover a variety of data analysis tasks using Python and pandas.
Remember, learning doesn't stop here.
One of the most important things that you can do to continue learning is to practice.
My advice to you would be to find data and practice these techniques on your own data.
Thanks for being a part of this journey.
Hope to see you in a future course.
|