|
|
3:50 |
|
show
|
0:25 |
Welcome to Polars for Power Users, Transform Your Data Analysis Game.<br>
My name is Christopher, and I will be your guide.<br>
This course is all about Polars, one of the newer data analysis libraries for Python.<br>
If you've reached the limit of what you can do with Excel, or realized just how fragile working in a spreadsheet can be, Polars might be your answer.<br>
Using it, and tools like it, can greatly increase your capabilities when working with data.
|
|
show
|
0:35 |
This is a course on Polars, which is a Python library.<br>
Not surprisingly, that means I'm going to assume you know some Python.<br>
Also, many of the examples along the way show you how to translate an operation from Excel into Polars code.<br>
If you don't know Excel well, you'll be fine, but some familiarity with the concepts there will likely make this course easier to understand.<br>
Speaking of that Python thing, if you do need a refresher, maybe take a look at Talk Python's course, Python for the Absolute Beginner, before diving into this one.<br>
Come back later if you need.<br>
I'll be right here waiting.
|
|
show
|
1:08 |
There are several different libraries out there for Python that use the dataframe concept as a base.<br>
The most popular of these is pandas.<br>
If you've done some pandas before, you'll be able to pick Polars up rather quickly.<br>
The overall concepts are quite similar, but there are some differences.<br>
First off, Polars doesn't have an index column.<br>
Or another way to look at it is the only index in Polars is the row number.<br>
That means you can do iLock-like things, but not lock-like things.<br>
Second, Polars uses Python objects for expressions.<br>
In Pandas, the square bracket slicing concepts have been massively overloaded to the point where some code doesn't really look like Python anymore.<br>
Polars achieves the same thing with actual Python objects.<br>
The downside is this can mean a bit more typing.<br>
The upside is the code looks more like Python than Blackmagic.<br>
Third, and this is the big one, performance.<br>
Polars is actually a dual library.<br>
It's available in Python and Rust.<br>
In fact, the Python part is just a thin veneer on top of the Rust library.<br>
This makes Polars really, really fast.<br>
In test after test, Polars outperforms pandas.
|
|
show
|
1:42 |
Almost all the demos in this course are going to be done with the Python REPL.<br>
The REPL sessions have been saved as text files with the extension.repl and are available to you in the course's GitHub repository.<br>
If you're following along and want to copy and paste instead of typing it all in, you'll want to clone the repo on the screen here.<br>
Well, I guess that was rude of me.<br>
I probably should have properly introduced myself.<br>
My name's Christopher Trudeau.<br>
My day job is helping companies with their tech stacks and technical processes, but I'm also the co-host of the Real Python podcast and author of Django in Action.<br>
On screen here, below my ugly mug, is my social info.<br>
Feel free to reach out if you've got a question.<br>
Episode 402 of Talk Python To Me is an interview with Richie Vink, one of the creators of Polars.<br>
If you want a little more background on how this library came to be, and what is behind its design philosophy, then this is worth a listen.<br>
Alright, enough of the chit-chat.<br>
Let's get you going.<br>
When building this course, I used Python 3.13 and Polars 1.24.<br>
I don't use anything 3.13 specific, so an earlier supported version of Python should be fine.<br>
Since Polars is a third-party library, you'll need to install it on your system.<br>
Best practice for this is to always use a virtual environment.<br>
If you've never done that before, you probably want to check out Python for the Absolute Beginner before you proceed and learn how to do virtual environments.<br>
Once you've got your virtual environment installed and activated, fetching Polars is as simple as a call to pip install.<br>
Of course, if you've switched to UV, you can use that as well.<br>
Okay, you ready?<br>
In the next lesson, I'll kick it off by showing you how to create the core component of Polars, a data frame.
|
|
|
23:10 |
|
show
|
2:38 |
In the previous lesson, I gave an overview of the course.<br>
This lesson is your first introduction to the concepts and objects in polars, including data frames, expressions, and contexts.<br>
In the first part of this lesson, I'll be using some sample data that I copied from Wikipedia.<br>
This table shows five of the largest land animals out there, including their average mass, the largest known mass for that animal, and their length in meters.<br>
For my American friends, a meter is roughly equivalent to a yard, and I'll be converting that to feet as part of the lesson.<br>
The first step is to take that content and put it in the core pullers object, called a data frame.<br>
Off to the REPL to play with the bear.<br>
First off, I need to import pullers.<br>
Notice how I'm aliasing it to PL.<br>
This is common practice.<br>
You don't have to do it this way, but most people do, and it echoes a similar practice with other scientific and numeric libraries like NumPy and Pandas.<br>
Now that I've got the library imported, I need some data.<br>
Let me create a dictionary.<br>
My eventual goal is to represent that table I just showed you.<br>
To do that, I'm using a key in the dictionary as a column with the value containing a list of the cell contents of that column.<br>
The second column is the average mass, then the max, and finally the length.<br>
With a dictionary in this format, I can construct the core Polars object, the data frame.<br>
To better understand data frames, flex evaluate it in the REPL.<br>
The data frame is a row-column abstraction, just like the table I showed you earlier.<br>
Just above the printout of this table, you see the word shape.<br>
This indicates the size of the table, with a 5-4 tuple indicating that it has 5 rows and 4 columns.<br>
The second thing to notice is the name of the columns, which correspond to the keys in the data dictionary.<br>
Later, when I start doing operations on this data frame, these names will be how I reference the columns.<br>
Below the column names is the column's data type.<br>
These are similar but not identical to Python data types.<br>
I'll cover them later in more detail.<br>
For now, know that str means string and f64 is a double-sized float.<br>
Polars automatically determined the data type based on the content of the column when constructing the data frame.<br>
This mostly works, and in a future lesson I'll show you what to do when it doesn't.<br>
The rest of the data frame is just the rows of data.<br>
If you're coming from pandas, note the one key difference here: there is no explicit index.
|
|
show
|
1:22 |
The shape info shown with the table above is also available as an attribute.<br>
The schema attribute contains information about the columns.<br>
Here, instead of str, you see the full word string, and likewise with float64.<br>
Note the capitalization.<br>
These are Polars classes, not the Python equivalents.<br>
For the most part, you can convert back and forth, but for now, understand that they are subtly different.<br>
DataFrames have methods that will give you information about their contents.<br>
The Describe method gives you statistics about each column.<br>
Count tells you how many items are in each column.<br>
NullCount counts the number of null values in a column.<br>
And the rest are statistical information: mean, standard deviation, minimum, maximum values, and the 25th, 50th, and 75th quartiles between them.<br>
Be careful to note that the rows in this table aren't really related to each other.<br>
Just because Rhino is the max string doesn't mean their average mass is 6 tons.<br>
The 6 tons comes from the elephant, which is the largest value in the average mass column.<br>
Same goes for the largest values for max mass and length.<br>
Try to keep this in mind.<br>
I can't tell you how many times I've mistaken this summary for the max row and then start jumbling my data together.
|
|
show
|
0:53 |
I mentioned that unlike Polars, pandas doesn't have an explicit index, but you can reference rows using square bracket notation.<br>
Let's try some of that.<br>
The hippo is the third row of data.<br>
Like with most square bracket stuff in Python, it's zero indexed.<br>
With a data frame, you can do the typical things you do with square brackets, including negative indexing.<br>
And slicing.<br>
The head method on the frame returns the top 5 rows.<br>
Seeing as there's only 5 rows in our table, that's everything.<br>
5 is the default, but you can pass a number to head.<br>
And that specifies how many rows should come back.<br>
What's a head without a tail?<br>
Same idea, but showing the bottom rows instead.
|
|
show
|
1:22 |
The most common object in Polars is definitely the data frame, which, as you've seen, represents a series of rows and columns, like with a spreadsheet.<br>
Polars also has an object called a series, which you can think of like a column on its own.<br>
It is sort of a list, but with metadata like a title and data type.<br>
When I showed you the animal data frame, each column had a data type.<br>
You saw strings and floats.<br>
Polars also supports integers, fixed decimal numbers, dates, times, and much more.<br>
So you've got all this data, now what if you want to do something with it?<br>
Polars uses expressions which are an object-oriented way of performing operations on a data frame or a series.<br>
For example, you could construct a new data frame containing the average length in feet by multiplying the average length in meters by 3.28.<br>
In fact, I'll show you how to do just that shortly.<br>
Operations are for calculating something new.<br>
Contexts are for accessing part of a data frame.<br>
Back to my meters to feet example, I could use just the length column on its own to do that.<br>
There are several different methods for getting a new context from a data frame.<br>
The three most common are Select, which chooses one or more columns, With Columns, which copies a data frame and adds columns to the copy, and Filter, which returns a subset of the rows of a data frame.
|
|
show
|
1:25 |
Let's head back into the REPL, using the same animal data to see expressions and contexts in practice.<br>
On the screen, I've already got the same animals data frame as before.<br>
Let's start by fetching a single column from it.<br>
The select method creates a new data frame with a copy of the original's data, but only including the selected columns.<br>
When you pass in a string to select, you get the named column, which in the example below is the name of our animals.<br>
Here, I've selected the average mass instead.<br>
You can pass multiple arguments to select to fetch multiple columns.<br>
Remember, each time you do this, you're getting a new data frame with the data copied from the original.<br>
The animals data frame hasn't been touched here at all.<br>
Passing a string argument to the select method is a shortcut.<br>
To do more complex things, you can give it an expression.<br>
The expression that specifies a column is the polar's call function, which takes the name of a column.<br>
This is the same as using the string shortcut earlier.<br>
On its own, you'll want to save yourself some typing and just use the string.<br>
when you want to chain operations together, you'll need to use the full form.
|
|
show
|
2:16 |
Let's take a look at just what that call thing is.<br>
The call function is an expression factory, meaning it returns an expression object.<br>
Note that it knows nothing about our data frame.<br>
I could just as easily write call apple.<br>
Polars won't care.<br>
Of course, if I tried to apply the apple column expression to a data frame without such a column, I'd get an error.<br>
Let's do something a little more complicated.<br>
a situation where the shortcut would be insufficient.<br>
Let's look at this piece by piece.<br>
First, I'm using the select method, which means I'll be fetching one or more columns.<br>
When you use a named argument to select, it creates a new column with that name, which in this case is average length feet.<br>
The argument itself is an expression that says ""Start with the average length column, then multiply it by 3.28"".<br>
There are 3.28 feet in a meter.<br>
The end result is a brand new dataframe consisting of a single column named average length feet with five new values, each of which correspond to the imperial equivalent to the metric values in the original dataframe.<br>
Let's dig into that a little more.<br>
This is the same expression I used in the select.<br>
Just now it's on its own.<br>
Expression objects have overloaded math operators, so when you perform math on an expression, it returns a new expression that will perform that operation for you.<br>
You can sort of make that out in the REPL evaluation, where it shows the average length column multiplied by dynamic value, and that's where the info gets cut off.<br>
Since this is an object, I can store it in a variable like any other object.<br>
which of course means I can use it later.<br>
This is the same result as before, but this time using the stored expression.<br>
Again, the expression has nothing to do with the data frame until you apply it, so you can reuse this object over and over, even with different data frames.<br>
And like I said before, none of this has any effect on the original data frame.
|
|
show
|
0:51 |
If I wanted a copy of the dataframe with the new column appended, I could use SELECT and name all four columns, adding the fifth calculated one.<br>
But that's a lot of typing, so instead...<br>
The with_columns method is like SELECT, but instead of returning just the selected column, it returns a copy of the dataframe with your new column appended to the end.<br>
And note how I reused the expression to do this.<br>
Once again, with_columns doesn't affect the original data frame, so if you want to keep it, you need to overwrite your variable.<br>
And now animals has five columns.<br>
Of course, this isn't anything special.<br>
I could have stuck it in a differently named variable as well if I wanted to.
|
|
show
|
1:39 |
Okay, so that's two ways of dealing with columns.<br>
What about some rows?<br>
The filter method returns a copy of the dataframe containing only those rows that meet the given criteria.<br>
The criteria are specified using an expression and a comparison operator.<br>
This works because Polars has overloaded comparisons for the expression objects, in this case returning all of the animals with an average mass greater than or equal to 2 tons.<br>
Expressions also have a shortcut for specifying multiple criteria.<br>
In this case, I've got all the rows where average mass is between 1 and 2 tons.<br>
Is between is inclusive, so you get both the 1 ton giraffe and the 2 ton rhino.<br>
I'd never have guessed that giraffes were that heavy.<br>
They're all spindly, but I guess you don't get to be 17 feet tall without having some heft.<br>
You can pass multiple arguments to filter.<br>
This shows rows where the average mass is less than 6 tons, and the maximum mass is between 1 and 5 tons.<br>
Comparisons are just expressions themselves.<br>
Let's look at one.<br>
Like with multiplication before, operating on an expression gives you a new expression.
|
|
show
|
4:58 |
To reiterate, expressions are objects.<br>
If you do something like multiply an expression, it isn't really multiplication.<br>
The expression object has overloaded the multiplication operation, and what that does is store the idea of multiplication as an operation, and then when it's applied it gets vectorized, meaning multiplication happens to every cell in the column.<br>
Why do I keep harping on this?<br>
Well, for more complex math you can't just use the math module.<br>
If you do, you'll be applying log to an expression object, and math's log doesn't accept expression objects.<br>
That doesn't mean you can't do fancier stuff, it just means you have to use methods on the expressions to perform their vectorized equivalents.<br>
Back into the REPL to see this in practice.<br>
For this lesson, I want some more data, so I'll create a new dictionary.<br>
This data frame has a thousand rows, with the num column containing the values from 1 to 1000, while double and triple contain 2 and 3 times the value of num.<br>
With our animal data, there were only 5 rows.<br>
Once you've got more data, the display here shows the first 5 and the last 5, separating them with an ellipsis.<br>
Always remember to look at the shape to see the true size of your data frame.<br>
And you'll remember, head returns the first five rows, and this time around, that isn't all of them, just the first five.<br>
Let's briefly review by creating a new data frame consisting of a copy of the num column and a four times column as well.<br>
Before I type it in, take a second and think about what that would look like.<br>
Alright, that's your second.<br>
Here it is.<br>
I hope you got it right.<br>
In this Select, I combined the use of the String Column Name shortcut along with a Named Argument.<br>
That Named Argument is particularly important in this case.<br>
Let me show you why.<br>
If you don't give select a name to call things, it uses the column the calculation is based upon as the default name.<br>
In this case, that's problematic, because now you're attempting to have two columns with the same name.<br>
That's what the exception means when it says the name num is duplicate.<br>
There are two ways around this.<br>
You can be explicit with a named argument like I did above, or you can use the alias method on an expression.<br>
Calling alias renames the column from the default.<br>
When using select I usually prefer a named argument as it's a little more readable, but there are more complex situations where alias is the way to go.<br>
For now you can use either.<br>
That was a bit of a tangent.<br>
I said I'd show you some math, so let me show you some math.<br>
I've grabbed square root from the built-in math library, and of course if you pass out a number you get its square root.<br>
But if I try to use that with an expression, I get an error.<br>
Remember what is happening here: an expression isn't a number, it's an object.<br>
And the math library's square root function doesn't know what to do with that.<br>
Thankfully, square root is implemented as a method on expressions.<br>
And there's the result of the num column's square root.<br>
It was a bit of a bad boy here, not aliasing the column.<br>
It worked fine as nothing else is named num, but it might be considered confusing for this data frame.<br>
num now contains the square root of the original data frame's column.<br>
It's best practice to rename things even when you don't have to, so it's clear what's inside of them.<br>
Let's look at some more operations.<br>
That's the sum of the column, which is also available as a method on the data frame itself.<br>
Here you get the sums of all of the columns.<br>
Likewise with mean, the number of items, the max, and the min.<br>
So far, each of these has been on one or more columns.<br>
There's also methods for rows.<br>
MaxHorizontal returns a series object containing the largest value in each row of the data frame, while MinHorizontal returns the smallest value, and SumHorizontal adds numeric columns together.
|
|
show
|
2:05 |
Now that you've seen a few results that return a series object, let's look at those a little more carefully.<br>
Like with a data frame, you can construct a series object directly.<br>
It takes different arguments though.<br>
The first argument is a name for the series.<br>
Remember, these are kind of like a column.<br>
And then an iterable of values.<br>
Which in this case is the numbers from 1 to 1000 multiplied by 4.<br>
Let me take a look at it.<br>
Similar to a data frame, you see the first five and last five values.<br>
It also has a shape.<br>
But note the shape is different.<br>
It only has one dimension.<br>
This is a subtle difference.<br>
You can have a data frame made up of one column, but it still has a row and column count in its shape tuple.<br>
A series, on the other hand, only has a row count.<br>
The i64 in brackets there is the data type, i being for integer.<br>
You can pass a series to the withColumns method.<br>
which of course gives you a new data frame.<br>
In this case, I overwrote the original one, and so now I've got quad inside of values.<br>
If you've got a data frame and want a series, you can use the toSeries method.<br>
The argument says which column to return as a series, and like with square brackets, it's zero indexed.<br>
Like with data frames, series have math and summary methods: min, max, total, and mean.<br>
You can also search inside of a series.<br>
The indexOf method returns the row number of the requested value, 4 meaning the fifth row in the series.<br>
Reverse returns a new series in the reverse order of the original.
|
|
show
|
2:43 |
Sometimes you want to know if something is unique or if it has duplicates.<br>
The isUnique method returns a Boolean mask.<br>
That's a series containing true or false values where true indicates the corresponding value matches the criteria and false means it doesn't.<br>
In this case, the criteria is whether the value is unique in the series.<br>
If isUnique isn't enough for you, you can count how many duplicates there are.<br>
UniqueCounts returns a new series with a value for each row of the original, counting how many times that row appears in the series.<br>
As our series is unique from top to bottom, you get 1000 values of 1.<br>
A better example would have had some repeating content, so let's construct a better example.<br>
I've added a mod 10 operation to our original data frame, which cycles from 1 to 9, then 0, and back again.<br>
The nUnique method returns the number of unique values in the column.<br>
Seeing as our numbers there range from 0 to 9, there are 10 different things in the column.<br>
If you want to know how many appearances there are of each value, you use value counts instead.<br>
Well, this is something new.<br>
ValueCounts returns a struct, which is a data type that contains multiple things in order.<br>
It uses the brace brackets to denote the idea.<br>
The first number in each set here is the thing being counted, and the second number is how many there are of each value.<br>
Since there are a thousand rows, each possible response from mod 10 shows up a hundred times.<br>
The normalize argument changes value counts to return percentages instead of absolute counts, 0.1 being 10% for each value.<br>
Note the order of values is not predetermined, and you shouldn't expect it to be sorted.<br>
These two calls to value counts returned a different order of our values.<br>
If you want the same content as columns rather than as struct, you can use select for two columns, one calling unique and the other unique count.<br>
That's a little easier to read and deal with.<br>
There are other ways of dealing with structs, but I'll leave that for a future lesson.
|
|
show
|
0:58 |
Let's review what you've learned in this chapter.<br>
A dataframe object describes a series of rows and columns of data.<br>
You can construct a dataframe by passing in a dictionary, with the keys being the column names and the values being iterables of the cell contents for that column.<br>
To perform operations in polars, you use an expression, which themselves are Python objects.<br>
If you don't want to perform an operation on the whole dataframe, you can create a context and operate on that instead.<br>
You saw three different methods for creating contexts.<br>
Select for fetching columns, with_columns for copying a data frame and adding new column content, and filter for getting at a subset of rows.<br>
Although you can hardcode data in a dictionary like I did in this lesson, most of the time you'll be using an external source, like a CSV or Excel file.<br>
That's coming up soon.<br>
But before that, a quick tangent on data types in the next lesson.
|
|
|
23:13 |
|
show
|
2:49 |
In the previous lesson, I gave you an intro to the core parts of Polars, data frames, expressions, and contexts.<br>
In doing so, I pointed out that columns have data types.<br>
This lesson does a deeper dive on just what those data types are.<br>
This lesson is a bit of a tangent, providing background on data types and why your computer works the way it does.<br>
Being a tangent, you can get away with skipping it if you just want to dive into Polars.<br>
If you've done some NumPy, or some pandas before, Polars uses NumPy-compatible data types.<br>
Even if you haven't, and this is meaningless to you, you can still get away with skipping this lesson with one key takeaway.<br>
There are some subtle differences between Python data types and Pollard data types, but you'll only bump into them in edge cases.<br>
Still with me?<br>
Great, let's dive into the details.<br>
As I said, Polars uses NumPy-compatible data types.<br>
So if you peel that down a layer, where does NumPy get its types from?<br>
Well, NumPy is written in C, a lower-level language, where the data types are directly correlated to how the information is stored in your computer's memory.<br>
The names may be a little different from the actual types found in a C compiler, but the concepts are the same.<br>
NumPy has both integers and floats, but it is specific about the storage used to keep them.<br>
so the numeric types have a suffix indicating the storage size.<br>
Int 16 is an integer stored in 16 bits.<br>
More on this in a second.<br>
As I mentioned, this is subtly different from Python, and that difference varies from type to type.<br>
Python's floats map directly to float32 and float64, but the integers are different.<br>
Way back in Python 2, an integer mapped to the underlying storage, but in Python 3, they added limitless integers, so now the only limit to the size of an integer is your machine's memory.<br>
That's where those subtle differences come in.<br>
In theory, you can store a Python integer that can't be put into polars.<br>
In practice, that has to be a honkin' big number, and you're not going to run into it all that often.<br>
Let's spend our tangential time here exploring a bit where these limits come from.<br>
Everything in your computer is a number.<br>
And I mean everything.<br>
I know you're looking at text right now on the screen, but that's actually a mapping between a graphical representation of characters, which itself is a collection of numbers, and a numeric representation in either ASCII or Unicode.<br>
That means things can be a bit more complicated than you might imagine, since there can be multiple ways of storing any given thing.<br>
Just a second ago, I said ASCII and Unicode.<br>
Those are two different standards for storing text, and what numbers are used to represent a character may be different between them.<br>
Technically, there is some overlap between those standards, but let's not get too tangential inside of my tangents.
|
|
show
|
2:53 |
To better understand how your computer is storing things, you need to understand binary number systems.<br>
To do that, let's break down the decimal number system you learned to count with back when you were using your fingers.<br>
Take the number 9531.<br>
You can break that down into parts that correspond to each position in the number.<br>
The 9 is for 9000, the 5 is for 500, etc.<br>
Each of those parts is a power of 10.<br>
9 thousands, 500s, 3 tens, and 1 one.<br>
I can further rewrite those pieces using the power notation.<br>
Each digit in a decimal number is the next power of 10.<br>
This time let me start from the right-hand side.<br>
1 is 10 to the power of 0.<br>
10 is 10 to the power of 1.<br>
And the power increments by 1 as you move to the left.<br>
Why'd I go through all this?<br>
Well, other number systems exist which are based on the same idea.<br>
But instead of each position meaning a power of 10, they can be a power of something else.<br>
Computer processors and memory are made up of transistors.<br>
A transistor is a very small electronic device that can be thought of as a switch.<br>
There are three poles on it, two for the in and out flow of electricity, and one to control whether or not the flow is flowing.<br>
If the electricity is flowing, then the switch is on, which could be thought of as one.<br>
And if the electricity isn't flowing, then the switch is off, which is thought of as zero.<br>
That's it.<br>
That's a computer.<br>
Everything in it is represented by these ones and zeros.<br>
I mentioned processors and memory, but even physical media uses the same on-off concept.<br>
On a hard drive with a platter, the platter is storing little segments of magnet that either are or are not carrying a charge.<br>
If you're old enough to have played with CDs, there it's something similar, but with little tiny divots.<br>
If the divot's there, then one, and if it's not, zero.<br>
You can combine these on-off ideas into a binary number.<br>
Instead of each position being a number from 0 to 9, in this case, each position can only be a 0 or a 1.<br>
In Python, this kind of number gets denoted with a 0b prefix to distinguish it from other numbers you might be using.<br>
The idea here is exactly the same as our powers of 10, but this time it's powers of 2.<br>
Reading this from right to left, same as the last example of powers of 10, and you get 1 times 2 to the power of 0, plus 0 times 2 to the power of 1, 0 times 2 to the 2, and finally 1 times 2 to the 3.<br>
If you convert that back to powers of 10, you get 8 plus 0 plus 1, giving you 9.<br>
So the binary value 1001 is the decimal value 9, and since you can go back and forth, any decimal number you want can be stored in binary in your computer.
|
|
show
|
1:47 |
Let's map this to computer storage.<br>
Consider your computer's memory.<br>
It's just a big collection of those switches.<br>
Each digit of binary, or each switch if you want, is called a bit.<br>
And of course, physical reality means you can't have an infinite collection of bits.<br>
That means that hardware designers have to consider the limits of the machine and how many bits to use to store a value.<br>
You also have to know this to know how to read the content.<br>
If I have 32 bits in a row, is that one number, four numbers, or 32 single-bit values.<br>
This is where history drives how we think about these things.<br>
Early machines used an 8-bit value to store things, and this grouping came to be known as a byte.<br>
As hardware progressed, and different machines started to use more bits for a single value, convention kept an 8-bit byte, while a single value could be a grouping of those bytes.<br>
These things are typically still powers of 2, so 8, 16, 32, and 64 bits became common storage sizes.<br>
The more bits you used, the bigger the integer value you can store.<br>
2 to the power of 8 is 256, so if you want to be able to store 0, that means you can represent 1 less than 256, giving you a range of 0 to 255 in a byte.<br>
2 to the 16 means you can store up to 65,535.<br>
2 to the 32 gives you more than 4 billion.<br>
And 2 to the 64 gives you a very large 19-digit number.<br>
And because these are all powers of 2, each progression up this ladder is the square of the number before it.<br>
Remember when I said you can't map all Python integers to polars?<br>
Well, as long as your integer is less than 1.8 times 10 to the 19, you're good.
|
|
show
|
2:41 |
Now you've seen that the limits of reality impose a maximum size on an integer related to how many bits you use to store it.<br>
Well, what happens when you exceed that?<br>
Forget binary for a second.<br>
What happens when you have the largest possible 3-digit decimal number and add 1?<br>
Well, you get a 4-digit number.<br>
If you have nowhere to store the 4th digit, it just goes away, and you're left with 000.<br>
This is known as overflow.<br>
If instead of adding 1 to 999, you added 2, your result would be 001.<br>
Whatever you add to 999, you get the three leftmost digits of the four-digit answer.<br>
This is exactly what happens in your processor.<br>
The number overflows, the value is truncated, and the processor raises a signal to indicate that overflow happened, so your program has the option to do something with that information.<br>
As I mentioned before, this no longer happens with integers in Python.<br>
They've built a new storage mechanism that dynamically increases the amount of memory needed to store your value, essentially adding bytes to the left as necessary.<br>
But if you're not using Python's integer, overflow can be a problem for your calculations, as you'd end up with the wrong answer.<br>
If you stick with 64-bit integers, that typically isn't a problem because you've got so much room, but it still has a limit.<br>
So far, I've only been talking about positive numbers.<br>
If you want to store negative numbers, you have to change how the binary values represent an integer.<br>
But you've still got to use binary, as that's all your computer can do.<br>
It switches all the way down.<br>
One way of solving this problem is to take the number and map parts of it to positive and parts of it to negative, by taking the same amount of storage space and splitting it in half.<br>
So instead of a byte representing from 0 to 255, you have it represent from minus 127 to positive 127.<br>
There are actually a few different ways of performing the math to do this mapping, and I won't go into it here, just realize that instead of representing 128, 128 gets mapped to a negative number.<br>
It's nice to be able to have negative numbers, but this means you have two overflow situations now.<br>
As a number gets more and more negative, eventually it will flip over and become positive, and likewise in the other direction, if it gets too positive, it flips over and becomes negative.<br>
Depending on what you're doing with integers, you might not need negatives and instead want the extra storage space.<br>
This is solved by having both kinds of integers available.<br>
A signed integer supports negative numbers but has a smaller absolute value, while unsigned integers only support positive numbers and have a larger absolute value.<br>
Everything in computing is a trade-off.
|
|
show
|
1:29 |
So far, I've been describing bits in hardware.<br>
Let's take things up an abstraction level to a programming language.<br>
C is the grandparent of most of the popular coding languages out there.<br>
Ever wonder why a for loop looks similar in Python, Java, C#, and all the rest?<br>
Well, because they're all descendants of C.<br>
C is pretty close to the hardware, and its data types map to those bits and bytes I've been going on about.<br>
You may recall that this is a Polars course.<br>
Well, Polers has a bunch of different ways to represent an integer, which map to those underlying C types.<br>
Their naming convention uses a suffix to indicate the number of bits used for storage.<br>
Int8 is a signed 8-bit integer, while uint8 is the unsigned equivalent.<br>
And as you might expect, 16, 32, and 64-bit versions are supported.<br>
Because C compilers exist for a wide variety of hardware, the terminology in the C language can be a bit messy.<br>
It tends to use short and u-short to mean 8-bit assigned and unsigned integers, with long and u-long for 32 or sometimes 64 bits.<br>
The inconsistency comes from the width of the hardware's integer representation.<br>
So if your machine used 16 bits for integers by default, that was a normal int, and a long was 32.<br>
while later, with 32-bit machines, long meant 64.<br>
This is why Polars and other libraries use the number suffix instead to avoid this confusion.
|
|
show
|
2:02 |
Integers are whole numbers, so what if you want some stuff to the right of the decimal point?<br>
IEEE spec 754 defines floating point numbers.<br>
Like with integers, you can use a varying number of bits to represent these, but by the time they came into common practice, 32 and 64 bits were common, and this is what most machines support.<br>
Like with negative numbers, the bits in a float don't map directly to their decimal equivalent.<br>
In fact, the float gets chopped up into parts.<br>
For a 32-bit float, you have 1 bit indicating the sign of the number, 8 bits for the exponent, and 23 bits for the mantissa.<br>
A mantissa is a series of bits where each bit is a negative power of 2, or to think of that another way, the mantissa is the sum of the power of halves.<br>
Taken together, the mantissa represents a number between 0 and 1, which you combine with the exponent part to give a pretty big range.<br>
In fact, a 32-bit float allows you to go from the really tiny fraction of 10 to the minus 44 all the way up to the equivalent of a 38-digit number.<br>
And because of the sine bit, you can express these as either positive or negative values.<br>
The dynamic range of a floating point is huge, but it has a drawback.<br>
Because the mantissa is a sum of fractions, it isn't just the digits to the right of the decimal point.<br>
That means there are decimal numbers that just can't be represented.<br>
If you open up your Python REPL and type in 0.2 plus 0.1, you'll see this in action.<br>
When the math is done on these values, you get a bit of extra cruft in the mantissa, giving you an error.<br>
Granted, it's a really small error, but a lot of small errors combined over time can give you a problem in the long run.<br>
Polar supports a couple of floating point sizes, which actually map directly to what Python can support.<br>
Like with the integers, it uses suffix to be explicit about their storage size.<br>
Float32 uses 32 bits, also known as the regular old float in C, and Float64 is also known as double.
|
|
show
|
0:54 |
You've already seen how floating point can lead to small errors, and in some cases this needs to be avoided.<br>
For example, you should never use floats to represent money.<br>
Eventually, the little errors will compound and your balances will be off.<br>
Instead, you should use fixed decimal representations.<br>
These are essentially like combining two integers, using some bits for the part to the left of the decimal point, and some bits for the part to the right of the decimal point.<br>
The trade-off here is twofold.<br>
First, you don't get as much range as with a float, and second, your hardware isn't specialized to deal with these kinds of numbers.<br>
One of the reasons floats are so common is that your CPU has instructions explicitly for them, and so doing floating point math tends to be a lot faster than the fixed decimal equivalent.<br>
Python and Polars both support fixed decimal.<br>
They use a 128-bit number to do so.
|
|
show
|
3:57 |
Way back, what seems like hours ago, I mentioned that everything in your computer is a number.<br>
You even sometimes map your numbers to other kinds of numbers.<br>
Text is handled in a similar fashion, where, within a certain context, a number represents a character value.<br>
This mapping has its own history, and if you go back far enough, there was even competition between different kinds of computers for different text mappings.<br>
The winner of that competition was the ASCII table.<br>
It's changed over time, but now in the 8-bit byte world, it has 256 codes with each code mapping to a character.<br>
For example, capital A is decimal 65, while its small version is decimal 97.<br>
Not all characters are letters.<br>
For example, there are mappings for the tab character, a little ding noise, the new line, and more.<br>
If you really want to blow your mind, the numbers are in here as well.<br>
That's right, ASCII 49 maps to the character for the digit 1.<br>
ASCII is rather Western-centric, and 250 characters isn't enough for all the characters in all the languages in the world.<br>
For example, Kanji has tens of thousands of characters all on its own.<br>
To better represent a wider range of characters, another standard was created, known as Unicode.<br>
For backwards compatibility, the first 256 characters in Unicode are direct mappings to ASCII.<br>
That way, old programs would still work in the newer system.<br>
Unicode is a complicated beast, and there are several different ways of encoding it.<br>
The UTF-8 encoding is a variable length mechanism, the 8 meaning 8-bit bytes.<br>
So for the ASCII-compatible characters, you can still use a single byte.<br>
But for the wider world, you can have up to 4 bytes to specify a single character.<br>
Python's been around for a while, and Python 2 used ASCII by default.<br>
Eventually, it added Unicode capabilities.<br>
part of the Python 3 modernization was to use UTF-8 throughout instead.<br>
In case all of this seemed a little too simple, Python further complicates this by having no real concept of a character.<br>
You and I might talk about a single letter colloquially, but in fact, that's just a short string.<br>
Inside Python, this isn't a big deal, but be aware, in some other programming languages, there is a distinction between the two concepts.<br>
In fact, that's where the term string comes from.<br>
It's a string of characters in a row.<br>
Polars and NumPy both have their own implementation of strings.<br>
This is compatible with Pythons, so you'll never notice the difference.<br>
But if you use the built-in type function on a polar string, you'll see that what comes back is its own class.<br>
Unicode is pretty much the way moving forwards.<br>
If you've ever been surfing the web, particularly message boards, and seen a funky little dot where someone was using an apostrophe or a question mark, this is a result of a mismatch between the person's, machine's, text encodings, and the websites.<br>
Although Unicode has become the way to do things, there are other ways out there, occasionally causing translation problems between the systems.<br>
If you're in the Python REPL, you can poke around a bit to see how strings are represented.<br>
The ORD function returns the ASCII mapping of a character.<br>
On screen here, you can see the A and the capital A I was talking about.<br>
Python strings have an encode method that returns the raw byte representation of an encoding.<br>
By calling encode UTF-8 on the happy emoji, you can see it is made up of four bytes.<br>
While if you do the same on the Canadian flag, you'll see it takes more bytes to represent it.<br>
Remember when I said there are up to four bytes in a character in Unicode?<br>
Well, that is strictly true, but some of the things you might think of as a character are actually a compound.<br>
Unicode documentation tries to avoid the word character because then you have to distinguish between the four-byte limit and a grapheme or glyph like my flag here.<br>
This is also how you apply a variety of skin tones on emojis by combining multiple segments of Unicode.<br>
Text can be complicated.
|
|
show
|
2:01 |
You've seen a few different kinds of numbers and a couple kinds of text.<br>
In the words of Ron Papil, but wait, there's more.<br>
How about storing a date and or time?<br>
You could just use integers for each part of the value, but that would take up a lot of space.<br>
Instead, programming languages have data types that map numbers to dates and times.<br>
In Python, that's done through the date-time object.<br>
Python's version can store a date and time up to December 31st in the year 9999 at 1159pm.<br>
It can also go all the way back to January 1st on year 1, but it can't go back further than that.<br>
So if you're building a program that deals with historic dates before the common era, you need a different format than Python's default date-time object.<br>
Polars has its own formats for dates and times.<br>
It has a date-specific one that uses a 32-bit signed integer, where 0 maps to January 1st, 1970.<br>
This date is kind of special in computing.<br>
It's known as the Unix Epoch, or just the Epoch, and it exists because that's how many Unix systems represented the date.<br>
In fact, the Y2K problem is going to happen again shortly in 2038, when 32-bit Unix systems will overflow.<br>
Unlike the Unix time problem, where 32 bits are used for both the date and time, in Polers, these 32 bits are used just for the date.<br>
That gives you a range of up to 2 billion days before and after January 1st, 1970.<br>
If you need a date and a time, Polers uses a 64-bit signed integer, also centered on January 1st, 1970.<br>
But it has a varying degree of accuracy.<br>
A single digit can be a milli, micro, or nanosecond, and of course the minimum and maximum date time then changes based on which range you choose.<br>
It defaults to microseconds if not otherwise specified.<br>
This gives you a lot of flexibility either for very precise times or for a very large range.
|
|
show
|
1:45 |
Phew, that was a lot, wasn't it?<br>
I didn't say it would be a short tangent.<br>
The challenge with all these formats is that the data you're working with might be in a different format from what Polars uses.<br>
If you're reading data in from a file, the format of the file is going to dictate what is stored and how.<br>
Text files are, well, text, which means if there are numbers inside, something is going to have to recognize that and convert them.<br>
Binary formats tend to be richer, but then you may have an incompatibility.<br>
Consider those dates I was just talking about.<br>
As a really simple example, Canadians and Americans write dates differently.<br>
Does 0102 mean January 2nd or February 1st?<br>
Guessing about the date type and potentially converting it is typically done by the library reading the file.<br>
For example, CSV may have quotes around text if it has commas in it.<br>
A number without a decimal is assumed an integer, and a number with a decimal is assumed a float.<br>
The library you use to read a CSV may even have controls to force this conversion one way or the other.<br>
All of this means that when you're doing data science, you may have to spend a bunch of time manipulating the data you've read in.<br>
Your reader might not recognize a date string as a date and store it directly as text.<br>
American postal codes are five-digit numbers, sometimes with leading zeros.<br>
If those are seen as integers, the leading zero goes away, so in that case you might want those numbers stored as text instead.<br>
Polars has a variety of functions for reading data, and those functions have ways of controlling this conversion.<br>
And if that doesn't work, you can further perform conversion operations within your data frame to get things into the space that you need to do your work.
|
|
show
|
0:55 |
Let's recap.<br>
Everything in your computer is a number.<br>
Data types are mappings between those raw numeric values and other representations.<br>
There are a lot of possible representations.<br>
For example, strings, dates, integers, both signed and unsigned, floating point numbers, and more.<br>
The data types in Python and Polars are mostly compatible, but there are some subtle differences that can be a challenge when dealing with edge cases.<br>
All of the numeric types in Polars have a suffix to tell you how much storage will be used, and therefore what the limits of them are.<br>
For example, the int8 type is a signed 8-bit integer.<br>
That's my less-than-tiny little tangent.<br>
You may recall, several weeks ago, in lesson 2, I said it would soon be time to learn how to read data files into Polars.<br>
Now that you know about data types, you're ready to do just that in the next lesson.
|
|
|
16:21 |
|
show
|
3:05 |
In the previous lesson, I went on a tiny little tangent about how computers store information.<br>
In this lesson, I'll dive back into the Arctic waters, this time to show you how to populate polars using external data sources like CSV and Excel.<br>
The first bit of data reading I'll show you is a CSV file.<br>
That's a text-based format where commas separate the values in a row.<br>
In this case, it's the CSV library that is responsible for determining the data types of each of those columns.<br>
Typical CSV readers only handle two things, text and numbers.<br>
Call that three if you want to make a distinction between integers and floats.<br>
Polars comes with a whole bunch of functions for reading data in from external sources.<br>
This family of functions all starts with the name read, and as you might guess, read CSV is for getting data from CSV files.<br>
For the purpose of demonstration, I've got a file called sales.csv.<br>
It has a thousand lines of data, and here on the screen you can see the first few rows.<br>
This is mocked up sales data, with each row showing an invoice identifier, the company that made the purchase, the date, and then what was purchased, along with how much that cost.<br>
The sales.csv file is available in the GitHub repository that goes with the course, if you want to play around with it.<br>
Let's head off into the REPL and read some data.<br>
I have already done my import, let's use read CSV to get some data.<br>
The simplest version of this function call takes a file name.<br>
The return value from read CSV is a data frame.<br>
You can tell that this is sample data as it has exactly 1000 rows.<br>
In the data frame here, you'll find the same structure as the table I just showed you in the slides.<br>
Running count says that I've got a thousand rows.<br>
This is the same information as I could get out of the shape, except for a subtle difference.<br>
If there were missing values in a column, the count wouldn't be one thousand.<br>
So this tells me not only are there a thousand rows, but I've got a thousand pieces of data for each column in that row.<br>
Let's examine some of our data.<br>
Say you wanted to see all the orders for the month of March.<br>
You do that with a filter.<br>
I haven't shown you how to filter a date yet.<br>
I'll go over it in detail in a moment.<br>
But for now, know that you use the dt attribute to treat a column like a date.<br>
And once you've done that, you can use the month method on that column to get the month.<br>
Well, that's no fun.<br>
It should have worked.<br>
Let's see what happened here.<br>
The exception is an invalid operation error, and the month operation, that's the method I called, can't be used for this string type.<br>
Let's look at our data frame again.<br>
Sure enough, the purchase date column has a data type of strings.
|
|
show
|
1:50 |
The CSV reader didn't convert our purchase date into a date object.<br>
It kept it as text.<br>
There are a couple different ways of dealing with this.<br>
The first is the cast operation.<br>
This is an expression method that causes Polars to treat the expression using a different data type.<br>
Let me modify that same filter from before, this time casting the purchase date.<br>
That's a lot of typing, but it worked.<br>
The resulting data frame only has 74 rows and the 10 shown on screen here are all from the month of March.<br>
Let's hope the other 64 are as well.<br>
You'll recall from earlier that this kind of comparison creates a mask.<br>
Let's look at the mask itself.<br>
Selecting our comparison expression results in a column of booleans.<br>
A row is true if the purchase date is March, and false otherwise.<br>
Remember, none of this affected the data frame.<br>
I could create a new data frame using this call and the withColumns function.<br>
Instead, though, wouldn't it be great if you could just tell the CSV reader to actually understand dates?<br>
I like how they named this.<br>
It's an honest argument.<br>
They'll try to convert it to a date, but no promises.<br>
Thankfully for our CSV data, it succeeded in parsing them.<br>
If it had failed, you'd still have a string column and you could convert that content manually as needed.<br>
In a later lesson, I'll walk you through a case study that includes dealing with data that can't be read in in the format you want and needs to be converted.<br>
For now, let's move from CSV to Excel.
|
|
show
|
1:31 |
Excel is the ubiquitous spreadsheet program.<br>
Its underlying data format is binary, and it can handle far more data types than our regular old CSV.<br>
Polers supports several different libraries for reading Excel files.<br>
The default is a library called FastExcel, which even though it's the default for Polers, Polers doesn't mark it as a dependency, so you have to install it separately.<br>
A quick note about Excel reading libraries.<br>
Some of the older ones only read the literal values of a cell.<br>
so if you have a formula, what you'll get is the string contents of that formula.<br>
In that case, you might have to do modifications to the spreadsheet, like copying a column and pasting it as values to get the data you want.<br>
Thankfully, FastExcel is a fairly capable library and will evaluate formulas for you.<br>
The Polars function for reading Excel is aptly named ReadExcel, which, after just a bit of housekeeping, I'll demonstrate.<br>
As I mentioned, the fast Excel library is separate from Pollard's, so you'll have to pip install it to use read Excel.<br>
Like with the CSV example, I have a sample sales file as well.<br>
Although the results are the same, the structure of the file is a little different.<br>
First off, Excel knows what a date is, so I won't have that casting problem that I had before.<br>
Second, to mimic the real world a little more, this time the total column is actually a calculation, being the quantity times the price.<br>
Alright, the house is now tidy Let's head into the REPL and load some Excel
|
|
show
|
1:54 |
I think you've seen this line before.<br>
And reading the Excel file is almost identical to reading the CSV.<br>
Consistency is a good thing.<br>
Let's look at the resulting data frame.<br>
And there you go.<br>
Notice that the purchase date column actually has a date data type, and the total column has an integer data type.<br>
Polars doesn't support formulas as part of the data frame.<br>
so FastExcel has run the calculation and returned the result to the data frame.<br>
Like with the CSV, let's check out the count.<br>
Same as before.<br>
That's good, nothing missing.<br>
Now let's look at our data in overview.<br>
It is a good habit to run describe after loading something in.<br>
The count and null count rows tell you whether you're missing anything.<br>
There are a bunch of nulls in the result here, as some of the statistical operations are meaningless for some of the columns.<br>
You can't perform standard deviation on a string.<br>
Let's filter out March.<br>
Same idea as before, but this time no need to cast.<br>
The column is already a date.<br>
Note that the data is in a somewhat random order.<br>
You can fix that by appending the sort operation to the expression.<br>
That's better, all the sales data for March in order of purchase date.<br>
You can also sort by multiple columns by passing in a list.<br>
Now, the three things purchased on March 1st are also ordered by quantity.
|
|
show
|
2:09 |
Remember from the math lesson that there are operations you can run on the data frame.<br>
Running sum tells you the total quantity of things purchased, the total price, which is kind of weird, and the total spend.<br>
Pullers doesn't know the semantic meaning of columns, so although the sum of prices is strange, Pullers just thinks of it as an integer and gives you a value.<br>
The nulls here are because summing a set of strings is not a thing you can do.<br>
Let's look at some of the other operations.<br>
Average price might be something you want to know, and average quantity is a little strange sounding until you think of it as average order size.<br>
Each of these operations is returning a new data frame as a result, so you can describe a result just like describing the original.<br>
You might want to select some columns and describe them like this if you have a lot of columns, or in the case where a lot of the columns are things where statistical information isn't useful.<br>
In the math lesson, I introduced you to the value counts call.<br>
Counting the number of mod 10 entries is a little abstract, but doing it for our products actually makes sense.<br>
Each row here contains a structure, one for each unique kind of product, which is then combined with the number of instances of that product.<br>
Note that this isn't the quantity sold, just the number of orders of a type.<br>
There were more than 226 pens sold, but only 226 orders bought some number of pens.<br>
When I showed you the structures in the math lesson, I promised to show you how to chop that up.<br>
Well, here you go.<br>
You can chain the un-NEST call to split up a structure.<br>
The argument to unnest is the name of the column of the thing being deconstructed.<br>
In fact, it can be any column specifier, which includes the good old call function.<br>
In our case, there was only one column, but you could have more.
|
|
show
|
1:36 |
Okay, you've got your data.<br>
Let's add some stuff to it.<br>
You've already seen the withColumns call.<br>
This takes the data frame and returns a copy with one or more additional columns.<br>
The lit expression function is short for literal and allows you to set the same value to every row.<br>
Let's look at the result.<br>
And there you go.<br>
USA a thousand times over.<br>
Data frames may not support formulas the way Excel does, but you can still do the equivalent calculations.<br>
There are only two guarantees in life, and this is one of them.<br>
Since I overwrote the original data frame with the result, I now have a new column with the amount of sales tax.<br>
Okay, that's a column.<br>
What if you want to add a row?<br>
Since our dataframe has a date in it, I'm going to need a date or dateTime object.<br>
Next, like way back when, I'm going to construct a dictionary.<br>
Previously, I used a dictionary to specify all of the rows in a new dataframe.<br>
I'm kind of doing the same thing, specifying the rows that I'm going to insert.<br>
In this case, I'm only inserting a single row, but I still use lists, just lists with a single item inside of them.
|
|
show
|
0:25 |
The concat call combines a series of dataframes.<br>
So to add data to our existing one, I'll create a new dataframe and concat the two of them together.<br>
Like with all our calls, concat returns a new dataframe, so I've overwritten our original.<br>
And there, at the bottom, is our new sale.
|
|
show
|
1:16 |
The column names in this data frame are inconsistent.<br>
Let's do something about that.<br>
You see the same thing by looking at the data frame table above, but sometimes the columns attribute is handy.<br>
In fact, store this little nugget away in your databank.<br>
It'll help you in the case study.<br>
One way to rename a column is to copy it and remove the old one.<br>
Another is to use the replace column call.<br>
The replace column call takes a series though, so the first thing I need do is get country as a series.<br>
There it is, and now I'll create a new series with a lowercase country.<br>
In previous lessons I created a series by passing in an iterator.<br>
Here I've done it by passing in another series.<br>
Remember, the first argument to the series constructor is the name of the data.<br>
Now I'll do the replacement.<br>
And there you go.<br>
Replace column number 7, zero indexed of course, with the new country, small c, series.
|
|
show
|
2:00 |
If you were using Excel, you could click on any cell to edit its value.<br>
Polars lets you do something similar.<br>
If you're coming from Pandas land, this is similar to what you might use iLock for.<br>
You can get at a row using square brackets, including negative indexing.<br>
This is the row I inserted just a short while ago.<br>
You can get at the value in a row by passing a row and column pair into the square brackets.<br>
I actually don't like this, as this isn't how you typically do this in Python, but it's how Pandas does it, so Polars tries to make it easier for people coming from that library.<br>
Not only can you read values, but you can also write them.<br>
Which, as you would expect, has modified the row.<br>
This makes me a little uncomfortable as it's an inconsistency.<br>
Most operations in Polars create a new data frame.<br>
Cell-based editing like this does not, so be careful.<br>
The row-column-pair mechanism in square brackets also supports slices.<br>
This is the second and third row, zero-indexed, and the first three columns.<br>
You can even pass a list instead of a slice.<br>
Same idea, but this time the first, second, and sixth columns.<br>
Since it's easy to count wrong, you can also use names.<br>
If you want to specify the columns but not slice the rows, you do this by passing in an empty slice.<br>
An empty slice being a colon on its lonesome without any numbers.<br>
There are limits, though.<br>
You can't use a slice for assignment.<br>
Assignment only works on a single cell.
|
|
show
|
0:35 |
Time to recap.<br>
You've seen how to use the ReadCSV and ReadExcel functions to read external data sources and get a poller's data frame.<br>
The ReadExcel function supports a few different Excel libraries.<br>
The default, and likely the best choice, is FastExcel, which you need to install separately.<br>
The Read functions do their best to choose the right data type based on the file, but it isn't always correct.<br>
You may need to use the Cast function in an expression to operate using the data type that you want.<br>
I've only scratched the surface of working with Excel.<br>
In the next lesson, I'll show you even more.
|
|
|
16:51 |
|
show
|
2:08 |
In the previous lesson, I showed you the read CSV and read Excel functions for creating a data frame from external files.<br>
In this lesson, I'll continue along the Excel path and show you how to perform some of the more common things you do in Excel inside a data frame instead.<br>
Like with the last lesson, I'll be using the sales XLSX file.<br>
This is a reminder of what that looks like.<br>
Actually, let's go right past the reminder to Excel itself.<br>
Inside of Excel, you might want to do a few different things with a date, requiring you to grab part of the date information.<br>
You'd use a formula to do this.<br>
I'll start with the month.<br>
And of course, I copy this down into the sheet.<br>
Going back to the top.<br>
How about the day?<br>
Similar idea.<br>
and the year.<br>
You know how I spent a bunch of time on data types?<br>
Well, the same concept is inside of Excel.<br>
If I try to pass a non-date column into the year function, I get a value error.<br>
This is Excel telling you that the data types aren't compatible.<br>
So that's day, month, and year.<br>
What about quarter?<br>
Well, Excel doesn't have a formula for that, so you have to do that manually.<br>
Dividing the month by 3 to get the quarter and rounding up to get rid of the fractions.<br>
And there's all my date information.<br>
Let's try to do the same kinds of things in Polars.
|
|
show
|
2:45 |
You've already seen how columns in a data frame have a data type, and I gave you a glimpse in how to use data type specific methods as part of an expression.<br>
You saw me use the dt attribute to treat a column like a date and get at its month method.<br>
Well, these kinds of accessor attributes exist for a bunch of data types.<br>
The two most common you'll use are dt and str, giving you access to date time methods and string methods.<br>
You might be wondering why the need for accessors.<br>
I'm not 100% sure of the answer, but I can take a good guess.<br>
Remember, expressions are independent of the data frame.<br>
You can construct them on their own.<br>
It isn't until they get used that Polars knows if they can be used.<br>
Of course, you could provide all data type methods on all columns, but then you might get a conflict between two data types having operations with the same names.<br>
Using accessors like this gives a little hierarchy to keep like things together.<br>
The downside, and in fact this is one of Pollard's drawbacks, is it takes more typing than some other equivalent frameworks.<br>
The other thing that can seem a little strange, at least for a Python programmer, is the name of some of the data type operations.<br>
They don't follow the same naming convention as their Python equivalents.<br>
Granted, Python itself isn't consistent, so that's not really anything to get worried about.<br>
This name mismatch is because Polars is built in Rust and is available as a standalone Rust library, so some of the naming conventions are Rust rather than Python.<br>
You'll see what I mean when you encounter starts with in just a little bit.<br>
Let's go back to the REPL.<br>
This seems somewhat familiar, and so does that.<br>
Let's use the DT accessor.<br>
In an earlier lesson, you saw me use it as a condition, but you don't have to.<br>
In this case, I've used DT to get at its year method, extracting the year into its own column.<br>
So if I wanted the date fully parsed, I could use with columns, get the year, and then add the month, the day, and without a funky formula, the quarter as well.<br>
It's available as a method.<br>
And as I overwrote the data frame, you now have the date information parsed out as integer columns.
|
|
show
|
1:55 |
Let's play with some more accessor methods.<br>
The DT attribute has a two-string method, which allows you to put it back to a string.<br>
This method also takes optional formatting arguments, but I won't show you that as year-month-day is the one true choice.<br>
I'll fight you.<br>
The str accessor has all sorts of string methods, pretty much those you're accustomed to on the string object in Python.<br>
The split method returns a list of values, splitting on the dash in the date in this case.<br>
What if I wanted to get at those parts?<br>
Unfortunately, there is no un-nest equivalent method for a list, but I can convert it to a struct.<br>
Granted, not without being warned that I'm a bad boy, but I can do it.<br>
The warning in this case is telling me that I should be giving names to the parts of the struct, otherwise when I go to do things with it, I might have a problem.<br>
It's just a warning though.<br>
But if I want to unnest it, I have to heed the warning and give it names.<br>
That's a lot of typing.<br>
First, it split the date on the dash, then it turned the resulting list into a struct using the field names, year, month, day.<br>
Then, I unnest that same date column I'm operating on to get a data frame with the year, month, and day.<br>
Do note that these are still strings, though.<br>
That was all very convoluted, but it does demonstrate the power of the STR accessor.
|
|
show
|
1:31 |
Dates should, well, be dates, so let's go back to the land of DT.<br>
In the Data Types Tangent lesson, I briefly explained the Unix epoch, which is January 1st, 1970.<br>
A lot of time stuff in computers is a counter since that date.<br>
Calling the epoch method without any arguments gives the number of microseconds since the Unix epoch.<br>
There are many DT methods.<br>
Let's look at a few of them.<br>
The century.<br>
The number of the day of the year, known as the ordinal.<br>
The number of the week in the year.<br>
And the day of the week as a numeric value.<br>
Note that this is one of those rare things in computing that isn't zero-indexed.<br>
It's a number from 1 to 7, with 1 being Monday.<br>
This isn't a Polar-specific thing, but an ISO-standard thing.<br>
This is the day the current month starts on, and the day that the month ends on.<br>
This gives you an easy way to get the number of days in any given month.<br>
It even accounts for leap years.<br>
Speaking of leap years, 2019 wasn't one, so that's 1,000 falses.
|
|
show
|
1:08 |
Bored with dates?<br>
Let's go back to some more string stuff.<br>
The toUppercase method returns an uppercase version of the string.<br>
This is one of those places where the name isn't Pythonic.<br>
In Python, this would just be called Upper.<br>
This is mildly annoying, as it sometimes means you can't find what you're looking for in the docs.<br>
But seeing as the library really is a Rust library at heart, we should just be glad we get it in the Python world at all.<br>
The opposite of upper is lower.<br>
And title case is large first letter, then small from there on.<br>
As you might have guessed, the lenCharz method returns the length of the string counting characters.<br>
While the lenBytes returns the number of bytes used for storage.<br>
For our column, everything is ASCII, so the two functions return the same.<br>
If one of the products had a Canadian flag emoji in its name, you'd get different numbers.
|
|
show
|
2:14 |
Inside Excel, you can turn on the Quick Filter feature, and then your header row turns into a selection widget.<br>
Using this, you can hide the rows in the spreadsheet that don't match your filter.<br>
You already know you can accomplish the same thing with the filter method in Polars, but let's dig into that some more.<br>
I'm back in the REPL using the same Excel sheet as my source data.<br>
Let's start by looking for orders where more than 10 things were bought.<br>
739 of our invoices are for orders with a quantity of 10 or more.<br>
You can pass more than one condition to the filter method.<br>
When you do, it works like an AND, in this case filtering both on customers named Viva and orders of 10 or more.<br>
You'll recall that a filter is a masking column.<br>
So if I select on the condition instead of filtering it, you get a series of trues and falses, matching those rows that meet the criteria or not.<br>
Let me store that away.<br>
And I'll pass it into the filter method.<br>
Hmm, couldn't have been that easy, could it?<br>
Turns out that masks need to be a series, not a single column dataframe.<br>
You can fix that by calling the toSeries method.<br>
checking that it's a series this time.<br>
And now I can pass it to Filter.<br>
Let me do that again with the Quantity condition.<br>
There's my mask.<br>
And just like before, I can pass them both to the Filter method.
|
|
show
|
0:37 |
Now, why would I do all that when I can just use them straight?<br>
Well, when you have mask objects, you can combine them using binary operators.<br>
By doing so, you can do more complex things than just anding the two conditions together.<br>
Of course, having said that, I went and anded them together anyways, but that's just to prove my point.<br>
The last two calls have the same result.<br>
Chaining the binary operator AND to OR and you get something different: 742 rows of different.
|
|
show
|
1:17 |
combining filters and expressions is powerful, giving you a lot of control on your data.<br>
Say the first character in the invoice code was meaningful somehow, then you want to see just those that start with s.<br>
Here I used the str accessor to get at the starts with method, spelt differently from python, and use that to filter all the s invoices.<br>
I like this so much I'm going to run it again and keep it this time.<br>
Next, I can run queries on this to count the product sales with value counts.<br>
The beauty of expressions being their own objects and almost every operation returning a data frame is I could operate that on the thing I just did, or I can chain it all together.<br>
Here, instead of using the ""with s"" dataframe, I just created it from scratch.<br>
It's a lot of typing, but you get the point.<br>
I can use references, or I can do the whole thing.
|
|
show
|
2:05 |
Let's try some more filters, this time with dates.<br>
To do comparisons with dates, you need a date object or a dateTime object.<br>
It's really no different than any other filter, just using the Python date object instead.<br>
If you don't want to import the date class, there is another way around it.<br>
Remember, the lit call means literal, so this is constructing a literal string, then using the string's str accessor's toDate method to parse the date, and after that, doing a comparison like any other.<br>
It's a fair amount of extra typing, but if you really don't want to import the Python date class for some reason, you don't have to.<br>
Let's build something a little more complex.<br>
Looking for books that were purchased in the month of November, then sorting the result, giving me 23 invoices.<br>
Skinning this particular feline in an alternate fashion, still looking for books, filtering by month instead, and then sorting like before.<br>
This is definitely less to type.<br>
The format of the previous version would allow you to do smaller date ranges than the entire month, and would allow you to select date periods that crossed month boundaries.<br>
The second method is less to type, and only works because our data is of a single year.<br>
If there were multiple years, you would get the November's in those multiple years as well, which would give you a different result in that case.<br>
So save some typing, but it might be a little more risky depending on your data.
|
|
show
|
1:11 |
Time for the end of lesson review.<br>
Each column in a data frame has a data type, and if you want to perform data type specific operations on the values in the column, you use a data type accessor.<br>
The two most common ones are DT for dates and STR for strings.<br>
Like with an Excel quick filter, you can use Polars filter method to fetch a subset of rows.<br>
Filters themselves are a Boolean mask stored as a series object, and once you've got those things just such a mask, you can do more complex filtering by combining those masks with binary operators.<br>
Some filtering is done by constructing comparison operations in an expression, but additionally, some data types support methods that also can help you filter.<br>
For example, the StartsWith method can help you filter based on strings that begin with a particular letter or phrase.<br>
So far, you've been performing operations on a data frame or a subset of rows.<br>
Polars also lets you perform aggregation operations on groups of rows as well.<br>
If you want to sum all the orders in January grouped by product type, you do that through aggregation, which I'll demonstrate in the next lesson.
|
|
|
14:39 |
|
show
|
1:57 |
In the previous lesson, I showed you a whole lot of methods available through the DT and STR data accessors.<br>
In this lesson, I'll be showing you how to do aggregate calculations, both on entire data frames as well as groups of data.<br>
An aggregate is a method that typically produces a single result based on a calculation over a group of data.<br>
You've already seen some of the aggregates available over the entire data frame, like sum, min, max, mean, as well as the methods that aggregate across columns on rows, sumhorizontal, minhorizontal, and maxhorizontal.<br>
Sometimes, instead of aggregating on an entire column, you want to aggregate on like data within a column.<br>
You could do this manually by performing multiple filter operations, and then aggregating on the result of each, or you can use the groupby functionality that lets you do this directly.<br>
For the purposes of the next demo, I'll once again be using the Sales Excel sheet.<br>
In fact, let's head off to Excel and play with it now.<br>
Say I wanted to count the total quantity of shirts sold.<br>
I could do that with the SUMIF function.<br>
The first argument is the part that this is going to operate on, which is the product column for us.<br>
The second part is the condition, so matching shirt.<br>
And the third is the column to do the summing on.<br>
which is the quantity.<br>
There you go, 6,249 shirts sold.<br>
And of course, you would then do the same for the other products.<br>
One of the downsides of this mechanism is you have to know which products you're looking for.<br>
Finding unique things in Excel is possible, but man is it messy.<br>
I'll leave that for you to Google if you want to see what I mean.<br>
Let's head off to the REPL and see how to do the same thing in Polars.
|
|
show
|
1:59 |
As usual, I've created a data frame based on our sales spreadsheet.<br>
And now, just a quick review of the data frame level aggregates, some of which you've seen before.<br>
Some, with the nulls for things that aren't summable.<br>
Mean, which interestingly includes an average purchase date.<br>
Even kind of weirder is that it converts it to a date time.<br>
Polars actually has a good selection of time series data methods, including things like data extrapolation, but that's beyond the scope of this course.<br>
The aggregate functions are available as expressions as well, which means you can construct a summary data frame.<br>
Here I've got both the mean and the standard deviation.<br>
Remember that you have to name the columns here, or you'll get an error seeing as the source column for both is the same.<br>
And there you go.<br>
Granted, you could get the exact same data and more by calling describe, but you get the idea.<br>
A common scenario is to want the average of several columns.<br>
For example, on quantity and on price.<br>
This time I could skip the aliasing since they're four different columns.<br>
You probably should alias anyways so that you can distinguish the data in the result, but I'm lazy, didn't want to type that out.<br>
Speaking of typing too much out, remember the shortcut of using a string to specify a column name?<br>
Well, you can specify multiple in a list and when you do that you can apply an expression to the list.<br>
Same result as before but with less typing.<br>
I like it.
|
|
show
|
2:05 |
Now, let's look at aggregating based on groups of data.<br>
You do this through the groupBy method, which takes a column name to perform the grouping on, and then you chain this with an ag method which specifies what operation to perform.<br>
Let me start by counting the instances of product.<br>
Ignore the deprecation warning for a second, I'll come back to it.<br>
I've grouped on product, which means Polars finds all of the unique instances in the product category.<br>
Book, poster, shirt, and pen.<br>
Then the ag call says to count the number of each.<br>
Like in previous lessons, this is counting instances of the invoices, not the quantities.<br>
The result is similar to what you have done with a value count call earlier, but without the need to unnest it afterwards.<br>
Now, back to that deprecation thing.<br>
Polars it wants you to call it len.<br>
I don't know why, but let me do as I was told.<br>
And there you go, same thing, but without the warning.<br>
So just what does groupby do?<br>
In a similar vein to expressions, calling groupby returns a groupby object.<br>
Unlike an expression, this is associated with a data frame.<br>
but like an expression, this hasn't done anything yet.<br>
To get useful info, you need to call ag on your group by object.<br>
The ag call can take multiple arguments, aggregating on each.<br>
Like before, this call groups on product, but now it's performing two different aggregations.<br>
First, counting the number of rows for each group, and second summing the corresponding quantity column.<br>
This is essentially the equivalent of that sum-if formula in Excel, except you don't have to know what unique values are there in order to get all of the results.
|
|
show
|
1:34 |
Since all contacts calls return data frames, you can mix and match filters with GroupBy to examine patterns in a subset of your data.<br>
The filter here looks for just those purchases by the Viva company.<br>
Then GroupsByProduct.<br>
Then I'll count the instances.<br>
Sum the quantity, aliasing the column for clarity, and get the mean of the quantity, which is why I had to make sure to alias.<br>
And there you go, sales information for the Viva company.<br>
You can also group on multiple columns.<br>
Company and product in this case.<br>
Like before, summing the quantity column.<br>
for a variety, this time I grabbed the standard deviation.<br>
When you group by two items, you get unique pairs, in this case company product pairs.<br>
The results here are a bit of a mess.<br>
For clarity, it probably would have been good to chain a sort on the end.
|
|
show
|
2:18 |
Let's leave that be though and deal with the other messiness.<br>
There are a lot of nulls in our result.<br>
Depending on what you're doing, that might be fine or it might be something you want to fix.<br>
The standard deviation calculation is undefined for negative values and requires two or more items to work, hence why all of the nulls.<br>
This begs the question as to why there are negative values in the quantity column.<br>
It could be dirty data or it could be a product return.<br>
Let's go with returns for now and leave the idea of dealing with dirty data for later.<br>
That's the same as before, but I've captured it into its own data frame to demonstrate a couple of different ways of dealing with null.<br>
The first is fill null.<br>
Fill null takes an argument that specifies what to replace null values with.<br>
The most common strategy is the one I've used, which just replaces them with zero.<br>
You can also just pass in a value instead.<br>
So I could have used an argument of zero and save some typing, but then you wouldn't have seen the strategy keyword.<br>
Let's look at some other strategies.<br>
min replaces null with the smallest value in the column max the largest and mean the mean.<br>
You get the idea.<br>
There are two other strategies as well, which are to use the value before the null or use the value after the null.<br>
Instead of replacing nulls, you can get rid of them.<br>
Before I chop anything, just note here that there are currently 731 rows in values.<br>
The drop nulls call removes any row with a null value.<br>
The end result here is 209 rows, so the 500 and change with null in them are just gone.
|
|
show
|
1:34 |
One of Excel's power tools is the pivot table.<br>
If you've never used this before, it takes vertical data and transforms it so that it's horizontal.<br>
This is kind of like a group buy.<br>
If you pivot around company and product, you get a single row for each company and a column for each type of product the company has.<br>
You can then perform aggregation calculations across those columns.<br>
Let me quickly show you how this is done.<br>
First off, I need a new worksheet to put my pivot table inside of.<br>
Now I switch back to the datasheet, move to the Insert tab, and choose Pivot Table.<br>
It's pre-selected the range for me, and now I pick Sheet 1 for the result.<br>
On the right hand side, I choose those columns on which I'm aggregating, drag the rows up to what I'm pivoting into columns, and this gives me a pivot table, showing the number of books, pens, posters, and shirts for each company.<br>
With the data transformed, I can add new formulas to look at purchasing trends.<br>
Of course, I wouldn't be bringing all this up if there wasn't a way to do it in Polars.<br>
So, go go Gadget Repl.<br>
That's a dated reference.<br>
Google Inspector Gadget.<br>
I'll be here when you get back.
|
|
show
|
2:15 |
Pretty sure you're familiar with this by now.<br>
To pivot in Polars, you use the pivot method.<br>
The arguments to it are almost identical to the same things you filled in on the Excel wizard.<br>
The first argument to pivot is the column to slice into parts.<br>
unique value getting its own new column.<br>
The index argument is what to group the parts by, and the values argument is the numeric content to go in the pivoted columns.<br>
Finally, the aggregate function argument is what math to do on those numeric values to produce a result.<br>
In this case, I've chosen sum.<br>
It also supports min, max, first, last, mean, median, and length.<br>
The result is our pivot table.<br>
Note that the columns are in a random order.<br>
You can fix that with the sort columns argument.<br>
This time I'll sort the columns, and while column at it, I'll sort not just the columns but also the rows as well and deal with the nulls.<br>
There you go.<br>
For products, sorting might not be that important, but if you got sales data by month, you definitely want a particular order.<br>
You can specify multiple values for your calculation by providing a list.<br>
The resulting table names the columns by combining each value with each unique item from product, including quantity book, quantity pen, total book, total pen, etc.
|
|
show
|
0:57 |
Let's review what you learned in this lesson.<br>
The DataFrame object has several aggregate operations that can be performed across its contents, like finding the sum or mean of values.<br>
If you want to find aggregate information or subsets of data, you can use GroupBy.<br>
This works by performing aggregate operations on the grouped data.<br>
And you can perform more than one aggregate at a time.<br>
Sometimes your DataFrame will contain null values.<br>
If you don't want nulls, you can use fill null to replace them, or drop nulls to get rid of them altogether.<br>
Like Excel, Polars gives you the ability to create a pivot table.<br>
Using Pivot, you can transform grouped rows into columns, allowing you to do cross-row analysis in a new data frame.<br>
So far, I've been throwing out all of our work, losing it when I close the REPL.<br>
Polars not only has the ability to read multiple types of data files, you can also write them.
|
|
|
24:59 |
|
show
|
2:32 |
In the previous lesson, I introduced you to group by aggregations.<br>
This lesson is all about files, including reading files that are more complicated than you've seen so far, merging files together, and writing files out to save your work.<br>
One of the pros of Excel is it's very flexible.<br>
It lets you put anything anywhere.<br>
One of the cons of Excel is it's very flexible, and it lets you put anything anywhere.<br>
This can be problematic if you're trying to do data analysis on a file.<br>
whether it's in Excel itself or reading that Excel file into Polars.<br>
In the examples so far, you've just read in a single worksheet from your Excel file, but Excel can have multiple sheets, and you can control the reader to account for this.<br>
Depending on the format of the sheet, you may only want a subset of it.<br>
The reader can help you with this as well.<br>
I'll see your GoGoGadget REPL and raise you a Grandizer Go.<br>
Wow, that just went from dated to esoteric as well.<br>
Okay, this time I'm going to be using a different Excel sheet called complicated.xlsx.<br>
Let me read it in.<br>
Well, it lives up to its name.<br>
It's complicated, alright.<br>
The exception says the sheet is empty.<br>
You know what?<br>
Let's go back to where I should have started and look at the Excel sheet itself.<br>
With apologies to Adam Savage, well, there's your problem.<br>
Not surprisingly, given the exception, our first worksheet is empty.<br>
There are multiple sheets in this workbook.<br>
Fortunately, the reader is capable of accessing other sheets.<br>
Let's go back to the REPL to try that out.<br>
There are two arguments you can pass to read Excel to get at different sheets.<br>
I'll start with Sheet ID, which takes a one indexed number.<br>
That's better.<br>
At least there's some data this time around.<br>
There's a lot here though.<br>
Say I just want the subset of data that corresponds to our usual example.<br>
I can use the columns argument to specify a subset of the columns to include.<br>
While I'm at it, I'll also use sheet name instead of sheet ID to get out our sales data worksheet.<br>
That's a little better.
|
|
show
|
2:21 |
But, well, complicated is complicated.<br>
Let's try another sheet.<br>
Let me scroll back up here.<br>
And just what is this nonsense?<br>
Lots of empty and the column names are weird.<br>
Let's go back to Excel to figure this one out.<br>
Let me click the Levels sheet.<br>
See at the top here, somebody has included some notes or something above our data.<br>
As Excel doesn't have to be structured, this kind of thing happens all the time.<br>
You'll often find aggregate data above the data contents itself.<br>
What I'm actually after in this case starts on row 3, so back to the REPL, this time we'll grab just that.<br>
Much better.<br>
You'll recall that you had to install FastExcel to read in Excel files.<br>
Polars actually supports several different Excel reading libraries.<br>
This gives a degree of flexibility, but it does mean that the calling interface is a little muddled.<br>
The different readers have different capabilities, so instead of allowing arguments that don't always work, Polars has a generic argument that gets passed to the reader.<br>
Then, it's the underlying library's job to reject you if you get it wrong.<br>
One feature, available in FastExcel, but not in the other libraries, allows you to change what row you're reading from.<br>
To do so, you need to provide an arguments dictionary to the generic read options parameter.<br>
In this case, the dictionary has a key header row with a value 2 to get at the third row.<br>
Let's do that again, but this time restrict the columns.<br>
Previously, to do this, you saw me use a colon g.<br>
In addition to ranges, you can also provide a comma separated list.<br>
This time I stored it away.<br>
The result contains columns A and C through H.
|
|
show
|
1:58 |
Unfortunately, I'm still not done.<br>
Notice the data type on the %ToTarget column.<br>
Whoever created the Excel sheet didn't store the percentage as a number and format it.<br>
Instead, they stored it as a string.<br>
Why?<br>
Who knows?<br>
The real world is messy and real data is dirty.<br>
Let's play around a bit and see if this can be fixed.<br>
using replace all got rid of the percent sign but this still isn't a number note what I haven't done here I haven't tried to merge it into the data frame yet I find I do this quite frequently when I've got dirty data I do some trial and error producing a new data frame until I'm happy with it, then once I am happy, merge it in.<br>
This saves me having to start from scratch if I muck something up along the way.<br>
With the percentage sign gone, you can reach back in your memory several lessons ago and cast the string to an integer.<br>
While I'm at it, I'll also divide it by 100 to make it an actual percentage.<br>
There, that's better.<br>
The other weirdness with this file is that the month, day, and year are stored as separate values.<br>
It would be grand to have a date instead.<br>
Let's do a little more experimentation and see if that can be done.<br>
The date function is a constructor for date objects and it can take the names of columns as a source.<br>
That was almost too easy.<br>
Now that I've got the trial and error out of the way, I could use a well-constructed select to combine the previous experiment with this experiment, or a with_columns to add it to my new data.<br>
I'll leave that to you as an exercise.
|
|
show
|
1:24 |
Instead, let me cover one more common case of dirty data.<br>
Rather than read it in, I'll just create an example from scratch.<br>
American zip codes, those are postal codes for the rest of the world, are five-digit numbers.<br>
They really should be stored as strings, but sometimes they aren't.<br>
Then somebody comes along and enters a zip code for main, and you've got a leading zero.<br>
If you're in Python, this is a problem, as starting with a leading zero means an octal number.<br>
If you're in Excel, the leading zero will just get dropped.<br>
So zip code data in an Excel sheet often is incorrect, containing four-digit numbers when they should be five.<br>
I've created a data frame with my zip code problem.<br>
Now, let's cast it to a string, then use pad start to make sure that everything is 5 characters long.<br>
And there you have it.<br>
String-based, correct-length zip codes.<br>
A significant part of a data scientist's job is cleaning the data before doing anything with it.<br>
The real world is dirty, dirty, dirty.
|
|
show
|
1:39 |
I mentioned that the read Excel call supports multiple libraries.<br>
Polars currently supports Calamine, which is a Rust library that you've already used.<br>
That's the FastExcel package.<br>
You saw how to use the read options argument with FastExcel.<br>
Anything you pass into that argument gets sent along to the underlying load sheet by name call, or its load sheet by ID equivalent.<br>
If you look up the FastExcel docs, anything in LoadSheetByName can be used in the ReadOptions dictionary.<br>
FastExcel is one of the newer editions, and it's the default.<br>
It is a generally better library and faster than the others.<br>
Before FastExcel was integrated, there was also OpenPyExcel, which is still there if you want it.<br>
With this one, you can't use ReadOptions, as the underlying call structure is different.<br>
Instead, you use the engine options argument, which gets passed to the library's constructor.<br>
Two arguments that might be useful are rich text and keep links, which changes the handling of formatted text work.<br>
I'd stick with Fast Excel, unless you find a feature in OpenPyExcel that you need and isn't available to you otherwise.<br>
In case two choices weren't enough, you can also use XLSX2CSV.<br>
That's a mouthful.<br>
This engine converts an Excel file to CSV underneath and then calls the CSV reader.<br>
Like with OpenPyExcel, the optional configuration is done through the engine options argument.<br>
The library supports configuration for specifying the date and time formats, how string quoting is done, how hyperlinks get handled, and more.
|
|
show
|
1:58 |
If you've got data in multiple sheets or multiple files, in Excel you might copy and paste from one to the other.<br>
Polars, you can't do that.<br>
I briefly covered concat earlier.<br>
Let's drill down a little more on what combining data from data frames looks like.<br>
For this exercise, I've got a new Excel sheet called Extra 50, which contains 50 new rows to combine with our original sales Excel file.<br>
Off to the REPL once more.<br>
Insert your own cheesy 80s cartoon reference here.<br>
Back with our usual sales data file.<br>
And here I've read the new extra 50 into its own data frame.<br>
And there it is.<br>
It's important to pay attention to the shape of the two files.<br>
If they aren't the same shape, you can't concat them.<br>
You'll get an error.<br>
These are the same shape, so let me concat them.<br>
Hmm, shape on its own isn't enough.<br>
The columns have to match as well.<br>
The error here tells you that there is a column named Total and another named Amount, which don't match.<br>
Here's the schema of Sales.<br>
And Extra.<br>
Sure enough, the schemas don't match.<br>
Never fear, Polars has a rename method on the data frame.<br>
It takes a dictionary where the key is the old name and the value is the new one.<br>
That looks better.<br>
Like with most calls, this didn't change anything.<br>
It spit out a new data frame.<br>
Since I'm happy with it, I'll do this again and store it away.<br>
Now to concat.<br>
And there you go.<br>
The shape is now the 1,050 rows long, as expected.
|
|
show
|
2:05 |
Adding rows to the bottom of the data frame is one of the easier use cases.<br>
What if you've got cross-reference data between files?<br>
In Excel, you'd copy your cross-reference data into a new worksheet, then use the VLOOKUP formula to combine data in the main sheet.<br>
Say I want to add the customer level information to my sales data.<br>
This sheet contains the lookup for each customer's level.<br>
Let's go to Excel and add this information to our sales data.<br>
Behind the scene, I've opened another Excel file called Customer Levels and copied the data to the clipboard.<br>
Let me paste that into a new sheet.<br>
Now, back to the sales data, I'll add a column for the level information.<br>
And now to do the lookup.<br>
The first argument is the cell to get the information to cross-reference.<br>
That's our company name.<br>
The second argument is where the lookup is happening, which is the A and B columns of the other sheet.<br>
The next argument is the column to cross-reference, which is our second column in the level sheet.<br>
And the final argument is false.<br>
True would give the closest match.<br>
I only want exact matches.<br>
I've got gold for real cube.<br>
Let's go back to the sheet and check it out.<br>
Yep, that looks like it worked.<br>
Back to the main sheet.<br>
And I'll copy the formula down.<br>
Moving back to the top, notice the missing data here for ZUXO, or however you're supposed to pronounce that.<br>
That means there's no ZUXO in the lookup.<br>
Remember this situation.<br>
You're going to encounter it in polars as well.
|
|
show
|
3:05 |
Okay, I'm back in the REPL with the usual sales data loaded.<br>
And now I've got the customer levels as well.<br>
There it is.<br>
First off, look at the columns.<br>
If I want to cross references with the other sheet, I need the names to be the same.<br>
Now I've got consistency for the company column.<br>
The equivalent of VLOOKUP in Polars is the join method.<br>
If you're a database person, yes, this is named after an SQL join.<br>
That wasn't too bad, was it?<br>
The join call takes at least two arguments, the first being the data frame to join with, and the on argument specifies what column to base the join upon.<br>
I think I'm good.<br>
Let me just store this away.<br>
You ever get that little suspicious feeling?<br>
Hmm, that's a problem.<br>
The inner data frame has three less entries than the sales data.<br>
I have a hunch.<br>
Right, remember how there's no lookup data for Zoxo?<br>
Well, the inner join resulted in its removal from the original.<br>
The join call supports several different ways of joining.<br>
Inner means to return rows that are common to both sets of data.<br>
What you want in this case instead is left, which keeps the first set and joins any matching rows from the second.<br>
That's better.<br>
Yep, SoXO is there with null for the level value.<br>
That makes sense as there was no data to join it with.<br>
Okay, let me go back a couple of steps and show you a shortcut.<br>
Reloaded the customer data.<br>
And instead of renaming the columns like I did before, you can add arguments to the join call.<br>
This says what column in the first dataframe is to be used as the left part of the join, and what column in the second dataframe to use as the right part of the join, saving you that extra renaming step.<br>
The join call supports seven different mechanisms for merging data, so depending on your case, dig into the docs for the others.<br>
Personally, I find, whether in SQL or otherwise, it's the left join I need 99% of the time.
|
|
show
|
1:47 |
As a nice companion to read Excel, Polars has Write Excel for writing a data frame to a file.<br>
Like with the Reader, you can write multiple sheets, and there are even mechanisms for providing styling.<br>
Also like with reading, writing is done with a third-party library.<br>
This one is called XLSX Writer.<br>
Which of course means you need to pip install XLSX Writer into your virtual environment before going any further.<br>
Usual suspects here?<br>
Who is Kaisersosai?<br>
At least my dated references have skipped forward a decade this time.<br>
To practice with writing, first, let's filter some data.<br>
This is your Viva customer data.<br>
Now, writing is very, very complicated.<br>
Pay close attention.<br>
Did you catch that?<br>
I actually wouldn't recommend doing it this way as the result is a sheet named dataframe xslx in the same directory as your script but you can do it this way if you want.<br>
It's definitely fast.<br>
Let's take a quick look at what it produced.<br>
Because I'm anal retentive and can't stand having my data files in the same directory as my scripts I have behind the scenes moved dataframe xlsx to a directory named processed.<br>
More on that in a later lesson, but for now, just know that the right Excel call that I made didn't put the file where it is in the repo.<br>
Looking at this file, you can see that you've got the contents of the data frame, and that might be enough for you.<br>
It is a little messy though, isn't it?<br>
Let's try it again, this time with a little more style.
|
|
show
|
2:58 |
To have more control on your output file, you need to create a workbook object.<br>
The best way to do that is within a context.<br>
That way you can perform multiple actions on the workbook, and it closes and saves automatically when you leave the context.<br>
The argument to workbook is the name of the file that I want.<br>
This time I'm explicitly putting it in the process directory.<br>
Then I keep a reference to the context manager to work with it.<br>
First I call the writeExcel method, passing in the workbook object.<br>
This means you won't get that default file anymore.<br>
The worksheet argument says what sheet to write the data to.<br>
You can use this for multiple sheets, or if you only have one and want to name it.<br>
The column totals argument adds a sum feature at the bottom of your table, which is kind of cool.<br>
It's a common enough thing to do, and you don't have to do much to take advantage of it.<br>
Turning on, AutoFET resizes the columns in the sheet for the width of the data.<br>
This is the equivalent of double-clicking on the column header boundary for each column in the sheet.<br>
And this one is for my American friends.<br>
Got a bizarre date format that is against the international standard, which is neither biggest to smallest or smallest to biggest?<br>
No problem.<br>
Use D-type formats.<br>
And all the Polars' columns with a date data type will get formatted in your strange little way.<br>
You be you.<br>
The header format argument allows you to provide styling to the header row of the sheet.<br>
Here I've made the text bold and a funky purplish thing called Byzantium.<br>
Put that all together and you get an Excel file.<br>
Since this is done in a context block, you can perform multiple calls to write Excel with the same workbook, allowing you to mix and match your dataframes onto different sheets.<br>
There are also arguments for what position to put your dataframe in, so you could combine dataframes on the same sheet.<br>
You can also control row heights, formulas, do conditional formatting, and a whole lot more.<br>
If Excel is your desired report format, you can produce sheets that look pretty much any way you like.<br>
Let's take a quick look at the result.<br>
And this is much better than before.<br>
Nice and formatted with some extra little bells and whistles.<br>
Byzantium's kind of a pretty color.
|
|
show
|
2:14 |
Before wrapping up this lesson on file operations, I want to briefly talk about file limits.<br>
Both Excel and Numbers on macOS limit how many rows you can have.<br>
In Numbers, the limit is exactly a million, while Excel lets you go a few thousand rows past that.<br>
If you attempt to load a really large CSV file with more rows than that, the spreadsheet program will load as much as it can, then give you an error.<br>
It also isn't the fastest operation.<br>
One more reason to transform your data analysis game with Polars: you can deal with much larger files, and in less time.<br>
Just to make my point, I've got a CSV file named ""larger"".<br>
Let me load it now.<br>
That did nothing for 4 or 5 seconds, then gave me a warning, and sure enough if I scroll down to the bottom...<br>
Yep, not everything made it here.<br>
Let's go see how the polar bear fares.<br>
Alright, let's grab that file.<br>
A tiny bit faster than Excel, huh?<br>
Just to prove it worked.<br>
That's 1,100,000 rows in less than a second.<br>
It isn't just fast to load.<br>
Let's play a bit.<br>
Summing all the columns comes back almost instantaneously.<br>
How about something more complex?<br>
Speedy little bear, huh?<br>
Multiple aggregation calculations on a sheet Excel couldn't load, and all in the matter of microseconds.<br>
For something like this, the load time of a Python script would be the bottleneck rather than the calculation itself.
|
|
show
|
0:58 |
Time to recap.<br>
Although you saw a concat before, this time around I showed you a more real-world example where the dataframes not having matching schemas was something that needed to be dealt with.<br>
To reconcile data from one dataframe to the other, you can use the SQL-inspired join method.<br>
It supports multiple ways of joining data, most of which are similar to SQL joins like left, inner, write, and more.<br>
In this lesson, you saw how to use additional arguments to the Read Excel call in order to control where in the workbook you're reading from, and also saw Read's companion Write Excel through the use of the XLSX Writer Library.<br>
At the tail end of this lesson, you got a little hint at just how speedy the polar bear can be.<br>
Would you believe me if I told you so far you've been using the slow methods?<br>
Interestingly enough, the lazy polar bear is faster than its cousin.<br>
The next lesson is all about how to be lazy.
|
|
|
7:32 |
|
show
|
3:57 |
In the previous lesson, I did a deep dive on file input and output with Polars.<br>
In this lesson, I'll introduce you to lazy evaluation and how that can speed up your queries.<br>
Up until now, everything you've done has been with the entire dataset loaded into memory in a data frame.<br>
If you ever need to deal with really large datasets, this can be a limitation.<br>
Enter my favorite Polars feature, the lazy frame.<br>
This is why I've switched to it from that other bear.<br>
A lazy frame is like a data frame, but as its name implies, it does lazy evaluation of queries.<br>
Lazy evaluation provides a couple of advantages, the first of which is you allow Polars to evaluate all your operations together, giving it the opportunity to optimize them.<br>
As a simple example, say you're only running operations on a subset of rows.<br>
By doing the filter before the operation, you can keep fewer things in memory at a time and only perform the operations on those rows that need it.<br>
The most common way to create a lazy frame is to use this scan family of methods.<br>
These are like the read methods, but they don't load the whole file in at once.<br>
Not all file formats that are supported by read have scan equivalents.<br>
For example, since Excel is actually a compressed file format, scanning a part at a time isn't really possible.<br>
As such, there is no scan Excel.<br>
You can still read it and convert the data frame to a lazy one, but as a lot of the optimization happens at the scan step, this will only be beneficial if you're doing a truckload of operations.<br>
You perform actions on your lazy frame similar to the regular data frame, but the actual calculation doesn't happen until you request it.<br>
You do that by calling the collect method, which then returns the result.<br>
I'll give you one guess as to where I'm going next.<br>
Ready?<br>
Did you guess right?<br>
How about the first line?<br>
Did I surprise you there, or were you expecting it?<br>
I promise the second line will be different.<br>
Instead of reading, let's scan.<br>
As I said, there's no scan Excel, so I've reverted back to using the CSV file.<br>
And like with the read call, I've used the try parse dates argument to process the date data properly.<br>
The scan call returns a lazy frame.<br>
Now you use it to chain operations together.<br>
I'll store the result of the chain in a variable called query.<br>
Starting with the lazy frame itself.<br>
Then filter for just the company Viva.<br>
group by product, aggregate on the count, the sum of quantity, the mean of quantity, and then close all those parentheses off.<br>
Let's look at the query object.<br>
It's a lazy frame.<br>
Every action on a lazy frame returns a lazy frame.<br>
To get at the actual result, you call collect.<br>
would be the spot to say something clever about call and collect.<br>
But you young ones and your fancy cell phones probably don't even know what long distance charges are.<br>
And if you've heard of call and collect, it's only from watching prison TV shows.<br>
Where was I?<br>
Oh yeah, I called collect.<br>
The result is our aggregated data frame.<br>
On one hand, this is no big deal.<br>
You've seen this same result before.<br>
On the other hand, it's kind of cool that it's the same as before because lazy evaluation with chained operations give you the same result.
|
|
show
|
2:01 |
Polars provides information about the query attached to the lazy frame.<br>
You can see that information by calling the explain method.<br>
That's a little hard to read.<br>
Notice the slash ends.<br>
I'm going to print it instead.<br>
That's sort of better.<br>
I won't lie to you and tell you I know how to read this.<br>
You kind of get the sense, and it has words like aggregate and calls to some and mean, but there's some funky symbols in there like pi, and this isn't 3.14, but part of relational algebra and has to do with how the optimizer works.<br>
The technology here is well understood.<br>
Similar tech operates behind the scenes when you do complex queries on a database.<br>
The ability to do this kind of optimization is one of several reasons that Polars is so fast.<br>
In addition to the explain method, you can also call show graph to get a graphical version of the output.<br>
This call requires that both Matplotlib and GraphViz are installed.<br>
Matplotlib is no big deal as it's just another pip install away.<br>
GraphViz, on the other hand, is not a Python library, so installation will be machine dependent.<br>
It's kind of neat that you have the ability to graphically display the content, but one, as I said, I don't really understand the text inside of it, so grouping it in lines and boxes isn't going to help me, and two, once the boxes start to get over full, they put little dot dot dots inside and clip the content, so you may not be able to see everything anyway.<br>
As such, this isn't a feature that I use myself.<br>
If you've already got a data frame, you can call its lazy method to get a lazy frame accessor to it.<br>
As I mentioned up top, this would be one way of getting a lazy frame after loading an Excel file.<br>
But a big part of that performance boost comes from skipping those rows that aren't of interest and not keeping them all in memory at the same time.<br>
As such, starting from a data frame first probably means you're not going to get a lot faster unless you're doing a really large number of queries.
|
|
show
|
0:43 |
In the sample code folder for the course, I have a directory called timing with two programs in it.<br>
The first, bear.py, loads our larger CSV file and times how long that takes.<br>
The second does the same operations but on a lazy frame instead.<br>
Due to the wonders of machine inconsistency, caching, and just general complexity, my timings were all over the shop.<br>
Most of the time, lazy.py was about 10% faster than bear.py, but not always.<br>
Even though larger.csv is too big for Excel, it still is under 100MB.<br>
I wasn't really stressing pullers enough to notice a huge speedup, but this little 10% improvement gives you a flavor of the possibility.
|
|
show
|
0:51 |
I'm almost feeling too lazy to do a review, but let's log through it.<br>
See what I did there?<br>
Polars does its best to optimize all of your operations, but with lazy evaluation, it has that much more of a chance to do so.<br>
With a lazy frame, operations don't happen until you call the collect method.<br>
This delayed execution means Polars can heavily optimize all the actions you've requested.<br>
This combined with being able to scan past the parts of a file that aren't needed for your calculation means the possibility of a performance boost.<br>
If you know more than I do about relational algebra, then you can examine the query plan to see what Polars is doing to optimize your actions.<br>
If you stuck with me so far, I've more or less covered the key parts of Polars.<br>
In the next lesson, I'm going to glue it all together with a case study that uses data sourced from the real world.
|
|
|
30:31 |
|
show
|
2:43 |
Theory is all well and good in theory, but the rubber hits the road when your shoulder's to the grindstone.<br>
Sorry, I'm using up my quota for cliches before the weekends.<br>
In the previous lessons, you've learned about Polars.<br>
In this lesson, I'll walk you through trying to create some reports based on real-world data.<br>
I've got three datasets containing world economic data.<br>
GDP, population, and research and development spend.<br>
All of this comes from the same collection available at DataHub.<br>
The GDP data has a country name and a code and a year and the total GDP in American dollars.<br>
The value column is overly precise, which I thought was strange at first, but I suspect this is just a result of converting to USD without rounding.<br>
Nothing like having a number in the billions with six decimal points of accuracy.<br>
Just what is 0.92345 of a dollar anyways?<br>
Get ready for a theme.<br>
Real world data is dirty.<br>
Next comes our population data.<br>
The structure of this file is similar.<br>
This time the values are a little more sane.<br>
And finally the R&D spend data.<br>
This file has a lot of columns, but I'm going to just concentrate on the direct spend by the government.<br>
Your mission, if you choose to accept it.<br>
Now see I've regressed several decades in my TV references.<br>
No wait, that could be a movie.<br>
See, I look like I'm current and hip.<br>
I just have to reference things that have gone full circle.<br>
Anyway, your mission is to produce three reports.<br>
The first requires some data analysis on the GDP values showing changes in values over time, and must include the best and worst years of data.<br>
Once you've done that, use the population data to calculate GDP per capita.<br>
And finally, add government research spend as a percentage of GDP.<br>
Before getting going, a quick note about directory structure.<br>
In the sample code, I've created a directory called case study, and the code for the work in this lesson is in there.<br>
When you're doing data science-y things in real life, it's good practice to keep your original data files together.<br>
I typically put this in a directory named raw.<br>
Then, have a separate directory for your output.<br>
That way, if you're writing reusable scripts, you're less likely to accidentally overwrite your original data.<br>
I tend to call my results folder processed.<br>
I didn't come up with this convention myself.<br>
I've copied the names from others, and I honestly don't know how consistent that convention is in the real world.<br>
The important thing is to separate your source from your results.<br>
Then secondly, being consistent is helpful in case you have to come back later.<br>
One last time into the breach.<br>
Say it with me.<br>
R-E-P-L.
|
|
show
|
1:20 |
Although I'm using three new files, I still have the same old starting point.<br>
First off, I'll read the GDP CSV data into a data frame.<br>
Let's take a look at it.<br>
Almost 14,000 rows.<br>
You can see a lot of repetition here as there are many years of data for each country.<br>
Note that the GDP dollars in the value column are stored as a float, and because they're bigger numbers, Polars is showing them using exponential representation.<br>
There's an order of magnitude difference between the values for Afghanistan at the top and Zimbabwe at the bottom.<br>
I don't like to type a lot, and spaces and column names always annoy me.<br>
This is really a panda's habit for a shortcut that doesn't exist in Polars, but I'm going to rename the columns anyhow.<br>
now.<br>
In addition to shortening some of the names, value is a bit vague, so I've changed it to what it is, the country's GDP.<br>
Note that I didn't actually save the result.<br>
I'm going to be doing this a lot.<br>
Try it out first, make sure I like the result, then do it again to overwrite the data frame.<br>
Alright, I've kept my changes now.
|
|
show
|
2:52 |
Before doing any work, it's good practice to poke around a bit, look for dirty data, or anything else that might cause a problem.<br>
A good place to start poking around is with Describe.<br>
A few things can be learned here.<br>
First, the count is consistent across the columns.<br>
That's good, it means nothing is empty.<br>
Second, the null count is zero across the board, which means you've got data for everything.<br>
Next, having a standard deviation for year and GDP rather than null tells you all the numbers are positive.<br>
That's the good news.<br>
One little warning sign does pop out to me though.<br>
The smallest GDP value is really, really small.<br>
Remember, GDP is the total value of goods produced by a country.<br>
11,502 is below the poverty line for individuals in G7 nations, let alone the sum total of a country.<br>
Let's take a look at that one a little bit more.<br>
To examine it, I've filtered on anything less than 20k.<br>
There are three values, all for the same place.<br>
A quick look at Wikipedia shows that this is around the time that Georgia separated from the Soviet Union, so it's going from being state-like data to being country-like data, which, combined with the value of the ruble at the time, might explain the value.<br>
To sanity check, let's look at all of Georgia's data.<br>
Okay, so the rest of the data seems reasonable.<br>
Depending on your knowledge of the data source, you could just chop the outliers, or since the rest of it's okay, I'll naively assume that maybe it's all accurate.<br>
The difference between 1989 and 1990 is six orders of magnitude, but what's a factor of hundreds of thousands between friends?<br>
That's the low end.<br>
Let's try the high end for other outliers.<br>
That's a mighty big number.<br>
Let's hunt down where it comes from.<br>
To do that, you can filter on the number.<br>
You could copy and paste it, which would be easier.<br>
Or you can do it the hard way.<br>
Fetch it from within the data frame.<br>
The item call grabs a value from a data frame.<br>
Like I said, this is the hard way.<br>
But if you needed to get at this kind of content in a program, you wouldn't be able to just copy and paste.<br>
So sometimes the hard way is the only way.<br>
Let me stash that away.<br>
Got it.<br>
Now let's filter.<br>
Well, that's interesting.<br>
There's an entry for the sum total.<br>
Depending on what you were doing with your report, you might want to get rid of this as world isn't really a country.<br>
If you were headed for a pie chart or the like, you definitely want this gone.<br>
As our desired outcome is just a report table, I'm going to leave it alone.
|
|
show
|
1:57 |
That was the overall picture.<br>
How about grouping by country?<br>
The shape here tells you that there are 262 entries.<br>
A quick google tells me there are 195 countries, depending on your definition of a country, and man, I'm not going to touch that with a 305cm pole.<br>
Does 262 seem like too much?<br>
Given the data goes back to the 1960s and countries change over time, this probably feels reasonable to me.<br>
Let's store this away and do some processing on it.<br>
Be careful here.<br>
This isn't the last row.<br>
This is the max for each column.<br>
Now it turns out that the Zimbabwe row does have a value of 64, but that is sheer coincidence.<br>
If you look back at the table before, you realize you've got differing amounts of GDP data per country.<br>
Some have 64 years worth, and some have less.<br>
That means our report will have to have some holes in it.<br>
And this is the min.<br>
The smallest value is four years of data, and again, that doesn't correspond to Afghanistan.<br>
Afghanistan is just the smallest alphabetical country.<br>
And by the way, its code isn't ABW.<br>
That's Aruba.<br>
This is what max and min do, give you the biggest and smallest values per column.<br>
I know I'm harping on about this, but I've messed up my analysis multiple times by forgetting this fact, and I'm trying to learn from my mistakes.<br>
Now that you know there's a range of year data, with four years being the least and 64 years being the most, like I said, our report's going to have a lot of holes in it.<br>
Let's find out how many countries have less than the full set of data.<br>
Our original shape was 262 countries long, while this one is 128, so a little less than half our countries don't have the full set of date values.
|
|
show
|
3:53 |
Now that you've got a sense of the data, let's head towards the objective.<br>
What are the min and max GDP values per country, and which year did those happen within?<br>
There's a lot of data to mess around with.<br>
Rather than attack it all at once, I'm going to experiment on a single country to get my queries right.<br>
Then I'll apply that same logic to the whole dataset.<br>
Seeing as St.<br>
Martin is the top of the table there, and it doesn't have a lot of data, I'm going to start with it.<br>
That, and it makes me nostalgic.<br>
I've been there a couple times, it's a lovely little island.<br>
As I'm recording this, there's snow outside my window, so some Caribbean sun feels like just the right state of mind.<br>
That's all of our MAF data, four years, and not even contiguous.<br>
Stored it away.<br>
Now to make an attempt at our goal.<br>
Well, that's a bit of a lot, isn't it?<br>
If I only wanted the worst GDP for a single country, I could use a filter, then call min, but seeing as I'm going to want the data across a lot of entries, I'm going to need something like group by.<br>
Instead of group by, I'm using a slightly different approach.<br>
One, so I can show it to you, and two, because it will solve another problem in the future.<br>
What this expression says is to examine the GDP column, then from it get the value in the GDP column that is smallest.<br>
That's the argmin part.<br>
What the over call does is limit the execution to a group.<br>
In this case, that's the country code group, giving us essentially a group by but through a column instead.<br>
The advantage of this mechanism is instead of getting a summary data frame, I can stick this value into a row.<br>
For the four rows of data for St.<br>
Martin, each will get a value containing the same number, which will be the minimum GDP for all of those four years.<br>
It is a bit lengthy and a lot to wrap your head around, but it did the trick.<br>
2021 was St.<br>
Martin's smallest year, and that is showing up in each row.<br>
Now, why did I do it this way?<br>
Well...<br>
Not only do I want the worst GDP, I also want the year that it happened in.<br>
This expression is almost identical, except for the column it's operating on.<br>
This one is operating on the year column.<br>
It still looks for the smallest GDP, but the output value is the year in which that happened.<br>
Not bad, huh?<br>
Since our experiment worked, I'll try this with the full dataset.<br>
Worst GDP amount?<br>
The year of the worst GDP Changing argmin to argmax gets the best GDP and the year in which that happened.<br>
The data looks reasonable.<br>
Let's rerun it and keep it this time.<br>
I probably should have put that in a different variable in case I mucked something up, but once in a while you can live dangerously.
|
|
show
|
1:05 |
Let's spot check by looking at the beauty of St.<br>
Martin.<br>
Looks like the same data as the experiment from before, which gives me some confidence.<br>
Let's try with a bit more data.<br>
Okay, but what if I want to see all of it?<br>
The short form version here is something you can control.<br>
The config object in Polars controls your configuration.<br>
The SetTableRows method allows you to specify how many rows show up in the summary table.<br>
Setting it to -1 gives you all of them.<br>
And there it is, all 64 rows of Afghanistan.<br>
Before looking into anything else, let me put the config back the way it was.<br>
a config value to none resets it to the default.<br>
I find that a little weird, but I guess it's better than requiring a separate method.
|
|
show
|
3:25 |
Our report isn't bad, but it could be a little better.<br>
It has the information that was requested, but the repetition of showing the best and worst years in every row feels a little clunky.<br>
Instead, let's build a pivot table so each country has a row and each column is the GDP information.<br>
Once more, let's start by experimenting with a smaller amount of data.<br>
There's the pivot for St.<br>
Martin.<br>
Remember back when I introduced you to pivot tables and I said ""sort columns"" was useful sometimes?<br>
Well, when your pivoted columns are years, you want them sorted, otherwise they'll be in a random order.<br>
The pivot looks good.<br>
Let's do it for the whole enchilada.<br>
Okay, nothing broke.<br>
Let's store it to play with it.<br>
Now to double check the data a bit.<br>
Yep, St.<br>
Martin looks right from what I can see.<br>
There are a whole bunch of nulls, but that's because there's holes in our data.<br>
There are so many columns in our table that at the moment you can't see them all.<br>
Let's make another change to the config to fix that.<br>
Set table columns controls the column summary.<br>
Now I'll view it again.<br>
Hmm, did I say it would fix it?<br>
Wasn't very helpful, was it?<br>
I'm not a fan of how Polars handles horizontal wrapping.<br>
There is another way to get a glimpse of the data though.<br>
First, let me reset the config.<br>
You could call setColumnTablesWithNone, or if you've got a lot of config and you want to reset it all at once, you can use restoreDefaults instead.<br>
It's about the same amount of typing when you've got only one change, but if you're doing a bunch, that can be handy.<br>
Now, for that glimpse at the data I was talking about.<br>
Let me scroll back up here.<br>
Now you might understand why I kept using the word glimpse.<br>
The glimpse method returns columns as rows.<br>
For this query I only had one row, but if you had several, the first few values are shown per column.<br>
As I head back down to the bottom, you can see the four years for which there is GDP data.<br>
ask me what those dollar signs are.<br>
I have no idea.<br>
And let me tell you, the people who wrote it don't either.<br>
Glimpse is a feature copied from a library from the R programming language.<br>
The Polars folks more or less implemented exactly what was done there.<br>
There are no comments about it in the code, and a colleague of mine asked about it on the Polars Discord server and got back, well, because they did it that way.<br>
If you're also an R developer and you know why the money sign is there, drop me a line.<br>
And in case the dollar sign isn't confusing enough, there's another R library with a glimpse routine that uses ampersand instead.<br>
This is starting to sound like the horse butt-size explanation of train track width.<br>
Google that if you haven't heard about it.<br>
It isn't true, but it is fun.
|
|
show
|
2:33 |
Having used Glimpse, I'm confident in our pivot table.<br>
Now, let's add the best year information into that data frame.<br>
In the pivot, you've got all of the GDP years as columns.<br>
The way of getting the biggest value is to use maxHorizontal.<br>
That works, but note that I didn't actually include all the possible columns in my list of arguments to maxHorizontal.<br>
I could type them all out, but there's 64 of them.<br>
Time to try another approach.<br>
There are all of our columns, including the 64 years.<br>
Length tells me I have 66 in total.<br>
Saves me having to count them.<br>
Seeing as maxHorizontal takes an interval of column names, all I have to do is slice this to get the right names.<br>
That's the slice with the columns I want, so now I'll go back and apply this.<br>
And this time, instead of hardcoding the column names, I used the slice.<br>
And there you go.<br>
Big and messy, but it includes our new best GDP column.<br>
Seeing as that worked, let me add the column to the data frame.<br>
I'll leave the creation of the worst year column as an exercise for you.<br>
That's teacher speak for I'm too lazy.<br>
The good news about the pivot approach is you get all the country data in a row, and you've seen how you can use max and min horizontal to get best and worst values.<br>
The original request also asks you to note the best and worst years.<br>
Unfortunately, that's not a simple task with this format.<br>
With the pivot table version of the report, the value you'd want is the column name.<br>
I found a couple of attempts online to do this, but each one had drawbacks, and the best looking one had a comment saying, this no longer works.<br>
If your boss absolutely wanted the best and worst info, I'd suggest using a group by and putting it in different tables altogether.<br>
If your report is in an Excel file, you don't have to have everything showing on the same page.<br>
They could be separate sheets.<br>
Speaking of output, let's write out our pivot table as a result.
|
|
show
|
2:22 |
That leads us to task number two, determining the GDP per capita.<br>
First off, let's read in the population file.<br>
Hmm, that's not fun.<br>
The exception here states that it found a floating point value in an integer column.<br>
CSV files don't have type information, so the reader is guessing.<br>
It bases that guess on the first X number of rows, where X is something you can configure.<br>
If you encounter a conflicting data type after that guess has been made, you get an error.<br>
Seeing as this is population data, I'm not sure what half a person is.<br>
That's not true.<br>
I know exactly what half a person is.<br>
It's dirty data.<br>
Let's see what can be done about this.<br>
The error message gives you a couple hints that you could try, but I'm going to do something else.<br>
Setting infer schema to false tells the reader not to guess at the data types.<br>
The end result is that everything is a string.<br>
That's okay, string data can always be converted.<br>
Here, I've added a column with the name value, small v, based on a cast to integer.<br>
Yep, that didn't work either.<br>
The problem here is the same as the error on reading the file in.<br>
There are 18 float values in our data that should be integers.<br>
Cast is kind of restrictive.<br>
It won't truncate the value for you.<br>
Let's force the issue.<br>
By casting to float first, the string gets converted to a number.<br>
Then, by casting to integer after, it will get rid of those pesky half people.<br>
And there you go, a new column with the integer data type.<br>
The year column is also a string.<br>
Let's fix that too.<br>
Our double casting trick for the value of population.<br>
And converting the year.<br>
And that's not bad.<br>
Let's store it for further processing.
|
|
show
|
0:39 |
You'll recall that I don't like long names, and the current version of the schema also has columns I no longer want.<br>
Let's clean that up.<br>
Starting with a call to rename.<br>
That looks better.<br>
Now let's get rid of the columns that got casted.<br>
The drop call removes a column.<br>
Doing it again for the year.<br>
And you've got some cleaner data.
|
|
show
|
1:32 |
Now that the data is in shape, it's time to join it to the GDP table.<br>
Remember to use a left join so that all of the GDP data is kept.<br>
If there are years missing from the population data, or years only in the population data, you want those to be ignored.<br>
This looks like it worked, so I'll update the data frame.<br>
Now, time to add the calculation to determine the GDP per capita.<br>
That's merely a matter of doing division.<br>
Looks like that worked.<br>
Let me store it away.<br>
Let's do a quick sanity check.<br>
You'll recall that Georgia had some really small GDP values.<br>
I guess that they had more than 11,502 people at the time, and so you should end up with some pretty small per capita GDPs.<br>
Let's look at the other extreme.<br>
If you haven't seen the underscore before, Python allows you to put underscores where you would put commas in your numbers to make them easier to read.<br>
Monaco seems to be the place to be.<br>
Let's write this out and check off requirement number two.
|
|
show
|
1:45 |
The last objective is to add government R&D spend as a percentage of GDP.<br>
Let me grab the R&D file.<br>
You'll recall from the slides that there's a lot of information in this one, but I only care about a few columns, so I'm going to restrict the data frame to just what I need.<br>
Let's take a look.<br>
There are just over 1300 lines.<br>
Looks like there are some nulls in the data as well.<br>
I want to fix the column names.<br>
Instead of renaming things, since there are only a couple of columns that I'm interested in, this time I'm going to do a select and save it instead.<br>
As I plan on using this data for a join, I don't need the country name, so it just seemed easier to do this as a query rather than renaming and dropping columns.<br>
Since this looks good, I'll overwrite the rd variable.<br>
Let's do a bit of poking.<br>
Polars, where you're encouraged to poke the bear.<br>
Looks like there are some nulls in this data.<br>
This isn't necessarily a bad thing, it just means that you get a null in the percentage spend calculation.<br>
Like with GDP, let's look at the extremes of the data, starting with the min.<br>
180 seems very little for a country to spend.<br>
What about the max?<br>
The US only spent $130 million.
|
|
show
|
0:33 |
That's definitely not right.<br>
This is why you sanity check your data.<br>
I went back to the source, looked at their published schema, and it turns out the spend is in thousands of dollars.<br>
Let's add that info to get the magnitude right.<br>
Calc worked.<br>
I'll add it permanently.<br>
Now to double check.<br>
130 billion seems much more likely.<br>
That's better.
|
|
show
|
2:23 |
With the data in hand, I'm ready to join, but first I'll get rid of the thousands spend column to avoid any confusion.<br>
Now the actual join.<br>
I've been brazen and just stored the result.<br>
Here's hoping I didn't screw it up.<br>
There's the result.<br>
A whole lot of nulls in the summary.<br>
Let's drill down a bit.<br>
I'm starting to get a little worried.<br>
Let's look at all of it.<br>
Scrolling back up a bit.<br>
Ah, that's the issue.<br>
like the R&D data is only for a subset of our years.<br>
On one hand, that means the nulls weren't an error.<br>
On the other, if I actually need that data, I might have to go find another source.<br>
If I'd paid a little more attention to the describe call up top, I would have seen the min and max years were 1996 and 2016.<br>
Could have saved me the raise in blood pressure from all those null values.<br>
Let me scroll back down.<br>
Seeing as those nulls are correct, the R&D spend's in place, the last thing I need to do is add the spend as a percentage of GDP.<br>
Seeing as it's still configured to display all rows, let's just look at the US.<br>
Looks like that worked.<br>
It's hard to keep the whole picture in my head, so I'm going to describe the results.<br>
The min spend was 0.6%, while the max was 0.9%.<br>
That's in line with my expectations, so it feels like I have a report.<br>
One last thing to do.<br>
And there you go, my new full GDP report in CSV format.
|
|
show
|
1:29 |
Well, as you've seen, the real world is messy.<br>
Even with three data sets sourced from the same place, I had to do different things to each file.<br>
Once you've imported new data, the first thing you need to do is verify that you've got the expected data types.<br>
You may have to do some casting, or in bad cases, string parsing to get things in the shape you want.<br>
You should also check for outliers in your data.<br>
There may be things you don't expect.<br>
Depending on the work you're doing, you might need to remove the outliers altogether, Or, like with the R&D example, it might make you realize the data you have isn't in the form you expect.<br>
With pullers being so fast, it's relatively painless to switch back and forth between multiple data frames.<br>
As such, it's best to run some experiments on your calculations first, making sure that they do what you want.<br>
Then, modify your data once you're sure.<br>
There's an exception to this, which is when dealing with very large data.<br>
even then, I might be inclined to manually chop the data file into smaller pieces, play with the smaller pieces, and debug my code, then apply it to the whole set once I'm confident.<br>
Polars is fairly configurable, and by making method calls to the config object, you can control things like how many rows and columns are shown in the summary print outputs.<br>
Adjusting these values while experimenting will let you see more or less of your data frame.<br>
Well, there you have it.<br>
You've survived Polars and gone through a case study.<br>
The next lesson is the last.<br>
In it, I'll give a review of the course and point to it possible next steps in your data wrangling journey.
|
|
|
8:22 |
|
show
|
5:17 |
Welcome to the final lesson in Polars for Power Users, Transform Your Data Analysis Game.<br>
This lesson contains a short review of the course and some suggestions for where you can go next.<br>
DataFrames are an abstraction of rows and columns of data, kind of like a spreadsheet in memory, and Polars is one of several Python libraries that use them.<br>
Polars is fast and flexible and can meet most of your data analysis needs.<br>
Polars has three core concepts, the dataframe itself, expressions, and contexts.<br>
An expression is a series of operations you perform on a context, while a context is a way of getting at a subset of the data in a data frame.<br>
When you run expressions on a data frame context, you get back a new copy of the data frame with the operations performed.<br>
Expressions themselves are Python objects and are independent of the data frames they can operate on.<br>
This means you can reuse them, and when coding you can take advantage of type checking and other IDE features as you're just coding with objects.<br>
Polar supports several different functions for reading data in from external sources.<br>
The read family of functions fetches the data from a source and constructs a new data frame from the information.<br>
Remember though, not all data source formats are equivalent, and different formats support different kinds of data types.<br>
Because of this, you may have to perform operations to cast or transform the initial result to get it into the form you desire.<br>
One of the key contexts you can perform on a data frame is the filter method, which allows you to fetch a subset of its rows.<br>
Filters can be specified as masks, which are a series of booleans, true for rows where a condition applies and false otherwise.<br>
You can combine masks with binary operators and pass them to the filter method to create complex conditions.<br>
Sometimes you want to perform calculations across a data frame.<br>
This process is known as aggregation.<br>
DataFrames have several aggregation methods built in, including sum, mean, min, max, and more.<br>
If you want to aggregate on subsets of data, you use groupby.<br>
This group's rows would like data together, then lets you run one or more aggregation operations on those subgroups.<br>
You can also transform your data with a pivot table, turning groups of rows into a single row with a series of pivoted columns.<br>
If you've got data in multiple dataframes, you can combine them together using the CONCAT call.<br>
Note that the width and data types of the combined frames need to be the same.<br>
To augment the data in one dataframe with data looked up in another, you can use the JOIN call.<br>
This works like in SQL and supports left and right joins as well as others.<br>
With the addition of the third-party library XLSX Writer, you can write your dataframes out to an Excel file.<br>
This library supports multiple sheet workbooks and formatting.<br>
so you can create quite complex reports.<br>
My favorite Polars feature has to be the lazy frame.<br>
This allows lazy evaluation of expressions on a data frame like class.<br>
You can call multiple operations on the lazy frame, but they don't execute until you call the collect method.<br>
This allows for optimization of your queries and can have a significant performance impact.<br>
Now that you've got a taste of Polars, where else should you go to up your data analysis game?<br>
Let me point you at a few possibilities.<br>
First off, the Polars docs are really good.<br>
The home site gives a nice overview.<br>
Polars is open source, but there is a company behind it, so the site has information about both the library itself as well as how it all fits together.<br>
The documentation for Polars is divided into two parts.<br>
First is the user guide, which provides an overview and covers the main topics.<br>
Then the second source is the API, which has every last method call and object in the library.<br>
If you're looking to learn something not covered in the course, I'd start with the user guide.<br>
If you want to see all the methods and attributes available on all Polars objects, then go to the API instead.<br>
Polars is a relatively new kid on the DataFrame block.<br>
The most popular DataFrame library out there, and one of the older ones in Python, is another kind of bear, pandas.<br>
As it's been around longer, the library does have more in it than Polars, but I very seldom find something that I need that can't be done in both.<br>
The concepts in Pandas are similar to those in Polars, but the syntax relies more heavily on operator overloading and generally feels less Pythonic to me.<br>
This may depend on where you come from.<br>
As my background is the software engineering side, I tend to get stuck looking at Pandas code and thinking, how did they even do that?<br>
If you don't care about that, it can mean less typing.<br>
The good news is, you don't have to stick with one or the other.<br>
They're both Python libraries, and Polars has methods for converting to and from Pandas data frames.<br>
so you can go back and forth if you need to in your code.<br>
If you'd like to add pandas to your data analysis toolkit, one place to start would be in the Talk Python course titled Excel to Python.<br>
This course has a similar structure to the one you just took, showing you things in Excel and their equivalent in pandas.<br>
You might even recognize some of the data files in the course.<br>
That would be because I borrowed them to use in this one.<br>
Chris Moffitt is the instructor for that course, and he does a great job walking you through everything and focusing on practical examples.<br>
It's definitely worth checking out.
|
|
show
|
1:58 |
One level down from Polars and Pandas is NumPy.<br>
It's the de facto high-performance math package for Python.<br>
It lets you do all sorts of scientific computing operations, and it's written in a low-level language, so it's much faster than equivalent operations in pure Python.<br>
Amongst other things, NumPy supports multidimensional arrays, linear algebra, and Fourier transforms.<br>
In fact, the multidimensional array support is what inspired libraries like Pandas to create higher-level abstractions for data analysis use cases.<br>
If you're doing a lot of number crunching, it's definitely worth knowing what NumPy can do.<br>
In the previous lesson, you created some reports which were very spreadsheet-like.<br>
Reading rows and columns of data can be fun and all, but pretty pictures can help you better understand your data.<br>
Matplotlib is a visualization library that can help you do just that.<br>
With it, you can build static, animated, and interactive visualizations.<br>
And it includes a reusable styling mechanism so you can create a consistent look and feel across your output.<br>
There are tons of graphs in this library, and it integrates well with Polars, Pandas, NumPy, and more.<br>
I'll admit I have a bit of a love-hate relationship with Matplotlib.<br>
It has made a few odd design choices that as a Python programmer get my hackles up, but it is really, really powerful, and it's been fully embraced by the data science community.<br>
This is definitely a tool you want in your tool chest.<br>
There are more data frame libraries out there than just Polars and Pandas, and as each has their benefits, there is value in being able to go back and forth.<br>
Although Polars provides a few translation methods, the Narwhals library provides a seamless layer making it simpler to use the best data frame tool for the job you have in mind.<br>
Narwhals supports all your favorite data frame libraries, and it works by having its own Narwhal class.<br>
This is a wrapper within which you can call into the supported data frame libraries, like Polars, and when you've performed your operations, transition back out to the wrapper to deal with output or transform it to another library as needed.
|
|
show
|
1:07 |
For simplicity, I stuck with the REPL throughout this course.<br>
REPLs are great for experimenting, but they're a little too impermanent.<br>
Jupyter notebooks are like REPLs on steroids.<br>
Just like with the REPL, you have an interactive session.<br>
Unlike with the REPL, you can go back and edit a cell and have the change cascade through the results.<br>
Notebooks allow the inclusion of markdown and integrate well with Matplotlib, so you can make some fancy-looking reports with very little extra effort.<br>
One of the downsides of Jupyter is it stores the notebook as JSON.<br>
This is fine if you're only using the notebook tool, but it can be problematic if multiple people are editing a notebook and you're trying to use a code repo to manage those joint changes.<br>
One answer to that is Marimo.<br>
I haven't actually played with this library yet, but it's an up-and-comer on the scene.<br>
It provides the same kind of functionality as Jupyter, but the underlying storage mechanism is Python code, meaning your code development and sharing tools will work well with it.<br>
Congratulations, you've completed the Talk Python Polars course.<br>
I hope it was useful for you.<br>
Thanks for your attention.
|