Move from Excel to Python with Pandas Transcripts
Chapter: Data I/O (input and output)
Lecture: Reading Excel and converting data types
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
so since our data has come in and we have the columns we want,
0:03
the other thing we need to take a look at are the data types.
0:07
And what is interesting about this is we can see there a few things that don't
0:12
quite look right. So the ZIP code is coming in as an integer,
0:16
and we have this month, day in year that is coming in as an imager
0:19
But we'd really like there's a day type and then we also have this percent
0:23
to target, which is coming in as an object.
0:26
We may want to do some mathematical functions on that,
0:28
so we need to find a way to convert that.
0:30
So let's walk through how we're going to clean up this data using their read Excel
0:34
function. The first thing we're gonna clean up are the dates.
0:37
So got this handy function to help with formatting,
0:41
so it's a little easier for you to follow,
0:43
so we're going to read in the data.
0:44
But this time we're going to use a have used the parse dates argument,
0:51
tell it to take a look at calls 34 and five,
0:55
and turned those into a date,
0:57
Time column and name it update date.
0:59
So if we look at our DF levels now can see it doesn't have the month
1:04
day in year, like we did in the past,
1:07
just has one calm that says update date as we defined here in our dictionary.
1:12
And then we can also check and see that we have a date time,
1:17
which is going to be really helpful for us for future analysis.
1:21
Now there's another problem. We can see what this data said.
1:24
Let's take a look at this.
1:27
Happens all the time with data sets,
1:29
ZIP codes in the U. S.
1:30
Air always five digits. And so this example here has 442 But it's missing the
1:36
leading zero. And that's because zip code got red in as an image or 64
1:42
which is not what we want.
1:43
We want to make sure we preserve that leading zero.
1:47
So we need to tell it that zip,
1:48
coast and object and not a number well,
1:53
it's try reading it in, and the way we do that is to use the
1:58
D type parameter. So now take a look at this.
2:04
It has the leading zero like we would expect.
2:07
So what the D type argument does it.
2:09
It can take a dictionary of all the call names and what type you want it
2:13
to be. So we're explicitly overriding the process that pandas goes through to try and
2:19
determine the type and telling it that zip code needs to be an object.
2:23
And we can check that and see the ZIP code is an object,
2:26
So everything's looking pretty good there.
2:29
So then the final calm that we want to take a look at is this percent
2:32
to target. So it's coming through as an object,
2:35
but it looks like it should be a number,
2:37
and the reasons coming in that way is the data has a percent sign.
2:42
They're so Pandas doesn't know that it's actually a number.
2:46
So let's figure out how to actually convert that to a numeric format.
2:52
So the way we need to do this is we need to write a simple function
2:56
to clean the data. We're going to call that convert percent.
3:00
The argument to this is a Pandas data frame.
3:03
Siri's first thing we need to do is replace that percentage sign with a blank space
3:10
and then we can convert that to afloat and we want to divided by 100.
3:16
So that comes through is a decimal.
3:19
So if we want to see if that worked for us,
3:24
you can use the apply function,
3:26
which will run a function on that data frame Siri's.
3:31
And in this case, it works.
3:33
So it said that 100% now comes through is a 1 90 s 900.9.
3:39
It's afloat 64 so we can do the math that we may want to do on
3:44
it later on in our analysis.
3:47
So the question now is, How do we get that to apply?
3:50
Every time we read in the file,
3:52
what we can do that easily.
3:54
Let's continue to copy the code,
3:56
and there is a another argument called converters,
4:01
which will take a dictionary as well,
4:03
and we pass in the column name and then we can pass in a function.
4:09
Whenever pandas reads in F I'll,
4:11
it will open up a specific sheet name,
4:14
go to a specific header, choose only the columns we want and parse the date
4:18
from the individual components into a daytime object.
4:22
Also, keep the ZIP code as an object and then apply this convert percent formula
4:27
to the percent of target. And now,
4:29
if we take a look at it,
4:31
we can see that we have the numeric format and that our data types are numeric
4:37
4% the target object for zip code level and company.
4:40
And we have a daytime for update date,
4:42
and this is a really useful approach that you'll need to take A.
4:46
You bring in data to always make sure that you get it in the right format
4:49
so that you can do your future analysis on the data.