Move from Excel to Python with Pandas Transcripts
Chapter: Data I/O (input and output)
Lecture: Reading Excel and converting data types
0:00 so since our data has come in and we have the columns we want, the other thing we need to take a look at are the data types.
0:08 And what is interesting about this is we can see there a few things that don't quite look right. So the ZIP code is coming in as an integer,
0:17 and we have this month, day in year that is coming in as an imager But we'd really like there's a day type and then we also have this percent
0:24 to target, which is coming in as an object. We may want to do some mathematical functions on that, so we need to find a way to convert that.
0:31 So let's walk through how we're going to clean up this data using their read Excel function. The first thing we're gonna clean up are the dates.
0:38 So got this handy function to help with formatting, so it's a little easier for you to follow, so we're going to read in the data.
0:45 But this time we're going to use a have used the parse dates argument, tell it to take a look at calls 34 and five, and turned those into a date,
0:58 Time column and name it update date. So if we look at our DF levels now can see it doesn't have the month day in year, like we did in the past,
1:08 just has one calm that says update date as we defined here in our dictionary. And then we can also check and see that we have a date time,
1:18 which is going to be really helpful for us for future analysis. Now there's another problem. We can see what this data said. Let's take a look at this.
1:28 Happens all the time with data sets, ZIP codes in the U. S. Air always five digits. And so this example here has 442 But it's missing the
1:37 leading zero. And that's because zip code got red in as an image or 64 which is not what we want. We want to make sure we preserve that leading zero.
1:48 So we need to tell it that zip, coast and object and not a number well, it's try reading it in, and the way we do that is to use the
1:59 D type parameter. So now take a look at this. It has the leading zero like we would expect. So what the D type argument does it.
2:10 It can take a dictionary of all the call names and what type you want it
2:14 to be. So we're explicitly overriding the process that pandas goes through to try and
2:20 determine the type and telling it that zip code needs to be an object. And we can check that and see the ZIP code is an object,
2:27 So everything's looking pretty good there. So then the final calm that we want to take a look at is this percent
2:33 to target. So it's coming through as an object, but it looks like it should be a number,
2:38 and the reasons coming in that way is the data has a percent sign. They're so Pandas doesn't know that it's actually a number.
2:47 So let's figure out how to actually convert that to a numeric format. So the way we need to do this is we need to write a simple function
2:57 to clean the data. We're going to call that convert percent. The argument to this is a Pandas data frame.
3:04 Siri's first thing we need to do is replace that percentage sign with a blank space
3:11 and then we can convert that to afloat and we want to divided by 100. So that comes through is a decimal. So if we want to see if that worked for us,
3:25 you can use the apply function, which will run a function on that data frame Siri's. And in this case, it works.
3:34 So it said that 100% now comes through is a 1 90 s 900.9. It's afloat 64 so we can do the math that we may want to do on it later on in our analysis.
3:48 So the question now is, How do we get that to apply? Every time we read in the file, what we can do that easily. Let's continue to copy the code,
3:57 and there is a another argument called converters, which will take a dictionary as well,
4:04 and we pass in the column name and then we can pass in a function. Whenever pandas reads in F I'll, it will open up a specific sheet name,
4:15 go to a specific header, choose only the columns we want and parse the date from the individual components into a daytime object.
4:23 Also, keep the ZIP code as an object and then apply this convert percent formula to the percent of target. And now, if we take a look at it,
4:32 we can see that we have the numeric format and that our data types are numeric 4% the target object for zip code level and company.
4:41 And we have a daytime for update date, and this is a really useful approach that you'll need to take A.
4:47 You bring in data to always make sure that you get it in the right format so that you can do your future analysis on the data.