Move from Excel to Python with Pandas Transcripts
Chapter: Data I/O (input and output)
Lecture: Reading Excel and converting data types
0:00 so since our data has come in and we have the columns we want,
0:03 the other thing we need to take a look at are the data types.
0:07 And what is interesting about this is we can see there a few things that don't
0:12 quite look right. So the ZIP code is coming in as an integer,
0:16 and we have this month, day in year that is coming in as an imager
0:19 But we'd really like there's a day type and then we also have this percent
0:23 to target, which is coming in as an object.
0:26 We may want to do some mathematical functions on that,
0:28 so we need to find a way to convert that.
0:30 So let's walk through how we're going to clean up this data using their read Excel
0:34 function. The first thing we're gonna clean up are the dates.
0:37 So got this handy function to help with formatting,
0:41 so it's a little easier for you to follow,
0:43 so we're going to read in the data.
0:44 But this time we're going to use a have used the parse dates argument,
0:51 tell it to take a look at calls 34 and five,
0:55 and turned those into a date,
0:57 Time column and name it update date.
0:59 So if we look at our DF levels now can see it doesn't have the month
1:04 day in year, like we did in the past,
1:07 just has one calm that says update date as we defined here in our dictionary.
1:12 And then we can also check and see that we have a date time,
1:17 which is going to be really helpful for us for future analysis.
1:21 Now there's another problem. We can see what this data said.
1:24 Let's take a look at this.
1:27 Happens all the time with data sets,
1:29 ZIP codes in the U. S.
1:30 Air always five digits. And so this example here has 442 But it's missing the
1:36 leading zero. And that's because zip code got red in as an image or 64
1:42 which is not what we want.
1:43 We want to make sure we preserve that leading zero.
1:47 So we need to tell it that zip,
1:48 coast and object and not a number well,
1:53 it's try reading it in, and the way we do that is to use the
1:58 D type parameter. So now take a look at this.
2:04 It has the leading zero like we would expect.
2:07 So what the D type argument does it.
2:09 It can take a dictionary of all the call names and what type you want it
2:13 to be. So we're explicitly overriding the process that pandas goes through to try and
2:19 determine the type and telling it that zip code needs to be an object.
2:23 And we can check that and see the ZIP code is an object,
2:26 So everything's looking pretty good there.
2:29 So then the final calm that we want to take a look at is this percent
2:32 to target. So it's coming through as an object,
2:35 but it looks like it should be a number,
2:37 and the reasons coming in that way is the data has a percent sign.
2:42 They're so Pandas doesn't know that it's actually a number.
2:46 So let's figure out how to actually convert that to a numeric format.
2:52 So the way we need to do this is we need to write a simple function
2:56 to clean the data. We're going to call that convert percent.
3:00 The argument to this is a Pandas data frame.
3:03 Siri's first thing we need to do is replace that percentage sign with a blank space
3:10 and then we can convert that to afloat and we want to divided by 100.
3:16 So that comes through is a decimal.
3:19 So if we want to see if that worked for us,
3:24 you can use the apply function,
3:26 which will run a function on that data frame Siri's.
3:31 And in this case, it works.
3:33 So it said that 100% now comes through is a 1 90 s 900.9.
3:39 It's afloat 64 so we can do the math that we may want to do on
3:44 it later on in our analysis.
3:47 So the question now is, How do we get that to apply?
3:50 Every time we read in the file,
3:52 what we can do that easily.
3:54 Let's continue to copy the code,
3:56 and there is a another argument called converters,
4:01 which will take a dictionary as well,
4:03 and we pass in the column name and then we can pass in a function.
4:09 Whenever pandas reads in F I'll,
4:11 it will open up a specific sheet name,
4:14 go to a specific header, choose only the columns we want and parse the date
4:18 from the individual components into a daytime object.
4:22 Also, keep the ZIP code as an object and then apply this convert percent formula
4:27 to the percent of target. And now,
4:29 if we take a look at it,
4:31 we can see that we have the numeric format and that our data types are numeric
4:37 4% the target object for zip code level and company.
4:40 And we have a daytime for update date,
4:42 and this is a really useful approach that you'll need to take A.
4:46 You bring in data to always make sure that you get it in the right format
4:49 so that you can do your future analysis on the data.