Data Science Jumpstart with 10 Projects Transcripts
Chapter: Project 4: Understanding Grouping and Aggregation Retail Data
Lecture: Exploratory Data Analysis (EDA) in Pandas with describe, histograms, and value_counts

Login or purchase this course to watch this video and the rest of the course contents.
0:00 I want to show how to do some exploratory data analysis on this sales data, so let's jump into that. Again, first what I like to do is do
0:09 a describe here that gives me summary statistics. If we look at this, we can see that we've got quantity of things that we sold. We have a date,
0:17 we have a unit price, and we have a customer ID. Again, what I like to look at generally is counts.
0:24 Our counts look like they're pretty similar except for customer ID. It looks like some of our customer IDs are missing.
0:31 Note that unit price goes negative as well. This might be a little weird, but apparently these are refunds.
0:38 Also, it looks like quantity is going negative as well. We've got negative 80,995 and we have positive 80,995.
0:46 That's a little weird just looking at that, that we have this weird large number and weird small number and they're the same,
0:52 they're just opposite signs. It's probably the case where someone bought a bunch of things and then return them.
0:59 That might also indicate that there's maybe test data in this dataset that they tried something out and then undid that.
1:07 That might be something that I might want to check out further. I'm not going to do that here, but just by looking at that,
1:13 that is an example of ideas that I get by looking at these summaries and things to explore. Let's visualize the unit price.
1:22 Again, I'm just going to throw a histogram on that. If you look at that, this is not particularly interesting per se.
1:28 It looks like our unit price is around zero, but it goes out to 40,000 here and negative 10,000. What's going on there?
1:37 We do know that we have values going from those ranges, because if we didn't, the histogram wouldn't show that.
1:43 Maybe I'll bump up the bins a little bit and say bins is equal to 30. It looks like the vast majority of our data is around that small value there.
1:56 Let's see if we can dive into this a little bit more. I'm going to say, let's look at cells where unit price is less than zero.
2:05 We see that we have two cells. The description is a just bad debt. Let's look at cells where quantity is less than zero.
2:18 There's a lot of entries there. There's actually 10,000 rows where quantity is less than zero. Let's look at a customer.
2:26 Maybe we can say, I want to look at customer 17548. You can see 17548 is up here. They bought 12 pink paisley tissues.
2:38 Here is the purchases for that customer. There's a lot of negative quantities there, but there are also some positive quantities there.
2:51 We've summarized some of our numeric columns. Let's look at our string columns here. It looks like invoice number is a string,
3:03 stock code is a string, description, and country are strings. Country makes sense that that's a string. Again, our friend value counts,
3:11 we can come in here and quantify that value counts. Stock code, maybe not quite so clear that it's a string.
3:17 If you look at that, it looks numeric there. Let's try and understand what's going on there. Here is stock code.
3:23 Now, by doing this, again, value counts is our friend. We can see that we have things like letters tacked onto the end of that.
3:29 That makes sense that that is a string. Just showed you how to do some basic exploratory data analysis. Again, start with those summaries.
3:37 Then sometimes we get interesting insights and we start digging into those from that. Showed a little bit of examples of that
3:44 by looking at those negative values and digging into customer IDs, that sort of thing.


Talk Python's Mastodon Michael Kennedy's Mastodon