Move from Excel to Python with Pandas Transcripts
Chapter: Intro to Pandas
Lecture: Demo: Understanding initial data

Login or purchase this course to watch this video and the rest of the course contents.
0:00 Okay, now let's go ahead and read in the Excel file into our Jupyter notebook
0:04 I'm going to go through the process of launching the Notebook one more time.
0:07 So let's "conda activate" our work environment.
0:11 The next thing we need to do is go into the directory where the files are
0:14 So I placed them in a sales analysis directory,
0:17 and now I'm going to run Jupyter notebook.
0:20 And here's my notebook. The two files that are already there were created by Cookie
0:25 Cutter, but I'm gonna go ahead and create a new one so we can walk
0:28 through that process. Click on New Python3 notebook and remember,
0:33 one of the first things need to do is make sure to change the title.
0:36 It comes in as an untitled notebook,
0:39 so you can see that here as well as in the URL.
0:41 So let's call this sales analysis exploration.
0:47 That's a really important thing to do so that you're in a good habit of organizing
0:51 your data, I am going to create a markdown cell and press shift enter so that it
1:03 gets rendered. This is a good habit to get into so that you understand why
1:07 you did this notebook and what the days sources were and how you wanted to use
1:13 this to answer a business problem.
1:15 So now let's get into actually writing some python code.
1:20 We put our imports at the top,
1:22 and I'm just going to use path lib to access the files and then pandas in
1:27 a second to read in that file.
1:31 So what I've done here is referenced the sample sales file in relation to the current
1:38 working directory. And it is in a subdirectory called raw.
1:42 So I define that input file,
1:44 and then I'm going to read that file in using the "pd.read_excel()" function in pandas
1:48 and nothing happens. But you can see that the number incriminated here.
1:54 So there was something that happened behind the scenes.
1:57 If we want to see what a variable looks like,
1:59 we just type df (for data frame). And now we see the data frame representation that looks very
2:06 similar to the Excel file. So let me go through a couple things that you
2:10 will typically do the first time you read a file into pandas.
2:15 You can use the head command to look at the top five rows.
2:18 You can use df tail, see the bottom five.
2:22 This is really helpful. Almost every time you read in the data,
2:25 you're gonna look at what comes at the top and what comes in at the bottom
2:27 Remember, we talked about columns,
2:32 So if you want to look at what the columns are,
2:35 type df.columns and you can see that has a list of all the
2:39 columns they calls it and index,
2:40 and that's gonna be important later for us to access our data.
2:45 The other thing that I like to do is the shape command - "df.shape".
2:50 And so this tells us how many rows.
2:53 So we have 1000 rows and 7 columns in the data.
2:56 So this is a really compact way to understand your data and really important thing to
3:01 do as you go through and manipulate data to make sure that you are keeping all
3:06 the data together, not dropping things inadvertently.
3:10 The other useful commanders DF info -,
3:13 which shows you all the columns,
3:15 how many different values are in the column and what data type they are.
3:20 This is really important as we start to manipulate the data because some of the analysis
3:25 can't be done if the data is not in the correct data type.
3:30 The final command I'm gonna show is DF describe - df.describe() - which gives a quick summary of all
3:41 the numeric columns. So this is a really handy way to get a feel for
3:46 the overall structure of your data.
3:48 It tells you how many instances of the data you have.
3:51 It does some basic math on the mean staring deviation the men Max and the various
3:57 percentiles. And this is all a very standard process that I go through almost every
4:02 time I load in data and starts to get in my mind what the shape of
4:07 the data is, what the structure is before I do further analysis.