Data Science Jumpstart with 10 Projects Transcripts
Chapter: Project 7: Working with Movie Review Text Data in Pandas
Lecture: Using scikit-learn to calculate Tfidf for Pandas text
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
Next thing we're going to do is look at TF-IDF. That means term frequency inverse document frequency.
0:06
This is a way to show how important words are. We look at the relationship between how often a word appears in a document versus in the document,
0:16
how many documents have that word. If you think about this, if a word only occurs in a small subset of those
0:23
documents, but it occurs a lot in those, that's probably an important word,
0:26
especially if we've removed those stop words. If you have words that are important, those tend to describe that document.
0:34
We're going to use Scikit-Learn to do that.
0:36
So make sure you've installed Scikit-Learn. Scikit-Learn is a machine learning library and it has this thing called
0:40
TF-IDF vectorizer, term frequency inverse document frequency vectorizer, and this works with pandas.
0:47
So what we're going to do is we're going to apply our
0:50
removal of stop words, and then we're going to call fit transform on the removed stop words. This will give us this object
0:56
I'm calling sparse. This is a sparse vector. Okay, you can see that this is a numpy array. It's got 600 rows and 13,000 columns.
1:07
Why does it have 13,000 columns? Because there's a lot of words and this is basically a binary indicator
1:13
indicating whether a word occurred in a document. So let's look at what the features are.
1:19
We can ask the vectorizer to get the features. I'm actually going to stick those into a data frame and
1:25
then I'm going to concatenate that to my original data frame. Here's my original data frame and you can see
1:31
that we have all of these features tacked on to the end of it.
1:36
Finally, let's look at our value counts of our sentiment and we've got 301 of each positive and negative reviews.