Data Science Jumpstart with 10 Projects Transcripts
Chapter: Project 7: Working with Movie Review Text Data in Pandas
Lecture: Using scikit-learn to calculate Tfidf for Pandas text

Login or purchase this course to watch this video and the rest of the course contents.
0:00 Next thing we're going to do is look at TF-IDF. That means term frequency inverse document frequency.
0:06 This is a way to show how important words are. We look at the relationship between how often a word appears in a document versus in the document,
0:16 how many documents have that word. If you think about this, if a word only occurs in a small subset of those
0:23 documents, but it occurs a lot in those, that's probably an important word,
0:26 especially if we've removed those stop words. If you have words that are important, those tend to describe that document.
0:34 We're going to use Scikit-Learn to do that.
0:36 So make sure you've installed Scikit-Learn. Scikit-Learn is a machine learning library and it has this thing called
0:40 TF-IDF vectorizer, term frequency inverse document frequency vectorizer, and this works with pandas.
0:47 So what we're going to do is we're going to apply our
0:50 removal of stop words, and then we're going to call fit transform on the removed stop words. This will give us this object
0:56 I'm calling sparse. This is a sparse vector. Okay, you can see that this is a numpy array. It's got 600 rows and 13,000 columns.
1:07 Why does it have 13,000 columns? Because there's a lot of words and this is basically a binary indicator
1:13 indicating whether a word occurred in a document. So let's look at what the features are.
1:19 We can ask the vectorizer to get the features. I'm actually going to stick those into a data frame and
1:25 then I'm going to concatenate that to my original data frame. Here's my original data frame and you can see
1:31 that we have all of these features tacked on to the end of it.
1:36 Finally, let's look at our value counts of our sentiment and we've got 301 of each positive and negative reviews.


Talk Python's Mastodon Michael Kennedy's Mastodon