Fundamentals of Dask Transcripts
Chapter: Dask Bag
Lecture: Manipulating data
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
So we've seen how to create Dask Bags and how to get JSON data, for example, into Dask Bags.
0:06
And as we've written here, bag objects have a standard functional API found in projects like the Python, Standard Library tools, PySpark.
0:14
So they include 'map' and 'filter' and 'groupby'. And we're going to see all of these things in action now.
0:19
So the first thing to note is that operations on bag objects create new bags. So without further ado, let's look at several common operations.
0:30
So 'Filter' is an important one. The reason Filter is so important because when you have your data,
0:35
you may want to look at certain values of interest and you may want to filter
0:39
your records, for example. So we have the Bag "b" from before and what we're going to do is filter it according to the age in the record.
0:49
Right? So we use a 'lambda function' here to do that, which we pass to the filter method.
0:53
And then we use take with the argument(5) in order to look at the first five records performed after this filter And there we go.
1:02
We have Harold for example, who's 42, then we have Jack 66, etc. So we have the first five records of people who are older than 25.
1:12
So that demonstrates how to filter the 'Dask Bag'. We can also 'map' functions across bags.
1:19
For example, you may want to get all the first names from our json data And the way we do this is we 'map' the function which extracts the first
1:28
name. We map that across the entire bag. So that's what we do here and we take the first 10. So we're gonna get the first 10 names here.
1:36
And we said we have Harold, Jack, Emmett, Jonah, Eugenia sterling, Rudolph, Erlin, Lawrence and Valentine, wow, that was that was a mouthful.
1:42
But we got those 1st 1st 10 there. Okay. Another common operation for data professionals.
1:48
Data analyst, data scientists. Citizen data scientists use a "group by" which you may recall using in 'pandas' all the time.
1:55
So essentially a 'groupby' allows you to group data by some property or function. So what we're going to do here is if you recall 'x' is a bag of
2:06
all the first names of people in the records. And we're going to group by the length and then compute now what this will do
2:12
it will return a list of the length of the names and then the names that correspond that have that length essentially. So let's see that. Great.
2:22
So we have a list where we have six and then all the names or the first names that have six characters in them.
2:30
Then 4,8,7 and 9. So one thing to note about the 'Dask Groupby' operation It can be slow. So I just want to say a bit about an
2:40
alternative which is called 'foldby'. Okay. And I encourage you to check out the Dask documentation on 'foldby' and
2:49
I'll tell you briefly what some of the documentation says. So the 'groupby' method is straightforward according to the documentation,
2:55
but forces a full shuffle of the data, which is expensive. Now 'foldby' is slightly harder to use but faster.
3:01
So go and check out the docs and see for your particular use case whether you'd like to use 'groupby' or 'foldby'.