Fundamentals of Dask Transcripts
Chapter: Dask Bag
Lecture: Manipulating data

Login or purchase this course to watch this video and the rest of the course contents.
0:00 So we've seen how to create Dask Bags and how to get JSON data, for example, into Dask Bags.
0:06 And as we've written here, bag objects have a standard functional API found in projects like the Python, Standard Library tools, PySpark.
0:14 So they include 'map' and 'filter' and 'groupby'. And we're going to see all of these things in action now.
0:19 So the first thing to note is that operations on bag objects create new bags. So without further ado, let's look at several common operations.
0:30 So 'Filter' is an important one. The reason Filter is so important because when you have your data,
0:35 you may want to look at certain values of interest and you may want to filter
0:39 your records, for example. So we have the Bag "b" from before and what we're going to do is filter it according to the age in the record.
0:49 Right? So we use a 'lambda function' here to do that, which we pass to the filter method.
0:53 And then we use take with the argument(5) in order to look at the first five records performed after this filter And there we go.
1:02 We have Harold for example, who's 42, then we have Jack 66, etc. So we have the first five records of people who are older than 25.
1:12 So that demonstrates how to filter the 'Dask Bag'. We can also 'map' functions across bags.
1:19 For example, you may want to get all the first names from our json data And the way we do this is we 'map' the function which extracts the first
1:28 name. We map that across the entire bag. So that's what we do here and we take the first 10. So we're gonna get the first 10 names here.
1:36 And we said we have Harold, Jack, Emmett, Jonah, Eugenia sterling, Rudolph, Erlin, Lawrence and Valentine, wow, that was that was a mouthful.
1:42 But we got those 1st 1st 10 there. Okay. Another common operation for data professionals.
1:48 Data analyst, data scientists. Citizen data scientists use a "group by" which you may recall using in 'pandas' all the time.
1:55 So essentially a 'groupby' allows you to group data by some property or function. So what we're going to do here is if you recall 'x' is a bag of
2:06 all the first names of people in the records. And we're going to group by the length and then compute now what this will do
2:12 it will return a list of the length of the names and then the names that correspond that have that length essentially. So let's see that. Great.
2:22 So we have a list where we have six and then all the names or the first names that have six characters in them.
2:30 Then 4,8,7 and 9. So one thing to note about the 'Dask Groupby' operation It can be slow. So I just want to say a bit about an
2:40 alternative which is called 'foldby'. Okay. And I encourage you to check out the Dask documentation on 'foldby' and
2:49 I'll tell you briefly what some of the documentation says. So the 'groupby' method is straightforward according to the documentation,
2:55 but forces a full shuffle of the data, which is expensive. Now 'foldby' is slightly harder to use but faster.
3:01 So go and check out the docs and see for your particular use case whether you'd like to use 'groupby' or 'foldby'.


Talk Python's Mastodon Michael Kennedy's Mastodon