Getting started with Dask Transcripts
Chapter: Using the Dask DataFrame
Lecture: Limitations of Dask DataFrame
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
Dask DataFrame is fantastic and incredibly powerful, but it does have some limitations.
0:06
It does not implement the entire Pandas API because not all Pandas operations are suited for a parallel and distributed environment. For example,
0:16
operations that require data shuffling. As we know, Dask DataFrames consist of multiple Pandas DataFrames. Each has index starting from zero. Pandas
0:25
indexing operations like set_index, reset_index are slower in Dask because they may need the
0:31
data to be sorted, which requires a lot of time consuming shuffling and synchronization of
0:36
data among workers, moving data across different machines has network and communication costs.
0:42
To avoid sorting, you can presort the index and make logical partitions. Finally, to learn more,go through the resources shared here.