Getting started with Dask Transcripts
Chapter: Using the Dask DataFrame
Lecture: Limitations of Dask DataFrame

Login or purchase this course to watch this video and the rest of the course contents.
0:00 Dask DataFrame is fantastic and incredibly powerful, but it does have some limitations.
0:06 It does not implement the entire Pandas API because not all Pandas operations are suited for a parallel and distributed environment. For example,
0:16 operations that require data shuffling. As we know, Dask DataFrames consist of multiple Pandas DataFrames. Each has index starting from zero. Pandas
0:25 indexing operations like set_index, reset_index are slower in Dask because they may need the
0:31 data to be sorted, which requires a lot of time consuming shuffling and synchronization of
0:36 data among workers, moving data across different machines has network and communication costs.
0:42 To avoid sorting, you can presort the index and make logical partitions. Finally, to learn more,go through the resources shared here.


Talk Python's Mastodon Michael Kennedy's Mastodon