Getting started with Dask Transcripts
Chapter: What is big data?
Lecture: Big data?

Login or purchase this course to watch this video and the rest of the course contents.
0:01 In the beginning, Matthew talked about how Dask is used to scale your data
0:06 science workflows but let's pause for a second and consider what scaling to larger datasets
0:12 actually means. We can broadly divide datasets into three categories: small. medium and large. A small dataset can be loaded into your local RAM
0:21 and you can do analysis on it comfortably using the tools you know and love. Medium
0:25 datasets are datasets that exceed your RAM capacity but still fit on your local disk, normally our tools would fail because of insufficient memory.
0:34 You can take advantage of parallel computing by loading only one part of the data into
0:39 RAM at the time, Dusk helps you scale up. Large datasets however,
0:43 do not fit on your physical local drive and only distributed cloud computing is a viable
0:49 solution. Dask help scale out to fleets of machines in the cloud as well. We typically call these large datasets Big Data.
0:57 Another way to think about Big Data is any data that can't be processed with traditional
1:03 methods, any data that needs extra engineering to handle it efficiently or to even begin working with it. When we talk about Big Data,
1:11 it's worth mentioning the four V's that define it. The first is Volume, technology is enabling collection of more and more data. Velocity is
1:21 the second one, as data is being generated at a speed like never before,
1:25 thanks again to technology, then social media as well as democratizing access to the internet to
1:31 larger and larger groups of people. Veracity, collecting data always raises concerns about accuracy,
1:37 authenticity, biases. When data grows, so the reliability concerns. Variety, data is available in so many different formats from text,
1:47 to images, to videos, to social content and more. New formats are invented every year,
1:54 I encourage you to pause and look at this infographic prepared by IBM Big Data & Analytics Hub.


Talk Python's Mastodon Michael Kennedy's Mastodon