Getting started with Dask Transcripts
Chapter: What is big data?
Lecture: Big data?
Login or
purchase this course
to watch this video and the rest of the course contents.
0:01
In the beginning, Matthew talked about how Dask is used to scale your data
0:06
science workflows but let's pause for a second and consider what scaling to larger datasets
0:12
actually means. We can broadly divide datasets into three categories: small. medium and large. A small dataset can be loaded into your local RAM
0:21
and you can do analysis on it comfortably using the tools you know and love. Medium
0:25
datasets are datasets that exceed your RAM capacity but still fit on your local disk, normally our tools would fail because of insufficient memory.
0:34
You can take advantage of parallel computing by loading only one part of the data into
0:39
RAM at the time, Dusk helps you scale up. Large datasets however,
0:43
do not fit on your physical local drive and only distributed cloud computing is a viable
0:49
solution. Dask help scale out to fleets of machines in the cloud as well. We typically call these large datasets Big Data.
0:57
Another way to think about Big Data is any data that can't be processed with traditional
1:03
methods, any data that needs extra engineering to handle it efficiently or to even begin working with it. When we talk about Big Data,
1:11
it's worth mentioning the four V's that define it. The first is Volume, technology is enabling collection of more and more data. Velocity is
1:21
the second one, as data is being generated at a speed like never before,
1:25
thanks again to technology, then social media as well as democratizing access to the internet to
1:31
larger and larger groups of people. Veracity, collecting data always raises concerns about accuracy,
1:37
authenticity, biases. When data grows, so the reliability concerns. Variety, data is available in so many different formats from text,
1:47
to images, to videos, to social content and more. New formats are invented every year,
1:54
I encourage you to pause and look at this infographic prepared by IBM Big Data & Analytics Hub.