Write Pythonic Code Like a Seasoned Developer Transcripts
Chapter: Dictionaries
Lecture: Stop using lists for everything
Login or
purchase this course
to watch this video and the rest of the course contents.
0:01
The first area I want to cover around dictionaries is using dictionaries for performance. So when you have a collection of data,
0:09
the most natural choice, the thing Python developers reach for first, has got to be a list. It's super common, super useful and very flexible.
0:18
Let's look at two different algorithms one using a list and one using a dictionary and we'll compare the relative performance.
0:25
So here we have a bit of code and we'll look at this in detail in PyCharm in just a moment, but here is the basic way it works.
0:32
So we start out with a list of data, here you can see in our example we are going to be using half of million items,
0:38
so we are calling a data list and it's going to contain some rich objects with multiple values or measurements - details aren't actually affected.
0:46
Then, suppose for some reason we have done some computation and we've come up with a hundred pieces of data we would like to go look up in our list,
0:54
so we are going to go loop over each of those 100 and then we are going to find the item in the list,
0:59
we can't just use interesting "item in list" in operator, because we actually have to filter on a particular value,
1:07
what we did instead is wrote this method called find_point_by_id_in_list and it just walks through the list, it compares the id
1:14
it's looking for against the ids that it finds in the list, as soon as it finds the match, it returns that one,
1:18
it's assuming to be unique and then it appends it to this interesting_points. So this is one version of the algorithm.
1:25
The other one is well, maybe we could do a little bit better if we actually used a dictionary.
1:30
And then we could index - the key for the dictionary could be the id.
1:33
So if we wrote it like this, here we have a dictionary of half a million of items, again the same dynamically discovered 100 interesting points,
1:43
and instead of doing this lookup by walking through the items, we can actually map the id to the objects that we are looking for
1:50
so we can just index into the dictionary. Obviously, indexing into the dictionary is going to be faster, but possibly the computation,
1:58
the building of the dictionary itself might be way slower, if we are going to do this a 100 times, we have half a million items, right,
2:05
it's much more complicated to generate a dictionary with half a million items than it is a list.
2:11
So let's see this in action and see what the verdict is. So here we have our data point, our data point is a named tuple,
2:19
it could have been a custom class but named tuple is sufficient and it has five values: id, an "x y" for two dimensional coordinates,
2:27
a temperature and a quality on the measurement of the temperature. You can see I have collapsed some areas of the code
2:33
because they don't really matter, these little print outs I think they kind of make it hard to read.
2:36
In PyCharm you can highlight these and hit Command+. and turn them into little collapsible regions,
2:41
so I did that so that you can focus on the algorithm and not the little details. Here we have our data_list that we are going to work with,
2:48
and we are going to use a random seed of zeros so we always do exactly the same thing, but randomly, so that we have perfectly repeatable results
2:57
and then down here for each item of this range of half a million each time through the list we are going to randomly construct
3:02
one of these data points and put it into our list. Next, we do a little reordering on the list just to make sure
3:07
that we don't just randomly access it in order, since we are using auto incrementing ids, next we are going to create our set of interesting ids
3:15
that we are going to go search through our list, and then later through our dictionary. Really we would use some kind of algorithm
3:21
and we would find interesting items we need to go look up, but in this case we are just going to randomly do it,
3:26
but there is a few Pythonic things going on here, one - notice this statement here with the curly braces, and then one item left to the "for",
3:34
that means what we are building here is a set using something called a set comprehension and each item in the set
3:41
is going to be a random number between zero and the length of that list which is half a million, so quite a large range there.
3:48
And we are just going to range across zero to 100. The other thing to look at is we don't actually care about
3:54
the index coming out of the range, we just want to run this a 100 times. In Python, when you are looping across something like this range set here
4:02
or you are possibly unpacking a tuple and there is only a few of the values, not all the values you care about,
4:08
it's Pythonic to use the underscore for the variable name to say "I must put something here but I actually have no concern what it is."
4:16
So, our interesting ids are interesting from a Pythonic perspective, but now we have the set of approximately 100 ids,
4:23
assuming that there is no conflicts or duplication there, and next thing we are going to do is we are going to come along here,
4:30
we are going to start a little timer, figure at the end what the total seconds pass were,
4:35
and during that time, we want to go and actually pull out the interesting points that correspond to the interesting ids.
4:41
So we are going to go for each interesting id, remember, it's about a 100, we are going to say "find the point in the list like so,
4:48
and then add it", and if we look quickly at this, you can see we just go through each item on the list
4:53
and if the item matches the id we are looking for, we are done, otherwise, we didn't find it.
4:59
So, just to get a base line, I am going to assume that this is slower, let's go and run it and see what happens.
5:09
Remember, it's only the locating data in the list part that is actually timed. All right, so this took 7.9 seconds
5:17
and here you can see there is a whole bunch of data points it found, if we run it again, we get 8.4 seconds. So it's somewhere around 7 to 8 seconds.
5:27
All right, so let's take this algorithm here and adapt it for our dictionary.
5:32
So I've got a little place holder to sort of drop in the timing and so on, you don't have to watch me type that,
5:36
so the first thing we want to do is create a dictionary, before we had data_list, now we are going to have data_dict,
5:41
we can create this using a dictionary comprehension, so that would be a very Pythonic thing to do
5:46
and we want to map for each item in the dictionary the id to the actual object. So, we create set and dictionary comprehensions like so
5:54
but the difference is we have a "key:value" for the dictionaries where we just have the value for sets. You kind of have to write this in reverse,
6:01
I am going to name the elements we are going to look at "d", so I am going to say "d.id", maps the "d for d in data_list", right,
6:12
so this is going to create a dictionary of all half a million items and mapping the id to the actual value. So now, let's start a little timer,
6:19
and next we want to locate the items in the dictionary, so again, we'll say interesting_points, let's clear that;
6:26
"for id", we call it "d.id" so it doesn't conflict with the id built in, so "for d.id in interesting ids" we want to do a lookup,
6:36
we'll say "the data element is", now we have a dictionary and we can look up things by ids, so that is super easy,
6:42
we just say it like so, assuming that there is none of the id that is missing, something like that
6:46
and then we'll just say "interesting_points.append(d)" Oops, almost made a mistake there, let's say "d.id" not the built in, that of course won't work.
6:59
All right, so let's run it again and see how it works, so we are going to run, it's still going to run the other slow version,
7:03
I'll skip that in the video, wow, look at that, 8 seconds, and this is 0.000069 seconds. So that's less than 1 millisecond, by a wide margin.
7:18
That is a non-trivial speed up, let's see how much of a speed up that is, then the other thing to consider as well, maybe the speed up was huge
7:26
but the cost of computing the dictionary was more than offsetting the gains we had, let's try.
7:33
Wow, the speedup that we received was not one time faster, two times faster, or ten times faster,
7:39
if this is data that we are going to go back into and back into, we would create this dictionary and sort of reuse it,
7:46
where we get a speed up of a 128 000 times faster and an algorithm that is actually easier to use than writing our silly list lookup
7:54
and it took literally one line of a dictionary comprehension, that's a beautiful combination of how dictionaries work for performance,
8:01
bringing together these Pythonic ideas like dictionary comprehensions and so on, it made our algorithm both easier and dramatically faster.
8:10
What if we had to create this dictionary just one time to do this work?
8:15
Maybe we should move this down and actually count the creation of the dictionary as part of the computational time,
8:21
so let's see what we get if we run it that way. Look at that, 8 seconds versus 0.2 seconds, so even though it took a while to create that dictionary
8:30
it still took almost no time relative to our way more inefficient algorithm using lists, we've got a 37 times speedup if every single time
8:38
we call this function or we do this operation we would have to recreate the dictionary, it's still dramatically better and of course simpler as well.
8:46
Let's review that in a graphic. So here we have two basically equivalent algorithms, we have a bunch of data we are storing in a list,
8:55
half a million items, and then we are going to loop over them and we are going to try to pull some items out,
8:59
by some particular property of the things contained in the list, well if you are in that situation, dictionaries are amazing for it
9:05
and as you saw they are stunningly fast. If we don't count the creation of the dictionary,
9:11
we had a 130 000 times faster the bottom algorithm to the top algorithm. So I am sure you all thought well dictionary is probably faster,
9:19
but did you think it would be a 130 000 times faster, that's really cool, right? It basically means that becomes free to do that lookup,
9:27
and even if we had to recreate the dictionary every time, it's still 37 times faster, which is an amazing speedup.