Write Pythonic Code Like a Seasoned Developer Transcripts
Chapter: Dictionaries
Lecture: Stop using lists for everything

0:01 The first area I want to cover around dictionaries is using dictionaries for performance. So when you have a collection of data,

0:09 the most natural choice, the thing Python developers reach for first, has got to be a list. It's super common, super useful and very flexible.

0:18 Let's look at two different algorithms one using a list and one using a dictionary and we'll compare the relative performance.

0:25 So here we have a bit of code and we'll look at this in detail in PyCharm in just a moment, but here is the basic way it works.

0:32 So we start out with a list of data, here you can see in our example we are going to be using half of million items,

0:38 so we are calling a data list and it's going to contain some rich objects with multiple values or measurements - details aren't actually affected.

0:46 Then, suppose for some reason we have done some computation and we've come up with a hundred pieces of data we would like to go look up in our list,

0:54 so we are going to go loop over each of those 100 and then we are going to find the item in the list,

0:59 we can't just use interesting "item in list" in operator, because we actually have to filter on a particular value,

1:07 what we did instead is wrote this method called find_point_by_id_in_list and it just walks through the list, it compares the id

1:14 it's looking for against the ids that it finds in the list, as soon as it finds the match, it returns that one,

1:18 it's assuming to be unique and then it appends it to this interesting_points. So this is one version of the algorithm.

1:25 The other one is well, maybe we could do a little bit better if we actually used a dictionary.

1:30 And then we could index - the key for the dictionary could be the id.

1:33 So if we wrote it like this, here we have a dictionary of half a million of items, again the same dynamically discovered 100 interesting points,

1:43 and instead of doing this lookup by walking through the items, we can actually map the id to the objects that we are looking for

1:50 so we can just index into the dictionary. Obviously, indexing into the dictionary is going to be faster, but possibly the computation,

1:58 the building of the dictionary itself might be way slower, if we are going to do this a 100 times, we have half a million items, right,

2:05 it's much more complicated to generate a dictionary with half a million items than it is a list.

2:11 So let's see this in action and see what the verdict is. So here we have our data point, our data point is a named tuple,

2:19 it could have been a custom class but named tuple is sufficient and it has five values: id, an "x y" for two dimensional coordinates,

2:27 a temperature and a quality on the measurement of the temperature. You can see I have collapsed some areas of the code

2:33 because they don't really matter, these little print outs I think they kind of make it hard to read.

2:36 In PyCharm you can highlight these and hit Command+. and turn them into little collapsible regions,

2:41 so I did that so that you can focus on the algorithm and not the little details. Here we have our data_list that we are going to work with,

2:48 and we are going to use a random seed of zeros so we always do exactly the same thing, but randomly, so that we have perfectly repeatable results

2:57 and then down here for each item of this range of half a million each time through the list we are going to randomly construct

3:02 one of these data points and put it into our list. Next, we do a little reordering on the list just to make sure

3:07 that we don't just randomly access it in order, since we are using auto incrementing ids, next we are going to create our set of interesting ids

3:15 that we are going to go search through our list, and then later through our dictionary. Really we would use some kind of algorithm

3:21 and we would find interesting items we need to go look up, but in this case we are just going to randomly do it,

3:26 but there is a few Pythonic things going on here, one - notice this statement here with the curly braces, and then one item left to the "for",

3:34 that means what we are building here is a set using something called a set comprehension and each item in the set

3:41 is going to be a random number between zero and the length of that list which is half a million, so quite a large range there.

3:48 And we are just going to range across zero to 100. The other thing to look at is we don't actually care about

3:54 the index coming out of the range, we just want to run this a 100 times. In Python, when you are looping across something like this range set here

4:02 or you are possibly unpacking a tuple and there is only a few of the values, not all the values you care about,

4:08 it's Pythonic to use the underscore for the variable name to say "I must put something here but I actually have no concern what it is."

4:16 So, our interesting ids are interesting from a Pythonic perspective, but now we have the set of approximately 100 ids,

4:23 assuming that there is no conflicts or duplication there, and next thing we are going to do is we are going to come along here,

4:30 we are going to start a little timer, figure at the end what the total seconds pass were,

4:35 and during that time, we want to go and actually pull out the interesting points that correspond to the interesting ids.

4:41 So we are going to go for each interesting id, remember, it's about a 100, we are going to say "find the point in the list like so,

4:48 and then add it", and if we look quickly at this, you can see we just go through each item on the list

4:53 and if the item matches the id we are looking for, we are done, otherwise, we didn't find it.

4:59 So, just to get a base line, I am going to assume that this is slower, let's go and run it and see what happens.

5:09 Remember, it's only the locating data in the list part that is actually timed. All right, so this took 7.9 seconds

5:17 and here you can see there is a whole bunch of data points it found, if we run it again, we get 8.4 seconds. So it's somewhere around 7 to 8 seconds.

5:27 All right, so let's take this algorithm here and adapt it for our dictionary.

5:32 So I've got a little place holder to sort of drop in the timing and so on, you don't have to watch me type that,

5:36 so the first thing we want to do is create a dictionary, before we had data_list, now we are going to have data_dict,

5:41 we can create this using a dictionary comprehension, so that would be a very Pythonic thing to do

5:46 and we want to map for each item in the dictionary the id to the actual object. So, we create set and dictionary comprehensions like so

5:54 but the difference is we have a "key:value" for the dictionaries where we just have the value for sets. You kind of have to write this in reverse,

6:01 I am going to name the elements we are going to look at "d", so I am going to say "d.id", maps the "d for d in data_list", right,

6:12 so this is going to create a dictionary of all half a million items and mapping the id to the actual value. So now, let's start a little timer,

6:19 and next we want to locate the items in the dictionary, so again, we'll say interesting_points, let's clear that;

6:26 "for id", we call it "d.id" so it doesn't conflict with the id built in, so "for d.id in interesting ids" we want to do a lookup,

6:36 we'll say "the data element is", now we have a dictionary and we can look up things by ids, so that is super easy,

6:42 we just say it like so, assuming that there is none of the id that is missing, something like that

6:46 and then we'll just say "interesting_points.append(d)" Oops, almost made a mistake there, let's say "d.id" not the built in, that of course won't work.

6:59 All right, so let's run it again and see how it works, so we are going to run, it's still going to run the other slow version,

7:03 I'll skip that in the video, wow, look at that, 8 seconds, and this is 0.000069 seconds. So that's less than 1 millisecond, by a wide margin.

7:18 That is a non-trivial speed up, let's see how much of a speed up that is, then the other thing to consider as well, maybe the speed up was huge

7:26 but the cost of computing the dictionary was more than offsetting the gains we had, let's try.

7:33 Wow, the speedup that we received was not one time faster, two times faster, or ten times faster,

7:39 if this is data that we are going to go back into and back into, we would create this dictionary and sort of reuse it,

7:46 where we get a speed up of a 128 000 times faster and an algorithm that is actually easier to use than writing our silly list lookup

7:54 and it took literally one line of a dictionary comprehension, that's a beautiful combination of how dictionaries work for performance,

8:01 bringing together these Pythonic ideas like dictionary comprehensions and so on, it made our algorithm both easier and dramatically faster.

8:10 What if we had to create this dictionary just one time to do this work?

8:15 Maybe we should move this down and actually count the creation of the dictionary as part of the computational time,

8:21 so let's see what we get if we run it that way. Look at that, 8 seconds versus 0.2 seconds, so even though it took a while to create that dictionary

8:30 it still took almost no time relative to our way more inefficient algorithm using lists, we've got a 37 times speedup if every single time

8:38 we call this function or we do this operation we would have to recreate the dictionary, it's still dramatically better and of course simpler as well.

8:46 Let's review that in a graphic. So here we have two basically equivalent algorithms, we have a bunch of data we are storing in a list,

8:55 half a million items, and then we are going to loop over them and we are going to try to pull some items out,

8:59 by some particular property of the things contained in the list, well if you are in that situation, dictionaries are amazing for it

9:05 and as you saw they are stunningly fast. If we don't count the creation of the dictionary,

9:11 we had a 130 000 times faster the bottom algorithm to the top algorithm. So I am sure you all thought well dictionary is probably faster,

9:19 but did you think it would be a 130 000 times faster, that's really cool, right? It basically means that becomes free to do that lookup,

9:27 and even if we had to recreate the dictionary every time, it's still 37 times faster, which is an amazing speedup.

Write Pythonic Code Like a Seasoned Developer Transcripts Chapter: Dictionaries Lecture: Stop using lists for everything

Write Pythonic Code Like a Seasoned Developer Transcripts
Chapter: Dictionaries
Lecture: Stop using lists for everything