Write Pythonic Code Like a Seasoned Developer Transcripts
Lecture: Stop using lists for everything
0:01 The first area I want to cover around dictionaries
0:03 is using dictionaries for performance.
0:06 So when you have a collection of data,
0:08 the most natural choice, the thing Python developers reach for first,
0:12 has got to be a list.
0:14 It's super common, super useful and very flexible.
0:17 Let's look at two different algorithms
0:20 one using a list and one using a dictionary
0:22 and we'll compare the relative performance.
0:24 So here we have a bit of code
0:26 and we'll look at this in detail in PyCharm in just a moment,
0:28 but here is the basic way it works.
0:31 So we start out with a list of data, here you can see in our example
0:34 we are going to be using half of million items,
0:37 so we are calling a data list and it's going to contain some rich objects
0:40 with multiple values or measurements - details aren't actually affected.
0:45 Then, suppose for some reason we have done some computation
0:48 and we've come up with a hundred pieces of data
0:51 we would like to go look up in our list,
0:53 so we are going to go loop over each of those 100
0:55 and then we are going to find the item in the list,
0:58 we can't just use interesting "item in list" in operator,
1:02 because we actually have to filter on a particular value,
1:06 what we did instead is wrote this method called find_point_by_id_in_list
1:10 and it just walks through the list, it compares the id
1:13 it's looking for against the ids that it finds in the list,
1:15 as soon as it finds the match, it returns that one,
1:17 it's assuming to be unique and then it appends it to this interesting_points.
1:21 So this is one version of the algorithm.
1:24 The other one is well, maybe we could do a little bit better
1:27 if we actually used a dictionary.
1:29 And then we could index - the key for the dictionary could be the id.
1:32 So if we wrote it like this, here we have a dictionary of half a million of items,
1:37 again the same dynamically discovered 100 interesting points,
1:42 and instead of doing this lookup by walking through the items,
1:45 we can actually map the id to the objects that we are looking for
1:49 so we can just index into the dictionary.
1:52 Obviously, indexing into the dictionary is going to be faster,
1:54 but possibly the computation,
1:57 the building of the dictionary itself might be way slower,
2:01 if we are going to do this a 100 times, we have half a million items, right,
2:04 it's much more complicated to generate a dictionary with half a million items
2:08 than it is a list.
2:10 So let's see this in action and see what the verdict is.
2:14 So here we have our data point, our data point is a named tuple,
2:18 it could have been a custom class but named tuple is sufficient
2:21 and it has five values: id, an "x y" for two dimensional coordinates,
2:26 a temperature and a quality on the measurement of the temperature.
2:30 You can see I have collapsed some areas of the code
2:32 because they don't really matter, these little print outs
2:34 I think they kind of make it hard to read.
2:35 In PyCharm you can highlight these and hit Command+.
2:37 and turn them into little collapsible regions,
2:40 so I did that so that you can focus on the algorithm and not the little details.
2:44 Here we have our data_list that we are going to work with,
2:47 and we are going to use a random seed of zeros
2:50 so we always do exactly the same thing, but randomly,
2:53 so that we have perfectly repeatable results
2:56 and then down here for each item of this range of half a million
2:58 each time through the list we are going to randomly construct
3:01 one of these data points and put it into our list.
3:04 Next, we do a little reordering on the list just to make sure
3:06 that we don't just randomly access it in order,
3:08 since we are using auto incrementing ids,
3:12 next we are going to create our set of interesting ids
3:14 that we are going to go search through our list,
3:16 and then later through our dictionary.
3:18 Really we would use some kind of algorithm
3:20 and we would find interesting items we need to go look up,
3:23 but in this case we are just going to randomly do it,
3:25 but there is a few Pythonic things going on here,
3:27 one - notice this statement here with the curly braces,
3:30 and then one item left to the "for",
3:33 that means what we are building here is a set using something
3:36 called a set comprehension and each item in the set
3:40 is going to be a random number between zero and the length of that list
3:43 which is half a million, so quite a large range there.
3:47 And we are just going to range across zero to 100.
3:50 The other thing to look at is we don't actually care about
3:53 the index coming out of the range, we just want to run this a 100 times.
3:57 In Python, when you are looping across something like this range set here
4:01 or you are possibly unpacking a tuple and there is only a few of the values,
4:06 not all the values you care about,
4:07 it's Pythonic to use the underscore for the variable name to say
4:11 "I must put something here but I actually have no concern what it is."
4:15 So, our interesting ids are interesting from a Pythonic perspective,
4:19 but now we have the set of approximately 100 ids,
4:22 assuming that there is no conflicts or duplication there,
4:27 and next thing we are going to do is we are going to come along here,
4:29 we are going to start a little timer, figure at the end what the total seconds pass were,
4:34 and during that time, we want to go and actually pull out the interesting points
4:38 that correspond to the interesting ids.
4:40 So we are going to go for each interesting id,
4:42 remember, it's about a 100, we are going to say "find the point in the list like so,
4:47 and then add it", and if we look quickly at this,
4:50 you can see we just go through each item on the list
4:52 and if the item matches the id we are looking for, we are done,
4:56 otherwise, we didn't find it.
4:58 So, just to get a base line, I am going to assume that this is slower,
5:01 let's go and run it and see what happens.
5:08 Remember, it's only the locating data in the list part that is actually timed.
5:13 All right, so this took 7.9 seconds
5:16 and here you can see there is a whole bunch of data points it found,
5:19 if we run it again, we get 8.4 seconds.
5:23 So it's somewhere around 7 to 8 seconds.
5:26 All right, so let's take this algorithm here and adapt it for our dictionary.
5:31 So I've got a little place holder to sort of drop in the timing and so on,
5:34 you don't have to watch me type that,
5:35 so the first thing we want to do is create a dictionary,
5:38 before we had data_list, now we are going to have data_dict,
5:40 we can create this using a dictionary comprehension,
5:43 so that would be a very Pythonic thing to do
5:45 and we want to map for each item in the dictionary the id to the actual object.
5:49 So, we create set and dictionary comprehensions like so
5:53 but the difference is we have a "key:value" for the dictionaries
5:57 where we just have the value for sets.
5:59 You kind of have to write this in reverse,
6:00 I am going to name the elements we are going to look at "d",
6:03 so I am going to say "d.id", maps the "d for d in data_list", right,
6:11 so this is going to create a dictionary of all half a million items
6:13 and mapping the id to the actual value.
6:16 So now, let's start a little timer,
6:18 and next we want to locate the items in the dictionary,
6:21 so again, we'll say interesting_points, let's clear that;
6:25 "for id", we call it "d.id" so it doesn't conflict with the id built in,
6:31 so "for d.id in interesting ids" we want to do a lookup,
6:35 we'll say "the data element is", now we have a dictionary
6:38 and we can look up things by ids, so that is super easy,
6:41 we just say it like so, assuming that there is none of the id
6:44 that is missing, something like that
6:45 and then we'll just say "interesting_points.append(d)"
6:53 Oops, almost made a mistake there,
6:54 let's say "d.id" not the built in, that of course won't work.
6:58 All right, so let's run it again and see how it works,
7:00 so we are going to run, it's still going to run the other slow version,
7:02 I'll skip that in the video,
7:05 wow, look at that, 8 seconds, and this is 0.000069 seconds.
7:13 So that's less than 1 millisecond, by a wide margin.
7:17 That is a non-trivial speed up, let's see how much of a speed up that is,
7:22 then the other thing to consider as well,
7:24 maybe the speed up was huge
7:25 but the cost of computing the dictionary was
7:28 more than offsetting the gains we had, let's try.
7:32 Wow, the speedup that we received was not one time faster,
7:36 two times faster, or ten times faster,
7:38 if this is data that we are going to go back into and back into,
7:41 we would create this dictionary and sort of reuse it,
7:45 where we get a speed up of a 128 000 times faster
7:49 and an algorithm that is actually easier to use than writing our silly list lookup
7:53 and it took literally one line of a dictionary comprehension,
7:57 that's a beautiful combination of how dictionaries work for performance,
8:00 bringing together these Pythonic ideas like dictionary comprehensions and so on,
8:05 it made our algorithm both easier and dramatically faster.
8:09 What if we had to create this dictionary just one time to do this work?
8:14 Maybe we should move this down and actually count the creation of the dictionary
8:18 as part of the computational time,
8:20 so let's see what we get if we run it that way.
8:23 Look at that, 8 seconds versus 0.2 seconds,
8:26 so even though it took a while to create that dictionary
8:29 it still took almost no time relative to our way more inefficient algorithm using lists,
8:34 we've got a 37 times speedup if every single time
8:37 we call this function or we do this operation we would have to recreate the dictionary,
8:41 it's still dramatically better and of course simpler as well.
8:45 Let's review that in a graphic.
8:47 So here we have two basically equivalent algorithms,
8:51 we have a bunch of data we are storing in a list,
8:54 half a million items, and then we are going to loop over them
8:56 and we are going to try to pull some items out,
8:58 by some particular property of the things contained in the list,
9:01 well if you are in that situation, dictionaries are amazing for it
9:04 and as you saw they are stunningly fast.
9:08 If we don't count the creation of the dictionary,
9:10 we had a 130 000 times faster the bottom algorithm to the top algorithm.
9:15 So I am sure you all thought well dictionary is probably faster,
9:18 but did you think it would be a 130 000 times faster,
9:21 that's really cool, right? It basically means that becomes free to do that lookup,
9:26 and even if we had to recreate the dictionary every time,
9:29 it's still 37 times faster, which is an amazing speedup.