Python 3, an illustrated tour Transcripts
Lecture: Walk-through: Unicode
0:00 In this video, we're going to look at unicode test,
0:03 let's open that up in the editor that you want.
0:06 I'm going to run it. You can run it from your command line
0:09 by just invoking Python on the file,
0:12 or you can in PyCharm right click and say run
0:15 you should get three errors here. Let's go to the first error.
0:20 On line 10 we get a name error, so here's line 10.
0:24 And in this function, it's called test 1.
0:27 It says the following line is from Yu Lou Chun by Dao Ren Bai Yun.
0:31 There's a link to Project Gutenberg there, it says convert the line to utf-8 bytes
0:36 stored in the variable utf-8_txt.
0:40 So up here we have a unicode string and we're going to convert that to bytes.
0:43 Let's see how we do that.
0:45 We're going to make a variable called utf-8_text is equal to
0:50 and on text, we need to call the encode method.
0:53 So we're going to encode the string and we're going to encode it as utf-8 bytes
1:00 so we can say utf-8, and that should give us a new variable that actually is bytes.
1:10 Let's run this and see if it works.
1:13 Note that our test here is just asserting that the last five characters are these bytes.
1:17 It's also asserting the length of the bytes.
1:23 Okay, so we have one that's passed now.
1:25 So the thing to remember is that if you have a string, a unicode string
1:29 if you want to change it into bytes that process is called encoding,
1:34 you don't decode a string, you decode bytes back into a string.
1:40 Okay, here's another line or probably the same line
1:44 convert the line to big5, another Chinese encoding and store it in big5_txt.
1:53 So big5_txt = txt so we have bytes here
2:00 and we want to encode those bytes as big5.
2:06 Let's run that and see if that works.
2:11 Okay, it looks like it worked, we have 2 passed,
2:14 one thing to note is the length of the big5 encoding
2:18 is 74 bites on that same string versus above,
2:21 when it's utf-8 encoded it's 111 bytes.
2:25 So there are some compromises that utf-8 makes
2:28 but in general, utf-8 is one of the most widely used encodings on the internet.
2:33 So it's a pretty good encoding to use
2:35 even though it might be a little bigger than other encodings.
2:39 Okay, test three, the following is utf-8 bytes decode it into a variable result.
2:45 So we have some bytes here and we're going to make a variable called result
2:49 and we're going to take our unknown bytes and we're going to decode it.
2:53 Again, we don't encode bytes, bytes are already encoded for us.
3:00 Okay, let's run this and make sure it works
3:07 It looks like we're good to go.
3:11 So let's just for fun put a little break point here
3:16 and see if we can see what unknown is.
3:19 I'll move the break point down one level here
3:27 Okay, here's result.
3:29 And if you look at result, it says that snake makes your head
3:34 and then it says spin upside down.
3:37 Okay, cool. Thanks for watching this video.
3:39 Hopefully, you have a better understanding of unicode and bytes
3:42 and the conversion between those two.
3:45 Again, if you have a unicode string, you encode those as bytes,
3:49 and if you have bytes, you decode those to a unicode string.