Python 3, an Illustrated Tour Transcripts
Chapter: Strings
Lecture: Walk-through: Unicode

Login or purchase this course to watch this video and the rest of the course contents.
0:00 In this video, we're going to look at unicode test, let's open that up in the editor that you want.
0:07 I'm going to run it. You can run it from your command line by just invoking Python on the file, or you can in PyCharm right click and say run
0:16 you should get three errors here. Let's go to the first error. On line 10 we get a name error, so here's line 10.
0:25 And in this function, it's called test 1. It says the following line is from Yu Lou Chun by Dao Ren Bai Yun.
0:32 There's a link to Project Gutenberg there, it says convert the line to utf-8 bytes stored in the variable utf-8_txt.
0:41 So up here we have a unicode string and we're going to convert that to bytes. Let's see how we do that.
0:46 We're going to make a variable called utf-8_text is equal to and on text, we need to call the encode method.
0:54 So we're going to encode the string and we're going to encode it as utf-8 bytes
1:01 so we can say utf-8, and that should give us a new variable that actually is bytes. Let's run this and see if it works.
1:14 Note that our test here is just asserting that the last five characters are these bytes. It's also asserting the length of the bytes.
1:24 Okay, so we have one that's passed now. So the thing to remember is that if you have a string, a unicode string
1:30 if you want to change it into bytes that process is called encoding, you don't decode a string, you decode bytes back into a string.
1:41 Okay, here's another line or probably the same line convert the line to big5, another Chinese encoding and store it in big5_txt.
1:54 So big5_txt = txt so we have bytes here and we want to encode those bytes as big5. Let's run that and see if that works.
2:12 Okay, it looks like it worked, we have 2 passed, one thing to note is the length of the big5 encoding is 74 bites on that same string versus above,
2:22 when it's utf-8 encoded it's 111 bytes. So there are some compromises that utf-8 makes
2:29 but in general, utf-8 is one of the most widely used encodings on the internet. So it's a pretty good encoding to use
2:36 even though it might be a little bigger than other encodings. Okay, test three, the following is utf-8 bytes decode it into a variable result.
2:46 So we have some bytes here and we're going to make a variable called result and we're going to take our unknown bytes and we're going to decode it.
2:54 Again, we don't encode bytes, bytes are already encoded for us. Okay, let's run this and make sure it works It looks like we're good to go.
3:12 So let's just for fun put a little break point here and see if we can see what unknown is. I'll move the break point down one level here
3:28 Okay, here's result. And if you look at result, it says that snake makes your head and then it says spin upside down.
3:38 Okay, cool. Thanks for watching this video. Hopefully, you have a better understanding of unicode and bytes and the conversion between those two.
3:46 Again, if you have a unicode string, you encode those as bytes, and if you have bytes, you decode those to a unicode string.


Talk Python's Mastodon Michael Kennedy's Mastodon