Python 3, an Illustrated Tour Transcripts
Chapter: Strings
Lecture: Walk-through: Unicode
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
In this video, we're going to look at unicode test, let's open that up in the editor that you want.
0:07
I'm going to run it. You can run it from your command line by just invoking Python on the file, or you can in PyCharm right click and say run
0:16
you should get three errors here. Let's go to the first error. On line 10 we get a name error, so here's line 10.
0:25
And in this function, it's called test 1. It says the following line is from Yu Lou Chun by Dao Ren Bai Yun.
0:32
There's a link to Project Gutenberg there, it says convert the line to utf-8 bytes stored in the variable utf-8_txt.
0:41
So up here we have a unicode string and we're going to convert that to bytes. Let's see how we do that.
0:46
We're going to make a variable called utf-8_text is equal to and on text, we need to call the encode method.
0:54
So we're going to encode the string and we're going to encode it as utf-8 bytes
1:01
so we can say utf-8, and that should give us a new variable that actually is bytes. Let's run this and see if it works.
1:14
Note that our test here is just asserting that the last five characters are these bytes. It's also asserting the length of the bytes.
1:24
Okay, so we have one that's passed now. So the thing to remember is that if you have a string, a unicode string
1:30
if you want to change it into bytes that process is called encoding, you don't decode a string, you decode bytes back into a string.
1:41
Okay, here's another line or probably the same line convert the line to big5, another Chinese encoding and store it in big5_txt.
1:54
So big5_txt = txt so we have bytes here and we want to encode those bytes as big5. Let's run that and see if that works.
2:12
Okay, it looks like it worked, we have 2 passed, one thing to note is the length of the big5 encoding is 74 bites on that same string versus above,
2:22
when it's utf-8 encoded it's 111 bytes. So there are some compromises that utf-8 makes
2:29
but in general, utf-8 is one of the most widely used encodings on the internet. So it's a pretty good encoding to use
2:36
even though it might be a little bigger than other encodings. Okay, test three, the following is utf-8 bytes decode it into a variable result.
2:46
So we have some bytes here and we're going to make a variable called result and we're going to take our unknown bytes and we're going to decode it.
2:54
Again, we don't encode bytes, bytes are already encoded for us. Okay, let's run this and make sure it works It looks like we're good to go.
3:12
So let's just for fun put a little break point here and see if we can see what unknown is. I'll move the break point down one level here
3:28
Okay, here's result. And if you look at result, it says that snake makes your head and then it says spin upside down.
3:38
Okay, cool. Thanks for watching this video. Hopefully, you have a better understanding of unicode and bytes and the conversion between those two.
3:46
Again, if you have a unicode string, you encode those as bytes, and if you have bytes, you decode those to a unicode string.