Python 3, an Illustrated Tour Transcripts
Chapter: Strings
Lecture: Unicode
Login or
purchase this course
to watch this video and the rest of the course contents.
0:00
In this video we're going to talk about unicode. There are a few terms that we need to know before we can understand unicode and how Python handles it.
0:09
So let's talk about these terms. The first term is a character and a character is a single letter
0:15
something that you would type and it would print on the screen. There's a little bit of a vagary between character and glyph
0:21
glyph is the visual representation of said character. So if we think of the character A in the English alphabet
0:30
A is a single letter and there's a visual representation of A actually uppercase or lowercase.
0:37
So the glyph would be the representation of it, a is the actual character. There's also what's called a code point
0:45
and a code point is a numeric description of a character. And each character or glyph has a unique numeric description.
0:54
Typically this is represented as a hex number and this is also where unicode comes from.
1:00
This is a universal code that represents this character or glyph. Another term that we need to know is an encoding
1:07
encoding is a mapping, a byte stream to a code point and so we'll understand this little bit more, but basically, you can think of a code point
1:17
as a universal way of understanding something and when we want to tell someone else about it or tell a computer or send it over the network,
1:25
we encode that character into some encoding, so typical encodings will include ASCII or utf-8
1:33
there are other encodings as well, we'll look at a few of them. Here's an example. So there's a character called Omega and it has a glyph
1:41
and it looks sort of like a horseshoe Ω you might be familiar with it if you've done some physics,
1:46
it has a code point, so the code point, we put a capital U in front of it it just stands for unicode,
1:54
and the code point is 2126 note that that is a hex number. There are also a couple encoding represented here,
2:00
so one encoding is the byte string consisting of e2 84 and a6, this is the utf-8 encoding for the Omega character or glyph
2:11
or the 2126 unicode code point. There's also a utf-16 code point, ff, fe&! at the end. Note that these are two different encodings
2:26
and their byte streams look different. Here's how we do it in Python. One thing to be aware of in Python 3
2:32
is that all strings in Python are unicode strings we'll talk a little bit how they're represented internally
2:36
but if I have the glyph, I have a way to type it I can just type it into a string. I can also copy and paste it from a webpage or whatnot.
2:45
If I don't have the glyph or I don't want to type it but I do have the code point I can insert that
2:50
by putting a \_u or _U depending on how long the hex string is if the hex string is 4 characters, then I use an _u or a lowercase u
3:02
if the hex string is longer than 4 characters then I'm going to put an upper case U
3:07
and I'm going to left pad it with zeros untill I get to 8 characters. I can also use the unicode name and in this case the name is Ω sign
3:17
and I put a \N and then I put the name in curly braces. A fourth way to get this unicode string is by passing in this number here
3:29
and 8486 is the decimal version of 2126. So if I pass that into the chr function that will give me a character from an ordinal number
3:38
and that's the unicode ordinal. Note that I can print this out to the screen and it will print out the Omega character
3:46
and I can test if all these characters are indeed equal or equivalent to one another and they are.
3:51
Another thing that you might want to be aware of in Python is a module included in Python called unicode data.
3:56
And if you have unicode data, you can pass in a single string character into it and it will tell you what the name is.
4:04
So in this case, we have the Ω character in there and unicode data.name says that unicode name of this is Ω sign.
4:14
Let's look at another example really quickly. There's a character called superscript two and that's if you're familiar with math, like you said x2
4:24
the squared would be the glyph the number 2 raised up slightly higher is the superscript two
4:30
it has a unicode code point, in this case, it's the hex characters 178 and we can see two encodings here,
4:38
here's a utf-8 encoding and we can also see a Windows 1252 encoding. Now, where'd you get these code points?
4:45
Where do you understand what the master data is? If you want to find them out, you can go to a website called unicode.org.
4:54
There's a consortium there that occasionally releases new mappings, but they have charts that you can download that map letters
5:03
to unicode character codes or code points. Here's an example of one of the charts. You'll see something like this.
5:11
This is for the Emoji chart and you can see that there is along the top, we've got a hex number here
5:19
and then we've got another hex number here on the left-hand side. And when you concatenate those two you get this hex number at the bottom here,
5:28
and that is the code point for the smiley face here. And then the next one is the sort of normal face,
5:36
and then there's a frowny face and a crying face and a surprised face. The chart also contains a table that looks like this
5:43
that has the code point name and glyph all in one place here. Right here we have the code point 1F600, we have the glyph which is the smiley face
5:54
and we have the actual name, which isn't smiley, but grinning face, note that it's capitalized and there is a space between it.
6:00
One thing to note is that the code point for this 1F600 is longer than four characters. So in order to represent that using the code point
6:10
we need to put a capital U and then we need to pad that with three zeros to get 8 characters in that case. We can also use the name with a \m
6:20
If we have access to the glyph or keyboard that types Emoji we can put that directly into a string.
6:26
Note that here I've also got the utf-8 version of the encoding of grinning face. If I have that byte stream encoded as utf-8 bytes,
6:34
I can decode it back to unicode using the decode method and the appropriate encoding that it was encoded as
6:41
and I say decode to utf-8, I will get back the utf-8 string for that. Let's talk about how things are stored in Python.
6:49
Everything internally is stored as two or four bytes and there's internal encodings, these are called UCS2 and UCS4,
6:59
depending on how your Python was compiled will determine how your unicode strings are stored.
7:05
So one thing to be aware of because all strings in Python 3 are unicode strings, and these are stored as UCS2 or UCS4 byte strings internally,
7:17
there's typically a 2X to 4X increase in the size of memory needed to store strings in Python 3 versus Python 2.
7:25
In practice, that doesn't really make so much of a difference on modern machines unless you're dealing with huge files,
7:31
but just something to be aware of. Also note that bytes in Python 3 are not the same as Python 2 strings.
7:38
So bytes in Python 3 are simply arrays of integers. Let's talk about encodings a little bit more, encodings map bytes to code points.
7:49
A common misconception is that an encoding is a unicode number. So utf-8 is an encoding. This is not code point.
7:59
This is an encoding of a code point, just to be pedantic about that, utf-8 is an encoding of characters, it is not unicode per se.
8:08
Unicode is always encoded to bytes and the reverse is always true bytes are decoded into unicode.
8:16
Note that you can't take unicode and decode it, you encode it. Also, the same with bytes— you can't take bytes and encode them,
8:24
they are already encoded, you can only decode them to unicode. Here's an example here. We have the string with Omega in it.
8:33
And I created it with the code point and then if I wanted to encode that as utf-16, I say encode, I call the encode method on that
8:43
and I pass in the encoding utf-16 and it returns back a byte string, again, note that c is a unicode string
8:52
and the result of that is a byte string coming out of that. If I want to encode c as utf-8, I simply call the encode method and pass in utf-8.
9:02
Note that these encodings are different, utf-16 and utf-8 have different encodings.
9:07
Now, once I have these bytes, I can go back and get the original string from it. So I don't encode bytes, I always decode bytes
9:15
and here I'm taking the utf-8 bytes and decoding them calling the decode method on them to get back a unicode string.
9:24
Here's a chart that just shows what we do if we have a unicode string, we always encode it to a byte string,
9:32
likewise if we have a byte string, we always decode it. We can't do the opposite, the byte string doesn't even have an encode method,
9:39
likewise, the unicode string doesn't have a decode method. There are some errors you can get when you're dealing with unicode,
9:45
here's a pretty common one here. I've got the Omega sign here in a variable called c. And if I try to encode that as ASCII,
9:53
I'm going to get a unicode encode error. And the problem is that the ASCII character set doesn't have an encoding for this character.
10:02
And so that's what this error means, charmap, codec cannot encode character unicode 2126 in position 0.
10:11
This is a pretty common error when you start dealing with unicode. So again, what this error means is that you have a string
10:18
and you're trying to encode it to a byte encoding that doesn't have a representation for that.
10:23
There are some encodings that have representations for all of unicode, so utf-8 is a good choice, but ASCII does not,
10:30
it only has a limited number of characters that it can encode. Here, we'll trying to encode this Omega character again
10:37
we'll call encode with the windows 1252 a common encoding that was found in Windows during the last century and we'll get the same error here.
10:47
Well, similar error, we are getting unicode encode error and that it can't be encoded into Windows 1252.
10:54
On the other hand, if we try and encode it into cp949, this is a Korean encoding, we get a byte string.
11:01
So this Korean encoding has the ability to support the Omega character. Now be careful, once you have bytes encoded, you need to decode them typically.
11:11
Typically, you only encode them to send them over the wire or to save them as a file or send them over the network, that sort of thing.
11:18
But when you're dealing with them, you want them in utf-8. So a lot of times, you'll get data and you'll need to decode it to be able to deal with it.
11:25
Here we have the variable core which has the bytes for the Omega sign encoded in Korean. Now if we have those bytes and we call decode
11:32
and we say I want to decode these bytes assuming that they were in utf-8 I'm going to get an error here, that's a unicode decode error.
11:43
And this says I got bytes and I'm trying to decode them as utf-8, but there aren't utf-8 bytes that make sense here. So this is a unicode decode error,
11:56
typically what this means is you have bytes and you are decoding them from the wrong encoding.
12:00
Note that we encoded as Korean, we need to decode from Korean as well. Now even more nefarious is this example down here.
12:07
We have the Korean bytes, and we're decoding them but we're decoding them as Turkish.
12:13
And apparently the combination of Korean bytes is valid Turkish bytes, but it's not the Omega sign, it's a different sign.
12:21
This is known as mojibake, that's a Japanese term that means messed up characters, I believe. And so this is a little bit more problematic,
12:30
you've decoded your characters, but you have the wrong characters, because you decoded them in the wrong encoding,
12:38
so be careful about your encoding, you want to be explicit here and you want to make sure that your encoding and decoding
12:46
match up with the same encoding. Here's a chart that represents the various things you can do with characters and the conversions
12:52
that you can do on the single character. Note that if we have a string here, this box right here is various ways to represent ASCII character T.
13:02
We can convert that to an integer by calling ord on it and we can go back by calling chr.
13:09
We can also get bytes by calling bytes with the encoding that we want and we can put our bytes into a file
13:18
if we open the file in the right binary mode, if we have string and we want to write to a file we need to just call it with the w mode.
13:28
There are a couple errors that you might see. You might try and open a file for writing with bytes and you'll get an error, that's the type error,
13:37
you have to use a string and not bytes if you're opening to write it in text mode. Similarly, if I have a string and I open it in binary mode
13:47
I'm getting an error that says string does not support the buffer interface. So these are errors that you might see with an ASCII character.
13:54
This chart shows some of the errors that you might see with unicode characters. Here we've got the string here which has Ω
14:01
and we can see that we can encode it as an integer. We can also encode it as bytes, in this case we're encoding it as utf-8 bytes.
14:10
Now note that if I try and decode this sequence as Windows 1252 that will pass, but I'll get a messed up mojibake.
14:20
So again, we need to make sure that this decoding has the same encoding as the encoding call, which was utf-8.
14:30
We also see some of the other errors that we have if we try and encode with a different encoding that's not supported,
14:36
we might get a unicode encode error. So Windows 1252 or ASCII, those both give us errors
14:43
and know that we can't call decode on a string, we can only encode a string. So those are some of the things that you need to be aware of.
14:51
Typically, if you get these unicode encode errors, that means that you're trying to call encode and you're using the wrong encoding there.
14:59
So try and figure out what your encoding is. Common coding these days is utf-8. Okay, we've been through a lot in this video.
15:05
Hopefully, you have some understanding of how Python handles unicode and how we can change these unicode strings into byte strings
15:13
to serialize or send over the wire. Hopefully, you also understand some of the errors you might run into and how to deal with those errors.
15:22
If you're sure what your encoding is, that can eliminate a lot of the issues that you might run into.