Python 3, an Illustrated Tour Transcripts
Chapter: Strings
Lecture: Unicode in Files
Login or
purchase this course
to watch this video and the rest of the course contents.
0:01
In this video we'll discuss unicode in files in Python 3. We talked about unicode in Python 3 and that Python 3 handles unicode natively
0:10
and the strings are natively unicode. One thing to be aware of is that when Python reads in a text file, it's going to use the system encoding
0:19
to determine what the encoding is on that file. So you can run this command right here locale.getprefferedencoding with faults passed in
0:29
and it will tell you what the encoding is on your system. Typically, on most systems that's utf-8, if that's not the case, you should be aware of that.
0:37
And in any case, you should be explicit about what your files are encoded in. Here's an example of being explicit with writing output.
0:46
I have a unicode string that has the ohm (Ω) character in it. Again, ASCII can't handle this, but the cpe949 encoding can, that's a Korean encoding.
0:57
And so I'm going to make a file called ohm.core and I'm going to write to it, note that I'm calling the W mode, I'm not saying binary
1:07
because I'm writing out a string to it. So if you're writing out text, you only open a file in read or write mode not in binary mode.
1:16
And then I specify the encoding being explicit here and I'm saying that I'm going to encode this string as the Korean cp949 encoding
1:26
and then with my file, I can call write and write my data out. Now, this is a case where if I tried to read the file without specifying the encoding
1:36
the encoding on my system again is utf-8 and if I'd simply try and open the file for reading and read from it, I'll get a unicode decode error
1:43
that the utf-8 codec can't support that byte sequence. That's because there is some combination of characters in the Korean byte sequence
1:50
that utf-8 doesn't know how to decode but if I specify my encoding here and I'm explicit
1:56
then I can read that data back and get back my original string. Now, this used Korean, typically, most files you're going to see these days are utf-8.
2:06
So this just shows us an example of being explicit by being explicit, we can get around these encoding issues. If we happen to have binary data,
2:14
note that binary data is what we send over the wire or what we write to files. If we have binary data, we don't specify the encoding here.
2:23
So here I'm saying I'm going to write a binary file and I'm specifying the encoding and Python throws an error and it says
2:30
the binary mode doesn't take an encoding argument. Again, we want to be explicit here and remember that binary is what we send over the wire
2:41
over the network on a file and that is already encoded so you don't need to specify an encoding, it's a sequence of bytes.
2:50
If you open something for binary it's just going to lay down that sequence of bytes. So I hope you understand a little bit more
2:56
about reading and writing files that have alternate encodings in Python. One of the best practices of Python is being explicit.
3:04
So when you're writing a text file you want to be explicit about what encoding you're using, especially if you're using characters
3:12
that aren't ASCII or commonly used characters.