Python 3, an Illustrated Tour Transcripts
Chapter: Strings
Lecture: Unicode in Files

Login or purchase this course to watch this video and the rest of the course contents.
0:01 In this video we'll discuss unicode in files in Python 3. We talked about unicode in Python 3 and that Python 3 handles unicode natively
0:10 and the strings are natively unicode. One thing to be aware of is that when Python reads in a text file, it's going to use the system encoding
0:19 to determine what the encoding is on that file. So you can run this command right here locale.getprefferedencoding with faults passed in
0:29 and it will tell you what the encoding is on your system. Typically, on most systems that's utf-8, if that's not the case, you should be aware of that.
0:37 And in any case, you should be explicit about what your files are encoded in. Here's an example of being explicit with writing output.
0:46 I have a unicode string that has the ohm (Ω) character in it. Again, ASCII can't handle this, but the cpe949 encoding can, that's a Korean encoding.
0:57 And so I'm going to make a file called ohm.core and I'm going to write to it, note that I'm calling the W mode, I'm not saying binary
1:07 because I'm writing out a string to it. So if you're writing out text, you only open a file in read or write mode not in binary mode.
1:16 And then I specify the encoding being explicit here and I'm saying that I'm going to encode this string as the Korean cp949 encoding
1:26 and then with my file, I can call write and write my data out. Now, this is a case where if I tried to read the file without specifying the encoding
1:36 the encoding on my system again is utf-8 and if I'd simply try and open the file for reading and read from it, I'll get a unicode decode error
1:43 that the utf-8 codec can't support that byte sequence. That's because there is some combination of characters in the Korean byte sequence
1:50 that utf-8 doesn't know how to decode but if I specify my encoding here and I'm explicit
1:56 then I can read that data back and get back my original string. Now, this used Korean, typically, most files you're going to see these days are utf-8.
2:06 So this just shows us an example of being explicit by being explicit, we can get around these encoding issues. If we happen to have binary data,
2:14 note that binary data is what we send over the wire or what we write to files. If we have binary data, we don't specify the encoding here.
2:23 So here I'm saying I'm going to write a binary file and I'm specifying the encoding and Python throws an error and it says
2:30 the binary mode doesn't take an encoding argument. Again, we want to be explicit here and remember that binary is what we send over the wire
2:41 over the network on a file and that is already encoded so you don't need to specify an encoding, it's a sequence of bytes.
2:50 If you open something for binary it's just going to lay down that sequence of bytes. So I hope you understand a little bit more
2:56 about reading and writing files that have alternate encodings in Python. One of the best practices of Python is being explicit.
3:04 So when you're writing a text file you want to be explicit about what encoding you're using, especially if you're using characters
3:12 that aren't ASCII or commonly used characters.


Talk Python's Mastodon Michael Kennedy's Mastodon