Python 3, an Illustrated Tour Transcripts
Lecture: Unicode in Files
0:01 In this video we'll discuss unicode in files in Python 3.
0:04 We talked about unicode in Python 3
0:06 and that Python 3 handles unicode natively
0:09 and the strings are natively unicode.
0:12 One thing to be aware of is that when Python reads in a text file,
0:16 it's going to use the system encoding
0:18 to determine what the encoding is on that file.
0:22 So you can run this command right here
0:24 locale.getprefferedencoding with faults passed in
0:28 and it will tell you what the encoding is on your system.
0:31 Typically, on most systems that's utf-8,
0:33 if that's not the case, you should be aware of that.
0:36 And in any case, you should be explicit about what your files are encoded in.
0:41 Here's an example of being explicit with writing output.
0:45 I have a unicode string that has the ohm (Ω) character in it.
0:49 Again, ASCII can't handle this,
0:52 but the cpe949 encoding can, that's a Korean encoding.
0:56 And so I'm going to make a file called ohm.core and I'm going to write to it,
1:02 note that I'm calling the W mode, I'm not saying binary
1:06 because I'm writing out a string to it.
1:10 So if you're writing out text,
1:12 you only open a file in read or write mode not in binary mode.
1:15 And then I specify the encoding being explicit here
1:18 and I'm saying that I'm going to encode this string
1:21 as the Korean cp949 encoding
1:25 and then with my file, I can call write and write my data out.
1:29 Now, this is a case where if I tried to read the file
1:33 without specifying the encoding
1:35 the encoding on my system again is utf-8
1:37 and if I'd simply try and open the file for reading and read from it,
1:40 I'll get a unicode decode error
1:42 that the utf-8 codec can't support that byte sequence.
1:45 That's because there is some combination
1:47 of characters in the Korean byte sequence
1:49 that utf-8 doesn't know how to decode
1:52 but if I specify my encoding here and I'm explicit
1:55 then I can read that data back and get back my original string.
1:58 Now, this used Korean,
2:01 typically, most files you're going to see these days are utf-8.
2:05 So this just shows us an example of being explicit
2:07 by being explicit, we can get around these encoding issues.
2:10 If we happen to have binary data,
2:13 note that binary data is what we send over the wire
2:16 or what we write to files.
2:19 If we have binary data, we don't specify the encoding here.
2:22 So here I'm saying I'm going to write a binary file
2:25 and I'm specifying the encoding
2:27 and Python throws an error and it says
2:29 the binary mode doesn't take an encoding argument.
2:32 Again, we want to be explicit here
2:36 and remember that binary is what we send over the wire
2:40 over the network on a file and that is already encoded
2:44 so you don't need to specify an encoding,
2:47 it's a sequence of bytes.
2:49 If you open something for binary
2:51 it's just going to lay down that sequence of bytes.
2:53 So I hope you understand a little bit more
2:55 about reading and writing files that have alternate encodings in Python.
3:00 One of the best practices of Python is being explicit.
3:03 So when you're writing a text file
3:06 you want to be explicit about what encoding you're using,
3:09 especially if you're using characters
3:11 that aren't ASCII or commonly used characters.