Saturday, October 29, 2011

Encodings, Part 2

So... Different places started using different encodings to represent the characters in their languages. There are several ways to represent the characters, but none of them would work with each other.

Enter Unicode. Unicode is a largely successful way to represent every character, ever. It's largely what's used in localization. Still, there are some things that you need to keep in mind. There are several ways to represent Unicode.

You might hear the terms "UTF-8," "UTF-16," "Unicode," "Little Endian" or "Big Endian." Even UTF-32 exists now. What's it all mean?

UTF-16 is the most straightforward way to represent Unicode, at least the first, well, 16 bits of it, which is most of the current Unicode set. UTF-32 is the most straightforward way to represent all current Unicode characters, and then some. The endian-ness of these encodings depends (or at least depended) on how computers work internally, which bytes naturally get read first, etc. That's where something called a Byte Order Mark (BOM) comes into play. It's the first two(in the case of UTF-16) or 4(in the case of UTF-32) bytes that lets you know what order the bytes come in, or the "endian-ness." UTF-8 can have a BOM of 3 bytes, but it just lets the computer know that it's UTF-8 and not just ANSI/ASCII (the old stuff that used 8 bits, or one byte, per character).

UTF-8 is the smallest in terms of size, UTF-32 is the largest but most straight forward. The encodings you'll see the most in localization are UTF-8 and UTF-16. If you need more information, let me know in the comments! Or try Google. I've given you plenty of terms to get started. :)

No comments:

Post a Comment