Dale's Localization Blog: encodings

Saturday, October 29, 2011

Encodings, Part 2

So... Different places started using different encodings to represent the characters in their languages. There are several ways to represent the characters, but none of them would work with each other.

Enter Unicode. Unicode is a largely successful way to represent every character, ever. It's largely what's used in localization. Still, there are some things that you need to keep in mind. There are several ways to represent Unicode.

You might hear the terms "UTF-8," "UTF-16," "Unicode," "Little Endian" or "Big Endian." Even UTF-32 exists now. What's it all mean?

UTF-16 is the most straightforward way to represent Unicode, at least the first, well, 16 bits of it, which is most of the current Unicode set. UTF-32 is the most straightforward way to represent all current Unicode characters, and then some. The endian-ness of these encodings depends (or at least depended) on how computers work internally, which bytes naturally get read first, etc. That's where something called a Byte Order Mark (BOM) comes into play. It's the first two(in the case of UTF-16) or 4(in the case of UTF-32) bytes that lets you know what order the bytes come in, or the "endian-ness." UTF-8 can have a BOM of 3 bytes, but it just lets the computer know that it's UTF-8 and not just ANSI/ASCII (the old stuff that used 8 bits, or one byte, per character).

UTF-8 is the smallest in terms of size, UTF-32 is the largest but most straight forward. The encodings you'll see the most in localization are UTF-8 and UTF-16. If you need more information, let me know in the comments! Or try Google. I've given you plenty of terms to get started. :)

Saturday, October 22, 2011

Encodings, Part 1

What are you reading right now? Letters? Words? Ideas? There really is no wrong answer, unless you happen to be a computer.

If you're a computer, you're reading numbers. Lots and lots of numbers. Everything in computers is numbers, and text is no exception.

How do those numbers turn into text? It all starts out with a font. A font, basically, is a group of small pictures, pictures representing how to draw each character. In the computer, a character is stored as the index to a specific character in the font, or, in other words, which character should be drawn.

Okay, so where do encodings come in, and what are they exactly?

Computers began in the English-speaking world. Apple and Microsoft, the two big players in the computer world, are both American companies. Like a lot of companies that start out in the United States, they didn't really think globally at first. A character was represented by small numbers that would fit within one byte (Computers those days weren't really able to handle anything bigger anyway). One byte can handle the numbers from 0 to 255. That's more than enough for all characters in English ('A' and 'a' being two different characters, etc.), but not nearly enough for some other languages, like, say, Japanese.

This is where things got complicated. And I'll pick it up from here next week.