Saturday, October 29, 2011

Encodings, Part 2

So... Different places started using different encodings to represent the characters in their languages. There are several ways to represent the characters, but none of them would work with each other.

Enter Unicode. Unicode is a largely successful way to represent every character, ever. It's largely what's used in localization. Still, there are some things that you need to keep in mind. There are several ways to represent Unicode.

You might hear the terms "UTF-8," "UTF-16," "Unicode," "Little Endian" or "Big Endian." Even UTF-32 exists now. What's it all mean?

UTF-16 is the most straightforward way to represent Unicode, at least the first, well, 16 bits of it, which is most of the current Unicode set. UTF-32 is the most straightforward way to represent all current Unicode characters, and then some. The endian-ness of these encodings depends (or at least depended) on how computers work internally, which bytes naturally get read first, etc. That's where something called a Byte Order Mark (BOM) comes into play. It's the first two(in the case of UTF-16) or 4(in the case of UTF-32) bytes that lets you know what order the bytes come in, or the "endian-ness." UTF-8 can have a BOM of 3 bytes, but it just lets the computer know that it's UTF-8 and not just ANSI/ASCII (the old stuff that used 8 bits, or one byte, per character).

UTF-8 is the smallest in terms of size, UTF-32 is the largest but most straight forward. The encodings you'll see the most in localization are UTF-8 and UTF-16. If you need more information, let me know in the comments! Or try Google. I've given you plenty of terms to get started. :)

Saturday, October 22, 2011

Encodings, Part 1

What are you reading right now? Letters? Words? Ideas? There really is no wrong answer, unless you happen to be a computer.

If you're a computer, you're reading numbers. Lots and lots of numbers. Everything in computers is numbers, and text is no exception.

How do those numbers turn into text? It all starts out with a font. A font, basically, is a group of small pictures, pictures representing how to draw each character. In the computer, a character is stored as the index to a specific character in the font, or, in other words, which character should be drawn.

Okay, so where do encodings come in, and what are they exactly?

Computers began in the English-speaking world. Apple and Microsoft, the two big players in the computer world, are both American companies. Like a lot of companies that start out in the United States, they didn't really think globally at first. A character was represented by small numbers that would fit within one byte (Computers those days weren't really able to handle anything bigger anyway). One byte can handle the numbers from 0 to 255. That's more than enough for all characters in English ('A' and 'a' being two different characters, etc.), but not nearly enough for some other languages, like, say, Japanese.

This is where things got complicated. And I'll pick it up from here next week.

Saturday, October 15, 2011

Translation Memories: Pros and Cons

To say that translation memories are popular in the translation/localization industry would be a gross understatement. Chances are, if you have worked with or in the industry in any way, you have heard "TM," "translation memory" or some other term referring to translation memories.

Almost every translation system or tool that claims to increase productivity uses a translation memory. Some make claims that their memory is somehow better than another tool's, but the basics are the same.

TMs are exactly what their name describes: They are a way for a computer to remember what has been translated before. This is done on a segment level (or on a sentence level, basically, though there are exceptions). If you come across the same segment later, or a similar enough one, your translation tool will be able to look it up in the translation and either offer it up as a suggestion or simply plug it in as the translation.

There are a lot of benefits to having a translation memory, when it is managed correctly. For example, if a client has an update to a file, they might ask you to only translate what has changed and paste it into the previously translated file. If you have a translation memory, you can just stick the whole file into the tool, and it will fill out what has already been translated, leaving you to only translate what has changed. No need to search for the changes, find the corresponding place in the previous translation, paste, etc...

There might also be a lot of occurrences of the same segment in the same document. Translation memories can be updated as you work, so if you get to a segment that you've seen before, the translation memory can fill it in for you. This is also the case with slogans, etc., where the chance of repetition is high. Translate once, then reuse as needed.

With advantages such as these, productivity can and does go up. However, translation memories have their drawbacks, several of which are often ignored. You should know that I'm not making these up, that in about a year of working in the industry as a localization engineer, I've come across each of these drawbacks at least once.

The first drawback that I'll mention has to do with translator laziness or complacency. It's not really the translator's fault, either. It's something of a trap. Most translation memory tools have the option to fill in "fuzzy" (inexact) matches if their degree of similarity to the segment in question is above a certain degree. It's very easy, as a translator, to see that a translation has already been inserted and move on. After all, your time is valuable. However, the meaning of a sentence can change without the sentence changing much at all. Some theoretical examples: "You must do that." vs "You must not do that."; "John has 1 dog." vs "John has 3 dogs."; etc. If a translator misses one change, it gets put into the translation memory. Then you have an incorrect translation all ready to be used again. That's not even mentioning exact matches that may be out of context, so their translations should be different but aren't, because it's not usually considered worth a translator's time to look at something that's already translated at a 100% match.

The second drawback can be seen as an extension of the first, except that this time it's not the translator at all who's at fault, but a side effect of what may have happened previously. That drawback is error propagation. If a segment is translated incorrectly, it will be incorrect all over the place. I've seen this happen literally hundreds of times within a very technical set of files. The worst part? Sometimes it's very difficult to fix. Because the same source segment can be translated different ways given different contexts, some translation memory tools will save any changes as a different entry into the translation memory. Depending on the features of the tool, the incorrect translation may pop up again. Sure, you can look in the memory itself and delete the offending segment, but translation memories can get big, so it can be difficult to manage them sometimes.

The third drawback (and the last that I'm going to mention) is unrealistic expectations. While this is not really the fault of translation memories, it effects people who use them. I offer the following examples, though there are a lot of ways expectations can be unrealistic. Example #1: Almost all translation memories work on a segment level. They don't deal with anything smaller than that. A lot of companies will charge/pay depending on the degree of matches in a translation memory. Why pay for something that's already translated, right? Well, some people seem to think that if they change "a few words" in some sentences, they will only be charged a lower match rate on those words. However, because translation memories work at a segment level, if the match rate (how much of the segment is similar to the entry in the translation memory) is low enough that it passes the threshold of what is charged, it affects every word in that segment, not just the words that have been changed. I've seen a few clients get into arguments because "only a few words" had been changed. Example #2: Turnaround times. Some clients, and I'm talking both about the end client and the translation companies that use freelancers, seem to think that because you have a translation memory tool, all translations can get done lightening fast, regardless of how repetitive the text is, or how long it is, or if you've ever translated anything like it before (i.e. if you have segments in your translation memory from similar texts). Translation is still work, and sometimes, regardless of what tools you're using, it takes time.

So translation memories can be huge time savers. They can make sure that your wording is consistent in similar segments, and they can give suggestions based on previous translations. However, they are not miracle workers and in some cases can make things worse. Missing fuzzy matches can allow errors to sneak in, and sometimes errors can be propagated to other sections, or even completely different documents. You have to be careful when using a translation memory, and you have to make sure to manage clients' expectations.

Saturday, October 8, 2011

Concatenation is evil.

While I was a student, I participated in a group project to localize a web application. We had to find our own client and everything. We found a nice client with a cool application, and they made things very straightforward for us. They gave us a list of strings that was easily prepared for translation.

There was one problem, though. String concatenation. There was a lot of it. What is string concatenation? It's when you have two strings and you stick them together. One common example: "You have " + (some number) + " item(s) in your shopping cart." The more there is of this, the more difficult it is to localize.

String concatenation occurs almost exclusively in software and websites. In fact, I can't think of any exceptions. It's really easy for programmers (or developers, or whatever they want to call themselves) to fall into the trap of string concatenation. In the example listed above, the number would be calculated using some code somewhere in the application. The easiest way to put the number into the sentence is concatenation. So why does concatenation pose such a problem to translation and localization?

First, let me briefly mention internal strings vs. external strings. Internal strings are strings that are put directly into the code. This might be in the JavaScript of a website or in the computer code of a program. Depending on the complexity of the application, internal strings are almost impossible to localize. It can be done, but it takes longer and costs more. Why? Because you have to sort through all of the code, decide what is relevant for translation and hope that you don't mess up the code in the meantime.

Most people who want their applications localized know this. If not, they will soon. So they externalize their strings. What does this mean? It means you get some sort of list of (identifier) + (string).

This is why concatenation is a problem. You may find something like this:
string.beginning = "You have "
string.end = " item(s) in your shopping cart."

If you speak another language, you may already see the problem. If not, here's a basic explanation. Different languages have different grammar rules. Sometimes, this affects word order. Some language's grammar rules might force the sentence's structure to be "Your shopping cart contains (number) item(s)." or something like that. Other languages might be different. You could, in theory, translate "You have " as "Your shopping cart contains ", but that has its own problems. For one thing, string concatenation often comes paired with string reuse. "You have " might be used somewhere else in the program. It would also pollute your translation memory (more on that in a different post in the future). The list might also look like this:
string.1 = "You have "
string.2 = "something completely unrelated to the contents of your shopping cart"
...
string.453 = "item(s) in your shopping cart."

In this case, context is completely lost. Sometimes the strings might not even appear in the right order in the list. So how do we avoid problems with string concatenation? The answer is simple: We somehow convince our clients to avoid string concatenation. Avoid it like you would avoid tall trees during a thunderstorm.

Unfortunately, getting rid of string concatenation will also get rid of string reuse. That's a nightmare to localization, too, so it's alright with us to get rid of it, but string reuse is popular among some programmers, because they don't have to type as much, and programming is all about making things more efficient. If this is mentioned, you might want to mention that they might need to write completely different code for each language if they still want to use concatenation. That might get their attention.

There are right ways and less right ways to get rid of concatenation. Some strings, like the one mentioned above, have parts that might change. Here is a really wrong way to do it:
string.1 = "You have 1 item in your shopping cart."
string.2 = "You have 2 items in your shopping cart."
...
string.500 = "You have 500 items in your shopping cart."

I think it's obvious what's wrong with this. More strings = more time and money spent by everybody. And it might not always be possible to calculate every possibility.

Here's a less wrong way to do it:
string.itemsInList = "You have %d(numberOfItems) item(s) in your shopping cart."

The above way is a way to cheat string concatenation. You put some sort of placeholder (and the way to do this will change with each computer language, but almost all of them support it) in the middle to keep the string together, and you insert the value later. There are problems with this. It might cause awkward grammar. It might cause confusion. But it's better than concatenation, and it's better than a case for every alternative, especially when it could be an infinitely long list.

Here's what I recommend, where possible:
string.itemsInList = "Number of items in your list: %d(numberOfItems)"

Put the variable part(s) of the string at the end of the string, with a logically complete phrase before it. Or, and this would be a correct usage of string concatenation:
string.itemsInList = "Number of items in your list: "
Then, in the code, the programmer could concatenate that logically complete string with the number.

As you can see, concatenation can cause more than a minor headache for localization teams. It should be managed or avoided, when possible.

Feel free to share your comments below. This is a new blog, so feel free to include things that you would like addressed. I am by no means an expert on everything in localization, but I'm willing to share what knowledge I have and observations that I have made.