Internationalization and Localization - by Joe Mabel

This is a piece from 2001, last significantly updated 2004. While much of it remains relevant, any decades-old technical article is dated in some degree. In particular, the emphasis here is on creating applications for user machines; nowadays, most development is in some degree directed to the web and the cloud.


Nowadays, many Internet sites and desktop programs must support multiple languages and cultures. Over more than a decade, as a Software Architect at Active Voice, as Director of Development at Saltmine, and then in 2004 on a project for Tableau Software 🔗), I had a lot of opportunity to develop truly international programs and sites, in applications ranging from voice mail systems to troubleshooting tools. I'd like to share some of what I've learned.

[In 2005, partly as a result of writing this article, I had the privilege of working with Microsoft's "Dr. International" team to rewrite most of the Microsoft SDK's conceptual documents and API references on internationalization and typography.]

To be useful to a broad audience, I've tried to avoid delving too deeply into technical issues, either about programming or about human language. This is not the place to learn about the use of hiragana and katakana in Japanese, or the several different and incompatible C++ libraries for text manipulation, or how Microsoft's notion of locale does and doesn't jibe with the ISO's. I'm also talking only about conventional personal computers; handheld wireless devices raise many of the same issues, but I am not yet expert enough with wireless to presume to give lessons.

10 tips on Internationalization and Localization

1. Understand the difference between internationalization and localization

Internationalization (sometimes abbreviated ''i18n'') is the up front work of building a system in such a way that it can be adapted to multiple locales. An internationalized site or program requires forethought: it may not be easy to design up front for the international market, but it is far harder adapt it for other languages, nations, and cultures when the original designers thought only about use by English speakers in the United States. Localization is the adaptation of a system for a particular locale. If you have given sufficient thought up front to internationalization issues, each localization will be relatively straightforward.

Some authors will use terms other than internationalization and localization for these two concepts, but they are worth distinguishing: an expert on translating content from one language to another may not know much about designing a system that makes such localization easy. Conversely, an expert on internationalization may not have the language facility or the knowledge of a particular culture to perform a localization.

2. Localization isn't just about language

When most people think about localization, the first thing that comes to mind is translation. Of course, that is an important part of localization, but plenty of localization issues can arise even among countries with a common language.

Consider just the English-speaking world. There are a wide variety of currencies (e.g. US dollar, British pound, Canadian dollar, Australian dollar, etc.). Dates are written differently (1/12/2002 means January 12 in the US, but December 1 in the UK), and some regions prefer a 24-hour clock while others prefer am/pm. The British use European paper sizes; we Americans have our own paper sizes.

If you are writing a spell-checker or building a voice-mail system, the very notion of a single "English-speaking world" falls apart pretty rapidly: is that "gray" or "grey"? "Color" or "colour"? What accent is acceptable for a recorded voice? Is there a strong national preference for recorded voices to be specifically male of female?

By the way, even numbers aren't written the same everywhere. English speakers generally favor a comma to mark thousands and a period to mark decimals, but in much of Europe it's the other way around.

3. What's in a name?

Even something as seemingly simple as people's names becomes an issue in an international system. In much of the European world, you can get away with First Name + Middle Initial + Last Name, but even in the English-speaking world the "Juniors" and "IIIs" may not be thrilled. In Chinese and in several other languages, family name comes first. In the Spanish-speaking world, most people use a segundo apellido (a second surname) as part of their name (e.g. Gabriel García Marquez's surname is "íķa"; "Marquez" is his mother's maiden name).

Any time you build a system for international use, you need to understand aspects like this of the cultures in which your system will be used. Typically, this knowledge can come only from a native or someone with near-native knowledge of a language and culture.

4. International e-Commerce is particularly tricky

For any but the most trivial international e-commerce, you need to understand the tax laws, disclosure laws, and customs (in both senses of the word) of the various countries.

If a user orders a product to be shipped from a different country than the one in which it is to be delivered, customs duties may apply.

Most of Europe uses some form of value-added tax, but rates vary, and some locales smaller than countries (e.g. the Canary Islands or the Channel Islands) have special tax rules of their own.

In some countries (e.g. Germany), credit cards are not a popular way to pay for things: people there expect to be able to pay COD.

There are literally dozens of such considerations, far more than I can go into here, and many of them affect how you can do business lawfully. Plan before you build, or the only people who will make any money off of your system are the lawyers.

5. Know about the environments in which your system must operate

Not all computers, operating systems, and browsers represent text exactly the same way. Unicode 🔗 is an increasingly accepted standard, but that won't do you much good if your software must run on Windows 95 or support an old text-based Lynx browser or a legacy database. Even within the Microsoft world, NT-based Windows 2000 provides support for quite a bit more functionality for internationalization of programs than does Windows Millennium (which descends from the Win9X family). Macintosh and Unix raise other issues.

Although the 101-key keyboard has become pretty standard, you will find that the key caps are at least slightly different in each country. For example, an English-language keyboard starts with "qwerty", but a German-language keyboard starts with "qwertz". Typically, the operating system will shield you from most such issues, but they can wreak havoc with attempts to exploit the physical layout of the keyboard for navigation purposes.

Programmers should plan to handle most text as Unicode. If you need to use anything else, deal with it entirely in a thin I/O layer.

As with every other aspect of systems design, the more you can narrow your hardware, OS, and browser requirements, the simpler life becomes. It's a lot easier to build a system that runs correctly on Windows Millennium with Internet Explorer 5.0 or later than one that runs correctly on every Windows system ever built and which supports a wide variety of browsers.

6. Know which languages and locales you care about

It's a lot easier to build a site or application that will adapt successfully to the languages of Western Europe than one that will also cover the ideographic Far Eastern languages or Arabic and Hebrew.

Even working across the various Western European languages raises some tricky issues: for example, English is the only language where you can test whether a character is a lower-case letter just by a simple test like:

  if (ch >= 'a' && ch <= 'z')

All of the following characters will fail that test:

  àáäæçñø

but all are lower case characters in some Western European language.

Some languages are simply more verbose than others, so a German-language user interface will typically need 30% more screen real estate for text than an English-language user interface. Even languages that are not generally verbose may raise this type of issues in particular cases: the Spanish-language equivalent of an "OK" button says "Aceptar".

Because different languages raise different issues, and because different languages have different degrees of importance in the world of commerce and computing, there seem to be five common levels of internationalization with respect to language:

7. Think about how much of your system really must be localized

How completely must your system really be localized? For example, if you are building a web site and the tools to administer it, can all of your administrators read English? If so, even though they may need to edit content in dozens languages, the text for administrative screens needn't be localized. It may be that the only part of the administration tool affected by concerns of localization is the editing boxes.

Another example: some of the content driving the site might never be shown to an end user. There's generally no need to localize HTML tags or scripts. Good system architecture may separate localizable and non-localizable text elements, so that localization technicians don't accidentally translate text which will be read only by a machine.

8. Exploit the platform

Don't reinvent what's already been built for you. For example, if you are working in a contemporary Microsoft environment, and if your users each have their computers set up for the appropriate locale, Microsoft provides locale-sensitive program elements such as:

Your programming language probably already has a library with locale-sensitive routines to convert upper case to lower case (for those languages where this makes sense: Far Eastern languages don't generally have case). Don't try to write your own. On the other hand, you may or may not have access to a tool that knows that in German, for searching and sorting, "ß" should be interchangeable with "ss" and "ä" with "ae". If you need that level of cultural appropriateness, you'll need to examine your tools closely and work out what the platform provides and what you have to build for yourself.

9. Plan the process

As you can see by now, there's a lot to this. If you jump in without planning, it's going to be a lot more expensive (in terms of both time and money) than it needs to be.

For example, I've seen programs that adopted the policy of putting all text in Microsoft resource files. That's workable, but it is really annoying to look through a file built for localization purposes and see strings like "<a href=". Pretty clearly, that does not need to be localized. On the other hand, someone might think they've been really clever to create the following collection of strings:

"Not enough memory to "
"open"
"save"
"spell-check"
" the file."

and dynamically stitch them together. When someone else goes to localize this for German or Finnish, they are going to be very frustrated, because the word order is not the same: the verb must come last. What should have been a content localization becomes a coding change.

If you are considering using a third party content management tool to organize a large web site, know in advance whether it will provide good support for content localization or whether your localization people will have to extract the entire site from that system, localize it in some other environment, then put it back.

If your web site intends to provide the same pricing information in multiple languages, what is the plan to make sure that the actual prices stay in synch? How do you plan to handle customer service calls that are not in a language in which your company routinely does business?

Is all of your content always going to originate in English, or might it originate in any of a number of languages? How will you identify which is the master and which the localization? (This is crucial for content that will go through multiple versions.)

10. Have a plan for testing

No non-trivial software system can ever be completely tested. However, there is such a thing as reasonable testing on a given budget.

To test whether software has been correctly internationalized, allow time in the release cycle to test several localizations. A highly internationalized piece of software generally cannot be thoroughly tested in all of the languages it supports, and some of the localizations will probably not be available at the time of the first release.

Typically, software that works correctly in English, German, Japanese, and Arabic will have few problems localizing for other languages. However, "few" is not "zero". You still may stumble across a currency with three digits to the right of the decimal point or a color scheme that turns out to look like a mockery of someone's flag.

Summary

I hope you find these tips useful. As in most aspects of software / web development, the key is to do some research up front and to have a plan. Version 1.0 of your system may be US English-only, but if you keep localization issues in mind during the design phase, you won't have to throw it away and start over when your next big opportunity is in Japan.

Links, etc.

Probably the single most useful source of information on internationalization and localization of software applications I've ever encountered is Developing International Software from Microsoft Press, currently [as of 2004] in its second edition. The first edition was by the redoubtable Nadine Kano; the second is by a group of five authors writing under the collective pseudonym of "Dr. International". The focus is firmly on the Microsoft world, but I am unaware of a comparably good book from any other source. A search for "Nadine Kano" or "Dr. International" on Google 🔗 is a great way to find a ton of useful information on internationalization and localization. Other useful search strings are "i18n" and "Bill Hall" (another fine author on this topic). [In 2005, I had the privilege of working with the "Dr. International" team to rewrite most of the Microsoft SDK's conceptual documents and API references (Win32 PSDK for C++ and .NET Framework) on internationalization and typography.]

Useful online references include:


Copyright ©2001-2004, 2021 Joseph L. Mabel
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. Notification of such use and a link back to this page would be greatly appreciated, but are not required. Please note that GFDL does require appropriate attribution of the authorship of this material, in this case to Joe Mabel.


Last modified: 23 February 2021

My e-mail address is jmabel@joemabel.com. Normally, I check this at least every 48 hours, more often during the working week.