Multilanguage support / internationalization
I'm starting out writing a cms that needs to have multilanguage support.
Is there a single encoding that can handle all/most languages correctly? unicode? I will be using a database, let's say mysql, although it would be nice to be able to change DB's later on.
There are various concerns like how the language data is entered, stored, and returned, throughout the application lifecycle. I realize this is a big topic but can anyone shed some light on what might be a good way to proceed from scratch with respect to internationalization and language support?
First piece of advice would be to take advantage of whatever your framework gives you. For example, I'm developing in .Net so I don't track currencies, number formats (i.e. is it 1,000 or 1.000 for "one thousand"), date formats etc. I let the good folks and MS handle that for me. That essentially leaves me with translations for what I put on the screen. Granted, not all frameworks provide that much, but whatever you can, take advantage.
I wouldn't think type of DB would be a big issue as most major dbs handle Unicode. If it doesn't, you probably don't want to be using it for other reasons. My guess would be if you change DBs in the future your bigger headache areas will be somewhere besides your internationalization.
Lastly, create something like string IDs and language Ids. Where string id 1 = "This is a string" then you create different translations for id 1. I say this as opposed to creating a table with a column for each language. Reason being, it's easier to modify code to support a new language by passing in a language ID than it is to alter the code so that there is a new column.
Of course, the best way to do internationalization is to take over the world and inflict your own preferences, but that can be time consuming and costly....
Is there a single encoding that can handle all/most languages correctly?
Unicode is not an encoding; UTF-8, UTF-16, UTF-32 and a bunch of more obscure ones are. But yes, they are supposed to solve all the world's internationalization problems and are the way to go. I'd suggest to go with UTF-8, because it's the most widely used one on the web.
Oh, and thanks for considering i18n. I'm constantly annoyed by apps that don't.
Thanks to both. I did a little research and came to a similar conclusion about utf-8. Interesting and fairly obscure topic, at least to me. There are subtle considerations such as string comparision performance depending on if the number of bytes are fixed or not in the encoding, and etc.
It seems like most of the newest DB's support 4 byte utf-8 which should cover just about everything.
For encodings, UTF-8 is your most-likely choice
UTF-8 is well supported in many languages and frameworks. If the majority of the text in your system is in Japanese or Chinese, then the 16-bit Unicode encoding will save you space, but if it's mostly Roman characters, stick with UTF-8.
For internationalization, you'll typically want to separate out changes in presentation/UI and translations. So, you'll typically end up with a slightly-different layout of your UI for each supported language, and a separate string table, which allows you to look up translated versions of all the UI captions, etc.
Don't forget to use the right Date and Number formats for your locales.
I've always used UTF-8. Have a read of the Gettext manual, it will highlight some of the language issues you have to consider when writing - date, time, plurals (not all languages have one singular and one plural form). Also, for icons, it's important to pick ones that are familiar globally.
Just wanted to throw in my $0.02. UTF-8 is the newest Unicode format and it is preferred everywhere I have seen. It is 8-bit ASCII clean as well, which means you can pass the UTF-8 strings to functions that work with ASCII strings so long as you aren't needing interpretation of the string itself. In other words, if you just need the number of bytes and not the number of characters, ASCII functions work fine. MySQL supports UTF-8 collation on tables and has for quite awhile. You can actually store UTF-8 in an ASCII table, but certain string functions won't behave the way you think they do. One thing you must be careful of is to always use the mysql_real escape functions as those take the database connection and use that collation for the SQL escaping. This is because the MySQL developers decided that UTF-8 should have additional quoting characters that need to be escaped and the only way for the client code to know which escaping method to use is by the charset in use on the server. It is possible for a UTF-8 string passed to a UTF-8 database but escaped with the mysql_escape_string function to break, leading to possible SQL injection exploits.
This post wasn't funny. Please try again, chief.
Just wanted to throw in my $0.02.
Just a minor heads up - if you are going to use MySQL in an i18n app, be very careful when searching for text fields. I write my code in the far east and many of the languages around here (Thai, Japanese, cantonese etc.) require unicode AND the correct collations to be set. If you don't get it right, then you'll find that you cannot search on text fields correctly. I generally recommend using a binary collation for non-Latin based languages.