Recommendations for translating database values?

Benjamin Hall

My first actual job-related help question. Y'all are pretty smart, and lots of you deal with multiple language products.

We support a bunch of different languages. But there are some values that currently are non-localized. They're pulled directly from the database and are basically used as display values in <select> elements. Not the ids themselves, thankfully. That's a good integer that won't have to change.

Names of allowed colors for icons
Names of supported icon shapes
Types of supported devices.
Etc.

The ticket I'm working on is to define a path forward for internationalizing these "database values".

Any suggestions on resources for doing this? From what I can tell from a quick Google search, the options seem to be

Add a new column per language per column that needs translating. Ugh. Combinatorial explosion.
Add a new "global" translations table, with columns for each translated column across all tables and rows for each language...or vice versa. Requires every query to join against this translations table, no matter what.
Add one new translation table (similar to the above) for each table that needs translation. Fewer joins, but more new tables.
???????

Gąska

@Benjamin-Hall a single global table with all translatable keys, but with only three columns: the ID of the text thing, the ID of the desired language, and the translation. That way, neither new screens nor new languages require any modifications to the schema.

DON'T use English for primary key. Create an artificial unique identifier for each translatable term and only ever use that in code. Otherwise you will find yourself in a situation where two terms are the same in English but different in some other language, so a simple lookup doesn't work anymore, and you'll invent some absolutely awful hack that will eventually end up on the frontpage here. E.g. instead of Hello, use TXT_DASHBOARD_HELLO.

If performance overhead becomes unbearable, you can make some materialized views for the most used languages. But remember what they say about premature optimization.

Benjamin Hall

@Gąska said in Recommendations for translating database values?:

@Benjamin-Hall a single global table with all translatable keys, but with only three columns: the ID of the text thing, the ID of the desired language, and the translation. That way, neither new screens nor new languages require any modifications to the schema.

DON'T use English for primary key. Create an artificial unique identifier for each translatable term and only ever use that in code. Otherwise you will find yourself in a situation where two terms are the same in English but different in some other language, so a simple lookup doesn't work anymore, and you'll invent some absolutely awful hack that will eventually end up on the frontpage here. E.g. instead of Hello, use TXT_DASHBOARD_HELLO.

If performance overhead becomes unbearable, you can make some materialized views for the most used languages. But remember what they say about premature optimization.

Yeah, primary keys are already all AUTOINCREMENT INT (or whatever you call them). No fear there. I will say that we have a distinct paucity of proper foreign key constraints...there are lots of foreign keys and joins, but I've yet to see actual constraints. Or maybe I'm looking in the wrong place/don't know where to look. That's very plausible as well, SQL is my weakest language (of the ones I consider myself versed in at all).

I'd have to investigate further to see if there's already a languages table. But I know what all the supported languages are currently, and there aren't that many and it should be pretty static (for now).

That approach (the "global" translations table was one I was leaning toward), with the one-translation-table-per-data-table approach as the other option. I'll probably present both. Now I have to write my reasoning down and talk to the team lead. Tomorrow.

There was a non-database option suggested elsewhere internally, but I think it's a bad idea. Right now, all the translations are done client side based on a big flat map. We could just add all the to-be-translated database values to that and carry on as normal. Low effort...right now. But it turns those ostensibly-database values into hard-coded values that require full client releases to show up. And we support a whole bunch of clients (Android, iOS, a web admin interface, an HTML5 application, and a WPF windows desktop client). So any change to any of these many many values and we have to push updates to each and every one of them, instead of just twiddling something in the database. No thank you.

Gąska

@Benjamin-Hall said in Recommendations for translating database values?:

I'd have to investigate further to see if there's already a languages table. But I know what all the supported languages are currently, and there aren't that many and it should be pretty static (for now).

It's one of those things that's extremely likely to change in the future. You really don't want to hardcode the supported languages, and if they're already hardcoded, you want to get rid of that at the first opportunity.

@Benjamin-Hall said in Recommendations for translating database values?:

There was a non-database option suggested elsewhere internally, but I think it's a bad idea. Right now, all the translations are done client side based on a big flat map.

This isn't such a bad idea, honestly. It saves you a lot of joins, and that's always a good thing. And JS is extremely well optimized for lookup tables (due to every object being basically a full-blown lookup table), so it's unlikely to slow things down, and it won't increase network traffic much either. The only issue you might have is those privacy nuts who turn off all scripts.

And what I said about English language keys applies to your big flat map as well. Just saying.

error

@Gąska said in Recommendations for translating database values?:

remember what they say about premature optimization.

It happens to a lot of devs?

Benjamin Hall

The flat map would have to be completely regenerated for new languages. It'd be a nightmare, honestly.

As far as turning off scripts...good luck. The whole thing goes nowhere without them.

@Gąska said in Recommendations for translating database values?:

And what I said about English language keys applies to your big flat map as well. Just saying

Yeah, the flat map is all english-keyed, but they're ALL_CAPS_AND_STILTED. Well, the newer ones are. There are, for instance, translations for 1 in there. really.

My big concern is that doing the flat map means all value changes (which happen not-infrequently) now have to get synced to all clients independently (all native clients, so different formats for the actual output, plus some code changes to support them), which entails going through app-store approval and causes clients to be out of sync. Which can cause a real problem if a new client says "make this marker grue" but half the clients across their agency don't support grue yet because they haven't updated to LATEST. Whereas a database change either works for everyone or breaks (and gets fixed).

Gąska

@Benjamin-Hall said in Recommendations for translating database values?:

The flat map would have to be completely regenerated for new languages. It'd be a nightmare, honestly.

Make separate maps per language. You don't need more than one at any given time. They can be separate, static JS files, making them easily cacheable.

As far as turning off scripts...good luck. The whole thing goes nowhere without them.

That's a good thing - it means you don't have to worry about it.

@Gąska said in Recommendations for translating database values?:

And what I said about English language keys applies to your big flat map as well. Just saying

Yeah, the flat map is all english-keyed, but they're ALL_CAPS_AND_STILTED. Well, the newer ones are. There are, for instance, translations for 1 in there. really.

As long as no two places reuse the same key in different contexts, that's fine.

My big concern is that doing the flat map means all value changes (which happen not-infrequently) now have to get synced to all clients independently (all native clients, so different formats for the actual output, plus some code changes to support them), which entails going through app-store approval and causes clients to be out of sync. Which can cause a real problem if a new client says "make this marker grue" but half the clients across their agency don't support grue yet because they haven't updated to LATEST. Whereas a database change either works for everyone or breaks (and gets fixed).

Ah, I thought it's a web app only. Yes, if the translations need to be updated significantly more often than new versions are deployed, it makes sense to put it in the DB.

Benjamin Hall

@Gąska said in Recommendations for translating database values?:

@Benjamin-Hall said in Recommendations for translating database values?:

The flat map would have to be completely regenerated for new languages. It'd be a nightmare, honestly.

Make separate maps per language. You don't need more than one at any given time. They can be separate, static JS files, making them easily cacheable.

There are separate maps...on the two mobile clients. The main global flat one is a master list that is used to programmatically generate the client-specific output files. At least on the mobile clients, they have to get baked into the app bundle as resources, so they end up as xml files.

Who knows what the web client does, and the web interface (admin side) is probably all javascript.

For the static values (labels, interface pieces, etc) this works pretty ok. But for the dynamically-loaded database-side values...

My big concern is that doing the flat map means all value changes (which happen not-infrequently) now have to get synced to all clients independently (all native clients, so different formats for the actual output, plus some code changes to support them), which entails going through app-store approval and causes clients to be out of sync. Which can cause a real problem if a new client says "make this marker grue" but half the clients across their agency don't support grue yet because they haven't updated to LATEST. Whereas a database change either works for everyone or breaks (and gets fixed).

Ah, I thought it's a web app only. Yes, if the translations need to be updated significantly more often than new versions are deployed, it makes sense to put it in the DB.

It may not even be frequency, but being able to deploy it globally and independently of client releases. And certainly independently of updates. We still have people running on iOS 8, for goodness sake!

Zenith

@Benjamin-Hall I don't know if I can help except maybe what not to do.

The voter registration app I used to support had a column per language. However, it only supported English and Spanish. And I'm pretty sure the push for Spanish was bolted on after we had a hispanic secretary appointed. If I were going to use that approach in an app, I'd probably name the columns after the language code.

Response.Write((DataTable.Columns.Contains(Session["MY_LANGUAGE"]) == true ? DataTable.Rows[0][Session["MY_LANGUAGE"]] : DataTable.Rows[0]["en-US"]));

Pros:

You can force every language addition to be done on all messages (set a constraint that the new row must be non-NULL).
Code can sort of auto pickup new languages without modification.

Cons:

Having to check column existence is clumsy until you're used to it.
A DBA has to alter the table before you can add more languages.

An alternative would be to have more of a row based system. You'd have the key, the language, and the text. Put the English (or whatever the default is) as NULL. Then do something like SELECT TOP(1) * FROM messages WHERE key = @key AND ISNULL(language, @language) = @language ORDER BY language DESC;. The sort should force the NULL to the bottom, so it only gets picked up if there isn't a preferred language row.

Pros:

Code can sort of auto pickup new languages without modification.
Code doesn't have to do that dumb existence check.

Cons:

Some clerical person can add more languages without involving a DBA.
It's more difficult to make sure an entire language set is added.

Gąska

@Benjamin-Hall said in Recommendations for translating database values?:

My big concern is that doing the flat map means all value changes (which happen not-infrequently) now have to get synced to all clients independently (all native clients, so different formats for the actual output, plus some code changes to support them), which entails going through app-store approval and causes clients to be out of sync. Which can cause a real problem if a new client says "make this marker grue" but half the clients across their agency don't support grue yet because they haven't updated to LATEST. Whereas a database change either works for everyone or breaks (and gets fixed).

Ah, I thought it's a web app only. Yes, if the translations need to be updated significantly more often than new versions are deployed, it makes sense to put it in the DB.

It may not even be frequency, but being able to deploy it globally and independently of client releases. And certainly independently of updates. We still have people running on iOS 8, for goodness sake!

Consider keeping it in a separate file (not part of the installation) that's updated on every app launch. So like DB, except without DB (so without joins everywhere).

Benjamin Hall

@Gąska said in Recommendations for translating database values?:

@Benjamin-Hall said in Recommendations for translating database values?:

My big concern is that doing the flat map means all value changes (which happen not-infrequently) now have to get synced to all clients independently (all native clients, so different formats for the actual output, plus some code changes to support them), which entails going through app-store approval and causes clients to be out of sync. Which can cause a real problem if a new client says "make this marker grue" but half the clients across their agency don't support grue yet because they haven't updated to LATEST. Whereas a database change either works for everyone or breaks (and gets fixed).

Ah, I thought it's a web app only. Yes, if the translations need to be updated significantly more often than new versions are deployed, it makes sense to put it in the DB.

It may not even be frequency, but being able to deploy it globally and independently of client releases. And certainly independently of updates. We still have people running on iOS 8, for goodness sake!

Consider keeping it in a separate file (not part of the installation) that's updated on every app launch. So like DB, except without DB (so without joins everywhere).

Can't, really. Not on the mobile clients. The core APIs provide the language translation service (based on the user's locale) and expect those to be in specific formats in specific places in the app bundle. It's all supposed to be transparent to the developer--they just use a resource ID and ask for the string that matches it, and it injects the right one. No dev-side code to write there.

But that only works for static values, really. So making these dynamic values part of those means that they can only be updated on client update.

Gąska

@Benjamin-Hall I meant the other way around. Get rid of the static part and make everything dynamic. Forget about the SDK-provided stuff and do the translation yourself. It's either that or having a weird static-dynamic hybrid that will be twice as much work to maintain.

But even if you can't refactor the existing part and thus must go with the weird hybrid, there's no reason why you can't add the dynamic translation file downloaded on launch on top (or rather, side by side with) whatever you already have but can't use. Keep it in app data or something.

dkf

@Benjamin-Hall said in Recommendations for translating database values?:

Add a new column per language per column that needs translating. Ugh. Combinatorial explosion.

DO NOT DO THAT!

Whatever else you choose to do, don't go in for anything where you might be thinking of a column name being a variable. Those are always design mistakes.

dkf

@Zenith said in Recommendations for translating database values?:

It's more difficult to make sure an entire language set is added.

That's a DB check. It just needs to look for cases where there are translations for some languages but not others, which is a classic SQL type thing. Could probably be done nightly or even weekly.

bobjanova

For me, internationalisation should be separated from the business data that's in your database. A list of valid colours or a saved list of countries or something like that (which is user data or data relating to business logic) should be colour/country/whatever codes, not displayable text. You should only translate them into displayable text in the appropriate locale in the client.

Personally I like to keep that entirely out of the database. We have a big Java web app so it's mostly in message.fr.properties type files, or React-intl JSON files for some client side text. In a .Net client app you'd probably use embedded resources, though it depends if you want them to be updatable outside a release cycle. For some complex messages we use parameter replacement ("The file {0} was invalid because the location (1) wasn't accessible", that kind of thing). But if you do decide to put it in the database then it should at least be in a different table.

Keeping it out of the database means you can't accidentally write lazy queries that cross over the two concepts. This is almost always a mistake, because your data access code and the business logic requesting those queries shouldn't be dealing with displayable text.

PleegWat

@bobjanova If @Benjamin-Hall has more UI frontends than he can count on the fingers of one hand, he'll want to keep it frontend-agnostic.

I'd suggest keeping the data translations in a universal format where all frontends which need it can get at it. That could be a database table, but might as well be a collection of static json files on a web server.

Unrelated: parameter replacement is something to be careful about, since it can easily lead to untranslatable strings.

dkf

@PleegWat said in Recommendations for translating database values?:

That could be a database table, but might as well be a collection of static json files on a web server.

The advantage of a database is that you can much more easily find out what's missing from some languages but not others, and you don't even have to assume that any language is complete (just that there's at least a translation into some language). Analysing a big bunch of flat files for problems is a lot more difficult. It doesn't have to be the same database as the main business data; there's no need to share anything there and any query that crossed over the two areas would probably be pretty horrific. (It should be the front-end logic that actually does substitutions into the translated templates. It's good at that sort of thing.)

bobjanova

@PleegWat said in Recommendations for translating database values?:

If @Benjamin-Hall has more UI frontends than he can count on the fingers of one hand, he'll want to keep it frontend-agnostic.

Indeed. Presumably all those UIs are talking to a common data display layer, and that's where the translation should happen in this case.

parameter replacement is something to be careful about, since it can easily lead to untranslatable strings

True, but by being able to re-order the parameter values you can mitigate most of that, and I'm not aware of an alternative when you want to construct strings from multiple parts ("Your order could not be processed because we are out of stock for Thingies and Watsits").

bobjanova

@dkf said in Recommendations for translating database values?:

The advantage of a database is that you can much more easily find out what's missing from some languages but not others

Doing this comparison across properties files or similar is not hard - we have a unit* test in our build that does it. That doesn't prove that you didn't forget to add a string to all languages, true, but you could forget to add a row in the database too, right?

*: I guess technically not since it reads files from disk, but close enough

Zenith

@dkf Sorry, I know it's a basic SQL query but I've been in a a hyper-siloed environment for too long.

Benjamin Hall

@bobjanova there's a common back end api call that all the clients call to get the display pieces from the db. It seems to me that adding a parameter with the language and letting the db do the right thing with the translation is less likely to go sproing than adding logic to each client to munge that themselves.

dkf

@bobjanova said in Recommendations for translating database values?:

but you could forget to add a row in the database too, right?

Yes, but a bit of querying should be able to identify any translations that are not present in all languages. It's mostly just SELECT DISTINCT and so on, and if it is a daily overnight or weekly over-the-weekend job then it doesn't really need a lot of thought on how to make it efficient. The query will pick up where things are incomplete, and that makes fixing things fairly easy. (It also makes deriving completeness metrics for manglement reporting easy.)

Assuming you can find someone who understands what the translation should be. Without that, adding translations is going to be tricky…

Benjamin Hall

Update:

So. Turns out the scope of the ticket was different than I thought. I only have to worry about the web interface, for now.

But, it turns out that the web interface

didn't actually query the database at all. The values had been hard-coded and actually translated a while ago. Except that the translation never got hooked up so no one realized it had been done at all.
I'm supposed to figure out a way of doing it "right" instead.
Most of the values aren't really in tables at all directly. Instead, they're the possible values for the enum that defines the column type. As in Color ENUM(red, green, blue, ...). Which makes querying for them rather painful.

My current thought is to have a simple file listing (in some format, probably bunched with a key for each page or modal) a bunch of value/translation-id pairs, where value is the thing that goes into the database, not the user-facing label. That would get generated from the translation project just like other static values and included in the interface project by the automated build tool. The display logic for each modal would then call the internationalization library on all the matching translation ids and replace the default labels (which might be english or might be blank, doesn't really matter I don't think).

That way, I can alter the logic to query the values from the database, use that as the key to grab the translation id, then get the proper label that way. I'll probably ask if sometime we can move to actual lookup tables and foreign key relationships, because, you know, relational databases and all that.

Zecc

@error said in Recommendations for translating database values?:

@Gąska said in Recommendations for translating database values?:

remember what they say about premature optimization.

It happens to a lot of devs?

I'm just glad I get the opportunity to optimize now and then.