I maintain a pluggable Django app that contains translations. All strings in Python and HTML code are written in English. When translating the strings to German, I'm always fighting with the problem that German differentiates between formal and informal speech (see T–V distinction). Because the app is used on different sites, ranging from a social network to a banking website, I can't just support either the formal or informal version. And since the translations can differ quite a bit, there's no way I can parameterize it. E.g. the sentence "Do you want to log out?" would have these two translations:
Wollen Sie sich abmelden? (formal)
Willst du dich abmelden? (informal)
Is there anything in Gettext that could help me with this?
You can use contextual markers to give your translations additional context.
logout = pgettext('casual', 'Do you want to log out?')
...
logout = pgettext('formal', 'Do you want to log out?')
The best approach, used in other similar situations by gettext as well as UNIX is to use locale variants. For example, sr_RS is (or was, because Serbian is considered a metalanguage these days...) code used for Serbian written in Cyrillic. But it’s sometimes written in Latin script too, so sr_RS#latin is used as the language name and of course, the MO filename or directory as well.
Here, have a look at some translations I have present on my system:
$ find /usr/local/share/locale | grep /sr
/usr/local/share/locale/sr
/usr/local/share/locale/sr/LC_MESSAGES
/usr/local/share/locale/sr/LC_MESSAGES/bash.mo
/usr/local/share/locale/sr/LC_MESSAGES/bfd.mo
/usr/local/share/locale/sr/LC_MESSAGES/binutils.mo
/usr/local/share/locale/sr/LC_MESSAGES/gettext-runtime.mo
/usr/local/share/locale/sr/LC_MESSAGES/gettext-tools.mo
/usr/local/share/locale/sr/LC_MESSAGES/glib20.mo
/usr/local/share/locale/sr/LC_MESSAGES/wget.mo
/usr/local/share/locale/sr#ije
/usr/local/share/locale/sr#ije/LC_MESSAGES
/usr/local/share/locale/sr#ije/LC_MESSAGES/glib20.mo
/usr/local/share/locale/sr#latin
/usr/local/share/locale/sr#latin/LC_MESSAGES
/usr/local/share/locale/sr#latin/LC_MESSAGES/glib20.mo
/usr/local/share/locale/sr_RS
/usr/local/share/locale/sr_RS/LC_MESSAGES
/usr/local/share/locale/sr_RS/LC_MESSAGES/mkvtoolnix.mo
/usr/local/share/locale/sr_RS#latin
/usr/local/share/locale/sr_RS#latin/LC_MESSAGES
/usr/local/share/locale/sr_RS#latin/LC_MESSAGES/mkvtoolnix.mo
$
So the best way to handle German variants is the same: use de (or de_DE) for the base informal variant and have a separate translation file de_DE#formal with the formal variant of the translation.
This is basically what WordPress does too. Of course, being WordPress, they have their own special flavour and don’t use the variant syntax, but instead add a third component to the filename: de_DE.mo is the informal (and also fallback, because it lacks any further specification) variant and de_DE_formal.mo contains the formal variant.
Related
Is there a way to use Django's I18n features with keys in the code instead of storing the default language in the code and the others in .po/.mo files?
Something similar to Wikipedia, where the code has a "(whatlinkshere)" key that is translated in the English translation file as "What links here", in the French one as "Pages liées", etc.
I guess I could make "qqq" or "qqx" the default language and work from there, but then a person with a browser set in a non-managed language would see the keys instead of English.
The problem with having the English as default in the code is that if you make a slight adjustment to a string in English, the translation is lost altogether in the other language versions.
I am using python 2.7 & django 1.7.
When I use Google Translator Toolkit to machine translate my .po files to another language (English to German), there are many errors due to the use of different django template variables in my translation tags.
I understand that machine translation is not so great, but I am wanting to only test my translation strings on my test pages.
Here is an example of a typical error of the machine-translated .po file translated from English (en) to German (de).
#. Translators: {{ site_name_lowercase }} is a variable that does not require translation.
#: .\templates\users\reset_password_email_html.txt:47
#: .\templates\users\reset_password_email_txt.txt:18
#, python-format
msgid ""
"Once you've returned to %(site_name_lowercase)s.com, we will give you "
"instructions to reset your password."
msgstr "Sobald du mit% (site_name_lowercase) s.com zurückgegeben haben, geben wir Ihnen Anweisungen, um Ihr Passwort zurückzusetzen."
The %(site_name_lowercase)s is machine translated to % (site_name_lowercase) s and is often concatenated to the precedding word, as shown above.
I have hundreds of these type of errors and I estimate that a find & replace would take at least 7 hours. Plus if I makemessages and then translate the .po file again I would have to go through the find and replace again.
I am hoping that there is some type of undocumented rule in the Google Translator Toolkit that will allow the machine-translation to ignore the variables. I have read the Google Translator Toolkit docs and searched SO & Google, but did not find anything that would assist me.
Does anyone have any suggestions?
The %(site_name_lowercase)s is machine translated to % (site_name_lowercase) s and is often concatenated to the precedding word, as shown above.
This is caused by tokenization prior to translation, followed by detokenization after translation, i.e. Google Translate tries to split the input before translation to re-merge it after translation. The variables you use are typically composed of characters that are used by tokenizers to detect token boundaries. To avoid this sort of problem, you can pre-process your file and replace the offending variables by placeholders that do not have this issue - I suggest you try out a couple of things, e.g. _VAR_PLACE_HOLDER_. It is important that you do not use any punctuation characters that may cause the tokenizer to split. After pre-processing, translate your newly generated file and post-process by replacing the placeholders by their original value. Typically, your placeholder will be picked up as an Out-Of-Vocabulary (OOV) item and it will be preserved during translation. Try to experiment with including a sequence number (to keep track of your placeholders during post-processing), since word reordering may occur. There used to be a scientific API for Google Translate that gives you the token alignments. You could use these for post-processing as well.
Note that this procedure will not give you the best translation output possible, as the language model will not recognize the placeholder. You can see this illustrated here (with placeholder, the token "gelezen" is in the wrong place):
https://translate.google.com/#en/nl/I%20have%20read%20SOME_VARIABLE_1%20some%20time%20ago%0AI%20have%20read%20a%20book%20some%20time%20ago
If you just want to test the system for your variables, and you do not care about the translation quality, this is the fastest way to go.
Should you decide to go for a better solution, you can solve this issue yourself by developing your own machine translation system (it's fun, by the way, see http://www.statmt.org/moses/) and apply the procedure explained above, but then with, for example Part-Of-Speech-Tags to improve the language model. Note that you can use the alignment information as well.
I am creating an OpenGL game and I would like to make it open to more languages than just English for obvious reasons. From looking around and fiddling around with the games installed on my computer I can see that locales play a big part in this and that .lang files, such as en-US.lang that is shipped with minecraft, are basically text documents with a language code, "item.iron.ingot" for example, an equal sign, and then what it means for that given language, English as per en-US, so in this case would be, "Iron Ingot". Well I created a file that I named en-US.lang and this is its contents:
item.iron.ingot=Iron Ingot
In my C++ main method I put:
setlocale(LC_ALL, "en-US");
After including the locale header file. So I suppose the part that I am confused by is how to use the locales to read from the .lang file? Please help SO and some example code would be appreciated.
C++ Does not come with a built-in support for resource files / internationalization. However there is a huge variety of solutions.
To support multi-language messages, you should have some basic understanding of how such strings are encoded in files and read to memory. Here is a basic introduction if you are not familiar:
"http://www.joelonsoftware.com/articles/Unicode.html"
To keep and load the correct text at runtime you need to use a third party library: GNU gettext http://www.gnu.org/software/gettext/ is one such example. However there are other solutions out there.
Question
I am trying to match browsers set to Scandinavian languages based on HTTP header "Accept-Language".
My regex is:
^(nb|nn|no|sv|se|da|dk).*
My question is if this is sufficient, and if anyone know about any other odd scandinavian (but "valid") language codes or obscure browser bugs causing false positives?
Used for
The regex is used for displaying a english link in the top of the Norwegian web pages (which is the primary language and the root of the domain and sub-domains) that takes you to the English web pages (secondary language and folder under root) when the browser language is not Scandinavian. The link can be closed / "opted-out" with hash stored in JavaScript localStorage if the user don't want to see the link again. We decided not to use IP geo-location because of limited time to implement.
Depending on the language you are working in there may be code in place you can use to parse this easily, e.g. this post: Parse Accept-Language header in Java <-- Also provides a good code example
Further - are you sure you want to limit your regex to the start of the string, as several lanaguages can be provided (the first is intended to be "I prefer x but also accept the following") : http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.4
Otherwise your regex should work fine based on the what you were asking and here is a list of all browser language codes: http://www.metamodpro.com/browser-language-codes
I would also - in your shoes, make the "switch to X language" link easy to find for all users until they had opted not to see it again. I would expect many people may have a preference set by default in their browser but find a site actually using it to be unexpected i.e. a user experience like:
I prefer english but don't know enough to change this setting and have never had a reason to before as so few sites make use of it.
That regular expression is enough if you are testing each item in accept-language individually.
If not individually, there are 2 problems:
One of the expected languages could not appear at the beginning of the header, but after.
Some of the expected languages abbreviations could appear as qualifier of a completely different language.
I'm building a C++/MFC program in a multilingual environment. I have one main (national) language and three international languages. Every time I add a feature to the program I have to keep the international languages up-to-date with the national one. The resource editor in Visual Studio is not very helpful because I frequently end up leaving a string, dialog box, etc., untranslated.
I wonder if you guys know of a program that can edit resource (.rc) files and
Build a file that includes only the strings to be translated and their respective IDs and accepts the same (or similar) file in another language (this would be helpful since usually the translation is done by someone else), or
Handle the translations itself, allowing to view the same string in different languages at the same time.
In my experience, internationalization requires a little more than translating strings. Many strings when translated, require more space on a dialog. Because of this it's useful to be able to customize the dialogs for each language. Otherwise you have to create dialogs with extra space for the translated strings which then looks less than optimal when displayed in English.
Quite a while back I was using a translation tool for an MFC application but the company that produced the software stopped selling it. When I tried to find a reasonably priced replacement I did not find one.
Check out Lingobit Localizer. Expensive, but well worth it.
Here's a script I use to generate resource files for testing in different languages. It just parses a response from babelfish so clearly the translation will be about as high quality as that done by a drunken monkey, but it's useful for testing and such
for i in $trfile
do
key=`echo $i | sed 's/^\(.*\)=\(.*\)$/\1/g'`
value=`echo $i | sed 's/^\(.*\)=\(.*\)$/\2/g'`
url="http://babelfish.altavista.com/tr?doit=done&intl=1&tt=urltext&lp=$langs&btnTrTxt=Translate&trtext=$value"
wget -O foo.html -A "$agent" "$url" *&> /dev/null
tx=`grep "<td bgcolor=white class=s><div style=padding:10px;>" foo.html`
tx=`echo $tx | iconv -f latin1 -t utf-8 | sed 's/<td bgcolor=white class=s><div style=padding:10px;>\(.*\)<\/div><\/td>/\1/g'`
echo $key=$tx
done
rm foo.html
Check out appTranslator, its relatively cheap and works rather well. The guy developing it is really responsive to enhancement requests and bug report, so you get really good support.
You might take a look at Sisulizer http://www.sisulizer.com. Expensive though. We're evaluating it for use at my company to manage the headache of ongoing translation. I read on their About page that the company was founded by people who left Multilizer and other similar companies.
If there isn't one, it would be pretty easy to loop through all the strings in a resource a compare them to the international resources. You could probably do this with a simple grid.
In the end we have ended up building our own external tools to manage this. Our devs work in the english string table and every automated build sends our strings that have been added/changed and deleted to translation manager. He can also run a report at anytime from an old build to determine what is required for translation.
Check out RC-WinTrans. Its a commercial tool that my company uses. It basically imports our .RC files (or .resx files) into a database which we send to a different office for translation. The tool can then export a translated .RC file (or .resx file) for each language from the database. It even has a basic dialog box editor so the translator can adjust the size of various controls in the dialog box to be sure the translated text fits.
It also accepts a number of command line arguments and has a COM automation interface so you can integrate it into a build process more easily. It works quite well for us and we literally have thousands and thousands of strings and dialog boxes, etc.
(We currently have version 7 so what I've said might be a little bit different than their latest version 8.)
Also try AppTranslator: http://www.apptranslator.com/. It has a build-in resource editor so that translators can, for example, enlargen a text box when need bo. It has separate versions for developers and translators and much more.
We are using Multilizer (http://www.multilizer.com/) and although sometimes it's a bit tricky to use, at the end with a bit of patient it works pretty well.
We even have a translation web site where translators can download our projects and then upload the translations using Multilizer command-line features.
Managing localization and translations using .rc files and Visual Studio is not a good idea. It's much smarter (though counter-intuitive) to start localization through the exe. Read here why: http://www.apptranslator.com/misconceptions.html
I've written this recently, which integrates into VS:
https://github.com/ekkis/Powershell/blob/master/MT.ps1
largely because I was unsatisfied with the solutions out there. you'll need to get a client id from M$ (but they give you 2M words/month translation free - not bad)
ResxCrunch will be out soon, it will edit multiple resource files in multiple languages in one single table.