Google Translator Toolkit machine-translation issues

Google Translator Toolkit machine-translation issues - django

I am using python 2.7 & django 1.7.
When I use Google Translator Toolkit to machine translate my .po files to another language (English to German), there are many errors due to the use of different django template variables in my translation tags.
I understand that machine translation is not so great, but I am wanting to only test my translation strings on my test pages.
Here is an example of a typical error of the machine-translated .po file translated from English (en) to German (de).
#. Translators: {{ site_name_lowercase }} is a variable that does not require translation.
#: .\templates\users\reset_password_email_html.txt:47
#: .\templates\users\reset_password_email_txt.txt:18
#, python-format
msgid ""
"Once you've returned to %(site_name_lowercase)s.com, we will give you "
"instructions to reset your password."
msgstr "Sobald du mit% (site_name_lowercase) s.com zurückgegeben haben, geben wir Ihnen Anweisungen, um Ihr Passwort zurückzusetzen."
The %(site_name_lowercase)s is machine translated to % (site_name_lowercase) s and is often concatenated to the precedding word, as shown above.
I have hundreds of these type of errors and I estimate that a find & replace would take at least 7 hours. Plus if I makemessages and then translate the .po file again I would have to go through the find and replace again.
I am hoping that there is some type of undocumented rule in the Google Translator Toolkit that will allow the machine-translation to ignore the variables. I have read the Google Translator Toolkit docs and searched SO & Google, but did not find anything that would assist me.
Does anyone have any suggestions?

The %(site_name_lowercase)s is machine translated to % (site_name_lowercase) s and is often concatenated to the precedding word, as shown above.
This is caused by tokenization prior to translation, followed by detokenization after translation, i.e. Google Translate tries to split the input before translation to re-merge it after translation. The variables you use are typically composed of characters that are used by tokenizers to detect token boundaries. To avoid this sort of problem, you can pre-process your file and replace the offending variables by placeholders that do not have this issue - I suggest you try out a couple of things, e.g. _VAR_PLACE_HOLDER_. It is important that you do not use any punctuation characters that may cause the tokenizer to split. After pre-processing, translate your newly generated file and post-process by replacing the placeholders by their original value. Typically, your placeholder will be picked up as an Out-Of-Vocabulary (OOV) item and it will be preserved during translation. Try to experiment with including a sequence number (to keep track of your placeholders during post-processing), since word reordering may occur. There used to be a scientific API for Google Translate that gives you the token alignments. You could use these for post-processing as well.
Note that this procedure will not give you the best translation output possible, as the language model will not recognize the placeholder. You can see this illustrated here (with placeholder, the token "gelezen" is in the wrong place):
https://translate.google.com/#en/nl/I%20have%20read%20SOME_VARIABLE_1%20some%20time%20ago%0AI%20have%20read%20a%20book%20some%20time%20ago
If you just want to test the system for your variables, and you do not care about the translation quality, this is the fastest way to go.
Should you decide to go for a better solution, you can solve this issue yourself by developing your own machine translation system (it's fun, by the way, see http://www.statmt.org/moses/) and apply the procedure explained above, but then with, for example Part-Of-Speech-Tags to improve the language model. Note that you can use the alignment information as well.

Related

how to disable Google Translate API from not translating proper names with common words

I am using Google Cloud Platform to create content in Indian regional language. Some of the content contains buildings and society name which have common words like 'The Nest Glory'. After conversion from Google Translate API the building name should only be spelled in the regional language, instead it is literally being translated. It sounds funny and user will never find that building.

Cloud Translation API can be told not to translate some part of the text, use the following HTML tags:
<span translate="no"> </span>
<span class="notranslate"> </span>
This functionality requires the source text to be submitted in HTML.
Access to the Cloud Translation API
How to Disable Google Translate from Translating Specific Words

After some search I think one older way is to use the transliteration API which was deprecated by google
Google Transliteration API Deprecated

translate="no"
class="notranslate"
in last API versions this features doesn't work
That's a very complex thing to do: it needs semantic analysis as well as translation and there is no real way to simply add that to your application. The best we can suggest is that you identify the names, remove them from the strings (or replace them with non-translated markers as the may not be in anything like the same position in translated messages) and re-instate them afterwards.
But ... that may well change the quality of the translation; as I said, Google Translate is pretty clever ... but it isn't at Human Interpreter yet.
If that doesn't work, you need to use a real-world human translator who is very familiar with the source and target languages.

Extract comment written in Chinese and translate them into English using some script

I have a C++ project in which comments of source code are in Chinese language, now I want to convert them into English.
I tried to solve using google translator but got an Issue: Whole CPP files or header didn't get converted, also I have found the name of the struct, class etc gets changed. Sometimes code also gets modified.
Note: Each .cpp or .h file is less than 1000 lines of code.But there are multiple C++ projects each having around 10 files. Thus I have around 50 files for which I need to translate Chinese text to English.

Well, what did you expect? Google Translate doesn't know what a CPP file is and how to treat it. You'll have to write your own program that extracts comments from them (not that hard), runs just those through Google Translate, and then puts them back in.
Mind you, if there is commented out code, or the comments reference variable names, those will get translated too. Detecting and handling these cases is a lot harder already.

Extracting comments is a lexical issue, and mostly a quite simple one.
In a few hours, you could write (e.g. with flex) some simple command line program extracting them. And a good editor (such as GNU emacs) could even be configured to run that filter on selected code chunks.
(handling a few corner cases, such as raw string literals, might be slightly more difficult, but these don't happen often and you might handle them manually)
BTW, if you are assigned to work on that code, you'll need to understand it, and that takes much more time than copy&pasting or editing each comments manually.
At last, I am not sure of the quality of automatic translation of code comments. You might be disappointed. Also, the code names (of functions, of classes, of variables, etc...) matter a lot more.
Perhaps adding your comments in English could be wiser.
Don't forget to use some version control system. You really need one (e.g. git)
(I am not convinced that extracting comments for automatic translation would help your work)

First separate both comment and code part in different file using python script as below,
import sys
file=sys.argv[1]
f=open(file,"r")
lines=f.readlines()
f.close()
comment=open("comment.txt","w+")
code=open("code.txt","w+")
for l in lines:
if "//" in l:
comment.write(l)
code.write("\n")
else:
code.write(l)
comment.write("\n")
comment.close()
code.close()
Now translate comment.txt with google translator and then use
paste code.txt comment_en > source
where comment_en is translated comment in english.

Proofread strings with Qt Linguist

Our main language is English, so we use tr("Some english text") all over the source code.
We also plan to translate it to several different languages - no problem with that.
Our customer wants to get all phrases from the source code and proofread them.
Of course, we should put those phrases back after proofreading.
How can we accomplish that in a proper way? Maybe Qt Linguist allow to export/import embedded localizable texts?
I guess the customer can just translate English into English and then we can use that English translation, but it's weird.

I would go with Qt's lupdate utility (could be found in Qt's bin directory) that will extract all string literals from your sources into a xml (ts) file. The file can be opened with Linguist tool.
Note, that the utility considers only strings surrounded with tr() macro. Here is luptdate description:
lupdate is part of Qt's Linguist tool chain. It extracts translatable
messages from Qt UI files, C++, Java and JavaScript/QtScript source
code. Extracted messages are stored in textual translation source
files (typically Qt TS XML). New and modified messages can be merged
into existing TS files.
UPDATE:
Another alternative is keeping all string literals definitions in a separate source file and update it once customer has corrected all strings. I believe this happens not so frequently, or even only once, so it would not be worth of much effort with translations etc.

Finally, it looks like I will have to update phrases (embedded into source code) by hand. Actually, it shouldn't take too much time. If I have time to write a script on Python I will update this post.
UPDATE
So, I made everything "by hand" with a little help from Sublime Text 3.
Find all matches in repository folder using the following regular expression (.*)(tr\((\"(.+?)\")\))(.*)
Copy the search results into new document
Using the same regular expression do the search again and replace each match with \4 - this capture group represents text in tr("").
After receiving phrases from the customer after proofreading, it took 3-5 minutes to find differences with diff tool and update phrases in code.
Not a true-programmer way of resolving problems but worked for me and worked pretty fast!

Gettext/Django for german translations: formal/informal salutations

I maintain a pluggable Django app that contains translations. All strings in Python and HTML code are written in English. When translating the strings to German, I'm always fighting with the problem that German differentiates between formal and informal speech (see T–V distinction). Because the app is used on different sites, ranging from a social network to a banking website, I can't just support either the formal or informal version. And since the translations can differ quite a bit, there's no way I can parameterize it. E.g. the sentence "Do you want to log out?" would have these two translations:
Wollen Sie sich abmelden? (formal)
Willst du dich abmelden? (informal)
Is there anything in Gettext that could help me with this?

You can use contextual markers to give your translations additional context.
logout = pgettext('casual', 'Do you want to log out?')
...
logout = pgettext('formal', 'Do you want to log out?')

The best approach, used in other similar situations by gettext as well as UNIX is to use locale variants. For example, sr_RS is (or was, because Serbian is considered a metalanguage these days...) code used for Serbian written in Cyrillic. But it’s sometimes written in Latin script too, so sr_RS#latin is used as the language name and of course, the MO filename or directory as well.
Here, have a look at some translations I have present on my system:
$ find /usr/local/share/locale | grep /sr
/usr/local/share/locale/sr
/usr/local/share/locale/sr/LC_MESSAGES
/usr/local/share/locale/sr/LC_MESSAGES/bash.mo
/usr/local/share/locale/sr/LC_MESSAGES/bfd.mo
/usr/local/share/locale/sr/LC_MESSAGES/binutils.mo
/usr/local/share/locale/sr/LC_MESSAGES/gettext-runtime.mo
/usr/local/share/locale/sr/LC_MESSAGES/gettext-tools.mo
/usr/local/share/locale/sr/LC_MESSAGES/glib20.mo
/usr/local/share/locale/sr/LC_MESSAGES/wget.mo
/usr/local/share/locale/sr#ije
/usr/local/share/locale/sr#ije/LC_MESSAGES
/usr/local/share/locale/sr#ije/LC_MESSAGES/glib20.mo
/usr/local/share/locale/sr#latin
/usr/local/share/locale/sr#latin/LC_MESSAGES
/usr/local/share/locale/sr#latin/LC_MESSAGES/glib20.mo
/usr/local/share/locale/sr_RS
/usr/local/share/locale/sr_RS/LC_MESSAGES
/usr/local/share/locale/sr_RS/LC_MESSAGES/mkvtoolnix.mo
/usr/local/share/locale/sr_RS#latin
/usr/local/share/locale/sr_RS#latin/LC_MESSAGES
/usr/local/share/locale/sr_RS#latin/LC_MESSAGES/mkvtoolnix.mo
$
So the best way to handle German variants is the same: use de (or de_DE) for the base informal variant and have a separate translation file de_DE#formal with the formal variant of the translation.
This is basically what WordPress does too. Of course, being WordPress, they have their own special flavour and don’t use the variant syntax, but instead add a third component to the filename: de_DE.mo is the informal (and also fallback, because it lacks any further specification) variant and de_DE_formal.mo contains the formal variant.

Do you know of a good program for editing/translating resource (.rc) files?

I'm building a C++/MFC program in a multilingual environment. I have one main (national) language and three international languages. Every time I add a feature to the program I have to keep the international languages up-to-date with the national one. The resource editor in Visual Studio is not very helpful because I frequently end up leaving a string, dialog box, etc., untranslated.
I wonder if you guys know of a program that can edit resource (.rc) files and
Build a file that includes only the strings to be translated and their respective IDs and accepts the same (or similar) file in another language (this would be helpful since usually the translation is done by someone else), or
Handle the translations itself, allowing to view the same string in different languages at the same time.

In my experience, internationalization requires a little more than translating strings. Many strings when translated, require more space on a dialog. Because of this it's useful to be able to customize the dialogs for each language. Otherwise you have to create dialogs with extra space for the translated strings which then looks less than optimal when displayed in English.
Quite a while back I was using a translation tool for an MFC application but the company that produced the software stopped selling it. When I tried to find a reasonably priced replacement I did not find one.

Check out Lingobit Localizer. Expensive, but well worth it.

Here's a script I use to generate resource files for testing in different languages. It just parses a response from babelfish so clearly the translation will be about as high quality as that done by a drunken monkey, but it's useful for testing and such
for i in $trfile
do
key=`echo $i | sed 's/^\(.*\)=\(.*\)$/\1/g'`
value=`echo $i | sed 's/^\(.*\)=\(.*\)$/\2/g'`
url="http://babelfish.altavista.com/tr?doit=done&intl=1&tt=urltext&lp=$langs&btnTrTxt=Translate&trtext=$value"
wget -O foo.html -A "$agent" "$url" *&> /dev/null
tx=`grep "<td bgcolor=white class=s><div style=padding:10px;>" foo.html`
tx=`echo $tx | iconv -f latin1 -t utf-8 | sed 's/<td bgcolor=white class=s><div style=padding:10px;>\(.*\)<\/div><\/td>/\1/g'`
echo $key=$tx
done
rm foo.html

Check out appTranslator, its relatively cheap and works rather well. The guy developing it is really responsive to enhancement requests and bug report, so you get really good support.

You might take a look at Sisulizer http://www.sisulizer.com. Expensive though. We're evaluating it for use at my company to manage the headache of ongoing translation. I read on their About page that the company was founded by people who left Multilizer and other similar companies.

If there isn't one, it would be pretty easy to loop through all the strings in a resource a compare them to the international resources. You could probably do this with a simple grid.

In the end we have ended up building our own external tools to manage this. Our devs work in the english string table and every automated build sends our strings that have been added/changed and deleted to translation manager. He can also run a report at anytime from an old build to determine what is required for translation.

Check out RC-WinTrans. Its a commercial tool that my company uses. It basically imports our .RC files (or .resx files) into a database which we send to a different office for translation. The tool can then export a translated .RC file (or .resx file) for each language from the database. It even has a basic dialog box editor so the translator can adjust the size of various controls in the dialog box to be sure the translated text fits.
It also accepts a number of command line arguments and has a COM automation interface so you can integrate it into a build process more easily. It works quite well for us and we literally have thousands and thousands of strings and dialog boxes, etc.
(We currently have version 7 so what I've said might be a little bit different than their latest version 8.)

Also try AppTranslator: http://www.apptranslator.com/. It has a build-in resource editor so that translators can, for example, enlargen a text box when need bo. It has separate versions for developers and translators and much more.

We are using Multilizer (http://www.multilizer.com/) and although sometimes it's a bit tricky to use, at the end with a bit of patient it works pretty well.
We even have a translation web site where translators can download our projects and then upload the translations using Multilizer command-line features.

Managing localization and translations using .rc files and Visual Studio is not a good idea. It's much smarter (though counter-intuitive) to start localization through the exe. Read here why: http://www.apptranslator.com/misconceptions.html

I've written this recently, which integrates into VS:
https://github.com/ekkis/Powershell/blob/master/MT.ps1
largely because I was unsatisfied with the solutions out there. you'll need to get a client id from M$ (but they give you 2M words/month translation free - not bad)

ResxCrunch will be out soon, it will edit multiple resource files in multiple languages in one single table.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js