Google Cloud Translate - Serbian Latin not working

Google Cloud Translate - Serbian Latin not working - google-cloud-platform

I have used Google Translate APIs for translating from English to Serbian (with Latin characters).
Since few days ago, using en as source language and sr-Latn I was able to get correct translation but now it does not seem anymore.
Code snippet:
from google.cloud import translate
project_id = "<my project>"
parent = f"projects/{project_id}"
client = translate.TranslationServiceClient()
sample_text = "Hello!"
source_language_code = "en"
target_language_code = "sr-Latn"
response = client.translate_text(
contents=[sample_text],
source_language_code=source_language_code,
target_language_code=target_language_code,
parent=parent,
)
for translation in response.translations:
print(translation.translated_text)
Actual output:
Здраво!
Expected output:
Zdravo!
Additional info: sr-Latn is a valid BCP-47 language code and worked since few days ago.
Thanks for you help

They have removed the feature. You can do post translate transliteration from CYR to LAT.
A sample
https://github.com/kukicmilorad/cyrlat

It appears that translation to the Serbian Latin Alphabet is not officially supported by the Cloud Translate API, as discussed in this recent issue. Therefore it’s not assured that any possible workaround will be functional or reliable. You can see the list of supported language codes for translation here.
You/We can, however, submit a Feature Request to the public Google issue tracker for Cloud Translation API. The higher the number of us users who bring attention to this request, the more likely it is for it to be eventually built into the API.

Related

Google's Cloud Translation API is not translating

I wrote ami valo achi in google translate. The proper translation of this is I'm fine. Google translate shows the exact same result.
But when I try to translate the same text using Cloud Translation API, it doesn't translate. It shows the exact same text I gave as input. Here's my code segment:
const { Translate } = require("#google-cloud/translate").v2;
const translate = new Translate({
keyFilename: "file path",
});
let target = 'en'
let text = 'ami valo achi'
async function detectLanguage() {
let [translations] = await translate.translate(text, target);
translations = Array.isArray(translations) ? translations : [translations];
console.log("Translations:");
translations.forEach((translation, i) => {
console.log(translation)
console.log(`${text[i]} => (${target}) ${translation}`);
});
}
detectLanguage();
Is there anything that I'm doing wrong or I can do to solve this?

Google translator is a product developed by Google which not only implements GCP Translation AI but also various other components.Thus the differences in translations are not only common but expected.
Cloud Translation API is designed for programmatic use. Cloud Translation API uses Google pre-trained Neural Machine Translation (NMT) model for translation and Google Translate uses statistical machine translation (SMT), where computers analyze millions of existing translated documents to learn vocabulary and look for patterns in a language. The difference in their way of working results in different behavior. If you are not getting the same response from two different products is intended behavior.
Currently, Cloud translate API doesn’t support translation for Bengali written in the Latin alphabet. Thus, when translating "আমি ভালো আছি" into English, "i am fine" is sent back. Conversely, when translating "ami valo achi" does not work as it is not written in the Bengali alphabet.
The same has been raised as an issue in this Issue Tracker that will be updated whenever there is progress. However, we cannot provide an ETA at the moment but you can “STAR” the issue to receive automatic updates and give it traction by referring to this link.

how to disable Google Translate API from not translating proper names with common words

I am using Google Cloud Platform to create content in Indian regional language. Some of the content contains buildings and society name which have common words like 'The Nest Glory'. After conversion from Google Translate API the building name should only be spelled in the regional language, instead it is literally being translated. It sounds funny and user will never find that building.

Cloud Translation API can be told not to translate some part of the text, use the following HTML tags:
<span translate="no"> </span>
<span class="notranslate"> </span>
This functionality requires the source text to be submitted in HTML.
Access to the Cloud Translation API
How to Disable Google Translate from Translating Specific Words

After some search I think one older way is to use the transliteration API which was deprecated by google
Google Transliteration API Deprecated

translate="no"
class="notranslate"
in last API versions this features doesn't work
That's a very complex thing to do: it needs semantic analysis as well as translation and there is no real way to simply add that to your application. The best we can suggest is that you identify the names, remove them from the strings (or replace them with non-translated markers as the may not be in anything like the same position in translated messages) and re-instate them afterwards.
But ... that may well change the quality of the translation; as I said, Google Translate is pretty clever ... but it isn't at Human Interpreter yet.
If that doesn't work, you need to use a real-world human translator who is very familiar with the source and target languages.

ANTLR 4 Python Documentation

I am very new to antlr 4 and my target language is PYTHON2.
I am not able to understand CommonTokenStream in python and how I can access tokens in antlr 4.
What I require is to access tokens present in Hidden Channel ?
Please someone point me to some proper documentation where I can understand how to access tokens and manipulate them in python.
I am sorry if the question is vague, I am new here.

The ANTLR book is one.
https://pragprog.com/book/tpantlr2/the-definitive-antlr-4-reference
In the chapter 12 "Wielding Lexical Black Magic", it has "Accessing hidden channel" section. Use the TokenStreamRewriter to rewrite the token.
You need to mentally convert Java code in the book to Python code. The runtime libraries have subtle differences but they are virtually the same.
It is not the only way. You can override lexer's emit() function (which I usually do). Then you have total control over the token routing.

If you are on python 3 it is all wonderfully done cooked and ready
https://github.com/jszheng/py3antlr4book
For Some Python start hints try
https://github.com/antlr/antlr4/blob/master/doc/python-target.md
If you are using Anaconda3 try egrep of class def import comments(#) of all *.py
Anaconda3\Lib\site-packages\antlr4_python3_runtime-4.7.1-py3.6.egg\antlr4
Or even write a ANTLR script to create the python docs and share with me and the world
Also at run time this helps to see what methods and properties are in say a CTX object
def dump(obj):
for attr in dir(obj):
print("obj.%s = %r" % (attr, getattr(obj, attr)))
print("-------------------------------------------")
dump(ctx)
print("===========================================")

Google Translator Toolkit machine-translation issues

I am using python 2.7 & django 1.7.
When I use Google Translator Toolkit to machine translate my .po files to another language (English to German), there are many errors due to the use of different django template variables in my translation tags.
I understand that machine translation is not so great, but I am wanting to only test my translation strings on my test pages.
Here is an example of a typical error of the machine-translated .po file translated from English (en) to German (de).
#. Translators: {{ site_name_lowercase }} is a variable that does not require translation.
#: .\templates\users\reset_password_email_html.txt:47
#: .\templates\users\reset_password_email_txt.txt:18
#, python-format
msgid ""
"Once you've returned to %(site_name_lowercase)s.com, we will give you "
"instructions to reset your password."
msgstr "Sobald du mit% (site_name_lowercase) s.com zurückgegeben haben, geben wir Ihnen Anweisungen, um Ihr Passwort zurückzusetzen."
The %(site_name_lowercase)s is machine translated to % (site_name_lowercase) s and is often concatenated to the precedding word, as shown above.
I have hundreds of these type of errors and I estimate that a find & replace would take at least 7 hours. Plus if I makemessages and then translate the .po file again I would have to go through the find and replace again.
I am hoping that there is some type of undocumented rule in the Google Translator Toolkit that will allow the machine-translation to ignore the variables. I have read the Google Translator Toolkit docs and searched SO & Google, but did not find anything that would assist me.
Does anyone have any suggestions?

The %(site_name_lowercase)s is machine translated to % (site_name_lowercase) s and is often concatenated to the precedding word, as shown above.
This is caused by tokenization prior to translation, followed by detokenization after translation, i.e. Google Translate tries to split the input before translation to re-merge it after translation. The variables you use are typically composed of characters that are used by tokenizers to detect token boundaries. To avoid this sort of problem, you can pre-process your file and replace the offending variables by placeholders that do not have this issue - I suggest you try out a couple of things, e.g. _VAR_PLACE_HOLDER_. It is important that you do not use any punctuation characters that may cause the tokenizer to split. After pre-processing, translate your newly generated file and post-process by replacing the placeholders by their original value. Typically, your placeholder will be picked up as an Out-Of-Vocabulary (OOV) item and it will be preserved during translation. Try to experiment with including a sequence number (to keep track of your placeholders during post-processing), since word reordering may occur. There used to be a scientific API for Google Translate that gives you the token alignments. You could use these for post-processing as well.
Note that this procedure will not give you the best translation output possible, as the language model will not recognize the placeholder. You can see this illustrated here (with placeholder, the token "gelezen" is in the wrong place):
https://translate.google.com/#en/nl/I%20have%20read%20SOME_VARIABLE_1%20some%20time%20ago%0AI%20have%20read%20a%20book%20some%20time%20ago
If you just want to test the system for your variables, and you do not care about the translation quality, this is the fastest way to go.
Should you decide to go for a better solution, you can solve this issue yourself by developing your own machine translation system (it's fun, by the way, see http://www.statmt.org/moses/) and apply the procedure explained above, but then with, for example Part-Of-Speech-Tags to improve the language model. Note that you can use the alignment information as well.

Match browsers set to Scandinavian languages based on "Accept-Language"

Question
I am trying to match browsers set to Scandinavian languages based on HTTP header "Accept-Language".
My regex is:
^(nb|nn|no|sv|se|da|dk).*
My question is if this is sufficient, and if anyone know about any other odd scandinavian (but "valid") language codes or obscure browser bugs causing false positives?
Used for
The regex is used for displaying a english link in the top of the Norwegian web pages (which is the primary language and the root of the domain and sub-domains) that takes you to the English web pages (secondary language and folder under root) when the browser language is not Scandinavian. The link can be closed / "opted-out" with hash stored in JavaScript localStorage if the user don't want to see the link again. We decided not to use IP geo-location because of limited time to implement.

Depending on the language you are working in there may be code in place you can use to parse this easily, e.g. this post: Parse Accept-Language header in Java <-- Also provides a good code example
Further - are you sure you want to limit your regex to the start of the string, as several lanaguages can be provided (the first is intended to be "I prefer x but also accept the following") : http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.4
Otherwise your regex should work fine based on the what you were asking and here is a list of all browser language codes: http://www.metamodpro.com/browser-language-codes
I would also - in your shoes, make the "switch to X language" link easy to find for all users until they had opted not to see it again. I would expect many people may have a preference set by default in their browser but find a site actually using it to be unexpected i.e. a user experience like:
I prefer english but don't know enough to change this setting and have never had a reason to before as so few sites make use of it.

That regular expression is enough if you are testing each item in accept-language individually.
If not individually, there are 2 problems:
One of the expected languages could not appear at the beginning of the header, but after.
Some of the expected languages abbreviations could appear as qualifier of a completely different language.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js