I wrote ami valo achi in google translate. The proper translation of this is I'm fine. Google translate shows the exact same result.
But when I try to translate the same text using Cloud Translation API, it doesn't translate. It shows the exact same text I gave as input. Here's my code segment:
const { Translate } = require("#google-cloud/translate").v2;
const translate = new Translate({
keyFilename: "file path",
});
let target = 'en'
let text = 'ami valo achi'
async function detectLanguage() {
let [translations] = await translate.translate(text, target);
translations = Array.isArray(translations) ? translations : [translations];
console.log("Translations:");
translations.forEach((translation, i) => {
console.log(translation)
console.log(`${text[i]} => (${target}) ${translation}`);
});
}
detectLanguage();
Is there anything that I'm doing wrong or I can do to solve this?
Google translator is a product developed by Google which not only implements GCP Translation AI but also various other components.Thus the differences in translations are not only common but expected.
Cloud Translation API is designed for programmatic use. Cloud Translation API uses Google pre-trained Neural Machine Translation (NMT) model for translation and Google Translate uses statistical machine translation (SMT), where computers analyze millions of existing translated documents to learn vocabulary and look for patterns in a language. The difference in their way of working results in different behavior. If you are not getting the same response from two different products is intended behavior.
Currently, Cloud translate API doesn’t support translation for Bengali written in the Latin alphabet. Thus, when translating "আমি ভালো আছি" into English, "i am fine" is sent back. Conversely, when translating "ami valo achi" does not work as it is not written in the Bengali alphabet.
The same has been raised as an issue in this Issue Tracker that will be updated whenever there is progress. However, we cannot provide an ETA at the moment but you can “STAR” the issue to receive automatic updates and give it traction by referring to this link.
Related
I have used Google Translate APIs for translating from English to Serbian (with Latin characters).
Since few days ago, using en as source language and sr-Latn I was able to get correct translation but now it does not seem anymore.
Code snippet:
from google.cloud import translate
project_id = "<my project>"
parent = f"projects/{project_id}"
client = translate.TranslationServiceClient()
sample_text = "Hello!"
source_language_code = "en"
target_language_code = "sr-Latn"
response = client.translate_text(
contents=[sample_text],
source_language_code=source_language_code,
target_language_code=target_language_code,
parent=parent,
)
for translation in response.translations:
print(translation.translated_text)
Actual output:
Здраво!
Expected output:
Zdravo!
Additional info: sr-Latn is a valid BCP-47 language code and worked since few days ago.
Thanks for you help
They have removed the feature. You can do post translate transliteration from CYR to LAT.
A sample
https://github.com/kukicmilorad/cyrlat
It appears that translation to the Serbian Latin Alphabet is not officially supported by the Cloud Translate API, as discussed in this recent issue. Therefore it’s not assured that any possible workaround will be functional or reliable. You can see the list of supported language codes for translation here.
You/We can, however, submit a Feature Request to the public Google issue tracker for Cloud Translation API. The higher the number of us users who bring attention to this request, the more likely it is for it to be eventually built into the API.
When trying to import into the Datastore Emulator, all the data is imported correctly, but the key references are wrong somehow.
The procedure I'm following to import is the one from here following an export from the instructions here.
I've included a screenshot of the situation from the Datastore viewer as otherwise it's hard to understand.
It appears as though the key references (blue arrow) contain the correct kind and ID, as the Datastore viewer is pulling those out (orange arrow) and they are correct, but the entity it references has a different main entity key (e.g. red arrow though obviously for a different entity) which are all in a slightly different format (they have a common prefix and two hyphens in them).
It seems as though the key encoding is done in a subtly inconsistent manner in the emulator versus in the live datastore, but I've not been able to find any documentation about this anywhere.
Running code and connecting to the emulator with the client library shows that all the references have the correct IDs as well (I'm not even sure if you can see the string keys using the Ruby client). Trying to use the client to reset the references by setting the same ID and saving to hopefully regenerate the keys didn't work either.
I assume your app is working fine but you are just concerned about the encodings. If so, there is nothing to worry.
It seems as though the key encoding is done in a subtly inconsistent manner in the emulator versus in the live datastore, but I've not been able to find any documentation about this anywhere.
The way the keys get encoded has changed at some point. The datastore viewer that comes with the SDK I believe still uses the old style. The API was even enhanced to support decoding from the old style, you can refer
https://github.com/googleapis/google-cloud-python/issues/3293
to for more details. I found that the newer encoding is more compact than the old encoding. I believe the new encoding doesn't include the app/project id into the encoding which makes sense because that information belongs to the entire database and not to each specific key.
I'm using AWS SageMaker, and i want to create something that, with a given text, it recognize the place of that description. Is it possible?
If there are no other classes besides the text that you would like your model to identify, you may not need a multiclass classifier.
You could train your own text detection model using Amazon SageMaker, and train using a dataset with labelled examples using the Object Detection Algorithm, but this becomes rather involved for a problem that has existing solutions available.
If the appearance of the text you're trying to detect is identical each time, your problem space gets reduced from trying to interpret variable text, to simply having to gather enough examples and perform object detection for the "pattern" your text forms visually. Note that if the text were to appear in different fonts or styles, that the generic object detection method would not interpret it dynamically, and an OCR-based solution would likely be necessary.
More broadly, for text identification in images on AWS, you have quite a few options:
Amazon Rekognition has a DetectText method that will enable you to easily find text within an image. If it's a small or simple phrase, with alphanumeric characters, this should work very well for your use case.
Amazon Textract will help you perform OCR (optical character recognition) while retaining the structure of the source. This is great for documents and tables, but doesn't sound like it may be applicable to your use case.
The AWS marketplace will also have hosted options available from third party vendors. One example of this for text region identification is this one from RocketML.
There are also some great open source tools I'd recommend looking into: OpenCV for ascertaining the text bounding boxes, and Tesseract for OCR and text extraction. This blog post does a good job walking through the process of using them together.
Any of these will help to solve your problem of performing OCR/text identification on AWS, but the best choice comes down to what your current and future needs are, and how quickly you're looking to implement the feature.
Your question is not clear regarding the data that you have or the problem that you want to solve.
If you have a text that includes a place name in it (for example, "I visited Seattle and enjoyed the fish market"), you can use Amazon Comprehend Name Entity Extraction (NEE) including places ("Seattle" in the above example)
{
"Entities": [
{
"Score": 0.9857407212257385,
"Type": "LOCATION",
"Text": "Seattle",
"BeginOffset": 10,
"EndOffset": 17
}
]
}
If the description is more general and you want to classify if the description is of a hotel, a restaurant, a theme park, a concert/show, or similar types of places, you can either use the Custom classification in Comprehend or the Neural Topic Model in SageMaker (https://docs.aws.amazon.com/sagemaker/latest/dg/ntm.html). You will need some examples of the classes and documents/sentences that are used for the model training.
I am using Google Cloud Platform to create content in Indian regional language. Some of the content contains buildings and society name which have common words like 'The Nest Glory'. After conversion from Google Translate API the building name should only be spelled in the regional language, instead it is literally being translated. It sounds funny and user will never find that building.
Cloud Translation API can be told not to translate some part of the text, use the following HTML tags:
<span translate="no"> </span>
<span class="notranslate"> </span>
This functionality requires the source text to be submitted in HTML.
Access to the Cloud Translation API
How to Disable Google Translate from Translating Specific Words
After some search I think one older way is to use the transliteration API which was deprecated by google
Google Transliteration API Deprecated
translate="no"
class="notranslate"
in last API versions this features doesn't work
That's a very complex thing to do: it needs semantic analysis as well as translation and there is no real way to simply add that to your application. The best we can suggest is that you identify the names, remove them from the strings (or replace them with non-translated markers as the may not be in anything like the same position in translated messages) and re-instate them afterwards.
But ... that may well change the quality of the translation; as I said, Google Translate is pretty clever ... but it isn't at Human Interpreter yet.
If that doesn't work, you need to use a real-world human translator who is very familiar with the source and target languages.
I am using python 2.7 & django 1.7.
When I use Google Translator Toolkit to machine translate my .po files to another language (English to German), there are many errors due to the use of different django template variables in my translation tags.
I understand that machine translation is not so great, but I am wanting to only test my translation strings on my test pages.
Here is an example of a typical error of the machine-translated .po file translated from English (en) to German (de).
#. Translators: {{ site_name_lowercase }} is a variable that does not require translation.
#: .\templates\users\reset_password_email_html.txt:47
#: .\templates\users\reset_password_email_txt.txt:18
#, python-format
msgid ""
"Once you've returned to %(site_name_lowercase)s.com, we will give you "
"instructions to reset your password."
msgstr "Sobald du mit% (site_name_lowercase) s.com zurückgegeben haben, geben wir Ihnen Anweisungen, um Ihr Passwort zurückzusetzen."
The %(site_name_lowercase)s is machine translated to % (site_name_lowercase) s and is often concatenated to the precedding word, as shown above.
I have hundreds of these type of errors and I estimate that a find & replace would take at least 7 hours. Plus if I makemessages and then translate the .po file again I would have to go through the find and replace again.
I am hoping that there is some type of undocumented rule in the Google Translator Toolkit that will allow the machine-translation to ignore the variables. I have read the Google Translator Toolkit docs and searched SO & Google, but did not find anything that would assist me.
Does anyone have any suggestions?
The %(site_name_lowercase)s is machine translated to % (site_name_lowercase) s and is often concatenated to the precedding word, as shown above.
This is caused by tokenization prior to translation, followed by detokenization after translation, i.e. Google Translate tries to split the input before translation to re-merge it after translation. The variables you use are typically composed of characters that are used by tokenizers to detect token boundaries. To avoid this sort of problem, you can pre-process your file and replace the offending variables by placeholders that do not have this issue - I suggest you try out a couple of things, e.g. _VAR_PLACE_HOLDER_. It is important that you do not use any punctuation characters that may cause the tokenizer to split. After pre-processing, translate your newly generated file and post-process by replacing the placeholders by their original value. Typically, your placeholder will be picked up as an Out-Of-Vocabulary (OOV) item and it will be preserved during translation. Try to experiment with including a sequence number (to keep track of your placeholders during post-processing), since word reordering may occur. There used to be a scientific API for Google Translate that gives you the token alignments. You could use these for post-processing as well.
Note that this procedure will not give you the best translation output possible, as the language model will not recognize the placeholder. You can see this illustrated here (with placeholder, the token "gelezen" is in the wrong place):
https://translate.google.com/#en/nl/I%20have%20read%20SOME_VARIABLE_1%20some%20time%20ago%0AI%20have%20read%20a%20book%20some%20time%20ago
If you just want to test the system for your variables, and you do not care about the translation quality, this is the fastest way to go.
Should you decide to go for a better solution, you can solve this issue yourself by developing your own machine translation system (it's fun, by the way, see http://www.statmt.org/moses/) and apply the procedure explained above, but then with, for example Part-Of-Speech-Tags to improve the language model. Note that you can use the alignment information as well.