Get multiple variations from Google Translate API

Get multiple variations from Google Translate API - google-cloud-platform

When we make a query to Translate API
https://translation.googleapis.com/language/translate/v2?key=$API_KEY&q=hello&source=en&target=e
I only get 1 result in :
{
"data": {
"translations": [
{
"translatedText": "....."
}
]
}
}
Is it possible to get all variations (alternatives) of that word, not only 1 translation?

Microsoft Azure supports one. https://learn.microsoft.com/en-us/azure/cognitive-services/translator/reference/v3-0-dictionary-lookup .
For ex. https://api.cognitive.microsofttranslator.com/dictionary/lookup?api-version=3.0&from=en&to=es
[
{"Text":"hello"}
]
gives you a list of translations like this:
[
{
"normalizedSource": "hello",
"displaySource": "hello",
"translations": [
{
"normalizedTarget": "diga",
"displayTarget": "diga",
"posTag": "OTHER",
"confidence": 0.6909,
"prefixWord": "",
"backTranslations": [
{
"normalizedText": "hello",
"displayText": "hello",
"numExamples": 1,
"frequencyCount": 38
}
]
},
{
"normalizedTarget": "dime",
"displayTarget": "dime",
"posTag": "OTHER",
"confidence": 0.3091,
"prefixWord": "",
"backTranslations": [
{
"normalizedText": "tell me",
"displayText": "tell me",
"numExamples": 1,
"frequencyCount": 5847
},
{
"normalizedText": "hello",
"displayText": "hello",
"numExamples": 0,
"frequencyCount": 17
}
]
}
]
}
]
You can see 2 different translations in this case.

The Translation API service doesn't support the retrieval of multiple translations of a word, as mentioned in the FAQ Documentation:
Is it possible to get multiple translations of a word?
No. This feature is only available via the web interface at
translate.google.com
In case this feature doesn't cover your current needs, you can use the Send Feedback button, located at the lower left and upper right corners of the service public documentation, as well as take a look the Issue Tracker tool in order to raise a Translation API feature request and notify to Google about this desired functionality.

Approach mapping Wiktionary using POS tags, related terms and Google-translated word.
TL;DR
The question is titled 'get-multiple-variations-from-google-translate-api', but in short, you (still) currently can't do this by using Google's service alone (as of Sept. 2022). It seems most companies, such as Google, want to continue charging for this service. This answer provides an approach using a (free) service as a pivot to get the term, related terms, and their POS (Parts of Speech) e.g. noun, verb, etc. before translating those terms and then re-querying the service.
This alternative creates a small pipeline that queries Wiktionary before (on the source language), and after (on the translated terms target language) the translation (using Google).
The small pipeline is written in python and bash.
Rationale
We could get word senses, for each POS (Part of Speech) and corresponding synonyms, then translate for each word sense since Google only translates word to word, and then match word senses for the corresponding target language using a tool such as Wiktionary.
Wiktionary
Fortunately, someone has already created a python library to query Wiktionary for multiple languages.
Script to get definitions / synonyms from Wiktionary (using python):
(requires wiktionaryparser )
e.g. python -m pip install wiktionaryparser
import sys;
import json;
from wiktionaryparser import WiktionaryParser;
parser = WiktionaryParser()
# sys.argv[1] is a language e.g. 'english'
parser.set_default_language(sys.argv[1])
print(
json.dumps(
[
[
{
'pos': d.get('partOfSpeech'),
'text':d.get('text'),
'examples':[e for e in d.get('examples')][0] if d.get('examples') else [],
'related': d.get('relatedWords')
} for d in w.get('definitions')
] for w in parser.fetch(sys.argv[2])
],
indent=2
)
)
Google translate + Wiktionary
The bash script below gets Wiktionary definitions, splits on synonym lists and correlates translations based on POS (Part of Speech).
To be honest this script is a bit convoluted, it uses a lot of utils, but it works. It could be refactored into python like the wiktionary part by anyone wanting to make something a bit more robust.
This github post provided some of the below script that call the free Google translate api.
#!/bin/bash
sl=$1
tl=$2
wiki_sl=$3
wiki_tl=$4
string=$5
ua='Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'
#echo "$string"
result="{\"${sl}\":[],\"${tl}\":[]}"
#set -x
while IFS= read line; do
# line could be better named 'synonym' here
pos="$(echo ${line} | jq -r ".pos")"
sl_result="$(echo $line | jq . -c)"
tl_result=""
opt_single="single?client=gtx&sl=${sl}&tl=${tl}&dt=t&q=${string//[[:blank:]]/+}"
full_url="http://translate.googleapis.com/translate_a/${opt_single}"
response=$(curl -sA "${ua}" "${full_url}")
tl_word="$(echo ${response} | jq -r '.[[0][0]][] | .[0:1][0]')"
echo "${tl_word}" | grep -q " " && continue 1
tl_result_new="$(python ./get_wiki.py "${wiki_tl}" "${tl_word}" | jq -r -c --arg POS "$pos" '.[][] | select(.pos==$POS)'),"
# making json
tl_result="[${tl_result_new}"
# iterate over synonyms
while IFS= read qry; do
opt_single="single?client=gtx&sl=${sl}&tl=${tl}&dt=t&q=${qry//[[:blank:]]/+}"
full_url="http://translate.googleapis.com/translate_a/${opt_single}"
response=$(curl -sA "${ua}" "${full_url}")
tl_word="$(echo ${response} | jq -r '.[[0][0]][] | .[0:1][0]')"
echo "${tl_word}" | grep -q " " && continue 1
tl_result_new="$(python ./get_wiki.py "${wiki_tl}" "${tl_word}" | jq -r -c --arg POS "$pos" '.[][] | select(.pos==$POS)'),"
# adding to json
tl_result="${tl_result},${tl_result_new}"
done< <(echo "${line}" | jq -c -r ' .related[].words[]' | \
sed -e 's/.*://;s/"//g;s/^ *//g;s/ *$//g' | tr ',' '\n')
tl_result="$(echo "${tl_result_new}" | sed 's/,$//g')"
[ -z "${tl_result}" ] && tl_result=null
[ -z "${sl_result}" ] && sl_result=null
result="{\"${sl}\":${sl_result},\"${tl}\":${tl_result}}"
echo "$result" | jq "."
done< <(python ./get_wiki.py "$wiki_sl" "$string" | \
jq -c -r '.[][]|select(.related[].relationshipType=="synonyms")') 2> /dev/null | jq -c '[.]'
How to use:
The first 2 arguments used are for google (source language, and target language in that order which are two-letter codes.
The second 2 arguments used are for Wiktionary (source language, a full word - e.g. 'English', 'French', etc.)
The final (fifth) argument is the single word to be translated.
./translate.sh en pt english portuguese help
In fact, the python 'wiktionaryparser' lib occasionally breaks and can throw an error, due to the fact that it is a webscraping library, which is why I add 2> /dev/null to silence stderr on output.
./translate.sh en pt english portuguese help 2> /dev/null
This script isn't perfect, but it is a starting point and a proof-of-concept to show you this is possible using a free tool such as wiktionary.
English to Portuguese
$ ./translate.sh en pt english portuguese help 2> /dev/null
Output:
[
{
"en": {
"pos": "noun",
"text": [
"help (usually uncountable, plural helps)",
"(uncountable) Action given to provide assistance; aid.",
"(usually uncountable) Something or someone which provides assistance with a task.",
"Documentation provided with computer software, etc. and accessed using the computer.",
"(usually uncountable) One or more people employed to help in the maintenance of a house or the operation of a farm or enterprise.",
"(uncountable) Correction of deficits, as by psychological counseling or medication or social support or remedial training."
],
"examples": "I need some help with my homework.",
"related": [
{
"relationshipType": "synonyms",
"words": [
"(action given to provide assistance): aid, assistance"
]
}
]
},
"pt": {
"pos": "noun",
"text": [
"assistência f (plural assistências)",
"assistance, aid, help",
"protection"
],
"examples": [],
"related": [
{
"relationshipType": "related terms",
"words": [
"assistir"
]
}
]
}
}
]
[
{
"en": {
"pos": "verb",
"text": [
"help (third-person singular simple present helps, present participle helping, simple past helped or (archaic) holp, past participle helped or (archaic) holpen)",
"(transitive) To provide assistance to (someone or something).",
"(transitive) To assist (a person) in getting something, especially food or drink at table; used with to.",
"(transitive) To contribute in some way to.",
"(intransitive) To provide assistance.",
"(transitive) To avoid; to prevent; to refrain from; to restrain (oneself). Usually used in nonassertive contexts with can."
],
"examples": "Risk is everywhere. […] For each one there is a frighteningly precise measurement of just how likely it is to jump from the shadows and get you. “The Norm Chronicles” […] aims to help data-phobes find their way through this blizzard of risks.",
"related": [
{
"relationshipType": "synonyms",
"words": [
"(provide assistance to): aid, assist, come to the aid of, help out; See also Thesaurus:help",
"(contribute in some way to): contribute to",
"(provide assistance): assist; See also Thesaurus:assist"
]
}
]
},
"pt": {
"pos": "verb",
"text": [
"ajudar (first-person singular present indicative ajudo, past participle ajudado)",
"to help, aid; to assist"
],
"examples": "Ajude-me! ― Help me!",
"related": [
{
"relationshipType": "related terms",
"words": [
"ajuda",
"ajudante"
]
}
]
}
}
]
English to Latin
$ ./translate.sh en la english latin body | jq '.'
[
{
"en": {
"pos": "noun",
"text": [
"body (countable and uncountable, plural bodies)",
"Physical frame.",
"Main section.",
"Coherent group.",
"Material entity.",
"(printing) The shank of a type, or the depth of the shank (by which the size is indicated).",
"(geometry) A three-dimensional object, such as a cube or cone."
],
"examples": "I saw them walking from a distance, their bodies strangely angular in the dawn light.",
"related": [
{
"relationshipType": "synonyms",
"words": [
"See also Thesaurus:body",
"See also Thesaurus:corpse"
]
}
]
},
"la": {
"pos": "noun",
"text": [
"cadāver n (genitive cadāveris); third declension",
"A corpse, cadaver, carcass"
],
"examples": [],
"related": []
}
}
]
When it doesn't work
Sometimes there is no output at all.
Shortcomings of this approach, and going further
Despite a lot of words being on Wiktionary, and a lot of synonyms being present, they are not always inside the 'related' field, sometimes synonyms are in the 'text' field, which gives word senses. I suspect that the partial information wiktionaryparser provides is the same on the Wiktionary site.
One could use any dictionary tool, or online thesaurus, such as wordnet, to first get possible POS tags and a word's synsets, or query a fasttext model to get a word's nearest neighbors, then filter only words that are nearest neighbors from the 'text' field in wiktionary.

Related

Figuring out what this sed command do

I'm having a hard time trying to discover what the next comand is doing.
I'm trying to monitor different services on Linux using systemctl. I need a Json output with all the services on Linux that are running on the machine.
The problem is that with this comand the Status ouput is: "enable enabled". I only need the first parameter (state), and trying to delete the second one (Vendor preset) I really don't get it working. Basically because I don't understand it. I know with Sed is trying to replace some strings but with so many characters for me this isn't readable.
echo "{\"data\":[$(systemctl list-unit-files --type=service|grep \.service|grep -v "#"|sed -E -e "s/\.service\s+/\",\"{#STATUS}\":\"/;s/(\s+)?$/\"},/;s/^/{\"{#NAME}\":\"/;$ s/.$//")]}"
Result:
"data": [{
"{#NAME}": "accounts-daemon",
"{#STATUS}": "enabled enabled"
},
{
"{#NAME}": "acpid",
"{#STATUS}": "disabled enabled"
}, {
"{#NAME}": "zabbix-agent",
"{#STATUS}": "enabled enabled"
}
]
}
Expected result:
"data": [{
"{#NAME}": "accounts-daemon",
"{#STATUS}": "enabled"
},
{
"{#NAME}": "acpid",
"{#STATUS}": "disabled"
}, {
"{#NAME}": "zabbix-agent",
"{#STATUS}": "enabled"
}
]
}
Command without "sed": systemctl list-unit-files --type=service
UNIT FILE
STATE
VENDOR PRESET
accounts-daemon.service
enabled
enabled
acpid.service
disabled
enabled
zabbix-agent
static
enabled

The relevant substitute in your code is
s/(\s+)?$/
Try to replace that by deleting everyting starting with the first seperator (\s)
That is
s/\s.*$/
The modified command becomes
echo "{\"data\":[$(systemctl list-unit-files --type=service|grep \.service|grep -v "#"|sed -E -e "s/\.service\s+/\",\"{#STATUS}\":\"/;s/\s.*$/\"},/;s/^/{\"{#NAME}\":\"/;$ s/.$//")]}"

How to debug a (PCRE) regex passed to grep?

I'm trying to debug a regex passed to grep that doesn't seem to be working just on my system.
This is the full command that should return the latest terraform release version:
wget -qO - "https://api.github.com/repos/hashicorp/terraform/releases/latest" | grep -Po '"tag_name": "v\K.*?(?=")'
Which seems to be working for others but not me.
Adding a * quantifier after "tag_name": to match extra spaces makes it work for me:
wget -qO - "https://api.github.com/repos/hashicorp/terraform/releases/latest" | grep -Po '"tag_name": *"v\K.*?(?=")'
Here's the response from the wget without piping to grep:
{
"url": "https://api.github.com/repos/hashicorp/terraform/releases/20814583",
"assets_url": "https://api.github.com/repos/hashicorp/terraform/releases/20814583/assets",
"upload_url": "https://uploads.github.com/repos/hashicorp/terraform/releases/20814583/assets{?name,label}",
"html_url": "https://github.com/hashicorp/terraform/releases/tag/v0.12.12",
"id": 20814583,
"node_id": "MDc6UmVsZWFzZTIwODE0NTgz",
"tag_name": "v0.12.12",
"target_commitish": "master",
"name": "",
"draft": false,
"author": {
"login": "apparentlymart",
"id": 20180,
"node_id": "MDQ6VXNlcjIwMTgw",
"avatar_url": "https://avatars1.githubusercontent.com/u/20180?v=4",
"gravatar_id": "",
"url": "https://api.github.com/users/apparentlymart",
"html_url": "https://github.com/apparentlymart",
"followers_url": "https://api.github.com/users/apparentlymart/followers",
"following_url": "https://api.github.com/users/apparentlymart/following{/other_user}",
"gists_url": "https://api.github.com/users/apparentlymart/gists{/gist_id}",
"starred_url": "https://api.github.com/users/apparentlymart/starred{/owner}{/repo}",
"subscriptions_url": "https://api.github.com/users/apparentlymart/subscriptions",
"organizations_url": "https://api.github.com/users/apparentlymart/orgs",
"repos_url": "https://api.github.com/users/apparentlymart/repos",
"events_url": "https://api.github.com/users/apparentlymart/events{/privacy}",
"received_events_url": "https://api.github.com/users/apparentlymart/received_events",
"type": "User",
"site_admin": false
},
"prerelease": false,
"created_at": "2019-10-18T18:39:16Z",
"published_at": "2019-10-18T18:45:33Z",
"assets": [],
"tarball_url": "https://api.github.com/repos/hashicorp/terraform/tarball/v0.12.12",
"zipball_url": "https://api.github.com/repos/hashicorp/terraform/zipball/v0.12.12",
"body": "BUG FIXES:\r\n\r\n* backend/remote: Don't do local validation of whether variables are set prior to submitting, because only the remote system knows the full set of configured stored variables and environment variables that might contribute. This avoids erroneous error messages about unset required variables for remote runs when those variables will be set by stored variables in the remote workspace. ([#23122](https://github.com/hashicorp/terraform/issues/23122))"
}
And using https://regex101.com I can see that "tag_name": "v\K.*?(?=") and "tag_name": *"v\K.*?(?=") both match the version number correctly.
So there must be something wrong with my system and I'm just very curious why the original one doesn't work for me and how (if possible) to debug in situations like this.

I've been able to narrow it down to the following. If I execute the wget command without the piped grep and without formatting the json response:
wget -qO - "https://api.github.com/repos/hashicorp/terraform/releases/latest"
then I get a json without any whitespaces (I'll post only one a part of the response):
"html_url":"https://github.com/hashicorp/terraform/releases/tag/v0.12.12","id":20814583,"node_id":"MDc6UmVsZWFzZTIwODE0NTgz","tag_name":"v0.12.12","target_commitish":"master","name":"","draft":false
So naturally the original regex "tag_name": "v\K.*?(?=") fails because there is no space after :
This is clearly not related to the regex that is passed to the grep or the grep itself. I don't see the point in digging into the response itself here so the original question can be considered resolved (Though if someone knows what could be causing this please post a comment.)

It is very likely that your RegExp engine does not understand \K. There are many dialects for regexps.
Using standard PCRE regexp terms usually yields good results across all engines.
$ curl -s "https://api.github.com/repos/hashicorp/terraform/releases/latest" | egrep -oe '"tag_name": "v(.*)"'
"tag_name": "v0.12.12"
Now if you only want the version number, you need to fetch for the numbers afterwards (as using ?! to ignore a pattern might not always work).
curl -s "https://api.github.com/repos/hashicorp/terraform/releases/latest" | egrep -oe '"tag_name": "v(.*)"' | egrep -oe '([0-9]+\.?)+'
0.12.12

Using jq to parse json output of AWS CLI tools with Lightsail

I'm trying to modify a script to automate lightsail snapshots, and I am having trouble modifying the jq query.
I'm trying to parse the output of aws lightsail get-instance-snapshots
This is the original line from the script:
aws lightsail get-instance-snapshots | jq '.[] | sort_by(.createdAt) | select(.[0].fromInstanceName == "WordPress-Test-Instance") | .[].name'
which returns a list of snapshot names with one per line.
I need to modify the query so that is does not return all snapshots, but rather only ones where the name start with 'autosnap'. i'm doing this as the script rotates snapshots, but I don't want it to delete snapshots I manually create (which will not start with 'autosnap').
Here is a redacted sample output from aws lightsail get-instance-snapshots
{
"instanceSnapshots": [
{
"location": {
"availabilityZone": "all",
"regionName": "*****"
},
"arn": "*****",
"fromBlueprintId": "wordpress_4_9_2_1",
"name": "autosnap-WordPress-Test-Instance-2018-04-16_01.46",
"fromInstanceName": "WordPress-Test-Instance",
"fromBundleId": "nano_1_2",
"supportCode": "*****",
"sizeInGb": 20,
"createdAt": 1523843190.117,
"fromAttachedDisks": [],
"fromInstanceArn": "*****",
"resourceType": "InstanceSnapshot",
"state": "available"
},
{
"location": {
"availabilityZone": "all",
"regionName": "*****"
},
"arn": "*****",
"fromBlueprintId": "wordpress_4_9_2_1",
"name": "Premanent-WordPress-Test-Instance-2018-04-16_01.40",
"fromInstanceName": "WordPress-Test-Instance",
"fromBundleId": "nano_1_2",
"supportCode": "*****",
"sizeInGb": 20,
"createdAt": 1523842851.69,
"fromAttachedDisks": [],
"fromInstanceArn": "*****",
"resourceType": "InstanceSnapshot",
"state": "available"
}
]
}
I would have thought something like this would work, but I'm not having any luck after many attempts...
aws lightsail get-instance-snapshots | jq '.[] | sort_by(.createdAt) | select(.[0].fromInstanceName == "WordPress-Test-Instance") | select(.[0].name | test("autosnap")) |.[].name'
Any help would be greatly appreciated!

The basic query for making the selection you describe would be:
.instanceSnapshots | map(select(.name|startswith("autosnap")))
(If you didn't need to preserve the array structure, you could go with:
.instanceSnapshots[] | select(.name|startswith("autosnap"))
)
You could then perform additional filtering by extending the pipeline.
If you were to use test/1, the appropriate invocation would be test("^autosnap") or perhaps test("^autosnap-").
Example
.instanceSnapshots
| map(select(.name|startswith("autosnap")))
| map(select(.fromInstanceName == "WordPress-Test-Instance"))
| sort_by(.createdAt)
| .[].name
The two successive selects could of course be compacted into one. For efficiency, the sorting should be done as late as possible.
Postscript
Although you might indeed be able to get away with commencing the pipeline with .[] instead of .instanceSnapshots, the latter is advisable in case the JSON schema changes. In a sense, the whole point of data formats like JSON is to make it easy to write queries that are robust with respect to (sane) schema-evolution.

Is there any easy way / API to find out the number of pipelines on a gocd server?

Sorry for the brief question, but just wondering if there's an API to find out the number of pipelines on a GoCD server.

The Pipeline Groups API will give you what you need after some JSON parsing.
$ curl 'https://ci.example.com/go/api/config/pipeline_groups' \
-u 'username:password'
Returns:
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
[
{
"pipelines": [
{
"stages": [
{
"name": "up42_stage"
}
],
"name": "up42",
"materials": [
{
"description": "URL: https://github.com/gocd/gocd, Branch: master",
"fingerprint": "2d05446cd52a998fe3afd840fc2c46b7c7e421051f0209c7f619c95bedc28b88",
"type": "Git"
}
],
"label": "${COUNT}"
}
],
"name": "first"
}
]

You can grab the config.xml file and parse it. from the config repo or via http.

As an alternative, you can just get the cctray file from your server at http://yourgoserver/go/cctray.xml and parse it.
It contains information about all the pipelines (including its stages)

I would recommend using yagocd:
from yagocd import Yagocd
go = Yagocd(server='https://build.gocd.io')
# login as guest
go._session.get('https://build.gocd.io/go/plugin/interact/gocd.guest.user.auth.plugin/index')
print(len(list(go.pipelines)))

Yes, of course. You can get the desired output in different ways. The first easy way to get the number of pipelines and other statistical information from the GoCD support URL (https://example.com/go/api/support) which requires admin privilege.
If the user does not have the admin privilege, we need to go with the GoCD pipeline_groups API. The below command should give you the exact result with jq(JSON processor)
$ curl 'https://example.com/go/api/config/pipeline_groups' -u 'username:password' | jq -r '.[] | .pipelines[].name' | wc -l
NOTE: Still Go Administrator users can get the actual number of pipelines.

how to locate 'imqi.hpp' from node-gyp

I am trying to use "nan" module to call MQ_CONNECT() from node.js
See
Node.js and C/C++ integration: how to properly implement callbacks?
and
https://github.com/nodejs/nan
When I use "node-gyp" it says it can not find "imqi.hpp", the MQ header
As far as I can see, the path to MQ includes has to be provided in "binding.gyp", and I have tried this without success:
{
"targets": [
{
"target_name": "mqconn",
"sources": [
"initall.cc",
"mqconn.cc"
],
"include_dirs": [
"<!(node -e \"require('nan')\")",
"c:\MQ\tools\cplus\include"
]
}
]
}
Does anyboby have a clue on how to fix this ?
Sebastian.
PD.- of course, the file is where the path indicates:
c:\>dir c:\MQ\tools\cplus\include\imqi.hpp
Volume in drive C is OS
Volume Serial Number is 12AA-0601
Directory of c:\MQ\tools\cplus\include
27/06/2013 02:00 1.538 imqi.hpp

Because binding.gyp is in JSON, the String "c:\MQ\tools\cplus\include"is a standard JavaScript String, and therefore the \ needs to be escaped to \\.
So you should replace "c:\MQ\tools\cplus\include" into "c:\\MQ\\tools\\cplus\\include".
I hope that fixes the problem...

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Get multiple variations from Google Translate API - google-cloud-platform

Related

Figuring out what this sed command do

How to debug a (PCRE) regex passed to grep?

Using jq to parse json output of AWS CLI tools with Lightsail

Is there any easy way / API to find out the number of pipelines on a gocd server?

how to locate 'imqi.hpp' from node-gyp

Categories

Resources