awk (or sed/grep) to get occurrences of substring

awk (or sed/grep) to get occurrences of substring - regex

I have a json string in a bash variable, which is something like this:
{
"items": [
{
"foo": null,
"timestamp": 1553703000,
"bar": 123
},
{
"foo": null,
"timestamp": 1553703200,
"bar": 456
},
{
"foo": null,
"timestamp": 1553703400,
"bar": 789
}
]
}
I want to know how many of those timestamps are after a given datetime, so if I have 1553703100 it'll return 2.
(Bonus imaginary points if you can get me just that number!)
As a step towards that, I want to get just the matches of "timestamp": \d+, in the string so that I can loop through them in a bash script.
I've used sed and grep a bit, but never used awk, and from my reading it seems like that might be the better match for the task.
Other info:
- The json is already pretty-printed, as above, so the timestamps would always be on separate lines.
- This is to run in Cygwin, so I have awk/gawk, sed, and grep/egrep, but probably not others.
- Could be any number of timestamps in the json.

You didn't provide the expected output so it's a guess but is this what you're trying to do?
$ echo "$var" | jq '.items[].timestamp'
1553703000
1553703200
1553703400
or maybe:
$ echo "$var" | jq '.items[].timestamp | select(. > 1553703100)'
1553703200
1553703400
or:
$ echo "$var" | jq '[.items[].timestamp | select(. > 1553703100)] | length'
2
WARNING: I'm just learning jq so there may be better ways to do the above!

edit: The second approach listed below has serious problems that were very helpfully outlined by #EdMorton. I've elected to keep the old code for educational purposes.
Avoided substr() and caught null string i:
$ awk -v dt=1553703100 '
/timestamp/ && $2+0>dt {i++}
END {print i+0}
' <<< "$var"
2
WARNING: PROBLEMATIC CODE
Here I used substr(string, index, [characters]) to trim the comma off your second field. The /timestamp/ regex is not complex; it could be improved if your json became more intricate.
$ awk -v dt=1553703100 '
/timestamp/ && substr($2, 0, length($2)) > dt {i++}
END {print i}
' <<< "$var"
2

You can also implement quickly a python solution:
input:
$ cat data.json
{
"items": [
{
"foo": null,
"timestamp": 1553703000,
"bar": 123
},
{
"foo": null,
"timestamp": 1553703200,
"bar": 456
},
{
"foo": null,
"timestamp": 1553703400,
"bar": 789
}
]
}
code:
$ cat extract_value2.py
import json
tLimit = 1553703100
with open('data.json') as f:
data = json.load(f)
print([t['timestamp'] for t in data["items"] if t['timestamp'] > tLimit])
output:
$ python extract_value2.py
[1553703200, 1553703400]
count code:
$ cat extract_value2.py
import json
tLimit = 1553703100
with open('data.json') as f:
data = json.load(f)
print(len([t['timestamp'] for t in data["items"] if t['timestamp'] > tLimit]))
output:
$ python extract_value2.py
2

Related

How to use powerShell and regular expressions to parse a text file

I am new to powerShell and need to input a text file, parse it to extract the data we need and write the result to a .csv file. However, at this point I still am unable to parse the file and am totally confused about which PS commands to use and how to incorporate regular expressions. While I could write out all of the ways I've tried to get this to work I think it would be more beneficial to just ask for help and then ask questions on anything I don't fully understand. FYI: we're running Win10 and my only 2 scripting options are batch or PowerShell.
We have a JSON file that was formatted by notepad++ and looks like this:
"issue": [{
"field": [{
"name": "someName",
"value": [],
"values": []
}],
"field": [{
"name": "numberinproject",
"value": ["81"],
"values": ["81"]
}],
"field": [{
"name": "summary",
"value": ["This is a summary for 81."],
"values": ["This is a summary for 81."]
}],
"comment":[{
"text": "someText for 81 - 01",
"markdown":false,
"created":0123456789101,
"updated":null,
"Author":"first.last01",
"permitted group":null
},{
"text": "someText for 81 - 02",
"markdown":false,
"created":0123456789102,
"updated":null,
"Author":"first.last02",
"permitted group":null
},{
"text": "someText for 81 - 03",
"markdown":false,
"created":0123456789103,
"updated":null,
"Author":"first.last03",
"permitted group":null
}],
"field": [{
"name": "someNameTwo",
"value": [],
"values": []
}],
"field": [{
"name": "numberinproject",
"value": ["83"],
"values": ["83"]
}],
"field": [{
"name": "summary",
"value": ["This is a summary for 83."],
"values": ["This is a summary for 83."]
}],
"comment":[]
}
]
What I am attempting to do is extract the numberinproject, summary and Comment text, created and Author.
Notice that there could be Zero to multiple comments per project number. The comment.created field is a 13 digit epoch number that has to be converted into mm/dd/yyyy hh:mm:ss AM/PM
I had hoped to export this data into a .csv file but at this time would be happy just getting the data parsed out of the file.
Thanks for whatever feedback you can give.
===================================================
By request: Here are some of the things I tried and I apologise for this being such a mess. Since the "json" file was not in a format that convertfrom-json could use I assumed the file was actually text and that is where this starts.
What I've picked up has been from Searching on the web. If anyone can suggest a good article, please let me know and I will read it.
Set-Variable -Name "inputFile" -Value "inputFile.txt"
Set-Variable -Name "outputTXTFile" -Value "outputTXTFile.txt"
Set-Variable -Name "outputFile" -Value "outputFile.csv"
numberinProject = \"value\"\:\s\[\"\d+
summary = \"value\"\:\s\[\".+\"\],
comment - text = \"text\"\:\s\".+\",
comment - created = \"created\"\:\d{13}
comment - author = \"Author\"\:\"\w+\.\w+
## This actually worked. Though it grabbed the whole line, my plan was to then parse it a for a substring.
$results = Get-Content -Path $inputFile | Select-String -Pattern '"values": ' -CaseSensitive -SimpleMatch
# ------------------------------------------------
# However, If I tried using regex, the parse failed
$results = Get-Content -Path $inputFile | Select-String -Pattern \"values\"\:\s\[\"\d+ -CaseSensitive -SimpleMatch
# I also tried this
#$A = Get-ChildItem $inputFile | Select-String -Pattern '(<ID>\"value\"\:\s\[\"\d'
# $results | Export-CSV $outputFile -NoTypeInformation
$results | Out-File $outputTXTFile
# ---------------------------------------------------------
#I tried to output the file as a single string for manipulation - it didn't work
Get-Content -Path $inputFile) -join "`r`n" | Out-File $outputTXTFile
# I tried to use "patterns" to find the data but that didn't work
$issueIDPattern = "(<ID>\"value\"\:\s\[\"\d+)"
$summaryPattern = "\"value\"\:\s\[\".+\"\],"
$commentTextPattern = "\"text\"\:\s\".+\","
$commentDatePattern = "\"created\"\:\d{13}"
$commentAuthorPattern = "\"Author\"\:\"\w+\.\w+
Get-ChildItem $inputFile|
Select-String -Pattern $issueIDPattern |
Foreach-Object {
$ID = $_.Matches[0].Groups['ID'].Value
[PSCustomObject] #{
issueNum = $ID
}
}
### Also tried a variation of this
Get-Content C:\Path\To\File.txt) -join "`r`n" -Split "(?m)^(?=\S)" |
Where{$_} |
ForEach{
Clear-Variable commentauthor,commentcreated,commenttext,summary,numberinProject
$commentcreated = #()
$numberinProject = ($_ -split "`r`n")[0].trim()
Switch -regex ($_ -split "`r`n"){
"^\s+summary:" {$summary = ($_ -split ':',2)[-1].trim();Continue}
"^\s+.:\\" {$commentcreated += $_.trim();continue}
"^\s+commenttext" {$commenttext = [RegEx]::Matches($_,"(?<=commenttext installed from )(.+?)(?= \[)").value;continue}
}
[PSCustomObject]#{'numberinProject' = $numberinProject;'summary' = $summary; 'commenttext' = $commenttext; 'commentcreated' = $commentcreated}
}

Get multiple variations from Google Translate API

When we make a query to Translate API
https://translation.googleapis.com/language/translate/v2?key=$API_KEY&q=hello&source=en&target=e
I only get 1 result in :
{
"data": {
"translations": [
{
"translatedText": "....."
}
]
}
}
Is it possible to get all variations (alternatives) of that word, not only 1 translation?

Microsoft Azure supports one. https://learn.microsoft.com/en-us/azure/cognitive-services/translator/reference/v3-0-dictionary-lookup .
For ex. https://api.cognitive.microsofttranslator.com/dictionary/lookup?api-version=3.0&from=en&to=es
[
{"Text":"hello"}
]
gives you a list of translations like this:
[
{
"normalizedSource": "hello",
"displaySource": "hello",
"translations": [
{
"normalizedTarget": "diga",
"displayTarget": "diga",
"posTag": "OTHER",
"confidence": 0.6909,
"prefixWord": "",
"backTranslations": [
{
"normalizedText": "hello",
"displayText": "hello",
"numExamples": 1,
"frequencyCount": 38
}
]
},
{
"normalizedTarget": "dime",
"displayTarget": "dime",
"posTag": "OTHER",
"confidence": 0.3091,
"prefixWord": "",
"backTranslations": [
{
"normalizedText": "tell me",
"displayText": "tell me",
"numExamples": 1,
"frequencyCount": 5847
},
{
"normalizedText": "hello",
"displayText": "hello",
"numExamples": 0,
"frequencyCount": 17
}
]
}
]
}
]
You can see 2 different translations in this case.

The Translation API service doesn't support the retrieval of multiple translations of a word, as mentioned in the FAQ Documentation:
Is it possible to get multiple translations of a word?
No. This feature is only available via the web interface at
translate.google.com
In case this feature doesn't cover your current needs, you can use the Send Feedback button, located at the lower left and upper right corners of the service public documentation, as well as take a look the Issue Tracker tool in order to raise a Translation API feature request and notify to Google about this desired functionality.

Approach mapping Wiktionary using POS tags, related terms and Google-translated word.
TL;DR
The question is titled 'get-multiple-variations-from-google-translate-api', but in short, you (still) currently can't do this by using Google's service alone (as of Sept. 2022). It seems most companies, such as Google, want to continue charging for this service. This answer provides an approach using a (free) service as a pivot to get the term, related terms, and their POS (Parts of Speech) e.g. noun, verb, etc. before translating those terms and then re-querying the service.
This alternative creates a small pipeline that queries Wiktionary before (on the source language), and after (on the translated terms target language) the translation (using Google).
The small pipeline is written in python and bash.
Rationale
We could get word senses, for each POS (Part of Speech) and corresponding synonyms, then translate for each word sense since Google only translates word to word, and then match word senses for the corresponding target language using a tool such as Wiktionary.
Wiktionary
Fortunately, someone has already created a python library to query Wiktionary for multiple languages.
Script to get definitions / synonyms from Wiktionary (using python):
(requires wiktionaryparser )
e.g. python -m pip install wiktionaryparser
import sys;
import json;
from wiktionaryparser import WiktionaryParser;
parser = WiktionaryParser()
# sys.argv[1] is a language e.g. 'english'
parser.set_default_language(sys.argv[1])
print(
json.dumps(
[
[
{
'pos': d.get('partOfSpeech'),
'text':d.get('text'),
'examples':[e for e in d.get('examples')][0] if d.get('examples') else [],
'related': d.get('relatedWords')
} for d in w.get('definitions')
] for w in parser.fetch(sys.argv[2])
],
indent=2
)
)
Google translate + Wiktionary
The bash script below gets Wiktionary definitions, splits on synonym lists and correlates translations based on POS (Part of Speech).
To be honest this script is a bit convoluted, it uses a lot of utils, but it works. It could be refactored into python like the wiktionary part by anyone wanting to make something a bit more robust.
This github post provided some of the below script that call the free Google translate api.
#!/bin/bash
sl=$1
tl=$2
wiki_sl=$3
wiki_tl=$4
string=$5
ua='Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'
#echo "$string"
result="{\"${sl}\":[],\"${tl}\":[]}"
#set -x
while IFS= read line; do
# line could be better named 'synonym' here
pos="$(echo ${line} | jq -r ".pos")"
sl_result="$(echo $line | jq . -c)"
tl_result=""
opt_single="single?client=gtx&sl=${sl}&tl=${tl}&dt=t&q=${string//[[:blank:]]/+}"
full_url="http://translate.googleapis.com/translate_a/${opt_single}"
response=$(curl -sA "${ua}" "${full_url}")
tl_word="$(echo ${response} | jq -r '.[[0][0]][] | .[0:1][0]')"
echo "${tl_word}" | grep -q " " && continue 1
tl_result_new="$(python ./get_wiki.py "${wiki_tl}" "${tl_word}" | jq -r -c --arg POS "$pos" '.[][] | select(.pos==$POS)'),"
# making json
tl_result="[${tl_result_new}"
# iterate over synonyms
while IFS= read qry; do
opt_single="single?client=gtx&sl=${sl}&tl=${tl}&dt=t&q=${qry//[[:blank:]]/+}"
full_url="http://translate.googleapis.com/translate_a/${opt_single}"
response=$(curl -sA "${ua}" "${full_url}")
tl_word="$(echo ${response} | jq -r '.[[0][0]][] | .[0:1][0]')"
echo "${tl_word}" | grep -q " " && continue 1
tl_result_new="$(python ./get_wiki.py "${wiki_tl}" "${tl_word}" | jq -r -c --arg POS "$pos" '.[][] | select(.pos==$POS)'),"
# adding to json
tl_result="${tl_result},${tl_result_new}"
done< <(echo "${line}" | jq -c -r ' .related[].words[]' | \
sed -e 's/.*://;s/"//g;s/^ *//g;s/ *$//g' | tr ',' '\n')
tl_result="$(echo "${tl_result_new}" | sed 's/,$//g')"
[ -z "${tl_result}" ] && tl_result=null
[ -z "${sl_result}" ] && sl_result=null
result="{\"${sl}\":${sl_result},\"${tl}\":${tl_result}}"
echo "$result" | jq "."
done< <(python ./get_wiki.py "$wiki_sl" "$string" | \
jq -c -r '.[][]|select(.related[].relationshipType=="synonyms")') 2> /dev/null | jq -c '[.]'
How to use:
The first 2 arguments used are for google (source language, and target language in that order which are two-letter codes.
The second 2 arguments used are for Wiktionary (source language, a full word - e.g. 'English', 'French', etc.)
The final (fifth) argument is the single word to be translated.
./translate.sh en pt english portuguese help
In fact, the python 'wiktionaryparser' lib occasionally breaks and can throw an error, due to the fact that it is a webscraping library, which is why I add 2> /dev/null to silence stderr on output.
./translate.sh en pt english portuguese help 2> /dev/null
This script isn't perfect, but it is a starting point and a proof-of-concept to show you this is possible using a free tool such as wiktionary.
English to Portuguese
$ ./translate.sh en pt english portuguese help 2> /dev/null
Output:
[
{
"en": {
"pos": "noun",
"text": [
"help (usually uncountable, plural helps)",
"(uncountable) Action given to provide assistance; aid.",
"(usually uncountable) Something or someone which provides assistance with a task.",
"Documentation provided with computer software, etc. and accessed using the computer.",
"(usually uncountable) One or more people employed to help in the maintenance of a house or the operation of a farm or enterprise.",
"(uncountable) Correction of deficits, as by psychological counseling or medication or social support or remedial training."
],
"examples": "I need some help with my homework.",
"related": [
{
"relationshipType": "synonyms",
"words": [
"(action given to provide assistance): aid, assistance"
]
}
]
},
"pt": {
"pos": "noun",
"text": [
"assistência f (plural assistências)",
"assistance, aid, help",
"protection"
],
"examples": [],
"related": [
{
"relationshipType": "related terms",
"words": [
"assistir"
]
}
]
}
}
]
[
{
"en": {
"pos": "verb",
"text": [
"help (third-person singular simple present helps, present participle helping, simple past helped or (archaic) holp, past participle helped or (archaic) holpen)",
"(transitive) To provide assistance to (someone or something).",
"(transitive) To assist (a person) in getting something, especially food or drink at table; used with to.",
"(transitive) To contribute in some way to.",
"(intransitive) To provide assistance.",
"(transitive) To avoid; to prevent; to refrain from; to restrain (oneself). Usually used in nonassertive contexts with can."
],
"examples": "Risk is everywhere. […] For each one there is a frighteningly precise measurement of just how likely it is to jump from the shadows and get you. “The Norm Chronicles” […] aims to help data-phobes find their way through this blizzard of risks.",
"related": [
{
"relationshipType": "synonyms",
"words": [
"(provide assistance to): aid, assist, come to the aid of, help out; See also Thesaurus:help",
"(contribute in some way to): contribute to",
"(provide assistance): assist; See also Thesaurus:assist"
]
}
]
},
"pt": {
"pos": "verb",
"text": [
"ajudar (first-person singular present indicative ajudo, past participle ajudado)",
"to help, aid; to assist"
],
"examples": "Ajude-me! ― Help me!",
"related": [
{
"relationshipType": "related terms",
"words": [
"ajuda",
"ajudante"
]
}
]
}
}
]
English to Latin
$ ./translate.sh en la english latin body | jq '.'
[
{
"en": {
"pos": "noun",
"text": [
"body (countable and uncountable, plural bodies)",
"Physical frame.",
"Main section.",
"Coherent group.",
"Material entity.",
"(printing) The shank of a type, or the depth of the shank (by which the size is indicated).",
"(geometry) A three-dimensional object, such as a cube or cone."
],
"examples": "I saw them walking from a distance, their bodies strangely angular in the dawn light.",
"related": [
{
"relationshipType": "synonyms",
"words": [
"See also Thesaurus:body",
"See also Thesaurus:corpse"
]
}
]
},
"la": {
"pos": "noun",
"text": [
"cadāver n (genitive cadāveris); third declension",
"A corpse, cadaver, carcass"
],
"examples": [],
"related": []
}
}
]
When it doesn't work
Sometimes there is no output at all.
Shortcomings of this approach, and going further
Despite a lot of words being on Wiktionary, and a lot of synonyms being present, they are not always inside the 'related' field, sometimes synonyms are in the 'text' field, which gives word senses. I suspect that the partial information wiktionaryparser provides is the same on the Wiktionary site.
One could use any dictionary tool, or online thesaurus, such as wordnet, to first get possible POS tags and a word's synsets, or query a fasttext model to get a word's nearest neighbors, then filter only words that are nearest neighbors from the 'text' field in wiktionary.

Python 2.7.9 subprocess convert check_output to dictionary (volumio)

I've been searching a long time, but I can't find an answer.
I'm making a script for volumio on my Raspberry Pi
In the terminal, when I type
volumio status
I get exactly
{
"status": "pause",
"position": 0,
"title": "Boom Boom",
"artist": "France Gall",
"album": "Francegall Longbox",
"albumart": "/albumart?cacheid=614&web=France%20Gall/Francegall%20Longbox/extralarge&path=%2FUSB&metadata=false",
"uri": "/Boom Boom.flac",
"trackType": "flac",
"seek": 21192,
"duration": 138,
"samplerate": "44.1 KHz",
"bitdepth": "16 bit",
"channels": 2,
"random": true,
"repeat": null,
"repeatSingle": false,
"consume": false,
"volume": 100,
"mute": false,
"stream": "flac",
"updatedb": false,
"volatile": false,
"service": "mpd"
}
In python, I would like to store this in a dictionary
since it already has the right formatting, I thought that assigning it to a variable will make it a dictionnary right away as follows:
import subprocess, shlex
cmd = "volumio status | sed -e 's/true/True/g' -e 's/false/False/g' -e 's/null/False/g'"
cmd = shlex.split(cmd)
status = subprocess.check_output(cmd)
print status["volume"]
If what I thought was true I would get "100". Instead, I get this error :
File "return.py", line 7, in <module>
print status["volume"]
TypeError: string indices must be integers, not str
this means "status" is stored as a string. Does anybody know how I can make it a dictionary?
dict() doesn't make it, i get :
ValueError: dictionary update sequence element #0 has length 1; 2 is required

Victory! I was able to make my code work with eval()
import subprocess
status = subprocess.check_output("volumio status | sed -e 's/true/True/g' -e 's/false/False/g' -e 's/null/False/g'", shell=True)
status = eval(status)
print status["volume"]
it returns 100

Select a particular type of word from a text file and load it in a Variable

I am trying to use a power shell script to read the contents of a file and pick a specific type of word from it. I need to load the word that is found as a variable which I intend to use further downstream.
This is how my input file looks like:
{
"AvailabilityZone": "ap-northeast-1b",
"VolumeType": "gp2",
"VolumeId": "vol-087238f9",
"State": "creating",
"Iops": 100,
"SnapshotId": "",
"CreateTime": "2016-09-15T12:17:27.952Z",
"Size": 10
}
The specific word I would like to pick is vol-xxxxxxxx.
I used this link to write my script
How to pass a variable in the select-string of powershell
This is how I am doing it:
$Filename = "c:\reports\volume.jason"
$regex = "^[vol-][a-z0-9]{8}$"
$newvolumeid=select-string -Pattern $regex -Path $filename > C:\Reports\newVolumeid.txt
$newVolumeid
When I run this script it runs but does not give any response. Seems somehow the output of select string is not loaded into the variable $newvolumeid.
Any idea how to resolve this? Or what I am missing?
PS: The post mentioned above is about 3 years old and doesn't work hence I am reposting.

You are trying to read a property of a JSON object. Instead of using regex, you can parse the JSON and select the property using:
Get-Content 'c:\reports\volume.jason' | ConvertFrom-Json | select -ExpandProperty VolumeId

Try this
$Inpath = "E:\tests\test.txt"
$INFile = Get-Content $Inpath
$NeedsTrimming = $INFile.Split(" ") | ForEach-Object {if ($_ -like '*vol-*'){$_}}
$FirstQuote = $NeedsTrimming.IndexOf('"')
$LastQuote = $NeedsTrimming.LastIndexOf('"')
$vol = $NeedsTrimming.Substring(($FirstQuote + 1),($LastQuote - 1))
$vol

How to separate package name by regex in bash?

I'm writing a script function to separate package tar ball name listing into package name version.
xorg-fonts-misc-1.0b-1
Xorg-font-bitstream-75dpi-1.0.0-2.i386
Xorg-font-bitstream-100dpi-1.2a-2.arm
Other-Third-Party-1.2.2-1-any
I'm using the following script to separate name and version.
split_pkgname_pipe() { # split x-x-1.3-1.x -> x-x 1.3-1.x
[ $opt_v != 0 ] && echo "dbg:split_pkgname_pipe $*" >&2
awk '{
f=$0
sub(/\-[0-9].*$/,"")
n=$1
v=substr(f, length(n)+2)
print n, v
}'
}
The problem of my code will cause Xorg-font-bitstream-75dpi-1.0.0 separate as Xorg-font-bitstream and 75dpi-1.0.0. But I want Xorg-font-bitstream-75dpi and -1.0.0
[SOLVED]
split_pkgname_pipe() { # split x-x-1.3-1.x -> x-x 1.3-1.x
[ $opt_v != 0 ] && echo "dbg:split_pkgname_pipe $*" >&2
local line namever name ver rel
while read line ; do
namever="${line%-*}"
rel="${line##*-}"
if [ `expr match $rel '[0-9]'` = 0 ] ; then # rel is 'i386/any'...
name="${namever%-*}"
ver="${namever##*-}"
namever="$name"
rel="$ver-$rel"
fi
name="${namever%-*}"
ver="${namever##*-}"
echo "$name $ver-$rel"
done
}

$ package="Xorg-font-bitstream-75dpi-1.0.0"
$ echo "${package%-*}"
Xorg-font-bitstream-75dpi
$ echo "${package##*-}"
1.0.0

Try this
sed -re '/^(.*?)((\d[a-z]?\.)+.*)$/\1\t\2/gmi' file.txt

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

awk (or sed/grep) to get occurrences of substring - regex

Related

How to use powerShell and regular expressions to parse a text file

Get multiple variations from Google Translate API

Python 2.7.9 subprocess convert check_output to dictionary (volumio)

Select a particular type of word from a text file and load it in a Variable

How to separate package name by regex in bash?

Categories

Resources