How to use a capture variable as a field name in JQ? - regex

I'm trying to use jq to automate changing i18n string files from the format taken by one library to another.
I have a json file which has looks like this:
{
"some_label": {
"message": "a string in English with a $VARIABLE$",
"description": "directions to translators",
"placeholders": {
"VARIABLE": {
"content": "{variable}"
}
}
},
// more of the same...
}
And I need that to turn in to "some-label": "a string in English with a {variable}"
I am pretty close to getting it. Currently, I'm using
jq '[.
| to_entries
| .[]
| .key |= (gsub("_";"-"))
| .value.placeholders as $p
| .value.message |= (sub("\\$KEY_NAME\\$";$p.KEY_NAME.content))
| .value = .value.message
] | from_entries'
The next step is to use a capture group in the sub call so I can programmatically get variables with different names, but I'm not sure how to use the capture group to index into $p.
I've tried sub("\\$(?<id>VARIABLE)\\$";$p.(.id).content) which gave a compiler error, and I'm pretty much stuck on what to try next.

Here is one way of achieving the desired result. It could be simplified further too. At the top level it removes the usage of to_entries/from_entries by enclosing the whole filter under with_entries() and modifying the .value field as required
with_entries(
.key |= ( gsub("_";"-") ) |
.value.placeholders as $p |
.value.message as $m |
( $m | match(".*\\$(.*)\\$") | .captures[0].string ) as $c |
( $p | .[$c].content ) as $v |
( "\\$" + $c + "\\$" ) as $t |
.value = ( $m | sub($t; $v) )
)
My view of the key parts of the expression are
The part $m | match(".*\\$(.*)\\$") | .captures[0].string makes a regex match to extract the part within the $..$ in the .message
The part $p | .[$c].content does a generic object index fetch using the dynamic value of $c
Since the first argument of sub()/gsub() functions are a regex, the value captured $c needs to be created as \\$VARIABLE\\$
jqplay - Demo

Here's a basic JQ. Haven't tried with complex inputs, and haven't accommodated for $. I guess you can build on top of this -
to_entries | map(. as $kv | { "\($kv.key)": $kv.value.placeholders | to_entries | map(. as $p | $kv.value.message | sub("\\$\($p.key)\\$"; $p.value.content))[0]}) | add
output -
{
"some_label": "a string in English with a {variable}"
}

Related

What is the maximum length of a regex expression?

In PostgreSQL, I want to exclude rows if the desc field contains any forbidden words.
items:
| id | desc |
|----|------------------|
| 1 | apple foo cat bar|
| 2 | foo bar |
| 3 | foocatbar |
| 4 | foo dog bar |
The forbidden words list is stored in another table, currently it has 400 words to check.
forbidden_word_table:
| word |
|---------|
| apple |
| boy |
| cat |
| dog |
| .... |
SQL query:
select id, desc
from items
where
desc !~* (select '\y(' || string_agg(word, '|') || ')\y' from forbidden_word_table)
I am checking if desc does not match the regex expression:
desc !~* '\y(apple|boy|cat|dog|.............)\y'
Results:
| id | desc |
|----|------------------|
| 2 | foo bar |
| 3 | foocatbar |
** 3rd is not excluded since cat is not a single word
My forbidden_word_table will likely grow with many rows, the above regex will become a very lengthy expression.
Do regex expressions have a maximum length limit (in bytes or characters)? I'm afraid of my regex matching approach will not work if forbidden_word_table keeps growing.
Seems, that Wiktor Stribiżew is right about "catastrophic backtracking".
I'd suggest to use ILIKE and ANY:
SELECT *
FROM items i
WHERE NOT i."desc" ILIKE ANY
(
SELECT '%' || word || '%'
FROM forbidden_word_table
);
db-fiddle

Reading list style text file into powershell array

I am provided a list of string blocks in a text file, and i need this to be in an array in powershell.
The list looks like this
a:1
b:2
c:3
d:
e:5
[blank line]
a:10
b:20
c:30
d:
e:50
[blank line]
...
and i want this in a powershell array to further work with it.
Im using
$output = #()
Get-Content ".\Input.txt" | ForEach-Object {
$splitline = ($_).Split(":")
if($splitline.Count -eq 2) {
if($splitline[0] -eq "a") {
#Write-Output "New Block starting"
$output += ($string)
$string = "$($splitline[1])"
} else {
$string += ",$($splitline[1])"
}
}
}
Write-Host $output -ForegroundColor Green
$output | Export-Csv ".\Output.csv" -NoTypeInformation
$output | Out-File ".\Output.txt"
But this whole thing feels quite cumbersome and the output is not a csv file, which at this point is i think because of the way i use the array. Out-File does produce a file that contains rows that are separated by commas.
Maybe someone can give me a push in the right direction.
Thx
x
One solution is to convert your data to an array of hash tables that can be read into a custom object. Then the output array object can be exported, formatted, or read as required.
$hashtables = (Get-Content Input.txt) -replace '(.*?):','$1=' | ConvertFrom-StringData
$ObjectShell = "" | Select-Object ($hashtable.keys | Select-Object -Unique)
$output = foreach ($hashtable in $hashtable) {
$obj = $ObjectShell.psobject.Copy()
foreach ($n in $hashtable.GetEnumerator()) {
$obj.($n.key) = $n.value
}
$obj
}
$output
$output | Export-Csv Output.csv -NoTypeInformation
Explanation:
The first colons (:) on each line are replaced with =. That enables ConvertFrom-StringData to create an array of hash tables with values on the LHS of the = being the keys and values on the RHS of the = being the values. If you know there is only one : on each line, you can make the -replace operation simpler.
$ObjectShell is just an object with all of the properties your data presents. You need all of your properties present for each line of data whether or not you assign values to them. Otherwise, your CSV output or table view within the console will have issues.
The first foreach iterates through the $hashtables array. Then we need to enumerate through each hash table to find the keys and values, which is performed by the second foreach loop. Each key/value pair is stored as a copy of $ObjectShell. The .psobject.Copy() method is used to prevent references to the original object. Updating data that is a reference will update the data of the original object.
$output contains the array of objects of all processed data.
Usability of output:
# Console Output
$output | format-table
a b c d e
- - - - -
1
2
3
5
10
20
30
50
# Convert to CSV
$output | ConvertTo-Csv -NoTypeInformation
"a","b","c","d","e"
"1",,,,
,"2",,,
,,"3",,
,,,"",
,,,,"5"
,,,,
"10",,,,
,"20",,,
,,"30",,
,,,"",
,,,,"50"
# Accessing Properties
$output.b
2
20
$output[0],$output[1]
a : 1
b :
c :
d :
e :
a :
b : 2
c :
d :
e :
Alternative Conversion:
$output = ((Get-Content Input.txt -raw) -split "(?m)^\r?\n") | Foreach-Object {
$data = $_ -replace "(.*?):(.*?)(\r?\n)",'"$1":"$2",$3'
$data = $data.Remove($data.LastIndexOf(','),1)
("{1}`r`n{0}`r`n{2}" -f $data,'{','}') | ConvertFrom-Json
}
$output | ConvertTo-Csv -NoType
Alternative Explanation:
Since ConvertFrom-StringData does not guarantee hash table key order, this alternative readies the file for a JSON conversion. This will maintain the property order listed in the file provided each group's order is the same. Otherwise, the property order of the first group will be respected.
All properties and their respective values are divided by the first : character on each line. The property and value are each surrounded by double quotes. Each property line is separated by a ,. Then finally the opening { and closing } are added. The resulting JSON-formatted string is converted to a custom object.
You can split by \n newline, see example:
$text = #"
a:1
b:2
c:3
d:
e:5
a:10
b:20
c:30
d:
e:50
e:50
e:50
e:50
"#
$Array = $text -split '\n' | ? {$_}
$Array.Count
15
if you want to exclude the empty lines, add ? {$_}
With your example:
$Array = (Get-Content ".\Input.txt") -split '\n' | ? {$_}

Powershell Regex Grouping from Select-String

I am scraping a web request response to pull information held within html code which repeats a few times so using select-string rather than match. My code looks like
$regexname = '\(\w{0,11}).{1,10}\'
$energenie.RawContent | select-string $regexname -AllMatches | % {$_.matches}
The return looks something like:
Groups : {<h2 class="ener">TV </h2>, TV}
Success : True
Captures : {<h2 class="ener">TV </h2>}
Index : 1822
Length : 33
Value : <h2 class="ener">TV </h2>
Groups : {<h2 class="ener">PS3 </h2>, PS3}
Success : True
Captures : {<h2 class="ener">PS3 </h2>}
Index : 1864
Length : 33
Value : <h2 class="ener">PS3 </h2>
I can't workout a way to grab the second element of groups e.g. TV or PS3 as:
$energenie.RawContent | select-string $regexname -AllMatches | % {$_.matches.groups}
Gives a strange output
Paul
This should work:
$energenie.RawContent | select-string $regexname -AllMatches | ForEach-Object { Write-Host $_.Matches.Groups[1].Value }
To get the second item in a collection, use the array index operator: [n] where n is the index from which you want a value.
For each entry in Matches, you want the second entry in the Groups property so that would be:
$MyMatches = $energenie.RawContent | select-string $regexname -AllMatches | % {$_.Matches}
$SecondGroups = $MyMatches | % {$_.Groups[1]}
To get just the captured value, use the Value property:
$MyMatches | % { $_.Groups[1].Value }

Replace similar strings in a file in place

I have a file with the following types of pairs of strings:
Call Stack: [UniqueObject1] | [UnOb2] | [SuspectedObject1] | [SuspectedObject2] | [SuspectedObject3] | [UnOb3] | [UnOb4] | [UnOb5] | ... end till unique objects
Call Stack: [UniqueObject1] | [UnOb2] | 0x28798765 | 0x18793765 | 0x48792767 | [UnOb3] | [UnOb4] | [UnOb5] | ... end till unique objects
There are many such pairs that occur in the file.
The attributes of this pair are that the first part of the pair has "SuspectedObject1","SuspectedObject2" and so on, which in the second part of the pair are replaced by HEX-VALUES of the address of those objects.
What I want to do is, remove all the second part of the pairs.
Please note the pairs do not occur in any specific order and might be separated by many lines in between.
I plan to iterate through each line of this file, if I see a hex-string given as an address instead of a suspected object, I would want to start comparing the following regex
Call Stack: [UniqueObject1] | [UnOb2] | * | * | * | [UnOb3] | [UnOb4] | [UnOb5] | ... end till unique objects
in the whole file and if a string does match, I want to remove this specific line from the file.
Can someone suggest a shell way to do this?
If I have understood your question correctly, you may need to use awk. Run like:
awk -f script.awk file file
Contents of script.awk:
BEGIN {
FS=" \\| "
}
FNR==NR {
$3=$4=$5=""
a[$0]++
next
}
$3 ~ /^0x[0-9]{8}$/ {
r=$0
$3=$4=$5=""
if (a[$0]<2) {
print r
}
next
}1
Alternatively, here's the one-liner:
awk -F ' \\| ' 'FNR==NR { $3=$4=$5=""; a[$0]++; next } $3 ~ /^0x[0-9]{8}$/ { r=$0; $3=$4=$5=""; if (a[$0]<2) print r; next }1' file{,}

How to change case of back references?

I'm trying to modify a back reference in PowerShell but am having no luck :(
This is my example:
"456,Jane Doe" -replace '^(\d{3}),(.*)$',"| $(`"`$2`".ToUpper()) | `$1 |"
If I run it I get this:
| Jane Doe | 456 |
But I'm really expecting this:
| JANE DOE | 456 |
If I run the following (the same as above but without the '()' on the call to ToUpper):
"456,Jane Doe" -replace '^(\d{3}),(.*)$',"| $(`"`$2`".ToUpper) | `$1 |"
I get this:
| string ToUpper(), string ToUpper(System.Globalization.CultureInfo culture) | 456 |
So it would appear that PowerShell knows that the back reference '$2' is a string but why can't I get PowerShell to convert it to upper case?
Terry
[Regex]::Replace('456,Jane Doe',
'^(\d{3}),(.*)$',
{
param($m)
'| ' + $m.Groups[2].Value.ToUpper() + ' | ' + $m.Groups[1].Value + ' |'
}
)
Not very pretty, I admit. And you sadly cannot use script blocks as replacement in the -replace operator.
Just to explain what is happening, in "| $(`"`$2`".ToUpper()) | `$1 |" PowerShell is evaluating the highlighted subexpression before passing the string to the -replace operator, rather than after the replace operation has occurred.
In other words, ToUpper is called on the string value $2, resulting in | $2 | $1 | being used for the replace operation. You can see this by including a letter in the subexpression string, for example:
"456,Jane Doe" -replace '^(\d{3}),(.*)$',"| $(`"zz `$2`".ToUpper()) | `$1 |"
This has an effective replace string of | ZZ $2 | $1 |, giving | ZZ Jane Doe | 456 | as the result.
Similarly, the second version omitting parenthesis, "| $(`"`$2`".ToUpper) | `$1 |", is evaluated as "some string".ToUpper, which puts the array of overload definitions for the ToUpper method on System.String in the replace string.
To keep the replace operation as a one-liner, Joey's answer using the MatchEvaluator overload to Regex.Replace works well. Or you might do the string formatting yourself based on the results of a -match:
if( '456,Jane Doe' -match '^(\d{3}),(.*)$' ) {
'| {0} | {1} |' -f $matches[2].ToUpper(),$matches[1]
}
If this needs to be replaced in the context of a larger string, you can always do a literal replace to get the final result:
PS> $r = '| {0} | {1} |' -f $matches[2].ToUpper(),$matches[1]
PS> 'A longer string with 456,Jane Doe in it.'.Replace( $matches[0], $r )
A longer string with | JANE DOE | 456 | in it.