backreferencing in awk gensub with conditional branching

backreferencing in awk gensub with conditional branching - if-statement

I'm referencing to
answer to: GNU awk: accessing captured groups in replacement text
but whith ? Quantifier for regex matching
I would like to make if statement or ternary operator ?: or something more elegant so that if regex group that is backreferenced with \\1 returns nonempty string then, one arbitrary string (\\1 is not excluded) is inserted and if it returns empty string some other arbitrary string is inserted. My example works when capturing group returns nonempty string, but doesn't return expected branch "B" when backreference is empty. How to make conditional branching based on backreferenced values?
echo abba | awk '{ print gensub(/a(b*)?a/, "\\1"?"A":"B", "g", $0)}'

you can do the assignment in the gensub and use the value for the ternary operator afterwards, something like this
... | awk '{ v=gensub(/a(b*)?a/, "\\1", "g", $0); print v?"A":"B"}'

Something like this, maybe?:
$ gawk '{ print gensub(/a(.*)a/, (match($0,/a(b*)?a/)?"A":"B"), "g", $0)}' <<< abba
A
$ gawk '{ print gensub(/a(.*)a/, (match($0,/a(b*)?a/)?"A":"B"), "g", $0)}' <<< acca
B

The expressions in any arguments you pass to any function are evaluated before the function is called so gensub(/a(b*)?a/, "\\1"?"A":"B", "g", $0) is the same as str=("\\1"?"A":"B"); gensub(/a(b*)?a/, str, "g", $0) which is the same as gensub(/a(b*)?a/, "A", "g", $0).
So you cannot do what you're apparently trying to do with a single call to any function, nor can you call gsub() twice, once with ab+a and then again with aa, or similar without breaking the left-to-right, leftmost-longest order in which such a replacement function would match the regexp against the input if it existed.
It looks like you might be trying to do the following, using GNU awk for patsplit():
awk '
n = patsplit($0,f,/ab*a/,s) {
$0 = s[0]
for ( i=1; i<=n; i++ ) {
$0 = $0 (f[i] ~ /ab+a/ ? "A" : "B") s[i]
}
}
1'
or with any awk:
awk '
{
head = ""
while ( match($0,/ab*a/) ) {
str = substr($0,RSTART,RLENGTH)
head = head substr($0,1,RSTART-1) (str ~ /ab+a/ ? "A" : "B")
$0 = substr($0,RSTART+RLENGTH)
}
$0 = head $0
}
1'
but without sample input/output it's a guess. FWIW given this sample input file:
$ cat file
XabbaXaaXabaX
foo
abbaabba
aabbaabba
bar
abbaaabba
the above will output:
XAXBXAX
foo
AA
BbbBbba
bar
ABbba

Related

awk sub with a capturing group into the replacement

I am writing an awk oneliner for this purpose:
file1:
1 apple
2 orange
4 pear
file2:
1/4/2/1
desired output: apple/pear/orange/apple
addendum: Missing numbers should be best kept unchanged 1/4/2/3 = apple/pear/orange/3 to prevent loss of info.
Methodology:
Build an associative array key[$1] = $2 for file1
capture all characters between the slashes and replace them by matching to the key of associative array eg key[4] = pear
Tried:
gawk 'NR==FNR { key[$1] = $2 }; NR>FNR { r = gensub(/(\w+)/, "key[\\1]" , "g"); print r}' file1.txt file2.txt
#gawk because need to use \w+ regex
#gensub used because need to use a capturing group
Unfortunately, results are
1/4/2/1
key[1]/key[4]/key[2]/key[1]
Any suggestions? Thank you.

You may use this awk:
awk -v OFS='/' 'NR==FNR {key[$1] = $2; next}
{for (i=1; i<=NF; ++i) if ($i in key) $i = key[$i]} 1' file1 FS='/' file2
apple/pear/orange/apple
Note that if numbers from file2 don't exist in key array then it will make those fields empty.
file1 FS='/' file2 will keep default field separators for file1 but will use / as field separator while reading file2.

EDIT: In case you don't have a match in file2 from file and you want to keep original value as it is then try following:
awk '
FNR==NR{
arr[$1]=$2
next
}
{
val=""
for(i=1;i<=NF;i++){
val=(val=="" ? "" : val FS) (($i in arr)?arr[$i]:$i)
}
print val
}
' file1 FS="/" file2
With your shown samples please try following.
awk '
FNR==NR{
arr[$1]=$2
next
}
{
val=""
for(i=1;i<=NF;i++){
val = (val=="" ? "" : val FS) arr[$i]
}
print val
}
' file1 FS="/" file2
Explanation: Reading Input_file1 first and creating array arr with index of 1st field and value of 2nd field then setting field separator as / and traversing through each field os file2 and saving its value in val; printing it at last for each line.

Like #Sundeep comments in the comments, you can't use backreference as an array index. You could mix match and gensub (well, I'm using sub below). Not that this would be anywhere suggested method but just as an example:
$ awk '
NR==FNR {
k[$1]=$2 # hash them
next
}
{
while(match($0,/[0-9]+/)) # keep doing it while it lasts
sub(/[0-9]+/,k[substr($0,RSTART,RLENGTH)]) # replace here
}1' file1 file2
Output:
apple/pear/orange/apple
And of course, if you have k[1]="word1", you'll end up with a neverending loop.

With perl (assuming key is always found):
$ perl -lane 'if(!$#ARGV){ $h{$F[0]}=$F[1] }
else{ s|[^/]+|$h{$&}|g; print }' f1 f2
apple/pear/orange/apple
if(!$#ARGV) to determine first file (assuming exactly two files passed)
$h{$F[0]}=$F[1] create hash based on first field as key and second field as value
[^/]+ match non / characters
$h{$&} get the value based on matched portion from the hash
If some keys aren't found, leave it as is:
$ cat f2
1/4/2/1/5
$ perl -lane 'if(!$#ARGV){ $h{$F[0]}=$F[1] }
else{ s|[^/]+|exists $h{$&} ? $h{$&} : $&|ge; print }' f1 f2
apple/pear/orange/apple/5
exists $h{$&} checks if the matched portion exists as key.

Another approach using awk without loop:
awk 'FNR==NR{
a[$1]=$2;
next
}
$1 in a{
printf("%s%s",FNR>1 ? RS: "",a[$1])
}
END{
print ""
}' f1 RS='/' f2
$ cat f1
1 apple
2 orange
4 pear
$ cat f2
1/4/2/1
$ awk 'FNR==NR{a[$1]=$2;next}$1 in a{printf("%s%s",FNR>1?RS:"",a[$1])}END{print ""}' f1 RS='/' f2
apple/pear/orange/apple

Removing multiple delimiters between outside delimiters on each line

Using awk or sed in a bash script, I need to remove comma separated delimiters that are located between an inner and outer delimiter. The problem is that wrong values ends up in the wrong columns, where only 3 columns are desired.
For example, I want to turn this:
2020/11/04,Test Account,569.00
2020/11/05,Test,Account,250.00
2020/11/05,More,Test,Accounts,225.00
Into this:
2020/11/04,Test Account,569.00
2020/11/05,Test Account,250.00
2020/11/05,More Test Accounts,225.00
I've tried to use a few things, testing regex:
But I cannot find a solution to only select the commas in order to remove.

awk -F, '{ printf "%s,",$1;for (i=2;i<=NF-2;i++) { printf "%s ",$i };printf "%s,%s\n",$(NF-1),$NF }' file
Using awk, print the first comma delimited field and then loop through the rest of the field up to the last but 2 field printing the field followed by a space. Then for the last 2 fields print the last but one field, a comma and then the last field.

With GNU awk for the 3rd arg to match():
$ awk -v OFS=, '{
match($0,/([^,]*),(.*),([^,]*)/,a)
gsub(/,/," ",a[2])
print a[1], a[2], a[3]
}' file
2020/11/04,Test Account,569.00
2020/11/05,Test Account,250.00
2020/11/05,More Test Accounts,225.00
or with any awk:
$ awk '
BEGIN { FS=OFS="," }
{
n = split($0,a)
gsub(/^[^,]*,|,[^,]*$/,"")
gsub(/,/," ")
print a[1], $0, a[n]
}
' file
2020/11/04,Test Account,569.00
2020/11/05,Test Account,250.00
2020/11/05,More Test Accounts,225.00

Use this Perl one-liner:
perl -F',' -lane 'print join ",", $F[0], "#F[1 .. ($#F-1)]", $F[-1];' in.csv
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
-F',' : Split into #F on comma, rather than on whitespace.
$F[0] : first element of the array #F (= first comma-delimited value).
$F[-1] : last element of #F.
#F[1 .. ($#F-1)] : elements of #F between the second from the start and the second from the end, inclusive.
"#F[1 .. ($#F-1)]" : the above elements, joined on blanks into a string.
join ",", ... : join the LIST "..." on a comma, and return the resulting string.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches

perl -pe 's{,\K.*(?=,)}{$& =~ y/,/ /r}e' file
sed -e ':a' -e 's/\(,[^,]*\),\([^,]*,\)/\1 \2/; t a' file
awk '{$1=$1","; $NF=","$NF; gsub(/ *, */,","); print}' FS=, file
awk '{for (i=2; i<=NF; ++i) $i=(i>2 && i<NF ? " " : ",") $i} 1' FS=, OFS= file

awk doesn't support look arounds, we could have it by using match function of awk; using that could you please try following, written and tested with shown samples in GNU awk.
awk '
match($0,/,.*,/){
val=substr($0,RSTART+1,RLENGTH-2)
gsub(/,/," ",val)
print substr($0,1,RSTART) val substr($0,RSTART+RLENGTH-1)
}
' Input_file

Yet another perl
$ perl -pe 's/(?:^[^,]*,|,[^,]*$)(*SKIP)(*F)|,/ /g' ip.txt
2020/11/04,Test Account,569.00
2020/11/05,Test Account,250.00
2020/11/05,More Test Accounts,225.00
(?:^[^,]*,|,[^,]*$) matches first/last field along with the comma character
(*SKIP)(*F) this would prevent modification of preceding regexp
|, provide , as alternate regexp to be matched for modification
With sed (assuming \n is supported by the implementation, otherwise, you'll have to find a character that cannot be present in the input)
sed -E 's/,/\n/; s/,([^,]*)$/\n\1/; y/,/ /; y/\n/,/'
s/,/\n/; s/,([^,]*)$/\n\1/ replace first and last comma with newline character
y/,/ / replace all comma with space
y/\n/,/ change newlines back to comma

A similar answer to Timur's, in awk
awk '
BEGIN { FS = OFS = "," }
function join(start, stop, sep, str, i) {
str = $start
for (i = start + 1; i <= stop; i++) {
str = str sep $i
}
return str
}
{ print $1, join(2, NF-1, " "), $NF }
' file.csv
It's a shame awk doesn't ship with a join function builtin

Dynamically generated regex for gsub not working

I have an input CSV file:
1,5,1
1,6,2
1,5,3
1,7,4
1,5,5
1,6,6
1,6,7
I need to create a string out of this as follows:
;5,1,3,5;6,2,6,7;7,4
So each character, except the first which is the value of the field $2, in the substring in between the ; denotes the row number of middle field; for example ;5,1,3,5 means that 5 is at row number 1,3,5.
I've been trying to use awk with gsub, trying to create the string MYSTR dynamically.
The regex inside the gsub is not working. I need a regex that will match ;$3 (the value of $3, which can be a two digit number) and replace it with ;$3,RowNO, if the pattern is not matched then add ;$3 at the end of the string.
This is what I have so far:
awk -F',' '{
print NR, $3;
noofchars=gsub(/;$3/,";"$3","NR,MYSTR);
print noofchars;
if ( noofchars == 1 )
;
else
MYSTR=MYSTR";"$3","NR;
print NR, $3;
print MYSTR;
}
END{print MYSTR;}' $1

The regex doesn't work because $3 isn't interpreted as the field #3 value but is seen as the anchor $ (that matches the end of the line) and a literal 3.
You can do it without gsub:
awk -F, '{a[$2]=a[$2]","NR}END{for (i in a){printf(";%d%s",i,a[i])}}'

Input
$ cat file
1,5,1
1,6,2
1,5,3
1,7,4
1,5,5
1,6,6
1,6,7
Output
$ awk -F, '{gsub(/[ ]+/,"",$3);a[$2] = ($2 in a ? a[$2]:$2) FS $3 }END{for(i in a)printf("%s%s",";",a[i]); print ""}' file
;5,1,3,5;6,2,6,7;7,4
Better Readable version
awk -F, '
{
gsub(/[ ]+/,"",$3); # suppress space char in third field
a[$2] = ($2 in a ? a[$2]:$2) FS $3 # array a where index being field2 and value will be field3, if index exists before append string with existing value
}
END{
for(i in a) # loop through array a and print values
printf("%s%s",";",a[i]);
print ""
}
' file

#vsshekhar: Try following too: It will provide you values in the correct same order which Input_file ($2) are coming.
awk -F, '{A[++i]=$2;B[A[i]]=B[A[i]]?B[A[i]] "," FNR:FNR} END{for(j=1;j<=i;j++){if(B[A[j]]){printf(";%s,%s",A[j],B[A[j]]);delete B[A[j]]}};print ""}' Input_file
Adding a non-one liner form of solution too now.
awk -F, '{
A[++i]=$2;
B[A[i]]=B[A[i]]?B[A[i]] "," FNR:FNR
}
END{
for(j=1;j<=i;j++){
if(B[A[j]]){
printf(";%s,%s",A[j],B[A[j]]);
delete B[A[j]]
}
};
print ""
}
' Input_file

Creating matching brackets- awk :sed

I have a data set that has three patterns:
First:
abrasion abrade:stem<>ion:suffix
abstainer abstain:stem<>er:suffix
abstention abstain:stem<>ion:suffix
Second:
inaccurate in:prefix<>accurate:stem
inactive in:prefix<>active:stem
Third:
incommunicable in:prefix<>communicate:stem<>able:suffix
incompatibility in:prefix<>compatible:stem<>ity:suffix
I need to convert the above to following form : Matching the brackets in the way for Penn Tree Bank (http://languagelog.ldc.upenn.edu/myl/PennTreebank1995.pdf)
First:
abrasion ((abrade:stem) ion:suffix)
abstainer ((abstain:stem)er:suffix)
abstention ((abstain:stem)ion:suffix)
Second:
inaccurate (in:prefix(accurate:stem))
inactive (in:prefix(active:stem))
Third:
incommunicable (in:prefix ((communicate:stem)able:suffix))
incompatibility (in:prefix ((compatible:stem)ity:suffix))
The code, I am working is using awk
{
n = gsub(/<>/,")",$2)
s = sprintf("%*s",n,"")
gsub(/ /,"(",s)
print "(" $1, s "((" $2 "))"
}
EDIT
More complex forms
nationalistic national: stem <>ism:suffix<>ist:suffix<>ic:suffix
to:
nationalistic ((((national: stem) ism:suffix)ist:suffix)ic:suffix)
It is not producing the expected outputs that mentioned in the examples.

This should be general enough as it takes into account :stem, :prefix, and :suffix for matching:
awk 'BEGIN{FS=OFS="\n"}{
a=gensub(/([a-zA-Z]*):stem/,"(\\1:stem)", "g");
b=gensub(/(\([a-zA-Z]*:stem\))<>([a-zA-Z]*):suffix/,"(\\1\\2:suffix)", "g", a);
c=gensub(/([a-zA-Z]*:prefix)<>(.*)/,"(\\1\\2)", "g", b);
print c;}' testfile
Demo here: https://ideone.com/U3ux91
EDIT
This should take care of multiple suffixes and prefixes:
awk 'BEGIN{FS=OFS="\n"}{
a=gensub(/([a-zA-Z]*):stem/,"(\\1:stem)", "g");
while ( a ~ /stem)<>.*:suffix/) {
a=gensub(/(\([a-zA-Z]*:stem\).*?)<>([a-zA-Z]*):suffix/,"(\\1\\2:suffix)", "g", a);
}
while ( a ~ /<>/) {
a=gensub(/([a-zA-Z]*?:prefix)<>(.*)/,"(\\1\\2)", "g", a);
}
print a;}' test
Demo here: https://ideone.com/U7LYXi
(sorry if antinationalistic is not a word, but for testing sake....)

The expected output for pattern 1 may have problem, the brackets are not paired, I guess it was typos and it should be:
abrasion ((abrade:stem)ion:suffix)
abstainer ((abstain:stem)er:suffix)
abstention ((abstain:stem)ion:suffix)
I make this awk script:
awk -v d="<>" '{$2="("$2")"}
$1~/^ab/{sub(d,")",$2);$2="(" $2}
$1~/^ina/{sub(d,"(",$2);$2=$2")"}
$1~/^inc/{sub(d,"((",$2);sub(d,")",$2);$2=$2")"}7' file
with your 3 patterns example in same file, it gives:
abrasion ((abrade:stem)ion:suffix)
abstainer ((abstain:stem)er:suffix)
abstention ((abstain:stem)ion:suffix)
inaccurate (in:prefix(accurate:stem))
inactive (in:prefix(active:stem))
incommunicable (in:prefix((communicate:stem)able:suffix))
incompatibility (in:prefix((compatible:stem)ity:suffix))

awk -F'<>| ' -v OFS= '{
$1 = $1 " "
for (i=2; i<=NF; i++) {
if ($i ~ /prefix$/) { $i = "(" $i; $NF = $NF ")" }
if ($i ~ /stem\)?$/) { stem = i; $i = "(" $i ")" }
if ($i ~ /suffix\)?$/) { $i = $i ")"; $stem = "(" $stem } }
} { print }'

awk to the rescue!
$ awk 'function wrap(v) {return "("v")"; }
{n=split($2,a,"<>");
if(n==3) w=wrap(a[1] wrap(wrap(a[2]) a[3]));
else if(a[1]~/:prefix/) w=wrap(a[1] wrap(a[2]));
else w=wrap(wrap(a[1]) a[2]);
print $1, w}' stems
abrasion ((abrade:stem)ion:suffix)
abstainer ((abstain:stem)er:suffix)
abstention ((abstain:stem)ion:suffix)
inaccurate (in:prefix(accurate:stem))
inactive (in:prefix(active:stem))
incommunicable (in:prefix((communicate:stem)able:suffix))
incompatibility (in:prefix((compatible:stem)ity:suffix))

Awk find and replace for exact match only

If I'd like to replace a character field, say {, with awk I can use:
awk '{ gsub(/{/, "<"); print }' file
...but this will also replace a field such as "{" (which I don't want). Is there an awk function which will find only an exact match (and replace) of an entire field; for all fields.
For example, the following:
$ echo "foo bar zod \"{\" {" | awk '{ gsub(/{/, "<"); print }'
will output:
foo bar zod "<" <
but I'd like it to output:
foo bar zod "{" <
I could also explicitly iterate over the fields and use == to check for an exact match, but I wonder if there's an alternative.

I would do what you said, loop through all field, either checking with == or /^{$/.
However if we play some trick, it could be done without loop: (gnu awk)
awk '$0=gensub(/(\s|^){(\s|$)/, "\\1<\\2","g")'
check this example:
kent$ echo '{ foo "{" and this: { bar {'|awk '$0=gensub(/(\s|^){(\s|$)/, "\\1<\\2","g")'
< foo "{" and this: < bar <
In the example above, 3 of 4 { were substituted.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

backreferencing in awk gensub with conditional branching - if-statement

you can do the assignment in the gensub and use the value for the ternary operator afterwards, something like this ... | awk '{ v=gensub(/a(b*)?a/, "\\1", "g", $0); print v?"A":"B"}'

Something like this, maybe?: $ gawk '{ print gensub(/a(.)a/, (match($0,/a(b)?a/)?"A":"B"), "g", $0)}' <<< abba A $ gawk '{ print gensub(/a(.)a/, (match($0,/a(b)?a/)?"A":"B"), "g", $0)}' <<< acca B

Related

awk sub with a capturing group into the replacement

Removing multiple delimiters between outside delimiters on each line

Dynamically generated regex for gsub not working

Creating matching brackets- awk :sed

Awk find and replace for exact match only

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

backreferencing in awk gensub with conditional branching - if-statement

you can do the assignment in the gensub and use the value for the ternary operator afterwards, something like this ... | awk '{ v=gensub(/a(b*)?a/, "\\1", "g", $0); print v?"A":"B"}'

Something like this, maybe?: $ gawk '{ print gensub(/a(.*)a/, (match($0,/a(b*)?a/)?"A":"B"), "g", $0)}' <<< abba A $ gawk '{ print gensub(/a(.*)a/, (match($0,/a(b*)?a/)?"A":"B"), "g", $0)}' <<< acca B

Related

awk sub with a capturing group into the replacement

Removing multiple delimiters between outside delimiters on each line

Dynamically generated regex for gsub not working

Creating matching brackets- awk :sed

Awk find and replace for exact match only

Categories

Resources

Something like this, maybe?: $ gawk '{ print gensub(/a(.)a/, (match($0,/a(b)?a/)?"A":"B"), "g", $0)}' <<< abba A $ gawk '{ print gensub(/a(.)a/, (match($0,/a(b)?a/)?"A":"B"), "g", $0)}' <<< acca B