Regex with awk or gawk

Regex with awk or gawk - regex

I'm a beginner user of awk/gawk.
If I run below, the shell gives me nothing. Please help!
echo "A=1,B=2,3,C=,D=5,6,E=7,8,9"|awk 'BEGIN{
n = split($0, arr, /,(?=\\w+=)/)
for (x=1; x<n; x++) printf "arr[%d]=%s\n", x, arr[x]
}'
.....................................................
I am trying to parse:
A=1,B=2,3,C=,D=5,6,E=7,8,9
Expected Output:
A=1
B=2,3
C=
D=5,6
E=7,8,9
I bet there's something wrong with my awk.

gawk doesn't support look-ahead.
if you want gawk to parse it as you expected, try this:
awk '{n=split(gensub(/,([A-Z])/, " \\1","g" ),arr," ");for(x=1;x<=n;x++)print arr[x]}'
test with your example:
kent$ echo "A=1,B=2,3,C=,D=5,6,E=7,8,9"|awk '{n=split(gensub(/,([A-Z])/, " \\1","g" ),arr," ");for(x=1;x<=n;x++)print arr[x]}'
A=1
B=2,3
C=
D=5,6
E=7,8,9

This might be easier with sed:
$ echo "A=1,B=2,3,C=,D=5,6,E=7,8,9" | sed 's/,\(\w\+=\)/\n\1/g'
A=1
B=2,3
C=
D=5,6
E=7,8,9

If you are using gnu awk, you could do:
awk '{printf $0 "\n" substr( RT, 2 )}' RS=,[A-Z]

As nhahtdh, theres is no lookahead in awk... But you can use a different separator for the assignments. Why not "A=1;B=2,3,4;C=5..."?
If your input must have that format, try flex...

You could also use comma as the record separator:
echo "A=1,B=2,3,C=,D=5,6,E=7,8,9" |
awk -v RS=, '{sep=","} /=/ {sep="\n"} NR==1 {sep=""} {printf "%s%s", sep, $0}'
outputs
A=1
B=2,3
C=
D=5,6
E=7,8,9

You have two problems. First, you don't want a BEGIN clause; you just want this to run on every input line. Second, you are trying to use regular expression features that AWK does not support.
Instead of trying to use a fancy pattern that splits the string, loop and call match() to parse out the features you want.
echo "A=1,B=2,3,C=,D=5,6,E=7,8,9"|awk '
{
line = $0
for (i = 0;;)
{
i = match(line, /([A-Z]+)=([0-9,]*)(,|$)/, arr)
if (0 == i)
break
key = arr[1]
value = arr[2]
l = length(key "=" value ",") + 1
line = substr(line, l)
printf "DEBUG: key '%s' value '%s'\n", key, value
}
}'
This prints:
DEBUG: key A value 1
DEBUG: key B value 2,3
DEBUG: key C value
DEBUG: key D value 5,6
DEBUG: key E value 7,8,9

Other way using awk
awk '{print gensub(/,([A-Z]+=)/, "\n\\1","g")}' temp.txt
Output
A=1
B=2,3
C=
D=5,6
E=7,8,9

Related

awk sub with a capturing group into the replacement

I am writing an awk oneliner for this purpose:
file1:
1 apple
2 orange
4 pear
file2:
1/4/2/1
desired output: apple/pear/orange/apple
addendum: Missing numbers should be best kept unchanged 1/4/2/3 = apple/pear/orange/3 to prevent loss of info.
Methodology:
Build an associative array key[$1] = $2 for file1
capture all characters between the slashes and replace them by matching to the key of associative array eg key[4] = pear
Tried:
gawk 'NR==FNR { key[$1] = $2 }; NR>FNR { r = gensub(/(\w+)/, "key[\\1]" , "g"); print r}' file1.txt file2.txt
#gawk because need to use \w+ regex
#gensub used because need to use a capturing group
Unfortunately, results are
1/4/2/1
key[1]/key[4]/key[2]/key[1]
Any suggestions? Thank you.

You may use this awk:
awk -v OFS='/' 'NR==FNR {key[$1] = $2; next}
{for (i=1; i<=NF; ++i) if ($i in key) $i = key[$i]} 1' file1 FS='/' file2
apple/pear/orange/apple
Note that if numbers from file2 don't exist in key array then it will make those fields empty.
file1 FS='/' file2 will keep default field separators for file1 but will use / as field separator while reading file2.

EDIT: In case you don't have a match in file2 from file and you want to keep original value as it is then try following:
awk '
FNR==NR{
arr[$1]=$2
next
}
{
val=""
for(i=1;i<=NF;i++){
val=(val=="" ? "" : val FS) (($i in arr)?arr[$i]:$i)
}
print val
}
' file1 FS="/" file2
With your shown samples please try following.
awk '
FNR==NR{
arr[$1]=$2
next
}
{
val=""
for(i=1;i<=NF;i++){
val = (val=="" ? "" : val FS) arr[$i]
}
print val
}
' file1 FS="/" file2
Explanation: Reading Input_file1 first and creating array arr with index of 1st field and value of 2nd field then setting field separator as / and traversing through each field os file2 and saving its value in val; printing it at last for each line.

Like #Sundeep comments in the comments, you can't use backreference as an array index. You could mix match and gensub (well, I'm using sub below). Not that this would be anywhere suggested method but just as an example:
$ awk '
NR==FNR {
k[$1]=$2 # hash them
next
}
{
while(match($0,/[0-9]+/)) # keep doing it while it lasts
sub(/[0-9]+/,k[substr($0,RSTART,RLENGTH)]) # replace here
}1' file1 file2
Output:
apple/pear/orange/apple
And of course, if you have k[1]="word1", you'll end up with a neverending loop.

With perl (assuming key is always found):
$ perl -lane 'if(!$#ARGV){ $h{$F[0]}=$F[1] }
else{ s|[^/]+|$h{$&}|g; print }' f1 f2
apple/pear/orange/apple
if(!$#ARGV) to determine first file (assuming exactly two files passed)
$h{$F[0]}=$F[1] create hash based on first field as key and second field as value
[^/]+ match non / characters
$h{$&} get the value based on matched portion from the hash
If some keys aren't found, leave it as is:
$ cat f2
1/4/2/1/5
$ perl -lane 'if(!$#ARGV){ $h{$F[0]}=$F[1] }
else{ s|[^/]+|exists $h{$&} ? $h{$&} : $&|ge; print }' f1 f2
apple/pear/orange/apple/5
exists $h{$&} checks if the matched portion exists as key.

Another approach using awk without loop:
awk 'FNR==NR{
a[$1]=$2;
next
}
$1 in a{
printf("%s%s",FNR>1 ? RS: "",a[$1])
}
END{
print ""
}' f1 RS='/' f2
$ cat f1
1 apple
2 orange
4 pear
$ cat f2
1/4/2/1
$ awk 'FNR==NR{a[$1]=$2;next}$1 in a{printf("%s%s",FNR>1?RS:"",a[$1])}END{print ""}' f1 RS='/' f2
apple/pear/orange/apple

AWK - add value based on regex

I have to add the numbers returned by REGEX using awk in linux.
Basically from this file:
123john456:x:98:98::/home/john123:/bin/bash
I have to add the numbers 123 and 456 using awk.
So the result would be 579
So far I have done the following:
awk -F ':' '$1 ~ VAR+="/[0-9].*(?=:)/" ; {print VAR}' /etc/passwd
awk -F ':' 'VAR+="/[0-9].*(?=:)/" ; {print VAR}' /etc/passwd
awk -F ':' 'match($1, VAR=/[0-9].*?:/) ; {print VAR}' /etc/passwd
And from what I've seen match doesn't support this at all.
Does someone has any idea?
UPDATE:
it also should work for
john123 result - > 123
123john result - > 123

$ awk -F':' '{split($1,t,/[^0-9]+/); print t[1] + t[2]}' file
579
With your updated requirements:
$ cat file
123john456:x:98:98::/home/john123:/bin/bash
john123:x:98:98::/home/john123:/bin/bash
123john:x:98:98::/home/john123:/bin/bash
$ awk -F':' '{split($1,t,/[^0-9]+/); print t[1] + t[2]}' file
579
123
123

With gawk and for the given example
awk -F ':' '{a=gensub(/[a-zA-Z]+/,"+", "g", $1); print a}' inputFile | bc
would do the job.
More general:
awk -F ':' '{a=gensub(/[a-zA-Z]+/,"+", "g", $1); a=gensub(/^+/,"","g",a); a=gensub(/+$/,"","g",a); print a}' inputFile | bc
The regex-part replaces all sequences of letters with '+' (e.g., '12johnny34' becomes 12+34). Finally, this mathematical operation is evaluated by bc.
(The be safe, I remove leading and trailing '+' sings by ^+ and +$)

You may use
awk -F ':' '{n=split($1, a, /[^0-9]+/); b=0; for (i=1;i<=n;i++) { b += a[i]; }; print b; }' /etc/passwd
See online awk demo
s="123john456:x:98:98::/home/john123:/bin/bash
john123:x:98:98::/home/john123:/bin/bash"
awk -F ':' '{n=split($1, a, /[^0-9]+/); b=0; for (i=1;i<=n;i++) { b += a[i]; }; print b; }' <<< "$s"
Output:
579
123
Details
-F ':' - records are split into fields with : char
n=split($1, a, /[^0-9]+/) - gets Field 1 and splits into digit only chunks saving the numbers in a array and the n var contains the number of these chunks
b=0 - b will hold the sum
for (i=1;i<=n;i++) { b += a[i]; } - iterate over a array and sum the values
print b - prints the result.

I used awk's split() to separate the first field on any string not containing numbers.
split(string, target_array, [regex], [separator_array]*)
*separator_array requires gawk
$ awk -F: '{split($1, A, /[^0-9]+/, S); print S[1], A[1]+A[2]}' <<EOF
123john456:x:98:98::/home/john123:/bin/bash
123john:x:98:98::/home/john123:/bin/bash
EOF
john 579
john 123

You can use [^0-9]+ as a field separator, and :[^\n]*\n as a record separator instead:
awk -F '[^0-9]+' 'BEGIN{RS=":[^\n]*\n"}{print $1+$2}' /etc/passwd
so that given the content of /etc/passwd being:
123john456:x:98:98::/home/john123:/bin/bash
john123:x:98:98::/home/john123:/bin/bash
123john:x:98:98::/home/john123:/bin/bash
This outputs:
579
123
123

You can try Perl also
$ cat johnny.txt
123john456:x:98:98::/home/john123:/bin/bash
john123:x:98:98::/home/john123:/bin/bash
123john:x:98:98::/home/john123:/bin/bash
$ perl -F: -lane ' $_=$F[0]; $sum+= $1 while(/(\d+)/g); print $sum; $sum=0 ' johnny.txt
579
123
123
$

Here is another awk variant that adds all the numbers present in first field separated by ::
cat file
123john456:x:98:98::/home/john123:/bin/bash
john123:x:98:98::/home/john123:/bin/bash
123john:x:98:98::/home/john123:/bin/bash
1j2o3h4n5:x:98:98::/home/john123:/bin/bash
awk -F '[^0-9:]+' '{s=0; for (i=1; i<=NF; i++) {s+=$i; if ($i~/:$/) break} print s}' file
579
123
123
15

How to reverse all the words in a file with bash in Ubuntu?

I would like to reverse the complete text from the file.
Say if the file contains:
com.e.h/float
I want to get output as:
float/h.e.com
I have tried the command:
rev file.txt
but I have got all the reverse output: taolf/h.e.moc
Is there a way I can get the desired output. Do let me know. Thank you.
Here is teh link of teh sample file: Sample Text

You can use sed and tac:
str=$(echo 'com.e.h/float' | sed -E 's/(\W+)/\n\1\n/g' | tac | tr -d '\n')
echo "$str"
float/h.e.com
Using sed we insert \n before and after all non-word characters.
Using tac we reverse the output lines.
Using tr we strip all new lines.
If you have gnu-awk then you can do all this in a single awk command using 4 argument split function call that populates split strings and delimiters separately:
awk '{
s = ""
split($0, arr, /\W+/, seps)
for (i=length(arr); i>=1; i--)
s = s seps[i] arr[i]
print s
}' file
For non-gnu awk, you can use:
awk '{
r = $0
i = 0
while (match(r, /[^a-zA-Z0-9_]+/)) {
a[++i] = substr(r, RSTART, RLENGTH) substr(r, 0, RSTART-1)
r = substr(r, RSTART+RLENGTH)
}
s = r
for (j=i; j>=1; j--)
s = s a[j]
print s
}' file

Is it possible to use Perl?
perl -nlE 'say reverse(split("([/.])",$_))' f
This one-liner reverses all the lines of f, according to PO's criteria.
If prefer a less parentesis version:
perl -nlE 'say reverse split "([/.])"' f

For portability, this can be done using any awk (not just GNU) using substrings:
$ awk '{
while (match($0,/[[:alnum:]]+/)) {
s=substr($0,RLENGTH+1,1) substr($0,1,RLENGTH) s;
$0=substr($0,RLENGTH+2)
} print s
}' <<<"com.e.h/float"
This steps through the string grabbing alphanumeric strings plus the following character, reversing the order of those two captured pieces, and prepending them to an output string.

Using GNU awk's split, splitting from separators . and /, define more if you wish.
$ cat program.awk
{
for(n=split($0,a,"[./]",s); n>=1; n--) # split to a and s, use n from split
printf "%s%s", a[n], (n==1?ORS:s[(n-1)]) # printf it pretty
}
Run it:
$ echo com.e.h/float | awk -f program.awk
float/h.e.com
EDIT:
If you want to run it as one-liner:
awk '{for(n=split($0,a,"[./]",s); n>=1; n--); printf "%s%s", a[n], (n==1?ORS:s[(n-1)])}' foo.txt

Parsing a .csv-like file in bash

I have a file formatted as follows:
string1,string2,string3,...
...
I have to analyze the second column, counting the occurrences of each string, and producing a file formatted as follows:
"number of occurrences of x",x
"number of occurrences of y",y
...
I managed to write the following script, that works fine:
#!/bin/bash
> output
regExp='^\s*([0-9]+) (.+)$'
while IFS= read -r line
do
if [[ "$line" =~ $regExp ]]
then
printf "${BASH_REMATCH[1]},${BASH_REMATCH[2]}\n" >> output
fi
done <<< "`gawk -F , '!/^$/ {print $2}' $1 | sort | uniq -c`"
My question is:
There is a better and simpler way to do the job?
In particular I don't know how to fix that:
gawk -F , '!/^$/ {print $2}' miocsv.csv | sort | uniq -c | gawk '{print $1","$2}'
The problem is that string2 can contain whitespaces and, if so, the second call on gawk will truncate the string.
Neither i know how to print all the field "from 2 to NF", maintaining the delimiter, which can occur several times in succession.
Thank very much,
Goodbye
EDIT:
As asked, here there is some sample data:
(It is an exercise, sorry for the inventive)
Input:
*,*,*
test, test ,test
prova, * , prova
test,test,test
prova, prova ,prova
leonardo,da vinci,leonardo
in,o u t ,pr
, spaces ,
, spaces ,
leonardo,da vinci,leonardo
leonardo,da vinci,leonardo
leonardo,da vinci,leonardo
in,o u t ,pr
test, test ,test
, tabs ,
, tabs ,
po,po,po
po,po,po
po,po,po
prova, * , prova
prova, * , prova
*,*,*
*,*,*
*,*,*
, spaces ,
, tabs ,
Output:
3, *
4,*
4,da vinci
2,o u t
3,po
1, prova
3, spaces
3, tabs
1,test
2, test

A one-liner in awk:
awk -F, 'x[$2]++ { } END { for (i in x) print x[i] "," i }' input.csv
It stores the count for each 2nd column string in the associative array x, and in the end loops through the array and prints the results.
To get the exact output you showed for this example, you need to pipe it to sort(1), setting the field delimiter to , and the sort key to the 2nd field:
awk -F, 'x[$2]++ { } END { for (i in x) print x[i] "," i }' input.csv | sort -t, -k2,2
The only condition, of course, is that the 2nd column of each line doesn't contain a ,

You can make your final awk:
gawk '{ sub(" *","",$0); sub(" ",",",$0); print }'
or use sed for this sort of thing:
sed 's/ *\([0-9]*\) /\1,/'

Here is a Perl one-liner, similar to Filipe's awk solution:
perl -F, -lane '$x{$F[1]}++; END{ for $i (sort keys %x) { print "$x{$i},$i" } }' input.csv
The output is sorted alphabetically according to the second column.
The #F autosplit array starts at index $F[0] while awk fields start with $1

Printing the actual field delimiter value not the regular expression

Given the following input:
check1;check2
check1;;check2
check1,check2
and the awk command:
awk -F';+|,' '{print $1 FS $2}'
FS should contain the selected delimiter?
How can you print the delimiter which is selected i.e. either of ;, ;; or , not the regular expression that the describes the delimiters.
If the input is check1;check2 then the output should be check1;check2.

If you're using GNU Awk (gawk) you can use the 4th argument of split():
gawk '{split($0, a, /;+|,/, seps); print a[1] seps[1] a[2]}' file
Output:
check1;check2
check1;;check2
check1,check2
Using it within a loop is also easy to handle:
gawk '{nf = split($0, a, /;+|,/, seps); for (i = 1; i <= nf; ++i) printf "%s%s", a[i], seps[i]; print ""}' file
22011,25029;;3331,25275
6740,16516;;27292,1217
13480,31488;;7947,18804
328,30623;;12470,6883
If you only need the fields you would only have to touch a. Separators would be separated in seps and the indices of those are aligned with a.

I don't think awk stores the matched delimiter anywhere. If you use GNU awk, you can do it yourself:
gawk '{match($0, /([^;,]*)(;+|,)(.*)/, a); print a[1], a[2], a[3]}'

GNU awk has this feature for records not fields so you could also do something like this:
$ awk '{printf "%s%s",$0,RT}' RS=';+|,|\n' file
check1;check2
check1;;check2
check1,check2
Where RT is the value match by RS for the given record which you can see by:
$ awk '{printf "%s",RT}' RS=';+|,|\n' file
;
;;
,

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex with awk or gawk - regex

This might be easier with sed: $ echo "A=1,B=2,3,C=,D=5,6,E=7,8,9" | sed 's/,\(\w\+=\)/\n\1/g' A=1 B=2,3 C= D=5,6 E=7,8,9

If you are using gnu awk, you could do: awk '{printf $0 "\n" substr( RT, 2 )}' RS=,[A-Z]

As nhahtdh, theres is no lookahead in awk... But you can use a different separator for the assignments. Why not "A=1;B=2,3,4;C=5..."? If your input must have that format, try flex...

You could also use comma as the record separator: echo "A=1,B=2,3,C=,D=5,6,E=7,8,9" | awk -v RS=, '{sep=","} /=/ {sep="\n"} NR==1 {sep=""} {printf "%s%s", sep, $0}' outputs A=1 B=2,3 C= D=5,6 E=7,8,9

Other way using awk awk '{print gensub(/,([A-Z]+=)/, "\n\\1","g")}' temp.txt Output A=1 B=2,3 C= D=5,6 E=7,8,9

Related

awk sub with a capturing group into the replacement

AWK - add value based on regex

How to reverse all the words in a file with bash in Ubuntu?

Parsing a .csv-like file in bash

Printing the actual field delimiter value not the regular expression

Categories

Resources