AWK - add value based on regex - regex

I have to add the numbers returned by REGEX using awk in linux.
Basically from this file:
123john456:x:98:98::/home/john123:/bin/bash
I have to add the numbers 123 and 456 using awk.
So the result would be 579
So far I have done the following:
awk -F ':' '$1 ~ VAR+="/[0-9].*(?=:)/" ; {print VAR}' /etc/passwd
awk -F ':' 'VAR+="/[0-9].*(?=:)/" ; {print VAR}' /etc/passwd
awk -F ':' 'match($1, VAR=/[0-9].*?:/) ; {print VAR}' /etc/passwd
And from what I've seen match doesn't support this at all.
Does someone has any idea?
UPDATE:
it also should work for
john123 result - > 123
123john result - > 123

$ awk -F':' '{split($1,t,/[^0-9]+/); print t[1] + t[2]}' file
579
With your updated requirements:
$ cat file
123john456:x:98:98::/home/john123:/bin/bash
john123:x:98:98::/home/john123:/bin/bash
123john:x:98:98::/home/john123:/bin/bash
$ awk -F':' '{split($1,t,/[^0-9]+/); print t[1] + t[2]}' file
579
123
123

With gawk and for the given example
awk -F ':' '{a=gensub(/[a-zA-Z]+/,"+", "g", $1); print a}' inputFile | bc
would do the job.
More general:
awk -F ':' '{a=gensub(/[a-zA-Z]+/,"+", "g", $1); a=gensub(/^+/,"","g",a); a=gensub(/+$/,"","g",a); print a}' inputFile | bc
The regex-part replaces all sequences of letters with '+' (e.g., '12johnny34' becomes 12+34). Finally, this mathematical operation is evaluated by bc.
(The be safe, I remove leading and trailing '+' sings by ^+ and +$)

You may use
awk -F ':' '{n=split($1, a, /[^0-9]+/); b=0; for (i=1;i<=n;i++) { b += a[i]; }; print b; }' /etc/passwd
See online awk demo
s="123john456:x:98:98::/home/john123:/bin/bash
john123:x:98:98::/home/john123:/bin/bash"
awk -F ':' '{n=split($1, a, /[^0-9]+/); b=0; for (i=1;i<=n;i++) { b += a[i]; }; print b; }' <<< "$s"
Output:
579
123
Details
-F ':' - records are split into fields with : char
n=split($1, a, /[^0-9]+/) - gets Field 1 and splits into digit only chunks saving the numbers in a array and the n var contains the number of these chunks
b=0 - b will hold the sum
for (i=1;i<=n;i++) { b += a[i]; } - iterate over a array and sum the values
print b - prints the result.

I used awk's split() to separate the first field on any string not containing numbers.
split(string, target_array, [regex], [separator_array]*)
*separator_array requires gawk
$ awk -F: '{split($1, A, /[^0-9]+/, S); print S[1], A[1]+A[2]}' <<EOF
123john456:x:98:98::/home/john123:/bin/bash
123john:x:98:98::/home/john123:/bin/bash
EOF
john 579
john 123

You can use [^0-9]+ as a field separator, and :[^\n]*\n as a record separator instead:
awk -F '[^0-9]+' 'BEGIN{RS=":[^\n]*\n"}{print $1+$2}' /etc/passwd
so that given the content of /etc/passwd being:
123john456:x:98:98::/home/john123:/bin/bash
john123:x:98:98::/home/john123:/bin/bash
123john:x:98:98::/home/john123:/bin/bash
This outputs:
579
123
123

You can try Perl also
$ cat johnny.txt
123john456:x:98:98::/home/john123:/bin/bash
john123:x:98:98::/home/john123:/bin/bash
123john:x:98:98::/home/john123:/bin/bash
$ perl -F: -lane ' $_=$F[0]; $sum+= $1 while(/(\d+)/g); print $sum; $sum=0 ' johnny.txt
579
123
123
$

Here is another awk variant that adds all the numbers present in first field separated by ::
cat file
123john456:x:98:98::/home/john123:/bin/bash
john123:x:98:98::/home/john123:/bin/bash
123john:x:98:98::/home/john123:/bin/bash
1j2o3h4n5:x:98:98::/home/john123:/bin/bash
awk -F '[^0-9:]+' '{s=0; for (i=1; i<=NF; i++) {s+=$i; if ($i~/:$/) break} print s}' file
579
123
123
15

Related

How to print a line with no field separator in awk?

I have data like this (file is called list-in.dat)
a ; b ; c ; i
d
e ; f ; a ; b
g ; h ; i
and I want a list like this (output file list-out.dat) with all items, in alphabetically order (case insensitive) and each unique item only once.
a
b
c
d
e
f
g
h
i
My attempt is:
awk -F " ; " ' BEGIN { OFS="\n" ; } {for(i=0; i<=NF; i++) print $i} ' file-in.dat | uniq | sort -uf > file-out.dat
But I end up with all antries except those lines which has only one item:
a
b
c
e
f
g
h
i
How can I get all (unique, sorted) items no matter how many items are in one line / if the field separator is missing?
Using gnu-awk:
awk -F '[[:blank:]]*;[[:blank:]]*' '{
for (i=1; i<=NF; i++) uniq[$i]
}
END {
PROCINFO["sorted_in"]="#ind_str_asc"
for (i in uniq)
print i
}' file
a
b
c
d
e
f
g
h
i
For non-gnu awk use:
awk -F '[[:blank:]]*;[[:blank:]]*' '{for (i=1; i<=NF; i++) uniq[$i]}
END{for (i in uniq) print i}' file | sort
awk -F' ; ' -v OFS='\n' '{$1=$1} 1' ip.txt | sort -fu
-F' ; ' sets space followed by ; followed by space as field separator
-v OFS='\n' sets newline as output field separator
{$1=$1} change $0 as per new OFS
1 print $0
sort -fu sort uniquely ignoring case in alphabetic order
Could you please try following, awk + sort solution, written and tested with shown samples. In case you want to use ignorecase then add IGNORECASE=1 in awk code.
awk '
BEGIN{
FS=" ; "
}
{
for(i=1;i<=NF;i++){
if(!a[$i]++){ print $i }
}
}
' Input_file | sort
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS=" ; " ##Setting field separator as space semi-colon space here.
}
{
for(i=1;i<=NF;i++){ ##Starting a for loop till NF here for each line.
if(!a[$i]++){ print $i } ##Checking condition if current field is NOT present in array a then printing that field value here.
}
}
' Input_file | sort ##Mentioning Input_file name here and passing it to sort as Input to sort the data.

Using gawk to extract rows with string in a column

I was trying to extract rows from a tab separated file, if it contained a certain word in the 4th column. For example, if input file test.txt is:
chr 8 1234 abc ; xyz
chr 8 1255 abc
chr 8 987 xyz
chr 8 5467 jxyzm
The following code correctly outputs only the 1st and 3rd line:
gawk -F"\t" ' { if($4 ~ /\<xyz\>/) print $0 } ' test.txt >> test.out
However, when I try to run this in a loop, in a bash script, my output file is blank. the code I am using is:
while read id
do
OFILE=${ODIR}/${id}.txt
gawk -v id="$id" -F"\t" ' { if($4 ~ /\<id\>/) print $0 } ' ${IFILE} >> ${OFILE}
done < ${GFILE}
The file ${GFILE} has one word per line, e.g.:
xyz
fg45
tre2y
What am I doing wrong?
thanks!
Edited to:
Add fourth row in input file
Added -v id="$id" to command...script still doesn't work!
You can very well use awk to read search patterns from one file and find matches in other like this:
awk -F '\t' '
NR == FNR {
words[$1]
next
}
{
for (w in words)
if (index($4, w)) {
print > w ".txt"
break
}
}' "$GFILE" "$IFILE"
Then check output:
cat xyz.txt
chr 8 1234 abc ; xyz
chr 8 987 xyz
If you really-really want to fix your shell script then here it is:
while read id; do
awk -F '\t' -v id="$id" '$4 ~ id' "$IFILE" > "$id.txt"
done < "$GFILE"

How to reverse all the words in a file with bash in Ubuntu?

I would like to reverse the complete text from the file.
Say if the file contains:
com.e.h/float
I want to get output as:
float/h.e.com
I have tried the command:
rev file.txt
but I have got all the reverse output: taolf/h.e.moc
Is there a way I can get the desired output. Do let me know. Thank you.
Here is teh link of teh sample file: Sample Text
You can use sed and tac:
str=$(echo 'com.e.h/float' | sed -E 's/(\W+)/\n\1\n/g' | tac | tr -d '\n')
echo "$str"
float/h.e.com
Using sed we insert \n before and after all non-word characters.
Using tac we reverse the output lines.
Using tr we strip all new lines.
If you have gnu-awk then you can do all this in a single awk command using 4 argument split function call that populates split strings and delimiters separately:
awk '{
s = ""
split($0, arr, /\W+/, seps)
for (i=length(arr); i>=1; i--)
s = s seps[i] arr[i]
print s
}' file
For non-gnu awk, you can use:
awk '{
r = $0
i = 0
while (match(r, /[^a-zA-Z0-9_]+/)) {
a[++i] = substr(r, RSTART, RLENGTH) substr(r, 0, RSTART-1)
r = substr(r, RSTART+RLENGTH)
}
s = r
for (j=i; j>=1; j--)
s = s a[j]
print s
}' file
Is it possible to use Perl?
perl -nlE 'say reverse(split("([/.])",$_))' f
This one-liner reverses all the lines of f, according to PO's criteria.
If prefer a less parentesis version:
perl -nlE 'say reverse split "([/.])"' f
For portability, this can be done using any awk (not just GNU) using substrings:
$ awk '{
while (match($0,/[[:alnum:]]+/)) {
s=substr($0,RLENGTH+1,1) substr($0,1,RLENGTH) s;
$0=substr($0,RLENGTH+2)
} print s
}' <<<"com.e.h/float"
This steps through the string grabbing alphanumeric strings plus the following character, reversing the order of those two captured pieces, and prepending them to an output string.
Using GNU awk's split, splitting from separators . and /, define more if you wish.
$ cat program.awk
{
for(n=split($0,a,"[./]",s); n>=1; n--) # split to a and s, use n from split
printf "%s%s", a[n], (n==1?ORS:s[(n-1)]) # printf it pretty
}
Run it:
$ echo com.e.h/float | awk -f program.awk
float/h.e.com
EDIT:
If you want to run it as one-liner:
awk '{for(n=split($0,a,"[./]",s); n>=1; n--); printf "%s%s", a[n], (n==1?ORS:s[(n-1)])}' foo.txt

compare fields after removing spaces in two files with Regex tools

I am new to awk and need to find the statement to compare two fields in files below
The columns are , seperated
1.csv
_________
1space, aspace
2,b
space3space,c
2.csv
____________
1space,spacea
space2,bspace
3,spacecspace
The below statement works fine if there are no leading or training spaces in the fields of either of 1.tsv or 2.tsv
nawk -F, 'NR==FNR{a[$1,$2]++;next} !(a[$1,$2])' 2.tsv 1.tsv
Kindly let me know how can we modify the above statement to trim leadind and lagging spaces and then compare. Thanks for the help.
awk -F, '
{ key=$1; gsub(/^[[:space:]]+|[[:space:]]+$/,"",key) }
NR==FNR { a[key]; next }
!(key in a)
' 2.tsv 1.tsv
Do this:
awk '
BEGIN {FS=OFS=","}
NR==FNR {
gsub(/^ *| *$/,"",$1)
a[$1]++
next
}
{
gsub(/^ *| *$/,"",$1);
if (!($1 in a)) {
print
}
}' 2.tsv 1.tsv
Code for GNU sed:
sed -r 's#\s*(\S+)\s*,\s*(\S+)\s*#/\\s*\1\\s*,\\s*\2\\s*/p#' file1|sed -f - file2
$cat file1
1 , a
2,b
3 ,c
$cat file2
1 ,a
2,b
3,c
$sed -r 's#\s*(\S+)\s*,\s*(\S+)\s*#/\\s*\1\\s*,\\s*\2\\s*/d#' file1|sed -nf - file2
You need to trim all the spaces from $1 before trying to locate it in array a:
awk -F"," 'NR==FNR{$1=$1;a[$1]++;next} {f1=$1; gsub(/ /, "", f1);
if (!a[f1]) print}' 2.tsv 1.tsv

Regex with awk or gawk

I'm a beginner user of awk/gawk.
If I run below, the shell gives me nothing. Please help!
echo "A=1,B=2,3,C=,D=5,6,E=7,8,9"|awk 'BEGIN{
n = split($0, arr, /,(?=\\w+=)/)
for (x=1; x<n; x++) printf "arr[%d]=%s\n", x, arr[x]
}'
.....................................................
I am trying to parse:
A=1,B=2,3,C=,D=5,6,E=7,8,9
Expected Output:
A=1
B=2,3
C=
D=5,6
E=7,8,9
I bet there's something wrong with my awk.
gawk doesn't support look-ahead.
if you want gawk to parse it as you expected, try this:
awk '{n=split(gensub(/,([A-Z])/, " \\1","g" ),arr," ");for(x=1;x<=n;x++)print arr[x]}'
test with your example:
kent$ echo "A=1,B=2,3,C=,D=5,6,E=7,8,9"|awk '{n=split(gensub(/,([A-Z])/, " \\1","g" ),arr," ");for(x=1;x<=n;x++)print arr[x]}'
A=1
B=2,3
C=
D=5,6
E=7,8,9
This might be easier with sed:
$ echo "A=1,B=2,3,C=,D=5,6,E=7,8,9" | sed 's/,\(\w\+=\)/\n\1/g'
A=1
B=2,3
C=
D=5,6
E=7,8,9
If you are using gnu awk, you could do:
awk '{printf $0 "\n" substr( RT, 2 )}' RS=,[A-Z]
As nhahtdh, theres is no lookahead in awk... But you can use a different separator for the assignments. Why not "A=1;B=2,3,4;C=5..."?
If your input must have that format, try flex...
You could also use comma as the record separator:
echo "A=1,B=2,3,C=,D=5,6,E=7,8,9" |
awk -v RS=, '{sep=","} /=/ {sep="\n"} NR==1 {sep=""} {printf "%s%s", sep, $0}'
outputs
A=1
B=2,3
C=
D=5,6
E=7,8,9
You have two problems. First, you don't want a BEGIN clause; you just want this to run on every input line. Second, you are trying to use regular expression features that AWK does not support.
Instead of trying to use a fancy pattern that splits the string, loop and call match() to parse out the features you want.
echo "A=1,B=2,3,C=,D=5,6,E=7,8,9"|awk '
{
line = $0
for (i = 0;;)
{
i = match(line, /([A-Z]+)=([0-9,]*)(,|$)/, arr)
if (0 == i)
break
key = arr[1]
value = arr[2]
l = length(key "=" value ",") + 1
line = substr(line, l)
printf "DEBUG: key '%s' value '%s'\n", key, value
}
}'
This prints:
DEBUG: key A value 1
DEBUG: key B value 2,3
DEBUG: key C value
DEBUG: key D value 5,6
DEBUG: key E value 7,8,9
Other way using awk
awk '{print gensub(/,([A-Z]+=)/, "\n\\1","g")}' temp.txt
Output
A=1
B=2,3
C=
D=5,6
E=7,8,9