Column replacement with awk, with retaining the format - replace

This is file a.pdb:
ATOM 1 N ARG 1 0.000 0.000 0.000 1.00 0.00 N
ATOM 2 H1 ARG 1 0.000 0.000 0.000 1.00 0.00 H
ATOM 3 H2 ARG 1 0.000 0.000 0.000 1.00 0.00 H
ATOM 4 H3 ARG 1 0.000 0.000 0.000 1.00 0.00 H
And this is file a.xyz:
16.388 -5.760 -23.332
17.226 -5.608 -23.768
15.760 -5.238 -23.831
17.921 -5.926 -26.697
I want to replace 6,7 and 8th column of a.pdb with a.xyz columns. Once replaced, I need to maintain tabs/space/columns of a.pdb.
I have tried:
awk 'NR==FNR {fld1[NR]=$1; fld2[NR]=$2; fld3[NR]=$3; next} {$6=fld1[FNR]; $7=fld2[FNR]; $8=fld3[FNR]}1' a.xyz a.pdb
But it doesn't keep the format.

This is exactly what the 4th arg for split() in GNU awk was invented to facilitate:
gawk '
NR==FNR { pdb[NR]=$0; next }
{
split(pdb[FNR],flds,FS,seps)
flds[6]=$1
flds[7]=$2
flds[8]=$3
for (i=1;i in flds;i++)
printf "%s%s", flds[i], seps[i]
print ""
}
' a.pdb a.xyz
ATOM 1 N ARG 1 16.388 -5.760 -23.332 1.00 0.00 N
ATOM 2 H1 ARG 1 17.226 -5.608 -23.768 1.00 0.00 H
ATOM 3 H2 ARG 1 15.760 -5.238 -23.831 1.00 0.00 H
ATOM 4 H3 ARG 1 17.921 -5.926 -26.697 1.00 0.00 H

Not a general solution, but this might work with in this particular case:
awk 'NR==FNR{for(i=6; i<=8; i++) A[FNR,i]=$(i-5); next} {for(i=6; i<=8; i++) sub($i,A[FNR,i])}1' file2 file1
or
awk '{for(i=6; i<=8; i++) if(NR==FNR) A[FNR,i]=$(i-5); else sub($i,A[FNR,i])} NR>FNR' file2 file1
There is a bit of a shift, though. We would need to know the fields widths to prevent this.
--
Or perhaps with substrings:
awk 'NR==FNR{A[FNR]=$0; next} {print substr($0,1,p) FS A[FNR] substr($0,p+length(A[FNR]))}' p=33 file2 file1
-- changing it in the OP's original solution:
awk 'NR==FNR {fld1[NR]=$1; fld2[NR]=$2; fld3[NR]=$3; next} {sub($6,fld1[FNR]); sub($7,fld2[FNR]); sub($8,fld3[FNR])}1' file file1
with the same restrictions as the first 2 suggestions.
So 1, 2, and 4 use sub to replace, which is not a water proof solution, since earlier fields might interfere and it uses regex rather than strings (and so the regex dot happens to match the actual dot), but with this particular input, it might pan out..
Probably nr. 3 would be a more fool-proof method..
--edit--
I think this would work with the given input:
awk 'NR==FNR{A[FNR]=$1 " " $2 " " $3; next} {print substr($0,1,p) A[FNR] substr($0,p+length(A[FNR]))}' p=32 file2 file1
but I think something like printf or sprint formatting would be required to make it fool-proof.
So, perhaps something like this:
awk 'NR==FNR{A[FNR]=sprintf("%7.3f %7.3f %8.4f", $1, $2, $3); next} {print substr($0,1,p) A[FNR] substr($0,p+length(A[FNR]))}' p=31 file2 file1
or not on one line:
awk '
NR==FNR {
A[FNR]=sprintf("%7.3f %7.3f %8.4f", $1, $2, $3)
next
}
{
print substr($0,1,p) A[FNR] substr($0,p+length(A[FNR]))
}
' p=31 file2 file1

You can try this one
paste -d' ' test4 test5 |awk '{print $1,$2,$3,$4,$5,$12,$13,$14,$9,$10,$11}'

Related

Parse default Salt highstate output

I'm trying to parse the highstate output of Salt has proven to be difficult. Without changing the output to json due to the fact that I still want it to be human legible.
What's the best way to convert the Summary into something machine readable?
Summary for app1.domain.com
--------------
Succeeded: 278 (unchanged=12, changed=6)
Failed: 0
--------------
Total states run: 278
Total run time: 7.383 s
--
Summary for app2.domain.com
--------------
Succeeded: 278 (unchanged=12, changed=6)
Failed: 0
--------------
Total states run: 278
Total run time: 7.448 s
--
Summary for app0.domain.com
--------------
Succeeded: 293 (unchanged=13, changed=6)
Failed: 0
--------------
Total states run: 293
Total run time: 7.510 s
Without a better idea I'm trying to grep and awk the output and insert it into a csv.
These two work:
cat ${_FILE} | grep Summary | awk '{ print $3} ' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate.csv;
cat ${_FILE} | grep -oP '(?<=unchanged=)[0-9]+' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate.csv;
But this one fails but works in Reger
cat ${_FILE} | grep -oP '(?<=\schanged=)[0-9]+' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate.csv;
EDIT1: #vintnes #ikegami I agree I'd much rather take the json output parse the output but Salt doesn't offer a summary of changes when outputting to josn. So far this is what I have and while very ugly, it's working.
cat ${_FILE} | grep Summary | awk '{ print $3} ' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate_tmp.csv;
cat ${_FILE} | grep -oP '(?<=unchanged=)[0-9]+' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate_tmp.csv;
cat ${_FILE} | grep unchanged | awk -F' ' '{ print $4}' | \
grep -oP '(?<=changed=)[0-9]+' | tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate_tmp.csv;
cat ${_FILE} | { grep "Warning" || true; } | awk -F: '{print $2+0} END { if (!NR) print "null" }' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate_tmp.csv;
cat ${_FILE} | { grep "Failed" || true; } | awk -F: '{print $2+0} END { if (!NR) print "null" }' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate_tmp.csv;
csvtool transpose /tmp/highstate_tmp.csv > /tmp/highstate.csv;
sed -i '1 i\instance,unchanged,changed,warning,failed' /tmp/highstate.csv;
Output:
instance,unchanged,changed,warning,failed
app1.domain.com,12,6,,0
app0.domain.com,13,6,,0
app2.domain.com,12,6,,0
Here you go. This will also work if your output contains warnings. Please note that the output is in a different order than you specified; it's the order in which each record occurs in the file. Don't hesitate with any questions.
$ awk -v OFS=, '
BEGIN { print "instance,unchanged,changed,warning,failed" }
/^Summary/ { instance=$NF }
/^Succeeded/ { split($3 $4 $5, S, /[^0-9]+/) }
/^Failed/ { print instance, S[2], S[3], S[4], $2 }
' "$_FILE"
split($3 $4 $5, S, /[^0-9]+/) handles the possibility of warnings by disregarding the first two "words" Succeeded: ### and using any number of non-digits as a separator.
edit: Printed on /^Fail/ instead of using /^Summ/ and END.
perl -e'
use strict;
use warnings qw( all );
use Text::CSV_XS qw( );
my $csv = Text::CSV_XS->new({ auto_diag => 2, binary => 1 });
$csv->say(select(), [qw( instance unchanged change warning failed )]);
my ( $instance, $unchanged, $changed, $warning, $failed );
while (<>) {
if (/^Summary for (\S+)/) {
( $instance, $unchanged, $changed, $warning, $failed ) = $1;
}
elsif (/^Succeeded:\s+\d+ \(unchanged=(\d+), changed=(\d+)\)/) {
( $unchanged, $changed ) = ( $1, $2 );
}
elsif (/^Warning:\s+(\d+)/) {
$warning = $1;
}
elsif (/^Failed:\s+(\d+)/) {
$failed = $1;
$csv->say(select(), [ $instance, $unchanged, $changed, $warning, $failed ]);
}
}
'
Provide input via STDIN, or provide path to file(s) from which to read as arguments.
Terse version:
perl -MText::CSV_XS -ne'
BEGIN {
$csv = Text::CSV_XS->new({ auto_diag => 2, binary => 1 });
$csv->say(select(), [qw( instance unchanged change warning failed )]);
}
/^Summary for (\S+)/ and #row=$1;
/^Succeeded:\s+\d+ \(unchanged=(\d+), changed=(\d+)\)/ and #row[1,2]=($1,$2);
/^Warning:\s+(\d+)/ and $row[3]=$1;
/^Failed:\s+(\d+)/ and ($row[4]=$1), $csv->say(select(), \#row);
'
Improving answer from #vintnes.
Producing output as tab separated CSV
Write awk script that reads values from lines by their order.
Print each record as it is read.
script.awk
BEGIN {print("computer","succeeded","unchanged","changed","failed","states run","run time");}
FNR%8 == 1 {arr[1] = $3}
FNR%8 == 3 {arr[2] = $2; arr[3] = extractNum($3); arr[4] = extractNum($4)}
FNR%8 == 4 {arr[5] = $2;}
FNR%8 == 6 {arr[6] = $4;}
FNR%8 == 7 {arr[7] = $4; print arr[1],arr[2],arr[3],arr[4],arr[5],arr[6],arr[7];}
function extractNum(str){match(str,/[[:digit:]]+/,m);return m[0];}
run script
Tab separated CSV output
awk -v OFS="\t" -f script.awk input-1.txt input-2.txt ...
Comma separated CSV output
awk -v OFS="," -f script.awk input-1.txt input-2.txt ...
Output
computer succeeded unchanged changed failed states run run time
app1.domain.com 278 12 6 0 278 7.383
app2.domain.com 278 12 6 0 278 7.448
app0.domain.com 293 13 6 0 293 7.510
computer,succeeded,unchanged,changed,failed,states run,run time
app1.domain.com,278,12,6,0,278,7.383
app2.domain.com,278,12,6,0,278,7.448
app0.domain.com,293,13,6,0,293,7.510
Explanation
BEGIN {print("computer","succeeded","unchanged","changed","failed","states run","run time");}
Print the heading CSV line
FNR%8 == 1 {arr[1] = $3}
Extract the arr[1] value from 3rd field in (first line from 8 lines)
FNR%8 == 3 {arr[2] = $2; arr[3] = extractNum($3); arr[4] = extractNum($4)}
Extract the arr[2,3,4] values from 2nd,3rd,4th fields in (third line from 8 lines)
FNR%8 == 4 {arr[5] = $2;}
Extract the arr[5] value from 2nd field in (4th line from 8 lines)
FNR%8 == 6 {arr[6] = $4;}
Extract the arr[6] value from 4th field in (6th line from 8 lines)
FNR%8 == 7 {arr[7] = $4;
Extract the arr[7] value from 4th field in (7th line from 8 lines)
print arr[1],arr[2],arr[3],arr[4],arr[5],arr[6],arr[7];}
print the array elements for the extracted variable at the completion of reading 7th line from 8 lines.
function extractNum(str){match(str,/[[:digit:]]+/,m);return m[0];}
Utility function to extract numbers from text field.

AWK - add value based on regex

I have to add the numbers returned by REGEX using awk in linux.
Basically from this file:
123john456:x:98:98::/home/john123:/bin/bash
I have to add the numbers 123 and 456 using awk.
So the result would be 579
So far I have done the following:
awk -F ':' '$1 ~ VAR+="/[0-9].*(?=:)/" ; {print VAR}' /etc/passwd
awk -F ':' 'VAR+="/[0-9].*(?=:)/" ; {print VAR}' /etc/passwd
awk -F ':' 'match($1, VAR=/[0-9].*?:/) ; {print VAR}' /etc/passwd
And from what I've seen match doesn't support this at all.
Does someone has any idea?
UPDATE:
it also should work for
john123 result - > 123
123john result - > 123
$ awk -F':' '{split($1,t,/[^0-9]+/); print t[1] + t[2]}' file
579
With your updated requirements:
$ cat file
123john456:x:98:98::/home/john123:/bin/bash
john123:x:98:98::/home/john123:/bin/bash
123john:x:98:98::/home/john123:/bin/bash
$ awk -F':' '{split($1,t,/[^0-9]+/); print t[1] + t[2]}' file
579
123
123
With gawk and for the given example
awk -F ':' '{a=gensub(/[a-zA-Z]+/,"+", "g", $1); print a}' inputFile | bc
would do the job.
More general:
awk -F ':' '{a=gensub(/[a-zA-Z]+/,"+", "g", $1); a=gensub(/^+/,"","g",a); a=gensub(/+$/,"","g",a); print a}' inputFile | bc
The regex-part replaces all sequences of letters with '+' (e.g., '12johnny34' becomes 12+34). Finally, this mathematical operation is evaluated by bc.
(The be safe, I remove leading and trailing '+' sings by ^+ and +$)
You may use
awk -F ':' '{n=split($1, a, /[^0-9]+/); b=0; for (i=1;i<=n;i++) { b += a[i]; }; print b; }' /etc/passwd
See online awk demo
s="123john456:x:98:98::/home/john123:/bin/bash
john123:x:98:98::/home/john123:/bin/bash"
awk -F ':' '{n=split($1, a, /[^0-9]+/); b=0; for (i=1;i<=n;i++) { b += a[i]; }; print b; }' <<< "$s"
Output:
579
123
Details
-F ':' - records are split into fields with : char
n=split($1, a, /[^0-9]+/) - gets Field 1 and splits into digit only chunks saving the numbers in a array and the n var contains the number of these chunks
b=0 - b will hold the sum
for (i=1;i<=n;i++) { b += a[i]; } - iterate over a array and sum the values
print b - prints the result.
I used awk's split() to separate the first field on any string not containing numbers.
split(string, target_array, [regex], [separator_array]*)
*separator_array requires gawk
$ awk -F: '{split($1, A, /[^0-9]+/, S); print S[1], A[1]+A[2]}' <<EOF
123john456:x:98:98::/home/john123:/bin/bash
123john:x:98:98::/home/john123:/bin/bash
EOF
john 579
john 123
You can use [^0-9]+ as a field separator, and :[^\n]*\n as a record separator instead:
awk -F '[^0-9]+' 'BEGIN{RS=":[^\n]*\n"}{print $1+$2}' /etc/passwd
so that given the content of /etc/passwd being:
123john456:x:98:98::/home/john123:/bin/bash
john123:x:98:98::/home/john123:/bin/bash
123john:x:98:98::/home/john123:/bin/bash
This outputs:
579
123
123
You can try Perl also
$ cat johnny.txt
123john456:x:98:98::/home/john123:/bin/bash
john123:x:98:98::/home/john123:/bin/bash
123john:x:98:98::/home/john123:/bin/bash
$ perl -F: -lane ' $_=$F[0]; $sum+= $1 while(/(\d+)/g); print $sum; $sum=0 ' johnny.txt
579
123
123
$
Here is another awk variant that adds all the numbers present in first field separated by ::
cat file
123john456:x:98:98::/home/john123:/bin/bash
john123:x:98:98::/home/john123:/bin/bash
123john:x:98:98::/home/john123:/bin/bash
1j2o3h4n5:x:98:98::/home/john123:/bin/bash
awk -F '[^0-9:]+' '{s=0; for (i=1; i<=NF; i++) {s+=$i; if ($i~/:$/) break} print s}' file
579
123
123
15

awk to count and sum total using matching string from file

I am trying to get the total length of each matching string and the count of each match in a file using awk. The matching string in $5 is the count and the sum of each $3 - $2 is the total length. Hopefully the awk below is a good start. Thank you :).
input
chr1 1266716 1266926 chr1:1266716-1266926 TAS1R3
chr1 1267008 1267328 chr1:1267008-1267328 TAS1R3
chr1 1267394 1268196 chr1:1267394-1268196 TAS1R3
chr1 1268291 1268514 chr1:1268291-1268514 TAS1R3
chr1 1956371 1956503 chr1:1956371-1956503 GABRD
chr1 1956747 1956866 chr1:1956747-1956866 GABRD
chr1 1956947 1957187 chr1:1956947-1957187 GABRD
chr1 1220077 1220196 chr1:1220077-1220196 SCNN1D
desired output
TAS1R3 4 1555
GABRD 3 491
SCNN1D 1 119
awk
awk '{count[$5]++}
END {
for (word in count)
print $1,$2,$3,$4,word, count[word]
}' input > count |
awk 'print $1,$2,$3,$4,word, count[word]
}
{ $6 = $3 - $2 }
1' count.txt > length
edit
SCNN1D 1 119
GABRD 3 240
TAS1R3 4 223
You can do:
awk '{c1[$5]++; c2[$5]+=($3-$2)}
END{for (e in c1) print e, c1[e], c2[e]}' input
Note that the order of the records may be different than the order in the original file.
With awk, it's possible to do the entire thing in a single script,
by keeping a running count of both the cumulative length, and the number of instances for each word.
Try this (yet untested):
awk '{
offset1=$2; offset2=$3; word=$5
TotalLength[word]=offset2 - offset1 # or just $3-$2
count[word]++}
END {
for (word in count)
print word, count[word], TotalLength[word]
}' input
The original script had three errors.
The second awk chunk had an ambiguous input specification: Reading from pipe and a file argument (count.txt). In this case, awk cannot decide where to read from.
In an END section, the numbered fields will only refer to the fields of the last line/record read. This is not what you want.
Finally, the second awk script is missing the opening brace { for the print statement.
$ cat tst.awk
$5 != prev { if (NR>1) print prev, cnt, sum; prev=$5; cnt=sum=0 }
{ cnt++; sum+=($3-$2) }
END { print prev, cnt, sum }
$ awk -f tst.awk file
TAS1R3 4 1555
GABRD 3 491
SCNN1D 1 119

grep, cut, sed, awk a file for 3rd column, n lines at a time, then paste into repeated columns of n rows?

I have a file of the form:
#some header text
a 1 1234
b 2 3333
c 2 1357
#some header text
a 4 8765
b 1 1212
c 7 9999
...
with repeated data in n-row chunks separated by a blank line (with possibly some other header text). I'm only interested in the third column, and would like to do some grep, cut, awk, sed, paste magic to turn it in to this:
a 1234 8765 ...
b 3333 1212
c 1357 9999
where the third column of each subsequent n-row chunk is tacked on as a new column. I guess you could call it a transpose, just n-lines at a time, and only a specific column. The leading (a b c) column label isn't essential... I'd be happy if I could just grab the data in the third column
Is this even possible? It must be. I can get things chopped down to only the interesting columns with grep and cut:
cat myfile | grep -A2 ^a\ | cut -c13-15
but I can't figure out how to take these n-row chunks and sed/paste/whatever them into repeated n-row columns.
Any ideas?
This awk does the job:
awk 'NF<3 || /^(#|[[:blank:]]*$)/{next} !a[$1]{b[++k]=$1; a[$1]=$3; next}
{a[$1] = a[$1] OFS $3} END{for(i=1; i<=k; i++) print b[i], a[b[i]]}' file
a 1234 8765
b 3333 1212
c 1357 9999
awk '/#/{next}{a[$1] = a[$1] $3 "\t"}END{for(i in a){print i, a[i]}}' file
Would produce
a 1234 8765
b 3333 1212
c 1357 9999
You can change "\t" to a different output separator like " " if you like.
sub(/\t$/, "", a[i]); may be inserted before printif uf you don't like having trailing spaces. Another solution is to check if a[$1] already has a value where you decide if you have append to a previous value or not. It complicates the code a bit though.
Using bash > 4.0:
declare -A array
while read line
do
if [[ $line && $line != \#* ]];then
c=$( echo $line | cut -f 1 -d ' ')
value=$( echo $line | cut -f 3 -d ' ')
array[$c]="${array[$c]} $value"
fi
done < myFile.txt
for k in "${!array[#]}"
do
echo "$k ${array[$k]}"
done
Will produce:
a 1234 8765
b 3333 1212
c 1357 9999
It stores the letter as the key of the associative array, and in each iteration, appends the correspondig value to it.
$ awk -v RS= -F'\n' '{ for (i=2;i<=NF;i++) {split($i,f,/[[:space:]]+/); map[f[1]] = map[f[1]] " " f[3]} } END{ for (key in map) print key map[key]}' file
a 1234 8765
b 3333 1212
c 1357 9999

Parsing a file line by line for key character in string and copying line

I'm trying to parse a DNA protein file. I want to extract just certain amount of information. I want to parse only if the line starts with "ATOM" and has either G,A,T,C at then end of the fourth column. For example in the snippet below DG would be parsed because it has a G at the end. Then save the line in file. I am using bash. What would you use to do this? grep, find, sed, awk or some kind of regular expression?
Thanks for any help!
HETATM 103 HG22 MVA A 8 4.999 -1.260 2.090 1.00 0.00 H
HETATM 104 HG23 MVA A 8 5.639 -2.810 2.604 1.00 0.00 H
TER 105 MVA A 8
ATOM 106 O5' DG C 11 -12.710 1.571 -11.945 1.00 0.00 O
ATOM 107 C5' DG C 11 -13.491 2.438 -11.111 1.00 0.00 C
Additional to the original problem:
Count the lines total and individual G,A,T,C? Output the counted total into a file as Total Lines, TOTAL G, TOTAL T, TOTAL A, TOTAL C.
awk '/^ATOM/&&$4~/[GATC]$/' input > output
Here is an old fashion bash way:
while read -ra fld; do
[[ ${fld[0]} == "ATOM" ]] && [[ ${fld[3]} =~ [GATC]$ ]] && echo "${fld[#]}"
done < dnafile.old > dnafile.new
Hope I get the chance to answer it, because OP questioned on Kent's answer. Here is question:
If you notice Line 3 of the example the 3rd column is blank will this matter, it shouldn't in this case because its not an ATOM but if it was?
So the fix is here, (base on the format and location is not changed.
awk '/^ATOM/&&substr($0,20,1)~/[GATC]/' file
Test result:
$ cat file
HETATM 103 HG22 MVA A 8 4.999 -1.260 2.090 1.00 0.00 H
HETATM 104 HG23 MVA A 8 5.639 -2.810 2.604 1.00 0.00 H
ATOM 105 MVA X 8
ATOM 106 O5' DG C 11 -12.710 1.571 -11.945 1.00 0.00 O
ATOM 107 C5' DG C 11 -13.491 2.438 -11.111 1.00 0.00 C
$ awk '/^ATOM/&&substr($0,20,1)~/[GATC]/' file
ATOM 105 MVA X 8
ATOM 106 O5' DG C 11 -12.710 1.571 -11.945 1.00 0.00 O
ATOM 107 C5' DG C 11 -13.491 2.438 -11.111 1.00 0.00 C
Edit for new request.
awk '/^ATOM/&&substr($0,20,1)~/[GATC]/{print;l++;a[substr($0,20,1)]++}END{printf "total line : %s\n",l;for (i in a) printf "%s : %s \n",i,a[i]}' file
ATOM 105 MVA A 8
ATOM 106 O5' DG C 11 -12.710 1.571 -11.945 1.00 0.00 O
ATOM 107 C5' DG C 11 -13.491 2.438 -11.111 1.00 0.00 C
total line : 3
A : 1
G : 2
Huh... after the excellent Kent's awk solution am hesitating writing a long regexp :) :)
grep -P 'ATOM\s+\S+\s+\S+\s*\S*[GATC]\s+' dnafile
this need a grep with -P - perl regexes.
Without perl regexes, the stndard-regex is much longer,
grep 'ATOM *[^ ][^ ]* *[^ ][^ ]* *[^ ][^ ]* *[^ ]*[GATC] *' dnafile
This might work for you (GNU sed):
sed -nr '/^ATOM.{15}[GATC]/w newfile' oldfile
Since columns may be empty, the match must be made on position in the line.