Awk: From CSV to PDB (Protein Data Bank) - if-statement

I have a CSV file with this format:
ATOM,3662,H,VAL,A,257,6.111,31.650,13.338,1.00,0.00,H
ATOM,3663,HA,VAL,A,257,3.180,31.995,13.768,1.00,0.00,H
ATOM,3664,HB,VAL,A,257,4.726,32.321,11.170,1.00,0.00,H
ATOM,3665,HG11,VAL,A,257,2.387,31.587,10.892,1.00,0.00,H
And I would like to format it according to PDB standards (fixed position):
ATOM 3662 H VAL A 257 6.111 31.650 13.338 1.00 0.00 H
ATOM 3663 HA VAL A 257 3.180 31.995 13.768 1.00 0.00 H
ATOM 3664 HB VAL A 257 4.726 32.321 11.170 1.00 0.00 H
ATOM 3665 HG11 VAL A 257 2.387 31.587 10.892 1.00 0.00 H
One can consider that everything is right-justified except for the first and the third column. The first is not a problem. The third however, it is left-justified when it length is 1-3 but shifted one position to the left when it is 4.
I have this AWK one-liner that almost does the trick:
awk -F, 'BEGIN {OFS=FS} {if(length($3) == 4 ) {pad=" "} else {pad=" "}} {printf "%-6s%5s%s%-4s%4s%2s%4s%11s%8s%8s%6s%6s%12s\n", $1, $2, $pad, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12}' < 1iy8_min.csv
Except for two things:
The exception of the third column. I was thinking about adding a condition which changes the padding before the third column according to the field length, but I cannot get it to work (the idea is illustrated in the above one-liner).
The other problem is that if there are no spaces between the fields, the padding does not work at all.
ATOM 3799 HH TYR A 267 -5.713 16.149 26.838 1.00 0.00 H
HETATM 3801 O7N NADA12688.285 19.839 10.489 1.00 20.51 O
In the above example, the second line should be:
HETATM 3801 O7N NAD A1268 8.285 19.839 10.489 1.00 20.51 O
But because there is no space between fields 5 and 6, everything gets shuffled. It think that A1268 is perceived as being one field. Maybe because the default awk delimiter seems to be a blank space. Is it possible to make it position-dependent?
UPDATE: The following solves the problem with the exception on the third column:
awk 'BEGIN {FS = ",";OFS = ""} { if(length($3) == 4 ) {pad = sprintf("%s", " ")} else {pad = sprintf("%2s", " ")} } { if(length($3) == 4 ) {pad2 = sprintf("%s", " ")} else {pad2 = sprintf("%s", "")} } {printf "%-6s%5s%s%-4s%s%3s%2s%4s%11s%8s%8s%6s%6s%12s\n", $1, $2, pad, $3, pad2, $4, $5, $6, $7, $8, $9, $10, $11, $12}' 1iy8_min.csv
However, OFS seems to be ignored...
UPDATE2: The problem was in the input file. Sorry about that. Solved.
The working script:
awk 'BEGIN{OFS=FS=","}{$7=sprintf("%.3f",$7)}1{$8=sprintf("%.3f",$8)}1{$9=sprintf("%.3f",$9)}1' ${file} | awk 'BEGIN {FS =","; OFS=""} { if(length($3) == 4 ) {pad = sprintf("%s", " ")} else {pad = sprintf("%2s", " ")} } { if(length($3) == 4 ) {pad2 = sprintf("%s", " ")} else {pad2 = sprintf("%s", "")} } {printf "%-6s%5s%s%-4s%s%3s%2s%4s%12s%8s%8s%6s%6s%12s\n", $1, $2, pad, $3, pad2, $4, $5, $6, $7, $8, $9, $10, $11, $12}' > ${root}_csv.pdb

Related

AWK if then loop conditional on column values

I have 3 columns, I want to create a 4th column that is equal to the 2nd column only when the 3rd column is equal to 1 (otherwise, the value can be 0).
For example,
4 3 1
would become
4 3 1 3
whereas
4 3 2
would become
4 3 2 0
I tried it 3 ways, in all cases the 4th column is all zeroes:
'BEGIN {FS = "\t"}; {if ($3!=1) last=0; else last=$2} {print $1, $2, $3, last}'
'BEGIN {FS = "\t"}; {if ($3!=1) print $1, $2, $3, 0; else print $1, $2, $3, $2}'
'BEGIN {FS = "\t"}; {if ($3==1) print $1, $2, $3, $2; else print $1, $2, $3, 0}'
awk to the rescue
awk '{$(NF+1)=$3==1?$2:0}1'
$ awk '{print $0, ($3==1?$2:0)}' file
4 3 1 3
4 3 2 0

Column replacement with awk, with retaining the format

This is file a.pdb:
ATOM 1 N ARG 1 0.000 0.000 0.000 1.00 0.00 N
ATOM 2 H1 ARG 1 0.000 0.000 0.000 1.00 0.00 H
ATOM 3 H2 ARG 1 0.000 0.000 0.000 1.00 0.00 H
ATOM 4 H3 ARG 1 0.000 0.000 0.000 1.00 0.00 H
And this is file a.xyz:
16.388 -5.760 -23.332
17.226 -5.608 -23.768
15.760 -5.238 -23.831
17.921 -5.926 -26.697
I want to replace 6,7 and 8th column of a.pdb with a.xyz columns. Once replaced, I need to maintain tabs/space/columns of a.pdb.
I have tried:
awk 'NR==FNR {fld1[NR]=$1; fld2[NR]=$2; fld3[NR]=$3; next} {$6=fld1[FNR]; $7=fld2[FNR]; $8=fld3[FNR]}1' a.xyz a.pdb
But it doesn't keep the format.
This is exactly what the 4th arg for split() in GNU awk was invented to facilitate:
gawk '
NR==FNR { pdb[NR]=$0; next }
{
split(pdb[FNR],flds,FS,seps)
flds[6]=$1
flds[7]=$2
flds[8]=$3
for (i=1;i in flds;i++)
printf "%s%s", flds[i], seps[i]
print ""
}
' a.pdb a.xyz
ATOM 1 N ARG 1 16.388 -5.760 -23.332 1.00 0.00 N
ATOM 2 H1 ARG 1 17.226 -5.608 -23.768 1.00 0.00 H
ATOM 3 H2 ARG 1 15.760 -5.238 -23.831 1.00 0.00 H
ATOM 4 H3 ARG 1 17.921 -5.926 -26.697 1.00 0.00 H
Not a general solution, but this might work with in this particular case:
awk 'NR==FNR{for(i=6; i<=8; i++) A[FNR,i]=$(i-5); next} {for(i=6; i<=8; i++) sub($i,A[FNR,i])}1' file2 file1
or
awk '{for(i=6; i<=8; i++) if(NR==FNR) A[FNR,i]=$(i-5); else sub($i,A[FNR,i])} NR>FNR' file2 file1
There is a bit of a shift, though. We would need to know the fields widths to prevent this.
--
Or perhaps with substrings:
awk 'NR==FNR{A[FNR]=$0; next} {print substr($0,1,p) FS A[FNR] substr($0,p+length(A[FNR]))}' p=33 file2 file1
-- changing it in the OP's original solution:
awk 'NR==FNR {fld1[NR]=$1; fld2[NR]=$2; fld3[NR]=$3; next} {sub($6,fld1[FNR]); sub($7,fld2[FNR]); sub($8,fld3[FNR])}1' file file1
with the same restrictions as the first 2 suggestions.
So 1, 2, and 4 use sub to replace, which is not a water proof solution, since earlier fields might interfere and it uses regex rather than strings (and so the regex dot happens to match the actual dot), but with this particular input, it might pan out..
Probably nr. 3 would be a more fool-proof method..
--edit--
I think this would work with the given input:
awk 'NR==FNR{A[FNR]=$1 " " $2 " " $3; next} {print substr($0,1,p) A[FNR] substr($0,p+length(A[FNR]))}' p=32 file2 file1
but I think something like printf or sprint formatting would be required to make it fool-proof.
So, perhaps something like this:
awk 'NR==FNR{A[FNR]=sprintf("%7.3f %7.3f %8.4f", $1, $2, $3); next} {print substr($0,1,p) A[FNR] substr($0,p+length(A[FNR]))}' p=31 file2 file1
or not on one line:
awk '
NR==FNR {
A[FNR]=sprintf("%7.3f %7.3f %8.4f", $1, $2, $3)
next
}
{
print substr($0,1,p) A[FNR] substr($0,p+length(A[FNR]))
}
' p=31 file2 file1
You can try this one
paste -d' ' test4 test5 |awk '{print $1,$2,$3,$4,$5,$12,$13,$14,$9,$10,$11}'

Working with AWK regex

I have a file in which have values in following format-
20/01/2012 01:14:27;UP;UserID;User=bob email=abc#sample.com
I want to pick each value from this file (not labels). By saying label, i mean to say that for string email=abc#sample.com, i only want to pick abc#sample.com and for sting User=bob, i only want to pic bob. All the Space separated values are easy to pick but i am unable to pick the values separated by Semi colon. Below is the command i am using in awk-
awk '{print "1=",$1} /;/{print "2=",$2,"3=",$3}' sample_file
In $2, i am getting the complete string till bob and rest of the string is assigned to $3. Although i can work with substr provided with awk but i want to be on safe side, string length may vary.
Can somebody tell me how to design such regex to parse my file.
You can set multiple delimiters using awk -F:
awk -F "[ \t;=]+" '{ print $1, $2, $3, $4, $5, $6, $7, $8 }' file.txt
Results:
value1 value2 value3 value4 label1 value5 label2 value6
EDIT:
You can remove anything before the equal signs using sub (/[^=]*=/,"", $i). This will allow you to just print the 'values':
awk 'BEGIN { FS="[ \t;]+"; OFS=" " } { for (i=1; i<=NF; i++) { sub (/[^=]*=/,"", $i); line = (line ? line OFS : "") $i } print line; line = "" }' file.txt
Results:
20/01/2012 01:14:27 UP UserID bob abc#sample.com

Filtering multiline pcregrep match with sed

I have data in multiple text files that look like this:
1 DAEJ X -3120041.6620 -3120042.0476 -0.3856 0.0014
Y 4084614.2137 4084614.6871 0.4734 0.0015
Z 3764026.4954 3764026.7346 0.2392 0.0014
HEIGHT 116.0088 116.6419 0.6332 0.0017 0.0017 8.0
LATITUDE 36 23 57.946407 36 23 57.940907 -0.1699 0.0013 0.0012 57.5 0.0012 62.9
LONGITUDE 127 22 28.131395 127 22 28.132160 0.0190 0.0012 0.0013 2.3 0.0013
and I want to run it through a filter so that the output will look like this:
DAEJ: 36 23 57.940907, 127 22 28.132160, 116.6419
I can do it easily enough with grepWin using named capture by searching for:
(?<site>\w\w\w\w+)<filler>\r\n\r\n<filler>(?<height>\-?\d+\.\d+)<filler>(?<heightRMS>\d+\.\d+)<filler>\r\n<filler>(?<lat>\-?\ *\d+\ +\d+\ +\d+\.\d+)<filler>(?<latRMS>\d+\.\d+)<filler>\r\n<filler>(?<lon>\-?\ *\d+\ +\d+\ +\d+\.\d+)<filler>(?<lonRMS>\d+\.\d+)<filler>
and repacing with (ignore the unreferenced groups, I'll use that in other implementations):
$+{site}: $+{lat}, $+{lon}, $+{height}
but of course, at the cost of doing it manually through a GUI. I was wondering if there's a way to script it by piping pcregrep output to sed for text substitution? I'm aware of the pcregrep -M option to match the multiline regex pattern above, and I've been successful until that point, but I'm stuck with the sed end of the problem.
I would be using awk to handle your text file:
awk '$1 ~ /^[0-9]+$/ { printf "%s: ", $2 } $1 == "HEIGHT" { height = $3 } $1 == "LATITUDE" { printf "%s %s %s, ", $2, $3, $4 } $1 == "LONGITUDE" { printf "%s %s %s, %s\n", $5, $6, $7, height }' file.txt
Broken out on multiple lines for readability:
$1 ~ /^[0-9]+$/ {
printf "%s: ", $2
}
$1 == "HEIGHT" {
height = $3
}
$1 == "LATITUDE" {
printf "%s %s %s, ", $2, $3, $4
}
$1 == "LONGITUDE" {
printf "%s %s %s, %s\n", $5, $6, $7, height
}
Results:
DAEJ: 36 23 57.946407, 127 22 28.132160, 116.6419
EDIT:
Put the following code in a file called script.awk:
$3 == "X" {
printf "%s: ", $2
}
$1 == "HEIGHT" {
height = $3
}
$1 == "LATITUDE" {
if ($2 == "-" && $6 == "-") { printf "-%s %s %s, ", $7, $8, $9 }
else if ($2 == "-") { printf "%s %s %s, ", $6, $7, $8 }
else if ($5 == "-") { printf "-%s %s %s, ", $6, $7, $8 }
else { printf "%s %s %s, ", $5, $6, $7 }
}
$1 == "LONGITUDE" {
if ($2 == "-" && $6 == "-") { printf "-%s %s %s, %s\n", $7, $8, $9, height }
else if ($2 == "-") { printf "%s %s %s, %s\n", $6, $7, $8, height }
else if ($5 == "-") { printf "-%s %s %s, %s\n", $6, $7, $8, height }
else { printf "%s %s %s, %s\n", $5, $6, $7, height }
}
Run like this:
awk -f script.awk file.txt
This might work for you (GNU sed):
sed '/^DAEJ/,/^\s*LONGITUDE/!d;/HEIGHT/{s/^\s*\S*\s*\S*\s*\(\S*\).*/\1/;h};/LATITUDE/{s/^\s*\(\S*\s*\)\{4\}\(\(\S*\s*\)\{2\}\S*\).*/\2/;H};/LONGITUDE/!d;s/^\s*\(\S*\s*\)\{4\}\(\(\S*\s*\)\{2\}\S*\).*/ \2/;H;g;y/\n/,/;s/\([^,]*\),\(.*\)/DAEJ: \2, \1/' file1 file2 filen

decimal pattern matching

I have a big file and the lines pattern is given below:
MDQ[11:15],IO,MDQ[10:14],,,,MDQ[12:16],TPR_AAWD[11:15]
I want to modify this file like given below:
MDQ[11],IO,MDQ[10],,,,MDQ[12],TPR_AAWD[11]
MDQ[12],IO,MDQ[11],,,,MDQ[13],TPR_AAWD[12]
MDQ[13],IO,MDQ[12],,,,MDQ[14],TPR_AAWD[13]
MDQ[14],IO,MDQ[13],,,,MDQ[15],TPR_AAWD[14]
How i can implement this in sed/awk/perl/csh/vim?
Please help
awk -F '[][]' '{
split($2, a, /:/)
split($4, b, /:/)
split($6, c, /:/)
split($8, d, /:/)
for (i=0; i < a[2]-a[1]; i++) {
printf("%s[%d]%s[%d]%s[%d]%s[%d]\n",
$1, a[1]+i,
$3, b[1]+i,
$5, c[1]+i,
$7, d[1]+i)
}
}'
Hope the below helps:
sed -e 's/:[0-9]*//g'