Filtering multiline pcregrep match with sed - regex

I have data in multiple text files that look like this:
1 DAEJ X -3120041.6620 -3120042.0476 -0.3856 0.0014
Y 4084614.2137 4084614.6871 0.4734 0.0015
Z 3764026.4954 3764026.7346 0.2392 0.0014
HEIGHT 116.0088 116.6419 0.6332 0.0017 0.0017 8.0
LATITUDE 36 23 57.946407 36 23 57.940907 -0.1699 0.0013 0.0012 57.5 0.0012 62.9
LONGITUDE 127 22 28.131395 127 22 28.132160 0.0190 0.0012 0.0013 2.3 0.0013
and I want to run it through a filter so that the output will look like this:
DAEJ: 36 23 57.940907, 127 22 28.132160, 116.6419
I can do it easily enough with grepWin using named capture by searching for:
(?<site>\w\w\w\w+)<filler>\r\n\r\n<filler>(?<height>\-?\d+\.\d+)<filler>(?<heightRMS>\d+\.\d+)<filler>\r\n<filler>(?<lat>\-?\ *\d+\ +\d+\ +\d+\.\d+)<filler>(?<latRMS>\d+\.\d+)<filler>\r\n<filler>(?<lon>\-?\ *\d+\ +\d+\ +\d+\.\d+)<filler>(?<lonRMS>\d+\.\d+)<filler>
and repacing with (ignore the unreferenced groups, I'll use that in other implementations):
$+{site}: $+{lat}, $+{lon}, $+{height}
but of course, at the cost of doing it manually through a GUI. I was wondering if there's a way to script it by piping pcregrep output to sed for text substitution? I'm aware of the pcregrep -M option to match the multiline regex pattern above, and I've been successful until that point, but I'm stuck with the sed end of the problem.

I would be using awk to handle your text file:
awk '$1 ~ /^[0-9]+$/ { printf "%s: ", $2 } $1 == "HEIGHT" { height = $3 } $1 == "LATITUDE" { printf "%s %s %s, ", $2, $3, $4 } $1 == "LONGITUDE" { printf "%s %s %s, %s\n", $5, $6, $7, height }' file.txt
Broken out on multiple lines for readability:
$1 ~ /^[0-9]+$/ {
printf "%s: ", $2
}
$1 == "HEIGHT" {
height = $3
}
$1 == "LATITUDE" {
printf "%s %s %s, ", $2, $3, $4
}
$1 == "LONGITUDE" {
printf "%s %s %s, %s\n", $5, $6, $7, height
}
Results:
DAEJ: 36 23 57.946407, 127 22 28.132160, 116.6419
EDIT:
Put the following code in a file called script.awk:
$3 == "X" {
printf "%s: ", $2
}
$1 == "HEIGHT" {
height = $3
}
$1 == "LATITUDE" {
if ($2 == "-" && $6 == "-") { printf "-%s %s %s, ", $7, $8, $9 }
else if ($2 == "-") { printf "%s %s %s, ", $6, $7, $8 }
else if ($5 == "-") { printf "-%s %s %s, ", $6, $7, $8 }
else { printf "%s %s %s, ", $5, $6, $7 }
}
$1 == "LONGITUDE" {
if ($2 == "-" && $6 == "-") { printf "-%s %s %s, %s\n", $7, $8, $9, height }
else if ($2 == "-") { printf "%s %s %s, %s\n", $6, $7, $8, height }
else if ($5 == "-") { printf "-%s %s %s, %s\n", $6, $7, $8, height }
else { printf "%s %s %s, %s\n", $5, $6, $7, height }
}
Run like this:
awk -f script.awk file.txt

This might work for you (GNU sed):
sed '/^DAEJ/,/^\s*LONGITUDE/!d;/HEIGHT/{s/^\s*\S*\s*\S*\s*\(\S*\).*/\1/;h};/LATITUDE/{s/^\s*\(\S*\s*\)\{4\}\(\(\S*\s*\)\{2\}\S*\).*/\2/;H};/LONGITUDE/!d;s/^\s*\(\S*\s*\)\{4\}\(\(\S*\s*\)\{2\}\S*\).*/ \2/;H;g;y/\n/,/;s/\([^,]*\),\(.*\)/DAEJ: \2, \1/' file1 file2 filen

Related

awk - wrong comparing floating point numbers

echo '"MSE_DB": -20.100000000000001,' | awk '/MSE_DB/ {mse_db = substr($2, 1, length($2)-1)} END {printf("MSE_DB %f ", mse_db); if (mse_db > -22.0)
{print ">-22.0"}; if (mse_db<= -22.0) {print "<= -22.0"} }'
MSE_DB -20.100000 <= -22.0
What am I missing?
expected to see -20.1 > -22
substr() is a string function so the value it returns and stores in mse_db is a string and so you're doing a string comparison (i.e. alphabetic character-by-character), not a numeric comparison.
Add a 0 to the substr() result to make mse_db a number instead of a string:
echo '"MSE_DB": -20.100000000000001,' | awk '/MSE_DB/ {mse_db = substr($2, 1, length($2)-1)+0} END {printf("MSE_DB %f ", mse_db); if (mse_db > -22.0)
{print ">-22.0"}; if (mse_db<= -22.0) {print "<= -22.0"} }'
MSE_DB -20.100000 >-22.0
but you can just get rid of the substr() and add 0 since awk already knows how to strip trailing chars during a numeric conversion:
echo '"MSE_DB": -20.100000000000001,' | awk '/MSE_DB/ {mse_db = $2+0} END {printf("MSE_DB %f ", mse_db); if (mse_db > -22.0)
{print ">-22.0"}; if (mse_db<= -22.0) {print "<= -22.0"} }'
MSE_DB -20.100000 >-22.0
You can refactor/reduce your awk to this:
awk '/MSE_DB/ {
mse_db = $2+0
}
END {
print "MSE_DB", mse_db, (mse_db > -22.0 ? "> -22.0" : "<= -22.0")
}' <<< '"MSE_DB": -20.100000000000001,'
This will give output:
MSE_DB -20.1 > -22.0
there's nothing wrong with using substr() as long as you prepend an unary "+" to the numeric-string to force numeric comparisons (even if the value were negative) :
echo 'MSE_DB: -20.100000000000001,' |
{m,g,n}awk '
/MSE_DB/ { mse_db = substr(__=$(++_+_), _, length(__)-_--)
} END {
printf("MSE_DB [ %s ] :: %*s %*.*f\n", mse_db, _+=++_,
(__=_-_*(_+_+_)*_) < +mse_db ? ">" : "<=",--_,__) }'
MSE_DB [ -20.100000000000001 ] :: > -22.0

Awk: From CSV to PDB (Protein Data Bank)

I have a CSV file with this format:
ATOM,3662,H,VAL,A,257,6.111,31.650,13.338,1.00,0.00,H
ATOM,3663,HA,VAL,A,257,3.180,31.995,13.768,1.00,0.00,H
ATOM,3664,HB,VAL,A,257,4.726,32.321,11.170,1.00,0.00,H
ATOM,3665,HG11,VAL,A,257,2.387,31.587,10.892,1.00,0.00,H
And I would like to format it according to PDB standards (fixed position):
ATOM 3662 H VAL A 257 6.111 31.650 13.338 1.00 0.00 H
ATOM 3663 HA VAL A 257 3.180 31.995 13.768 1.00 0.00 H
ATOM 3664 HB VAL A 257 4.726 32.321 11.170 1.00 0.00 H
ATOM 3665 HG11 VAL A 257 2.387 31.587 10.892 1.00 0.00 H
One can consider that everything is right-justified except for the first and the third column. The first is not a problem. The third however, it is left-justified when it length is 1-3 but shifted one position to the left when it is 4.
I have this AWK one-liner that almost does the trick:
awk -F, 'BEGIN {OFS=FS} {if(length($3) == 4 ) {pad=" "} else {pad=" "}} {printf "%-6s%5s%s%-4s%4s%2s%4s%11s%8s%8s%6s%6s%12s\n", $1, $2, $pad, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12}' < 1iy8_min.csv
Except for two things:
The exception of the third column. I was thinking about adding a condition which changes the padding before the third column according to the field length, but I cannot get it to work (the idea is illustrated in the above one-liner).
The other problem is that if there are no spaces between the fields, the padding does not work at all.
ATOM 3799 HH TYR A 267 -5.713 16.149 26.838 1.00 0.00 H
HETATM 3801 O7N NADA12688.285 19.839 10.489 1.00 20.51 O
In the above example, the second line should be:
HETATM 3801 O7N NAD A1268 8.285 19.839 10.489 1.00 20.51 O
But because there is no space between fields 5 and 6, everything gets shuffled. It think that A1268 is perceived as being one field. Maybe because the default awk delimiter seems to be a blank space. Is it possible to make it position-dependent?
UPDATE: The following solves the problem with the exception on the third column:
awk 'BEGIN {FS = ",";OFS = ""} { if(length($3) == 4 ) {pad = sprintf("%s", " ")} else {pad = sprintf("%2s", " ")} } { if(length($3) == 4 ) {pad2 = sprintf("%s", " ")} else {pad2 = sprintf("%s", "")} } {printf "%-6s%5s%s%-4s%s%3s%2s%4s%11s%8s%8s%6s%6s%12s\n", $1, $2, pad, $3, pad2, $4, $5, $6, $7, $8, $9, $10, $11, $12}' 1iy8_min.csv
However, OFS seems to be ignored...
UPDATE2: The problem was in the input file. Sorry about that. Solved.
The working script:
awk 'BEGIN{OFS=FS=","}{$7=sprintf("%.3f",$7)}1{$8=sprintf("%.3f",$8)}1{$9=sprintf("%.3f",$9)}1' ${file} | awk 'BEGIN {FS =","; OFS=""} { if(length($3) == 4 ) {pad = sprintf("%s", " ")} else {pad = sprintf("%2s", " ")} } { if(length($3) == 4 ) {pad2 = sprintf("%s", " ")} else {pad2 = sprintf("%s", "")} } {printf "%-6s%5s%s%-4s%s%3s%2s%4s%12s%8s%8s%6s%6s%12s\n", $1, $2, pad, $3, pad2, $4, $5, $6, $7, $8, $9, $10, $11, $12}' > ${root}_csv.pdb

AWK if then loop conditional on column values

I have 3 columns, I want to create a 4th column that is equal to the 2nd column only when the 3rd column is equal to 1 (otherwise, the value can be 0).
For example,
4 3 1
would become
4 3 1 3
whereas
4 3 2
would become
4 3 2 0
I tried it 3 ways, in all cases the 4th column is all zeroes:
'BEGIN {FS = "\t"}; {if ($3!=1) last=0; else last=$2} {print $1, $2, $3, last}'
'BEGIN {FS = "\t"}; {if ($3!=1) print $1, $2, $3, 0; else print $1, $2, $3, $2}'
'BEGIN {FS = "\t"}; {if ($3==1) print $1, $2, $3, $2; else print $1, $2, $3, 0}'
awk to the rescue
awk '{$(NF+1)=$3==1?$2:0}1'
$ awk '{print $0, ($3==1?$2:0)}' file
4 3 1 3
4 3 2 0

Print only '+' or '-' if string matches (two files)

I would like to print only a '+' o '-' symbols if string is found or not. Basically, I have two files:
Input file 1 (tab-delimited):
HPNK_00457
HPNK_00458
HPNK_00459
Input file 2 (tab-delimited):
HPNK_00457 AAA50325 1e-43 437 28 43 83 ATP-binding protein.
HPNK_00458 P25256 8e-43 429 28 43 82 RecName: Full=Tylosin resistance ATP-binding protein tlrC.
HPNK_00458 CAM96590 1e-42 429 27 42 87 ABC transporter ATP-binding protein [Streptomyces ambofaciens].
Desired output (tab-delimited, maintaining order of strings in file 1):
HPNK_00457 +
HPNK_00458 +
HPNK_00459 -
This is what I've been using up to now, but need to update:
while read vl; do grep "^$vl " file2 || printf -- "- -\n" ; done < file1
Thanks, trying to learn everyday here.
Here's one way using awk:
awk 'FNR==NR { a[$1]; next } { print $1, ($1 in a ? "+" : "-" ) }' file2 file1
Results:
HPNK_00457 +
HPNK_00458 +
HPNK_00459 -
You can use:
while read -r line
do
grep -q "$line" f2 && echo "$line +" || echo "$line -"
done < f1
As grep -q just returns true if it has matched something, in that case we print the file name + + otherwise, we print the file name + -.
It returns:
$ while read -r line; do grep -q "$line" f2 && echo "$line +" || echo "$line -"; done < f1
HPNK_00457 +
HPNK_00458 +
HPNK_00459 -
perl -lane'
BEGIN{ $, ="\t"; $x=shift; #h{ map /(\S+)/, <> } =(); #ARGV=$x }
print #F, exists $h{$F[0]} ? "+" : "-";
' file1 file2
output
HPNK_00457 +
HPNK_00458 +
HPNK_00459 -
Here's the algorithm:
Read file 2. For each line,
Get the first word
Store it in a hash.
Read file 1. For each line, chomp it, then
print $hash{$_}? '+' : '-'
I can write the code for you but if you want to learn everyday, it will be a useful exercise if you want to write it yourself.
This simple Perl script should do the work
#!/usr/local/bin/perl
## f1 and f2 are the 2 files containing your input data
open FILE1, f1;
open FILE2, f2;
#file1data = <FILE1>;
#file2data = <FILE2>;
my $row = 0;
foreach $data (#file1data) {
chomp($data);
if (grep (/$data/,$file2data[$row]) ) {
print $data . " " . "+\n";
}
else {
print $data . " " . "-\n";
}
$row++;
}
awk 'FNR==NR
{a[$1];next}
{b[$1]}
END{
for(i in a)
if(b[i]){print i,"+"}
else{print i,"-"}
}' file1 file2

Working with AWK regex

I have a file in which have values in following format-
20/01/2012 01:14:27;UP;UserID;User=bob email=abc#sample.com
I want to pick each value from this file (not labels). By saying label, i mean to say that for string email=abc#sample.com, i only want to pick abc#sample.com and for sting User=bob, i only want to pic bob. All the Space separated values are easy to pick but i am unable to pick the values separated by Semi colon. Below is the command i am using in awk-
awk '{print "1=",$1} /;/{print "2=",$2,"3=",$3}' sample_file
In $2, i am getting the complete string till bob and rest of the string is assigned to $3. Although i can work with substr provided with awk but i want to be on safe side, string length may vary.
Can somebody tell me how to design such regex to parse my file.
You can set multiple delimiters using awk -F:
awk -F "[ \t;=]+" '{ print $1, $2, $3, $4, $5, $6, $7, $8 }' file.txt
Results:
value1 value2 value3 value4 label1 value5 label2 value6
EDIT:
You can remove anything before the equal signs using sub (/[^=]*=/,"", $i). This will allow you to just print the 'values':
awk 'BEGIN { FS="[ \t;]+"; OFS=" " } { for (i=1; i<=NF; i++) { sub (/[^=]*=/,"", $i); line = (line ? line OFS : "") $i } print line; line = "" }' file.txt
Results:
20/01/2012 01:14:27 UP UserID bob abc#sample.com