awk if statement and pattern matching - regex

I have the next input file:
##Names
##Something
FVEG_04063 1265 . AA ATTAT DP=19
FVEG_04063 1266 . AA ATTA DP=45
FVEG_04063 2703 . GTTTTTTTT ATA DP=1
FVEG_15672 2456 . TTG AA DP=71
FVEG_01111 300 . CTATA ATATA DP=7
FVEG_01111 350 . AGAC ATATATG DP=41
My desired output file:
##Names
##Something
FVEG_04063 1266 . AA ATTA DP=45
FVEG_04063 2703 . GTTTTTTTT ATA DP=1
FVEG_15672 2456 . TTG AA DP=71
FVEG_01111 300 . CTATA ATATA DP=7
FVEG_01111 350 . AGAC ATATATG DP=41
Explanation: I want to print in my output file, all the lines begining with "#", all the "unique" lines attending to column 1, and if I have repeated hits in column 1, first: take the number in $2 and sum to length of $5 (in same line), if the result is smaller than the $2 of next line, print both lines; BUT if the result is bigger than the $2 of next line, compare the values of DP and only print the line with best DP.
What I've tried:
awk '/^#/ {print $0;} arr[$1]++; END {for(i in arr){ if(arr[i]>1){ HERE I NEED TO INTRODUCE MORE 'IF' I THINK... } } { if(arr[i]==1){print $0;} } }' file.txt
I'm new in awk world... I think that is more simple to do a little script with multiple lines... or maybe is better a bash solution.
Thanks in advance

As requested, an awk solution. I have commented the code heavily, so hopefully the comments will serve as explanation. As a summary, the basic idea is to:
Match comment lines, print them, and go to the next line.
Match the first line (done by checking if whether we have started remembering col1 yet).
On all subsequent lines, check values against the remembered values from the previous line. The "best" record, ie. the one that should be printed for each unique ID, is remembered each time and updated depending on conditions set forth by the question.
Finally, output the last "best" record of the last unique ID.
Code:
# Print lines starting with '#' and go to next line.
/^#/ { print $0; next; }
# Set up variables on the first line of input and go to next line.
! col1 { # If col1 is unset:
col1 = $1;
col2 = $2;
len5 = length($5);
dp = substr($6, 4) + 0; # Note dp is turned into int here by +0
best = $0;
next;
}
# For all other lines of input:
{
# If col1 is the same as previous line:
if ($1 == col1) {
# Check col2
if (len5 + col2 < $2) # Previous len5 + col2 < current $2
print best; # Print previous record
# Check DP
else if (substr($6, 4) + 0 < dp) # Current dp < previous dp:
next; # Go to next record, do not update variables.
}
else { # Different ids, print best line from previous id and update id.
print best;
col1 = $1;
}
# Update variables to current record.
col2 = $2;
len5 = length($5);
dp = substr($6, 4) + 0;
best = $0;
}
# Print the best record of the last id.
END { print best }
Note: dp is calculated by taking the sub-string of $6 starting at index 4 and going to the end. The + 0 is added to force the value to be converted to an integer, to ensure the comparison will work as expected.

Perl solution. You might need to fix the border cases as you didn't provide data to test them.
#last remembers the last line, #F is the current line.
#!/usr/bin/perl
use warnings;
use strict;
my (#F, #last);
while (<>) {
#F = split;
print and next if /^#/ or not #last;
if ($last[0] eq $F[0]) {
if ($F[1] + length $F[4] > $last[1] + length $last[4]) {
print "#last\n";
} else {
my $dp_l = $last[5];
my $dp_f = $F[5];
s/DP=// for $dp_l, $dp_f;
if ($dp_l > $dp_f) {
#F = #last;
}
}
} else {
print "#last\n" if #last;
}
} continue {
#last = #F;
}
print "#last\n";

Related

perl - range between with regex

I have a file like
$ cat num_range.txt
rate1, rate2, rate3, rate4, rate5
pay1, pay2, rate1, rate2, rate3, rate4
rev1, rev2
rate2, rate3, rate4
And I need to filter the comma-separated rows by matching against a prefix and a numeric range.
For example - if the input is "rate" and range is 2 to 5, then I should get
rate2, rate3, rate4, rate5
rate2, rate3, rate4
rate2, rate3, rate4
If it is 5 to 10, then I should get
rate5
when I use perl -ne ' while ( /rate(\d)/g ) { print "$&," } ; print "\n" ' num_range.txt I get all the matches for the prefix,
But below one is not working.
perl -ne ' while ( /rate(\d){2,5}/g ) { print "$&," } ; print "\n" ' num_range.txt
A straightforward way
perl -wnE'
print join",", grep { /rate([0-9]+)/ and $1 >= 2 and $1 <= 5 } split /\s*,\s*/
' file
The hard-coded keyword rate and limits (2 and 5) can of course be variables set from input
Your code does nothing to compare the matched number to the range.
Also, you are gratuitously printing a comma after the last entry.
Try this instead.
perl -ne '$sep = ""; while (/(rate(\d+))/g ) {
if ($2 >= 2 and $2 <= 5) {
print "$sep$1"; $sep=", ";
}
}
print "\n" if $sep' num_range.txt
Notice also how \d+ is used to match any number after rate and extracted into a separate numeric comparison. This is slightly clumsy in isolation, but easy to adapt to different number ranges.
To explain why your code isn't working:
/rate(\d){2,5}/g
This doesn't do what you think it does. The {x,y} syntax defines the number of times the previous string occurs.
So this matches "the string 'rate' followed by between 2 and 5 digits". And that won't match anything in your data.
This does the job:
perl -anE '#rates=();while(/rate(\d+)/g){push #rates,$& if $1>=2 && $1<=15}say"#rates" if #rates' file.txt
Output:
rate2 rate3 rate4 rate5
rate2 rate3 rate4
rate2 rate3 rate4

Match only very first occurrence of a pettern using awk

I am trying to follow the solution at
Moving matching lines in a text file using sed
The situation is that pattern2 needs to be applied just once in the whole file. How can I change the following to get this done
awk '/pattern1/ {t[1]=$0;next}
/pattern2/ {t[2]=$0;next}
/pattern3/ {t[3]=$0;next}
/target3/ { print t[3] }
1
/target1/ { print t[1] }
/target2/ { print t[2] }' file
Here is the file on which I applied the pattern2 (RELOC_DIR)
asdasd0
-SRC_OUT_DIR = /a/b/c/d/e/f/g/h
RELOC_DIR = /i/j/k/l/m
asdasd3
asdasd4
DEFAULTS {
asdasd6
$(RELOC_DIR)/some other text1
$(RELOC_DIR)/some other text2
$(RELOC_DIR)/some other text3
$(RELOC_DIR)/some other text4
and the last 4 lines of the file got deleted because of the match.
asdasd0
-SRC_OUT_DIR = /a/b/c/d/e/f/g/h
asdasd3
asdasd4
DEFAULTS {
RELOC_DIR = /i/j/k/l/m
asdasd6
I am assuming you need to check pattern2 along with some other condition if this is the case then try.
awk '/pattern1/ {t[1]=$0;next}
/pattern2/ && /check_other_text_in_current_line/{t[2]=$0;next}
/pattern3/ {t[3]=$0;next}
/target3/ { print t[3] }
1
/target1/ { print t[1] }
/target2/ { print t[2] }' file
Above is checking check_other_text_in_current_line string(which is a sample and you could change it as per your actual string) is present along with pattern2 also in same line. If this si not what you are looking for then please post samples of input and expected output in your post.
OR in case you are looking that only 1st match for pattern2 in Input_file and skip all others then try. It will only print very first match for pattern2 and skip all others.(since samples are not provied by OP so this code is written only for the ask of specific pattern matching)
awk '/pattern1/ {t[1]=$0;next}
/pattern2/ && ++count==1{t[2]=$0;next}
/pattern3/ {t[3]=$0;next}
/target3/ { print t[3] }
1
/target1/ { print t[1] }
/target2/ { print t[2] }' file
OR
awk '/pattern1/ {t[1]=$0;next}
/pattern2/ && !found2{t[2]=$0;found2=1;next}
/pattern3/ {t[3]=$0;next}
/target3/ { print t[3] }
1
/target1/ { print t[1] }
/target2/ { print t[2] }' file
EDIT: Though my 2nd solution looks like should be the one as per OP's ask but complete picture of requirement is not given so adding code only for printing Pattern2(string RELOC_DIR)'s first occurence here.
awk '/RELOC_DIR/ && ++ count==1{print}' Input_file
RELOC_DIR = /i/j/k/l/m
OR
awk '!found2 && /RELOC_DIR/ { t[2]=$0; found2=1; print}' Input_file

Regex to replace a character if is unique

I need please help with a script with a regex to fix a big text file under linux (with sed for example). My records looks like:
1373350|Doe, John|John|Doe|||B|Acme corp|...
1323350|Simpson, Homer|Homer|Simpson|||3|Moe corp|...
I need to validate if the 7th column has a unique character (maybe a letter or number) and if true, add the second column without the comma, i mean:
1373350|Doe, John|John|Doe|||B Doe John|Acme corp|...
1323350|Simpson, Homer|Homer|Simpson|||3 Simpson Homer|Moe corp|...
Any help? Thanks!
Awk is better suited for this job:
awk -F '|' 'BEGIN { OFS = FS } length($7) == 1 { x = $2; sub(/,/, "", x); $7 = $7 " " x } 1' filename
That is:
BEGIN { OFS = FS } # output separated the same way as the input
length($7) == 1 { # if the 7th field is one character long
x = $2 # make a copy of the second field
sub(/,/, "", x) # remove comma from it
$7 = $7 " " x # append it to seventh field
}
1 # print line

What does this awk sentence mean?

I have the following sentence in awk
$ gawk '$2 == "-" { print $1 }' file
I was wondering what thing this instruction exactly did because I can't parse exactly the words I need.
Edit: How can I do in order to skip the lines before the following astersiks?
Let's say I have the following lines:
text
text
text
* * * * * * *
line1 -
line2 -
And then I want to filter just
line1
line2
with the sentence I posted above...
$ gawk '$2 == "-" { print $1 }' file
Thanks for your time and response!
This will find all lines on which the second column (Separated by spaces) is a -, and will then print the first column.
The first part ($2 == "-") checks for the second column being a -, and then if that is the case, runs the attached {} block, which prints the first column ($0 being the whole line, and $1, $2, etc being the first, second, ... columns.)
Spaces are the separator here simply because they are the default separator in awk.
Edit: To do what you want to do now, try the following (Not the most elegant, but it should work.)
gawk 'BEGIN { p = 0 } { if (p != 0 && $2 == "-") { print $1 } else { p = ($0 == "* * * * * * *")? 1 : 0 } }'
Spread over more lines for clarity on what's happening:
gawk 'BEGIN { p = 0 }
{ if (p != 0 && $2 == "-")
{ print $1 }
else
{ p = ($0 == "* * * * * * *")? 1 : 0 }
}'
Answer to the original question:
If the second column in a line from the file matches the string "-" then it prints out the first column of the line, columns are by default separated by spaces.
This would match and print out one:
one - two three
This would not:
one two three four
Answer to the revised question:
This code should do what you need after the match of the given string:
awk '/\* \* \* \* \* \* \*/{i++}i && $2 == "-" { print $1 }' data2.txt
Testing on this data gives the following output:
2two
2two

Parsing input to get specific values

I have input like this:
"[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=17}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|phone]0|0|{A=145,B=2,C=12,D=18}|!0|0|{A=167,B=2,C=67,D=17}|.1iit196|0|{A=244,B=6,C=67,D=12}|10:48AM204|0|{A=9,B=201,C=61,D=11}|Calculator66|0|{A=145,B=450,C=49,D=14}|phone";
It appears as a continuous line, there are no line breaks. I need the
largest value out of the values between [ and the first occurrence of
|. In this case, for example, the largest value is 204. Once
that is obtained, I want to print the contents of that element
between []. In this case, it would be "204|0|{A=9,B=201,C=61,D=11}|Calculator".
I've tried something like this, but it is not going anywhere:
my #array1;
my $data = "[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=1
+7}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,
+D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|phone]0|0|{A=145,B=2,C
+=12,D=18}|!0|0|{A=167,B=2,C=67,D=17}|.1iit196|0|{A=244,B=6,C=67,D=12}
+|10:48AM204|0|{A=9,B=201,C=61,D=11}|Calculator66|0|{A=145,B=450,C=49,
+D=14}|phone";
my $high = 0;
my #values = split(/\[([^\]]+)\]/,$data) ;
print "Values is #values \n";
foreach (#values) {
# I want the value that preceeds the first occurence of | in each array
# element, i.e. 0,0,196,204, etc.
my ($conf,$rest)= split(/\|/,$_);
print "Conf is $conf \n";
print "Rest is $rest \n";
push(#array1, $conf);
push (#array2, $rest);
print "Array 1 is #array1 \n";
print "Array 2 is #array2 \n";
}
$conf = highest(#array1);
my $i=0;
# I want the index value of the element that contains the highest conf value,
# in this case 204.
for (#myarray1) { last if $conf eq $_; $i++; };
print "$conf=$i\n";
# I want to print the rest of the string that was split in the same index
# position.
$rest = #array2[$i];
print "Rest is $rest \n";
# To get the highest conf value
sub highest {
my #data = #_;
my $high = 0;
for(#data) {
$high = $_ if $_ > $high;
}
$high;
}
Maybe I should be using a different approach. Could someone help me, please?
One way of doing it:
#!/usr/bin/perl
use strict;
my $s = "[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=17}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|phone]";
my #parts = split(/\]/, $s);
my $max = 0;
my $data = "";
foreach my $part (#parts) {
if ($part =~ /\[(\d+)/) {
if ($1 > $max) {
$max = $1;
$data = substr($part, 1);
}
}
}
print $data."\n";
A couple of notes:
you can split your original string by \], so you get parts like [0|0|{A=145,B=2,C=12,D=18}|!
then you parse each part to get the integer after the initial [
the rest it's easy: keep track of the biggest integer and of the corresponding part, and output it at the end.
In shell script:
#!/bin/bash
MAXVAL=$(cat /tmp/data | tr [ "\\n" | cut -d"|" -f1 | sort -n | tail -1)
cat /tmp/data | tr [] "\\n" | grep ^$MAXVAL
The first line cuts your big mass of data into lines, extracts just the first field, sorts it and takes the max. The second line cuts the data into lines again and greps for that max val.
If you have a LOT of data, this could be slow, so you could put the "lined" data into a temp file or something.
split() is the Right Tool when you know what you want to throw away. Capturing or m//g is the Right Tool when you know what you want to keep. (paraphrased from a Randal Schwartz quote).
You want to specify what to keep (between square brackets) rather than what to throw away (nothing!).
Luckily, your data is "hash shaped" (ie. alternating keys and values), so load it into a hash, sort the keys, and output the value for the highest key:
my %data = $data =~ /\[
(\d+) # digits are the keys
([^]]+) # rest are the values
\]/gx;
my($highest) = sort {$b <=> $a} keys %data; # inefficent if $data is big
print $highest, $data{$highest}, "\n";
Another way of doing this :
#!/usr/bin/perl
use strict;
my $str = '[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=17}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|phone]0|0|{A=145,B=2,C=12,D=18}|!0|0|{A=167,B=2,C=67,D=17}|.1iit196|0|{A=244,B=6,C=67,D=12}|10:48AM204|0|{A=9,B=201,C=61,D=11}|Calculator66|0|{A=145,B=450,C=49,D=14}|phone';
my $maxval = 0;
my $pattern;
while ( $str =~ /(\[(\d+)\|.+?\])/g)
{
if ( $maxval < $2 ) {
$maxval = $2;
$pattern = $1;
}
}
print "Maximum value = $maxval and the associate pattern = $pattern \n";
# In this example $maxvalue = 204
# and $pattern = [204|0|{A=9,B=201,C=61,D=11}|Calculator]