Match only very first occurrence of a pettern using awk - regex

I am trying to follow the solution at
Moving matching lines in a text file using sed
The situation is that pattern2 needs to be applied just once in the whole file. How can I change the following to get this done
awk '/pattern1/ {t[1]=$0;next}
/pattern2/ {t[2]=$0;next}
/pattern3/ {t[3]=$0;next}
/target3/ { print t[3] }
1
/target1/ { print t[1] }
/target2/ { print t[2] }' file
Here is the file on which I applied the pattern2 (RELOC_DIR)
asdasd0
-SRC_OUT_DIR = /a/b/c/d/e/f/g/h
RELOC_DIR = /i/j/k/l/m
asdasd3
asdasd4
DEFAULTS {
asdasd6
$(RELOC_DIR)/some other text1
$(RELOC_DIR)/some other text2
$(RELOC_DIR)/some other text3
$(RELOC_DIR)/some other text4
and the last 4 lines of the file got deleted because of the match.
asdasd0
-SRC_OUT_DIR = /a/b/c/d/e/f/g/h
asdasd3
asdasd4
DEFAULTS {
RELOC_DIR = /i/j/k/l/m
asdasd6

I am assuming you need to check pattern2 along with some other condition if this is the case then try.
awk '/pattern1/ {t[1]=$0;next}
/pattern2/ && /check_other_text_in_current_line/{t[2]=$0;next}
/pattern3/ {t[3]=$0;next}
/target3/ { print t[3] }
1
/target1/ { print t[1] }
/target2/ { print t[2] }' file
Above is checking check_other_text_in_current_line string(which is a sample and you could change it as per your actual string) is present along with pattern2 also in same line. If this si not what you are looking for then please post samples of input and expected output in your post.
OR in case you are looking that only 1st match for pattern2 in Input_file and skip all others then try. It will only print very first match for pattern2 and skip all others.(since samples are not provied by OP so this code is written only for the ask of specific pattern matching)
awk '/pattern1/ {t[1]=$0;next}
/pattern2/ && ++count==1{t[2]=$0;next}
/pattern3/ {t[3]=$0;next}
/target3/ { print t[3] }
1
/target1/ { print t[1] }
/target2/ { print t[2] }' file
OR
awk '/pattern1/ {t[1]=$0;next}
/pattern2/ && !found2{t[2]=$0;found2=1;next}
/pattern3/ {t[3]=$0;next}
/target3/ { print t[3] }
1
/target1/ { print t[1] }
/target2/ { print t[2] }' file
EDIT: Though my 2nd solution looks like should be the one as per OP's ask but complete picture of requirement is not given so adding code only for printing Pattern2(string RELOC_DIR)'s first occurence here.
awk '/RELOC_DIR/ && ++ count==1{print}' Input_file
RELOC_DIR = /i/j/k/l/m
OR
awk '!found2 && /RELOC_DIR/ { t[2]=$0; found2=1; print}' Input_file

Related

awk sub with a capturing group into the replacement

I am writing an awk oneliner for this purpose:
file1:
1 apple
2 orange
4 pear
file2:
1/4/2/1
desired output: apple/pear/orange/apple
addendum: Missing numbers should be best kept unchanged 1/4/2/3 = apple/pear/orange/3 to prevent loss of info.
Methodology:
Build an associative array key[$1] = $2 for file1
capture all characters between the slashes and replace them by matching to the key of associative array eg key[4] = pear
Tried:
gawk 'NR==FNR { key[$1] = $2 }; NR>FNR { r = gensub(/(\w+)/, "key[\\1]" , "g"); print r}' file1.txt file2.txt
#gawk because need to use \w+ regex
#gensub used because need to use a capturing group
Unfortunately, results are
1/4/2/1
key[1]/key[4]/key[2]/key[1]
Any suggestions? Thank you.
You may use this awk:
awk -v OFS='/' 'NR==FNR {key[$1] = $2; next}
{for (i=1; i<=NF; ++i) if ($i in key) $i = key[$i]} 1' file1 FS='/' file2
apple/pear/orange/apple
Note that if numbers from file2 don't exist in key array then it will make those fields empty.
file1 FS='/' file2 will keep default field separators for file1 but will use / as field separator while reading file2.
EDIT: In case you don't have a match in file2 from file and you want to keep original value as it is then try following:
awk '
FNR==NR{
arr[$1]=$2
next
}
{
val=""
for(i=1;i<=NF;i++){
val=(val=="" ? "" : val FS) (($i in arr)?arr[$i]:$i)
}
print val
}
' file1 FS="/" file2
With your shown samples please try following.
awk '
FNR==NR{
arr[$1]=$2
next
}
{
val=""
for(i=1;i<=NF;i++){
val = (val=="" ? "" : val FS) arr[$i]
}
print val
}
' file1 FS="/" file2
Explanation: Reading Input_file1 first and creating array arr with index of 1st field and value of 2nd field then setting field separator as / and traversing through each field os file2 and saving its value in val; printing it at last for each line.
Like #Sundeep comments in the comments, you can't use backreference as an array index. You could mix match and gensub (well, I'm using sub below). Not that this would be anywhere suggested method but just as an example:
$ awk '
NR==FNR {
k[$1]=$2 # hash them
next
}
{
while(match($0,/[0-9]+/)) # keep doing it while it lasts
sub(/[0-9]+/,k[substr($0,RSTART,RLENGTH)]) # replace here
}1' file1 file2
Output:
apple/pear/orange/apple
And of course, if you have k[1]="word1", you'll end up with a neverending loop.
With perl (assuming key is always found):
$ perl -lane 'if(!$#ARGV){ $h{$F[0]}=$F[1] }
else{ s|[^/]+|$h{$&}|g; print }' f1 f2
apple/pear/orange/apple
if(!$#ARGV) to determine first file (assuming exactly two files passed)
$h{$F[0]}=$F[1] create hash based on first field as key and second field as value
[^/]+ match non / characters
$h{$&} get the value based on matched portion from the hash
If some keys aren't found, leave it as is:
$ cat f2
1/4/2/1/5
$ perl -lane 'if(!$#ARGV){ $h{$F[0]}=$F[1] }
else{ s|[^/]+|exists $h{$&} ? $h{$&} : $&|ge; print }' f1 f2
apple/pear/orange/apple/5
exists $h{$&} checks if the matched portion exists as key.
Another approach using awk without loop:
awk 'FNR==NR{
a[$1]=$2;
next
}
$1 in a{
printf("%s%s",FNR>1 ? RS: "",a[$1])
}
END{
print ""
}' f1 RS='/' f2
$ cat f1
1 apple
2 orange
4 pear
$ cat f2
1/4/2/1
$ awk 'FNR==NR{a[$1]=$2;next}$1 in a{printf("%s%s",FNR>1?RS:"",a[$1])}END{print ""}' f1 RS='/' f2
apple/pear/orange/apple

Match strings in two files using awk and regexp

I have two files.
File 1 includes various types of SeriesDescriptions
"SeriesDescription": "Type_*"
"SeriesDescription": "OtherType_*"
...
File 2 contains information with only one SeriesDescription
"Name":"Joe"
"Age":"18"
"SeriesDescription":"Type_(Joe_text)"
...
I want to
compare the two files and find the lines that match for "SeriesDescription" and
print the line number of the matched text from File 1.
Expected Output:
"SeriesDescription": "Type_*" 24 11 (the correct line numbers in my files)
"SeriesDescription" will always be found on line 11 of File 2. I am having trouble matching given the * and have also tried changing it to .* without luck.
Code I have tried:
grep -nf File1.txt File2.txt
Successfully matches, but I want the line number from File1
awk 'FNR==NR{l[$1]=NR; next}; $1 in l{print $0, l[$1], FNR}' File2.txt File1.txt
This finds a match and prints the line number from both files, however, this is matching on the first column and prints the last line from File 1 as the match (since every line has the same column 1 for File 1).
awk 'FNR==NR{l[$2]=$3;l[$2]=NR; next}; $2 in l{print $0, l[$2], FNR}' File2.txt File1.txt
Does not produce a match.
I have also tried various settings of FS=":" without luck. I am not sure if the trouble is coming from the regex or the use of "" in the files or something else. Any help would be greatly appreciated!
With your shown samples, please try following. Written and tested in GNU awk, should work in any awk.
awk '
{ val="" }
match($0,/^[^_]*_/){
val=substr($0,RSTART,RLENGTH)
gsub(/[[:space:]]+/,"",val)
}
FNR==NR{
if(val){
arr[val]=$0 OFS FNR
}
next
}
(val in arr){
print arr[val] OFS FNR
}
' SeriesDescriptions file2
With your shown samples output will be:
"SeriesDescription": "Type_*" 1 3
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{ val="" } ##Nullifying val here.
match($0,/^[^_]*_/){ ##Using match to match value till 1st occurrence of _ here.
val=substr($0,RSTART,RLENGTH) ##Creating val which has sub string of above matched regex.
gsub(/[[:space:]]+/,"",val) ##Globally substituting spaces with NULL in val here.
}
FNR==NR{ ##This will execute when first file is being read.
if(val){ ##If val is NOT NULL.
arr[val]=$0 OFS FNR ##Create arr with index of val, which has value of current line OFS and FNR in it.
}
next ##next will skip all further statements from here.
}
(val in arr){ ##Checking if val is present in arr then do following.
print arr[val] OFS FNR ##Printing arr value with OFS, FNR value.
}
' SeriesDescriptions file2 ##Mentioning Input_file name here.
Bonus solution: If above is working fine for you AND you have this match only once in your file2 then you can exit from program to make it quick, in that case have above code in following way.
awk '
{ val="" }
match($0,/^[^_]*_/){
val=substr($0,RSTART,RLENGTH)
gsub(/[[:space:]]+/,"",val)
}
FNR==NR{
if(val){
arr[val]=$0 OFS FNR
}
next
}
(val in arr){
print arr[val] OFS FNR
exit
}
' SeriesDescriptions file2

awk convert number to date format to select line bigger than specific mmyy

INPUT:
test,1120,1
test,1219,2
Expected Output
test,1120,1
Goal: trying to print line where $2 which is mmyy format is bigger than 1020 as example.
I've tried with the following:
awk -F, '{ if ( $2 > 1020 ) { print $0 }}' file that's will not give the expected output because it's still number etc.. 1219 is bigger than 1020.
Assuming the 2nd field always contains 4 digits, how about:
awk -F, 'substr($2, 3, 2) substr($2, 1, 2) > 2010' input
Please note that I have interpreted the word bigger as later, meaning 0921 is bigger than 1020. If my assumption is incorrect, please let me know.
EDIT: Since OP mentioned that now if dates require lesser than provided input in that case one could try following.
awk -v val="1020" '
BEGIN{
FS=OFS=","
user_year=substr(val,3)
user_month=substr(val,1,2)
}
{
year=substr($2,3)
month=substr($2,1,2)
if(year==user_year){
if(month<user_month){
print
}
}
else if(year<user_year){
print
}
}
' Input_file
Could you please try following. I have create a variable named val here which will have value which user needs to compare to all the lines of Input_file. In this case it is set to 1020
awk -v val="1020" '
BEGIN{
FS=OFS=","
user_year=substr(val,3)
user_month=substr(val,1,2)
}
{
year=substr($2,3)
month=substr($2,1,2)
if(year==user_year){
if(month>user_month){
print
}
}
if(year>user_year){
print
}
}
' Input_file

AWK script to check first line of a file and then print the rest

I am trying to write an AWK script to parse a file of the form
> field1 - field2 field3 ...
lineoftext
anotherlineoftext
anotherlineoftext
and I am checking using regex if the first line is correct (begins with a > and then has something after it) and then print all the other lines. This is the script I wrote but it only verifies that the file is in a correct format and then doesn't print anything.
#!/bin/bash
# FASTA parser
awk ' BEGIN { x = 0; }
{ if ($1 !~ />.*/ && x == 0)
{ print "Not a FASTA file"; exit; }
else { x = 1; next; }
print $0 }
END { print " - DONE - "; }'
Basically you can use the following awk command:
awk 'NR==1 && /^>./ {p=1} p' file
On the first row NR==1 it checks whether the line starts with a > followed by "something" (/^>./). If that condition is true the variable p will be set to one. The p at the end checks whether p evaluates true and prints the line in that case.
If you want to print the error message, you need to revert the logic a bit:
awk 'NR==1 && !/^>./ {print "Not a FASTA file"; exit 1} 1' file
In this case the program prints the error messages and exits the program if the first line does not start with a >. Otherwise all lines gets printed because 1 always evaluates to true.
For this OP literally
awk 'NR==1{p=$0~/^>/}p' YourFile
# shorter version with info of #EdMorton
awk 'NR==1{p=/^>/}p' YourFile
for line after > (including)
awk '!p{p=$0~/^>/}p' YourFile
# shorter version with info of #EdMorton
awk '!p{p=/^>/}p' YourFile
Since all you care about is the first line, you can just check that, then exit.
awk 'NR > 1 { exit (0) }
! /^>/ { print "Not a FASTA file" >"/dev/stderr"; exit (1) }' file
As noted in comments, the >"/dev/stderr" is a nonportable hack which may not work for you. Regard it as a placeholder for something slightly more sophisticated if you want a tool which behaves as one would expect from a standard Unix tool (run silently if no problems; report problems to standard error).

awk if statement and pattern matching

I have the next input file:
##Names
##Something
FVEG_04063 1265 . AA ATTAT DP=19
FVEG_04063 1266 . AA ATTA DP=45
FVEG_04063 2703 . GTTTTTTTT ATA DP=1
FVEG_15672 2456 . TTG AA DP=71
FVEG_01111 300 . CTATA ATATA DP=7
FVEG_01111 350 . AGAC ATATATG DP=41
My desired output file:
##Names
##Something
FVEG_04063 1266 . AA ATTA DP=45
FVEG_04063 2703 . GTTTTTTTT ATA DP=1
FVEG_15672 2456 . TTG AA DP=71
FVEG_01111 300 . CTATA ATATA DP=7
FVEG_01111 350 . AGAC ATATATG DP=41
Explanation: I want to print in my output file, all the lines begining with "#", all the "unique" lines attending to column 1, and if I have repeated hits in column 1, first: take the number in $2 and sum to length of $5 (in same line), if the result is smaller than the $2 of next line, print both lines; BUT if the result is bigger than the $2 of next line, compare the values of DP and only print the line with best DP.
What I've tried:
awk '/^#/ {print $0;} arr[$1]++; END {for(i in arr){ if(arr[i]>1){ HERE I NEED TO INTRODUCE MORE 'IF' I THINK... } } { if(arr[i]==1){print $0;} } }' file.txt
I'm new in awk world... I think that is more simple to do a little script with multiple lines... or maybe is better a bash solution.
Thanks in advance
As requested, an awk solution. I have commented the code heavily, so hopefully the comments will serve as explanation. As a summary, the basic idea is to:
Match comment lines, print them, and go to the next line.
Match the first line (done by checking if whether we have started remembering col1 yet).
On all subsequent lines, check values against the remembered values from the previous line. The "best" record, ie. the one that should be printed for each unique ID, is remembered each time and updated depending on conditions set forth by the question.
Finally, output the last "best" record of the last unique ID.
Code:
# Print lines starting with '#' and go to next line.
/^#/ { print $0; next; }
# Set up variables on the first line of input and go to next line.
! col1 { # If col1 is unset:
col1 = $1;
col2 = $2;
len5 = length($5);
dp = substr($6, 4) + 0; # Note dp is turned into int here by +0
best = $0;
next;
}
# For all other lines of input:
{
# If col1 is the same as previous line:
if ($1 == col1) {
# Check col2
if (len5 + col2 < $2) # Previous len5 + col2 < current $2
print best; # Print previous record
# Check DP
else if (substr($6, 4) + 0 < dp) # Current dp < previous dp:
next; # Go to next record, do not update variables.
}
else { # Different ids, print best line from previous id and update id.
print best;
col1 = $1;
}
# Update variables to current record.
col2 = $2;
len5 = length($5);
dp = substr($6, 4) + 0;
best = $0;
}
# Print the best record of the last id.
END { print best }
Note: dp is calculated by taking the sub-string of $6 starting at index 4 and going to the end. The + 0 is added to force the value to be converted to an integer, to ensure the comparison will work as expected.
Perl solution. You might need to fix the border cases as you didn't provide data to test them.
#last remembers the last line, #F is the current line.
#!/usr/bin/perl
use warnings;
use strict;
my (#F, #last);
while (<>) {
#F = split;
print and next if /^#/ or not #last;
if ($last[0] eq $F[0]) {
if ($F[1] + length $F[4] > $last[1] + length $last[4]) {
print "#last\n";
} else {
my $dp_l = $last[5];
my $dp_f = $F[5];
s/DP=// for $dp_l, $dp_f;
if ($dp_l > $dp_f) {
#F = #last;
}
}
} else {
print "#last\n" if #last;
}
} continue {
#last = #F;
}
print "#last\n";