AWK negative regular expression with variable - regex

I am using awk in a bash script to compare two files to get just the not-matching lines.
I need to compare all three fields of the second file (as one pattern?) with all lines of the first file:
First file:
chr1 9997 10330 HumanGM18558_peak_1 150 . 10.78887 18.86368 15.08777 100
chr1 628885 635117 HumanGM18558_peak_2 2509 . 83.77238 255.95094 250.99944 5270
chr1 15966215 15966638 HumanGM18558_peak_3 81 . 7.61567 11.78841 8.17169 200
Second file:
chr1 628885 635117
chr1 1250086 1250413
chr1 16613629 16613934
chr1 16644496 16644800
chr1 16895871 16896489
chr1 16905126 16905616
The current idea is to load one file in an array and use AWKs negative regular expression to compare.
readarray a < file2.txt
for i in "${a[#]}"; do
awk -v var="$i" '!/var/' file1.narrowPeak | cat > output.narrowPeak
done
The problem is that '!/var/' is not working with variables.

With awk alone:
$ awk 'NR==FNR{a[$1,$2,$3]; next} !(($1,$2,$3) in a)' file2 file1
chr1 9997 10330 HumanGM18558_peak_1 150 . 10.78887 18.86368 15.08777 100
chr1 15966215 15966638 HumanGM18558_peak_3 81 . 7.61567 11.78841 8.17169 200
NR==FNR this will be true only for the first file, which is file2 in this example
a[$1,$2,$3] create keys based on first three fields, if spacing is exactly same between the two files, you can simply use $0 instead of $1,$2,$3
next to skip remaining commands and process next line of input
($1,$2,$3) in a to check if first three fields of file1 is present as key in array a. Then invert the condition.
Here's another way to write it (thanks to Ed Morton)
awk '{key=$1 FS $2 FS $3} NR==FNR{a[key]; next} !(key in a)' file2 file1

When the pattern is stored in a variable, you have to use the match operator:
awk -v var="something" '
$0 !~ var {print "this line does not match the pattern"}
'
With this problem, regular expression matching looks a bit awkward. I'd go with Sundeep's solution, but if you really want regex:
awk '
NR == FNR {
# construct and store the regex
patt["^" $1 "[[:blank:]]+" $2 "[[:blank:]]+" $3 + "[[:blank:]]"] = 1
next
}
{
for (p in patt)
if ($0 ~ p)
next
print
}
' second first

Related

awk: how to extract 2 patterns from a single line and then concatenate them?

I want to find 2 patterns in each line and then print them with a dash between them as a separator. Here is a sample of lines:
20200323: #5357 BEAR_SPX_X15_NORDNET_D1 {CU=DKK, ES=E, II=DK0061205473, IR=NRB, LN=BEAR SPX X15 NORDNET D1, MIC=FNDK, NS=1, PC=C, SE=193133, SG=250, SN=193133, TK="0.01 to 100,0.05 to 500,0.1", TS=BEAR_SPX_X15_NORDNET_D1, TY=W, UQ=1}
20200323: #5358 BULL_SPX_X10_NORDNET_D2 {CU=DKK, ES=E, II=DK0061205556, IR=NRB, LN=BULL SPX X10 NORDNET D2, MIC=FNDK, NS=1, PC=P, SE=193132, SG=250, SN=193132, TK="0.01 to 100,0.05 to 500,0.1", TS=BULL_SPX_X10_NORDNET_D2, TY=W, UQ=1}
20200323: #5359 BULL_SPX_X12_NORDNET_D2 {CU=DKK, ES=E, II=DK0061205630, IR=NRB, LN=BULL SPX X12 NORDNET D2, MIC=FNDK, NS=1, PC=P, SE=193131, SG=250, SN=193131, TK="0.01 to 100,0.05 to 500,0.1", TS=BULL_SPX_X12_NORDNET_D2, TY=W, UQ=1}
Given the above lines, my desired output after running a script should look like this:
BEAR_SPX_X15_NORDNET_D1 - DK0061205473
BULL_SPX_X10_NORDNET_D2 - DK0061205556
BULL_SPX_X12_NORDNET_D2 - DK0061205630
The first alphanumeric value (e.g. BULL_SPX_X12_NORDNET_D2) is always in the 3rd position of a line.
The second alphanumeric value (e.g. DK0061205630) can be at various positions but it's always preceded by "II=" and is always exactly 12 characters length.
I tried to implement my task with the following script:
13 regex='II=.\{12\}'
14 while IFS="" read -r line; do
15 matchedString=`grep -o $regex littletest.txt | tr -d 'II=,'`
16 awk /II=/'{print $3, " - ", $matchedString}' littletest.txt > temp.txt
17 done <littletest.txt
My thought process and intentions/assumptions:
Line 13 defines a regex pattern to match the alphanumeric string preceded with "II="
In line 15 variable "matchedString" gets assigned a value that is extracted from a line via regex, with the preceding "II=" being deleted.
Line 16 uses awk expression to to detect all lines that contain "II=" and then print the third string that is found on every input file's line and also print the value of matched string pattern that was defined in the previous line of the script. So I expect that at this point a pair of extracted patterns (e.g. BEAR_SPX_X15_NORDNET_D1 - DK0061205473) should be transfered to temp.txt file.
Line 17 is taking an input file for a script to consume.
However, after running the script I did not get the desired output. Here is a sample of what I got:
BEAR_SPX_X15_NORDNET_D1
20200323: #5357 BEAR_SPX_X15_NORDNET_D1 {CU=DKK, ES=E, II=DK0061205473, IR=NRB, LN=BEAR SPX X15 NORDNET D1, MIC=FNDK, NS=1, PC=C, SE=193133, SG=250, SN=193133, TK="0.01 to 100,0.05 to 500,0.1", TS=BEAR_SPX_X15_NORDNET_D1, TY=W, UQ=1}
How could I achieve my desired output that I described earlier?
$ awk -v OFS=' - ' 'match($0,/II=/){print $3, substr($0,RSTART+3,12)}' file
BEAR_SPX_X15_NORDNET_D1 - DK0061205473
BULL_SPX_X10_NORDNET_D2 - DK0061205556
BULL_SPX_X12_NORDNET_D2 - DK0061205630
Just trying out awk.
awk 'BEGIN{ FS="[II=, ]+" ; OFS=" - " } {print $3, $8}' file.txt
Using gawk (gnu awk) that supports regex as Field Seperator (FS) , and considering that each line in your file has exactly the same format / same number of fields, this works fine in my tests:
awk '{print $3,$9}' FS="[ ]|II=" OFS=" - " file1
#or FS="[[:space:]]+|II=|[,]" if you might have more than one space between fields
Results
BEAR_SPX_X15_NORDNET_D1 - DK0061205473
BULL_SPX_X10_NORDNET_D2 - DK0061205556
BULL_SPX_X12_NORDNET_D2 - DK0061205630
Since the II= part could be anywhere, this trick could also work with a penalty of parsing the file twice:
paste -d "-" <(awk '{print $3}' file1) <(awk '/II/{print $2}' RS="[ ]" FS="=|," file1)

Bash script to split a file by grep everything till the second time match in a column into one file and the rest into another

I am trying to split a file with data like
2 0.2345
58 0.3608
59 0.3504
60 0.4175
65 0.3995
66 0.3972
67 0.4411
411 0.3455
2 1.3867
3 1.4532
4 1.2925
5 1.2473
6 1.2605
7 1.2463
8 1.1667
9 1.1312
10 1.1502
11 1.1190
12 1.0346
13 1.0291
409 0.8025
410 0.8695
411 0.9154
For this kind of data, I am trying to split this into two files:
File 1 : 2 -411 (first Column match)
File 2 : 2-411 (second occurrence in the first column)
For this, I wrote these two one liners:
awk '1;/411/{exit}' $1 > File1_$1 ;
awk '/411/,0' $1 | awk '{if (NR!=1) {print}}' > File2_$1
The problem is that if there is a match of "411" (as in "67 0.4411") on the second column, my script prematurely cuts from that line.
I am unable to make the match on the first column only as occurrence of 411 on the second column can be number of times and not of interest.
Any help would be greatly appreciated.
an idea could be to use this command combination
awk '{ if ($1 >= 2 && $1 <= 411) print $0 }{if ($1=="411") exit}' input > f1
then
grep -v -f f1 input > f2
if your input file is more bigger you should repeat step2.
I don't know nothing about Bash, but for regex i think you should indicate that the line begins with 411 like that \b411.

print lines between patterns individual separate files

I have a big file of 5000+ lines which has a repeated pattern like shown below:
ABC
111
222
333
XYZ
ABC
444
555
666
777
XYZ
..
..
ABC
777777777
888888888
999999999
222
333
111
XYZ
I would like to extract contents between each 'ABC' and 'XYZ' and write it to a separate file.
Ex: file1 should have
ABC
111
222
333
XYZ
File2 should have
ABC
444
555
666
777
XYZ
Filen should have
ABC
777777777
888888888
999999999
222
333
111
XYZ
and so on.
How could we achieve this ? I read these below threads but it writes only one single file. Didn't help for my case.
How to select lines between two marker patterns which may occur multiple times with awk/sed
Print lines between two patterns to new file
awk '/^ABC/{file="file"c++}{print >>file}' a
Perl to the rescue!
< bigfile perl -nwe 'print {$OUT} $_
if (/ABC/ && do { open $OUT, ">", "file" . ++$i or die $!}
) ... /XYZ/'
n reads the file line by line
it only prints if between /ABC/ and /XYZ/
when /ABC/ is true, i.e. we're starting a new section, a new file is opened and associated with the filehandle $OUT. $i is the number of the file.
awk '
# setup our output file name file0, file1, file2, ...
$0 == "ABC"{if (i) {close(f)};f="file"i++;};
# use inclusive range match
$0 == "ABC",$0 == "XYZ"{print > f}
'

Match a word just once - AWK

I was reading GNU awk manual but I didnt find a regular expression wich whom I can match a string just once.
For example from the files aha_1.txt, aha_2.txt, aha_3.txt, .... I would like to print the second column $2 from the first time ana appears in the files (aha_1.txt, aha_2.txt, aha_3.txt, ....). In addition, the same thing when pedro appears.
aha_1.txt
luis 321 487
ana 454 345
pedro 341 435
ana 941 345
aha_2.txt
pedro 201 723
gusi 837 134
ana 319 518
cindy 738 278
ana 984 265
.
.
.
.
Meanwhile I did this but it counts all the cases not just the first time
/^ana/ {print $2 }
/^pedro/ {print $2 }
Thanks for your help :-)
Just call the exit command after printing the first value(second column in the line which starts with the string ana).
$ awk '$1~/^ana$/{print $2; exit}' file
454
Original question
Only processing one file.
awk '/ana/ { if (ana++ == 0) print $2 }' aha.txt
or
awk '/ana/ && ana++ == 0 { print $2 }' aha.txt
Or, if you don't need to do anything else, you can exit after printing, as suggested by Avinash Raj in his answer.
Revised question
I have many files (aha.txt, aha_1.txt, aha_2.txt, ...) each file has ana inside and I need just to take the fist time ana appears in each file and the output has to be one file.
That's sightly different as a question. If you have GNU grep, you can use (more or less):
grep -m1 -e ana aha*.txt
That will list the whole line, not just column 2, and will list the filenames too, so it isn't a perfect match.
Using awk, you have to work a bit more:
awk 'FILENAME != old_file { ana = 0; old_file = FILENAME }
/ana/ { if (ana++ == 0) print $2 }' aha*.txt

how to extract number in a single quote from a line with awk or sed?

I have this line, tab delimited:
chr1 11460 11462 '16/38' 421 + chr1 11460 11462 '21/29' 724 + 2
chr1 11479 11481 '11/29' 379 + chr1 11479 11481 '20/5' 667 + 2
What I want to do is to test if all the second number inside ' ' are greater or equal to 10. If so, I'll output this line. So the result should be to print the first line
chr1 11460 11462 '16/38' 421 + chr1 11460 11462 '21/29' 724 + 2
I can write a perl code to do it. But this seems to be something awk can do easily.. anyone has a solution?
Thanks.
If you set the right field separators, it's pretty easy:
awk -F "['/]" '{for (i=3; i<=NF; i+=3) if ($i<10) next; print}' file
Easiest way fetch the content inside single quotes might be just to strip off everything from both ends of each line, up to and including the single quote:
$ sed "s/^[^']*'//;s/'.*//" file
16/38
11/29
This sed expression consists of two commands:
s/^[^']*'// -- strips off all text to the first single quote,
s/'.*// -- strips off all text from the first (remaining) single quote to EOL.
To wrap this in a shell script that does something with that data requires .. well, a shell script...
You can parse this stuff using bash's read command. For example:
#!/bin/bash
IFS=/
sed "s/^[^']*'//;s/'.*//" file \
| while read left right; do
echo "$left / $right"
done
To implement something that grabs contents of multiple single-quoted numbers, you can expand the sed script appropriately, and implement if statements for the conditions you want. For example, a sed expression to grab the TWO single-quoted strings might be:
sed "s/^[^']*'\([^']*\)'[^']*'\([^']*\)'.*/\1 \2/"
This is a single large regex that uses two sets of brackets \( and \), to mark patterns that will be placed in the output, \1 and \2.
But you might be better off parsing things according to column position:
$ while read _ _ _ A _ _ _ _ _ B _; do echo "$A .. $B"; done < file
'16/38' .. '21/29'
'11/29' .. '20/5'
Actually implementing your programming logic is left as an exercise to the reader. If you'd like us to help you with your script, please include your work so far.
As long as those are the only ' characters in the string and the numbers won't have leading zeros you could use the regular expression:
\d\d+'.*\d\d+'
If either of those preconditions isn't true there are changes that could be made, but it would depend on the situation.
You should be able to use grep to get the lines you want using that regex.
The following puts just the first line to stdout:
grep \d\d+'.*\d\d+' "chr1 11460 11462 '16/38' 421 + chr1 11460 11462 '21/29' 724 + 2
chr1 11479 11481 '11/29' 379 + chr1 11479 11481 '20/5' 667 + 2"
My version, serious overkill but should work with any amount of 'xx/xx' per line:
awk -F'\t' "{
found=1;
for(i=0;i<NF;i++){
if(match(\$i, /'[[:digit:]]+\/([[:digit:]]+)'/, capts)){
if(capts[1] < 10){
found=0;
break;
}
}
}
if(found){
print;
}
}" file.txt
Explanation:
This will loop through each field of the line and apply a regex against the field to find the last digits of 'xx/xx'. If the last digits are less than 10 it will break out of the loop and go to the next line. If all fields have been processed by the if loop and no last digits were less than 10, it will print the line.
Note:
Seeing that i'm using the match function to capture regex groups this will only work with GNU awk.