Match a word just once - AWK - regex

I was reading GNU awk manual but I didnt find a regular expression wich whom I can match a string just once.
For example from the files aha_1.txt, aha_2.txt, aha_3.txt, .... I would like to print the second column $2 from the first time ana appears in the files (aha_1.txt, aha_2.txt, aha_3.txt, ....). In addition, the same thing when pedro appears.
aha_1.txt
luis 321 487
ana 454 345
pedro 341 435
ana 941 345
aha_2.txt
pedro 201 723
gusi 837 134
ana 319 518
cindy 738 278
ana 984 265
.
.
.
.
Meanwhile I did this but it counts all the cases not just the first time
/^ana/ {print $2 }
/^pedro/ {print $2 }
Thanks for your help :-)

Just call the exit command after printing the first value(second column in the line which starts with the string ana).
$ awk '$1~/^ana$/{print $2; exit}' file
454

Original question
Only processing one file.
awk '/ana/ { if (ana++ == 0) print $2 }' aha.txt
or
awk '/ana/ && ana++ == 0 { print $2 }' aha.txt
Or, if you don't need to do anything else, you can exit after printing, as suggested by Avinash Raj in his answer.
Revised question
I have many files (aha.txt, aha_1.txt, aha_2.txt, ...) each file has ana inside and I need just to take the fist time ana appears in each file and the output has to be one file.
That's sightly different as a question. If you have GNU grep, you can use (more or less):
grep -m1 -e ana aha*.txt
That will list the whole line, not just column 2, and will list the filenames too, so it isn't a perfect match.
Using awk, you have to work a bit more:
awk 'FILENAME != old_file { ana = 0; old_file = FILENAME }
/ana/ { if (ana++ == 0) print $2 }' aha*.txt

Related

AWK negative regular expression with variable

I am using awk in a bash script to compare two files to get just the not-matching lines.
I need to compare all three fields of the second file (as one pattern?) with all lines of the first file:
First file:
chr1 9997 10330 HumanGM18558_peak_1 150 . 10.78887 18.86368 15.08777 100
chr1 628885 635117 HumanGM18558_peak_2 2509 . 83.77238 255.95094 250.99944 5270
chr1 15966215 15966638 HumanGM18558_peak_3 81 . 7.61567 11.78841 8.17169 200
Second file:
chr1 628885 635117
chr1 1250086 1250413
chr1 16613629 16613934
chr1 16644496 16644800
chr1 16895871 16896489
chr1 16905126 16905616
The current idea is to load one file in an array and use AWKs negative regular expression to compare.
readarray a < file2.txt
for i in "${a[#]}"; do
awk -v var="$i" '!/var/' file1.narrowPeak | cat > output.narrowPeak
done
The problem is that '!/var/' is not working with variables.
With awk alone:
$ awk 'NR==FNR{a[$1,$2,$3]; next} !(($1,$2,$3) in a)' file2 file1
chr1 9997 10330 HumanGM18558_peak_1 150 . 10.78887 18.86368 15.08777 100
chr1 15966215 15966638 HumanGM18558_peak_3 81 . 7.61567 11.78841 8.17169 200
NR==FNR this will be true only for the first file, which is file2 in this example
a[$1,$2,$3] create keys based on first three fields, if spacing is exactly same between the two files, you can simply use $0 instead of $1,$2,$3
next to skip remaining commands and process next line of input
($1,$2,$3) in a to check if first three fields of file1 is present as key in array a. Then invert the condition.
Here's another way to write it (thanks to Ed Morton)
awk '{key=$1 FS $2 FS $3} NR==FNR{a[key]; next} !(key in a)' file2 file1
When the pattern is stored in a variable, you have to use the match operator:
awk -v var="something" '
$0 !~ var {print "this line does not match the pattern"}
'
With this problem, regular expression matching looks a bit awkward. I'd go with Sundeep's solution, but if you really want regex:
awk '
NR == FNR {
# construct and store the regex
patt["^" $1 "[[:blank:]]+" $2 "[[:blank:]]+" $3 + "[[:blank:]]"] = 1
next
}
{
for (p in patt)
if ($0 ~ p)
next
print
}
' second first

Bash/sed: delete everything from text file except match(es)

I have a text file which I need to extract a match from in a bash script. There might be more than one match and everything else is supposed to be discarded.
Sample snippet of input.txt file content:
PART TWO OF TWO PARTS-
E RESNO 56/20 56/30 54/40 52/50 TUDEP
EAST LVLS NIL
WEST LVLS 310 320 330 340 350 360 370 380 390
EUR RTS WEST NIL
NAR NIL-
REMARKS.
1.TMI IS 142 AND OPERATORS ARE REMINDED TO INCLUDE THE
TMI NUMBER AS PART OF THE OCEANIC CLEARANCE READ BACK.
2.ADS-C AND CPDLC MANDATED OTS ARE AS FOLLOWS
TRACK A 350 360 370 380 390
TRACK B 350 360 370 380 390
I try to match for 142 from the line
1.TMI IS 142 AND OPERATORS ARE REMINDED TO INCLUDE THE
The match is always a number (one to three digits, may have leading zeroes) and always preceded by TMI IS.
My experiments so far led to nothing: I tried .*TMI IS ([0-9]+).* with the following sed command in my bash script
sed -n 's/.*TMI IS \([0-9]+\).*/\1/g' input.txt > output.txt
but only got an empty output.txt.
My script runs in GNU Bash-4.2. Where do I make my mistake? I ran out of ideas so your input is highly appreciated!
Thanks,
Chris
Two moments about your sed approach to make it work:
+ quantifier should be escaped in sed basic regular expressions
to print matched pattern use p subcommand:
sed -n 's/.*TMI IS \([0-9]\+\).*/\1/gp' input.txt
142
To get only the first match for your current format use:
sed -n 's/^\S\+TMI IS \([0-9]\+\).*/\1/gp' input.txt
With GNU grep:
$ grep -oP 'TMI IS \K([0-9]*)' input.txt
142
You could also do this using perl as an alternative to the above:
$ perl -nle 'print $1 if /TMI IS (\d+)/;' < input.txt
142

Bash script to split a file by grep everything till the second time match in a column into one file and the rest into another

I am trying to split a file with data like
2 0.2345
58 0.3608
59 0.3504
60 0.4175
65 0.3995
66 0.3972
67 0.4411
411 0.3455
2 1.3867
3 1.4532
4 1.2925
5 1.2473
6 1.2605
7 1.2463
8 1.1667
9 1.1312
10 1.1502
11 1.1190
12 1.0346
13 1.0291
409 0.8025
410 0.8695
411 0.9154
For this kind of data, I am trying to split this into two files:
File 1 : 2 -411 (first Column match)
File 2 : 2-411 (second occurrence in the first column)
For this, I wrote these two one liners:
awk '1;/411/{exit}' $1 > File1_$1 ;
awk '/411/,0' $1 | awk '{if (NR!=1) {print}}' > File2_$1
The problem is that if there is a match of "411" (as in "67 0.4411") on the second column, my script prematurely cuts from that line.
I am unable to make the match on the first column only as occurrence of 411 on the second column can be number of times and not of interest.
Any help would be greatly appreciated.
an idea could be to use this command combination
awk '{ if ($1 >= 2 && $1 <= 411) print $0 }{if ($1=="411") exit}' input > f1
then
grep -v -f f1 input > f2
if your input file is more bigger you should repeat step2.
I don't know nothing about Bash, but for regex i think you should indicate that the line begins with 411 like that \b411.

Sed command garbled with very easy mutiline regex in bash

I'm again garbled with sed command, because most probably i have very old version of sed but according to my limitations i couldn't change the version of 'sed' (!)
My question is this i wrote such an easy regex that fits with my string file such as:
/[^,]*$/mg
My string file is this :
23:53:20,650
23:53:20,654
23:53:20,655
23:53:20,656
23:53:21,238
23:53:21,240
23:53:21,302
23:53:21,303
23:53:21,304
23:53:21,305
23:53:21,889
23:53:21,890
23:53:21,896
23:53:21,897
23:53:21,898
23:53:21,899
23:53:22,492
23:53:22,538
23:53:22,539
23:53:23,109
23:53:23,110
23:53:23,115
23:53:23,117
23:53:23,118
23:53:23,119
23:53:23,690
23:53:23,721
23:53:23,722
23:53:24,275
23:53:24,276
23:53:24,313
23:53:24,316
23:53:24,317
23:53:24,318
23:53:24,854
23:53:24,888
23:53:24,889
23:53:24,890
23:53:24,891
23:53:50,676
23:53:50,677
23:53:50,711
23:53:50,713
23:53:50,714
23:53:51,257
23:53:51,258
23:53:51,296
23:53:51,297
23:53:51,298
23:53:51,820
23:53:51,822
23:53:51,823
23:53:52,358
23:53:52,364
23:53:52,367
23:53:52,909
23:53:52,910
23:53:52,936
23:53:52,939
23:53:52,941
23:53:52,944
23:53:52,945
23:53:52,946
23:53:52,949
23:53:52,953
23:53:52,956
23:53:52,959
23:53:52,963
23:53:52,966
23:53:52,970
23:53:52,971
23:53:52,974
23:53:52,978
23:53:52,980
23:53:52,983
23:53:52,984
23:53:52,986
23:53:52,987
23:53:52,989
23:53:52,990
23:53:52,991
23:53:52,994
23:53:52,995
23:53:52,999
23:53:53,001
23:53:53,002
23:53:53,004
23:53:53,005
23:53:53,007
23:53:53,010
23:53:53,026
23:53:53,027
23:53:53,081
23:53:53,082
23:53:53,083
23:53:53,085
07:32:54,519
07:32:54,521
07:32:54,537
07:32:54,538
07:32:54,539
07:32:54,540
07:32:54,541
07:32:54,542
07:32:54,543
07:32:54,544
07:32:54,545
07:32:54,546
07:32:54,547
07:32:54,548
07:32:54,549
07:32:54,550
I'm trying to get the values after the comma then assign them into array, when I used the sed command like :
`sed -n '/[^,]*$/mg'` file
It says command garbled, i read about multiline sed but i still couldn't reach to solution, i am new to regexes so the help will be appreciated.
Thank you in advance!
If you are using a "recent" bash, I think you can use cut and assign extracted values to an array:
numbers="$(cut -d',' -f2 filename.txt)"
array_numbers=( $numbers )
If you want to get the values after comma then you could use the below sed command which removes the values from the start upto the first comma.
sed 's/^[^,]*,//' file
OR
sed 's/^.*,//' file
Example:
$ echo '23:53:22,492' | sed 's/^[^,]*,//'
492
$ echo '23:53:22,492' | sed 's/^.*,//'
492
sed s/.*,// file
would match the till the first , are substitute the match wth nothing, which effectively gives the values after comma
for the input file
23:53:20,650
23:53:20,654
23:53:20,655
23:53:20,656
23:53:21,238
23:53:21,240
23:53:21,302
23:53:21,303
23:53:21,304
23:53:21,305
23:53:21,889
23:53:21,890
23:53:21,896
23:53:21,897
23:53:21,898
23:53:21,899
23:53:22,492
23:53:22,538
will produce output as
650
654
655
656
238
240
302
303
304
305
889
890
896
897
898
899
492
538

how to extract number in a single quote from a line with awk or sed?

I have this line, tab delimited:
chr1 11460 11462 '16/38' 421 + chr1 11460 11462 '21/29' 724 + 2
chr1 11479 11481 '11/29' 379 + chr1 11479 11481 '20/5' 667 + 2
What I want to do is to test if all the second number inside ' ' are greater or equal to 10. If so, I'll output this line. So the result should be to print the first line
chr1 11460 11462 '16/38' 421 + chr1 11460 11462 '21/29' 724 + 2
I can write a perl code to do it. But this seems to be something awk can do easily.. anyone has a solution?
Thanks.
If you set the right field separators, it's pretty easy:
awk -F "['/]" '{for (i=3; i<=NF; i+=3) if ($i<10) next; print}' file
Easiest way fetch the content inside single quotes might be just to strip off everything from both ends of each line, up to and including the single quote:
$ sed "s/^[^']*'//;s/'.*//" file
16/38
11/29
This sed expression consists of two commands:
s/^[^']*'// -- strips off all text to the first single quote,
s/'.*// -- strips off all text from the first (remaining) single quote to EOL.
To wrap this in a shell script that does something with that data requires .. well, a shell script...
You can parse this stuff using bash's read command. For example:
#!/bin/bash
IFS=/
sed "s/^[^']*'//;s/'.*//" file \
| while read left right; do
echo "$left / $right"
done
To implement something that grabs contents of multiple single-quoted numbers, you can expand the sed script appropriately, and implement if statements for the conditions you want. For example, a sed expression to grab the TWO single-quoted strings might be:
sed "s/^[^']*'\([^']*\)'[^']*'\([^']*\)'.*/\1 \2/"
This is a single large regex that uses two sets of brackets \( and \), to mark patterns that will be placed in the output, \1 and \2.
But you might be better off parsing things according to column position:
$ while read _ _ _ A _ _ _ _ _ B _; do echo "$A .. $B"; done < file
'16/38' .. '21/29'
'11/29' .. '20/5'
Actually implementing your programming logic is left as an exercise to the reader. If you'd like us to help you with your script, please include your work so far.
As long as those are the only ' characters in the string and the numbers won't have leading zeros you could use the regular expression:
\d\d+'.*\d\d+'
If either of those preconditions isn't true there are changes that could be made, but it would depend on the situation.
You should be able to use grep to get the lines you want using that regex.
The following puts just the first line to stdout:
grep \d\d+'.*\d\d+' "chr1 11460 11462 '16/38' 421 + chr1 11460 11462 '21/29' 724 + 2
chr1 11479 11481 '11/29' 379 + chr1 11479 11481 '20/5' 667 + 2"
My version, serious overkill but should work with any amount of 'xx/xx' per line:
awk -F'\t' "{
found=1;
for(i=0;i<NF;i++){
if(match(\$i, /'[[:digit:]]+\/([[:digit:]]+)'/, capts)){
if(capts[1] < 10){
found=0;
break;
}
}
}
if(found){
print;
}
}" file.txt
Explanation:
This will loop through each field of the line and apply a regex against the field to find the last digits of 'xx/xx'. If the last digits are less than 10 it will break out of the loop and go to the next line. If all fields have been processed by the if loop and no last digits were less than 10, it will print the line.
Note:
Seeing that i'm using the match function to capture regex groups this will only work with GNU awk.