print lines between patterns individual separate files - regex

I have a big file of 5000+ lines which has a repeated pattern like shown below:
ABC
111
222
333
XYZ
ABC
444
555
666
777
XYZ
..
..
ABC
777777777
888888888
999999999
222
333
111
XYZ
I would like to extract contents between each 'ABC' and 'XYZ' and write it to a separate file.
Ex: file1 should have
ABC
111
222
333
XYZ
File2 should have
ABC
444
555
666
777
XYZ
Filen should have
ABC
777777777
888888888
999999999
222
333
111
XYZ
and so on.
How could we achieve this ? I read these below threads but it writes only one single file. Didn't help for my case.
How to select lines between two marker patterns which may occur multiple times with awk/sed
Print lines between two patterns to new file

awk '/^ABC/{file="file"c++}{print >>file}' a

Perl to the rescue!
< bigfile perl -nwe 'print {$OUT} $_
if (/ABC/ && do { open $OUT, ">", "file" . ++$i or die $!}
) ... /XYZ/'
n reads the file line by line
it only prints if between /ABC/ and /XYZ/
when /ABC/ is true, i.e. we're starting a new section, a new file is opened and associated with the filehandle $OUT. $i is the number of the file.

awk '
# setup our output file name file0, file1, file2, ...
$0 == "ABC"{if (i) {close(f)};f="file"i++;};
# use inclusive range match
$0 == "ABC",$0 == "XYZ"{print > f}
'

Related

AWK negative regular expression with variable

I am using awk in a bash script to compare two files to get just the not-matching lines.
I need to compare all three fields of the second file (as one pattern?) with all lines of the first file:
First file:
chr1 9997 10330 HumanGM18558_peak_1 150 . 10.78887 18.86368 15.08777 100
chr1 628885 635117 HumanGM18558_peak_2 2509 . 83.77238 255.95094 250.99944 5270
chr1 15966215 15966638 HumanGM18558_peak_3 81 . 7.61567 11.78841 8.17169 200
Second file:
chr1 628885 635117
chr1 1250086 1250413
chr1 16613629 16613934
chr1 16644496 16644800
chr1 16895871 16896489
chr1 16905126 16905616
The current idea is to load one file in an array and use AWKs negative regular expression to compare.
readarray a < file2.txt
for i in "${a[#]}"; do
awk -v var="$i" '!/var/' file1.narrowPeak | cat > output.narrowPeak
done
The problem is that '!/var/' is not working with variables.
With awk alone:
$ awk 'NR==FNR{a[$1,$2,$3]; next} !(($1,$2,$3) in a)' file2 file1
chr1 9997 10330 HumanGM18558_peak_1 150 . 10.78887 18.86368 15.08777 100
chr1 15966215 15966638 HumanGM18558_peak_3 81 . 7.61567 11.78841 8.17169 200
NR==FNR this will be true only for the first file, which is file2 in this example
a[$1,$2,$3] create keys based on first three fields, if spacing is exactly same between the two files, you can simply use $0 instead of $1,$2,$3
next to skip remaining commands and process next line of input
($1,$2,$3) in a to check if first three fields of file1 is present as key in array a. Then invert the condition.
Here's another way to write it (thanks to Ed Morton)
awk '{key=$1 FS $2 FS $3} NR==FNR{a[key]; next} !(key in a)' file2 file1
When the pattern is stored in a variable, you have to use the match operator:
awk -v var="something" '
$0 !~ var {print "this line does not match the pattern"}
'
With this problem, regular expression matching looks a bit awkward. I'd go with Sundeep's solution, but if you really want regex:
awk '
NR == FNR {
# construct and store the regex
patt["^" $1 "[[:blank:]]+" $2 "[[:blank:]]+" $3 + "[[:blank:]]"] = 1
next
}
{
for (p in patt)
if ($0 ~ p)
next
print
}
' second first

Bash/sed: delete everything from text file except match(es)

I have a text file which I need to extract a match from in a bash script. There might be more than one match and everything else is supposed to be discarded.
Sample snippet of input.txt file content:
PART TWO OF TWO PARTS-
E RESNO 56/20 56/30 54/40 52/50 TUDEP
EAST LVLS NIL
WEST LVLS 310 320 330 340 350 360 370 380 390
EUR RTS WEST NIL
NAR NIL-
REMARKS.
1.TMI IS 142 AND OPERATORS ARE REMINDED TO INCLUDE THE
TMI NUMBER AS PART OF THE OCEANIC CLEARANCE READ BACK.
2.ADS-C AND CPDLC MANDATED OTS ARE AS FOLLOWS
TRACK A 350 360 370 380 390
TRACK B 350 360 370 380 390
I try to match for 142 from the line
1.TMI IS 142 AND OPERATORS ARE REMINDED TO INCLUDE THE
The match is always a number (one to three digits, may have leading zeroes) and always preceded by TMI IS.
My experiments so far led to nothing: I tried .*TMI IS ([0-9]+).* with the following sed command in my bash script
sed -n 's/.*TMI IS \([0-9]+\).*/\1/g' input.txt > output.txt
but only got an empty output.txt.
My script runs in GNU Bash-4.2. Where do I make my mistake? I ran out of ideas so your input is highly appreciated!
Thanks,
Chris
Two moments about your sed approach to make it work:
+ quantifier should be escaped in sed basic regular expressions
to print matched pattern use p subcommand:
sed -n 's/.*TMI IS \([0-9]\+\).*/\1/gp' input.txt
142
To get only the first match for your current format use:
sed -n 's/^\S\+TMI IS \([0-9]\+\).*/\1/gp' input.txt
With GNU grep:
$ grep -oP 'TMI IS \K([0-9]*)' input.txt
142
You could also do this using perl as an alternative to the above:
$ perl -nle 'print $1 if /TMI IS (\d+)/;' < input.txt
142

How to grep any word that appears between 2 and 4 times?

My file is:
ab 12ab 1cd uu 88 ab 33 33 1 1
ab cd uu 88 88 33 33 33 cw ab
And I need to extract the words and numbers that appears 2-4 times.- {2,4}
I've tried many regex lines and even regex101.
I cant really put my finger on what's not working.
this is the closest I've got so far:
egrep -o '[\w]{2,4}' A1
Native grep doesn't supoort \w and {} notations. You have to use extended regular expressions.
Use
-E option as,
-E, --extended-regexp
Interpret pattern as an extended regular expression (i.e. force grep to behave as egrep).
Also use
-w to match words, so that it matches the entire words instead of partial.
-w, --word-regexp
The expression is searched for as a word (as if surrounded by [[:<:]]' and[[:>:]]'; see re_format(7)).
Example
$ grep -Ewo "\w{2,4}" file
ab
12ab
1cd
uu
88
ab
33
33
ab
cd
uu
88
88
33
33
33
cw
Note
You can eliminated use of an un-necessary cat by providing file as input to grep instead.
You were very close; within character class notation [], the special notation \w is being treated literally, put it out of []:
egrep -o '\w{2,4}'
Also egrep is deprecated in favor of grep -E, and you don't need the cat as grep takes file(s) as argument(s):
grep -Eo '\w{2,4}' file.txt
I would use awk for it:
awk '{for(i=1;i<=NF;i++)a[$i]++}
END{for(x in a)if(a[x]>1&&a[x]<5)print x}' file
It will scan the whole file, find out the words with occurrence (in the file) in this range [2,4]
Output is:
uu
ab
88
1
Using AWK, this solution counts the word instances per line not per file:
awk '{delete array; for(i = 1; i <= NF; i++) array[$i]+=1; for(i in array) if(array[i] >= 2 && array[i] <= 4) printf "%s ", i; printf "\n" }' input.txt
Delete to clear the array for each new line. Use fields as hash for array indexes and increment it's value by one. Print the index (field) with values between 2 and 4 inclusive.
Output:
ab 1 33
ab 88 33
Perl implementation for a file small enough to process its content as a single string:
$/ = undef;
$_ = <>;
#_ = /(\b\w+\b)/gs;
my %h; $h{$_}++ for #_;
for (keys %h) {
print "$_\n" if $h{$_} >= 2 and $h{$_} <= 4;
}
Save it into a script.pl and run:
perl script.pl < file
Of course, you can pass the code via -e option as well: perl -e 'the code' < file.
Input
ab 12ab 1cd uu 88 ab 33 33 1 1
ab cd uu 88 88 33 33 33 cw ab
Output
88
uu
ab
1
There is no 33 in the output, since it occurs 5 times in the input.
The code reads the file in slurp mode into the default variable ($_), then collects all the words (\w with word boundaries around) into #_ array. Then it counts the number of times each word occurred in the file and stores the result into %h hash. The final block prints only the items that occurred 2, 3, or 4 times, no more and no less.
Note, in Perl you should always use strict; and use warnings; in order to detect issues at early phase.

grepping variables containing special characters in a shell script

I am trying to grep out some lines from a file based on patterns stored in a variable in bash script that may contain (, ), [ or ]. I get the desired output with patterns that do not contain the special characters but with ( or ), I get a blank output and with [ or ], I get the following error:
grep: range out of order in character class
Sample of pattern file:
14-3-3-like protein B
14-3-3-like protein B (Fragment)
3-oxoacyl-[acyl-carrier-protein] synthase 2
Sample of input file:
seq1 gi|703124372 380 285 + 2e-154 14-3-3-like protein B sp
seq2 Q96451 69 51 + 3e-16 14-3-3-like protein B (Fragment) sp
seq3 P0AAI5 104 84 - 4e-20 3-oxoacyl-[acyl-carrier-protein] synthase 2 sp
My code is as below:
if [ $#==0 ]
then echo -e "\nUSAGE: $0 [pattern file] [in file] > [out file]\n"
exit;
else
while read line; do
echo -e "Pattern: $line"
grep -P "\t$line\t" $2
echo -e "\n"
done < $1
Sample of the output:
Pattern: 14-3-3-like protein B
seq1 gi|703124372 380 285 + 2e-154 14-3-3-like protein B sp
Pattern: 14-3-3-like protein B (Fragment) sp
Pattern: 3-oxoacyl-[acyl-carrier-protein] synthase 2
grep: range out of order in character class
I've tried using grep -Fw but that also doesn't give the desired output..
I've also tried substituting the patterns in the two input files with \( and \[ instead of ( and [ but that also doesn't work..
Any idea how can I achieve this? Is there anything else I could use instead of grep?
tab=$(echo -e \\t)
grep -F "$tab$line$tab" $2
Edit:
See also the suggestion from #anubhava: grep -F $'\t'"$line"$'\t' "$2"

Match a word just once - AWK

I was reading GNU awk manual but I didnt find a regular expression wich whom I can match a string just once.
For example from the files aha_1.txt, aha_2.txt, aha_3.txt, .... I would like to print the second column $2 from the first time ana appears in the files (aha_1.txt, aha_2.txt, aha_3.txt, ....). In addition, the same thing when pedro appears.
aha_1.txt
luis 321 487
ana 454 345
pedro 341 435
ana 941 345
aha_2.txt
pedro 201 723
gusi 837 134
ana 319 518
cindy 738 278
ana 984 265
.
.
.
.
Meanwhile I did this but it counts all the cases not just the first time
/^ana/ {print $2 }
/^pedro/ {print $2 }
Thanks for your help :-)
Just call the exit command after printing the first value(second column in the line which starts with the string ana).
$ awk '$1~/^ana$/{print $2; exit}' file
454
Original question
Only processing one file.
awk '/ana/ { if (ana++ == 0) print $2 }' aha.txt
or
awk '/ana/ && ana++ == 0 { print $2 }' aha.txt
Or, if you don't need to do anything else, you can exit after printing, as suggested by Avinash Raj in his answer.
Revised question
I have many files (aha.txt, aha_1.txt, aha_2.txt, ...) each file has ana inside and I need just to take the fist time ana appears in each file and the output has to be one file.
That's sightly different as a question. If you have GNU grep, you can use (more or less):
grep -m1 -e ana aha*.txt
That will list the whole line, not just column 2, and will list the filenames too, so it isn't a perfect match.
Using awk, you have to work a bit more:
awk 'FILENAME != old_file { ana = 0; old_file = FILENAME }
/ana/ { if (ana++ == 0) print $2 }' aha*.txt