split file into multiple files based upon differing start and end delimiter

split file into multiple files based upon differing start and end delimiter - regex

i have a file that i need split into multiple files, and need it done via separate start and end delimiters.
for example, if i have the following file:
abcdef
START
ghijklm
nopqrst
END
uvwxyz
START
abcdef
ghijklm
nopqrs
END
START
tuvwxyz
END
i need 3 separate files of:
file1
START
ghijklm
nopqrst
END
file2
START
abcdef
ghijklm
nopqrs
END
file3
START
tuvwxyz
END
i found this link which showed how to do it with a starting delimiter, but i also need an ending delimiter. i have tried this using some regex in the awk command, but am not getting the result that i want. i don't quite understand how to get awk to be 'lazy' or 'non greedy', so that i can get it to pull apart the file correctly.
i really like the awk solution. something similar would be fantastic (i am reposting the solution here so you don't have to click through:
awk '/DELIMITER_HERE/{n++}{print >"out" n ".txt" }' input_file.txt
any help is appreciated.

You can use this awk command:
awk '/^START/{n++;w=1} n&&w{print >"out" n ".txt"} /^END/{w=0}' input_file.txt

awk '
/START/ {p = 1; n++; file = "file" n}
p { print > file }
/END/ {p = 0}
' filename

Here's another example using range notation:
awk '/START/,/END/ {if(/START/) n++; print > "out" n ".txt"}' data
Or an equivalent with a different if/else syntax:
awk '/START/,/END/ {print > "out" (/START/ ? ++n : n) ".txt"}' data
Here's a version without repeating the /START/ regex after Ed Morton's comments because I just wanted to see if it would work:
awk '/START/ && ++n,/END/ {print > "out" n ".txt" }' data
The other answers are definitely better if your range is or will ever be non-inclusive of the ends.

Related

Interval expressions in gawk to awk

I hope this is an easy fix
I originally wrote a clean and easy script that utilized gawk, I used this first and foremost because when I was solving the original issue was what I found. I now need to adapt it to only use awk.
sample file.fasta:
>gene1
>gene235
ATGCTTAGATTTACAATTCAGAAATTCCTGGTCTATTAACCCTCCTTCACTTTTCACTTTTCCCTAACCCTTCAAAATTTTATATCCAATCTTCTCACCCTCTACAATAATACATTTATTATCCTCTTACTTCAAAATTTTT
>gene335
ATGCTCCTTCTTAATCTAAACCTTCAAAATTTTCCCCCTCACATTTATCCATTATCACCTTCATTTCGGAATCCTTAACTAAATACAATCATCAACCATCTTTTAACATAACTTCTTCAAAATTTTACCAACTTACTATTGCTTCAAAATTTTTCAT
>gene406
ATGTACCACACACCCCCATCTTCCATTTTCCCTTTATTCTCCTCACCTCTACAATCCCCTTAATTCCTCTTCAAAATTTTTGGAGCCCTTAACTTTCAATAACTTCAAAATTTTTCACCATACCAATAATATCCCTCTTCAAAATTTTCCACACTCACCAAC
gawk '/[ACTG]{21,}GG/{print a; print}{a=$0}' file.fasta >"species_precrispr".fasta
what I know works is awk is the following:
awk '/[ACTG]GG/{print a; print}{a=$0}' file.fasta >"species_precrispr".fasta
the culprit therefore is the interval expression of {21,}
What I want it to do is search is for it to match each line that contains at least 21 nucleotides left of my "GG" match.
Can anyone help?
Edit:
Thanks for all the help:
There are various solutions that worked. To reply to some of the comments a more basic example of the initial output and the desired effect achieved...
Prior to awk command:
cat file1.fasta
>gene1
ATGCCTTAACTTTCAATAACTGG
>gene2
ATGGGTGCCTTAACTTTCAATAACTG
>gene3
ATGTCAAAATTTTTCATTTCAAT
>gene4
ATCCTTTTTTTTGGGTCAAAATTAAA
>gene5
ATGCCTTAACTTTCAATAACTTTTTAAAATTTTTGG
Following codes all produced the same desired output:
original code
gawk '/[ACTG]{21,}GG/{print a; print}{a=$0}' file1.fasta
slight modification that adds interval function to original awk version >3.x.x
awk --re-interval'/[ACTG]{21,}GG/{print a; print}{a=$0}' file1.fasta
Allows for modification of val and correct output , untested but should work with lower versions of awk
awk -v usr_count="21" '/gene/{id=$0;next} match($0,/.*GG/){val=substr($0,RSTART,RLENGTH-2);if(gsub(/[ACTG]/,"&",val)>= usr_count){print id ORS $0};id=""}' file1.fasta
awk --re-interval '/^>/ && seq { if (match(seq,"[ACTG]{21,}GG")) print ">" name ORS seq ORS} /^>/{name=$0; seq=""; next} {seq = seq $0 } END { if (match(seq,"[ACTG]{21,}GG")) print ">" name ORS seq ORS }' file1.fasta
Desired output: only grab genes names and sequences of sequences that have 21 nucleotides prior to matching GG
>gene1
ATGCCTTAACTTTCAATAACTGG
>gene5
ATGCCTTAACTTTCAATAACTTTTTAAAATTTTTGG
Lastly just to show the discarded lines
>gene2
ATG-GG-TGCCTTAACTTTCAATAACTG # only 3 nt prior to any GG combo
>gene3
ATGTCAAAATTTTTCATTTCAAT # No GG match found
>gene4
ATCCTTTTTTTTGGGTCAAAATTAAA # only 14 nt prior to any GG combo
Hope this helps others!

EDIT: As per OP comment need to print gene ids too then try following.
awk '
/gene/{
id=$0
next
}
match($0,/.*GG/){
val=substr($0,RSTART,RLENGTH-2)
if(gsub(/[ACTG]/,"&",val)>=21){
print id ORS $0
}
id=""
}
' Input_file
OR one-liner form of above solution as per OP's request:
awk '/gene/{id=$0;next} match($0,/.*GG/){val=substr($0,RSTART,RLENGTH-2);if(gsub(/[ACTG]/,"&",val)>=21){print id ORS $0};id=""}' Input_file
Could you please try following, written and tested with shown samples only.
awk '
match($0,/.*GG/){
val=substr($0,RSTART,RLENGTH-2)
if(gsub(/[ACTG]/,"&",val)>=21){
print
}
}
' Input_file
OR more generic approach where created a variable in which user could mention value which user is looking to match should be present before GG.
awk -v usr_count="21" '
match($0,/.*GG/){
val=substr($0,RSTART,RLENGTH-2)
if(gsub(/[ACTG]/,"&",val)>=usr_count){
print
}
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
match($0,/.*GG/){ ##Using Match function to match everything till GG in current line.
val=substr($0,RSTART,RLENGTH-2) ##Storing sub-string of current line from RSTART till RLENGTH-2 into variable val here.
if(gsub(/[ACTG]/,"&",val)>=21){ ##Checking condition if global substitution of ACTG(with same value) is greater or equal to 21 then do following.
print ##Printing current line then.
}
}
' Input_file ##Mentioning Input_file name here.

GNU awk accepts interval expressions in regular expressions from version 3.0 onwards. However, only from version 4.0, interval expression became defaultly enabled. If you have awk 3.x.x, you have to use the flag --re-interval to enable them.
awk --re-interval '/a{3,6}/{print}' file
There is an issue that often people overlook with FASTA files and using awk. When you have multi-line sequences, it is possible that your match is covering multiple lines. To this end you need to combine your sequences first.
The easiest way to process FASTA files with awk, is to build up a variable called name and a variable called seq. Every time you read a full sequence, you can process it. Remark that, for the best way of processing, the sequence, should be stored as a continues string, and not contain any newlines or white-spaces due. A generic awk for processing fasta, looks like this:
awk '/^>/ && seq { **process_sequence_here** }
/^>/{name=$0; seq=""; next}
{seq = seq $0 }
END { **process_sequence_here** }' file.fasta
In the presented case, your sequence processing looks like:
awk '/^>/ && seq { if (match(seq,"[ACTG]{21,}GG")) print ">" name ORS seq ORS}
/^>/{name=$0; seq=""; next}
{seq = seq $0 }
END { if (match(seq,"[ACTG]{21,}GG")) print ">" name ORS seq ORS }' file.fasta

Sounds like what you want is:
awk 'match($0,/[ACTG]+GG/) && RLENGTH>22{print a; print} {a=$0}' file
but this is probably all you need given the sample input you provided:
awk 'match($0,/.*GG/) && RLENGTH>22{print a; print} {a=$0}' file
They'll both work in any awk.
Using your updated sample input:
$ awk 'match($0,/.*GG/) && RLENGTH>22{print a; print} {a=$0}' file
>gene1
ATGCCTTAACTTTCAATAACTGG
>gene5
ATGCCTTAACTTTCAATAACTTTTTAAAATTTTTGG

awk concatenate strings till contain substring

I have a awk script from this example:
awk '/START/{if (x) print x; x="";}{x=(!x)?$0:x","$0;}END{print x;}' file
Here's a sample file with lines:
$ cat file
START
1
2
3
4
5
end
6
7
START
1
2
3
end
5
6
7
So I need to stop concatenating when destination string would contain end word, so the desired output is:
START,1,2,3,4,5,end
START,1,2,3,end

Short Awk solution (though it will check for /end/ pattern twice):
awk '/START/,/end/{ printf "%s%s",$0,(/^end/? ORS:",") }' file
The output:
START,1,2,3,4,5,end
START,1,2,3,end
/START/,/end/ - range pattern
A range pattern is made of two patterns separated by a comma, in the
form ‘begpat, endpat’. It is used to match ranges of consecutive
input records. The first pattern, begpat, controls where the range
begins, while endpat controls where the pattern ends.
/^end/? ORS:"," - set delimiter for the current item within a range

here is another awk
$ awk '/START/{ORS=","} /end/ && ORS=RS; ORS!=RS' file
START,1,2,3,4,5,end
START,1,2,3,end
Note that /end/ && ORS=RS; is shortened form of /end/{ORS=RS; print}

You can use this awk:
awk '/START/{p=1; x=""} p{x = x (x=="" ? "" : ",") $0} /end/{if (x) print x; p=0}' file
START,1,2,3,4,5,end
START,1,2,3,end

Another way, similar to answers in How to select lines between two patterns?
$ awk '/START/{ORS=","; f=1} /end/{ORS=RS; print; f=0} f' ip.txt
START,1,2,3,4,5,end
START,1,2,3,end
this doesn't need a buffer, but doesn't check if START had a corresponding end
/START/{ORS=","; f=1} set ORS as , and set a flag (which controls what lines to print)
/end/{ORS=RS; print; f=0} set ORS to newline on ending condition. Print the line and clear the flag
f print input record as long as this flag is set

Since we seem to have gone down the rabbit hole with ways to do this, here's a fairly reasonable approach with GNU awk for multi-char RS, RT, and gensub():
$ awk -v RS='end' -v OFS=',' 'RT{$0=gensub(/.*(START)/,"\\1",1); $NF=$NF OFS RT; print}' file
START,1,2,3,4,5,end
START,1,2,3,end

Remove \n newline if string contains keyword

I'd like to know if I can remove a \n (newline) only if the current line has one ore more keywords from a list; for instance, I want to remove the \n if it contains the words hello or world.
Example:
this is an original
file with lines
containing words like hello
and world
this is the end of the file
And the result would be:
this is an original
file with lines
containing words like hello and world this is the end of the file
I'd like to use sed, or awk and, if needed, grep, wc or whatever commands work for this purpose. I want to be able to do this on a lot of files.

Using awk you can do:
awk '/hello|world/{printf "%s ", $0; next} 1' file
this is an original
file with lines
containing words like hello and world this is the end of the file

here is simple one using sed
sed -r ':a;$!{N;ba};s/((hello|world)[^\n]*)\n/\1 /g' file
Explanation
:a;$!{N;ba} read whole file into pattern, like this: this is an original\nfile with lines\ncontaining words like hell\
o\nand world\nthis is the end of the file$
s/((hello|world)[^\n]*)\n/\1 /g search the key words hello or world and remove the next \n,
g command in sed substitute stands to apply the replacement to all matches to the regexp, not just the first.

A non-regex approach:
awk '
BEGIN {
# define the word list
w["hello"]
w["world"]
}
{
printf "%s", $0
for (i=1; i<=NF; i++)
if ($i in w) {
printf " "
next
}
print ""
}
'
or a perl one-liner
perl -pe 'BEGIN {#w = qw(hello world)} s/\n/ / if grep {$_ ~~ #w} split'
To edit the file in-place, do:
awk '...' filename > tmpfile && mv tmpfile filename
perl -i -pe '...' filename

This might work for you (GNU sed):
sed -r ':a;/^.*(hello|world).*\'\''/M{$bb;N;ba};:b;s/\n/ /g' file
This checks if the last line, of a possible multi-line, contains the required string(s) and if so reads another line until end-of-file or such that the last line does not contain the/those string(s). Newlines are removed and the line printed.

$ awk '{ORS=(/hello|world/?FS:RS)}1' file
this is an original
file with lines
containing words like hello and world this is the end of the file

sed -n '
:beg
/hello/ b keep
/world/ b keep
H;s/.*//;x;s/\n/ /g;p;b
: keep
H;s/.*//
$ b beg
' YourFile
a bit harder due to check on current line that may include a previous hello or world already
principle:
on every pattern match, keep the string in hold buffer
other wise, load hold buffer and remove \n (use of swap and empty the current line due to limited buffer operation available) and print the content
Add a special case of pattern in last line (normaly hold so not printed otherwise)

how to replace the next string after match (every) two blank lines?

is there a way to do this kind of substitution in Awk, sed, ...?
I have a text file with sections divived into two blank lines;
section1_name_x
dklfjsdklfjsldfjsl
section2_name_x
dlskfjsdklfjsldkjflkj
section_name_X
dfsdjfksdfsdf
I would to replace every "section_name_x" by "#section_name_x", this is, how to replace the next string after match (every) two blank lines?
Thanks,
Steve,

awk '
(NR==1 || blank==2) && $1 ~ /^section/ {sub(/section/, "#&")}
{
print
if (length)
blank = 0
else
blank ++
}
' file
#section1_name_x
dklfjsdklfjsldfjsl
#section2_name_x
dlskfjsdklfjsldkjflkj
#section_name_X
dfsdjfksdfsdf

hm....
Given your example data why not just
sed 's/^section[0-9]*_name.*/#/' file > newFile && mv newFile file
some seds support sed -i OR sed -i"" to overwrite the existing file, avoiding the && mv ... shown above.
The reg ex says, section must be at the beginning of the line, and can optionally contain a number or NO number at all.
IHTH

In gawk you can use the RT builtin variable:
gawk '{$1="#"$1; print $0 RT}' RS='\n\n' file
* Update *
Thanks to #EdMorton I realized that my first version was incorrect.
What happens:
Assigning to $1 causes the record to be rebuildt, which is not good in this cases since any sequence of white space is replaced by a single space between fields, and by the null string in the beginning and at the end of the record.
Using print adds an additional newline to the output.
The correct version:
gawk '{printf "%s", "#" $0 RT}' RS='\n\n\n' file

sed: remove strings between two patterns leaving the 2nd pattern intact (half inclusive)

I am trying to filter out text between two patterns, I've seen a dozen examples but didn't manage to get exactly what I want:
Sample input:
START LEAVEMEBE text
data
START DELETEME text
data
more data
even more
START LEAVEMEBE text
data
more data
START DELETEME text
data
more
SOMETHING that doesn't start with START
# sometimes it starts with characters that needs to be escaped...
I want to stay with:
START LEAVEMEBE text
data
START LEAVEMEBE text
data
more data
SOMETHING that doesn't start with START
# sometimes it starts with characters that needs to be escaped...
I tried running sed with:
sed 's/^START DELETEME/,/^[^ ]/d'
And got an inclusive removal, I tried adding "exclusions" (not sure if I really understand this syntax well):
sed 's/^START DELETEME/,/^[^ ]/{/^[^ ]/!d}'
But my "START DELETEME" line is still there (yes, I can grep it out, but that's ugly :) and besides - it DOES remove the empty line in this sample as well and I'd like to leave empty lines if they are my end pattern intact )
I am wondering if there is a way to do it with a single sed command.
I have an awk script that does this well:
BEGIN { flag = 0 }
{
if ($0 ~ "^START DELETEME")
flag=1
else if ($0 !~ "^ ")
flag=0
if (flag != 1)
print $0
}
But as you know "A is for awk which runs like a snail". It takes forever.
Thanks in advance.
Dave.

Using a loop in sed:
sed -n '/^START DELETEME/{:l n; /^[ ]/bl};p' input

GNU sed
sed '/LEAVEMEBE/,/DELETEME/!d;{/DELETEME/d}' file

I would stick with awk:
awk '
/LEAVE|SOMETHING/{flag=1}
/DELETE/{flag=0}
flag' file
But if you still prefer sed, here's another way:
sed -n '
/LEAVE/,/DELETE/{
/DELETE/b
p
}
' file

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

split file into multiple files based upon differing start and end delimiter - regex

You can use this awk command: awk '/^START/{n++;w=1} n&&w{print >"out" n ".txt"} /^END/{w=0}' input_file.txt

awk ' /START/ {p = 1; n++; file = "file" n} p { print > file } /END/ {p = 0} ' filename

Related

Interval expressions in gawk to awk

awk concatenate strings till contain substring

Remove \n newline if string contains keyword

how to replace the next string after match (every) two blank lines?

sed: remove strings between two patterns leaving the 2nd pattern intact (half inclusive)

Categories

Resources