Search text for multiple lines matching string 1 which are not separated by string 2 - regex

I've got a file looking like this:
abc|100|test|line|with|multiple|information|||in|different||fields
abc|100|another|test|line|with|multiple|information|in||different|fields|
abc|110|different|looking|line|with|some|supplementary|information
abc|100|test|line|with|multiple|information|||in|different||fields
abc|110|different|looking|line|with|some|other|supplementary|information
abc|110|different|looking|line|with|additional||information
abc|100|another|test|line|with|multiple|information|in||different|fields|
abc|110|different|looking|line|with|supplementary|information
I'm looking for a regexp to use with sed / awk / (e)grep (it actually doesn't matter to me which of these as all would be fine) to find the following in the above mentioned text:
abc|100|test|line|with|multiple|information|||in|different||fields
abc|110|different|looking|line|with|some|other|supplementary|information
abc|110|different|looking|line|with|additional||information
I want to get back a |100| line if it is followed by at least two |110| lines before another |100| line appears. The result should contain the initial |100| line together with all |110| lines that follow but not the following |100| line.
sed -ne '/|100|/,/|110|/p'
provides me a list of all |100| lines which are followed by at least one |110| line. But it doesn't check, if the |110| line has been repeated more than once. I get back results I don't look for.
sed -ne '/|100|/,/|100|/p'
returns a list of all |100| lines and the content between the next |100| line including the next |100| line.
Trying to find lines between search patterns always was a nightmare to me. I spent hours of try and error on similar problems which finally worked. But I never really understood why. I hope, s.o. might be so kind to save me of the headache this time and maybe explain how the pattern does the work. I'm quite sure, I'll face this kind of problem again and then I finally could help myself.
Thank you for any help on this one!
Regards
Manuel

I'd do this in awk.
awk -F'|' '$2==100&&c>2{print b} $2==100{c=1;b=$0;next} $2==110&&c{c++;b=b RS $0;next} {c=0}' file
Broken out for easier reading:
awk -F'|' '
# If we're starting a new section and conditions have been met, print buffer
$2==100 && c>2 {print b}
# Start a section with a new count and a new buffer...
$2==100 {c=1;b=$0;next}
# Add to buffer
$2==110 && c {c++;b=b RS $0}
# Finally, zero everything if we encounter lines that don't fit the pattern
{c=0;b=""}
' file
Rather than using a regex, this steps through the file using the field delimiters you've specified. Upon seeing the "start" condition, it begins keeping a buffer. As subsequent lines match your "continue" condition, the buffer grows. Once we see the start of a new section, we print the buffer if the the counter is big enough.
Works for me on your sample data.

Here's a GNU awk specific answer: use |100| as the record separator, |110| as the field separator, and look for records with at least 3 fields.
gawk '
BEGIN {
# a newline, the first pipe-delimited column, then the "100" value
RS="(\n[^|]+[|]100[|])"
FS="[|]110[|]"
}
NF >= 3 {print RT $0} # RT is the actual text matching the RS pattern
' file

In AWK, the field separator is set to a pipe character and the second field is compared to 100 and 110 per line. $0 represents a line from the input file.
BEGIN { FS = "|" }
{
if($2 == 100) {
one_hundred = 1;
one_hundred_one = 0;
var0 = $0
}
if($2 == 110) {
one_hundred_one += 1;
if(one_hundred_one == 1 && one_hundred = 1) var1 = $0;
if(one_hundred_one == 2 && one_hundred = 1) var2 = $0;
}
if(one_hundred == 1 && one_hundred_one == 2) {
print var0
print var1
print var2
}
}
awk -f foo.awk input.txt
abc|100|test|line|with|multiple|information|||in|different||fields
abc|110|different|looking|line|with|some|other|supplementary|information
abc|110|different|looking|line|with|additional||information

Related

Match the number of characters in two lines

I have a file that I'm trying to prepare for some downstream analysis, but I need the number of characters in two lines to be identical. The file is formatted as below, where the 2nd (CTTATAATGCCGCTCCCTAAG) and 4th (bbbeeeeegggggiiiiiiiiigghiiiiiiiiiiiiiiiiiigeccccb) lines need to contain the same number of characters.
#HWI-ST:8:1101:3346:2198#GTCCGC/1
CTTATAATGCCGCTCCCTAAG
+HWI-ST:8:1101:3346:2198#GTCCGC/1
bbbeeeeegggggiiiiiiiiigghiiiiiiiiiiiiiiiiiigeccccb
#HWI-ST:8:1101:10491:2240#GTCCGC/1
GAGTAGGGAGTATACATCAG
+HWI-ST:8:1101:10491:2240#GTCCGC/1
abbceeeeggggfiiiiiigg`gfhfhhifhifdgg^ggdf_`_Y[aa_R
#HWI-ST:8:1101:19449:2134#GTCCGC/1
AAGAAGAGATCTGTGGACCA
So far I've pulled out the second line from each set of four and generated a file containing a record of the length of each line using:
grep -v '[^A-Z]' file.fastq |awk '{ print length($0); }' > newfile
Now I'm just looking for a way to point to this record to direct a sed command as to how many characters to trim off of the end of the line. Something similar to:
sed -r 's/.{n}$//' file
Replacing n with some regular expression to reference the text file. I wonder if I'm overcomplicating things, but I need the lines to match EXACTLY so I haven't been able to think of another way to go about it. Any help would be awesome, thanks!
This might be what you're looking for:
awk '
# If 2nd line of 4-line group, save length as len.
NR % 4 == 2 { len = length($0) }
# If 4th line of 4-line group, trim the line to len.
NR % 4 == 0 { $0 = substr($0, 1, len)}
# print every line
{ print }
' file
This assumes that the file consists of 4-line groups where the 2nd and 4th line of each group are the ones you're interested in. It also assumes that the 2nd line of each group will be no longer than its corresponding 4th line.

how to replace the next string after match (every) two blank lines?

is there a way to do this kind of substitution in Awk, sed, ...?
I have a text file with sections divived into two blank lines;
section1_name_x
dklfjsdklfjsldfjsl
section2_name_x
dlskfjsdklfjsldkjflkj
section_name_X
dfsdjfksdfsdf
I would to replace every "section_name_x" by "#section_name_x", this is, how to replace the next string after match (every) two blank lines?
Thanks,
Steve,
awk '
(NR==1 || blank==2) && $1 ~ /^section/ {sub(/section/, "#&")}
{
print
if (length)
blank = 0
else
blank ++
}
' file
#section1_name_x
dklfjsdklfjsldfjsl
#section2_name_x
dlskfjsdklfjsldkjflkj
#section_name_X
dfsdjfksdfsdf
hm....
Given your example data why not just
sed 's/^section[0-9]*_name.*/#/' file > newFile && mv newFile file
some seds support sed -i OR sed -i"" to overwrite the existing file, avoiding the && mv ... shown above.
The reg ex says, section must be at the beginning of the line, and can optionally contain a number or NO number at all.
IHTH
In gawk you can use the RT builtin variable:
gawk '{$1="#"$1; print $0 RT}' RS='\n\n' file
* Update *
Thanks to #EdMorton I realized that my first version was incorrect.
What happens:
Assigning to $1 causes the record to be rebuildt, which is not good in this cases since any sequence of white space is replaced by a single space between fields, and by the null string in the beginning and at the end of the record.
Using print adds an additional newline to the output.
The correct version:
gawk '{printf "%s", "#" $0 RT}' RS='\n\n\n' file

delete n lines between 2 matching patterns, keeing the first match and deleting the second match

Given data in a text file:
string1 EP00 37.45 83.83
save
save
save
gibberish
gibberish
gibberish
gibberish
gibberish
gibberish
gibberish
gibberish
gibberish
gibberish
gibberish
gibberish
gibberish
string2
gibberish
gibberish
gibberish
gibberish
gibberish
gibberish
gibberish
gibberish
gibberish
gibberish
gibberish
gibberish
gibberish
I would like to use sed or awk to match both string1 and string 2, then delete everything after string1 and the first 3 lines. I would like to it to also delete string2, but not string1. And also delete one extra line in between that and the next text. So the expected output would be:
string1 EP00 37.45 83.83
save
save
save
There are always the same number of lines in between the two patterns if that helps (16). I would like to do this with sed or awk, but have only been able to figure out a script to delete the entire block of data between the two, holding onto both strings:
sed '/string1/,/string2/{//!d}' file >> tr.txt
Does anyone know how to specify to retain string1 and the three lines after it and delete the rest of the lines in between the two patterns including string2? I would like to do this with sed or awk, whichever is easier.
Thanks!
You can use this awk:
awk '/^string1/{i=0} /^string1/,/^string2/{i++; if (i<5) print; next}1' file
string1 EP00 37.45 83.83
save
save
save
If you want to do this with awk, the script might look something like this (updated based on your comments; it now "recycles", so it will do the matching correctly for as many times as you have the string1-string2 pattern. I realize you have already got an answer you accepted but wanted to give you this alternative; it is much less "professional" than #anubhava's answer, but it might give you an insight in how to make awk do "anything you want", even if you are not a pro):
BEGIN {
state = 0;
}
{ if($1 == "string1") {
state = 1;
}
if (state == 1) {
state = 2;
print;
next;
}
if (state > 1 && state < 5) {
print;
state = state + 1;
next;
}
if ($1 == "string2") {
state = 6;
next;
}
if (state == 6) {
state = 0;
next;
}
if (state == 0) {
print;
next;
}
}
The state variable basically tells you "where am I in the logic". The states are:
0: "normal state", print the line, go to the next
1: "found string2", start printing this line and the next three
2 - 4: printing "the lines that followed string1"
5: Waiting for string2, not printing anything
6: found string2, need to delete the next line
Having found the next line, we reset the state to 0 again…
You would run it with
awk -f scriptFile.awk inputfile.txt > outputfile.txt
I made this "pedestrian", so you can see exactly what is done, and in what order. Let me know if you have any questions.
Something like this:
sed -e '1,/^string1/-1d' -e '/string1/+4,$/d' < file > output
The first command removed from line 1 up to the line preceeding a line starting with "string1", and the second finds the line starting with "string1", counts 4 lines after that, and deletes from there to the end.
You could also do this, if your version of grep supports it:
grep -A3 "^string1" file > output
Using GNU sed
sed -n '/^string1/,+3p' file
If no GNU sed, try this:
sed -n ':a;/string1/{N;N;N;p;ta;}' file
This might work for you (GNU sed):
sed -rn '/string1/{h;d};H;/string2/{x;s/(string1([^\n]*\n){4}).*string2.*/\1/p}' file

Awk print if no match

I am using the following statement in awk with text piped to it from another command:
awk 'match($0,/(QUOTATION|TAX INVOICE|ADJUSTMENT NOTE|DELIVERY DOCKET|PICKING SLIP|REMITTANCE ADVICE|PURCHASE ORDER|STATEMENT)/) && NR<11 {print substr($0,RSTART,RLENGTH)}'
which is almost working for what I need (find one of the words in the regex within the first 10 lines of the input and print that word). The main thing I need to do is to output something if there is no match. For instance, if none of those words are found in the first ten lines it would output UNKNOWN.
I also need to limit the output to the first match, as I need to ensure a single line of output per input file. I can do this with head or ask another question if needs be, I only include it here in case it affects how to output the no-match text.
I am also not tied to awk as a tool - if there is a simpler way to do this with sed or something else I am open to it.
You just need to exit at the first match, or on line 11 if no match
awk '
match($0,/(QUOTATION|TAX ... ORDER|STATEMENT)/) {
print substr($0,RSTART,RLENGTH)
exit
}
NR == 11 {print "UNKNOWN"; exit}
'
I like glenn jackman's answer, however, if you wish to print matches for all 10 lines then you can try something like this:
awk '
match($0,/(QUOTATION|TAX ... ORDER|STATEMENT)/) {
print NR " ---> " substr($0,RSTART,RLENGTH)
flag=1
}
flag==0 && NR==11 {
print "UNKNOWN"
exit
}'
You can do this..
( head -10 | egrep -o '(QUOTATION|TAX INVOICE|ADJUSTMENT NOTE|
DELIVERY DOCKET|PICKING SLIP|REMITTANCE ADVICE|PURCHASE ORDER|STATEMENT)'
|| print "Unkownn" ) | head -1

sed: remove strings between two patterns leaving the 2nd pattern intact (half inclusive)

I am trying to filter out text between two patterns, I've seen a dozen examples but didn't manage to get exactly what I want:
Sample input:
START LEAVEMEBE text
data
START DELETEME text
data
more data
even more
START LEAVEMEBE text
data
more data
START DELETEME text
data
more
SOMETHING that doesn't start with START
# sometimes it starts with characters that needs to be escaped...
I want to stay with:
START LEAVEMEBE text
data
START LEAVEMEBE text
data
more data
SOMETHING that doesn't start with START
# sometimes it starts with characters that needs to be escaped...
I tried running sed with:
sed 's/^START DELETEME/,/^[^ ]/d'
And got an inclusive removal, I tried adding "exclusions" (not sure if I really understand this syntax well):
sed 's/^START DELETEME/,/^[^ ]/{/^[^ ]/!d}'
But my "START DELETEME" line is still there (yes, I can grep it out, but that's ugly :) and besides - it DOES remove the empty line in this sample as well and I'd like to leave empty lines if they are my end pattern intact )
I am wondering if there is a way to do it with a single sed command.
I have an awk script that does this well:
BEGIN { flag = 0 }
{
if ($0 ~ "^START DELETEME")
flag=1
else if ($0 !~ "^ ")
flag=0
if (flag != 1)
print $0
}
But as you know "A is for awk which runs like a snail". It takes forever.
Thanks in advance.
Dave.
Using a loop in sed:
sed -n '/^START DELETEME/{:l n; /^[ ]/bl};p' input
GNU sed
sed '/LEAVEMEBE/,/DELETEME/!d;{/DELETEME/d}' file
I would stick with awk:
awk '
/LEAVE|SOMETHING/{flag=1}
/DELETE/{flag=0}
flag' file
But if you still prefer sed, here's another way:
sed -n '
/LEAVE/,/DELETE/{
/DELETE/b
p
}
' file