Deleting n lines in both directions and the match in sed? - regex

Deleting the match and two lines before it works:
sed -i.bak -e '/match/,-2d' someCommonName.txt
Deleting the match and two lines after it works:
sed -i.bak -e '/match/,+2d' someCommonName.txt
But deleting the match, two lines after it and two lines before it does not work?
sed -i.bak -e '/match/-2,+2d' someCommonName.txt
sed: -e expression #1 unknown command: `-'
Why is that?

sed operates on a range of addresses. That means either one or two expressions, not three.
/match/ is an address which matches a regex.
-2 is an address which specifies two lines before
+2 is an address which specifies two lines after
Therefore:
/match/,-2 is a range which specifies the line matching match to two lines before.
/match/-2,+2d, on the other hand, includes three addresses, and thus makes no sense.
To delete two lines before and after a pattern, I would recommend something like this (modified from this answer):
sed -n "1N;2N;/\npattern$/{N;N;d};P;N;D"
This keeps 3 lines in the buffer and reads through the file. When the pattern is found in the last line, it reads two more lines and deletes all 5. Note that this will not work if the pattern is in the first two lines of the file, but it is a start.

sed -i .bak '/match/,-2 {/match/!d;};/match/,+2d' YourFile
try this (cannot test here, -2 is not available in my sed version)

I don't have a complete solution but an outline: sed is a pretty simple tool which doesn't do two things at once. My approach would be to run sed once deleting the two lines after the pattern but keeping the pattern itself. The result can then be piped to sed again to remove the pattern and the two lines before.

FWIW this is how I'd really do the job (just change the b and a values to delete different numbers of lines before/after match is found):
$ cat file
1
2
3
4
5 match
6
7
8
9
$ awk -v b=2 -v a=2 'NR==FNR{if (/match/) for (i=(NR-b);i<=(NR+a);i++) skip[i]; next } !(FNR in skip)' file file
1
2
8
9
$ awk -v b=3 -v a=1 'NR==FNR{if (/match/) for (i=(NR-b);i<=(NR+a);i++) skip[i]; next } !(FNR in skip)' file file
1
7
8
9
Note that the above assumes that when 2 "match"s appear within a removal window you want to base the deletions on the original occurrence, not what would happen after the first match being found causes the 2nd match to be deleted:
$ cat file2
1
2
3
4 match
5
6 match
7
8
9
$ awk -v b=2 -v a=2 'NR==FNR{if (/match/) for (i=(NR-b);i<=(NR+a);i++) skip[i]; next } !(FNR in skip)' file2 file2
1
9
as opposed to the output being:
1
7
8
9
since deleting the 2 lines after the first match would delete the 2nd match and so the 2 lines after THAT would not be deleted since they no longer are within 2 lines after a match.
Something else to consider:
$ diff --changed-group-format='%<' --unchanged-group-format='' file <(grep -A2 -B2 match file)
1
2
8
9
$ diff --changed-group-format='%<' --unchanged-group-format='' file2 <(grep -A2 -B2 match file2)
1
9
That uses bash and GNU diff 3.2, idk if/which other shells/diffs would support those constructs/options.

Related

Why is this sed command only working on every other match?

Here's a sed command, works great, just on every other line (simplified for your convenience):
cat testfile.txt | sed -E "/PATTERN/,/^>/{//!d;}"
if my testfile.txt is
>PATTERN
1
2
3
>PATTERN
a
b
c
>PATTERN
1
2
3
>PATTERN
a
b
c
>asdf
1
2
3
>asdf
a
b
c
Expected output:
>PATTERN
>PATTERN
>PATTERN
>PATTERN
>asdf
1
2
3
>asdf
a
b
c
actual output:
>PATTERN
>PATTERN
a
b
c
>PATTERN
>PATTERN
a
b
c
>asdf
1
2
3
>asdf
a
b
c
-An aisde-
(The actual goal is to find a one of a group of patterns then delete the stuff that comes after it until the next occurence of a ">" symbol {also delete that line which I can do by piping to a grep -v})
I more or less got guidance by following what I found here. I've had this work for me. Here's an exact example (not that you have the file to look at it)
for line in $(cat bad_results.txt)
do
echo "removing $line"
cat 16S.fasta | sed "/$line/,/^>/{//!d;}" | grep $line -v > temp_stor.fasta
done
/PATTERN/,/^>/ will match from a line containing PATTERN to a line starting with > (which can be a line containing PATTERN). You should instead match an empty line, like so:
$ sed '/PATTERN/,/^$/{/PATTERN/!d}' ip.txt
>PATTERN
>PATTERN
>PATTERN
>PATTERN
>asdf
1
2
3
>asdf
a
b
c
Your aside isn't very clear to me, but if you want to delete the line with PATTERN as well, you can simplify it to:
$ sed '/PATTERN/,/^$/d' ip.txt
>asdf
1
2
3
>asdf
a
b
c
You can also use:
awk -v RS= -v ORS='\n\n' '!/PATTERN/'
but it will have an extra empty line at the end of the output. The advantage is that instead of your for loop, you can do this:
awk 'BEGIN{FS="\n"; ORS="\n\n"}
NR==FNR{a[">" $0]; next}
!($1 in a)' bad_results.txt RS= 16S.fasta
The above code stores each line of bad_results.txt in an associative array, with > character prefixed. And then, contents of 16S.fasta will be printed only if entire line starting with > isn't present in bad_results.txt.
If you want a partial match:
awk 'BEGIN{FS="\n"; ORS="\n\n"}
NR==FNR{a[$0]; next}
{for (k in a) if(index($1, k)) next; print}' bad_results.txt RS= 16S.fasta
In your range pattern match, the second element 'consumes' the line so that the start of the range no longer sees that block as a match. This is why you apparently have 'skipping.' This can be fixed by using a lookahead that does not consume characters to match. Unfortunately, sed lacks lookaheads.
Perl is really a better choice than sed for complex multi line matches involving lookaheads.
Here is a Perl that reads the file and applies the regex /(?:^>PATTERN)|(?:^>[\s\S]*?)(?=\v?^>|\z)/ (Demo) to it:
$ perl -0777 -lnE 'while(/(?:^>PATTERN)|(?:^>[\s\S]*?)(?=\v?^>|\z)/gm) { say $& }' file
>PATTERN
>PATTERN
>PATTERN
>PATTERN
>asdf
1
2
3
>asdf
a
b
c
Aside: Please read Looping through the content of a file in Bash. The way you are doing it is not idea. Specifically, read here on the side effects of using cat in a Bash loop.
This might work for you (GNU sed):
sed -E '/PATTERN/{p;:a;$!{N;/\n>/!s/\n//;ta};D}' file
As has been already stated, the range operator matches from PATTERN to a line beginning >. The latter line may also contain PATTERN but is not matched, hence the alternating pattern.
The solution above, does not use the range operator but instead gathers the lines from the first containing PATTERN to the line before a line beginning >.
If a line contains PATTERN it is printed, then subsequent lines are collected until the end-of-file or a line begins >.
Within this collection, newlines are removed - essentially making the first line in the pattern space the concatenation of one or more lines.
On a match (or end-of-file) this long line is removed and any line still in the pattern space is processed as if it had been read in as part of the normal sed cycle.
N.B. The difference between the d and the D commands is the d command deletes the pattern space and immediately begins the next sed cycle which involves reading in the next line of input. Whereas the D command removes everything up to and including the first newline in the pattern space and then begins the next sed cycle. However if the pattern space is not empty, the reading in of the next line from the input is forgone and then the sed cycle resumed.
An alternative:
sed '/^>/{h;/^>PATTERN/p};G;/\n>PATTERN/!P;d' file
To answer the question as to why it seemed to be skipping every other occurrence
(as fleshed out in the comments of Sundeep's answer. See his answer to work around this)
The apparent skipping was just an illusion. sed is greedy; it found the first occurrence of PATTERN and up to and including the next line starting with a >. It then deletes everything between (as instructed). sed then continues where it left off and as such doesn't "see" that last line as a new occurrence
to be clear:
>PATTERN <--- sed see's the first occurrence here------------------|
a |(this whole
a |chunk is
a |considered
|by sed)
>PATTERN <--- then matches up to here (the next occurence of ">")--|
b <--- then continues from here "missing" the match of PATTERN above
b
b
>PATTERN
c
c
c

How to use grep to extract multiple groups

Say I have this file data.txt:
a=0,b=3,c=5
a=2,b=0,c=4
a=3,b=6,c=7
I want to use grep to extract 2 columns corresponding to the values of a and c:
0 5
2 4
3 7
I know how to extract each column separately:
grep -oP 'a=\K([0-9]+)' data.txt
0
2
3
And:
grep -oP 'c=\K([0-9]+)' data.txt
5
4
7
But I can't figure how to extract the two groups. I tried the following, which didn't work:
grep -oP 'a=\K([0-9]+),.+c=\K([0-9]+)' data.txt
5
4
7
I am also curious about grep being able to do so. \K "removes" the previous content that is stored, so you cannot use it twice in the same expression: it will just show the last group. Hence, it should be done differently.
In the meanwhile, I would use sed:
sed -r 's/^a=([0-9]+).*c=([0-9]+)$/\1 \2/' file
it catches the digits after a= and c=, whenever this happens on lines starting with a= and not containing anything else after c=digits.
For your input, it returns:
0 5
2 4
3 7
You could try the below grep command. But note that , grep would display each match in separate new line. So you won't get the format like you mentioned in the question.
$ grep -oP 'a=\K([0-9]+)|c=\K([0-9]+)' file
0
5
2
4
3
7
To get the mentioned format , you need to pass the output of grep to paste or any other commands .
$ grep -oP 'a=\K([0-9]+)|c=\K([0-9]+)' file | paste -d' ' - -
0 5
2 4
3 7
use this :
awk -F[=,] '{print $2" "$6}' data.txt
I am using the separators as = and ,, then spliting on them

How to delete lines before a match perserving it?

I have the following script to remove all lines before a line which matches with a word:
str='
1
2
3
banana
4
5
6
banana
8
9
10
'
echo "$str" | awk -v pattern=banana '
print_it {print}
$0 ~ pattern {print_it = 1}
'
It returns:
4
5
6
banana
8
9
10
But I want to include the first match too. This is the desired output:
banana
4
5
6
banana
8
9
10
How could I do this? Do you have any better idea with another command?
I've also tried sed '0,/^banana$/d', but seems it only works with files, and I want to use it with a variable.
And how could I get all lines before a match using awk?
I mean. With banana in the regex this would be the output:
1
2
3
This awk should do:
echo "$str" | awk '/banana/ {f=1} f'
banana
4
5
6
banana
8
9
10
sed -n '/^banana$/,$p'
Should do what you want. -n instructs sed to print nothing by default, and the p command specifies that all addressed lines should be printed. This will work on a stream, and is different than the awk solution since this requires the entire line to match 'banana' exactly whereas your awk solution merely requires 'banana' to be in the string, but I'm copying your sed example. Not sure what you mean by "use it with a variable". If you mean that you want the string 'banana' to be in a variable, you can easily do sed -n "/$variable/,\$p" (note the double quotes and the escaped $) or sed -n "/^$variable\$/,\$p" or sed -n "/^$variable"'$/,$p'. You can also echo "$str" | sed -n '/banana/,$p' just like you do with awk.
Just invert the commands in the awk:
echo "$str" | awk -v pattern=banana '
$0 ~ pattern {print_it = 1} <--- if line matches, activate the flag
print_it {print} <--- if the flag is active, print the line
'
The print_it flag is activated when pattern is found. From that moment on (inclusive that line), you print lines when the flag is ON. Previously the print was done before the checking.
cat in.txt | awk "/banana/,0"
In case you don't want to preserve the matched line then you can use
cat in.txt | sed "0,/banana/d"

grep line containing a pattern to line containing other pattern

Say the input is:
">"1aaa
2
3
4
">"5bbb
6
7
">"8ccc
9
">"10ddd
11
12
I want this output (per example for the matching pattern "bbb"):
">"5bbb
6
7
I had tried with grep:
grep -A 2 -B 0 "bbb" file.txt > results.txt
This works. However, the number of lines between ">"5bbb and ">"8ccc are variable. Does anyone knows how to achieve that using Unix command line tools?
With awk you could simply using a flag like so:
$ awk '/^">"/{f=0}/bbb/{f=1}f' file
">"5bbb
6
7
You could also parametrize the pattern like so:
$ awk '/^">"/{f=0}$0~pat{f=1}f' pat='aaa' file
">"1aaa
2
3
4
Explanation:
/^">"/ # Regular expression that matches lines starting ">"
{f=0} # If the regex matched unset the print flag
/bbb/ # Regular expression to match the pattern bbb
{f=1} # If the regex matched set the print flag
f # If the print flag is set then print the line
Something like this should do it:
sed -ne '/bbb/,/^"/ { /bbb/p; /^[^"]/p; }' file.txt
That is:
for the range of lines between matching /bbb/ and /^"/
if the line matches /bbb/ print it
if the line doesn't start with " print it
otherwise nothing else is printed
This might work for you (GNU sed):
sed '/^"/h;G;/\n.*bbb/P;d' file

How to get the output from second value using sed

I have a file with below content as example
cat test.log
hello
how are you?
terminating
1
2
3
terminating
1
2
When am using grep command to show output after terminating it is showing as below.
sed -n '/terminating/,$p' test.log
terminating
1
2
3
terminating
1
2
I want output as below
terminating
1
2
Can anyone help me on this please?
Code for sed:
$ sed -n '/terminating/{N;N;h};${g;p}' file
terminating
1
2
If line matches terminating, store it and the next two lines in hold space. Print the three lines on $EOF.
Example with a sedscript:
$ cat script.sed
/terminating/{
N
N
h
}
${
g
p
}
$ sed -nf script.sed file
terminating
1
2
And for all lines after the last terminating:
$ cat file
cat test.log
hello
how are you?
terminating
1
2
3
terminating
1
2
3
4
5
6
$ cat script.sed
H
/terminating/{
h
}
${
g
p
}
$ sed -nf script.sed file
terminating
1
2
3
4
5
6
This might work for you (GNU sed):
sed -r '/^terminating/!d;:a;$!N;/.*\n(terminating)/s//\1/;$q;ba' file
Unless the line begins with terminating discard it. Read in more lines discarding any lines that are ahead of a line beginning terminating. At end-of-file print out the remainder of the file.
sed -n 'H;/terminating/x;${x;p}' test.log
As annotated pseudocode:
for each line:
append the line to the hold space H
if line matches /terminating/ /terminating/
then set hold space to line x
on the last line: $
get the hold space x
and print p
note: x actually exchanges the hold/pattern spaces.