Why is this sed command only working on every other match? - regex

Here's a sed command, works great, just on every other line (simplified for your convenience):
cat testfile.txt | sed -E "/PATTERN/,/^>/{//!d;}"
if my testfile.txt is
>PATTERN
1
2
3
>PATTERN
a
b
c
>PATTERN
1
2
3
>PATTERN
a
b
c
>asdf
1
2
3
>asdf
a
b
c
Expected output:
>PATTERN
>PATTERN
>PATTERN
>PATTERN
>asdf
1
2
3
>asdf
a
b
c
actual output:
>PATTERN
>PATTERN
a
b
c
>PATTERN
>PATTERN
a
b
c
>asdf
1
2
3
>asdf
a
b
c
-An aisde-
(The actual goal is to find a one of a group of patterns then delete the stuff that comes after it until the next occurence of a ">" symbol {also delete that line which I can do by piping to a grep -v})
I more or less got guidance by following what I found here. I've had this work for me. Here's an exact example (not that you have the file to look at it)
for line in $(cat bad_results.txt)
do
echo "removing $line"
cat 16S.fasta | sed "/$line/,/^>/{//!d;}" | grep $line -v > temp_stor.fasta
done

/PATTERN/,/^>/ will match from a line containing PATTERN to a line starting with > (which can be a line containing PATTERN). You should instead match an empty line, like so:
$ sed '/PATTERN/,/^$/{/PATTERN/!d}' ip.txt
>PATTERN
>PATTERN
>PATTERN
>PATTERN
>asdf
1
2
3
>asdf
a
b
c
Your aside isn't very clear to me, but if you want to delete the line with PATTERN as well, you can simplify it to:
$ sed '/PATTERN/,/^$/d' ip.txt
>asdf
1
2
3
>asdf
a
b
c
You can also use:
awk -v RS= -v ORS='\n\n' '!/PATTERN/'
but it will have an extra empty line at the end of the output. The advantage is that instead of your for loop, you can do this:
awk 'BEGIN{FS="\n"; ORS="\n\n"}
NR==FNR{a[">" $0]; next}
!($1 in a)' bad_results.txt RS= 16S.fasta
The above code stores each line of bad_results.txt in an associative array, with > character prefixed. And then, contents of 16S.fasta will be printed only if entire line starting with > isn't present in bad_results.txt.
If you want a partial match:
awk 'BEGIN{FS="\n"; ORS="\n\n"}
NR==FNR{a[$0]; next}
{for (k in a) if(index($1, k)) next; print}' bad_results.txt RS= 16S.fasta

In your range pattern match, the second element 'consumes' the line so that the start of the range no longer sees that block as a match. This is why you apparently have 'skipping.' This can be fixed by using a lookahead that does not consume characters to match. Unfortunately, sed lacks lookaheads.
Perl is really a better choice than sed for complex multi line matches involving lookaheads.
Here is a Perl that reads the file and applies the regex /(?:^>PATTERN)|(?:^>[\s\S]*?)(?=\v?^>|\z)/ (Demo) to it:
$ perl -0777 -lnE 'while(/(?:^>PATTERN)|(?:^>[\s\S]*?)(?=\v?^>|\z)/gm) { say $& }' file
>PATTERN
>PATTERN
>PATTERN
>PATTERN
>asdf
1
2
3
>asdf
a
b
c
Aside: Please read Looping through the content of a file in Bash. The way you are doing it is not idea. Specifically, read here on the side effects of using cat in a Bash loop.

This might work for you (GNU sed):
sed -E '/PATTERN/{p;:a;$!{N;/\n>/!s/\n//;ta};D}' file
As has been already stated, the range operator matches from PATTERN to a line beginning >. The latter line may also contain PATTERN but is not matched, hence the alternating pattern.
The solution above, does not use the range operator but instead gathers the lines from the first containing PATTERN to the line before a line beginning >.
If a line contains PATTERN it is printed, then subsequent lines are collected until the end-of-file or a line begins >.
Within this collection, newlines are removed - essentially making the first line in the pattern space the concatenation of one or more lines.
On a match (or end-of-file) this long line is removed and any line still in the pattern space is processed as if it had been read in as part of the normal sed cycle.
N.B. The difference between the d and the D commands is the d command deletes the pattern space and immediately begins the next sed cycle which involves reading in the next line of input. Whereas the D command removes everything up to and including the first newline in the pattern space and then begins the next sed cycle. However if the pattern space is not empty, the reading in of the next line from the input is forgone and then the sed cycle resumed.
An alternative:
sed '/^>/{h;/^>PATTERN/p};G;/\n>PATTERN/!P;d' file

To answer the question as to why it seemed to be skipping every other occurrence
(as fleshed out in the comments of Sundeep's answer. See his answer to work around this)
The apparent skipping was just an illusion. sed is greedy; it found the first occurrence of PATTERN and up to and including the next line starting with a >. It then deletes everything between (as instructed). sed then continues where it left off and as such doesn't "see" that last line as a new occurrence
to be clear:
>PATTERN <--- sed see's the first occurrence here------------------|
a |(this whole
a |chunk is
a |considered
|by sed)
>PATTERN <--- then matches up to here (the next occurence of ">")--|
b <--- then continues from here "missing" the match of PATTERN above
b
b
>PATTERN
c
c
c

Related

sed retrieve part of line

I have lines of code that look like this:
hi:12345:234 (second line)
How do I write a line of code using the sed command that only prints out the 2nd item in the second line?
My current command looks like this:
sed -n '2p' file which gets the second line, but I don't know what regex to use to match only the 2nd item '12345' and combine with my current command
Could you please try following, written and tested with shown samples in GNU sed.
sed -n '2s/\([^:]*\):\([^:]*\).*/\2/p' Input_file
Explanation: Using -n option of sed will stop the printing for all the lines and printing will happen only for those lines where we are explicitly mentioning p option to print(later in code). Then mentioning 2s means perform substitution on 2nd line only. Then using regex and sed's capability to store matched regex into a temp buffer by which values can be retrieved later by numbering 1,2...and so on. Regex is basically catching 1st part which comes before first occurrence of : and then 2nd part after first occurrence of : to till 2nd occurrence of : as per OP's request. So while doing substitution mentioning /2 will replace whole line with 2nd value stored in buffer as per request, then mentioning p will print that part only in 2nd line.
A couple of solutions:
echo "hi:12345:234" | sed -n '2s/.*:\([0-9]*\):.*/\1/p'
echo "hi:12345:234" | sed -n '2{s/^[^:]*://; s/:.*//p; q}'
echo "hi:12345:234" | awk -F':' 'FNR==2{print $2}'
All display 12345.
sed -n '2s/.*:\([0-9]*\):.*/\1/p' only displays the captured value thanks to -n and p option/flag. It matches a whole string capturing digits between two colons, \1 only keeps the capture.
The sed -n '2{s/^[^:]*://;s/:.*//p;q}' removes all from start till first :, all from the second to end, and then quits (q) so if your file is big, it will be processed quicker.
awk -F':' 'FNR==2{print $2}' splits the second line with a colon and fetches the second item.

How can I print 2 lines if the second line contains the same match as the first line?

Let's say I have a file with several million lines, organized like this:
#1:N:0:ABC
XYZ
#1:N:0:ABC
ABC
I am trying to write a one-line grep/sed/awk matching function that returns both lines if the NCCGGAGA line from the first line is found in the second line.
When I try to use grep -A1 -P and pipe the matches with a match like '(?<=:)[A-Z]{3}', I get stuck. I think my creativity is failing me here.
With awk
$ awk -F: 'NF==1 && $0 ~ s{print p ORS $0} {s=$NF; p=$0}' ip.txt
#1:N:0:ABC
ABC
-F: use : as delimiter, makes it easy to get last column
s=$NF; p=$0 save last column value and entire line for printing later
NF==1 if line doesn't contain :
$0 ~ s if line contains the last column data saved previously
if search data can contain regex meta characters, use index($0,s) instead to search literally
note that this code assumes input file having line containing : followed by line which doesn't have :
With GNU sed (might work with other versions too, syntax might differ though)
$ sed -nE '/:/{N; /.*:(.*)\n.*\1/p}' ip.txt
#1:N:0:ABC
ABC
/:/ if line contains :
N add next line to pattern space
/.*:(.*)\n.*\1/ capture string after last : and check if it is present in next line
again, this assumes input like shown in question.. this won't work for cases like
#1:N:0:ABC
#1:N:0:XYZ
XYZ
This might work for you (GNU sed):
sed -n 'N;/.*:\(.*\)\n.*\1/p;D' file
Use grep-like option -n to explicitly print lines. Read two lines into the pattern space and print both if they meet the requirements. Always delete the first and repeat.
If you actual Input_file is same as shown example then following may help you too here.
awk -v FS="[: \n]" -v RS="" '$(NF-1)==$NF' Input_file
EDIT: Adding 1 more solution as per Sundeep suggestion too here.
awk -v FS='[:\n]' -v RS= 'index($NF, $(NF-1))' Input_file

Deleting n lines in both directions and the match in sed?

Deleting the match and two lines before it works:
sed -i.bak -e '/match/,-2d' someCommonName.txt
Deleting the match and two lines after it works:
sed -i.bak -e '/match/,+2d' someCommonName.txt
But deleting the match, two lines after it and two lines before it does not work?
sed -i.bak -e '/match/-2,+2d' someCommonName.txt
sed: -e expression #1 unknown command: `-'
Why is that?
sed operates on a range of addresses. That means either one or two expressions, not three.
/match/ is an address which matches a regex.
-2 is an address which specifies two lines before
+2 is an address which specifies two lines after
Therefore:
/match/,-2 is a range which specifies the line matching match to two lines before.
/match/-2,+2d, on the other hand, includes three addresses, and thus makes no sense.
To delete two lines before and after a pattern, I would recommend something like this (modified from this answer):
sed -n "1N;2N;/\npattern$/{N;N;d};P;N;D"
This keeps 3 lines in the buffer and reads through the file. When the pattern is found in the last line, it reads two more lines and deletes all 5. Note that this will not work if the pattern is in the first two lines of the file, but it is a start.
sed -i .bak '/match/,-2 {/match/!d;};/match/,+2d' YourFile
try this (cannot test here, -2 is not available in my sed version)
I don't have a complete solution but an outline: sed is a pretty simple tool which doesn't do two things at once. My approach would be to run sed once deleting the two lines after the pattern but keeping the pattern itself. The result can then be piped to sed again to remove the pattern and the two lines before.
FWIW this is how I'd really do the job (just change the b and a values to delete different numbers of lines before/after match is found):
$ cat file
1
2
3
4
5 match
6
7
8
9
$ awk -v b=2 -v a=2 'NR==FNR{if (/match/) for (i=(NR-b);i<=(NR+a);i++) skip[i]; next } !(FNR in skip)' file file
1
2
8
9
$ awk -v b=3 -v a=1 'NR==FNR{if (/match/) for (i=(NR-b);i<=(NR+a);i++) skip[i]; next } !(FNR in skip)' file file
1
7
8
9
Note that the above assumes that when 2 "match"s appear within a removal window you want to base the deletions on the original occurrence, not what would happen after the first match being found causes the 2nd match to be deleted:
$ cat file2
1
2
3
4 match
5
6 match
7
8
9
$ awk -v b=2 -v a=2 'NR==FNR{if (/match/) for (i=(NR-b);i<=(NR+a);i++) skip[i]; next } !(FNR in skip)' file2 file2
1
9
as opposed to the output being:
1
7
8
9
since deleting the 2 lines after the first match would delete the 2nd match and so the 2 lines after THAT would not be deleted since they no longer are within 2 lines after a match.
Something else to consider:
$ diff --changed-group-format='%<' --unchanged-group-format='' file <(grep -A2 -B2 match file)
1
2
8
9
$ diff --changed-group-format='%<' --unchanged-group-format='' file2 <(grep -A2 -B2 match file2)
1
9
That uses bash and GNU diff 3.2, idk if/which other shells/diffs would support those constructs/options.

Find the Last Occurrence of a search string And Print the Line next line in Ksh

How can we Find the last occurrence a search string (Regex) and then print the next line following it? Assume a Textfile which has Data as below
1 absc
1 sandka
file hjk
2 asdaps
2 amsdapm
file abc
So, from this file, I have to grep or awk the last occurrence of the 2 and print the line that follows it.
awk is always handy for these cases:
$ awk '/2/ {p=1; next} p{a=$0; p=0} END{print a}' file
file abc
Explanation
/2/ {p=1; next} when 2 appears in the line, activate the p flag and skip the line.
p{a=$0; p=0} when the p flag is active, store the line and unactivate p.
END{print a} print the stored value, which happens to be the last one because a is always overwritten.
Using grep
grep -A 1 '^2' option displays lines that match 2 at the beginning of the line plus one following line
then use tail -1 to print the final line:
grep -A 1 '^2' yourfile | tail -1

sed replace last line matching pattern

Given a file like this:
a
b
a
b
I'd like to be able to use sed to replace just the last line that contains an instance of "a" in the file. So if I wanted to replace it with "c", then the output should look like:
a
b
c
b
Note that I need this to work irrespective of how many matches it might encounter, or the details of exactly what the desired pattern or file contents might be. Thanks in advance.
Not quite sed only:
tac file | sed '/a/ {s//c/; :loop; n; b loop}' | tac
testing
% printf "%s\n" a b a b a b | tac | sed '/a/ {s//c/; :loop; n; b loop}' | tac
a
b
a
b
c
b
Reverse the file, then for the first match, make the substitution and then unconditionally slurp up the rest of the file. Then re-reverse the file.
Note, an empty regex (here as s//c/) means re-use the previous regex (/a/)
I'm not a huge sed fan, beyond very simple programs. I would use awk:
tac file | awk '/a/ && !seen {sub(/a/, "c"); seen=1} 1' | tac
Many good answers here; here's a conceptually simple two-pass sed solution assisted by tail that is POSIX-compliant and doesn't read the whole file into memory, similar to Eran Ben-Natan's approach:
sed "$(sed -n '/a/ =' file | tail -n 1)"' s/a/c/' file
sed -n '/a/=' file outputs the numbers of the lines (function =) matching regex a, and tail -n 1 extracts the output's last line, i.e. the number of the line in file file containing the last occurrence of the regex.
Placing command substitution $(sed -n '/a/=' file | tail -n 1) directly before ' s/a/c' results in an outer sed script such as 3 s/a/c/ (with the sample input), which performs the desired substitution only on the last on which the regex occurred.
If the pattern is not found in the input file, the whole command is an effective no-op.
Another approach:
sed "`grep -n '^a$' a | cut -d \: -f 1 | tail -1`s/a/c/" a
The advantage of this approach is that you run sequentially on the file twice, and not read it to memory. This can be meaningful in large files.
This might work for you (GNU sed):
sed -r '/^PATTERN/!b;:a;$!{N;/^(.*)\n(PATTERN.*)/{h;s//\1/p;g;s//\2/};ba};s/^PATTERN/REPLACEMENT/' file
or another way:
sed '/^PATTERN/{x;/./p;x;h;$ba;d};x;/./{x;H;$ba;d};x;b;:a;x;/./{s/^PATTERN/REPLACEMENT/p;d};x' file
or if you like:
sed -r ':a;$!{N;ba};s/^(.*\n?)PATTERN/\1REPLACEMENT/' file
On reflection, this solution may replace the first two:
sed '/a/,$!b;/a/{x;/./p;x;h};/a/!H;$!d;x;s/^a$/c/M' file
If the regexp is no where to found in the file, the file will pass through unchanged. Once the regex matches, all lines will be stored in the hold space and will be printed when one or both conditions are met. If a subsequent regex is encountered, the contents of the hold space is printed and the latest regex replaces it. At the end of file the first line of the hold space will hold the last matching regex and this can be replaced.
Another one:
tr '\n' ' ' | sed 's/\(.*\)a/\1c/' | tr ' ' '\n'
in action:
$ printf "%s\n" a b a b a b | tr '\n' ' ' | sed 's/\(.*\)a/\1c/' | tr ' ' '\n'
a
b
a
b
c
b
A two-pass solution for when buffering the entire input is intolerable:
sed "$(sed -n /a/= file | sed -n '$s/$/ s,a,c,/p' )" file
(the earlier version of this hit a bug with history expansion encountered on a redhat bash-4.1 install, this way avoids a $!d that was being mistakenly expanded.)
A one-pass solution that buffers as little as possible:
sed '/a/!{1h;1!H};/a/{x;1!p};$!d;g;s/a/c/'
Simplest:
tac | sed '0,/a/ s/a/c/' | tac
Here is all done in one single awk
awk 'FNR==NR {if ($0~/a/) f=NR;next} FNR==f {$0="c"} 1' file file
a
b
c
b
This reads the file twice. First run to find last a, second run to change it.
tac infile.txt | sed "s/a/c/; ta ; b ; :a ; N ; ba" | tac
The first tac reverses the lines of infile.txt, the sed expression (see https://stackoverflow.com/a/9149155/2467140) replaces the first match of 'a' with 'c' and prints the remaining lines, and the last tac reverses the lines back to their original order.
Here is a way with only using awk:
awk '{a[NR]=$1}END{x=NR;cnt=1;while(x>0){a[x]=((a[x]=="a"&&--cnt==0)?"c <===":a[x]);x--};for(i=1;i<=NR;i++)print a[i]}' file
$ cat f
a
b
a
b
f
s
f
e
a
v
$ awk '{a[NR]=$1}END{x=NR;cnt=1;while(x>0){a[x]=((a[x]=="a"&&--cnt==0)?"c <===":a[x]);x--};for(i=1;i<=NR;i++)print a[i]}' f
a
b
a
b
f
s
f
e
c <===
v
It can also be done in perl:
perl -e '#a=reverse<>;END{for(#a){if(/a/){s/a/c/;last}}print reverse #a}' temp > your_new_file
Tested:
> cat temp
a
b
c
a
b
> perl -e '#a=reverse<>;END{for(#a){if(/a/){s/a/c/;last}}print reverse #a}' temp
a
b
c
c
b
>
Here's another option:
sed -e '$ a a' -e '$ d' file
The first command appends an a and the second deletes the last line. From the sed(1) man page:
$ Match the last line.
d Delete pattern space. Start next cycle.
a text Append text, which has each embedded newline preceded by a backslash.
Here's the command:
sed '$s/.*/a/' filename.txt
And here it is in action:
> echo "a
> b
> a
> b" > /tmp/file.txt
> sed '$s/.*/a/' /tmp/file.txt
a
b
a
a
awk-only solution:
awk '/a/{printf "%s", all; all=$0"\n"; next}{all=all $0"\n"} END {sub(/^[^\n]*/,"c",all); printf "%s", all}' file
Explanation:
When a line matches a, all lines between the previous a up to (not including) current a (i.e. the content stored in the variable all) is printed
When a line doesn't match a, it gets appended to the variable all.
The last line matching a would not be able to get its all content printed, so you manually print it out in the END block. Before that though, you can substitute the line matching a with whatever you desire.
Given:
$ cat file
a
b
a
b
You can use POSIX grep to count the matches:
$ grep -c '^a' file
2
Then feed that number into awk to print a replacement:
$ awk -v last=$(grep -c '^a' file) '/^a/ && ++cnt==last{ print "c"; next } 1' file
a
b
c
b