What does this sed expression from todo.sh do? - regex

What does the sed expression: G; s/\n/&&/; /^\([ ~-]*\n\).*\n\1/d; s/\n//; h; P do? Exactly what does it match and how does it match it?
It's from todo.sh. In context:
archive()
{
#defragment blank lines
sed -i.bak -e '/./!d' "$TODO_FILE" ## delete all empty lines
[ $TODOTXT_VERBOSE -gt 0 ] && grep "^x " "$TODO_FILE" ## if verbose mode print completed tasks..
grep "^x " "$TODO_FILE" >> "$DONE_FILE" ## append completed tasks to $DONE_FILE
sed -i.bak '/^x /d' "$TODO_FILE" ## delete completed tasks
cp "$TODO_FILE" "$TMP_FILE"
sed -n 'G; s/\n/&&/; /^\([ ~-]*\n\).*\n\1/d; s/\n//; h; P' "$TMP_FILE" > "$TODO_FILE"
## G; Add a newline
## s/\n/&&/; Substitute newline with && (two newlines?)
## /^\([ ~-]*\n\).*\n\1/d; Delete duplicate lines???
## s/\n// Remove newlines
## h Hold: copy pattern space to buffer
## P Print first line of pattern space
if [ $TODOTXT_VERBOSE -gt 0 ]; then
echo "TODO: $TODO_FILE archived."
fi
}

Ok, you've got some of the story already. Recall that the sed expression is executed for each input line. So the G at the beginning appends the contents of the hold space to the current line (with a newline in between). The contents of the hold space is empty initially but expanded by the h command at the end of each input cycle.
Then s/\n/&&/ duplicates the first newline only, the one between the current line and what was grabbed from the hold space. This is in preparation for the next command. /^\([ -~]*\n\).*\n\1/ indeed matches if the current line is identical to a line in the hold space:
    ^\([ -~]*\n\) matches a line at the beginning of the buffer¹
        Note that this matches only if the line contains only printable ASCII characters.
        If your system supports locales, ^\([[:print:]]*\n\) would be better.
    .*\n matches at least one subsequent line
    \1 matches a line identical to the first line
The extra newline added by the previous s command takes care of the case when the duplicate is the very first line from the hold space. The point of the \n\1 is to “anchor” the duplicate at the beginning of a line, otherwise bar would be considered a duplicate of foobar. If the current line is a duplicate, the d command discards it and execution branches to the next line.
If the current line is not a duplicate, s/\n// discards that extra newline (again, no g modifier, so only the first newline is removed). Then the h command results in the hold space containing what it contained before, with the current line prepended. Finally P prints the current input line.
Ok, now what does the hold space contain? It starts empty, then gets each successive line prepended unless it's a duplicate. So the hold space contains the input lines, in reverse order, minus the duplicates.
¹ Uh, I don't know how you did that, but that should be [ -~], not [ ~-] which wouldn't make any sense.
Here's another way of doing this, if you have a POSIX-conforming set of tools (Single Unix v2 is good enough).
<"$TMP_FILE" \
nl -s: | # add line numbers
sort -t: -k2 -u | # sort, ignoring the line numbers, and remove duplicates
sort -t: -k1 -n | # sort by line number
cut -d: -f2- # cut out the line numbers
Oh, you wanted to do this legibly and concisely? Just use awk.
<"$TMP_FILE" awk '!seen[$0] {++seen[$0]; print}'
If the current line hasn't been seen yet, mark it as seen, and print it.
Note that like the sed method, the awk method essentially stores the whole file in memory. The method above using sort has the advantage that only sort needs to keep more than one line of input at a time, and it's designed for this.
Of course, if you don't care about the order of the lines, it's as simple as sort -u.

After Gilles presented his excellent answer I found Famous Sed One-Liners Explained, which includes this exact sed expression; adding here for reference:
70. Delete duplicate, nonconsecutive lines from a file.
sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'
This is a very tricky one-liner. It
stores the unique lines in hold buffer
and at each newly read line, tests if
the new line already is in the hold
buffer. If it is, then the new line is
purged. If it's not, then it's saved
in hold buffer for future tests and
printed.
A more detailed description - at each
line this one-liner appends the
contents of hold buffer to pattern
space with "G" command. The appended
string gets separated from the
existing contents of pattern space by
"\n" character. Next, a substitution
is made to that substitutes the "\n"
character with two "\n\n". The
substitute command "s/\n/&&/" does
that. The "&" means the matched
string. As the matched string was
"\n", then "&&" is two copies of it
"\n\n". Next, a test "/^([
-~]\n).\n\1/" is done to see if the contents of group capture group 1 is
repeated. The capture group 1 is all
the characters from space " " to "~"
(which include all printable chars).
The "[ -~]" matches that. Replacing
one "\n" with two was the key idea
here. As "([ -~]\n)" is greedy
(matches as much as possible), the
double newline makes sure that it
matches as little text as possible. If
the test is successful, the current
input line was already seen and "d"
purges the whole pattern space and
starts script execution from the
beginning. If the test was not
successful, the doubled "\n\n" gets
replaced with a single "\n" by
"s/\n//" command. Then "h" copies the
whole string to hold buffer, and "P"
prints the new line.

Related

grep/pcregrep/sed/awk the data after the last match to the end of a file

I need to grab the content after the last match of ENTRY to the end of the file, and I can't seem to do it. It can be multiple lines and the data can include any character to the end of the file including (,\n, ).
I’ve tried:
tail -1 file # doesn’t work due to it not consistently being one line
grep "^(.*" # only grabs one line
pcregrep -M '\n(.*' file # I think a variation of this is the solution, but I’ve had no luck so far.
File that grows below:
TOP OF FILE
%
ENTRY
(S®s
√6ûíπ‹ôTìßÅDPˆ¬k·Ù"=ÓxF)*†‰ú˚ÃQ´¿J‘\˜©ŒG»‡∫QÆ’<πsµ-ù±ñ∞NäAOilWçk
N+P}V<ôÒ∏≠µW*`Hß”;–GØ»14∏åR"ºã
FD‘mÍõ?*ÊÎÉC)(S®s
√6ûíπ‹ôTìßÅDPˆ¬k·Ù"=ÓxF)*†‰ú˚ÃQ´¿J‘\˜©ŒG»‡∫QÆ’<πsµ-ù±ñ∞NäAOilWçk
N+P}V<ôÒ∏≠µW*`Hß”;–GØ»14∏åR"ºã
FD‘mÍõ?*ÊÎÉC)eq
{
DATA
}
ENTRY
(A® S\kÉflã1»Âbπ¯Ú∞⁄äπHZ#F◊§•Ã*‹¡‹…ÿPkJòÑíòú˛¶à˛¨¢v|u«Ùbó–Ö¶¢∂5ıÜ#¨•˘®#W´≥‡*`H∑”ı–Só¬<˙ìEçöf∞Gg±:œe™flflå)A® S\kÉflã1»Âbπ¯Ú∞⁄äπHZ#F◊§•Ã*‹¡‹…ÿPkJòÑíòú˛¶à˛¨¢v|u«Ùbó–Ö¶¢∂5ıÜ#¨•˘®#W´≥‡*`H∑”ı–Só¬<˙ìEçöf∞Gg±:œe™flflå)eq
{
DATA
}if
ENTRY
(ÌSYõ˛9°\K¬∞≈fl|”/í÷L
Ö˙h/ÜÇi"û£fi±€ÀNéÓ›bÏÿmâ[≈4J’XPü´Z
oÜlø∫…qìõ¢,ßü©cÓ{—˜e&ÚÀÓHÏÜ‚m(Œ∆⁄ˆQ˝òêpoÉÄÂ(S‘E ⁄ !ŸQ§ô6ÉH
$ awk '/^[(]/{s="";} {s=s"\n"$0;} END{print substr(s,2);}' file
(ÌSYõ˛9°\K¬∞≈fl|”/í÷L
Ö˙h/ÜÇi"û£fi±€ÀNéÓ›bÏÿmâ[≈4J’XPü´Z
oÜlø∫…qìõ¢,ßü©cÓ{—˜e&ÚÀÓHÏÜ‚m(Œ∆⁄ˆQ˝òêpoÉÄÂ(S‘E ⁄ !ŸQ§ô6ÉH
How it works
awk implicitly loops through files line-by-line. This script stores whatever we want to print in the variable s.
/^[(]/{s="";}
Every time that we find a line which starts with (, we set s to an empty string.
The purpose of this is to remove everything before the last occurrence of a line starting with (.
s=s"\n"$0
We add the current line onto the end of s.
END{print substr(s,2);}
After we reach the end of the file, we print s (omitting the first character which will be a surplus newline character).
Interesting problem. I think you can do it with just sed. When you find a match, zero the hold space and add the match line to the hold space. On the last line, print the hold space.
sed -n -e '/ENTRY/,$ { /ENTRY/ { h; n; }; H; $ { x; p; } }'
Don't print by default. From the first entry to the end of the file:
If it is an entry line; copy the new line over the hold space and move on.
Otherwise append the line to the hold space.
If it is the last line, swap the hold space and pattern space, and print the pattern space (what was in the hold space).
You might worry about what happens if the last line in the file is an ENTRY line.
Given a data file:
TOP OF FILE
not wanted
ENTRY
could be wanted
ENTRY
but it wasn't
and this isn't
because
ENTRY
this is here
EOF
The output is:
ENTRY
this is here
EOF
If you don't want ENTRY to appear, modify the script slightly:
sed -n -e '/ENTRY/,$ { /ENTRY/ { s/.*//; h; n; }; H; $ { x; s/^\n//; p; } }'
Using tac you could do it:
tac <file> | sed -e '/ENTRY/,$d' | tac
This will print the file with the lines reversed, then use sed to remove everything from what is now the first occurrence of ENTRY to the now end of the file, then reverse the lines again to get the original order.
As Jonathan Leffler pointed out, a faster way to do this--though probably not much because tac will still have a lot to do and it has all the overhead of rquireing 3 processes instead of just one, but the sed could be done more efficiently, but just ending when we find the ENTRY line, instead of processing the rest of the file to remove the lines:
tac <file> | sed -e '/ENTRY/q' | tac
though his answer is often going to be better still. That answer will include the ENTRY line. If you don't want that you could also do
tac <file> | sed -n '/ENTRY/q;p' | tac
to not print any ouptut by default, then quit as soon as you find the ENTRY line, but use the p command to print the lines until you get to that line.
This should work too (at least with gawk)
awk -vRS="ENTRY" 'END{print $0}'
set the record separator as your pattern and print the last record.
loadind file in memory
sed -e 'H;$!d' -e 'x;s/.*ENTRY[[:blank:]]*\n//' YourFile

Finding and replacing a numeric string between colons, before a space, using sed?

I am attempting to change all coordinate information in a fastq file to zeros. My input file is composed of millions of entries in the following repeating 4-line structure:
#HWI-SV007:140:C173GACXX:6:2215:16030:89299 1:N:0:CAGATC
GATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAG
+
###FFFDFHGGDHIIHGIJJJJJJJJJJJGIJJJJJJJIIIDHGHIGIJJIIIJJIJ
I would like to replace the two numeric strings in the first line 16030:89299 with zeros in a generic way, such that any numeric string between the colons, before the space, is replaced. I would like the output to appear as follows, replacing the two strings globally throughout the file with zeros:
#HWI-SV007:140:C173GACXX:6:2215:0:0 1:N:0:CAGATC
GATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAG
+
###FFFDFHGGDHIIHGIJJJJJJJJJJJGIJJJJJJJIIIDHGHIGIJJIIIJJIJ
I am attempting to do this using the following sed:
sed 's/:^[0-9]+$:^[0-9]+$\s/:0:0 /g'
However, this does not behave as expected.
I think you will need to use sed -r option.
Also, ^ matches beginning of the line and $ matches end of the line.
Thus this is the command line that works against your sample.
sed -r 's/:[0-9]+:[0-9]+\s/:0:0 /g'
some alternative
awk -F ":" 'BEGIN{ OFS = ":" }{ if ( NF > 1 ) {$6 = 0; sub( /^[0-9]*/, 0, $7)}; print $0 }' YourFile
using column separate by :
sed 's/^\(\([^:]*:\)\{5\}\)[^[:blank:]]*/\10:0/' YourFile
using 5 first element separate by : thant space as delimiter
for your sed
sed 's/:[0-9]+:[0-9]+\(\s\)/:0:0\1/'
^and $ are relative to the whole string not the current word
option to keep the original space instead of replacing by a blank space (case of several or other like \t)
g is not needed (and better not to use here) because normaly only 1 occurence per line
you need to be sure that the pattern is not possible somewhere else (never a space after the previous number) because it's a small one

Sed regex expression?

I was wondering if somebody could help me with this sed regex expression.
I will put the whole code:
for w in ./tmp/horse_F3.csfasta; do
sed -n '/^>/!{H;$!b};s/$/ /;x;1b;s/\n//g;p' ${w} > ${w}.flat
done
sed -n
Don't print unless told to.
'/^>/!{H;$!b};
If the line doesn't begin with '>', add the line to the hold space and then if it isn't the last line of the file, jump to the end of the script (i.e. start over with the next line).
s/$/ /;
Add a blank space to the end of the line.
x;
Swap the line with the contents of the hold space.
1b;
If we're working on the first line (i.e. if this is the first time through the script) then jump to the end of the script.
s/\n//g;
Remove all line feeds (\n) from the the thing we're working on. That is, if it is several lines (from the hold space), turn them into one line.
p'
Print it.

SED: addressing two lines before match

Print line, which is situated 2 lines before the match(pattern).
I tried next:
sed -n ': loop
/.*/h
:x
{n;n;/cen/p;}
s/./c/p
t x
s/n/c/p
t loop
{g;p;}
' datafile
The script:
sed -n "1N;2N;/XXX[^\n]*$/P;N;D"
works as follows:
Read the first three lines into the pattern space, 1N;2N
Search for the test string XXX anywhere in the last line, and if found print the first line of the pattern space, P
Append the next line input to pattern space, N
Delete first line from pattern space and restart cycle without any new read, D, noting that 1N;2N is no longer applicable
This might work for you (GNU sed):
sed -n ':a;$!{N;s/\n/&/2;Ta};/^PATTERN\'\''/MP;$!D' file
This will print the line 2 lines before the PATTERN throughout the file.
This one with grep, a bit simpler solution and easy to read [However need to use one pipe]:
grep -B2 'pattern' file_name | sed -n '1,2p'
If you can use awk try this:
awk '/pattern/ {print b} {b=a;a=$0}' file
This will print two line before pattern
I've tested your sed command but the result is strange (and obviously wrong), and you didn't give any explanation. You will have to save three lines in a buffer (named hold space), do a pattern search with the newest line and print the oldest one if it matches:
sed -n '
## At the beginning read three lines.
1 { N; N }
## Append them to "hold space". In following iterations it will append
## only one line.
H
## Get content of "hold space" to "pattern space" and check if the
## pattern matches. If so, extract content of first line (until a
## newline) and exit.
g
/^.*\nsix$/ {
s/^\n//
P
q
}
## Remove the old of the three lines saved and append the new one.
s/^\n[^\n]*//
h
' infile
Assuming and input file (infile) with following content:
one
two
three
four
five
six
seven
eight
nine
ten
It will search six and as output will yield:
four
Here are some other variants:
awk '{a[NR]=$0} /pattern/ {f=NR} END {print a[f-2]}' file
This stores all lines in an array a. When pattern is found store line number.
At then end print that line number from the file.
PS may be slow with large files
Here is another one:
awk 'FNR==NR && /pattern/ {f=NR;next} f-2==FNR' file{,}
This reads the file twice (file{,} is the same as file file)
At first round it finds the pattern and store line number in variable f
Then at second round it prints the line two before the value in f

How to use sed to remove only double empty lines?

I found this question and answer on how to remove triple empty lines. However, I need the same only for double empty lines. Ie. all double blank lines should be deleted completely, but single blank lines should be kept.
I know a bit of sed, but the proposed command for removing triple blank lines is over my head:
sed '1N;N;/^\n\n$/d;P;D'
This would be easier with cat:
cat -s
I've commented the sed command you don't understand:
sed '
## In first line: append second line with a newline character between them.
1N;
## Do the same with third line.
N;
## When found three consecutive blank lines, delete them.
## Here there are two newlines but you have to count one more deleted with last "D" command.
/^\n\n$/d;
## The combo "P+D+N" simulates a FIFO, "P+D" prints and deletes from one side while "N" appends
## a line from the other side.
P;
D
'
Remove 1N because we need only two lines in the 'stack' and it's enought with the second N, and change /^\n\n$/d; to /^\n$/d; to delete all two consecutive blank lines.
A test:
Content of infile:
1
2
3
4
5
6
7
Run the sed command:
sed '
N;
/^\n$/d;
P;
D
' infile
That yields:
1
2
3
4
5
6
7
sed '/^$/{N;/^\n$/d;}'
It will delete only two consecutive blank lines in a file. You can use this expression only in file then only you can fully understand. When a blank line will come that it will enter into braces.
Normally sed will read one line. N will append the second line to pattern space. If that line is empty line. the both lines are separated by newline.
/^\n$/ this pattern will match that time only the d will work. Else d not work. d is used to delete the pattern space whole content then start the next cycle.
This would be easier with awk:
awk -v RS='\n\n\n' 1
BUT the above solution only deletes first search of 3 consecutive blank line.
To delete all, 3 consecutive blank lines use below command
sed '1N;N;/^\n\n$/ { N;s/^\n\n//;N;D; };P;D' filename
As far as I can tell none of the solutions here work. cat -s as suggested by #DerMike isn't POSIX compliant (and it's less convenient if you're already using sed for another transformation), and sed 'N;/^\n$/d;P;D' as suggested by #Birei sometimes deletes more newlines than it should.
Instead, sed ':L;N;s/^\n$//;t L' works. For POSIX compliance use sed -e :L -e N -e 's/^\n$//' -e 't L', since POSIX doesn't specify using ; to separate commands.
Example:
$ S='foo\nbar\n\nbaz\n\n\nqux\n\n\n\nquxx\n';\
> paste <(printf "$S")\
> <(printf "$S" | sed -e 'N;/^\n$/d;P;D')\
> <(printf "$S" | sed -e ':L;N;s/^\n$//;t L')
foo foo foo
bar bar bar
baz baz baz
qux
qux
qux quxx
quxx
quxx
$
Here we can see the original file, #Birei's solution, and my solution side-by-side. #Birei's solution deletes all blank lines separating baz and qux, while my solution removes all but one as intended.
Explanation:
:L Create a new label called L.
N Read the next line into the current pattern space,
separated by an "embedded newline."
s/^\n$// Replace the pattern space with the empty pattern space,
corresponding to a single non-embedded newline in the output,
if the current pattern space only contains a single embedded newline,
indicating that a blank line was read into the pattern space by `N`
after a blank line had already been read from the input.
t L Branch to label L if the previous `s` command successfully
substituted text in the pattern space.
In effect, this deletes one recurrent blank line at a time, reading each into the pattern space as an embedded newline with N and deleting them with s.
BUT the above solution only deletes first search of 3 consecutive blank line. To delete all, 3 consecutive blank lines use below command
sed '1N;N;/^\n\n$/ { N;s/^\n\n//;N;D; };P;D' filename
Just pipe it to 'uniq' command and all empty lines regardless the number of them will be shrank to just one. Simpler is better.
Clarification: As Marlar stated this is not a solution if you have "other non-blank consecutive duplicated lines" that you do not want to get rid of. This is a solution in other cases like when trying to cleanup configuration files which was the solution I was after when I saw this question. I solved my problem indeed just using 'uniq'.