Sed regex expression? - regex

I was wondering if somebody could help me with this sed regex expression.
I will put the whole code:
for w in ./tmp/horse_F3.csfasta; do
sed -n '/^>/!{H;$!b};s/$/ /;x;1b;s/\n//g;p' ${w} > ${w}.flat
done

sed -n
Don't print unless told to.
'/^>/!{H;$!b};
If the line doesn't begin with '>', add the line to the hold space and then if it isn't the last line of the file, jump to the end of the script (i.e. start over with the next line).
s/$/ /;
Add a blank space to the end of the line.
x;
Swap the line with the contents of the hold space.
1b;
If we're working on the first line (i.e. if this is the first time through the script) then jump to the end of the script.
s/\n//g;
Remove all line feeds (\n) from the the thing we're working on. That is, if it is several lines (from the hold space), turn them into one line.
p'
Print it.

Related

Replace several lines by one using sed

I have an input like this:
This_is(A)
Goto(B,condition_1)
Goto(C,condition_2)
This_is(B)
Goto(A,condition_3)
This_is(C)
Goto(B,condition_1)
I want it to become like this
(A,B,condition_1)
(A,C,condition_2)
(B,A,condition_3)
(C,B,condition_1)
Anyone knows how to do this with sed?
Assuming you don't really need to do this with sed, this will work using any awk in any shell on every UNIX box:
$ awk -F'[()]' '/^[^[:space:]]/{s=$2; next} {sub(/[^[:space:]]*\(/,"("s",")} 1' file
(A,B,condition_1)
(A,C,condition_2)
(B,A,condition_3)
(C,B,condition_1)
This is a possible sed solution, where I have hardcoded a few bits, like This_is and Goto because the OP did not clarify if those strings change along the file in the actual file:
sed '/^This_is/{:a;N;s/\(^This_is(\(.\)).*\)\(\n *\)Goto(\([^)]*)\)$/\1\3(\2,\4/;$!ta;s/[^\n]*\n//}' input_file
(Unfortunately, with all these parenthesis, using the -E does not shorten the command much.)
The code is slightly more readable if split on more lines:
sed '/^This_is/{
:a
N
s/\(^This_is(\(.\)).*\)\(\n *\)Goto(\([^)]*)\)$/\1\3(\2,\4/
$!ta
s/[^\n]*\n//
}' os
Here you can see that the code takes action only on the lines starting with This_is; when the program hits those lines, it does the following.
It uses the N command to append the next line to the pattern space (interspersing \ns),
and it attempts a substitution with s/…/…/, which essentially tries to pick the x in This_is(x) and to put it just after the last Goto( on the multiline,
and it keeps doing this as long as the latter action is successful (ta branches to :a if s was successful) and the last line has not been read ($! matches all line but the last);
Indeed, this is a do-while loop, where :a marks the entry point, where the control jumps back if the while-condition is true, and ta is the command that evaluates the logical condition.
When the above while loop terminates, the shorter s/…/…/ command removes the leading line from the multiline pattern space, which is the This_is line.
This might work for you (GNU sed):
sed -E '/^\S.*\(.*\)/{h;d};G;s/\S+\((.*\))\n.*(\(.*)\).*/\2,\1/;P;d' file
If a line starts with a non-white space character and contains parens, copy it to the hold space (HS) and then delete it.
Otherwise, append the HS, remove non-white characters upto the opening paren, insert the value between parens from the stored value, add a comma and print the first line and then delete the whole of the pattern space.
N.B. Lines that do not meet the substitution criteria will be unchanged.
An alternative solution using GNU parallel and sed:
parallel --pipe --recstart T -kqN1 sed -E '1{h;d};G;s/\S+\((.*)\n.*(\(.*)\).*/\2,\1/;P;d' <file

grep/pcregrep/sed/awk the data after the last match to the end of a file

I need to grab the content after the last match of ENTRY to the end of the file, and I can't seem to do it. It can be multiple lines and the data can include any character to the end of the file including (,\n, ).
I’ve tried:
tail -1 file # doesn’t work due to it not consistently being one line
grep "^(.*" # only grabs one line
pcregrep -M '\n(.*' file # I think a variation of this is the solution, but I’ve had no luck so far.
File that grows below:
TOP OF FILE
%
ENTRY
(S®s
√6ûíπ‹ôTìßÅDPˆ¬k·Ù"=ÓxF)*†‰ú˚ÃQ´¿J‘\˜©ŒG»‡∫QÆ’<πsµ-ù±ñ∞NäAOilWçk
N+P}V<ôÒ∏≠µW*`Hß”;–GØ»14∏åR"ºã
FD‘mÍõ?*ÊÎÉC)(S®s
√6ûíπ‹ôTìßÅDPˆ¬k·Ù"=ÓxF)*†‰ú˚ÃQ´¿J‘\˜©ŒG»‡∫QÆ’<πsµ-ù±ñ∞NäAOilWçk
N+P}V<ôÒ∏≠µW*`Hß”;–GØ»14∏åR"ºã
FD‘mÍõ?*ÊÎÉC)eq
{
DATA
}
ENTRY
(A® S\kÉflã1»Âbπ¯Ú∞⁄äπHZ#F◊§•Ã*‹¡‹…ÿPkJòÑíòú˛¶à˛¨¢v|u«Ùbó–Ö¶¢∂5ıÜ#¨•˘®#W´≥‡*`H∑”ı–Só¬<˙ìEçöf∞Gg±:œe™flflå)A® S\kÉflã1»Âbπ¯Ú∞⁄äπHZ#F◊§•Ã*‹¡‹…ÿPkJòÑíòú˛¶à˛¨¢v|u«Ùbó–Ö¶¢∂5ıÜ#¨•˘®#W´≥‡*`H∑”ı–Só¬<˙ìEçöf∞Gg±:œe™flflå)eq
{
DATA
}if
ENTRY
(ÌSYõ˛9°\K¬∞≈fl|”/í÷L
Ö˙h/ÜÇi"û£fi±€ÀNéÓ›bÏÿmâ[≈4J’XPü´Z
oÜlø∫…qìõ¢,ßü©cÓ{—˜e&ÚÀÓHÏÜ‚m(Œ∆⁄ˆQ˝òêpoÉÄÂ(S‘E ⁄ !ŸQ§ô6ÉH
$ awk '/^[(]/{s="";} {s=s"\n"$0;} END{print substr(s,2);}' file
(ÌSYõ˛9°\K¬∞≈fl|”/í÷L
Ö˙h/ÜÇi"û£fi±€ÀNéÓ›bÏÿmâ[≈4J’XPü´Z
oÜlø∫…qìõ¢,ßü©cÓ{—˜e&ÚÀÓHÏÜ‚m(Œ∆⁄ˆQ˝òêpoÉÄÂ(S‘E ⁄ !ŸQ§ô6ÉH
How it works
awk implicitly loops through files line-by-line. This script stores whatever we want to print in the variable s.
/^[(]/{s="";}
Every time that we find a line which starts with (, we set s to an empty string.
The purpose of this is to remove everything before the last occurrence of a line starting with (.
s=s"\n"$0
We add the current line onto the end of s.
END{print substr(s,2);}
After we reach the end of the file, we print s (omitting the first character which will be a surplus newline character).
Interesting problem. I think you can do it with just sed. When you find a match, zero the hold space and add the match line to the hold space. On the last line, print the hold space.
sed -n -e '/ENTRY/,$ { /ENTRY/ { h; n; }; H; $ { x; p; } }'
Don't print by default. From the first entry to the end of the file:
If it is an entry line; copy the new line over the hold space and move on.
Otherwise append the line to the hold space.
If it is the last line, swap the hold space and pattern space, and print the pattern space (what was in the hold space).
You might worry about what happens if the last line in the file is an ENTRY line.
Given a data file:
TOP OF FILE
not wanted
ENTRY
could be wanted
ENTRY
but it wasn't
and this isn't
because
ENTRY
this is here
EOF
The output is:
ENTRY
this is here
EOF
If you don't want ENTRY to appear, modify the script slightly:
sed -n -e '/ENTRY/,$ { /ENTRY/ { s/.*//; h; n; }; H; $ { x; s/^\n//; p; } }'
Using tac you could do it:
tac <file> | sed -e '/ENTRY/,$d' | tac
This will print the file with the lines reversed, then use sed to remove everything from what is now the first occurrence of ENTRY to the now end of the file, then reverse the lines again to get the original order.
As Jonathan Leffler pointed out, a faster way to do this--though probably not much because tac will still have a lot to do and it has all the overhead of rquireing 3 processes instead of just one, but the sed could be done more efficiently, but just ending when we find the ENTRY line, instead of processing the rest of the file to remove the lines:
tac <file> | sed -e '/ENTRY/q' | tac
though his answer is often going to be better still. That answer will include the ENTRY line. If you don't want that you could also do
tac <file> | sed -n '/ENTRY/q;p' | tac
to not print any ouptut by default, then quit as soon as you find the ENTRY line, but use the p command to print the lines until you get to that line.
This should work too (at least with gawk)
awk -vRS="ENTRY" 'END{print $0}'
set the record separator as your pattern and print the last record.
loadind file in memory
sed -e 'H;$!d' -e 'x;s/.*ENTRY[[:blank:]]*\n//' YourFile

sed substitution including newlines

I want to change a text file so that any line beginning with "Length:" is appended to the previous line.
I'm aware that sed '/\nLength:/ Length:/' isn't going to work because sed is line based.
Googling for "How to match newlines in sed" did turn up a complex sed method for joining a pattern to the next line but I couldn't figure out how to adapt it.
Help would be appreciated.
In awk you can use something like:
awk '/^/&&!/^Length/{printf "\n"}{printf "%s",$0}' infile
Will only print \n when line start ^ is matched. Exception: Length is found at that beginnig.
If the file isn't too large, you can use a Perl command line in slurp mode (load all the file content before processing) :
perl -0777 -pe 's/\R(?=Length:)//g' file
-0777 switches on the slurp mode
pattern:
\R any kind of newlines
(?=...) lookahead assertion
If there's no consecutive lines starting with Length: you can use this sed command:
sed -n ':a;/\nLength:/!{$p;N;ba;}; s/\n\(Length:\)/$1/;p;' file
details:
:a; # define the label "a"
/\nLength:/! { # if "\nLength:" doesn't match then:
$p; # if last line, print
N; # append the next line to the pattern space
ba; # go to label "a"
};
s/\n\(Length:\)/$1/; # perform the replacement
p; # print
An other way with awk using the record separator:
awk 'BEGIN{RS="\nLength:";ORS="Length:"}1' file | head -n -1
This might work for you (GNU sed):
sed 'N;/\nLength:/s/\n/ /;P;D' file
This appends the next line to the present line in the pattern space and if the appended line begins with the required string it replaces the newline with a space (if you do not want the space just replace the newline with nothing). The first line is then printed and deleted and the process repeated (the second line is now the first unless the condition was met in which case a line is automatically read in and then the first command appends the next).

Regex to move second line to end of first line

I have several lines with certain values and i want to merge every second line or every line beginning with <name> to the end of the line ending with
<id>rd://data1/8b</id>
<name>DM_test1</name>
<id>rd://data2/76f</id>
<name>DM_test_P</name>
so end up with something like
<id>rd://data1/8b</id><name>DM_test1</name>
The reason why it came out like this is because i used two piped xpath queries
Regex
Simply remove the newline at the end of a line ending in </id>. On a windows, replace (<\/id>)\r\n with \1 or $1 (which is perl syntax). On a linux search for (<\/id>)\n and replace it with the same thing.
awk
The ideal solution uses awk. The idea is simply, when the line number is odd, we print the line without a newline, if not we print it with a newline.
awk '{ if(NR % 2) { printf $0 } else { print $0 } }' file
sed
Using sed we place a line in the hold space when it contains <id>´ and append the line to it when it's a` line. Then we remove the newline and print the hold buffer by exchanging it with the pattern space.
sed -n '/<id>.*<\/id>/{h}; /<name>.*<\/name>/{H;x;s/\n//;p}' file
pr
Using pr we can achieve a similar goal:
pr -s --columns 2 file

What does this sed expression from todo.sh do?

What does the sed expression: G; s/\n/&&/; /^\([ ~-]*\n\).*\n\1/d; s/\n//; h; P do? Exactly what does it match and how does it match it?
It's from todo.sh. In context:
archive()
{
#defragment blank lines
sed -i.bak -e '/./!d' "$TODO_FILE" ## delete all empty lines
[ $TODOTXT_VERBOSE -gt 0 ] && grep "^x " "$TODO_FILE" ## if verbose mode print completed tasks..
grep "^x " "$TODO_FILE" >> "$DONE_FILE" ## append completed tasks to $DONE_FILE
sed -i.bak '/^x /d' "$TODO_FILE" ## delete completed tasks
cp "$TODO_FILE" "$TMP_FILE"
sed -n 'G; s/\n/&&/; /^\([ ~-]*\n\).*\n\1/d; s/\n//; h; P' "$TMP_FILE" > "$TODO_FILE"
## G; Add a newline
## s/\n/&&/; Substitute newline with && (two newlines?)
## /^\([ ~-]*\n\).*\n\1/d; Delete duplicate lines???
## s/\n// Remove newlines
## h Hold: copy pattern space to buffer
## P Print first line of pattern space
if [ $TODOTXT_VERBOSE -gt 0 ]; then
echo "TODO: $TODO_FILE archived."
fi
}
Ok, you've got some of the story already. Recall that the sed expression is executed for each input line. So the G at the beginning appends the contents of the hold space to the current line (with a newline in between). The contents of the hold space is empty initially but expanded by the h command at the end of each input cycle.
Then s/\n/&&/ duplicates the first newline only, the one between the current line and what was grabbed from the hold space. This is in preparation for the next command. /^\([ -~]*\n\).*\n\1/ indeed matches if the current line is identical to a line in the hold space:
    ^\([ -~]*\n\) matches a line at the beginning of the buffer¹
        Note that this matches only if the line contains only printable ASCII characters.
        If your system supports locales, ^\([[:print:]]*\n\) would be better.
    .*\n matches at least one subsequent line
    \1 matches a line identical to the first line
The extra newline added by the previous s command takes care of the case when the duplicate is the very first line from the hold space. The point of the \n\1 is to “anchor” the duplicate at the beginning of a line, otherwise bar would be considered a duplicate of foobar. If the current line is a duplicate, the d command discards it and execution branches to the next line.
If the current line is not a duplicate, s/\n// discards that extra newline (again, no g modifier, so only the first newline is removed). Then the h command results in the hold space containing what it contained before, with the current line prepended. Finally P prints the current input line.
Ok, now what does the hold space contain? It starts empty, then gets each successive line prepended unless it's a duplicate. So the hold space contains the input lines, in reverse order, minus the duplicates.
¹ Uh, I don't know how you did that, but that should be [ -~], not [ ~-] which wouldn't make any sense.
Here's another way of doing this, if you have a POSIX-conforming set of tools (Single Unix v2 is good enough).
<"$TMP_FILE" \
nl -s: | # add line numbers
sort -t: -k2 -u | # sort, ignoring the line numbers, and remove duplicates
sort -t: -k1 -n | # sort by line number
cut -d: -f2- # cut out the line numbers
Oh, you wanted to do this legibly and concisely? Just use awk.
<"$TMP_FILE" awk '!seen[$0] {++seen[$0]; print}'
If the current line hasn't been seen yet, mark it as seen, and print it.
Note that like the sed method, the awk method essentially stores the whole file in memory. The method above using sort has the advantage that only sort needs to keep more than one line of input at a time, and it's designed for this.
Of course, if you don't care about the order of the lines, it's as simple as sort -u.
After Gilles presented his excellent answer I found Famous Sed One-Liners Explained, which includes this exact sed expression; adding here for reference:
70. Delete duplicate, nonconsecutive lines from a file.
sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'
This is a very tricky one-liner. It
stores the unique lines in hold buffer
and at each newly read line, tests if
the new line already is in the hold
buffer. If it is, then the new line is
purged. If it's not, then it's saved
in hold buffer for future tests and
printed.
A more detailed description - at each
line this one-liner appends the
contents of hold buffer to pattern
space with "G" command. The appended
string gets separated from the
existing contents of pattern space by
"\n" character. Next, a substitution
is made to that substitutes the "\n"
character with two "\n\n". The
substitute command "s/\n/&&/" does
that. The "&" means the matched
string. As the matched string was
"\n", then "&&" is two copies of it
"\n\n". Next, a test "/^([
-~]\n).\n\1/" is done to see if the contents of group capture group 1 is
repeated. The capture group 1 is all
the characters from space " " to "~"
(which include all printable chars).
The "[ -~]" matches that. Replacing
one "\n" with two was the key idea
here. As "([ -~]\n)" is greedy
(matches as much as possible), the
double newline makes sure that it
matches as little text as possible. If
the test is successful, the current
input line was already seen and "d"
purges the whole pattern space and
starts script execution from the
beginning. If the test was not
successful, the doubled "\n\n" gets
replaced with a single "\n" by
"s/\n//" command. Then "h" copies the
whole string to hold buffer, and "P"
prints the new line.