How can i remove a line only if it is followed by a line that starts with the same character? - regex

I need some help with sed or awks.
How can i remove a line only if it is followed by a line that starts with the same character (in this case >)?
Example I have this:
>1_SRR1422294
ATCGTCAGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAT
>2_SRR1422294
CGTCAGACGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAG
>5_SRR1422298
>5_SRR1422294
CGTCAGACGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAG
>6_SRR1422294
>6_SRR1422250
TGTTCATGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCACATCTCACGACACGAGCTGACGACAGCCATGCAGC
>9_SRR1422294
GCGACTAGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCACATCTCACGACACGAGCTGACGACAGCCATGCAGC
I want to get this:
>1_SRR1422294
ATCGTCAGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAT
>2_SRR1422294
CGTCAGACGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAG
>5_SRR1422294
CGTCAGACGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAG
>6_SRR1422250
TGTTCATGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCACATCTCACGACACGAGCTGACGACAGCCATGCAGC
>9_SRR1422294
GCGACTAGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCACATCTCACGACACGAGCTGACGACAGCCATGCAGC
Note that not all the lines have the same numbers but they all have the same format, which is why I want to use regular expressions. If you could explain how to read the code you produce that would be really great.
Thank you so much!

If the whole file follows that pattern (some number of lines starting with >, of which you want only the last, followed by a single line that should always be printed), you could use something like this:
awk '/^>/ { latest=$0 } !/^>/ { if (latest) { print latest; latest="" } print }'
If the line starts with >, then it is remembered (stored in the variable latest) but not printed. If the line doesn't start with >, then it is printed, but only after first printing whatever was most recently stored in latest.
The conditional means each printed > line will appear only once, even if there are multiple non-> lines in a row. Since that doesn't happen in your sample data, you may not need the complication, and could use this simpler unconditional version:
awk '/^>/ { latest=$0 } !/^>/ { print latest; print }'

The needed result can be easily achieved by just using uniq command with -w(--check-chars=N) option:
cat testfile | uniq -w 3
The output:
>1_SRR1422294
ATCGTCAGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAT
>2_SRR1422294
CGTCAGACGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAG
>5_SRR1422298
CGTCAGACGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAG
>6_SRR1422294
TGTTCATGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCACATCTCACGACACGAGCTGACGACAGCCATGCAGC
>9_SRR1422294
GCGACTAGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCACATCTCACGACACGAGCTGACGACAGCCATGCAGC
-w, --check-chars=N
compare no more than N characters in lines
http://man7.org/linux/man-pages/man1/uniq.1.html
It will compare the first N characters of each line to make decision for repeated lines

try: if your data is same as given sample Input_file then following may help you in same.
awk '/^>/{A=$0;next} {print A ORS $0;A=""}' Input_file

This might work for you (GNU sed):
sed 'N;/^>.*\n>/!P;D' file
Read two lines into the pattern space and do not print the first of these lines if the first and second lines begin with >.

sed 'N;/^>.*\n\w/!D' file #(GNU sed)
N: read next line into the pattern space. /^>.*\n\w/!D: delete the first line if the first line starts with ">" and the second line doesn't begin with a letter

Related

Replace several lines by one using sed

I have an input like this:
This_is(A)
Goto(B,condition_1)
Goto(C,condition_2)
This_is(B)
Goto(A,condition_3)
This_is(C)
Goto(B,condition_1)
I want it to become like this
(A,B,condition_1)
(A,C,condition_2)
(B,A,condition_3)
(C,B,condition_1)
Anyone knows how to do this with sed?
Assuming you don't really need to do this with sed, this will work using any awk in any shell on every UNIX box:
$ awk -F'[()]' '/^[^[:space:]]/{s=$2; next} {sub(/[^[:space:]]*\(/,"("s",")} 1' file
(A,B,condition_1)
(A,C,condition_2)
(B,A,condition_3)
(C,B,condition_1)
This is a possible sed solution, where I have hardcoded a few bits, like This_is and Goto because the OP did not clarify if those strings change along the file in the actual file:
sed '/^This_is/{:a;N;s/\(^This_is(\(.\)).*\)\(\n *\)Goto(\([^)]*)\)$/\1\3(\2,\4/;$!ta;s/[^\n]*\n//}' input_file
(Unfortunately, with all these parenthesis, using the -E does not shorten the command much.)
The code is slightly more readable if split on more lines:
sed '/^This_is/{
:a
N
s/\(^This_is(\(.\)).*\)\(\n *\)Goto(\([^)]*)\)$/\1\3(\2,\4/
$!ta
s/[^\n]*\n//
}' os
Here you can see that the code takes action only on the lines starting with This_is; when the program hits those lines, it does the following.
It uses the N command to append the next line to the pattern space (interspersing \ns),
and it attempts a substitution with s/…/…/, which essentially tries to pick the x in This_is(x) and to put it just after the last Goto( on the multiline,
and it keeps doing this as long as the latter action is successful (ta branches to :a if s was successful) and the last line has not been read ($! matches all line but the last);
Indeed, this is a do-while loop, where :a marks the entry point, where the control jumps back if the while-condition is true, and ta is the command that evaluates the logical condition.
When the above while loop terminates, the shorter s/…/…/ command removes the leading line from the multiline pattern space, which is the This_is line.
This might work for you (GNU sed):
sed -E '/^\S.*\(.*\)/{h;d};G;s/\S+\((.*\))\n.*(\(.*)\).*/\2,\1/;P;d' file
If a line starts with a non-white space character and contains parens, copy it to the hold space (HS) and then delete it.
Otherwise, append the HS, remove non-white characters upto the opening paren, insert the value between parens from the stored value, add a comma and print the first line and then delete the whole of the pattern space.
N.B. Lines that do not meet the substitution criteria will be unchanged.
An alternative solution using GNU parallel and sed:
parallel --pipe --recstart T -kqN1 sed -E '1{h;d};G;s/\S+\((.*)\n.*(\(.*)\).*/\2,\1/;P;d' <file

Print commands in history consisting in just one word

I want to print lines that contains single word only.
For example:
this is a line
another line
one
more
line
last one
I want to get the ones with single word only
one
more
line
EDIT: Guys, thank you for answers. Almost all of the answers work for my test file. However I wanted to list single lines in bash history. When I try your answers like
history | your posted commands
all of them below fails. Some only prints some numbers (might line numbers?)
You want to get all those commands in history that contain just one word. Considering that history prints the number of the command as a first column, you need to match those lines consisting in two words.
For this, you can say:
history | awk 'NF==2'
If you just want to print the command itself, say:
history | awk 'NF==2 {print $2}'
To rehash your problem, any line containing a space or nothing should be removed.
grep -Ev '^$| ' file
Your problem statement is unspecific on whether lines containing only punctuation might also occur. Maybe try
grep -Ex '[A-Za-z]+' file
to only match lines containing only one or more alphabetics. (The -x option implicitly anchors the pattern -- it requires the entire line to match.)
In Bash, the output from history is decorated with line numbers; maybe try
history | grep -E '^ *[0-9]+ [A-Za-z]+$'
to match lines where the line number is followed by a single alphanumeric token. Notice that there will be two spaces between the line number and the command.
In all cases above, the -E selects extended regular expression matching, aka egrep (basic RE aka traditional grep does not support e.g. the + operator, though it's available as \+).
Try this:
grep -E '^\s*\S+\s*$' file
With the above input, it will output:
one
more
line
If your test strings are in a file called in.txt, you can try the following:
grep -E "^\w+$" in.txt
What it means is:
^ starting the line with
\w any word character [a-zA-Z0-9]
+ there should be at least 1 of those characters or more
$ line end
And output would be
one
more
line
Assuming your file as texts.txt and if grep is not the only criteria; then
awk '{ if ( NF == 1 ) print }' texts.txt
If your single worded lines don't have a space at the end you can also search for lines without an empty space :
grep -v " "
I think that what you're looking for could be best described as a newline followed by a word with a negative lookahead for a space,
/\n\w+\b(?! )/g
example

Regex to move second line to end of first line

I have several lines with certain values and i want to merge every second line or every line beginning with <name> to the end of the line ending with
<id>rd://data1/8b</id>
<name>DM_test1</name>
<id>rd://data2/76f</id>
<name>DM_test_P</name>
so end up with something like
<id>rd://data1/8b</id><name>DM_test1</name>
The reason why it came out like this is because i used two piped xpath queries
Regex
Simply remove the newline at the end of a line ending in </id>. On a windows, replace (<\/id>)\r\n with \1 or $1 (which is perl syntax). On a linux search for (<\/id>)\n and replace it with the same thing.
awk
The ideal solution uses awk. The idea is simply, when the line number is odd, we print the line without a newline, if not we print it with a newline.
awk '{ if(NR % 2) { printf $0 } else { print $0 } }' file
sed
Using sed we place a line in the hold space when it contains <id>´ and append the line to it when it's a` line. Then we remove the newline and print the hold buffer by exchanging it with the pattern space.
sed -n '/<id>.*<\/id>/{h}; /<name>.*<\/name>/{H;x;s/\n//;p}' file
pr
Using pr we can achieve a similar goal:
pr -s --columns 2 file

Find text enclosed by patterns using sed

I have a config file like this:
[whatever]
Do I need this? no!
[directive]
This lines I want
Very much text here
So interesting
[otherdirective]
I dont care about this one anymore
Now I want to match the lines in between [directive] and [otherdirective] without matching [directive] or [otherdirective].
Also if [otherdirective] is not found all lines till the end of file should be returned. The [...] might contain any number or letter.
Attempt
I tried this using sed like this:
sed -r '/\[directive\]/,/\[[[:alnum:]+\]/!d
The only problem with this attempt is that the first line is [directive]and the last line is [otherdirective].
I know how to pipe this again to truncate the first and last line but is there a sed solution to this?
You can use the range, as you were trying, and inside it use // negated. When it's empty it reuses last regular expression matched, so it will skip both edge lines:
sed -n '/\[directive\]/,/\[otherdirective\]/ { //! p }' infile
It yields:
This lines I want
Very much text here
So interesting
Here is a nice way with awk to get section of data.
awk -v RS= '/\[directive\]/' file
[directive]
This lines I want
Very much text here
So interesting
When setting RS to nothing RS= it divides the file up in records based on blank line.
So when searching for [directive] it will print that record.
Normally a record is one line, but due to the RS (record selector) is change, it gives the block.
Okay damn after more tries I found the solution or merely one solution:
sed -rn '/\[buildout\]/,/\[[[:alnum:]]+\]/{
/\[[[:alnum:]]+\]/d
p }'
is this what you want?
\[directive\](.*?)\[
Look here

SED: addressing two lines before match

Print line, which is situated 2 lines before the match(pattern).
I tried next:
sed -n ': loop
/.*/h
:x
{n;n;/cen/p;}
s/./c/p
t x
s/n/c/p
t loop
{g;p;}
' datafile
The script:
sed -n "1N;2N;/XXX[^\n]*$/P;N;D"
works as follows:
Read the first three lines into the pattern space, 1N;2N
Search for the test string XXX anywhere in the last line, and if found print the first line of the pattern space, P
Append the next line input to pattern space, N
Delete first line from pattern space and restart cycle without any new read, D, noting that 1N;2N is no longer applicable
This might work for you (GNU sed):
sed -n ':a;$!{N;s/\n/&/2;Ta};/^PATTERN\'\''/MP;$!D' file
This will print the line 2 lines before the PATTERN throughout the file.
This one with grep, a bit simpler solution and easy to read [However need to use one pipe]:
grep -B2 'pattern' file_name | sed -n '1,2p'
If you can use awk try this:
awk '/pattern/ {print b} {b=a;a=$0}' file
This will print two line before pattern
I've tested your sed command but the result is strange (and obviously wrong), and you didn't give any explanation. You will have to save three lines in a buffer (named hold space), do a pattern search with the newest line and print the oldest one if it matches:
sed -n '
## At the beginning read three lines.
1 { N; N }
## Append them to "hold space". In following iterations it will append
## only one line.
H
## Get content of "hold space" to "pattern space" and check if the
## pattern matches. If so, extract content of first line (until a
## newline) and exit.
g
/^.*\nsix$/ {
s/^\n//
P
q
}
## Remove the old of the three lines saved and append the new one.
s/^\n[^\n]*//
h
' infile
Assuming and input file (infile) with following content:
one
two
three
four
five
six
seven
eight
nine
ten
It will search six and as output will yield:
four
Here are some other variants:
awk '{a[NR]=$0} /pattern/ {f=NR} END {print a[f-2]}' file
This stores all lines in an array a. When pattern is found store line number.
At then end print that line number from the file.
PS may be slow with large files
Here is another one:
awk 'FNR==NR && /pattern/ {f=NR;next} f-2==FNR' file{,}
This reads the file twice (file{,} is the same as file file)
At first round it finds the pattern and store line number in variable f
Then at second round it prints the line two before the value in f