remove all sequences containg phrase "random" - replace

This is how my input looks like:
>a
AACTCTCTC
CGTGCTCTC
>b_random
ACTGSTSTS
CTCTCTCCT
ATATATA
>c
AACTCTCTC
CGTGCTCTC
>d
AACTCTCTC
CGTGCTCTC
CGTGCTCTC
>e_random
ACTGSTSTS
CTCTCTCCT
ATATATA
>c_random
ACTGSTSTS
CTCTCTACT
GSTSTSCTC
TCTCCTCCT
ATATATA
I would like to remove all sequences containing phrase "random" - sequence always starts with ">" and ends when another sequence starts.
In this case, I would like to get 3 files:
a.txt
>a
AACTCTCTC
CGTGCTCTC
c.txt
>c
AACTCTCTC
CGTGCTCTC
d.txt
>d
AACTCTCTC
CGTGCTCTC
CGTGCTCTC
Right now, I somehow can not force sed to do what I want. i started with this:
sed 's/random.*random//g' sample_data
what is not working. Thank you very much.

The easiest way to go here is probably with awk and a sensible RS/ORS setting:
awk '$1 !~ /random/ { print RS $0 > $1 ".txt"; close($1 ".txt" }' RS='>' ORS=''
If you have description lines with spaces in them, you need to set FS='\n' as well.

Here's one way using awk that should handle large files:
awk '/^>/ { i=substr($0,2) } i ~ /random/ { i="" } i { print > i ".txt" }' file
Results of grep . *.txt:
a.txt:>a
a.txt:AACTCTCTC
a.txt:CGTGCTCTC
c.txt:>c
c.txt:AACTCTCTC
c.txt:CGTGCTCTC
d.txt:>d
d.txt:AACTCTCTC
d.txt:CGTGCTCTC
d.txt:CGTGCTCTC

awk '/\>/ && $0!~/random/{file=substr($0,2)".txt";f=1}{if($0~/random/)f=0;if(f)print>file}' your_file

Another awk without using RS to avoid limitations
awk -F\> '/>/{close(f); f=/random/?x:$2 ".txt"} f{print>f}' file
This version also closes the file and uses a variable for the file name, because some awks cannot handle concatenated print targets.

Related

How to use awk or sed to replace specific word under certain profile

a file contains data in the following format, Now I want to chang the value of showfirst under XYZ section. how to achieve that with sed or awk or grep?
I thought of line number or second appearance but that's not going to be constant In future file can contain hundreds of such profile so it has to be user based.
I know that I can extract 1st line after 'XYZ' pattern but I want it to be field based.
Thanks for help
[ABC]
showfirst =0
showlast=10
[XYZ]
showfirst=10
showlast=3
With sed:
sed '/^\[XYZ\]/,/^showfirst *=/{0,//!s/.*/showfirst=20/}' file
How it works:
/^\[XYZ\]/,/^showfirst *=/: address range that matches lines from [XYZ] to next ^showfirst
// is for lines matching above addresses([XYZ] and showfirst=10). So 0,//!: NOT in the first matching line(that is showfirst=10 line)
s/.*/showfirst=20/: replace line with showfirst=44
You can use awk like this:
awk -v val='50' 'BEGIN{FS=OFS="="} /^\[.*\]$/{ flag = ($1 == "[XYZ]"?1:0) }
flag && $1 == "showfirst" { $2 = val } 1' file
[ABC]
showfirst =0
showlast=10
[XYZ]
showfirst=50
showlast=3
In awk. All parameterized:
$ awk -v l='XYZ' -v k='showfirst' -v v='666' ' # parameters lable, key, value
BEGIN { FS=OFS="=" } # delimiters
/\[.*\]/ { f=0 } # flag down #new lable
$1=="[" l "]" { f=1 } # flag up #lable
f==1 && $1==k { $2=v } # replace value when flag up and match
1' file # print
[ABC]
showfirst =0
showlast=10
[XYZ]
showfirst=666
showlast=3
Man, even I feel confused looking at that code.
If you set the record separator (RS) to the empty stiring, awk will read a whole record at a time, assuming the records are double-newline separated.
So for example you can do something like this:
awk -v k=XYZ -v v=42 '$1 ~ "\\[" k "\\]" { sub("showfirst *=[^\n]*", "showfirst=" v) } 1' RS= ORS='\n\n' infile
Output:
[ABC]
showfirst =0
showlast=10
[XYZ]
showfirst=42
showlast=3
with sed following command solved my problem for every user,
sed '/.*\[ABC\]/,/^$/ s/.*showfirst.*/showfirst=20/' input.conf
syntax goes as sed [address] command
'/.*\[ABC\]/,/^$/ : This generates address range pointing to region starting from [ABC] up to next first blank line. Search for string will be done in this particular range only.
s/.*showfirst.*/showfirst=20/: This search for any line having showfirst in it and replace entire line with showfirst=20

How can I use sed to find a line starting with AAA but NOT end with BBB

I'm trying to create a script to append oracleserver to /etc/hosts as an alias of localhost. Which means I need to:
Locate the line that ^127.0.0.1 and NOT oracleserver$
Then, append oracleserver to this line
I know the best practice is probably using negative look ahead. However, sed does not have look around feature: What's wrong with my lookahead regex in GNU sed?. Can anyone provide me some possible solutions?
sed -i '/oracleserver$/! s/^127\.0\.0\.1.*$/& oracleserver/' filename
/oracleserver$/! - on lines not ending with oracleserver
^127\.0\.0\.1.*$ - replace the whole line if it is starting with 127.0.0.1
& oracleserver - with the line plus a space separator ' ' (required) and oracleserver after that
Just use awk with && to combine the two conditions:
awk '/^127\.0\.0\.1/ && !/oracleserver$/ { $0 = $0 "oracleserver" } 1' file
This appends the string when the first pattern is matched but the second one isn't. The 1 at the end is always true, so awk prints each line (the default action is { print }).
I wouldn't use sed but instead perl:
Locate the line that ^127.0.0.1 and NOT oracleserver$
perl -pe 'if ( m/^127\.0\.0\.1/ and not m/oracleserver$/ ) { s/$/oracleserver/ }'
Should do the trick. You can add -i.bak to inplace edit too.

How to exclude patterns in regex conditionally in bash?

This is the content of input.txt:
hello=123
1234
stack=(23(4))
12341234
overflow=345
=
friends=(987)
Then I'm trying to match all the lines with equal removing the external parenteses (if the line has it).
To be clear, this is the result I'm looking for:
hello=123
stack=23(4)
overflow=345
friends=987
I toughth in something like this:
cat input.txt | grep -Poh '.+=(?=\()?.+(?=\))?'
But does not returns nothing. What am I doing wrong? Do you have any idea to do this? I'm so interested.
Using awk:
awk 'BEGIN{FS=OFS="="} NF==2 && $1!=""{gsub(/^\(|\)$/, "", $2); print}' file
hello=123
stack=23(4)
overflow=345
friends=987
Here is an alternate way with sed:
sed -nr ' # Use n to disable default printing and r for extended regex
/.+=.+/ { # Look for lines with key value pairs separated by =
/[(]/!ba; # If the line does not contain a paren branch out to label a
s/\(([^)]+)\)/\1/; # If the line contains a paren find a subset and print that
:a # Our label
p # print the line
}' file
$ sed -nr '/.+=.+/{/[(]/!ba;s/\(([^)]+)\)/\1/;:a;p}' file
hello=123
stack=23(4)
overflow=345
friends=987

split file into multiple files based upon differing start and end delimiter

i have a file that i need split into multiple files, and need it done via separate start and end delimiters.
for example, if i have the following file:
abcdef
START
ghijklm
nopqrst
END
uvwxyz
START
abcdef
ghijklm
nopqrs
END
START
tuvwxyz
END
i need 3 separate files of:
file1
START
ghijklm
nopqrst
END
file2
START
abcdef
ghijklm
nopqrs
END
file3
START
tuvwxyz
END
i found this link which showed how to do it with a starting delimiter, but i also need an ending delimiter. i have tried this using some regex in the awk command, but am not getting the result that i want. i don't quite understand how to get awk to be 'lazy' or 'non greedy', so that i can get it to pull apart the file correctly.
i really like the awk solution. something similar would be fantastic (i am reposting the solution here so you don't have to click through:
awk '/DELIMITER_HERE/{n++}{print >"out" n ".txt" }' input_file.txt
any help is appreciated.
You can use this awk command:
awk '/^START/{n++;w=1} n&&w{print >"out" n ".txt"} /^END/{w=0}' input_file.txt
awk '
/START/ {p = 1; n++; file = "file" n}
p { print > file }
/END/ {p = 0}
' filename
Here's another example using range notation:
awk '/START/,/END/ {if(/START/) n++; print > "out" n ".txt"}' data
Or an equivalent with a different if/else syntax:
awk '/START/,/END/ {print > "out" (/START/ ? ++n : n) ".txt"}' data
Here's a version without repeating the /START/ regex after Ed Morton's comments because I just wanted to see if it would work:
awk '/START/ && ++n,/END/ {print > "out" n ".txt" }' data
The other answers are definitely better if your range is or will ever be non-inclusive of the ends.

Replace previous when match regular expression

I need to delete the "end of line" of the previous line when current line starts is not a number ^[!0-9], basically if match, append to the line before, I'm a sed & awk n00b, and really like them btw. thanks
edit:
$ cat file
1;1;1;text,1
2;4;;8;some;1;1;1;more
100;tex
t
broke
4564;1;1;"also
";12,2121;546465
$ "script" file
1;1;1;text,1
2;4;;8;some;1;1;1;more
100;text broke
4564;1;1;"also";12,2121;546465
You didn't post any sample input or expected output so this is a guess but it sounds like what you're asking for:
$ cat file
a
b
3
4
c
d
$ awk '{printf "%s%s",(NR>1 && /^[[:digit:]]/ ? ORS : ""),$0} END{print ""}' file
ab
3
4cd
On the OPs newly posted input:
$ awk '{printf "%s%s",(NR>1 && /^[[:digit:]]/ ? ORS : ""),$0} END{print ""}' file
1;1;1;text,1
2;4;;8;some;1;1;1;more
100;textbroke
4564;1;1;"also";12,2121;546465
This might work for you (GNU sed):
sed -r ':a;$!N;s/\n([^0-9]|$)/\1/;ta;P;D' file
Keep two lines in the pattern space and if the start of the second line is empty or does not start with an integer, remove the newline.
if you have Ruby on your system
array = File.open("file").readlines
array.each_with_index do |val,ind|
array[ind-1].chomp! if not val[/^\d/] # just chomp off the previous item's \n
end
puts array.join
output
# ruby test.rb
1;1;1;text,1
2;4;;8;some;1;1;1;more
100;textbroke
4564;1;1;"also";12,2121;546465