regexp find and replace: bash variables inside sed - regex

I would like to remove this sequence when present at the beginning of the line:
ATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAATCTCGTATGCCGTCTTCTGCTTG followed by at least 3 A characters.
Both, sequence and multiple A should be removed and the rest of the file should be preserved.
My input files look like this:
#M00946:3:000000000-A2WF2:1:1101:18115:1962 1:N:0:2
GATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAACATTTTCTTTCTTACTTCGTTCACTTTCCACTTCTTTCTCCCTATCTTCCCCCTTCTGTCTGCCCCAGCTGTCTATCCCACTTATTGTCTCCCCCCACTGCCCCACACTCCTACCTTCTTCATCTTCACCTAACACCTCCCGCTCCCTCCTTATCGTCTCTTATCCTTTCCTTGTTCC
+
????????DDDDDDDDGGGGGGHHIIIIHHHIIIIFHIIIH/CGFHHIIIIHEDHHIIIIHI=5EEGFEHHEC+5,,4#,#,,....--..+77,,.6..6.....7.4..7.76=..-5.>.4-)134-.5....-3*))0***1*********10*0**01*1*)''..0***.)0'))*****00*11******01***0****0*)**0)'''...*0)0*11********1****1*0********
#M00946:3:000000000-A2WF2:1:1101:19888:2900 1:N:0:2
GATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAACACAAATACCGTTCCAATATCTTTTTGTTTCATGTCTAATAAC
+
<<??????BB?BBBBBCAFFFCFHF;>EFCDFGFFHFBGHCA=FHA>EFGEE7CF>F?FFHB=?EEGF>>DH5<)++,++,4,,4+=:,,,,5,,,,,,,,),33?,3,3,3,,,,33
I was trying to use script replace.sh which looks like this
file=$1;
adapter_sequence=$2;
sed -r "s/${adapter_sequence}A{3}//" $file
from the command line:
./replace.sh file.fastq GATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAATCTCGTATGCCGTCTTCTGCTTG
It did not work. Any help in any script language will be appreciated.

I believe your have $1, $2 reversed. Have it like this:
adapter_sequence=$2
sed "s/$adapter_sequence//" $1
In the ideal case I would like to remove all adapter sequences
starting at the beginning of line followed by at least three A
letters,
Try this sed:
sed -r "s/^${adapter_sequence}A{3,}//" file

Related

sed regex match and replace any last digit

I have lots of file containing following ipaddress, and i want to replace last digit of ip and look like i am having struggle to come up with correct regex
file1
IPADDR=10.30.2.26
NETMASK=255.255.0.0
GATEWAY=10.30.0.1
I want to replace 10.30.2.26 to 10.30.2.27 using sed but somehow i am missing something, i have tried following.
I have many file which i want to replace and last digit could be anything.
I have tried sed 's/[^IPADDR].$/7/g' file1
how do i match anything between ^IPADDR{anything}$ ?
In your regex, [^IPADDR] is a character class that search for any character except those listed between brackets. I'm not sure that's what you want.
You can use an address instead to find lines starting with IPADDR(/^IPADDR/) and apply the substitution command on it:
sed '/^IPADDR/s/[0-9]$/7/' file
You may use the following command:
sed -r 's/(^IPADDR=[0-9.]+)([0-9]$)/\17/g' file
Prints:
IPADDR=10.30.2.27
NETMASK=255.255.0.0
GATEWAY=10.30.0.1

Finding strings across lines and replace with nothing

I have some 'fastq' format DNA sequence files (basically just text files) like this:
#Sample_1
ACTGACTGACTGACTGACTGACTGACTG
ACTGACTGACTGACTGACTGACTGACTG
+
BBBBBBBBBBBBEEEEEEEEEEEEEEEE
EHHHHKKKKKKKKKKKKKKNQQTTTTTT
#
+
#
+
#Sample_4
ACTGACTGACTGACTGACTGACTGACTG
ACTGACTGACTGACTGACTGACTGACTG
+
BBBBBBBBBBBBEEEEEEEEEEEEEEEE
EHHHHKKKKKKKKKKKKKKNQQTTTTTT
My ultimate goal is to turn these into 'fasta' format files, but to do that I need to get rid of the two empty sequences in the middle.
EDIT
The desired output would look like this:
#Sample_1
ACTGACTGACTGACTGACTGACTGACTG
ACTGACTGACTGACTGACTGACTGACTG
+
BBBBBBBBBBBBEEEEEEEEEEEEEEEE
EHHHHKKKKKKKKKKKKKKNQQTTTTTT
#Sample_4
ACTGACTGACTGACTGACTGACTGACTG
ACTGACTGACTGACTGACTGACTGACTG
+
BBBBBBBBBBBBEEEEEEEEEEEEEEEE
EHHHHKKKKKKKKKKKKKKNQQTTTTTT
All of the dedicated software I tried (Biopython, stand alone programs, perl scripts posted by others) crash at the empty sequences. This is really just a problem of searching for the string #\n+ and replacing it with nothing. I googled this and read several posts and tried about a million options with sed and couldn't figure it out. Here are some things that didn't work:
sed s/'#'/,/'+'// test.fastq > test.fasta
sed s/'#,+'// test.fastq > test.fasta
Any insights would be greatly appreciated.
PS. I've got a Mac.
Try:
sed "/^[#+]*$/d" test.fastq > test.fasta
The /d option tells sed to "delete" the matching line (i.e. not print it).
^ and $ mean "start of string" and "end of string" respectively, i.e. the line must be an exact match.
So, the above command basically says:
Print all lines that do not only contain # or +, and write the result to test.fasta.
Edit: I misunderstood the question slightly, sorry. If you want to only remove pairs of consecutive lines like
#
+
then you need to perform a multi-line search and replace.
Although this can be done with sed, it's perhaps easier to use something like a perl script instead:
perl -0pe 's/^#\n\+\n//gm' test.fastq > test.fasta
The -0 option turns Perl into "file slurp" mode, where Perl reads the entire input file in one shot (instead of line by line). This enables multi-line search and replace.
The -pe option allows you to run Perl code (pattern matching and replacement in this case) and display output from the command line.
^#\n\+\n is the pattern to match, which we are replacing with nothing (i.e. deleting).
/gm makes the substitution multiline and global.
You could also instead pass -i as the first parameter to perl, to edit the file inline.
This may not be the most elegant solution in the world, but you can use tr to replace the \n with a null character and back.
cat test.fastq | tr '\n' '\0' | sed 's/#\x0+\x0//g' | tr '\0' '\n' > test.fasta
Try this:
sed '/^#$/{N;/\n+$/d}' file
When # is found, next line is appended to the pattern space with N.
If $ is found in next line, the d command deletes both lines.

Unix Shell Scripting removing - from file

I have a file containing following info
job, cost_code, laborcost,materialcost
202-12-21, 23-94-23, **110.00-**, 120.04
204-12-21, 23-93-23, 520.00, **120.04-**
204-12-12, 24-93-23, 155.00, **120.04-**
There are few problem records specifically ones which have a -sign at the end of the line (marked with ** which isn't really there in the file).
I am trying to remove the - sign with regex but i am having trouble matching only the amount ones which have a problem.
Try this, this command removes dashes at end of line
sed s'/-$//'
echo "204-12-21, 23-93-23, 520.00, 120.04-" | sed s'/-$//'
You can use sed:
sed -i~ 's/\([0-9]\)-,/\1,/g;s/-$//' input.csv
The first expression finds a digit followed by -, and replaces it by the digit and comma; the second expression removes dashes at ends of lines.
This might work for you (GNU sed):
sed -r 's/-(,|$)/\1/g' file
Remove a hyphen before a comma or a hyphen from the last field of a line.

process a delimited text file with sed

I have a ";" delimited file:
aa;;;;aa
rgg;;;;fdg
aff;sfg;;;fasg
sfaf;sdfas;;;
ASFGF;;;;fasg
QFA;DSGS;;DSFAG;fagf
I'd like to process it replacing the missing value with a \N .
The result should be:
aa;\N;\N;\N;aa
rgg;\N;\N;\N;fdg
aff;sfg;\N;\N;fasg
sfaf;sdfas;\N;\N;\N
ASFGF;\N;\N;\N;fasg
QFA;DSGS;\N;DSFAG;fagf
I'm trying to do it with a sed script:
sed "s/;\(;\)/;\\N\1/g" file1.txt >file2.txt
But what I get is
aa;\N;;\N;aa
rgg;\N;;\N;fdg
aff;sfg;\N;;fasg
sfaf;sdfas;\N;;
ASFGF;\N;;\N;fasg
QFA;DSGS;\N;DSFAG;fagf
You don't need to enclose the second semicolon in parentheses just to use it as \1 in the replacement string. You can use ; in the replacement string:
sed 's/;;/;\\N;/g'
As you noticed, when it finds a pair of semicolons it replaces it with the desired string then skips over it, not reading the second semicolon again and this makes it insert \N after every two semicolons.
A solution is to use positive lookaheads; the regex is /;(?=;)/ but sed doesn't support them.
But it's possible to solve the problem using sed in a simple manner: duplicate the search command; the first command replaces the odd appearances of ;; with ;\N, the second one takes care of the even appearances. The final result is the one you need.
The command is as simple as:
sed 's/;;/;\\N;/g;s/;;/;\\N;/g'
It duplicates the previous command and uses the ; between g and s to separe them. Alternatively you can use the -e command line option once for each search expression:
sed -e 's/;;/;\\N;/g' -e 's/;;/;\\N;/g'
Update:
The OP asks in a comment "What if my file have 100 columns?"
Let's try and see if it works:
$ echo "0;1;;2;;;3;;;;4;;;;;5;;;;;;6;;;;;;;" | sed 's/;;/;\\N;/g;s/;;/;\\N;/g'
0;1;\N;2;\N;\N;3;\N;\N;\N;4;\N;\N;\N;\N;5;\N;\N;\N;\N;\N;6;\N;\N;\N;\N;\N;\N;
Look, ma! It works!
:-)
Update #2
I ignored the fact that the question doesn't ask to replace ;; with something else but to replace the empty/missing values in a file that uses ; to separate the columns. Accordingly, my expression doesn't fix the missing value when it occurs at the beginning or at the end of the line.
As the OP kindly added in a comment, the complete sed command is:
sed 's/;;/;\\N;/g;s/;;/;\\N;/g;s/^;/\\N;/g;s/;$/;\\N/g'
or (for readability):
sed -e 's/;;/;\\N;/g;' -e 's/;;/;\\N;/g;' -e 's/^;/\\N;/g' -e 's/;$/;\\N/g'
The two additional steps replace ';' when they found it at beginning or at the end of line.
You can use this sed command with 2 s (substitute) commands:
sed 's/;;/;\\N;/g; s/;;/;\\N;/g;' file
aa;\N;\N;\N;aa
rgg;\N;\N;\N;fdg
aff;sfg;\N;\N;fasg
sfaf;sdfas;\N;\N;
ASFGF;\N;\N;\N;fasg
QFA;DSGS;\N;DSFAG;fagf
Or using lookarounds regex in a perl command:
perl -pe 's/(?<=;)(?=;)/\\N/g' file
aa;\N;\N;\N;aa
rgg;\N;\N;\N;fdg
aff;sfg;\N;\N;fasg
sfaf;sdfas;\N;\N;
ASFGF;\N;\N;\N;fasg
QFA;DSGS;\N;DSFAG;fagf
The main problem is that you can't use several times the same characters for a single replacement:
s/;;/..../g: The second ; can't be reused for the next match in a string like ;;;
If you want to do it with sed without to use a Perl-like regex mode, you can use a loop with the conditional command t:
sed ':a;s/;;/;\\N;/g;ta;' file
:a defines a label "a", ta go to this label only if something has been replaced.
For the ; at the end of the line (and to deal with eventual trailing whitespaces):
sed ':a;s/;;/;\\N;/g;ta; s/;[ \t\r]*$/;\\N/1' file
this awk one-liner will give you what you want:
awk -F';' -v OFS=';' '{for(i=1;i<=NF;i++)if($i=="")$i="\\N"}7' file
if you really want the line: sfaf;sdfas;\N;\N;\N , this line works for you:
awk -F';' -v OFS=';' '{for(i=1;i<=NF;i++)if($i=="")$i="\\N";sub(/;$/,";\\N")}7' file
sed 's/;/;\\N/g;s/;\\N\([^;]\)/;\1/g;s/;[[:blank:]]*$/;\\N/' YourFile
non recursive, onliner, posix compliant
Concept:
change all ;
put back unmatched one
add the special case of last ; with eventually space before the end of line
This might work for you (GNU sed):
sed -r ':;s/^(;)|(;);|(;)$/\2\3\\N\1\2/g;t' file
There are 4 senarios in which an empty field may occur: at the start of a record, between 2 field delimiters, an empty field following an empty field and at the end of a record. Alternation can be employed to cater for senarios 1,2 and 4 and senario 3 can be catered for by a second pass using a loop (:;...;t). Multiple senarios can be replaced in both passes using the g flag.

Understanding a sed example

I found a solution for extracting the password from a Mac OS X Keychain item. It uses sed to get the password from the security command:
security 2>&1 >/dev/null find-generic-password -ga $USER | \
sed -En '/^password: / s,^password: "(.*)"$,\1,p'
The code is here in a comment by 'sr105'. The part before the | evaluates to password: "secret". I'm trying to figure out exactly how the sed command works. Here are some thoughts:
I understand the flags -En, but what are the commas doing in this example? In the sed docs it says a comma separates an address range, but there's 3 commas.
The first 'address' /^password: / has a trailing s; in the docs s is only mentioned as the replace command like s/pattern/replacement/. Not the case here.
The ^password: "(.*)"$ part looks like the Regex for isolating secret, but it's not delimited.
I can understand the end part where the back-reference \1 is printed out, but again, what are the commas doing there??
Note that I'm not interested in an easier alternative to this sed example. This will only be part of a larger bash script which will include some more sed parsing in an .htaccess file, so I'd really like to learn the syntax even if it is obscure.
Thanks for your help!
Here is sed command:
sed -En '/^password: / s,^password: "(.*)"$,\1,p'
Commas are used as regex delimiter it can very well be another delimiter like #:
sed -En '/^password: / s#^password: "(.*)"$#\1#p'`
/^password: / finds an input line that starts with password:
s#^password: "(.*)"$#\1#p finds and captures double-quoted string after password: and replaces the entire line with the captured string \1 ( so all that remains is the password )
First, the command extracts passwords from a file (or stream) and prints them to stdout.
While you "normally" might execute a sed command on all lines of a file, sed offers to specify a regex pattern which describes which lines the following command should get applied to.
In your case
/^password: /
is a regex, saying that the command:
s,^password: "(.*)"$,\1,p
should get executed for all lines looking like password: "secret". The command substitutes those lines with the password itself while suppressing the outer lines.
The substitute command might look uncommon but you can choose the delimiter in an sed command, it is not limited to /. In this case , was chosen.