Insert 1 after elements with no count - regex

I have a file of a structure like this:
NH3O
CH4
CHN
C2NOPH3
What I was trying to do is to put 1 as a count between the two letters or at the end of the item. Thus, the desired output is:
NH3O1
C1H4
C1H1N1
C2N1O1P1H3
So far, I was trying something like sed -e 's/\([A-Z]\)\([A-Z]\)/\11\2/g' -e 's/\([A-Z]\)[[:blank:]]/\11/g' but that does not work out.
Thanks for any tips

Could you please try following, written and tested with GNU awk.
awk '{num=split($0,array,"");for(i=1;i<=num;i++){if(array[i]~/^[a-zA-Z]*[a-zA-Z]/ && (array[i]+1)~/^[a-zA-Z]*/){array[i]=array[i]"|"};val=val array[i]};print val;val=""}' Input_file
Adding a non-one liner form of solution here.
awk '
{
num=split($0,array,"")
for(i=1;i<=num;i++){
if(array[i]~/^[a-zA-Z]*[a-zA-Z]/ && (array[i]+1)~/^[a-zA-Z]*/){
array[i]=array[i]"|"
}
val=val array[i]
}
print val
val=""
}
' Input_file

sed -e ':1' -e 's/\([[:upper:]][[:lower:]]*\)\([[:upper:]]\|$\)/\11\2/' -e 't1'

Related

Interval expressions in gawk to awk

I hope this is an easy fix
I originally wrote a clean and easy script that utilized gawk, I used this first and foremost because when I was solving the original issue was what I found. I now need to adapt it to only use awk.
sample file.fasta:
>gene1
>gene235
ATGCTTAGATTTACAATTCAGAAATTCCTGGTCTATTAACCCTCCTTCACTTTTCACTTTTCCCTAACCCTTCAAAATTTTATATCCAATCTTCTCACCCTCTACAATAATACATTTATTATCCTCTTACTTCAAAATTTTT
>gene335
ATGCTCCTTCTTAATCTAAACCTTCAAAATTTTCCCCCTCACATTTATCCATTATCACCTTCATTTCGGAATCCTTAACTAAATACAATCATCAACCATCTTTTAACATAACTTCTTCAAAATTTTACCAACTTACTATTGCTTCAAAATTTTTCAT
>gene406
ATGTACCACACACCCCCATCTTCCATTTTCCCTTTATTCTCCTCACCTCTACAATCCCCTTAATTCCTCTTCAAAATTTTTGGAGCCCTTAACTTTCAATAACTTCAAAATTTTTCACCATACCAATAATATCCCTCTTCAAAATTTTCCACACTCACCAAC
gawk '/[ACTG]{21,}GG/{print a; print}{a=$0}' file.fasta >"species_precrispr".fasta
what I know works is awk is the following:
awk '/[ACTG]GG/{print a; print}{a=$0}' file.fasta >"species_precrispr".fasta
the culprit therefore is the interval expression of {21,}
What I want it to do is search is for it to match each line that contains at least 21 nucleotides left of my "GG" match.
Can anyone help?
Edit:
Thanks for all the help:
There are various solutions that worked. To reply to some of the comments a more basic example of the initial output and the desired effect achieved...
Prior to awk command:
cat file1.fasta
>gene1
ATGCCTTAACTTTCAATAACTGG
>gene2
ATGGGTGCCTTAACTTTCAATAACTG
>gene3
ATGTCAAAATTTTTCATTTCAAT
>gene4
ATCCTTTTTTTTGGGTCAAAATTAAA
>gene5
ATGCCTTAACTTTCAATAACTTTTTAAAATTTTTGG
Following codes all produced the same desired output:
original code
gawk '/[ACTG]{21,}GG/{print a; print}{a=$0}' file1.fasta
slight modification that adds interval function to original awk version >3.x.x
awk --re-interval'/[ACTG]{21,}GG/{print a; print}{a=$0}' file1.fasta
Allows for modification of val and correct output , untested but should work with lower versions of awk
awk -v usr_count="21" '/gene/{id=$0;next} match($0,/.*GG/){val=substr($0,RSTART,RLENGTH-2);if(gsub(/[ACTG]/,"&",val)>= usr_count){print id ORS $0};id=""}' file1.fasta
awk --re-interval '/^>/ && seq { if (match(seq,"[ACTG]{21,}GG")) print ">" name ORS seq ORS} /^>/{name=$0; seq=""; next} {seq = seq $0 } END { if (match(seq,"[ACTG]{21,}GG")) print ">" name ORS seq ORS }' file1.fasta
Desired output: only grab genes names and sequences of sequences that have 21 nucleotides prior to matching GG
>gene1
ATGCCTTAACTTTCAATAACTGG
>gene5
ATGCCTTAACTTTCAATAACTTTTTAAAATTTTTGG
Lastly just to show the discarded lines
>gene2
ATG-GG-TGCCTTAACTTTCAATAACTG # only 3 nt prior to any GG combo
>gene3
ATGTCAAAATTTTTCATTTCAAT # No GG match found
>gene4
ATCCTTTTTTTTGGGTCAAAATTAAA # only 14 nt prior to any GG combo
Hope this helps others!
EDIT: As per OP comment need to print gene ids too then try following.
awk '
/gene/{
id=$0
next
}
match($0,/.*GG/){
val=substr($0,RSTART,RLENGTH-2)
if(gsub(/[ACTG]/,"&",val)>=21){
print id ORS $0
}
id=""
}
' Input_file
OR one-liner form of above solution as per OP's request:
awk '/gene/{id=$0;next} match($0,/.*GG/){val=substr($0,RSTART,RLENGTH-2);if(gsub(/[ACTG]/,"&",val)>=21){print id ORS $0};id=""}' Input_file
Could you please try following, written and tested with shown samples only.
awk '
match($0,/.*GG/){
val=substr($0,RSTART,RLENGTH-2)
if(gsub(/[ACTG]/,"&",val)>=21){
print
}
}
' Input_file
OR more generic approach where created a variable in which user could mention value which user is looking to match should be present before GG.
awk -v usr_count="21" '
match($0,/.*GG/){
val=substr($0,RSTART,RLENGTH-2)
if(gsub(/[ACTG]/,"&",val)>=usr_count){
print
}
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
match($0,/.*GG/){ ##Using Match function to match everything till GG in current line.
val=substr($0,RSTART,RLENGTH-2) ##Storing sub-string of current line from RSTART till RLENGTH-2 into variable val here.
if(gsub(/[ACTG]/,"&",val)>=21){ ##Checking condition if global substitution of ACTG(with same value) is greater or equal to 21 then do following.
print ##Printing current line then.
}
}
' Input_file ##Mentioning Input_file name here.
GNU awk accepts interval expressions in regular expressions from version 3.0 onwards. However, only from version 4.0, interval expression became defaultly enabled. If you have awk 3.x.x, you have to use the flag --re-interval to enable them.
awk --re-interval '/a{3,6}/{print}' file
There is an issue that often people overlook with FASTA files and using awk. When you have multi-line sequences, it is possible that your match is covering multiple lines. To this end you need to combine your sequences first.
The easiest way to process FASTA files with awk, is to build up a variable called name and a variable called seq. Every time you read a full sequence, you can process it. Remark that, for the best way of processing, the sequence, should be stored as a continues string, and not contain any newlines or white-spaces due. A generic awk for processing fasta, looks like this:
awk '/^>/ && seq { **process_sequence_here** }
/^>/{name=$0; seq=""; next}
{seq = seq $0 }
END { **process_sequence_here** }' file.fasta
In the presented case, your sequence processing looks like:
awk '/^>/ && seq { if (match(seq,"[ACTG]{21,}GG")) print ">" name ORS seq ORS}
/^>/{name=$0; seq=""; next}
{seq = seq $0 }
END { if (match(seq,"[ACTG]{21,}GG")) print ">" name ORS seq ORS }' file.fasta
Sounds like what you want is:
awk 'match($0,/[ACTG]+GG/) && RLENGTH>22{print a; print} {a=$0}' file
but this is probably all you need given the sample input you provided:
awk 'match($0,/.*GG/) && RLENGTH>22{print a; print} {a=$0}' file
They'll both work in any awk.
Using your updated sample input:
$ awk 'match($0,/.*GG/) && RLENGTH>22{print a; print} {a=$0}' file
>gene1
ATGCCTTAACTTTCAATAACTGG
>gene5
ATGCCTTAACTTTCAATAACTTTTTAAAATTTTTGG

Find and append to Text Between Two Strings or Words using sed or awk

I am looking for a sed in which I can recognize all of the text in between two indicators and then replace it with a place holder.
For instance, the 1st indicator is a list of words
(no|noone|haven't)
and the 2nd indicator is a list of punctuation
Code:
(.|,|!)
From an input text such as
"Noone understands the plot. There is no storyline. I haven't
recommended this movie to my friends! Did you understand it?"
The desired result would be.
"Noone understands_AFFIX me_AFFIX. There is no storyline_AFFIX. I
haven't recommended_AFFIX this_AFFIX movie_AFFIX to_AFFIX my_AFFIX
friends_AFFIX! Did you understand it?"
I know that there is the following sed:
sed -n '/WORD1/,/WORD2/p' /path/to/file
which recognizes the content between two indicators. I have also found a lot of great information and resources here. However, I still cannot find a way to append the affix to each token of text that occurs between the two indicators.
I have also considered to use awk, such as
awk '{sub(/.*indic1 /,"");sub(/ indic2.*/,"");print;}' < infile
yet still, it does not allow me to append the affix.
Does anyone have a suggestion to do so, either with awk or sed?
Little more compact awk
$ awk 'BEGIN{RS=ORS=" ";s="_AFFIX"}
/[.,!]$/{f=0; $0=gensub(/(.)$/,"s\\1","g")}
f{$0=$0s}
/Noone|no|haven'\''t/{f=1}1' story
Noone understands_AFFIX the_AFFIX plot_AFFIX. There is no storyline_AFFIX. I haven't recommended_AFFIX this_AFFIX movie_AFFIX to_AFFIX my_AFFIX friends_AFFIX! Did you understand it?
Perl to the rescue!
perl -pe 's/(?:no(?:one)?|haven'\''t)\s*\K([^.,!]+)/
join " ", map "${_}_AFFIX", split " ", $1/egi
' infile > outfile
\K matches what's on its left, but excludes it from the replacement. In this case, it verifies the 1st indicator. (\K needs Perl 5.10+.)
/e evaluates the replacement part as code. In this case, the code splits $1 on whitespace, map adds _AFFIX to each of the members, and join joins them back into a string.
Here is one verbose awk command for the same:
s="Noone understands the plot. There is no storyline. I haven't recommended this movie to my friends! Did you understand it?"
awk -v IGNORECASE=1 -v kw="no|noone|haven't" -v pct='\\.|,|!' '{
a=0
for (i=2; i<=NF; i++) {
if ($(i-1) ~ "\\y" kw "\\y")
a=1
if (a && $i ~ pct "$") {
p = substr($i, length($i), 1)
$i = substr($i, 1, length($i)-1)
}
if (a)
$i=$i "_AFFIX" p
if(p) {
p=""
a=0
}
}
} 1'
Output:
Noone understands_AFFIX the_AFFIX plot_AFFIX. There is no storyline_AFFIX. I haven't recommended_AFFIX this_AFFIX movie_AFFIX to_AFFIX my_AFFIX friends_AFFIX! Did you understand it?

Replace number of specified characters

I have something like this:
aaaaaaaaaaaaaaaaaaaaaaaaa
I need something that will allow me to replace a with another character like c from left to right according to the specified number.
For example:
some_command 3 should replace the first 3 a with c
cccaaaaaaaaaaaaaaaaaaaaaa
some_command 15
cccccccccccccccccaaaaaaaaaa
This can be done entirely in bash:
some_command() {
a="aaaaaaaaaaaaaaaaaaaaaaaaa"
c="ccccccccccccccccccccccccc"
echo "${c:0:$1}${a:$1}"
}
> some_command 3
cccaaaaaaaaaaaaaaaaaaaaaa
Using awk:
s='aaaaaaaaaaaaaaaaaaaaaaaaa'
awk -F "\0" -v n=3 -v r='c' '{for (i=1; i<=n; i++) $i=r}1' OFS= <<< "$s"
cccaaaaaaaaaaaaaaaaaaaaaa
This might work for you (GNU sed):
sed -r ':a;/a/{x;/^X{5}$/{x;b};s/$/X/;x;s/a/c/;ba} file
This will replace the first 5 a's with c throughout the file:
sed -r ':a;/a/{x;/^X{5}$/{z;x;b};s/$/X/;x;s/a/c/;ba} file
This will replace the first 5 a's with cfor each line throughout the file.
#/bin/bash
char=c
word=aaaaaaaaaaaaaaaaaaaaaaaaa
# pass in the number of chars to replace
replaceChar () {
num=$1
newword=""
# this for loop to concatenate the chars could probably be optimized
for i in $(seq 1 $num); do newword="${newword}${char}"; done
word="${newword}${word:$num}"
echo $word
}
replaceChar 4
A more general solution than the OP asked for, building on #anubhava's excellent answer.
Parameterizes the replacement count as well as the "before and after" chars.
The "before" char is matched anywhere - not just at the beginning of the input string, and whether adjacent to other instances or not.
Input is taken from stdin, so multiple lines can be piped in.
# Usage:
# ... | some_command_x replaceCount beforeChar afterChar
some_command_x() {
awk -F '\0' -v n="$1" -v o="${2:0:1}" -v r="${3:0:1}" -v OFS='' \
'{
while(++i <= NF)
{ if ($i==o) { if (++n_matched > n) break; $i=r } }
{ i=n_matched=0; print }
}'
}
# Example:
some_command_x 2 a c <<<$'abc_abc_abc\naaa rating'
# Returns:
cbc_cbc_abc
cca rating
Perl has some interesting features that can be exploited. Define the following bash script some_command:
#! /bin/bash
str="aaaaaaaaaaaaaaaaaaaaaaaaa"
perl -s -nE'print s/(a{$x})/"c" x length $1/er' -- -x=$1 <<<"$str"
Testing:
$ some_command 5
cccccaaaaaaaaaaaaaaaaaaaa

Using awk to replace only once the delimiter between $4 and $5

I'm really new in awk language and I'm suprised of all the power it has. (I discovered this today), because I have a file in this format:
test1;test2;test3;test4;test5;test6;test7
I need to output this in a new file and get this as result:
test1;test2;test3;test4 test5;test6;test7
Basically add ' ' between $4 and $5. I know there is a lot of questions about this but I'm not able to do what I want.
I was testing this code that I found somewhere here:
for (i=1;i<=3;i++)
printf "%s;", $i
n = split($0,tmp,/ +/)
for (i=6;i>=8;i++)
printf ";%s", tmp[n-i]
print ""
}
But I get as output something like:
test1;test2;test3;test4;test5;test6;;;
Can you please tell my what I'm doing wrong? and, there is another simple method like one-line code in awk to do this?
Thank you in advance
I'd go with sed on this one. On the basis that the issue is "replace the 4th ; character" you can use this:
sed 's/;/ /4' <inputfile>
That'll output the result to stdout, or you can use "sed -i" to do it in place.
See http://www.gnu.org/software/sed/manual/html_node/The-_0022s_0022-Command.html
Try this :
echo "test1;test2;test3;test4;test5;test6;test7" | awk -F';' '{
for (i=1; i<=NF; i++)
printf "%s%s", $i, (i == 4) ? " " : ";"
}'
Output
test1;test2;test3;test4 test5;test6;test7
If you're using GNU awk you could use a similar solution to what chooban suggested:
awk '{ $0=gensub(";", " ", "4") } 1'
The 1 at the end invokes the default block: { print $0 }.

Sed error: 'unescaped newline inside substitute pattern' and 'bad flag in substitute command'

Question:
I am new in sed,so when I excute following code in shell:
sed -e "/^\s<key>CHANNEL_NAME<\/key>$/{N;s/\(^\s<string>\).+\(<\/string>$\)/\1test\2}" Info.plist > test.plist
Sed give me an error: "sed: 1: "/^\sCHANNEL_NAME<\ ...": unescaped newline inside substitute pattern"
My Question: What does "unescaped newline inside substitute pattern" exactly mean?
The Info.plist file is like this:
...
<key>CHANNEL_NAME</key>
<string>App Store</string>
...
I am appreciate everyone could answer the question, thanks!
Anwser:
Thanks #potong #dogbane #Beta ! : )
Because it is a Cocoa plist, so here's my finally solution:
sed '/<key>CHANNEL_NAME<\/key>/{N;s/\(<string>\).*\(<\/string>\)/\1test\2/;}' Info.plist > test.plist
Tips:
I got two error during my process to solve the problem. Put them here:
sed: 1: "/^\sCHANNEL_NAME<\ ...": unescaped newline inside substitute pattern
sed: 1: "/CHANNEL_NAME</ke ...": bad flag in substitute command: '}'
I make so many mistakes in the first code.
haven't escaped the '+'
should end with 2/}"
acturally should end with 2/;}" (I miss a ';', so I got the second error in Tips 1)
user 'n' or 'N' both works for me.
Probably because of on Mac, the '.+' (even if I escaped) not work, so have to change it as #potong said, '..*'
Any good advice to approve the code is welcome, thanks all the following guys again!
This might work for you (GNU sed):
sed -e '/^\s*<key>CHANNEL_NAME<\/key>$/{n;s/^\(\s*<string>\).\+\(<\/string>\)$/\1test\2/}' Info.plist > test.plist
N.B. You should allow for whitespace (^\s*) at the beginning of a line and print the matched line before comparing the start of the next line for the substitution command i.e. use n instead of N.
Or:
sed -e '/^ *<key>CHANNEL_NAME<\/key>$/!b' -e 'n' -e 's/^\( *<string>\)..*\(<\/string>\)$/\1test\2/' Info.plist > test.plist
Since you said you're just learning sed: sed is an excellent tool for simple substitutions on a single line but for anything else just use awk.
Here's a GNU awk solution (you can cram it onto one line if you like):
$ cat file
...
foo
<key>CHANNEL_NAME</key>
<string>App Store</string>
...
$
$ awk '
found { $0=gensub(/(<string>).*(<\/string>)/,"\\1test\\2",""); found=0 }
/<key>CHANNEL_NAME<\/key>/ { found=1 }
{ print $0 }
' file
...
foo
<key>CHANNEL_NAME</key>
<string>test</string>
...
It doesn't LOOK much different from the sed solution, but just try modifying the sed solution to do anything additional e.g. add line numbers to the output:
$ awk '
found { $0=gensub(/(<string>).*(<\/string>)/,"\\1test\\2",""); found=0 }
/<key>CHANNEL_NAME<\/key>/ { found=1 }
{ print NR, $0 }
' file
1 ...
2 foo
3 <key>CHANNEL_NAME</key>
4 <string>test</string>
5 ...
or replace the text between "string"s with the contents of the line before CHANNEL_NAME instead of the hard-coded "test":
awk '
found { $0=gensub(/(<string>).*(<\/string>)/,"\\1" rep "\\2",""); found=0 }
/<key>CHANNEL_NAME<\/key>/ { found=1; rep=prev }
{ print $0; prev=$0 }
' file
...
foo
<key>CHANNEL_NAME</key>
<string>foo</string>
...
and you'll find you need a whole other solution, probably involving a nightmarish concoction of single letters and punctuation marks, whereas with awk it's a simple tweak to enhance your starting solution.