Regexp for removing certain columns - regex

I have an input of this format:
<apple1> <orange1> : <apple2> <orange2> : <apple3> <orange3> : ...
This input is of undefined length and consists of apple-orange pairs with varying orange and apple parts, separated by a colon.
I'd like to have this as an output:
<apple1> <orange1> : <orange2> : <orange3> : ...
I. e. all apple parts but the first removed.
Each apple part is 14 characters wide, each orange part is 19 characters wide.
I tried things like this:
sed -r 's/.{14}(.{19}):/\1:/g'
But this always ran into problems skipping the first apple part.
Can anybody provide a regexp solving this task?
Real world example input:
appleappleapplorangeorangeorangeo:appleappleapplorangeorangeorangeo:appleappleapplorangeorangeorangeo
foofoofoofoofobarbarbarbarbarbarb:foofoofoofoofobarbarbarbarbarbarb:foofoofoofoofobarbarbarbarbarbarb
xxxxxxxxxxxxxxooooooooooooooooooo:ppppppppppppppqqqqqqqqqqqqqqqqqqq:nnnnnnnnnnnnnnttttttttttttttttttt
Output should be this:
appleappleapplorangeorangeorangeo:orangeorangeorangeo:orangeorangeorangeo
foofoofoofoofobarbarbarbarbarbarb:barbarbarbarbarbarb:barbarbarbarbarbarb
xxxxxxxxxxxxxxooooooooooooooooooo:qqqqqqqqqqqqqqqqqqq:ttttttttttttttttttt

Your regex to sed was almost correct. Just match ":_14_19" over and over and remove the 14 part. (Note: I use commas as regex delimiters below because they're easier to read.)
$ export A='appleappleapplorangeorangeorangeo:appleappleapplorangeorangeorangeo:appleappleapplorangeorangeorangeo:foofoofoofoofobarbarbarbarbarbarb:foofoofoofoofobarbarbarbarbarbarb:foofoofoofoofobarbarbarbarbarbarb:xxxxxxxxxxxxxxooooooooooooooooooo:ppppppppppppppqqqqqqqqqqqqqqqqqqq:nnnnnnnnnnnnnnttttttttttttttttttt'
$ echo $A | sed -Ee 's,:.{14}(.{19}),:\1,g'
appleappleapplorangeorangeorangeo:orangeorangeorangeo:orangeorangeorangeo:barbarbarbarbarbarb:barbarbarbarbarbarb:barbarbarbarbarbarb:ooooooooooooooooooo:qqqqqqqqqqqqqqqqqqq:ttttttttttttttttttt

This job is more suitable to awk as input file is well structured in rows and columns using a known delimiter i.e. colon:
awk 'BEGIN{FS=OFS=":"} {for (i=2; i<=NF; i++) $i = substr($i, 15)} 1' file
appleappleapplorangeorangeorangeo:orangeorangeorangeo:orangeorangeorangeo
foofoofoofoofobarbarbarbarbarbarb:barbarbarbarbarbarb:barbarbarbarbarbarb
xxxxxxxxxxxxxxooooooooooooooooooo:qqqqqqqqqqqqqqqqqqq:ttttttttttttttttttt
This awk command uses : as input+output delimiter and starting from 2nd field in each record it sets each field to a substring of same field from 15th position.

With perl..
Our Input: appleappleapplorangeorangeorangeo:appleappleapplorangeorangeorangeo:appleappleapplorangeorangeorangeo
lets assume
a=appleappleappl (14 characters)
b=orangeorangeorangeo (19 characters)
c=appleappleapplorangeorangeorangeo:appleappleapplorangeorangeorangeo (rest of the line, which is a repeating combination of a and b.
Expected output: Before the fist colon (:), both a and b are kept and after the first colon, only b is kept.
${a}${b}:${b}:${b}:.... (please correct me if I am wrong)
So here it is once again, to recap, both the input and output.
Our Input: appleappleapplorangeorangeorangeo:appleappleapplorangeorangeorangeo:appleappleapplorangeorangeorangeo
Expected Output: appleappleapplorangeorangeorangeo:orangeorangeorangeo:orangeorangeorangeo
Please try this script: (As mentioned earlier, this is using perl and not shell).
%_Host#User> cat apple.pl
#!/usr/bin/perl
use strict;
use warnings;
while (<>) {
chomp $_ ;
my #tmp = split /:/, $_ ;
my ($a,$b) = (substr($tmp[0],0,14), substr($tmp[0],14,19)) ;
my $str = "$a"."$b" ;
foreach my $i (1..$#tmp) {
$tmp[$i] =~ s/$a//g ;
$str .= ":"."$tmp[$i]" ;
}
print "$str\n" ;
}
%_Host#User>
Script Output:
%_Host#User> cat td_apple |./apple.pl
appleappleapplorangeorangeorangeo:orangeorangeorangeo:orangeorangeorangeo
foofoofoofoofobarbarbarbarbarbarb:barbarbarbarbarbarb:barbarbarbarbarbarb
xxxxxxxxxxxxxxooooooooooooooooooo:ppppppppppppppqqqqqqqqqqqqqqqqqqq:nnnnnnnnnnnnnnttttttttttttttttttt
Sample Data:
%_Host#User> cat td_apple
appleappleapplorangeorangeorangeo:appleappleapplorangeorangeorangeo:appleappleapplorangeorangeorangeo
foofoofoofoofobarbarbarbarbarbarb:foofoofoofoofobarbarbarbarbarbarb:foofoofoofoofobarbarbarbarbarbarb
xxxxxxxxxxxxxxooooooooooooooooooo:ppppppppppppppqqqqqqqqqqqqqqqqqqq:nnnnnnnnnnnnnnttttttttttttttttttt
%_Host#User>
Thanks.

Related

regex replace and add hyphen before first zero

Input1: RC000030034
Replace1: RC-000030034
Input2: RC100003282
Replace2: RC1-00003282
Looking to add a hyphen before the first 0 in the string.
Input will always have 11 characters.
Final output will always be 12 characters.
Never will have alpha characters after the hyphen.
Based on this : "Looking to add a hyphen before the first 0 in the string."
example in javascript:
> var re = /^([^0]*)0(.*)$/
> "RC000030034".replace(re,"$1-$2")
'RC-00030034'
in bash (echo + sed):
$ echo 'RC00030034' | sed -e 's/^\([^0]*\)0\(.*\)$/\1-\2/'
RC-0030034
output = input.replace("0", "-0")
or equivalent code in your language
Most languages provide some kind of replace method for strings. The example above works in javascript, it will replace the first occurence of '0' by '-0'.
In python, there is a replace() which replacess all occurrences; there, you must use an optional argument to indicate a maximum number:
output = input.replace("0", "-0", 1)
In perl, you could use a regular expression:
$input =~ s/0/-0/;
Putting all your strings in a file (each string on a separate line) you can use the following shell command:
perl -ne 's/0/-0/; print' inputfile
Now, suppose your input allows for strings like RC0A00001234, where the hyphen should be before the second 0, behind the A (because we 'Never will have alpha characters after the hyphen').
Then the command has to change into:
perl -ne 's/(0\d*)$/-$1/; print' inputfile

How to use sed or awk to replace string in csv file

Sorry for a really basic question. How to replace a particular column in a csv file with some string?
e.g.
id, day_model,night_model
===========================
1 , ,
2 ,2_DAY ,
3 ,3_DAY ,3_NIGHT
4 , ,
(4 rows)
I want to replace any string in the column 2 and column 3 to true
others would be false, but not the 1,2 row and end row.
Output:
id, day_model,night_model
===========================
1 ,false ,false
2 ,true ,false
3 ,true ,true
4 ,false ,false
(4 rows)
What I tried is the following sample code( Only trying to replace the string to "true" in column 3):
#awk -F, '$3!=""{$3="true"}' OFS=, file.csv > out.csv
But the out.csv is empty. Please give me some direction.
Many thanks!!
Since your field separator is comma, the "empty" fields may contain spaces, particularly the 2nd field. Therefore they might not equal the empty string.
I would do this:
awk -F, -v OFS=, '
# ex
NR>2 && !/^\([0-9]+ rows\)/ {
for (i=2; i<=NF; i++)
$i = ($i ~ /[^[:blank:]]/) ? "true" : "false"
}
{ print }
' file
Well since you added sed in tag and you have only three columns I have solution for your problem in four steps because regex replacement was not possible for all cases in just one go.
Since your 2nd and 3rd column is having blank space. I wrote four sed commands to do the replacement for each kind of row.
sed '/^(\d+\s+,)\S+\s*,\S+\s*$/\1true,true/gm' file.txt
This will replace rows like 3 ,3_DAY ,3_NIGHT
Regex101 Demo
sed '/^(\d+\s+,)\S+\s*,\s*$/\1true,false/gm' file.txt
This will replace rows like 2 ,2_DAY ,
Regex101 Demo
sed '/^(\d+\s+,)\s*,\S+\s*$/\1false,true/gm' file.txt
This will replace rows like 5 , ,2_Day
Regex101 Demo
sed '/^(\d+\s+,)\s*,\s*$/\1false,false/gm' file.txt
This will replace rows like 1 , ,
Regex101 Demo

Regex for Replacing String with Incrementing Number

Once upon a time I used UltraEdit for making this
text 1
text 2
text 3
text 4
...
from that
text REPLACE
text REPLACE
text REPLACE
text REPLACE
...
it was as easy as replace REPLACE with \i
now, how can I do this with eg. sed?
If you provide a solution, could you please add directions for filling the result with leading zeros?
thanks
You can use awk instead of sed for this:
awk '{ sub(/REPLACE/, ++i) } 1' file
text 1
text 2
text 3
text 4
...
Code Demo
Is this what you want?
$ awk '{$NF=sprintf("%05d",++i)} 1' file
text 00001
text 00002
text 00003
text 00004
If not, edit your question to show some more truly representative sample input and expected output (and get rid of the ...s if they don't literally exist in your input and output as they make your example non-testable).
perl -i -pe 's/REPLACE/++$i/ge' file
For zero-padding to the minimum width (i.e. if there are 10 replacements, use field width 2):
perl -i -pe '
BEGIN {
$patt = "REPLACE";
chomp( $n = qx(grep -co "$patt" file) );
$n = int( log($n)/log(10) ) + 1;
}
s/$patt/ sprintf("%0*d", $n, ++$i) /ge
' file

Edit line names with a new name containing an incremented value

This seems like a simple task to me but getting it to work easily is ending up more difficult than I thought:
I have a fasta file containing several million lines of text (only a few hundred individual sequence entries) and these sequence names are long, I want to replace all characters after the header > with Contig $n, where $n is an integer starting at 1 and is incremented for each replacement.
an example input sequence name:
>NODE:345643RD:Cov_456:GC47:34thgd
ATGTCGATGCGT
>NODE...
ATGCGCTTACAC
Which I then want to output like this
>Contig 1
ATGTCGATGCGT
>Contig 2
ATGCGCTTACAC
so maybe a Perl script? I know some basics but I'd like to read in a file and then output the new file with the changes, and I'm unsure of the best way to do this? I've seen some Perl one liner examples but none did what I wanted.
$n = 1
if {
s/>.*/(Contig)++$n/e
++$n
}
$ awk '/^\\>/{$0="\\>Contig "++n} 1' file
\>Contig 1
ATGTCGATGCGT
\>Contig 2
ATGCGCTTACAC
Try something like this:
#!/usr/bin/perl -w
use strict;
open (my $fh, '<','example.txt');
open (my $fh1, '>','example2.txt');
my $n = 1;
# For each line of the input file
while(<$fh>) {
# Try to update the name, if successful, increment $n
if ($_ =~ s/^>.*/>Contig$n/) { $n++; }
print $fh1 $_;
}
When you use the /e modifier, Perl expects the substitution pattern to be a valid Perl expression. Try something like
s/>.*/">Contig " . ++$n/e
perl -i -pe 's/>.*/">Contig " . ++$c/e;' file.txt
Output:
\>Contig 1
ATGTCGATGCGT
\>Contig 2
ATGCGCTTACAC
I'm not awk expert (far from that), but solved this only for curiosity and because sed don't contain variables (limited possibilities).
One possible gawk solution could be
awk -v n=1 '/^>/{print ">Contig " n; n++; next}1' <file

Cut and copy-paste given positions of the text

My dummy text file (one continuous line) looks like this:
AAChvhkfiAFAjjfkqAPPMB
I want to:
Delete part of the text (specific range);
Copy-Paste (specific range of characters) within the file.
How I am doing this:
To cut part of the text at wanted positions (from 5 to 7 characters & from 10 to 14 characters) I use cut
echo 'AAChvhkfiAFAjjfkqAPPMB' | cut --complement -c 5-7,10-14
AAChfifkqAPPMB
But I really don't know how to copy-paste text. For example: to copy text from 15 to 18 characters and paste it after character 1 (also using previous cut command). To get the final result like this:
fkqAAAChfifkqAPPMB
So I do have to questions:
How to read text (from .. to) given range using perl, awk or sed & paste this text at specific position.
How to combine this text pasting with the previous cut command as after cutting text will move to the left side, hence wrong text will be copied.
Maybe something like this:
$ echo AAChvhkfiAFAjjfkqAPPMB | awk '{ print(substr($1, 0, 14) substr($1, 18) substr($1, 15, 3)) }'
AAChvhkfiAFAjjAPPMBfkq
In Perl I think substr would be a good candidate, try eg.
$a = '1234567890';
#from pos 2, replace 3 chars with nothing, return the 3 chars
$b=substr($a,2,3,'');
print "$a\t$b\n"; #1267890 345
#in posistion 0 (first), replace 0 characters (ie pure insert)
#with the content of $b
substr($a,0,0,$b);
print "$a\t$b\n"; #3451267890 345
See http://perldoc.perl.org/functions/substr.html for more details.
splice() may be a candidate as well.
In perl, you can use array slice, by splitting the string in a array :
my $string = "AAChvhkfiAFAjjfkqAPPMB1";
my #arr = split //, $string;
and slicing (print element 5 to 7 and 10 to 14):
print #array[5..7,10..14];
you can use splice() too to re-arrange the array.
perldoc said :
Removes the elements designated by OFFSET and LENGTH from an array, and replaces them with the elements of LIST, if any.
See http://perldoc.perl.org/perldata.html#Slices
quite straightforward with awk:
kent$ echo "AAChvhkfiAFAjjfkqAPPMB"|awk '
{for(i=5;i<=7;i++)$i="";
for(i=10;i<=14;i++)$i="";
for(i=15;i<=18;i++)t=sprintf("%s%s",t,$i);
$0=t""$0}1' OFS="" FS=""
fkqAAAChfifkqAPPMB
edit
to reverse the part of text, you just need to swap t and $i:
kent$ echo "AAChvhkfiAFAjjfkqAPPMB"|awk '
{for(i=5;i<=7;i++)$i="";
for(i=10;i<=14;i++)$i="";
for(i=15;i<=18;i++)t=sprintf("%s%s",$i,t);
$0=t""$0}1' OFS="" FS=""
AqkfAAChfifkqAPPMB