I have a list like:
DD2aaQQmmm
AA34DDmm
And i want to print the upper case, the lower case and the numbers separately, so it would be like:
DDQQ aammm 2
AADD mm 34
How should I do this in perl, using regex?
I have tried for the upper case this:
#!/usr/bin/perl
my #array = <>;
chomp(#array);
foreach (#array){
if ($_ = ~ /([A-Z][a-z])/){
print $_ "\n"
}
}
But this only prints out the words starts with upper case.
The task is quite simple -- separate the wheat from the chaff
use strict;
use warnings;
use feature 'say';
while(my $line = <DATA>) {
my #out;
$out[0] = $line =~ s/[^A-Z]//gr;
$out[1] = $line =~ s/[^a-z]//gr;
$out[2] = $line =~ s/[^0-9]//gr;
say join(' ', #out);
}
__DATA__
DD2aaQQmmm
AA34DDmm
Output
DDQQ aammm 2
AADD mm 34
Of cause the final script can read from pipe or file given on command line
use strict;
use warnings;
use feature 'say';
while(<>) {
my #out;
$out[0] = s/[^A-Z]//gr;
$out[1] = s/[^a-z]//gr;
$out[2] = s/[^0-9]//gr;
say join(' ', #out);
}
Another way to go is to use tr operator.
use Modern::Perl;
while(my $str = <DATA>) {
chomp $str;
my #out;
push #out, $str =~ tr/A-Z//cdr;
push #out, $str =~ tr/a-z//cdr;
push #out, $str =~ tr/0-9//cdr;
say "#out";
}
__DATA__
DD2aaQQmmm
AA34DDmm
Output:
DDQQ aammm 2
AADD mm 34
tr is about 6 times quicker than s///, here is a benchmarck:
use Modern::Perl;
use Benchmark qw(:all);
my $str = "DD2aaQQmmm";
my $count = -3;
cmpthese($count, {
'tr' => sub {
my #out;
push #out, $str =~ tr/A-Z//cdr;
push #out, $str =~ tr/a-z//cdr;
push #out, $str =~ tr/0-9//cdr;
},
'subst' => sub {
my #out;
$out[0] = $str =~ s/[^A-Z]//gr;
$out[1] = $str =~ s/[^a-z]//gr;
$out[2] = $str =~ s/[^0-9]//gr;
},
});
Output:
Rate subst tr
subst 58165/s -- -84%
tr 357629/s 515% --
Polar Bears solution is absolutely correct - but needs perl 5.14+ Here is a simple workaround for older perl versions.
use strict;
use warnings;
while(my $line = <DATA>) {
my #out = ($line, $line, $line);
$out[0] =~ s/[^A-Z]//g;
$out[1] =~ s/[^a-z]//g;
$out[2] =~ s/[^0-9]//g;
print join(' ', #out) . "\n";
}
__DATA__
DD2aaQQmmm
AA34DDmm
Related
I am VERY new to perl, and to programming in general.
I have been searching for the past couple of days on how to count the number of pattern matches; I have had a hard time understanding others solutions and applying them to the code I have already written.
Basically, I have a sequence and I need to find all the patterns that match [TC]C[CT]GGAAGC
I believe I have that part down. but I am stuck on counting the number of occurrences of each pattern match. Does anyone know how to edit the code I already have to do this? Any advice is welcomed. Thanks!
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
# open fasta file for reading
unless( open( FASTA, "<", '/scratch/Drosophila/dmel-all-chromosome- r6.02.fasta' )) {
die "Can't open dmel-all-chromosome-r6.02.fasta for reading:", $!;
}
#split the fasta record
local $/ = ">";
#scan through fasta file
while (<FASTA>) {
chomp;
if ( $_ =~ /^(.*?)$(.*)$/ms) {
my $header = $1;
my $seq = $2;
$seq =~ s/\R//g; # \R removes line breaks
while ( $seq =~ /([TC]C[CT]GGAAGC)/g) {
print $1, "\n";
}
}
}
Update, I have added in
my #matches = $seq =~ /([TC]C[CT]GGAAGC)/g;
print scalar #matches;
In the code below. However, it seems to be outputting 0 in front of each pattern match, instead of outputting the total sum of all pattern matches.
while (<FASTA>) {
chomp;
if ( $_ =~ /^(.*?)$(.*)$/ms) {
my $header = $1;
my $seq = $2;
$seq =~ s/\R//g; # \R removes line breaks
while ( $seq =~ /([TC]C[CT]GGAAGC)/g) {
print $1, "\n";
my #matches = $seq =~ /([TC]C[CT]GGAAGC)/g;
print scalar #matches;
}
}
}
Edit: I need the output to list ever pattern match found. I also need it to find the total number of matches found. For example:
CCTGGAAGC
TCTGGAAGC
TCCGGAAGC
3 matches found
counting the number of occurrences of each pattern match
my #matches = $string =~ /pattern/g
#matches array will contain all the matched parts. You can then do below to get the count.
print scalar #matches
Or you could directly write
my $matches = () = $string =~ /pattern/
I would suggest you to use the former as you might need to check "what was matched" in future (perhaps for debugging?).
Example 1:
use strict;
use warnings;
my $string = 'John Doe John Done';
my $matches = () = $string =~ /John/g;
print $matches; #prints 2
Example 2:
use strict;
use warnings;
my $string = 'John Doe John Done';
my #matches = $string =~ /John/g;
print "#matches"; #prints John John
print scalar #matches; #prints 2
Edit:
while ( my #matches = $seq =~ /([TC]C[CT]GGAAGC)/g) {
print $1, "\n";
print "Count of matches:". scalar #matches;
}
As you have written the code, you have to count the matches yourself:
local $/ = ">";
my $count = 0;
#scan through fasta file
while (<FASTA>) {
chomp;
if ( $_ =~ /^(.*?)$(.*)$/ms) {
my $header = $1;
my $seq = $2;
$seq =~ s/\R//g; # \R removes line breaks
while ( $seq =~ /([TC]C[CT]GGAAGC)/g) {
print $1, "\n";
$count = $count +1;
}
}
}
print "Fount $count matches\n";
should do the job.
HTH Georg
my #count = ($seq =~ /([TC]C[CT]GGAAGC)/g);
print scalar #count ;
I have a problem I cannot understand. I have this string:
gene_id "siRNA_Z27kG1_20543"transcript_id "siRNA_Z27kG1_20543_X_1";tss_id "TSS124620"
And I want to change the gene_id. So, I have the following code:
if ($line =~ /;transcript_id "([A-Za-z0-9:\-._]*)(_[oxOX][_.][0-9]*)";/) {
$num = $2;
$line =~ s/gene_id "([A-Za-z0-9:\-._]*)";/gene_id "$1$num";/g;
print $new $line."\n";
}
The aim of my code is to change siRNA_Z27kG1_20543 for siRNA_Z27kG1_20543_X_1. However, my code does not produce that output. Why? I can't understand that.
My regex needs to be as it is because I match other strings (this time with success).
#!/usr/bin/perl
use strict;
use warnings;
my $string = q{gene_id "siRNA_Z27kG1_20543"transcript_id "siRNA_Z27kG1_20543_X_1";tss_id "TSS124620"};
if($string =~ m|transcript_id "([A-Za-z0-9:\-._]*)(_[oxOX][_.][0-9]*)"|){
my $replace_with = qq{gene_id "$1$2"};
$string =~ s/gene_id (\"\w+\")/$replace_with/g;
}
print "$string";
Output: gene_id "siRNA_Z27kG1_20543_X_1"transcript_id "siRNA_Z27kG1_20543_X_1";tss_id "TSS124620"
Demo
Remove the semicolon at the start of the pattern as it is not present in the string :-
if ($line =~ /transcript_id "([A-Za-z0-9:\-._]*)(_[oxOX][_.][0-9]*)";/) {
$num = $2;
$line =~ s/gene_id "([A-Za-z0-9:\-._]*)";/gene_id "$1$num";/g;
print $new $line."\n";
}
The problem:
Find pieces of text in a file enclosed by # and replace the inside
Input:
#abc# abc #ABC#
cba #cba CBA#
Deisred output:
абц abc АБЦ
cba цба ЦБА
I have the following:
#!/usr/bin/perl
use strict;
use warnings;
use Encode;
my $output;
open FILE,"<", 'test.txt';
while (<FILE>) {
chomp(my #chars = split(//, $_));
for (#chars) {
my #char;
$_ =~ s/a/chr(0x430)/eg;
$_ =~ s/b/chr(0x431)/eg;
$_ =~ s/c/chr(0x446)/eg;
$_ =~ s/d/chr(0x434)/eg;
$_ =~ s/e/chr(0x435)/eg;
$_ =~ s/A/chr(0x410)/eg;
$_ =~ s/B/chr(0x411)/eg;
$_ =~ s/C/chr(0x426)/eg;
push #char, $_;
$output = join "", #char;
print encode("utf-8",$output);}
print "\n";
}
close FILE;
But I'm stuck on how to process further
Thanks for help in advance!
Kluther
Here my solution. (you will fixed it, yes. It is prototype)
for (my $data = <DATA>){
$data=~s/[#]([\s\w]+)[#]/func($1)/ge;
print $data;
# while($data=~m/[#]([\s\w]+)[#]/g){
# print "marked: ",$1,"\n";
# print "position:", pos();
# }
# print "not marked: ";
}
sub func{
#do your magic here ;)
return "<< #_ >>";
}
__DATA__
#abc# abc #ABC# cba #cba CBA#
What happens here?
First, I read data. You can do it yourself.
for (my $data = <DATA>){...}
Next, I need to search your pattern and replace it.
What should I do?
Use substition operator: s/pattern/replace/
But in interesting form:
s/pattern/func($1)/ge
Key g mean Global Search
Key e mean Evaluate
So, I think, that you need to write your own func function ;)
Maybe better to use transliteration operator: tr/listOfSymbolsToBeReplaced/listOfSymbolsThatBePlacedInstead/
With minimal changes to your algorithm you need to keep track of whether you are inside the #marks or not. so add something like this
my $bConvert = 0;
chomp(my #chars = split(//, $_));
for (#chars) {
my $char = $_;
if (/#/) {
$bConvert = ($bConvert + 1) % 2;
next;
}
elsif ($bConvert) {
$char =~ s/a/chr(0x430)/eg;
$char =~ s/b/chr(0x431)/eg;
$char =~ s/c/chr(0x446)/eg;
$char =~ s/d/chr(0x434)/eg;
$char =~ s/e/chr(0x435)/eg;
$char =~ s/A/chr(0x410)/eg;
$char =~ s/B/chr(0x411)/eg;
$char =~ s/C/chr(0x426)/eg;
}
print encode("utf-8",$char);
}
Try this after $output is processed.
$output =~ s/\#//g;
my #split_output = split(//, $output);
$output = "";
my $len = scalar(#split_output) ;
while ($len--) {
$output .= shift(#split_output);
}
print $output;
It can be done with a single regex and no splitting of the string:
use strict;
use warnings;
use Encode;
my %chars = (
a => chr(0x430),
b => chr(0x431),
c => chr(0x446),
d => chr(0x434),
e => chr(0x435),
A => chr(0x410),
B => chr(0x411),
C => chr(0x426),
);
my $regex = '(' . join ('|', keys %chars) . ')';
while (<DATA>) {
1 while ($_ =~ s|\#(?!\s)[^#]*?\K$regex(?=[^#]*(?!\s)\#)|$chars{$1}|eg);
print encode("utf-8",$_);
}
It does require repeated runs of the regex due to the overlapping nature of the matches.
Given a url the following regular expression is able insert/substitute in words at certain points in the urls.
Code:
#!/usr/bin/perl
use strict;
use warnings;
#use diagnostics;
my #insert_words = qw/HELLO GOODBYE/;
my $word = 0;
my $match;
while (<DATA>) {
chomp;
foreach my $word (#insert_words)
{
my $repeat = 1;
while ((my $match=$_) =~ s|(?<![/])(?:[/](?![/])[^/]*){$repeat}[^/]*\K|$word|)
{
print "$match\n";
$repeat++;
}
print "\n";
}
}
__DATA__
http://www.stackoverflow.com/dog/cat/rabbit/
http://www.superuser.co.uk/dog/cat/rabbit/hamster/
10.15.16.17/dog/cat/rabbit/
The output given (for the first example url in __DATA__ with the HELLO word):
http://www.stackoverflow.com/dogHELLO/cat/rabbit/
http://www.stackoverflow.com/dog/catHELLO/rabbit/
http://www.stackoverflow.com/dog/cat/rabbitHELLO/
http://www.stackoverflow.com/dog/cat/rabbit/HELLO
Where I am now stuck:
I would now like to alter the regular expression so that the output will look like what is shown below:
http://www.stackoverflow.com/dogHELLO/cat/rabbit/
http://www.stackoverflow.com/dog/catHELLO/rabbit/
http://www.stackoverflow.com/dog/cat/rabbitHELLO/
http://www.stackoverflow.com/dog/cat/rabbit/HELLO
#above is what it already does at the moment
#below is what i also want it to be able to do as well
http://www.stackoverflow.com/HELLOdog/cat/rabbit/ #<-puts the word at the start of the string
http://www.stackoverflow.com/dog/HELLOcat/rabbit/
http://www.stackoverflow.com/dog/cat/HELLOrabbit/
http://www.stackoverflow.com/dog/cat/rabbit/HELLO
http://www.stackoverflow.com/HELLO/cat/rabbit/ #<- now also replaces the string with the word
http://www.stackoverflow.com/dog/HELLO/rabbit/
http://www.stackoverflow.com/dog/cat/HELLO/
http://www.stackoverflow.com/dog/cat/rabbit/HELLO
But I am having trouble getting it to automatically do this within the one regular expression.
Any help with this matter would be highly appreciated, many thanks
One solution:
use strict;
use warnings;
use URI qw( );
my #insert_words = qw( HELLO );
while (<DATA>) {
chomp;
my $url = URI->new($_);
my $path = $url->path();
for (#insert_words) {
# Use package vars to communicate with /(?{})/ blocks.
local our $insert_word = $_;
local our #paths;
$path =~ m{
^(.*/)([^/]*)((?:/.*)?)\z
(?{
push #paths, "$1$insert_word$2$3";
if (length($2)) {
push #paths, "$1$insert_word$3";
push #paths, "$1$2$insert_word$3";
}
})
(?!)
}x;
for (#paths) {
$url->path($_);
print "$url\n";
}
}
}
__DATA__
http://www.stackoverflow.com/dog/cat/rabbit/
http://www.superuser.co.uk/dog/cat/rabbit/hamster/
http://10.15.16.17/dog/cat/rabbit/
Without crazy regexes:
use strict;
use warnings;
use URI qw( );
my #insert_words = qw( HELLO );
while (<DATA>) {
chomp;
my $url = URI->new($_);
my $path = $url->path();
for my $insert_word (#insert_words) {
my #parts = $path =~ m{/([^/]*)}g;
my #paths;
for my $part_idx (0..$#parts) {
my $orig_part = $parts[$part_idx];
local $parts[$part_idx];
{
$parts[$part_idx] = $insert_word . $orig_part;
push #paths, join '', map "/$_", #parts;
}
if (length($orig_part)) {
{
$parts[$part_idx] = $insert_word;
push #paths, join '', map "/$_", #parts;
}
{
$parts[$part_idx] = $orig_part . $insert_word;
push #paths, join '', map "/$_", #parts;
}
}
}
for (#paths) {
$url->path($_);
print "$url\n";
}
}
}
__DATA__
http://www.stackoverflow.com/dog/cat/rabbit/
http://www.superuser.co.uk/dog/cat/rabbit/hamster/
http://10.15.16.17/dog/cat/rabbit/
one more solution:
#!/usr/bin/perl
use strict;
use warnings;
my #insert_words = qw/HELLO GOODBYE/;
while (<DATA>) {
chomp;
/(?<![\/])(?:[\/](?![\/])[^\/]*)/p;
my $begin_part = ${^PREMATCH};
my $tail = ${^MATCH} . ${^POSTMATCH};
my #tail_chunks = split /\//, $tail;
foreach my $word (#insert_words) {
for my $index (1..$#tail_chunks) {
my #new_tail = #tail_chunks;
$new_tail[$index] = $word . $tail_chunks[$index];
my $str = $begin_part . join "/", #new_tail;
print $str, "\n";
$new_tail[$index] = $tail_chunks[$index] . $word;
$str = $begin_part . join "/", #new_tail;
print $str, "\n";
}
print "\n";
}
}
__DATA__
http://www.stackoverflow.com/dog/cat/rabbit/
http://www.superuser.co.uk/dog/cat/rabbit/hamster/
10.15.16.17/dog/cat/rabbit/
I am splitting sentences at individual space characters, and then matching these terms against keys of hashes. I am getting matches only if the terms are 100% similar, and I am struggling to find a perfect regex that could match several occurrences of the same word. Eg. Let us consider I have a term 'antagon' now it perfectly matches with the term 'antagon' but fails to match with antagonists, antagonistic or pre-antagonistic, hydro-antagonist etc. Also I need a regex to match occurrences of words like MCF-7 with MCF7 or MC-F7 silencing the effect of special characters and so on.
This is the code that I have till now; thr commented part is where I am struggling.
(Note: Terms in the hash are stemmed to root form of a word).
use warnings;
use strict;
use Drug;
use Stop;
open IN, "sample.txt" or die "cannot find sample";
open OUT, ">sample1.txt" or die "cannot find sample";
while (<IN>) {
chomp $_;
my $flag = 0;
my $line = lc $_;
my #full = ();
if ( $line =~ /<Sentence.*>(.*)<\/Sentence>/i ) {
my $string = $1;
chomp $string;
$string =~ s/,/ , /g;
$string =~ s/\./ \. /g;
$string =~ s/;/ ; /g;
$string =~ s/\(/ ( /g;
$string =~ s/\)/ )/g;
$string =~ s/\:/ : /g;
$string =~ s/\::/ :: )/g;
my #array = split / /, $string;
foreach my $word (#array) {
chomp $word;
if ( $word =~ /\,|\;|\.|\(|\)/g ) {
push( #full, $word );
}
if ( $Stop_words{$word} ) {
push( #full, $word );
}
if ( $Values{$word} ) {
my $term = "<Drug>$word<\/Drug>";
push( #full, $term );
}
else {
push( #full, $word );
}
# if($word=~/.*\Q$Values{$word}\E/i)#Changed this
# {
# $term="<Drug>$word</$Drug>";
# print $term,"\n";
# push(#full,$term);
# }
}
}
my $mod_str = join( " ", #full );
print OUT $mod_str, "\n";
}
I need a regex to match occurances of words like MCF-7 with MCF7 or
MC-F7
The most straightforward approach is just to strip out the hyphenss i.e.
my $ignore_these = "[-_']"
$word =~ s{$ignore_these}{}g;
I am not sure what is stored in your Value hash, so its hard to tell what you expect to happen
if($word=~/.*\Q$Values{$word}\E/i)
However, the kind of thing I imagin you want is (simplified your code somewhat)
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use 5.10.0;
use Data::Dumper;
while (<>) {
chomp $_;
my $flag = 0;
my $line = lc $_;
my #full = ();
if ( $line =~ /<Sentence.*>(.*)<\/Sentence>/i ) {
my $string = $1;
chomp $string;
$string =~ s/([,\.;\(\)\:])/ $1 /g; # squished these together
$string =~ s/\:\:/ :: )/g; # typo in original
my #array = split /\s+/, $string; # split on one /or more/ spaces
foreach my $word (#array) {
chomp $word;
my $term=$word;
my $word_chars = "[\\w\\-_']";
my $word_part = "antagon";
if ($word =~ m{$word_chars*?$word_part$word_chars+}) {
$term="<Drug>$word</Drug>";
}
push(#full,$term); # push
}
}
my $mod_str = join( " ", #full );
say "<Sentence>$mod_str</Sentence>";
}
This gives me the following output, which is my best guess at what you expect:
$ cat tmp.txt
<Sentence>This in antagonizing the antagonist's antagonism pre-antagonistically.</Sentence>
$ cat tmp.txt | perl x.pl
<Sentence>this in <Drug>antagonizing</Drug> the <Drug>antagonist's</Drug> <Drug>antagonism</Drug> <Drug>pre-antagonistically</Drug> .</Sentence>
$
perl -ne '$things{$1}++while s/([^ ;.,!?]*?antagon[^ ;.,!?]++)//;END{print "$_\n" for sort keys %things}' FILENAME
If the file contains the following:
he was an antagonist
antagonize is a verb
why are you antagonizing her?
this is an alpha-antagonist
This will return:
alpha-antagonist
antagonist
antagonize
antagonizing
Below is the a regular (not one-liner) version:
#!/usr/bin/perl
use warnings;
use strict;
open my $in, "<", "sample.txt" or die "could not open sample.txt for reading!";
open my $out, ">", "sample1.txt" or die "could not open sample1.txt for writing!";
my %things;
while (<$in>){
$things{$1}++ while s/([^ ;.,!?]*?antagon[^ ;.,!?]++)//
}
print $out "$_\n" for sort keys %things;
You may want to take another look at your assumptions on your approach. What it sounds like to me is that you are looking for words which are within a certain distance of a list of words. Take a look at the Levenshtein distance formula to see if this is something you want. Be aware, however, that computing this might take exponential time.