Loop through global substitution - regex

$num = 6;
$str = "1 2 3 4";
while ($str =~ s/\d/$num/g)
{
print $str, "\n";
$num++;
}
Is it possible to do something like the above in perl? I would like the loop to run only 4 times and to finish with $str being 6 7 8 9.

You don't need the loop: the /g modifier causes the substitution to be repeated as many times as it matches. What you want is the /e modifier to compute the substitution. Assuming the effect you were after is to add 5 to each number, the example code is as follows.
$str = "1 2 3 4";
$str =~ s/(\d)/$1+5/eg;
print "$str\n";
If you really wanted to substitute numbers starting with 6, then this works.
$num = 6;
$str = "1 2 3 4";
$str =~ s/\d/$num++/eg;
print "$str\n";

You could with the match operator.
my $new_str = '';
while ($str =~ /\G(.*?)\d/gc) {
$new_str .= $1 . $num++;
}
$new_str .= substr($str, pos($str));
However, #TFBW's solution is probably what you want unless you're writing a tokenizer for a big parser. In a tokenizer, it would look more like:
# So we don't have to say «$str =~ all» over the place.
# Also, allows us to use redo.
for ($str) {
/\G \s+ /xgc; # Skip whitespace.
if (/\G (\d+) /xgc) {
# Do something with $1.
redo;
}
...
last if /\G \z /xgc;
die "Unrecognized token";
}

Related

Counting number of pattern matches in Perl

I am VERY new to perl, and to programming in general.
I have been searching for the past couple of days on how to count the number of pattern matches; I have had a hard time understanding others solutions and applying them to the code I have already written.
Basically, I have a sequence and I need to find all the patterns that match [TC]C[CT]GGAAGC
I believe I have that part down. but I am stuck on counting the number of occurrences of each pattern match. Does anyone know how to edit the code I already have to do this? Any advice is welcomed. Thanks!
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
# open fasta file for reading
unless( open( FASTA, "<", '/scratch/Drosophila/dmel-all-chromosome- r6.02.fasta' )) {
die "Can't open dmel-all-chromosome-r6.02.fasta for reading:", $!;
}
#split the fasta record
local $/ = ">";
#scan through fasta file
while (<FASTA>) {
chomp;
if ( $_ =~ /^(.*?)$(.*)$/ms) {
my $header = $1;
my $seq = $2;
$seq =~ s/\R//g; # \R removes line breaks
while ( $seq =~ /([TC]C[CT]GGAAGC)/g) {
print $1, "\n";
}
}
}
Update, I have added in
my #matches = $seq =~ /([TC]C[CT]GGAAGC)/g;
print scalar #matches;
In the code below. However, it seems to be outputting 0 in front of each pattern match, instead of outputting the total sum of all pattern matches.
while (<FASTA>) {
chomp;
if ( $_ =~ /^(.*?)$(.*)$/ms) {
my $header = $1;
my $seq = $2;
$seq =~ s/\R//g; # \R removes line breaks
while ( $seq =~ /([TC]C[CT]GGAAGC)/g) {
print $1, "\n";
my #matches = $seq =~ /([TC]C[CT]GGAAGC)/g;
print scalar #matches;
}
}
}
Edit: I need the output to list ever pattern match found. I also need it to find the total number of matches found. For example:
CCTGGAAGC
TCTGGAAGC
TCCGGAAGC
3 matches found
counting the number of occurrences of each pattern match
my #matches = $string =~ /pattern/g
#matches array will contain all the matched parts. You can then do below to get the count.
print scalar #matches
Or you could directly write
my $matches = () = $string =~ /pattern/
I would suggest you to use the former as you might need to check "what was matched" in future (perhaps for debugging?).
Example 1:
use strict;
use warnings;
my $string = 'John Doe John Done';
my $matches = () = $string =~ /John/g;
print $matches; #prints 2
Example 2:
use strict;
use warnings;
my $string = 'John Doe John Done';
my #matches = $string =~ /John/g;
print "#matches"; #prints John John
print scalar #matches; #prints 2
Edit:
while ( my #matches = $seq =~ /([TC]C[CT]GGAAGC)/g) {
print $1, "\n";
print "Count of matches:". scalar #matches;
}
As you have written the code, you have to count the matches yourself:
local $/ = ">";
my $count = 0;
#scan through fasta file
while (<FASTA>) {
chomp;
if ( $_ =~ /^(.*?)$(.*)$/ms) {
my $header = $1;
my $seq = $2;
$seq =~ s/\R//g; # \R removes line breaks
while ( $seq =~ /([TC]C[CT]GGAAGC)/g) {
print $1, "\n";
$count = $count +1;
}
}
}
print "Fount $count matches\n";
should do the job.
HTH Georg
my #count = ($seq =~ /([TC]C[CT]GGAAGC)/g);
print scalar #count ;

Counting occurrences of a word in a string in Perl

I am trying to find out the number of occurrences of "The/the". Below is the code I tried"
print ("Enter the String.\n");
$inputline = <STDIN>;
chop($inputline);
$regex="\[Tt\]he";
if($inputline ne "")
{
#splitarr= split(/$regex/,$inputline);
}
$scalar=#splitarr;
print $scalar;
The string is :
Hello the how are you the wanna work on the project but i the u the
The
The output that it gives is 7. However with the string :
Hello the how are you the wanna work on the project but i the u the
the output is 5. I suspect my regex. Can anyone help in pointing out what's wrong.
I get the correct number - 6 - for the first string
However your method is wrong, because if you count the number of pieces you get by splitting on the regex pattern it will give you different values depending on whether the word appears at the beginning of the string. You should also put word boundaries \b into your regular expression to prevent the regex from matching something like theory
Also, it is unnecessary to escape the square brackets, and you can use the /i modifier to do a case-independent match
Try something like this instead
use strict;
use warnings;
print 'Enter the String: ';
my $inputline = <>;
chomp $inputline;
my $regex = 'the';
if ( $inputline ne '' ) {
my #matches = $inputline =~ /\b$regex\b/gi;
print scalar #matches, " occurrences\n";
}
With split, you're counting the substrings between the the's. Use match instead:
#!/usr/bin/perl
use warnings;
use strict;
my $regex = qr/[Tt]he/;
for my $string ('Hello the how are you the wanna work on the project but i the u the The',
'Hello the how are you the wanna work on the project but i the u the',
'the theological cathedral'
) {
my $count = () = $string =~ /$regex/g;
print $count, "\n";
my #between = split /$regex/, $string;
print 0 + #between, "\n";
print join '|', #between;
print "\n";
}
Note that both methods return the same number for the two inputs you mentioned (and the first one returns 6, not 7).
The following snippet uses a code side-effect to increment a counter, followed by an always-failing match to keep searching. It produces the correct answer for matches that overlap (e.g. "aaaa" contains "aa" 3 times, not 2). The split-based answers don't get that right.
my $i;
my $string;
$i = 0;
$string = "aaaa";
$string =~ /aa(?{$i++})(?!)/;
print "'$string' contains /aa/ x $i (should be 3)\n";
$i = 0;
$string = "Hello the how are you the wanna work on the project but i the u the The";
$string =~ /[tT]he(?{$i++})(?!)/;
print "'$string' contains /[tT]he/ x $i (should be 6)\n";
$i = 0;
$string = "Hello the how are you the wanna work on the project but i the u the";
$string =~ /[tT]he(?{$i++})(?!)/;
print "'$string' contains /[tT]he/ x $i (should be 5)\n";
What you need is 'countof' operator to count the number of matches:
my $string = "Hello the how are you the wanna work on the project but i the u the The";
my $count = () = $string =~/[Tt]he/g;
print $count;
If you want to select only the word the or The, add word boundary:
my $string = "Hello the how are you the wanna work on the project but i the u the The";
my $count = () = $string =~/\b[Tt]he\b/g;
print $count;

Perl regex issue , last character truncated

Have a strange issue with my regex.
My regex is truncating the last character , in the example below it should return the value 32 but it is instead returning 3.
Note that the value could be up to 10 digits!!!
$word = "thisisit=";
$line = "hello thisisit=32 byefornow ";
if ($line =~ m/$word(.*?)\d /)
{
print $1; #returns 3 instead of 32
}
Thanks.
You can do:
if ($line =~ /$word(\d+)/) # This will capture all numbers after your $word
{
print $1;
}
You can also refine to:
if ($line =~ /$word\s*(\d+)/) # In case you're having like "thisisit= 32 byefornow"
Or, to capture everything and stop after first white space:
if ($line =~ /$word(.+?)\s/)
{
print $1;
}
You should ask for it to return zero or any number of digits:
($line =~ m/$word(.*?)\d* /)
At least one digit: \d+
Two digits: \d{2}
I'm not sure what you are looking for here, in terms of the specs.

Perl regex replace count

Is it possible to specify the maximum number of matches to replace. For instance if matching 'l' in "Hello World", would it be possible to replace the first 2 'l' characters, but not the third without looping?
$str = "Hello world!";
$str =~ s/l/r/ for (1,2);
print $str;
I don't see what's so bad about looping.
Actually, here's a way:
$str="Hello world!";
$str =~ s/l/$i++ >= 2 ? "l": "r"/eg;
print $str;
It's a loop, of sorts, since s///g works in a loopy way when you do this. But not a traditional loop.
Here is one way. This requires an external counter to be updated within the RE using a (?{code}) block inside of a (?(condition)true-sub-expression|false-sub-expression) construct. See perldoc perlre for an explanation.
use Modern::Perl;
use re qw/eval/; # Considered experimental.
my $string = 'Hello world!';
my $count = 2;
my $re = qr/
(l)
(?(?{$count--})|(*FAIL))
/x;
say "Looking for $count instances of 'l' in $string.";
my ( #found ) = $string =~ m/$re/g;
say "Found ", scalar #found, " instances of 'l': #found";
The output is:
Looking for 2 instances of 'l' in Hello world!
Found 2 instances of 'l': l l
Here's another test of the same regexp, but this time we're keeping track of the position of the matches just to prove it's matching the first two occurrences.
use Modern::Perl;
use strict;
use warnings;
use re qw/eval/; # Considered experimental.
my $string = 'Hello world!';
my $count = 2;
my $position = 0;
my $re = qr/
(l)(?{$position=pos})
(?(?{$count--})|(*FAIL))
/x;
while( $string =~ m/$re/g ) {
say "Found $1 at ", $position;
}
And this time the output is:
Found l at 3
Found l at 4
I don't think I would recommend any of this. If I were considering constraining matches to only one portion of a string, I would match against a substr() of the string. But if you like to live on the edge, go ahead and have fun with this snippet.
Here it is in a substitution:
use Modern::Perl;
use strict;
use warnings;
use re qw/eval/; # Considered experimental.
my $string = 'Hello world!';
say "Before substitution $string";
my $count = 2;
my $re = qr/
(l)
(?(?{$count--})|(*FAIL))
/x;
$string =~ s/$re/L/g;
say "After substitution $string";
And the output:
Before substitution Hello world!
After substitution HeLLo world!
Short answer: no. You will need to perform the substitutions in a loop of some kind.

How can I count the amount of spaces at the start of a string in Perl?

How can I count the amount of spaces at the start of a string in Perl?
I now have:
$temp = rtrim($line[0]);
$count = ($temp =~ tr/^ //);
But that gives me the count of all spaces.
$str =~ /^(\s*)/;
my $count = length( $1 );
If you just want actual spaces (instead of whitespace), then that would be:
$str =~ /^( *)/;
Edit: The reason why tr doesn't work is it's not a regular expression operator. What you're doing with $count = ( $temp =~ tr/^ // ); is replacing all instances of ^ and with itself (see comment below by cjm), then counting up how many replacements you've done. tr doesn't see ^ as "hey this is the beginning of the string pseudo-character" it sees it as "hey this is a ^".
You can get the offset of a match using #-. If you search for a non-whitespace character, this will be the number of whitespace characters at the start of the string:
#!/usr/bin/perl
use strict;
use warnings;
for my $s ("foo bar", " foo bar", " foo bar", " ") {
my $count = $s =~ /\S/ ? $-[0] : length $s;
print "'$s' has $count whitespace characters at its start\n";
}
Or, even better, use #+ to find the end of the whitespace:
#!/usr/bin/perl
use strict;
use warnings;
for my $s ("foo bar", " foo bar", " foo bar", " ") {
$s =~ /^\s*/;
print "$+[0] '$s'\n";
}
Here's a script that does this for every line of stdin. The relevant snippet of code is the first in the body of the loop.
#!/usr/bin/perl
while ($x = <>) {
$s = length(($x =~ m/^( +)/)[0]);
print $s, ":", $x, "\n";
}
tr/// is not a regex operator. However, you can use s///:
use strict; use warnings;
my $t = (my $s = " \t\n sdklsdjfkl");
my $n = 0;
++$n while $s =~ s{^\s}{};
print "$n \\s characters were removed from \$s\n";
$n = ( $t =~ s{^(\s*)}{} ) && length $1;
print "$n \\s characters were removed from \$t\n";
Since the regexp matcher returns the parenthesed matches when called in a list context, CanSpice's answer can be written in a single statement:
$count = length( ($line[0] =~ /^( *)/)[0] );
This prints amount of white space
echo " hello" |perl -lane 's/^(\s+)(.*)+$/length($1)/e; print'
3