Weird behaviour of the global g regex flag [duplicate] - regex

This question already has answers here:
Help understanding global flag in perl
(2 answers)
Closed 9 years ago.
my $test = "There was once an\n ugly ducking";
if ($test =~ m/ugly/g) {
if ($test =~ m/here/g) {
print 'Match';
}
}
Results in no output, but
my $test = "There was once an\n ugly ducking";
if ($test =~ m/here/g) {
if ($test =~ m/ugly/g) {
print 'Match';
}
}
results in Match!
If I remove the g flag from the regex, then the second internal test matches whichever way around the matches appear in $test. I can't find a reference to why this is so.

Yes. That behaviour is documented in perlop man page. Using m/.../ with g flag advances in the string for the next match.
In scalar context, each execution of "m//g" finds the next match, returning true if it matches, and false if there is no further match. The position after the last match can be read or set using the "pos()" function; see "pos" in perlfunc. A failed match normally resets the search position
to the beginning of the string, but you can avoid that by adding the "/c" modifier (e.g. "m//gc"). Modifying the target string also resets the search position.
So, in first case after ugly there isn't any here substring, but in second case it first matches here in There and later it finds the ugly word.

Related

Perl Regex Find and Return Every Possible Match

Im trying to create a while loop that will find every possible sub-string within a string. But so far all I can match is the largest instance or the shortest. So for example I have the string
EDIT CHANGE STRING FOR DEMO PURPOSES
"A.....B.....B......B......B......B"
And I want to find every possible sequence of "A.......B"
This code will give me the shortest possible return and exit the while loop
while($string =~ m/(A(.*?)B)/gi) {
print "found\n";
my $substr = $1;
print $substr."\n";
}
And this will give me the longest and exit the while loop.
$string =~ m/(A(.*)B)/gi
But I want it to loop through the string returning every possible match. Does anyone know if Perl allows for this?
EDIT ADDED DESIRED OUTPUT BELOW
found
A.....B
found
A.....B.....B
found
A.....B.....B......B
found
A.....B.....B......B......B
found
A.....B.....B......B......B......B
There are various ways to parse the string so to scoop up what you want.
For example, use regex to step through all A...A substrings and process each capture
use warnings;
use strict;
use feature 'say';
my $s = "A.....B.....B......B......B......B";
while ($s =~ m/(A.*)(?=A|$)/gi) {
my #seqs = split /(B)/, $1;
for my $i (0..$#seqs) {
say #seqs[0..$i] if $i % 2 != 0;
}
}
The (?=A|$) is a lookahead, so .* matches everything up to an A (or the end of string) but that A is not consumed and so is there for the next match. The split uses () in the separator pattern so that the separator, too, is returned (so we have all those B's). It only prints for an even number of elements, so only substrings ending with the separator (B here).
The above prints
A.....B
A.....B.....B
A.....B.....B......B
A.....B.....B......B......B
A.....B.....B......B......B......B
There may be bioinformatics modules that do this but I am not familiar with them.

What does s/(\W)/\\$1/g do in perl? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 5 years ago.
I went over a piece of code in which a subroutine takes video filename as argument, and then printing it's duration time. Here I'm only showing the snippet.
sub videoInfo {
my $file = shift;
$file =~ s/(\W)/\\$1/g;
}
So far I understood is that it is dealing with whitespaces but I'm not able to break the meaning of code, I mean what is $1 and how it will work?
It puts backslashes in front of non-word characters. Things like "untitled file" becomes "untitled\ file".
As in most regular expression operations $1 represents the first thing captured with (...) which in this case is the (\W) representing a single non-word character.
I think this is an unnecessary home-rolled version of quotemeta.

Perl Regular Expression - What does gc modifier means?

I have a regex which matches some text as:
$text =~ m/$regex/gcxs
Now I want to know what 'gc' modifier means:
I have searched and found that gc means "Allow continued search after failed /g match".
This is not clear to me. What does continued search means?
As far as I have understood, it means that start matching at the beginning if the /g search fails. But doesn't /g modififier matches the whole string?
The /g modifier is used to remember the "position in a string" so you can incrementally process a string. e.g.
my $txt = "abc3de";
while( $txt =~ /\G[a-z]/g )
{
print "$&";
}
while( $txt =~ /\G./g )
{
print "$&";
}
Because the position is reset on a failed match, the above will output
abcabc3de
The /c flag does not reset the position on a failed match. So if we add /c to the first regex like so
my $txt = "abc3de";
while( $txt =~ /\G[a-z]/gc )
{
print "$&";
}
while( $txt =~ /\G./g )
{
print "$&";
}
We end up with
abc3de
Sample code: http://ideone.com/cC9wb
In the perldoc perlre http://perldoc.perl.org/perlre.html#Modifiers
Global matching, and keep the Current position after failed matching. Unlike i, m, s and x, these two flags affect the way the regex is used rather than the regex itself. See Using regular expressions in Perl in perlretut for further explanation of the g and c modifiers.
The specified ref leads to:
http://perldoc.perl.org/perlretut.html#Using-regular-expressions-in-Perl
This URI has a sub-section entitled, 'Global matching' which contains a small tutorial/working example, including:
A failed match or changing the target string resets the position. If you don't want the position reset after failure to match, add the //c , as in /regexp/gc . The current position in the string is associated with the string, not the regexp. This means that different strings have different positions and their respective positions can be set or read independently.
HTH
Lee

2-step regular expression matching with a variable in Perl

I am looking to do a 2-step regular expression look-up in Perl, I have text that looks like this:
here is some text 9337 more text AA 2214 and some 1190 more BB stuff 8790 words
I also have a hash with the following values:
%my_hash = ( 9337 => 'AA', 2214 => 'BB', 8790 => 'CC' );
Here's what I need to do:
Find a number
Look up the text code for the number using my_hash
Check if the text code appears within 50 characters of the identified number, and if true print the result
So the output I'm looking for is:
Found 9337, matches 'AA'
Found 2214, matches 'BB'
Found 1190, no matches
Found 8790, no matches
Here's what I have so far:
while ( $text =~ /(\d+)(.{1,50})/g ) {
$num = $1;
$text_after_num = $2;
$search_for = $my_hash{$num};
if ( $text_after_num =~ /($search_for)/ ) {
print "Found $num, matches $search_for\n";
}
else {
print "Found $num, no matches\n";
}
This sort of works, except that the only correct match is 9337; the code doesn't match 2214. I think the reason is that the regular expression match on 9337 is including 50 characters after the number for the second-step match, and then when the regex engine starts again it is starting from a point after the 2214. Is there an easy way to fix this? I think the \G modifier can help me here, but I don't quite see how.
Any suggestions or help would be great.
You have a problem with greediness. The 1,50 will consume as much as it can. Your regex should be /(\d+)(.+?)(?=($|\d))/
To explain, the question mark will make the multiple match non-greedy (it will stop as soon as the next pattern is matched - the next pattern gets precedence). The ?= is a lookahead operator to say "check if the next element is a digit. If so, match but do not consume." This allows the first digit to get picked up by the beginning of the regex and be put into the next matched pattern.
[EDIT]
I added an optional end value to the lookahead so that it wouldn't die on the last match.
Just use :
/\b\d+\b/g
Why match everything if you don't need to? You should use other functions to determine where the number is :
/(?=9337.{1,50}AA)/
This will fail if AA is further than 50 chars away from the end of 9337. Of course you will have to interpolate your variables to match your hashe's keys and values. This was just an example for your first key/value pair.

how do you match two strings in two different variables using regular expressions?

$a='program';
$b='programming';
if ($b=~ /[$a]/){print "true";}
this is not working
thanks every one i was a little confused
The [] in regex mean character class which match any one of the character listed inside it.
Your regex is equivalent to:
$b=~ /[program]/
which returns true as character p is found in $b.
To see if the match happens or not you are printing true, printing true will not show anything. Try printing something else.
But if you wanted to see if one string is present inside another you have to drop the [..] as:
if ($b=~ /$a/) { print true';}
If variable $a contained any regex metacharacter then the above matching will fail to fix that place the regex between \Q and \E so that any metacharacters in the regex will be escaped:
if ($b=~ /\Q$a\E/) { print true';}
Assuming either variable may come from external input, please quote the variables inside the regex:
if ($b=~ /\Q$a\E/){print true;}
You then won't get burned when the pattern you'll be looking for will contain "reserved characters" like any of -[]{}().
(apart the missing semicolons:) Why do you put $a in square brackets? This makes it a list of possible characters. Try:
$b =~ /\Q${a}\E/
Update
To answer your remarks regarding = and =~:
=~ is the matching operator, and specifies the variable to which you are applying the regex ($b) in your example above. If you omit =~, then Perl will automatically use an implied $_ =~.
The result of a regular expression is an array containing the matches. You usually assign this so an array, such as in ($match1, $match2) = $b =~ /.../;. If, on the other hand, you assign the result to a scalar, then the scalar will be assigned the number of elements in that array.
So if you write $b = /\Q$a\E/, you'll end up with $b = $_ =~ /\Q$a\E/.
$a='program';
$b='programming';
if ( $b =~ /\Q$a\E/) {
print "match found\n";
}
If you're just looking for whether one string is contained within another and don't need to use any character classes, quantifiers, etc., then there's really no need to fire up the regex engine to do an exact literal match. Consider using index instead:#!/usr/bin/env perl
#!/usr/bin/env perl
use strict;
use warnings;
my $target = 'program';
my $string = 'programming';
if (index($string, $target) > -1) {
print "target is in string\n";
}