Using a substring twice in regex - regex

First, this question may have been asked before, but I'm not sure what phrase to search on.
I have a string:
Maaaa
I have a pattern:
aaa
I would like to match twice, giving me starting indices of 1 and 2. But of course I only get a single match (start index 1), because the regex engine gobbles up all 3 "a"s and can't use them again, leaving me with 1 "a" which doesn't match.
How do I solve this?
Thanks!

You could use a lookahead assertion to find an a followed by 2 a's
a(?=aa)

The man perlre manpage suggests:
my #a;
"Maaaa" =~ /aaa(?{push #a,$&})(*FAIL)/;
print join "\n",#a;
print "\n";
which yields
aaa
aaa

Related

Difference between ? and * in regular expressions - match same input?

I am not able to understand the practical difference between ? and * in regular expressions. I know that ? means to check if previous character/group is present 0 or 1 times and * means to check if the previous character/group is present 0 or more times.
But this code
while(<>) {
chomp($_);
if(/hello?/) {
print "metch $_ \n";
}
else {
print "naot metch $_ \n";
}
}
gives the same out put for both hello? and hello*. The external file that is given to this Perl program contains
hello
helloooo
hell
And the output is
metch hello
metch helloooo
metch hell
for both hello? and hello*. I am not able to understand the exact difference between ? and *
In Perl (and unlike Java), the m//-match operator is not anchored by default.
As such all of the input it trivially matched by both /hello?/ and /hello*/. That is, these will match any string that contains "hell" (as both quantifiers make the "o" optional) anywhere.
Compare with /^hello?$/ and /^hello*$/, respectively. Since these employ anchors the former will not match "helloo" (as at most one "o" is allowed) while the latter will.
Under Regexp Quote-like Operators:
m/PATTERN/ searches [anywhere in] a string for a pattern match, and in scalar context returns true if it succeeds, false if it fails.
What is confusing you is that, without anchors like ^ and $ a regex pattern match checks only whether the pattern appears anywhere in the target string.
If you add something to the pattern after the hello, like
if (/hello?, Ashwin/) { ... }
Then the strings
hello, Ashwin
and
hell, Ashwin
will match, but
helloooo, Ashwin
will not, because there are too many o characters between hell and the comma ,.
However, if you use a star * instead, like
if (/hello*, Ashwin/) { ... }
then all three strings will match.
? Means the last item is optional. * Means it is both optional and you can have multiple items.
ie.
hello? matches hell, hello
hello* matches hell, hello, helloo, hellooo, ....
But not using either ^ or $ means these matches can occur anywhere in the string
Here's an example I came up with that makes it quite clear:
What if you wanted to only match up to tens of people and your data was like below:
2 people. 20 people. 200 people. 2000 people.
Only ? would be useful in that case, whereas * would incorrectly capture larger numbers.

Getting multiple matches within a string using regex in Perl

After having read this similar question and having tried my code several times, I keep on getting the same undesired output.
Let's assume the string I'm searching is "I saw wilma yesterday".
The regex should capture each word followed by an 'a' and its optional 5 following characters or spaces.
The code I wrote is the following:
$_ = "I saw wilma yesterday";
if (#m = /(\w+)a(.{5,})?/g){
print "found " . #m . " matches\n";
foreach(#m){
print "\t\"$_\"\n";
}
}
However, I kept on getting the following output:
found 2 matches
"s"
"w wilma yesterday"
while I expected to get the following one:
found 3 matches:
"saw wil"
"wilma yest"
"yesterday"
until I found out that the return values inside #m were $1 and $2, as you can notice.
Now, since the /g flag is on, and I don't think the problem is about the regex, how could I get the desired output?
You can try this pattern that allows overlapped results:
(?=\b(\w+a.{1,5}))
or
(?=(?i)\b([a-z]+a.{0,5}))
example:
use strict;
my $str = "I saw wilma yesterday";
my #matches = ($str =~ /(?=\b([a-z]+a.{0,5}))/gi);
print join("\n", #matches),"\n";
more explanations:
You can't have overlapped results with a regex since when a character is "eaten" by the regex engine it can't be eaten a second time. The trick to avoid this constraint, is to use a lookahead (that is a tool that only checks, but not matches) which can run through the string several times, and put a capturing group inside.
For another example of this behaviour, you can try the example code without the word boundary (\b) to see the result.
Firstly you want to capture everything inside the expression, i.e.:
/(\w+a(?:.{5,})?)/
Next you want to start your search from one character past where the last expression's first character matched.
The pos() function allows you to specify where a /g regex starts its search from.
$s = "I saw wilma yesterday";
while ($s =~ /(\w+a(.{0,5}))/g){
print "\t\"$1\"\n";
pos($s) = pos($s) - length($2);
}
Gives you:
"saw wil"
"wilma yest"
"yesterday"
But I don't know why you should get day and not yesterday.

2-step regular expression matching with a variable in Perl

I am looking to do a 2-step regular expression look-up in Perl, I have text that looks like this:
here is some text 9337 more text AA 2214 and some 1190 more BB stuff 8790 words
I also have a hash with the following values:
%my_hash = ( 9337 => 'AA', 2214 => 'BB', 8790 => 'CC' );
Here's what I need to do:
Find a number
Look up the text code for the number using my_hash
Check if the text code appears within 50 characters of the identified number, and if true print the result
So the output I'm looking for is:
Found 9337, matches 'AA'
Found 2214, matches 'BB'
Found 1190, no matches
Found 8790, no matches
Here's what I have so far:
while ( $text =~ /(\d+)(.{1,50})/g ) {
$num = $1;
$text_after_num = $2;
$search_for = $my_hash{$num};
if ( $text_after_num =~ /($search_for)/ ) {
print "Found $num, matches $search_for\n";
}
else {
print "Found $num, no matches\n";
}
This sort of works, except that the only correct match is 9337; the code doesn't match 2214. I think the reason is that the regular expression match on 9337 is including 50 characters after the number for the second-step match, and then when the regex engine starts again it is starting from a point after the 2214. Is there an easy way to fix this? I think the \G modifier can help me here, but I don't quite see how.
Any suggestions or help would be great.
You have a problem with greediness. The 1,50 will consume as much as it can. Your regex should be /(\d+)(.+?)(?=($|\d))/
To explain, the question mark will make the multiple match non-greedy (it will stop as soon as the next pattern is matched - the next pattern gets precedence). The ?= is a lookahead operator to say "check if the next element is a digit. If so, match but do not consume." This allows the first digit to get picked up by the beginning of the regex and be put into the next matched pattern.
[EDIT]
I added an optional end value to the lookahead so that it wouldn't die on the last match.
Just use :
/\b\d+\b/g
Why match everything if you don't need to? You should use other functions to determine where the number is :
/(?=9337.{1,50}AA)/
This will fail if AA is further than 50 chars away from the end of 9337. Of course you will have to interpolate your variables to match your hashe's keys and values. This was just an example for your first key/value pair.

regex for n characters or at least m characters

This should be a pretty simple regex question but I couldn't find any answers anywhere. How would one make a regex, which matches on either ONLY 2 characters, or at least 4 characters. Here is my current method of doing it (ignore the regex itself, that's besides the point):
[A-Za-z0_9_]{2}|[A-Za-z0_9_]{4,}
However, this method takes twice the time (and is approximately 0.3s slower for me on a 400 line file), so I was wondering if there was a better way to do it?
Optimize the beginning, and anchor it.
^[A-Za-z0-9_]{2}(?:|[A-Za-z0-9_]{2,})$
(Also, you did say to ignore the regex itself, but I guessed you probably wanted 0-9, not 0_9)
EDIT Hm, I was sure I read that you want to match lines. Remove the anchors (^$) if you want to match inside the line as well. If you do match full lines only, anchors will speed you up (well, the front anchor ^ will, at least).
Your solution looks pretty good. As an alternative you can try smth like that:
[A-Za-z0-9_]{2}(?:[A-Za-z0-9_]{2,})?
Btw, I think you want hyphen instead of underscore between 0 and 9, don't you?
The solution you present is correct.
If you're trying to optimize the routine, and the number of matches strings matching 2 or more characters is much smaller than those that do not, consider accepting all strings of length 2 or greater, then tossing those if they're of length 3. This may boost performance by only checking the regex once, and the second call need not even be a regular expression; checking a string length is usually an extremely fast operation.
As always, you really need to run tests on real-world data to verify if this would give you a speed increase.
so basically you want to match words of length either 2 or 2+2+N, N>=0
([A-Za-z0-9][A-Za-z0-9](?:[A-Za-z0-9][A0Za-z0-9])*)
working example:
#!/usr/bin/perl
while (<STDIN>)
{
chomp;
my #matches = ($_=~/([A-Za-z0-9][A-Za-z0-9](?:[A-Za-z0-9][A0Za-z0-9])*)/g);
for my $m (#matches) {
print "match: $m\n";
}
}
input file:
cat in.txt
ab abc bcad a as asdfa
aboioioi i i abc bcad a as asdfa
output:
perl t.pl <in.txt
match: ab
match: ab
match: bcad
match: as
match: asdf
match: aboioioi
match: ab
match: bcad
match: as
match: asdf

Verify if a word have a letter repeated in any position

I'd like know if there are a way to test if a word have a letter repeated in any position?
I'm currently using this regex to test it, but not work, becouse if I add more then 2 's' the test returns true.
/s{0,2}/.test('süuaãpérbrôséê'); //expected true
/s{0,2}/.test('ssüuaãpérbrôéê'); //expected true
/s{0,2}/.test('süuaãpérbrôéê'); //expected true
/s{0,2}/.test('süuaãpérbrôséês'); //expected fail
Thanks.
/s{2,}/
or generally for any character:
/(.)\1/
/(\w)\1/ finds two alphanumeric characters next to each other
This will find and replace the duplicates:
s/(\w)\1/$1/
The only way that I found to resolve this problem is using php preg_match_all, on this way I can count how much times the character repeat.
$s = 'süuaãpérbrôséê';
preg_match_all('/s/i', $s, $m);
echo count($m[0]); //outputs 2
My initial idea was pass a regex and use preg_match to verify the match exists in a determined number of times, but I think that it's not possible, so I'll create a method that receive the word and the regex that I need match and it will return the number of matches.
Thanks.
using lookahead you can achieve something like that:
^(?=.*(\w)(.*\1){1}.*$)((?!\1).)*\1(((?!\1).)*\1){1}((?!\1).)*$
Where {1} is number of repeatings minus 1, so for finding if there are three repeations this would look like:
^(?=.*(\w)(.*\1){2}.*$)((?!\1).)*\1(((?!\1).)*\1){2}((?!\1).)*$
And for two or three:
^(?=.*(\w)(.*\1){1,2}.*$)((?!\1).)*\1(((?!\1).)*\1){1,2}((?!\1).)*$
etc.
The lookaheads with backreferences can be very powerful :)