Regular expression puzzler - regex

I have been doing regular expression for 25+ years but I don't understand why this regex is not a match (using Perl syntax):
"unify" =~ /[iny]{3}/
# as in
perl -e 'print "Match\n" if "unify" =~ /[iny]{3}/'
Can someone help solve that riddle?

The quantifier {3} in the pattern [iny]{3} means to match a character with that pattern (either i or n or y), and then another character with the same pattern, and then another. Three -- one after another. So your string unify doesn't have that, but can muster two at most, ni.
That's been explained in other answers already. What I'd like to add is an answer to a clarification in comments: how to check for these characters appearing 3 times in the string, scattered around at will. Apart from matching that whole substring, as shown already, we can use a lookahead:
(?=[iny].*[iny].*[iny])
This does not "consume" any characters but rather "looks" ahead for the pattern, not advancing the engine from its current position. As such it can be very useful as a subpattern, in combination with other patterns in a larger regex.
A Perl example, to copy-paste on the command line:
perl -wE'say "Match" if "unify" =~ /(?=[iny].*[iny].*[iny])/'
The drawback to this, as well as to consuming the whole such substring, is the literal spelling out of all three subpatterns; what when the number need be decided dynamically? Or when it's twelve? The pattern can be built at runtime of course. In Perl, one way
my $pattern = '(?=' . join('.*', ('[iny]')x3) . ')';
and then use that in the regex.
 
For the sake of performance, for long strings and many repetitions, make that .* non-greedy
(?=[iny].*?[iny].*?[iny])
(when forming the pattern dynamically join with .*?)
A simple benchmark for illustration (in Perl)
use warnings;
use strict;
use feature 'say';
use Getopt::Long;
use List::Util qw(shuffle);
use Benchmark qw( cmpthese );
# For how many seconds to run each option (-r N, default 3),
# how many times to repeat for the test string (-n N, default 2)
my ($runfor, $n) = (3, 2);
GetOptions('r=i' => \$runfor, 'n=i' => \$n);
my $str = 'aa'
. join('', map { (shuffle 'b'..'t')x$n, 'a' } 1..$n)
. 'a'x($n+1)
. 'zzz';
my $pat_greedy = '(?=' . join('.*', ('a')x$n) . ')';
my $pat_non_greedy = '(?=' . join('.*?', ('a')x$n) . ')';
#my $pat_greedy = join('.*', ('a')x$n); # test straight match,
#my $pat_non_greedy = join('.*?', ('a')x$n); # not lookahead
sub match_repeated {
my ($s, $pla) = #_;
return ( $s =~ /$pla(.*z)/ ) ? "match" : "no match";
}
cmpthese(-$runfor, {
greedy => sub { match_repeated($str, $pat_greedy) },
non_greedy => sub { match_repeated($str, $pat_non_greedy) },
});
(Shuffling of that string is probably unneeded but I feared optimizations intruding.)
When a string is made with the factor of 20 (program.pl -n 20) the output is
Rate greedy non_greedy
greedy 56.3/s -- -100%
non_greedy 90169/s 159926% --
So ... some 1600 times better non-greedy. That test string is 7646 characters long and the pattern to match has 20 subpatterns (a) with .* between them (in greedy case); so there's a lot going on there. With default 2, so for a short string and a simpler pattern, the difference is 10%.
Btw, to test for straight-up matches (not using lookahead) just move those comment signs around the pattern variables, and it's nearly twice as bad:
Rate greedy non_greedy
greedy 56.5/s -- -100%
non_greedy 171949/s 304117% --

The letters n, i, and y aren't all adjacent. There's an f in between them.
/[iny]{3}/ matches any string that contains a substring of three letters taken from the set {i, n, y}. The letters can be in any order; they can even be repeated.
Choosing three characters three times, with replacement, means there are 33 = 27 matching substrings:
iii, iin, iiy, ini, inn, iny, iyi, iyn, iyy
nii, nin, niy, nni, nnn, nny, nyi, nyn, nyy
yii, yin, yiy, yni, ynn, yny, yyi, yyn, yyy
To match non-adjacent letters you can use one of these:
[iny].*[iny].*[iny]
[iny](.*[iny]){2}
([iny].*){3}
(The last option will work fine on its own since your search is unanchored, but might not be suitable as part of a larger regex. The final .* could match more than you intend.)

That pattern looks for three consecutive occurrences of the letters i, n, or y. You do not have three consecutive occurrences.
Perhaps you meant to use [inf] or [ify]?

Looks like you are looking for 3 consecutive letters, so yours should not match
[iny]{3} //no match
[unf]{3} //no match
[nif]{3} //matches nif
[nify]{3} //matches nif
[ify]{3} //matches ify
[uni]{3} //matches uni
Hope that helps somewhat :)

The {3} atom means "exactly three consecutive matches of the preceding element." While all of the letters in your character class are present in the string, they are not consecutive as they are separated by other characters in your string.
It isn't the order of items in the character class that's at issue. It's the fact that you can't match any combination of the three letters in your character class where exactly three of them are directly adjacent to one another in your example string.

Related

Check if string is subset of a bunch of characters? (RegEx)?

I have a little problem, I have 8 characters, for example "a b c d a e f g", and a list of words, for example:
mom, dad, bad, fag, abac
How can I check if I can or cannot compose these words with the letters I have?
In my example, I can compose bad, abac and fag, but I cannot compose dad (I have not two D) and mom (I have not M or O).
I'm pretty sure it can be done using a RegEx but would be helpful even using some functions in Perl..
Thanks in advance guys! :)
This is done most simply by forming a regular expression from the word that is to be tested.
This sorts the list of available characters and forms a string by concatenating them. Then each candidate word is split into characters, sorted, and rejoined with the regex term .* as separator. So, for instance, abac will be converted to a.*a.*b.*c.
Then the validity of the word is determined by testing the string of available characters against the derived regex.
use strict;
use warnings;
my #chars = qw/ a b c d a e f g /;
my $chars = join '', sort #chars;
for my $word (qw/ mom dad bad fag abac /) {
my $re = join '.*', sort $word =~ /./g;
print "$word is ", $chars =~ /$re/ ? 'valid' : 'NOT valid', "\n";
}
output
mom is NOT valid
dad is NOT valid
bad is valid
fag is valid
abac is valid
This is to demonstrate the possibility rather than endorsing the regex method. Please consider other saner solution.
First step, you need to count the number of characters available.
Then construct your regex as such (this is not Perl code!):
Start with start of input anchor, this matches the start of the string (a single word from the list):
^
Append as many of these as the number of unique characters:
(?!(?:[^<char>]*+<char>){<count + 1>})
Example: (?!(?:[^a]*+a){3}) if the number of a is 2.
I used an advanced regex construct here called zero-width negative look-ahead (?!pattern). It will not consume text, and it will try its best to check that nothing ahead in the string matches the pattern specified (?:[^a]*+a){3}. Basically, the idea is that I check that I cannot find 3 'a' ahead in the string. If I really can't find 3 instances of 'a', it means that the string can only contain 2 or less 'a'.
Note that I use *+, which is 0 or more quantifier, possessively. This is to avoid unnecessary backtracking.
Put the characters that can appear within []:
[<unique_chars_in_list>]+
Example: For a b c d a e f g, this will become [abcdefg]+. This part will actually consume the string, and make sure the string only contains characters in the list.
End with end of input anchor, which matches the end of the string:
$
So for your example, the regex will be:
^(?!(?:[^a]*+a){3})(?!(?:[^b]*+b){2})(?!(?:[^c]*+c){2})(?!(?:[^d]*+d){2})(?!(?:[^e]*+e){2})(?!(?:[^f]*+f){2})(?!(?:[^g]*+g){2})[abcdefg]+$
You must also specify i flag for case-insensitive matching.
Note that this only consider the case of English alphabet (a-z) in the list of words to match. Space and hyphen are not (yet) considered here.
How about sorting both strings into alphabetical order then for the string you want to check insert .*
between each letter like so:
'aabcdefg' =~ m/a.*b.*d.*/
True
'aabcdefg' =~ m/m.*m.*u.*/
False
'aabcdefg' =~ m/a.*d.*d.*/
False
Some pseudocode:
Sort the available characters into alphabetical order
for each word:
Sort the characters of the word into alphabetical order
For each character of the word search forwards through the available characters to find a matching character. Note the this
search will never go back to the start of the available chars,
matched chars are consumed.
Or even better, use frequency counts of characters.
For your available characters, construct a map from character to occurence count of that character.
Do the same for each candidate word and compare against the available map, if the word map contains a mapping for a character where the available map does not, or the mapped value is larger in the word map than the available map, then the word cannot be constructed using the available characters.
Here's a really simple script that would be rather easy to generalize:
#!/usr/bin/env perl
use strict;
use warnings;
sub check_word {
my $word = shift;
my %chars;
$chars{$_}++ for #_;
$chars{$_}-- or return for split //, $word;
return 1;
}
print check_word( 'cab', qw/a b c/ ) ? "Good" : "Bad";
And of course the performance of this function could be greatly enhanced if the letters list is going to be the same every time. Actually for eight characters, copying the hash vs building a new one each time is probably the same speed.
pseudocode:
bool possible=true
string[] chars= { "a", "b", "c"}
foreach word in words
{
foreach char in word.chars
{
possible=possible && chars.contains(char)
}
}

Using perl Regular expressions I want to make sure a number comes in order

I want to use a regular expression to check a string to make sure 4 and 5 are in order. I thought I could do this by doing
'$string =~ m/.45./'
I think I am going wrong somewhere. I am very new to Perl. I would honestly like to put it in an array and search through it and find out that way, but I'm assuming there is a much easier way to do it with regex.
print "input please:\n";
$input = <STDIN>;
chop($input);
if ($input =~ m/45/ and $input =~ m/5./) {
print "works";
}
else {
print "nata";
}
EDIT: Added Info
I just want 4 and 5 in order, but if 5 comes before at all say 322195458900023 is the number then where 545 is a problem 5 always have to come right after 4.
Assuming you want to match any string that contains two digits where the first digit is smaller than the second:
There is an obscure feature called "postponed regular expressions". We can include code inside a regular expression with
(??{CODE})
and the value of that code is interpolated into the regex.
The special verb (*FAIL) makes sure that the match fails (in fact only the current branch). We can combine this into following one-liner:
perl -ne'print /(\d)(\d)(??{$1<$2 ? "" : "(*FAIL)"})/ ? "yes\n" :"no\n"'
It prints yes when the current line contains two digits where the first digit is smaller than the second digit, and no when this is not the case.
The regex explained:
m{
(\d) # match a number, save it in $1
(\d) # match another number, save it in $2
(??{ # start postponed regex
$1 < $2 # if $1 is smaller than $2
? "" # then return the empty string (i.e. succeed)
: "(*FAIL)" # else return the *FAIL verb
}) # close postponed regex
}x; # /x modifier so I could use spaces and comments
However, this is a bit advanced and masochistic; using an array is (1) far easier to understand, and (2) probably better anyway. But it is still possible using only regexes.
Edit
Here is a way to make sure that no 5 is followed by a 4:
/^(?:[^5]+|5(?=[^4]|$))*$/
This reads as: The string is composed from any number (zero or more) characters that are not a five, or a five that is followed by either a character that is not a four or the five is the end of the string.
This regex is also a possibility:
/^(?:[^45]+|45)*$/
it allows any characters in the string that are not 4 or 5, or the sequence 45. I.e., there are no single 4s or 5s allowed.
You just need to match all 5 and search fails, where preceded is not 4:
if( $str =~ /(?<!4)5/ ) {
#Fail
}

Negative lookahead assertion with the * modifier in Perl

I have the (what I believe to be) negative lookahead assertion <#> *(?!QQQ) that I expect to match if the tested string is a <#> followed by any number of spaces (zero including) and then not followed by QQQ.
Yet, if the tested string is <#> QQQ the regular expression matches.
I fail to see why this is the case and would appreciate any help on this matter.
Here's a test script
use warnings;
use strict;
my #strings = ('something <#> QQQ',
'something <#> RRR',
'something <#>QQQ' ,
'something <#>RRR' );
print "$_\n" for map {$_ . " --> " . rep($_) } (#strings);
sub rep {
my $string = shift;
$string =~ s,<#> *(?!QQQ),at w/o ,;
$string =~ s,<#> *QQQ,at w/ QQQ,;
return $string;
}
This prints
something <#> QQQ --> something at w/o QQQ
something <#> RRR --> something at w/o RRR
something <#>QQQ --> something at w/ QQQ
something <#>RRR --> something at w/o RRR
And I'd have expected the first line to be something <#> QQQ --> something at w/ QQQ.
It matches because zero is included in "any number". So no spaces, followed by a space, matches "any number of spaces not followed by a Q".
You should add another lookahead assertion that the first thing after your spaces is not itself a space. Try this (untested):
<#> *(?!QQQ)(?! )
ETA Side note: changing the quantifier to + would have helped only when there's exactly one space; in the general case, the regex can always grab one less space and therefore succeed. Regexes want to match, and will bend over backwards to do so in any way possible. All other considerations (leftmost, longest, etc) take a back seat - if it can match more than one way, they determine which way is chosen. But matching always wins over not matching.
$string =~ s,<#> *(?!QQQ),at w/o ,;
$string =~ s,<#> *QQQ,at w/ QQQ,;
One problem of yours here is that you are viewing the two regexes separately. You first ask to replace the string without QQQ, and then to replace the string with QQQ. This is actually checking the same thing twice, in a sense. For example: if (X==0) { ... } elsif (X!=0) { ... }. In other words, the code may be better written:
unless ($string =~ s,<#> *QQQ,at w/ QQQ,) {
$string =~ s,<#> *,at w/o,;
}
You always have to be careful with the * quantifier. Since it matches zero or more times, it can also match the empty string, which basically means: it can match any place in any string.
A negative look-around assertion has a similar quality, in the sense that it needs to only find a single thing that differs in order to match. In this case, it matches the part "<#> " as <#> + no space + space, where space is of course "not" QQQ. You are more or less at a logical impasse here, because the * quantifier and the negative look-ahead counter each other.
I believe the correct way to solve this is to separate the regexes, like I showed above. There is no sense in allowing the possibility of both regexes being executed.
However, for theoretical purposes, a working regex that allows both any number of spaces, and a negative look-ahead would need to be anchored. Much like Mark Reed has shown. This one might be the simplest.
<#>(?! *QQQ) # Add the spaces to the look-ahead
The difference is that now the spaces and Qs are anchored to each other, whereas before they could match separately. To drive home the point of the * quantifier, and also solve a minor problem of removing additional spaces, you can use:
<#> *(?! *QQQ)
This will work because either of the quantifiers can match the empty string. Theoretically, you can add as many of these as you want, and it will make no difference (except in performance): / * * * * * * */ is functionally equivalent to / */. The difference here is that spaces combined with Qs may not exist.
The regex engine will backtrack until it finds a match, or until finding a match is impossible. In this case, it found the following match:
+--------------- Matches "<#>".
| +----------- Matches "" (empty string).
| | +--- Doesn't match " QQQ".
| | |
--- ---- ---
'something <#> QQQ' =~ /<#> [ ]* (?!QQQ)/x
All you need to do is shuffle things around. Replace
/<#>[ ]*(?!QQQ)/
with
/<#>(?![ ]*QQQ)/
Or you can make it so the regex will only match all the spaces:
/<#>[ ]*+(?!QQQ)/
/<#>[ ]*(?![ ]|QQQ)/
/<#>[ ]*(?![ ])(?!QQQ)/
PS — Spaces are hard to see, so I use [ ] to make them more visible. It gets optimised away anyway.

Regex to match the longest repeating substring

I'm writing regular expression for checking if there is a substring, that contains at least 2 repeats of some pattern next to each other. I'm matching the result of regex with former string - if equal, there is such pattern. Better said by example: 1010 contains pattern 10 and it is there 2 times in continuous series. On other hand 10210 wouldn't have such pattern, because those 10 are not adjacent.
What's more, I need to find the longest pattern possible, and it's length is at least 1. I have written the expression to check for it ^.*?(.+)(\1).*?$. To find longest pattern, I've used non-greedy version to match something before patter, then pattern is matched to group 1 and once again same thing that has been matched for group1 is matched. Then the rest of string is matched, producing equal string. But there's a problem that regex is eager to return after finding first pattern, and don't really take into account that I intend to make those substrings before and after shortest possible (leaving the rest longest possible). So from string 01011010 I get correctly that there's match, but the pattern stored in group 1 is just 01 though I'd except 101.
As I believe I can't make pattern "more greedy" or trash before and after even "more non-greedy" I can only come whit an idea to make regex less eager, but I'm not sure if this is possible.
Further examples:
56712453289 - no pattern - no match with former string
22010110100 - pattern 101 - match with former string (regex resulted in 22010110100 with 101 in group 1)
5555555 - pattern 555 - match
1919191919 - pattern 1919 - match
191919191919 - pattern 191919 - match
2323191919191919 - pattern 191919 - match
What I would get using current expression (same strings used):
no pattern - no match
pattern 2 - match
pattern 555 - match
pattern 1919 - match
pattern 191919 - match
pattern 23 - match
In Perl you can do it with one expression with help of (??{ code }):
$_ = '01011010';
say /(?=(.+)\1)(?!(??{ '.+?(..{' . length($^N) . ',})\1' }))/;
Output:
101
What happens here is that after a matching consecutive pair of substrings, we make sure with a negative lookahead that there is no longer pair following it.
To make the expression for the longer pair a postponed subexpression construct is used (??{ code }), which evaluates the code inside (every time) and uses the returned string as an expression.
The subexpression it constructs has the form .+?(..{N,})\1, where N is the current length of the first capturing group (length($^N), $^N contains the current value of the previous capturing group).
Thus the full expression would have the form:
(?=(.+)\1)(?!.+?(..{N,})\2}))
With the magical N (and second capturing group not being a "real"/proper capturing group of the original expression).
Usage example:
use v5.10;
sub longest_rep{
$_[0] =~ /(?=(.+)\1)(?!(??{ '.+?(..{' . length($^N) . ',})\1' }))/;
}
say longest_rep '01011010';
say longest_rep '010110101000110001';
say longest_rep '2323191919191919';
say longest_rep '22010110100';
Output:
101
10001
191919
101
You can do it in a single regex, you just have to pick the longest match from the list of results manually.
def longestrepeating(strg):
regex = re.compile(r"(?=(.+)\1)")
matches = regex.findall(strg)
if matches:
return max(matches, key=len)
This gives you (since re.findall() returns a list of the matching capturing groups, even though the matches themselves are zero-length):
>>> longestrepeating("yabyababyab")
'abyab'
>>> longestrepeating("10100101")
'010'
>>> strings = ["56712453289", "22010110100", "5555555", "1919191919", 
               "191919191919", "2323191919191919"]
>>> [longestrepeating(s) for s in strings]
[None, '101', '555', '1919', '191919', '191919']
Here's a long-ish script that does what you ask. It basically goes through your input string, shortens it by one, then goes through it again. Once all possible matches are found, it returns one of the longest. It is possible to tweak it so that all the longest matches are returned, instead of just one, but I'll leave that to you.
It's pretty rudimentary code, but hopefully you'll get the gist of it.
use v5.10;
use strict;
use warnings;
while (<DATA>) {
chomp;
print "$_ : ";
my $longest = foo($_);
if ($longest) {
say $longest;
} else {
say "No matches found";
}
}
sub foo {
my $num = shift;
my #hits;
for my $i (0 .. length($num)) {
my $part = substr $num, $i;
push #hits, $part =~ /(.+)(?=\1)/g;
}
my $long = shift #hits;
for (#hits) {
if (length($long) < length) {
$long = $_;
}
}
return $long;
}
__DATA__
56712453289
22010110100
5555555
1919191919
191919191919
2323191919191919
Not sure if anyone's thought of this...
my $originalstring="pdxabababqababqh1234112341";
my $max=int(length($originalstring)/2);
my #result;
foreach my $n (reverse(1..$max)) {
#result=$originalstring=~m/(.{$n})\1/g;
last if #result;
}
print join(",",#result),"\n";
The longest doubled match cannot exceed half the length of the original string, so we count down from there.
If the matches are suspected to be small relative to the length of the original string, then this idea could be reversed... instead of counting down until we find the match, we count up until there are no more matches. Then we need to back up 1 and give that result. We would also need to put a comma after the $n in the regex.
my $n;
foreach (1..$max) {
unless (#result=$originalstring=~m/(.{$_,})\1/g) {
$n=--$_;
last;
}
}
#result=$originalstring=~m/(.{$n})\1/g;
print join(",",#result),"\n";
Regular expressions can be helpful in solving this, but I don't think you can do it as a single expression, since you want to find the longest successful match, whereas regexes just look for the first match they can find. Greediness can be used to tweak which match is found first (earlier vs. later in the string), but I can't think of a way to prefer an earlier, longer substring over a later, shorter substring while also preferring a later, longer substring over an earlier, shorter substring.
One approach using regular expressions would be to iterate over the possible lengths, in decreasing order, and quit as soon as you find a match of the specified length:
my $s = '01011010';
my $one = undef;
for(my $i = int (length($s) / 2); $i > 0; --$i)
{
if($s =~ m/(.{$i})\1/)
{
$one = $1;
last;
}
}
# now $one is '101'

Regular Expression issue with * laziness

Sorry in advance that this might be a little challenging to read...
I'm trying to parse a line (actually a subject line from an IMAP server) that looks like this:
=?utf-8?Q?Here is som?= =?utf-8?Q?e text.?=
It's a little hard to see, but there are two =?/?= pairs in the above line. (There will always be one pair; there can theoretically be many.) In each of those =?/?= pairs, I want the third argument (as defined by a ? delimiter) extracted. (In the first pair, it's "Here is som", and in the second it's "e text.")
Here's the regex I'm using:
=\?(.+)\?.\?(.*?)\?=
I want it to return two matches, one for each =?/?= pair. Instead, it's returning the entire line as a single match. I would have thought that the ? in the (.*?), to make the * operator lazy, would have kept this from happening, but obviously it doesn't.
Any suggestions?
EDIT: Per suggestions below to replace ".?" with "[^(\?=)]?" I'm now trying to do:
=\?(.+)\?.\?([^(\?=)]*?)\?=
...but it's not working, either. (I'm unsure whether [^(\?=)]*? is the proper way to test for exclusion of a two-character sequence like "?=". Is it correct?)
Try this:
\=\?([^?]+)\?.\?(.*?)\?\=
I changed the .+ to [^?]+, which means "everything except ?"
A good practice in my experience is not to use .*? but instead do use the * without the ?, but refine the character class. In this case [^?]* to match a sequence of non-question mark characters.
You can also match more complex endmarkers this way, for instance, in this case your end-limiter is ?=, so you want to match nonquestionmarks, and questionmarks followed by non-equals:
([^?]*\?[^=])*[^?]*
At this point it becomes harder to choose though. I like that this solution is stricter, but readability decreases in this case.
One solution:
=\?(.*?)\?=\s*=\?(.*?)\?=
Explanation:
=\? # Literal characters '=?'
(.*?) # Match each character until find next one in the regular expression. A '?' in this case.
\?= # Literal characters '?='
\s* # Match spaces.
=\? # Literal characters '=?'
(.*?) # Match each character until find next one in the regular expression. A '?' in this case.
\?= # Literal characters '?='
Test in a 'perl' program:
use warnings;
use strict;
while ( <DATA> ) {
printf qq[Group 1 -> %s\nGroup 2 -> %s\n], $1, $2 if m/=\?(.*?)\?=\s*=\?(.*?)\?=/;
}
__DATA__
=?utf-8?Q?Here is som?= =?utf-8?Q?e text.?=
Running:
perl script.pl
Results:
Group 1 -> utf-8?Q?Here is som
Group 2 -> utf-8?Q?e text.
EDIT to comment:
I would use the global modifier /.../g. Regular expression would be:
/=\?(?:[^?]*\?){2}([^?]*)/g
Explanation:
=\? # Literal characters '=?'
(?:[^?]*\?){2} # Any number of characters except '?' with a '?' after them. This process twice to omit the string 'utf-8?Q?'
([^?]*) # Save in a group next characters until found a '?'
/g # Repeat this process multiple times until end of string.
Tested in a Perl script:
use warnings;
use strict;
while ( <DATA> ) {
printf qq[Group -> %s\n], $1 while m/=\?(?:[^?]*\?){2}([^?]*)/g;
}
__DATA__
=?utf-8?Q?Here is som?= =?utf-8?Q?e text.?= =?utf-8?Q?more text?=
Running and results:
Group -> Here is som
Group -> e text.
Group -> more text
Thanks for everyone's answers! The simplest expression that solved my issue was this:
=\?(.*?)\?.\?(.*?)\?=
The only difference between this and my originally-posted expression was the addition of a ? (non-greedy) operator on the first ".*". Critical, and I'd forgotten it.