Find any permutation of a set using Perl's RegEx - regex

I need to find a way of checking for the existence of sets of the type {1,2,3,4,5,6,8,9,10}, that have a preset number of elements. Also, notice the missing 7. Obviously the numbers could be in any order and should appear only once, since according to definition, {1,2,3} = {3,2,1} = {1,2,3,3} = ... and so forth.
How could I do this with Perl (or is it even possible)? One thing I tried was
{([1-6],|[8-9],|10,){8}([1-6]|[8-9]|10)} here, but this doesn't take care of the multiple instances of the same number within the brackets.

Regexes are almost certainly the wrong tool here. You want something that deals with permutations of an input list.
This blog post gives a useful overview of Perl modules that deal with permutations and combinations. Sounds to me like Algorithm::Combinatorics would be a good place to start. Something like this, perhaps:
use Algorithm::Combinatorics;
my #input = qw[1 2 3 4 5 6 8 9 10];
my #perms = permutations(\#input);
You then need some way to compare the valid permutations with the sets you want to test. I'd consider constructing a string representation of the sets (by joining them with a known delimiter) and doing a simple string comparison.
my #perm_strs = map { join ':' } #perms;
my #test = qw[2 4 3 1 10 5 9 8 6];
my $test_str = join ':', #test;
my $match = 0;
for (#perm_strs) {
if ($test_str eq $_) {
$match = 1;
last;
}
}
The success of the match is now in $match.

This regex does that.
Here 10 slots are allocated, but you can add as many as you want ( a hundred ? ).
It doesn't mean you have to match 10 unique numbers in a set,
You can match anything less than or equal to 10 (example {5}),
or even a range like {3,7}
The slots will be filled sequentially starting from 1.
So, you just have to sit in a loop from 1 - N, seeing if it is defined.
If you're looking for speed, this is the demon you want !
/\{(?>(?>(?(1)(?!))((?&GetNum))|(?(2)(?!))((?&GetNum))|(?(3)(?!))((?&GetNum))|(?(4)(?!))((?&GetNum))|(?(5)(?!))((?&GetNum))|(?(6)(?!))((?&GetNum))|(?(7)(?!))((?&GetNum))|(?(8)(?!))((?&GetNum))|(?(9)(?!))((?&GetNum))|(?(10)(?!))((?&GetNum)))(?:,(?!\})|(?=\}))){3,7}\}(?(DEFINE)(?<GetNum>(?!(?:\g{1}|\g{2}|\g{3}|\g{4}|\g{5}|\g{6}|\g{7}|\g{8}|\g{9}|\g{10})\b)\d+))/
https://regex101.com/r/pPwPTe/1
Readable regex
# Unique numbers in set, 10 slots
\{
(?> # Atomic, no backtracking allowed
(?> # ditto
(?(1) (?!) ) ( (?&GetNum) ) # (1), Slot 1
| (?(2) (?!) ) ( (?&GetNum) ) # (2), Slot 2
| (?(3) (?!) ) ( (?&GetNum) ) # (3), Slot 3
| (?(4) (?!) ) ( (?&GetNum) ) # (4), Slot 4
| (?(5) (?!) ) ( (?&GetNum) ) # (5), Slot 5
| (?(6) (?!) ) ( (?&GetNum) ) # (6), Slot 6
| (?(7) (?!) ) ( (?&GetNum) ) # (7), Slot 7
| (?(8) (?!) ) ( (?&GetNum) ) # (8), Slot 8
| (?(9) (?!) ) ( (?&GetNum) ) # (9), Slot 9
| (?(10) (?!) ) ( (?&GetNum) ) # (10), Slot 10
)
(?: , (?! \} ) | (?= \} ) )
){3,7} # Set range, example: 3 to 7 unique numbers in set
\}
(?(DEFINE)
(?<GetNum> # (4) Get a new number, must not be seen before
(?! (?: \g{1}|\g{2}|\g{3}|\g{4}|\g{5}|\g{6}|\g{7}|\g{8}|\g{9}|\g{10} ) \b )
\d+
)
)

Given front matter and test cases of
#! /usr/bin/env perl
use strict;
use warnings;
my #tests = (
"{}",
"{1,1}",
"{1,2,3,4,5,6,8,9,10}",
"{1,1,2,3,4,5,6,8,9,10}",
"{1,2,3,4,5,6,7,8,9,10}",
"{10,9,8,7,6,5,4,3,2,1}",
"{10,9,8,6,5,4,3,2,1}",
"{10,9,8,6,5,4,3,2,1",
"{10,9,8,6,5,4,3,2,1,1}",
"{2,4,6,8,10,9,5,3,1}",
);
you have at least three approaches to implementing what you want.
Brute force
When in doubt, try a bigger hammer. Generate all permutations and bake those into your pattern directly. Note that this has a factorial cost, so it quickly becomes intractable as the number of elements in your set grows.
# perlfaq4: How do I permute N elements of a list?
sub permute (&#) {
my $code = shift;
my #idx = 0..$#_;
while ( $code->(#_[#idx]) ) {
my $p = $#idx;
--$p while $idx[$p-1] > $idx[$p];
my $q = $p or return;
push #idx, reverse splice #idx, $p;
++$q while $idx[$p-1] > $idx[$q];
#idx[$p-1,$q]=#idx[$q,$p-1];
}
}
my $brute_force;
permute { local $" = ",";
$brute_force .= "|" if $brute_force;
$brute_force .= "{#_}" }
#members;
$brute_force = qr/ ^ (?: $brute_force ) $/x;
for (#tests) {
my $result = /$brute_force/x ? "ACCEPT" : "REJECT";
print "$_ - $result\n";
}
Generating all permutations on my laptop takes about 3 minutes. Precomputing the pattern may or may not make sense depending on your application.
Piggyback on the regex engine’s backtracking
One way to do it is to take advantage of the Perl regex engine’s backtracking and running (?{ code }) at various points within your pattern.
Define members of your set as below. Note that these must be global variables because of limitations of the regex engine, so use our and not my.
# must use package variables inside (?{ })
our #members = (1 .. 6, 8 .. 10);
our %remaining;
A pattern that matches permutations becomes
my $permutation = qr!
\{ (?{ #remaining{#members} = map +($_ => 1), #members })
( ([0-9]+), (?(?{ delete local $remaining{$^N} })|(*FAIL)))+
([0-9]+)\} (?(?{ delete local $remaining{$^N} && keys %remaining == 0 })|(*FAIL))
!x;
Code inside (?{ code }) sections runs at corresponding points of the pattern match. For example, the first one initializes the hash %remaining to contain all members of the set as keys.
The second and third (?{ code }) sections are within (?(condition)yes-pattern|no-pattern) sections and (*FAIL) backtracking control verbs. For any member before the last in the set (which we know because it is terminated by a comma), the member just matched, available in the $^N special variable, must be still available in %remaining. For the last member (terminated by right curly brace), the member just matched must be available and we must have covered all elements of the set to succeed. If these constraints are met, we match against an empty yes-pattern and continue successfully, but if one of these conditions fails, we meet (*FAIL) in the no-pattern. This causes the current attempted match to fail and the regex engine backtracks to attempt the next possibility.
Writing delete local localizes deletion of the particular key from %remaining. This delegates the error-prone bookkeeping to the regex engine that correctly restores localized values when it backtracks past a non-viable match.
Note that this implementation requires a set of at least two members.
Use it as in
for (#tests) {
my $result = /^ $permutation $/x ? "ACCEPT" : "REJECT";
print "$_ - $result\n";
}
Hybrid approach
Finally, combine the approaches by searching for everything that looks like a set and reject invalid permutations.
sub _assert_permutation_of {
my($members,$set) = #_;
my %seen = map +($_ => 1), #$members;
while ($set =~ /\b([0-9]+)\b/g) {
return unless delete $seen{$1};
}
keys %seen == 0;
}
my $hybrid = qr!
( \{ # opening brace
(?: [0-9]+ , )+ # comma-terminated integers
[0-9]+ # final integer
\} # closing brace
)
(?(?{ _assert_permutation_of \#members, $^N })|(*FAIL))
!x;
for (#tests) {
my $result = /^ $hybrid $/x ? "ACCEPT" : "REJECT";
print "$_ - $result\n";
}
Test output
For all three, the output is
{} - REJECT
{1,1} - REJECT
{1,2,3,4,5,6,8,9,10} - ACCEPT
{1,1,2,3,4,5,6,8,9,10} - REJECT
{1,2,3,4,5,6,7,8,9,10} - REJECT
{10,9,8,7,6,5,4,3,2,1} - REJECT
{10,9,8,6,5,4,3,2,1} - ACCEPT
{10,9,8,6,5,4,3,2,1 - REJECT
{10,9,8,6,5,4,3,2,1,1} - REJECT
{2,4,6,8,10,9,5,3,1} - ACCEPT

Related

RegEx for matching subexpression

When I use a regular expression like
std::regex midiNoteNameRegex("([cdefgab])([b#]{0,1})([0-9]))|([0-9]{3})|([A-Z0-9]{2})");
there are three top-level subexpressions connected by "|" in the pattern of which one will match.
Is there a way to tell which one? Other than testing them sequentially one after the other?
If I would use named subexpressions it would be easy, but there are no named subexpressions in C++.
How do I solve this problem?
Given the groups in your regex, it's just a flat search of the match object,
which in C++ is a flag (int) check, with no noticeable overhead.
( [cdefgab] ) # (1)
( [b#]{0,1} ) # (2)
( [0-9] ) # (3)
| ( [0-9]{3} ) # (4)
| ( [A-Z0-9]{2} ) # (5)
And a possible usage
wregex MyRx = wregex( "([cdefgab])([b#]{0,1})([0-9])|([0-9]{3})|([A-Z0-9]{2})", 0);
wstring::const_iterator start = str.begin();
wstring::const_iterator end = str.end();
wsmatch m;
while ( regex_search( start, end, m, MyRx ) )
{
if ( m[1].matched )
// First alternation
else
if ( m[4].matched )
// Second alternation
else
if ( m[5].matched )
// Third alternation
start = m[0].second;
}
I don't have a definite answer but I believe the answer is most likely no.
Named capturing group is not a required feature: http://www.cplusplus.com/reference/regex/ECMAScript/
Implementation of named capturing group is probably not trivial and probably brings down the performance of the regex engine.
Found another post on this issue that agrees with me: C++ regex: Which group matched?

Search for substring and store another part of the string as variable in perl

I am revamping an old mail tool and adding MIME support. I have a lot of it working but I'm a perl dummy and the regex stuff is losing me.
I had:
foreach ( #{$body} ) {
next if /^$/;
if ( /NEMS/i ) {
/.*?(\d{5,7}).*/;
$nems = $1;
next;
}
if ( $delimit ) {
next if (/$delimit/ && ! $tp);
last if (/$delimit/ && $tp);
$tp = 1, next if /text.plain/;
$tp = 0, next if /text.html/;
s/<[^>]*>//g;
$newbody .= $_ if $tp;
} else {
s/<[^>]*>//g;
$newbody .= $_ ;
}
} # End Foreach
Now I have $body_text as the plain text mail body thanks to MIME::Parser. So now I just need this part to work:
foreach ( #{$body_text} ) {
next if /^$/;
if ( /NEMS/i ) {
/.*?(\d{5,7}).*/;
$nems = $1;
next;
}
} # End Foreach
The actual challenge is to find NEMS=12345 or NEMS=1234567 and set $nems=12345 if found. I think I have a very basic syntax problem with the test because I'm not exposed to perl very often.
A coworker suggested:
foreach (split(/\n/,$body_text)){
next if /^$/;
if ( /NEMS/i ) {
/.*?(\d{5,7}).*/;
$nems = $1;
next;
}
}
Which seems to be working, but it may not be the preferred way?
edit:
So this is the most current version based on tips here and testing:
foreach (split(/\n/,$body_text)){
next if /^$/;
if ( /NEMS/i ) {
/^\s*NEMS\s*=\s*(\d+)/i;
$nems = $1;
next;
}
}
Match the last two digits as optional and capture the first five, and assign the capture directly
($nems) = /(\d{5}) (?: \d{2} )?/x; # /x allows spaces inside
The construct (?: ) only groups what's inside, without capture. The ? after it means to match that zero or one time. We need parens so that it applies to that subpattern only. So the last two digits are optional -- five digits or seven digits match. I removed the unneeded .*? and .*
However, by what you say it appears that the whole thing can be simplified
if ( ($nems) = /^\s*NEMS \s* = \s* (\d{5}) (?:\d{2})?/ix ) { next }
where there is now no need for if (/NEMS/) and I've adjusted to the clarification that NEMS is at the beginning and that there may be spaces around =. Then you can also say
my $nems;
foreach ( split /\n/, $body_text ) {
# ...
next if ($nems) = /^\s*NEMS\s*=\s*(\d{5})(?:\d{2})?/i;
# ...
}
what includes the clarification that the new $body_text is a multiline string.
It is clear that $nems is declared (needed) outside of the loop and I indicate that.
This allows yet more digits to follow; it will match on 8 digits as well (but capture only the first five). This is what your trailing .* in the regex implies.
Edit It's been clarified that there can only be 5 or 7 digits. Then the regex can be tightened, to check whether input is as expected, but it should work as it stands, too.
A few notes, let me know if more would be helpful
The match operator returns a list so we need the parens in ($nems) = /.../;
The ($nems) = /.../ syntax is a nice shortcut, for ($nems) = $_ =~ /.../;.
If you are matching on a variable other than $_ then you need the whole thing.
You always want to start Perl programs with
use warnings 'all';
use strict;
This directly helps and generally results in better code.
The clarification of the evolved problem understanding states that all digits following = need be captured into $nems (and there may be 5,(not 6),7,8,9,10 digits). Then the regex is simply
($nems) = /^\s*NEMS\s*=\s*(\d+)/i;
where \d+ means a digit, one or more times. So a string of digits (match fails if there are none).

how to limit, characters between a range using regular expression

As far as I know {} curly braces are used to limit characters in regular expression like {3,12}, would match character length between 3 to 12.
I am trying to validate username that might contain a period . or _ either one, but not both, doesn't matter placement. For this below regex is working very well.
(^[a-z0-9]+$)|(^[a-z0-9]*[\.\_][a-z0-9]*$)
But I also need to limit the string length between 3 to 12, I had tried to put {3,12} in regex, but that doesn't work.
((^[a-z0-9]+$)|(^[a-z0-9]*[\.\_][a-z0-9]*$)){3,12}
See Example: https://regex101.com/r/kN3aO1/1
As hwnd suggested, a simpler solution would be:
^(?=.{3,12}$)[a-z0-9]+(?:[._][a-z0-9]+)?$
Old solution, which is rather complex and convoluted,is left here for reference, but use the one above instead.
^(?!(?:.{13,}|.{1,2})$)(?:([a-z0-9]+)|([a-z0-9]*[\.\_][a-z0-9]*))$
You can add a lookahead for this.
Demo on regex101
I would do this in three steps.
Check to see if the string has any '/' in it.
Check to see if the string has any '_' in it.
Check to see if string length is between 3 and 12.
In Perl:
if ( ( ( $name =~ /_/ ) && ( $name =~ /\./ ) ) ||
( length($name) < 3 ) ||
( length($name) > 12 ) )
{
# Handle invalid username
}
If you want to make sure that the username contains only one dot or underscore, you may count them. Again, in Perl:
my $dcnt = $name =~ tr /././;
my $ucnt = $name =~ tr /_/_/;
if ( ( $dcnt > 0 && $ucnt > 0 ) ||
( $dcnt > 1 ) ||
( $ucnt > 1 ) ||
( length($name) < 3 ) ||
( length($name) > 12 ) )
{
# Handle invalid username
}
Why not one monster regular expression that does everything at once? Well, for the sake of maintainability. If you or a colleague looks at this code in a year's time, when requirements have changed, this approach will make it easier to update the code.
Notice also that {3,12} says nothing about lengths. It allows the previous pattern to match three to twelve times.

PCRE regex for multiple decimal coordinates using [lon,lat] format

I am trying to create a regex for [lon,lat] coordinates.
The code first checks if the input starts with '['.
If it does we check the validity of the coordinates via a regex
/([\[][-+]?(180(\.0{1,15})?|((1[0-7]\d)|([1-9]?\d))(\.\d{1,15})?),[-+]?([1-8]?\d(\.\d{1,15})?|90(\.0{1,15})?)[\]][\;]?)+/gm
The regex tests for [lon,lat] with 15 decimals [+- 180degrees, +-90degrees]
it should match :
single coordinates :
[120,80];
[120,80]
multiple coordinates
[180,90];[180,67];
[180,90];[180,67]
with newlines
[123,34];[-32,21];
[12,-67]
it should not match:
semicolon separator missing - single
[25,67][76,23];
semicolon separator missing - multiple
[25,67]
[76,23][12,90];
I currently have problems with the ; between coordinates (see 4 & 5)
jsfiddle equivalent here : http://regex101.com/r/vQ4fE0/4
You can try with this (human readable) pattern:
$pattern = <<<'EOD'
~
(?(DEFINE)
(?<lon> [+-]?
(?:
180 (?:\.0{1,15})?
|
(?: 1(?:[0-7][0-9]?)? | [2-9][0-9]? | 0 )
(?:\.[0-9]{1,15})?
)
)
(?<lat> [+-]?
(?:
90 (?:\.0{1,15})?
|
(?: [1-8][0-9]? | 9)
(?:\.[0-9]{1,15})?
)
)
)
\A
\[ \g<lon> , \g<lat> ] (?: ; \n? \[ \g<lon> , \g<lat> ] )* ;?
\z
~x
EOD;
explanations:
When you have to deal with a long pattern inside which you have to repeat several time the same subpatterns, you can use several features to make it more readable.
The most well know is to use the free-spacing mode (the x modifier) that allows to indent has you want the pattern (all spaces are ignored) and eventually to add comments.
The second consists to define subpatterns in a definition section (?(DEFINE)...) in which you can define named subpatterns to be used later in the main pattern.
Since I don't want to repeat the large subpatterns that describes the longitude number and the latitude number, I have created in the definition section two named pattern "lon" and "lat". To use them in the main pattern, I only need to write \g<lon> and \g<lat>.
javascript version:
var lon_sp = '(?:[+-]?(?:180(?:\\.0{1,15})?|(?:1(?:[0-7][0-9]?)?|[2-9][0-9]?|0)(?:\\.[0-9]{1,15})?))';
var lat_sp = '(?:[+-]?(?:90(?:\\.0{1,15})?|(?:[1-8][0-9]?|9)(?:\\.[0-9]{1,15})?))';
var coo_sp = '\\[' + lon_sp + ',' + lat_sp + '\\]';
var regex = new RegExp('^' + coo_sp + '(?:;\\n?' + coo_sp + ')*;?$');
var coordinates = new Array('[120,80];',
'[120,80]',
'[180,90];[180,67];',
'[123,34];[-32,21];\n[12,-67]',
'[25,67][76,23];',
'[25,67]\n[76,23]');
for (var i = 0; i<coordinates.length; i++) {
console.log("\ntest "+(i+1)+": " + regex.test(coordinates[i]));
}
fiddle
Try this out:
^(\[([+-]?(?!(180\.|18[1-9]|19\d{1}))\d{1,3}(\.\d{1,15})?,[+-]?(?!(90\.|9[1-9]))\d{1,2}(\.\d{1,15})?(\];$|\]$|\];\[)){1,})
Demo: http://regex101.com/r/vQ4fE0/7
Explanation
^(\[
Must start with a bracket
[+-]?
May or may not contain +- in front of the number
(?!(180\.|18[1-9]|19\d{1}))
Should not contain 180., 181-189 nor 19x
\d{1,3}(\.\d{1,15})?
Otherwise, any number containing 1 or 3 digits, with or without decimals (up to 15) are allowed
(?!(90\.|9[1-9]))
The 90 check is similar put here we are not allowing 90. nor 91-99
\d{1,2}(\.\d{1,15})?
Otherwise, any number containing 1 or 2 digits, with or without decimals (up to 15) are allowed
(\];$|\]$|\];\[)
The ending of a bracket body must have a ; separating two bracket bodies, otherwise it must be the end of the line.
{1,}
The brackets can exist 1 or multiple times
Hope this was helpful.
This might work. Note that you have a lot of capture groups, none of which
will give you good information because of recursive quantifiers.
# /^(\[[-+]?(180(\.0{1,15})?|((1[0-7]\d)|([1-9]?\d))(\.\d{1,15})?),[-+]?([1-8]?\d(\.\d{1,15})?|90(\.0{1,15})?)\](?:;\n?|$))+$/
^
( # (1 start)
\[
[-+]?
( # (2 start)
180
( \. 0{1,15} )? # (3)
|
( # (4 start)
( 1 [0-7] \d ) # (5)
|
( [1-9]? \d ) # (6)
) # (4 end)
( \. \d{1,15} )? # (7)
) # (2 end)
,
[-+]?
( # (8 start)
[1-8]? \d
( \. \d{1,15} )? # (9)
|
90
( \. 0{1,15} )? # (10)
) # (8 end)
\]
(?: ; \n? | $ )
)+ # (1 end)
$
Try a function approach, where the function can do some of the splitting for you, as well as delegating the number comparisons away from the regex. I tested it here: http://repl.it/YyG/3
//represents regex necessary to capture one coordinate, which
// looks like 123 or 123.13532
// the decimal part is a non-capture group ?:
var oneCoord = '(-?\\d+(?:\\.\\d+)?)';
//console.log("oneCoord is: "+oneCoord+"\n");
//one coordinate pair is represented by [x,x]
// check start/end with ^, $
var coordPair = '^\\['+oneCoord+','+oneCoord+'\\]$';
//console.log("coordPair is: "+coordPair+"\n");
//the full regex string consists of one or more coordinate pairs,
// but we'll do the splitting in the function
var myRegex = new RegExp(coordPair);
//console.log("my regex is: "+myRegex+"\n");
function isPlusMinus180(x)
{
return -180.0<=x && x<=180.0;
}
function isPlusMinus90(y)
{
return -90.0<=y && y<=90.0;
}
function isValid(s)
{
//if there's a trailing semicolon, remove it
if(s.slice(-1)==';')
{
s = s.slice(0,-1);
}
//remove all newlines and split by semicolon
var all = s.replace(/\n/g,'').split(';');
//console.log(all);
for(var k=0; k<all.length; ++k)
{
var match = myRegex.exec(all[k]);
if(match===null)
return false;
console.log(" match[1]: "+match[1]);
console.log(" match[2]: "+match[2]);
//break out if one pair is bad
if(! (isPlusMinus180(match[1]) && isPlusMinus90(match[2])) )
{
console.log(" one of matches out of bounds");
return false;
}
}
return true;
}
var coords = new Array('[120,80];',
'[120.33,80]',
'[180,90];[180,67];',
'[123,34];[-32,21];\n[12,-67]',
'[25,67][76,23];',
'[25,67]\n[76,23]',
'[190,33.33]',
'[180.33,33]',
'[179.87,90]',
'[179.87,91]');
var s;
for (var i = 0; i<coords.length; i++) {
s = coords[i];
console.log((i+1)+". ==== testing "+s+" ====");
console.log(" isValid? => " + isValid(s));
}

Regex pattern works in javascript but fails in scala with PatternSyntaxException: Unclosed character class

Here is the regex:
ws(s)?://([0-9\.a-zA-Z\-_]+):([\d]+)([/([0-9\.a-zA-Z\-_]+)?
Here is a test pattern:
wss://beta5.max.com:18989/abcde.html
softlion.com likes it:
Test results
Match count: 1
Global matches:
wss://beta5.max.com:18989/abcde.html
Value of each capturing group:
0 1 2 3 4
wss://beta5.max.com:18989/abcde.html s beta5.max.com 18989 /abcde.html
scala does not:
val regex = """ws(s)?://([0-9\.a-zA-Z\-_]+):([\d]+)([/([0-9\.a-zA-Z\-_]+)?""".r
Exception in thread "main" java.util.regex.PatternSyntaxException: Unclosed character class near index 58
ws(s)?://([0-9\.a-zA-Z\-_]+):([\d]+)([/([0-9\.a-zA-Z\-_]+)?
My bad, I had an extra [ at the front of the last capturing group.
([/([0-9.a-zA-Z-_]+)?
Java allows intersections and all that, so error ..
ws
( s )?
://
( [0-9\.a-zA-Z\-_]+ )
:
( [\d]+ )
= ( <-- Unbalanced '('
= [ <-- Unbalanced '['
/
( [0-9\.a-zA-Z\-_]+ )?
With everybody else its no problem:
ws
( s )? # (1)
://
( [0-9\.a-zA-Z\-_]+ ) # (2)
:
( [\d]+ ) # (3)
( [/([0-9\.a-zA-Z\-_]+ )? # (4)
So, its good to see (know) the original regex is not what you thought it was.