RegEx for matching subexpression - c++

When I use a regular expression like
std::regex midiNoteNameRegex("([cdefgab])([b#]{0,1})([0-9]))|([0-9]{3})|([A-Z0-9]{2})");
there are three top-level subexpressions connected by "|" in the pattern of which one will match.
Is there a way to tell which one? Other than testing them sequentially one after the other?
If I would use named subexpressions it would be easy, but there are no named subexpressions in C++.
How do I solve this problem?

Given the groups in your regex, it's just a flat search of the match object,
which in C++ is a flag (int) check, with no noticeable overhead.
( [cdefgab] ) # (1)
( [b#]{0,1} ) # (2)
( [0-9] ) # (3)
| ( [0-9]{3} ) # (4)
| ( [A-Z0-9]{2} ) # (5)
And a possible usage
wregex MyRx = wregex( "([cdefgab])([b#]{0,1})([0-9])|([0-9]{3})|([A-Z0-9]{2})", 0);
wstring::const_iterator start = str.begin();
wstring::const_iterator end = str.end();
wsmatch m;
while ( regex_search( start, end, m, MyRx ) )
{
if ( m[1].matched )
// First alternation
else
if ( m[4].matched )
// Second alternation
else
if ( m[5].matched )
// Third alternation
start = m[0].second;
}

I don't have a definite answer but I believe the answer is most likely no.
Named capturing group is not a required feature: http://www.cplusplus.com/reference/regex/ECMAScript/
Implementation of named capturing group is probably not trivial and probably brings down the performance of the regex engine.
Found another post on this issue that agrees with me: C++ regex: Which group matched?

Related

Regex: Match every occurrence of dash in between two specific symbols

I have this string: ](-test-word-another)
I'm trying to find every single occurrence of - in between ] and )
Basically the return should be: ](-test-word-another)
Currently I have (?<=\]\()(-)(?=\)) but that just finds if there is only one -
Thank you in advance
Try this: /(?<=\].*)-(?=.*\))/gm
Test here: https://regex101.com/r/xkmTZs/3
This basically matches all - only if they occur after a ] and before ).
You can use
text.replace(/]\([^()]*\)/g, (x) => x.replace(/-/g, '_'))
See the demo below:
const text = 'Word-word ](-test-word-another) text-text-.';
console.log( text.replace(/]\([^()]*\)/g, (x) => x.replace(/-/g, '_')) );
The ]\([^()]*\) regex matches a ]( substring, then any zero or more chars other than ( and ) and then a ), and then all - chars inside the match value (x here) are replaced with _ using (x) => x.replace(/-/g, '_').
Another, single regex solution can look like
(?<=]\((?:(?!]\()[^)])*)-(?=[^()]*\))
See this regex demo. It matches any - that is immediately preceded with ]( and any zero or more chars other than ) that does not start a ]( char sequence, and is immediately followed with any zero or more chars other than ( and ) and then a ) char.

Find any permutation of a set using Perl's RegEx

I need to find a way of checking for the existence of sets of the type {1,2,3,4,5,6,8,9,10}, that have a preset number of elements. Also, notice the missing 7. Obviously the numbers could be in any order and should appear only once, since according to definition, {1,2,3} = {3,2,1} = {1,2,3,3} = ... and so forth.
How could I do this with Perl (or is it even possible)? One thing I tried was
{([1-6],|[8-9],|10,){8}([1-6]|[8-9]|10)} here, but this doesn't take care of the multiple instances of the same number within the brackets.
Regexes are almost certainly the wrong tool here. You want something that deals with permutations of an input list.
This blog post gives a useful overview of Perl modules that deal with permutations and combinations. Sounds to me like Algorithm::Combinatorics would be a good place to start. Something like this, perhaps:
use Algorithm::Combinatorics;
my #input = qw[1 2 3 4 5 6 8 9 10];
my #perms = permutations(\#input);
You then need some way to compare the valid permutations with the sets you want to test. I'd consider constructing a string representation of the sets (by joining them with a known delimiter) and doing a simple string comparison.
my #perm_strs = map { join ':' } #perms;
my #test = qw[2 4 3 1 10 5 9 8 6];
my $test_str = join ':', #test;
my $match = 0;
for (#perm_strs) {
if ($test_str eq $_) {
$match = 1;
last;
}
}
The success of the match is now in $match.
This regex does that.
Here 10 slots are allocated, but you can add as many as you want ( a hundred ? ).
It doesn't mean you have to match 10 unique numbers in a set,
You can match anything less than or equal to 10 (example {5}),
or even a range like {3,7}
The slots will be filled sequentially starting from 1.
So, you just have to sit in a loop from 1 - N, seeing if it is defined.
If you're looking for speed, this is the demon you want !
/\{(?>(?>(?(1)(?!))((?&GetNum))|(?(2)(?!))((?&GetNum))|(?(3)(?!))((?&GetNum))|(?(4)(?!))((?&GetNum))|(?(5)(?!))((?&GetNum))|(?(6)(?!))((?&GetNum))|(?(7)(?!))((?&GetNum))|(?(8)(?!))((?&GetNum))|(?(9)(?!))((?&GetNum))|(?(10)(?!))((?&GetNum)))(?:,(?!\})|(?=\}))){3,7}\}(?(DEFINE)(?<GetNum>(?!(?:\g{1}|\g{2}|\g{3}|\g{4}|\g{5}|\g{6}|\g{7}|\g{8}|\g{9}|\g{10})\b)\d+))/
https://regex101.com/r/pPwPTe/1
Readable regex
# Unique numbers in set, 10 slots
\{
(?> # Atomic, no backtracking allowed
(?> # ditto
(?(1) (?!) ) ( (?&GetNum) ) # (1), Slot 1
| (?(2) (?!) ) ( (?&GetNum) ) # (2), Slot 2
| (?(3) (?!) ) ( (?&GetNum) ) # (3), Slot 3
| (?(4) (?!) ) ( (?&GetNum) ) # (4), Slot 4
| (?(5) (?!) ) ( (?&GetNum) ) # (5), Slot 5
| (?(6) (?!) ) ( (?&GetNum) ) # (6), Slot 6
| (?(7) (?!) ) ( (?&GetNum) ) # (7), Slot 7
| (?(8) (?!) ) ( (?&GetNum) ) # (8), Slot 8
| (?(9) (?!) ) ( (?&GetNum) ) # (9), Slot 9
| (?(10) (?!) ) ( (?&GetNum) ) # (10), Slot 10
)
(?: , (?! \} ) | (?= \} ) )
){3,7} # Set range, example: 3 to 7 unique numbers in set
\}
(?(DEFINE)
(?<GetNum> # (4) Get a new number, must not be seen before
(?! (?: \g{1}|\g{2}|\g{3}|\g{4}|\g{5}|\g{6}|\g{7}|\g{8}|\g{9}|\g{10} ) \b )
\d+
)
)
Given front matter and test cases of
#! /usr/bin/env perl
use strict;
use warnings;
my #tests = (
"{}",
"{1,1}",
"{1,2,3,4,5,6,8,9,10}",
"{1,1,2,3,4,5,6,8,9,10}",
"{1,2,3,4,5,6,7,8,9,10}",
"{10,9,8,7,6,5,4,3,2,1}",
"{10,9,8,6,5,4,3,2,1}",
"{10,9,8,6,5,4,3,2,1",
"{10,9,8,6,5,4,3,2,1,1}",
"{2,4,6,8,10,9,5,3,1}",
);
you have at least three approaches to implementing what you want.
Brute force
When in doubt, try a bigger hammer. Generate all permutations and bake those into your pattern directly. Note that this has a factorial cost, so it quickly becomes intractable as the number of elements in your set grows.
# perlfaq4: How do I permute N elements of a list?
sub permute (&#) {
my $code = shift;
my #idx = 0..$#_;
while ( $code->(#_[#idx]) ) {
my $p = $#idx;
--$p while $idx[$p-1] > $idx[$p];
my $q = $p or return;
push #idx, reverse splice #idx, $p;
++$q while $idx[$p-1] > $idx[$q];
#idx[$p-1,$q]=#idx[$q,$p-1];
}
}
my $brute_force;
permute { local $" = ",";
$brute_force .= "|" if $brute_force;
$brute_force .= "{#_}" }
#members;
$brute_force = qr/ ^ (?: $brute_force ) $/x;
for (#tests) {
my $result = /$brute_force/x ? "ACCEPT" : "REJECT";
print "$_ - $result\n";
}
Generating all permutations on my laptop takes about 3 minutes. Precomputing the pattern may or may not make sense depending on your application.
Piggyback on the regex engine’s backtracking
One way to do it is to take advantage of the Perl regex engine’s backtracking and running (?{ code }) at various points within your pattern.
Define members of your set as below. Note that these must be global variables because of limitations of the regex engine, so use our and not my.
# must use package variables inside (?{ })
our #members = (1 .. 6, 8 .. 10);
our %remaining;
A pattern that matches permutations becomes
my $permutation = qr!
\{ (?{ #remaining{#members} = map +($_ => 1), #members })
( ([0-9]+), (?(?{ delete local $remaining{$^N} })|(*FAIL)))+
([0-9]+)\} (?(?{ delete local $remaining{$^N} && keys %remaining == 0 })|(*FAIL))
!x;
Code inside (?{ code }) sections runs at corresponding points of the pattern match. For example, the first one initializes the hash %remaining to contain all members of the set as keys.
The second and third (?{ code }) sections are within (?(condition)yes-pattern|no-pattern) sections and (*FAIL) backtracking control verbs. For any member before the last in the set (which we know because it is terminated by a comma), the member just matched, available in the $^N special variable, must be still available in %remaining. For the last member (terminated by right curly brace), the member just matched must be available and we must have covered all elements of the set to succeed. If these constraints are met, we match against an empty yes-pattern and continue successfully, but if one of these conditions fails, we meet (*FAIL) in the no-pattern. This causes the current attempted match to fail and the regex engine backtracks to attempt the next possibility.
Writing delete local localizes deletion of the particular key from %remaining. This delegates the error-prone bookkeeping to the regex engine that correctly restores localized values when it backtracks past a non-viable match.
Note that this implementation requires a set of at least two members.
Use it as in
for (#tests) {
my $result = /^ $permutation $/x ? "ACCEPT" : "REJECT";
print "$_ - $result\n";
}
Hybrid approach
Finally, combine the approaches by searching for everything that looks like a set and reject invalid permutations.
sub _assert_permutation_of {
my($members,$set) = #_;
my %seen = map +($_ => 1), #$members;
while ($set =~ /\b([0-9]+)\b/g) {
return unless delete $seen{$1};
}
keys %seen == 0;
}
my $hybrid = qr!
( \{ # opening brace
(?: [0-9]+ , )+ # comma-terminated integers
[0-9]+ # final integer
\} # closing brace
)
(?(?{ _assert_permutation_of \#members, $^N })|(*FAIL))
!x;
for (#tests) {
my $result = /^ $hybrid $/x ? "ACCEPT" : "REJECT";
print "$_ - $result\n";
}
Test output
For all three, the output is
{} - REJECT
{1,1} - REJECT
{1,2,3,4,5,6,8,9,10} - ACCEPT
{1,1,2,3,4,5,6,8,9,10} - REJECT
{1,2,3,4,5,6,7,8,9,10} - REJECT
{10,9,8,7,6,5,4,3,2,1} - REJECT
{10,9,8,6,5,4,3,2,1} - ACCEPT
{10,9,8,6,5,4,3,2,1 - REJECT
{10,9,8,6,5,4,3,2,1,1} - REJECT
{2,4,6,8,10,9,5,3,1} - ACCEPT

Search for substring and store another part of the string as variable in perl

I am revamping an old mail tool and adding MIME support. I have a lot of it working but I'm a perl dummy and the regex stuff is losing me.
I had:
foreach ( #{$body} ) {
next if /^$/;
if ( /NEMS/i ) {
/.*?(\d{5,7}).*/;
$nems = $1;
next;
}
if ( $delimit ) {
next if (/$delimit/ && ! $tp);
last if (/$delimit/ && $tp);
$tp = 1, next if /text.plain/;
$tp = 0, next if /text.html/;
s/<[^>]*>//g;
$newbody .= $_ if $tp;
} else {
s/<[^>]*>//g;
$newbody .= $_ ;
}
} # End Foreach
Now I have $body_text as the plain text mail body thanks to MIME::Parser. So now I just need this part to work:
foreach ( #{$body_text} ) {
next if /^$/;
if ( /NEMS/i ) {
/.*?(\d{5,7}).*/;
$nems = $1;
next;
}
} # End Foreach
The actual challenge is to find NEMS=12345 or NEMS=1234567 and set $nems=12345 if found. I think I have a very basic syntax problem with the test because I'm not exposed to perl very often.
A coworker suggested:
foreach (split(/\n/,$body_text)){
next if /^$/;
if ( /NEMS/i ) {
/.*?(\d{5,7}).*/;
$nems = $1;
next;
}
}
Which seems to be working, but it may not be the preferred way?
edit:
So this is the most current version based on tips here and testing:
foreach (split(/\n/,$body_text)){
next if /^$/;
if ( /NEMS/i ) {
/^\s*NEMS\s*=\s*(\d+)/i;
$nems = $1;
next;
}
}
Match the last two digits as optional and capture the first five, and assign the capture directly
($nems) = /(\d{5}) (?: \d{2} )?/x; # /x allows spaces inside
The construct (?: ) only groups what's inside, without capture. The ? after it means to match that zero or one time. We need parens so that it applies to that subpattern only. So the last two digits are optional -- five digits or seven digits match. I removed the unneeded .*? and .*
However, by what you say it appears that the whole thing can be simplified
if ( ($nems) = /^\s*NEMS \s* = \s* (\d{5}) (?:\d{2})?/ix ) { next }
where there is now no need for if (/NEMS/) and I've adjusted to the clarification that NEMS is at the beginning and that there may be spaces around =. Then you can also say
my $nems;
foreach ( split /\n/, $body_text ) {
# ...
next if ($nems) = /^\s*NEMS\s*=\s*(\d{5})(?:\d{2})?/i;
# ...
}
what includes the clarification that the new $body_text is a multiline string.
It is clear that $nems is declared (needed) outside of the loop and I indicate that.
This allows yet more digits to follow; it will match on 8 digits as well (but capture only the first five). This is what your trailing .* in the regex implies.
Edit It's been clarified that there can only be 5 or 7 digits. Then the regex can be tightened, to check whether input is as expected, but it should work as it stands, too.
A few notes, let me know if more would be helpful
The match operator returns a list so we need the parens in ($nems) = /.../;
The ($nems) = /.../ syntax is a nice shortcut, for ($nems) = $_ =~ /.../;.
If you are matching on a variable other than $_ then you need the whole thing.
You always want to start Perl programs with
use warnings 'all';
use strict;
This directly helps and generally results in better code.
The clarification of the evolved problem understanding states that all digits following = need be captured into $nems (and there may be 5,(not 6),7,8,9,10 digits). Then the regex is simply
($nems) = /^\s*NEMS\s*=\s*(\d+)/i;
where \d+ means a digit, one or more times. So a string of digits (match fails if there are none).

Stripping comments from Forth source code using regular expressions

I am trying to match all content between parentheses, including parentheses in a non-greedy way. There should be a space before and after the opening parentheses (or the start of a line before the opening parentheses) and a space before and after the closing parentheses. Take the following text:
( )
( This is a comment )
1 2 +
\ a
: square dup * ;
( foo bar
baz )
(quux)
( ( )
(
( )
The first line should be matched, the second line including its content should be matched, the second last line should not be matched (or raise an error) and the last line should be matched. The two lines foo bar baz should be matched, but (quux) should not as it doesn't contain a space before and after the parentheses. The line with the extra opening parentheses inside should be matched.
I tried a few conventional regexes for matching content between parentheses but without much success. The regex engine is that of Go's.
re := regexp.MustCompile(`(?s)\(( | .*? )\)`)
s = re.ReplaceAllString(s, "")
Playground: https://play.golang.org/p/t93tc_hWAG
Regular expressions "can't count" (that's over-simplified, but bear with me), so you can't match on an unbounded amount of parenthesis nesting. I guess you're mostly concerned about matching only a single level in this case, so you would need to use something like:
foo := regexp.MustCompile(`^ *\( ([^ ]| [^)]*? \)$`)
This does require the comment to be the very last thing on a line, so it may be better to add "match zero or more spaces" there. This does NOT match the string "( ( ) )" or try to cater for arbitrary nesting, as that's well outside the counting that regular expressions can do.
What they can do in terms of counting is "count a specific number of times", they can't "count how many blah, then make sure there's the same number of floobs" (that requires going from a regular expression to a context-free grammar).
Playground
Here is a way to match all the 3 lines in question:
(?m)^[\t\p{Zs}]*\([\pZs}\t](?:[^()\n]*[\pZs}\t])?\)[\pZs}\t]*$
See the Go regex demo at the new regex101.com
Details:
(?m) - multiline mode on
^ - due to the above, the start of a line
[\t\p{Zs}]* - 0+ horizontal whitespaces
\( - a (
[\pZs}\t] - exactly 1 horizontal whitespace
(?:[^()\n]*[\pZs}\t])? - an optional sequence matching:
[^()\n]* - a negated character class matching 0+ characters other than (, ) and a newline
[\pZs}\t] - horizontal whitespace
\) - a literal )
[\pZs}\t]* - 0+ horizontal whitespaces
$ - due to (?m), the end of a line.
Go playground demo:
package main
import (
"regexp"
"fmt"
)
func main() {
var re = regexp.MustCompile(`(?m)^[\t\p{Zs}]*\([\pZs}\t](?:[^()\n]*[\pZs}\t])?\)[\pZs}\t]*$`)
var str = ` ( )
( This is a comment )
1 2 +
\ a
: square dup * ;
( foo bar
baz )
(quux)
( ( )
(
( )`
for i, match := range re.FindAllString(str, -1) {
fmt.Println("'", match, "' (found at index", i, ")")
}
}

Regex pattern works in javascript but fails in scala with PatternSyntaxException: Unclosed character class

Here is the regex:
ws(s)?://([0-9\.a-zA-Z\-_]+):([\d]+)([/([0-9\.a-zA-Z\-_]+)?
Here is a test pattern:
wss://beta5.max.com:18989/abcde.html
softlion.com likes it:
Test results
Match count: 1
Global matches:
wss://beta5.max.com:18989/abcde.html
Value of each capturing group:
0 1 2 3 4
wss://beta5.max.com:18989/abcde.html s beta5.max.com 18989 /abcde.html
scala does not:
val regex = """ws(s)?://([0-9\.a-zA-Z\-_]+):([\d]+)([/([0-9\.a-zA-Z\-_]+)?""".r
Exception in thread "main" java.util.regex.PatternSyntaxException: Unclosed character class near index 58
ws(s)?://([0-9\.a-zA-Z\-_]+):([\d]+)([/([0-9\.a-zA-Z\-_]+)?
My bad, I had an extra [ at the front of the last capturing group.
([/([0-9.a-zA-Z-_]+)?
Java allows intersections and all that, so error ..
ws
( s )?
://
( [0-9\.a-zA-Z\-_]+ )
:
( [\d]+ )
= ( <-- Unbalanced '('
= [ <-- Unbalanced '['
/
( [0-9\.a-zA-Z\-_]+ )?
With everybody else its no problem:
ws
( s )? # (1)
://
( [0-9\.a-zA-Z\-_]+ ) # (2)
:
( [\d]+ ) # (3)
( [/([0-9\.a-zA-Z\-_]+ )? # (4)
So, its good to see (know) the original regex is not what you thought it was.