Why doesn't this regex capture group repeat for each match? - regex

I'm testing this on regex101.com
Regex: ^\+([0-9A-Za-z-]+)(?:\.([0-9A-Za-z-]+))*$
Test string: +beta-bar.baz-bz.fd.zz
The string matches, but the "match information" box shows that there are only two capture groups:
MATCH 1
1. [1-9] `beta-bar`
2. [20-22] `zz`
I was expecting all these captures:
beta-bar
baz-bz
fd
zz
Why didn't each identifier between periods get recognized as its own captured group?

The reason why that happens is because when using a quantifier on a capture group and it is captured n times, only the last captured text gets stored in the buffer and returned at the end.
Instead of matching those parts, you can preg_split the string you have with a simple regex [+.]:
$str = "+beta-bar.baz-bz.fd.zz";
$a = preg_split('/[+.]/', $str, -1, PREG_SPLIT_NO_EMPTY);
See IDEONE demo
Result:
Array
(
[0] => beta-bar
[1] => baz-bz
[2] => fd
[3] => zz
)

Related

Regex to match substrings containing n non-repeated characters

I am facing a (naive) problem with regular expression.
I need to find any substrings composed of a fixed number (n) of different characters.
So, for "aaabcddd", if n=3 the substrings that I expect to find are: "abc" and "bcd".
My idea is to use n-1 capture groups and '[^' to exclude characters already matched. Thus, I wrote the following Perl regex (in Julia):
r"(([[:alpha:]])[^\2])[^\1]"
But, it is not working.
Do you have any tips?
You can not use a backreference to a capture group using a negated character class [^\1]
What you can do is use a negative lookahead to assert what is directly to the right of the current position is not what you have already captured in a previous group.
If that is the case, capture a single alpha in a new group.
The matches abc and bcd are in capture group 1
(?=(([[:alpha:]])(?!\2)([[:alpha:]])(?!\3|\2)[[:alpha:]]))
(?= Positive lookahead
( Capture group 1
([[:alpha:]]) Capture the first char in group 2
(?!\1)([[:alpha:]]) If not looking at what is captured by group 2 to the right, capture the second char in group 3
(?!\2|\1) If not looking to the right at what is captured by group 2 or 3
[[:alpha:]] Mach the 3rd char
) Close group 1
) Close the lookahead
Regex demo
Or a bit shorter using a case insensitive match:
(?=(([a-z])(?!\2)([a-z])(?!\3|\2)[a-z]))
Here is a solution to an arbitrary value of n characters:
#!/usr/local/bin/perl
use strict; use warnings; use feature ':5.10';
my $s="aaabcded";
my $n=3;
while ($s=~/(?=([[:alpha:]]{$n}))/g){
my $hit=$1;
my #chars = split //, $hit;
my %uniq;
#uniq{#chars} = ();
say "$hit" if (scalar keys %uniq) == $n;
}
Running with $n=3 prints:
abc
bcd
cde
Running with $n=4 prints:
abcd
bcde
And $n=5:
abcde

How to split string by slash which is not between numbers?

How to split string by slash which is not between numbers?
I am using preg_split function below:
$splitted = preg_split('#[/\\\\\_\s]+#u', $string);
Input: "925/123 Black/Jack"
Splitted result now:
[
0 => '925',
1 => '123',
2 => 'Black',
3 => 'Jack'
]
Splitted result I want:
[
0 => '925/123',
1 => 'Black',
2 => 'Jack'
]
You may use
preg_split('#(?:[\s\\\\_]|(?<!\d)/(?!\d))+#u', '925/123 Black/Jack')
See the PHP demo and the regex demo and the regex graph:
Details
(?: - start of a non-capturing group:
[\s\\_] - a whitespace, \ or _
| - or
(?<!\d)/(?!\d) - a / not enclosed with digits
)+ - end of a non-capturing group, repeat 1 or more times.
One option is match 1 or more digits divided by a forward slash with whitespace boundaries on the left and on the right.
Then use SKIP FAIL, and match 1 or more times what is listed in the character class. Note that you don't have to escape the underscore.
(?<!\S)\d+(?:/\d+)+(?!\S)(*SKIP)(*F)|[/\\_\s]+
Explanation
(?<!\S)\d+(?:/\d+)+(?!\S) Match a repeated number of digits between forward slashes
(*SKIP)(*F) Skip
| Or
[/\\_\s]+ Match 1+ occurrences of any of the listed
Regex demo | Php demo
For example
$string = "925/123 Black/Jack";
$pattern = "#(?<!\S)\d+(?:/\d+)+(?!\S)(*SKIP)(*F)|[/\\\\_\s]+#u";
$splitted = preg_split($pattern, $string);
print_r($splitted);
Output
Array
(
[0] => 925/123
[1] => Black
[2] => Jack
)
Your regex is unnecessarily complicated. You need to split your string on:
either a space (maybe more generally - a sequence of white chars),
or a slash
not preceded by a digit (negative lookbehind),
not followed by a digit (negative lookahead).
So the regex you need (enclosed in # chars, with doubled backslashes) is:
#(?<!\\d)/(?!\\d)|\\s+#
Example of code:
$string = "925/123 Black/Jack";
$pattern = "#(?<!\\d)/(?!\\d)|\\s+#";
$splitted = preg_split($pattern, $string);
print_r($splitted);
prints just what you want:
Array
(
[0] => 925/123
[1] => Black
[2] => Jack
)

Regex to parse log data not capturing all groups [duplicate]

I'm testing this on regex101.com
Regex: ^\+([0-9A-Za-z-]+)(?:\.([0-9A-Za-z-]+))*$
Test string: +beta-bar.baz-bz.fd.zz
The string matches, but the "match information" box shows that there are only two capture groups:
MATCH 1
1. [1-9] `beta-bar`
2. [20-22] `zz`
I was expecting all these captures:
beta-bar
baz-bz
fd
zz
Why didn't each identifier between periods get recognized as its own captured group?
The reason why that happens is because when using a quantifier on a capture group and it is captured n times, only the last captured text gets stored in the buffer and returned at the end.
Instead of matching those parts, you can preg_split the string you have with a simple regex [+.]:
$str = "+beta-bar.baz-bz.fd.zz";
$a = preg_split('/[+.]/', $str, -1, PREG_SPLIT_NO_EMPTY);
See IDEONE demo
Result:
Array
(
[0] => beta-bar
[1] => baz-bz
[2] => fd
[3] => zz
)

Will a lookahead in regular expressions always not capture or does it depend?

I've been reading some articles on non-capturing groups on this site and on the net
(such as http://www.regular-expressions.info/brackets.html and http://www.asiteaboutnothing.net/regexp/regex-disambiguation.html, What does the "?:^" regular expression mean?, What is a non-capturing group? What does a question mark followed by a colon (?:) mean?)
I am clear on the meaning of (?:foo). What I am unclear about is (?=foo). Is (?=foo) also always a non-capturing group, or does it depend?
No, (?=foo) will not capture "foo". Any look-around assertion (negative- and positive look ahead & behind) will not capture, but only check the presence (or absence) of text.
For example, the regex:
(X(?=\d+))
matches "X" only when there's one or more digits after it. However, these digits are not a part of match group 1.
You can define captures inside the look ahead to capture it. For example, the regex:
(X(?=(\d+)))
matches "X" only when there's one or more digits after it. And these digits are captured in match group 2.
A PHP demo:
<?php
$s = 'X123';
preg_match_all('/(X(?=(\d+)))/', $s, $matches);
print_r($matches);
?>
will print:
Array
(
[0] => Array
(
[0] => X
)
[1] => Array
(
[0] => X
)
[2] => Array
(
[0] => 123
)
)
Lookarounds are always non-capturing and zero-width.
Every group starting with ? will be non-capturing, although only (?:foo) works as a regular group.

Trying to create a RegEx for the following patterns

Here are the patterns:
Red,Green (and so on...)
Red (+5.00),Green (+6.00) (and so on...)
Red (+5.00,+10.00),Green (+6.00,+20.00) (and so on...)
Red (+5.00),Green (and so on...)
Each attribute ("Red,"Green") can have 0, 1, or 2 modifiers (shown as "+5.00,+10.00", etc.).
I need to capture each of the attributes and their modifiers as a single string (i.e. "Red (+5.00,+10.00)", "Green (+6.00,+20.00)".
Help?
Another example (PCRE):
((?:Red|Green)(?:\s\((?:\+\d+\.\d+,?)+\))?)
Explanation:
(...) // a capture group
(?:...) // a non-capturing group
Read|Green // matches Red or Green
(?:...)? // an optional non-capturing group
\s // matches any whitespace character
\( // matches a literal (
(?:...)+ // a non-capturing group that can occur one or more times
\+ // matches a literal +
\d+ // matches one or more digits
\. // matches a literal .
\d+ // matches one or more digits
,? // matches an optional comma
\) //matches a literal )
Update:
Or actually if you just want to extract the data, then
((?:Red|Green)(?:\s\([^)]+\))?)
would be sufficient.
Update 2: As pointed out in your comment, this would match anything in the first part but , and (:
([^,(]+(?:\s\([^)]+\))?)
(does not work, too permissive)
to be more restrictive (allowing only characters and numbers, you can just use \w:
(\w+(?:\s\([^)]+\))?)
Update 3:
I see, the first of my alternatives does not work correctly, but \w works:
$pattern = "#\w+(?:\s\([^)]+\))?#";
$str = "foo (+15.00,-10.00),bar (-10.00,+25),baz,bing,bam (150.00,-5000.00)";
$matches = array();
preg_match_all($pattern, $str, $matches);
print_r($matches);
prints
Array
(
[0] => Array
(
[0] => foo (+15.00,-10.00)
[1] => bar (-10.00,+25)
[2] => baz
[3] => bing
[4] => bam (150.00,-5000.00)
)
)
Update 4:
Ok, I got something working, please check whether it always works:
(?=[^-+,.]+)[^(),]+(?:\s?\((?:[-+\d.]+,?)+\))?
With:
$pattern = "#(?=[^-+,.]+)[^(),]+(?:\s?\((?:[-+\d.]+,?)+\))?#";
$str = "5 lb. (+15.00,-10.00),bar (-10.00,+25),baz,bing,bam (150.00,-5000.00)";
preg_match_all gives me
Array
(
[0] => Array
(
[0] => 5 lb. (+15.00,-10.00)
[1] => bar (-10.00,+25)
[2] => baz
[3] => bing
[4] => bam (150.00,-5000.00)
)
)
Maybe there is a simpler regex, I'm not an expert...
PCRE format:
(Red|Green)(\s\((?P<val1>.+?)(,){0,1}(?P<val2>.+?){0,1}\)){0,1}
Match from PHP:
preg_match_all("/(Red|Green)(\s\((?P<val1>.+?)(,){0,1}(?P<val2>.+?){0,1}\)){0,1}/ims", $text, $matches);
Here's my bid:
/
(?:^|,) # Match line beginning or a comma
(?: # parent wrapper to catch multiple "color (+#.##)" patterns
( # grouping pattern for picking off matches
(?:(?:Red|Green),?)+ # match the color prefix
\s\( # space then parenthesis
(?: # wrapper for repeated number groups
(?:\x2B\d+\.\d+) # pattern for the +#.##
,?)+ # end wrapper
\) # closing parenthesis
)+ # end matching pattern
)+ # end parent wrapper
/
Which translates to:
/(?:^|,)(?:((?:(?:Red|Green),?)+\s\((?:(?:\x2B\d+\.\d+),?)+\))+)+/
EDIT
Sorry, it was only catching the last pattern before. This will catch all matches (or should).