Non-Capturing and Capturing Groups - The right way - regex

I'm trying to match an array of elements preceeded by a specific string in a line of text. For Example, match all pets in the text below:
fruits:apple,banana;pets:cat,dog,bird;colors:green,blue
/(?:pets:)(\w+[,|;])+/g**
Using the given regex I only could match the last word "bird"
Can anybody help me to understand the right way of using Non-Capturing and Capturing Groups?
Thanks!

First, let's talk about capturing and non-capturing group:
(?:...) non-capturing version, you're looking for this values, but don't need it
() capturing version, you want this values! You're searching for it
So:
(?:pets:) you searching for "pets" but don't want to capture it, after that point, you WANT to capture (if I've understood):
So try (?:pets:)([a-zA-Z,]+); ... You're searching for "pets:" (but don't want it !) and stop at the first ";" (and don't want it too).
Result is :
Match 1 : cat,dog,bird
A better solution exists with 1 match == 1 pet.

Since you want to have each pet in a separate match and you are using PCRE \G is, as suggested by Wiktor, a decent option:
(?:pets:)|\G(?!^)(\w+)(?:[,;]|$)
Explanation:
1st Alternative (?:pets:) to find the start of the pattern
2nd Alternative \G(?!^)(\w+)(?:[,;]|$)
\G asserts position at the end of the previous match or the start of the string for the first match
Negative Lookahead (?!^) to assert that the Regex does not match at the start of the string
(\w+) to matches the pets
Non-capturing group (?:[,;]|$) used as a delimiter (matches a single character in the list ,; (case sensitive) or $ asserts position at the end of the string
Perl Code Sample:
use strict;
use Data::Dumper;
my $str = 'fruits:apple,banana;pets:cat,dog,bird;colors:green,blue';
my $regex = qr/(?:pets:)|\G(?!^)(\w+)(?:[,;]|$)/mp;
my #result = ();
while ( $str =~ /$regex/g ) {
if ($1 ne '') {
#print "$1\n";
push #result, $1;
}
}
print Dumper(\#result);

Related

Replace a given substring before the digits

In a string like /p20 (can be any number) i want to replace the /p into /pag- but keep the number /pag-20
This is what i tried:
preg_replace('/\/p+[0-9]/', '/pag-', $string);
but the result is /pag-0
Use a capturing group and a backreference:
$string = "/p-20";
echo preg_replace('~/p-([0-9])~', '/pag-$1', $string);
^^^^^^^ ^^
Here, /p- matches a literal substring and ([0-9]) matches and captures any 1 digit into Group 1 that can be referred to with $1 backreference from the replacement pattern.
Alternatively, you may use a lookahead based solution:
preg_replace('~/p-(?=[0-9])~', '/pag-', $string);
See the PHP demo
Here, no backreference is necessary as the (?=[0-9]) positive lookahead does not consume text it matches, i.e. it does not add the text to the match value.

Match a string between multiple whitespaces

Hello can anyone help me with a regex to match a string between multiple whitespaces
My string may look like this :
This is just Nicolas-764 sdh and his sister
I want to match Nicolas-764 sdh
So far I wrote this but it matches all the string after the first whitespaces
if ($string =~ m/(just) {5,}(.*) {5,}/) {
print "$1\n";
print "$2\n";
}
I want to create a hash that will have as key just and as value Nicolas-764 sdh.
I don't want to just match a string between multiple spaces. I need to use just too
You're suffering from greedy matching .*.
You simply need to change to non-greedy matching using .*?.
use strict;
use warnings;
my $string = 'This is just Nicolas-764 sdh and his sister';
if ($string =~ m/just\s{5,}(.*?)\s{5,}/) {
print "$1\n";
}
Outputs:
Nicolas-764 sdh
Your code would be,
if ($string =~ m/^.*?just {5,}(\S+)\s+(\S+) {5,}.*$/) {
print "$1\n";
print "$2\n";
}
First group contains Nicolas-764 and the second group contains sdh
DEMO
Or
You could try the below regex also,
^.*?just {5,}(\S+(?:\s\S+)*?) {5,}.*$
Explanation:
^ Asserst that we are at the start of the line.
.*?just This would match upto the first just string. ? after * does a non-greedy match.
{5,} Matches 5 or more spaces.
() Capturing groups.
\S+ One or more non-space characters.
(?:) Non-capturing groups. It won't capture anything. Just matching would be done.
(?:\s\S+)*? Matches a space followed by one or more non-space characters. And the whole would occur zero or more times.
{5,} Matches 5 or more spaces.
.* Matches any character zero or more times.
$ Asserts that we are at the end of the line.

Regular expression Capture and Backrefence

Here's the string I'm searching.
T+4ACCGT+12CAAGTACTACCGT+12CAAGTACTACCGT+4ACCGA+6CTACCGT+12CAAGTACTACCGT+12CAAGTACTACCG
I want to capture the digits behind the number for X digits (X being the previous number) I also want to capture the complete string.
ie the capture should return:
+4ACCG
+12AAGTACTACCGT
etc.
and :
ACCG
AAGTACTACCGT
etc.
Here's the regex I'm using:
(\+(\d+)([ATGCatgcnN]){\2});
and I'm using $1 and $3 for the captures.
What am I missing ?
You can not use a backreference in a quantifier. \1 is a instruction to match what $1 contains, so {\1} is not a valid quantifier. But why do you need to match the exact number? Just match the letters (because the next part starts again with a +).
So try:
(\+\d+([ATGCatgcnN]+));
and find the complete match in $1 and the letters in $2
Another problem in your regex is that your quantifier is outside your third capturing group. That way only the last letter would be in the capturing group. Place the quantifier inside the group to capture the whole sequence.
You can also remove the upper or lower case letters from your class by using the i modifier to match case independent:
/(\+\d+([ATGCN]+))/gi
This loop works because the \G assertion tells the regex engine to begin the search after the last match , (digit(s)), in the string.
$_ = 'T+4ACCGT+12CAAGTACTACCGT+12CAAGTACTACCGT+4ACCGA+6CTACCGT+12CAAGTACTACCGT+12CAAGTACTACCG';
while (/(\d+)/g) {
my $dig = $1;
/\G([TAGCN]{$dig})/i;
say $1;
}
The results are
ACCG
CAAGTACTACCG
CAAGTACTACCG
ACCG
CTACCG
CAAGTACTACCG
CAAGTACTACCG
I think this is correct but not sure :-|
Update: Added the \G assertion which tells the regex to begin immediately after the last matched number.
my #sequences = split(/\+/, $string);
for my $seq (#sequences) {
my($bases) = $seq =~ /([^\d]+)/;
}

Match from last occurrence using regex in perl

I have a text like this:
hello world /* select a from table_b
*/ some other text with new line cha
racter and there are some blocks of
/* any string */ select this part on
ly
////RESULT rest string
The text is multilined and I need to extract from last occurrence of "*/" until "////RESULT". In this case, the result should be:
select this part on
ly
How to achieve this in perl?
I have attempted \\\*/(.|\n)*////RESULT but that will start from first "*/"
A useful trick in cases like this is to prefix the regexp with the greedy pattern .*, which will try to match as many characters as possible before the rest of the pattern matches. So:
my ($match) = ($string =~ m!^.*\*/(.*?)////RESULT!s);
Let's break this pattern into its components:
^.* starts at the beginning of the string and matches as many characters as it can. (The s modifier allows . to match even newlines.) The beginning-of-string anchor ^ is not strictly necessary, but it ensures that the regexp engine won't waste too much time backtracking if the match fails.
\*/ just matches the literal string */.
(.*?) matches and captures any number of characters; the ? makes it ungreedy, so it prefers to match as few characters as possible in case there's more than one position where the rest of the regexp can match.
Finally, ////RESULT just matches itself.
Since the pattern contains a lot of slashes, and since I wanted to avoid leaning toothpick syndrome, I decided to use alternative regexp delimiters. Exclamation points (!) are a popular choice, since they don't collide with any normal regexp syntax.
Edit: Per discussion with ikegami below, I guess I should note that, if you want to use this regexp as a sub-pattern in a longer regexp, and if you want to guarantee that the string matched by (.*?) will never contain ////RESULT, then you should wrap those parts of the regexp in an independent (?>) subexpression, like this:
my $regexp = qr!\*/(?>(.*?)////RESULT)!s;
...
my $match = ($string =~ /^.*$regexp$some_other_regexp/s);
The (?>) causes the pattern inside it to fail rather than accepting a suboptimal match (i.e. one that extends beyond the first substring matching ////RESULT) even if that means that the rest of the regexp will fail to match.
(?:(?!STRING).)*
matches any number of characters that don't contain STRING. It's like [^a], but for strings instead of characters.
You can take shortcuts if you know certain inputs won't be encountered (like Kenosis and Ilmari Karonen did), but this is what what matches what you specified:
my ($segment) = $string =~ m{
\*/
( (?: (?! \*/ ). )* )
////RESULT
(?: (?! \*/ ). )*
\z
}xs;
If you don't care if */ appears after ////RESULT, the following is the safest:
my ($segment) = $string =~ m{
\*/
( (?: (?! \*/ ). )* )
////RESULT
}xs;
You didn't specify what should happen if there are two ////RESULT that follow the last */. The above matches until the last one. If you wanted to match until the first one, you'd use
my ($segment) = $string =~ m{
\*/
( (?: (?! \*/ | ////RESULT ). )* )
////RESULT
}xs;
Here's one option:
use strict;
use warnings;
my $string = <<'END';
hello world /* select a from table_b
*/ some other text with new line cha
racter and there are some blocks of
/* any string */ select this part on
ly
////RESULT
END
my ($segment) = $string =~ m!\*/([^/]+)////RESULT$!s;
print $segment;
Output:
select this part on
ly

Perl regex - why does the regex /[0-9\.]+(\,)/ match comma

The following seems to match ,
Can someone explain why?
I would like to match more than one Number or point, ended by comma.
123.456.768,
123,
.,
1.2,
But doing the following unexpectedly prints , too
my $text = "241.000,00";
foreach my $match ($text =~ /[0-9\.]+(\,)/g){
print "$match \n";
}
print $text;
# prints 241.000,
# ,
Update:
The comma matched because:
In list context, //g returns a list of matched groupings, or if there are no groupings, a list of matches to the whole regex
As defined here.
Use a zero-width positive look-ahead assertion to exclude the comma from the match itself:
$text =~ /[0-9\.]+(?=,)/g
Your match in the foreach loop is in list context. In list context, a match returns what its captured. Parens indicate a capture, not the whole regex. You have parens around a comma. You want it the other way around, put the parens aroundt he bit you want.
my $text = "241.000,00";
# my($foo) puts the right hand side in list context.
my($integer_part) = $text =~ /([0-9\.]+),/;
print "$integer_part\n"; # 241.000
If you don't want to match the comma, use a lookahead assertion:
/[0-9\.]+(?=,)/g
You're capturing the wrong thing! Move the parens from around the comma to around the number.
$text =~ /([0-9\.]+),/g
You can replace the comma with a lookahead, or just exclude the comma altogether since it isn't part of what you want to capture, it won't make a difference in this case. However, the pattern as it is puts the comma instead of the number into capture group 1, and then doesn't even reference by capture group, returning the entire match instead.
This is how a capture group is retrieved:
$mystring = "The start text always precedes the end of the end text.";
if($mystring =~ m/start(.*)end/) {
print $1;
}