Perl regex and capturing groups - regex

The following prints ac | a | bbb | c
#!/usr/bin/env perl
use strict;
use warnings;
# use re 'debug';
my $str = 'aacbbbcac';
if ($str =~ m/((a+)?(b+)?(c))*/) {
print "$1 | $2 | $3 | $4\n";
}
It seems like failed matches do not reset the captured group variables.
What am I missing?

it seems like failed matches dont reset the captured group variables
There is no failed matches in there. Your regex matches the string fine. Although there are some failed matches for inner groups in some repetition. Each matched group might be overwritten by the next match found for that particular group, or keep it's value from previous match, if that group is not matched in current repetition.
Let's see how regex match proceeds:
First (a+)?(b+)?(c) matches aac. Since (b+)? is optional, that will not be matched. At this stage, each capture group contains following part:
$1 contains entire match - aac
$2 contains (a+)? part - aa
$3 contains (b+)? part - null.
$4 contains (c) part - c
Since there is still some string left to match - bbbcac. Proceeding further - (a+)?(b+)?(c) matches - bbbc. Since (a+)? is optional, that won't be matched.
$1 contains entire match - bbbc. Overwrites the previous value in $1
$2 doesn't match. So, it will contain text previously matched - aa
$3 this time matches. It contains - bbb
$4 matches c
Again, (a+)?(b+)?(c) will go on to match the last part - ac.
$1 contains entire match - ac.
$2 matches a this time. Overwrites the previous value in $2. It now contains - a
$3 doesn't matches this time, as there is no (b+)? part. It will be same as previous match - bbb
$4 matches c. Overwrites the value from previous match. It now contains - c.
Now, there is nothing left in the string to match. The final value of all the capture groups are:
$1 - ac
$2 - a
$3 - bbb
$4 - c.

As odd as it seems this is the "expected" behavior. Here's a quote from the perlre docs:
NOTE: Failed matches in Perl do not reset the match variables, which makes it easier to write code that tests for a series of more specific cases and remembers the best match.

For the parenthesis grouping, /(\d+)/ This documentation says to use \1 \2 ... or \g{1} \g{2}. Using $1 or $2... in a substitution regex part will cause an error like: scalar found in pattern
# Example to turn a css href to local css.
# Transforms <link href="http://..." into <link href="css/..."
# ... inside a loop ...
my $localcss = $_; # one line from the file
$localcss =~ s/href.+\/([^\/]+\.css")/href="css\/\1/g ;

Related

how to shell script regex perfect matching?

I have a Bash script file that matches a regex.
My regex script file:
if [[ "$image" =~ [0-9]+(\.[0-9]+){3}\-[0-9]+$ ]]; then
I need to pass cases that only match 0.0.0.0-0000
These are my inputs and results.
pass : 0.0.0.0-0000
pass : 0.0.0.0.0.0-0000 << Unwanted match
no : 0.0.0.0-word
no : 0.0.0.0
As I marked above 0.0.0.0.0.0-0000 gets a match with my regex.
My question is how can I modify my regex to only match the pattern 0.0.0.0-0000?
Assuming that you are trying to match up some sort of IP address like String I came up with this regex.
^(\d+\.?){4}-\d+
Regex Demo
Note the \d+ in first capturing group (\d+\.?) which will match any number before a .. If the only starting pattern is 0.0.0.0, you can remove the + mark here to only match one digit character.
Explanation:
^ - Captures start of a String
(\d+\.?){4} - Captures a number that ends with a optional . character 4 times in a row capturing 0.0.0.0
-\d+ - Captures - character and sequence of digits in a row capturing -0000
This issue is solved.
The follow answer to up #The fourth bird
i missed anchor(^).
To clarify the starting and ending points, It should be between '^' and '$'.
You can refer to answer
if [[ "$image" =~ ^[0-9]+(\.[0-9]+){3}\-[0-9]+$ ]]; #The fourth bird Jul 11 at 8:43
Thank you for replayers XD

Match reverse translation of a capture group in Perl regex

I am trying to find strings that match a certain pattern and then the reverse translation of that pattern followed by it separated by a letter O.
Translation rule is /ABC/XYZ.
Example of a match: CCBAOXYZZ
First section matches the pattern [ABC]{3,25}. Then there's a letter O which also matches. Then we see that XYZZ is the reverse of CCBA with the translation above applied.
I have managed to write the tr rule into my backreferencing. But I cannot figure out how to do the reverse as well.
while (my $input_string = <sample_input>) {
push #hit, $1 while $input_string
=~ m{
(([ABC]{3,25})
O
(??{ $2 =~ tr/ABC/XYZ/r}))
}xg;
}
Is it correct to add the 'reverse' function to the third line of the regex in this way: (??{ $2 =~ tr/ACGT/TGCA/r;reverse}))?
How do I match the reverse tr of $2?
Your tr///r returns the transliterated string. So you simply need to stick your reverse in front of the tr///r and you're good to go.
push #hit, $1 while $input_string
=~ m{
(([ABC]{3,25})
O
(??{ reverse $2 =~ tr/ABC/XYZ/r }))
}xg;
The return value of the tr///r does not go into $_, so ; reverse will reverse whatever is in $_. That makes the overall match fail.
You actually answered your own question in your last sentence.
How do I do the match the reverse tr of $2?
If you add use re 'debug' you can see the actual pattern that is being matched against.
With tr///; reverse, the second part of that debugging output, which relates to the regex compiled from the eval, is:
...
Compiling REx "ZZYXOABCC"
Final program:
1: EXACT <ZZYXOABCC> (5)
5: END (0)
anchored "ZZYXOABCC" at 0 (checking anchored isall) minlen 9
Matching embedded REx "ZZYXOABCC" against "XYZZ"
...
As we can see here, it took the full string as the second part of the match, after the O. It correctly reversed the left side of the string, but it returned the full string.
Now if we compare that to reverse tr///r, we see the difference.
...
Compiling REx "XYZZ"
Final program:
1: EXACT <XYZZ> (3)
3: END (0)
anchored "XYZZ" at 0 (checking anchored isall) minlen 4
Matching embedded REx "XYZZ" against "XYZZ"
...
It now only returns the transliterated left side of the string, which then matches.

PowerShell -replace to get string between two different characters

I am current using split to get what I need, but I am hoping I can use a better way in powershell.
Here is the string:
server=ss8.server.com;database=CSSDatabase;uid=WS_CSSDatabase;pwd=abc123-1cda23-123-A7A0-CC54;Max Pool Size=5000
I want to get the server and database with out the database= or the server=
here is the method I am currently using and this is what I am currently doing:
$databaseserver = (($details.value).split(';')[0]).split('=')[1]
$database = (($details.value).split(';')[1]).split('=')[1]
This outputs to:
ss8.server.com
CSSDatabase
I would like it to be as simple as possible.
Thank you in advance
Replacing approach
You may use the following regex replace:
$s = 'server=ss8.server.com;database=CSSDatabase;uid=WS_CSSDatabase;pwd=abc123-1cda23-123-A7A0-CC54;Max Pool Size=5000'
$dbserver = $s -replace '^server=([^;]+).*', '$1'
$db = $s -replace '^[^;]*;database=([^;]+).*', '$1'
The technique is to match and capture (with (...)) what we need and just match what we need to remove.
Pattern details:
^ - start of the line
server= - a literal substring
([^;]+) - Group 1 (what $1 refers to) matching 1+ chars other than ;
.* - any 0+ chars other than a newline, as many as possible
Pattern 2 is almost the same, the capturing group is shifted a bit to capture another detail, and some more literal values are added to match the right context.
Note: if the values you need to extract may appear anywhere in the string, replace ^ in the first one and ^[^;]*; pattern in the second one with .*?\b (any 0+ chars other than a newline, as few as possible followed with a word boundary).
Matching approach
With a -match, you may do it the following way:
$s -match '^server=(.+?);database=([^;]+)'
The $Matches[1] will contain the server details and $Matches[2] will hold the DB info:
Name Value
---- -----
2 CSSDatabase
1 ss8.server.com
0 server=ss8.server.com;database=CSSDatabase
Pattern details
^ - start of string
server= - literal substring
(.+?) - Group 1: any 1+ non-linebreak chars as few as possible
;database= - literal substring
([^;]+) - 1+ chars other than ;
Another solution with a RegEx and named capture groups, similar to Wiktor's Matching Approach.
$s = 'server=ss8.server.com;database=CSSDatabase;uid=WS_CSSDatabase;pwd=abc123-1cda23-123-A7A0-CC54;Max Pool Size=5000'
$RegEx = '^server=(?<databaseserver>[^;]+);database=(?<database>[^;]+)'
if ($s -match $RegEx){
$Matches.databaseserver
$Matches.database
}

Trying to use /^\s*$/ match multiple blank lines and replace them failed and get a confusing result

Perl version : 5.16.01
I'm reading a book about regex which based on perl 5.8
The book said that s/^\s*$/blabla/mg can match and replace multiple blank lines.
But when I praticed, I got a confusing result.
code:
$text = "c\n\n\n\n\nb";
$text =~ s/^\s*$/<p>/mg;
print "$text";
Here is the result:
C:\Users\Administrator\Desktop\regex>perl t2h.pl
c
<p><p>
b
I want to know why I didn't get a single <p> but double between 'c' and 'b'. Does Perl's /$/ change after 5.8 ?
The lesson here is be wary of regular expressions that will match a zero-width pattern, you could get unexpected results.
We can see what's happening here by showing the prematch, match and post match of both replacements:
use strict;
use warnings;
my $text = "c\n\n\n\nb";
$text =~ s{^\s*$}{
printf qq{<"%s" - "%s" - "%s">\n}, map s/\n/\\n/gr, ($`, $&, $');
"<p>"
}emg;
$text =~ s/\n/\\n/g;
print qq{Result: "$text"};
Outputs <"Prematch" - "Match" - "Postmatch">:
<"c\n" - "\n\n" - "\nb">
<"c\n\n\n" - "" - "\nb">
Result: "c\n<p><p>\nb"
Basically, the regex matches from position 2 until 4, capturing 2 return characters. After that replacement it starts searching from position 4 and matches a zero width pattern, so adds a second <p>.
One of the reasons this isn't intuitive is because our regex has replaced the \n\n at positions 2 & 3 with a <p>. However, lookbehind assertions (which ^ is special variant) treat the string as it originally was, not as it might have been replaced by previous passes of a /g regex. Therefore when matching at position 4, the regex sees c\n\n\n behind it instead of c\n<p> (as demonstrated in our output above), and therefore will match ^ again and $ immediately in front of it with no spacing between.
The solution to this is to not allow zero width patterns by using + in this instance instead of *.
Secondary Example
Another example of this is the following, simpler regex
my $text = "caab";
$text =~ s/a*/<p>/g;
print $text;
Outputs:
<p>c<p><p>b<p>
The positional breakdown of this matching is as follows:
0 c - match a zero width pattern
1 a - Match a 2 character pattern
2 a
3 b - Match a zero width pattern
4 $ - match a zero width pattern
Therefore, the final lesson is to simply be wary of regexes that will match a zero width pattern.
Quantifier * match 0 or more times,
quantifier ? match 1 or more times.
So your regex should be written as s/^\s+$/<p>/mg
You can try this:
#!/usr/bin/perl
$text = "c\n\n\n\n\nb";
$text =~ s/[\r\n]//g;
print $text;
DEMO http://ideone.com/WmVFHz

Regex: Matching 4-Digits within words

I have a body of text I'm looking to pull repeat sets of 4-digit numbers out from.
For Example:
The first is 1234 2) The Second is 2098 3) The Third is 3213
Now I know i'm able to get the first set of digits out by simply using:
/\d{4}/
...returning 1234
But how do I match the second set of digits, or the third, and so on...?
edit: How do i return 2098, or 3213
You don't appear to have a proper answer to your question yet.
The solution is to use the /g modifier on your regex. In list context it will find all of the numbers in your string at once, like this
my $str = 'The first is 1234 2) The Second is 2098 3) The Third is 3213';
my #numbers = $str =~ /\b \d{4} \b/gx;
print "#numbers\n";
output
1234 2098 3213
Or you can iterate through them, using scalar context in a while loop, like this
while ($str =~ /\b (\d{4}) \b/gx) {
my $number = $1;
print $number, "\n";
}
output
1234
2098
3213
I have added the \b patterns to the regex so that it only matches whole four-digit numbers and doesn't, for example, find 1234 in 1234567. The /x modifier just allows me to add spaces so that the pattern is more intelligible.
See http://perldoc.perl.org/perlre.html for discussion on the use of the 'g' modifier which will cause your regular expression to match ALL occurrances of its pattern, not just the first.
If you want a pattern that finds the $n'th 4-digit group, this seems to work:
$pat = "^(?:.*?\\b(\\d{4})\\b){$n}";
if ($s =~ /$pat/) {
print "Found $1\n";
} else {
print "Not found\n";
}
I did this by building a string pattern because I couldn't get a variable interpolated into a quantifier {$n}.
This pattern finds 4-digit groups that are on word boundaries (the \b tests); I don't know if that meets your requirements. The pattern uses .*? to ensure that as few characters as possible are matched between each four-digit group. The pattern is matched $n times, and the capture group $1 is set to whatever it was in the last iteration, i.e. the $n'th one.
EDIT: When I just tried it again, it seemed to interpolate $n in a quantifier just fine. I don't know what I did differently that it didn't work last time. So maybe this will work:
if ($s =~ /^(?:.*?\b(\d{4}\b){$n}/) { ...
If not, see amon's comment about qr//.
If the regex is only matched once, then match all three in one regex and extract them using matched groups:
^.*\b(\d{4})\b.*\b(\d{4})\b.*\b(\d{4})\b.*$
The three 4-digit numbers will be captured in group 1. 2 and 3.
Ajb's answer with "gx" is the best. If you know you will have three numbers, this straighforward line does the trick:
my $str = 'The first is 1234 2) The Second is 2098 3) The Third is 3213';
my ($num1, $num2, $num3) = $str =~ /\b \d{4} \b/gx;
print "$num1, $num2, $num3\n";