Perl regex forward reference - regex

I would like to match a forward reference with regexp. The pattern I am looking for is
[snake-case prefix]_[snake-case words] [same snake-case prefix]_number
For example:
foo_bar_eighty_twelve foo_bar_8012
I cannot extract foo_bar and eighty_twelve without looking first at foo_bar_8012. Thus I need a forward reference, not a backward reference which work only if my prefix is not a snake-case prefix.
my $prefix = "foo";
local $_ = "${prefix}_thirty_two = ${prefix}_32";
# Backward reference that works with a prefix with no underscores
{
/(\w+)_(\w+) \s+ = \s+ \1_(\d+)/ix;
print "Name: $2 \t Number: $3\n";
}
# Wanted Forward reference that do not work :(
{
/\2_(\w+) \s+ = \s+ (\w+)_\d+/ix;
print "Name: $1 \t Number: $2\n";
}
Unfortunately, my forward reference does not work and I do not know why. I've read that Perl support that kind of patterns.
Any help ?

The following assumption is false:
“I cannot extract foo_bar and eighty_twelve without looking first at foo_bar_8012.”
Yes, it is true that you can't definitely determine where the break in prefix and name occur in the first group of characters until looking at the second group, but thus comes the power of regular expressions. It greedily matches on the first pass, finds the second string doesn't match, and then backtracks to try again with a smaller string for the prefix.
The following demonstrates how you would accomplish your goal using simple back references:
use strict;
use warnings;
while (<DATA>) {
if (m{\b(\w+)_(\w+)\s+\1_(\d+)\b}) {
print "Prefix = $1, Name = $2, Number = $3\n";
} else {
warn "Not found: $_"
}
}
__DATA__
foo_thirty_two foo_32
foo_bar_eighty_twelve foo_bar_8012
Outputs:
Prefix = foo, Name = thirty_two, Number = 32
Prefix = foo_bar, Name = eighty_twelve, Number = 8012

AFAIK Forward referencing is not a magic bullet that allows to to swap capture-group and reference.
I've look at quite a bit of examples and i simply dont think you can do what you're trying, using forward referencing.
I solved the issue by using back-referencing combined with look-ahead. Like so:
/(?=.*=\s*([a-z]+))\1_(\w+) \s+ = \s+ \w+_\d+/ix
This works because the look-ahead initializes the first capture group ahead of the "actual" expression. For reference, this part is the look-ahead:
(?=.*=\s*([a-z]+))
and its basically just sort of a "sub-regex". The reason i use [a-z]+, is because \w+ includes underscore. And i don't think that was what you wanted.

Related

regex: confirm if a optional portion was matched

I have a string that can be of two forms, and it is unknown which form it will be each time:
hello world[0:10]; or hello world;
There may or may not be the brackets with numbers. The two words (hello and world) can vary. If the brackets and numbers are there, the first number is always 0 and the second number (10) varies.
I need to capture the first word (hello) and, if it exists, the second number (10). I also need to know which form of the string it was.
hello world[0:10]; I would capture {hello, 10, form1}, and hello world; I would capture {hello, form2}. I don't really care how the "form" is formatted, I just need to be able to differentiate. It can be a bit (1=form1, 0=form2), structural (form1 puts me in one scope and form2 another), etc.
I currently have the following (now working) regex:
/(\w*) \s \w* (?:\[0:(\d*)\])?;/x
This gives me $1 = hello and potentially $2 = 10. I now just need to know if the bracketed numbers were there or not. This will be repeated many times, so I can't assume $2 = undef going into the regex. $2 could also be the same thing a few times in a row so I can't just look for a change in $2 before and after the regex.
My best solution so far is to run the regex twice, the first time with the brackets and the second time without:
if( /(\w*) \s \w* \[0:(\d*)\];/x ) {...}
elsif( /(\w*) \s \w*;/x ) {...}
This seems very inefficient and inelegant though so I was wondering if there is a better way?
You can use ? to optionally match portions of your regex. Then you can capture the output directly as a return value from the regex.
my $re = qr{ (\w*) \s* (?:\[0:(\d+)\])?; }x;
if( my($word, $num) = $line =~ $re ) {
say "Word: $word";
say "Num: $num" if defined $num;
}
else {
say "No match";
}
(?:\[0:(\d+)\])? says there may be a [0:\d+]. (?:) makes the grouping non-capturing so only \d+ is captured.
$1 and $2 are also safe to use, they are reset on each match, but using lexical variables makes things more explicit.

Extract first word after specific word

I'm having difficulty writing a Perl program to extract the word following a certain word.
For example:
Today i'm not going anywhere except to office.
I want the word after anywhere, so the output should be except.
I have tried this
my $words = "Today i'm not going anywhere except to office.";
my $w_after = ( $words =~ /anywhere (\S+)/ );
but it seems this is wrong.
Very close:
my ($w_after) = ($words =~ /anywhere\s+(\S+)/);
^ ^ ^^^
+--------+ |
Note 1 Note 2
Note 1: =~ returns a list of captured items, so the assignment target needs to be a list.
Note 2: allow one or more blanks after anywhere
In Perl v5.22 and later, you can use \b{wb} to get better results for natural language. The pattern could be
/anywhere\b{wb}.+?\b{wb}(.+?\b{wb})/
"wb" stands for word break, and it will account for words that have apostrophes in them, like "I'll", that plain \b doesn't.
.+?\b{wb}
matches the shortest non-empty sequence of characters that don't have a word break in them. The first one matches the span of spaces in your sentence; and the second one matches "except". It is enclosed in parentheses, so upon completion $1 contains "except".
\b{wb} is documented most fully in perlrebackslash
First, you have to write parentheses around left side expression of = operator to force array context for regexp evaluation. See m// and // in perlop documentation.[1] You can write
parentheses also around =~ binding operator to improve readability but it is not necessary because =~ has pretty high priority.
Use POSIX Character Classes word
my ($w_after) = ($words =~ / \b anywhere \W+ (\w+) \b /x);
Note I'm using x so whitespaces in regexp are ignored. Also use \b word boundary to anchor regexp correctly.
[1]: I write my ($w_after) just for convenience because you can write my ($a, $b, $c, #rest) as equivalent of (my $a, my $b, my $c, my #rest) but you can also control scope of your variables like (my $a, our $UGLY_GLOBAL, local $_, #_).
This Regex to be matched:
my ($expect) = ($words=~m/anywhere\s+([^\s]+)\s+/);
^\s+ the word between two spaces
Thanks.
If you want to also take into consideration the punctuation marks, like in:
my $words = "Today i'm not going anywhere; except to office.";
Then try this:
my ($w_after) = ($words =~ /anywhere[[:punct:]|\s]+(\S+)/);

Sub-pattern in regex can't be dereferenced?

I have following Perl script to extract numbers from a log. It seems that the non-capturing group with ?: isn't working when I define the sub-pattern in a variable. It's only working when I leave out the grouping in either the regex-pattern or the sub-pattern in $number.
#!/usr/bin/perl
use strict;
use warnings;
my $number = '(:?-?(?:(?:\d+\.?\d*)|(?:\.\d+))(?:[Ee][+-]?\d+)?)';
#my $number = '-?(?:(?:\d+\.?\d*)|(?:\.\d+))(?:[Ee][+-]?\d+)?';
open(FILE,"file.dat") or die "Exiting with: $!\n";
while (my $line = <FILE>) {
if ($line =~ m{x = ($number). y = ($number)}){
print "\$1= $1\n";
print "\$2= $2\n";
print "\$3= $3\n";
print "\$4= $4\n";
};
}
close(FILE);
The output for this code looks like:
$1= 12.15
$2= 12.15
$3= 3e-5
$4= 3e-5
for an input of:
asdf x = 12.15. y = 3e-5 yadda
Those doubled outputs aren't desired.
Is this because of the m{} style in contrast to the regular m// patterns for regex? I only know the former style to get variables (sub-strings) in my regex expressions. I just noticed this for the backreferencing so possibly there are other differences for metacharacters?
The delimiters you use for the regular expression aren't causing any problems but the following is:
(:?-?(?:(?:\d+\.?\d*)|(?:\.\d+))(?:[Ee][+-]?\d+)?)
^^
Notice this isn't a capturing group, it is an optional colon :
Probably a typo mistake but it is causing the trouble.
Edit: It looks that it is not a typo mistake, i substituted the variables in the regex and I got this:
x = ((:?-?(?:(?:\d+\.?\d*)|(?:\.\d+))(?:[Ee][+-]?\d+)?)). y = ((:?-?(?:(?:\d+\.?\d*)|(?:\.\d+))(?:[Ee][+-]?\d+)?))
^^ first and second group ^^ ^^ third and fourth grouop ^^
As you can see the first and second capturing group are capturing exactly the same thing, the same is happening for the third and fourth capturing group.
You're going to kick yourself...
Your regexp reads out as:
capture {
maybe-colon
maybe-minus
cluster { (?:(?:\d+\.?\d*)|(?:\.\d+))
cluster { (?:\d+\.?\d*)
1+ digits
maybe-dot
0+ digits
}
-or-
cluster { (?:\.\d+)
dot
1+digits
}
}
maybe cluster {
E or e
maybe + or -
1+ digets
} (?:[Ee][+-]?\d+)?
}
... which is what you're looking for.
However, when you then do your actual regexp, you do:
$line =~ m{x = $number. y = $number})
(the curly braces are a distraction.... you may use any \W if the m or s has been specified)
What this is asking is to capture whatever the regexp defined in $number is.... which is, itself, a capture.... hence $1 and $2 being the same thing.
Simply remove the capture braces from either $number or the regexp line.

Matching numbers for substitution in Perl

I have this little script:
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
foreach (#list) {
s/(\d{2}).*\.txt$/$1.txt/;
s/^0+//;
print $_ . "\n";
}
The expected output would be
5.txt
12.txt
1.txt
But instead, I get
R3_05.txt
T3_12.txt
1.txt
The last one is fine, but I cannot fathom why the regex gives me the string start for $1 on this case.
Try this pattern
foreach (#list) {
s/^.*?_?(?|0(\d)|(\d{2})).*\.txt$/$1.txt/;
print $_ . "\n";
}
Explanations:
I use here the branch reset feature (i.e. (?|...()...|...()...)) that allows to put several capturing groups in a single reference ( $1 here ). So, you avoid using a second replacement to trim a zero from the left of the capture.
To remove all from the begining before the number, I use :
.*? # all characters zero or more times
# ( ? -> make the * quantifier lazy to match as less as possible)
_? # an optional underscore
Note that you can ensure that you have only 2 digits adding a lookahead to check if there is not a digit that follows:
s/^.*?_?(?|0(\d)|(\d{2}))(?!\d).*\.txt$/$1.txt/;
(?!\d) means not followed by a digit.
The problem here is that your substitution regex does not cover the whole string, so only part of the string is substituted. But you are using a rather complex solution for a simple problem.
It seems that what you want is to read two digits from the string, and then add .txt to the end of it. So why not just do that?
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
for (#list) {
if (/(\d{2})/) {
$_ = "$1.txt";
}
}
To overcome the leading zero effect, you can force a conversion to a number by adding zero to it:
$_ = 0+$1 . ".txt";
I would modify your regular expression. Try using this code:
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
foreach (#list) {
s/.*(\d{2}).*\.txt$/$1.txt/;
s/^0+//;
print $_ . "\n";
}
The problem is that the first part in your s/// matches, what you think it does, but that the second part isn't replacing what you think it should. s/// will only replace what was previously matched. Thus to replace something like T3_ you will have to match that too.
s/.*(\d{2}).*\.txt$/$1.txt/;

Regular Expressions: querystring parameters matching

I'm trying to learn something about regular expressions.
Here is what I'm going to match:
/parent/child
/parent/child?
/parent/child?firstparam=abc123
/parent/child?secondparam=def456
/parent/child?firstparam=abc123&secondparam=def456
/parent/child?secondparam=def456&firstparam=abc123
/parent/child?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child?thirdparam=ghi789
/parent/child/
/parent/child/?
/parent/child/?firstparam=abc123
/parent/child/?secondparam=def456
/parent/child/?firstparam=abc123&secondparam=def456
/parent/child/?secondparam=def456&firstparam=abc123
/parent/child/?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child/?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child/?thirdparam=ghi789
My expression should "grabs" abc123 and def456.
And now just an example about what I'm not going to match ("question mark" is missing):
/parent/child/firstparam=abc123&secondparam=def456
Well, I built the following expression:
^(?:/parent/child){1}(?:^(?:/\?|\?)+(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*)?)?
But that doesn't work.
Could you help me to understand what I'm doing wrong?
Thanks in advance.
UPDATE 1
Ok, I made other tests.
I'm trying to fix the previous version with something like this:
/parent/child(?:(?:\?|/\?)+(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*)?)?$
Let me explain my idea:
Must start with /parent/child:
/parent/child
Following group is optional
(?: ... )?
The previous optional group must starts with ? or /?
(?:\?|/\?)+
Optional parameters (I grab values if specified parameters are part of querystring)
(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*)?
End of line
$
Any advice?
UPDATE 2
My solution must be based just on regular expressions.
Just for example, I previously wrote the following one:
/parent/child(?:[?&/]*(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*))*$
And that works pretty nice.
But it matches the following input too:
/parent/child/firstparam=abc123&secondparam=def456
How could I modify the expression in order to not match the previous string?
You didn't specify a language so I'll just usre Perl. So basically instead of matching everything, I just matched exactly what I thought you needed. Correct me if I am wrong please.
while ($subject =~ m/(?<==)\w+?(?=&|\W|$)/g) {
# matched text = $&
}
(?<= # Assert that the regex below can be matched, with the match ending at this position (positive lookbehind)
= # Match the character “=” literally
)
\\w # Match a single character that is a “word character” (letters, digits, and underscores)
+? # Between one and unlimited times, as few times as possible, expanding as needed (lazy)
(?= # Assert that the regex below can be matched, starting at this position (positive lookahead)
# Match either the regular expression below (attempting the next alternative only if this one fails)
& # Match the character “&” literally
| # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
\\W # Match a single character that is a “non-word character”
| # Or match regular expression number 3 below (the entire group fails if this one fails to match)
\$ # Assert position at the end of the string (or before the line break at the end of the string, if any)
)
Output:
This regex will work as long as you know what your parameter names are going to be and you're sure that they won't change.
\/parent\/child\/?\?(?:(?:firstparam|secondparam|thirdparam)\=([\w]+)&?)(?:(?:firstparam|secondparam|thirdparam)\=([\w]+)&?)?(?:(?:firstparam|secondparam|thirdparam)\=([\w]+)&?)?
Whilst regex is not the best solution for this (the above code examples will be far more efficient, as string functions are way faster than regexes) this will work if you need a regex solution with up to 3 parameters. Out of interest, why must the solution use only regex?
In any case, this regex will match the following strings:
/parent/child?firstparam=abc123
/parent/child?secondparam=def456
/parent/child?firstparam=abc123&secondparam=def456
/parent/child?secondparam=def456&firstparam=abc123
/parent/child?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child?thirdparam=ghi789
/parent/child/?firstparam=abc123
/parent/child/?secondparam=def456
/parent/child/?firstparam=abc123&secondparam=def456
/parent/child/?secondparam=def456&firstparam=abc123
/parent/child/?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child/?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child/?thirdparam=ghi789
It will now only match those containing query string parameters, and put them into capture groups for you.
What language are you using to process your matches?
If you are using preg_match with PHP, you can get the whole match as well as capture groups in an array with
preg_match($regex, $string, $matches);
Then you can access the whole match with $matches[0] and the rest with $matches[1], $matches[2], etc.
If you want to add additional parameters you'll also need to add them in the regex too, and add additional parts to get your data. For example, if you had
/parent/child/?secondparam=def456&firstparam=abc123&fourthparam=jkl01112&thirdparam=ghi789
The regex will become
\/parent\/child\/?\?(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)?(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)?(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)?
This will become a bit more tedious to maintain as you add more parameters, though.
You can optionally include ^ $ at the start and end if the multi-line flag is enabled. If you also need to match the whole lines without query strings, wrap this whole regex in a non-capture group (including ^ $) and add
|(?:^\/parent\/child\/?\??$)
to the end.
You're not escaping the /s in your regex for starters and using {1} for a single repetition of something is unnecessary; you only use those when you want more than one repetition or a range of repetitions.
And part of what you're trying to do is simply not a good use of a regex. I'll show you an easier way to deal with that: you want to use something like split and put the information into a hash that you can check the contents of later. Because you didn't specify a language, I'm just going to use Perl for my example, but every language I know with regexes also has easy access to hashes and something like split, so this should be easy enough to port:
# I picked an example to show how this works.
my $route = '/parent/child/?first=123&second=345&third=678';
my %params; # I'm going to put those URL parameters in this hash.
# Perl has a way to let me avoid escaping the /s, but I wanted an example that
# works in other languages too.
if ($route =~ m/\/parent\/child\/\?(.*)/) { # Use the regex for this part
print "Matched route.\n";
# But NOT for this part.
my $query = $1; # $1 is a Perl thing. It contains what (.*) matched above.
my #items = split '&', $query; # Each item is something like param=123
foreach my $item (#items) {
my ($param, $value) = split '=', $item;
$params{$param} = $value; # Put the parameters in a hash for easy access.
print "$param set to $value \n";
}
}
# Now you can check the parameter values and do whatever you need to with them.
# And you can add new parameters whenever you want, etc.
if ($params{'first'} eq '123') {
# Do whatever
}
My solution:
/(?:\w+/)*(?:(?:\w+)?\?(?:\w+=\w+(?:&\w+=\w+)*)?|\w+|)
Explain:
/(?:\w+/)* match /parent/child/ or /parent/
(?:\w+)?\?(?:\w+=\w+(?:&\w+=\w+)*)? match child?firstparam=abc123 or ?firstparam=abc123 or ?
\w+ match text like child
..|) match nothing(empty)
If you need only query string, pattern would reduce such as:
/(?:\w+/)*(?:\w+)?\?(\w+=\w+(?:&\w+=\w+)*)
If you want to get every parameter from query string, this is a Ruby sample:
re = /\/(?:\w+\/)*(?:\w+)?\?(\w+=\w+(?:&\w+=\w+)*)/
s = '/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789'
if m = s.match(re)
query_str = m[1] # now, you can 100% trust this string
query_str.scan(/(\w+)=(\w+)/) do |param,value| #grab parameter
printf("%s, %s\n", param, value)
end
end
output
secondparam, def456
firstparam, abc123
thirdparam, ghi789
This script will help you.
First, i check, is there any symbol like ?.
Then, i kill first part of line (left from ?).
Next, i split line by &, where each value splitted by =.
my $r = q"/parent/child
/parent/child?
/parent/child?firstparam=abc123
/parent/child?secondparam=def456
/parent/child?firstparam=abc123&secondparam=def456
/parent/child?secondparam=def456&firstparam=abc123
/parent/child?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child?thirdparam=ghi789
/parent/child/
/parent/child/?
/parent/child/?firstparam=abc123
/parent/child/?secondparam=def456
/parent/child/?firstparam=abc123&secondparam=def456
/parent/child/?secondparam=def456&firstparam=abc123
/parent/child/?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child/?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child/?thirdparam=ghi789";
for my $string(split /\n/, $r){
if (index($string,'?')!=-1){
substr($string, 0, index($string,'?')+1,"");
#say "string = ".$string;
if (index($string,'=')!=-1){
my #params = map{$_ = [split /=/, $_];}split/\&/, $string;
$"="\n";
say "$_->[0] === $_->[1]" for (#params);
say "######next########";
}
else{
#print "there is no params!"
}
}
else{
#say "there is no params!";
}
}