How do I access the captures within a match? - regex

I am trying to parse a csv file, and I am trying to access names regex in proto regex in Perl6. It turns out to be Nil. What is the proper way to do it?
grammar rsCSV {
regex TOP { ( \s* <oneCSV> \s* \, \s* )* }
proto regex oneCSV {*}
regex oneCSV:sym<noQuote> { <-[\"]>*? }
regex oneCSV:sym<quoted> { \" .*? \" } # use non-greedy match
}
my $input = prompt("Enter csv line: ");
my $m1 = rsCSV.parse($input);
say "===========================";
say $m1;
say "===========================";
say "1 " ~ $m1<oneCSV><quoted>; # this fails; it is "Nil"
say "2 " ~ $m1[0];
say "3 " ~ $m1[0][2];

Detailed discussion complementing Christoph's answer
I am trying to parse a csv file
Perhaps you are focused on learning Raku parsing and are writing some throwaway code. But if you want industrial strength CSV parsing out of the box, please be aware of the Text::CSV modules[1].
I am trying to access a named regex
If you are learning Raku parsing, please take advantage of the awesome related (free) developer tools[2].
in proto regex in Raku
Your issue is unrelated to it being a proto regex.
Instead the issue is that, while the match object corresponding to your named capture is stored in the overall match object you stored in $m1, it is not stored precisely where you are looking for it.
Where do match objects corresponding to captures appear?
To see what's going on, I'll start by simulating what you were trying to do. I'll use a regex that declares just one capture, a "named" (aka "Associative") capture that matches the string ab.
given 'ab'
{
my $m1 = m/ $<named-capture> = ( ab ) /;
say $m1<named-capture>;
# 「ab」
}
The match object corresponding to the named capture is stored where you'd presumably expect it to appear within $m1, at $m1<named-capture>.
But you were getting Nil with $m1<oneCSV>. What gives?
Why your $m1<oneCSV> did not work
There are two types of capture: named (aka "Associative") and numbered (aka "Positional"). The parens you wrote in your regex that surrounded <oneCSV> introduced a numbered capture:
given 'ab'
{
my $m1 = m/ ( $<named-capture> = ( ab ) ) /; # extra parens added
say $m1[0]<named-capture>;
# 「ab」
}
The parens in / ( ... ) / declare a single top level numbered capture. If it matches, then the corresponding match object is stored in $m1[0]. (If your regex looked like / ... ( ... ) ... ( ... ) ... ( ... ) ... / then another match object corresponding to what matches the second pair of parentheses would be stored in $m1[1], another in $m1[2] for the third, and so on.)
The match result for $<named-capture> = ( ab ) is then stored inside $m1[0]. That's why say $m1[0]<named-capture> works.
So far so good. But this is only half the story...
Why $m1[0]<oneCSV> in your code would not work either
While $m1[0]<named-capture> in the immediately above code is working, you would still not get a match object in $m1[0]<oneCSV> in your original code. This is because you also asked for multiple matches of the zeroth capture because you used a * quantifier:
given 'ab'
{
my $m1 = m/ ( $<named-capture> = ( ab ) )* /; # * is a quantifier
say $m1[0][0]<named-capture>;
# 「ab」
}
Because the * quantifier asks for multiple matches, Raku writes a list of match objects into $m1[0]. (In this case there's only one such match so you end up with a list of length 1, i.e. just $m1[0][0] (and not $m1[0][1], $m1[0][2], etc.).)
Summary
Captures nest;
A capture quantified by either * or + corresponds to two levels of nesting not just one.
In your original code, you'd have to write say $m1[0][0]<oneCSV>; to get to the match object you're looking for.
[1] Install relevant modules and write use Text::CSV; (for a pure Raku implementation) or use Text::CSV:from<Perl5>; (for a Perl plus XS implementation) at the start of your code. (talk slides (click on top word, eg. "csv", to advance through slides), video, Raku module, Perl XS module.)
[2] Install CommaIDE and have fun with its awesome grammar/regex development/debugging/analysis features. Or install the Grammar::Tracer; and/or Grammar::Debugger modules and write use Grammar::Tracer; or use Grammar::Debugger; at the start of your code (talk slides, video, modules.)

The match for <oneCSV> lives within the scope of the capture group, which you get via $m1[0].
As the group is quantified with *, the results will again be a list, ie you need another indexing operation to get at a match object, eg $m1[0][0] for the first one.
The named capture can then be accessed by name, eg $m1[0][0]<oneCSV>. This will already contain the match result of the appropriate branch of the protoregex.
If you want the whole list of matches instead of a specific one, you can use >> or map, eg $m1[0]>>.<oneCSV>.

Related

How can I match only integers in Perl?

So I have an array that goes like this:
my #nums = (1,2,12,24,48,120,360);
I want to check if there is an element that is not an integer inside that array without using loop. It goes like this:
if(grep(!/[^0-9]|\^$/,#nums)){
die "Numbers are not in correct format.";
}else{
#Do something
}
Basically, the format should not be like this (Empty string is acceptable):
1A
A2
#A
#
#######
More examples:
1,2,3,A3 = Unacceptable
1,2,###,2 = unacceptable
1,2,3A,4 = Unacceptable
1, ,3,4=Acceptable
1,2,3,360 = acceptable
I know that there is another way by using look like a number. But I can't use that for some reason (outside of my control/setup reasons). That's why I used the regex method.
My question is, even though the numbers are in not correct format (A60 for example), the condition always return False. Basically, it ignores the incorrect format.
You say in the comments that you don't want to use modules because you can't install them, but there are many core modules that should come with Perl (although some systems screw this up).
zdim's answer in the comments is to look for anything that is not 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9. That's the negated character class [^0-9]. A grep in scalar context returns the number of items that match:
my $found_non_ints = grep { /[^0-9]/ } #items;
Instead of that, I'd go back to the non-negated character class and match string that only has zero or more digits. To do this, anchor the pattern to the absolute start and end of the string:
my $found_non_ints = grep { ! /\A[0-9]*\z/ } #items;
But, this doesn't really match integers. It matches positive whole numbers (and zero). If you want to match negative numbers as well, allow an optional - at the start of the string:
my $found_non_ints = grep { ! /\A-?[0-9]*\z/ } #items;
That - would be a problem in the negated character class.
Also, you don't want the $ anchor here: that allows a possible newline to match at the end, and that's a non-digit (the \Z is the same for the end of the string). Also, the meaning of $ can change based on the setting of the /m flag, which might be set with default regex flags.
Here's a short program with your sample data. Note that you need to decide how to split up the list; does whitespace matter? I decided to remove whitespace around the comma:
#!perl
use v5.10;
while( <DATA> ) {
chomp;
my $found_non_ints = grep { ! /\A[0-9]*\z/ } split /\s*,\s*/;
say "$_ => Found $found_non_ints non-ints";
}
__DATA__
1A
A2
#A
#
1,2,3,A3
1,2,###,2
1,2,3A,4
1, ,3,4
1,,3,4
1,2,3,360
The solution proposed in the question gets close, except that the logic got reversed and there is an error in a regex pattern. One way for it:
if ( grep { /[^0-9] | ^$/x } #nums ) { say 'not all integers' }
Regex explanation
[] is a character class: it matches any one of the characters listed inside (so [abc] matches either of a, b, or c) -- but when it starts with a ^ it matches any character not listed; so [^abc] matches any char not being either of a, b, or c. The pattern 0-9 inside a character class specifies all digits in that range (and we can also use a-z and A-Z)
So [^0-9] matches any character that is not a digit
Then that is or-ed by | with a ^$: ^ matches beginning of the string and $ is for the end of it. So ^$ match a string without anything -- an empty string! We need to account for that as [^0-9] doesn't while an array element can be an empty string. (It can also be a undef but from my understanding that is not possible with actual data, and a regex on undef would draw a warning.)
Note that $ allows for a newline as well, and that ^ and $ may change their meaning if /m modifier is in use, matching on linefeeds inside a string. However, in all these cases we'd be matching a non-digit, which is precisely the point here
/x modifier makes it disregard literal spaces inside so we can space things out for easier reading. (It also allows for newlines and comments with #, so complex patterns can be organized and documented very nicely)
So that's all -- the regex tries to match anything that shouldn't be in an integer (assumed to be strictly positive in OP's data).
If it matches any such, in any one of the array elements, then grep returns a list which isn't empty (but has at least one element) and that is "true" under if. So we caught a non-integer and we go into if's block to deal with that.
A little aside: we can also declare and populate an array right inside the if condition, to catch all those non-integers:
if ( my #non_ints = grep { /[^0-9] | ^$/x } #nums ) {
say 'Non-integers: ', join ' ', map { "|$_|" } #non_ints;
}
This also reads more nicely, telling by the array name what we're after in that complicated condition: "non_ints." I put || around each item in print to be able to see an empty string.†
Now, when you put an exclamation mark in front of that regex, it reverses the true/false return from the regex and our code goes haywire. So drop that !.
The other error is in escaping the ^ by having \^. This would match a literal ^ character, robbing ^ of its special meaning as a pattern in regex, explained above. So drop that \.
One other way is in using an extremely useful List::Util library, which is "core" (so it is normally installed with Perl, even though that can get messed up).
Among a number of essential functions it gives us any, and with it we have
use List::Util qw(any);
if ( any { /[^0-9]|^$/ } #nums ) { say 'not all integers' }
I like any firstly because the name of the function includes at least a part of the needed logic, making code that much clearer and easier to comprehend: is there any element of #nums for which the code in the block is true? So any element which contains a non-digit? Precisely what is needed here.
Then, another advantage is that any will quit as soon as it finds one match, while grep continues through the whole list. But this efficiency advantage shows only on very large arrays or a lot of repeated checks. Also, on the other hand sometimes we want to count all instances.
I'd also like to point out some of any's siblings: none and notall. These names themselves also capture a good deal of logic, making otherwise possibly convoluted code that much clearer. Browse through this library to get accustomed to what is in there.
† A program with your test data
use warnings;
use strict;
use feature 'say';
while (<DATA>) {
chomp;
my #nums = split /\s*,\s*/;
say "Data: #nums";
if ( my #non_ints = grep { /[^0-9] | ^$/x } #nums ) {
say 'Non-ints: ', join ' ', map { "|$_|" } #non_ints;
}
say '---';
}
__DATA__
1A
A2
#A
#
1,2,3,A3
1,2,###,2
1,2,3A,4
1, ,3,4
1,2,3,360

Match very specific open and close brackets

I am trying to modify some LaTeX Beamer code, and want to do a quick regex find a certain pattern that defines a block of code. For example, in
\only{
\begin{equation*}
\psi_{w}(B)(1-B)^{d}y_{t,w} = x^{T}_{t,w}\gamma_{w} + \eta_{w}(B)z_{t,w},
\label{eq:gen_arima}
\end{equation*}
}
I want to match just the \only{ and the final }, and nothing else, to remove them. Is that even possible?
Regexes cannot count, as expressed in this famous SO answer: RegEx match open tags except XHTML self-contained tags
In addition, LaTeX itself has a fairly complicated grammar (i.e. macro expansions?). Ideally, you'd use a parser of some kind, but perhaps that's a little overkill. If you're looking for a really simple, quick and dirty hack which will only work for a certain class of inputs you can:
Search for \only.
Increment a counter every time you see a { and decrement it every time you see a }. If there is a \ preceding the {, ignore it. (Fancy points if you look for an odd number of \.)
When counter gets to 0, you've found the closing }.
Again, this is not reliable.
I want to remove \only{ and }, and keep everything within it
On a PCRE (php), Perl, Notepad++ it's done like this:
For something as simple as this, all you need is
Find \\only\s*({((?:[^{}]++|(?1))*)})
Replace $2
https://regex101.com/r/U3QxGa/1
Explained
\\ only \s*
( # (1 start)
{
( # (2 start), Inner core to keep
(?: # Cluster group
[^{}]++ # Possesive, not {}
| # or,
(?1) # Recurse to group 1
)* # End cluster, do 0 to many times
) # (2 end)
}
) # (1 end)

regex to duplicate repeated patterns, substituting part of the pattern

I'd like to duplicate a multiple matches in a line, substituting part of the match, but keeping the runs of matches together (that seems to be the tricky part).
e.g.:
Regex:
(x(\d)(,)?)
Replacement:
X$2,O$2$3
Input:
x1,x2,Z3,x4,Z5,x6
Output: (repeated groups broken apart)
X1,O1,X2,O2,Z3,X4,O4,Z5,X6,O6
Desired output (repeated groups, "X1,X2" kept together):
X1,X2,O1,O2,Z3,X4,O4,Z5,X6,O6
Demo: https://regex101.com/r/gH9tL9/1
Is this possible with regex or do I need to use something else?
Update: Wills answer is what I expected. It occurs to me that it might be possible with multiple passes of regex.
You would have to capture the repeating patterns as one match and write out replacements for the whole repeating pattern at once. your current pattern cannot tell that your first and second matches, x1, and x2, respectively, are adjacent.
Im going to say no, this is not possible with one pure regex.
This is because of two important facts about capture groups and replacing.
Repeated capture groups will return the last capture:
Regex's are able to capture patterns which repeat an arbitrary amount of time by using the form <PATTERN>{1,},<PATTERN>+ or <PATTERN>*. However any capture group within <PATTERN> would only return the captures from the last iteration of the pattern. This would prevent your desired ability to capture matches that arbitrarily repeat.
"Hold on", you might say, "I only want to capture patterns that repeat one or two times, I could use (x(\d)(,)?)(x(\d)(,)?)?", which brings us to point 2.
There is no conditional replacement
Using the above pattern we could get your desired output for the repeated match, but not without mangling the solo match replacement.
See: https://regex101.com/r/gH9tL9/2 Without the ability to turn off sections of the replacement based on the existence of capture groups, we cannot achieve the desired output.
But "No, you can't do that" is a challenge to a hacker, I hope I am shown up by a true regex ninja.
Solution with 2 regexes and some code
There's definitely ways to achieve this goal with some code.
Here's a quick and dirty python hack using two regexes http://pythonfiddle.com/wip-soln-for-so-q/
This makes use of python's re.sub(), which can pass matches to one regex to a function ordered_repl which returns the replacement string. By using your original regex within the ordered_repl we can extract the information we want and get the right order by buffering our lists of Xs and Os.
import re
input_string="x1,x2,Z3,x4,Z5,x6"
re1 = re.compile("(?:x\d,?)+") # captures the general thing you want to match using a repeating non-capturing group
re2 = re.compile("(x(\d)(,)?)") # your actual matcher
def ordered_repl(m): # m is a matchobj
buf1 = []
buf2 = []
cap_iter = re.finditer(re2,m.group(0)) # returns an iterator of MatchObjects for all non-overlapping matches
for cap_group in cap_iter:
capture = cap_group.group(2) # capture the digit
buf1.append("X%s" % capture) # buffer X's of this submatch group
buf2.append("O%s" % capture) # buffer O's of this submatch group
return "%s,%s," % (",".join(buf1),",".join(buf2)) # concatenate the buffers and return
print re.sub(re1,ordered_repl,input_string).rstrip(',') # searches string for matches to re1 and passes them to the ordered_repl function
In my specific case I'm using powershell, so I was able to come up with the following:
(linebreaks added for readability)
("x1,x2,z3,x4,z5,x6"
-split '((?<=x\d),(?!x)|(?<!x\d),(?=x))'
| Foreach-Object {
if ($_ -match 'x') {
$_ + ',' + ($_ -replace 'x','y')
} else {$_}
}
) -join ''
Outputs:
x1,x2,y1,y2,z3,x4,y4,z5,x6,y6
Where:
-split '((?<=x\d),(?!x)|(?<!x\d),(?=x))'
breaks apart the string into these groups:
x1,x2
,
z3
,
x4
,
z5
,
x6
using positive and negative lookahead and lookbehind:
comma with x\d before and without x after:
(?<=x\d),(?!x)
comma without x\d before and with x after:
(?<!x\d),(?=x)

Regular Expressions: querystring parameters matching

I'm trying to learn something about regular expressions.
Here is what I'm going to match:
/parent/child
/parent/child?
/parent/child?firstparam=abc123
/parent/child?secondparam=def456
/parent/child?firstparam=abc123&secondparam=def456
/parent/child?secondparam=def456&firstparam=abc123
/parent/child?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child?thirdparam=ghi789
/parent/child/
/parent/child/?
/parent/child/?firstparam=abc123
/parent/child/?secondparam=def456
/parent/child/?firstparam=abc123&secondparam=def456
/parent/child/?secondparam=def456&firstparam=abc123
/parent/child/?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child/?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child/?thirdparam=ghi789
My expression should "grabs" abc123 and def456.
And now just an example about what I'm not going to match ("question mark" is missing):
/parent/child/firstparam=abc123&secondparam=def456
Well, I built the following expression:
^(?:/parent/child){1}(?:^(?:/\?|\?)+(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*)?)?
But that doesn't work.
Could you help me to understand what I'm doing wrong?
Thanks in advance.
UPDATE 1
Ok, I made other tests.
I'm trying to fix the previous version with something like this:
/parent/child(?:(?:\?|/\?)+(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*)?)?$
Let me explain my idea:
Must start with /parent/child:
/parent/child
Following group is optional
(?: ... )?
The previous optional group must starts with ? or /?
(?:\?|/\?)+
Optional parameters (I grab values if specified parameters are part of querystring)
(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*)?
End of line
$
Any advice?
UPDATE 2
My solution must be based just on regular expressions.
Just for example, I previously wrote the following one:
/parent/child(?:[?&/]*(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*))*$
And that works pretty nice.
But it matches the following input too:
/parent/child/firstparam=abc123&secondparam=def456
How could I modify the expression in order to not match the previous string?
You didn't specify a language so I'll just usre Perl. So basically instead of matching everything, I just matched exactly what I thought you needed. Correct me if I am wrong please.
while ($subject =~ m/(?<==)\w+?(?=&|\W|$)/g) {
# matched text = $&
}
(?<= # Assert that the regex below can be matched, with the match ending at this position (positive lookbehind)
= # Match the character “=” literally
)
\\w # Match a single character that is a “word character” (letters, digits, and underscores)
+? # Between one and unlimited times, as few times as possible, expanding as needed (lazy)
(?= # Assert that the regex below can be matched, starting at this position (positive lookahead)
# Match either the regular expression below (attempting the next alternative only if this one fails)
& # Match the character “&” literally
| # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
\\W # Match a single character that is a “non-word character”
| # Or match regular expression number 3 below (the entire group fails if this one fails to match)
\$ # Assert position at the end of the string (or before the line break at the end of the string, if any)
)
Output:
This regex will work as long as you know what your parameter names are going to be and you're sure that they won't change.
\/parent\/child\/?\?(?:(?:firstparam|secondparam|thirdparam)\=([\w]+)&?)(?:(?:firstparam|secondparam|thirdparam)\=([\w]+)&?)?(?:(?:firstparam|secondparam|thirdparam)\=([\w]+)&?)?
Whilst regex is not the best solution for this (the above code examples will be far more efficient, as string functions are way faster than regexes) this will work if you need a regex solution with up to 3 parameters. Out of interest, why must the solution use only regex?
In any case, this regex will match the following strings:
/parent/child?firstparam=abc123
/parent/child?secondparam=def456
/parent/child?firstparam=abc123&secondparam=def456
/parent/child?secondparam=def456&firstparam=abc123
/parent/child?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child?thirdparam=ghi789
/parent/child/?firstparam=abc123
/parent/child/?secondparam=def456
/parent/child/?firstparam=abc123&secondparam=def456
/parent/child/?secondparam=def456&firstparam=abc123
/parent/child/?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child/?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child/?thirdparam=ghi789
It will now only match those containing query string parameters, and put them into capture groups for you.
What language are you using to process your matches?
If you are using preg_match with PHP, you can get the whole match as well as capture groups in an array with
preg_match($regex, $string, $matches);
Then you can access the whole match with $matches[0] and the rest with $matches[1], $matches[2], etc.
If you want to add additional parameters you'll also need to add them in the regex too, and add additional parts to get your data. For example, if you had
/parent/child/?secondparam=def456&firstparam=abc123&fourthparam=jkl01112&thirdparam=ghi789
The regex will become
\/parent\/child\/?\?(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)?(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)?(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)?
This will become a bit more tedious to maintain as you add more parameters, though.
You can optionally include ^ $ at the start and end if the multi-line flag is enabled. If you also need to match the whole lines without query strings, wrap this whole regex in a non-capture group (including ^ $) and add
|(?:^\/parent\/child\/?\??$)
to the end.
You're not escaping the /s in your regex for starters and using {1} for a single repetition of something is unnecessary; you only use those when you want more than one repetition or a range of repetitions.
And part of what you're trying to do is simply not a good use of a regex. I'll show you an easier way to deal with that: you want to use something like split and put the information into a hash that you can check the contents of later. Because you didn't specify a language, I'm just going to use Perl for my example, but every language I know with regexes also has easy access to hashes and something like split, so this should be easy enough to port:
# I picked an example to show how this works.
my $route = '/parent/child/?first=123&second=345&third=678';
my %params; # I'm going to put those URL parameters in this hash.
# Perl has a way to let me avoid escaping the /s, but I wanted an example that
# works in other languages too.
if ($route =~ m/\/parent\/child\/\?(.*)/) { # Use the regex for this part
print "Matched route.\n";
# But NOT for this part.
my $query = $1; # $1 is a Perl thing. It contains what (.*) matched above.
my #items = split '&', $query; # Each item is something like param=123
foreach my $item (#items) {
my ($param, $value) = split '=', $item;
$params{$param} = $value; # Put the parameters in a hash for easy access.
print "$param set to $value \n";
}
}
# Now you can check the parameter values and do whatever you need to with them.
# And you can add new parameters whenever you want, etc.
if ($params{'first'} eq '123') {
# Do whatever
}
My solution:
/(?:\w+/)*(?:(?:\w+)?\?(?:\w+=\w+(?:&\w+=\w+)*)?|\w+|)
Explain:
/(?:\w+/)* match /parent/child/ or /parent/
(?:\w+)?\?(?:\w+=\w+(?:&\w+=\w+)*)? match child?firstparam=abc123 or ?firstparam=abc123 or ?
\w+ match text like child
..|) match nothing(empty)
If you need only query string, pattern would reduce such as:
/(?:\w+/)*(?:\w+)?\?(\w+=\w+(?:&\w+=\w+)*)
If you want to get every parameter from query string, this is a Ruby sample:
re = /\/(?:\w+\/)*(?:\w+)?\?(\w+=\w+(?:&\w+=\w+)*)/
s = '/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789'
if m = s.match(re)
query_str = m[1] # now, you can 100% trust this string
query_str.scan(/(\w+)=(\w+)/) do |param,value| #grab parameter
printf("%s, %s\n", param, value)
end
end
output
secondparam, def456
firstparam, abc123
thirdparam, ghi789
This script will help you.
First, i check, is there any symbol like ?.
Then, i kill first part of line (left from ?).
Next, i split line by &, where each value splitted by =.
my $r = q"/parent/child
/parent/child?
/parent/child?firstparam=abc123
/parent/child?secondparam=def456
/parent/child?firstparam=abc123&secondparam=def456
/parent/child?secondparam=def456&firstparam=abc123
/parent/child?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child?thirdparam=ghi789
/parent/child/
/parent/child/?
/parent/child/?firstparam=abc123
/parent/child/?secondparam=def456
/parent/child/?firstparam=abc123&secondparam=def456
/parent/child/?secondparam=def456&firstparam=abc123
/parent/child/?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child/?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child/?thirdparam=ghi789";
for my $string(split /\n/, $r){
if (index($string,'?')!=-1){
substr($string, 0, index($string,'?')+1,"");
#say "string = ".$string;
if (index($string,'=')!=-1){
my #params = map{$_ = [split /=/, $_];}split/\&/, $string;
$"="\n";
say "$_->[0] === $_->[1]" for (#params);
say "######next########";
}
else{
#print "there is no params!"
}
}
else{
#say "there is no params!";
}
}

Regular Expression issue with * laziness

Sorry in advance that this might be a little challenging to read...
I'm trying to parse a line (actually a subject line from an IMAP server) that looks like this:
=?utf-8?Q?Here is som?= =?utf-8?Q?e text.?=
It's a little hard to see, but there are two =?/?= pairs in the above line. (There will always be one pair; there can theoretically be many.) In each of those =?/?= pairs, I want the third argument (as defined by a ? delimiter) extracted. (In the first pair, it's "Here is som", and in the second it's "e text.")
Here's the regex I'm using:
=\?(.+)\?.\?(.*?)\?=
I want it to return two matches, one for each =?/?= pair. Instead, it's returning the entire line as a single match. I would have thought that the ? in the (.*?), to make the * operator lazy, would have kept this from happening, but obviously it doesn't.
Any suggestions?
EDIT: Per suggestions below to replace ".?" with "[^(\?=)]?" I'm now trying to do:
=\?(.+)\?.\?([^(\?=)]*?)\?=
...but it's not working, either. (I'm unsure whether [^(\?=)]*? is the proper way to test for exclusion of a two-character sequence like "?=". Is it correct?)
Try this:
\=\?([^?]+)\?.\?(.*?)\?\=
I changed the .+ to [^?]+, which means "everything except ?"
A good practice in my experience is not to use .*? but instead do use the * without the ?, but refine the character class. In this case [^?]* to match a sequence of non-question mark characters.
You can also match more complex endmarkers this way, for instance, in this case your end-limiter is ?=, so you want to match nonquestionmarks, and questionmarks followed by non-equals:
([^?]*\?[^=])*[^?]*
At this point it becomes harder to choose though. I like that this solution is stricter, but readability decreases in this case.
One solution:
=\?(.*?)\?=\s*=\?(.*?)\?=
Explanation:
=\? # Literal characters '=?'
(.*?) # Match each character until find next one in the regular expression. A '?' in this case.
\?= # Literal characters '?='
\s* # Match spaces.
=\? # Literal characters '=?'
(.*?) # Match each character until find next one in the regular expression. A '?' in this case.
\?= # Literal characters '?='
Test in a 'perl' program:
use warnings;
use strict;
while ( <DATA> ) {
printf qq[Group 1 -> %s\nGroup 2 -> %s\n], $1, $2 if m/=\?(.*?)\?=\s*=\?(.*?)\?=/;
}
__DATA__
=?utf-8?Q?Here is som?= =?utf-8?Q?e text.?=
Running:
perl script.pl
Results:
Group 1 -> utf-8?Q?Here is som
Group 2 -> utf-8?Q?e text.
EDIT to comment:
I would use the global modifier /.../g. Regular expression would be:
/=\?(?:[^?]*\?){2}([^?]*)/g
Explanation:
=\? # Literal characters '=?'
(?:[^?]*\?){2} # Any number of characters except '?' with a '?' after them. This process twice to omit the string 'utf-8?Q?'
([^?]*) # Save in a group next characters until found a '?'
/g # Repeat this process multiple times until end of string.
Tested in a Perl script:
use warnings;
use strict;
while ( <DATA> ) {
printf qq[Group -> %s\n], $1 while m/=\?(?:[^?]*\?){2}([^?]*)/g;
}
__DATA__
=?utf-8?Q?Here is som?= =?utf-8?Q?e text.?= =?utf-8?Q?more text?=
Running and results:
Group -> Here is som
Group -> e text.
Group -> more text
Thanks for everyone's answers! The simplest expression that solved my issue was this:
=\?(.*?)\?.\?(.*?)\?=
The only difference between this and my originally-posted expression was the addition of a ? (non-greedy) operator on the first ".*". Critical, and I'd forgotten it.