Match very specific open and close brackets - regex

I am trying to modify some LaTeX Beamer code, and want to do a quick regex find a certain pattern that defines a block of code. For example, in
\only{
\begin{equation*}
\psi_{w}(B)(1-B)^{d}y_{t,w} = x^{T}_{t,w}\gamma_{w} + \eta_{w}(B)z_{t,w},
\label{eq:gen_arima}
\end{equation*}
}
I want to match just the \only{ and the final }, and nothing else, to remove them. Is that even possible?

Regexes cannot count, as expressed in this famous SO answer: RegEx match open tags except XHTML self-contained tags
In addition, LaTeX itself has a fairly complicated grammar (i.e. macro expansions?). Ideally, you'd use a parser of some kind, but perhaps that's a little overkill. If you're looking for a really simple, quick and dirty hack which will only work for a certain class of inputs you can:
Search for \only.
Increment a counter every time you see a { and decrement it every time you see a }. If there is a \ preceding the {, ignore it. (Fancy points if you look for an odd number of \.)
When counter gets to 0, you've found the closing }.
Again, this is not reliable.

I want to remove \only{ and }, and keep everything within it
On a PCRE (php), Perl, Notepad++ it's done like this:
For something as simple as this, all you need is
Find \\only\s*({((?:[^{}]++|(?1))*)})
Replace $2
https://regex101.com/r/U3QxGa/1
Explained
\\ only \s*
( # (1 start)
{
( # (2 start), Inner core to keep
(?: # Cluster group
[^{}]++ # Possesive, not {}
| # or,
(?1) # Recurse to group 1
)* # End cluster, do 0 to many times
) # (2 end)
}
) # (1 end)

Related

How do I access the captures within a match?

I am trying to parse a csv file, and I am trying to access names regex in proto regex in Perl6. It turns out to be Nil. What is the proper way to do it?
grammar rsCSV {
regex TOP { ( \s* <oneCSV> \s* \, \s* )* }
proto regex oneCSV {*}
regex oneCSV:sym<noQuote> { <-[\"]>*? }
regex oneCSV:sym<quoted> { \" .*? \" } # use non-greedy match
}
my $input = prompt("Enter csv line: ");
my $m1 = rsCSV.parse($input);
say "===========================";
say $m1;
say "===========================";
say "1 " ~ $m1<oneCSV><quoted>; # this fails; it is "Nil"
say "2 " ~ $m1[0];
say "3 " ~ $m1[0][2];
Detailed discussion complementing Christoph's answer
I am trying to parse a csv file
Perhaps you are focused on learning Raku parsing and are writing some throwaway code. But if you want industrial strength CSV parsing out of the box, please be aware of the Text::CSV modules[1].
I am trying to access a named regex
If you are learning Raku parsing, please take advantage of the awesome related (free) developer tools[2].
in proto regex in Raku
Your issue is unrelated to it being a proto regex.
Instead the issue is that, while the match object corresponding to your named capture is stored in the overall match object you stored in $m1, it is not stored precisely where you are looking for it.
Where do match objects corresponding to captures appear?
To see what's going on, I'll start by simulating what you were trying to do. I'll use a regex that declares just one capture, a "named" (aka "Associative") capture that matches the string ab.
given 'ab'
{
my $m1 = m/ $<named-capture> = ( ab ) /;
say $m1<named-capture>;
# 「ab」
}
The match object corresponding to the named capture is stored where you'd presumably expect it to appear within $m1, at $m1<named-capture>.
But you were getting Nil with $m1<oneCSV>. What gives?
Why your $m1<oneCSV> did not work
There are two types of capture: named (aka "Associative") and numbered (aka "Positional"). The parens you wrote in your regex that surrounded <oneCSV> introduced a numbered capture:
given 'ab'
{
my $m1 = m/ ( $<named-capture> = ( ab ) ) /; # extra parens added
say $m1[0]<named-capture>;
# 「ab」
}
The parens in / ( ... ) / declare a single top level numbered capture. If it matches, then the corresponding match object is stored in $m1[0]. (If your regex looked like / ... ( ... ) ... ( ... ) ... ( ... ) ... / then another match object corresponding to what matches the second pair of parentheses would be stored in $m1[1], another in $m1[2] for the third, and so on.)
The match result for $<named-capture> = ( ab ) is then stored inside $m1[0]. That's why say $m1[0]<named-capture> works.
So far so good. But this is only half the story...
Why $m1[0]<oneCSV> in your code would not work either
While $m1[0]<named-capture> in the immediately above code is working, you would still not get a match object in $m1[0]<oneCSV> in your original code. This is because you also asked for multiple matches of the zeroth capture because you used a * quantifier:
given 'ab'
{
my $m1 = m/ ( $<named-capture> = ( ab ) )* /; # * is a quantifier
say $m1[0][0]<named-capture>;
# 「ab」
}
Because the * quantifier asks for multiple matches, Raku writes a list of match objects into $m1[0]. (In this case there's only one such match so you end up with a list of length 1, i.e. just $m1[0][0] (and not $m1[0][1], $m1[0][2], etc.).)
Summary
Captures nest;
A capture quantified by either * or + corresponds to two levels of nesting not just one.
In your original code, you'd have to write say $m1[0][0]<oneCSV>; to get to the match object you're looking for.
[1] Install relevant modules and write use Text::CSV; (for a pure Raku implementation) or use Text::CSV:from<Perl5>; (for a Perl plus XS implementation) at the start of your code. (talk slides (click on top word, eg. "csv", to advance through slides), video, Raku module, Perl XS module.)
[2] Install CommaIDE and have fun with its awesome grammar/regex development/debugging/analysis features. Or install the Grammar::Tracer; and/or Grammar::Debugger modules and write use Grammar::Tracer; or use Grammar::Debugger; at the start of your code (talk slides, video, modules.)
The match for <oneCSV> lives within the scope of the capture group, which you get via $m1[0].
As the group is quantified with *, the results will again be a list, ie you need another indexing operation to get at a match object, eg $m1[0][0] for the first one.
The named capture can then be accessed by name, eg $m1[0][0]<oneCSV>. This will already contain the match result of the appropriate branch of the protoregex.
If you want the whole list of matches instead of a specific one, you can use >> or map, eg $m1[0]>>.<oneCSV>.

issue with regexp with nested groups in golang

Consider the following toy example. I want to match in Go a name with a regexp where the name is sequences of letters a separated by single #, so a#a#aaa is valid, but a# or a##a are not. I can code the regexp in the following two ways:
r1 := regexp.MustCompile(`^a+(#a+)*$`)
r2 := regexp.MustCompile(`^(a+#)*a+$`)
Both of these work. Now consider more complex task of matching a sequence of names separated by single slash. As in above, I can code that in two ways:
^N+(/N+)*$
^(N+/)*N+$
where N is a regexp for the name with ^ and $ stripped. As I have two cases for N, so now I can have 4 regexps:
^a+(#a+)*(/a+(#a+)*)*$
^(a+#)*a+(/a+(#a+)*)*$
^((a+#)*a+/)*a+(#a+)*$
^((a+#)*a+/)*(a+#)*a+$
The question is why when matching against the string "aa#a#a/a#a/a" the first one fails while the rest 3 cases work as expected? I.e. what causes the first regexp to mismatch? The full code is:
package main
import (
"fmt"
"regexp"
)
func main() {
str := "aa#a#a/a#a/a"
regs := []string {
`^a+(#a+)*(/a+(#a+)*)*$`,
`^(a+#)*a+(/a+(#a+)*)*$`,
`^((a+#)*a+/)*a+(#a+)*$`,
`^((a+#)*a+/)*(a+#)*a+$`,
}
for _, r := range(regs) {
fmt.Println(regexp.MustCompile(r).MatchString(str))
}
}
Surprisingly it prints false true true true
This must be a bug in golang regexp engine. A simpler test case is that ^a(/a+(#a+)*)*$ fails to match a/a#a while ^(a)(/a+(#a+)*)*$ works, see http://play.golang.org/p/CDKxVeXW98 .
I filed https://github.com/golang/go/issues/11905
You can try to alleviate atomic sub-nesting of quantifiers.
If you didn't have the anchors, the expression would possibly blow up
in backtracking if using nested optional quantifiers like this,
when it can't find a direct solution.
Go might be balking at that, forcing it to fail outright instead
of massive backtracking, but not sure.
Try something like this, see if it works.
# ^a+(#?a)*(/a+(#?a)*)*$
^
a+
( # (1 start)
\#?
a
)* # (1 end)
( # (2 start)
/
a+
( # (3 start)
\#?
a
)* # (3 end)
)* # (2 end)
$
edit: (Transposed from comment) If complexity is too high, some engines will not even compile it, some will silently fail at runtime, some will lock up in a backtracking loop.
This little subtle difference is the problem
bad: too complex (#?a+)*
good: no nested, open ended quantifiers (#?a)*
If you ever have this problem, strip out the nesting, usually solves
the backtracking issue.
eit2: If you require a separator and want to insure a separator is only in the middle and surrounded by a's, you could try this
https://play.golang.org/p/oM6B6H3Kdx
# ^a+[#/](a[#/]?)*a+$
^
a+
[\#/]
( # (1 start)
a
[\#/]?
)* # (1 end)
a+
$
or this
https://play.golang.org/p/WihqSjH_dI
# ^a+(?:[#/]a+)+$
^
a+
(?:
[\#/]
a+
)+
$

Regular Expressions: querystring parameters matching

I'm trying to learn something about regular expressions.
Here is what I'm going to match:
/parent/child
/parent/child?
/parent/child?firstparam=abc123
/parent/child?secondparam=def456
/parent/child?firstparam=abc123&secondparam=def456
/parent/child?secondparam=def456&firstparam=abc123
/parent/child?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child?thirdparam=ghi789
/parent/child/
/parent/child/?
/parent/child/?firstparam=abc123
/parent/child/?secondparam=def456
/parent/child/?firstparam=abc123&secondparam=def456
/parent/child/?secondparam=def456&firstparam=abc123
/parent/child/?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child/?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child/?thirdparam=ghi789
My expression should "grabs" abc123 and def456.
And now just an example about what I'm not going to match ("question mark" is missing):
/parent/child/firstparam=abc123&secondparam=def456
Well, I built the following expression:
^(?:/parent/child){1}(?:^(?:/\?|\?)+(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*)?)?
But that doesn't work.
Could you help me to understand what I'm doing wrong?
Thanks in advance.
UPDATE 1
Ok, I made other tests.
I'm trying to fix the previous version with something like this:
/parent/child(?:(?:\?|/\?)+(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*)?)?$
Let me explain my idea:
Must start with /parent/child:
/parent/child
Following group is optional
(?: ... )?
The previous optional group must starts with ? or /?
(?:\?|/\?)+
Optional parameters (I grab values if specified parameters are part of querystring)
(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*)?
End of line
$
Any advice?
UPDATE 2
My solution must be based just on regular expressions.
Just for example, I previously wrote the following one:
/parent/child(?:[?&/]*(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*))*$
And that works pretty nice.
But it matches the following input too:
/parent/child/firstparam=abc123&secondparam=def456
How could I modify the expression in order to not match the previous string?
You didn't specify a language so I'll just usre Perl. So basically instead of matching everything, I just matched exactly what I thought you needed. Correct me if I am wrong please.
while ($subject =~ m/(?<==)\w+?(?=&|\W|$)/g) {
# matched text = $&
}
(?<= # Assert that the regex below can be matched, with the match ending at this position (positive lookbehind)
= # Match the character “=” literally
)
\\w # Match a single character that is a “word character” (letters, digits, and underscores)
+? # Between one and unlimited times, as few times as possible, expanding as needed (lazy)
(?= # Assert that the regex below can be matched, starting at this position (positive lookahead)
# Match either the regular expression below (attempting the next alternative only if this one fails)
& # Match the character “&” literally
| # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
\\W # Match a single character that is a “non-word character”
| # Or match regular expression number 3 below (the entire group fails if this one fails to match)
\$ # Assert position at the end of the string (or before the line break at the end of the string, if any)
)
Output:
This regex will work as long as you know what your parameter names are going to be and you're sure that they won't change.
\/parent\/child\/?\?(?:(?:firstparam|secondparam|thirdparam)\=([\w]+)&?)(?:(?:firstparam|secondparam|thirdparam)\=([\w]+)&?)?(?:(?:firstparam|secondparam|thirdparam)\=([\w]+)&?)?
Whilst regex is not the best solution for this (the above code examples will be far more efficient, as string functions are way faster than regexes) this will work if you need a regex solution with up to 3 parameters. Out of interest, why must the solution use only regex?
In any case, this regex will match the following strings:
/parent/child?firstparam=abc123
/parent/child?secondparam=def456
/parent/child?firstparam=abc123&secondparam=def456
/parent/child?secondparam=def456&firstparam=abc123
/parent/child?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child?thirdparam=ghi789
/parent/child/?firstparam=abc123
/parent/child/?secondparam=def456
/parent/child/?firstparam=abc123&secondparam=def456
/parent/child/?secondparam=def456&firstparam=abc123
/parent/child/?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child/?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child/?thirdparam=ghi789
It will now only match those containing query string parameters, and put them into capture groups for you.
What language are you using to process your matches?
If you are using preg_match with PHP, you can get the whole match as well as capture groups in an array with
preg_match($regex, $string, $matches);
Then you can access the whole match with $matches[0] and the rest with $matches[1], $matches[2], etc.
If you want to add additional parameters you'll also need to add them in the regex too, and add additional parts to get your data. For example, if you had
/parent/child/?secondparam=def456&firstparam=abc123&fourthparam=jkl01112&thirdparam=ghi789
The regex will become
\/parent\/child\/?\?(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)?(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)?(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)?
This will become a bit more tedious to maintain as you add more parameters, though.
You can optionally include ^ $ at the start and end if the multi-line flag is enabled. If you also need to match the whole lines without query strings, wrap this whole regex in a non-capture group (including ^ $) and add
|(?:^\/parent\/child\/?\??$)
to the end.
You're not escaping the /s in your regex for starters and using {1} for a single repetition of something is unnecessary; you only use those when you want more than one repetition or a range of repetitions.
And part of what you're trying to do is simply not a good use of a regex. I'll show you an easier way to deal with that: you want to use something like split and put the information into a hash that you can check the contents of later. Because you didn't specify a language, I'm just going to use Perl for my example, but every language I know with regexes also has easy access to hashes and something like split, so this should be easy enough to port:
# I picked an example to show how this works.
my $route = '/parent/child/?first=123&second=345&third=678';
my %params; # I'm going to put those URL parameters in this hash.
# Perl has a way to let me avoid escaping the /s, but I wanted an example that
# works in other languages too.
if ($route =~ m/\/parent\/child\/\?(.*)/) { # Use the regex for this part
print "Matched route.\n";
# But NOT for this part.
my $query = $1; # $1 is a Perl thing. It contains what (.*) matched above.
my #items = split '&', $query; # Each item is something like param=123
foreach my $item (#items) {
my ($param, $value) = split '=', $item;
$params{$param} = $value; # Put the parameters in a hash for easy access.
print "$param set to $value \n";
}
}
# Now you can check the parameter values and do whatever you need to with them.
# And you can add new parameters whenever you want, etc.
if ($params{'first'} eq '123') {
# Do whatever
}
My solution:
/(?:\w+/)*(?:(?:\w+)?\?(?:\w+=\w+(?:&\w+=\w+)*)?|\w+|)
Explain:
/(?:\w+/)* match /parent/child/ or /parent/
(?:\w+)?\?(?:\w+=\w+(?:&\w+=\w+)*)? match child?firstparam=abc123 or ?firstparam=abc123 or ?
\w+ match text like child
..|) match nothing(empty)
If you need only query string, pattern would reduce such as:
/(?:\w+/)*(?:\w+)?\?(\w+=\w+(?:&\w+=\w+)*)
If you want to get every parameter from query string, this is a Ruby sample:
re = /\/(?:\w+\/)*(?:\w+)?\?(\w+=\w+(?:&\w+=\w+)*)/
s = '/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789'
if m = s.match(re)
query_str = m[1] # now, you can 100% trust this string
query_str.scan(/(\w+)=(\w+)/) do |param,value| #grab parameter
printf("%s, %s\n", param, value)
end
end
output
secondparam, def456
firstparam, abc123
thirdparam, ghi789
This script will help you.
First, i check, is there any symbol like ?.
Then, i kill first part of line (left from ?).
Next, i split line by &, where each value splitted by =.
my $r = q"/parent/child
/parent/child?
/parent/child?firstparam=abc123
/parent/child?secondparam=def456
/parent/child?firstparam=abc123&secondparam=def456
/parent/child?secondparam=def456&firstparam=abc123
/parent/child?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child?thirdparam=ghi789
/parent/child/
/parent/child/?
/parent/child/?firstparam=abc123
/parent/child/?secondparam=def456
/parent/child/?firstparam=abc123&secondparam=def456
/parent/child/?secondparam=def456&firstparam=abc123
/parent/child/?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child/?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child/?thirdparam=ghi789";
for my $string(split /\n/, $r){
if (index($string,'?')!=-1){
substr($string, 0, index($string,'?')+1,"");
#say "string = ".$string;
if (index($string,'=')!=-1){
my #params = map{$_ = [split /=/, $_];}split/\&/, $string;
$"="\n";
say "$_->[0] === $_->[1]" for (#params);
say "######next########";
}
else{
#print "there is no params!"
}
}
else{
#say "there is no params!";
}
}

Is it possible to check if two groups are equal?

If I have some HTML like this:
<b>1<i>2</i>3</b>
And the following regex:
\<[^\>\/]+\>(.*?)\<\/[^\>]+\>
Then it will match:
<b>1<i>2</i>
I want it to only match HTML where the start and end tags are the same. Is there a way to do this?
Thanks,
Joe
Is there a way to do this?
Yes, certainly. Ignore those flippant non-answers that tell you it can’t be done. It most certainly can. You just may not wish to do so, as I explain below.
Numbered Captures
Pretending for the nonce that HTML <i> and <b> tags are always denude of attributes, and moreover, neither overlap nor nest, we have this simple solution:
#!/usr/bin/env perl
#
# solution A: numbered captures
#
use v5.10;
while (<>) {
say "$1: $2" while m{
< ( [ib] ) >
(
(?:
(?! < /? \1 > ) .
) *
)
</ \1 >
}gsix;
}
Which when run, produces this:
$ echo 'i got <i>foo</i> and <b>bar</b> bits go here' | perl solution-A
i: foo
b: bar
Named Captures
It would be better to use named captures, which leads to this equivalent solution:
#!/usr/bin/env perl
#
# Solution B: named captures
#
use v5.10;
while (<>) {
say "$+{name}: $+{contents}" while m{
< (?<name> [ib] ) >
(?<contents>
(?:
(?! < /? \k<name> > ) .
) *
)
</ \k<name> >
}gsix;
}
Recursive Captures
Of course, it is not reasonable to assume that such tags neither overlap nor nest. Since this is recursive data, it therefore requires a recursive pattern to solve. Remembering that the trival pattern to parse nested parens recursively is simply:
( \( (?: [^()]++ | (?-1) )*+ \) )
I’ll build that sort of recursive matching into the previous solution, and I’ll further toss in a bit interative processing to unwrap the inner bits, too.
#!/usr/bin/perl
use v5.10;
# Solution C: recursive captures, plus bonus iteration
while (my $line = <>) {
my #input = ( $line );
while (#input) {
my $cur = shift #input;
while ($cur =~ m{
< (?<name> [ib] ) >
(?<contents>
(?:
[^<]++
| (?0)
| (?! </ \k<name> > )
.
) *+
)
</ \k<name> >
}gsix)
{
say "$+{name}: $+{contents}";
push #input, $+{contents};
}
}
}
Which when demo’d produces this:
$ echo 'i got <i>foo <i>nested</i> and <b>bar</b> bits</i> go here' | perl Solution-C
i: foo <i>nested</i> and <b>bar</b> bits
i: nested
b: bar
That’s still fairly simple, so if it works on your data, go for it.
Grammatical Patterns
However, it doesn’t actually know about proper HTML syntax, which admits tag attributes to things like <i> and <b>.
As explained in this answer, one can certainly use regexes to parse markup languages, provided one is careful about it.
For example, this knows the attributes germane to the <i> (or <b>) tag. Here we defined regex subroutines used to build up a grammatical regex. These are definitions only, just like defining regular subs but now for regexes:
(?(DEFINE) # begin regex subroutine defs for grammatical regex
(?<i_tag_end> < / i > )
(?<i_tag_start> < i (?&attributes) > )
(?<attributes> (?: \s* (?&one_attribute) ) *)
(?<one_attribute>
\b
(?&legal_attribute)
\s* = \s*
(?:
(?&quoted_value)
| (?&unquoted_value)
)
)
(?<legal_attribute>
(?&standard_attribute)
| (?&event_attribute)
)
(?<standard_attribute>
class
| dir
| ltr
| id
| lang
| style
| title
| xml:lang
)
# NB: The white space in string literals
# below DOES NOT COUNT! It's just
# there for legibility.
(?<event_attribute>
on click
| on dbl click
| on mouse down
| on mouse move
| on mouse out
| on mouse over
| on mouse up
| on key down
| on key press
| on key up
)
(?<nv_pair> (?&name) (?&equals) (?&value) )
(?<name> \b (?= \pL ) [\w\-] + (?<= \pL ) \b )
(?<equals> (?&might_white) = (?&might_white) )
(?<value> (?&quoted_value) | (?&unquoted_value) )
(?<unwhite_chunk> (?: (?! > ) \S ) + )
(?<unquoted_value> [\w\-] * )
(?<might_white> \s * )
(?<quoted_value>
(?<quote> ["'] )
(?: (?! \k<quote> ) . ) *
\k<quote>
)
(?<start_tag> < (?&might_white) )
(?<end_tag>
(?&might_white)
(?: (?&html_end_tag)
| (?&xhtml_end_tag)
)
)
(?<html_end_tag> > )
(?<xhtml_end_tag> / > )
)
Once you have the pieces of your grammar assembled, you could incorporate those definitions into the recursive solution already given to do a much better job.
However, there are still things that haven’t been considered, and which in the more general case must be. Those are demonstrated in the longer solution already provided.
SUMMARY
I can think of only three possible reasons why you might not care to use regexes for parsing general HTML:
You are using an impoverished regex language, not a modern one, and so you have to recourse to essential modern conveniences like recursive matching or grammatical patterns.
You might such concepts as recursive and grammatical patterns too complicated for you to easily understand.
You prefer for someone else to do all the heavy lifting for you, including the heavy testing, and so you would rather use a separate HTML parsing module instead of rolling your own.
Any one or more of those might well apply. In which case, don’t do it this way.
For simple canned examples, this route is easy. The more robust you want this to work on things you’ve never seen before, the harder this route becomes.
Certainly you can’t do any of it if you are using the inferior, impoverished pattern matching bolted onto the side of languages like Python or even worse, Javascript. Those are barely any better than the Unix grep program, and in some ways, are even worse. No, you need a modern pattern matching engine such as found in Perl or PHP to even start down this road.
But honestly, it’s probably easier just to get somebody else to do it for you, by which I mean that you should probably use an already-written parsing module.
Still, understanding why not to bother with these regex-based approaches (at least, not more than once) requires that you first correctly implement proper HTML parsing using regexes. You need to understand what it is all about. Therefore, little exercises like this are useful for improving your overall understanding of the problem-space, and of modern pattern matching in general.
This forum isn’t really in the right format for explaining all these things about modern pattern-matching. There are books, though, that do so equitably well.
You probably don't want to use regular expressions with HTML.
But if you still want to do this you need to take a look at backreferences.
Basically it's a way to capture a group (such as "b" or "i") to use it later in the same regular expression.
Related issues:
RegEx match open tags except XHTML self-contained tags

Regular Expression issue with * laziness

Sorry in advance that this might be a little challenging to read...
I'm trying to parse a line (actually a subject line from an IMAP server) that looks like this:
=?utf-8?Q?Here is som?= =?utf-8?Q?e text.?=
It's a little hard to see, but there are two =?/?= pairs in the above line. (There will always be one pair; there can theoretically be many.) In each of those =?/?= pairs, I want the third argument (as defined by a ? delimiter) extracted. (In the first pair, it's "Here is som", and in the second it's "e text.")
Here's the regex I'm using:
=\?(.+)\?.\?(.*?)\?=
I want it to return two matches, one for each =?/?= pair. Instead, it's returning the entire line as a single match. I would have thought that the ? in the (.*?), to make the * operator lazy, would have kept this from happening, but obviously it doesn't.
Any suggestions?
EDIT: Per suggestions below to replace ".?" with "[^(\?=)]?" I'm now trying to do:
=\?(.+)\?.\?([^(\?=)]*?)\?=
...but it's not working, either. (I'm unsure whether [^(\?=)]*? is the proper way to test for exclusion of a two-character sequence like "?=". Is it correct?)
Try this:
\=\?([^?]+)\?.\?(.*?)\?\=
I changed the .+ to [^?]+, which means "everything except ?"
A good practice in my experience is not to use .*? but instead do use the * without the ?, but refine the character class. In this case [^?]* to match a sequence of non-question mark characters.
You can also match more complex endmarkers this way, for instance, in this case your end-limiter is ?=, so you want to match nonquestionmarks, and questionmarks followed by non-equals:
([^?]*\?[^=])*[^?]*
At this point it becomes harder to choose though. I like that this solution is stricter, but readability decreases in this case.
One solution:
=\?(.*?)\?=\s*=\?(.*?)\?=
Explanation:
=\? # Literal characters '=?'
(.*?) # Match each character until find next one in the regular expression. A '?' in this case.
\?= # Literal characters '?='
\s* # Match spaces.
=\? # Literal characters '=?'
(.*?) # Match each character until find next one in the regular expression. A '?' in this case.
\?= # Literal characters '?='
Test in a 'perl' program:
use warnings;
use strict;
while ( <DATA> ) {
printf qq[Group 1 -> %s\nGroup 2 -> %s\n], $1, $2 if m/=\?(.*?)\?=\s*=\?(.*?)\?=/;
}
__DATA__
=?utf-8?Q?Here is som?= =?utf-8?Q?e text.?=
Running:
perl script.pl
Results:
Group 1 -> utf-8?Q?Here is som
Group 2 -> utf-8?Q?e text.
EDIT to comment:
I would use the global modifier /.../g. Regular expression would be:
/=\?(?:[^?]*\?){2}([^?]*)/g
Explanation:
=\? # Literal characters '=?'
(?:[^?]*\?){2} # Any number of characters except '?' with a '?' after them. This process twice to omit the string 'utf-8?Q?'
([^?]*) # Save in a group next characters until found a '?'
/g # Repeat this process multiple times until end of string.
Tested in a Perl script:
use warnings;
use strict;
while ( <DATA> ) {
printf qq[Group -> %s\n], $1 while m/=\?(?:[^?]*\?){2}([^?]*)/g;
}
__DATA__
=?utf-8?Q?Here is som?= =?utf-8?Q?e text.?= =?utf-8?Q?more text?=
Running and results:
Group -> Here is som
Group -> e text.
Group -> more text
Thanks for everyone's answers! The simplest expression that solved my issue was this:
=\?(.*?)\?.\?(.*?)\?=
The only difference between this and my originally-posted expression was the addition of a ? (non-greedy) operator on the first ".*". Critical, and I'd forgotten it.