Regular Expressions: Non-Greedy with Stack? - regex

I have to do a lot regex within LaTeX and HTML files.. and often I find my self in the following situation:
I want something like \mbox{\sqrt{2}} + \sqrt{4} to be stripped to \sqrt{2} + \sqrt{4}.
In words: "replace every occurrence of \mbox{...} by its content.
So, how do I do that?
The greedy version \mbox{(.*)} gets me \sqrt{2}} + \sqrt{4 in $1 and the
non-greedy version \mbox{(.*?)} gets me \sqrt{2 in $1.
Both is not what I want.
What I need is, that the RegEx engine keeps somehow a
Stack of characters that at the position before and behind (.*), namely { and }. So, when a new { is encountered in .*, it should be placed on stack. when a } is encountered, the last { should be removed from stack. When the stack is empty, .* is done.
Similar cases occur with nested HTML Tags.
So, since most regex engines create an FSA for each regex, a stack should be feasible, or do I miss something? Some rare modifier that I'm not aware of? I am wondering, why there is no solution for this.
Of course I could code something for my self with java/python/perl whatsoever.. but I'd like to have it integrated in RegEx :)
Regards, Gilbert
(ps: I omitted to project + \sqrt{4} to keep the example small, \ should be escaped too)

It depends on your regex engine but it is possible with the .Net regex engine as follows...
\\mbox{(
(?>
[^{}]+
| { (?<number>)
| } (?<-number>)
)*
(?(number)(?!))
)
}
Assuming you are using IgnorePatternWhiteSpace
you can then do regex.Replace(sourceText,"$1") to perform the conversion you wished

Here's another regex that works in perl http://codepad.org/fcVz9Bky :
s/
\\mbox{
(
(?:
[^{}]+ #either match any number of non-braces
| #or
\{[^{}]+} #braces surrounding non-braces
)*
)
}
/$1/x;
Note: It only works for one level of nesting

Another trick you may be able to use is a recursive regex (which should be supported by PCRE and a few other flavors):
\\mbox(\{([^{}]|(?1)+)*+\})
Not too much to explain, if you're in the right state of mind.
Here's a similar one, but a little more flexible (for example, easier to add [] and (), or other balanced constructs):
\\mbox\{([^{}]|\{(?1)*\})*\}

Related

Capturing what's inside a nested structure in a regex or grammar token

I'd like to capture the interior of a nested structure.
my $str = "(a)";
say $str ~~ /"(" ~ ")" (\w) /;
say $str ~~ /"(" ~ ")" <(\w)> /;
say $str ~~ /"(" <(~)> ")" \w /;
say $str ~~ /"(" <(~ ")" \w /;
The first one works; the last one works but also captures the closing parenthesis. The other two fail, so it's not possible to use capture markers in this case. But the problem is more complicated in the context of a grammar, since capturing groups do not seem to work either, like here:
# Please paste this together with the code above so that it compiles.
grammar G {
token TOP {
'(' ~ ')' $<content> = .+?
}
}
grammar H {
token TOP {
'(' ~ ')' (.+?)
}
}
grammar I {
token TOP {
'(' ~ ')' <( .+? )>
}
}
$str = "(one of us)";
for G,H,I -> $grammar {
say $grammar.parse( $str );
}
Since neither capturing grouping or capture markers seem to work, except if it's assigned, on the fly, to a variable. This, however, creates an additional token I'd really like to avoid.
So there are two questions
What is the right way to make capture markers work in nested structures?
Is there a way to use either capturing groups or capturing markers in tokens to get the interior of a nested structure?
One solution to two issues
Per ugexe's comment, the [...] grouping construct works for all your use cases.
The <( and )> capture markers are not grouping constructs so they don't work with the regex ~ operation unless they're grouped.
The (...) capture/grouping construct clamps frugal matching to its minimum match when ratchet is in effect. A pattern like :r (.+?) never matches more than one character.
The behaviors described in the last two bullet points above aren't obvious, aren't in the docs, may not be per the design docs, may be holes in roast, may be figments of my imagination, etc. The rest of this answer explains what I've found out about the above three cases, and discusses some things that could be done.
Glib explanation, as if it's all perfectly cromulent
<( and )> are capture markers.
They behave as zero width assertions. Each asserts "this marks where I want capturing to start/end for the regex that contains this marker".
Per the doc for the regex ~ operator:
it mostly ignores the left argument, and operates on the next two [arguments]
(The doc says "atoms" where I've written "arguments". In reality it operates on the next two atoms or groups.)
In the regex pattern "(" ~ ")" <(\w)>:
")" is the first atom/group after ~.
<( is the second atom/group after ~.
~ ignores \w)>.
The solution is to use [...]:
say '(a)' ~~ / '(' ~ ')' [ <( \w )> ] /; # 「a」
Similarly, in a grammar:
token TOP { '(' ~ ')' [ <( .+? )> ] }
(...) grouping isn't what you want for two reasons:
It couldn't be what you want. It would create an additional token capture. And you wrote you'd like to avoid that.
Even if you wanted the additional capture, using (...) when ratchet is in effect clamps frugal matching within the parens.
What could be done about capture markers "not working"?
I think a doc update is the likely best thing to do. But imo whoever thinks of filing an issue about one, or preparing a PR, would be well advised to make use of the following.
Is it known to be intended behavior or a bug?
Searches of GH repos for "capture markers":
raku/old-design-docs
raku/roast
raku/old-issue-tracker and rakudo/rakudo
raku/docs
The term "capture markers" comes from the doc, not the old design docs which just say:
A <( token indicates the start of the match's overall capture, while the corresponding )> token indicates its endpoint. When matched, these behave as assertions that are always true, but have the side effect of setting the .from and .to attributes of the match object.
(Maybe you can figure out from that what strings to search for among issues etc...)
At the time of writing, all GH searches for <( or )> draw blanks but that's due to a weakness of the current built in GH search, not because there aren't any in those repos, eg this.
I was curious and tried this:
my $str = "aaa";
say $str ~~ / <(...)>* /;
It infinitely loops. The * is acting on just the )>. This corroborates the sense that capture markers are treated as atoms.
The regex ~ operator works for [...] and some other grouped atom constructions. Parsing any of them has a start and end within a regex pattern.
The capture markers are different in that they aren't necessarily paired -- the start or end can be implicit.
Perhaps this makes treating them as we might wish unreasonably difficult for Raku given that start (/ or{) and end ( / or }) occur at a slang boundary and Raku is a single-pass parsing braid?
I think that a doc fix is probably the appropriate response to this capture marker aspect of your SO.
If regex ~ were the only regex construct that cared that left and right capture markers are each an individual atom then perhaps the best place to mention this wrinkle would be in the regex ~ section.
But given that multiple regex constructs care (quantifiers do per the above infinite loop example), then perhaps the best place is the capture markers section.
Or perhaps it would be best if it's mentioned in both. (Though that's a slippery slope...)
What could be done about :r (.*?) "not working"?
I think a doc update is the likely best thing to do. But imo whoever thinks of filing an issue about one, or preparing a PR, would be well advised to make use of the following.
Is it known to be intended behavior or a bug?
Searches of GH repos for ratchet frugal:
raku/old-design-docs
raku/roast
raku/old-issue-tracker and rakudo/rakudo
raku/docs
The terms "ratchet" and "frugal" both come from the old design docs and are still used in the latest doc and don't seem to have aliases. So searches for them should hopefully match all relevant mentions.
The above searches are for both words. Searching for one at a time may reveal important relevant mentions that happen to not mention the other.
At the time of writing, all GH searches for .*? or similar draw blanks but that's due to a weakness of the current built in GH search, not because there aren't any in those repos.
Perhaps the issue here is broader than the combination of ratchet, frugal, and capture?
Perhaps file an issue using the words "ratchet", "frugal" and "capture"?

RegEx - is recursive substitution possible using only a RegEx engine? Conditional search replace

I'm editing some data, and my end goal is to conditionally substitute , (comma) chars with .(dot). I have a crude solution working now, so this question is strictly for suggestions on better methods in practice, and determining what is possible with a regex engine outside of an enhanced programming environment.
I gave it a good college try, but 6 hours is enough mental grind for a Saturday, and I'm throwing in the towel. :)
I've been through about 40 SO posts on regex recursion, substitution, etc, the wiki.org on the definitions and history of regex and regular language, and a few other tutorial sites. The majority is centered around Python and PHP.
The working, crude regex (facilitating loops / search and replace by hand):
(^.*)(?<=\()(.*?)(,)(.*)(?=\))(.*$)
A snip of the input:
room_ass=01:macro_id=01: name=Left, pgm_audio=0, usb=0, list=(1*,3,5,7,),
room_ass=01:macro_id=02: name=Right, pgm_audio=1, usb=1, list=(2*,4,6,8,),
room_ass=01:macro_id=03: name=All, pgm_audio=1, list=(1,2*,3,4,5,6,7,8,),
And the desired output:
room_ass=01: macro_id=01: name=Left, pgm_audio=0, usb=0, list=(1*.3.5.7.),
room_ass=01: macro_id=02: name=Right, pgm_audio=1, usb=1, list=(2*.4.6.8.),
room_ass=01: macro_id=03: name=All, pgm_audio=1, list=(1.2*.3.4.5.6.7.8.),
That's all. Just replace the , with ., but only inside ( ).
This is one conceptual (not working) method I'd like to see, where the middle group<3> would loop recursively:
(^.*)(?<=\()([^,]*)([,|\d|\*]\3.*)(?=\))(.*$)
( ^ )
..where each recursive iteration would shift across the data, either 1 char or 1 comma at a time:
room_ass=01:macro_id=01: name=Left, pgm_audio=0, usb=0, list=(1*,3,5,7,),
iter 1-| ^ |
2-| ^ |
3-| ^ |
4-| ^|
or
A much simpler approach would be to just tell it to mask/select all , between the (), but I struck out on figuring that one out.
I use text editors a lot for little data editing tasks like this, so I'd like to verify that SublimeText can't do it before I dig into Python.
All suggestions and criticisms welcome. Be gentle. <--#n00b
Thanks in advance!
-B
Not much magic needed. Just check, if there's a closing ) ahead, without any ( in between.
,(?=[^)(]*\))
See this demo at regex101
However it does not check for an opening (. It's a common approach and probably a dulicate.
This is a complete guess because I don't use SublimeText, the assumption here is that SublimeText uses PCRE regular expressions.
Note that you mention "recursive", I don't believe you mean Regular Expression Recursion that doesn't fit the problem here.
Something like this might work...
You'll need to test to make sure this isn't matching other things in your document and to see if SublimeText even supports this...
This is based on using the /K operator to "keep" what comes before it - you can find other uses of it as an PCRE alternative (workaround) to variable look-behinds not being supported by PCRE.
Regular Expression
\((?:(?:[^,\)]+),)*?(?:[^,\)]+)\K,
Visualisation
Regex Description
Match the opening parenthesis character \(
Match the regular expression below (?:(?:[^,\)]+),)*?
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) *?
Match the regular expression below (?:[^,\)]+)
Match any single character NOT present in the list below [^,\)]+
Between one and unlimited times, as many times as possible, giving back as needed (greedy) +
The literal character “,” ,
The closing parenthesis character \)
Match the character “,” literally ,
Match the regular expression below (?:[^,\)]+)
Match any single character NOT present in the list below [^,\)]+
Between one and unlimited times, as many times as possible, giving back as needed (greedy) +
The literal character “,” ,
The closing parenthesis character \)
Keep the text matched so far out of the overall regex match \K
Match the character “,” literally ,

PHP preg_match_all trouble

I have written a regular expression that I tested in rubular.com and it returned 4 matches. The subject of testing can be found here http://pastebin.com/49ERrzJN and the PHP code is below. For some reason the PHP code returns only the first 2 matches. How to make it to match all 4? It seems it has something to do with greediness or so.
$file = file_get_contents('x.txt');
preg_match_all('~[0-9]+\s+(((?!\d{7,}).){2,20})\s{2,30}(((?!\d{7,}).){2,30})\s+([0-9]+)-([0-9]+)-([0-9]+)\s+(F|M)\s+(.{3,25})\s+(((?!\d{7,}).){2,50})~', $file, $m, PREG_SET_ORDER);
foreach($m as $v) echo 'S: '. $v[1]. '; N: '. $v[3]. '; D:'. $v[7]. '<br>';
Your regex is very slooooooow. After trying it on regex101.com, I found it would timeout on PHP (but not JS, for whatever reason). I'm pretty sure the timeout happens at around 50,000 steps. Actually, it makes sense now why you're not using an online PHP regex tester.
I'm not sure if this is the source of your problem, but there is a default memory limit in PHP:
memory_limit [default:] "128M"
[history:] "8M" before PHP 5.2.0, "16M" in PHP 5.2.0
If you use the multiline modifier (I assume that preg_match_all essentially adds the global modifier), you can use this regex that only takes 1282 steps to find all 4 matches:
^ [0-9]+\s+(((?!\d{7,}).){2,20})\s{2,30}(((?!\d{7,}).){2,30})\s+([0-9]+)-([0-9]+)-([0-9]+)\s+(F|M)\s+(.{3,25})\s+(((?!\d{7,}).){2,50})
Actually, there are only 2 characters that I added. They're at the beginning, the anchor ^ and the literal space.
If you have to write a long pattern, the first thing to do is to make it readable. To do that, use the verbose mode (x modifier) that allows comments and free-spacing, and use named captures.
Then you need to make a precise description of what you are looking for:
your target takes a whole line => use the anchors ^ and $ with the modifier m, and use the \h class (that only contains horizontal white-spaces) instead of the \s class.
instead of using this kind of inefficient sub-patterns (?:(?!.....).){m,n} to describe what your field must not contain, describe what the field can contain.
use atomic groups (?>...) when needed instead of non-capturing groups to avoid useless backtracking.
in general, using precise characters classes avoids a lot of problems
pattern:
~
^ \h*+ # start of the line
# named captures # field separators
(?<VOTERNO> [0-9]+ ) \h+
(?<SURNAME> \S+ (?>\h\S+)*? ) \h{2,}
(?<OTHERNAMES> \S+ (?>\h\S+)*? ) \h{2,}
(?<DOB> [0-9]{2}-[0-9]{2}-[0-9]{4} ) \h+
(?<SEX> [FM] ) \h+
(?<APPID_RECNO> [0-9A-Z/]+ ) \h+
(?<VILLAGE> \S+ (?>\h\S+)* )
\h* $ # end of the line
~mx
demo
If you want to know what goes wrong with a pattern, you can use the function preg_last_error()

Regex: delete contents of square brackets

Is there a regular expression that can be used with search/replace to delete everything occurring within square brackets (and the brackets)?
I've tried \[.*\] which chomps extra stuff (e.g. "[chomps] extra [stuff]")
Also, the same thing with lazy matching \[.*?\] doesn't work when there is a nested bracket (e.g. "stops [chomping [too] early]!")
Try something like this:
$text = "stop [chomping [too] early] here!";
$text =~ s/\[([^\[\]]|(?0))*]//g;
print($text);
which will print:
stop here!
A short explanation:
\[ # match '['
( # start group 1
[^\[\]] # match any char except '[' and ']'
| # OR
(?0) # recursively match group 0 (the entire pattern!)
)* # end group 1 and repeat it zero or more times
] # match ']'
The regex above will get replaced with an empty string.
You can test it online: http://ideone.com/tps8t
EDIT
As #ridgerunner mentioned, you can make the regex more efficiently by making the * and the character class [^\[\]] match once or more and make it possessive, and even by making a non capturing group from group 1:
\[(?:[^\[\]]++|(?0))*+]
But a real improvement in speed might only be noticeable when working with large strings (you can test it, of course!).
This is technically not possible with regular expressions because the language you're matching does not meet the definition of "regular". There are some extended regex implementations that can do it anyway using recursive expressions, among them are:
Greta:
http://easyethical.org/opensource/spider/regexp%20c++/greta2.htm#_Toc39890907
and
PCRE
http://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions
See "Recursive Patterns", which has an example for parentheses.
A PCRE recursive bracket match would look like this:
\[(?R)*\]
edit:
Since you added that you're using Perl, here's a page that explicitly describes how to match balanced pairs of operators in Perl:
http://perldoc.perl.org/perlfaq6.html#Can-I-use-Perl-regular-expressions-to-match-balanced-text%3f
Something like:
$string =~ m/(\[(?:[^\[\]]++|(?1))*\])/xg;
Since you're using Perl, you can use modules from the CPAN and not have to write your own regular expressions. Check out the Text::Balanced module that allows you to extract text from balanced delimiters. Using this module means that if your delimiters suddenly change to {}, you don't have to figure out how to modify a hairy regular expression, you only have to change the delimiter parameter in one function call.
If you are only concerned with deleting the contents and not capturing them to use elsewhere you can use a repeated removal from the inside of the nested groups to the outside.
my $string = "stops [chomping [too] early]!";
# remove any [...] sequence that doesn't contain a [...] inside it
# and keep doing it until there are no [...] sequences to remove
1 while $string =~ s/\[[^\[\]]*\]//g;
print $string;
The 1 while will basically do nothing while the condition is true. If a s/// matches and removes a bracketed section the loop is repeated and the s/// is run again.
This will work even if your using an older version of Perl or another language that doesn't support the (?0) recursion extended pattern in Bart Kiers's answer.
You want to remove only things between the []s that aren't []s themselves. IE:
\[[^\]]*\]
Which is a pretty hairy mess of []s ;-)
It won't handle multiple nested []s though. IE, matching [foo[bar]baz] won't work.

How to edit "Full Windows Folder Path Regular Expression"

Hay this regualr expression working fine for Full Windows Folder Path
^([A-Za-z]:|\\{2}([-\w]+|((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))\\(([^"*/:?|<>\\,;[\]+=.\x00-\x20]|\.[.\x20]*[^"*/:?|<>\\,;[\]+=.\x00-\x20])([^"*/:?|<>\\,;[\]+=\x00-\x1F]*[^"*/:?|<>\\,;[\]+=\x00-\x20])?))\\([^"*/:?|<>\\.\x00-\x20]([^"*/:?|<>\\\x00-\x1F]*[^"*/:?|<>\\.\x00-\x20])?\\)*$
Matches
d:\, \\Dpk\T c\, E:\reference\h101\, \\be\projects$\Wield\Rff\, \\70.60.44.88\T d\SPC2\
Non-Matches
j:ohn\, \\Dpk\, G:\GD, \\cae\.. ..\, \\70.60.44\T d\SPC2\
PROBLEM:
THIS EXPRESSION REQUIRED "\" END OF PATH.
HOW CAN I EDIT THIS EXPRESSION SO USER CAN ENTER PATH LIKE
C:\Folder1, C:\Folder 1\Sub Folder
There are two ways to approach this problem:
Understand the regex (way harder than necessary) and fix it to your specification (may be buggy)
Who cares how the regex does its thing (it seems to do what you need) and modify your input to conform to what you think the regex does
The second approach means that you just check if the input string ends with \. If it doesn't then just add it on, then let the regex does it magic.
I normally wouldn't recommend this ignorant alternative, but this may be an exception.
Blackboxing
Here's how I'm "solving" this problem:
There's a magic box, who knows how it works but it does 99% of the time
We want it to work 100% of the time
It's simpler to fix the 1% so it works with the magic box rather than fixing the magic box itself (because this would require understanding of how the magic box works)
Then just fix the 1% manually and leave the magic box alone
Deciphering the black magic
That said, we can certainly try to take a look at the regex. Here's the same pattern but reformatted in free-spacing/comments mode, i.e. (?x) in e.g. Java.
^
( [A-Za-z]:
| \\{2} ( [-\w]+
| (
(25[0-5]
|2[0-4][0-9]
|[01]?[0-9][0-9]?
)\.
){3}
(25[0-5]
|2[0-4][0-9]
|[01]?[0-9][0-9]?
)
)
\\ (
( [^"*/:?|<>\\,;[\]+=.\x00-\x20]
| \.[.\x20]* [^"*/:?|<>\\,;[\]+=.\x00-\x20]
)
( [^"*/:?|<>\\,;[\]+=\x00-\x1F]*
[^"*/:?|<>\\,;[\]+=\x00-\x20]
)?
)
)
\\ (
[^"*/:?|<>\\.\x00-\x20]
(
[^"*/:?|<>\\\x00-\x1F]*
[^"*/:?|<>\\.\x00-\x20]
)?
\\
)*
$
The main skeleton of the pattern is as follows:
^
(head)
\\ (
bodypart
\\
)*
$
Based from this higher-level view, it looks like an optional trailing \ can be supported by adding ? on the two \\ following the (head) part:
^
(head)
\\?(
bodypart
\\?
)*
$
References
regular-expressions.info/Question Mark for Optional
Note on catastrophic backtracking
You should generally be very wary of nesting repetition modifiers (a ? inside a * in this case), but for this specific pattern it's "okay", because the bodypart doesn't match \.
References
regular-expressions.info/Catastrophic Backtracking
I don't understand your regular expression at all. But I bet all you need to do is find the bit or bits that match the trailing "\", and add a single question mark after that bit or those bits.
The regex you provided seems to mismatch "C:\?tmp" which is an invalid windows path.
I have figured out one solution but works in windows only. You may have a try with this one:
"^[A-Za-z]:(?:\\\\(?![\"*/:?|<>\\\\,;[\\]+=.\\x00-\\x20])[^\"*/:?|<>\\\\[\\]]+){0,}(?:\\\\)?$"
This regex ignores the last "\" which hinders you.
I've tested with pcre.lib(5.5) in VS2005.
Hope it helps!
I know this question is roughly 4 years old, but the following may be sufficient:
string validWindowsOrUncPath = #"^(?:(?:[a-z]:)|(?:\\\\[^\\*\?\:;\0]*))(?:\\[^\\*\?\:;\0]*)+$";
(to be used with IgnoreCase option).
Edit:
I even came to this one, which can extract the root and each part in named groups:
string validWindowsOrUncPath = #"^(?<Root>(?:[a-z]:)|(?:\\\\[^\\*\?\:;\0]*))(?:\\(?<Part>[^\\*\?\:;\0]*))+$";