Regex doesn't fetch the nested curly braces - regex

Curly braces matches sometimes and doesn't in few case.
My Code:
use strict;
use warnings;
my $str1 = '$$\eqalign{&\cases{\mathdot{\bf x}=A{\bf x}+Bu\cr y=H{\bf x}}\quad{\rm with}\{\bf x}=\left(\matrix{x\cr\mathdot{x}\cr\theta\cr\mathdot{\theta}}\right),\cr&A\!=\!\!\left(\matrix{0&1&0&0\cr 0&0&-{m_{a}\over M}g&0\cr 0&0&0&1\cr 0&0&{(M\!+\!m_{a})\over Ml}g&0}\right)\!,\ B\!=\!\left(\matrix{0\cr{a\over M}\cr 0\cr-{a\over Ml}}\right)\!,\ H^{T}\!=\!\left(\matrix{1\cr 0\cr 1\cr 0}\right)\!.}$$';
my $str2 = "\\bibcite{Airdetal2013}{{2}{2017}{{{John} {et~al.}}}{{{James}, {Flexi}, {Buella}, {Curren}, {Mozes}, {Sam}, {Kandan}, {Alexander}, {Alfonsa}, {Fireknight}, {Georgen}, {Karims}, {Merloni}, {Nanda}, {Terra}, {Alvato}, {Nini}, {Winski}, {Shankar}, {Gnali}, \& {Giito}}}}";
my $regex = qr/(?:[^{}]*(?:{(?:[^{}]*(?:{(?:[^{}]*(?:{[^{}]*})*[^{}]*)})*[^{}]*)*})*[^{}]*)*/;
if($str1=~m/\{$regex\}/) { print "str1: $&\n"; }
if($str2=~m/\{$regex\}/) { print "str2: $&\n"; }
OUTPUT:
str1: {&\cases{\mathdot{\bf x}=A{\bf x}+Bu\cr y=H{\bf x}}\quad{\rm with}\ {\bf x}=\left(\matrix{x\cr\mathdot{x}\cr\theta\cr\mathdot{\theta}}\right),\cr&A\!=\!\!\left(\matrix{0&1&0&0\cr 0&0&-{m_{a}\over M}g&0\cr 0&0&0&1\cr 0&0&{(M\!+ !m_{a})\over Ml}g&0}\right)\!,\ B\!=\!\left(\matrix{0\cr{a\over M}\cr 0\cr-{a\over Ml}}\right)\!,\ H^{T}\!=\!\left(\matrix{1\cr 0\cr 1\cr 0}\right)\!.}
str2: {2}
str1 is correct output. str2 incorrect output.
Expected Output on str2 is:
str2: {{2}{2017}{{{John} {et~al.}}}{{{James}, {Flexi}, {Buella}, {Curren}, {Mozes}, {Sam}, {Kandan}, {Alexander}, {Alfonsa}, {Fireknight}, {Georgen}, {Karims}, {Merloni}, {Nanda}, {Terra}, {Alvato}, {Nini}, {Winski}, {Shankar}, {Gnali}, \& {Giito}}}}
In the sample str1 string doesn't matched with the nested curly braces. However the second sample str12 string can matched the nested curly braces.
This is my question can matched the nested curly braces. I am clueless. It would be better if someone point out my mistake.
Thanks in advance.

Since your actual requirements (discussed in the chat) are to match substrings starting with \bib followed with {...} substrings or any chars other than { and }, you should use a regex with a subroutine:
/\\bib(?:({(?:[^{}]++|(?1))*})|(?!\\bib)[^{}])*/g
Details:
\\bib - \bib literal text
(?:({(?:[^{}]++|(?1))*})|(?!\\bib)[^{}])* - 0+ occurrences of:
({(?:[^{}]++|(?1))*}) - Group 1 (that will be recursed with (?1)) matching
{ - a literal {
(?:[^{}]++|(?1))* - 0 or more occurrences of 1+ chars other than { and } or the whole Group 1 subpattern
} - a literal }
| - or
(?!\\bib)[^{}] - a char other than { and } not starting a \bib literal char sequence.
See the sample Perl code:
use strict;
use warnings;
use feature 'say';
my $str2 = "\\bibcite{Airdetal2013}{{2}{2017}{{{John} {et~al.}}}{{{James}, {Flexi}, {Buella}, {Curren}, {Mozes}, {Sam}, {Kandan}, {Alexander}, {Alfonsa}, {Fireknight}, {Georgen}, {Karims}, {Merloni}, {Nanda}, {Terra}, {Alvato}, {Nini}, {Winski}, {Shankar}, {Gnali}, \& {Giito}}}}";
while($str2 =~ /\\bib(?:({(?:[^{}]++|(?1))*})|(?!\\bib)[^{}])*/g) {
say "$&";
}

Note The edit in the question adds \\bibcite{Airdetal2013} in front. However, this doesn't change the analysis below as it doesn't change the overall nesting levels.
This has got to be possible to do in a better way. There is recursive regex offered by Wiktor Stribiżew in comments. There are modules for recursive parsing. And there are tools for parsing Latex.
However, out of curiosity ...
Your string, shortened suitably
my $str2 = "{{2}{2017}{{{John}{et~al.}}}{{{James}, ... {Gnali}, \& {Giito}}}}";
or, with C standing for a pair of curlies with something inside (no nesting)
"{ C C { { C C } { C, ... \& C } } }"
So you have three levels of nesting, to get down to the last pair {...} (no further nesting).
Your regex, spread out and with $nc = qr/[^{}]*/ (Non-Curlies), so that we can look at it
my $regex = qr/
(?: $nc
(?: {
(?: $nc
(?: {
(?: $nc (?: { $nc } )* $nc )
}
)* $nc
)*
}
)* $nc
)*/x;
I can count two levels here. (The $nc has no curlies so { $nc } matches my C above.)
Thus this regex cannot match that whole string.
How to fix it? Best, find another way so to not drown in this.
Or, write it out like above, very carefully, and add the missing level.

Related

RE2 Nested Regex Group Match

I have a RE2 regex as following
const re2::RE2 numRegex("(([0-9]+),)+([0-9])+");
std::string inputStr;
inputStr="apple with make,up things $312,412,3.00");
RE2::Replace(&inputStr, numRegex, "$1$3");
cout << inputStr;
Expected
apple with make,up,things $3124123.00
I was trying to remove the , in the recognized number, $1 would only match 312 but not 412 part. Wondering how to extract the recursive pattern in the group.
Note that RE2 doesn't support lookahead (see Using positive-lookahead (?=regex) with re2) and the solutions I found all use lookaheads.
RE2 based solution
As RE2 does not support lookarounds, there is no pure single-pass regex solution.
You can have a workaround (as usual, when no solution is available): replace the string twice with (\d),(\d) regex and $1$2 substitution:
const re2::RE2 numRegex(R"((\d),(\d))");
std::string inputStr("apple with make,up things $312,412,3.00");
RE2::Replace(&inputStr, numRegex, "$1$2");
RE2::Replace(&inputStr, numRegex, "$1$2"); // <- Second pass to remove commas in 1,2,3,4 like strings
std::cout << inputStr;
C++ std::regex based solution:
You can remove the commas between digits using
std::string inputStr("apple with make,up things $312,412,3.00");
std::regex numRegex(R"((\d),(?=\d))");
std::cout << regex_replace(inputStr, numRegex, "$1") << "\n";
// => apple with make,up things $3124123.00
See the C++ demo. Also, see the regex demo here.
Details:
(\d) - Capturing group 1 ($1): a digit
, - a comma
(?=\d) - a positive lookahead that requires a digit immediately to the right of the current location.
In the pattern that you tried, you are repeating the outer group (([0-9]+),)+ which will then contain the value of the last iteration where it can match a 1+ digits and a comma.
The last iteration will capture 412, and 312, will only be matched.
You are using regex, but as an alternative if you have boost available, you could make use of the \G anchor which can get iterative matches asserting the position at the end of the previous match and replace with an empty string.
(?:\$|\G(?!^))\d+\K,(?=\d)
The pattern matches:
(?: Non capture group
\$ match $
| Or
\G(?!^) Assert the position at the end of the previous match, not at the start
) Close non capture group
\d+\K Match 1+ digits and forget what is matched so far
,(?=\d) Match a comma and assert a digit directly to the right
Regex demo
#include<iostream>
#include <string>
#include <boost/regex.hpp>
using namespace std;
int main()
{
std::string inputStr = "apple with make,up things $312,412,3.00";
boost::regex numRegex("(?:\\$|\\G(?!^))\\d+\\K,(?=\\d)");
std::string result = boost::regex_replace(inputStr, numRegex, "");
std::cout << result << std::endl;
}
Output
apple with make,up things $3124123.00

regex - how to specify the expressions to exclude

I need to replace two characters {, } with {\n, \n}.
But they must be not surrounded in '' or "".
I tried this code to achieve that
text = 'hello(){imagine{myString("HELLO, {WORLD}!")}}'
replaced = re.sub(r'{', "{\n", text)
Ellipsis...
Naturally, This code replaces curly brackets that are surrounded in quote marks.
What are the negative statements like ! or not that can be used in regular expressions?
And the following is what I wanted.
hello(){
imagine{
puts("{HELLO}")
}
}
In a nutshell - what I want to do is
Search { and }.
If that is not enclosed in '' or ""
replace { or } to {\n or \n}
In the opposite case, I can solve it with (?P<a>\".*){(?P<b>.*?\").
But I have no clue how I can solve it in my case.
First replace all { characters with {\n. You will also be replacing {" with {\n". Now, you can replace back all {\n" characters with {".
text = 'hello(){imagine{puts("{HELLO}")}}'
replaced = text.replace('{', '{\n').replace('{\n"','{"')
You may match single and double quoted (C-style) string literals (those that support escape entities with backslashes) and then match { and } in any other context that you may replace with your desired values.
See Python demo:
import re
text = 'hello(){imagine{puts("{HELLO}")}}'
dblq = r'(?<!\\)(?:\\{2})*"[^"\\]*(?:\\.[^"\\]*)*"'
snlq = r"(?<!\\)(?:\\{2})*'[^'\\]*(?:\\.[^'\\]*)*'"
rx = re.compile(r'({}|{})|[{{}}]'.format(dblq, snlq))
print(rx.pattern)
def repl(m):
if m.group(1):
return m.group(1)
elif m.group() == '{':
return '{\n'
else:
return '\n}'
# Examples
print(rx.sub(repl, text))
print(rx.sub(repl, r'hello(){imagine{puts("Nice, Mr. \"Know-all\"")}}'))
print(rx.sub(repl, "hello(){imagine{puts('MORE {HELLO} HERE ')}}"))
The pattern that is generated in the code above is
((?<!\\)(?:\\{2})*"[^"\\]*(?:\\.[^"\\]*)*"|(?<!\\)(?:\\{2})*'[^'\\]*(?:\\.[^'\\]*)*')|[{}]
It can actually be reduced to
(?<!\\)((?:\\{2})*(?:"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*'))|[{}]
See the regex demo.
Details:
The pattern matches 2 main alternatives. The first one matches single- and double-quoted string literals.
(?<!\\) - no \ immediately to the left is allowed
((?:\\{2})*(?:"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')) - Group 1:
(?:\\{2})* - 0+ repetitions of two consecutive backslashes
(?: - a non-capturing group:
"[^"\\]*(?:\\.[^"\\]*)*" - a double quoted string literal
| - or
'[^'\\]*(?:\\.[^'\\]*)*' - a single quoted string literal
) - end of the non-capturing group
| - or
[{}] - a { or }.
In the repl method, Group 1 is checked for a match. If it matched, the single- or double-quoted string literal is matched, it must be put back where it was. Else, if the match value is {, it is replaced with {\n, else, with \n}.
Replace { with {\n:
text.replace('{', '{\n')
Replace } with \n}:
text.replace('}', '\n}')
Now to fix the braces that were quoted:
text.replace('"{\n','"{')
and
text.replace('\n}"', '}"')
Combined together:
replaced = text.replace('{', '{\n').replace('}', '\n}').replace('"{\n','"{').replace('\n}"', '}"')
Output
hello(){
imagine{
puts("{HELLO}")
}
}
You can check the similarities with the input and try to match them.
text = 'hello(){imagine{puts("{HELLO}")}}'
replaced = text.replace('){', '){\n').replace('{puts', '{\nputs').replace('}}', '\n}\n}')
print(replaced)
output:
hello(){
imagine{
puts("{HELLO}")
}
}
UPDATE
try this: https://regex101.com/r/DBgkrb/1

Search for substring and store another part of the string as variable in perl

I am revamping an old mail tool and adding MIME support. I have a lot of it working but I'm a perl dummy and the regex stuff is losing me.
I had:
foreach ( #{$body} ) {
next if /^$/;
if ( /NEMS/i ) {
/.*?(\d{5,7}).*/;
$nems = $1;
next;
}
if ( $delimit ) {
next if (/$delimit/ && ! $tp);
last if (/$delimit/ && $tp);
$tp = 1, next if /text.plain/;
$tp = 0, next if /text.html/;
s/<[^>]*>//g;
$newbody .= $_ if $tp;
} else {
s/<[^>]*>//g;
$newbody .= $_ ;
}
} # End Foreach
Now I have $body_text as the plain text mail body thanks to MIME::Parser. So now I just need this part to work:
foreach ( #{$body_text} ) {
next if /^$/;
if ( /NEMS/i ) {
/.*?(\d{5,7}).*/;
$nems = $1;
next;
}
} # End Foreach
The actual challenge is to find NEMS=12345 or NEMS=1234567 and set $nems=12345 if found. I think I have a very basic syntax problem with the test because I'm not exposed to perl very often.
A coworker suggested:
foreach (split(/\n/,$body_text)){
next if /^$/;
if ( /NEMS/i ) {
/.*?(\d{5,7}).*/;
$nems = $1;
next;
}
}
Which seems to be working, but it may not be the preferred way?
edit:
So this is the most current version based on tips here and testing:
foreach (split(/\n/,$body_text)){
next if /^$/;
if ( /NEMS/i ) {
/^\s*NEMS\s*=\s*(\d+)/i;
$nems = $1;
next;
}
}
Match the last two digits as optional and capture the first five, and assign the capture directly
($nems) = /(\d{5}) (?: \d{2} )?/x; # /x allows spaces inside
The construct (?: ) only groups what's inside, without capture. The ? after it means to match that zero or one time. We need parens so that it applies to that subpattern only. So the last two digits are optional -- five digits or seven digits match. I removed the unneeded .*? and .*
However, by what you say it appears that the whole thing can be simplified
if ( ($nems) = /^\s*NEMS \s* = \s* (\d{5}) (?:\d{2})?/ix ) { next }
where there is now no need for if (/NEMS/) and I've adjusted to the clarification that NEMS is at the beginning and that there may be spaces around =. Then you can also say
my $nems;
foreach ( split /\n/, $body_text ) {
# ...
next if ($nems) = /^\s*NEMS\s*=\s*(\d{5})(?:\d{2})?/i;
# ...
}
what includes the clarification that the new $body_text is a multiline string.
It is clear that $nems is declared (needed) outside of the loop and I indicate that.
This allows yet more digits to follow; it will match on 8 digits as well (but capture only the first five). This is what your trailing .* in the regex implies.
Edit It's been clarified that there can only be 5 or 7 digits. Then the regex can be tightened, to check whether input is as expected, but it should work as it stands, too.
A few notes, let me know if more would be helpful
The match operator returns a list so we need the parens in ($nems) = /.../;
The ($nems) = /.../ syntax is a nice shortcut, for ($nems) = $_ =~ /.../;.
If you are matching on a variable other than $_ then you need the whole thing.
You always want to start Perl programs with
use warnings 'all';
use strict;
This directly helps and generally results in better code.
The clarification of the evolved problem understanding states that all digits following = need be captured into $nems (and there may be 5,(not 6),7,8,9,10 digits). Then the regex is simply
($nems) = /^\s*NEMS\s*=\s*(\d+)/i;
where \d+ means a digit, one or more times. So a string of digits (match fails if there are none).

Why won't Groovy honor "OR" instances in my regex?

It is well established that "|" in a regex is the "OR" operator. So when I run this:
static void main(String[] args) {
String permission = "[fizz]:[index]"
if((permission =~ /\[fizz|buzz]:\[.*]/).matches()) {
println "We match!"
} else {
println "We don't match!"
}
}
...then why does it print "We don't match!"???
The regex \[fizz|buzz]:\[.*] matches:
\[fizz - literal [ followed by fizz
| - OR operator....
buzz]:\[ - matches literal buzz]:[
.* - any character but a newline, as many times as possible, greedy
] - a literal ].
I think you need to re-group the alternatives:
if((permission =~ /\[(?:fizz|buzz)]:\[[^\]]*]/).matches()) {
Here, \[(?:fizz|buzz)]:\[[^\]]*] will match a [, then either fizz or buzz without capturing the words, then ]:[, [^\]]* will match 0 or more any characters but a ] and then ].
Check the regex101 demo. Also checked at OCP Regex Visualizer:

c# regex split or replace. here's my code i did

I am trying to replace a certain group to "" by using regex.
I was searching and doing my best, but it's over my head.
What I want to do is,
string text = "(12je)apple(/)(jj92)banana(/)cat";
string resultIwant = {apple, banana, cat};
In the first square bracket, there must be 4 character including numbers.
and '(/)' will come to close.
Here's my code. (I was using matches function)
string text= #"(12dj)apple(/)(88j1)banana(/)cat";
string pattern = #"\(.{4}\)(?<value>.+?)\(/\)";
Regex rex = new Regex(pattern);
MatchCollection mc = rex.Matches(text);
if(mc.Count > 0)
{
foreach(Match str in mc)
{
print(str.Groups["value"].Value.ToString());
}
}
However, the result was
apple
banana
So I think I should use replace or something else instead of Matches.
The below regex would capture the word characters which are just after to ),
(?<=\))(\w+)
DEMO
Your c# code would be,
{
string str = "(12je)apple(/)(jj92)banana(/)cat";
Regex rgx = new Regex(#"(?<=\))(\w+)");
foreach (Match m in rgx.Matches(str))
Console.WriteLine(m.Groups[1].Value);
}
IDEONE
Explanation:
(?<=\)) Positive lookbehind is used here. It sets the matching marker just after to the ) symbol.
() capturing groups.
\w+ Then it captures all the following word characters. It won't capture the following ( symbol because it isn't a word character.