Recursive regular expression match with boost - c++

I got a problem with C++ standard regex library not compiling recursive regex.
Looking up on the internet I found out it's a well known problem and people suggest using boost library. This is the incriminated one :
\\((?>[^()]|(?R))*\\)|\\w+
What I'm trying to do is basically using this regex to split statements according to spaces and brackets (including the case of balanced brackets inside brackets) but every piece of code showing how to do it using boost doesn't work properly and I don't know why. Thanks in advance.

You may declare the regex using a raw string literal, using R"(...)" syntax. This way, you won't have to escape backslashes twice.
Cf., these are equal declarations:
std::string my_pattern("\\w+");
std::string my_pattern(R"(\w+)");
The parentheses are not part of the regex pattern, they are raw string literal delimiter parts.
However, your regex is not quite correct: you need to recurse only the first alternative and not the whole regex.
Here is the fix:
std::string my_pattern(R"((\((?:[^()]++|(?1))*\))|\w+)");
Here, (\((?:[^()]++|(?1))*\)) matches and 1+ chars other than ( and ) or recurses the whole Group 1 pattern with (?1) regex subroutine.
See the regex demo.

Related

Regular expression to select single words inside parentheses using stringr

I specified that I'm using stringr because its character escaping is not "standard" regex escaping.
I want to detect strings that have a single word inside parentheses. Thus, I want to detect
"Men's shirt (blue)"
and not detect
"Blade Runner (Director's cut)"
If it helps simplify the regex, all the parenthetical parts are always at the end of the string.
I have attempted
str_detect(my_string, "\\(\\w?\\)") which yields no results and
str_detect(my_string, "\\(\\S\\)$") which returns everything including multiple words
as well as various combinations using //S, with or without the $, etc.
When I look for other stack overflow answers, I usually find slightly different questions whose answers are simply "use this:" along with what seems like an incomprehensible regex using lookaheads and other seemilgly too-complicated things. I thank you for a little explanation on why the (probably obvious and simple) regex works.
Looks like I didn't hit upon the //S+ in my combinations. I was checking for a single non-space character, not one or more non-space characters. This solution works:
str_detect(desc, "\\(\\S+\\)$")

c++ Remove substring after last match of substring

What is a compact way of removing the last part of a string after the last occurrence of a given substring, including it?
In terms of bash parameter substitution it would be the equivalent of:
VAR=${VAR%substring*}
Is there a library (e.g. boost) supporting replacement with wildcards or something similar?
Without wildcards, the solution I've found is as follows
string.erase(string.rfind("substring"));
Provided substring is found in string

Efficient way to match regex between delimiters

I have a string and want to match the substring between the two first delimiters with a regular expression.
For example a string foo"text"bar anotherfoo"anothertext"anotherbar with delimiter " should yield text.
I found the following possible solutions:
Non-greedy matching "(.*?)"
Non-greedy matching with Lookahead and Lookbehind assertions (?<=")(.*?)(?=")
Negated character classes "([^"]*)"
Which one is the most efficient way of doing this? Or am I missing cases where these solutions behave differently (assuming the new line modifier is set so that a dot matches a new line)?
Since the delimiters are single characters, and the matched substring should not contain them, the negated character class solution ("([^"]*)") is the most efficient.
If you want to match only once, you do not even need the closing ": just use "([^"]*).
The lazy dot matching ("(.*?)") technique might cause performance issues when there is no ending delimiter and the text is rather large after the initial delimiter.
Lookarounds almost always involve additional overhead of checking for some subpatterns at each tested position. Since the delimiters here are single characters, the lookbehind/lookahead here are not efficient. You only want to use this solution if there is no way to access capturing groups. In Python, capturing works well, so no need using this solution.

Regexp Question - Negating a captured character

I'm looking for a regular expression that allows for either single-quoted or double-quoted strings, and allows the opposite quote character within the string. For example, the following would both be legal strings:
"hello 'there' world"
'hello "there" world'
The regexp I'm using uses negative lookahead and is as follows:
(['"])(?:(?!\1).)*\1
This would work I think, but what about if the language didn't support negative lookahead. Is there any other way to do this? Without alternation?
EDIT:
I know I can use alternation. This was more of just a hypothetical question. Say I had 20 different characters in the initial character class. I wouldn't want to write out 20 different alternations. I'm trying to actually negate the captured character, without using lookahead, lookbehind, or alternation.
This is actually much simpler than you may have realized. You don't really need the negative look-ahead. What you want to do is a non-greedy (or lazy) match like this:
(['"]).*?\1
The ? character after the .* is the important part. It says, consume the minimum possible characters before hitting the next part of the regex. So, you get either kind of quote, and then you go after 0-M characters until you encounter a character matching whichever quote you first ran into. You can learn more about greedy matching vs. non-greedy here and here.
Sure:
'([^']*)'|"([^"]*)"
On a successful match, the $+ variable will hold the contents of whichever alternate matched.
In the general case, regexps are not really the answer. You might be interested in something like Text::ParseWords, which tokenizes text, accounting for nested quotes, backslashed quotes, backslashed spaces, and other oddities.

Regex for matching a character, but not when it's enclosed in quotes

I need to match a colon (':') in a string, but not when it's enclosed by quotes - either a " or ' character.
So the following should have 2 matches
something:'firstValue':'secondValue'
something:"firstValue":'secondValue'
but this should only have 1 match
something:'no:match'
If the regular expression implementation supports look-around assertions, try this:
:(?:(?<=["']:)|(?=["']))
This will match any colon that is either preceeded or followed by a double or single quote. So that does only consider construct like you mentioned. something:firstValue would not be matched.
It would be better if you build a little parser that reads the input byte-by-byte and remembers when quotation is open.
Regular expressions are stateless. Tracking whether you are inside of quotes or not is state information. It is, therefore, impossible to handle this correctly using only a single regular expression. (Note that some "regular expression" implementations add extensions which may make this possible; I'm talking solely about "true" regular expressions here.)
Doing it with two regular expressions is possible, though, provided that you're willing to modify the original string or to work with a copy of it. In Perl:
$string =~ s/['"][^'"]*['"]//g;
my $match_count = $string =~ /:/g;
The first will find every sequence consisting of a quote, followed by any number of non-quote characters, and terminated by a second quote, and remove all such sequences from the string. This will eliminate any colons which are within quotes. (something:"firstValue":'secondValue' becomes something:: and something:'no:match' becomes something:)
The second does a simple count of the remaining colons, which will be those that weren't within quotes to start with.
Just counting the non-quoted colons doesn't seem like a particularly useful thing to do in most cases, though, so I suspect that your real goal is to split the string up into fields with colons as the field delimiter, in which case this regex-based solution is unsuitable, as it will destroy any data in quoted fields. In that case, you need to use a real parser (most CSV parsers allow you to specify the delimiter and would be ideal for this) or, in the worst case, walk through the string character-by-character and split it manually.
If you tell us the language you're using, I'm sure somebody could suggest a good parser library for that language.
Uppps ... missed the point. Forget the rest. It's quite hard to do this because regex is not good at counting balanced characters (but the .NET implementation for example has an extension that can do it, but it's a bit complicated).
You can use negated character groups to do this.
[^'"]:[^'"]
You can further wrap the quotes in non-capturing groups.
(?:[^'"]):(?:[^'"])
Or you can use assertion.
(?<!['"]):(?!['"])
I've come up with the following slightly worrying construction:
(?<=^('[^']*')*("[^"]*")*[^'"]*):
It uses a lookbehind assertion to make sure you match an even number of quotes from the beginning of the line to the current colon. It allows for embedding a single quote inside double quotes and vice versa. As in:
'a":b':c::"':" (matches at positions 6, 8 and 9)
EDIT
Gumbo is right, using * within a look behind assertion is not allowed.
You can try to catch the strings withing the quotes
/(?<q>'|")([\w ]+)(\k<q>)/m
First pattern defines the allowed quote types, second pattern takes all Word-Digits and spaces.
Very good on this solution is, it takes ONLY Strings where opening and closing quotes match.
Try it at regex101.com