Named captured substring in pcre++ - c++

I want to capture named substring with the pcre++ library.
I know the pcre library has the functionality for this, but pcre++ has not implemented this.
This is was I have now (just a simple example):
pcrepp::Pcre regex("test (?P<groupName>bla)");
if (regex.search("test bla"))
{
// Get matched group by name
int pos = pcre_get_stringnumber(
regex.get_pcre(),
"groupName"
);
if (pos == PCRE_ERROR_NOSUBSTRING) return;
// Get match
std::string temp = regex[pos - 1];
std::cout << "temp: " << temp << "\n";
}
If I debug, pos return 1, and that is right, (?Pbla) is the 1th submatch (0 is the whole match). It should be ok. But... regex.matches() return 0. Why is that :S ?
Btw. I do regex[pos - 1] because pcre++ reindexes the result with 0 pointing to the first submatch, so 1. So 1 becomes 0, 2 becomes 1, 3 becomes 2, etc.
Does anybody know how to fix this?

My mistake unfortunately, I tested the regex in my real program and there the regex was different. I used something like this:
(?:/(?P<controller>[^/]+)(?:/(?P<action>[^/]+))?)?
So the group name to number conversion goes well, but when i try to access the group i get index of range because of the (?: ... )? groups. I just added a check if the group index i in the correct range, it is i could use the group.
Sorry for asking it here too early.

Related

Regex Replace everything except between the first " and the last "

i need a regex that replaces everything except the content between the first " and the last ".
I need it like this:
Input String:["Key:"Value""]
And after the regex i only need this:
Output String:Key:"Value"
Thanks!
You can try something like this.
patern:
^.*?"(.*)".*$
Substion:
$1
On Regex101
Explination:
the first part ^.*?" matches as few characters as possible that are between the start of the string and a double quote
the second part(.*)" makes the largest match it can that ends in a double quote, and stuffs it all in a capture group
the last part .*$ grabs what ever is left and includes it in the match
Finally you replace the entire match with the contents of the first capture group
Can you say why you need a RegExp?
A function like:
String unquote(String input) {
int start = input.indexOf('"');
if (start < 0) return input; // or throw.
int end = input.lastIndexOf('"');
if (start == end) return input; // or throw
return input.substring(start + 1, end);
}
is going to be faster and easier to understand than a RegExp.
Anyway, for the challenge, let's say we do want a RegExp that replaces the part up to the first " and from the last " with nothing. That's two replaces, so you can do an
input.replaceAll(RegExp(r'^[^"]*"|"[^"]*$'), "")`
or you can use a capturing group and a computed replacement like:
input.replaceFirstMapped(RegExp(r'^[^"]*"([^]*)"[^"]*$'), (m) => m[1])
Alternatively, you can use the capturing group to select the text between the two and extract it in code, instead of doing string replacement:
String unquote(String input) {
var re = RegExp(r'^[^"]*"([^]*)"[^"]$');
var match = re.firstMatch(input);
if (match == null) return input; // or throw.
return match[1];
}

C++ regex: Get index of the Capture Group the SubMatch matched to

Context. I'm developing a Lexer/Tokenizing engine, which would use regex as a backend. The lexer accepts rules, which define the token types/IDs, e.g.
<identifier> = "\\b\\w+\\b".
As I envision, to do the regex match-based tokenizing, all of the rules defined by regexes are enclosed in capturing groups, and all groups are separated by ORs.
When the matching is being executed, every match we produce must have an index of the capturing group it was matched to. We use these IDs to map the matches to token types.
So the problem of this question arises - how to get the ID of the group?
Similar question here, but it does not provide the solution to my specific problem.
Exactly my problem here, but it's in JS, and I need a C/C++ solution.
So let's say I've got a regex, made up of capturing groups separated by an OR:
(\\b[a-zA-Z]+\\b)|(\\b\\d+\\b)
which matches the the whole numbers or alpha-words.
My problem requires that the index of the capture group the regex submatch matched to could be known, e.g. when matching the string
foo bar 123
3 iterations will be done. The group indexes of the matches of every iteration would be 0 0 1, because the first two matches matched the first capturing group, and the last match matched the second capturing group.
I know that in standard std::regex library it's not entirely possible (regex_token_iterator is not a solution, because I don't need to skip any matches).
I don't have much knowledge about boost::regex or PCRE regex library.
What is the best way to accomplish this task? Which is the library and method to use?
You may use the sregex_iterator to get all matches, and once there is a match you may analyze the std::match_results structure and only grab the ID-1 value of the group that participated in the match (note only one group here will match, either the first one, or the second), which can be conveniently checked with the m[index].matched:
std::regex r(R"((\b[[:alpha:]]+\b)|(\b\d+\b))");
std::string s = "foo bar 123";
for(std::sregex_iterator i = std::sregex_iterator(s.begin(), s.end(), r);
i != std::sregex_iterator();
++i)
{
std::smatch m = *i;
std::cout << "Match value: " << m.str() << " at Position " << m.position() << '\n';
for(auto index = 1; index < m.size(); ++index ){
if (m[index].matched) {
std::cout << "Capture group ID: " << index-1 << std::endl;
break;
}
}
}
See the C++ demo. Output:
Match value: foo at Position 0
Capture group ID: 0
Match value: bar at Position 4
Capture group ID: 0
Match value: 123 at Position 8
Capture group ID: 1
Note that R"(...)" is a raw string literal, no need to double backslashes inside it.
Also, index is set to 1 at the start of the for loop because the 0th group is the whole match, but you want group IDs to be zero-based, that is why 1 is subtracted later.

regex with all components optionals, how to avoid empty matches

I have to process a comma separated string which contains triplets of values and translate them to runtime types,the input looks like:
"1x2y3z,80r160g255b,48h30m50s,1x3z,255b,1h,..."
So each substring should be transformed this way:
"1x2y3z" should become Vector3 with x = 1, y = 2, z = 3
"80r160g255b" should become Color with r = 80, g = 160, b = 255
"48h30m50s" should become Time with h = 48, m = 30, s = 50
The problem I'm facing is that all the components are optional (but they preserve order) so the following strings are also valid Vector3, Color and Time values:
"1x3z" Vector3 x = 1, y = 0, z = 3
"255b" Color r = 0, g = 0, b = 255
"1h" Time h = 1, m = 0, s = 0
What I have tried so far?
All components optional
((?:\d+A)?(?:\d+B)?(?:\d+C)?)
The A, B and C are replaced with the correct letter for each case, the expression works almost well but it gives twice the expected results (one match for the string and another match for an empty string just after the first match), for example:
"1h1m1s" two matches [1]: "1h1m1s" [2]: ""
"11x50z" two matches [1]: "11x50z" [2]: ""
"11111h" two matches [1]: "11111h" [2]: ""
This isn't unexpected... after all an empty string matches the expression when ALL of the components are empty; so in order to fix this issue I've tried the following:
1 to 3 quantifier
((?:\d+[ABC]){1,3})
But now, the expression matches strings with wrong ordering or even repeated components!:
"1s1m1h" one match, should not match at all! (wrong order)
"11z50z" one match, should not match at all! (repeated components)
"1r1r1b" one match, should not match at all! (repeated components)
As for my last attempt, I've tried this variant of my first expression:
Match from begin ^ to the end $
^((?:\d+A)?(?:\d+B)?(?:\d+C)?)$
And it works better than the first version but it still matches the empty string plus I should first tokenize the input and then pass each token to the expression in order to assure that the test string could match the begin (^) and end ($) operators.
EDIT: Lookahead attempt (thanks to Casimir et Hippolyte)
After reading and (try to) understanding the regex lookahead concept and with the help of Casimir et Hippolyte answer I've tried the suggested expression:
\b(?=[^,])(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
Against the following test string:
"48h30m50s,1h,1h1m1s,11111h,1s1m1h,1h1h1h,1s,1m,1443s,adfank,12322134445688,48h"
And the results were amazing! it is able to detect complete valid matches flawlessly (other expressions gave me 3 matches on "1s1m1h" or "1h1h1h" which weren't intended to be matched at all). Unfortunately it captures emtpy matches everytime a unvalid match is found so a "" is detected just before "1s1m1h", "1h1h1h", "adfank" and "12322134445688", so I modified the Lookahead condition to get the expression below:
\b(?=(?:\d+[ABC]){1,3})(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
It gets rid of the empty matches in any string which doesn't match (?:\d+[ABC]){1,3}) so the empty matches just before "adfank" and "12322134445688" are gone but the ones just before "1s1m1h", "1h1h1h" are stil detected.
So the question is: Is there any regular expression which matches three triplet values in a given order where all component is optional but should be composed of at least one component and doesn't match empty strings?
The regex tool I'm using is the C++11 one.
Yes, you can add a lookahead at the begining to ensure there is at least one character:
^(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)$
If you need to find this kind of substring in a larger string (so without to tokenize before), you can remove the anchors and use a more explicit subpattern in a lookahead:
(?=\d+[ABC])((?:\d+A)?(?:\d+B)?(?:\d+C)?)
In this case, to avoid false positive (since you are looking for very small strings that can be a part of something else), you can add word-boundaries to the pattern:
\b(?=\d+[ABC])((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
Note: in a comma delimited string: (?=\d+[ABC]) can be replaced by (?=[^,])
I think this might do the trick.
I am keying on either the beginning of the string to match ^ or the comma separator , for fix the start of each match: (?:^|,).
Example:
#include <regex>
#include <iostream>
const std::regex r(R"~((?:^|,)((?:\d+[xrh])?(?:\d+[ygm])?(?:\d+[zbs])?))~");
int main()
{
std::string test = "1x2y3z,80r160g255b,48h30m50s,1x3z,255b";
std::sregex_iterator iter(test.begin(), test.end(), r);
std::sregex_iterator end_iter;
for(; iter != end_iter; ++iter)
std::cout << iter->str(1) << '\n';
}
Output:
1x2y3z
80r160g255b
48h30m50s
1x3z
255b
Is that what you are after?
EDIT:
If you really want to go to town and make empty expressions unmatched then as far as I can tell you have to put in every permutation like this:
const std::string A = "(?:\\d+[xrh])";
const std::string B = "(?:\\d+[ygm])";
const std::string C = "(?:\\d+[zbs])";
const std::regex r("(?:^|,)(" + A + B + C + "|" + A + B + "|" + A + C + "|" + B + C + "|" + A + "|" + B + "|" + C + ")");

Qt Regex Help (Array Keys)

Okay, so the following string is what my regex will attempt to match against:
[key1][key2][key3]
and here is my regex.
\[(.+?)\]
This is all being done in Qt, and here is the code I am using
QRegExp reg("\\[(.+?)\\]");
reg.indexIn(string);
qDebug() << "Matches: " << reg.capturedTexts();
The above returns this:
("", "")
So two questions then:
Why are the captures empty
On my regex, why did I need to put \\ for it to work? If I just put \ it will not capture anything.
Thank you!
First, let's optimize your regular expression: instead of .+? reluctant expression use [^\]]+, which lets you avoid so-called catastrophic backtracking. The new expression is as follows:
\\[([^\\]]+)\\]
On my regex, why did I need to put \\ for it to work?
Because the regex goes through two compilers which pay attention to backslashes - first, your C++ compiler, and then the regex compiler inside QRegExp constructor. The first slash of the pair is for the C++ compiler; the second one is for the regex compiler. Once C++ compiler is finished, each pair of backslahses is replaced with a single slash, which is what the regex needs.
I got key1, but now how do I get the other 2? reg.capturedCount() returns 1
Your regular expression captures one square bracket - delimited item at a time. If you want to capture them all, you need a loop:
int pos = 0;
while (pos >= 0) {
pos = reg.indexIn(str, pos);
if (pos >= 0) {
++pos; // move along in str
qDebug() << "Matches: " << reg.capturedTexts();
}
}

Feel the need to improve my RegExp's

I'm working with a HTTP library (winhttp) for 2 weeks now and now I want to improve my RegExp's for retrieving some data on the target website.
Given the following HTML code:
Total Posts:</span> 22,423</li>
Now what I want to do is retrieving only the number and storing it into a variable:
regex = "Total Posts:</span> \\S+";
if(std::regex_search(regexs, regexmatch, regex))
{
temp = regexmatch[0];
found = temp.find(",");
if(found != std::string::npos)
temp.erase(found, 1);
temp.erase(0, 19);
temp.erase(temp.end() - 5, temp.end());
User._Posts = ConvertStringToInteger(temp);
}
Used some RegExp for this and stripping the parts off since I don't get how I only retrieve the pattern, not the whole result. Hopefully someone understands me. Already looked up the docs but found nothing what could help me.
To only match your desired pattern, you're wanting to use a capture group with std::regex_search.
Capture groups are meant for capturing matched regions within a regular expression and each captured region is represented by a sub_match. You can use the smatch specialization of match_results for working with string sub matches and then use the operator [] to get the match.
Example:
const std::string foo = "Total Posts:</span> 22,423</li>";
std::regex rgx("Total Posts:</span> ([^<]+)");
std::smatch match;
if (std::regex_search(foo.begin(), foo.end(), match, rgx)) {
std::cout << match[1] << '\n';
}
Output:
22,423