C++ boost::regex multiples captures - c++

I'm trying to recover multiples substrings thanks to boost::regex and put each one in a var. Here my code :
unsigned int i = 0;
std::string string = "--perspective=45.0,1.33,0.1,1000";
std::string::const_iterator start = string.begin();
std::string::const_iterator end = string.end();
std::vector<std::string> matches;
boost::smatch what;
boost::regex const ex(R"(^-?\d*\.?\d+),(^-?\d*\.?\d+),(^-?\d*\.?\d+),(^-?\d*\.?\d+))");
string.resize(4);
while (boost::regex_search(start, end, what, ex)
{
std::string stest(what[1].first, what[1].second);
matches[i] = stest;
start = what[0].second;
++i;
}
I'm trying to extract each float of my string and put it in my vector variable matches. My result, at the moment, is that I can extract the first one (in my vector var, I can see "45" without double quotes) but the second one in my vector var is empty (matches[1] is "").
I can't figure out why and how to correct this. So my question is how to correct this ? Is my regex not correct ? My smatch incorrect ?

Firstly, ^ is symbol for the beginning of a line. Secondly, \ must be escaped. So you should fix each (^-?\d*\.?\d+) group to (-?\\d*\\.\\d+). (Probably, (-?\\d+(?:\\.\\d+)?) is better.)
Your regular expression searches for the number,number,number,number pattern, not for the each number. You add only the first substring to matches and ignore others. To fix this, you can replace your expression with (-?\\d*\\.\\d+) or just add all the matches stored in what to your matches vector:
while (boost::regex_search(start, end, what, ex))
{
for(int j = 1; j < what.size(); ++j)
{
std::string stest(what[j].first, what[j].second);
matches.push_back(stest);
}
start = what[0].second;
}

You are using ^ at several times in your regex. That's why it didn't match. ^ means the beginning of the string. Also you have an extra ) at the end of the regex. I don't know that closing bracket doing there.
Here is your regex after correction:
(-?\d*\.?\d+),(-?\d*\.?\d+),(-?\d*\.?\d+),(-?\d*\.?\d+)
A better version of your regex can be(only if you want to avoid matching numbers like .01, .1):
(-?\d+(?:\.\d+)?),(-?\d+(?:\.\d+)?),(-?\d+(?:\.\d+)?),(-?\d+(?:\.\d+)?)

A repeated search in combination with a regular expression that apparently is built to match all of the target string is pointless.
If you are searching repeatedly in a moving window delimited by a moving iterator and string.end() then you should reduce the pattern to something that matches a single fraction.
If you know that the number of fractions in your string is/must be constant, match once, not in a loop and extract the matched substrings from what.

Related

Quick regex_search/replace, or clear indication of replacement?

I must browse a collection of strings to replace a pattern and save the changes.
The saving operation is (very) expensive and out of my hands, so I would like to know beforehand if the replacement did anything.
I can use std::regex_search to gain knowledge on the pattern's presence in my input, and use capture groups to store details in a std::smatch. std::regex_replace does not seem to explicitely tell me wether it did anything.
The patterns and strings are arbitrarily long and complicated; running regex_replace after a regex_search seems wasteful.
I can directly compare the input and output to search for a discrepancy but that too is uncomfortable.
Is there either a simple way to observe regex_replace to determine its impact, or to use a smatch filled by the regex_search to do a faster replacement operation ?
Thanks in advance.
No regex_replace doesn't provide this info and yes you can do it with a regex_search loop.
For example like this:
std::regex pattern("...");
std::string replacement_format = "...";
std::string input = "......"; // a very, very long string
std::string output, replacement;
std::smatch match;
auto begin = input.cbegin();
int replacements = 0;
while (std::regex_search(begin, input.cend(), match, pattern)) {
output += match.prefix();
replacement = match.format(replacement_format);
if (match[0] != replacement) {
replacements++;
}
output += replacement;
begin = match.suffix().first;
}
output.append(begin, input.cend());
if (replacements > 0) {
// process output ...
}
Live demo
As regex_replace creates a copy of your string you could simply compare the replaced string with the original one and only "store" the new one if they differ.
For C++14 it seems that regex_replace returns a pointer to the last place it has written to:
https://www.cplusplus.com/reference/regex/regex_replace/ Versions 5
and 6 return an iterator that points to the element past the last
character written to the sequence pointed by out.

need support defining the right regex

I would like to parse a file using boost::sregex_token_iterator.
Unfortunately I'm not able to find the right regex to extract strings in the form FOO:BAR out of it.
The below code example is usable only if one such occurence per line is found, but I would like to support multiple of this entries per line, and ideally also a comment after an '#'
So entries like this
AA:BB CC:DD EE:FF #this is a comment
should result in 3 identified token (AA:BB, CC:DD, EE:FF)
boost::regex re("((\\W+:\\W+)\\S*)+");
boost::sregex_token_iterator i(line.begin(), line.end(), re, -1), end;
for(; i != end; i++){
std::stringstream ss(*i);
...
}
Any support is very welcome.
I suggest you use splitting to get the values you need.
I would begin by first splitting using #. This separates the comment from the rest of the line. Then split using white space, which separates the pairs out. After this, individual pairs can be split using :.
If, for whatever reason, you must use regex, you can iterate over the matches. In this case I would use the following regex:
(?:#(?:.*))*(\w+:\w+)\s*
This regex will match every pair until it finds a comment. If there is a comment, it will skip to the next new line.
You want to match sequences of 1 or more word chars followed with : and then having again 1 or more word chars.
Thus, you need to replace -1 with 1 in the call to boost::sregex_token_iterator to get Group 1 text chunks and replace the regex you use with \w+:\w+ pattern:
boost::regex re(R"(#.*|(\w+:\w+))");
boost::sregex_token_iterator i(line.begin(), line.end(), re, 1), end;
Note that R"(#.*|(\w+:\w+))" is a raw string literal that actually represents #.*|(\w+:\w+) pattern that matches # and then the rest of the line or matches and captures the pattern you need into Group 1.
See an std::regex C++ example (you may easily adjust the code for Boost):
#include <string>
#include <iostream>
#include <regex>
using namespace std;
int main() {
std::regex r(R"(#.*|(\w+:\w+))");
std::string s = "AA:BB CC:DD EE:FF #this is a comment XX:YY";
for(std::sregex_iterator i = std::sregex_iterator(s.begin(), s.end(), r);
i != std::sregex_iterator();
++i)
{
std::smatch m = *i;
std::cout << m[1].str() << '\n';
}
return 0;
}

Search only beginning of string in c++ using regex

Edit I am trying to token left to right using regex a string with a list of regex strings to compare with. I decided to do this by adding carets to each regex string, and when I find a match I will make a substring after the matching regex string, and look for the next match at the beginning of that string.
I have a list of strings to convert to regex to search for inside a vectorcontainer. Here is just an example of one
vector<vector<string>> operators = {
{{",|;|//.*"}} //punctuation
};
I then take substrings and search each one for a match at the beginning. In this case I add a caret at the beginning of each string before I add it to the regex to do that:
Token *find_Match(string &s, int i)
{
string substring = s.substr(i, s.length() - i);
string somestring
for (string c : operators[x])
{
regex r = regex("^" + c);
smatch sm;
regex_search(substring, sm, r); // , std::regex_constants::;
int size = sm.size();
if (size > 0) //MATCH FOUND
{
somestring = sm[0]
}
}
return somestring;
}
Now the problem is that for the punctuation regexes, it will only look for the comma at the beginning, and then find any other match for the rest anywhere in the string, such as a; will return a match for ;. What is the best way in C++ to say that I want the beginning first match without having to search through every | operator to add the caret?

Ignore String containing special words (Months)

I am trying to find alphanumeric strings by using the following regular expression:
^(?=.*\d)(?=.*[a-zA-Z]).{3,90}$
Alphanumeric string: an alphanumeric string is any string that contains at least a number and a letter plus any other special characters it can be # - _ [] () {} ç _ \ ù %
I want to add an extra constraint to ignore all alphanumerical strings containing the following month formats :
JANVIER|FEVRIER|MARS|AVRIL|MAI|JUIN|JUILLET|AOUT|SEPTEMBRE|OCTOBRE|NOVEMBRE|DECEMBRE|Jan|Feb|Mar|Apr|May|Jun|JUN|Jul|Aug|Sep|Oct|Nov|Dec|[jJ]anvier|[fF][ée]vrier|[mM]ars|[aA]vril|[mM]ai|[jJ]uin|[jJ]uillet|[aA]o[éû]t|aout|[sS]eptembre|[oO]ctobre|[nN]ovembre|[dD][eé]cembre
One solution is to actually match an alphanumerical string. Then check if this string contains one of these names by using the following function:
vector<string> findString(string s)
{
vector<string> vec;
boost::regex rgx("JANVIER|FEVRIER|MARS|AVRIL|MAI|JUIN|JUILLET|AOUT|SEPTEMBRE|OCTOBRE|NOVEMBRE|DECEMBRE|Jan|Feb|Mar|Apr|May|Jun|JUN|Jul|Aug|Sep|Oct|Nov|Dec|[jJ]anvier|[fF][ée]vrier|[mM]ars|[aA]vril|[mM]ai|[jJ]uin|[jJ]uillet|[aA]o[éû]t|aout|[sS]eptembre|[oO]ctobre|[nN]ovembre|[dD][eé]cembre
");
boost::smatch match;
boost::sregex_iterator begin {s.begin(), s.end(), rgx},
end {};
for (boost::sregex_iterator& i = begin; i != end; ++i)
{
boost::smatch m = *i;
vec.push_back(m.str());
}
return vec;
}
Question: How can I add this constraint directly into the regular expression instead of using this function.
One solution is to use negative lookahead as mentioned in How to ignore words in string using Regular Expressions.
I used it as follows:
String : 2-hello-001
Regular expression : ^(?=.*\d)(?=.*[a-zA-Z]^(?!Jan|Feb|Mar)).{3,90}$
Result: no match
Test website: http://regexlib.com/
The edit provided by #Robin and #RyanCarlson : ^[][\w#_(){}ç\\ù%-]{3,90}$ works perfectly in detecting alphanumeric strings with special characters. It's just the negative lookahead part that isn't working.
You can use negative look ahead, the same way you're using positive lookahead:
(?=.*\d)(?=.*[a-zA-Z])
(?!.*(?:JANVIER|FEVRIER|MARS|AVRIL|MAI|JUIN|JUILLET|AOUT|SEPTEMBRE|OCTOBRE|NOVEMBRE|DECEMBRE|Jan|Feb|Mar|Apr|May|Jun|JUN|Jul|Aug|Sep|Oct|Nov|Dec|[jJ]anvier|[fF][ée]vrier|[mM]ars|[aA]vril|[mM]ai|[jJ]uin|[jJ]uillet|[aA]o[éû]t|aout|[sS]eptembre|[oO]ctobre|[nN]ovembre|[dD][eé]cembre)).{3,90}$
Also you regex is pretty unclear. If you want alphanumerical strings with a length between 3 and 90, you can just do:
/^(?!.*(?:JANVIER|F[Eé]VRIER|MARS|AVRIL|MAI|JUIN|JUILLET|AO[Uù]T|SEPTEMBRE|OCTOBRE|NOVEMBRE|D[Eé]CEMBRE|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))
[][\w#_(){}ç\\ù%-]{3,90}$/i
the i flag means it will match upper and lower case (so you can reduce your forbidden list), \w is a shortcut for [0-9a-zA-Z_] (careful if you copy-paste, there's a linebreak here for readability between (?! ) and [ ]). Just add in the final [...] whatever special characters you wanna match.

boost regex to extract a number from string

I have a string
resource = "/Music/1"
the string can take multiple numeric values after "/Music/" . I new to regular expression stuff . I tried following code
#include <iostream>
#include<boost/regex.hpp>
int main()
{
std::string resource = "/Music/123";
const char * pattern = "\\d+";
boost::regex re(pattern);
boost::sregex_iterator it(resource.begin(), resource.end(), re);
boost::sregex_iterator end;
for( ; it != end; ++it)
{
std::cout<< it->str() <<"\n";
}
return 0;
}
vickey#tb:~/trash/boost$ g++ idExtraction.cpp -lboost_regex
vickey#tb:~/trash/boost$ ./a.out
123
works fine . But even when the string happens to be something like "/Music23/123" it give me a value 23 before 123. When I use the pattern "/\d+" it would give results event when the string is /23/Music/123. What I want to do is extract the only number after "/Music/" .
I think part of the problem is that you haven't defined very well (at least to us) what it is you are trying to match. I'm going to take some guesses. Perhaps one will meet your needs.
The number at the end of your input string. For example "/a/b/34". Use regex "\\d+$".
A path element that is entirely numeric. For example "/a/b/12/c" or "/a/b/34" but not "/a/b56/d". Use regex "(?:^|/)(\\d+)(?:/|$)" and get captured group [1]. You might do the same thing with lookahead and lookbehind, perhaps with "(?<=^|/)\\d+(?=/|$)".
If there will never be anything after the last slash could you just use a regex or string.split() to get everything after the last slash. I'd get you code but I'm on my phone now.