C++ regex string literal with capturing group - c++

I have a std::string containing backslashes, double quotes. I want to extract a substring using capture group, but I am not able to get the syntax right.
e.g.
std::string str(R"(some\"string"name":"john"\"lastname":"doe")"); //==> want to extract "john"
std::regex re(R"(some\"string"name":")"(.*)R"("\"lastname":"doe")"); //==> wrong syntax
std::smatch match;
std::string name;
if (std::regex_search(str, match, re) && match.size() > 1)
{
name = match.str(1);
}

Use a delimeter that does not occur in the string. E.g. R"~( .... )~"
You still need to escape the \ for regex. To match \ literally use \\.
You probably want to stop as soon as the shortest possible match is found. So use (.*?):
std::regex re(R"~(some\\"string"name":"(.*?)"\\"lastname":"doe")~");

Related

Regex replace names of methods

I'm trying to replace all occurrences of names within a given string. I'm using regex, since a simple substring match won't work in this case and I need to match full words.
My problem is that I can only match words before and after blanks. But for example I cannot replace a string when it's followed by a blank, like:
toReplace()
with:
theReplacement()
My regex replace method looks like this:
void replaceWord(std::string &str, const std::string& search, const std::string& replace)
{
// Regular expression to match words beginning with 'search'
// std::regex e ("(\\b("+search+"))([^,. ]*)");
// std::regex e ("(\\b("+search+"))\\b)");
std::regex e("(\\b("+search+"))([^,.()<>{} ]*)");
str = std::regex_replace(str,e,replace) ;
}
How should the regex look like in order to ignore leading and trailing non-alphanumericals?
You need to
Escape all special characters in the regex pattern with std::regex_replace(search, std::regex(R"([.^$|{}()[\]*+?/\\])"), std::string(R"(\$&)"))
Escape all special chars in the replacement pattern with std::regex_replace(replace, std::regex("[$]"), std::string("$$$$")) (that is in case you replace with literal $1 text, $ can be set with $$, so to replace with a double $, we need $$$$ in the replacement here)
Wrap your search pattern with unambiguous word boundaries, i.e. "(\\W|^)("+search+")(?!\\w)
When you replace, add $1 at the start of the replacement pattern to keep the whitespace (if it is matched and captured into the first group with the (\W|^) pattern).
See C++ sample code:
std::string replaceWord(std::string &str, std::string& search, std::string& replace)
{
// Escape the literal regex pattern
search = std::regex_replace(search, std::regex(R"([.^$|{}()[\]*+?/\\])"), std::string(R"(\$&)"));
// Escape the literal replacement pattern
replace = std::regex_replace(replace, std::regex("[$]"), std::string("$$$$"));
std::regex e("(\\W|^)("+search+")(?!\\w)");
return std::regex_replace(str, e, std::string("$1") + replace);
}
Then,
std::string text("String toReplace()");
std::string s("toReplace()");
std::string r("theReplacement()");
std::cout << replaceWord(text, s, r);
// => String theReplacement()

How to find the exact substring with regex in c++11?

I am trying to find substrings that are not surrounded by other a-zA-Z0-9 symbols.
For example: I want to find substring hello, so it won't match hello1 or hellow but will match Hello and heLLo!##$%.
And I have such sample below.
std::string s = "1mySymbol1, /_mySymbol_ mysymbol";
const std::string sub = "mysymbol";
std::regex rgx("[^a-zA-Z0-9]*" + sub + "[^a-zA-Z0-9]*", std::regex::icase);
std::smatch match;
while (std::regex_search(s, match, rgx)) {
std::cout << match.size() << "match: " << match[0] << '\n';
s = match.suffix();
}
The result is:
1match: mySymbol
1match: , /_mySymbol_
1match: mysymbol
But I don't understand why first occurance 1mySymbol1 also matches my regex?
How to create a proper regex that will ignore such strings?
UDP
If I do like this
std::string s = "mySymbol, /_mySymbol_ mysymbol";
const std::string sub = "mysymbol";
std::regex rgx("[^a-zA-Z0-9]+" + sub + "[^a-zA-Z0-9]+", std::regex::icase);
then I find only substring in the middle
1match: , /_mySymbol_
And don't find substrings at the beggining and at the end.
The regex [^a-zA-Z0-9]* will match 0 or more characters, so it's perfectly valid for [^a-zA-Z0-9]*mysymbol[^a-zA-Z0-9]* to match mysymbol in 1mySymbol1 (allowing for case insensitivity). As you saw, this is fixed when you use [^a-zA-Z0-9]+ (matching 1 or more characters) instead.
With your update, you see that this doesn't match strings at the beginning or end. That's because [^a-zA-Z0-9]+ has to match 1 or more characters (which don't exist at the beginning or end of the string).
You have a few options:
Use beginning/end anchors: (?:[^a-zA-Z0-9]+|^)mysymbol(?:[^a-zA-Z0-9]+|$) (non-alphanumeric OR beginning of string, followed by mysymbol, followed by non-alphanumeric OR end of string).
Use negative lookahead and negative lookbehind: (?<![a-zA-Z0-9])mysymbol(?![a-zA-Z0-9]) (match mysymbol which doesn't have an alphanumeric character before or after it). Note that using this the match won't include the characters before/after mysymbol.
I recommend using https://regex101.com/ to play around with regular expressions. It lists all the different constructs you can use.

Tokenizing boost::regex matches

I have created a regex to match the lines of a file which have the following structure: string int int
int main()
{
std::string line;
boost::regex pat("\\w\\s\\d\\s\\d");
while (std::cin)
{
std::getline(std::cin, line);
boost::smatch matches;
if (boost::regex_match(line, matches, pat))
std::cout << matches[2] << std::endl;
}
}
I would like to save those numbers into a pair<string,pair<int,int>>. How can I tokenize match of the boost:regex to achieve this?
First of all your regular expression accepts "one word character then one space then one digit then one space then one digit", I bet this is not what you want. Most probably you want your expression look like:
\w+\s+\d+\s+\d+
where \w+ now means "one or more word characters". If you are sure that there is only one space between tokens you can omit + after \s but this way it is safer.
Then you need to select parts of your expression. That is called sub-expression in RE:
(\w+)\s+(\d+)\s+(\d+)
this way what matches by (\w+) (one or more word characters) will be in matches[1], first (\d+) in matches[2] and second (\d+) in matches[3]. Of course you need to put double \ when you put it in C++ string.

Ignore String containing special words (Months)

I am trying to find alphanumeric strings by using the following regular expression:
^(?=.*\d)(?=.*[a-zA-Z]).{3,90}$
Alphanumeric string: an alphanumeric string is any string that contains at least a number and a letter plus any other special characters it can be # - _ [] () {} ç _ \ ù %
I want to add an extra constraint to ignore all alphanumerical strings containing the following month formats :
JANVIER|FEVRIER|MARS|AVRIL|MAI|JUIN|JUILLET|AOUT|SEPTEMBRE|OCTOBRE|NOVEMBRE|DECEMBRE|Jan|Feb|Mar|Apr|May|Jun|JUN|Jul|Aug|Sep|Oct|Nov|Dec|[jJ]anvier|[fF][ée]vrier|[mM]ars|[aA]vril|[mM]ai|[jJ]uin|[jJ]uillet|[aA]o[éû]t|aout|[sS]eptembre|[oO]ctobre|[nN]ovembre|[dD][eé]cembre
One solution is to actually match an alphanumerical string. Then check if this string contains one of these names by using the following function:
vector<string> findString(string s)
{
vector<string> vec;
boost::regex rgx("JANVIER|FEVRIER|MARS|AVRIL|MAI|JUIN|JUILLET|AOUT|SEPTEMBRE|OCTOBRE|NOVEMBRE|DECEMBRE|Jan|Feb|Mar|Apr|May|Jun|JUN|Jul|Aug|Sep|Oct|Nov|Dec|[jJ]anvier|[fF][ée]vrier|[mM]ars|[aA]vril|[mM]ai|[jJ]uin|[jJ]uillet|[aA]o[éû]t|aout|[sS]eptembre|[oO]ctobre|[nN]ovembre|[dD][eé]cembre
");
boost::smatch match;
boost::sregex_iterator begin {s.begin(), s.end(), rgx},
end {};
for (boost::sregex_iterator& i = begin; i != end; ++i)
{
boost::smatch m = *i;
vec.push_back(m.str());
}
return vec;
}
Question: How can I add this constraint directly into the regular expression instead of using this function.
One solution is to use negative lookahead as mentioned in How to ignore words in string using Regular Expressions.
I used it as follows:
String : 2-hello-001
Regular expression : ^(?=.*\d)(?=.*[a-zA-Z]^(?!Jan|Feb|Mar)).{3,90}$
Result: no match
Test website: http://regexlib.com/
The edit provided by #Robin and #RyanCarlson : ^[][\w#_(){}ç\\ù%-]{3,90}$ works perfectly in detecting alphanumeric strings with special characters. It's just the negative lookahead part that isn't working.
You can use negative look ahead, the same way you're using positive lookahead:
(?=.*\d)(?=.*[a-zA-Z])
(?!.*(?:JANVIER|FEVRIER|MARS|AVRIL|MAI|JUIN|JUILLET|AOUT|SEPTEMBRE|OCTOBRE|NOVEMBRE|DECEMBRE|Jan|Feb|Mar|Apr|May|Jun|JUN|Jul|Aug|Sep|Oct|Nov|Dec|[jJ]anvier|[fF][ée]vrier|[mM]ars|[aA]vril|[mM]ai|[jJ]uin|[jJ]uillet|[aA]o[éû]t|aout|[sS]eptembre|[oO]ctobre|[nN]ovembre|[dD][eé]cembre)).{3,90}$
Also you regex is pretty unclear. If you want alphanumerical strings with a length between 3 and 90, you can just do:
/^(?!.*(?:JANVIER|F[Eé]VRIER|MARS|AVRIL|MAI|JUIN|JUILLET|AO[Uù]T|SEPTEMBRE|OCTOBRE|NOVEMBRE|D[Eé]CEMBRE|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))
[][\w#_(){}ç\\ù%-]{3,90}$/i
the i flag means it will match upper and lower case (so you can reduce your forbidden list), \w is a shortcut for [0-9a-zA-Z_] (careful if you copy-paste, there's a linebreak here for readability between (?! ) and [ ]). Just add in the final [...] whatever special characters you wanna match.

C++ regex escaping punctional characters like "."

Matching a "." in a string with the std::tr1::regex class makes me use a weird workaround.
Why do I need to check for "\\\\." instead of "\\."?
regex(".") // Matches everything (but "\n") as expected.
regex("\\.") // Matches everything (but "\n").
regex("\\\\.") // Matches only ".".
Can someone explain me why? It's really bothering me since I had my code written using boost::regex classes, which didn't need this syntax.
Edit: Sorry, regex("\\\\.") seems to match nothing.
Edit2: Some code
void parser::lex(regex& token)
{
// Skipping whitespaces
{
regex ws("\\s*");
sregex_token_iterator wit(source.begin() + pos, source.end(), ws, regex_constants::match_default), wend;
if(wit != wend)
pos += (*wit).length();
}
sregex_token_iterator it(source.begin() + pos, source.end(), token, regex_constants::match_default), end;
if (it != end)
temp = *it;
else
temp = "";
}
This is because \. is interpreted as an escape sequence, which the language itself is trying to interpret as a single character. What you want is for your regex to contain the actual string "\.", which is written \\. because \\ is the escape sequence for the backslash character (\).
As it turns out, the actual problem was due to the way sregex_token_iterator was used. Using match_default meant it was always finding the next match in the string, if any, even if there is a non-match in-between. That is,
string source = "AAA.BBB";
regex dot("\\.");
sregex_token_iterator wit(source.begin(), source.end(), dot, regex_constants::match_default);
would give a match at the dot, rather than reporting that there was no match.
The solution is to use match_continuous instead.
Try to escape the dot by its ASCII code:
regex("\\x2E")