Regex in C++11 vs PHP - c++

I'm new to regex and C++11. In order to match an expression like this :
TYPE SIZE NUMBER ("regina s x99");
I built a regex which looks like this one :
\b(regina|margarita|americaine|fantasia)\b \b(s|l|m|xl|xxl)\b x([1-9])([0-9])
In my code I did this to try the regex :
std::string s("regina s x99");
std::regex rgx($RGX); //$RGX corresponds to the regex above
if (std::regex_match(s, rgx))
std::cout << "It works !" << std::endl;
This code throw a std::regex_error, but I don't know where it comes from..
Thanks,

This works with g++ (4.9.2) in c++11 mode:
std::regex rgx("\\b(regina|margarita|americaine|fantasia)\\b\\s*(s|l|m|xl|xxl)\\b\\s*x([1-9]*[0-9])");
This will capture three groups: regina s 99 which matches the TYPE SIZE NUMBER pattern, while your original captured four groups regina s 9 9 and had the NUMBER as two values (maybe that was what you wanted though).
Demo on IdeOne

In C++ strings the \ character is special and needs to be escaped so that it gets passed to the regular expression engine, not interpreted by the compiler.
So you either need to use \\b:
std::regex rgx("\\b(regina|margarita|americaine|fantasia)\\b \\b(s|l|m|xl|xxl)\\b x([1-9])([0-9])");
or use a raw string, which means that \ is not special and doesn't need to be escaped:
std::regex rgx(R"(\b(regina|margarita|americaine|fantasia)\b \b(s|l|m|xl|xxl)\b x([1-9])([0-9]))");

There was a typo in this line in original question:
if (std::reegex_match(s, rgx))
More over I am not sure what are you passing with this variable : $RGX
Corrected program as follows:
#include<regex>
#include<iostream>
int main()
{
std::string s("regina s x99");
std::regex rgx("\\b(regina|margarita|americaine|fantasia)\\b \\s*(s|l|m|xl|xxl)\\b \\s*x([1-9])([0-9])"); //$RGX corresponds to the regex above
if (std::regex_match(s, rgx))
std::cout << "It works !" << std::endl;
else
std::cout<<"No Match"<<std::endl;
}

Related

How to match one of multiple alternative patterns with C++11 regex [duplicate]

This question already has an answer here:
Strange results when using C++11 regexp with gcc 4.8.2 (but works with Boost regexp) [duplicate]
(1 answer)
Closed 7 years ago.
With Perl, the following results in a match:
echo xyz | perl -ne 'print if (/.*(yes|no|xy).*/);'
I'm trying to achieve the same thing with a C++ regex. The ECMAScript syntax documentation says
A regular expression can contain multiple alternative patterns simply
by separating them with the separator operator (|): The regular
expression will match if any of the alternatives match, and as soon as
one does.
However, the following example seems to suggest that std::regex_match only matches the first two alternatives, ignoring the third:
std::string pattern1 = ".*(yes|no|xy).*";
std::string pattern2 = ".*(yes|xy|no).*";
std::regex re1(pattern1);
std::regex re2(pattern2);
for (std::string str : {"yesplease", "bayes", "nobody", "anode", "xyz", "abc"} ) {
if (std::regex_match(str,re1)) {
std::cout << str << "\t\tmatches " << pattern1 << "\n";
}
else if (std::regex_match(str,re2)) {
std::cout << str << "\t\tmatches " << pattern2 << "\n";
}
}
Output:
yesplease matches .*(yes|no|xy).*
bayes matches .*(yes|no|xy).*
nobody matches .*(yes|no|xy).*
anode matches .*(yes|no|xy).*
xyz matches .*(yes|xy|no).*
How can I obtain the same behaviour as with my Perl regex example, i.e. having 'xyz' match pattern1?
It looks like regex is not fully implemented in gcc version 4.8.2 but rather in later versions of gcc (i.e., version > 4.9.0).
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53631
In gcc version 4.9.0 works ok LIVE DEMO
So I guess you'll have to upgrade to newer version of gcc.

Find and replace with regular expressions

I'm trying to replace a bunch of function calls using regular expressions but can't seem to be getting it right. This is a simplified example of what I'm trying to do:
GetPetDog();
GetPetCat();
GetPetBird();
I want to change to:
GetPet<Animal_Dog>();
GetPet<Animal_Cat>();
GetPet<Animal_Bird>();
Use below regex:
(GetPet)([^(]*) with subsitution \1<Animal_\2>
Demo
You can use the following regex and code for that:
std::string ss ("GetPetDog();");
static const std::regex ee ("GetPet([^()]*)");
std::string result;
result = regex_replace(ss, ee, "GetPet<Animal_$1>");
std::cout << result << endl;
Regex:
GetPet - Matches GetPet literally (we need no capturing group here)
([^()]*) - A capturing group to match any characters other than ( or ) 0 or more times (*)
Output:

c++ regex substring wrong pattern found

I'm trying to understand the logic on the regex in c++
std::string s ("Ni Ni Ni NI");
std::regex e ("(Ni)");
std::smatch sm;
std::regex_search (s,sm,e);
std::cout << "string object with " << sm.size() << " matches\n";
This form shouldn't give me the number of substrings matching my pattern? Because it always give me 1 match and it says that the match is [Ni , Ni]; but i need it to find every single pattern; they should be 3 and like this [Ni][Ni][Ni]
The function std::regex_search only returns the results for the first match found in your string.
Here is a code, merged from yours and from cplusplus.com. The idea is to search for the first match, analyze it, and then start again using the rest of the string (that is to say, the sub-string that directly follows the match that was found, which can be retrieved thanks to match_results::suffix ).
Note that the regex has two capturing groups (Ni*) and ([^ ]*).
std::string s("the knights who say Niaaa and Niooo");
std::smatch m;
std::regex e("(Ni*)([^ ]*)");
while (std::regex_search(s, m, e))
{
for (auto x : m)
std::cout << x.str() << " ";
std::cout << std::endl;
s = m.suffix().str();
}
This gives the following output:
Niaaa Ni aaa
Niooo Ni ooo
As you can see, for every call to regex_search, we have the following information:
the content of the whole match,
the content of every capturing group.
Since we have two capturing groups, this gives us 3 strings for every regex_search.
EDIT: in your case if you want to retrieve every "Ni", all you need to do is to replace
std::regex e("(Ni*)([^ ]*)");
with
std::regex e("(Ni)");
You still need to iterate over your string, though.

Why am I getting multiple regex matches?

I'm trying to write a processor for GLSL shader code that will allow me to analyze the code and dynamically determine what inputs and outputs I need to handle for each shader.
To accomplish that, I decided to use some regex to parse the shader code before I compile it via OpenGL.
I've written some test code to verify that the regex is working as I expect.
Code:
#include <iostream>
#include <string>
#include <regex>
using namespace std;
int main()
{
string strInput = " in vec3 i_vPosition; ";
smatch match;
// Will appear in regex as:
// \bin\s+[a-zA-Z0-9]+\s+[a-zA-Z0-9_]+\s*(\[[0-9]+\])?\s*;
regex rgx("\\bin\\s+[a-zA-Z0-9]+\\s+[a-zA-Z0-9_]+\\s*(\\[[0-9]+\\])?\\s*;");
bool bMatchFound = regex_search(strInput, match, rgx);
cout << "Match found: " << bMatchFound << endl;
for (int i = 0; i < match.size(); ++i)
{
cout << "match " << i << " (" << match[i] << ") ";
cout << "at position " << match.position(i) << std::endl;
}
}
The only problem is that the above code generates two results instead of one. Though one of the results is empty.
Output:
Match found: 1
match 0 (in vec3 i_vPosition;) at position 6
match 1 () at position 34
I ultimately want to generate multiple results when I provide a whole file as input, but I'd like to get some consistency so that I can process the results in a consistent manner.
Any ideas as to why I'm getting multiple results when I'm only expecting one?
Your regex appears to contain a back reference
(\[[0-9]+\])?
which would contain square brackets surrounding 1 or more digits, but the ? makes it optional.
When applying the regex, the leading and trailing spaces are trimmed by the
\s+ ... \s*
The remainder of the string is matched by
[a-zA-Z0-9]+\s+[a-zA-Z0-9_]+\s*
And the backreference bit matches the empty string.
If you want to match strings that optionally contain that bit, but not return it as a backreference, make it passive with ?: like:
\bin\s+[a-zA-Z0-9]+\s+[a-zA-Z0-9_]+\s*(?:\[[0-9]+\])?\s*
I ultimately want to generate multiple results
The regex_search only finds the first match of the complete regular expression.
If you want to find the other places in your source text that the complete regular expression matches,
you must run regex_search repeatedly.
See
" C++ Regex to match words without punctuation "
for an example of repeatedly running the search.
the above code generates two results instead of one.
Confusing, isn't it?
The regular expression
\bin\s+[a-zA-Z0-9]+\s+[a-zA-Z0-9_]+\s*(\[[0-9]+\])?\s*;
includes round brackets().
The round brackets create a "group" aka "sub-expression".
Because the sub-expression is optional "(....)?",
the expression as a whole is allowed to match even if the sub-expression doesn't really match anything.
When the sub-expression doesn't match anything, the value of that sub-expression is an empty string.
See "Regular-expressions: Use Round Brackets for Grouping" for far more information on "capturing parenthesis" and "non-capturing parenthesis".
According to the documentation for regex_search,
match.size() is the number of subexpressions plus 1,
match[0] is the part of the source string that matches the complete regular expression.
match[1] is the part of the source string that matches the first sub-expression inside the regular expression.
match[n] is the part of the source string that matches the n'th sub-expression inside the regular expression.
A regular expression with only 1 sub-expression, as in the above example, will always return a match.size() of 2 -- one match for the complete regular expression, and one match for the sub-expression -- even when that sub-expression doesn't really match anything and is therefore the empty string.

Regular expression validation fails while egrep validates just fine

I'm trying to use regular expressions in order to validate strings so before I go any further let me explain first how the strings looks like: optional number of digits followed by an 'X' and an optional ('^' followed by one or more digits).
Here are some exmaples: "2X", "X", "23X^6" fit the pattern while strings like "X^", "4", "foobar", "4X^", "4X44" don't.
Now where was I: using 'egrep' and the "^[0-9]{0,}\X(\^[0-9]{1,})$" regex I can validate just fine those strings however when trying this in C++ using the C++11 regex library it fails.
Here's the code I'm using to validate those strings:
#include <iostream>
#include <regex>
#include <string>
#include <vector>
int main()
{
std::regex r("^[0-9]{0,}\\X(\\^[0-9]{1,})$",
std::regex_constants::egrep);
std::vector<std::string> challanges_ok {"2X", "X", "23X^66", "23X^6",
"3123X", "2313131X^213213123"};
std::vector<std::string> challanges_bad {"X^", "4", "asdsad", " X",
"4X44", "4X^"};
std::cout << "challanges_ok: ";
for (auto &str : challanges_ok) {
std::cout << std::regex_match(str, r) << " ";
}
std::cout << "\nchallanges_bad: ";
for (auto &str : challanges_bad) {
std::cout << std::regex_match(str, r) << " ";
}
std::cout << "\n";
return 0;
}
Am I doing something wrong or am I missing something? I'm compiling under GCC 4.7.
Your regex fails to make the '^' followed by one or more digits optional; change it to:
"^[0-9]*X(\\^[0-9]+)?$".
Also note that this page says that GCC's support of <regex> is only partial, so std::regex may not work at all for you ('partial' in this context apparently means 'broken'); have you tried Boost.Xpressive or Boost.Regex as a sanity check?
optional number of digits followed by an 'X' and an optional ('^' followed by one or more digits).
OK, the regular expression in your code doesn't match that description, for two reasons: you have an extra backslash on the X, and the '^digits' part is not optional. The regex you want is this:
^[0-9]{0,}X(\^[0-9]{1,}){0,1}$
which means your grep command should look like this (note single quotes):
egrep '^[0-9]{0,}X(\^[0-9]{1,}){0,1}$' filename
And the string you have to pass in your C++ code is this:
"^[0-9]{0,}X(\\^[0-9]{1,}){0,1}$"
If you then replace all the explicit quantifiers with their more traditional abbreviations, you get #ildjarn's answer: {0,} is *, {1,} is +, and {0,1} is ?.