Why am I getting multiple regex matches? - c++

I'm trying to write a processor for GLSL shader code that will allow me to analyze the code and dynamically determine what inputs and outputs I need to handle for each shader.
To accomplish that, I decided to use some regex to parse the shader code before I compile it via OpenGL.
I've written some test code to verify that the regex is working as I expect.
Code:
#include <iostream>
#include <string>
#include <regex>
using namespace std;
int main()
{
string strInput = " in vec3 i_vPosition; ";
smatch match;
// Will appear in regex as:
// \bin\s+[a-zA-Z0-9]+\s+[a-zA-Z0-9_]+\s*(\[[0-9]+\])?\s*;
regex rgx("\\bin\\s+[a-zA-Z0-9]+\\s+[a-zA-Z0-9_]+\\s*(\\[[0-9]+\\])?\\s*;");
bool bMatchFound = regex_search(strInput, match, rgx);
cout << "Match found: " << bMatchFound << endl;
for (int i = 0; i < match.size(); ++i)
{
cout << "match " << i << " (" << match[i] << ") ";
cout << "at position " << match.position(i) << std::endl;
}
}
The only problem is that the above code generates two results instead of one. Though one of the results is empty.
Output:
Match found: 1
match 0 (in vec3 i_vPosition;) at position 6
match 1 () at position 34
I ultimately want to generate multiple results when I provide a whole file as input, but I'd like to get some consistency so that I can process the results in a consistent manner.
Any ideas as to why I'm getting multiple results when I'm only expecting one?

Your regex appears to contain a back reference
(\[[0-9]+\])?
which would contain square brackets surrounding 1 or more digits, but the ? makes it optional.
When applying the regex, the leading and trailing spaces are trimmed by the
\s+ ... \s*
The remainder of the string is matched by
[a-zA-Z0-9]+\s+[a-zA-Z0-9_]+\s*
And the backreference bit matches the empty string.
If you want to match strings that optionally contain that bit, but not return it as a backreference, make it passive with ?: like:
\bin\s+[a-zA-Z0-9]+\s+[a-zA-Z0-9_]+\s*(?:\[[0-9]+\])?\s*

I ultimately want to generate multiple results
The regex_search only finds the first match of the complete regular expression.
If you want to find the other places in your source text that the complete regular expression matches,
you must run regex_search repeatedly.
See
" C++ Regex to match words without punctuation "
for an example of repeatedly running the search.
the above code generates two results instead of one.
Confusing, isn't it?
The regular expression
\bin\s+[a-zA-Z0-9]+\s+[a-zA-Z0-9_]+\s*(\[[0-9]+\])?\s*;
includes round brackets().
The round brackets create a "group" aka "sub-expression".
Because the sub-expression is optional "(....)?",
the expression as a whole is allowed to match even if the sub-expression doesn't really match anything.
When the sub-expression doesn't match anything, the value of that sub-expression is an empty string.
See "Regular-expressions: Use Round Brackets for Grouping" for far more information on "capturing parenthesis" and "non-capturing parenthesis".
According to the documentation for regex_search,
match.size() is the number of subexpressions plus 1,
match[0] is the part of the source string that matches the complete regular expression.
match[1] is the part of the source string that matches the first sub-expression inside the regular expression.
match[n] is the part of the source string that matches the n'th sub-expression inside the regular expression.
A regular expression with only 1 sub-expression, as in the above example, will always return a match.size() of 2 -- one match for the complete regular expression, and one match for the sub-expression -- even when that sub-expression doesn't really match anything and is therefore the empty string.

Related

How to consider taking dot in the number in regular expression

Take a look at the following regular expression
std::regex reg("[A][-+]?([0-9]*\\.[0-9]+|[0-9]+)");
This will find any A letter followed by float number. The problem if the number A30., this regular expression ignores the dot and print the result as A30. I would like to force the regular expression to consider the decimal dot as well. Is this feasible?
#include <iostream>
#include <string>
#include <regex>
using namespace std;
int main()
{
std::string line("A50. hsih Y0 his ");
std::smatch match;
std::regex reg("[A][-+]?([0-9]*\\.[0-9]+|[0-9]+)");
if ( std::regex_search(line,match,reg) ){
cout << match.str(0) << endl;
}else{
cout << "nothing found" << endl;
}
return 0;
}
You request the dot to be followed by one or more (+) digits. Just make the trailing ditigs optional by changing it to:
std::regex reg("[A][-+]?([0-9]*\\.[0-9]*|[0-9]+)");
Demo
The only problem with this expression is that it would also match A followed by a single dot without any digit. I don't know if you'd see this a s a valid match. A more robust alternative would hence be:
std::regex reg("[A][-+]?([0-9]*\\.[0-9]+|[0-9]+\\.?)");
So either trailing digits, or digits followed optionally by a dot.
Second demo
You can change your regex like this
A[-+]?(?:[0-9]*\\.?(?:[0-9]+)?)
A - Matches A.
[-+]? - Matches + or -. ( ? makes it optional)
(?:[0-9]*\\.?(?:[0-9]+)?)
(?:[0-9]*\\. - will match zero or more digits followed by . (? makes it optional)
(?:[0-9]+)? - Matches one or more time. (? makes it optional)
Demo

Regex grouping matches with C++ 11 regex library

I'm trying to use a regex for group matching. I want to extract two strings from one big string.
The input string looks something like this:
tХB:Username!Username#Username.tcc.domain.com Connected
tХB:Username!Username#Username.tcc.domain.com WEBMSG #Username :this is a message
tХB:Username!Username#Username.tcc.domain.com Status: visible
The Username can be anything. Same goes for the end part this is a message.
What I want to do is extract the Username that comes after the pound sign #. Not from any other place in the string, since that can vary aswell. I also want to get the message from the string that comes after the semicolon :.
I tried that with the following regex. But it never outputs any results.
regex rgx("WEBMSG #([a-zA-Z0-9]) :(.*?)");
smatch matches;
for(size_t i=0; i<matches.size(); ++i) {
cout << "MATCH: " << matches[i] << endl;
}
I'm not getting any matches. What is wrong with my regex?
Your regular expression is incorrect because neither capture group does what you want. The first is looking to match a single character from the set [a-zA-Z0-9] followed by <space>:, which works for single character usernames, but nothing else. The second capture group will always be empty because you're looking for zero or more characters, but also specifying the match should not be greedy, which means a zero character match is a valid result.
Fixing both of these your regex becomes
std::regex rgx("WEBMSG #([a-zA-Z0-9]+) :(.*)");
But simply instantiating a regex and a match_results object does not produce matches, you need to apply a regex algorithm. Since you only want to match part of the input string the appropriate algorithm to use in this case is regex_search.
std::regex_search(s, matches, rgx);
Putting it all together
std::string s{R"(
tХB:Username!Username#Username.tcc.domain.com Connected
tХB:Username!Username#Username.tcc.domain.com WEBMSG #Username :this is a message
tХB:Username!Username#Username.tcc.domain.com Status: visible
)"};
std::regex rgx("WEBMSG #([a-zA-Z0-9]+) :(.*)");
std::smatch matches;
if(std::regex_search(s, matches, rgx)) {
std::cout << "Match found\n";
for (size_t i = 0; i < matches.size(); ++i) {
std::cout << i << ": '" << matches[i].str() << "'\n";
}
} else {
std::cout << "Match not found\n";
}
Live demo
"WEBMSG #([a-zA-Z0-9]) :(.*?)"
This regex will match only strings, which contain username of 1 character length and any message after semicolon, but second group will be always empty, because tries to find the less non-greedy match of any characters from 0 to unlimited.
This should work:
"WEBMSG #([a-zA-Z0-9]+) :(.*)"

c++ regex substring wrong pattern found

I'm trying to understand the logic on the regex in c++
std::string s ("Ni Ni Ni NI");
std::regex e ("(Ni)");
std::smatch sm;
std::regex_search (s,sm,e);
std::cout << "string object with " << sm.size() << " matches\n";
This form shouldn't give me the number of substrings matching my pattern? Because it always give me 1 match and it says that the match is [Ni , Ni]; but i need it to find every single pattern; they should be 3 and like this [Ni][Ni][Ni]
The function std::regex_search only returns the results for the first match found in your string.
Here is a code, merged from yours and from cplusplus.com. The idea is to search for the first match, analyze it, and then start again using the rest of the string (that is to say, the sub-string that directly follows the match that was found, which can be retrieved thanks to match_results::suffix ).
Note that the regex has two capturing groups (Ni*) and ([^ ]*).
std::string s("the knights who say Niaaa and Niooo");
std::smatch m;
std::regex e("(Ni*)([^ ]*)");
while (std::regex_search(s, m, e))
{
for (auto x : m)
std::cout << x.str() << " ";
std::cout << std::endl;
s = m.suffix().str();
}
This gives the following output:
Niaaa Ni aaa
Niooo Ni ooo
As you can see, for every call to regex_search, we have the following information:
the content of the whole match,
the content of every capturing group.
Since we have two capturing groups, this gives us 3 strings for every regex_search.
EDIT: in your case if you want to retrieve every "Ni", all you need to do is to replace
std::regex e("(Ni*)([^ ]*)");
with
std::regex e("(Ni)");
You still need to iterate over your string, though.

QRegExp not finding expected string pattern

I am working in Qt 5.2, and I have a piece of code that takes in a string and enters one of several if statements based on its format. One of the formats searched for is the letters "RCV", followed by a variable amount of numbers, a decimal, and then one more number. There can be more than one of these values in the line, separated by "|", for example it could one value like "RCV0123456.1" or mulitple values like "RCV12345.1|RCV678.9". Right now I am using QRegExp class to find this, like this:
QString value = "RCV000030249.2|RCV000035360.2"; //Note: real test value from my code
if(QRegExp("^[RCV\d+\.\d\|?]+$").exactMatch(value))
std::cout << ":D" << std::endl;
else
std::cout << ":(" << std::endl;
I want it to use the if statement, but it keeps going into the else statement. Is there something I'm doing wrong with the regular expression?
Your expression should be like #vahancho mentionet in a comment:
if(QRegExp("^[RCV\\d+\\.\\d\\|?]+$").exactMatch(value))
If you use C++11, then you can use its raw strings feature:
if(QRegExp(R"(^[RCV\d+\.\d\|?]+$)").exactMatch(value))
Aside from escaping the backslashes which others has mentioned in answers and comments,
There can be more than one of these values in the line, separated by "|", for example it could one value like "RCV0123456.1" or mulitple values like "RCV12345.1|RCV678.9".
[RCV\d+\.\d\|?] may not be doing what you expect. Perhaps you want () instead of []:
/^
[RCV\d+\.\d\|?]+ # More than one of characters from the list:
# R, C, V, a digit, a +, a dot, a digit, a |, a ?
$/x
/^
(
RCV\d+\.\d # RCV, some digits, a dot, followed by a digit
\|? # Optional: a |
)+ # Quantifier of one or more
$/x
Also, maybe you could revise the regex such that the optional | requires the group to be matched *again*:
/^
(RCV\d+\.\d) # RCV, some digits, a dot, followed by a digit
(
\|(?1) # A |, then match subpattern 1 (Above)
)+ # Quantifier of one or more
$/x
Check if only valid occurences in line with the addition to require an | starting second occurence (having your implementation would not require the | even with double quotes):
QString value = "RCV000030249.2|RCV000035360.2"; //Note: real test value from my code
if(QRegExp("^RCV\\d+\\.\\d(\\|RCV\\d+\\.\\d)*$").exactMatch(value))
std::cout << ":D" << std::endl;
else
std::cout << ":(" << std::endl;

C++11 regex to tokenize Mathematical Expression

I have the following code to tokenize a string of the format: (1+2)/((8))-(100*34):
I'd like to throw an error to the user if they use an operator or character that isn't part of my regex.
e.g if user enters 3^4 or x-6
Is there a way to negate my regex, search for it and if it is true throw the error?
Can the regex expression be improved?
//Using c++11 regex to tokenize input string
//[0-9]+ = 1 or many digits
//Or [\\-\\+\\\\\(\\)\\/\\*] = "-" or "+" or "/" or "*" or "(" or ")"
std::regex e ( "[0-9]+|[\\-\\+\\\\\(\\)\\/\\*]");
std::sregex_iterator rend;
std::sregex_iterator a( infixExpression.begin(), infixExpression.end(), e );
queue<string> infixQueue;
while (a!=rend) {
infixQueue.push(a->str());
++a;
}
return infixQueue;
-Thanks
You can run a search on the string using the search expression [^0-9()+\-*/] defined as C++ string as "[^0-9()+\\-*/]" which finds any character which is NOT a digit, a round bracket, a plus or minus sign (in real hyphen), an asterisk or a slash.
The search with this regular expression search string should not return anything otherwise the string contains a not supported character like ^ or x.
[...] is a positive character class which means find a character being one of the characters in the square brackets.
[^...] is a negative character class which means find a character NOT being one of the characters in the square brackets.
The only characters which must be escaped within square brackets to be interpreted as literal character are ], \ and - whereby - must not be escaped if being first or last character in the list of characters within the square brackets. But it is nevertheless better to escape - always within square brackets as this makes it easier for the regular expression engine / function to detect that the hyphen character should be interpreted as literal character and not with meaning "FROM x to z".
Of course this expression does not check for missing closing round brackets. But formula parsers do often not require that there is always a closing parenthesis for every opening parenthesis in comparison to a compiler or script interpreter simply because not needed to calculate the value based on entered formula.
Answer is given already but perhaps someone might need this
[0-9]?([0-9]*[.])?[0-9]+|[\\-\\+\\\\\(\\)\\/\\*]
This regex separates floats, integers and arithmetic operators
Heres the trick:
[0-9]?([0-9]*[.])?[0-9]+ -> if its a digit and has a point, then grab the digits with the point and the digits that follows it, if not, just grab the digits.
Sorry if my answer isn't clear, i just learned regex and found this solution by my own by just trial and errors.
Heres the code (it takes a mathematical expression and split all digits and operators into a vector)
NOTE: I don't know if it accepts whitespaces, meaning that the mathematical expression that i worked with had no whitespaces. Example: 4+2*(3+1) and would separate everything nicely, but i havent tried with whitespaces.
/* Separate every int or float or operator into a single string using regular expression and store it in untokenize vector */
string infix; //The string to be parse (the arithmetic operation if you will)
vector<string> untokenize;
std::regex words_regex("[0-9]?([0-9]*[.])?[0-9]+|[\\-\\+\\\\\(\\)\\/\\*]");
auto words_begin = std::sregex_iterator(infix.begin(), infix.end(), words_regex);
auto words_end = std::sregex_iterator();
for (std::sregex_iterator i = words_begin; i != words_end; ++i) {
cout << (*i).str() << endl;
untokenize.push_back((*i).str());
}
Output:
(<br/>
1<br/>
+<br/>
2<br/>
)<br/>
/<br/>
(<br/>
(<br/>
8<br/>
)<br/>
)<br/>
-<br/>
(<br/>
100<br/>
*<br/>
34<br/>
)<br/>