Regex grouping matches with C++ 11 regex library - c++

I'm trying to use a regex for group matching. I want to extract two strings from one big string.
The input string looks something like this:
tХB:Username!Username#Username.tcc.domain.com Connected
tХB:Username!Username#Username.tcc.domain.com WEBMSG #Username :this is a message
tХB:Username!Username#Username.tcc.domain.com Status: visible
The Username can be anything. Same goes for the end part this is a message.
What I want to do is extract the Username that comes after the pound sign #. Not from any other place in the string, since that can vary aswell. I also want to get the message from the string that comes after the semicolon :.
I tried that with the following regex. But it never outputs any results.
regex rgx("WEBMSG #([a-zA-Z0-9]) :(.*?)");
smatch matches;
for(size_t i=0; i<matches.size(); ++i) {
cout << "MATCH: " << matches[i] << endl;
}
I'm not getting any matches. What is wrong with my regex?

Your regular expression is incorrect because neither capture group does what you want. The first is looking to match a single character from the set [a-zA-Z0-9] followed by <space>:, which works for single character usernames, but nothing else. The second capture group will always be empty because you're looking for zero or more characters, but also specifying the match should not be greedy, which means a zero character match is a valid result.
Fixing both of these your regex becomes
std::regex rgx("WEBMSG #([a-zA-Z0-9]+) :(.*)");
But simply instantiating a regex and a match_results object does not produce matches, you need to apply a regex algorithm. Since you only want to match part of the input string the appropriate algorithm to use in this case is regex_search.
std::regex_search(s, matches, rgx);
Putting it all together
std::string s{R"(
tХB:Username!Username#Username.tcc.domain.com Connected
tХB:Username!Username#Username.tcc.domain.com WEBMSG #Username :this is a message
tХB:Username!Username#Username.tcc.domain.com Status: visible
)"};
std::regex rgx("WEBMSG #([a-zA-Z0-9]+) :(.*)");
std::smatch matches;
if(std::regex_search(s, matches, rgx)) {
std::cout << "Match found\n";
for (size_t i = 0; i < matches.size(); ++i) {
std::cout << i << ": '" << matches[i].str() << "'\n";
}
} else {
std::cout << "Match not found\n";
}
Live demo

"WEBMSG #([a-zA-Z0-9]) :(.*?)"
This regex will match only strings, which contain username of 1 character length and any message after semicolon, but second group will be always empty, because tries to find the less non-greedy match of any characters from 0 to unlimited.
This should work:
"WEBMSG #([a-zA-Z0-9]+) :(.*)"

Related

How to find the exact substring with regex in c++11?

I am trying to find substrings that are not surrounded by other a-zA-Z0-9 symbols.
For example: I want to find substring hello, so it won't match hello1 or hellow but will match Hello and heLLo!##$%.
And I have such sample below.
std::string s = "1mySymbol1, /_mySymbol_ mysymbol";
const std::string sub = "mysymbol";
std::regex rgx("[^a-zA-Z0-9]*" + sub + "[^a-zA-Z0-9]*", std::regex::icase);
std::smatch match;
while (std::regex_search(s, match, rgx)) {
std::cout << match.size() << "match: " << match[0] << '\n';
s = match.suffix();
}
The result is:
1match: mySymbol
1match: , /_mySymbol_
1match: mysymbol
But I don't understand why first occurance 1mySymbol1 also matches my regex?
How to create a proper regex that will ignore such strings?
UDP
If I do like this
std::string s = "mySymbol, /_mySymbol_ mysymbol";
const std::string sub = "mysymbol";
std::regex rgx("[^a-zA-Z0-9]+" + sub + "[^a-zA-Z0-9]+", std::regex::icase);
then I find only substring in the middle
1match: , /_mySymbol_
And don't find substrings at the beggining and at the end.
The regex [^a-zA-Z0-9]* will match 0 or more characters, so it's perfectly valid for [^a-zA-Z0-9]*mysymbol[^a-zA-Z0-9]* to match mysymbol in 1mySymbol1 (allowing for case insensitivity). As you saw, this is fixed when you use [^a-zA-Z0-9]+ (matching 1 or more characters) instead.
With your update, you see that this doesn't match strings at the beginning or end. That's because [^a-zA-Z0-9]+ has to match 1 or more characters (which don't exist at the beginning or end of the string).
You have a few options:
Use beginning/end anchors: (?:[^a-zA-Z0-9]+|^)mysymbol(?:[^a-zA-Z0-9]+|$) (non-alphanumeric OR beginning of string, followed by mysymbol, followed by non-alphanumeric OR end of string).
Use negative lookahead and negative lookbehind: (?<![a-zA-Z0-9])mysymbol(?![a-zA-Z0-9]) (match mysymbol which doesn't have an alphanumeric character before or after it). Note that using this the match won't include the characters before/after mysymbol.
I recommend using https://regex101.com/ to play around with regular expressions. It lists all the different constructs you can use.

c++11 (MSVS2012) regex looking for file names in multiple line std::string

I have been trying to search for a clear answer on this one, but not been able to find it.
So lets say I have the string (where \n could be \r\n - I want to handle both - not sure if that is relevant or not)
"4345t435\ng54t a_file_123.xml rk\ngreg a_file_j34.xml fger 43t54"
Then I want to get matches:
a_file_123.xml
a_file_j34.xml
Here is my test code:
const str::string s = "4345t435\ng54t a_file_123.xml rk\ngreg a_file_j34.xml fger 43t54";
std::smatch matches;
if (std::regex_search(s, matches, std::regex("a_file_(.*)\\.xml")))
{
std::cout << "total: " << matches.size() << std::endl;
for (unsigned int i = 0; i < matches.size(); i++)
{
std::cout << "match: " << matches[i] << std::endl;
}
}
Output is:
total: 2
match: a_file_123.xml
match: 123
I don't quite understand why match 2 is just "123"...
You only have one match, not two, as the regex_search method returns a single match. What you printed is two group values, Group 0 (the whole match, a_file_123.xml here) and Group 1 (the capturing group value, here, 123 that is a substring captured with a capturing group you defined as (.*) in the pattern).
If you want to match multiple strings, you need to use the regex iterator, not just a regex_search that only returns the first match.
Besides, .* is too greedy and will return weird results if you have more than 1 match on the same line. It seems you want to match letter or digits, so .* can be replaced with \w+. Well, if there can really be anything, just use .*?.
Use
const std::string s = "4345t435\ng54t a_file_123.xml rk\ngreg a_file_j34.xml fger 43t54";
const std::regex rx("a_file_\\w+\\.xml");
std::vector<std::string> results(std::sregex_token_iterator(s.begin(), s.end(), rx),
std::sregex_token_iterator());
std::cout << "Number of matches: " << results.size() << std::endl;
for (auto result : results)
{
std::cout << result << std::endl;
}
See the C++ demo yielding
Number of matches: 2
a_file_123.xml
a_file_j34.xml
Notes on regex
a_file_ - a literal substring
\\w+ - 1+ word chars (letters, digits, _) (note you may use [^.]*? here instead of \\w+ if you want to match any char, 0 or more repetitions, as few as possible, up to the first .xml)
\\. - a dot (if you do not escape it, it will match any char except line break chars)
xml - a literal substring.
See the regex demo

Need regex to look for two words while ignoring everything in-between

I've been trying to make regex find both a two digit number and the word thanks, but ignore everything in-between.
Here is my current implementation in C++, but I need the two patterns to be consolidated into one:
regex pattern1{R"(\d\d)"};
regex pattern2{R"(thanks)");
string to_search = "I would like the number 98 to be found and printed, thanks.";
smatch matches;
regex_search(to_search, matches, pattern1);
for (auto match : matches) {
cout << match << endl;
}
regex_search(to_search, matches, pattern2);
for (auto match : matches) {
cout << match << endl;
}
return 0;
Thanks!
EDIT: Is there any way to change ONLY the pattern and get rid of one of the for loops? Sorry for the confusion.

c++ regex substring wrong pattern found

I'm trying to understand the logic on the regex in c++
std::string s ("Ni Ni Ni NI");
std::regex e ("(Ni)");
std::smatch sm;
std::regex_search (s,sm,e);
std::cout << "string object with " << sm.size() << " matches\n";
This form shouldn't give me the number of substrings matching my pattern? Because it always give me 1 match and it says that the match is [Ni , Ni]; but i need it to find every single pattern; they should be 3 and like this [Ni][Ni][Ni]
The function std::regex_search only returns the results for the first match found in your string.
Here is a code, merged from yours and from cplusplus.com. The idea is to search for the first match, analyze it, and then start again using the rest of the string (that is to say, the sub-string that directly follows the match that was found, which can be retrieved thanks to match_results::suffix ).
Note that the regex has two capturing groups (Ni*) and ([^ ]*).
std::string s("the knights who say Niaaa and Niooo");
std::smatch m;
std::regex e("(Ni*)([^ ]*)");
while (std::regex_search(s, m, e))
{
for (auto x : m)
std::cout << x.str() << " ";
std::cout << std::endl;
s = m.suffix().str();
}
This gives the following output:
Niaaa Ni aaa
Niooo Ni ooo
As you can see, for every call to regex_search, we have the following information:
the content of the whole match,
the content of every capturing group.
Since we have two capturing groups, this gives us 3 strings for every regex_search.
EDIT: in your case if you want to retrieve every "Ni", all you need to do is to replace
std::regex e("(Ni*)([^ ]*)");
with
std::regex e("(Ni)");
You still need to iterate over your string, though.

Why am I getting multiple regex matches?

I'm trying to write a processor for GLSL shader code that will allow me to analyze the code and dynamically determine what inputs and outputs I need to handle for each shader.
To accomplish that, I decided to use some regex to parse the shader code before I compile it via OpenGL.
I've written some test code to verify that the regex is working as I expect.
Code:
#include <iostream>
#include <string>
#include <regex>
using namespace std;
int main()
{
string strInput = " in vec3 i_vPosition; ";
smatch match;
// Will appear in regex as:
// \bin\s+[a-zA-Z0-9]+\s+[a-zA-Z0-9_]+\s*(\[[0-9]+\])?\s*;
regex rgx("\\bin\\s+[a-zA-Z0-9]+\\s+[a-zA-Z0-9_]+\\s*(\\[[0-9]+\\])?\\s*;");
bool bMatchFound = regex_search(strInput, match, rgx);
cout << "Match found: " << bMatchFound << endl;
for (int i = 0; i < match.size(); ++i)
{
cout << "match " << i << " (" << match[i] << ") ";
cout << "at position " << match.position(i) << std::endl;
}
}
The only problem is that the above code generates two results instead of one. Though one of the results is empty.
Output:
Match found: 1
match 0 (in vec3 i_vPosition;) at position 6
match 1 () at position 34
I ultimately want to generate multiple results when I provide a whole file as input, but I'd like to get some consistency so that I can process the results in a consistent manner.
Any ideas as to why I'm getting multiple results when I'm only expecting one?
Your regex appears to contain a back reference
(\[[0-9]+\])?
which would contain square brackets surrounding 1 or more digits, but the ? makes it optional.
When applying the regex, the leading and trailing spaces are trimmed by the
\s+ ... \s*
The remainder of the string is matched by
[a-zA-Z0-9]+\s+[a-zA-Z0-9_]+\s*
And the backreference bit matches the empty string.
If you want to match strings that optionally contain that bit, but not return it as a backreference, make it passive with ?: like:
\bin\s+[a-zA-Z0-9]+\s+[a-zA-Z0-9_]+\s*(?:\[[0-9]+\])?\s*
I ultimately want to generate multiple results
The regex_search only finds the first match of the complete regular expression.
If you want to find the other places in your source text that the complete regular expression matches,
you must run regex_search repeatedly.
See
" C++ Regex to match words without punctuation "
for an example of repeatedly running the search.
the above code generates two results instead of one.
Confusing, isn't it?
The regular expression
\bin\s+[a-zA-Z0-9]+\s+[a-zA-Z0-9_]+\s*(\[[0-9]+\])?\s*;
includes round brackets().
The round brackets create a "group" aka "sub-expression".
Because the sub-expression is optional "(....)?",
the expression as a whole is allowed to match even if the sub-expression doesn't really match anything.
When the sub-expression doesn't match anything, the value of that sub-expression is an empty string.
See "Regular-expressions: Use Round Brackets for Grouping" for far more information on "capturing parenthesis" and "non-capturing parenthesis".
According to the documentation for regex_search,
match.size() is the number of subexpressions plus 1,
match[0] is the part of the source string that matches the complete regular expression.
match[1] is the part of the source string that matches the first sub-expression inside the regular expression.
match[n] is the part of the source string that matches the n'th sub-expression inside the regular expression.
A regular expression with only 1 sub-expression, as in the above example, will always return a match.size() of 2 -- one match for the complete regular expression, and one match for the sub-expression -- even when that sub-expression doesn't really match anything and is therefore the empty string.