c++ regex substring wrong pattern found - c++

I'm trying to understand the logic on the regex in c++
std::string s ("Ni Ni Ni NI");
std::regex e ("(Ni)");
std::smatch sm;
std::regex_search (s,sm,e);
std::cout << "string object with " << sm.size() << " matches\n";
This form shouldn't give me the number of substrings matching my pattern? Because it always give me 1 match and it says that the match is [Ni , Ni]; but i need it to find every single pattern; they should be 3 and like this [Ni][Ni][Ni]

The function std::regex_search only returns the results for the first match found in your string.
Here is a code, merged from yours and from cplusplus.com. The idea is to search for the first match, analyze it, and then start again using the rest of the string (that is to say, the sub-string that directly follows the match that was found, which can be retrieved thanks to match_results::suffix ).
Note that the regex has two capturing groups (Ni*) and ([^ ]*).
std::string s("the knights who say Niaaa and Niooo");
std::smatch m;
std::regex e("(Ni*)([^ ]*)");
while (std::regex_search(s, m, e))
{
for (auto x : m)
std::cout << x.str() << " ";
std::cout << std::endl;
s = m.suffix().str();
}
This gives the following output:
Niaaa Ni aaa
Niooo Ni ooo
As you can see, for every call to regex_search, we have the following information:
the content of the whole match,
the content of every capturing group.
Since we have two capturing groups, this gives us 3 strings for every regex_search.
EDIT: in your case if you want to retrieve every "Ni", all you need to do is to replace
std::regex e("(Ni*)([^ ]*)");
with
std::regex e("(Ni)");
You still need to iterate over your string, though.

Related

c++11 (MSVS2012) regex looking for file names in multiple line std::string

I have been trying to search for a clear answer on this one, but not been able to find it.
So lets say I have the string (where \n could be \r\n - I want to handle both - not sure if that is relevant or not)
"4345t435\ng54t a_file_123.xml rk\ngreg a_file_j34.xml fger 43t54"
Then I want to get matches:
a_file_123.xml
a_file_j34.xml
Here is my test code:
const str::string s = "4345t435\ng54t a_file_123.xml rk\ngreg a_file_j34.xml fger 43t54";
std::smatch matches;
if (std::regex_search(s, matches, std::regex("a_file_(.*)\\.xml")))
{
std::cout << "total: " << matches.size() << std::endl;
for (unsigned int i = 0; i < matches.size(); i++)
{
std::cout << "match: " << matches[i] << std::endl;
}
}
Output is:
total: 2
match: a_file_123.xml
match: 123
I don't quite understand why match 2 is just "123"...
You only have one match, not two, as the regex_search method returns a single match. What you printed is two group values, Group 0 (the whole match, a_file_123.xml here) and Group 1 (the capturing group value, here, 123 that is a substring captured with a capturing group you defined as (.*) in the pattern).
If you want to match multiple strings, you need to use the regex iterator, not just a regex_search that only returns the first match.
Besides, .* is too greedy and will return weird results if you have more than 1 match on the same line. It seems you want to match letter or digits, so .* can be replaced with \w+. Well, if there can really be anything, just use .*?.
Use
const std::string s = "4345t435\ng54t a_file_123.xml rk\ngreg a_file_j34.xml fger 43t54";
const std::regex rx("a_file_\\w+\\.xml");
std::vector<std::string> results(std::sregex_token_iterator(s.begin(), s.end(), rx),
std::sregex_token_iterator());
std::cout << "Number of matches: " << results.size() << std::endl;
for (auto result : results)
{
std::cout << result << std::endl;
}
See the C++ demo yielding
Number of matches: 2
a_file_123.xml
a_file_j34.xml
Notes on regex
a_file_ - a literal substring
\\w+ - 1+ word chars (letters, digits, _) (note you may use [^.]*? here instead of \\w+ if you want to match any char, 0 or more repetitions, as few as possible, up to the first .xml)
\\. - a dot (if you do not escape it, it will match any char except line break chars)
xml - a literal substring.
See the regex demo

C++ regex: Get index of the Capture Group the SubMatch matched to

Context. I'm developing a Lexer/Tokenizing engine, which would use regex as a backend. The lexer accepts rules, which define the token types/IDs, e.g.
<identifier> = "\\b\\w+\\b".
As I envision, to do the regex match-based tokenizing, all of the rules defined by regexes are enclosed in capturing groups, and all groups are separated by ORs.
When the matching is being executed, every match we produce must have an index of the capturing group it was matched to. We use these IDs to map the matches to token types.
So the problem of this question arises - how to get the ID of the group?
Similar question here, but it does not provide the solution to my specific problem.
Exactly my problem here, but it's in JS, and I need a C/C++ solution.
So let's say I've got a regex, made up of capturing groups separated by an OR:
(\\b[a-zA-Z]+\\b)|(\\b\\d+\\b)
which matches the the whole numbers or alpha-words.
My problem requires that the index of the capture group the regex submatch matched to could be known, e.g. when matching the string
foo bar 123
3 iterations will be done. The group indexes of the matches of every iteration would be 0 0 1, because the first two matches matched the first capturing group, and the last match matched the second capturing group.
I know that in standard std::regex library it's not entirely possible (regex_token_iterator is not a solution, because I don't need to skip any matches).
I don't have much knowledge about boost::regex or PCRE regex library.
What is the best way to accomplish this task? Which is the library and method to use?
You may use the sregex_iterator to get all matches, and once there is a match you may analyze the std::match_results structure and only grab the ID-1 value of the group that participated in the match (note only one group here will match, either the first one, or the second), which can be conveniently checked with the m[index].matched:
std::regex r(R"((\b[[:alpha:]]+\b)|(\b\d+\b))");
std::string s = "foo bar 123";
for(std::sregex_iterator i = std::sregex_iterator(s.begin(), s.end(), r);
i != std::sregex_iterator();
++i)
{
std::smatch m = *i;
std::cout << "Match value: " << m.str() << " at Position " << m.position() << '\n';
for(auto index = 1; index < m.size(); ++index ){
if (m[index].matched) {
std::cout << "Capture group ID: " << index-1 << std::endl;
break;
}
}
}
See the C++ demo. Output:
Match value: foo at Position 0
Capture group ID: 0
Match value: bar at Position 4
Capture group ID: 0
Match value: 123 at Position 8
Capture group ID: 1
Note that R"(...)" is a raw string literal, no need to double backslashes inside it.
Also, index is set to 1 at the start of the for loop because the 0th group is the whole match, but you want group IDs to be zero-based, that is why 1 is subtracted later.

Need regex to look for two words while ignoring everything in-between

I've been trying to make regex find both a two digit number and the word thanks, but ignore everything in-between.
Here is my current implementation in C++, but I need the two patterns to be consolidated into one:
regex pattern1{R"(\d\d)"};
regex pattern2{R"(thanks)");
string to_search = "I would like the number 98 to be found and printed, thanks.";
smatch matches;
regex_search(to_search, matches, pattern1);
for (auto match : matches) {
cout << match << endl;
}
regex_search(to_search, matches, pattern2);
for (auto match : matches) {
cout << match << endl;
}
return 0;
Thanks!
EDIT: Is there any way to change ONLY the pattern and get rid of one of the for loops? Sorry for the confusion.

C++ Regex: non-greedy match

I'm currently trying to make a regex which matches URL parameters and extracts them.
For example, if I got the following parameters string ?param1=someValue&param2=someOtherValue, std::regex_match should extract the following contents:
param1
some_content
param2
some_other_content
After trying different regex patterns, I finally built one corresponding to what I want: std::regex("(?:[\\?&]([^=&]+)=([^=&]+))*").
If I take the previous example, std::regex_match matches as expected. However, it does not extract the expected values, keeping only the last captured values.
For example, the following code:
std::regex paramsRegex("(?:[\\?&]([^=&]+)=([^=&]+))*");
std::string arg = "?param1=someValue&param2=someOtherValue";
std::smatch sm;
std::regex_match(arg, sm, paramsRegex);
for (const auto &match : sm)
std::cout << match << std::endl;
will give the following output:
param2
someOtherValue
As you can see, param1 and its value are skipped and not captured.
After searching on google, I've found that this is due to greedy capture and I have modified my regex into "(?:[\\?&]([^=&]+)=([^=&]+))\\*?" in order to enable non-greedy capturing.
This regex works well when I try it on rubular but it does not match when I use it in C++ (std::regex_match returns false and nothing is captured).
I've tried different std::regex_constants options (different regex grammar by using std::regex_constants::grep, std::regex_constants::egrep, ...) but the result is the same.
Does someone know how to do non-greedy regex capture in C++?
As Casimir et Hippolyte explained in his comment, I just need to:
remove the quantifier
Use std::regex_iterator
It gives me the following code:
std::regex paramsRegex("[\\?&]([^=]+)=([^&]+)");
std::string url_params = "?key1=val1&key2=val2&key3=val3&key4=val4";
std::smatch sm;
auto params_it = std::sregex_iterator(url_params.cbegin(), url_params.cend(), paramsRegex);
auto params_end = std::sregex_iterator();
while (params_it != params_end) {
auto param = params_it->str();
std::regex_match(param, sm, paramsRegex);
for (const auto &s : sm)
std::cout << s << std::endl;
++params_it;
}
And here is the output:
?key1=val1
key1
val1
&key2=val2
key2
val2
&key3=val3
key3
val3
&key4=val4
key4
val4
The orignal regex (?:[\\?&]([^=&]+)=([^=&]+))* was just changed into [\\?&]([^=]+)=([^&]+).
Then, by using std::sregex_iterator, I get an iterator on each matching groups (?key1=val1, &key2=val2, ...).
Finally, by calling std::regex_match on each sub-string, I can retrieve parameters values.
Try to use match_results::prefix/suffix:
string match_expression("your expression");
smatch result;
regex fnd(match_expression, regex_constants::icase);
while (regex_search(in_str, result, fnd, std::regex_constants::match_any))
{
for (size_t i = 1; i < result.size(); i++)
{
std::cout << result[i].str();
}
in_str = result.suffix();
}

Regex grouping matches with C++ 11 regex library

I'm trying to use a regex for group matching. I want to extract two strings from one big string.
The input string looks something like this:
tХB:Username!Username#Username.tcc.domain.com Connected
tХB:Username!Username#Username.tcc.domain.com WEBMSG #Username :this is a message
tХB:Username!Username#Username.tcc.domain.com Status: visible
The Username can be anything. Same goes for the end part this is a message.
What I want to do is extract the Username that comes after the pound sign #. Not from any other place in the string, since that can vary aswell. I also want to get the message from the string that comes after the semicolon :.
I tried that with the following regex. But it never outputs any results.
regex rgx("WEBMSG #([a-zA-Z0-9]) :(.*?)");
smatch matches;
for(size_t i=0; i<matches.size(); ++i) {
cout << "MATCH: " << matches[i] << endl;
}
I'm not getting any matches. What is wrong with my regex?
Your regular expression is incorrect because neither capture group does what you want. The first is looking to match a single character from the set [a-zA-Z0-9] followed by <space>:, which works for single character usernames, but nothing else. The second capture group will always be empty because you're looking for zero or more characters, but also specifying the match should not be greedy, which means a zero character match is a valid result.
Fixing both of these your regex becomes
std::regex rgx("WEBMSG #([a-zA-Z0-9]+) :(.*)");
But simply instantiating a regex and a match_results object does not produce matches, you need to apply a regex algorithm. Since you only want to match part of the input string the appropriate algorithm to use in this case is regex_search.
std::regex_search(s, matches, rgx);
Putting it all together
std::string s{R"(
tХB:Username!Username#Username.tcc.domain.com Connected
tХB:Username!Username#Username.tcc.domain.com WEBMSG #Username :this is a message
tХB:Username!Username#Username.tcc.domain.com Status: visible
)"};
std::regex rgx("WEBMSG #([a-zA-Z0-9]+) :(.*)");
std::smatch matches;
if(std::regex_search(s, matches, rgx)) {
std::cout << "Match found\n";
for (size_t i = 0; i < matches.size(); ++i) {
std::cout << i << ": '" << matches[i].str() << "'\n";
}
} else {
std::cout << "Match not found\n";
}
Live demo
"WEBMSG #([a-zA-Z0-9]) :(.*?)"
This regex will match only strings, which contain username of 1 character length and any message after semicolon, but second group will be always empty, because tries to find the less non-greedy match of any characters from 0 to unlimited.
This should work:
"WEBMSG #([a-zA-Z0-9]+) :(.*)"