detect new line using C++ boost regex_match [duplicate] - c++

I just started using Boost::regex today and am quite a novice in Regular Expressions too. I have been using "The Regulator" and Expresso to test my regex and seem satisfied with what I see there, but transferring that regex to boost, does not seem to do what I want it to do. Any pointers to help me a solution would be most welcome. As a side question are there any tools that would help me test my regex against boost.regex?
using namespace boost;
using namespace std;
vector<string> tokenizer::to_vector_int(const string s)
{
regex re("\\d*");
vector<string> vs;
cmatch matches;
if( regex_match(s.c_str(), matches, re) ) {
MessageBox(NULL, L"Hmmm", L"", MB_OK); // it never gets here
for( unsigned int i = 1 ; i < matches.size() ; ++i ) {
string match(matches[i].first, matches[i].second);
vs.push_back(match);
}
}
return vs;
}
void _uttokenizer::test_to_vector_int()
{
vector<string> __vi = tokenizer::to_vector_int("0<br/>1");
for( int i = 0 ; i < __vi.size() ; ++i ) INFO(__vi[i]);
CPPUNIT_ASSERT_EQUAL(2, (int)__vi.size());//always fails
}
Update (Thanks to Dav for helping me clarify my question):
I was hoping to get a vector with 2 strings in them => "0" and "1". I instead never get a successful regex_match() (regex_match() always returns false) so the vector is always empty.
Thanks '1800 INFORMATION' for your suggestions. The to_vector_int() method now looks like this, but it goes into a never ending loop (I took the code you gave and modified it to make it compilable) and find "0","","","" and so on. It never find the "1".
vector<string> tokenizer::to_vector_int(const string s)
{
regex re("(\\d*)");
vector<string> vs;
cmatch matches;
char * loc = const_cast<char *>(s.c_str());
while( regex_search(loc, matches, re) ) {
vs.push_back(string(matches[0].first, matches[0].second));
loc = const_cast<char *>(matches.suffix().str().c_str());
}
return vs;
}
In all honesty I don't think I have still understood the basics of searching for a pattern and getting the matches. Are there any tutorials with examples that explains this?

The basic problem is that you are using regex_match when you should be using regex_search:
The algorithms regex_search and
regex_match make use of match_results
to report what matched; the difference
between these algorithms is that
regex_match will only find matches
that consume all of the input text,
where as regex_search will search for
a match anywhere within the text being
matched.
From the boost documentation. Change it to use regex_search and it will work.
Also, it looks like you are not capturing the matches. Try changing the regex to this:
regex re("(\\d*)");
Or, maybe you need to be calling regex_search repeatedly:
char *where = s.c_str();
while (regex_search(s.c_str(), matches, re))
{
where = m.suffix().first;
}
This is since you only have one capture in your regex.
Alternatively, change your regex, if you know the basic structure of the data:
regex re("(\\d+).*?(\\d+)");
This would match two numbers within the search string.
Note that the regular expression \d* will match zero or more digits - this includes the empty string "" since this is exactly zero digits. I would change the expression to \d+ which will match 1 or more.

Related

How to match multiple patterns with a regex in C++ 11

Suppose there is a string named path that needs to match multiple patterns. The regular expression string is as follows:
std::string regexString="(/api/Attachment)|(/api/Attachment/upload)|(/api/Attachment/download)|(/api/v1/ApiTest)|(/api/v1/ApiTest/get/[^/]*/[^/]*)|(/api/v1/ApiTest/[^/]*/List)";
The matching code is as follows:
std::smatch result;
if (std::regex_match(path, result, regexString))
{
for (size_t i = 1; i < result.size(); i++)
{
/// Question: Is there any better way to find the sub-match index without using a loop?
if (!result[i].matched)
continue;
if (result[i].str() == path)
{
std::cout<<"Match a pattern with index "<<i<<std::endl;
/// Do something with it;
break;
}
}
}
else
{
std::cout<<"Match none"<<std::endl;
}
The above program works, but considering a large number of patterns, the loop in it is a bit ugly and inefficient. As the comments in the code show, my question is is there a way to find the sub-match index without using loops?
Any comments would be greatly appreciated, thank you!
Try just using a single alternation which covers all variations. In the pattern below, I also turn off capturing in the alternation. This leaves us with fairly straightforward matching logic. If the smatch result does have an entry, then it should be a single entry with the entire matching path. Otherwise, it should be empty.
std::string regexString="/api/(?:Attachment|Attachment/upload|Attachment/download|v1/ApiTest|v1/ApiTest/get/[^/]*/[^/]*|v1/ApiTest/[^/]*/List)";
std::string s ("/api/Attachment/upload");
std::regex e (regexString);
std::smatch sm;
std::regex_match (s,sm,e);
if (sm.size() > 0) {
std::cout << "found a matching path: " << sm[0];
}
found a matching path: /api/Attachment/upload

C++: Regex: returns full string and not matched group

for those asking, the {0} allows selection of any one block within the sResult string separated by the | 0 is the first block
it needs to be dynamic for future expansion as that number will be configurable by users
So I am working on a regex to extract 1 portion of a string, however while it matches the results return are not what is expected.
std::string sResult = "MATCH_ME|BUT|NOT|ANYTHNG|ELSE";
std::regex pattern("^(?:[^|]+[|]){0}([^|;]+)");
std::smatch regMatch;
std::regex_search(sResult, regMatch, pattern);
if(regMatch[1].matched)
{
for( int i = 0; i < regMatch.size(); i++)
{
//SUBMATCH 0 = "MATCH_ME|BUT|NOT|ANYTHNG|ELSE"
//SUBMATCH 1 = "BUT|NOT|ANYTHNG|ELSE"
std::ssub_match sm = regMatch[i];
bValid = strcmp(regMatch[i].str().c_str(), pzPoint->_ptrTarget->_pzTag->szOPCItem);
}
}
For some reason I cannot figure out the code to get me just the MATCH_ME back so I can compare it to expected results list on the C++ side.
Anyone have any ideas on where I went wrong here.
It seems you're using regular expressions for what they haven't been designed for. You should first split your string at the delimiter | and apply regular expressions on the resulting tokens if you want to check them for validity.
By the way: The std::regex implementation in libstdc++ seems to be buggy. I just did some tests and found that even simple patterns containing escaped pipe characters like \\| failed to compile throwing a std::regex_error with no further information in the error message (GCC 4.8.1).
The following code example shows how to do what you are after - you compile this, then call it with a single numerical argument to extract that element of the input:
#include <iostream>
#include <cstring>
#include <regex>
int main(int argc, char *argv[]) {
char pat[100];
if (argc > 1) {
sprintf(pat, "^(?:[^|]+[|]){%s}([^|;]+)", argv[1]);
std::string sResult = "MATCH_ME|BUT|NOT|ANYTHNG|ELSE";
std::regex pattern(pat);
std::smatch regMatch;
std::regex_search(sResult, regMatch, pattern);
if(regMatch[1].matched)
{
std::ssub_match sm = regMatch[1];
std::cout << "The match is " << sm << std::endl;
//bValid = strcmp(regMatch[i].str().c_str(), pzPoint->_ptrTarget->_pzTag->szOPCItem);
}
}
return 0;
}
Creating an executable called match, you can then do
>> match 2
The match is NOT
which is what you wanted.
The regex, it turns out, works just fine - although as a matter of preference I would use \| instead of [|] for the first part.
Turns out the problem was on the C side in extracting the match, it had to be done more directly, below is the code that gets me exactly what I wanted out of the string so I can use it later.
std::string sResult = "MATCH_ME|BUT|NOT|ANYTHNG|ELSE";
std::regex pattern("^(?:[^|]+[|]){0}([^|;]+)");
std::smatch regMatch;
std::regex_search(sResult, regMatch, pattern);
if(regMatch[1].matched)
{
std::string theMatchedPortion = regMatch[1];
//the issue was not with the regex but in how I was retrieving the results.
//theMatchedPortion now equals "MATCH_ME" and by changing the number associated
with it I can navigate through the string
}

How to check which matching group was used to match (boost-regex)

I'm using boost::regex to parse some formatting string where '%' symbol is escape character. Because I do not have much experience with boost::regex, and with regex at all to be honest I do some trial and error. This code is some kind of prototype that I came up with.
std::string regex_string =
"(?:%d\\{(.*)\\})|" //this group will catch string for formatting time
"(?:%([hHmMsSqQtTlLcCxXmMnNpP]))|" //symbols that have some meaning
"(?:\\{(.*?)\\})|" //some other groups
"(?:%(.*?)\\s)|"
"(?:([^%]*))";
boost::regex regex;
boost::smatch match;
try
{
regex.assign(regex_string, boost::regex_constants::icase);
boost::sregex_iterator res(pattern.begin(), pattern.end(), regex);
//pattern in line above is string which I'm parsing
boost::sregex_iterator end;
for(; res != end; ++res)
{
match = *res;
output << match.get_last_closed_paren();
//I want to know if the thing that was just written to output is from group describing time string
output << "\n";
}
}
catch(boost::regex_error &e)
{
output<<"regex error\n";
}
And this works pretty good, on the output I have exactly what I want to catch. But I do not know from which group it is. I could do something like match[index_of_time_group]!="" but this is kind of fragile, and doesn't look too good. If I change regex_string index that was pointing on group catching string for formatting time could also change.
Is there a neat way to do this? Something like naming groups? I'll be grateful for any help.
You can use boost::sub_match::matched bool member:
if(match[index_of_time_group].matched) process_it(match);
It is also possible to use named groups in regexp like: (?<name_of_group>.*), and with this above line could be changed to:
if(match["name_of_group"].matched) process_it(match);
Dynamically build regex_string from pairs of name/pattern, and return a name->index mapping as well as the regex. Then write some code that determines if the match comes from a given name.
If you are insane, you can do it at compile time (the mapping from tag to index that is). It isn't worth it.

Getting sub-match_results with boost::regex

Hey, let's say I have this regex: (test[0-9])+
And that I match it against: test1test2test3test0
const bool ret = boost::regex_search(input, what, r);
for (size_t i = 0; i < what.size(); ++i)
cout << i << ':' << string(what[i]) << "\n";
Now, what[1] will be test0 (the last occurrence). Let's say that I need to get test1, 2 and 3 as well: what should I do?
Note: the real regex is extremely more complex and has to remain one overall match, so changing the example regex to (test[0-9]) won't work.
I think Dot Net has the ability to make single capture group Collections so that (grp)+ will create a collection object on group1. The boost engine's regex_search() is going to be just like any ordinary match function. You sit in a while() loop matching the pattern where the last match left off. The form you used does not use a bid-itterator, so the function won't start the next match where the last match left off.
You can use the itterator form:
(Edit - you can also use the token iterator, defining what groups to iterate over. Added in the code below).
#include <boost/regex.hpp>
#include <string>
#include <iostream>
using namespace std;
using namespace boost;
int main()
{
string input = "test1 ,, test2,, test3,, test0,,";
boost::regex r("(test[0-9])(?:$|[ ,]+)");
boost::smatch what;
std::string::const_iterator start = input.begin();
std::string::const_iterator end = input.end();
while (boost::regex_search(start, end, what, r))
{
string stest(what[1].first, what[1].second);
cout << stest << endl;
// Update the beginning of the range to the character
// following the whole match
start = what[0].second;
}
// Alternate method using token iterator
const int subs[] = {1}; // we just want to see group 1
boost::sregex_token_iterator i(input.begin(), input.end(), r, subs);
boost::sregex_token_iterator j;
while(i != j)
{
cout << *i++ << endl;
}
return 0;
}
Output:
test1
test2
test3
test0
Boost.Regex offers experimental support for exactly this feature (called repeated captures); however, since it's huge performance hit, this feature is disabled by default.
To enable repeated captures, you need to rebuild Boost.Regex and define macro BOOST_REGEX_MATCH_EXTRA in all translation units; the best way to do this is to uncomment this define in boost/regex/user.hpp (see the reference, it's at the very bottom of the page).
Once compiled with this define, you can use this feature by calling/using regex_search, regex_match and regex_iterator with match_extra flag.
Check reference to Boost.Regex for more info.
Seems to me like you need to create a regex_iterator, using the (test[0-9]) regex as input. Then you can use the resulting regex_iterator to enumerate the matching substrings of your original target.
If you still need "one overall match" then perhaps that work has to be decoupled from the task of finding matching substrings. Can you clarify that part of your requirement?

Boost regex not working as expected in my code

I just started using Boost::regex today and am quite a novice in Regular Expressions too. I have been using "The Regulator" and Expresso to test my regex and seem satisfied with what I see there, but transferring that regex to boost, does not seem to do what I want it to do. Any pointers to help me a solution would be most welcome. As a side question are there any tools that would help me test my regex against boost.regex?
using namespace boost;
using namespace std;
vector<string> tokenizer::to_vector_int(const string s)
{
regex re("\\d*");
vector<string> vs;
cmatch matches;
if( regex_match(s.c_str(), matches, re) ) {
MessageBox(NULL, L"Hmmm", L"", MB_OK); // it never gets here
for( unsigned int i = 1 ; i < matches.size() ; ++i ) {
string match(matches[i].first, matches[i].second);
vs.push_back(match);
}
}
return vs;
}
void _uttokenizer::test_to_vector_int()
{
vector<string> __vi = tokenizer::to_vector_int("0<br/>1");
for( int i = 0 ; i < __vi.size() ; ++i ) INFO(__vi[i]);
CPPUNIT_ASSERT_EQUAL(2, (int)__vi.size());//always fails
}
Update (Thanks to Dav for helping me clarify my question):
I was hoping to get a vector with 2 strings in them => "0" and "1". I instead never get a successful regex_match() (regex_match() always returns false) so the vector is always empty.
Thanks '1800 INFORMATION' for your suggestions. The to_vector_int() method now looks like this, but it goes into a never ending loop (I took the code you gave and modified it to make it compilable) and find "0","","","" and so on. It never find the "1".
vector<string> tokenizer::to_vector_int(const string s)
{
regex re("(\\d*)");
vector<string> vs;
cmatch matches;
char * loc = const_cast<char *>(s.c_str());
while( regex_search(loc, matches, re) ) {
vs.push_back(string(matches[0].first, matches[0].second));
loc = const_cast<char *>(matches.suffix().str().c_str());
}
return vs;
}
In all honesty I don't think I have still understood the basics of searching for a pattern and getting the matches. Are there any tutorials with examples that explains this?
The basic problem is that you are using regex_match when you should be using regex_search:
The algorithms regex_search and
regex_match make use of match_results
to report what matched; the difference
between these algorithms is that
regex_match will only find matches
that consume all of the input text,
where as regex_search will search for
a match anywhere within the text being
matched.
From the boost documentation. Change it to use regex_search and it will work.
Also, it looks like you are not capturing the matches. Try changing the regex to this:
regex re("(\\d*)");
Or, maybe you need to be calling regex_search repeatedly:
char *where = s.c_str();
while (regex_search(s.c_str(), matches, re))
{
where = m.suffix().first;
}
This is since you only have one capture in your regex.
Alternatively, change your regex, if you know the basic structure of the data:
regex re("(\\d+).*?(\\d+)");
This would match two numbers within the search string.
Note that the regular expression \d* will match zero or more digits - this includes the empty string "" since this is exactly zero digits. I would change the expression to \d+ which will match 1 or more.