ATL regex to parse csv files - c++

Can some tell me what is wrong with the below code, I am trying to parse CSV files using the below program but it returns zero in m_uNumGroups field.
int _tmain(int argc, _TCHAR* argv[])
{
CAtlRegExp<> reUrl;
// Five match groups: scheme, authority, path, query, fragment
REParseError status = reUrl.Parse(**L"[^\",]+|(?:[ˆ\"])|\"\")"**);
if (REPARSE_ERROR_OK != status)
{
// Unexpected error.
return 0;
}
TCHAR testing[ ] = L"It’ s \" 10 Grand\" , baby";
CAtlREMatchContext<> mcUrl;
if (!reUrl.Match(testing,&mcUrl))
{
// Unexpected error.
return 0;
}
for (UINT nGroupIndex = 0; nGroupIndex < mcUrl.m_uNumGroups;nGroupIndex)
{
const CAtlREMatchContext<>::RECHAR* szStart = 0;
const CAtlREMatchContext<>::RECHAR* szEnd = 0;
mcUrl.GetMatch(nGroupIndex, &szStart, &szEnd);
ptrdiff_t nLength = szEnd - szStart;
printf_s("%d: \"%.*s\"\n", nGroupIndex, nLength, szStart);
}
return 0;;
}

With ATL regular expression syntax you need to use curly brackets around the expression you are catching. Your expression does not have any, so you're doing just match without sbu-expressions.
Check this out: http://msdn.microsoft.com/en-us/library/k3zs4axe%28v=vs.80%29.aspx
{ }
Indicates a match group. The actual text in the input that matches the expression inside the braces can be retrieved through the CAtlREMatchContext object.

I don't know C++, but if you're trying to parse "It’ s \" 10 Grand\" , baby" into It’ s \" 10 Grand\" and baby, then this fails for several reasons:
because that string is not valid CSV syntax. In CSV, quotes within fields need to be escaped by doubling (yours aren't escaped at all, only at string level), and fields that contain quotes must be surrounded by quotes. A valid CSV string would be "\"It’ s \"\" 10 Grand\"\"\", baby".
because your regex is wrong. Parsing CSV with regexes is hard, if not impossible, because of all the gotchas involved. Search StackOverflow for csv regex and find out that you should use a CSV parser instead.

Related

detect new line using C++ boost regex_match [duplicate]

I just started using Boost::regex today and am quite a novice in Regular Expressions too. I have been using "The Regulator" and Expresso to test my regex and seem satisfied with what I see there, but transferring that regex to boost, does not seem to do what I want it to do. Any pointers to help me a solution would be most welcome. As a side question are there any tools that would help me test my regex against boost.regex?
using namespace boost;
using namespace std;
vector<string> tokenizer::to_vector_int(const string s)
{
regex re("\\d*");
vector<string> vs;
cmatch matches;
if( regex_match(s.c_str(), matches, re) ) {
MessageBox(NULL, L"Hmmm", L"", MB_OK); // it never gets here
for( unsigned int i = 1 ; i < matches.size() ; ++i ) {
string match(matches[i].first, matches[i].second);
vs.push_back(match);
}
}
return vs;
}
void _uttokenizer::test_to_vector_int()
{
vector<string> __vi = tokenizer::to_vector_int("0<br/>1");
for( int i = 0 ; i < __vi.size() ; ++i ) INFO(__vi[i]);
CPPUNIT_ASSERT_EQUAL(2, (int)__vi.size());//always fails
}
Update (Thanks to Dav for helping me clarify my question):
I was hoping to get a vector with 2 strings in them => "0" and "1". I instead never get a successful regex_match() (regex_match() always returns false) so the vector is always empty.
Thanks '1800 INFORMATION' for your suggestions. The to_vector_int() method now looks like this, but it goes into a never ending loop (I took the code you gave and modified it to make it compilable) and find "0","","","" and so on. It never find the "1".
vector<string> tokenizer::to_vector_int(const string s)
{
regex re("(\\d*)");
vector<string> vs;
cmatch matches;
char * loc = const_cast<char *>(s.c_str());
while( regex_search(loc, matches, re) ) {
vs.push_back(string(matches[0].first, matches[0].second));
loc = const_cast<char *>(matches.suffix().str().c_str());
}
return vs;
}
In all honesty I don't think I have still understood the basics of searching for a pattern and getting the matches. Are there any tutorials with examples that explains this?
The basic problem is that you are using regex_match when you should be using regex_search:
The algorithms regex_search and
regex_match make use of match_results
to report what matched; the difference
between these algorithms is that
regex_match will only find matches
that consume all of the input text,
where as regex_search will search for
a match anywhere within the text being
matched.
From the boost documentation. Change it to use regex_search and it will work.
Also, it looks like you are not capturing the matches. Try changing the regex to this:
regex re("(\\d*)");
Or, maybe you need to be calling regex_search repeatedly:
char *where = s.c_str();
while (regex_search(s.c_str(), matches, re))
{
where = m.suffix().first;
}
This is since you only have one capture in your regex.
Alternatively, change your regex, if you know the basic structure of the data:
regex re("(\\d+).*?(\\d+)");
This would match two numbers within the search string.
Note that the regular expression \d* will match zero or more digits - this includes the empty string "" since this is exactly zero digits. I would change the expression to \d+ which will match 1 or more.

C++: Regex: returns full string and not matched group

for those asking, the {0} allows selection of any one block within the sResult string separated by the | 0 is the first block
it needs to be dynamic for future expansion as that number will be configurable by users
So I am working on a regex to extract 1 portion of a string, however while it matches the results return are not what is expected.
std::string sResult = "MATCH_ME|BUT|NOT|ANYTHNG|ELSE";
std::regex pattern("^(?:[^|]+[|]){0}([^|;]+)");
std::smatch regMatch;
std::regex_search(sResult, regMatch, pattern);
if(regMatch[1].matched)
{
for( int i = 0; i < regMatch.size(); i++)
{
//SUBMATCH 0 = "MATCH_ME|BUT|NOT|ANYTHNG|ELSE"
//SUBMATCH 1 = "BUT|NOT|ANYTHNG|ELSE"
std::ssub_match sm = regMatch[i];
bValid = strcmp(regMatch[i].str().c_str(), pzPoint->_ptrTarget->_pzTag->szOPCItem);
}
}
For some reason I cannot figure out the code to get me just the MATCH_ME back so I can compare it to expected results list on the C++ side.
Anyone have any ideas on where I went wrong here.
It seems you're using regular expressions for what they haven't been designed for. You should first split your string at the delimiter | and apply regular expressions on the resulting tokens if you want to check them for validity.
By the way: The std::regex implementation in libstdc++ seems to be buggy. I just did some tests and found that even simple patterns containing escaped pipe characters like \\| failed to compile throwing a std::regex_error with no further information in the error message (GCC 4.8.1).
The following code example shows how to do what you are after - you compile this, then call it with a single numerical argument to extract that element of the input:
#include <iostream>
#include <cstring>
#include <regex>
int main(int argc, char *argv[]) {
char pat[100];
if (argc > 1) {
sprintf(pat, "^(?:[^|]+[|]){%s}([^|;]+)", argv[1]);
std::string sResult = "MATCH_ME|BUT|NOT|ANYTHNG|ELSE";
std::regex pattern(pat);
std::smatch regMatch;
std::regex_search(sResult, regMatch, pattern);
if(regMatch[1].matched)
{
std::ssub_match sm = regMatch[1];
std::cout << "The match is " << sm << std::endl;
//bValid = strcmp(regMatch[i].str().c_str(), pzPoint->_ptrTarget->_pzTag->szOPCItem);
}
}
return 0;
}
Creating an executable called match, you can then do
>> match 2
The match is NOT
which is what you wanted.
The regex, it turns out, works just fine - although as a matter of preference I would use \| instead of [|] for the first part.
Turns out the problem was on the C side in extracting the match, it had to be done more directly, below is the code that gets me exactly what I wanted out of the string so I can use it later.
std::string sResult = "MATCH_ME|BUT|NOT|ANYTHNG|ELSE";
std::regex pattern("^(?:[^|]+[|]){0}([^|;]+)");
std::smatch regMatch;
std::regex_search(sResult, regMatch, pattern);
if(regMatch[1].matched)
{
std::string theMatchedPortion = regMatch[1];
//the issue was not with the regex but in how I was retrieving the results.
//theMatchedPortion now equals "MATCH_ME" and by changing the number associated
with it I can navigate through the string
}

C++11 regex replace

I have an XML string that i wish to log out. this XML contains some sensitive data that i'd like to mask out before sending to the log file. Currently using std::regex to do this:
std::regex reg("<SensitiveData>(\\d*)</SensitiveData>");
return std::regex_replace(xml, reg, "<SensitiveData>......</SensitiveData>");
Currently the data is being replaced by exactly 6 '.' characters, however what i really want to do is to replace the sensitive data with the correct number of dots. I.e. I'd like to get the length of the capture group and put that exact number of dots down.
Can this be done?
regex_replace of C++11 regular expressions does not have the capability you are asking for — the replacement format argument must be a string. Some regular expression APIs allow replacement to be a function that receives a match, and which could perform exactly the substitution you need.
But regexps are not the only way to solve a problem, and in C++ it's not exactly hard to look for two fixed strings and replace characters inbetween:
const char* const PREFIX = "<SensitiveData>";
const char* const SUFFIX = "</SensitiveData>";
void replace_sensitive(std::string& xml) {
size_t start = 0;
while (true) {
size_t pref, suff;
if ((pref = xml.find(PREFIX, start)) == std::string::npos)
break;
if ((suff = xml.find(SUFFIX, pref + strlen(PREFIX))) == std::string::npos)
break;
// replace stuff between prefix and suffix with '.'
for (size_t i = pref + strlen(PREFIX); i < suff; i++)
xml[i] = '.';
start = suff + strlen(SUFFIX);
}
}

Tokenize with colon using std::tr1::regex

I'm working on a quasi-SCPI command parser and I want to split a string based on colons, ignoring quoted strings. I want to get an empty string if there is no text between colons.
If I use this regex expression in EditPad Pro 7.2.2, it does exactly what I want.
(([^:\"']|\"[^\"]\"|'[^']')+)?
As an example, using this data string:
:foo:::bar:baz
I get 6 hits: [empty],foo,[empty],[empty],bar,baz
So far, so good. However, in my code, using std::tr1::regex, I'm getting 9 hits with the same data string. It seems like I'm getting an extra empty hit after each non-empty hit.
void RICommandState::InitRawCommandEnum(const std::string& full_command)
{
// Split string by colons, but ignore text within quotes.
static const std::tr1::regex split_by_colon("(([^:\"']|\"[^\"]*\"|'[^']*')+)?");
raw_command_list.clear();
raw_command_index = 0;
DebugPrintf(ZONE_REMOTE, (TEXT("InitRawCommandEnum FULL '%S'"), full_command.c_str()));
const std::tr1::sregex_token_iterator end;
for (std::tr1::sregex_token_iterator it(full_command.begin(),
full_command.end(),
split_by_colon);
it != end;
it++)
{
raw_command_list.push_back(*it);
const std::string temp(*it);
DebugPrintf(ZONE_REMOTE, (TEXT("InitRawCommandEnum '%S'"), temp.c_str()));
}
DebugPrintf(ZONE_REMOTE, (TEXT("InitRawCommandEnum hits = %d"), raw_command_list.size()));
}
And here is my output:
InitRawCommandEnum FULL ':foo:::bar:baz'
InitRawCommandEnum ''
InitRawCommandEnum 'foo'
InitRawCommandEnum ''
InitRawCommandEnum ''
InitRawCommandEnum ''
InitRawCommandEnum 'bar'
InitRawCommandEnum ''
InitRawCommandEnum 'baz'
InitRawCommandEnum ''
InitRawCommandEnum hits = 9
The most important question is how can I get my regex search to yield one (and only one) hit for every token delimited by a colon? Is the problem with my search expression?
Or maybe I'm misinterpreting the results? Do the empty strings after the non-empty strings have a special meaning? If so, what? And if that's the case, then is the correct solution to simply ignore them?
As a side question, I'm deeply curious why my code is behaving differently than EditPad Pro. EditPad is a useful test environment for experimenting with regular expressions, and it would be nice to know what the gotchas are.
Thanks!
It's still not clear to me what the meaning of the empty strings are, but I was able to work around them by ignoring them. I track the position of the hits within the search string and only process results that are farther along in the string.
Here's my code, without modification. Note that my regex search expression is slightly different, but that's not critical to the answer.
void RICommandState::InitRawCommandEnum(const std::string& full_command)
{
// Split string by colons, but ignore text within quotes.
static const std::tr1::regex split_by_colon("(?:[^:\"']|\"[^\"]*\"|'[^']*')*");
raw_command_list.clear();
raw_command_index = 0;
std::tr1::sregex_iterator::difference_type minPosition = 0;
const std::tr1::sregex_iterator end;
for (std::tr1::sregex_iterator it(full_command.begin(),
full_command.end(),
split_by_colon);
it != end;
it++)
{
if (it->position() >= minPosition)
{
raw_command_list.push_back(it->str());
minPosition = it->position() + it->length() + 1;
}
}
}

Boost regex not working as expected in my code

I just started using Boost::regex today and am quite a novice in Regular Expressions too. I have been using "The Regulator" and Expresso to test my regex and seem satisfied with what I see there, but transferring that regex to boost, does not seem to do what I want it to do. Any pointers to help me a solution would be most welcome. As a side question are there any tools that would help me test my regex against boost.regex?
using namespace boost;
using namespace std;
vector<string> tokenizer::to_vector_int(const string s)
{
regex re("\\d*");
vector<string> vs;
cmatch matches;
if( regex_match(s.c_str(), matches, re) ) {
MessageBox(NULL, L"Hmmm", L"", MB_OK); // it never gets here
for( unsigned int i = 1 ; i < matches.size() ; ++i ) {
string match(matches[i].first, matches[i].second);
vs.push_back(match);
}
}
return vs;
}
void _uttokenizer::test_to_vector_int()
{
vector<string> __vi = tokenizer::to_vector_int("0<br/>1");
for( int i = 0 ; i < __vi.size() ; ++i ) INFO(__vi[i]);
CPPUNIT_ASSERT_EQUAL(2, (int)__vi.size());//always fails
}
Update (Thanks to Dav for helping me clarify my question):
I was hoping to get a vector with 2 strings in them => "0" and "1". I instead never get a successful regex_match() (regex_match() always returns false) so the vector is always empty.
Thanks '1800 INFORMATION' for your suggestions. The to_vector_int() method now looks like this, but it goes into a never ending loop (I took the code you gave and modified it to make it compilable) and find "0","","","" and so on. It never find the "1".
vector<string> tokenizer::to_vector_int(const string s)
{
regex re("(\\d*)");
vector<string> vs;
cmatch matches;
char * loc = const_cast<char *>(s.c_str());
while( regex_search(loc, matches, re) ) {
vs.push_back(string(matches[0].first, matches[0].second));
loc = const_cast<char *>(matches.suffix().str().c_str());
}
return vs;
}
In all honesty I don't think I have still understood the basics of searching for a pattern and getting the matches. Are there any tutorials with examples that explains this?
The basic problem is that you are using regex_match when you should be using regex_search:
The algorithms regex_search and
regex_match make use of match_results
to report what matched; the difference
between these algorithms is that
regex_match will only find matches
that consume all of the input text,
where as regex_search will search for
a match anywhere within the text being
matched.
From the boost documentation. Change it to use regex_search and it will work.
Also, it looks like you are not capturing the matches. Try changing the regex to this:
regex re("(\\d*)");
Or, maybe you need to be calling regex_search repeatedly:
char *where = s.c_str();
while (regex_search(s.c_str(), matches, re))
{
where = m.suffix().first;
}
This is since you only have one capture in your regex.
Alternatively, change your regex, if you know the basic structure of the data:
regex re("(\\d+).*?(\\d+)");
This would match two numbers within the search string.
Note that the regular expression \d* will match zero or more digits - this includes the empty string "" since this is exactly zero digits. I would change the expression to \d+ which will match 1 or more.