Determining the location of C++11 regular expression matches - c++

How do I efficiently determine the location of a capture group inside a searched string? Getting the location of the entire match is easy, but I see no obvious ways to get at capture groups beyond the first.
This is a simplified example, lets presume "a*" and "b*" are complicated regexes that are expensive to run.
#include <iostream>
#include <regex>
#include <string>
using namespace std;
int main()
{
regex matcher("a*(needle)b*");
smatch findings;
string haystack("aaaaaaaaneedlebbbbbbbbbbbbbb");
if( regex_match(haystack, findings, matcher) )
{
// What do I put here to know how the offset of "needle" in the
// string haystack?
// This is the position of the entire, which is
// always 0 with regex_match, with regex_search
cout << "smatch::position - " << findings.position() << endl;
// Is this just a string or what? Are there member functions
// That can be called?
cout << "Needle - " << findings[1] << endl;
}
return 0;
}
If it helps I built this question in Coliru: http://coliru.stacked-crooked.com/a/885a6b694d32d9b5

I will not mark this as and answer until 72 hours have passed and no better answers are present.
Before asking this I presumed smatch::position took no arguments I cared about, because when I read the cppreference page the "sub" parameter was not obviously an index into the container of matches. I thought it had something to do with "sub"strings and the offset value of the whole match.
So my answer is:
cout << "Needle Position- " << findings.position(1) << endl;
Any explanation on this design, or other issues my line of thinking may have caused would be appreciated.

According to the documentation, you can access the iterator pointing to the beginning and the end of the captured text via match[n].first and match[n].second. To get the start and end indices, just do pointer arithmetic with haystack.begin().
if (findings[1].matched) {
cout << "[" << findings[1].first - haystack.begin() << "-"
<< findings[1].second - haystack.begin() << "] "
<< findings[1] << endl;
}
Except for the main match (index 0), capturing groups may or may not capture anything. In such cases, first and second will point to the end of the string.
I also demonstrate the matched property of sub_match object. While it's unnecessary in this case, in general, if you want to print out the indices of the capturing groups, it's necessary to check whether the capturing group matches anything first.

Related

QRegexp Missing Digits

We are all stumped on this one:
QRegExp kcc_stationing("(-)?(\\d+)\\.(\\d+)[^a-zA-Z]");
QString str;
if (kcc_stationing.indexIn(description) > -1)
{
str = kcc_stationing.cap(1) + kcc_stationing.cap(2) + "." + kcc_stationing.cap(3);
qDebug() << kcc_stationing.cap(1);
qDebug() << kcc_stationing.cap(2);
qDebug() << kcc_stationing.cap(3);
qDebug() << "Description: " << description;
qDebug() << "Returned Stationing string: " << str;
}
Running this code on "1082.006":
Note the missing "6"
After some just blind guessing, we removed [^a-zA-Z] and got the correct answer. We added this originally so that we would reject any number with other characters directly attached without spaces.
For example: 10.05D should be rejected.
Can anyone explain why this extra piece was causing us to lose that last "6"?
The [^a-zA-Z] is a character class. Character classes match one character. It will not match the end of a string, since there is no character there.
To get that result, the engine will match all the numbers with the \\d+, including the last one. It will then need to backtrack in order for the last character class to be satisfied.
I think you want to allow zero-width match (specifically when it's the end of the string). In your case, it would be easiest to use:
(-)?(\\d+)\\.(\\d+)([^a-zA-Z]|$)
Or, if Qt supports non-capturing groups:
(-)?(\\d+)\\.(\\d+)(?:[^a-zA-Z]|$)
Note that I also recommend using [.] instead of \\., since I feel it improves readability.

c++ Is there a way to find sentences within strings?

I'm trying to recognise certain phrases within a user defined string but so far have only been able to get a single word.
For example, if I have the sentence:
"What do you think of stack overflow?"
is there a way to search for "What do you" within the string?
I know you can retrieve a single word with the find function but when attempting to get all three it gets stuck and can only search for the first.
Is there a way to search for the whole string in another string?
Use str.find()
size_t find (const string& str, size_t pos = 0)
Its return value is the starting position of the substring. You can test if the string you are looking for is contained in the main string by performing the simple boolean test of returning str::npos:
string str = "What do you think of stack overflow?";
if (str.find("What do you") != str::npos) // is contained
The second argument can be used to limit your search from certain string position.
The OP question mentions it gets stuck in the attempt to find a three word string. Actually, I believe you are misinterpreting the return value. It happens that the return for the single word search "What" and the string "What do you" have coincidental starting positions, therefore str.find() returns the same. To search for individual words positions, use multiple function calls.
Use regular expressions
#include <iostream>
#include <string>
#include <regex>
int main ()
{
std::string s ("What do you think of stack overflow?");
std::smatch m;
std::regex e ("\\bWhat do you think\\b");
std::cout << "The following matches and submatches were found:" << std::endl;
while (std::regex_search (s,m,e)) {
for (auto x:m) std::cout << x << " ";
std::cout << std::endl;
s = m.suffix().str();
}
return 0;
}
Also you can find wildcards implementing with boost (regex in std library was boost::regex library before c++11) there

C++11 Regex Find Capture Group Identifier

I've looked at a number of sources for C++11's new regex library, but most of them focus more on the syntax, or the more basic usage of things like regex_match, or regex_search. While these articles helped me get started using the regex library, I'm having a difficult time finding more details on capture groups.
What I'm trying to accomplish, is find out which capture group a match belongs to. So far, I've only found a single way to do this.
#include <iostream>
#include <string>
#include <regex>
int main(int argc, char** argv)
{
std::string input = "+12 -12 -13 90 qwerty";
std::regex pattern("([+-]?[[:digit:]]+)|([[:alpha:]]+)");
auto iter_begin = std::sregex_token_iterator(input.begin(), input.end(), pattern, 1);
auto iter_end = std::sregex_token_iterator();
for (auto it = iter_begin; it != iter_end; ++it)
{
std::ssub_match match = *it;
std::cout << "Match: " << match.str() << " [" << match.length() << "]" << std::endl;
}
std::cout << std::endl << "Done matching..." << std::endl;
std::string temp;
std::getline(std::cin, temp);
return 0;
}
In changing the value of the fourth argument of std::sregex_token_iterator, I can control which submatch it will keep, telling it to throw away the rest of them. Therefore, to find out which capture group a match belongs to, I can simply iterate through the capture groups to find out which matches are not thrown away for a particular group.
However, this would be undesirable for me, because unless there's some caching going on in the background I would expect each construction of std::sregex_token_iterator to pass over the input and find the matches again (someone please correct me if this is wrong, but this is the best conclusion I could come to).
Is there any better way of finding the capture group(s) a match belongs to? Or is iterating over the submatches the best course of action?
Use regex_iterator instead. You will have access to match_results for each match, which contains all the sub_matches, where you can check which of the capturing group the match belongs to.

Why does std::regex_iterator cause a stack overflow with this data?

I've been using std::regex_iterator to parse log files. My program has been working quite nicely for some weeks and has parsed millions of log lines, until today, when today I ran it against a log file and got a stack overflow. It turned out that just one log line in the log file were causing the problem. Does anyone know know why my regex is causing such massive recursion? Here's a small self contained program which shows the issue (my compiler is VC2012):
#include <string>
#include <regex>
#include <iostream>
using namespace std;
std::wstring test = L"L3 T15356 79726859 [CreateRegistryAction] Creating REGISTRY Action:\n"
L" Identity: 272A4FE2-A7EE-49B7-ABAF-7C57BEA0E081\n"
L" Description: Set Registry Value: \"SortOrder\" in Key HKEY_CURRENT_USER\\Software\\Hummingbird\\PowerDOCS\\Core\\Plugins\\Fusion\\Settings\\DetailColumns\\LONEDOCS1\\Search Unsaved\\$AUTHOR.FULL_NAME;DOCSADM.PEOPLE.SYSTEM_ID\n"
L" Operation: 3\n"
L" Hive: HKEY_CURRENT_USER\n"
L" Key: Software\\Hummingbird\\PowerDOCS\\Core\\Plugins\\Fusion\\Settings\\DetailColumns\\LONEDOCS1\\Search Unsaved\\$AUTHOR.FULL_NAME;DOCSADM.PEOPLE.SYSTEM_ID\n"
L" ValueName: SortOrder\n"
L" ValueType: REG_DWORD\n"
L" ValueData: 0\n"
L"L4 T15356 79726859 [CEMRegistryValueAction::ClearRevertData] [ENTER]\n";
int wmain(int argc, wchar_t* argv[])
{
static wregex rgx_log_lines(
L"^L(\\d+)\\s+" // Level
L"T(\\d+)\\s+" // TID
L"(\\d+)\\s+" // Timestamp
L"\\[((?:\\w|\\:)+)\\]" // Function name
L"((?:" // Complex pattern
L"(?!" // Stop matching when...
L"^L\\d" // New log statement at the beginning of a line
L")"
L"[^]" // Matching all until then
L")*)" //
);
try
{
for (std::wsregex_iterator it(test.begin(), test.end(), rgx_log_lines), end; it != end; ++it)
{
wcout << (*it)[1] << endl;
wcout << (*it)[2] << endl;
wcout << (*it)[3] << endl;
wcout << (*it)[4] << endl;
wcout << (*it)[5] << endl;
}
}
catch (std::exception& e)
{
cout << e.what() << endl;
}
return 0;
}
Negative lookahead patterns which are tested on every character just seem like a bad idea to me, and what you're trying to do is not complicated. You want to match (1) the rest of the line and then (2) any number of following (3) lines which start with something other than L\d (small bug; see below): (another edit: these are regexes; if you want to write them as string literals, you need to change \ to \\.)
.*\n(?:(?:[^L]|L\D).*\n)*
| | |
+-1 | +---------------3
+---------------------2
In Ecmascript mode, . should not match \n, but you could always replace the two .s in that expression with [^\n]
Edited to add: I realize that this may not work if there is a blank line just before the end of the log entry, but this should cover that case; I changed . to [^\n] for extra precision:
[^\n]*\n(?:(?:(?:[^L\n]|L\D)[^\n]*)?\n)*
The regex appears to be OK; at least there is nothing in it that could cause catastrophic backtracking.
I see a small possibility to optimize the regex, cutting down on stack use:
static wregex rgx_log_lines(
L"^L(\\d+)\\s+" // Level
L"T(\\d+)\\s+" // TID
L"(\\d+)\\s+" // Timestamp
L"\\[([\\w:]+)\\]" // Function name
L"((?:" // Complex pattern
L"(?!" // Stop matching when...
L"^L\\d" // New log statement at the beginning of a line
L")"
L"[^]" // Matching all until then
L")*)" //
);
Did you set the ECMAScript option? Otherwise, I suspect the regex library defaults to POSIX regexes, and those don't support lookahead assertions.

PCRECPP (pcre) extract hostname from url code problem

I have this simple piece of code in c++:
int main(void)
{
string text = "http://www.amazon.com";
string a,b,c,d,e,f;
pcrecpp::RE re("^((\\w+):\\/\\/\\/?)?((\\w+):?(\\w+)?#)?([^\\/\\?:]+):?(\\d+)?(\\/?[^\\?#;\\|]+)?([;\\|])?([^\\?#]+)?\\??([^#]+)?#?(\\w*)");
if(re.PartialMatch(text, &a,&b,&c,&d,&e,&f))
{
std::cout << "match: " << f << "\n";
// should print "www.amazon.com"
}else{
std::cout << "no match. \n";
}
return 0;
}
When I run this it doesn't find a match.
I pretty sure that the regex pattern is correct and my code is what's wrong.
If anyone familiar with pcrecpp can take a look at this Ill be grateful.
EDIT:
Thanks to Dingo, it works great.
another issue I had is that the result was at the sixth place - "f".
I edited the code above so you can copy/paste if you wish.
The problem is that your code contains ??( which is a trigraph in C++ for [. You'll either need to disable trigraphs or do something to break them up like:
pcrecpp::RE re("^((\\w+):\\/\\/\\/?)?((\\w+):?(\\w+)?#)?([^\\/\\?:]+):?(\\d+)?(\\/?[^\\?#;\\|]+)?([;\\|])?([^\\?#]+)?\\??" "([^#]+)?#?(\\w*)");
Please do
cout << re.pattern() << endl;
to double-check that all your double-slashing is done right (and also post the result).
Looks like
^((\w+):///?)?((\w+):?(\w+)?#)?([^/\?:]+):?(\d+)?(/?[^\?#;\|]+)?([;\|])?([^\?#]+)?\??([^#]+)?#?(\w*)
The hostname isn't going to be returned from the first capture group, why are you using parentheses around for example \w+ that you aren't wanting to capture?