I'm currently studying implementations of UNIX-style glob pattern matching. Generally, the fnmatch library is a good reference implementation for this functionality.
Looking at some of the implementations, as well as reading various blogs/tutorials about this, it seems that this algorithm is usually implemented recursively.
Generally, any sort of algorithm that requires "back tracking", or requires returning to an earlier state, nicely lends itself to a recursive solution. This includes things like tree traversal, or parsing nested structures.
But I'm having trouble understanding why glob pattern matching in particular is so often implemented recursively. I get the idea that sometimes back tracking will be necessary, for example if we have a string aabaabxbaab, and a pattern a*baab, the characters after the * will match the first "baab" substring, like aa(baab)xbaab, and then fail to match the rest of the string. So the algorithm will need to backtrack so that the character match counter starts over, and we can match the second occurrence of baab, like: aabaabx(baab).
Okay, but generally recursion is used when we might require multiple nested levels of backtracking, and I don't see how that would be necessary in this case. It seems we'd only ever have to backtrack one level at a time, when the iterator over the pattern and the iterator over the input string fail to match. At this point, the iterator over the pattern would need to move back to the last "save point", which would either be the beginning of the string, or the last * that successfully matched something. This doesn't require a stack - just a single save point.
The only complication I can think of is in the event of an "overlapping" match. For example if we have the input string aabaabaab, and the pattern a*baab, we would fail to match because the "b" in the last baab could be part of either the first match or the second match. But it seems this could be solved by simply backtracking the input iterator if the distance between the last pattern iterator save point and the end of the pattern is greater than the distance between the input iterator position and the end of the input string.
So, as far as I'm seeing, it shouldn't be too much of an issue to implement this glob matching algorithm iteratively (assuming a very simple glob matcher, which only uses the * character to mean "match zero or more characters". Also, the matching strategy would be greedy.)
So, I assume I'm definitely wrong about this, because everyone else does this recursively - so I must be missing something. It's just that I can't see what I'm missing here. So I implemented a simple glob matcher in C++ (that only supports the * operator), to see if I could figure out what I'm missing. This is an extremely straightforward, simple iterative solution which just uses an inner loop to do the wildcard matching. It also records the indices which the * character matches in a vector of pairs:
bool match_pattern(const std::string& pattern, const std::string& input,
std::vector<std::pair<std::size_t, std::size_t>>& matches)
{
const char wildcard = '*';
auto pat = std::begin(pattern);
auto pat_end = std::end(pattern);
auto it = std::begin(input);
auto end = std::end(input);
while (it != end && pat != pat_end)
{
const char c = *pat;
if (*it == c)
{
++it;
++pat;
}
else if (c == wildcard)
{
matches.push_back(std::make_pair(std::distance(std::begin(input), it), 0));
++pat;
if (pat == pat_end)
{
matches.back().second = input.size();
return true;
}
auto save = pat;
std::size_t matched_chars = 0;
while (it != end && pat != pat_end)
{
if (*it == *pat)
{
++it;
++pat;
++matched_chars;
if (pat == pat_end && it != end)
{
pat = save;
matched_chars = 0;
// Check for an overlap and back up the input iterator if necessary
//
std::size_t d1 = std::distance(it, end);
std::size_t d2 = std::distance(pat, pat_end);
if (d2 > d1) it -= (d2 - d1);
}
}
else if (*pat == wildcard)
{
break;
}
else
{
if (pat == save) ++it;
pat = save;
matched_chars = 0;
}
}
matches.back().second = std::distance(std::begin(input), it) - matched_chars;
}
else break;
}
return it == end && pat == pat_end;
}
Then I wrote a series of tests to see if I could find some pattern or input string that would require multiple levels of backtracking (and therefore recursion or a stack), but I can't seem to come up with anything.
Here is my test function:
void test(const std::string& input, const std::string& pattern)
{
std::vector<std::pair<std::size_t, std::size_t>> matches;
bool result = match_pattern(pattern, input, matches);
auto match_iter = matches.begin();
std::cout << "INPUT: " << input << std::endl;
std::cout << "PATTERN: " << pattern << std::endl;
std::cout << "INDICES: ";
for (auto& p : matches)
{
std::cout << "(" << p.first << "," << p.second << ") ";
}
std::cout << std::endl;
if (result)
{
std::cout << "MATCH: ";
for (std::size_t idx = 0; idx < input.size(); ++idx)
{
if (match_iter != matches.end())
{
if (idx == match_iter->first) std::cout << '(';
else if (idx == match_iter->second)
{
std::cout << ')';
++match_iter;
}
}
std::cout << input[idx];
}
if (!matches.empty() && matches.back().second == input.size()) std::cout << ")";
std::cout << std::endl;
}
else
{
std::cout << "NO MATCH!" << std::endl;
}
std::cout << std::endl;
}
And my actual tests:
test("aabaabaab", "a*b*ab");
test("aabaabaab", "a*");
test("aabaabaab", "aa*");
test("aabaabaab", "aaba*");
test("/foo/bar/baz/xlig/fig/blig", "/foo/*/blig");
test("/foo/bar/baz/blig/fig/blig", "/foo/*/blig");
test("abcdd", "*d");
test("abcdd", "*d*");
test("aabaabqqbaab", "a*baab");
test("aabaabaab", "a*baab");
So this outputs:
INPUT: aabaabaab
PATTERN: a*b*ab
INDICES: (1,2) (3,7)
MATCH: a(a)b(aaba)ab
INPUT: aabaabaab
PATTERN: a*
INDICES: (1,9)
MATCH: a(abaabaab)
INPUT: aabaabaab
PATTERN: aa*
INDICES: (2,9)
MATCH: aa(baabaab)
INPUT: aabaabaab
PATTERN: aaba*
INDICES: (4,9)
MATCH: aaba(abaab)
INPUT: /foo/bar/baz/xlig/fig/blig
PATTERN: /foo/*/blig
INDICES: (5,21)
MATCH: /foo/(bar/baz/xlig/fig)/blig
INPUT: /foo/bar/baz/blig/fig/blig
PATTERN: /foo/*/blig
INDICES: (5,21)
MATCH: /foo/(bar/baz/blig/fig)/blig
INPUT: abcdd
PATTERN: *d
INDICES: (0,4)
MATCH: (abcd)d
INPUT: abcdd
PATTERN: *d*
INDICES: (0,3) (4,5)
MATCH: (abc)d(d)
INPUT: aabaabqqbaab
PATTERN: a*baab
INDICES: (1,8)
MATCH: a(abaabqq)baab
INPUT: aabaabaab
PATTERN: a*baab
INDICES: (1,5)
MATCH: a(abaa)baab
The parentheses that appear in the output after "MATCH: " show the substrings that were matched/consumed by each * character. So, this seems to work fine, and I can't seem to see why it would be necessary to backtrack multiple levels here - at least if we limit our pattern to only allow * characters, and we assume greedy matching.
So I assume I'm definitely wrong about this, and probably overlooking something simple - can someone help me to see why this algorithm might require multiple levels of backtracking (and therefore recursion or a stack)?
I didn't check all the details of your implementation, but it is certainly true that you can do the match without recursive backtracking.
You can actually do glob matching without backtracking at all by building a simple finite-state machine. You could translate the glob into a regular expression and build a DFA in the normal way, or you could use something very similar to the Aho-Corasick machine; if you tweaked your algorithm a little bit, you'd achieve the same result. (The key is that you don't actually need to backup the input scan; you just need to figure out the correct scan state, which can be precomputed.)
The classic fnmatch implementations are not optimized for speed; they're based on the assumption that patterns and target strings are short. That assumption is usually reasonable, but if you allow untrusted patterns, you're opening yourself up to a DoS attack. And based on that assumption, an algorithm similar to the one you present, which does not require precomputation, is probably faster in the vast majority of use cases than any algorithm which requires precomputing state transition tables while avoiding the exponential blowup with pathological patterns.
Related
Let's say I have a sorted vector of strings:
std::vector<std::string> Dictionary
Dictionary.push_back("ant");
Dictionary.push_back("anti-matter");
Dictionary.push_back("matter");
Dictionary.push_back("mate");
Dictionary.push_back("animate");
Dictionary.push_back("animal");
std::sort(Dictionary.begin(), Dictionary.end());
I want to find the first word in the vector that matches a prefix, but every example I found use a hard coded string as prefix. For example, I can define a boolean unary function for finding the "an" prefix:
bool find_prefix(std::string &S) {
return S.compare(0, 2, "an");
}
and use it as the predicate of the std::find_if() function to find an iterator to the first match. But how can I search for user given string as a prefix? Is it possible to use binary predicates in some way? Or build a "pseudo-unary" predicate that depends on a variable and a parameter?
Or, is there any other container and methods that I should use in this problem?
I know that there are much more efficient and elegant structures to store a dictionary for prefix search, but I'm a beginner self-learning programming, so first I'd like to learn how to use the standard containers before adventuring in more complex structures.
You can write find_prefix as a lambda. That lets you capture the string you want to search for, and use that for the comparison:
string word = ... // the prefix you're looking for
auto result = std::find_if(Dictionary.begin(), Dictionary.end(),
[&word](string const &S) {
return ! S.compare(0, word.length(), word);
});
Since you are sorting the vector, you should take advantage that the vector is sorted.
Rather than doing a linear search for a match, you can use std::lower_bound to put you close to, if not right on the entry that matches the prefix:
#include <vector>
#include <string>
#include <iostream>
#include <algorithm>
int main()
{
std::vector<std::string> Dictionary;
Dictionary.push_back("ant");
Dictionary.push_back("anti-matter");
Dictionary.push_back("matter");
Dictionary.push_back("mate");
Dictionary.push_back("animate");
Dictionary.push_back("animal");
std::sort(Dictionary.begin(), Dictionary.end());
std::vector<std::string> search_test = {"an", "b", "ma", "m", "x", "anti"};
for (auto& s : search_test)
{
auto iter = std::lower_bound(Dictionary.begin(), Dictionary.end(), s);
// see if the item returned actually is a match
if ( iter->size() >= s.size() && iter->substr(0, s.size()) == s )
std::cout << "The string \"" << s << "\" has a match on \"" << *iter << "\"\n";
else
std::cout << "no match for \"" << s << "\"\n";
}
}
Output:
The string "an" has a match on "animal"
no match for "b"
The string "ma" has a match on "mate"
The string "m" has a match on "mate"
no match for "x"
The string "anti" has a match on "anti-matter"
The test after the lower_bound is done to see if the string actually matches the one found by lower_bound.
My problem is more or less self-explanatory, I want to write a regex to parse out numbers out of a string that user enters via console. I take the user input using:
getline(std::cin,stringName); //1 2 3 4 5
I asume that user enters N numbers followed by white spaces except the last number.
I have solved this problem by analyzing string char by char like this:
std::string helper = "";
std::for_each(stringName.cbegin(), strinName.cend(), [&](char c)
{
if (c == ' ')
{
intVector.push_back(std::stoi(helper.c_str()));
helper = "";
}
else
helper += c;
});
intVector.push_back(std::stoi(helper.c_str()));
I want to achieve the same behavior by using regex. I've wrote the following code:
std::regex rx1("([0-9]+ )");
std::sregex_iterator begin(stringName.begin(), stringName.end(), rx1);
std::sregex_iterator end;
while (begin != end)
{
std::smatch sm = *begin;
int number = std::stoi(sm.str(1));
std::cout << number << " ";
}
Problem with this regex occurs when it gets to the last number since it doesn't have space behind it, therefore it enters an infinite loop. Can someone give me an idea on how to fix this?
You're going to get an endless loop there because you never increment begin. If you do that, you'll get all the numbers except the last one (which, as you say, is not followed by a space).
But I don't understand why you feel it necessary to include the whitespace in the regular expression. If you just match a string of digits, the regex will automatically select the longest possible match, so the following character (if any) cannot be a digit.
I also see no value in the capture in the regex. If you wanted to restrict the capture to the number itself, you would have used ([0-9]+). (But since stoi only converts until it finds a non-digit, it doesn't matter.)
So you just use this:
std::regex rx1("[0-9]+");
for (auto it = std::sregex_iterator{str.begin(), str.end(), rx1},
end = std::sregex_iterator{};
it != end;
++it) {
std::cout << std::stoi(it->str(0)) << '\n';
}
(Live on coliru)
I've been trying to make regex find both a two digit number and the word thanks, but ignore everything in-between.
Here is my current implementation in C++, but I need the two patterns to be consolidated into one:
regex pattern1{R"(\d\d)"};
regex pattern2{R"(thanks)");
string to_search = "I would like the number 98 to be found and printed, thanks.";
smatch matches;
regex_search(to_search, matches, pattern1);
for (auto match : matches) {
cout << match << endl;
}
regex_search(to_search, matches, pattern2);
for (auto match : matches) {
cout << match << endl;
}
return 0;
Thanks!
EDIT: Is there any way to change ONLY the pattern and get rid of one of the for loops? Sorry for the confusion.
I am using the boost/regex.hpp library. The regex is intended to match a floating point number or one of an arbitrary list of math operators. The trailing a is a place holder because the current code to construct the regex leaves a | at the end, and I haven't fixed it yet. My regex is:
(?:([-+]?[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?)|(\s*sqrt\((.+?)\)\s*)|(\s*exp\((.+?)\)\s*)|(\^)|(\s*log2\((.+?)\)\s*)|(\s*log10\((.+?)\)\s*)|(\s*neg\((.+?)\)\s*)|(\s*floor\((.+?)\)\s*)|(\s*log\((.+?)\)\s*)|(\s*fact\((.+?)\)\s*)|(/)|([*])|([+])|([-])|a)
and my test string is:
4.5 + 9.6e8 + sqrt(5)
The resulting match is:
4.5 + 9.6e8 + sqrt(5) 5
I'm not sure why there are so many spaces between the captures.
The printing code is
boost::regex reg(token);
boost::smatch m;
string s = input;
while (boost::regex_search(s, m, reg)) {
for (int i = 1; i < m.size(); ++i) cout << m[i] << " ";
s = m.suffix().str();
}
You have a lot of capturing parentheses and you are printing a space between each capture group. Many of your capture groups are empty. Maybe you want to refactor your regex to only capture what you really want.
why does the following boost regex not return the results I am looking for (starts with 0 ore more whitespace followed by one or more asterisk)?
boost::regex tmpCommentRegex("(^\\s*)\\*+");
for (std::vector<std::string>::iterator vect_it =
tmpInputStringLines.begin(); vect_it != tmpInputStringLines.end();
++vect_it) {
boost::match_results<std::string::const_iterator> tmpMatch;
if (boost::regex_match((*vect_it), tmpMatch, tmpCommentRegex,
boost::match_default) == 0) {
std::cout << "Found comment " << (*vect_it) << std::endl;
} else {
std::cout << "No comment" << std::endl;
}
}
On the following input:
* Script 7
[P]%OMO * change
[P]%QMS * change
[T]%OMO * change
[T]%QMM * change
[S]%G1 * Resume
[]
This should read
Found comment * Script 7
No comment
No comment
No comment
No comment
No comment
No comment
Quoting from the documentation for regex_match:
Note that the result is true only if the expression matches the whole of the input sequence. If you want to search for an expression somewhere within the sequence then use regex_search. If you want to match a prefix of the character string then use regex_search with the flag match_continuous set.
None of your input lines are matched by your regular expression as a whole, so the program works as expected. You should use regex_search to get the desired behavior.
Besides, regex_match and regex_search both return bool and not int, so testing for == 0 is wrong.