Comparing a string with a regEx wildcard value - c++

So I need to check a string (url) against a list of reg ex wildcard values to see if there is a match. I will be intercepting an HTTP request and checking it against a list of pre-configured values and if there is a match, do something to the URL. Examples:
Request URL: http://www.stackoverflow.com
Wildcards: *.stackoverflow.com/
*.stack*.com/
www.stackoverflow.*
Are there any good libraries for C++ for doing this? Any good examples would be great. Pseudo-code that I have looks something like:
std::string requestUrl = "http://www.stackoverflow.com";
std::vector<string> urlWildcards = ...;
BOOST_FOREACH(string wildcard, urlWildcards) {
if (requestUrl matches wildcard) {
// Do something
} else {
// Do nothing
}
}
Thanks a lot.

The following code example uses regular expressions to look for exact substring matches. The search is performed by the static IsMatch method, which takes two strings as input. The first is the string to be searched, and the second is the pattern to be searched for.
#using <System.dll>
using namespace System;
using namespace System::Text::RegularExpressions;
int main()
{
array<String^>^ sentence =
{
"cow over the moon",
"Betsy the Cow",
"cowering in the corner",
"no match here"
};
String^ matchStr = "cow";
for (int i=0; i<sentence->Length; i++)
{
Console::Write( "{0,24}", sentence[i] );
if ( Regex::IsMatch( sentence[i], matchStr,
RegexOptions::IgnoreCase ) )
Console::WriteLine(" (match for '{0}' found)", matchStr);
else
Console::WriteLine("");
}
return 0;
}
}
Code from MSDN (http://msdn.microsoft.com/en-us/library/zcwwszd7(v=vs.80).aspx).

If you use VS 2010, consider use the regex introduced by c++ tr1.
Refer to following page for more details.
http://www.johndcook.com/cpp_regex.html

Related

Extract all allowed characters from a regular expression

I need to extract a list of all allowed characters from a given regular expression.
So for example, if the regex looks like this (some random example):
[A-Z]*\s+(4|5)+
the output should be
ABCDEFGHIJKLMNOPQRSTUVWXYZ45
(omitting the whitespace)
One obvious solution would be to define a complete set of allowed characters, and use a find method, to return the corresponding subsequence for each character. This seems to be a bit of a dull solution though.
Can anyone think of a (possibly simple) algorithm on how to implement this?
One thing you can do is:
split the regex by subgroup
test the char panel against the subgroup
See the following example (not perfect yet) c#:
static void Main(String[] args)
{
Console.WriteLine($"-->{TestRegex(#"[A-Z]*\s+(4|5)+")}<--");
}
public static string TestRegex(string pattern)
{
string result = "";
foreach (var subPattern in Regex.Split(pattern, #"[*+]"))
{
if(string.IsNullOrWhiteSpace(subPattern))
continue;
result += GetAllCharCoveredByRegex(subPattern);
}
return result;
}
public static string GetAllCharCoveredByRegex(string pattern)
{
Console.WriteLine($"Testing {pattern}");
var regex = new Regex(pattern);
var matches = new List<char>();
for (var c = char.MinValue; c < char.MaxValue; c++)
{
if (regex.IsMatch(c.ToString()))
{
matches.Add(c);
}
}
return string.Join("", matches);
}
Which outputs:
Testing [A-Z]
Testing \s
Testing (4|5)
-->ABCDEFGHIJKLMNOPQRSTUVWXYZ
? ? ???????? 45<--

detect new line using C++ boost regex_match [duplicate]

I just started using Boost::regex today and am quite a novice in Regular Expressions too. I have been using "The Regulator" and Expresso to test my regex and seem satisfied with what I see there, but transferring that regex to boost, does not seem to do what I want it to do. Any pointers to help me a solution would be most welcome. As a side question are there any tools that would help me test my regex against boost.regex?
using namespace boost;
using namespace std;
vector<string> tokenizer::to_vector_int(const string s)
{
regex re("\\d*");
vector<string> vs;
cmatch matches;
if( regex_match(s.c_str(), matches, re) ) {
MessageBox(NULL, L"Hmmm", L"", MB_OK); // it never gets here
for( unsigned int i = 1 ; i < matches.size() ; ++i ) {
string match(matches[i].first, matches[i].second);
vs.push_back(match);
}
}
return vs;
}
void _uttokenizer::test_to_vector_int()
{
vector<string> __vi = tokenizer::to_vector_int("0<br/>1");
for( int i = 0 ; i < __vi.size() ; ++i ) INFO(__vi[i]);
CPPUNIT_ASSERT_EQUAL(2, (int)__vi.size());//always fails
}
Update (Thanks to Dav for helping me clarify my question):
I was hoping to get a vector with 2 strings in them => "0" and "1". I instead never get a successful regex_match() (regex_match() always returns false) so the vector is always empty.
Thanks '1800 INFORMATION' for your suggestions. The to_vector_int() method now looks like this, but it goes into a never ending loop (I took the code you gave and modified it to make it compilable) and find "0","","","" and so on. It never find the "1".
vector<string> tokenizer::to_vector_int(const string s)
{
regex re("(\\d*)");
vector<string> vs;
cmatch matches;
char * loc = const_cast<char *>(s.c_str());
while( regex_search(loc, matches, re) ) {
vs.push_back(string(matches[0].first, matches[0].second));
loc = const_cast<char *>(matches.suffix().str().c_str());
}
return vs;
}
In all honesty I don't think I have still understood the basics of searching for a pattern and getting the matches. Are there any tutorials with examples that explains this?
The basic problem is that you are using regex_match when you should be using regex_search:
The algorithms regex_search and
regex_match make use of match_results
to report what matched; the difference
between these algorithms is that
regex_match will only find matches
that consume all of the input text,
where as regex_search will search for
a match anywhere within the text being
matched.
From the boost documentation. Change it to use regex_search and it will work.
Also, it looks like you are not capturing the matches. Try changing the regex to this:
regex re("(\\d*)");
Or, maybe you need to be calling regex_search repeatedly:
char *where = s.c_str();
while (regex_search(s.c_str(), matches, re))
{
where = m.suffix().first;
}
This is since you only have one capture in your regex.
Alternatively, change your regex, if you know the basic structure of the data:
regex re("(\\d+).*?(\\d+)");
This would match two numbers within the search string.
Note that the regular expression \d* will match zero or more digits - this includes the empty string "" since this is exactly zero digits. I would change the expression to \d+ which will match 1 or more.

C++: Regex: returns full string and not matched group

for those asking, the {0} allows selection of any one block within the sResult string separated by the | 0 is the first block
it needs to be dynamic for future expansion as that number will be configurable by users
So I am working on a regex to extract 1 portion of a string, however while it matches the results return are not what is expected.
std::string sResult = "MATCH_ME|BUT|NOT|ANYTHNG|ELSE";
std::regex pattern("^(?:[^|]+[|]){0}([^|;]+)");
std::smatch regMatch;
std::regex_search(sResult, regMatch, pattern);
if(regMatch[1].matched)
{
for( int i = 0; i < regMatch.size(); i++)
{
//SUBMATCH 0 = "MATCH_ME|BUT|NOT|ANYTHNG|ELSE"
//SUBMATCH 1 = "BUT|NOT|ANYTHNG|ELSE"
std::ssub_match sm = regMatch[i];
bValid = strcmp(regMatch[i].str().c_str(), pzPoint->_ptrTarget->_pzTag->szOPCItem);
}
}
For some reason I cannot figure out the code to get me just the MATCH_ME back so I can compare it to expected results list on the C++ side.
Anyone have any ideas on where I went wrong here.
It seems you're using regular expressions for what they haven't been designed for. You should first split your string at the delimiter | and apply regular expressions on the resulting tokens if you want to check them for validity.
By the way: The std::regex implementation in libstdc++ seems to be buggy. I just did some tests and found that even simple patterns containing escaped pipe characters like \\| failed to compile throwing a std::regex_error with no further information in the error message (GCC 4.8.1).
The following code example shows how to do what you are after - you compile this, then call it with a single numerical argument to extract that element of the input:
#include <iostream>
#include <cstring>
#include <regex>
int main(int argc, char *argv[]) {
char pat[100];
if (argc > 1) {
sprintf(pat, "^(?:[^|]+[|]){%s}([^|;]+)", argv[1]);
std::string sResult = "MATCH_ME|BUT|NOT|ANYTHNG|ELSE";
std::regex pattern(pat);
std::smatch regMatch;
std::regex_search(sResult, regMatch, pattern);
if(regMatch[1].matched)
{
std::ssub_match sm = regMatch[1];
std::cout << "The match is " << sm << std::endl;
//bValid = strcmp(regMatch[i].str().c_str(), pzPoint->_ptrTarget->_pzTag->szOPCItem);
}
}
return 0;
}
Creating an executable called match, you can then do
>> match 2
The match is NOT
which is what you wanted.
The regex, it turns out, works just fine - although as a matter of preference I would use \| instead of [|] for the first part.
Turns out the problem was on the C side in extracting the match, it had to be done more directly, below is the code that gets me exactly what I wanted out of the string so I can use it later.
std::string sResult = "MATCH_ME|BUT|NOT|ANYTHNG|ELSE";
std::regex pattern("^(?:[^|]+[|]){0}([^|;]+)");
std::smatch regMatch;
std::regex_search(sResult, regMatch, pattern);
if(regMatch[1].matched)
{
std::string theMatchedPortion = regMatch[1];
//the issue was not with the regex but in how I was retrieving the results.
//theMatchedPortion now equals "MATCH_ME" and by changing the number associated
with it I can navigate through the string
}

Boost regex not working as expected in my code

I just started using Boost::regex today and am quite a novice in Regular Expressions too. I have been using "The Regulator" and Expresso to test my regex and seem satisfied with what I see there, but transferring that regex to boost, does not seem to do what I want it to do. Any pointers to help me a solution would be most welcome. As a side question are there any tools that would help me test my regex against boost.regex?
using namespace boost;
using namespace std;
vector<string> tokenizer::to_vector_int(const string s)
{
regex re("\\d*");
vector<string> vs;
cmatch matches;
if( regex_match(s.c_str(), matches, re) ) {
MessageBox(NULL, L"Hmmm", L"", MB_OK); // it never gets here
for( unsigned int i = 1 ; i < matches.size() ; ++i ) {
string match(matches[i].first, matches[i].second);
vs.push_back(match);
}
}
return vs;
}
void _uttokenizer::test_to_vector_int()
{
vector<string> __vi = tokenizer::to_vector_int("0<br/>1");
for( int i = 0 ; i < __vi.size() ; ++i ) INFO(__vi[i]);
CPPUNIT_ASSERT_EQUAL(2, (int)__vi.size());//always fails
}
Update (Thanks to Dav for helping me clarify my question):
I was hoping to get a vector with 2 strings in them => "0" and "1". I instead never get a successful regex_match() (regex_match() always returns false) so the vector is always empty.
Thanks '1800 INFORMATION' for your suggestions. The to_vector_int() method now looks like this, but it goes into a never ending loop (I took the code you gave and modified it to make it compilable) and find "0","","","" and so on. It never find the "1".
vector<string> tokenizer::to_vector_int(const string s)
{
regex re("(\\d*)");
vector<string> vs;
cmatch matches;
char * loc = const_cast<char *>(s.c_str());
while( regex_search(loc, matches, re) ) {
vs.push_back(string(matches[0].first, matches[0].second));
loc = const_cast<char *>(matches.suffix().str().c_str());
}
return vs;
}
In all honesty I don't think I have still understood the basics of searching for a pattern and getting the matches. Are there any tutorials with examples that explains this?
The basic problem is that you are using regex_match when you should be using regex_search:
The algorithms regex_search and
regex_match make use of match_results
to report what matched; the difference
between these algorithms is that
regex_match will only find matches
that consume all of the input text,
where as regex_search will search for
a match anywhere within the text being
matched.
From the boost documentation. Change it to use regex_search and it will work.
Also, it looks like you are not capturing the matches. Try changing the regex to this:
regex re("(\\d*)");
Or, maybe you need to be calling regex_search repeatedly:
char *where = s.c_str();
while (regex_search(s.c_str(), matches, re))
{
where = m.suffix().first;
}
This is since you only have one capture in your regex.
Alternatively, change your regex, if you know the basic structure of the data:
regex re("(\\d+).*?(\\d+)");
This would match two numbers within the search string.
Note that the regular expression \d* will match zero or more digits - this includes the empty string "" since this is exactly zero digits. I would change the expression to \d+ which will match 1 or more.

Regex Rejecting matches because of Instr

What's the easiest way to do an "instring" type function with a regex? For example, how could I reject a whole string because of the presence of a single character such as :? For example:
this - okay
there:is - not okay because of :
More practically, how can I match the following string:
//foo/bar/baz[1]/ns:foo2/#attr/text()
For any node test on the xpath that doesn't include a namespace?
(/)?(/)([^:/]+)
Will match the node tests but includes the namespace prefix which makes it faulty.
I'm still not sure whether you just wanted to detect if the Xpath contains a namespace, or whether you want to remove the references to the namespace. So here's some sample code (in C#) that does both.
class Program
{
static void Main(string[] args)
{
string withNamespace = #"//foo/ns2:bar/baz[1]/ns:foo2/#attr/text()";
string withoutNamespace = #"//foo/bar/baz[1]/foo2/#attr/text()";
ShowStuff(withNamespace);
ShowStuff(withoutNamespace);
}
static void ShowStuff(string input)
{
Console.WriteLine("'{0}' does {1}contain namespaces", input, ContainsNamespace(input) ? "" : "not ");
Console.WriteLine("'{0}' without namespaces is '{1}'", input, StripNamespaces(input));
}
static bool ContainsNamespace(string input)
{
// a namspace must start with a character, but can have characters and numbers
// from that point on.
return Regex.IsMatch(input, #"/?\w[\w\d]+:\w[\w\d]+/?");
}
static string StripNamespaces(string input)
{
return Regex.Replace(input, #"(/?)\w[\w\d]+:(\w[\w\d]+)(/?)", "$1$2$3");
}
}
Hope that helps! Good luck.
Match on :? I think the question isn't clear enough, because the answer is so obvious:
if(Regex.Match(":", input)) // reject
You might want \w which is a "word" character. From javadocs, it is defined as [a-zA-Z_0-9], so if you don't want underscores either, that may not work....
I dont know regex syntax very well but could you not do:
[any alpha numeric]\*:[any alphanumeric]\*
I think something like that should work no?
Yeah, my question was not very clear. Here's a solution but rather than a single pass with a regex, I use a split and perform iteration. It works as well but isn't as elegant:
string xpath = "//foo/bar/baz[1]/ns:foo2/#attr/text()";
string[] nodetests = xpath.Split( new char[] { '/' } );
for (int i = 0; i < nodetests.Length; i++)
{
if (nodetests[i].Length > 0 && Regex.IsMatch( nodetests[i], #"^(\w|\[|\])+$" ))
{
// does not have a ":", we can manipulate it.
}
}
xpath = String.Join( "/", nodetests );