How to match string with wildcard using C++11 regex - c++

This is alist of static strings, it only uses wildcard at begin or end of the string. No any other regex rules.
AAAA, BBBB*, *CCCC, *DDDD* .
I need to find a given string match any of the string in this list. I'm looking for something like this.
bool isMatch(std::string str)
{
std::vector<string> my_list = {AAAA, BBBB*, *CCCC, *DDDD*};
if(str.matchAny(my_list))
return true;
return false;
}
I don't like to uses any 3rd parties like boost. Thinking this can be achieve by C++11 std::regex? Or is there any other simple way?

A regular expression would be overkill here. Just look for each of the character sequences in the appropriate place:
str == "AAAA"
str.find("BBBB") == 0
str.find("CCCC") == str.size() - 4
str.find("DDDD") != std::string::npos

Here's how I've usually done it, I replace "\\*" with ".*" and "\\?" with ".".
Here's the C++ code for it.
#include <iostream>
#include <regex>
using namespace std;
int main()
{
regex star_replace("\\*");
regex questionmark_replace("\\?");
string data = "AAAABBBCCDDDD";
string pattern = "*CC*";
auto wildcard_pattern = regex_replace(
regex_replace(pattern, star_replace, ".*"),
questionmark_replace, ".");
cout << "Wildcard: " << pattern << " Regex: " << wildcard_pattern << endl;
regex wildcard_regex("^" + wildcard_pattern + "$");
if (regex_match(data, wildcard_regex))
cout << "Match!" << endl;
else
cout << "No match!" << endl;
return 0;
}
Here's a link to runnable code on onlinegdb

Related

Regex character class subtraction in C++

I'm writing a C++ program that will need to take regular expressions that are defined in a XML Schema file and use them to validate XML data. The problem is, the flavor of regular expressions used by XML Schemas does not seem to be directly supported in C++.
For example, there are a couple special character classes \i and \c that are not defined by default and also the XML Schema regex language supports something called "character class subtraction" that does not seem to be supported in C++.
Allowing the use of the \i and \c special character classes is pretty simple, I can just look for "\i" or "\c" in the regular expression and replace them with their expanded versions, but getting character class subtraction to work is a much more daunting problem...
For example, this regular expression that is valid in an XML Schema definition throws an exception in C++ saying it has unbalanced square brackets.
#include <iostream>
#include <regex>
int main()
{
try
{
// Match any lowercase letter that is not a vowel
std::regex rx("[a-z-[aeiuo]]");
}
catch (const std::regex_error& ex)
{
std::cout << ex.what() << std::endl;
}
}
How can I get C++ to recognize character class subtraction within a regex? Or even better, is there a way to just use the XML Schema flavor of regular expressions directly within C++?
Character ranges subtraction or intersection is not available in any of the grammars supported by std::regex, so you will have to rewrite the expression into one of the supported ones.
The easiest way is to perform the subtraction yourself and pass the set to std::regex, for instance [bcdfghjklvmnpqrstvwxyz] for your example.
Another solution is to find either a more featureful regular expression engine or a dedicated XML library that supports XML Schema and its regular expression language.
Starting from the cppreference examples
#include <iostream>
#include <regex>
void show_matches(const std::string& in, const std::string& re)
{
std::smatch m;
std::regex_search(in, m, std::regex(re));
if(m.empty()) {
std::cout << "input=[" << in << "], regex=[" << re << "]: NO MATCH\n";
} else {
std::cout << "input=[" << in << "], regex=[" << re << "]: ";
std::cout << "prefix=[" << m.prefix() << "] ";
for(std::size_t n = 0; n < m.size(); ++n)
std::cout << " m[" << n << "]=[" << m[n] << "] ";
std::cout << "suffix=[" << m.suffix() << "]\n";
}
}
int main()
{
// greedy match, repeats [a-z] 4 times
show_matches("abcdefghi", "(?:(?![aeiou])[a-z]){2,4}");
}
You can test and check the the details of the regular expression here.
The choice of using a non capturing group (?: ...) is to prevent it from changing your groups in case you will use it in a bigger regular expression.
(?![aeiou]) will match without consuming the input if finds a character not matching [aeiou], the [a-z] will match letters. Combining these two condition is equivalent to your character class subtraction.
The {2,4} is a quantifier that says from 2 to 4, could also be + for one or more, * for zero or more.
Edit
Reading the comments in the other answer I understand that you want to support XMLSchema.
The next program shows how to use ECMA regular expression to translate the "character class differences" to a ECMA compatible format.
#include <iostream>
#include <regex>
#include <string>
#include <vector>
std::string translated_regex(const std::string &pattern){
// pattern to identify character class subtraction
std::regex class_subtraction_re(
"\\[((?:\\\\[\\[\\]]|[^[\\]])*)-\\[((?:\\\\[\\[\\]]|[^[\\]])*)\\]\\]"
);
// translate the regular expression to ECMA compatible
std::string translated = std::regex_replace(pattern,
class_subtraction_re, "(?:(?![$2])[$1])");
return translated;
}
void show_matches(const std::string& in, const std::string& re)
{
std::smatch m;
std::regex_search(in, m, std::regex(re));
if(m.empty()) {
std::cout << "input=[" << in << "], regex=[" << re << "]: NO MATCH\n";
} else {
std::cout << "input=[" << in << "], regex=[" << re << "]: ";
std::cout << "prefix=[" << m.prefix() << "] ";
for(std::size_t n = 0; n < m.size(); ++n)
std::cout << " m[" << n << "]=[" << m[n] << "] ";
std::cout << "suffix=[" << m.suffix() << "]\n";
}
}
int main()
{
std::vector<std::string> tests = {
"Some text [0-9-[4]] suffix",
"([abcde-[ae]])",
"[a-z-[aei]]|[A-Z-[OU]] "
};
std::string re = translated_regex("[a-z-[aeiou]]{2,4}");
show_matches("abcdefghi", re);
for(std::string test : tests){
std::cout << " " << test << '\n'
<< " -- " << translated_regex(test) << '\n';
}
return 0;
}
Edit: Recursive and Named character classes
The above approach does not work with recursive character class negation. And there is no way to deal with recursive substitutions using only regular expressions. This rendered the solution far less straight forward.
The solution has the following levels
one function scans the regular expression for a [
when a [ is found there is a function to handle the character classes recursively when '-[` is found.
The pattern \p{xxxxx} is handled separately to identify named character patterns. The named classes are defined in the specialCharClass map, I fill two examples.
#include <iostream>
#include <regex>
#include <string>
#include <vector>
#include <map>
std::map<std::string, std::string> specialCharClass = {
{"IsDigit", "0-9"},
{"IsBasicLatin", "a-zA-Z"}
// Feel free to add the character classes you want
};
const std::string getCharClassByName(const std::string &pattern, size_t &pos){
std::string key;
while(++pos < pattern.size() && pattern[pos] != '}'){
key += pattern[pos];
}
++pos;
return specialCharClass[key];
}
std::string translate_char_class(const std::string &pattern, size_t &pos){
std::string positive;
std::string negative;
if(pattern[pos] != '['){
return "";
}
++pos;
while(pos < pattern.size()){
if(pattern[pos] == ']'){
++pos;
if(negative.size() != 0){
return "(?:(?!" + negative + ")[" + positive + "])";
}else{
return "[" + positive + "]";
}
}else if(pattern[pos] == '\\'){
if(pos + 3 < pattern.size() && pattern[pos+1] == 'p'){
positive += getCharClassByName(pattern, pos += 2);
}else{
positive += pattern[pos++];
positive += pattern[pos++];
}
}else if(pattern[pos] == '-' && pos + 1 < pattern.size() && pattern[pos+1] == '['){
if(negative.size() == 0){
negative = translate_char_class(pattern, ++pos);
}else{
negative += '|';
negative = translate_char_class(pattern, ++pos);
}
}else{
positive += pattern[pos++];
}
}
return '[' + positive; // there is an error pass, forward it
}
std::string translate_regex(const std::string &pattern, size_t pos = 0){
std::string r;
while(pos < pattern.size()){
if(pattern[pos] == '\\'){
r += pattern[pos++];
r += pattern[pos++];
}else if(pattern[pos] == '['){
r += translate_char_class(pattern, pos);
}else{
r += pattern[pos++];
}
}
return r;
}
void show_matches(const std::string& in, const std::string& re)
{
std::smatch m;
std::regex_search(in, m, std::regex(re));
if(m.empty()) {
std::cout << "input=[" << in << "], regex=[" << re << "]: NO MATCH\n";
} else {
std::cout << "input=[" << in << "], regex=[" << re << "]: ";
std::cout << "prefix=[" << m.prefix() << "] ";
for(std::size_t n = 0; n < m.size(); ++n)
std::cout << " m[" << n << "]=[" << m[n] << "] ";
std::cout << "suffix=[" << m.suffix() << "]\n";
}
}
int main()
{
std::vector<std::string> tests = {
"[a]",
"[a-z]d",
"[\\p{IsBasicLatin}-[\\p{IsDigit}-[89]]]",
"[a-z-[aeiou]]{2,4}",
"[a-z-[aeiou-[e]]]",
"Some text [0-9-[4]] suffix",
"([abcde-[ae]])",
"[a-z-[aei]]|[A-Z-[OU]] "
};
for(std::string test : tests){
std::cout << " " << test << '\n'
<< " -- " << translate_regex(test) << '\n';
// Construct a reegx (validate syntax)
std::regex(translate_regex(test));
}
std::string re = translate_regex("[a-z-[aeiou-[e]]]{2,10}");
show_matches("abcdefghi", re);
return 0;
}
Try using a library function from a library with XPath support, like xmlregexp in libxml (is a C library), it can handle the XML regexes and apply them to the XML directly
http://www.xmlsoft.org/html/libxml-xmlregexp.html#xmlRegexp
----> http://web.mit.edu/outland/share/doc/libxml2-2.4.30/html/libxml-xmlregexp.html <----
An alternative could have been PugiXML (C++ library, What XML parser should I use in C++? ) however i think it does not implement the XML regex functionality ...
Okay after going through the other answers I tried out a few different things and ended up using the xmlRegexp functionality from libxml2.
The xmlRegexp related functions are very poorly documented so I figured I would post an example here because others may find it useful:
#include <iostream>
#include <libxml/xmlregexp.h>
int main()
{
LIBXML_TEST_VERSION;
xmlChar* str = xmlCharStrdup("bcdfg");
xmlChar* pattern = xmlCharStrdup("[a-z-[aeiou]]+");
xmlRegexp* regex = xmlRegexpCompile(pattern);
if (xmlRegexpExec(regex, str) == 1)
{
std::cout << "Match!" << std::endl;
}
free(regex);
free(pattern);
free(str);
}
Output:
Match!
I also attempted to use the XMLString::patternMatch from the Xerces-C++ library but it didn't seem to use an XML Schema compliant regex engine underneath. (Honestly I have no clue what regex engine it uses underneath and the documentation for that was pretty abysmal and I couldn't find any examples online so I just gave up on it.)

Trouble with C++11 regex_search [duplicate]

I'm a bit confused about the following C++11 code:
#include <iostream>
#include <string>
#include <regex>
int main()
{
std::string haystack("abcdefabcghiabc");
std::regex needle("abc");
std::smatch matches;
std::regex_search(haystack, matches, needle);
std::cout << matches.size() << std::endl;
}
I'd expect it to print out 3 but instead I get 1. Am I missing something?
You get 1 because regex_search returns only 1 match, and size() will return the number of capture groups + the whole match value.
Your matches is...:
Object of a match_results type (such as cmatch or smatch) that is filled by this function with information about the match results and any submatches found.
If [the regex search is] successful, it is not empty and contains a series of sub_match objects: the first sub_match element corresponds to the entire match, and, if the regex expression contained sub-expressions to be matched (i.e., parentheses-delimited groups), their corresponding sub-matches are stored as successive sub_match elements in the match_results object.
Here is a code that will find multiple matches:
#include <string>
#include <iostream>
#include <regex>
using namespace std;
int main() {
string str("abcdefabcghiabc");
int i = 0;
regex rgx1("abc");
smatch smtch;
while (regex_search(str, smtch, rgx1)) {
std::cout << i << ": " << smtch[0] << std::endl;
i += 1;
str = smtch.suffix().str();
}
return 0;
}
See IDEONE demo returning abc 3 times.
As this method destroys the input string, here is another alternative based on the std::sregex_iterator (std::wsregex_iterator should be used when your subject is an std::wstring object):
int main() {
std::regex r("ab(c)");
std::string s = "abcdefabcghiabc";
for(std::sregex_iterator i = std::sregex_iterator(s.begin(), s.end(), r);
i != std::sregex_iterator();
++i)
{
std::smatch m = *i;
std::cout << "Match value: " << m.str() << " at Position " << m.position() << '\n';
std::cout << " Capture: " << m[1].str() << " at Position " << m.position(1) << '\n';
}
return 0;
}
See IDEONE demo, returning
Match value: abc at Position 0
Capture: c at Position 2
Match value: abc at Position 6
Capture: c at Position 8
Match value: abc at Position 12
Capture: c at Position 14
What you're missing is that matches is populated with one entry for each capture group (including the entire matched substring as the 0th capture).
If you write
std::regex needle("a(b)c");
then you'll get matches.size()==2, with matches[0]=="abc", and matches[1]=="b".
EDIT: Some people have downvoted this answer. That may be for a variety of reasons, but if it is because it does not apply to the answer I criticized (no one left a comment to explain the decision), they should take note that W. Stribizew changed the code two months after I wrote this, and I was unaware of it until today, 2021-01-18. The rest of the answer is unchanged from when I first wrote it.
#stribizhev's solution has quadratic worst case complexity for sane regular expressions. For insane ones (e.g. "y*"), it doesn't terminate. In some applications, these issues could be DoS attacks waiting to happen. Here's a fixed version:
string str("abcdefabcghiabc");
int i = 0;
regex rgx1("abc");
smatch smtch;
auto beg = str.cbegin();
while (regex_search(beg, str.cend(), smtch, rgx1)) {
std::cout << i << ": " << smtch[0] << std::endl;
i += 1;
if ( smtch.length(0) > 0 )
std::advance(beg, smtch.length(0));
else if ( beg != str.cend() )
++beg;
else
break;
}
According to my personal preference, this will find n+1 matches of an empty regex in a string of length n. You might also just exit the loop after an empty match.
If you want to compare the performance for a string with millions of matches, add the following lines after the definition of str (and don't forget to turn on optimizations), once for each version:
for (int j = 0; j < 20; ++j)
str = str + str;

Use regex to validate string not starting with specific character and string length

I have a function with the following if statements:
if (name.length() < 10 || name.length() > 64)
{
return false;
}
if (name.front() == L'.' || name.front() == L' ')
{
return false;
}
I was curious to see if can do this using the following regular expression:
^(?!\ |\.)([A-Za-z]{10,46}$)
to dissect the above expression the first part ^(?!\ |.) preforms a negative look ahead to assert that it is impossible for the string to start with space or dot(.) and the second part should take care of the string length condition. I wrote the following to test the expression out:
std::string randomStrings [] = {" hello",
" hellllloooo",
"\.hello",
".zoidbergvddd",
"asdasdsadadasdasdasdadasdsad"};
std::regex txt_regex("^(?!\ |\.)([A-Za-z]{10,46}$)");
for (const auto& str : randomStrings)
{
std::cout << str << ": " << std::boolalpha << std::regex_match(str, txt_regex) << '\n';
}
I expected the last one to to match since it does not start with space or dot(.) and it meets the length criteria. However, this is what I got:
hello: false
hellllloooo: false
.hello: false
.zoidbergvddd: false
asdasdsadadasdasdasdadasdsad: false
Did I miss something trivial here or this is not possible using regex? It seems like it should be.
Feel free to suggest a better title, I tried to be as descriptive as possible.
Change your regular expression to: "^(?![\\s.])([A-Za-z]{10,46}$)" and it will work.
\s refers to any whitespace and you need to escape the \ inside the string and that's why it becomes \\s.
You can also check this link
You need to turn on compiler warnings. It would have told you that you have an unknown escape sequence in your regex. I recommend using a raw literal.
#include <iostream>
#include <regex>
int main() {
std::string randomStrings[] = { " hello", " hellllloooo", ".hello",
".zoidbergvddd", "asdasdsadadasdasdasdadasdsad" };
std::regex txt_regex(R"foo(^(?!\ |\.)([A-Za-z]{10,46}$))foo");
for (const auto& str : randomStrings) {
std::cout << str << ": " << std::boolalpha
<< std::regex_match(str, txt_regex) << '\n';
}
}
clang++-3.8 gives
hello: false
hellllloooo: false
.hello: false
.zoidbergvddd: false
asdasdsadadasdasdasdadasdsad: true

Retrieving a regex search in C++

Hello I am new to regular expressions and from what I understood from the c++ reference website it is possible to get match results.
My question is: how do I retrieve these results? What is the difference between smatch and cmatch? For example, I have a string consisting of date and time and this is the regular expression I wrote:
"(1[0-2]|0?[1-9])([:][0-5][0-9])?(am|pm)"
Now when I do a regex_search with the string and the above expression, I can find whether there is a time in the string or not. But I want to store that time in a structure so I can separate hours and minutes. I am using Visual studio 2010 c++.
If you use e.g. std::regex_search then it fills in a std::match_result where you can use the operator[] to get the matched strings.
Edit: Example program:
#include <iostream>
#include <string>
#include <regex>
void test_regex_search(const std::string& input)
{
std::regex rgx("((1[0-2])|(0?[1-9])):([0-5][0-9])((am)|(pm))");
std::smatch match;
if (std::regex_search(input.begin(), input.end(), match, rgx))
{
std::cout << "Match\n";
//for (auto m : match)
// std::cout << " submatch " << m << '\n';
std::cout << "match[1] = " << match[1] << '\n';
std::cout << "match[4] = " << match[4] << '\n';
std::cout << "match[5] = " << match[5] << '\n';
}
else
std::cout << "No match\n";
}
int main()
{
const std::string time1 = "9:45pm";
const std::string time2 = "11:53am";
test_regex_search(time1);
test_regex_search(time2);
}
Output from the program:
Match
match[1] = 9
match[4] = 45
match[5] = pm
Match
match[1] = 11
match[4] = 53
match[5] = am
Just use named groups.
(?<hour>(1[0-2]|0?[1-9]))([:](?<minute>[0-5][0-9]))?(am|pm)
Ok, vs2010 doesn't support named groups. You already using unnamed capture groups. Go through them.

Extract IP address from a string using boost regex?

I was wondering if anyone can help me, I've been looking around for regex examples but I still can't get my head over it.
The strings look like this:
"User JaneDoe, IP: 12.34.56.78"
"User JohnDoe, IP: 34.56.78.90"
How would I go about to make an expression that matches the above strings?
The question is how exactly do you want to match these, and what else do you want to exclude?
It's trivial (but rarely useful) to match any incoming string with a simple .*.
To match these more exactly (and add the possibility of extracting things like the user name and/or IP), you could use something like: "User ([^,]*), IP: (\\d{1,3}(\\.\\d{1,3}){3})". Depending on your input, this might still run into a problem with a name that includes a comma (e.g., "John James, Jr."). If you have to allow for that, it gets quite a bit uglier in a hurry.
Edit: Here's a bit of code to test/demonstrate the regex above. At the moment, this is using the C++0x regex class(es) -- to use Boost, you'll need to change the namespaces a bit (but I believe that should be about all).
#include <regex>
#include <iostream>
void show_match(std::string const &s, std::regex const &r) {
std::smatch match;
if (std::regex_search(s, match, r))
std::cout << "User Name: \"" << match[1]
<< "\", IP Address: \"" << match[2] << "\"\n";
else
std::cerr << s << "did not match\n";
}
int main(){
std::string inputs[] = {
std::string("User JaneDoe, IP: 12.34.56.78"),
std::string("User JohnDoe, IP: 34.56.78.90")
};
std::regex pattern("User ([^,]*), IP: (\\d{1,3}(\\.\\d{1,3}){3})");
for (int i=0; i<2; i++)
show_match(inputs[i], pattern);
return 0;
}
This prints out the user name and IP address, but in (barely) enough different format to make it clear that it's matching and printing out individual pieces, not just passing entire strings through.
#include <string>
#include <iostream>
#include <boost/regex.hpp>
int main() {
std::string text = "User JohnDoe, IP: 121.1.55.86";
boost::regex expr ("User ([^,]*), IP: (\\d{1,3}(\\.\\d{1,3}){3})");
boost::smatch matches;
try
{
if (boost::regex_match(text, matches, expr)) {
std::cout << matches.size() << std::endl;
for (int i = 1; i < matches.size(); i++) {
std::string match (matches[i].first, matches[i].second);
std::cout << "matches[" << i << "] = " << match << std::endl;
}
}
else {
std::cout << "\"" << expr << "\" does not match \"" << text << "\". matches size(" << matches.size() << ")" << std::endl;
}
}
catch (boost::regex_error& e)
{
std::cout << "Error: " << e.what() << std::endl;
}
return 0;
}
Edited: Fixed missing comma in string, pointed out by Davka, and changed cmatch to smatch