Regex character class subtraction in C++

Regex character class subtraction in C++ - c++

I'm writing a C++ program that will need to take regular expressions that are defined in a XML Schema file and use them to validate XML data. The problem is, the flavor of regular expressions used by XML Schemas does not seem to be directly supported in C++.
For example, there are a couple special character classes \i and \c that are not defined by default and also the XML Schema regex language supports something called "character class subtraction" that does not seem to be supported in C++.
Allowing the use of the \i and \c special character classes is pretty simple, I can just look for "\i" or "\c" in the regular expression and replace them with their expanded versions, but getting character class subtraction to work is a much more daunting problem...
For example, this regular expression that is valid in an XML Schema definition throws an exception in C++ saying it has unbalanced square brackets.
#include <iostream>
#include <regex>
int main()
{
try
{
// Match any lowercase letter that is not a vowel
std::regex rx("[a-z-[aeiuo]]");
}
catch (const std::regex_error& ex)
{
std::cout << ex.what() << std::endl;
}
}
How can I get C++ to recognize character class subtraction within a regex? Or even better, is there a way to just use the XML Schema flavor of regular expressions directly within C++?

Character ranges subtraction or intersection is not available in any of the grammars supported by std::regex, so you will have to rewrite the expression into one of the supported ones.
The easiest way is to perform the subtraction yourself and pass the set to std::regex, for instance [bcdfghjklvmnpqrstvwxyz] for your example.
Another solution is to find either a more featureful regular expression engine or a dedicated XML library that supports XML Schema and its regular expression language.

Starting from the cppreference examples
#include <iostream>
#include <regex>
void show_matches(const std::string& in, const std::string& re)
{
std::smatch m;
std::regex_search(in, m, std::regex(re));
if(m.empty()) {
std::cout << "input=[" << in << "], regex=[" << re << "]: NO MATCH\n";
} else {
std::cout << "input=[" << in << "], regex=[" << re << "]: ";
std::cout << "prefix=[" << m.prefix() << "] ";
for(std::size_t n = 0; n < m.size(); ++n)
std::cout << " m[" << n << "]=[" << m[n] << "] ";
std::cout << "suffix=[" << m.suffix() << "]\n";
}
}
int main()
{
// greedy match, repeats [a-z] 4 times
show_matches("abcdefghi", "(?:(?![aeiou])[a-z]){2,4}");
}
You can test and check the the details of the regular expression here.
The choice of using a non capturing group (?: ...) is to prevent it from changing your groups in case you will use it in a bigger regular expression.
(?![aeiou]) will match without consuming the input if finds a character not matching [aeiou], the [a-z] will match letters. Combining these two condition is equivalent to your character class subtraction.
The {2,4} is a quantifier that says from 2 to 4, could also be + for one or more, * for zero or more.
Edit
Reading the comments in the other answer I understand that you want to support XMLSchema.
The next program shows how to use ECMA regular expression to translate the "character class differences" to a ECMA compatible format.
#include <iostream>
#include <regex>
#include <string>
#include <vector>
std::string translated_regex(const std::string &pattern){
// pattern to identify character class subtraction
std::regex class_subtraction_re(
"\\[((?:\\\\[\\[\\]]|[^[\\]])*)-\\[((?:\\\\[\\[\\]]|[^[\\]])*)\\]\\]"
);
// translate the regular expression to ECMA compatible
std::string translated = std::regex_replace(pattern,
class_subtraction_re, "(?:(?![$2])[$1])");
return translated;
}
void show_matches(const std::string& in, const std::string& re)
{
std::smatch m;
std::regex_search(in, m, std::regex(re));
if(m.empty()) {
std::cout << "input=[" << in << "], regex=[" << re << "]: NO MATCH\n";
} else {
std::cout << "input=[" << in << "], regex=[" << re << "]: ";
std::cout << "prefix=[" << m.prefix() << "] ";
for(std::size_t n = 0; n < m.size(); ++n)
std::cout << " m[" << n << "]=[" << m[n] << "] ";
std::cout << "suffix=[" << m.suffix() << "]\n";
}
}
int main()
{
std::vector<std::string> tests = {
"Some text [0-9-[4]] suffix",
"([abcde-[ae]])",
"[a-z-[aei]]|[A-Z-[OU]] "
};
std::string re = translated_regex("[a-z-[aeiou]]{2,4}");
show_matches("abcdefghi", re);
for(std::string test : tests){
std::cout << " " << test << '\n'
<< " -- " << translated_regex(test) << '\n';
}
return 0;
}
Edit: Recursive and Named character classes
The above approach does not work with recursive character class negation. And there is no way to deal with recursive substitutions using only regular expressions. This rendered the solution far less straight forward.
The solution has the following levels
one function scans the regular expression for a [
when a [ is found there is a function to handle the character classes recursively when '-[` is found.
The pattern \p{xxxxx} is handled separately to identify named character patterns. The named classes are defined in the specialCharClass map, I fill two examples.
#include <iostream>
#include <regex>
#include <string>
#include <vector>
#include <map>
std::map<std::string, std::string> specialCharClass = {
{"IsDigit", "0-9"},
{"IsBasicLatin", "a-zA-Z"}
// Feel free to add the character classes you want
};
const std::string getCharClassByName(const std::string &pattern, size_t &pos){
std::string key;
while(++pos < pattern.size() && pattern[pos] != '}'){
key += pattern[pos];
}
++pos;
return specialCharClass[key];
}
std::string translate_char_class(const std::string &pattern, size_t &pos){
std::string positive;
std::string negative;
if(pattern[pos] != '['){
return "";
}
++pos;
while(pos < pattern.size()){
if(pattern[pos] == ']'){
++pos;
if(negative.size() != 0){
return "(?:(?!" + negative + ")[" + positive + "])";
}else{
return "[" + positive + "]";
}
}else if(pattern[pos] == '\\'){
if(pos + 3 < pattern.size() && pattern[pos+1] == 'p'){
positive += getCharClassByName(pattern, pos += 2);
}else{
positive += pattern[pos++];
positive += pattern[pos++];
}
}else if(pattern[pos] == '-' && pos + 1 < pattern.size() && pattern[pos+1] == '['){
if(negative.size() == 0){
negative = translate_char_class(pattern, ++pos);
}else{
negative += '|';
negative = translate_char_class(pattern, ++pos);
}
}else{
positive += pattern[pos++];
}
}
return '[' + positive; // there is an error pass, forward it
}
std::string translate_regex(const std::string &pattern, size_t pos = 0){
std::string r;
while(pos < pattern.size()){
if(pattern[pos] == '\\'){
r += pattern[pos++];
r += pattern[pos++];
}else if(pattern[pos] == '['){
r += translate_char_class(pattern, pos);
}else{
r += pattern[pos++];
}
}
return r;
}
void show_matches(const std::string& in, const std::string& re)
{
std::smatch m;
std::regex_search(in, m, std::regex(re));
if(m.empty()) {
std::cout << "input=[" << in << "], regex=[" << re << "]: NO MATCH\n";
} else {
std::cout << "input=[" << in << "], regex=[" << re << "]: ";
std::cout << "prefix=[" << m.prefix() << "] ";
for(std::size_t n = 0; n < m.size(); ++n)
std::cout << " m[" << n << "]=[" << m[n] << "] ";
std::cout << "suffix=[" << m.suffix() << "]\n";
}
}
int main()
{
std::vector<std::string> tests = {
"[a]",
"[a-z]d",
"[\\p{IsBasicLatin}-[\\p{IsDigit}-[89]]]",
"[a-z-[aeiou]]{2,4}",
"[a-z-[aeiou-[e]]]",
"Some text [0-9-[4]] suffix",
"([abcde-[ae]])",
"[a-z-[aei]]|[A-Z-[OU]] "
};
for(std::string test : tests){
std::cout << " " << test << '\n'
<< " -- " << translate_regex(test) << '\n';
// Construct a reegx (validate syntax)
std::regex(translate_regex(test));
}
std::string re = translate_regex("[a-z-[aeiou-[e]]]{2,10}");
show_matches("abcdefghi", re);
return 0;
}

Try using a library function from a library with XPath support, like xmlregexp in libxml (is a C library), it can handle the XML regexes and apply them to the XML directly
http://www.xmlsoft.org/html/libxml-xmlregexp.html#xmlRegexp
----> http://web.mit.edu/outland/share/doc/libxml2-2.4.30/html/libxml-xmlregexp.html <----
An alternative could have been PugiXML (C++ library, What XML parser should I use in C++? ) however i think it does not implement the XML regex functionality ...

Okay after going through the other answers I tried out a few different things and ended up using the xmlRegexp functionality from libxml2.
The xmlRegexp related functions are very poorly documented so I figured I would post an example here because others may find it useful:
#include <iostream>
#include <libxml/xmlregexp.h>
int main()
{
LIBXML_TEST_VERSION;
xmlChar* str = xmlCharStrdup("bcdfg");
xmlChar* pattern = xmlCharStrdup("[a-z-[aeiou]]+");
xmlRegexp* regex = xmlRegexpCompile(pattern);
if (xmlRegexpExec(regex, str) == 1)
{
std::cout << "Match!" << std::endl;
}
free(regex);
free(pattern);
free(str);
}
Output:
Match!
I also attempted to use the XMLString::patternMatch from the Xerces-C++ library but it didn't seem to use an XML Schema compliant regex engine underneath. (Honestly I have no clue what regex engine it uses underneath and the documentation for that was pretty abysmal and I couldn't find any examples online so I just gave up on it.)

Related

regex_match giving unexpected results

I'm trying to write a recursive descent parser and am trying to search a match a regex within a string inputted by the user. I am trying to do the following to try to understand the <regex> library offered by C++11, but I'm getting unexpected results.
std::string expression = "2+2+2";
std::regex re("[-+*/()]");
std::smatch m;
std::cout << "My expression is " << expression << std::endl;
if(std::regex_search(expression, re)) {
std::cout << "Found a match!" << std::endl;
}
std::regex_match(expression, m, re);
std::cout << "matches:" << std::endl;
for (auto it = m.begin(); it!=m.end(); ++it) {
std::cout << *it << std::endl;
}
So based on my regular expression, I expect it to output
Found a match!
matches:
+
+
However, the output I get is:
My expression is 2+2+2
Found a match!
matches:
I feel like I'm making a stupid mistake, but I can't seem to figure out why there's a discrepancy between the outputs.
Thanks,
erip

You've got a few issues. First, let's look at some working code:
#include <regex>
#include <iostream>
int main() {
std::string expr = "2+2+2";
std::regex re("[+\\-*/()]");
const auto operators_begin = std::sregex_iterator(expr.begin(), expr.end(), re);
const auto operators_end = std::sregex_iterator();
std::cout << "Count: " << std::distance(operators_begin, operators_end) << "\n";
for (auto i = operators_begin; i != operators_end; ++i) {
std::smatch match = *i;
std::cout << match.str() << "\n";
}
}
Output:
Count: 2
+
+
Issues with your code:
regex_match() returns false.
You don't have any capturing groups in the regular expression. So even if regex_match() returned true, it wouldn't capture anything.
The number of captures in a regex_match can be determined strictly by looking at the regular expression. So my re will capture exactly one group.
But we want to apply this regular expression on our string multiple times, because we want to find all of the matches. The tool for that is regex_iterator.
We also needed to escape the - in the regular expression. The minus has a special meaning within character classes.

I found the regex_iterator, which solved the problem. Here's the working code:
std::regex re("[-+*/()]");
std::smatch m;
std::cout << "My expression is " << expression << std::endl;
if(std::regex_search(expression, re)) {
std::cout << "Found a match!" << std::endl;
}
try {
std::sregex_iterator next(expression.begin(), expression.end(), re);
std::sregex_iterator end;
while (next != end) {
std::smatch match = *next;
std::cout << match.str() << "\n";
next++;
}
} catch (std::regex_error& e) {
// Syntax error in the regular expression
}

Retrieving a regex search in C++

Hello I am new to regular expressions and from what I understood from the c++ reference website it is possible to get match results.
My question is: how do I retrieve these results? What is the difference between smatch and cmatch? For example, I have a string consisting of date and time and this is the regular expression I wrote:
"(1[0-2]|0?[1-9])([:][0-5][0-9])?(am|pm)"
Now when I do a regex_search with the string and the above expression, I can find whether there is a time in the string or not. But I want to store that time in a structure so I can separate hours and minutes. I am using Visual studio 2010 c++.

If you use e.g. std::regex_search then it fills in a std::match_result where you can use the operator[] to get the matched strings.
Edit: Example program:
#include <iostream>
#include <string>
#include <regex>
void test_regex_search(const std::string& input)
{
std::regex rgx("((1[0-2])|(0?[1-9])):([0-5][0-9])((am)|(pm))");
std::smatch match;
if (std::regex_search(input.begin(), input.end(), match, rgx))
{
std::cout << "Match\n";
//for (auto m : match)
// std::cout << " submatch " << m << '\n';
std::cout << "match[1] = " << match[1] << '\n';
std::cout << "match[4] = " << match[4] << '\n';
std::cout << "match[5] = " << match[5] << '\n';
}
else
std::cout << "No match\n";
}
int main()
{
const std::string time1 = "9:45pm";
const std::string time2 = "11:53am";
test_regex_search(time1);
test_regex_search(time2);
}
Output from the program:
Match
match[1] = 9
match[4] = 45
match[5] = pm
Match
match[1] = 11
match[4] = 53
match[5] = am

Just use named groups.
(?<hour>(1[0-2]|0?[1-9]))([:](?<minute>[0-5][0-9]))?(am|pm)
Ok, vs2010 doesn't support named groups. You already using unnamed capture groups. Go through them.

Regex C++: extract substring

I would like to extract a substring between two others.
ex: /home/toto/FILE_mysymbol_EVENT.DAT
or just FILE_othersymbol_EVENT.DAT
And I would like to get : mysymbol and othersymbol
I don't want to use boost or other libs. Just standard stuffs from C++, except CERN's ROOT lib, with TRegexp, but I don't know how to use it...

Since last year C++ has regular expression built into the standard. This program will show how to use them to extract the string you are after:
#include <regex>
#include <iostream>
int main()
{
const std::string s = "/home/toto/FILE_mysymbol_EVENT.DAT";
std::regex rgx(".*FILE_(\\w+)_EVENT\\.DAT.*");
std::smatch match;
if (std::regex_search(s.begin(), s.end(), match, rgx))
std::cout << "match: " << match[1] << '\n';
}
It will output:
match: mysymbol
It should be noted though, that it will not work in GCC as its library support for regular expression is not very good. Works well in VS2010 (and probably VS2012), and should work in clang.
By now (late 2016) all modern C++ compilers and their standard libraries are fully up to date with the C++11 standard, and most if not all of C++14 as well. GCC 6 and the upcoming Clang 4 support most of the coming C++17 standard as well.

TRegexp only supports a very limited subset of regular expressions compared to other regex flavors. This makes constructing a single regex that suits your needs somewhat awkward.
One possible solution:
[^_]*_([^_]*)_
will match the string until the first underscore, then capture all characters until the next underscore. The relevant result of the match is then found in group number 1.
But in your case, why use a regex at all? Just find the first and second occurrence of your delimiter _ in the string and extract the characters between those positions.

If you want to use regular expressions, I'd really recommend using C++11's regexes or, if you have a compiler that doesn't yet support them, Boost. Boost is something I consider almost-part-of-standard-C++.
But for this particular question, you do not really need any form of regular expressions. Something like this sketch should work just fine, after you add all appropriate error checks (beg != npos, end != npos etc.), test code, and remove my typos:
std::string between(std::string const &in,
std::string const &before, std::string const &after) {
size_type beg = in.find(before);
beg += before.size();
size_type end = in.find(after, beg);
return in.substr(beg, end-beg);
}
Obviously, you could change the std::string to a template parameter and it should work just fine with std::wstring or more seldomly used instantiations of std::basic_string as well.

I would study corner cases before trusting it.
But This is a good candidate:
std::string text = "/home/toto/FILE_mysymbol_EVENT.DAT";
std::regex reg("(.*)(FILE_)(.*)(_EVENT.DAT)(.*)");
std::cout << std::regex_replace(text, reg, "$3") << '\n';

The answers of Some programmer dude, Tim Pietzcker, and Christopher Creutzig are cool and correct, but they seemed to me not very obvious for beginners.
The following function is an attempt to create an auxiliary illustration for Some programmer dude and Tim Pietzcker's answers:
void ExtractSubString(const std::string& start_string
, const std::string& string_regex_extract_substring_template)
{
std::regex regex_extract_substring_template(
string_regex_extract_substring_template);
std::smatch match;
std::cout << std::endl;
std::cout << "A substring extract template: " << std::endl;
std::cout << std::quoted(string_regex_extract_substring_template)
<< std::endl;
std::cout << std::endl;
std::cout << "Start string: " << std::endl;
std::cout << start_string << std::endl;
std::cout << std::endl;
if (std::regex_search(start_string.begin(), start_string.end()
, match, regex_extract_substring_template))
{
std::cout << "match0: " << match[0] << std::endl;
std::cout << "match1: " << match[1] << std::endl;
std::cout << "match2: " << match[2] << std::endl;
}
std::cout << std::endl;
}
The following overloaded function is an attempt to help illustrate Christopher Creutzig's answer:
void ExtractSubString(const std::string& start_string
, const std::string& before_substring, const std::string& after_substring)
{
std::cout << std::endl;
std::cout << "A before substring: " << std::endl;
std::cout << std::quoted(before_substring) << std::endl;
std::cout << std::endl;
std::cout << "An after substring: " << std::endl;
std::cout << std::quoted(after_substring) << std::endl;
std::cout << std::endl;
std::cout << "Start string: " << std::endl;
std::cout << start_string << std::endl;
std::cout << std::endl;
size_t before_substring_begin
= start_string.find(before_substring);
size_t extract_substring_begin
= before_substring_begin + before_substring.size();
size_t extract_substring_end
= start_string.find(after_substring, extract_substring_begin);
std::cout << "Extract substring: " << std::endl;
std::cout
<< start_string.substr(extract_substring_begin
, extract_substring_end - extract_substring_begin)
<< std::endl;
std::cout << std::endl;
}
This is the main function to run the overloaded functions:
#include <regex>
#include <iostream>
#include <iomanip>
int main()
{
const std::string start_string
= "/home/toto/FILE_mysymbol_EVENT.DAT";
const std::string string_regex_extract_substring_template(
".*FILE_(\\w+)_EVENT\\.DAT.*");
const std::string string_regex_extract_substring_template2(
"[^_]*_([^_]*)_");
ExtractSubString(start_string, string_regex_extract_substring_template);
ExtractSubString(start_string, string_regex_extract_substring_template2);
const std::string before_substring = "/home/toto/FILE_";
const std::string after_substring = "_EVENT.DAT";
ExtractSubString(start_string, before_substring, after_substring);
}
This is the result of executing the main function:
A substring extract template:
".*FILE_(\\w+)_EVENT\\.DAT.*"
Start string:
"/home/toto/FILE_mysymbol_EVENT.DAT"
match0: /home/toto/FILE_mysymbol_EVENT.DAT
match1: mysymbol
match2:
A substring extract template:
"[^_]*_([^_]*)_"
Start string:
"/home/toto/FILE_mysymbol_EVENT.DAT"
match0: /home/toto/FILE_mysymbol_
match1: mysymbol
match2:
A before substring:
"/home/toto/FILE_"
An after substring:
"_EVENT.DAT"
Start string:
"/home/toto/FILE_mysymbol_EVENT.DAT"
Extract substring:
mysymbol

Boost regex don't match tabs

I'm using boost regex_match and I have a problem with matching no tab characters.
My test application looks as follows:
#include <iostream>
#include <string>
#include <boost/spirit/include/classic_regex.hpp>
int
main(int args, char** argv)
{
boost::match_results<std::string::const_iterator> what;
if(args == 3) {
std::string text(argv[1]);
boost::regex expression(argv[2]);
std::cout << "Text : " << text << std::endl;
std::cout << "Regex: " << expression << std::endl;
if(boost::regex_match(text, what, expression, boost::match_default) != 0) {
int i = 0;
std::cout << text;
if(what[0].matched)
std::cout << " matches with regex pattern!" << std::endl;
else
std::cout << " does not match with regex pattern!" << std::endl;
for(boost::match_results<std::string::const_iterator>::const_iterator it=what.begin(); it!=what.end(); ++it) {
std::cout << "[" << (i++) << "] " << it->str() << std::endl;
}
} else {
std::cout << "Expression does not match!" << std::endl;
}
} else {
std::cout << "Usage: $> ./boost-regex <text> <regex>" << std::endl;
}
return 0;
}
If I run the program with these arguments, I don't get the expected result:
$> ./boost-regex "`cat file`" "(?=.*[^\t]).*"
Text : This text includes some tabulators
Regex: (?=.*[^\t]).*
This text includes some tabulators matches with regex pattern!
[0] This text includes some tabulators
In this case I would have expected that what[0].matched is false, but it's not.
Is there any mistake in my regular expression?
Or do I have to use other format/match flag?
Thank you in advance!

I am not sure what you want to do. My understanding is, you want the regex to fail as soon as there is a tab in the text.
Your positive lookahead assertion (?=.*[^\t]) is true as soon as it finds a non tab, and there are a lot of non tabs in your text.
If you want it to fail, when there is a tab, go the other way round and use a negative lookahead assertion.
(?!.*\t).*
this assertion will fail as soon as it find a tab.

Extract IP address from a string using boost regex?

I was wondering if anyone can help me, I've been looking around for regex examples but I still can't get my head over it.
The strings look like this:
"User JaneDoe, IP: 12.34.56.78"
"User JohnDoe, IP: 34.56.78.90"
How would I go about to make an expression that matches the above strings?

The question is how exactly do you want to match these, and what else do you want to exclude?
It's trivial (but rarely useful) to match any incoming string with a simple .*.
To match these more exactly (and add the possibility of extracting things like the user name and/or IP), you could use something like: "User ([^,]*), IP: (\\d{1,3}(\\.\\d{1,3}){3})". Depending on your input, this might still run into a problem with a name that includes a comma (e.g., "John James, Jr."). If you have to allow for that, it gets quite a bit uglier in a hurry.
Edit: Here's a bit of code to test/demonstrate the regex above. At the moment, this is using the C++0x regex class(es) -- to use Boost, you'll need to change the namespaces a bit (but I believe that should be about all).
#include <regex>
#include <iostream>
void show_match(std::string const &s, std::regex const &r) {
std::smatch match;
if (std::regex_search(s, match, r))
std::cout << "User Name: \"" << match[1]
<< "\", IP Address: \"" << match[2] << "\"\n";
else
std::cerr << s << "did not match\n";
}
int main(){
std::string inputs[] = {
std::string("User JaneDoe, IP: 12.34.56.78"),
std::string("User JohnDoe, IP: 34.56.78.90")
};
std::regex pattern("User ([^,]*), IP: (\\d{1,3}(\\.\\d{1,3}){3})");
for (int i=0; i<2; i++)
show_match(inputs[i], pattern);
return 0;
}
This prints out the user name and IP address, but in (barely) enough different format to make it clear that it's matching and printing out individual pieces, not just passing entire strings through.

#include <string>
#include <iostream>
#include <boost/regex.hpp>
int main() {
std::string text = "User JohnDoe, IP: 121.1.55.86";
boost::regex expr ("User ([^,]*), IP: (\\d{1,3}(\\.\\d{1,3}){3})");
boost::smatch matches;
try
{
if (boost::regex_match(text, matches, expr)) {
std::cout << matches.size() << std::endl;
for (int i = 1; i < matches.size(); i++) {
std::string match (matches[i].first, matches[i].second);
std::cout << "matches[" << i << "] = " << match << std::endl;
}
}
else {
std::cout << "\"" << expr << "\" does not match \"" << text << "\". matches size(" << matches.size() << ")" << std::endl;
}
}
catch (boost::regex_error& e)
{
std::cout << "Error: " << e.what() << std::endl;
}
return 0;
}
Edited: Fixed missing comma in string, pointed out by Davka, and changed cmatch to smatch

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex character class subtraction in C++ - c++

Related

regex_match giving unexpected results

Retrieving a regex search in C++

Regex C++: extract substring

Boost regex don't match tabs

Extract IP address from a string using boost regex?

Categories

Resources