Composing complex regular expressions with "DEFINED" subexpressions in C++ - c++

I'm trying to write a regular expression in C++ to match a base64 encoded string. I'm quite familiar with writing complex regular expressions in Perl so I started with that:
use strict;
use warnings;
my $base64_regex = qr{
(?(DEFINE)
(?<B64>[A-Za-z0-9+/])
(?<B16>[AEIMQUYcgkosw048])
(?<B04>[AQgw])
)
^(
((?&B64){4})*
(
(?&B64){4}|
(?&B64){2}(?&B16)=|
(?&B64)(?&B04)={2}
)
)?$}x;
# "Hello World!" base64 encoded
my $base64 = "SGVsbG8gV29ybGQh";
if ($base64 =~ $base64_regex)
{
print "Match!\n";
}
else
{
print "No match!\n"
}
Output:
Match!
I then tried to implement a similar regular expression in C++:
#include <iostream>
#include <regex>
int main()
{
std::regex base64_regex(
"(?(DEFINE)"
"(?<B64>[A-Za-z0-9+/])"
"(?<B16>[AEIMQUYcgkosw048])"
"(?<B04>[AQgw])"
")"
"^("
"((?&B64){4})*"
"("
"(?&B64){4}|"
"(?&B64){2}(?&B16)=|"
"(?&B64)(?&B04)={2}"
")"
")?$");
// "Hello World!" base64 encoded
std::string base64 = "SGVsbG8gV29ybGQh";
if (std::regex_match(base64, base64_regex))
{
std::cout << "Match!" << std::endl;
}
else
{
std::cout << "No Match!" << std::endl;
}
}
but when I run the code I get an exception telling me it is not a valid regular expression.
Catching the exception and printing the "what" string doesn't help much either. All it gives me is the following:
regex_error(error_syntax)
Obviously I could get rid of the "DEFINE" block with my pre-defined subpatterns, but that would make the whole expression very difficult to read... and, well... I like to be able to maintain my own code when I come back to it a few years later lol so that isn't really a good option.
How can I get a similar regular expression to work in C++?
Note: This must all be done within a single "std::regex" object because I am writing a library where users will be able to pass a string to be able to define their own regular expressions and I want these users to be able to "DEFINE" similar subexpressions within their regex if they need to.

How about string concatenation?
#define B64 "[A-Za-z0-9+/]"
#define B16 "[AEIMQUYcgkosw048]"
#define B04 "[AQgw]"
std::regex base64_regex(
"^("
"(" B64 "{4})*"
"("
B64 "{4}|"
B64 "{2}" B16 "=|"
B64 B04 "={2}"
")"
")?$");

I took a suggestion from the comments and checked out "boost" regex since it supports "Perl" regular expressions. I gave it a try and it worked great!
#include <boost/regex.hpp>
boost::regex base64_regex(
"(?(DEFINE)"
"(?<B64>[A-Za-z0-9+/])"
"(?<B16>[AEIMQUYcgkosw048])"
"(?<B04>[AQgw])"
")"
"("
"((?&B64){4})*"
"("
"(?&B64){4}|"
"(?&B64){2}(?&B16)=|"
"(?&B64)(?&B04)={2}"
")"
")?", boost::regex::perl);

Related

C++ Regex always matching entire string

Whenever I use a regex function it matches the entire string for some reason.
#include <iostream>
#include <regex>
int main() {
std::string text = "This (is a) test";
std::regex pattern("\(.+\)");
std::cout << std::regex_replace(text, pattern, "isnt") << std::endl;
return 0;
}
Output: isnt
Your pattern unfortunately is not what it seems to be. Here is the problem.
Imagine for some reason you want to match tabs in with you regex. You might try this.
std::regex my_regex("\t");
This would work, but the string your std::regex class has seen is " ", not "\t". This is because of how C++ threats escaped characters. To pass literal "\t", you had to do the following.
std::regex my_regex("\\t");
So the correct syntax for your regex is.
std::regex pattern("\\(.+\\)");

How to match complex strings with regular expressions

I am a newbie in C++, I am using the regular expression function, but I have not been able to get the results I want
c++ code:
#include <regex>
std::string str = "[game.exe+009E820C]+338";
std::smatch result;
std::regex pattern("\\[([^\\[\\]]+)\\]");
std::regex_match(str, result, pattern);
// no result
std::cout << result[1] << std::endl;
I am familiar with javascript regular expressions, so I can get the value I want:
'[game.exe+009E820C]+338'.match(/\[([^\[\]]+)\]/)[1] => game.exe+009E820C
Is my c++ code doing something wrong
If you want to access the capture groups, it appears that the regex_match API requires a pattern which matches the entire input. Also, to avoid getting bogged down by a negative character class which includes a closing square bracket, I recommend using the Perl lazy dot instead. Putting all this together:
std::string str = "[game.exe+009E820C]+338";
std::smatch result;
std::regex pattern(".*\\[(.*?)\\].*");
std::regex_match(str, result, pattern);
std::cout << result[1] << std::endl;
This prints:
game.exe+009E820C

Regex error at runtime Visual Studio 2019 [duplicate]

I would like to use regular expression from here:
https://www.rfc-editor.org/rfc/rfc3986#appendix-B
I am trying to compile it like this:
#include <regex.h>
...
regex_t regexp;
if((regcomp(&regexp, "^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?", REG_EXTENDED)) != 0){
return SOME_ERROR:
}
But I am stuck with return value of regcomp:
REG_BADRPT
According to man it means:
Invalid use of repetition operators such as using * as the first character.
Similar meaning at this man:
?, * or + is not preceded by valid regular expression
I wrote parser using my own regular expression, but I would like to test this one too, since its officially in rfc. I do no intend to use it for validation though.
As Oli Charlesworth suggested, you need to escape backslash \\ for the question marks \?. See C++ escape sequences for more information.
test program
#include <regex.h>
#include <iostream>
void test_regcomp(char *rx){
regex_t regexp;
if((regcomp(&regexp, rx, REG_EXTENDED)) != 0){
std::cout << "ERROR :" << rx <<"\n";
}
else{
std::cout << " OK :"<< rx <<"\n";
}
}
int main()
{
char *rx1 = "^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?" ;
char *rx2 = "^(([^:/\?#]+):)\?(//([^/\?#]*))\?([^\?#]*)(\\\?([^#]*))\?(#(.*))\?" ;
test_regcomp(rx1);
test_regcomp(rx2);
return 0;
}
output
ERROR :^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(?([^#]*))?(#(.*))?
OK :^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
The \?in your regex is the source of the REG_BADRPT error. It gets converted to ?. If you replace it by \\?, regcomp will be able to compile your regex.
"^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\\?([^#]*))?(#(.*))?"
OK :^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

regex visual studio

I was planning to use the following regex to capture path and name of a file:
std::regex capture_path_name_file("(.+)\\([^\\]+)\\.[^\\]+$");
but when running (i'm using visual studio) i get the regex error
error_brack: the expression contained mismatched [ and ]
Trying to pinpoint the cause i tried the following regex:
std::regex test("[^\\]")
and I got the same error.
I have tested my regex in regex101.com (with the slight difference that i had to use \. instead of \\.)
Thanks for any help.
The issue you have is because \\ is treated as 1 literal \ symbol in regular string literals. Biffen explained it well in his comment, [^\\] is treated as [^\], the ] is treated as a literal ] and not the closing character class delimiter (and there is no matching ] to close the character class further).
The right answer is: use _splitpath_s.
And if you want to further play with regex, you can fix it like this:
#include <iostream>
#include <regex>
using namespace std;
int main() {
std::regex rex1(R"((.+?)([^\\.]+\.[^\\.]+)$)");
std::smatch m;
std::string str = "c:\\Python27\\REGEX\\test_regex.py";
if (regex_search(str, m, rex1)) {
std::cout << "Path: " << m[1] << std::endl;
std::cout << "File name: " << m[2] << std::endl;
}
return 0;
}
Using raw string literals, you can avoid the majority of issues related to escaping. Use R"((.+?)([^\\.]+\.[^\\.]+)$)", it will match and capture into Group 1 the file folder path, and it will capture into Group 2 the file name with extension. Note that the extension must be present.

What is the regular expression to get a token of a URL?

Say I have strings like these:
bunch of other html<a href="http://domain.com/133742/The_Token_I_Want.zip" more html and stuff
bunch of other html<a href="http://domain.com/12345/another_token.zip" more html and stuff
bunch of other html<a href="http://domain.com/0981723/YET_ANOTHER_TOKEN.zip" more html and stuff
What is the regular expression to match The_Token_I_Want, another_token, YET_ANOTHER_TOKEN?
Appendix B of RFC 2396 gives a doozy of a regular expression for splitting a URI into its components, and we can adapt it for your case
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*/([^.]+)[^?#]*)(\?([^#]*))?(#(.*))?
#######
This leaves The_Token_I_Want in $6, which is the “hashderlined” subexpression above. (Note that the hashes are not part of the pattern.) See it live:
#! /usr/bin/perl
$_ = "http://domain.com/133742/The_Token_I_Want.zip";
if (m!^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*/([^.]+)[^?#]*)(\?([^#]*))?(#(.*))?!) {
print "$6\n";
}
else {
print "no match\n";
}
Output:
$ ./prog.pl
The_Token_I_Want
UPDATE: I see in a comment that you're using boost::regex, so remember to escape the backslash in your C++ program.
#include <boost/foreach.hpp>
#include <boost/regex.hpp>
#include <iostream>
#include <string>
int main()
{
boost::regex token("^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*"
"/([^.]+)"
// ####### I CAN HAZ HASHDERLINE PLZ
"[^?#]*)(\\?([^#]*))?(#(.*))?");
const char * const urls[] = {
"http://domain.com/133742/The_Token_I_Want.zip",
"http://domain.com/12345/another_token.zip",
"http://domain.com/0981723/YET_ANOTHER_TOKEN.zip",
};
BOOST_FOREACH(const char *url, urls) {
std::cout << url << ":\n";
std::string t;
boost::cmatch m;
if (boost::regex_match(url, m, token))
t = m[6];
else
t = "<no match>";
std::cout << " - " << m[6] << '\n';
}
return 0;
}
Output:
http://domain.com/133742/The_Token_I_Want.zip:
- The_Token_I_Want
http://domain.com/12345/another_token.zip:
- another_token
http://domain.com/0981723/YET_ANOTHER_TOKEN.zip:
- YET_ANOTHER_TOKEN
/a href="http://domain.com/[0-9]+/([a-zA-Z_]+).zip"/
Might want to add more characters to [a-zA-Z_]+
You can use:
(http|ftp)+://[[:alnum:]./_]+/([[:alnum:]._-]+).[[:alnum:]_-]+
([[:alnum:]._-]+) is a group for the matched pattern, and in your example its value will be The_Token_I_Want. to access this group, use \2 or $2, because (http|ftp) is the first group and ([[:alnum:]._-]+) is the second group of the matched pattern.
Try this:
/(?:f|ht)tps?:/{2}(?:www.)?domain[^/]+.([^/]+).([^/]+)/i
or
/\w{3,5}:/{2}(?:w{3}.)?domain[^/]+.([^/]+).([^/]+)/i
First, use an HTML parser and get a DOM. Then get the anchor elements and loop over them looking for the hrefs. Don't try to grab the token straight out of a string.
Then:
The glib answer would be:
/(The_Token_I_Want.zip)/
You might want to be a little more precise then a single example.
I'm guessing you are actually looking for:
/([^/]+)$/
m/The_Token_I_Want/
You'll have to be more specific about what kind of token it is. A number? A string? Does it repeat? Does it have a form or pattern to it?
It's probably best to use something smarter than a RegEx. For example, if you're using C# you could use the System.Uri class to parse it for you.