C++ RegEx out of memory - c++

I am using regex to retrieve a string from between divs in a html page however I have run into a out of memory error. I am using Visual Studio 2012 and C++.
The regex expression is "class=\"ListingDescription\">((.*|\r|\n)*?(?=</div>))" and regxbuddy reckons it does it in 242 steps (much better than ~5000 it had originally). The website I am trying to scrap the info from is http://www.trademe.co.nz/Browse/Listing.aspx?id=557211466
Here is the code:
typedef match_results<const char*> cmatch;
tr1::cmatch results;
try {
tr1::regex regx("class=\"ListingDescription\">((.*|\\r|\\n)*?(?=</div>))");
tr1::regex_search(data.c_str(), results, regx);
cout << result[1];
}
catch (const std::regex_error& e) {
std::cout << "regex_error caught: " << e.what() << '\n';
if (e.code() == std::regex_constants::error_brack) {
std::cout << "The code was error_brack\n";
}
}
This is the error I get:
regex_error caught: regex_error(error_stack): There was insufficient memory to d
etermine whether the regular expression could match the specified character sequ
ence.
Regexbuddy works fine and so do some online regex tools just not my code :( Please help

You are using a . at a place where it can happen multiple times, so it will match all <, including the one before </div>, which is something you probably do not want.
And now the mandatory link RegEx match open tags except XHTML self-contained tags .
Using regexp to parse HTML is generally a bad idea. You should use an HTML parser instead

I see now. Regex is pretty limited in some areas. I will have a look at parsers and try them out. What I have done in the mean time is:
std::string startstr = "<div id=\"ListingDescription_ListingDescription\" class=\"ListingDescription\">";
unsigned startpos = data.find(startstr) + strlen(startstr.c_str());
unsigned endpos = data.find("</div>",
startpos);
std::string desc = data.substr (startpos,endpos - startpos);
LOL, i know its not great but it works.
Thanks Clement Bellot

Related

Composing complex regular expressions with "DEFINED" subexpressions in C++

I'm trying to write a regular expression in C++ to match a base64 encoded string. I'm quite familiar with writing complex regular expressions in Perl so I started with that:
use strict;
use warnings;
my $base64_regex = qr{
(?(DEFINE)
(?<B64>[A-Za-z0-9+/])
(?<B16>[AEIMQUYcgkosw048])
(?<B04>[AQgw])
)
^(
((?&B64){4})*
(
(?&B64){4}|
(?&B64){2}(?&B16)=|
(?&B64)(?&B04)={2}
)
)?$}x;
# "Hello World!" base64 encoded
my $base64 = "SGVsbG8gV29ybGQh";
if ($base64 =~ $base64_regex)
{
print "Match!\n";
}
else
{
print "No match!\n"
}
Output:
Match!
I then tried to implement a similar regular expression in C++:
#include <iostream>
#include <regex>
int main()
{
std::regex base64_regex(
"(?(DEFINE)"
"(?<B64>[A-Za-z0-9+/])"
"(?<B16>[AEIMQUYcgkosw048])"
"(?<B04>[AQgw])"
")"
"^("
"((?&B64){4})*"
"("
"(?&B64){4}|"
"(?&B64){2}(?&B16)=|"
"(?&B64)(?&B04)={2}"
")"
")?$");
// "Hello World!" base64 encoded
std::string base64 = "SGVsbG8gV29ybGQh";
if (std::regex_match(base64, base64_regex))
{
std::cout << "Match!" << std::endl;
}
else
{
std::cout << "No Match!" << std::endl;
}
}
but when I run the code I get an exception telling me it is not a valid regular expression.
Catching the exception and printing the "what" string doesn't help much either. All it gives me is the following:
regex_error(error_syntax)
Obviously I could get rid of the "DEFINE" block with my pre-defined subpatterns, but that would make the whole expression very difficult to read... and, well... I like to be able to maintain my own code when I come back to it a few years later lol so that isn't really a good option.
How can I get a similar regular expression to work in C++?
Note: This must all be done within a single "std::regex" object because I am writing a library where users will be able to pass a string to be able to define their own regular expressions and I want these users to be able to "DEFINE" similar subexpressions within their regex if they need to.
How about string concatenation?
#define B64 "[A-Za-z0-9+/]"
#define B16 "[AEIMQUYcgkosw048]"
#define B04 "[AQgw]"
std::regex base64_regex(
"^("
"(" B64 "{4})*"
"("
B64 "{4}|"
B64 "{2}" B16 "=|"
B64 B04 "={2}"
")"
")?$");
I took a suggestion from the comments and checked out "boost" regex since it supports "Perl" regular expressions. I gave it a try and it worked great!
#include <boost/regex.hpp>
boost::regex base64_regex(
"(?(DEFINE)"
"(?<B64>[A-Za-z0-9+/])"
"(?<B16>[AEIMQUYcgkosw048])"
"(?<B04>[AQgw])"
")"
"("
"((?&B64){4})*"
"("
"(?&B64){4}|"
"(?&B64){2}(?&B16)=|"
"(?&B64)(?&B04)={2}"
")"
")?", boost::regex::perl);

How to match absolute value using regex

I am having trouble with absolute value in regex in C++. This is what I have as the pattern:
std::tr1::regex loadAbsNM("load -|M\\((\\d+)\\)|"); // load -|M(x)|
I am trying to use std::tr1::regex_match( IR, result, loadNM ) to match. But it is not matching anything, even though it should be.
I'm using Visual Stuido 2010 compilier
shortened version of program (included above is iostream and regex)
int main()
{
std::string IR = "load -|M(x)|";
std::smatch result;
std::tr1::regex loadAbsNM("load -|M\\((\\d+)\\)|");
if( std::tr1::regex_match( IR , result, loadAbsNM ) )
{
int x = 2;
std::cout << "matched!" << std::endl;
}
else
{
std::cout << "!UNABLE TO DECODE INSTRUCTION!" << std::endl;
}
}
output produced
!UNABLE TO DECODE INSTRUCTION!
Note that from your code, you're not going to have a match. The letter x won't match the regex \d+.
Also, I'm not too sure whether you need a backslash in front of the pipe character. As you may know, pipe (|) is used to separate possible entries: (a|b) means a or b.
Finally, since their is a pipe at the end, the expression matches the empty string which is often a bad idea.
I would suggest something like this:
"load -\\|M\\((\\d+)\\)\\|"
But that won't match:
"load -|M(x)|"
You'd need to use a number instead of 'x' as in:
"load -|M(123)|"

How to check which matching group was used to match (boost-regex)

I'm using boost::regex to parse some formatting string where '%' symbol is escape character. Because I do not have much experience with boost::regex, and with regex at all to be honest I do some trial and error. This code is some kind of prototype that I came up with.
std::string regex_string =
"(?:%d\\{(.*)\\})|" //this group will catch string for formatting time
"(?:%([hHmMsSqQtTlLcCxXmMnNpP]))|" //symbols that have some meaning
"(?:\\{(.*?)\\})|" //some other groups
"(?:%(.*?)\\s)|"
"(?:([^%]*))";
boost::regex regex;
boost::smatch match;
try
{
regex.assign(regex_string, boost::regex_constants::icase);
boost::sregex_iterator res(pattern.begin(), pattern.end(), regex);
//pattern in line above is string which I'm parsing
boost::sregex_iterator end;
for(; res != end; ++res)
{
match = *res;
output << match.get_last_closed_paren();
//I want to know if the thing that was just written to output is from group describing time string
output << "\n";
}
}
catch(boost::regex_error &e)
{
output<<"regex error\n";
}
And this works pretty good, on the output I have exactly what I want to catch. But I do not know from which group it is. I could do something like match[index_of_time_group]!="" but this is kind of fragile, and doesn't look too good. If I change regex_string index that was pointing on group catching string for formatting time could also change.
Is there a neat way to do this? Something like naming groups? I'll be grateful for any help.
You can use boost::sub_match::matched bool member:
if(match[index_of_time_group].matched) process_it(match);
It is also possible to use named groups in regexp like: (?<name_of_group>.*), and with this above line could be changed to:
if(match["name_of_group"].matched) process_it(match);
Dynamically build regex_string from pairs of name/pattern, and return a name->index mapping as well as the regex. Then write some code that determines if the match comes from a given name.
If you are insane, you can do it at compile time (the mapping from tag to index that is). It isn't worth it.

What is the regular expression to get a token of a URL?

Say I have strings like these:
bunch of other html<a href="http://domain.com/133742/The_Token_I_Want.zip" more html and stuff
bunch of other html<a href="http://domain.com/12345/another_token.zip" more html and stuff
bunch of other html<a href="http://domain.com/0981723/YET_ANOTHER_TOKEN.zip" more html and stuff
What is the regular expression to match The_Token_I_Want, another_token, YET_ANOTHER_TOKEN?
Appendix B of RFC 2396 gives a doozy of a regular expression for splitting a URI into its components, and we can adapt it for your case
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*/([^.]+)[^?#]*)(\?([^#]*))?(#(.*))?
#######
This leaves The_Token_I_Want in $6, which is the “hashderlined” subexpression above. (Note that the hashes are not part of the pattern.) See it live:
#! /usr/bin/perl
$_ = "http://domain.com/133742/The_Token_I_Want.zip";
if (m!^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*/([^.]+)[^?#]*)(\?([^#]*))?(#(.*))?!) {
print "$6\n";
}
else {
print "no match\n";
}
Output:
$ ./prog.pl
The_Token_I_Want
UPDATE: I see in a comment that you're using boost::regex, so remember to escape the backslash in your C++ program.
#include <boost/foreach.hpp>
#include <boost/regex.hpp>
#include <iostream>
#include <string>
int main()
{
boost::regex token("^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*"
"/([^.]+)"
// ####### I CAN HAZ HASHDERLINE PLZ
"[^?#]*)(\\?([^#]*))?(#(.*))?");
const char * const urls[] = {
"http://domain.com/133742/The_Token_I_Want.zip",
"http://domain.com/12345/another_token.zip",
"http://domain.com/0981723/YET_ANOTHER_TOKEN.zip",
};
BOOST_FOREACH(const char *url, urls) {
std::cout << url << ":\n";
std::string t;
boost::cmatch m;
if (boost::regex_match(url, m, token))
t = m[6];
else
t = "<no match>";
std::cout << " - " << m[6] << '\n';
}
return 0;
}
Output:
http://domain.com/133742/The_Token_I_Want.zip:
- The_Token_I_Want
http://domain.com/12345/another_token.zip:
- another_token
http://domain.com/0981723/YET_ANOTHER_TOKEN.zip:
- YET_ANOTHER_TOKEN
/a href="http://domain.com/[0-9]+/([a-zA-Z_]+).zip"/
Might want to add more characters to [a-zA-Z_]+
You can use:
(http|ftp)+://[[:alnum:]./_]+/([[:alnum:]._-]+).[[:alnum:]_-]+
([[:alnum:]._-]+) is a group for the matched pattern, and in your example its value will be The_Token_I_Want. to access this group, use \2 or $2, because (http|ftp) is the first group and ([[:alnum:]._-]+) is the second group of the matched pattern.
Try this:
/(?:f|ht)tps?:/{2}(?:www.)?domain[^/]+.([^/]+).([^/]+)/i
or
/\w{3,5}:/{2}(?:w{3}.)?domain[^/]+.([^/]+).([^/]+)/i
First, use an HTML parser and get a DOM. Then get the anchor elements and loop over them looking for the hrefs. Don't try to grab the token straight out of a string.
Then:
The glib answer would be:
/(The_Token_I_Want.zip)/
You might want to be a little more precise then a single example.
I'm guessing you are actually looking for:
/([^/]+)$/
m/The_Token_I_Want/
You'll have to be more specific about what kind of token it is. A number? A string? Does it repeat? Does it have a form or pattern to it?
It's probably best to use something smarter than a RegEx. For example, if you're using C# you could use the System.Uri class to parse it for you.

PCRECPP (pcre) extract hostname from url code problem

I have this simple piece of code in c++:
int main(void)
{
string text = "http://www.amazon.com";
string a,b,c,d,e,f;
pcrecpp::RE re("^((\\w+):\\/\\/\\/?)?((\\w+):?(\\w+)?#)?([^\\/\\?:]+):?(\\d+)?(\\/?[^\\?#;\\|]+)?([;\\|])?([^\\?#]+)?\\??([^#]+)?#?(\\w*)");
if(re.PartialMatch(text, &a,&b,&c,&d,&e,&f))
{
std::cout << "match: " << f << "\n";
// should print "www.amazon.com"
}else{
std::cout << "no match. \n";
}
return 0;
}
When I run this it doesn't find a match.
I pretty sure that the regex pattern is correct and my code is what's wrong.
If anyone familiar with pcrecpp can take a look at this Ill be grateful.
EDIT:
Thanks to Dingo, it works great.
another issue I had is that the result was at the sixth place - "f".
I edited the code above so you can copy/paste if you wish.
The problem is that your code contains ??( which is a trigraph in C++ for [. You'll either need to disable trigraphs or do something to break them up like:
pcrecpp::RE re("^((\\w+):\\/\\/\\/?)?((\\w+):?(\\w+)?#)?([^\\/\\?:]+):?(\\d+)?(\\/?[^\\?#;\\|]+)?([;\\|])?([^\\?#]+)?\\??" "([^#]+)?#?(\\w*)");
Please do
cout << re.pattern() << endl;
to double-check that all your double-slashing is done right (and also post the result).
Looks like
^((\w+):///?)?((\w+):?(\w+)?#)?([^/\?:]+):?(\d+)?(/?[^\?#;\|]+)?([;\|])?([^\?#]+)?\??([^#]+)?#?(\w*)
The hostname isn't going to be returned from the first capture group, why are you using parentheses around for example \w+ that you aren't wanting to capture?