PCRECPP (pcre) extract hostname from url code problem - c++

I have this simple piece of code in c++:
int main(void)
{
string text = "http://www.amazon.com";
string a,b,c,d,e,f;
pcrecpp::RE re("^((\\w+):\\/\\/\\/?)?((\\w+):?(\\w+)?#)?([^\\/\\?:]+):?(\\d+)?(\\/?[^\\?#;\\|]+)?([;\\|])?([^\\?#]+)?\\??([^#]+)?#?(\\w*)");
if(re.PartialMatch(text, &a,&b,&c,&d,&e,&f))
{
std::cout << "match: " << f << "\n";
// should print "www.amazon.com"
}else{
std::cout << "no match. \n";
}
return 0;
}
When I run this it doesn't find a match.
I pretty sure that the regex pattern is correct and my code is what's wrong.
If anyone familiar with pcrecpp can take a look at this Ill be grateful.
EDIT:
Thanks to Dingo, it works great.
another issue I had is that the result was at the sixth place - "f".
I edited the code above so you can copy/paste if you wish.

The problem is that your code contains ??( which is a trigraph in C++ for [. You'll either need to disable trigraphs or do something to break them up like:
pcrecpp::RE re("^((\\w+):\\/\\/\\/?)?((\\w+):?(\\w+)?#)?([^\\/\\?:]+):?(\\d+)?(\\/?[^\\?#;\\|]+)?([;\\|])?([^\\?#]+)?\\??" "([^#]+)?#?(\\w*)");

Please do
cout << re.pattern() << endl;
to double-check that all your double-slashing is done right (and also post the result).
Looks like
^((\w+):///?)?((\w+):?(\w+)?#)?([^/\?:]+):?(\d+)?(/?[^\?#;\|]+)?([;\|])?([^\?#]+)?\??([^#]+)?#?(\w*)
The hostname isn't going to be returned from the first capture group, why are you using parentheses around for example \w+ that you aren't wanting to capture?

Related

QRegexp Missing Digits

We are all stumped on this one:
QRegExp kcc_stationing("(-)?(\\d+)\\.(\\d+)[^a-zA-Z]");
QString str;
if (kcc_stationing.indexIn(description) > -1)
{
str = kcc_stationing.cap(1) + kcc_stationing.cap(2) + "." + kcc_stationing.cap(3);
qDebug() << kcc_stationing.cap(1);
qDebug() << kcc_stationing.cap(2);
qDebug() << kcc_stationing.cap(3);
qDebug() << "Description: " << description;
qDebug() << "Returned Stationing string: " << str;
}
Running this code on "1082.006":
Note the missing "6"
After some just blind guessing, we removed [^a-zA-Z] and got the correct answer. We added this originally so that we would reject any number with other characters directly attached without spaces.
For example: 10.05D should be rejected.
Can anyone explain why this extra piece was causing us to lose that last "6"?
The [^a-zA-Z] is a character class. Character classes match one character. It will not match the end of a string, since there is no character there.
To get that result, the engine will match all the numbers with the \\d+, including the last one. It will then need to backtrack in order for the last character class to be satisfied.
I think you want to allow zero-width match (specifically when it's the end of the string). In your case, it would be easiest to use:
(-)?(\\d+)\\.(\\d+)([^a-zA-Z]|$)
Or, if Qt supports non-capturing groups:
(-)?(\\d+)\\.(\\d+)(?:[^a-zA-Z]|$)
Note that I also recommend using [.] instead of \\., since I feel it improves readability.

Determining the location of C++11 regular expression matches

How do I efficiently determine the location of a capture group inside a searched string? Getting the location of the entire match is easy, but I see no obvious ways to get at capture groups beyond the first.
This is a simplified example, lets presume "a*" and "b*" are complicated regexes that are expensive to run.
#include <iostream>
#include <regex>
#include <string>
using namespace std;
int main()
{
regex matcher("a*(needle)b*");
smatch findings;
string haystack("aaaaaaaaneedlebbbbbbbbbbbbbb");
if( regex_match(haystack, findings, matcher) )
{
// What do I put here to know how the offset of "needle" in the
// string haystack?
// This is the position of the entire, which is
// always 0 with regex_match, with regex_search
cout << "smatch::position - " << findings.position() << endl;
// Is this just a string or what? Are there member functions
// That can be called?
cout << "Needle - " << findings[1] << endl;
}
return 0;
}
If it helps I built this question in Coliru: http://coliru.stacked-crooked.com/a/885a6b694d32d9b5
I will not mark this as and answer until 72 hours have passed and no better answers are present.
Before asking this I presumed smatch::position took no arguments I cared about, because when I read the cppreference page the "sub" parameter was not obviously an index into the container of matches. I thought it had something to do with "sub"strings and the offset value of the whole match.
So my answer is:
cout << "Needle Position- " << findings.position(1) << endl;
Any explanation on this design, or other issues my line of thinking may have caused would be appreciated.
According to the documentation, you can access the iterator pointing to the beginning and the end of the captured text via match[n].first and match[n].second. To get the start and end indices, just do pointer arithmetic with haystack.begin().
if (findings[1].matched) {
cout << "[" << findings[1].first - haystack.begin() << "-"
<< findings[1].second - haystack.begin() << "] "
<< findings[1] << endl;
}
Except for the main match (index 0), capturing groups may or may not capture anything. In such cases, first and second will point to the end of the string.
I also demonstrate the matched property of sub_match object. While it's unnecessary in this case, in general, if you want to print out the indices of the capturing groups, it's necessary to check whether the capturing group matches anything first.

Get smallest match using std::regex in C++

I have this code:
std::smatch m;
std::string dataType = "type = struct A {\nint a;\nint b;\nint c; }";
std::regex_search(dataType, m, std::regex("(= )(.*)( [{]|\n|$)"));
std::cout << m.str(2) << std::endl;
The problem is that it returns the longest match, but I need the smallest.
The output is:
struct A {\n
int a;\n
int b;\n
int c; }
But it needs to be:
struct A
How can I get the result I want?
You could change .* to .*?.
Read up on greediness.
(Did you really mean to put a literal newline in the regex? You probably want to make that \\n.)
To get "struct A" from your text, the regex to be used is:
\=\s(\w+\s\w+)
if you have other cases, please give more specifications or examples of your input and how your output should look like.
Edit:
thanks to user3259253, the correct solution is:
\\=\\s(\\w+(\\s\\w+)?)
Right, use this:
std::regex_search(dataType, m, std::regex("(= )(.*)([\\{]|\\n)$"));
However, if you only what to capture what's between the = sign and the curly bracket/end of line, you don't need so many () groups. This would be enough:
std::regex_search(dataType, m, std::regex("= (.*)[\\{]|\\n$"));
std::cout << m.str(1) << std::endl;
Note that here we're catching the text as the first entry, m.str(1), instead of the second, because we have eliminated "(= )"

C++ RegEx out of memory

I am using regex to retrieve a string from between divs in a html page however I have run into a out of memory error. I am using Visual Studio 2012 and C++.
The regex expression is "class=\"ListingDescription\">((.*|\r|\n)*?(?=</div>))" and regxbuddy reckons it does it in 242 steps (much better than ~5000 it had originally). The website I am trying to scrap the info from is http://www.trademe.co.nz/Browse/Listing.aspx?id=557211466
Here is the code:
typedef match_results<const char*> cmatch;
tr1::cmatch results;
try {
tr1::regex regx("class=\"ListingDescription\">((.*|\\r|\\n)*?(?=</div>))");
tr1::regex_search(data.c_str(), results, regx);
cout << result[1];
}
catch (const std::regex_error& e) {
std::cout << "regex_error caught: " << e.what() << '\n';
if (e.code() == std::regex_constants::error_brack) {
std::cout << "The code was error_brack\n";
}
}
This is the error I get:
regex_error caught: regex_error(error_stack): There was insufficient memory to d
etermine whether the regular expression could match the specified character sequ
ence.
Regexbuddy works fine and so do some online regex tools just not my code :( Please help
You are using a . at a place where it can happen multiple times, so it will match all <, including the one before </div>, which is something you probably do not want.
And now the mandatory link RegEx match open tags except XHTML self-contained tags .
Using regexp to parse HTML is generally a bad idea. You should use an HTML parser instead
I see now. Regex is pretty limited in some areas. I will have a look at parsers and try them out. What I have done in the mean time is:
std::string startstr = "<div id=\"ListingDescription_ListingDescription\" class=\"ListingDescription\">";
unsigned startpos = data.find(startstr) + strlen(startstr.c_str());
unsigned endpos = data.find("</div>",
startpos);
std::string desc = data.substr (startpos,endpos - startpos);
LOL, i know its not great but it works.
Thanks Clement Bellot

What is the regular expression to get a token of a URL?

Say I have strings like these:
bunch of other html<a href="http://domain.com/133742/The_Token_I_Want.zip" more html and stuff
bunch of other html<a href="http://domain.com/12345/another_token.zip" more html and stuff
bunch of other html<a href="http://domain.com/0981723/YET_ANOTHER_TOKEN.zip" more html and stuff
What is the regular expression to match The_Token_I_Want, another_token, YET_ANOTHER_TOKEN?
Appendix B of RFC 2396 gives a doozy of a regular expression for splitting a URI into its components, and we can adapt it for your case
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*/([^.]+)[^?#]*)(\?([^#]*))?(#(.*))?
#######
This leaves The_Token_I_Want in $6, which is the “hashderlined” subexpression above. (Note that the hashes are not part of the pattern.) See it live:
#! /usr/bin/perl
$_ = "http://domain.com/133742/The_Token_I_Want.zip";
if (m!^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*/([^.]+)[^?#]*)(\?([^#]*))?(#(.*))?!) {
print "$6\n";
}
else {
print "no match\n";
}
Output:
$ ./prog.pl
The_Token_I_Want
UPDATE: I see in a comment that you're using boost::regex, so remember to escape the backslash in your C++ program.
#include <boost/foreach.hpp>
#include <boost/regex.hpp>
#include <iostream>
#include <string>
int main()
{
boost::regex token("^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*"
"/([^.]+)"
// ####### I CAN HAZ HASHDERLINE PLZ
"[^?#]*)(\\?([^#]*))?(#(.*))?");
const char * const urls[] = {
"http://domain.com/133742/The_Token_I_Want.zip",
"http://domain.com/12345/another_token.zip",
"http://domain.com/0981723/YET_ANOTHER_TOKEN.zip",
};
BOOST_FOREACH(const char *url, urls) {
std::cout << url << ":\n";
std::string t;
boost::cmatch m;
if (boost::regex_match(url, m, token))
t = m[6];
else
t = "<no match>";
std::cout << " - " << m[6] << '\n';
}
return 0;
}
Output:
http://domain.com/133742/The_Token_I_Want.zip:
- The_Token_I_Want
http://domain.com/12345/another_token.zip:
- another_token
http://domain.com/0981723/YET_ANOTHER_TOKEN.zip:
- YET_ANOTHER_TOKEN
/a href="http://domain.com/[0-9]+/([a-zA-Z_]+).zip"/
Might want to add more characters to [a-zA-Z_]+
You can use:
(http|ftp)+://[[:alnum:]./_]+/([[:alnum:]._-]+).[[:alnum:]_-]+
([[:alnum:]._-]+) is a group for the matched pattern, and in your example its value will be The_Token_I_Want. to access this group, use \2 or $2, because (http|ftp) is the first group and ([[:alnum:]._-]+) is the second group of the matched pattern.
Try this:
/(?:f|ht)tps?:/{2}(?:www.)?domain[^/]+.([^/]+).([^/]+)/i
or
/\w{3,5}:/{2}(?:w{3}.)?domain[^/]+.([^/]+).([^/]+)/i
First, use an HTML parser and get a DOM. Then get the anchor elements and loop over them looking for the hrefs. Don't try to grab the token straight out of a string.
Then:
The glib answer would be:
/(The_Token_I_Want.zip)/
You might want to be a little more precise then a single example.
I'm guessing you are actually looking for:
/([^/]+)$/
m/The_Token_I_Want/
You'll have to be more specific about what kind of token it is. A number? A string? Does it repeat? Does it have a form or pattern to it?
It's probably best to use something smarter than a RegEx. For example, if you're using C# you could use the System.Uri class to parse it for you.