C++ regex - get a link from html code - c++

#include <iostream>
#include <stdio.h>
#include <string.h>
#include <regex>
using namespace std;
int main(int argc, char* argv[]) {
string test = "<html><div><script>var link = "http://example.com/?key=dynamic_key";</script></div></html>";
regex re("http://example.com/(*)");
smatch match;
if (regex_search(test, match, re)) {
cout<<"OK"<<endl;
}
return 0;
}
The command for this compile.
root# g++ test.cpp -o test -std=gnu++11
This program not working. How do I get the link (use regex) from the html code? Please, help me.

Your string construction is incorrect, see the " escaping:
string test = "<html><div><script>var link = \"http://example.com/?key=dynamic_key\";</script></div></html>";
And I would use this regex:
http:\/\/example.com[^"]*
which select only this:
http://example.com/?key=dynamic_key

I see two problems with your code.
The first is you are trying to put quotes " inside quotes without escaping them.
You need to do: "escape your \"quotes\" properly" (note the \"):
Also your regex was not quite right, * needs to follow a matchable character (like [^"] meaning not a quote):
#include <iostream>
#include <stdio.h>
#include <string.h>
#include <regex>
using namespace std;
int main(int argc, char* argv[]) {
//string test = "<html><div><script>var link = "http://example.com/?key=dynamic_key";</script></div></html>";
string test = "<html><div><script>var link = \"http://example.com/?key=dynamic_key\";</script></div></html>";
//regex re("http://example.com/(*)");
regex re("http://example.com/([^\"]*)"); // NOTE the escape \"
smatch match;
if (regex_search(test, match, re)) {
cout<<"OK"<<endl;
cout << match.str(1) << '\n'; // first capture group
}
return 0;
}
Output:
OK
?key=dynamic_key

I think there are two errors here:
The test string is incorrectly delimited. Try use raw string literals.
The regex isn't quite right either (I assume you want to match the full link).
Further there is one more warning, regex and html don't always work well together.
Sample code listing
#include <iostream>
#include <stdio.h>
#include <string.h>
#include <regex>
using namespace std;
int main(int argc, char* argv[]) {
string test = R"(<html><div><script>var link = "http://example.com/?key=dynamic_key";</script></div></html>)";
regex re( R"(http://example\.com/[^"]*)" );
smatch match;
if (regex_search(test, match, re)) {
cout << "OK" << endl;
for (auto i : match) {
cout << i << endl;
}
}
return 0;
}
And the output here is;
OK
http://example.com/?key=dynamic_key
See here for a live sample.

Related

Boost regex cpp for finding strings between %% with output excluding the % character itself

I am having a problem with boost regex in cpp. I want to match a string like
"Hello %world% regex %cpp%" and expected string output is world, cpp
Can somebody suggest a regex for this
Thanks
Anil
I personally prefer "\\%([^\\%]*)\\%" (or as a raw string R"r(\%([^\%]*)\%)r")
It doesn't rely on non-greedy qualifiers
Which is essentially
one percent character \\%
any amount of non-percent characters [^\\%]*
one percent character \\%
I know this is tagged boost but here's a solution with std::regex
#include <string>
#include <regex>
#include <iostream>
int main()
{
using namespace std;
string source = "Hello %world%";
regex match_percent_enclosed (R"_(\%([^\%]*)\%)_");
smatch between_percent;
bool found_match = regex_search(source,between_percent,match_percent_enclosed);
if(found_match && between_percent.size()>1)
cout << "found: \"" << between_percent[1].str() << "\"." << endl;
else
cout << "no match found." << endl;
}
you may get some idea
%(.+?)%
Result:
Match 1
1. world
Match 2
1. cpp
You can use this regex \%(.*?)\%smallest group
Online regex: https://regex101.com/r/dSCE2a/2
And for the code with boost
#include <iostream>
#include <cstdlib>
#include <boost/regex.hpp>
using namespace std;
int main()
{
boost::cmatch mat;
boost::regex reg( "\\%(.*?)\\%" );
char szStr[] = "Hello %world% regex %cpp%";
char *where = szStr;
while (regex_search(where, mat, reg))
{
cout << mat[1] << endl; // 0 for whole match, 1 for sub
where = (char*)mat[0].second;
}
}

How do I use regex_replace?

After asking this question on SO, I realised that I needed to replace all matches within a string with another string. In my case, I want to replace all occurrences of a whitespace with `\s*' (ie. any number of whitespaces will match).
So I devised the following:
#include <string>
#include <regex>
int main ()
{
const std::string someString = "here is some text";
const std::string output = std::regex_replace(someString.c_str(), std::regex("\\s+"), "\\s*");
}
This fails with the following output:
error: no matching function for call to ‘regex_replace(const char*, std::regex, const char [4])
Working example: http://ideone.com/yEpgXy
Not to be discouraged, I headed over to cplusplus.com and found that my attempt actually matches the first prototype of the regex_replace function quite well, so I was surprised the compiler couldn't run it (for your reference: http://www.cplusplus.com/reference/regex/match_replace/)
So I thought I'd just run the example they provided for the function:
// regex_replace example
#include <iostream>
#include <string>
#include <regex>
#include <iterator>
int main ()
{
std::string s ("there is a subsequence in the string\n");
std::regex e ("\\b(sub)([^ ]*)"); // matches words beginning by "sub"
// using string/c-string (3) version:
std::cout << std::regex_replace (s,e,"sub-$2");
// using range/c-string (6) version:
std::string result;
std::regex_replace (std::back_inserter(result), s.begin(), s.end(), e, "$2");
std::cout << result;
// with flags:
std::cout << std::regex_replace (s,e,"$1 and $2",std::regex_constants::format_no_copy);
std::cout << std::endl;
return 0;
}
But when I run this I get the exact same error!
Working example: http://ideone.com/yEpgXy
So either ideone.com or cplusplus.com are wrong. Rather than bang my head against the wall trying to diagnose the errors of those far wiser than me I'm going to spare my sanity and ask.
You need to update your compiler to GCC 4.9.
Try using boosts regex as an alternative
regex_replace
Simple code C++ regex_replace only alphanumeric characters
#include <iostream>
#include <regex>
using namespace std;
int main() {
const std::regex pattern("[^a-zA-Z0-9.-_]");
std::string String = "!#!e-ma.il#boomer.zx";
// std::regex_constants::icase
// Only first
// std::string newtext = std::regex_replace( String, pattern, "X", std::regex_constants::format_first_only );
// All case insensitive
std::string newtext = std::regex_replace( String, pattern, "", std::regex_constants::icase);
std::cout << newtext << std::endl;
return 0;
}
Run https://ideone.com/CoMq3r

C++ regex_match doesn't match

I'm using C++ on XCode. I'd like to match non-alphabet characters using regex_match but seem to be having difficulty:
#include <iostream>
#include <regex>
using namespace std;
int main(int argc, const char * argv[])
{
cout << "BY-WORD: " << regex_match("BY-WORD", regex("[^a-zA-Z]")) << endl;
cout << "BYEWORD: " << regex_match("BYEWORD", regex("[^a-zA-Z]")) << endl;
return 0;
}
which returns:
BY-WORD: 0
BYEWORD: 0
I want "BY-WORD" to be matched (because of the hyphen), but regex_match returns a 0 for both tests.
I confoosed.
regex_match tries to match the whole input string against the regular expression you provide. Since your expression would only match a single character, it will always come back false on those inputs.
You probably want regex_search instead.
regex_match() returns whether the target sequence matches the regular expression rgx. If you want to search the non-alphabet characters from the target sequence, you need regex_search():
#include <regex>
#include <iostream>
int main()
{
std::regex rx("[^a-zA-Z]");
std::smatch res;
std::string str("BY-WORD");
while (std::regex_search (str,res,rx)) {
std::cout <<res[0] << std::endl;
str = res.suffix().str();
}
}

Regex search & replace group in C++?

The best I can come up with is:
#include <boost/algorithm/string/replace.hpp>
#include <boost/regex.hpp>
#include <iostream>
using namespace std;
int main() {
string dog = "scooby-doo";
boost::regex pattern("(\\w+)-doo");
boost::smatch groups;
if (boost::regex_match(dog, groups, pattern))
boost::replace_all(dog, string(groups[1]), "scrappy");
cout << dog << endl;
}
with output:
scrappy-doo
.. is there a simpler way of doing this, that doesn't involve doing two distinct searches? Maybe with the new C++11 stuff (although I'm not sure that it's compatible with gcc atm?)
std::regex_replace should do the trick. The provided example is pretty close to your problem, even to the point of showing how to shove the answer straight into cout if you want. Pasted here for posterity:
#include <iostream>
#include <iterator>
#include <regex>
#include <string>
int main()
{
std::string text = "Quick brown fox";
std::regex vowel_re("a|e|i|o|u");
// write the results to an output iterator
std::regex_replace(std::ostreambuf_iterator<char>(std::cout),
text.begin(), text.end(), vowel_re, "*");
// construct a string holding the results
std::cout << '\n' << std::regex_replace(text, vowel_re, "[$&]") << '\n';
}

If-Then-Else Conditionals in Regular Expressions and using capturing group

I have some difficulties in understanding if-then-else conditionals in regular expressions.
After reading If-Then-Else Conditionals in Regular Expressions I decided to write a simple test. I use C++, Boost 1.38 Regex and MS VC 8.0.
I have written this program:
#include <iostream>
#include <string>
#include <boost/regex.hpp>
int main()
{
std::string str_to_modify = "123";
//std::string str_to_modify = "ttt";
boost::regex regex_to_search ("(\\d\\d\\d)");
std::string regex_format ("(?($1)$1|000)");
std::string modified_str =
boost::regex_replace(
str_to_modify,
regex_to_search,
regex_format,
boost::match_default | boost::format_all | format_no_copy );
std::cout << modified_str << std::endl;
return 0;
}
I expected to get "123" if str_to_modify has "123" and to get "000" if I str_to_modify has "ttt". However I get ?123123|000 in the first case and nothing in second one.
Coluld you tell me, please, what is wrong with my test?
The second example that still doesn't work :
#include <iostream>
#include <string>
#include <boost/regex.hpp>
int main()
{
//std::string str_to_modify = "123";
std::string str_to_modify = "ttt";
boost::regex regex_to_search ("(\\d\\d\\d)");
std::string regex_format ("(?1foo:bar");
std::string modified_str =
boost::regex_replace(str_to_modify, regex_to_search, regex_format,
boost::match_default | boost::format_all | boost::format_no_copy );
std::cout << modified_str << std::endl;
return 0;
}
I think the format string should be (?1$1:000) as described in the Boost.Regex docs.
Edit: I don't think regex_replace can do what you want. Why don't you try the following instead? regex_match will tell you whether the match succeeded (or you can use match[i].matched to check whether the i-th tagged sub-expression matched). You can format the match using the match.format member function.
#include <iostream>
#include <string>
#include <boost/regex.hpp>
int main()
{
boost::regex regex_to_search ("(\\d\\d\\d)");
std::string str_to_modify;
while (std::getline(std::cin, str_to_modify))
{
boost::smatch match;
if (boost::regex_match(str_to_modify, match, regex_to_search))
std::cout << match.format("foo:$1") << std::endl;
else
std::cout << "error" << std::endl;
}
}