Replacing tokens that match pieces of a regex - c++

I would like to use a regex both as a pattern to search and a template to construct a string. (I'm using boost::regex because I'm on gcc 4.8.4 where apparently regex is not fully supported (until 4.9)):
That is, I want to construct a regex, pass it to a function, use the regex to match some files, then construct an output file name following the same pattern. For example:
Regex: "file_.*\.txt"
to match things like "file_1.txt", "file_2.txt", etc.
and then would like to construct from it
Output: "file_all.txt"
That is, I want to match files starting with "file_" and ending with ".txt", then I want to fill in "all" between the "file_" and the ".txt", all from a single regex object.
We'll skip the matching to the regex as that is straightforward, but rather focus on the replacement:
#include <iostream>
#include <iterator>
#include <string>
#include <boost/regex.hpp>
std::string constructOutput(const boost::regex& myRegex)
{
// How to replace the match to the center of the filenames here?
// return boost::regex_replace(?, myRegex, "all");
}
int main()
{
// We can do something like this, but it requires us to manually separate the "center" of the regex from the string, as well as keep around a string object and a regex object:
// std::string myText = "File_.*.txt";
// boost::regex myRegex("_.*\\.");
// std::cout << '\n' << boost::regex_replace(myText, myRegex, "_all.") << '\n';
// Want to do this:
boost::regex myRegex("File_.*\\.txt");
std::string outputString = constructOutput(myRegex);
std::cout << outputString << std::endl;
}
Is something like this possible?

Related

C++ Regex always matching entire string

Whenever I use a regex function it matches the entire string for some reason.
#include <iostream>
#include <regex>
int main() {
std::string text = "This (is a) test";
std::regex pattern("\(.+\)");
std::cout << std::regex_replace(text, pattern, "isnt") << std::endl;
return 0;
}
Output: isnt
Your pattern unfortunately is not what it seems to be. Here is the problem.
Imagine for some reason you want to match tabs in with you regex. You might try this.
std::regex my_regex("\t");
This would work, but the string your std::regex class has seen is " ", not "\t". This is because of how C++ threats escaped characters. To pass literal "\t", you had to do the following.
std::regex my_regex("\\t");
So the correct syntax for your regex is.
std::regex pattern("\\(.+\\)");

How to match complex strings with regular expressions

I am a newbie in C++, I am using the regular expression function, but I have not been able to get the results I want
c++ code:
#include <regex>
std::string str = "[game.exe+009E820C]+338";
std::smatch result;
std::regex pattern("\\[([^\\[\\]]+)\\]");
std::regex_match(str, result, pattern);
// no result
std::cout << result[1] << std::endl;
I am familiar with javascript regular expressions, so I can get the value I want:
'[game.exe+009E820C]+338'.match(/\[([^\[\]]+)\]/)[1] => game.exe+009E820C
Is my c++ code doing something wrong
If you want to access the capture groups, it appears that the regex_match API requires a pattern which matches the entire input. Also, to avoid getting bogged down by a negative character class which includes a closing square bracket, I recommend using the Perl lazy dot instead. Putting all this together:
std::string str = "[game.exe+009E820C]+338";
std::smatch result;
std::regex pattern(".*\\[(.*?)\\].*");
std::regex_match(str, result, pattern);
std::cout << result[1] << std::endl;
This prints:
game.exe+009E820C

My Boost regular expression is not matching anything

I'm trying to match strings which look like this:
Mar 25 19:17:55 127.0.0.1 user:[pool-15-thread-17]INTOUCH;0;INFO;SOFTLOADSERVICE;Install started
with a regular expression. Here is my code defining the regular expression:
#include <boost/regex.hpp>
#include <boost/date_time/posix_time/posix_time.hpp>
#include <tuple>
#include <string>
const std::string softload_startup = "(\\w{3}) (\\d{1,2}) (\\d{2}):
(\\d{2}):(\\d{2})*SOFTLOADSERVICE;Install started\\s"; //NOLINT
const boost::regex softload_start(softload_startup);
class InTouchSoftload {
public:
explicit InTouchSoftload(std::string filename);
private:
std::string _log_name;
std::tuple<unsigned int, std::string> software_update_start;
};
I am calling it here:
int main() {
fin.open(input_file);
if (fin.fail()) {
std::cerr << "Failed to open " << input_file << std::endl;
exit(1);
}
while (std::getline(fin, line)) {
line_no++;
if (regex_match(line, softload_start)) {
std::cout << line << std::endl;
}
}
return 0;
}
Unfortunately, I can't seem to get any matches. Any suggestions?
If your regular expression does not match the string you wanted it to match then your regular expression is wrong. I've corrected your regular expression:
(\\w{3}) (\\d{1,2}) (\\d{2}):(\\d{2}):(\\d{2}).*SOFTLOADSERVICE;Install started\\s*
Here's where you can test your regular expression and yourself:
https://regex101.com/
https://www.regextester.com/
https://regexr.com/
While you haven't provided a complete example, your recent edit suggests you're failing because you're trying to match individual lines - the result of std::getline(), while your pattern involves two lines.
If that's indeed the case, you should probably do one of the following:
Match pairs of consecutive lines (i.e. at every iteration try matching the previous+current lines)
Split the regexp into 2 regular expressions, one for each line. Now, whenever a line matches the first regexp, try matching the next line to the second; otherwise try matching it to the first.
Add ^ to the beginning of your regexp and $ to the end (so that it matches on line boundaries, and match your regexp against the entire input stream instead of line by line.

Need help constructing Regular expression pattern

I'm failing to create a pattern for the stl regex_match function and need some help understanding why the pattern I created doesn't work and what would fix it.
I think the regex would have a hit for dl.boxcloud.com but it does not.
****still looking for input. I updated the program reflect suggestions. There are two matches when I think should be one.
#include <string>
#include <regex>
using namespace std;
wstring GetBody();
int _tmain(int argc, _TCHAR* argv[])
{
wsmatch m;
wstring regex(L"(dl\\.boxcloud\\.com|api-content\\.dropbox\\.com)");
regex_search(GetBody(), m, wregex(regex));
printf("%d matches.\n", m.size());
return 0;
}
wstring GetBody() {
wstring body(L"ABOUTLinkedIn\r\n\r\nwall of textdl.boxcloud.com/this/file/bitbyte.zip sent you a message.\r\n\r\nDate: 12/04/2012\r\n\r\nSubject: RE: Reference Ask\r\n\r\nOn 12/03/12 2:02 PM, wall of text wrote:\r\n--------------------\r\nRuba,\r\n\r\nI am looking for a n.");
return body;
}
There is no problem with the code itself. You mistake m.size() for the number of matches, when in fact, it is a number of groups your regex returns.
The std::match_results::size reference is not helpful with understanding that:
Returns the number of matches and sub-matches in the match_results object.
There are 2 groups (since you defined a capturing group around the 2 alternatives) and 1 match all in all.
See this IDEONE demo
#include <regex>
#include <string>
#include <iostream>
#include <time.h>
using namespace std;
int main()
{
string data("ABOUTLinkedIn\r\n\r\nwall of textdl.boxcloud.com/this/file/bitbyte.zip sent you a message.\r\n\r\nDate: 12/04/2012\r\n\r\nSubject: RE: Reference Ask\r\n\r\nOn 12/03/12 2:02 PM, wall of text wrote:\r\n--------------------\r\nRuba,\r\n\r\nI am looking for a n.");
std::regex pattern("(dl\\.boxcloud\\.com|api-content\\.dropbox\\.com)");
std::smatch result;
while (regex_search(data, result, pattern)) {
std::cout << "Match: " << result[0] << std::endl;
std::cout << "Captured text 1: " << result[1] << std::endl;
std::cout << "Size: " << result.size() << std::endl;
data = result.suffix().str();
}
}
It outputs:
Match: dl.boxcloud.com
Captured text 1: dl.boxcloud.com
Size: 2
See, the captured text equals the whole match.
To "fix" that, you may use non-capturing group, or remove grouping at all:
std::regex pattern("(?:dl\\.boxcloud\\.com|api-content\\.dropbox\\.com)");
// or
std::regex pattern("dl\\.boxcloud\\.com|api-content\\.dropbox\\.com");
Also, consider using raw string literal when declaring a regex (to avoid backslash hell):
std::regex pattern(R"(dl\.boxcloud\.com|api-content\.dropbox\.com)");
You need to add another "\" before each ".". I think that should fix it. You need to use escape character to represent "\" so your regex looks like this
wstring regex(L"(dl\\.boxcloud\\.com|api-content\\.dropbox\\.com)");
Update:
As #user3494744 also said you have to use
std::regex_search
instead of
std::regex_match.
I tested and it works now.
The problem is that you use regex_match instead of regex_search. To quote from the manual:
Note that regex_match will only successfully match a regular expression to an entire character sequence, whereas std::regex_search will successfully match subsequences
This fix will give a match, but too many since you also have to replace \. by \\. as shown before my answer. Otherwise the string "dlXboxcloud.com" will also match.

What is the regular expression to get a token of a URL?

Say I have strings like these:
bunch of other html<a href="http://domain.com/133742/The_Token_I_Want.zip" more html and stuff
bunch of other html<a href="http://domain.com/12345/another_token.zip" more html and stuff
bunch of other html<a href="http://domain.com/0981723/YET_ANOTHER_TOKEN.zip" more html and stuff
What is the regular expression to match The_Token_I_Want, another_token, YET_ANOTHER_TOKEN?
Appendix B of RFC 2396 gives a doozy of a regular expression for splitting a URI into its components, and we can adapt it for your case
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*/([^.]+)[^?#]*)(\?([^#]*))?(#(.*))?
#######
This leaves The_Token_I_Want in $6, which is the “hashderlined” subexpression above. (Note that the hashes are not part of the pattern.) See it live:
#! /usr/bin/perl
$_ = "http://domain.com/133742/The_Token_I_Want.zip";
if (m!^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*/([^.]+)[^?#]*)(\?([^#]*))?(#(.*))?!) {
print "$6\n";
}
else {
print "no match\n";
}
Output:
$ ./prog.pl
The_Token_I_Want
UPDATE: I see in a comment that you're using boost::regex, so remember to escape the backslash in your C++ program.
#include <boost/foreach.hpp>
#include <boost/regex.hpp>
#include <iostream>
#include <string>
int main()
{
boost::regex token("^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*"
"/([^.]+)"
// ####### I CAN HAZ HASHDERLINE PLZ
"[^?#]*)(\\?([^#]*))?(#(.*))?");
const char * const urls[] = {
"http://domain.com/133742/The_Token_I_Want.zip",
"http://domain.com/12345/another_token.zip",
"http://domain.com/0981723/YET_ANOTHER_TOKEN.zip",
};
BOOST_FOREACH(const char *url, urls) {
std::cout << url << ":\n";
std::string t;
boost::cmatch m;
if (boost::regex_match(url, m, token))
t = m[6];
else
t = "<no match>";
std::cout << " - " << m[6] << '\n';
}
return 0;
}
Output:
http://domain.com/133742/The_Token_I_Want.zip:
- The_Token_I_Want
http://domain.com/12345/another_token.zip:
- another_token
http://domain.com/0981723/YET_ANOTHER_TOKEN.zip:
- YET_ANOTHER_TOKEN
/a href="http://domain.com/[0-9]+/([a-zA-Z_]+).zip"/
Might want to add more characters to [a-zA-Z_]+
You can use:
(http|ftp)+://[[:alnum:]./_]+/([[:alnum:]._-]+).[[:alnum:]_-]+
([[:alnum:]._-]+) is a group for the matched pattern, and in your example its value will be The_Token_I_Want. to access this group, use \2 or $2, because (http|ftp) is the first group and ([[:alnum:]._-]+) is the second group of the matched pattern.
Try this:
/(?:f|ht)tps?:/{2}(?:www.)?domain[^/]+.([^/]+).([^/]+)/i
or
/\w{3,5}:/{2}(?:w{3}.)?domain[^/]+.([^/]+).([^/]+)/i
First, use an HTML parser and get a DOM. Then get the anchor elements and loop over them looking for the hrefs. Don't try to grab the token straight out of a string.
Then:
The glib answer would be:
/(The_Token_I_Want.zip)/
You might want to be a little more precise then a single example.
I'm guessing you are actually looking for:
/([^/]+)$/
m/The_Token_I_Want/
You'll have to be more specific about what kind of token it is. A number? A string? Does it repeat? Does it have a form or pattern to it?
It's probably best to use something smarter than a RegEx. For example, if you're using C# you could use the System.Uri class to parse it for you.