Need help constructing Regular expression pattern - c++

I'm failing to create a pattern for the stl regex_match function and need some help understanding why the pattern I created doesn't work and what would fix it.
I think the regex would have a hit for dl.boxcloud.com but it does not.
****still looking for input. I updated the program reflect suggestions. There are two matches when I think should be one.
#include <string>
#include <regex>
using namespace std;
wstring GetBody();
int _tmain(int argc, _TCHAR* argv[])
{
wsmatch m;
wstring regex(L"(dl\\.boxcloud\\.com|api-content\\.dropbox\\.com)");
regex_search(GetBody(), m, wregex(regex));
printf("%d matches.\n", m.size());
return 0;
}
wstring GetBody() {
wstring body(L"ABOUTLinkedIn\r\n\r\nwall of textdl.boxcloud.com/this/file/bitbyte.zip sent you a message.\r\n\r\nDate: 12/04/2012\r\n\r\nSubject: RE: Reference Ask\r\n\r\nOn 12/03/12 2:02 PM, wall of text wrote:\r\n--------------------\r\nRuba,\r\n\r\nI am looking for a n.");
return body;
}

There is no problem with the code itself. You mistake m.size() for the number of matches, when in fact, it is a number of groups your regex returns.
The std::match_results::size reference is not helpful with understanding that:
Returns the number of matches and sub-matches in the match_results object.
There are 2 groups (since you defined a capturing group around the 2 alternatives) and 1 match all in all.
See this IDEONE demo
#include <regex>
#include <string>
#include <iostream>
#include <time.h>
using namespace std;
int main()
{
string data("ABOUTLinkedIn\r\n\r\nwall of textdl.boxcloud.com/this/file/bitbyte.zip sent you a message.\r\n\r\nDate: 12/04/2012\r\n\r\nSubject: RE: Reference Ask\r\n\r\nOn 12/03/12 2:02 PM, wall of text wrote:\r\n--------------------\r\nRuba,\r\n\r\nI am looking for a n.");
std::regex pattern("(dl\\.boxcloud\\.com|api-content\\.dropbox\\.com)");
std::smatch result;
while (regex_search(data, result, pattern)) {
std::cout << "Match: " << result[0] << std::endl;
std::cout << "Captured text 1: " << result[1] << std::endl;
std::cout << "Size: " << result.size() << std::endl;
data = result.suffix().str();
}
}
It outputs:
Match: dl.boxcloud.com
Captured text 1: dl.boxcloud.com
Size: 2
See, the captured text equals the whole match.
To "fix" that, you may use non-capturing group, or remove grouping at all:
std::regex pattern("(?:dl\\.boxcloud\\.com|api-content\\.dropbox\\.com)");
// or
std::regex pattern("dl\\.boxcloud\\.com|api-content\\.dropbox\\.com");
Also, consider using raw string literal when declaring a regex (to avoid backslash hell):
std::regex pattern(R"(dl\.boxcloud\.com|api-content\.dropbox\.com)");

You need to add another "\" before each ".". I think that should fix it. You need to use escape character to represent "\" so your regex looks like this
wstring regex(L"(dl\\.boxcloud\\.com|api-content\\.dropbox\\.com)");
Update:
As #user3494744 also said you have to use
std::regex_search
instead of
std::regex_match.
I tested and it works now.

The problem is that you use regex_match instead of regex_search. To quote from the manual:
Note that regex_match will only successfully match a regular expression to an entire character sequence, whereas std::regex_search will successfully match subsequences
This fix will give a match, but too many since you also have to replace \. by \\. as shown before my answer. Otherwise the string "dlXboxcloud.com" will also match.

Related

C++ Regex always matching entire string

Whenever I use a regex function it matches the entire string for some reason.
#include <iostream>
#include <regex>
int main() {
std::string text = "This (is a) test";
std::regex pattern("\(.+\)");
std::cout << std::regex_replace(text, pattern, "isnt") << std::endl;
return 0;
}
Output: isnt
Your pattern unfortunately is not what it seems to be. Here is the problem.
Imagine for some reason you want to match tabs in with you regex. You might try this.
std::regex my_regex("\t");
This would work, but the string your std::regex class has seen is " ", not "\t". This is because of how C++ threats escaped characters. To pass literal "\t", you had to do the following.
std::regex my_regex("\\t");
So the correct syntax for your regex is.
std::regex pattern("\\(.+\\)");

regex_replace invalid open parenthesis

DEMO
#include <iostream>
#include <regex>
int main() {
std::wstring str = LR"(
bst.enable_adb_access="1"
)";
std::wregex re(L"(?<=bst\\.enable_adb_access.*?)\\d+");
str = std::regex_replace(str, re, L"0");
std::wcout << str << std::endl;
}
error:
terminate called after throwing an instance of 'std::regex_error'
what(): Invalid special open parenthesis.
https://regex101.com/r/a33eFL/1
Whats wrong with the parenthesis?
Well, this is one illustration why the plural of "regex" is "regrets"...
C++ accepts several flavours of regexes, but none of them seems to understand lookbehinds. Default modified ECMAScript flavour only accepts lookaheads. I'm not 100% sure about POSIX, awk and grep flavours, but none of them seems to have any lookarounds whatsoever.
Fortunately, you can get the same effect without lookarounds, using capturing group. I had to change format string rules to sed, because default ECMAScript rules allow for two-digit backreferences.
#include <iostream>
#include <regex>
int main() {
std::wstring str = LR"(
bst.enable_adb_access="1"
)";
std::wregex re(L"(bst\\.enable_adb_access.*?)\\d+");
str = std::regex_replace(str, re, L"\\10", std::regex_constants::format_sed);
std::wcout << str << std::endl;
}
See it online
You don't need to use a lookbehind for this situation. Simply use a normal capturing group and include it in the replacement string:
#include <iostream>
#include <regex>
using namespace std;
int main() {
std::wstring str = LR"(
bst.enable_adb_access="1"
)";
std::wregex re(L"(bst\\.enable_adb_access.*?)\\d+");
str = std::regex_replace(str, re, L"$010");
std::wcout << str << std::endl;
}
Output:
bst.enable_adb_access="0"
Note that because the substitution for the capturing group is followed by a digit, we need to use the $nn format for the group number (hence $010), otherwise $10 could - dependent on the compiler - be interpreted as replacing with capture group 10.
Demo on ideone

Regexp matching fails with invalid special open parenthesis

I am trying to use regexps in c++11, but my code always throws an std::regex_error of Invalid special open parenthesis.. A minimal example code which tries to find the first duplicate character in a string:
std::string regexp_string("(?P<a>[a-z])(?P=a)"); // Nothing to be escaped here, right?
std::regex regexp_to_match(regexp_string);
std::string target("abbab");
std::smatch matched_regexp;
std::regex_match(target, matched_regexp, regexp_to_match);
for(const auto& m: matched_regexp)
{
std::cout << m << std::endl;
}
Why do I get an error and how do I fix this example?
There are 2 issues here:
std::regex flavors do not support named capturing groups / backreferences, you need to use numbered capturing groups / backreferences
You should use regex_search rather than regex_match that requires a full string match.
Use
std::string regexp_string(R"(([a-z])\1)");
std::regex regexp_to_match(regexp_string);
std::string target("abbab");
std::smatch matched_regexp;
if (std::regex_search(target, matched_regexp, regexp_to_match)) {
std::cout << matched_regexp.str() << std::endl;
}
// => bb
See the C++ demo
The R"(([a-z])\1)" raw string literal defines the ([a-z])\1 regex that matches any lowercase ASCII letter and then matches the same letter again.
http://en.cppreference.com/w/cpp/regex/ecmascript says that ECMAScript (the default type for std::regex) requires (?= for positive lookahead.
The reason your regex crashes for you is because named groups not supported by std::regex. However you can still use what is available to find the first duplicate char in string:
#include <iostream>
#include <regex>
int main()
{
std::string s = "abc def cde";
std::smatch m;
std::regex r("(\\w).*?(?=\\1)");
if (std::regex_search(s, m, r))
std::cout << m[1] << std::endl;
return 0;
}
Prints
c

Replacing tokens that match pieces of a regex

I would like to use a regex both as a pattern to search and a template to construct a string. (I'm using boost::regex because I'm on gcc 4.8.4 where apparently regex is not fully supported (until 4.9)):
That is, I want to construct a regex, pass it to a function, use the regex to match some files, then construct an output file name following the same pattern. For example:
Regex: "file_.*\.txt"
to match things like "file_1.txt", "file_2.txt", etc.
and then would like to construct from it
Output: "file_all.txt"
That is, I want to match files starting with "file_" and ending with ".txt", then I want to fill in "all" between the "file_" and the ".txt", all from a single regex object.
We'll skip the matching to the regex as that is straightforward, but rather focus on the replacement:
#include <iostream>
#include <iterator>
#include <string>
#include <boost/regex.hpp>
std::string constructOutput(const boost::regex& myRegex)
{
// How to replace the match to the center of the filenames here?
// return boost::regex_replace(?, myRegex, "all");
}
int main()
{
// We can do something like this, but it requires us to manually separate the "center" of the regex from the string, as well as keep around a string object and a regex object:
// std::string myText = "File_.*.txt";
// boost::regex myRegex("_.*\\.");
// std::cout << '\n' << boost::regex_replace(myText, myRegex, "_all.") << '\n';
// Want to do this:
boost::regex myRegex("File_.*\\.txt");
std::string outputString = constructOutput(myRegex);
std::cout << outputString << std::endl;
}
Is something like this possible?

regex visual studio

I was planning to use the following regex to capture path and name of a file:
std::regex capture_path_name_file("(.+)\\([^\\]+)\\.[^\\]+$");
but when running (i'm using visual studio) i get the regex error
error_brack: the expression contained mismatched [ and ]
Trying to pinpoint the cause i tried the following regex:
std::regex test("[^\\]")
and I got the same error.
I have tested my regex in regex101.com (with the slight difference that i had to use \. instead of \\.)
Thanks for any help.
The issue you have is because \\ is treated as 1 literal \ symbol in regular string literals. Biffen explained it well in his comment, [^\\] is treated as [^\], the ] is treated as a literal ] and not the closing character class delimiter (and there is no matching ] to close the character class further).
The right answer is: use _splitpath_s.
And if you want to further play with regex, you can fix it like this:
#include <iostream>
#include <regex>
using namespace std;
int main() {
std::regex rex1(R"((.+?)([^\\.]+\.[^\\.]+)$)");
std::smatch m;
std::string str = "c:\\Python27\\REGEX\\test_regex.py";
if (regex_search(str, m, rex1)) {
std::cout << "Path: " << m[1] << std::endl;
std::cout << "File name: " << m[2] << std::endl;
}
return 0;
}
Using raw string literals, you can avoid the majority of issues related to escaping. Use R"((.+?)([^\\.]+\.[^\\.]+)$)", it will match and capture into Group 1 the file folder path, and it will capture into Group 2 the file name with extension. Note that the extension must be present.