Boost regex not matching "\\s" to spaces - c++

I'm just starting out with boost and c++ and I'm struggling to understand the behaviour of boost's regular expression engine when it comes to matching whitespace. If I use the code:
boost::regex rx(" ");
cout << regex_search(" ", rx);
to match spaces, then everything works as expected and the regex_search returns true. However, if I try replacing the regular expression with "\s" to match all whitespace characters, I never get a match, and the following code always outputs false:
boost::regex rx("\\s");
cout << regex_search(" ", rx);
What am I missing here?
As requested, here's my complete test case:
#include <boost/regex.hpp>
#include <iostream>
using namespace std;
int main()
{
boost::regex rx("\\s", boost::regex::icase);
cout << regex_search(" ", rx);
}

Got it -- I was originally using the prebuilt libraries from ascend4.org/Binary_installer_for_Boost_on_MinGW . After building Boost 1.52 the code works as expected. Trying to shortcut the boost build process ended up costing me a couple of hours of frustration... I've learnt my lesson now!

Related

How to match complex strings with regular expressions

I am a newbie in C++, I am using the regular expression function, but I have not been able to get the results I want
c++ code:
#include <regex>
std::string str = "[game.exe+009E820C]+338";
std::smatch result;
std::regex pattern("\\[([^\\[\\]]+)\\]");
std::regex_match(str, result, pattern);
// no result
std::cout << result[1] << std::endl;
I am familiar with javascript regular expressions, so I can get the value I want:
'[game.exe+009E820C]+338'.match(/\[([^\[\]]+)\]/)[1] => game.exe+009E820C
Is my c++ code doing something wrong
If you want to access the capture groups, it appears that the regex_match API requires a pattern which matches the entire input. Also, to avoid getting bogged down by a negative character class which includes a closing square bracket, I recommend using the Perl lazy dot instead. Putting all this together:
std::string str = "[game.exe+009E820C]+338";
std::smatch result;
std::regex pattern(".*\\[(.*?)\\].*");
std::regex_match(str, result, pattern);
std::cout << result[1] << std::endl;
This prints:
game.exe+009E820C

std regex_search to match only current line

I use a various regexes to parse a C source file, line by line. First i read all the content of file in a string:
ifstream file_stream("commented.cpp",ifstream::binary);
std::string txt((std::istreambuf_iterator<char>(file_stream)),
std::istreambuf_iterator<char>());
Then i use a set of regex, which should be applied continusly until the match found, here i will give only one for example:
vector<regex> rules = { regex("^//[^\n]*$") };
char * search =(char*)txt.c_str();
int position = 0, length = 0;
for (int i = 0; i < rules.size(); i++) {
cmatch match;
if (regex_search(search + position, match, rules[i],regex_constants::match_not_bol | regex_constants::match_not_eol))
{
position += ( match.position() + match.length() );
}
}
But it don't work. It will match the comment not in the current line, but it will search whole string, for the first match, regex_constants::match_not_bol and regex_constants::match_not_eol should make the regex_search to recognize ^$ as start/end of line only, not end start/end of whole block. So here is my file:
commented.cpp:
#include <stdio.h>
//comment
The code should fail, my logic is with those options to regex_search, the match should fail, because it should search for pattern in the first line:
#include <stdio.h>
But instead it searches whole string, and immideatly finds //comment. I need help, to make regex_search match only in current line. The options match_not_bol and match_not_eol do not help me. Of course i can read a file line by line in a vector, and then do match of all rules on each string in vector, but it is very slow, i have done that, and it take too long time to parse a big file like that, that's why i want to let regex deal with new lines, and use positioning counter.
If it is not what you want please comment so I will delete the answer
What you are doing is not a correct way of using a regex library.
Thus here is my suggestion for anyone that wants to use std::regex library.
It only supports ECMAScript that somehow is a little
poor than all modern regex library.
It has bugs as many as you like ( just I found ):
the same regex but different results on Linux and Windows only C++
std::regex and ignoring flags
std::regex_match and lazy quantifier with strange behavior
In some cases (I test specifically with std::match_results ) It is 200 times slower in comparison to std.regex in d language
It has very confusing flag-match and almost it does not work (at least for me)
conclusion: do not use it at all.
But if anyone still demands to use c++ anyway then you can:
use boost::regex about Boost library because:
It is PCRE support
It has less bug ( I have not seen any )
It is smaller in bin file ( I mean executable file after compiling )
It is faster then std::regex
use gcc version 7.1.0 and NOT below. The last bug I found is in version 6.3.0
use clang version 3 or above
If you have enticed (= persuade) to NOT use c++ then you can use:
Use d regular expression link library for large task: std.regex and why:
Fast Faster Command Line Tools in D
Easy
Flexible drn
Use native pcre or pcre2 link that have been written in c
Extremely fast but a little complicated
Use perl for a simple task and specially Perl one-liner link
#include <stdio.h>
//comment
The code should fail, my logic is with those options to regex_search, the match should fail, because it should search for pattern in the first line:
#include <stdio.h>
But instead it searches whole string, and immideatly finds //comment. I need help, to make regex_search match only in current line.
Are you trying to match all // comments in a source code file, or only the first line?
The former can be done like this:
#include <iostream>
#include <fstream>
#include <regex>
int main()
{
auto input = std::ifstream{"stream_union.h"};
for(auto line = std::string{}; getline(input, line); )
{
auto submatch = std::smatch{};
auto pattern = std::regex(R"(//)");
std::regex_search(line, submatch, pattern);
auto match = submatch.str(0);
if(match.empty()) continue;
std::cout << line << std::endl;
}
std::cout << std::endl;
return EXIT_SUCCESS;
}
And the later can be done like this:
#include <iostream>
#include <fstream>
#include <regex>
int main()
{
auto input = std::ifstream{"stream_union.h"};
auto line = std::string{};
getline(input, line);
auto submatch = std::smatch{};
auto pattern = std::regex(R"(//)");
std::regex_search(line, submatch, pattern);
auto match = submatch.str(0);
if(match.empty()) { return EXIT_FAILURE; }
std::cout << line << std::endl;
return EXIT_SUCCESS;
}
If for any reason you're trying to get the position of the match, tellg() will do that for you.

unchecked exception while running regex- get file name without extention from file path

I have this simple program
string str = "D:\Praxisphase 1 project\test\Brainstorming.docx";
regex ex("[^\\]+(?=\.docx$)");
if (regex_match(str, ex)){
cout << "match found"<< endl;
}
expecting the result to be true, my regex is working since I have tried it online, but when trying to run in C++ , the app throws unchecked exception.
First of all, use raw string literals when defining regex to avoid issues with backslashes (the \. is not a valid escape sequence, you need "\\." or R"(\.)"). Second, regex_match requires a full string match, thus, use regex_search.
#include <iostream>
#include <regex>
#include <string>
using namespace std;
int main() {
string str = R"(D:\Praxisphase 1 project\test\Brainstorming.docx)";
// OR
// string str = R"D:\\Praxisphase 1 project\\test\\Brainstorming.docx";
regex ex(R"([^\\]+(?=\.docx$))");
if (regex_search(str, ex)){
cout << "match found"<< endl;
}
return 0;
}
See the C++ demo
Note that R"([^\\]+(?=\.docx$))" = "[^\\\\]+(?=\\.docx$)", the \ in the first are literal backslashes (and you need two backslashes in a regex pattern to match a \ symbol), and in the second, the 4 backslashes are necessary to declare 2 literal backslashes that will match a single \ in the input text.

regex visual studio

I was planning to use the following regex to capture path and name of a file:
std::regex capture_path_name_file("(.+)\\([^\\]+)\\.[^\\]+$");
but when running (i'm using visual studio) i get the regex error
error_brack: the expression contained mismatched [ and ]
Trying to pinpoint the cause i tried the following regex:
std::regex test("[^\\]")
and I got the same error.
I have tested my regex in regex101.com (with the slight difference that i had to use \. instead of \\.)
Thanks for any help.
The issue you have is because \\ is treated as 1 literal \ symbol in regular string literals. Biffen explained it well in his comment, [^\\] is treated as [^\], the ] is treated as a literal ] and not the closing character class delimiter (and there is no matching ] to close the character class further).
The right answer is: use _splitpath_s.
And if you want to further play with regex, you can fix it like this:
#include <iostream>
#include <regex>
using namespace std;
int main() {
std::regex rex1(R"((.+?)([^\\.]+\.[^\\.]+)$)");
std::smatch m;
std::string str = "c:\\Python27\\REGEX\\test_regex.py";
if (regex_search(str, m, rex1)) {
std::cout << "Path: " << m[1] << std::endl;
std::cout << "File name: " << m[2] << std::endl;
}
return 0;
}
Using raw string literals, you can avoid the majority of issues related to escaping. Use R"((.+?)([^\\.]+\.[^\\.]+)$)", it will match and capture into Group 1 the file folder path, and it will capture into Group 2 the file name with extension. Note that the extension must be present.

Need help constructing Regular expression pattern

I'm failing to create a pattern for the stl regex_match function and need some help understanding why the pattern I created doesn't work and what would fix it.
I think the regex would have a hit for dl.boxcloud.com but it does not.
****still looking for input. I updated the program reflect suggestions. There are two matches when I think should be one.
#include <string>
#include <regex>
using namespace std;
wstring GetBody();
int _tmain(int argc, _TCHAR* argv[])
{
wsmatch m;
wstring regex(L"(dl\\.boxcloud\\.com|api-content\\.dropbox\\.com)");
regex_search(GetBody(), m, wregex(regex));
printf("%d matches.\n", m.size());
return 0;
}
wstring GetBody() {
wstring body(L"ABOUTLinkedIn\r\n\r\nwall of textdl.boxcloud.com/this/file/bitbyte.zip sent you a message.\r\n\r\nDate: 12/04/2012\r\n\r\nSubject: RE: Reference Ask\r\n\r\nOn 12/03/12 2:02 PM, wall of text wrote:\r\n--------------------\r\nRuba,\r\n\r\nI am looking for a n.");
return body;
}
There is no problem with the code itself. You mistake m.size() for the number of matches, when in fact, it is a number of groups your regex returns.
The std::match_results::size reference is not helpful with understanding that:
Returns the number of matches and sub-matches in the match_results object.
There are 2 groups (since you defined a capturing group around the 2 alternatives) and 1 match all in all.
See this IDEONE demo
#include <regex>
#include <string>
#include <iostream>
#include <time.h>
using namespace std;
int main()
{
string data("ABOUTLinkedIn\r\n\r\nwall of textdl.boxcloud.com/this/file/bitbyte.zip sent you a message.\r\n\r\nDate: 12/04/2012\r\n\r\nSubject: RE: Reference Ask\r\n\r\nOn 12/03/12 2:02 PM, wall of text wrote:\r\n--------------------\r\nRuba,\r\n\r\nI am looking for a n.");
std::regex pattern("(dl\\.boxcloud\\.com|api-content\\.dropbox\\.com)");
std::smatch result;
while (regex_search(data, result, pattern)) {
std::cout << "Match: " << result[0] << std::endl;
std::cout << "Captured text 1: " << result[1] << std::endl;
std::cout << "Size: " << result.size() << std::endl;
data = result.suffix().str();
}
}
It outputs:
Match: dl.boxcloud.com
Captured text 1: dl.boxcloud.com
Size: 2
See, the captured text equals the whole match.
To "fix" that, you may use non-capturing group, or remove grouping at all:
std::regex pattern("(?:dl\\.boxcloud\\.com|api-content\\.dropbox\\.com)");
// or
std::regex pattern("dl\\.boxcloud\\.com|api-content\\.dropbox\\.com");
Also, consider using raw string literal when declaring a regex (to avoid backslash hell):
std::regex pattern(R"(dl\.boxcloud\.com|api-content\.dropbox\.com)");
You need to add another "\" before each ".". I think that should fix it. You need to use escape character to represent "\" so your regex looks like this
wstring regex(L"(dl\\.boxcloud\\.com|api-content\\.dropbox\\.com)");
Update:
As #user3494744 also said you have to use
std::regex_search
instead of
std::regex_match.
I tested and it works now.
The problem is that you use regex_match instead of regex_search. To quote from the manual:
Note that regex_match will only successfully match a regular expression to an entire character sequence, whereas std::regex_search will successfully match subsequences
This fix will give a match, but too many since you also have to replace \. by \\. as shown before my answer. Otherwise the string "dlXboxcloud.com" will also match.