I'm trying to get a regex to match a char containing a space ' '.
When compiled with g++ (MinGW 8.1.0 on Windows) it reliably fails to match.
When compiled with onlinegdb it reliably matches
Why would the behaviour differ between these two? What would be the best way to get my regex to match properly without using a std::string (which does match correctly)
My code:
#include <iostream>
#include <regex>
#include <string>
int main() {
char a = ' ';
std::string b = " ";
cout << std::regex_match(b, std::regex("\\s+")) << \n; // always writes 1
cout << std::regex_match(&a, std::regex("\\s+")) << \n; // writes 1 in onlinegdb, 0 with MinGW
}
Related
It seems that I can match higher order unicode on a char by char basis. But classes/properties does not work well.
I created this sample program (on Windows):
#include <iostream>
#include <string>
#include <regex>
int main()
{
std::wregex re(L"\\S+");
std::wstring c(L"c");
std::wstring imp_smiley(L"\U0001F608");//gets encoded as UTF16
std::cout << std::boolalpha << "c: " << std::regex_match(c, re) << std::endl;
std::cout << std::boolalpha << "imp_smiley: " << std::regex_match(imp_smiley, re) << std::endl;
}
What is weird is that the 'imp smiley' char matches neither \S (non white space) nor \s (white space). I would have expected this to be treated as non white-space.
What is going on ?
Update
If using [^\s]+ instead of eg. \\S+ will actually makes it match. It seems that utf-16 is not being recognized (or normalized).
I am trying to match all files that have the extension .nef - The match must be case insensitive.
regex e("(.*)(\\.NEF)",ECMAScript|icase);
...
if (regex_match ( fn1, e )){
//Do Something
}
here fn1 is a string with a file name.
However, this "does something" only with files with .NEF (upper case) extensions. .nef extensions are ignored.
I also tried -
regex e("(.*)(\\.[Nn][Ee][Ff])");
and
regex e("(.*)(\\.[N|n][E|e][F|f])");
both of which resulted in a runtime error.
terminate called after throwing an instance of 'std::regex_error'
what(): regex_error
Aborted (core dumped)
My code is compiled using -
g++ nefread.cpp -o nefread -lraw_r -lpthread -pthread -std=c++11 -O3
What am I doing wrong? This is my basic code. I want to extend it to match more file extensions .nef, .raw, .cr2 etc.
Your original expression is correct, and should produce the desired result. The problem is with the gcc implementation of <regex>, which is broken. This answer explains the historical reasons why it so, and also says that gcc4.9 will ship with a working <regex> implementation.
Your code works using Boost.Regex
#include <iostream>
#include <string>
#include <boost/regex.hpp>
int main()
{
// Simple regular expression matching
boost::regex expr(R"((.*)\.(nef))", boost::regex_constants::ECMAScript |
boost::regex_constants::icase);
// ^^^ ^^
// no need escape the '\' if you use raw string literals
boost::cmatch m;
for (auto const& fname : {"foo.nef", "bar.NeF", "baz.NEF"}) {
if(boost::regex_match(fname, m, expr)) {
std::cout << "matched: " << m[0] << '\n';
std::cout << " " << m[1] << '\n';
std::cout << " " << m[2] << '\n';
}
}
}
Live demo
I wrote this test program in a Latin-1 encoded file...
#include <cstring>
#include <iostream>
using namespace std;
const char str[] = "ÅÄÖ";
int main() {
cout << sizeof(str) << ' ' << strlen(str) << ' '
<< sizeof("Åäö") << ' ' << strlen("åäö") << endl;
return 0;
}
...and compiled it with g++ -fexec-charset=UTF-8 -pedantic -Wall on OpenBSD 5.3. I was expecting to see the size of the strings being 6 chars + 1 NUL char, but I get the output 4 3 4 3?
I tried changing my system locale from ISO-8859-1 to UTF-8 with export LC_CTYPE=sv_SE.UTF-8, but that didn't help. (Accordning to the gcc manual, that only changes the input character set, but hey, it was worth a try.)
So, what am I doing wrong?
Is there any way I can create a regex pattern with that contains an unsigned char? I've tried:
regex* r = new regex("\\xff");
which results in an exception saying the pattern character is out of range. I've also tried to define my own basic_regex and my own regex_traits following the code in the regex include file but that results in a strange error in the local include.
Any help would be appreciated.
This is a legal HexEscapeSequence in ECMAScript regex (which is what C++ uses by default), and it appears to work on my tests:
#include <iostream>
#include <regex>
int main()
{
std::regex re("\\xff");
std::string test = "abc\xff";
std::smatch m;
regex_search(test, m, re);
std::cout << "Found " << m.size() << " match after '"
<< m.prefix() << "'\n";
}
clang++/libc++: Found 1 match after 'abc'
g++/boost.regex (replacing std:: with boost::): Found 1 match after 'abc'
what's your implementation?
I am using C++ regex. was not able to grasp the following programming output.
#include <iostream>
#include <regex>
#include <algorithm>
#include <string>
using namespace std;
int main(){
regex r("a(b+)(c+)d");
string s ="abcd";
smatch m;
cout << s << endl;
const bool b = regex_match(s,m, r);
cout << b <<endl; // prints 1 - OK
if(b){
cout << m[0] << endl; // prints abcd - OK
cout << m[1] << endl; // prints ab - Why? Should it be just b?
cout<< m[2] << endl; // prints bc - Why? Should it be just c?
}
}
I per my exposure to regex in other languages, the parenthesis should match the captured part of the string? so the output should be
1
abcd
b
c
EDIT:
I am using g++ 4.6
Assuming you are using g++, you should note that its implementation of <regex> (section 28) is incomplete. Note the listings for basic_regex, sub_match, and match_results are declared "Partial".
For more info on g++, I think this post from a year ago is still relevant (as is this bug report).
This would explain why it's not giving the results that you expect. You may wish to try Boost regex in the meantime.