It seems that I can match higher order unicode on a char by char basis. But classes/properties does not work well.
I created this sample program (on Windows):
#include <iostream>
#include <string>
#include <regex>
int main()
{
std::wregex re(L"\\S+");
std::wstring c(L"c");
std::wstring imp_smiley(L"\U0001F608");//gets encoded as UTF16
std::cout << std::boolalpha << "c: " << std::regex_match(c, re) << std::endl;
std::cout << std::boolalpha << "imp_smiley: " << std::regex_match(imp_smiley, re) << std::endl;
}
What is weird is that the 'imp smiley' char matches neither \S (non white space) nor \s (white space). I would have expected this to be treated as non white-space.
What is going on ?
Update
If using [^\s]+ instead of eg. \\S+ will actually makes it match. It seems that utf-16 is not being recognized (or normalized).
Related
How to display character string literals with hex properly with std::cout in C++?
I want to use octal and hex to print character string literals with std::cout in C++.
I want to print "bee".
#include <iostream>
int main() {
std::cout << "b\145e" << std::endl;//1
std::cout << "b\x65e" << std::endl;//2
return 0;
}
//1 works fine, but //2 doesn't with hex escape sequence out of range.
Now I want to print "be3".
#include <iostream>
int main() {
std::cout << "b\1453" << std::endl;//1
std::cout << "b\x653" << std::endl;//2
return 0;
}
Also, //1 works fine, but //2 doesn't with hex escape sequence out of range.
Now can I come to the conclusion that hex is not a good way to display character string characters?
I get the feeling I am wrong but don't know why.
Can someone explain whether hex can be used and how?
There's actually an example of this exact same situation on cppreference's documentation on string literals.
If a valid hex digit follows a hex escape in a string literal, it would fail to compile as an invalid escape sequence. String concatenation can be used as a workaround:
They provide the example below:
// const char* p = "\xfff"; // error: hex escape sequence out of range
const char* p = "\xff""f"; // OK : the literal is const char[3] holding {'\xff','f','\0'}
Applying what they explain to your problem, we can print the string literal be3 in two ways:
std::cout << "b\x65" "3" << std::endl;
std::cout << "b\x65" << "3" << std::endl;
The hex escape sequences becomes \x65e and \x653 so you need to help the compiler to stop after 65:
#include <iostream>
int main() {
std::cout << "b\x65""e" << std::endl;//2
std::cout << "b\x65""3" << std::endl;//2
}
so I need to see if my character is a letter. I tried using isalpha() function, however, if I try to pass not latin letter (for example ą, č, ę, ė, į, š, ų, ū, ž) I get an error, that seems to state that isalpha function accepts only chars that in ASCII code are between 0 and 255. Is there any way to overcome this problem?
You can use a locale version of std::isalpha. Taking an example from the linked reference:
#include <iostream>
#include <locale>
int main()
{
const wchar_t c = L'\u042f'; // cyrillic capital letter ya
std::locale loc1("C");
std::cout << "isalpha('Я', C locale) returned "
<< std::boolalpha << std::isalpha(c, loc1) << '\n';
std::locale loc2("en_US.UTF8");
std::cout << "isalpha('Я', Unicode locale) returned "
<< std::boolalpha << std::isalpha(c, loc2) << '\n';
}
Output:
isalpha('Я', C locale) returned false
isalpha('Я', Unicode locale) returned true
I'm trying to get a regex to match a char containing a space ' '.
When compiled with g++ (MinGW 8.1.0 on Windows) it reliably fails to match.
When compiled with onlinegdb it reliably matches
Why would the behaviour differ between these two? What would be the best way to get my regex to match properly without using a std::string (which does match correctly)
My code:
#include <iostream>
#include <regex>
#include <string>
int main() {
char a = ' ';
std::string b = " ";
cout << std::regex_match(b, std::regex("\\s+")) << \n; // always writes 1
cout << std::regex_match(&a, std::regex("\\s+")) << \n; // writes 1 in onlinegdb, 0 with MinGW
}
so I need to see if my character is a letter. I tried using isalpha() function, however, if I try to pass not latin letter (for example ą, č, ę, ė, į, š, ų, ū, ž) I get an error, that seems to state that isalpha function accepts only chars that in ASCII code are between 0 and 255. Is there any way to overcome this problem?
You can use a locale version of std::isalpha. Taking an example from the linked reference:
#include <iostream>
#include <locale>
int main()
{
const wchar_t c = L'\u042f'; // cyrillic capital letter ya
std::locale loc1("C");
std::cout << "isalpha('Я', C locale) returned "
<< std::boolalpha << std::isalpha(c, loc1) << '\n';
std::locale loc2("en_US.UTF8");
std::cout << "isalpha('Я', Unicode locale) returned "
<< std::boolalpha << std::isalpha(c, loc2) << '\n';
}
Output:
isalpha('Я', C locale) returned false
isalpha('Я', Unicode locale) returned true
so I need to see if my character is a letter. I tried using isalpha() function, however, if I try to pass not latin letter (for example ą, č, ę, ė, į, š, ų, ū, ž) I get an error, that seems to state that isalpha function accepts only chars that in ASCII code are between 0 and 255. Is there any way to overcome this problem?
You can use a locale version of std::isalpha. Taking an example from the linked reference:
#include <iostream>
#include <locale>
int main()
{
const wchar_t c = L'\u042f'; // cyrillic capital letter ya
std::locale loc1("C");
std::cout << "isalpha('Я', C locale) returned "
<< std::boolalpha << std::isalpha(c, loc1) << '\n';
std::locale loc2("en_US.UTF8");
std::cout << "isalpha('Я', Unicode locale) returned "
<< std::boolalpha << std::isalpha(c, loc2) << '\n';
}
Output:
isalpha('Я', C locale) returned false
isalpha('Я', Unicode locale) returned true