std::isalpha not recognizes utf8? [duplicate] - c++

so I need to see if my character is a letter. I tried using isalpha() function, however, if I try to pass not latin letter (for example ą, č, ę, ė, į, š, ų, ū, ž) I get an error, that seems to state that isalpha function accepts only chars that in ASCII code are between 0 and 255. Is there any way to overcome this problem?

You can use a locale version of std::isalpha. Taking an example from the linked reference:
#include <iostream>
#include <locale>
int main()
{
const wchar_t c = L'\u042f'; // cyrillic capital letter ya
std::locale loc1("C");
std::cout << "isalpha('Я​', C locale) returned "
<< std::boolalpha << std::isalpha(c, loc1) << '\n';
std::locale loc2("en_US.UTF8");
std::cout << "isalpha('Я', Unicode locale) returned "
<< std::boolalpha << std::isalpha(c, loc2) << '\n';
}
Output:
isalpha('Я​', C locale) returned false
isalpha('Я', Unicode locale) returned true

Related

How to display character string literals with hex properly with std::cout in C++?

How to display character string literals with hex properly with std::cout in C++?
I want to use octal and hex to print character string literals with std::cout in C++.
I want to print "bee".
#include <iostream>
int main() {
std::cout << "b\145e" << std::endl;//1
std::cout << "b\x65e" << std::endl;//2
return 0;
}
//1 works fine, but //2 doesn't with hex escape sequence out of range.
Now I want to print "be3".
#include <iostream>
int main() {
std::cout << "b\1453" << std::endl;//1
std::cout << "b\x653" << std::endl;//2
return 0;
}
Also, //1 works fine, but //2 doesn't with hex escape sequence out of range.
Now can I come to the conclusion that hex is not a good way to display character string characters?
I get the feeling I am wrong but don't know why.
Can someone explain whether hex can be used and how?
There's actually an example of this exact same situation on cppreference's documentation on string literals.
If a valid hex digit follows a hex escape in a string literal, it would fail to compile as an invalid escape sequence. String concatenation can be used as a workaround:
They provide the example below:
// const char* p = "\xfff"; // error: hex escape sequence out of range
const char* p = "\xff""f"; // OK : the literal is const char[3] holding {'\xff','f','\0'}
Applying what they explain to your problem, we can print the string literal be3 in two ways:
std::cout << "b\x65" "3" << std::endl;
std::cout << "b\x65" << "3" << std::endl;
The hex escape sequences becomes \x65e and \x653 so you need to help the compiler to stop after 65:
#include <iostream>
int main() {
std::cout << "b\x65""e" << std::endl;//2
std::cout << "b\x65""3" << std::endl;//2
}

Character matches neither space nor non-space - whats going on?

It seems that I can match higher order unicode on a char by char basis. But classes/properties does not work well.
I created this sample program (on Windows):
#include <iostream>
#include <string>
#include <regex>
int main()
{
std::wregex re(L"\\S+");
std::wstring c(L"c");
std::wstring imp_smiley(L"\U0001F608");//gets encoded as UTF16
std::cout << std::boolalpha << "c: " << std::regex_match(c, re) << std::endl;
std::cout << std::boolalpha << "imp_smiley: " << std::regex_match(imp_smiley, re) << std::endl;
}
What is weird is that the 'imp smiley' char matches neither \S (non white space) nor \s (white space). I would have expected this to be treated as non white-space.
What is going on ?
Update
If using [^\s]+ instead of eg. \\S+ will actually makes it match. It seems that utf-16 is not being recognized (or normalized).

isalpha() function lets through Latin letters although the locale is set to Russian [duplicate]

so I need to see if my character is a letter. I tried using isalpha() function, however, if I try to pass not latin letter (for example ą, č, ę, ė, į, š, ų, ū, ž) I get an error, that seems to state that isalpha function accepts only chars that in ASCII code are between 0 and 255. Is there any way to overcome this problem?
You can use a locale version of std::isalpha. Taking an example from the linked reference:
#include <iostream>
#include <locale>
int main()
{
const wchar_t c = L'\u042f'; // cyrillic capital letter ya
std::locale loc1("C");
std::cout << "isalpha('Я​', C locale) returned "
<< std::boolalpha << std::isalpha(c, loc1) << '\n';
std::locale loc2("en_US.UTF8");
std::cout << "isalpha('Я', Unicode locale) returned "
<< std::boolalpha << std::isalpha(c, loc2) << '\n';
}
Output:
isalpha('Я​', C locale) returned false
isalpha('Я', Unicode locale) returned true

C++ How to check if letter isalpha (not latin alphabet)

so I need to see if my character is a letter. I tried using isalpha() function, however, if I try to pass not latin letter (for example ą, č, ę, ė, į, š, ų, ū, ž) I get an error, that seems to state that isalpha function accepts only chars that in ASCII code are between 0 and 255. Is there any way to overcome this problem?
You can use a locale version of std::isalpha. Taking an example from the linked reference:
#include <iostream>
#include <locale>
int main()
{
const wchar_t c = L'\u042f'; // cyrillic capital letter ya
std::locale loc1("C");
std::cout << "isalpha('Я​', C locale) returned "
<< std::boolalpha << std::isalpha(c, loc1) << '\n';
std::locale loc2("en_US.UTF8");
std::cout << "isalpha('Я', Unicode locale) returned "
<< std::boolalpha << std::isalpha(c, loc2) << '\n';
}
Output:
isalpha('Я​', C locale) returned false
isalpha('Я', Unicode locale) returned true

Where do we really need to use wide character stream wcout?

I can't get the point of using std::wcout. As far as I've understood the wide stream object corresponds to a C wide-oriented stream (ISO C 7.19.2/5). When do we really need to use it in practice. I'm pretty sure it doesn't suit to output a character from an implementation's wide character set N3797::3.9.1/5 [basic.fundamental], because
#include <iostream>
#include <locale>
int main()
{
std::locale loc = std::locale ("en_US.UTF-8");
std::wcout.imbue(loc);
std::cout << "ي" << std::endl; // OK!
std::wcout << "ي" << std::endl; // Empty string
std::wcout << "L specifier = " << L"ي" << std::endl; // J
std::wcout << "u specifier = " << u"ي" << std::endl; // 0x400,eac
std::wcout << "u8 specifier = " << u8"ي" << std::endl; // empty string
}
DEMO
We can see that wcout's operator<< didn't print these characters correct, meanwhile cout's operator<< does it well. I've also chech wcout on any other charater like
'л', 'ਚੰ', 'ਗਾ', 'კ', 'ა', 'რ', 'გ', 'ი' and so on and so forth, but it prints well only a Latinic characters or a numbers.