How to test a u32string for letters only (with locale)

How to test a u32string for letters only (with locale) - c++

I'm writing a compiler (for my own programming language) and I want to allow users to use any of the characters in the Unicode letter categories to define identifiers (modern languages, like Go allow such syntax already).
I've read a lot about character encoding in C++11 and based on all the informations I've found out, it will be fine to use utf32 encoding (it is fast to iterate over in lexer and it has better support than utf8 in C++).
In C++ there is isalpha function. How can I test wchar32_t if it is a letter (a Unicode code point classified as "letter" in any language)?
Is it even possible?

Use ICU to iterate over the string and check whether the appropriate Unicode properties are fulfilled. Here is an example in C that checks whether the UTF-8 command line argument is a valid identifier:
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <unicode/uchar.h>
#include <unicode/utf8.h>
int main(int argc, char **argv) {
if (argc != 2) return EXIT_FAILURE;
const char *const str = argv[1];
int32_t off = 0;
// U8_NEXT has a bug causing length < 0 to not work for characters in [U+0080, U+07FF]
const size_t actual_len = strlen(str);
if (actual_len > INT32_MAX) return EXIT_FAILURE;
const int32_t len = actual_len;
if (!len) return EXIT_FAILURE;
UChar32 ch = -1;
U8_NEXT(str, off, len, ch);
if (ch < 0 || !u_isIDStart(ch)) return EXIT_FAILURE;
while (off < len) {
U8_NEXT(str, off, len, ch);
if (ch < 0 || !u_isIDPart(ch)) return EXIT_FAILURE;
}
}
Note that ICU here uses the Java definitions, which are slightly different from those in UAX #31. In a real application you might also want to normalize to NFC before.

there is an isaplha in the ICU project. I think you can use that.

Related

C++17: Check if a file is encoded in UTF-8

I want to check whether a file is (likely) encoded in UTF-8. I don't want to use any external libraries (otherwise I would probably use Boost.Locale), just 'plain' C++17. I need this to be cross-platform compatible, at least on MS Windows and Linux, building with Clang, GCC and MSVC.
I am aware that such a check can only be a heuristic, since you can craft e.g. a ISO-8859 encoded file containing a weird combination of special charactes which yield a valid UTF-8 sequence (corresponding to probably equally weird, but different, unicode characters).
My best attempt so far is to use std::wstring_convert and std::codecvt<char16_t, char, std::mbstate_t> to attempt a conversion from the input data (assumed to be UTF-8) into something else (UTF-16 in this case) and handle a thrown std::range_error as "the file was not UTF-8". Something like this:
void check(const std::filesystem::path& path)
{
std::ifstream ifs(path);
if (!ifs)
{
return false;
}
std::string data = std::string(std::istreambuf_iterator<char>(ifs), std::istreambuf_iterator<char>());
std::wstring_convert<deletable_facet<std::codecvt<char16_t, char, std::mbstate_t>>, char16_t>
conv16;
try
{
std::u16string str16 = conv16.from_bytes(data);
std::cout << "Probably UTF-8\n";
}
catch (std::range_error&)
{
std::cout << "Not UTF-8!\n";
}
}
(Note that the conversion code, as well as the not defined deletable_facet, is taken more or less verbatim from cppreference.)
Is that a sensible approach? Are there better ways that do not rely on external libraries?

The rules for UTF-8 are much more stringent than for UTF-16, and are quite easy to follow. The code below basically does BNF parsing to check the validity of a string. If you plan to check on streams, remember that the longest UTF-8 sequence is 6 bytes long, so if an error appears less that 6 bytes before the end of a buffer, you may have a truncated symbol.
NOTE: the code below is backwards-compatible with RFC-2279, the precursor to the current standard (defined in RFC-3629). If any of the text you plan to check could have been generated by software made before 2004, then use this, else if you need more stringent testing for RFC-3679 compliance, the rules can be modified quite easily.
#include <algorithm>
#include <cstddef>
#include <iostream>
#include <string>
#include <string_view>
size_t find_first_not_utf8(std::string_view s) {
// ----------------------------------------------------
// returns true if fn(c) returns true for all n first charac-ters c of
// string src. the sring_voew is updated to exclude the first n characters
// if a match is found, left untouched otherwise.
const auto match_n = [](std::string_view& src, size_t n, auto&& fn) noexcept {
if (src.length() < n) return false;
if (!std::all_of(src.begin(), src.begin() + n, fn))
return false;
src.remove_prefix(n);
return true;
};
// ----------------------------------------------------
// returns true if fn(c) returns true for the first character c of
// string src. the sring_view is updated to exclude the first character
// if a match is found, left untouched otherwise.
const auto match_1 = [](std::string_view& src, auto&& fn) noexcept {
if (src.empty()) return false;
if (!fn(src.front()))
return false;
src.remove_prefix(1);
return true;
};
// ----------------------------------------------------
// returns true if the first chatacter sequence of src is a valid non-ascii
// utf8 sequece.
// the sring_view is updated to exclude the first utf-8 sequence if non-ascii
// sequence is found, left untouched otherwise.
const auto utf8_non_ascii = [&](std::string_view& src) noexcept {
const auto SRC = src;
auto UTF8_CONT = [](uint8_t c) noexcept {
return 0x80 <= c && c <= 0xBF;
};
if (match_1(src, [](uint8_t c) { return 0xC0 <= c && c <= 0xDF; }) &&
match_1(src, UTF8_CONT)) {
return true;
}
src = SRC;
if (match_1(src, [](uint8_t c) { return 0xE0 <= c && c <= 0xEF; }) &&
match_n(src, 2, UTF8_CONT)) {
return true;
}
src = SRC;
if (match_1(src, [](uint8_t c) { return 0xF0 <= c && c <= 0xF7; }) &&
match_n(src, 3, UTF8_CONT)) {
return true;
}
src = SRC;
if (match_1(src, [](uint8_t c) { return 0xF8 <= c && c <= 0xFB; }) &&
match_n(src, 4, UTF8_CONT)) {
return true;
}
src = SRC;
if (match_1(src, [](uint8_t c) { return 0xFC <= c && c <= 0xFD; }) &&
match_n(src, 5, UTF8_CONT)) {
return true;
}
src = SRC;
return false;
};
// ----------------------------------------------------
// returns true if the first symbol of st(ring src is a valid UTF8 character
// not-including control characters, nor space.
// the sring_view is updated to exclude the first utf-8 sequence
// if a valid symbol sequence is found, left untouched otherwise.
const auto utf8_char = [&](std::string_view& src) noexcept {
auto rule = [](uint8_t c) noexcept -> bool {
return (0x21 <= c && c <= 0x7E) || std::isspace(c);
};
const auto SRC = src;
if (match_1(src, rule)) return true;
s = SRC;
return utf8_non_ascii(src);
};
// ----------------------------------------------------
const auto S = s;
while (!s.empty() && utf8_char(s)) {
}
if (s.empty()) return std::string_view::npos;
return size_t(s.data() - S.data());
}
void test(const std::string s) {
std::cout << "testing \'" << s << "\": ";
auto pos = find_first_not_utf8(s);
if (pos < s.length())
std::cout << "failed at offset " << pos << "\n";
else
std::cout << "OK\n";
}
auto greek = "Οὐχὶ ταὐτὰ παρίσταταί μοι γιγνώσκειν, ὦ ἄνδρες ᾿Αθηναῖοι\n ὅταν τ᾿ εἰς τὰ πράγματα ἀποβλέψω καὶ ὅταν πρὸς τοὺς ";
auto ethiopian = "ሰማይ አይታረስ ንጉሥ አይከሰስ።";
const char* errors[] = {
"2-byte sequence with last byte missing (U+0000): \xC0xyz",
"3-byte sequence with last byte missing (U+0000): \xe0\x81xyz",
"4-byte sequence with last byte missing (U+0000): \xF0\x83\x80xyz",
"5-byte sequence with last byte missing (U+0000): \xF8\x81\x82\x83xyz",
"6-byte sequence with last byte missing (U+0000): \xFD\x81\x82\x83\x84xyz"
};
int main() {
test("hello world");
test(greek);
test(ethiopian);
for (auto& e : errors) test(e);
return 0;
}
You'll be able to play with the code here: https://godbolt.org/z/q6rbveEeY

Recommendation: Just use ICU
It exists (that is, it is already installed and in use) on every major modern OS you care about[citation needed]. It’s there. Use it.
The good news is, for what you want to do, you don’t even have to link with ICU ⟶ No extra magic compilation flags necessary!
This should compile with anything (modern) you’ve got:
#include <string>
#ifdef _WIN32
#include <icu.h>
#else
#include <unicode/utf8.h>
#endif
bool is_utf8( const char * s, size_t n )
{
if (!n) return true; // empty files are UTF-8 encoded
UChar32 c = 0;
int32_t i = 0;
do { U8_INTERNAL_NEXT_OR_SUB( s, i, (int32_t)n, c, 0 ); }
while (c and U_IS_UNICODE_CHAR( c ) and (i < (int32_t)n));
return !!c;
}
bool is_utf8( const std::string & s )
{
return is_utf8( s.c_str(), s.size() );
}
If you are using MSVC’s C++17 or earlier, you’ll want to add an #include <ciso646> above that.
Example program:
#include <fstream>
#include <iostream>
#include <sstream>
auto file_to_string( const std::string & filename )
{
std::ifstream f( filename, std::ios::binary );
std::ostringstream ss;
ss << f.rdbuf();
return ss.str();
}
auto ask( const std::string & prompt )
{
std::cout << prompt;
std::string s;
getline( std::cin, s );
return s;
}
int main( int, char ** argv )
{
std::string filename = argv[1] ? argv[1] : ask( "filename? " );
std::cout << (is_utf8( file_to_string( filename ) )
? "UTF-8 encoded\n"
: "Unknown encoding\n");
}
Tested with (Windows) MSVC, Clang/LLVM, MinGW-w64, TDM and (Linux) GCC, Clang over a whole bunch of random UTF-8 test files (valid and invalid) that I won’t offer you here.
cl /EHsc /W4 /Ox /std:c++17 isutf8.cpp
clang++ -Wall -Wextra -Werror -pedantic-errors -O3 -std=c++17 isutf8.cpp
(My copy of TDM is a little out of date. I also had to tell it where to find the ICU headers.)
Update
So, there is some interesting commentary about my claim to ICU’s ubiquity.
That's not how answers work. You're the one making the claim; you are therefore the one who must provide evidence for it.
Ah, but I am not making an extraordinary claim. But lest I get caught in a Shifting Burden of Proof circle, here’s my end of the easily-discovered stick. Clicky-clicky!
Microsoft’s ICU documentation
Arch Linux core
Ubuntu manifests
Mint manifest
CentOS
Red Hat Enterprise 9 • Base OS
Fedora buildsystem docs for ICU
SUSE Enterprise Basesystem Module
Oracle Solaris ICU announcement page
Android provides it via Java classes: android.icu.lang .. android.icu.text
Apple Developer documentation describes using the system ICU
And so on...I’m not going to look up every major OS.
What this boils down to is that if you have a shiny window manager or <insert internet browser here> or basically any modern i18n text processing software program on your OS, there is a very high probability that it uses ICU (and things like HarfBuzz-icu)
My pleasure.
I find no pleasure here. Online compilers aren’t meant to compile anything beyond basic, text-I/O, single-file, simple programs. The fact that Godbolt’s online compiler can actually pull an include file from the web is, AFAIK, unique.
But while indeed cool, its limitations are acknowledged here — the ultimate consequence being that it would be absolutely impossible to compile something against ICU using godbolt.org or any other online compiler.
Which leads to a final note relevant to the code sample I gave above:
You need to properly configure you toolsif you expect them to work for you
For the above code snippet you must have ICU headers installed on your development machine. That is a given and should not surprise anyone. Just because your system has ICU libraries installed, and the software on it uses them, does not mean your compiler can automagically compile against the library.
For Windows you do automatically get the <icu.h> with the most recent WDKs (for some years now, and <icucommon.h> and <icui18n.h> before that).
For *nixen you will have to do something like sudo apt-get install libicu-dev or whatever is appropriate for your OS package manager.
I am glad I had to look into this, at least, because I just remembered that I have my development environments a little better initialized than the basic defaults, and was inadvertently using my local copy of ICU’s headers instead of Windows’. So I fixed it in the code above with that wonky #ifdef.
Must roll your own?
This is not difficult, and many people have different solutions to this, but there is a tricky consideration: a valid UTF-8 file should have valid Unicode code-points — which is more than just basic UTF-8/CESU-8/Modified-UTF-8/etc form validation.
If all you care is that the data is encoded using UTF-8 scheme, then Michaël Roy’s solution above looks fine to my eyeballs.
Personally, I think you should be a bit more strict, which properly requires you to actually decode the UTF-8 data to Unicode code points and verify them as well.
This requires very little more effort, and as it is something that your reader needs to do to access the data at some point anyway, why not just do it once and get it over with?
Still, here is just the check:
#include <algorithm>
#include <ciso646>
#include <string>
namespace utf8
{
bool is_unicode( char32_t c )
{
return ((c & 0xFFFE) != 0xFFFE)
and (c < 0x10FFFF);
}
bool is_surrogate ( char32_t c ) { return (c & 0xF800) == 0xD800; }
bool is_high_surrogate( char32_t c ) { return (c & 0xFC00) == 0xD800; }
bool is_low_surrogate ( char32_t c ) { return (c & 0xFC00) == 0xDC00; }
char32_t decode( const char * & first, const char * last, char32_t invalid = 0xFFFD )
{
// Empty sequence
if (first == last) return invalid;
// decode byte length of encoded code point (1..4) from first octet
static const unsigned char nbytes[] =
{
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 3, 3, 4, 0
};
unsigned char k, n = k = nbytes[(unsigned char)*first >> 3];
if (!n) { ++first; return invalid; }
// extract bits from lead octet
static const unsigned char masks[] = { 0, 0x7F, 0x1F, 0x0F, 0x07 };
char32_t c = (unsigned char)*first++ & masks[n];
// extract bits from remaining octets
while (--n and (first != last) and ((signed char)*first < -0x40))
c = (c << 6) | ((unsigned char)*first++ & 0x3F);
// the possibility of an incomplete sequence (continuing with future
// input at a later invocation) is ignored here.
if (n != 0) return invalid;
// overlong-encoded sequences are not valid
if (k != 1 + (c > 0x7F) + (c > 0x7FF) + (c > 0xFFFF)) return invalid;
// the end
return is_unicode( c ) and !is_surrogate( c ) ? c : invalid;
}
bool is_utf8( const std::string & s )
{
return []( const char * first, const char * last )
{
if (first != last)
{
// ignore UTF-8 BOM
if ((last-first) > 2)
if ( ((unsigned char)first[0] == 0xEF)
and ((unsigned char)first[1] == 0xBF)
and ((unsigned char)first[2] == 0xBE) )
first += 3;
while (first != last)
if (decode( first, last, 0x10FFFF ) == 0x10FFFF)
return false;
}
return true;
}
( s.c_str(), s.c_str()+s.size() );
}
} // namespace utf8
using utf8::is_utf8;
The very same example program as above can be used to play with the new code.
It behaves exactly the same as the ICU code.
Variants
I have ignored some common UTF-8 variants. In particular:
CESU-8 is a variation that happens when software working over UTF-16 forgets that surrogate pairs exist and encode them as two adjacent UTF-8 code sequences.
Modified UTF-8 is a special encoding where '\0' is expressly encoded with the overlong sequence C0 80, which makes nul-terminated strings continue to work. Strict UTF-8 requires encoders to use as few octets as possible, but we could accept this one specific overlong sequence anyway.
We will not, however, accept 5- or 6-octet sequences. The current Unicode UTF-8 standard, which is twenty years old now (2003), emphatically forbids them.
Modified UTF-8 implies CESU-8.
WTF-8 happens, too. WTF-8 implies Modified UTF-8.
PEP 383 can go die in a lonely corner.
You may wish to consider these as valid. While the Unicode people think that those things shouldn’t appear in files you may have access to they do recognize that it is possible and not necessarily wrong. It wouldn’t take much to modify the code to enable checks for each of those. Let us know if that is what you are looking to do.
Simple, quick-n-dirty solutions look cute on the internet, but messing with text has corner cases and considerations that people on discussion forums like to forget — Which is the main reason I do not recommend doing this yourself. Use ICU. It is a highly-optimized, industry-proven library designed by people who eat and breathe this stuff. Everyone else is just hoping they get it right while the software that actually needs it just uses ICU.
Even the C++ Standard Library got it wrong, which is why the whole thing was deprecated. (std::codecvt_utf8 may or may not accept any of CESU-8, Modified UTF-8, and WTF-8, and its behavior is not consistent across platforms in that regard. That and its design means you must make more than a couple passes over your data to verify it, in contrast to the single-pass-cum-verify that ICU [and my code] does. Maybe not much of an issue in today’s highly-optimized memory pipelines, but still, I’m an old fart about this.)

How to convert unicode characters to uppercase in C++

I'm learning about unicode in C++ and I have a hard time getting it to work properly. I try to treat the individual characters as uint64_t. It works if all I need it for is to print out the characters, but the problem is that I need to convert them to uppercase. I could store the uppercase letters in an array and simply use the same index as I do for the lowercase letters, but I'm looking for a more elegant solution. I found this similar question but most of the answers used wide characters, which is not something I can use. Here is what I have attempted:
#include <iostream>
#include <locale>
#include <string>
#include <cstdint>
#include <algorithm>
// hacky solution to store a multibyte character in a uint64_t
#define E(c) ((((uint64_t) 0 | (uint32_t) c[0]) << 32) | (uint32_t) c[1])
typedef std::string::value_type char_t;
char_t upcase(char_t ch) {
return std::use_facet<std::ctype<char_t>>(std::locale()).toupper(ch);
}
std::string toupper(const std::string &src) {
std::string result;
std::transform(src.begin(), src.end(), std::back_inserter(result), upcase);
return result;
}
const uint64_t VOWS_EXTRA[]
{
E("å") , E("ä"), E("ö"), E("ĳ"), E("ø"), E("æ")
};
int main(void) {
char name[5];
std::locale::global(std::locale("sv_SE.UTF8"));
name[0] = (VOWS_EXTRA[3] >> 32) & ~((uint32_t)0);
name[1] = VOWS_EXTRA[3] & ~((uint32_t)0);
name[2] = '\0';
std::cout << toupper(name) << std::endl;
}
I expect this to print out the character Ĳ but in reality it prints out the same character as it was in the beginning (ĳ).
(EDIT: OK, so I read more about the unicode support in standard C++ here. It seems like my best bet is to use something like ICU or Boost.locale for this task. C++ essentially treats std::string as a blob of binary data so doesn't seem to be an easy task to uppercase unicode letters properly. I think that my hacky solution using uint64_t isn't in any way more useful than the C++ standard library if not even worse. I'd be grateful for an example on how to achieve the behaviour stated above using ICU.)

Have a look at the ICU User Guide. For simple (single-character) case mapping, you can use u_toupper. For full case mapping, use u_strToUpper. Example code:
#include <unicode/uchar.h>
#include <unicode/ustdio.h>
#include <unicode/ustring.h>
int main() {
UChar32 upper = u_toupper(U'ĳ');
u_printf("%lC\n", upper);
UChar src = u'ß';
UChar dest[3];
UErrorCode err = U_ZERO_ERROR;
u_strToUpper(dest, 3, &src, 1, NULL, &err);
u_printf("%S\n", dest);
return 0;
}

also if anyone else is looking for it, std::towupper and std::towlower seemed to work fine
https://en.cppreference.com/w/cpp/string/wide/towupper

What is the correct way of processing different strings encodings via c++ main char** args?

I need some clarifications.
The problem is I have a program for windows written in C++ which uses 'wmain' windows-specific function that accepts wchar_t** as its args. So, there is an opportunity to pass whatever-you-like as a command line parameters to such program: for example, Chinese symbols, Japanese ones, etc, etc.
To be honest, I have no information about the encoding this function is usually used with. Probably utf-32, or even utf-16.
So, the questions:
What is the not windows-specific, but unix/linux way to achieve this with standard main function? My first thoughts were about usage of utf-8 encoded input strings with some kind of locales specifying?
Can somebody give a simple example of such main function? How can a std::string hold a Chinese symbols?
Can we operate with Chinese symbols encoded in utf-8 and contained in std::strings as usual when we just access each char (byte) like this: string_object[i] ?

Disclaimer: All Chinese words provided by GOOGLE translate service.
1) Just proceed as normal using normal std::string. The std::string can hold any character encoding and argument processing is simple pattern matching. So on a Chinese computer with the Chinese version of the program installed all it needs to do is compare Chinese versions of the flags to what the user inputs.
2) For example:
#include <string>
#include <vector>
#include <iostream>
std::string arg_switch = "开关";
std::string arg_option = "选项";
std::string arg_option_error = "缺少参数选项";
int main(int argc, char* argv[])
{
const std::vector<std::string> args(argv + 1, argv + argc);
bool do_switch = false;
std::string option;
for(auto arg = args.begin(); arg != args.end(); ++arg)
{
if(*arg == "--" + arg_switch)
do_switch = true;
else if(*arg == "--" + arg_option)
{
if(++arg == args.end())
{
// option needs a value - not found
std::cout << arg_option_error << '\n';
return 1;
}
option = *arg;
}
}
std::cout << arg_switch << ": " << (do_switch ? "on":"off") << '\n';
std::cout << arg_option << ": " << option << '\n';
return 0;
}
Usage:
./program --开关 --选项 wibble
Output:
开关: on
选项: wibble
3) No.
For UTF-8/UTF-16 data we need to use special libraries like ICU
For character by character processing you need to use or convert to UTF-32.

In short:
int main(int argc, char **argv) {
setlocale(LC_CTYPE, "");
// ...
}
http://unixhelp.ed.ac.uk/CGI/man-cgi?setlocale+3
And then you use mulitbyte string functions. You can still use normal std::string for storing multibyte strings, but beware that characters in them may span multiple array cells. After successfully setting the locale, you can also use wide streams (wcin, wcout, wcerr) to read and write wide strings from the standard streams.

1) with linux, you'd get standard main(), and standard char. It would use UTF-8 encoding. So chineese specific characters would be included in the string with a multibyte encoding.
***Edit:**sorry, yes: you have to set the default "" locale like here as well as cout.imbue().*
2) All the classic main() examples would be good examples. As said, chineese specific characters would be included in the string with a multibyte encoding. So if you cout such a string with the default UTF-8 locale, the cout sream would interpret the special UTF8 encoded sequences, knowing it has to agregate between 2 and 6 of each in order to produce the chineese output.
3) you can operate as usual on strings. THere are some issues however if you cout the string length for example: there is a difference between memory (ex: 3 bytes) and the chars that the user sees (ex: only 1). Same if you move with a pointer forward or backward. You have to make sure you interpret mulrtibyte encoding correctly, in order not to output an invalid encoding.
You could be interested in this other SO question.
Wikipedia explains the logic of the UTF-8 multibyte encoding. From this article you'll understand that any char u is a multibyte encoded char if:
( ((u & 0xE0) == 0xC0)
|| ((u & 0xF0) == 0xE0)
|| ((u & 0xF8) == 0xF0)
|| ((u & 0xFC) == 0xF8)
|| ((u & 0xFE) == 0xFC) )
It is followed by one or several chars such as:
((u & 0xC0) == 0x80)
All other chars are ASCII chars (i.e. not multibyte).

Does POSIX regex.h provide unicode or basically non-ascii characters?

Hi i am using Standard Regex Library (regcomp, regexec..). But now on demand i should add unicode support to my codes for regular expressions.
Does Standard Regex Library provide unicode or basically non-ascii characters? I researched on the Web, and think not.
My project is resource critic therefore i don't want to use large libraries for it (ICU and Boost.Regex).
Any help would be appreciated..

Looks like POSIX Regex working properly with UTF-8 locale. I've just wrote a simple test (see below) and used it for matching string with a cyrillic characters against regex "[[:alpha:]]" (for example). And everything working just fine.
Note: The main thing you must remember - regex functions are locale-related. So you must call setlocale() before it.
#include <sys/types.h>
#include <string.h>
#include <regex.h>
#include <stdio.h>
#include <locale.h>
int main(int argc, char** argv) {
int ret;
regex_t reg;
regmatch_t matches[10];
if (argc != 3) {
fprintf(stderr, "Usage: %s regex string\n", argv[0]);
return 1;
}
setlocale(LC_ALL, ""); /* Use system locale instead of default "C" */
if ((ret = regcomp(&reg, argv[1], 0)) != 0) {
char buf[256];
regerror(ret, &reg, buf, sizeof(buf));
fprintf(stderr, "regcomp() error (%d): %s\n", ret, buf);
return 1;
}
if ((ret = regexec(&reg, argv[2], 10, matches, 0)) == 0) {
int i;
char buf[256];
int size;
for (i = 0; i < sizeof(matches) / sizeof(regmatch_t); i++) {
if (matches[i].rm_so == -1) break;
size = matches[i].rm_eo - matches[i].rm_so;
if (size >= sizeof(buf)) {
fprintf(stderr, "match (%d-%d) is too long (%d)\n",
matches[i].rm_so, matches[i].rm_eo, size);
continue;
}
buf[size] = '\0';
printf("%d: %d-%d: '%s'\n", i, matches[i].rm_so, matches[i].rm_eo,
strncpy(buf, argv[2] + matches[i].rm_so, size));
}
}
return 0;
}
Usage example:
$ locale
LANG=ru_RU.UTF-8
LC_CTYPE="ru_RU.UTF-8"
LC_COLLATE="ru_RU.UTF-8"
... (skip)
LC_ALL=
$ ./reg '[[:alpha:]]' ' 359 фыва'
0: 5-7: 'ф'
$
The length of the matching result is two bytes because cyrillic letters in UTF-8 takes so much.

Basically, POSIX regexes are not Unicode aware. You can try to use them on Unicode characters, but there might be problems with glyphs that have multiple encodings and other such issues that Unicode aware libraries handle for you.
From the standard, IEEE Std 1003.1-2008:
Matching shall be based on the bit pattern used for encoding the character, not on the graphic representation of the character. This means that if a character set contains two or more encodings for a graphic symbol, or if the strings searched contain text encoded in more than one codeset, no attempt is made to search for any other representation of the encoded symbol. If that is required, the user can specify equivalence classes containing all variations of the desired graphic symbol.
Maybe libpcre would work for you? It's slightly heavier than POSIX regexes, but I would think it lighter than ICU or Boost.

If you really mean "Standard", i.e. std::regex from C++11, then all you need to do is switch to std::wregex (and std::wstring of course).

Convert wchar_t to char

I was wondering is it safe to do so?
wchar_t wide = /* something */;
assert(wide >= 0 && wide < 256 &&);
char myChar = static_cast<char>(wide);
If I am pretty sure the wide char will fall within ASCII range.

Why not just use a library routine wcstombs.

assert is for ensuring that something is true in a debug mode, without it having any effect in a release build. Better to use an if statement and have an alternate plan for characters that are outside the range, unless the only way to get characters outside the range is through a program bug.
Also, depending on your character encoding, you might find a difference between the Unicode characters 0x80 through 0xff and their char version.

You are looking for wctomb(): it's in the ANSI standard, so you can count on it. It works even when the wchar_t uses a code above 255. You almost certainly do not want to use it.
wchar_t is an integral type, so your compiler won't complain if you actually do:
char x = (char)wc;
but because it's an integral type, there's absolutely no reason to do this. If you accidentally read Herbert Schildt's C: The Complete Reference, or any C book based on it, then you're completely and grossly misinformed. Characters should be of type int or better. That means you should be writing this:
int x = getchar();
and not this:
char x = getchar(); /* <- WRONG! */
As far as integral types go, char is worthless. You shouldn't make functions that take parameters of type char, and you should not create temporary variables of type char, and the same advice goes for wchar_t as well.
char* may be a convenient typedef for a character string, but it is a novice mistake to think of this as an "array of characters" or a "pointer to an array of characters" - despite what the cdecl tool says. Treating it as an actual array of characters with nonsense like this:
for(int i = 0; s[i]; ++i) {
wchar_t wc = s[i];
char c = doit(wc);
out[i] = c;
}
is absurdly wrong. It will not do what you want; it will break in subtle and serious ways, behave differently on different platforms, and you will most certainly confuse the hell out of your users. If you see this, you are trying to reimplement wctombs() which is part of ANSI C already, but it's still wrong.
You're really looking for iconv(), which converts a character string from one encoding (even if it's packed into a wchar_t array), into a character string of another encoding.
Now go read this, to learn what's wrong with iconv.

An easy way is :
wstring your_wchar_in_ws(<your wchar>);
string your_wchar_in_str(your_wchar_in_ws.begin(), your_wchar_in_ws.end());
char* your_wchar_in_char = your_wchar_in_str.c_str();
I'm using this method for years :)

A short function I wrote a while back to pack a wchar_t array into a char array. Characters that aren't on the ANSI code page (0-127) are replaced by '?' characters, and it handles surrogate pairs correctly.
size_t to_narrow(const wchar_t * src, char * dest, size_t dest_len){
size_t i;
wchar_t code;
i = 0;
while (src[i] != '\0' && i < (dest_len - 1)){
code = src[i];
if (code < 128)
dest[i] = char(code);
else{
dest[i] = '?';
if (code >= 0xD800 && code <= 0xD8FF)
// lead surrogate, skip the next code unit, which is the trail
i++;
}
i++;
}
dest[i] = '\0';
return i - 1;
}

Technically, 'char' could have the same range as either 'signed char' or 'unsigned char'. For the unsigned characters, your range is correct; theoretically, for signed characters, your condition is wrong. In practice, very few compilers will object - and the result will be the same.
Nitpick: the last && in the assert is a syntax error.
Whether the assertion is appropriate depends on whether you can afford to crash when the code gets to the customer, and what you could or should do if the assertion condition is violated but the assertion is not compiled into the code. For debug work, it seems fine, but you might want an active test after it for run-time checking too.

Here's another way of doing it, remember to use free() on the result.
char* wchar_to_char(const wchar_t* pwchar)
{
// get the number of characters in the string.
int currentCharIndex = 0;
char currentChar = pwchar[currentCharIndex];
while (currentChar != '\0')
{
currentCharIndex++;
currentChar = pwchar[currentCharIndex];
}
const int charCount = currentCharIndex + 1;
// allocate a new block of memory size char (1 byte) instead of wide char (2 bytes)
char* filePathC = (char*)malloc(sizeof(char) * charCount);
for (int i = 0; i < charCount; i++)
{
// convert to char (1 byte)
char character = pwchar[i];
*filePathC = character;
filePathC += sizeof(char);
}
filePathC += '\0';
filePathC -= (sizeof(char) * charCount);
return filePathC;
}

one could also convert wchar_t --> wstring --> string --> char
wchar_t wide;
wstring wstrValue;
wstrValue[0] = wide
string strValue;
strValue.assign(wstrValue.begin(), wstrValue.end()); // convert wstring to string
char char_value = strValue[0];

In general, no. int(wchar_t(255)) == int(char(255)) of course, but that just means they have the same int value. They may not represent the same characters.
You would see such a discrepancy in the majority of Windows PCs, even. For instance, on Windows Code page 1250, char(0xFF) is the same character as wchar_t(0x02D9) (dot above), not wchar_t(0x00FF) (small y with diaeresis).
Note that it does not even hold for the ASCII range, as C++ doesn't even require ASCII. On IBM systems in particular you may see that 'A' != 65

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to test a u32string for letters only (with locale) - c++

there is an isaplha in the ICU project. I think you can use that.

Related

C++17: Check if a file is encoded in UTF-8

How to convert unicode characters to uppercase in C++

What is the correct way of processing different strings encodings via c++ main char** args?

Does POSIX regex.h provide unicode or basically non-ascii characters?

Convert wchar_t to char

Categories

Resources