C++17: Check if a file is encoded in UTF-8 - c++

I want to check whether a file is (likely) encoded in UTF-8. I don't want to use any external libraries (otherwise I would probably use Boost.Locale), just 'plain' C++17. I need this to be cross-platform compatible, at least on MS Windows and Linux, building with Clang, GCC and MSVC.
I am aware that such a check can only be a heuristic, since you can craft e.g. a ISO-8859 encoded file containing a weird combination of special charactes which yield a valid UTF-8 sequence (corresponding to probably equally weird, but different, unicode characters).
My best attempt so far is to use std::wstring_convert and std::codecvt<char16_t, char, std::mbstate_t> to attempt a conversion from the input data (assumed to be UTF-8) into something else (UTF-16 in this case) and handle a thrown std::range_error as "the file was not UTF-8". Something like this:
void check(const std::filesystem::path& path)
{
std::ifstream ifs(path);
if (!ifs)
{
return false;
}
std::string data = std::string(std::istreambuf_iterator<char>(ifs), std::istreambuf_iterator<char>());
std::wstring_convert<deletable_facet<std::codecvt<char16_t, char, std::mbstate_t>>, char16_t>
conv16;
try
{
std::u16string str16 = conv16.from_bytes(data);
std::cout << "Probably UTF-8\n";
}
catch (std::range_error&)
{
std::cout << "Not UTF-8!\n";
}
}
(Note that the conversion code, as well as the not defined deletable_facet, is taken more or less verbatim from cppreference.)
Is that a sensible approach? Are there better ways that do not rely on external libraries?

The rules for UTF-8 are much more stringent than for UTF-16, and are quite easy to follow. The code below basically does BNF parsing to check the validity of a string. If you plan to check on streams, remember that the longest UTF-8 sequence is 6 bytes long, so if an error appears less that 6 bytes before the end of a buffer, you may have a truncated symbol.
NOTE: the code below is backwards-compatible with RFC-2279, the precursor to the current standard (defined in RFC-3629). If any of the text you plan to check could have been generated by software made before 2004, then use this, else if you need more stringent testing for RFC-3679 compliance, the rules can be modified quite easily.
#include <algorithm>
#include <cstddef>
#include <iostream>
#include <string>
#include <string_view>
size_t find_first_not_utf8(std::string_view s) {
// ----------------------------------------------------
// returns true if fn(c) returns true for all n first charac-ters c of
// string src. the sring_voew is updated to exclude the first n characters
// if a match is found, left untouched otherwise.
const auto match_n = [](std::string_view& src, size_t n, auto&& fn) noexcept {
if (src.length() < n) return false;
if (!std::all_of(src.begin(), src.begin() + n, fn))
return false;
src.remove_prefix(n);
return true;
};
// ----------------------------------------------------
// returns true if fn(c) returns true for the first character c of
// string src. the sring_view is updated to exclude the first character
// if a match is found, left untouched otherwise.
const auto match_1 = [](std::string_view& src, auto&& fn) noexcept {
if (src.empty()) return false;
if (!fn(src.front()))
return false;
src.remove_prefix(1);
return true;
};
// ----------------------------------------------------
// returns true if the first chatacter sequence of src is a valid non-ascii
// utf8 sequece.
// the sring_view is updated to exclude the first utf-8 sequence if non-ascii
// sequence is found, left untouched otherwise.
const auto utf8_non_ascii = [&](std::string_view& src) noexcept {
const auto SRC = src;
auto UTF8_CONT = [](uint8_t c) noexcept {
return 0x80 <= c && c <= 0xBF;
};
if (match_1(src, [](uint8_t c) { return 0xC0 <= c && c <= 0xDF; }) &&
match_1(src, UTF8_CONT)) {
return true;
}
src = SRC;
if (match_1(src, [](uint8_t c) { return 0xE0 <= c && c <= 0xEF; }) &&
match_n(src, 2, UTF8_CONT)) {
return true;
}
src = SRC;
if (match_1(src, [](uint8_t c) { return 0xF0 <= c && c <= 0xF7; }) &&
match_n(src, 3, UTF8_CONT)) {
return true;
}
src = SRC;
if (match_1(src, [](uint8_t c) { return 0xF8 <= c && c <= 0xFB; }) &&
match_n(src, 4, UTF8_CONT)) {
return true;
}
src = SRC;
if (match_1(src, [](uint8_t c) { return 0xFC <= c && c <= 0xFD; }) &&
match_n(src, 5, UTF8_CONT)) {
return true;
}
src = SRC;
return false;
};
// ----------------------------------------------------
// returns true if the first symbol of st(ring src is a valid UTF8 character
// not-including control characters, nor space.
// the sring_view is updated to exclude the first utf-8 sequence
// if a valid symbol sequence is found, left untouched otherwise.
const auto utf8_char = [&](std::string_view& src) noexcept {
auto rule = [](uint8_t c) noexcept -> bool {
return (0x21 <= c && c <= 0x7E) || std::isspace(c);
};
const auto SRC = src;
if (match_1(src, rule)) return true;
s = SRC;
return utf8_non_ascii(src);
};
// ----------------------------------------------------
const auto S = s;
while (!s.empty() && utf8_char(s)) {
}
if (s.empty()) return std::string_view::npos;
return size_t(s.data() - S.data());
}
void test(const std::string s) {
std::cout << "testing \'" << s << "\": ";
auto pos = find_first_not_utf8(s);
if (pos < s.length())
std::cout << "failed at offset " << pos << "\n";
else
std::cout << "OK\n";
}
auto greek = "Οὐχὶ ταὐτὰ παρίσταταί μοι γιγνώσκειν, ὦ ἄνδρες ᾿Αθηναῖοι\n ὅταν τ᾿ εἰς τὰ πράγματα ἀποβλέψω καὶ ὅταν πρὸς τοὺς ";
auto ethiopian = "ሰማይ አይታረስ ንጉሥ አይከሰስ።";
const char* errors[] = {
"2-byte sequence with last byte missing (U+0000): \xC0xyz",
"3-byte sequence with last byte missing (U+0000): \xe0\x81xyz",
"4-byte sequence with last byte missing (U+0000): \xF0\x83\x80xyz",
"5-byte sequence with last byte missing (U+0000): \xF8\x81\x82\x83xyz",
"6-byte sequence with last byte missing (U+0000): \xFD\x81\x82\x83\x84xyz"
};
int main() {
test("hello world");
test(greek);
test(ethiopian);
for (auto& e : errors) test(e);
return 0;
}
You'll be able to play with the code here: https://godbolt.org/z/q6rbveEeY

Recommendation: Just use ICU
It exists (that is, it is already installed and in use) on every major modern OS you care about[citation needed]. It’s there. Use it.
The good news is, for what you want to do, you don’t even have to link with ICU ⟶ No extra magic compilation flags necessary!
This should compile with anything (modern) you’ve got:
#include <string>
#ifdef _WIN32
#include <icu.h>
#else
#include <unicode/utf8.h>
#endif
bool is_utf8( const char * s, size_t n )
{
if (!n) return true; // empty files are UTF-8 encoded
UChar32 c = 0;
int32_t i = 0;
do { U8_INTERNAL_NEXT_OR_SUB( s, i, (int32_t)n, c, 0 ); }
while (c and U_IS_UNICODE_CHAR( c ) and (i < (int32_t)n));
return !!c;
}
bool is_utf8( const std::string & s )
{
return is_utf8( s.c_str(), s.size() );
}
If you are using MSVC’s C++17 or earlier, you’ll want to add an #include <ciso646> above that.
Example program:
#include <fstream>
#include <iostream>
#include <sstream>
auto file_to_string( const std::string & filename )
{
std::ifstream f( filename, std::ios::binary );
std::ostringstream ss;
ss << f.rdbuf();
return ss.str();
}
auto ask( const std::string & prompt )
{
std::cout << prompt;
std::string s;
getline( std::cin, s );
return s;
}
int main( int, char ** argv )
{
std::string filename = argv[1] ? argv[1] : ask( "filename? " );
std::cout << (is_utf8( file_to_string( filename ) )
? "UTF-8 encoded\n"
: "Unknown encoding\n");
}
Tested with (Windows) MSVC, Clang/LLVM, MinGW-w64, TDM and (Linux) GCC, Clang over a whole bunch of random UTF-8 test files (valid and invalid) that I won’t offer you here.
cl /EHsc /W4 /Ox /std:c++17 isutf8.cpp
clang++ -Wall -Wextra -Werror -pedantic-errors -O3 -std=c++17 isutf8.cpp
(My copy of TDM is a little out of date. I also had to tell it where to find the ICU headers.)
Update
So, there is some interesting commentary about my claim to ICU’s ubiquity.
That's not how answers work. You're the one making the claim; you are therefore the one who must provide evidence for it.
Ah, but I am not making an extraordinary claim. But lest I get caught in a Shifting Burden of Proof circle, here’s my end of the easily-discovered stick. Clicky-clicky!
Microsoft’s ICU documentation
Arch Linux core
Ubuntu manifests
Mint manifest
CentOS
Red Hat Enterprise 9 • Base OS
Fedora buildsystem docs for ICU
SUSE Enterprise Basesystem Module
Oracle Solaris ICU announcement page
Android provides it via Java classes: android.icu.lang .. android.icu.text
Apple Developer documentation describes using the system ICU
And so on...I’m not going to look up every major OS.
What this boils down to is that if you have a shiny window manager or <insert internet browser here> or basically any modern i18n text processing software program on your OS, there is a very high probability that it uses ICU (and things like HarfBuzz-icu)
My pleasure.
I find no pleasure here. Online compilers aren’t meant to compile anything beyond basic, text-I/O, single-file, simple programs. The fact that Godbolt’s online compiler can actually pull an include file from the web is, AFAIK, unique.
But while indeed cool, its limitations are acknowledged here — the ultimate consequence being that it would be absolutely impossible to compile something against ICU using godbolt.org or any other online compiler.
Which leads to a final note relevant to the code sample I gave above:
You need to properly configure you toolsif you expect them to work for you
For the above code snippet you must have ICU headers installed on your development machine. That is a given and should not surprise anyone. Just because your system has ICU libraries installed, and the software on it uses them, does not mean your compiler can automagically compile against the library.
For Windows you do automatically get the <icu.h> with the most recent WDKs (for some years now, and <icucommon.h> and <icui18n.h> before that).
For *nixen you will have to do something like sudo apt-get install libicu-dev or whatever is appropriate for your OS package manager.
I am glad I had to look into this, at least, because I just remembered that I have my development environments a little better initialized than the basic defaults, and was inadvertently using my local copy of ICU’s headers instead of Windows’. So I fixed it in the code above with that wonky #ifdef.
Must roll your own?
This is not difficult, and many people have different solutions to this, but there is a tricky consideration: a valid UTF-8 file should have valid Unicode code-points — which is more than just basic UTF-8/CESU-8/Modified-UTF-8/etc form validation.
If all you care is that the data is encoded using UTF-8 scheme, then Michaël Roy’s solution above looks fine to my eyeballs.
Personally, I think you should be a bit more strict, which properly requires you to actually decode the UTF-8 data to Unicode code points and verify them as well.
This requires very little more effort, and as it is something that your reader needs to do to access the data at some point anyway, why not just do it once and get it over with?
Still, here is just the check:
#include <algorithm>
#include <ciso646>
#include <string>
namespace utf8
{
bool is_unicode( char32_t c )
{
return ((c & 0xFFFE) != 0xFFFE)
and (c < 0x10FFFF);
}
bool is_surrogate ( char32_t c ) { return (c & 0xF800) == 0xD800; }
bool is_high_surrogate( char32_t c ) { return (c & 0xFC00) == 0xD800; }
bool is_low_surrogate ( char32_t c ) { return (c & 0xFC00) == 0xDC00; }
char32_t decode( const char * & first, const char * last, char32_t invalid = 0xFFFD )
{
// Empty sequence
if (first == last) return invalid;
// decode byte length of encoded code point (1..4) from first octet
static const unsigned char nbytes[] =
{
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 3, 3, 4, 0
};
unsigned char k, n = k = nbytes[(unsigned char)*first >> 3];
if (!n) { ++first; return invalid; }
// extract bits from lead octet
static const unsigned char masks[] = { 0, 0x7F, 0x1F, 0x0F, 0x07 };
char32_t c = (unsigned char)*first++ & masks[n];
// extract bits from remaining octets
while (--n and (first != last) and ((signed char)*first < -0x40))
c = (c << 6) | ((unsigned char)*first++ & 0x3F);
// the possibility of an incomplete sequence (continuing with future
// input at a later invocation) is ignored here.
if (n != 0) return invalid;
// overlong-encoded sequences are not valid
if (k != 1 + (c > 0x7F) + (c > 0x7FF) + (c > 0xFFFF)) return invalid;
// the end
return is_unicode( c ) and !is_surrogate( c ) ? c : invalid;
}
bool is_utf8( const std::string & s )
{
return []( const char * first, const char * last )
{
if (first != last)
{
// ignore UTF-8 BOM
if ((last-first) > 2)
if ( ((unsigned char)first[0] == 0xEF)
and ((unsigned char)first[1] == 0xBF)
and ((unsigned char)first[2] == 0xBE) )
first += 3;
while (first != last)
if (decode( first, last, 0x10FFFF ) == 0x10FFFF)
return false;
}
return true;
}
( s.c_str(), s.c_str()+s.size() );
}
} // namespace utf8
using utf8::is_utf8;
The very same example program as above can be used to play with the new code.
It behaves exactly the same as the ICU code.
Variants
I have ignored some common UTF-8 variants. In particular:
CESU-8 is a variation that happens when software working over UTF-16 forgets that surrogate pairs exist and encode them as two adjacent UTF-8 code sequences.
Modified UTF-8 is a special encoding where '\0' is expressly encoded with the overlong sequence C0 80, which makes nul-terminated strings continue to work. Strict UTF-8 requires encoders to use as few octets as possible, but we could accept this one specific overlong sequence anyway.
We will not, however, accept 5- or 6-octet sequences. The current Unicode UTF-8 standard, which is twenty years old now (2003), emphatically forbids them.
Modified UTF-8 implies CESU-8.
WTF-8 happens, too. WTF-8 implies Modified UTF-8.
PEP 383 can go die in a lonely corner.
You may wish to consider these as valid. While the Unicode people think that those things shouldn’t appear in files you may have access to they do recognize that it is possible and not necessarily wrong. It wouldn’t take much to modify the code to enable checks for each of those. Let us know if that is what you are looking to do.
Simple, quick-n-dirty solutions look cute on the internet, but messing with text has corner cases and considerations that people on discussion forums like to forget — Which is the main reason I do not recommend doing this yourself. Use ICU. It is a highly-optimized, industry-proven library designed by people who eat and breathe this stuff. Everyone else is just hoping they get it right while the software that actually needs it just uses ICU.
Even the C++ Standard Library got it wrong, which is why the whole thing was deprecated. (std::codecvt_utf8 may or may not accept any of CESU-8, Modified UTF-8, and WTF-8, and its behavior is not consistent across platforms in that regard. That and its design means you must make more than a couple passes over your data to verify it, in contrast to the single-pass-cum-verify that ICU [and my code] does. Maybe not much of an issue in today’s highly-optimized memory pipelines, but still, I’m an old fart about this.)

Related

How can I "convert" ISO-8859-7 strings to UTF-8 in C++?

I'm working with 10+ years old machines which use ISO 8859-7 to represent Greek characters using a single byte each.
I need to catch those characters and convert them to UTF-8 in order to inject them in a JSON to be sent via HTTPS.
Also, I'm using GCC v4.4.7 and I don't feel like upgrading so I can't use codeconv or such.
Example: "OΛΑ":
I get char values [ 0xcf, 0xcb, 0xc1, ], I need to write this string "\u039F\u039B\u0391".
PS: I'm not a charset expert so please avoid philosophical answers like "ISO 8859 is a subset of Unicode so you just need to implement the algorithm".
Given that there are so few values to map, a simple solution is to use a lookup table.
Pseudocode:
id_offset = 0x80 // 0x00 .. 0x7F same in UTF-8
c1_offset = 0x20 // 0x80 .. 0x9F control characters
table_offset = id_offset + c1_offset
table = [
u8"\u00A0", // 0xA0
u8"‘", // 0xA1
u8"’",
u8"£",
u8"€",
u8"₯",
// ... Refer to ISO 8859-7 for full list of characters.
]
let S be the input string
let O be an empty output string
for each char C in S
reinterpret C as unsigned char U
if U less than id_offset // same in both encodings
append C to O
else if U less than table_offset // control code
append char '\xC2' to O // lead byte
append char C to O
else
append string table[U - table_offset] to O
All that said, I recommend to save some time by using a library instead.
One way could be to use the Posix libiconv library. On Linux, the functions needed (iconv_open, iconv and iconv_close) are even included in libc so no extra linkage is needed there. On your old machines you may need to install libiconv but I doubt it.
Converting may be as simple as this:
#include <iconv.h>
#include <cerrno>
#include <cstring>
#include <iostream>
#include <iterator>
#include <stdexcept>
#include <string>
// A wrapper for the iconv functions
class Conv {
public:
// Open a conversion descriptor for the two selected character sets
Conv(const char* to, const char* from) : cd(iconv_open(to, from)) {
if(cd == reinterpret_cast<iconv_t>(-1))
throw std::runtime_error(std::strerror(errno));
}
Conv(const Conv&) = delete;
~Conv() { iconv_close(cd); }
// the actual conversion function
std::string convert(const std::string& in) {
const char* inbuf = in.c_str();
size_t inbytesleft = in.size();
// make the "out" buffer big to fit whatever we throw at it and set pointers
std::string out(inbytesleft * 6, '\0');
char* outbuf = out.data();
size_t outbytesleft = out.size();
// the const_cast shouldn't be needed but my "iconv" function declares it
// "char**" not "const char**"
size_t non_rev_converted = iconv(cd, const_cast<char**>(&inbuf),
&inbytesleft, &outbuf, &outbytesleft);
if(non_rev_converted == static_cast<size_t>(-1)) {
// here you can add misc handling like replacing erroneous chars
// and continue converting etc.
// I'll just throw...
throw std::runtime_error(std::strerror(errno));
}
// shrink to keep only what we converted
out.resize(outbuf - out.data());
return out;
}
private:
iconv_t cd;
};
int main() {
Conv cvt("UTF-8", "ISO-8859-7");
// create a string from the ISO-8859-7 data
unsigned char data[]{0xcf, 0xcb, 0xc1};
std::string iso88597_str(std::begin(data), std::end(data));
auto utf8 = cvt.convert(iso88597_str);
std::cout << utf8 << '\n';
}
Output (in UTF-8):
ΟΛΑ
Using this you can create a mapping table, from ISO-8859-7 to UTF-8, that you include in your project instead of iconv:
Demo
Ok I decided to do this myself instead of looking for a compatible library. Here's how I did.
The main problem was figuring out how to fill the two bytes for Unicode using the single one for ISO, so I used the debugger to read the value for the same character, first written by the old machine and then written with a constant string (UTF-8 by default). I started with "O" and "Π" and saw that in UTF-8 the first byte was always 0xCE while the second one was filled with the ISO value plus an offset (-0x30). I built the following code to implement this and used a test string filled with all greek letters, both upper and lower case. Then I realised that starting from "π" (0xF0 in ISO) both the first byte and the offset for the second one changed, so I added a test to figure out which of the two rules to apply. The following method returns a bool to let the caller know whether the original string contained ISO characters (useful for other purposes) and overwrites the original string, passed as reference, with the new one. I worked with char arrays instead of strings for coherence with the rest of the project which is basically a C project written in C++.
bool iso_to_utf8(char* in){
bool wasISO=false;
if(in == NULL)
return wasISO;
// count chars
int i=strlen(in);
if(!i)
return wasISO;
// create and size new buffer
char *out = new char[2*i];
// fill with 0's, useful for watching the string as it gets built
memset(out, 0, 2*i);
// ready to start from head of old buffer
i=0;
// index for new buffer
int j=0;
// for each char in old buffer
while(in[i]!='\0'){
if(in[i] >= 0){
// it's already utf8-compliant, take it as it is
out[j++] = in[i];
}else{
// it's ISO
wasISO=true;
// get plain value
int val = in[i] & 0xFF;
// first byte to CF or CE
out[j++]= val > 0xEF ? 0xCF : 0xCE;
// second char to plain value normalized
out[j++] = val - (val > 0xEF ? 0x70 : 0x30);
}
i++;
}
// add string terminator
out[j]='\0';
// paste into old char array
strcpy(in, out);
return wasISO;
}

How to remove the last character of a UTF-8 string in C++?

The text is stored in a std::string.
If the text is 8-bit ASCII, then it is really easy:
text.pop_back();
But what if it is UTF-8 text?
As far as I know, there are no UTF-8 related functions in the standard library which I could use.
You really need a UTF-8 Library if you are going to work with UTF-8. However for this task I think something like this may suffice:
void pop_back_utf8(std::string& utf8)
{
if(utf8.empty())
return;
auto cp = utf8.data() + utf8.size();
while(--cp >= utf8.data() && ((*cp & 0b10000000) && !(*cp & 0b01000000))) {}
if(cp >= utf8.data())
utf8.resize(cp - utf8.data());
}
int main()
{
std::string s = "κόσμε";
while(!s.empty())
{
std::cout << s << '\n';
pop_back_utf8(s);
}
}
Output:
κόσμε
κόσμ
κόσ
κό
κ
It relies on the fact that UTF-8 Encoding has one start byte followed by several continuation bytes. Those continuation bytes can be detected using the provided bitwise operators.
What you can do is pop off characters until you reach the leading byte of a code point. The leading byte of a code point in UTF8 is either of the pattern 0xxxxxxx or 11xxxxxx, and all non-leading bytes are of the form 10xxxxxx. This means you can check the first and second bit to determine if you have a leading byte.
bool is_leading_utf8_byte(char c) {
auto first_bit_set = (c & 0x80) != 0;
auto second_bit_set = (c & 0X40) != 0;
return !first_bit_set || second_bit_set;
}
void pop_utf8(std::string& x) {
while (!is_leading_utf8_byte(x.back()))
x.pop_back();
x.pop_back();
}
This of course does no error checking and assumes that your string is valid utf-8.

What is the correct way of processing different strings encodings via c++ main char** args?

I need some clarifications.
The problem is I have a program for windows written in C++ which uses 'wmain' windows-specific function that accepts wchar_t** as its args. So, there is an opportunity to pass whatever-you-like as a command line parameters to such program: for example, Chinese symbols, Japanese ones, etc, etc.
To be honest, I have no information about the encoding this function is usually used with. Probably utf-32, or even utf-16.
So, the questions:
What is the not windows-specific, but unix/linux way to achieve this with standard main function? My first thoughts were about usage of utf-8 encoded input strings with some kind of locales specifying?
Can somebody give a simple example of such main function? How can a std::string hold a Chinese symbols?
Can we operate with Chinese symbols encoded in utf-8 and contained in std::strings as usual when we just access each char (byte) like this: string_object[i] ?
Disclaimer: All Chinese words provided by GOOGLE translate service.
1) Just proceed as normal using normal std::string. The std::string can hold any character encoding and argument processing is simple pattern matching. So on a Chinese computer with the Chinese version of the program installed all it needs to do is compare Chinese versions of the flags to what the user inputs.
2) For example:
#include <string>
#include <vector>
#include <iostream>
std::string arg_switch = "开关";
std::string arg_option = "选项";
std::string arg_option_error = "缺少参数选项";
int main(int argc, char* argv[])
{
const std::vector<std::string> args(argv + 1, argv + argc);
bool do_switch = false;
std::string option;
for(auto arg = args.begin(); arg != args.end(); ++arg)
{
if(*arg == "--" + arg_switch)
do_switch = true;
else if(*arg == "--" + arg_option)
{
if(++arg == args.end())
{
// option needs a value - not found
std::cout << arg_option_error << '\n';
return 1;
}
option = *arg;
}
}
std::cout << arg_switch << ": " << (do_switch ? "on":"off") << '\n';
std::cout << arg_option << ": " << option << '\n';
return 0;
}
Usage:
./program --开关 --选项 wibble
Output:
开关: on
选项: wibble
3) No.
For UTF-8/UTF-16 data we need to use special libraries like ICU
For character by character processing you need to use or convert to UTF-32.
In short:
int main(int argc, char **argv) {
setlocale(LC_CTYPE, "");
// ...
}
http://unixhelp.ed.ac.uk/CGI/man-cgi?setlocale+3
And then you use mulitbyte string functions. You can still use normal std::string for storing multibyte strings, but beware that characters in them may span multiple array cells. After successfully setting the locale, you can also use wide streams (wcin, wcout, wcerr) to read and write wide strings from the standard streams.
1) with linux, you'd get standard main(), and standard char. It would use UTF-8 encoding. So chineese specific characters would be included in the string with a multibyte encoding.
***Edit:**sorry, yes: you have to set the default "" locale like here as well as cout.imbue().*
2) All the classic main() examples would be good examples. As said, chineese specific characters would be included in the string with a multibyte encoding. So if you cout such a string with the default UTF-8 locale, the cout sream would interpret the special UTF8 encoded sequences, knowing it has to agregate between 2 and 6 of each in order to produce the chineese output.
3) you can operate as usual on strings. THere are some issues however if you cout the string length for example: there is a difference between memory (ex: 3 bytes) and the chars that the user sees (ex: only 1). Same if you move with a pointer forward or backward. You have to make sure you interpret mulrtibyte encoding correctly, in order not to output an invalid encoding.
You could be interested in this other SO question.
Wikipedia explains the logic of the UTF-8 multibyte encoding. From this article you'll understand that any char u is a multibyte encoded char if:
( ((u & 0xE0) == 0xC0)
|| ((u & 0xF0) == 0xE0)
|| ((u & 0xF8) == 0xF0)
|| ((u & 0xFC) == 0xF8)
|| ((u & 0xFE) == 0xFC) )
It is followed by one or several chars such as:
((u & 0xC0) == 0x80)
All other chars are ASCII chars (i.e. not multibyte).

How to test a u32string for letters only (with locale)

I'm writing a compiler (for my own programming language) and I want to allow users to use any of the characters in the Unicode letter categories to define identifiers (modern languages, like Go allow such syntax already).
I've read a lot about character encoding in C++11 and based on all the informations I've found out, it will be fine to use utf32 encoding (it is fast to iterate over in lexer and it has better support than utf8 in C++).
In C++ there is isalpha function. How can I test wchar32_t if it is a letter (a Unicode code point classified as "letter" in any language)?
Is it even possible?
Use ICU to iterate over the string and check whether the appropriate Unicode properties are fulfilled. Here is an example in C that checks whether the UTF-8 command line argument is a valid identifier:
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <unicode/uchar.h>
#include <unicode/utf8.h>
int main(int argc, char **argv) {
if (argc != 2) return EXIT_FAILURE;
const char *const str = argv[1];
int32_t off = 0;
// U8_NEXT has a bug causing length < 0 to not work for characters in [U+0080, U+07FF]
const size_t actual_len = strlen(str);
if (actual_len > INT32_MAX) return EXIT_FAILURE;
const int32_t len = actual_len;
if (!len) return EXIT_FAILURE;
UChar32 ch = -1;
U8_NEXT(str, off, len, ch);
if (ch < 0 || !u_isIDStart(ch)) return EXIT_FAILURE;
while (off < len) {
U8_NEXT(str, off, len, ch);
if (ch < 0 || !u_isIDPart(ch)) return EXIT_FAILURE;
}
}
Note that ICU here uses the Java definitions, which are slightly different from those in UAX #31. In a real application you might also want to normalize to NFC before.
there is an isaplha in the ICU project. I think you can use that.

Compare std::wstring and std::string

How can I compare a wstring, such as L"Hello", to a string? If I need to have the same type, how can I convert them into the same type?
Since you asked, here's my standard conversion functions from string to wide string, implemented using C++ std::string and std::wstring classes.
First off, make sure to start your program with set_locale:
#include <clocale>
int main()
{
std::setlocale(LC_CTYPE, ""); // before any string operations
}
Now for the functions. First off, getting a wide string from a narrow string:
#include <string>
#include <vector>
#include <cassert>
#include <cstdlib>
#include <cwchar>
#include <cerrno>
// Dummy overload
std::wstring get_wstring(const std::wstring & s)
{
return s;
}
// Real worker
std::wstring get_wstring(const std::string & s)
{
const char * cs = s.c_str();
const size_t wn = std::mbsrtowcs(NULL, &cs, 0, NULL);
if (wn == size_t(-1))
{
std::cout << "Error in mbsrtowcs(): " << errno << std::endl;
return L"";
}
std::vector<wchar_t> buf(wn + 1);
const size_t wn_again = std::mbsrtowcs(buf.data(), &cs, wn + 1, NULL);
if (wn_again == size_t(-1))
{
std::cout << "Error in mbsrtowcs(): " << errno << std::endl;
return L"";
}
assert(cs == NULL); // successful conversion
return std::wstring(buf.data(), wn);
}
And going back, making a narrow string from a wide string. I call the narrow string "locale string", because it is in a platform-dependent encoding depending on the current locale:
// Dummy
std::string get_locale_string(const std::string & s)
{
return s;
}
// Real worker
std::string get_locale_string(const std::wstring & s)
{
const wchar_t * cs = s.c_str();
const size_t wn = std::wcsrtombs(NULL, &cs, 0, NULL);
if (wn == size_t(-1))
{
std::cout << "Error in wcsrtombs(): " << errno << std::endl;
return "";
}
std::vector<char> buf(wn + 1);
const size_t wn_again = std::wcsrtombs(buf.data(), &cs, wn + 1, NULL);
if (wn_again == size_t(-1))
{
std::cout << "Error in wcsrtombs(): " << errno << std::endl;
return "";
}
assert(cs == NULL); // successful conversion
return std::string(buf.data(), wn);
}
Some notes:
If you don't have std::vector::data(), you can say &buf[0] instead.
I've found that the r-style conversion functions mbsrtowcs and wcsrtombs don't work properly on Windows. There, you can use the mbstowcs and wcstombs instead: mbstowcs(buf.data(), cs, wn + 1);, wcstombs(buf.data(), cs, wn + 1);
In response to your question, if you want to compare two strings, you can convert both of them to wide string and then compare those. If you are reading a file from disk which has a known encoding, you should use iconv() to convert the file from your known encoding to WCHAR and then compare with the wide string.
Beware, though, that complex Unicode text may have multiple different representations as code point sequences which you may want to consider equal. If that is a possibility, you need to use a higher-level Unicode processing library (such as ICU) and normalize your strings to some common, comparable form.
You should convert the char string to a wchar_t string using mbstowcs, and then compare the resulting strings. Notice that mbstowcs works on char */wchar *, so you'll probably need to do something like this:
std::wstring StringToWstring(const std::string & source)
{
std::wstring target(source.size()+1, L' ');
std::size_t newLength=std::mbstowcs(&target[0], source.c_str(), target.size());
target.resize(newLength);
return target;
}
I'm not entirely sure that that usage of &target[0] is entirely standard-conforming, if someone has a good answer to that please tell me in the comments. Also, there's an implicit assumption that the converted string won't be longer (in number of wchar_ts) than the number of chars of the original string - a logical assumption that still I'm not sure it's covered by the standard.
On the other hand, it seems that there's no way to ask to mbstowcs the size of the needed buffer, so either you go this way, or go with (better done and better defined) code from Unicode libraries (be it Windows APIs or libraries like iconv).
Still, keep in mind that comparing Unicode strings without using special functions is slippery ground, two equivalent strings may be evaluated different when compared bitwise.
Long story short: this should work, and I think it's the maximum you can do with just the standard library, but it's a lot implementation-dependent in how Unicode is handled, and I wouldn't trust it a lot. In general, it's just better to stick with an encoding inside your application and avoid this kind of conversions unless absolutely necessary, and, if you are working with definite encodings, use APIs that are less implementation-dependent.
Think twice before doing this — you might not want to compare them in the first place. If you are sure you do and you are using Windows, then convert string to wstring with MultiByteToWideChar, then compare with CompareStringEx.
If you are not using Windows, then the analogous functions are mbstowcs and wcscmp. The standard wide character C++ functions are often not portable under Windows; for instance mbstowcs is deprecated.
The cross-platform way to work with Unicode is to use the ICU library.
Take care to use special functions for Unicode string comparison, don't do it manually. Two Unicode strings could have different characters, yet still be the same.
wstring ConvertToUnicode(const string & str)
{
UINT codePage = CP_ACP;
DWORD flags = 0;
int resultSize = MultiByteToWideChar
( codePage // CodePage
, flags // dwFlags
, str.c_str() // lpMultiByteStr
, str.length() // cbMultiByte
, NULL // lpWideCharStr
, 0 // cchWideChar
);
vector<wchar_t> result(resultSize + 1);
MultiByteToWideChar
( codePage // CodePage
, flags // dwFlags
, str.c_str() // lpMultiByteStr
, str.length() // cbMultiByte
, &result[0] // lpWideCharStr
, resultSize // cchWideChar
);
return &result[0];
}