std::string conversion to char32_t (unicode characters) - c++

I need to read a file using fstream in C++ that has ASCII as well as Unicode characters using the getline function.
But the function uses only std::string and these simple strings' characters can not be converted into char32_t so that I can compare them with Unicode characters. So please could any one give any fix.

char32_t corresponds to UTF-32 encoding, which is almost never used (and often poorly supported). Are you sure that your file is encoded in UTF-32?
If you are sure, then you need to use std::u32string to store your string. For reading, you can use std::basic_stringstream<char32_t> for instance. However, please note that these types are generally poorly supported.
Unicode is generally encoded using:
UTF-8 in text files (and web pages, etc...)
A platform-specific 16-bit or 32-bit encoding in programs, using type wchar_t
So generally, universally encoded files are in UTF-8. They use a variable number of bytes for encoding characters, from 1(ASCII characters) to 4. This means you cannot directly test the individual chars using a std::string
For this, you need to convert the UTF-8 string to wchar_t string, stored in a std::wstring.
For this, use a converter defined like this:
std::wstring_convert<std::codecvt_utf8<wchar_t> > converter;
And convert like that:
std::wstring unicodeString = converter.from_bytes(utf8String);
You can then access the individual unicode characters. Don't forget to put a "L" before each string literals, to make it a unicode string literal. For instance:
if(unicodeString[i]==L'仮')
{
info("this is some japanese character");
}

Related

How can I convert a std::string to UTF-8?

I need to put a stringstream as a value of a JSON (using rapidjson library), but std::stringstream::str is not working because it is not returning UTF-8 characters. How can I do that?
Example:
d["key"].SetString(tmp_stream.str());
rapidjson::Value::SetString accepts a pointer and a length. So you have to call it this way:
std::string stream_data = tmp_stream.str();
d["key"].SetString(tmp_stream.data(), tmp_string.size());
As others have mentioned in the comments, std::string is a container of char values with no encoding specified. It can contain UTF-8 encoded bytes or any other encoding.
I tested putting invalid UTF-8 data in an std::string and calling SetString. RapidJSON accepted the data and simply replaced the invalid characters with "?". If that's what you're seeing, then you need to:
Determine what encoding your string has
Re-encode the string as UTF-8
If your string is ASCII, then SetString will work fine as ASCII and UTF-8 are compatible.
If your string is UTF-16 or UTF-32 encoded, there are several lightweight portable libraries to do this like utfcpp. C++11 had an API for this, but it was poorly supported and now deprecated as of C++17.
If your string encoded with a more archaic encoding like Windows-1252, then you might need to use either an OS API like MultiByteToWideChar on Windows, or use a heavyweight Unicode library like LibICU to convert the data to a more standard encoding.

What is the efficient, standards-compliant mechanism for processing Unicode using C++17?

Short version:
If I wanted to write program that can efficiently perform operations with Unicode characters, being able to input and output files in UTF-8 or UTF-16 encodings. What is the appropriate way to do this with C++?
Long version:
C++ predates Unicode, and both have evolved significantly since. I need to know how to write standards-compliant C++ code that is leak-free. I need a clear answers to:
Which string container should I pick?
std::string with UTF-8?
std::wstring (don't really know much about it)
std::u16string with UTF-16?
std::u32string with UTF-32?
Should I stick entirely to one of the above containers or change them when needed?
Can I use non-english characters in string literals, when using UTF strings, such as Polish characters: ąćęłńśźż etc?
What changes when we store UTF-8 encoded characters in std::string? Are they limited to one-byte ASCII characters or can they be multi-byte?
What happens when i do the following?
std::string s = u8"foo";
s += 'x';
What are differences between wchar_t and other multi-byte character types? Is wchar_t character or wchar_t string literal capable of storing UTF encodings?
Which string container should I pick?
That is really up to you to decide, based on your own particular needs. Any of the choices you have presented will work, and they each have their own advantages and disadvantages. Generically, UTF-8 is good to use for storage and communication purposes, and is backwards compatible with ASCII. Whereas UTF-16/32 is easier to use when processing Unicode data.
std::wstring (don't really know much about it)
The size of wchar_t is compiler-dependent and even platform-dependent. For instance, on Windows, wchar_t is 2 bytes, making std::wstring usable for UTF-16 encoded strings. On other platforms, wchar_t may be 4 bytes instead, making std::wstring usable for UTF-32 encoded strings instead. That is why wchar_t/std::wstring is generally not used in portable code, and why char16_t/std::u16string and char32_t/std::u32string were introduced in C++11. Even char can have portability issues for UTF-8, since char can be either signed or unsigned at the descretion of the compiler vendors, which is why char8_t/std::u8string was introduced in C++20 for UTF-8.
Should I stick entirely to one of the above containers or change them when needed?
Use whatever containers suit your needs.
Typically, you should use one string type throughout your code. Perform data conversions only at the boundaries where string data enters/leaves your program. For instance, when reading/writing files, network communications, platform system calls, etc.
How to properly convert between them?
There are many ways to handle that.
C++11 and later have std::wstring_convert/std::wbuffer_convert. But these were deprecated in C++17.
There are 3rd party Unicode conversion libraries, such as ICONV, ICU, etc.
There are C library functions, platform system calls, etc.
Can I use non-english characters in string literals, when using UTF strings, such as Polish characters: ąćęłńśźż etc?
Yes, if you use appropriate string literal prefixes:
u8 for UTF-8.
L for UTF-16 or UTF-32 (depending on compiler/platform).
u16 for UTF-16.
u32 for UTF-32.
Also, be aware that the charset you use to save your source files can affect how the compiler interprets string literals. So make sure that whatever charset you choose to save your files in, like UTF-8, that you tell your compiler what that charset is, or else you may end up with the wrong string values at runtime.
What changes when we store UTF-8 encoded characters in std::string? Are they limited to one-byte ASCII characters or can they be multi-byte?
Each string character may be a single-byte, or be part of a multi-byte representation of a Unicode codepoint. It depends on the encoding of the string, and the character being encoded.
Just as std::wstring (when wchar_t is 2 bytes) and std::u16string can hold strings containing supplementary characters outside of the Unicode BMP, which require UTF-16 surrogates to encode.
When a string container contains a UTF encoded string, each "character" is just a UTF encoded codeunit. UTF-8 encodes a Unicode codepoint as 1-4 codeunits (1-4 chars in a std::string). UTF-16 encodes a codepoint as 1-2 codeunits (1-2 wchar_ts/char16_ts in a std::wstring/std::u16string). UTF-32 encodes a codepoint as 1 codeunit (1 char32_t in a std::u32string).
What happens when i do the following?
std::string s = u8"foo";
s += 'x';
Exactly what you would expect. A std::string holds char elements. Regardless of encoding, operator+=(char) will simply append a single char to the end of the std::string.
How can I distinguish UTF char[] and non-UTF char[] or std::string?
You would need to have outside knowledge of the string's original encoding, or else perform your own heuristic analysis of the char[]/std::string data to see if it conforms to a UTF or not.
What are differences between wchar_t and other multi-byte character types?
Byte size and UTF encoding.
char = ANSI/MBCS or UTF-8
wchar_t = DBCS, UTF-16 or UTF-32, depending on compiler/platform
char8_t = UTF-8
char16_t = UTF-16
char32_t = UTF-32
Is wchar_t character or wchar_t string literal capable of storing UTF encodings?
Yes, UTF-16 or UTF-32, depending on compiler/platform. In case of UTF-16, a single wchar_t can only hold a codepoint value that is in the BMP. A single wchar_t in UTF-32 can hold any codepoint value. A wchar_t string can encode all codepoints in either encoding.
How to properly manipulate UTF strings (such as toupper/tolower conversion) and be compatible with locales simultaneously?
That is a very broad topic, worthy of its own separate question by itself.

using unicode in a C++ program

I want that strings with Unicode characters be correctly handled in my file synchronizer application but I don't know how this kind of encoding works ?
In a unicode string, I can see that a unicode char has this form : "\uxxxx" where xs are numbers, how a normal C or C++ program interpret this kind of char ? (why there is a 'u' after '\' ? what's the effect ?)
In internet I see examples using "wide strings or wchar_t ??
So, what's the suitable object to handle unicode characters ? In rapidJson (which supports Unicode, UTF-8, UTF-16, UTF-32) , we can use const char* to store a JSOn that could have "wide characters" but those characters take more than a byte to be represented... I don't understand...
This is the kind of temporary arrangement I found for the moment (unicode->utf8?ascii?, listFolder is a std::string) :
boost::replace_all(listFolder, "\\u00e0", "à");
boost::replace_all(listFolder, "\\u00e2", "â");
boost::replace_all(listFolder, "\\u00e4", "ä");
...
The suitable object to handle Unicode strings in C++ is icu::UnicodeString (check "API References, ICU4C" in the sidebar), at least if you want to really handle Unicode strings (as opposed to just passing them from one point of your application to another).
wchar_t was an early attempt at handling international character sets, which turned out to be a failure because Microsoft's definition of wchar_t as two bytes turned out to be insufficient once Unicode was extended beyond code point 0x10000. Linux defines wchar_t as four bytes, but the inconsistency makes it (and its derived std::wstring) rather useless for portable programming.
TCHAR is a Microsoft define that resolves to char by default and to WCHAR if UNICODE is defined, with WCHAR in turn being wchar_t behind a level of indirection... yeah.
C++11 brought us char16_t and char32_t as well as the corresponding string classes, but those are still instances of basic_string<>, and as such have their shortcomings e.g. when trying to uppercase / lowercase characters that have more than one replacement character (e.g. the German ß would require to be extended to SS in uppercase; the standard library cannot do that).
ICU, on the other hand, goes the full way. For example, it provides normalization and decomposition, which the standard strings do not.
\uxxxx and \UXXXXXXXX are unicode character escapes. The xxxx are a 16-bit hexadecimal number representing a UCS-2 code point, which is equivalent to a UTF-16 code point within the Basic Multilingual Plane.
The XXXXXXXX are a 32-bit hex number, representing a UTF-32 code point, which may be any plane.
How those character escapes are handled depends on the context in which they appear (narrow / wide string, for example), making them somewhat less than perfect.
C++11 introduced "proper" Unicode literals:
u8"..." is always a const char[] in UTF-8 encoding.
u"..." is always a const uchar16_t[] in UTF-16 encoding.
U"..." is always a const uchar32_t[] in UTF-32 encoding.
If you use \uxxxx or \UXXXXXXXX within one of those three, the character literal will always be expanded to the proper code unit sequence.
Note that storing UTF-8 in a std::string is possible, but hazardous. You need to be aware of many things: .length() is not the number of characters in your string. .substr() can lead to partial and invalid sequences. .find_first_of() will not work as expected. And so on.
That being said, in my opinion UTF-8 is the only sane encoding choice for any stored text. There are cases to be made for handling texts as UTF-16 in-memory (the way ICU does), but on file, don't accept anything but UTF-8. It's space-efficient, endianess-independent, and allows for semi-sane handling even by software that is blissfully unaware of Unicode matters (see caveats above).
In a unicode string, I can see that a unicode char has this form : "\uxxxx" where xs are numbers, how a normal C or C++ program interpret this kind of char ? (why there is a 'u' after '\' ? what's the effect ?)
That is a unicode character escape sequence. It will be interpreted as a unicode character. The u after the escape character is part of the syntax and it's what differentiates it from other escape sequences. Read the documentation for more information.
So, what's the suitable object to handle unicode characters ?
char for uft-8
char16_t for utf-16
char32_t for utf-32
The size of wchar_t is platform dependent, so you cannot make portable assumptions of which encoding it suits.
we can use const char* to store a JSOn that could have "wide characters" but those characters take more than a byte to be represented...
If you mean that you can store multi-byte utf-8 characters in a char string, then you're correct.
This is the kind of temporary arrangement I found for the moment (unicode->utf8?ascii?, listFolder is a std::string)
What you're attempting to do there is replacing some unicode characters with characters that have a plaftorm defined encoding. If you have other unicode characters besides those, then you end up with a string that has mixed encoding. Also, in some cases it may accidentally replace parts of other byte sequences. I recommend using library to convert encoding or do any other manipulation on encoded strings.

Is wstring character is Unicode ? What happens during conversion?

Recent times I am coming across the conversion of UTF-8 encoding to string and vice vera. I understood that UTF-8 encoding is used to hold almost all the characters in the world while using char which is built in data type for string, only ASCII values can be stored.For a character in UTF-8 encoding the number of bytes required in memory is varied from one byte to 4 bytes but for 'char' type it is usually 1 byte.
My question is what happens in conversion from wstring to string or wchar to char ?
Does the characters which require more than one byte is skipped? It seems it depends on implementation but I want to know what is the correct way of doing it.
Also does wchar is required to store unicode characters ? As far as I understood UNICODE characters can be stored in normal string as well. Why should we use wstring or wchar ?
Depends how you convert them.
You need to specify the source encoding type and the target encoding type.
wstring is not a format, it just defines a data type.
Now usually when one says "Unicode", one means UTF16 which is what Microsoft Windows uses, and that is usuasly what wstring contains.
So, the right way to convert from UTF8 to UTF16:
std::string utf8String = "blah blah";
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;
std::wstring utf16String = convert.from_bytes( utf8String );
And the other way around:
std::wstring utf16String = "blah blah";
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;
std::string utf16String = convert.to_bytes( utf16String );
And to add to the confusion:
When you use std::string on a windows platform (like when you use a multibyte compilation), It's NOT UTF8. They use ANSI.
More specifically, the default encoding language your windows is using.
When compiling in Unicode the windows API commands expect these formats:
CommandA - multibyte - ANSI
CommandW - Unicode - UTF16
Make your source files UTF-8 encoded, set the character encoding to UNICODE in your IDE.
Use std::string and widen them for WindowsAPI calls.
std::string somestring = "こんにちは";
WindowsApiW(widen(somestring).c_str());
I know it sounds kind of hacky but a more profound explaination of this issue can be found at utf8everywhere.org.

streams with default utf8 handling

I have read that in some environments std::string internally uses UTF-8. Whereas, on my platform, Windows, std::string is ASCII only. This behavior can be changed by using std::locale. My version of STL doesn't have, or at least I can't find, a UTF-8 facet for use with strings. I do however have a facet for use with the fstream set of classes.
Edit:
When I say "use UTF-8 internally", I'm referring to methods like std::basic_filebuf::open(), which in some environments accept UTF-8 encoded strings. I know this isn't really an std::string issue but rather some OS's use UTF-8 natively. My question should be read as "how does your implementation handle code conversion of invalid sequences?".
How do these streams handle invalid code sequences on other platforms/implementations?
In my UTF8 facet for files, it simply returns an error, which in turn prevents any more of the stream from being read. I would have thought changing the error to the Unicode "Invalid char" 0xfffd value to be a better option.
My question isn't limited to UTF-8, how about invalid UTF-16 surrogate pairs?
Let's have an example. Say you open a UTF-8 encoded file with a UTF-8 to wchar_t locale. How are invalid UTF-8 sequences handled by your implementation?
Or, a std::wstring and print it to std::cout, this time with a lone surrogate.
I have read that in some environments std::string internally uses uses UTF-8.
A C++ program can chose to use std::string to hold a UTF-8 string on any standard-compliant platform.
Whereas, on my platform, Windows, std::string is ASCII only.
That is not correct. On Windows you can use a std::string to hold a UTF-8 string if you want, std::string is not limited to hold ASCII on any standard-compliant platform.
This behavior can be changed by using std::locale.
No, the behaviour of std::string is not affected by the locale library.
A std::string is a sequence of chars. On most platforms, including Windows, a char is 8-bits. So you can use std::string to hold ASCII, Latin1, UTF-8 or any character encoding that uses an 8-bit or less code unit. std::string::length returns the number of code units so held, and the std::string::operator[] will return the ith code unit.
For holding UTF-16 you can use char16_t and std::u16string.
For holding UTF-32 you can use char32_t and std::u32string.
Say you open a UTF-8 encoded file with a UTF-8 to wchar_t locale. How are invalid UTF-8 sequences handled by your implementation?
Typically no one bothers with converting to wchar_t or other wide char types on other platforms, but the standard facets that can be used for this all signal a read error that causes the stream to stop working until the error is cleared.
std::string should be encoding agnostic: http://en.cppreference.com/w/cpp/string/basic_string - so it should not validate codepoints/data - you should be able to store any binary data in it.
The only places where encoding really makes a difference is in calculating string length and iterating over string character by character - and locale should have no effect in either of these cases.
And also - use of std::locale is probably not a good idea if it can be avoided at all - its not thread safe on all platforms or all implementations of standard library so care must be taken when using it. The effect of this is also very limited, and probably not at all what you expect it to be.