Strings of 4 byte character in Windows? - c++

I have a program that does various operations on char types in std::string, for example
if (my_string.front() == my_char) {
// do stuff with my_string
}
I'm looking for some practical advice on how to make my program support Unicode. I need the ability to compare characters to characters, and that means 4 byte characters are required so that even the largest Unicode characters can be processed without losses.
I'm on Windows with a GCC compiler and read that in this case, std::wstring is 2 bytes. C++11 has std::u32string with 4 bytes but it seems largely unsupported by the standard library.
What's the easiest solution in this case?

Even if you had a string of uint32 you could not just compare these integers one by one. You would have to first normalize the strings before. As normalization is NOT simple, you will end up using a library like ICU. So you may directly try to use it directly :)
http://site.icu-project.org/

Windows uses the UTF-16 encoding:
http://en.wikipedia.org/wiki/UTF-16
You don't need "four byte characters" to support all unicode symbols. UTF-16 is a variable length encoding.
Good reading material:
http://www.joelonsoftware.com/articles/Unicode.html

Related

What exactly can wchar_t represent?

According to cppreference.com's doc on wchar_t:
wchar_t - type for wide character representation (see wide strings). Required to be large enough to represent any supported character code point (32 bits on systems that support Unicode. A notable exception is Windows, where wchar_t is 16 bits and holds UTF-16 code units) It has the same size, signedness, and alignment as one of the integer types, but is a distinct type.
The Standard says in [basic.fundamental]/5:
Type wchar_­t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales. Type wchar_­t shall have the same size, signedness, and alignment requirements as one of the other integral types, called its underlying type. Types char16_­t and char32_­t denote distinct types with the same size, signedness, and alignment as uint_­least16_­t and uint_­least32_­t, respectively, in <cstdint>, called the underlying types.
So, if I want to deal with unicode characters, should I use wchar_t?
Equivalently, how do I know if a specific unicode character is "supported" by wchar_t?
So, if I want to deal with unicode characters, should I use
wchar_t?
First of all, note that the encoding does not force you to use any particular type to represent a certain character. You may use char to represent Unicode characters just as wchar_t can - you only have to remember that up to 4 chars together will form a valid code point depending on UTF-8, UTF-16, or UTF-32 encoding, while wchar_t can use 1 (UTF-32 on Linux, etc) or up to 2 working together (UTF-16 on Windows).
Next, there is no definite Unicode encoding. Some Unicode encodings use a fixed width for representing codepoints (like UTF-32), others (such as UTF-8 and UTF-16) have variable lengths (the letter 'a' for instance surely will just use up 1 byte, but apart from the English alphabet, other characters surely will use up more bytes for representation).
So you have to decide what kind of characters you want to represent and then choose your encoding accordingly. Depending on the kind of characters you want to represent, this will affect the amount of bytes your data will take. E.g. using UTF-32 to represent mostly English characters will lead to many 0-bytes. UTF-8 is a better choice for many Latin based languages, while UTF-16 is usually a better choice for Eastern Asian languages.
Once you have decided on this, you should minimize the amount of conversions and stay consistent with your decision.
In the next step, you may decide what data type is appropriate to represent the data (or what kind of conversions you may need).
If you would like to do text-manipulation/interpretation on a code-point basis, char certainly is not the way to go if you have e.g. Japanese kanji. But if you just want to communicate your data and regard it no more as a quantitative sequence of bytes, you may just go with char.
The link to UTF-8 everywhere was already posted as a comment, and I suggest you having a look there as well. Another good read is What every programmer should know about encodings.
As by now, there is only rudimentary language support in C++ for Unicode (like the char16_t and char32_t data types, and u8/u/U literal prefixes). So chosing a library for manging encodings (especially conversions) certainly is a good advice.
wchar_t is used in Windows which uses UTF16-LE format. wchar_t requires wide char functions. For example wcslen(const wchar_t*) instead of strlen(const char*) and std::wstring instead of std::string
Unix based machines (Linux, Mac, etc.) use UTF8. This uses char for storage, and the same C and C++ functions for ASCII, such as strlen(const char*) and std::string (see comments below about std::find_first_of)
wchar_t is 2 bytes (UTF16) in Windows. But in other machines it is 4 bytes (UTF32). This makes things more confusing.
For UTF32, you can use std::u32string which is the same on different systems.
You might consider converting UTF8 to UTF32, because that way each character is always 4 bytes, and you might think string operations will be easier. But that's rarely necessary.
UTF8 is designed so that ASCII characters between 0 and 128 are not used to represent other Unicode code points. That includes escape sequence '\', printf format specifiers, and common parsing characters like ,
Consider the following UTF8 string. Lets say you want to find the comma
std::string str = u8"汉,🙂"; //3 code points represented by 8 bytes
The ASCII value for comma is 44, and str is guaranteed to contain only one byte whose value is 44. To find the comma, you can simply use any standard function in C or C++ to look for ','
To find 汉, you can search for the string u8"汉" since this code point cannot be represented as a single character.
Some C and C++ functions don't work smoothly with UTF8. These include
strtok
strspn
std::find_first_of
The argument for above functions is a set of characters, not an actual string.
So str.find_first_of(u8"汉") does not work. Because u8"汉" is 3 bytes, and find_first_of will look for any of those bytes. There is a chance that one of those bytes are used to represent a different code point.
On the other hand, str.find_first_of(u8",;abcd") is safe, because all the characters in the search argument are ASCII (str itself can contain any Unicode character)
In rare cases UTF32 might be required (although I can't imagine where!) You can use std::codecvt to convert UTF8 to UTF32 to run the following operations:
std::u32string u32 = U"012汉"; //4 code points, represented by 4 elements
cout << u32.find_first_of(U"汉") << endl; //outputs 3
cout << u32.find_first_of(U'汉') << endl; //outputs 3
Side note:
You should use "Unicode everywhere", not "UTF8 everywhere".
In Linux, Mac, etc. use UTF8 for Unicode.
In Windows, use UTF16 for Unicode. Windows programmers use UTF16, they don't make pointless conversions back and forth to UTF8. But there are legitimate cases for using UTF8 in Windows.
Windows programmer tend to use UTF8 for saving files, web pages, etc. So that's less worry for non-Windows programmers in terms of compatibility.
The language itself doesn't care which Unicode format you want to use, but in terms of practicality use a format that matches the system you are working on.
So, if I want to deal with unicode characters, should I use wchar_t?
That depends on what encoding you're dealing with. In case of UTF-8 you're just fine with char and std::string.
UTF-8 means the least encoding unit is 8 bits: all Unicode code points from U+0000 to U+007F are encoded by only 1 byte.
Beginning with code point U+0080 UTF-8 uses 2 bytes for encoding, starting from U+0800 it uses 3 bytes and from U+10000 4 bytes.
To handle this variable width (1 byte - 2 byte - 3 byte - 4 byte) char fits best.
Be aware that C-functions like strlen will provide byte-based results: "öö" in fact is a 2-character text but strlen will return 4 because 'ö' is encoded to 0xC3B6.
UTF-16 means the least encoding unit is 16 bits: all code points from U+0000 to U+FFFF are encoded by 2 bytes; starting from U+100000 4 bytes are used.
In case of UTF-16 you should use wchar_t and std::wstring because most of the characters you'll ever encounter will be 2-byte encoded.
When using wchar_t you can't use C-functions like strlen any more; you have to use the wide char equivalents like wcslen.
When using Visual Studio and building with configuration "Unicode" you'll get UTF-16: TCHAR and CString will be based on wchar_t instead of char.
It all depends what you mean by 'deal with', but one thing is for sure: where Unicode is concerned std::basic_string doesn't provide any real functionality at all.
In any particular program, you will need to perform X number of Unicode-aware operations, e.g. intelligent string matching, case folding, regex, locating word breaks, using a Unicode string as a path name maybe, and so on.
Supporting these operations there will almost always be some kind of library and / or native API provided by the platform, and the goal for me would be to store and manipulate my strings in such a way that these operations can be carried out without scattering knowledge of the underlying library and native API support throughout the code any more than necessary. I'd also want to future-proof myself as to the width of the characters I store in my strings in case I change my mind.
Suppose, for example, you decide to use ICU to do the heavy lifting. Immediately there is an obvious problem: an icu::UnicodeString is not related in any way to std::basic_string. What to do? Work exclusively with icu::UnicodeString throughout the code? Probably not.
Or maybe the focus of the application switches from European languages to Asian ones, so that UTF-16 becomes (perhaps) a better choice than UTF-8.
So, my choice would be to use a custom string class derived from std::basic_string, something like this:
typedef wchar_t mychar_t; // say
class MyString : public std::basic_string <mychar_t>
{
...
};
Straightaway you have flexibility in choosing the size of the code units stored in your container. But you can do much more than that. For example, with the above declaration (and after you add in boilerplate for the various constructors that you need to provide to forward them to std::basic_string), you still cannot say:
MyString s = "abcde";
Because "abcde" is a narrow string and various the constructors for std::basic_string <wchar_t> all expect a wide string. Microsoft solve this with a macro (TEXT ("...") or __T ("...")), but that is a pain. All we need to do now is to provide a suitable constructor in MyString, with signature MyString (const char *s), and the problem is solved.
In practise, this constructor would probably expect a UTF-8 string, regardless of the underlying character width used for MyString, and convert it if necessary. Someone comments here somewhere that you should store your strings as UTF-8 so that you can construct them from UTF-8 literals in your code. Well now we have broken that constraint. The underlying character width of our strings can be anything we like.
Another thing that people have been talking about in this thread is that find_first_of may not work properly for UTF-8 strings (and indeed some UTF-16 ones also). Well, now you can provide an implementation that does the job properly. Should take about half an hour. If there are other 'broken' implementations in std::basic_string (and I'm sure there are), then most of them can probably be replaced with similar ease.
As for the rest, it mainly depends what level of abstraction you want to implement in your MyString class. If your application is happy to have a dependency on ICU, for example, then you can just provide a couple of methods to convert to and from an icu::UnicodeString. That's probably what most people would do.
Or if you need to pass UTF-16 strings to / from native Windows APIs then you can add methods to convert to and from const WCHAR * (which again you would implement in such a way that they work for all values of mychar_t). Or you could go further and abstract away some or all of the Unicode support provided by the platform and library you are using. The Mac, for example, has rich Unicode support but it's only available from Objective-C so you have to wrap it.
It depends on how portable you want your code to be.
So you can add in whatever functionality you like, probably on an on-going basis as work progresses, without losing the ability to carry your strings around as a std::basic_string. Of one sort or another. Just try not to write code that assumes it knows how wide it is, or that it contains no surrogate pairs.
First of all, you should check (as you point out in your question) if you are using Windows and Visual Studio C++ with wchar_t being 16bits, because in that case, to use full unicode support, you'll need to assume UTF-16 encoding.
The basic problem here is not the sizeof wchar_t you are using, but if the libraries you are going to use, support full unicode support.
Java has a similar problem, as its char type is 16bit wide, so it couldn't a priori support full unicode space, but it does, as it uses UTF-16 encoding and the pair surrogates to cope with full 24bit codepoints.
It's also worth to note that UNICODE uses only the high plane to encode rare codepoints, that are not normally used daily.
For unicode support anyway, you need to use wide character sets, so wchar_t is a good beginning. If you are going to work with visual studio, then you have to check how it's libraries deal with unicode characters.
Another thing to note is that standard libraries deal with character sets (and this includes unicode) only when you add locale support (this requires some library to be initialized, e.g. setlocale(3)) and so, you'll see no unicode at all (only basic ascii) in cases where you have not called setlocale(3).
There are wide char functions for almost any str*(3) function, as well as for any stdio.h library function, to deal with wchar_ts. A little dig into the /usr/include/wchar.h file will reveal the names of the routines. Go to the manual pages for documentation on them: fgetws(3), fputwc(3), fputws(3), fwide(3), fwprintf(3), ...
Finally, consider again that, if you are dealing with Microsoft Visual C++, you have a different implementation from the beginning. Even if they cope to be completely standard compliant, you'll have to cope with some idiosyncrasies of having a different implementation. Probably you'll have different function names for some uses.

how character sets are stored in strings and wstrings?

So, i've been trying to do a bit of research of strings and wstrings as i need to understand how they work for a program i'm creating so I also looked into ASCII and unicode, and UTF-8 and UTF-16.
I believe i have an okay understanding of the concept of how these work, but what i'm still having trouble with is how they are actually stored in 'char's, 'string's, 'wchar_t's and 'wstring's.
So my questions are as follows:
Which character set and encoding is used for char and wchar_t? and are these types limited to using only these character sets / encoding?
If they are not limited to these character sets / encoding, how is it decided what character set / encoding is used for a particular char or wchar_t? is it automatically decided at compile for example or do we have to explicitly tell it what to use?
From my understanding UTF-8 uses 1 byte when using the first 128 code points in the set but can use more than 1 byte when using code point 128 and above. If so how is this stored? for example is it simply stored identically to ASCII if it only uses 1 byte? and how does the type (char or wchar_t or whatever) know how many bytes it is using?
Finally, if my understanding is correct I get why UTF-8 and UTF-16 are not compatible, eg. a string can't be used where a wstring is needed. But in a program that requires a wstring would it be better practice to write a conversion function from a string to a wstring and the use this when a wstring is required to make my code exclusively string-based or just use wstring where needed instead?
Thanks, and let me know if any of my questions are incorrectly worded or use the wrong terminology as i'm trying to get to grips with this as best as I can.
i'm working in C++ btw
They use whatever characterset and encoding you want. The types do not imply a specific characterset or encoding. They do not even imply characters - you could happily do math problems with them. Don't do that though, it's weird.
How do you output text? If it is to a console, the console decides which character is associated with each value. If it is some graphical toolkit, the toolkit decides. Consoles and toolkits tend to conform to standards, so there is a good chance they will be using unicode, nowadays. On older systems anything might happen.
UTF8 has the same values as ASCII for the range 0-127. Above that it gets a bit more complicated; this is explained here quite well: https://en.wikipedia.org/wiki/UTF-8#Description
wstring is a string made up of wchar_t, but sadly wchar_t is implemented differently on different platforms. For example, on Visual Studio it is 16 bits (and could be used to store UTF16), but on GCC it is 32 bits (and could thus be used to store unicode codepoints directly). You need to be aware of this if you want your code to be portable. Personally I chose to only store strings in UTF8, and convert only when needed.
Which character set and encoding is used for char and wchar_t? and are these types limited to using only these character sets / encoding?
This is not defined by the language standard. Each compiler will have to agree with the operating system on what character codes to use. We don't even know how many bits are used for char and wchar_t.
On some systems char is UTF-8, on others it is ASCII, or something else. On IBM mainframes it can be EBCDIC, a character encoding already in use before ASCII was defined.
If they are not limited to these character sets / encoding, how is it decided what character set / encoding is used for a particular char or wchar_t? is it automatically decided at compile for example or do we have to explicitly tell it what to use?
The compiler knows what is appropriate for each system.
From my understanding UTF-8 uses 1 byte when using the first 128 code points in the set but can use more than 1 byte when using code point 128 and above. If so how is this stored? for example is it simply stored identically to ASCII if it only uses 1 byte? and how does the type (char or wchar_t or whatever) know how many bytes it is using?
The first part of UTF-8 is identical to the corresponding ASCII codes, and stored as a single byte. Higher codes will use two or more bytes.
The char type itself just store bytes and doesn't know how many bytes we need to form a character. That's for someone else to decide.
The same thing for wchar_t, which is 16 bits on Windows but 32 bits on other systems, like Linux.
Finally, if my understanding is correct I get why UTF-8 and UTF-16 are not compatible, eg. a string can't be used where a wstring is needed. But in a program that requires a wstring would it be better practice to write a conversion function from a string to a wstring and the use this when a wstring is required to make my code exclusively string-based or just use wstring where needed instead?
You will likely have to convert. Unfortunately the conversion needed will be different for different systems, as character sizes and encodings vary.
In later C++ standards you have new types char16_t and char32_t, with the string types u16string and u32string. Those have known sizes and encodings.
Everything about used encoding is implementation defined. Check your compiler documentation. It depends on default locale, encoding of source file and OS console settings.
Types like string, wstring, operations on them and C facilities, like strcmp/wstrcmp expect fixed-width encodings. So the would not work properly with variable width ones like UTF8 or UTF16 (but will work with, e.g., UCS-2). If you want to store variable-width encoded strings, you need to be careful and not use fixed-width operations on it. C-string do have some functions for manipulation of such strings in standard library .You can use classes from codecvt header to convert between different encodings for C++ strings.
I would avoid wstring and use C++11 exact width character string: std::u16string or std::u32string
As an example here is some info on how windows uses these types/encodings.
char stores ASCII values (with code pages for non-ASCII values)
wchar_t stores UTF-16, note this means that some unicode characters will use 2 wchar_t's
If you call a system function, e.g. puts then the header file will actually pick either puts or _putws depending on how you've set things up (i.e. if you are using unicode).
So on windows there is no direct support for UTF-8, which means that if you use char to store UTF-8 encoded strings you have to covert them to UTF-16 and call the corresponding UTF-16 system functions.

substr with characters instead of bytes

Suppose i have a string s = "101870002PTäPO PVä #Person Tätigkeitsdarstellung 001100001&0111010101101870100092001000010"
When I do a substring(30,40) it returns " #Person Tätigkeitsdarstellung" beginning with a space.
I guess it's counting bytes instead of characters.
Normally the size of the string is 110 and when I do a s.length() or s.size() it returns 113 because of the 3 special characters.
I was wondering if there is a way to avoid this empty space at the beginning of the return value.
Thanks for your help!
In utf-8, the code point (character) ä consists of two code units (which are 1 byte in utf-8). C++ does not have support for treating strings as sequence of code points. Therefore, as far the standard library is concerned, std::string("ä").size() is 2.
A simple approach is to use std::wstring. wstring uses a character type (wchar_t) which is at least as wide as the widest character set supported by the system. Therefore, if the system supports a wide enough encoding to represent any (non-composite) unicode character with a single code unit, then string methods will behave as you would expect. Currently utf-32 is wide enough and is supported by (most?) unix like OS.
A thing to note is that Windows only supports utf-16 and not utf-32, so if you choose wstring approach and port your program to Windows and a user of your program tries to use unicode characters that are more than 2 bytes wide, then the presumption of one code unit per code point does not hold.
The wstring approach also doesn't take control or composite characters into consideration.
Here's a little test code which converts a std::string containing a multi byte utf-8 character ä and converts it to a wstring:
string foo("ä"); // read however you want
wstring_convert<codecvt_utf8<wchar_t>> converter;
wstring wfoo = converter.from_bytes(foo.data());
cout << foo.size() << endl; // 2 on my system
cout << wfoo.size() << endl; // 1 on my system
Unfortunately, libstdc++ hasn't implemented <codecvt> which was introduced in c++11 as of gcc-4.8 at least. If you can't require libc++, then similar functionality is probably in Boost.Locale.
Alternatively, if you wish to keep your code portable to systems that don't support utf-32, you can keep using std::string and use an external library for iterating and counting and such. Here's one: http://utfcpp.sourceforge.net/ and another: http://site.icu-project.org/. I believe this is the recommended approach.

UTF-8 decoding library

I have to code in an application which is in Unicode UTF-8 in Windows, MSVC 10. I'm aware that the UTF-8 encoded strings would use either 1 or 2 bytes per character. So, my question is : Is std::string suitable for this? If yes, how do I decode the strings? As far as I understand std::string is just an array of bytes and it doesn't provide any decoding logic.
How can I know the logical length of the string? How can I extract logical characters from a string? Are there any libraries which helps me to extract logical characters from the string?
e.g : If I have the string "olé" in std::string, I need to know that the length is 3, but not 4.
A commonally used library is ICU - International Components for Unicode
Yes, std::string is appropriare but as you’ve noticed it only operates on bytes, not Unicode code points. In that, std::string is an opaque type; this isn’t necessarily bad (in fact, it does have some advantages, see the links below for information) but it makes it necessary to decode the string if you need information about characters.
For the actual handling of UTF-8 (where necessary), you can use the Boost.NoWide library to decode UTF-8.
Furthermore, I suggest reading the UTF-8 everywhere manifesto for some information about the use of UTF-8 vs. other Unicode transformations.
First you may want to call the mbstowcs() function to transform the UTF-8 characters to wide characters. Then if you want the result to be 8 bits, you'll have a loss of data in the event you have "Unicode" characters (characters outside of the ISO-8859-1 plane, also called Latin 1.)
Note that the "Windows" encoding is not 1 to 1 equivalent to ISO-8859-1, but in most cases ISO-8859-1 is what people use these days.
Reference: http://www.cplusplus.com/reference/clibrary/cstdlib/mbstowcs/
Okay, if you just want the length in characters, use the mblen() function:
len = mblen(str.c_str(), str.length());
Additional note: an easy way to implementation mblen() is to count the number of bytes that are not between 0x80 and 0xBF since those are part of a multi-bytes sequence. This is particularly useful if you receive a UTF-8 byte sequence over a flaky serial connection.

Correct use of string storage in C and C++

Popular software developers and companies (Joel Spolsky, Fog Creek software) tend to use wchar_t for Unicode character storage when writing C or C++ code. When and how should one use char and wchar_t in respect to good coding practices?
I am particularly interested in POSIX compliance when writing software that leverages Unicode.
When using wchar_t, you can look up characters in an array of wide characters on a per-character or per-array-element basis:
/* C code fragment */
const wchar_t *overlord = L"ov€rlord";
if (overlord[2] == L'€')
wprintf(L"Character comparison on a per-character basis.\n");
How can you compare unicode bytes (or characters) when using char?
So far my preferred way of comparing strings and characters of type char in C often looks like this:
/* C code fragment */
const char *mail[] = { "ov€rlord#masters.lt", "ov€rlord#masters.lt" };
if (mail[0][2] == mail[1][2] && mail[0][3] == mail[1][3] && mail[0][3] == mail[1][3])
printf("%s\n%zu", *mail, strlen(*mail));
This method scans for the byte equivalent of a unicode character. The Unicode Euro symbol € takes up 3 bytes. Therefore one needs to compare three char array bytes to know if the Unicode characters match. Often you need to know the size of the character or string you want to compare and the bits it produces for the solution to work. This does not look like a good way of handling Unicode at all. Is there a better way of comparing strings and character elements of type char?
In addition, when using wchar_t, how can you scan the file contents to an array? The function fread does not seem to produce valid results.
If you know that you're dealing with unicode, neither char nor wchar_t are appropriate as their sizes are compiler/platform-defined. For example, wchar_t is 2 bytes on Windows (MSVC), but 4 bytes on Linux (GCC). The C11 and C++11 standards have been a bit more rigorous, and define two new character types (char16_t and char32_t) with associated literal prefixes for creating UTF-{8, 16, 32} strings.
If you need to store and manipulate unicode characters, you should use a library that is designed for the job, as neither the pre-C11 nor pre-C++11 language standards have been written with unicode in mind. There are a few to choose from, but ICU is quite popular (and supports C, C++, and Java).
I am particularly interested in POSIX compliance when writing software
that leverages Unicode.
In this case, you'll probably want to use UTF-8 (with char) as your preferred Unicode string type. POSIX doesn't have a lot of functions for working with wchar_t — that's mostly a Windows thing.
This method scans for the byte equivalent of a unicode character. The
Unicode Euro symbol € takes up 3 bytes. Therefore one needs to compare
three char array bytes to know if the Unicode characters match. Often
you need to know the size of the character or string you want to
compare and the bits it produces for the solution to work.
No, you don't. You just compare the bytes. Iff the bytes match, the strings match. strcmp works just as well with UTF-8 as it does with any other encoding.
Unless you want something like a case-insensitive or accent-insensitive comparison, in which case you'll need a proper Unicode library.
You should never-ever compare bytes, or even code points, to decide if strings are equal. That's because of a lot of strings can be identical from user perspective without being identical from code point perspective.