Unicode test strings for unit tests - c++

I need some Utf32 test strings to exercise some cross platform string manipulation code. I'd like a suite of test strings that exercise the utf32 <-> utf16 <-> utf8 encodings to validate that characters outside the BMP can be transformed from utf32, through utf16 surrogates, through utf8, and back. properly.
And I always find it a bit more elegant if the strings in question aren't just composed of random bytes, but are actually meaningful in the (various) languages they encode.

Although this isn't quite what you asked for, I've always found this test document useful.
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
The same site offers this
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/quickbrown.txt
... which are equivalents of English's "Quick brown fox" text, which exercise all the characters used, for a variety of languages. This page refers to a larger list of "pangrams" which used to be on Wikipedia, but was apparently deleted there. It is still available here:
http://clagnut.com/blog/2380/

https://github.com/noct/cutf/tree/master/bin
Includes following files:
UTF-8-demo.txt
big.txt
quickbrown.txt
utf8_invalid.txt

To really test all possible conversions between formats, opposed to character conversions (i.e. towupper(), towlower()) you should test all characters. The following loop gives you all of those:
for(wint_t c(0); c < 0x110000; ++c)
{
if(c >= 0xD800 && c <= 0xDFFF)
{
continue;
}
// here 'c' is any one Unicode character in UTF-32
...
}
That way you can make sure you don't miss anything (i.e. 100% complete test.) This is only 1,112,065 characters, so it will be very fast with a modern computer.
Note that for basic conversions between encodings my loop above is more than enough. However, there are other feature in Unicode which would require testing character pairs which behave differently when used together. This is really not necessary here.
Also I now have a separate C++ libutf8 library to convert characters between UTF-32, UTF-16, and UTF-8. The tests use loops as shown above. The tests also verify that using invalid character codes gets caught properly.

Hmmm
You could find a lot of incidental data by googling (and see the right column for questions like these on SO...)
However, I recommend you pretty much build your test strings as byte array. It is not really about 'what data', just that unicode gets handled correctly.
E.g. you will want to make sure that identical strings in different normalized forms (i.e. even if not in canonical form) still compare equal.
You will want to check that the string length detection is robust (and recognizes single, double, triple and quadruple byte characters). You will want to check that traversing a string from begin to end honours the same logic. More targeted tests for random access of unicode characters.
These are all things you knew, I'm sure. I'm just spelling them out to remind you that you need test data catered to exactly the edge cases, the logical properties that are intrinsic to Unicode.
Only then will you have proper test data.
Beyond this scope (technical correct Unicode handling) is actual localization (collation, charset conversion etc.). I refer to the Turkey Test
Here are helpful links:
http://minaret.info/test/collate.msp
http://www.moserware.com/2008/02/does-your-code-pass-turkey-test.html

You can try this one (there are some sentences in russian, greek, chinese, etc. to test Unicode):
http://www.madore.org/~david/misc/unitest/

Related

Read multi-language file - wchar_t vs char?

It's a horrible experience for me to get understanding of unicodes, locales, wide characters and conversion.
I need to read a text file which contains Russian and English, Chinese and Ukrainian characters all at once
My approach is to read the file in byte-chunks, then operate on the chunk, on a separate thread for fast reading. (Link)
This is done using std::ifstream.read(myChunkBuffer, chunk_byteSize)
However, I understand that there is no way any character from my multi-lingual file can be represented via 255 combinations, if I stick to char.
For that matter I converted everything into wchar_t and hoped for the best.
I also know about Sys.setlocale(locale = "Russian") (Link) but doesn't it then interpret each character as Russian? I wouldn't know when to flip between my 4 languages as I am parsing my bytes.
On Windows OS, I can create a .txt file and write "Привет! Hello!" in the program Notepad++, which will save file and re-open with the same letters. Does it somehow secretly add invisible tokens after each character, to know when to interpret as Russian, and when as English?
My current understanding is: have everything as wchar_t (double-byte), interpret any file as UTF-16 (double-byte) - is it correct?
Also, I hope to keep the code cross-platform.
Sorry for noob
Hokay, let's do this. Let's provide a practical solution to the specific problem of reading text from a UTF-8 encoded file and getting it into a wide string without losing any information.
Once we can do that, we should be OK because the utility functions presented here will handle all UTF-8 to wide-string conversion (and vice-versa) in general and that's the key thing you're missing.
So, first, how would you read in your data? Well, that's easy. Because, at one level, UTF-8 strings are just a sequence of chars, you can, for many purposes, simply treat them that way. So you just need to do what you would do for any text file, e.g.:
std::ifstream f;
f.open ("myfile.txt", std::ifstream::in);
if (!f.fail ())
{
std::string utf8;
f >> utf8;
// ...
}
So far so good. That all looks easy enough.
But now, to make processing the string we just read in easier (because handling multi-byte strings in code is a total pain), we need to convert it to a so-called wide string before we try to do anything with it. There are actually a few flavours of these (because of the uncertainty surrounding just how 'wide' wchar_t actually is on any particular platform), but for now I'll stick with wchar_t to keep things simple, and doing that conversion is actually easier than you might think.
So, without further ado, here are your conversion functions (which is what you bought your ticket for):
#include <string>
#include <codecvt>
#include <locale>
std::string narrow (const std::wstring& wide_string)
{
std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
return convert.to_bytes (wide_string);
}
std::wstring widen (const std::string& utf8_string)
{
std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
return convert.from_bytes (utf8_string);
}
My, that was easy, why did those tickets cost so much in the first place?
I imagine that's all I really need to say. I think, from what you say in your question, that you already had a fair idea of what you wanted to be able to do, you just didn't know how to achieve it (and perhaps hadn't quite joined up all the dots yet) but just in case there is any lingering confusion, once you do have a wide string you can freely use all the methods of std::basic_string on it and everything will 'just work'. And if you need to convert to back to a UTF-8 string to (say) write it out to a file, well, that's trivial now.
Test program over at the most excellent Wandbox. I'll touch this post up later, there are still a few things to say. Time for breakfast now :) Please ask any questions in the comments.
Notes (added as an edit):
codecvt is deprecated in C++17 (not sure why), but if you limit its use to just those two functions then it's not really anything to worry about. One can always rewrite those if and when something better comes along (hint, hint, dear standards persons).
codecvt can, I believe, handle other character encodings, but as far as I'm concerned, who cares?
if std::wstring (which is based on wchar_t) doesn't cut it for you on your particular platform, then you can always use std::u16string or std::u32string.
Unfortunately standard c++ does not have any real support for your situation. (e.g. unicode in c++-11)
You will need to use a text-handling library that does support it. Something like this one
The most important question is, what encoding that text file is in. It is most likely not a byte encoding, but Unicode of some sort (as there is no way to have Russian and Chinese in one file otherwise, AFAIK). So... run file <textfile.txt> or equivalent, or open the file in a hex editor, to determine encoding (could be UTF-8, UTF-16, UTF-32, something-else-entirely), and act appropriately.
wchar_t is, unfortunately, rather useless for portable coding. Back when Microsoft decided what that datatype should be, all Unicode characters fit into 16 bit, so that is what they went for. When Unicode was extended to 21 bit, Microsoft stuck with the definition they had, and eventually made their API work with UTF-16 encoding (which breaks the "wide" nature of wchar_). "The Unixes", on the other hand, made wchar_t 32 bit and use UTF-32 encoding, so...
Explaining the different encodings goes beyond the scope of a simple Q&A. There is an article by Joel Spolsky ("The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)") that does a reasonably good job of explaining Unicode though. There are other encodings out there, and I did a table that shows the ISO/IEC 8859 encodings and common Microsoft codepages side by side.
C++11 introduced char16_t (for UTF-16 encoded strings) and char32_t (for UTF-32 encoded strings), but several parts of the standard are not quite capable of handling Unicode correctly (toupper / tolower conversions, comparison that correctly handles normalized / unnormalized strings, ...). If you want the whole smack, the go-to library for handling all things Unicode (including conversion to / from Unicode to / from other encodings) in C/C++ is ICU.
And here's a second answer - about Microsoft's (lack of) standards compilance with regard to wchar_t - because, thanks to the standards committee hedging their bets, the situation with this is more confusing than it needs to be.
Just to be clear, wchar_t on Windows is only 16-bits wide and as we all know, there are many more Unicode characters than that these days, so, on the face of it, Windows is non-compliant (albeit, as we again all know, they do what they do for a reason).
So, moving on, I am indebted to Bo Persson for digging up this (emphasis mine):
The Standard says in [basic.fundamental]/5:
Type wchar_­t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales. Type wchar_­t shall have the same size, signedness, and alignment requirements as one of the other integral types, called its underlying type. Types char16_­t and char32_­t denote distinct types with the same size, signedness, and alignment as uint_­least16_­t and uint_­least32_­t, respectively, in <cstdint>, called the underlying types.
Hmmm. "Among the supported locales." What's that all about?
Well, I for one don't know, and nor, I suspect, is the person that wrote it. It's just been put in there to let Microsoft off the hook, simple as that. It's just double-speak.
As others have commented here (in effect), the standard is a mess. Someone should put something about this in there that other human beings can understand.
The c++ standard defines wchar_t as a type which will support any code point. On linux this is true. MSVC violates the standard and defines it as a 16-bit integer, which is too small.
Therefore the only portable way to handle strings is to convert them from native strings to utf-8 on input and from utf-8 to native strings at the point of output.
You will of course need to use some #ifdef magic to select the correct conversion and I/O calls depending on the OS.
Non-adherence to standards is the reason we can't have nice things.

How do I properly use std::string on UTF-8 in C++?

My platform is a Mac. I'm a C++ beginner and working on a personal project which processes Chinese and English. UTF-8 is the preferred encoding for this project.
I read some posts on Stack Overflow, and many of them suggest using std::string when dealing with UTF-8 and avoid wchar_t as there's no char8_t right now for UTF-8.
However, none of them talk about how to properly deal with functions like str[i], std::string::size(), std::string::find_first_of() or std::regex as these function usually returns unexpected results when facing UTF-8.
Should I go ahead with std::string or switch to std::wstring? If I should stay with std::string, what's the best practice for one to handle the above problems?
Unicode Glossary
Unicode is a vast and complex topic. I do not wish to wade too deep there, however a quick glossary is necessary:
Code Points: Code Points are the basic building blocks of Unicode, a code point is just an integer mapped to a meaning. The integer portion fits into 32 bits (well, 24 bits really), and the meaning can be a letter, a diacritic, a white space, a sign, a smiley, half a flag, ... and it can even be "the next portion reads right to left".
Grapheme Clusters: Grapheme Clusters are groups of semantically related Code Points, for example a flag in unicode is represented by associating two Code Points; each of those two, in isolation, has no meaning, but associated together in a Grapheme Cluster they represent a flag. Grapheme Clusters are also used to pair a letter with a diacritic in some scripts.
This is the basic of Unicode. The distinction between Code Point and Grapheme Cluster can be mostly glossed over because for most modern languages each "character" is mapped to a single Code Point (there are dedicated accented forms for commonly used letter+diacritic combinations). Still, if you venture in smileys, flags, etc... then you may have to pay attention to the distinction.
UTF Primer
Then, a serie of Unicode Code Points has to be encoded; the common encodings are UTF-8, UTF-16 and UTF-32, the latter two existing in both Little-Endian and Big-Endian forms, for a total of 5 common encodings.
In UTF-X, X is the size in bits of the Code Unit, each Code Point is represented as one or several Code Units, depending on its magnitude:
UTF-8: 1 to 4 Code Units,
UTF-16: 1 or 2 Code Units,
UTF-32: 1 Code Unit.
std::string and std::wstring.
Do not use std::wstring if you care about portability (wchar_t is only 16 bits on Windows); use std::u32string instead (aka std::basic_string<char32_t>).
The in-memory representation (std::string or std::wstring) is independent of the on-disk representation (UTF-8, UTF-16 or UTF-32), so prepare yourself for having to convert at the boundary (reading and writing).
While a 32-bits wchar_t ensures that a Code Unit represents a full Code Point, it still does not represent a complete Grapheme Cluster.
If you are only reading or composing strings, you should have no to little issues with std::string or std::wstring.
Troubles start when you start slicing and dicing, then you have to pay attention to (1) Code Point boundaries (in UTF-8 or UTF-16) and (2) Grapheme Clusters boundaries. The former can be handled easily enough on your own, the latter requires using a Unicode aware library.
Picking std::string or std::u32string?
If performance is a concern, it is likely that std::string will perform better due to its smaller memory footprint; though heavy use of Chinese may change the deal. As always, profile.
If Grapheme Clusters are not a problem, then std::u32string has the advantage of simplifying things: 1 Code Unit -> 1 Code Point means that you cannot accidentally split Code Points, and all the functions of std::basic_string work out of the box.
If you interface with software taking std::string or char*/char const*, then stick to std::string to avoid back-and-forth conversions. It'll be a pain otherwise.
UTF-8 in std::string.
UTF-8 actually works quite well in std::string.
Most operations work out of the box because the UTF-8 encoding is self-synchronizing and backward compatible with ASCII.
Due the way Code Points are encoded, looking for a Code Point cannot accidentally match the middle of another Code Point:
str.find('\n') works,
str.find("...") works for matching byte by byte1,
str.find_first_of("\r\n") works if searching for ASCII characters.
Similarly, regex should mostly works out of the box. As a sequence of characters ("haha") is just a sequence of bytes ("哈"), basic search patterns should work out of the box.
Be wary, however, of character classes (such as [:alphanum:]), as depending on the regex flavor and implementation it may or may not match Unicode characters.
Similarly, be wary of applying repeaters to non-ASCII "characters", "哈?" may only consider the last byte to be optional; use parentheses to clearly delineate the repeated sequence of bytes in such cases: "(哈)?".
1 The key concepts to look-up are normalization and collation; this affects all comparison operations. std::string will always compare (and thus sort) byte by byte, without regard for comparison rules specific to a language or a usage. If you need to handle full normalization/collation, you need a complete Unicode library, such as ICU.
std::string and friends are encoding-agnostic. The only difference between std::wstring and std::string are that std::wstring uses wchar_t as the individual element, not char. For most compilers the latter is 8-bit. The former is supposed to be large enough to hold any unicode character, but in practice on some systems it isn't (Microsoft's compiler, for example, uses a 16-bit type). You can't store UTF-8 in std::wstring; that's not what it's designed for. It's designed to be an equivalent of UTF-32 - a string where each element is a single Unicode codepoint.
If you want to index UTF-8 strings by Unicode codepoint or composed unicode glyph (or some other thing), count the length of a UTF-8 string in Unicode codepoints or some other unicode object, or find by Unicode codepoint, you're going to need to use something other than the standard library. ICU is one of the libraries in the field; there may be others.
Something that's probably worth noting is that if you're searching for ASCII characters, you can mostly treat a UTF-8 bytestream as if it were byte-by-byte. Each ASCII character encodes the same in UTF-8 as it does in ASCII, and every multi-byte unit in UTF-8 is guaranteed not to include any bytes in the ASCII range.
Consider upgrading to C++20 and std::u8string that is the best thing we have as of 2019 for holding UTF-8. There are no standard library facilities to access individual code points or grapheme clusters but at least your type is strong enough to at least say it is true UTF-8.
Both std::string and std::wstring must use UTF encoding to represent Unicode. On macOS specifically, std::string is UTF-8 (8-bit code units), and std::wstring is UTF-32 (32-bit code units); note that the size of wchar_t is platform-dependent.
For both, size tracks the number of code units instead of the number of code points, or grapheme clusters. (A code point is one named Unicode entity, one or more of which form a grapheme cluster. Grapheme clusters are the visible characters that users interact with, like letters or emojis.)
Although I'm not familiar with the Unicode representation of Chinese, it's very possible that when you use UTF-32, the number of code units is often very close to the number of grapheme clusters. Obviously, however, this comes at the cost of using up to 4x more memory.
The most accurate solution would be to use a Unicode library, such as ICU, to calculate the Unicode properties that you are after.
Finally, UTF strings in human languages that don't use combining characters usually do pretty well with find/regex. I'm not sure about Chinese, but English is one of them.
Should I go ahead with std::string or switch to std::wstring?
I would recommend using std::string because wchar_t is non-portable and C++20 char8_t is poorly supported in the standard and not supported by any system APIs at all (and will likely never be because of compatibility reasons). On most platforms including macOS that you are using normal char strings are already UTF-8.
Most of the standard string operations work with UTF-8 but operate on code units. If you want a higher-level API you'll have to use something else such as the text library proposed to Boost.

How to achieve unicode-agnostic case insensitive comparison in C++

I have a requirement wherein my C++ code needs to do case insensitive comparison without worrying about whether the string is encoded or not, or the type of encoding involved. The string could be an ASCII or a non-ASCII, I just need to store it as is and compare it with a second string without concerning if the right locale is set and so forth.
Use case: Suppose my application receives a string (let's say it's a file name) initially as "Zoë Saldaña.txt" and it stores it as is. Subsequently, it receives another string "zoë saLdañA.txt", and the comparison between this and the first string should result in a match, by using a few APIs. Same with file name "abc.txt" and "AbC.txt".
I read about IBM's ICU and how it uses UTF-16 encoding by default. I'm curious to know:
If ICU provides a means of solving my requirement by seamlessly handling the strings regardless of their encoding type?
If the answer to 1. is no, then, using ICU's APIs, is it safe to normalize all strings (both ASCII and non-ASCII) to UTF-16 and then do the case-insensitive comparison and other operations?
Are there alternatives that facilitate this?
I read this post, but it doesn't quite meet my requirements.
Thanks!
The requirement is impossible. Computers don't work with characters, they work with numbers. But "case insensitive" comparisons are operations which work on characters. Locales determine which numbers correspond to which characters, and are therefore indispensible.
The above isn't just true for all progamming langguages, it's even true for case-sensitive comparisons. The mapping from character to number isn't always unique. That means that comparing two numbers doesn't work. There could be a locale where character 42 is equivalent to character 43. In Unicode, it's even worse. There are number sequences which have different lengths and still are equivalent. (precomposed and decomposed characters in particular)
Without knowing encoding, you cannot do that. I will take one example using french accented characters and 2 different encodings: cp850 used as OEM character for windows in west european zone, and the well known iso-8859-1 (also known as latin1, not very different from win1252 ansi character set for windows)).
in cp850, 0x96 is 'û', 0xca is '╩', 0xea is 'Û'
in latin1, 0x96 is non printable(*), 0xca is 'Ê', 0xea is 'ê'
so if string is cp850 encoded, 0xea should be the same as 0x96 and 0xca is a different character
but if string is latin1 encoded, 0xea should be the same as 0xca, 0x96 being a control character
You could find similar examples with other iso-8859-x encoding by I only speak of languages I know.
(*) in cp1252 0x96 is '–' unicode character U+2013 not related to 'ê'
For UTF-8 (or other Unicode) encodings, it is possible to perform a "locale neutral" case-insensitive string comparison. This type of comparison is useful in multi-locale applications, e.g. network protocols (e.g. CIFS), international database data, etc.
The operation is possible due to Unicode metadata which clearly identifies which characters may be "folded" to/from which upper/lower case characters.
As of 2007, when I last looked, there are less than 2000 upper/lower case character pairs. It was also possible to generate a perfect hash function to convert upper to lower case (most likely vice versa, as well, but I didn't try it).
At the time, I used Bob Burtle's perfect hash generator. It worked great in a CIFS implementation I was working on at the time.
There aren't many smallish, fixed sets of data out there you can point a perfect hash generator at. But this is one of 'em. :--)
Note: this is locale-neutral. So it will not support applications like German telephone books. There are a great many applications you should definitely use locale aware folding and collation. But there are a large number where locale neutral is actually preferable. Especially now when folks are sharing data across so many time zones and, necessarily, cultures. The Unicode standard does a good job of defining a good set of shared rules.
If you're not using Unicode, the presumption is that you have a really good reason. As a practical matter, if you have to deal with other character encodings, you have a highly locale aware application. In which case, the OP's question doesn't apply.
See also:
The Unicode® Standard, Chapter 4, section 4.2, Case
The Unicode® Standard, Chapter 5, section 5.18, Case Mappings, subsection Caseless Matching.
UCD - CaseFolding.txt
Well, first I must say that any programmer dealing with natural language text has the utmost duty to know and understand Unicode well. Other ancient 20th Century encodings still exists, but things like EBCDIC and ASCII are not able to encode even a simple English text, which may contain words like façade, naïve or fiancée or even a geographical sign, a mathematical symbol or even emojis — conceptually, they are similar to ideograms. The majority of the world population does not use Latin characters to write text. UTF-8 is now the prevalent encoding on the Internet, and UTF-16 is used internally by all present day operating systems, including Windows, which unfortunately still does it wrong. (For example, NTFS has a decade-long reported bug that allows a directory to contain 2 files with names that look exactly the same but are encoded with different normal forms — I get this a lot when synchronising files via FTP between Windows and MacOS or Linux; all my files with accented characters get duplicated because unlike the other systems, Windows uses a different normal forms and only normalise the file names on the GUI level, not on the file system level. I reported this in 2001 for Windows 7 and the bug is still present today in Windows 10.)
If you still don't know what a normal form is, start here: https://en.wikipedia.org/wiki/Unicode_equivalence
Unicode has strict rules for lower- and uppercase conversion, and these should be followed to the point in order for things to work nicely. First, make sure both strings use the same normal form (you should do this in the input process, the Unicode standard has the algorithm). Please do not reinvent the wheel, use ICU normalising and comparison facilities. They have been extensively tested and they work correctly. Use them, IBM has made it gratis.
A note: if you plan on comparing string for ordering, please remember that collation is locale-dependant, and highly influenced by the language and the scenery. For example, in a dictionary these Portuguese words would have this exact order: sabia, sabiá, sábia, sábio. The same ordering rules would not work for an address list, which would use phonetic rules to place names like Peçanha and Pessanha adjacently. The same phenomenon happens in German with ß and ss. Yes, natural language is not logical — or better saying, its rules are not simple.
C'est la vie. これが私たちの世界です。

Create UTF-16 string from char*

So I have standard C string:
char* name = "Jakub";
And I want to convert it to UTF-16. I figured out, that UTF-16 will be twice as long - one character takes two chars.
So I create another string:
char name_utf_16[10]; //"Jakub" is 5 characters
Now, I believe with ASCII characters I will only use lower bytes, so for all of them it will be like 74 00 for J and so on. With that belief, I can make such code:
void charToUtf16(char* input, char* output, int length) {
/*Todo: how to check if output is long enough?*/
for(int i=0; i<length; i+=2) //Step over 2 bytes
{
//Lets use little-endian - smallest bytes first
output[i] = input[i];
output[i+1] = 0; //We will never have any data for this field
}
}
But, with this process, I ended with "Jkb". I know no way to test this properly - I've just sent the string to Minecraft Bukkit Server. And this is what it said upon disconnecting:
13:34:19 [INFO] Disconnecting jkb?? [/127.0.0.1:53215]: Outdated server!
Note: I'm aware that Minecraft uses big-endian. Code above is just an example, in fact, I have my conversion implemented in class.
Before I answer your question, consider this:
This area of programming is full of man traps. It makes a lot of sense to understand the differences between ASCII, UTF7/8 and ANSI/'MultiByte Character Strings (MBCS)', all of which to an english speaking programmer will look and feel identical, but need very different handling if they are introduced to a european or asian user.
ASCII: Characters are in range 32-127. only ever one byte. The clue is in the name, they are great for Americans, but not fit for purpose in the rest of the world.
ANSI/MBCS: This is the reason for 'code pages'. Characters 32-127 are the same as ASCII, but it is possible to have characters in the range of 128-255 as well for additional characters, and some of the 128-255 range can be used as a flag to mark that the character continues into a second, third or even fourth byte. To process the string correctly, you need both the string bytes and the correct code page. If you try processing the string using the wrong code page you will not have the right characters, and misinterpret whether a character is a one, two or even 4 byte character.
UTF7/8: These are 8-bit wide formatting of 21-bit unicode character points. in UTF-7 and UTF-8 unicode characters can be between one and four bytes long. The advantage that UTF encodings have over ANSI/MBCS is that there is no ambiguity caused by code pages. Each glyph in every script has a unique unicode code point, which means it is not possible to mangle the character sets by interpreting the data on a different computer with different regional settings.
So to to start to answer your question:
Whilst you are making the assumption that your char* will only point to an ASCII string, that is a really dangerous choice to make, users are in control of data that is typed in, not the programmer. Windows programs will be storing this as MBCS by default.
You are making the second assumption is that a UTF-16 encoding will be twice the size of an 8 bit encoding. That is not generally a safe assumption. depending on the source encoding the UTF-16 encoding may be twice the size, may be less than twice the size, and in an extreme example may actually be shorter in length.
So, what is the safe solution?
The safe option is to implement your application internally as unicode. On windows, this is a compiler option, and then means your windows controls all use wchar_t* strings for their data type. On linux I'm less sure that you can always use unicide graphics and OS libraries. You must also use the wcslen() functions to get the length of strings etc. When you interact with the outside world, be precise in the character encodings used.
To answer to your question then becomes changing the question to, what do i do when I receive non UTF-16 data?
Firstly, be very clear about what assumptions on its formatting are you making? and secondly, accept the fact that sometimes the conversion to UTF-16 may fail.
If you are clear on the source formatting, you can then choose the appropriate win32 or the stl converter to convert the format, and you should then look for evidence the conversion failed before using the result. e.g. mbstowcs in or MultiByteToWideChar() on windows. However the use of both of these approaches safely means you need to understand ALL of the above answer.
All other options introduce risk. Use mbcs strings and you will have data strings mangled by being entered using one code page, and processed using a different code page. Assume ASCII data, and when you encounter a non ascii character your code will break, and you will 'blame' the user for your short comings.
Why do you want to make your own Unicode conversion functionality when theres's existing C/C++ functions for this, like mbstowcs() which is included in <cstdlib>.
If you still want to make your own stuff, then have a look at Unicode Consortium's open source code which can be found here:
Convert UTF-16 to UTF-8 under Windows and Linux, in C
output[i] = input[i];
This will assign every other byte of the input, because you increment i by 2. So no wonder that you obtain "Jkb".
You probably wanted to write:
output[i] = input[i / 2];

How do you cope with signed char -> int issues with standard library?

This is a really long-standing issue in my work, that I realize I still don't have a good solution to...
C naively defined all of its character test functions for an int:
int isspace(int ch);
But char's are often signed, and a full character often doesn't fit in an int, or in any single storage-unit that used for strings******.
And these functions have been the logical template for current C++ functions and methods, and have set the stage for the current standard library. In fact, they're still supported, afaict.
So if you hand isspace(*pchar) you can end up with sign extension problems. They're hard to see, and thence they're hard to guard against in my experience.
Similarly, because isspace() and it's ilk all take ints, and because the actual width of a character is often unknown w/o string-analysis - meaning that any modern character library should essentially never be carting around char's or wchar_t's but only pointers/iterators, since only by analyzing the character stream can you know how much of it composes a single logical character, I am at a bit of a loss as to how best to approach the issues?
I keep expecting a genuinely robust library based around abstracting away the size-factor of any character, and working only with strings (providing such things as isspace, etc.), but either I've missed it, or there's another simpler solution staring me in the face that all of you (who know what you're doing) use...
** These issues don't come up for fixed-sized character-encodings that can wholly contain a full character - UTF-32 apparently is about the only option that has these characteristics (or specialized environments that restrict themselves to ASCII or some such).
So, my question is:
"How do you test for whitespace, isprintable, etc., in a way that doesn't suffer from two issues:
1) Sign expansion, and
2) variable-width character issues
After all, most character encodings are variable-width: UTF-7, UTF-8, UTF-16, as well as older standards such as Shift-JIS. Even extended ASCII can have the simple sign-extension problem if the compiler treats char as a signed 8 bit unit.
Please note:
No matter what size your char_type is, it's wrong for most character encoding schemes.
This problem is in the standard C library, as well as in the C++ standard libraries; which still tries to pass around char and wchar_t, rather than string-iterators in the various isspace, isprint, etc. implementations.
Actually, it's precisely those type of functions that break the genericity of std::string. If it only worked in storage-units, and didn't try to pretend to understand the meaning of the storage-units as logical characters (such as isspace), then the abstraction would be much more honest, and would force us programmers to look elsewhere for valid solutions...
Thank You
Everyone who participated. Between this discussion and WChars, Encodings, Standards and Portability I have a much better handle on the issues. Although there are no easy answers, every bit of understanding helps.
How do you test for whitespace, isprintable, etc., in a way that doesn't suffer from two issues:
1) Sign expansion
2) variable-width character issues
After all, all commonly used Unicode encodings are variable-width, whether programmers realize it or not: UTF-7, UTF-8, UTF-16, as well as older standards such as Shift-JIS...
Obviously, you have to use a Unicode-aware library, since you've demonstrated (correctly) that C++03 standard library is not. The C++11 library is improved, but still not quite good enough for most usages. Yes, some OS' have a 32-bit wchar_t which makes them able to correctly handle UTF32, but that's an implementation, and is not guaranteed by C++, and is not remotely sufficient for many unicode tasks, such as iterating over Graphemes (letters).
IBMICU
Libiconv
microUTF-8
UTF-8 CPP, version 1.0
utfproc
and many more at http://unicode.org/resources/libraries.html.
If the question is less about specific character testing and more about code practices in general: Do whatever your framework does. If you're coding for linux/QT/networking, keep everything internally in UTF-8. If you're coding with Windows, keep everything internally in UTF-16. If you need to mess with code points, keep everything internally in UTF-32. Otherwise (for portable, generic code), do whatever you want, since no matter what, you have to translate for some OS or other anyway.
I think you are confounding a whole host of unrelated concepts.
First off, char is simply a data type. Its first and foremost meaning is "the system's basic storage unit", i.e. "one byte". Its signedness is intentionally left up to the implementation so that each implementation can pick the most appropriate (i.e. hardware-supported) version. It's name, suggesting "character", is quite possibly the single worst decision in the design of the C programming language.
The next concept is that of a text string. At the foundation, text is a sequence of units, which are often called "characters", but it can be more involved than that. To that end, the Unicode standard coins the term "code point" to designate the most basic unit of text. For now, and for us programmers, "text" is a sequence of code points.
The problem is that there are more codepoints than possible byte values. This problem can be overcome in two different ways: 1) use a multi-byte encoding to represent code point sequences as byte sequences; or 2) use a different basic data type. C and C++ actually offer both solutions: The native host interface (command line args, file contents, environment variables) are provided as byte sequences; but the language also provides an opaque type wchar_t for "the system's character set", as well as translation functions between them (mbstowcs/wcstombs).
Unfortunately, there is nothing specific about "the system's character set" and "the systems multibyte encoding", so you, like so many SO users before you, are left puzzling what to do with those mysterious wide characters. What people want nowadays is a definite encoding that they can share across platforms. The one and only useful encoding that we have for this purpose is Unicode, which assigns a textual meaning to a large number of code points (up to 221 at the moment). Along with the text encoding comes a family of byte-string encodings, UTF-8, UTF-16 and UTF-32.
The first step to examining the content of a given text string is thus to transform it from whatever input you have into a string of definite (Unicode) encoding. This Unicode string may itself be encoded in any of the transformation formats, but the simplest is just as a sequence of raw codepoints (typically UTF-32, since we don't have a useful 21-bit data type).
Performing this transformation is already outside the scope of the C++ standard (even the new one), so we need a library to do this. Since we don't know anything about our "system's character set", we also need the library to handle that.
One popular library of choice is iconv(); the typical sequence goes from input multibyte char* via mbstowcs() to a std::wstring or wchar_t* wide string, and then via iconv()'s WCHAR_T-to-UTF32 conversion to a std::u32string or uint32_t* raw Unicode codepoint sequence.
At this point our journey ends. We can now either examine the text codepoint by codepoint (which might be enough to tell if something is a space); or we can invoke a heavier text-processing library to perform intricate textual operations on our Unicode codepoint stream (such as normalization, canonicalization, presentational transformation, etc.). This is far beyond the scope of a general-purpose programmer, and the realm of text processing specialists.
It is in any case invalid to pass a negative value other than EOF to isspace and the other character macros. If you have a char c, and you want to test whether it is a space or not, do isspace((unsigned char)c). This deals with the extension (by zero-extending). isspace(*pchar) is flat wrong -- don't write it, don't let it stand when you see it. If you train yourself to panic when you do see it, then it's less hard to see.
fgetc (for example) already returns either EOF or a character read as an unsigned char and then converted to int, so there's no sign-extension issue for values from that.
That's trivia really, though, since the standard character macros don't cover Unicode, or multi-byte encodings. If you want to handle Unicode properly then you need a Unicode library. I haven't looked into what C++11 or C1X provide in this regard, other than that C++11 has std::u32string which sounds promising. Prior to that the answer is to use something implementation-specific or third-party. (Un)fortunately there are a lot of libraries to choose from.
It may be (I speculate) that a "complete" Unicode classification database is so large and so subject to change that it would be impractical for the C++ standard to mandate "full" support anyway. It depends to an extent what operations should be supported, but you can't get away from the problem that Unicode has been through 6 major versions in 20 years (since the first standard version), while C++ has had 2 major versions in 13 years. As far as C++ is concerned, the set of Unicode characters is a rapidly-moving target, so it's always going to be implementation-defined what code points the system knows about.
In general, there are three correct ways to handle Unicode text:
At all I/O (including system calls that return or accept strings), convert everything between an externally-used character encoding, and an internal fixed-width encoding. You can think of this as "deserialization" on input and "serialization" on output. If you had some object type with functions to convert it to/from a byte stream, then you wouldn't mix up byte stream with the objects, or examine sections of byte stream for snippets of serialized data that you think you recognize. It needn't be any different for this internal unicode string class. Note that the class cannot be std::string, and might not be std::wstring either, depending on implementation. Just pretend the standard library doesn't provide strings, if it helps, or use a std::basic_string of something big as the container but a Unicode-aware library to do anything sophisticated. You may also need to understand Unicode normalization, to deal with combining marks and such like, since even in a fixed-width Unicode encoding, there may be more than one code point per glyph.
Mess about with some ad-hoc mixture of byte sequences and Unicode sequences, carefully tracking which is which. It's like (1), but usually harder, and hence although it's potentially correct, in practice it might just as easily come out wrong.
(Special purposes only): use UTF-8 for everything. Sometimes this is good enough, for example if all you do is parse input based on ASCII punctuation marks, and concatenate strings for output. Basically it works for programs where you don't need to understand anything with the top bit set, just pass it on unchanged. It doesn't work so well if you need to actually render text, or otherwise do things to it that a human would consider "obvious" but actually are complex. Like collation.
One comment up front: the old C functions like isspace took int for
a reason: they support EOF as input as well, so they need to be able
to support one more value than will fit in a char. The
“naïve” decision was allowing char to be signed—but
making it unsigned would have had severe performance implications on a
PDP-11.
Now to your questions:
1) Sign expansion
The C++ functions don't have this problem. In C++, the
“correct” way of testing things like whether a character is
a space is to grap the std::ctype facet from whatever locale you want,
and to use it. Of course, the C++ localization, in <locale>, has
been carefully designed to make it as hard as possible to use, but if
you're doing any significant text processing, you'll soon come up with
your own convenience wrappers: a functional object which takes a locale
and mask specifying which characteristic you want to test isn't hard.
Making it a template on the mask, and giving its locale argument a
default to the global locale isn't rocket science either. Throw in a
few typedef's, and you can pass things like IsSpace() to std::find.
The only subtility is managing the lifetime of the std::ctype object
you're dealing with. Something like the following should work, however:
template<std::ctype_base::mask mask>
class Is // Must find a better name.
{
std::locale myLocale;
//< Needed to ensure no premature destruction of facet
std::ctype<char> const* myCType;
public:
Is( std::locale const& l = std::locale() )
: myLocale( l )
, myCType( std::use_facet<std::ctype<char> >( l ) )
{
}
bool operator()( char ch ) const
{
return myCType->is( mask, ch );
}
};
typedef Is<std::ctype_base::space> IsSpace;
// ...
(Given the influence of the STL, it's somewhat surprising that the
standard didn't define something like the above as standard.)
2) Variable width character issues.
There is no real answer. It all depends on what you need. For some
applications, just looking for a few specific single byte characters is
sufficient, and keeping everything in UTF-8, and ignoring the multi-byte
issues, is a viable (and simple) solution. Beyond that, it's often
useful to convert to UTF-32 (or depending on the type of text you're
dealing with, UTF-16), and use each element as a single code point. For
full text handling, on the other hand, you have to deal with
multi-code-point characters even if you're using UTF-32: the sequence
\u006D\u0302 is a single character (a small m with a circumflex over
it).
I haven't been testing internationalization capabilities of Qt library so much, but from what i know, QString is fully unicode-aware, and is using QChar's which are unicode-chars. I don't know internal implementation of those, but I expect that this implies QChar's to be varaible size characters.
It would be weird to bind yourself to such big framework as Qt just to use strings though.
You seem to be confusing a function defined on 7-bit ascii with a universal space-recognition function. Character functions in standard C use int not to deal with different encodings, but to allow EOF to be an out-of-band indicator. There are no issues with sign-extension, because the numbers these functions are defined on have no 8th bit. Providing a byte with this possibility is a mistake on your part.
Plan 9 attempts to solve this with a UTF library, and the assumption that all input data is UTF-8. This allows some measure of backwards compatibility with ASCII, so non-compliant programs don't all die, but allows new programs to be written correctly.
The common notion in C, even still is that a char* represents an array of letters. It should instead be seen as a block of input data. To get the letters from this stream, you use chartorune(). Each Rune is a representation of a letter(/symbol/codepoint), so one can finally define a function isspacerune(), which would finally tell you which letters are spaces.
Work with arrays of Rune as you would with char arrays, to do string manipulation, then call runetochar() to re-encode your letters into UTF-8 before you write it out.
The sign extension issue is easy to deal with. You can either use:
isspace((unsigned char) ch)
isspace(ch & 0xFF)
the compiler option that makes char an unsigned type
As far the variable-length character issue (I'm assuming UTF-8), it depends on your needs.
If you just to deal with the ASCII whitespace characters \t\n\v\f\r, then isspace will work fine; the non-ASCII UTF-8 code units will simply be treated as non-spaces.
But if you need to recognize the extra Unicode space characters \x85\xa0\u1680\u180e\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000, it's a bit more work. You could write a function along the lines of
bool isspace_utf8(const char* pChar)
{
uint32_t codePoint = decode_char(*pChar);
return is_unicode_space(codePoint);
}
Where decode_char converts a UTF-8 sequence to the corresponding Unicode code point, and is_unicode_space returns true for characters with category Z or for the Cc characters that are spaces. iswspace may or may not help with the latter, depending on how well your C++ library supports Unicode. It's best to use a dedicated Unicode library for the job.
most strings in practice use a multibyte encoding such as UTF-7,
UTF-8, UTF-16, SHIFT-JIS, etc.
No programmer would use UTF-7 or Shift-JIS as an internal representation unless they enjoy pain. Stick with ŬTF-8, -16, or -32, and only convert as needed.
Your preamble argument is somewhat inacurate, and arguably unfair, it is simply not in the library design to support Unicode encodings - certainly not multiple Unicode encodings.
Development of the C and C++ languages and much of the libraries pre-date the development of Unicode. Also as system's level languages they require a data type that corresponds to the smallest addressable word size of the execution environment. Unfortunately perhaps the char type has become overloaded to represent both the character set of the execution environment and the minimum addressable word. It is history that has shown this to be flawed perhaps, but changing the language definition and indeed the library would break a large amount of legacy code, so such things are left to newer languages such as C# that has an 8-bit byte and distinct char type.
Moreover the variable encoding of Unicode representations makes it unsuited to a built-in data type as such. You are obviously aware of this since you suggest that Unicode character operations should be performed on strings rather than machine word types. This would require library support and as you point out this is not provided by the standard library. There are a number of reasons for that, but primarily it is not within the domain of the standard library, just as there is no standard library support for networking or graphics. The library intrinsically does not address anything that is not generally universally supported by all target platforms from the deeply embedded to the super-computer. All such things must be provided by either system or third-party libraries.
Support for multiple character encodings is about system/environment interoperability, and the library is not intended to support that either. Data exchange between incompatible encoding systems is an application issue not a system issue.
"How do you test for whitespace, isprintable, etc., in a way that
doesn't suffer from two issues:
1) Sign expansion, and
2) variable-width character issues
isspace() considers only the lower 8-bits. Its definition explicitly states that if you pass an argument that is not representable as an unsigned char or equal to the value of the macro EOF, the results are undefined. The problem does not arise if it is used as it was intended. The problem is that it is inappropriate for the purpose you appear to be applying it to.
After all, all commonly used Unicode encodings are variable-width,
whether programmers realize it or not: UTF-7, UTF-8, UTF-16, as well
as older standards such as Shift-JIS
isspace() is not defined for Unicode. You'll need a library designed to use any specific encoding you are using. This question What is the best Unicode library for C? may be relevant.