Identify upside down question mark in a string - c++

I have a for loop that looks at every character in a string, the purpose is to eliminate some characters. For example one comparison that works is...
if(str[i] == '!'){str[i] = NULL;}
I also need to eliminate the upside down question mark. I tried several things including some hex codes and the following.
if(str[i] == 191){str[i] = NULL;}
Here, I get an error that says, "comparison of constant 191 with expression of type 'value_type' is always false." What am I missing here? How can I catch the upside-down question mark?

Your string's value_type is most likely char, which might or might not be signed on your platform.
If it's signed, CHAR_MAX would be 127... you see the problem when comparing that with 191? That is what the compiler is complaining about.
There are several ways around this.
The most roughshod one would be to cast the constant to value_type.
More elegant (but depending on your compiler's features) would be to actually write '¿' in your source and make sure your editor and your compiler agree on the encoding used by the source file.
While the standard only requires support for a subset of the ASCII-7 characters in source (minus backticks, $ and #), implementations are free (and usually quite capable) of supporting other encodings.
For GCC, the option would be -finput-charset=..., which defaults to UTF-8.
All this is, of course, assuming that your source and your input are agreeing on their respective encodings as well. Being on the same codepage, so to speak. ;-)
All that being said, if you're handling international characters in your application, you might want to take a look at the ICU library and full Unicode support.

Related

Read multi-language file - wchar_t vs char?

It's a horrible experience for me to get understanding of unicodes, locales, wide characters and conversion.
I need to read a text file which contains Russian and English, Chinese and Ukrainian characters all at once
My approach is to read the file in byte-chunks, then operate on the chunk, on a separate thread for fast reading. (Link)
This is done using std::ifstream.read(myChunkBuffer, chunk_byteSize)
However, I understand that there is no way any character from my multi-lingual file can be represented via 255 combinations, if I stick to char.
For that matter I converted everything into wchar_t and hoped for the best.
I also know about Sys.setlocale(locale = "Russian") (Link) but doesn't it then interpret each character as Russian? I wouldn't know when to flip between my 4 languages as I am parsing my bytes.
On Windows OS, I can create a .txt file and write "Привет! Hello!" in the program Notepad++, which will save file and re-open with the same letters. Does it somehow secretly add invisible tokens after each character, to know when to interpret as Russian, and when as English?
My current understanding is: have everything as wchar_t (double-byte), interpret any file as UTF-16 (double-byte) - is it correct?
Also, I hope to keep the code cross-platform.
Sorry for noob
Hokay, let's do this. Let's provide a practical solution to the specific problem of reading text from a UTF-8 encoded file and getting it into a wide string without losing any information.
Once we can do that, we should be OK because the utility functions presented here will handle all UTF-8 to wide-string conversion (and vice-versa) in general and that's the key thing you're missing.
So, first, how would you read in your data? Well, that's easy. Because, at one level, UTF-8 strings are just a sequence of chars, you can, for many purposes, simply treat them that way. So you just need to do what you would do for any text file, e.g.:
std::ifstream f;
f.open ("myfile.txt", std::ifstream::in);
if (!f.fail ())
{
std::string utf8;
f >> utf8;
// ...
}
So far so good. That all looks easy enough.
But now, to make processing the string we just read in easier (because handling multi-byte strings in code is a total pain), we need to convert it to a so-called wide string before we try to do anything with it. There are actually a few flavours of these (because of the uncertainty surrounding just how 'wide' wchar_t actually is on any particular platform), but for now I'll stick with wchar_t to keep things simple, and doing that conversion is actually easier than you might think.
So, without further ado, here are your conversion functions (which is what you bought your ticket for):
#include <string>
#include <codecvt>
#include <locale>
std::string narrow (const std::wstring& wide_string)
{
std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
return convert.to_bytes (wide_string);
}
std::wstring widen (const std::string& utf8_string)
{
std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
return convert.from_bytes (utf8_string);
}
My, that was easy, why did those tickets cost so much in the first place?
I imagine that's all I really need to say. I think, from what you say in your question, that you already had a fair idea of what you wanted to be able to do, you just didn't know how to achieve it (and perhaps hadn't quite joined up all the dots yet) but just in case there is any lingering confusion, once you do have a wide string you can freely use all the methods of std::basic_string on it and everything will 'just work'. And if you need to convert to back to a UTF-8 string to (say) write it out to a file, well, that's trivial now.
Test program over at the most excellent Wandbox. I'll touch this post up later, there are still a few things to say. Time for breakfast now :) Please ask any questions in the comments.
Notes (added as an edit):
codecvt is deprecated in C++17 (not sure why), but if you limit its use to just those two functions then it's not really anything to worry about. One can always rewrite those if and when something better comes along (hint, hint, dear standards persons).
codecvt can, I believe, handle other character encodings, but as far as I'm concerned, who cares?
if std::wstring (which is based on wchar_t) doesn't cut it for you on your particular platform, then you can always use std::u16string or std::u32string.
Unfortunately standard c++ does not have any real support for your situation. (e.g. unicode in c++-11)
You will need to use a text-handling library that does support it. Something like this one
The most important question is, what encoding that text file is in. It is most likely not a byte encoding, but Unicode of some sort (as there is no way to have Russian and Chinese in one file otherwise, AFAIK). So... run file <textfile.txt> or equivalent, or open the file in a hex editor, to determine encoding (could be UTF-8, UTF-16, UTF-32, something-else-entirely), and act appropriately.
wchar_t is, unfortunately, rather useless for portable coding. Back when Microsoft decided what that datatype should be, all Unicode characters fit into 16 bit, so that is what they went for. When Unicode was extended to 21 bit, Microsoft stuck with the definition they had, and eventually made their API work with UTF-16 encoding (which breaks the "wide" nature of wchar_). "The Unixes", on the other hand, made wchar_t 32 bit and use UTF-32 encoding, so...
Explaining the different encodings goes beyond the scope of a simple Q&A. There is an article by Joel Spolsky ("The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)") that does a reasonably good job of explaining Unicode though. There are other encodings out there, and I did a table that shows the ISO/IEC 8859 encodings and common Microsoft codepages side by side.
C++11 introduced char16_t (for UTF-16 encoded strings) and char32_t (for UTF-32 encoded strings), but several parts of the standard are not quite capable of handling Unicode correctly (toupper / tolower conversions, comparison that correctly handles normalized / unnormalized strings, ...). If you want the whole smack, the go-to library for handling all things Unicode (including conversion to / from Unicode to / from other encodings) in C/C++ is ICU.
And here's a second answer - about Microsoft's (lack of) standards compilance with regard to wchar_t - because, thanks to the standards committee hedging their bets, the situation with this is more confusing than it needs to be.
Just to be clear, wchar_t on Windows is only 16-bits wide and as we all know, there are many more Unicode characters than that these days, so, on the face of it, Windows is non-compliant (albeit, as we again all know, they do what they do for a reason).
So, moving on, I am indebted to Bo Persson for digging up this (emphasis mine):
The Standard says in [basic.fundamental]/5:
Type wchar_­t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales. Type wchar_­t shall have the same size, signedness, and alignment requirements as one of the other integral types, called its underlying type. Types char16_­t and char32_­t denote distinct types with the same size, signedness, and alignment as uint_­least16_­t and uint_­least32_­t, respectively, in <cstdint>, called the underlying types.
Hmmm. "Among the supported locales." What's that all about?
Well, I for one don't know, and nor, I suspect, is the person that wrote it. It's just been put in there to let Microsoft off the hook, simple as that. It's just double-speak.
As others have commented here (in effect), the standard is a mess. Someone should put something about this in there that other human beings can understand.
The c++ standard defines wchar_t as a type which will support any code point. On linux this is true. MSVC violates the standard and defines it as a 16-bit integer, which is too small.
Therefore the only portable way to handle strings is to convert them from native strings to utf-8 on input and from utf-8 to native strings at the point of output.
You will of course need to use some #ifdef magic to select the correct conversion and I/O calls depending on the OS.
Non-adherence to standards is the reason we can't have nice things.

Why doesn't this character conversion work?

Visual Studio 2008
Project compiled as multibyte character set
LPWSTR lpName[1] = {(WCHAR*)_T("Setup")};
After this conversion, lpName[0] contains garbage (at least when previewed in VS)
LPWSTR is typedef'd as follows:
typedef __nullterminated WCHAR *NWPSTR, *LPWSTR, *PWSTR;
It's an expanded version of my comment above.
The code shown casts a pointer of type A to a pointer of type B. This is a low-vevel, machine-dependent operation. It almost never works as a conversion of an object of type A to an object of type B, especially if one type is a regular character type and the other is wide characters.
Imagine that you take a French book, and read it aloud as if it was written in English.
FRENCH* book;
readaloud ((ENGLISH*) book);
You will mostly hear gibberish. The letters used in the two languages are the same (or similar, at any rate), but the rules of the two languages are are totally different. The representation is the same for both languages, but the meaning is not.
This is very similar to what we have here. Whatever type you have, bits and bytes are the same, but the rules are totally different. You take bits laid out according to regular character rules, and try to interpret them according to wide character rules. It doesn't work. The representation is the same in both cases, but the meaning is not.
To convert one character flavor to another, you in general need a lookup table or some other means to convert each character from one type to the other — change representation, but keep the meaning. Likewise, to convert a French book into an English book, you need to use a big lookup table a.k.a. dictionary... well, the analogy breaks here, because there's no formal set of conversion rules, you need to be creative! But you get the idea.
The rules of C++ actually prohibit such casts. You can only cast an object type poiner to void*, and only use the result to cast it back to the original object type. Everything else is a no-no (unless you are willing to venture in the realm of undefined behavior).
So what should you do?
Pick one character variant and stick to it.
If you must convert between flavors, do so with a library function.
Try to avoid pointer casts, they almost always signal trouble.
I think what you're looking for is
LPTSTR lpName[1] = {_T("Setup")};
The various typedefs with a T in them (e.g. TSTR, LPTSTR) are dependant on whether you use unicode or multi-byte or whatever else. By using these, you should be able to write code that work in whatever encoding you are using (i.e., tomorrow you could switch to ascii, and a large portion of your code should still work).
Edit
If you are in situation where you really must convert between encodings, then there are various conversion functions available, such as wcstombs (or microsoft's documentation) and mbstowcs. These are defined in <cstdlib>

How do you cope with signed char -> int issues with standard library?

This is a really long-standing issue in my work, that I realize I still don't have a good solution to...
C naively defined all of its character test functions for an int:
int isspace(int ch);
But char's are often signed, and a full character often doesn't fit in an int, or in any single storage-unit that used for strings******.
And these functions have been the logical template for current C++ functions and methods, and have set the stage for the current standard library. In fact, they're still supported, afaict.
So if you hand isspace(*pchar) you can end up with sign extension problems. They're hard to see, and thence they're hard to guard against in my experience.
Similarly, because isspace() and it's ilk all take ints, and because the actual width of a character is often unknown w/o string-analysis - meaning that any modern character library should essentially never be carting around char's or wchar_t's but only pointers/iterators, since only by analyzing the character stream can you know how much of it composes a single logical character, I am at a bit of a loss as to how best to approach the issues?
I keep expecting a genuinely robust library based around abstracting away the size-factor of any character, and working only with strings (providing such things as isspace, etc.), but either I've missed it, or there's another simpler solution staring me in the face that all of you (who know what you're doing) use...
** These issues don't come up for fixed-sized character-encodings that can wholly contain a full character - UTF-32 apparently is about the only option that has these characteristics (or specialized environments that restrict themselves to ASCII or some such).
So, my question is:
"How do you test for whitespace, isprintable, etc., in a way that doesn't suffer from two issues:
1) Sign expansion, and
2) variable-width character issues
After all, most character encodings are variable-width: UTF-7, UTF-8, UTF-16, as well as older standards such as Shift-JIS. Even extended ASCII can have the simple sign-extension problem if the compiler treats char as a signed 8 bit unit.
Please note:
No matter what size your char_type is, it's wrong for most character encoding schemes.
This problem is in the standard C library, as well as in the C++ standard libraries; which still tries to pass around char and wchar_t, rather than string-iterators in the various isspace, isprint, etc. implementations.
Actually, it's precisely those type of functions that break the genericity of std::string. If it only worked in storage-units, and didn't try to pretend to understand the meaning of the storage-units as logical characters (such as isspace), then the abstraction would be much more honest, and would force us programmers to look elsewhere for valid solutions...
Thank You
Everyone who participated. Between this discussion and WChars, Encodings, Standards and Portability I have a much better handle on the issues. Although there are no easy answers, every bit of understanding helps.
How do you test for whitespace, isprintable, etc., in a way that doesn't suffer from two issues:
1) Sign expansion
2) variable-width character issues
After all, all commonly used Unicode encodings are variable-width, whether programmers realize it or not: UTF-7, UTF-8, UTF-16, as well as older standards such as Shift-JIS...
Obviously, you have to use a Unicode-aware library, since you've demonstrated (correctly) that C++03 standard library is not. The C++11 library is improved, but still not quite good enough for most usages. Yes, some OS' have a 32-bit wchar_t which makes them able to correctly handle UTF32, but that's an implementation, and is not guaranteed by C++, and is not remotely sufficient for many unicode tasks, such as iterating over Graphemes (letters).
IBMICU
Libiconv
microUTF-8
UTF-8 CPP, version 1.0
utfproc
and many more at http://unicode.org/resources/libraries.html.
If the question is less about specific character testing and more about code practices in general: Do whatever your framework does. If you're coding for linux/QT/networking, keep everything internally in UTF-8. If you're coding with Windows, keep everything internally in UTF-16. If you need to mess with code points, keep everything internally in UTF-32. Otherwise (for portable, generic code), do whatever you want, since no matter what, you have to translate for some OS or other anyway.
I think you are confounding a whole host of unrelated concepts.
First off, char is simply a data type. Its first and foremost meaning is "the system's basic storage unit", i.e. "one byte". Its signedness is intentionally left up to the implementation so that each implementation can pick the most appropriate (i.e. hardware-supported) version. It's name, suggesting "character", is quite possibly the single worst decision in the design of the C programming language.
The next concept is that of a text string. At the foundation, text is a sequence of units, which are often called "characters", but it can be more involved than that. To that end, the Unicode standard coins the term "code point" to designate the most basic unit of text. For now, and for us programmers, "text" is a sequence of code points.
The problem is that there are more codepoints than possible byte values. This problem can be overcome in two different ways: 1) use a multi-byte encoding to represent code point sequences as byte sequences; or 2) use a different basic data type. C and C++ actually offer both solutions: The native host interface (command line args, file contents, environment variables) are provided as byte sequences; but the language also provides an opaque type wchar_t for "the system's character set", as well as translation functions between them (mbstowcs/wcstombs).
Unfortunately, there is nothing specific about "the system's character set" and "the systems multibyte encoding", so you, like so many SO users before you, are left puzzling what to do with those mysterious wide characters. What people want nowadays is a definite encoding that they can share across platforms. The one and only useful encoding that we have for this purpose is Unicode, which assigns a textual meaning to a large number of code points (up to 221 at the moment). Along with the text encoding comes a family of byte-string encodings, UTF-8, UTF-16 and UTF-32.
The first step to examining the content of a given text string is thus to transform it from whatever input you have into a string of definite (Unicode) encoding. This Unicode string may itself be encoded in any of the transformation formats, but the simplest is just as a sequence of raw codepoints (typically UTF-32, since we don't have a useful 21-bit data type).
Performing this transformation is already outside the scope of the C++ standard (even the new one), so we need a library to do this. Since we don't know anything about our "system's character set", we also need the library to handle that.
One popular library of choice is iconv(); the typical sequence goes from input multibyte char* via mbstowcs() to a std::wstring or wchar_t* wide string, and then via iconv()'s WCHAR_T-to-UTF32 conversion to a std::u32string or uint32_t* raw Unicode codepoint sequence.
At this point our journey ends. We can now either examine the text codepoint by codepoint (which might be enough to tell if something is a space); or we can invoke a heavier text-processing library to perform intricate textual operations on our Unicode codepoint stream (such as normalization, canonicalization, presentational transformation, etc.). This is far beyond the scope of a general-purpose programmer, and the realm of text processing specialists.
It is in any case invalid to pass a negative value other than EOF to isspace and the other character macros. If you have a char c, and you want to test whether it is a space or not, do isspace((unsigned char)c). This deals with the extension (by zero-extending). isspace(*pchar) is flat wrong -- don't write it, don't let it stand when you see it. If you train yourself to panic when you do see it, then it's less hard to see.
fgetc (for example) already returns either EOF or a character read as an unsigned char and then converted to int, so there's no sign-extension issue for values from that.
That's trivia really, though, since the standard character macros don't cover Unicode, or multi-byte encodings. If you want to handle Unicode properly then you need a Unicode library. I haven't looked into what C++11 or C1X provide in this regard, other than that C++11 has std::u32string which sounds promising. Prior to that the answer is to use something implementation-specific or third-party. (Un)fortunately there are a lot of libraries to choose from.
It may be (I speculate) that a "complete" Unicode classification database is so large and so subject to change that it would be impractical for the C++ standard to mandate "full" support anyway. It depends to an extent what operations should be supported, but you can't get away from the problem that Unicode has been through 6 major versions in 20 years (since the first standard version), while C++ has had 2 major versions in 13 years. As far as C++ is concerned, the set of Unicode characters is a rapidly-moving target, so it's always going to be implementation-defined what code points the system knows about.
In general, there are three correct ways to handle Unicode text:
At all I/O (including system calls that return or accept strings), convert everything between an externally-used character encoding, and an internal fixed-width encoding. You can think of this as "deserialization" on input and "serialization" on output. If you had some object type with functions to convert it to/from a byte stream, then you wouldn't mix up byte stream with the objects, or examine sections of byte stream for snippets of serialized data that you think you recognize. It needn't be any different for this internal unicode string class. Note that the class cannot be std::string, and might not be std::wstring either, depending on implementation. Just pretend the standard library doesn't provide strings, if it helps, or use a std::basic_string of something big as the container but a Unicode-aware library to do anything sophisticated. You may also need to understand Unicode normalization, to deal with combining marks and such like, since even in a fixed-width Unicode encoding, there may be more than one code point per glyph.
Mess about with some ad-hoc mixture of byte sequences and Unicode sequences, carefully tracking which is which. It's like (1), but usually harder, and hence although it's potentially correct, in practice it might just as easily come out wrong.
(Special purposes only): use UTF-8 for everything. Sometimes this is good enough, for example if all you do is parse input based on ASCII punctuation marks, and concatenate strings for output. Basically it works for programs where you don't need to understand anything with the top bit set, just pass it on unchanged. It doesn't work so well if you need to actually render text, or otherwise do things to it that a human would consider "obvious" but actually are complex. Like collation.
One comment up front: the old C functions like isspace took int for
a reason: they support EOF as input as well, so they need to be able
to support one more value than will fit in a char. The
“naïve” decision was allowing char to be signed—but
making it unsigned would have had severe performance implications on a
PDP-11.
Now to your questions:
1) Sign expansion
The C++ functions don't have this problem. In C++, the
“correct” way of testing things like whether a character is
a space is to grap the std::ctype facet from whatever locale you want,
and to use it. Of course, the C++ localization, in <locale>, has
been carefully designed to make it as hard as possible to use, but if
you're doing any significant text processing, you'll soon come up with
your own convenience wrappers: a functional object which takes a locale
and mask specifying which characteristic you want to test isn't hard.
Making it a template on the mask, and giving its locale argument a
default to the global locale isn't rocket science either. Throw in a
few typedef's, and you can pass things like IsSpace() to std::find.
The only subtility is managing the lifetime of the std::ctype object
you're dealing with. Something like the following should work, however:
template<std::ctype_base::mask mask>
class Is // Must find a better name.
{
std::locale myLocale;
//< Needed to ensure no premature destruction of facet
std::ctype<char> const* myCType;
public:
Is( std::locale const& l = std::locale() )
: myLocale( l )
, myCType( std::use_facet<std::ctype<char> >( l ) )
{
}
bool operator()( char ch ) const
{
return myCType->is( mask, ch );
}
};
typedef Is<std::ctype_base::space> IsSpace;
// ...
(Given the influence of the STL, it's somewhat surprising that the
standard didn't define something like the above as standard.)
2) Variable width character issues.
There is no real answer. It all depends on what you need. For some
applications, just looking for a few specific single byte characters is
sufficient, and keeping everything in UTF-8, and ignoring the multi-byte
issues, is a viable (and simple) solution. Beyond that, it's often
useful to convert to UTF-32 (or depending on the type of text you're
dealing with, UTF-16), and use each element as a single code point. For
full text handling, on the other hand, you have to deal with
multi-code-point characters even if you're using UTF-32: the sequence
\u006D\u0302 is a single character (a small m with a circumflex over
it).
I haven't been testing internationalization capabilities of Qt library so much, but from what i know, QString is fully unicode-aware, and is using QChar's which are unicode-chars. I don't know internal implementation of those, but I expect that this implies QChar's to be varaible size characters.
It would be weird to bind yourself to such big framework as Qt just to use strings though.
You seem to be confusing a function defined on 7-bit ascii with a universal space-recognition function. Character functions in standard C use int not to deal with different encodings, but to allow EOF to be an out-of-band indicator. There are no issues with sign-extension, because the numbers these functions are defined on have no 8th bit. Providing a byte with this possibility is a mistake on your part.
Plan 9 attempts to solve this with a UTF library, and the assumption that all input data is UTF-8. This allows some measure of backwards compatibility with ASCII, so non-compliant programs don't all die, but allows new programs to be written correctly.
The common notion in C, even still is that a char* represents an array of letters. It should instead be seen as a block of input data. To get the letters from this stream, you use chartorune(). Each Rune is a representation of a letter(/symbol/codepoint), so one can finally define a function isspacerune(), which would finally tell you which letters are spaces.
Work with arrays of Rune as you would with char arrays, to do string manipulation, then call runetochar() to re-encode your letters into UTF-8 before you write it out.
The sign extension issue is easy to deal with. You can either use:
isspace((unsigned char) ch)
isspace(ch & 0xFF)
the compiler option that makes char an unsigned type
As far the variable-length character issue (I'm assuming UTF-8), it depends on your needs.
If you just to deal with the ASCII whitespace characters \t\n\v\f\r, then isspace will work fine; the non-ASCII UTF-8 code units will simply be treated as non-spaces.
But if you need to recognize the extra Unicode space characters \x85\xa0\u1680\u180e\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000, it's a bit more work. You could write a function along the lines of
bool isspace_utf8(const char* pChar)
{
uint32_t codePoint = decode_char(*pChar);
return is_unicode_space(codePoint);
}
Where decode_char converts a UTF-8 sequence to the corresponding Unicode code point, and is_unicode_space returns true for characters with category Z or for the Cc characters that are spaces. iswspace may or may not help with the latter, depending on how well your C++ library supports Unicode. It's best to use a dedicated Unicode library for the job.
most strings in practice use a multibyte encoding such as UTF-7,
UTF-8, UTF-16, SHIFT-JIS, etc.
No programmer would use UTF-7 or Shift-JIS as an internal representation unless they enjoy pain. Stick with ŬTF-8, -16, or -32, and only convert as needed.
Your preamble argument is somewhat inacurate, and arguably unfair, it is simply not in the library design to support Unicode encodings - certainly not multiple Unicode encodings.
Development of the C and C++ languages and much of the libraries pre-date the development of Unicode. Also as system's level languages they require a data type that corresponds to the smallest addressable word size of the execution environment. Unfortunately perhaps the char type has become overloaded to represent both the character set of the execution environment and the minimum addressable word. It is history that has shown this to be flawed perhaps, but changing the language definition and indeed the library would break a large amount of legacy code, so such things are left to newer languages such as C# that has an 8-bit byte and distinct char type.
Moreover the variable encoding of Unicode representations makes it unsuited to a built-in data type as such. You are obviously aware of this since you suggest that Unicode character operations should be performed on strings rather than machine word types. This would require library support and as you point out this is not provided by the standard library. There are a number of reasons for that, but primarily it is not within the domain of the standard library, just as there is no standard library support for networking or graphics. The library intrinsically does not address anything that is not generally universally supported by all target platforms from the deeply embedded to the super-computer. All such things must be provided by either system or third-party libraries.
Support for multiple character encodings is about system/environment interoperability, and the library is not intended to support that either. Data exchange between incompatible encoding systems is an application issue not a system issue.
"How do you test for whitespace, isprintable, etc., in a way that
doesn't suffer from two issues:
1) Sign expansion, and
2) variable-width character issues
isspace() considers only the lower 8-bits. Its definition explicitly states that if you pass an argument that is not representable as an unsigned char or equal to the value of the macro EOF, the results are undefined. The problem does not arise if it is used as it was intended. The problem is that it is inappropriate for the purpose you appear to be applying it to.
After all, all commonly used Unicode encodings are variable-width,
whether programmers realize it or not: UTF-7, UTF-8, UTF-16, as well
as older standards such as Shift-JIS
isspace() is not defined for Unicode. You'll need a library designed to use any specific encoding you are using. This question What is the best Unicode library for C? may be relevant.

How to check if the casting to wchar_t "failed"

I have a code that does something like this:
char16_t msg[256]={0};
//...
wstring wstr;
for (int i =0;i<len;++i)
{
if((unsigned short)msg[i]!=167)
wstr.push_back((wchar_t) msg[i]);
else
wstr.append(L"_<?>_");
}
as you can see it uses some rather ugly hardcoding(I'm not sure it works, but it works for my data) to figure out if wchar_t casting "failed"(that is the value of the replacement character)
From wiki:
The replacement character � (often a black diamond with a white
question mark) is a symbol found in the Unicode standard at codepoint
U+FFFD in the Specials table. It is used to indicate problems when a
system is not able to decode a stream of data to a correct symbol. It
is most commonly seen when a font does not contain a character, but is
also seen when the data is invalid and does not match any character:
So I have 2 questions:
1. Is there a proper way to do this nicely?
2. Are there other characters like replacement character that signal the failed conversion?
EDIT: i use gcc on linux so wchar_t is 32 bit, and the reason why I need this cast to work is because weird wstrings kill my glog library. :) Also wcout dies. :( :)
Doesn't work like that. wchar_t and char16_t are both integer types in C++. Casting from one to the other follows the usual rules for integer conversions, it does not attempt to convert between charsets in any way, or verify that anything is a genuine unicode code point.
Any replacement characters will have to come from more sophisticated code than a simple cast (or could be from the original input, of course).
Provided that:
The input in msg is a sequence of code points in the BMP
wchar_t in your implementation is at least 16 bits and the wide character set used by your implementation is Unicode (or a 16-bit version of Unicode, whether that's BMP-only, or UTF-16).
Then the code you have should work fine. It will not validate the input, though, just copy the values.
If you want to actually handle Unicode strings in C++ (and not merely sequences of 16-bit values), you should use the International Components for Unicode (ICU) library. Quoting the FAQ:
Why ICU4C?
The C and C++ languages and many operating system environments do not provide full support for Unicode and standards-compliant text handling services. Even though some platforms do provide good Unicode text handling services, portable application code can not make use of them. The ICU4C libraries fills in this gap. ICU4C provides an open, flexible, portable foundation for applications to use for their software globalization requirements. ICU4C closely tracks industry standards, including Unicode and CLDR (Common Locale Data Repository).
As a side effect, you get proper error reporting if a conversion fails...
If you don't mind platform-specific code, Windows has the MultiByteToWideChar API.
*Edit: I see you're on linux; I'll leave my answer here though in case Windows people can benefit from it.
A cast can not fail neither it will produce any replacement characters. The 167 value in your code does not indicate a failed cast, it means something else what only the code's author knows.
Just for reference, Unicode code point 167 (0x00A7) is a section sign: §. Maybe that will ring some bells about what the code was supposed to do.
And though I don't know what it is, consider rewriting it with:
wchar_t msg[256];
...
wstring wstr(msg, wcslen(msg));
or
char16_t msg[256];
...
u16string u16str(msg, wcslen(msg));
then do something to that 167 values if you need to.

Unicode test strings for unit tests

I need some Utf32 test strings to exercise some cross platform string manipulation code. I'd like a suite of test strings that exercise the utf32 <-> utf16 <-> utf8 encodings to validate that characters outside the BMP can be transformed from utf32, through utf16 surrogates, through utf8, and back. properly.
And I always find it a bit more elegant if the strings in question aren't just composed of random bytes, but are actually meaningful in the (various) languages they encode.
Although this isn't quite what you asked for, I've always found this test document useful.
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
The same site offers this
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/quickbrown.txt
... which are equivalents of English's "Quick brown fox" text, which exercise all the characters used, for a variety of languages. This page refers to a larger list of "pangrams" which used to be on Wikipedia, but was apparently deleted there. It is still available here:
http://clagnut.com/blog/2380/
https://github.com/noct/cutf/tree/master/bin
Includes following files:
UTF-8-demo.txt
big.txt
quickbrown.txt
utf8_invalid.txt
To really test all possible conversions between formats, opposed to character conversions (i.e. towupper(), towlower()) you should test all characters. The following loop gives you all of those:
for(wint_t c(0); c < 0x110000; ++c)
{
if(c >= 0xD800 && c <= 0xDFFF)
{
continue;
}
// here 'c' is any one Unicode character in UTF-32
...
}
That way you can make sure you don't miss anything (i.e. 100% complete test.) This is only 1,112,065 characters, so it will be very fast with a modern computer.
Note that for basic conversions between encodings my loop above is more than enough. However, there are other feature in Unicode which would require testing character pairs which behave differently when used together. This is really not necessary here.
Also I now have a separate C++ libutf8 library to convert characters between UTF-32, UTF-16, and UTF-8. The tests use loops as shown above. The tests also verify that using invalid character codes gets caught properly.
Hmmm
You could find a lot of incidental data by googling (and see the right column for questions like these on SO...)
However, I recommend you pretty much build your test strings as byte array. It is not really about 'what data', just that unicode gets handled correctly.
E.g. you will want to make sure that identical strings in different normalized forms (i.e. even if not in canonical form) still compare equal.
You will want to check that the string length detection is robust (and recognizes single, double, triple and quadruple byte characters). You will want to check that traversing a string from begin to end honours the same logic. More targeted tests for random access of unicode characters.
These are all things you knew, I'm sure. I'm just spelling them out to remind you that you need test data catered to exactly the edge cases, the logical properties that are intrinsic to Unicode.
Only then will you have proper test data.
Beyond this scope (technical correct Unicode handling) is actual localization (collation, charset conversion etc.). I refer to the Turkey Test
Here are helpful links:
http://minaret.info/test/collate.msp
http://www.moserware.com/2008/02/does-your-code-pass-turkey-test.html
You can try this one (there are some sentences in russian, greek, chinese, etc. to test Unicode):
http://www.madore.org/~david/misc/unitest/