Strings. TCHAR LPWCS LPCTSTR CString. Whats what here, simple quick - c++

TCHAR szExeFileName[MAX_PATH];
GetModuleFileName(NULL, szExeFileName, MAX_PATH);
CString tmp;
lstrcpy(szExeFileName, tmp);
CString out;
out.Format("\nInstall32 at %s\n", tmp);
TRACE(tmp);
Error (At the Format):
error C2664: 'void ATL::CStringT<BaseType,StringTraits>::Format(const wchar_t
*,...)' : cannot convert parameter 1 from 'const char [15]' to 'const wchar_t
I'd just like to get the current path that this program was launched from and copy it into a CString so I can use it elsewhere. I am currently just try to get to see the path by TRACE'ing it out. But strings, chars, char arrays, I can't ever get all the strait. Could someone give me a pointer?

The accepted answer addresses the problem. But the question also asked for a better understanding of the differences among all the character types on Windows.
Encodings
A char on Windows (and virtually all other systems) is a single byte. A byte is typically interpreted as either an unsigned value [0..255] or a signed value [-128..127]. (Older C++ standards guarantees a signed range of only [-127..127], but most implementations give [-128..127]. I believe C++11 guarantees the larger range.)
ASCII is a character mapping for values in the range [0..127] to particular characters, so you can store an ASCII character in either a signed byte or an unsigned byte, and thus it will always fit in a char.
But ASCII doesn't have all the characters necessary for most languages, so the character sets were often extended by using the rest of the values available in a byte to represent the additional characters needed for certain languages (or families of languages). So, while [0..127] almost always mean the same thing, values like 150 can only be interpreted in the context of a particular encoding. For single-byte alphabets, these encodings are called code pages.
Code pages helped, but they didn't solve all the problems. You always had to know which code page a particular document used in order to interpret it correctly. Furthermore, you typically couldn't write a single document that used different languages.
Also, some languages have more than 256 characters, so there was no way to map one char to one character. This led to the development of multi-byte character encodings, where [0..127] is still ASCII, but some of the other values are "escapes" that mean you have to look at some number of following chars to figure out what character you really had. (It's best to think of multi-byte as variable-byte, as some characters require only one byte while other require two or more.) Multi-byte works, but it's a pain to code for.
Meanwhile, memory was becoming more plentiful, so a bunch of organizations got together and created Unicode, with the goal of making a universal mapping of values to characters (for appropriately vague definitions of "characters"). Initially, it was believed that all characters (or at least all the ones anyone would ever use) would fit into 16-bit values, which was nice because you wouldn't have to deal with multi-byte encodings--you'd just use two bytes per character instead of one. About this time, Microsoft decided to adopt Unicode as the internal representation for text in Windows.
WCHAR
So Windows has a type called WCHAR, a two-byte value that represents a "Unicode" "character". I'm using quotation marks here because Unicode evolved past the original two-byte encoding, so what Windows calls "Unicode" isn't really Unicode today--it's actually a particular encoding of Unicode called UTF-16. And a "character" is not as simple a concept in Unicode as it was in ASCII, because, in some languages, characters combine or otherwise influence adjacent characters in interesting ways.
Newer versions of Windows used these 16-bit WCHAR values for text internally, but there was a lot of code out there still written for single-byte code pages, and even some for multi-byte encodings. Those programs still used chars rather than WCHARs. And many of these programs had to work with people using older versions of Windows that still used chars internally as well as newer ones that use WCHAR. So a technique using C macros and typedefs was devised so that you could mostly write your code one way and--at compile time--choose to have it use either char or WCHAR.
TCHAR
To accomplish this flexibility, you use a TCHAR for a "text character". In some header file (often <tchar.h>), TCHAR would be typedef'ed to either char or WCHAR, depending on the compile time environment. Windows headers adopted conventions like this:
LPTSTR is a (long) pointer to a string of TCHARs.
LPWSTR is a (long) pointer to a string of WCHARs.
LPSTR is a (long) pointer to a string of chars.
(The L for "long" is a leftover from 16-bit days, when we had long, far, and near pointers. Those are all obsolete today, but the L prefix tends to remain.)
Most of the Windows API functions that take and return strings were actually replaced with two versions: the A version (for "ANSI" characters) and the W version (for wide characters). (Again, historical legacy shows in these. The code pages scheme was often called ANSI code pages, though I've never been clear if they were actually ruled by ANSI standards.)
So when you call a Windows API like this:
SetWindowText(hwnd, lptszTitle);
what you're really doing is invoking a preprocessor macro that expands to either SetWindowTextA or SetWindowTextW. It should be consistent with however TCHAR is defined. That is, if you want strings of chars, you'll get the A version, and if you want strings of WCHARs, you get the W version.
But it's a little more complicated because of string literals. If you write this:
SetWindowText(hwnd, "Hello World"); // works only in "ANSI" mode
then that will only compile if you're targeting the char version, because "Hello World" is a string of chars, so it's only compatible with the SetWindowTextA version. If you wanted the WCHAR version, you'd have to write:
SetWindowText(hwnd, L"Hello World"); // only works in "Unicode" mode
The L here means you want wide characters. (The L actually stands for long, but it's a different sense of long than the long pointers above.) When the compiler sees the L prefix on the string, it knows that string should be encoded as a series of wchar_ts rather than chars.
(Compilers targeting Windows use a two-byte value for wchar_t, which happens to be identical to what Windows defined a WCHAR. Compilers targeting other systems often use a four-byte value for wchar_t, which is what it really takes to hold a single Unicode code point.)
So if you want code that can compile either way, you need another macro to wrap the string literals. There are two to choose from: _T() and TEXT(). They work exactly the same way. The first comes from the compiler's library and the second from the OS's libraries. So you write your code like this:
SetWindowText(hwnd, TEXT("Hello World")); // compiles in either mode
If you're targeting chars, the macro is a no-op that just returns the regular string literal. If you're targeting WCHARs, the macro prepends the L.
So how do you tell the compiler that you want to target WCHAR? You define UNICODE and _UNICODE. The former is for the Windows APIs and the latter is for the compiler libraries. Make sure you never define one without the other.

My guess is you are compiling in Unicode mode.
Try enclosing your format string in the _T macro, which is designed to provide an always-correct method of providing constant string parameters, regardless of whether you're compiling in Unicode or ANSI mode:
out.Format(_T("\nInstall32 at %s\n"), tmp);

Related

What exactly can wchar_t represent?

According to cppreference.com's doc on wchar_t:
wchar_t - type for wide character representation (see wide strings). Required to be large enough to represent any supported character code point (32 bits on systems that support Unicode. A notable exception is Windows, where wchar_t is 16 bits and holds UTF-16 code units) It has the same size, signedness, and alignment as one of the integer types, but is a distinct type.
The Standard says in [basic.fundamental]/5:
Type wchar_­t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales. Type wchar_­t shall have the same size, signedness, and alignment requirements as one of the other integral types, called its underlying type. Types char16_­t and char32_­t denote distinct types with the same size, signedness, and alignment as uint_­least16_­t and uint_­least32_­t, respectively, in <cstdint>, called the underlying types.
So, if I want to deal with unicode characters, should I use wchar_t?
Equivalently, how do I know if a specific unicode character is "supported" by wchar_t?
So, if I want to deal with unicode characters, should I use
wchar_t?
First of all, note that the encoding does not force you to use any particular type to represent a certain character. You may use char to represent Unicode characters just as wchar_t can - you only have to remember that up to 4 chars together will form a valid code point depending on UTF-8, UTF-16, or UTF-32 encoding, while wchar_t can use 1 (UTF-32 on Linux, etc) or up to 2 working together (UTF-16 on Windows).
Next, there is no definite Unicode encoding. Some Unicode encodings use a fixed width for representing codepoints (like UTF-32), others (such as UTF-8 and UTF-16) have variable lengths (the letter 'a' for instance surely will just use up 1 byte, but apart from the English alphabet, other characters surely will use up more bytes for representation).
So you have to decide what kind of characters you want to represent and then choose your encoding accordingly. Depending on the kind of characters you want to represent, this will affect the amount of bytes your data will take. E.g. using UTF-32 to represent mostly English characters will lead to many 0-bytes. UTF-8 is a better choice for many Latin based languages, while UTF-16 is usually a better choice for Eastern Asian languages.
Once you have decided on this, you should minimize the amount of conversions and stay consistent with your decision.
In the next step, you may decide what data type is appropriate to represent the data (or what kind of conversions you may need).
If you would like to do text-manipulation/interpretation on a code-point basis, char certainly is not the way to go if you have e.g. Japanese kanji. But if you just want to communicate your data and regard it no more as a quantitative sequence of bytes, you may just go with char.
The link to UTF-8 everywhere was already posted as a comment, and I suggest you having a look there as well. Another good read is What every programmer should know about encodings.
As by now, there is only rudimentary language support in C++ for Unicode (like the char16_t and char32_t data types, and u8/u/U literal prefixes). So chosing a library for manging encodings (especially conversions) certainly is a good advice.
wchar_t is used in Windows which uses UTF16-LE format. wchar_t requires wide char functions. For example wcslen(const wchar_t*) instead of strlen(const char*) and std::wstring instead of std::string
Unix based machines (Linux, Mac, etc.) use UTF8. This uses char for storage, and the same C and C++ functions for ASCII, such as strlen(const char*) and std::string (see comments below about std::find_first_of)
wchar_t is 2 bytes (UTF16) in Windows. But in other machines it is 4 bytes (UTF32). This makes things more confusing.
For UTF32, you can use std::u32string which is the same on different systems.
You might consider converting UTF8 to UTF32, because that way each character is always 4 bytes, and you might think string operations will be easier. But that's rarely necessary.
UTF8 is designed so that ASCII characters between 0 and 128 are not used to represent other Unicode code points. That includes escape sequence '\', printf format specifiers, and common parsing characters like ,
Consider the following UTF8 string. Lets say you want to find the comma
std::string str = u8"汉,🙂"; //3 code points represented by 8 bytes
The ASCII value for comma is 44, and str is guaranteed to contain only one byte whose value is 44. To find the comma, you can simply use any standard function in C or C++ to look for ','
To find 汉, you can search for the string u8"汉" since this code point cannot be represented as a single character.
Some C and C++ functions don't work smoothly with UTF8. These include
strtok
strspn
std::find_first_of
The argument for above functions is a set of characters, not an actual string.
So str.find_first_of(u8"汉") does not work. Because u8"汉" is 3 bytes, and find_first_of will look for any of those bytes. There is a chance that one of those bytes are used to represent a different code point.
On the other hand, str.find_first_of(u8",;abcd") is safe, because all the characters in the search argument are ASCII (str itself can contain any Unicode character)
In rare cases UTF32 might be required (although I can't imagine where!) You can use std::codecvt to convert UTF8 to UTF32 to run the following operations:
std::u32string u32 = U"012汉"; //4 code points, represented by 4 elements
cout << u32.find_first_of(U"汉") << endl; //outputs 3
cout << u32.find_first_of(U'汉') << endl; //outputs 3
Side note:
You should use "Unicode everywhere", not "UTF8 everywhere".
In Linux, Mac, etc. use UTF8 for Unicode.
In Windows, use UTF16 for Unicode. Windows programmers use UTF16, they don't make pointless conversions back and forth to UTF8. But there are legitimate cases for using UTF8 in Windows.
Windows programmer tend to use UTF8 for saving files, web pages, etc. So that's less worry for non-Windows programmers in terms of compatibility.
The language itself doesn't care which Unicode format you want to use, but in terms of practicality use a format that matches the system you are working on.
So, if I want to deal with unicode characters, should I use wchar_t?
That depends on what encoding you're dealing with. In case of UTF-8 you're just fine with char and std::string.
UTF-8 means the least encoding unit is 8 bits: all Unicode code points from U+0000 to U+007F are encoded by only 1 byte.
Beginning with code point U+0080 UTF-8 uses 2 bytes for encoding, starting from U+0800 it uses 3 bytes and from U+10000 4 bytes.
To handle this variable width (1 byte - 2 byte - 3 byte - 4 byte) char fits best.
Be aware that C-functions like strlen will provide byte-based results: "öö" in fact is a 2-character text but strlen will return 4 because 'ö' is encoded to 0xC3B6.
UTF-16 means the least encoding unit is 16 bits: all code points from U+0000 to U+FFFF are encoded by 2 bytes; starting from U+100000 4 bytes are used.
In case of UTF-16 you should use wchar_t and std::wstring because most of the characters you'll ever encounter will be 2-byte encoded.
When using wchar_t you can't use C-functions like strlen any more; you have to use the wide char equivalents like wcslen.
When using Visual Studio and building with configuration "Unicode" you'll get UTF-16: TCHAR and CString will be based on wchar_t instead of char.
It all depends what you mean by 'deal with', but one thing is for sure: where Unicode is concerned std::basic_string doesn't provide any real functionality at all.
In any particular program, you will need to perform X number of Unicode-aware operations, e.g. intelligent string matching, case folding, regex, locating word breaks, using a Unicode string as a path name maybe, and so on.
Supporting these operations there will almost always be some kind of library and / or native API provided by the platform, and the goal for me would be to store and manipulate my strings in such a way that these operations can be carried out without scattering knowledge of the underlying library and native API support throughout the code any more than necessary. I'd also want to future-proof myself as to the width of the characters I store in my strings in case I change my mind.
Suppose, for example, you decide to use ICU to do the heavy lifting. Immediately there is an obvious problem: an icu::UnicodeString is not related in any way to std::basic_string. What to do? Work exclusively with icu::UnicodeString throughout the code? Probably not.
Or maybe the focus of the application switches from European languages to Asian ones, so that UTF-16 becomes (perhaps) a better choice than UTF-8.
So, my choice would be to use a custom string class derived from std::basic_string, something like this:
typedef wchar_t mychar_t; // say
class MyString : public std::basic_string <mychar_t>
{
...
};
Straightaway you have flexibility in choosing the size of the code units stored in your container. But you can do much more than that. For example, with the above declaration (and after you add in boilerplate for the various constructors that you need to provide to forward them to std::basic_string), you still cannot say:
MyString s = "abcde";
Because "abcde" is a narrow string and various the constructors for std::basic_string <wchar_t> all expect a wide string. Microsoft solve this with a macro (TEXT ("...") or __T ("...")), but that is a pain. All we need to do now is to provide a suitable constructor in MyString, with signature MyString (const char *s), and the problem is solved.
In practise, this constructor would probably expect a UTF-8 string, regardless of the underlying character width used for MyString, and convert it if necessary. Someone comments here somewhere that you should store your strings as UTF-8 so that you can construct them from UTF-8 literals in your code. Well now we have broken that constraint. The underlying character width of our strings can be anything we like.
Another thing that people have been talking about in this thread is that find_first_of may not work properly for UTF-8 strings (and indeed some UTF-16 ones also). Well, now you can provide an implementation that does the job properly. Should take about half an hour. If there are other 'broken' implementations in std::basic_string (and I'm sure there are), then most of them can probably be replaced with similar ease.
As for the rest, it mainly depends what level of abstraction you want to implement in your MyString class. If your application is happy to have a dependency on ICU, for example, then you can just provide a couple of methods to convert to and from an icu::UnicodeString. That's probably what most people would do.
Or if you need to pass UTF-16 strings to / from native Windows APIs then you can add methods to convert to and from const WCHAR * (which again you would implement in such a way that they work for all values of mychar_t). Or you could go further and abstract away some or all of the Unicode support provided by the platform and library you are using. The Mac, for example, has rich Unicode support but it's only available from Objective-C so you have to wrap it.
It depends on how portable you want your code to be.
So you can add in whatever functionality you like, probably on an on-going basis as work progresses, without losing the ability to carry your strings around as a std::basic_string. Of one sort or another. Just try not to write code that assumes it knows how wide it is, or that it contains no surrogate pairs.
First of all, you should check (as you point out in your question) if you are using Windows and Visual Studio C++ with wchar_t being 16bits, because in that case, to use full unicode support, you'll need to assume UTF-16 encoding.
The basic problem here is not the sizeof wchar_t you are using, but if the libraries you are going to use, support full unicode support.
Java has a similar problem, as its char type is 16bit wide, so it couldn't a priori support full unicode space, but it does, as it uses UTF-16 encoding and the pair surrogates to cope with full 24bit codepoints.
It's also worth to note that UNICODE uses only the high plane to encode rare codepoints, that are not normally used daily.
For unicode support anyway, you need to use wide character sets, so wchar_t is a good beginning. If you are going to work with visual studio, then you have to check how it's libraries deal with unicode characters.
Another thing to note is that standard libraries deal with character sets (and this includes unicode) only when you add locale support (this requires some library to be initialized, e.g. setlocale(3)) and so, you'll see no unicode at all (only basic ascii) in cases where you have not called setlocale(3).
There are wide char functions for almost any str*(3) function, as well as for any stdio.h library function, to deal with wchar_ts. A little dig into the /usr/include/wchar.h file will reveal the names of the routines. Go to the manual pages for documentation on them: fgetws(3), fputwc(3), fputws(3), fwide(3), fwprintf(3), ...
Finally, consider again that, if you are dealing with Microsoft Visual C++, you have a different implementation from the beginning. Even if they cope to be completely standard compliant, you'll have to cope with some idiosyncrasies of having a different implementation. Probably you'll have different function names for some uses.

how character sets are stored in strings and wstrings?

So, i've been trying to do a bit of research of strings and wstrings as i need to understand how they work for a program i'm creating so I also looked into ASCII and unicode, and UTF-8 and UTF-16.
I believe i have an okay understanding of the concept of how these work, but what i'm still having trouble with is how they are actually stored in 'char's, 'string's, 'wchar_t's and 'wstring's.
So my questions are as follows:
Which character set and encoding is used for char and wchar_t? and are these types limited to using only these character sets / encoding?
If they are not limited to these character sets / encoding, how is it decided what character set / encoding is used for a particular char or wchar_t? is it automatically decided at compile for example or do we have to explicitly tell it what to use?
From my understanding UTF-8 uses 1 byte when using the first 128 code points in the set but can use more than 1 byte when using code point 128 and above. If so how is this stored? for example is it simply stored identically to ASCII if it only uses 1 byte? and how does the type (char or wchar_t or whatever) know how many bytes it is using?
Finally, if my understanding is correct I get why UTF-8 and UTF-16 are not compatible, eg. a string can't be used where a wstring is needed. But in a program that requires a wstring would it be better practice to write a conversion function from a string to a wstring and the use this when a wstring is required to make my code exclusively string-based or just use wstring where needed instead?
Thanks, and let me know if any of my questions are incorrectly worded or use the wrong terminology as i'm trying to get to grips with this as best as I can.
i'm working in C++ btw
They use whatever characterset and encoding you want. The types do not imply a specific characterset or encoding. They do not even imply characters - you could happily do math problems with them. Don't do that though, it's weird.
How do you output text? If it is to a console, the console decides which character is associated with each value. If it is some graphical toolkit, the toolkit decides. Consoles and toolkits tend to conform to standards, so there is a good chance they will be using unicode, nowadays. On older systems anything might happen.
UTF8 has the same values as ASCII for the range 0-127. Above that it gets a bit more complicated; this is explained here quite well: https://en.wikipedia.org/wiki/UTF-8#Description
wstring is a string made up of wchar_t, but sadly wchar_t is implemented differently on different platforms. For example, on Visual Studio it is 16 bits (and could be used to store UTF16), but on GCC it is 32 bits (and could thus be used to store unicode codepoints directly). You need to be aware of this if you want your code to be portable. Personally I chose to only store strings in UTF8, and convert only when needed.
Which character set and encoding is used for char and wchar_t? and are these types limited to using only these character sets / encoding?
This is not defined by the language standard. Each compiler will have to agree with the operating system on what character codes to use. We don't even know how many bits are used for char and wchar_t.
On some systems char is UTF-8, on others it is ASCII, or something else. On IBM mainframes it can be EBCDIC, a character encoding already in use before ASCII was defined.
If they are not limited to these character sets / encoding, how is it decided what character set / encoding is used for a particular char or wchar_t? is it automatically decided at compile for example or do we have to explicitly tell it what to use?
The compiler knows what is appropriate for each system.
From my understanding UTF-8 uses 1 byte when using the first 128 code points in the set but can use more than 1 byte when using code point 128 and above. If so how is this stored? for example is it simply stored identically to ASCII if it only uses 1 byte? and how does the type (char or wchar_t or whatever) know how many bytes it is using?
The first part of UTF-8 is identical to the corresponding ASCII codes, and stored as a single byte. Higher codes will use two or more bytes.
The char type itself just store bytes and doesn't know how many bytes we need to form a character. That's for someone else to decide.
The same thing for wchar_t, which is 16 bits on Windows but 32 bits on other systems, like Linux.
Finally, if my understanding is correct I get why UTF-8 and UTF-16 are not compatible, eg. a string can't be used where a wstring is needed. But in a program that requires a wstring would it be better practice to write a conversion function from a string to a wstring and the use this when a wstring is required to make my code exclusively string-based or just use wstring where needed instead?
You will likely have to convert. Unfortunately the conversion needed will be different for different systems, as character sizes and encodings vary.
In later C++ standards you have new types char16_t and char32_t, with the string types u16string and u32string. Those have known sizes and encodings.
Everything about used encoding is implementation defined. Check your compiler documentation. It depends on default locale, encoding of source file and OS console settings.
Types like string, wstring, operations on them and C facilities, like strcmp/wstrcmp expect fixed-width encodings. So the would not work properly with variable width ones like UTF8 or UTF16 (but will work with, e.g., UCS-2). If you want to store variable-width encoded strings, you need to be careful and not use fixed-width operations on it. C-string do have some functions for manipulation of such strings in standard library .You can use classes from codecvt header to convert between different encodings for C++ strings.
I would avoid wstring and use C++11 exact width character string: std::u16string or std::u32string
As an example here is some info on how windows uses these types/encodings.
char stores ASCII values (with code pages for non-ASCII values)
wchar_t stores UTF-16, note this means that some unicode characters will use 2 wchar_t's
If you call a system function, e.g. puts then the header file will actually pick either puts or _putws depending on how you've set things up (i.e. if you are using unicode).
So on windows there is no direct support for UTF-8, which means that if you use char to store UTF-8 encoded strings you have to covert them to UTF-16 and call the corresponding UTF-16 system functions.

Correct use of string storage in C and C++

Popular software developers and companies (Joel Spolsky, Fog Creek software) tend to use wchar_t for Unicode character storage when writing C or C++ code. When and how should one use char and wchar_t in respect to good coding practices?
I am particularly interested in POSIX compliance when writing software that leverages Unicode.
When using wchar_t, you can look up characters in an array of wide characters on a per-character or per-array-element basis:
/* C code fragment */
const wchar_t *overlord = L"ov€rlord";
if (overlord[2] == L'€')
wprintf(L"Character comparison on a per-character basis.\n");
How can you compare unicode bytes (or characters) when using char?
So far my preferred way of comparing strings and characters of type char in C often looks like this:
/* C code fragment */
const char *mail[] = { "ov€rlord#masters.lt", "ov€rlord#masters.lt" };
if (mail[0][2] == mail[1][2] && mail[0][3] == mail[1][3] && mail[0][3] == mail[1][3])
printf("%s\n%zu", *mail, strlen(*mail));
This method scans for the byte equivalent of a unicode character. The Unicode Euro symbol € takes up 3 bytes. Therefore one needs to compare three char array bytes to know if the Unicode characters match. Often you need to know the size of the character or string you want to compare and the bits it produces for the solution to work. This does not look like a good way of handling Unicode at all. Is there a better way of comparing strings and character elements of type char?
In addition, when using wchar_t, how can you scan the file contents to an array? The function fread does not seem to produce valid results.
If you know that you're dealing with unicode, neither char nor wchar_t are appropriate as their sizes are compiler/platform-defined. For example, wchar_t is 2 bytes on Windows (MSVC), but 4 bytes on Linux (GCC). The C11 and C++11 standards have been a bit more rigorous, and define two new character types (char16_t and char32_t) with associated literal prefixes for creating UTF-{8, 16, 32} strings.
If you need to store and manipulate unicode characters, you should use a library that is designed for the job, as neither the pre-C11 nor pre-C++11 language standards have been written with unicode in mind. There are a few to choose from, but ICU is quite popular (and supports C, C++, and Java).
I am particularly interested in POSIX compliance when writing software
that leverages Unicode.
In this case, you'll probably want to use UTF-8 (with char) as your preferred Unicode string type. POSIX doesn't have a lot of functions for working with wchar_t — that's mostly a Windows thing.
This method scans for the byte equivalent of a unicode character. The
Unicode Euro symbol € takes up 3 bytes. Therefore one needs to compare
three char array bytes to know if the Unicode characters match. Often
you need to know the size of the character or string you want to
compare and the bits it produces for the solution to work.
No, you don't. You just compare the bytes. Iff the bytes match, the strings match. strcmp works just as well with UTF-8 as it does with any other encoding.
Unless you want something like a case-insensitive or accent-insensitive comparison, in which case you'll need a proper Unicode library.
You should never-ever compare bytes, or even code points, to decide if strings are equal. That's because of a lot of strings can be identical from user perspective without being identical from code point perspective.

How to use Unicode in C++?

Assuming a very simple program that:
ask a name.
store the name in a variable.
display the variable content on the screen.
It's so simple that is the first thing that one learns.
But my problem is that I don't know how to do the same thing if I enter the name using japanese characters.
So, if you know how to do this in C++, please show me an example (that I can compile and test)
Thanks.
user362981 : Thanks for your help. I compiled the code that you wrote without problem, them the console window appears and I cannot enter any Japanese characters on it (using IME). Also if
I change a word in your code ("hello") to one that contains Japanese characters, it also will not display these.
Svisstack : Also thanks for your help. But when I compile your code I get the following error:
warning: deprecated conversion from string constant to 'wchar_t*'
error: too few arguments to function 'int swprintf(wchar_t*, const wchar_t*, ...)'
error: at this point in file
warning: deprecated conversion from string constant to 'wchar_t*'
You're going to get a lot of answers about wide characters. Wide characters, specifically wchar_t do not equal Unicode. You can use them (with some pitfalls) to store Unicode, just as you can an unsigned char. wchar_t is extremely system-dependent. To quote the Unicode Standard, version 5.2, chapter 5:
With the wchar_t wide character type, ANSI/ISO C provides for
inclusion of fixed-width, wide characters. ANSI/ISO C leaves the semantics of the wide
character set to the specific implementation but requires that the characters from the portable C execution set correspond to their wide character equivalents by zero extension.
and that
The width of wchar_t is compiler-specific and can be as small as 8 bits. Consequently,
programs that need to be portable across any C or C++ compiler should not use wchar_t
for storing Unicode text. The wchar_t type is intended for storing compiler-defined wide
characters, which may be Unicode characters in some compilers.
So, it's implementation defined. Here's two implementations: On Linux, wchar_t is 4 bytes wide, and represents text in the UTF-32 encoding (regardless of the current locale). (Either BE or LE depending on your system, whichever is native.) Windows, however, has a 2 byte wide wchar_t, and represents UTF-16 code units with them. Completely different.
A better path: Learn about locales, as you'll need to know that. For example, because I have my environment setup to use UTF-8 (Unicode), the following program will use Unicode:
#include <iostream>
int main()
{
setlocale(LC_ALL, "");
std::cout << "What's your name? ";
std::string name;
std::getline(std::cin, name);
std::cout << "Hello there, " << name << "." << std::endl;
return 0;
}
...
$ ./uni_test
What's your name? 佐藤 幹夫
Hello there, 佐藤 幹夫.
$ echo $LANG
en_US.UTF-8
But there's nothing Unicode about it. It merely reads in characters, which come in as UTF-8 because I have my environment set that way. I could just as easily say "heck, I'm part Czech, let's use ISO-8859-2": Suddenly, the program is getting input in ISO-8859-2, but since it's just regurgitating it, it doesn't matter, the program will still perform correctly.
Now, if that example had read in my name, and then tried to write it out into an XML file, and stupidly wrote <?xml version="1.0" encoding="UTF-8" ?> at the top, it would be right when my terminal was in UTF-8, but wrong when my terminal was in ISO-8859-2. In the latter case, it would need to convert it before serializing it to the XML file. (Or, just write ISO-8859-2 as the encoding for the XML file.)
On many POSIX systems, the current locale is typically UTF-8, because it provides several advantages to the user, but this isn't guaranteed. Just outputting UTF-8 to stdout will usually be correct, but not always. Say I am using ISO-8859-2: if you mindlessly output an ISO-8859-1 "è" (0xE8) to my terminal, I'll see a "č" (0xE8). Likewise, if you output a UTF-8 "è" (0xC3 0xA8), I'll see (ISO-8859-2) "è" (0xC3 0xA8). This barfing of incorrect characters has been called Mojibake.
Often, you're just shuffling data around, and it doesn't matter much. This typically comes into play when you need to serialize data. (Many internet protocols use UTF-8 or UTF-16, for example: if you got data from an ISO-8859-2 terminal, or a text file encoded in Windows-1252, then you have to convert it, or you'll be sending Mojibake.)
Sadly, this is about the state of Unicode support, in both C and C++. You have to remember: these languages are really system-agnostic, and don't bind to any particular way of doing it. That includes character-sets. There are tons of libraries out there, however, for dealing with Unicode and other character sets.
In the end, it's not all that complicated really: Know what encoding your data is in, and know what encoding your output should be in. If they're not the same, you need to do a conversion. This applies whether you're using std::cout or std::wcout. In my examples, stdin or std::cin and stdout/std::cout were sometimes in UTF-8, sometimes ISO-8859-2.
Try replacing cout with wcout, cin with wcin, and string with wstring. Depending on your platform, this may work:
#include <iostream>
#include <string>
int main() {
std::wstring name;
std::wcout << L"Enter your name: ";
std::wcin >> name;
std::wcout << L"Hello, " << name << std::endl;
}
There are other ways, but this is sort of the "minimal change" answer.
Pre-requisite: http://www.joelonsoftware.com/articles/Unicode.html
The above article is a must read which explains what unicode is but few lingering questions remains. Yes UNICODE has a unique code point for every character in every language and furthermore they can be encoded and stored in memory potentially differently from what the actual code is. This way we can save memory by for example using UTF-8 encoding which is great if the language supported is just English and so the memory representation is essentially same as ASCII – this of course knowing the encoding itself. In theory if we know the encoding, we can store these longer UNICODE characters however we like and read it back. But real world is a little different.
How do you store a UNICODE character/string in a C++ program? Which encoding do you use? The answer is you don’t use any encoding but you directly store the UNICODE code points in a unicode character string just like you store ASCII characters in ASCII string. The question is what character size should you use since UNICODE characters has no fixed size. The simple answer is you choose character size which is wide enough to hold the highest character code point (language) that you want to support.
The theory that a UNICODE character can take 2 bytes or more still holds true and this can create some confusion. Shouldn’t we be storing code points in 3 or 4 bytes than which is really what represents all unicode characters? Why is Visual C++ storing unicode in wchar_t then which is only 2 bytes, clearly not enough to store every UNICODE code point?
The reason we store UNICODE character code point in 2 bytes in Visual C++ is actually exactly the same reason why we were storing ASCII (=English) character into one byte. At that time, we were thinking of only English so one byte was enough. Now we are thinking of most international languages out there but not all so we are using 2 bytes which is enough. Yes it’s true this representation will not allow us to represent those code points which takes 3 bytes or more but we don’t care about those yet because those folks haven’t even bought a computer yet. Yes we are not using 3 or 4 bytes because we are still stingy with memory, why store the extra 0(zero) byte with every character when we are never going to use it (that language). Again this is exactly the same reasons why ASCII was storing each character in one byte, why store a character in 2 or more bytes when English can be represented in one byte and room to spare for those extra special characters!
In theory 2 bytes are not enough to present every Unicode code point but it is enough to hold anything that we may ever care about for now. A true UNICODE string representation could store each character in 4 bytes but we just don’t care about those languages.
Imagine 1000 years from now when we find friendly aliens and in abundance and want to communicate with them incorporating their countless languages. A single unicode character size will grow further perhaps to 8 bytes to accommodate all their code points. It doesn’t mean we should start using 8 bytes for each unicode character now. Memory is limited resource, we allocate what what we need.
Can I handle UNICODE string as C Style string?
In C++ an ASCII strings could still be handled in C++ and that’s fairly common by grabbing it by its char * pointer where C functions can be applied. However applying current C style string functions on a UNICODE string will not make any sense because it could have a single NULL bytes in it which terminates a C string.
A UNICODE string is no longer a plain buffer of text, well it is but now more complicated than a stream of single byte characters terminating with a NULL byte. This buffer could be handled by its pointer even in C but it will require a UNICODE compatible calls or a C library which could than read and write those strings and perform operations.
This is made easier in C++ with a specialized class that represents a UNICODE string. This class handles complexity of the unicode string buffer and provide an easy interface. This class also decides if each character of the unicode string is 2 bytes or more – these are implementation details. Today it may use wchar_t (2 bytes) but tomorrow it may use 4 bytes for each character to support more (less known) language. This is why it is always better to use TCHAR than a fixed size which maps to the right size when implementation changes.
How do I index a UNICODE string?
It is also worth noting and particularly in C style handling of strings that they use index to traverse or find sub string in a string. This index in ASCII string directly corresponded to the position of item in that string but it has no meaning in a UNICODE string and should be avoided.
What happens to the string terminating NULL byte?
Are UNICODE strings still terminated by NULL byte? Is a single NULL byte enough to terminate the string? This is an implementation question but a NULL byte is still one unicode code point and like every other code point, it must still be of same size as any other(specially when no encoding). So the NULL character must be two bytes as well if unicode string implementation is based on wchar_t. All UNICODE code points will be represented by same size irrespective if its a null byte or any other.
Does Visual C++ Debugger shows UNICODE text?
Yes, if text buffer is type LPWSTR or any other type that supports UNICODE, Visual Studio 2005 and up support displaying the international text in debugger watch window (provided fonts and language packs are installed of course).
Summary:
C++ doesn’t use any encoding to store unicode characters but it directly stores the UNICODE code points for each character in a string. It must pick character size large enough to hold the largest character of desirable languages (loosely speaking) and that character size will be fixed and used for all characters in the string.
Right now, 2 bytes are sufficient to represent most languages that we care about, this is why 2 bytes are used to represent code point. In future if a new friendly space colony was discovered that want to communicate with them, we will have to assign new unicode code pionts to their language and use larger character size to store those strings.
You can do simple things with the generic wide character support in your OS of choice, but generally C++ doesn't have good built-in support for unicode, so you'll be better off in the long run looking into something like ICU.
#include <stdio.h>
#include <wchar.h>
int main()
{
wchar_t name[256];
wprintf(L"Type a name: ");
wscanf(L"%s", name);
wprintf(L"Typed name is: %s\n", name);
return 0;
}

Is wchar_t needed for unicode support?

Is the wchar_t type required for unicode support? If not then what's the point of this multibyte type? Why would you use wchar_t when you could accomplish the same thing with char?
No.
Technically, no. Unicode is a standard that defines code points and it does not require a particular encoding.
So, you could use unicode with the UTF-8 encoding and then everything would fit in a one or a short sequence of char objects and it would even still be null-terminated.
The problem with UTF-8 and UTF-16 is that s[i] is not necessarily a character any more, it might be just a piece of one, whereas with sufficiently wide characters you can preserve the abstraction that s[i] is a single character, tho it does not make strings fixed-length under various transformations.
32-bit integers are at least wide enough to solve the code point problem but they still don't handle corner cases, e.g., upcasing something can change the number of characters.
So it turns out that the x[i] problem is not completely solved even by char32_t, and those other encodings make poor file formats.
Your implied point, then, is quite valid: wchar_t is a failure, partly because Windows made it only 16 bits, and partly because it didn't solve every problem and was horribly incompatible with the byte stream abstraction.
As has already been noted, wchar_t is absolutely not necessary for unicode support. Not only that, it is also utterly useless for that purpose, since the standard provides no fixed-size guarantee for wchar_t (in other words, you don't know ahead of time what sizeof( wchar_t ) will be on a particular system), whereas sizeof( char ) will always be 1.
In a UTF-8 encoding, any actual UNICODE character is mapped to a sequence of one or more (up to four, I believe) octets.
In a UTF-16 encoding, any actual UNICODE character is mapped to a sequence of one or more (up to two, I believe) 16-bit words.
In a UTF-32 encoding, any actual UNICODE character is mapped to exactly one 32-bit-word.
As you can see, wchar_t could be of some use for implementing UTF-16 support IF the standard was nice enough to guarantee that wchar_t is always 16 bits wide. Unfortunately it does not, so you'd have to revert to a fixed-width integer type from <cstdint> (such as std::uint16_t) anyway.
<slightly OffTopic Microsoft-specific rant>
What's more infuriating is the additional confusion caused by Microsoft's Visual Studio UNICODE and MBCS (multi-byte character set) build configurations. Both of these are
A) confusing and
B) an outright lie
because neither does a "UNICODE" configuration in Visual Studio do anything to buy the programmer actual Unicode support, nor does the difference implied by these 2 build configurations make any sense. To explain, Microsoft recommends using TCHAR instead of using char or wchar_t directly. In an MBCS configuration, TCHAR expands to char, meaning you could potentially use this to implement UTF-8 support. In a UNICODE configuration, it expands to wchar_t, which in Visual Studio happens to be 16 bits wide and could potentially be used to implement UTF-16 support (which, as far as I'm aware, is the native encoding used by Windows). However, both of these encodings are multi-byte character sets, since both UTF-8 and UTF-16 allow for the possibility that a particular Unicode character may be encoded as more than a one char/wchar_t respectively, so the term multi-byte character set (as opposed to single-byte character set?) makes little sense.
To add insult to injury, merely using the Unicode configuration does not actually give you one iota of Unicode support. To actually get that, you have to use an actual Unicode library like ICU ( http://site.icu-project.org/ ). In short, the wchar_t type and Microsoft's MBCS and UNICODE configurations add nothing of any use and cause unnecessary confusion, and the world would be a significantly better place if none of them had ever been invented.
</slightly OffTopic Microsoft-specific rant>
You absolutely do not need wchar_t to support Unicode in the software, in fact using wchar_t makes it even harder because you do not know if "wide string" is UTF-16 or UTF-32 -- it depends on OS: under windows utf-16 all others utf-32.
However, utf-8 allows you to write Unicode enabled software easily(*)
See: https://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful
(*) Note: under Windows you still have to use wchar_t because it does not support utf-8 locales so for unicode enabled windows programming you have to use wchar based API.
wchar_t is absolutely NOT required for Unicode. UTF-8 for example, maintains backward compatibility with ASCII and uses plain 8-bit char. wchar_t mostly yields support for so-called multi-byte characters, or basically any character set that's encoded using more than the sizeof(char).
wchar_t is not required. It's not even guaranteed to have a specific encoding. The point is to provide a data type that represents the wide characters native to your system, similar to char representing native characters. On Windows, for example, you can use wchar_t to access the wide character Win32 API functions.
Be careful, wchar_t is often 16bits which is not enough to store all unicode characters and is a bad choice ofr data in UTF_8 for instance
Because you can't accomplish the same thing with char:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
char is generally a single byte. (sizeof(char) must be equal to 1).
wchar_t was added to the language specifically to suppose multibyte characters.