TCHAR, WCHAR, LPWSTR, LPSTR, wstring clarification - c++

Hello everybody and good afternoon. So I'm still new-ish to this scene but have quite the ambition for it and I've been trying to learn as much as i can. i consider myself to be adept in c++ but I've always programming DOS programs and recently I've broadened my horizons to the Windows API.... with that being said, I've noticed that Windows API is greatly intertwined with UNI-CODE while DOS used ANSI.. so i know that ANSI uses 8-bit character codes and UNI-CODE uses 16-bit.. so my questions are:
1) why is this important.. is it more specific or able to hold more information since its 16 bits verses 8? i mean i know that there are some characters that ANSI does not support that UNI-CODE does but is that it??
2) What's the difference between TCHAR and WCHAR and is it just the 16 bit version of char? if WCHAR is wide char then whats TCHAR?
3)I understand that LPWSTR is long pointer to wide string but when would you use this and why? is it just a windows thing? and isn't a long pointer automatically 16 bits? Does that mean a regular pointer is 8 bits? if so why would you need the extra bits?
4)Next why would you need wstring and would you need to use wchar and tchar with it for certain functions? i.e.
wstring myStr;
TCHAR myChar;
if (myStr.find(myChar) != string::npos) { krmormrm }
or does it matter..
char myChar;
if (myStr.find(myChar) != string::npos) { jnrnikvnr }
5) Last but not least, i had trouble displaying WCHAR and wstring or even int without a conversion.. for instance (i figured it out sort of) i did:
WCHAR myChar = '1';
int i = 2;
wstring myString;
ofstream File1("myFile.txt");
if (File1.is_open())
{
File1 << (char)myChar; //if i didn't typecast it to char it displayed 49 instead of 1;
File1 << (WCHAR)i; //if i didn't typecast it to WCHAR(like to char instead)it displays symbols
WCHAR temp;
copy(myString.begin(), myString.end(), temp);
File1 << (char)temp;
}
ok so i had a little problem with the wstring and copy. what i did in my real program (this was just a quick rescript) was took 9 WCHAR variables... used wstringstream to load them all into its variable(wss) and then into myString(my wstring variable)... so to make sure they all loaded correctly i copied it into a WCHAR temp to send it to file1 so i could physically see what loaded into it but for some reason it loaded the variables i wanted AND extra variables i didnt want and ive gone over the code multiple times and found nothing wrong.. so i got rid of the copy function and displayed each variable individually with a for loop like:
for (int i = 0; i < 81; i++)
{
File1 << "Box " << (WCHAR)i << ": " << (char)BoxNum[i] << "\n";
}
and i concluded everything held the correct values... just fyi i was inputting the values into a text box and retrieving the text and storing it in individual variables.. the text boxes are lined up 9 by 9... so there's 9 in a row and 9 in a column... and then i used the variables from the boxes in the first row and put it in myString so i could just use the string.find() function to check for numbers in that row instead of box by box.. my problem was displaying this wstring...... ANYWAYS lol sorry just trying to provide as much info as possible, maybe someone can solve that problem for me as well.

8 bit character encoding only allows 256 different characters, minus a lot of control characters. That's enough for English, but when you want to cover other European languages, like those containing strange characters like ößé or ø, this is simply not enough. Sure, you could use different codepages which place different characters on the higher 128 codepoints of an 8bit encoding, but what if you need to mix multiple languages in the same string? And what about languages like Chinese which have far more than 256 characters? But with 16 bits per character, you can use over 60.000 codepoints which is enough to cover the whole basic multilingual plane in a single codepage.
A WCHAR is always 16 bit. A TCHAR can be 8bit or 16bit, depending on whether you compile your program as an unicode program or not.
The difference between long-pointers and short-pointers is mostly historical and of not much concern on modern platforms (when you really want to know, check this question). The Windows API has a really long legacy dating back to the first Windows versions, so you find a lot of obsolete cruft in there. The length of a pointer depends on the kind of program. A 32bit program has 32bit long pointers and a 64bit program has 64bit long pointers. When you compile your program for 64bit, a LPWSTR will be a 64bit pointer (to a null-terminated array of 16-bit characters).
The first code will only work when TCHAR is 16bit, because in that case WCHAR and TCHAR are the same thing. When TCHAR is 8 bit, that code won't compile because the find-method requires the same type the string is made from.
When you write a 16bit string to a file, it gets written to the file as a 16bit string. When you then open it with a text editor and only see garbage, that's likely because your text editor interprets it with 8bit character encoding. Switch the encoding of the text editor to the encoding with which you wrote the file (UTF-16 might work). Or convert the wstring to a string before your write it, as described, in this question. But keep in mind that this can not work well when there are characters in your strings which can not be expressed with 8bit.

Related

c++ windows: Is there a way to convert from _UNICODE_STRING to std::string? [duplicate]

I am not able to understand the differences between std::string and std::wstring. I know wstring supports wide characters such as Unicode characters. I have got the following questions:
When should I use std::wstring over std::string?
Can std::string hold the entire ASCII character set, including the special characters?
Is std::wstring supported by all popular C++ compilers?
What is exactly a "wide character"?
string? wstring?
std::string is a basic_string templated on a char, and std::wstring on a wchar_t.
char vs. wchar_t
char is supposed to hold a character, usually an 8-bit character.
wchar_t is supposed to hold a wide character, and then, things get tricky:
On Linux, a wchar_t is 4 bytes, while on Windows, it's 2 bytes.
What about Unicode, then?
The problem is that neither char nor wchar_t is directly tied to unicode.
On Linux?
Let's take a Linux OS: My Ubuntu system is already unicode aware. When I work with a char string, it is natively encoded in UTF-8 (i.e. Unicode string of chars). The following code:
#include <cstring>
#include <iostream>
int main()
{
const char text[] = "olé";
std::cout << "sizeof(char) : " << sizeof(char) << "\n";
std::cout << "text : " << text << "\n";
std::cout << "sizeof(text) : " << sizeof(text) << "\n";
std::cout << "strlen(text) : " << strlen(text) << "\n";
std::cout << "text(ordinals) :";
for(size_t i = 0, iMax = strlen(text); i < iMax; ++i)
{
unsigned char c = static_cast<unsigned_char>(text[i]);
std::cout << " " << static_cast<unsigned int>(c);
}
std::cout << "\n\n";
// - - -
const wchar_t wtext[] = L"olé" ;
std::cout << "sizeof(wchar_t) : " << sizeof(wchar_t) << "\n";
//std::cout << "wtext : " << wtext << "\n"; <- error
std::cout << "wtext : UNABLE TO CONVERT NATIVELY." << "\n";
std::wcout << L"wtext : " << wtext << "\n";
std::cout << "sizeof(wtext) : " << sizeof(wtext) << "\n";
std::cout << "wcslen(wtext) : " << wcslen(wtext) << "\n";
std::cout << "wtext(ordinals) :";
for(size_t i = 0, iMax = wcslen(wtext); i < iMax; ++i)
{
unsigned short wc = static_cast<unsigned short>(wtext[i]);
std::cout << " " << static_cast<unsigned int>(wc);
}
std::cout << "\n\n";
}
outputs the following text:
sizeof(char) : 1
text : olé
sizeof(text) : 5
strlen(text) : 4
text(ordinals) : 111 108 195 169
sizeof(wchar_t) : 4
wtext : UNABLE TO CONVERT NATIVELY.
wtext : ol�
sizeof(wtext) : 16
wcslen(wtext) : 3
wtext(ordinals) : 111 108 233
You'll see the "olé" text in char is really constructed by four chars: 110, 108, 195 and 169 (not counting the trailing zero). (I'll let you study the wchar_t code as an exercise)
So, when working with a char on Linux, you should usually end up using Unicode without even knowing it. And as std::string works with char, so std::string is already unicode-ready.
Note that std::string, like the C string API, will consider the "olé" string to have 4 characters, not three. So you should be cautious when truncating/playing with unicode chars because some combination of chars is forbidden in UTF-8.
On Windows?
On Windows, this is a bit different. Win32 had to support a lot of application working with char and on different charsets/codepages produced in all the world, before the advent of Unicode.
So their solution was an interesting one: If an application works with char, then the char strings are encoded/printed/shown on GUI labels using the local charset/codepage on the machine, which could not be UTF-8 for a long time. For example, "olé" would be "olé" in a French-localized Windows, but would be something different on an cyrillic-localized Windows ("olй" if you use Windows-1251). Thus, "historical apps" will usually still work the same old way.
For Unicode based applications, Windows uses wchar_t, which is 2-bytes wide, and is encoded in UTF-16, which is Unicode encoded on 2-bytes characters (or at the very least, UCS-2, which just lacks surrogate-pairs and thus characters outside the BMP (>= 64K)).
Applications using char are said "multibyte" (because each glyph is composed of one or more chars), while applications using wchar_t are said "widechar" (because each glyph is composed of one or two wchar_t. See MultiByteToWideChar and WideCharToMultiByte Win32 conversion API for more info.
Thus, if you work on Windows, you badly want to use wchar_t (unless you use a framework hiding that, like GTK or QT...). The fact is that behind the scenes, Windows works with wchar_t strings, so even historical applications will have their char strings converted in wchar_t when using API like SetWindowText() (low level API function to set the label on a Win32 GUI).
Memory issues?
UTF-32 is 4 bytes per characters, so there is no much to add, if only that a UTF-8 text and UTF-16 text will always use less or the same amount of memory than an UTF-32 text (and usually less).
If there is a memory issue, then you should know than for most western languages, UTF-8 text will use less memory than the same UTF-16 one.
Still, for other languages (chinese, japanese, etc.), the memory used will be either the same, or slightly larger for UTF-8 than for UTF-16.
All in all, UTF-16 will mostly use 2 and occassionally 4 bytes per characters (unless you're dealing with some kind of esoteric language glyphs (Klingon? Elvish?), while UTF-8 will spend from 1 to 4 bytes.
See https://en.wikipedia.org/wiki/UTF-8#Compared_to_UTF-16 for more info.
Conclusion
When I should use std::wstring over std::string?
On Linux? Almost never (§).
On Windows? Almost always (§).
On cross-platform code? Depends on your toolkit...
(§) : unless you use a toolkit/framework saying otherwise
Can std::string hold all the ASCII character set including special characters?
Notice: A std::string is suitable for holding a 'binary' buffer, where a std::wstring is not!
On Linux? Yes.
On Windows? Only special characters available for the current locale of the Windows user.
Edit (After a comment from Johann Gerell):
a std::string will be enough to handle all char-based strings (each char being a number from 0 to 255). But:
ASCII is supposed to go from 0 to 127. Higher chars are NOT ASCII.
a char from 0 to 127 will be held correctly
a char from 128 to 255 will have a signification depending on your encoding (unicode, non-unicode, etc.), but it will be able to hold all Unicode glyphs as long as they are encoded in UTF-8.
Is std::wstring supported by almost all popular C++ compilers?
Mostly, with the exception of GCC based compilers that are ported to Windows.
It works on my g++ 4.3.2 (under Linux), and I used Unicode API on Win32 since Visual C++ 6.
What is exactly a wide character?
On C/C++, it's a character type written wchar_t which is larger than the simple char character type. It is supposed to be used to put inside characters whose indices (like Unicode glyphs) are larger than 255 (or 127, depending...).
I recommend avoiding std::wstring on Windows or elsewhere, except when required by the interface, or anywhere near Windows API calls and respective encoding conversions as a syntactic sugar.
My view is summarized in http://utf8everywhere.org of which I am a co-author.
Unless your application is API-call-centric, e.g. mainly UI application, the suggestion is to store Unicode strings in std::string and encoded in UTF-8, performing conversion near API calls. The benefits outlined in the article outweigh the apparent annoyance of conversion, especially in complex applications. This is doubly so for multi-platform and library development.
And now, answering your questions:
A few weak reasons. It exists for historical reasons, where widechars were believed to be the proper way of supporting Unicode. It is now used to interface APIs that prefer UTF-16 strings. I use them only in the direct vicinity of such API calls.
This has nothing to do with std::string. It can hold whatever encoding you put in it. The only question is how You treat its content. My recommendation is UTF-8, so it will be able to hold all Unicode characters correctly. It's a common practice on Linux, but I think Windows programs should do it also.
No.
Wide character is a confusing name. In the early days of Unicode, there was a belief that a character can be encoded in two bytes, hence the name. Today, it stands for "any part of the character that is two bytes long". UTF-16 is seen as a sequence of such byte pairs (aka Wide characters). A character in UTF-16 takes either one or two pairs.
So, every reader here now should have a clear understanding about the facts, the situation. If not, then you must read paercebal's outstandingly comprehensive answer [btw: thanks!].
My pragmatical conclusion is shockingly simple: all that C++ (and STL) "character encoding" stuff is substantially broken and useless. Blame it on Microsoft or not, that will not help anyway.
My solution, after in-depth investigation, much frustration and the consequential experiences is the following:
accept, that you have to be responsible on your own for the encoding and conversion stuff (and you will see that much of it is rather trivial)
use std::string for any UTF-8 encoded strings (just a typedef std::string UTF8String)
accept that such an UTF8String object is just a dumb, but cheap container. Do never ever access and/or manipulate characters in it directly (no search, replace, and so on). You could, but you really just really, really do not want to waste your time writing text manipulation algorithms for multi-byte strings! Even if other people already did such stupid things, don't do that! Let it be! (Well, there are scenarios where it makes sense... just use the ICU library for those).
use std::wstring for UCS-2 encoded strings (typedef std::wstring UCS2String) - this is a compromise, and a concession to the mess that the WIN32 API introduced). UCS-2 is sufficient for most of us (more on that later...).
use UCS2String instances whenever a character-by-character access is required (read, manipulate, and so on). Any character-based processing should be done in a NON-multibyte-representation. It is simple, fast, easy.
add two utility functions to convert back & forth between UTF-8 and UCS-2:
UCS2String ConvertToUCS2( const UTF8String &str );
UTF8String ConvertToUTF8( const UCS2String &str );
The conversions are straightforward, google should help here ...
That's it. Use UTF8String wherever memory is precious and for all UTF-8 I/O. Use UCS2String wherever the string must be parsed and/or manipulated. You can convert between those two representations any time.
Alternatives & Improvements
conversions from & to single-byte character encodings (e.g. ISO-8859-1) can be realized with help of plain translation tables, e.g. const wchar_t tt_iso88951[256] = {0,1,2,...}; and appropriate code for conversion to & from UCS2.
if UCS-2 is not sufficient, than switch to UCS-4 (typedef std::basic_string<uint32_t> UCS2String)
ICU or other unicode libraries?
For advanced stuff.
When you want to have wide characters stored in your string. wide depends on the implementation. Visual C++ defaults to 16 bit if i remember correctly, while GCC defaults depending on the target. It's 32 bits long here. Please note wchar_t (wide character type) has nothing to do with unicode. It's merely guaranteed that it can store all the members of the largest character set that the implementation supports by its locales, and at least as long as char. You can store unicode strings fine into std::string using the utf-8 encoding too. But it won't understand the meaning of unicode code points. So str.size() won't give you the amount of logical characters in your string, but merely the amount of char or wchar_t elements stored in that string/wstring. For that reason, the gtk/glib C++ wrapper folks have developed a Glib::ustring class that can handle utf-8.
If your wchar_t is 32 bits long, then you can use utf-32 as an unicode encoding, and you can store and handle unicode strings using a fixed (utf-32 is fixed length) encoding. This means your wstring's s.size() function will then return the right amount of wchar_t elements and logical characters.
Yes, char is always at least 8 bit long, which means it can store all ASCII values.
Yes, all major compilers support it.
I frequently use std::string to hold utf-8 characters without any problems at all. I heartily recommend doing this when interfacing with API's which use utf-8 as the native string type as well.
For example, I use utf-8 when interfacing my code with the Tcl interpreter.
The major caveat is the length of the std::string, is no longer the number of characters in the string.
A good question!
I think DATA ENCODING (sometimes a CHARSET also involved) is a MEMORY EXPRESSION MECHANISM in order to save data to a file or transfer data via a network, so I answer this question as:
1. When should I use std::wstring over std::string?
If the programming platform or API function is a single-byte one, and we want to process or parse some Unicode data, e.g read from Windows'.REG file or network 2-byte stream, we should declare std::wstring variable to easily process them. e.g.: wstring ws=L"中国a"(6 octets memory: 0x4E2D 0x56FD 0x0061), we can use ws[0] to get character '中' and ws[1] to get character '国' and ws[2] to get character 'a', etc.
2. Can std::string hold the entire ASCII character set, including the special characters?
Yes. But notice: American ASCII, means each 0x00~0xFF octet stands for one character, including printable text such as "123abc&*_&" and you said special one, mostly print it as a '.' avoid confusing editors or terminals. And some other countries extend their own "ASCII" charset, e.g. Chinese, use 2 octets to stand for one character.
3.Is std::wstring supported by all popular C++ compilers?
Maybe, or mostly. I have used: VC++6 and GCC 3.3, YES
4. What is exactly a "wide character"?
a wide character mostly indicates using 2 octets or 4 octets to hold all countries' characters. 2 octet UCS2 is a representative sample, and further e.g. English 'a', its memory is 2 octet of 0x0061(vs in ASCII 'a's memory is 1 octet 0x61)
When you want to store 'wide' (Unicode) characters.
Yes: 255 of them (excluding 0).
Yes.
Here's an introductory article: http://www.joelonsoftware.com/articles/Unicode.html
There are some very good answers here, but I think there are a couple of things I can add regarding Windows/Visual Studio. Tis is based on my experience with VS2015. On Linux, basically the answer is to use UTF-8 encoded std::string everywhere. On Windows/VS it gets more complex. Here is why. Windows expects strings stored using chars to be encoded using the locale codepage. This is almost always the ASCII character set followed by 128 other special characters depending on your location. Let me just state that this in not just when using the Windows API, there are three other major places where these strings interact with standard C++. These are string literals, output to std::cout using << and passing a filename to std::fstream.
I will be up front here that I am a programmer, not a language specialist. I appreciate that USC2 and UTF-16 are not the same, but for my purposes they are close enough to be interchangeable and I use them as such here. I'm not actually sure which Windows uses, but I generally don't need to know either. I've stated UCS2 in this answer, so sorry in advance if I upset anyone with my ignorance of this matter and I'm happy to change it if I have things wrong.
String literals
If you enter string literals that contain only characters that can be represented by your codepage then VS stores them in your file with 1 byte per character encoding based on your codepage. Note that if you change your codepage or give your source to another developer using a different code page then I think (but haven't tested) that the character will end up different. If you run your code on a computer using a different code page then I'm not sure if the character will change too.
If you enter any string literals that cannot be represented by your codepage then VS will ask you to save the file as Unicode. The file will then be encoded as UTF-8. This means that all Non ASCII characters (including those which are on your codepage) will be represented by 2 or more bytes. This means if you give your source to someone else the source will look the same. However, before passing the source to the compiler, VS converts the UTF-8 encoded text to code page encoded text and any characters missing from the code page are replaced with ?.
The only way to guarantee correctly representing a Unicode string literal in VS is to precede the string literal with an L making it a wide string literal. In this case VS will convert the UTF-8 encoded text from the file into UCS2. You then need to pass this string literal into a std::wstring constructor or you need to convert it to utf-8 and put it in a std::string. Or if you want you can use the Windows API functions to encode it using your code page to put it in a std::string, but then you may as well have not used a wide string literal.
std::cout
When outputting to the console using << you can only use std::string, not std::wstring and the text must be encoded using your locale codepage. If you have a std::wstring then you must convert it using one of the Windows API functions and any characters not on your codepage get replaced by ? (maybe you can change the character, I can't remember).
std::fstream filenames
Windows OS uses UCS2/UTF-16 for its filenames so whatever your codepage, you can have files with any Unicode character. But this means that to access or create files with characters not on your codepage you must use std::wstring. There is no other way. This is a Microsoft specific extension to std::fstream so probably won't compile on other systems. If you use std::string then you can only utilise filenames that only include characters on your codepage.
Your options
If you are just working on Linux then you probably didn't get this far. Just use UTF-8 std::string everywhere.
If you are just working on Windows just use UCS2 std::wstring everywhere. Some purists may say use UTF8 then convert when needed, but why bother with the hassle.
If you are cross platform then it's a mess to be frank. If you try to use UTF-8 everywhere on Windows then you need to be really careful with your string literals and output to the console. You can easily corrupt your strings there. If you use std::wstring everywhere on Linux then you may not have access to the wide version of std::fstream, so you have to do the conversion, but there is no risk of corruption. So personally I think this is a better option. Many would disagree, but I'm not alone - it's the path taken by wxWidgets for example.
Another option could be to typedef unicodestring as std::string on Linux and std::wstring on Windows, and have a macro called UNI() which prefixes L on Windows and nothing on Linux, then the code
#include <fstream>
#include <string>
#include <iostream>
#include <Windows.h>
#ifdef _WIN32
typedef std::wstring unicodestring;
#define UNI(text) L ## text
std::string formatForConsole(const unicodestring &str)
{
std::string result;
//Call WideCharToMultiByte to do the conversion
return result;
}
#else
typedef std::string unicodestring;
#define UNI(text) text
std::string formatForConsole(const unicodestring &str)
{
return str;
}
#endif
int main()
{
unicodestring fileName(UNI("fileName"));
std::ofstream fout;
fout.open(fileName);
std::cout << formatForConsole(fileName) << std::endl;
return 0;
}
would be fine on either platform I think.
Answers
So To answer your questions
1) If you are programming for Windows, then all the time, if cross platform then maybe all the time, unless you want to deal with possible corruption issues on Windows or write some code with platform specific #ifdefs to work around the differences, if just using Linux then never.
2)Yes. In addition on Linux you can use it for all Unicode too. On Windows you can only use it for all unicode if you choose to manually encode using UTF-8. But the Windows API and standard C++ classes will expect the std::string to be encoded using the locale codepage. This includes all ASCII plus another 128 characters which change depending on the codepage your computer is setup to use.
3)I believe so, but if not then it is just a simple typedef of a 'std::basic_string' using wchar_t instead of char
4)A wide character is a character type which is bigger than the 1 byte standard char type. On Windows it is 2 bytes, on Linux it is 4 bytes.
Applications that are not satisfied with only 256 different characters have the options of either using wide characters (more than 8 bits) or a variable-length encoding (a multibyte encoding in C++ terminology) such as UTF-8. Wide characters generally require more space than a variable-length encoding, but are faster to process. Multi-language applications that process large amounts of text usually use wide characters when processing the text, but convert it to UTF-8 when storing it to disk.
The only difference between a string and a wstring is the data type of the characters they store. A string stores chars whose size is guaranteed to be at least 8 bits, so you can use strings for processing e.g. ASCII, ISO-8859-15, or UTF-8 text. The standard says nothing about the character set or encoding.
Practically every compiler uses a character set whose first 128 characters correspond with ASCII. This is also the case with compilers that use UTF-8 encoding. The important thing to be aware of when using strings in UTF-8 or some other variable-length encoding, is that the indices and lengths are measured in bytes, not characters.
The data type of a wstring is wchar_t, whose size is not defined in the standard, except that it has to be at least as large as a char, usually 16 bits or 32 bits. wstring can be used for processing text in the implementation defined wide-character encoding. Because the encoding is not defined in the standard, it is not straightforward to convert between strings and wstrings. One cannot assume wstrings to have a fixed-length encoding either.
If you don't need multi-language support, you might be fine with using only regular strings. On the other hand, if you're writing a graphical application, it is often the case that the API supports only wide characters. Then you probably want to use the same wide characters when processing the text. Keep in mind that UTF-16 is a variable-length encoding, meaning that you cannot assume length() to return the number of characters. If the API uses a fixed-length encoding, such as UCS-2, processing becomes easy. Converting between wide characters and UTF-8 is difficult to do in a portable way, but then again, your user interface API probably supports the conversion.
when you want to use Unicode strings and not just ascii, helpful for internationalisation
yes, but it doesn't play well with 0
not aware of any that don't
wide character is the compiler specific way of handling the fixed length representation of a unicode character, for MSVC it is a 2 byte character, for gcc I understand it is 4 bytes. and a +1 for http://www.joelonsoftware.com/articles/Unicode.html
If you keep portability for string, you can use tstring, tchar. It is widely used technique from long ago. In this sample, I use self-defined TCHAR, but you can find out tchar.h implementation for linux on internet.
This idea means that wstring/wchar_t/UTF-16 is used on windows and string/char/utf-8(or ASCII..) is used on Linux.
In the sample below, the searching of english/japanese multibyte mixed string works well on both windows/linux platforms.
#include <locale.h>
#include <stdio.h>
#include <algorithm>
#include <string>
using namespace std;
#ifdef _WIN32
#include <tchar.h>
#else
#define _TCHAR char
#define _T
#define _tprintf printf
#endif
#define tstring basic_string<_TCHAR>
int main() {
setlocale(LC_ALL, "");
tstring s = _T("abcあいうえおxyz");
auto pos = s.find(_T("え"));
auto r = s.substr(pos);
_tprintf(_T("r=%s\n"), r.c_str());
}
1) As mentioned by Greg, wstring is helpful for internationalization, that's when you will be releasing your product in languages other than english
4) Check this out for wide character
http://en.wikipedia.org/wiki/Wide_character
When should you NOT use wide-characters?
When you're writing code before the year 1990.
Obviously, I'm being flip, but really, it's the 21st century now. 127 characters have long since ceased to be sufficient. Yes, you can use UTF8, but why bother with the headaches?

c++ wchar_t array and char array in programming for win32 console

I am writing a program includes output chinese characters using Dev C++.
I've added
-finput-charset=big5
-fexec-charset=big5
in compiler parameters. I also set the code page of the console to be 950 (traditional chinese)
It works perfectly while in a simple cout like this:
cout << "中文字";
while it comes to characters array it goes wrong as expected:
char chin[] = "中文字";
cout << chin[0];//output nothing
cout << chin[0] << chin[1];//output the first chinese character as one chinese character occupies 2 bytes.
So I decided to use wchar_t instead and I have to use wcout with wchar_t or else a number will be shown.
However, wcout show nothing in the console. All of the below show nothing:
wcout << L"中文字";
wchar_t chin2[] = L"中文字";
wcout << chin2[0];
What did I missed to use wchar_t to output chinese (or other east asian) characters? I really don't want to write 2 array member to show one single chinese chracters.
There are subtle problems going on here.
The C++ compiler does not understand Big5 encoding. When you create a source code file and display it, you may see your familiar Chinese characters but the compiler sees a string of bytes. Big5 is a double byte charset so each input character will be represented by 2 bytes inside the compiler.
When that string of bytes is fed to a suitable output device the Chinese characters appear again. Code page 950 is compatible with Big5 so you see the "right" thing. But then you try to build on this and confusion is the result. Your second code sample uses L"" strings, but I expect those strings will contain half a character in each short.
The only "safe" character set you can use is Unicode. Windows internals are historically UCS-2 (char is a single short) but is now theoretically UTF-16 (char is short, but may include multi-byte sequences). Not all existing software and older APIs fully support UTF-16 (or need to). Windows has very limited support for UTF-8 or other encodings. Everything gets converted into Unicode, so best to just leave it that way.
In practice, you should build your C++ code with Unicode settings, for UCS-2, and exercise caution if you need characters that would require multibyte sequences. You should ensure that any source code you write and any input text files are identified as whatever encoding they need to be, but are translated into Unicode internally. Leave your console as the default Unicode encoding, and everything will just work.
It is almost impossible to sensibly use Big5 as an internal encoding in a Windows program. Best not to try.

Create UTF-16 string from char*

So I have standard C string:
char* name = "Jakub";
And I want to convert it to UTF-16. I figured out, that UTF-16 will be twice as long - one character takes two chars.
So I create another string:
char name_utf_16[10]; //"Jakub" is 5 characters
Now, I believe with ASCII characters I will only use lower bytes, so for all of them it will be like 74 00 for J and so on. With that belief, I can make such code:
void charToUtf16(char* input, char* output, int length) {
/*Todo: how to check if output is long enough?*/
for(int i=0; i<length; i+=2) //Step over 2 bytes
{
//Lets use little-endian - smallest bytes first
output[i] = input[i];
output[i+1] = 0; //We will never have any data for this field
}
}
But, with this process, I ended with "Jkb". I know no way to test this properly - I've just sent the string to Minecraft Bukkit Server. And this is what it said upon disconnecting:
13:34:19 [INFO] Disconnecting jkb?? [/127.0.0.1:53215]: Outdated server!
Note: I'm aware that Minecraft uses big-endian. Code above is just an example, in fact, I have my conversion implemented in class.
Before I answer your question, consider this:
This area of programming is full of man traps. It makes a lot of sense to understand the differences between ASCII, UTF7/8 and ANSI/'MultiByte Character Strings (MBCS)', all of which to an english speaking programmer will look and feel identical, but need very different handling if they are introduced to a european or asian user.
ASCII: Characters are in range 32-127. only ever one byte. The clue is in the name, they are great for Americans, but not fit for purpose in the rest of the world.
ANSI/MBCS: This is the reason for 'code pages'. Characters 32-127 are the same as ASCII, but it is possible to have characters in the range of 128-255 as well for additional characters, and some of the 128-255 range can be used as a flag to mark that the character continues into a second, third or even fourth byte. To process the string correctly, you need both the string bytes and the correct code page. If you try processing the string using the wrong code page you will not have the right characters, and misinterpret whether a character is a one, two or even 4 byte character.
UTF7/8: These are 8-bit wide formatting of 21-bit unicode character points. in UTF-7 and UTF-8 unicode characters can be between one and four bytes long. The advantage that UTF encodings have over ANSI/MBCS is that there is no ambiguity caused by code pages. Each glyph in every script has a unique unicode code point, which means it is not possible to mangle the character sets by interpreting the data on a different computer with different regional settings.
So to to start to answer your question:
Whilst you are making the assumption that your char* will only point to an ASCII string, that is a really dangerous choice to make, users are in control of data that is typed in, not the programmer. Windows programs will be storing this as MBCS by default.
You are making the second assumption is that a UTF-16 encoding will be twice the size of an 8 bit encoding. That is not generally a safe assumption. depending on the source encoding the UTF-16 encoding may be twice the size, may be less than twice the size, and in an extreme example may actually be shorter in length.
So, what is the safe solution?
The safe option is to implement your application internally as unicode. On windows, this is a compiler option, and then means your windows controls all use wchar_t* strings for their data type. On linux I'm less sure that you can always use unicide graphics and OS libraries. You must also use the wcslen() functions to get the length of strings etc. When you interact with the outside world, be precise in the character encodings used.
To answer to your question then becomes changing the question to, what do i do when I receive non UTF-16 data?
Firstly, be very clear about what assumptions on its formatting are you making? and secondly, accept the fact that sometimes the conversion to UTF-16 may fail.
If you are clear on the source formatting, you can then choose the appropriate win32 or the stl converter to convert the format, and you should then look for evidence the conversion failed before using the result. e.g. mbstowcs in or MultiByteToWideChar() on windows. However the use of both of these approaches safely means you need to understand ALL of the above answer.
All other options introduce risk. Use mbcs strings and you will have data strings mangled by being entered using one code page, and processed using a different code page. Assume ASCII data, and when you encounter a non ascii character your code will break, and you will 'blame' the user for your short comings.
Why do you want to make your own Unicode conversion functionality when theres's existing C/C++ functions for this, like mbstowcs() which is included in <cstdlib>.
If you still want to make your own stuff, then have a look at Unicode Consortium's open source code which can be found here:
Convert UTF-16 to UTF-8 under Windows and Linux, in C
output[i] = input[i];
This will assign every other byte of the input, because you increment i by 2. So no wonder that you obtain "Jkb".
You probably wanted to write:
output[i] = input[i / 2];

How to use Unicode in C++?

Assuming a very simple program that:
ask a name.
store the name in a variable.
display the variable content on the screen.
It's so simple that is the first thing that one learns.
But my problem is that I don't know how to do the same thing if I enter the name using japanese characters.
So, if you know how to do this in C++, please show me an example (that I can compile and test)
Thanks.
user362981 : Thanks for your help. I compiled the code that you wrote without problem, them the console window appears and I cannot enter any Japanese characters on it (using IME). Also if
I change a word in your code ("hello") to one that contains Japanese characters, it also will not display these.
Svisstack : Also thanks for your help. But when I compile your code I get the following error:
warning: deprecated conversion from string constant to 'wchar_t*'
error: too few arguments to function 'int swprintf(wchar_t*, const wchar_t*, ...)'
error: at this point in file
warning: deprecated conversion from string constant to 'wchar_t*'
You're going to get a lot of answers about wide characters. Wide characters, specifically wchar_t do not equal Unicode. You can use them (with some pitfalls) to store Unicode, just as you can an unsigned char. wchar_t is extremely system-dependent. To quote the Unicode Standard, version 5.2, chapter 5:
With the wchar_t wide character type, ANSI/ISO C provides for
inclusion of fixed-width, wide characters. ANSI/ISO C leaves the semantics of the wide
character set to the specific implementation but requires that the characters from the portable C execution set correspond to their wide character equivalents by zero extension.
and that
The width of wchar_t is compiler-specific and can be as small as 8 bits. Consequently,
programs that need to be portable across any C or C++ compiler should not use wchar_t
for storing Unicode text. The wchar_t type is intended for storing compiler-defined wide
characters, which may be Unicode characters in some compilers.
So, it's implementation defined. Here's two implementations: On Linux, wchar_t is 4 bytes wide, and represents text in the UTF-32 encoding (regardless of the current locale). (Either BE or LE depending on your system, whichever is native.) Windows, however, has a 2 byte wide wchar_t, and represents UTF-16 code units with them. Completely different.
A better path: Learn about locales, as you'll need to know that. For example, because I have my environment setup to use UTF-8 (Unicode), the following program will use Unicode:
#include <iostream>
int main()
{
setlocale(LC_ALL, "");
std::cout << "What's your name? ";
std::string name;
std::getline(std::cin, name);
std::cout << "Hello there, " << name << "." << std::endl;
return 0;
}
...
$ ./uni_test
What's your name? 佐藤 幹夫
Hello there, 佐藤 幹夫.
$ echo $LANG
en_US.UTF-8
But there's nothing Unicode about it. It merely reads in characters, which come in as UTF-8 because I have my environment set that way. I could just as easily say "heck, I'm part Czech, let's use ISO-8859-2": Suddenly, the program is getting input in ISO-8859-2, but since it's just regurgitating it, it doesn't matter, the program will still perform correctly.
Now, if that example had read in my name, and then tried to write it out into an XML file, and stupidly wrote <?xml version="1.0" encoding="UTF-8" ?> at the top, it would be right when my terminal was in UTF-8, but wrong when my terminal was in ISO-8859-2. In the latter case, it would need to convert it before serializing it to the XML file. (Or, just write ISO-8859-2 as the encoding for the XML file.)
On many POSIX systems, the current locale is typically UTF-8, because it provides several advantages to the user, but this isn't guaranteed. Just outputting UTF-8 to stdout will usually be correct, but not always. Say I am using ISO-8859-2: if you mindlessly output an ISO-8859-1 "è" (0xE8) to my terminal, I'll see a "č" (0xE8). Likewise, if you output a UTF-8 "è" (0xC3 0xA8), I'll see (ISO-8859-2) "è" (0xC3 0xA8). This barfing of incorrect characters has been called Mojibake.
Often, you're just shuffling data around, and it doesn't matter much. This typically comes into play when you need to serialize data. (Many internet protocols use UTF-8 or UTF-16, for example: if you got data from an ISO-8859-2 terminal, or a text file encoded in Windows-1252, then you have to convert it, or you'll be sending Mojibake.)
Sadly, this is about the state of Unicode support, in both C and C++. You have to remember: these languages are really system-agnostic, and don't bind to any particular way of doing it. That includes character-sets. There are tons of libraries out there, however, for dealing with Unicode and other character sets.
In the end, it's not all that complicated really: Know what encoding your data is in, and know what encoding your output should be in. If they're not the same, you need to do a conversion. This applies whether you're using std::cout or std::wcout. In my examples, stdin or std::cin and stdout/std::cout were sometimes in UTF-8, sometimes ISO-8859-2.
Try replacing cout with wcout, cin with wcin, and string with wstring. Depending on your platform, this may work:
#include <iostream>
#include <string>
int main() {
std::wstring name;
std::wcout << L"Enter your name: ";
std::wcin >> name;
std::wcout << L"Hello, " << name << std::endl;
}
There are other ways, but this is sort of the "minimal change" answer.
Pre-requisite: http://www.joelonsoftware.com/articles/Unicode.html
The above article is a must read which explains what unicode is but few lingering questions remains. Yes UNICODE has a unique code point for every character in every language and furthermore they can be encoded and stored in memory potentially differently from what the actual code is. This way we can save memory by for example using UTF-8 encoding which is great if the language supported is just English and so the memory representation is essentially same as ASCII – this of course knowing the encoding itself. In theory if we know the encoding, we can store these longer UNICODE characters however we like and read it back. But real world is a little different.
How do you store a UNICODE character/string in a C++ program? Which encoding do you use? The answer is you don’t use any encoding but you directly store the UNICODE code points in a unicode character string just like you store ASCII characters in ASCII string. The question is what character size should you use since UNICODE characters has no fixed size. The simple answer is you choose character size which is wide enough to hold the highest character code point (language) that you want to support.
The theory that a UNICODE character can take 2 bytes or more still holds true and this can create some confusion. Shouldn’t we be storing code points in 3 or 4 bytes than which is really what represents all unicode characters? Why is Visual C++ storing unicode in wchar_t then which is only 2 bytes, clearly not enough to store every UNICODE code point?
The reason we store UNICODE character code point in 2 bytes in Visual C++ is actually exactly the same reason why we were storing ASCII (=English) character into one byte. At that time, we were thinking of only English so one byte was enough. Now we are thinking of most international languages out there but not all so we are using 2 bytes which is enough. Yes it’s true this representation will not allow us to represent those code points which takes 3 bytes or more but we don’t care about those yet because those folks haven’t even bought a computer yet. Yes we are not using 3 or 4 bytes because we are still stingy with memory, why store the extra 0(zero) byte with every character when we are never going to use it (that language). Again this is exactly the same reasons why ASCII was storing each character in one byte, why store a character in 2 or more bytes when English can be represented in one byte and room to spare for those extra special characters!
In theory 2 bytes are not enough to present every Unicode code point but it is enough to hold anything that we may ever care about for now. A true UNICODE string representation could store each character in 4 bytes but we just don’t care about those languages.
Imagine 1000 years from now when we find friendly aliens and in abundance and want to communicate with them incorporating their countless languages. A single unicode character size will grow further perhaps to 8 bytes to accommodate all their code points. It doesn’t mean we should start using 8 bytes for each unicode character now. Memory is limited resource, we allocate what what we need.
Can I handle UNICODE string as C Style string?
In C++ an ASCII strings could still be handled in C++ and that’s fairly common by grabbing it by its char * pointer where C functions can be applied. However applying current C style string functions on a UNICODE string will not make any sense because it could have a single NULL bytes in it which terminates a C string.
A UNICODE string is no longer a plain buffer of text, well it is but now more complicated than a stream of single byte characters terminating with a NULL byte. This buffer could be handled by its pointer even in C but it will require a UNICODE compatible calls or a C library which could than read and write those strings and perform operations.
This is made easier in C++ with a specialized class that represents a UNICODE string. This class handles complexity of the unicode string buffer and provide an easy interface. This class also decides if each character of the unicode string is 2 bytes or more – these are implementation details. Today it may use wchar_t (2 bytes) but tomorrow it may use 4 bytes for each character to support more (less known) language. This is why it is always better to use TCHAR than a fixed size which maps to the right size when implementation changes.
How do I index a UNICODE string?
It is also worth noting and particularly in C style handling of strings that they use index to traverse or find sub string in a string. This index in ASCII string directly corresponded to the position of item in that string but it has no meaning in a UNICODE string and should be avoided.
What happens to the string terminating NULL byte?
Are UNICODE strings still terminated by NULL byte? Is a single NULL byte enough to terminate the string? This is an implementation question but a NULL byte is still one unicode code point and like every other code point, it must still be of same size as any other(specially when no encoding). So the NULL character must be two bytes as well if unicode string implementation is based on wchar_t. All UNICODE code points will be represented by same size irrespective if its a null byte or any other.
Does Visual C++ Debugger shows UNICODE text?
Yes, if text buffer is type LPWSTR or any other type that supports UNICODE, Visual Studio 2005 and up support displaying the international text in debugger watch window (provided fonts and language packs are installed of course).
Summary:
C++ doesn’t use any encoding to store unicode characters but it directly stores the UNICODE code points for each character in a string. It must pick character size large enough to hold the largest character of desirable languages (loosely speaking) and that character size will be fixed and used for all characters in the string.
Right now, 2 bytes are sufficient to represent most languages that we care about, this is why 2 bytes are used to represent code point. In future if a new friendly space colony was discovered that want to communicate with them, we will have to assign new unicode code pionts to their language and use larger character size to store those strings.
You can do simple things with the generic wide character support in your OS of choice, but generally C++ doesn't have good built-in support for unicode, so you'll be better off in the long run looking into something like ICU.
#include <stdio.h>
#include <wchar.h>
int main()
{
wchar_t name[256];
wprintf(L"Type a name: ");
wscanf(L"%s", name);
wprintf(L"Typed name is: %s\n", name);
return 0;
}

std::wstring VS std::string

I am not able to understand the differences between std::string and std::wstring. I know wstring supports wide characters such as Unicode characters. I have got the following questions:
When should I use std::wstring over std::string?
Can std::string hold the entire ASCII character set, including the special characters?
Is std::wstring supported by all popular C++ compilers?
What is exactly a "wide character"?
string? wstring?
std::string is a basic_string templated on a char, and std::wstring on a wchar_t.
char vs. wchar_t
char is supposed to hold a character, usually an 8-bit character.
wchar_t is supposed to hold a wide character, and then, things get tricky:
On Linux, a wchar_t is 4 bytes, while on Windows, it's 2 bytes.
What about Unicode, then?
The problem is that neither char nor wchar_t is directly tied to unicode.
On Linux?
Let's take a Linux OS: My Ubuntu system is already unicode aware. When I work with a char string, it is natively encoded in UTF-8 (i.e. Unicode string of chars). The following code:
#include <cstring>
#include <iostream>
int main()
{
const char text[] = "olé";
std::cout << "sizeof(char) : " << sizeof(char) << "\n";
std::cout << "text : " << text << "\n";
std::cout << "sizeof(text) : " << sizeof(text) << "\n";
std::cout << "strlen(text) : " << strlen(text) << "\n";
std::cout << "text(ordinals) :";
for(size_t i = 0, iMax = strlen(text); i < iMax; ++i)
{
unsigned char c = static_cast<unsigned_char>(text[i]);
std::cout << " " << static_cast<unsigned int>(c);
}
std::cout << "\n\n";
// - - -
const wchar_t wtext[] = L"olé" ;
std::cout << "sizeof(wchar_t) : " << sizeof(wchar_t) << "\n";
//std::cout << "wtext : " << wtext << "\n"; <- error
std::cout << "wtext : UNABLE TO CONVERT NATIVELY." << "\n";
std::wcout << L"wtext : " << wtext << "\n";
std::cout << "sizeof(wtext) : " << sizeof(wtext) << "\n";
std::cout << "wcslen(wtext) : " << wcslen(wtext) << "\n";
std::cout << "wtext(ordinals) :";
for(size_t i = 0, iMax = wcslen(wtext); i < iMax; ++i)
{
unsigned short wc = static_cast<unsigned short>(wtext[i]);
std::cout << " " << static_cast<unsigned int>(wc);
}
std::cout << "\n\n";
}
outputs the following text:
sizeof(char) : 1
text : olé
sizeof(text) : 5
strlen(text) : 4
text(ordinals) : 111 108 195 169
sizeof(wchar_t) : 4
wtext : UNABLE TO CONVERT NATIVELY.
wtext : ol�
sizeof(wtext) : 16
wcslen(wtext) : 3
wtext(ordinals) : 111 108 233
You'll see the "olé" text in char is really constructed by four chars: 110, 108, 195 and 169 (not counting the trailing zero). (I'll let you study the wchar_t code as an exercise)
So, when working with a char on Linux, you should usually end up using Unicode without even knowing it. And as std::string works with char, so std::string is already unicode-ready.
Note that std::string, like the C string API, will consider the "olé" string to have 4 characters, not three. So you should be cautious when truncating/playing with unicode chars because some combination of chars is forbidden in UTF-8.
On Windows?
On Windows, this is a bit different. Win32 had to support a lot of application working with char and on different charsets/codepages produced in all the world, before the advent of Unicode.
So their solution was an interesting one: If an application works with char, then the char strings are encoded/printed/shown on GUI labels using the local charset/codepage on the machine, which could not be UTF-8 for a long time. For example, "olé" would be "olé" in a French-localized Windows, but would be something different on an cyrillic-localized Windows ("olй" if you use Windows-1251). Thus, "historical apps" will usually still work the same old way.
For Unicode based applications, Windows uses wchar_t, which is 2-bytes wide, and is encoded in UTF-16, which is Unicode encoded on 2-bytes characters (or at the very least, UCS-2, which just lacks surrogate-pairs and thus characters outside the BMP (>= 64K)).
Applications using char are said "multibyte" (because each glyph is composed of one or more chars), while applications using wchar_t are said "widechar" (because each glyph is composed of one or two wchar_t. See MultiByteToWideChar and WideCharToMultiByte Win32 conversion API for more info.
Thus, if you work on Windows, you badly want to use wchar_t (unless you use a framework hiding that, like GTK or QT...). The fact is that behind the scenes, Windows works with wchar_t strings, so even historical applications will have their char strings converted in wchar_t when using API like SetWindowText() (low level API function to set the label on a Win32 GUI).
Memory issues?
UTF-32 is 4 bytes per characters, so there is no much to add, if only that a UTF-8 text and UTF-16 text will always use less or the same amount of memory than an UTF-32 text (and usually less).
If there is a memory issue, then you should know than for most western languages, UTF-8 text will use less memory than the same UTF-16 one.
Still, for other languages (chinese, japanese, etc.), the memory used will be either the same, or slightly larger for UTF-8 than for UTF-16.
All in all, UTF-16 will mostly use 2 and occassionally 4 bytes per characters (unless you're dealing with some kind of esoteric language glyphs (Klingon? Elvish?), while UTF-8 will spend from 1 to 4 bytes.
See https://en.wikipedia.org/wiki/UTF-8#Compared_to_UTF-16 for more info.
Conclusion
When I should use std::wstring over std::string?
On Linux? Almost never (§).
On Windows? Almost always (§).
On cross-platform code? Depends on your toolkit...
(§) : unless you use a toolkit/framework saying otherwise
Can std::string hold all the ASCII character set including special characters?
Notice: A std::string is suitable for holding a 'binary' buffer, where a std::wstring is not!
On Linux? Yes.
On Windows? Only special characters available for the current locale of the Windows user.
Edit (After a comment from Johann Gerell):
a std::string will be enough to handle all char-based strings (each char being a number from 0 to 255). But:
ASCII is supposed to go from 0 to 127. Higher chars are NOT ASCII.
a char from 0 to 127 will be held correctly
a char from 128 to 255 will have a signification depending on your encoding (unicode, non-unicode, etc.), but it will be able to hold all Unicode glyphs as long as they are encoded in UTF-8.
Is std::wstring supported by almost all popular C++ compilers?
Mostly, with the exception of GCC based compilers that are ported to Windows.
It works on my g++ 4.3.2 (under Linux), and I used Unicode API on Win32 since Visual C++ 6.
What is exactly a wide character?
On C/C++, it's a character type written wchar_t which is larger than the simple char character type. It is supposed to be used to put inside characters whose indices (like Unicode glyphs) are larger than 255 (or 127, depending...).
I recommend avoiding std::wstring on Windows or elsewhere, except when required by the interface, or anywhere near Windows API calls and respective encoding conversions as a syntactic sugar.
My view is summarized in http://utf8everywhere.org of which I am a co-author.
Unless your application is API-call-centric, e.g. mainly UI application, the suggestion is to store Unicode strings in std::string and encoded in UTF-8, performing conversion near API calls. The benefits outlined in the article outweigh the apparent annoyance of conversion, especially in complex applications. This is doubly so for multi-platform and library development.
And now, answering your questions:
A few weak reasons. It exists for historical reasons, where widechars were believed to be the proper way of supporting Unicode. It is now used to interface APIs that prefer UTF-16 strings. I use them only in the direct vicinity of such API calls.
This has nothing to do with std::string. It can hold whatever encoding you put in it. The only question is how You treat its content. My recommendation is UTF-8, so it will be able to hold all Unicode characters correctly. It's a common practice on Linux, but I think Windows programs should do it also.
No.
Wide character is a confusing name. In the early days of Unicode, there was a belief that a character can be encoded in two bytes, hence the name. Today, it stands for "any part of the character that is two bytes long". UTF-16 is seen as a sequence of such byte pairs (aka Wide characters). A character in UTF-16 takes either one or two pairs.
So, every reader here now should have a clear understanding about the facts, the situation. If not, then you must read paercebal's outstandingly comprehensive answer [btw: thanks!].
My pragmatical conclusion is shockingly simple: all that C++ (and STL) "character encoding" stuff is substantially broken and useless. Blame it on Microsoft or not, that will not help anyway.
My solution, after in-depth investigation, much frustration and the consequential experiences is the following:
accept, that you have to be responsible on your own for the encoding and conversion stuff (and you will see that much of it is rather trivial)
use std::string for any UTF-8 encoded strings (just a typedef std::string UTF8String)
accept that such an UTF8String object is just a dumb, but cheap container. Do never ever access and/or manipulate characters in it directly (no search, replace, and so on). You could, but you really just really, really do not want to waste your time writing text manipulation algorithms for multi-byte strings! Even if other people already did such stupid things, don't do that! Let it be! (Well, there are scenarios where it makes sense... just use the ICU library for those).
use std::wstring for UCS-2 encoded strings (typedef std::wstring UCS2String) - this is a compromise, and a concession to the mess that the WIN32 API introduced). UCS-2 is sufficient for most of us (more on that later...).
use UCS2String instances whenever a character-by-character access is required (read, manipulate, and so on). Any character-based processing should be done in a NON-multibyte-representation. It is simple, fast, easy.
add two utility functions to convert back & forth between UTF-8 and UCS-2:
UCS2String ConvertToUCS2( const UTF8String &str );
UTF8String ConvertToUTF8( const UCS2String &str );
The conversions are straightforward, google should help here ...
That's it. Use UTF8String wherever memory is precious and for all UTF-8 I/O. Use UCS2String wherever the string must be parsed and/or manipulated. You can convert between those two representations any time.
Alternatives & Improvements
conversions from & to single-byte character encodings (e.g. ISO-8859-1) can be realized with help of plain translation tables, e.g. const wchar_t tt_iso88951[256] = {0,1,2,...}; and appropriate code for conversion to & from UCS2.
if UCS-2 is not sufficient, than switch to UCS-4 (typedef std::basic_string<uint32_t> UCS2String)
ICU or other unicode libraries?
For advanced stuff.
When you want to have wide characters stored in your string. wide depends on the implementation. Visual C++ defaults to 16 bit if i remember correctly, while GCC defaults depending on the target. It's 32 bits long here. Please note wchar_t (wide character type) has nothing to do with unicode. It's merely guaranteed that it can store all the members of the largest character set that the implementation supports by its locales, and at least as long as char. You can store unicode strings fine into std::string using the utf-8 encoding too. But it won't understand the meaning of unicode code points. So str.size() won't give you the amount of logical characters in your string, but merely the amount of char or wchar_t elements stored in that string/wstring. For that reason, the gtk/glib C++ wrapper folks have developed a Glib::ustring class that can handle utf-8.
If your wchar_t is 32 bits long, then you can use utf-32 as an unicode encoding, and you can store and handle unicode strings using a fixed (utf-32 is fixed length) encoding. This means your wstring's s.size() function will then return the right amount of wchar_t elements and logical characters.
Yes, char is always at least 8 bit long, which means it can store all ASCII values.
Yes, all major compilers support it.
I frequently use std::string to hold utf-8 characters without any problems at all. I heartily recommend doing this when interfacing with API's which use utf-8 as the native string type as well.
For example, I use utf-8 when interfacing my code with the Tcl interpreter.
The major caveat is the length of the std::string, is no longer the number of characters in the string.
A good question!
I think DATA ENCODING (sometimes a CHARSET also involved) is a MEMORY EXPRESSION MECHANISM in order to save data to a file or transfer data via a network, so I answer this question as:
1. When should I use std::wstring over std::string?
If the programming platform or API function is a single-byte one, and we want to process or parse some Unicode data, e.g read from Windows'.REG file or network 2-byte stream, we should declare std::wstring variable to easily process them. e.g.: wstring ws=L"中国a"(6 octets memory: 0x4E2D 0x56FD 0x0061), we can use ws[0] to get character '中' and ws[1] to get character '国' and ws[2] to get character 'a', etc.
2. Can std::string hold the entire ASCII character set, including the special characters?
Yes. But notice: American ASCII, means each 0x00~0xFF octet stands for one character, including printable text such as "123abc&*_&" and you said special one, mostly print it as a '.' avoid confusing editors or terminals. And some other countries extend their own "ASCII" charset, e.g. Chinese, use 2 octets to stand for one character.
3.Is std::wstring supported by all popular C++ compilers?
Maybe, or mostly. I have used: VC++6 and GCC 3.3, YES
4. What is exactly a "wide character"?
a wide character mostly indicates using 2 octets or 4 octets to hold all countries' characters. 2 octet UCS2 is a representative sample, and further e.g. English 'a', its memory is 2 octet of 0x0061(vs in ASCII 'a's memory is 1 octet 0x61)
When you want to store 'wide' (Unicode) characters.
Yes: 255 of them (excluding 0).
Yes.
Here's an introductory article: http://www.joelonsoftware.com/articles/Unicode.html
There are some very good answers here, but I think there are a couple of things I can add regarding Windows/Visual Studio. Tis is based on my experience with VS2015. On Linux, basically the answer is to use UTF-8 encoded std::string everywhere. On Windows/VS it gets more complex. Here is why. Windows expects strings stored using chars to be encoded using the locale codepage. This is almost always the ASCII character set followed by 128 other special characters depending on your location. Let me just state that this in not just when using the Windows API, there are three other major places where these strings interact with standard C++. These are string literals, output to std::cout using << and passing a filename to std::fstream.
I will be up front here that I am a programmer, not a language specialist. I appreciate that USC2 and UTF-16 are not the same, but for my purposes they are close enough to be interchangeable and I use them as such here. I'm not actually sure which Windows uses, but I generally don't need to know either. I've stated UCS2 in this answer, so sorry in advance if I upset anyone with my ignorance of this matter and I'm happy to change it if I have things wrong.
String literals
If you enter string literals that contain only characters that can be represented by your codepage then VS stores them in your file with 1 byte per character encoding based on your codepage. Note that if you change your codepage or give your source to another developer using a different code page then I think (but haven't tested) that the character will end up different. If you run your code on a computer using a different code page then I'm not sure if the character will change too.
If you enter any string literals that cannot be represented by your codepage then VS will ask you to save the file as Unicode. The file will then be encoded as UTF-8. This means that all Non ASCII characters (including those which are on your codepage) will be represented by 2 or more bytes. This means if you give your source to someone else the source will look the same. However, before passing the source to the compiler, VS converts the UTF-8 encoded text to code page encoded text and any characters missing from the code page are replaced with ?.
The only way to guarantee correctly representing a Unicode string literal in VS is to precede the string literal with an L making it a wide string literal. In this case VS will convert the UTF-8 encoded text from the file into UCS2. You then need to pass this string literal into a std::wstring constructor or you need to convert it to utf-8 and put it in a std::string. Or if you want you can use the Windows API functions to encode it using your code page to put it in a std::string, but then you may as well have not used a wide string literal.
std::cout
When outputting to the console using << you can only use std::string, not std::wstring and the text must be encoded using your locale codepage. If you have a std::wstring then you must convert it using one of the Windows API functions and any characters not on your codepage get replaced by ? (maybe you can change the character, I can't remember).
std::fstream filenames
Windows OS uses UCS2/UTF-16 for its filenames so whatever your codepage, you can have files with any Unicode character. But this means that to access or create files with characters not on your codepage you must use std::wstring. There is no other way. This is a Microsoft specific extension to std::fstream so probably won't compile on other systems. If you use std::string then you can only utilise filenames that only include characters on your codepage.
Your options
If you are just working on Linux then you probably didn't get this far. Just use UTF-8 std::string everywhere.
If you are just working on Windows just use UCS2 std::wstring everywhere. Some purists may say use UTF8 then convert when needed, but why bother with the hassle.
If you are cross platform then it's a mess to be frank. If you try to use UTF-8 everywhere on Windows then you need to be really careful with your string literals and output to the console. You can easily corrupt your strings there. If you use std::wstring everywhere on Linux then you may not have access to the wide version of std::fstream, so you have to do the conversion, but there is no risk of corruption. So personally I think this is a better option. Many would disagree, but I'm not alone - it's the path taken by wxWidgets for example.
Another option could be to typedef unicodestring as std::string on Linux and std::wstring on Windows, and have a macro called UNI() which prefixes L on Windows and nothing on Linux, then the code
#include <fstream>
#include <string>
#include <iostream>
#include <Windows.h>
#ifdef _WIN32
typedef std::wstring unicodestring;
#define UNI(text) L ## text
std::string formatForConsole(const unicodestring &str)
{
std::string result;
//Call WideCharToMultiByte to do the conversion
return result;
}
#else
typedef std::string unicodestring;
#define UNI(text) text
std::string formatForConsole(const unicodestring &str)
{
return str;
}
#endif
int main()
{
unicodestring fileName(UNI("fileName"));
std::ofstream fout;
fout.open(fileName);
std::cout << formatForConsole(fileName) << std::endl;
return 0;
}
would be fine on either platform I think.
Answers
So To answer your questions
1) If you are programming for Windows, then all the time, if cross platform then maybe all the time, unless you want to deal with possible corruption issues on Windows or write some code with platform specific #ifdefs to work around the differences, if just using Linux then never.
2)Yes. In addition on Linux you can use it for all Unicode too. On Windows you can only use it for all unicode if you choose to manually encode using UTF-8. But the Windows API and standard C++ classes will expect the std::string to be encoded using the locale codepage. This includes all ASCII plus another 128 characters which change depending on the codepage your computer is setup to use.
3)I believe so, but if not then it is just a simple typedef of a 'std::basic_string' using wchar_t instead of char
4)A wide character is a character type which is bigger than the 1 byte standard char type. On Windows it is 2 bytes, on Linux it is 4 bytes.
Applications that are not satisfied with only 256 different characters have the options of either using wide characters (more than 8 bits) or a variable-length encoding (a multibyte encoding in C++ terminology) such as UTF-8. Wide characters generally require more space than a variable-length encoding, but are faster to process. Multi-language applications that process large amounts of text usually use wide characters when processing the text, but convert it to UTF-8 when storing it to disk.
The only difference between a string and a wstring is the data type of the characters they store. A string stores chars whose size is guaranteed to be at least 8 bits, so you can use strings for processing e.g. ASCII, ISO-8859-15, or UTF-8 text. The standard says nothing about the character set or encoding.
Practically every compiler uses a character set whose first 128 characters correspond with ASCII. This is also the case with compilers that use UTF-8 encoding. The important thing to be aware of when using strings in UTF-8 or some other variable-length encoding, is that the indices and lengths are measured in bytes, not characters.
The data type of a wstring is wchar_t, whose size is not defined in the standard, except that it has to be at least as large as a char, usually 16 bits or 32 bits. wstring can be used for processing text in the implementation defined wide-character encoding. Because the encoding is not defined in the standard, it is not straightforward to convert between strings and wstrings. One cannot assume wstrings to have a fixed-length encoding either.
If you don't need multi-language support, you might be fine with using only regular strings. On the other hand, if you're writing a graphical application, it is often the case that the API supports only wide characters. Then you probably want to use the same wide characters when processing the text. Keep in mind that UTF-16 is a variable-length encoding, meaning that you cannot assume length() to return the number of characters. If the API uses a fixed-length encoding, such as UCS-2, processing becomes easy. Converting between wide characters and UTF-8 is difficult to do in a portable way, but then again, your user interface API probably supports the conversion.
when you want to use Unicode strings and not just ascii, helpful for internationalisation
yes, but it doesn't play well with 0
not aware of any that don't
wide character is the compiler specific way of handling the fixed length representation of a unicode character, for MSVC it is a 2 byte character, for gcc I understand it is 4 bytes. and a +1 for http://www.joelonsoftware.com/articles/Unicode.html
If you keep portability for string, you can use tstring, tchar. It is widely used technique from long ago. In this sample, I use self-defined TCHAR, but you can find out tchar.h implementation for linux on internet.
This idea means that wstring/wchar_t/UTF-16 is used on windows and string/char/utf-8(or ASCII..) is used on Linux.
In the sample below, the searching of english/japanese multibyte mixed string works well on both windows/linux platforms.
#include <locale.h>
#include <stdio.h>
#include <algorithm>
#include <string>
using namespace std;
#ifdef _WIN32
#include <tchar.h>
#else
#define _TCHAR char
#define _T
#define _tprintf printf
#endif
#define tstring basic_string<_TCHAR>
int main() {
setlocale(LC_ALL, "");
tstring s = _T("abcあいうえおxyz");
auto pos = s.find(_T("え"));
auto r = s.substr(pos);
_tprintf(_T("r=%s\n"), r.c_str());
}
1) As mentioned by Greg, wstring is helpful for internationalization, that's when you will be releasing your product in languages other than english
4) Check this out for wide character
http://en.wikipedia.org/wiki/Wide_character
When should you NOT use wide-characters?
When you're writing code before the year 1990.
Obviously, I'm being flip, but really, it's the 21st century now. 127 characters have long since ceased to be sufficient. Yes, you can use UTF8, but why bother with the headaches?