C++: Comparing strings or wstrings with special characters in them (á, é, ő, etc.) - c++

I recently got an assignment that requires me to compare words. I don't want to describe it in full, but I have to compare the words character-by-character to see how similar two words are.
Now the problem is that the input text I have to use contains a lot of non-standard characters like á, é, ő, etc. I tried using string, wstring, char and wchar_t for representing my words, but nothing seems to work properly. An example:
setlocale(LC_ALL, "");
std::vector <Word::Word> words;
std::wfstream fileWrite("testout.txt");
std::wstring s = words[0].getString();
fileWrite << s;
Our string contains the word "Még" here. It outputs correctly. For the record, everything works the same if I use string instead of wstring. The following works too:
const wchar_t* wc = s.c_str();
fileWrite << wc;
But as soon as I try referencing to a char it gives me gibberish. Example:
fileWrite << wc[0] << " " << wc[1];
This outputs "ď »". I'm guessing the problem is that they use multiple bytes to store a char? I'm just wildly guessing here, but that would explain why
wcslen(wc);
returns 7.
I tried using the substr function with both string and wstring, but generally doesn't seem to be working. Anyone has any idea on how to tackle this problem? Am I missing something obvious here?
Also, I'm using codeblocks with a gcc compiler, I've read it somewhere that it doesn't handle wchar and wstring well, could that be the problem? Remember, I've tried everything above with string instead of wstring, and it was the same.
Thank you all very much for the help, it would be greatly appreciated!

These characters are not unusual. They are absolutely standard Unicode characters. Unfortunately, plain standard C++ doesn't have any support for the finer details of Unicode. Your choice is to either find a nice library supporting it (for example for code running on MacOS X or iOS you would just use what's built into the OS, other operating systems might have similar support), or to go to www.unicode.org and download their code tables. And read everything you can find out about it.
wchar and wstring are inherently non-portable. Your best bet is to use UTF-8 encoding and standard std::string. And understanding UTF-8 is absolutely essential for any programmer nowadays.
There was some discussion here about Notepad. Lots of software writes UTF-8 preceded by a byte-order marker (BOM) and lots of software uses that to recognise UTF-8. If that byte order marker is not present, they look at individual bytes. There is a possibility that a file consists only of ASCII characters, in which case it doesn't matter what encoding it is. If it's not only ASCII, the likelihood that for example a Windows-1252 encoded file containing non-ASCII characters is legal UTF-8 is practically zero.

Related

how to convert utf8 to std::string?

I am working on this code which receives a cpprest sdk response containing a base64_encoded payload which is a json. here is my code snippet:
typedef std::wstring string_t; //defined in basic_types.h in cpprest lib
void demo() {
http_response response;
//code to handle respose ...
json::value output= response.extract_json();
string_t payload = output.at(L"payload").as_string();
vector<unsigned char> base64_encoded_payload = conversions::from_base64(payload);
std::string utf8_payload(base64_encoded_payload.begin(), base64_encoded_payload.end()); //in debugger I see the Japanese chars are garbled.
string_t utf16_payload = utf8_to_utf16(utf8_payload); //in debugger I see the Japanese chars are good here
//then I need to process the utf8_payload which is an xml.
//I have an API available to process the xml which takes an string
processXML(utf16_payload); //need to convert utf16_payload to a string here;
}
I also tried this and I see str contains garbled chars!
#include <codecvt> // for codecvt_utf8_utf16
#include <locale> // for wstring_convert
#include <string> // for string, wstring
void wstr2str(void) {
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> conversion;
std::wstring japanese = L"北島 美奈";
std::string str = conversion.to_bytes(japanese); //str is garbled:(
}
my questions is: can utf8 containing Japanese char be converted to std::string without being garbled?
Update: I gained access to the processXML() code and changed the input argument type to std::wstring and it worked.
I figured when the xml was getting created, it was converting the std::string to wstring; however, it was not turning out good!
void processXML(std::wstring xmlStrBuf) { //chaned xmlStrBuf to wstring and worked
// more code
CComBSTR xmlBuff = xmlStrBuf.c_str();
VARIANT_BOOL bSuccess = false;
xmlDoc->loadXML(xmlBuff, &bSuccess);
//more code
}
Thanks for the answers and they were helpful when mentioned the string is only a storage.
You are confusing different concepts here.
Storage
This is how we save/store/hold our data. A std::string is a collection of chars, which are bytes. A std::wstring is a collection of wchar_ts, which are sometimes 2-byte wide value (but this is not guaranteed!).
Encoding
This is what the data means, and how it should be interpreted. A std::string, a collection of bytes, could hold UTF-8, or UTF-16, or UTF-32, or ASCII, or ShiftJIS, or morse code, or a JPEG, or a movie, or my DNA (lucky string!).
There are some strong conventions in play in the world. For example, on Windows, a std::wstring is generally accepted to hold UTF-16 (because the two-byte storage is convenient for this, and also because that's how the Windows API does it).
Newer versions of C++ give us things like std::u16_string and std::u32_string as well, which still do not directly have any notion of encoding, but are intended to be used for UTF-16 and UTF-32 respectively because their names make that intention more obvious to readers of code. C++20 will introduce std::u8_string which is intended to signify a UTF-8 encoded string (and is otherwise more or less like a std::string).
But these are just conventions. Nothing about the type std::string says "UTF-8" or any other thing. It doesn't know about or care about or enforce any encoding. It just stores bytes.
So, your question about "converting UTF-8 to std::string" does not really make any sense; it's like asking how to convert a road into a car.
"What should I do, then?"
Well, Base64 is also not an encoding. Well, actually, it totally is, but it's an encoding on top of the string encoding. It's a way of transmitting/escaping/sanitising the raw bytes, not a way of describing how to interpret them later. By asking cpprest to convert from Base64, that's just transforming the way the raw bytes are provided. That's why it gives you a std::vector<char> rather than a std::string because, although (as discussed above) std::string doesn't care about encoding, we sometimes use a std::vector<char> to really, properly, completely say that "this collection does not have any particular encoding, so please don't try to guess from convention or whatever what the encoding is in this use case; all it knows is that it is a bunch of bytes". This is down to opinion. Some people will still use a std::string for that; the authors of cpprest decided not to.
The point is that the use of the function from_base64 cannot tell us anything about the encoding of the text that you've retrieved. For that, we have to go back to the documentation for the text. We have no access to that, and you did not tell us anything about it. If it were just a JSON string, the encoding would be down to the cpprest JSON library and so you'd already be done. However, it's not: it's something packed into a Base64 representation by whoever created the JSON object. Again, that information is not something that you shared with us.
But, based on the variable names you've chosen, the data you're looking at is already UTF-8. You've then attempted to convert it to UTF-16, which is rather the opposite of what you've described you wanted to do.
(Similarly, in your second example, you've taken a std::wstring that [probably] already stores UTF-16 thanks to the L"wide string literal", then told the computer that it's UTF-8 and to convert it "again" to UTF-16, then extracted the raw bytes into a std::string. None of that makes sense.)
Instead, why not literally just processXML(utf8_payload);?
General advice
Encoding can be quite complex, although it's significantly easier to deal with once you've wrapped your mind around the basic concepts of all these layers of abstraction. For the future, and for this question if you wish to clarify it, you will need to ensure that you are absolutely clear, at each stage of the "pipeline" of your data as it gets transmitted from place A to place B, and gets converted from type C to type D, and whatever else, about what encoding it should be at each of those steps. If you want to change the encoding at one of those steps, then do so (though this should be rare!). But before you write any code make sure that you know for sure what it is that you need, otherwise you'll get yourself in a massive tangle.
Eventually you'll start to detect patterns that can help, though. For example, if you were expecting some delicious non-ASCII output and instead see strange text with lots of "Å" characters in it, that's probably UTF-8 that's being interpreted as ASCII by mistake. That's because of the way that the special sequence denoting Unicode codepoints larger than one byte in UTF-8 often starts with a byte whose numerical value is the same as that of the letter "Å" in ASCII (well, ISO/IEC 8859, but close enough).
Similarly, if you get Japanese and didn't expect it, in my experience that's usually because you've given the computer some bytes and told it that they are a string in UTF-16 encoding, when actually they were UTF-8. You just get more experienced at recognising these patterns as you work more, and it can help you to fix your bugs faster.
Just last week the last example there saved me quite a bit of time: I knew immediately that my source data must have been UTF-8, and was therefore able to quickly decide to remove the byte-copy into a std::wstring that I'd been attempting. Examining the bytes in an encoding-agnostic way revealed the "Å" pattern as well and then that was that. This was important because I had no documentation for the data source and thus no way to just look up what the encoding was supposed to be. I had to guess/deduce it. Hopefully that won't be the case for you here.
std::string is just a container for 8-bit wide char, and does not know/care about the encoding. Always think in symbols (letters, numbers, punctuation, etc.) The first 128 characters (0-127) were defined per the ASCII standard, thus requiring a single char to store each symbol. With all the languages and symbols there is, we couldn't represent each of them with just 256 possibilities. The UTF-8 encoding introduces a way to deal with this problem by allowing a single symbol to take 1, 2, 3 or 4 char wide. But, for the std::string object, this is entirely transparent and it's still dealing with a series of chars.
The reason why you're thinking the string is garbled is probably because your debugger assumes the contents of the std::string is always 1 symbol per char (extended ASCII for example), and as such, it's displaying the wrong characters.
Edit: you might want to read this post also.

c++ windows: Is there a way to convert from _UNICODE_STRING to std::string? [duplicate]

I am not able to understand the differences between std::string and std::wstring. I know wstring supports wide characters such as Unicode characters. I have got the following questions:
When should I use std::wstring over std::string?
Can std::string hold the entire ASCII character set, including the special characters?
Is std::wstring supported by all popular C++ compilers?
What is exactly a "wide character"?
string? wstring?
std::string is a basic_string templated on a char, and std::wstring on a wchar_t.
char vs. wchar_t
char is supposed to hold a character, usually an 8-bit character.
wchar_t is supposed to hold a wide character, and then, things get tricky:
On Linux, a wchar_t is 4 bytes, while on Windows, it's 2 bytes.
What about Unicode, then?
The problem is that neither char nor wchar_t is directly tied to unicode.
On Linux?
Let's take a Linux OS: My Ubuntu system is already unicode aware. When I work with a char string, it is natively encoded in UTF-8 (i.e. Unicode string of chars). The following code:
#include <cstring>
#include <iostream>
int main()
{
const char text[] = "olé";
std::cout << "sizeof(char) : " << sizeof(char) << "\n";
std::cout << "text : " << text << "\n";
std::cout << "sizeof(text) : " << sizeof(text) << "\n";
std::cout << "strlen(text) : " << strlen(text) << "\n";
std::cout << "text(ordinals) :";
for(size_t i = 0, iMax = strlen(text); i < iMax; ++i)
{
unsigned char c = static_cast<unsigned_char>(text[i]);
std::cout << " " << static_cast<unsigned int>(c);
}
std::cout << "\n\n";
// - - -
const wchar_t wtext[] = L"olé" ;
std::cout << "sizeof(wchar_t) : " << sizeof(wchar_t) << "\n";
//std::cout << "wtext : " << wtext << "\n"; <- error
std::cout << "wtext : UNABLE TO CONVERT NATIVELY." << "\n";
std::wcout << L"wtext : " << wtext << "\n";
std::cout << "sizeof(wtext) : " << sizeof(wtext) << "\n";
std::cout << "wcslen(wtext) : " << wcslen(wtext) << "\n";
std::cout << "wtext(ordinals) :";
for(size_t i = 0, iMax = wcslen(wtext); i < iMax; ++i)
{
unsigned short wc = static_cast<unsigned short>(wtext[i]);
std::cout << " " << static_cast<unsigned int>(wc);
}
std::cout << "\n\n";
}
outputs the following text:
sizeof(char) : 1
text : olé
sizeof(text) : 5
strlen(text) : 4
text(ordinals) : 111 108 195 169
sizeof(wchar_t) : 4
wtext : UNABLE TO CONVERT NATIVELY.
wtext : ol�
sizeof(wtext) : 16
wcslen(wtext) : 3
wtext(ordinals) : 111 108 233
You'll see the "olé" text in char is really constructed by four chars: 110, 108, 195 and 169 (not counting the trailing zero). (I'll let you study the wchar_t code as an exercise)
So, when working with a char on Linux, you should usually end up using Unicode without even knowing it. And as std::string works with char, so std::string is already unicode-ready.
Note that std::string, like the C string API, will consider the "olé" string to have 4 characters, not three. So you should be cautious when truncating/playing with unicode chars because some combination of chars is forbidden in UTF-8.
On Windows?
On Windows, this is a bit different. Win32 had to support a lot of application working with char and on different charsets/codepages produced in all the world, before the advent of Unicode.
So their solution was an interesting one: If an application works with char, then the char strings are encoded/printed/shown on GUI labels using the local charset/codepage on the machine, which could not be UTF-8 for a long time. For example, "olé" would be "olé" in a French-localized Windows, but would be something different on an cyrillic-localized Windows ("olй" if you use Windows-1251). Thus, "historical apps" will usually still work the same old way.
For Unicode based applications, Windows uses wchar_t, which is 2-bytes wide, and is encoded in UTF-16, which is Unicode encoded on 2-bytes characters (or at the very least, UCS-2, which just lacks surrogate-pairs and thus characters outside the BMP (>= 64K)).
Applications using char are said "multibyte" (because each glyph is composed of one or more chars), while applications using wchar_t are said "widechar" (because each glyph is composed of one or two wchar_t. See MultiByteToWideChar and WideCharToMultiByte Win32 conversion API for more info.
Thus, if you work on Windows, you badly want to use wchar_t (unless you use a framework hiding that, like GTK or QT...). The fact is that behind the scenes, Windows works with wchar_t strings, so even historical applications will have their char strings converted in wchar_t when using API like SetWindowText() (low level API function to set the label on a Win32 GUI).
Memory issues?
UTF-32 is 4 bytes per characters, so there is no much to add, if only that a UTF-8 text and UTF-16 text will always use less or the same amount of memory than an UTF-32 text (and usually less).
If there is a memory issue, then you should know than for most western languages, UTF-8 text will use less memory than the same UTF-16 one.
Still, for other languages (chinese, japanese, etc.), the memory used will be either the same, or slightly larger for UTF-8 than for UTF-16.
All in all, UTF-16 will mostly use 2 and occassionally 4 bytes per characters (unless you're dealing with some kind of esoteric language glyphs (Klingon? Elvish?), while UTF-8 will spend from 1 to 4 bytes.
See https://en.wikipedia.org/wiki/UTF-8#Compared_to_UTF-16 for more info.
Conclusion
When I should use std::wstring over std::string?
On Linux? Almost never (§).
On Windows? Almost always (§).
On cross-platform code? Depends on your toolkit...
(§) : unless you use a toolkit/framework saying otherwise
Can std::string hold all the ASCII character set including special characters?
Notice: A std::string is suitable for holding a 'binary' buffer, where a std::wstring is not!
On Linux? Yes.
On Windows? Only special characters available for the current locale of the Windows user.
Edit (After a comment from Johann Gerell):
a std::string will be enough to handle all char-based strings (each char being a number from 0 to 255). But:
ASCII is supposed to go from 0 to 127. Higher chars are NOT ASCII.
a char from 0 to 127 will be held correctly
a char from 128 to 255 will have a signification depending on your encoding (unicode, non-unicode, etc.), but it will be able to hold all Unicode glyphs as long as they are encoded in UTF-8.
Is std::wstring supported by almost all popular C++ compilers?
Mostly, with the exception of GCC based compilers that are ported to Windows.
It works on my g++ 4.3.2 (under Linux), and I used Unicode API on Win32 since Visual C++ 6.
What is exactly a wide character?
On C/C++, it's a character type written wchar_t which is larger than the simple char character type. It is supposed to be used to put inside characters whose indices (like Unicode glyphs) are larger than 255 (or 127, depending...).
I recommend avoiding std::wstring on Windows or elsewhere, except when required by the interface, or anywhere near Windows API calls and respective encoding conversions as a syntactic sugar.
My view is summarized in http://utf8everywhere.org of which I am a co-author.
Unless your application is API-call-centric, e.g. mainly UI application, the suggestion is to store Unicode strings in std::string and encoded in UTF-8, performing conversion near API calls. The benefits outlined in the article outweigh the apparent annoyance of conversion, especially in complex applications. This is doubly so for multi-platform and library development.
And now, answering your questions:
A few weak reasons. It exists for historical reasons, where widechars were believed to be the proper way of supporting Unicode. It is now used to interface APIs that prefer UTF-16 strings. I use them only in the direct vicinity of such API calls.
This has nothing to do with std::string. It can hold whatever encoding you put in it. The only question is how You treat its content. My recommendation is UTF-8, so it will be able to hold all Unicode characters correctly. It's a common practice on Linux, but I think Windows programs should do it also.
No.
Wide character is a confusing name. In the early days of Unicode, there was a belief that a character can be encoded in two bytes, hence the name. Today, it stands for "any part of the character that is two bytes long". UTF-16 is seen as a sequence of such byte pairs (aka Wide characters). A character in UTF-16 takes either one or two pairs.
So, every reader here now should have a clear understanding about the facts, the situation. If not, then you must read paercebal's outstandingly comprehensive answer [btw: thanks!].
My pragmatical conclusion is shockingly simple: all that C++ (and STL) "character encoding" stuff is substantially broken and useless. Blame it on Microsoft or not, that will not help anyway.
My solution, after in-depth investigation, much frustration and the consequential experiences is the following:
accept, that you have to be responsible on your own for the encoding and conversion stuff (and you will see that much of it is rather trivial)
use std::string for any UTF-8 encoded strings (just a typedef std::string UTF8String)
accept that such an UTF8String object is just a dumb, but cheap container. Do never ever access and/or manipulate characters in it directly (no search, replace, and so on). You could, but you really just really, really do not want to waste your time writing text manipulation algorithms for multi-byte strings! Even if other people already did such stupid things, don't do that! Let it be! (Well, there are scenarios where it makes sense... just use the ICU library for those).
use std::wstring for UCS-2 encoded strings (typedef std::wstring UCS2String) - this is a compromise, and a concession to the mess that the WIN32 API introduced). UCS-2 is sufficient for most of us (more on that later...).
use UCS2String instances whenever a character-by-character access is required (read, manipulate, and so on). Any character-based processing should be done in a NON-multibyte-representation. It is simple, fast, easy.
add two utility functions to convert back & forth between UTF-8 and UCS-2:
UCS2String ConvertToUCS2( const UTF8String &str );
UTF8String ConvertToUTF8( const UCS2String &str );
The conversions are straightforward, google should help here ...
That's it. Use UTF8String wherever memory is precious and for all UTF-8 I/O. Use UCS2String wherever the string must be parsed and/or manipulated. You can convert between those two representations any time.
Alternatives & Improvements
conversions from & to single-byte character encodings (e.g. ISO-8859-1) can be realized with help of plain translation tables, e.g. const wchar_t tt_iso88951[256] = {0,1,2,...}; and appropriate code for conversion to & from UCS2.
if UCS-2 is not sufficient, than switch to UCS-4 (typedef std::basic_string<uint32_t> UCS2String)
ICU or other unicode libraries?
For advanced stuff.
When you want to have wide characters stored in your string. wide depends on the implementation. Visual C++ defaults to 16 bit if i remember correctly, while GCC defaults depending on the target. It's 32 bits long here. Please note wchar_t (wide character type) has nothing to do with unicode. It's merely guaranteed that it can store all the members of the largest character set that the implementation supports by its locales, and at least as long as char. You can store unicode strings fine into std::string using the utf-8 encoding too. But it won't understand the meaning of unicode code points. So str.size() won't give you the amount of logical characters in your string, but merely the amount of char or wchar_t elements stored in that string/wstring. For that reason, the gtk/glib C++ wrapper folks have developed a Glib::ustring class that can handle utf-8.
If your wchar_t is 32 bits long, then you can use utf-32 as an unicode encoding, and you can store and handle unicode strings using a fixed (utf-32 is fixed length) encoding. This means your wstring's s.size() function will then return the right amount of wchar_t elements and logical characters.
Yes, char is always at least 8 bit long, which means it can store all ASCII values.
Yes, all major compilers support it.
I frequently use std::string to hold utf-8 characters without any problems at all. I heartily recommend doing this when interfacing with API's which use utf-8 as the native string type as well.
For example, I use utf-8 when interfacing my code with the Tcl interpreter.
The major caveat is the length of the std::string, is no longer the number of characters in the string.
A good question!
I think DATA ENCODING (sometimes a CHARSET also involved) is a MEMORY EXPRESSION MECHANISM in order to save data to a file or transfer data via a network, so I answer this question as:
1. When should I use std::wstring over std::string?
If the programming platform or API function is a single-byte one, and we want to process or parse some Unicode data, e.g read from Windows'.REG file or network 2-byte stream, we should declare std::wstring variable to easily process them. e.g.: wstring ws=L"中国a"(6 octets memory: 0x4E2D 0x56FD 0x0061), we can use ws[0] to get character '中' and ws[1] to get character '国' and ws[2] to get character 'a', etc.
2. Can std::string hold the entire ASCII character set, including the special characters?
Yes. But notice: American ASCII, means each 0x00~0xFF octet stands for one character, including printable text such as "123abc&*_&" and you said special one, mostly print it as a '.' avoid confusing editors or terminals. And some other countries extend their own "ASCII" charset, e.g. Chinese, use 2 octets to stand for one character.
3.Is std::wstring supported by all popular C++ compilers?
Maybe, or mostly. I have used: VC++6 and GCC 3.3, YES
4. What is exactly a "wide character"?
a wide character mostly indicates using 2 octets or 4 octets to hold all countries' characters. 2 octet UCS2 is a representative sample, and further e.g. English 'a', its memory is 2 octet of 0x0061(vs in ASCII 'a's memory is 1 octet 0x61)
When you want to store 'wide' (Unicode) characters.
Yes: 255 of them (excluding 0).
Yes.
Here's an introductory article: http://www.joelonsoftware.com/articles/Unicode.html
There are some very good answers here, but I think there are a couple of things I can add regarding Windows/Visual Studio. Tis is based on my experience with VS2015. On Linux, basically the answer is to use UTF-8 encoded std::string everywhere. On Windows/VS it gets more complex. Here is why. Windows expects strings stored using chars to be encoded using the locale codepage. This is almost always the ASCII character set followed by 128 other special characters depending on your location. Let me just state that this in not just when using the Windows API, there are three other major places where these strings interact with standard C++. These are string literals, output to std::cout using << and passing a filename to std::fstream.
I will be up front here that I am a programmer, not a language specialist. I appreciate that USC2 and UTF-16 are not the same, but for my purposes they are close enough to be interchangeable and I use them as such here. I'm not actually sure which Windows uses, but I generally don't need to know either. I've stated UCS2 in this answer, so sorry in advance if I upset anyone with my ignorance of this matter and I'm happy to change it if I have things wrong.
String literals
If you enter string literals that contain only characters that can be represented by your codepage then VS stores them in your file with 1 byte per character encoding based on your codepage. Note that if you change your codepage or give your source to another developer using a different code page then I think (but haven't tested) that the character will end up different. If you run your code on a computer using a different code page then I'm not sure if the character will change too.
If you enter any string literals that cannot be represented by your codepage then VS will ask you to save the file as Unicode. The file will then be encoded as UTF-8. This means that all Non ASCII characters (including those which are on your codepage) will be represented by 2 or more bytes. This means if you give your source to someone else the source will look the same. However, before passing the source to the compiler, VS converts the UTF-8 encoded text to code page encoded text and any characters missing from the code page are replaced with ?.
The only way to guarantee correctly representing a Unicode string literal in VS is to precede the string literal with an L making it a wide string literal. In this case VS will convert the UTF-8 encoded text from the file into UCS2. You then need to pass this string literal into a std::wstring constructor or you need to convert it to utf-8 and put it in a std::string. Or if you want you can use the Windows API functions to encode it using your code page to put it in a std::string, but then you may as well have not used a wide string literal.
std::cout
When outputting to the console using << you can only use std::string, not std::wstring and the text must be encoded using your locale codepage. If you have a std::wstring then you must convert it using one of the Windows API functions and any characters not on your codepage get replaced by ? (maybe you can change the character, I can't remember).
std::fstream filenames
Windows OS uses UCS2/UTF-16 for its filenames so whatever your codepage, you can have files with any Unicode character. But this means that to access or create files with characters not on your codepage you must use std::wstring. There is no other way. This is a Microsoft specific extension to std::fstream so probably won't compile on other systems. If you use std::string then you can only utilise filenames that only include characters on your codepage.
Your options
If you are just working on Linux then you probably didn't get this far. Just use UTF-8 std::string everywhere.
If you are just working on Windows just use UCS2 std::wstring everywhere. Some purists may say use UTF8 then convert when needed, but why bother with the hassle.
If you are cross platform then it's a mess to be frank. If you try to use UTF-8 everywhere on Windows then you need to be really careful with your string literals and output to the console. You can easily corrupt your strings there. If you use std::wstring everywhere on Linux then you may not have access to the wide version of std::fstream, so you have to do the conversion, but there is no risk of corruption. So personally I think this is a better option. Many would disagree, but I'm not alone - it's the path taken by wxWidgets for example.
Another option could be to typedef unicodestring as std::string on Linux and std::wstring on Windows, and have a macro called UNI() which prefixes L on Windows and nothing on Linux, then the code
#include <fstream>
#include <string>
#include <iostream>
#include <Windows.h>
#ifdef _WIN32
typedef std::wstring unicodestring;
#define UNI(text) L ## text
std::string formatForConsole(const unicodestring &str)
{
std::string result;
//Call WideCharToMultiByte to do the conversion
return result;
}
#else
typedef std::string unicodestring;
#define UNI(text) text
std::string formatForConsole(const unicodestring &str)
{
return str;
}
#endif
int main()
{
unicodestring fileName(UNI("fileName"));
std::ofstream fout;
fout.open(fileName);
std::cout << formatForConsole(fileName) << std::endl;
return 0;
}
would be fine on either platform I think.
Answers
So To answer your questions
1) If you are programming for Windows, then all the time, if cross platform then maybe all the time, unless you want to deal with possible corruption issues on Windows or write some code with platform specific #ifdefs to work around the differences, if just using Linux then never.
2)Yes. In addition on Linux you can use it for all Unicode too. On Windows you can only use it for all unicode if you choose to manually encode using UTF-8. But the Windows API and standard C++ classes will expect the std::string to be encoded using the locale codepage. This includes all ASCII plus another 128 characters which change depending on the codepage your computer is setup to use.
3)I believe so, but if not then it is just a simple typedef of a 'std::basic_string' using wchar_t instead of char
4)A wide character is a character type which is bigger than the 1 byte standard char type. On Windows it is 2 bytes, on Linux it is 4 bytes.
Applications that are not satisfied with only 256 different characters have the options of either using wide characters (more than 8 bits) or a variable-length encoding (a multibyte encoding in C++ terminology) such as UTF-8. Wide characters generally require more space than a variable-length encoding, but are faster to process. Multi-language applications that process large amounts of text usually use wide characters when processing the text, but convert it to UTF-8 when storing it to disk.
The only difference between a string and a wstring is the data type of the characters they store. A string stores chars whose size is guaranteed to be at least 8 bits, so you can use strings for processing e.g. ASCII, ISO-8859-15, or UTF-8 text. The standard says nothing about the character set or encoding.
Practically every compiler uses a character set whose first 128 characters correspond with ASCII. This is also the case with compilers that use UTF-8 encoding. The important thing to be aware of when using strings in UTF-8 or some other variable-length encoding, is that the indices and lengths are measured in bytes, not characters.
The data type of a wstring is wchar_t, whose size is not defined in the standard, except that it has to be at least as large as a char, usually 16 bits or 32 bits. wstring can be used for processing text in the implementation defined wide-character encoding. Because the encoding is not defined in the standard, it is not straightforward to convert between strings and wstrings. One cannot assume wstrings to have a fixed-length encoding either.
If you don't need multi-language support, you might be fine with using only regular strings. On the other hand, if you're writing a graphical application, it is often the case that the API supports only wide characters. Then you probably want to use the same wide characters when processing the text. Keep in mind that UTF-16 is a variable-length encoding, meaning that you cannot assume length() to return the number of characters. If the API uses a fixed-length encoding, such as UCS-2, processing becomes easy. Converting between wide characters and UTF-8 is difficult to do in a portable way, but then again, your user interface API probably supports the conversion.
when you want to use Unicode strings and not just ascii, helpful for internationalisation
yes, but it doesn't play well with 0
not aware of any that don't
wide character is the compiler specific way of handling the fixed length representation of a unicode character, for MSVC it is a 2 byte character, for gcc I understand it is 4 bytes. and a +1 for http://www.joelonsoftware.com/articles/Unicode.html
If you keep portability for string, you can use tstring, tchar. It is widely used technique from long ago. In this sample, I use self-defined TCHAR, but you can find out tchar.h implementation for linux on internet.
This idea means that wstring/wchar_t/UTF-16 is used on windows and string/char/utf-8(or ASCII..) is used on Linux.
In the sample below, the searching of english/japanese multibyte mixed string works well on both windows/linux platforms.
#include <locale.h>
#include <stdio.h>
#include <algorithm>
#include <string>
using namespace std;
#ifdef _WIN32
#include <tchar.h>
#else
#define _TCHAR char
#define _T
#define _tprintf printf
#endif
#define tstring basic_string<_TCHAR>
int main() {
setlocale(LC_ALL, "");
tstring s = _T("abcあいうえおxyz");
auto pos = s.find(_T("え"));
auto r = s.substr(pos);
_tprintf(_T("r=%s\n"), r.c_str());
}
1) As mentioned by Greg, wstring is helpful for internationalization, that's when you will be releasing your product in languages other than english
4) Check this out for wide character
http://en.wikipedia.org/wiki/Wide_character
When should you NOT use wide-characters?
When you're writing code before the year 1990.
Obviously, I'm being flip, but really, it's the 21st century now. 127 characters have long since ceased to be sufficient. Yes, you can use UTF8, but why bother with the headaches?

How to search a non-ASCII character in a c++ string?

string s="x1→(y1⊕y2)∧z3";
for(auto i=s.begin(); i!=s.end();i++){
if(*i=='→'){
...
}
}
The char comparing is definitely wrong, what's the correct way to do it? I am using vs2013.
First you need some basic understanding of how programs handle Unicode. Otherwise, you should read up, I quite like this post on Joel on Software.
You actually have 2 problems here:
Problem #1: getting the string into your program
Your first problem is getting that actual string in your string s. Depending on the encoding of your source code file, MSVC may corrupt any non-ASCII characters in that string.
either save your C++ file as UTF-16 (which Windows confusingly calls Unicode), and use whcar_t and wstring (effectively encoding the expression as UTF-16). Saving as UTF-8 with BOM will also work. Any other encoding and your L"..." character literals will contain the wrong characters.
Note that other platforms may define wchar_t as 4 bytes instead of 2. So the handling of characters above U+FFFF will be non-portable.
In all other cases, you can't just write those characters in your source file. The most portable way is encoding your string literals as UTF-8, using \x escape codes for all non-ASCII characters. Like this: "x1\xe2\x86\x92a\xe2\x8a\x95" "b)" rather than "x1→(a⊕b)".
And yes, that's as unreadable and cumbersome as it gets. The root problem is MSVC doesn't really support using UTF-8. You can go through this question here for an overview: How to create a UTF-8 string literal in Visual C++ 2008 .
But, also consider how often those strings will actually show up in your source code.
Problem #2: finding the character
(If you're using UTF-16, you can just find the L'→' character, since that character is representable as one whcar_t. For characters above U+FFFF you'll have to use the wide version of the workaround below.)
It's impossible to define a char representing the arrow character. You can however with a string: "\xe2\x86\x92". (that's a string with 3 chars for the arrow, and the \0 terminator.
You can now search for this string in your expression:
s.find("\xe2\x86\x92");
The UTF-8 encoding scheme guarantees this always finds the correct character, but keep in mind this is an offset in bytes.
My comment is too large, so i am submitting it as an answer.
The problem is that everybody is concentrating on the issue of different encodings that Unicode may use (UTF-8, UTF-16, UCS2, etc). But your problems here will just begin.
There is also an issue of composite characters, which will really mess up any search that you are trying to make.
Let's say you are looking for a character 'é', you find it in Unicode as U+00E9 and do your search, but it is not guaranteed that this is the only way to represent this character. The document may also contain U+0065 U+0301 combination. Which is actually exactly the same character.
Yes, not just "character that looks the same", but it is exactly the same, so any software and even some programming libraries will freely convert from one to another without even telling you.
So if you wish to make a search, that is robust, you will need something that represents not just different encodings of Unicode, but Unicode characters themselves with equality between Composite and Ready-Made chars.

How to compare/replace non-ASCII chars in array in C++?

I have a large char array, which contains Czech diacritical characters (e.g. "á"), coded in UTF-8. I need to replace them to their ASCII equivalents (e.g. "a"), because program must work on Windows (Linux console accepts these chars perfectly).
I am reading array char by char and writing content into string.
Here is code I am using, this doesnt work:
int array_size = 50000; //size of file array
char * array = new char[array_size]; //array to store file contents
string ascicontent="";
if ('\u00E1'==array[zacatek]) { //check if char is "á"
ascicontent +='a'; //write ordinal "a" into string
}
I even tried replacing '\u00E1' with 'á', but it also doesnt work. Guessing there is problem that these chars are longer than ascii.
How can I declare the non-ascii char, so it could be compared?
Each char is a single byte, however UTF-8 can use multiple bytes to encode a single character. In particular U+00E1 is encoded as two bytes: 0xC3 0xA1. So you can't do what you want with just comparing a single char.
There are multiple ways that you might be able to tackle your problem:
A) First, try googling for "windows console utf-8" and see if that gives anything which might make things just work without having to alter the characters at all. (I don't know if anything can work for you, I've never tried this.)
B) Convert the data to wide characters (wchar_t) using MultiByteToWideChar or mbstowcs and then google how to use wcout or such to output UTF-16 to the console.
C) Use MultiByteToWideChar to convert the data from UTF-8 to UTF-16. Then use WideCharToMultiByte to convert from UTF-16 to the console's code page, relying on the fact that it can automatically "best fit" common characters (such as "á" to "a").
D) If you really only care about a limited set of characters (such as only the accented characters in the Czech code page), then you could possibly write your own lookup table of UTF-8 byte sequences and your desired replacements. You just need to be doing comparisons on the UTF-8 by those multiple bytes rather than individual chars. Among various tools out there, I've found this page helpful for seeing how characters are encoded in various ways.
Which of these make the most sense for your program depends on various factors, such as how easy or hard it might be to keep the Windows-specific pieces from conflicting with the Linux-specific or cross-platform parts.
char in C is not unicode, it is really a byte; it only gets converted to a glyph by the terminal console you happen to use. On some Linux implementations (like Debian) it defaults to UTF-8, so if your program outputs a sequence of bytes encoded in UTF-8, your terminal will display the proper glyph. If you know that array is UTF-8 encoded, you must check for the proper byte sequence.
Edit: take a look at The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Please take a look at this link http://en.wikipedia.org/wiki/Wide_character.
And I believe this code might help you:
std::wstring str(L"cccccááddddddd");
std::replace( str.begin(), str.end(), L'á', L'a');

Correctly reading a utf-16 text file into a string without external libraries?

I've been using StackOverflow since the beginning, and have on occasion been tempted to post questions, but I've always either figured them out myself or found answers posted eventually... until now. This feels like it should be fairly simple, but I've been wandering around the internet for hours with no success, so I turn here:
I have a pretty standard utf-16 text file, with a mixture of English and Chinese characters. I would like those characters to end up in a string (technically, a wstring). I've seen a lot of related questions answered (here and elsewhere), but they're either looking to solve the much harder problem of reading arbitrary files without knowing the encoding, or converting between encodings, or are just generally confused about "Unicode" being a range of encodings. I know the source of the text file I'm trying to read, it will always be UTF16, it has a BOM and everything, and it can stay that way.
I had been using the solution described here, which worked for text files that were all English, but after encountering certain characters, it stopped reading the file. The only other suggestion I found was to use ICU, which would probably work, but I'd really rather not include a whole large library in an application for distribution, just to read one text file in one place. I don't care about system independence, though - I only need it to compile and work in Windows. A solution that didn't rely on that fact would prettier, of course, but I would be just as happy for a solution that used the stl while relying on assumptions about Windows architecture, or even solutions that involved win32 functions, or ATL; I just don't want to have to include another large 3rd-party library like ICU. Am I still totally out of luck unless I want to reimplement it all myself?
edit: I'm stuck using VS2008 for this particular project, so C++11 code sadly won't help.
edit 2: I realized that the code I had been borrowing before didn't fail on non-English characters like I thought it was doing. Rather, it fails on specific characters in my test document, among them ':' (FULLWIDTH COLON, U+FF1A) and ')' (FULLWIDTH RIGHT PARENTHESIS, U+FF09). bames53's posted solution also mostly works, but is stumped by those same characters?
edit 3 (and the answer!): the original code I had been using -did- mostly work - as bames53 helped me discover, the ifstream just needed to be opened in binary mode for it to work.
The C++11 solution (supported, on your platform, by Visual Studio since 2010, as far as I know), would be:
#include <fstream>
#include <iostream>
#include <locale>
#include <codecvt>
int main()
{
// open as a byte stream
std::wifstream fin("text.txt", std::ios::binary);
// apply BOM-sensitive UTF-16 facet
fin.imbue(std::locale(fin.getloc(),
new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>));
// read
for(wchar_t c; fin.get(c); )
std::cout << std::showbase << std::hex << c << '\n';
}
When you open a file for UTF-16, you must open it in binary mode. This is because in text mode, certain characters are interpreted specially - specifically, 0x0d is filtered out completely and 0x1a marks the end of the file. There are some UTF-16 characters that will have one of those bytes as half of the character code and will mess up the reading of the file. This is not a bug, it is intentional behavior and is the sole reason for having separate text and binary modes.
For the reason why 0x1a is considered the end of a file, see this blog post from Raymond Chen tracing the history of Ctrl-Z. It's basically backwards compatibility run amok.
Edit:
So it appears that the issue was that the Windows treats certain magic byte sequences as the end of the file in text mode. This is solved by using binary mode to read the file, std::ifstream fin("filename", std::ios::binary);, and then copying the data into a wstring as you already do.
The simplest, non-portable solution would be to just copy the file data into a wchar_t array. This relies on the fact that wchar_t on Windows is 2 bytes and uses UTF-16 as its encoding.
You'll have a bit of difficulty converting UTF-16 to the locale specific wchar_t encoding in a completely portable fashion.
Here's the unicode conversion functionality available in the standard C++ library (though VS 10 and 11 implement only items 3, 4, and 5)
codecvt<char32_t,char,mbstate_t>
codecvt<char16_t,char,mbstate_t>
codecvt_utf8
codecvt_utf16
codecvt_utf8_utf16
c32rtomb/mbrtoc32
c16rtomb/mbrtoc16
And what each one does
A codecvt facet that always converts between UTF-8 and UTF-32
converts between UTF-8 and UTF-16
converts between UTF-8 and UCS-2 or UCS-4 depending on the size of target element (characters outside BMP are probably truncated)
converts between a sequence of chars using a UTF-16 encoding scheme and UCS-2 or UCS-4
converts between UTF-8 and UTF-16
If the macro __STDC_UTF_32__ is defined these functions convert between the current locale's char encoding and UTF-32
If the macro __STDC_UTF_16__ is defined these functions convert between the current locale's char encoding and UTF-16
If __STDC_ISO_10646__ is defined then converting directly using codecvt_utf16<wchar_t> should be fine since that macro indicates that wchar_t values in all locales correspond to the short names of Unicode charters (and so implies that wchar_t is large enough to hold any such value).
Unfortunately there's nothing defined that goes directly from UTF-16 to wchar_t. It's possible to go UTF-16 -> UCS-4 -> mb (if __STDC_UTF_32__) -> wc, but you'll loose anything that's not representable in the locale's multi-byte encoding. And of course no matter what, converting from UTF-16 to wchar_t will lose anything not representable in the locale's wchar_t encoding.
So it's probably not worth being portable, and instead you can just read the data into a wchar_t array, or use some other Windows specific facility, such as the _O_U16TEXT mode on files.
This should build and run anywhere, but makes a bunch of assumptions to actually work:
#include <fstream>
#include <sstream>
#include <iostream>
int main ()
{
std::stringstream ss;
std::ifstream fin("filename");
ss << fin.rdbuf(); // dump file contents into a stringstream
std::string const &s = ss.str();
if (s.size()%sizeof(wchar_t) != 0)
{
std::cerr << "file not the right size\n"; // must be even, two bytes per code unit
return 1;
}
std::wstring ws;
ws.resize(s.size()/sizeof(wchar_t));
std::memcpy(&ws[0],s.c_str(),s.size()); // copy data into wstring
}
You should probably at least add code to handle endianess and the 'BOM'. Also Windows newlines don't get converted automatically so you need to do that manually.