Why use MultiByteToWideCharArray to convert std::string to std::wstring? - c++

I want to convert a std::string to std::wstring. There are two approaches which i have come across.
Given a string str we cant convert into wide string using the following code
wstring widestring = std::wstring(str.begin(),str.end());
The other approach is to use MultiByteToWideCharArray().
What i wanted to understand was what is the drawback of using the first approach and how does the second approach solves thing problem

MultiByteToWideChar offers more options(like the ability to select "codepages") and translates non-standard symbols correctly

The first option doesn't support multibyte encoding. It will iterate through each byte (char) in the string and convert it to a wide character. When you have a string with multibyte encoding, individual characters can take more than one byte, so a standard string iterator is inappropriate.
The MultiByteToWideChar function has support for different multibyte formats, as specified by the codepage parameter.

Related

std::string conversion to char32_t (unicode characters)

I need to read a file using fstream in C++ that has ASCII as well as Unicode characters using the getline function.
But the function uses only std::string and these simple strings' characters can not be converted into char32_t so that I can compare them with Unicode characters. So please could any one give any fix.
char32_t corresponds to UTF-32 encoding, which is almost never used (and often poorly supported). Are you sure that your file is encoded in UTF-32?
If you are sure, then you need to use std::u32string to store your string. For reading, you can use std::basic_stringstream<char32_t> for instance. However, please note that these types are generally poorly supported.
Unicode is generally encoded using:
UTF-8 in text files (and web pages, etc...)
A platform-specific 16-bit or 32-bit encoding in programs, using type wchar_t
So generally, universally encoded files are in UTF-8. They use a variable number of bytes for encoding characters, from 1(ASCII characters) to 4. This means you cannot directly test the individual chars using a std::string
For this, you need to convert the UTF-8 string to wchar_t string, stored in a std::wstring.
For this, use a converter defined like this:
std::wstring_convert<std::codecvt_utf8<wchar_t> > converter;
And convert like that:
std::wstring unicodeString = converter.from_bytes(utf8String);
You can then access the individual unicode characters. Don't forget to put a "L" before each string literals, to make it a unicode string literal. For instance:
if(unicodeString[i]==L'仮')
{
info("this is some japanese character");
}

How to convert from std::string to SQLWCHAR*

I'm trying to convert std::string to SQLWCHAR*, but I couldn't find how.
Any brilliant suggestion, please?
One solution would be to simply use a std::wstring in the first place, rather than std::string. With Unicode Character set you can define a wide string literal using the following syntax:
std::wstring wstr = L"hello world";
However, if you would like to stick with std::string then you will need to convert the string to a different encoding. This will depend on how your std::string is encoded. The default encoding in Windows is ANSI (however the most common encoding when reading files or downloading text from websites is usually UTF8).
This answer shows a function for converting a std::string to a std::wstring on Windows (using the MultiByteToWideChar function).
https://stackoverflow.com/a/27296/966782
Note that the answer uses CP_ACP as the input encoding (i.e. ANSI). If your input string is UTF8 then you can change to use CP_UTF8 instead.
Once you have a std::wstring you should be able to easily retrieve a SQLWCHAR* using:
std::wstring::c_str()

streams with default utf8 handling

I have read that in some environments std::string internally uses UTF-8. Whereas, on my platform, Windows, std::string is ASCII only. This behavior can be changed by using std::locale. My version of STL doesn't have, or at least I can't find, a UTF-8 facet for use with strings. I do however have a facet for use with the fstream set of classes.
Edit:
When I say "use UTF-8 internally", I'm referring to methods like std::basic_filebuf::open(), which in some environments accept UTF-8 encoded strings. I know this isn't really an std::string issue but rather some OS's use UTF-8 natively. My question should be read as "how does your implementation handle code conversion of invalid sequences?".
How do these streams handle invalid code sequences on other platforms/implementations?
In my UTF8 facet for files, it simply returns an error, which in turn prevents any more of the stream from being read. I would have thought changing the error to the Unicode "Invalid char" 0xfffd value to be a better option.
My question isn't limited to UTF-8, how about invalid UTF-16 surrogate pairs?
Let's have an example. Say you open a UTF-8 encoded file with a UTF-8 to wchar_t locale. How are invalid UTF-8 sequences handled by your implementation?
Or, a std::wstring and print it to std::cout, this time with a lone surrogate.
I have read that in some environments std::string internally uses uses UTF-8.
A C++ program can chose to use std::string to hold a UTF-8 string on any standard-compliant platform.
Whereas, on my platform, Windows, std::string is ASCII only.
That is not correct. On Windows you can use a std::string to hold a UTF-8 string if you want, std::string is not limited to hold ASCII on any standard-compliant platform.
This behavior can be changed by using std::locale.
No, the behaviour of std::string is not affected by the locale library.
A std::string is a sequence of chars. On most platforms, including Windows, a char is 8-bits. So you can use std::string to hold ASCII, Latin1, UTF-8 or any character encoding that uses an 8-bit or less code unit. std::string::length returns the number of code units so held, and the std::string::operator[] will return the ith code unit.
For holding UTF-16 you can use char16_t and std::u16string.
For holding UTF-32 you can use char32_t and std::u32string.
Say you open a UTF-8 encoded file with a UTF-8 to wchar_t locale. How are invalid UTF-8 sequences handled by your implementation?
Typically no one bothers with converting to wchar_t or other wide char types on other platforms, but the standard facets that can be used for this all signal a read error that causes the stream to stop working until the error is cleared.
std::string should be encoding agnostic: http://en.cppreference.com/w/cpp/string/basic_string - so it should not validate codepoints/data - you should be able to store any binary data in it.
The only places where encoding really makes a difference is in calculating string length and iterating over string character by character - and locale should have no effect in either of these cases.
And also - use of std::locale is probably not a good idea if it can be avoided at all - its not thread safe on all platforms or all implementations of standard library so care must be taken when using it. The effect of this is also very limited, and probably not at all what you expect it to be.

How can I convert a wchar_t* to char* without losing data?

I'm using a Japanese string as a wchar_t, and I need to convert it to a char*. Is there any method or function to convert wchar_t* to char* without losing data?
It is not enough to say "I have a string as wchar_t". You must also know what encoding the characters of the string are in. This is probably UTF-16, but you need to know definitely.
It is also not enough to say "I want to convert to char". Again, you must make a decision on what encoding the characters will be represented in. JIS? Shift-JIS? EUC? UTF-8? Another encoding?
If you know the answers to the two questions above, you can do the conversion without any problem using WideCharToMultiByte.
What you have to do first is to choose the string encoding such as UTF-8 or UTF-16. And then, encode your wchar_t[] strings in the encoding you choose via libiconv or other similar string encoding library.
You need to call WideCharToMultiByte and pass in the code page encoding identifier for the Japanese multibyte encoding you want. See the MDSN for that function. On Windows, the local multibyte set is CP932, the MS variation on ShiftJIS. However, you might conceivably want UTF-8 to send to someone who wants it.

How to write a std::string to a UTF-8 text file

I just want to write some few simple lines to a text file in C++, but I want them to be encoded in UTF-8. What is the easiest and simple way to do so?
The only way UTF-8 affects std::string is that size(), length(), and all the indices are measured in bytes, not characters.
And, as sbi points out, incrementing the iterator provided by std::string will step forward by byte, not by character, so it can actually point into the middle of a multibyte UTF-8 codepoint. There's no UTF-8-aware iterator provided in the standard library, but there are a few available on the 'Net.
If you remember that, you can put UTF-8 into std::string, write it to a file, etc. all in the usual way (by which I mean the way you'd use a std::string without UTF-8 inside).
You may want to start your file with a byte order mark so that other programs will know it is UTF-8.
There is nice tiny library to work with utf8 from c++: utfcpp
libiconv is a great library for all our encoding and decoding needs.
If you are using Windows you can use WideCharToMultiByte and specify that you want UTF8.
What is the easiest and simple way to do so?
The most intuitive and thus easiest handling of utf8 in C++ is for sure using a drop-in replacement for std::string.
As the internet still lacks of one, I went to implement the functionality on my own:
tinyutf8 (EDIT: now Github).
This library provides a very lightweight drop-in preplacement for std::string (or std::u32string if you will, because you iterate over codepoints rather that chars). Ity is implemented succesfully in the middle between fast access and small memory consumption, while being very robust. This robustness to 'invalid' UTF8-sequences makes it (nearly completely) compatible with ANSI (0-255).
Hope this helps!
If by "simple" you mean ASCII, there is no need to do any encoding, since characters with an ASCII value of 127 or less are the same in UTF-8.
std::wstring text = L"Привет";
QString qstr = QString::fromStdWString(text);
QByteArray byteArray(qstr.toUtf8());
std::string str_std( byteArray.constData(), byteArray.length());
My preference is to convert to and from a std::u32string and work with codepoints internally, then convert to utf8 when writing out to a file using these converting iterators I put on github.
#include <utf/utf.h>
int main()
{
using namespace utf;
u32string u32_text = U"ɦΈ˪˪ʘ";
// do stuff with string
// convert to utf8 string
utf32_to_utf8_iterator<u32string::iterator> pos(u32_text.begin());
utf32_to_utf8_iterator<u32string::iterator> end(u32_text.end());
u8string u8_text(pos, end);
// write out utf8 to file.
// ...
}
Use Glib::ustring from glibmm.
It is the only widespread UTF-8 string container (AFAIK). While glyph (not byte) based, it has the same method signatures as std::string so the port should be simple search and replace (just make sure that your data is valid UTF-8 before loading it into a ustring).
As to UTF-8 is multibite characters string and so you get some problems to work and it's a bad idea/ Instead use normal Unicode.
So by my opinion best is use ordinary ASCII char text with some codding set. Need to use Unicode if you use more than 2 sets of different symbols
(languages) in single.
It's rather rare case. In most cases enough 2 sets of symbols. For this common case use ASCII chars, not Unicode.
Effect of using multibute chars like UTF-8 you get only China traditional, arabic or some hieroglyphic text. It's very very rare case!!!
I don't think there are many peoples needs that. So never use UTF-8!!! It's avoid strong headache of manipulate such strings.