How can I use wstring(s) in Linux APIs?

How can I use wstring(s) in Linux APIs? - c++

I want to develope an application in Linux. I want to use wstring beacuse my application should supports unicode and I don't want to use UTF-8 strings.
In Windows OS, using wstring is easy. beacuse any ANSI API has a unicode form. for example there are two CreateProcess API, first API is CreateProcessA and second API is CreateProcessW.
wstring app = L"C:\\test.exe";
CreateProcess
(
app.c_str(), // EASY!
....
);
But it seems working with wstring in Linux is complicated! for example there is an API in Linux called parport_open (It just an example).
and I don't know how to send my wstring to this API (or APIs like parport_open that accept a string parameter).
wstring name = L"myname";
parport_open
(
0, // or a valid number. It is not important in this question.
name.c_str(), // Error: because type of this parameter is char* not wchat_t*
....
);
My question is how can I use wstring(s) in Linux APIs?
Note: I don't want to use UTF-8 strings.
Thanks

Linux APIs (on recent kernels and with correct locale setting) on almost every distribution use UTF-8 strings by default1. You too should use them inside your code. Resistance is futile.
The wchar_t (and thus wstring) on Windows were convenient only when Unicode was limited to 65536 characters (i.e. wchar_t were used for UCS-2), now that the 16-bit Windows wchar_t are used for UTF-16 the advantage of 1 wchar_t=1 Unicode character is long gone, so you have the same disadvantages of using UTF-8. Nowadays IMHO the Linux approach is the most correct. (Another answer of mine on UTF-16 and why Windows and Java use it)
By the way, both string and wstring aren't encoding-aware, so you can't reliably use any of these two to manipulate Unicode code points. I heard that wxString from the wxWidgets toolkit handles UTF-8 nicely, but I never did extensive research about it.
actually, as pointed out below, the kernel aims to be encoding-agnostic, i.e. it treats the strings as opaque sequences of (NUL-terminated?) bytes (and that's why encodings that use "larger" character types like UTF-16 cannot be used). On the other hand, wherever actual string manipulation is done, the current locale setting is used, and by default on almost any modern Linux distribution it is set to UTF-8 (which is a reasonable default to me).

I don't want to use UTF-8 strings.
Well, you will need to overcome that reluctance, at least when calling the APIs. Linux uses single byte string encodings, invariably UTF-8. Clearly you should use a single byte string type since you obviously can't pass wide characters to a function that expects char*. Use string rather than wstring.

Related

Storing math symbols into string c++

Is there a way to store math symbols into strings in c++ ?
I notably need the union/intersection symbols.
Thanks in advance!

This seemingly simple question is actual a tangle of multiple questions:
What character set to use?
Unicode is almost certainly the best choice nowadays.
What encoding to use?
C++ std::strings are strings of chars, but you can decide how those chars correspond to "characters" in your character set. The default representation assumed by the language and the system is could be ASCII, some random code page like Latin-1 or Windows-1252, or UTF-8.
If you're on Linux or Mac, your best bet is to use UTF-8. If you're on Windows, you might choose to use wide strings instead (std::wstring), and to use UTF-16 as the encoding. But many people suggest that you always use UTF-8 in std::strings even on Windows, and simply convert from and to UTF-16 as needed to do I/O.
How to specify string literals in the code?
To store UTF-8 in older versions of C++ (before C++11), you could manually encode your string literals like this:
const std::string subset = "\xE2\x8A\x82";
To store UTF-8 in C++11 or newer, you use the u8 prefix to tell the compiler you want UTF-8 encoding. You can use escaped characters:
const std::string subset = u8"\u2282";
Or you can enter the character directly into the source code:
const std::string subset = u8"⊂";
I tend to use the escaped versions to avoid worrying about the encoding of the source file and whether all the editors and viewers and IDEs I use will consistently understand the source file encoding.
If you're on Windows and you choose to use UTF-16 instead, then, regardless of C++ version, you can specify wide string literals in your code like this:
const std::wstring subset = L"\u2282"; // or L"⊂";
How to display these strings?
This is very system dependent.
On Mac and Linux, I suspect things will generally just work.
In a console program on Windows (e.g., one that just uses <iostreams> or printf to display in a command prompt), you're probably in trouble because the legacy command prompts don't have good Unicode and font support. (Maybe this is better on Windows 10?)
In a GUI program on Windows, you have to make sure you use the "Unicode" version of the API and to give it the wide string. ("Unicode" is in quotation marks here because the Windows API documentation often uses "Unicode" to mean a UTF-16 encoded wide character string, which isn't exactly what Unicode means.) So if you want to use an API like TextOut or MessageBox to display your string, you have to make sure you do two things: (1) call the "wide" version of the API, and (2) pass a UTF-16 encoded string.
You solve (1) by explicitly calling the wide versions (e.g., TextOutW or MessageBoxW) or by making your you compile with "Unicode" selected in your project settings. (You can also do it by defining several C++ preprocessor macros instead, but this answer is already long enough.)
For (2), if you are using std::wstrings, you're already done. If you're using UTF-8, you'll need to make a wide copy of the string to pass to the output function. Windows provides MultiByteToWideChar for making such a copy. Make sure you specify CP_UTF8.
For (2), do not try to call the narrow versions of the API functions themselves (e.g., TextOutA or MessageBoxA). These will convert your string to a wide string automatically, but they do so assuming the string is encoded in the user's current code page. If the string is really in UTF-8, then these will do the wrong thing for all of the "interesting" (non-ASCII) characters.
How to read these strings from a file, a socket, or the user?
This is very system specific and probably worth a separate question.

Yes, you can, as follows:
std::string unionChar = "∪";
std::string intersectionChar = "∩";
They are just characters but don't expect this code to be portable. You could also use Unicode, as follows:
std::string unionChar = u8"\u222A";
std::string intersectionChar = u8"\u2229";

In C++ when to use WCHAR and when to use CHAR

I have a question:
Some libraries use WCHAR as the text parameter and others use CHAR (as UTF-8): I need to know when to use WCHAR or CHAR when I write my own library.

Use char and treat it as UTF-8. There are a great many reasons for this; this website summarises it much better than I can:
http://utf8everywhere.org/
It recommends converting from wchar_t to char (UTF-16 to UTF-8) as soon as you receive it from any library, and converting back when you need to pass strings to it. So to answer your question, always use char except at the point that an API requires you to pass or receive wchar_t.

WCHAR (or wchar_t on Visual C++ compiler) is used for Unicode UTF-16 strings.
This is the "native" string encoding used by Win32 APIs.
CHAR (or char) can be used for several other string formats: ANSI, MBCS, UTF-8.
Since UTF-16 is the native encoding of Win32 APIs, you may want to use WCHAR (and better a proper string class based on it, like std::wstring) at the Win32 API boundary, inside your app.
And you can use UTF-8 (so, CHAR/char and std::string) to exchange your Unicode text outside your application boundary. For example: UTF-8 is widely used on the Internet, and when you exchange UTF-8 text between different platforms you don't have the problem of endianness (instead with UTF-16 you have to consider both the UTF-16BE big-endian and the UTF-16LE little-endian cases).
You can convert between UTF-16 and UTF-8 using the WideCharToMultiByte() and MultiByteToWideChar() Win32 APIs. These are pure-C APIs, and these can be conveniently wrapped in C++ code, using string classes instead of raw character pointers, and exceptions instead of raw error codes. You can find an example of that here.

The right question is not which type to use, but what should be your contract with your library users. Both char and wchar_t can mean more than one thing.
The right answer to me, is use char and consider everything utf-8 encoded, as utf8everywhere.org suggests. This will also make it easier to write cross-platform libraries.
Make sure you make correct use of strings though. Some APIs like fopen(), would accept a char* string and treat it differently (not as UTF-8) when compiled on Windows. If Unicode is important to you (and it probably is, when you are dealing with strings), be sure to handle your strings correctly. A good example can be seen in boost::locale. I also recommend using boost::nowide on Windows to get strings handled correctly inside your library.

In Windows we stick to WCHARS. std::wstring. Mainly because if you don't you end up having to convert because calling Windows functions.
I have a feeling that trying to use utf8 internally simply because of http://utf8everywhere.org/ is gonna bite us in the bum later on down the line.

It is best recommended that, when developing a Windows application, resort to TCHARs. The good thing about TCHARs is that they can be either regular chars or wchars, depending whether the unicode setting is set or not. Once you resort to TCHARs, you make sure that all string manipulations that you use also start with the _t prefix (e.g. _tcslen for length of string). That way you will know that your code will work both in Unicode and ASCII environments.

UNICODE, UTF-8 and Windows mess

I'm trying to implement text support in Windows with the intention of also moving to a Linux platform later on. It would be ideal to support international languages in a uniform way but that doesn't seem to be easily accomplished when considering the two platforms in question. I have spent a considerable amount of time reading up on UNICODE, UTF-8 (and other encodings), widechars and such and here is what I have come to understand so far:
UNICODE, as the standard, describes the set of characters that are mappable and the order in which they occur. I refer to this as the "what": UNICODE specifies what will be available.
UTF-8 (and other encodings) specify the how: How each character will be represented in a binary format.
Now, on windows, they opted for a UCS-2 encoding originally, but that failed to meet the requirements, so UTF-16 is what they have, which is also multi-char when necessary.
So here is the delemma:
Windows internally only does UTF-16, so if you want to support international characters you are forced to convert to their widechar versions to use the OS calls accordingly. There doesn't seem to be any support for calling something like CreateFileA() with a multi-byte UTF-8 string and have it come out looking proper. Is this correct?
In C, there are some multi-byte supporting functions (_mbscat, _mbscpy, etc), however, on windows, the character type is defined as unsigned char* for those functions. Given the fact that the _mbs series of functions is not a complete set (i.e. there is no _mbstol to convert a multi-byte string to a long, for example) you are forced to use some of the char* versions of the runtime functions, which leads to compiler problems because of the signed/unsigned type difference between those functions. Does anyone even use those? Do you just do a big pile of casting to get around the errors?
In C++, std::string has iterators, but these are based on char_type, not on code points. So if I do a ++ on an std::string::iterator, I get the next char_type, not the next code point. Similarly, if you call std::string::operator[], you get a reference to a char_type, which has the great potential to not be a complete code point. So how does one iterate an std::string by code point? (C has the _mbsinc() function).

Just do UTF-8
There are lots of support libraries for UTF-8 in every plaftorm, also some are multiplaftorm too. The UTF-16 APIs in Win32 are limited and inconsistent as you've already noted, so it's better to keep everything in UTF-8 and convert to UTF-16 at last moment. There are also some handy UTF-8 wrappings for the windows API.
Also, at application-level documents, UTF-8 is getting more and more accepted as standard. Every text-handling application either accepts UTF-8, or at worst shows it as "ASCII with some dingbats", while there's only few applications that support UTF-16 documents, and those that don't, show it as "lots and lots of whitespace!"

Correct. You will convert UTF-8 to UTF-16 for your Windows API calls.
Most of the time you will use regular string functions for UTF-8 -- strlen, strcpy (ick), snprintf, strtol. They will work fine with UTF-8 characters. Either use char * for UTF-8 or you will have to cast everything.
Note that the underscore versions like _mbstowcs are not standard, they are normally named without an underscore, like mbstowcs.
It is difficult to come up with examples where you actually want to use operator[] on a Unicode string, my advice is to stay away from it. Likewise, iterating over a string has surprisingly few uses:
If you are parsing a string (e.g., the string is C or JavaScript code, maybe you want syntax hilighting) then you can do most of the work byte-by-byte and ignore the multibyte aspect.
If you are doing a search, you will also do this byte-by-byte (but remember to normalize first).
If you are looking for word breaks or grapheme cluster boundaries, you will want to use a library like ICU. The algorithm is not simple.
Finally, you can always convert a chunk of text to UTF-32 and work with it that way. I think this is the sanest option if you are implementing any of the Unicode algorithms like collation or breaking.
See: C++ iterate or split UTF-8 string into array of symbols?

Windows internally only does UTF-16, so if you want to support international characters you are forced to convert to their widechar versions to use the OS calls accordingly. There doesn't seem to be any support for calling something like CreateFileA() with a multi-byte UTF-8 string and have it come out looking proper. Is this correct?
Yes, that's correct. The *A function variants interpret the string parameters according to the currently active code page (which is Windows-1252 on most computers in the US and Western Europe, but can often be other code pages) and convert them to UTF-16. There is a UTF-8 code page, however AFAIK there isn't a way to programmatically set the active code page (there's GetACP to get the active code page, but not corresponding SetACP).
In C, there are some multi-byte supporting functions (_mbscat, _mbscpy, etc), however, on windows, the character type is defined as unsigned char* for those functions. Given the fact that the _mbs series of functions is not a complete set (i.e. there is no _mbstol to convert a multi-byte string to a long, for example) you are forced to use some of the char* versions of the runtime functions, which leads to compiler problems because of the signed/unsigned type difference between those functions. Does anyone even use those? Do you just do a big pile of casting to get around the errors?
The mbs* family of functions is almost never used, in my experience. With the exception of mbstowcs, mbsrtowcs, and mbsinit, those functions are not standard C.
In C++, std::string has iterators, but these are based on char_type, not on code points. So if I do a ++ on an std::string::iterator, I get the next char_type, not the next code point. Similarly, if you call std::string::operator[], you get a reference to a char_type, which has the great potential to not be a complete code point. So how does one iterate an std::string by code point? (C has the _mbsinc() function).
I think that mbrtowc(3) would be the best option here for decoding a single code point of a multibyte string.
Overall, I think the best strategy for cross-platform Unicode compatibility is to do everything in UTF-8 internally using single-byte characters. When you need to call a Windows API function, convert it to UTF-16 and always call the *W variant. Most non-Windows platforms use UTF-8 already, so that makes using those a snap.

In Windows, you can call WideCharToMultiByte and MultiByteToWideChar to convert between UTF-8 string and UTF-16 string (wstring in Windows). Because Windows API do not use UTF-8, whenever you call any Windows API function that support Unicode, you have to convert string into wstring (Windows version of Unicode in UTF-16). And when you get output from Windows, you have to convert UTF-16 back to UTF-8. Linux uses UTF-8 internally, so you do not need such conversion. To make your code portable to Linux, stick to UTF-8 and provide something as below for conversion:
#if (UNDERLYING_OS==OS_WINDOWS)
using os_string = std::wstring;
std::string utf8_string_from_os_string(const os_string &os_str)
{
size_t length = os_str.size();
int size_needed = WideCharToMultiByte(CP_UTF8, 0, os_str, length, NULL, 0, NULL, NULL);
std::string strTo(size_needed, 0);
WideCharToMultiByte(CP_UTF8, 0, os_str, length, &strTo[0], size_needed, NULL, NULL);
return strTo;
}
os_string utf8_string_to_os_string(const std::string &str)
{
size_t length = os_str.size();
int size_needed = MultiByteToWideChar(CP_UTF8, 0, str, length, NULL, 0);
os_string wstrTo(size_needed, 0);
MultiByteToWideChar(CP_UTF8, 0, str, length, &wstrTo[0], size_needed);
return wstrTo;
}
#else
// Other operating system uses UTF-8 directly and such conversion is
// not required
using os_string = std::string;
#define utf8_string_from_os_string(str) str
#define utf8_string_to_os_string(str) str
#endif
To iterate over utf8 strings, two fundamental functions you need are: one to calculate the number of bytes for an utf8 character and the another to determine if the byte is the leading byte of a utf8 character sequence. The following code provides a very efficient way to test:
inline size_t utf8CharBytes(char leading_ch)
{
return (leading_ch & 0x80)==0 ? 1 : clz(~(uint32_t(uint8_t(leading_ch))<<24));
}
inline bool isUtf8LeadingByte(char ch)
{
return (ch & 0xC0) != 0x80;
}
Using these functions, it should not be difficult to implement your own iterator over utf8 strings, one is for forwarding iterator, and another is for backward iterator.

Convert wchar_t* to UTF-16 string

I need a code in C++ to convert a string given in wchar_t* to a UTF-16 string. It must work both on Windows and Linux. I've looked through a lot of web-pages during the search, but the subject still is not clear to me.
As I understand I need to:
Call setlocale with LC_TYPE and UTF-16 encoding.
Use wcstombs to convert wchar_t to UTF-16 string.
Call setlocale to restore previous locale.
Do you know the way I can convert wchar_t* to UTF-16 in a portable way (Windows and Linux)?

There is no single cross-platform method for doing this in C++03 (not without a library). This is in part because wchar_t is itself not the same thing across platforms. Under Windows, wchar_t is a 16-bit value, while on other platforms it is often a 32-bit value. So you would need two different codepaths to do it.

C++11's std::codecvt_utf16 should work, I think.
std::codecvt_utf16 is a std::codecvt facet which encapsulates conversion between a UTF-16 encoded byte string and UCS2 or UCS4 character string (depending on the type of Elem).
See this: http://en.cppreference.com/w/cpp/locale/codecvt_utf16

You can assume that wchar_t is utf-32 in the non-Windows world. It is true on Linux and Mac OS X and most *nix systems (there are very few exceptions to that, and on systems you will probably never touch :-)
And wchar_t is utf-16 on Windows. So on Windows the conversion function can just do a memcpy :-)
On everything else, the conversion is algorithmic, and pretty simple. So there is no need of fancy support from 3rd party libraries.
Here is the basic algorithm: http://unicode.org/faq/utf_bom.html#utf16-3
And you can probably find find a dozen different implementations if you don't want to write your own :-)

The problem is with wchar_t being rather underspecified. You could use GNU libiconv to do what you want. It accepts special encoding name "wchar_t" as both source and target encoding. That way it will be portable to both Windows and Linux and elsewhere where you can provide libiconv.

The g++ compiler appears to support wcstombs?

How do I get STL std::string to work with unicode on windows?

At my company we have a cross platform(Linux & Windows) library that contains our own extension of the STL std::string, this class provides all sort of functionality on top of the string; split, format, to/from base64, etc. Recently we were given the requirement of making this string unicode "friendly" basically it needs to support characters from Chinese, Japanese, Arabic, etc. After initial research this seems fine on the Linux side since every thing is inherently UTF-8, however I am having trouble with the Windows side; is there a trick to getting the STL std::string to work as UTF-8 on windows? Is it even possible? Is there a better way? Ideally we would keep ourselves based on the std::string since that is what the string class is based on in Linux.
Thank you,

There are several misconceptions in your question.
Neither C++ nor the STL deal with encodings.
std::string is essentially a string of bytes, not characters. So you should have no problem stuffing UTF-8 encoded Unicode into it. However, keep in mind that all string functions also work on bytes, so myString.length() will give you the number of bytes, not the number of characters.
Linux is not inherently UTF-8. Most distributions nowadays default to UTF-8, but it should not be relied upon.

Yes - by being more aware of locales and encodings.
Windows has two function calls for everything that requires text, a FoobarA() and a FoobarW(). The *W() functions take UTF-16 encoded strings, the *A() takes strings in the current codepage. However, Windows doesn't support a UTF-8 code page, so you can't directly use it in that sense with the *A() functions, nor would you want to depend on that being set by users. If you want "Unicode" in Windows, use the Unicode-capable (*W) functions. There are tutorials out there, Googling "Unicode Windows tutorial" should get you some.
If you are storing UTF-8 data in a std::string, then before you pass it off to Windows, convert it to UTF-16 (Windows provides functions for doing such), and then pass it to Windows.
Many of these problems arise from C/C++ being generally encoding-agnostic. char isn't really a character, it's just an integral type. Even using char arrays to store UTF-8 data can get you into trouble if you need to access individual code units, as char's signed-ness is left undefined by the standards. A statement like str[x] < 0x80 to check for multiple-byte characters can quickly introduce a bug. (That statement is always true if char is signed.) A UTF-8 code unit is an unsigned integral type with a range of 0-255. That maps to the C type of uint8_t exactly, although unsigned char works as well. Ideally then, I'd make a UTF-8 string an array of uint8_ts, but due to old APIs, this is rarely done.
Some people have recommended wchar_t, claiming it to be "A Unicode character type" or something like that. Again, here the standard is just as agnostic as before, as C is meant to work anywhere, and anywhere might not be using Unicode. Thus, wchar_t is no more Unicode than char. The standard states:
which is an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales
In Linux, a wchat_t represents a UTF-32 code unit / code point. It is thus 4 bytes. However, in Windows, it's a UTF-16 code unit, and is only 2 bytes. (Which, I would have said does not conform to the above, since 2-bytes cannot represent all of Unicode, but that's the way it works.) This size difference, and difference in data encoding, clearly puts a strain on portability. The Unicode standard itself recommends against wchar_t if you need portability. (§5.2)
The end lesson: I find it easiest to store all my data in some well-declared format. (Typically UTF-8, usually in std::string's, but I'd really like something better.) The important thing here is not the UTF-8 part, but rather, I know that my strings are UTF-8. If I'm passing them to some other API, I must also know that that API expects UTF-8 strings. If it doesn't, then I must convert them. (Thus, if I speak to Window's API, I must convert strings to UTF-16 first.) A UTF-8 text string is an "orange", and a "latin1" text string is an "apple". A char array that doesn't know what encoding it is in is a recipe for disaster.

Putting UTF-8 code points into an std::string should be fine regardless of platform. The problem on Windows is that almost nothing else expects or works with UTF-8 -- it expects and works with UTF-16 instead. You can switch to an std::wstring which will store UTF-16 (at least on most Windows compilers) or you can write other routines that will accept UTF-8 (probably by converting to UTF-16, and then passing through to the OS).

Have you looked at std::wstring? It's a version of std::basic_string for wchar_t rather than the char that std::string uses.

No, there is no way to make Windows treat "narrow" strings as UTF-8.
Here is what works best for me in this situation (cross-platform application that has Windows and Linux builds).
Use std::string in cross-platform portion of the code. Assume that it always contains UTF-8 strings.
In Windows portion of the code, use "wide" versions of Windows API explicitly, i.e. write e.g. CreateFileW instead of CreateFile. This allows to avoid dependency on build system configuration.
In the platfrom abstraction layer, convert between UTF-8 and UTF-16 where needed (MultiByteToWideChar/WideCharToMultiByte).
Other approaches that I tried but don't like much:
typedef std::basic_string<TCHAR> tstring; then use tstring in the business code. Wrappers/overloads can be made to streamline conversion between std::string and std::tstring, but it still adds a lot of pain.
Use std::wstring everywhere. Does not help much since wchar_t is 16 bit on Windows, so you either have to restrict yourself to BMP or go to a lot of complications to make the code dealing with Unicode cross-platform. In the latter case, all benefits over UTF-8 evaporate.
Use ATL/WTL/MFC CString in the platfrom-specific portion; use std::string in cross-platfrom portion. This is actually a variant of what I recommend above. CString is in many aspects superior to std::string (in my opinion). But it introduces an additional dependency and thus not always acceptable or convenient.

If you want to avoid headache, don't use the STL string types at all. C++ knows nothing about Unicode or encodings, so to be portable, it's better to use a library that is tailored for Unicode support, e.g. the ICU library. ICU uses UTF-16 strings by default, so no conversion is required, and supports conversions to many other important encodings like UTF-8. Also try to use cross-platform libraries like Boost.Filesystem for things like path manipulations (boost::wpath). Avoid std::string and std::fstream.

In the Windows API and C runtime library, char* parameters are interpreted as being encoded in the "ANSI" code page. The problem is that UTF-8 isn't supported as an ANSI code page, which I find incredibly annoying.
I'm in a similar situation, being in the middle of porting software from Windows to Linux while also making it Unicode-aware. The approach we've taken for this is:
Use UTF-8 as the default encoding for strings.
In Windows-specific code, always call the "W" version of functions, converting string arguments between UTF-8 and UTF-16 as necessary.
This is also the approach Poco has taken.

It really platform dependant, Unicode is headache. Depends on which compiler you use. For older ones from MS (VS2010 or older), you would need use API described in MSDN
for VS2015
std::string _old = u8"D:\\Folder\\This \xe2\x80\x93 by ABC.txt"s;
according to their docs. I can't check that one.
for mingw, gcc, etc.
std::string _old = u8"D:\\Folder\\This \xe2\x80\x93 by ABC.txt";
std::cout << _old.data();
output contains proper file name...

You should consider using QString and QByteArray, it has good unicode support

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js