Portable wchar_t in C++ - c++

Is there a portable wchar_t in C++? On Windows, its 2 bytes. On everything else is 4 bytes. I would like to use wstring in my application, but this will cause problems if I decide down the line to port it.

If you're dealing with use internal to the program, don't worry about it; a wchar_t in class A is the same as in class B.
If you're planning to transfer data between Windows and Linux/MacOSX versions, you've got more than wchar_t to worry about, and you need to come up with means to handle all the details.
You could define a type that you'll define to be four bytes everywhere, and implement your own strings, etc. (since most text handling in C++ is templated), but I don't know how well that would work for your needs.
Something like typedef int my_char; typedef std::basic_string<my_char> my_string;

What do you mean by "portable wchar_t"? There is a uint16_t type that is 16bits wide everywhere, which is often available. But that of course doesn't make up a string yet. A string has to know of its encoding to make sense of functions like length(), substring() and so on (so it doesn't cut characters in the middle of a code point when using utf8 or 16). There are some unicode compatible string classes i know of that you can use. All can be used in commercial programs for free (the Qt one will be compatible with commercial programs for free in a couple of months, when Qt 4.5 is released).
ustring from the gtkmm project. If you program with gtkmm or use glibmm, that should be the first choice, it uses utf-8 internally. Qt also has a string class, called QString. It's encoded in utf-16. ICU is another project that creates portable unicode string classes, and has a UnicodeString class that internally seems to be encoded in utf-16, like Qt. Haven't used that one though.

The proposed C++0x standard will have char16_t and char32_t types. Until then, you'll have to fall back on using integers for the non-wchar_t character type.
#if defined(__STDC_ISO_10646__)
#define WCHAR_IS_UTF32
#elif defined(_WIN32) || defined(_WIN64)
#define WCHAR_IS_UTF16
#endif
#if defined(__STDC_UTF_16__)
typedef _Char16_t CHAR16;
#elif defined(WCHAR_IS_UTF16)
typedef wchar_t CHAR16;
#else
typedef uint16_t CHAR16;
#endif
#if defined(__STDC_UTF_32__)
typedef _Char32_t CHAR32;
#elif defined(WCHAR_IS_UTF32)
typedef wchar_t CHAR32;
#else
typedef uint32_t CHAR32;
#endif
According to the standard, you'll need to specialize char_traits for the integer types. But on Visual Studio 2005, I've gotten away with std::basic_string<CHAR32> with no special handling.
I plan to use a SQLite database.
Then you'll need to use UTF-16, not wchar_t.
The SQLite API also has a UTF-8 version. You may want to use that instead of dealing with the wchar_t differences.

My suggestion. Use UTF-8 and std::string. Wide strings would not bring you too much added value. As you anyway can't interpret wide character as letter as some characters crated from several unicode code points.
So use anywhere UTF-8 and use good library to deal with natural languages. Like for example Boost.Locale.
Bad idea: define something like typedef uint32_t mychar; is bad. As you can't use iostream with it, you can't create for example stringstream based in this character as you would not be able to write in it.
For example this would not work:
std::basic_ostringstream<unsigned> s;
ss << 10;
Would not create you a string.

Related

Proper way crossplatfom convert from std::string to 'const TCHAR *'

I'm working for crossplatrofm project in c++ and I have variable with type std::string and need convert it to const TCHAR * - what is proper way, may be functions from some library ?
UPD 1: - as I see in function definition there is split windows and non-Windows implementations:
#if defined _MSC_VER || defined __MINGW32__
#define _tinydir_char_t TCHAR
#else
#define _tinydir_char_t char
#endif
- so is it a really no way for non spliting realization for send parameter from std::string ?
Proper way crossplatfom convert from std::string to 'const TCHAR *'
TCHAR should not be used in cross platform programs at all; Except of course, when interacting with windows API calls, but those need to be abstracted away from the rest of the program or else it won't be cross-platform. So, you only need to convert between TCHAR strings and char strings in windows specific code.
The rest of the program should use char, and preferably assume that it contains UTF-8 encoded strings. If user input, or system calls return strings that are in a different encoding, you need to figure out what that encoding is, and convert accordingly.
Character encoding conversion functionality of the C++ standard library is rather weak, so that is not of much use. You can implement the conversion according the encoding specification or you can use a third party implementation, as always.
may be functions from some library ?
I recommend this.
as I see in function definition there is split windows and non-Windows implementations
The library that you use doesn't provide a uniform API to different platforms, so it cannot be used in a truly cross-platform way. You can write a wrapper library with uniform function declarations that handles the character encoding conversion on platforms that need it.
Or, you can use another library, which provides a uniform API and converts the encoding transparently.
TCHAR are Windows type and it defined in this way:
#ifdef UNICODE
typedef wchar_t TCHAR, *PTCHAR;
#else
typedef char TCHAR, *PTCHAR;
#endif
UNICODE macro is typically defined in project settings (in case when your use Visual Studio project on Windows).
You can get the const TCHAR* from std::string (which is ASCII or UTF8 in most cases) in this way:
std::string s("hello world");
const TCHAR* pstring = nullptr;
#ifdef UNICODE
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
std::wstring wstr = converter.from_bytes(s);
pstring = wstr.data();
#else
pstring = s.data();
#endif
pstring will be the result.
But it's highly not recommended to use the TCHAR on other platforms. It's better to use the UTF8 strings (char*) within std::string
I came across boost.nowide the other day. I think it will do exactly what you want.
http://cppcms.com/files/nowide/html/
As others have pointed out, you should not be using TCHAR except in code that interfaces with the Windows API (or libraries modeled after the Windows API).
Another alternative is to use the character conversion classes/macros defined in atlconv.h. CA2T will convert an 8-bit character string to a TCHAR string. CA2CT will convert to a const TCHAR string (LPCTSTR). Assuming your 8-bit strings are UTF-8, you should specify CP_UTF8 as the code page for the conversion.
If you want to declare a variable containing a TCHAR copy of a std::string:
CA2T tstr(stdstr.c_str(), CP_UTF8);
If you want to call a function that takes an LPCTSTR:
FunctionThatTakesString(CA2CT(stdsr.c_str(), CP_UTF8));
If you want to construct a std::string from a TCHAR string:
std::string mystdstring(CT2CA(tstr, CP_UTF8));
If you want to call a function that takes an LPTSTR then maybe you should not be using these conversion classes. (But you can if you know that the function you are calling does not modify the string outside its current length.)

Should string encoding for library conform to Unicode or flexible?

I am created a library in C++ which exposes c style interface APIs. Some of the arguments are string so they would be char *. Now I know they should be all Unicode but because it is a library I don't think I want to force users to use decide or not. Ideally I thought it would be best to use TCHAR so I can build it either way for unicode code and ASCII users. Than I read this and it opposes the idea in general.
As an example of API, the strings are filenames or error messages like below.
void LoadSomeFile(char * fileName );
const char * GetErrorMsg();
I am using c++ and STL. There is this debate of std::string vs std::wstring as well.
Personally I really like MFC's CString class which takes care of all this nicely but that means I have to use MFC just for its string class.
Now I think TCHAR is probably the best solution for me but do I have to use CString (internally) for that to work? Can I use it with STL string? As far as I can see, it is either string or wstring there.
The TCHAR type is an unfortunate design choice that has thankfully been left behind us. Nobody has to use TCHAR any more, thank goodness. The Unicode choice has been made for us as well: Unicode is the only sane choice going forwards.
The question is, is your library Windows-only? Or is it portable?
If your library is portable, then the typical choice is char * or std::string with UTF-8 encoded strings. For more information, see UTF-8 Everywhere. The summary is that wchar_t is UTF-16 on Windows but UTF-32 everywhere else, which makes it almost useless for cross-platform programming.
If your library runs on Win32 only, then you may feel free to use wchar_t instead. On Windows, wchar_t is UTF-16.
Don't use both, it will make your code and API bloated and difficult to read. TCHAR is a hack for supporting the Win32 API and migrating to Unicode.

C++ UNICODE and STL

The Windows API seems big on UNICODE, you make a new project in Visual C++ and it sets it to UNICODE by default.
And me trying to be a good Windows programmer, I want to use UNICODE.
The problem is the C++ Standard Library and STL (such as std::string, or std::runtime_error) don't work well with UNICODE strings.
I can only pass a std::string, or a char* to std::runtime_error, and i'm pretty sure std::string doesn't support UNICODE.
So my question is, how should I use things such as std::runtime_error? Should I mix UNICODE and regular ANSI? (I think this is a bad idea...)
Just use ANSI in my whole project? (prefer not..) Or what?
In general you shouldn’t mix those two encodings. However, exception messages are something that is only of interest to the developer (e.g. in log files) and should never be shown to the user (but look at Jim’s comment for an important caveat).
So you are on the safe side if you use UNICODE for your whole user-faced interface and still use std::exception etc. behind the scenes for developer messages. There should be no need ever to convert between the two.
Furthermore, it’s a good trick to define a typedef for UNICODE-independent strings in C++:
typedef std::basic_string<TCHAR> tstring;
… and analogously define tcout, tcin etc. conditionally:
#ifdef UNICODE
std::wostream& tcout = std::wcout;
std::wostream& tcerr = std::wcerr;
std::wostream& tclog = std::wclog;
std::wistream& tcin = std::wcin;
#else
std::ostream& tcout = std::cout;
std::ostream& tcerr = std::cerr;
std::ostream& tclog = std::clog;
std::istream& tcin = std::cin;
#endif
Josh,
Please have a look at my answer here: https://softwareengineering.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful
There is growing number of engineers who believe std::string is just perfect for unicode on Windows, and is the right way to write portable and unicode-correct programs faster.
Take a look at this (rather old now) article on CodeProject: Upgrading an STL-based application to use Unicode. It covers the issues you're likely to hit if you're using the STL extensively. It shouldn't be that bad and, generally speaking, it's worth your while to use wide strings.
To work with Windows Unicode API, just use the wide string versions - wstring, etc. It won't help with exception::what(), but for that you can use UTF-8 encoding if you really need Unicode.

WChars, Encodings, Standards and Portability

The following may not qualify as a SO question; if it is out of bounds, please feel free to tell me to go away. The question here is basically, "Do I understand the C standard correctly and is this the right way to go about things?"
I would like to ask for clarification, confirmation and corrections on my understanding of character handling in C (and thus C++ and C++0x). First off, an important observation:
Portability and serialization are orthogonal concepts.
Portable things are things like C, unsigned int, wchar_t. Serializable things are things like uint32_t or UTF-8. "Portable" means that you can recompile the same source and get a working result on every supported platform, but the binary representation may be totally different (or not even exist, e.g. TCP-over-carrier pigeon). Serializable things on the other hand always have the same representation, e.g. the PNG file I can read on my Windows desktop, on my phone or on my toothbrush. Portable things are internal, serializable things deal with I/O. Portable things are typesafe, serializable things need type punning. </preamble>
When it comes to character handling in C, there are two groups of things related respectively to portability and serialization:
wchar_t, setlocale(), mbsrtowcs()/wcsrtombs(): The C standard says nothing about "encodings"; in fact, it is entirely agnostic to any text or encoding properties. It only says "your entry point is main(int, char**); you get a type wchar_t which can hold all your system's characters; you get functions to read input char-sequences and make them into workable wstrings and vice versa.
iconv() and UTF-8,16,32: A function/library to transcode between well-defined, definite, fixed encodings. All encodings handled by iconv are universally understood and agreed upon, with one exception.
The bridge between the portable, encoding-agnostic world of C with its wchar_t portable character type and the deterministic outside world is iconv conversion between WCHAR-T and UTF.
So, should I always store my strings internally in an encoding-agnostic wstring, interface with the CRT via wcsrtombs(), and use iconv() for serialization? Conceptually:
my program
<-- wcstombs --- /==============\ --- iconv(UTF8, WCHAR_T) -->
CRT | wchar_t[] | <Disk>
--- mbstowcs --> \==============/ <-- iconv(WCHAR_T, UTF8) ---
|
+-- iconv(WCHAR_T, UCS-4) --+
|
... <--- (adv. Unicode malarkey) ----- libicu ---+
Practically, that means that I'd write two boiler-plate wrappers for my program entry point, e.g. for C++:
// Portable wmain()-wrapper
#include <clocale>
#include <cwchar>
#include <string>
#include <vector>
std::vector<std::wstring> parse(int argc, char * argv[]); // use mbsrtowcs etc
int wmain(const std::vector<std::wstring> args); // user starts here
#if defined(_WIN32) || defined(WIN32)
#include <windows.h>
extern "C" int main()
{
setlocale(LC_CTYPE, "");
int argc;
wchar_t * const * const argv = CommandLineToArgvW(GetCommandLineW(), &argc);
return wmain(std::vector<std::wstring>(argv, argv + argc));
}
#else
extern "C" int main(int argc, char * argv[])
{
setlocale(LC_CTYPE, "");
return wmain(parse(argc, argv));
}
#endif
// Serialization utilities
#include <iconv.h>
typedef std::basic_string<uint16_t> U16String;
typedef std::basic_string<uint32_t> U32String;
U16String toUTF16(std::wstring s);
U32String toUTF32(std::wstring s);
/* ... */
Is this the right way to write an idiomatic, portable, universal, encoding-agnostic program core using only pure standard C/C++, together with a well-defined I/O interface to UTF using iconv? (Note that issues like Unicode normalization or diacritic replacement are outside the scope; only after you decide that you actually want Unicode (as opposed to any other coding system you might fancy) is it time to deal with those specifics, e.g. using a dedicated library like libicu.)
Updates
Following many very nice comments I'd like to add a few observations:
If your application explicitly wants to deal with Unicode text, you should make the iconv-conversion part of the core and use uint32_t/char32_t-strings internally with UCS-4.
Windows: While using wide strings is generally fine, it appears that interaction with the console (any console, for that matter) is limited, as there does not appear to be support for any sensible multi-byte console encoding and mbstowcs is essentially useless (other than for trivial widening). Receiving wide-string arguments from, say, an Explorer-drop together with GetCommandLineW+CommandLineToArgvW works (perhaps there should be a separate wrapper for Windows).
File systems: File systems don't seem to have any notion of encoding and simply take any null-terminated string as a file name. Most systems take byte strings, but Windows/NTFS takes 16-bit strings. You have to take care when discovering which files exist and when handling that data (e.g. char16_t sequences that do not constitute valid UTF16 (e.g. naked surrogates) are valid NTFS filenames). The Standard C fopen is not able to open all NTFS files, since there is no possible conversion that will map to all possible 16-bit strings. Use of the Windows-specific _wfopen may be required. As a corollary, there is in general no well defined notion of "how many characters" comprise a given file name, as there is no notion of "character" in the first place. Caveat emptor.
Is this the right way to write an idiomatic, portable, universal, encoding-agnostic program core using only pure standard C/C++
No, and there is no way at all to fulfill all these properties, at least if you want your program to run on Windows. On Windows, you have to ignore the C and C++ standards almost everywhere and work exclusively with wchar_t (not necessarily internally, but at all interfaces to the system). For example, if you start with
int main(int argc, char** argv)
you have already lost Unicode support for command line arguments. You have to write
int wmain(int argc, wchar_t** argv)
instead, or use the GetCommandLineW function, none of which is specified in the C standard.
More specifically,
any Unicode-capable program on Windows must actively ignore the C and C++ standard for things like command line arguments, file and console I/O, or file and directory manipulation. This is certainly not idiomatic. Use the Microsoft extensions or wrappers like Boost.Filesystem or Qt instead.
Portability is extremely hard to achieve, especially for Unicode support. You really have to be prepared that everything you think you know is possibly wrong. For example, you have to consider that the filenames you use to open files can be different from the filenames that are actually used, and that two seemingly different filenames may represent the same file. After you create two files a and b, you might end up with a single file c, or two files d and e, whose filenames are different from the file names you passed to the OS. Either you need an external wrapper library or lots of #ifdefs.
Encoding agnosticity usually just doesn't work in practice, especially if you want to be portable. You have to know that wchar_t is a UTF-16 code unit on Windows and that char is often (bot not always) a UTF-8 code unit on Linux. Encoding-awareness is often the more desirable goal: make sure that you always know with which encoding you work, or use a wrapper library that abstracts them away.
I think I have to conclude that it's completely impossible to build a portable Unicode-capable application in C or C++ unless you are willing to use additional libraries and system-specific extensions, and to put lots of effort in it. Unfortunately, most applications already fail at comparatively simple tasks such as "writing Greek characters to the console" or "supporting any filename allowed by the system in a correct manner", and such tasks are only the first tiny steps towards true Unicode support.
I would avoid the wchar_t type because it's platform-dependent (not "serializable" by your definition): UTF-16 on Windows and UTF-32 on most Unix-like systems. Instead, use the char16_t and/or char32_t types from C++0x/C1x. (If you don't have a new compiler, typedef them as uint16_t and uint32_t for now.)
DO define functions to convert between UTF-8, UTF-16, and UTF-32 functions.
DON'T write overloaded narrow/wide versions of every string function like the Windows API did with -A and -W. Pick one preferred encoding to use internally, and stick to it. For things that need a different encoding, convert as necessary.
The problem with wchar_t is that encoding-agnostic text processing is too difficult and should be avoided. If you stick with "pure C" as you say, you can use all of the w* functions like wcscat and friends, but if you want to do anything more sophisticated then you have to dive into the abyss.
Here are some things that much harder with wchar_t than they are if you just pick one of the UTF encodings:
Parsing Javascript: Identifers can contain certain characters outside the BMP (and lets assume that you care about this kind of correctness).
HTML: How do you turn 𐀀 into a string of wchar_t?
Text editor: How do you find grapheme cluster boundaries in a wchar_t string?
If I know the encoding of a string, I can examine the characters directly. If I don't know the encoding, I have to hope that whatever I want to do with a string is implemented by a library function somewhere. So the portability of wchar_t is somewhat irrelevant as I don't consider it an especially useful data type.
Your program requirements may differ and wchar_t may work fine for you.
Given that iconv is not "pure standard C/C++", I don't think you are satisfying your own specifications.
There are new codecvt facets coming with char32_t and char16_t so I don't see how you can be wrong as long as you are consistent and pick one char type + encoding if the facets are here.
The facets are described in 22.5 [locale.stdcvt] (from n3242).
I don't understand how this doesn't satisfy at least some of your requirements:
namespace ns {
typedef char32_t char_t;
using std::u32string;
// or use user-defined literal
#define LIT u32
// Communicate with interface0, which wants utf-8
// This type doesn't need to be public at all; I just refactored it.
typedef std::wstring_convert<std::codecvt_utf8<char_T>, char_T> converter0;
inline std::string
to_interface0(string const& s)
{
return converter0().to_bytes(s);
}
inline string
from_interface0(std::string const& s)
{
return converter0().from_bytes(s);
}
// Communitate with interface1, which wants utf-16
// Doesn't have to be public either
typedef std::wstring_convert<std::codecvt_utf16<char_T>, char_T> converter1;
inline std::wstring
to_interface0(string const& s)
{
return converter1().to_bytes(s);
}
inline string
from_interface0(std::wstring const& s)
{
return converter1().from_bytes(s);
}
} // ns
Then your code can use ns::string, ns::char_t, LIT'A' & LIT"Hello, World!" with reckless abandon, without knowing what's the underlying representation. Then use from_interfaceX(some_string) whenever it's needed. It doesn't affect the global locale or streams either. The helpers can be as clever as needed, e.g. codecvt_utf8 can deal with 'headers', which I assume is Standardese from tricky stuff like the BOM (ditto codecvt_utf16).
In fact I wrote the above to be as short as possible but you'd really want helpers like this:
template<typename... T>
inline ns::string
ns::from_interface0(T&&... t)
{
return converter0().from_bytes(std::forward<T>(t)...);
}
which give you access to the 3 overloads for each [from|to]_bytes members, accepting things like e.g. const char* or ranges.

Would std::basic_string<TCHAR> be preferable to std::wstring on Windows?

As I understand it, Windows #defines TCHAR as the correct character type for your application based on the build - so it is wchar_t in UNICODE builds and char otherwise.
Because of this I wondered if std::basic_string<TCHAR> would be preferable to std::wstring, since the first would theoretically match the character type of the application, whereas the second would always be wide.
So my question is essentially: Would std::basic_string<TCHAR> be preferable to std::wstring on Windows? And, would there be any caveats (i.e. unexpected behavior or side effects) to using std::basic_string<TCHAR>? Or, should I just use std::wstring on Windows and forget about it?
I believe the time when it was advisable to release non-unicode versions of your application (to support Win95, or to save a KB or two) is long past: nowadays the underlying Windows system you'll support are going to be unicode-based (so using char-based system interfaces will actually complicate the code by interposing a shim layer from the library) and it's doubtful whether you'd save any space at all. Go std::wstring, young man!-)
I have done this on very large projects and it works great:
namespace std
{
#ifdef _UNICODE
typedef wstring tstring;
#else
typedef string tstring;
#endif
}
You can use wstring everywhere instead though if you'd like, if you do not need to ever compile using a multi-byte character string. I don't think you need to ever support multi byte character strings though in any modern application.
Note: The std namespace is supposed to be off limits, but I have not had any problems with the above method for several years.
One thing to keep in mind. If you decide to use std::wstring all the way in your program, you might still need to use std::string if you are communicating with other systems using UTF8.