Would std::basic_string<TCHAR> be preferable to std::wstring on Windows? - c++

As I understand it, Windows #defines TCHAR as the correct character type for your application based on the build - so it is wchar_t in UNICODE builds and char otherwise.
Because of this I wondered if std::basic_string<TCHAR> would be preferable to std::wstring, since the first would theoretically match the character type of the application, whereas the second would always be wide.
So my question is essentially: Would std::basic_string<TCHAR> be preferable to std::wstring on Windows? And, would there be any caveats (i.e. unexpected behavior or side effects) to using std::basic_string<TCHAR>? Or, should I just use std::wstring on Windows and forget about it?

I believe the time when it was advisable to release non-unicode versions of your application (to support Win95, or to save a KB or two) is long past: nowadays the underlying Windows system you'll support are going to be unicode-based (so using char-based system interfaces will actually complicate the code by interposing a shim layer from the library) and it's doubtful whether you'd save any space at all. Go std::wstring, young man!-)

I have done this on very large projects and it works great:
namespace std
{
#ifdef _UNICODE
typedef wstring tstring;
#else
typedef string tstring;
#endif
}
You can use wstring everywhere instead though if you'd like, if you do not need to ever compile using a multi-byte character string. I don't think you need to ever support multi byte character strings though in any modern application.
Note: The std namespace is supposed to be off limits, but I have not had any problems with the above method for several years.

One thing to keep in mind. If you decide to use std::wstring all the way in your program, you might still need to use std::string if you are communicating with other systems using UTF8.

Related

C++ C2440 Error, cannot convert from 'std::_String_iterator<_Elem,_Traits,_Alloc>' to 'LPCTSTR'

Unfortunately I have been tasked with compiling an old C++ DLL so that it will work on Win 7 64 Bit machines. I have zero C++ experience. I have muddled through other issues, but this one has stumped me and i have not found a solution elsewhere. The following code is throwing a C2440 compiler error.
The error: "error C2440: '' : cannot convert from 'std::_String_iterator<_Elem,_Traits,_Alloc>' to 'LPCTSTR'"
the code:
#include "StdAfx.h"
#include "Antenna.h"
#include "MyTypes.h"
#include "objbase.h"
const TCHAR* CAntenna::GetMdbPath()
{
if(m_MdbPath.size() <= 0) return NULL;
return LPCTSTR(m_MdbPath.begin());
}
"m_MdbPath" is defined in the Antenna.h file as
string m_MdbPath;
Any assistance or guidance someone could provide would be extremely helpful. Thank you in advance. I am happy to provide additional details on the code if needed.
std::string has a .c_str() member function that should accomplish what you're looking for. It will return a const char* (const wchar_t* with std::wstring). std::string also has an empty() member function and I recommend using nullptr instead of the NULL macro.
const TCHAR* CAntenna::GetMdbPath()
{
if(m_MdbPath.empty()) return nullptr;
return m_MdbPath.c_str();
}
The solution:
const TCHAR* CAntenna::GetMdbPath()
{
if(m_MdbPath.size() <= 0) return NULL;
return m_MdbPath.c_str();
}
will work if you can guarantee that your program is using the MBCS character set option (or if the Character Set option is set to Not Setif you're using Visual Studio), since a std::string character type will be the same as the TCHAR type.
However, if your build is UNICODE, then a std::string character type will not be the same as the TCHAR type, since TCHAR is defined as a wide character, not a single-byte character. Thus returning c_str() will give a compiler error. So your original code, plus the tentative fix of returning std::string::c_str() may work, but it is technically wrong with respect to the types being used.
You should be working directly with the types presented to your function -- if the type is TCHAR, then you should be using TCHAR based types, but again, a TCHAR is a type that changes definition depending on the build type.
As a matter of fact, the code will fail miserably at runtime if it were a UNICODE build and to fix the compiler error(s), you casted to an LPCTSTR to keep the compiler quiet. You cannot cast a string pointer type to another string pointer type. Casting is not converting, and merely casting will not convert a narrow string into a wide string and vice-versa. You will wind up with at the least, odd characters being used or displayed, and worse, a program with random weird behavior and crashes.
In addition to this you never know when you really will need to build a UNICODE version of the DLL, since MBCS builds are becoming more rare. Going by your original post, the DLL's usage of TCHAR indicates that yes, this DLL may (or even has) been built for UNICODE.
There are a few alternate solutions that you can use. Note that some of these solutions may require you to make additional coding changes beyond the string types. If you're using C++ stream objects, they may also need to be changed to match the character type (for example, std::ostream and std::wostream)
Solution 1. Use a typedef, where the string type to use depends on the build type.
For example:
#ifdef UNICODE
typedef std::wstring tchar_string;
#else
typedef std::string tchar_string;
#endif
and then use tchar_string instead of std::string. Switching between MBCS and UNICODE builds will work without having to cast strin types, but requires coding changes.
Solution 2. Use a typedef to use TCHAR as the underlying character type in the std::basic_string template.
For example:
typedef std::basic_string<TCHAR> tchar_string;
And use tchar_string in your application. You get the same public interface as std::(w)string and this allows you to switch between MBCS and UNICODE build seamlessly (with respect to the string types).
Solution 3. Go with UNICODE builds and drop MBCS builds altogether
If the application you're building is a new one, then UNICODE is the default option (at least in Visual Studio) when creating a new application. Then all you really need to do is use std::wstring.
This is sort of the opposite of the original solution given of making sure you use MBCS, but IMO this solution makes more sense if there is only going to be one build type. MBCS builds are becoming more rare, and basically should be used only for legacy apps.

Should string encoding for library conform to Unicode or flexible?

I am created a library in C++ which exposes c style interface APIs. Some of the arguments are string so they would be char *. Now I know they should be all Unicode but because it is a library I don't think I want to force users to use decide or not. Ideally I thought it would be best to use TCHAR so I can build it either way for unicode code and ASCII users. Than I read this and it opposes the idea in general.
As an example of API, the strings are filenames or error messages like below.
void LoadSomeFile(char * fileName );
const char * GetErrorMsg();
I am using c++ and STL. There is this debate of std::string vs std::wstring as well.
Personally I really like MFC's CString class which takes care of all this nicely but that means I have to use MFC just for its string class.
Now I think TCHAR is probably the best solution for me but do I have to use CString (internally) for that to work? Can I use it with STL string? As far as I can see, it is either string or wstring there.
The TCHAR type is an unfortunate design choice that has thankfully been left behind us. Nobody has to use TCHAR any more, thank goodness. The Unicode choice has been made for us as well: Unicode is the only sane choice going forwards.
The question is, is your library Windows-only? Or is it portable?
If your library is portable, then the typical choice is char * or std::string with UTF-8 encoded strings. For more information, see UTF-8 Everywhere. The summary is that wchar_t is UTF-16 on Windows but UTF-32 everywhere else, which makes it almost useless for cross-platform programming.
If your library runs on Win32 only, then you may feel free to use wchar_t instead. On Windows, wchar_t is UTF-16.
Don't use both, it will make your code and API bloated and difficult to read. TCHAR is a hack for supporting the Win32 API and migrating to Unicode.

C++ UNICODE and STL

The Windows API seems big on UNICODE, you make a new project in Visual C++ and it sets it to UNICODE by default.
And me trying to be a good Windows programmer, I want to use UNICODE.
The problem is the C++ Standard Library and STL (such as std::string, or std::runtime_error) don't work well with UNICODE strings.
I can only pass a std::string, or a char* to std::runtime_error, and i'm pretty sure std::string doesn't support UNICODE.
So my question is, how should I use things such as std::runtime_error? Should I mix UNICODE and regular ANSI? (I think this is a bad idea...)
Just use ANSI in my whole project? (prefer not..) Or what?
In general you shouldn’t mix those two encodings. However, exception messages are something that is only of interest to the developer (e.g. in log files) and should never be shown to the user (but look at Jim’s comment for an important caveat).
So you are on the safe side if you use UNICODE for your whole user-faced interface and still use std::exception etc. behind the scenes for developer messages. There should be no need ever to convert between the two.
Furthermore, it’s a good trick to define a typedef for UNICODE-independent strings in C++:
typedef std::basic_string<TCHAR> tstring;
… and analogously define tcout, tcin etc. conditionally:
#ifdef UNICODE
std::wostream& tcout = std::wcout;
std::wostream& tcerr = std::wcerr;
std::wostream& tclog = std::wclog;
std::wistream& tcin = std::wcin;
#else
std::ostream& tcout = std::cout;
std::ostream& tcerr = std::cerr;
std::ostream& tclog = std::clog;
std::istream& tcin = std::cin;
#endif
Josh,
Please have a look at my answer here: https://softwareengineering.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful
There is growing number of engineers who believe std::string is just perfect for unicode on Windows, and is the right way to write portable and unicode-correct programs faster.
Take a look at this (rather old now) article on CodeProject: Upgrading an STL-based application to use Unicode. It covers the issues you're likely to hit if you're using the STL extensively. It shouldn't be that bad and, generally speaking, it's worth your while to use wide strings.
To work with Windows Unicode API, just use the wide string versions - wstring, etc. It won't help with exception::what(), but for that you can use UTF-8 encoding if you really need Unicode.

proper style for interfacing with legacy TCHAR code

I'm modifying someone else's code which uses TCHAR extensively. Is it better form to just use std::wstring in my code? wstring should be equivalent to TString on widechar platforms so I don't see an issue. The rationale being, its easier to use a raw wstring than to support TCHAR... e.g., using boost:wformat.
Which style will be more clear to the next maintainer? I wasted several hours myself trying to understand string intricacies, it seems just using wstring would cut off half of the stuff you need to understand.
typedef std::basic_string<TCHAR> TString; //on winxp, TCHAR resolves to wchar_t
typedef basic_string<wchar_t, char_traits<wchar_t>, allocator<wchar_t> > wstring;
...the only difference is the allocator.
In the unlikely case that your program
lands on a Window 9x machine, there's
still an API layer that can translate
your UTF-16 strings to 8-bit chars.
There's no point left in using TCHAR
for new code development.
source
If you are only intending on targetting Unicode (wchar_t) platforms, you are better off using std::wstring. If you want to support multibyte and Unicode builds, you will need to use TString and similar.
Also note that basic_string defaults the char_traits and allocator to one based on the passed in character type, so on builds where UNICODE (or _UNICODE, I can never remember which), TString and wstring will be the same.
NOTE: If you are just passing the arguments to various APIs and not doing any manipulations on them, you are better off using const wchar_t * instead of std::wstring directly (especially if mixing Win32, COM and standard C++ code) as you will end up doing less conversions and copying.
TCHAR used to be more important when you where going to compile the binaries twice, once for char and a second for wchar_t.
You can still make this choice if you like, changing the MSVC project settings from MBCS to Unicode and back.
This also means when calling the windows API you will have the matching data type.

Portable wchar_t in C++

Is there a portable wchar_t in C++? On Windows, its 2 bytes. On everything else is 4 bytes. I would like to use wstring in my application, but this will cause problems if I decide down the line to port it.
If you're dealing with use internal to the program, don't worry about it; a wchar_t in class A is the same as in class B.
If you're planning to transfer data between Windows and Linux/MacOSX versions, you've got more than wchar_t to worry about, and you need to come up with means to handle all the details.
You could define a type that you'll define to be four bytes everywhere, and implement your own strings, etc. (since most text handling in C++ is templated), but I don't know how well that would work for your needs.
Something like typedef int my_char; typedef std::basic_string<my_char> my_string;
What do you mean by "portable wchar_t"? There is a uint16_t type that is 16bits wide everywhere, which is often available. But that of course doesn't make up a string yet. A string has to know of its encoding to make sense of functions like length(), substring() and so on (so it doesn't cut characters in the middle of a code point when using utf8 or 16). There are some unicode compatible string classes i know of that you can use. All can be used in commercial programs for free (the Qt one will be compatible with commercial programs for free in a couple of months, when Qt 4.5 is released).
ustring from the gtkmm project. If you program with gtkmm or use glibmm, that should be the first choice, it uses utf-8 internally. Qt also has a string class, called QString. It's encoded in utf-16. ICU is another project that creates portable unicode string classes, and has a UnicodeString class that internally seems to be encoded in utf-16, like Qt. Haven't used that one though.
The proposed C++0x standard will have char16_t and char32_t types. Until then, you'll have to fall back on using integers for the non-wchar_t character type.
#if defined(__STDC_ISO_10646__)
#define WCHAR_IS_UTF32
#elif defined(_WIN32) || defined(_WIN64)
#define WCHAR_IS_UTF16
#endif
#if defined(__STDC_UTF_16__)
typedef _Char16_t CHAR16;
#elif defined(WCHAR_IS_UTF16)
typedef wchar_t CHAR16;
#else
typedef uint16_t CHAR16;
#endif
#if defined(__STDC_UTF_32__)
typedef _Char32_t CHAR32;
#elif defined(WCHAR_IS_UTF32)
typedef wchar_t CHAR32;
#else
typedef uint32_t CHAR32;
#endif
According to the standard, you'll need to specialize char_traits for the integer types. But on Visual Studio 2005, I've gotten away with std::basic_string<CHAR32> with no special handling.
I plan to use a SQLite database.
Then you'll need to use UTF-16, not wchar_t.
The SQLite API also has a UTF-8 version. You may want to use that instead of dealing with the wchar_t differences.
My suggestion. Use UTF-8 and std::string. Wide strings would not bring you too much added value. As you anyway can't interpret wide character as letter as some characters crated from several unicode code points.
So use anywhere UTF-8 and use good library to deal with natural languages. Like for example Boost.Locale.
Bad idea: define something like typedef uint32_t mychar; is bad. As you can't use iostream with it, you can't create for example stringstream based in this character as you would not be able to write in it.
For example this would not work:
std::basic_ostringstream<unsigned> s;
ss << 10;
Would not create you a string.