Which syntax for Unicode strings in VC++?

Which syntax for Unicode strings in VC++? - c++

How should you use unicode strings in VC++? Of course you should to #define UNICODE, but what about your strings?
Should the TEXT() or _T() macro be used around all text or should you just put an L in front of strings? Its my belief that all programs should use unicode these days, so wouldn't it be cleanest to use use the L prefix?
Opinions?

It depends on what you want to achieve. If you want to make sure your code will compile and work correctly both with and without Unicode, use the TEXT or _T macros, and call the "default" Win32 function names (for example CreateWindow).
If you want to make sure your program always uses the Unicode API, then you should use a L prefix in front of your strings, and call the wide versions of Win32 functions (such as CreateWindowW).
In the latter case, you'll get unicode behavior whether or not UNICODE is defined.
In the former case, your application will change its behavior based on whether UNICODE is defined.
I agree with you that the non-unicode versions haven't really been relevant since Win98, so I'd go with the second approach.

Declare Unicode string literals with L prefix.
The TEXT() or _T() macros were for the bad old days when you wanted single source to compile for both Unicode and non-Unicode versions of Windows (Windows 9x). Thankfully you can safely ignore Windows 9x today.

Something I learned a while ago:
Golden rule:
Don't fight the framework.
Do what the framework was designed to do -- if you use Windows, use _T, to make your code independent of the character type. If you're on Linux, use UTF-8. If you have a cross-platform framework, do whatever it does. But don't try to invent something of your own unless you have a really good reason to. (It is simply usually not worth the effort of working against a framework.)

Related

How to properly localize a cross platform program?

I am currently making a game engine that is eventually going to support all platforms. Currently I am working on the Windows support with the Win32 API. Reading the documentation, it suggests that I use wide strings/chars and the Unicode version of API functions so that my application can be localized. But if I use wide versions of everything (wcout wstring wchar_t etc.), I will have to make my entire game engine use those wide types. That also means that when working with other platforms, I will also have to use wide types, or I will have to convert between them.
My idea is that maybe my code will only be compiled with wide string types on Windows and be compiled with normal string types on other platforms perhaps with macro definitions. Is that the best option to do this? And how may I go about doing this?
I also don't really understand how unicode works in c++. If I set my system locale to English, then I will get a compiler warning from MSVC if I have any Chinese characters stored in a normal string type. However now I set my system locale to Chinese and enabled UTF-8 encoding, I get no compiler warnings if I store Unicode characters in normal strings. I also have no idea how unicode works on other platforms. Can somebody explain this for me?
Thanks in advance.

What are the disadvantages to not using Unicode in Windows?

What are the disadvantages to not using Unicode on Windows?
By Unicode, I mean WCHAR and the wide API functions. (CreateWindowW, MessageBoxW, and so on)
What problems could I run into by not using this?

Your code won't be able to deal correctly with characters outside the currently selected codepage when dealing with system APIs1.
Typical problems include unsupported characters being translated to question marks, inability to process text with special characters, in particular files with "strange characters" in their names/paths.
Also, several newer APIs are present only in the "wide" version.
Finally, each API call involving text will be marginally slower, since the "A" versions of APIs are normally just thin wrappers around the "W" APIs, that convert the parameters to UTF-16 on the fly - so, you have some overhead in respect to a "plain" W call.
Nothing stops you to work in a narrow-characters Unicode encoding (=>UTF-8) inside your application, but Windows "A" APIs don't speak UTF-8, so you'd have to convert to UTF-16 and call the W versions anyway.

I believe the gist of the original question was "should I compile all my Windows apps with "#define _UNICODE", and what's the down side if I don't?
My original reply was "Yeah, you should. We've moved 8-bit ASCII, and '_UNICODE' is a reasonable default for any modern Windows code."
For Windows, I still believe that's reasonably good advice. But I've deleted my original reply. Because I didn't realize until I re-read my own links how much "UTF-16 is quite a sad state of affairs" (as Matteo Italia eloquently put it).
For example:
http://utf8everywhere.org/
Microsoft has ... mistakenly used ‘Unicode’ and ‘widechar’ as
synonyms for ‘UCS-2’ and ‘UTF-16’. Furthermore, since UTF-8 cannot be
set as the encoding for narrow string WinAPI, one must compile her
code with _UNICODE rather than _MBCS. Windows C++ programmers are
educated that Unicode must be done with ‘widechars’. As a result of
this mess, they are now among the most confused ones about what is the
right thing to do about text.
I heartily recommend these three links:
The Absolute Minimum Every Software Developer Should Know about Unicode
Should UTF-16 Be Considered Harmful?
UTF-8 Everywhere
IMHO...

Trying to CreateDirectory, getting char* to LPCWSTR error, willing to try another function

I've tried Googling this, and there are so many answers based on various specific situations that frankly I'm more stuck than I was when I started.
The facts are these:
Language: C/C++
OS: Windows
IDE: Visual Studio 2005
I'm trying to create a directory from a function in my program, using CreateDirectory (after a #include of windows.h).
Supposedly, the first parameter (a path) should be a char*. However, when I try to compile, I get the following error: error C2664: 'CreateDirectoryW' : cannot convert parameter 1 from 'char *' to 'LPCWSTR'
What I've read is that I have some sort of issue between UNICODE and ANSI. The solutions vary wildly and I'm afraid of breaking something important, or doing something very stupid.
I am perfectly willing to use any other method of creating a new directory, if one exists without me having to find some other library.
I only minored in comp sci, and frankly I have no idea why it's so easy to open, close, edit, and otherwise access files through stdio, but doing anything with directories (specifically making them and finding out if they exist) is a wild goose chase through the streets of the Internet.
Please help me, either to fix the current attempt at CreateDirectory or to use something else to create a directory.
Thank you!

This is completely crazy. Microsoft have mechanisms to support both ANSI (as they call it) and UNICODE from the same code, because Windows 95/8/Me were ANSI operating systems and NT/XP/Vista etc are UNICODE operating systems. So if you really want to you can write one set of code that supports both systems and just recompile for each system.
But who is interested in Windows 95/98/Me any more? Yet this stuff carries on confusing newbies.
Here's the beef, there is no function called CreateDirectory in Windows, there a UNICODE function called CreateDirectoryW and an ANSI function called CreateDirectoryA. The macro CreateDirectory is converted to CreateDirectoryW or CreateDirectoryA depending on which compiler options you have defined. If you end up using CreateDirectoryW (as you evidentally did) the you must pass a Unicode string to it, if you use CreateDirectoryA then you pass a normal ASCII string.
The simplest thing in your case would be to forget about all this and just call CreateDirectoryA directly.
CreateDirectoryA("c:\\some\\directory", NULL);
Why it is so hard to create a directory in C++? I would guess that because back in the 70s when C was new, not every operating system had directories so this stuff never made it into the language standard.

VS2005 is Unicode by default, and you should better keep it that way. Might save you a lot of issues in the future. In Unicode builds CreateDirectory (and other Win32 functions) expect wchar_t strings, and not regular char. Making string literals wchar_t's is simple -
L"Some string" to make it always Unicode, or _T("Some string") to make it configuration dependent.
I don't know how exactly are you calling CreateDirectory, but if converting the string to Unicode is too much trouble, you can use the ANSI version directly - CreateDirectoryA. Post some code if you want a more detailed answer.

Try using #undef UNICODE before including windows.h. Unless you are really coding Unicode, but that's another story.
With the new Visual Studio's, windows.h defaults to Unicode version, so #undef UNICODE will give you back the ANSI version. It will show in the stack trace -- you will call CreateDirectoryA rather than CreateDirectoryW.
Or, just call CreateDirectoryA directly, but somehow I feel that is not a best practice.

Which character set to choose when compiling a c++ dll

Could someone give some info regarding the different character sets within visual studio's project properties sheets.
The options are:
None
Unicode
Multi byte
I would like to make an informed decision as to which to choose.
Thanks.

All new software should be Unicode enabled. For Windows apps that means the UTF-16 character set, and for pretty much everyone else UTF-8 is often the best choice. The other character set choices in Windows programming should only be used for compatibility with older apps. They do not support the same range of characters as Unicode.

Multibyte takes exactly 2 bytes per character, none exactly 1, unicode varies.
None is not good as it doesn't support non-latin symbols. It's very boring if some non-English user tries to input their name into edit box. Do not use none.
If you do not use custom computation of string lengths, from programmer's point of view multibyte and unicode do not differ as long as use use TEXT macro to wrap your string constants.
Some libraries explicitly require certain encoding (DirectShow etc.), just use what they want.

As Mr. Shiny recommended, Unicode is the right thing.
If you want to understand a bit more on what are the implications of that decision, take a look here: http://www.mihai-nita.net/article.php?artID=20050306b

Is TCHAR still relevant?

I'm new to Windows programming and after reading the Petzold book I wonder:
is it still good practice to use the TCHAR type and the _T() function to declare strings or if I should just use the wchar_t and L"" strings in new code?
I will target only Windows 2000 and up and my code will be i18n from the start up.

The short answer: NO.
Like all the others already wrote, a lot of programmers still use TCHARs and the corresponding functions. In my humble opinion the whole concept was a bad idea. UTF-16 string processing is a lot different than simple ASCII/MBCS string processing. If you use the same algorithms/functions with both of them (this is what the TCHAR idea is based on!), you get very bad performance on the UTF-16 version if you are doing a little bit more than simple string concatenation (like parsing etc.). The main reason are Surrogates.
With the sole exception when you really have to compile your application for a system which doesn't support Unicode I see no reason to use this baggage from the past in a new application.

I have to agree with Sascha. The underlying premise of TCHAR / _T() / etc. is that you can write an "ANSI"-based application and then magically give it Unicode support by defining a macro. But this is based on several bad assumptions:
That you actively build both MBCS and Unicode versions of your software
Otherwise, you will slip up and use ordinary char* strings in many places.
That you don't use non-ASCII backslash escapes in _T("...") literals
Unless your "ANSI" encoding happens to be ISO-8859-1, the resulting char* and wchar_t* literals won't represent the same characters.
That UTF-16 strings are used just like "ANSI" strings
They're not. Unicode introduces several concepts that don't exist in most legacy character encodings. Surrogates. Combining characters. Normalization. Conditional and language-sensitive casing rules.
And perhaps most importantly, the fact that UTF-16 is rarely saved on disk or sent over the Internet: UTF-8 tends to be preferred for external representation.
That your application doesn't use the Internet
(Now, this may be a valid assumption for your software, but...)
The web runs on UTF-8 and a plethora of rarer encodings. The TCHAR concept only recognizes two: "ANSI" (which can't be UTF-8) and "Unicode" (UTF-16). It may be useful for making your Windows API calls Unicode-aware, but it's damned useless for making your web and e-mail apps Unicode-aware.
That you use no non-Microsoft libraries
Nobody else uses TCHAR. Poco uses std::string and UTF-8. SQLite has UTF-8 and UTF-16 versions of its API, but no TCHAR. TCHAR isn't even in the standard library, so no std::tcout unless you want to define it yourself.
What I recommend instead of TCHAR
Forget that "ANSI" encodings exist, except for when you need to read a file that isn't valid UTF-8. Forget about TCHAR too. Always call the "W" version of Windows API functions. #define _UNICODE just to make sure you don't accidentally call an "A" function.
Always use UTF encodings for strings: UTF-8 for char strings and UTF-16 (on Windows) or UTF-32 (on Unix-like systems) for wchar_t strings. typedef UTF16 and UTF32 character types to avoid platform differences.

If you're wondering if it's still in practice, then yes - it is still used quite a bit. No one will look at your code funny if it uses TCHAR and _T(""). The project I'm working on now is converting from ANSI to unicode - and we're going the portable (TCHAR) route.
However...
My vote would be to forget all the ANSI/UNICODE portable macros (TCHAR, _T(""), and all the _tXXXXXX calls, etc...) and just assume unicode everywhere. I really don't see the point of being portable if you'll never need an ANSI version. I would use all the wide character functions and types directly. Preprend all string literals with a L.

I would still use the TCHAR syntax if I was doing a new project today. There's not much practical difference between using it and the WCHAR syntax, and I prefer code which is explicit in what the character type is. Since most API functions and helper objects take/use TCHAR types (e.g.: CString), it just makes sense to use it. Plus it gives you flexibility if you decide to use the code in an ASCII app at some point, or if Windows ever evolves to Unicode32, etc.
If you decide to go the WCHAR route, I would be explicit about it. That is, use CStringW instead of CString, and casting macros when converting to TCHAR (eg: CW2CT).
That's my opinion, anyway.

I would like to suggest a different approach (neither of the two).
To summarize, use char* and std::string, assuming UTF-8 encoding, and do the conversions to UTF-16 only when wrapping API functions.
More information and justification for this approach in Windows programs can be found in http://www.utf8everywhere.org.

The Introduction to Windows Programming article on MSDN says
New applications should always call the Unicode versions (of the API).
The TEXT and TCHAR macros are less useful today, because all applications should use Unicode.
I would stick to wchar_t and L"".

TCHAR/WCHAR might be enough for some legacy projects. But for new applications, I would say NO.
All these TCHAR/WCHAR stuff are there because of historical reasons. TCHAR provides a seemly neat way (disguise) to switch between ANSI text encoding (MBCS) and Unicode text encoding (UTF-16). In the past, people did not have an understanding of the number of characters of all the languages in the world. They assumed 2 bytes were enough to represent all characters and thus having a fixed-length character encoding scheme using WCHAR. However, this is no longer true after the release of Unicode 2.0 in 1996.
That is to say:
No matter which you use in CHAR/WCHAR/TCHAR, the text processing part in your program should be able to handle variable length characters for internationalization.
So you actually need to do more than choosing one from CHAR/WCHAR/TCHAR for programming in Windows:
If your application is small and does not involve text processing (i.e. just passing around the text string as arguments), then stick with WCHAR. Since it is easier this way to work with WinAPI with Unicode support.
Otherwise, I would suggest using UTF-8 as internal encoding and store texts in char strings or std::string. And covert them to UTF-16 when calling WinAPI. UTF-8 is now the dominant encoding and there are lots of handy libraries and tools to process UTF-8 strings.
Check out this wonderful website for more in-depth reading:
http://utf8everywhere.org/

Yes, absolutely; at least for the _T macro. I'm not so sure about the wide-character stuff, though.
The reason being is to better support WinCE or other non-standard Windows platforms. If you're 100% certain that your code will remain on NT, then you can probably just use regular C-string declarations. However, it's best to tend towards the more flexible approach, as it's much easier to #define that macro away on a non-windows platform in comparison to going through thousands of lines of code and adding it everywhere in case you need to port some library to windows mobile.

IMHO, if there's TCHARs in your code, you're working at the wrong level of abstraction.
Use whatever string type is most convenient for you when dealing with text processing - this will hopefully be something supporting unicode, but that's up to you. Do conversion at OS API boundaries as necessary.
When dealing with file paths, whip up your own custom type instead of using strings. This will allow you OS-independent path separators, will give you an easier interface to code against than manual string concatenation and splitting, and will be a lot easier to adapt to different OSes (ansi, ucs-2, utf-8, whatever).

The only reasons I see to use anything other than the explicit WCHAR are portability and efficiency.
If you want to make your final executable as small as possible use char.
If you don't care about RAM usage and want internationalization to be as easy as simple translation, use WCHAR.
If you want to make your code flexible, use TCHAR.
If you only plan on using the Latin characters, you might as well use the ASCII/MBCS strings so that your user does not need as much RAM.
For people who are "i18n from the start up", save yourself the source code space and simply use all of the Unicode functions.

TCHAR is not relevant anymore, since now we have UNICODE. You should use UTF-16 wchar_t* strings instead.
Windows APIs takes wchar_t* as strings, and it is UTF-16.

Just adding to an old question:
NO
Go start a new CLR C++ project in VS2010. Microsoft themselves use L"Hello World", 'nuff said.

TCHAR have a new meaning to port from WCHAR to CHAR.
https://learn.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page
Recent releases of Windows 10 have used the ANSI code page and -A
APIs as a means to introduce UTF-8 support to apps. If the ANSI code
page is configured for UTF-8, -A APIs operate in UTF-8.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js