Assistance with porting from Multi-Byte to UNICODE in MFC - c++

I've got a bit of a tedious 6-months to a year ahead of me. I'm working on a program with 1 million+ lines of code (much of it written in the early/mid 90's) and it has been decided that it should now support a UNICODE build. I've researched and found many of the best practices:
using the _t version of many microsoft and C++ methods like _stprintf_s()
instead of sprintf_s() or _tcsstr() instead of strstr(),
wrapping all coded strings that need to be TCHAR* like so _T("string") or _T('c'),
replacing most char* with LPTSTR and most const char* with LPCTSTR and char with TCHAR
using CA2T() and CT2A() to convert between char* and LPTSTR if necessary,
I was wondering if anyone has written a script that is capable of automatically making many of these changes, since they could save me MONTHS of work.

I think this approach exactly fits your scenario.
Leave all your strings be narrow chars, use sprintf and strstr as before, read and write from text files that are always assumed to be UTF-8 without BOMs, etc... All you need to change is your communication with the system. Just assume now that the strings are UTF-8 and before calling into MFC or Windows, convert to UTF-16 on-the-fly.
As a bonus, you'll get easier portability to non-Windows platforms, compared to the approach advocated by Microsoft.

Related

Designing an application for UTF-8 or UTF-16 usage

I am developing an application that will be primarily used by English and Spanish readers. However, in the future I would like to be able to support more extended languages, such as Japanese. While thinking of the design of the program I have hit a wall in the UTF-8 vs. UTF-16 vs. multibyte. I would like to compile my program to support either UTF-8 or UTF-16 (for when languages such as Chinese are used). For this to happen, I was thinking that I should have something such as
#if _UTF8
typedef char char_type;
#elif _UTF16
typedef unsigned short char_type;
#else
#error
#endif
That way, in the future when I use UTF-16, I can switch the #define (and of course, have the same type of #if/#endif for things such as sprintf, etc.). I have my own custom string type, so that would also make use of this case also.
Would replacing every use of just the single use of "char" with my "char_type" using the scenario mentioned above, be considered a "bad idea"? If so, why is it considered a bad idea and how could I achieve what I mentioned above?
The reason I would like to use one or the other is due to memory efficiency. I would rather not use UTF-16 all the time if I am not using it.
UTF-8 can represent every Unicode character. If your application properly supports UTF-8, you are golden for any language.
Note that Windows' native controls do not have APIs to set UTF-8 text in them, if you are writing a Windows application. However, it's easy to make an application which uses UTF-8 internally for everything, and converts UTF-8 -> UTF-16 when setting text in Windows, and converts UTF-16 -> UTF-8 when getting text from Windows. I've done it, and it worked awesome and was MUCH nicer than writing a WCHAR application. It's trivial to convert UTF-8 <-> 16; Windows has APIs for it, or you can find a simple (one page) function to do it in your own code.
I believe choosing UTF-8 is just enough for your needs. Keep in mind, that char_type as above is less than a character in both encodings.
You may wish to have a look at this discussion: https://softwareengineering.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful for the benefits of different types of popular encodings.
This is essentially what Windows does with TCHAR (except that the Windows API interprets char as the "ANSI" code page instead of UTF-8).
I think it's a bad idea.

C++ and UTF8 - Why not just replace ASCII?

In my application I have to constantly convert string between std::string and std::wstring due different APIs (boost, win32, ffmpeg etc..). Especially with ffmpeg the strings end up utf8->utf16->utf8->utf16, just to open a file.
Since UTF8 is backwards compatible with ASCII I thought that I consistently store all my strings UTF-8 std::string and only convert to std::wstring when I have to call certain unusual functions.
This worked kind of well, I implemented to_lower, to_upper, iequals for utf8. However then I met several dead-ends std::regex, and regular string comparisons. To make this usable I would need to implement a custom ustring class based on std::string with re-implementation of all corresponding algorithms (including regex).
Basically my conclusion is that utf8 is not very good for general usage. And the current std::string/std::wstring is mess.
However, my question is why the default std::string and "" are not simply changed to use UTF8? Especially as UTF8 is backward compatible? Is there possibly some compiler flag which can do this? Of course the stl implemention would need to be automatically adapted.
I've looked at ICU, but it is not very compatible with apis assuming basic_string, e.g. no begin/end/c_str etc...
The main issue is the conflation of in-memory representation and encoding.
None of the Unicode encoding is really amenable to text processing. Users will in general care about graphemes (what's on the screen) while the encoding is defined in terms of code points... and some graphemes are composed of several code points.
As such, when one asks: what is the 5th character of "Hélène" (French first name) the question is quite confusing:
In terms of graphemes, the answer is n.
In terms of code points... it depends on the representation of é and è (they can be represented either as a single code point or as a pair using diacritics...)
Depending on the source of the question (a end-user in front of her screen or an encoding routine) the response is completely different.
Therefore, I think that the real question is Why are we speaking about encodings here?
Today it does not make sense, and we would need two "views": Graphemes and Code Points.
Unfortunately the std::string and std::wstring interfaces were inherited from a time where people thought that ASCII was sufficient, and the progress made didn't really solve the issue.
I don't even understand why the in-memory representation should be specified, it is an implementation detail. All a user should want is:
to be able to read/write in UTF-* and ASCII
to be able to work on graphemes
to be able to edit a grapheme (to manage the diacritics)
... who cares how it is represented? I thought that good software was built on encapsulation?
Well, C cares, and we want interoperability... so I guess it will be fixed when C is.
You cannot, the primary reason for this is named Microsoft. They decided not to support Unicode as UTF-8 so the support for UTF-8 under Windows is minimal.
Under windows you cannot use UTF-8 as a codepage, but you can convert from or to UTF-8.
There are two snags to using UTF8 on windows.
You cannot tell how many bytes a string will occupy - it depends on which characters are present, since some characters take 1 byte, some take 2, some take 3, and some take 4.
The windows API uses UTF16. Since most windows programs make numerous calls to the windows API, there is quite an overhead converting back and forward. ( Note that you can do a "non-unicode' build, which looks like it uses a utf8 windows api, but all that is happening is that the conversion back and forward on each call is hidden )
The big snag with UTF16 is that the binary representation of a string depends on the byte order in a word on the particular hardware the program is running on. This does not matter in most cases, except when strings are transmitted between computers where you cannot be sure that the other computer uses the same byte order.
So what to do? I uses UTF16 everywhere 'inside' all my programs. When string data has to be stored in a file, or transmitted from a socket, I first convert it to UTF8.
This means that 95% of my code runs simply and most efficiently, and all the messy conversions between UTF8 and UTF16 can be isolated to routines responsible for I/O.

Trying to CreateDirectory, getting char* to LPCWSTR error, willing to try another function

I've tried Googling this, and there are so many answers based on various specific situations that frankly I'm more stuck than I was when I started.
The facts are these:
Language: C/C++
OS: Windows
IDE: Visual Studio 2005
I'm trying to create a directory from a function in my program, using CreateDirectory (after a #include of windows.h).
Supposedly, the first parameter (a path) should be a char*. However, when I try to compile, I get the following error: error C2664: 'CreateDirectoryW' : cannot convert parameter 1 from 'char *' to 'LPCWSTR'
What I've read is that I have some sort of issue between UNICODE and ANSI. The solutions vary wildly and I'm afraid of breaking something important, or doing something very stupid.
I am perfectly willing to use any other method of creating a new directory, if one exists without me having to find some other library.
I only minored in comp sci, and frankly I have no idea why it's so easy to open, close, edit, and otherwise access files through stdio, but doing anything with directories (specifically making them and finding out if they exist) is a wild goose chase through the streets of the Internet.
Please help me, either to fix the current attempt at CreateDirectory or to use something else to create a directory.
Thank you!
This is completely crazy. Microsoft have mechanisms to support both ANSI (as they call it) and UNICODE from the same code, because Windows 95/8/Me were ANSI operating systems and NT/XP/Vista etc are UNICODE operating systems. So if you really want to you can write one set of code that supports both systems and just recompile for each system.
But who is interested in Windows 95/98/Me any more? Yet this stuff carries on confusing newbies.
Here's the beef, there is no function called CreateDirectory in Windows, there a UNICODE function called CreateDirectoryW and an ANSI function called CreateDirectoryA. The macro CreateDirectory is converted to CreateDirectoryW or CreateDirectoryA depending on which compiler options you have defined. If you end up using CreateDirectoryW (as you evidentally did) the you must pass a Unicode string to it, if you use CreateDirectoryA then you pass a normal ASCII string.
The simplest thing in your case would be to forget about all this and just call CreateDirectoryA directly.
CreateDirectoryA("c:\\some\\directory", NULL);
Why it is so hard to create a directory in C++? I would guess that because back in the 70s when C was new, not every operating system had directories so this stuff never made it into the language standard.
VS2005 is Unicode by default, and you should better keep it that way. Might save you a lot of issues in the future. In Unicode builds CreateDirectory (and other Win32 functions) expect wchar_t strings, and not regular char. Making string literals wchar_t's is simple -
L"Some string" to make it always Unicode, or _T("Some string") to make it configuration dependent.
I don't know how exactly are you calling CreateDirectory, but if converting the string to Unicode is too much trouble, you can use the ANSI version directly - CreateDirectoryA. Post some code if you want a more detailed answer.
Try using #undef UNICODE before including windows.h. Unless you are really coding Unicode, but that's another story.
With the new Visual Studio's, windows.h defaults to Unicode version, so #undef UNICODE will give you back the ANSI version. It will show in the stack trace -- you will call CreateDirectoryA rather than CreateDirectoryW.
Or, just call CreateDirectoryA directly, but somehow I feel that is not a best practice.

C++: Making my project support unicode

My C++ project currently is about 16K lines of code big, and I admit having completely not thought about unicode support in the first place.
All I have done was a custom typedef for std::string as String and jump into coding.
I have never really worked with unicode myself in programs I wrote.
How hard is it to switch my project to unicode now? Is it even a good idea?
Can I just switch to std::wchar without any major problems?
Probably the most important part of making an application unicode aware is to track the encoding of your strings and to make sure that your public interfaces are well specified and easy to use with the encodings that you wish to use.
Switching to a wider character (in c++ wchar_t) is not necessarily the correct solution. In fact, I would say it is usually not the simplest solution. Some applications can get away with specifying that all strings and interfaces use UTF-8 and not need to change at all. std::string can perfectly well be used for UTF-8 encoded strings.
However, if you need to interpret the characters in a string or interface with non-UTF-8 interfaces then you will have to put more work in but without knowing more about your application it is impossible to recommend a single best approach.
There are some issues with using std::wstring. If your application will be storing text in Unicode, and it will be running on different platforms, you may run into trouble. std::wstring relies on wchar_t, which is compiler dependent. In Microsoft Visual C++, this type is 16 bits wide, and will thus only support UTF-16 encodings. The GNU C++ compiler specifes this type to be 32 bits wide, and will thus only support UTF-32 encodings. If you then store the text in a file from one system (say Windows/VC++), and then read the file from another system (Linux/GCC), you will have to prepare for this (in this case convert from UTF-16 to UTF-32).
Can I just switch to [std::wchar_t] without any major problems?
No, it's not that simple.
The encoding of a wchar_t string is platform-dependent. Windows uses UTF-16. Linux usually uses UTF-32. (C++0x will mitigate this difference by introducing separate char16_t and char32_t types.)
If you need to support Unix-like systems, you don't have all the UTF-16 functions that Windows has, so you'd need to write your own _wfopen, etc.
Do you use any third-party libraries? Do they support wchar_t?
Although wide characters are commonly-used for an in-memory representation, on-disk and on-the-Web formats are much more likely to be UTF-8 (or other char-based encoding) than UTF-16/32. You'd have to convert these.
You can't just search-and-replace char with wchar_t because C++ confounds "character" and "byte", and you have to determine which chars are characters and which chars are bytes.

Is TCHAR still relevant?

I'm new to Windows programming and after reading the Petzold book I wonder:
is it still good practice to use the TCHAR type and the _T() function to declare strings or if I should just use the wchar_t and L"" strings in new code?
I will target only Windows 2000 and up and my code will be i18n from the start up.
The short answer: NO.
Like all the others already wrote, a lot of programmers still use TCHARs and the corresponding functions. In my humble opinion the whole concept was a bad idea. UTF-16 string processing is a lot different than simple ASCII/MBCS string processing. If you use the same algorithms/functions with both of them (this is what the TCHAR idea is based on!), you get very bad performance on the UTF-16 version if you are doing a little bit more than simple string concatenation (like parsing etc.). The main reason are Surrogates.
With the sole exception when you really have to compile your application for a system which doesn't support Unicode I see no reason to use this baggage from the past in a new application.
I have to agree with Sascha. The underlying premise of TCHAR / _T() / etc. is that you can write an "ANSI"-based application and then magically give it Unicode support by defining a macro. But this is based on several bad assumptions:
That you actively build both MBCS and Unicode versions of your software
Otherwise, you will slip up and use ordinary char* strings in many places.
That you don't use non-ASCII backslash escapes in _T("...") literals
Unless your "ANSI" encoding happens to be ISO-8859-1, the resulting char* and wchar_t* literals won't represent the same characters.
That UTF-16 strings are used just like "ANSI" strings
They're not. Unicode introduces several concepts that don't exist in most legacy character encodings. Surrogates. Combining characters. Normalization. Conditional and language-sensitive casing rules.
And perhaps most importantly, the fact that UTF-16 is rarely saved on disk or sent over the Internet: UTF-8 tends to be preferred for external representation.
That your application doesn't use the Internet
(Now, this may be a valid assumption for your software, but...)
The web runs on UTF-8 and a plethora of rarer encodings. The TCHAR concept only recognizes two: "ANSI" (which can't be UTF-8) and "Unicode" (UTF-16). It may be useful for making your Windows API calls Unicode-aware, but it's damned useless for making your web and e-mail apps Unicode-aware.
That you use no non-Microsoft libraries
Nobody else uses TCHAR. Poco uses std::string and UTF-8. SQLite has UTF-8 and UTF-16 versions of its API, but no TCHAR. TCHAR isn't even in the standard library, so no std::tcout unless you want to define it yourself.
What I recommend instead of TCHAR
Forget that "ANSI" encodings exist, except for when you need to read a file that isn't valid UTF-8. Forget about TCHAR too. Always call the "W" version of Windows API functions. #define _UNICODE just to make sure you don't accidentally call an "A" function.
Always use UTF encodings for strings: UTF-8 for char strings and UTF-16 (on Windows) or UTF-32 (on Unix-like systems) for wchar_t strings. typedef UTF16 and UTF32 character types to avoid platform differences.
If you're wondering if it's still in practice, then yes - it is still used quite a bit. No one will look at your code funny if it uses TCHAR and _T(""). The project I'm working on now is converting from ANSI to unicode - and we're going the portable (TCHAR) route.
However...
My vote would be to forget all the ANSI/UNICODE portable macros (TCHAR, _T(""), and all the _tXXXXXX calls, etc...) and just assume unicode everywhere. I really don't see the point of being portable if you'll never need an ANSI version. I would use all the wide character functions and types directly. Preprend all string literals with a L.
I would still use the TCHAR syntax if I was doing a new project today. There's not much practical difference between using it and the WCHAR syntax, and I prefer code which is explicit in what the character type is. Since most API functions and helper objects take/use TCHAR types (e.g.: CString), it just makes sense to use it. Plus it gives you flexibility if you decide to use the code in an ASCII app at some point, or if Windows ever evolves to Unicode32, etc.
If you decide to go the WCHAR route, I would be explicit about it. That is, use CStringW instead of CString, and casting macros when converting to TCHAR (eg: CW2CT).
That's my opinion, anyway.
I would like to suggest a different approach (neither of the two).
To summarize, use char* and std::string, assuming UTF-8 encoding, and do the conversions to UTF-16 only when wrapping API functions.
More information and justification for this approach in Windows programs can be found in http://www.utf8everywhere.org.
The Introduction to Windows Programming article on MSDN says
New applications should always call the Unicode versions (of the API).
The TEXT and TCHAR macros are less useful today, because all applications should use Unicode.
I would stick to wchar_t and L"".
TCHAR/WCHAR might be enough for some legacy projects. But for new applications, I would say NO.
All these TCHAR/WCHAR stuff are there because of historical reasons. TCHAR provides a seemly neat way (disguise) to switch between ANSI text encoding (MBCS) and Unicode text encoding (UTF-16). In the past, people did not have an understanding of the number of characters of all the languages in the world. They assumed 2 bytes were enough to represent all characters and thus having a fixed-length character encoding scheme using WCHAR. However, this is no longer true after the release of Unicode 2.0 in 1996.
That is to say:
No matter which you use in CHAR/WCHAR/TCHAR, the text processing part in your program should be able to handle variable length characters for internationalization.
So you actually need to do more than choosing one from CHAR/WCHAR/TCHAR for programming in Windows:
If your application is small and does not involve text processing (i.e. just passing around the text string as arguments), then stick with WCHAR. Since it is easier this way to work with WinAPI with Unicode support.
Otherwise, I would suggest using UTF-8 as internal encoding and store texts in char strings or std::string. And covert them to UTF-16 when calling WinAPI. UTF-8 is now the dominant encoding and there are lots of handy libraries and tools to process UTF-8 strings.
Check out this wonderful website for more in-depth reading:
http://utf8everywhere.org/
Yes, absolutely; at least for the _T macro. I'm not so sure about the wide-character stuff, though.
The reason being is to better support WinCE or other non-standard Windows platforms. If you're 100% certain that your code will remain on NT, then you can probably just use regular C-string declarations. However, it's best to tend towards the more flexible approach, as it's much easier to #define that macro away on a non-windows platform in comparison to going through thousands of lines of code and adding it everywhere in case you need to port some library to windows mobile.
IMHO, if there's TCHARs in your code, you're working at the wrong level of abstraction.
Use whatever string type is most convenient for you when dealing with text processing - this will hopefully be something supporting unicode, but that's up to you. Do conversion at OS API boundaries as necessary.
When dealing with file paths, whip up your own custom type instead of using strings. This will allow you OS-independent path separators, will give you an easier interface to code against than manual string concatenation and splitting, and will be a lot easier to adapt to different OSes (ansi, ucs-2, utf-8, whatever).
The only reasons I see to use anything other than the explicit WCHAR are portability and efficiency.
If you want to make your final executable as small as possible use char.
If you don't care about RAM usage and want internationalization to be as easy as simple translation, use WCHAR.
If you want to make your code flexible, use TCHAR.
If you only plan on using the Latin characters, you might as well use the ASCII/MBCS strings so that your user does not need as much RAM.
For people who are "i18n from the start up", save yourself the source code space and simply use all of the Unicode functions.
TCHAR is not relevant anymore, since now we have UNICODE. You should use UTF-16 wchar_t* strings instead.
Windows APIs takes wchar_t* as strings, and it is UTF-16.
Just adding to an old question:
NO
Go start a new CLR C++ project in VS2010. Microsoft themselves use L"Hello World", 'nuff said.
TCHAR have a new meaning to port from WCHAR to CHAR.
https://learn.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page
Recent releases of Windows 10 have used the ANSI code page and -A
APIs as a means to introduce UTF-8 support to apps. If the ANSI code
page is configured for UTF-8, -A APIs operate in UTF-8.