I want to make my Win32 C++ application able to be played on any encoding version (UNICODE & ANSI). Now I am a little confused as to what exactly is the difference between the two(or more?) encodings?
To make my Win32 application cross-encoding compatible does that mean I have to go through my code & replace every std::string with std::wstring, then replace every char with a wchar_t* and then replace every literal string("") with L""?
What will happen if my application runs on a UNICODE machine & my application has one std::string in it?
Do you have any advice on the steps I need to take to make my application cross-encoding compatible?
For eg:
- Change all c_strings & strings to their UNICODE equivalent
- Change any Win32 functions to the uncide version (eg, change from getenv() to _wgetenv())
What will happen if my application runs on a UNICODE machine & my application has one std::string in it?
Computers are not ANSI or Unicode but the Operating Systems on which the computers operate on are. The last version of Windows that didn't support Unicode was Windows 3.11 for Workgroups. If you run a ASCII compiled application on a UniCode.
What exactly is the difference between the two(or more?) encodings?
What is ASCII?
ASCII is a seven-bit encoding technique which assigns a number to each of the 128 characters used most frequently in American English. This allows most computers to record and display basic text. ASCII does not include symbols frequently used in other countries.
What is Unicode?
One major draw back to ASCII was you could only have 256 different characters. However, languages such as Japanese and Arabic have thousands of characters. Thus ASCII would not work in these situations. The result was Unicode which allowed for up to 65,536 different characters.
Unicode is an attempt by ISO and the Unicode Consortium to develop a coding system for electronic text that includes every written alphabet in existence. Unicode uses 8-, 16-, or 32-bit characters depending on the specific representation, so Unicode documents often require up to twice as much disk space as ASCII or Latin-1 documents. The first 256 characters of Unicode are identical to Latin-1.
In Win32, UNICODE is supported by #define-ing the UNICODE and _UNICODE macros. This, in turn, causes your program to use the Unicode variants of the Win32 functions.
Do you have any advice on the steps I need to take to make my application cross-encoding compatible?
Each Win32 function (that takes or returns a string) has two variants, one for ASCII and one for Unicode. And the function call resolves to one of these, depending on whether or not the UNICODE macro is defined. So you should define the macro and start using the Unicode versions of the functions. for eg:
Replacing every std::string with std::wstring,
Replacing every char with a wchar_t*
Replacing every literal string("") with L""
Making use of the TCHAR support in Windows etc.
as you pointed out are a list of things that you will have to take care of, mind you this is not the complete list.
Basically, You will have to use all the Unicode versions of the types and function calls in your code.
The last version of Windows that did not use Unicode internally was Windows ME. The recommendation for new code is to use Unicode exclusively. Some conversion may be necessary when you need to read and write files that are encoded with a specific code page.
You're on the right track with your initial thoughts. If you're using Microsoft's CString, it comes in two versions CStringA and CStringW - you need to change one compiler definition and it will use CStringW in every place that you specify CString, and everything will just work. You should use std::wstring instead of std::string. Prefix every string literal with L"" or use Microsoft's macro _T("") which will convert to the same thing.
When you compile a program for ANSI or Unicode, you're affecting two things.
Which set of APIs get called. Suppose your code calls CreateFile(). The actual API called is either CreateFileA() or CreateFileW() (ANSI or Wide (i.e. Unicode)) depending on your compiler setting. Internally the NT kernal uses Unicde for all APIs. The ANSI APIs simply convert their string parameters to ANSI and call the Unicode APIs. Many APIs are Unicode only.
How T* macros are expanded. TCHAR will eventually be expanded to char in ANSI mode, wchar_t in Unicode mode.
Things like std::string and std::wstring are not affected until you need to call an API and want to pass a string to them. The use of string vs. wstring should be determined by your program's needs and not whether it's compiled ANSI or Unicode.
You can use ATL to easily convert strings as necessary.
// assume compiled for Unicode
#include <atlbase.h>
void myfunc() {
USES_CONVERSION;
std::string filename = "...";
HANDLE hFile = CreateFile(A2W(filename.c_str()), ...
or, if you prefer, you can use A2T() and your code will work whether it's compiled for ANSI or Unicode.
You can use TCHAR in your case.
In UNICODE, TCHAR is WCHAR.
In not UNICODE, TCHAR is CHAR.
If you want to use std::string, I recommend you the following usage.
#ifdef UNICODE
#define std::tstring str::wstring
#else
#define std::tstring str::string
#endif
and,
Use std::tstring in your program.
Related
I have a C++ Native WinAPI application that strictly uses Unicode functions and data types. Ie, CreateWindowW(), SendMessageW(), wstring, WCHAR, etc. Now I intend to expand my application to use SQLite3.
My Problem: The SQLite3 library is ANSI. Which means I have to use char* as most function parameters.
Are there any limitations or negative impacts from using ANSI Functions in a Unicode Application?
If there are what might these impacts be?
SQLite is not restricted to ANSI. It is a misconception that char* implies ANSI encoded text. Not all functions that operate on char* data assume that the data is ANSI encoded. In the case of SQLite it fully supports Unicode and does so using char* data encoded using UTF-8.
If you intend to continue using UTF-16 encoded text internal to your application you'll need to add an adapter layer at the boundary between your code and the SQLite code. Convert from UTF-16 to UTF-8 when passing data to SQLite, and the opposite direction when receiving.
Which to my mind renders the question that you asked somewhat moot, but I'll address that anyway:
Are there any limitations or negative impacts from using ANSI Functions in a Unicode Application?
The most obvious drawbacks of using ANSI functions are:
Severely restricted character set.
Performance cost when converting between different character sets.
Risk of programmer confusion and errors due to using multiple character sets in a single codebase.
No limitation, you can use ANSI strings in Unicode applications.
Some details: Unicode application is compile-time definition. At run time, program can work both with Unicode and ANSI strings.
For example:
char* ptr1; // this is always ANSI string
wchar_t* ptr2; // this is always Unicode string
TCHAR* ptr3; // this is generic string, which is compiled as char* or wchar_t*
Unicode/ANSI configuration differs by interpreting a generic text macros, like TCHAR. Some Windows API are also implemented using generic text macros. For example: SetWindowText is actually macro, which is expanded to SetWindowTextA in ANSI configuration, and to SetWindowTextW in Unicode configuration.
Any non-generic string or API name (like char*, SetWindowTextW etc.) works by the same way in any program configuration.
Use ATL conversion macros to convert between different (generic and non-generic) string types: http://msdn.microsoft.com/en-us/library/87zae4a3%28v=vs.80%29.aspx
You can use Ansi-based APIs in a Unicode application. Simply convert your input Unicode strings to Ansi when passing them to the API, and convert any output Ansi strings to Unicode upon return from the API. You can use WideCharToMultiByte() and MultiByteToWideChar() for that, or higher-level wrappers like CString, ATL conversions, etc.
TCHAR szExeFileName[MAX_PATH];
GetModuleFileName(NULL, szExeFileName, MAX_PATH);
CString tmp;
lstrcpy(szExeFileName, tmp);
CString out;
out.Format("\nInstall32 at %s\n", tmp);
TRACE(tmp);
Error (At the Format):
error C2664: 'void ATL::CStringT<BaseType,StringTraits>::Format(const wchar_t
*,...)' : cannot convert parameter 1 from 'const char [15]' to 'const wchar_t
I'd just like to get the current path that this program was launched from and copy it into a CString so I can use it elsewhere. I am currently just try to get to see the path by TRACE'ing it out. But strings, chars, char arrays, I can't ever get all the strait. Could someone give me a pointer?
The accepted answer addresses the problem. But the question also asked for a better understanding of the differences among all the character types on Windows.
Encodings
A char on Windows (and virtually all other systems) is a single byte. A byte is typically interpreted as either an unsigned value [0..255] or a signed value [-128..127]. (Older C++ standards guarantees a signed range of only [-127..127], but most implementations give [-128..127]. I believe C++11 guarantees the larger range.)
ASCII is a character mapping for values in the range [0..127] to particular characters, so you can store an ASCII character in either a signed byte or an unsigned byte, and thus it will always fit in a char.
But ASCII doesn't have all the characters necessary for most languages, so the character sets were often extended by using the rest of the values available in a byte to represent the additional characters needed for certain languages (or families of languages). So, while [0..127] almost always mean the same thing, values like 150 can only be interpreted in the context of a particular encoding. For single-byte alphabets, these encodings are called code pages.
Code pages helped, but they didn't solve all the problems. You always had to know which code page a particular document used in order to interpret it correctly. Furthermore, you typically couldn't write a single document that used different languages.
Also, some languages have more than 256 characters, so there was no way to map one char to one character. This led to the development of multi-byte character encodings, where [0..127] is still ASCII, but some of the other values are "escapes" that mean you have to look at some number of following chars to figure out what character you really had. (It's best to think of multi-byte as variable-byte, as some characters require only one byte while other require two or more.) Multi-byte works, but it's a pain to code for.
Meanwhile, memory was becoming more plentiful, so a bunch of organizations got together and created Unicode, with the goal of making a universal mapping of values to characters (for appropriately vague definitions of "characters"). Initially, it was believed that all characters (or at least all the ones anyone would ever use) would fit into 16-bit values, which was nice because you wouldn't have to deal with multi-byte encodings--you'd just use two bytes per character instead of one. About this time, Microsoft decided to adopt Unicode as the internal representation for text in Windows.
WCHAR
So Windows has a type called WCHAR, a two-byte value that represents a "Unicode" "character". I'm using quotation marks here because Unicode evolved past the original two-byte encoding, so what Windows calls "Unicode" isn't really Unicode today--it's actually a particular encoding of Unicode called UTF-16. And a "character" is not as simple a concept in Unicode as it was in ASCII, because, in some languages, characters combine or otherwise influence adjacent characters in interesting ways.
Newer versions of Windows used these 16-bit WCHAR values for text internally, but there was a lot of code out there still written for single-byte code pages, and even some for multi-byte encodings. Those programs still used chars rather than WCHARs. And many of these programs had to work with people using older versions of Windows that still used chars internally as well as newer ones that use WCHAR. So a technique using C macros and typedefs was devised so that you could mostly write your code one way and--at compile time--choose to have it use either char or WCHAR.
TCHAR
To accomplish this flexibility, you use a TCHAR for a "text character". In some header file (often <tchar.h>), TCHAR would be typedef'ed to either char or WCHAR, depending on the compile time environment. Windows headers adopted conventions like this:
LPTSTR is a (long) pointer to a string of TCHARs.
LPWSTR is a (long) pointer to a string of WCHARs.
LPSTR is a (long) pointer to a string of chars.
(The L for "long" is a leftover from 16-bit days, when we had long, far, and near pointers. Those are all obsolete today, but the L prefix tends to remain.)
Most of the Windows API functions that take and return strings were actually replaced with two versions: the A version (for "ANSI" characters) and the W version (for wide characters). (Again, historical legacy shows in these. The code pages scheme was often called ANSI code pages, though I've never been clear if they were actually ruled by ANSI standards.)
So when you call a Windows API like this:
SetWindowText(hwnd, lptszTitle);
what you're really doing is invoking a preprocessor macro that expands to either SetWindowTextA or SetWindowTextW. It should be consistent with however TCHAR is defined. That is, if you want strings of chars, you'll get the A version, and if you want strings of WCHARs, you get the W version.
But it's a little more complicated because of string literals. If you write this:
SetWindowText(hwnd, "Hello World"); // works only in "ANSI" mode
then that will only compile if you're targeting the char version, because "Hello World" is a string of chars, so it's only compatible with the SetWindowTextA version. If you wanted the WCHAR version, you'd have to write:
SetWindowText(hwnd, L"Hello World"); // only works in "Unicode" mode
The L here means you want wide characters. (The L actually stands for long, but it's a different sense of long than the long pointers above.) When the compiler sees the L prefix on the string, it knows that string should be encoded as a series of wchar_ts rather than chars.
(Compilers targeting Windows use a two-byte value for wchar_t, which happens to be identical to what Windows defined a WCHAR. Compilers targeting other systems often use a four-byte value for wchar_t, which is what it really takes to hold a single Unicode code point.)
So if you want code that can compile either way, you need another macro to wrap the string literals. There are two to choose from: _T() and TEXT(). They work exactly the same way. The first comes from the compiler's library and the second from the OS's libraries. So you write your code like this:
SetWindowText(hwnd, TEXT("Hello World")); // compiles in either mode
If you're targeting chars, the macro is a no-op that just returns the regular string literal. If you're targeting WCHARs, the macro prepends the L.
So how do you tell the compiler that you want to target WCHAR? You define UNICODE and _UNICODE. The former is for the Windows APIs and the latter is for the compiler libraries. Make sure you never define one without the other.
My guess is you are compiling in Unicode mode.
Try enclosing your format string in the _T macro, which is designed to provide an always-correct method of providing constant string parameters, regardless of whether you're compiling in Unicode or ANSI mode:
out.Format(_T("\nInstall32 at %s\n"), tmp);
I want to develope an application in Linux. I want to use wstring beacuse my application should supports unicode and I don't want to use UTF-8 strings.
In Windows OS, using wstring is easy. beacuse any ANSI API has a unicode form. for example there are two CreateProcess API, first API is CreateProcessA and second API is CreateProcessW.
wstring app = L"C:\\test.exe";
CreateProcess
(
app.c_str(), // EASY!
....
);
But it seems working with wstring in Linux is complicated! for example there is an API in Linux called parport_open (It just an example).
and I don't know how to send my wstring to this API (or APIs like parport_open that accept a string parameter).
wstring name = L"myname";
parport_open
(
0, // or a valid number. It is not important in this question.
name.c_str(), // Error: because type of this parameter is char* not wchat_t*
....
);
My question is how can I use wstring(s) in Linux APIs?
Note: I don't want to use UTF-8 strings.
Thanks
Linux APIs (on recent kernels and with correct locale setting) on almost every distribution use UTF-8 strings by default1. You too should use them inside your code. Resistance is futile.
The wchar_t (and thus wstring) on Windows were convenient only when Unicode was limited to 65536 characters (i.e. wchar_t were used for UCS-2), now that the 16-bit Windows wchar_t are used for UTF-16 the advantage of 1 wchar_t=1 Unicode character is long gone, so you have the same disadvantages of using UTF-8. Nowadays IMHO the Linux approach is the most correct. (Another answer of mine on UTF-16 and why Windows and Java use it)
By the way, both string and wstring aren't encoding-aware, so you can't reliably use any of these two to manipulate Unicode code points. I heard that wxString from the wxWidgets toolkit handles UTF-8 nicely, but I never did extensive research about it.
actually, as pointed out below, the kernel aims to be encoding-agnostic, i.e. it treats the strings as opaque sequences of (NUL-terminated?) bytes (and that's why encodings that use "larger" character types like UTF-16 cannot be used). On the other hand, wherever actual string manipulation is done, the current locale setting is used, and by default on almost any modern Linux distribution it is set to UTF-8 (which is a reasonable default to me).
I don't want to use UTF-8 strings.
Well, you will need to overcome that reluctance, at least when calling the APIs. Linux uses single byte string encodings, invariably UTF-8. Clearly you should use a single byte string type since you obviously can't pass wide characters to a function that expects char*. Use string rather than wstring.
I'm currently taking care of an application that uses std::string and char for string operations - which is fine on linux, since Linux is agnostic to Unicode (or so it seems; I don't really know, so please correct me if I'm telling stories here). This current style naturally leads to this kind of function/class declarations:
std::string doSomethingFunkyWith(const std::string& thisdata)
{
/* .... */
}
However, if thisdata contains unicode characters, it will be displayed wrongly on windows, since std::string can't hold unicode characters on Windows.
So I thought up this concept:
namespace MyApplication {
#ifdef UNICODE
typedef std::wstring string_type;
typedef wchar_t char_type;
#else
typedef std::string string_type;
typedef char char_type;
#endif
/* ... */
string_type doSomethingFunkyWith(const string_type& thisdata)
{
/* ... */
}
}
Is this a good concept to go with to support Unicode on windows?
My current toolchain consists of gcc/clang on Linux, and wine+MinGW for Windows support (crosstesting also happens via wine), if that matters.
Multiplatform issues comes from the fact that there are many encodings, and a wrong encoding pick will lead to encóding Ãssues. Once you tackle that problem, you should be able to use std::wstring on all your program.
The usual workflow is:
raw_input_data = read_raw_data()
input_encoding = "???" // What is your file or terminal encoding?
unicode_data = convert_to_unicode(raw_input_data, input_encoding)
// Do something with the unicode_data, store in some var, etc.
output_encoding = "???" // Is your terminal output encoding the same as your input?
raw_output_data = convert_from_unicode(unicode_data, output_encoding)
print_raw_data(raw_data)
Most Unicode issues comes from wrongly detecting the values of input_encoding and output_encoding. On a modern Linux distribution this is usually UTF-8. On Windows YMMV.
Standard C++ don't know about encodings, you should use some library like ICU to do the conversion.
How you store a string within your application is entirely up to you -- after all, nobody would know as long as the strings stay within your application. The problem starts when you try to read or write strings from the outside world (console, files, sockets etc.) and this is where the OS matters.
Linux isn't exactly "agnostic" to Unicode -- it does recognize Unicode but the standard library functions assume UTF-8 encoding, so Unicode strings fit into standard char arrays. Windows, on the other hand, uses UTF-16 encoding, so you need a wchar_t array to represent 16-bit characters.
The typedefs you proposed should work fine, but keep in mind that this alone doesn't make your code portable. As an example, if you want to store text in files in a portable manner, you should choose one encoding and stick to it across all platforms -- this could require converting between encodings on certain platforms.
Linux does support Unicode, it simply uses UTF-8. Probably a better way to make your system portable would be to make use of International Components for Unicode and treat all std::string objects as containing UTF-8 characters, and convert them to UTF-16 as needed when invoking Windows functions. It almost always makes sense to use UTF-8 over UTF-16, as UTF-8 uses less space for some of the most commonly used characters (e.g. English*) and more space for less frequent characters, whereas UTF-16 wastes space equally for all characters, no matter how frequently they are used.
While you can use your typedefs, this will mean that you have to write two copies of every single function that has to deal with strings. I think it would be more efficient to simply do all internal computations in UTF-8 and simply translate that to/from UTF-16 if necessary when inputting/outputting as needed.
*For HTML, XML, and JSON that use English as part of the encoding (e.g. "<html>, <body>, etc.) regardless of the language of the values, this can still be a win for foreign languages.
The problem for Linux and using Unicode is that all the IO and most system functions use UTF-8 and the wide character type is 32 bit. Then there is interfacing to Java and other programs which requires UTF-16.
As a suggestion for Unicode support, see the OpenRTL library at http://code.google.com/p/openrtl which supports all UTF-8, UTF-16 and UTF-32 on windows, Linux, Osx and Ios. The Unicode support is not just the character types, but also Unicode collation, normalization, case folding, title casing and about 64 different Unicode character properties per full unsigned 32 bit character.
The OpenRTL code is ready now to support char8_t, char16_t and char32_t for the new C++ standards as well, allthough the same character types are supported using macros for existing C and C++ compilers. I think for Unicode and strings processing that it might be what you want for your library.
The point is that if you use OpenRTL, you can build the system using the OpenRTL "char_t" type. This supports the notion that your entire library can be built in either UTF8, UTF16 or UTF32 mode, even on Linux, because OpenRTL is already handling all the interfacing to a lot of system functions like files and io stuff. It has its own print_f functions for example.
By default the char_t is mapping to the wide character type. So on windows it is 32 bit and on Linux it is 32 bit. But you can make it also make it 8 bit everywhere for example. Also it has the support to do fast UTF decoding inside loops using macros.
So instead of ifdeffing between wchar_t and char, you can build everything using char_t and OpenRTL takes care of the rest.
I am in the process of learning C++ and came across an article on the MSDN here:
http://msdn.microsoft.com/en-us/magazine/dd861344.aspx
In the first code example the one line of code which my question relates to is the following:
VERIFY(SetWindowText(L"Direct2D Sample"));
More specifically that L prefix. I had a little read up, and correct me if I am wrong :-), but this is to allow for unicode strings, i.e. to prep for a long character set. Now in during my read up on this I came across another article on Adavnced String Techniques in C here http://www.flipcode.com/archives/Advanced_String_Techniques_in_C-Part_I_Unicode.shtml
It says there are a few options including the inclusion of the header:
#define UNICODE
OR
#define _UNICODE
in C , again point out if I am wrong, appreciate your feedback. Further it shows the datatype suitable for these unicode strings being:
wchar_t
It throws into the mix a macro and a kind of hybrid datatype, the macro being:
_TEXT(t)
which simply prefixes the string with the L and the hybrid data type as
TCHAR
Which it points out will allow for unicode if the header is there and ASCII if not. Now my question is, or more of an asumption which I would like to confirm, would Microsoft use this TCHAR data type which is more flexible or is there any benefit to committing to using the wchar_t.
Also when I say does Microsoft use this, more specifically for exmaple in the ATL and WTL libraries, do anyone of yourselves have preference or have some advice regarding this?
Cheers,
Andrew
For all new software you should define UNICODE and use wchar_t directly. Using ANSI stirngs will come back to haunt you.
You should just use wchar_t and the wide versions of all the CRT functions (ex: wcscmp instead of strcmp). The TEXT macros and TCHAR etc just exist if your code needs to work in both ANSI and UNICODE environments which I feel code rarely needs to do.
When you create a new windows application using Visual Studio UNICODE is automatically defined and wchar_t will work like a built-in.
Short answer: the hybrid infrastructure with the TCHAR type, the _TEXT() macro and the various _t* functions (_tcscpy comes to mind) are a throwback to the times when Microsoft had two platforms coexisting:
The Windows NT line was based on the Unicode string representation
The Windows 95/98/ME line was based on ANSI string representation.
String representation here means that all the Windows APIs that expected or returned string to your app used one or the other representation for these strings. COM added even more confusion as it was available on both platforms -- and expected Unicode strings on both!
In those old times it was encouraged that you write "portable" code: you were instructed to use the hybrid infrastructure for your strings so that you can compile for both models just by defining/undefining UNICODE and/or _UNICODE for your app.
As the Windows9x line is no more relevant (for the vast majority of the apps anyway) you can safely ignore the ANSI world and use the Unicode strings directly.
Beware though that Unicode has multiple representations today: as it is pointed out above the Unicode convention implied by wchar_t is the UCS-2 representation (all characters encoded in 16-bit words). There are other, widely used representations where this is not necessarily true.
On Windows it's wchar_t with UTF-16 (2 bytes) encoding.
Source : http://www.firstobject.com/wchar_t-string-on-linux-osx-windows.htm
TCHAR changes its type depending if UNICODE is defined, and should be used when you want code that you can compile for UNICODE and non-UNICODE.
If you want to explicitly process UNICODE data only, then feel free to use wchar_t.