WNDCLASSEX vs WNDCLASSEXW? [duplicate] - c++

What is the difference in calling the Win32 API function that have an A character appended to the end as opposed to the W character.
I know it means ASCII and WIDE CHARACTER or Unicode, but what is the difference in the output or the input?
For example, If I call GetDefaultCommConfigA, will it fill my COMMCONFIG structure with ASCII strings instead of WCHAR strings? (Or vice-versa for GetDefaultCommConfigW)
In other words, how do I know what Encoding the string is in, ASCII or UNICODE, it must be by the version of the function I call A or W? Correct?
I have found this question, but I don't think it answers my question.

The A functions use Ansi (not ASCII) strings as input and output, and the W functions use Unicode string instead (UCS-2 on NT4 and earlier, UTF-16 on W2K and later). Refer to MSDN for more details.

Related

How to create multibyte characters in C

During my study of character encoding in C and C++ I came across two general ways of encoding: multibyte characters and wide characters. In order to strengthen my understanding of those systems (benefits and drawbacks) I wanted to do some examples.
Doing examples with wide characters is not a problem due to the native support with the wchar_t type. But when I wanted to create a string which contains those so called multibyte characters I came to a problem.
How can I actually create a multibyte character string which uses an encoding that works with a char array (using Visual C++)? This kind of encoding sure does exist: http://www.gnu.org/software/libc/manual/html_node/Shift-State.html. But I read only about it and never saw an actual example. Or do you have to create your own encoding for this kind of string?
If you are able to create a wide character string literal, simply omitting the L should give you a multibyte character string literal with an implementation defined encoding (gcc has an option to chose it, I don't know about visual C++).
If you have a wide character string, you can get the equivalent multibyte string according to the C locale using the functions wcstombs (in <stdlib.h>) and wcsrtombs (in <wchar.h>).
C++ locale system also provides a way to do that conversion. (Look for the in and out member of the codecvt facet, I won't provide here a tutorial on their use, the site cppreference has example codes, for instance for out).
I'm not sure you'll be able to find easily support either on Unix or on Windows for an encoding with a shift state. You should search for encoding for China, Japan, Korea, Vietman (for instance ISO 2022-JP, but it seems to me that Unix tend to use EUC-JP instead and Windows Shift JIS).

Strings. TCHAR LPWCS LPCTSTR CString. Whats what here, simple quick

TCHAR szExeFileName[MAX_PATH];
GetModuleFileName(NULL, szExeFileName, MAX_PATH);
CString tmp;
lstrcpy(szExeFileName, tmp);
CString out;
out.Format("\nInstall32 at %s\n", tmp);
TRACE(tmp);
Error (At the Format):
error C2664: 'void ATL::CStringT<BaseType,StringTraits>::Format(const wchar_t
*,...)' : cannot convert parameter 1 from 'const char [15]' to 'const wchar_t
I'd just like to get the current path that this program was launched from and copy it into a CString so I can use it elsewhere. I am currently just try to get to see the path by TRACE'ing it out. But strings, chars, char arrays, I can't ever get all the strait. Could someone give me a pointer?
The accepted answer addresses the problem. But the question also asked for a better understanding of the differences among all the character types on Windows.
Encodings
A char on Windows (and virtually all other systems) is a single byte. A byte is typically interpreted as either an unsigned value [0..255] or a signed value [-128..127]. (Older C++ standards guarantees a signed range of only [-127..127], but most implementations give [-128..127]. I believe C++11 guarantees the larger range.)
ASCII is a character mapping for values in the range [0..127] to particular characters, so you can store an ASCII character in either a signed byte or an unsigned byte, and thus it will always fit in a char.
But ASCII doesn't have all the characters necessary for most languages, so the character sets were often extended by using the rest of the values available in a byte to represent the additional characters needed for certain languages (or families of languages). So, while [0..127] almost always mean the same thing, values like 150 can only be interpreted in the context of a particular encoding. For single-byte alphabets, these encodings are called code pages.
Code pages helped, but they didn't solve all the problems. You always had to know which code page a particular document used in order to interpret it correctly. Furthermore, you typically couldn't write a single document that used different languages.
Also, some languages have more than 256 characters, so there was no way to map one char to one character. This led to the development of multi-byte character encodings, where [0..127] is still ASCII, but some of the other values are "escapes" that mean you have to look at some number of following chars to figure out what character you really had. (It's best to think of multi-byte as variable-byte, as some characters require only one byte while other require two or more.) Multi-byte works, but it's a pain to code for.
Meanwhile, memory was becoming more plentiful, so a bunch of organizations got together and created Unicode, with the goal of making a universal mapping of values to characters (for appropriately vague definitions of "characters"). Initially, it was believed that all characters (or at least all the ones anyone would ever use) would fit into 16-bit values, which was nice because you wouldn't have to deal with multi-byte encodings--you'd just use two bytes per character instead of one. About this time, Microsoft decided to adopt Unicode as the internal representation for text in Windows.
WCHAR
So Windows has a type called WCHAR, a two-byte value that represents a "Unicode" "character". I'm using quotation marks here because Unicode evolved past the original two-byte encoding, so what Windows calls "Unicode" isn't really Unicode today--it's actually a particular encoding of Unicode called UTF-16. And a "character" is not as simple a concept in Unicode as it was in ASCII, because, in some languages, characters combine or otherwise influence adjacent characters in interesting ways.
Newer versions of Windows used these 16-bit WCHAR values for text internally, but there was a lot of code out there still written for single-byte code pages, and even some for multi-byte encodings. Those programs still used chars rather than WCHARs. And many of these programs had to work with people using older versions of Windows that still used chars internally as well as newer ones that use WCHAR. So a technique using C macros and typedefs was devised so that you could mostly write your code one way and--at compile time--choose to have it use either char or WCHAR.
TCHAR
To accomplish this flexibility, you use a TCHAR for a "text character". In some header file (often <tchar.h>), TCHAR would be typedef'ed to either char or WCHAR, depending on the compile time environment. Windows headers adopted conventions like this:
LPTSTR is a (long) pointer to a string of TCHARs.
LPWSTR is a (long) pointer to a string of WCHARs.
LPSTR is a (long) pointer to a string of chars.
(The L for "long" is a leftover from 16-bit days, when we had long, far, and near pointers. Those are all obsolete today, but the L prefix tends to remain.)
Most of the Windows API functions that take and return strings were actually replaced with two versions: the A version (for "ANSI" characters) and the W version (for wide characters). (Again, historical legacy shows in these. The code pages scheme was often called ANSI code pages, though I've never been clear if they were actually ruled by ANSI standards.)
So when you call a Windows API like this:
SetWindowText(hwnd, lptszTitle);
what you're really doing is invoking a preprocessor macro that expands to either SetWindowTextA or SetWindowTextW. It should be consistent with however TCHAR is defined. That is, if you want strings of chars, you'll get the A version, and if you want strings of WCHARs, you get the W version.
But it's a little more complicated because of string literals. If you write this:
SetWindowText(hwnd, "Hello World"); // works only in "ANSI" mode
then that will only compile if you're targeting the char version, because "Hello World" is a string of chars, so it's only compatible with the SetWindowTextA version. If you wanted the WCHAR version, you'd have to write:
SetWindowText(hwnd, L"Hello World"); // only works in "Unicode" mode
The L here means you want wide characters. (The L actually stands for long, but it's a different sense of long than the long pointers above.) When the compiler sees the L prefix on the string, it knows that string should be encoded as a series of wchar_ts rather than chars.
(Compilers targeting Windows use a two-byte value for wchar_t, which happens to be identical to what Windows defined a WCHAR. Compilers targeting other systems often use a four-byte value for wchar_t, which is what it really takes to hold a single Unicode code point.)
So if you want code that can compile either way, you need another macro to wrap the string literals. There are two to choose from: _T() and TEXT(). They work exactly the same way. The first comes from the compiler's library and the second from the OS's libraries. So you write your code like this:
SetWindowText(hwnd, TEXT("Hello World")); // compiles in either mode
If you're targeting chars, the macro is a no-op that just returns the regular string literal. If you're targeting WCHARs, the macro prepends the L.
So how do you tell the compiler that you want to target WCHAR? You define UNICODE and _UNICODE. The former is for the Windows APIs and the latter is for the compiler libraries. Make sure you never define one without the other.
My guess is you are compiling in Unicode mode.
Try enclosing your format string in the _T macro, which is designed to provide an always-correct method of providing constant string parameters, regardless of whether you're compiling in Unicode or ANSI mode:
out.Format(_T("\nInstall32 at %s\n"), tmp);

dmDeviceName is just 'c'

I'm trying to get the names of each of my monitors using DEVMODE.dmDeviceName:
dmDeviceName
A zero-terminated character array that specifies the "friendly" name of the printer or display; for example, "PCL/HP LaserJet" in the case of PCL/HP LaserJet. This string is unique among device drivers. Note that this name may be truncated to fit in the dmDeviceName array.
I'm using the following code:
log.printf("Device Name: %s",currDevMode.dmDeviceName);
But for every monitor, the name is printed as just c. All other information from DEVMODE seems to print ok. What's going wrong?
Most likely you are using the Unicode version of the structure and thus are passing wide characters to printf. Since you use a format string that implies char data there is a mis-match.
The UTF-16 encoding results in every other byte being 0 for characters in the ASCII range and so printf thinks that the second byte of the first two byte character is actually a null-terminator.
This is the sort of problem that you get with printf which of course has no type-safety. Since you are using C++ it's probably worth switching to iostream based I/O.
However, if you want to use ANSI text, as you indicate in a comment, then the simplest solution is to use the ANSI DEVMODEA version of the struct and the corresponding A versions of the API functions, e.g. EnumDisplaySettingsA, DeviceCapabilitiesA.
dmDeviceName is TCHAR[] so if you're compiling for unicode, the first wide character will be interpreted as a 'c' followed by a zero terminator.
You will need to convert it to ascii or use unicode capable printing routines.

Using Different Character Encodings

Recently, I have gotten interested in Text Encoding. As you know, there are many kinds of Text Encoding such as CRC949, UTF-8 and so on.
I am wondering how to express them properly. (To the screen and users.) I mean, they are different from each other. I remember there was particular way to express text accrording to encoding in C#.
Is it possible one can use just simple printf() in C to express string regardless of encoding? Does the compiler automatically do it?
Read Joel Spolsky's article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
From the article:
We decided to do everything internally in UCS-2 (two byte) Unicode,
which is what Visual Basic, COM, and Windows NT/2000/XP use as their
native string type. In C++ code we just declare strings as wchar_t
("wide char") instead of char and use the wcs functions instead of the
str functions (for example wcscat and wcslen instead of strcat and
strlen). To create a literal UCS-2 string in C code you just put an L
before it as so: L"Hello".

Is a wide character string literal starting with L like L"Hello World" guaranteed to be encoded in Unicode?

I've recently tried to get the full picture about what steps it takes to create platform independent C++ applications that support unicode. A thing that is confusing to me is that most howtos and stuff equalize the character encoding (i.e. ANSI or Unicode) and the character type (char or wchar_t). As I've learned so far, these are different things and there may exist a character sequence encodeded in Unicode but represented by std::string as well as a character sequence encoded in ANSI but represented as std::wstring, right?
So the question that comes to my mind is whether the C++ standard gives any guarantee about the encoding of string literals starting with L or does it just say it's of type wchar_t with implementation specific character encoding?
If there is no such guaranty, does that mean I need some sort of external resource system to provide non ASCII string literals for my application in a platform independent way?
What is the prefered way for this? Resource system or proper encoding of source files plus proper compiler options?
The L symbol in front of a string literal simply means that each character in the string will be stored as a wchar_t. But this doesn't necessarily imply Unicode. For example, you could use a wide character string to encode GB 18030, a character set used in China which is similar to Unicode. The C++03 standard doesn't have anything to say about Unicode, (however C++11 defines Unicode char types and string literals) so it's up to you to properly represent Unicode strings in C++03.
Regarding string literals, Chapter 2 (Lexical Conventions) of the C++ standard mentions a "basic source character set", which is basically equivalent to ASCII. So this essentially guarantees that "abc" will be represented as a 3-byte string (not counting the null), and L"abc" will be represented as a 3 * sizeof(wchar_t)-byte string of wide-characters.
The standard also mentions "universal-character-names" which allow you to refer to non-ASCII characters using the \uXXXX hexadecimal notation. These "universal-character-names" usually map directly to Unicode values, but the standard doesn't guarantee that they have to. However, you can at least guarantee that your string will be represented as a certain sequence of bytes by using universal-character-names. This will guarantee Unicode output provided the runtime environment supports Unicode, has the appropriate fonts installed, etc.
As for string literals in C++03 source files, again there is no guarantee. If you have a Unicode string literal in your code which contains characters outside of the ASCII range, it is up to your compiler to decide how to interpret these characters. If you want to explicitly guarantee that the compiler will "do the right thing", you'd need to use \uXXXX notation in your string literals.
The C++03 does not mention unicode (future C++0x does). Currently you have to either use external libraries (ICU, UTF-CPP, etc.) or build your own solution using platform specific code. As others have mentioned, wchar_t encoding (or even size) is not specified. Consequently, string literal encoding is implementation specific. However, you can give unicode codepoints in string literals by using \x \u \U escapes.
Typically unicode apps in Windows use wchar_t (with UTF-16 encoding) as internal character format, because it makes using Windows APIs easier as Windows itself uses UTF-16. Unix/Linux unicode apps in turn usually use char (with UTF-8 encoding) internally. If you want to exchange data between different platforms, UTF-8 is usual choice for data transfer encoding.
The standard makes no mention of encoding formats for strings.
Take a look at ICU from IBM (its free). http://site.icu-project.org/