Different char type in windows programming - c++

Recently, I meet some tasks about the char/string on windows platform. I see that they are different char type like char, TCHAR, WCHAR, LPSTR, LPWSTR, LPCTSTR. Can someone give me some information about it? And how to use like the regular char and char *. I cam confused about these types?
Best Regards,

They are documented on MSDN. Here's a few:
TCHAR: A WCHAR if UNICODE is defined, a CHAR otherwise.
WCHAR: A 16-bit Unicode character.
CHAR: An 8-bit Windows (ANSI) character.
LPTSTR: An LPWSTR if UNICODE is defined, an LPSTR otherwise.
LPSTR: A pointer to a null-terminated string of 8-bit Windows (ANSI) characters.
LPWSTR: A pointer to a null-terminated string of 16-bit Unicode characters.
LPCTSTR: An LPCWSTR if UNICODE is defined, an LPCSTR otherwise.
LPCWSTR: A pointer to a constant null-terminated string of 16-bit Unicode characters.
LPCSTR: A pointer to a constant null-terminated string of 8-bit Windows (ANSI) characters.
Note that some of these types map to something different depending on whether UNICODE has been #define'd. By default, they resolve to the ANSI versions:
#include <windows.h>
// LPCTSTR resolves to LPCSTR
When you #define UNICODE before #include <windows.h>, they resolve to the Unicode versions.
#define UNICODE
#include <windows.h>
// LPCTSTR resolves to LPCWSTR
They are in reality typedefs to some fundamental types in the C and C++ language. For example:
typedef char CHAR;
typedef wchar_t WCHAR;
On compilers like Visual C++, there's really no difference between an LPCSTR and a const char* or a LPCWSTR and a const wchar_t* . This might differ between compilers however, which is why these data types exist in the first place!
It's sort of like the Windows API equivalent of <cstdint> or <stdint.h>. The Windows API has bindings in other languages, and having data types with a known size is useful, if not required.

char is the standard 8-bit character type.
wchar_t is a 16-bit Unicode UTF-16 character type, used since about Windows 95. WCHAR is another name for it.
TCHAR can be either one, depending on your compiler settings. Most of the time in a modern program it's wchar_t.
The P and LP prefixes are pointers to the different types. The L is legacy (stands for Long pointer), and became obsolete with Windows 95; you still see it quite a bit though.
The C after the prefix stands for const.

TCHAR, LPTSTR and LPCTSTR are all generalized macros that will be either regular character strings or wide character strings depending on whether or not the UNICODE define is set. CHAR, LPSTR and LPCSTR are regular character strings. WCHAR, LPWSTR and LPCWSTR are wide character strings. TCHAR, CHAR and WCHAR represents a single character. LPTSTR, LPSTR and LPWSTR are "Long Pointer to STRing". LPCTSTR, LPCSTR and LPWCSTR are constant string pointers.

Let me try to shed some light (I've blogged this on my site at https://www.dima.to/blog/?p=190 in case you want to check it out):
#include "stdafx.h"
#include "Windows.h"
int _tmain(int argc, _TCHAR* argv[])
{
/* Quick Tutorial on Strings in Microsoft Visual C++
The Unicode Character Set and Multibyte Character Set options in MSVC++ provide a project with two flavours of string encodings. They will use different encodings for characters in your project. Here are the two main character types in MSVC++ that you should be concerned about:
1. char <-- char characters use an 8-bit character encoding (8 bits = 1 byte) according to MSDN.
2. wchar_t <-- wchar_t uses a 16-bit character encoding (16 bits = 2 bytes) according to MSDN.
From above, we can see that the size of each character in our strings will change depending on our chosen character set.
WARNING: Do NOT assume that any given character you append to either a Mutlibyte or Unicode string will always take up a single-byte or double-byte space defined by char or wchar_t! That is up to the discretion of the encoding used. Sometimes, characters need to be combined to define a character that the user wants in their string. In other words, take this example: Multibyte character strings take up a byte per character inside of the string, but that does not mean that a given byte will always produce the character you desire at a particular location, because even multibyte characters may take up more than a single byte. MSDN says it may take up TWO character spaces to produce a single multibyte-encoded character: "A multibyte-character string may contain a mixture of single-byte and double-byte characters. A two-byte multibyte character has a lead byte and a trail byte."
WARNING: Do NOT assume that Unicode contains every character for every language. For more information, please see http://stackoverflow.com/questions/5290182/how-many-bytes-takes-one-unicode-character.
Note: The ASCII Character Set is a subset of both Multibyte and Unicode Character Sets (in other words, both of these flavours encompass ASCII characters).
Note: You should always use Unicode for new development, according to MSDN. For more information, please see http://msdn.microsoft.com/en-us/library/ey142t48.aspx.
*/
// Strings that are Multibyte.
LPSTR a; // Regular Multibyte string (synonymous with char *).
LPCSTR b; // Constant Multibyte string (synonymous with const char *).
// Strings that are Unicode.
LPWSTR c; // Regular Unicode string (synonymous with wchar_t *).
LPCWSTR d; // Constant Unicode string (synonymous with const wchar_t *).
// Strings that take on either Multibyte or Unicode depending on project settings.
LPTSTR e; // Multibyte or Unicode string (can be either char * or wchar_t *).
LPCTSTR f; // Constant Multibyte or Unicode string (can be either const char * or const wchar_t *).
/* From above, it is safe to assume that the pattern is as follows:
LP: Specifies a long pointer type (this is synonymous with prefixing this type with a *).
W: Specifies that the type is of the Unicode Character Set.
C: Specifies that the type is constant.
T: Specifies that the type has a variable encoding.
STR: Specifies that the type is a string type.
*/
// String format specifiers:
e = _T("Example."); // Formats a string as either Multibyte or Unicode depending on project settings.
e = TEXT("Example."); // Formats a string as either Multibyte or Unicode depending on project settings (same as _T).
c = L"Example."; // Formats a string as Unicode.
a = "Example."; // Formats a string as Multibyte.
return 0;
}

Related

How do I convert a `CString` into a `CHAR *`?

I have the following c++ code:
#include "stdafx.h"
#include <atlstr.h>
int _tmain(int argc, _TCHAR* argv[])
{
CString OFST_PATH;
TCHAR DIR_PATH[MAX_PATH];
GetCurrentDirectory(MAX_PATH, DIR_PATH);
OFST_PATH.Format(DIR_PATH);
CHAR *pOFST_PATH = (LPSTR)(LPCTSTR)OFST_PATH;
return 0;
}
I want to understand why the value of pOFST_PATH in the end of the program is "c"? what did (LPSTR)(LPCTSTR) casting of variable OFST_PATH did to the whole path that was written in there?
As you can see in the following window, when debuging the variables values are:
CString and LPCTSTR are both based on TCHAR, which is wchar_t when UNICODE is defined (which it is, in your case, as I can tell by the value of argv in your debugger). When you do this:
(LPCTSTR)OFST_PATH
That works okay, because CString has a conversion operator to LPCTSTR. But with UNICODE defined, LPCTSTR is LPCWSTR, a.k.a. wchar_t const*. It points to an array of utf16 characters. The first character in that array is L'c' (that's the wide character version of 'c'). The bytes of L'c' look like this in memory: 0x63 0x00. That's the ASCII code for the letter 'c', followed by a zero. So, when you convert your CString to LPCTSTR, that's valid, however, your next conversion:
(LPSTR)(LPCTSTR)OFST_PATH
That's not valid. LPSTR is char*, so you are treating a wchar_t const* as if it's a char*. Well your debugger assumes that when it sees a char*, it is looking at a null terminated narrow character string. And if you remember from above what the value of the bytes of the first character were, it is the ASCII value for the letter 'c', followed by a zero. So the debugger sees this as a null terminated string consisting of just the letter 'c'.
The moral of the story is, don't use c-style casts if you don't understand what they do, and whether they are appropriate.

Difference between char* and wchar_t*

I am new to MFC. I am trying to do simple mfc application and I'm getting confuse in some places. For example, SetWindowText have two api, SetWindowTextA, SetWindowTextW one api takes char * and another one accepts wchar_t *.
What is the use of char * and wchar_t *?
char is used for so called ANSI family of functions (typically function name ends with A), or more commonly known as using ASCII character set.
wchar_t is used for new so called Unicode (or Wide) family of functions (typically function name ends with W), which use UTF-16 character set. It is very similar to UCS-2, but not quite it. If character requires more than 2 bytes, it will be converted into 2 composite codepoints, and this can be very confusing.
If you want to convert one to another, it is not really simple task. You will need to use something like MultiByteToWideChar, which requires knowing and providing code page for input ANSI string.
On Windows, APIs that take char * use the current code page whereas wchar_t * APIs use UTF-16. As a result, you should always use wchar_t on Windows. A recommended way to do this is to:
// Be sure to define this BEFORE including <windows.h>
#define UNICODE 1
#include <windows.h>
When UNICODE is defined, APIs like SetWindowText will be aliased to SetWindowTextW and can therefore be used safely. Without UNICODE, SetWindowText will be aliased to SetWindowTextA and therefore cannot be used without first converting to the current code page.
However, there's no good reason to use wchar_t when you are not calling Windows APIs, since its portable functionality is not useful, and its useful functionality is not portable (wchar_t is UTF-16 only on Windows, on most other platforms it is UTF-32, what a total mess.)
SetWindowTextA takes char*, which is a pointer to ANSI strings.
SetWindowTextW takes wchar_t*, which is a pointer to "wide" strings (Unicode).
SetWindowText has been defined (#define) to either of these in header Windows.h based on the type of application you are building. If you are building a UNICODE build then your code will automatically use SetWindowTextW.
SetWindowTextA is there primarily to support legacy code, which needs to be built as SBCS (Single byte character set).
char* : It means that this is a pointer to data of type char.
Example
// Regular char
char aChar = 'a';
// Pointer to char
char* aPointer = new char;
*aPointer = 'a';
// Pointer to an array of 10 chars
char* anArray = new char[ 10 ];
*anArray = 'a';
anArray[ 1 ] = 'b';
// Also a pointer to an array of 10
char[] anArray = new char[ 10 ];
*anArray = 'a';
anArray[ 1 ] = 'b';
wchar_t* : wchar_t is defined such that any locale's char encoding can be converted to a wchar_t representation where every wchar_t represents exactly one codepoint.

Cannot convert char[33] to LPCTSTR or if I typecast to LPCTSTR desired results not getting

I am trying to use listbox.Addstring(); in MFC application which will take LPCTSTR.
I am passing a variable of char array that's 33 chars long.
ListBox.AddString(Adapter_List->pScanList->network[0].szSsid);
SzSsid is declared as char szSsid[33];
I am facing two problems:
1) if I typecast to LPCTSTR like
ListBox.AddString( (LPCTSTR ) Adapter_List->pScanList->network[0].szSsid );
I am not getting correct output - there are some Chinese characters displaying. I know it's some unicode problem but I am not knowledgeable about unicode.
2) if I dont typecast I get an error
Cannot convert char[33] to LPCTSTR
I am trying to build an MFC application which will display all access points. In szSsid I am able to see access point names.
LPCTSTR type-casting is just wrong. You may want to use an ATL conversion helper like CA2T to convert from char string to TCHAR (LPCTSTR) string, or CA2W to convert from char string to Unicode UTF-16 wchar_t string; e.g.:
// CA2T - Uses the TCHAR model (obsolete)
ListBox.AddString( CA2T(Adapter_List->pScanList->network[0].szSsid) );
or:
// CA2W - Conversion to Unicode UTF-16 (wchar_t) string
// More modern approach.
ListBox.AddString( CA2W(Adapter_List->pScanList->network[0].szSsid) );
But, more important, what is the encoding used by your char szSSid[] string? You may want to specify that encoding identifier (e.g. CP_UTF8 for UTF-8 strings) to CA2W constructor nCodePage parameter for proper conversion to Unicode UTF-16 string passed to AddString() method.

Multi-Byte UTF-8 in Arrays in C++

I have been having trouble working with 3-byte Unicode UTF-8 characters in arrays. When they are in char arrays I get multi-character character constant and implicit constant conversion warnings, but when I use wchar_t arrays, wcout returns nothing at all. Because of the nature of the project, it must be an array and not a string. Below is an example of what I've been trying to do.
#include <iostream>
#include <string>
using namespace std;
int main()
{
wchar_t testing[40];
testing[0] = L'\u0B95';
testing[1] = L'\u0BA3';
testing[2] = L'\u0B82';
testing[3] = L'\0';
wcout << testing[0] << endl;
return 0;
}
Any suggestions? I'm working with OSX.
Since '\u0B95' requires 3 bytes, it is considered a multicharacter literal. A multicharacter literal has type int and an implementation-defined value. (Actually, I don't think gcc is correct to do this)
Putting the L prefix before the literal makes it have type wchar_t and has an implementation defined value (it maps to a value in the execution wide-character set which is an implementation defined superset of the basic execution wide-character set).
The C++11 standard provides us with some more Unicode aware types and literals. The additional types are char16_t and char32_t, whose values are the Unicode code-points that represent the character. They are analogous to UTF-16 and UTF-32 respectively.
Since you need character literals to store characters from the basic multilingual plane, you'll need a char16_t literal. This can be written as, for example, u'\u0B95'. You can therefore write your code as follows, with no warnings or errors:
char16_t testing[40];
testing[0] = u'\u0B95';
testing[1] = u'\u0BA3';
testing[2] = u'\u0B82';
testing[3] = u'\0';
Unfortunately, the I/O library does not play nicely with these new types.
If you do not truly require using character literals as above, you may make use of the new UTF-8 string literals:
const char* testing = u8"\u0B95\u0BA3\u0B82";
This will encode the characters as UTF-8.

CreateFileMapping() name

Im creating a DLL that shares memory between different applications.
The code that creates the shared memory looks like this:
#define NAME_SIZE 4
HANDLE hSharedFile;
create(char[NAME_SIZE] name)
{
hSharedFile = CreateFileMapping(INVALID_HANDLE_VALUE, NULL, PAGE_READWRITE, 0, 1024, (LPCSTR)name);
(...) //Other stuff that maps the view of the file etc.
}
It does not work. However if I replace name with a string it works:
SharedFile = CreateFileMapping(INVALID_HANDLE_VALUE, NULL, PAGE_READWRITE, 0, 1024, (LPCSTR)"MY_TEST_NAME");
How can I get this to work with the char array?
I have a java background where you would just use string all the time, what is a LPCSTR? And does this relate to whether my MS VC++ project is using Unicode or Multi-Byte character set
I suppose you should increase NAME_SIZE value.
Do not forget that array must be at least number of chars + 1 to hold \0 char at the end, which shows the end of the line.
LPCSTR is a pointer to a constant null-terminated string of 8-bit Windows (ANSI) characters and defined as follows:
LPCSTR defined as typedef __nullterminated CONST CHAR *LPCSTR;
For example even if you have "Hello world" constant and it has 11 characters it will take 12 bytes in the memory.
If you are passing a string constant as an array you must add '\0' to the end like {'T','E','S','T', '\0'}
If you look at the documentation, you'll find that most Win32 functions take an LPCTSTR, which represents a string of TCHAR. Depending on whether you use Unicode (the default) or ANSI, TCHAR will expand to either wchar_t or char. Also, LPCWSTR and LPCSTR explicitly represent Unicode and ANSI strings respectively.
When you're developing for Win32, in most cases, it's best to follow suit and use LPCTSTR wherever you need strings, instead of explicit char arrays/pointers. Also, use the TEXT("...") macro to create the correct kind of string literals instead of just "...".
In your case though, I doubt this is causing a problem, since both your examples use only LPCSTR. You have also defined NAME_SIZE to be 4, could it be that your array is too small to hold the string you want?