Difference between unsigned char and char pointers - c++

I'm a bit confused with differences between unsigned char (which is also BYTE in WinAPI) and char pointers.
Currently I'm working with some ATL-based legacy code and I see a lot of expressions like the following:
CAtlArray<BYTE> rawContent;
CALL_THE_FUNCTION_WHICH_FILLS_RAW_CONTENT(rawContent);
return ArrayToUnicodeString(rawContent);
// or return ArrayToAnsiString(rawContent);
Now, the implementations of ArrayToXXString look the following way:
CStringA ArrayToAnsiString(const CAtlArray<BYTE>& array)
{
CAtlArray<BYTE> copiedArray;
copiedArray.Copy(array);
copiedArray.Add('\0');
// Casting from BYTE* -> LPCSTR (const char*).
return CStringA((LPCSTR)copiedArray.GetData());
}
CStringW ArrayToUnicodeString(const CAtlArray<BYTE>& array)
{
CAtlArray<BYTE> copiedArray;
copiedArray.Copy(array);
copiedArray.Add('\0');
copiedArray.Add('\0');
// Same here.
return CStringW((LPCWSTR)copiedArray.GetData());
}
So, the questions:
Is the C-style cast from BYTE* to LPCSTR (const char*) safe for all possible cases?
Is it really necessary to add double null-termination when converting array data to wide-character string?
The conversion routine CStringW((LPCWSTR)copiedArray.GetData()) seems invalid to me, is that true?
Any way to make all this code easier to understand and to maintain?

The C standard is kind of weird when it comes to the definition of a byte. You do have a couple of guarantees though.
A byte will always be one char in size
sizeof(char) always returns 1
A byte will be at least 8 bits in size
This definition doesn't mesh well with older platforms where a byte was 6 or 7 bits long, but it does mean BYTE*, and char * are guaranteed to be equivalent.
Multiple nulls are needed at the end of a Unicode string because there are valid Unicode characters that start with a zero (null) byte.
As for making the code easier to read, that is completely a matter of style. This code appears to be written in a style used by a lot of old C Windows code, which has definitely fallen out of favor. There are probably a ton of ways to make it clearer for you, but how to make it clearer has no clear answer.

Yes, it is always safe. Because they both point to an array of single-byte memory locations.
LPCSTR: Long Pointer to Const (single-byte) String
LPCWSTR : Long Pointer to Const Wide (multi-byte) String
LPCTSTR : Long Pointer to Const context-dependent (single-byte or multi-byte) String
In wide character strings, every single character occupies 2 bytes of memory, and the length of the memory location containing the string must be a multiple of 2. So if you want to add a wide '\0' to the end of a string, you should add two bytes.
Sorry for this part, I do not know ATL and I cannot help you on this part, but actually I see no complexity here, and I think it is easy to maintain. What code do you really want to make easier to understand and maintain?

If the BYTE* behaves like a proper string (i.e. the last BYTE is 0), you can cast a BYTE* to a LPCSTR, yes. Functions working with LPCSTR assume zero-terminated strings.
I think the multiple zeroes are only necessary when dealing with some multibyte character sets. The most common 8-bit encodings (like ordinary Windows Western and also UTF-8) don't require them.
The CString is Microsoft's best attempt at user-friendly strings. For instance, its constructor can handle both char and wchar_t type input, regardless of whether the CString itself is wide or not, so you don't have to worry about the conversion much.
Edit: wait, now I see that they are abusing a BYTE array for storing wide chars in. I couldn't recommend that.

An LPCWSTR is a String with 2 Bytes per character, a "char" is one Byte per character. That means you cannot cast it in C-style, because you have to adjust the memory (add a "0" before each standard-ASCII), and not just read the Data in a different way from the memory (what a C-Cast would do).
So the cast is not so safe i would say.
The Double-Nulltermination: You have always 2 Bytes as one Character, so your "End-of-string" sign must be 2 Bytes long.
To make that code easier to understand look after lexical_cast in Boost (http://www.boost.org/doc/libs/1_48_0/doc/html/boost_lexical_cast.html)
Another way would be using the std::strings (using like std::basic_string; ), and you can perform on String operations.

Related

How to tell if LPCWSTR text is numeric?

Entire string needs to be made of integers which as we know are 0123456789 I am trying with following function but it doesnt seem to work
bool isNumeric( const char* pszInput, int nNumberBase )
{
string base = "0123456789";
string input = pszInput;
return (::strspn(input.substr(0, nNumberBase).c_str(), base.c_str()) == input.length());
}
and the example of using it in code...
isdigit = (isNumeric((char*)text, 11));
It returns true even with text in the string
Presumably the issue is that text is actually LPCWSTR which is const wchar_t*. We have to infer this fact from the question title and the cast that you made.
Now, that cast is a problem. The compiler objected to you passing text. It said that text is not const char*. By casting you have not changed what text is, you simply lied to the compiler. And the compiler took its revenge.
What happens next is that you reinterpret the wide char buffer as being a narrow 8 bit buffer. If your wide char buffer has latin text, encoded as UTF-16, then every other byte will be zero. Hence the reinterpret cast that you do results in isNumeric thinking that the string is only 1 character long.
What you need to do is either:
Start using UTF-16 encoded wchar_t buffers in isNumeric.
Convert from UTF-16 to ANSI before calling isNumeric.
You should think about this carefully. It seems that at present you have a rather unholy mix of ANSI and UTF-16 in your program. You really ought to settle on a standard character encoding an use it consistently throughout. That is tenable internal to your program, but you will encounter external text that could use different encodings. Deal with that by converting at the boundary between your program and the outside world.
Personally I don't understand why you are using C strings at all. Surely you should be using std::wstring or std::string.

how to make a not null-terminated c string?

i am wondering :char *cs = .....;what will happen to strlen() and printf("%s",cs) if cs point to memory block which is huge but with no '\0' in it?
i write these lines:
char s2[3] = {'a','a','a'};
printf("str is %s,length is %d",s2,strlen(s2));
i get the result :"aaa","3",but i think this result is because that a '\0'(or a 0 byte) happens to reside in the location s2+3.
how to make a not null-terminated c string? strlen and other c string function relies heavily on the '\0' byte,what if there is no '\0',i just want know this rule deeper and better.
ps: my curiosity is aroused by studying the follw post on SO.
How to convert a const char * to std::string
and these word in that post :
"This is actually trickier than it looks, because you can't call strlen unless the string is actually nul terminated."
If it's not null-terminated, then it's not a C string, and you can't use functions like strlen - they will march off the end of the array, causing undefined behaviour. You'll need to keep track of the length some other way.
You can still print a non-terminated character array with printf, as long as you give the length:
printf("str is %.3s",s2);
printf("str is %.*s",s2_length,s2);
or, if you have access to the array itself, not a pointer:
printf("str is %.*s", (int)(sizeof s2), s2);
You've also tagged the question C++: in that language, you usually want to avoid all this error-prone malarkey and use std::string instead.
A "C string" is, by definition, null-terminated. The name comes from the C convention of having null-terminated strings. If you want something else, it's not a C string.
So if you have a string that is not null-terminated, you cannot use the C string manipulation routines on it. You can't use strlen, strcpy or strcat. Basically, any function that takes a char* but no separate length is not usable.
Then what can you do? If you have a string that is not null-terminated, you will have the length separately. (If you don't, you're screwed. You need some way to find the length, either by a terminator or by storing it separately.) What you can do is allocate a buffer of the appropriate size, copy the string over, and append a null. Or you can write your own set of string manipulation functions that work with pointer and length. In C++ you can use std::string's constructor that takes a char* and a length; that one doesn't need the terminator.
Your supposition is correct: your strlen is returning the correct value out of sheer luck, because there happens to be a zero on the stack right after your improperly terminated string. It probably helps that the string is 3 bytes, and the compiler is likely aligning stuff on the stack to 4-byte boundaries.
You cannot depend on this. C strings need NUL characters (zeroes) at the end to work correctly. C string handling is messy, and error-prone; there are libraries and APIs that help make it less so… but it's still easy to screw up. :)
In this particular case, your string could be initialized as one of these:
A: char s2[4] = { 'a','a','a', 0 }; // good if string MUST be 3 chars long
B: char *s2 = "aaa"; // if you don't need to modify the string after creation
C: char s2[]="aaa"; // if you DO need to modify the string afterwards
Also note that declarations B and C are 'safer' in the sense that if someone comes along later and changes the string declaration in a way that alters the length, B and C are still correct automatically, whereas A depends on the programmer remembering to change the array size and keeping the explicit null terminator at the end.
What happens is that strlen keeps going, reading memory values until it eventually gets to a null. it then assumes that is the terminator and returns the length that could be massively large. If you're using strlen in an environment that expects C-strings to be used, you could then copy this huge buffer of data into another one that is just not big enough - causing buffer overrun problems, or at best, you could copy a large amount of garbage data into your buffer.
Copying a non-null terminated C string into a std:string will do this. If you then decide that you know this string is only 3 characters long and discard the rest, you will still have a massively long std:string that contains the first 3 good characters and then a load of wastage. That's inefficient.
The moral is, if you're using the CRT functions to operator on C strings, they must be null-terminated. Its no different to any other API, you must follow the rules that API sets down for correct usage.
Of course, there is no reason you cannot use the CRT functions if you always use the specific-length versions (eg strncpy) but you will have to limit yourself to just those, always, and manually keep track of the correct lengths.
Convention states that a char array with a terminating \0 is a null terminated string. This means that all str*() functions expect to find a null-terminator at the end of the char-array. But that's it, it's convention only.
By convention also strings should contain printable characters.
If you create an array like you did char arr[3] = {'a', 'a', 'a'}; you have created a char array. Since it is not terminated by a \0 it is not called a string in C, although its contents can be printed to stdout.
The C standard does not define the term string until the section 7 - Library functions. The definition in C11 7.1.1p1 reads:
A string is a contiguous sequence of characters terminated by and including the first null character.
(emphasis mine)
If the definition of string is a sequence of characters terminated by a null character, a sequence of non-null characters not terminated by a null is not a string, period.
What you have done is undefined behavior.
You are trying to write to a memory location that is not yours.
Change it to
char s2[] = {'a','a','a','\0'};

Differences in string class implementations

Why are string classes implemented in several different ways and what are the advantages and disadvantages? I have seen it done several differents ways
Only using a simple char (most basic way).
Supporting UTF8 and UTF16 through a templated string, such as string<UTF8>. Where UTF8 is a char and UTF16 is an unsigned short.
Having both a UTF8 and UTF16 in the string class.
Are there any other ways to implement a string class that may be better?
As far as I know std::basic_string<wchar_t> where sizeof(wchar_t) == 2 is not UTF16 encoding. There are more than 2^16 characters in unicode, and codes go at least up to 0xFFFFF which is > 0xFFFF (2byte wchar_t capacity). As a result, proper UTF16 should use variable number of bytes per letter (one 2byte wchar_t or two of them), which is not the case with std::basic_string and similar classes that assume that one string element == one character.
As far as I know there are two ways to deal with unicode strings.
Either use big enough type to fit any character into single string element (for example, on linux it is quite normal to see sizeof(wchar_t) == 4), so you'll be able to enjoy "benefits" (basically, easy string length calculation and nothing else) of std::string-like classes.
Or use variable-length encoding (UTF8 - 1..4 bytes per char or UTF16 - 2..4 bytes per char), and well-tested string class that provides string-manipulation routines.
As long as you don't use char it doesn't matter which method you use. char-based strings are likely to cause trouble on machines with different 8bit codepage, if you weren't careful enough to take care of that (It is safe to assume that you'll forget about it and won't be careful enough - Microsoft Applocale was created for a reason).
Unicode contains plenty of non-printable characters (control and formatting characters in unicode), so that pretty much defeats any benefit method #1 could provide. Regardless, if you decide to use method #1, you should remember that wchar_t is not big enough to fit all possible characters on some compilers/platforms (windows/microsoft compiler), and that std::basic_string<wchar_t> is not a perfect solution because of that.
Rendering internationalized text is PAIN, so the best idea would be just to grab whatever unicode-compatible string class (like QString) there is that hopefully comes with text layout engine (that can properly handle control characters and bidirectional text) and concentrate on more interesting programming problems instead.
-Update-
If unsigned short is not UTF16, then what is, unsigned int? What is UTF8 then? Is that unsigned char?
UTF16 is variable-length character encoding. UTF16 uses 1..2 2-byte (i.e. uint16_t, 16 bit) elements per character. I.e. number of of elements in UTF16 string != number of characters in string for UTF16. You can't calculate string length by counting elements.
UTF8 is another variable-length encoding, based on 1byte elements (8 bit, 1 byte or "unsigned char"). One unicode character ("code point") in UTF8 takes 1..4 uint8_t elements. Once again, number of elements in string != number of characters in string. The advantage of UTF8 is characters that exist within ASCII take exactly 1 byte per character in UTF8, which saves a bit of space, while in UTF16, character always takes at least 2 bytes.
UTF32 is fixed-length character encoding, that always uses 32bit (4 bytes or uint32_t) per character. Currently any unicode character can fit into single UTF32 element, and UTF32 will probably remain fixed-length for a long time (I don't think that all languages of Earth combined would produce 2^31 different characters). It wastes more memory, but number of elements in string == number of characters in string.
Also, keep in mind, that C++ standard doesn't specify how big "int" or "short" should be.

The 'L' and 'LPCWSTR' in WIndows API

I've found that
NetUserChangePassword(0, 0, L"ab", L"cd");
changes the user password from ab to cd. However,
NetUserChangePassword(0, 0, (LPCWSTR) "ab", (LPCWSTR) "cd");
doesn't work. The returned value indicates invalid password.
I need to pass const char* as last two parameters for this function call. How can I do that? For example,
NetUserChangePassword(0, 0, (LPCWSTR) vs[0].c_str(), (LPCWSTR) vs[1].c_str());
Where vs is std::vector<std::string>.
Those are two totally different L's. The first is a part of the C++ language syntax. Prefix a string literal with L and it becomes a wide string literal; instead of an array of char, you get an array of wchar_t.
The L in LPCWSTR doesn't describe the width of the characters, though. Instead, it describes the size of the pointer. Or, at least, it used to. The L abbreviation on type names is a relic of 16-bit Windows, when there were two kinds of pointers. There were near pointers, where the address was somewhere within the current 64 KB segment, and there were far, or long pointers, which could point beyond the current segment. The OS required callers to provide the latter to its APIs, so all the pointer-type names use LP. Nowadays, there's only one type of pointer; Microsoft keeps the same type names so that old code continues to compile.
The part of LPCWSTR that specifies wide characters is the W. But merely type-casting a char string literal to LPCWSTR is not sufficient to transform those characters into wide characters. Instead, what happens is the type-cast tells the compiler that what you wrote really is a pointer to a wide string, even though it really isn't. The compiler trusts you. Don't type-cast unless you really know better than the compiler what the real types are.
If you really need to pass a const char*, then you don't need to type-cast anything, and you don't need any L prefix. A plain old string literal is sufficient. (If you really want to cast to a Windows type, use LPCSTR — no W.) But it looks like what you really need to pass in is a const wchar_t*. As we learned above, you can get that with the L prefix on the string literal.
In a real program, you probably don't have a string literal. The user will provide a password, or you'll read a password from some other external source. Ideally, you would store that password in a std::wstring, which is like std::string but for wchar_t instead of char. The c_str() method of that type returns a const wchar_t*. If you don't have a wstring, a plain array of wchar_t might be sufficient.
But if you're storing the password in a std::string, then you'll need to convert it into wide characters some other way. To do a conversion, you need to know what code page the std::string characters use. The "current ANSI code page" is usually a safe bet; it's represented by the constant CP_ACP. You'll use that when calling MultiByteToWideString to have the OS convert from the password's code page into Unicode.
int required_size = MultiByteToWideChar(CP_ACP, 0, vs[0].c_str(), vs[0].size(), NULL, 0);
if (required_size == 0)
ERROR;
// We'll be storing the Unicode password in this vector. Reserve at
// least enough space for all the characters plus a null character
// at the end.
std::vector<wchar_t> wv(required_size);
int result = MultiByteToWideChar(CP_ACP, 0, vs[0].c_str(), vs[0].size(), &wv[0], required_size);
if (result != required_size - 1)
ERROR;
Now, when you need a wchar_t*, just use a pointer to the first element of that vector: &wv[0]. If you need it in a wstring, you can construct it from the vector in a few ways:
// The vector is null-terminated, so use "const wchar_t*" constructor
std::wstring ws1 = &wv[0];
// Use iterator constructor. The vector is null-terminated, so omit
// the final character from the iterator range.
std::wstring ws2(wv.begin(), wv.end() - 1);
// Use pointer/length constructor.
std::wstring ws3(&wv[0], wv.size() - 1);
You have two problems.
The first is the practical problem - how to do this. You are confusing wide and narrow strings and casting from one to the other. A string with an L prefix is a wide string, where each character is two bytes (a wchar_t). A string without the L is a single byte (a char). You cannot cast from one to the other using the C-style cast (LPCWSTR) "ab" because you have an array of chars, and are casting it to a pointer to wide chars. It is simply changing the pointer type, not the underlying data.
To convert from a narrow string to a wide string, you would normally use MultiByteToWideChar. You don't mention what code page your narrow strings are in; you would probably pass in CP_ACP for the first parameter. However, since you are converting between a string and a wstring, you might be interested in other ways to convert (one, two). This will give you a wstring with your characters, not a string, and a wstring's .c_str() method returns a pointer to wchar_ts.
The second is the following misunderstanding:
I need to pass const char* as last two parameters for this function call. How can I do that?
No you don't. You need to pass a wide string, which you got above. Your approach to this (casting the pointer) indicates you probably don't know about different string types and character encodings, and this is something every software developer should know. So, on the assumption you're interested, hopefully you'll find the following references handy:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
The Unicode FAQ (covers things like 'What is Unicode?')
MSDN introduction to wide characters.
I'd recommend you investigate recompiling your application with UNICODE and using wide strings. Many APIs are defined in both narrow and wide versions, and normally this would mean you access the narrow version by default (you can access either the ANSI (narrow) or Wide versions of these APIs by directly calling the A or W version - they have A or W appended to their name, such as CreateWindowW - see the bottom of that page for the two names. You normally don't need to worry about this.) As far as I can tell, this API is always available as-is regardless of UNICODE, it's just it's only prototyped as wide.
C style casts as you've used here are a very blunt instrument. They assume you know exactly what you're doing.
You'll need to convert your ASCII or multi-byte strings into Unicode strings for the API. There might be a NetUserChangePasswordA function that takes the char * types you're trying to pass, try that first.
LPWSTR is defined wchar_t* (whcar_T is 2 byte char-type) which are interpreted differently then normal 1 byte chars.
LPCWSTR means you need to pass a wchar_t*.
if you change vs to std::vector<std::wstring> that will give you a wide char when you pass vs[0].c_str()
if you look at the example at http://msdn.microsoft.com/en-us/library/aa370650(v=vs.85).aspx you can see that they define UNICODE which is why they use the wchar_t.

conversion from std::vector<char> to wchar_t*

i'm trying to read ID3 frames and their values with TagLib (1) and index them with CLucene (2). the former returns frame ID's as std::vector<char> (3) and the latter writes field names as tchar* [wchar_t* in Linux] (4). i need to make a link between the two. how can i convert from std::vector<char> to wchar_t* by means of the STL? thank you
(1)http://developer.kde.org/~wheeler/taglib.html
(2)http://clucene.sourceforge.net/
(3)http://developer.kde.org/~wheeler/taglib/api/classTagLib_1_1ID3v2_1_1Frame.html#6aac53ec5893fd15164cd22c6bdb5dfd
(4)http://ohnopublishing.net/doc/clucene-0.9.21b/html/classlucene_1_1document_1_1Field.html#59b0082e2ade8c78a51a64fe99e684b2
In a simple case where your chars don't contain any accented characters or anything like that, you can just copy each one to the destination and use it:
std::vector<char> frameID;
std::vector<wchar_t> field_name;
std::copy(frameID.begin(), frameID.end(), std::back_inserter(field_name));
lucene_write_field(&field_name[0], field_name.length());
My guess is that for ID3 frame ID's you don't have accented characters and such, so that'll probably be all you need. If you do have a possibility of accented characters and such, things get more complex in a hurry -- you'll need to convert from something like ISO 8859-x to (probably) UTF-16 Unicode. To do that, you need a code-page that tells you how to interpret the input (i.e., there are several varieties of ISO 8859, and one for French input will be different from one for Russian, for example).
In order to prevent large char values from becoming negative wchar_t values you need to make sure that you cast to unsigned. This works though I believe it's technically undefined:
unsigned char* uchar = reinterpret_cast<unsigned char*>(&vect[0]);
std::vector<wchar_t> vwchar(uchar, uchar + vect.size());
This is important if your text contains anything above 127 in the character set.
Also keep in mind that none of these answer correctly deal with UTF-anything.