Is there any mistake to write a code such:
char* sp=(char*)malloc(128);
int x=22;
wsprintf(sp,"%d",x);
cout<<sp;
I am asking specially about security mistakes?
There are a number of "potential" issues here, non of them is actually infinging anything but you may find things not behaving as you expect.
First: wsprintf, as a Win32 API (http://msdn.microsoft.com/en-us/library/windows/desktop/ms647550(v=vs.85).aspx ) is prototyped as:
int __cdecl wsprintf(
_Out_ LPTSTR lpOut,
_In_ LPCTSTR lpFmt,
_In_ ...
);
where LPTSTR is defined as char* or wchar_t* depending on the definition or not of the UNICODE symbol (check your propject settings and / or build commands)
Now, in case you are on an ANSI build (no UNICODE) all types are coherent, but there is no check about wsprintf writing more than the 128 char you allocated. If you just write a decimal integer it will have no problem, but if you (of somebody else after you) modify later the "message" and no checks are made, some surprises may arise (like wsprintf(sp,"This is the number I've been told I was supposed to be expected to be: %d",x); will this still fits the 128 chars?!? )
In case you are on a UNICODE build, you allocate 128 char, and write a double-byte string on it. The number 22 will be written as \x32\x00\x32\x00\x00\x00 (3200 is the little-endian coding for 0x0032 that is the wchar_t correponding to the UNICODE 50 that stands for '2').
If you give that sequence to cout (that is char based, not wchar_t based) will see the first \x00 as a string terminator and will output ... just '2'.
To be coherent, you shold either:
use all char based types and function (OK malloc and cout, but wsprintfA instead of wsprintf)
use all wchar_t based types and function (malloc(128*sizeof(wchar_t)), wchar_t* and wsprintfW)
use all TCHAR based types (malloc(128*sizeof(TCHAR)), TCHAR* and wsprintf, but define tcout as cout or wcout depending on UNICODE).
There is no security mistake because it is never the case that an int converted to a C string will exceed the size you've allocated.
However this style of programming has the potential for security issues. And history has shown that this kind of code has caused real security issues time and time again. So maybe you should be learning a better style of coding?
This MSDN link lists some concerns over the use of wsprintf. They don't appear to apply to your example but they do give some alternatives that you might want to explore.
Ok given that you have stated you are using winapi, read this from their documentation:
Note Do not use. Consider using one of the following functions
instead: StringCbPrintf, StringCbPrintfEx, StringCchPrintf, or
StringCchPrintfEx. See Security Considerations.
Therefore do not use. I would ignore however what they tell you to do instead and either:
Write in C, and a function from the C standard library (sprintf, snprintf etc). In that case you cannot use cout.
Write in C++. Use std::string and there is even a new to_string, as well as boost::format and ostringstream to help you build formatted string. You can still use C standard library functions there if you want to as well when it really suits your purpose better, but leave the allocation stuff to the library.
Related
I want to convert a __int64 variable into a CString. The code is exactly this
__int64 i64TotalGB;
CString totalSpace;
i64TotalGB = 150;
printf("disk space: %I64d GB\n", i64TotalGB);
totalSpace.Format(_T("%I64d", i64TotalGB));
printf("totalSpace contains: %s", totalSpace);
the first printf prints
"disk space: 150GB"
which it's correct, but the second printf prints randomly high numbers like
"totalSpace contains: 298070026817519929"
I also tried to use a INT64 variable instead of a __int64 variable, but the result is the same. What can be the cause of this?
Here:
totalSpace.Format(_T("%I64d", i64TotalGB));
you're passing i64TotalGB as an argument to the _T() macro instead of passing it as the second argument to Format().
Try this:
totalSpace.Format(_T("%I64d"), i64TotalGB);
Having said that, thanks to MS's mess (ha) around character encodings, using _T here is not the right thing, as CString is composed of a TCHAR and not _TCHAR. So taking that into account, might as well use TEXT() instead of T(), as it is dependent on UNICODE and not _UNICODE:
totalSpace.Format(TEXT("%I64d"), i64TotalGB);
In addition, this line is wrong as it tries to pass an ATL CString as a char* (a.k.a. C-style string):
printf("totalSpace contains: %s", totalSpace);
For which the compiler gives this warning:
warning C4477: 'printf' : format string '%s' requires an argument of type 'char *', but variadic argument 1 has type 'ATL::CString'
While the structure of CString is practically compatible with passing it like you have, this is still formally undefined behavior. Use CString::GetString() to safeguard against it:
printf("totalSpace contains: %ls", totalSpace.GetString());
Note the %ls as under my configuration totalSpace.GetString() returned a const wchar_t*. However, as "printf does not currently support output into a UNICODE stream.", the correct version for this line, that will support characters outside your current code page, is a call to wprintf() in the following manner:
wprintf("totalSpace contains: %s", totalSpace.GetString());
Having said ALL that, here's a general advice, regardless of the direct problem behind the question. The far better practice nowadays is slightly different altogether, and I quote from the respectable answer by #IInspectable, saying that "generic-text mappings were relevant 2 decades ago".
What's the alternative? In the absence of good enough reason, try sticking explicitly to CStringW (A Unicode character type string with CRT support). Prefer the L character literal over the archaic data/text mappings that depend on whether the constant _UNICODE or _MBCS has been defined in your program. Conversely, the better practice would be using the wide-character versions of all API and language library calls, such as wprintf() instead of printf().
The bug is a result of numerous issues with the code, specifically these 2:
totalSpace.Format(_T("%I64d", i64TotalGB));
This uses the _T macro in a way it's not meant to be used. It should wrap a single character string literal. In the code it wraps a second argument.
printf("totalSpace contains: %s", totalSpace);
This assumes an ANSI-encoded character string, but passes a CString object, that can store both ANSI as well as Unicode encoded strings.
The recommended course of action is to drop generic-text mappings altogether, in favor of using Unicode (that's UTF-16LE on Windows) throughout1. The generic-text mappings were relevant 2 decades ago, to ease porting of Win9x code to the Windows NT based products.
To do this
Choose CStringW over CString.
Drop all occurences of _T, TEXT, and _TEXT, and replace them with an L prefix.
Use the wide-character version of the Windows API, CRT, and C++ Standard Library.
The fixed code looks like this:
__int64 i64TotalGB;
CStringW totalSpace; // Use wide-character string
i64TotalGB = 150;
printf("disk space: %I64d GB\n", i64TotalGB);
totalSpace.Format(L"%I64d", i64TotalGB); // Use wide-character string literal
wprintf(L"totalSpace contains: %s", totalSpace.GetString()); // Use wide-character library
On an unrelated note, while it is technically safe to pass a CString object in place of a character pointer in a variable argument list, this is an implementation detail, and not formally documented to work. Call CString::GetString() if you care about correct code.
1 Unless there is a justifiable reason to use a character encoding that uses char as its underlying type (like UTF-8 or ANSI). In that case you should still be explicit about it by using CStringA.
try this
totalSpace.Format(_T("%I64d"), i64TotalGB);
I want to understand the difference between char and wchar_t ? I understand that wchar_t uses more bytes but can I get a clear cut example to differentiate when I would use char vs wchar_t
Short anwser:
You should never use wchar_t in modern C++, except when interacting with OS-specific APIs (basically use wchar_t only to call Windows API functions).
Long answer:
Design of standard C++ library implies there is only one way to handle Unicode - by storing UTF-8 encoded strings in char arrays, as almost all functions exist only in char variants (think of std::exception::what).
In a C++ program you have two locales:
Standard C library locale set by std::setlocale
Standard C++ library locale set by std::locale::global
Unfortunately, none of them defines behavior of standard functions that open files (like std::fopen, std::fstream::open etc). Behavior differs between OSes:
Linux is encoding agnostic, so those function simply pass char string to underlying system call
On Windows char string is converted to wide string using user specific locale before system call is made
Everything usually works fine on Linux as everyone uses UTF-8 based locales so all user input and arguments passed to main functions will be UTF-8 encoded. But you might still need to switch current locales to UTF-8 variants explicitly as by default C++ program starts using default "C" locale. At this point, if you only care about Linux and don't need to support Windows, you can use char arrays and std::string assuming it is UTF-8 sequences and everything "just works".
Problems appear when you want to support Windows, as there you always have additional 3rd locale: the one set for the current user which can be configured somewhere in "Control Panel". The main issue is that this locale is never a unicode locale, so it is impossible to use functions like std::fopen(const char *) and std::fstream::open(const char *) to open a file using Unicode path. On Windows you will have to use custom wrappers that use non-standard Windows specific functions like _wfopen, std::fstream::open(const wchar_t *) on Windows. You can check Boost.Nowide (not yet included in Boost) to see how this can be done: http://cppcms.com/files/nowide/html/
With C++17 you can use std::filesystem::path to store file path in a portable way, but it is still broken on Windows:
Implicit constructor std::filesystem::path::path(const char *) uses user-specific locale on MSVC and there is no way to make it use UTF-8. Function std::filesystem::u8string should be used to construct path from UTF-8 string, but it is too easy to forget about this and use implicit constructor instead.
std::error_category::message(int) for both error categories returns error description using user-specific encoding.
So what we have on Windows is:
Standard library functions that open files are broken and should never be used.
Arguments passed to main(int, char**) are broken and should never be used.
WinAPI functions ending with *A and macros are broken and should never be used.
std::filesystem::path is partially broken and should never be used directly.
Error categories returned by std::generic_category and std::system_category are broken and should never be used.
If you need long term solution for a non-trivial project, I would recommend:
Using Boost.Nowide or implementing similar functionality directly - this fixes broken standard library.
Re-implementing standard error categories returned by std::generic_category and std::system_category so that they would always return UTF-8 encoded strings.
Wrapping std::filesystem::path so that new class would always use UTF-8 when converting path to string and string to path.
Wrapping all required functions from std::filesystem so that they would use your path wrapper and your error categories.
Unfortunately, this won't fix issues with other libraries that work with files, but many are broken anyway (do not support unicode).
You can check this link for further explanation: http://utf8everywhere.org/
Fundamentally, use wchar_t when the encoding has more symbols than a char can contain.
Background
The char type has enough capacity to hold any character (encoding) in the ASCII character set.
The issue is that many languages require more encodings than the ASCII accounts for. So, instead of 127 possible encodings, more are needed. Some languages have more than 256 possible encodings. A char type does not guarantee a range greater than 256. Thus a new data type is required.
The wchar_t, a.k.a. wide characters, provides more room for encodings.
Summary
Use char data type when the range of encodings is 256 or less, such as ASCII. Use wchar_t when you need the capacity for more than 256.
Prefer Unicode to handle large character sets (such as emojis).
Never use wchar_t.
When possible, use (some kind of array of) char, such as std::string, and ensure that it is encoded in UTF-8.
When you must interface with APIs that don't speak UTF-8, use char16_t or char32_t. Never use them otherwise; they provide only illusory advantages and encourage faulty code.
Note that there are plenty of cases where more than one char32_t is required to represent a single user-visible character. OTOH, using UTF-8 with char forces you to handle variable width very early.
TCHAR szExeFileName[MAX_PATH];
GetModuleFileName(NULL, szExeFileName, MAX_PATH);
CString tmp;
lstrcpy(szExeFileName, tmp);
CString out;
out.Format("\nInstall32 at %s\n", tmp);
TRACE(tmp);
Error (At the Format):
error C2664: 'void ATL::CStringT<BaseType,StringTraits>::Format(const wchar_t
*,...)' : cannot convert parameter 1 from 'const char [15]' to 'const wchar_t
I'd just like to get the current path that this program was launched from and copy it into a CString so I can use it elsewhere. I am currently just try to get to see the path by TRACE'ing it out. But strings, chars, char arrays, I can't ever get all the strait. Could someone give me a pointer?
The accepted answer addresses the problem. But the question also asked for a better understanding of the differences among all the character types on Windows.
Encodings
A char on Windows (and virtually all other systems) is a single byte. A byte is typically interpreted as either an unsigned value [0..255] or a signed value [-128..127]. (Older C++ standards guarantees a signed range of only [-127..127], but most implementations give [-128..127]. I believe C++11 guarantees the larger range.)
ASCII is a character mapping for values in the range [0..127] to particular characters, so you can store an ASCII character in either a signed byte or an unsigned byte, and thus it will always fit in a char.
But ASCII doesn't have all the characters necessary for most languages, so the character sets were often extended by using the rest of the values available in a byte to represent the additional characters needed for certain languages (or families of languages). So, while [0..127] almost always mean the same thing, values like 150 can only be interpreted in the context of a particular encoding. For single-byte alphabets, these encodings are called code pages.
Code pages helped, but they didn't solve all the problems. You always had to know which code page a particular document used in order to interpret it correctly. Furthermore, you typically couldn't write a single document that used different languages.
Also, some languages have more than 256 characters, so there was no way to map one char to one character. This led to the development of multi-byte character encodings, where [0..127] is still ASCII, but some of the other values are "escapes" that mean you have to look at some number of following chars to figure out what character you really had. (It's best to think of multi-byte as variable-byte, as some characters require only one byte while other require two or more.) Multi-byte works, but it's a pain to code for.
Meanwhile, memory was becoming more plentiful, so a bunch of organizations got together and created Unicode, with the goal of making a universal mapping of values to characters (for appropriately vague definitions of "characters"). Initially, it was believed that all characters (or at least all the ones anyone would ever use) would fit into 16-bit values, which was nice because you wouldn't have to deal with multi-byte encodings--you'd just use two bytes per character instead of one. About this time, Microsoft decided to adopt Unicode as the internal representation for text in Windows.
WCHAR
So Windows has a type called WCHAR, a two-byte value that represents a "Unicode" "character". I'm using quotation marks here because Unicode evolved past the original two-byte encoding, so what Windows calls "Unicode" isn't really Unicode today--it's actually a particular encoding of Unicode called UTF-16. And a "character" is not as simple a concept in Unicode as it was in ASCII, because, in some languages, characters combine or otherwise influence adjacent characters in interesting ways.
Newer versions of Windows used these 16-bit WCHAR values for text internally, but there was a lot of code out there still written for single-byte code pages, and even some for multi-byte encodings. Those programs still used chars rather than WCHARs. And many of these programs had to work with people using older versions of Windows that still used chars internally as well as newer ones that use WCHAR. So a technique using C macros and typedefs was devised so that you could mostly write your code one way and--at compile time--choose to have it use either char or WCHAR.
TCHAR
To accomplish this flexibility, you use a TCHAR for a "text character". In some header file (often <tchar.h>), TCHAR would be typedef'ed to either char or WCHAR, depending on the compile time environment. Windows headers adopted conventions like this:
LPTSTR is a (long) pointer to a string of TCHARs.
LPWSTR is a (long) pointer to a string of WCHARs.
LPSTR is a (long) pointer to a string of chars.
(The L for "long" is a leftover from 16-bit days, when we had long, far, and near pointers. Those are all obsolete today, but the L prefix tends to remain.)
Most of the Windows API functions that take and return strings were actually replaced with two versions: the A version (for "ANSI" characters) and the W version (for wide characters). (Again, historical legacy shows in these. The code pages scheme was often called ANSI code pages, though I've never been clear if they were actually ruled by ANSI standards.)
So when you call a Windows API like this:
SetWindowText(hwnd, lptszTitle);
what you're really doing is invoking a preprocessor macro that expands to either SetWindowTextA or SetWindowTextW. It should be consistent with however TCHAR is defined. That is, if you want strings of chars, you'll get the A version, and if you want strings of WCHARs, you get the W version.
But it's a little more complicated because of string literals. If you write this:
SetWindowText(hwnd, "Hello World"); // works only in "ANSI" mode
then that will only compile if you're targeting the char version, because "Hello World" is a string of chars, so it's only compatible with the SetWindowTextA version. If you wanted the WCHAR version, you'd have to write:
SetWindowText(hwnd, L"Hello World"); // only works in "Unicode" mode
The L here means you want wide characters. (The L actually stands for long, but it's a different sense of long than the long pointers above.) When the compiler sees the L prefix on the string, it knows that string should be encoded as a series of wchar_ts rather than chars.
(Compilers targeting Windows use a two-byte value for wchar_t, which happens to be identical to what Windows defined a WCHAR. Compilers targeting other systems often use a four-byte value for wchar_t, which is what it really takes to hold a single Unicode code point.)
So if you want code that can compile either way, you need another macro to wrap the string literals. There are two to choose from: _T() and TEXT(). They work exactly the same way. The first comes from the compiler's library and the second from the OS's libraries. So you write your code like this:
SetWindowText(hwnd, TEXT("Hello World")); // compiles in either mode
If you're targeting chars, the macro is a no-op that just returns the regular string literal. If you're targeting WCHARs, the macro prepends the L.
So how do you tell the compiler that you want to target WCHAR? You define UNICODE and _UNICODE. The former is for the Windows APIs and the latter is for the compiler libraries. Make sure you never define one without the other.
My guess is you are compiling in Unicode mode.
Try enclosing your format string in the _T macro, which is designed to provide an always-correct method of providing constant string parameters, regardless of whether you're compiling in Unicode or ANSI mode:
out.Format(_T("\nInstall32 at %s\n"), tmp);
I'm sure this question gets asked a lot but I just want to make sure there's not a better way to do this.
Basically, I have a const char* which points to a null-terminated C string. I have another function which expects a const wchar_t* pointing to a string with the same characters.
For the time being, I have been trying to do it like this:
size_t newsize = strlen(myCString) + 1;
wchar_t * wcstring = new wchar_t[newsize];
size_t convertedChars = 0;
mbstowcs_s(&convertedChars, wcstring, newsize, myCString, _TRUNCATE);
delete[] wcstring;
I need to make these conversions in a lot of places since I'm dealing with 3rd party libraries which expect one or the other. Is this the recommended way to go about this?
What you're doing is pretty much the recommended way of doing it, assuming that your data is all ASCII. If you have non-ASCII data in there, you need to know what its encoding is: UTF-8, Windows-1252, any of the ISO 8859 variants, SHIFT-JIS, etc. Each one needs to be converted in a different way.
The only thing I would change would be to use mbstowcs instead of mbstowcs_s. mbstowcs_s is only available on Windows, while mbstowcs is a standard C99 function which is portable. Of course, if you'd like to avoid the CRT deprecation warnings with the Microsoft compiler without completely turning them off, it's perfectly fine to use a macro of #if test to use mbstowcs on non-Windows systems and mbstowcs_s on Windows systems.
You can also use mbstowcs to get the length of the converted string by first passing in NULL for the destination. That way, you can avoid truncation no matter how long the input string is; however, it does involve converting the string twice.
For non-ASCII conversions, I recommend using libiconv.
You haven't said what encodings are involved. If you have non-multibyte strings, you can just use this:
std::string a("hello");
std::wstring b(s.begin(), s.end());
const wchar_t *wcString= b.c_str();
I have the following code:
http://privatepaste.com/8364a2a7b8/12345
But it only writes "c" (supposedly, conversion to LPBYTE leaves one byte only).
What's the proper way to handle GetModuleFileName and registry edit?
strlen((char*)szPath2)+1
This is most likely where your problem is. I bet your program is compiled in UNICODE mode. strlen only works properly for ASCII strings. (The fact that you're having to cast from TCHAR to char is a big hint that something isn't right.)
To keep consistent with the usage of TCHAR and such, you should probably use _tcslen instead.