C string to wide C string - c++

I'm sure this question gets asked a lot but I just want to make sure there's not a better way to do this.
Basically, I have a const char* which points to a null-terminated C string. I have another function which expects a const wchar_t* pointing to a string with the same characters.
For the time being, I have been trying to do it like this:
size_t newsize = strlen(myCString) + 1;
wchar_t * wcstring = new wchar_t[newsize];
size_t convertedChars = 0;
mbstowcs_s(&convertedChars, wcstring, newsize, myCString, _TRUNCATE);
delete[] wcstring;
I need to make these conversions in a lot of places since I'm dealing with 3rd party libraries which expect one or the other. Is this the recommended way to go about this?

What you're doing is pretty much the recommended way of doing it, assuming that your data is all ASCII. If you have non-ASCII data in there, you need to know what its encoding is: UTF-8, Windows-1252, any of the ISO 8859 variants, SHIFT-JIS, etc. Each one needs to be converted in a different way.
The only thing I would change would be to use mbstowcs instead of mbstowcs_s. mbstowcs_s is only available on Windows, while mbstowcs is a standard C99 function which is portable. Of course, if you'd like to avoid the CRT deprecation warnings with the Microsoft compiler without completely turning them off, it's perfectly fine to use a macro of #if test to use mbstowcs on non-Windows systems and mbstowcs_s on Windows systems.
You can also use mbstowcs to get the length of the converted string by first passing in NULL for the destination. That way, you can avoid truncation no matter how long the input string is; however, it does involve converting the string twice.
For non-ASCII conversions, I recommend using libiconv.

You haven't said what encodings are involved. If you have non-multibyte strings, you can just use this:
std::string a("hello");
std::wstring b(s.begin(), s.end());
const wchar_t *wcString= b.c_str();

Related

Conversion from wstring to u16string and back (standard conform) in C++17 / C++20

My main platform is Windows which is the reason why I use internally UTF-16 (mostly BMP strings).
I would like to use console output for these strings.
Unfortunately there is no std::u16cout or std::u8cout so I need to use std::wcout. Therefore I must convert my u16strings to wstrings - what is the best (and easiest) way to do that?
On Windows I know that wstring points to UTF16 data, so I can create a simple std::u16string_view which uses the same data (no conversion).
But on Linux wstring is usually UTF32...
Is there a way to do that without macros and without things like assuming sizeof(wchar_t) == 2 => utf16?
There is nothing in the C++20 standard that converts wchar_t to char32_t and back. After all, wchar_t is supposed to be large enough to contain any supported code point.
And indeed everywhere Unicode above U+FFFF is supported, wchar_t is 32-bit, except on Windows (and in Java, but that's irrelevant). So yes, even today working with Unicode in a portable way is problematic, and sizeof(wchar_t)==2 or #ifdef _WIN32 both sound like legitimate workarounds.
Having said that, wcout still seamlessly works with wchar_t on all platforms regardless of the underlying encoding.
It is only if you cut wstrings or work with individual code points and you want to support code points beyond the basic plane, then you need to take surrogate pairs into account (which is pretty easy still, 0xD800–0xDBFF = first pair, 0xDC00–0xDFFF = second pair, don't cut in between).

How to tell if LPCWSTR text is numeric?

Entire string needs to be made of integers which as we know are 0123456789 I am trying with following function but it doesnt seem to work
bool isNumeric( const char* pszInput, int nNumberBase )
{
string base = "0123456789";
string input = pszInput;
return (::strspn(input.substr(0, nNumberBase).c_str(), base.c_str()) == input.length());
}
and the example of using it in code...
isdigit = (isNumeric((char*)text, 11));
It returns true even with text in the string
Presumably the issue is that text is actually LPCWSTR which is const wchar_t*. We have to infer this fact from the question title and the cast that you made.
Now, that cast is a problem. The compiler objected to you passing text. It said that text is not const char*. By casting you have not changed what text is, you simply lied to the compiler. And the compiler took its revenge.
What happens next is that you reinterpret the wide char buffer as being a narrow 8 bit buffer. If your wide char buffer has latin text, encoded as UTF-16, then every other byte will be zero. Hence the reinterpret cast that you do results in isNumeric thinking that the string is only 1 character long.
What you need to do is either:
Start using UTF-16 encoded wchar_t buffers in isNumeric.
Convert from UTF-16 to ANSI before calling isNumeric.
You should think about this carefully. It seems that at present you have a rather unholy mix of ANSI and UTF-16 in your program. You really ought to settle on a standard character encoding an use it consistently throughout. That is tenable internal to your program, but you will encounter external text that could use different encodings. Deal with that by converting at the boundary between your program and the outside world.
Personally I don't understand why you are using C strings at all. Surely you should be using std::wstring or std::string.

UNICODE, UTF-8 and Windows mess

I'm trying to implement text support in Windows with the intention of also moving to a Linux platform later on. It would be ideal to support international languages in a uniform way but that doesn't seem to be easily accomplished when considering the two platforms in question. I have spent a considerable amount of time reading up on UNICODE, UTF-8 (and other encodings), widechars and such and here is what I have come to understand so far:
UNICODE, as the standard, describes the set of characters that are mappable and the order in which they occur. I refer to this as the "what": UNICODE specifies what will be available.
UTF-8 (and other encodings) specify the how: How each character will be represented in a binary format.
Now, on windows, they opted for a UCS-2 encoding originally, but that failed to meet the requirements, so UTF-16 is what they have, which is also multi-char when necessary.
So here is the delemma:
Windows internally only does UTF-16, so if you want to support international characters you are forced to convert to their widechar versions to use the OS calls accordingly. There doesn't seem to be any support for calling something like CreateFileA() with a multi-byte UTF-8 string and have it come out looking proper. Is this correct?
In C, there are some multi-byte supporting functions (_mbscat, _mbscpy, etc), however, on windows, the character type is defined as unsigned char* for those functions. Given the fact that the _mbs series of functions is not a complete set (i.e. there is no _mbstol to convert a multi-byte string to a long, for example) you are forced to use some of the char* versions of the runtime functions, which leads to compiler problems because of the signed/unsigned type difference between those functions. Does anyone even use those? Do you just do a big pile of casting to get around the errors?
In C++, std::string has iterators, but these are based on char_type, not on code points. So if I do a ++ on an std::string::iterator, I get the next char_type, not the next code point. Similarly, if you call std::string::operator[], you get a reference to a char_type, which has the great potential to not be a complete code point. So how does one iterate an std::string by code point? (C has the _mbsinc() function).
Just do UTF-8
There are lots of support libraries for UTF-8 in every plaftorm, also some are multiplaftorm too. The UTF-16 APIs in Win32 are limited and inconsistent as you've already noted, so it's better to keep everything in UTF-8 and convert to UTF-16 at last moment. There are also some handy UTF-8 wrappings for the windows API.
Also, at application-level documents, UTF-8 is getting more and more accepted as standard. Every text-handling application either accepts UTF-8, or at worst shows it as "ASCII with some dingbats", while there's only few applications that support UTF-16 documents, and those that don't, show it as "lots and lots of whitespace!"
Correct. You will convert UTF-8 to UTF-16 for your Windows API calls.
Most of the time you will use regular string functions for UTF-8 -- strlen, strcpy (ick), snprintf, strtol. They will work fine with UTF-8 characters. Either use char * for UTF-8 or you will have to cast everything.
Note that the underscore versions like _mbstowcs are not standard, they are normally named without an underscore, like mbstowcs.
It is difficult to come up with examples where you actually want to use operator[] on a Unicode string, my advice is to stay away from it. Likewise, iterating over a string has surprisingly few uses:
If you are parsing a string (e.g., the string is C or JavaScript code, maybe you want syntax hilighting) then you can do most of the work byte-by-byte and ignore the multibyte aspect.
If you are doing a search, you will also do this byte-by-byte (but remember to normalize first).
If you are looking for word breaks or grapheme cluster boundaries, you will want to use a library like ICU. The algorithm is not simple.
Finally, you can always convert a chunk of text to UTF-32 and work with it that way. I think this is the sanest option if you are implementing any of the Unicode algorithms like collation or breaking.
See: C++ iterate or split UTF-8 string into array of symbols?
Windows internally only does UTF-16, so if you want to support international characters you are forced to convert to their widechar versions to use the OS calls accordingly. There doesn't seem to be any support for calling something like CreateFileA() with a multi-byte UTF-8 string and have it come out looking proper. Is this correct?
Yes, that's correct. The *A function variants interpret the string parameters according to the currently active code page (which is Windows-1252 on most computers in the US and Western Europe, but can often be other code pages) and convert them to UTF-16. There is a UTF-8 code page, however AFAIK there isn't a way to programmatically set the active code page (there's GetACP to get the active code page, but not corresponding SetACP).
In C, there are some multi-byte supporting functions (_mbscat, _mbscpy, etc), however, on windows, the character type is defined as unsigned char* for those functions. Given the fact that the _mbs series of functions is not a complete set (i.e. there is no _mbstol to convert a multi-byte string to a long, for example) you are forced to use some of the char* versions of the runtime functions, which leads to compiler problems because of the signed/unsigned type difference between those functions. Does anyone even use those? Do you just do a big pile of casting to get around the errors?
The mbs* family of functions is almost never used, in my experience. With the exception of mbstowcs, mbsrtowcs, and mbsinit, those functions are not standard C.
In C++, std::string has iterators, but these are based on char_type, not on code points. So if I do a ++ on an std::string::iterator, I get the next char_type, not the next code point. Similarly, if you call std::string::operator[], you get a reference to a char_type, which has the great potential to not be a complete code point. So how does one iterate an std::string by code point? (C has the _mbsinc() function).
I think that mbrtowc(3) would be the best option here for decoding a single code point of a multibyte string.
Overall, I think the best strategy for cross-platform Unicode compatibility is to do everything in UTF-8 internally using single-byte characters. When you need to call a Windows API function, convert it to UTF-16 and always call the *W variant. Most non-Windows platforms use UTF-8 already, so that makes using those a snap.
In Windows, you can call WideCharToMultiByte and MultiByteToWideChar to convert between UTF-8 string and UTF-16 string (wstring in Windows). Because Windows API do not use UTF-8, whenever you call any Windows API function that support Unicode, you have to convert string into wstring (Windows version of Unicode in UTF-16). And when you get output from Windows, you have to convert UTF-16 back to UTF-8. Linux uses UTF-8 internally, so you do not need such conversion. To make your code portable to Linux, stick to UTF-8 and provide something as below for conversion:
#if (UNDERLYING_OS==OS_WINDOWS)
using os_string = std::wstring;
std::string utf8_string_from_os_string(const os_string &os_str)
{
size_t length = os_str.size();
int size_needed = WideCharToMultiByte(CP_UTF8, 0, os_str, length, NULL, 0, NULL, NULL);
std::string strTo(size_needed, 0);
WideCharToMultiByte(CP_UTF8, 0, os_str, length, &strTo[0], size_needed, NULL, NULL);
return strTo;
}
os_string utf8_string_to_os_string(const std::string &str)
{
size_t length = os_str.size();
int size_needed = MultiByteToWideChar(CP_UTF8, 0, str, length, NULL, 0);
os_string wstrTo(size_needed, 0);
MultiByteToWideChar(CP_UTF8, 0, str, length, &wstrTo[0], size_needed);
return wstrTo;
}
#else
// Other operating system uses UTF-8 directly and such conversion is
// not required
using os_string = std::string;
#define utf8_string_from_os_string(str) str
#define utf8_string_to_os_string(str) str
#endif
To iterate over utf8 strings, two fundamental functions you need are: one to calculate the number of bytes for an utf8 character and the another to determine if the byte is the leading byte of a utf8 character sequence. The following code provides a very efficient way to test:
inline size_t utf8CharBytes(char leading_ch)
{
return (leading_ch & 0x80)==0 ? 1 : clz(~(uint32_t(uint8_t(leading_ch))<<24));
}
inline bool isUtf8LeadingByte(char ch)
{
return (ch & 0xC0) != 0x80;
}
Using these functions, it should not be difficult to implement your own iterator over utf8 strings, one is for forwarding iterator, and another is for backward iterator.

The 'L' and 'LPCWSTR' in WIndows API

I've found that
NetUserChangePassword(0, 0, L"ab", L"cd");
changes the user password from ab to cd. However,
NetUserChangePassword(0, 0, (LPCWSTR) "ab", (LPCWSTR) "cd");
doesn't work. The returned value indicates invalid password.
I need to pass const char* as last two parameters for this function call. How can I do that? For example,
NetUserChangePassword(0, 0, (LPCWSTR) vs[0].c_str(), (LPCWSTR) vs[1].c_str());
Where vs is std::vector<std::string>.
Those are two totally different L's. The first is a part of the C++ language syntax. Prefix a string literal with L and it becomes a wide string literal; instead of an array of char, you get an array of wchar_t.
The L in LPCWSTR doesn't describe the width of the characters, though. Instead, it describes the size of the pointer. Or, at least, it used to. The L abbreviation on type names is a relic of 16-bit Windows, when there were two kinds of pointers. There were near pointers, where the address was somewhere within the current 64 KB segment, and there were far, or long pointers, which could point beyond the current segment. The OS required callers to provide the latter to its APIs, so all the pointer-type names use LP. Nowadays, there's only one type of pointer; Microsoft keeps the same type names so that old code continues to compile.
The part of LPCWSTR that specifies wide characters is the W. But merely type-casting a char string literal to LPCWSTR is not sufficient to transform those characters into wide characters. Instead, what happens is the type-cast tells the compiler that what you wrote really is a pointer to a wide string, even though it really isn't. The compiler trusts you. Don't type-cast unless you really know better than the compiler what the real types are.
If you really need to pass a const char*, then you don't need to type-cast anything, and you don't need any L prefix. A plain old string literal is sufficient. (If you really want to cast to a Windows type, use LPCSTR — no W.) But it looks like what you really need to pass in is a const wchar_t*. As we learned above, you can get that with the L prefix on the string literal.
In a real program, you probably don't have a string literal. The user will provide a password, or you'll read a password from some other external source. Ideally, you would store that password in a std::wstring, which is like std::string but for wchar_t instead of char. The c_str() method of that type returns a const wchar_t*. If you don't have a wstring, a plain array of wchar_t might be sufficient.
But if you're storing the password in a std::string, then you'll need to convert it into wide characters some other way. To do a conversion, you need to know what code page the std::string characters use. The "current ANSI code page" is usually a safe bet; it's represented by the constant CP_ACP. You'll use that when calling MultiByteToWideString to have the OS convert from the password's code page into Unicode.
int required_size = MultiByteToWideChar(CP_ACP, 0, vs[0].c_str(), vs[0].size(), NULL, 0);
if (required_size == 0)
ERROR;
// We'll be storing the Unicode password in this vector. Reserve at
// least enough space for all the characters plus a null character
// at the end.
std::vector<wchar_t> wv(required_size);
int result = MultiByteToWideChar(CP_ACP, 0, vs[0].c_str(), vs[0].size(), &wv[0], required_size);
if (result != required_size - 1)
ERROR;
Now, when you need a wchar_t*, just use a pointer to the first element of that vector: &wv[0]. If you need it in a wstring, you can construct it from the vector in a few ways:
// The vector is null-terminated, so use "const wchar_t*" constructor
std::wstring ws1 = &wv[0];
// Use iterator constructor. The vector is null-terminated, so omit
// the final character from the iterator range.
std::wstring ws2(wv.begin(), wv.end() - 1);
// Use pointer/length constructor.
std::wstring ws3(&wv[0], wv.size() - 1);
You have two problems.
The first is the practical problem - how to do this. You are confusing wide and narrow strings and casting from one to the other. A string with an L prefix is a wide string, where each character is two bytes (a wchar_t). A string without the L is a single byte (a char). You cannot cast from one to the other using the C-style cast (LPCWSTR) "ab" because you have an array of chars, and are casting it to a pointer to wide chars. It is simply changing the pointer type, not the underlying data.
To convert from a narrow string to a wide string, you would normally use MultiByteToWideChar. You don't mention what code page your narrow strings are in; you would probably pass in CP_ACP for the first parameter. However, since you are converting between a string and a wstring, you might be interested in other ways to convert (one, two). This will give you a wstring with your characters, not a string, and a wstring's .c_str() method returns a pointer to wchar_ts.
The second is the following misunderstanding:
I need to pass const char* as last two parameters for this function call. How can I do that?
No you don't. You need to pass a wide string, which you got above. Your approach to this (casting the pointer) indicates you probably don't know about different string types and character encodings, and this is something every software developer should know. So, on the assumption you're interested, hopefully you'll find the following references handy:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
The Unicode FAQ (covers things like 'What is Unicode?')
MSDN introduction to wide characters.
I'd recommend you investigate recompiling your application with UNICODE and using wide strings. Many APIs are defined in both narrow and wide versions, and normally this would mean you access the narrow version by default (you can access either the ANSI (narrow) or Wide versions of these APIs by directly calling the A or W version - they have A or W appended to their name, such as CreateWindowW - see the bottom of that page for the two names. You normally don't need to worry about this.) As far as I can tell, this API is always available as-is regardless of UNICODE, it's just it's only prototyped as wide.
C style casts as you've used here are a very blunt instrument. They assume you know exactly what you're doing.
You'll need to convert your ASCII or multi-byte strings into Unicode strings for the API. There might be a NetUserChangePasswordA function that takes the char * types you're trying to pass, try that first.
LPWSTR is defined wchar_t* (whcar_T is 2 byte char-type) which are interpreted differently then normal 1 byte chars.
LPCWSTR means you need to pass a wchar_t*.
if you change vs to std::vector<std::wstring> that will give you a wide char when you pass vs[0].c_str()
if you look at the example at http://msdn.microsoft.com/en-us/library/aa370650(v=vs.85).aspx you can see that they define UNICODE which is why they use the wchar_t.

What is the simplest way to convert char[] to/from tchar[] in C/C++(ms)?

This seems like a pretty softball question, but I always have a hard time looking up this function because there seem there are so many variations regarding the referencing of char and tchar.
The simplest way is to use the conversion macros:
CW2A
CA2W
etc...
MSDN
TCHAR is a Microsoft-specific typedef for either char or wchar_t (a wide character).
Conversion to char depends on which of these it actually is. If TCHAR is actually a char, then you can do a simple cast, but if it is truly a wchar_t, you'll need a routine to convert between character sets. See the function MultiByteToWideChar()
MultiByteToWideChar but also see "A few of the gotchas of MultiByteToWideChar".
There are a few answers in this post as well, especially if you're looking for a cross-platform solution:
UTF8 to/from wide char conversion in STL
Although in this particular situation I think the TChar is a wide character I'll only need to do the conversion if it isn't. which I gotta check somehow.
if (sizeof(TCHAR) != sizeof(wchar_t))
{ .... }
The cool thing about that is both sizes of the equals are constants, which means that the compiler will handle (and remove) the if(), and if they are equal, remove everything inside the braces
Here is the CPP code that duplicates _TCHAR * argv[] to char * argn[].
http://www.wincli.com/?p=72
If you adopting old code to Windows, simple use define mentioned in the code as optional.
You can put condition in your code
ifdef _UNICODE
{ //DO LIKE TCHAR IS WIDE CHAR } ELSE { //DO LIKE TCHAR IS CHAR }
I realize this is an old thread, but it didn't get me the "right" answer, so am adding it now.
The way this appears to be done now is to use the TEXT macro. The example for FindFirstFile at msdn points this out.
http://msdn.microsoft.com/en-us/library/windows/desktop/aa364418%28v=vs.85%29.aspx