Number of bytes of CString in C++ - c++

I have a Unicode string stored in CString and I need to know the number bytes this string takes in UTF-8 encoding. I know CString has a method getLength(), but that returns number of characters, not bytes.
I tried (beside other things) converting to char array, but I get (logically, I guess) only array of wchar_t, so this doesn't solve my problem.
To be clear about my goal. For the input lets say "aaa" I want "3" as output (since "a" takes one byte in UTF-8). But for the input "āaa", I'd like to see output "4" (since ā is two byte character).
I think this has to be quite common request, but even after 1,5 hours of search and experimenting, I couldn't find the correct solution.
I have very little experience with Windows programming, so maybe I left out some crucial information. If you feel like that, please let me know, I'll add any information you request.

As your CString contains a series of wchar_t, you can just use WideCharToMultiByte with the output charset as CP_UTF8. The function will return the number of bytes written to the output buffer, or the length of the UTF-8 encoded string
LPWSTR instr;
char outstr[MAX_OUTSTR_SIZE];
int utf8_len = WideCharToMultiByte(CP_UTF8, 0, instr, -1, outstr, MAX_OUTSTR_SIZE, NULL, NULL);
If you don't need the output string, you can simply set the output buffer size to 0
cbMultiByte
Size, in bytes, of the buffer indicated by lpMultiByteStr. If this parameter is set to 0, the function returns the required buffer size for lpMultiByteStr and makes no use of the output parameter itself.
In that case the function will return the number of bytes in UTF-8 without really outputting anything
int utf8_len = WideCharToMultiByte(CP_UTF8, 0, instr, -1, NULL, 0, NULL, NULL);
If your CString is really CStringA, i.e. _UNICODE is not defined, then you need to use Multi­Byte­To­Wide­Char to convert the string to UTF-16 and then convert from UTF-16 to UTF-8 with Wide­Char­To­Multi­byte. See How do I convert an ANSI string directly to UTF-8? But new code should never be compiled without Unicode support anyway

Related

Converting to UTF-8 from ToUnicodeEx()

I get input using GetAsyncKeyState() which I then convert to unicode using ToUnicodeEx():
wchar_t character[1];
ToUnicodeEx(i, scanCode, keyboardState, character, 1, 0, layout);
I can write this to a file using wfstream like so:
wchar_t buffer[128]; // Will not print unicode without these 2 lines
file.rdbuf()->pubsetbuf(buffer, 128);
file.put(0xFEFF); // BOM needed since it's encoded using UCS-2 LE
file << character[0];
When I open this file in Notepad++ it's in UCS-2 LE, when I want it to be in UTF-8 format. I believe ToUnicodeEx() is returning it in UCS-2 LE format, it also only works with wide chars. Is there any way to do this using either fstream or wfstream by somehow converting into UTF-8 first? Thanks!
You might want to use the WideCharToMultiByte function.
For example:
wchar_t buffer[LEN]; // input buffer
char output_buffer[OUT_LEN]; // output buffer where the utf-8 string will be written
int num = WideCharToMultiByte(
CP_UTF8,
0,
buffer,
number_of_characters_in_buffer, // or -1 if buffer is null-terminated
output_buffer,
size_in_bytes_of_output_buffer,
NULL,
NULL);
Windows API generally refers to UTF-16 as unicode which is a little confusing. This means most unicode Win32 function calls operate on or give utf-16 strings.
So ToUnicodeEx returns a utf-16 string.
If you need this as utf 8 you'll need to convert it using WideCharToMultiByte
Thank you for all the help, I've managed to solve my problem with additional help from a blog post about WideCharToMultiByte() and UTF-8 here.
This function converts wide char arrays to a UTF-8 string:
// Takes in pointer to wide char array and length of the array
std::string ConvertCharacters(const wchar_t* buffer, int len)
{
int nChars = WideCharToMultiByte(CP_UTF8, 0, buffer, len, NULL, 0, NULL, NULL);
if (nChars == 0)
{
return u8"";
}
std::string newBuffer;
newBuffer.resize(nChars);
WideCharToMultiByte(CP_UTF8, 0, buffer, len, const_cast<char*>(newBuffer.c_str()), nChars, NULL, NULL);
return newBuffer;
}

Set Registry Value to a Wide Character String (WCHAR) in C++

I'm trying to add a wide character string to registry in C++. The problem is that the RegSetValueEx() function does not support wide chars, it only supports BYTE type (BYTE = unsigned char).
WCHAR myPath[] = "C:\\éâäà\\éâäà.exe"
RegSetValueExA(HKEY_CURRENT_USER, "MyProgram", 0, REG_SZ, myPath, sizeof(myPath)); // error: cannot convert argument 5 from WCHAR* to BYTE*
And please don't tell me I should convert WCHAR to BYTE because characters such as é and â can't be stored as 8 bit characters.
I'm sure this is possible because I tried opening regedit and adding a new key with value C:\\éâäà\\éâäà.exe and it worked. I wonder how other programs can add themselves to startup on a Russian or Chinese computer.
Is there another way to do so? Or is there a way to format wide character path using wildcards?
Edit: The Unicode version of the function RegSetValueExW() only changes the type of the second argument.
You are calling RegSetValueExA() when you should be calling RegSetValueExW() instead. But in either case, RegSetValueEx() writes bytes, not characters, that is why the lpData parameter is declared as BYTE*. Simply type-cast your character array. The REG_SZ value in the dwType parameter will let RegSetValueEx() know that the bytes represent a Unicode string. And make sure to include the null terminator in the value that you pass to the cbData parameter, per the documentation:
cbSize [in]
The size of the information pointed to by the lpData parameter, in bytes. If the data is of type REG_SZ, REG_EXPAND_SZ, or REG_MULTI_SZ, cbData must include the size of the terminating null character or characters.
For example:
WCHAR myPath[] = L"C:\\éâäà\\éâäà.exe";
RegSetValueExW(HKEY_CURRENT_USER, L"MyProgram", 0, REG_SZ, (LPBYTE)myPath, sizeof(myPath));
Or:
LPCWSTR myPath = L"C:\\éâäà\\éâäà.exe";
RegSetValueExW(HKEY_CURRENT_USER, L"MyProgram", 0, REG_SZ, (LPCBYTE)myPath, (lstrlenW(myPath) + 1) * sizeof(WCHAR));
That being said, you should not be writing values to the root of HKEY_CURRENT_USER itself. You should be writing to a subkey instead, eg:
WCHAR myPath[] = L"C:\\éâäà\\éâäà.exe";
if (RegCreateKeyEx(HKEY_CURRENT_USER, L"Software\\MyProgram", 0, NULL, REG_OPTION_NON_VOLATILE, KEY_SET_VALUE, NULL, &hKey, NULL) == 0)
{
RegSetValueExW(hKey, L"MyValue", 0, REG_SZ, (LPBYTE)myPath, sizeof(myPath));
RegCloseKey(hKey);
}
It seems to me you're trying to use the narrow/non-wide-char version of that function, which will only support ASCII. How about trying RegSetValueExW? Maybe you should also look up how the Windows API tries to supports ASCII and UNICODE as transparently as possible.
Edit: The Unicode version of the function RegSetValueExW() only changes the type of the second argument.
No it does not.
REG_SZ: A null-terminated string. This will be either a Unicode or an ANSI string, depending on whether you use the Unicode or ANSI functions.
From here:
https://learn.microsoft.com/en-us/windows/win32/sysinfo/registry-value-types

LPWSTR to wstring c++

I would like to read utf-8 test from a .dll string table.
something like this
LPWSTR nnW;
LoadStringW(hMod, id, nnW, MAX_PATH);
and after that I would like to convert the LPWSTR nnW to std::wstring nnWstring.
I tried in this way:
LPWSTR nnW;
LoadStringW(hMod, id, nnW, MAX_PATH);
const int length = MultiByteToWideChar(CP_UTF8,
0, // no flags required
(LPCSTR)nnW,
-1, // automatically determine length
NULL,
0);
std::wstring nnWstring(length, L'\0');
if (!MultiByteToWideChar(CP_UTF8,
0,
(LPCSTR)nnW,
-1,
&nnWstring[0],
length))
MessageBoxW(NULL, (LPCWSTR)nnWstring.c_str(), L"wstring", MB_OK | MB_ICONERROR);
After that in the MessageBoxW only shows the first letter.
No conversion or copying needed.
std::wstring nnWString(MAX_PATH, 0);
nnWString.resize(LoadStringW(hMod, id, &nnWString[0], nnWString.size());
Note: Your original code causes undefined behavior, because it writes using an uninitialized pointer. Surely not what you wanted.
Here's another variation:
http://msmvps.com/blogs/gdicanio/archive/2010/01/05/stl-strings-loading-from-resources.aspx
I would like to read utf-8 test from a .dll string table. something like this
Generally, string tables in Windows are UTF-16. You're trying to put UTF-8 data into one. The UTF-8 data is being treated like "extended" ASCII, so each byte is being expanded to two bytes with zero bytes between them.
You should probably put UTF-16 data in the string table directly.
If you must store UTF-8 data in the resources, you can put it into an RCDATA resource and use the lower-level resource functions to get the data out. Then you'll have to convert from UTF-8 to UTF-16 to store it in a wstring.

c++ string to utf8 valid string using utf8proc

I have a std::string output. Using utf8proc i would like to transform it into an valid utf8 string.
http://www.public-software-group.org/utf8proc-documentation
typedef int int32_t;
#define ssize_t int
ssize_t utf8proc_reencode(int32_t *buffer, ssize_t length, int options)
Reencodes the sequence of unicode characters given by the pointer buffer and length as UTF-8. The result is stored in the same memory area where the data is read. Following flags in the options field are regarded: (Documentation missing here) In case of success the length of the resulting UTF-8 string is returned, otherwise a negative error code is returned.
WARNING: The amount of free space being pointed to by buffer, has to exceed the amount of the input data by one byte, and the entries of the array pointed to by str have to be in the range of 0x0000 to 0x10FFFF, otherwise the program might crash!
So first, how do I add an extra byte at the end? Then how do I convert from std::string to int32_t *buffer?
This does not work:
std::string g = output();
fprintf(stdout,"str: %s\n",g.c_str());
g += " "; //add an extra byte??
g = utf8proc_reencode((int*)g.c_str(), g.size()-1, 0);
fprintf(stdout,"strutf8: %s\n",g.c_str());
You very likely don't actually want utf8proc_reencode() - that function takes a valid UTF-32 buffer and turns it into a valid UTF-8 buffer, but since you say you don't know what encoding your data is in then you can't use that function.
So, first you need to figure out what encoding your data is actually in. You can use http://utfcpp.sourceforge.net/ to test whether you already have valid UTF-8 with utf8::is_valid(g.begin(), g.end()). If that's true, you're done!
If false, things get complicated...but ICU ( http://icu-project.org/ ) can help you; see http://userguide.icu-project.org/conversion/detection
Once you somewhat reliably know what encoding your data is in, ICU can help again with getting it to UTF-8. For example, assuming your source data g is in ISO-8859-1:
UErrorCode err = U_ZERO_ERROR; // check this after every call...
// CONVERT FROM ISO-8859-1 TO UChar
UConverter *conv_from = ucnv_open("ISO-8859-1", &err);
std::vector<UChar> converted(g.size()*2); // *2 is usually more than enough
int32_t conv_len = ucnv_toUChars(conv_from, &converted[0], converted.size(), g.c_str(), g.size(), &err);
converted.resize(conv_len);
ucnv_close(conv_from);
// CONVERT FROM UChar TO UTF-8
g.resize(converted.size()*4);
UConverter *conv_u8 = ucnv_open("UTF-8", &err);
int32_t u8_len = ucnv_fromUChars(conv_u8, &g[0], g.size(), &converted[0], converted.size(), &err);
g.resize(u8_len);
ucnv_close(conv_u8);
after which your g is now holding UTF-8 data.

How to find if a character belongs to a particular codepage using c++ or calling winapi

How can we find if a character belongs to a particular codepage?
or How can we determine whether a charcter fits into currently active IME for an application.
First, Convert your UTF-8 string of characters to UTF-16 using MultiByteToWideChar
Now, reverse the process using WideCharToMultiByte passing the desired codepage as the first parameter.
Use the WC_ERR_INVALID_CHARS flag and WideCharToMultiByte will fail outright if any invalid characters are used. If you want to know which characters are not represented in the target codepage, use the lpDefaultChar, and lpUsedDefaultChar parameters.
LPCWSTR pszUtf16; // converted from utf8 source character
UINT nTargetCP = CP_ACP;
BOOL fBadCharacter = FALSE;
if(WideCharToMultiByte(nTargetCP,WC_NO_BEST_FIT_CHARS,pszUtf16,NULL,0,NULL,&fBadCharacter)
{
if(fBadCharacter)
{
// at least one character in the string was not represented in nTargetCP
}
}
The two previous answers have correctly suggested using MultiByteToWideChar then WideCharToMultiByte to translate your UTF-8 character to UTF-16, then to the current Windows codepage (CP_ACP). Check the result of WideCharToMultiByte to see if the conversion was successful.
What wasn't clear from the original question, is that you are having a particular issue with Hindi. For this language, your question is meaningless because there is no Windows ANSI codepage for Hindi, as Chris Becke pointed out. Therefore, you can never convert a Hindi character to CP_ACP, and WideCharToMultiByte will always fail.
To use Hindi on Windows, as far as I understand it, you must be a Unicode app that calls Unicode APIs.
Using the windows functions WideCharToMultiByte and MultiByteToWideChar you can convert between UTF-8 and 16-bit Unicode characters. The functions have arguments to specify the code page and to specify the behavior if an invalid character is encountered.
Thanks Chris..I am running the following code
#define CP_HINDI 0
#define CP_JAPANESE 932
#define CP_ENGLISH 1252
wchar_t wcsStringJapanese = 'あ';
wchar_t wcsStringHindi = 'र';
wchar_t wcsStringEnglish = 'A';
int main()
{
BOOL usedDefaultCharacter = FALSE;
/* Test for ENGLISH */
WideCharToMultiByte( CP_ENGLISH,
0, &wcsStringEnglish,
-1,
NULL,
0,
NULL,
&usedDefaultCharacter);
printf("usedDefaultCharacters for English? %d \n",usedDefaultCharacter);
usedDefaultCharacter = FALSE;
/*TEST FOR JAPANESE */
WideCharToMultiByte( CP_JAPANESE,
0,
&wcsStringJapanese,
-1,
NULL,
0,
NULL,
&usedDefaultCharacter);
printf("usedDefaultCharacters for Japanese? %d \n",usedDefaultCharacter);
//TEST FOR HINDI
usedDefaultCharacter = FALSE;
WideCharToMultiByte( CP_HINDI,
0,
&wcsStringHindi,
-1,
NULL,
0,
NULL,
&usedDefaultCharacter);
printf("usedDefaultCharacters for Hindi? %d \n",usedDefaultCharacter);
}
The above code returns:
usedDefaultCharacters for English? 0
usedDefaultCharacters for Japanese? 0
usedDefaultCharacters for Hindi? 1
The third line is incorrect as the Codepage for Hindi is 0 , and the string passed consists of Hindi Character and still the usedDefaultChar is set to 1 .. which should not be the case.