C++ Arabic UTF8 string to CString - c++

in a Visual Studio 2008 MFC project I've to manage strings in UTF8 containing arabic cities and searching onlines I write this little piece of code:
CString MyClass::convertString(string input) {
int l = MultiByteToWideChar(CP_UTF8, 0, input.c_str(), -1, NULL, 0);
wchar_t *str = new wchar_t[l];
int r = MultiByteToWideChar(CP_UTF8, 0, input.c_str(), -1, str, l);
CString output = str;
delete str ;
return output;
}
When I try to convert a string it remains the same and if I try to print these two string the result is the same.
What am I doing wrong?
Thanks in advance.

You don't want to convert strings to UTF-8 for display purposes. There is no UTF-8 charset than will allow you to display them correctly. If your already have them in Unicode, just keep them in Unicode. I would build your application in Unicode and avoid MBCS if you can. It makes life easier. Otherwise, for displaying those Arabic strings, you would have to convert them to the Arabic codepage and then use an Arabic font/charset to display them.

Thanks for all replies. I've found a solution; the string in input was not encoded in UTF8 (I should have check it before posting on Stackoverflow), then I edited the code changing the output from CString to wstring.
wstring MyClass::convertString(string input) {
int l = MultiByteToWideChar(CP_UTF8, 0, input.c_str(), -1, NULL, 0);
wchar_t *str = new wchar_t[l];
int r = MultiByteToWideChar(CP_UTF8, 0, input.c_str(), -1, str, l1);
wstring output = wstring(str);
delete str ;
return output
}
Now everything works fine. Thanks.

Related

Converting shift-jis encoded file to to utf-8 in c++

I am trying with below code to convert from shift-jis file to utf-8, but when we open the output file it has corrupted characters, looks like something is missed out here, any thoughts?
// From file
FILE* shiftJisFile = _tfopen(lpszShiftJs, _T("rb"));
int nLen = _filelength(fileno(shiftJisFile));
LPSTR lpszBuf = new char[nLen];
fread(lpszBuf, 1, nLen, shiftJisFile);
// convert multibyte to wide char
int utf16size = ::MultiByteToWideChar(CP_ACP, 0, lpszBuf, -1, 0, 0);
LPWSTR pUTF16 = new WCHAR[utf16size];
::MultiByteToWideChar(CP_ACP, 0, lpszBuf, -1, pUTF16, utf16size);
wstring str(pUTF16);
// convert wide char to multi byte utf-8 before writing to a file
fstream File("filepath", std::ios::out);
string result = string();
result.resize(WideCharToMultiByte(CP_UTF8, 0, str.c_str(), -1, NULL, 0, 0, 0));
char* ptr = &result[0];
WideCharToMultiByte(CP_UTF8, 0, str.c_str(), -1, ptr, result.size(), 0, 0);
File << result;
File.close();
There are multiple problems.
The first problem is that when you are writing the output file, you need to set it to binary for the same reason you need to do so when reading the input.
fstream File("filepath", std::ios::out | std::ios::binary);
The second problem is that when you are reading the input file, you are only reading the bytes of the input stream and treat them like a string. However, those bytes do not have a terminating null character. If you call MultiByteToWideChar with a -1 length, it infers the input string length from the terminating null character, which is missing in your case. That means both utf16size and the contents of pUTF16 are already wrong. Add it manually after reading the file:
int nLen = _filelength(fileno(shiftJisFile));
LPSTR lpszBuf = new char[nLen+1];
fread(lpszBuf, 1, nLen, shiftJisFile);
lpszBuf[nLen] = 0;
The last problem is that you are using CP_ACP. That means "the current code page". In your question, you were specifically asking how to convert Shift-JIS. The code page Windows uses for its closes equivalent to what is commonly called "Shift-JIS" is 932 (you can look that up on wikipedia for example). So use 932 instead of CP_ACP:
int utf16size = ::MultiByteToWideChar(932, 0, lpszBuf, -1, 0, 0);
LPWSTR pUTF16 = new WCHAR[utf16size];
::MultiByteToWideChar(932, 0, lpszBuf, -1, pUTF16, utf16size);
Additionally, there is no reason to create wstring str(pUTF16). Just use pUTF16 directly in the WideCharToMultiByte calls.
Also, I'm not sure how kosher char *ptr = &result[0] is. I personally would not create a string specifically as a buffer for this.
Here is the corrected code. I would personally not write it this way, but I don't want to impose my coding ideology on you, so I made only the changes necessary to fix it:
// From file
FILE* shiftJisFile = _tfopen(lpszShiftJs, _T("rb"));
int nLen = _filelength(fileno(shiftJisFile));
LPSTR lpszBuf = new char[nLen+1];
fread(lpszBuf, 1, nLen, shiftJisFile);
lpszBuf[nLen] = 0;
// convert multibyte to wide char
int utf16size = ::MultiByteToWideChar(932, 0, lpszBuf, -1, 0, 0);
LPWSTR pUTF16 = new WCHAR[utf16size];
::MultiByteToWideChar(932, 0, lpszBuf, -1, pUTF16, utf16size);
// convert wide char to multi byte utf-8 before writing to a file
fstream File("filepath", std::ios::out | std::ios::binary);
string result;
result.resize(WideCharToMultiByte(CP_UTF8, 0, pUTF16, -1, NULL, 0, 0, 0));
char *ptr = &result[0];
WideCharToMultiByte(CP_UTF8, 0, pUTF16, -1, ptr, result.size(), 0, 0);
File << ptr;
File.close();
Also, you have a memory leak -- lpszBuf and pUTF16 are not cleaned up.
You should try use std::locale to perform this conversion:
namespace fs = std::filesystem;
void convert(const fs::path inName, const fs::path outName)
{
std::wifstream in{inName};
in.imbue(std::locale{".932"}); // or "ja_JP.SJIS"
if (in) {
std::wofstream out{outName};
out.imbue(std::locale{".utf-8"});
std::wstring line;
while (getline(in, line)) {
out << line << L'\n';
}
}
}
Note locale names are platform specific - I think I used proper one for Windows.
Update: I've tested this on my Window 10 machine with MSVC 19.29.30145 and works perfectly. I used wiki page to get some valid Japanese text and used Notepad++ to save this text in proper encoding (Shift-JIS).
I also used Beyond Compare to verify results:
Note I used similar method here for Korean and it worked nicely.
wstring str(pUTF16); - pUTF16 there does not end with zero char. It should be wstring str(pUTF16, utf16size);

Converting to UTF-8 from ToUnicodeEx()

I get input using GetAsyncKeyState() which I then convert to unicode using ToUnicodeEx():
wchar_t character[1];
ToUnicodeEx(i, scanCode, keyboardState, character, 1, 0, layout);
I can write this to a file using wfstream like so:
wchar_t buffer[128]; // Will not print unicode without these 2 lines
file.rdbuf()->pubsetbuf(buffer, 128);
file.put(0xFEFF); // BOM needed since it's encoded using UCS-2 LE
file << character[0];
When I open this file in Notepad++ it's in UCS-2 LE, when I want it to be in UTF-8 format. I believe ToUnicodeEx() is returning it in UCS-2 LE format, it also only works with wide chars. Is there any way to do this using either fstream or wfstream by somehow converting into UTF-8 first? Thanks!
You might want to use the WideCharToMultiByte function.
For example:
wchar_t buffer[LEN]; // input buffer
char output_buffer[OUT_LEN]; // output buffer where the utf-8 string will be written
int num = WideCharToMultiByte(
CP_UTF8,
0,
buffer,
number_of_characters_in_buffer, // or -1 if buffer is null-terminated
output_buffer,
size_in_bytes_of_output_buffer,
NULL,
NULL);
Windows API generally refers to UTF-16 as unicode which is a little confusing. This means most unicode Win32 function calls operate on or give utf-16 strings.
So ToUnicodeEx returns a utf-16 string.
If you need this as utf 8 you'll need to convert it using WideCharToMultiByte
Thank you for all the help, I've managed to solve my problem with additional help from a blog post about WideCharToMultiByte() and UTF-8 here.
This function converts wide char arrays to a UTF-8 string:
// Takes in pointer to wide char array and length of the array
std::string ConvertCharacters(const wchar_t* buffer, int len)
{
int nChars = WideCharToMultiByte(CP_UTF8, 0, buffer, len, NULL, 0, NULL, NULL);
if (nChars == 0)
{
return u8"";
}
std::string newBuffer;
newBuffer.resize(nChars);
WideCharToMultiByte(CP_UTF8, 0, buffer, len, const_cast<char*>(newBuffer.c_str()), nChars, NULL, NULL);
return newBuffer;
}

C++ Unicode Issue

I'm having a bit of trouble with handling unicode conversions.
The following code outputs this into my text file.
HELLO??O
std::string test = "HELLO";
std::string output;
int len = WideCharToMultiByte(CP_OEMCP, 0, (LPCWSTR)test.c_str(), -1, NULL, 0, NULL, NULL);
char *buf = new char[len];
int len2 = WideCharToMultiByte(CP_OEMCP, 0, (LPCWSTR)test.c_str(), -1, buf, len, NULL, NULL);
output = buf;
std::wofstream outfile5("C:\\temp\\log11.txt");
outfile5 << test.c_str();
outfile5 << output.c_str();
outfile5.close();
But as you can see, output is just a unicode conversion from the test variable. How is this possible?
Check if the LEN is correct after first measuring call. In general, you should not cast test.c_str() to LPCWSTR. The 'test' as is 'char'-string not 'wchar_t'-wstring. You may cast it to LPCSTR - note the 'W' missing. The WinAPI has distinction between that. You really should be using wstring if you want to keep widechars in it.. Yeah, after re-reading your code, the test should be a wstring, then you can cast it to LPCWSTR safely.
after reading this
Microsoft wstring reference
I changed
std::string test = "HELLO";
to
std::wstring test = L"HELLO";
And the string was outputted correctly and I got
HELLOHELLO

Handling Hunspell suggestions with special characters

I've integrated Hunspell in an unmanaged C++ app on Windows 7 using Visual Studio 2010.
I've got spell checking and suggestions working for English, but now I'm trying to get things working for Spanish and hitting some snags. Whenever I get suggestions for Spanish the suggestions with accent characters are not translating properly to std::wstring objects.
Here is an example of a suggestion that comes back from the Hunspell->suggest method:
Here is the code I'm using to translate that std::string to a std::wstring
std::wstring StringToWString(const std::string& str)
{
std::wstring convertedString;
int requiredSize = MultiByteToWideChar(CP_UTF8, 0, str.c_str(), -1, 0, 0);
if(requiredSize > 0)
{
std::vector<wchar_t> buffer(requiredSize);
MultiByteToWideChar(CP_UTF8, 0, str.c_str(), -1, &buffer[0], requiredSize);
convertedString.assign(buffer.begin(), buffer.end() - 1);
}
return convertedString;
}
And after I run that through I get this, with the funky character on the end.
Can anyone help me figure out what could be going on with the conversion here? I have a guess that it's related to the negative char returned from hunspell, but don't know how I can convert that to something for the std::wstring conversion code.
It looks like the output of Hunspell is ASCII with code page 852.
Use 852 instead of CP_UTF8 http://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx
Or configure Hunspell to return UTF8.
It looks like the output of Hunspell is ASCII with code page 28591 (ISO 8859-1 Latin 1; Western European (ISO)) which I found by looking at the Hunspell default settings for the unix command line utility.
Changing the CP_UTF8 to 28591 worked for me.
// Updated code page to 28591 from CP_UTF8
std::wstring StringToWString(const std::string& str)
{
std::wstring convertedString;
int requiredSize = MultiByteToWideChar(28591, 0, str.c_str(), -1, 0, 0);
if(requiredSize > 0)
{
std::vector<wchar_t> buffer(requiredSize);
MultiByteToWideChar(28591, 0, str.c_str(), -1, &buffer[0], requiredSize);
convertedString.assign(buffer.begin(), buffer.end() - 1);
}
return convertedString;
}
Here is a list of code pages from MSDN that helped me find the correct code page integer.

How to output utf8 encoded characters normally in c/c++ console application?

Here's what I'm getting now by wprintf:
1胩?鳧?1敬爄汯?瑳瑡獵猆慴畴??
Is utf8 just not supported by windows?
No, Windows doesn't support printing UTF-8 to the console.
When Windows says "Unicode", it means UTF-16. You need to use MultiByteToWideChar to convert from UTF-8 to UTF-16. Something like this:
char* text = "My UTF-8 text\n";
int len = MultiByteToWideChar(CP_UTF8, 0, text, -1, 0, 0);
wchar_t *unicode_text = new wchar_t[len];
MultiByteToWideChar(CP_UTF8, 0, text, -1, unicode_text, len);
wprintf(L"%s", unicode_text);
wprintf supposed to receive a UTF-16 encoded string. Use the following for conversion:
Use MultiByteToWideChar with CP_UTF8 codepage to do the conversion. (and don't do blind casting from char* into wchar_t*).