Equivalent wstring operation? - c++

i'm trying to do the same operation, but with a compatible code for visual c++ 2010:
wstring widen(string Str) {
const size_t wcharCount = Str.size() + 1;
vector<wchar_t> Buffer(wcharCount);
return wstring{
Buffer.data(),(size_t)MultiByteToWideChar(CP_UTF8, 0, Str.c_str(), -1, Buffer.data(), (int)wcharCount);
};
}
Specifically this:
return wstring{
Buffer.data(),(size_t)MultiByteToWideChar(CP_UTF8, 0, Str.c_str(), -1, Buffer.data(), (int)wcharCount);
};
How do i return a wstring, but with another way... because i'm getting errors, because the compiler i must use is older.
I also want to learn more about converting the code, compatible with older versions of Microsoft visual c++ (this case here, i want to compile this in 2010 version (and i have it)).
EDIT:
As an useful to track errors:
|Line 5| error C2143: syntax error : missing ';' before '{'|
|Line 5| error C2275: 'std::wstring' : illegal use of this type as an expression|

For older (pre-C++11) compilers, use parenthesis (return wstring(...)) instead of braces (return wstring{...}), and also use std::vector::operator[] to access the allocated array, not std::vector::data() (which didn’t exist yet), eg:
return wstring(
&Buffer[0], MultiByteToWideChar(CP_UTF8, 0, Str.c_str(), -1, &Buffer[0], wcharCount)
);
That being said, the source std::string’s size is the wrong size to use for the std::vector. Call MultiByteToWideChar() twice - call it once with a NULL output buffer to calculate the necessary size, and then call it again to write to the buffer. And, you should be using the std::string’s actual size instead of -1 for the source buffer size.
wstring widen(const string &Str) {
const int wcharCount = MultiByteToWideChar(CP_UTF8, 0, Str.c_str(), Str.size(), NULL, 0);
vector<wchar_t> Buffer(wcharCount);
return wstring(
&Buffer[0], MultiByteToWideChar(CP_UTF8, 0, Str.c_str(), Str.size(), &Buffer[0], wcharCount)
);
}
Note that in C++11 and later, it is safe to use a pre-sized std::wstring directly as the destination buffer, you don’t need the std::vector at all, because std::wstring is guaranteed to use a single contiguous block of memory for its character data. And even in earlier compilers, this is usually safe in practice (though NOT guaranteed) because most implementations use a contiguous block anyway for efficiency:
wstring widen(const string &Str) {
const int wcharCount = MultiByteToWideChar(CP_UTF8, 0, Str.c_str(), Str.size(), NULL, 0);
wstring Buffer;
if (wcharCount > 0) {
Buffer.resize(wcharCount);
MultiByteToWideChar(CP_UTF8, 0, Str.c_str(), Str.size(), &Buffer[0]/* or Buffer.data() in C++17 */, wcharCount);
}
return Buffer;
}

Related

Converting shift-jis encoded file to to utf-8 in c++

I am trying with below code to convert from shift-jis file to utf-8, but when we open the output file it has corrupted characters, looks like something is missed out here, any thoughts?
// From file
FILE* shiftJisFile = _tfopen(lpszShiftJs, _T("rb"));
int nLen = _filelength(fileno(shiftJisFile));
LPSTR lpszBuf = new char[nLen];
fread(lpszBuf, 1, nLen, shiftJisFile);
// convert multibyte to wide char
int utf16size = ::MultiByteToWideChar(CP_ACP, 0, lpszBuf, -1, 0, 0);
LPWSTR pUTF16 = new WCHAR[utf16size];
::MultiByteToWideChar(CP_ACP, 0, lpszBuf, -1, pUTF16, utf16size);
wstring str(pUTF16);
// convert wide char to multi byte utf-8 before writing to a file
fstream File("filepath", std::ios::out);
string result = string();
result.resize(WideCharToMultiByte(CP_UTF8, 0, str.c_str(), -1, NULL, 0, 0, 0));
char* ptr = &result[0];
WideCharToMultiByte(CP_UTF8, 0, str.c_str(), -1, ptr, result.size(), 0, 0);
File << result;
File.close();
There are multiple problems.
The first problem is that when you are writing the output file, you need to set it to binary for the same reason you need to do so when reading the input.
fstream File("filepath", std::ios::out | std::ios::binary);
The second problem is that when you are reading the input file, you are only reading the bytes of the input stream and treat them like a string. However, those bytes do not have a terminating null character. If you call MultiByteToWideChar with a -1 length, it infers the input string length from the terminating null character, which is missing in your case. That means both utf16size and the contents of pUTF16 are already wrong. Add it manually after reading the file:
int nLen = _filelength(fileno(shiftJisFile));
LPSTR lpszBuf = new char[nLen+1];
fread(lpszBuf, 1, nLen, shiftJisFile);
lpszBuf[nLen] = 0;
The last problem is that you are using CP_ACP. That means "the current code page". In your question, you were specifically asking how to convert Shift-JIS. The code page Windows uses for its closes equivalent to what is commonly called "Shift-JIS" is 932 (you can look that up on wikipedia for example). So use 932 instead of CP_ACP:
int utf16size = ::MultiByteToWideChar(932, 0, lpszBuf, -1, 0, 0);
LPWSTR pUTF16 = new WCHAR[utf16size];
::MultiByteToWideChar(932, 0, lpszBuf, -1, pUTF16, utf16size);
Additionally, there is no reason to create wstring str(pUTF16). Just use pUTF16 directly in the WideCharToMultiByte calls.
Also, I'm not sure how kosher char *ptr = &result[0] is. I personally would not create a string specifically as a buffer for this.
Here is the corrected code. I would personally not write it this way, but I don't want to impose my coding ideology on you, so I made only the changes necessary to fix it:
// From file
FILE* shiftJisFile = _tfopen(lpszShiftJs, _T("rb"));
int nLen = _filelength(fileno(shiftJisFile));
LPSTR lpszBuf = new char[nLen+1];
fread(lpszBuf, 1, nLen, shiftJisFile);
lpszBuf[nLen] = 0;
// convert multibyte to wide char
int utf16size = ::MultiByteToWideChar(932, 0, lpszBuf, -1, 0, 0);
LPWSTR pUTF16 = new WCHAR[utf16size];
::MultiByteToWideChar(932, 0, lpszBuf, -1, pUTF16, utf16size);
// convert wide char to multi byte utf-8 before writing to a file
fstream File("filepath", std::ios::out | std::ios::binary);
string result;
result.resize(WideCharToMultiByte(CP_UTF8, 0, pUTF16, -1, NULL, 0, 0, 0));
char *ptr = &result[0];
WideCharToMultiByte(CP_UTF8, 0, pUTF16, -1, ptr, result.size(), 0, 0);
File << ptr;
File.close();
Also, you have a memory leak -- lpszBuf and pUTF16 are not cleaned up.
You should try use std::locale to perform this conversion:
namespace fs = std::filesystem;
void convert(const fs::path inName, const fs::path outName)
{
std::wifstream in{inName};
in.imbue(std::locale{".932"}); // or "ja_JP.SJIS"
if (in) {
std::wofstream out{outName};
out.imbue(std::locale{".utf-8"});
std::wstring line;
while (getline(in, line)) {
out << line << L'\n';
}
}
}
Note locale names are platform specific - I think I used proper one for Windows.
Update: I've tested this on my Window 10 machine with MSVC 19.29.30145 and works perfectly. I used wiki page to get some valid Japanese text and used Notepad++ to save this text in proper encoding (Shift-JIS).
I also used Beyond Compare to verify results:
Note I used similar method here for Korean and it worked nicely.
wstring str(pUTF16); - pUTF16 there does not end with zero char. It should be wstring str(pUTF16, utf16size);

Unexpected results with wchar_t and c_str string conversion

Based on this answer to a related question, I tried to write a method that converts a standard string to a wide string, which I can then convert into a wchar_t*.
Why aren't the two different ways of creating the wchar_t* equivalent? (I've shown the values that my debugger gives me).
TEST_METHOD(TestingAssertsWithGetWideString)
{
std::wstring wString1 = GetWideString("me");
const wchar_t* wchar1 = wString1.c_str(); // wchar1 = "me"
const wchar_t* wchar2 = GetWideString("me").c_str(); // wchar2 = "ﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮﻮ#" (Why?!)
}
where GetWideString is defined as follows:
inline const std::wstring GetWideString(const std::string &str)
{
std::wstring wstr;
wstr.assign(str.begin(), str.end());
return wstr;
};
Note: the following doesn't work either.
const wchar_t* wchar2 = GetWChar("me");
const wchar_t *GetWChar(const std::string &str)
{
std::wstring wstr;
wstr.assign(str.begin(), str.end());
return wstr.c_str();
};
Each time you call GetWideString(), you are creating a new std::wstring, which has a newly allocated memory buffer. You are comparing pointers to different memory blocks (assuming Assert::AreEqual() is simply comparing the pointers themselves and not the contents of the memory blocks that are being pointed at).
Update: const wchar_t* wchar2 = GetWideString("me").c_str(); does not work because GetWideString() returns a temporary std::wstring that goes out of scope and gets freed as soon as the statement is finished. Thus you are obtaining a pointer to a temporary memory block, and then leaving that pointer dangling when that memory gets freed before you can use the pointer for anything.
Also, const wchar_t* wchar2 = GetWChar("me"); should not compile. GetWChar() returns a std::wstring, which does not implement an implicit conversion to wchar_t*. You have to use the c_str() method to get a wchar_t* from a std::wstring.
Because the two pointers aren't equal. A wchar_t * is not a String, so you get the generic AreEqual.
std::wstring contains of wide characters of type wchar_t. std::string contains characters of type char. For special characters stored within std::string a multi-byte encoding is being used, i.e. some characters are represented by 2 characters within such a string. Converting between these thus can not be easy as calling a simple assign.
To convert between "wide" strings and multi-byte strings, you can use following helpers (Windows only):
// multi byte to wide char:
std::wstring s2ws(const std::string& str)
{
int size_needed = MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), NULL, 0);
std::wstring wstrTo(size_needed, 0);
MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), &wstrTo[0], size_needed);
return wstrTo;
}
// wide char to multi byte:
std::string ws2s(const std::wstring& wstr)
{
int size_needed = WideCharToMultiByte(CP_ACP, 0, wstr.c_str(), int(wstr.length() + 1), 0, 0, 0, 0);
std::string strTo(size_needed, 0);
WideCharToMultiByte(CP_ACP, 0, wstr.c_str(), int(wstr.length() + 1), &strTo[0], size_needed, 0, 0);
return strTo;
}

C++ Unicode Issue

I'm having a bit of trouble with handling unicode conversions.
The following code outputs this into my text file.
HELLO??O
std::string test = "HELLO";
std::string output;
int len = WideCharToMultiByte(CP_OEMCP, 0, (LPCWSTR)test.c_str(), -1, NULL, 0, NULL, NULL);
char *buf = new char[len];
int len2 = WideCharToMultiByte(CP_OEMCP, 0, (LPCWSTR)test.c_str(), -1, buf, len, NULL, NULL);
output = buf;
std::wofstream outfile5("C:\\temp\\log11.txt");
outfile5 << test.c_str();
outfile5 << output.c_str();
outfile5.close();
But as you can see, output is just a unicode conversion from the test variable. How is this possible?
Check if the LEN is correct after first measuring call. In general, you should not cast test.c_str() to LPCWSTR. The 'test' as is 'char'-string not 'wchar_t'-wstring. You may cast it to LPCSTR - note the 'W' missing. The WinAPI has distinction between that. You really should be using wstring if you want to keep widechars in it.. Yeah, after re-reading your code, the test should be a wstring, then you can cast it to LPCWSTR safely.
after reading this
Microsoft wstring reference
I changed
std::string test = "HELLO";
to
std::wstring test = L"HELLO";
And the string was outputted correctly and I got
HELLOHELLO

WINAPI: how to get edit's text to a std::string?

I'm trying the following code:
int length = SendMessage(textBoxPathToSource, WM_GETTEXTLENGTH, 0, 0);
LPCWSTR lpstr;
SendMessage(textBoxPathToSource, WM_GETTEXT, length+1, LPARAM(LPCWSTR));
std::string s(lpstr);
But it doesn't work.
You're using it absolutely incorrectly:
First, you are passing a type instead of a value here:
SendMessage(textBoxPathToSource, WM_GETTEXT, length+1, LPARAM(LPCWSTR));
Interfacing WinAPI functions who write to a string requires a buffer, since std::string's cannot be written to directly.
You need to define a space to hold the value:
WCHAR wszBuff[256] = {0}; (of course you could allocate the storage space using new, which you didn't, you just declared LPCWSTR lpstr).
Extract the string and store in that buffer:
SendMessage(textBoxPathToSource, WM_GETTEXT, 256, (LPARAM)wszBuff);
and perform std::wstring s(lpStr).
EDIT:
Please note the use of std::wstring, not std::string.
What ALevy said is correct, but it'd be better to use a std::vector<WCHAR> than a fixed-size buffer (or using new):
std::wstring s;
int length = SendMessageW(textBoxPathToSource, WM_GETTEXTLENGTH, 0, 0);
if (length > 0)
{
std::vector<WCHAR> buf(length + 1 /* NUL */);
SendMessageW(textBoxPathToSource,
WM_GETTEXT,
buf.size(),
reinterpret_cast<LPCWSTR>(&buf[0]));
s = &buf[0];
}

Wrong reading file in UNICODE (fread) on C++

I'm trying to load into string the content of file saved on the dics. The file is .CS code, created in VisualStudio so I suppose it's saved in UTF-8 coding. I'm doing this:
FILE *fConnect = _wfopen(connectFilePath, _T("r,ccs=UTF-8"));
if (!fConnect)
return;
fseek(fConnect, 0, SEEK_END);
lSize = ftell(fConnect);
rewind(fConnect);
LPTSTR lpContent = (LPTSTR)malloc(sizeof(TCHAR) * lSize + 1);
fread(lpContent, sizeof(TCHAR), lSize, fConnect);
But result is so strange - the first part (half of the string is content of .CS file), then strange symbols like 췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍 appear.
So I think I read the content in a wrong way. But how to do that properly?
Thank you so much and I'm looking to hear!
ftell(), fseek(), and fread() all operate on bytes, not on characters. In a Unicode environment, TCHAR is at least 2 bytes, so you are allocating and reading twice as much memory as you should be.
I have never seen fopen() or _wfopen() support a "ccs" attribute. You should use "rb" as the reading mode, read the raw bytes into memory, and then decode them once you have them all available, ie:
FILE *fConnect = _wfopen(connectFilePath, _T("rb"));
if (!fConnect)
return;
fseek(fConnect, 0, SEEK_END);
lSize = ftell(fConnect);
rewind(fConnect);
LPBYTE lpContent = (LPBYTE) malloc(lSize);
fread(lpContent, 1, lSize, fConnect);
fclose(lpContent);
.. decode lpContent as needed ...
free(lpContent);
Does the string contain all the contents of the cs file and then additional funny characters? Probably it's just not correctly null-terminated since fread will not automatically do that. You need to set the character following the string content to zero:
lpContent[lSize] = 0;
.. decode lpContent as needed ...
s2ws function convert string to wstring
std::wstring s2ws(const std::string& str)
{
int size_needed = MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), NULL, 0);
std::wstring wstrTo(size_needed, 0);
MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), &wstrTo[0], size_needed);
return wstrTo;
}
add null terminator in the end of buffer:
lpContent[lSize-1] = 0;
initialize wstring from buffer content
std::wstring replyStr = (s2ws((char*)lpContent));