Converting shift-jis encoded file to to utf-8 in c++

Converting shift-jis encoded file to to utf-8 in c++ - c++

I am trying with below code to convert from shift-jis file to utf-8, but when we open the output file it has corrupted characters, looks like something is missed out here, any thoughts?
// From file
FILE* shiftJisFile = _tfopen(lpszShiftJs, _T("rb"));
int nLen = _filelength(fileno(shiftJisFile));
LPSTR lpszBuf = new char[nLen];
fread(lpszBuf, 1, nLen, shiftJisFile);
// convert multibyte to wide char
int utf16size = ::MultiByteToWideChar(CP_ACP, 0, lpszBuf, -1, 0, 0);
LPWSTR pUTF16 = new WCHAR[utf16size];
::MultiByteToWideChar(CP_ACP, 0, lpszBuf, -1, pUTF16, utf16size);
wstring str(pUTF16);
// convert wide char to multi byte utf-8 before writing to a file
fstream File("filepath", std::ios::out);
string result = string();
result.resize(WideCharToMultiByte(CP_UTF8, 0, str.c_str(), -1, NULL, 0, 0, 0));
char* ptr = &result[0];
WideCharToMultiByte(CP_UTF8, 0, str.c_str(), -1, ptr, result.size(), 0, 0);
File << result;
File.close();

There are multiple problems.
The first problem is that when you are writing the output file, you need to set it to binary for the same reason you need to do so when reading the input.
fstream File("filepath", std::ios::out | std::ios::binary);
The second problem is that when you are reading the input file, you are only reading the bytes of the input stream and treat them like a string. However, those bytes do not have a terminating null character. If you call MultiByteToWideChar with a -1 length, it infers the input string length from the terminating null character, which is missing in your case. That means both utf16size and the contents of pUTF16 are already wrong. Add it manually after reading the file:
int nLen = _filelength(fileno(shiftJisFile));
LPSTR lpszBuf = new char[nLen+1];
fread(lpszBuf, 1, nLen, shiftJisFile);
lpszBuf[nLen] = 0;
The last problem is that you are using CP_ACP. That means "the current code page". In your question, you were specifically asking how to convert Shift-JIS. The code page Windows uses for its closes equivalent to what is commonly called "Shift-JIS" is 932 (you can look that up on wikipedia for example). So use 932 instead of CP_ACP:
int utf16size = ::MultiByteToWideChar(932, 0, lpszBuf, -1, 0, 0);
LPWSTR pUTF16 = new WCHAR[utf16size];
::MultiByteToWideChar(932, 0, lpszBuf, -1, pUTF16, utf16size);
Additionally, there is no reason to create wstring str(pUTF16). Just use pUTF16 directly in the WideCharToMultiByte calls.
Also, I'm not sure how kosher char *ptr = &result[0] is. I personally would not create a string specifically as a buffer for this.
Here is the corrected code. I would personally not write it this way, but I don't want to impose my coding ideology on you, so I made only the changes necessary to fix it:
// From file
FILE* shiftJisFile = _tfopen(lpszShiftJs, _T("rb"));
int nLen = _filelength(fileno(shiftJisFile));
LPSTR lpszBuf = new char[nLen+1];
fread(lpszBuf, 1, nLen, shiftJisFile);
lpszBuf[nLen] = 0;
// convert multibyte to wide char
int utf16size = ::MultiByteToWideChar(932, 0, lpszBuf, -1, 0, 0);
LPWSTR pUTF16 = new WCHAR[utf16size];
::MultiByteToWideChar(932, 0, lpszBuf, -1, pUTF16, utf16size);
// convert wide char to multi byte utf-8 before writing to a file
fstream File("filepath", std::ios::out | std::ios::binary);
string result;
result.resize(WideCharToMultiByte(CP_UTF8, 0, pUTF16, -1, NULL, 0, 0, 0));
char *ptr = &result[0];
WideCharToMultiByte(CP_UTF8, 0, pUTF16, -1, ptr, result.size(), 0, 0);
File << ptr;
File.close();
Also, you have a memory leak -- lpszBuf and pUTF16 are not cleaned up.

You should try use std::locale to perform this conversion:
namespace fs = std::filesystem;
void convert(const fs::path inName, const fs::path outName)
{
std::wifstream in{inName};
in.imbue(std::locale{".932"}); // or "ja_JP.SJIS"
if (in) {
std::wofstream out{outName};
out.imbue(std::locale{".utf-8"});
std::wstring line;
while (getline(in, line)) {
out << line << L'\n';
}
}
}
Note locale names are platform specific - I think I used proper one for Windows.
Update: I've tested this on my Window 10 machine with MSVC 19.29.30145 and works perfectly. I used wiki page to get some valid Japanese text and used Notepad++ to save this text in proper encoding (Shift-JIS).
I also used Beyond Compare to verify results:
Note I used similar method here for Korean and it worked nicely.

wstring str(pUTF16); - pUTF16 there does not end with zero char. It should be wstring str(pUTF16, utf16size);

Related

C++ Arabic UTF8 string to CString

in a Visual Studio 2008 MFC project I've to manage strings in UTF8 containing arabic cities and searching onlines I write this little piece of code:
CString MyClass::convertString(string input) {
int l = MultiByteToWideChar(CP_UTF8, 0, input.c_str(), -1, NULL, 0);
wchar_t *str = new wchar_t[l];
int r = MultiByteToWideChar(CP_UTF8, 0, input.c_str(), -1, str, l);
CString output = str;
delete str ;
return output;
}
When I try to convert a string it remains the same and if I try to print these two string the result is the same.
What am I doing wrong?
Thanks in advance.

You don't want to convert strings to UTF-8 for display purposes. There is no UTF-8 charset than will allow you to display them correctly. If your already have them in Unicode, just keep them in Unicode. I would build your application in Unicode and avoid MBCS if you can. It makes life easier. Otherwise, for displaying those Arabic strings, you would have to convert them to the Arabic codepage and then use an Arabic font/charset to display them.

Thanks for all replies. I've found a solution; the string in input was not encoded in UTF8 (I should have check it before posting on Stackoverflow), then I edited the code changing the output from CString to wstring.
wstring MyClass::convertString(string input) {
int l = MultiByteToWideChar(CP_UTF8, 0, input.c_str(), -1, NULL, 0);
wchar_t *str = new wchar_t[l];
int r = MultiByteToWideChar(CP_UTF8, 0, input.c_str(), -1, str, l1);
wstring output = wstring(str);
delete str ;
return output
}
Now everything works fine. Thanks.

C++ Unicode Issue

I'm having a bit of trouble with handling unicode conversions.
The following code outputs this into my text file.
HELLO??O
std::string test = "HELLO";
std::string output;
int len = WideCharToMultiByte(CP_OEMCP, 0, (LPCWSTR)test.c_str(), -1, NULL, 0, NULL, NULL);
char *buf = new char[len];
int len2 = WideCharToMultiByte(CP_OEMCP, 0, (LPCWSTR)test.c_str(), -1, buf, len, NULL, NULL);
output = buf;
std::wofstream outfile5("C:\\temp\\log11.txt");
outfile5 << test.c_str();
outfile5 << output.c_str();
outfile5.close();
But as you can see, output is just a unicode conversion from the test variable. How is this possible?

Check if the LEN is correct after first measuring call. In general, you should not cast test.c_str() to LPCWSTR. The 'test' as is 'char'-string not 'wchar_t'-wstring. You may cast it to LPCSTR - note the 'W' missing. The WinAPI has distinction between that. You really should be using wstring if you want to keep widechars in it.. Yeah, after re-reading your code, the test should be a wstring, then you can cast it to LPCWSTR safely.

after reading this
Microsoft wstring reference
I changed
std::string test = "HELLO";
to
std::wstring test = L"HELLO";
And the string was outputted correctly and I got
HELLOHELLO

How to set file encoding format to UTF8 in C++

A requirement for my software is that the encoding of a file which contains exported data shall be UTF8. But when I write the data to the file the encoding is always ANSI. (I use Notepad++ to check this.)
What I'm currently doing is trying to convert the file manually by reading it, converting it to UTF8 and writing the text to a new file.
line is a std::string
inputFile is an std::ifstream
pOutputFile is a FILE*
// ...
if( inputFile.is_open() )
{
while( inputFile.good() )
{
getline(inputFile,line);
//1
DWORD dwCount = MultiByteToWideChar( CP_ACP, 0, line.c_str(), -1, NULL, 0 );
wchar_t *pwcharText;
pwcharText = new wchar_t[ dwCount];
//2
MultiByteToWideChar( CP_ACP, 0, line.c_str(), -1, pwcharText, dwCount );
//3
dwCount = WideCharToMultiByte( CP_UTF8, 0, pwcharText, -1, NULL, 0, NULL, NULL );
char *pText;
pText = new char[ dwCount ];
//4
WideCharToMultiByte( CP_UTF8, 0, pwcharText, -1, pText, dwCount, NULL, NULL );
fprintf(pOutputFile,pText);
fprintf(pOutputFile,"\n");
delete[] pwcharText;
delete[] pText;
}
}
// ...
Unfortunately the encoding is still ANSI. I searched a while for a solution but I always encounter the solution via MultiByteToWideChar and WideCharToMultiByte. However, this doesn't seem to work. What am I missing here?
I also looked here on SO for a solution but most UTF8 questions deal with C# and php stuff.

On Windows in VC++2010 it is possible (not yet implemented in GCC, as far as i know) using localization facet std::codecvt_utf8_utf16 (i.e. in C++11). The sample code from cppreference.com has all basic information you would need to read/write UTF-8 file.
std::wstring wFromFile = _T("𤭢teststring");
std::wofstream fileOut("textOut.txt");
fileOut.imbue(std::locale(fileOut.getloc(), new std::codecvt_utf8_utf16<wchar_t>));
fileOut<<wFromFile;
It sets the ANSI encoded file to UTF-8 (checked in Notepad). Hope this is what you need.

On Windows, files don't have encodings. Each application will assume an encoding based on its own rules. The best you can do is put a byte-order mark at the front of the file and hope it's recognized.

AFAIK, fprintf() does character conversions, so there is no guarantee that passing UTF-8 encoded data to it will actually write the UTF-8 to the file. Since you already converted the data yourself, use fwrite() instead so you are writing the UTF-8 data as-is, eg:
DWORD dwCount = MultiByteToWideChar( CP_ACP, 0, line.c_str(), line.length(), NULL, 0 );
if (dwCount == 0) continue;
std::vector<WCHAR> utf16Text(dwCount);
MultiByteToWideChar( CP_ACP, 0, line.c_str(), line.length(), &utf16Text[0], dwCount );
dwCount = WideCharToMultiByte( CP_UTF8, 0, &utf16Text[0], utf16Text.size(), NULL, 0, NULL, NULL );
if (dwCount == 0) continue;
std::vector<CHAR> utf8Text(dwCount);
WideCharToMultiByte( CP_UTF8, 0, &utf16Text[0], utf16Text.size(), &utf8Text[0], dwCount, NULL, NULL );
fwrite(&utf8Text[0], sizeof(CHAR), dwCount, pOutputFile);
fprintf(pOutputFile, "\n");

The type char has no clue of any encoding, all it can do is store 8 bits. Therefore any text file is just a sequence of bytes and the user must guess the underlying encoding. A file starting with a BOM indicates UTF 8, but using a BOM is not recommended any more. The type wchar_t in contrast is in Windows always interpreted as UTF 16.
So let's say you have a file encoded in UTF 8 with just one line: "Confucius says: Smile. 孔子说：微笑！😊." The following code snippet appends this text once more, then reads the first line and displays it in a MessageBoxW and MessageBoxA. Note that MessageBoxW shows the correct text while MessageBoxA shows some junk because it assumes my local codepage 1252 for the char* string.
Note that I have used the handy CA2W class instead of MultiByteToWideChar. Be careful, the CP_Whatever argument is optional and if omitted the local codepage is used.
#include <iostream>
#include <fstream>
#include <filesystem>
#include <atlbase.h>
int main(int argc, char** argv)
{
std::fstream afile;
std::string line1A = u8"Confucius says: Smile. 孔子说：微笑！ 😊";
std::wstring line1W;
afile.open("Test.txt", std::ios::out | std::ios::app);
if (!afile.is_open())
return 0;
afile << "\n" << line1A;
afile.close();
afile.open("Test.txt", std::ios::in);
std::getline(afile, line1A);
line1W = CA2W(line1A.c_str(), CP_UTF8);
MessageBoxW(nullptr, line1W.c_str(), L"Smile", 0);
MessageBoxA(nullptr, line1A.c_str(), "Smile", 0);
afile.close();
return 0;
}

WNetUseConnection SystemErrorCode 1113 No Mapping Exist

I am trying to convert a string into a wchar_t string to use it in a WNetUseConnection function.
Basicly its an unc name looking like this "\\remoteserver".
I get a return code 1113, which is described as:
No mapping for the Unicode character
exists in the target multi-byte code
page.
My code looks like this:
std::string serverName = "\\uncDrive";
wchar_t *remoteName = new wchar_t[ serverName.size() ];
MultiByteToWideChar(CP_ACP, 0, serverName.c_str(), serverName.size(), remoteName, serverName.size()); //also doesn't work if CP_UTF8
NETRESOURCE nr;
memset( &nr, 0, sizeof( nr ));
nr.dwType = RESOURCETYPE_DISK;
nr.lpRemoteName = remoteName;
wchar_t pswd[] = L"user"; //would have the same problem if converted and not set
wchar_t usrnm[] = L"pwd"; //would have the same problem if converted and not set
int ret = WNetUseConnection(NULL, &nr, pswd, usrnm, 0, NULL, NULL, NULL);
std::cerr << ret << std::endl;
The intersting thing is, that if remoteName is hard codede like this:
char_t remoteName[] = L"\\\\uncName";
Everything works fine. But since later on the server, user and pwd will be parameters which i get as strings, i need a way to convert them (also tried mbstowcs function with the same result).

MultiByteToWideChar will not 0-terminate the converted string with your current code, and therefore you get garbage characters following the converted "\uncDrive"
Use this:
std::string serverName = "\\uncDrive";
int CharsNeeded = MultiByteToWideChar(CP_ACP, 0, serverName.c_str(), serverName.size() + 1, 0, 0);
wchar_t *remoteName = new wchar_t[ CharsNeeded ];
MultiByteToWideChar(CP_ACP, 0, serverName.c_str(), serverName.size() + 1, remoteName, CharsNeeded);
This first checks with MultiByteToWideChar how many chars are needed to store the specified string and the 0-termination, then allocates the string and converts it. Note that I didn't compile/test this code, beware of typos.

Wrong reading file in UNICODE (fread) on C++

I'm trying to load into string the content of file saved on the dics. The file is .CS code, created in VisualStudio so I suppose it's saved in UTF-8 coding. I'm doing this:
FILE *fConnect = _wfopen(connectFilePath, _T("r,ccs=UTF-8"));
if (!fConnect)
return;
fseek(fConnect, 0, SEEK_END);
lSize = ftell(fConnect);
rewind(fConnect);
LPTSTR lpContent = (LPTSTR)malloc(sizeof(TCHAR) * lSize + 1);
fread(lpContent, sizeof(TCHAR), lSize, fConnect);
But result is so strange - the first part (half of the string is content of .CS file), then strange symbols like 췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍 appear.
So I think I read the content in a wrong way. But how to do that properly?
Thank you so much and I'm looking to hear!

ftell(), fseek(), and fread() all operate on bytes, not on characters. In a Unicode environment, TCHAR is at least 2 bytes, so you are allocating and reading twice as much memory as you should be.
I have never seen fopen() or _wfopen() support a "ccs" attribute. You should use "rb" as the reading mode, read the raw bytes into memory, and then decode them once you have them all available, ie:
FILE *fConnect = _wfopen(connectFilePath, _T("rb"));
if (!fConnect)
return;
fseek(fConnect, 0, SEEK_END);
lSize = ftell(fConnect);
rewind(fConnect);
LPBYTE lpContent = (LPBYTE) malloc(lSize);
fread(lpContent, 1, lSize, fConnect);
fclose(lpContent);
.. decode lpContent as needed ...
free(lpContent);

Does the string contain all the contents of the cs file and then additional funny characters? Probably it's just not correctly null-terminated since fread will not automatically do that. You need to set the character following the string content to zero:
lpContent[lSize] = 0;

.. decode lpContent as needed ...
s2ws function convert string to wstring
std::wstring s2ws(const std::string& str)
{
int size_needed = MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), NULL, 0);
std::wstring wstrTo(size_needed, 0);
MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), &wstrTo[0], size_needed);
return wstrTo;
}
add null terminator in the end of buffer:
lpContent[lSize-1] = 0;
initialize wstring from buffer content
std::wstring replyStr = (s2ws((char*)lpContent));

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Converting shift-jis encoded file to to utf-8 in c++ - c++

wstring str(pUTF16); - pUTF16 there does not end with zero char. It should be wstring str(pUTF16, utf16size);

Related

C++ Arabic UTF8 string to CString

C++ Unicode Issue

How to set file encoding format to UTF8 in C++

WNetUseConnection SystemErrorCode 1113 No Mapping Exist

Wrong reading file in UNICODE (fread) on C++

Categories

Resources