Reading csv file having japanese text.(C++)

Reading csv file having japanese text.(C++) - c++

A csv file is having japanese text in it.
On opening through notepad, it says its encoding is utf-8.
I read on stackoverflow, for utf-8 , first read the file in single stream and then convert it into wstring.
I am using below code for the conversion of string to wstring.
wstring stow(const std::string& str){
int size_needed = MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), NULL, 0);
std::wstring wstrTo( size_needed, 0 );
MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), &wstrTo[0], size_needed);
return wstrTo;}
But still, i am getting junk in the returned wstring in case of japanese text.
Note:
I can only use stream to read the csv.
No static memory allocation is allowed.
How can i read the Japanese text successfully ?

Missing check to see if UTF-8 BOM is prepended in the string, if yes, skip it.

It is achieved by using CP_ACP encoding.

Related

Converting to UTF-8 from ToUnicodeEx()

I get input using GetAsyncKeyState() which I then convert to unicode using ToUnicodeEx():
wchar_t character[1];
ToUnicodeEx(i, scanCode, keyboardState, character, 1, 0, layout);
I can write this to a file using wfstream like so:
wchar_t buffer[128]; // Will not print unicode without these 2 lines
file.rdbuf()->pubsetbuf(buffer, 128);
file.put(0xFEFF); // BOM needed since it's encoded using UCS-2 LE
file << character[0];
When I open this file in Notepad++ it's in UCS-2 LE, when I want it to be in UTF-8 format. I believe ToUnicodeEx() is returning it in UCS-2 LE format, it also only works with wide chars. Is there any way to do this using either fstream or wfstream by somehow converting into UTF-8 first? Thanks!

You might want to use the WideCharToMultiByte function.
For example:
wchar_t buffer[LEN]; // input buffer
char output_buffer[OUT_LEN]; // output buffer where the utf-8 string will be written
int num = WideCharToMultiByte(
CP_UTF8,
0,
buffer,
number_of_characters_in_buffer, // or -1 if buffer is null-terminated
output_buffer,
size_in_bytes_of_output_buffer,
NULL,
NULL);

Windows API generally refers to UTF-16 as unicode which is a little confusing. This means most unicode Win32 function calls operate on or give utf-16 strings.
So ToUnicodeEx returns a utf-16 string.
If you need this as utf 8 you'll need to convert it using WideCharToMultiByte

Thank you for all the help, I've managed to solve my problem with additional help from a blog post about WideCharToMultiByte() and UTF-8 here.
This function converts wide char arrays to a UTF-8 string:
// Takes in pointer to wide char array and length of the array
std::string ConvertCharacters(const wchar_t* buffer, int len)
{
int nChars = WideCharToMultiByte(CP_UTF8, 0, buffer, len, NULL, 0, NULL, NULL);
if (nChars == 0)
{
return u8"";
}
std::string newBuffer;
newBuffer.resize(nChars);
WideCharToMultiByte(CP_UTF8, 0, buffer, len, const_cast<char*>(newBuffer.c_str()), nChars, NULL, NULL);
return newBuffer;
}

How to set file encoding format to UTF8 in C++

A requirement for my software is that the encoding of a file which contains exported data shall be UTF8. But when I write the data to the file the encoding is always ANSI. (I use Notepad++ to check this.)
What I'm currently doing is trying to convert the file manually by reading it, converting it to UTF8 and writing the text to a new file.
line is a std::string
inputFile is an std::ifstream
pOutputFile is a FILE*
// ...
if( inputFile.is_open() )
{
while( inputFile.good() )
{
getline(inputFile,line);
//1
DWORD dwCount = MultiByteToWideChar( CP_ACP, 0, line.c_str(), -1, NULL, 0 );
wchar_t *pwcharText;
pwcharText = new wchar_t[ dwCount];
//2
MultiByteToWideChar( CP_ACP, 0, line.c_str(), -1, pwcharText, dwCount );
//3
dwCount = WideCharToMultiByte( CP_UTF8, 0, pwcharText, -1, NULL, 0, NULL, NULL );
char *pText;
pText = new char[ dwCount ];
//4
WideCharToMultiByte( CP_UTF8, 0, pwcharText, -1, pText, dwCount, NULL, NULL );
fprintf(pOutputFile,pText);
fprintf(pOutputFile,"\n");
delete[] pwcharText;
delete[] pText;
}
}
// ...
Unfortunately the encoding is still ANSI. I searched a while for a solution but I always encounter the solution via MultiByteToWideChar and WideCharToMultiByte. However, this doesn't seem to work. What am I missing here?
I also looked here on SO for a solution but most UTF8 questions deal with C# and php stuff.

On Windows in VC++2010 it is possible (not yet implemented in GCC, as far as i know) using localization facet std::codecvt_utf8_utf16 (i.e. in C++11). The sample code from cppreference.com has all basic information you would need to read/write UTF-8 file.
std::wstring wFromFile = _T("𤭢teststring");
std::wofstream fileOut("textOut.txt");
fileOut.imbue(std::locale(fileOut.getloc(), new std::codecvt_utf8_utf16<wchar_t>));
fileOut<<wFromFile;
It sets the ANSI encoded file to UTF-8 (checked in Notepad). Hope this is what you need.

On Windows, files don't have encodings. Each application will assume an encoding based on its own rules. The best you can do is put a byte-order mark at the front of the file and hope it's recognized.

AFAIK, fprintf() does character conversions, so there is no guarantee that passing UTF-8 encoded data to it will actually write the UTF-8 to the file. Since you already converted the data yourself, use fwrite() instead so you are writing the UTF-8 data as-is, eg:
DWORD dwCount = MultiByteToWideChar( CP_ACP, 0, line.c_str(), line.length(), NULL, 0 );
if (dwCount == 0) continue;
std::vector<WCHAR> utf16Text(dwCount);
MultiByteToWideChar( CP_ACP, 0, line.c_str(), line.length(), &utf16Text[0], dwCount );
dwCount = WideCharToMultiByte( CP_UTF8, 0, &utf16Text[0], utf16Text.size(), NULL, 0, NULL, NULL );
if (dwCount == 0) continue;
std::vector<CHAR> utf8Text(dwCount);
WideCharToMultiByte( CP_UTF8, 0, &utf16Text[0], utf16Text.size(), &utf8Text[0], dwCount, NULL, NULL );
fwrite(&utf8Text[0], sizeof(CHAR), dwCount, pOutputFile);
fprintf(pOutputFile, "\n");

The type char has no clue of any encoding, all it can do is store 8 bits. Therefore any text file is just a sequence of bytes and the user must guess the underlying encoding. A file starting with a BOM indicates UTF 8, but using a BOM is not recommended any more. The type wchar_t in contrast is in Windows always interpreted as UTF 16.
So let's say you have a file encoded in UTF 8 with just one line: "Confucius says: Smile. 孔子说：微笑！😊." The following code snippet appends this text once more, then reads the first line and displays it in a MessageBoxW and MessageBoxA. Note that MessageBoxW shows the correct text while MessageBoxA shows some junk because it assumes my local codepage 1252 for the char* string.
Note that I have used the handy CA2W class instead of MultiByteToWideChar. Be careful, the CP_Whatever argument is optional and if omitted the local codepage is used.
#include <iostream>
#include <fstream>
#include <filesystem>
#include <atlbase.h>
int main(int argc, char** argv)
{
std::fstream afile;
std::string line1A = u8"Confucius says: Smile. 孔子说：微笑！ 😊";
std::wstring line1W;
afile.open("Test.txt", std::ios::out | std::ios::app);
if (!afile.is_open())
return 0;
afile << "\n" << line1A;
afile.close();
afile.open("Test.txt", std::ios::in);
std::getline(afile, line1A);
line1W = CA2W(line1A.c_str(), CP_UTF8);
MessageBoxW(nullptr, line1W.c_str(), L"Smile", 0);
MessageBoxA(nullptr, line1A.c_str(), "Smile", 0);
afile.close();
return 0;
}

converting array of bytes to UTF-8 unicode

I have a file saved as UTF-8, and i'm reading it like this:
ReadFile(hFile, pContents, pFile->nFileSize, &dwRead, NULL);
(pContents is a BYTE* of size nFileSize)
its just a small file with 100 bytes or so, contains text which i want to read into memory in wchar_t* format, so i can set the text of edit and static controls with the unicode text.
How can i convert the bytes to UTF-8?
edit (i don't want to use fstream or wfstream)

MultiByteToWideChar to convert from UTF-8 to UTF-16 (wchar_t).
WideCharToMuliByte to convert from UTF-16 to UTF-8.

If the file is in UTF-8 and you read it into an array.
Then it is still in UTF-8 format and you don;t need to do anything.

int res2 = WideCharToMultiByte(CP_UTF8, 0, tempBuf.c_str(), -1,
multiByteBuf, lengthOfInputString, NULL, NULL);
int res = MultiByteToWideChar(CP_UTF8, 0, buf, -1, wcharBuf, lengthOfInputString);

How to output utf8 encoded characters normally in c/c++ console application?

Here's what I'm getting now by wprintf:
1胩?鳧?1敬爄汯?瑳瑡獵猆慴畴??
Is utf8 just not supported by windows?

No, Windows doesn't support printing UTF-8 to the console.
When Windows says "Unicode", it means UTF-16. You need to use MultiByteToWideChar to convert from UTF-8 to UTF-16. Something like this:
char* text = "My UTF-8 text\n";
int len = MultiByteToWideChar(CP_UTF8, 0, text, -1, 0, 0);
wchar_t *unicode_text = new wchar_t[len];
MultiByteToWideChar(CP_UTF8, 0, text, -1, unicode_text, len);
wprintf(L"%s", unicode_text);

wprintf supposed to receive a UTF-16 encoded string. Use the following for conversion:
Use MultiByteToWideChar with CP_UTF8 codepage to do the conversion. (and don't do blind casting from char* into wchar_t*).

Wrong reading file in UNICODE (fread) on C++

I'm trying to load into string the content of file saved on the dics. The file is .CS code, created in VisualStudio so I suppose it's saved in UTF-8 coding. I'm doing this:
FILE *fConnect = _wfopen(connectFilePath, _T("r,ccs=UTF-8"));
if (!fConnect)
return;
fseek(fConnect, 0, SEEK_END);
lSize = ftell(fConnect);
rewind(fConnect);
LPTSTR lpContent = (LPTSTR)malloc(sizeof(TCHAR) * lSize + 1);
fread(lpContent, sizeof(TCHAR), lSize, fConnect);
But result is so strange - the first part (half of the string is content of .CS file), then strange symbols like 췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍췍 appear.
So I think I read the content in a wrong way. But how to do that properly?
Thank you so much and I'm looking to hear!

ftell(), fseek(), and fread() all operate on bytes, not on characters. In a Unicode environment, TCHAR is at least 2 bytes, so you are allocating and reading twice as much memory as you should be.
I have never seen fopen() or _wfopen() support a "ccs" attribute. You should use "rb" as the reading mode, read the raw bytes into memory, and then decode them once you have them all available, ie:
FILE *fConnect = _wfopen(connectFilePath, _T("rb"));
if (!fConnect)
return;
fseek(fConnect, 0, SEEK_END);
lSize = ftell(fConnect);
rewind(fConnect);
LPBYTE lpContent = (LPBYTE) malloc(lSize);
fread(lpContent, 1, lSize, fConnect);
fclose(lpContent);
.. decode lpContent as needed ...
free(lpContent);

Does the string contain all the contents of the cs file and then additional funny characters? Probably it's just not correctly null-terminated since fread will not automatically do that. You need to set the character following the string content to zero:
lpContent[lSize] = 0;

.. decode lpContent as needed ...
s2ws function convert string to wstring
std::wstring s2ws(const std::string& str)
{
int size_needed = MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), NULL, 0);
std::wstring wstrTo(size_needed, 0);
MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), &wstrTo[0], size_needed);
return wstrTo;
}
add null terminator in the end of buffer:
lpContent[lSize-1] = 0;
initialize wstring from buffer content
std::wstring replyStr = (s2ws((char*)lpContent));

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Reading csv file having japanese text.(C++) - c++

Missing check to see if UTF-8 BOM is prepended in the string, if yes, skip it.

It is achieved by using CP_ACP encoding.

Related

Converting to UTF-8 from ToUnicodeEx()

How to set file encoding format to UTF8 in C++

converting array of bytes to UTF-8 unicode

How to output utf8 encoded characters normally in c/c++ console application?

Wrong reading file in UNICODE (fread) on C++

Categories

Resources