Convert hexadecimal into unicode character - c++

In C++, I would like to save hexadecimal string into file as unicode character
Ex: 0x4E3B save to file ---> 主
Any suggestions or ideas are appreciated.

What encoding? I assume UTF-8.
What platform?
If you under Linux then
std::locale loc("en_US.UTF-8"); // or "" for system default
std::wofstream file;
file.imbue(loc); // make the UTF-8 locale for the stream as default
file.open("file.txt");
wchar_t cp = 0x4E3B;
file << cp;
However if you need Windows it is quite different story:
You need to convert code point to UTF-8. Many ways. If it is bigger then 0xFFFF then convert it to UTF-16 and then search how to use WideCharToMultiByte, and then save to file.

Related

How to set file encoding to ISO-8859-1 or WinCP-1252 in C++

I am learning C++ and I have a requirement to write a CSV file encoded in ISO-8859-1 or WinCP-1252.
I've tried the following code snippet to set a locale that will use 1252 codepage encoding, but when I open the output file in Notepad.exe, the encoding is displayed as UTF-8.
std::ofstream ofs;
ofs.imbue(std::locale("English_United States.1252"));
ofs.open("file.txt");
ofs << 78123.456 << std::endl;
If you use only chars with ASCII codes 0..127, you should not care of a file encoding. UTF-8 is a default Notepad encoding, 8-bit multibyte. Notepad is not a tool for determining a file encoding. In other words, chars with ASCII codes 0..127 can be considered to be any 8-bit ISO or multibyte encoding.

Converting UTF16(Windows wchar_t) to UTF8 in C++ Non-English letters corrupted(Korean)

I'm trying to make a multiplatform app. On the Windows Store App(winrt) side, open a file and read its path in Platform::String format which is wchar_t, UTF16 in Windows.
Since my core logic is platform independent and only use standard C++ data types, I've converted the path into std::string in UTF8 via this code:
Platform::String^ copyPath = copy->Path;
std::wstring source(copyPath->Data());
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t >, wchar_t > convert;
std::string u8CopyPath = convert.to_bytes(source);
However, when I check u8CopyPath in debugger, it shows corrupted letters for non-English chars. Far as I know, UTF-8 is perfectly capable of encoding non-English languages since it can use multiple bytes for a single letter. Is there something in the conversion that corrupts the non-English letters?
It turns out it's just a debugger thing. Once I wrote it to a file and examine it, it printed out correctly.

Convert wide CString to char*

There are lots of times this question has been asked and as many answers - none of which work for me and, it seems, many others. The question is about wide CStrings and 8bit chars under MFC. We all want an answer that will work in ALL cases, not a specific instance.
void Dosomething(CString csFileName)
{
char cLocFileNamestr[1024];
char cIntFileNamestr[1024];
// Convert from whatever version of CString is supplied
// to an 8 bit char string
cIntFileNamestr = ConvertCStochar(csFileName);
sprintf_s(cLocFileNamestr, "%s_%s", cIntFileNamestr, "pling.txt" );
m_KFile = fopen(LocFileNamestr, "wt");
}
This is an addition to existing code (by somebody else) for debugging.
I don't want to change the function signature, it is used in many places.
I cannot change the signature of sprintf_s, it is a library function.
You are leaving out a lot of details, or ignoring them. If you are building with UNICODE defined (which it seems you are), then the easiest way to convert to MBCS is like this:
CStringA strAIntFileNameStr = csFileName.GetString(); // uses default code page
CStringA is the 8-bit/MBCS version of CString.
However, it will fill with some garbage characters if the unicode string you are translating from contains characters that are not in the default code page.
Instead of using fopen(), you could use _wfopen() which will open a file with a unicode filename. To create your file name, you would use swprintf_s().
an answer that will work in ALL cases, not a specific instance...
There is no such thing.
It's easy to convert "ABCD..." from wchar_t* to char*, but it doesn't work that way with non-Latin languages.
Stick to CString and wchar_t when your project is unicode.
If you need to upload data to webpage or something, then use CW2A and CA2W for utf-8 and utf-16 conversion.
CStringW unicode = L"Россия";
MessageBoxW(0,unicode,L"Russian",0);//should be okay
CStringA utf8 = CW2A(unicode, CP_UTF8);
::MessageBoxA(0,utf8,"format error",0);//WinApi doesn't get UTF-8
char buf[1024];
strcpy(buf, utf8);
::MessageBoxA(0,buf,"format error",0);//same problem
//send this buf to webpage or other utf-8 systems
//this should be compatible with notepad etc.
//text will appear correctly
ofstream f(L"c:\\stuff\\okay.txt");
f.write(buf, strlen(buf));
//convert utf8 back to utf16
unicode = CA2W(buf, CP_UTF8);
::MessageBoxW(0,unicode,L"okay",0);

Writing wide string to a file in byte mode stopped

I am writing out unicode text (stored as wstring) into a file and I'm doing it in byte mode, but the string in the file ends prior to "™" character being printed. Is "™" not unicode or am I doing something wrong?
wofstream output;
outp.open("output.txt", ofstream::binary);
wstring a =L"ABC™";
output << a;
TM is definitely unicode. ofstream and wofstream do not write the text in UTF-8 format. You've to encode the output buffer in UTF-8 in order to see the results you're expecting. So, try using "WideCharToMultiByte".
There is a common misconception about the iostream binary mode: that it is to read/write binary files. The iostream library works only with text files and only read and write text files. The only thing the the "binary" mode changes is how NL (new line) characters are handled. In binary more, no transformation occurs. In non-binary mode, writing LF characters ('\n') to a stream will convert it to the platform specific new line sequence (Unix -> LF, Windows -> CR LF ("\r\n"), Mac -> CR) while when reading, the platform specific new line sequence will be converted to a single LF ('\n') character.
For everything else, nothing changes, meaning an wofstream will always convert the Unicode wide character string to single byte or multi byte character stream depending on the locale used by your process. If you have a locale of "en_US.utf8" on Linux for example, it will be converted to UTF8. Now, if the current locale does not have a representation for the TM Unicode symbol, then either nothing or a '?' will be written to the file.

How to convert unsigned hex values to corresponding unicode characters which should be written to file using c++ [duplicate]

This question already has answers here:
UTF8 to/from wide char conversion in STL
(8 answers)
Closed 9 years ago.
I need to convert unsigned hex values to corresponding unicode characters which should be written to file using c++
so far I have tried this
unsigned short array[2]={0x20ac,0x20ab};
this should be converted to corresponding character in a file using c++
It depends on what encoding you have choosen.
If you are using UTF-8 encoding, you need to first convert each Unicode character to corresponding UTF-8 bytes sequence and then write that byte sequence to the file.
Its pseudo code will be like
EncodeCharToUTF8(charin, charout, &numbytes); //EncodeCharToUTF8(short,char*, int*);
WriteToFile(charout, numchar);
If you are using UTF-16 encoding, you need to first write BOM at the beginning of the file and then encoding each character into UTF-16 byte sequence (byte order matters here whether it is little-endian or big-endian depending on your BOM).
WriteToFile("\xFF\xFE", 2); //Write BOM
EncodeCharToUTF16(charin, charout, &numbytes); //EncodeCharToUTF16(short,char*, int*);
//Write the character.
WriteToFile(charout, numchar);
UTF-32 is not recommended although, step is similar to UTF-16.
I think this should help you to start.
From your array, it seems that you are going to use UTF-16.
Write UTF-16 BOM 0xFFFE for little endian and 0xFEFF for big endian. After that write each character as per byte order of your machine.
I have given here pseudo code which you can white-boxed. Search more on encoding conversion.
Actually you are facing two problems:
1. How to convert buffer from UTF-8 encoding to UTF-16 encoding?
I suggest you use boost locale library ,
sample codes can be like this:
std::string ansi = "This is what we want to convert";
try
{
std::string utf8 = boost::locale::conv::to_utf<char>(ansi, "ISO-8859-1");
std::wstring utf16 = boost::locale::conv::to_utf<wchar_t>(ansi, "ISO-8859-1");
std::wstring utf16_2 = boost::locale::conv::utf_to_utf<wchar_t, char>(utf8);
}
catch (boost::locale::conv::conversion_error e)
{
std::cout << "Fail to convert to unicode!" << std::endl;
}
2. How to save buffer to a file as UTF-16 encoding?
This involves writting a BOM (ByteOrderMark) at the beginning of the file manually, you can find reference here
That means if you want to save a buffer encodes as UTF-8 to a UNICODE file, you should first write 3 bytes "EF BB BF" in the beginning of the output file."FE FF" for Big-Endian UTF-16, "FF FE" for Little-Endian UTF-16.
I you still don't understand how BOM works, just open a Notepad, and write some words, save it with different "Encoding" options, and then open the saved file with a hex editor, you can see the BOM.
Hope it helps you!