Unicode big endian some character not getting properly from wchar_t array - c++

I am trying to extract the exact "unicode big endian" character from an array.
The values i directly taken from a file using big endian. i use vs 2015, mfc framework (unicode support).
values: 𠀐亙𠀃𠀃亙亙𠀐𠀐Val𪛕𨕥
So these values directly taken from file to an array and without changing those values in the same array and directly printing to another txt file as unicode big endian format is possible. But changing some chars getting wrong result.
Directly written to editor.cpp file
wchar_t chr[] = {L'𠀐', L'亙', L'𠀃', L'𠀃', L'亙', L'亙', L'𠀐', L'𠀐', L'V', L'a', L'l', L'𪛕', L'𨕥'};
wchar_t chVal = (wchar_t) chr[0]; // getting � or a rectangle mark
if(chVal == L'𠀐')
MessageBox(_T("Show msg")); // results wrong
wchar_t chVal = (wchar_t) chr[1]; // getting 亙 proper element.
if(chVal == L'亙')
MessageBox(_T("Show msg")); // results correct
llly correct results in 'V', 'a', 'l'
=======================================
Before i placed the code
wchar_t* ch = _wsetlocale(LC_ALL, _T("Chinese"));
is it a problem from _wsetLocale ?
in the editor we can directly write those characters. But during debug or exe the results wrong.
why the editor not displaying some characters during debugging or execution.
================
updated:
// wcstring is wchar_t array with unicode characters
CStringW str; wchar_t wh;
System::Text::Encoding^ encodingWr = System::Text::Encoding::BigEndianUnicode;
StreamWriter^ writer = gcnew StreamWriter("Converted.txt", true, encodingWr );
//String^ line = reader->ReadLine();
for(int ct = 0; ct< ctTot; ct++)
{
int ln = wcstring[ct]; // correct number
wh = /*(wchar_t)*/ wcstring[ct]; //wrong
str.Format(_T("UNNUM %d %lc"), ln, wh);
/* https://learn.microsoft.com/en-us/cpp/text/how-to-convert-between-various-string-types?view=vs-2017*/
// Convert a wide character CStringW to a
// System::String.
String ^systemstringw = gcnew String(str);
//systemstringw += " (System::String)";
//Console::WriteLine("{0}", systemstringw);
//delete systemstringw;
writer->WriteLine(systemstringw);
delete systemstringw;
OutputDebugString(str);
}
but needed to print on file correct unicode character.
so compiler problems need to know too.

Related

How to convert an integer to a unicode character?

So I wanted to try converting Unicode to an integer for a project of mine. I tried something like this :
unsigned int foo = (unsigned int)L'آ';
std::cout << foo << std::endl;
How do I convert it back? Or in other words, How do I convert an int to the respective Unicode character ?
EDIT : I am expecting the output to be the unicode value of an integer, example:
cout << (wchar_t) 1570 ; // This should print the unicode value of 1570 (which is :آ)
I am using Visual Studio 2013 Community with it's default compiler, Windows 10 64 bit Pro
Cheers
L'آ' will work okay as a signle wide character, because it is below 0xFFFF. But in general UTF16 includes surrogate pairs, so a unicode code point cannot be represented with a single wide character. You need wide string instead.
Your problem is also partly to do with printing UTF16 character in Windows console. If you use MessageBoxW to view a wide string it will work as expected:
wchar_t buf[2] = { 0 };
buf[0] = 1570;
MessageBoxW(0, buf, 0, 0);
However, in general you need a wide string to account for surrogate pairs, not a single wide char. Example:
int utf32 = 1570;
const int mask = (1 << 10) - 1;
std::wstring str;
if(utf32 < 0xFFFF)
{
str.push_back((wchar_t)utf32);
}
else
{
utf32 -= 0x10000;
int hi = (utf32 >> 10) & mask;
int lo = utf32 & mask;
hi += 0xD800;
lo += 0xDC00;
str.push_back((wchar_t)hi);
str.push_back((wchar_t)lo);
}
MessageBox(0, str.c_str(), 0, 0);
See related posts for printing UTF16 in Windows console.
The key here is setlocale(LC_ALL, "en_US.UTF-8");. en_US is the localization string which you may want to set to a different value like zh_CN for Chinese for example.
#include <stdio.h>
#include <iostream>
int main() {
setlocale(LC_ALL, "en_US.UTF-8");
// This does not work without setlocale(LC_ALL, "en_US.UTF-8");
for(int ch=30000; ch<30030; ch++) {
wprintf(L"%lc", ch);
}
printf("\n");
return 0;
}
Things to notice here is the use of wprintf and how the formatted string is given: L"%lc" which tells wprintf to treat the string and the character as long characters.
If you want to use this method to print some variables, use the type wchat_t.
Useful links:
setlocale
wprintf

Putting Hebrew string in a variable using C++ on Windows

I have problem putting Hebrew string in a variable like this:
wchar_t* hebrewString = L"א";
The value in unicode of א is 0x05d0 in hex or 1488 in dec.
The problem is that my memory show different value that totally unconnected
to the real value of א.
If I write:
wchar_t hebrewChar = 0x05d0
it is obvious that the right value will be in hebrewChar, but I want to write regular string.
I thought maybe I did something wrong so I looked up in the generate ASM code and even there it was wrong value.
How can I write Hebrew string in a simple way?
Edit 1:
add source code(in comment above the code is the assembly)
wchar_t d = 0x05D0;
// DB 0f3H, 05H, 090H, 00H, 00H, 00H
wchar_t *test = L"א";
// mov eax, 1523 ; 000005f3H
wchar_t test1 = L'א';
// mov eax, -112 ; ffffff90H
char test2 = 'א';
By specifying L in front of string or Unicode character, compiler will convert it into an encoding matching the encoding file is saved. Therefore you have to change file encoding via FILE -> Advance Save Options and choose UTF 8 with signature - codepage 65001 for example.
Also bear in mind that Windows Console isn't capable of printing all Unicode characters (you could if you'd have different default language and encoding).
Here is also an example to see if your code is working by saving character into a text file:
#include <iostream>
#include <fstream>
using namespace std;
int main()
{
// UCS-2 little endian text file magic number
char magic_number[] = { 0xFF, 0xFE };
wchar_t unicode_char = L'א';
wchar_t unicode_val = 0x05d0;
if (unicode_char == unicode_val)
cout << "Works!" << endl;
ofstream f("out.txt", ios::out);
f.write(magic_number, 2);
f.write((char *)&unicode_char, 2);
f.close();
return 0;
}
Open the file and check if the value is printed correctly.
Otherwise for storing non ANSI characters in code, I'd strongly recommend using library like ICU for saving, loading... - in general all operations regarding strings.

Chinese character in source code when UTF-8 settings can't be used [duplicate]

This question already has an answer here:
PHP and C++ for UTF-8 code unit in reverse order in Chinese character
(1 answer)
Closed 9 years ago.
This is the scenario:
I can only use the char* data type for the string, not wchar_t *
My MS Visual C++ compiler has to be set to MBCS, not UNICODE because the third party source code that I have is using MBCS; Setting it to UNICODE will cause data type issues.
I am trying to print chinese characters on a printer which needs to get a character string so it can print correctly
What should I do with this line to make the code correct: char * str = "你好";
Convert it to hex sequence perhaps? If yes, how? Thanks a lot.
char * str = "你好";
size_t len = strlen(str) + 1;
wchar_t * wstr = new wchar_t[len];
size_t convertedSize = 0;
mbstowcs_s(&convertedSize, wstr, len, str, _TRUNCATE);
cout << convertedSize;
if(! ExtTextOutW(resource->dc, 1,1 , ETO_OPAQUE, NULL, wstr , convertedSize, NULL))
{
return 0;
}
UPDATE : Let's put the question in another way
I have this, the char * str contain sequence of UTF-8 code units, for the 2 chinese character 你好 , the ExtTextOutW still cannot execute the wstr correctly, because I think the my code for mbstowcs_s could still not working correctly. Any idea why ?
char * str = "\xE4\xBD\xA0\xE5\xA5\xBD";
size_t len = strlen(str) + 1;
wchar_t * wstr = new wchar_t[len];
size_t convertedSize = 0;
mbstowcs_s(&convertedSize, wstr, len, str, _TRUNCATE);
if(! ExtTextOutW(resource->dc, 1,1 , ETO_OPAQUE, NULL, wstr , len, NULL))
{
return 0;
}
The fact is, 你好 is a sequence of Unicode characters. You will need to use a Unicode character set in order to ensure that it will be displayed correctly.
The only possible exception to that is if you're using a multi-byte character set that includes both of these characters in the basic character set. Since you say that you're stuck compiling for the MBCS anyway, that might be a solution. In order to make it work, you will have to set the system language to one that includes this character. The exact way you do this changes in each OS version. I think they're trying to "improve" it. On Windows 7, at least, they call this the "Language for non-Unicode programs" setting, accessible in the "Regions and Language" control panel.
If there is no system language in which these characters are provided as part of the basic character set, then you are basically out of luck.
Even if you tried to use a UTF-8 encoding (which Windows does not natively support, instead preferring UTF-16 for its Unicode support), which uses the char data type, it is very likely that whatever other application/library you're interfacing with would not be able to deal with it. Windows applications assume that a char holds a character in the current ANSI/MB character set. Unicode characters are in a wchar_t, and since you can't use that, it indicates the application simply doesn't support Unicode. (That means it's broken, by the way—time to upgrade.)
As an adaptation from what MYMNeo said, I would suggest that this would work:
wchar_t *str = L"你好";
fputws(str, stdout);
ps. This probably isn't C: cout << convertedSize;.

Call popen() on a command with Chinese characters on Mac

I'm trying to execute a program on a file using the popen() command on a Mac. For this, I create a command of the form <path-to_executable> <path-to-file> and then call popen() on this command. Right now, both these two components are declared in a char*. I need to read the output of the command so I need the pipe given by popen().
Now it turns out that path-to-file can contain Chinese, Japanese, Russian and pretty much any other characters. For this, I can represent the path-to-file as wchar_t*. But this doesn't work with popen() because apparently Mac / Linux don't have a wide _wpopen() like Windows.
Is there any other way I can make this work? I'm getting the path-to-file from a data structure that can only give me wchar_t* so I have to take it from there and convert it appropriately, if needed.
Thanks in advance.
Edit:
Seems like one of those days when you just end up pulling your hair out.
So I tried using wcstombs, but the setlocale call failed for "C.UTF-8" and any of its permutations. Unsurprisingly, the wcstombs call failed returning -1 after that.
Then I tried to write my own iconv implementation based on some sample codes searched on Google. I came up with this, which stubbornly refuses to work:
iconv_t cd = iconv_open("UTF-8", "WCHAR_T");
// error checking here
wchar_t* inbuf = ...; // get wchar_t* here
char outbuf[<size-of-inbuf>*4+1];
size_t inlen = <size-of-inbuf>;
size_t outlen = <size-of-inbuf>*4+1;
char* c_inbuf = (char*) inbuf;
char* c_outbuf = outbuf;
int ret = iconv(cd, &c_inbuf, &inlen, &c_outbuf, &outlen);
// more error checking here
iconv always returns -1 and the errno is set to EINVAL. I've verified that <size-of-len> is set correctly. I've got no clue why this code's failing now.
Edit 2:
iconv was failing because I was not setting the input buffer length right. Also, Mac doesn't seem to support the "WCHAR_T" encoding so I've changed it to UTF-16. Now I've corrected the length and changed the from encoding but iconv just returns without converting any character. It just returns 0.
To debug this issue, I even changed the input string to a temp string and set the input length appropriately. Even this iconv call just returns 0. My code now looks like:
iconv_t cd = iconv_open("UTF-8", "UTF-16");
// error checking here
wchar_t* inbuf = ...; // get wchar_t* here - guaranteed to be UTF-16
char outbuf[<size-of-inbuf>*4+1];
size_t inlen = <size-of-inbuf>;
size_t outlen = <size-of-inbuf>*4+1;
char* c_inbuf = "abc"; // (char*) inbuf;
inlen = 4;
char* c_outbuf = outbuf;
int ret = iconv(cd, &c_inbuf, &inlen, &c_outbuf, &outlen);
// more error checking here
I've confirmed that the converter descriptor is being opened correctly. The from-encoding is correct. The input buffer contains a few simple characters. Everything is hardcoded and still, iconv doesn't convert any characters and just returns 0 and outbuf remains empty.
Sanity loss alert!
You'll need an UTF-8 string for popen. For this, you can use iconv to convert between different encodings, including from the local wchar_t encoding to UTF-8. (Note that on my Mac OS install, wchar_t is actually 32 bits, and not 16.)
EDIT Here's an example that works on OS X Lion. I did not have problems using the wchar_t encoding (and it is documented in the iconv man page).
#include <sys/param.h>
#include <string.h>
#include <iconv.h>
#include <stdio.h>
#include <errno.h>
char* utf8path(const wchar_t* wchar, size_t utf32_bytes)
{
char result_buffer[MAXPATHLEN];
iconv_t converter = iconv_open("UTF-8", "wchar_t");
char* result = result_buffer;
char* input = (char*)wchar;
size_t output_available_size = sizeof result_buffer;
size_t input_available_size = utf32_bytes;
size_t result_code = iconv(converter, &input, &input_available_size, &result, &output_available_size);
if (result_code == -1)
{
perror("iconv");
return NULL;
}
iconv_close(converter);
return strdup(result_buffer);
}
int main()
{
wchar_t hello_world[] = L"/éè/path/to/hello/world.txt";
char* utf8 = utf8path(hello_world, sizeof hello_world);
printf("%s\n", utf8);
free(utf8);
return 0;
}
The utf8_hello_world function accepts a wchar_t string with its byte length and returns the equivalent UTF-8 string. If you deal with pointers to wchar_t instead of an array of wchar_t, you'll want to use (wcslen(ptr) + 1) * sizeof(wchar_t) instead of sizeof.
Mac OS X uses UTF-8, so you need to convert the wide-character strings into UTF-8. You can do this using wcstombs, provided you first switch into a UTF-8 locale. For example:
// Do this once at program startup
setlocale(LC_ALL, "en_US.UTF-8");
...
// Error checking omitted for expository purposes
wchar_t *wideFilename = ...; // This comes from wherever
char filename[256]; // Make sure this buffer is big enough!
wcstombs(filename, wideFilename, sizeof(filename));
// Construct popen command using the UTF-8 filename
You can also use libiconv to do the UTF-16 to UTF-8 conversion for you if you don't want to change your program's locale setting; you could also roll your own implementation, as doing the conversion is not all that complicated.

WideCharToMultiByte problem

I have the lovely functions from my previous question, which work fine if I do this:
wstring temp;
wcin >> temp;
string whatever( toUTF8(getSomeWString()) );
// store whatever, copy, but do not use it as UTF8 (see below)
wcout << toUTF16(whatever) << endl;
The original form is reproduced, but the in between form often contains extra characters. If I enter for example àçé as the input, and add a cout << whatever statement, i'll get ┬à┬ç┬é as output.
Can I still use this string to compare to others, procured from an ASCII source? Or asked differently: if I would output ┬à┬ç┬é through the UTF8 cout in linux, would it read àçé? Is the byte content of a string àçé, read in UTF8 linux by cin, exactly the same as what the Win32 API gets me?
Thanks!
PS: the reason I'm asking is because I need to use the string a lot to compare to other read values (comparing and concatenating...).
Let's start by me saying that it appears that there is simply no way to output UTF-8 text to the console in Windows via cout (assuming you compile with Visual Studio).
What you can do however for your tests is to output your UTF-8 text via the Win32 API fn WriteConsoleA:
if(!SetConsoleOutputCP(CP_UTF8)) { // 65001
cerr << "Failed to set console output mode!\n";
return 1;
}
HANDLE const consout = GetStdHandle(STD_OUTPUT_HANDLE);
DWORD nNumberOfCharsWritten;
const char* utf8 = "Umlaut AE = \xC3\x84 / ue = \xC3\xBC \n";
if(!WriteConsoleA(consout, utf8, strlen(utf8), &nNumberOfCharsWritten, NULL)) {
DWORD const err = GetLastError();
cerr << "WriteConsole failed with << " << err << "!\n";
return 1;
}
This should output:
Umlaut AE = Ä / ue = ü if you set your console (cmd.exe) to use the Lucida Console font.
As for your question (taken from your comment) if
a win23 API converted string is the
same as a raw UTF8 (linux) string
I will say yes: Given a Unicode character sequence, it's UTF-16 (Windows wchar_t) representation converted to a UTF-8 (char) representation via the WideCharToMultiByte function will always yield the same byte sequence.
When you convert the string to a UTF 16 it is a 16 byte wide character, you can't compare it to the ASCII values because they aren't 16 byte values. You have to convert them to compare, or write a specialized comparision to ASCII function.
I doubt the UTF8 cout in linux would produce the same correct output unless it were regular ASCII values, as UTF8 UTF-8 encoding forms are binary-compatible with ASCII for code points below 128, and I assume UTF16 comes after UTF8 in a simliar fashion.
The good news is there are many converters out there written to convert these strings to different character sets.