std::wcout, why is the printed character not the same as the input? [duplicate] - c++

I tried to printf with some accented characters such as á é í ó ú:
printf("my name is Seán\n");
The text editor in the DEVC++ IDE displays them fine - i.e the source code looks fine.
I guess I need some library other than stdio.h and maybe some variant of the normal printf.
I'm using IDE Bloodshed DEVC running on Windows XP.

Perhaps the best is to use Unicode.
Here's how...
First, manually set your console font to "Consolas" or "Lucida Console" or whichever True-Type Unicode font you can choose ("Raster fonts" may not work, those aren't Unicode fonts, although they may include characters you're interested in).
Next, set the console code page to 65001 (UTF-8) with SetConsoleOutputCP(CP_UTF8).
Then convert your text to UTF-8 (if it's not yet in UTF-8) using WideCharToMultiByte(CP_UTF8, ...).
Finally, call WriteConsoleA() to output the UTF-8 text.
Here's a little function that does all these things for you, it's an "improved" variant of wprintf():
int _wprintf(const wchar_t* format, ...)
{
int r;
static int utf8ModeSet = 0;
static wchar_t* bufWchar = NULL;
static size_t bufWcharCount = 256;
static char* bufMchar = NULL;
static size_t bufMcharCount = 256;
va_list vl;
int mcharCount = 0;
if (utf8ModeSet == 0)
{
if (!SetConsoleOutputCP(CP_UTF8))
{
DWORD err = GetLastError();
fprintf(stderr, "SetConsoleOutputCP(CP_UTF8) failed with error 0x%X\n", err);
utf8ModeSet = -1;
}
else
{
utf8ModeSet = 1;
}
}
if (utf8ModeSet != 1)
{
va_start(vl, format);
r = vwprintf(format, vl);
va_end(vl);
return r;
}
if (bufWchar == NULL)
{
if ((bufWchar = malloc(bufWcharCount * sizeof(wchar_t))) == NULL)
{
return -1;
}
}
for (;;)
{
va_start(vl, format);
r = vswprintf(bufWchar, bufWcharCount, format, vl);
va_end(vl);
if (r < 0)
{
break;
}
if (r + 2 <= bufWcharCount)
{
break;
}
free(bufWchar);
if ((bufWchar = malloc(bufWcharCount * sizeof(wchar_t) * 2)) == NULL)
{
return -1;
}
bufWcharCount *= 2;
}
if (r > 0)
{
if (bufMchar == NULL)
{
if ((bufMchar = malloc(bufMcharCount)) == NULL)
{
return -1;
}
}
for (;;)
{
mcharCount = WideCharToMultiByte(CP_UTF8,
0,
bufWchar,
-1,
bufMchar,
bufMcharCount,
NULL,
NULL);
if (mcharCount > 0)
{
break;
}
if (GetLastError() != ERROR_INSUFFICIENT_BUFFER)
{
return -1;
}
free(bufMchar);
if ((bufMchar = malloc(bufMcharCount * 2)) == NULL)
{
return -1;
}
bufMcharCount *= 2;
}
}
if (mcharCount > 1)
{
DWORD numberOfCharsWritten, consoleMode;
if (GetConsoleMode(GetStdHandle(STD_OUTPUT_HANDLE), &consoleMode))
{
fflush(stdout);
if (!WriteConsoleA(GetStdHandle(STD_OUTPUT_HANDLE),
bufMchar,
mcharCount - 1,
&numberOfCharsWritten,
NULL))
{
return -1;
}
}
else
{
if (fputs(bufMchar, stdout) == EOF)
{
return -1;
}
}
}
return r;
}
Following tests this function:
_wprintf(L"\xA0\xA1\xA2\xA3\xA4\xA5\xA6\xA7"
L"\xA8\xA9\xAA\xAB\xAC\xAD\xAE\xAF"
L"\xB0\xB1\xB2\xB3\xB4\xB5\xB6\xB7"
L"\xB8\xB9\xBA\xBB\xBC\xBD\xBE\xBF"
L"\n"
L"\xC0\xC1\xC2\xC3\xC4\xC5\xC6\xC7"
L"\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF"
L"\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD7"
L"\xD8\xD9\xDA\xDB\xDC\xDD\xDE\xDF"
L"\n"
L"\xE0\xE1\xE2\xE3\xE4\xE5\xE6\xE7"
L"\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF"
L"\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF7"
L"\xF8\xF9\xFA\xFB\xFC\xFD\xFE\xFF"
L"\n");
_wprintf(L"\x391\x392\x393\x394\x395\x396\x397"
L"\x398\x399\x39A\x39B\x39C\x39D\x39E\x39F"
L"\x3A0\x3A1\x3A2\x3A3\x3A4\x3A5\x3A6\x3A7"
L"\x3A8\x3A9\x3AA\x3AB\x3AC\x3AD\x3AE\x3AF\x3B0"
L"\n"
L"\x3B1\x3B2\x3B3\x3B4\x3B5\x3B6\x3B7"
L"\x3B8\x3B9\x3BA\x3BB\x3BC\x3BD\x3BE\x3BF"
L"\x3C0\x3C1\x3C2\x3C3\x3C4\x3C5\x3C6\x3C7"
L"\x3C8\x3C9\x3CA\x3CB\x3CC\x3CD\x3CE"
L"\n");
_wprintf(L"\x410\x411\x412\x413\x414\x415\x401\x416\x417"
L"\x418\x419\x41A\x41B\x41C\x41D\x41E\x41F"
L"\x420\x421\x422\x423\x424\x425\x426\x427"
L"\x428\x429\x42A\x42B\x42C\x42D\x42E\x42F"
L"\n"
L"\x430\x431\x432\x433\x434\x435\x451\x436\x437"
L"\x438\x439\x43A\x43B\x43C\x43D\x43E\x43F"
L"\x440\x441\x442\x443\x444\x445\x446\x447"
L"\x448\x449\x44A\x44B\x44C\x44D\x44E\x44F"
L"\n");
And should result in the following text in the console:
 ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿
ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß
àáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ
ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡ΢ΣΤΥΦΧΨΩΪΫάέήίΰ
αβγδεζηθικλμνξοπρςστυφχψωϊϋόύώ
АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ
абвгдеёжзийклмнопрстуфхцчшщъыьэюя
I do not know the encoding in which your IDE stores non-ASCII characters in .c/.cpp files and I do not know what your compiler does when encounters non-ASCII characters. This part you should figure out yourself.
As long as you supply to _wprintf() properly encoded UTF-16 text or call WriteConsoleA() with properly encoded UTF-8 text, things should work.
P.S. Some gory details about console fonts can be found here.

Windows console is generally considered badly broken regarding to character encodings. You can read about this problem here, for example.
The problem is that Windows generally uses the ANSI codepage (which is assuming you are in Western Europe or America Windows-1252), but the console uses the OEM codepage (CP850 under the same assumption).
You have several options:
Convert the text to CP850 before writing it (see CharToOem()). The drawback is that if the user redirects the output to a file (> file.txt) and opens the file with e.g. Notepad, he will see it wrong.
Change the codepage of the console: You need to select a TTF console font (Lucida Console, for example) and use the command chcp 1252.
Use UNICODE text and wprintf(): You need the TTF console font anyway.

The Windows-1252 (also known as "ANSI") character set used by Windows console mode is not the same as that used by GUI applications. Hence the IDE representation differs from the runtime representation.
A quick-and-dirty solution for your example is:
printf("my name is Se\xe9n\n");
Most solutions to this problem are flawed one way or another and the simplest solution for Windows applications that need extensive multi-language localisation is to write them as GUI apps using Unicode.

Related

Strange unicode error when converting Chinese wide strings to regular strings in C++

Some of my Chinese software users noticed a strange C++ exception being thrown when my C++ code for Windows tried to list all running processes:
在多字节的目标代码页中,没有此 Unicode 字符可以映射到的字符。
Translated to English this roughly means:
There are no characters to which this Unicode character can be mapped
in the multi-byte target code page.
The code which prints this is:
try
{
list_running_processes();
}
catch (std::runtime_error &exception)
{
LOG_S(ERROR) << exception.what();
return EXIT_FAILURE;
}
The most likely culprit source code is:
std::vector<running_process_t> list_running_processes()
{
std::vector<running_process_t> running_processes;
const auto snapshot_handle = unique_handle(CreateToolhelp32Snapshot(TH32CS_SNAPPROCESS, 0));
if (snapshot_handle.get() == INVALID_HANDLE_VALUE)
{
throw std::runtime_error("CreateToolhelp32Snapshot() failed");
}
PROCESSENTRY32 process_entry{};
process_entry.dwSize = sizeof process_entry;
if (Process32First(snapshot_handle.get(), &process_entry))
{
do
{
const auto process_id = process_entry.th32ProcessID;
const auto executable_file_path = get_file_path(process_id);
// *** HERE ***
const auto process_name = wide_string_to_string(process_entry.szExeFile);
running_processes.emplace_back(executable_file_path, process_name, process_id);
} while (Process32Next(snapshot_handle.get(), &process_entry));
}
return running_processes;
}
Or alternatively:
std::string get_file_path(const DWORD process_id)
{
std::string file_path;
const auto snapshot_handle = unique_handle(CreateToolhelp32Snapshot(TH32CS_SNAPMODULE, process_id));
MODULEENTRY32W module_entry32{};
module_entry32.dwSize = sizeof(MODULEENTRY32W);
if (Module32FirstW(snapshot_handle.get(), &module_entry32))
{
do
{
if (module_entry32.th32ProcessID == process_id)
{
return wide_string_to_string(module_entry32.szExePath); // *** HERE ***
}
} while (Module32NextW(snapshot_handle.get(), &module_entry32));
}
return file_path;
}
This is the code for performing a conversion from a std::wstring to a regular std::string:
std::string wide_string_to_string(const std::wstring& wide_string)
{
if (wide_string.empty())
{
return std::string();
}
const auto size_needed = WideCharToMultiByte(CP_UTF8, 0, &wide_string.at(0),
static_cast<int>(wide_string.size()), nullptr, 0, nullptr, nullptr);
std::string str_to(size_needed, 0);
WideCharToMultiByte(CP_UTF8, 0, &wide_string.at(0), static_cast<int>(wide_string.size()), &str_to.at(0),
size_needed, nullptr, nullptr);
return str_to;
}
Is there any reason this can fail on Chinese language file paths or Chinese language Windows etc.? The code works fine on regular western Windows machines. Let me know if I'm missing any crucial pieces of information here since I cannot debug or test this on my own right now without access to one of the affected machines.
I managed to test on a Chinese machine and it turns out that converting a file path from wide string to a regular string will produce a bad file path output if the file path contains e.g. Chinese (non-ASCII) symbols.
I could fix this bug by replacing calls to wide_string_to_string() with std::filesystem::path(wide_string_file_path).string() since the std::filesystem API will handle the conversion correctly for file paths unlike wide_string_to_string().

How to handle pasted text correctly via GetConsoleInput()?

In Windows console, we can use GetConsoleInput() to get raw keyboard (and more) input. I want to use it to implement a custom function that read keystrokes with possible CTRL, SHIFT, ALT status. A simplified version of the function is
// for demo only, no error checking ...
struct ret {
wchar_t ch; // 2-byte UTF-16 in Windows
DWORD control_keys;
};
ret getch() {
HANDLE in = GetStdHandle(STD_INPUT_HANDLE);
INPUT_RECORD buf;
DWORD cnt;
for (;;) {
ReadConsoleInput(in, &buf, 1, &cnt);
if (buf.EventType != KEY_EVENT)
continue;
const KEY_EVENT_RECORD& rec = buf.Event.KeyEvent;
if (!rec.bKeyDown)
continue;
if (!rec.uChar.UnicodeChar)
continue;
return { rec.uChar.UnicodeChar,rec.dwControlKeyState };
}
}
It works fine, except that when I try to paste character not representable in 2 bytes in UTF-16, the UnicodeChar field is 0 when bKeyDown==true, and the UnicodeChar field is the pasted content when bKeyDown==false. Can anyone tell why it is the case and suggest possible workarounds?
Here is some demo code and result.

Pasting after SetClipboardData() in C++ does not include newlines for Notepad

I'm developing some software which copies a large string to the windows clipboard to paste into some other software. Pasting in the other software does not work, and when I paste into Notepad, the newlines in the initial strings are gone, which is why it is failing to paste in the other software. I know this because when I re-add the newlines to Notepad, and do a Copy, Pasting then works in the other program. When I paste into Wordpad, the newlines are there mysteriously.
I'm using SetClipboardData() in C++ with the CF_TEXT clipboard format type. I've tried using CF_OEMTEXT, CF_DSPTEXT but neither of those work. I saw some documentation on CF_SYLK (Symbolic Link) for spreadsheets, as the software I'm pasting in is similar to a spreadsheet, but I couldn't get that to work either. Below is my code for copying to the clipboard.
void ClipBoardManager::CopyExcelStringToClipBoard(std::string excel_str)
{
OpenClipboard(nullptr);
EmptyClipboard();
HGLOBAL hg = GlobalAlloc(GMEM_MOVEABLE, excel_str.size() + 1);
if (!hg) {
CloseClipboard();
return;
}
memcpy(GlobalLock(hg), excel_str.c_str(), excel_str.size() + 1);
GlobalUnlock(hg);
SetClipboardData(CF_TEXT, hg);
CloseClipboard();
GlobalFree(hg);
}
Any help is appreciated.
The excel_str must have the CRLF line endings. Here is example code to convert string to the good format:
string replaceAll(string in, string replaceIn, string replaceOut)
{
size_t pos = 0;
while(pos < in.size())
{
size_t pos2 = in.find(replaceIn, pos);
if(pos2 != string::npos)
{
in.replace(in.begin() + pos2, in.begin() + pos2 + replaceIn.size(), replaceOut);
pos = pos2 + replaceOut.size();
}
else
break;
}
return in;
}
If your project setup for unicode characters (default setup) - use unicode everywhere and use CF_UNICODETEXT instead CF_TEXT. Or use non unicode - but consistently - and then change project settings.
Code below will copy text with line endings correctly - after end of this program it's possible to paste copied by this program text (with line endings) from clipboard from say notepad:
#include <Windows.h>
BOOL WINAPI ToClipboard(VOID);
int main()
{
ToClipboard();
}
BOOL WINAPI ToClipboard(VOID)
{
LPTSTR lptstrCopy;
HGLOBAL hglbCopy;
if (!OpenClipboard(NULL))
return FALSE;
EmptyClipboard();
// Allocate a global memory object for the text.
wchar_t s[] = L"12345\n6789";
hglbCopy = GlobalAlloc(GMEM_MOVEABLE,
(wcslen(s) + 1) * sizeof(wchar_t));
if (hglbCopy == NULL)
{
CloseClipboard();
return FALSE;
}
lptstrCopy = (LPTSTR)GlobalLock(hglbCopy);
memcpy(lptstrCopy, &s,
(wcslen(s) + 1) * sizeof(wchar_t));
lptstrCopy[sizeof(s)] = (TCHAR)0; // null character
GlobalUnlock(hglbCopy);
SetClipboardData(CF_UNICODETEXT, hglbCopy);
CloseClipboard();
return TRUE;
}

freopen and fwprintf writes incorrect wide characters into TXT file

I've already spent the whole day searching for an answer about UTF-8 and UTF-16 options when freopen and fwprintf used and no results for now. I will add my code below, maybe someone can help. Thanks in advance.
template<typename... ArgsT>
void log(const wchar_t* message, ArgsT... args)
{
fwprintf(stdout, message, args...);
fwprintf(stdout, L"\n");
fflush(stdout);
}
int main()
{
bool init = true;
if (!std::freopen("log.txt", "w", stdout))
{
init = false;
}
if (std::fwide(stdout, 1) <= 0)
{
init = false;
}
if (init)
{
std::wstring str = L"кирилиця";
log(L"Some text in cyrillic %S and some number %i", str.c_str(), 10);
}
return 0;
}
As the result in TXT file I have: Some text in cyrillic :8#8;8FO and some number 10
You need to start your file with wchar_t(0xFEFF).
It tells text editor apps to treat following data as unicode.

Call HPDF_SaveToFile() with japanese filename

Im trying to save one pdf in path that contains japanese username. In this case, HPDF_SaveToFile is doing crash my app on windows. Any options to compile or other thing? Any idea to support Unicode filenames with libhaur? I not want to create pdf with japanese encode, I want to write pdf with japanese filename.
A solution in Qt. If you use C++, you can use fstream/ofstream(::write). If you use C, you can use fwrite.
QFile file(path);
if (file.open(QIODevice::WriteOnly))
{
HPDF_SaveToStream(m_pdf);
/* get the data from the stream and write it to file. */
for (;;)
{
HPDF_BYTE buf[4096];
HPDF_UINT32 siz = 4096;
HPDF_STATUS ret = HPDF_ReadFromStream(m_pdf, buf, &siz);
if (siz == 0)
{
break;
}
if (-1 == file.write(reinterpret_cast<const char *>(buf), siz))
{
qDebug() << "Write PDF error";
break;
}
}
}
HPDF_Free(m_pdf);
Refrence: Libharu Usage examples