I want to know how to enter utf-8 in Visual Studio's default debugging console.
I changed the code page through the SetConsoleCP (CP_UTF8) function, but the encoding of the input text only came out as utf-16, and I couldn't get utf-8 directly.
When I received the values through SetConsoleCP (CP_UTF8) and ReadConsole functions, I found that the text entered was utf-16 instead of utf-8.
Related
In order to find answer to my own question How to use ESC sequences to set foreground colors?
I tried to find out how printf() itself works, and then maybe create my own function, which only accepts RGB and then automatically applies this color as an escape sequence to VT100 emulator of windows console. But eventually as I started digging into the code I discovered, that the most of IO, as well as stdin and stdout are written in C++ as well as string validator for printf format string.
As I was taught, in order to execute the program, compiler does all-the-way-back inclusion up to point, where all the text from all the included libs is pasted in your code. But my compiler is set to C only, so it can't compile C++ code. How could this happen? And is there other way to write stdout with escape codes already known to me(\033[%d;%d;%d;%d;%dm), though pushing the limitations of the console output?
I know there are many questions already asked on this topic but I am facing a very unusual situation here.
I am working in Centos. My application reads some data in wchar_t and converts in multibyte (UTF-8 encoding) and fills the char buffer in a google proto message and sends to another application.
The other application converts it again to wide string and displays it to user. I am using wcstombs for the conversion. My locale is "en_US.UTF-8".
For some strings it is working fine. I am facing issue in one particular wide string (maybe there are several others) in which wcstombs returns -1. Error number is set to 84 (Invalid or incomplete multibyte or wide character).
The problem is, when I am running my application through eclipse, the conversion is successful but when my application is run from root (as a service), the conversion fails.
Same string conversion is successful in windows using widechartomultibyte API.
I am not able to understand why this is happening.
Hope the experts can help me out.
EDIT
My Wide string is L"\006£æ?Jÿ" which when converted and displayed to user becomes as the image
L"\006" doesn't appear to be a valid unicode string (neither in UTF-16 nor in UTF-32). I agree with wcstombs, there's no corresponding UTF-8 sequence.
I suspect you didn't use WC_ERR_INVALID_CHARS on Windows. That would catch the same error.
I'm trying to output UTF8 characters in the Windows command line. I can't seem to get the function, setConsoleOutputCP to work. I also heard that you had to change the font to "Lucida Grande" for it to work but I can't get that working either. Can someone please provide me with a short example of how to use these functions to correctly output UTF-8 characters to the console?
Also I heard that those functions don't work in Windows XP, is there a better alternative to those functions which will work in Windows XP?
[I know this question is old and was about Windows XP, but it still seemed like a good place to drop this information so I (and maybe others) can find it again in the future.]
Support for Unicode in CMD windows has improved in newer versions of Windows. This program will work on Windows 10.
#include <iostream>
#include <Windows.h>
class UTF8CodePage {
public:
UTF8CodePage() : m_old_code_page(::GetConsoleOutputCP()) {
::SetConsoleOutputCP(CP_UTF8);
}
~UTF8CodePage() { ::SetConsoleOutputCP(m_old_code_page); }
private:
UINT m_old_code_page;
};
int main() {
UTF8CodePage use_utf8;
const char *text = u8"This text is in UTF-8. ¡Olé! 佻\n";
std::cout << text;
return 0;
}
I made an RAII class to ensure the code page is restored because it would be rude to leave the code page changed if the user had purposely selected a specific one. All the Windows-specific code (SetConsoleOutputCP) is contained within that class. The definition of the use_utf8 variable in main changes the code page to UTF-8, and that code page will stay in effect until the variable is destructed at the end of the scope.
Note that I used the u8 prefix on the string literal, which is a newer feature of C++ to ensure that the string is encoded using UTF-8 regardless of the encoding used for the source file. You don't have to use that feature if you have another way to make a string of valid UTF-8 text.
You still have to be sure that the CMD window is using a font that supports the glyphs you need. I don't think there's a way to get font linking automatically.
But this will at least show a the replacement character if the font is missing the glyph. For example, on my window, the ¡Olé! looks right but the CJK glyph is shown approximately like �. If the user copies that replacement character, the clipboard will receive the original glyph, so they can paste it into other programs without any loss of fidelity.
Note that command line parameters you get from main's argv will be in the original code page. One way to work around this is to get the unconverted "wide" command line with GetCommandLineW, convert it to UTF-8 with WideToMultibyte, and then parse it yourself. Alternatively, you can pass the result of GetCommandLineW to CommandLineToArgvW, which will parse it, and then you'd convert each argument to UTF-8.
Finally, note that changing the code page affects only the output. If you input text from the user, it arrives encoded using the original code page (often called the OEM code page).
TODO: Figure out input. SetConsoleCP isn't doing what I think the documentation says it should do.
Windows console doesn't play nice with UNICODE and particularly with UTF-8.
Setting a console code page to utf-8 won't work.
One approach is to use WideCharToMultiByte() (or something else) to convert the text to UTF-16, then MultiByteToWideChar() (or something else) to convert to a localised ISO encoding. The set the console code page to the ISO code page.
Its ugly, but it sort of works.
In short: SetConsoleOutputCP CP_UTF8 and cout/wcout dont work together by default.
Though windows CRT supports utf-8 output, a robust way to output to console utf-8 chars is to convert them into a console current codepage, especially if you want to use count/wcout.
Standard high level functions of basic_ostream does not work properly with utf-8 by default.
I've seen usage of MultiByteToWideChar and WideCharToMultiByte with CP_OEMCP and CP_UTF8 parameters.
You may setup your application environment, including console font via SetCurrentConsoleFontEx but it works only from Vista and Server 2008.
Also, check this about cout and console.
_setmode and wprintf works together as well, but this may lead to crash for non-wide char functions.
The problem occurs because there is a difference of codepage that uses windows in your console with the encoding of your source code text file.
Qt uses utf-8 by default, but another editor can use another one. So you must to verify which one you're using.
To change to utf-8 use:
#include <windows.h>
SetConsoleOutputCP(CP_UTF8);
Currently it defaults to my system locale, which is 932 (Japanese Shift-JIS) in my case, but I want it to be 65001 (UTF-8) by default.
I can change the default for a given program by inserting a SetConsoleOutputCP line somewhere in the code and then removing it, but doing it for every program is pretty annoying.
Any suggestions?
This could be done with the standard approach: by composing the registry settings for the executable being debugged. My template dbg_console.reg:
Windows Registry Editor Version 5.00
[HKEY_CURRENT_USER\Console\<encoded_path_to_executable>]
"ScreenBufferSize"=dword:1388012c
"WindowSize"=dword:00340096
"FontSize"=dword:00100000
"FontFamily"=dword:00000036
"FontWeight"=dword:00000190
"FaceName"="Lucida Console"
"HistoryNoDup"=dword:00000000
"QuickEdit"=dword:00000001
"CodePage"=dword:000004e3
<encoded_path_to_executable> is the string where \ is replaced with _.
E.g.: Z:\prj\prj_name\out\debug\bin\program.exe is transformed to Z:_prj_prj_name_out_debug_bin_program.exe.
"CodePage"=dword:000004e3 sets the desired code page. It is important to choose a proper font.
Due to debug mode, the standard settings dialog called on the window header of a program console cannot function properly. However, you can get the desired settings via Windows cmd started with Win+R shortcut. They appear under the folder %SystemRoot%_system32_cmd.exe or alike.
Thus, you exactly get the desired code page in the debug console of your executable without the need to setup code page conditionally at run time. The support of the 65001 code page can be determined from the keys in HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage: if there is the key 65001 having appropriate *.nls file, this page is supported.
While testing some functions to convert strings between wchar_t and utf8 I met the following weird result with Visual C++ express 2008
std::wcout << L"élève" << std::endl;
prints out "ÚlÞve:" which is obviously not what is expected.
This is obviously a bug. How can that be ? How am I suppose to deal with such "feature" ?
The C++ compiler does not support Unicode in code files. You have to replace those characters with their escaped versions instead.
Try this:
std::wcout << L"\x00E9l\x00E8ve" << std::endl;
Also, your console must support Unicode as well.
UPDATE:
It's not going to produce the desired output in your console, because the console does not support Unicode.
I found these related questions with useful answers
Is there a Windows command shell that will display Unicode characters?
How can I embed unicode string constants in a source file?
You might also want to take a look at this question. It shows how you can actually hard-code unicode characters into files using some compilers (I'm not sure what the options would be got MSVC).
This is obviously a bug. How can that be?
While other operating systems have dispensed with legacy character encodings and switched to UTF-8, Windows uses two legacy encodings: An "OEM" code page (used at the command prompt) and an "ANSI" code page (used by the GUI).
Your C++ source file is in ANSI code page 1252 (or possibly 1254, 1256, or 1258), but your console is interpreting it as OEM code page 850.
You IDE and the compiler use the ANSI code page.
The console uses the OEM code page.
It also matter what are you doing with those conversion functions.