Linux wide string to multibyte issue

Linux wide string to multibyte issue - c++

I know there are many questions already asked on this topic but I am facing a very unusual situation here.
I am working in Centos. My application reads some data in wchar_t and converts in multibyte (UTF-8 encoding) and fills the char buffer in a google proto message and sends to another application.
The other application converts it again to wide string and displays it to user. I am using wcstombs for the conversion. My locale is "en_US.UTF-8".
For some strings it is working fine. I am facing issue in one particular wide string (maybe there are several others) in which wcstombs returns -1. Error number is set to 84 (Invalid or incomplete multibyte or wide character).
The problem is, when I am running my application through eclipse, the conversion is successful but when my application is run from root (as a service), the conversion fails.
Same string conversion is successful in windows using widechartomultibyte API.
I am not able to understand why this is happening.
Hope the experts can help me out.
EDIT
My Wide string is L"\006£æ?Jÿ" which when converted and displayed to user becomes as the image

L"\006" doesn't appear to be a valid unicode string (neither in UTF-16 nor in UTF-32). I agree with wcstombs, there's no corresponding UTF-8 sequence.
I suspect you didn't use WC_ERR_INVALID_CHARS on Windows. That would catch the same error.

Related

SendInput: Combine Unicode and VirtualKey

I need to replay remote characters on a local machine. It seems I can only use SendInput to either send Unicode chars or VirtualKeys... But I need both. I do not know which of them an application needs. Games need virtual codes, but sometimes unicode (for chat), etc...
So how can I replay key input such that the end result is still the unicode char that I got from the remote end, but also transmits the virtual codes for apps that need it?
I was thinking of:
SendInput(virtualKey, DOWN)
SendInput(unicode, DOWN)
SendInput(unicode, UP)
SendInput(virtualKey, UP)
But I tried various combinations and always ended up with duplicated chars.

Change Console Code Page in Windows C++

I'm trying to output UTF8 characters in the Windows command line. I can't seem to get the function, setConsoleOutputCP to work. I also heard that you had to change the font to "Lucida Grande" for it to work but I can't get that working either. Can someone please provide me with a short example of how to use these functions to correctly output UTF-8 characters to the console?
Also I heard that those functions don't work in Windows XP, is there a better alternative to those functions which will work in Windows XP?

[I know this question is old and was about Windows XP, but it still seemed like a good place to drop this information so I (and maybe others) can find it again in the future.]
Support for Unicode in CMD windows has improved in newer versions of Windows. This program will work on Windows 10.
#include <iostream>
#include <Windows.h>
class UTF8CodePage {
public:
UTF8CodePage() : m_old_code_page(::GetConsoleOutputCP()) {
::SetConsoleOutputCP(CP_UTF8);
}
~UTF8CodePage() { ::SetConsoleOutputCP(m_old_code_page); }
private:
UINT m_old_code_page;
};
int main() {
UTF8CodePage use_utf8;
const char *text = u8"This text is in UTF-8. ¡Olé! 佻\n";
std::cout << text;
return 0;
}
I made an RAII class to ensure the code page is restored because it would be rude to leave the code page changed if the user had purposely selected a specific one. All the Windows-specific code (SetConsoleOutputCP) is contained within that class. The definition of the use_utf8 variable in main changes the code page to UTF-8, and that code page will stay in effect until the variable is destructed at the end of the scope.
Note that I used the u8 prefix on the string literal, which is a newer feature of C++ to ensure that the string is encoded using UTF-8 regardless of the encoding used for the source file. You don't have to use that feature if you have another way to make a string of valid UTF-8 text.
You still have to be sure that the CMD window is using a font that supports the glyphs you need. I don't think there's a way to get font linking automatically.
But this will at least show a the replacement character if the font is missing the glyph. For example, on my window, the ¡Olé! looks right but the CJK glyph is shown approximately like �. If the user copies that replacement character, the clipboard will receive the original glyph, so they can paste it into other programs without any loss of fidelity.
Note that command line parameters you get from main's argv will be in the original code page. One way to work around this is to get the unconverted "wide" command line with GetCommandLineW, convert it to UTF-8 with WideToMultibyte, and then parse it yourself. Alternatively, you can pass the result of GetCommandLineW to CommandLineToArgvW, which will parse it, and then you'd convert each argument to UTF-8.
Finally, note that changing the code page affects only the output. If you input text from the user, it arrives encoded using the original code page (often called the OEM code page).
TODO: Figure out input. SetConsoleCP isn't doing what I think the documentation says it should do.

Windows console doesn't play nice with UNICODE and particularly with UTF-8.
Setting a console code page to utf-8 won't work.
One approach is to use WideCharToMultiByte() (or something else) to convert the text to UTF-16, then MultiByteToWideChar() (or something else) to convert to a localised ISO encoding. The set the console code page to the ISO code page.
Its ugly, but it sort of works.

In short: SetConsoleOutputCP CP_UTF8 and cout/wcout dont work together by default.
Though windows CRT supports utf-8 output, a robust way to output to console utf-8 chars is to convert them into a console current codepage, especially if you want to use count/wcout.
Standard high level functions of basic_ostream does not work properly with utf-8 by default.
I've seen usage of MultiByteToWideChar and WideCharToMultiByte with CP_OEMCP and CP_UTF8 parameters.
You may setup your application environment, including console font via SetCurrentConsoleFontEx but it works only from Vista and Server 2008.
Also, check this about cout and console.
_setmode and wprintf works together as well, but this may lead to crash for non-wide char functions.

The problem occurs because there is a difference of codepage that uses windows in your console with the encoding of your source code text file.
Qt uses utf-8 by default, but another editor can use another one. So you must to verify which one you're using.
To change to utf-8 use:
#include <windows.h>
SetConsoleOutputCP(CP_UTF8);

Windows active code page

I have a certain library (IBM's WebSphere MQ) which I'm using, with an API that is suppose to return a remote servers character set.
After some debugging, it seems as though the return value of this function call returns the active code page of my machine. I saw this by looking at the return value of the function call and the result of running chcp in a command line - both returned 862. When I changed the language in the Control Panel->Regional and Language Options->Advanced tab to something else, both values changed again, which verified my suspicion.
My question is, what is the value that chcp returns? What Win32 API gets/sets it? How does it relate to locales? (trying to change the global locale in a C++ application using std::locale::global had no impact on it apparently).

CHCP returns the OEM codepage (OEMCP). The API is Get/SetConsoleCP.
You can set the C++ locale to ".OCP" to match this locale.

Locales mostly identifies languages, and given that historically there are not so many codepages (many languages' alphabet differ not so greatly from 26-Latin), several languages can be "mapped" to the same codepage. As I remember, there is no direct converstion function, but I made it with statistical approach:
For any given locale I collected this languages words I can obtain from the system (LOCALE_SMONTHNAME1..LOCALE_SMONTHNAME12, LOCALE_SNATIVELANGNAME etc) in Unicode
I called WideCharToMultiByte function for every string trying to convert them to this codepage one-byte encoding
WideCharToMultiByte(CodePage, CP_ACP or WC_NO_BEST_FIT_CHARS, ..., #DefChar, #DefUsed);
If DefUsed was set during the process, that it basically meant that this languages is not compatible with this codepage.

Unexpected output of std::wcout << L"élève"; in Windows Shell

While testing some functions to convert strings between wchar_t and utf8 I met the following weird result with Visual C++ express 2008
std::wcout << L"élève" << std::endl;
prints out "ÚlÞve:" which is obviously not what is expected.
This is obviously a bug. How can that be ? How am I suppose to deal with such "feature" ?

The C++ compiler does not support Unicode in code files. You have to replace those characters with their escaped versions instead.
Try this:
std::wcout << L"\x00E9l\x00E8ve" << std::endl;
Also, your console must support Unicode as well.
UPDATE:
It's not going to produce the desired output in your console, because the console does not support Unicode.

I found these related questions with useful answers
Is there a Windows command shell that will display Unicode characters?
How can I embed unicode string constants in a source file?

You might also want to take a look at this question. It shows how you can actually hard-code unicode characters into files using some compilers (I'm not sure what the options would be got MSVC).

This is obviously a bug. How can that be?
While other operating systems have dispensed with legacy character encodings and switched to UTF-8, Windows uses two legacy encodings: An "OEM" code page (used at the command prompt) and an "ANSI" code page (used by the GUI).
Your C++ source file is in ANSI code page 1252 (or possibly 1254, 1256, or 1258), but your console is interpreting it as OEM code page 850.

You IDE and the compiler use the ANSI code page.
The console uses the OEM code page.
It also matter what are you doing with those conversion functions.

Multi-byte character set in MFC application

I have a MFC application in which I want to add internationalization support. The project is configured to use the "multi-byte character set" (the "unicode character set" is not an option in my situation).
Now, I would expect the CWnd::OnChar() function to send me multi-byte characters if I set my keyboard to some foreign language, but it doesn't seem to work that way. The OnChar() function always sends me a 1-byte character in its nChar variable.
I thought that the _getmbcp() function would give me the current code page for the application, but this function always return 0.
Any advice would be appreciated.

And help here? Multibyte Functions in Microsoft C Run-time

As far as changing the default code page:
The default code page for a user (for WinXP - not sure how it is on Vista) is set in the "Regional and Languages options" Control Panel applet on the "Advanced" tab.
The "Language for non-Unicode programs" sets the default code page for the current user. Unfortunately it does not actually tell you the codepage number it's configuring - it just gives the language (which might be further specified with a region variant). This meakes sense from an end-user perspective, because I think codepage numbers have no meaning to 99.999% of end users. You need to reboot for a change to take effect. If you use regmon to determine what changes you could probably come up with something that specifies the default codepage somewhat easier.
Microsoft also has an unsupported utility, AppLocale, for testing localization that changes the codepage for particular applications: http://www.microsoft.com/globaldev/tools/apploc.mspx
Also you can change the code page for a thread by calling SetThreadLocale() - but you also have to call the C runtime's setlocale() function because some CRT functions don't talk to the Win API locale functions (and vice versa). See "Windows SetThreadLocale and CRT setlocale" by Chris Grimes for details.

As always in non Unicode scenarios, you'll get a reliable result only if the system locale (aka in Control Panel "language for non-unicode applications") is set accordingly. If not, don't expect anything good.
For example, if system locale is Chinese Traditional, you'll receive 2 successive WM_CHAR messages (One for each byte, assuming user composed a 2-char character).
isleadbyte() should help you determine if a 2nd byte is coming soon.
If your system locale is NOT set to Chinese, don't expect to receive correct messages even iusing a Chinese keyboard/IME. The misleading part is that some scenarios work. e.g. using a Greek keyboard, you'll receive WM_CHAR char values based of the Greek codepage even if your system locale is Latin-based. But you should really stay away from trying to cope with such scenarios: Success is not guaranteed and will likely vary according to Windows version and locale.
As MikeB wrote, MS AppLocale is your friend to make basic tests.
[ad] and appTranslator is your friend if you need to translate your UI [/ad]

For _getmbcp, MSDN says "A return value of 0 indicates that a single byte code page is in use." That sure makes it not very useful. Try one of these: GetUserDefaultLCID GetSystemDefaultLCID GetACP. (Now why is there no "user" equivalent for GetACP?)
Anyway if you want _getmbcp to return an actual value then set your system default language to Chinese, Japanese, or Korean.

There is actually a very simple (but weird) way of force the OnChar function to send unicode characters to the application even if it's configured in multi-byte character set:
SetWindowLongW( m_hWnd, GWL_WNDPROC, GetWindowLong( m_hWnd, GWL_WNDPROC ) );
Simply by calling the unicode version of "SetWindowLong", it forces the application to receive unicode characters.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Linux wide string to multibyte issue - c++

L"\006" doesn't appear to be a valid unicode string (neither in UTF-16 nor in UTF-32). I agree with wcstombs, there's no corresponding UTF-8 sequence. I suspect you didn't use WC_ERR_INVALID_CHARS on Windows. That would catch the same error.

Related

SendInput: Combine Unicode and VirtualKey

Change Console Code Page in Windows C++

Windows active code page

Unexpected output of std::wcout << L"élève"; in Windows Shell

Multi-byte character set in MFC application

Categories

Resources