to unicode or not to unicode

to unicode or not to unicode - c++

I'm getting a value from the registry. This value might have double byte characters in it.
I will later have to transfer this across the network to a C# client to display. C# is all unicode.
The function returns MBCS if you call it non-unicode.
What should I use?
string result = string(cbData);
RegQueryValueExA(h_sub_key, "DisplayName", NULL, NULL, (LPBYTE) &result[0], &cbData)
or
string result = string(cbData);
RegQueryValueExW(h_sub_key, L"DisplayName", NULL, NULL, (LPBYTE) &result[0], &cbData)

Using Unicode whenever possible will make your life easier. The registry contains Unicode natively and converts to MBCS on the fly when you use ReqQueryValueExA, why would you want to do an unneeded conversion?
Converting to UTF-8 from UTF-16 might make sense for information going over the network, but if you control both ends of the connection it wouldn't be necessary.

No, that's not the way it works. The string you get back from the first snippet is encoded according to the current system code page. Could be a double-byte encoding. Could be anything. Big problem of course, the C# code on the other end of that Internet connection has no way to guess what the code page might be.
So do not use the first snippet. The second one gets you the string in utf16, the native encoding used in Windows, result needs to be an std::wstring. Also the encoding used by C# so you could send the binary string. Although that's not typically a good idea, xml is popular. It is up to you to set the wire format.

Related

Can't display 'ä' in GLFW window title

glfwSetWindowTitle(win, "Nämen");
Becomes "N?men", where '?' is in a little black, twisted square, indicating that the character could not be displayed.
How do I display 'ä'?

If you want to use non-ASCII letters in the window title, then the string has to be utf-8 encoded.
GLFW: Window title:
The window title is a regular C string using the UTF-8 encoding. This means for example that, as long as your source file is encoded as UTF-8, you can use any Unicode characters.
If you see a little black, twisted square then this indicates that the ä is encoded with some iso encoding that is not UTF-8, maybe something like latin1. To fix this you need to open it in the editor in which you can change the encoding of the file, change it to uft-8 (without BOM) and fix the ä in the title.

It seems like the GLFW implementation does not work according to the specification in this case. Probably the function still uses Latin-1 instead of UTF-8.
I had the same problem on GLFW 3.3 Windows 64 bit precompiled binaries and fixed it like this:
SetWindowTextA(glfwGetWin32Window(win),"Nämen")

The issue does not lie within GLFW but within the compiler. Encodings are handled by the major compilers as follows:
Good guy clang assumes that every file is encoded in UTF-8
Trusty gcc checks the system's settings1 and falls back on UTF-8, when it fails to determine one.
MSVC checks for BOM and uses the detected encoding; otherwise it assumes that file is encoded using the users current code page2.
You can determine your current code page by simply running chcp in your (Windows) console or PowerShell. For me, on a fresh install of Windows 11, it yields "850" (language: English; keyboard: German), which stands for Code Page 850.
To fix this issue you have several solutions:
Change your systems code page. Arguably the worst solution.
Prefix your strings with u8, escape all unicode literals and convert the string to wide char before passing Win32 functions; e.g.:
const char* title = u8"\u0421\u043b\u0430\u0432\u0430\u0020\u0423\u043a\u0440\u0430\u0457\u043d\u0456\u0021";
// This conversion is actually performed by GLFW; see footnote ^3
const int l = MultiByteToWideChar(CP_UTF8, 0, title, -1, NULL, 0);
wchar_t* buf = _malloca(l * sizeof(wchar_t));
MultiByteToWideChar(CP_UTF8, 0, title, -1, buf, l);
SetWindowTextW(hWnd, buf);
_freea(buf);
Save your source files with UTF-8 encoding WITH BOM. This allows you to write your strings without having to escape them. You'll still need to convert the string to a wide char string using the method seen above.
Specify the /utf-8 flag when compiling; this has the same effect as the previous solution, but you don't need the BOM anymore.
The solutions stated above still require you convert your good and nice string to a big chunky wide string.
Another option would be to provide a manifest4 with the activeCodePage set to UTF-8. This way all Win32 functions with the A-suffix (e.g. SetWindowTextA) now accept and properly handle UTF-8 strings, if the running system is at least or newer than Windows Version 1903.
TL;DR
Compile your application with the /utf-8 flag active.
IMPORTANT: This works for the Win32 APIs. This doesn't let you magically write Unicode emojis to the console like a hipster JS developer.
1 I suppose it reads the LC_ALL setting on linux. In the last 6 years I have never seen a distribution, that does NOT specify UTF-8. However, take this information with a grain of salt; I might be entirely wrong on how gcc handles this now.
2 If no byte-order mark is found, it assumes that the source file is encoded in the current user code page [...].
3 GLFW performs the conversion as seen here.
4 More about Win32 ANSI-APIs, Manifest and UTF-8 can be found here.

c++ Lithuanian language, how to get more than ascii

I am trying to use Lithuanian in my c++ application, but every try is unsuccesfull.
Multi-byte character set is used. I have tryed everything i have tought of, i am new in c++. Never ever tryed to do something in Lithuanian.
Tryed every setlocale(LC_ALL, "en_US.utf8"); setlocale(LC_ALL, "Lithuanian");...
Researched for 2 hours and didnt found proper examples, solution.
I do have a average sized project which needs Lithuanian translation from database and it cant understand most of "ĄČĘĖĮŠŲŪąčęėįšųū".
Compiler - "Visual studio 2013"
Database - sqlite3.
I cant get simple strings to work(defined myself), and output as Lithuanian to win32 application, even.

In Windows use wide character strings (1UTF-16 encoding, wchar_t type) for internal text handling, and preferably UTF-8 for external text files and networking.
Note that Visual C++ will translate narrow text literals from the source encoding to Windows ANSI, which is a platform-dependent usually single-byte encoding (you can check which one via the GetACP API function), i.e., Visual C++ has the platform-specific Windows ANSI as its narrow C++ execution character set.
But also do note that for an app restricted to non-Windows platforms, i.e. Unix-land, it makes practical sense to do everything in UTF-8, based on char type.
For the database communication you may need to translate to and from the program's internal text representation.
This depends on what the database interface requires, which is not stated.
Example for console output in Windows:
#include <iostream>
#include <fcntl.h>
#include <io.h>
auto main() -> int
{
_setmode( _fileno( stdout ), _O_WTEXT );
using namespace std;
wcout << L"ĄČĘĖĮŠŲŪąčęėįšųū" << endl;
}
To make this compile by default with g++, the source code encoding needs to be UTF-8. Then, to make it produce correct results with Visual C++ the source code encoding needs to be UTF-8 with BOM, which happily is also accepted by modern versions of g++. For otherwise the Visual C++ compiler will assume the Windows ANSI encoding and produce an incorrect UTF-16 string.
Not coincidentally this is the default meaning of UTF-8 in Windows, e.g. in the Notepad editor, namely UTF-8 with BOM.
But note that while in Windows the problem is that the main system compiler requires a BOM for UTF-8, in Unix-land the problem is the opposite, that many old tools can't handle the BOM (for example, even MinGW g++ 4.9.1 isn't yet entirely up to speed: it sometimes includes the BOM bytes, then incorrectly interpreted, in error messages).
1) On other platforms wide character text can be encoded in other ways, e.g. with UTF-32. In fact the Windows convention is in direct conflict with the C and C++ standards which require that a single wchar_t should be able to encode any character in the extended character set. However, this requirement was, AFAIK, imposed after Windows adopted UTF-16, so the fault probably lies with the politics of the C and C++ standardization process, not yet another Microsoft'ism.

Complexity of internationalisation
There are several related but distinct topics that can cause mismatches between them, making try and error approach very tedious:
type used for storing strings and chars: windows iuses wchar_t by default, but for most APIs you have also char equivalents functions
character set encoding this defines how the chars stored in the type are to be understood. For exemple unicode (UTF8, UTF16, UTF32), 7 bits ascii, 8 bit ansii. In windows, by default it is UTF16 for wchar_t and ansi/windows for char
locale defines, among other things, the character set asumptions, when processing strings. This permit to use language independent functions like isalpha(i, loc), islower(i, loc), ispunct(i, loc) to find out if a given character is alphanumeric, a lower case alphabetic, or a punctuation, for example to bereak down a user text into words. C++ offers here portable functions.
output codepage or font used to show a character to the user. This assumes that the font used shows the characters using the same character set used in the code internals.
source code encoding. For example your editor could assume an ansi encoding, with windows 1252 character set.
Most typical errors
The problem n°1 is Win32 console output, as unicode is not well supported by the console. But this is not your problem here.
Another cause of mismatch is the encoding of your text editor. It might not be unicode, but use a windows code page. In this case, you type "Č", the deditor displays it as such, but editor might use windows 1257 encoding for lithuanian and store 0xC8 in the file. If you then display this literal with a windows unicode function, it will interpret 0xC8 as "latin E grave accent" and print something else, as the right unicode encoding for "Č" is 0x010C !
I can be even worse: the compiler may have its own assumption about character set encoding used and convert your litterals into unicode using false assumptions (it happened to me when I used some exotic code generation switch).
How to do ?
To figure out what goes wront, proceed by elimination:
First, for plain windows, use the native unicode setting. Ok it's UTF16 and wchar_t instead of UTF8 and as thus comes with some drawbacks, but it's native and well supported.
Then use explict unicode coding in litterals, for example TEXT("\u010C") instead of TEXT("Č"). This avoids editor and compiler mismatch.
If it's still not the right character, make sure that your font FULLY supports unicode. The default system font for instance doesn't while most other do. You can easily check with the windows font pannel (WindowKey+R fonts then click on "search char") to display the character table of your font.
Set fonts explicitely in your code
For example, a very tiny experiment :
...
case WM_PAINT:
{
hdc = BeginPaint(hWnd, &ps);
auto hf = CreateFont(24, 0, 0, 0, 0, TRUE, 0, 0, 0, 0, 0, 0, 0, L"Times New Roman");
auto hfOld = SelectObject(hdc, hf); // if you comment this out, € and Č won't display
TextOut(hdc, 50, 50, L"Test with éç € \u010C special chars", 30);
SelectObject(hdc, hfOld);
DeleteObject(hf);
EndPaint(hWnd, &ps);
break;
}

Storing and retrieving UTF-8 strings from Windows resource (RC) files

I created an RC file which contains a string table, I would like to use some special
characters: ö ü ó ú ő ű á é. so I save the string with UTF-8 encoding.
But when I call in my cpp file, something like this:
LoadString("hu.dll", 12, nn, MAX_PATH);
I get a weird result:
How do I solve this problem?

As others have pointed out in the comments, the Windows APIs do not provide direct support for UTF-8 encoded text. You cannot pass the MessageBox function UTF-8 encoded strings and get the output that you expect. It will, instead, interpret them as characters in your local code page.
To get a UTF-8 string to pass to the Windows API functions (including MessageBox), you need to use the MultiByteToWideChar function to convert from UTF-8 to UTF-16 (what Windows calls Unicode, or wide strings). Passing the CP_UTF8 flag for the first parameter is the magic that enables this conversion. Example:
std::wstring ConvertUTF8ToUTF16String(const char* pszUtf8String)
{
// Determine the size required for the destination buffer.
const int length = MultiByteToWideChar(CP_UTF8,
0, // no flags required
pszUtf8String,
-1, // automatically determine length
nullptr,
0);
// Allocate a buffer of the appropriate length.
std::wstring utf16String(length, L'\0');
// Call the function again to do the conversion.
if (!MultiByteToWideChar(CP_UTF8,
0,
pszUtf8String,
-1,
&utf16String[0],
length))
{
// Uh-oh! Something went wrong.
// Handle the failure condition, perhaps by throwing an exception.
// Call the GetLastError() function for additional error information.
throw std::runtime_error("The MultiByteToWideChar function failed");
}
// Return the converted UTF-16 string.
return utf16String;
}
Then, once you have a wide string, you will explicitly call the wide-string variant of the MessageBox function, MessageBoxW.
However, if you only need to support Windows and not other platforms that use UTF-8 everywhere, you will probably have a much easier time sticking exclusively with UTF-16 encoded strings. This is the native Unicode encoding that Windows uses, and you can pass these types of strings directly to any of the Windows API functions. See my answer here to learn more about the interaction between Windows API functions and strings. I recommend the same thing to you as I did to the other guy:
Stick with wchar_t and std::wstring for your characters and strings, respectively.
Always call the W variants of Windows API functions, including LoadStringW and MessageBoxW.
Ensure that the UNICODE and _UNICODE macros are defined either before you include any of the Windows headers or in your project's build settings.

Converting LCID to Language string

In my application I have to query for System and User Locale and return them as strings(like en-US instead of language codes). I can get the LCID for these Languages from GetSystemDefaultLCID() and GetUserDefaultLCID() functions, but what I am struggling with is the conversion from LCID to Language strings.
My app has to run on Windows XP as well so I can't use the LCIDToLocaleName() Win API. The only I have been able to Get the locale name so far is to use the GetLocaleInfo() Win API by passing LOCALE_SYSTEM_DEFAULT and LOCALE_USER_DEFAULT as LCID and LOCALE_SISO639LANGNAME as LCType. My code looks somewhat like this to Query for System Locale
int localeBufferSize = GetLocaleInfo(LOCALE_SYSTEM_DEFAULT, LOCALE_SISO639LANGNAME, NULL, 0);
char *sysLocale = new char[localeBufferSize];
GetLocaleInfo(LOCALE_SYSTEM_DEFAULT, LOCALE_SISO639LANGNAME, sysLocale,localeBufferSize);
But the value I get is only the language name(only en of en-US). To get the full locale I have to call GetLocaleInfo() twice passing LOCALE_SISO3166CTRYNAME as the LCType the second time and then append the two values. Is there any better way to do this?

Sorry to say that, but no. Only the two (or more) letter codes are standardized, the concatenation of them are not. But! There is no need for the dynamic buffer allocation. Quoted from MSDN:
The maximum number of characters allowed for this string is nine,
including a terminating null character.
Reading this, it sounds overkill to dynamically allocate the required buffer, so if you really want to keep it extra-clean, you can do something like this:
char buf[19];
int ccBuf = GetLocaleInfo(LOCALE_SYSTEM_DEFAULT, LOCALE_SISO639LANGNAME, buf, 9);
buf[ccBuf++] = '-';
ccBuf += GetLocaleInfo(LOCALE_SYSTEM_DEFAULT, LOCALE_SISO3166CTRYNAME, buf+ccBuf, 9);

Can't read unicode (japanese) from a file

Hi I have a file containing japanese text, saved as unicode file.
I need to read from the file and display the information to the stardard output.
I am using Visual studio 2008
int main()
{
wstring line;
wifstream myfile("D:\sample.txt"); //file containing japanese characters, saved as unicode file
//myfile.imbue(locale("Japanese_Japan"));
if(!myfile)
cout<<"While opening a file an error is encountered"<<endl;
else
cout << "File is successfully opened" << endl;
//wcout.imbue (locale("Japanese_Japan"));
while ( myfile.good() )
{
getline(myfile,line);
wcout << line << endl;
}
myfile.close();
system("PAUSE");
return 0;
}
This program generates some random output and I don't see any japanese text on the screen.

Oh boy. Welcome to the Fun, Fun world of character encodings.
The first thing you need to know is that your console is not unicode on windows. The only way you'll ever see Japanese characters in a console application is if you set your non-unicode (ANSI) locale to Japanese. Which will also make backslashes look like yen symbols and break paths containing european accented characters for programs using the ANSI Windows API (which was supposed to have been deprecated when Windows XP came around, but people still use to this day...)
So first thing you'll want to do is build a GUI program instead. But I'll leave that as an exercise to the interested reader.
Second, there are a lot of ways to represent text. You first need to figure out the encoding in use. Is is UTF-8? UTF-16 (and if so, little or big endian?) Shift-JIS? EUC-JP? You can only use a wstream to read directly if the file is in little-endian UTF-16. And even then you need to futz with its internal buffer. Anything other than UTF-16 and you'll get unreadable junk. And this is all only the case on Windows as well! Other OSes may have a different wstream representation. It's best not to use wstreams at all really.
So, let's assume it's not UTF-16 (for full generality). In this case you must read it as a char stream - not using a wstream. You must then convert this character string into UTF-16 (assuming you're using windows! Other OSes tend to use UTF-8 char*s). On windows this can be done with MultiByteToWideChar. Make sure you pass in the right code page value, and CP_ACP or CP_OEMCP are almost always the wrong answer.
Now, you may be wondering how to determine which code page (ie, character encoding) is correct. The short answer is you don't. There is no prima facie way of looking at a text string and saying which encoding it is. Sure, there may be hints - eg, if you see a byte order mark, chances are it's whatever variant of unicode makes that mark. But in general, you have to be told by the user, or make an attempt to guess, relying on the user to correct you if you're wrong, or you have to select a fixed character set and don't attempt to support any others.

Someone here had the same problem with Russian characters (He's using basic_ifstream<wchar_t> wich should be the same as wifstream according to this page). In the comments of that question they also link to this which should help you further.
If understood everything correctly, it seems that wifstream reads the characters correctly but your program tries to convert them to whatever locale your program is running in.

Two errors:
std::wifstream(L"D:\\sample.txt");
And do not mix cout and wcout.
Also check that your file is encoded in UTF-16, Little-Endian. If not so, you will be in trouble reading it.

wfstream uses wfilebuf for the actual reading and writing of the data. wfilebuf defaults to using a char buffer internally which means that the text in the file is assumed narrow, and converted to wide before you see it. Since the text was actually wide, you get a mess.
The solution is to replace the wfilebuf buffer with a wide one.
You probably also need to open the file as binary.
const size_t bufsize = 128;
wchar_t buffer[bufsize];
wifstream myfile("D:\\sample.txt", ios::binary);
myfile.rdbuf()->pubsetbuf(buffer, 128);
Make sure the buffer outlives the stream object!
See details here: http://msdn.microsoft.com/en-us/library/tzf8k3z8(v=VS.80).aspx

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js