Read UTF-8 encoded character from keyboard in ncurses - c++

When reading keyboard input in nCurses I use getch() function which works fine for ASCII characters but fails for UTF-8 encoded characters. If I push character ś on the keyboard:
int c = getch();
Value of c should be in hex: 0xC59B. But when I try to print its value I get only 0xC5.
How I can read a whole character and is getch() correct function to use?

getch reads bytes, but UTF-8 is multibyte. You could read that byte-by-byte and interpret it, but that's work that most do not need.. Use get_wch to read a (whole) wide character.
That assumes you've initialized ncurses' locale:
setlocale(LC_ALL, "");
(if you don't do that, getch would not return the correct bytes, anyway).

getch's name comes from the Old Earth meaning of "character" which is actually just "byte". Multibyte encodings are not understood by this mechanism.
However, it is still the right function to use; you just need to deal with its result properly. Call it repeatedly and dump what you get into a string of bytes (in your particular example, you'll need two calls to obtain enough bytes to represent the particular Unicode character provided), then interpret those bytes with a UTF-8 library.
Don't forget to filter out "special values" that getch can provide, as it does not always give you raw characters (consider, for example the F1 key!).

Related

Streaming Out Extended ASCII

I know that only positive character ASCII values are guaranteed cross platform support.
In Visual Studio 2015, I can do:
cout << '\xBA';
And it prints:
║
When I try that on http://ideone.com I don't print anything.
If I try to directly print this using the literal character:
cout << '║';
Visual Studio gives the warning:
warning C4566: character represented by universal-character-name '\u2551' cannot be represented in the current code page (1252)
And then prints:
?
When this command is run on http://ideone.com I get:
14849425
I've read that wchars may provide a cross platform approach to this. Is that true? Or am I simply out of luck on extended ASCII?
There are two separate concepts in play here.
The first one is one of a locale, which is often called "code page" in Microsoft-ese. A locale defines which visual characters are represented by which byte sequence. In your first example, whatever locale your program gets executed as, it shows the "║" character, in response to the byte 0xBA.
Other locales, or code pages, will display different characters for the same bytes. Many locales are multibyte locales, where it can take several bytes to display a single character. In the UTF-8 locale, for example, the same character, ║, takes three bytes to display: 0xE2 0x95 0x91.
The second concept here is one of the source code character set, which comes from the locale in which the source code is edited, before it gets compiled. When you enter the ║ character in your source code, it may get represented, I suppose, either as the 0xBA character, or maybe 0xE2 0x95 0x91 sequence, if your editor uses the UTF-8 locale. The compiler, when it reads the source code, just sees the actual byte sequence. Everything gets reduced to bytes.
Fortunately, all C++ keywords use US-ASCII, so it doesn't matter what character set is used to write C++ code. Until you start using non-Latin characters. Which result in a compiler warning, informing you, basically, that you're using stuff that may or may not work, depending on the eventual locale the resulting program runs in.
First, your input source file has its own encoding. Your compiler needs to be able to read this encoding (maybe with the help of flags/settings).
With a simple string, the compiler is free to do what it wants, but it must yield a const char[]. Usually, the compiler keeps the source encoding when it can, so the string stored in your program will have the encoding of your input file. There are cases when the compiler will do a conversion, for example if your file is UTF-16 (you can't fit UTF-16 characters in chars).
When you use '\xBA', you write a raw character, and you chose yourself your encoding, so there is no encoding from the compiler.
When you use '║', the type of '║' is not necessarily char. If the character is not representable as a single byte in the compiler character set, its type will be int. In the case of Visual Studio with the Windows-1252 source file, '║' doesn't fit, so it will be of type int and printed as such by cout <<.
You can force an encoding with prefixes on string literals. u8"" will force UTF-8, u"" UTF-16 and U"" UTF-32. Note that the L"" prefix will give you a wide char wchar_t string, but it's still implementation dependent. Wide chars on Windows are UCS-2 (2 bytes per char), but UTF-32 (4 bytes per char) on linux.
Printing to the console only depends on the type of the variable. cout << is overloaded with all common types, so what it does depends on the type. cout << will usually feed char strings as is to the console (actually stdin), and wcout << will usually feed wchar_t strings as is. Other combinations may have conversions or interpretations (like feeding an int). UTF-8 strings are char strings, so cout << should always feed them correctly.
Next, there is the console itself. A console is a totally independent piece of software. You feed it some bytes, it display them. It doesn't care one bit about your program. It uses its own encoding, and try to print the bytes you fed using this encoding.
The default console encoding on Windows is Code page 850 (not sure if it is always the case). In your case, your file is CP 1252 and your console is CP 850, which is why you can't print '║' directly (CP 1252 doesn't contain '║'), but you can using a raw character. You can change the console encoding on Windows with SetConsoleCP().
On linux, the default encoding is UTF-8, which is more convenient because it support the whole Unicode range. Ideone uses linux, so it will use UTF-8. Note that there is the added layer of HTTP and HTML, but they also use UTF-8 for that.

Required to convert a String to UTF8 string

Problem Statement:
I am required to convert a generated string to UTF8 string, this generated string has extended ascii characters and I am on Linux system (2.6.32-358.el6.x86_64).
A POC is still in progress so I can only provide small code samples
and complete solution can be posted only once ready.
Why I required UFT8 (I have extended ascii characters to be stored in a string which has to be UTF8).
How I am proceeding:
Convert generated string to wchar_t string.
Please look at the below sample code
int main(){
char CharString[] = "Prova";
iconv_t cd;
wchar_t WcharString[255];
size_t size= mbstowcs(WcharString, CharString, strlen(CharString));
wprintf(L"%ls\n", WcharString);
wprintf(L"%s\n", WcharString);
printf("\n%zu\n",size);
}
One question here:
Output is
Prova?????
s
Why the size is not printed here ?
Why the second printf prints only one character.
If I print size before both printed string then only 5 is printed and both strings are missing from console.
Moving on to Second Part:
Now that I will have a wchar_t string I want to convert it to UTF8 string
For this I was surfing through and found iconv will help here.
Question here
These are the methods I found in manual
**iconv_t iconv_open(const char *, const char *);
size_t iconv(iconv_t, char **, size_t *, char **, size_t *);
int iconv_close(iconv_t);**
Do I need to convert back wchar_t array to char array to before feeding to iconv ?
Please provide suggestions on the above issues.
Extended ascii I am talking about please see letters i in the marked snapshot below
For your first question (which I am interpreting as "why is all the output not what I expect"):
Where does the '?????' come from? In the call mbstowcs(WcharString, CharString, strlen(CharString)), the last argument (strlen(CharString)) is the length of the output buffer, not the length of the input string. mbstowcs will not write more than that number of wide characters, including the NUL terminator. Since the conversion requires 6 wide characters including the terminator, and you are only allowing it to write 5 wide characters, the resulting wide character string is not NUL terminated, and when you try to print it out you end up printing garbage after the end of the converted string. Hence the ?????. You should use the size of the output buffer in wchar_t's (255, in this case) instead.
Why does the second wprintf only print one character? When you call wprintf with a wide character string argument, you must use the %ls format code (or, more accurately, the %s conversion needs to be qualified with an l length modifier). If you use %s without the l, then wprintf will interpret the string as a char*, and it will convert each character to a wchar_t as it outputs it. However, since the argument is actually a wide character string, the first wchar_t in the string is L"p", which is the number 0x70 in some integer size. That means that the second byte of the wchar_t (counting from the end, since you have a little-endian architecture) is a 0, so if you treat the string as a string of characters, it will be terminated immediately after the p. So only one character is printed.
Why doesn't the last printf print anything? In C, an output stream can either be a wide stream or a byte stream, but you don't specify that when you open the stream. (And, in any case, standard output is already opened for you.) This is called the orientation of the stream. A newly opened stream is unoriented, and the orientation is fixed when you first output to the stream. If the first output call is a wide call, like wprintf, then the stream is a wide stream; otherwise, it is a byte stream. Once set, the orientation is fixed and you can't use output calls of the wrong orientation. So the printf is illegal, and it does nothing other than raise an error.
Now, let's move on to your second question: What do I do about it?
The first thing is that you need to be clear about what format the input is in, and how you want to output it. On Linux, it is somewhat unlikely that you will want to use wchar_t at all. The most likely cases for the input string are that it is already UTF-8, or that it is in some ISO-8859-x encoding. And the most likely cases for the output are the same: either it is UTF-8, or it is some ISO-8859-x encoding.
Unfortunately, there is no way for your program to know what encoding the console is expecting. The output may not even be going to a console. Similarly, there is really no way for your program to know which ISO-8859-x encoding is being used in the input string. (If it is a string literal, the encoding might be specified when you invoke the compiler, but there is no standard way of providing the information.)
If you are having trouble viewing output because non-ascii characters aren't displaying properly, you should start by making sure that the console is configured to use the same encoding as the program is outputting. If the program is sending UTF-8 to a console which is displaying, say, ISO-8859-15, then the text will not display properly. In theory, your locale setting includes the encoding used by your console, but if you are using a remote console (say, through PuTTY from a Windows machine), then the console is not part of the Linux environment and the default locale may be incorrect. The simplest fix is to configure your console correctly, but it is also possible to change the Linux locale.
The fact that you are using mbstowcs from a byte string suggests that you believe that the original string is in UTF-8. So it seems unlikely that the problem is that you need to convert it to UTF-8.
You can certainly use iconv to convert a string from one encoding to another; you don't need to go through wchar_t to do so. But you do need to know the actual input encoding and the desired output encoding.
It's no good idea to use iconv for utf8. Just implement the definition of utf8 yourself. That is quite easily in done in C from the Description https://en.wikipedia.org/wiki/UTF-8.
You don't even need wchar_t, just use uint32_t for your characters.
You will learn much if you implement yourself and your program will gain speed from not using mb or iconv functions.

How to compare/replace non-ASCII chars in array in C++?

I have a large char array, which contains Czech diacritical characters (e.g. "á"), coded in UTF-8. I need to replace them to their ASCII equivalents (e.g. "a"), because program must work on Windows (Linux console accepts these chars perfectly).
I am reading array char by char and writing content into string.
Here is code I am using, this doesnt work:
int array_size = 50000; //size of file array
char * array = new char[array_size]; //array to store file contents
string ascicontent="";
if ('\u00E1'==array[zacatek]) { //check if char is "á"
ascicontent +='a'; //write ordinal "a" into string
}
I even tried replacing '\u00E1' with 'á', but it also doesnt work. Guessing there is problem that these chars are longer than ascii.
How can I declare the non-ascii char, so it could be compared?
Each char is a single byte, however UTF-8 can use multiple bytes to encode a single character. In particular U+00E1 is encoded as two bytes: 0xC3 0xA1. So you can't do what you want with just comparing a single char.
There are multiple ways that you might be able to tackle your problem:
A) First, try googling for "windows console utf-8" and see if that gives anything which might make things just work without having to alter the characters at all. (I don't know if anything can work for you, I've never tried this.)
B) Convert the data to wide characters (wchar_t) using MultiByteToWideChar or mbstowcs and then google how to use wcout or such to output UTF-16 to the console.
C) Use MultiByteToWideChar to convert the data from UTF-8 to UTF-16. Then use WideCharToMultiByte to convert from UTF-16 to the console's code page, relying on the fact that it can automatically "best fit" common characters (such as "á" to "a").
D) If you really only care about a limited set of characters (such as only the accented characters in the Czech code page), then you could possibly write your own lookup table of UTF-8 byte sequences and your desired replacements. You just need to be doing comparisons on the UTF-8 by those multiple bytes rather than individual chars. Among various tools out there, I've found this page helpful for seeing how characters are encoded in various ways.
Which of these make the most sense for your program depends on various factors, such as how easy or hard it might be to keep the Windows-specific pieces from conflicting with the Linux-specific or cross-platform parts.
char in C is not unicode, it is really a byte; it only gets converted to a glyph by the terminal console you happen to use. On some Linux implementations (like Debian) it defaults to UTF-8, so if your program outputs a sequence of bytes encoded in UTF-8, your terminal will display the proper glyph. If you know that array is UTF-8 encoded, you must check for the proper byte sequence.
Edit: take a look at The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Please take a look at this link http://en.wikipedia.org/wiki/Wide_character.
And I believe this code might help you:
std::wstring str(L"cccccááddddddd");
std::replace( str.begin(), str.end(), L'á', L'a');

dmDeviceName is just 'c'

I'm trying to get the names of each of my monitors using DEVMODE.dmDeviceName:
dmDeviceName
A zero-terminated character array that specifies the "friendly" name of the printer or display; for example, "PCL/HP LaserJet" in the case of PCL/HP LaserJet. This string is unique among device drivers. Note that this name may be truncated to fit in the dmDeviceName array.
I'm using the following code:
log.printf("Device Name: %s",currDevMode.dmDeviceName);
But for every monitor, the name is printed as just c. All other information from DEVMODE seems to print ok. What's going wrong?
Most likely you are using the Unicode version of the structure and thus are passing wide characters to printf. Since you use a format string that implies char data there is a mis-match.
The UTF-16 encoding results in every other byte being 0 for characters in the ASCII range and so printf thinks that the second byte of the first two byte character is actually a null-terminator.
This is the sort of problem that you get with printf which of course has no type-safety. Since you are using C++ it's probably worth switching to iostream based I/O.
However, if you want to use ANSI text, as you indicate in a comment, then the simplest solution is to use the ANSI DEVMODEA version of the struct and the corresponding A versions of the API functions, e.g. EnumDisplaySettingsA, DeviceCapabilitiesA.
dmDeviceName is TCHAR[] so if you're compiling for unicode, the first wide character will be interpreted as a 'c' followed by a zero terminator.
You will need to convert it to ascii or use unicode capable printing routines.

Determine if a byte array contains an ANSI or Unicode string?

Say I have a function that receives a byte array:
void fcn(byte* data)
{
...
}
Does anyone know a reliable way for fcn() to determine if data is an ANSI string or a Unicode string?
Note that I'm intentionally NOT passing a length arg, all I receive is the pointer to the array. A length arg would be a great help, but I don't receive it, so I must do without.
This article mentions an OLE API that apparently does it, but of course they don't tell you WHICH api function: http://support.microsoft.com/kb/138142
First, a word on terminology. There is no such thing as an ANSI string; there are ASCII strings, which represents a character encoding. ASCII was developed by ANSI, but they're not interchangable.
Also, there is no such thing as a Unicode string. There are Unicode encodings, but those are only a part of Unicode itself.
I will assume that by "Unicode string" you mean "UTF-8 encoded codepoint sequence." And by ANSI string, I'll assume you mean ASCII.
If so, then every ASCII string is also a UTF-8 string, by the definition of UTF-8's encoding. ASCII only defines characters up to 0x7F, and all UTF-8 code units (bytes) up to 0x7F mean the same thing as they do under ASCII.
Therefore, your concern would be for the other 128 possible values. That is... complicated.
The only reason you would ask this question is if you have no control over the encoding of the string input. And therefore, the problem is that ASCII and UTF-8 are not the only possible choices.
There's Latin-1, for example. There are many strings out there that are encoded in Latin-1, which takes the other 128 bytes that ASCII doesn't use and defines characters for them. That's bad, because those other 128 bytes will conflict with UTF-8's encoding.
There are also code pages. Many strings were encoded against a particular code page; this is particularly so on Windows. Decoding them requires knowing what codepage you're working on.
If you are in a situation where you are certain that a string is either ASCII (7-bit, with the high bit always 0) or UTF-8, then you can make the determination easily. Either the string is ASCII (and therefore also UTF-8), or one or more of the bytes will have the high bit set to 1. In which case, you must use UTF-8 decoding logic.
Unless you are truly certain of that these are the only possibilities, you are going to need to do a bit more. You can validate the data by trying to run it through a UTF-8 decoder. If it runs into an invalid code unit sequence, then you know it isn't UTF-8. The problem is that it is theoretically possible to create a Latin-1 string that is technically valid UTF-8. You're kinda screwed at that point. The same goes for code page-based strings.
Ultimately, if you don't know what encoding the string is, there's no guarantee you can display it properly. That's why it's important to know where your strings come from and what they mean.