standard main and unicode, managing console input and output - c++

The standard recognizes int main() and int main(int, char* []) entry points, and there is no unicode version of the function, except of course if provided by implementation such as wmain on windows.
I have code which strictly uses 'wide' strings ie. wchar_t and std::wstring not even a single LOC makes any use of char type.
Also all the 'program output' is printed with std::wcout.
Now I do understand when the program starts it's stream will be set to 'wide' stream, and even if somewhere in the code std::cout appears instead of std::wcout then output may be wrong.
Now my first misunderstanding is, if a user runs several command line tools in same terminal (ie. cmd.exe) where one program uses wide string and other uses ASCII strings then what is the behavior of the terminal?
Does it handle both streams just fine? is wide stream and non wide stream related to program and not to terminal?
Secondly, my main issue is that my 'unicode' program expects user input, which will most likely be char because int main(int, char* []) accepts char string, but my code uses wide strings.
What are my options here? do I need to convert input non unicode strings to wide strings?
And what happens if I use int wmain(int, wchar_t* []) on windows? how do I know if terminal will interpret user input as wide characters and forward them to my entry point?
what if user copy/pastes ASCII string into the console as input to my entry point?
So my questions are, console program input and output char and wchar_t in both portable and non portable code, missing wide main function by standard is what confuses me most in combination with different terminals.
edit
To be clear,
I want to use standard main and wide strings.
my code uses wide string.
my program expects user input, and writes output. so how to best handle this?
Is wmain the only solution to get unicode input and how do terminals work in regard to unicode vs non unicode with different programs in same console session.

Related

Required to convert a String to UTF8 string

Problem Statement:
I am required to convert a generated string to UTF8 string, this generated string has extended ascii characters and I am on Linux system (2.6.32-358.el6.x86_64).
A POC is still in progress so I can only provide small code samples
and complete solution can be posted only once ready.
Why I required UFT8 (I have extended ascii characters to be stored in a string which has to be UTF8).
How I am proceeding:
Convert generated string to wchar_t string.
Please look at the below sample code
int main(){
char CharString[] = "Prova";
iconv_t cd;
wchar_t WcharString[255];
size_t size= mbstowcs(WcharString, CharString, strlen(CharString));
wprintf(L"%ls\n", WcharString);
wprintf(L"%s\n", WcharString);
printf("\n%zu\n",size);
}
One question here:
Output is
Prova?????
s
Why the size is not printed here ?
Why the second printf prints only one character.
If I print size before both printed string then only 5 is printed and both strings are missing from console.
Moving on to Second Part:
Now that I will have a wchar_t string I want to convert it to UTF8 string
For this I was surfing through and found iconv will help here.
Question here
These are the methods I found in manual
**iconv_t iconv_open(const char *, const char *);
size_t iconv(iconv_t, char **, size_t *, char **, size_t *);
int iconv_close(iconv_t);**
Do I need to convert back wchar_t array to char array to before feeding to iconv ?
Please provide suggestions on the above issues.
Extended ascii I am talking about please see letters i in the marked snapshot below
For your first question (which I am interpreting as "why is all the output not what I expect"):
Where does the '?????' come from? In the call mbstowcs(WcharString, CharString, strlen(CharString)), the last argument (strlen(CharString)) is the length of the output buffer, not the length of the input string. mbstowcs will not write more than that number of wide characters, including the NUL terminator. Since the conversion requires 6 wide characters including the terminator, and you are only allowing it to write 5 wide characters, the resulting wide character string is not NUL terminated, and when you try to print it out you end up printing garbage after the end of the converted string. Hence the ?????. You should use the size of the output buffer in wchar_t's (255, in this case) instead.
Why does the second wprintf only print one character? When you call wprintf with a wide character string argument, you must use the %ls format code (or, more accurately, the %s conversion needs to be qualified with an l length modifier). If you use %s without the l, then wprintf will interpret the string as a char*, and it will convert each character to a wchar_t as it outputs it. However, since the argument is actually a wide character string, the first wchar_t in the string is L"p", which is the number 0x70 in some integer size. That means that the second byte of the wchar_t (counting from the end, since you have a little-endian architecture) is a 0, so if you treat the string as a string of characters, it will be terminated immediately after the p. So only one character is printed.
Why doesn't the last printf print anything? In C, an output stream can either be a wide stream or a byte stream, but you don't specify that when you open the stream. (And, in any case, standard output is already opened for you.) This is called the orientation of the stream. A newly opened stream is unoriented, and the orientation is fixed when you first output to the stream. If the first output call is a wide call, like wprintf, then the stream is a wide stream; otherwise, it is a byte stream. Once set, the orientation is fixed and you can't use output calls of the wrong orientation. So the printf is illegal, and it does nothing other than raise an error.
Now, let's move on to your second question: What do I do about it?
The first thing is that you need to be clear about what format the input is in, and how you want to output it. On Linux, it is somewhat unlikely that you will want to use wchar_t at all. The most likely cases for the input string are that it is already UTF-8, or that it is in some ISO-8859-x encoding. And the most likely cases for the output are the same: either it is UTF-8, or it is some ISO-8859-x encoding.
Unfortunately, there is no way for your program to know what encoding the console is expecting. The output may not even be going to a console. Similarly, there is really no way for your program to know which ISO-8859-x encoding is being used in the input string. (If it is a string literal, the encoding might be specified when you invoke the compiler, but there is no standard way of providing the information.)
If you are having trouble viewing output because non-ascii characters aren't displaying properly, you should start by making sure that the console is configured to use the same encoding as the program is outputting. If the program is sending UTF-8 to a console which is displaying, say, ISO-8859-15, then the text will not display properly. In theory, your locale setting includes the encoding used by your console, but if you are using a remote console (say, through PuTTY from a Windows machine), then the console is not part of the Linux environment and the default locale may be incorrect. The simplest fix is to configure your console correctly, but it is also possible to change the Linux locale.
The fact that you are using mbstowcs from a byte string suggests that you believe that the original string is in UTF-8. So it seems unlikely that the problem is that you need to convert it to UTF-8.
You can certainly use iconv to convert a string from one encoding to another; you don't need to go through wchar_t to do so. But you do need to know the actual input encoding and the desired output encoding.
It's no good idea to use iconv for utf8. Just implement the definition of utf8 yourself. That is quite easily in done in C from the Description https://en.wikipedia.org/wiki/UTF-8.
You don't even need wchar_t, just use uint32_t for your characters.
You will learn much if you implement yourself and your program will gain speed from not using mb or iconv functions.

c++ wchar_t array and char array in programming for win32 console

I am writing a program includes output chinese characters using Dev C++.
I've added
-finput-charset=big5
-fexec-charset=big5
in compiler parameters. I also set the code page of the console to be 950 (traditional chinese)
It works perfectly while in a simple cout like this:
cout << "中文字";
while it comes to characters array it goes wrong as expected:
char chin[] = "中文字";
cout << chin[0];//output nothing
cout << chin[0] << chin[1];//output the first chinese character as one chinese character occupies 2 bytes.
So I decided to use wchar_t instead and I have to use wcout with wchar_t or else a number will be shown.
However, wcout show nothing in the console. All of the below show nothing:
wcout << L"中文字";
wchar_t chin2[] = L"中文字";
wcout << chin2[0];
What did I missed to use wchar_t to output chinese (or other east asian) characters? I really don't want to write 2 array member to show one single chinese chracters.
There are subtle problems going on here.
The C++ compiler does not understand Big5 encoding. When you create a source code file and display it, you may see your familiar Chinese characters but the compiler sees a string of bytes. Big5 is a double byte charset so each input character will be represented by 2 bytes inside the compiler.
When that string of bytes is fed to a suitable output device the Chinese characters appear again. Code page 950 is compatible with Big5 so you see the "right" thing. But then you try to build on this and confusion is the result. Your second code sample uses L"" strings, but I expect those strings will contain half a character in each short.
The only "safe" character set you can use is Unicode. Windows internals are historically UCS-2 (char is a single short) but is now theoretically UTF-16 (char is short, but may include multi-byte sequences). Not all existing software and older APIs fully support UTF-16 (or need to). Windows has very limited support for UTF-8 or other encodings. Everything gets converted into Unicode, so best to just leave it that way.
In practice, you should build your C++ code with Unicode settings, for UCS-2, and exercise caution if you need characters that would require multibyte sequences. You should ensure that any source code you write and any input text files are identified as whatever encoding they need to be, but are translated into Unicode internally. Leave your console as the default Unicode encoding, and everything will just work.
It is almost impossible to sensibly use Big5 as an internal encoding in a Windows program. Best not to try.

dmDeviceName is just 'c'

I'm trying to get the names of each of my monitors using DEVMODE.dmDeviceName:
dmDeviceName
A zero-terminated character array that specifies the "friendly" name of the printer or display; for example, "PCL/HP LaserJet" in the case of PCL/HP LaserJet. This string is unique among device drivers. Note that this name may be truncated to fit in the dmDeviceName array.
I'm using the following code:
log.printf("Device Name: %s",currDevMode.dmDeviceName);
But for every monitor, the name is printed as just c. All other information from DEVMODE seems to print ok. What's going wrong?
Most likely you are using the Unicode version of the structure and thus are passing wide characters to printf. Since you use a format string that implies char data there is a mis-match.
The UTF-16 encoding results in every other byte being 0 for characters in the ASCII range and so printf thinks that the second byte of the first two byte character is actually a null-terminator.
This is the sort of problem that you get with printf which of course has no type-safety. Since you are using C++ it's probably worth switching to iostream based I/O.
However, if you want to use ANSI text, as you indicate in a comment, then the simplest solution is to use the ANSI DEVMODEA version of the struct and the corresponding A versions of the API functions, e.g. EnumDisplaySettingsA, DeviceCapabilitiesA.
dmDeviceName is TCHAR[] so if you're compiling for unicode, the first wide character will be interpreted as a 'c' followed by a zero terminator.
You will need to convert it to ascii or use unicode capable printing routines.

Polish chars in std::string

I have a problem. I'm writing an app in Polish (with, of course, polish chars) for Linux and I receive 80 warnings when compiling. These are just "warning: multi-character character constant" and "warning: case label value exceeds maximum value for type". I'm using std::string.
How do I replace std::string class?
Please help.
Thanks in advance.
Regards.
std::stringdoes not define a particular encoding. You can thus store any sequence of bytes in it. There are subtleties to be aware of:
.c_str() will return a null-terminated buffer. If your character set allows null bytes, don't pass this string to functions that take a const char* parameter without a lenght, or your data will be truncated.
A char does not represent a character, but a **byte. IMHO, this is the most problematic nomenclature in computing history. Note that wchar_t does necessarily hold a full character either, depending on UTF-16 normalization.
.size() and .length() will return the number of bytes, not the number of characters.
[edit] The warnings about case labels is related to issue (2). You are using a switch statement with multi-byte characters using type char which can not hold more than one byte.[/edit]
Therefore, you can use std::string in your application, provided that you respect these three rules. There are subtleties involving the STL, including std::find() that are consequences of this. You need to use some more clever string matching algorithms to properly support Unicode because of normalization forms.
However, when writing applications in any language that uses non-ASCII characters (if you're paranoid, consider this anything outside [0, 128)), you need to be aware of encodings in different sources of textual data.
The source-file encoding might not be specified, and might be subject to change using compiler options. Any string literal will be subject to this rule. I guess this is why you are getting warnings.
You will get a variety of character encodings from external sources (files, user input, etc.). When that source specifies the encoding or you can get it from some external source (i.e. asking the user that imports the data), then this is easier. A lot of (newer) internet protocols impose ASCII or UTF-8 unless otherwise specified.
These two issues are not addressed by any particular string class. You just need to convert all any external source to your internal encoding. I suggest UTF-8 all the time, but especially so on Linux because of native support. I strongly recommend to place your string literals in a message file to forget about issue (1) and only deal with issue (2).
I don't suggest using std::wstring on Linux because 100% of native APIs use function signatures with const char* and have direct support for UTF-8. If you use any string class based on wchar_t, you will need to convert to/from std::wstring non-stop and eventually get something wrong, on top of making everything slow(er).
If you were writing an application for Windows, I'd suggest exactly the opposite because all native APIs use const wchar_t* signatures. The ANSI versions of such functions perform an internal conversion to/from const wchar_t*.
Some "portable" libraries/languages use different representations based on the platform. They use UTF-8 with char on Linux and UTF-16 with wchar_t on Windows. I recall reading bout that trick in the Python reference implementation but the article was quite old. I'm not sure if that is true anymore.
On linux you should use multibyte string class provided by a framework you use.
I'd recommend Glib::ustring, from glibmm framework, which stores strings in UTF-8 encoding.
If your source files are in UTF-8, then using multibyte string literal in code is as easy as:
ustring alphabet("aąbcćdeęfghijklłmnńoóprsśtuwyzźż");
But you can not build a switch/case statement on multibyte characters using char. I'd recommend using a series of ifs. You can use Glibmm's gunichar, but it's not very readable (You can get correct unicode values for characters using a table from article on Polish alphabet in Wikipedia):
#include <glibmm.h>
#include <iostream>
using namespace std;
int main()
{
Glib::ustring alphabet("aąbcćdeęfghijklłmnńoóprsśtuwyzźż");
int small_polish_vovels_with_diacritics_count = 0;
for ( int i=0; i<alphabet.size(); i++ ) {
switch (alphabet[i]) {
case 0x0105: // ą
case 0x0119: // ę
case 0x00f3: // ó
small_polish_vovels_with_diacritics_count++;
break;
default:
break;
}
}
cout << "There are " << small_polish_vovels_with_diacritics_count
<< " small polish vovels with diacritics in this string.\n";
return 0;
}
You can compile this using:
g++ `pkg-config --cflags --libs glibmm-2.4` progname.cc -o progname
std::string is for ASCII strings. Since your polish strings don't fit in, you should use std::wstring.

How to use Unicode in C++?

Assuming a very simple program that:
ask a name.
store the name in a variable.
display the variable content on the screen.
It's so simple that is the first thing that one learns.
But my problem is that I don't know how to do the same thing if I enter the name using japanese characters.
So, if you know how to do this in C++, please show me an example (that I can compile and test)
Thanks.
user362981 : Thanks for your help. I compiled the code that you wrote without problem, them the console window appears and I cannot enter any Japanese characters on it (using IME). Also if
I change a word in your code ("hello") to one that contains Japanese characters, it also will not display these.
Svisstack : Also thanks for your help. But when I compile your code I get the following error:
warning: deprecated conversion from string constant to 'wchar_t*'
error: too few arguments to function 'int swprintf(wchar_t*, const wchar_t*, ...)'
error: at this point in file
warning: deprecated conversion from string constant to 'wchar_t*'
You're going to get a lot of answers about wide characters. Wide characters, specifically wchar_t do not equal Unicode. You can use them (with some pitfalls) to store Unicode, just as you can an unsigned char. wchar_t is extremely system-dependent. To quote the Unicode Standard, version 5.2, chapter 5:
With the wchar_t wide character type, ANSI/ISO C provides for
inclusion of fixed-width, wide characters. ANSI/ISO C leaves the semantics of the wide
character set to the specific implementation but requires that the characters from the portable C execution set correspond to their wide character equivalents by zero extension.
and that
The width of wchar_t is compiler-specific and can be as small as 8 bits. Consequently,
programs that need to be portable across any C or C++ compiler should not use wchar_t
for storing Unicode text. The wchar_t type is intended for storing compiler-defined wide
characters, which may be Unicode characters in some compilers.
So, it's implementation defined. Here's two implementations: On Linux, wchar_t is 4 bytes wide, and represents text in the UTF-32 encoding (regardless of the current locale). (Either BE or LE depending on your system, whichever is native.) Windows, however, has a 2 byte wide wchar_t, and represents UTF-16 code units with them. Completely different.
A better path: Learn about locales, as you'll need to know that. For example, because I have my environment setup to use UTF-8 (Unicode), the following program will use Unicode:
#include <iostream>
int main()
{
setlocale(LC_ALL, "");
std::cout << "What's your name? ";
std::string name;
std::getline(std::cin, name);
std::cout << "Hello there, " << name << "." << std::endl;
return 0;
}
...
$ ./uni_test
What's your name? 佐藤 幹夫
Hello there, 佐藤 幹夫.
$ echo $LANG
en_US.UTF-8
But there's nothing Unicode about it. It merely reads in characters, which come in as UTF-8 because I have my environment set that way. I could just as easily say "heck, I'm part Czech, let's use ISO-8859-2": Suddenly, the program is getting input in ISO-8859-2, but since it's just regurgitating it, it doesn't matter, the program will still perform correctly.
Now, if that example had read in my name, and then tried to write it out into an XML file, and stupidly wrote <?xml version="1.0" encoding="UTF-8" ?> at the top, it would be right when my terminal was in UTF-8, but wrong when my terminal was in ISO-8859-2. In the latter case, it would need to convert it before serializing it to the XML file. (Or, just write ISO-8859-2 as the encoding for the XML file.)
On many POSIX systems, the current locale is typically UTF-8, because it provides several advantages to the user, but this isn't guaranteed. Just outputting UTF-8 to stdout will usually be correct, but not always. Say I am using ISO-8859-2: if you mindlessly output an ISO-8859-1 "è" (0xE8) to my terminal, I'll see a "č" (0xE8). Likewise, if you output a UTF-8 "è" (0xC3 0xA8), I'll see (ISO-8859-2) "è" (0xC3 0xA8). This barfing of incorrect characters has been called Mojibake.
Often, you're just shuffling data around, and it doesn't matter much. This typically comes into play when you need to serialize data. (Many internet protocols use UTF-8 or UTF-16, for example: if you got data from an ISO-8859-2 terminal, or a text file encoded in Windows-1252, then you have to convert it, or you'll be sending Mojibake.)
Sadly, this is about the state of Unicode support, in both C and C++. You have to remember: these languages are really system-agnostic, and don't bind to any particular way of doing it. That includes character-sets. There are tons of libraries out there, however, for dealing with Unicode and other character sets.
In the end, it's not all that complicated really: Know what encoding your data is in, and know what encoding your output should be in. If they're not the same, you need to do a conversion. This applies whether you're using std::cout or std::wcout. In my examples, stdin or std::cin and stdout/std::cout were sometimes in UTF-8, sometimes ISO-8859-2.
Try replacing cout with wcout, cin with wcin, and string with wstring. Depending on your platform, this may work:
#include <iostream>
#include <string>
int main() {
std::wstring name;
std::wcout << L"Enter your name: ";
std::wcin >> name;
std::wcout << L"Hello, " << name << std::endl;
}
There are other ways, but this is sort of the "minimal change" answer.
Pre-requisite: http://www.joelonsoftware.com/articles/Unicode.html
The above article is a must read which explains what unicode is but few lingering questions remains. Yes UNICODE has a unique code point for every character in every language and furthermore they can be encoded and stored in memory potentially differently from what the actual code is. This way we can save memory by for example using UTF-8 encoding which is great if the language supported is just English and so the memory representation is essentially same as ASCII – this of course knowing the encoding itself. In theory if we know the encoding, we can store these longer UNICODE characters however we like and read it back. But real world is a little different.
How do you store a UNICODE character/string in a C++ program? Which encoding do you use? The answer is you don’t use any encoding but you directly store the UNICODE code points in a unicode character string just like you store ASCII characters in ASCII string. The question is what character size should you use since UNICODE characters has no fixed size. The simple answer is you choose character size which is wide enough to hold the highest character code point (language) that you want to support.
The theory that a UNICODE character can take 2 bytes or more still holds true and this can create some confusion. Shouldn’t we be storing code points in 3 or 4 bytes than which is really what represents all unicode characters? Why is Visual C++ storing unicode in wchar_t then which is only 2 bytes, clearly not enough to store every UNICODE code point?
The reason we store UNICODE character code point in 2 bytes in Visual C++ is actually exactly the same reason why we were storing ASCII (=English) character into one byte. At that time, we were thinking of only English so one byte was enough. Now we are thinking of most international languages out there but not all so we are using 2 bytes which is enough. Yes it’s true this representation will not allow us to represent those code points which takes 3 bytes or more but we don’t care about those yet because those folks haven’t even bought a computer yet. Yes we are not using 3 or 4 bytes because we are still stingy with memory, why store the extra 0(zero) byte with every character when we are never going to use it (that language). Again this is exactly the same reasons why ASCII was storing each character in one byte, why store a character in 2 or more bytes when English can be represented in one byte and room to spare for those extra special characters!
In theory 2 bytes are not enough to present every Unicode code point but it is enough to hold anything that we may ever care about for now. A true UNICODE string representation could store each character in 4 bytes but we just don’t care about those languages.
Imagine 1000 years from now when we find friendly aliens and in abundance and want to communicate with them incorporating their countless languages. A single unicode character size will grow further perhaps to 8 bytes to accommodate all their code points. It doesn’t mean we should start using 8 bytes for each unicode character now. Memory is limited resource, we allocate what what we need.
Can I handle UNICODE string as C Style string?
In C++ an ASCII strings could still be handled in C++ and that’s fairly common by grabbing it by its char * pointer where C functions can be applied. However applying current C style string functions on a UNICODE string will not make any sense because it could have a single NULL bytes in it which terminates a C string.
A UNICODE string is no longer a plain buffer of text, well it is but now more complicated than a stream of single byte characters terminating with a NULL byte. This buffer could be handled by its pointer even in C but it will require a UNICODE compatible calls or a C library which could than read and write those strings and perform operations.
This is made easier in C++ with a specialized class that represents a UNICODE string. This class handles complexity of the unicode string buffer and provide an easy interface. This class also decides if each character of the unicode string is 2 bytes or more – these are implementation details. Today it may use wchar_t (2 bytes) but tomorrow it may use 4 bytes for each character to support more (less known) language. This is why it is always better to use TCHAR than a fixed size which maps to the right size when implementation changes.
How do I index a UNICODE string?
It is also worth noting and particularly in C style handling of strings that they use index to traverse or find sub string in a string. This index in ASCII string directly corresponded to the position of item in that string but it has no meaning in a UNICODE string and should be avoided.
What happens to the string terminating NULL byte?
Are UNICODE strings still terminated by NULL byte? Is a single NULL byte enough to terminate the string? This is an implementation question but a NULL byte is still one unicode code point and like every other code point, it must still be of same size as any other(specially when no encoding). So the NULL character must be two bytes as well if unicode string implementation is based on wchar_t. All UNICODE code points will be represented by same size irrespective if its a null byte or any other.
Does Visual C++ Debugger shows UNICODE text?
Yes, if text buffer is type LPWSTR or any other type that supports UNICODE, Visual Studio 2005 and up support displaying the international text in debugger watch window (provided fonts and language packs are installed of course).
Summary:
C++ doesn’t use any encoding to store unicode characters but it directly stores the UNICODE code points for each character in a string. It must pick character size large enough to hold the largest character of desirable languages (loosely speaking) and that character size will be fixed and used for all characters in the string.
Right now, 2 bytes are sufficient to represent most languages that we care about, this is why 2 bytes are used to represent code point. In future if a new friendly space colony was discovered that want to communicate with them, we will have to assign new unicode code pionts to their language and use larger character size to store those strings.
You can do simple things with the generic wide character support in your OS of choice, but generally C++ doesn't have good built-in support for unicode, so you'll be better off in the long run looking into something like ICU.
#include <stdio.h>
#include <wchar.h>
int main()
{
wchar_t name[256];
wprintf(L"Type a name: ");
wscanf(L"%s", name);
wprintf(L"Typed name is: %s\n", name);
return 0;
}