C++ with wxWidgets, Unicode vs. ASCII, what's the difference? - c++

I have a Code::Blocks 10.05 rev 0 and gcc 4.5.2 Linux/unicode 64bit and
WxWidgets version 2.8.12.0-0
I have a simple problem:
#define _TT(x) wxT(x)
string file_procstatus;
file_procstatus.assign("/PATH/TO/FILE");
printf("%s",file_procstatus.c_str());
wxLogVerbose(_TT("%s"),file_procstatus.c_str());
Printf outputs "/PATH/TO/FILE" normally while wxLogVerbose turns into crap. When I want to change std::string to wxString I have to do following:
wxString buf;
buf = wxString::From8BitData(file_procstatus.c_str());
Somebody has an idea what might be wrong, why do I need to change from 8bit data?

This is to do with how the character data is stored in memory. Using the "string" you produce a string of type char using the ASCII character set whereas I would assume that the _TT macro expands to L"string" which create a string of type wchar_t using a Unicode character set (UTF-32 on Linux I believe).
the printf function is expecting a char string whereas wxLogVerbose I assume is expecting a wchar_t string. This is where the need for conversion comes from. ASCII used one byte per character (8 bit data) but wchar_t strings use multiple bytes per character so the problem is down to the character encoding.
If you don't want to have to call this conversion function then do something like the following:
wstring file_procstatus = wxT("/PATH/TO/FILE");
wxLogVerbose(_TT("%s"),file_procstatus.c_str());

The following article gives best explanation about differences in Unicode and ASCII character set, how they are stored in memory and how string functions work with them.
http://allaboutcharactersets.blogspot.in/

Related

storing unicode in c++ charcaters

char in c++ has a memory of 1 byte but most of unicode characters require 2 bytes.
Does this mean that unicode can't be stored in characters in c++?
no char isn't the only. If you are on Windows there is wchar_t (WCHAR) or generally consider that short is 2-bytes also, but it's more about the way you want to implement and use it, the protocol ex:
#if !defined(_NATIVE_WCHAR_T_DEFINED)
typedef unsigned short WCHAR;
#else
typedef wchar_t WCHAR;
#endif
WCHAR* strDemo = L"consider the L";
but you need to dig more on web. they are also called multi-byte string so consider that in you searchs.
ex:
like in more general old-school cross platform BSD way:
https://www.freebsd.org/cgi/man.cgi?query=multibyte&apropos=0&sektion=0&format=html
http://utf8everywhere.org. and do not miss this
also since you asked the question at first place I assumed you should know about boost too.
C, C++ also support 16-bit character type wchar_t used for unicode utf-16.
Often via Macro define WCHAR Or TCHAR.
You can force 16-bit character literal / source code constants:
wchar_t c = L'a';
and the same with 16bit character Strings:
wchar_t[256] s = L"utf-16";
First of all you have to be aware that there is something called encoding.
So there are multiple ways to represent non ASCII characters.
Most popular encoding nowadays is UTF-8 which represents single non ASCII character as multiple bytes 2-4. In this encoding you CAN'T store this kind character in single char variable.
There are other encodings where small subset of non ASCII characters are represented as single byte, for example ISO 8859-2. Encoding is defined by locale and Windows is preferring such encoding, that is why Java Rookie answer had a chance to work for you.
Other systems are usually using UTF-8 for std::string so single character ca be represented by multiple bytes.
Another approach is to use wchar_t wstring wcout wcin, note there are still some issues with that.
To represent the character you can use Universal Character Names (UCNs). The character 'ф' has the Unicode value U+0444 and so in C++ you could write it '\u0444' or '\U00000444'. Also if the source code encoding supports this character then you can just write it literally in your source code.
// both of these assume that the character can be represented with
// a single char in the execution encoding
char b = '\u0444';
char a = 'ф'; // this line additionally assumes that the source character
// encoding supports this character
Printing such characters out depends on what you're printing to. If you're printing to a Unix terminal emulator, the terminal emulator is using an encoding that supports this character, and that encoding matches the compiler's execution encoding, then you can do the following:
#include <iostream>
int main() {
std::cout << "Hello, ф or \u0444!\n";
}
you can also use wchar_t

Required to convert a String to UTF8 string

Problem Statement:
I am required to convert a generated string to UTF8 string, this generated string has extended ascii characters and I am on Linux system (2.6.32-358.el6.x86_64).
A POC is still in progress so I can only provide small code samples
and complete solution can be posted only once ready.
Why I required UFT8 (I have extended ascii characters to be stored in a string which has to be UTF8).
How I am proceeding:
Convert generated string to wchar_t string.
Please look at the below sample code
int main(){
char CharString[] = "Prova";
iconv_t cd;
wchar_t WcharString[255];
size_t size= mbstowcs(WcharString, CharString, strlen(CharString));
wprintf(L"%ls\n", WcharString);
wprintf(L"%s\n", WcharString);
printf("\n%zu\n",size);
}
One question here:
Output is
Prova?????
s
Why the size is not printed here ?
Why the second printf prints only one character.
If I print size before both printed string then only 5 is printed and both strings are missing from console.
Moving on to Second Part:
Now that I will have a wchar_t string I want to convert it to UTF8 string
For this I was surfing through and found iconv will help here.
Question here
These are the methods I found in manual
**iconv_t iconv_open(const char *, const char *);
size_t iconv(iconv_t, char **, size_t *, char **, size_t *);
int iconv_close(iconv_t);**
Do I need to convert back wchar_t array to char array to before feeding to iconv ?
Please provide suggestions on the above issues.
Extended ascii I am talking about please see letters i in the marked snapshot below
For your first question (which I am interpreting as "why is all the output not what I expect"):
Where does the '?????' come from? In the call mbstowcs(WcharString, CharString, strlen(CharString)), the last argument (strlen(CharString)) is the length of the output buffer, not the length of the input string. mbstowcs will not write more than that number of wide characters, including the NUL terminator. Since the conversion requires 6 wide characters including the terminator, and you are only allowing it to write 5 wide characters, the resulting wide character string is not NUL terminated, and when you try to print it out you end up printing garbage after the end of the converted string. Hence the ?????. You should use the size of the output buffer in wchar_t's (255, in this case) instead.
Why does the second wprintf only print one character? When you call wprintf with a wide character string argument, you must use the %ls format code (or, more accurately, the %s conversion needs to be qualified with an l length modifier). If you use %s without the l, then wprintf will interpret the string as a char*, and it will convert each character to a wchar_t as it outputs it. However, since the argument is actually a wide character string, the first wchar_t in the string is L"p", which is the number 0x70 in some integer size. That means that the second byte of the wchar_t (counting from the end, since you have a little-endian architecture) is a 0, so if you treat the string as a string of characters, it will be terminated immediately after the p. So only one character is printed.
Why doesn't the last printf print anything? In C, an output stream can either be a wide stream or a byte stream, but you don't specify that when you open the stream. (And, in any case, standard output is already opened for you.) This is called the orientation of the stream. A newly opened stream is unoriented, and the orientation is fixed when you first output to the stream. If the first output call is a wide call, like wprintf, then the stream is a wide stream; otherwise, it is a byte stream. Once set, the orientation is fixed and you can't use output calls of the wrong orientation. So the printf is illegal, and it does nothing other than raise an error.
Now, let's move on to your second question: What do I do about it?
The first thing is that you need to be clear about what format the input is in, and how you want to output it. On Linux, it is somewhat unlikely that you will want to use wchar_t at all. The most likely cases for the input string are that it is already UTF-8, or that it is in some ISO-8859-x encoding. And the most likely cases for the output are the same: either it is UTF-8, or it is some ISO-8859-x encoding.
Unfortunately, there is no way for your program to know what encoding the console is expecting. The output may not even be going to a console. Similarly, there is really no way for your program to know which ISO-8859-x encoding is being used in the input string. (If it is a string literal, the encoding might be specified when you invoke the compiler, but there is no standard way of providing the information.)
If you are having trouble viewing output because non-ascii characters aren't displaying properly, you should start by making sure that the console is configured to use the same encoding as the program is outputting. If the program is sending UTF-8 to a console which is displaying, say, ISO-8859-15, then the text will not display properly. In theory, your locale setting includes the encoding used by your console, but if you are using a remote console (say, through PuTTY from a Windows machine), then the console is not part of the Linux environment and the default locale may be incorrect. The simplest fix is to configure your console correctly, but it is also possible to change the Linux locale.
The fact that you are using mbstowcs from a byte string suggests that you believe that the original string is in UTF-8. So it seems unlikely that the problem is that you need to convert it to UTF-8.
You can certainly use iconv to convert a string from one encoding to another; you don't need to go through wchar_t to do so. But you do need to know the actual input encoding and the desired output encoding.
It's no good idea to use iconv for utf8. Just implement the definition of utf8 yourself. That is quite easily in done in C from the Description https://en.wikipedia.org/wiki/UTF-8.
You don't even need wchar_t, just use uint32_t for your characters.
You will learn much if you implement yourself and your program will gain speed from not using mb or iconv functions.

How to tell if LPCWSTR text is numeric?

Entire string needs to be made of integers which as we know are 0123456789 I am trying with following function but it doesnt seem to work
bool isNumeric( const char* pszInput, int nNumberBase )
{
string base = "0123456789";
string input = pszInput;
return (::strspn(input.substr(0, nNumberBase).c_str(), base.c_str()) == input.length());
}
and the example of using it in code...
isdigit = (isNumeric((char*)text, 11));
It returns true even with text in the string
Presumably the issue is that text is actually LPCWSTR which is const wchar_t*. We have to infer this fact from the question title and the cast that you made.
Now, that cast is a problem. The compiler objected to you passing text. It said that text is not const char*. By casting you have not changed what text is, you simply lied to the compiler. And the compiler took its revenge.
What happens next is that you reinterpret the wide char buffer as being a narrow 8 bit buffer. If your wide char buffer has latin text, encoded as UTF-16, then every other byte will be zero. Hence the reinterpret cast that you do results in isNumeric thinking that the string is only 1 character long.
What you need to do is either:
Start using UTF-16 encoded wchar_t buffers in isNumeric.
Convert from UTF-16 to ANSI before calling isNumeric.
You should think about this carefully. It seems that at present you have a rather unholy mix of ANSI and UTF-16 in your program. You really ought to settle on a standard character encoding an use it consistently throughout. That is tenable internal to your program, but you will encounter external text that could use different encodings. Deal with that by converting at the boundary between your program and the outside world.
Personally I don't understand why you are using C strings at all. Surely you should be using std::wstring or std::string.

dmDeviceName is just 'c'

I'm trying to get the names of each of my monitors using DEVMODE.dmDeviceName:
dmDeviceName
A zero-terminated character array that specifies the "friendly" name of the printer or display; for example, "PCL/HP LaserJet" in the case of PCL/HP LaserJet. This string is unique among device drivers. Note that this name may be truncated to fit in the dmDeviceName array.
I'm using the following code:
log.printf("Device Name: %s",currDevMode.dmDeviceName);
But for every monitor, the name is printed as just c. All other information from DEVMODE seems to print ok. What's going wrong?
Most likely you are using the Unicode version of the structure and thus are passing wide characters to printf. Since you use a format string that implies char data there is a mis-match.
The UTF-16 encoding results in every other byte being 0 for characters in the ASCII range and so printf thinks that the second byte of the first two byte character is actually a null-terminator.
This is the sort of problem that you get with printf which of course has no type-safety. Since you are using C++ it's probably worth switching to iostream based I/O.
However, if you want to use ANSI text, as you indicate in a comment, then the simplest solution is to use the ANSI DEVMODEA version of the struct and the corresponding A versions of the API functions, e.g. EnumDisplaySettingsA, DeviceCapabilitiesA.
dmDeviceName is TCHAR[] so if you're compiling for unicode, the first wide character will be interpreted as a 'c' followed by a zero terminator.
You will need to convert it to ascii or use unicode capable printing routines.

assigning chinese as DBCS

In my code I can do:
wchar_t *s = L"...some chinese/japanese/etc string..";
and this works okay.
but if I do:
char *s = "...some chinese/japanese/etc string..."
I end up with s assigned to "???????" (not a display problem, there are actual question marks in the value).
Given that I'm on a US/1252 Win 7 (VS2010) and Unicode-compiled apps, how do I create a MBCS chinese string given a constant string literal? I do not want it to be unicode, but rather the MBCS representation of the chinese characters.
So far the only way I've been able to do it is use the unicode version and convert it to MBCS using WideCharToMultiByte. Do I really need to do that, or enter it as a byte array?
Yes, you really do need to do that. There are no MBCS string literals in C++.
(In theory you could do something like char *s = "...\xa7\f6\d5..." with the right bytes,
but that would be difficult to write and read.)