How to tell if LPCWSTR text is numeric? - c++

Entire string needs to be made of integers which as we know are 0123456789 I am trying with following function but it doesnt seem to work
bool isNumeric( const char* pszInput, int nNumberBase )
{
string base = "0123456789";
string input = pszInput;
return (::strspn(input.substr(0, nNumberBase).c_str(), base.c_str()) == input.length());
}
and the example of using it in code...
isdigit = (isNumeric((char*)text, 11));
It returns true even with text in the string

Presumably the issue is that text is actually LPCWSTR which is const wchar_t*. We have to infer this fact from the question title and the cast that you made.
Now, that cast is a problem. The compiler objected to you passing text. It said that text is not const char*. By casting you have not changed what text is, you simply lied to the compiler. And the compiler took its revenge.
What happens next is that you reinterpret the wide char buffer as being a narrow 8 bit buffer. If your wide char buffer has latin text, encoded as UTF-16, then every other byte will be zero. Hence the reinterpret cast that you do results in isNumeric thinking that the string is only 1 character long.
What you need to do is either:
Start using UTF-16 encoded wchar_t buffers in isNumeric.
Convert from UTF-16 to ANSI before calling isNumeric.
You should think about this carefully. It seems that at present you have a rather unholy mix of ANSI and UTF-16 in your program. You really ought to settle on a standard character encoding an use it consistently throughout. That is tenable internal to your program, but you will encounter external text that could use different encodings. Deal with that by converting at the boundary between your program and the outside world.
Personally I don't understand why you are using C strings at all. Surely you should be using std::wstring or std::string.

Related

Required to convert a String to UTF8 string

Problem Statement:
I am required to convert a generated string to UTF8 string, this generated string has extended ascii characters and I am on Linux system (2.6.32-358.el6.x86_64).
A POC is still in progress so I can only provide small code samples
and complete solution can be posted only once ready.
Why I required UFT8 (I have extended ascii characters to be stored in a string which has to be UTF8).
How I am proceeding:
Convert generated string to wchar_t string.
Please look at the below sample code
int main(){
char CharString[] = "Prova";
iconv_t cd;
wchar_t WcharString[255];
size_t size= mbstowcs(WcharString, CharString, strlen(CharString));
wprintf(L"%ls\n", WcharString);
wprintf(L"%s\n", WcharString);
printf("\n%zu\n",size);
}
One question here:
Output is
Prova?????
s
Why the size is not printed here ?
Why the second printf prints only one character.
If I print size before both printed string then only 5 is printed and both strings are missing from console.
Moving on to Second Part:
Now that I will have a wchar_t string I want to convert it to UTF8 string
For this I was surfing through and found iconv will help here.
Question here
These are the methods I found in manual
**iconv_t iconv_open(const char *, const char *);
size_t iconv(iconv_t, char **, size_t *, char **, size_t *);
int iconv_close(iconv_t);**
Do I need to convert back wchar_t array to char array to before feeding to iconv ?
Please provide suggestions on the above issues.
Extended ascii I am talking about please see letters i in the marked snapshot below
For your first question (which I am interpreting as "why is all the output not what I expect"):
Where does the '?????' come from? In the call mbstowcs(WcharString, CharString, strlen(CharString)), the last argument (strlen(CharString)) is the length of the output buffer, not the length of the input string. mbstowcs will not write more than that number of wide characters, including the NUL terminator. Since the conversion requires 6 wide characters including the terminator, and you are only allowing it to write 5 wide characters, the resulting wide character string is not NUL terminated, and when you try to print it out you end up printing garbage after the end of the converted string. Hence the ?????. You should use the size of the output buffer in wchar_t's (255, in this case) instead.
Why does the second wprintf only print one character? When you call wprintf with a wide character string argument, you must use the %ls format code (or, more accurately, the %s conversion needs to be qualified with an l length modifier). If you use %s without the l, then wprintf will interpret the string as a char*, and it will convert each character to a wchar_t as it outputs it. However, since the argument is actually a wide character string, the first wchar_t in the string is L"p", which is the number 0x70 in some integer size. That means that the second byte of the wchar_t (counting from the end, since you have a little-endian architecture) is a 0, so if you treat the string as a string of characters, it will be terminated immediately after the p. So only one character is printed.
Why doesn't the last printf print anything? In C, an output stream can either be a wide stream or a byte stream, but you don't specify that when you open the stream. (And, in any case, standard output is already opened for you.) This is called the orientation of the stream. A newly opened stream is unoriented, and the orientation is fixed when you first output to the stream. If the first output call is a wide call, like wprintf, then the stream is a wide stream; otherwise, it is a byte stream. Once set, the orientation is fixed and you can't use output calls of the wrong orientation. So the printf is illegal, and it does nothing other than raise an error.
Now, let's move on to your second question: What do I do about it?
The first thing is that you need to be clear about what format the input is in, and how you want to output it. On Linux, it is somewhat unlikely that you will want to use wchar_t at all. The most likely cases for the input string are that it is already UTF-8, or that it is in some ISO-8859-x encoding. And the most likely cases for the output are the same: either it is UTF-8, or it is some ISO-8859-x encoding.
Unfortunately, there is no way for your program to know what encoding the console is expecting. The output may not even be going to a console. Similarly, there is really no way for your program to know which ISO-8859-x encoding is being used in the input string. (If it is a string literal, the encoding might be specified when you invoke the compiler, but there is no standard way of providing the information.)
If you are having trouble viewing output because non-ascii characters aren't displaying properly, you should start by making sure that the console is configured to use the same encoding as the program is outputting. If the program is sending UTF-8 to a console which is displaying, say, ISO-8859-15, then the text will not display properly. In theory, your locale setting includes the encoding used by your console, but if you are using a remote console (say, through PuTTY from a Windows machine), then the console is not part of the Linux environment and the default locale may be incorrect. The simplest fix is to configure your console correctly, but it is also possible to change the Linux locale.
The fact that you are using mbstowcs from a byte string suggests that you believe that the original string is in UTF-8. So it seems unlikely that the problem is that you need to convert it to UTF-8.
You can certainly use iconv to convert a string from one encoding to another; you don't need to go through wchar_t to do so. But you do need to know the actual input encoding and the desired output encoding.
It's no good idea to use iconv for utf8. Just implement the definition of utf8 yourself. That is quite easily in done in C from the Description https://en.wikipedia.org/wiki/UTF-8.
You don't even need wchar_t, just use uint32_t for your characters.
You will learn much if you implement yourself and your program will gain speed from not using mb or iconv functions.

Difference between unsigned char and char pointers

I'm a bit confused with differences between unsigned char (which is also BYTE in WinAPI) and char pointers.
Currently I'm working with some ATL-based legacy code and I see a lot of expressions like the following:
CAtlArray<BYTE> rawContent;
CALL_THE_FUNCTION_WHICH_FILLS_RAW_CONTENT(rawContent);
return ArrayToUnicodeString(rawContent);
// or return ArrayToAnsiString(rawContent);
Now, the implementations of ArrayToXXString look the following way:
CStringA ArrayToAnsiString(const CAtlArray<BYTE>& array)
{
CAtlArray<BYTE> copiedArray;
copiedArray.Copy(array);
copiedArray.Add('\0');
// Casting from BYTE* -> LPCSTR (const char*).
return CStringA((LPCSTR)copiedArray.GetData());
}
CStringW ArrayToUnicodeString(const CAtlArray<BYTE>& array)
{
CAtlArray<BYTE> copiedArray;
copiedArray.Copy(array);
copiedArray.Add('\0');
copiedArray.Add('\0');
// Same here.
return CStringW((LPCWSTR)copiedArray.GetData());
}
So, the questions:
Is the C-style cast from BYTE* to LPCSTR (const char*) safe for all possible cases?
Is it really necessary to add double null-termination when converting array data to wide-character string?
The conversion routine CStringW((LPCWSTR)copiedArray.GetData()) seems invalid to me, is that true?
Any way to make all this code easier to understand and to maintain?
The C standard is kind of weird when it comes to the definition of a byte. You do have a couple of guarantees though.
A byte will always be one char in size
sizeof(char) always returns 1
A byte will be at least 8 bits in size
This definition doesn't mesh well with older platforms where a byte was 6 or 7 bits long, but it does mean BYTE*, and char * are guaranteed to be equivalent.
Multiple nulls are needed at the end of a Unicode string because there are valid Unicode characters that start with a zero (null) byte.
As for making the code easier to read, that is completely a matter of style. This code appears to be written in a style used by a lot of old C Windows code, which has definitely fallen out of favor. There are probably a ton of ways to make it clearer for you, but how to make it clearer has no clear answer.
Yes, it is always safe. Because they both point to an array of single-byte memory locations.
LPCSTR: Long Pointer to Const (single-byte) String
LPCWSTR : Long Pointer to Const Wide (multi-byte) String
LPCTSTR : Long Pointer to Const context-dependent (single-byte or multi-byte) String
In wide character strings, every single character occupies 2 bytes of memory, and the length of the memory location containing the string must be a multiple of 2. So if you want to add a wide '\0' to the end of a string, you should add two bytes.
Sorry for this part, I do not know ATL and I cannot help you on this part, but actually I see no complexity here, and I think it is easy to maintain. What code do you really want to make easier to understand and maintain?
If the BYTE* behaves like a proper string (i.e. the last BYTE is 0), you can cast a BYTE* to a LPCSTR, yes. Functions working with LPCSTR assume zero-terminated strings.
I think the multiple zeroes are only necessary when dealing with some multibyte character sets. The most common 8-bit encodings (like ordinary Windows Western and also UTF-8) don't require them.
The CString is Microsoft's best attempt at user-friendly strings. For instance, its constructor can handle both char and wchar_t type input, regardless of whether the CString itself is wide or not, so you don't have to worry about the conversion much.
Edit: wait, now I see that they are abusing a BYTE array for storing wide chars in. I couldn't recommend that.
An LPCWSTR is a String with 2 Bytes per character, a "char" is one Byte per character. That means you cannot cast it in C-style, because you have to adjust the memory (add a "0" before each standard-ASCII), and not just read the Data in a different way from the memory (what a C-Cast would do).
So the cast is not so safe i would say.
The Double-Nulltermination: You have always 2 Bytes as one Character, so your "End-of-string" sign must be 2 Bytes long.
To make that code easier to understand look after lexical_cast in Boost (http://www.boost.org/doc/libs/1_48_0/doc/html/boost_lexical_cast.html)
Another way would be using the std::strings (using like std::basic_string; ), and you can perform on String operations.

assigning chinese as DBCS

In my code I can do:
wchar_t *s = L"...some chinese/japanese/etc string..";
and this works okay.
but if I do:
char *s = "...some chinese/japanese/etc string..."
I end up with s assigned to "???????" (not a display problem, there are actual question marks in the value).
Given that I'm on a US/1252 Win 7 (VS2010) and Unicode-compiled apps, how do I create a MBCS chinese string given a constant string literal? I do not want it to be unicode, but rather the MBCS representation of the chinese characters.
So far the only way I've been able to do it is use the unicode version and convert it to MBCS using WideCharToMultiByte. Do I really need to do that, or enter it as a byte array?
Yes, you really do need to do that. There are no MBCS string literals in C++.
(In theory you could do something like char *s = "...\xa7\f6\d5..." with the right bytes,
but that would be difficult to write and read.)

C++ with wxWidgets, Unicode vs. ASCII, what's the difference?

I have a Code::Blocks 10.05 rev 0 and gcc 4.5.2 Linux/unicode 64bit and
WxWidgets version 2.8.12.0-0
I have a simple problem:
#define _TT(x) wxT(x)
string file_procstatus;
file_procstatus.assign("/PATH/TO/FILE");
printf("%s",file_procstatus.c_str());
wxLogVerbose(_TT("%s"),file_procstatus.c_str());
Printf outputs "/PATH/TO/FILE" normally while wxLogVerbose turns into crap. When I want to change std::string to wxString I have to do following:
wxString buf;
buf = wxString::From8BitData(file_procstatus.c_str());
Somebody has an idea what might be wrong, why do I need to change from 8bit data?
This is to do with how the character data is stored in memory. Using the "string" you produce a string of type char using the ASCII character set whereas I would assume that the _TT macro expands to L"string" which create a string of type wchar_t using a Unicode character set (UTF-32 on Linux I believe).
the printf function is expecting a char string whereas wxLogVerbose I assume is expecting a wchar_t string. This is where the need for conversion comes from. ASCII used one byte per character (8 bit data) but wchar_t strings use multiple bytes per character so the problem is down to the character encoding.
If you don't want to have to call this conversion function then do something like the following:
wstring file_procstatus = wxT("/PATH/TO/FILE");
wxLogVerbose(_TT("%s"),file_procstatus.c_str());
The following article gives best explanation about differences in Unicode and ASCII character set, how they are stored in memory and how string functions work with them.
http://allaboutcharactersets.blogspot.in/

The 'L' and 'LPCWSTR' in WIndows API

I've found that
NetUserChangePassword(0, 0, L"ab", L"cd");
changes the user password from ab to cd. However,
NetUserChangePassword(0, 0, (LPCWSTR) "ab", (LPCWSTR) "cd");
doesn't work. The returned value indicates invalid password.
I need to pass const char* as last two parameters for this function call. How can I do that? For example,
NetUserChangePassword(0, 0, (LPCWSTR) vs[0].c_str(), (LPCWSTR) vs[1].c_str());
Where vs is std::vector<std::string>.
Those are two totally different L's. The first is a part of the C++ language syntax. Prefix a string literal with L and it becomes a wide string literal; instead of an array of char, you get an array of wchar_t.
The L in LPCWSTR doesn't describe the width of the characters, though. Instead, it describes the size of the pointer. Or, at least, it used to. The L abbreviation on type names is a relic of 16-bit Windows, when there were two kinds of pointers. There were near pointers, where the address was somewhere within the current 64 KB segment, and there were far, or long pointers, which could point beyond the current segment. The OS required callers to provide the latter to its APIs, so all the pointer-type names use LP. Nowadays, there's only one type of pointer; Microsoft keeps the same type names so that old code continues to compile.
The part of LPCWSTR that specifies wide characters is the W. But merely type-casting a char string literal to LPCWSTR is not sufficient to transform those characters into wide characters. Instead, what happens is the type-cast tells the compiler that what you wrote really is a pointer to a wide string, even though it really isn't. The compiler trusts you. Don't type-cast unless you really know better than the compiler what the real types are.
If you really need to pass a const char*, then you don't need to type-cast anything, and you don't need any L prefix. A plain old string literal is sufficient. (If you really want to cast to a Windows type, use LPCSTR — no W.) But it looks like what you really need to pass in is a const wchar_t*. As we learned above, you can get that with the L prefix on the string literal.
In a real program, you probably don't have a string literal. The user will provide a password, or you'll read a password from some other external source. Ideally, you would store that password in a std::wstring, which is like std::string but for wchar_t instead of char. The c_str() method of that type returns a const wchar_t*. If you don't have a wstring, a plain array of wchar_t might be sufficient.
But if you're storing the password in a std::string, then you'll need to convert it into wide characters some other way. To do a conversion, you need to know what code page the std::string characters use. The "current ANSI code page" is usually a safe bet; it's represented by the constant CP_ACP. You'll use that when calling MultiByteToWideString to have the OS convert from the password's code page into Unicode.
int required_size = MultiByteToWideChar(CP_ACP, 0, vs[0].c_str(), vs[0].size(), NULL, 0);
if (required_size == 0)
ERROR;
// We'll be storing the Unicode password in this vector. Reserve at
// least enough space for all the characters plus a null character
// at the end.
std::vector<wchar_t> wv(required_size);
int result = MultiByteToWideChar(CP_ACP, 0, vs[0].c_str(), vs[0].size(), &wv[0], required_size);
if (result != required_size - 1)
ERROR;
Now, when you need a wchar_t*, just use a pointer to the first element of that vector: &wv[0]. If you need it in a wstring, you can construct it from the vector in a few ways:
// The vector is null-terminated, so use "const wchar_t*" constructor
std::wstring ws1 = &wv[0];
// Use iterator constructor. The vector is null-terminated, so omit
// the final character from the iterator range.
std::wstring ws2(wv.begin(), wv.end() - 1);
// Use pointer/length constructor.
std::wstring ws3(&wv[0], wv.size() - 1);
You have two problems.
The first is the practical problem - how to do this. You are confusing wide and narrow strings and casting from one to the other. A string with an L prefix is a wide string, where each character is two bytes (a wchar_t). A string without the L is a single byte (a char). You cannot cast from one to the other using the C-style cast (LPCWSTR) "ab" because you have an array of chars, and are casting it to a pointer to wide chars. It is simply changing the pointer type, not the underlying data.
To convert from a narrow string to a wide string, you would normally use MultiByteToWideChar. You don't mention what code page your narrow strings are in; you would probably pass in CP_ACP for the first parameter. However, since you are converting between a string and a wstring, you might be interested in other ways to convert (one, two). This will give you a wstring with your characters, not a string, and a wstring's .c_str() method returns a pointer to wchar_ts.
The second is the following misunderstanding:
I need to pass const char* as last two parameters for this function call. How can I do that?
No you don't. You need to pass a wide string, which you got above. Your approach to this (casting the pointer) indicates you probably don't know about different string types and character encodings, and this is something every software developer should know. So, on the assumption you're interested, hopefully you'll find the following references handy:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
The Unicode FAQ (covers things like 'What is Unicode?')
MSDN introduction to wide characters.
I'd recommend you investigate recompiling your application with UNICODE and using wide strings. Many APIs are defined in both narrow and wide versions, and normally this would mean you access the narrow version by default (you can access either the ANSI (narrow) or Wide versions of these APIs by directly calling the A or W version - they have A or W appended to their name, such as CreateWindowW - see the bottom of that page for the two names. You normally don't need to worry about this.) As far as I can tell, this API is always available as-is regardless of UNICODE, it's just it's only prototyped as wide.
C style casts as you've used here are a very blunt instrument. They assume you know exactly what you're doing.
You'll need to convert your ASCII or multi-byte strings into Unicode strings for the API. There might be a NetUserChangePasswordA function that takes the char * types you're trying to pass, try that first.
LPWSTR is defined wchar_t* (whcar_T is 2 byte char-type) which are interpreted differently then normal 1 byte chars.
LPCWSTR means you need to pass a wchar_t*.
if you change vs to std::vector<std::wstring> that will give you a wide char when you pass vs[0].c_str()
if you look at the example at http://msdn.microsoft.com/en-us/library/aa370650(v=vs.85).aspx you can see that they define UNICODE which is why they use the wchar_t.