strcmp error comparing converted wide string - c++

I added this because I am trying to convert handle WStrings in Android NDK NDK does not support wide characters. I could use advice on how to do this. I think the asciiConvert method does not work anymore
typedef std::basic_string<wchar_t> WString;
WString val;
val=L"";
set_val(L"");
char* value=asciiConvert(get_val()); // value is 0x00000000
std::string token; // value is ""
if (strcmp(token.c_str(),value)==0) //ERROR HERE: INFINITE LOOP HERE I THINK since it will never be true.
HERE IS THE CONVERSION FUNCTION:
char* asciiConvert(const wchar_t* wideStr, char replSpace) // replSpace == -1
{
if (wideStr == NULL)
return NULL;
char* asciiStr = new char[wcslen(wideStr) + 10];
sprintf(asciiStr, "%S", wideStr);
if (replSpace >= 0)
{
int len = strlen(asciiStr);
while (len)
{
if (asciiStr[len] == ' ')
asciiStr[len] = replSpace;
len--;
}
}
return asciiStr;
}
UPDATE: the typedef is advised for some implementations which do not support wstring so I think I need, but now something to not work like above. Have not used C++ in a while so I could use very specific instructions on this.
Basically I have dozens of const wchar_t* foo(const wchar_t* a, const wchar_t& b)
and a quite a few wchar* [] as well as const wchar_t* memVariable; even virutal functions with these.
How about CrystalX for this? Is that the way to go?

Related

Can I safely use std::string to assemble binary data into messages?

I am using a std::string to hold binary data read from a socket.
The data consists of messages beginning with a '$' and ending with a '#'. Each message may contain '\0' characters.
I use std::string::find() to find the location of the first message and extract it from the string using std::string::substr():
class MessageSplitter {
public:
MessageSplitter() { m_data.reserve(1'000'000); }
void appendBinaryData(const std::string& binaryData) {
m_data.append(bytes);
}
bool popMessage(std::string& msg) {
size_t beg_index = m_data.find("$");
if (beg_index == std::string::npos) {
return false;
}
size_t end_index = m_data.find("#", beg_index);
if (end_index == std::string::npos) {
return false;
}
size_t count = end_index - beg_index + end.size();
msg = m_data.substr(beg_index, count);
m_data = m_data.substr(end_index + end.size());
return true;
}
private:
std::string m_data;
};
I read from socket this way (error checking on recv omitted):
char buffer[4096];
int ret = ::recv(m_socket, buffer, 4096, 0);
std::string binaryData = std::string(buffer, ret);
This approach seems to work fine on Windows.
However is it guaranteed to work on other platforms according to the C++ standard?
This is perfectly safe from a language level. std::string is guaranteed to be able to handle non-printable characters including embedded nul characters just fine.
From a programmer's prospective though it's somewhat unsafe because it's surprising. When I see std::string I generally expect it to be printable text. It has an operator<< for example to make it easy to print to output streams, and I have to remember never to use that.
For the second reason, I would tend to prefer something more explicit. std::vector<std::byte> or std::vector<unsigned char> or similar. Something that doesn't act like text is much more difficult to accidentally treat as text.

How to convert std::string to wchar_t*

std::regex regexpy("y:(.+?)\"");
std::smatch my;
regex_search(value.text, my, regexpy);
y = my[1];
std::wstring wide_string = std::wstring(y.begin(), y.end());
const wchar_t* p_my_string = wide_string.c_str();
wchar_t* my_string = const_cast<wchar_t*>(p_my_string);
URLDownloadToFile(my_string, aDest);
I'm using Unicode, the encoding of the source string is ASCII, UrlDownloadToFile expands to UrlDownloadToFileW (wchar_t*) the code above compiles in debug mode, but with a lot of warnings like:
warning C4244: 'argument': conversion from 'wchar_t' to 'const _Elem', possible loss of data
So do I ask, how I could convert a std::string to a wchar_t?
First off, you don't need the const_cast, as URLDownloadToFileW() takes a const wchar_t* as input, so passing it wide_string.c_str() will work as-is:
URLDownloadToFile(..., wide_string.c_str(), ...);
That being said, you are constructing a std::wstring with the individual char values of a std::string as-is. That will work without data loss only for ASCII characters <= 127, which have the same numeric values in both ASCII and Unicode. For non-ASCII characters, you need to actually convert the char data to Unicode, such as with MultiByteToWideChar() (or equivilent), eg:
std::wstring to_wstring(const std::string &s)
{
std::wstring wide_string;
// NOTE: be sure to specify the correct codepage that the
// str::string data is actually encoded in...
int len = MultiByteToWideChar(CP_ACP, 0, s.c_str(), s.size(), NULL, 0);
if (len > 0) {
wide_string.resize(len);
MultiByteToWideChar(CP_ACP, 0, s.c_str(), s.size(), &wide_string[0], len);
}
return wide_string;
}
URLDownloadToFileW(..., to_wstring(y).c_str(), ...);
That being said, there is a simpler solution. If the std::string is encoded in the user's default locale, you can simply call URLDownloadToFileA() instead, passing it the original std::string as-is, and let the OS handle the conversion for you, eg:
URLDownloadToFileA(..., y.c_str(), ...);
There is a cross-platform solution. You can use std::mbtowc.
std::wstring convert_mb_to_wc(std::string s) {
std::wstring out;
std::mbtowc(nullptr, 0, 0);
int offset;
size_t index = 0;
for (wchar_t wc;
(offset = std::mbtowc(&wc, &s[index], s.size() - index)) > 0;
index += offset) {
out.push_back(wc);
}
return out;
}
Adapted from an example on cppreference.com at https://en.cppreference.com/w/cpp/string/multibyte/mbtowc .

libxml2 xmlChar * to std::wstring

libxml2 seems to store all its strings in UTF-8, as xmlChar *.
/**
* xmlChar:
*
* This is a basic byte in an UTF-8 encoded string.
* It's unsigned allowing to pinpoint case where char * are assigned
* to xmlChar * (possibly making serialization back impossible).
*/
typedef unsigned char xmlChar;
As libxml2 is a C library, there's no provided routines to get an std::wstring out of an xmlChar *. I'm wondering whether the prudent way to convert xmlChar * to a std::wstring in C++11 is to use the mbstowcs C function, via something like this (work in progress):
std::wstring xmlCharToWideString(const xmlChar *xmlString) {
if(!xmlString){abort();} //provided string was null
int charLength = xmlStrlen(xmlString); //excludes null terminator
wchar_t *wideBuffer = new wchar_t[charLength];
size_t wcharLength = mbstowcs(wideBuffer, (const char *)xmlString, charLength);
if(wcharLength == (size_t)(-1)){abort();} //mbstowcs failed
std::wstring wideString(wideBuffer, wcharLength);
delete[] wideBuffer;
return wideString;
}
Edit: Just an FYI, I'm very aware of what xmlStrlen returns; it's the number of xmlChar used to store the string; I know it's not the number of characters but rather the number of unsigned char. It would have been less confusing if I had named it byteLength, but I thought it would have been clearer as I have both charLength and wcharLength. As for the correctness of the code, the wideBuffer will be larger or equal to the required size to hold the buffer, always (I believe). As characters that require more space than wide_t will be truncated (I think).
xmlStrlen() returns the number of UTF-8 encoded codeunits in the xmlChar* string. That is not going to be the same number of wchar_t encoded codeunits needed when the data is converted, so do not use xmlStrlen() to allocate the size of your wchar_t string. You need to call std::mbtowc() once to get the correct length, then allocate the memory, and call mbtowc() again to fill the memory. You will also have to use std::setlocale() to tell mbtowc() to use UTF-8 (messing with the locale may not be a good idea, especially if multiple threads are involved). For example:
std::wstring xmlCharToWideString(const xmlChar *xmlString)
{
if (!xmlString) { abort(); } //provided string was null
std::wstring wideString;
int charLength = xmlStrlen(xmlString);
if (charLength > 0)
{
char *origLocale = setlocale(LC_CTYPE, NULL);
setlocale(LC_CTYPE, "en_US.UTF-8");
size_t wcharLength = mbtowc(NULL, (const char*) xmlString, charLength); //excludes null terminator
if (wcharLength != (size_t)(-1))
{
wideString.resize(wcharLength);
mbtowc(&wideString[0], (const char*) xmlString, charLength);
}
setlocale(LC_CTYPE, origLocale);
if (wcharLength == (size_t)(-1)) { abort(); } //mbstowcs failed
}
return wideString;
}
A better option, since you mention C++11, is to use std::codecvt_utf8 with std::wstring_convert instead so you do not have to deal with locales:
std::wstring xmlCharToWideString(const xmlChar *xmlString)
{
if (!xmlString) { abort(); } //provided string was null
try
{
std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> conv;
return conv.from_bytes((const char*)xmlString);
}
catch(const std::range_error& e)
{
abort(); //wstring_convert failed
}
}
An alternative option is to use an actual Unicode library, such as ICU or ICONV, to handle Unicode conversions.
There are some problems in this code, besides the fact that you are using wchar_t and std::wstring which is a bad idea unless you're making calls to the Windows API.
xmlStrlen() does not do what you think it does. It counts the number of UTF-8 code units (a.k.a. bytes) in a string. It does not count the number of characters. This is all stuff in the documentation.
Counting characters will not portably give you the correct size for a wchar_t array anyway. So not only does xmlStrlen() not do what you think it does, what you wanted isn't the right thing either. The problem is that the encoding of wchar_t varies from platform to platform, making it 100% useless for portable code.
The mbtowcs() function is locale-dependent. It only converts from UTF-8 if the locale is a UTF-8 locale!
This code will leak memory if the std::wstring constructor throws an exception.
My recommendations:
Use UTF-8 if at all possible. The wchar_t rabbit hole is a lot of extra work for no benefit (except the ability to make Windows API calls).
If you need UTF-32, then use std::u32string. Remember that wstring has a platform-dependent encoding: it could be a variable-length encoding (Windows) or fixed-length (Linux, OS X).
If you absolutely must have wchar_t, then chances are good that you are on Windows. Here is how you do it on Windows:
std::wstring utf8_to_wstring(const char *utf8)
{
size_t utf8len = std::strlen(utf8);
int wclen = MultiByteToWideChar(
CP_UTF8, 0, utf8, utf8len, NULL, 0);
wchar_t *wc = NULL;
try {
wc = new wchar_t[wclen];
MultiByteToWideChar(
CP_UTF8, 0, utf8, utf8len, wc, wclen);
std::wstring wstr(wc, wclen);
delete[] wc;
wc = NULL;
return wstr;
} catch (std::exception &) {
if (wc)
delete[] wc;
}
}
If you absolutely must have wchar_t and you are not on Windows, use iconv() (see man 3 iconv, man 3 iconv_open and man 3 iconv_close for the manual). You can specify "WCHAR_T" as one of the encodings for iconv().
Remember: You probably don't want wchar_t or std::wstring. What wchar_t does portably isn't useful, and making it useful isn't portable. C'est la vie.
add
#include <boost/locale.hpp>
convert xmlChar* to string
std::string strGbk((char*)node);
convert string to wstring
std::string strGbk = "china powerful forever";
std::wstring wstr = boost::locale::conv::to_utf<wchar_t>(strGbk, "gbk");
std::cout << strGbk << std::endl;
std::wcout << wstr. << std::endl;
it works for me,good lucks.

Compare std::wstring and std::string

How can I compare a wstring, such as L"Hello", to a string? If I need to have the same type, how can I convert them into the same type?
Since you asked, here's my standard conversion functions from string to wide string, implemented using C++ std::string and std::wstring classes.
First off, make sure to start your program with set_locale:
#include <clocale>
int main()
{
std::setlocale(LC_CTYPE, ""); // before any string operations
}
Now for the functions. First off, getting a wide string from a narrow string:
#include <string>
#include <vector>
#include <cassert>
#include <cstdlib>
#include <cwchar>
#include <cerrno>
// Dummy overload
std::wstring get_wstring(const std::wstring & s)
{
return s;
}
// Real worker
std::wstring get_wstring(const std::string & s)
{
const char * cs = s.c_str();
const size_t wn = std::mbsrtowcs(NULL, &cs, 0, NULL);
if (wn == size_t(-1))
{
std::cout << "Error in mbsrtowcs(): " << errno << std::endl;
return L"";
}
std::vector<wchar_t> buf(wn + 1);
const size_t wn_again = std::mbsrtowcs(buf.data(), &cs, wn + 1, NULL);
if (wn_again == size_t(-1))
{
std::cout << "Error in mbsrtowcs(): " << errno << std::endl;
return L"";
}
assert(cs == NULL); // successful conversion
return std::wstring(buf.data(), wn);
}
And going back, making a narrow string from a wide string. I call the narrow string "locale string", because it is in a platform-dependent encoding depending on the current locale:
// Dummy
std::string get_locale_string(const std::string & s)
{
return s;
}
// Real worker
std::string get_locale_string(const std::wstring & s)
{
const wchar_t * cs = s.c_str();
const size_t wn = std::wcsrtombs(NULL, &cs, 0, NULL);
if (wn == size_t(-1))
{
std::cout << "Error in wcsrtombs(): " << errno << std::endl;
return "";
}
std::vector<char> buf(wn + 1);
const size_t wn_again = std::wcsrtombs(buf.data(), &cs, wn + 1, NULL);
if (wn_again == size_t(-1))
{
std::cout << "Error in wcsrtombs(): " << errno << std::endl;
return "";
}
assert(cs == NULL); // successful conversion
return std::string(buf.data(), wn);
}
Some notes:
If you don't have std::vector::data(), you can say &buf[0] instead.
I've found that the r-style conversion functions mbsrtowcs and wcsrtombs don't work properly on Windows. There, you can use the mbstowcs and wcstombs instead: mbstowcs(buf.data(), cs, wn + 1);, wcstombs(buf.data(), cs, wn + 1);
In response to your question, if you want to compare two strings, you can convert both of them to wide string and then compare those. If you are reading a file from disk which has a known encoding, you should use iconv() to convert the file from your known encoding to WCHAR and then compare with the wide string.
Beware, though, that complex Unicode text may have multiple different representations as code point sequences which you may want to consider equal. If that is a possibility, you need to use a higher-level Unicode processing library (such as ICU) and normalize your strings to some common, comparable form.
You should convert the char string to a wchar_t string using mbstowcs, and then compare the resulting strings. Notice that mbstowcs works on char */wchar *, so you'll probably need to do something like this:
std::wstring StringToWstring(const std::string & source)
{
std::wstring target(source.size()+1, L' ');
std::size_t newLength=std::mbstowcs(&target[0], source.c_str(), target.size());
target.resize(newLength);
return target;
}
I'm not entirely sure that that usage of &target[0] is entirely standard-conforming, if someone has a good answer to that please tell me in the comments. Also, there's an implicit assumption that the converted string won't be longer (in number of wchar_ts) than the number of chars of the original string - a logical assumption that still I'm not sure it's covered by the standard.
On the other hand, it seems that there's no way to ask to mbstowcs the size of the needed buffer, so either you go this way, or go with (better done and better defined) code from Unicode libraries (be it Windows APIs or libraries like iconv).
Still, keep in mind that comparing Unicode strings without using special functions is slippery ground, two equivalent strings may be evaluated different when compared bitwise.
Long story short: this should work, and I think it's the maximum you can do with just the standard library, but it's a lot implementation-dependent in how Unicode is handled, and I wouldn't trust it a lot. In general, it's just better to stick with an encoding inside your application and avoid this kind of conversions unless absolutely necessary, and, if you are working with definite encodings, use APIs that are less implementation-dependent.
Think twice before doing this — you might not want to compare them in the first place. If you are sure you do and you are using Windows, then convert string to wstring with MultiByteToWideChar, then compare with CompareStringEx.
If you are not using Windows, then the analogous functions are mbstowcs and wcscmp. The standard wide character C++ functions are often not portable under Windows; for instance mbstowcs is deprecated.
The cross-platform way to work with Unicode is to use the ICU library.
Take care to use special functions for Unicode string comparison, don't do it manually. Two Unicode strings could have different characters, yet still be the same.
wstring ConvertToUnicode(const string & str)
{
UINT codePage = CP_ACP;
DWORD flags = 0;
int resultSize = MultiByteToWideChar
( codePage // CodePage
, flags // dwFlags
, str.c_str() // lpMultiByteStr
, str.length() // cbMultiByte
, NULL // lpWideCharStr
, 0 // cchWideChar
);
vector<wchar_t> result(resultSize + 1);
MultiByteToWideChar
( codePage // CodePage
, flags // dwFlags
, str.c_str() // lpMultiByteStr
, str.length() // cbMultiByte
, &result[0] // lpWideCharStr
, resultSize // cchWideChar
);
return &result[0];
}

How to compare two BSTRs or CComBSTRs?

What is the right way to compare two CComBSTRs? I tried to use
bool operator ==(
const CComBSTR& bstrSrc
) const throw( );
However it always return false even two ComBSTRs are the same. It did not work correctly.
Do I have to convert CComBSTRs to ANSI string first and then use strcmp?
Thanks!
-bc
You should probably use VarBstrCmp.
EDIT: this is actually what CComBSTR::operator== does, so without further context, your code may be incorrect.
BSTRs (and therefore CComBSTRs) are usually Unicode strings. You can use wcscmp() (or wcsicmp() for case-insensitive comparison).
Beware that encapsulated BSTR can be null which is a legal representation for an empty string and this should be treated as a special case, otherwise your program might run into undefined behaviour (most likely just crash).
To properly compare BSTR values which may contain embedded null characters you need to use something like this:
bool EqualBSTR(const BSTR String1, const BSTR String2, bool IgnoreCase = false)
{
if (String1 == nullptr || String2 == nullptr) {
return false;
}
const size_t MaxCount = std::min(static_cast<size_t>(SysStringLen(String1)), static_cast<size_t>(SysStringLen(String2)));
if (IgnoreCase) {
return _wcsnicmp(String1, String2, MaxCount) == 0;
} else {
return wcsncmp(String1, String2, MaxCount) == 0;
}
}
BSTRsAreEqual(BSTR bstr1, BSTR bstr2, VARIANT_BOOL* boolptrEqual)
{
CString s1, s2;
s1 = bstr1;
s2 = bstr2;
if (s1 == s2) {
*boolptrEqual = true;
} else {
*boolptrEqual = false;
}
}