Can I define a Shift-JIS string literal in c++? - c++

I'm writing a program which involves comparing some strings which will be encoded in Shift-JIS. I only need to compare against a few possible values, and I'd like to encode them as string constants/literals.
I know how to do this with a user-defined literal using iconv or the MultiByteToWideChar/WideCharToMultiByte functions, but I'm wondering if there's some other way to do this. Or maybe a simpler way to define my own literal.
Some context if it's useful:
The program will only be compiled for Windows.
I'm using the Mingw64 cross compiler to build from Linux.
The strings I'm checking against are all exactly 16 bytes long.
I can use c++17 features, but probably not anything from c++20.
The user-defined literal I'm using now:
auto operator "" _sjs(const char* in, std::size_t len)
{
#ifdef HAS_ICONV
auto t = iconv_open("SHIFT-JIS", "UTF-8");
// max length of BS header name + 1 for terminator
char output[17] = "";
memset(output, ' ', 16);
size_t outLeft = 16;
char* tmpOut = output;
iconv(t, &in, &len, &tmpOut, &outLeft);
iconv_close(t);
return std::string{output};
#else
wchar_t output[34];
memset(output, 0, 34);
MultiByteToWideChar(CP_UTF8, 0, in, -1, output, len);
char output2[17] = "";
memset(output2, ' ', 16);
WideCharToMultiByte(932, 0, output, -1, output2, len*2, NULL, NULL);
return std::string{output2};
#endif
}

Related

How to convert std::string to wchar_t*

std::regex regexpy("y:(.+?)\"");
std::smatch my;
regex_search(value.text, my, regexpy);
y = my[1];
std::wstring wide_string = std::wstring(y.begin(), y.end());
const wchar_t* p_my_string = wide_string.c_str();
wchar_t* my_string = const_cast<wchar_t*>(p_my_string);
URLDownloadToFile(my_string, aDest);
I'm using Unicode, the encoding of the source string is ASCII, UrlDownloadToFile expands to UrlDownloadToFileW (wchar_t*) the code above compiles in debug mode, but with a lot of warnings like:
warning C4244: 'argument': conversion from 'wchar_t' to 'const _Elem', possible loss of data
So do I ask, how I could convert a std::string to a wchar_t?
First off, you don't need the const_cast, as URLDownloadToFileW() takes a const wchar_t* as input, so passing it wide_string.c_str() will work as-is:
URLDownloadToFile(..., wide_string.c_str(), ...);
That being said, you are constructing a std::wstring with the individual char values of a std::string as-is. That will work without data loss only for ASCII characters <= 127, which have the same numeric values in both ASCII and Unicode. For non-ASCII characters, you need to actually convert the char data to Unicode, such as with MultiByteToWideChar() (or equivilent), eg:
std::wstring to_wstring(const std::string &s)
{
std::wstring wide_string;
// NOTE: be sure to specify the correct codepage that the
// str::string data is actually encoded in...
int len = MultiByteToWideChar(CP_ACP, 0, s.c_str(), s.size(), NULL, 0);
if (len > 0) {
wide_string.resize(len);
MultiByteToWideChar(CP_ACP, 0, s.c_str(), s.size(), &wide_string[0], len);
}
return wide_string;
}
URLDownloadToFileW(..., to_wstring(y).c_str(), ...);
That being said, there is a simpler solution. If the std::string is encoded in the user's default locale, you can simply call URLDownloadToFileA() instead, passing it the original std::string as-is, and let the OS handle the conversion for you, eg:
URLDownloadToFileA(..., y.c_str(), ...);
There is a cross-platform solution. You can use std::mbtowc.
std::wstring convert_mb_to_wc(std::string s) {
std::wstring out;
std::mbtowc(nullptr, 0, 0);
int offset;
size_t index = 0;
for (wchar_t wc;
(offset = std::mbtowc(&wc, &s[index], s.size() - index)) > 0;
index += offset) {
out.push_back(wc);
}
return out;
}
Adapted from an example on cppreference.com at https://en.cppreference.com/w/cpp/string/multibyte/mbtowc .

C++: socket encoding (working with TeamSpeak)

As I'm currently working on a program for a TeamSpeak server, I need to retrieve the names of the currently online users which I'm doing with sockets - that's working fine so far.In my UI I'm displaying all clients in a ListBox which is basically working. Nevertheless I'm having problems with wrong displayed characters and symbols in the ListBox.
I'm using the following code:
//...
auto getClientList() -> void{
i = 0;
queryString.str("");
queryString.clear();
queryString << clientlist << " \n";
send(sock, queryString.str().c_str(), strlen(queryString.str().c_str()), NULL);
TeamSpeak::getAnswer(1);
while(p_1 != -1){
p_1 = lastLog.find(L"client_nickname=", sPos + 1);
if(p_1 != -1){
sPos = p_1;
p_2 = lastLog.find(L" ", p_1);
temporary = lastLog.substr(p_1 + 16, p_2 - (p_1 + 16));
users[i].assign(temporary.begin(), temporary.end());
SendMessage(hwnd_2, LB_ADDSTRING, (WPARAM)NULL, (LPARAM)(LPTSTR)(users[i].c_str()));
i++;
}
else{
sPos = 0;
p_1 = 0;
break;
}
}
TeamSpeak::getAnswer(0);
}
//...
I've already checked lastLog, temporary and users[i] (by writing them to a file), but all of them have no encoding problem with characters or symbols (for example Andrè). If I add a string directly:SendMessage(hwnd_2, LB_ADDSTRING, (WPARAM)NULL, (LPARAM)(LPTSTR)L"Andrè", it is displayed correctly in the ListBox.What might be the issue here, is it a problem with my code or something else?
Update 1:I recently continued working on this problem and considered the word Olè! receiving it from the socket. The result I got, is the following:O (79) | l (108) | � (-61) | � (-88) | ! (33).How can I convert this char array to a wstring containing the correct characters?
Solution: As #isanae mentioned in his post, the std::wstring_convert-template did the trick for me, thank you very much!
Many things can go wrong in this code, and you don't show much of it. What's particularly lacking is the definition of all those variables.
Assuming that users[i] contains meaningful data, you also don't say how it is encoded. Is it ASCII? UTF-8? UTF-16? The fact that you can output it to a file and read it with an editor doesn't mean anything, as most editors are able to guess at encoding.
If it really is UTF-16 (the native encoding on Windows), then I see no reason for this code not to work. One way to check would be to break into the debugger and look at the individual bytes in users[i]. If you see every character with a value less than 128 followed by a 0, then it's probably UTF-16.
If it is not UTF-16, then you'll need to convert it. There are a variety of ways to do this, but MultiByteToWideChar may be the easiest. Make sure you set the codepage to same encoding used by the sender. It may be CP_UTF8, or an actual codepage.
Note also that hardcoding a string with non-ASCII characters doesn't help you much either, as you'd first have to find out the encoding of the file itself. I know some versions of Visual C++ will convert your source file to UTF-16 if it encounters non-ASCII characters, which may be what happened to you.
O (79) | l (108) | � (-61) | � (-88) | ! (33).
How can I convert this char array to a wstring containing the correct characters?
This is a UTF-8 string. It has to be converted to UTF-16 so Windows can use it.
This is a portable, C++11 solution on implementations where sizeof(wchar_t) == 2. If this is not the case, then char16_t and std::u16string may be used, but the most recent version of Visual C++ as of this writing (2015 RC) doesn't implement std::codecvt for char16_t and char32_t.
#include <string>
#include <codecvt>
std::wstring utf8_to_utf16(const std::string& s)
{
static_assert(sizeof(wchar_t)==2, "wchar_t needs to be 2 bytes");
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> conv;
return conv.from_bytes(s);
}
std::string utf16_to_utf8(const std::wstring& s)
{
static_assert(sizeof(wchar_t)==2, "wchar_t needs to be 2 bytes");
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> conv;
return conv.to_bytes(s);
}
Windows-only:
#include <string>
#include <cassert>
#include <memory>
#include <codecvt>
#include <Windows.h>
std::wstring utf8_to_utf16(const std::string& s)
{
// getting the required size in characters (not bytes) of the
// output buffer
const int size = ::MultiByteToWideChar(
CP_UTF8, 0, s.c_str(), static_cast<int>(s.size()),
nullptr, 0);
// error handling
assert(size != 0);
// creating a buffer with enough characters in it
std::unique_ptr<wchar_t[]> buffer(new wchar_t[size]);
// converting from utf8 to utf16
const int written = ::MultiByteToWideChar(
CP_UTF8, 0, s.c_str(), static_cast<int>(s.size()),
buffer.get(), size);
// error handling
assert(written != 0);
return std::wstring(buffer.get(), buffer.get() + written);
}
std::string utf16_to_utf8(const std::wstring& ws)
{
// getting the required size in bytes of the output buffer
const int size = ::WideCharToMultiByte(
CP_UTF8, 0, ws.c_str(), static_cast<int>(ws.size()),
nullptr, 0, nullptr, nullptr);
// error handling
assert(size != 0);
// creating a buffer with enough characters in it
std::unique_ptr<char[]> buffer(new char[size]);
// converting from utf16 to utf8
const int written = ::WideCharToMultiByte(
CP_UTF8, 0, ws.c_str(), static_cast<int>(ws.size()),
buffer.get(), size, nullptr, nullptr);
// error handling
assert(written != 0);
return std::string(buffer.get(), buffer.get() + written);
}
Test:
// utf-8 string
const std::string s = {79, 108, -61, -88, 33};
::MessageBoxW(0, utf8_to_utf16(s).c_str(), L"", MB_OK);

Compare std::wstring and std::string

How can I compare a wstring, such as L"Hello", to a string? If I need to have the same type, how can I convert them into the same type?
Since you asked, here's my standard conversion functions from string to wide string, implemented using C++ std::string and std::wstring classes.
First off, make sure to start your program with set_locale:
#include <clocale>
int main()
{
std::setlocale(LC_CTYPE, ""); // before any string operations
}
Now for the functions. First off, getting a wide string from a narrow string:
#include <string>
#include <vector>
#include <cassert>
#include <cstdlib>
#include <cwchar>
#include <cerrno>
// Dummy overload
std::wstring get_wstring(const std::wstring & s)
{
return s;
}
// Real worker
std::wstring get_wstring(const std::string & s)
{
const char * cs = s.c_str();
const size_t wn = std::mbsrtowcs(NULL, &cs, 0, NULL);
if (wn == size_t(-1))
{
std::cout << "Error in mbsrtowcs(): " << errno << std::endl;
return L"";
}
std::vector<wchar_t> buf(wn + 1);
const size_t wn_again = std::mbsrtowcs(buf.data(), &cs, wn + 1, NULL);
if (wn_again == size_t(-1))
{
std::cout << "Error in mbsrtowcs(): " << errno << std::endl;
return L"";
}
assert(cs == NULL); // successful conversion
return std::wstring(buf.data(), wn);
}
And going back, making a narrow string from a wide string. I call the narrow string "locale string", because it is in a platform-dependent encoding depending on the current locale:
// Dummy
std::string get_locale_string(const std::string & s)
{
return s;
}
// Real worker
std::string get_locale_string(const std::wstring & s)
{
const wchar_t * cs = s.c_str();
const size_t wn = std::wcsrtombs(NULL, &cs, 0, NULL);
if (wn == size_t(-1))
{
std::cout << "Error in wcsrtombs(): " << errno << std::endl;
return "";
}
std::vector<char> buf(wn + 1);
const size_t wn_again = std::wcsrtombs(buf.data(), &cs, wn + 1, NULL);
if (wn_again == size_t(-1))
{
std::cout << "Error in wcsrtombs(): " << errno << std::endl;
return "";
}
assert(cs == NULL); // successful conversion
return std::string(buf.data(), wn);
}
Some notes:
If you don't have std::vector::data(), you can say &buf[0] instead.
I've found that the r-style conversion functions mbsrtowcs and wcsrtombs don't work properly on Windows. There, you can use the mbstowcs and wcstombs instead: mbstowcs(buf.data(), cs, wn + 1);, wcstombs(buf.data(), cs, wn + 1);
In response to your question, if you want to compare two strings, you can convert both of them to wide string and then compare those. If you are reading a file from disk which has a known encoding, you should use iconv() to convert the file from your known encoding to WCHAR and then compare with the wide string.
Beware, though, that complex Unicode text may have multiple different representations as code point sequences which you may want to consider equal. If that is a possibility, you need to use a higher-level Unicode processing library (such as ICU) and normalize your strings to some common, comparable form.
You should convert the char string to a wchar_t string using mbstowcs, and then compare the resulting strings. Notice that mbstowcs works on char */wchar *, so you'll probably need to do something like this:
std::wstring StringToWstring(const std::string & source)
{
std::wstring target(source.size()+1, L' ');
std::size_t newLength=std::mbstowcs(&target[0], source.c_str(), target.size());
target.resize(newLength);
return target;
}
I'm not entirely sure that that usage of &target[0] is entirely standard-conforming, if someone has a good answer to that please tell me in the comments. Also, there's an implicit assumption that the converted string won't be longer (in number of wchar_ts) than the number of chars of the original string - a logical assumption that still I'm not sure it's covered by the standard.
On the other hand, it seems that there's no way to ask to mbstowcs the size of the needed buffer, so either you go this way, or go with (better done and better defined) code from Unicode libraries (be it Windows APIs or libraries like iconv).
Still, keep in mind that comparing Unicode strings without using special functions is slippery ground, two equivalent strings may be evaluated different when compared bitwise.
Long story short: this should work, and I think it's the maximum you can do with just the standard library, but it's a lot implementation-dependent in how Unicode is handled, and I wouldn't trust it a lot. In general, it's just better to stick with an encoding inside your application and avoid this kind of conversions unless absolutely necessary, and, if you are working with definite encodings, use APIs that are less implementation-dependent.
Think twice before doing this — you might not want to compare them in the first place. If you are sure you do and you are using Windows, then convert string to wstring with MultiByteToWideChar, then compare with CompareStringEx.
If you are not using Windows, then the analogous functions are mbstowcs and wcscmp. The standard wide character C++ functions are often not portable under Windows; for instance mbstowcs is deprecated.
The cross-platform way to work with Unicode is to use the ICU library.
Take care to use special functions for Unicode string comparison, don't do it manually. Two Unicode strings could have different characters, yet still be the same.
wstring ConvertToUnicode(const string & str)
{
UINT codePage = CP_ACP;
DWORD flags = 0;
int resultSize = MultiByteToWideChar
( codePage // CodePage
, flags // dwFlags
, str.c_str() // lpMultiByteStr
, str.length() // cbMultiByte
, NULL // lpWideCharStr
, 0 // cchWideChar
);
vector<wchar_t> result(resultSize + 1);
MultiByteToWideChar
( codePage // CodePage
, flags // dwFlags
, str.c_str() // lpMultiByteStr
, str.length() // cbMultiByte
, &result[0] // lpWideCharStr
, resultSize // cchWideChar
);
return &result[0];
}

Why is the following C++ code printing only the first character?

I am trying to convert a char string to a wchar string.
In more detail: I am trying to convert a char[] to a wchar[] first and then append " 1" to that string and the print it.
char src[256] = "c:\\user";
wchar_t temp_src[256];
mbtowc(temp_src, src, 256);
wchar_t path[256];
StringCbPrintf(path, 256, _T("%s 1"), temp_src);
wcout << path;
But it prints just c
Is this the right way to convert from char to wchar? I have come to know of another way since. But I'd like to know why the above code works the way it does?
mbtowc converts only a single character. Did you mean to use mbstowcs?
Typically you call this function twice; the first to obtain the required buffer size, and the second to actually convert it:
#include <cstdlib> // for mbstowcs
const char* mbs = "c:\\user";
size_t requiredSize = ::mbstowcs(NULL, mbs, 0);
wchar_t* wcs = new wchar_t[requiredSize + 1];
if(::mbstowcs(wcs, mbs, requiredSize + 1) != (size_t)(-1))
{
// Do what's needed with the wcs string
}
delete[] wcs;
If you rather use mbstowcs_s (because of deprecation warnings), then do this:
#include <cstdlib> // also for mbstowcs_s
const char* mbs = "c:\\user";
size_t requiredSize = 0;
::mbstowcs_s(&requiredSize, NULL, 0, mbs, 0);
wchar_t* wcs = new wchar_t[requiredSize + 1];
::mbstowcs_s(&requiredSize, wcs, requiredSize + 1, mbs, requiredSize);
if(requiredSize != 0)
{
// Do what's needed with the wcs string
}
delete[] wcs;
Make sure you take care of locale issues via setlocale() or using the versions of mbstowcs() (such as mbstowcs_l() or mbstowcs_s_l()) that takes a locale argument.
why are you using C code, and why not write it in a more portable way, for example what I would do here is use the STL!
std::string src = std::string("C:\\user") +
std::string(" 1");
std::wstring dne = std::wstring(src.begin(), src.end());
wcout << dne;
it's so simple it's easy :D
L"Hello World"
the prefix L in front of the string makes it a wide char string.

How do you convert from a nsACString to a LPCWSTR?

I'm making a firefox extension (nsACString is from mozilla) but LoadLibrary expects a LPCWSTR. I googled a few options but nothing worked. Sort of out of my depth with strings so any references would also be appreciated.
It depends whether your nsACString (which I'll call str) holds ASCII or UTF-8 data:
ASCII
std::vector<WCHAR> wide(str.Length()+1);
std::copy(str.beginReading(), str.endReading(), wide.begin());
// I don't know whether nsACString has a terminating NUL, best to be sure
wide[str.Length()] = 0;
LPCWSTR newstr = &wide[0];
UTF-8
// get length, including nul terminator
int len = MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS,
str.BeginReading(), str.Length(), 0, 0);
if (len == 0) panic(); // happens if input data is invalid UTF-8
// allocate enough space
std::vector<WCHAR> wide(len);
// convert string
MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS,
str.BeginReading(), str.Length(), &wide[0], len)
LPCWSTR newstr = &wide[0];
This allocates only as much space as is needed - if you want faster code that potentially uses more memory than necessary, you can replace the first two lines with:
int len = str.Length() + 1;
This works because a conversion from UTF-8 to WCHAR never results in more characters than there were bytes of input.
Firstly note: LoadLibrary need not accept a LPWSTR. Only LoadLibraryW does. You may call LoadLibraryA directly (passing a narrow LPCSTR) and it will perform the translation for you.
If you choose to do it yourself however, below is one possible example.
nsACString sFoo = ...; // Some string.
size_t len = sFoo.Length() + 1;
WCHAR *swFoo = new WCHAR[len];
MultiByteToWideChar(CP_ACP, 0, sFoo.BeginReading(), len - 1, swFoo, len);
swFoo[len - 1] = 0; // Null-terminate it.
...
delete [] swFoo;
nsACString a;
const char* pData;
PRUint32 iLen = NS_CStringGetData(a, &pData);