i've tried a few things and haven't yet been able to figure out how to get const wchar_t *text (shown bellow) to pass into the variable StoreText (shown below). What am i doing wrong?
void KeyboardComplete(int localClientNum, const wchar_t *text, unsigned int len)
{
char* StoreText = text; //This is where error occurs
}
You cannot directly assign a wchar_t* to a char*, as they are different and incompatible data types.
If StoreText needs to point at the same memory address that text is pointing at, such as if you are planning on looping through the individual bytes of the text data, then a simple type-cast will suffice:
char* StoreText = (char*)text;
However, if StoreText is expected to point to its own separate copy of the character data, then you would need to convert the wide character data into narrow character data instead. Such as by:
using the WideCharToMultiByte() function on Windows:
void KeyboardComplete(int localClientNum, const wchar_t *text, unsigned int len)
{
int StoreTextLen = 1 + WideCharToMultiByte(CP_ACP, 0, text, len, NULL, 0, NULL, NULL);
std::vector<char> StoreTextBuffer(StoreTextLen);
WideCharToMultiByte(CP_ACP, 0, text, len, &StoreTextBuffer[0], StoreTextLen, NULL, NULL);
char* StoreText = &StoreText[0];
//...
}
using the std::wcsrtombs() function:
#include <cwchar>
void KeyboardComplete(int localClientNum, const wchar_t *text, unsigned int len)
{
std::mbstate_t state = std::mbstate_t();
int StoreTextLen = 1 + std::wcsrtombs(NULL, &text, 0, &state);
std::vector<char> StoreTextBuffer(StoreTextLen);
std::wcsrtombs(&StoreTextBuffer[0], &text, StoreTextLen, &state);
char *StoreText = &StoreTextBuffer[0];
//...
}
using the std::wstring_convert class (C++11 and later):
#include <locale>
void KeyboardComplete(int localClientNum, const wchar_t *text, unsigned int len)
{
std::wstring_convert<std::codecvt<wchar_t, char, std::mbstate_t>> conv;
std::string StoreTextBuffer = conv.to_bytes(text, text+len);
char *StoreText = &StoreTextBuffer[0];
//...
}
using similar conversions from the ICONV or ICU library.
First of all, for strings you should use std::wstring/std::string instead of raw pointers.
The C++11 Locale (http://en.cppreference.com/w/cpp/locale) library can be used to convert wide string to narrow string.
I wrote a wrapper function below and have used it for years. Hope it will be helpful to you, too.
#include <string>
#include <locale>
#include <codecvt>
std::string WstringToString(const std::wstring & wstr, const std::locale & loc /*= std::locale()*/)
{
std::string buf(wstr.size(), 0);
std::use_facet<std::ctype<wchar_t>>(loc).narrow(wstr.c_str(), wstr.c_str() + wstr.size(), '?', &buf[0]);
return buf;
}
wchar_t is a wide character. It is typically 16 or 32 bits per character, but this is system dependent.
char is a good ol' CHAR_BIT-sized data type. Again, how big it is is system dependent. Most likely it's going to be one byte, but I can't think of a reason why CHAR_BIT can't be 16 or 32 bits, making it the same size as wchar_t.
If they are different sizes, a direct assignment is doomed. For example an 8 bit char will see 2 characters, and quite likely 2 completely unrelated characters, for every 1 character in a 16 bit wchar_t. This would be bad.
Second, even if they are the same size, they may have different encodings. For example, the numeric value assigned to the letter 'A' may be different for the char and the wchar_t. It could be 65 in char and 16640 in wchar_t.
To make any sense in the different data type char and wchar_t will need to be translated to the other's encoding. std::wstring_convert will often perform this translation for you, but look into the locale library for more complicated translations. Both require a compiler supporting C++11 or better. In previous C++ Standards, a small army of functions provided conversion support. Third party libraries such as Boost::locale are helpful to unify and provide wider support.
Conversion functions are supplied by the operating system to translate between the encoding used by the OS and other common encodings.
You have to do a cast, you can do this:
char* StoreText = (char*)text;
I think this may work.
But you can use the wcstombs function of cstdlib library.
char someText[12];
wcstombs(StoreText,text, 12);
Last parameter most be a number of byte available in the array pointed.
Related
I have a string in char* format and would like to convert it to wchar_t*, to pass to a Windows function.
Does this little function help?
#include <cstdlib>
int mbstowcs(wchar_t *out, const char *in, size_t size);
Also see the C++ reference
If you don't want to link against the C runtime library, use the MultiByteToWideChar API call, e.g:
const size_t WCHARBUF = 100;
const char szSource[] = "HELLO";
wchar_t wszDest[WCHARBUF];
MultiByteToWideChar(CP_ACP, MB_PRECOMPOSED, szSource, -1, wszDest, WCHARBUF);
the Windows SDK specifies 2 functions in kernel32.lib for converting strings from and to a wide character set. those are MultiByteToWideChar() and WideCharToMultiByte().
please note that, unlike the function name suggest, the string does not necessarily use a multi-byte character set, but can be a simple ANSI string. alse note that those functions understand UTF-7 and UTF-8 as a multi-byte character set. the wide char character set is always UTF-16.
schnaader's answer use the conversion defined by the current C locale, this one uses the C++ locale interface (who said that it was simple?)
std::wstring widen(std::string const& s, std::locale loc)
{
std::char_traits<wchar_t>::state_type state = { 0 };
typedef std::codecvt<wchar_t, char, std::char_traits<wchar_t>::state_type >
ConverterFacet;
ConverterFacet const& converter(std::use_facet<ConverterFacet>(loc));
char const* nextToRead = s.data();
wchar_t buffer[BUFSIZ];
wchar_t* nextToWrite;
std::codecvt_base::result result;
std::wstring wresult;
while ((result
= converter.in
(state,
nextToRead, s.data()+s.size(), nextToRead,
buffer, buffer+sizeof(buffer)/sizeof(*buffer), nextToWrite))
== std::codecvt_base::partial)
{
wresult.append(buffer, nextToWrite);
}
if (result == std::codecvt_base::error) {
throw std::runtime_error("Encoding error");
}
wresult.append(buffer, nextToWrite);
return wresult;
}
I have some code that reads in an a unicode codepoint (as escaped in a string 0xF00).
Since im using boost, I'm speculating if the following is best (and correct) approach:
unsigned int codepoint{0xF00};
boost::locale::conv::utf_to_utf<char>(&codepoint, &codepoint+1);
?
You can do this with the standard library using std::wstring_convert to convert UTF-32 (code points) to UTF-8:
#include <locale>
#include <codecvt>
std::string codepoint_to_utf8(char32_t codepoint) {
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> convert;
return convert.to_bytes(&codepoint, &codepoint + 1);
}
This returns a std::string whose size is 1, 2, 3 or 4 depending on how large codepoint is. It will throw a std::range_error if the code point is too large (> 0x10FFFF, the max unicode code point).
Your version with boost seems to be doing the same thing. The documentation says that the utf_to_utf function converts a UTF encoding to another one, in this case 32 to 8. If you use char32_t, it will be a "correct" approach, that will work on systems where unsigned int isn't the same size as char32_t.
// The function also converts the unsigned int to char32_t
std::string codepoint_to_utf8(char32_t codepoint) {
return boost::locale::conv::utf_to_utf<char>(&codepoint, &codepoint + 1);
}
As mentioned, a codepoint in this form is (conveniently) UTF-32, so what you're looking for is a transcoding.
For a solution that does not rely on functions deprecated since C++17, and isn't really ugly, and which also does not require hefty third-party libraries, you can use the very lightweight UTF8-CPP (four small headers!) and its function utf8::utf32to8.
It's going to look something like this:
const uint32_t codepoint{0xF00};
std::vector<unsigned char> result;
try
{
utf8::utf32to8(&codepoint, &codepoint + 1, std::back_inserter(result));
}
catch (const utf8::invalid_code_point&)
{
// something
}
(There's also a utf8::unchecked::utf32to8, if you're allergic to exceptions.)
(And consider reading into vector<char8_t> or std::u8string, since C++20).
(Finally, note that I've specifically used uint32_t to ensure the input has the proper width.)
I tend to use this library in projects until I need something a little heavier for other purposes (at which point I'll typically switch to ICU).
C++17 has deprecated number of convenience functions processing utf. Unfortunately, the last remaining ones will be deprecated in C++20 (*). That being said std::codecvt is still valid. From C++11 to C++17, you can use a std::codecvt<char32_t, char, mbstate_t>, starting with C++20 it will be std::codecvt<char32_t, char8_t, mbstate_t>.
Here is some code converting a code point (up to 0x10FFFF) in utf8:
// codepoint is the codepoint to convert
// buff is a char array of size sz (should be at least 4 to convert any code point)
// on return sz is the used size of buf for the utf8 converted string
// the return value is the return value of std::codecvt::out (0 for ok)
std::codecvt_base::result to_utf8(char32_t codepoint, char *buf, size_t& sz) {
std::locale loc("");
const std::codecvt<char32_t, char, std::mbstate_t> &cvt =
std::use_facet<std::codecvt<char32_t, char, std::mbstate_t>>(loc);
std::mbstate_t state{{0}};
const char32_t * last_in;
char *last_out;
std::codecvt_base::result res = cvt.out(state, &codepoint, 1+&codepoint, last_in,
buf, buf+sz, last_out);
sz = last_out - buf;
return res;
}
(*) std::codecvt will still exist in C++20. Simply the default instantiations will no longer be std::codecvt<char16_t, char, std::mbstate_t> and std::codecvt<char32_t, char, std::mbstate_t> but std::codecvt<char16_t, char8_t, std::mbstate_t> and std::codecvt<char32_t, char8_t, std::mbstate_t> (note char8_t instead of char)
After reading about the unsteady state of UTF-8 support in C++, I stumbled upon the corresponding C support c32rtomb, which looks promising, and likely won't be deprecated any time soon
#include <clocale>
#include <cuchar>
#include <climits>
size_t to_utf8(char32_t codepoint, char *buf)
{
const char *loc = std::setlocale(LC_ALL, "en_US.utf8");
std::mbstate_t state{};
std::size_t len = std::c32rtomb(buf, codepoint, &state);
std::setlocale(LC_ALL, loc);
return len;
}
Usage would then be
char32_t codepoint{0xfff};
char buf[MB_LEN_MAX]{};
size_t len = to_utf8(codepoint, buf);
If your application's current locale is already UTF-8, you might omit the back and forth call to setlocale of course.
How can/should I cast from a unsigned char array to a widechar array wchar_t or std::wstring? And how can I convert it back to a unsigned char array?
Or can OpenSSL produce a widechar hash from SHA256_Update?
Try the following:
#include <cstdlib>
using namespace std;
unsigned char* temp; // pointer to initial data
// memory allocation and filling
// calculation of string length
wchar_t* wData = new wchar_t[len+1];
mbstowcs(&wData[0], &temp1[0], len);
Сoncerning inverse casting look the example here or just use mbstowcs once again but with changing places of two first arguments.
Also WideCharToMultiByte function can be useful for Windows development, and setting locale should be considered as well (see some examples).
UPDATE:
To calculate length of string pointed by unsigned char* temp the following approach can be used:
const char* ccp = reinterpret_cast<const char*>(temp);
size_t len = mbstowcs(nullptr, &ccp[0], 0);
std::setlocale(LC_ALL, "en_US.utf8");
const char* mbstr = "hello";
std::mbstate_t state = std::mbstate_t();
// calc length
int len = 1 + std::mbsrtowcs(nullptr, &mbstr, 0, &state);
std::vector<wchar_t> wstr(len);
std::mbsrtowcs(&wstr[0], &mbstr, wstr.size(), &state);
How can/should I cast from a unsigned char array to a widechar array wchar_t or std::wstring? And how can I convert it back to a unsigned char array?
They are completely different, so you should not be doing it under most circumstances. If you provide a specific question with real code, than we can probably tell you more.
Or can OpenSSL produce a widechar hash from SHA256_Update?
No, OpenSSL cannot do this. It produces hashes which are binary strings cmposed of bytes, not chars. You are responsible for for presentation details, like narrow/wide character sets or base64 encoding.
How can I convert a Unicode string to a char* or char* const in embarcadero c++ ?
String text = "Hello world";
char *txt = AnsiString(text).c_str();
Older text.t_str() is now AnsiString(String).c_str()
"Unicode string" really isn't specific enough to know what your source data is, but you probably mean 'UTF-16 string stored as wchar_t array' since that's what most people who don't know the correct terminology use.
"char*" also isn't enough to know what you want to target, although maybe "embarcadero" has some convention. I'll just assume you want UTF-8 data unless you mention otherwise.
Also I'll limit my example to what works in VS2010
// your "Unicode" string
wchar_t const * utf16_string = L"Hello, World!";
// #include <codecvt>
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>,wchar_t> convert;
std::string utf8_string = convert.to_bytes(utf16_string);
This assumes that wchar_t strings are UTF-16, as is the case on Windows, but otherwise is portable code.
You can reinterpret any array as an array of char pointers legally. So if your Unicode data comes in 4-byte code units like
char32_t data[100];
then you can access it as a char array:
char const * p = reinterpret_cast<char const*>(data);
for (std::size_t i = 0; i != sizeof data; ++i)
{
std::printf("Byte %03zu is 0x%02X.\n", i, p[i]);
}
That way, you can examine the individual bytes of your Unicode data one by one.
(That has of course nothing to do with converting the encoding of your text. For that, use a library like iconv or ICU.)
If you work with Windows:
//#include <windows.h>
u16string utext = u"объява";
char text[0x100];
WideCharToMultiByte(CP_UTF8,NULL,(const wchar_t*)(utext.c_str()),-1,text,-1,NULL,NULL);
cout << text;
We can't use std::wstring_convert, wherefore is not available in MinGW 4.9.2.
I have a string in char* format and would like to convert it to wchar_t*, to pass to a Windows function.
Does this little function help?
#include <cstdlib>
int mbstowcs(wchar_t *out, const char *in, size_t size);
Also see the C++ reference
If you don't want to link against the C runtime library, use the MultiByteToWideChar API call, e.g:
const size_t WCHARBUF = 100;
const char szSource[] = "HELLO";
wchar_t wszDest[WCHARBUF];
MultiByteToWideChar(CP_ACP, MB_PRECOMPOSED, szSource, -1, wszDest, WCHARBUF);
the Windows SDK specifies 2 functions in kernel32.lib for converting strings from and to a wide character set. those are MultiByteToWideChar() and WideCharToMultiByte().
please note that, unlike the function name suggest, the string does not necessarily use a multi-byte character set, but can be a simple ANSI string. alse note that those functions understand UTF-7 and UTF-8 as a multi-byte character set. the wide char character set is always UTF-16.
schnaader's answer use the conversion defined by the current C locale, this one uses the C++ locale interface (who said that it was simple?)
std::wstring widen(std::string const& s, std::locale loc)
{
std::char_traits<wchar_t>::state_type state = { 0 };
typedef std::codecvt<wchar_t, char, std::char_traits<wchar_t>::state_type >
ConverterFacet;
ConverterFacet const& converter(std::use_facet<ConverterFacet>(loc));
char const* nextToRead = s.data();
wchar_t buffer[BUFSIZ];
wchar_t* nextToWrite;
std::codecvt_base::result result;
std::wstring wresult;
while ((result
= converter.in
(state,
nextToRead, s.data()+s.size(), nextToRead,
buffer, buffer+sizeof(buffer)/sizeof(*buffer), nextToWrite))
== std::codecvt_base::partial)
{
wresult.append(buffer, nextToWrite);
}
if (result == std::codecvt_base::error) {
throw std::runtime_error("Encoding error");
}
wresult.append(buffer, nextToWrite);
return wresult;
}