How to convert a codepoint to utf-8? - c++

I have some code that reads in an a unicode codepoint (as escaped in a string 0xF00).
Since im using boost, I'm speculating if the following is best (and correct) approach:
unsigned int codepoint{0xF00};
boost::locale::conv::utf_to_utf<char>(&codepoint, &codepoint+1);
?

You can do this with the standard library using std::wstring_convert to convert UTF-32 (code points) to UTF-8:
#include <locale>
#include <codecvt>
std::string codepoint_to_utf8(char32_t codepoint) {
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> convert;
return convert.to_bytes(&codepoint, &codepoint + 1);
}
This returns a std::string whose size is 1, 2, 3 or 4 depending on how large codepoint is. It will throw a std::range_error if the code point is too large (> 0x10FFFF, the max unicode code point).
Your version with boost seems to be doing the same thing. The documentation says that the utf_to_utf function converts a UTF encoding to another one, in this case 32 to 8. If you use char32_t, it will be a "correct" approach, that will work on systems where unsigned int isn't the same size as char32_t.
// The function also converts the unsigned int to char32_t
std::string codepoint_to_utf8(char32_t codepoint) {
return boost::locale::conv::utf_to_utf<char>(&codepoint, &codepoint + 1);
}

As mentioned, a codepoint in this form is (conveniently) UTF-32, so what you're looking for is a transcoding.
For a solution that does not rely on functions deprecated since C++17, and isn't really ugly, and which also does not require hefty third-party libraries, you can use the very lightweight UTF8-CPP (four small headers!) and its function utf8::utf32to8.
It's going to look something like this:
const uint32_t codepoint{0xF00};
std::vector<unsigned char> result;
try
{
utf8::utf32to8(&codepoint, &codepoint + 1, std::back_inserter(result));
}
catch (const utf8::invalid_code_point&)
{
// something
}
(There's also a utf8::unchecked::utf32to8, if you're allergic to exceptions.)
(And consider reading into vector<char8_t> or std::u8string, since C++20).
(Finally, note that I've specifically used uint32_t to ensure the input has the proper width.)
I tend to use this library in projects until I need something a little heavier for other purposes (at which point I'll typically switch to ICU).

C++17 has deprecated number of convenience functions processing utf. Unfortunately, the last remaining ones will be deprecated in C++20 (*). That being said std::codecvt is still valid. From C++11 to C++17, you can use a std::codecvt<char32_t, char, mbstate_t>, starting with C++20 it will be std::codecvt<char32_t, char8_t, mbstate_t>.
Here is some code converting a code point (up to 0x10FFFF) in utf8:
// codepoint is the codepoint to convert
// buff is a char array of size sz (should be at least 4 to convert any code point)
// on return sz is the used size of buf for the utf8 converted string
// the return value is the return value of std::codecvt::out (0 for ok)
std::codecvt_base::result to_utf8(char32_t codepoint, char *buf, size_t& sz) {
std::locale loc("");
const std::codecvt<char32_t, char, std::mbstate_t> &cvt =
std::use_facet<std::codecvt<char32_t, char, std::mbstate_t>>(loc);
std::mbstate_t state{{0}};
const char32_t * last_in;
char *last_out;
std::codecvt_base::result res = cvt.out(state, &codepoint, 1+&codepoint, last_in,
buf, buf+sz, last_out);
sz = last_out - buf;
return res;
}
(*) std::codecvt will still exist in C++20. Simply the default instantiations will no longer be std::codecvt<char16_t, char, std::mbstate_t> and std::codecvt<char32_t, char, std::mbstate_t> but std::codecvt<char16_t, char8_t, std::mbstate_t> and std::codecvt<char32_t, char8_t, std::mbstate_t> (note char8_t instead of char)

After reading about the unsteady state of UTF-8 support in C++, I stumbled upon the corresponding C support c32rtomb, which looks promising, and likely won't be deprecated any time soon
#include <clocale>
#include <cuchar>
#include <climits>
size_t to_utf8(char32_t codepoint, char *buf)
{
const char *loc = std::setlocale(LC_ALL, "en_US.utf8");
std::mbstate_t state{};
std::size_t len = std::c32rtomb(buf, codepoint, &state);
std::setlocale(LC_ALL, loc);
return len;
}
Usage would then be
char32_t codepoint{0xfff};
char buf[MB_LEN_MAX]{};
size_t len = to_utf8(codepoint, buf);
If your application's current locale is already UTF-8, you might omit the back and forth call to setlocale of course.

Related

C++: Convert arg[0] to a wchar_t [duplicate]

I have a string in char* format and would like to convert it to wchar_t*, to pass to a Windows function.
Does this little function help?
#include <cstdlib>
int mbstowcs(wchar_t *out, const char *in, size_t size);
Also see the C++ reference
If you don't want to link against the C runtime library, use the MultiByteToWideChar API call, e.g:
const size_t WCHARBUF = 100;
const char szSource[] = "HELLO";
wchar_t wszDest[WCHARBUF];
MultiByteToWideChar(CP_ACP, MB_PRECOMPOSED, szSource, -1, wszDest, WCHARBUF);
the Windows SDK specifies 2 functions in kernel32.lib for converting strings from and to a wide character set. those are MultiByteToWideChar() and WideCharToMultiByte().
please note that, unlike the function name suggest, the string does not necessarily use a multi-byte character set, but can be a simple ANSI string. alse note that those functions understand UTF-7 and UTF-8 as a multi-byte character set. the wide char character set is always UTF-16.
schnaader's answer use the conversion defined by the current C locale, this one uses the C++ locale interface (who said that it was simple?)
std::wstring widen(std::string const& s, std::locale loc)
{
std::char_traits<wchar_t>::state_type state = { 0 };
typedef std::codecvt<wchar_t, char, std::char_traits<wchar_t>::state_type >
ConverterFacet;
ConverterFacet const& converter(std::use_facet<ConverterFacet>(loc));
char const* nextToRead = s.data();
wchar_t buffer[BUFSIZ];
wchar_t* nextToWrite;
std::codecvt_base::result result;
std::wstring wresult;
while ((result
= converter.in
(state,
nextToRead, s.data()+s.size(), nextToRead,
buffer, buffer+sizeof(buffer)/sizeof(*buffer), nextToWrite))
== std::codecvt_base::partial)
{
wresult.append(buffer, nextToWrite);
}
if (result == std::codecvt_base::error) {
throw std::runtime_error("Encoding error");
}
wresult.append(buffer, nextToWrite);
return wresult;
}

Take wchar_t and put into char?

i've tried a few things and haven't yet been able to figure out how to get const wchar_t *text (shown bellow) to pass into the variable StoreText (shown below). What am i doing wrong?
void KeyboardComplete(int localClientNum, const wchar_t *text, unsigned int len)
{
char* StoreText = text; //This is where error occurs
}
You cannot directly assign a wchar_t* to a char*, as they are different and incompatible data types.
If StoreText needs to point at the same memory address that text is pointing at, such as if you are planning on looping through the individual bytes of the text data, then a simple type-cast will suffice:
char* StoreText = (char*)text;
However, if StoreText is expected to point to its own separate copy of the character data, then you would need to convert the wide character data into narrow character data instead. Such as by:
using the WideCharToMultiByte() function on Windows:
void KeyboardComplete(int localClientNum, const wchar_t *text, unsigned int len)
{
int StoreTextLen = 1 + WideCharToMultiByte(CP_ACP, 0, text, len, NULL, 0, NULL, NULL);
std::vector<char> StoreTextBuffer(StoreTextLen);
WideCharToMultiByte(CP_ACP, 0, text, len, &StoreTextBuffer[0], StoreTextLen, NULL, NULL);
char* StoreText = &StoreText[0];
//...
}
using the std::wcsrtombs() function:
#include <cwchar>
void KeyboardComplete(int localClientNum, const wchar_t *text, unsigned int len)
{
std::mbstate_t state = std::mbstate_t();
int StoreTextLen = 1 + std::wcsrtombs(NULL, &text, 0, &state);
std::vector<char> StoreTextBuffer(StoreTextLen);
std::wcsrtombs(&StoreTextBuffer[0], &text, StoreTextLen, &state);
char *StoreText = &StoreTextBuffer[0];
//...
}
using the std::wstring_convert class (C++11 and later):
#include <locale>
void KeyboardComplete(int localClientNum, const wchar_t *text, unsigned int len)
{
std::wstring_convert<std::codecvt<wchar_t, char, std::mbstate_t>> conv;
std::string StoreTextBuffer = conv.to_bytes(text, text+len);
char *StoreText = &StoreTextBuffer[0];
//...
}
using similar conversions from the ICONV or ICU library.
First of all, for strings you should use std::wstring/std::string instead of raw pointers.
The C++11 Locale (http://en.cppreference.com/w/cpp/locale) library can be used to convert wide string to narrow string.
I wrote a wrapper function below and have used it for years. Hope it will be helpful to you, too.
#include <string>
#include <locale>
#include <codecvt>
std::string WstringToString(const std::wstring & wstr, const std::locale & loc /*= std::locale()*/)
{
std::string buf(wstr.size(), 0);
std::use_facet<std::ctype<wchar_t>>(loc).narrow(wstr.c_str(), wstr.c_str() + wstr.size(), '?', &buf[0]);
return buf;
}
wchar_t is a wide character. It is typically 16 or 32 bits per character, but this is system dependent.
char is a good ol' CHAR_BIT-sized data type. Again, how big it is is system dependent. Most likely it's going to be one byte, but I can't think of a reason why CHAR_BIT can't be 16 or 32 bits, making it the same size as wchar_t.
If they are different sizes, a direct assignment is doomed. For example an 8 bit char will see 2 characters, and quite likely 2 completely unrelated characters, for every 1 character in a 16 bit wchar_t. This would be bad.
Second, even if they are the same size, they may have different encodings. For example, the numeric value assigned to the letter 'A' may be different for the char and the wchar_t. It could be 65 in char and 16640 in wchar_t.
To make any sense in the different data type char and wchar_t will need to be translated to the other's encoding. std::wstring_convert will often perform this translation for you, but look into the locale library for more complicated translations. Both require a compiler supporting C++11 or better. In previous C++ Standards, a small army of functions provided conversion support. Third party libraries such as Boost::locale are helpful to unify and provide wider support.
Conversion functions are supplied by the operating system to translate between the encoding used by the OS and other common encodings.
You have to do a cast, you can do this:
char* StoreText = (char*)text;
I think this may work.
But you can use the wcstombs function of cstdlib library.
char someText[12];
wcstombs(StoreText,text, 12);
Last parameter most be a number of byte available in the array pointed.

convert from char to char16_t

My config:
Compiler: gnu gcc 4.8.2
I compile with C++11
platform/OS: Linux 64bit Ubuntu 14.04.1 LTS
I have this method:
static inline std::u16string StringtoU16(const std::string &str) {
const size_t si = strlen(str.c_str());
char16_t cstr[si+1];
memset(cstr, 0, (si+1)*sizeof(char16_t));
const char* constSTR = str.c_str();
mbstate_t mbs;
memset (&mbs, 0, sizeof (mbs));//set shift state to the initial state
size_t ret = mbrtoc16 (cstr, constSTR, si, &mbs);
std::u16string wstr(cstr);
return wstr;
}
I want a conversion between char to char16_T pretty much (via std::string and std::u16string to facilitate memory management) but regardless of the size of the input variable str, it will return the first character only. If str= "Hello" it will return "H". I am not sure what is wrong my my method. Value of ret is 1.
I didn't know mbrtoc16() can only handle one character at a time.. what a turtle. Here is then the code I generate, and works like a charm:
static inline std::u16string StringtoU16(const std::string &str) {
std::u16string wstr = u"";
char16_t c16str[3] = u"\0";
mbstate_t mbs;
for (const auto& it: str){
memset (&mbs, 0, sizeof (mbs));//set shift state to the initial state
memmove(c16str, u"\0\0\0", 3);
mbrtoc16 (c16str, &it, 3, &mbs);
wstr.append(std::u16string(c16str));
}//for
return wstr;
}
for its counterpart (when one way is needed, sooner or later the other way will be needed):
static inline std::string U16toString(const std::u16string &wstr) {
std::string str = "";
char cstr[3] = "\0";
mbstate_t mbs;
for (const auto& it: wstr){
memset (&mbs, 0, sizeof (mbs));//set shift state to the initial state
memmove(cstr, "\0\0\0", 3);
c16rtomb (cstr, it, &mbs);
str.append(std::string(cstr));
}//for
return str;
}
Be aware that c16rtomb will be lossy if a character cannot be converted from char16_t to char (might endup printing a bunch of '?' depending on your system) but it will work without complains.
mbrtoc16() converts a single character, and returns the number of multibyte characters that were consumed in order to convert the char16_t.
In order to effect this conversion, the general approach is:
A) call mbrtoc16().
B) save the converted character, skip the number of characters that were consumed.
C) Have you consumed the entire string you wanted to convert? If no, go back to step A.
Additionally, there could be conversion errors. You must check the return value from mbrtoc16() and do whatever you want to do, to handle conversion errors (the original multibyte string is note valid).
Finally, you should not assume what the maximum size of the char16_t string is going to be equal to or less than the size of the multibyte string. It probably is; but, in some weird locale I suppose that it can, theoretically, be more.

Convert std::string to Unicode in Linux

EDIT I modified the question after realizing it was wrong to begin with.
I'm porting part of a C# application to Linux, where I need to get the bytes of a UTF-16 string:
string myString = "ABC";
byte[] bytes = Encoding.Unicode.GetBytes(myString);
So that the bytes array is now:
"65 00 66 00 67 00" (bytes)
How can I achieve the same in C++ on Linux? I have a myString defined as std::string, and it seems that std::wstring on Linux is 4 bytes?
You question isn't really clear, but I'll try to clear up some confusion.
Introduction
Status of the handling of character set in C (and that was inherited by C++) after the '95 amendment to the C standard.
the character set used is given by the current locale
wchar_t is meant to store code point
char is meant to store a multibyte encoded form (a constraint for instance is that characters in the basic character set must be encoded in one byte)
string literals are encoded in an implementation defined manner. If they use characters outside of the basic character set, you can't assume they are valid in all locale.
Thus with a 16 bits wchar_t you are restricted to the BMP. Using the surrogates of UTF-16 is not compliant but I think MS and IBM are more or less forced to do this because they believed Unicode when they said they'll forever be a 16 bits charset. Those who delayed their Unicode support tend to use a 32 bits wchar_t.
Newer standards don't change much. Mostly there are literals for UTF-8, UTF-16 and UTF-32 encoded strings and there are types for 16 bits and 32 bits char. There is little or no additional support for Unicode in the standard libraries.
How to do the transformation of one encoding to the other
You have to be in a locale which use Unicode. Hopefully
std::locale::global(locale(""));
will be enough for that. If not, your environment is not properly setup (or setup for another charset and assuming Unicode won't be a service to your user.).
C Style
Use the wcstomsb and mbstowcs functions. Here is an example for what you asked.
std::string narrow(std::wstring const& s)
{
std::vector<char> result(4*s.size() + 1);
size_t used = wcstomsb(&result[0], s.data(), result.size());
assert(used < result.size());
return result.data();
}
C++ Style
The codecvt facet of the locale provide the needed functionality. The advantage is that you don't have to change the global locale for using it. The inconvenient is that the usage is more complex.
#include <locale>
#include <iostream>
#include <string>
#include <vector>
#include <assert.h>
#include <iomanip>
std::string narrow(std::wstring const& s,
std::locale loc = std::locale())
{
std::vector<char> result(4*s.size() + 1);
wchar_t const* fromNext;
char* toNext;
mbstate_t state = {0};
std::codecvt_base::result convResult
= std::use_facet<std::codecvt<wchar_t, char, std::mbstate_t> >(loc)
.out(state,&s[0], &s[s.size()], fromNext,
&result[0], &result[result.size()], toNext);
assert(fromNext == &s[s.size()]);
assert(toNext != &result[result.size()]);
assert(convResult == std::codecvt_base::ok);
*toNext = '\0';
return &result[0];
}
std::wstring widen(std::string const& s,
std::locale loc = std::locale())
{
std::vector<wchar_t> result(s.size() + 1);
char const* fromNext;
wchar_t* toNext;
mbstate_t state = {0};
std::codecvt_base::result convResult
= std::use_facet<std::codecvt<wchar_t, char, std::mbstate_t> >(loc)
.in(state, &s[0], &s[s.size()], fromNext,
&result[0], &result[result.size()], toNext);
assert(fromNext == &s[s.size()]);
assert(toNext != &result[result.size()]);
assert(convResult == std::codecvt_base::ok);
*toNext = L'\0';
return &result[0];
}
you should replace the assertions by better handling.
BTW, this is standard C++ and doesn't assume Unicode excepted for the computation of the size of result, you can do better by checking convResult which can indicate a partial conversion).
The easiest way is to grab a small library, such as UTF8 CPP and do something like:
utf8::utf8to16(line.begin(), line.end(), back_inserter(utf16line));
I usually use the UnicodeConverter class from the Poco C++ libraries. If you don't want the dependency then you can have a look at the code.

How do I convert wchar_t* to std::string?

I changed my class to use std::string (based on the answer I got here but a function I have returns wchar_t *. How do I convert it to std::string?
I tried this:
std::string test = args.OptionArg();
but it says error C2440: 'initializing' : cannot convert from 'wchar_t *' to 'std::basic_string<_Elem,_Traits,_Ax>'
std::wstring ws( args.OptionArg() );
std::string test( ws.begin(), ws.end() );
You can convert a wide char string to an ASCII string using the following function:
#include <locale>
#include <sstream>
#include <string>
std::string ToNarrow( const wchar_t *s, char dfault = '?',
const std::locale& loc = std::locale() )
{
std::ostringstream stm;
while( *s != L'\0' ) {
stm << std::use_facet< std::ctype<wchar_t> >( loc ).narrow( *s++, dfault );
}
return stm.str();
}
Be aware that this will just replace any wide character for which an equivalent ASCII character doesn't exist with the dfault parameter; it doesn't convert from UTF-16 to UTF-8. If you want to convert to UTF-8 use a library such as ICU.
This is an old question, but if it's the case you're not really seeking conversions but rather using the TCHAR stuff from Mircosoft to be able to build both ASCII and Unicode, you could recall that std::string is really
typedef std::basic_string<char> string
So we could define our own typedef, say
#include <string>
namespace magic {
typedef std::basic_string<TCHAR> string;
}
Then you could use magic::string with TCHAR, LPCTSTR, and so forth
It's rather disappointing that none of the answers given to this old question addresses the problem of converting wide strings into UTF-8 strings, which is important in non-English environments.
Here's an example code that works and may be used as a hint to construct custom converters. It is based on an example code from Example code in cppreference.com.
#include <iostream>
#include <clocale>
#include <string>
#include <cstdlib>
#include <array>
std::string convert(const std::wstring& wstr)
{
const int BUFF_SIZE = 7;
if (MB_CUR_MAX >= BUFF_SIZE) throw std::invalid_argument("BUFF_SIZE too small");
std::string result;
bool shifts = std::wctomb(nullptr, 0); // reset the conversion state
for (const wchar_t wc : wstr)
{
std::array<char, BUFF_SIZE> buffer;
const int ret = std::wctomb(buffer.data(), wc);
if (ret < 0) throw std::invalid_argument("inconvertible wide characters in the current locale");
buffer[ret] = '\0'; // make 'buffer' contain a C-style string
result = result + std::string(buffer.data());
}
return result;
}
int main()
{
auto loc = std::setlocale(LC_ALL, "en_US.utf8"); // UTF-8
if (loc == nullptr) throw std::logic_error("failed to set locale");
std::wstring wstr = L"aąß水𝄋-扫描-€𐍈\u00df\u6c34\U0001d10b";
std::cout << convert(wstr) << "\n";
}
This prints, as expected:
Explanation
7 seems to be the minimal secure value of the buffer size, BUFF_SIZE. This includes 4 as the maximum number of UTF-8 bytes encoding a single character; 2 for the possible "shift sequence", 1 for the trailing '\0'.
MB_CUR_MAX is a run-time variable, so static_assert is not usable here
Each wide character is translated into its char representation using std::wctomb
This conversion makes sense only if the current locale allows multi-byte representations of a character
For this to work, the application needs to set the proper locale. en_US.utf8 seems to be sufficiently universal (available on most machines). In Linux, available locales can be queried in the console via locale -a command.
Critique of the most upvoted answer
The most upvoted answer,
std::wstring ws( args.OptionArg() );
std::string test( ws.begin(), ws.end() );
works well only when the wide characters represent ASCII characters - but these are not what wide characters were designed for. In this solution, the converted string contains one char per each source wide char, ws.size() == test.size(). Thus, it loses information from the original wstring and produces strings that cannot be interpreted as proper UTF-8 sequences. For example, on my machine the string resulting from this simplistic conversion of "ĄŚĆII" prints as "ZII", even though its size is 5 (and should be 8).
You could just use wstring and keep everything in Unicode
just for fun :-):
const wchar_t* val = L"hello mfc";
std::string test((LPCTSTR)CString(val));
Following code is more concise:
wchar_t wstr[500];
char string[500];
sprintf(string,"%ls",wstr);