Convert std::string to Unicode in Linux - c++

EDIT I modified the question after realizing it was wrong to begin with.
I'm porting part of a C# application to Linux, where I need to get the bytes of a UTF-16 string:
string myString = "ABC";
byte[] bytes = Encoding.Unicode.GetBytes(myString);
So that the bytes array is now:
"65 00 66 00 67 00" (bytes)
How can I achieve the same in C++ on Linux? I have a myString defined as std::string, and it seems that std::wstring on Linux is 4 bytes?

You question isn't really clear, but I'll try to clear up some confusion.
Introduction
Status of the handling of character set in C (and that was inherited by C++) after the '95 amendment to the C standard.
the character set used is given by the current locale
wchar_t is meant to store code point
char is meant to store a multibyte encoded form (a constraint for instance is that characters in the basic character set must be encoded in one byte)
string literals are encoded in an implementation defined manner. If they use characters outside of the basic character set, you can't assume they are valid in all locale.
Thus with a 16 bits wchar_t you are restricted to the BMP. Using the surrogates of UTF-16 is not compliant but I think MS and IBM are more or less forced to do this because they believed Unicode when they said they'll forever be a 16 bits charset. Those who delayed their Unicode support tend to use a 32 bits wchar_t.
Newer standards don't change much. Mostly there are literals for UTF-8, UTF-16 and UTF-32 encoded strings and there are types for 16 bits and 32 bits char. There is little or no additional support for Unicode in the standard libraries.
How to do the transformation of one encoding to the other
You have to be in a locale which use Unicode. Hopefully
std::locale::global(locale(""));
will be enough for that. If not, your environment is not properly setup (or setup for another charset and assuming Unicode won't be a service to your user.).
C Style
Use the wcstomsb and mbstowcs functions. Here is an example for what you asked.
std::string narrow(std::wstring const& s)
{
std::vector<char> result(4*s.size() + 1);
size_t used = wcstomsb(&result[0], s.data(), result.size());
assert(used < result.size());
return result.data();
}
C++ Style
The codecvt facet of the locale provide the needed functionality. The advantage is that you don't have to change the global locale for using it. The inconvenient is that the usage is more complex.
#include <locale>
#include <iostream>
#include <string>
#include <vector>
#include <assert.h>
#include <iomanip>
std::string narrow(std::wstring const& s,
std::locale loc = std::locale())
{
std::vector<char> result(4*s.size() + 1);
wchar_t const* fromNext;
char* toNext;
mbstate_t state = {0};
std::codecvt_base::result convResult
= std::use_facet<std::codecvt<wchar_t, char, std::mbstate_t> >(loc)
.out(state,&s[0], &s[s.size()], fromNext,
&result[0], &result[result.size()], toNext);
assert(fromNext == &s[s.size()]);
assert(toNext != &result[result.size()]);
assert(convResult == std::codecvt_base::ok);
*toNext = '\0';
return &result[0];
}
std::wstring widen(std::string const& s,
std::locale loc = std::locale())
{
std::vector<wchar_t> result(s.size() + 1);
char const* fromNext;
wchar_t* toNext;
mbstate_t state = {0};
std::codecvt_base::result convResult
= std::use_facet<std::codecvt<wchar_t, char, std::mbstate_t> >(loc)
.in(state, &s[0], &s[s.size()], fromNext,
&result[0], &result[result.size()], toNext);
assert(fromNext == &s[s.size()]);
assert(toNext != &result[result.size()]);
assert(convResult == std::codecvt_base::ok);
*toNext = L'\0';
return &result[0];
}
you should replace the assertions by better handling.
BTW, this is standard C++ and doesn't assume Unicode excepted for the computation of the size of result, you can do better by checking convResult which can indicate a partial conversion).

The easiest way is to grab a small library, such as UTF8 CPP and do something like:
utf8::utf8to16(line.begin(), line.end(), back_inserter(utf16line));

I usually use the UnicodeConverter class from the Poco C++ libraries. If you don't want the dependency then you can have a look at the code.

Related

How to convert a codepoint to utf-8?

I have some code that reads in an a unicode codepoint (as escaped in a string 0xF00).
Since im using boost, I'm speculating if the following is best (and correct) approach:
unsigned int codepoint{0xF00};
boost::locale::conv::utf_to_utf<char>(&codepoint, &codepoint+1);
?
You can do this with the standard library using std::wstring_convert to convert UTF-32 (code points) to UTF-8:
#include <locale>
#include <codecvt>
std::string codepoint_to_utf8(char32_t codepoint) {
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> convert;
return convert.to_bytes(&codepoint, &codepoint + 1);
}
This returns a std::string whose size is 1, 2, 3 or 4 depending on how large codepoint is. It will throw a std::range_error if the code point is too large (> 0x10FFFF, the max unicode code point).
Your version with boost seems to be doing the same thing. The documentation says that the utf_to_utf function converts a UTF encoding to another one, in this case 32 to 8. If you use char32_t, it will be a "correct" approach, that will work on systems where unsigned int isn't the same size as char32_t.
// The function also converts the unsigned int to char32_t
std::string codepoint_to_utf8(char32_t codepoint) {
return boost::locale::conv::utf_to_utf<char>(&codepoint, &codepoint + 1);
}
As mentioned, a codepoint in this form is (conveniently) UTF-32, so what you're looking for is a transcoding.
For a solution that does not rely on functions deprecated since C++17, and isn't really ugly, and which also does not require hefty third-party libraries, you can use the very lightweight UTF8-CPP (four small headers!) and its function utf8::utf32to8.
It's going to look something like this:
const uint32_t codepoint{0xF00};
std::vector<unsigned char> result;
try
{
utf8::utf32to8(&codepoint, &codepoint + 1, std::back_inserter(result));
}
catch (const utf8::invalid_code_point&)
{
// something
}
(There's also a utf8::unchecked::utf32to8, if you're allergic to exceptions.)
(And consider reading into vector<char8_t> or std::u8string, since C++20).
(Finally, note that I've specifically used uint32_t to ensure the input has the proper width.)
I tend to use this library in projects until I need something a little heavier for other purposes (at which point I'll typically switch to ICU).
C++17 has deprecated number of convenience functions processing utf. Unfortunately, the last remaining ones will be deprecated in C++20 (*). That being said std::codecvt is still valid. From C++11 to C++17, you can use a std::codecvt<char32_t, char, mbstate_t>, starting with C++20 it will be std::codecvt<char32_t, char8_t, mbstate_t>.
Here is some code converting a code point (up to 0x10FFFF) in utf8:
// codepoint is the codepoint to convert
// buff is a char array of size sz (should be at least 4 to convert any code point)
// on return sz is the used size of buf for the utf8 converted string
// the return value is the return value of std::codecvt::out (0 for ok)
std::codecvt_base::result to_utf8(char32_t codepoint, char *buf, size_t& sz) {
std::locale loc("");
const std::codecvt<char32_t, char, std::mbstate_t> &cvt =
std::use_facet<std::codecvt<char32_t, char, std::mbstate_t>>(loc);
std::mbstate_t state{{0}};
const char32_t * last_in;
char *last_out;
std::codecvt_base::result res = cvt.out(state, &codepoint, 1+&codepoint, last_in,
buf, buf+sz, last_out);
sz = last_out - buf;
return res;
}
(*) std::codecvt will still exist in C++20. Simply the default instantiations will no longer be std::codecvt<char16_t, char, std::mbstate_t> and std::codecvt<char32_t, char, std::mbstate_t> but std::codecvt<char16_t, char8_t, std::mbstate_t> and std::codecvt<char32_t, char8_t, std::mbstate_t> (note char8_t instead of char)
After reading about the unsteady state of UTF-8 support in C++, I stumbled upon the corresponding C support c32rtomb, which looks promising, and likely won't be deprecated any time soon
#include <clocale>
#include <cuchar>
#include <climits>
size_t to_utf8(char32_t codepoint, char *buf)
{
const char *loc = std::setlocale(LC_ALL, "en_US.utf8");
std::mbstate_t state{};
std::size_t len = std::c32rtomb(buf, codepoint, &state);
std::setlocale(LC_ALL, loc);
return len;
}
Usage would then be
char32_t codepoint{0xfff};
char buf[MB_LEN_MAX]{};
size_t len = to_utf8(codepoint, buf);
If your application's current locale is already UTF-8, you might omit the back and forth call to setlocale of course.

C++ append int to wstring

Before(using ASCII) i was using std::string as buffer like this:
std::string test = "";
int value = 6;
test.append("some string");
test.append((char*)value, 4);
test.append("some string");
with expected value in test:
"some srtring\x6\x0\x0\x0somestring"
Now i am tring to use Unicode and i wanna keep the same "code" but trubles happens:
std::wstring test = "";
int value = 6;
test.append("some string");
test.append((wchar_t*)value, 4); (buffer overflow cause reading 8 bytes)
test.append("some string");
How can i append bytes like in std::string?
Doing:
std::wstring test = "";
int value = 6;
test.append("some string");
test.append((wchar_t*)value, 2);
test.append("some string");
Solve partially the problem cause after i can't append bools.
EDIT:
i can even use wstringstream if a binary copy is applied.(normally not)
You're confusing unicode and character encodings. An std::string can represent unicode code points just fine, using the UTF-8 encoding.
Windows uses the UTF-16LE (or UTF-16 with a BOM, I believe) encoding to represent unicode glyphs. Most others use UTF-8.
An std::string which is encoded in UTF-8 and which uses only ASCII characters can actually be interpreted as an ASCII string. This is the beauty of UTF-8. It's a natural extension.
Anyway,
i need a "binary" dynamic buffer, where i can add the real size of types(bool 1, int 4 etc)
An std::vector<uint8_t> is probably more suitable for this task. It communicates that it is not something human-readable, per se. If you need to embed strings into this buffer, make sure that sizeof(char) == sizeof(uint8_t) on the platform, and then just write the data as-is to this buffer.
If you're saving this buffer on one machine and try to read it on another machine, you have to take care of endianness too.
You make a function that reads the stuff you want to put:
void putBytes(std::wstring& s, char* c, int numBytes)
{
while (numBytes-- > 0)
s += (wchar_t)*c++;
}
Then you can call it:
int value = 65;
putBytes(s, reinterpret_cast<char*>(&value), sizeof(value));
I think a IStream is a proper way to do this...i'll make an interface to handle different types. I was abusing std::string for an easy "dynamic binary array", with std::wstring this is not possible,for many reasons but most silly one is that require at least 2 bytes, so no room for a bool

Take wchar_t and put into char?

i've tried a few things and haven't yet been able to figure out how to get const wchar_t *text (shown bellow) to pass into the variable StoreText (shown below). What am i doing wrong?
void KeyboardComplete(int localClientNum, const wchar_t *text, unsigned int len)
{
char* StoreText = text; //This is where error occurs
}
You cannot directly assign a wchar_t* to a char*, as they are different and incompatible data types.
If StoreText needs to point at the same memory address that text is pointing at, such as if you are planning on looping through the individual bytes of the text data, then a simple type-cast will suffice:
char* StoreText = (char*)text;
However, if StoreText is expected to point to its own separate copy of the character data, then you would need to convert the wide character data into narrow character data instead. Such as by:
using the WideCharToMultiByte() function on Windows:
void KeyboardComplete(int localClientNum, const wchar_t *text, unsigned int len)
{
int StoreTextLen = 1 + WideCharToMultiByte(CP_ACP, 0, text, len, NULL, 0, NULL, NULL);
std::vector<char> StoreTextBuffer(StoreTextLen);
WideCharToMultiByte(CP_ACP, 0, text, len, &StoreTextBuffer[0], StoreTextLen, NULL, NULL);
char* StoreText = &StoreText[0];
//...
}
using the std::wcsrtombs() function:
#include <cwchar>
void KeyboardComplete(int localClientNum, const wchar_t *text, unsigned int len)
{
std::mbstate_t state = std::mbstate_t();
int StoreTextLen = 1 + std::wcsrtombs(NULL, &text, 0, &state);
std::vector<char> StoreTextBuffer(StoreTextLen);
std::wcsrtombs(&StoreTextBuffer[0], &text, StoreTextLen, &state);
char *StoreText = &StoreTextBuffer[0];
//...
}
using the std::wstring_convert class (C++11 and later):
#include <locale>
void KeyboardComplete(int localClientNum, const wchar_t *text, unsigned int len)
{
std::wstring_convert<std::codecvt<wchar_t, char, std::mbstate_t>> conv;
std::string StoreTextBuffer = conv.to_bytes(text, text+len);
char *StoreText = &StoreTextBuffer[0];
//...
}
using similar conversions from the ICONV or ICU library.
First of all, for strings you should use std::wstring/std::string instead of raw pointers.
The C++11 Locale (http://en.cppreference.com/w/cpp/locale) library can be used to convert wide string to narrow string.
I wrote a wrapper function below and have used it for years. Hope it will be helpful to you, too.
#include <string>
#include <locale>
#include <codecvt>
std::string WstringToString(const std::wstring & wstr, const std::locale & loc /*= std::locale()*/)
{
std::string buf(wstr.size(), 0);
std::use_facet<std::ctype<wchar_t>>(loc).narrow(wstr.c_str(), wstr.c_str() + wstr.size(), '?', &buf[0]);
return buf;
}
wchar_t is a wide character. It is typically 16 or 32 bits per character, but this is system dependent.
char is a good ol' CHAR_BIT-sized data type. Again, how big it is is system dependent. Most likely it's going to be one byte, but I can't think of a reason why CHAR_BIT can't be 16 or 32 bits, making it the same size as wchar_t.
If they are different sizes, a direct assignment is doomed. For example an 8 bit char will see 2 characters, and quite likely 2 completely unrelated characters, for every 1 character in a 16 bit wchar_t. This would be bad.
Second, even if they are the same size, they may have different encodings. For example, the numeric value assigned to the letter 'A' may be different for the char and the wchar_t. It could be 65 in char and 16640 in wchar_t.
To make any sense in the different data type char and wchar_t will need to be translated to the other's encoding. std::wstring_convert will often perform this translation for you, but look into the locale library for more complicated translations. Both require a compiler supporting C++11 or better. In previous C++ Standards, a small army of functions provided conversion support. Third party libraries such as Boost::locale are helpful to unify and provide wider support.
Conversion functions are supplied by the operating system to translate between the encoding used by the OS and other common encodings.
You have to do a cast, you can do this:
char* StoreText = (char*)text;
I think this may work.
But you can use the wcstombs function of cstdlib library.
char someText[12];
wcstombs(StoreText,text, 12);
Last parameter most be a number of byte available in the array pointed.

how to convert char * to uchar16 in JNI C++

here's what I am trying to do:
typedef uint16_t uchar16_t;
uchar16_t buf[32];
// buf will contain timezone information like GMT-6, Eastern Daylight Time, etc
char * str = "Test";
for (int i = 0; i <= strlen(str); i++)
buf[i] = str[i];
I guess that's not correct since uchar16_t would contain 2 bytes and str contains 1 byte.
What is it that I am supposed to do ?
Strlen? buf[32]? Trying to destroy the universe?
You want to use a wstringstream.
std::wstringstream lols;
lols << "Test";
std::wstring cakes;
lols >> cakes;
Edit#Comment:
You shouldn't use strlen because any decent string system allows embedded zeros, and strlen is seriously slow. In addition, you didn't resize your buffer as needed, so if you had a string of size > 31 you would get a buffer overflow. In addition, you would have to (if you did dynamically size your buffer) manually free it afterwards. Both of these things are serious failings of the C string system. My example code makes your standard library writer do all the work and avoid all these problems for you.
That's actually OK if your string will always be ASCII. To do it correctly, the portable function is mbstowcs which assumes you're converting from the default locale or if you're on Windows then there's API functions that let you specify the source code page explicitly.
Your code will work, as long as str is ASCII; calling strlen() in the loop condition is probably a bad idea, though. It might be easier to just use swprintf() if it's available on your system:
uchar16_t buf[32];
char *str = "Test";
swprintf(buf, sizeof buf, "%s", str);
Have a look here.
Also, is there a good reason you are defining your own type?
If you have a (narrow) char string, you cannot convert it to
a wchar_t string by setting your locale to "C" and then passing
the string through mbstowcs(). That's because the "C" locale specifies
a -particular- character encoding, and that encoding might not match
the encoding of the execution character set, so mbstowcs() might
map the characters to something unexpected, or could even fail
(if the execution character set happened to use encodings that
were incompatible with the encoding structure for the C locale
character set.)
Thus, in order to convert a char
string into a wider string, you have
to copy the chars one by one into an
array of wchar_t . If you need to work
with Unicode or utf-16 or whatever
after that, then wcstombs() is what
you should look at.

Assigning a "const char*" to std::string is allowed, but assigning to std::wstring doesn't compile. Why?

I assumed that std::wstring and std::string both provide more or less the same interface.
So I tried to enable unicode capabilities for our application
# ifdef APP_USE_UNICODE
typedef std::wstring AppStringType;
# else
typedef std::string AppStringType;
# endif
However that gives me a lot of compile errors when -DAPP_USE_UNICODE is used.
It turned out, that the compiler chokes when a const char[] is assigned to std::wstring.
EDIT: improved example by removing the usage of literal "hello".
#include <string>
void myfunc(const char h[]) {
string s = h; // compiles OK
wstring w = h; // compile Error
}
Why does it make such a difference?
Assigning a const char* to std::string is allowed, but assigning to std::wstring gives compile errors.
Shouldn't std::wstring provide the same interface as std::string? At least for such a basic operation as assignment?
(environment: gcc-4.4.1 on Ubuntu Karmic 32bit)
You should do:
#include <string>
int main() {
const wchar_t h[] = L"hello";
std::wstring w = h;
return 0;
}
std::string is a typedef of std::basic_string<char>, while std::wstring is a typedef of std::basic_string<wchar_t>. As such, the 'equivalent' C-string of a wstring is an array of wchar_ts.
The 'L' in front of the string literal is to indicate that you are using a wide-char string constant.
The relevant part of the string API is this constructor:
basic_string(const charT*);
For std::string, charT is char. For std::wstring it's wchar_t. So the reason it doesn't compile is that wstring doesn't have a char* constructor. Why doesn't wstring have a char* constructor?
There is no one unique way to convert a string of char to a string of wchar. What's the encoding used with the char string? Is it just 7 bit ASCII? Is it UTF-8? Is it UTF-7? Is it SHIFT-JIS? So I don't think it would entirely make sense for std::wstring to have an automatic conversion from char*, even though you could cover most cases. You can use:
w = std::wstring(h, h + sizeof(h) - 1);
which will convert each char in turn to wchar (except the NUL terminator), and in this example that's probably what you want. As int3 says though, if that's what you mean it's most likely better to use a wide string literal in the first place.
To convert from a multibyte encoding to a wide character encoding, take a look at the header <locale> and the type std::codecvt. The Dinkumware library has a class Dinkum::wstring_convert that makes performing such multibyte-to-wide conversions easier.
The function std::codecvt_byname allows one to find a codecvt instance for a particular named encoding. Unfortunately, discovering the names of the encodings (or locales) on your system is implementation-specific.
Small suggestion... Do not use "Unicode" strings under Linux (a.k.a. wide strings). std::string is perfectly fine and holds Unicode very well (UTF-8).
Most Linux API works with char * strings and most popular encoding is UTF-8.
So... Just don't bother yourself using wstring.
In addition to the other answers, you could use a trick from Microsoft's book (specifically, tchar.h), and write something like this:
# ifdef APP_USE_UNICODE
typedef std::wstring AppStringType;
#define _T(s) (L##s)
# else
typedef std::string AppStringType;
#define _T(s) (s)
# endif
AppStringType foo = _T("hello world!");
(Note: my macro-fu is weak, and this is untested, but you get the idea.)
Looks like you can do something like this:
#include <sstream>
// ...
std::wstringstream tmp;
tmp << "hello world";
std::wstring our_string =
Although for a more complex situation, you may want to break down and use mbstowcs
you should use
#include <tchar.h>
tstring instead of wstring/string
TCHAR* instead of char*
and _T("hello") instead of "hello" or L"hello"
this will use the appropriate form of string+char, when _UNICODE is defined.