convert from char to char16_t - c++

My config:
Compiler: gnu gcc 4.8.2
I compile with C++11
platform/OS: Linux 64bit Ubuntu 14.04.1 LTS
I have this method:
static inline std::u16string StringtoU16(const std::string &str) {
const size_t si = strlen(str.c_str());
char16_t cstr[si+1];
memset(cstr, 0, (si+1)*sizeof(char16_t));
const char* constSTR = str.c_str();
mbstate_t mbs;
memset (&mbs, 0, sizeof (mbs));//set shift state to the initial state
size_t ret = mbrtoc16 (cstr, constSTR, si, &mbs);
std::u16string wstr(cstr);
return wstr;
}
I want a conversion between char to char16_T pretty much (via std::string and std::u16string to facilitate memory management) but regardless of the size of the input variable str, it will return the first character only. If str= "Hello" it will return "H". I am not sure what is wrong my my method. Value of ret is 1.

I didn't know mbrtoc16() can only handle one character at a time.. what a turtle. Here is then the code I generate, and works like a charm:
static inline std::u16string StringtoU16(const std::string &str) {
std::u16string wstr = u"";
char16_t c16str[3] = u"\0";
mbstate_t mbs;
for (const auto& it: str){
memset (&mbs, 0, sizeof (mbs));//set shift state to the initial state
memmove(c16str, u"\0\0\0", 3);
mbrtoc16 (c16str, &it, 3, &mbs);
wstr.append(std::u16string(c16str));
}//for
return wstr;
}
for its counterpart (when one way is needed, sooner or later the other way will be needed):
static inline std::string U16toString(const std::u16string &wstr) {
std::string str = "";
char cstr[3] = "\0";
mbstate_t mbs;
for (const auto& it: wstr){
memset (&mbs, 0, sizeof (mbs));//set shift state to the initial state
memmove(cstr, "\0\0\0", 3);
c16rtomb (cstr, it, &mbs);
str.append(std::string(cstr));
}//for
return str;
}
Be aware that c16rtomb will be lossy if a character cannot be converted from char16_t to char (might endup printing a bunch of '?' depending on your system) but it will work without complains.

mbrtoc16() converts a single character, and returns the number of multibyte characters that were consumed in order to convert the char16_t.
In order to effect this conversion, the general approach is:
A) call mbrtoc16().
B) save the converted character, skip the number of characters that were consumed.
C) Have you consumed the entire string you wanted to convert? If no, go back to step A.
Additionally, there could be conversion errors. You must check the return value from mbrtoc16() and do whatever you want to do, to handle conversion errors (the original multibyte string is note valid).
Finally, you should not assume what the maximum size of the char16_t string is going to be equal to or less than the size of the multibyte string. It probably is; but, in some weird locale I suppose that it can, theoretically, be more.

Related

How to convert a codepoint to utf-8?

I have some code that reads in an a unicode codepoint (as escaped in a string 0xF00).
Since im using boost, I'm speculating if the following is best (and correct) approach:
unsigned int codepoint{0xF00};
boost::locale::conv::utf_to_utf<char>(&codepoint, &codepoint+1);
?
You can do this with the standard library using std::wstring_convert to convert UTF-32 (code points) to UTF-8:
#include <locale>
#include <codecvt>
std::string codepoint_to_utf8(char32_t codepoint) {
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> convert;
return convert.to_bytes(&codepoint, &codepoint + 1);
}
This returns a std::string whose size is 1, 2, 3 or 4 depending on how large codepoint is. It will throw a std::range_error if the code point is too large (> 0x10FFFF, the max unicode code point).
Your version with boost seems to be doing the same thing. The documentation says that the utf_to_utf function converts a UTF encoding to another one, in this case 32 to 8. If you use char32_t, it will be a "correct" approach, that will work on systems where unsigned int isn't the same size as char32_t.
// The function also converts the unsigned int to char32_t
std::string codepoint_to_utf8(char32_t codepoint) {
return boost::locale::conv::utf_to_utf<char>(&codepoint, &codepoint + 1);
}
As mentioned, a codepoint in this form is (conveniently) UTF-32, so what you're looking for is a transcoding.
For a solution that does not rely on functions deprecated since C++17, and isn't really ugly, and which also does not require hefty third-party libraries, you can use the very lightweight UTF8-CPP (four small headers!) and its function utf8::utf32to8.
It's going to look something like this:
const uint32_t codepoint{0xF00};
std::vector<unsigned char> result;
try
{
utf8::utf32to8(&codepoint, &codepoint + 1, std::back_inserter(result));
}
catch (const utf8::invalid_code_point&)
{
// something
}
(There's also a utf8::unchecked::utf32to8, if you're allergic to exceptions.)
(And consider reading into vector<char8_t> or std::u8string, since C++20).
(Finally, note that I've specifically used uint32_t to ensure the input has the proper width.)
I tend to use this library in projects until I need something a little heavier for other purposes (at which point I'll typically switch to ICU).
C++17 has deprecated number of convenience functions processing utf. Unfortunately, the last remaining ones will be deprecated in C++20 (*). That being said std::codecvt is still valid. From C++11 to C++17, you can use a std::codecvt<char32_t, char, mbstate_t>, starting with C++20 it will be std::codecvt<char32_t, char8_t, mbstate_t>.
Here is some code converting a code point (up to 0x10FFFF) in utf8:
// codepoint is the codepoint to convert
// buff is a char array of size sz (should be at least 4 to convert any code point)
// on return sz is the used size of buf for the utf8 converted string
// the return value is the return value of std::codecvt::out (0 for ok)
std::codecvt_base::result to_utf8(char32_t codepoint, char *buf, size_t& sz) {
std::locale loc("");
const std::codecvt<char32_t, char, std::mbstate_t> &cvt =
std::use_facet<std::codecvt<char32_t, char, std::mbstate_t>>(loc);
std::mbstate_t state{{0}};
const char32_t * last_in;
char *last_out;
std::codecvt_base::result res = cvt.out(state, &codepoint, 1+&codepoint, last_in,
buf, buf+sz, last_out);
sz = last_out - buf;
return res;
}
(*) std::codecvt will still exist in C++20. Simply the default instantiations will no longer be std::codecvt<char16_t, char, std::mbstate_t> and std::codecvt<char32_t, char, std::mbstate_t> but std::codecvt<char16_t, char8_t, std::mbstate_t> and std::codecvt<char32_t, char8_t, std::mbstate_t> (note char8_t instead of char)
After reading about the unsteady state of UTF-8 support in C++, I stumbled upon the corresponding C support c32rtomb, which looks promising, and likely won't be deprecated any time soon
#include <clocale>
#include <cuchar>
#include <climits>
size_t to_utf8(char32_t codepoint, char *buf)
{
const char *loc = std::setlocale(LC_ALL, "en_US.utf8");
std::mbstate_t state{};
std::size_t len = std::c32rtomb(buf, codepoint, &state);
std::setlocale(LC_ALL, loc);
return len;
}
Usage would then be
char32_t codepoint{0xfff};
char buf[MB_LEN_MAX]{};
size_t len = to_utf8(codepoint, buf);
If your application's current locale is already UTF-8, you might omit the back and forth call to setlocale of course.

C++ append int to wstring

Before(using ASCII) i was using std::string as buffer like this:
std::string test = "";
int value = 6;
test.append("some string");
test.append((char*)value, 4);
test.append("some string");
with expected value in test:
"some srtring\x6\x0\x0\x0somestring"
Now i am tring to use Unicode and i wanna keep the same "code" but trubles happens:
std::wstring test = "";
int value = 6;
test.append("some string");
test.append((wchar_t*)value, 4); (buffer overflow cause reading 8 bytes)
test.append("some string");
How can i append bytes like in std::string?
Doing:
std::wstring test = "";
int value = 6;
test.append("some string");
test.append((wchar_t*)value, 2);
test.append("some string");
Solve partially the problem cause after i can't append bools.
EDIT:
i can even use wstringstream if a binary copy is applied.(normally not)
You're confusing unicode and character encodings. An std::string can represent unicode code points just fine, using the UTF-8 encoding.
Windows uses the UTF-16LE (or UTF-16 with a BOM, I believe) encoding to represent unicode glyphs. Most others use UTF-8.
An std::string which is encoded in UTF-8 and which uses only ASCII characters can actually be interpreted as an ASCII string. This is the beauty of UTF-8. It's a natural extension.
Anyway,
i need a "binary" dynamic buffer, where i can add the real size of types(bool 1, int 4 etc)
An std::vector<uint8_t> is probably more suitable for this task. It communicates that it is not something human-readable, per se. If you need to embed strings into this buffer, make sure that sizeof(char) == sizeof(uint8_t) on the platform, and then just write the data as-is to this buffer.
If you're saving this buffer on one machine and try to read it on another machine, you have to take care of endianness too.
You make a function that reads the stuff you want to put:
void putBytes(std::wstring& s, char* c, int numBytes)
{
while (numBytes-- > 0)
s += (wchar_t)*c++;
}
Then you can call it:
int value = 65;
putBytes(s, reinterpret_cast<char*>(&value), sizeof(value));
I think a IStream is a proper way to do this...i'll make an interface to handle different types. I was abusing std::string for an easy "dynamic binary array", with std::wstring this is not possible,for many reasons but most silly one is that require at least 2 bytes, so no room for a bool

converting narrow string to wide string

How can i convert a narrow string to a wide string ?
I have tried this method :
string myName;
getline( cin , myName );
wstring printerName( L(myName) ); // error C3861: 'L': identifier not found
wchar_t* WprinterName = printerName.c_str(); // error C2440: 'initializing' : cannot convert from 'const wchar_t *' to 'wchar_t *'
But i get errors as listed above.
Why do i get these errors ? How can i fix them ?
Is there any other method of directly converting a narrow string to a wide string ?
If the source is ASCII encoded, you can just do this:
wstring printerName;
printerName.assign( myName.begin(), myName.end() );
You should do this :
inline std::wstring convert( const std::string& as )
{
// deal with trivial case of empty string
if( as.empty() ) return std::wstring();
// determine required length of new string
size_t reqLength = ::MultiByteToWideChar( CP_UTF8, 0, as.c_str(), (int)as.length(), 0, 0 );
// construct new string of required length
std::wstring ret( reqLength, L'\0' );
// convert old string to new string
::MultiByteToWideChar( CP_UTF8, 0, as.c_str(), (int)as.length(), &ret[0], (int)ret.length() );
// return new string ( compiler should optimize this away )
return ret;
}
This expects the std::string to be UTF-8 (CP_UTF8), when you have another encoding replace the codepage.
Another way could be :
inline std::wstring convert( const std::string& as )
{
wchar_t* buf = new wchar_t[as.size() * 2 + 2];
swprintf( buf, L"%S", as.c_str() );
std::wstring rval = buf;
delete[] buf;
return rval;
}
I found this while googling the problem. I have pasted the code for reference. Author of this post is Paul McKenzie.
std::string str = "Hello";
std::wstring str2(str.length(), L' '); // Make room for characters
// Copy string to wstring.
std::copy(str.begin(), str.end(), str2.begin());
ATL (non-express editions of Visual Studio) has a couple useful class types which can convert the strings plainly. You can use the constructor directly, if you do not need to hold onto the string.
#include <atlbase.h>
std::wstring wideString(L"My wide string");
std::string narrowString("My not-so-wide string");
ATL::CW2A narrow(wideString.c_str()); // narrow is a narrow string
ATL::CA2W wide(asciiString.c_str()); // wide is a wide string
Here are two functions that can be used: mbstowcs_s and wcstombs_s.
mbstowcs_s: Converts a sequence of multibyte characters to a corresponding sequence of wide characters.
wcstombs_s: Converts a sequence of wide characters to a corresponding sequence of multibyte characters.
errno_t wcstombs_s(
size_t *pReturnValue,
char *mbstr,
size_t sizeInBytes,
const wchar_t *wcstr,
size_t count
);
errno_t mbstowcs_s(
size_t *pReturnValue,
wchar_t *wcstr,
size_t sizeInWords,
const char *mbstr,
size_t count
);
See http://msdn.microsoft.com/en-us/library/eyktyxsx.aspx and http://msdn.microsoft.com/en-us/library/s7wzt4be.aspx.
The Windows API provides routines for doing this: WideCharToMultiByte() and MultiByteToWideChar(). However, they are a pain to use. Each conversion requires two calls to the routines and you have to look after allocating/freeing memory and making sure the strings are correctly terminated. You need a wrapper!
I have a convenient C++ wrapper on my blog, here, which you are welcome to use.
The original question of this thread was: "How can i convert a narrow string to a wide string?"
However, from the example code given in the question, there seems to be no conversion necessary. Rather, there is a compiler error due to the newer compilers deprecating something that used to be okay. Here is what I think is going on:
// wchar_t* wstr = L"A wide string"; // Error: cannot convert from 'const wchar_t *' to 'wchar_t *'
wchar_t const* wstr = L"A wide string"; // okay
const wchar_t* wstr_equivalent = L"A wide string"; // also okay
The c_str() seems to be treated the same as a literal, and is considered a constant (const). You could use a cast. But preferable is to add const.
The best answer I have seen for converting between wide and narrow strings is to use std::wstringstream. And this is one of the answers given to C++ Convert string (or char*) to wstring (or wchar_t*)
You can convert most anything to and from strings and wide strings using stringstream and wstringstream.
This article published on the MSDN Magazine 2016 September issue discusses the conversion in details using Win32 APIs.
Note that using MultiByteToWideChar() is much faster than using the std:: stuff on Windows.
Use mbtowc():
string myName;
wchar_t wstr[BUFFER_SIZE];
getline( cin , myName );
mbtowc(wstr, myName, BUFFER_SIZE);

How do I convert a char string to a wchar_t string?

I have a string in char* format and would like to convert it to wchar_t*, to pass to a Windows function.
Does this little function help?
#include <cstdlib>
int mbstowcs(wchar_t *out, const char *in, size_t size);
Also see the C++ reference
If you don't want to link against the C runtime library, use the MultiByteToWideChar API call, e.g:
const size_t WCHARBUF = 100;
const char szSource[] = "HELLO";
wchar_t wszDest[WCHARBUF];
MultiByteToWideChar(CP_ACP, MB_PRECOMPOSED, szSource, -1, wszDest, WCHARBUF);
the Windows SDK specifies 2 functions in kernel32.lib for converting strings from and to a wide character set. those are MultiByteToWideChar() and WideCharToMultiByte().
please note that, unlike the function name suggest, the string does not necessarily use a multi-byte character set, but can be a simple ANSI string. alse note that those functions understand UTF-7 and UTF-8 as a multi-byte character set. the wide char character set is always UTF-16.
schnaader's answer use the conversion defined by the current C locale, this one uses the C++ locale interface (who said that it was simple?)
std::wstring widen(std::string const& s, std::locale loc)
{
std::char_traits<wchar_t>::state_type state = { 0 };
typedef std::codecvt<wchar_t, char, std::char_traits<wchar_t>::state_type >
ConverterFacet;
ConverterFacet const& converter(std::use_facet<ConverterFacet>(loc));
char const* nextToRead = s.data();
wchar_t buffer[BUFSIZ];
wchar_t* nextToWrite;
std::codecvt_base::result result;
std::wstring wresult;
while ((result
= converter.in
(state,
nextToRead, s.data()+s.size(), nextToRead,
buffer, buffer+sizeof(buffer)/sizeof(*buffer), nextToWrite))
== std::codecvt_base::partial)
{
wresult.append(buffer, nextToWrite);
}
if (result == std::codecvt_base::error) {
throw std::runtime_error("Encoding error");
}
wresult.append(buffer, nextToWrite);
return wresult;
}

String comparisons. How can you compare string with std::wstring? WRT strcmp

I am trying to compare two formats that I expected would be somewhat compatible, since they are both generally strings. I have tried to perform strcmp with a string and std::wstring, and as I'm sure C++ gurus know, this will simply not compile. Is it possible to compare these two types? Is there an easy conversion here?
You need to convert your char* string - "multibyte" in ISO C parlance - to a wchar_t* string - "wide character" in ISO C parlance. The standard function that does that is called mbstowcs ("Multi-Byte String To Wide Character String")
NOTE: as Steve pointed out in comments, this is a C99 function and thus is not ISO C++ conformant, but may be supported by C++ implementations as an extension. MSVC and g++ both support it.
It is used thus:
const char* input = ...;
std::size_t output_size = std::mbstowcs(NULL, input, 0); // get length
std::vector<wchar_t> output_buffer(output_size);
// output_size is guaranteed to be >0 because of \0 at end
std::mbstowcs(&output_buffer[0], input, output_size);
std::wstring output(&output_buffer[0]);
Once you have two wstrings, just compare as usual. Note that this will use the current system locale for conversion (i.e. on Windows this will be the current "ANSI" codepage) - normally this is just what you want, but occasionally you'll need to deal with a specific encoding, in which case the above won't do, and you'll need to use something like iconv.
EDIT
All other answers seem to go for direct codepoint translation (i.e. the equivalent of (wchar_t)c for every char c in the string). This may not work for all locales, but it will work if e.g. your char are all ASCII or Latin-1, and your wchar_t are Unicode. If you're sure that's what you really want, the fastest way is actually to avoid conversion altogether, and to use std::lexicographical_compare:
#include <algorithm>
const char* s = ...;
std::wstring ws = ...;
const char* s_end = s + strlen(s);
bool is_ws_less_than_s = std::lexicographical_compare(ws.begin, ws.end(),
s, s_end());
bool is_s_less_than_ws = std::lexicographical_compare(s, s_end(),
ws.begin(), ws.end());
bool is_s_equal_to_ws = !is_ws_less_than_s && !is_s_less_than_ws;
If you specifically need to test for equality, use std::equal with a length check:
#include <algorithm>
const char* s = ...;
std::wstring ws = ...;
std::size_t s_len = strlen(s);
bool are_equal =
ws.length() == s_len &&
std::equal(ws.begin(), ws.end(), s);
The quick and dirty way is
if( std::wstring(your_char_ptr_string) == your_wstring)
I say dirty because it will create a temporary string and copy your_char into it. However, it will work just fine as long as you are not in a tight loop.
Note that wstring uses 16 bit characters (i.e unicode - 65536 possible characters) whereas char* tends to be 8 bit characters (Ascii, Latin english only). They are not the same, so wstring-->char* might loose accuracy.
-Tom
First of all you have to ask yourself why you are using std::wstring which is a unicode format with char* (cstring) which is ansi. It is best practice to use unicode because it allows your application to be internationalized, but using a mix doesn't make much sense in most cases. If you want your cstrings to be unicode use wchar_t. If you want your STL strings to be ansi use std::string.
Now back to your question.
The first thing you want to do is convert one of them to match the other datatype.
std::string an std::wstring have the c_str function
here are the function definitions
const char* std::string::c_str() const
const wchar_t* std::wstring::c_str() const
I don't remember off hand how to convert char * to wchar_t * and vice versa, but after you do that you can use strcmp. If you google you'll find a way.
You could use the functions below to convert std::wstring to std::string then c_str will give you char * which you can strcmp
#include <string>
#include <algorithm>
// Prototype for conversion functions
std::wstring StringToWString(const std::string& s);
std::string WStringToString(const std::wstring& s);
std::wstring StringToWString(const std::string& s)
{
std::wstring temp(s.length(),L' ');
std::copy(s.begin(), s.end(), temp.begin());
return temp;
}
std::string WStringToString(const std::wstring& s)
{
std::string temp(s.length(), ' ');
std::copy(s.begin(), s.end(), temp.begin());
return temp;
}
Convert your wstring to a string.
wstring a = L"foobar";
string b(a.begin(),a.end());
Now you can compare it to any char* using b.c_str() or whatever you like.
char c[] = "foobar";
cout<<strcmp(b.c_str(),c)<<endl;