wstring_convert throwing on umlaut - c++

consider the following piece of code:
#include <iostream>
#include <string>
#include <codecvt>
std::wstring string_to_wstring(const std::string& str)
{
std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> converter;
return converter.from_bytes(str);
}
int main()
{
std::string str = "abcä"; // without the "ä" it works
std::wstring wstr = string_to_wstring(str);
std::wcout << wstr << L"\n";
}
This throws me a "bad_conversion" exception, which seems to be caused by the umlaut because if I remove the "ä", everything works.
I have found the code for the string_to_wstring function some time ago here on SO and it worked until now very well. Mainly because I never came accross any umlauts.
Can we fix this function to work with any characters? Or is there a better (more efficient/safe) way to convert between string and wstring?

Related

How to convert a string encoded in utf16 to a string encoded in UTF-8?

I have a string or a char[], but the it is encoded in utf-16, like this:
Now I want to convert it to utf-8 in a new string, Please help me! I already tried like this:
But the compiler tells me I have a problem. How to solve this problem?
The problem is evident: you define u16_str as a std::string when cvt.to_bytes() expect a std::u16string (as the name of the variable suggest).
The following code works for me
#include <locale>
#include <codecvt>
#include <iostream>
int main ()
{
std::u16string u16_str { u"aeiuoàèìòùAEIOU" };
std::wstring_convert<std::codecvt_utf8<char16_t>, char16_t> cvt;
std::string u8_str = cvt.to_bytes(u16_str);
std::cout << u8_str << std::endl;
return 0;
}

Debug Assertion Failed when using ispunct('ø')

I am writing a program that handles large portions of text, and need to remove punctuation. I encountered a Debug Assertion Failed error, and isolated it to this: It occurs when testing ispunct() on non-English letters.
my test-program is now like this:
main.c
int main() {
ispunct('ø');
cin.get();
return 0;
}
The Debug Assertion Failed window looks like this:
Screenshot of the error
All non-English letters I have tried cause this problem, including 'æ', 'ø', 'å', 'é', etc. Punctuation and English letters do not cause the problem. It's probably something very simple that I am overlooking, so I am thankful for any help!
Character 'ø' must be representable as an unsigned char, otherwise you should use type wchar_t and std::ispunct, for example:
#include <iostream>
#include <locale>
int main()
{
const wchar_t c = L'ø';
std::locale loc("en_US.UTF-8");
std::ispunct(c, loc);
}
For your problem, you can also do this:
#include <locale>
#include <string>
#include <algorithm>
#include <functional>
int main()
{
std::wstring word = L"søme.?.thing";
std::locale loc("en_US.UTF-8");
using namespace std::placeholders;
word.erase(std::remove_if(word.begin(), word.end(),
std::bind(std::ispunct<wchar_t>, _1, loc)), word.end());
std::wcout << word << std::endl;
}

How to store unicode character in wstring on linux?

#include <iostream>
using namespace std;
int main() {
std::wstring str = L"\u00A2";
std::wcout << str;
return 0;
}
Whys this doesn't work? And how solve this?
It doesn't work because in the default C locale, there is no character which corresponds to U+00A2.
If you're using a standard ubuntu install, it is most likely that your user locale uses a larger character set than US-ASCII, quite possibly Unicode encoded with UTF-8. So you just need to switch to the locale specified in the environment, as follows:
#include <iostream>
/* locale is needed for std::setlocale */
#include <locale>
#include <string>
int main() {
/* The following switches to the locale specified
* by the LC_ALL environment variable.
*/
std::setlocale (LC_ALL, "");
std::wstring str = L"\u00A2";
std::wcout << str;
return 0;
}
If you use std::string instead of std::wstring and std::cout instead of std::wcout, then you don't need the setlocale because no translation is needed (provided the console expects UTF-8).

How to convert string to wstring in C++

I have programmed some code but there is some problem. In this codes i am trying to convert string to wstring. But this string have "█" characters. This character have 219 ascii code.
This conversion getting error.
In my code:
string strs= "█and█something else";
wstring wstr(strs.begin(),strs.end());
After debugging, I am getting result like this
?and?something else
How do I correct this problem?
Thanks...
The C-library solution for converting between the system's narrow and wide encoding use the mbsrtowcs and wcsrtombs functions from the <cwchar> header. I've spelt this out in this answer.
In C++11, you can use the wstring_convert template instantiated with a suitable codecvt facet. Unfortunately this requires some custom rigging, which is spelt out on the cppreference page.
I've adapted it here into a self-contained example which converts a wstring to a string, converting from the system's wide into the system's narrow encoding:
#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
// utility wrapper to adapt locale-bound facets for wstring/wbuffer convert
template <typename Facet>
struct deletable_facet : Facet
{
using Facet::Facet;
};
int main()
{
std::wstring_convert<
deletable_facet<std::codecvt<wchar_t, char, std::mbstate_t>>> conv;
std::wstring ws(L"Hello world.");
std::string ns = conv.to_bytes(ws);
std::cout << ns << std::endl;
}

clang: converting const char16_t* (UTF-16) to wstring (UCS-4)

I'm trying to convert UTF-16 encoded strings to UCS-4
If I understand correctly, C++11 provides this conversion through codecvt_utf16.
My code is something like:
#include <iostream>
#include <locale>
#include <memory>
#include <codecvt>
#include <string>
using namespace std;
int main()
{
u16string s;
s.push_back('h');
s.push_back('e');
s.push_back('l');
s.push_back('l');
s.push_back('o');
wstring_convert<codecvt_utf16<wchar_t>, wchar_t> conv;
wstring ws = conv.from_bytes(reinterpret_cast<const char*> (s.c_str()));
wcout << ws << endl;
return 0;
}
Note: the explicit push_backs to get around the fact that my version of clang (Xcode 4.2) doesn't have unicode string literals.
When the code is run, I get terminate exception. Am I doing something illegal here? I was thinking it should work because the const char* that I passed to wstring_convert is UTF-16 encoded, right? I have also considered endianness being the issue, but I have checked that it's not the case.
Two errors:
1) from_bytes() overload that takes the single const char* expects a null-terminated byte string, but your very second byte is '\0'.
2) your system is likely little-endian, so you need to convert from UTF-16LE to UCS-4:
#include <iostream>
#include <locale>
#include <memory>
#include <codecvt>
#include <string>
using namespace std;
int main()
{
u16string s;
s.push_back('h');
s.push_back('e');
s.push_back('l');
s.push_back('l');
s.push_back('o');
wstring_convert<codecvt_utf16<wchar_t, 0x10ffff, little_endian>,
wchar_t> conv;
wstring ws = conv.from_bytes(
reinterpret_cast<const char*> (&s[0]),
reinterpret_cast<const char*> (&s[0] + s.size()));
wcout << ws << endl;
return 0;
}
Tested with Visual Studio 2010 SP1 on Windows and CLang++/libc++-svn on Linux.