I'm trying to convert a UTF-8 string to a ISO-8859-1 char* for use in legacy code. The only way I'm seeing to do this is with iconv.
I would definitely prefer a completely string-based C++ solution then just call .c_str() on the resulting string.
How do I do this? Code example if possible, please. I'm fine using iconv if it is the only solution you know.
I'm going to modify my code from another answer to implement the suggestion from Alf.
std::string UTF8toISO8859_1(const char * in)
{
std::string out;
if (in == NULL)
return out;
unsigned int codepoint;
while (*in != 0)
{
unsigned char ch = static_cast<unsigned char>(*in);
if (ch <= 0x7f)
codepoint = ch;
else if (ch <= 0xbf)
codepoint = (codepoint << 6) | (ch & 0x3f);
else if (ch <= 0xdf)
codepoint = ch & 0x1f;
else if (ch <= 0xef)
codepoint = ch & 0x0f;
else
codepoint = ch & 0x07;
++in;
if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff))
{
if (codepoint <= 255)
{
out.append(1, static_cast<char>(codepoint));
}
else
{
// do whatever you want for out-of-bounds characters
}
}
}
return out;
}
Invalid UTF-8 input results in dropped characters.
First convert UTF-8 to 32-bit Unicode.
Then keep the values that are in the range 0 through 255.
Those are the Latin-1 code points, and for other values, decide if you want to treat that as an error or perhaps replace with code point 127 (my fav, the ASCII "del") or question mark or something.
The C++ standard library defines a std::codecvt specialization that can be used,
template<>
codecvt<char32_t, char, mbstate_t>
C++11 §22.4.1.4/3: “the specialization codecvt <char32_t, char, mbstate_t> converts between the UTF-32 and
UTF-8 encoding schemes”
Alfs suggestion implemented in C++11
#include <string>
#include <codecvt>
#include <algorithm>
#include <iterator>
auto i = u8"H€llo Wørld";
std::wstring_convert<std::codecvt_utf8<wchar_t>> utf8;
auto wide = utf8.from_bytes(i);
std::string out;
out.reserve(wide.length());
std::transform(wide.cbegin(), wide.cend(), std::back_inserter(out),
[](const wchar_t c) { return (c <= 255) ? c : '?'; });
// out now contains "H?llo W\xf8rld"
Related
This question already has answers here:
Convert string from UTF-8 to ISO-8859-1
(3 answers)
Closed 1 year ago.
Referring to the ISO-8859-1 (Latin-1) encoding:
The capital E acute (É) has a hex value of C9.
I am trying to write a function that takes a std::string and then converts it to hex according to the ISO-8859-1 encoding above.
Currently, I am only able to write a function that converts an ASCII string to hex:
std::string Helper::ToHex(std::string input) {
std::stringstream strstream;
std::string output;
for (int i=0; i<input.length(); i++) {
strstream << std::hex << unsigned(input[i]);
}
strstream >> output;
}
However, this function can't do the job when the input has accented characters. It will convert É to a hex value of ffffffc3ffffff89.
std::string has no encoding of its own. It can easily hold characters encoded in ASCII, UTF-8, ISO-8859-x, Windows-125x, etc. They are just raw bytes, as far as std::string is concerned. So, before you can print your output in ISO-8859-1 specifically, you need to first know what the std::string is already holding so it can be converted to ISO-8859-1 if needed.
FYI, ffffffc3ffffff89 is simply the two char values 0xc3 0x89 (the UTF-8 encoded form of É) being sign-extended to 32 bits. Which means your compiler implements char as a signed type rather than an unsigned type. To eliminate the leading fs, you need to cast each char to unsigned char before then casting to unsigned. You also will need to account for unsigned values < 10 so that the output is an even multiple of 2 hex digits per char, eg:
strstream << std::hex << std::setw(2) << std::setfill('0') << static_cast<unsigned>(static_cast<unsigned char>(input[i]));
So, it appears that your std::string is encoded in UTF-8. There are plenty of libraries available that can convert text from one encoding to another, such as ICU or ICONV. Or platform-specific APIs, like WideCharToMultiByte()/MultiByteToWideChar() on Windows, std::mbstowcs()/std::wcstombs(), etc (provided suitable locales are installed in the OS). But there is nothing really built-in to C++ for this exact UTF-8 to ISO-8859-1 conversion. Though, you could use the (deprecated) std::wstring_convert to decode the UTF-8 std::string to a UTF-16/32 encoded std::wstring, or a UTF-16 encoded std::u16string, at least. And then you can convert that to ISO-8859-1 using whatever library you want as needed.
Or, knowing that the input is UTF-8 and the output is ISO-8859-1, it is really not that hard to just convert the data manually, decoding the UTF-8 into codepoints, and then encoding those codepoints to bytes. Both encodings are well-documented and fairly easy to write code for without too much effort, eg:
size_t nextUtf8CodepointLen(const char* data)
{
unsigned char ch = static_cast<unsigned char>(*data);
if ((ch & 0x80) == 0) {
return 1;
}
if ((ch & 0xE0) == 0xC0) {
return 2;
}
if ((ch & 0xF0) == 0xE0) {
return 3;
}
if ((ch & 0xF8) == 0xF0) {
return 4;
}
return 0;
}
unsigned nextUtf8Codepoint(const char* &data, size_t &data_size)
{
if (data_size == 0) return -1;
unsigned char ch = static_cast<unsigned char>(*data);
size_t len = nextUtf8CodepointLen(data);
++data;
--data_size;
if (len < 2) {
return (len == 1) ? static_cast<unsigned>(ch) : 0xFFFD;
}
--len;
unsigned cp;
if (len == 1) {
cp = ch & 0x1F;
}
else if (len == 2) {
cp = ch & 0x0F;
}
else {
cp = ch & 0x07;
}
if (len > data_size) {
data += data_size;
data_size = 0;
return 0xFFFD;
}
for(size_t j = 0; j < len; ++j) {
ch = static_cast<unsigned char>(data[j]);
if ((ch & 0xC0) != 0x80) {
cp = 0xFFFD;
break;
}
cp = (cp << 6) | (ch & 0x3F);
}
data += len;
data_size -= len;
return cp;
}
std::string Helper::ToHex(const std::string &input) {
const char *data = input.c_str();
size_t data_size = input.size();
std::ostringstream oss;
unsigned cp;
while ((cp = nextUtf8Codepoint(data, data_size)) != -1) {
if (cp > 0xFF) {
cp = static_cast<unsigned>('?');
}
oss << std::hex << std::setw(2) << std::setfill('0') << cp;
}
return oss.str();
}
Online Demo
I need to extract Unicode strings from a PE file. While extracting I need to detect it first. For UTF-8 characters, I used the following link - How to easily detect utf8 encoding in the string?. Is there any similar way to detect UTF-16 characters. I have tried the following code. Is this right? Please do help or provide suggestions. Thanks in advance!!!
BYTE temp1 = buf[offset];
BYTE temp2 = buf[offset+1];
while (!(temp1 == 0x00 && temp2 == 0x00) && offset <= bufSize)
{
if ((temp1 >= 0x00 && temp1 <= 0xFF) && (temp2 >= 0x00 && temp2 <= 0xFF))
{
tmp += 2;
}
else
{
break;
}
offset += 2;
temp1 = buf[offset];
temp2 = buf[offset+1];
if (temp1 == 0x00 && temp2 == 0x00)
{
break;
}
}
I just implemented right now a function for you, DecodeUtf16Char(), basically it is able to do two things - either just check if it is a valid utf-16 (when check_only = true) or check and return valid decoded Unicode code-point (32-bit). Also it supports either big endian (default, when big_endian = true) or little endian (big_endian = false) order of bytes within two-byte utf-16 word. bad_skip equals to number of bytes to be skipped if failed to decode a character (invalid utf-16), bad_value is a value that is used to signify that utf-16 wasn't decoded (was invalid) by default it is -1.
Example of usage/tests are included after this function definition. Basically you just pass starting (ptr) and ending pointer to this function and when returned check return value, if it is -1 then at pointer begin was invalid utf-16 sequence, if it is not -1 then this returned value contains valid 32-bit unicode code-point. Also my function increments ptr, by amount of decoded bytes in case of valid utf-16 or by bad_skip number of bytes if it is invalid.
My functions should be very fast, because it contains only few ifs (plus a bit of arithmetics in case when you ask to actually decode chars), always place my function into headers so that it is inlined into calling function to produce very fast code! Also pass in only compile-time-constants check_only and big_endian, this will remove extra decoding code through C++ optimizations.
If for example you just want to detect long runs of utf-16 bytes then you do next thing, iterate in a loop calling this function and whenever it first returned not -1 then it will be possible beginning, then iterate further and catch last not-equal-to -1 value, this will be the last point of text. Also important to pass in bad_skip = 1 when searching for utf-16 bytes because valid char may start at any byte.
I used for testing different characters - English ASCII, Russian chars (two-byte utf-16) plus two 4-byte chars (two utf-16 words). My tests append converted line to test.txt file, this file is UTF-8 encoded to be easily viewable e.g. by notepad. All of the code after my decoding function is not needed for it to work, the rest is just testing code.
My function to work needs two functions - _DecodeUtf16Char_ReadWord() (helper) plus DecodeUtf16Char() (main decoder). I only include one standard header <cstdint>, if you're not allowed to include anything then just define uint8_t and uint16_t and uint32_t, I use only these types definition from this header.
Also, for reference, see my other post which implements both from scratch (and using standard C++ library) all types of conversions between UTF-8<-->UTF-16<-->UTF-32!
Try it online!
#include <cstdint>
static inline bool _DecodeUtf16Char_ReadWord(
uint8_t const * & ptrc, uint8_t const * end,
uint16_t & r, bool const big_endian
) {
if (ptrc + 1 >= end) {
// No data left.
if (ptrc < end)
++ptrc;
return false;
}
if (big_endian) {
r = uint16_t(*ptrc) << 8; ++ptrc;
r |= uint16_t(*ptrc) ; ++ptrc;
} else {
r = uint16_t(*ptrc) ; ++ptrc;
r |= uint16_t(*ptrc) << 8; ++ptrc;
}
return true;
}
static inline uint32_t DecodeUtf16Char(
uint8_t const * & ptr, uint8_t const * end,
bool const check_only = true, bool const big_endian = true,
uint32_t const bad_skip = 1, uint32_t const bad_value = -1
) {
auto ptrs = ptr, ptrc = ptr;
uint32_t c = 0;
uint16_t v = 0;
if (!_DecodeUtf16Char_ReadWord(ptrc, end, v, big_endian)) {
// No data left.
c = bad_value;
} else if (v < 0xD800 || v > 0xDFFF) {
// Correct single-word symbol.
if (!check_only)
c = v;
} else if (v >= 0xDC00) {
// Unallowed UTF-16 sequence!
c = bad_value;
} else { // Possibly double-word sequence.
if (!check_only)
c = (v & 0x3FF) << 10;
if (!_DecodeUtf16Char_ReadWord(ptrc, end, v, big_endian)) {
// No data left.
c = bad_value;
} else if ((v < 0xDC00) || (v > 0xDFFF)) {
// Unallowed UTF-16 sequence!
c = bad_value;
} else {
// Correct double-word symbol
if (!check_only) {
c |= v & 0x3FF;
c += 0x10000;
}
}
}
if (c == bad_value)
ptr = ptrs + bad_skip; // Skip bytes.
else
ptr = ptrc; // Skip all eaten bytes.
return c;
}
// --------- Next code only for testing only and is not needed for decoding ------------
#include <iostream>
#include <string>
#include <codecvt>
#include <fstream>
#include <locale>
static std::u32string DecodeUtf16Bytes(uint8_t const * ptr, uint8_t const * end) {
std::u32string res;
while (true) {
if (ptr >= end)
break;
uint32_t c = DecodeUtf16Char(ptr, end, false, false, 2);
if (c != -1)
res.append(1, c);
}
return res;
}
#if (!_DLL) && (_MSC_VER >= 1900 /* VS 2015*/) && (_MSC_VER <= 1914 /* VS 2017 */)
std::locale::id std::codecvt<char16_t, char, _Mbstatet>::id;
std::locale::id std::codecvt<char32_t, char, _Mbstatet>::id;
#endif
template <typename CharT = char>
static std::basic_string<CharT> U32ToU8(std::u32string const & s) {
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> utf_8_32_conv;
auto res = utf_8_32_conv.to_bytes(s.c_str(), s.c_str() + s.length());
return res;
}
template <typename WCharT = wchar_t>
static std::basic_string<WCharT> U32ToU16(std::u32string const & s) {
std::wstring_convert<std::codecvt_utf16<char32_t, 0x10ffffUL, std::little_endian>, char32_t> utf_16_32_conv;
auto res = utf_16_32_conv.to_bytes(s.c_str(), s.c_str() + s.length());
return std::basic_string<WCharT>((WCharT*)(res.c_str()), (WCharT*)(res.c_str() + res.length()));
}
template <typename StrT>
void OutputString(StrT const & s) {
std::ofstream f("test.txt", std::ios::binary | std::ios::app);
f.write((char*)s.c_str(), size_t((uint8_t*)(s.c_str() + s.length()) - (uint8_t*)s.c_str()));
f.write("\n\x00", sizeof(s.c_str()[0]));
}
int main() {
std::u16string a = u"привет|мир|hello|𐐷|world|𤭢|again|русский|english";
*((uint8_t*)(a.data() + 12) + 1) = 0xDD; // Introduce bad utf-16 byte.
// Also truncate by 1 byte ("... - 1" in next line).
OutputString(U32ToU8(DecodeUtf16Bytes((uint8_t*)a.c_str(), (uint8_t*)(a.c_str() + a.length()) - 1)));
return 0;
}
Output:
привет|мир|hllo|𐐷|world|𤭢|again|русский|englis
I'm trying to convert a UTF-8 string to a ISO-8859-1 char* for use in legacy code. The only way I'm seeing to do this is with iconv.
I would definitely prefer a completely string-based C++ solution then just call .c_str() on the resulting string.
How do I do this? Code example if possible, please. I'm fine using iconv if it is the only solution you know.
I'm going to modify my code from another answer to implement the suggestion from Alf.
std::string UTF8toISO8859_1(const char * in)
{
std::string out;
if (in == NULL)
return out;
unsigned int codepoint;
while (*in != 0)
{
unsigned char ch = static_cast<unsigned char>(*in);
if (ch <= 0x7f)
codepoint = ch;
else if (ch <= 0xbf)
codepoint = (codepoint << 6) | (ch & 0x3f);
else if (ch <= 0xdf)
codepoint = ch & 0x1f;
else if (ch <= 0xef)
codepoint = ch & 0x0f;
else
codepoint = ch & 0x07;
++in;
if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff))
{
if (codepoint <= 255)
{
out.append(1, static_cast<char>(codepoint));
}
else
{
// do whatever you want for out-of-bounds characters
}
}
}
return out;
}
Invalid UTF-8 input results in dropped characters.
First convert UTF-8 to 32-bit Unicode.
Then keep the values that are in the range 0 through 255.
Those are the Latin-1 code points, and for other values, decide if you want to treat that as an error or perhaps replace with code point 127 (my fav, the ASCII "del") or question mark or something.
The C++ standard library defines a std::codecvt specialization that can be used,
template<>
codecvt<char32_t, char, mbstate_t>
C++11 §22.4.1.4/3: “the specialization codecvt <char32_t, char, mbstate_t> converts between the UTF-32 and
UTF-8 encoding schemes”
Alfs suggestion implemented in C++11
#include <string>
#include <codecvt>
#include <algorithm>
#include <iterator>
auto i = u8"H€llo Wørld";
std::wstring_convert<std::codecvt_utf8<wchar_t>> utf8;
auto wide = utf8.from_bytes(i);
std::string out;
out.reserve(wide.length());
std::transform(wide.cbegin(), wide.cend(), std::back_inserter(out),
[](const wchar_t c) { return (c <= 255) ? c : '?'; });
// out now contains "H?llo W\xf8rld"
If I have a UTF-8 std::string how do I convert it to a UTF-16 std::wstring? Actually, I want to compare two Persian words.
This is how you do it with C++11:
std::string str = "your string in utf8";
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>> converter;
std::wstring wstr = converter.from_bytes(str);
And these are the headers you need:
#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
A more complete example available here:
http://en.cppreference.com/w/cpp/locale/wstring_convert/from_bytes
Here's some code. Only lightly tested and there's probably a few improvements. Call this function to convert a UTF-8 string to a UTF-16 wstring. If it thinks the input string is not UTF-8 then it will throw an exception, otherwise it returns the equivalent UTF-16 wstring.
std::wstring utf8_to_utf16(const std::string& utf8)
{
std::vector<unsigned long> unicode;
size_t i = 0;
while (i < utf8.size())
{
unsigned long uni;
size_t todo;
bool error = false;
unsigned char ch = utf8[i++];
if (ch <= 0x7F)
{
uni = ch;
todo = 0;
}
else if (ch <= 0xBF)
{
throw std::logic_error("not a UTF-8 string");
}
else if (ch <= 0xDF)
{
uni = ch&0x1F;
todo = 1;
}
else if (ch <= 0xEF)
{
uni = ch&0x0F;
todo = 2;
}
else if (ch <= 0xF7)
{
uni = ch&0x07;
todo = 3;
}
else
{
throw std::logic_error("not a UTF-8 string");
}
for (size_t j = 0; j < todo; ++j)
{
if (i == utf8.size())
throw std::logic_error("not a UTF-8 string");
unsigned char ch = utf8[i++];
if (ch < 0x80 || ch > 0xBF)
throw std::logic_error("not a UTF-8 string");
uni <<= 6;
uni += ch & 0x3F;
}
if (uni >= 0xD800 && uni <= 0xDFFF)
throw std::logic_error("not a UTF-8 string");
if (uni > 0x10FFFF)
throw std::logic_error("not a UTF-8 string");
unicode.push_back(uni);
}
std::wstring utf16;
for (size_t i = 0; i < unicode.size(); ++i)
{
unsigned long uni = unicode[i];
if (uni <= 0xFFFF)
{
utf16 += (wchar_t)uni;
}
else
{
uni -= 0x10000;
utf16 += (wchar_t)((uni >> 10) + 0xD800);
utf16 += (wchar_t)((uni & 0x3FF) + 0xDC00);
}
}
return utf16;
}
There are some relevant Q&A here and here which is worth a read.
Basically you need to convert the string to a common format -- my preference is always to convert to UTF-8, but your mileage may wary.
There have been lots of software written for doing the conversion -- the conversion is straigth forwards and can be written in a few hours -- however why not pick up something already done such as the UTF-8 CPP
To convert between the 2 types, you should use: std::codecvt_utf8_utf16< wchar_t>
Note the string prefixes I use to define UTF16 (L) and UTF8 (u8).
#include <string>
#include <codecvt>
int main()
{
std::string original8 = u8"הלו";
std::wstring original16 = L"הלו";
//C++11 format converter
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;
//convert to UTF8 and std::string
std::string utf8NativeString = convert.to_bytes(original16);
std::wstring utf16NativeString = convert.from_bytes(original8);
assert(utf8NativeString == original8);
assert(utf16NativeString == original16);
return 0;
}
Microsoft has developed a beautiful library for such conversions as part of their Casablanca project also named as CPPRESTSDK. This is marked under the namespaces utility::conversions.
A simple usage of it would look something like this on using namespace
utility::conversions
utf8_to_utf16("sample_string");
This page also seems useful: http://www.codeproject.com/KB/string/UtfConverter.aspx
In the comment section of that page, there are also some interesting suggestions for this task like:
// Get en ASCII std::string from anywhere
std::string sLogLevelA = "Hello ASCII-world!";
std::wstringstream ws;
ws << sLogLevelA.c_str();
std::wstring sLogLevel = ws.str();
Or
// To std::string:
str.assign(ws.begin(), ws.end());
// To std::wstring
ws.assign(str.begin(), str.end());
Though I'm not sure the validity of these approaches...
Is there any C++ method support this conversion?
By now i just fill character '0' to convert ucs2 to ucs4, is it safe?
thanks!
It's correct for UCS2, but that's most likely not what you have. Nowadays, you're more likely to encounter UTF-16. Unlike UCS-2, UTF-16 encodes Unicode characters as either one or two 16-bit units. This is necessary because Unicode has more than 65536 characters in its current version.
The more complex conversions usually can be done by your OS, and there are several (non-standard) libraries that offer the same functionality, e.g. ICU.
I have something like that. Hope it will help:
String^ StringFromUCS4(const char32_t* element, int length)
{
StringBuilder^ result = gcnew StringBuilder(length);
const char32_t* pUCS4 = element;
int characterCount = 0;
while (*pUCS4 != 0)
{
wchar_t cUTF16;
if (*pUCS4 < 0x10000)
{
cUTF16 = (wchar_t)*pUCS4;
}
else
{
unsigned int t = *pUCS4 - 0x10000;
unsigned int h = (((t << 12) >> 22) + 0xD800);
unsigned int l = (((t << 22) >> 22) + 0xDC00);
cUTF16 = (wchar_t)((h << 16) | (l & 0x0000FFFF));
}
result->Append((wchar_t)*pUCS4);
characterCount++;
if (characterCount >= length)
{
break;
}
pUCS4++;
}
return result->ToString();
}