This question already has answers here:
Convert string from UTF-8 to ISO-8859-1
(3 answers)
Closed 1 year ago.
Referring to the ISO-8859-1 (Latin-1) encoding:
The capital E acute (É) has a hex value of C9.
I am trying to write a function that takes a std::string and then converts it to hex according to the ISO-8859-1 encoding above.
Currently, I am only able to write a function that converts an ASCII string to hex:
std::string Helper::ToHex(std::string input) {
std::stringstream strstream;
std::string output;
for (int i=0; i<input.length(); i++) {
strstream << std::hex << unsigned(input[i]);
}
strstream >> output;
}
However, this function can't do the job when the input has accented characters. It will convert É to a hex value of ffffffc3ffffff89.
std::string has no encoding of its own. It can easily hold characters encoded in ASCII, UTF-8, ISO-8859-x, Windows-125x, etc. They are just raw bytes, as far as std::string is concerned. So, before you can print your output in ISO-8859-1 specifically, you need to first know what the std::string is already holding so it can be converted to ISO-8859-1 if needed.
FYI, ffffffc3ffffff89 is simply the two char values 0xc3 0x89 (the UTF-8 encoded form of É) being sign-extended to 32 bits. Which means your compiler implements char as a signed type rather than an unsigned type. To eliminate the leading fs, you need to cast each char to unsigned char before then casting to unsigned. You also will need to account for unsigned values < 10 so that the output is an even multiple of 2 hex digits per char, eg:
strstream << std::hex << std::setw(2) << std::setfill('0') << static_cast<unsigned>(static_cast<unsigned char>(input[i]));
So, it appears that your std::string is encoded in UTF-8. There are plenty of libraries available that can convert text from one encoding to another, such as ICU or ICONV. Or platform-specific APIs, like WideCharToMultiByte()/MultiByteToWideChar() on Windows, std::mbstowcs()/std::wcstombs(), etc (provided suitable locales are installed in the OS). But there is nothing really built-in to C++ for this exact UTF-8 to ISO-8859-1 conversion. Though, you could use the (deprecated) std::wstring_convert to decode the UTF-8 std::string to a UTF-16/32 encoded std::wstring, or a UTF-16 encoded std::u16string, at least. And then you can convert that to ISO-8859-1 using whatever library you want as needed.
Or, knowing that the input is UTF-8 and the output is ISO-8859-1, it is really not that hard to just convert the data manually, decoding the UTF-8 into codepoints, and then encoding those codepoints to bytes. Both encodings are well-documented and fairly easy to write code for without too much effort, eg:
size_t nextUtf8CodepointLen(const char* data)
{
unsigned char ch = static_cast<unsigned char>(*data);
if ((ch & 0x80) == 0) {
return 1;
}
if ((ch & 0xE0) == 0xC0) {
return 2;
}
if ((ch & 0xF0) == 0xE0) {
return 3;
}
if ((ch & 0xF8) == 0xF0) {
return 4;
}
return 0;
}
unsigned nextUtf8Codepoint(const char* &data, size_t &data_size)
{
if (data_size == 0) return -1;
unsigned char ch = static_cast<unsigned char>(*data);
size_t len = nextUtf8CodepointLen(data);
++data;
--data_size;
if (len < 2) {
return (len == 1) ? static_cast<unsigned>(ch) : 0xFFFD;
}
--len;
unsigned cp;
if (len == 1) {
cp = ch & 0x1F;
}
else if (len == 2) {
cp = ch & 0x0F;
}
else {
cp = ch & 0x07;
}
if (len > data_size) {
data += data_size;
data_size = 0;
return 0xFFFD;
}
for(size_t j = 0; j < len; ++j) {
ch = static_cast<unsigned char>(data[j]);
if ((ch & 0xC0) != 0x80) {
cp = 0xFFFD;
break;
}
cp = (cp << 6) | (ch & 0x3F);
}
data += len;
data_size -= len;
return cp;
}
std::string Helper::ToHex(const std::string &input) {
const char *data = input.c_str();
size_t data_size = input.size();
std::ostringstream oss;
unsigned cp;
while ((cp = nextUtf8Codepoint(data, data_size)) != -1) {
if (cp > 0xFF) {
cp = static_cast<unsigned>('?');
}
oss << std::hex << std::setw(2) << std::setfill('0') << cp;
}
return oss.str();
}
Online Demo
Related
I need to extract Unicode strings from a PE file. While extracting I need to detect it first. For UTF-8 characters, I used the following link - How to easily detect utf8 encoding in the string?. Is there any similar way to detect UTF-16 characters. I have tried the following code. Is this right? Please do help or provide suggestions. Thanks in advance!!!
BYTE temp1 = buf[offset];
BYTE temp2 = buf[offset+1];
while (!(temp1 == 0x00 && temp2 == 0x00) && offset <= bufSize)
{
if ((temp1 >= 0x00 && temp1 <= 0xFF) && (temp2 >= 0x00 && temp2 <= 0xFF))
{
tmp += 2;
}
else
{
break;
}
offset += 2;
temp1 = buf[offset];
temp2 = buf[offset+1];
if (temp1 == 0x00 && temp2 == 0x00)
{
break;
}
}
I just implemented right now a function for you, DecodeUtf16Char(), basically it is able to do two things - either just check if it is a valid utf-16 (when check_only = true) or check and return valid decoded Unicode code-point (32-bit). Also it supports either big endian (default, when big_endian = true) or little endian (big_endian = false) order of bytes within two-byte utf-16 word. bad_skip equals to number of bytes to be skipped if failed to decode a character (invalid utf-16), bad_value is a value that is used to signify that utf-16 wasn't decoded (was invalid) by default it is -1.
Example of usage/tests are included after this function definition. Basically you just pass starting (ptr) and ending pointer to this function and when returned check return value, if it is -1 then at pointer begin was invalid utf-16 sequence, if it is not -1 then this returned value contains valid 32-bit unicode code-point. Also my function increments ptr, by amount of decoded bytes in case of valid utf-16 or by bad_skip number of bytes if it is invalid.
My functions should be very fast, because it contains only few ifs (plus a bit of arithmetics in case when you ask to actually decode chars), always place my function into headers so that it is inlined into calling function to produce very fast code! Also pass in only compile-time-constants check_only and big_endian, this will remove extra decoding code through C++ optimizations.
If for example you just want to detect long runs of utf-16 bytes then you do next thing, iterate in a loop calling this function and whenever it first returned not -1 then it will be possible beginning, then iterate further and catch last not-equal-to -1 value, this will be the last point of text. Also important to pass in bad_skip = 1 when searching for utf-16 bytes because valid char may start at any byte.
I used for testing different characters - English ASCII, Russian chars (two-byte utf-16) plus two 4-byte chars (two utf-16 words). My tests append converted line to test.txt file, this file is UTF-8 encoded to be easily viewable e.g. by notepad. All of the code after my decoding function is not needed for it to work, the rest is just testing code.
My function to work needs two functions - _DecodeUtf16Char_ReadWord() (helper) plus DecodeUtf16Char() (main decoder). I only include one standard header <cstdint>, if you're not allowed to include anything then just define uint8_t and uint16_t and uint32_t, I use only these types definition from this header.
Also, for reference, see my other post which implements both from scratch (and using standard C++ library) all types of conversions between UTF-8<-->UTF-16<-->UTF-32!
Try it online!
#include <cstdint>
static inline bool _DecodeUtf16Char_ReadWord(
uint8_t const * & ptrc, uint8_t const * end,
uint16_t & r, bool const big_endian
) {
if (ptrc + 1 >= end) {
// No data left.
if (ptrc < end)
++ptrc;
return false;
}
if (big_endian) {
r = uint16_t(*ptrc) << 8; ++ptrc;
r |= uint16_t(*ptrc) ; ++ptrc;
} else {
r = uint16_t(*ptrc) ; ++ptrc;
r |= uint16_t(*ptrc) << 8; ++ptrc;
}
return true;
}
static inline uint32_t DecodeUtf16Char(
uint8_t const * & ptr, uint8_t const * end,
bool const check_only = true, bool const big_endian = true,
uint32_t const bad_skip = 1, uint32_t const bad_value = -1
) {
auto ptrs = ptr, ptrc = ptr;
uint32_t c = 0;
uint16_t v = 0;
if (!_DecodeUtf16Char_ReadWord(ptrc, end, v, big_endian)) {
// No data left.
c = bad_value;
} else if (v < 0xD800 || v > 0xDFFF) {
// Correct single-word symbol.
if (!check_only)
c = v;
} else if (v >= 0xDC00) {
// Unallowed UTF-16 sequence!
c = bad_value;
} else { // Possibly double-word sequence.
if (!check_only)
c = (v & 0x3FF) << 10;
if (!_DecodeUtf16Char_ReadWord(ptrc, end, v, big_endian)) {
// No data left.
c = bad_value;
} else if ((v < 0xDC00) || (v > 0xDFFF)) {
// Unallowed UTF-16 sequence!
c = bad_value;
} else {
// Correct double-word symbol
if (!check_only) {
c |= v & 0x3FF;
c += 0x10000;
}
}
}
if (c == bad_value)
ptr = ptrs + bad_skip; // Skip bytes.
else
ptr = ptrc; // Skip all eaten bytes.
return c;
}
// --------- Next code only for testing only and is not needed for decoding ------------
#include <iostream>
#include <string>
#include <codecvt>
#include <fstream>
#include <locale>
static std::u32string DecodeUtf16Bytes(uint8_t const * ptr, uint8_t const * end) {
std::u32string res;
while (true) {
if (ptr >= end)
break;
uint32_t c = DecodeUtf16Char(ptr, end, false, false, 2);
if (c != -1)
res.append(1, c);
}
return res;
}
#if (!_DLL) && (_MSC_VER >= 1900 /* VS 2015*/) && (_MSC_VER <= 1914 /* VS 2017 */)
std::locale::id std::codecvt<char16_t, char, _Mbstatet>::id;
std::locale::id std::codecvt<char32_t, char, _Mbstatet>::id;
#endif
template <typename CharT = char>
static std::basic_string<CharT> U32ToU8(std::u32string const & s) {
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> utf_8_32_conv;
auto res = utf_8_32_conv.to_bytes(s.c_str(), s.c_str() + s.length());
return res;
}
template <typename WCharT = wchar_t>
static std::basic_string<WCharT> U32ToU16(std::u32string const & s) {
std::wstring_convert<std::codecvt_utf16<char32_t, 0x10ffffUL, std::little_endian>, char32_t> utf_16_32_conv;
auto res = utf_16_32_conv.to_bytes(s.c_str(), s.c_str() + s.length());
return std::basic_string<WCharT>((WCharT*)(res.c_str()), (WCharT*)(res.c_str() + res.length()));
}
template <typename StrT>
void OutputString(StrT const & s) {
std::ofstream f("test.txt", std::ios::binary | std::ios::app);
f.write((char*)s.c_str(), size_t((uint8_t*)(s.c_str() + s.length()) - (uint8_t*)s.c_str()));
f.write("\n\x00", sizeof(s.c_str()[0]));
}
int main() {
std::u16string a = u"привет|мир|hello|𐐷|world|𤭢|again|русский|english";
*((uint8_t*)(a.data() + 12) + 1) = 0xDD; // Introduce bad utf-16 byte.
// Also truncate by 1 byte ("... - 1" in next line).
OutputString(U32ToU8(DecodeUtf16Bytes((uint8_t*)a.c_str(), (uint8_t*)(a.c_str() + a.length()) - 1)));
return 0;
}
Output:
привет|мир|hllo|𐐷|world|𤭢|again|русский|englis
I'm trying to convert a UTF-8 string to a ISO-8859-1 char* for use in legacy code. The only way I'm seeing to do this is with iconv.
I would definitely prefer a completely string-based C++ solution then just call .c_str() on the resulting string.
How do I do this? Code example if possible, please. I'm fine using iconv if it is the only solution you know.
I'm going to modify my code from another answer to implement the suggestion from Alf.
std::string UTF8toISO8859_1(const char * in)
{
std::string out;
if (in == NULL)
return out;
unsigned int codepoint;
while (*in != 0)
{
unsigned char ch = static_cast<unsigned char>(*in);
if (ch <= 0x7f)
codepoint = ch;
else if (ch <= 0xbf)
codepoint = (codepoint << 6) | (ch & 0x3f);
else if (ch <= 0xdf)
codepoint = ch & 0x1f;
else if (ch <= 0xef)
codepoint = ch & 0x0f;
else
codepoint = ch & 0x07;
++in;
if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff))
{
if (codepoint <= 255)
{
out.append(1, static_cast<char>(codepoint));
}
else
{
// do whatever you want for out-of-bounds characters
}
}
}
return out;
}
Invalid UTF-8 input results in dropped characters.
First convert UTF-8 to 32-bit Unicode.
Then keep the values that are in the range 0 through 255.
Those are the Latin-1 code points, and for other values, decide if you want to treat that as an error or perhaps replace with code point 127 (my fav, the ASCII "del") or question mark or something.
The C++ standard library defines a std::codecvt specialization that can be used,
template<>
codecvt<char32_t, char, mbstate_t>
C++11 §22.4.1.4/3: “the specialization codecvt <char32_t, char, mbstate_t> converts between the UTF-32 and
UTF-8 encoding schemes”
Alfs suggestion implemented in C++11
#include <string>
#include <codecvt>
#include <algorithm>
#include <iterator>
auto i = u8"H€llo Wørld";
std::wstring_convert<std::codecvt_utf8<wchar_t>> utf8;
auto wide = utf8.from_bytes(i);
std::string out;
out.reserve(wide.length());
std::transform(wide.cbegin(), wide.cend(), std::back_inserter(out),
[](const wchar_t c) { return (c <= 255) ? c : '?'; });
// out now contains "H?llo W\xf8rld"
I'm trying to convert a UTF-8 string to a ISO-8859-1 char* for use in legacy code. The only way I'm seeing to do this is with iconv.
I would definitely prefer a completely string-based C++ solution then just call .c_str() on the resulting string.
How do I do this? Code example if possible, please. I'm fine using iconv if it is the only solution you know.
I'm going to modify my code from another answer to implement the suggestion from Alf.
std::string UTF8toISO8859_1(const char * in)
{
std::string out;
if (in == NULL)
return out;
unsigned int codepoint;
while (*in != 0)
{
unsigned char ch = static_cast<unsigned char>(*in);
if (ch <= 0x7f)
codepoint = ch;
else if (ch <= 0xbf)
codepoint = (codepoint << 6) | (ch & 0x3f);
else if (ch <= 0xdf)
codepoint = ch & 0x1f;
else if (ch <= 0xef)
codepoint = ch & 0x0f;
else
codepoint = ch & 0x07;
++in;
if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff))
{
if (codepoint <= 255)
{
out.append(1, static_cast<char>(codepoint));
}
else
{
// do whatever you want for out-of-bounds characters
}
}
}
return out;
}
Invalid UTF-8 input results in dropped characters.
First convert UTF-8 to 32-bit Unicode.
Then keep the values that are in the range 0 through 255.
Those are the Latin-1 code points, and for other values, decide if you want to treat that as an error or perhaps replace with code point 127 (my fav, the ASCII "del") or question mark or something.
The C++ standard library defines a std::codecvt specialization that can be used,
template<>
codecvt<char32_t, char, mbstate_t>
C++11 §22.4.1.4/3: “the specialization codecvt <char32_t, char, mbstate_t> converts between the UTF-32 and
UTF-8 encoding schemes”
Alfs suggestion implemented in C++11
#include <string>
#include <codecvt>
#include <algorithm>
#include <iterator>
auto i = u8"H€llo Wørld";
std::wstring_convert<std::codecvt_utf8<wchar_t>> utf8;
auto wide = utf8.from_bytes(i);
std::string out;
out.reserve(wide.length());
std::transform(wide.cbegin(), wide.cend(), std::back_inserter(out),
[](const wchar_t c) { return (c <= 255) ? c : '?'; });
// out now contains "H?llo W\xf8rld"
How can I print (cout / wcout / ...) char32_t to console in C++11?
The following code prints hex values:
u32string s2 = U"Добрый день";
for(auto x:s2){
wcout<<(char32_t)x<<endl;
}
First, I don't think wcout is supposed to print as characters anything but char and wchar_t. char32_t is neither.
Here's a sample program that prints individual wchar_t's:
#include <iostream>
using namespace std;
int main()
{
wcout << (wchar_t)0x41 << endl;
return 0;
}
Output (ideone):
A
Currently, it's impossible to get consistent Unicode output in the console even in major OSes. Simplistic Unicode text output via cout, wcout, printf(), wprintf() and the like won't work on Windows without major hacks. The problem of getting readable Unicode text in the Windows console is in having and being able to select proper Unicode fonts. Windows' console is quite broken in this respect. See this answer of mine and follow the link(s) in it.
I know this is very old, but I had to solve it on my own and there you go.
The idea is to switch between UTF-8 and UTF-32 encodings of Unicode: you can cout u8 strings, so just translate the UTF-32 encoded char32_t to it and you're done. Those are the low level functions I came up with (no Modern C++). Probably those can be optimized, also: any suggestion is appreciated.
char* char_utf32_to_utf8(char32_t utf32, const char* buffer)
// Encodes the UTF-32 encoded char into a UTF-8 string.
// Stores the result in the buffer and returns the position
// of the end of the buffer
// (unchecked access, be sure to provide a buffer that is big enough)
{
char* end = const_cast<char*>(buffer);
if(utf32 < 0x7F) *(end++) = static_cast<unsigned>(utf32);
else if(utf32 < 0x7FF) {
*(end++) = 0b1100'0000 + static_cast<unsigned>(utf32 >> 6);
*(end++) = 0b1000'0000 + static_cast<unsigned>(utf32 & 0b0011'1111);
}
else if(utf32 < 0x10000){
*(end++) = 0b1110'0000 + static_cast<unsigned>(utf32 >> 12);
*(end++) = 0b1000'0000 + static_cast<unsigned>((utf32 >> 6) & 0b0011'1111);
*(end++) = 0b1000'0000 + static_cast<unsigned>(utf32 & 0b0011'1111);
} else if(utf32 < 0x110000) {
*(end++) = 0b1111'0000 + static_cast<unsigned>(utf32 >> 18);
*(end++) = 0b1000'0000 + static_cast<unsigned>((utf32 >> 12) & 0b0011'1111);
*(end++) = 0b1000'0000 + static_cast<unsigned>((utf32 >> 6) & 0b0011'1111);
*(end++) = 0b1000'0000 + static_cast<unsigned>(utf32 & 0b0011'1111);
}
else throw encoding_error(end);
*end = '\0';
return end;
}
You can implement this function in a class if you want, in a constructor, in a template, or whatever you prefer.
Follows the overloaded operator with the char array
std::ostream& operator<<(std::ostream& os, const char32_t* s)
{
const char buffer[5] {0}; // That's the famous "big-enough buffer"
while(s && *s)
{
char_utf32_to_utf8(*(s++), buffer);
os << buffer;
}
return os;
}
and with the u32string
std::ostream& operator<<(std::ostream& os, const std::u32string& s)
{
return (os << s.c_str());
}
Running the simplest stupidest test with the Unicode characters found on Wikipedia
int main()
{
std::cout << std::u32string(U"\x10437\x20AC") << std::endl;
}
leads to 𐐷€ printed on the (Linux) console. This should be tested with different Unicode characters, though...
Also this varies with endianness but I'm sure you can find the solution looking at this.
Is there any C++ method support this conversion?
By now i just fill character '0' to convert ucs2 to ucs4, is it safe?
thanks!
It's correct for UCS2, but that's most likely not what you have. Nowadays, you're more likely to encounter UTF-16. Unlike UCS-2, UTF-16 encodes Unicode characters as either one or two 16-bit units. This is necessary because Unicode has more than 65536 characters in its current version.
The more complex conversions usually can be done by your OS, and there are several (non-standard) libraries that offer the same functionality, e.g. ICU.
I have something like that. Hope it will help:
String^ StringFromUCS4(const char32_t* element, int length)
{
StringBuilder^ result = gcnew StringBuilder(length);
const char32_t* pUCS4 = element;
int characterCount = 0;
while (*pUCS4 != 0)
{
wchar_t cUTF16;
if (*pUCS4 < 0x10000)
{
cUTF16 = (wchar_t)*pUCS4;
}
else
{
unsigned int t = *pUCS4 - 0x10000;
unsigned int h = (((t << 12) >> 22) + 0xD800);
unsigned int l = (((t << 22) >> 22) + 0xDC00);
cUTF16 = (wchar_t)((h << 16) | (l & 0x0000FFFF));
}
result->Append((wchar_t)*pUCS4);
characterCount++;
if (characterCount >= length)
{
break;
}
pUCS4++;
}
return result->ToString();
}