C++ ShiftJIS to UTF8 conversion

C++ ShiftJIS to UTF8 conversion - c++

I need to convert Doublebyte characters. In my special case Shift-Jis into something better to handle, preferably with standard C++.
the following Question ended up without a workaround:
Doublebyte encodings on MSVC (std::codecvt): Lead bytes not recognized
So is there anyone with a suggestion or a reference on how to handle this conversion with C++ standard?

Normally I would recommend using the ICU library, but for this alone, using it is way too much overhead.
First a conversion function which takes an std::string with Shiftjis data, and returns an std::string with UTF8 (note 2019: no idea anymore if it works :))
It uses a uint8_t array of 25088 elements (25088 byte), which is used as convTable in the code. The function does not fill this variable, you have to load it from eg. a file first. The second code part below is a program that can generate the file.
The conversion function doesn't check if the input is valid ShiftJIS data.
std::string sj2utf8(const std::string &input)
{
std::string output(3 * input.length(), ' '); //ShiftJis won't give 4byte UTF8, so max. 3 byte per input char are needed
size_t indexInput = 0, indexOutput = 0;
while(indexInput < input.length())
{
char arraySection = ((uint8_t)input[indexInput]) >> 4;
size_t arrayOffset;
if(arraySection == 0x8) arrayOffset = 0x100; //these are two-byte shiftjis
else if(arraySection == 0x9) arrayOffset = 0x1100;
else if(arraySection == 0xE) arrayOffset = 0x2100;
else arrayOffset = 0; //this is one byte shiftjis
//determining real array offset
if(arrayOffset)
{
arrayOffset += (((uint8_t)input[indexInput]) & 0xf) << 8;
indexInput++;
if(indexInput >= input.length()) break;
}
arrayOffset += (uint8_t)input[indexInput++];
arrayOffset <<= 1;
//unicode number is...
uint16_t unicodeValue = (convTable[arrayOffset] << 8) | convTable[arrayOffset + 1];
//converting to UTF8
if(unicodeValue < 0x80)
{
output[indexOutput++] = unicodeValue;
}
else if(unicodeValue < 0x800)
{
output[indexOutput++] = 0xC0 | (unicodeValue >> 6);
output[indexOutput++] = 0x80 | (unicodeValue & 0x3f);
}
else
{
output[indexOutput++] = 0xE0 | (unicodeValue >> 12);
output[indexOutput++] = 0x80 | ((unicodeValue & 0xfff) >> 6);
output[indexOutput++] = 0x80 | (unicodeValue & 0x3f);
}
}
output.resize(indexOutput); //remove the unnecessary bytes
return output;
}
About the helper file: I used to have a download here, but nowadays I only know unreliable file hosters. So... either http://s000.tinyupload.com/index.php?file_id=95737652978017682303 works for you, or:
First download the "original" data from ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/SHIFTJIS.TXT . I can't paste this here because of the length, so we have to hope at least unicode.org stays online.
Then use this program while piping/redirecting above text file in, and redirecting the binary output to a new file. (Needs a binary-safe shell, no idea if it works on Windows).
#include<iostream>
#include<string>
#include<cstdio>
using namespace std;
// pipe SHIFTJIS.txt in and pipe to (binary) file out
int main()
{
string s;
uint8_t *mapping; //same bigendian array as in converting function
mapping = new uint8_t[2*(256 + 3*256*16)];
//initializing with space for invalid value, and then ASCII control chars
for(size_t i = 32; i < 256 + 3*256*16; i++)
{
mapping[2 * i] = 0;
mapping[2 * i + 1] = 0x20;
}
for(size_t i = 0; i < 32; i++)
{
mapping[2 * i] = 0;
mapping[2 * i + 1] = i;
}
while(getline(cin, s)) //pipe the file SHIFTJIS to stdin
{
if(s.substr(0, 2) != "0x") continue; //comment lines
uint16_t shiftJisValue, unicodeValue;
if(2 != sscanf(s.c_str(), "%hx %hx", &shiftJisValue, &unicodeValue)) //getting hex values
{
puts("Error hex reading");
continue;
}
size_t offset; //array offset
if((shiftJisValue >> 8) == 0) offset = 0;
else if((shiftJisValue >> 12) == 0x8) offset = 256;
else if((shiftJisValue >> 12) == 0x9) offset = 256 + 16*256;
else if((shiftJisValue >> 12) == 0xE) offset = 256 + 2*16*256;
else
{
puts("Error input values");
continue;
}
offset = 2 * (offset + (shiftJisValue & 0xfff));
if(mapping[offset] != 0 || mapping[offset + 1] != 0x20)
{
puts("Error mapping not 1:1");
continue;
}
mapping[offset] = unicodeValue >> 8;
mapping[offset + 1] = unicodeValue & 0xff;
}
fwrite(mapping, 1, 2*(256 + 3*256*16), stdout);
delete[] mapping;
return 0;
}
Notes:
Two-byte big endian raw unicode values (more than two byte not necessary here)
First 256 chars (512 byte) for the single byte ShiftJIS chars, value 0x20 for invalid ones.
Then 3 * 256*16 chars for the groups 0x8???, 0x9??? and 0xE???
= 25088 byte

For those looking for the Shift-JIS conversion table data, you can get the uint8_t array here:
https://github.com/bucanero/apollo-ps3/blob/master/include/shiftjis.h
Also, here's a very simple function to convert basic Shift-JIS chars to ASCII:
const char SJIS_REPLACEMENT_TABLE[] =
" ,.,..:;?!\"*'`*^"
"-_????????*---/\\"
"~||--''\"\"()()[]{"
"}<><>[][][]+-+X?"
"-==<><>????*'\"CY"
"$c&%#&*#S*******"
"*******T><^_'='";
//Convert Shift-JIS characters to ASCII equivalent
void sjis2ascii(char* bData)
{
uint16_t ch;
int i, j = 0;
int len = strlen(bData);
for (i = 0; i < len; i += 2)
{
ch = (bData[i]<<8) | bData[i+1];
// 'A' .. 'Z'
// '0' .. '9'
if ((ch >= 0x8260 && ch <= 0x8279) || (ch >= 0x824F && ch <= 0x8258))
{
bData[j++] = (ch & 0xFF) - 0x1F;
continue;
}
// 'a' .. 'z'
if (ch >= 0x8281 && ch <= 0x829A)
{
bData[j++] = (ch & 0xFF) - 0x20;
continue;
}
if (ch >= 0x8140 && ch <= 0x81AC)
{
bData[j++] = SJIS_REPLACEMENT_TABLE[(ch & 0xFF) - 0x40];
continue;
}
if (ch == 0x0000)
{
//End of the string
bData[j] = 0;
return;
}
// Character not found
bData[j++] = bData[i];
bData[j++] = bData[i+1];
}
bData[j] = 0;
return;
}

Related

Decoding and saving image files from base64 C++

I'm trying to write a program in c++ that can encode images into base64 and also decode base64 into images. I believe the encoder function is working fine and some websites can take the base64 code I generate and decode it into the image fine, but for some reason once I decode the base64 into a string and then write it to a file and save it as a png it says it can't be opened in an image viewer.
I confirmed that the string that is being written to the new file is exactly the same as the existing file (when opened in a text editor), but for some reason, the new file can't be opened but the existing one can be. I have even tried just making a new file in a text editor, and copying the text from the old file into it, but it still doesn't open in an image viewer.
I believe that both of the encode functions and the base64 decode function all work fine. I think the problem is in the Image Decode function.
Image Encode Function
string base64_encode_image(const string& path) {
vector<char> temp;
std::ifstream infile;
infile.open(path, ios::binary); // Open file in binary mode
if (infile.is_open()) {
while (!infile.eof()) {
char c = (char)infile.get();
temp.push_back(c);
}
infile.close();
}
else return "File could not be opened";
string ret(temp.begin(), temp.end() - 1);
ret = base64_encode((unsigned const char*)ret.c_str(), ret.size());
return ret;
}
Image Decode Function
void base64_decode_image(const string& input) {
ofstream outfile;
outfile.open("test.png", ofstream::out);
string temp = base64_decode(input);
outfile.write(temp.c_str(), temp.size());
outfile.close();
cout << "file saved" << endl;
}
Encode Function base64
string base64_encode(unsigned const char* input, unsigned const int len) {
string ret;
size_t i = 0;
unsigned char bytes[3];
unsigned char sextets[4];
while (i <= (len - 3)) {
bytes[0] = *(input++);
bytes[1] = *(input++);
bytes[2] = *(input++);
sextets[0] = (bytes[0] & 0xfc) >> 2; // Cuts last two bits off of first byte
sextets[1] = ((bytes[0] & 0x03) << 4) + ((bytes[1] & 0xf0) >> 4); // Takes last two bits from first byte and adds it to first 4 bits of 2nd byte
sextets[2] = ((bytes[1] & 0x0f) << 2) + ((bytes[2] & 0xc0) >> 6); // Takes last 4 bits of 2nd byte and adds it to first 2 bits of third byte
sextets[3] = bytes[2] & 0x3f; // takes last 6 bits of third byte
for (size_t j = 0; j < 4; ++j) {
ret += base64_chars[sextets[j]];
}
i += 3; // increases to go to third byte
}
if (i != len) {
size_t k = 0;
size_t j = len - i; // Find index of last byte
while (k < j) { // Sets first bytes
bytes[k] = *(input++);
++k;
}
while (j < 3) { // Set last bytes to 0x00
bytes[j] = '\0';
++j;
}
sextets[0] = (bytes[0] & 0xfc) >> 2; // Cuts last two bits off of first byte
sextets[1] = ((bytes[0] & 0x03) << 4) + ((bytes[1] & 0xf0) >> 4); // Takes last two bits from first byte and adds it to first 4 bits of 2nd byte
sextets[2] = ((bytes[1] & 0x0f) << 2) + ((bytes[2] & 0xc0) >> 6); // Takes last 4 bits of 2nd byte and adds it to first 2 bits of third byte
// No last one is needed, because if there were 4, then (i == len) == true
for (j = 0; j < (len - i) + 1; ++j) { // Gets sextets that include data
ret += base64_chars[sextets[j]]; // Appends them to string
}
while ((j++) < 4) // Appends remaining ='s
ret += '=';
}
return ret;
}
Decode Function base64
string base64_decode(const string& input) {
string ret;
size_t i = 0;
unsigned char bytes[3];
unsigned char sextets[4];
while (i < input.size() && input[i] != '=') {
size_t j = i % 4; // index per sextet
if (is_base64(input[i])) sextets[j] = input[i++]; // set sextets with characters from string
else { cerr << "Non base64 string included in input (possibly newline)" << endl; return ""; }
if (i % 4 == 0) {
for (j = 0; j < 4; ++j) // Using j as a seperate index (not the same as it was originally used as, will later be reset)
sextets[j] = indexof(base64_chars, strlen(base64_chars), sextets[j]); // Change value to indicies of b64 characters and not ascii characters
bytes[0] = (sextets[0] << 2) + ((sextets[1] & 0x30) >> 4); // Similar bitshifting to before
bytes[1] = ((sextets[1] & 0x0f) << 4) + ((sextets[2] & 0x3c) >> 2);
bytes[2] = ((sextets[2] & 0x03) << 6) + sextets[3];
for (j = 0; j < 3; ++j) // Using j seperately again to iterate through bytes and adding them to full string
ret += bytes[j];
}
}
if (i % 4 != 0) {
for (size_t j = 0; j < (i % 4); ++j)
sextets[j] = indexof(base64_chars, strlen(base64_chars), sextets[j]);
bytes[0] = (sextets[0] << 2) + ((sextets[1] & 0x30) >> 4); // Similar bitshifting to before
bytes[1] = ((sextets[1] & 0x0f) << 4) + ((sextets[2] & 0x3c) >> 2);
for (size_t j = 0; j < (i % 4) - 1; ++j)
ret += bytes[j]; // Add final bytes
}
return ret;
}
When I try to open the files produced by Image decode function It says that the file format isn't supported, or that it has been corrupted.
The base64 produced by the encode function that I'm trying to decode is in this link
https://pastebin.com/S5D90Fs8

When you open outfile in base64_decode_image, you do not specify the ofstream::binary flag like you do in base64_encode_image when reading the image. Without that flag, you're writing in text mode which can alter the data you're writing (when adjusting for newlines).

C++ string of greek characters and .at() operator

With english characters it is easy to extract, so to say, a char from a string, e.g., the following code should have y as output:
string my_word;
cout << my_word.at(1);
If I try to do the same with greek characters, I get a funny character:
string my_word = "λογος";
cout << my_word.at(1);
Output:
�
My question is: what can I do to make .at() or whatever similar function work?
many thanks!

std::string is a sequence of narrow characters char. But many national alphabets use more then one char to encode single letter when using utf-8 locale. So when you take s.at(0) you get a half of whole letter or even less. You should use wide chars: std::wstring instead of std::string, std::wcout instead of std::cout and L"λογος" as string literal.
Also, you should set right locale before any printing using std::locale stuff.
Code example for this case:
#include <iostream>
#include <string>
#include <locale>
int main(int, char**) {
std::locale::global(std::locale("en_US.utf8"));
std::wcout.imbue(std::locale());
std::wstring s = L"λογος";
std::wcout << s.at(0) << std::endl;
return 0;
}

Problem is complex. Non Latin characters have to be encoded properly. There are couple standards for that. Question is which encoding your system is using.
In UTF-8 encoding one character is represented by multiple bytes. It can vary form 1 to 4 bytes depending on what kind of character it is.
For example: λ is represented by two bytes (in hex): CE BB.
I don't know what are the other character encoding which gives single byte characters fro Greek letters, but I'm sure there is one such encoding.
Note that your value my_word.length() most probably returns 10 not 5.

As others have said, it depends on your encoding. An at() function is problematic once you move to internationalisation because Hebrew has vowels written around the character, for example. Not all scripts consist of discrete sequences of glyphs.
Generally it's best to treat strings as atomic, unless you are writing the display / word manipulation code itself, when of course you need the individual glyphs. To read UTF, check out the code in Baby X (it's a windowing system that has to draw text to the screen)
Here;s the link https://github.com/MalcolmMcLean/babyx/blob/master/src/common/BBX_Font.c
Here's the UTF8 code - it's quite a hunk of code but fundamentally strightforwards.
static const unsigned int offsetsFromUTF8[6] =
{
0x00000000UL, 0x00003080UL, 0x000E2080UL,
0x03C82080UL, 0xFA082080UL, 0x82082080UL
};
static const unsigned char trailingBytesForUTF8[256] = {
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5
};
int bbx_isutf8z(const char *str)
{
int len = 0;
int pos = 0;
int nb;
int i;
int ch;
while(str[len])
len++;
while(pos < len && *str)
{
nb = bbx_utf8_skip(str);
if(nb < 1 || nb > 4)
return 0;
if(pos + nb > len)
return 0;
for(i=1;i<nb;i++)
if( (str[i] & 0xC0) != 0x80 )
return 0;
ch = bbx_utf8_getch(str);
if(ch < 0x80)
{
if(nb != 1)
return 0;
}
else if(ch < 0x8000)
{
if(nb != 2)
return 0;
}
else if(ch < 0x10000)
{
if(nb != 3)
return 0;
}
else if(ch < 0x110000)
{
if(nb != 4)
return 0;
}
pos += nb;
str += nb;
}
return 1;
}
int bbx_utf8_skip(const char *utf8)
{
return trailingBytesForUTF8[(unsigned char) *utf8] + 1;
}
int bbx_utf8_getch(const char *utf8)
{
int ch;
int nb;
nb = trailingBytesForUTF8[(unsigned char)*utf8];
ch = 0;
switch (nb)
{
/* these fall through deliberately */
case 3: ch += (unsigned char)*utf8++; ch <<= 6;
case 2: ch += (unsigned char)*utf8++; ch <<= 6;
case 1: ch += (unsigned char)*utf8++; ch <<= 6;
case 0: ch += (unsigned char)*utf8++;
}
ch -= offsetsFromUTF8[nb];
return ch;
}
int bbx_utf8_putch(char *out, int ch)
{
char *dest = out;
if (ch < 0x80)
{
*dest++ = (char)ch;
}
else if (ch < 0x800)
{
*dest++ = (ch>>6) | 0xC0;
*dest++ = (ch & 0x3F) | 0x80;
}
else if (ch < 0x10000)
{
*dest++ = (ch>>12) | 0xE0;
*dest++ = ((ch>>6) & 0x3F) | 0x80;
*dest++ = (ch & 0x3F) | 0x80;
}
else if (ch < 0x110000)
{
*dest++ = (ch>>18) | 0xF0;
*dest++ = ((ch>>12) & 0x3F) | 0x80;
*dest++ = ((ch>>6) & 0x3F) | 0x80;
*dest++ = (ch & 0x3F) | 0x80;
}
else
return 0;
return dest - out;
}
int bbx_utf8_charwidth(int ch)
{
if (ch < 0x80)
{
return 1;
}
else if (ch < 0x800)
{
return 2;
}
else if (ch < 0x10000)
{
return 3;
}
else if (ch < 0x110000)
{
return 4;
}
else
return 0;
}
int bbx_utf8_Nchars(const char *utf8)
{
int answer = 0;
while(*utf8)
{
utf8 += bbx_utf8_skip(utf8);
answer++;
}
return answer;
}

C++ convert ASII escaped unicode string into utf8 string

I need to read in a standard ascii style string with unicode escaping and convert it into a std::string containing the utf8 encoded equivalent. So for example "\u03a0" (a std::string with 6 characters) should be converted into the std::string with two characters, 0xce, 0xa0 respectively, in raw binary.
Would be most happy if there's a simple answer using icu or boost but I haven't been able to find one.
(This is similar to Convert a Unicode string to an escaped ASCII string, but NB that I ultimately need to arrive at the UTF8 encoding. If we can use the Unicode as an intermediate step that's fine.)

Try something like this:
std::string to_utf8(uint32_t cp)
{
/*
if using C++11 or later, you can do this:
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv;
return conv.to_bytes( (char32_t)cp );
Otherwise...
*/
std::string result;
int count;
if (cp <= 0x007F)
count = 1
else if (cp <= 0x07FF)
count = 2;
else if (cp <= 0xFFFF)
count = 3;
else if (cp <= 0x10FFFF)
count = 4;
else
return result; // or throw an exception
result.resize(count);
if (count > 1)
{
for (int i = count-1; i > 0; --i)
{
result[i] = (char) (0x80 | (cp & 0x3F));
cp >>= 6;
}
for (int i = 0; i < count; ++i)
cp |= (1 << (7-i));
}
result[0] = (char) cp;
return result;
}
std::string str = ...; // "\\u03a0"
std::string::size_type startIdx = 0;
do
{
startIdx = str.find("\\u", startIdx);
if (startIdx == std::string::npos) break;
std::string::size_type endIdx = str.find_first_not_of("0123456789abcdefABCDEF", startIdx+2);
if (endIdx == std::string::npos) break;
std::string tmpStr = str.substr(startIdx+2, endIdx-(startIdx+2));
std::istringstream iss(tmpStr);
uint32_t cp;
if (iss >> std::hex >> cp)
{
std::string utf8 = to_utf8(cp);
str.replace(startIdx, 2+tmpStr.length(), utf8);
startIdx += utf8.length();
}
else
startIdx += 2;
}
while (true);

(\u03a0 is the Unicode code point for GREEK CAPITAL LETTER PI whose UTF-8 encoding is 0xCE 0xA0)
You need to:
Get the number 0x03a0 from the string "\u03a0": drop the backslash and the u and parse 03a0 as hex, into a wchar_t. Repeat until you get a (wide) string.
Convert 0x3a0 into UTF-8. C++11 has a codecvt_utf8 that may help.

My solution:
convert_unicode_escape_sequences(str)
input: "\u043f\u0440\u0438\u0432\u0435\u0442"
output: "привет"
Boost used for wchar/chars convertion:
#include <boost/locale/encoding_utf.hpp>
using boost::locale::conv::utf_to_utf;
inline uint8_t get_uint8(uint8_t h, uint8_t l)
{
uint8_t ret;
if (h - '0' < 10)
ret = h - '0';
else if (h - 'A' < 6)
ret = h - 'A' + 0x0A;
else if (h - 'a' < 6)
ret = h - 'a' + 0x0A;
ret = ret << 4;
if (l - '0' < 10)
ret |= l - '0';
else if (l - 'A' < 6)
ret |= l - 'A' + 0x0A;
else if (l - 'a' < 6)
ret |= l - 'a' + 0x0A;
return ret;
}
std::string wstring_to_utf8(const std::wstring& str)
{
return utf_to_utf<char>(str.c_str(), str.c_str() + str.size());
}
std::string convert_unicode_escape_sequences(const std::string& source)
{
std::wstring ws; ws.reserve(source.size());
std::wstringstream wis(ws);
auto s = source.begin();
while (s != source.end())
{
if (*s == '\\')
{
if (std::distance(s, source.end()) > 5)
{
if (*(s + 1) == 'u')
{
unsigned int v = get_uint8(*(s + 2), *(s + 3)) << 8;
v |= get_uint8(*(s + 4), *(s + 5));
s += 6;
wis << boost::numeric_cast<wchar_t>(v);
continue;
}
}
}
wis << wchar_t(*s);
s++;
}
return wstring_to_utf8(wis.str());
}

CString Hex value conversion to Byte Array

I have been trying to carry out a conversion from CString that contains Hex string to a Byte array and have been
unsuccessful so far. I have looked on forums and none of them seem to help so far. Is there a function with just a few
lines of code to do this conversion?
My code:
BYTE abyData[8]; // BYTE = unsigned char
CString sByte = "0E00000000000400";
Expecting:
abyData[0] = 0x0E;
abyData[6] = 0x04; // etc.

You can simply gobble up two characters at a time:
unsigned int value(char c)
{
if (c >= '0' && c <= '9') { return c - '0'; }
if (c >= 'A' && c <= 'F') { return c - 'A' + 10; }
if (c >= 'a' && c <= 'f') { return c - 'a' + 10; }
return -1; // Error!
}
for (unsigned int i = 0; i != 8; ++i)
{
abyData[i] = value(sByte[2 * i]) * 16 + value(sByte[2 * i + 1]);
}
Of course 8 should be the size of your array, and you should ensure that the string is precisely twice as long. A checking version of this would make sure that each character is a valid hex digit and signal some type of error if that isn't the case.

How about something like this:
for (int i = 0; i < sizeof(abyData) && (i * 2) < sByte.GetLength(); i++)
{
char ch1 = sByte[i * 2];
char ch2 = sByte[i * 2 + 1];
int value = 0;
if (std::isdigit(ch1))
value += ch1 - '0';
else
value += (std::tolower(ch1) - 'a') + 10;
// That was the four high bits, so make them that
value <<= 4;
if (std::isdigit(ch2))
value += ch1 - '0';
else
value += (std::tolower(ch1) - 'a') + 10;
abyData[i] = value;
}
Note: The code above is not tested.

You could:
#include <stdint.h>
#include <sstream>
#include <iostream>
int main() {
unsigned char result[8];
std::stringstream ss;
ss << std::hex << "0E00000000000400";
ss >> *( reinterpret_cast<uint64_t *>( result ) );
std::cout << static_cast<int>( result[1] ) << std::endl;
}
however take care of memory management issues!!!
Plus the result is in the reverse order as you would expect, so:
result[0] = 0x00
result[1] = 0x04
...
result[7] = 0x0E

C++: Char iteration over string (I'm getting crazy)

I have this string:
std::string str = "presents";
And when I iterate over the characters, they come in this order:
spresent
So, the last char comes first.
This is the code:
uint16_t c;
printf("%s: ", str.c_str());
for (unsigned int i = 0; i < str.size(); i += extractUTF8_Char(str, i, &c)) {
printf("%c", c);
}
printf("\n");
And this is the exctract method:
uint8_t extractUTF8_Char(string line, int offset, uint16_t *target) {
uint8_t ch = uint8_t(line.at(offset));
if ((ch & 0xC0) == 0xC0) {
if (!target) {
return 2;
}
uint8_t ch2 = uint8_t(line.at(offset + 1));
uint16_t fullCh = (uint16_t(((ch & 0x1F) >> 2)) << 8) | ((ch & 0x3) << 0x6) | (ch2 & 0x3F);
*target = fullCh;
return 2;
}
if (target) {
*target = ch;
}
return 1;
}
This method returns the length of the character. So: 1 byte or 2 bytes. And if the length is 2 bytes, it extracts the UNICODE point out of the UTF8 string.

your first printf is printing nonsense (the initial value of c). The last c gotten is not printed.
This is because the call to extractUTF8_char is occurring in the last clause of the for statement. You might want to change it to
for (unsigned int i = 0; i < str.size();) {
i += extractUTF8_Char(str, i, &c);
printf("%c", c);
}
instead.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

C++ ShiftJIS to UTF8 conversion - c++

Related

Decoding and saving image files from base64 C++

C++ string of greek characters and .at() operator

C++ convert ASII escaped unicode string into utf8 string

CString Hex value conversion to Byte Array

C++: Char iteration over string (I'm getting crazy)

Categories

Resources