C++ string of greek characters and .at() operator

C++ string of greek characters and .at() operator - c++

With english characters it is easy to extract, so to say, a char from a string, e.g., the following code should have y as output:
string my_word;
cout << my_word.at(1);
If I try to do the same with greek characters, I get a funny character:
string my_word = "λογος";
cout << my_word.at(1);
Output:
�
My question is: what can I do to make .at() or whatever similar function work?
many thanks!

std::string is a sequence of narrow characters char. But many national alphabets use more then one char to encode single letter when using utf-8 locale. So when you take s.at(0) you get a half of whole letter or even less. You should use wide chars: std::wstring instead of std::string, std::wcout instead of std::cout and L"λογος" as string literal.
Also, you should set right locale before any printing using std::locale stuff.
Code example for this case:
#include <iostream>
#include <string>
#include <locale>
int main(int, char**) {
std::locale::global(std::locale("en_US.utf8"));
std::wcout.imbue(std::locale());
std::wstring s = L"λογος";
std::wcout << s.at(0) << std::endl;
return 0;
}

Problem is complex. Non Latin characters have to be encoded properly. There are couple standards for that. Question is which encoding your system is using.
In UTF-8 encoding one character is represented by multiple bytes. It can vary form 1 to 4 bytes depending on what kind of character it is.
For example: λ is represented by two bytes (in hex): CE BB.
I don't know what are the other character encoding which gives single byte characters fro Greek letters, but I'm sure there is one such encoding.
Note that your value my_word.length() most probably returns 10 not 5.

As others have said, it depends on your encoding. An at() function is problematic once you move to internationalisation because Hebrew has vowels written around the character, for example. Not all scripts consist of discrete sequences of glyphs.
Generally it's best to treat strings as atomic, unless you are writing the display / word manipulation code itself, when of course you need the individual glyphs. To read UTF, check out the code in Baby X (it's a windowing system that has to draw text to the screen)
Here;s the link https://github.com/MalcolmMcLean/babyx/blob/master/src/common/BBX_Font.c
Here's the UTF8 code - it's quite a hunk of code but fundamentally strightforwards.
static const unsigned int offsetsFromUTF8[6] =
{
0x00000000UL, 0x00003080UL, 0x000E2080UL,
0x03C82080UL, 0xFA082080UL, 0x82082080UL
};
static const unsigned char trailingBytesForUTF8[256] = {
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5
};
int bbx_isutf8z(const char *str)
{
int len = 0;
int pos = 0;
int nb;
int i;
int ch;
while(str[len])
len++;
while(pos < len && *str)
{
nb = bbx_utf8_skip(str);
if(nb < 1 || nb > 4)
return 0;
if(pos + nb > len)
return 0;
for(i=1;i<nb;i++)
if( (str[i] & 0xC0) != 0x80 )
return 0;
ch = bbx_utf8_getch(str);
if(ch < 0x80)
{
if(nb != 1)
return 0;
}
else if(ch < 0x8000)
{
if(nb != 2)
return 0;
}
else if(ch < 0x10000)
{
if(nb != 3)
return 0;
}
else if(ch < 0x110000)
{
if(nb != 4)
return 0;
}
pos += nb;
str += nb;
}
return 1;
}
int bbx_utf8_skip(const char *utf8)
{
return trailingBytesForUTF8[(unsigned char) *utf8] + 1;
}
int bbx_utf8_getch(const char *utf8)
{
int ch;
int nb;
nb = trailingBytesForUTF8[(unsigned char)*utf8];
ch = 0;
switch (nb)
{
/* these fall through deliberately */
case 3: ch += (unsigned char)*utf8++; ch <<= 6;
case 2: ch += (unsigned char)*utf8++; ch <<= 6;
case 1: ch += (unsigned char)*utf8++; ch <<= 6;
case 0: ch += (unsigned char)*utf8++;
}
ch -= offsetsFromUTF8[nb];
return ch;
}
int bbx_utf8_putch(char *out, int ch)
{
char *dest = out;
if (ch < 0x80)
{
*dest++ = (char)ch;
}
else if (ch < 0x800)
{
*dest++ = (ch>>6) | 0xC0;
*dest++ = (ch & 0x3F) | 0x80;
}
else if (ch < 0x10000)
{
*dest++ = (ch>>12) | 0xE0;
*dest++ = ((ch>>6) & 0x3F) | 0x80;
*dest++ = (ch & 0x3F) | 0x80;
}
else if (ch < 0x110000)
{
*dest++ = (ch>>18) | 0xF0;
*dest++ = ((ch>>12) & 0x3F) | 0x80;
*dest++ = ((ch>>6) & 0x3F) | 0x80;
*dest++ = (ch & 0x3F) | 0x80;
}
else
return 0;
return dest - out;
}
int bbx_utf8_charwidth(int ch)
{
if (ch < 0x80)
{
return 1;
}
else if (ch < 0x800)
{
return 2;
}
else if (ch < 0x10000)
{
return 3;
}
else if (ch < 0x110000)
{
return 4;
}
else
return 0;
}
int bbx_utf8_Nchars(const char *utf8)
{
int answer = 0;
while(*utf8)
{
utf8 += bbx_utf8_skip(utf8);
answer++;
}
return answer;
}

Related

atoi() for int128_t type

How can I use argv values with int128_t support? I know about atoi() and family of functions exposed by <cstdlib> but somehow I cannot find one for int128_t fixed width integer. This might be because of the fact that this type isn't backed by either c or c++ standard, but is there any way for me to make this code work?
#include <iostream>
int main(int argc, char **argv) {
__int128_t value = atoint128_t(argv[1]);
}
Almost all answers posted are good enough for me but I'm selecting the one that is a drop by solution for my current code, so do look at other ones too.

Here's a simple way of implementing this:
__int128_t atoint128_t(const char *s)
{
const char *p = s;
__int128_t val = 0;
if (*p == '-' || *p == '+') {
p++;
}
while (*p >= '0' && *p <= '9') {
val = (10 * val) + (*p - '0');
p++;
}
if (*s == '-') val = val * -1;
return val;
}
This code checks each character to see if it's a digit (with an optional leading + or -), and if so it multiplies the current result by 10 and adds the value associated with that digit. It then inverts the sign if need be.
Note that this implementation does not check for overflow, which is consistent with the behavior of atoi.
EDIT:
Revised implementation that covers int128_MIN case by either adding or subtracting the value of each digit based on the sign, and skipping leading whitespace.
int myatoi(const char *s)
{
const char *p = s;
int neg = 0, val = 0;
while ((*p == '\n') || (*p == '\t') || (*p == ' ') ||
(*p == '\f') || (*p == '\r') || (*p == '\v')) {
p++;
}
if ((*p == '-') || (*p == '+')) {
if (*p == '-') {
neg = 1;
}
p++;
}
while (*p >= '0' && *p <= '9') {
if (neg) {
val = (10 * val) - (*p - '0');
} else {
val = (10 * val) + (*p - '0');
}
p++;
}
return val;
}

Here is a C++ implementation:
#include <string>
#include <stdexcept>
__int128_t atoint128_t(std::string const & in)
{
__int128_t res = 0;
size_t i = 0;
bool sign = false;
if (in[i] == '-')
{
++i;
sign = true;
}
if (in[i] == '+')
{
++i;
}
for (; i < in.size(); ++i)
{
const char c = in[i];
if (not std::isdigit(c))
throw std::runtime_error(std::string("Non-numeric character: ") + c)
res *= 10;
res += c - '0';
}
if (sign)
{
res *= -1;
}
return res;
}
int main()
{
__int128_t a = atoint128_t("170141183460469231731687303715884105727");
}
If you want to test it then there is a stream operator here.
Performance
I ran a few performance test. I generate 100,000 random numbers uniformly distributed in the entire support of __int128_t. Then I converted each of them 2000 times. All of these (200,000,000) conversions where completed within ~12 seconds.
Using this code:
#include <iostream>
#include <string>
#include <random>
#include <vector>
#include <chrono>
int main()
{
std::mt19937 gen(0);
std::uniform_int_distribution<> num(0, 9);
std::uniform_int_distribution<> len(1, 38);
std::uniform_int_distribution<> sign(0, 1);
std::vector<std::string> str;
for (int i = 0; i < 100000; ++i)
{
std::string s;
int l = len(gen);
if (sign(gen))
s += '-';
for (int u = 0; u < l; ++u)
s += std::to_string(num(gen));
str.emplace_back(s);
}
namespace sc = std::chrono;
auto start = sc::duration_cast<sc::microseconds>(sc::high_resolution_clock::now().time_since_epoch()).count();
__int128_t b = 0;
for (int u = 0; u < 200; ++u)
{
for (int i = 0; i < str.size(); ++i)
{
__int128_t a = atoint128_t(str[i]);
b += a;
}
}
auto time = sc::duration_cast<sc::microseconds>(sc::high_resolution_clock::now().time_since_epoch()).count() - start;
std::cout << time / 1000000. << 's' << std::endl;
}

Adding here a "not-so-naive" implementation in pure C, it's still kind of simple:
#include <stdio.h>
#include <inttypes.h>
__int128 atoi128(const char *s)
{
while (*s == ' ' || *s == '\t' || *s == '\n' || *s == '+') ++s;
int sign = 1;
if (*s == '-')
{
++s;
sign = -1;
}
size_t digits = 0;
while (s[digits] >= '0' && s[digits] <= '9') ++digits;
char scratch[digits];
for (size_t i = 0; i < digits; ++i) scratch[i] = s[i] - '0';
size_t scanstart = 0;
__int128 result = 0;
__int128 mask = 1;
while (scanstart < digits)
{
if (scratch[digits-1] & 1) result |= mask;
mask <<= 1;
for (size_t i = digits-1; i > scanstart; --i)
{
scratch[i] >>= 1;
if (scratch[i-1] & 1) scratch[i] |= 8;
}
scratch[scanstart] >>= 1;
while (scanstart < digits && !scratch[scanstart]) ++scanstart;
for (size_t i = scanstart; i < digits; ++i)
{
if (scratch[i] > 7) scratch[i] -= 3;
}
}
return result * sign;
}
int main(int argc, char **argv)
{
if (argc > 1)
{
__int128 x = atoi128(argv[1]);
printf("%" PRIi64 "\n", (int64_t)x); // just for demo with smaller numbers
}
}
It reads the number bit by bit, using a shifted BCD scratch space, see Double dabble for the algorithm (it's reversed here). This is a lot more efficient than doing many multiplications by 10 in general. *)
This relies on VLAs, without them, you can replace
char scratch[digits];
with
char *scratch = malloc(digits);
if (!scratch) return 0;
and add a
free(scratch);
at the end of the function.
Of course, the code above has the same limitations as the original atoi() (e.g. it will produce "random" garbage on overflow and has no way to check for that) .. if you need strtol()-style guarantees and error checking, extend it yourself (not a big problem, just work to do).
*) Of course, implementing double dabble in C always suffers from the fact you can't use "hardware carries", so there are extra bit masking and testing operations necessary. On the other hand, "naively" multiplying by 10 can be very efficient, as long as the platform provides multiplication instructions with a width "close" to your target type. Therefore, on your typical x86_64 platform (which has instructions for multiplying 64bit integers), this code is probably a lot slower than the naive decimal method. But it scales much better to really huge integers (which you would implement e.g. using arrays of uintmax_t).

is there any way for me to make this code work?
"What about implementing your own atoint128_t ?" #Marian
It is not to hard to roll your own atoint128_t().
Points to consider.
There is 0 or 1 more representable negative value than positive values. Accumulating the value using negative numbers provides more range.
Overflow is not defined for atoi(). Perhaps provide a capped value and set errno? Detecting potential OF prevents UB.
__int128_t constants need careful code to form correctly.
How to handle unusual input? atoi() is fairly loose and made sense years ago for speed/size, yet less UB is usually desired these days. Candidate cases: "", " ", "-", "z", "+123", "999..many...999", "the min int128", "locale_specific_space" + " 123" or even non-string NULL.
Code to do atoi() and atoint128_t() need only vary on the type, range, and names. The algorithm is the same.
#if 1
#define int_t __int128_t
#define int_MAX (((__int128_t)0x7FFFFFFFFFFFFFFF << 64) + 0xFFFFFFFFFFFFFFFF)
#define int_MIN (-1 - int_MAX)
#define int_atoi atoint128_t
#else
#define int_t int
#define int_MAX INT_MAX
#define int_MIN INT_MIN
#define int_atoi int_atoi
#endif
Sample code: Tailor as needed. Relies on C99 or later negative/positive and % functionality.
int_t int_atoi(const char *s) {
if (s == NULL) { // could omit this test
errno = EINVAL;
return 0;
}
while (isspace((unsigned char ) *s)) { // skip same leading white space like atoi()
s++;
}
char sign = *s; // remember if the sign was `-` for later
if (sign == '-' || sign == '+') {
s++;
}
int_t sum = 0;
while (isdigit((unsigned char)*s)) {
int digit = *s - '0';
if ((sum > int_MIN/10) || (sum == int_MIN/10 && digit <= -(int_MIN%10))) {
sum = sum * 10 - digit; // accumulate on the - side
} else {
sum = int_MIN;
errno = ERANGE;
break; // overflow
}
s++;
}
if (sign != '-') {
if (sum < -int_MAX) {
sum = int_MAX;
errno = ERANGE;
} else {
sum = -sum; // Make positive
}
}
return sum;
}
As #Lundin commented about the lack of overflow detection, etc. Modeling the string-->int128 after strtol() is a better idea.
For simplicity, consider __128_t strto__128_base10(const char *s, char *endptr);
This answer all ready handles overflow and flags errno like strtol(). Just need a few changes:
bool digit_found = false;
while (isdigit((unsigned char)*s)) {
digit_found = true;
// delete the `break`
// On overflow, continue looping to get to the end of the digits.
// break;
// after the `while()` loop:
if (!digit_found) { // optional test
errno = EINVAL;
}
if (endptr) {
*endptr = digit_found ? s : original_s;
}
A full long int strtol(const char *nptr, char **endptr, int base); like functionality would also handle other bases with special code when base is 0 or 16. #chqrlie

The C Standard does not mandate support for 128-bit integers.
Yet they are commonly supported by modern compilers: both gcc and clang support the types __int128_t and __uint128_t, but surprisingly still keep intmax_t and uintmax_t limited to 64 bits.
Beyond the basic arithmetic operators, there is not much support for these large integers, especially in the C library: no scanf() or printf() conversion specifiers, etc.
Here is an implementation of strtoi128(), strtou128() and atoi128() that is consistent with the C Standard's atoi(), strtol() and strtoul() specifications.
#include <ctype.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
/* Change these typedefs for your local flavor of 128-bit integer types */
typedef __int128_t i128;
typedef __uint128_t u128;
static int strdigit__(char c) {
/* This is ASCII / UTF-8 specific, would not work for EBCDIC */
return (c >= '0' && c <= '9') ? c - '0'
: (c >= 'a' && c <= 'z') ? c - 'a' + 10
: (c >= 'A' && c <= 'Z') ? c - 'A' + 10
: 255;
}
static u128 strtou128__(const char *p, char **endp, int base) {
u128 v = 0;
int digit;
if (base == 0) { /* handle octal and hexadecimal syntax */
base = 10;
if (*p == '0') {
base = 8;
if ((p[1] == 'x' || p[1] == 'X') && strdigit__(p[2]) < 16) {
p += 2;
base = 16;
}
}
}
if (base < 2 || base > 36) {
errno = EINVAL;
} else
if ((digit = strdigit__(*p)) < base) {
v = digit;
/* convert to unsigned 128 bit with overflow control */
while ((digit = strdigit__(*++p)) < base) {
u128 v0 = v;
v = v * base + digit;
if (v < v0) {
v = ~(u128)0;
errno = ERANGE;
}
}
if (endp) {
*endp = (char *)p;
}
}
return v;
}
u128 strtou128(const char *p, char **endp, int base) {
if (endp) {
*endp = (char *)p;
}
while (isspace((unsigned char)*p)) {
p++;
}
if (*p == '-') {
p++;
return -strtou128__(p, endp, base);
} else {
if (*p == '+')
p++;
return strtou128__(p, endp, base);
}
}
i128 strtoi128(const char *p, char **endp, int base) {
u128 v;
if (endp) {
*endp = (char *)p;
}
while (isspace((unsigned char)*p)) {
p++;
}
if (*p == '-') {
p++;
v = strtou128__(p, endp, base);
if (v >= (u128)1 << 127) {
if (v > (u128)1 << 127)
errno = ERANGE;
return -(i128)(((u128)1 << 127) - 1) - 1;
}
return -(i128)v;
} else {
if (*p == '+')
p++;
v = strtou128__(p, endp, base);
if (v >= (u128)1 << 127) {
errno = ERANGE;
return (i128)(((u128)1 << 127) - 1);
}
return (i128)v;
}
}
i128 atoi128(const char *p) {
return strtoi128(p, (char**)NULL, 10);
}
char *utoa128(char *dest, u128 v, int base) {
char buf[129];
char *p = buf + 128;
const char *digits = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ";
*p = '\0';
if (base >= 2 && base <= 36) {
while (v > (unsigned)base - 1) {
*--p = digits[v % base];
v /= base;
}
*--p = digits[v];
}
return strcpy(dest, p);
}
char *itoa128(char *buf, i128 v, int base) {
char *p = buf;
u128 uv = (u128)v;
if (v < 0) {
*p++ = '-';
uv = -uv;
}
if (base == 10)
utoa128(p, uv, 10);
else
if (base == 16)
utoa128(p, uv, 16);
else
utoa128(p, uv, base);
return buf;
}
static char *perrno(char *buf, int err) {
switch (err) {
case EINVAL:
return strcpy(buf, "EINVAL");
case ERANGE:
return strcpy(buf, "ERANGE");
default:
sprintf(buf, "%d", err);
return buf;
}
}
int main(int argc, char *argv[]) {
char buf[130];
char xbuf[130];
char ebuf[20];
char *p1, *p2;
i128 v, v1;
u128 v2;
int i;
for (i = 1; i < argc; i++) {
printf("%s:\n", argv[i]);
errno = 0;
v = atoi128(argv[i]);
perrno(ebuf, errno);
printf(" atoi128(): %s 0x%s errno=%s\n",
itoa128(buf, v, 10), utoa128(xbuf, v, 16), ebuf);
errno = 0;
v1 = strtoi128(argv[i], &p1, 0);
perrno(ebuf, errno);
printf(" strtoi128(): %s 0x%s endptr:\"%s\" errno=%s\n",
itoa128(buf, v1, 10), utoa128(xbuf, v1, 16), p1, ebuf);
errno = 0;
v2 = strtou128(argv[i], &p2, 0);
perrno(ebuf, errno);
printf(" strtou128(): %s 0x%s endptr:\"%s\" errno=%s\n",
utoa128(buf, v2, 10), utoa128(xbuf, v2, 16), p2, ebuf);
}
return 0;
}

C++ ShiftJIS to UTF8 conversion

I need to convert Doublebyte characters. In my special case Shift-Jis into something better to handle, preferably with standard C++.
the following Question ended up without a workaround:
Doublebyte encodings on MSVC (std::codecvt): Lead bytes not recognized
So is there anyone with a suggestion or a reference on how to handle this conversion with C++ standard?

Normally I would recommend using the ICU library, but for this alone, using it is way too much overhead.
First a conversion function which takes an std::string with Shiftjis data, and returns an std::string with UTF8 (note 2019: no idea anymore if it works :))
It uses a uint8_t array of 25088 elements (25088 byte), which is used as convTable in the code. The function does not fill this variable, you have to load it from eg. a file first. The second code part below is a program that can generate the file.
The conversion function doesn't check if the input is valid ShiftJIS data.
std::string sj2utf8(const std::string &input)
{
std::string output(3 * input.length(), ' '); //ShiftJis won't give 4byte UTF8, so max. 3 byte per input char are needed
size_t indexInput = 0, indexOutput = 0;
while(indexInput < input.length())
{
char arraySection = ((uint8_t)input[indexInput]) >> 4;
size_t arrayOffset;
if(arraySection == 0x8) arrayOffset = 0x100; //these are two-byte shiftjis
else if(arraySection == 0x9) arrayOffset = 0x1100;
else if(arraySection == 0xE) arrayOffset = 0x2100;
else arrayOffset = 0; //this is one byte shiftjis
//determining real array offset
if(arrayOffset)
{
arrayOffset += (((uint8_t)input[indexInput]) & 0xf) << 8;
indexInput++;
if(indexInput >= input.length()) break;
}
arrayOffset += (uint8_t)input[indexInput++];
arrayOffset <<= 1;
//unicode number is...
uint16_t unicodeValue = (convTable[arrayOffset] << 8) | convTable[arrayOffset + 1];
//converting to UTF8
if(unicodeValue < 0x80)
{
output[indexOutput++] = unicodeValue;
}
else if(unicodeValue < 0x800)
{
output[indexOutput++] = 0xC0 | (unicodeValue >> 6);
output[indexOutput++] = 0x80 | (unicodeValue & 0x3f);
}
else
{
output[indexOutput++] = 0xE0 | (unicodeValue >> 12);
output[indexOutput++] = 0x80 | ((unicodeValue & 0xfff) >> 6);
output[indexOutput++] = 0x80 | (unicodeValue & 0x3f);
}
}
output.resize(indexOutput); //remove the unnecessary bytes
return output;
}
About the helper file: I used to have a download here, but nowadays I only know unreliable file hosters. So... either http://s000.tinyupload.com/index.php?file_id=95737652978017682303 works for you, or:
First download the "original" data from ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/SHIFTJIS.TXT . I can't paste this here because of the length, so we have to hope at least unicode.org stays online.
Then use this program while piping/redirecting above text file in, and redirecting the binary output to a new file. (Needs a binary-safe shell, no idea if it works on Windows).
#include<iostream>
#include<string>
#include<cstdio>
using namespace std;
// pipe SHIFTJIS.txt in and pipe to (binary) file out
int main()
{
string s;
uint8_t *mapping; //same bigendian array as in converting function
mapping = new uint8_t[2*(256 + 3*256*16)];
//initializing with space for invalid value, and then ASCII control chars
for(size_t i = 32; i < 256 + 3*256*16; i++)
{
mapping[2 * i] = 0;
mapping[2 * i + 1] = 0x20;
}
for(size_t i = 0; i < 32; i++)
{
mapping[2 * i] = 0;
mapping[2 * i + 1] = i;
}
while(getline(cin, s)) //pipe the file SHIFTJIS to stdin
{
if(s.substr(0, 2) != "0x") continue; //comment lines
uint16_t shiftJisValue, unicodeValue;
if(2 != sscanf(s.c_str(), "%hx %hx", &shiftJisValue, &unicodeValue)) //getting hex values
{
puts("Error hex reading");
continue;
}
size_t offset; //array offset
if((shiftJisValue >> 8) == 0) offset = 0;
else if((shiftJisValue >> 12) == 0x8) offset = 256;
else if((shiftJisValue >> 12) == 0x9) offset = 256 + 16*256;
else if((shiftJisValue >> 12) == 0xE) offset = 256 + 2*16*256;
else
{
puts("Error input values");
continue;
}
offset = 2 * (offset + (shiftJisValue & 0xfff));
if(mapping[offset] != 0 || mapping[offset + 1] != 0x20)
{
puts("Error mapping not 1:1");
continue;
}
mapping[offset] = unicodeValue >> 8;
mapping[offset + 1] = unicodeValue & 0xff;
}
fwrite(mapping, 1, 2*(256 + 3*256*16), stdout);
delete[] mapping;
return 0;
}
Notes:
Two-byte big endian raw unicode values (more than two byte not necessary here)
First 256 chars (512 byte) for the single byte ShiftJIS chars, value 0x20 for invalid ones.
Then 3 * 256*16 chars for the groups 0x8???, 0x9??? and 0xE???
= 25088 byte

For those looking for the Shift-JIS conversion table data, you can get the uint8_t array here:
https://github.com/bucanero/apollo-ps3/blob/master/include/shiftjis.h
Also, here's a very simple function to convert basic Shift-JIS chars to ASCII:
const char SJIS_REPLACEMENT_TABLE[] =
" ,.,..:;?!\"*'`*^"
"-_????????*---/\\"
"~||--''\"\"()()[]{"
"}<><>[][][]+-+X?"
"-==<><>????*'\"CY"
"$c&%#&*#S*******"
"*******T><^_'='";
//Convert Shift-JIS characters to ASCII equivalent
void sjis2ascii(char* bData)
{
uint16_t ch;
int i, j = 0;
int len = strlen(bData);
for (i = 0; i < len; i += 2)
{
ch = (bData[i]<<8) | bData[i+1];
// 'A' .. 'Z'
// '0' .. '9'
if ((ch >= 0x8260 && ch <= 0x8279) || (ch >= 0x824F && ch <= 0x8258))
{
bData[j++] = (ch & 0xFF) - 0x1F;
continue;
}
// 'a' .. 'z'
if (ch >= 0x8281 && ch <= 0x829A)
{
bData[j++] = (ch & 0xFF) - 0x20;
continue;
}
if (ch >= 0x8140 && ch <= 0x81AC)
{
bData[j++] = SJIS_REPLACEMENT_TABLE[(ch & 0xFF) - 0x40];
continue;
}
if (ch == 0x0000)
{
//End of the string
bData[j] = 0;
return;
}
// Character not found
bData[j++] = bData[i];
bData[j++] = bData[i+1];
}
bData[j] = 0;
return;
}

C++ convert ASII escaped unicode string into utf8 string

I need to read in a standard ascii style string with unicode escaping and convert it into a std::string containing the utf8 encoded equivalent. So for example "\u03a0" (a std::string with 6 characters) should be converted into the std::string with two characters, 0xce, 0xa0 respectively, in raw binary.
Would be most happy if there's a simple answer using icu or boost but I haven't been able to find one.
(This is similar to Convert a Unicode string to an escaped ASCII string, but NB that I ultimately need to arrive at the UTF8 encoding. If we can use the Unicode as an intermediate step that's fine.)

Try something like this:
std::string to_utf8(uint32_t cp)
{
/*
if using C++11 or later, you can do this:
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv;
return conv.to_bytes( (char32_t)cp );
Otherwise...
*/
std::string result;
int count;
if (cp <= 0x007F)
count = 1
else if (cp <= 0x07FF)
count = 2;
else if (cp <= 0xFFFF)
count = 3;
else if (cp <= 0x10FFFF)
count = 4;
else
return result; // or throw an exception
result.resize(count);
if (count > 1)
{
for (int i = count-1; i > 0; --i)
{
result[i] = (char) (0x80 | (cp & 0x3F));
cp >>= 6;
}
for (int i = 0; i < count; ++i)
cp |= (1 << (7-i));
}
result[0] = (char) cp;
return result;
}
std::string str = ...; // "\\u03a0"
std::string::size_type startIdx = 0;
do
{
startIdx = str.find("\\u", startIdx);
if (startIdx == std::string::npos) break;
std::string::size_type endIdx = str.find_first_not_of("0123456789abcdefABCDEF", startIdx+2);
if (endIdx == std::string::npos) break;
std::string tmpStr = str.substr(startIdx+2, endIdx-(startIdx+2));
std::istringstream iss(tmpStr);
uint32_t cp;
if (iss >> std::hex >> cp)
{
std::string utf8 = to_utf8(cp);
str.replace(startIdx, 2+tmpStr.length(), utf8);
startIdx += utf8.length();
}
else
startIdx += 2;
}
while (true);

(\u03a0 is the Unicode code point for GREEK CAPITAL LETTER PI whose UTF-8 encoding is 0xCE 0xA0)
You need to:
Get the number 0x03a0 from the string "\u03a0": drop the backslash and the u and parse 03a0 as hex, into a wchar_t. Repeat until you get a (wide) string.
Convert 0x3a0 into UTF-8. C++11 has a codecvt_utf8 that may help.

My solution:
convert_unicode_escape_sequences(str)
input: "\u043f\u0440\u0438\u0432\u0435\u0442"
output: "привет"
Boost used for wchar/chars convertion:
#include <boost/locale/encoding_utf.hpp>
using boost::locale::conv::utf_to_utf;
inline uint8_t get_uint8(uint8_t h, uint8_t l)
{
uint8_t ret;
if (h - '0' < 10)
ret = h - '0';
else if (h - 'A' < 6)
ret = h - 'A' + 0x0A;
else if (h - 'a' < 6)
ret = h - 'a' + 0x0A;
ret = ret << 4;
if (l - '0' < 10)
ret |= l - '0';
else if (l - 'A' < 6)
ret |= l - 'A' + 0x0A;
else if (l - 'a' < 6)
ret |= l - 'a' + 0x0A;
return ret;
}
std::string wstring_to_utf8(const std::wstring& str)
{
return utf_to_utf<char>(str.c_str(), str.c_str() + str.size());
}
std::string convert_unicode_escape_sequences(const std::string& source)
{
std::wstring ws; ws.reserve(source.size());
std::wstringstream wis(ws);
auto s = source.begin();
while (s != source.end())
{
if (*s == '\\')
{
if (std::distance(s, source.end()) > 5)
{
if (*(s + 1) == 'u')
{
unsigned int v = get_uint8(*(s + 2), *(s + 3)) << 8;
v |= get_uint8(*(s + 4), *(s + 5));
s += 6;
wis << boost::numeric_cast<wchar_t>(v);
continue;
}
}
}
wis << wchar_t(*s);
s++;
}
return wstring_to_utf8(wis.str());
}

Optimizing Hexadecimal To Ascii Function in C++

This is a function in c++ that takes a HEX string and converts it to its equivalent ASCII character.
string HEX2STR (string str)
{
string tmp;
const char *c = str.c_str();
unsigned int x;
while(*c != 0) {
sscanf(c, "%2X", &x);
tmp += x;
c += 2;
}
return tmp;
If you input the following string:
537461636b6f766572666c6f77206973207468652062657374212121
The output will be:
Stackoverflow is the best!!!
Say I were to input 1,000,000 unique HEX strings into this function, it takes awhile to compute.
Is there a more efficient way to complete this?

Of course. Look up two characters at a time:
unsigned char val(char c)
{
if ('0' <= c && c <= '9') { return c - '0'; }
if ('a' <= c && c <= 'f') { return c + 10 - 'a'; }
if ('A' <= c && c <= 'F') { return c + 10 - 'A'; }
throw "Eeek";
}
std::string decode(std::string const & s)
{
if (s.size() % 2) != 0) { throw "Eeek"; }
std::string result;
result.reserve(s.size() / 2);
for (std::size_t i = 0; i < s.size() / 2; ++i)
{
unsigned char n = val(s[2 * i]) * 16 + val(s[2 * i + 1]);
result += n;
}
return result;
}

Just since I wrote it anyway, this should be fairly efficient :)
const char lookup[32] =
{0,10,11,12,13,14,15,0,0,0,0,0,0,0,0,0,0,1,2,3,4,5,6,7,8,9,0,0,0,0,0,0};
std::string HEX2STR(std::string str)
{
std::string out;
out.reserve(str.size()/2);
const char* tmp = str.c_str();
unsigned char ch, last = 1;
while(*tmp)
{
ch <<= 4;
ch |= lookup[*tmp&0x1f];
if(last ^= 1)
out += ch;
tmp++;
}
return out;
}

Don't use sscanf. It's a very general flexible function, which means its slow to allow all those usecases. Instead, walk the string and convert each character yourself, much faster.

This routine takes a string with (what I call) hexwords, often used in embedded ECUs, for example "31 01 7F 33 38 33 37 30 35 31 30 30 20 20 49" and transforms it in readable ASCII where possible.
Transforms by taking care of the discontuinity in the ASCII table (0-9: 48-57, A-F:65 - 70);
int i,j, len=strlen(stringWithHexWords);
char ascii_buffer[250];
char c1, c2, r;
i=0;
j=0;
while (i<len) {
c1 = stringWithHexWords[i];
c2 = stringWithHexWords[i+1];
if ((int)c1!=32) { // if space found, skip next section and bump index only once
// skip scary ASCII codes
if (32<(int)c1 && 127>(int)c1 && 32<(int)c2 && 127>(int)c2) {
//
// transform by taking first hexdigit * 16 and add second hexdigit
// both with correct offset
r = (char) ((16*(int)c1+((int)c2<64?((int)c2-48):((int)c2-55))));
if (31<(int)r && 127>(int)r)
ascii_buffer[j++] = r; // check result for readability
}
i++; // bump index
}
i++; // bump index once more for next hexdigit
}
ascii_bufferCurrentLength = j;
return true;
}

The hexToString() function will convert hex string to ASCII readable string
string hexToString(string str){
std::stringstream HexString;
for(int i=0;i<str.length();i++){
char a = str.at(i++);
char b = str.at(i);
int x = hexCharToInt(a);
int y = hexCharToInt(b);
HexString << (char)((16*x)+y);
}
return HexString.str();
}
int hexCharToInt(char a){
if(a>='0' && a<='9')
return(a-48);
else if(a>='A' && a<='Z')
return(a-55);
else
return(a-87);
}

C++: Char iteration over string (I'm getting crazy)

I have this string:
std::string str = "presents";
And when I iterate over the characters, they come in this order:
spresent
So, the last char comes first.
This is the code:
uint16_t c;
printf("%s: ", str.c_str());
for (unsigned int i = 0; i < str.size(); i += extractUTF8_Char(str, i, &c)) {
printf("%c", c);
}
printf("\n");
And this is the exctract method:
uint8_t extractUTF8_Char(string line, int offset, uint16_t *target) {
uint8_t ch = uint8_t(line.at(offset));
if ((ch & 0xC0) == 0xC0) {
if (!target) {
return 2;
}
uint8_t ch2 = uint8_t(line.at(offset + 1));
uint16_t fullCh = (uint16_t(((ch & 0x1F) >> 2)) << 8) | ((ch & 0x3) << 0x6) | (ch2 & 0x3F);
*target = fullCh;
return 2;
}
if (target) {
*target = ch;
}
return 1;
}
This method returns the length of the character. So: 1 byte or 2 bytes. And if the length is 2 bytes, it extracts the UNICODE point out of the UTF8 string.

your first printf is printing nonsense (the initial value of c). The last c gotten is not printed.
This is because the call to extractUTF8_char is occurring in the last clause of the for statement. You might want to change it to
for (unsigned int i = 0; i < str.size();) {
i += extractUTF8_Char(str, i, &c);
printf("%c", c);
}
instead.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

C++ string of greek characters and .at() operator - c++

Related

atoi() for int128_t type

C++ ShiftJIS to UTF8 conversion

C++ convert ASII escaped unicode string into utf8 string

Optimizing Hexadecimal To Ascii Function in C++

C++: Char iteration over string (I'm getting crazy)

Categories

Resources