Related
How can I use argv values with int128_t support? I know about atoi() and family of functions exposed by <cstdlib> but somehow I cannot find one for int128_t fixed width integer. This might be because of the fact that this type isn't backed by either c or c++ standard, but is there any way for me to make this code work?
#include <iostream>
int main(int argc, char **argv) {
__int128_t value = atoint128_t(argv[1]);
}
Almost all answers posted are good enough for me but I'm selecting the one that is a drop by solution for my current code, so do look at other ones too.
Here's a simple way of implementing this:
__int128_t atoint128_t(const char *s)
{
const char *p = s;
__int128_t val = 0;
if (*p == '-' || *p == '+') {
p++;
}
while (*p >= '0' && *p <= '9') {
val = (10 * val) + (*p - '0');
p++;
}
if (*s == '-') val = val * -1;
return val;
}
This code checks each character to see if it's a digit (with an optional leading + or -), and if so it multiplies the current result by 10 and adds the value associated with that digit. It then inverts the sign if need be.
Note that this implementation does not check for overflow, which is consistent with the behavior of atoi.
EDIT:
Revised implementation that covers int128_MIN case by either adding or subtracting the value of each digit based on the sign, and skipping leading whitespace.
int myatoi(const char *s)
{
const char *p = s;
int neg = 0, val = 0;
while ((*p == '\n') || (*p == '\t') || (*p == ' ') ||
(*p == '\f') || (*p == '\r') || (*p == '\v')) {
p++;
}
if ((*p == '-') || (*p == '+')) {
if (*p == '-') {
neg = 1;
}
p++;
}
while (*p >= '0' && *p <= '9') {
if (neg) {
val = (10 * val) - (*p - '0');
} else {
val = (10 * val) + (*p - '0');
}
p++;
}
return val;
}
Here is a C++ implementation:
#include <string>
#include <stdexcept>
__int128_t atoint128_t(std::string const & in)
{
__int128_t res = 0;
size_t i = 0;
bool sign = false;
if (in[i] == '-')
{
++i;
sign = true;
}
if (in[i] == '+')
{
++i;
}
for (; i < in.size(); ++i)
{
const char c = in[i];
if (not std::isdigit(c))
throw std::runtime_error(std::string("Non-numeric character: ") + c)
res *= 10;
res += c - '0';
}
if (sign)
{
res *= -1;
}
return res;
}
int main()
{
__int128_t a = atoint128_t("170141183460469231731687303715884105727");
}
If you want to test it then there is a stream operator here.
Performance
I ran a few performance test. I generate 100,000 random numbers uniformly distributed in the entire support of __int128_t. Then I converted each of them 2000 times. All of these (200,000,000) conversions where completed within ~12 seconds.
Using this code:
#include <iostream>
#include <string>
#include <random>
#include <vector>
#include <chrono>
int main()
{
std::mt19937 gen(0);
std::uniform_int_distribution<> num(0, 9);
std::uniform_int_distribution<> len(1, 38);
std::uniform_int_distribution<> sign(0, 1);
std::vector<std::string> str;
for (int i = 0; i < 100000; ++i)
{
std::string s;
int l = len(gen);
if (sign(gen))
s += '-';
for (int u = 0; u < l; ++u)
s += std::to_string(num(gen));
str.emplace_back(s);
}
namespace sc = std::chrono;
auto start = sc::duration_cast<sc::microseconds>(sc::high_resolution_clock::now().time_since_epoch()).count();
__int128_t b = 0;
for (int u = 0; u < 200; ++u)
{
for (int i = 0; i < str.size(); ++i)
{
__int128_t a = atoint128_t(str[i]);
b += a;
}
}
auto time = sc::duration_cast<sc::microseconds>(sc::high_resolution_clock::now().time_since_epoch()).count() - start;
std::cout << time / 1000000. << 's' << std::endl;
}
Adding here a "not-so-naive" implementation in pure C, it's still kind of simple:
#include <stdio.h>
#include <inttypes.h>
__int128 atoi128(const char *s)
{
while (*s == ' ' || *s == '\t' || *s == '\n' || *s == '+') ++s;
int sign = 1;
if (*s == '-')
{
++s;
sign = -1;
}
size_t digits = 0;
while (s[digits] >= '0' && s[digits] <= '9') ++digits;
char scratch[digits];
for (size_t i = 0; i < digits; ++i) scratch[i] = s[i] - '0';
size_t scanstart = 0;
__int128 result = 0;
__int128 mask = 1;
while (scanstart < digits)
{
if (scratch[digits-1] & 1) result |= mask;
mask <<= 1;
for (size_t i = digits-1; i > scanstart; --i)
{
scratch[i] >>= 1;
if (scratch[i-1] & 1) scratch[i] |= 8;
}
scratch[scanstart] >>= 1;
while (scanstart < digits && !scratch[scanstart]) ++scanstart;
for (size_t i = scanstart; i < digits; ++i)
{
if (scratch[i] > 7) scratch[i] -= 3;
}
}
return result * sign;
}
int main(int argc, char **argv)
{
if (argc > 1)
{
__int128 x = atoi128(argv[1]);
printf("%" PRIi64 "\n", (int64_t)x); // just for demo with smaller numbers
}
}
It reads the number bit by bit, using a shifted BCD scratch space, see Double dabble for the algorithm (it's reversed here). This is a lot more efficient than doing many multiplications by 10 in general. *)
This relies on VLAs, without them, you can replace
char scratch[digits];
with
char *scratch = malloc(digits);
if (!scratch) return 0;
and add a
free(scratch);
at the end of the function.
Of course, the code above has the same limitations as the original atoi() (e.g. it will produce "random" garbage on overflow and has no way to check for that) .. if you need strtol()-style guarantees and error checking, extend it yourself (not a big problem, just work to do).
*) Of course, implementing double dabble in C always suffers from the fact you can't use "hardware carries", so there are extra bit masking and testing operations necessary. On the other hand, "naively" multiplying by 10 can be very efficient, as long as the platform provides multiplication instructions with a width "close" to your target type. Therefore, on your typical x86_64 platform (which has instructions for multiplying 64bit integers), this code is probably a lot slower than the naive decimal method. But it scales much better to really huge integers (which you would implement e.g. using arrays of uintmax_t).
is there any way for me to make this code work?
"What about implementing your own atoint128_t ?" #Marian
It is not to hard to roll your own atoint128_t().
Points to consider.
There is 0 or 1 more representable negative value than positive values. Accumulating the value using negative numbers provides more range.
Overflow is not defined for atoi(). Perhaps provide a capped value and set errno? Detecting potential OF prevents UB.
__int128_t constants need careful code to form correctly.
How to handle unusual input? atoi() is fairly loose and made sense years ago for speed/size, yet less UB is usually desired these days. Candidate cases: "", " ", "-", "z", "+123", "999..many...999", "the min int128", "locale_specific_space" + " 123" or even non-string NULL.
Code to do atoi() and atoint128_t() need only vary on the type, range, and names. The algorithm is the same.
#if 1
#define int_t __int128_t
#define int_MAX (((__int128_t)0x7FFFFFFFFFFFFFFF << 64) + 0xFFFFFFFFFFFFFFFF)
#define int_MIN (-1 - int_MAX)
#define int_atoi atoint128_t
#else
#define int_t int
#define int_MAX INT_MAX
#define int_MIN INT_MIN
#define int_atoi int_atoi
#endif
Sample code: Tailor as needed. Relies on C99 or later negative/positive and % functionality.
int_t int_atoi(const char *s) {
if (s == NULL) { // could omit this test
errno = EINVAL;
return 0;
}
while (isspace((unsigned char ) *s)) { // skip same leading white space like atoi()
s++;
}
char sign = *s; // remember if the sign was `-` for later
if (sign == '-' || sign == '+') {
s++;
}
int_t sum = 0;
while (isdigit((unsigned char)*s)) {
int digit = *s - '0';
if ((sum > int_MIN/10) || (sum == int_MIN/10 && digit <= -(int_MIN%10))) {
sum = sum * 10 - digit; // accumulate on the - side
} else {
sum = int_MIN;
errno = ERANGE;
break; // overflow
}
s++;
}
if (sign != '-') {
if (sum < -int_MAX) {
sum = int_MAX;
errno = ERANGE;
} else {
sum = -sum; // Make positive
}
}
return sum;
}
As #Lundin commented about the lack of overflow detection, etc. Modeling the string-->int128 after strtol() is a better idea.
For simplicity, consider __128_t strto__128_base10(const char *s, char *endptr);
This answer all ready handles overflow and flags errno like strtol(). Just need a few changes:
bool digit_found = false;
while (isdigit((unsigned char)*s)) {
digit_found = true;
// delete the `break`
// On overflow, continue looping to get to the end of the digits.
// break;
// after the `while()` loop:
if (!digit_found) { // optional test
errno = EINVAL;
}
if (endptr) {
*endptr = digit_found ? s : original_s;
}
A full long int strtol(const char *nptr, char **endptr, int base); like functionality would also handle other bases with special code when base is 0 or 16. #chqrlie
The C Standard does not mandate support for 128-bit integers.
Yet they are commonly supported by modern compilers: both gcc and clang support the types __int128_t and __uint128_t, but surprisingly still keep intmax_t and uintmax_t limited to 64 bits.
Beyond the basic arithmetic operators, there is not much support for these large integers, especially in the C library: no scanf() or printf() conversion specifiers, etc.
Here is an implementation of strtoi128(), strtou128() and atoi128() that is consistent with the C Standard's atoi(), strtol() and strtoul() specifications.
#include <ctype.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
/* Change these typedefs for your local flavor of 128-bit integer types */
typedef __int128_t i128;
typedef __uint128_t u128;
static int strdigit__(char c) {
/* This is ASCII / UTF-8 specific, would not work for EBCDIC */
return (c >= '0' && c <= '9') ? c - '0'
: (c >= 'a' && c <= 'z') ? c - 'a' + 10
: (c >= 'A' && c <= 'Z') ? c - 'A' + 10
: 255;
}
static u128 strtou128__(const char *p, char **endp, int base) {
u128 v = 0;
int digit;
if (base == 0) { /* handle octal and hexadecimal syntax */
base = 10;
if (*p == '0') {
base = 8;
if ((p[1] == 'x' || p[1] == 'X') && strdigit__(p[2]) < 16) {
p += 2;
base = 16;
}
}
}
if (base < 2 || base > 36) {
errno = EINVAL;
} else
if ((digit = strdigit__(*p)) < base) {
v = digit;
/* convert to unsigned 128 bit with overflow control */
while ((digit = strdigit__(*++p)) < base) {
u128 v0 = v;
v = v * base + digit;
if (v < v0) {
v = ~(u128)0;
errno = ERANGE;
}
}
if (endp) {
*endp = (char *)p;
}
}
return v;
}
u128 strtou128(const char *p, char **endp, int base) {
if (endp) {
*endp = (char *)p;
}
while (isspace((unsigned char)*p)) {
p++;
}
if (*p == '-') {
p++;
return -strtou128__(p, endp, base);
} else {
if (*p == '+')
p++;
return strtou128__(p, endp, base);
}
}
i128 strtoi128(const char *p, char **endp, int base) {
u128 v;
if (endp) {
*endp = (char *)p;
}
while (isspace((unsigned char)*p)) {
p++;
}
if (*p == '-') {
p++;
v = strtou128__(p, endp, base);
if (v >= (u128)1 << 127) {
if (v > (u128)1 << 127)
errno = ERANGE;
return -(i128)(((u128)1 << 127) - 1) - 1;
}
return -(i128)v;
} else {
if (*p == '+')
p++;
v = strtou128__(p, endp, base);
if (v >= (u128)1 << 127) {
errno = ERANGE;
return (i128)(((u128)1 << 127) - 1);
}
return (i128)v;
}
}
i128 atoi128(const char *p) {
return strtoi128(p, (char**)NULL, 10);
}
char *utoa128(char *dest, u128 v, int base) {
char buf[129];
char *p = buf + 128;
const char *digits = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ";
*p = '\0';
if (base >= 2 && base <= 36) {
while (v > (unsigned)base - 1) {
*--p = digits[v % base];
v /= base;
}
*--p = digits[v];
}
return strcpy(dest, p);
}
char *itoa128(char *buf, i128 v, int base) {
char *p = buf;
u128 uv = (u128)v;
if (v < 0) {
*p++ = '-';
uv = -uv;
}
if (base == 10)
utoa128(p, uv, 10);
else
if (base == 16)
utoa128(p, uv, 16);
else
utoa128(p, uv, base);
return buf;
}
static char *perrno(char *buf, int err) {
switch (err) {
case EINVAL:
return strcpy(buf, "EINVAL");
case ERANGE:
return strcpy(buf, "ERANGE");
default:
sprintf(buf, "%d", err);
return buf;
}
}
int main(int argc, char *argv[]) {
char buf[130];
char xbuf[130];
char ebuf[20];
char *p1, *p2;
i128 v, v1;
u128 v2;
int i;
for (i = 1; i < argc; i++) {
printf("%s:\n", argv[i]);
errno = 0;
v = atoi128(argv[i]);
perrno(ebuf, errno);
printf(" atoi128(): %s 0x%s errno=%s\n",
itoa128(buf, v, 10), utoa128(xbuf, v, 16), ebuf);
errno = 0;
v1 = strtoi128(argv[i], &p1, 0);
perrno(ebuf, errno);
printf(" strtoi128(): %s 0x%s endptr:\"%s\" errno=%s\n",
itoa128(buf, v1, 10), utoa128(xbuf, v1, 16), p1, ebuf);
errno = 0;
v2 = strtou128(argv[i], &p2, 0);
perrno(ebuf, errno);
printf(" strtou128(): %s 0x%s endptr:\"%s\" errno=%s\n",
utoa128(buf, v2, 10), utoa128(xbuf, v2, 16), p2, ebuf);
}
return 0;
}
This question already has answers here:
Convert two ASCII Hexadecimal Characters (Two ASCII bytes) in one byte
(6 answers)
Closed 6 years ago.
I have a typical use case where i need to convert the unsigned char values to hexadecimal.
For example,
unsigned char *pBuffer = (unsigned char *)pvaluefromClient //where pvaluefromclient is received from a client
The length of pBuffer is 32 bytes and it holds the value as follows,
(gdb) p pBuffer
$5 = (unsigned char *) 0x7fd4b82cead0 "EBA5F7304554DCC3702E06182AB1D487"
(gdb) n
STEP 1: I need to split this pBuffer value as follows,
{EB,A5,F7,30,45,54,DC,C3,70,2E,06,18,2A,B1,D4,87 }
STEP 2: I need to convert the above splited values to decimal as follows,
const unsigned char pConvertedBuffer[16] = {
235,165,247,48,69,84,220,195,112,46,6,24,42,177,212,135
};
Any idea on how to achieve the STEP1 and STEP2? any help on this would be highly appreciated
How about something like this:
unsigned char *pBuffer = (unsigned char *)pvaluefromClient //where valuefromclient is received from a client
int i, j;
unsigned char target[16]
for(i=0;i<32;i+=2)
{
sscanf((char*)&pBuffer[i], "%02X", &j);
target[i/2] = j;
}
You can create a function that takes two unsigned chars as parameter and returns another unsigned char. The two parameters are the chars (E and B for the first byte). The returned value would be the numerical value of the byte.
The logic would be :
unsigned char hex2byte(unsigned char uchar1, unsigned char uchar2) {
unsigned char returnValue = 0;
if((uchar1 >= '0') && (uchar1 <= '9')) {
returnValue = uchar1 - 0x30; //0x30 = '0'
}
else if((uchar1 >= 'a') && (uchar1 <= 'f')) {
returnValue = uchar1 - 0x61 + 0x0A; //0x61 = 'a'
}
else if((uchar1 >= 'A') && (uchar1 <= 'F')) {
returnValue = uchar1 - 0x41 + 0x0A; //0x41 = 'A'
}
if((uchar2 >= '0') && (uchar2 <= '9')) {
returnValue = (returnValue <<8) + (uchar2 - 0x30); //0x30 = '0'
}
else if((uchar2 >= 'a') && (uchar2 <= 'f')) {
returnValue = (returnValue <<8) + (uchar2 - 0x61 + 0x0A); //0x61 = 'a'
}
else if((uchar2 >= 'A') && (uchar1 <= 'F')) {
returnValue = (returnValue <<8) + (uchar2 - 0x41 + 0x0A); //0x41 = 'A'
}
return returnValue;
}
The basic idea is to calculate the numerical value of the chars and to reassemble a number from two chars (hence the bit shift)
I'm pretty sure there are multiple more elegant solutions than mine here and there.
void Conversion(char *pBuffer, int *ConvertedBuffer)
{
int j = 0;
for(int i = 0; i < 32; i += 2)
{
std::stringstream ss;
char sz[4] = {0};
sz[0] = pBuffer[i];
sz[1] = pBuffer[i+1];
sz[2] = 0;
ss << std::hex << sz;
ss >> ConvertedBuffer[j];
++j;
}
}
int main()
{
char Buffer[] = "EBA5F7304554DCC3702E06182AB1D487";
int ConvertedBuffer[16];
Conversion(Buffer, ConvertedBuffer);
for(int i = 0; i < 16; ++i)
{
cout << ConvertedBuffer[i] << " ";
}
return 0;
}
//output:
235 165 247 48 69 84 220 195 112 46 6 24 42 177 212 135
When I input
0x123456789
I get incorrect outputs, I can't figure out why. At first I thought it was a max possible int value problem, but I changed my variables to unsigned long and the problem was still there.
#include <iostream>
using namespace std;
long htoi(char s[]);
int main()
{
cout << "Enter Hex \n";
char hexstring[20];
cin >> hexstring;
cout << htoi(hexstring) << "\n";
}
//Converts string to hex
long htoi(char s[])
{
int charsize = 0;
while (s[charsize] != '\0')
{
charsize++;
}
int base = 1;
unsigned long total = 0;
unsigned long multiplier = 1;
for (int i = charsize; i >= 0; i--)
{
if (s[i] == '0' || s[i] == 'x' || s[i] == 'X' || s[i] == '\0')
{
continue;
}
if ( (s[i] >= '0') && (s[i] <= '9') )
{
total = total + ((s[i] - '0') * multiplier);
multiplier = multiplier * 16UL;
continue;
}
if ((s[i] >= 'A') && (s[i] <= 'F'))
{
total = total + ((s[i] - '7') * multiplier); //'7' equals 55 in decimal, while 'A' equals 65
multiplier = multiplier * 16UL;
continue;
}
if ((s[i] >= 'a') && (s[i] <= 'f'))
{
total = total + ((s[i] - 'W') * multiplier); //W equals 87 in decimal, while 'a' equals 97
multiplier = multiplier * 16UL;
continue;
}
}
return total;
}
long probably is 32 bits on your computer as well. Try long long.
You need more than 32 bits to store that number. Your long type could well be as small as 32 bits.
Use a std::uint64_t instead. This is always a 64 bit unsigned type. If your compiler doesn't support that, use a long long. That must be at least 64 bits.
The idea follows the polynomial nature of a number. 123 is the same as
1*102 + 2*101 + 3*100
In other words, I had to multiply the first digit by ten two times. I had to multiply 2 by ten one time. And I multiplied the last digit by one. Again, reading from left to right:
Multiply zero by ten and add the 1 → 0*10+1 = 1.
Multiply that by ten and add the 2 → 1*10+2 = 12.
Multiply that by ten and add the 3 → 12*10+3 = 123.
We will do the same thing:
#include <cctype>
#include <ciso646>
#include <iostream>
using namespace std;
unsigned long long hextodec( const std::string& s )
{
unsigned long long result = 0;
for (char c : s)
{
result *= 16;
if (isdigit( c )) result |= c - '0';
else result |= toupper( c ) - 'A' + 10;
}
return result;
}
int main( int argc, char** argv )
{
cout << hextodec( argv[1] ) << "\n";
}
You may notice that the function is more than three lines. I did that for clarity. C++ idioms can make that loop a single line:
for (char c : s)
result = (result << 4) | (isdigit( c ) ? (c - '0') : (toupper( c ) - 'A' + 10));
You can also do validation if you like. What I have presented is not the only way to do the digit-to-value conversion. There exist other methods that are just as good (and some that are better).
I do hope this helps.
I found out what was happening, when I inputted "1234567890" it would skip over the '0' so I had to modify the code. The other problem was that long was indeed 32-bits, so I changed it to uint64_t as suggested by #Bathsheba. Here's the final working code.
#include <iostream>
using namespace std;
uint64_t htoi(char s[]);
int main()
{
char hexstring[20];
cin >> hexstring;
cout << htoi(hexstring) << "\n";
}
//Converts string to hex
uint64_t htoi(char s[])
{
int charsize = 0;
while (s[charsize] != '\0')
{
charsize++;
}
int base = 1;
uint64_t total = 0;
uint64_t multiplier = 1;
for (int i = charsize; i >= 0; i--)
{
if (s[i] == 'x' || s[i] == 'X' || s[i] == '\0')
{
continue;
}
if ( (s[i] >= '0') && (s[i] <= '9') )
{
total = total + ((uint64_t)(s[i] - '0') * multiplier);
multiplier = multiplier * 16;
continue;
}
if ((s[i] >= 'A') && (s[i] <= 'F'))
{
total = total + ((uint64_t)(s[i] - '7') * multiplier); //'7' equals 55 in decimal, while 'A' equals 65
multiplier = multiplier * 16;
continue;
}
if ((s[i] >= 'a') && (s[i] <= 'f'))
{
total = total + ((uint64_t)(s[i] - 'W') * multiplier); //W equals 87 in decimal, while 'a' equals 97
multiplier = multiplier * 16;
continue;
}
}
return total;
}
I need to convert Doublebyte characters. In my special case Shift-Jis into something better to handle, preferably with standard C++.
the following Question ended up without a workaround:
Doublebyte encodings on MSVC (std::codecvt): Lead bytes not recognized
So is there anyone with a suggestion or a reference on how to handle this conversion with C++ standard?
Normally I would recommend using the ICU library, but for this alone, using it is way too much overhead.
First a conversion function which takes an std::string with Shiftjis data, and returns an std::string with UTF8 (note 2019: no idea anymore if it works :))
It uses a uint8_t array of 25088 elements (25088 byte), which is used as convTable in the code. The function does not fill this variable, you have to load it from eg. a file first. The second code part below is a program that can generate the file.
The conversion function doesn't check if the input is valid ShiftJIS data.
std::string sj2utf8(const std::string &input)
{
std::string output(3 * input.length(), ' '); //ShiftJis won't give 4byte UTF8, so max. 3 byte per input char are needed
size_t indexInput = 0, indexOutput = 0;
while(indexInput < input.length())
{
char arraySection = ((uint8_t)input[indexInput]) >> 4;
size_t arrayOffset;
if(arraySection == 0x8) arrayOffset = 0x100; //these are two-byte shiftjis
else if(arraySection == 0x9) arrayOffset = 0x1100;
else if(arraySection == 0xE) arrayOffset = 0x2100;
else arrayOffset = 0; //this is one byte shiftjis
//determining real array offset
if(arrayOffset)
{
arrayOffset += (((uint8_t)input[indexInput]) & 0xf) << 8;
indexInput++;
if(indexInput >= input.length()) break;
}
arrayOffset += (uint8_t)input[indexInput++];
arrayOffset <<= 1;
//unicode number is...
uint16_t unicodeValue = (convTable[arrayOffset] << 8) | convTable[arrayOffset + 1];
//converting to UTF8
if(unicodeValue < 0x80)
{
output[indexOutput++] = unicodeValue;
}
else if(unicodeValue < 0x800)
{
output[indexOutput++] = 0xC0 | (unicodeValue >> 6);
output[indexOutput++] = 0x80 | (unicodeValue & 0x3f);
}
else
{
output[indexOutput++] = 0xE0 | (unicodeValue >> 12);
output[indexOutput++] = 0x80 | ((unicodeValue & 0xfff) >> 6);
output[indexOutput++] = 0x80 | (unicodeValue & 0x3f);
}
}
output.resize(indexOutput); //remove the unnecessary bytes
return output;
}
About the helper file: I used to have a download here, but nowadays I only know unreliable file hosters. So... either http://s000.tinyupload.com/index.php?file_id=95737652978017682303 works for you, or:
First download the "original" data from ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/SHIFTJIS.TXT . I can't paste this here because of the length, so we have to hope at least unicode.org stays online.
Then use this program while piping/redirecting above text file in, and redirecting the binary output to a new file. (Needs a binary-safe shell, no idea if it works on Windows).
#include<iostream>
#include<string>
#include<cstdio>
using namespace std;
// pipe SHIFTJIS.txt in and pipe to (binary) file out
int main()
{
string s;
uint8_t *mapping; //same bigendian array as in converting function
mapping = new uint8_t[2*(256 + 3*256*16)];
//initializing with space for invalid value, and then ASCII control chars
for(size_t i = 32; i < 256 + 3*256*16; i++)
{
mapping[2 * i] = 0;
mapping[2 * i + 1] = 0x20;
}
for(size_t i = 0; i < 32; i++)
{
mapping[2 * i] = 0;
mapping[2 * i + 1] = i;
}
while(getline(cin, s)) //pipe the file SHIFTJIS to stdin
{
if(s.substr(0, 2) != "0x") continue; //comment lines
uint16_t shiftJisValue, unicodeValue;
if(2 != sscanf(s.c_str(), "%hx %hx", &shiftJisValue, &unicodeValue)) //getting hex values
{
puts("Error hex reading");
continue;
}
size_t offset; //array offset
if((shiftJisValue >> 8) == 0) offset = 0;
else if((shiftJisValue >> 12) == 0x8) offset = 256;
else if((shiftJisValue >> 12) == 0x9) offset = 256 + 16*256;
else if((shiftJisValue >> 12) == 0xE) offset = 256 + 2*16*256;
else
{
puts("Error input values");
continue;
}
offset = 2 * (offset + (shiftJisValue & 0xfff));
if(mapping[offset] != 0 || mapping[offset + 1] != 0x20)
{
puts("Error mapping not 1:1");
continue;
}
mapping[offset] = unicodeValue >> 8;
mapping[offset + 1] = unicodeValue & 0xff;
}
fwrite(mapping, 1, 2*(256 + 3*256*16), stdout);
delete[] mapping;
return 0;
}
Notes:
Two-byte big endian raw unicode values (more than two byte not necessary here)
First 256 chars (512 byte) for the single byte ShiftJIS chars, value 0x20 for invalid ones.
Then 3 * 256*16 chars for the groups 0x8???, 0x9??? and 0xE???
= 25088 byte
For those looking for the Shift-JIS conversion table data, you can get the uint8_t array here:
https://github.com/bucanero/apollo-ps3/blob/master/include/shiftjis.h
Also, here's a very simple function to convert basic Shift-JIS chars to ASCII:
const char SJIS_REPLACEMENT_TABLE[] =
" ,.,..:;?!\"*'`*^"
"-_????????*---/\\"
"~||--''\"\"()()[]{"
"}<><>[][][]+-+X?"
"-==<><>????*'\"CY"
"$c&%#&*#S*******"
"*******T><^_'='";
//Convert Shift-JIS characters to ASCII equivalent
void sjis2ascii(char* bData)
{
uint16_t ch;
int i, j = 0;
int len = strlen(bData);
for (i = 0; i < len; i += 2)
{
ch = (bData[i]<<8) | bData[i+1];
// 'A' .. 'Z'
// '0' .. '9'
if ((ch >= 0x8260 && ch <= 0x8279) || (ch >= 0x824F && ch <= 0x8258))
{
bData[j++] = (ch & 0xFF) - 0x1F;
continue;
}
// 'a' .. 'z'
if (ch >= 0x8281 && ch <= 0x829A)
{
bData[j++] = (ch & 0xFF) - 0x20;
continue;
}
if (ch >= 0x8140 && ch <= 0x81AC)
{
bData[j++] = SJIS_REPLACEMENT_TABLE[(ch & 0xFF) - 0x40];
continue;
}
if (ch == 0x0000)
{
//End of the string
bData[j] = 0;
return;
}
// Character not found
bData[j++] = bData[i];
bData[j++] = bData[i+1];
}
bData[j] = 0;
return;
}
This is a function in c++ that takes a HEX string and converts it to its equivalent ASCII character.
string HEX2STR (string str)
{
string tmp;
const char *c = str.c_str();
unsigned int x;
while(*c != 0) {
sscanf(c, "%2X", &x);
tmp += x;
c += 2;
}
return tmp;
If you input the following string:
537461636b6f766572666c6f77206973207468652062657374212121
The output will be:
Stackoverflow is the best!!!
Say I were to input 1,000,000 unique HEX strings into this function, it takes awhile to compute.
Is there a more efficient way to complete this?
Of course. Look up two characters at a time:
unsigned char val(char c)
{
if ('0' <= c && c <= '9') { return c - '0'; }
if ('a' <= c && c <= 'f') { return c + 10 - 'a'; }
if ('A' <= c && c <= 'F') { return c + 10 - 'A'; }
throw "Eeek";
}
std::string decode(std::string const & s)
{
if (s.size() % 2) != 0) { throw "Eeek"; }
std::string result;
result.reserve(s.size() / 2);
for (std::size_t i = 0; i < s.size() / 2; ++i)
{
unsigned char n = val(s[2 * i]) * 16 + val(s[2 * i + 1]);
result += n;
}
return result;
}
Just since I wrote it anyway, this should be fairly efficient :)
const char lookup[32] =
{0,10,11,12,13,14,15,0,0,0,0,0,0,0,0,0,0,1,2,3,4,5,6,7,8,9,0,0,0,0,0,0};
std::string HEX2STR(std::string str)
{
std::string out;
out.reserve(str.size()/2);
const char* tmp = str.c_str();
unsigned char ch, last = 1;
while(*tmp)
{
ch <<= 4;
ch |= lookup[*tmp&0x1f];
if(last ^= 1)
out += ch;
tmp++;
}
return out;
}
Don't use sscanf. It's a very general flexible function, which means its slow to allow all those usecases. Instead, walk the string and convert each character yourself, much faster.
This routine takes a string with (what I call) hexwords, often used in embedded ECUs, for example "31 01 7F 33 38 33 37 30 35 31 30 30 20 20 49" and transforms it in readable ASCII where possible.
Transforms by taking care of the discontuinity in the ASCII table (0-9: 48-57, A-F:65 - 70);
int i,j, len=strlen(stringWithHexWords);
char ascii_buffer[250];
char c1, c2, r;
i=0;
j=0;
while (i<len) {
c1 = stringWithHexWords[i];
c2 = stringWithHexWords[i+1];
if ((int)c1!=32) { // if space found, skip next section and bump index only once
// skip scary ASCII codes
if (32<(int)c1 && 127>(int)c1 && 32<(int)c2 && 127>(int)c2) {
//
// transform by taking first hexdigit * 16 and add second hexdigit
// both with correct offset
r = (char) ((16*(int)c1+((int)c2<64?((int)c2-48):((int)c2-55))));
if (31<(int)r && 127>(int)r)
ascii_buffer[j++] = r; // check result for readability
}
i++; // bump index
}
i++; // bump index once more for next hexdigit
}
ascii_bufferCurrentLength = j;
return true;
}
The hexToString() function will convert hex string to ASCII readable string
string hexToString(string str){
std::stringstream HexString;
for(int i=0;i<str.length();i++){
char a = str.at(i++);
char b = str.at(i);
int x = hexCharToInt(a);
int y = hexCharToInt(b);
HexString << (char)((16*x)+y);
}
return HexString.str();
}
int hexCharToInt(char a){
if(a>='0' && a<='9')
return(a-48);
else if(a>='A' && a<='Z')
return(a-55);
else
return(a-87);
}