Convert 128-bit hexadecimal string to base-36 string - c++

I have a 128-bit number in hexadecimal stored in a string (from md5, security isn't a concern here) that I'd like to convert to a base-36 string. If it were a 64-bit or less number I'd convert it to a 64-bit integer then use an algorithm I found to convert integers to base-36 strings but this number is too large for that so I'm kind of at a loss for how to approach this. Any guidance would be appreciated.
Edit: After Roland Illig pointed out the hassle of saying 0/O and 1/l over the phone and not gaining much data density over hex I think I may end up staying with hex. I'm still curious though if there is a relatively simple way to convert an hex string of arbitrary length to a base-36 string.

A base-36 encoding requires 6 bits to store each token. Same as base-64 but not using 28 of the available tokens. Solving 36^n >= 2^128 yields n >= log(2^128) / log(36) or 25 tokens to encode the value.
A base-64 encoding also requires 6 bits, all possible token values are used. Solving 64^n >= 2^128 yields n >= log(2^128) / log(64) or 22 tokens to encode the value.
Calculating the base-36 encoding requires dividing by powers of 36. No easy shortcuts, you need a division algorithm that can work with 128-bit values. The base-64 encoding is much easier to compute since it is a power of 2. Just take 6 bits at a time and shift by 6, in total 22 times to consume all 128 bits.
Why do you want to use base-36? Base-64 encoders are standard. If you really have a constraint on the token space (you shouldn't, ASCII rulez) then at least use a base-32 encoding. Or any power of 2, base-16 is hex.

If the only thing that is missing is the support for 128 bit unsigned integers, here is the solution for you:
#include <stdio.h>
#include <inttypes.h>
typedef struct {
uint32_t v3, v2, v1, v0;
} uint128;
static void
uint128_divmod(uint128 *out_div, uint32_t *out_mod, const uint128 *in_num, uint32_t in_den)
{
uint64_t x = 0;
x = (x << 32) + in_num->v3;
out_div->v3 = x / in_den;
x %= in_den;
x = (x << 32) + in_num->v2;
out_div->v2 = x / in_den;
x %= in_den;
x = (x << 32) + in_num->v1;
out_div->v1 = x / in_den;
x %= in_den;
x = (x << 32) + in_num->v0;
out_div->v0 = x / in_den;
x %= in_den;
*out_mod = x;
}
int
main(void)
{
uint128 x = { 0x12345678, 0x12345678, 0x12345678, 0x12345678 };
uint128 result;
uint32_t mod;
uint128_divmod(&result, &mod, &x, 16);
fprintf(stdout, "%08"PRIx32" %08"PRIx32" %08"PRIx32" %08"PRIx32" rest %08"PRIx32"\n", result.v3, result.v2, result.v1, result.v0, mod);
return 0;
}
Using this function you can repeatedly compute the mod-36 result, which leads you to the number encoded as base-36.

If you are using C++ with .NET 4 you could always use the System.Numerics.BigInteger class. You could try calling one of the toString overrides to get you to base 36.
Alternatively look at one of the many Big Integer libraries e.g. Matt McCutchen's C++ Big Integer Library although you might have to look into the depths of the classes to use a custom base such as 36.

Two things:
1. It really isn't that hard to divide a byte string by 36. But if you can't be bothered to implement that, you can use base-32 encoding, which would need 26 bytes instead of 25.
2. If you want to be able to read the result over the phone to humans, you absolutely must add a simple checksum to your string, which will cost one or two bytes but will save you a huge amount of 'Chinese whispers' hassle from hard-of-hearing customers.

Related

How to convert a float into uint8_t?

I am trying to sent multiple float values from an arduino using the LMIC lora library. The LMIC function only takes an uint8_t as its transmission argument type.
temp contains my temperature value as a float and I can print the measured temperature as such without problem:
Serial.println((String)"Temp C: " + temp);
There is an example that shows this code being used to do the conversion:
uint16_t payloadTemp = LMIC_f2sflt16(temp);
// int -> bytes
byte tempLow = lowByte(payloadTemp);
byte tempHigh = highByte(payloadTemp);
payload[0] = tempLow;
payload[1] = tempHigh;
I am not sure if this would work, it doesn't seem to be. The resulting data that gets sent is: FF 7F
I don't believe this is what I am looking for.
I have also tried the following conversion procedure:
uint8_t *array;
array = (unit8_t*)(&f);
using arduino, this will not even compile.
something that does work, but creates a much too long result is:
String toSend = String(temp);
toSend.toCharArray(payload, toSend.length());
payloadActualLength = toSend.length();
Serial.print("the payload is: ");
Serial.println(payload);
but the resulting hex is far far too long to when I get my other values that I want to send in.
So how do I convert a float into a uint8_t value and why doesn't my original given conversion not work as how I expect it to work?
Sounds like you are trying to figure out a minimally sized representation for these numbers that you can transmit in some very small packet format. If the range is suitably limited, this can often best be done by using an appropriate fixed-point representation.
For example, if your temperatures are always in the range 0..63, you could use a 6.2 fixed point format in a single byte:
if (value < 0.0 || value > 63.75) {
// out of range for 6.2 fixed point, so do something else.
} else {
uint8_t bval = (uint8_t)(value * 4 + 0.5);
// output this byte value
}
when you read the byte back, you just multiply it by 0.25 to get the (approximate) float value back.
Of course, since 8 bits is pretty limited for precision (about 2 digits), it will get rounded a bit to fit -- your 23.24 value will be rounded to 23.25. If you need more precision, you'll need to use more bits.
If you only need a little precision but a wider range, you can use a custom floating point format. IEEE 16-bit floats (S5.10) are pretty good (give you 3 digits of precision and around 10 orders of magnitude range), but you can go even smaller, particularly if you don't need negative values. A U4.4 float format give you 1 digit of precision and 5 orders of magnitude range in 8 bits (positive only)
If you know that both sender and receiver use the same fp binary representation and both use the same endianness then you can just memcpy:
float a = 23.24;
uint8_t buffer[sizeof(float)];
::memcpy(buffer, &a, sizeof(float));
In Arduino one can convert the float into a String
float ds_temp=sensors.getTempCByIndex(0); // DS18b20 Temp sensor
then convert the String into a char array:
String ds_str = String(ds_temp);
char* ds_char[ds_str.length()];
ds_str.toCharArray(ds_char ,ds_str.length()-1);
uint8_t* data =(uint8_t*)ds_char;
the uint_8 value is stored in data with a size sizeof(data)
A variable of uint8_t can carry only 256 values. If you actually want to squeeze temperature into single byte, you have to use fixed-point approach or least significant bit value approach
Define working range, T0 and T1
divide T0-T1 by 256 ( 2^8, a number of possible values).
Resulting value would be a float constant (working with a flexible LSB value is possible) by which you divide original float value X: R = (X-T0)/LSB. You can round the result, it would fit into byte.
On receiving side you have to multiply integer value by same constant X = R*LSB + T0.

Convert a 64bit integer to an array of 7bit-characters

Say I have a function vector<unsigned char> byteVector(long long UID), returning a byte presentation of the UID, a 64bit integer, as a vector. This vector is later on used to write this data to a file.
Now, because I decided I want to read that file with Python, I have to comply to the utf-8 standard, which means I can only use the first 7bits of each char. If the highest significant bit is 1 I can't decode it to a string anymore, because those are only supporting ASCII-characters. Also, I'll have to pass those strings to other processes via a Command Line Interface, which also is only supporting the ASCII-character set.
Before that problem arose, my approach on splitting the 64bit integer up into 8 separate bytes was the following, which worked really great:
vector<unsigned char> outputVector = vector<unsigned char>();
unsigned char * uidBytes = (unsigned char*) &UID_;
for (int i = 0; i < 8; i++){
outputVector.push_back(uidBytes[i]);
}
Of course that doesn't work anymore, as the constrain "HBit may not be 1" limits the maximum value of each unsigned char to 127.
My easiest option now would of course be to replace the one push_back call with this:
outputVector.push_back(uidBytes[i] / 128);
outputVector.push_back(uidBytes[i] % 128);
But this seems kind of wasteful, as the first of each unsigned char pair can only be 0 or 1 and I would be wasting some space (6 bytes) I could otherwise use.
As I need to save 64 bits, and can use 7 bits per byte, I'll need 64//7 + 64%7 = 10 bytes.
It isn't really much (none of the files I write ever even reached the 1kB mark), but I was using 8 bytes before and it seems a bit wasteful to use 16 now when ten (not 9, I'm sorry) would suffice. So:
How do I convert a 64bit integer to a vector of ten 7-bit integers?
This is probably too much optimization, but there could be some very cool solution for this problem (probably using shift operators) and I would be really interested in seeing it.
You can use bit shifts to take 7-bit pieces of the 64-bit integer. However, you need ten 7-bit integers, nine is not enough: 9 * 7 = 63, one bit short.
std::uint64_t uid = 42; // Your 64-bit input here.
std::vector<std::uint8_t> outputVector;
for (int i = 0; i < 10; i++)
{
outputVector.push_back(uid >> (i * 7) & 0x7f);
}
In every iteration, we shift the input bits by a multiple of 7, and mask out a 7-bit part. The most significant bit of the 8-bit numbers will be zero. Note that the numbers in the vector are “reversed”: the least significant bits have the lowest index. This is irrelevant though, if you decode the parts in the correct way. Decoding can be done as follows:
std::uint64_t decoded = 0;
for (int i = 0; i < 10; i++)
{
decoded |= static_cast<std::uint64_t>(outputVector[i]) << (i * 7);
}
Please note that it seems like a bad idea to interpret the resulting vector as UTF-8 encoded text: the sequence can still contain control characters and and \0. If you want to encode your 64-bit integer in printable characters, take a look at base64. In that case, you will need one more character (eleven in total) to encode 64 bits.
I suggest using assembly language.
Many assembly languages have instructions for shifting a bit into a "spare" carry bit and shifting the carry bit into a register. The C language has no convenient or efficient method to do this.
The algorithm:
for i = 0; i < 7; ++i
{
right shift 64-bit word into carry.
right shift carry into character.
}
You should also look into using std::bitset.

C++: Binary to Decimal Conversion

I am trying to convert a binary array to decimal in following way:
uint8_t array[8] = {1,1,1,1,0,1,1,1} ;
int decimal = 0 ;
for(int i = 0 ; i < 8 ; i++)
decimal = (decimal << 1) + array[i] ;
Actually I have to convert 64 bit binary array to decimal and I have to do it for million times.
Can anybody help me, is there any faster way to do the above ? Or is the above one is nice ?
Your method is adequate, to call it nice I would just not mix bitwise operations and "mathematical" way of converting to decimal, i.e. use either
decimal = decimal << 1 | array[i];
or
decimal = decimal * 2 + array[i];
It is important, before attempting any optimisation, to profile the code. Time it, look at the code being generated, and optimise only when you understand what is going on.
And as already pointed out, the best optimisation is to not do something, but to make a higher level change that removes the need.
However...
Most changes you might want to trivially make here, are likely to be things the compiler has already done (a shift is the same as a multiply to the compiler). Some may actually prevent the compiler from making an optimisation (changing an add to an or will restrict the compiler - there are more ways to add numbers, and only you know that in this case the result will be the same).
Pointer arithmetic may be better, but the compiler is not stupid - it ought to already be producing decent code for dereferencing the array, so you need to check that you have not in fact made matters worse by introducing an additional variable.
In this case the loop count is well defined and limited, so unrolling probably makes sense.
Further more it depends on how dependent you want the result to be on your target architecture. If you want portability, it is hard(er) to optimise.
For example, the following produces better code here:
unsigned int x0 = *(unsigned int *)array;
unsigned int x1 = *(unsigned int *)(array+4);
int decimal = ((x0 * 0x8040201) >> 20) + ((x1 * 0x8040201) >> 24);
I could probably also roll a 64-bit version that did 8 bits at a time instead of 4.
But it is very definitely not portable code. I might use that locally if I knew what I was running on and I just wanted to crunch numbers quickly. But I probably wouldn't put it in production code. Certainly not without documenting what it did, and without the accompanying unit test that checks that it actually works.
The binary 'compression' can be generalized as a problem of weighted sum -- and for that there are some interesting techniques.
X mod (255) means essentially summing of all independent 8-bit numbers.
X mod 254 means summing each digit with a doubling weight, since 1 mod 254 = 1, 256 mod 254 = 2, 256*256 mod 254 = 2*2 = 4, etc.
If the encoding was big endian, then *(unsigned long long)array % 254 would produce a weighted sum (with truncated range of 0..253). Then removing the value with weight 2 and adding it manually would produce the correct result:
uint64_t a = *(uint64_t *)array;
return (a & ~256) % 254 + ((a>>9) & 2);
Other mechanism to get the weight is to premultiply each binary digit by 255 and masking the correct bit:
uint64_t a = (*(uint64_t *)array * 255) & 0x0102040810204080ULL; // little endian
uint64_t a = (*(uint64_t *)array * 255) & 0x8040201008040201ULL; // big endian
In both cases one can then take the remainder of 255 (and correct now with weight 1):
return (a & 0x00ffffffffffffff) % 255 + (a>>56); // little endian, or
return (a & ~1) % 255 + (a&1);
For the sceptical mind: I actually did profile the modulus version to be (slightly) faster than iteration on x64.
To continue from the answer of JasonD, parallel bit selection can be iteratively utilized.
But first expressing the equation in full form would help the compiler to remove the artificial dependency created by the iterative approach using accumulation:
ret = ((a[0]<<7) | (a[1]<<6) | (a[2]<<5) | (a[3]<<4) |
(a[4]<<3) | (a[5]<<2) | (a[6]<<1) | (a[7]<<0));
vs.
HI=*(uint32_t)array, LO=*(uint32_t)&array[4];
LO |= (HI<<4); // The HI dword has a weight 16 relative to Lo bytes
LO |= (LO>>14); // High word has 4x weight compared to low word
LO |= (LO>>9); // high byte has 2x weight compared to lower byte
return LO & 255;
One more interesting technique would be to utilize crc32 as a compression function; then it just happens that the result would be LookUpTable[crc32(array) & 255]; as there is no collision with this given small subset of 256 distinct arrays. However to apply that, one has already chosen the road of even less portability and could as well end up using SSE intrinsics.
You could use accumulate, with a doubling and adding binary operation:
int doubleSumAndAdd(const int& sum, const int& next) {
return (sum * 2) + next;
}
int decimal = accumulate(array, array+ARRAY_SIZE,
doubleSumAndAdd);
This produces big-endian integers, whereas OP code produces little-endian.
Try this, I converted a binary digit of up to 1020 bits
#include <sstream>
#include <string>
#include <math.h>
#include <iostream>
using namespace std;
long binary_decimal(string num) /* Function to convert binary to dec */
{
long dec = 0, n = 1, exp = 0;
string bin = num;
if(bin.length() > 1020){
cout << "Binary Digit too large" << endl;
}
else {
for(int i = bin.length() - 1; i > -1; i--)
{
n = pow(2,exp++);
if(bin.at(i) == '1')
dec += n;
}
}
return dec;
}
Theoretically this method will work for a binary digit of infinate length

Ultra Quick Way To Concatenate Byte Values

Given 3 different bytes such as say x = 64, y = 90, z = 240 I am looking to concatenate them into say a string like 6490240. It would be lovely if this worked but it doesn't:
string xx = (string)x + (string)y + (string)z;
I am working in C++, and would settle for a concatenation of the bytes as a 24 bit string using their 8-bit representations.
It needs to be ultra fast because I am using this method on a lot of data, and it seems frustratingly like their isn't a way to just say treat this byte as if it were a string.
Many thanks for your help
To clarify, the reason why I'm particular about using 3 bytes is because the original data pertains to RGB values which are read via pointers and are stored of course as bytes in memory.
I want a way really to treat each color independently so you can think of this as a hashing function if you like. So any fast representation that does it without collisions is desired. This is the only way I can think of to avoid any collisions at all.
Did you consider instead just packing the color elements into three bytes of an integer?
uint32_t full_color = (x << 16) | (y << 8) | z;
Easiest way to turn numbers into a string is to use ostringstream
#include <sstream>
#include <string>
std::ostringstream os;
os << x << y << z;
std::string str = os.str(); // 6490240
You can even make use of manipulators to do this in hex or octal:
os << std::hex << x << y << z;
Update
Since you've clarified what you really want to do, I've updated my answer. You're looking to take RGB values as three bytes, and use them as a key somehow. This would be best done with a long int, not as a string. You can still stringify the int quite easily, for printing to the screen.
unsigned long rgb = 0;
byte* b = reinterpret_cast<byte*>(&rgb);
b[0] = x;
b[1] = y;
b[2] = z;
// rgb is now the bytes { 0, x, y, z }
Then you can use the long int rgb as your key, very efficiently. Whenever you want to print it out, you can still do that:
std::cout << std::hex << rgb;
Depending on the endian-ness of your system, you may need to play around with which bytes of the long int you set. My example overwrites bytes 0-2, but you might want to write bytes 1-3. And you might want to write the order as z, y, x instead of x, y, z. That kind of detail is platform dependent. Although if you never want to print the RGB value, but simply want to consider it as a hash, then you don't need to worry about which bytes you write or in what order.
try sprintf(xx,"%d%d%d",x,y,z);
Use a 3 character character array as your 24 bit representation, and assign each char the value of one of your input values.
Converting 3 bytes to bits and storing the result in an array can be done easily as below:
void bytes2bits(unsigned char x, unsigned char y, unsigned char z, char * res)
{
res += 24; *res-- = 0;
unsigned xyz = (x<<16)+(y<<8)+z;
for (size_t l = 0 ; l < 24 ; l++){
*res-- = '0'+(xyz & 1); xyz >>= 1;
}
}
However, if you are looking for a way to store three bytes values in a non ambiguous and compact way, you should probably settle for hexadecimal. (each group of four bits of the binary representation match a digit between 0 to 9 or a letter between A to F). It's ultra simple and ultra simple to encode and decode and also fit a human readable output.
If you never need to printout the result, just combining the values as a single integer and use it as a key as proposed Mark is certainly the fastest and the simplest solution. Assuming your native integer is 32 bits or more on the target system, just do:
unsigned int key = (x<< 16)|(y<<8)|z;
You can as easily get back the initial values from key if needed:
unsigned char x = (key >> 16) & 0xFF;
unsigned char y = (key >> 8) & 0xFF;
unsigned char z = key & 0xFF;

How best to implement BCD as an exercise?

I'm a beginner (self-learning) programmer learning C++, and recently I decided to implement a binary-coded decimal (BCD) class as an exercise, and so I could handle very large numbers on Project Euler. I'd like to do it as basically as possible, starting properly from scratch.
I started off using an array of ints, where every digit of the input number was saved as a separate int. I know that each BCD digit can be encoded with only 4 bits, so I thought using a whole int for this was a bit overkill. I'm now using an array of bitset<4>'s.
Is using a library class like this overkill as well?
Would you consider it cheating?
Is there a better way to do this?
EDIT: The primary reason for this is as an exercise - I wouldn't want to use a library like GMP because the whole point is making the class myself. Is there a way of making sure that I only use 4 bits for each decimal digit?
Just one note, using an array of bitset<4>'s is going to require the same amount of space as an array of long's. bitset is usually implemented by having an array of word sized integers be the backing store for the bits, so that bitwise operations can use bitwise word operations, not byte ones, so more gets done at a time.
Also, I question your motivation. BCD is usually used as a packed representation of a string of digits when sending them between systems. There isn't really anything to do with arithmetic usually. What you really want is an arbitrary sized integer arithmetic library like GMP.
Is using a library class like this overkill as well?
I would benchmark it against an array of ints to see which one performs better. If an array of bitset<4> is faster, then no it's not overkill. Every little bit helps on some of the PE problems
Would you consider it cheating?
No, not at all.
Is there a better way to do this?
Like Greg Rogers suggested, an arbitrary precision library is probably a better choice, unless you just want to learn from rolling your own. There's something to learn from both methods (using a library vs. writing a library). I'm lazy, so I usually use Python.
Like Greg Rogers said, using a bitset probably won't save any space over ints, and doesn't really provide any other benefits. I would probably use a vector instead. It's twice as big as it needs to be, but you get simpler and faster indexing for each digit.
If you want to use packed BCD, you could write a custom indexing function and store two digits in each byte.
Is using a library class like this overkill as well?
Would you consider it cheating?
Is there a better way to do this?
1&2: not really
3: each byte's got 8-bits, you could store 2 BCD in each unsigned char.
In general, bit operations are applied in the context of an integer, so from the performance aspect there is no real reason to go to bits.
If you want to go to bit approach to gain experience, then this may be of help
#include <stdio.h>
int main
(
void
)
{
typedef struct
{
unsigned int value:4;
} Nibble;
Nibble nibble;
for (nibble.value = 0; nibble.value < 20; nibble.value++)
{
printf("nibble.value is %d\n", nibble.value);
}
return 0;
}
The gist of the matter is that inside that struct, you are creating a short integer, one that is 4 bits wide. Under the hood, it is still really an integer, but for your intended use, it looks and acts like a 4 bit integer.
This is shown clearly by the for loop, which is actually an infinite loop. When the nibble value hits, 16, the value is really zero, as there are only 4 bits to work with.
As a result nibble.value < 20 never becomes true.
If you look in the K&R White book, one of the notes there is the fact that bit operations like this are not portable, so if you want to port your program to another platform, it may or may not work.
Have fun.
You are trying to get base-10 representation (i.e. decimal digit in each cell of the array). This way either space (one int per digit), or time (4-bits per dgit, but there is overhead of packing/unpacking) is wasted.
Why not try with base-256, for example, and use an array of bytes? Or even base-2^32 with array of ints? The operations are implemented the same way as in base-10. The only thing that will be different is converting the number to a human-readable string.
It may work like this:
Assuming base-256, each "digit" has 256 possible values, so the numbers 0-255 are all single digit values. Than 256 is written as 1:0 (I'll use colon to separate the "digits", we cannot use letters like in base-16), analoge in base-10 is how after 9, there is 10.
Likewise 1030 (base-10) = 4 * 256 + 6 = 4:6 (base-256).
Also 1020 (base-10) = 3 * 256 + 252 = 3:252 (base-256) is two-digit number in base-256.
Now let's assume we put the digits in array of bytes with the least significant digit first:
unsigned short digits1[] = { 212, 121 }; // 121 * 256 + 212 = 31188
int len1 = 2;
unsigned short digits2[] = { 202, 20 }; // 20 * 256 + 202 = 5322
int len2 = 2;
Then adding will go like this (warning: notepad code ahead, may be broken):
unsigned short resultdigits[enough length] = { 0 };
int len = len1 > len2 ? len1 : len2; // max of the lengths
int carry = 0;
int i;
for (i = 0; i < len; i++) {
int leftdigit = i < len1 ? digits1[i] : 0;
int rightdigit = i < len2 ? digits2[i] : 0;
int sum = leftdigit + rightdigit + carry;
if (sum > 255) {
carry = 1;
sum -= 256;
} else {
carry = 0;
}
resultdigits[i] = sum;
}
if (carry > 0) {
resultdigits[i] = carry;
}
On the first iteration it should go like this:
sum = 212 + 202 + 0 = 414
414 > 256, so carry = 1 and sum = 414 - 256 = 158
resultdigits[0] = 158
On the second iteration:
sum = 121 + 20 + 1 = 142
142 < 256, so carry = 0
resultdigits[1] = 142
So at the end resultdigits[] = { 158, 142 }, that is 142:158 (base-256) = 142 * 256 + 158 = 36510 (base-10), which is exactly 31188 + 5322
Note that converting this number to/from a human-readable form is by no means a trivial task - it requires multiplication and division by 10 or 256 and I cannot present code as a sample without proper research. The advantage is that the operations 'add', 'subtract' and 'multiply' can be made really efficient and the heavy conversion to/from base-10 is done only once in the beginning and once after the end of the calculation.
Having said all that, personally, I'd use base 10 in array of bytes and not care about the memory loss. This will require adjusting the constants 255 and 256 above to 9 and 10 respectively.