Given 3 different bytes such as say x = 64, y = 90, z = 240 I am looking to concatenate them into say a string like 6490240. It would be lovely if this worked but it doesn't:
string xx = (string)x + (string)y + (string)z;
I am working in C++, and would settle for a concatenation of the bytes as a 24 bit string using their 8-bit representations.
It needs to be ultra fast because I am using this method on a lot of data, and it seems frustratingly like their isn't a way to just say treat this byte as if it were a string.
Many thanks for your help
To clarify, the reason why I'm particular about using 3 bytes is because the original data pertains to RGB values which are read via pointers and are stored of course as bytes in memory.
I want a way really to treat each color independently so you can think of this as a hashing function if you like. So any fast representation that does it without collisions is desired. This is the only way I can think of to avoid any collisions at all.
Did you consider instead just packing the color elements into three bytes of an integer?
uint32_t full_color = (x << 16) | (y << 8) | z;
Easiest way to turn numbers into a string is to use ostringstream
#include <sstream>
#include <string>
std::ostringstream os;
os << x << y << z;
std::string str = os.str(); // 6490240
You can even make use of manipulators to do this in hex or octal:
os << std::hex << x << y << z;
Update
Since you've clarified what you really want to do, I've updated my answer. You're looking to take RGB values as three bytes, and use them as a key somehow. This would be best done with a long int, not as a string. You can still stringify the int quite easily, for printing to the screen.
unsigned long rgb = 0;
byte* b = reinterpret_cast<byte*>(&rgb);
b[0] = x;
b[1] = y;
b[2] = z;
// rgb is now the bytes { 0, x, y, z }
Then you can use the long int rgb as your key, very efficiently. Whenever you want to print it out, you can still do that:
std::cout << std::hex << rgb;
Depending on the endian-ness of your system, you may need to play around with which bytes of the long int you set. My example overwrites bytes 0-2, but you might want to write bytes 1-3. And you might want to write the order as z, y, x instead of x, y, z. That kind of detail is platform dependent. Although if you never want to print the RGB value, but simply want to consider it as a hash, then you don't need to worry about which bytes you write or in what order.
try sprintf(xx,"%d%d%d",x,y,z);
Use a 3 character character array as your 24 bit representation, and assign each char the value of one of your input values.
Converting 3 bytes to bits and storing the result in an array can be done easily as below:
void bytes2bits(unsigned char x, unsigned char y, unsigned char z, char * res)
{
res += 24; *res-- = 0;
unsigned xyz = (x<<16)+(y<<8)+z;
for (size_t l = 0 ; l < 24 ; l++){
*res-- = '0'+(xyz & 1); xyz >>= 1;
}
}
However, if you are looking for a way to store three bytes values in a non ambiguous and compact way, you should probably settle for hexadecimal. (each group of four bits of the binary representation match a digit between 0 to 9 or a letter between A to F). It's ultra simple and ultra simple to encode and decode and also fit a human readable output.
If you never need to printout the result, just combining the values as a single integer and use it as a key as proposed Mark is certainly the fastest and the simplest solution. Assuming your native integer is 32 bits or more on the target system, just do:
unsigned int key = (x<< 16)|(y<<8)|z;
You can as easily get back the initial values from key if needed:
unsigned char x = (key >> 16) & 0xFF;
unsigned char y = (key >> 8) & 0xFF;
unsigned char z = key & 0xFF;
Related
I have a coordinate pair of values that each range from [0,15]. For now I can use an unsigned, however since 16 x 16 = 256 total possible coordinate locations, this also represents all the binary and hex values of 1 byte. So to keep memory compact I'm starting to prefer the idea of using a BYTE or an unsigned char. What I want to do with this coordinate pair is this:
Let's say we have a coordinate pair with the hex value [0x05,0x0C], I would like the final value to be 0x5C. I would also like to do the reverse as well, but I think I've already found an answer with a solution to the reverse. I was thinking on the lines of using & or | however, I'm missing something for I'm not getting the correct values.
However as I was typing this and looking at the reverse of this: this is what I came up with and it appears to be working.
byte a = 0x04;
byte b = 0x0C;
byte c = (a << 4) | b;
std::cout << +c;
And the value that is printing is 76; which converted to hex is 0x4C.
Since I have figured out the calculation for this, is there a more efficient way?
EDIT
After doing some testing the operation to combine the initial two is giving me the correct value, however when I'm doing the reverse operation as such:
byte example = c;
byte nibble1 = 0x0F & example;
byte nibble2 = (0xF0 & example) >> 4;
std::cout << +nibble1 << " " << +nibble2 << std::endl;
It is printout 12 4. Is this correct or should this be a concern? If worst comes to worst I can rename the values to indicate which coordinate value they are.
EDIT
After thinking about this for a little bit and from some of the suggestions I had to modify the reverse operation to this:
byte example = c;
byte nibble1 = (0xF0 & example) >> 4;
byte nibble2 = (0x0F & example);
std:cout << +nibble1 << " " << +nibble2 << std::endl;
And this prints out 4 12 which is the correct order of what I am looking for!
First of all, be careful about there are in fact 17 values in the range 0..16. Your values are probably 0..15, because if they actually range both from 0 to 16, you won't be able to uniquely store every possible coordinate pair into a single byte.
The code extract you submitted is pretty efficient, you are using bit operators, which are the quickest thing you can ask a processor to do.
For the "reverse" (splitting your byte into two 4-bit values), you are right when thinking about using &. Just apply a 4-bit shift at the right time.
I have the question of the title, but If not, how could I get away with using only 4 bits to represent an integer?
EDIT really my question is how. I am aware that there are 1 byte data structures in a language like c, but how could I use something like a char to store two integers?
In C or C++ you can use a struct to allocate the required number of bits to a variable as given below:
#include <stdio.h>
struct packed {
unsigned char a:4, b:4;
};
int main() {
struct packed p;
p.a = 10;
p.b = 20;
printf("p.a %d p.b %d size %ld\n", p.a, p.b, sizeof(struct packed));
return 0;
}
The output is p.a 10 p.b 4 size 1, showing that p takes only 1 byte to store, and that numbers with more than 4 bits (larger than 15) get truncated, so 20 (0x14) becomes 4. This is simpler to use than the manual bitshifting and masking used in the other answer, but it is probably not any faster.
You can store two 4-bit numbers in one byte (call it b which is an unsigned char).
Using hex is easy to see that: in b=0xAE the two numbers are A and E.
Use a mask to isolate them:
a = (b & 0xF0) >> 4
and
e = b & 0x0F
You can easily define functions to set/get both numbers in the proper portion of the byte.
Note: if the 4-bit numbers need to have a sign, things can become a tad more complicated since the sign must be extended correctly when packing/unpacking.
So I can't figure out how to do this in C++. I need to do a modulus operation and integer conversion on data that is 96 bits in length.
Example:
struct Hash96bit
{
char x[12];
};
int main()
{
Hash96bit n;
// set n to something
int size = 23;
int result = n % size
}
Edit: I'm trying to have a 96 bit hash because i have 3 floats which when combined create a unique combination. Thought that would be best to use as the hash because you don't really have to process it at all.
Edit: Okay... so at this point I might as well explain the bigger issue. I have a 3D world that I want to subdivide into sectors, that way groups of objects can be placed in sectors that would make frustum culling and physics iterations take less time. So at the begging lets say you are at sector 0,0,0. Sure we store them all in array, cool, but what happens when we get far away from 0,0,0? We don't care about those sectors there anymore. So we use a hashmap since memory isn't an issue and because we will be accessing data with sector values rather than handles. Now a sector is 3 floats, hashing that could easily be done with any number of algorithms. I thought it might be better if I could just say the 3 floats together is the key and go from there, I just needed a way to mod a 96 bit number to fit it in the data segment. Anyway I think i'm just gonna take the bottom bits of each of these floats and use a 64 bit hash unless anyone comes up with something brilliant. Thank you for the advice so far.
UPDATE: Having just read your second edit to the question, I'd recommend you use David's jenkin's approach (which I upvoted a while back)... just point it at the lowest byte in your struct of three floats.
Regarding "Anyway I think i'm just gonna take the bottom bits of each of these floats" - again, the idea with a hash function used by a hash table is not just to map each bit in the input (less till some subset of them) to a bit in the hash output. You could easily end up with a lot of collisions that way, especially if the number of buckets is not a prime number. For example, if you take 21 bits from each float, and the number of buckets happens to be 1024 currently, then after % 1024 only 10 bits from one of the floats will be used with no regard to the values of the other floats... hash(a,b,c) == hash(d,e,c) for all c (it's actually a little worse than that - values like 5.5, 2.75 etc. will only use a couple bits of the mantissa....).
Since you're insisting on this (though it's very likely not what you need, and a misnomer to boot):
struct Hash96bit
{
union {
float f[3];
char x[12];
uint32_t u[3];
};
Hash96bit(float a, float b, float c)
{
f[0] = a;
f[1] = b;
f[2] = c;
}
// the operator will support your "int result = n % size;" usage...
operator uint128_t() const
{
return u[0] * ((uint128_t)1 << 64) + // arbitrary ordering
u[1] + ((uint128_t)1 << 32) +
u[2];
}
};
You can use jenkins hash.
uint32_t jenkins_one_at_a_time_hash(char *key, size_t len)
{
uint32_t hash, i;
for(hash = i = 0; i < len; ++i)
{
hash += key[i];
hash += (hash << 10);
hash ^= (hash >> 6);
}
hash += (hash << 3);
hash ^= (hash >> 11);
hash += (hash << 15);
return hash;
}
I have a 128-bit number in hexadecimal stored in a string (from md5, security isn't a concern here) that I'd like to convert to a base-36 string. If it were a 64-bit or less number I'd convert it to a 64-bit integer then use an algorithm I found to convert integers to base-36 strings but this number is too large for that so I'm kind of at a loss for how to approach this. Any guidance would be appreciated.
Edit: After Roland Illig pointed out the hassle of saying 0/O and 1/l over the phone and not gaining much data density over hex I think I may end up staying with hex. I'm still curious though if there is a relatively simple way to convert an hex string of arbitrary length to a base-36 string.
A base-36 encoding requires 6 bits to store each token. Same as base-64 but not using 28 of the available tokens. Solving 36^n >= 2^128 yields n >= log(2^128) / log(36) or 25 tokens to encode the value.
A base-64 encoding also requires 6 bits, all possible token values are used. Solving 64^n >= 2^128 yields n >= log(2^128) / log(64) or 22 tokens to encode the value.
Calculating the base-36 encoding requires dividing by powers of 36. No easy shortcuts, you need a division algorithm that can work with 128-bit values. The base-64 encoding is much easier to compute since it is a power of 2. Just take 6 bits at a time and shift by 6, in total 22 times to consume all 128 bits.
Why do you want to use base-36? Base-64 encoders are standard. If you really have a constraint on the token space (you shouldn't, ASCII rulez) then at least use a base-32 encoding. Or any power of 2, base-16 is hex.
If the only thing that is missing is the support for 128 bit unsigned integers, here is the solution for you:
#include <stdio.h>
#include <inttypes.h>
typedef struct {
uint32_t v3, v2, v1, v0;
} uint128;
static void
uint128_divmod(uint128 *out_div, uint32_t *out_mod, const uint128 *in_num, uint32_t in_den)
{
uint64_t x = 0;
x = (x << 32) + in_num->v3;
out_div->v3 = x / in_den;
x %= in_den;
x = (x << 32) + in_num->v2;
out_div->v2 = x / in_den;
x %= in_den;
x = (x << 32) + in_num->v1;
out_div->v1 = x / in_den;
x %= in_den;
x = (x << 32) + in_num->v0;
out_div->v0 = x / in_den;
x %= in_den;
*out_mod = x;
}
int
main(void)
{
uint128 x = { 0x12345678, 0x12345678, 0x12345678, 0x12345678 };
uint128 result;
uint32_t mod;
uint128_divmod(&result, &mod, &x, 16);
fprintf(stdout, "%08"PRIx32" %08"PRIx32" %08"PRIx32" %08"PRIx32" rest %08"PRIx32"\n", result.v3, result.v2, result.v1, result.v0, mod);
return 0;
}
Using this function you can repeatedly compute the mod-36 result, which leads you to the number encoded as base-36.
If you are using C++ with .NET 4 you could always use the System.Numerics.BigInteger class. You could try calling one of the toString overrides to get you to base 36.
Alternatively look at one of the many Big Integer libraries e.g. Matt McCutchen's C++ Big Integer Library although you might have to look into the depths of the classes to use a custom base such as 36.
Two things:
1. It really isn't that hard to divide a byte string by 36. But if you can't be bothered to implement that, you can use base-32 encoding, which would need 26 bytes instead of 25.
2. If you want to be able to read the result over the phone to humans, you absolutely must add a simple checksum to your string, which will cost one or two bytes but will save you a huge amount of 'Chinese whispers' hassle from hard-of-hearing customers.
I have two hex strings, accompanied by masks, that I would like to merge into a single string value/mask pair. The strings may have bytes that overlap but after applying masks, no overlapping bits should contradict what the value of that bit must be, i.e. value1 = 0x0A mask1 = 0xFE and value2 = 0x0B, mask2 = 0x0F basically says that the resulting merge must have the upper nibble be all '0's and the lower nibble must be 01011
I've done this already using straight c, converting strings to byte arrays and memcpy'ing into buffers as a prototype. It's tested and seems to work. However, it's ugly and hard to read and doesn't throw exceptions for specific bit requirements that contradict. I've considered using bitsets, but is there another way that might not demand the conversion overhead? Performance would be nice, but not crucial.
EDIT: More detail, although writing this makes me realize I've made a simple problem too difficult. But, here it is, anyway.
I am given a large number of inputs that are binary searches of a mixed-content document. The document is broken into pages, and pages are provided by an api the delivers a single page at a time. Each page needs to be searched with the provided search terms.
I have all the search terms prior to requesting pages. The input are strings representing hex digits (this is what I mean by hex strings) as well a mask to indicate bits that are significant in the input hex string. Since I'm given all input up-front I wanted to improve the search of each page returned. I wanted to pre-process merge these hex strings together. To make the problem more interesting, every string has an optional offset into the page where they must appear and a lack of an offset indicates that the string can appear anywhere in a page requested. So, something like this:
class Input {
public:
int input_id;
std::string value;
std::string mask;
bool offset_present;
unsigned int offset;
};
If a given Input object has offset_present = false, then any value assigned to offset is ignored. If offset_present is false, then it clearly can't be merged with other inputs.
To make the problem more interesting, I want to report an output that provides information about what was found (input_id that was found, where the offset was, etc). Merging some input (but not others) makes this a bit more difficult.
I had considered defining a CompositeInput class and was thinking about the underlying merger be a bitset, but further reading about about bitsets made me realize it wasn't what I really thought. My inexperience made me give up on the composite idea and go brute force. I necessarily skipped some details about other input types an additional information to be collected for the output (say, page number, parag. number) when an input is found. Here's an example output class:
class Output {
public:
Output();
int id_result;
unsigned int offset_result;
};
I would want to product N of these if I merge N hex strings, keeping any merger details hidden from the user.
I don't know what a hexstring is... but other than that it should be like this:
outcome = (value1 & mask1) | (value2 & mask2);
it sounds like |, & and ~ would work?
const size_t prefix = 2; // "0x"
const size_t bytes = 2;
const char* value1 = "0x0A";
const char* mask1 = "0xFE";
const char* value2 = "0x0B";
const char* mask2 = "0x0F";
char output[prefix + bytes + 1] = "0x";
uint8_t char2int[] = { /*zeroes until index '0'*/ 0,1,2,3,4,5,6,7,8,9 /*...*/ 10,11,12,13,14,15 };
char int2char[] = { '0', /*...*/ 'F' };
for (size_t ii = prefix; ii != prefix + bytes; ++ii)
{
uint8_t result1 = char2int[value1[ii]] & char2int[mask1[ii]];
uint8_t result2 = char2int[value2[ii]] & char2int[mask2[ii]];
if (result1 & result2)
throw invalid_argument("conflicting bits");
output[ii] = int2char[result1 | result2];
}