Is there a way to uniquely build an integer from below data? - c++

I have structured data like below:
struct Leg
{
char type;
char side;
int qty;
int id;
} Legs[5];
where
type is O or E,
side is B or S;
qty is 1 to 9999 and qty in all Legs is relative prime to each other i.e. 1 2 3 not 2 4 6
id is an integer from 1 to 9999999 and all ids are unique in the group of Legs
To build unique signature of above data, currently I am building a string like below:
first sort Legs based on id;
then
signature=""
for i=1 to 5
signature+=id+type+qty+side of leg-i
and I insert into unordered_map so that if any matching structured data comes, I can, lookup by building a signature as above and looking up.
unorderd_map on string means key-compare which is string compare and also hash function which needs to traverse the string which is usually around 25 chars.
For efficiency, it it is possible to build a unique integer out of above data for each structure above, the lookups/insertions in unorderd_map will be extremely faster.
Just wondering if there is any mathematical properties I can take advantage of.
Edit:
The map will contain key,value pairs like
<unique-signature=key, value=int-value needs to be located on looking up another repeating Leg group by constructing signature like above after sorting Legs based on id>
<123O2B234E3S456O3S567O2S789E2B, 989>
The goal is to build unique signature from each such unique repeating group of legs. Legs can be in different order and yet they can be match with another group of legs which are in different order thats why I sort based on id which is unique and build the signature.
My signature is string based, if there was a way to construct a unique number signature, then my lookups/insertions will be faster.

You can just create a unique 40-bit number from the fields you have. Why 40 bits? I'm glad you asked.
You have 9,999,999 possible id values, which means you can use 24 bits to represent all possibilities (log2(9999999) = a little over 23).
You have 9,999 possible qty values, which requires another 14 bits.
type and side require 1 bit each, which gives you a total of 40 bits of information. Store this number as a long long and you have a nice, fast key for your map.
If you really want a unique int key then you're probably out of luck because it's going to be pretty tricky to get rid of 8 bits of information. You might be able to take advantage of the co-primality of the qty field to represent it in fewer than 14 bits, however I doubt that you can get it down to 6 bits because that only gives you 64 possible values for qty.
That's a way to get what you asked for, but #David Schwartz's answer is probably what you actually need: hash collisions are generally not expensive unless you have a really bad hash function - see Application vulnerability due to Non Random Hash Functions for an example of how that can bite you - or a carefully crafted data set that happens to hit the worst-case.
In your case you should be fine with David's answer. It'll be fast enough unless you are extremely unfortunate with your set of data.
EDIT: Just noticed that you are computing your signature over the set of 5 Legs. The same math applies, you just will need 200 bits rather than 4. So it won't fit in a long long unless you have some information that can be shared amongst all 5 Leg objects; if each set of 5 shares the same id, for example.
Stick with David's answer.

It doesn't have to be unique. I would suggest something like:
std::size_t hash_value(const Leg& l)
{
std::size_t ret = l.type;
ret << = 8;
ret |= l.side;
ret *= 2654435761;
ret += l.qty;
ret *= 2654435761;
ret += l.id;
return ret * 2654435761;
}

In order to create an order-independent hash function for groups of five legs, first choose a hash function for individual legs -- David's answer looks great. Compute the hashes for each of the five legs. Now choose an order-independent function to combine these five hash values. You could, for example, xor the hashes together, or add them all together, or multiply them all together.
The fact that multiplication distributes over addition, and multiplication was the last operation to happen, makes me a little bit wary of using that. I think xor might be the best option of the ones I give here; but before using this in production, you should definitely run a few tests to see if you can easily generate collisions with any of them.
Probably superfluous, but here is a simple implementation that calls hash_value from David's answer:
std::size_t hash_value(const Leg_Array& legs) {
std::size_t ret = 0;
for (int i = 0; i < 5; ++i) {
ret ^= hash_value(legs[i]);
}
return ret;
}

Related

Storing multiple int values into one variable - C++

I am doing an algorithmic contest, and I'm trying to optimize my code. Maybe what I want to do is stupid and impossible but I was wondering.
I have these requirements:
An inventory which can contains 4 distinct types of item. This inventory can't contain more than 10 items (all type included). Example of valid inventory: 1 / 1 / 1 / 0. Example of invalid inventories: 11 / 0 / 0 / 0 or 5 / 5 / 5 / 0
I have some recipe which consumes or adds items into my inventory. The recipe can't add or consume more than 10 items since the inventory can't have more than 10 items. Example of valid recipe: -1 / -2 / 3 /
0. Example of invalid recipe: -6 / -6 / +12 / 0
For now, I store the inventory and the recipe into 4 integers. Then I am able to perform some operations like:
ApplyRecepe: Inventory(1/1/1/0).Apply(Recepe(-1/1/0/0)) = Inventory(0/2/1/0)
CanAfford: Iventory(1/1/0/0).CanAfford(Recepe(-2/1/0/0)) = False
I would like to know if it is possible (and if yes, how) to store the 4 values of an inventory/recipe into one single integer and to performs previous operations on it that would be faster than comparing / adding the 4 integers as I'm doing now.
I thought of something like having the inventory like that:
int32: XXXX (number of items of the first type) - YYYY (number of items of the second type) - ZZZ (number of items of the third type) - WWW (number of item of the fourth type)
But I have two problems with that:
I don't know how to handle the possible negative values
It seems to me much slower than just adding the 4 integers since I have to bit shift the inventory and the recipe to get the value I want and then proceed with the addition.
Storing multiple int values into one variable
Here are two alternatives:
An array. The advantage of this is that you may iterate over the elements:
int variable[] {
1,
1,
1,
0,
};
Or a class. The advantage of this is the ability to name the members:
struct {
int X;
int Y;
int Z;
int W;
} variable {
1,
1,
1,
0,
};
Then I am able to perform some operations like:
Those look like SIMD vector operations (Single Instruction Multiple Data). The array is the way to go in this case. Since the number of operands appears to be constant and small in your description, an efficient way to perform them are vector operations on the CPU 1.
There is no standard way to use SIMD operations directly in C++. To give the compiler optimal opportunity to use them, these steps need to be followed:
Make sure that the CPU you use supports the operations that you need. AVX-2 instruction set and its expansions have wide support for integer vector operations.
Make sure that you tell the compiler that the program should be optimised for that architecture.
Make sure to tell the compiler to perform vectorisation optimisations.
Make sure that the integers are sufficiently aligned as required by the operations. This can be achieved with alignas.
Make sure that the number of integers is known at compile time.
If the prospect of relying on the optimiser worries you, then you may instead prefer to use vector extensions that may be provided by your compiler. The use of language extensions would come at the cost of portability to other compilers naturally. Here is an example with GCC:
constexpr int count = 4;
using v4si = int __attribute__ ((vector_size (sizeof(int) * count)));
#include <iostream>
int main()
{
v4si inventory { 1, 1, 1, 0};
v4si recepe {-1, 1, 0, 0};
v4si applied = inventory + recepe;
for (int i = 0; i < count; i++) {
std::cout << applied[i] << ", ";
}
}
1 If the number of operands were large, then specialised vector processor such as a GPU could be faster.
Especially if you're learning, it's not a bad opportunity to try implementing your own helper class for vectorization, and consequently deepen your understanding about data in C++, even if your use case might not warrant the technique.
The insight you want to exploit is that arithmetic operations seem invariant to bitshifts, if one considers the pesky carry-bit and effects of signage (e.g. two's complement). But precisely because of these latter factors, it's much better to use some standardized underlying type like an int8_t[], as #Botje suggests.
To begin, implement the following functions. (My C++ is rusty, consider this pseudocode.)
int8_t* add(int8_t[], int8_t[], size_t);
int8_t* multiply(int8_t[], int8_t[], size_t);
int8_t* zeroes(size_t); // additive identity
int8_t* ones(size_t); // multiplicative identity
Also considering:
How would you like to handle overflows and underflows? Let them be and ask the developer to be cautious? Or throw exceptions?
Maybe you'd like to pin down the size of the array and avoid having to deal with a dynamic size_t?
Maybe you'd like to go as far as overloading operators?
The end result of an exercise like this, but generalized and polished, is something like Armadillo. But you'll understand it on a whole different level by doing the exercise yourself first. Also, if all this makes sense so far, you can now take a look at How to vectorize my loop with g++?—even the compiler can vectorize for you in certain cases.
Bitpacking as #Botje mentions is another step beyond this. You won't even have the safety and convenience of an integer type like int8_t or int4_t. Which additionally means the code you write might stop being platform-independent. I recommend at least finishing the vectorization exercise before delving into this.
This will be something of a non-answer, just intended to show what you're up against if you do bitpacking.
Suppose, for simplicity's sake, that recipes can only remove from inventory, and only contain positive values (you could represent negative numbers using two's complement, but it would take more bits, and add much complexity to working with the bit-packed numbers).
You then have 11 possible values for an item, so you need 4 bits for each item. Four items can then be represented in one uint16.
So, say you have an inventory with 10,4,6,9 items; this would be uint16_t inv = 0b1010'0100'0110'1001.
Then, a recipe with 2,2,2,2 items or uint16_t rec = 0b0010'0010'0010'0010.
inv - rec would give 0b1000'0010'0100'0111 for 8,2,4,7 items.
So far, so good. No need here to shift and mask to get at the individual values before doing the calculation. Yay.
Now, a recipe with 6,6,6,6 items which would be 0b0110'0110'0110'0110, giving inv - rec = 0b0011'1110'0000'0011 for 3,14,0,3 items.
Oops.
The arithmetic will work, but only if you check beforehand that the individual 4-bit results don't go out of bounds; in this example this would mean that you know beforehand that there are enough items in the inventory to fill a recipe.
You could get at, say, the third item in the inventory by doing: (inv >> 4) & 0b1111 or (inv << 8) >> 12 for doing your checks.
For testing, you would then get expressions like:
if ((inv >> 4) & 0b1111 >= (rec >> 4) & 0b1111)
or, comparing the 4 bits "in place":
if (inv & 0b0000000011110000 >= rec & 0b0000000011110000)
for each 4-bit part.
All these things are doable, but do you want to? It probably won't be faster than what is suggested in the other answers after the compiler has done its job, and it certainly won't be more readable.
It becomes even more horrible when you allow negative numbers (two's complement or otherwise) in recipes, especially if you want to bit-shift them.
So, bitpacking is nice for storage, and in some rare cases you can even do math without unpacking the bits, but I wouldn't try to go there (unless you are very performance and memory constrained).
Having said that, it could be fun to try to get it to work; there's always that.

Finding all values that occurs odd number of times in huge list of positive integers

I came across this question from a colleague.
Q: Given a huge list (say some thousands)of positive integers & has many values repeating in the list, how to find those values occurring odd number of times?
Like 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 1 2 3 4 5 6 1 2 3 4 5 1 2 3 4 1 2 3 1 2 1...
Here,
1 occrus 8 times
2 occurs 7 times (must be listed in output)
3 occurs 6 times
4 occurs 5 times (must be listed in output)
& so on... (the above set of values is only for explaining the problem but really there would be any positive numbers in the list in any order).
Originally we were looking at deriving a logic (to be based on c).
I suggested the following,
Using a hash table and the values from the list as an index/key to the table, keep updating the count in the corresponding index every time when the value is encountered while walking through the list; however, how to decide on the size of the hash table?? I couldn't say it surely though it might require Hashtable as big as the list.
Once the list is walked through & the hash table is populated (with the 'count' number of occurrences for each values/indices), only way to find/list the odd number of times occurring value is to walk through the table & find it out? Is that's the only way to do?
This might not be the best solution given this scenario.
Can you please suggest on any other efficient way of doing it so??
I sought in SO, but there were queries/replies on finding a single value occurring odd number of times but none like the one I have mentioned.
The relevance for this question is not known but seems to be asked in his interview...
Please suggest.
Thank You,
If the values to be counted are bounded by even a moderately reasonable limit then you can just create an array of counters, and use the values to be counted as the array indices. You don't need a tight bound, and "reasonable" is somewhat a matter of platform. I would not hesitate to take this approach for a bound (and therefore array size) sufficient for all uint16_t values, and that's not a hard limit:
#define UPPER_BOUND 65536
uint64_t count[UPPER_BOUND];
void count_values(size_t num_values, uint16_t values[num_values]) {
size_t i;
memset(count, 0, sizeof(count));
for (i = 0; i < num_values; i += 1) {
count[values[i]] += 1;
)
}
Since you only need to track even vs. odd counts, though, you really only need one bit per distinct value in the input. Squeezing it that far is a bit extreme, but this isn't so bad:
#define UPPER_BOUND 65536
uint8_t odd[UPPER_BOUND];
void count_values(size_t num_values, uint16_t values[num_values]) {
size_t i;
memset(odd, 0, sizeof(odd));
for (i = 0; i < num_values; i += 1) {
odd[values[i]] ^= 1;
)
}
At the end, odd[i] contains 1 if the value i appeared an odd number of times, and it contains 0 if i appeared an even number of times.
On the other hand, if the values to be counted are so widely distributed that an array would require too much memory, then the hash table approach seems reasonable. In that case, however, you are asking the wrong question. Rather than
how to decide on the size of the hash table?
you should be asking something along the lines of "what hash table implementation doesn't require me to manage the table size manually?" There are several. Personally, I have used UTHash successfully, though as of recently it is no longer maintained.
You could also use a linked list maintained in order, or a search tree. No doubt there are other viable choices.
You also asked
Once the list is walked through & the hash table is populated (with the 'count' number of occurrences for each values/indices), only way to find/list the odd number of times occurring value is to walk through the table & find it out? Is that's the only way to do?
If you perform the analysis via the general approach we have discussed so far then yes, the only way to read out the result is to iterate through the counts. I can imagine alternative, more complicated, approaches wherein you switch numbers between lists of those having even counts and those having odd counts, but I'm having trouble seeing how whatever efficiency you might gain in readout could fail to be swamped by the efficiency loss at the counting stage.
In your specific case, you can walk the list and toggle the value's existence in a set. The resulting set will contain all of the values that appeared an odd number of times. However, this only works for that specific predicate, and the more generic count-then-filter algorithm you describe will be required if you wanted, say, all of the entries that appear an even number of times.
Both algorithms should be O(N) time and worst-case O(N) space, and the constants will probably be lower for the set-based algorithm, but you'll need to benchmark it against your data. In practice, I'd run with the more generic algorithm unless there was a clear performance problem.

What are some checksum implementations that allow for incremental computation?

In my program I have a set of sets that are stored in a proprietary hash table. Like all hash tables, I need two functions for each element. First, I need the hash value to use for insertion. Second, I need a compare function when there's conflicts. It occurs to me that a checksum function would be perfect for this. I could use the value in both functions. There's no shortage of checksum functions but I would like to know if there's any commonly available ones that I wouldn't need to bring in a library for (my company is a PIA when it comes to that).A system library would be ok.
But I have an additional, more complicated requirement. I need for the checksum to be incrementally calculable. That is, if a set contains A B C D E F and I subtract D from the set, it should be able to return a new checksum value without iterating over all the elements in the set again. The reason for this is to prevent non-linearity in my code. Ideally, I'd like for the checksum to be order independent but I can sort them first if needed. Does such an algorithm exist?
Simply store a dictionary of items in your set, and their corresponding hash value. The hash value of the set is the hash value of the concatenated, sorted hashes of the items. In Python:
hashes = '''dictionary of hashes in string representation'''
# e.g.
hashes = { item: hashlib.sha384(item) for item in items }
sorted_hashes = sorted(hashes.values())
concatenated_hashes = ''.join(sorted_hashes)
hash_of_the_set = hashlib.sha384(concatenated_hashes)
As hash function I would use sha384, but you might want to try Keccak-384.
Because there are (of course) no cryptographic hash functions with a lengths of only 32-bit, you have to use a checksum instead, like Adler-32 or CRC32. The idea remains the same. Best use Adler32 on the items and crc32 on the concatenated hashes:
hashes = { item: zlib.adler32(item) for item in items }
sorted_hashes = sorted(hashes.values())
concatenated_hashes = ''.join(sorted_hashes)
hash_of_the_set = zlib.crc32(concatenated_hashes)
In C++ you can use Adler-32 and CRC-32 of Botan.
A CRC is a set of bits that are calculated from an input.
If your input is the same size (or less) as the CRC (in your case - 32 bits), you can find the input that created this CRC - in effect reversing it.
If your input is larger than 32 bits, but you know all the input except for 32 bits, you can still reverse the CRC to find the missing bits.
If, however, the unknown part of the input is larger than 32 bits, you can't find it as there is more than one solution.
Why am I telling you this? Imagine you have the CRC of the set
{A,B,C}
Say you know what B is, and you can now calculate easily the CRC of the set
{A,C}
(by "easily" I mean - without going over the entire A and C inputs - like you wanted)
Now you have 64 bits describing A and C! And since we didn't have to go over the entirety of A and C to do it - it means we can do it even if we're missing information about A and C.
So it looks like IF such a method exists, we can magically fix more than 32 unknown bits from an input if we have the CRC of it.
This obviously is wrong. Does that mean there's no way to do what you want? Of course not. But it does give us constraints on how it can be done:
Option 1: we don't gain more information from CRC({A,C}) that we didn't have in CRC({A,B,C}). That means that the (relative) effect of A and C on the CRC doesn't change with the removal of B. Basically - it means that when calculating the CRC we use some "order not important" function when adding new elements:
we can use, for example, CRC({A,B,C}) = CRC(A) ^ CRC(B) ^ CRC(C) (not very good, as if A appears twice it's the same CRC as if it never appeared at all), or CRC({A,B,C}) = CRC(A) + CRC(B) + CRC(C) or CRC({A,B,C}) = CRC(A) * CRC(B) * CRC(C) (make sure CRC(X) is odd, so it's actually just 31 bits of CRC) or CRC({A,B,C}) = g^CRC(A) * g^CRC(B) * g^CRC(C) (where ^ is power - useful if you want cryptographically secure) etc.
Option 2: we do need all of A and C to calculate CRC({A,C}), but we have a data structure that makes it less than linear in time to do so if we already calculated CRC({A,B,C}).
This is useful if you want specifically CRC32, and don't mind remembering more information in addition to the CRC after the calculation (the CRC is still 32 bit, but you remember a data structure that's O(len(A,B,C)) that you will later use to calculate CRC{A,C} more efficiently)
How will that work? Many CRCs are just the application of a polynomial on the input.
Basically, if you divide the input into n chunks of 32 bit each - X_1...X_n - there is a matrix M such that
CRC(X_1...X_n) = M^n * X_1 + ... + M^1 * X_n
(where ^ here is power)
How does that help? This sum can be calculated in a tree-like fashion:
CRC(X_1...X_n) = M^(n/2) * CRC(X_1...X_n/2) + CRC(X_(n/2+1)...X_n)
So you begin with all the X_i on the leaves of the tree, start by calculating the CRC of each consecutive pair, then combine them in pairs until you get the combined CRC of all your input.
If you remember all the partial CRCs on the nodes, you can then easily remove (or add) an item anywhere in the list by doing just O(log(n)) calculations!
So there - as far as I can tell, those are your two options. I hope this wasn't too much of a mess :)
I'd personally go with option 1, as it's just simpler... but the resulting CRC isn't standard, and is less... good. Less "CRC"-like.
Cheers!

Algo: find max Xor in array for various interval limis, given N inputs, and p,q where 0<=p<=i<=q<=N

the problem statement is the following:
Xorq has invented an encryption algorithm which uses bitwise XOR operations extensively. This encryption algorithm uses a sequence of non-negative integers x1, x2, … xn as key. To implement this algorithm efficiently, Xorq needs to find maximum value for (a xor xj) for given integers a,p and q such that p<=j<=q. Help Xorq to implement this function.
Input
First line of input contains a single integer T (1<=T<=6). T test cases follow.
First line of each test case contains two integers N and Q separated by a single space (1<= N<=100,000; 1<=Q<= 50,000). Next line contains N integers x1, x2, … xn separated by a single space (0<=xi< 2^15). Each of next Q lines describe a query which consists of three integers ai,pi and qi (0<=ai< 2^15, 1<=pi<=qi<= N).
Output
For each query, print the maximum value for (ai xor xj) such that pi<=j<=qi in a single line.
int xArray[100000];
cin >>t;
for(int j =0;j<t;j++)
{
cin>> n >>q;
//int* xArray = (int*)malloc(n*sizeof(int));
int i,a,pi,qi;
for(i=0;i<n;i++)
{
cin>>xArray[i];
}
for(i=0;i<q;i++)
{
cin>>a>>pi>>qi;
int max =0;
for(int it=pi-1;it<qi;it++)
{
int t = xArray[it] ^ a;
if(t>max)
max =t;
}
cout<<max<<"\n" ;
}
No other assumptions may be made except for those stated in the text of the problem (numbers are not sorted).
The code is functional but not fast enough; is reading from stdin really that slow or is there anything else I'm missing?
XOR flips bits. The max result of XOR is 0b11111111.
To get the best result
if 'a' on ith place has 1 then you have to XOR it with key that has ith bit = 0
if 'a' on ith place has 0 then you have to XOR it with key that has ith bit = 1
saying simply, for bit B you need !B
Another obvious thing is that higher order bits are more important than lower order bits.
That is:
if 'a' on highest place has B and you have found a key with highest bit = !B
then ALL keys that have highest bit = !B are worse that this one
This cuts your amount of numbers by half "in average".
How about building a huge binary tree from all the keys and ordering them in the tree by their bits, from MSB to LSB. Then, cutting the A bit-by-bit from MSB to LSB would tell you which left-right branch to take next to get the best result. Of course, that ignores PI/QI limits, but surely would give you the best result since you always pick the best available bit on i-th level.
Now if you annotate the tree nodes with low/high index ranges of its subelements (performed only done once when building the tree), then later when querying against a case A-PI-QI you could use that to filter-out branches that does not fall in the index range.
The point is that if you order the tree levels like the MSB->LSB bit order, then the decision performed at the "upper nodes" could guarantee you that currently you are in the best possible branch, and it would hold even if all the subbranches were the worst:
Being at level 3, the result of
0b111?????
can be then expanded into
0b11100000
0b11100001
0b11100010
and so on, but even if the ????? are expanded poorly, the overall result is still greater than
0b11011111
which would be the best possible result if you even picked the other branch at level 3rd.
I habe absolutely no idea how long would preparing the tree cost, but querying it for an A-PI-QI that have 32 bits seems to be something like 32 times N-comparisons and jumps, certainly faster than iterating randomly 0-100000 times and xor/maxing. And since you have up to 50000 queries, then building such tree can actually be a good investment, since such tree would be build once per keyset.
Now, the best part is that you actually dont need the whole tree. You may build such from i.e. first two or four or eight bits only, and use the index ranges from the nodes to limit your xor-max loop to a smaller part. At worst, you'd end up with the same range as PiQi. At best, it'd be down to one element.
But, looking at the max N keys, I think the whole tree might actually fit in the memory pool and you may get away without any xor-maxing loop.
I've spent some time google-ing this problem and it seams that you can find it in the context of various programming competitions. While the brute force approach is intuitive it does not really solve the challenge as it is too slow.
There are a few contraints in the problem which you need to speculate in order to write a faster algorithm:
the input consists of max 100k numbers, but there are only 32768 (2^15) possible numbers
for each input array there are Q, max 50k, test cases; each test case consists of 3 values, a,pi,and qi. Since 0<=a<2^15 and there are 50k cases, there is a chance the same value will come up again.
I've found 2 ideas for solving the problem: splitting the input in sqrt(N) intervals and building a segment tree ( a nice explanation for these approaches can be found here )
The biggest problem is the fact that for each test case you can have different values for a, and that would make previous results useless, since you need to compute max(a^x[i]), for a small number of test cases. However when Q is large enough and the value a repeats, using previous results can be possible.
I will come back with the actual results once I finish implementing both methods

10 character id that's globally and locally unique

I need to generate a 10 character unique id (SIP/VOIP folks need to know that it's for a param icid-value in the P-Charging-Vector header). Each character shall be one of the 26 ASCII letters (case sensitive), one of the 10 ASCII digits, or the hyphen-minus.
It MUST be 'globally unique (outside of the machine generating the id)' and sufficiently 'locally unique (within the machine generating the id)', and all that needs to be packed into 10 characters, phew!
Here's my take on it. I'm FIRST encoding the 'MUST' be encoded globally unique local ip address into base-63 (its an unsigned long int that will occupy 1-6 characters after encoding) and then as much as I can of the current time stamp (its a time_t/long long int that will occupy 9-4 characters after encoding depending on how much space the encoded ip address occupies in the first place).
I've also added loop count 'i' to the time stamp to preserve the uniqueness in case the function is called more than once in a second.
Is this good enough to be globally and locally unique or is there another better approach?
Gaurav
#include <stdio.h>
#include <string.h>
#include <sys/time.h>
//base-63 character set
static char set[]="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-";
// b63() returns the next vacant location in char array x
int b63(long long longlong,char *x,int index){
if(index > 9)
return index+1;
//printf("index=%d,longlong=%lld,longlong%63=%lld\n",index,longlong,longlong%63);
if(longlong < 63){
x[index] = set[longlong];
return index+1;
}
x[index] = set[longlong%63];
return b63(longlong/63,x,index+1);
}
int main(){
char x[11],y[11] = {0}; /* '\0' is taken care of here */
//let's generate 10 million ids
for(int i=0; i<10000000; i++){
/* add i to timestamp to take care of sub-second function calls,
3770168404(is a sample ip address in n/w byte order) = 84.52.184.224 */
b63((long long)time(NULL)+i,x,b63((long long)3770168404,x,0));
// reverse the char array to get proper base-63 output
for(int j=0,k=9; j<10; j++,k--)
y[j] = x[k];
printf("%s\n",y);
}
return 0;
}
It MUST be 'globally unique (outside
of the machine generating the id)' and
sufficiently 'locally unique (within
the machine generating the id)', and
all that needs to be packed into 10
characters, phew!
Are you in control of all the software generating ids? Are you doling out the ids? If not...
I know nothing about SIP, but there's got to be a misunderstanding that you have about the spec (or the spec must be wrong). If another developer attempts to build an id using a different algorithm than the one you've cooked up, you will have collisions with their ids, meaning they will know longer be globally unique in that system.
I'd go back to the SIP documentation, see if there's an appendix with an algorithm for generating these ids. Or maybe a smarter SO user than I can answer what the SIP algorithm for generating these id's is.
I would have a serious look at RFC 4122 which describes the generation of 128-bit GUIDs. There are several different generation algorithms, some of which may fit (MAC address-based one springs to mind). This is a bigger number-space than yours 2^128 = 3.4 * 10^38 compared with 63^10 = 9.8 * 10^17, so you may have to make some compromises on uniqueness. Consider factors like how frequently the IDs will be generated.
However in the RFC, they have considered some practical issues, like the ability to generate large numbers of unique values efficiently by pre-allocating blocks of IDs.
Can't you just have a distributed ID table ?
Machines on NAT'ed LANs will often have an IP from a small range, and not all of the 32-bit values would be valid (think multicast, etc). Machines may also grab the same timestamp, especially if the granularity is large (such as seconds); keep in mind that the year is very often going to be the same, so it's the lower bits that will give you the most 'uniqueness'.
You may want to take the various values, hash them with a cryptographic hash, and translate that to the characters you are permitted to use, truncating to the 10 characters.
But you're dealing with a value with less than 60 bits; you need to think carefully about the implications of a collision. You might be approaching the problem the wrong way...
Well, if I cast aside the fact that I think this is a bad idea, and concentrate on a solution to your problem, here's what I would do:
You have an id range of 10^63, which correspond to roughly 60 bits. You want it to be both "globally" and "locally" unique. Let's generate the first N bits to be globally unique, and the rest to be locally unique. The concatenation of the two will have the properties you are looking for.
First, the global uniqueness : IP won't work, especially local ones, they hold very little entropy. I would go with MAC addresses, they were made for being globally unique. They cover a range of 256^6, so using up 6*8 = 48 bits.
Now, for the locally unique : why not use the process ID ? I'm making the assumption that the uniqueness is per process, if it's not, you'll have to think of something else. On Linux, process ID is 32 bits. If we wanted to nitpick, the 2 most significant bytes probably hold very little entropy, as they would at 0 on most machines. So discard them if you know what you're doing.
So now you'll see you have a problem as it would use up to 70 bits to generate a decent (but not bulletproof) globally and locally unique ID (using my technique anyway). And since I would also advise to put in a random number (at least 8 bits long) just in case, it definitely won't fit. So if I were you, I would hash the ~78 generated bits to SHA1 (for example), and convert the first 60 bits of the resulting hash to your ID format. To do so, notice that you have a 63 characters range to chose from, so almost the full range of 6 bits. So split the hash in 6 bits pieces, and use the first 10 pieces to select the 10 characters of your ID from the 63 character range. Obviously, the range of 6 bits is 64 possible values (you only want 63), so if you have a 6 bits piece equals to 63, either floor it to 62 or assume modulo 63 and pick 0. It will slightly bias the distribution, but it's not too bad.
So there, that should get you a decent globally and locally pseudo-unique ID.
A few last points: according to the Birthday paradox, you'll get a ~ 1 % chance of collisions after generating ~ 142 million IDs, and a 99% chance after generating 3 billions IDs. So if you hit great commercial success and have millions of IDs being generated, get a larger ID.
Finally, I think I provided a "better than the worse" solution to your problem, but I can't help but think you're attacking this problem in the wrong fashion, and possibly as other have mentioned, misreading the specs. So use this if there are no other ways that would be more "bulletproof" (centralised ID provider, much longer ID ... ).
Edit: I re-read your question, and you say you call this function possibly many times a second. I was assuming this was to serve as some kind of application ID, generated once at the start of your application, and never changed afterwards until it exited. Since it's not the case, you should definitely add a random number and if you generate a lot of IDs, make that at least a 32 bits number. And read and re-read the Birthday Paradox I linked to above. And seed your number generator to a highly entropic value, like the usec value of the current timestamp for example. Or even go so far as to get your random values from /dev/urandom .
Very honestly, my take on your endeavour is that 60 bits is probably not enough...
Hmm, using the system clock may be a weakness... what if someone sets the clock back? You might re-generate the same ID again. But if you are going to use the clock, you might call gettimeofday() instead of time(); at least that way you'll get better resolution than one second.
#Doug T.
No, I'm not in control of all the software generating the ids.
I agree without a standardized algorithm there maybe collisions, I've raised this issue in the appropriate mailing lists.
#Florian
Taking a cue from you're reply. I decided to use the /dev/urandom PRNG for a 32 bit random number as the space unique component of the id. I assume that every machine will have its own noise signature and it can be assumed to be safely globally unique in space at an instant of time. The time unique component that I used earlier remains the same.
These unique ids are generated to collate all the billing information collected from different network functions that independently generated charging information of a particular call during call processing.
Here's the updated code below:
Gaurav
#include <stdio.h>
#include <string.h>
#include <time.h>
//base-63 character set
static char set[]="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-";
// b63() returns the next vacant location in char array x
int b63(long long longlong, char *x, int index){
if(index > 9)
return index+1;
if(longlong < 63){
x[index] = set[longlong];
return index+1;
}
x[index] = set[longlong%63];
return b63(longlong/63, x, index+1);
}
int main(){
unsigned int number;
char x[11], y[11] = {0};
FILE *urandom = fopen("/dev/urandom", "r");
if(!urandom)
return -1;
//let's generate a 1 billion ids
for(int i=0; i<1000000000; i++){
fread(&number, 1, sizeof(number), urandom);
// add i to timestamp to take care of sub-second function calls,
b63((long long)time(NULL)+i, x, b63((long long)number, x, 0));
// reverse the char array to get proper base-63 output
for(int j=0, k=9; j<10; j++, k--)
y[j] = x[k];
printf("%s\n", y);
}
if(urandom)
fclose(urandom);
return 0;
}