Find initial value of a XOR list - c++

I've a pool of numbers and from these one number X has been XORed with all the others. From these comparisons the minimum XOR values are stored in a list in a sorted format.
How can I retrieve X ?
e.g.
List: { 3, 2, 1, 0, 15, 14, 13, 12 }
Looking for X so that:
X ^ 3 < X ^ 2 < X ^ 1 <... < X ^ 12
Might not be only one X or even none X. Is there any way to revert the process of a XOR when we don't know the initial value and the result of it but just it's comparative values? How can this be solved efficiently given we know the whole pool of numbers?

You probably can't.
XOR is a really special operation. Its traits makes it impossible to figure out anything given only one of the operands, or the result.
If A xor B == C, then we have all the other five expressions:
C xor A == B
B xor C == A
C xor B == A
A xor C == B
B xor A == C
If you see this, you should know it's impossible to get one value from another. Two is always needed for the third one.

It depends very much on the outputs you have. The problem might be underconstrained, giving many possible X values. Or it might be impossible (no X satisfies the constraints).
One approach would be to pair values that differ only in the most-significant bit. If the smaller of each pair appears before the larger, the MSB must be 0; if the larger appears first, then the MSB must be 1.
With that knowledge, we can consider pairs that differ only in the first two bits, and so on.

If certain conditions were true in the output, the value of X could be determined fairly efficiently.
For example:
If the value 15 were in the output array, then you would know that X is a binary complement of one of the other numbers. Also, if the value 0 were in the output array, you would know that this was the result of X being XORed with itself.
A general algorithm with an efficient time complexity would be difficult, however.

Related

Do we need epsilon value for lesser or greater comparison for float value? [duplicate]

This question already has an answer here:
Floating point less-than-equal comparisons after addition and substraction
(1 answer)
Closed 9 months ago.
I have gone through different threads for comparing lesser or greater float value not equal comparison but not clear do we need epsilon value logic to compare lesser or greater float value as well?
e.g ->
float a, b;
if (a < b) // is this correct way to compare two float value or we need epsilon value for lesser comparator
{
}
if (a > b) // is this correct way to compare two float value for greater comparator
{
}
I know for comparing for equality of float, we need some epsilon value
bool AreSame(double a, double b)
{
return fabs(a - b) < EPSILON;
}
It really depends on what should happen when both value are close enough to be seen as equal, meaning fabs(a - b) < EPSILON. In some use cases (for example for computing statistics), it is not very important if the comparison between 2 close values gives or not equality.
If it matters, you should first determine the uncertainty of the values. It really depends on the use case (where the input values come from and how they are processed), and then 2 value differing by less than that uncertainty should be considered as equal. But that equality is not longer a true mathematical equivalence relation: you can easily imagine how to build a chain a close values between 2 truely different values. In math words, the relation is not transitive (or is almost transitive is current language words).
I am sorry but as soon as you have to process approximations there cannot be any precise and consistent way: you have to think of the real world use case to determine how you should handle the approximation.
When you are working with floats, it's inevitable that you will run into precision errors.
In order to mitigate this, when checking for the equality two floats we often check if their difference is small enough.
For lesser and greater, however, there is no way to tell with full certainty which float is larger. The best (presumably for your intentions) approach is to first check if the two floats are the same, using the areSame function. If so return false (as a = b implies that a < b and a > b are both false).
Otherwise, return the value of either a < b or a > b.
The answer is application dependent.
If you are sure that a and b are sufficiently different that numerical errors will not reverse the order, then a < b is good enough.
But if a and b are dangerously close, you might require a < b + EPSILON. In such a case, it should be clear to you that < and ≤ are not distinguishable.
Needless to say, EPSILON should be chosen with the greatest care (which is often pretty difficult).
It ultimately depends on your application, but I would say generally no.
The problem, very simplified, is that if you calculate: (1/3) * 3 and get the answer 0.999999, then you want that to compare equal to 1. This is why we use epsilon values for equal comparisons (and the epsilon should be chosen according to the application and expected precision).
On the other hand, if you want to sort a list of floats then by default the 0.999999 value will sort before 1. But then again what would the correct behavior be? If they both are sorted as 1, then it will be somewhat random which one is actually sorted first (depending on the initial order of the list and the sorting algorithm you use).
The problem with floating point numbers is not that they are "random" and that it is impossible to predict their exact values. The problem is that base-10 fractions don't translate cleanly into base-2 fractions, and that non-repeating decimals in one system can translate into repeating one in the other - which then result in rounding errors when truncated to a finite number of decimals. We use epsilon values for equal comparisons to handle rounding errors that arise from these back and forth translations.
But do be aware that the nice relations that ==, < and <= have for integers, don't always translate over to floating points exactly because of the epsilons involved. Example:
a = x
b = a + epsilon/2
c = b + epsilon/2
d = c + epsilon/2
Now: a == b, b == c, c == d, BUT a != d, a < d. In fact, you can continue the sequence keeping num(n) == num(n+1) and at the same time get an arbitrarily large difference between a and the last number in the sequence.
As others have stated, there would always be precision errors when dealing with floats.
Thus, you should have an epsilon value even for comparing less than / greater than.
We know that in order for a to be less than b, firstly, a must be different from b. Checking this is a simple NOT equals, which uses the epsilon.
Then, once you already know a != b, the operator < is sufficient.

Given (a, b) compute the maximum value of k such that a^{1/k} and b^{1/k} are whole numbers

I'm writing a program that tries to find the minimum value of k > 1 such that the kth root of a and b (which are both given) equals a whole number.
Here's a snippet of my code, which I've commented for clarification.
int main()
{
// Declare the variables a and b.
double a;
double b;
// Read in variables a and b.
while (cin >> a >> b) {
int k = 2;
// We require the kth root of a and b to both be whole numbers.
// "while a^{1/k} and b^{1/k} are not both whole numbers..."
while ((fmod(pow(a, 1.0/k), 1) != 1.0) || (fmod(pow(b, 1.0/k), 1) != 0)) {
k++;
}
Pretty much, I read in (a, b), and I start from k = 2 and increment k until the kth roots of a and b are both congruent to 0 mod 1 (meaning that they are divisible by 1 and thus whole numbers).
But, the loop runs infinitely. I've tried researching, and I think it might have to do with precision error; however, I'm not too sure.
Another approach I've tried is changing the loop condition to check whether the floor of a^{1/k} equals a^{1/k} itself. But again, this runs infinitely, likely due to precision error.
Does anyone know how I can fix this issue?
EDIT: for example, when (a, b) = (216, 125), I want to have k = 3 because 216^(1/3) and 125^(1/3) are both integers (namely, 5 and 6).
That is not a programming problem but a mathematical one:
if a is a real, and k a positive integer, and if a^(1./k) is an integer, then a is an integer. (otherwise the aim is to toy with approximation error)
So the fastest approach may be to first check if a and b are integer, then do a prime decomposition such that a=p0e0 * p1e1 * ..., where pi are distinct primes.
Notice that, for a1/k to be an integer, each ei must also be divisible by k. In other words, k must be a common divisor of the ei. The same must be true for the prime powers of b if b1/k is to be an integer.
Thus the largest k is the greatest common divisor of all ei of both a and b.
With your approach you will have problem with large numbers. All IIEEE 754 binary64 floating points (the case of double on x86) have 53 significant bits. That means that all double larger than 253 are integer.
The function pow(x,1./k) will result in the same value for two different x, so that with your approach you will necessary have false answer, for example the numbers 55*290 and 35*2120 are exactly representable with double. The result of the algorithm is k=5. You may find this value of k with these number but you will also find k=5 for 55*290-249 and 35*2120, because pow(55*290-249,1./5)==pow(55*290). Demo here
On the other hand, as there are only 53 significant bits, prime number decomposition of double is trivial.
Floating numbers are not mathematical real numbers. The computation is "approximate". See http://floating-point-gui.de/
You could replace the test fmod(pow(a, 1.0/k), 1) != 1.0 with something like fabs(fmod(pow(a, 1.0/k), 1) - 1.0) > 0.0000001 (and play with various such 𝛆 instead of 0.0000001; see also std::numeric_limits::epsilon but use it carefully, since pow might give some error in its computations, and 1.0/k also inject imprecisions - details are very complex, dive into IEEE754 specifications).
Of course, you could (and probably should) define your bool almost_equal(double x, double y) function (and use it instead of ==, and use its negation instead of !=).
As a rule of thumb, never test floating numbers for equality (i.e. ==), but consider instead some small enough distance between them; that is, replace a test like x == y (respectively x != y) with something like fabs(x-y) < EPSILON (respectively fabs(x-y) > EPSILON) where EPSILON is a small positive number, hence testing for a small L1 distance (for equality, and a large enough distance for inequality).
And avoid floating point in integer problems.
Actually, predicting or estimating floating point accuracy is very difficult. You might want to consider tools like CADNA. My colleague Franck VĂ©drine is an expert on static program analyzers to estimate numerical errors (see e.g. his TERATEC 2017 presentation on Fluctuat). It is a difficult research topic, see also D.Monniaux's paper the pitfalls of verifying floating-point computations etc.
And floating point errors did in some cases cost human lives (or loss of billions of dollars). Search the web for details. There are some cases where all the digits of a computed number are wrong (because the errors may accumulate, and the final result was obtained by combining thousands of operations)! There is some indirect relationship with chaos theory, because many programs might have some numerical instability.
As others have mentioned, comparing floating point values for equality is problematic. If you find a way to work directly with integers, you can avoid this problem. One way to do so is to raise integers to the k power instead of taking the kth root. The details are left as an exercise for the reader.

SML Fibonacci large number . Using int datatype

I want big int in SML. please let me know direction.
I can make normal fibonacci. but I have to print fibo(100) only using int not intinf.
fun fibo 0 = 0
| fibo n =
if n <= 2
then 1
else fibo (n-1) + fibo(n-2)
I have to print fibo(100)
only using int, not IntInf.int.
To address each of these,
Just printing (and not storing) fibo(100) does not necessarily involve finding a good representation for big numbers, so I would change this goal into "I have to find a way to represent big numbers so I can add them and display them."
But, as John points out, overflowing int values is not your only concern when calculating fibo(100). Just try your naive implementation for fibo(40) or so. It doesn't overflow, but it takes at least a few seconds on my computer. Now try fibo(41), fibo(42), etc. and witness the exponential growth. Your algorithm not only needs a representation of big integers, it also needs a sub-exponential resource usage.
E.g. by making the function tail-recursive:
fun fib n =
let fun fib_ a b 0 = a (* or b *)
| fib_ a b i = fib_ b (a+b) (i-1)
in fib_ (Int.toLarge 0) (Int.toLarge 1) n end
You're essentially asking how to represent numbers that are bigger than ints using ints only, which, logically, is not possible. But perhaps you mean "by inventing a representation that uses multiple ints in series." That's exactly what IntInf.int does. See for example How to use IntInf or LargeInt in SML?.
A naive imlementation of big integers could keep lists of ints, add them pairwise and carry over when handle Overflow is triggered. Or simply strings of digit characters! But it sounds more like a mental exercise than something useful.

Hashing an unordered sequence of small integers

Background
I have a large collection (~thousands) of sequences of integers. Each sequence has the following properties:
it is of length 12;
the order of the sequence elements does not matter;
no element appears twice in the same sequence;
all elements are smaller than about 300.
Note that the properties 2. and 3. imply that the sequences are actually sets, but they are stored as C arrays in order to maximise access speed.
I'm looking for a good C++ algorithm to check if a new sequence is already present in the collection. If not, the new sequence is added to the collection. I thought about using a hash table (note however that I cannot use any C++11 constructs or external libraries, e.g. Boost). Hashing the sequences and storing the values in a std::set is also an option, since collisions can be just neglected if they are sufficiently rare. Any other suggestion is also welcome.
Question
I need a commutative hash function, i.e. a function that does not depend on the order of the elements in the sequence. I thought about first reducing the sequences to some canonical form (e.g. sorting) and then using standard hash functions (see refs. below), but I would prefer to avoid the overhead associated with copying (I can't modify the original sequences) and sorting. As far as I can tell, none of the functions referenced below are commutative. Ideally, the hash function should also take advantage of the fact that elements never repeat. Speed is crucial.
Any suggestions?
http://partow.net/programming/hashfunctions/index.html
http://code.google.com/p/smhasher/
Here's a basic idea; feel free to modify it at will.
Hashing an integer is just the identity.
We use the formula from boost::hash_combine to get combine hashes.
We sort the array to get a unique representative.
Code:
#include <algorithm>
std::size_t array_hash(int (&array)[12])
{
int a[12];
std::copy(array, array + 12, a);
std::sort(a, a + 12);
std::size_t result = 0;
for (int * p = a; p != a + 12; ++p)
{
std::size_t const h = *p; // the "identity hash"
result ^= h + 0x9e3779b9 + (result << 6) + (result >> 2);
}
return result;
}
Update: scratch that. You just edited the question to be something completely different.
If every number is at most 300, then you can squeeze the sorted array into 9 bits each, i.e. 108 bits. The "unordered" property only saves you an extra 12!, which is about 29 bits, so it doesn't really make a difference.
You can either look for a 128 bit unsigned integral type and store the sorted, packed set of integers in that directly. Or you can split that range up into two 64-bit integers and compute the hash as above:
uint64_t hash = lower_part + 0x9e3779b9 + (upper_part << 6) + (upper_part >> 2);
(Or maybe use 0x9E3779B97F4A7C15 as the magic number, which is the 64-bit version.)
Sort the elements of your sequences numerically and then store the sequences in a trie. Each level of the trie is a data structure in which you search for the element at that level ... you can use different data structures depending on how many elements are in it ... e.g., a linked list, a binary search tree, or a sorted vector.
If you want to use a hash table rather than a trie, then you can still sort the elements numerically and then apply one of those non-commutative hash functions. You need to sort the elements in order to compare the sequences, which you must do because you will have hash table collisions. If you didn't need to sort, then you could multiply each element by a constant factor that would smear them across the bits of an int (there's theory for finding such a factor, but you can find it experimentally), and then XOR the results. Or you could look up your ~300 values in a table, mapping them to unique values that mix well via XOR (each one could be a random value chosen so that it has an equal number of 0 and 1 bits -- each XOR flips a random half of the bits, which is optimal).
I would just use the sum function as the hash and see how far you come with that. This doesn’t take advantage of the non-repeating property of the data, nor of the fact that they are all < 300. On the other hand, it’s blazingly fast.
std::size_t hash(int (&arr)[12]) {
return std::accumulate(arr, arr + 12, 0);
}
Since the function needs to be unaware of ordering, I don’t see a smart way of taking advantage of the limited range of the input values without first sorting them. If this is absolutely required, collision-wise, I’d hard-code a sorting network (i.e. a number of if…else statements) to sort the 12 values in-place (but I have no idea how a sorting network for 12 values would look like or even if it’s practical).
EDIT After the discussion in the comments, here’s a very nice way of reducing collisions: raise every value in the array to some integer power before summing. The easiest way of doing this is via transform. This does generate a copy but that’s probably still very fast:
struct pow2 {
int operator ()(int n) const { return n * n; }
};
std::size_t hash(int (&arr)[12]) {
int raised[12];
std::transform(arr, arr + 12, raised, pow2());
return std::accumulate(raised, raised + 12, 0);
}
You could toggle bits, corresponding to each of the 12 integers, in the bitset of size 300. Then use formula from boost::hash_combine to combine ten 32-bit integers, implementing this bitset.
This gives commutative hash function, does not use sorting, and takes advantage of the fact that elements never repeat.
This approach may be generalized if we choose arbitrary bitset size and if we set or toggle arbitrary number of bits for each of the 12 integers (which bits to set/toggle for each of the 300 values is determined either by a hash function or using a pre-computed lookup table). Which results in a Bloom filter or related structures.
We can choose Bloom filter of size 32 or 64 bits. In this case, there is no need to combine pieces of large bit vector into a single hash value. In case of classical implementation of Bloom filter with size 32, optimal number of hash functions (or non-zero bits for each value of the lookup table) is 2.
If, instead of "or" operation of classical Bloom filter, we choose "xor" and use half non-zero bits for each value of the lookup table, we get a solution, mentioned by Jim Balter.
If, instead of "or" operation, we choose "+" and use approximately half non-zero bits for each value of the lookup table, we get a solution, similar to one, suggested by Konrad Rudolph.
I accepted Jim Balter's answer because he's the one who came closest to what I eventually coded, but all of the answers got my +1 for their helpfulness.
Here is the algorithm I ended up with. I wrote a small Python script that generates 300 64-bit integers such that their binary representation contains exactly 32 true and 32 false bits. The positions of the true bits are randomly distributed.
import itertools
import random
import sys
def random_combination(iterable, r):
"Random selection from itertools.combinations(iterable, r)"
pool = tuple(iterable)
n = len(pool)
indices = sorted(random.sample(xrange(n), r))
return tuple(pool[i] for i in indices)
mask_size = 64
mask_size_over_2 = mask_size/2
nmasks = 300
suffix='UL'
print 'HashType mask[' + str(nmasks) + '] = {'
for i in range(nmasks):
combo = random_combination(xrange(mask_size),mask_size_over_2)
mask = 0;
for j in combo:
mask |= (1<<j);
if(i<nmasks-1):
print '\t' + str(mask) + suffix + ','
else:
print '\t' + str(mask) + suffix + ' };'
The C++ array generated by the script is used as follows:
typedef int_least64_t HashType;
const int maxTableSize = 300;
HashType mask[maxTableSize] = {
// generated array goes here
};
inline HashType xorrer(HashType const &l, HashType const &r) {
return l^mask[r];
}
HashType hashConfig(HashType *sequence, int n) {
return std::accumulate(sequence, sequence+n, (HashType)0, xorrer);
}
This algorithm is by far the fastest of those that I have tried (this, this with cubes and this with a bitset of size 300). For my "typical" sequences of integers, collision rates are smaller than 1E-7, which is completely acceptable for my purpose.

std::map trick for comparing unrepresentable numbers?

I would like to have a user-defined key in a C++ std::map. The key is a binary representation of an integer set with maximum value 2^V so I can't represent all 2^V possible values. I do so by means of an efficient binary set representation, i.e., an array of uint64_t.
Now the problem is that to put this user-defined bitset as key in a std::map, I need to define a valid comparison between bitset values but if I have a maximum size of, say, V=1000, then I cannot get a number I can compare, let alone aggregating them all i.e., 2^1000 is not representable.
Therefore my question is, suppose I have two different sets (by setting the right bits in my bitset representation) and I cannot represent the final number because it will overflow:
id_1 = 2^0 + 2^1 + ... + 2^V
id_2 = 2^0 + 2^1 + ... + 2^V
Is there a suitable transformation that would lead to a value I can compare? I need to be able to say id_1 < id_2 so I would like to transform a sum of exponentials to a value that is representable BUT maintaining the invariant of the "less than". I was thinking along the lines of e.g. applying a log transformation in a clever way to preserve "less than".
Here is an example:
set_1 = {2,3,4}; set_2 = {8}
id(set_1) = 2^2 + 2^3 + 2^4 = 28; id(set_2) = 2^8 = 256
id(set_1) < id(set_2)
Perfect! How about a general set that can have {1,...,V}, and thus 2^V possible subsets?
I do so by means of an efficient binary set representation, i.e., an array of uint64_t.
Supposing that this array is accessed via a data member ra of the key type Key, and both arrays are of length N, then you want a comparator something like this:
bool operator<(const Key &lhs, const Key &rhs) {
return std::lexicographical_compare(lhs.ra, &lhs.ra[N], rhs.ra, &rhs.ra[N]);
}
This implicitly considers the array to be big-endian, i.e. the first uint64_t is the most significant. If you don't like that, that's fair enough, since you might already have in mind some relative significance for whatever order you've stored your V bits into your array. There's no great mystery to lexicographical_compare, so just look at an example implementation and modify as required.
This is called "lexicographical order". Other than the facts that I've used uint64_t instead of char and both arrays are the same length, it is how strings are compared[*] -- in fact the use of uint64_t isn't important, you could just use std::memcmp in your comparator instead of comparing 64-bit chunks. operator< for strings doesn't work by converting the whole string to an integer, and neither should your comparator.
[*] until you bring locale-specific collation rules into play.