Efficient way of ensuring newness of a set - c++

Given set N = {1,...,n}, consider P different pre-existing subsets of N. A subset, S_p, is characterized by the 0-1 n vector x_p where the ith element is 0 or 1 depending on whether the ith (of n) items is part of the subset or not. Let us call such x_ps indicator vectors.
For e.g., if N={1,2,3,4,5}, subset {1,2,5} is represented by vector (1,0,0,1,1).
Now, given P pre-existing subsets and their associated vectors x_ps.
A candidate subset denoted by vector yis computed.
What is the most efficient way of checking whether y is already part of the set of P pre-existing subsets or whether y is indeed a new subset not part of the P subsets?
The following are the methods I can think of:
(Method 1) Basically, we have to do an element by element check against all pre-existing sets. Pseudocode follows:
for(int p = 0; p < P; p++){
//(check if x_p == y by doing an element by element comparison)
int i;
for(i = 0; i < n; i++){
if(x_pi != y_i){
i = 999999;
}
}
if(i < 999999)
return that y is pre-existing
}
return that y is new
(Method 2) Another thought that comes to mind is to store the decimal equivalent of the indicator vectors x_ps (where the indicator vectors are taken to be binary representations) and compare it with the decimal equivalent of y. That is, if set of P pre-existing sets is: { (0,1,0,0,1), (1,0,1,1,0) }, the stored decimals for this set would be {9, 22}. If y is (0,1,1,0,0), we compute 12 and check this against the set {9, 22}. The benefit of this method is that for each new y, we don't have to check against the n elements of every pre-existing set. We can just compare the decimal numbers.
Question 1. It appears to me that (Method 2) should be more efficient than (Method 1). For (Method 2), is there an efficient way (inbuilt library function in C/C++) that converts the x_ps and y from binary to decimal? What should be data type of these indicator variables? For e.g., bool y[5]; or char y[5];?
Question 2. Is there any method more efficient than (Method 2)?

As you've noticed, there's a trivial isomorphism between your indicator vectors and N-bit integers. That means the answer to your question 2 is "no": the tools available for maintain a set and testing membership in it are the same as integers (hash tables bring the normal approach). A commented mentioned Bloom fillers, which can efficiently test membership at the risk of some false positives, but Bloom filters are generally for much larger data sizes than you're looking at.
As for your question 1: Method 2 is reasonable, and it's even easier than you think. While vector<bool> doesn't give you an easy way to turn it into integer blocks, on implementations I'm aware of it's already implemented this way (the C++ standard allows special treatment of that particular vector type, something that is generally considered nowadays to have been a poor decision, but which occasionally yields some benefit). And those vectors are hashable. So just keep an unordered_set<vector<bool>> around, and you'll get performance which is reasonably close to the optimum. (If you know N at compile time you may want to prefer bitset to vector<bool>.)

Method 2 can be optimized by calculating the decimal equivalent of the given subset and hashing it using modulus 1e9+7. This results in different decimal numbers every time since N<=1000(No collision occurs).
#define M 1000000007 //big prime number
unordered_set<long long> subset; //containing decimal representation of all the
//previous found subsets
/*fast computation of power of 2*/
long long Pow(long long num,long long pow){
long long result=1;
while(pow)
{
if(pow&1)
{
result*=num;
result%=M;
}
num*=num;
num%=M;
pow>>=1;
}
return result;
}
/*checks if subset pre exists*/
bool check(vector<bool> booleanVector){
long long result=0;
for(int i=0;i<booleanVector.size();i++)
if(booleanVector[i])
result+=Pow(2,i);
return (subset.find(result)==subset.end());
}

Related

Efficiently find an integer not in a set of size 40, 400, or 4000

Related to the classic problem find an integer not among four billion given ones but not exactly the same.
To clarify, by integers what I really mean is only a subset of its mathemtical definition. That is, assume there are only finite number of integers. Say in C++, they are int in the range of [INT_MIN, INT_MAX].
Now given a std::vector<int> (no duplicates) or std::unordered_set<int>, whose size can be 40, 400, 4000 or so, but not too large, how to efficiently generate a number that is guaranteed to be not among the given ones?
If there is no worry for overflow, then I could multiply all nonzero ones together and add the product by 1. But there is. The adversary test cases could delibrately contain INT_MAX.
I am more in favor of simple, non-random approaches. Is there any?
Thank you!
Update: to clear up ambiguity, let's say an unsorted std::vector<int> which is guaranteed to have no duplicates. So I am asking if there is anything better than O(n log(n)). Also please note that test cases may contain both INT_MIN and INT_MAX.
You could just return the first of N+1 candidate integers not contained in your input. The simplest candidates are the numbers 0 to N. This requires O(N) space and time.
int find_not_contained(container<int> const&data)
{
const int N=data.size();
std::vector<char> known(N+1, 0); // one more candidates than data
for(int i=0; i< N; ++i)
if(data[i]>=0 && data[i]<=N)
known[data[i]]=1;
for(int i=0; i<=N; ++i)
if(!known[i])
return i;
assert(false); // should never be reached.
}
Random methods can be more space efficient, but may require more passes over the data in the worst case.
Random methods are indeed very efficient here.
If we want to use a deterministic method and by assuming the size n is not too large, 4000 for example, then we can create a vector x of size m = n + 1 (or a little bit larger, 4096 for example to facilitate calculation), initialised with 0.
For each i in the range, we just set x[array[i] modulo m] = 1.
Then a simple O(n) search in x will provide a value which is not in array
Note: the modulo operation is not exactly the "%" operation
Edit: I mentioned that calculations are made easier by selecting here a size of 4096. To be more concrete, this implies that the modulo operation is performed with a simple & operation
You can find the smallest unused integer in O(N) time using O(1) auxiliary space if you are allowed to reorder the input vector, using the following algorithm. [Note 1] (The algorithm also works if the vector contains repeated data.)
size_t smallest_unused(std::vector<unsigned>& data) {
size_t N = data.size(), scan = 0;
while (scan < N) {
auto other = data[scan];
if (other < scan && data[other] != other) {
data[scan] = data[other];
data[other] = other;
}
else
++scan;
}
for (scan = 0; scan < N && data[scan] == scan; ++scan) { }
return scan;
}
The first pass guarantees that if some k in the range [0, N) was found after position k, then it is now present at position k. This rearrangement is done by swapping in order to avoid losing data. Once that scan is complete, the first entry whose value is not the same as its index is not referenced anywhere in the array.
That assertion may not be 100% obvious, since a entry could be referenced from an earlier index. However, in that case the entry could not be the first entry unequal to its index, since the earlier entry would be meet that criterion.
To see that this algorithm is O(N), it should be observed that the swap at lines 6 and 7 can only happen if the target entry is not equal to its index, and that after the swap the target entry is equal to its index. So at most N swaps can be performed, and the if condition at line 5 will be true at most N times. On the other hand, if the if condition is false, scan will be incremented, which can also only happen N times. So the if statement is evaluated at most 2N times (which is O(N)).
Notes:
I used unsigned integers here because it makes the code clearer. The algorithm can easily be adjusted for signed integers, for example by mapping signed integers from [INT_MIN, 0) onto unsigned integers [INT_MAX, INT_MAX - INT_MIN) (The subtraction is mathematical, not according to C semantics which wouldn't allow the result to be represented.) In 2's-complement, that's the same bit pattern. That changes the order of the numbers, of course, which affects the semantics of "smallest unused integer"; an order-preserving mapping could also be used.
Make random x (INT_MIN..INT_MAX) and test it against all. Test x++ on failure (very rare case for 40/400/4000).
Step 1: Sort the vector.
That can be done in O(n log(n)), you can find a few different algorithms online, use the one you like the most.
Step 2: Find the first int not in the vector.
Easily iterate from INT_MIN to INT_MIN + 40/400/4000 checking if the vector has the current int:
Pseudocode:
SIZE = 40|400|4000 // The one you are using
for (int i = 0; i < SIZE; i++) {
if (array[i] != INT_MIN + i)
return INT_MIN + i;
The solution would be O(n log(n) + n) meaning: O(n log(n))
Edit: just read your edit asking for something better than O(n log(n)), sorry.
For the case in which the integers are provided in an std::unordered_set<int> (as opposed to a std::vector<int>), you could simply traverse the range of integer values until you come up against one integer value that is not present in the unordered_set<int>. Searching for the presence of an integer in an std::unordered_set<int> is quite straightforward, since std::unodered_set does provide searching through its find() member function.
The space complexity of this approach would be O(1).
If you start traversing at the lowest possible value for an int (i.e., std::numeric_limits<int>::min()), you will obtain the lowest int not contained in the std::unordered_set<int>:
int find_lowest_not_contained(const std::unordered_set<int>& set) {
for (auto i = std::numeric_limits<int>::min(); ; ++i) {
auto it = set.find(i); // search in set
if (it == set.end()) // integer not in set?
return *it;
}
}
Analogously, if you start traversing at the greatest possible value for an int (i.e., std::numeric_limits<int>::max()), you will obtain the lowest int not contained in the std::unordered_set<int>:
int find_greatest_not_contained(const std::unordered_set<int>& set) {
for (auto i = std::numeric_limits<int>::max(); ; --i) {
auto it = set.find(i); // search in set
if (it == set.end()) // integer not in set?
return *it;
}
}
Assuming that the ints are uniformly mapped by the hash function into the unordered_set<int>'s buckets, a search operation on the unordered_set<int> can be achieved in constant time. The run-time complexity would then be O(M ), where M is the size of the integer range you are looking for a non-contained value. M is upper-bounded by the size of the unordered_set<int> (i.e., in your case M <= 4000).
Indeed, with this approach, selecting any integer range whose size is greater than the size of the unordered_set, is guaranteed to come up against an integer value which is not present in the unordered_set<int>.

summing array of doubles with large value span : proper algorithm

I have an algorithm where I need to sum (a lot of time) double numbers ranging in the e-40 to the e+40.
Array Example (randomly dumped from real application):
-2.06991e-05
7.58132e-06
-3.91367e-06
7.38921e-07
-5.33143e-09
-4.13195e-11
4.01724e-14
6.03221e-17
-4.4202e-20
6.58873
-1.22257
-0.0606178
0.00036508
2.67599e-07
0
-627.061
-59.048
5.92985
0.0885884
0.000276455
-2.02579e-07
It goes without saying the I am aware of the rounding effect this will cause, I am trying to keep it under control : the final result should not have any missing information in the fractional part of the double or, if not avoidable result should be at least n-digit accurate (with n defined). End result needs something like 5 digits plus exponent.
After some decent thinking, I ended up with following algorithm :
Sort the array so that the largest absolute value comes first, closest to zero last.
Add everything in a loop
The idea is that in this case, any cancellation of large values (negatives and positive) will not impact latter smaller values.
In short :
(10e40 - 10e40) + 1 = 1 : result is as expected
(1 + 10e-40) - 10e40 = 0 : not good
I ended up using std::multiset (benchmark on my PC gave 20% higher speed with long double compared to normal doubles - I am fine with doubles resolution) with a custom sort function using std:fabs.
It's still quite slow (it takes 5 seconds to do the whole thing) and I still have this feeling of "you missed something in your algo". Any recommandation :
for speed optimization. Is there a better way to sort the intermediate products ? Sorting a set of 40 intermediate results (typically) takes about 70% of the total execution time.
for missed issues. Is there a chance to still lose critical data (one that should have been in the fractional part of the final result) ?
On a bigger picture, I am implementing real coefficient polynomial classes of pure imaginary variable (electrical impedances : Z(jw)). Z is a big polynom representing a user defined system, with coefficient exponent ranging very far.
The "big" comes from adding things like Zc1 = 1/jC1w to Zc2 = 1/jC2w :
Zc1 + Zc2 = (C1C2(jw)^2 + 0(jw))/(C1+C2)(jw)
In this case, with C1 and C2 in nanofarad (10e-9), C1C2 is already in 10e-18 (and it only started...)
my sort function use a manhattan distance of complex variables (because, mine are either pure real or pure imaginary) :
struct manhattan_complex_distance
{
bool operator() (std::complex<long double> a, std::complex<long double> b)
{
return std::fabs(std::real(a) + std::imag(a)) > std::fabs(std::real(b) + std::imag(b));
}
};
and my multi set in action :
std:complex<long double> get_value(std::vector<std::complex<long double>>& frequency_vector)
{
//frequency_vector is precalculated once for all to have at index n the value (jw)^n.
std::multiset<std::complex<long double>, manhattan_distance> temp_list;
for (int i=0; i<m_coeficients.size(); ++i)
{
// element of : ℝ * ℂ
temp_list.insert(m_coeficients[i] * frequency_vector[i]);
}
std::complex<long double> ret=0;
for (auto i:temp_list)
{
// it is VERY important to start adding the big values before adding the small ones.
// in informatics, 10^60 - 10^60 + 1 = 1; while 1 + 10^60 - 10^60 = 0. Of course you'd expected to get 1, not 0.
ret += i;
}
return ret;
}
The project I have is c++11 enabled (mainly for improvement of the math lib and complex number tools)
ps : I refactored the code to make is easy to read, in reality all complexes and long double names are template : I can change the polynomial type in no time or use the class for regular polynomial of ℝ
As GuyGreer suggested, you can use Kahan summation:
double sum = 0.0;
double c = 0.0;
for (double value : values) {
double y = value - c;
double t = sum + y;
c = (t - sum) - y;
sum = t;
}
EDIT: You should also consider using Horner's method to evaluate the polynomial.
double value = coeffs[degree];
for (auto i = degree; i-- > 0;) {
value *= x;
value += coeffs[i];
}
Sorting the data is on the right track. But you definitely should be summing from smallest magnitude to largest, not from largest to smallest. Summing from largest to smallest, by the time you get to the smallest, aligning the next value with the current sum is liable to cause most or all of the bits of the next value to 'fall off the end'. Summing instead from smallest to largest, the smallest values get a chance to accumulate a decent-sized sum, for which more bits will get into the largest. Combined with Kahan summation, that should yield a fairly accurate sum.
First: have your math keep track of error. Replace your doubles with error-aware types, and when you add or multiply together two doubles it also calculates the maximium error.
This is about the only way you can guarantee that your code produces accurate results while being reasonably fast.
Second, don't use a multiset. The associative containers are not for sorting, they are for maintaining a sorted collection, while being able to incrementally add or remove elements from it efficiently.
The ability to add/remove elements incrementally means it is node-based, and node-based means it is slow in general.
If you simply want a sorted collection, start with a vector then std::sort it.
Next, to minimize error, keep a list of positive and negative elements. Start with zero as your sum. Now pick the smallest of either the positive or negative elements such that the total of your sum and that element is closest to zero.
Do so with elements that calculate their error bounds.
At the end, determine if you have 5 digits of precision, or not.
These error-propogating doubles should be ideally used as early on in the algorithm as possible.

Random pairs of different bits

I have the following problem. I have a number represented in binary representation. I need a way to randomly select two bits of them that are different (i.e. find a 1 and a 0). Besides this I run other operations on that number (reversing sequences, permute sequences,...) These are the approaches I already used:
Keep track of all the ones and the zeros. When I create the binary representation of the binary number I store the places of the 0's and 1's. So that I can choose an index for one list and one index from the other one. I then have two different bits. To run my other operations I created those from an elementary swap operations which updates the indices of the 1's and 0's when manipulating. Therefore I have a third list that stores the list index for each bit. If a bit is 1 I know where to find in the list with all the indices of the ones (same goes for zeros).
The method above yields some overhead when operations are done that do not require the bits to be different. So another way would be to create the lists whenever different bits are needed.
Does anyone have a better idea to do this? I need these operations to be really fast (I am working with popcount, clz, and other binary operations)
I don't feel as though I have enough information to assess the tradeoffs properly, but perhaps you'll find this idea useful. To find a random 1 in a word (find a 1 over multiple words by popcount and reservoir sampling; find a 0 by complementing), first test the popcount. If the popcount is high, then generate indexes uniformly at random and test them until a one is found. If the popcount is medium, then take bitwise ANDs with uniform random masks (but keep the original if the AND is zero) to reduce the popcount. When the popcount is low, use clz to compile the (small) list of candidates efficiently and then sample uniformly at random.
I think the following might be a rather efficient algorithm to do what you are asking. You only iterate over each bit in the number once, and for each element, you have to generate a random number (not exactly sure how costly that is, but I believe there are some optimized CPU instructions for getting random numbers).
Idea is to iterate over all the bits, and with the right probability, update the index to the current index you are visiting.
Generic pseudocode for getting an element from a stream/array:
p = 1
e = null
for s in stream:
with probability 1/p:
replace e with s
p++
return e
Java version:
int[] getIdx(int n){
int oneIdx = 0;
int zeroIdx = 0;
int ones = 1;
int zeros = 1;
// this loop depends on whether you want to select all the prepended zeros
// in a 32/64 bit representation. Alter to your liking...
for(int i = n, j = 0; i > 0; i = i >>> 1, j++){
if((i & 1) == 1){ // current element is 1
if(Math.random() < 1/(float)ones){
oneIdx = j;
}
ones++;
} else{ // element is 0
if(Math.random() < 1/(float)zeros){
zeroIdx = j;
}
zeros++;
}
}
return new int[]{zeroIdx,oneIdx};
}
An optimization you might look into is to do the probability selection using ints instead of floats, might be slightly faster. Here is a short proof I did some time ago regarding that this works: here . I believe the algorithm is attributed to Knuth but can't remember exactly.

How to approximate the count of distinct values in an array in a single pass through it

I have several huge arrays (millions++ members). All those are arrays of numbers and they are not sorted (and I cannot do that). Some are uint8_t, some uint16_t/32/64. I would like to approximate the count of distinct values in these arrays. The conditions are following:
speed is VERY important, I need to do this in one pass through the array and I must go through it sequentially (cannot jump back and forth) (I am doing this in C++, if that's important)
I don't need EXACT counts. What I want to know is that if it is an uint32_t array if there are like 10 or 20 distinct numbers or if there are thousands or millions.
I have quite a bit of memory that I can use, but the less is used the better
the smaller the array data type, the more accurate I need to be
I dont mind STL, but if I can do it without it that would be great (no BOOST though, sorry)
if the approach can be easily parallelized, that would be cool (but its not a mandatory condition)
Examples of perfect output:
ArrayA [uint32_t, 3M members]: ~128 distinct values
ArrayB [uint32_t, 9M members]: 100000+ distinct values
ArrayC [uint8_t, 50K members]: 2-5 distinct values
ArrayD [uint8_t, 700K members]: 64+ distinct values
I understand that some of the constraints may seem illogical, but thats the way it is.
As a side note, I also want the top X (3 or 10) most used and least used values, but that is far easier to do and I can do it on my own. However if someone has thoughts for that too, feel free to share them!
EDIT: a bit of clarification regarding STL. If you have a solution using it, please post it. Not using STL would be just a bonus for us, we dont fancy it too much. However if it is a good solution, it will be used!
For 8- and 16-bit values, you can just make a table of the count of each value; every time you write to a table entry that was previously zero, that's a different value found.
For larger values, if you are not interested in counts above 100000, std::map is suitable, if it's fast enough. If that's too slow for you, you could program your own B-tree.
I'm pretty sure you can do it by:
Create a Bloom filter
Run through the array inserting each element into the filter (this is a "slow" O(n), since it requires computing several independent decent hashes of each value)
Count how many bits are set in the Bloom Filter
Compute back from the density of the filter to an estimate of the number of distinct values. I don't know the calculation off the top of my head, but any treatment of the theory of Bloom filters goes into this, because it's vital to the probability of the filter giving a false positive on a lookup.
Presumably if you're simultaneously computing the top 10 most frequent values, then if there are less than 10 distinct values you'll know exactly what they are and you don't need an estimate.
I believe the "most frequently used" problem is difficult (well, memory-consuming). Suppose for a moment that you only want the top 1 most frequently used value. Suppose further that you have 10 million entries in the array, and that after the first 9.9 million of them, none of the numbers you've seen so far has appeared more than 100k times. Then any of the values you've seen so far might be the most-frequently used value, since any of them could have a run of 100k values at the end. Even worse, any two of them could have a run of 50k each at the end, in which case the count from the first 9.9 million entries is the tie-breaker between them. So in order to work out in a single pass which is the most frequently used, I think you need to know the exact count of each value that appears in the 9.9 million. You have to prepare for that freak case of a near-tie between two values in the last 0.1 million, because if it happens you aren't allowed to rewind and check the two relevant values again. Eventually you can start culling values -- if there's a value with a count of 5000 and only 4000 entries left to check, then you can cull anything with a count of 1000 or less. But that doesn't help very much.
So I might have missed something, but I think that in the worst case, the "most frequently used" problem requires you to maintain a count for every value you have seen, right up until nearly the end of the array. So you might as well use that collection of counts to work out how many distinct values there are.
One approach that can work, even for big values, is to spread them into lazily allocated buckets.
Suppose that you are working with 32 bits integers, creating an array of 2**32 bits is relatively impractical (2**29 bytes, hum). However, we can probably assume that 2**16 pointers is still reasonable (2**19 bytes: 500kB), so we create 2**16 buckets (null pointers).
The big idea therefore is to take a "sparse" approach to counting, and hope that the integers won't be to dispersed, and thus that many of the buckets pointers will remain null.
typedef std::pair<int32_t, int32_t> Pair;
typedef std::vector<Pair> Bucket;
typedef std::vector<Bucket*> Vector;
struct Comparator {
bool operator()(Pair const& left, Pair const& right) const {
return left.first < right.first;
}
};
void add(Bucket& v, int32_t value) {
Pair const pair(value, 1);
Vector::iterator it = std::lower_bound(v.begin(), v.end(), pair, Compare());
if (it == v.end() or it->first > value) {
v.insert(it, pair);
return;
}
it->second += 1;
}
void gather(Vector& v, int32_t const* begin, int32_t const* end) {
for (; begin != end; ++begin) {
uint16_t const index = *begin >> 16;
Bucket*& bucket = v[index];
if (bucket == 0) { bucket = new Bucket(); }
add(*bucket, *begin);
}
}
Once you have gathered your data, then you can count the number of different values or find the top or bottom pretty easily.
A few notes:
the number of buckets is completely customizable (thus letting you control the amount of original memory)
the strategy of repartition is customizable as well (this is just a cheap hash table I have made here)
it is possible to monitor the number of allocated buckets and abandon, or switch gear, if it starts blowing up.
if each value is different, then it just won't work, but when you realize it, you will already have collected many counts, so you'll at least be able to give a lower bound of the number of different values, and a you'll also have a starting point for the top/bottom.
If you manage to gather those statistics, then the work is cut out for you.
For 8 and 16 bit it's pretty obvious, you can track every possibility every iteration.
When you get to 32 and 64 bit integers, you don't really have the memory to track every possibility.
Here's a few natural suggestions that are likely outside the bounds of your constraints.
I don't really understand why you can't sort the array. RadixSort is O(n) and once sorted it would be one more pass to get accurate distinctiveness and top X information. In reality it would be 6 passes all together for 32bit if you used a 1 byte radix (1 pass for counting + 1 * 4 passes for each byte + 1 pass for getting values).
In the same cusp as above, why not just use SQL. You could create a stored procedure that takes the array in as a table valued parameter and return the number of distinct values and the top x values in one go. This stored procedure could also be called in parallel.
-- number of distinct
SELECT COUNT(DISTINCT(n)) FROM #tmp
-- top x
SELECT TOP 10 n, COUNT(n) FROM #tmp GROUP BY n ORDER BY COUNT(n) DESC
I've just thought of an interesting solution. It's based on law of boolean algebra called Idempotence of Multiplication, which states that:
X * X = X
From it, and using the commutative property of boolean multiplication, we can deduce that:
X * Y * X = X * X * Y = X * Y
Now, you see where I'm going to? This is how the algorithm would work (I'm terrible with pseudo-code):
make c = element1 & element2 (binary AND between the binary representation of the integers)
for i=3 until i == size_of_array
make b = c & element[i];
if b != c then diferent_values++;
c=b;
In first iteration, we make (element1*element2) * element3. We could represent it as:
(X * Y) * Z
If Z (element3) is equal to X (element1), then:
(X * Y) * Z = X * Y * X = X * Y
And if Z is equal to Y (element2), then:
(X * Y) * Z = X * Y * Y = X * Y
So, if Z isn´t different to X or Y, then X * Y won't change when we multiply it for Z
This remains valid for big expressions, like:
(X * A * Z * G * T * P * S) * S = X * A * Z * G * T * P * S
If we receive a value which is factor of our big multiplicand (that means that it has been already computed) then the big multiplicand won't change when we multiply it to the recieved input, so there's no new distinct value.
So that's how it will go. Each time that a different value is computed then the multiplication of our big multiplicand and that distinct value, will be different to the big operand. So, with b = c & element[i], if b!= c we just increment out distinct values counter.
I guess I'm no being clear enough. If that's the case, please let me know.

Given an array of integers, find the first integer that is unique

Given an array of integers, find the first integer that is unique.
my solution: use std::map
put integer (number as key, its index as value) to it one by one (O(n^2 lgn)), if have duplicate, remove the entry from the map (O(lg n)), after putting all numbers into the map, iterate the map and find the key with smallest index O(n).
O(n^2 lgn) because map needs to do sorting.
It is not efficient.
other better solutions?
I believe that the following would be the optimal solution, at least based on time / space complexity:
Step 1:
Store the integers in a hash map, which holds the integer as a key and the count of the number of times it appears as the value. This is generally an O(n) operation and the insertion / updating of elements in the hash table should be constant time, on the average. If an integer is found to appear more than twice, you really don't have to increment the usage count further (if you don't want to).
Step 2:
Perform a second pass over the integers. Look each up in the hash map and the first one with an appearance count of one is the one you were looking for (i.e., the first single appearing integer). This is also O(n), making the entire process O(n).
Some possible optimizations for special cases:
Optimization A: It may be possible to use a simple array instead of a hash table. This guarantees O(1) even in the worst case for counting the number of occurrences of a particular integer as well as the lookup of its appearance count. Also, this enhances real time performance, since the hash algorithm does not need to be executed. There may be a hit due to potentially poorer locality of reference (i.e., a larger sparse table vs. the hash table implementation with a reasonable load factor). However, this would be for very special cases of integer orderings and may be mitigated by the hash table's hash function producing pseudorandom bucket placements based on the incoming integers (i.e., poor locality of reference to begin with).
Each byte in the array would represent the count (up to 255) for the integer represented by the index of that byte. This would only be possible if the difference between the lowest integer and the highest (i.e., the cardinality of the domain of valid integers) was small enough such that this array would fit into memory. The index in the array of a particular integer would be its value minus the smallest integer present in the data set.
For example on modern hardware with a 64-bit OS, it is quite conceivable that a 4GB array can be allocated which can handle the entire domain of 32-bit integers. Even larger arrays are conceivable with sufficient memory.
The smallest and largest integers would have to be known before processing, or another linear pass through the data using the minmax algorithm to find out this information would be required.
Optimization B: You could optimize Optimization A further, by using at most 2 bits per integer (One bit indicates presence and the other indicates multiplicity). This would allow for the representation of four integers per byte, extending the array implementation to handle a larger domain of integers for a given amount of available memory. More bit games could be played here to compress the representation further, but they would only support special cases of data coming in and therefore cannot be recommended for the still mostly general case.
All this for no reason. Just using 2 for-loops & a variable would give you a simple O(n^2) algo.
If you are taking all the trouble of using a hash map, then it might as well be what #Micheal Goldshteyn suggests
UPDATE: I know this question is 1 year old. But was looking through the questions I answered and came across this. Thought there is a better solution than using a hashtable.
When we say unique, we will have a pattern. Eg: [5, 5, 66, 66, 7, 1, 1, 77]. In this lets have moving window of 3. first consider (5,5,66). we can easily estab. that there is duplicate here. So move the window by 1 element so we get (5,66,66). Same here. move to next (66,66,7). Again dups here. next (66,7,1). No dups here! take the middle element as this has to be the first unique in the set. The left element belongs to the dup so could 1. Hence 7 is the first unique element.
space: O(1)
time: O(n) * O(m^2) = O(n) * 9 ≈ O(n)
Inserting to a map is O(log n) not O(n log n) so inserting n keys will be n log n. also its better to use set.
Although it's O(n^2), the following has small coefficients, isn't too bad on the cache, and uses memmem() which is fast.
for(int x=0;x<len-1;x++)
if(memmem(&array[x+1], sizeof(int)*(len-(x+1)), array[x], sizeof(int))==NULL &&
memmem(&array[x+1], sizeof(int)*(x-1), array[x], sizeof(int))==NULL)
return array[x];
public static string firstUnique(int[] input)
{
int size = input.Length;
bool[] dupIndex = new bool[size];
for (int i = 0; i < size; ++i)
{
if (dupIndex[i])
{
continue;
}
else if (i == size - 1)
{
return input[i].ToString();
}
for (int j = i + 1; j < size; ++j)
{
if (input[i]==input[j])
{
dupIndex[j] = true;
break;
}
else if (j == size - 1)
{
return input[i].ToString();
}
}
}
return "No unique element";
}
#user3612419
Solution given you is good with some what close to O(N*N2) but further optimization in same code is possible I just added two-3 lines that you missed.
public static string firstUnique(int[] input)
{
int size = input.Length;
bool[] dupIndex = new bool[size];
for (int i = 0; i < size; ++i)
{
if (dupIndex[i])
{
continue;
}
else if (i == size - 1)
{
return input[i].ToString();
}
for (int j = i + 1; j < size; ++j)
{
if(dupIndex[j]==true)
{
continue;
}
if (input[i]==input[j])
{
dupIndex[j] = true;
dupIndex[i] = true;
break;
}
else if (j == size - 1)
{
return input[i].ToString();
}
}
}
return "No unique element";
}