Finding all values that occurs odd number of times in huge list of positive integers - list

I came across this question from a colleague.
Q: Given a huge list (say some thousands)of positive integers & has many values repeating in the list, how to find those values occurring odd number of times?
Like 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 1 2 3 4 5 6 1 2 3 4 5 1 2 3 4 1 2 3 1 2 1...
Here,
1 occrus 8 times
2 occurs 7 times (must be listed in output)
3 occurs 6 times
4 occurs 5 times (must be listed in output)
& so on... (the above set of values is only for explaining the problem but really there would be any positive numbers in the list in any order).
Originally we were looking at deriving a logic (to be based on c).
I suggested the following,
Using a hash table and the values from the list as an index/key to the table, keep updating the count in the corresponding index every time when the value is encountered while walking through the list; however, how to decide on the size of the hash table?? I couldn't say it surely though it might require Hashtable as big as the list.
Once the list is walked through & the hash table is populated (with the 'count' number of occurrences for each values/indices), only way to find/list the odd number of times occurring value is to walk through the table & find it out? Is that's the only way to do?
This might not be the best solution given this scenario.
Can you please suggest on any other efficient way of doing it so??
I sought in SO, but there were queries/replies on finding a single value occurring odd number of times but none like the one I have mentioned.
The relevance for this question is not known but seems to be asked in his interview...
Please suggest.
Thank You,

If the values to be counted are bounded by even a moderately reasonable limit then you can just create an array of counters, and use the values to be counted as the array indices. You don't need a tight bound, and "reasonable" is somewhat a matter of platform. I would not hesitate to take this approach for a bound (and therefore array size) sufficient for all uint16_t values, and that's not a hard limit:
#define UPPER_BOUND 65536
uint64_t count[UPPER_BOUND];
void count_values(size_t num_values, uint16_t values[num_values]) {
size_t i;
memset(count, 0, sizeof(count));
for (i = 0; i < num_values; i += 1) {
count[values[i]] += 1;
)
}
Since you only need to track even vs. odd counts, though, you really only need one bit per distinct value in the input. Squeezing it that far is a bit extreme, but this isn't so bad:
#define UPPER_BOUND 65536
uint8_t odd[UPPER_BOUND];
void count_values(size_t num_values, uint16_t values[num_values]) {
size_t i;
memset(odd, 0, sizeof(odd));
for (i = 0; i < num_values; i += 1) {
odd[values[i]] ^= 1;
)
}
At the end, odd[i] contains 1 if the value i appeared an odd number of times, and it contains 0 if i appeared an even number of times.
On the other hand, if the values to be counted are so widely distributed that an array would require too much memory, then the hash table approach seems reasonable. In that case, however, you are asking the wrong question. Rather than
how to decide on the size of the hash table?
you should be asking something along the lines of "what hash table implementation doesn't require me to manage the table size manually?" There are several. Personally, I have used UTHash successfully, though as of recently it is no longer maintained.
You could also use a linked list maintained in order, or a search tree. No doubt there are other viable choices.
You also asked
Once the list is walked through & the hash table is populated (with the 'count' number of occurrences for each values/indices), only way to find/list the odd number of times occurring value is to walk through the table & find it out? Is that's the only way to do?
If you perform the analysis via the general approach we have discussed so far then yes, the only way to read out the result is to iterate through the counts. I can imagine alternative, more complicated, approaches wherein you switch numbers between lists of those having even counts and those having odd counts, but I'm having trouble seeing how whatever efficiency you might gain in readout could fail to be swamped by the efficiency loss at the counting stage.

In your specific case, you can walk the list and toggle the value's existence in a set. The resulting set will contain all of the values that appeared an odd number of times. However, this only works for that specific predicate, and the more generic count-then-filter algorithm you describe will be required if you wanted, say, all of the entries that appear an even number of times.
Both algorithms should be O(N) time and worst-case O(N) space, and the constants will probably be lower for the set-based algorithm, but you'll need to benchmark it against your data. In practice, I'd run with the more generic algorithm unless there was a clear performance problem.

Related

Can't figure out how to program in outputs for the given situations

So, I'm working on an assignment for my intro to computer science class. The assignment is as follows.
There is an organism whose population can be determined according to
the following rules:
The organism requires at least one other organism to propagate. Thus,
if the population goes to 1, then the organism will become extinct in
one time cycle (e.g. one breeding season). In an unusual turn of
events, an even number of organisms is not a good thing. The
organisms will form pairs and from each pair, only one organism will
survive If there are an odd number of organisms and this number is
greater than 1 (e.g., 3,5,7,9,…), then this is good for population
growth. The organisms cannot pair up and in one time cycle, each
organism will produce 2 other organisms. In addition, one other
organism will be created. (As an example, let us say there are 3
organisms. Since 3 is an odd number greater than 1, in one time
cycle, each of the 3 organisms will produce 2 others. This yields 6
additional organisms. Furthermore, there is one more organism
produced so the total will be 10 organisms, 3 originals, 6 produced by
the 3, and then 1 more.)
A: Write a program that tests initial populations from 1 to 100,000.
Find all populations that do not eventually become extinct.
Write your answer here:
B: Find the value of the initial population that eventually goes
extinct but that has the largest number of time cycles before it does.
Write your answer here:
The general idea of what I have so far is (lacking sytanx) is this with P representing the population
int generations = 0;
{
if (P is odd) //I'll use a modulus modifier to divide by two and if the result is not 0 then I'll know it's odd
P = 3P + 1
else
P = 1/2 P
generations = generations + 1
}
The problem for me is that I'm uncertain how to tell what numbers will not go extinct or how to figure out which population takes the longest time to go extinct. Any suggestions would be helpful.
Basically what you want to do is this: wrap your code into a while-loop that exits if either P==1 or generations > someMaxValue.
Wrap this construct into a for-loop that counts from 1 to 100,000 and uses this count to set the initial P.
If you always store the generations after your while-loop (e.g. into an array) you can then search for the greatest element in the array.
This problem can actually be harder than it looks at the first sight. First, you should use memorization to speed things up - for example, with 3 you get 3 -> 10 -> 5 -> 16 -> 8 -> 4 -> 2 -> 1 -> 0, so you know the answer for all those numbers as well (note that every power of 2 will extinct).
But as pointed out by #Jerry, the problem is with the generations which eventually do not extinct - it will be difficult to say when to actually stop. The only chance is that there will (always) be a recurrence (number of organisms you already passed once when examining the current number of organisms), then you can say for sure that the organisms will not extinct.
Edit: I hacked a solution quickly and if it is correct, you are lucky - every population between 1-100,000 seems to eventually extinct (as my program terminated so I didn't actually need to check for recurrences). Will not give you the solution for now so that you can try by yourself and learn, but according to my program the largest number of cycles is 351 (and the number is close to 3/4 of the range). According to the google search for Collatz conjecture, that is a correct number (they say 350 to go to population of 1, where I'm adding one extra cycle to 0), also the initial population number agrees.
One additional hint: Check for integer overflow, and use 64-bit integer (unsigned __int64, unsigned long long) to calculate the population growth, as with 32-bit unsignet int, there is already an overflow in the range of 1-100,000 (the population can indeed grow much higher intermediately) - that was a problem in my initial solution, although it did not change the result. With 64-bit ints I was able to calculate up to 100,000,000 in relatively decent time (didn't try more; optimized release MSVC build), for that I had to limit the memo table to first 80,000,000 items to not go out of memory (compiled in 32-bit with LARGEADDRESSAWARE to be able to use up to 4 GB of memory - when compiled 64-bit the table could of course be larger).

Permutations of English Alphabet of a given length

So I have this code. Not sure if it works because the runtime for the program is still continuing.
void permute(std::vector<std::string>& wordsVector, std::string prefix, int length, std::string alphabet) {
if (length == 0) {
//end the recursion
wordsVector.push_back(prefix);
}
else {
for (int i = 0; i < alphabet.length(); ++i) {
permute(wordsVector, prefix + alphabet.at(i), length - 1, alphabet);
}
}}
where I'm trying to get all combinations of characters in the English alphabet of a given length. I'm not sure if the approach is correct at the moment.
Alphabet consists of A-Z in a string of length 26. WordsVectors holds all the different combinations of words. prefix is meant to pass through recursively until a word is made and length is self explanatory.
Example, if I give the length of 7 to the function, I expect a size of 26 x 25 x 24 x 23 x 22 x 21 x 20 = 3315312000 if I'm correct, following the formula for permutations.
I don't think programs are meant to run this long so either I'm hitting an infinite loop or something is wrong with my approach. Please advise. Thanks.
Surely the stack would overflow but concentrating on your question even if you write an iterative program it will take a long time ( not an infinite loop just very long )
[26L, 650L, 15600L, 358800L, 7893600L, 165765600L, 3315312000L, 62990928000L, 1133836704000L, 19275223968000L, 308403583488000L, 4626053752320000L, 64764752532480000L, 841941782922240000L, 10103301395066880000L, 111136315345735680000L, 1111363153457356800000L, 10002268381116211200000L, 80018147048929689600000L, 560127029342507827200000L, 3360762176055046963200000L, 16803810880275234816000000L, 67215243521100939264000000L, 201645730563302817792000000L, 403291461126605635584000000L, 403291461126605635584000000L]
The above list is the number of possibilities for 1<=n<=26. You can see as n increases number of possibilities increases tremendously. Say you have 1GHz processor that does 10^9 operations per second. Say consider number of possibilities for n=26 its 403291461126605635584000000L. Its evident that if you sit down to list all possibilities its so so long ( so so many years ) that
you will feel it has hit an infinite loop. Finally I have not looked that closely into your code , but in nutshell even if you write it correctly,iteratively and don't store (again can't have this much memory) and just print all possibilities its going to take long time for larger values of n.
EDIT
As jaromax and others said if you just want to write it for smaller values of n,
say less than 10-12 you can write an iterative program to list/print them. It will run quite fast for small values. But if you also want to store them them then n will have to be say less than 5 say. (Really depends on how much RAM is available or you could find some permutations write them to disk, then depends on how much disk memory you can spare, again refer the number of possibilities list I posted above. It gives a rough idea of both time and space complexity).
I think there could be quite a problem that you do this on stack. A large part of the calculation you do recursively and this means every time allocated space for function.
Try to reformulate it linearly. I think I had such a problem before.
Your question implies you think there are 26x25x24x ... permutations
Your code doesn't have anything I can see to avoid "AAAAAAA" being a permutation, in which case there are 26x26x26x ...
So in addition to being a very complicated way of counting in base 26, I think it's also giving bad answers?

simplier solution than iterating through all combinations

Designing a program given the following:
A initial array with a unknown length, with only 2’s and 6’s
Initial Array {2,2,2,2,2,2,6,6,6,6,6,6}
The goal is to find the fewest amount of arrays whose sum is less than or equal to 16 using the initial array
Wrong answer would be
4 arrays:
Array 1 {2,2,2,2,2,2}
Array 2 {6,6}
Array 3 {6,6}
Array 4 {6,6}
The correct answer would be
3 arrays:
Array 1 {6,6,2,2}
Array 2 {6,6,2,2}
Array 3 {6,6,2,2}
My solution is to step through each and every possible solution keeping track of the needed arrays. If the previous solution has the same amount of arrays or less than the current one throw the current one out. This seems a bit intensive due to the likely hood of comparing the difference in (6,6) and (6,6)…
I was reading on how to iterate through different combinations. Most of the articles related to poker and Im sure I could draw some similarities to blackjack(21).
I was hoping a “shortcut” might work here due to:
Possible entries are only 2’s & 6’s
Sums up to less than 16…Can’t go over
Final remark: I would love any info on how to proceed…logic..go read this..etc
Thanks,
Josh
A few notes on your problem before making any assumptions.
First, there are not many ways to arrange 6's and 2's to get a sum of 16. As a matter of fact, there are only 3 :
2*6+2*2 = 6+6+2+2
1*6+5*2 = 6+2+2+2+2+2
0*6+8*2 = 2+2+2+2+2+2+2+2
Second, you can sum up 2's to get 6's : 2+2+2 = 6
Third (comes from first), you cannot have more than 2 6's in an array.
Now, I think that the solution is pretty simple. However, I have no proof, so do not take my word for granted.
From second and third, I assume that you better get rid of 6's first. So you can pack them up by pairs in arrays like so : {6, 6, 2, 2}. If you do not have enough 2's to fill in, just do not fill in.
If you have an odd number of 6's, then the last one will appear in an array like so : {6, 2, 2, 2, 2, 2}.
Then you just need to fill in arrays of 2's if there are any left : {2, 2, 2, 2, 2, 2, 2, 2}.
Despite the fact that I have no proof (probably because I am lazy), it appear quite ovious to me that this is a solution, and it doesn't involve any kind of complicated combinatorics or so.
Mind to test it against the solution you describe ?
Edit :
Below are a correction of my comment and a proof that my solution is, indeed, a good solution.
Let us call n2 and n6 the number of 2's and 6's in the initial array, and sum the sum of elements in the initial array (6+6+...+2+2+...).
The number of arrays necessary is not ceil((3*n6 + n2)/8) = ceil(sum/16). It doesn't work if you have, for exemple, nine 6's and no 2 : it gives 4 instead of 5. This is because you cannot entirely fill an array with 6's (the sum would be 12, not 16), but you can with 2's. The correct formula would then be max(ceil(sum/16), ceil(n6/2)).
Now for the whole solution. Let us keep the notations n2, n6 and sum.
Consider your arrays as 16 meters long boxes (replace m with any linear unit you want), 2's as 2m long blocks and 6's as 6m long blocks. What you want in order to arrange all your blocks within the least boxes possible is to fit in each array the maximum length (not number) of blocks you can. This is pretty obvious : you need to fit sum meters in the least boxes of 16m.
So, the way to solve the problem is to fill up the boxes to the maximum. With 2's, it is easy : you can put eight 2m long blocks in one box. With 6's, you can only put two 6m long blocks in a box, without filling it up.
If you only have 6's, just put the maximum in every box (two in each), and you get your arrays. The number of arrays in that case is ceil(n6/2).
If you only have 2's, do the same (eight in each box). The number of arrays in this case is ceil(n2/8).
If you have both, you can pack two 6's and two 2's to fill up a box, so just do that ! We already know it is the best solution to fit the maximum length of blocks in each array, and you cannot do better than 16m = 2*6m + 2*2m.
When you run out of one kind of block, here is what to do.
If you run out of 6's, well you can fill up the rest of the boxes with 2's : you will get to the maximum of 16m for every next box until you run out of 2's, in which case you have floor(sum/16) boxes full, and maybe one partially filled. So the number of arrays in this case is ceil(sum/16).
If you run out of 2's, then discard every 2m long block you just put in boxes. What you have left is boxes with pairs of 6m long blocks, as if you didn't have any 2 to start with. Then you have your solution : put two 6m long blocks in every box. You can put back the 2's in the gaps left because you know they will not overflow. The number of arrays in this case is also ceil(n6/2).
Now, how to know the number of arrays necessary in every case ? Well, we have 3 possible formulas and need to take the maximum : ceil(n6/2), ceil(n2/8) and ceil(sum/16). But we know that ceil(n2/8) cannot be larger than ceil(sum/16), since sum = 6*n6 + 2*n2, sum/16 = 3/8 *n6 + n2/8, and 3/8 *n6 is not negative.
However, ceil(n6/2) might be larger than ceil(sum/16).
Exemple : n6 = 9, n2 = 0. Then ceil(n6/2) = 5 but ceil(sum/16) = 4
Finally, the correct formula is max(ceil(sum/16), ceil(n6/2)).
Since your objective deals with only 2s and 6s, you can use a greedy algorithm (since 6 is at least twice 2). That means you just take the largest number from your source numbers that still fits under 16 and add it to the current array. When no new number will fit, start a new array.
From that approach it seems clear that you can simplify this down to a formula in terms of the number of 2s and 6s in the source array: In other words you do one linear count of 2s and 6s, and then use a constant time formula to determine the number of arrays.

Algo: find max Xor in array for various interval limis, given N inputs, and p,q where 0<=p<=i<=q<=N

the problem statement is the following:
Xorq has invented an encryption algorithm which uses bitwise XOR operations extensively. This encryption algorithm uses a sequence of non-negative integers x1, x2, … xn as key. To implement this algorithm efficiently, Xorq needs to find maximum value for (a xor xj) for given integers a,p and q such that p<=j<=q. Help Xorq to implement this function.
Input
First line of input contains a single integer T (1<=T<=6). T test cases follow.
First line of each test case contains two integers N and Q separated by a single space (1<= N<=100,000; 1<=Q<= 50,000). Next line contains N integers x1, x2, … xn separated by a single space (0<=xi< 2^15). Each of next Q lines describe a query which consists of three integers ai,pi and qi (0<=ai< 2^15, 1<=pi<=qi<= N).
Output
For each query, print the maximum value for (ai xor xj) such that pi<=j<=qi in a single line.
int xArray[100000];
cin >>t;
for(int j =0;j<t;j++)
{
cin>> n >>q;
//int* xArray = (int*)malloc(n*sizeof(int));
int i,a,pi,qi;
for(i=0;i<n;i++)
{
cin>>xArray[i];
}
for(i=0;i<q;i++)
{
cin>>a>>pi>>qi;
int max =0;
for(int it=pi-1;it<qi;it++)
{
int t = xArray[it] ^ a;
if(t>max)
max =t;
}
cout<<max<<"\n" ;
}
No other assumptions may be made except for those stated in the text of the problem (numbers are not sorted).
The code is functional but not fast enough; is reading from stdin really that slow or is there anything else I'm missing?
XOR flips bits. The max result of XOR is 0b11111111.
To get the best result
if 'a' on ith place has 1 then you have to XOR it with key that has ith bit = 0
if 'a' on ith place has 0 then you have to XOR it with key that has ith bit = 1
saying simply, for bit B you need !B
Another obvious thing is that higher order bits are more important than lower order bits.
That is:
if 'a' on highest place has B and you have found a key with highest bit = !B
then ALL keys that have highest bit = !B are worse that this one
This cuts your amount of numbers by half "in average".
How about building a huge binary tree from all the keys and ordering them in the tree by their bits, from MSB to LSB. Then, cutting the A bit-by-bit from MSB to LSB would tell you which left-right branch to take next to get the best result. Of course, that ignores PI/QI limits, but surely would give you the best result since you always pick the best available bit on i-th level.
Now if you annotate the tree nodes with low/high index ranges of its subelements (performed only done once when building the tree), then later when querying against a case A-PI-QI you could use that to filter-out branches that does not fall in the index range.
The point is that if you order the tree levels like the MSB->LSB bit order, then the decision performed at the "upper nodes" could guarantee you that currently you are in the best possible branch, and it would hold even if all the subbranches were the worst:
Being at level 3, the result of
0b111?????
can be then expanded into
0b11100000
0b11100001
0b11100010
and so on, but even if the ????? are expanded poorly, the overall result is still greater than
0b11011111
which would be the best possible result if you even picked the other branch at level 3rd.
I habe absolutely no idea how long would preparing the tree cost, but querying it for an A-PI-QI that have 32 bits seems to be something like 32 times N-comparisons and jumps, certainly faster than iterating randomly 0-100000 times and xor/maxing. And since you have up to 50000 queries, then building such tree can actually be a good investment, since such tree would be build once per keyset.
Now, the best part is that you actually dont need the whole tree. You may build such from i.e. first two or four or eight bits only, and use the index ranges from the nodes to limit your xor-max loop to a smaller part. At worst, you'd end up with the same range as PiQi. At best, it'd be down to one element.
But, looking at the max N keys, I think the whole tree might actually fit in the memory pool and you may get away without any xor-maxing loop.
I've spent some time google-ing this problem and it seams that you can find it in the context of various programming competitions. While the brute force approach is intuitive it does not really solve the challenge as it is too slow.
There are a few contraints in the problem which you need to speculate in order to write a faster algorithm:
the input consists of max 100k numbers, but there are only 32768 (2^15) possible numbers
for each input array there are Q, max 50k, test cases; each test case consists of 3 values, a,pi,and qi. Since 0<=a<2^15 and there are 50k cases, there is a chance the same value will come up again.
I've found 2 ideas for solving the problem: splitting the input in sqrt(N) intervals and building a segment tree ( a nice explanation for these approaches can be found here )
The biggest problem is the fact that for each test case you can have different values for a, and that would make previous results useless, since you need to compute max(a^x[i]), for a small number of test cases. However when Q is large enough and the value a repeats, using previous results can be possible.
I will come back with the actual results once I finish implementing both methods

Find the numbers missing

If we have an array of all the numbers up to N (N < 10), what is the best way to find all the numbers that are missing.
Example:
N = 5
1 5 3 2 3
Output: 1 5 4 2 3
In the ex, the number 4 was the missing one and there were 2 3s, so we replaced the first one with 4 and now the array is complete - all the numbers up to 5 are there.
Is there any simple algorithm that can do this ?
Since N is really small, you can use F[i] = k if number i appears k times.
int F[10]; // make sure to initialize it to 0
for ( int i = 0; i < N; ++i )
++F[ numbers[i] ];
Now, to replace the duplicates, traverse your number array and if the current number appears more than once, decrement its count and replace it with a number that appears 0 times and increment that number's count. You can keep this O(N) if you keep a list of numbers that don't appear at all. I'll let you figure out what exactly needs to be done, as this sounds like homework.
Assume all numbers within the range 1 ≤ x ≤ N.
Keep 2 arrays of size N. output, used (as an associative array). Initialize them all to 0.
Scan from the right, fill in values to output unless it is used.
Check for unused values, and put them into the empty (zero) slots of output in order.
O(N) time complexity, O(N) space complexity.
You can use a set data structure - one for all the numbers up to N, one for the numbers you actually saw, and use a set difference.
One way to do this would be to look at each element of the array in sequence, and see whether that element has been seen before in elements that you've already checked. If so, then change that number to one you haven't seen before, and proceed.
Allow me to introduce you to my friend Schlemiel the Painter. Discovery of a more efficient method is left as a challenge for the reader.
This kind of looks like homework, please let us know if it isn't. I'll give you a small hint, and then I'll improve my answer if you confirm this isn't homework.
My tip for now is this: If you were to do this by hand, how would you do it? Would you write out an extra list of numbers of some time, would you read through the list (how many times?)? etc.
For simple problems, sometimes modelling your algorithm after an intuitive by-hand approach can work well.
Here's a link I read just today that may be helpful.
http://research.swtch.com/2008/03/using-uninitialized-memory-for-fun-and.html