Huffman's Data compression filltable and invert code problems - c++

I just began learning about Huffman's Data compression algorithm and I need help on the following function > filltable() and invertcode()
I don't understand why a codetable array is needed.
while (n>0){
copy = copy * 10 + n %10;
n /= 10;
}
Please help me understand what is going on for this part of the function and why if n is larger than 0 it is divided by ten because it is alway going to be greater than 0 no matter how many times you divided it.
Link for code: http://www.programminglogic.com/implementing-huffman-coding-in-c/
void fillTable(int codeTable[], Node *tree, int Code){
if (tree->letter<27)
codeTable[(int)tree->letter] = Code;
else{
fillTable(codeTable, tree->left, Code*10+1);
fillTable(codeTable, tree->right, Code*10+2);
}
return;
}
void invertCodes(int codeTable[],int codeTable2[]){
int i, n, copy;
for (i=0;i<27;i++){
n = codeTable[i];
copy = 0;
while (n>0){
copy = copy * 10 + n %10;
n /= 10;
}
codeTable2[i]=copy;
}
** edit **
To make this question more clear I don't need an explanation on huffman encoding and decoding but I need a explanation on how these two functions work and why codetables are necessary.

n is an int. Therefore, it will reduce to 0 over time. If n starts at 302 at the first iteration, it will be reduced to 30 after the first n /= 10;. At the end of the second iteration of the while loop, it will be reduced to 3. at the end of the fourth iteration, it will equal 0 ( int 4 / int 10 = int 0 ).
It is integer math. No decimal bits to extend to infinity.

I made a minor update to the example program to include an end of data code. The original example code may append an extra letter to the end of the original data when decompressing. Also there's a lot of stuff "hard coded" in this code, such as the number of codes, which was 27, and which I changed to 28 to include the end of data code that I added, and also the output file names which I changed to "compress.bin" (if compressing) or "output.txt" (if decompressing). It's not an optimal implementation, but it's ok to use as a learning example. It would help if you follow the code with a source level debugger.
http://rcgldr.net/misc/huffmanx.zip
A more realistic Huffman program would use tables to do the encode and decode. The encode table is indexed with the input code, and each table entry contains two values, the number of bits in the code, and the code itself. The decode table is indexed with a code composed of the minimum number of bits from the input stream required to determine the code (it's at least 9 bits, but may need to be 10 bits), and each entry in that table contains two values, the actual number of bits, and the character (or end of data) represented by that code. Since the actual number of bits may be less than the number bits used to determine the code, the left over bits will need to be buffered and used before reading data from the compressed file.
One variation of a Huffman like process is to have the length of the code determined by the leading bits of each code, to reduce the size of the decode table.

Related

outputting serializable sorted integers

I am wanting to generate a serialized list of randomly selected positive integers in sorted order, but the number of integers desired and the range of numbers that it may select from in a given use case could easily each be in the many millions (or sometimes even each in the range of billions, if 64 bit integers are being used), so it isn't really feasible to store the numbers that into an array that can then be accessed randomly by the software.
Therefore, I wanted to generate the numbers via a simple loop that looked something like this:
unsigned current = 0;
while(remaining>0) {
if (find_next_to_output(current,max,remaining)) {
// do stuff having output a value
}
}
Where remaining is initialized to however many random numbers I intend to output, and max is the upper bound (plus one) on the numbers that may be generated. It can be assumed that remaining will always be initialized to a number less than or equal to max.
The find_next_to_output function would look similar to this:
/**
* advance through the range of accepted values until all values have been output
* #param current [in/out] integer to examine. Advances to the next integer
* to consider for output
* #param max one more than the largest integer to ever output
* #param remaining [in/out] number of integers left to output.
* #return true if the function ouputted an integer, false otherwise
*/
bool find_next_to_output(unsigned &current, unsigned max, unsigned &remaining)
{
bool result = false;
if (remaining == 0) {
return false;
} if (rnd() * (max - current) < remaining) {
// code to output 'current' goes here.
remaining--;
result = true;
}
int x = ?; // WHAT GOES HERE?
current += x;
return result;
}
Note, the function rnd() used above would return a uniform randomly generated floating point number on the range [0..1).
As the comment highlights, I am unsure how I can calculate a reasonable value for x, such that the number of values of current that get skipped over by the function is reflective of the probability that none of the values that are skipped would be picked (while still leaving a sufficient number remaining that all remaining numbers can still be picked). I know that it needs to be a random number (probably not from a uniform distribution), but I don't know how to calculate a good value for it. At worst, it would simply increment current by one each time, but this should be statistically unlikely when there is a sufficient difference between the number of integers remaining to output and the number of integers remaining in the range.
I do not want to use any third party libraries such as boost, although I am fine with using any random number generators that may be packed in the standard library for C++11.
If any part of my question is unclear, please comment below and I will endeavor to clarify.
If I understand you correctly, you want to generate random, ascending numbers. You are trying to do this by creating a step of a random size to add to the previous number.
Your worry is that if the step is too large, then you will overflow and wrap back around, breaking the ascending requirement.
x needs to be constrained in a manner that prevents the overflow, while still satisfying the random requirement.
You want the modulo operator (modulus). %
const unsigned step = (max - current) / remaining;
x = unsigned(rnd() * max) % step; // will never be larger than step

Analyzing an algorithm involving bitwise operations and powers of two?

I wrote a program that computes the largest power of 2 in a given input number. For instance, the largest power of 2 in the number 26 is 16, since 24 is 16. Here is the algorithm:
uint power2( uint n )
{
uint i = n;
uint j = i & (i - 1);
while( j != 0 )
{
i = j;
j = i & (i - 1);
}
return i;
}
I've struggled a bit with analysis of algorithms. I know we are trying to figure out the Big-Oh, Big-Omega, or Big-Theta notation. In order to analyze an algorithm, we are supposed to count the number of basic operations? My problem here is that I see two lines that could be basic operations. I see the line uint j = i & (i - 1) outside the while loop and I also see the j = i & (i - 1) inside the while loop. I feel like the one inside the while loop is a basic operation for sure, but what about the one outside the while loop?
The other part I struggle with is determining how many times the body of the while loop will execute. For instance, if we have a for loop for(int i = 0; i < n; i++) {...} we know that this loop will execute n times. Or even in a while loop while(i < n * n) {... i++} we know this loop will run n * n times. But for this while loop, it varies depending on the input. For instance, if the number you pass into it is a power of 2 right off the bat, the while loop will never execute. But, if you pass in a really large number, it'll execute many times. I don't know how many times it will execute to be quite honest. That's what I am trying to figure out. Can anyone help me understand what is going on in this algorithm and the number of times it runs, like O(n) or O(n^2), etc?
This is a good example of an algorithm where it's easier to reason about it when you have a good intuitive understanding of what it's doing rather than just by looking at the code itself.
For starters, what does i & (i - 1) do? If you take a number written in binary and subtract one from it, it has the effect of
clearing the least-significant 1 bit, and
setting all the bits after that to 1.
For example, if we take the binary number 1001000 (72) and subtract one, we get 1000111 (71). Notice how the least-significant 1 bit was cleared and all the bits below that were set to 1.
What happens when you AND a number i with the number i - 1? Well, All the bits above the least-significant 1 bit in both i and i - 1 are unchanged, but i and i - 1 disagree in all positions at or below the least-significant 1 bit in i. This means that i & (i - 1) has the effect of clearing the lowest 1 bit in the number i.
So let's go back to the code. Notice that each iteration of the while loop uses this technique to clear a bit from the number j. This means that the number of iterations of the while loop is directly proportional to the number of 1 bits set in the number n. Therefore, if we let b represent the number of 1 bits set in n, then the runtime of this algorithm is Θ(b).
To get a sense for the best and worst-case behavior of this algorithm, as you noted, if n is a perfect power of two, then the runtime is O(1). That's as good as this is going to get. For the worst case, the number n could be all 1 bits, in which case there will be Θ(log n) 1 bits set in the number n (since, in binary, the number n requires Θ(log n) bits to write out). Therefore, the worst-case runtime is Θ(log n).
In summary, we see that
the best-case runtime is O(1),
the worst-case runtime is Θ(log n), and
the exact runtime is Θ(b), where b is the number of bits set in n.

Random pairs of different bits

I have the following problem. I have a number represented in binary representation. I need a way to randomly select two bits of them that are different (i.e. find a 1 and a 0). Besides this I run other operations on that number (reversing sequences, permute sequences,...) These are the approaches I already used:
Keep track of all the ones and the zeros. When I create the binary representation of the binary number I store the places of the 0's and 1's. So that I can choose an index for one list and one index from the other one. I then have two different bits. To run my other operations I created those from an elementary swap operations which updates the indices of the 1's and 0's when manipulating. Therefore I have a third list that stores the list index for each bit. If a bit is 1 I know where to find in the list with all the indices of the ones (same goes for zeros).
The method above yields some overhead when operations are done that do not require the bits to be different. So another way would be to create the lists whenever different bits are needed.
Does anyone have a better idea to do this? I need these operations to be really fast (I am working with popcount, clz, and other binary operations)
I don't feel as though I have enough information to assess the tradeoffs properly, but perhaps you'll find this idea useful. To find a random 1 in a word (find a 1 over multiple words by popcount and reservoir sampling; find a 0 by complementing), first test the popcount. If the popcount is high, then generate indexes uniformly at random and test them until a one is found. If the popcount is medium, then take bitwise ANDs with uniform random masks (but keep the original if the AND is zero) to reduce the popcount. When the popcount is low, use clz to compile the (small) list of candidates efficiently and then sample uniformly at random.
I think the following might be a rather efficient algorithm to do what you are asking. You only iterate over each bit in the number once, and for each element, you have to generate a random number (not exactly sure how costly that is, but I believe there are some optimized CPU instructions for getting random numbers).
Idea is to iterate over all the bits, and with the right probability, update the index to the current index you are visiting.
Generic pseudocode for getting an element from a stream/array:
p = 1
e = null
for s in stream:
with probability 1/p:
replace e with s
p++
return e
Java version:
int[] getIdx(int n){
int oneIdx = 0;
int zeroIdx = 0;
int ones = 1;
int zeros = 1;
// this loop depends on whether you want to select all the prepended zeros
// in a 32/64 bit representation. Alter to your liking...
for(int i = n, j = 0; i > 0; i = i >>> 1, j++){
if((i & 1) == 1){ // current element is 1
if(Math.random() < 1/(float)ones){
oneIdx = j;
}
ones++;
} else{ // element is 0
if(Math.random() < 1/(float)zeros){
zeroIdx = j;
}
zeros++;
}
}
return new int[]{zeroIdx,oneIdx};
}
An optimization you might look into is to do the probability selection using ints instead of floats, might be slightly faster. Here is a short proof I did some time ago regarding that this works: here . I believe the algorithm is attributed to Knuth but can't remember exactly.

Finding a number in an array

I have an array of 20 numbers (64 bit int) something like 10, 25, 36,43...., 118, 121 (sorted numbers).
Now, I have to give millions of numbers as input (say 17, 30).
What I have to give as output is:
for Input 17:
17 is < 25 and > 10. So, output will be index 0.
for Input 30:
30 is < 36 and > 25. So, output will be index 1.
Now, I can do it using linear search, binary serach. Is there any method to do it faster way ? Input numbers are random (gaussian).
If you know the distribution, you can direct your search in a smarter way.
Here is the rough idea of this variant of binary search:
Assuming that your data is expected to be distributed uniformly on 0 to 100.
If you observe the value 0, you start at the beginning. If your value is 37, you start at 37% of the array you have. This is the key difference to binary search: you don't always start at 50%, but you try to start in the expected "optimal" position.
This also works for Gaussian distributed data, if you know the parameters (If you don't know them, you can still estimate them easily from the observed data). You would compute the Gaussian CDF, and this yields the place to start your search.
Now for the next step, you need to refine your search. At the position you looked at, there was a different value. You can use this to re-estimate the position to continue searching.
Now even if you don't know the distribution this can work very well. So you start with a binary search, and looked at objects at 50% and 25% already. Instead of going to 37.5% next, you can do a better guess, if your query values was e.g. very close to the 50% entry. Unless your data set is very "clumpy" (and your queries are not correlated to the data) then this should still outperform "naive" binary search that always splits in the middle.
http://en.wikipedia.org/wiki/Interpolation_search
The expected average runtime apparently is O(log(log(n)), from Wikipedia.
Update: since someone complained that with just 20 numbers things are different. Yes, they are. With 20 numbers linear search may be best. Because of CPU caching. Linear scanning through a small amount of memory - that fits into the CPU cache - can be really fast. In particular with an unrolled loop. But that case is quite pathetic and uninteresting IMHO.
I believe best option for you is to use upper_bound - it will find the first value in the array bigger than the one you are searching for.
Still depending on the problem you try to solve maybe lower_bound or binary_search may be the thing you need.
All of these algorithms are with logarithmic complexity.
There is nothing will be better than binary search since your array is sorted.
Linear search is O(n) while binary search is O(log n)
Edit:
Interpolation search makes an extra assumption (the elements have to be uniformly distributed) and do more comparisons per iteration.
You can try both and empirically measure which is better for your case
In fact, this problem is quite interesting because it is a re-cast of an information theoretic framework.
Given 20 numbers, you will end up with 21 bins (including < first one and > last one).
For each incoming number, you are to map to one of these 21 bins. This mapping is done by comparison. Each comparison gives you 1 bit of information (< or >= -- two states).
So suppose the incoming number requires 5 comparisons in order to figure out which bin it belongs to, then it is equivalent to using 5 bits to represent that number.
Our goal is to minimize the number of comparisons! We have 1 million numbers each belonging to 21 ordered code words. How do we do that?
This is exactly an entropy compression problem.
Let a[1],.. a[20], be your 20 numbers.
Let p(n) = pr { incoming number is < n }.
Build the decision tree as follows.
Step 1.
let i = argmin |p(a[i]) - 0.5|
define p0(n) = p(n) / (sum(p(j), j=0...a[i-1])), and p0(n)=0 for n >= a[i].
define p1(n) = p(n) / (sum(p(j), j=a[i]...a[20])), and p1(n)=0 for n < a[i].
Step 2.
let i0 = argmin |p0(a[i0]) - 0.5|
let i1 = argmin |p1(a[i1]) - 0.5|
and so on...
and by the time we're done, we end up with:
i, i0, i1, i00, i01, i10, i11, etc.
each one of these i gives us the comparison position.
so now our algorithm is as follows:
let u = input number.
if (u < a[i]) {
if (u < a[i0]) {
if (u < a[i00]) {
} else {
}
} else {
if (u < a[i01]) {
} else {
}
}
} else {
similarly...
}
so the i's define a tree, and the if statements are walking the tree. we can just as well put it into a loop, but it's easier to illustrate with a bunch of if.
so for example, if you knew that your data were uniformly distributed between 0 and 2^63, and your 20 number were
0,1,2,3,...19
then
i = 20 (notice that there is no i1)
i0 = 10
i00 = 5
i01 = 15
i000 = 3
i001 = 7
i010 = 13
i011 = 17
i0000 = 2
i0001 = 4
i0010 = 6
i0011 = 9
i00110 = 8
i0100 = 12
i01000 = 11
i0110 = 16
i0111 = 19
i01110 = 18
ok so basically, the comparison would be as follows:
if (u < a[20]) {
if (u < a[10]) {
if (u < a[5]) {
} else {
...
}
} else {
...
}
} else {
return 21
}
so note here, that I am not doing binary search! I am first checking the end point. why?
there is 100*((2^63)-20)/(2^63) percent chance that it will be greater than a[20]. this is basically like 99.999999999999999783159565502899% chance!
so this algorithm as it is has an expected number of comparison of 1 for a dataset with the properties specified above! (this is better than log log :p)
notice what I have done here is I am basically using fewer compares to find numbers that are more probable and more compares to find numbers that are less probable. for example, the number 18 requires 6 comparisons (1 more than needed with binary search); however, the numbers 20 to 2^63 require only 1 comparison. this same principle is used for lossless (entropy) data compression -- use fewer bits to encode code words that appear often.
building the tree is a one time process and you can use the tree 1 million times later.
the question is... when does this decision tree become binary search? homework exercise! :p the answer is simple. it's similar to when you can't compress a file any more.
ok, so I didn't pull this out of my behind... the basis is here:
http://en.wikipedia.org/wiki/Arithmetic_coding
You could perform binary search using std::lower_bound and std::upper_bound. These give you back iterators, so you can use std::distance to get an index.

Create Random Number Sequence with No Repeats

Duplicate:
Unique random numbers in O(1)?
I want an pseudo random number generator that can generate numbers with no repeats in a random order.
For example:
random(10)
might return
5, 9, 1, 4, 2, 8, 3, 7, 6, 10
Is there a better way to do it other than making the range of numbers and shuffling them about, or checking the generated list for repeats?
Edit:
Also I want it to be efficient in generating big numbers without the entire range.
Edit:
I see everyone suggesting shuffle algorithms. But if I want to generate large random number (1024 byte+) then that method would take alot more memory than if I just used a regular RNG and inserted into a Set until it was a specified length, right? Is there no better mathematical algorithm for this.
You may be interested in a linear feedback shift register.
We used to build these out of hardware, but I've also done them in software. It uses a shift register with some of the bits xor'ed and fed back to the input, and if you pick just the right "taps" you can get a sequence that's as long as the register size. That is, a 16-bit lfsr can produce a sequence 65535 long with no repeats. It's statistically random but of course eminently repeatable. Also, if it's done wrong, you can get some embarrassingly short sequences. If you look up the lfsr, you will find examples of how to construct them properly (which is to say, "maximal length").
A shuffle is a perfectly good way to do this (provided you do not introduce a bias using the naive algorithm). See Fisher-Yates shuffle.
If a random number is guaranteed to never repeat it is no longer random and the amount of randomness decreases as the numbers are generated (after nine numbers random(10) is rather predictable and even after only eight you have a 50-50 chance).
I understand tou don't want a shuffle for large ranges, since you'd have to store the whole list to do so.
Instead, use a reversible pseudo-random hash. Then feed in the values 0 1 2 3 4 5 6 etc in turn.
There are infinite numbers of hashes like this. They're not too hard to generate if they're restricted to a power of 2, but any base can be used.
Here's one that would work for example if you wanted to go through all 2^32 32 bit values. It's easiest to write because the implicit mod 2^32 of integer math works to your advantage in this case.
unsigned int reversableHash(unsigned int x)
{
x*=0xDEADBEEF;
x=x^(x>>17);
x*=0x01234567;
x+=0x88776655;
x=x^(x>>4);
x=x^(x>>9);
x*=0x91827363;
x=x^(x>>7);
x=x^(x>>11);
x=x^(x>>20);
x*=0x77773333;
return x;
}
If you don't mind mediocre randomness properties and if the number of elements allows it then you could use a linear congruential random number generator.
A shuffle is the best you can do for random numbers in a specific range with no repeats. The reason that the method you describe (randomly generate numbers and put them in a Set until you reach a specified length) is less efficient is because of duplicates. Theoretically, that algorithm might never finish. At best it will finish in an indeterminable amount of time, as compared to a shuffle, which will always run in a highly predictable amount of time.
Response to edits and comments:
If, as you indicate in the comments, the range of numbers is very large and you want to select relatively few of them at random with no repeats, then the likelihood of repeats diminishes rapidly. The bigger the difference in size between the range and the number of selections, the smaller the likelihood of repeat selections, and the better the performance will be for the select-and-check algorithm you describe in the question.
What about using GUID generator (like in the one in .NET). Granted it is not guaranteed that there will be no duplicates, however the chance getting one is pretty low.
This has been asked before - see my answer to the previous question. In a nutshell: You can use a block cipher to generate a secure (random) permutation over any range you want, without having to store the entire permutation at any point.
If you want to creating large (say, 64 bits or greater) random numbers with no repeats, then just create them. If you're using a good random number generator, that actually has enough entropy, then the odds of generating repeats are so miniscule as to not be worth worrying about.
For instance, when generating cryptographic keys, no one actually bothers checking to see if they've generated the same key before; since you're trusting your random number generator that a dedicated attacker won't be able to get the same key out, then why would you expect that you would come up with the same key accidentally?
Of course, if you have a bad random number generator (like the Debian SSL random number generator vulnerability), or are generating small enough numbers that the birthday paradox gives you a high chance of collision, then you will need to actually do something to ensure you don't get repeats. But for large random numbers with a good generator, just trust probability not to give you any repeats.
As you generate your numbers, use a Bloom filter to detect duplicates. This would use a minimal amount of memory. There would be no need to store earlier numbers in the series at all.
The trade off is that your list could not be exhaustive in your range. If your numbers are truly on the order of 256^1024, that's hardly any trade off at all.
(Of course if they are actually random on that scale, even bothering to detect duplicates is a waste of time. If every computer on earth generated a trillion random numbers that size every second for trillions of years, the chance of a collision is still absolutely negligible.)
I second gbarry's answer about using an LFSR. They are very efficient and simple to implement even in software and are guaranteed not to repeat in (2^N - 1) uses for an LFSR with an N-bit shift-register.
There are some drawbacks however: by observing a small number of outputs from the RNG, one can reconstruct the LFSR and predict all values it will generate, making them not usable for cryptography and anywhere were a good RNG is needed. The second problem is that either the all zero word or the all one (in terms of bits) word is invalid depending on the LFSR implementation. The third issue which is relevant to your question is that the maximum number generated by the LFSR is always a power of 2 - 1 (or power of 2 - 2).
The first drawback might not be an issue depending on your application. From the example you gave, it seems that you are not expecting zero to be among the answers; so, the second issue does not seem relevant to your case.
The maximum value (and thus range) problem can solved by reusing the LFSR until you get a number within your range. Here's an example:
Say you want to have numbers between 1 and 10 (as in your example). You would use a 4-bit LFSR which has a range [1, 15] inclusive. Here's a pseudo code as to how to get number in the range [1,10]:
x = LFSR.getRandomNumber();
while (x > 10) {
x = LFSR.getRandomNumber();
}
You should embed the previous code in your RNG; so that the caller wouldn't care about implementation.
Note that this would slow down your RNG if you use a large shift-register and the maximum number you want is not a power of 2 - 1.
This answer suggests some strategies for getting what you want and ensuring they are in a random order using some already well-known algorithms.
There is an inside out version of the Fisher-Yates shuffle algorithm, called the Durstenfeld version, that randomly distributes sequentially acquired items into arrays and collections while loading the array or collection.
One thing to remember is that the Fisher-Yates (AKA Knuth) shuffle or the Durstenfeld version used at load time is highly efficient with arrays of objects because only the reference pointer to the object is being moved and the object itself doesn't have to be examined or compared with any other object as part of the algorithm.
I will give both algorithms further below.
If you want really huge random numbers, on the order of 1024 bytes or more, a really good random generator that can generate unsigned bytes or words at a time will suffice. Randomly generate as many bytes or words as you need to construct the number, make it into an object with a reference pointer to it and, hey presto, you have a really huge random integer. If you need a specific really huge range, you can add a base value of zero bytes to the low-order end of the byte sequence to shift the value up. This may be your best option.
If you need to eliminate duplicates of really huge random numbers, then that is trickier. Even with really huge random numbers, removing duplicates also makes them significantly biased and not random at all. If you have a really large set of unduplicated really huge random numbers and you randomly select from the ones not yet selected, then the bias is only the bias in creating the huge values for the really huge set of numbers from which to choose. A reverse version of Durstenfeld's version of the Yates-Fisher could be used to randomly choose values from a really huge set of them, remove them from the remaining values from which to choose and insert them into a new array that is a subset and could do this with just the source and target arrays in situ. This would be very efficient.
This may be a good strategy for getting a small number of random numbers with enormous values from a really large set of them in which they are not duplicated. Just pick a random location in the source set, obtain its value, swap its value with the top element in the source set, reduce the size of the source set by one and repeat with the reduced size source set until you have chosen enough values. This is essentiall the Durstenfeld version of Fisher-Yates in reverse. You can then use the Dursenfeld version of the Fisher-Yates algorithm to insert the acquired values into the destination set. However, that is overkill since they should be randomly chosen and randomly ordered as given here.
Both algorithms assume you have some random number instance method, nextInt(int setSize), that generates a random integer from zero to setSize meaning there are setSize possible values. In this case, it will be the size of the array since the last index to the array is size-1.
The first algorithm is the Durstenfeld version of Fisher-Yates (aka Knuth) shuffle algorithm as applied to an array of arbitrary length, one that simply randomly positions integers from 0 to the length of the array into the array. The array need not be an array of integers, but can be an array of any objects that are acquired sequentially which, effectively, makes it an array of reference pointers. It is simple, short and very effective
int size = someNumber;
int[] int array = new int[size]; // here is the array to load
int location; // this will get assigned a value before used
// i will also conveniently be the value to load, but any sequentially acquired
// object will work
for (int i = 0; i <= size; i++) { // conveniently, i is also the value to load
// you can instance or acquire any object at this place in the algorithm to load
// by reference, into the array and use a pointer to it in place of j
int j = i; // in this example, j is trivially i
if (i == 0) { // first integer goes into first location
array[i] = j; // this may get swapped from here later
} else { // subsequent integers go into random locations
// the next random location will be somewhere in the locations
// already used or a new one at the end
// here we get the next random location
// to preserve true randomness without a significant bias
// it is REALLY IMPORTANT that the newest value could be
// stored in the newest location, that is,
// location has to be able to randomly have the value i
int location = nextInt(i + 1); // a random value between 0 and i
// move the random location's value to the new location
array[i] = array[location];
array[location] = j; // put the new value into the random location
} // end if...else
} // end for
Voila, you now have an already randomized array.
If you want to randomly shuffle an array you already have, here is the standard Fisher-Yates algorithm.
type[] array = new type[size];
// some code that loads array...
// randomly pick an item anywhere in the current array segment,
// swap it with the top element in the current array segment,
// then shorten the array segment by 1
// just as with the Durstenfeld version above,
// it is REALLY IMPORTANT that an element could get
// swapped with itself to avoid any bias in the randomization
type temp; // this will get assigned a value before used
int location; // this will get assigned a value before used
for (int i = arrayLength -1 ; i > 0; i--) {
int location = nextInt(i + 1);
temp = array[i];
array[i] = array[location];
array[location] = temp;
} // end for
For sequenced collections and sets, i.e. some type of list object, you could just use adds/or inserts with an index value that allows you to insert items anywhere, but it has to allow adding or appending after the current last item to avoid creating bias in the randomization.
Shuffling N elements doesn't take up excessive memory...think about it. You only swap one element at a time, so the maximum memory used is that of N+1 elements.
Assuming you have a random or pseudo-random number generator, even if it's not guaranteed to return unique values, you can implement one that returns unique values each time using this code, assuming that the upper limit remains constant (i.e. you always call it with random(10), and don't call it with random(10); random(11).
The code doesn't check for errors. You can add that yourself if you want to.
It also requires a lot of memory if you want a large range of numbers.
/* the function returns a random number between 0 and max -1
* not necessarily unique
* I assume it's written
*/
int random(int max);
/* the function returns a unique random number between 0 and max - 1 */
int unique_random(int max)
{
static int *list = NULL; /* contains a list of numbers we haven't returned */
static int in_progress = 0; /* 0 --> we haven't started randomizing numbers
* 1 --> we have started randomizing numbers
*/
static int count;
static prev_max = 0;
// initialize the list
if (!in_progress || (prev_max != max)) {
if (list != NULL) {
free(list);
}
list = malloc(sizeof(int) * max);
prev_max = max;
in_progress = 1;
count = max - 1;
int i;
for (i = max - 1; i >= 0; --i) {
list[i] = i;
}
}
/* now choose one from the list */
int index = random(count);
int retval = list[index];
/* now we throw away the returned value.
* we do this by shortening the list by 1
* and replacing the element we returned with
* the highest remaining number
*/
swap(&list[index], &list[count]);
/* when the count reaches 0 we start over */
if (count == 0) {
in_progress = 0;
free(list);
list = 0;
} else { /* reduce the counter by 1 */
count--;
}
}
/* swap two numbers */
void swap(int *x, int *y)
{
int temp = *x;
*x = *y;
*y = temp;
}
Actually, there's a minor point to make here; a random number generator which is not permitted to repeat is not random.
Suppose you wanted to generate a series of 256 random numbers without repeats.
Create a 256-bit (32-byte) memory block initialized with zeros, let's call it b
Your looping variable will be n, the number of numbers yet to be generated
Loop from n = 256 to n = 1
Generate a random number r in the range [0, n)
Find the r-th zero bit in your memory block b, let's call it p
Put p in your list of results, an array called q
Flip the p-th bit in memory block b to 1
After the n = 1 pass, you are done generating your list of numbers
Here's a short example of what I am talking about, using n = 4 initially:
**Setup**
b = 0000
q = []
**First loop pass, where n = 4**
r = 2
p = 2
b = 0010
q = [2]
**Second loop pass, where n = 3**
r = 2
p = 3
b = 0011
q = [2, 3]
**Third loop pass, where n = 2**
r = 0
p = 0
b = 1011
q = [2, 3, 0]
** Fourth and final loop pass, where n = 1**
r = 0
p = 1
b = 1111
q = [2, 3, 0, 1]
Please check answers at
Generate sequence of integers in random order without constructing the whole list upfront
and also my answer lies there as
very simple random is 1+((power(r,x)-1) mod p) will be from 1 to p for values of x from 1 to p and will be random where r and p are prime numbers and r <> p.
I asked a similar question before but mine was for the whole range of a int see Looking for a Hash Function /Ordered Int/ to /Shuffled Int/
static std::unordered_set<long> s;
long l = 0;
for(; !l && (s.end() != s.find(l)); l = generator());
v.insert(l);
generator() being your random number generator. You roll numbers as long as the entry is not in your set, then you add what you find in it. You get the idea.
I did it with long for the example, but you should make that a template if your PRNG is templatized.
Alternative is to use a cryptographically secure PRNG that will have a very low probability to generate twice the same number.
If you don't mean poor statisticall properties of generated sequence, there is one method:
Let's say you want to generate N numbers, each of 1024 bits each. You can sacrifice some bits of generated number to be "counter".
So you generate each random number, but into some bits you choosen you put binary encoded counter (from variable, you increase each time next random number is generated).
You can split that number into single bits and put it in some of less significant bits of generated number.
That way you are sure you get unique number each time.
I mean for example each generated number looks like that:
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxyyxxxxyxyyyyxxyxx
where x is take directly from generator, and ys are taken from counter variable.
Mersenne twister
Description of which can be found here on Wikipedia: Mersenne twister
Look at the bottom of the page for implementations in various languages.
The problem is to select a "random" sequence of N unique numbers from the range 1..M where there is no constraint on the relationship between N and M (M could be much bigger, about the same, or even smaller than N; they may not be relatively prime).
Expanding on the linear feedback shift register answer: for a given M, construct a maximal LFSR for the smallest power of two that is larger than M. Then just grab your numbers from the LFSR throwing out numbers larger than M. On average, you will throw out at most half the generated numbers (since by construction more than half the range of the LFSR is less than M), so the expected running time of getting a number is O(1). You are not storing previously generated numbers so space consumption is O(1) too. If you cycle before getting N numbers then M less than N (or the LFSR is constructed incorrectly).
You can find the parameters for maximum length LFSRs up to 168 bits here (from wikipedia): http://www.xilinx.com/support/documentation/application_notes/xapp052.pdf
Here's some java code:
/**
* Generate a sequence of unique "random" numbers in [0,M)
* #author dkoes
*
*/
public class UniqueRandom
{
long lfsr;
long mask;
long max;
private static long seed = 1;
//indexed by number of bits
private static int [][] taps = {
null, // 0
null, // 1
null, // 2
{3,2}, //3
{4,3},
{5,3},
{6,5},
{7,6},
{8,6,5,4},
{9,5},
{10,7},
{11,9},
{12,6,4,1},
{13,4,3,1},
{14,5,3,1},
{15,14},
{16,15,13,4},
{17,14},
{18,11},
{19,6,2,1},
{20,17},
{21,19},
{22,21},
{23,18},
{24,23,22,17},
{25,22},
{26,6,2,1},
{27,5,2,1},
{28,25},
{29,27},
{30,6,4,1},
{31,28},
{32,22,2,1},
{33,20},
{34,27,2,1},
{35,33},
{36,25},
{37,5,4,3,2,1},
{38,6,5,1},
{39,35},
{40,38,21,19},
{41,38},
{42,41,20,19},
{43,42,38,37},
{44,43,18,17},
{45,44,42,41},
{46,45,26,25},
{47,42},
{48,47,21,20},
{49,40},
{50,49,24,23},
{51,50,36,35},
{52,49},
{53,52,38,37},
{54,53,18,17},
{55,31},
{56,55,35,34},
{57,50},
{58,39},
{59,58,38,37},
{60,59},
{61,60,46,45},
{62,61,6,5},
{63,62},
};
//m is upperbound; things break if it isn't positive
UniqueRandom(long m)
{
max = m;
lfsr = seed; //could easily pass a starting point instead
//figure out number of bits
int bits = 0;
long b = m;
while((b >>>= 1) != 0)
{
bits++;
}
bits++;
if(bits < 3)
bits = 3;
mask = 0;
for(int i = 0; i < taps[bits].length; i++)
{
mask |= (1L << (taps[bits][i]-1));
}
}
//return -1 if we've cycled
long next()
{
long ret = -1;
if(lfsr == 0)
return -1;
do {
ret = lfsr;
//update lfsr - from wikipedia
long lsb = lfsr & 1;
lfsr >>>= 1;
if(lsb == 1)
lfsr ^= mask;
if(lfsr == seed)
lfsr = 0; //cycled, stick
ret--; //zero is stuck state, never generated so sub 1 to get it
} while(ret >= max);
return ret;
}
}
Here is a way to random without repeating results. It also works for strings. Its in C# but the logig should work in many places. Put the random results in a list and check if the new random element is in that list. If not than you have a new random element. If it is in that list, repeat the random until you get an element that is not in that list.
List<string> Erledigte = new List<string>();
private void Form1_Load(object sender, EventArgs e)
{
label1.Text = "";
listBox1.Items.Add("a");
listBox1.Items.Add("b");
listBox1.Items.Add("c");
listBox1.Items.Add("d");
listBox1.Items.Add("e");
}
private void button1_Click(object sender, EventArgs e)
{
Random rand = new Random();
int index=rand.Next(0, listBox1.Items.Count);
string rndString = listBox1.Items[index].ToString();
if (listBox1.Items.Count <= Erledigte.Count)
{
return;
}
else
{
if (Erledigte.Contains(rndString))
{
//MessageBox.Show("vorhanden");
while (Erledigte.Contains(rndString))
{
index = rand.Next(0, listBox1.Items.Count);
rndString = listBox1.Items[index].ToString();
}
}
Erledigte.Add(rndString);
label1.Text += rndString;
}
}
For a sequence to be random there should not be any auto correlation. The restriction that the numbers should not repeat means the next number should depend on all the previous numbers which means it is not random anymore....
If you can generate 'small' random numbers, you can generate 'large' random numbers by integrating them: add a small random increment to each 'previous'.
const size_t amount = 100; // a limited amount of random numbers
vector<long int> numbers;
numbers.reserve( amount );
const short int spread = 250; // about 250 between each random number
numbers.push_back( myrandom( spread ) );
for( int n = 0; n != amount; ++n ) {
const short int increment = myrandom( spread );
numbers.push_back( numbers.back() + increment );
}
myshuffle( numbers );
The myrandom and myshuffle functions I hereby generously delegate to others :)
to have non repeated random numbers and to avoid waistingtime with checking for doubles numbers and get new numbers over and over use the below method which will assure the minimum usage of Rand:
for example if you want to get 100 non repeated random number:
1. fill an array with numbers from 1 to 100
2. get a random number using Rand function in the range of (1-100)
3. use the genarted random number as an Index to get th value from the array (Numbers[IndexGeneratedFromRandFunction]
4. shift the number in the array after that Index to the left
5. repeat from step 2 but now the the rang should be (1-99) and go on
now we have a array with different numbers!
int main() {
int b[(the number
if them)];
for (int i = 0; i < (the number of them); i++) {
int a = rand() % (the number of them + 1) + 1;
int j = 0;
while (j < i) {
if (a == b[j]) {
a = rand() % (the number of them + 1) + 1;
j = -1;
}
j++;
}
b[i] = a;
}
}