I have a bunch of classes that I'd like to instantiate using seeds. Such a class would only have one constructor; taking in a single argument, the seed. A very simple pseudo example could be:
class Person {
int age;
Person(uint32 seed){
age = deriveAgeFromSeed(seed);
}
}
If I instantiate a Person with a random given seed, e.g. 123456789, it should evaluate to a Person with a specific age, e.g 30. The same seed will always generate the same person (same age).
To achieve this specific example, I could use a regular random-number-generator and use my seed as its seed to generate a random number between e.g. 0-100 for age.
However, I may not want it to be linearly random. Maybe I'd want a 50% chance that the age is in the range of 30-40. I guess I could chain a bunch of "random" numbers operations with my logic, e.g. generating a number from 0 to 1 which would indicate which age-range should be used, and then generate a new number to decide what specific age within this given range. But this would be a very ugly chain of hard code, and very hard to make adjustments to later.
I'd rather want a way to bundle an "option probability set" with the application. For instance an XML file that would specify the probability for all variable. The file could be loaded into memory at launch to prevent having to read the same file every time a person is instantiated. Unrealistic example to give an idea of what I mean:
"Person":
"age":
0-30: 25%,
30-40: 50%,
40-100: 25%
The application would use this information to automagically set an age based on the seed, with these given "probability parameters".
Having such a file would drastically decrease the workload for future adjustments of parameters, and would even let me change parameters without having to rebuild the application. But is it viable?
In addition to this, there may be cases where a second parameter could be dependent on the first. An example could be enum Occupation, where certain occupations are more common for certain ages (e.g. 'fast food employee' being more common at younger ages and CEO more common the older they get).
This type of logic seems rather common for certain types of video games, e.g. RTS such as "Civilization", where the game seed would be used to create the map, place its resources and the player spawn locations. It appears to also be used for procedural sandbox games such as Minecraft, where there's a certain probability for biomes. The latter is probably more noise-based, but still, noise would only give an output between 0 and 1, and they somehow derive a certain biome probability from it.
(I will code in C++, but language doesn't matter for the question)
So
What is the optimal/best practice procedure to derive many values at preset probabilities from a single seed?
Can the probabilities be imported from an external file?
Can probabilities be dependent on each other?
It seems you are thinking about a prng with a state, that you can initialize and use exactly once to generate the age value. However, for this purpose, hash would be also good: it also produces a pseudo-random number with uniform distribution on the given output space. If there is any need to generate a sequence of random numbers from a single seed, then a random number engine would be more useful.
XML, JSON and other formats can be read in using 3rd party libraries in C++. However, if the difficulty level is at storing age ranges and probability, for portability and dependency management it can be better to implement the parser yourself.
If I am correct, the age distribution is occupation-dependent. Once you derive the correct distribution for the occupation (either explicitly defined in a settings file, like in point 2 or by calculating it using a formula), you can get the right age-distribution from a single pseudo random number generated by the hash function.
input
1 - - GET hm_brdr.gif
2 - - GET s102382.gif ( "1", {"- - GET hm_brdr.gif"})
3 - - GET bg_stars.gif map-reduce-> ( "2", {"- - GET s102382.gif"})
3 - - GET phrase.gif ( "3", {"- - GET bg_stars.gif,"- - GET phrase.gif"})
I want to make the first column values 1,2,3.. anonymous using random integers. But it shouldn't change it like 1->x in one line and 1->t in another line. so my solution is to replace the "keys" with random integers (rand(1)=x, rand(2)=y ..) in the reduce step and ungroup the values with their new keys and write to files again as shown below.
output file
x - - GET hm_brdr.gif
y - - GET s102382.gif
z - - GET bg_stars.gif
z - - GET phrase.gif
my question is, is there a better way of doing this in the means of running time ?
If you want to assign a random integer to a key value then you'll have to do that in a reducer where all key/value pairs for that key are gathered in one place. As #jason pointed out, you don't want to assign a random number since there's no guarantee that a particular random number won't be chosen for two different keys. What you can do is just increment a counter held as an instance variable on the reducer to get the next available number to associate with a key. If you have a small amount of data then a single reducer can be used and the numbers will be unique. If you're forced to use multiple reducers then you'll need a slightly more complicated technique. Use
Context.getTaskAttemptID().getTaskID().getId()
to get a unique reducer number with which to calculate an overall unique number for each key.
There is no way this is a bottleneck to your MapReduce job. More precisely, the runtime of your job is dominated by other concerns (network and disk I/O, etc.). A quick little key function? Meh.
But that's not even the biggest issue with your proposal. The biggest issue with your proposal is that it's doomed to fail. What is a key fact about keys? They serve as unique identifiers for records. Do random number generators guarantee uniqueness? No.
In fact, pretend for just a minute that your random key space has 365 possible values. It turns out that if you generate a mere 23 random keys, you are more likely than not to have a key collision; welcome to the birthday paradox. And all of a sudden, you've lost the whole point to the keys in the first place as you've started smashing together records by giving two that shouldn't have the same key the same key!
And you might be thinking, well, my key space isn't as small as 365 possible keys, it's more like 2^32 possible keys, so I'm, like, totally in the clear. No. After approximately 77,000 keys you're more likely than not to have a collision.
Your idea is just completely untenable because it's the wrong tool for the job. You need unique identifiers. Random doesn't guarantee uniqueness. Get a different tool.
In your case, you need a function that is injective on your input key space (that is, it guarantees that f(x) != f(y) if x != y). You haven't given me enough details to propose anything concrete, but that's what you're looking for.
And seriously, there is no way that performance of this function will be an issue. Your job's runtime really will be completely dominated by other concerns.
Edit:
To respond to your comment:
here i am actually trying to make the ip numbers anonymous in the log files, so if you think there is a better way i ll be happy to know.
First off, we have a serious XY problem here. You should have asked searched for answers to that question. Anonymizing IP addresses, or anything for that matter, is hard. You haven't even told us the criteria for a "solution" (e.g., who are the attackers?). I recommend taking a look at this answer on the IT Security Stack Exchange site.
I'm using the Mersenne twister algorithm to shuffle playing cards. Each time the deck needs to be shuffled I seed it with time(NULL) + deckCutCardNumber which is where the user chose to cut the deck. Would I get better results from only seeding it the first hand and continuing to generate them with the same seed or is this method more random?
Thanks
Only seed the PRNG once. The statistical properties of the generated sequence are only guaranteed after the seed. If you reseed every time, the resulting sequence may not have any predictable statistical properties.
For instance, consider a PRNG which always returns the seed value itself as the first number in the sequence, but which is perfectly uniform over its range. This constitutes a great PRNG, as long as you don't use the first number. However, if you reseed it before every use, say to an incrementing counter value, you have no randomness at all!
Assuming the user doesn't mess with the clock (or carefully reduce their cut number by exactly the time that has passed), they'll never see a repeated state of the PRNG anyway, so it doesn't make much difference what you do. You'll get a reasonable distribution out of the Mersenne Twister from any seed value[*], and at any feasible number of steps after re-seeding.
If you're keen to reseed, though, you could combine both approaches by seeding with the time, plus the user-chosen number, plus an output taken from the generator just before reseeding. That combines (part of, not all) the current state of the PRNG with the new seed data, so to some degree all of the past times and cut values (and number of uses of the PRNG) can affect the state, not just the most recent. Pouring more information into the seed value in this way could be considered "more random" than a seed involving less information and hence fewer plausible values.
The only thing about Mersenne Twister in particular is that if you can observe 600-odd outputs of it, then you can deduce its internal state and predict the rest of the output until it's reseeded. Then again, you probably wouldn't use MT for an application where that sort of thing matters: if you're relying on the reseed in any way then you should probably use a more secure PRNG to begin with. Clearly it doesn't matter for your application if the user can predict the values out of the PRNG, since the user knows the time just as well as you do. All of this tells you that it shouldn't matter how it's seeded, just so long as it isn't seeded with exactly the same value so that two games are identical. Hence it doesn't matter whether it's reseeded either.
[*] That's not strictly true, there are classes of weak seeds for MT. But as long as you take that into account when seeding (for instance, hash the seed before use so that bad values are unlikely to crop up by chance), you work around that.
It will be less random if you seed off of the user choice every time than if you only seed once. The reason being that the choice of cut will probably have a skewed distribution (maybe cutting at the 10th card is the most likely etc). If you want to continuously seed you should use something like the system time as the seed.
Yes, you would get better results when not seeding every time. That's the purpose of a (good) random number generator.
In this special case the first value would just increase by the time you waited between the shuffles, while a continuously applied rng would give you numbers across it's whole range.
It's neither more nor less random. It's not really random at all anyway, but you won't notice any difference if you reseed it every time or not.
However, I'd recommend against it because time returns an unsigned int, so if you call it twice in the same second, you'll get the same number, and hence the same numbers from the RNG. Then there's distribution and all that.
I would suggest initializing the PRNG for each shuffle for a completely different reason: It allows you to quantify the state of the deck using only the seed, which means you can provide the seed to the user, or log it, or whatever suits, and be able to easily recreate the hand as dealt at a later stage.
You really should avoid seeding based on time, though - it's generally a better idea to use a source of randomness such as /dev/urandom instead.
Edit: Another argument for re-seeding occurs if you're worried about players guessing the internal state and therefore knowing what cards will be dealt in future. This is possible after observing 624 outputs from the Mersenne Twister (at least according to Wikipedia); this is only possible if you reuse the same PRNG. If this does matter, though, you certainly shouldn't be seeding based on time, and you should probably be using a cryptographically secure PRNG anyway.
Re-seeding the random number generator will not give you any higher quality random numbers than seeding it once (quite the contrary in many cases, depending on your seed values).
This question already has answers here:
Closed 12 years ago.
Possible Duplicates:
Unique (non-repeating) random numbers in O(1)?
How do you efficiently generate a list of K non-repeating integers between 0 and an upper bound N
I want to generate random number in a certain diapason, and I must be sure, that each new number is not a duplicate of formers. One solution is to store formerly generated numbers in a container and each new number checks aginst the container. If there is such number in the container, then we generate agin, else we use and add it to the container. But with each new number this operation is becoming slower and slower. Is there any better approach, or any rand function that can work faster and ensure uniqueness of the generation?
EDIT: Yes, there is a limit (for example from 0 to 1.000.000.000). But I want to generate 100.000 unique numbers! (Would be great if the solution will be by using Qt features.)
Is there a range for the random numbers? If you have a limit for random numbers and you keep generating unique random numbers, then you'll end up with a list of all numbers from x..y in random order, where x-y is the valid range of your random numbers. If this is the case, you might improve speed greatly by simply generating the list of all numbers x..y and shuffling it, instead of generating the numbers.
I think there are 3 possible approaches, depending on range-size, and performance pattern needed you can use another algorithm.
Create a random number, see if it is in (a sorted) list. If not add and return, else try another.
Your list will grow and consume memory with every number you need. If every number is 32 bit, it will grow with at least 32 bits every time.
Every new random number increases the hit-ratio and this will make it slower.
O(n^2) - I think
Create an bit-array for every number in the range. Mark with 1/True if already returned.
Every number now only takes 1 bit, this can still be a problem if the range is big, but every number now only allocates 1 bit.
Every new random number increases the hit-ratio and this will make it slower.
O(n*2)
Pre-populate a list with all the numbers, shuffle it, and return the Nth number.
The list will not grow, returning numbers will not get slower,
but generating the list might take a long time, and a lot of memory.
O(1)
Depending on needed speed, you could store all lists in a database. There's no need for them to be in memory except speed.
Fill out a list with the numbers you need, then shuffle the list and pick your numbers from one end.
If you use a simple 32-bit linear congruential RNG (such as the so-called "Minimal Standard"), all you have to do is store the seed value you use and compare each generated number to it. If you ever reach that value again, your sequence is starting to repeat itself and you're out of values. This is O(1), but of course limited to 2^32-1 values (though I suppose you could use a 64-bit version as well).
There is a class of pseudo-random number generators that, I believe, has the properties you want: the Linear congruential generator. If defined properly, it will produce a list of integers from 0 to N-1, with no two numbers repeating until you've used all of the numbers in the list once.
#include <stdint.h>
/*
* Choose these values as follows:
*
* The MODULUS and INCREMENT must be relatively prime.
* The MULTIPLIER-1 must be divisible by all prime factors of the MODULUS.
* The MULTIPLIER-1 must be divisible by 4, if the MODULUS is divisible by 4.
*
* In addition, modulus must be <= 2**32 (0x0000000100000000ULL).
*
* A small example would be 8, 5, 3.
* A larger example would be 256, 129, 251.
* A useful example would be 0x0000000100000000ULL, 1664525, 1013904223.
*/
#define MODULUS (0x0000000100000000ULL)
#define MULTIPLIER (1664525)
#define INCREMENT (1013904223)
static uint64_t seed;
uint32_t lcg( void ) {
uint64_t temp;
temp = seed * MULTIPLIER + INCREMENT; // 64-bit intermediate product
seed = temp % MODULUS; // 32-bit end-result
return (uint32_t) seed;
}
All you have to do is choose a MODULUS such that it is larger than the number of numbers you'll need in a given run.
It wouldn't be random if there is such a pattern?
As far as I know you would have to store and filter all unwanted numbers...
unsigned int N = 1000;
vector <unsigned int> vals(N);
for(unsigned int i = 0; i < vals.size(); ++i)
vals[i] = i;
std::random_shuffle(vals.begin(), vals.end());
unsigned int random_number_1 = vals[0];
unsigned int random_number_2 = vals[1];
unsigned int random_number_3 = vals[2];
//etc
You could store the numbers in a vector, and get them by index (1..n-1). After each random generation, remove the indexed number from the vector, then generate the next number in the interval 1..n-2. etc.
If they can't be repeated, they aren't random.
EDIT:
Furthermore..
if they can't be repeated, they don't fit in a finite computer
How many random numbers do you need? Maybe you can apply a shuffle algorithm to a precalculated array of random numbers?
There is no way a random generator will output values depending on previously outputted values, because they wouldn't be random. However, you can improve performance by using different pools of random values each with values combined by a different salt value, which will divide the quantity of numbers to check by the quantity of pools you have.
If the range of the random number doesn't matter you could use a really large range of random numbers and hope you don't get any collisions. If your range is billions of times larger than the number of elements you expect to create your chances of a collision are small but still there. If the numbers don't to have an actual random distribution you could have a two part number {counter}{random x digits} that would ensure a unique number but it wouldn't be randomly distributed.
There's not going to be a pure functional approach that isn't O(n^2) on the number of results returned so far - every time a number is generated you will need to check against every result so far. Additionally, think about what happens when you're returning e.g. the 1000th number out of 1000 - you will require on average 1000 tries until the random algorithm comes up with the last unused number, with each attempt requiring an average of 499.5 comparisons with the already-generated numbers.
It should be clear from this that your description as posted is not quite exactly what you want. The better approach, as others have said, is to take a list of e.g. 1000 numbers upfront, shuffle it, and then return numbers from that list incrementally. This will guarantee you're not returning any duplicates, and return the numbers in O(1) time after the initial setup.
You can allocate enough memory for array of bits with 1 bit for each possible number. and check/set bits for every generated number. for example for numbers from 0 to 65535 you will need only 8192 (8kb) of memory.
Here's an interesting solution I came up with:
Assume you have numbers 1 to 1000 - and you don't have enough memory.
You could put all 1000 numbers into an array, and remove them one by one, but you'll get memory overflow error.
You could split the array in two, so you have an array of 1-500 and one empty array
You could then check if the number exists in array 1, or doesn't exist in the second array.
So assuming you have 1000 numbers, you can get a random number from 1-1000. If its less than 500, check array 1 and remove it if present. If it's NOT in array 2, you can add it.
This halves your memory usage.
If you propogate this using recursion, you can split your 500 array into a 250 and empty array.
Assuming empty arrays use no space, you can decrease your memory usage quite a bit.
Searching will be massively faster too, because if you break it down a lot, you generate a number such as 29. It's less than 500, less than 250, less than 125, less than 62, less than 31, greater than 15, so you do those 6 calculations, then check the array containing an average of 16/2 items - 8 in total.
I should patent this search, although I bet it already exists!
Especially given the desired number of values, you want a Linear Feedback Shift Register.
Why?
No shuffle step, nor a need to keep track of values you've already hit. As long as you go less than the full period, you should be fine.
It turns out that the Wikipedia article has some C++ code examples which are more tested than anything I would give you off the top of my head. Note that you'll want to be pulling values from inside the loops -- the loops just iterate the shift register through. You can see this in the snippet here.
(Yes, I know this was mentioned, briefly in the dupe -- saw it as I was revising. Given it hasn't been brought up here and is the best way to solve the poster's question, I think it should be brought up again.)
Let's say size=100.000 then create an array with this size. Create random numbers then put them into array.Problem is which index that number will be ? randomNumber%size will give you index.
When u put next number, use that function for index and check this value is exist or not. If not exist put it if exist then create new number and try that. U can create in fastest way with this way. Disadvange of this way is you will never find numbers which last section is same.
For example for last sections is
1231232444556
3458923444556
you will never have such numbers in your list even if they are totally different but last sections are same.
First off, there's a huge difference between random and pseudorandom. There's no way to generate perfectly random numbers from a deterministic process (such as a computer) without bringing in some physical process like latency between keystrokes or another entropy source.
The approach of saving all the numbers generated will slow down the computation rather quickly; the more numbers you have, the larger your storage needs, until you've filled up all available memory. A better method would be (as someone's already suggested) using a well known pseudorandom number generator such as the Linear Congruential Generator; it's super fast, requiring only modular multiplication and addition, and the theory behind it gets a lot of mention in Vol. 2 of Knuth's TAOCP. That way, the theory involved guarantees a rather large period before repetition, and the only storage needed are the parameters and seed used.
If you have no problem when a value can be calculated by the previous one, LFSR and LCG are fine. When you don't want that one output value can be calculated by another, you can use a block cipher in counter mode to generate the output sequence, given that the cipher block length is equal to the output length.
Use Hashset generic class . This class does not contain same values. You can put in all of your generated numbers then u can use them in Hashset.You can also check it if it is exist or not .Hashset can determine existence of items in fastest way.Hashset does not slow when list become bigger and this is biggest feature of it.
For example :
HashSet<int> array = new HashSet<int>();
array.Add(1);
array.Add(2);
array.Add(1);
foreach (var item in array)
{
Console.WriteLine(item);
}
Console.ReadKey();