unordered_map maxes out unique keys only with 16 character strings

unordered_map maxes out unique keys only with 16 character strings - c++

This code generates a random 16-character string using only A,C,T,G. It then checks whether this sequence is in the hash (unordered_map), and if not, inserts it and points to a dummy placeholder.
In its current form, it hangs at datact=16384 when the 'for i loop' requires 20000 iterations, despite the fact that there are 4^16 strings with ACTG.
But.. if the string length is changed to 8, 9, 10, 11.. to 15, or 17, 18.. it correctly iterates to 20000. Why does unordered_map refuse to hash new sequences, but only when those sequences are 16 characters long?
#include <string>
#include <vector>
#include <unordered_map>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <iostream>
using namespace std;
int main(int argc, char* argv[])
{
string funnelstring;
srand ( time(NULL) );
const int buffersize=10000;
int currentsize=buffersize;
int datact=0;
vector <unsigned int> ctarr(buffersize);
vector <char> nuc(4);
nuc[0]='A';
nuc[1]='C';
nuc[2]='T';
nuc[3]='G';
unordered_map <string,unsigned int*> location;
unsigned int sct;
sct=1;
for (int i=0;i<20000; i++)
{
do
{
funnelstring="";
for (int i=0; i<16; i++)
{ // generate random 16 nucleotide sequence
funnelstring+=nuc[(rand() % 4)];
}
} while (location.find(funnelstring) != location.end()); //asks whether this key has been assigned
ctarr[datact]=sct;
location[funnelstring]=&ctarr[datact]; //assign current key to point to data count
datact++;
cout << datact << endl;
if (datact>=currentsize)
{
ctarr.resize(currentsize+buffersize);
currentsize+=buffersize;
}
}
return 0;
}

As #us2012 said, the problem is your PRNG, and the poor randomness in the lower order bits. Here's a relevant quote:
In Numerical Recipes in C: The Art of Scientific Computing (William H. Press, Brian P. Flannery, Saul A. Teukolsky, William T. Vetterling; New York: Cambridge University Press, 1992 (2nd ed., p. 277)), the following comments are made:
"If you want to generate a random integer between 1 and 10, you should always do it by using high-order bits, as in
j = 1 + (int) (10.0 * (rand() / (RAND_MAX + 1.0)));
and never by anything resembling
j = 1 + (rand() % 10);
(which uses lower-order bits)."
Also, as others have pointed out, you can also use a better, more modern RNG.

The culprit is very likely your random number generator, i.e. the sequence of random numbers from the PRNG became periodic (mod 4) too quickly (most random number generators really produce pseudo-random numbers, hence the name PRNG). Therefore, your do...while loop never quits as it is unable to find a new nucleotide sequence with the random numbers provided.
Two fixes I can think of:
Instead of generating random numbers mod 4, generate them mod 4^length and extract the bit pairs, 00 -> A, 01 -> G, ...
Use a better PRNG, like std::mersenne_twister_engine.
(Disclaimer: I'm not an expert on random numbers. Don't rely on this advice for mission-critical systems, cryptographic requirements, etc.)

Related

Random ints with different likelihoods

I was wondering if there was a way to have a random number between A an b and where if a number meets a certain requirement it is more likely to appear than all the other numbers between A and B, for example: Lower numbers are more likely to appear so if A = 1 and B = 10 then 1 would be the likeliest and 10 would be the unlikeliest.
All help is appreciated :) (sorry for bad English/grammar/question)

C++11 (which you should absolutely be using by now) added the <random> header to the C++ standard library. This header provides much higher quality random number generators to C++. Using srand() and rand() has never been a very good idea because there's no guarantee of quality, but now it's truly inexcusable.
In your example, it sounds like you want what would probably be called a 'discrete triangular distribution': the probability mass function looks like a triangle. The easiest (but perhaps not the most efficient) way to implement this in C++ would be the discrete distribution included in <random>:
auto discrete_triangular_distribution(int max) {
std::vector<int> weights(max);
std::iota(weights.begin(), weights.end(), 0);
std::discrete_distribution<> dist(weights.begin(), weights.end());
return dist;
}
int main() {
std::random_device rd;
std::mt19937 gen(rd());
auto&& dist = discrete_triangular_distribution(10);
std::map<int, int> counts;
for (int i = 0; i < 10000; i++)
++counts[dist(gen)];
for (auto count: counts)
std::cout << count.first << " generated ";
std::cout << count.second << " times.\n";
}
which for me gives the following output:
1 generated 233 times.
2 generated 425 times.
3 generated 677 times.
4 generated 854 times.
5 generated 1130 times.
6 generated 1334 times.
7 generated 1565 times.
8 generated 1804 times.
9 generated 1978 times.
Things more complex than this would be better served with either using one of the existing distributions (I have been told that all commonly used statistical distributions are included) or by writing your own distribution, which isn't too hard: it just has to be an object with a function call operator that takes a random bit generator and uses those bits to produce (in this case) random numbers. But you could create one that made random strings, or any arbitrary random objects, perhaps for testing purposes).

Your question doesn't specify which distribution to use. One option (of many) is to use the (negative) exponential distribution. This distribution is parameterized by a parameter λ. For each value of λ, the maximum result is unbounded (which needs to be handled in order to return results only in the range specified)
(from Wikipedia, By Skbkekas, CC BY 3.0)
so any λ could theoretically work; however, the properties of the CDF
(from Wikipedia, By Skbkekas, CC BY 3.0)
imply that it pays to choose something in the order of 1 / (to - from + 1).
The following class works like a standard library distribution. Internally, it generates numbers in a loop, until a result in [from, to] is obtained.
#include <iostream>
#include <iomanip>
#include <string>
#include <map>
#include <random>
class bounded_discrete_exponential_dist {
public:
explicit bounded_discrete_exponential_dist(std::size_t from, std::size_t to) :
m_from{from}, m_to{to}, m_d{0.5 / (to - from + 1)} {}
explicit bounded_discrete_exponential_dist(std::size_t from, std::size_t to, double factor) :
m_from{from}, m_to{to}, m_d{factor} {}
template<class Gen>
std::size_t operator()(Gen &gen) {
while(true) {
const auto r = m_from + static_cast<std::size_t>(m_d(gen));
if(r <= m_to)
return r;
}
}
private:
std::size_t m_from, m_to;
std::exponential_distribution<> m_d;
};
Here is an example of using it:
int main()
{
std::random_device rd;
std::mt19937 gen(rd());
bounded_discrete_exponential_dist d{1, 10};
std::vector<std::size_t> hist(10, 0);
for(std::size_t i = 0; i < 99999; ++i)
++hist[d(gen) - 1];
for(auto h: hist)
std::cout << std::string(static_cast<std::size_t>(80 * h / 99999.), '+') << std::endl;
}
When run, it outputs a histogram like this:
$ ./a.out
++++++++++
+++++++++
+++++++++
++++++++
+++++++
+++++++
+++++++
+++++++
++++++
++++++

Your basic random number generator should produce a high-quality, uniform random numbers on 0 to 1 - epsilon. You then transform it to get the distribution you want. The simplest transform is of course (int) ( p * N) in the common case of needing an integer on 0 to N -1.
But there are many many other transforms you can try. Take the square root, for example, to bias it to 1.0, then 1 - p to set the bias towards zero. Or you can look up the Poisson distribution, which might be what you are after. You can also use a half-Gaussian distribution (statistical bell curve with the zero entries cut off, and presumably also the extreme tail of the distribution as it goes out of range).
There can be no right answer. Try various things, plot out ten thousand or so values, and pick the one that gives results you like.

You can make an array of values, the more likely value has more indexes and then choose a random index.
example:
int random[55];
int result;
int index = 0;
for (int i = 1 ; i <= 10 ; ++i)
for (int j = i ; j <= 10 ; ++j)
random[index++] = i;
result = random[rand() % 55];
Also, you can try to get random number twice, first time you choose the max number then you choose your random number:
int max= rand() % 10 + 1; // This is your max value
int random = rand() % max + 1; // This is you result
Both ways will make 1 more likely than 2 , 2 more likely than 3 ... 9 more likely than 10.

Trying to produce a unique sequence of random numbers per iteration

As the title states, I'm trying to create a unique sequence of random numbers every time I run this little program.
However, sometimes I get results like:
102
201
102
The code
#include <cstdlib>
#include <ctime>
#include <iostream>
using namespace std;
int main() {
for (int i = 0; i < 3; i++) {
srand (time(NULL)+i);
cout << rand() % 3;
cout << rand() % 3;
cout << rand() % 3 << '\n' << endl;
}
}
Clearly srand doesn't have quite the magical functionality I wanted it to. I'm hoping that there's a logical hack around this though?
Edit1: To clarify, this is just a simple test program for what will be implemented on a larger scale. So instead of 3 iterations of rand%3, I might run 1000, or more of rand%50.
If I see 102 at some point in its operation, I'd want it so that I never see 102 again.

First of all, if you were going to use srand/rand, you'd want to seed it once (and only once) at the beginning of each execution of the program:
int main() {
srand(time(NULL));
for (int i = 0; i < 3; i++) {
cout << rand() % 3;
cout << rand() % 3;
cout << rand() % 3 << '\n' << endl;
}
Second, time typically only produces a result with a resolution of one second, so even with this correction, if you run the program twice in the same second, you can expect it to produce identical results in the two runs.
Third, you don't really want to use srand/rand anyway. The random number generator in <random> are generally considerably better (and, perhaps more importantly, are enough better defined that they represent a much better-known quantity).
#include <random>
#include <iostream>
int main() {
std::mt19937_64 gen { std::random_device()() };
std::uniform_int_distribution<int> d(0, 2);
for (int i = 0; i < 3; i++) {
for (int j=0; j<3; j++)
std::cout << d(gen);
std::cout << "\n";
}
}
Based on the edit, however, this still isn't adequate. What you really want is a random sample without duplication. To get that, you need to do more than just generate numbers. Randomly generated numbers not only can repeat, but inevitably will repeat if you generate enough of them (but the likelihood of repetition becomes quite high even when it's not yet inevitable).
As long as the number of results you're producing is small compared to the number of possible results, you can pretty easily just store results in a set as you produce them, and only treat a result as actual output if it wasn't previously present in the set:
#include <random>
#include <iostream>
#include <set>
#include <iomanip>
int main() {
std::mt19937_64 gen { std::random_device()() };
std::uniform_int_distribution<int> d(0, 999);
std::set<int> results;
for (int i = 0; i < 50;) {
int result = d(gen);
if (results.insert(result).second) {
std::cout << std::setw(5) << result;
++i;
if (i % 10 == 0)
std::cout << "\n";
}
}
}
This becomes quite inefficient if the number of results approaches the number of possible results. For example, let's assume your producing numbers from 1 to 1000 (so 1000 possible results). Consider what happens if you decide to produce 1000 results (i.e., all possible results). In this case, when you're producing the last result, there's really only one possibility left--but rather than just producing that one possibility, you produce one random number after another after another, until you stumble across the one possibility that remains.
For such a case, there are better ways to do the job. For example, you can start with a container holding all the possible numbers. To generate an output, you generate a random index into that container. You output that number, and remove that number from the container, then repeat (but this time, the container is one smaller, so you reduce the range of your random index by one). This way, each random number you produce gives one output.
It is possible to do the same by just shuffling an array of numbers. This has two shortcomings though. First, you need to shuffle them correctly--a Fischer-Yates shuffle works nicely, but otherwise it's easy to produce bias. Second, unless you actually do use all (or very close to all) the numbers in the array, this is inefficient.
For an extreme case, consider wanting a few (10, for example) 64-bit numbers. In this, you start by filling an array with numbers from 264-1. You then do 264-2 swaps. So, you're doing roughly 265 operations just to produce 10 numbers. In this extreme of a case, the problem should be quite obvious. Although it's less obvious if you produce (say) 1000 numbers of 32 bits apiece, you still have the same basic problem, just to a somewhat lesser degree. So, while this is a valid way to do things for a few specific cases, its applicability is fairly narrow.

Generate an array containing the 27 three digit numbers whose digits are less than 3. Shuffle it. Iterate through the shuffled array as needed, values will be unique until you've exhausted them all.
As other people have pointed out, don't keep reseeding your random number generator. Also, rand is a terrible generator, you should use one of the better choices available in C++'s standard libraries.

You are effectively generating a three digit base 3 number. Use your RNG of choice to generate a base 10 number in the range 0 .. 26 and convert it to base 3. That gives 000 .. 222.
If you absolutely must avoid repeats, then shuffle an array as pjs suggests. That will result in later numbers being 'less random' than the earlier numbers because they are taken from a smaller pool.

Random list of numbers

#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include <string.h>
#include <math.h>
int main()
{
int i;
int diceRoll;
for(i=0; i < 20; i++)
{
printf("%d \n", rand());
}
return 0;
}
This is the code I wrote in c (codeblocks) to get random numbers, the problem is I always get the same sequence: 41,18467,6334,26500 etc...
I'm still learning so please try to explain like you're talking with a 8 year old D:

You get the same sequence each time because the seed for the random number generator isn't set. You need to call srand(time(NULL)) like this:
int main()
{
srand(time(NULL));
....

Random number generators are pseudorandom. What this means is that they use some "algorithm" to come up with the next "random" number. In other words, if you start with the same seed to this algorithm, you get the same sequence of random numbers each time. To solve this, you have to make sure to seed your random number generator. Sometimes, it is desirable to use the same seed so that you may deduce if the logic of your program is working correct. Either way, one common way that folks seed their programs is through the use of time(NULL). time gives the time elapsed (in seconds) since the epoch time. What this means is that this function changes every second. Thus, if you seed your random number generator with (srand(time(NULL)) at the beginning of the program, you'll get a different random number sequence every different second that you run your program. Be sure not to seed for every random number that you request. Just do this once at the very beginning of your code and then leave it alone.
Your title says C# but I've answered with C++. You'll want to include ctime for this. It may also be beneficial to look at the new style of random number generation as rand() isn't very random these days. Look into #include random and make yourself an engine and distribution to pull random numbers through. Don't forget to seed there as well!

First of all, seed your random function by including <ctime> and calling srand(time(NULL));.
Secondly, you need a modulo if you're going to call rand(), for example: rand() % x will return a random number from 0 to x-1. Since you're simulating dice rolls, do rand() % 6 + 1.

The line srand((unsigned)(time(NULL)) must be outside the loop, must have this line just once in your code.
The modulo rand()%10 means you get any number starting from 0 going up to what you are modulo by -1. So in this case 0-9, if you want 1-10 you do: rand()%10 + 1
int main()
{
int i;
int diceRoll;
srand((unsigned)(time(NULL));
for(i=0; i < 20; i++)
{
printf("%d \n", rand() % 10); //Gets you numbers 0-9
}
return 0;
}

rand() gives still the same value

I noticed that while practicing by doing a simple console-based quiz app. When I'm using rand() it gives me the same value several times in a row. The smaller number range, the bigger the problem is.
For example
for (i=0; i<10; i++) {
x = rand() % 20 + 1;
cout << x << ", ";
}
Will give me 1, 1, 1, 2, 1, 1, 1, 1, 14, - there are definetely too much ones, right? I usually got from none to 4 odd numbers (rest is just the same, it can also be 11, 11, 11, 4, 11 ...)
Am I doing something wrong? Or rand() is not so random that I thought it is?
(Or is it just some habit from C#/Java that I'm not aware of? It happens a lot to me, too...)

If I run that code a couple of times, I get different output. Sure, not as varied as I'd like, but seemingly not deterministic (although of course it is, since rand() only gives pseudo-random numbers...).
However, the way you treat your numbers isn't going to give you a uniform distribution over [1,20], which I guess is what you expect. To achieve that is rather more complicated, but in no way impossible. For an example, take a look at the documentation for <random> at cplusplus.com - at the bottom there's a showcase program that generates a uniform distribution over [0,1). To get that to [1,20), you simply change the input parameters to the generator - it can give you a uniform distribution over any range you like.
I did a quick test, and called rand() one million times. As you can see in the output below, even at very large sample sizes, there are some nonuniformities in the distribution. As the number of samples goes to infinity, the line will (probably) flatten out, using something like rand() % 20 + 1 gives you a distribution that takes very long time to do so. If you take something else (like the example above) your chances are better at achieving a uniform distribution even for quite small sample sizes.
Edit:
I see several others posting about using srand() to seed the random number generator before using it. This is good advice, but it won't solve your problem in this case. I repeat: seeding is not the problem in this case.
Seeds are mainly used to control the reproducibility of the output of your program. If you seed your random number with a constant value (e.g. 0), the program will give the same output every time, which is useful for testing that everything works the way it should. By seeding with something non-constant (the current time is a popular choice) you ensure that the results vary between different runs of the program.
Not calling srand() at all is the same as calling srand(1), by the C++ standard. Thus, you'll get the same results every time you run the program, but you'll have a perfectly valid series of pseudo-random numbers within each run.

Sounds like you're hitting modulo bias.
Scaling your random numbers to a range by using % is not a good idea. It's just about passable if your reducing it to a range that is a power of 2, but still pretty poor. It is primarily influenced by the smaller bits which are frequently less random with many algorithms (and rand() in particular), and it contracts to the smaller range in a non-uniform fashion because the range your reducing to will not equally divide the range of your random number generator. To reduce the range you should be using a division and loop, like so:
// generate a number from 0 to range-1
int divisor = MAX_RAND/(range+1);
int result;
do
{
result = rand()/divisor;
} while (result >= range);
This is not as inefficient as it looks because the loop is nearly always passed through only once. Also if you're ever going to use your generator for numbers that approach MAX_RAND you'll need a more complex equation for divisor which I can't remember off-hand.
Also, rand() is a very poor random number generator, consider using something like a Mersenne Twister if you care about the quality of your results.

You need to call srand() first and give it the time for parameter for better pseudorandom values.
Example:
#include <iostream>
#include <string>
#include <vector>
#include "stdlib.h"
#include "time.h"
using namespace std;
int main()
{
srand(time(0));
int x,i;
for (i=0; i<10; i++) {
x = rand() % 20 + 1;
cout << x << ", ";
}
system("pause");
return 0;
}
If you don't want any of the generated numbers to repeat and memory isn't a concern you can use a vector of ints, shuffle it randomly and then get the values of the first N ints.
Example:
#include <iostream>
#include <vector>
#include <algorithm>
using namespace std;
int main()
{
//Get 5 random numbers between 1 and 20
vector<int> v;
for(int i=1; i<=20; i++)
v.push_back(i);
random_shuffle(v.begin(),v.end());
for(int i=0; i<5; i++)
cout << v[i] << endl;
system("pause");
return 0;
}

The likely problems are that you are using the same "random" numbers each time and that any int mod 1 is zero. In other words (myInt % 1 == 0) is always true. Instead of %1, use % theBiggestNumberDesired.
Also, seed your random numbers with srand. Use a constant seed to verify that you are getting good results. Then change the seed to make sure you are still getting good results. Then use a more random seed like the clock to teat further. Release with the random seed.

C++ random number generator without repeating numbers

I have searched high and low for a type of function that turns this code
#include <iostream>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
using namespace std;
void ran(int array[], int max);
int main() {
printf("Today's lottery numbers are:\n");
for (int i = 0; i < 6; i++)
srand((unsigned)(NULL));
}
into a random number generator that ensures no repeating numbers can someone help me with it? after the check I plan to print it with printf("%d\n", rand()%50);
I just need a routine that makes sure its non repeating. Please If you can give me a routine I would be greatly relieved and will be sure to pay it forward.
Thanks. The libraries don't seem to be reading right on this scren but they are stdio, stdlib and time and im using namespace.

Why not just use what's already in the STL? Looking at your example code, and assuming it's somewhat representative of what you want to do, everything should be in there. (I assume you need a relatively small range of numbers, so memory wouldn't be a constraint)
Using std::random_shuffle, and an std::vector containing the integers in the range you wish your numbers to be in, should give you a sequence of unique random numbers that you need in your example code.
You will still have to call srand once, and once only, before using std::random_shuffle. Not multiple times like you're doing in your current code example.

If your range of random numbers is finite and small, say you have X different numbers.
Create an array with every single number
Select a random index I between 0 and X, and get its value
Move X value into I position
Decrease X and repeat

You should only call srand once in your code and you should call it with a "random" seed like time(NULL).
By calling srand within the loop, and calling it with a 0 seed each time, you'll get six numbers exactly the same.
However, even with those fixes, rand()%50 may give you the same number twice. What you should be using is a shuffle algorithm like this one since it works exactly the same as the lottery machines.
Here's a complete program showing that in action:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
static void getSix (int *dst) {
int sz, pos, i, src[50];
for (i = 0; i < sizeof(src)/sizeof(*src); i++)
src[i] = i + 1;
sz = 50;
for (i = 0; i < 6; i++) {
pos = rand() % sz;
dst[i] = src[pos];
src[pos] = src[sz-1];
sz--;
}
}
int main (void) {
srand (time (NULL));
int i, numbers[6];
getSix (numbers);
printf ("Numbers are:\n");
for (i = 0; i < sizeof(numbers)/sizeof(*numbers); i++)
printf (" %d\n", numbers[i]);
return 0;
}
Sample runs:
Numbers are:
25
10
26
4
18
1
Numbers are:
39
45
8
18
17
22
Numbers are:
8
6
49
21
40
28
Numbers are:
37
49
45
43
6
40

I would recommend using a better random number generation algorithm that can offer that internally, rather than using rand.
The problem with rand() and trying to prevent repeats is that finding an unused number will slow down with every number added to the used list, eventually becoming a very long process of finding and discarding numbers.
If you were to use a more complex pseudo-random number generator (and there are many, many available, check Boost for a few), you'll have an easier time and may be able to avoid the repeats altogether. It depends on the algorithm, so you'll need to check the documentation.
To do it without using any additional libraries, you could prefill a vector or list with sequential (or even random) numbers, making sure each number is present once in the list. Then, to generate a number, generate a random number and select (and remove) that item from the list. By removing each item as it's used, so long as every item was present once to begin with, you'll never run into a duplicate.

And if you have access to C++0x you can use the new random generator facilities that wrap all of this junk for you!
http://www2.research.att.com/~bs/C++0xFAQ.html#std-random

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js