I'm writing some tests for a C++ command line Linux app. I'd like to generate a bunch of integers with a power-law/long-tail distribution. Meaning, I get a some numbers very frequently but most of them relatively infrequently.
Ideally there would just be some magic equations I could use with rand() or one of the stdlib random functions. If not, an easy to use chunk of C/C++ would be great.
Thanks!
This page at Wolfram MathWorld discusses how to get a power-law distribution from a uniform distribution (which is what most random number generators provide).
The short answer (derivation at the above link):
x = [(x1^(n+1) - x0^(n+1))*y + x0^(n+1)]^(1/(n+1))
where y is a uniform variate, n is the distribution power, x0 and x1 define the range of the distribution, and x is your power-law distributed variate.
If you know the distribution you want (called the Probability Distribution Function (PDF)) and have it properly normalized, you can integrate it to get the Cumulative Distribution Function (CDF), then invert the CDF (if possible) to get the transformation you need from uniform [0,1] distribution to your desired.
So you start by defining the distribution you want.
P = F(x)
(for x in [0,1]) then integrated to give
C(y) = \int_0^y F(x) dx
If this can be inverted you get
y = F^{-1}(C)
So call rand() and plug the result in as C in the last line and use y.
This result is called the Fundamental Theorem of Sampling. This is a hassle because of the normalization requirement and the need to analytically invert the function.
Alternately you can use a rejection technique: throw a number uniformly in the desired range, then throw another number and compare to the PDF at the location indeicated by your first throw. Reject if the second throw exceeds the PDF. Tends to be inefficient for PDFs with a lot of low probability region, like those with long tails...
An intermediate approach involves inverting the CDF by brute force: you store the CDF as a lookup table, and do a reverse lookup to get the result.
The real stinker here is that simple x^-n distributions are non-normalizable on the range [0,1], so you can't use the sampling theorem. Try (x+1)^-n instead...
I just wanted to carry out an actual simulation as a complement to the (rightfully) accepted answer. Although in R, the code is so simple as to be (pseudo)-pseudo-code.
One tiny difference between the Wolfram MathWorld formula in the accepted answer and other, perhaps more common, equations is the fact that the power law exponent n (which is typically denoted as alpha) does not carry an explicit negative sign. So the chosen alpha value has to be negative, and typically between 2 and 3.
x0 and x1 stand for the lower and upper limits of the distribution.
So here it is:
set.seed(0)
x1 = 5 # Maximum value
x0 = 0.1 # It can't be zero; otherwise X^0^(neg) is 1/0.
alpha = -2.5 # It has to be negative.
y = runif(1e7) # Number of samples
x = ((x1^(alpha+1) - x0^(alpha+1))*y + x0^(alpha+1))^(1/(alpha+1))
plot(density(x), ylab="log density x", col=2)
or plotted in logarithmic scale:
plot(density(x), log="xy", ylab="log density x", col=2)
Here is the summary of the data:
> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1000 0.1208 0.1584 0.2590 0.2511 4.9388
I can't comment on the math required to produce a power law distribution (the other posts have suggestions) but I would suggest you familiarize yourself with the TR1 C++ Standard Library random number facilities in <random>. These provide more functionality than std::rand and std::srand. The new system specifies a modular API for generators, engines and distributions and supplies a bunch of presets.
The included distribution presets are:
uniform_int
bernoulli_distribution
geometric_distribution
poisson_distribution
binomial_distribution
uniform_real
exponential_distribution
normal_distribution
gamma_distribution
When you define your power law distribution, you should be able to plug it in with existing generators and engines. The book The C++ Standard Library Extensions by Pete Becker has a great chapter on <random>.
Here is an article about how to create other distributions (with examples for Cauchy, Chi-squared, Student t and Snedecor F)
Related
I do understand the principle component analysis. I know how to do it and what it actually does. I have applied PCA and my best result has shown to be two components. I do understand that each of my inputs are now contributing partially in each component. What I do not understand is how to feed the result of PCA (in my case 2 components ) to a machine learning model?
How do we input them?
For example when I want to run a NN on my features, I just can navigate to where they are stored and import them, but my PCA analysis has been run in SPSS and all it shows me is the contribution of my features on each component.
What should I import to my NN model?
PCA is a method of feature extraction, which is used to avoid the problem of co-linearity. For example, if several variables are highly correlated because "they measure the same thing", then PCA can extract a measure of "that thing" (technically: a component), which is called a score. Your data set of, say, 100 measured variables may reduce to, say, 10 significant components. Then you can use the scores your test persons have achieved in those 10 components to do for example a multi-dimensional regression, a cluster analysis or a discriminance analysis. This will result in more valid results than performing the analysis directly on the 100 variables.
So the procedure is to sort the eigenvalues (and -vectors) by size, identify the number of significant components p (e.g., by scree-plot), set up the projection matrix F (eigenvectors corresponding to the largest q eigenvalues in columns) and multiply it with the data matrix D. This will give you the score matrix C (dimension n times q, with n the number of test persons), which you can use as input for whatever method you want to use next.
i am using MS Visual studio 2010.
and now I would like to generate a random number in the range from 3 to 200 by a log normal distribution.
I heard that "central limit theorem" can convert the uniform distribution to normal distribution, but it seem too much work for me, because my range has 198 numbers :
a = random(MaxRange+1); // mean i have to write this for 198 time???!!!!
x = (a+.......)/198 ; //this will obtain a number which is a normal distribution right???
then, may i just write
y = log (x); // and is this mean that y is log normal distribution????
thanks for answering my question....
Well random will give you uniformly distributed random numbers as you said correctly. In order to generator variables with normal distribution you can use the Box-Muller transformation which is simple to implement.
Next you need to generate your lognormal variable v. By calculating v = exp(mu + sig * n) where n is your normal distributed random variable.
I don't quite understand what you mean with range 3 to 200 as the lognormal distribution has support ]0,inf[
You may want to look at the lognormal_distribution class inside Boost random library. See here for an example of how to generate numbers from a given distribution (you have to instantiate a boost::variate_generator with a given random number generator plus an instance of the distribution).
Further to Azrael3000's answer,
Let the lognormal variable lgn is generated as lgn = exp(mu + sig * stdn) where stdn is the standard normal variable, then we must note that the mu and sig for the equation above are given as:
if m and v are the mean and variance of the non-logarithmized sample values
Ref: wiki - Log-normal_distribution
I want to get a boost::variate_generator which gives me numbers distributed to the lognormal distribution according to http://en.wikipedia.org/wiki/Log-normal_distribution.
There is a distribution in boost::math that implements the formula from the wikipedia entry, but it doesn't work with the variate_generator.
And the one from boost::random, that works with the variate_generator, is somewhat different from the above mentioned.
http://www.boost.org/doc/libs/1_46_1/doc/html/boost/lognormal_distribution.html. Mu needs to be >0 and mu and sigma are calculated instead of just used as given.
Does someone have any idea how I can get it to work with the former formula?
EDIT
# Howard Hinnant:
there is this init() function which gets called in the constructor. So the formula is the same, but sigma and mean get calculated this way (why I don't know):
_nmean = log(_mean*_mean/sqrt(_sigma*_sigma + _mean*_mean));
_nsigma = sqrt(log(_sigma*_sigma/_mean/_mean+result_type(1)));
I may be mistaken, but from inspecting the boost code, it looks to me like the boost lognormal_distribution is consistent with the wikipedia description. The documentation would be consistent if the boost documentation dropped the N subscript from mu and sigma.
This is probably a super easy question, but I just wanted to make 10000% sure before I did it.
Basically Im doing a formula for a program, it takes some certain values and does things when them.....etc..
Anyways Lets say I have some values called:
N
Links_Retrieved
True_Links
True_Retrieved.
I also have a % "scalar" ill call it, for this example lets say the % scalar is 10%.
Links Retrieved is ALWAYS half of N, so that's easy to calculate.
BUT I want True_Links to be ANYWHERE from 1-10% of Links_Retrieved.
Then I want True_Retrieved to be anywhere from The True_Links to 15% of Links_Retrieved.
How would I do this? would it be something like
True_Link=(((rand()%(Scalar(10%)-1))+1)/100);
?
I would divide by 100 to get the "percent" value IE .1 so it's be anywhere from .01 to .1?
and to do the True_retrieved it'd be
True_Retrieved=(rand()%(.15-True_Link))+True_Link;
am I doing this correct or am I WAYYYY off?
thanks
rand() is a very simple Random Number Generator. The Boost libraries include Boost.Random. In addition to random number generators, Boost.Random provides a set of classes to generate specific distirbutions. It sounds like you would want a distribution that's random between 1% and 10%, i.e. 0.01 and 0.1. That's done with boost::random::uniform_real(0.01, 0.1).
Maybe it would be better to use advanced random generator like Mersenne Twister.
rand() produces values between 0.0 and 1.0 inclusive, you have to scale that output to the interval you want. To get a value fact1 between 0.01 and 0.1 (1%-10%) you'd do:
perc1 = (rand()/RAND_MAX)*9.0+1.0; //percentage 1-10 on the 0-100 scale
fact1 = perc1/100.0; //factor 0.01 - 0.1 on the 0-1 scale
to get a value between perc1 and 0.15 you'd do:
percrange = (15.0 - perc1);
perc2 = (rand()/RAND_MAX)*percrange + perc1;
fact2 = perc2/100.0;
so your values become:
True_Links = fact1*Links_Retrieved;
True_Retrieved = fact2*Links_Retrieved;
This is sort-of-pseudocode. You should make sure parc1, perc2, fact1, fact2 and percrange are floating point values, and the final multiplications are done in floating point and rounded to integer numbers.
There are two parameters while using RBF kernels with Support Vector Machines: C and γ. It is not known beforehand which C and γ are the best for one problem; consequently some kind of model selection (parameter search) must be done. The goal is to identify good (C;γ) so that the classier can accurately predict unknown data (i.e., testing data).
weka.classifiers.meta.GridSearch is a meta-classifier for tuning a pair of parameters. It seems, however, that it takes ages to finish (when the dataset is rather large). What would you suggest to do in order to bring down the time required to accomplish this task?
According to A User's Guide to Support Vector Machines:
C : soft-margin constant . A smaller value of C allows to ignore points close to the boundary, and increases the margin.
γ> 0 is a parameter that controls the width of Gaussian
Hastie et al.'s SVMPath explores the entire regularization path for C and only requires about the same computational cost of training a single SVM model. From their paper:
Our R function SvmPath computes all 632 steps in the mixture example (n+ = n− =
100, radial kernel, γ = 1) in 1.44(0.02) secs on a pentium 4, 2Ghz linux machine; the svm
function (using the optimized code libsvm, from the R library e1071) takes 9.28(0.06)
seconds to compute the solution at 10 points along the path. Hence it takes our procedure
about 50% more time to compute the entire path, than it costs libsvm to compute a typical
single solution.
They released a GPLed implementation of the algorithm in R that you can download from CRAN here.
Using SVMPath should allow you to find a good C value for any given γ quickly. However, you would still need to do separate training runs for different γ values. But, this should be much faster than doing separate runs for each pair of C:γ values.