I'm working with the random number generator available within C++11. At the moment, I'm using a uniform distribution, which should give me an equal probability to get any number within the range A & B which I specify.
However, I'm confused about generating Poisson distributions. While I understand how to determine the Poisson probability, I don't understand how a random series of numbers can be "distributed" based on the Poisson distribution.
For instance, the C++11 constructor for a Poisson distribution takes one argument -- λ, which is the mean of the distribution
std::tr1::poisson_distribution<double> poisson(7.0);
std::cout << poisson(eng) << std::endl;
In a Poisson probability problem, this is equal to the expected number of successes / occurrences during a given interval. However, I don't understand what it represents in this instance. What is a "success" / "occurrence" in a random number scenario?
I appreciate any assistance or reference materials which I can use to help me understand this.
The probability of a Poisson distribution is the chance a specific value occurs. Imagine you want to calculate how many cars pass a certain point each day. This value will be more some days, but less on other days. But when keeping track of this over a serious amount of time, a mean will start to emerge, with values in its vicinity occurring more often, and values further away (0 cars per day or a tenfold amount) being less likely. λ is that mean that emerged.
When reflecting this to RNG's, the algorithm would return you the amount of cars that passed on a random day (which is selected uniformly). As you can imagine the mean value λ is more likely to emerge, and the extremes are least likely to pop up.
The following link has an example of the distribution Poisson has, showing the discrete results you acquire, and the chance each of them has of occurring:
http://www.mathworks.com/help/toolbox/stats/brn2ivz-127.html
A sample implementation could calculate for each value the probability it occurs, and then calculate ranges based on these values to translate a uniform distribution to Poisson. e.g. for λ == 2 we have 13% chance for 0, 27% chance for 1, 27% chance for 2... Then we generate a good old uniform random number between 0.0 and 1.0. If this number is <= 0.13 return 0. Is it <= 0.40 return 1. Is it <= 0.67 return 2 etc...
Related
I have used to below hyper parameters to train the model.
rcf.set_hyperparameters(
num_samples_per_tree=200,
num_trees=250,
feature_dim=1,
eval_metrics =["accuracy", "precision_recall_fscore"])
is there any best way to choose the num_samples_per_tree and num_trees parameters.
what are the best numbers for both num_samples_per_tree and num_trees.
There are natural interpretations for these two hyper-parameters that can help you determine good starting approximations for HPO:
num_samples_per_tree -- the reciprocal of this value approximates the density of anomalies in your data set/stream. For example, if you set this to 200 then the assumption is that approximately 0.5% of the data is anomalous. Try exploring your dataset to make an educated estimate.
num_trees -- the more trees in your RCF model the less noise in scores. That is, if more trees are reporting that the input inference point is an anomaly then the point is much more likely to be an anomaly than if few trees suggest so.
The total number of points sampled from the input dataset is equal to num_samples_per_tree * num_trees. You should make sure that the input training set is at least this size.
(Disclosure - I helped create SageMaker Random Cut Forest)
I have a dataset that have multiple variables with each of them heavily centered around zero to form a high peak. The kurtosis of each variable is more than 100.
What I want to estimate is the probability density of any given value if it belongs to the dataset. The most accessible distribution function I found currently is the multivariant Gaussian distribution. However, since my dataset is not is a normal shape and I am worried that it is inaccurate estimate the probability density using this function.
Does anyone have any good suggestions on which function to use to for this purpose?
You are repeating a common incorrect interpretation of kurtosis, namely, "peakedness," which contributes the confusion about what distribution to use.
Kurtosis does not measure "peakedness" at all. You can have a distribution with a perfectly flat peak, with a V-shaped peak, with a trimodal peak, with a wavy peak, or with any shape peak whatsoever, that has infinite kurtosis. And you can have a distribution with infinite peak than has negative (excess) kurtosis.
Instead, kurtosis is a measure of the tails (outlier potential) of the distribution, not the peak. The only reason people think that there is a "high peak" when there is high kurtosis is that the outliers stretch the horizontal scale of the histogram, making the data appear concentrated in a narrow vertical strip. But if you zoom in on the bulk of the data in that strip, the peak can have any shape whatsoever. Further, if you compare the height of your histogram of standardized data with the height of a corresponding standard normal, either can be higher, no matter what your data show. The "height" mythology was debunked around 1945 by Kaplansky.
For your data, you do not need a "peaked" distribution. Instead, you need a distribution that allows such extreme values as you have observed. Examples include mixture distributions, lognormal distributions, t distributions with small degrees of freedom, or multivariate versions of such, if that's what you need.
References:
Westfall, P.H. (2014). Kurtosis as Peakedness, 1905 – 2014. R.I.P. The American Statistician, 68, 191–195.
(A simplified discussion of the above paper is given in the talk section of the Wikipedia entry on kurtosis.)
I have been using sklearn to learn on some data. This is a binary classifcation task and I am using a RBF kernel. My data set is quite unbalanced (80:20) and I'm using only 120 samples, with 10ish features (I've been experimenting with a few less). Since I set class_weight="auto" the accuracy I've calculated from a cross validated (10 folds) gridsearch has dropped dramatically. Why??
I will include a couple of validation accuracy heatmaps to demonstrate the difference.
NOTE: top heatmap is before classweight was changed to auto.
Accuracy is not the best metrics to use when dealing with unbalanced dataset. Let's say you have 99 positive examples and 1 negative example, and if you predict all outputs to be positive, still you will get 99% accuracy, whereas you have mis-classified the only negative example. You might have gotten high accuracy in the first case because your predictions will be on the side which has high number of samples.
When you do class weight = auto, it takes the imbalance into consideration and hence, your predictions might have moved towards center, you can cross-check it using plotting the histograms of predictions.
My suggestion is, don't use accuracy as performance metric, use something like F1 Score or AUC.
i have a sinusoidal-like shaped signal,and i would like to compute the frequency.
I tried to implement something but looks very difficult, any idea?
So far i have a vector with timestep and value, how can i get the frequency from this?
thank you
If the input signal is a perfect sinusoid, you can calculate the frequency using the time between positive 0 crossings. Find 2 consecutive instances where the signal goes from negative to positive and measure the time between, then invert this number to convert from period to frequency. Note this is only as accurate as your sample interval and it does not account for any potential aliasing.
You could try auto correlating the signal. An auto correlation can be rapidly calculated by following these steps:
Perform FFT of the audio.
Multiply each complex value with its complex conjugate.
Perform the inverse FFT of the audio.
The left most peak will always be the highest (as the signal always correlates best with itself). The second highest peak, however, can be used to calculate the sinusoid's frequency.
For example if the second peak occurs at an offset (lag) of 50 points and the sample rate is 16kHz and the window is 1 second then the end frequency is 16000 / 50 or 320Hz. You can even use interpolation to get a more accurate estimation of the peak position and thus a more accurate sinusoid frequency. This method is quite intense but is very good for estimating the frequency after significant amounts of noise have been added!
I have a set of values which follows exponential distribution. Now, I want to calculate the rate parameter alpha. Can anybody help me how to calculate it (I am using c++ to code it)?
If you know these values are from an exponential distribution, then you can calculate the maximum likelihood of λ (lambda, not alpha) as the average of 1 / value for each of these values (because the mean of the exponential distribution is 1 / λ). this is a statistical calculation, since you are trying to assess a parameter through observation.