Estimate the probability density of a given value if it belongs to a highly peaked multivariate dataset with high kurtosis (>100) - kurtosis

I have a dataset that have multiple variables with each of them heavily centered around zero to form a high peak. The kurtosis of each variable is more than 100.
What I want to estimate is the probability density of any given value if it belongs to the dataset. The most accessible distribution function I found currently is the multivariant Gaussian distribution. However, since my dataset is not is a normal shape and I am worried that it is inaccurate estimate the probability density using this function.
Does anyone have any good suggestions on which function to use to for this purpose?

You are repeating a common incorrect interpretation of kurtosis, namely, "peakedness," which contributes the confusion about what distribution to use.
Kurtosis does not measure "peakedness" at all. You can have a distribution with a perfectly flat peak, with a V-shaped peak, with a trimodal peak, with a wavy peak, or with any shape peak whatsoever, that has infinite kurtosis. And you can have a distribution with infinite peak than has negative (excess) kurtosis.
Instead, kurtosis is a measure of the tails (outlier potential) of the distribution, not the peak. The only reason people think that there is a "high peak" when there is high kurtosis is that the outliers stretch the horizontal scale of the histogram, making the data appear concentrated in a narrow vertical strip. But if you zoom in on the bulk of the data in that strip, the peak can have any shape whatsoever. Further, if you compare the height of your histogram of standardized data with the height of a corresponding standard normal, either can be higher, no matter what your data show. The "height" mythology was debunked around 1945 by Kaplansky.
For your data, you do not need a "peaked" distribution. Instead, you need a distribution that allows such extreme values as you have observed. Examples include mixture distributions, lognormal distributions, t distributions with small degrees of freedom, or multivariate versions of such, if that's what you need.
References:
Westfall, P.H. (2014). Kurtosis as Peakedness, 1905 – 2014. R.I.P. The American Statistician, 68, 191–195.
(A simplified discussion of the above paper is given in the talk section of the Wikipedia entry on kurtosis.)

Related

Using class weight to balance data set lowers accuracy in RBF SVM

I have been using sklearn to learn on some data. This is a binary classifcation task and I am using a RBF kernel. My data set is quite unbalanced (80:20) and I'm using only 120 samples, with 10ish features (I've been experimenting with a few less). Since I set class_weight="auto" the accuracy I've calculated from a cross validated (10 folds) gridsearch has dropped dramatically. Why??
I will include a couple of validation accuracy heatmaps to demonstrate the difference.
NOTE: top heatmap is before classweight was changed to auto.
Accuracy is not the best metrics to use when dealing with unbalanced dataset. Let's say you have 99 positive examples and 1 negative example, and if you predict all outputs to be positive, still you will get 99% accuracy, whereas you have mis-classified the only negative example. You might have gotten high accuracy in the first case because your predictions will be on the side which has high number of samples.
When you do class weight = auto, it takes the imbalance into consideration and hence, your predictions might have moved towards center, you can cross-check it using plotting the histograms of predictions.
My suggestion is, don't use accuracy as performance metric, use something like F1 Score or AUC.

What is the fastest way to calculate position cluster centers constriant by a concave polygon

I have a distribution of weighted 2D pose estimates (position + orientation) that are samples of an unknown PDF of a systems pose. All estimates and the underlying real position are constrained by a concave polygon.
The picture shows an exemplary distribution. The magenta colored circles are the estimates, the radius line indicates the estimated direction. The weights are indicated by the circles diameter. The red dot is the weighted mean, the yellow cirlce indicates the variance and the direction but is of no importance for the following problem:
From all estimates I want to derive the most likely position of the system.
Up to now I have evaluated the following approaches:
Using the estimate with the highest weight: Gives poor results since one estimate with a high weight outperforms several coinciding estimates with slightly lower weights.
Weighted Mean: Not applicable since the mean might lie outside the polygon as in the picture (red dot with yellow circle).
Weighted Median: Would work but does neglect potential clusters. E.g. in the image below two clusters are prominent of which one is more likely than the other.
Additionally I have looked into K-Means and K-Medoids. For K-Means I do not know the most efficient way to constrain the centers to the polygon. K-Medoids seems to work, but has poor performance (O(n^2)), which is important since I have a high number of estimates (contrary to explanatory picture)
What would be the ideal algorithm to solve this kind of problem ?
What complexity can be achieved ?
Are there readily available algorithms in c++ that solve this problem, or can be easily adapted to solve it?
k-means may also yield an estimate outside your polygons.
Such constraints are beyond the clustering use case. But nothing prevents you from devising a method to correct the estimates afterwards.
For non-convex data, DBSCAN may be worth a try. You could even incorporate line-of-sight into Generalized DBSCAN easily. But I'm not convinced that clustering will help for your overall objective.

Estimate gaussian height from its area

We (I and my colleague) were given a device, which sends to us each second a large amount of discrete integer data (intensities) that tend to have gaussian distribution. These pseudo gaussians flows one by one and we are supposed to pick the largest intensity from center of each gaussian as fast as possible. Moreover, these data contain a noise, so we cannot say that each gaussian can be separated to two monotone parts => we cannot rely on simple fact that if data start to decline, we will find the maximum.
My colleauge came up with an idea:
introduce an intensity threshold to separate gaussians from each other
sum intensities of each gaussian to estimate its area and then estimate its height
But the question is, how can I fast estimate height of this pseudo gaussian from its area?
UPDATE:
To be more clear, the intensities that I get represents "function values" of a gaussian, or batter they represent heights of histogram bins.
You could use a moving average filter, and when that starts to decline, take the maximum value in that window as your height. As long as the noise in the signal is fairly low amplitude and high frequency, that should work reasonably well. You can always combine it with thresholding if required. The people on the DSP site will probably have much better ideas though, so I'd ask there.

terminology and references for detecting light pulses in a field of light

Given a video with a fixed background containing a lot of variation in light I am trying to detect pulses of light that occur for relatively short spans of time. When the video is played it is pretty easy for a person to distinguish the light pulses but if only shown a still frame it would be impossible to distinguish a pulse from background light.
I would like to know if there is specific terminology in machine vision that I can use to search for algorithms used to solve this problem. Also if you have any references for papers or open source software that solves this problem that would be great.
Edit: More context
The video itself is of a biological process that occurs at the sub-cellular level and while the background is fixed there is also a significant amount of random signal noise at the pixel level (there doesn't appear to be significant correlation in the noise between neighboring pixels). Note that the variation I refer to in the first paragraph is true variation and not signal noise. Since I mentioned that the process is biological it's probably also worth saying that there is no movement going on; these are just pulses of light. Also, the pulses themselves occupy enough pixels so that it is easy to discern their relative sizes.
From statistics, you could look into change point detection. The essential idea being that most of the time each (x,y) point or region, if you define some granularity of regions, has an intensity I(x,y), where I(x,y) is random, but either bounded or stochastic with some assumed distribution (e.g. normal with a given mean and standard deviation), and then it is observed with an intensity that is anomalous for that distribution. Anomaly detection would also apply, but the time series nature is more appropriate.
(If you want to go more into the statistical methodologies, it would be far more appropriate to discuss this on the statistics Stack Exchange site.)
If you look into astronomical applications, you can find papers on supernova and pulsar detection.
Update 1. Just to clarify the astronomical analogies, if the pulse is repeating, then papers on pulsars or satellites may be most appropriate. If the pulse is one-time, then papers on supernova detection would be better. If the pulse is bursty, and spatially clustered, then meteor strike detection would be better. Although spatial time series analysis, especially change point or anomaly detection, is useful, it's best to have an understanding of the stochastic phenomena of interest in order to narrow down the detection methodology.
To continue the notion of applying statistics: you might consider gridding each image frame into rectangular neighborhoods. At each time t, compute the variance (or standard deviation) of the neighborhood. Presumably, the unexcited neighborhoods will exhibit some common distribution of intensity (i.e. uniform, but most likely some form of gaussian). The presence of pulse pixels will bias that distribution in some way. When comparing a neighborhood at time t and t-1, a significant change in mean intensity (or a change in the variance, etc.) would indicate an excited neighborhood.
You might also consider looking at other measures, such as skewness and kurtosis. Assuming the initial, unexcited distribution is gaussian, the "shape" parameters could also identify differences in the pixel populations.
*Note that I'm assuming a grayscale image for simplicity, but the same principles may be applied to an RGB image.
Assuming a completely static scene with no object and camera motion, then any color deviation would be due to lighting changes.
If you detect an abrupt color/intensity change at particular pixels (i.e. brighness change above a certain allowable threshold), then this should be due to the light source turning on/off.
If you are only interested in point light sources, then any change in a region larger than the maximum apparent light source should be considered as coming from something else (e.g. the sun suddenly revealed from behind clouds).

How to exploit periodicity to reduce noise of a signal?

100 periods have been collected from a 3 dimensional periodic signal. The wavelength slightly varies. The noise of the wavelength follows Gaussian distribution with zero mean. A good estimate of the wavelength is known, that is not an issue here. The noise of the amplitude may not be Gaussian and may be contaminated with outliers.
How can I compute a single period that approximates 'best' all of the collected 100 periods?
Time-series, ARMA, ARIMA, Kalman Filter, autoregression and autocorrelation seem to be keywords here.
UPDATE 1: I have no idea how time-series models work. Are they prepared for varying wavelengths? Can they handle non-smooth true signals? If a time-series model is fitted, can I compute a 'best estimate' for a single period? How?
UPDATE 2: A related question is this. Speed is not an issue in my case. Processing is done off-line, after all periods have been collected.
Origin of the problem: I am measuring acceleration during human steps at 200 Hz. After that I am trying to double integrate the data to get the vertical displacement of the center of gravity. Of course the noise introduces a HUGE error when you integrate twice. I would like to exploit periodicity to reduce this noise. Here is a crude graph of the actual data (y: acceleration in g, x: time in second) of 6 steps corresponding to 3 periods (1 left and 1 right step is a period):
My interest is now purely theoretical, as http://jap.physiology.org/content/39/1/174.abstract gives a pretty good recipe what to do.
We have used wavelets for noise suppression with similar signal measured from cows during walking.
I'm don't think the noise is so much of a problem here and the biggest peaks represent actual changes in the acceleration during walking.
I suppose that the angle of the leg and thus accelerometer changes during your experiment and you need to account for that in order to calculate the distance i.e you need to know what is the orientation of the accelerometer in each time step. See e.g this technical note for one to account for angle.
If you need get accurate measures of the position the best solution would be to get an accelerometer with a magnetometer, which also measures orientation. Something like this should work: http://www.sparkfun.com/products/10321.
EDIT: I have looked into this a bit more in the last few days because a similar project is in my to do list as well... We have not used gyros in the past, but we are doing so in the next project.
The inaccuracy in the positioning doesn't come from the white noise, but from the inaccuracy and drift of the gyro. And the error then accumulates very quickly due to the double integration. Intersense has a product called Navshoe, that addresses this problem by zeroing the error after each step (see this paper). And this is a good introduction to inertial navigation.
Periodic signal without noise has the following property:
f(a) = f(a+k), where k is the wavelength.
Next bit of information that is needed is that your signal is composed of separate samples. Every bit of information you've collected are based on samples, which are values of f() function. From 100 samples, you can get the mean value:
1/n * sum(s_i), where i is in range [0..n-1] and n = 100.
This needs to be done for every dimension of your data. If you use 3d data, it will be applied 3 times. Result would be (x,y,z) points. You can find value of s_i from the periodic signal equation simply by doing
s_i(a).x = f(a+k*i).x
s_i(a).y = f(a+k*i).y
s_i(a).z = f(a+k*i).z
If the wavelength is not accurate, this will give you additional source of error or you'll need to adjust it to match the real wavelength of each period. Since
k*i = k+k+...+k
if the wavelength varies, you'll need to use
k_1+k_2+k_3+...+k_i
instead of k*i.
Unfortunately with errors in wavelength, there will be big problems keeping this k_1..k_i chain in sync with the actual data. You'd actually need to know how to regognize the starting position of each period from your actual data. Possibly need to mark them by hand.
Now, all the mean values you calculated would be functions like this:
m(a) :: R->(x,y,z)
Now this is a curve in 3d space. More complex error models will be left as an excersize for the reader.
If you have a copy of Curve Fitting Toolbox, localized regression might be a good choice.
Curve Fitting Toolbox supports both lowess and loess localized regression models for curve and curve fitting.
There is an option for robust localized regression
The following blog post shows how to use cross validation to estimate an optimzal spaning parameter for a localized regression model, as well as techniques to estimate confidence intervals using a bootstrap.
http://blogs.mathworks.com/loren/2011/01/13/data-driven-fitting/