How to assess the variance explained by a principal component in a sub-set of data? - pca

I have conducted a PCA (in Matlab) on a set of thousands of points of spatial data. I have also calculated the variance explained across the full dataset by each principal component (i.e. PC or eigenvector) by dividing its eigenvalue by the sum of all eigenvalues. As an example, PC 15 accounts for 2% of the variance in the entire dataset; however, there is a subset of points in this dataset for which I suspect PC 15 accounts for a much higher % of their variance (e.g. 80%).
My question is this, is there a way to calculate the variance explained by a given PC from my existing analysis for only a subset of points (i.e. 1000 pts from the full dataset of 500k+). I know that I could run another PCA on just the subset, but for my purposes, I need to continue to use the PCs from my original analysis. Any idea for how to do this would be very helpful.
Thanks!

Related

a big number of Principal components while using PCA

I'm new to PCA, I have a dataset of 64 features and I am trying to get the most important features using PCA. When running PCA that explains 90% of the variance in my dataset I am getting about 40 principal components, and my question is, how can I get the feature importance based on all these principal components ?in the pic1 shown the number of principal components that explains 90% of the variance
should I sum the values of all principal components for each feature and then sort it in descending order ?
just run a regression model and check values of the importance statistics associated with each feature. Check out this article for discussion of feature importance.

Hyper parameter tuning for Random cut forest

I have used to below hyper parameters to train the model.
rcf.set_hyperparameters(
num_samples_per_tree=200,
num_trees=250,
feature_dim=1,
eval_metrics =["accuracy", "precision_recall_fscore"])
is there any best way to choose the num_samples_per_tree and num_trees parameters.
what are the best numbers for both num_samples_per_tree and num_trees.
There are natural interpretations for these two hyper-parameters that can help you determine good starting approximations for HPO:
num_samples_per_tree -- the reciprocal of this value approximates the density of anomalies in your data set/stream. For example, if you set this to 200 then the assumption is that approximately 0.5% of the data is anomalous. Try exploring your dataset to make an educated estimate.
num_trees -- the more trees in your RCF model the less noise in scores. That is, if more trees are reporting that the input inference point is an anomaly then the point is much more likely to be an anomaly than if few trees suggest so.
The total number of points sampled from the input dataset is equal to num_samples_per_tree * num_trees. You should make sure that the input training set is at least this size.
(Disclosure - I helped create SageMaker Random Cut Forest)

Estimate the probability density of a given value if it belongs to a highly peaked multivariate dataset with high kurtosis (>100)

I have a dataset that have multiple variables with each of them heavily centered around zero to form a high peak. The kurtosis of each variable is more than 100.
What I want to estimate is the probability density of any given value if it belongs to the dataset. The most accessible distribution function I found currently is the multivariant Gaussian distribution. However, since my dataset is not is a normal shape and I am worried that it is inaccurate estimate the probability density using this function.
Does anyone have any good suggestions on which function to use to for this purpose?
You are repeating a common incorrect interpretation of kurtosis, namely, "peakedness," which contributes the confusion about what distribution to use.
Kurtosis does not measure "peakedness" at all. You can have a distribution with a perfectly flat peak, with a V-shaped peak, with a trimodal peak, with a wavy peak, or with any shape peak whatsoever, that has infinite kurtosis. And you can have a distribution with infinite peak than has negative (excess) kurtosis.
Instead, kurtosis is a measure of the tails (outlier potential) of the distribution, not the peak. The only reason people think that there is a "high peak" when there is high kurtosis is that the outliers stretch the horizontal scale of the histogram, making the data appear concentrated in a narrow vertical strip. But if you zoom in on the bulk of the data in that strip, the peak can have any shape whatsoever. Further, if you compare the height of your histogram of standardized data with the height of a corresponding standard normal, either can be higher, no matter what your data show. The "height" mythology was debunked around 1945 by Kaplansky.
For your data, you do not need a "peaked" distribution. Instead, you need a distribution that allows such extreme values as you have observed. Examples include mixture distributions, lognormal distributions, t distributions with small degrees of freedom, or multivariate versions of such, if that's what you need.
References:
Westfall, P.H. (2014). Kurtosis as Peakedness, 1905 – 2014. R.I.P. The American Statistician, 68, 191–195.
(A simplified discussion of the above paper is given in the talk section of the Wikipedia entry on kurtosis.)

Parameters to improve a music frequency analyzer

I'm using a FFT on audio data to output an analyzer, like you'd see in Winamp or Windows Media Player. However the output doesn't look that great. I'm plotting using a logarithmic scale and I average the linear results from the FFT into the corresponding logarithmic bins. As an example, I'm using bins like:
16k,8k,4k,2k,1k,500,250,125,62,31,15 [hz]
Then I plot the magnitude (dB) against frequency [hz]. The graph definitely 'reacts' to the music, and I can see the response of a drum sample or a high pitched voice. But the graph is very 'saturated' close to the lower frequencies, and overall doesn't look much like what you see in applications, which tend to be more evenly distributed. I feel that apps that display visual output tend to do different things to the data to make it look better.
What things could I do to the data to make it look more like the typical music player app?
Some useful information:
I downsample to single channel, 32kHz, and specify a time window of 35ms. That means the FFT gets ~1100 points. I vary these values to experiment (ie tried 16kHz, and increasing/decreasing interval length) but I get similar results.
With an FFT of 1100 points, you probably aren't able to capture the low frequencies with a lot of frequency resolution.
Think about it, 30 Hz corresponds to a period of 33ms, which at 32kHz is roughly 1000 samples. So you'll only be able to capture about 1 period in this time.
Thus, you'll need a longer FFT window to capture those low frequencies with sharp frequency resolution.
You'll likely need a time window of 4000 samples or more to start getting noticeably more frequency resolution at the low frequencies. This will be fine too, since you'll still get about 8-10 spectrum updates per second.
One option too, if you want very fast updates for the high frequency bins but good frequency resolution at the low frequencies, is to update the high frequency bins more quickly (such as with the windows you're currently using) but compute the low frequency bins less often (and with larger windows necessary for the good freq. resolution.)
I think a lot of these applications have variable FFT bins.
What you could do is start with very wide evenly spaced FFT bins like you have and then keep track of the number of elements that are placed in each FFT bin. If some of the bins are not used significantly at all (usually the higher frequencies) then widen those bins so that they are larger (and thus have more frequency entries) and shring the low frequency bins.
I have worked on projects were we just spend a lot of time tuning bins for specific input sources but it is much nicer to have the software adjust in real time.
A typical visualizer would use constant-Q bandpass filters, not a single FFT.
You could emulate a set of constant-Q bandpass filters by multiplying the FFT results by a set of constant-Q filter responses in the frequency domain, then sum. For low frequencies, you should use an FFT longer than the significant impulse response of the lowest frequency filter. For high frequencies, you can use shorter FFTs for better responsiveness. You can slide any length FFTs along at any desired update rate by overlapping (re-using) data, or you might consider interpolation. You might also want to pre-window each FFT to reduce "spectral leakage" between frequency bands.

How to exploit periodicity to reduce noise of a signal?

100 periods have been collected from a 3 dimensional periodic signal. The wavelength slightly varies. The noise of the wavelength follows Gaussian distribution with zero mean. A good estimate of the wavelength is known, that is not an issue here. The noise of the amplitude may not be Gaussian and may be contaminated with outliers.
How can I compute a single period that approximates 'best' all of the collected 100 periods?
Time-series, ARMA, ARIMA, Kalman Filter, autoregression and autocorrelation seem to be keywords here.
UPDATE 1: I have no idea how time-series models work. Are they prepared for varying wavelengths? Can they handle non-smooth true signals? If a time-series model is fitted, can I compute a 'best estimate' for a single period? How?
UPDATE 2: A related question is this. Speed is not an issue in my case. Processing is done off-line, after all periods have been collected.
Origin of the problem: I am measuring acceleration during human steps at 200 Hz. After that I am trying to double integrate the data to get the vertical displacement of the center of gravity. Of course the noise introduces a HUGE error when you integrate twice. I would like to exploit periodicity to reduce this noise. Here is a crude graph of the actual data (y: acceleration in g, x: time in second) of 6 steps corresponding to 3 periods (1 left and 1 right step is a period):
My interest is now purely theoretical, as http://jap.physiology.org/content/39/1/174.abstract gives a pretty good recipe what to do.
We have used wavelets for noise suppression with similar signal measured from cows during walking.
I'm don't think the noise is so much of a problem here and the biggest peaks represent actual changes in the acceleration during walking.
I suppose that the angle of the leg and thus accelerometer changes during your experiment and you need to account for that in order to calculate the distance i.e you need to know what is the orientation of the accelerometer in each time step. See e.g this technical note for one to account for angle.
If you need get accurate measures of the position the best solution would be to get an accelerometer with a magnetometer, which also measures orientation. Something like this should work: http://www.sparkfun.com/products/10321.
EDIT: I have looked into this a bit more in the last few days because a similar project is in my to do list as well... We have not used gyros in the past, but we are doing so in the next project.
The inaccuracy in the positioning doesn't come from the white noise, but from the inaccuracy and drift of the gyro. And the error then accumulates very quickly due to the double integration. Intersense has a product called Navshoe, that addresses this problem by zeroing the error after each step (see this paper). And this is a good introduction to inertial navigation.
Periodic signal without noise has the following property:
f(a) = f(a+k), where k is the wavelength.
Next bit of information that is needed is that your signal is composed of separate samples. Every bit of information you've collected are based on samples, which are values of f() function. From 100 samples, you can get the mean value:
1/n * sum(s_i), where i is in range [0..n-1] and n = 100.
This needs to be done for every dimension of your data. If you use 3d data, it will be applied 3 times. Result would be (x,y,z) points. You can find value of s_i from the periodic signal equation simply by doing
s_i(a).x = f(a+k*i).x
s_i(a).y = f(a+k*i).y
s_i(a).z = f(a+k*i).z
If the wavelength is not accurate, this will give you additional source of error or you'll need to adjust it to match the real wavelength of each period. Since
k*i = k+k+...+k
if the wavelength varies, you'll need to use
k_1+k_2+k_3+...+k_i
instead of k*i.
Unfortunately with errors in wavelength, there will be big problems keeping this k_1..k_i chain in sync with the actual data. You'd actually need to know how to regognize the starting position of each period from your actual data. Possibly need to mark them by hand.
Now, all the mean values you calculated would be functions like this:
m(a) :: R->(x,y,z)
Now this is a curve in 3d space. More complex error models will be left as an excersize for the reader.
If you have a copy of Curve Fitting Toolbox, localized regression might be a good choice.
Curve Fitting Toolbox supports both lowess and loess localized regression models for curve and curve fitting.
There is an option for robust localized regression
The following blog post shows how to use cross validation to estimate an optimzal spaning parameter for a localized regression model, as well as techniques to estimate confidence intervals using a bootstrap.
http://blogs.mathworks.com/loren/2011/01/13/data-driven-fitting/