How to choose the right number of dimension in UMAP?

How to choose the right number of dimension in UMAP? - pca

I wanna try to use UMAP for my high-dimensional dataset as a preprocessing step (not for data visualization) in order to decrease the number of features, but how can I choose (if there is a method) the right number of dimensions in which to map the original data? For example, in PCA you can select the number of Factors that explain a fixed % of variances.

There is no good way to do this comparable to the explicit measure given by PCA. As a rule of thumb, however, you will get significantly diminishing returns for an embedding dimension larger than the n_neighbors value. With that in mind, and since you actually have a downstream task, it makes the most sense to build a pipeline to the downstream task evaluation and look at cross validation over the number of UMAP dimensions.

Related

What's the most efficient way to store a subset of column indices of big matrix and in C++?

I am working with a very big matrix X (say, 1,000-by-1,000,000). My algorithm goes like following:
Scan the columns of X one by one, based on some filtering rules, to identify only a subset of columns that are needed. Denote the subset of indices of columns be S. Its size depends on the filter, so is unknown before computation and will change if the filtering rules are different.
Loop over S, do some computation with a column x_i if i is in S. This step needs to be parallelized with openMP.
Repeat 1 and 2 for 100 times with changed filtering rules, defined by a parameter.
I am wondering what the best way is to implement this procedure in C++. Here are two ways I can think of:
(a) Use a 0-1 array (with length 1,000,000) to indicate needed columns for Step 1 above; then in Step 2 loop over 1 to 1,000,000, use if-else to check indicator and do computation if indicator is 1 for that column;
(b) Use std::vector for S and push_back the column index if identified as needed; then only loop over S, each time extract column index from S and then do computation. (I thought about using this way, but it's said push_back is expensive if just storing integers.)
Since my algorithm is very time-consuming, I assume a little time saving in the basic step would mean a lot overall. So my question is, should I try (a) or (b) or other even better way for better performance (and for working with openMP)?
Any suggestions/comments for achieving better speedup are very appreciated. Thank you very much!

To me, it seems that "step #1 really does not matter much." (At the end of the day, you're going to wind up with: "a set of columns, however represented.")
To me, what's really going to matter is: "just what's gonna happen when you unleash ("parallelized ...") step #2.
"An array of 'ones and zeros,'" however large, should be fairly simple for parallelization, while a more-'advanced' data structure might well, in this case, "just get in the way."
"One thousand mega-bits, these days?" Sure. Done. No problem. ("And if not, a simple array of bit-sets.") However-many simultaneously executing entities should be able to navigate such a data structure, in parallel, with a minimum of conflict . . . Therefore, to my gut, "big bit-sets win."

I think you will find std::vector easier to use. Regarding push_back, the cost is when the vector reallocates (and maybe copies) the data. To avoid that (if it matters), you could set vector::capacity to 1,000,000. Your vector is then 8 MB, insignificant compared to your problem size. It's only 1 order magnitude bigger than a bitmap would be, and a lot simpler to deal with: If we call your vector S and the nth interesting column i, then your column access is just x[S[i]].

(Based on my gut feeling) I'd probably go for pushing back into a vector, but the answer is quite simple: Measure both methods (they are both trivial to implement). Most likely you won't see a noticeable difference.

Query re. how to set up an SVM, which SVM variation … and how to define a metric

I’d like to learn how best set up an SVM in openCV (or other C++ library) for my particular problem (or if indeed there is a more appropriate algorithm).
My goal is to receive a weighting of how well an input set of labeled points on a 2D plane compares or fits with a set of ‘ideal’ sets of labeled 2D points.
I hope my illustrations make this clear – the first three boxes labeled A through C, indicate different ideal placements of 3 points, in my illustrations the labelling is managed by colour:
The second graphic gives examples of possible inputs:
If I then pass for instance example input set 1 to the algorithm it will compare that input set with each ideal set, illustrated here:
I would suggest that most observers would agree that the example input 1 is most similar to ideal set A, then B, then C.
My problem is to get not only this ordering out of an algorithm, but also ideally a weighting of by how much proportion is the input like A with respect to B and C.
For the example given it might be something like:
A:60%, B:30%, C:10%
Example input 3 might yield something such as:
A:33%, B:32%, C:35% (i.e. different order, and a less 'determined' result)
My end goal is to interpolate between the ideal settings using these weights.
To get the ordering I’m guessing the ‘cost’ involved of fitting the inputs to each set maybe have simply been compared anyway (?) … if so, could this cost be used to find the weighting? or maybe was it non-linear and some kind of transformation needs to happen? (but still obviously, relative comparisons were ok to determine the order).
Am I on track?
Direct question>> is the openCV SVM appropriate? - or more specifically:
A series of separated binary SVM classifiers for each ideal state and then a final ordering somehow ? (i.e. what is the metric?)
A version of an SVM such as multiclass, structured and so on from another library? (...that I still find hard to conceptually grasp as the examples seem so unrelated)
Also another critical component I’m not fully grasping yet is how to define what determines a good fit between any example input set and an ideal set. I was thinking Euclidian distance, and I simply sum the distances? What about outliers? My vector calc needs a brush up, but maybe dot products could nose in there somewhere?
Direct question>> How best to define a metric that describes a fit in this case?
The real case would have 10~20 points per set, and time permitting as many 'ideal' sets of points as possible, lets go with 30 for now. Could I expect to get away with ~2ms per iteration on a reasonable machine? (macbook pro) or does this kind of thing blow up ?
(disclaimer, I have asked this question more generally on Cross Validated, but there isn't much activity there (?))

How to create a vector containing a (artificially generated) Guassian (normal) distribution?

If I have data (a daily stock chart is a good example but it could be anything) in which I only know the range (high - low) that X units sold within but I don't know the exact price at which any given item sold. Assume for simplicity that the price range contains enough buckets (e.g. forty one-cent increments for a 40 cent range) to make such a distribution practical. How can I go about distributing those items to form a normal bell curve stored in a vector? It doesn't have to be perfect but realistic.
My (very) naive thinking has been to assume that since random numbers should form a normal distribution I can do something like have a binary RNG. If, for example, there are forty buckets then if a '0' comes up 40 times the 0th bucket gets incremented and if a '1' comes up for times in a row then the 39th bucket gets incremented. If '1' comes up 20 times then it is in the middle of the vector. Do this for each item until X units have been accounted for. This may or may not be right and in any case seems way more inefficient than necessary. I am looking for something more sensible.
This isn't homework, just a problem that has been bugging me and my statistics is not up to snuff. Most literature seems to be about analyzing the distribution after it already exists but not much about how to artificially create one.
I want to write this in c++ so pre-packaged solutions in R or matlab or whatnot are not too useful for me.
Thanks. I hope this made sense.

Most literature seems to be about analyzing the distribution after it already exists but not much about how to artificially create one.
There's tons of literature on how to create one. The Box–Muller transform, the Marsaglia polar method (a variant of Box-Muller), and the Ziggurat algorithm are three. (Google those terms). Both Box-Muller methods are easy to implement.
Better yet, just use a random generator that already exists that implements one of these algorithms. Both boost and the new C++11 have such packages.

The algorithm that you describe relies on the Central Limit Theorem that says that a random variable defined as the sum of n random variables that belong to the same distribution tends to approach a normal distribution when n grows to infinity. Uniformly distributed pseudorandom variables that come from a computer PRNG make a special case of this general theorem.
To get a more efficient algorithm you can view probability density function as a some sort of space warp that expands the real axis in the middle and shrinks it to the ends.
Let F: R -> [0:1] be the cumulative function of the normal distribution, invF be its inverse and x be a random variable uniformly distributed on [0:1] then invF(x) will be a normally distributed random variable.
All you need to implement this is be able to compute invF(x). Unfortunately this function cannot be expressed with elementary functions. In fact, it is a solution of a nonlinear differential equation. However you can efficiently solve the equation x = F(y) using the Newton method.
What I have described is a simplified presentation of the Inverse transform method. It is a very general approach. There are specialized algorithms for sampling from the normal distribution that are more efficient. These are mentioned in the answer of David Hammen.

Identifying local minima in a histogram

I'm interested in finding the local minima in a histogram that roughly resembles
I'd want to find the local minimum at 109.258, and the easiest way to do so would be to identify whether the number of counts at 109.258 is lower than the average number of counts around in some interval around (and including 109.258). It's identifying this interval that's the most difficult part for me.
As for the source of this data, it's a histogram with 100 bins of non-uniform width. Each bin has a value (shown on the x-axis), and a count of the samples falling into that bin (shown on the y-axis). What I'm trying to do is find the "best" place to split the histogram. Each side of the split is propagated down a binary tree, as part of a classification algorithm.
I'm thinking that my best course of action would be to try to fit a curve to this histogram, using something like the Levenberg-Marquardt algorithm and then to compare the local minima to find the "best" one. A proper measure of "best" would include some indication of the significance of that split, which is measured as the difference between the average counts in the interval to the left and the average of the counts in the interval to the right, and then maybe weight each difference with the number of counts included to get a composite measurement of "best," if that makes sense.
Either way, computational complexity of the algorithm isn't a huge issue, 100 bins is about the maximum number I'd expect to encounter. However, this calculation will be performed once for each sample, so keeping it linear with respect to the number of bins would, of course, be ideal.
By the way, I'm doing everything in C++, and make use of the boost libraries and STL, so nothing is off-limits in that regard.
Any thoughts or insights concerning best practices would be greatly appreciated!

If I understand correctly kmore wants to partition two "peaks" based on the largest separation (product of histogram count and bin distance). If this is true:
Find all maxs.
for each max build rectangles like in Fig.
Find rectangle with max white area, which gives you the x range to find desirable bin 109.258

Levenberg–Marquardt is not so good a choice in a rugged optimization terrain -- and yours is pretty rugged. There are lots of local minima there. Levenberg–Marquardt might well find the local minimum at about 100. Or it might find one the two global minima at the extremes of the graph where the function tails off to zero.
You want something that finds the most significant local minimum. For example, some kind of clustering algorithm. Here is a very simple one:
Step 1:
Find the local extrema in your data set. These are the extrema at the extremes of the range plus the internal local minima and maxima. With your histogram you should have an odd number of such extrema, alternating between minima and maxima.
Step 2:
Find the pair with the smallest delta. This will either be a (local max, local min) or a (local min, local max) pair.
Step 3:
Find a pair of elements to remove, one of
The pair found by step 2
The pair comprising the first element of the pair from step 2 and its predecessor
The pair comprising the last element of the pair from step 2 and its successor
When the found pair includes a boundary point you should use option 2 or 3, as appropriate. For an internal pair, you might want to use some heuristics in choosing between the three choices. Or you could just do the simple thing and use the found pair.
Step 4:
Delete the pair of elements from step 3, keeping track of the deleted pair.
Step 5:
Repeat steps 2 to 4 until there are only three elements left in the extrema data set (the extremes of the range plus a local max, hopefully the global max).
The last-removed minima is what you want.
There are lots of other clustering algorithms. The one I presented is rather crude and obviously isn't particularly fast. One that extends nicely to a lot more data, and higher dimensional data is the Expectation Maximization algorithm. Simulated annealing (Metropolis-Hastings) could also be adapted to this problem.

The problem can, of course be transformed into one of peak finding by functional manipulation of the data (inversion or negation are obvious candidates).
Alternatively, if the example is typical, one might begin with peak-finding in the untransformed data and seek regions where the peaks are (relatively) widely separated as candidates for containing a good local minima.
I am forever recommending the method used by the ROOT TSpectrum classes for peak finding.
The underling algorithm is discussed in detail in
M.Morhac et al.: Background elimination methods for multidimensional coincidence gamma-ray spectra. Nuclear Instruments and Methods in Physics Research A 401 (1997) 113-132.
M.Morhac et al.: Efficient one- and two-dimensional Gold deconvolution and its application to gamma-ray spectra decomposition. Nuclear Instruments and Methods in Physics Research A 401 (1997) 385-408.
M.Morhac et al.: Identification of peaks in multidimensional coincidence gamma-ray spectra. Nuclear Instruments and Methods in Research Physics A 443(2000), 108-125.
Copies of these papers are maintained on the ROOT web site and linked in the TSpectrum documentation for those that do not have a subscription to NIM A.

What you want seems to be more complicated than just a local minimum. Also, the local minimum concept depends strongly on your choice of bins.
Have you heard about Otsu's method? It might be more along the lines of what you want.
Here's another Otsu's method link.

How to select an unlike number in an array in C++?

I'm using C++ to write a ROOT script for some task. At some point I have an array of doubles in which many are quite similar and one or two are different. I want to average all the number except those sore thumbs. How should I approach it? For an example, lets consider:
x = [2.3, 2.4, 2.11, 10.5, 1.9, 2.2, 11.2, 2.1]
I want to somehow average all the numbers except 10.5 and 11.2, the dissimilar ones. This algorithm is going to repeated several thousand times and the array of doubles has 2000 entries, so optimization (while maintaining readability) is desired. Thanks SO!
Check out:
http://tinypic.com/r/111p0ya/3
The "dissimilar" numbers of the y-values of the pulse.
The point of this to determine the ground value for the waveform. I am comparing the most negative value to the ground and hoped to get a better method for grounding than to average the first N points in the sample.

Given that you are using ROOT you might consider looking at the TSpectrum classes which have support for extracting backgrounds from under an unspecified number of peaks...
I have never used them with so much baseline noise, but they ought to be robust.
BTW: what is the source of this data. The peak looks like a particle detector pulse, but the high level of background jitter suggests that you could really improve things by some fairly minor adjustments in the DAQ hardware, which might be better than trying to solve a difficult software problem.
Finally, unless you are restricted to some very primitive hardware (in which case why and how are you running ROOT?), if you only have a couple thousand such spectra you can afford a pretty slow algorithm. Or is that 2000 spectra per event and a high event rate?

If you can, maintain a sorted list; then you can easily chop off the head and the tail of the list each time you work out the average.
This is much like removing outliers based on the median (ie, you're going to need two passes over the data, one to find the median - which is almost as slow as sorting for floating point data, the other to calculate the average), but requires less overhead at the time of working out the average at the cost of maintaining a sorted list. Which one is fastest will depend entirely on your circumstances. It may be, of course, that what you really want is the median anyway!
If you had discrete data (say, bytes=256 possible values), you could use 256 histogram 'bins' with a single pass over your data putting counting the values that go in each bin, then it's really easy to find the median / approximate the mean / remove outliers, etc. This would be my preferred option, if you could afford to lose some of the precision in your data, followed by maintaining a sorted list, if that is appropriate for your data.

A quick way might be to take the median, and then take the averages of number not so far off from the median.
"Not so far off," being dependent of your project.

A good rule of thumb for determining likely outliers is to calculate the Interquartile Range (IQR), and then any values that are 1.5*IQR away from the nearest quartile are outliers.
This is the basic method many statistics systems (like R) use to automatically detect outliers.

Any method that is statistically significant and a good way to approach it (Dark Eru, Daniel White) will be too computationally intense to repeat, and I think I've found a work around that will allow later correction (meaning, leave it un-grounded).
Thanks for the suggestions. I'll look into them if I have time and want to see if their gain is worth the slowdown.

Here's a quick and dirty method that I've used before (works well if there are very few outliers at the beginning, and you don't have very complicated conditions for what constitutes an outlier)
The algorithm is O(N). The only really expensive part is the division.
The real advantage here is that you can have it up and running in a couple minutes.
avgX = Array[0] // initialize array with the first point
N = length(Array)
percentDeviation = 0.3 // percent deviation acceptable for non-outliers
count = 1
foreach x in Array[1..N-1]
if x < avgX + avgX*percentDeviation
and x > avgX - avgX*percentDeviation
count++
sumX =+ x
avgX = sumX / count
endif
endfor
return avgX

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js