How to select an unlike number in an array in C++? - c++

I'm using C++ to write a ROOT script for some task. At some point I have an array of doubles in which many are quite similar and one or two are different. I want to average all the number except those sore thumbs. How should I approach it? For an example, lets consider:
x = [2.3, 2.4, 2.11, 10.5, 1.9, 2.2, 11.2, 2.1]
I want to somehow average all the numbers except 10.5 and 11.2, the dissimilar ones. This algorithm is going to repeated several thousand times and the array of doubles has 2000 entries, so optimization (while maintaining readability) is desired. Thanks SO!
Check out:
http://tinypic.com/r/111p0ya/3
The "dissimilar" numbers of the y-values of the pulse.
The point of this to determine the ground value for the waveform. I am comparing the most negative value to the ground and hoped to get a better method for grounding than to average the first N points in the sample.

Given that you are using ROOT you might consider looking at the TSpectrum classes which have support for extracting backgrounds from under an unspecified number of peaks...
I have never used them with so much baseline noise, but they ought to be robust.
BTW: what is the source of this data. The peak looks like a particle detector pulse, but the high level of background jitter suggests that you could really improve things by some fairly minor adjustments in the DAQ hardware, which might be better than trying to solve a difficult software problem.
Finally, unless you are restricted to some very primitive hardware (in which case why and how are you running ROOT?), if you only have a couple thousand such spectra you can afford a pretty slow algorithm. Or is that 2000 spectra per event and a high event rate?

If you can, maintain a sorted list; then you can easily chop off the head and the tail of the list each time you work out the average.
This is much like removing outliers based on the median (ie, you're going to need two passes over the data, one to find the median - which is almost as slow as sorting for floating point data, the other to calculate the average), but requires less overhead at the time of working out the average at the cost of maintaining a sorted list. Which one is fastest will depend entirely on your circumstances. It may be, of course, that what you really want is the median anyway!
If you had discrete data (say, bytes=256 possible values), you could use 256 histogram 'bins' with a single pass over your data putting counting the values that go in each bin, then it's really easy to find the median / approximate the mean / remove outliers, etc. This would be my preferred option, if you could afford to lose some of the precision in your data, followed by maintaining a sorted list, if that is appropriate for your data.

A quick way might be to take the median, and then take the averages of number not so far off from the median.
"Not so far off," being dependent of your project.

A good rule of thumb for determining likely outliers is to calculate the Interquartile Range (IQR), and then any values that are 1.5*IQR away from the nearest quartile are outliers.
This is the basic method many statistics systems (like R) use to automatically detect outliers.

Any method that is statistically significant and a good way to approach it (Dark Eru, Daniel White) will be too computationally intense to repeat, and I think I've found a work around that will allow later correction (meaning, leave it un-grounded).
Thanks for the suggestions. I'll look into them if I have time and want to see if their gain is worth the slowdown.

Here's a quick and dirty method that I've used before (works well if there are very few outliers at the beginning, and you don't have very complicated conditions for what constitutes an outlier)
The algorithm is O(N). The only really expensive part is the division.
The real advantage here is that you can have it up and running in a couple minutes.
avgX = Array[0] // initialize array with the first point
N = length(Array)
percentDeviation = 0.3 // percent deviation acceptable for non-outliers
count = 1
foreach x in Array[1..N-1]
if x < avgX + avgX*percentDeviation
and x > avgX - avgX*percentDeviation
count++
sumX =+ x
avgX = sumX / count
endif
endfor
return avgX

Related

What is the fastest algorithm to find the point from a set of points, which is closest to a line?

I have:
- a set of points of known size (in my case, only 6 points)
- a line characterized by x = s + t * r, where x, s and r are 3D vectors
I need to find the point closest to the given line. The actual distance does not matter to me.
I had a look at several different questions that seem related (including this one) and know how to solve this on paper from my highschool math classes. But I cannot find a solution without calculating every distance, and I am sure there has to be a better/faster way. Performance is absolutely crucial in my application.
One more thing: All numbers are integers (coordinates of points and elements of s and r vectors). Again, for performance reasons I would like to keep the floating-point math to a minimum.
You have to process every point at least once to know their distance. Unless you want to repeat the process many times with different lines, simply computing the distance of every point is unavoidable. So the algorithm has to be O(n).
Since you don't care about the actual distance, we can make some simplification to the point-distance computation. The exact distance is computed by (source):
d^2 = |r⨯(p-s)|^2 / |r|^2
where ⨯ is the cross product and |r|^2 is the squared length of vector r. Since |r|^2 is constant for all points, we can omit it from the distance computation without changing result:
d^2 = |r⨯(p-s)|^2
Compare the approximated square distances and keep the minimum. The advantage of this formula is that you can do everything with integers since you mentioned that all coordinates are integers.
I'm afraid you can't get away with computing less than 6 distances (if you could, at least one point would be left out -- including the nearest one).
See if it makes sense to preprocess: Is the line fixed and the points vary? Consider rotating coordinates to make the line horizontal.
As there are few points, it is doubtful that this is your bottleneck. Measure where the hot spots are, redesign algorithms/data representation, spice up compiler optimization, compile to assembly and bum that. Strictly in that order.
Jon Bentley's "Writing Efficient Programs" (sadly long out of print) and "Programming Pearls" (2nd edition) are full of advise on practical programming.

OpenCV kmean: how to choose decent values for COUNT and EPS?

I am trying to use the kmean function in OpenCV to pre-classify 36000 sample images into 100+ classes (to reduce my work to prepare train data for supervised learning). In this function there are two parameters which I do not really understand: cv::TermCriteria::EPS and cv::TermCriteria::COUNT.
cv::kmeans(dataset.t(), K, kmean_labels, cv::TermCriteria( cv::TermCriteria::EPS + cv::TermCriteria::COUNT, 10, 1.0),
3, cv::KMEANS_PP_CENTERS, kmean_centers);
In OpenCV documents, it explains that:
cv::TermCriteria::EPS: the desired accuracy or change in parameters at which the iterative algorithm stops.
cv::TermCriteria::COUNT: the maximum number of iterations or elements to compute.
The explanation above is not quite clear for me. Can anyone help to explain more and show how to find good values for COUNT and EPS?
Thank you very much.
There are no magical numbers that will fit all applications (otherwise they wouldn't be parameters).
Kmeans is an iterative algorithm, which will move towards an optimum and each iteration should get better, but you need to tell your algorithm when to stop.
Using cv::TermCriteria::COUNT, you tell the algorithm: you can perform x iterations, then stop. But this doesn't guarantee you any precision.
Using cv::TermCriteria::EPS, you tell the algorithm to continue its iterations, untill the difference between two successive iterations becomes sufficiently small. The parameter EPS tell the algorithm how small this difference should become. This depends of course on the dataset that you are feeding to the algorithm. Suppose you multiply all your data points by 10; then EPS should vary accordingly (quadratically I suppose, but not sure about that).
When you use both both parameters; you tell the algorithm to stop when one of both conditions is fullfilled; for example: stop iterating when the difference between two successive runs is smaller than 0.1, OR when you have done 10 iterations.
in conclusion: only analysis of your datasets, and trial and error can give you decent values...

c++ discrete distribution sampling with frequently changing probabilities

Problem: I need to sample from a discrete distribution constructed of certain weights e.g. {w1,w2,w3,..}, and thus probability distribution {p1,p2,p3,...}, where pi=wi/(w1+w2+...).
some of wi's change very frequently, but only a very low proportion of all wi's. But the distribution itself thus has to be renormalised every time it happens, and therefore I believe Alias method does not work efficiently because one would need to build the whole distribution from scratch every time.
The method I am currently thinking is a binary tree (heap method), where all wi's are saved in the lowest level, and then the sum of each two in higher level and so on. The sum of all of them will be in the highest level, which is also a normalisation constant. Thus in order to update the tree after change in wi, one needs to do log(n) changes, as well as the same amount to get the sample from the distribution.
Question:
Q1. Do you have a better idea on how to achieve it faster?
Q2. The most important part: I am looking for a library which has already done this.
explanation: I have done this myself several years ago, by building heap structure in a vector, but since then I have learned many things including discovering libraries ( :) ), and containers such as map... Now I need to rewrite that code with higher functionality, and I want to make it right this time:
so Q2.1 is there a nice way to make a c++ map ordered and searched not by index, but by a cumulative sum of it's elements (this is how we sample, right?..). (that is my current theory how I would like to do it, but it doesnt have to be this way...)
Q2.2 Maybe there is some even nicer way to do the same? I would believe this problem is so frequent that I am very surprised I could not find some sort of library which would do it for me...
Thank you very much, and I am very sorry if this has been asked in some other form, please direct me towards it, but I have spent a good while looking...
-z
Edit: There is a possibility that I might need to remove or add the elements as well, but I think I could avoid it, if that makes a huge difference, thus leaving only changing the value of the weights.
Edit2: weights are reals in general, I would have to think if I could make them integers...
I would actually use a hash set of strings (don't remember the C++ container for it, you might need to implement your own though). Put wi elements for each i, with the values "w1_1", "w1_2",... all through "w1_[w1]" (that is, w1 elements starting with "w1_").
When you need to sample, pick an element at random using a uniform distribution. If you picked w5_*, say you picked element 5. Because of the number of elements in the hash, this will give you the distribution you were looking for.
Now, when wi changes from A to B, just add B-A elements to the hash (if B>A), or remove the last A-B elements of wi (if A>B).
Adding new elements and removing old elements is trivial in this case.
Obviously the problem is 'pick an element at random'. If your hash is a closed hash, you pick an array cell at random, if it's empty - just pick one at random again. If you keep your hash 3 or 4 times larger than the total sum of weights, your complexity will be pretty good: O(1) for retrieving a random sample, O(|A-B|) for modifying the weights.
Another option, since only a small part of your weights change, is to split the weights into two - the fixed part and the changed part. Then you only need to worry about changes in the changed part, and the difference between the total weight of changed parts and the total weight of unchanged parts. Then for the fixed part your hash becomes a simple array of numbers: 1 appears w1 times, 2 appears w2 times, etc..., and picking a random fixed element is just picking a random number.
Updating your normalisation factor when you change a value is trivial. This might suggest an algorithm.
w_sum = w_sum_old - w_i_old + w_i_new;
If you leave p_i as a computed property p_i = w_i / w_sum you would avoid recalculating the entire p_i array at the cost of calculating p_i every time they are needed. You would, however, be able to update many statistical properties without recalculating the entire sum
expected_something = (something_1 * w_1 + something_2 * w_2 + ...) / w_sum;
With a bit of algebra you can update expected_something by subtracting the contribution with the old weight and add the contribution with the new weight, multiplying and dividing with the normalization factors as required.
If you during the sampling keep track of which outcomes that are part of the sample, it would be possible to propagate how the probabilities were updated to the generated sample. Would this make it possible for you to update rather than recalculate values related to the sample? I think a bitmap could provide an efficient way to store an index of which outcomes that were used to build the sample.
One way of storing the probabilities together with the sums is to start with all probabilities. In the next N/2 positions you store the sums of the pairs. After that N/4 sums of the pairs etc. Where the sums are located can, obviously, be calculate in O(1) time. This data-structure is sort of a heap, but upside down.

How to create a vector containing a (artificially generated) Guassian (normal) distribution?

If I have data (a daily stock chart is a good example but it could be anything) in which I only know the range (high - low) that X units sold within but I don't know the exact price at which any given item sold. Assume for simplicity that the price range contains enough buckets (e.g. forty one-cent increments for a 40 cent range) to make such a distribution practical. How can I go about distributing those items to form a normal bell curve stored in a vector? It doesn't have to be perfect but realistic.
My (very) naive thinking has been to assume that since random numbers should form a normal distribution I can do something like have a binary RNG. If, for example, there are forty buckets then if a '0' comes up 40 times the 0th bucket gets incremented and if a '1' comes up for times in a row then the 39th bucket gets incremented. If '1' comes up 20 times then it is in the middle of the vector. Do this for each item until X units have been accounted for. This may or may not be right and in any case seems way more inefficient than necessary. I am looking for something more sensible.
This isn't homework, just a problem that has been bugging me and my statistics is not up to snuff. Most literature seems to be about analyzing the distribution after it already exists but not much about how to artificially create one.
I want to write this in c++ so pre-packaged solutions in R or matlab or whatnot are not too useful for me.
Thanks. I hope this made sense.
Most literature seems to be about analyzing the distribution after it already exists but not much about how to artificially create one.
There's tons of literature on how to create one. The Box–Muller transform, the Marsaglia polar method (a variant of Box-Muller), and the Ziggurat algorithm are three. (Google those terms). Both Box-Muller methods are easy to implement.
Better yet, just use a random generator that already exists that implements one of these algorithms. Both boost and the new C++11 have such packages.
The algorithm that you describe relies on the Central Limit Theorem that says that a random variable defined as the sum of n random variables that belong to the same distribution tends to approach a normal distribution when n grows to infinity. Uniformly distributed pseudorandom variables that come from a computer PRNG make a special case of this general theorem.
To get a more efficient algorithm you can view probability density function as a some sort of space warp that expands the real axis in the middle and shrinks it to the ends.
Let F: R -> [0:1] be the cumulative function of the normal distribution, invF be its inverse and x be a random variable uniformly distributed on [0:1] then invF(x) will be a normally distributed random variable.
All you need to implement this is be able to compute invF(x). Unfortunately this function cannot be expressed with elementary functions. In fact, it is a solution of a nonlinear differential equation. However you can efficiently solve the equation x = F(y) using the Newton method.
What I have described is a simplified presentation of the Inverse transform method. It is a very general approach. There are specialized algorithms for sampling from the normal distribution that are more efficient. These are mentioned in the answer of David Hammen.

minimum distance between 2 points in c++

I'm given m places (x,y coordinates).
I have n requests of finding the closest place to a given point P(x,y); (The minimum Euclidian distance)
How can i solve this problem below O(n*m) where n is the number of requests and m the number of places? I could use squared Euclidian distances but it's still n*m.
Try a kd-tree. A high performance library implementation can be found here.
Note: I'm pointing you to an approximate nearest-neighbors search which is optimized for high dimensions. This may be slightly overkill for your application.
Edit:
For a 2d kd-tree, the build time would be O(m*log(m)) and the query time would be O(n*sqrt(m)). This should end up being a net win over the naive solution if your number of queries n, exceeds log(m).
The library means you don't have to implement it so the complexity shouldn't be an issue.
If you want to generalize to high dimension extremely fast querying, check out locality sensitive hashing.
Interesting. To reduce the effect of n, I wonder if perhaps it would help to save the result of each request as you encounter and handle it. A clever result table might shortcut the need to calculate sqrt( x2 + y2) in solving subsequent requests.
The Nearest-Neighbor-Problem, eh? I found Robert Sedgewick Std Book very useful in these cases. He describes Nearest Neighbour Search, too.