OpenCV kmean: how to choose decent values for COUNT and EPS? - c++

I am trying to use the kmean function in OpenCV to pre-classify 36000 sample images into 100+ classes (to reduce my work to prepare train data for supervised learning). In this function there are two parameters which I do not really understand: cv::TermCriteria::EPS and cv::TermCriteria::COUNT.
cv::kmeans(dataset.t(), K, kmean_labels, cv::TermCriteria( cv::TermCriteria::EPS + cv::TermCriteria::COUNT, 10, 1.0),
3, cv::KMEANS_PP_CENTERS, kmean_centers);
In OpenCV documents, it explains that:
cv::TermCriteria::EPS: the desired accuracy or change in parameters at which the iterative algorithm stops.
cv::TermCriteria::COUNT: the maximum number of iterations or elements to compute.
The explanation above is not quite clear for me. Can anyone help to explain more and show how to find good values for COUNT and EPS?
Thank you very much.

There are no magical numbers that will fit all applications (otherwise they wouldn't be parameters).
Kmeans is an iterative algorithm, which will move towards an optimum and each iteration should get better, but you need to tell your algorithm when to stop.
Using cv::TermCriteria::COUNT, you tell the algorithm: you can perform x iterations, then stop. But this doesn't guarantee you any precision.
Using cv::TermCriteria::EPS, you tell the algorithm to continue its iterations, untill the difference between two successive iterations becomes sufficiently small. The parameter EPS tell the algorithm how small this difference should become. This depends of course on the dataset that you are feeding to the algorithm. Suppose you multiply all your data points by 10; then EPS should vary accordingly (quadratically I suppose, but not sure about that).
When you use both both parameters; you tell the algorithm to stop when one of both conditions is fullfilled; for example: stop iterating when the difference between two successive runs is smaller than 0.1, OR when you have done 10 iterations.
in conclusion: only analysis of your datasets, and trial and error can give you decent values...

Related

The most optimized way of calculating distance between data in c++

I have n points in a 2D plane. I want to calculate the distance between each two points in c++. Position of m'th point in the plan is (x(m),y(m)). This points changes during passing time. The number of time steps is equal to 10^5.
I wrote below code, but as n is a big number(5000) and I want to find the distance between points 10^5 times, I'm searching for the most optimized way to do that. Could anyone tell me what is the least time-consuming way to do that?
for(i=1;n;++)
for(j=1;n;++)
if (i>j)
r(i,j)= r(j,i);
else
r(i,j)=sqrt((x(i)-x(j))^2+(y(i)-y(j))^2);
end
end
end
I know that, in Matlab, I can find this by using bsxfun function. I want also to know which one could calculate distances faster? Matlab or c++?
Regarding Matlab, you have also pdist which does exactly that (but is not so fast), and you should also read this.
About comparing Matlab and C first read this and this. Also keep in mind that Matlab, as a desktop program, requires knowing not only a general efficient way to implement your code but also the right way to do this in Matlab. One example is the difference between functions. Built-in functions are written in FORTRAN or C and run much faster then non-built-in functions. To know if a function is built-in, type:
which function_name
and check if you see "built-in" at the start of the output:
built-in (C:\Program Files\MATLAB\...)

Reduce the length of a feature vector for comparison

I have a problem, where several different objects are each described by a vector of real numbers, between 0 and 100, and a length (dimension) of 1000 elements.
Then I want to compare a new vector of equal characteristics with the set of vectors above, to find the most similar, with the Mahalanobis distance.
My question is:
How can I reduce the length of the vectors to the N most relevant elements (say, 100 of the 1000), without affecting too much the quality of the answers found, ie, the distance does not vary too much?
Remember that each vector is a description of a different object, unrelated to others.
I thought about using PCA, but after studying it, I saw that I needed at least two samples per object, or so I understood.
Any idea? In case of coding, I´m using C++, OpenCV
Thanks in advance.

How to create a vector containing a (artificially generated) Guassian (normal) distribution?

If I have data (a daily stock chart is a good example but it could be anything) in which I only know the range (high - low) that X units sold within but I don't know the exact price at which any given item sold. Assume for simplicity that the price range contains enough buckets (e.g. forty one-cent increments for a 40 cent range) to make such a distribution practical. How can I go about distributing those items to form a normal bell curve stored in a vector? It doesn't have to be perfect but realistic.
My (very) naive thinking has been to assume that since random numbers should form a normal distribution I can do something like have a binary RNG. If, for example, there are forty buckets then if a '0' comes up 40 times the 0th bucket gets incremented and if a '1' comes up for times in a row then the 39th bucket gets incremented. If '1' comes up 20 times then it is in the middle of the vector. Do this for each item until X units have been accounted for. This may or may not be right and in any case seems way more inefficient than necessary. I am looking for something more sensible.
This isn't homework, just a problem that has been bugging me and my statistics is not up to snuff. Most literature seems to be about analyzing the distribution after it already exists but not much about how to artificially create one.
I want to write this in c++ so pre-packaged solutions in R or matlab or whatnot are not too useful for me.
Thanks. I hope this made sense.
Most literature seems to be about analyzing the distribution after it already exists but not much about how to artificially create one.
There's tons of literature on how to create one. The Box–Muller transform, the Marsaglia polar method (a variant of Box-Muller), and the Ziggurat algorithm are three. (Google those terms). Both Box-Muller methods are easy to implement.
Better yet, just use a random generator that already exists that implements one of these algorithms. Both boost and the new C++11 have such packages.
The algorithm that you describe relies on the Central Limit Theorem that says that a random variable defined as the sum of n random variables that belong to the same distribution tends to approach a normal distribution when n grows to infinity. Uniformly distributed pseudorandom variables that come from a computer PRNG make a special case of this general theorem.
To get a more efficient algorithm you can view probability density function as a some sort of space warp that expands the real axis in the middle and shrinks it to the ends.
Let F: R -> [0:1] be the cumulative function of the normal distribution, invF be its inverse and x be a random variable uniformly distributed on [0:1] then invF(x) will be a normally distributed random variable.
All you need to implement this is be able to compute invF(x). Unfortunately this function cannot be expressed with elementary functions. In fact, it is a solution of a nonlinear differential equation. However you can efficiently solve the equation x = F(y) using the Newton method.
What I have described is a simplified presentation of the Inverse transform method. It is a very general approach. There are specialized algorithms for sampling from the normal distribution that are more efficient. These are mentioned in the answer of David Hammen.

Fast Median Filter in C / C++ for `UINT16` 2D Array

Does anyone know a fast median filter algorithm for 16-bit (unsigned short) arrays in c++?
http://nomis80.org/ctmf.html
This one seems quite promising, but it only seems to work with byte arrays. Does anyone know either how to modify it to work with shorts or an alternative algorithm?
The technique in the paper relies on creating a histogram with 256 bins for an 8 bit pixel channel. Converting to 16 bits per channel would require a histogram with 65536 bins, and a histogram is required for each column of the image. Inflating the memory requirements by 256 makes this a less efficient algorithm overall, but still probably doable with today's hardware.
Using their proposed optimization of breaking the histogram into coarse and fine sections should further reduce the runtime hit to only 16x.
For small radius values I think you'll find traditional methods of median filtering will be more performant.
Fast Median Search - An ANSI C implementation (PDF) is something for C, it's a paper with the title "Fast median search: an ANSI C implementation". The author claims it's O(log(n)), he also provides some code, maybe it'll help you. It's not better than your suggested code, but maybe a look worth.
std::vector<unsigned short> v{4, 2, 5, 1, 3};
std::vector<unsigned short> h(v.size()/2+1);
std::partial_sort_copy(v.begin(), v.end(), h.begin(), h.end());
int median = h.back();
Runs in O(N·log(N/2+1)) and does not modify your input.
This article describes a method for median filtering of images that runs in O(log r) time per pixel, where r is the filter radius, and works for any data type (be it 8 bit integers or doubles):
Fast Median and Bilateral Filtering
I know this question is somewhat old but I also got interested in median filtering. If one is working with signals or images, then there will be a large overlap of data for the processing window. This can be taken advantage of.
I've posted some benchmark code here: 1D moving median filtering in C++
It's template based so it should work with most POD data types.
According to my results std::nth_element has poor performance for a moving median, as it must sort the window of values each time.
However, using a pool of values that is kept sorted, one can perform the median with 3 operation.
Remove oldest value out of the pool (calls std::lower_bound)
Insert new value into pool (calls std::lower_bound)
Store new value in history buffer
The median is now the middle value in the pool.
I hope someone finds this interesting and contributes their ideas!
See equations 4 and 5 in the following paper. The complexity is O(N*W) where W is the width of the filter and N is the number of samples.
See Noise Reduction by Vector Median Filtering.

How to select an unlike number in an array in C++?

I'm using C++ to write a ROOT script for some task. At some point I have an array of doubles in which many are quite similar and one or two are different. I want to average all the number except those sore thumbs. How should I approach it? For an example, lets consider:
x = [2.3, 2.4, 2.11, 10.5, 1.9, 2.2, 11.2, 2.1]
I want to somehow average all the numbers except 10.5 and 11.2, the dissimilar ones. This algorithm is going to repeated several thousand times and the array of doubles has 2000 entries, so optimization (while maintaining readability) is desired. Thanks SO!
Check out:
http://tinypic.com/r/111p0ya/3
The "dissimilar" numbers of the y-values of the pulse.
The point of this to determine the ground value for the waveform. I am comparing the most negative value to the ground and hoped to get a better method for grounding than to average the first N points in the sample.
Given that you are using ROOT you might consider looking at the TSpectrum classes which have support for extracting backgrounds from under an unspecified number of peaks...
I have never used them with so much baseline noise, but they ought to be robust.
BTW: what is the source of this data. The peak looks like a particle detector pulse, but the high level of background jitter suggests that you could really improve things by some fairly minor adjustments in the DAQ hardware, which might be better than trying to solve a difficult software problem.
Finally, unless you are restricted to some very primitive hardware (in which case why and how are you running ROOT?), if you only have a couple thousand such spectra you can afford a pretty slow algorithm. Or is that 2000 spectra per event and a high event rate?
If you can, maintain a sorted list; then you can easily chop off the head and the tail of the list each time you work out the average.
This is much like removing outliers based on the median (ie, you're going to need two passes over the data, one to find the median - which is almost as slow as sorting for floating point data, the other to calculate the average), but requires less overhead at the time of working out the average at the cost of maintaining a sorted list. Which one is fastest will depend entirely on your circumstances. It may be, of course, that what you really want is the median anyway!
If you had discrete data (say, bytes=256 possible values), you could use 256 histogram 'bins' with a single pass over your data putting counting the values that go in each bin, then it's really easy to find the median / approximate the mean / remove outliers, etc. This would be my preferred option, if you could afford to lose some of the precision in your data, followed by maintaining a sorted list, if that is appropriate for your data.
A quick way might be to take the median, and then take the averages of number not so far off from the median.
"Not so far off," being dependent of your project.
A good rule of thumb for determining likely outliers is to calculate the Interquartile Range (IQR), and then any values that are 1.5*IQR away from the nearest quartile are outliers.
This is the basic method many statistics systems (like R) use to automatically detect outliers.
Any method that is statistically significant and a good way to approach it (Dark Eru, Daniel White) will be too computationally intense to repeat, and I think I've found a work around that will allow later correction (meaning, leave it un-grounded).
Thanks for the suggestions. I'll look into them if I have time and want to see if their gain is worth the slowdown.
Here's a quick and dirty method that I've used before (works well if there are very few outliers at the beginning, and you don't have very complicated conditions for what constitutes an outlier)
The algorithm is O(N). The only really expensive part is the division.
The real advantage here is that you can have it up and running in a couple minutes.
avgX = Array[0] // initialize array with the first point
N = length(Array)
percentDeviation = 0.3 // percent deviation acceptable for non-outliers
count = 1
foreach x in Array[1..N-1]
if x < avgX + avgX*percentDeviation
and x > avgX - avgX*percentDeviation
count++
sumX =+ x
avgX = sumX / count
endif
endfor
return avgX