super fast median of matrix in opencv (as fast as matlab) - c++

I'm writing some code in openCV and want to find the median value of a very large matrix array (single channel grayscale, float).
I tried several methods such as sorting the array (using std::sort) and picking the middle entry but it is extremely slow when comparing with the median function in matlab. To be precise - what takes 0.25 seconds in matlab takes over 19 seconds in openCV.
My input image is originally a 12-bit greyscale image with the dimensions 3840x2748 (~10.5 megapixels), converted to float (CV_32FC1) where all the values are now mapped to the range [0,1] and at some point in the code I request the median value by calling:
double myMedianValue = medianMat(Input);
Where the function medianMat is:
double medianMat(cv::Mat Input){
Input = Input.reshape(0,1); // spread Input Mat to single row
std::vector<double> vecFromMat;
Input.copyTo(vecFromMat); // Copy Input Mat to vector vecFromMat
std::sort( vecFromMat.begin(), vecFromMat.end() ); // sort vecFromMat
if (vecFromMat.size()%2==0) {return (vecFromMat[vecFromMat.size()/2-1]+vecFromMat[vecFromMat.size()/2])/2;} // in case of even-numbered matrix
return vecFromMat[(vecFromMat.size()-1)/2]; // odd-number of elements in matrix
}
I timed the function medinaMat by itself and also the various parts - as expected the bottleneck is in:
std::sort( vecFromMat.begin(), vecFromMat.end() ); // sort vecFromMat
Does anyone here have an efficient solution?
Thanks!
EDIT
I have tried using std::nth_element given in the answer of Adi Shavit.
The function medianMat now reads as:
double medianMat(cv::Mat Input){
Input = Input.reshape(0,1); // spread Input Mat to single row
std::vector<double> vecFromMat;
Input.copyTo(vecFromMat); // Copy Input Mat to vector vecFromMat
std::nth_element(vecFromMat.begin(), vecFromMat.begin() + vecFromMat.size() / 2, vecFromMat.end());
return vecFromMat[vecFromMat.size() / 2];
}
The runtime has lowered from over 19 seconds to 3.5 seconds. This is still nowhere near the 0.25 second in Matlab using the median function...

Sorting and taking the middle element is not the most efficient way to find a median. It requires O(n log n) operations.
With C++ you should use std::nth_element() and take the middle iterator. This is an O(n) operation:
nth_element is a partial sorting algorithm that rearranges elements in [first, last) such that:
The element pointed at by nth is changed to whatever element would occur in that position if [first, last) was sorted.
All of the elements before this new nth element are less than or equal to the elements after the new nth element.
Also, your original data is 12 bit integers. Your implementation does a few things that make the comparison to Matlab problematic:
You converted to floating point (CV_32FC1 or double or both) this is costly and takes time
The code has an extra copy to a vector<double>
Operations on float and especially doubles cost more than on integers.
Assuming your image is continuous in memory, as is the default for OpenCV you should use CV_16C1, and work directly on the data array after reshape()
Another option which should be very fast is to simply build a histogram of the image - this is a single pass on the image. Then, working on the histogram, find the bin that corresponds to half the pixels on each side - this is at most a single pass over the bins.
The OpenCV docs have several tutorials on how to build a histograms. Once you have the histogram, accumulate the bin values until you get pass 3840x2748/2. This bin is your median.

OK.
I actually tried this before posting the question and due to some silly mistakes I disqualified it as a solution... anyway here it is:
I basically create a histogram of values for my original input with 2^12 = 4096 bins, compute the CDF and normalize it so it is mapped from 0 to 1 and find the smallest index in the CDF that is equal or larger than 0.5. I then divide this index by 12^2 and thus find the median value requested. Now it runs in 0.11 seconds (and that's in debug mode without heavy optimizations) which is less than half the time required in Matlab.
Here's the function (nVals = 4096 in my case corresponding with 12-bits of values):
double medianMat(cv::Mat Input, int nVals){
// COMPUTE HISTOGRAM OF SINGLE CHANNEL MATRIX
float range[] = { 0, nVals };
const float* histRange = { range };
bool uniform = true; bool accumulate = false;
cv::Mat hist;
calcHist(&Input, 1, 0, cv::Mat(), hist, 1, &nVals, &histRange, uniform, accumulate);
// COMPUTE CUMULATIVE DISTRIBUTION FUNCTION (CDF)
cv::Mat cdf;
hist.copyTo(cdf);
for (int i = 1; i <= nVals-1; i++){
cdf.at<float>(i) += cdf.at<float>(i - 1);
}
cdf /= Input.total();
// COMPUTE MEDIAN
double medianVal;
for (int i = 0; i <= nVals-1; i++){
if (cdf.at<float>(i) >= 0.5) { medianVal = i; break; }
}
return medianVal/nVals; }

It's probably faster to find it from the original data.
Since the original data has 12-bit values, there are only
4096 different possible values. That's a nice and small table!
Go through all the data in one pass, and count how many of each value
you have. That is a O(n) operation. Then it's easy to find the median,
only count size/2 items from either end of the table.

Related

clFFT: Calculating overlapped FFTs

I want to create a batch for clFFT to calculate 3 FFTs of 256 length, where the FFT input values overlap (FFT overlap processing)
Input: a 1D array of 276 complex numbers
Task: Calculate FFTs for [0..255], [10..265], [20..275]
Output: 3x 256 FFTs = 768 values.
If I where to write a loop, it would look like this:
std::complex<float> *input;
for (int i=0; i<3; ++i) {
calcFFT(input, input+256);
input += 10;
}
IOW: The fft calculates 256 input values then advances 10 values and calculates the next 256 values.
How do I set up a clFFT plan, so that this happens in one call?
clfftSetPlanIn/OutStride specifies the distance between the individual values, so that is the wrong parameter.
It looks as if clfftSetPlanDistance might be what I need. Doc says:
CLFFTAPI clfftStatus clfftSetPlanDistance( clfftPlanHandle plHandle, size_t iDist, size_t oDist );
Pitch is the distance between each discrete array object in an FFT array. This is only used
* for 'array' dimensions in clfftDim; see clfftSetPlanDimension (units are in terms of clfftPrecision)
which I find very confusing.
Yes, clfftSetPlanDistance is the right API to use. In the example I would have to use
cllSetPlanDistance(plan, 10, 256);
to calculate FFTs with a step of 10.
This will generate OpenCL code where the global offset of the first FFT index is calculated like this:
// Inside the generated fft_fwd OpenCL function
iOffset = (batch/32)*10 + (batch%32)*8;
where batch is the batch number of the FFT to calculate.

Fast 7x7 2D median filter in C and C++

I'm trying to convert the following code from MATLAB to C++:
function data = process(data)
data = medfilt2(data, [7 7], 'symmetric');
mask = fspecial('gaussian', [35 35], 12);
data = imfilter(data, mask, 'replicate', 'same');
maximum = max(data(:));
data = 1 ./ (data/maximum);
data(data > 10) = 16;
end
My problem is in the medfilt2, which is a 2D median filter. I need it to support 10 bits per pixels and more images.
I have looked into OpenCV it has a 5x5 median filter which supports 16 bits, but 7x7 only supports bytes.
medianBlur
I have also looked into Intel IPP, but I can see only a 1D median filter.
https://software.intel.com/en-us/node/502283
Is there a fast implementation for a 2D filter?
I am looking for something like:
Fast Median Search: An ANSI C Implementation using parallel programming and vectorized (AVX/SSE) operations...
Two Dimensional Digital Signal Processing II. Transforms and median filters.
Edited by T.S.Huang. Springer-Verlag. 1981.
There are more code examples in Fast median filtering with implementations in C/C++/C#/VB.NET/Delphi.
I also found Median Filtering in Constant Time.
Motivated by the fact that OpenCV does not implement 16-bit median filter for large kernel sizes (larger than 5), I tried three different strategies.
All of them are based on Huang's [2] sliding window algorithm. That is, the histogram is updated by removing and inserting pixel entries as the window slides from left to right. This is quite straightforward for 8-bit image and already implemented in OpenCV. However, a large 65536 bin histogram makes computation a bit difficult.
...The algorithm still remains O(log r), but storage considerations render it impractical for 16-bit images and impossible for floating-point images. [3]
I used the algorithm C++ standard library where applicable, and did not implement Weiss' additional optimization strategies.
1) A naive sorting implementation. I think this is the best starting point for arbitrary pixel type (floats particularly).
// copy pixels in the sliding window to a temporary vec and
// compute the median value (size is always odd)
memcpy( &v[0], &window[0], window.size() * sizeof(_Type) );
std::vector< _Type >::iterator it = v.begin() + v.size()/2;
std::nth_element( v.begin(), it, v.end() );
return *it;
2) A sparse histogram. We wouldn't want to step over 65536 bins to find the median of each pixel, so how about storing the sparse histogram then? Again, this is suitable for all pixel types, but it doesn't make sense if all pixels in the window are different (e.g. floats).
typedef std::map< _Type, int > Map;
//...
// inside the sliding window, update the histogram as follows
for ( /* pixels to remove */ )
{
// _Type px
Map::iterator it = map.find( px );
if ( it->second > 1 )
it->second -= 1;
else
map.erase( it );
}
// ...
for ( /* pixels to add */ )
{
// _Type px
Map::iterator lower = map.lower_bound( px );
if ( lower != map.end() && lower->first == px )
lower->second += 1;
else
map.insert( lower, std::pair<_Type,int>( px, 1 ) );
}
//... and compute the median by integrating from the one end until
// until the appropriate sum is reached ..
3) A dense histogram. So this is the dense histogram, but instead of a simple 65536 array, we make searching a little easier by dividing it into sub-bins e.g.:
[0...65535] <- px
[0...4095] <- px / 16
[0...255] <- px / 256
[0...15] <- px / 4096
This makes insertion a bit slower (by constant time), but search a lot faster. I found 16 a good number.
Figure I tested methods (1) red, (2) blue and (3) black against each other and 8bpp OpenCV (green). For all but OpenCV, the input image is 16-bpp gray scale. The dotted lines are truncated at dynamic range [0,255] and smooth lines are truncated at [0, 8020] ( via multiplication by 16 and smoothing to add more variance on pixel values).
Interesting is the divergence of sparse histogram as the variance of pixel values increases. Nth-element is safe bet always, OpenCV is the fastest (if 8bpp is ok) and the dense histogram is trailing behind.
I used Windows 7, 8 x 3.4 GHz and Visual Studio v. 10. Mine were running multithreaded, OpenCV implementation is single-threaded. Input image size 2136x3201 (http://i.imgur.com/gg9Z2aB.jpg, from Vogue).
[2]: Huang, T: "Two-Dimensional Signal Processing II: Transforms
and Median Filters", 1981
[3]: Weiss, B: "Fast median and Bilateral Filtering", 2006
I just implemented, in DIPlib, an efficient algorithm for computing the median filter (and the more generic percentile filter). This algorithm works for integer images of any bit depth as well as floating-point images, works for images of any number of dimensions, and works for kernels of any shape.
The algorithm is similar to the binary search tree implementation suggested by #mainactual in their answer to this question (as method #2), but uses a more appropriate order statistic tree. #mainactual's implementation needs O(n) to find the median in the search tree, for a tree with n nodes, because it iterates through half the nodes in the tree. This is only efficient if there are many fewer nodes than pixels in the kernel, which is typically only true for integer images with a small bit depth. In contrast, the order statistic tree can find the median value in O(log n), by storing an additional value in each node: the size of the subtree rooted at that node. The filter has a cost of O(k log k) for a compact 2D kernel with a height of k pixels (independent of the width).
I wrote down a more detailed description of the algorithm in my blog.
The C++ code is available on GitHub.
Here is a timing comparison for square kernels, comparing:
the new implementation in DIPlib (blue),
the naive implementation in scikit-image (which computes the median for each pixel's neighborhood independently, method #1 in #mainactual's answer, with a quadratic cost) (green), and
the O(1) implementation in OpenCV that only works for 8-bit images and square kernels (red).
"SFLOAT" stands for single-precision floating-point, "UINT8" stands for 8-bit unsigned integer, and "0-10" is also 8-bit unsigned integer, but containing only pixel values between 0 and 10 (this one tests what happens when there are many repeated values in each neighborhood).
The new implementation in DIPlib kicks in at k = 13, the lower part of the graph is the naive, quadratic cost algorithm.
I found this online. It is the same algorithm which OpenCV has. However, it is extended to 16 bit and optimized to SSE.
medianFilter.c
I happened to find (my) solution online as open source (image-quality-and-characterization-utilities, from include/Teisko/Image/Algorithm.hpp
The algorithm finds the Kth element of any set of size M<=64 in N steps, where N is the number of bits in the elements.
This is a radix-2 sort algorithm, which needs the original bit pattern int16_t data[7][7]; to be transposed into N planes of uint64_t bits[N] (10 for 10-bit images), with the MSB first.
// Runs N iterations for pixel data as bit planes in `bits`
// to recover the K(th) largest item (as set by initial threshold)
// The parameter `mask` must be initialized to contain a set bit for all those bits
// of interest in the corresponding pixel data bits[0..N-1]
template <int N> inline
uint64_t median8x8_iteration(uint64_t(&bits)[N], uint64_t mask, uint64_t threshold)
{
uint64_t result = 0;
int i = 0;
do
{
uint64_t ones = mask & bits[i];
uint64_t ones_size = popcount(ones);
uint64_t mask_size = popcount(mask);
auto zero_size = mask_size - ones_size;
int new_bit = 0;
if (zero_size < threshold)
{
new_bit = 1;
threshold -= zero_size;
mask = 0;
}
result = result * 2 + new_bit;
mask ^= ones;
} while (++i < N);
return result;
}
Use threshold = 25 to get median of 49 and mask = 0xfefefefefefefe00ull in the case the planes bits[] contain the support for 8x8 adjacent bits.
Toggling the MSB plane one can use the same inner loop for signed integers - and toggling conditionally MSB and the other planes one can use the algorithm for floating points as well.
Much after year 2016, Ice Lake with AVX-512 introduced _mm256_mask_popcnt_epi64 even on consumer machines, allowing the inner loop to be almost trivially vectorised for all the four submatrices in the common 8x8 support; the masks would be 0xfefefefefefefe00ull >> {0,1,8,9}.
The idea here is that the mask marks the set of pixels under inspection. Counting the number of ones (or zeros) in that set and comparing to a threshold, we can determine at each step if the Kth element belongs to the set of ones or zeros, producing also one correct output bit.
EDIT
Another method I tried was SSE2 version where one keeps a window of size Width*(7 + 1) sorted by rows:
original sorted
1 2 1 3 .... 1 1 0
2 1 3 4 .... 2 2 1
5 2 0 1 .... -> 5 2 3
. . . . .... . . .
Sorting 7 rows is efficiently done by a sorting network using 16 primitive sorting operations (32 instructions with 3-parameter VEX encoding + 14 instructions for memory access).
One can also incrementally remove element from input[row-1][column] from a presorted SSE2 register and add an element from input[row+7][column] to the register (which takes about 12 instructions per sorted column).
Having 7 sorted columns in 7 SSE2 registers, one can now implement bitonic merge sort of three different widths,
which at first column will sort in groups of
(r0),(r1,r2), ((r3,r4), (r5,r6))
<merge> <merge> <merge> // level #0
<--merge---> <---- merge* ----> // level #1
<----- merge + take middle ----> // partial level #2
At column 1, one needs to sort columns
(r1,r2), ((r3,r4),(r5,r6)), (r7)
******* **** <-- merge (r1,r2,r7)
<----- merge + take middle ----> <-- partial level #2
At column 2, with
r2, ((r3,r4),(r5,r6)), (r7,r8)
<merge> // level #0
** ******* // level #1
<----- merge + take middle ----> // level #2
This takes advantage of memorizing partially sorted substrings (and does it better than e.g. heap based priority queue). The final merging of the 3x7 and 4x7 element substrings does not need to compute every element correctly, since we are only interested in the element #24.
Still, while being better than (my implementations) of heap based priority queues and several hierarchical / flat histogram based methods, the overall winner was the popcount method (with a 64-bit popcount instruction).

C++ Fast Percentile Calculation

I'm trying to write a percentile function that takes 2 vectors as input and 1 vector as output. One of the input vector (Distr) would be a distribution of random numbers. The other input vector (Tests) would be a vector of values that I want to calculate the percentile from Distr. The output would be a vector (same size as Tests) that returns the percentile for each value in Tests.
The following is an example of what I want:
Input Distr = {3, 5, 8, 12}
Input Tests = {4, 9}
Output Percentile = {0.375, 0.8125}
Following is my implementation in C++:
vector<double> Percentile(vector<double> Distr, vector<double> Tests)
{
double prevValue, nextValue;
vector<double> result;
unsigned distrSize = Distr.size();
std::sort(Distr.begin(), Distr.end());
for (vector<double>::iterator test = Tests.begin(); test != Tests.end(); test++)
{
if (*test <= Distr.front())
{
result.push_back((double) 1 / distrSize); // min percentile returned (not important)
}
else if (Distr.back() <= *test)
{
result.push_back(1); // max percentile returned (not important)
}
else
{
prevValue = Distr[0];
for (unsigned sortedDistrIdx = 1; sortedDistrIdx < distrSize; sortedDistrIdx++)
{
nextValue = Distr[sortedDistrIdx];
if (nextValue <= *test)
{
prevValue = nextValue;
}
else
{
// linear interpolation
result.push_back(((*test - prevValue) / (nextValue - prevValue) + sortedDistrIdx) / distrSize);
break;
}
}
}
}
return result;
}
The size of both Distr and Tests can be from 2,000 to 30,000.
Are there any existing libraries that can calculate percentile as shown above (or similar)? If not how can I make the above code faster?
I would do something like
vector<double> Percentile(vector<double> Distr, vector<double> Tests)
{
double prevValue, nextValue;
vector<double> result;
unsigned distrSize = Distr.size();
std::sort(Distr.begin(), Distr.end());
for (vector<double>::iterator test = Tests.begin(); test != Tests.end(); test++)
{
if (*test <= Distr.front())
{
result.push_back((double) 1 / distrSize); // min percentile returned (not important)
}
else if (Distr.back() <= *test)
{
result.push_back(1); // max percentile returned (not important)
}
else
{
auto it = lower_bound(Distr.begin(), Distr.end(), *test);
prevValue = *(it - 1);
nextValue = *(it + 1);
// linear interpolation
result.push_back(((*test - prevValue) / (nextValue - prevValue) + (it - Distr.begin())) / distrSize);
}
}
return result;
}
Note that instead of making a linear search on Distr for each test, I leverage the fact that Distr is sorted and make a binary search instead (using lower_bound).
There is a linear algorithm for your problem (linear times logarithmic in both sizes). You need to sort both vectors, and then have two iterators going through each (itDistr, itTest). There are three possibilities:
1.
*itDistr < *itTest
Here, you have nothing to except increment itDistr.
2.
*itDistr >= *itTest
This is the case when you found a test case where *itTest is element of the interval [ *(itDistr-1), *itDistr ). So you have to do the interpolation you have used (linear), and then increment itTest.
The third possibility is where any of then reaches the end of its container vector. You also have to define what happens in the beginning and at the and, it depends on how you define the distribution from the series of your numbers.
Are there any existing libraries that can calculate percentile as shown above (or similar)?
Probably, but it is easy to implement it, and you can have fine control over the interpolation technique.
The linear search of Distr for each element of Tests would be the major amount of time if both of those are large.
When Distr is much larger, it is much faster to do a binary search instead of linear. There is a binary search algorithm available in std. You don't need to write one.
When Tests is nearly as big as Distr, or bigger, it is faster to do an index sort of Tests and then sequence through the two sorted lists together storing the results, then output the stored results in a next pass.
Edit: I see the answer by Csaba Balint gives a little more detail on what I meant by "sequence through the two sorted lists together".
Edit: The basic methods being discussed are:
1) Sort both lists and then process linearly together, time NlogN+MlogM
2) Sort just one list and binary search, time (N+M)logM
3) Sort just the other list and partition, time I haven't figured out, but in the case of N and M similar, it has to be larger than either method 1 or 2, and in the case of N sufficiently tiny has to be smaller than methods 1 or 2.
This answer is relevant to the case that input is initially random (not sorted) and test.size() smaller than input.size(), which is the most common situation.
Suppose there is only one test value. Then you only have to partition the input with respect to this value and obtain the upper(lower) bound of the lower(upper) parition to compute the respective percentile. This is much faster than a full sort on input (which quicksort implements as a recursion of partitions).
If test.size()>1, then you first sort test (ideally, test is already sorted and you can skip this step) and subsequently proceed with the test elements in increasing order, each time only partitioning the upper part from the previous partition. Since we also keep track of the lower bound of the upper partition (as well as the upper bound of the lower partition), we can detect if no input data are between consecutive test elements, and avoid to partition.
This algorithm should be near-optimal, since no unnecessary information is generated (as it would be with a full sort of input).
If subsequent partitioning splits the input roughly in half, the algorithm would be optimal. This could be approximated by proceeding not in increasing order of test, but by subsequent halving of test, i.e. starting with the median test element, then the first & third quartile, etc..

Pick a matrix cell according to its probability

I have a 2D matrix of positive real values, stored as follow:
vector<vector<double>> matrix;
Each cell can have a value equal or greater to 0, and this value represents the possibility of the cell to be chosen. In particular, for example, a cell with a value equals to 3 has three times the probability to be chosen compared to a cell with value 1.
I need to select N cells of the matrix (0 <= N <= total number of cells) randomly, but according to their probability to be selected.
How can I do that?
The algorithm should be as fast as possible.
I describe two methods, A and B.
A works in time approximately N * number of cells, and uses space O(log number of cells). It is good when N is small.
B works in time approximately (number of cells + N) * O(log number of cells), and uses space O(number of cells). So, it is good when N is large (or even, 'medium') but uses a lot more memory, in practice it might be slower in some regimes for that reason.
Method A:
The first thing you need to do is normalize the entries. (It's not clear to me if you assume they are normalized or not.) That means, sum all the entries and divide by the sum. (This part is potentially slow, so it's better if you assume or require that it already happened.)
Then you sample like this:
Choose a random [i,j] entry of the matrix (by choosing i,j each uniformly randomly from the range of integers 0 to n-1).
Choose a uniformly random real number p in the range [0, 1].
Check if matrix[i][j] > p. If so, return the pair [i][j]. If not, go back to step 1.
Why does this work? The probability that we end at step 3 with any particular output, is equal to, the probability that [i][j] was selected (this is the same for each entry), times the probality that the number p was small enough. This is proportional to the value matrix[i][j], so the sampling is choosing each entry with the correct proportions. It's also possible that at step 3 we go back to the start -- does that bias things? Basically, no. The reason is, suppose we arbitrarily choose a number k and then consider the distribution of the algorithm, conditioned on stopping exactly after k rounds. Conditioned on the assumption that we stop at the k'th round, no matter what value k we choose, the distribution we sample has to be exactly right by the above argument. Since if we eliminate the case that p is too small, the other possibilities all have their proportions correct. Since the distribution is perfect for each value of k that we might condition on, and the overall distribution (not conditioned on k) is an average of the distributions for each value of k, the overall distribution is perfect also.
If you want to analyze the number of rounds that typically needed in a rigorous way, you can do it by analyzing the probability that we actually stop at step 3 for any particular round. Since the rounds are independent, this is the same for every round, and statistically, it means that the running time of the algorithm is poisson distributed. That means it is tightly concentrated around its mean, and we can determine the mean by knowing that probability.
The probability that we stop at step 3 can be determined by considering the conditional probability that we stop at step 3, given that we chose any particular entry [i][j]. By the formulas for conditional expectation, you get that
Pr[ stop at step 3 ] = sum_{i,j} ( 1/(n^2) * Matrix[i,j] )
Since we assumed the matrix is normalized, this sum reduces to just 1/n^2. So, the expected number of rounds is about n^2 (that is, n^2 up to a constant factor) no matter what the entries in the matrix are. You can't hope to do a lot better than that I think -- that's about the same amount of time it takes to just read all the entries of the matrix, and it's hard to sample from a distribution that you cannot even read all of.
Note: What I described is a way to correctly sample a single element -- to get N elements from one matrix, you can just repeat it N times.
Method B:
Basically you just want to compute a histogram and sample inversely from it, so that you know you get exactly the right distribution. Computing the histogram is expensive, but once you have it, getting samples is cheap and easy.
In C++ it might look like this:
// Make histogram
typedef unsigned int uint;
typedef std::pair<uint, uint> upair;
typedef std::map<double, upair> histogram_type;
histogram_type histogram;
double cumulative = 0.0f;
for (uint i = 0; i < Matrix.size(); ++i) {
for (uint j = 0; j < Matrix[i].size(); ++j) {
cumulative += Matrix[i][j];
histogram[cumulative] = std::make_pair(i,j);
}
}
std::vector<upair> result;
for (uint k = 0; k < N; ++k) {
// Do a sample (this should never repeat... if it does not find a lower bound you could also assert false quite reasonably since it means something is wrong with rand() implementation)
while(1) {
double p = cumulative * rand(); // Or, for best results use std::mt19937 or boost::mt19937 and sample a real in the range [0,1] here.
histogram_type::iterator it = histogram::lower_bound(p);
if (it != histogram.end()) {
result.push_back(it->second);
break;
}
}
}
return result;
Here the time to make the histogram is something like number of cells * O(log number of cells) since inserting into the map takes time O(log n). You need an ordered data structure in order to get cheap lookup N * O(log number of cells) later when you do repeated sampling. Possibly you could choose a more specialized data structure to go faster, but I think there's only limited room for improvement.
Edit: As #Bob__ points out in comments, in method (B) a written there is potentially going to be some error due to floating point round-off if the matrices are quite large, even using type double, at this line:
cumulative += Matrix[i][j];
The problem is that, if cumulative is much larger than Matrix[i][j] beyond what the floating point precision can handle then these each time this statement is executed you may observe significant errors which accumulate to introduce significant inaccuracy.
As he suggests, if that happens, the most straightforward way to fix it is to sort the values Matrix[i][j] first. You could even do this in the general implementation to be safe -- sorting these guys isn't going to take more time asymptotically than you already have anyways.

OpenCV: Fill in missing elements in CVMat with avg of nearest non-zero neighbors?

The basic problem is this:
I have a CVMat, type CV_8UC1, which is mostly filled in with integers (well, chars, actually, but whatever) between 1 and 100 inclusive. The remaining elements are zeros.
In this case, 0 basically means "unknown". I want to fill in the unknown elements with, essentially, the average of its nearest neighbors... i.e. if this matrix were representing a 3d surface with a bunch of holes in it, I want to smoothly fill in the holes.
Keeping in mind, of course, that it's possible there are some rather big holes.
Efficiency isn't super important, as this operation is only going to be happening once, and the matrix in question isn't bigger than around 1000x1000.
Here's the code I need to finish:
for(int x=0; x<heightMatrix.cols; x++) {
for (int y=0; y<heightMatrix.rows; y++) {
if (heightMatrix.at<char>(x,y) == 0) {
// ???
}
}
}
Thanks!!
How about this instead:
put your data in an image and use image closing with a large kernel (or with a lot of iterations):
http://opencv.willowgarage.com/documentation/image_filtering.html#morphologyex
What about this?
int sum = 0;
... paste the following part inside the loop ...
sum += heightMatrix.at<char>(x - 1,y);
sum += heightMatrix.at<char>(x + 1,y);
sum += heightMatrix.at<char>(x,y - 1);
sum += heightMatrix.at<char>(x,y + 1);
heightMatrix.at<char>(x,y) = sum / 4;
Since you deal with a CV_8UC1 Mat you have in practice a 2d array and each pixel has just 4 nearest neighbors.
There are some caveats however:
1) put your averaged pixel in a Mat of floats to avoid round off!
2) to fill the whole Mat with this average may not be what you are looking for if the non-zero pixels are quite sparse: when there is a lot of empty pixels and really few non-zero pixels the more you move away from a non-zero pixel, the more the average converges to 0. And this may happen in as few as 3-4 iterations (another good reason to store not to store the values in a Mat of integers).