Grid nearest neighbour BFS slow - c++

Im trying to upsample my image. I fill the upsampled version with corresponding pixels in this way.
pseudocode:
upsampled.getPixel(((int)(x * factorX), (int)(y * factorY))) = old.getPixel(x, y)
as a result i end up with the bitmap that is not completely filled and I try to fill each not filled pixel with it's nearest filled neighbor.
I use this method for nn search and call it for each unfilled pixel. I do not flag unfilled pixel as filled after changing its value as it may create some weird patterns. The problem is that - it works but very slow. Execution time on my i7 9700k for 2500 x 3000 img scaled by factor x = 1,5 and y = 1,5 takes about 10 seconds.
template<typename T>
std::pair<int, int> cn::Utils::nearestNeighbour(const Bitmap <T> &bitmap, const std::pair<int, int> &point, int channel, const bool *filledArr) {
auto belongs = [](const cn::Bitmap<T> &bitmap, const std::pair<int, int> &point){
return point.first >= 0 && point.first < bitmap.w && point.second >= 0 && point.second < bitmap.h;
};
if(!(belongs(bitmap, point))){
throw std::out_of_range("This point does not belong to bitmap!");
}
auto hash = [](std::pair<int, int> const &pair){
std::size_t h1 = std::hash<int>()(pair.first);
std::size_t h2 = std::hash<int>()(pair.second);
return h1 ^ h2;
};
//from where, point
std::queue<std::pair<int, int>> queue;
queue.push(point);
std::unordered_set<std::pair<int, int>, decltype(hash)> visited(10, hash);
while (!queue.empty()){
auto p = queue.front();
queue.pop();
visited.insert(p);
if(belongs(bitmap, p)){
if(filledArr[bitmap.getDataIndex(p.first, p.second, channel)]){
return {p.first, p.second};
}
std::vector<std::pair<int,int>> neighbors(4);
neighbors[0] = {p.first - 1, p.second};
neighbors[1] = {p.first + 1, p.second};
neighbors[2] = {p.first, p.second - 1};
neighbors[3] = {p.first, p.second + 1};
for(auto n : neighbors) {
if (visited.find(n) == visited.end()) {
queue.push(n);
}
}
}
}
return std::pair<int, int>({-1, -1});
}
the bitmap.getDataIndex() works in O(1) time. Here's its implementation
template<typename T>
int cn::Bitmap<T>::getDataIndex(int col, int row, int depth) const{
if(col >= this->w or col < 0 or row >= this->h or row < 0 or depth >= this->d or depth < 0){
throw std::invalid_argument("cell does not belong to bitmap!");
}
return depth * w * h + row * w + col;
}
I have spent a while on debugging this but could not really find what makes it so slow.
Theoretically when scaling by factor x = 1,5, y = 1,5, the filled pixel should be no further than 2 pixels from unfilled one, so well implemented BFS wouldn't take long.
Also i use such encoding for bitmap, example for 3x3x3 image
* (each row and channel is in ascending order)
* {00, 01, 02}, | {09, 10, 11}, | {18, 19, 20},
c0 {03, 04, 05}, c1{12, 13, 14}, c2{21, 22, 23},
* {06, 07, 08}, | {15, 16, 17}, | {24, 25, 26},

the filled pixel should be no further than 2 pixels from unfilled one, so well implemented BFS wouldn't take long.
Sure, doing it once won’t take long. But you need to do this for almost every pixel in the output image, and doing lots of times something that doesn’t take long will still take long.
Instead of searching for a set pixel, use the information you have about the earlier computation to directly find the values you are looking for.
For example, in your output image, and set pixel, is at ((int)(x * factorX), (int)(y * factorY)), for integer x and y. So for a non-set pixel (a, b), you can find the nearest set pixel by ((int)(round(a/factorX)*factorX), (int)(round(b/factorY)*factorY)).
However, you are much better off directly upsampling the image in a simpler way: don’t loop over the input pixels, instead loop over the output pixels, and find the corresponding input pixel.

Related

Efficient way to get index of every element in array greater than some value

I have a (quite large) standard C++ array of type double, with ~50,000,000 rows and 20 columns. The array is filled with random data, according to some Gaussian distribution (if that's of any use in answering this question).
I've written an algorithm to solve a problem using this array. A significant part of this algorithm's time is spent iterating, row by row (and sometimes over the same row more than once) and returning, for each row, the index of every element in that row such that the absolute value of that element exceeds some value (also of type double).
Unfortunately, the algorithm is quite slow. As it's rather large, and the problem being solved is a bit complex for simply dumping the code here on SO, I'd like to start by tacking this issue. What is the most efficient (or, at least, a more efficient way) to grab the index of every element in a row of a multidimensional array?
What I've tried:
I've tried simply iterating through each row (with an iterator), passing each value to fabs(), and using std::distance() to get the index. I then store it in an std::set() (I don't care much about how the indices are stored, unless that is a significant speed factor, so long as they are "easily accessible").
I.e.:
for(auto it = row.begin(); it != row.end(); ++it){
auto &element = *it;
if(fabs(element) >= threshold){
cache.insert(std::distance(row.begin(), it));
}
}
I've also tried using std::find_if, and similarly through std::range. Neither gave measurable speed improvements (admittedly, I haven't used particularly scientific benchmarks, however I'm going for a visibly noticeable improvement).
I.e. something like this:
auto exceeds_thresh = [](double x){ return x > threshold}
it = ranges::find_if(row, exceeds_thresh);
while(it != end(row)){
resuts.emplace_back(distance(begin(row), it));
it = ranges::find_if(std::next(it), std::end(row), exceeds_thresh)
}
Note that, by efficiency, I'm focusing on speed
Here, 11.3, 9.8, 17.5 satisfy the condition, so their indices 1,3,6 should be printed. Note that, in practice, each array is a row in a far larger array (as above), and with far greater number of elements in each row:
double row_of_array[5] = {1.4, 11.3, 4.2, 9.8, 0.1, 3.2, 17.5};
double threshold = 8;
for(auto it = row_of_array.begin(); it != row_of_array.end(); ++it){
auto &element = *it;
if(fabs(element) > threshold){
std::cout << std::distance(row_of_array.begin(), it) << "\n";
}
}
You can try loop unrolling
double row_of_array[] = {1, 11, 4, 9, 0, 3, 17};
constexpr double threshold = 8;
std::vector<int> results;
results.reserve(20);
for(int i{}, e = std::ssize(row_of_array); i < e; i += 4)
{
if(std::abs(row_of_array[i]) > threshold)
results.push_back(i);
if(i + 1 < e && std::abs(row_of_array[i + 1]) > threshold)
results.push_back(i + 1);
if(i + 2 < e && std::abs(row_of_array[i + 2]) > threshold)
results.push_back(i + 2);
if(i + 3 < e && std::abs(row_of_array[i + 3]) > threshold)
results.push_back(i + 3);
}
EDIT:
or the riskier
double row_of_array[20] = {1, 11, 4, 9, 0, 3, 17};
constexpr double threshold = 8;
std::vector<int> results;
results.reserve(20);
static_assert(std::ssize(row_of_array) % 4 == 0, "only works for mul of 4");
for(int i{}, e = std::ssize(row_of_array); i < e; i += 4)
{
if(std::abs(row_of_array[i]) > threshold) results.push_back(i);
if(std::abs(row_of_array[i + 1]) > threshold) results.push_back(i + 1);
if(std::abs(row_of_array[i + 2]) > threshold) results.push_back(i + 2);
if(std::abs(row_of_array[i + 3]) > threshold) results.push_back(i + 3);
}

Query points on the vertices of a Hamming cube

I have N points that lie only on the vertices of a cube, of dimension D, where D is something like 3.
A vertex may not contain any point. So every point has coordinates in {0, 1}D. I am only interested in query time, as long as the memory cost is reasonable ( not exponential in N for example :) ).
Given a query that lies on one of the cube's vertices and an input parameter r, find all the vertices (thus points) that have hamming distance <= r with the query.
What's the way to go in a c++ environment?
I am thinking of a kd-tree, but I am not sure and want help, any input, even approximative, would be appreciated! Since hamming distance comes into play, bitwise manipulations should help (e.g. XOR).
There is a nice bithack to go from one bitmask with k bits set to the lexicographically next permutation, which means it's fairly simple to loop through all masks with k bits set. XORing these masks with an initial value gives all the values at hamming distance exactly k away from it.
So for D dimensions, where D is less than 32 (otherwise change the types),
uint32_t limit = (1u << D) - 1;
for (int k = 1; k <= r; k++) {
uint32_t diff = (1u << k) - 1;
while (diff <= limit) {
// v is the input vertex
uint32_t vertex = v ^ diff;
// use it
diff = nextBitPermutation(diff);
}
}
Where nextBitPermutation may be implemented in C++ as something like (if you have __builtin_ctz)
uint32_t nextBitPermutation(uint32_t v) {
// see https://graphics.stanford.edu/~seander/bithacks.html#NextBitPermutation
uint32_t t = v | (v - 1);
return (t + 1) | (((~t & -~t) - 1) >> (__builtin_ctz(v) + 1));
}
Or for MSVC (not tested)
uint32_t nextBitPermutation(uint32_t v) {
// see https://graphics.stanford.edu/~seander/bithacks.html#NextBitPermutation
uint32_t t = v | (v - 1);
unsigned long tzc;
_BitScanForward(&tzc, v); // v != 0 so the return value doesn't matter
return (t + 1) | (((~t & -~t) - 1) >> (tzc + 1));
}
If D is really low, 4 or lower, the old popcnt-with-pshufb works really well and generally everything just lines up well, like this:
uint16_t query(int vertex, int r, int8_t* validmask)
{
// validmask should be array of 16 int8_t's,
// 0 for a vertex that doesn't exist, -1 if it does
__m128i valid = _mm_loadu_si128((__m128i*)validmask);
__m128i t0 = _mm_set1_epi8(vertex);
__m128i r0 = _mm_set1_epi8(r + 1);
__m128i all = _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15);
__m128i popcnt_lut = _mm_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);
__m128i dist = _mm_shuffle_epi8(popcnt_lut, _mm_xor_si128(t0, all));
__m128i close_enough = _mm_cmpgt_epi8(r0, dist);
__m128i result = _mm_and_si128(close_enough, valid);
return _mm_movemask_epi8(result);
}
This should be fairly fast; fast compared to the bithack above (nextBitPermutation, which is fairly heavy, is used a lot there) and also compared to looping over all vertices and testing whether they are in range (even with builtin popcnt, that automatically takes at least 16 cycles and the above shouldn't, assuming everything is cached or even permanently in a register). The downside is the result is annoying to work with, since it's a mask of which vertices both exist and are in range of the queried point, not a list of them. It would combine well with doing some processing on data associated with the points though.
This also scales down to D=3 of course, just make none of the points >= 8 valid. D>4 can be done similarly but it takes more code then, and since this is really a brute force solution that is only fast due to parallelism it fundamentally gets slower exponentially in D.

Computing Rand error efficiently

I'm trying to compare two image segmentations to one another.
In order to do so, I transform each image into a vector of unsigned short values, and calculate the rand error,
according to the following formula:
where:
Here is my code (the rand error calculation part):
cv::Mat im1,im2;
//code for acquiring data for im1, im2
//code for copying im1(:)->v1, im2(:)->v2
int N = v1.size();
double a = 0;
double b = 0;
for (int i = 0; i <N; i++)
{
for (int j = 0; j < i; j++)
{
unsigned short l1 = v1[i];
unsigned short l2 = v1[j];
unsigned short gt1 = v2[i];
unsigned short gt2 = v2[j];
if (l1 == l2 && gt1 == gt2)
{
a++;
}
else if (l1 != l2 && gt1 != gt2)
{
b++;
}
}
}
double NPairs = (double)(N*N)/2;
double res = (a + b) / NPairs;
My problem is that length of each vector is 307,200.
Therefore the total number of iterations is 47,185,920,000.
It makes the running time of the entire process is very slow (a few minutes to compute).
Do you have any idea how can I improve it?
Thanks!
Let's assume that we have P distinct labels in the first image and Q distinct labels in the second image. The key observation for efficient computation of Rand error, also called Rand index, is that the number of distinct labels is usually much smaller than the number of pixels (i.e. P, Q << n).
Step 1
First, pre-compute the following auxiliary data:
the vector s1, with size P, such that s1[p] is the number of pixel positions i with v1[i] = p.
the vector s2, with size Q, such that s2[q] is the number of pixel positions i with v2[i] = q.
the matrix M, with size P x Q, such that M[p][q] is the number of pixel positions i with v1[i] = p and v2[i] = q.
The vectors s1, s2 and the matrix M can be computed by passing once through the input images, i.e. in O(n).
Step 2
Once s1, s2 and M are available, a and b can be computed efficiently:
This holds because each pair of pixels (i, j) that we are interested in has the property that both its pixels have the same label in image 1, i.e. v1[i] = v1[j] = p; and the same label in image 2, i.e. v2[i] = v2[ j ] = q. Since v1[i] = p and v2[i] = q, the pixel i will contribute to the bin M[p][q], and the same does the pixel j. Therefore, for each combination of labels p and q we need to consider the number of pairs of pixels that fall into the M[p][q] bin, and then to sum them up for all possible labels p and q.
Similarly, for b we have:
Here, we are counting how many pairs are formed with one of the pixels falling into the bin M[p][q]. Such a pixel can form a good pair with each pixel that is falling into a bin M[p'][q'], with the condition that p != p' and q != q'. Summing over all such M[p'][q'] is equivalent to subtracting from the sum over the entire matrix M (this sum is n) the sum on row p (i.e. s1[p]) and the sum on the column q (i.e. s2[q]). However, after subtracting the row and column sums, we have subtracted M[p][q] twice, and this is why it is added at the end of the expression above. Finally, this is divided by 2 because each pair was counted twice (once for each of its two constituent pixels as being part of a bin M[p][q] in the argument above).
The Rand error (Rand index) can now be computed as:
The overall complexity of this method is O(n) + O(PQ), with the first term usually being the dominant one.
After reading your comments, I tried the following approach:
calculate the intersections for each possible pair of values.
use the intersection results to calculate the error.
I performed the calculation straight on the cv::Mat objects, without converting them into std::vector objects. That gave me the ability to use opencv functions and achieve a faster runtime.
Code:
double a = 0, b = 0; //init variables
//unique function finds all the unique value of a matrix, with an optional input mask
std::set<unsigned short> m1Vals = unique(mat1);
for (unsigned short s1 : m1Vals)
{
cv::Mat mask1 = (mat1 == s1);
std::set<unsigned short> m2ValsInRoi = unique(mat2, mat1==s1);
for (unsigned short s2 : m2ValsInRoi)
{
cv::Mat mask2 = mat2 == s2;
cv::Mat andMask = mask1 & mask2;
double andVal = cv::countNonZero(andMask);
a += (andVal*(andVal - 1)) / 2;
b += ((double)cv::countNonZero(andMask) * (double)cv::countNonZero(~mask1 & ~mask2)) / 2;
}
}
double NPairs = (double)(N*(N-1)) / 2;
double res = (a + b) / NPairs;
The runtime is now reasonable (only a few milliseconds vs a few minutes), and the output is the same as the code above.
Example:
I ran the code on the following matrices:
//mat1 = [1 1 2]
cv::Mat mat1 = cv::Mat::ones(cv::Size(3, 1), CV_16U);
mat1.at<ushort>(cv::Point(2, 0)) = 2;
//mat2 = [1 2 1]
cv::Mat mat2 = cv::Mat::ones(cv::Size(3, 1), CV_16U);
mat2.at<ushort>(cv::Point(1, 0)) = 2;
In this case a = 0 (no matching pairs correspondence), and b=1(one matching pair for i=2,j=3). The algorithm result:
a = 0
b = 1
NPairs = 3
result = 0.3333333
Thank you all for your help!

Find all peaks for Mat() in OpenCV C++

I want to find all maximums (numbers of non-zero pixels) for my image. I need it to divide my picture such way:
So, i already asked question, how to project all image to one of the axis, and now i need to find all maximums on this one-row image.
Here's my part of the code:
void segment_plate (Mat input_image) {
double minVal;
double maxVal;
Point minLoc;
Point maxLoc;
Mat work_image = input_image;
Mat output;
//binarize image
adaptiveThreshold(work_image,work_image, 255, ADAPTIVE_THRESH_MEAN_C, THRESH_BINARY, 15, 10);
//project it to one axis
reduce(work_image,output,0,CV_REDUCE_SUM,CV_32S);
//find minimum and maximum falue for ALL image
minMaxLoc(output, &minVal,&maxVal,&minLoc,&maxLoc);
cout << "min val : " << minVal << endl;
cout << "max val: " << maxVal << endl;
As you can see, i could find one maximum and one minimum for all picture, but i need to find local maximums. Thanks for any help!
EDIT
Ok, so i made a mistake, i need to find peaks for this vector. I've used this code to find first peak:
int findPeakUtil(Mat arr, int low, int high, int n) {
// Find index of middle element
int mid = low + (high - low)/2; /* (low + high)/2 */
// Compare middle element with its neighbours (if neighbours exist)
if ((mid == 0 || arr.at<int>(0,mid-1) <= arr.at<int>(0,mid)) &&
(mid == n-1 || arr.at<int>(0,mid+1) <= arr.at<int>(0,mid)))
return mid;
// If middle element is not peak and its left neighbor is greater than it
// then left half must have a peak element
else if (mid > 0 && arr.at<int>(0,mid-1) > arr.at<int>(0,mid))
return findPeakUtil(arr, low, (mid - 1), n);
// If middle element is not peak and its right neighbor is greater than it
// then right half must have a peak element
else return findPeakUtil(arr, (mid + 1), high, n);
}
// A wrapper over recursive function findPeakUtil()
int findPeak(Mat arr, int n) {
return findPeakUtil(arr, 0, n-1, n);
}
So now my code looks like:
void segment_plate (Mat input_image) {
Mat work_image = input_image;
Mat output;
//binarize image
adaptiveThreshold(work_image,work_image, 255, ADAPTIVE_THRESH_MEAN_C, THRESH_BINARY, 15, 10);
//project it to one axis
reduce(work_image,output,0,CV_REDUCE_SUM,CV_32S);
int n = output.cols;
printf("Index of a peak point is %d", findPeak(output, n));
But how can i find another peaks? The algorithm of peak finding i took from here.
One way I can think about to find peaks is to find the first derivative and then find the negative numbers of it.
for example,
a = [ 1, 2, 3, 4, 4, 5, 6, 3, 4]
in this example the peak is 6 in the position 6 and 4 in the las position.
so, if you extend the vector (0 at the end) and apply the first derivative (a[i] - a[i-1]) you'll get
a_deriv = [1,1,1,0,1,1,-3,1,-4]
where the negative numbers are in the position of the peaks. In this case, -3 is in position 6 and -4 in position 8, that is where the peaks are located.
This is a way to do it.... but is not the only one.
Note that this method will count only the last number in a plateau as the peak ( you can find a plateau when two numbers share the peak because they have the same value and are consecutive)
Hope this helps you

Partitioning of an AABB

I have a problem where I need to divide an AABB into a number of small AABBs. I need to find the minimum and maximum points in each of the smaller AABB.
If we take this cuboid as an example, we can see that is divided into 64 smaller cuboids. I need to calculate the minimum and maximum points of all of these smaller cuboids, where the number of cuboids (64) can be specified by the end user.
I have made a basic attempt with the following code:
// Half the length of each side of the AABB.
float h = side * 0.5f;
// The length of each side of the inner AABBs.
float l = side / NUMBER_OF_PARTITIONS;
// Calculate the minimum point on the parent AABB.
Vector3 minPointAABB(
origin.getX() - h,
origin.getY() - h,
origin.getZ() - h
);
// Calculate all inner AABBs which completely fill the parent AABB.
for (int i = 0; i < NUMBER_OF_PARTITIONS; i++)
{
// This is not correct! Given a parent AABB of min (-10, 0, 0) and max (0, 10, 10) I need to
// calculate the following positions as minimum points of InnerAABB (with 8 inner AABBs).
// (-10, 0, 0), (-5, 0, 0), (-10, 5, 0), (-5, 5, 0), (-10, 0, 5), (-5, 0, 5),
// (-10, 5, 5), (-5, 5, 5)
Vector3 minInnerAABB(
minPointAABB.getX() + i * l,
minPointAABB.getY() + i * l,
minPointAABB.getZ() + i * l
);
// We can calculate the maximum point of the AABB from the minimum point
// by the summuation of each coordinate in the minimum point with the length of each side.
Vector3 maxInnerAABB(
minInnerAABB.getX() + l,
minInnerAABB.getY() + l,
minInnerAABB.getZ() + l
);
// Add the inner AABB points to a container for later use.
}
Many thanks!
I assume that your problem is that you don't get enough sub-boxes. The number of partitions refers to partitions per dimension, right? So 2 partitions yield 8 sub-boxes, 3 partitions yield 27 sub-boxes and so on.
Then you must have three nested loops, one for each dimension:
for (int k = 0; k < NUMBER_OF_PARTITIONS; k++)
for (int j = 0; j < NUMBER_OF_PARTITIONS; j++)
for (int i = 0; i < NUMBER_OF_PARTITIONS; i++)
{
Vector3 minInnerAABB(
minPointAABB.getX() + i * l,
minPointAABB.getY() + j * l,
minPointAABB.getZ() + k * l
);
Vector3 maxInnerAABB(
minInnerAABB.getX() + l,
minInnerAABB.getY() + l,
minInnerAABB.getZ() + l
);
// Add the inner AABB points to a container for later use.
}
}
}
Alternatively, you can have one huge loop over the cube of your partitios and sort out the indices by division and remainder operations inside the loop, which is a bit messy for three dimensions.
It might also be a good idea to make the code more general by calculating three independent sub-box lengths for each dimension based on the side lengths of the original box.