Is this connected-component labeling algorithm new? - c++

A long time ago, I made a game in which a sort of connected-component labeling is required to implement AI part. I used the two-pass algorithm unknowingly at that time.
Recently, I got to know that I can make them faster using bit-scan based method instead. It uses 1-bit-per-pixel data as input, instead of typical bytes-per-pixel input. Then it finds every linear chunks in each scan-lines using BSF instruction. Please see the code below. Cut is a struct which saves information of a linear chunk of bit 1s in a scan-line.
Cut* get_cuts_in_row(const u32* bits, const u32* bit_final, Cut* cuts) {
u32 working_bits = *bits;
u32 basepos = 0, bitpos = 0;
for (;; cuts++) {
//find starting position
while (!_BitScanForward(&bitpos, working_bits)) {
bits++, basepos += 32;
if (bits == bit_final) {
cuts->start_pos = (short)0xFFFF;
cuts->end_pos = (short)0xFFFF;
return cuts + 1;
}
working_bits = *bits;
}
cuts->start_pos = short(basepos + bitpos);
//find ending position
working_bits = (~working_bits) & (0xFFFFFFFF << bitpos);
while (!_BitScanForward(&bitpos, working_bits)) {
bits++, basepos += 32;
working_bits = ~(*bits);
}
working_bits = (~working_bits) & (0xFFFFFFFF << bitpos);
cuts->end_pos = short(basepos + bitpos);
}
}
First, it uses the assembly BSF instruction to find the first position bit 1 appears. After it is found, it finds the the first position bit 0 appears after that position using bit inversion and bit masking, then repeats this process.
After getting the starting position and the ending position of all linear chunks of 1s (I prefer refer them as 'cuts') in every scan-line, it gives labels to them in CCL manner. For the first row, every cuts get different labels.
For each cut in rest rows, it checks if there are upper cuts which are connected to it first. If no upper cuts are connected to it, it gets new label. If only one upper cuts are connected to it, it gets the copy of the label. If many upper cuts are connected to it, those labels are merged and it gets the merged one. This can be done easily using two progressing pointers of upper chunks and lower chunks. Here is the full code doing that part.
Label* get_labels_8c(Cut* cuts, Cut* cuts_end, Label* label_next) {
Cut* cuts_up = cuts;
//generate labels for the first row
for (; cuts->start_pos != 0xFFFF; cuts++) cuts->label = [GET NEW LABEL FROM THE POOL];
cuts++;
//generate labels for the rests
for (; cuts != cuts_end; cuts++) {
Cut* cuts_save = cuts;
for (;; cuts++) {
u32 start_pos = cuts->start_pos;
if (start_pos == 0xFFFF) break;
//Skip upper slices ends before this slice starts
for (; cuts_up->end_pos < start_pos; cuts_up++);
//No upper slice meets this
u32 end_pos = cuts->end_pos;
if (cuts_up->start_pos > end_pos) {
cuts->label = [GET NEW LABEL FROM THE POOL];
continue;
};
Label* label = label_equiv_recursion(cuts_up->label);
//Next upper slice can not meet this
if (end_pos <= cuts_up->end_pos) {
cuts->label = label;
continue;
}
//Find next upper slices meet this
for (; cuts_up->start_pos <= end_pos; cuts_up++) {
Label* label_other = label_equiv_recursion(cuts_up->label);
if (label != label_other) [MERGE TWO LABELS]
if (end_pos <= cuts_up->end_pos) break;
}
cuts->label = label;
}
cuts_up = cuts_save;
}
return label_next;
}
After this, one can use these information for each scan-line to make the array of labels or any output he want directly.
I checked the execution time of this method and then I found that it's much faster the two-scan method I previously used. Surprisingly, it turned out to be much faster than the two-scan one even when the input data is random. Apparently the bit-scanning algorithm is best for data with relatively simple structures where each chunks in scan-lines are big. It wasn't designed to be used on random images.
What baffled me was that literally nobody says about this method. Frankly speaking, it doesn't seem to be an idea that hard to come up with. It's hard to believe that I'm the first one who tried it.
Perhaps while my method is better than the primitive two-scan method, it's worse than more developed ones based on the two-scan idea so that anyway doesn't worth to be mentioned.
However, if the two-scan method can be improved, the bit-scan method also can. I myself found a nice improvement for 8-connectivity. It analyses neighboring two scan-lines at ones my merging them using the bit OR instruction. You can find the full codes and detailed explanation on how they work here.
I got to know that there is a benchmark for CCL algorithms named YACCLAB. I'll test my algorithms in this with best CCL algorithms to see how really good they are. Before that, I want to ask several things here.
My question is,
Are these algorithms I found really new? It's still hard to believe that nobody has ever thought CCL algorithm using bit-scanning. If it's already a thing, why I can't found anyone says about it? Were bit-scan based algorithms proven to be bad and forgotten?
If I really found a new algorithm, what should I do next? Of course I'll test it in more reliable system like YACCLAB. I'm questioning about what I should do next. What I should to to make these algorithms mine and spread them?

So far, I'm a bit sceptical
My reasoning was getting too long for a comment, so here we are. There is a lot to unpack. I like the question quite a bit even though it might be better suited for a computer science site.
The thing is, there are two layers to this question:
Was a new algorithm discovered?
What about the bit scanning part?
You are combining these two so first I will explain why I would like to think about them separately:
The algorithm is a set of steps(more formal definition) that is language-agnostic. As such it should work even without the need for the bitscanning.
The bit scanning on the other hand I would consider to be an optimization technique - we are using a structure that the computer is comfortable with which can bring us performance gains.
Unless we separate these two, the question gets a bit fuzzy since there are several possible scenarios that can be happening:
The algorithm is new and improved and bit scanning makes it even faster. That would be awesome.
The algorithm is just a new way of saying "two pass" or something similar. That would still be good if it beats the benchmarks. In this case it might be worth adding to a library for the CCL.
The algorithm is a good fit for some cases but somehow fails in others(speed-wise, not correction wise). The bit scanning here makes the comparison difficult.
The algorithm is a good fit for some cases but completely fails in others(produces incorrect result). You just didn't find a counterexample yet.
Let us assume that 4 isn't the case and we want to decide between 1 to 3. In each case, the bit scanning is making things fuzzy since it most likely speeds up things even more - so in some cases even a slower algorithm could outperform a better one.
So first I would try and remove the bit scanning and re-evaluate the performance. After a quick look it seems that the algorithms for the CCL have a linear complexity, depending on the image size - you need to check every pixel at least once. Rest is the fight for lowering the constant as much as possible. (Number of passes, number of neighbors to check etc.) I think it is safe to assume, that you can't do better then linear - so the first question is: does your algorithm improve on the complexity by a multiplicative constant? Since the algorithm is linear, the factor directly translates to performance which is nice.
Second question would then be: Does bit scanning further improve the performance of the algorithm?
Also, since I already started thinking about it, what about a chess-board pattern and 4-connectivity? Or alternatively, a chessboard of 3x3 crosses for the 8-connectivity.

Related

Shouldn't this be using a backtracking algorithm?

I am solving some questions on LeetCode. One of the questions is:
Given a m x n grid filled with non-negative numbers, find a path from top left to bottom right which minimizes the sum of all numbers along its path.You can only move either down or right at any point in time.
The editorial as well as the solutions posted all use dynamic programming. One of the most upvoted solution is as follows:
class Solution {
public:
int minPathSum(vector<vector<int>>& grid) {
int m = grid.size();
int n = grid[0].size();
vector<vector<int> > sum(m, vector<int>(n, grid[0][0]));
for (int i = 1; i < m; i++)
sum[i][0] = sum[i - 1][0] + grid[i][0];
for (int j = 1; j < n; j++)
sum[0][j] = sum[0][j - 1] + grid[0][j];
for (int i = 1; i < m; i++)
for (int j = 1; j < n; j++)
sum[i][j] = min(sum[i - 1][j], sum[i][j - 1]) + grid[i][j];
return sum[m - 1][n - 1];
}
};
My question is simple: shouldn't this be solved using backtracking? Suppose the input matrix is something like:
[
[1,2,500]
[100,500,500]
[1,3,4]
]
My doubt is because in DP, the solutions to subproblems are a part of the global solution (optimal substructure). However, as can be seen above, when we make a local choice of choosing 2 out of (2,100), we might be wrong, since the future paths might be too expensive (all numbers surrounding 2 are 500s). So, how is using dynamic programming justified in this case?
To summarize:
Shouldn't we use backtracking since we might have to retract our path if we have made an incorrect choice previously (looking at local maxima)?
How is this a dynamic programming question?
P.S.: The above solution definitely runs.
The example you illustrated above shows that a greedy solution to the problem will not necessarily produce an optimal solution, and you're absolutely right about that.
However, the DP solution to this problem doesn't quite use this strategy. The idea behind the DP solution is to compute, for each location, the cost of the shortest path ending at that location. In the course of solving the overall problem, the DP algorithm will end up computing the length of some shortest paths that pass through the 2 in your grid, but it won't necessarily use those intermediate shortest paths when determining the overall shortest path to return. Try tracing through the above code on your example - do you see how it computes and then doesn't end up using those other path options?
Shouldn't we use backtracking since we might have to retract our path if we have made an incorrect choice previously (looking at local maxima)?
In a real-world scenario, there will be quite a few factors that will determine which algorithm will be better suited to solve this problem.
This DP solution is alright in the sense that it will give you the best performance/memory usage when handling worst-case scenarios.
Any backtracking/dijkstra/A* algorithm will need to maintain a full matrix as well as a list of open nodes. This DP solution just assumes every node will end up being visited, so it can ditch the open node list and just maintain the costs buffer.
By assuming every node will be visited, it also gets rid of the "which node do I open next" part of the algorithm.
So if optimal worst-case scenario performance is what we are looking for, then this algorithm is actually going to be very hard to beat. But wether that's what we want or not is a different matter.
How is this a dynamic programming question?
This is only a dynamic programming question in the sense that there exists a dynamic programming solution for it. But by no means is DP the only way to tackle it.
Edit: Before I get dunked on, yes there are more memory-efficient solutions, but at very high CPU costs in the worst-case scenarios.
For your input
[
[ 1, 2, 500]
[100, 500, 500]
[ 1, 3, 4]
]
sum array results to
[
[ 1, 3, 503]
[101, 503, 1003]
[102, 105, 109]
]
And we can even retrace shortest path:
109, 105, 102, 101, 1
Algorithm doesn't check each path, but use the property that it can take previous optimum path to compute current cost:
sum[i][j] = min(sum[i - 1][j], // take better path between previous horizontal
sum[i][j - 1]) // or previous vertical
+ grid[i][j]; // current cost
Backtracking, in itself, doesn't fit this problem particularly well.
Backtracking works well for problems like eight queens, where a proposed solution either works, or it doesn't. We try a possible route to a solution, and if it fails, we backtrack and try another possible route, until we find one that works.
In this case, however, every possible route gets us from the beginning to the end. We can't just try different possibilities until we find one that works. Instead, we have to basically try every route from beginning to end, until one find the one that works the best (the lowest weight, in this case).
Now, it's certainly true that with backtracking and pruning, we could (perhaps) improve our approach to this solution, to at least some degree. In particular, let's assume you did a search that started by looking downward (if possible) and then to the side. In this case, with the input you gave its first attempt would end up being the optimal route.
The question is whether it can recognize that, and prune some branches of the tree without traversing them entirely. The answer is that yes, it can. To do that, it keeps track of the best route it's found so far, and based upon that, it can reject entire sub-trees. In this case its first route gives a total weight of 109. Then it tries to the right of the first node, which is a 2, for a total weight of 3 so far. That's smaller than 109, so it proceeds. From there, it looks downward and gets to the 500. That gives a weight of 503, so without doing any further looking, it knows no route from there can be suitable, so it stops and prunes off all the branches that start from that 500. Then it tries rightward from the 2 and finds another 500. This lets it prune that entire branch as well. So, in these cases, it never looks at the third 500, or the 3 and 4 at all--just by looking at the 500 nodes, we can determine that those can't possibly yield an optimal solution.
Whether that's really an improvement on the DP strategy largely comes down to a question of what operations cost how much. For the task at hand, it probably doesn't make much difference either way. If, however, your input matrix was a lot larger, it might. For example, we might have a large input stored in tiles. With a DP solution, we evaluate all the possibilities, so we always load all the tiles. With a tree-trimming approach, we might be able to completely avoid loading some tiles at all, because the routes including those tiles have already been eliminated.

all solutions to change making with dynamic programming

I was reviewing my handouts for our algorithm class and I started to think about this question:
Given different types of coins with different values, find all coin configurations to add up to a certain sum without duplication.
During class, we solved the problem to find the number of all possible ways for a sum and the least number of coins for a sum. However, we never tried to actually find the solutions.
I was thinking about solving this problem with dynamic programming.
I came with the recursion version(for simplicity I only print the solutions):
void solve(vector<string>& result, string& currSoln, int index, int target, vector<int>& coins)
{
if(target < 0)
{
return;
}
if(target == 0)
{
result.push_back(currSoln);
}
for(int i = index; i < coins.size(); ++i)
{
stringstream ss;
ss << coins[i];
string newCurrSoln = currSoln + ss.str() + " ";
solve(result, newCurrSoln, i, target - coins[i], coins);
}
}
However, I got stuck when trying to use DP to solve the problem.
I have 2 major obstacles:
I don't know what data structure I should use to store previous answers
I don't know what my bottom-up procedure(using loops to replace recursions) should look like.
Any help is welcomed and some codes would be appreciated!
Thank you for your time.
In a dp solution you generate a set of intermediate states, and how many ways there are to get there. Then your answer is the number that wound up in a success state.
So, for change counting, the states are that you got to a specific amount of change. The counts are the number of ways of making change. And the success state is that you made the correct amount of change.
To go from counting solutions to enumerating them you need to keep those intermediate states, and also keep a record in each state of all of the states that transitioned to that one - and information about how. (In the case of change counting, the how would be which coin you added.)
Now with that information you can start from the success state and recursively go backwards through the dp data structures to actually find the solutions rather than the count. The good news is that all of your recursive work is efficient - you're always only looking at paths that succeed so waste no time on things that won't work. But if there are a billion solutions, then there is no royal shortcut that makes printing out a billion solutions fast.
If you wish to be a little clever, though, you can turn this into a usable enumeration. You can, for instance, say "I know there are 4323431 solutions, what is the 432134'th one?" And finding that solution will be quick.
It is immediately obvious that you can take a dynamic programming approach. What isn't obvious that in most cases (depending on the denominations of the coins) you can use the greedy algorithm, which is likely to be more efficient. See Cormen, Leiserson, Rivest, Stein: Introduction to Algorithms 2nd ed, problems 16.1.

efficient check for value change in array of floats in c++

i want to optimize an OpenGL application, and one hotspot is
doing expensive handling ( uploading to graphics card ) of
relatively small arrays ( 8-64 values ) where sometimes the
values change but most of the times stay constant. So most
efficient solution would be to upload the array only when
it has changed.
Of course the simplest way would be setting flags whenever
the data is changed, but this would need many code
changes, and for a quick test i would like to know the
possible performance gains, before too much work has to
be done.
So i thought of a quick check ( like a murmur hash etc )
in memory if the data has changed from frame to frame and
decide uploding after this check. so the question is, how
could i eg. XOR an array of values like
float vptr[] = { box.x1,box.y1, box.x1,box.y2, box.x2,box.y2, box.x2,box.y1 };
together to detect reliably value changes?
Best & thanks,
Heiner
If you're using intel, you could look into intel intrinsics.
http://software.intel.com/en-us/articles/intel-intrinsics-guide gives you an interactive reference where you can explore. There are a bunch of instructions for comparing multiple integers or doubles in one instruction, which is a nice speed-up.
#Ming, thank you for the intrinsic speedup, i will have a look into this.
float vptr[] = { box.x1,box.y1, box.x1,box.y2, box.x2,box.y2, box.x2,box.y1 };
unsigned hashval h = 0;
for(int i=...)
{
h ^= (unsigned&) vptr[i];
}
dead simple, worked for the really tiny arrays. compiler should be able to auto-vectorize, size of array is known. have to test for larger arrays.
origin: Hash function for floats

Fast code for searching bit-array for contiguous set/clear bits?

Is there some reasonably fast code out there which can help me quickly search a large bitmap (a few megabytes) for runs of contiguous zero or one bits?
By "reasonably fast" I mean something that can take advantage of the machine word size and compare entire words at once, instead of doing bit-by-bit analysis which is horrifically slow (such as one does with vector<bool>).
It's very useful for e.g. searching the bitmap of a volume for free space (for defragmentation, etc.).
Windows has an RTL_BITMAP data structure one can use along with its APIs.
But I needed the code for this sometime ago, and so I wrote it here (warning, it's a little ugly):
https://gist.github.com/3206128
I have only partially tested it, so it might still have bugs (especially on reverse). But a recent version (only slightly different from this one) seemed to be usable for me, so it's worth a try.
The fundamental operation for the entire thing is being able to -- quickly -- find the length of a run of bits:
long long GetRunLength(
const void *const pBitmap, unsigned long long nBitmapBits,
long long startInclusive, long long endExclusive,
const bool reverse, /*out*/ bool *pBit);
Everything else should be easy to build upon this, given its versatility.
I tried to include some SSE code, but it didn't noticeably improve the performance. However, in general, the code is many times faster than doing bit-by-bit analysis, so I think it might be useful.
It should be easy to test if you can get a hold of vector<bool>'s buffer somehow -- and if you're on Visual C++, then there's a function I included which does that for you. If you find bugs, feel free to let me know.
I can't figure how to do well directly on memory words, so I've made up a quick solution which is working on bytes; for convenience, let's sketch the algorithm for counting contiguous ones:
Construct two tables of size 256 where you will write for each number between 0 and 255, the number of trailing 1's at the beginning and at the end of the byte. For example, for the number 167 (10100111 in binary), put 1 in the first table and 3 in the second table. Let's call the first table BBeg and the second table BEnd. Then, for each byte b, two cases: if it is 255, add 8 to your current sum of your current contiguous set of ones, and you are in a region of ones. Else, you end a region with BBeg[b] bits and begin a new one with BEnd[b] bits.
Depending on what information you want, you can adapt this algorithm (this is a reason why I don't put here any code, I don't know what output you want).
A flaw is that it does not count (small) contiguous set of ones inside one byte ...
Beside this algorithm, a friend tells me that if it is for disk compression, just look for bytes different from 0 (empty disk area) and 255 (full disk area). It is a quick heuristic to build a map of what blocks you have to compress. Maybe it is beyond the scope of this topic ...
Sounds like this might be useful:
http://www.aggregate.org/MAGIC/#Population%20Count%20%28Ones%20Count%29
and
http://www.aggregate.org/MAGIC/#Leading%20Zero%20Count
You don't say if you wanted to do some sort of RLE or to simply count in-bytes zeros and one bits (like 0b1001 should return 1x1 2x0 1x1).
A look up table plus SWAR algorithm for fast check might gives you that information easily.
A bit like this:
byte lut[0x10000] = { /* see below */ };
for (uint * word = words; word < words + bitmapSize; word++) {
if (word == 0 || word == (uint)-1) // Fast bailout
{
// Do what you want if all 0 or all 1
}
byte hiVal = lut[*word >> 16], loVal = lut[*word & 0xFFFF];
// Do what you want with hiVal and loVal
The LUT will have to be constructed depending on your intended algorithm. If you want to count the number of contiguous 0 and 1 in the word, you'll built it like this:
for (int i = 0; i < sizeof(lut); i++)
lut[i] = countContiguousZero(i); // Or countContiguousOne(i)
// The implementation of countContiguousZero can be slow, you don't care
// The result of the function should return the largest number of contiguous zero (0 to 15, using the 4 low bits of the byte, and might return the position of the run in the 4 high bits of the byte
// Since you've already dismissed word = 0, you don't need the 16 contiguous zero case.

C++, using one byte to store two variables

I am working on representation of the chess board, and I am planning to store it in 32 bytes array, where each byte will be used to store two pieces. (That way only 4 bits are needed per piece)
Doing it in that way, results in a overhead for accessing particular index of the board.
Do you think that, this code can be optimised or completely different method of accessing indexes can be used?
c++
char getPosition(unsigned char* c, int index){
//moving pointer
c+=(index>>1);
//odd number
if (index & 1){
//taking right part
return *c & 0xF;
}else
{
//taking left part
return *c>>4;
}
}
void setValue(unsigned char* board, char value, int index){
//moving pointer
board+=(index>>1);
//odd number
if (index & 1){
//replace right part
//save left value only 4 bits
*board = (*board & 0xF0) + value;
}else
{
//replacing left part
*board = (*board & 0xF) + (value<<4);
}
}
int main() {
char* c = (char*)malloc(32);
for (int i = 0; i < 64 ; i++){
setValue((unsigned char*)c, i % 8,i);
}
for (int i = 0; i < 64 ; i++){
cout<<(int)getPosition((unsigned char*)c, i)<<" ";
if (((i+1) % 8 == 0) && (i > 0)){
cout<<endl;
}
}
return 0;
}
I am equally interested in your opinions regarding chess representations, and optimisation of the method above, as a stand alone problem.
Thanks a lot
EDIT
Thanks for your replies. A while ago I created checkers game, where I was using 64 bytes board representation. This time I am trying some different methods, just to see what I like. Memory is not such a big problem. Bit-boards is definitely on my list to try. Thanks
That's the problem with premature optimization. Where your chess board would have taken 64 bytes to store, now it takes 32. What has this really boughten you? Did you actually analyze the situation to see if you needed to save that memory?
Assuming that you used one of the least optimal search method, straight AB search to depth D with no heuristics, and you generate all possible moves in a position before searching, then absolute maximum memory required for your board is going to be sizeof(board) * W * D. If we assume a rather large W = 100 and large D = 30 then you're going to have 3000 boards in memory at depth D. 64k vs 32k...is it really worth it?
On the other hand, you've increased the amount of operations necessary to access board[location] and this will be called many millions of times per search.
When building chess AI's the main thing you'll end up looking for is cpu cycles, not memory. This may vary a little bit if you're targeting a cell phone or something, but even at that you're going to worry more about speed before you'll ever reach enough depth to cause any memory issues.
As to which representation I prefer...I like bitboards. Haven't done a lot of serious measurements but I did compare two engines I made, one bitboard and one array, and the bitboard one was faster and could reach much greater depths than the other.
Let me be the first to point out a potential bug (depending on compilers and compiler settings). And bugs being why premature optimization is evil:
//taking left part
return *c>>4;
if *c is negative, then >> may repeat the negative high bit. ie in binary:
0b10100000 >> 4 == 0b11111010
for some compilers (ie the C++ standard leaves it to the compiler to decide - both whether to carry the high bit, and whether a char is signed or unsigned).
If you do want to go forward with your packed bits (and let me say that you probably shouldn't bother, but it is up to you), I would suggest wrapping the packed bits into a class, and overriding [] such that
board[x][y]
gives you the unpacked bits. Then you can turn the packing on and off easily, and having the same syntax in either case. If you inline the operator overloads, it should be as efficient as the code you have now.
Well, 64 bytes is a very small amount of RAM. You're better off just using a char[8][8]. That is, unless you plan on storing a ton of chess boards. Doing char[8][8] makes it easier (and faster) to access the board and do more complex operations on it.
If you're still interested in storing the board in packed representation (either for practice or to store a lot of boards), I say you're "doing it right" regarding the bit operations. You may want to consider inlining your accessors if you're going for speed using the inline keyword.
Is space enough of a consideration where you can't just use a full byte to represent a square? That would make accesses easier to follow on the program and additionally most likely faster as the bit manipulations are not required.
Otherwise to make sure everything goes smoothly I would make sure all your types are unsigned: getPosition return unsigned char, and qualify all your numeric literals with "U" (0xF0U for example) to make sure they're always interpreted as unsigned. Most likely you won't have any problems with signed-ness but why take chances on some architecture that behaves unexpectedly?
Nice code, but if you are really that deep into performance optimization, you should probably learn more about your particular CPU architecture.
AFAIK, you may found that storing a chess piece in as much 8 bytes will be more efficient. Even if you recurse 15 moves deep, L2 cache size would hardly be a constraint, but RAM misalignment may be. I would guess that proper handling of a chess board would include Expand() and Reduce() functions to translate between board representations during different parts of the algorithm: some may be faster on compact representation, and some vice versa. For example, caching, and algorithms involving hashing by composition of two adjacent cells might be good for the compact structure, all else no.
I would also consider developing some helper hardware, like some FPGA board, or some GPU code, if performance is so important..
As a chess player, I can tell you: There's more to a position than the mere placement of each piece. You have to take in to consideration some other things:
Which side has to move next?
Can white and/or black castle king and/or queenside?
Can a pawn be taken en passant?
How many moves have passed since the last pawn move and/or capturing move?
If the data structure you use to represent a position doesn't reflect this information, then you're in big trouble.