Today I've run into this problem, but I couldn't solve it after a period of time. I need some help
I have number N. The problem is to find next higher number ( > N ) with only one zero bit in binary.
Example:
Number 1 can be represented in binary as 1.
Next higher number with only one zero bit is 2 - Binary 10
A few other examples:
N = 2 (10), next higher number with one zero bit is 5 (101)
N = 5 (101), next higher number is 6 (110)
N = 7 (111), next higher number is 11 (1011)
List of 200 number:
1 1
2 10 - 1
3 11
4 100
5 101 - 1
6 110 - 1
7 111
8 1000
9 1001
10 1010
11 1011 - 1
12 1100
13 1101 - 1
14 1110 - 1
15 1111
16 10000
17 10001
18 10010
19 10011
20 10100
21 10101
22 10110
23 10111 - 1
24 11000
25 11001
26 11010
27 11011 - 1
28 11100
29 11101 - 1
30 11110 - 1
31 11111
32 100000
33 100001
34 100010
35 100011
36 100100
37 100101
38 100110
39 100111
40 101000
41 101001
42 101010
43 101011
44 101100
45 101101
46 101110
47 101111 - 1
48 110000
49 110001
50 110010
51 110011
52 110100
53 110101
54 110110
55 110111 - 1
56 111000
57 111001
58 111010
59 111011 - 1
60 111100
61 111101 - 1
62 111110 - 1
63 111111
64 1000000
65 1000001
66 1000010
67 1000011
68 1000100
69 1000101
70 1000110
71 1000111
72 1001000
73 1001001
74 1001010
75 1001011
76 1001100
77 1001101
78 1001110
79 1001111
80 1010000
81 1010001
82 1010010
83 1010011
84 1010100
85 1010101
86 1010110
87 1010111
88 1011000
89 1011001
90 1011010
91 1011011
92 1011100
93 1011101
94 1011110
95 1011111 - 1
96 1100000
97 1100001
98 1100010
99 1100011
100 1100100
101 1100101
102 1100110
103 1100111
104 1101000
105 1101001
106 1101010
107 1101011
108 1101100
109 1101101
110 1101110
111 1101111 - 1
112 1110000
113 1110001
114 1110010
115 1110011
116 1110100
117 1110101
118 1110110
119 1110111 - 1
120 1111000
121 1111001
122 1111010
123 1111011 - 1
124 1111100
125 1111101 - 1
126 1111110 - 1
127 1111111
128 10000000
129 10000001
130 10000010
131 10000011
132 10000100
133 10000101
134 10000110
135 10000111
136 10001000
137 10001001
138 10001010
139 10001011
140 10001100
141 10001101
142 10001110
143 10001111
144 10010000
145 10010001
146 10010010
147 10010011
148 10010100
149 10010101
150 10010110
151 10010111
152 10011000
153 10011001
154 10011010
155 10011011
156 10011100
157 10011101
158 10011110
159 10011111
160 10100000
161 10100001
162 10100010
163 10100011
164 10100100
165 10100101
166 10100110
167 10100111
168 10101000
169 10101001
170 10101010
171 10101011
172 10101100
173 10101101
174 10101110
175 10101111
176 10110000
177 10110001
178 10110010
179 10110011
180 10110100
181 10110101
182 10110110
183 10110111
184 10111000
185 10111001
186 10111010
187 10111011
188 10111100
189 10111101
190 10111110
191 10111111 - 1
192 11000000
193 11000001
194 11000010
195 11000011
196 11000100
197 11000101
198 11000110
199 11000111
200 11001000
There are three cases.
The number x has more than one zero bit in its binary representation. All but one of these zero bits must be "filled in" with 1 to obtain the required result. Notice that all numbers obtained by taking x and filling in one or more of its low-order zero bits are numerically closer to x compared to the number obtained by filling just the top-most zero bit. Therefore the answer is the number x with all-but-one of its zero bits filled: only its topmost zero bit remains unfilled. For example if x=110101001 then the answer is 110111111. To get the answer, find the index i of the topmost zero bit of x, and then calculate the bitwise OR of x and 2^i - 1.
C code for this case:
// warning: this assumes x is known to have *some* (>1) zeros!
unsigned next(unsigned x)
{
unsigned topmostzero = 0;
unsigned bit = 1;
while (bit && bit <= x) {
if (!(x & bit)) topmostzero = bit;
bit <<= 1;
}
return x | (topmostzero - 1);
}
The number x has no zero bits in binary. It means that x=2^n - 1 for some number n. By the same reasoning as above, the answer is then 2^n + 2^(n-1) - 1. For example, if x=111, then the answer is 1011.
The number x has exactly one zero bit in its binary representation. We know that the result must be strictly larger than x, so x itself is not allowed to be the answer. If x has the only zero in its least-significant bit, then this case reduces to case #2. Otherwise, the zero should be moved one position to the right. Assuming x has zero in its i-th bit, the answer should have its zero in i-1-th bit. For example, if x=11011, then the result is 11101.
You could also use another approach:
Every number with exactly one zero bit can be represented as
2^n - 1 - 2^m
Now the task is easy:
1. Find an n, great enough for at least 2^n-1-2^0>x, that's equivalent to 2^n>x+2
2. Find the greatest m for which 2^n-1-2^m is still greater than x.
as Code:
#include <iostream>
#include <math.h>
using namespace std;
//binary representation
void bin(unsigned n)
{
for (int i = floor(log2(n));i >= 0;--i)
(n & (1<<i))? printf("1"): printf("0");
}
//outputs the next greater int to x with exactly one 0 in binary representation
int nextHigherOneZero(int x)
{
unsigned int n=0;
while((1<<n)<= x+2 ) ++n;
unsigned int m=0;
while((1<<n)-1-(1<<(m+1)) > x && m<n-2)
++m;
return (1<<n)-1-(1<<m);
}
int main()
{
int r=0;
for(int i = 1; i<100;++i){
r=nextHigherOneZero(i);
printf("\nX: %i=",i);
bin(i);
printf(";\tnextHigherOneZero(x):%i=",r);
bin(r);
printf("\n");
}
return 0;
}
You can try it here (with some additional Debug-Output):
http://ideone.com/6w3fAN
As a note: its probably possible to get m and n faster with some good binary logic, feel free to contribute...
Pro of this approach:
No assumptions needs to be made
Cons:
Ugly while loops
couldn't miss the opportunity to remember binary logic :), here's my solution:
here's main
main(int argc, char** argv)
{
int i = 139261;
i++;
while (!oneZero(i))
{
i++;
}
std::cout << i;
}
and here's all logic to find if number has 1 zero
bool oneZero(int i)
{
int count = 0;
while (i != 0)
{
// check last bit if it is zero
if ((1 & i) == 0) {
count++;
if (count > 1) return false;
}
// make the number shorter :)
i = i >> 1;
}
return (count == 1);
}
Related
I'm new to C++ and I wanted to compress a vector via ZSTD compression library. I used ZSTD simple API ZSTD_compress and ZSTD_decompress in the same way as the example. But I found a wired issue that when I compressed and decompressed a vector, the decompressed vector was not the same as the original vector. I'm not sure which part of my operation went wrong.
I looked at ZSTD's GitHub homepage and didn't find an answer. Please help or try to give some ideas how to solve it.
Example C code: https://github.com/facebook/zstd/blob/dev/examples/simple_compression.c
//Initialize a vector
vector<int> NumToCompress ;
NumToCompress.resize(10000);
for(int i = 0; i < 10000; i++)
{
NumToCompress[i] = rand()% 255;
}
//compress
int* com_ptr = NULL;
size_t NumSize = NumToCompress.size();
size_t Boundsize = ZSTD_compressBound(NumSize);
com_ptr =(int*) malloc(Boundsize);
size_t ComSize;
ComSize = ZSTD_compress(com_ptr,Boundsize,NumToCompress.data(),NumToCompress.size(),ZSTD_fast);
//decompress
int* decom_ptr = NULL;
unsigned long long decom_Boundsize;
decom_Boundsize = ZSTD_getFrameContentSize(com_ptr, ComSize);
decom_ptr = (int*)malloc(decom_Boundsize);
size_t DecomSize;
DecomSize = ZSTD_decompress(decom_ptr, decom_Boundsize, com_ptr, ComSize);
vector<int> NumAfterDecompress(decom_ptr,decom_ptr+DecomSize);
//check if two vectors are same
if(NumToCompress == NumAfterDecompress)
{
cout << "Two vectors are same" << endl;
}else
{cout << "Two vectors are insame" << endl;}
free(com_ptr);
free(decom_ptr);
case 1: If zstd can compress std::vector directly?
case 2: How to properly compress vectors with zstd if zstd can compress std::vector directly ?
Two vectors are insame
Original vector:
163 151 162 85 83 190 241 252 249 121 107 82 20 19 233 226 45 81 142 31 86 8 87 39 167 5 212 208 82 130 119 117 27 153 74 237 88 61 106 82 54 213 36 74 104 142 173 149 95 60 53 181 196 140 221 108 17 50 61 226 180 180 89 207 206 35 61 39 223 167 249 150 252 30 224 102 44 14 123 140 202 48 66 143 188 159 123 206 209 184 177 135 236 138 214 187 46 21 99 14
Decompressed vector:
163 151 162 85 83 190 241 252 249 121 107 82 20 19 233 226 45 81 142 31 86 8 87 39 167 0 417 0551929248 21916 551935408 21916 551933352 21916 551939512 21916 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Seems likely that your API expects the size to be in bytes but you give the size as the number of elements. So you need to multiply the number of elements by the size of each element. Like this
ComSize = ZSTD_compress(com_ptr, Boundsize, NumToCompress.data(),
NumToCompress.size()*sizeof(int), ZSTD_fast);
and similarly when you decompress you need to divide by the element size
DecomSize = ZSTD_decompress(decom_ptr, decom_Boundsize, com_ptr, ComSize);
vector<int> NumAfterDecompress(decom_ptr, decom_ptr+DecomSize/sizeof(int));
I've read with a logic analyzer a TX of a controller. I know that it works at 1200 Baud and I have identified the frames according to this photo:
Frames
I have identified in the frame:
1byte - Always 54
2byte - Sequential, ++ per frame
3byte - Always 0
4, 5 and 6byte - Data
7byte - Always 0
8, 9, 10 and 11- Data
12, 13, 14 and 15- They vary (I understand that 15 is a checksum)
I cannot identify the checksum (I suspect Checksum8 Xor due to its similarity to another controller).
I try to take each byte with Arduino to a position of an Array, knowing that the first byte is constant (54) and the frame is always the same length.
Could it be that as the Arduino loop being faster than Serial it duplicates the data in all positions of each Array?
Looking for information I have read that when I make a Serial.print it works correctly (each byte is not repeated, but when I write an array with while (Serial.available ()) it fails.
I leave some frames obtained through arduino with:
#include <Arduino.h>
void setup() {
Serial.begin(9600);
Serial1.begin(1200);
}
void loop() {
if (Serial1.available()) {
int test= Serial1.read();
Serial.println(test);
}
}
54 154 0 84 84 84 0 84 84 84 84 201 133 84 224
54 155 0 89 89 89 0 89 89 89 89 206 138 89 233
54 156 0 2 2 2 0 2 2 2 2 119 51 2 238
54 157 0 7 7 7 0 7 7 7 7 124 56 7 239
54 158 0 0 0 0 0 0 0 0 0 117 49 0 236
54 159 0 5 5 5 0 5 5 5 5 122 54 5 229
Any help is welcome
Thank you very much
Here is the struct:
struct RGB {
byte r;
byte g;
byte b;
byte v;
};
And here is the sorting function:
void insertion_sort(RGB arr[], size_t capacity) {
RGB temp;
size_t j;
for(size_t i = 1; i < capacity; i++) {
temp = arr[i];
j = i - 1;
while(j >= 0 && arr[j].v > temp.v) {
arr[j+1] = arr[j];
j--;
}
arr[j+1] = temp;
}
}
the v member of the RGB struct is the object's position in the array before scrambling (and hopefully where it ends up after sorting). Here is the output when printing the v member of every element of the array:
165 171 53 164 171 13 13 167 156 168 163 14 15 16 17 18 19 20 20 156 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 142 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149
I get the same output every time.
There are 150 elements in the array, and everything seems fine after 21. The issue is that the elements before that are wrong, and also made up of values higher than 149. Currently, the b and g members of RGB are all 0 throughout the array, and r is evenly-spaced values from 0-254.
It should be noted that the exact same sorting function works when I recreated everything in C++, and it even works to sort an array of ints in Arduino. The only issue is sorting an array of RGB structs, which seems to be causing memory issues.
Your problem is the definition of j.
Your options are:
Serial output in critical lines and analyze the serial monitor -> Arduinos way of debuging
using one of the hundred tested sorting functions (e.g. here on stack exchange) like quick sort etc.
using the std::sort
For the last option you have to:
The best solution is to use the C++ standard library function std::sort. This function is type-safe, and doesn't require you to cast anything to void * and back, and you don't have to manually enter the sizes of the array and its elements.
#include < Arduino_Helpers.h\>
#include < AH/STL/algorithm\>
#include < AH/STL/iterator\>
auto cmpfunc = [](RGB A, RGB B) { return A.value < B.value; };
// or if you do not like this notation use
bool cmpfunc(RGB A, RGB B) {
return A.value < B.value;
}
std::sort(std::begin(RGB arr[]), std::end(RGB arr[]), cmpfunc); // Dummy for a RGB arr[] containing all your values
-
The problem was because of the size_t datatype, which is unsigned and could not become a value < 0 when it was supposed to. Thank you #JohnFilleau
I'm trying to store a number in an array of 4 integers. The array is in the class Num. My problem is that when I call getValue, the function returns numbers that aren't correct. I tried go through the program on paper, doing all the calculations in Microsoft's calculator, and the program should give the correct output. I don't even know which function could be problematic since there aren't any errors or warnings, and both worked on paper.
21 in binary:10101
What I'm trying to do:
Input to setValue function: 21
setValue puts the first four bits of 21 (0101) into num[3]. So num[3] is now 0101 in binary. Then it should put the next four bits of 21 into num[2]. The next four bits are 0001 so 0001 goes into num[2] The rest of the bits are 0 so we ignore them. Now num is {0,0,1,5}. getValue first goes to num[3]. There is 5 which is 0101 in binary. So it puts that into the first four bits of return value. It then puts 0001 into the next four bits. The rest of the numbers are 0 so it is supposed to ignore them. Then the output of the function getValue is directly printed out. The actual output is at the bottom.
My code:
#include <iostream>
class Num {
char len = 4;
int num[4];
public:
void setValue(int);
int getValue();
};
void Num::setValue(int toSet)
{
char len1=len-1;
for (int counter = len1;counter>=0;counter--)
{
if(toSet&(0xF<<(len1-counter))!=0)
{
num[counter]=(toSet&(0xF<<(len1-counter)))>>len1-counter;
} else {
break;
}
}
}
int Num::getValue()
{
char len1 = len-1;
int returnValue = 0;
for(char counter = len1; counter>=0;counter--)
{
if (num[counter]!=0) {
returnValue+=(num[counter]<<(len1-counter));
} else {
break;
}
}
return returnValue;
}
int main()
{
int x=260;
Num number;
while (x>0)
{
number.setValue(x);
std::cout<<x<<"Test: "<<number.getValue()<<std::endl;
x--;
}
std::cin>>x;
return 0;
}
Output:
260Test: -1748023676
259Test: 5
258Test: 5
257Test: 1
256Test: 1
255Test: 225
254Test: 225
253Test: 221
252Test: 221
251Test: 213
250Test: 213
249Test: 209
248Test: 209
247Test: 193
246Test: 193
245Test: 189
244Test: 189
243Test: 181
242Test: 181
241Test: 177
240Test: 177
239Test: 177
238Test: 177
237Test: 173
236Test: 173
235Test: 165
234Test: 165
233Test: 161
232Test: 161
231Test: 145
230Test: 145
229Test: 141
228Test: 141
227Test: 133
226Test: 133
225Test: 1
224Test: 1
223Test: 161
222Test: 161
221Test: 157
220Test: 157
219Test: 149
218Test: 149
217Test: 145
216Test: 145
215Test: 129
214Test: 129
213Test: 125
212Test: 125
211Test: 117
210Test: 117
209Test: 113
208Test: 113
207Test: 113
206Test: 113
205Test: 109
204Test: 109
203Test: 101
202Test: 101
201Test: 97
200Test: 97
199Test: 81
198Test: 81
197Test: 77
196Test: 77
195Test: 5
194Test: 5
193Test: 1
192Test: 1
191Test: 161
190Test: 161
189Test: 157
188Test: 157
187Test: 149
186Test: 149
185Test: 145
184Test: 145
183Test: 129
182Test: 129
181Test: 125
180Test: 125
179Test: 117
178Test: 117
177Test: 113
176Test: 113
175Test: 113
174Test: 113
173Test: 109
172Test: 109
171Test: 101
170Test: 101
169Test: 97
168Test: 97
167Test: 81
166Test: 81
165Test: 77
164Test: 77
163Test: 69
162Test: 69
161Test: 1
160Test: 1
159Test: 97
158Test: 97
157Test: 93
156Test: 93
155Test: 85
154Test: 85
153Test: 81
152Test: 81
151Test: 65
150Test: 65
149Test: 61
148Test: 61
147Test: 53
146Test: 53
145Test: 49
144Test: 49
143Test: 49
142Test: 49
141Test: 45
140Test: 45
139Test: 37
138Test: 37
137Test: 33
136Test: 33
135Test: 17
134Test: 17
133Test: 13
132Test: 13
131Test: 5
130Test: 5
129Test: 1
128Test: 1
127Test: 225
126Test: 225
125Test: 221
124Test: 221
123Test: 213
122Test: 213
121Test: 209
120Test: 209
119Test: 193
118Test: 193
117Test: 189
116Test: 189
115Test: 181
114Test: 181
113Test: 177
112Test: 177
111Test: 177
110Test: 177
109Test: 173
108Test: 173
107Test: 165
106Test: 165
105Test: 161
104Test: 161
103Test: 145
102Test: 145
101Test: 141
100Test: 141
99Test: 133
98Test: 133
97Test: 1
96Test: 1
95Test: 161
94Test: 161
93Test: 157
92Test: 157
91Test: 149
90Test: 149
89Test: 145
88Test: 145
87Test: 129
86Test: 129
85Test: 125
84Test: 125
83Test: 117
82Test: 117
81Test: 113
80Test: 113
79Test: 113
78Test: 113
77Test: 109
76Test: 109
75Test: 101
74Test: 101
73Test: 97
72Test: 97
71Test: 81
70Test: 81
69Test: 77
68Test: 77
67Test: 5
66Test: 5
65Test: 1
64Test: 1
63Test: 161
62Test: 161
61Test: 157
60Test: 157
59Test: 149
58Test: 149
57Test: 145
56Test: 145
55Test: 129
54Test: 129
53Test: 125
52Test: 125
51Test: 117
50Test: 117
49Test: 113
48Test: 113
47Test: 113
46Test: 113
45Test: 109
44Test: 109
43Test: 101
42Test: 101
41Test: 97
40Test: 97
39Test: 81
38Test: 81
37Test: 77
36Test: 77
35Test: 69
34Test: 69
33Test: 1
32Test: 1
31Test: 97
30Test: 97
29Test: 93
28Test: 93
27Test: 85
26Test: 85
25Test: 81
24Test: 81
23Test: 65
22Test: 65
21Test: 61
20Test: 61
19Test: 53
18Test: 53
17Test: 49
16Test: 49
15Test: 49
14Test: 49
13Test: 45
12Test: 45
11Test: 37
10Test: 37
9Test: 33
8Test: 33
7Test: 17
6Test: 17
5Test: 13
4Test: 13
3Test: 5
2Test: 5
1Test: 1
I compiled this with g++ 6.3.0 with the command g++ a.cpp -o a.exe
When compiling with -Wall, there are a number of warnings:
orig.cpp: In member function ‘void Num::setValue(int)’:
orig.cpp:15:39: warning: suggest parentheses around comparison in operand of ‘&’ [-Wparentheses]
if(toSet&(0xF<<(len1-counter))!=0)
~~~~~~~~~~~~~~~~~~~~~^~~
orig.cpp:17:61: warning: suggest parentheses around ‘-’ inside ‘>>’ [-Wparentheses]
num[counter]=(toSet&(0xF<<(len1-counter)))>>len1-counter;
~~~~^~~~~~~~
orig.cpp: In member function ‘int Num::getValue()’:
orig.cpp:30:24: warning: array subscript has type ‘char’ [-Wchar-subscripts]
if (num[counter]!=0) {
^
orig.cpp:31:38: warning: array subscript has type ‘char’ [-Wchar-subscripts]
returnValue+=(num[counter]<<(len1-counter));
^
If you were to print the values of num before changing them, you'd see that some might be non-zero (i.e. they are uninitialized), which causes undefined behavior and probably breaks your for loops in getValue and setValue.
So change:
int num[4];
Into:
int num[4] = { 0 };
Here's a cleaned up version with the warnings fixed:
#include <iostream>
class Num {
int len = 4;
int num[4] = { 0 };
public:
void setValue(int);
int getValue();
void showval();
};
void Num::setValue(int toSet)
{
int len1=len-1;
for (int counter = len1;counter>=0;counter--)
{
if ((toSet & (0xF << (len1-counter))) != 0)
{
num[counter] = (toSet & (0xF << (len1-counter))) >> (len1-counter);
} else {
break;
}
}
}
int Num::getValue()
{
int len1 = len-1;
int returnValue = 0;
for(int counter = len1; counter>=0;counter--)
{
if (num[counter]!=0) {
returnValue+=(num[counter]<<(len1-counter));
} else {
break;
}
}
return returnValue;
}
void Num::showval()
{
for (int i = 0; i < len; ++i)
std::cout << i << ": show: " << num[i] << "\n";
#if 0
for (int i = 0; i < len; ++i)
num[i] = 0;
#endif
}
int main()
{
int x=260;
Num number;
number.showval();
while (x>0)
{
number.setValue(x);
std::cout << x << " Test: " << number.getValue() << std::endl;
x--;
}
std::cin>>x;
return 0;
}
To break a number into nibbles, the shift counts should be multiples of 4. Otherwise slices of 4 bits are extracted that don't line up.
00010101 (21)
^^^^ first nibble
^^^^ second nibble
The second nibble is displaced by 4 bits so it needs to be shifted right by 4, not by 1.
You could multiply your shift counts by 4, but there is an easier way: only ever shift by 4. For example:
for (int i = len - 1; i >= 0; i--) {
num[i] = toSet & 0xF;
toSet >>= 4;
}
Then every iteration extracts the lowest nibble in toSet, and shifts toSet over so that the next nibble becomes the lowest nibble.
I didn't put in a break and there should not be one. It definitely shouldn't be the kind of break that you had, which stops the loop also whenever a number has a zero in the middle of it (for example in 0x101 the middle 0 causes the loop to stop). The loop also should not stop when the entire rest of the number is zero, since that leaves junk in the other entries of num.
It's more common to store the lowest nibble in the 0th element and so on (then you don't have to deal with all the "reverse logic" with down-counting loops and subtracting things from the length) but that's up to you.
Extracting the value can be done symmetrically, building up the result while shifting it, instead of shifting every piece into its final place immediately. Or just multiply (len1-counter) by 4. While extracting the value, you also cannot stop when num[i] is zero, since that does not prove that the rest of the number is zero too.
I have implemented Summed Area Table (or Integral image) in C++ using OpenMP.
The problem is that the Sequential code is always faster then the Parallel code even changing the number of threads and image sizes.
For example I tried images from (100x100) to (10000x10000) and threads from 1 to 64, but none of the combination is ever faster.
I also tried this code in different machines like:
Mac OSX 1,4 GHz Intel Core i5 dual core
Mac OSX 2,3 GHz Intel Core i7 quad core
Ubuntu 16.04 Intel Xeon E5-2620 2,4 GHz 12 cores
The time has been measured with OpenMP function: omp_get_wtime().
For compiling I use: g++ -fopenmp -Wall main.cpp.
Here is the parallel code:
void transpose(unsigned long *src, unsigned long *dst, const int N, const int M) {
#pragma omp parallel for
for(int n = 0; n<N*M; n++) {
int i = n/N;
int j = n%N;
dst[n] = src[M*j + i];
}
}
unsigned long * integralImageMP(uint8_t*x, int n, int m){
unsigned long * out = new unsigned long[n*m];
unsigned long * rows = new unsigned long[n*m];
#pragma omp parallel for
for (int i = 0; i < n; ++i)
{
rows[i*m] = x[i*m];
for (int j = 1; j < m; ++j)
{
rows[i*m + j] = x[i*m + j] + rows[i*m + j - 1];
}
}
transpose(rows, out, n, m);
#pragma omp parallel for
for (int i = 0; i < n; ++i)
{
rows[i*m] = out[i*m];
for (int j = 1; j < m; ++j)
{
rows[i*m + j] = out[i*m + j] + rows[i*m + j - 1];
}
}
transpose(rows, out, m, n);
delete [] rows;
return out;
}
Here is the sequential code:
unsigned long * integralImage(uint8_t*x, int n, int m){
unsigned long * out = new unsigned long[n*m];
for (int i = 0; i < n; ++i)
{
for (int j = 0; j < m; ++j)
{
unsigned long val = x[i*m + j];
if (i>=1)
{
val += out[(i-1)*m + j];
if (j>=1)
{
val += out[i*m + j - 1] - out[(i-1)*m + j - 1];
}
} else {
if (j>=1)
{
val += out[i*m + j -1];
}
}
out[i*m + j] = val;
}
}
return out;
}
I also tried without the transpose but it was even slower probably because the cache accesses.
An example of calling code:
int main(int argc, char **argv){
uint8_t* image = //read image from file (gray scale)
int height = //height of the image
int width = //width of the image
double start_omp = omp_get_wtime();
unsigned long* integral_image_parallel = integralImageMP(image, height, width); //parallel
double end_omp = omp_get_wtime();
double time_tot = end_omp - start_omp;
std::cout << time_tot << std::endl;
start_omp = omp_get_wtime();
unsigned long* integral_image_serial = integralImage(image, height, width); //sequential
end_omp = omp_get_wtime();
time_tot = end_omp - start_omp;
std::cout << time_tot << std::endl;
return 0;
}
Each thread is working on a block of rows (maybe an illustration of what each thread is doing can be useful):
Where ColumnSum is done transposing the matrix and repeating RowSum.
Let me first say, that the results are a bit surprising to me and I would guesstimate the problem being in the non local memory access required by the transpose algorithm.
You can anyway mitigate it by turning your sequential algorithm into parallel by a two pass approach. The first pass has to calculate the 2D integral in T threads N rows apart and the second pass must compensate the fact that each block didn't start from the accumulated result of the previous row but from zero.
An example with Matlab shows the principle in 2D.
f=fix(rand(12,8)*8) % A random matrix with 12 rows, 8 columns
5 6 1 4 7 5 4 4
4 6 0 7 1 3 2 0
7 0 2 3 0 1 6 3
5 3 1 7 4 3 7 2
6 4 3 2 7 3 5 1
3 3 2 5 5 0 2 1
3 5 7 5 1 4 4 3
6 5 7 4 2 1 0 0
0 2 0 5 3 3 7 4
1 3 5 5 7 4 7 3
1 0 2 1 1 2 6 5
3 7 3 1 6 2 2 5
ff=cumsum(cumsum(f')') % The Summed Area Table
5 11 12 16 23 28 32 36
9 21 22 33 41 49 55 59
16 28 31 45 53 62 74 81
21 36 40 61 73 85 104 113
27 46 53 76 95 110 134 144
30 52 61 89 113 128 154 165
33 60 76 109 134 153 183 197
39 71 94 131 158 178 208 222
39 73 96 138 168 191 228 246
40 77 105 152 189 216 260 281
41 78 108 156 194 223 273 299
44 88 121 170 214 245 297 328
fx=[cumsum(cumsum(f(1:4,:)')'); % The original table summed in
cumsum(cumsum(f(5:8,:)')'); % three parts -- 4 rows per each
cumsum(cumsum(f(9:12,:)')')] % "thread"
5 11 12 16 23 28 32 36
9 21 22 33 41 49 55 59
16 28 31 45 53 62 74 81
21 36 40 61 73 85 104 113 %% Notice this row #4
6 10 13 15 22 25 30 31
9 16 21 28 40 43 50 52
12 24 36 48 61 68 79 84
18 35 54 70 85 93 104 109 %% Notice this row #8
0 2 2 7 10 13 20 24
1 6 11 21 31 38 52 59
2 7 14 25 36 45 65 77
5 17 27 39 56 67 89 106
fx(4,:) + fx(8,:) %% this is the SUM of row #4 and row #8
39 71 94 131 158 178 208 222
%% and finally -- what is the difference of the piecewise
%% calculated result and the real result?
ff-fx
0 0 0 0 0 0 0 0 %% look !! the first block
0 0 0 0 0 0 0 0 %% is already correct
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
21 36 40 61 73 85 104 113 %% All these rows in this
21 36 40 61 73 85 104 113 %% block are short by
21 36 40 61 73 85 104 113 %% the row #4 above
21 36 40 61 73 85 104 113 %%
39 71 94 131 158 178 208 222 %% and all these rows
39 71 94 131 158 178 208 222 %% in this block are short
39 71 94 131 158 178 208 222 %% by the SUM of the rows
39 71 94 131 158 178 208 222 %% #4 and #8 above
Fortunately one can start integrating the block 2, i.e. rows 2N..3N-1 before the block #1 has been compensated -- one just has to calculate the offset, which is a relatively small sequential task.
acc_for_block_2 = row[2*N-1] + row[N-1];
acc_for_block_3 = acc_for_block_2 + row[3*N-1];
..
acc_for_block_T-1 = acc_for_block_(T-2) + row[N*(T-1)-1];