Removing Data-Dependence from Loop

Removing Data-Dependence from Loop - c++

I have the following C++ loop:
for (i = LEN_MAX - 1; i >= 0; i--) {
int j = i - LEN_MAX + len;
if (j < 0)
break;
int ind = a.getElem(j);
short t = ind;
ind = --c[ind];
b.setElem(ind, t);
}
What I would like to do is remove all dependency between iterations from it. In the above loop, for example, the line ind = --c[ind] has an inter-iteration dependency, because to decrease, I need to have the value from the previous iteration. Here is an example of the transformation I'm looking for:
From:
for (i = 1; i < RADIX_MAX; i++) {
if (i == radix)
break;
c[i] += c[i - 1];
c[i] += temp;
}
To:
short temp = c[0];
for (i = 1; i < RADIX_MAX; i++) {
if (i == radix)
break;
c[i] += temp; //this loop no longer depends on last iteration
temp = c[i];
}
I want to apply this same technique to the first loop I posted, but I'm not sure how. The reason I want to do this is because it is required in order to optimize the performance of a tool I am using. Anyone have any ideas?

There is no simple transformation of the loop you provided to remove the inter-iteration dependency (And the second example you gave doesn't actually remove the inter-iteration dependency). Each iteration depends on what happened to c the previous iteration and there's just no way around that with the current implementation. If you know something about the algorithm and/or the values stored within c and/or the values within a, you may be able to rework the code to remove the order dependency, we just can't do it given the short segment of code you provided.

Related

insertion sort : why assigning key to j+1 index?

void insertionSort(vector<int>& v){
for(int i =1; i<v.size(); ++i){
int key = v[i];
int j = i - 1;
while(j>=0 && key < v[j]){
swap(v[j], v[j+1]);
j--;
}
v[j+1] = key; // works fine without this.
}
}
In an insertion sort algorithm, I just wonder why the commented part was inserted.
I did several experiments removing that part, and actually thought it is okay to get rid of that.
Could anyone explain of the purpose of the line? Any help would be well appreciated!

Since after each swap it does j--, after the final swap (which frees up v[j]), it decreases j once more. Hence you need to put the new element at v[j + 1].
By the way, swap is not necessary for this code, you might as well do v[j + 1] = v[j] instead of swap.
Edit
Regarding the question on implementation, perhaps the author was making a some point which needed the swap - without knowing the context, we can't say for sure.
Since no one really uses insertion sort, I reckon the purpose of this was only theoretical, and likely to compute complexity by counting number of swaps. Hence the author may have been demonstrating the sort with swap as a building block.
Back to the question,
the implementation is correct, if you are okay with the extra writes that swap does.
(Essentially swap(a, b) is t = a; a = b; b = t;, so two additional writes.)
If you do have the swap, then the commented out line is indeed not necessary.
Without the swap you may rewrite it as -
void insertionSort(vector<int>& v){
for(int i = 1; i < v.size(); ++i){
int key = v[i];
int j = i - 1;
while(j >= 0 && key < v[j]){
v[j + 1] = v[j];
j--;
}
v[j + 1] = key; // this is now necessary.
}
}
Note that since this reduces asymptotic time taken by only a constant, complexity still remain the same as the one with swap, i.e. $O(n^2)$.

Almost same code running much slower

I am trying to solve this problem:
Given a string array words, find the maximum value of length(word[i]) * length(word[j]) where the two words do not share common letters. You may assume that each word will contain only lower case letters. If no such two words exist, return 0.
https://leetcode.com/problems/maximum-product-of-word-lengths/
You can create a bitmap of char for each word to check if they share chars in common and then calc the max product.
I have two method almost equal but the first pass checks, while the second is too slow, can you understand why?
class Solution {
public:
int maxProduct2(vector<string>& words) {
int len = words.size();
int *num = new int[len];
// compute the bit O(n)
for (int i = 0; i < len; i ++) {
int k = 0;
for (int j = 0; j < words[i].length(); j ++) {
k = k | (1 <<(char)(words[i].at(j)));
}
num[i] = k;
}
int c = 0;
// O(n^2)
for (int i = 0; i < len - 1; i ++) {
for (int j = i + 1; j < len; j ++) {
if ((num[i] & num[j]) == 0) { // if no common letters
int x = words[i].length() * words[j].length();
if (x > c) {
c = x;
}
}
}
}
delete []num;
return c;
}
int maxProduct(vector<string>& words) {
vector<int> bitmap(words.size());
for(int i=0;i<words.size();++i) {
int k = 0;
for(int j=0;j<words[i].length();++j) {
k |= 1 << (char)(words[i][j]);
}
bitmap[i] = k;
}
int maxProd = 0;
for(int i=0;i<words.size()-1;++i) {
for(int j=i+1;j<words.size();++j) {
if ( !(bitmap[i] & bitmap[j])) {
int x = words[i].length() * words[j].length();
if ( x > maxProd )
maxProd = x;
}
}
}
return maxProd;
}
};
Why the second function (maxProduct) is too slow for leetcode?
Solution
The second method does repetitive call to words.size(). If you save that in a var than it working fine

Since my comment turned out to be correct I'll turn my comment into an answer and try to explain what I think is happening.
I wrote some simple code to benchmark on my own machine with two solutions of two loops each. The only difference is the call to words.size() is inside the loop versus outside the loop. The first solution is approximately 13.87 seconds versus 16.65 seconds for the second solution. This isn't huge, but it's about 20% slower.
Even though vector.size() is a constant time operation that doesn't mean it's as fast as just checking against a variable that's already in a register. Constant time can still have large variances. When inside nested loops that adds up.
The other thing that could be happening (someone much smarter than me will probably chime in and let us know) is that you're hurting your CPU optimizations like branching and pipelining. Every time it gets to the end of the the loop it has to stop, wait for the call to size() to return, and then check the loop variable against that return value. If the cpu can look ahead and guess that j is still going to be less than len because it hasn't seen len change (len isn't even inside the loop!) it can make a good branch prediction each time and not have to wait.

Why cant you have one loop for bubble sort?

I was just in an argument with my instructor about bubble sort, he told me that bubble sort is known as two for loops, one nested in the other. Which was not given before I started the assignment, so okay that is fine, but what is wrong with this code for a bubble sort:
int num = 0, i = 0;
bool go = true;
while (i < size - 1){
if (array[i] > array[i + 1]){
num = array[i];
array[i] = array[i + 1];
array[i + 1] = num;
go = false;
}
i++;
if (i >= size - 1 && go == false){
i = 0;
go = true;
}
}
for (int i = 0; i < size; i++){
cout << array[i];
}
does it not do the same thing as a bubble sort?
int i, j;
bool flag = true;
int temp;
int numLength = size;
for (i = 1; (i <= numLength) && flag; i++)
{
flag = false;
for (j = 0; j < (numLength - 1); j++)
{
if (array[j + 1] < array[j])
{
temp = array[j];
array[j] = array[j + 1];
array[j + 1] = temp;
flag = true;
}
}
}
for (int i = 0; i < size; i++){
cout << array[i];
}
return;
Thanks!

The bubble sort algorithm needs two loops: the inner one iterating through the items and swapping them if adjacent ones are out of order, the outer one repeating until no more changes are made.
Your implementation does effectively have two loops. It's just that one of them is implemented using a flag and if condition, which resets the outer loop variable. It will do the same thing - loop through the items until no more need swapping.
Note, however, that constructing the algorithm in this way does not make it more efficient, or faster, or anything like that. It just makes it harder to figure out what is going on.

You need nested loops because one pass through the array will not always sort all the elements.
Your code only simulates nested loops by resetting i when we've reached the end and still have things left to sort. Theoretically speaking, your code will have the same runtime as a nested bubble sort if given the same input array.
As to the question of whether you can; of course you can. But it's important to realize there is no benefit, in practice or in theory, to choosing one form over the other, at least as far as I can tell.
Also, when computing the time complexity of both algorithms, you will come to the conclusion that your algorithm, just like the form with nested loops, will need to perform the operation at most n times; the operation being a pass through the array, which is on the order of n. You will have to convince yourself that this is the case with your algorithm.
So no matter how you slice it (array pun intended, I guess?), bubble sort will have complexity O(n^2).

Syntactically, you can have a one-loop bubble sort. Conceptually, you'll still have two loops.
does it not do the same thing as a bubble sort?
Yes, it does the same thing. Up to and including having a nested loop.
Your argument seems to be that if you only have one for loop in your code, you've created a one-loop bubble sort. It's a ruse though; there are still two loops happening algorithmically. And that's really the only thing that matters.

Your code implements bubble sort. I can't see any advantages in your instructors code. Only arguments for two arguments is that it is bad style to change loop variable inside loop and we should avoid it.
And i would change
if (i >= size - 1 && go == false)
to
if (i == size - 1 && go == false)
since the first version introduces the misconception that I can be greater than size (but it can't)

I want to optimize this short loop

I would like to optimize this simple loop:
unsigned int i;
while(j-- != 0){ //j is an unsigned int with a start value of about N = 36.000.000
float sub = 0;
i=1;
unsigned int c = j+s[1];
while(c < N) {
sub += d[i][j]*x[c];//d[][] and x[] are arrays of float
i++;
c = j+s[i];// s[] is an array of unsigned int with 6 entries.
}
x[j] -= sub; // only one memory-write per j
}
The loop has an execution time of about one second with a 4000 MHz AMD Bulldozer. I thought about SIMD and OpenMP (which I normally use to get more speed), but this loop is recursive.
Any suggestions?

think you may want to transpose the matrix d -- means store it in such a way that you can exchange the indices -- make i the outer index:
sub += d[j][i]*x[c];
instead of
sub += d[i][j]*x[c];
This should result in better cache performance.

I agree with transposing for better caching (but see my comments on that at the end), and there's more to do, so let's see what we can do with the full function...
Original function, for reference (with some tidying for my sanity):
void MultiDiagonalSymmetricMatrix::CholeskyBackSolve(float *x, float *b){
//We want to solve L D Lt x = b where D is a diagonal matrix described by Diagonals[0] and L is a unit lower triagular matrix described by the rest of the diagonals.
//Let D Lt x = y. Then, first solve L y = b.
float *y = new float[n];
float **d = IncompleteCholeskyFactorization->Diagonals;
unsigned int *s = IncompleteCholeskyFactorization->StartRows;
unsigned int M = IncompleteCholeskyFactorization->m;
unsigned int N = IncompleteCholeskyFactorization->n;
unsigned int i, j;
for(j = 0; j != N; j++){
float sub = 0;
for(i = 1; i != M; i++){
int c = (int)j - (int)s[i];
if(c < 0) break;
if(c==j) {
sub += d[i][c]*b[c];
} else {
sub += d[i][c]*y[c];
}
}
y[j] = b[j] - sub;
}
//Now, solve x from D Lt x = y -> Lt x = D^-1 y
// Took this one out of the while, so it can be parallelized now, which speeds up, because division is expensive
#pragma omp parallel for
for(j = 0; j < N; j++){
x[j] = y[j]/d[0][j];
}
while(j-- != 0){
float sub = 0;
for(i = 1; i != M; i++){
if(j + s[i] >= N) break;
sub += d[i][j]*x[j + s[i]];
}
x[j] -= sub;
}
delete[] y;
}
Because of the comment about parallel divide giving a speed boost (despite being only O(N)), I'm assuming the function itself gets called a lot. So why allocate memory? Just mark x as __restrict__ and change y to x everywhere (__restrict__ is a GCC extension, taken from C99. You might want to use a define for it. Maybe the library already has one).
Similarly, though I guess you can't change the signature, you can make the function take only a single parameter and modify it. b is never used when x or y have been set. That would also mean you can get rid of the branch in the first loop which runs ~N*M times. Use memcpy at the start if you must have 2 parameters.
And why is d an array of pointers? Must it be? This seems too deep in the original code, so I won't touch it, but if there's any possibility of flattening the stored array, it will be a speed boost even if you can't transpose it (multiply, add, dereference is faster than dereference, add, dereference).
So, new code:
void MultiDiagonalSymmetricMatrix::CholeskyBackSolve(float *__restrict__ x){
// comments removed so that suggestions are more visible. Don't remove them in the real code!
// these definitions got long. Feel free to remove const; it does nothing for the optimiser
const float *const __restrict__ *const __restrict__ d = IncompleteCholeskyFactorization->Diagonals;
const unsigned int *const __restrict__ s = IncompleteCholeskyFactorization->StartRows;
const unsigned int M = IncompleteCholeskyFactorization->m;
const unsigned int N = IncompleteCholeskyFactorization->n;
unsigned int i;
unsigned int j;
for(j = 0; j < N; j++){ // don't use != as an optimisation; compilers can do more with <
float sub = 0;
for(i = 1; i < M && j >= s[i]; i++){
const unsigned int c = j - s[i];
sub += d[i][c]*x[c];
}
x[j] -= sub;
}
// Consider using processor-specific optimisations for this
#pragma omp parallel for
for(j = 0; j < N; j++){
x[j] /= d[0][j];
}
for( j = N; (j --) > 0; ){ // changed for clarity
float sub = 0;
for(i = 1; i < M && j + s[i] < N; i++){
sub += d[i][j]*x[j + s[i]];
}
x[j] -= sub;
}
}
Well it's looking tidier, and the lack of memory allocation and reduced branching, if nothing else, is a boost. If you can change s to include an extra UINT_MAX value at the end, you can remove more branches (both the i<M checks, which again run ~N*M times).
Now we can't make any more loops parallel, and we can't combine loops. The boost now will be, as suggested in the other answer, to rearrange d. Except… the work required to rearrange d has exactly the same cache issues as the work to do the loop. And it would need memory allocated. Not good. The only options to optimise further are: change the structure of IncompleteCholeskyFactorization->Diagonals itself, which will probably mean a lot of changes, or find a different algorithm which works better with data in this order.
If you want to go further, your optimisations will need to impact quite a lot of the code (not a bad thing; unless there's a good reason for Diagonals being an array of pointers, it seems like it could do with a refactor).

I want to give an answer to my own question: The bad performance was caused by cache conflict misses due to the fact that (at least) Win7 aligns big memory blocks to the same boundary. In my case, for all buffers, the adresses had the same alignment (bufferadress % 4096 was same for all buffers), so they fall into the same cacheset of L1 cache. I changed memory allocation to align the buffers to different boundaries to avoid cache conflict misses and got a speedup of factor 2. Thanks for all the answers, especially the answers from Dave!

c++ counting sort

I tried to write a countingsort, but there's some problem with it.
here's the code:
int *countSort(int* start, int* end, int maxvalue)
{
int *B = new int[(int)(end-start)];
int *C = new int[maxvalue];
for (int i = 0; i < maxvalue; i++)
{
*(C+i) = 0;
}
for (int *i = start; i < end; i++)
{
*(C+*i) += 1;
}
for (int i = 1; i < maxvalue-1 ; i++)
{
*(C+i) += *(C+i-1);
}
for (int *i = end-1; i > start-1; i--)
{
*(B+*(C+(*i))) = *i;
*(C+(*i)) -= 1;
}
return B;
}
In the last loop it throws an exception "Acces violation writing at location: -some ram address-"
Where did I go wrong?

for (int i = 1; i < maxvalue-1 ; i++)
That's the incorrect upper bound. You want to go from 1 to maxvalue.
for (int *i = end-1; i > start-1; i--)
{
*(B+*(C+(*i))) = *i;
*(C+(*i)) -= 1;
}
This loop is also completely incorrect. I don't know what it does, but a brief mental test shows that the first iteration sets the element of B at the index of the value of the last element in the array to the number of times it shows. I guarantee that that is not correct. The last loop should be something like:
int* out = B;
int j=0;
for (int i = 0; i < maxvalue; i++) { //for each value
for(j<C[i]; j++) { //for the number of times its in the source
*out = i; //add it to the output
++out; //in the next open slot
}
}
As a final note, why are you playing with pointers like that?
*(B + i) //is the same as
B[i] //and people will hate you less
*(B+*(C+(*i))) //is the same as
B[C[*i]]

Since you're using C++ anyway, why not simplify the code (dramatically) by using std::vector instead of dynamically allocated arrays (and leaking one in the process)?
std::vector<int>countSort(int* start, int* end, int maxvalue)
{
std::vector<int> B(end-start);
std::vector<int> C(maxvalue);
for (int *i = start; i < end; i++)
++C[*i];
// etc.
Other than that, the logic you're using doesn't make sense to me. I think to get a working result, you're probably best off sitting down with a sheet of paper and working out the steps you need to use. I've left the counting part in place above, because I believe that much is correct. I don't think the rest really is. I'll even give a rather simple hint: once you've done the counting, you can generate B (your result) based only on what you have in C -- you do not need to refer back to the original array at all. The easiest way to do it will normally use a nested loop. Also note that it's probably easier to reserve the space in B and use push_back to put the data in it, rather than setting its initial size.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Removing Data-Dependence from Loop - c++

Related

insertion sort : why assigning key to j+1 index?

Almost same code running much slower

Why cant you have one loop for bubble sort?

I want to optimize this short loop

c++ counting sort

Categories

Resources