C++ loop unfolding, bounds - c++

I have a loop that I want to unfold:
for(int i = 0; i < N; i++)
do_stuff_for(i);
Unfolded:
for(int i = 0; i < N; i += CHUNK) {
do_stuff_for(i + 0);
do_stuff_for(i + 1);
...
do_stuff_for(i + CHUNK-1);
}
But, I should make sure that I do not run out of the original N, like when N == 14 and CHUNK == 10. My question is: what is the best/fasters/standard/most elegant (you name it) way to do it?
One solution that comes is:
int i;
for(i = 0; i < (N % CHUNK); i++)
do_stuff_for(i);
for(i; i < N; i += CHUNK) {
// unfolded, for the rest
}
But maybe there is a better practice

You could use a switch-case.
It's called Duff's Device

Related

How can I circle back to the a starting point using a for loop?

I have a std::vector<std::unique_ptr<object>> myObjects_ptrs. I need to, starting in one of my objects, circle back again to where I started.
I am doing this as follows:
while(true)
{
for(int i = 0; i < myObjects_ptrs.size(); ++i)
{
myObjects_ptr[i]->doSomething();
//and here I need to circle back
for(int j = i + 1; j < myObjects_ptr.size(); ++j)
{
//do some things with each other object
}
for(int j = 0; j < i; ++j)
{
//do the same things with the rest of the objects
}
}
}
Is this the standard way of doing that? My problem is that once I detect something, then I dont need to keep going around. For example if I find something during the first loop then there is no need to go through the second loop. I con solve this by adding an extra if before the second loop; but is there a better way?
You could use a modulus, i.e. the two inner loops would become:
int numObjects = myObjects_ptr.size();
for (int j = i + 1; j < numObjects + i + 1; ++j)
{
// Get object
auto& obj = myObjects_ptr[j % numObjects];
}
You could replace the two inner loops with something like this:
for(int j = i + 1;; j++)
{
j %= myObjects_ptr.size();
if (j == i)
{
break;
}
// Do stuff
}

Counting Sort in C++

I am trying to implement the Counting Sort in C++ without creating a function. This is the code that I've written so far, but the program doesn't return me any values. It doesn't give me any errors either. Therefore, what is wrong?
#include <iostream>
using namespace std;
int main()
{
int A[100], B[100], C[100], i, j, k = 0, n;
cin >> n;
for (i = 0; i < n; ++i)
{
cin >> A[i];
}
for (i = 0; i < n; ++i)
{
if (A[i] > k)
{
k = A[i];
}
}
for (i = 0; i < k + 1; ++i)
{
C[i] = 0;
}
for (j = 0; j < n; ++j)
{
C[A[j]]++;
}
for (i = 0; i < k; ++i)
{
C[i] += C[i - 1];
}
for (j = n; j > 0; --j)
{
B[C[A[j]]] = A[j];
C[A[j]] -= 1;
}
for (i = 0; i < n; ++i)
{
cout << B[i] << " ";
}
return 0;
}
It looks like you're on the right track. You take input into A, find the largest value you'll be dealing with and then make sure you zero out that many values in your C array. But that's when things start to go wrong. You then do:
for (i = 0; i < k; ++i)
{
C[i] += C[i - 1];
}
for (j = n; j > 0; --j)
{
B[C[A[j]]] = A[j];
C[A[j]] -= 1;
}
That first loop will always go out of bounds on the first iteration (C[i-1] when i=0 will be undefined behavior), but even if it didn't I'm not sure what you have in mind here. Or in the loop after that for that matter.
Instead, if I were you, I'd create an indx variable to keep track of which index I'm next going to insert a number to (how many numbers I've inserted so far), and then I'd loop over C and for each value in C, I'd loop that many times and insert that many values of that index. My explanation may sound a little wordy, but that'd look like:
int indx = 0;
for(int x = 0; x <= k; x++) {
for(int y = 0; y < C[x]; y++) {
B[indx++] = x;
}
}
If you replace the two loops above with this one, then everything should work as expected.
See a live example here: ideone

Stack Smashing Error While Working with CStrings

I am working on a small project and I am absolutely stuck. The purpose of the function I'm working on is to rearrange and change a Cstring based on a few preset rules. Where my issue lies is within the second portion of my swapping algorithm I came up with.
for(int i = 0; i < len; i++)
{
if(sentence[i] == SPACE)
{
space++;
spacePlace[counter] = i;
counter++;
}
}
for(int i = 0; i < space; i++)
{
if(i == 0)
{
count2 = 0;
for(int h = 0; h < 20; h++)
{
temp1[h] = NUL;
temp2[h] = NUL;
}
for(int j = 0; j < spacePlace[0]; j++)
temp1[j] = sentence[j];
for(int m = spacePlace[0]; m < spacePlace[1]; m++)
{
temp2[count2] = sentence[m];
count2++;
}
.
.
.
the first for loops executes perfectly and the output is great, but the second for loop always messes up and ends up sending me a stack smashing error. For more reference, sentence is a cstring passed to the function, and temp1 and temp2 are also cstrings. Any help or points in the right direction would be a godsend. Thanks!

Is it possible to parallelize this for loop?

I was given some code to paralellize using OpenMP and, among the various function calls, I noticed this for loop takes some good guilt on the computation time.
double U[n][n];
double L[n][n];
double Aprime[n][n];
for(i=0; i<n; i++) {
for(j=0; j<n; j++) {
if (j <= i) {
double s;
s=0;
for(k=0; k<j; k++) {
s += L[j][k] * U[k][i];
}
U[j][i] = Aprime[j][i] - s;
} else if (j >= i) {
double s;
s=0;
for(k=0; k<i; k++) {
s += L[j][k] * U[k][i];
}
L[j][i] = (Aprime[j][i] - s) / U[i][i];
}
}
However, after trying to parallelize it and applying some semaphores here and there (with no luck), I came to the realization that the else if condition has a strong dependency on the early if (L[j][i] being a processed number with U[i][i], which may be set on the early if), making it, in my oppinion, non-parallelizable due to race conditions.
Is it possible to parallelize this code in such a manner to make the else if only be executed if the earlier if has already completed?
Before trying to parallelize things, try simplification first.
For example, the if can be completely eliminated.
Also, the code is accessing the matrixes in a way that causes worst cache performance. That may be the real bottleneck.
Note: In update #3 below, I did benchmarks and the cache friendly version fix5, from update #2, outperforms the original by 3.9x.
I've cleaned things up in stages, so you can see the code transformations.
With this, it should be possible to add omp directives successfully. As I mentioned in my top comment, the global vs. function scope of the variables affects the type of update that may be required (e.g. omp atomic update, etc.)
For reference, here is your original code:
double U[n][n];
double L[n][n];
double Aprime[n][n];
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
if (j <= i) {
double s;
s = 0;
for (k = 0; k < j; k++) {
s += L[j][k] * U[k][i];
}
U[j][i] = Aprime[j][i] - s;
}
else if (j >= i) {
double s;
s = 0;
for (k = 0; k < i; k++) {
s += L[j][k] * U[k][i];
}
L[j][i] = (Aprime[j][i] - s) / U[i][i];
}
}
}
The else if (j >= i) was unnecessary and could be replaced with just else. But, we can split the j loop into two loops so that neither needs an if/else:
// fix2.c -- split up j's loop to eliminate if/else inside
double U[n][n];
double L[n][n];
double Aprime[n][n];
for (i = 0; i < n; i++) {
for (j = 0; j <= i; j++) {
double s = 0;
for (k = 0; k < j; k++)
s += L[j][k] * U[k][i];
U[j][i] = Aprime[j][i] - s;
}
for (; j < n; j++) {
double s = 0;
for (k = 0; k < i; k++)
s += L[j][k] * U[k][i];
L[j][i] = (Aprime[j][i] - s) / U[i][i];
}
}
U[i][i] is invariant in the second j loop, so we can presave it:
// fix3.c -- save off value of U[i][i]
double U[n][n];
double L[n][n];
double Aprime[n][n];
for (i = 0; i < n; i++) {
for (j = 0; j <= i; j++) {
double s = 0;
for (k = 0; k < j; k++)
s += L[j][k] * U[k][i];
U[j][i] = Aprime[j][i] - s;
}
double Uii = U[i][i];
for (; j < n; j++) {
double s = 0;
for (k = 0; k < i; k++)
s += L[j][k] * U[k][i];
L[j][i] = (Aprime[j][i] - s) / Uii;
}
}
The access to the matrixes is done in probably the worst way for cache performance. So, if the assignment of dimensions can be flipped, a substantial savings in memory access can be achieved:
// fix4.c -- transpose matrix coordinates to get _much_ better memory/cache
// performance
double U[n][n];
double L[n][n];
double Aprime[n][n];
for (i = 0; i < n; i++) {
for (j = 0; j <= i; j++) {
double s = 0;
for (k = 0; k < j; k++)
s += L[k][j] * U[i][k];
U[i][j] = Aprime[i][j] - s;
}
double Uii = U[i][i];
for (; j < n; j++) {
double s = 0;
for (k = 0; k < i; k++)
s += L[k][j] * U[i][k];
L[i][j] = (Aprime[i][j] - s) / Uii;
}
}
UPDATE:
In the Op's first k-loop its k<j and in the 2nd k<i don't you have to fix that?
Yes, I've fixed it. It was too ugly a change for fix1.c, so I removed that and applied the changes to fix2-fix4 where it was easy to do.
UPDATE #2:
These variables are all local to the function.
If you mean they are function scoped [without static], this says that the matrixes can't be too large because, unless the code increases the stack size, they're limited to the stack size limit (e.g. 8 MB)
Although the matrixes appeared to be VLAs [because n was lowercase], I ignored that. You may want to try a test case using fixed dimension arrays as I believe they may be faster.
Also, if the matrixes are function scope, and want to parallelize things, you'd probably need to do (e.g.) #pragma omp shared(Aprime) shared(U) shared(L).
The biggest drag on cache were the loops to calculate s. In fix4, I was able to make access to U cache friendly, but L access was poor.
I'd need to post a whole lot more if I did include the external context
I guessed as much, so I did the matrix dimension swap speculatively, not knowing how much other code would need changing.
I've created a new version that changes the dimensions on L back to the original way, but keeping the swapped versions on the other ones. This provides the best cache performance for all matrixes. That is, the inner loop for most matrix access is such that each iteration is incrementing along the cache lines.
In fact, give it a try. It may improve things to the point where parallel isn't needed. I suspect the code is memory bound anyway, so parallel might not help as much.
// fix5.c -- further transpose to fix poor performance on s calc loops
//
// flip the U dimensions back to original
double U[n][n];
double L[n][n];
double Aprime[n][n];
double *Up;
double *Lp;
double *Ap;
for (i = 0; i < n; i++) {
Ap = Aprime[i];
Up = U[i];
for (j = 0; j <= i; j++) {
double s = 0;
Lp = L[j];
for (k = 0; k < j; k++)
s += Lp[k] * Up[k];
Up[j] = Ap[j] - s;
}
double Uii = Up[i];
for (; j < n; j++) {
double s = 0;
Lp = L[j];
for (k = 0; k < i; k++)
s += Lp[k] * Up[k];
Lp[i] = (Ap[j] - s) / Uii;
}
}
Even if you really need the original dimensions, depending upon the other code, you might be able to transpose going in and transpose back going out. This would keep things the same for other code, but, if this code is truly a bottleneck, the extra transpose operations might be small enough to merit this.
UPDATE #3:
I've run benchmarks on all the versions. Here are the elapsed times and ratios relative to original for n equal to 1037:
orig: 1.780916929 1.000x
fix1: 3.730602026 0.477x
fix2: 1.743769884 1.021x
fix3: 1.765769482 1.009x
fix4: 1.762100697 1.011x
fix5: 0.452481270 3.936x
Higher ratios are better.
Anyway, this is the limit of what I can do. So, good luck ...

Deadlock on parallel loop

I'm trying to parallelize the code below. It's easy to see that there is a dependency between the values of aux, since they are computed after the inner loop, but they are needed inside that inner loop (note that on the first iteration j = 0, the code inside the inner loop is not executed). On the other hand, there is no dependency between the values of mu because we only update mu[k], but the only values needed for other computations are in mu[j], for 0 <= j < k.
My approach consists in having the elements of aux locked until they are computed. As soon as a given value of aux is computed, the lock of that element is released and every thread can use it. However, with this code a deadlock occurs and I can't figure out why. Does someone have any tips?
Thanks
for (j = 0; j < k; ++j)
locks[j] = 0;
#pragma omp parallel for num_threads(N_THREADS) private(j, i)
for (j = 0; j < k; ++j)
{
vals[j] = (long)0;
for (i = 0; i < j; i++)
{
while(!locks[i]);
vals[j] += mu[j][i] * aux[i];
}
aux[j] = (s[j] - vals[j]);
locks[j] = 1;
mu[k][j] = aux[j] / c[j];
}
Does it also hang when not optimized?
In optimized code, gcc would not bother reading locks[i] more than once, so this:
for (i = 0; i < j; i++) {
while(!locks[i]);
would be like writing:
for (i = 0; i < j; i++) {
if( !locks[i] ) for(;;) {}
Try adding a barrier to force gcc to re-read locks[i]:
#define pause() do { asm volatile("pause;":::"memory"); } while(0)
...
for (i = 0; i < j; i++) {
while(!locks[i]) pause();
HTH