c++ define variable inside if block for function scope - c++

Background: I am working on developing several different controllers (over 10 or so) for a hardware which involves running the code in hard real-time under RTAI linux. I have implemented a class for the hardware with each controller as a separate member function of the class. I'm looking to pass the desired trajectory for the respective control variable to each of these control functions based on which controller is chosen. In addition, since there are several parameters for each controller and I am looking to quickly switch controllers without having to navigate through the entire code and changing parameters, I am looking to define all the control variables at one place and define them based on which controller I choose to run. Here is a minimum working example of what I am looking for.
I am looking to define variables based on if a condition is true or not as follows in C++:
int foo()
{
int i=0;
if(i==0)
{
int a=0;
float b=1;
double c=10;
}
elseif(i==1)
{
int e=0;
float f=1;
double g=10;
}
// Memory locked for hard real-time execution
// execute in hard real-time from here
while(some condition)
{
// 100's of lines of code
if(i==0)
{
a=a+1;
b=b*2;
c=c*4;
// 100's of lines of code
}
elseif(i==1)
{
e=e*e*e;
f=f*3;
g=g*10;
// 100's of lines of code
}
// 100's of lines of code
}
// stop execution in hard real-time
}
The above code gives error on execution as the scope of the variables defined in the if blocks is limited to the respective if block. Could anyone suggest a better way of handling this issue? What is the best practice in this context in C++?

In your case, you may simply use:
int foo()
{
int i = 0;
if (i == 0) {
int a = 0;
float b = 1;
double c = 10;
for(int j = 1; j < 10; j++) {
a = a + 1;
b = b * 2;
c = c * 4;
}
} else if (i == 1) {
int e = 0;
float f = 1;
double g = 10;
for(int j = 1; j < 10; j++) {
e = e * e * e;
f = f * 3;
g = g * 10;
}
}
}
or even better, create sub-functions
void foo0()
{
int a = 0;
float b = 1;
double c = 10;
for(int j = 1; j < 10; j++) {
a = a + 1;
b = b * 2;
c = c * 4;
}
}
void foo1()
{
//.. stuff with e, f, g
}
int foo()
{
int i = 0;
if (i == 0) {
foo0();
} else if (i == 1) {
foo1();
}
}

Related

Static Atomic Global variable has different values between threads

I have an issue where I have two threads in two different C++ classes, where one reads a, atomic bool and one changes it. I have defined it in a separate header file here which is included in both class files.
static std::atomic<bool> readyToRead;
One class launches a thread which changes the value of readyToRead, and this function is passed to that thread:
void WhiteBoard::writePoint(int x, int y) {
QMutex mutex1;
bool binX[16];
bool binY[16];
for(int i = 0; i <= 15; i++){
binX[i]=x%2;
x/=2;
}
for(int i = 0; i <= 15; i++){
binY[i]=y%2;
y/=2;
}
if(readyToWrite){
mutex1.lock();
memcpy(GPIOX, binX, sizeof(binX));
memcpy(GPIOY, binY, sizeof(binY));
readyToRead = 1;
mutex1.unlock();
}
qDebug() << "T: " << readyToRead;
}
The other class also has a thread which runs this function:
void DisplayBoard::readPoint() {
valChanged = 0;
QMutex mutex2;
int x = 0;
int y = 0;
bool binX[16] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
bool binY[16] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
while(1){
qDebug() << "R: " << readyToRead;
if(readyToRead){
mutex2.lock();
memcpy(binX, GPIOX, sizeof(binX));
memcpy(binX, GPIOY, sizeof(binY));
qDebug() << GPIOX[0];;
mutex2.unlock();
valChanged = 1;
}
for(int i = 0; i <= 15; i++){
int b = 1;
x = x + binX[i] * b;
b *= 2;
}
for(int j = 0; j <= 15; j++){
int b = 1;
y = y + binY[j] * b;
b *= 2;
}
point.setX(x);
point.setY(y);
}
}
What I want the code to do is to run the code in the if statement in the second thread, when the first thread changes the value of readyToRead. readyToWrite, GPIOX and GPIOY are all global variables, where readyToWrite is also atomic and set to 1 before the thread launches. However, even when I can see that the value of readyToRead has changed in the first thread to true, it remains false in the second.
Thank you in advance for any help,
I apologise for any mistakes in the question. I am very new to this platform and programming in general.

How to nest for loops in CUDA?

I would like to ask for a complete example of CUDA code, one that includes everything someone may want to include so that it may be referenced by people trying to write such code such as myself.
My main concerns are whether or not it is possible to process multiple for loops at the same time on different threads in the same block. This is the difference between running (for a clear example) a total of 2016 threads divided into blocks of 32 on case 3 in the example code and running 1024 threads on each for loop theoretically with the code we have we could run even fewer taking of another 2 blocks by running the for loops of other cases under the same block. Otherwise separate cases would primarily be used for processing separate tasks such as a for loop. Currently it appears that the CUDA code simply knows when to run in parallel.
// note: rarely referenced, you can process if statements in parallel seemingly by block, I'd say that is the primary purpose of using more blocks instead of increasing thread count per block during call, other than the need of multiple SMs (Streaming Multiprocessors), capped at 2048 threads (also the cap for a block)//
If we have the following code including for loops and if statements then what would the code that optimizes parallelization be?
public void main(string[] args) {
doMath(3); // we want to process each statement in parallel. For this we use different blocks.
}
void doMath(int question) {
int[] x = new int{0,1,2,3,4,5,6,7,8,9};
int[] y = new int{0,1,2,3,4,5,6,7,8,10};
int[] z = new int{0,1,2,3,4,5,6,7,8,11};
int[] w = new int{0,1,2,3,4,5,6,7,8,12};
int[] q = new int[1000];
int[] r = new int[1000];
int[] v = new int[1000];
int[] t = new int[1000];
switch(question) {
case 1:
for (int a = 0; a < x.length; a++) {
for (int b = 0; b < y.length; b++) {
for (int c = 0; c < z.length; c++) {
q[(a*100)+(b*10)+(c)] = x[a] + y[b] + z[c];
}
}
}
break;
case 2:
for (int a = 0; a < x.length; a++) {
for (int b = 0; b < y.length; b++) {
for (int c = 0; c < w.length; c++) {
r[(a*100)+(b*10)+(c)] = x[a] + y[b] + w[c];
}
}
}
break;
case 3:
for (int a = 0; a < x.length; a++) {
for (int b = 0; b < z.length; b++) {
for (int c = 0; c < w.length; c++) {
v[(a*100)+(b*10)+(c)] = x[a] + z[b] + w[c];
}
}
}
for (int a = 0; a < x.length; a++) {
for (int b = 0; b < y.length; b++) {
for (int c = 0; c < w.length; c++) {
t[(a*100)+(b*10)+(c)] = x[a] + y[b] + w[c];
}
}
}
break;
}
}
From the samples I have seen the CUDA code would be as follows:
// 3 blocks for 3 switch cases the third case requires 2000 threads to be done in perfect parallel while the first two only require 1000. blocks operate by multiples of 32 (threads). the trick is to take the greatest common denominator of all cases, or if/else statements as the... case... may be, and appropriate the number of blocks required to each case. (in this example we would need 127 blocks of 32 threads (1024 * 2 + 2048 - 32)//
//side note: each Streaming Multiprocessor or SM can only support 2048 threads and 2048 / (# of blocks * # of threads/block)//
public void main(string[] args) {
int *x, *y *z, *w, *q, *r, *t;
int[] x = new int{0,1,2,3,4,5,6,7,8,9};
int[] y = new int{0,1,2,3,4,5,6,7,8,10};
int[] z = new int{0,1,2,3,4,5,6,7,8,11};
int[] w = new int{0,1,2,3,4,5,6,7,8,12};
int[] q = new int[1000];
int[] r = new int[1000];
int[] t = new int[1000];
cudaMallocManaged(&x, x.length*sizeof(int));
cudaMallocManaged(&y, y.length*sizeof(int));
cudaMallocManaged(&z, z.length*sizeof(int));
cudaMallocManaged(&w, w.length*sizeof(int));
cudaMallocManaged(&q, q.length*sizeof(int));
cudaMallocManaged(&r, r.length*sizeof(int));
cudaMallocManaged(&t, t.length*sizeof(int));
doMath<<<127,32>>>(x, y, z, w, q, r, t);
cudaDeviceSynchronize();
cudaFree(x);
cudaFree(y);
cudaFree(z);
cudaFree(w);
cudaFree(q);
cudaFree(r);
cudaFree(t);
}
__global__
void doMath(int *x, int *y, int *z, int *w, int *q, int *r, int *t) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
switch(question) {
case 1:
for (int a = index; a < x.length; a+=stride ) {
for (int b = index; b < y.length; b+=stride) {
for (int c = index; c < z.length; c+=stride) {
q[(a*100)+(b*10)+(c)] = x[a] + y[b] + w[c];
}
}
}
break;
case 2:
for (int a = index; a < x.length; a+=stride) {
for (int b = index; b < y.length; b+=stride) {
for (int c = index; c < w.length; c+=stride) {
r[(a*100)+(b*10)+(c)] = x[a] + y[b] + w[c];
}
}
}
break;
case 3:
for (int a = index; a < x.length; a+=stride) {
for (int b = index; b < y.length; b+=stride) {
for (int c = index; c < z.length; c+=stride) {
q[(a*100)+(b*10)+(c)] = x[a] + y[b] + w[c];
}
}
}
for (int a = index; a < x.length; a+=stride) {
for (int b = index; b < y.length; b+=stride) {
for (int c = index; c < w.length; c+=stride) {
t[(a*100)+(b*10)+(c)] = x[a] + y[b] + w[c];
}
}
}
break;
}
}
In cuda every thread runs your kernel. If you want the threads to do different things you have to branch dependent (in someway) on threadIdx and/or blockIdx.
You did this by calculating index. Every thread in you kernel has a different index. Now you have to map your indices to the work the kernel should do. So you have to map every index to one or multiple triplets of (a,b,c).
Your current mapping is something like:
index -> (index+i*stride,index+j*stride,index+k*stride)
I do not believe this was your intent.

Speeding up calculation using vectors in C++ by using pointers/references

Currently, I am making a C++ program that solves a sudoku. In order to do this, I calculate the "energy" of the sudoku (the number of faults) frequently. This calculation unfortunately takes up a lot of computation time. I think that it can be sped up significantly by using pointers and references in the calculation, but have trouble figuring out how to implement this.
In my solver class, I have a vector<vector<int> data-member called _sudoku, that contains the values of each site. Currently, when calculating the energy I call a lot of functions with pass-by-value. I tried adding a & in the arguments of the functions and a * when making the variables, but this did not work. How can I make this program run faster by using pass-by-reference?
Calculating the energy should not change the vector anyway so that would be better.
I used the CPU usage to track down 80% of the calculation time to the function where vectors are called.
int SudokuSolver::calculateEnergy() {
int energy = 243 - (rowUniques() + colUniques() + blockUniques());//count number as faults
return energy;
}
int SudokuSolver::colUniques() {
int count = 0;
for (int col = 0; col < _dim; col++) {
vector<int> colVec = _sudoku[col];
for (int i = 1; i <= _dim; i++) {
if (isUnique(colVec, i)) {
count++;
}
}
}
return count;
}
int SudokuSolver::rowUniques() {
int count = 0;
for (int row = 0; row < _dim; row++) {
vector<int> rowVec(_dim);
for (int i = 0; i < _dim; i++) {
rowVec[i] = _sudoku[i][row];
}
for (int i = 1; i <= _dim; i++) {
if (isUnique(rowVec, i)) {
count++;
}
}
}
return count;
}
int SudokuSolver::blockUniques() {
int count = 0;
for (int nBlock = 0; nBlock < _dim; nBlock++) {
vector<int> blockVec = blockMaker(nBlock);
for (int i = 1; i <= _dim; i++) {
if (isUnique(blockVec, i)) {
count++;
}
}
}
return count;
}
vector<int> SudokuSolver::blockMaker(int No) {
vector<int> block(_dim);
int xmin = 3 * (No % 3);
int ymin = 3 * (No / 3);
int col, row;
for (int i = 0; i < _dim; i++) {
col = xmin + (i % 3);
row = ymin + (i / 3);
block[i] = _sudoku[col][row];
}
return block;
}
bool SudokuSolver::isUnique(vector<int> v, int n) {
int count = 0;
for (int i = 0; i < _dim; i++) {
if (v[i] == n) {
count++;
}
}
if (count == 1) {
return true;
} else {
return false;
}
}
The specific lines that use a lot of computatation time are the ones like:
vector<int> colVec = _sudoku[col];
and every time isUnique() is called.
I expect that if I switch to using pass-by-reference, my code will speed up significantly. Could anyone help me in doing so, if that would indeed be the case?
Thanks in advance.
If you change your SudokuSolver::isUnique to take vector<int> &v, that is the only change you need to do pass-by-reference instead of pass-by-value. Passing with a pointer will be similar to passing by reference, with the difference that pointers could be re-assigned, or be NULL, while references can not.
I suspect you would see some performance increase if you are working on a sufficiently large-sized problem where you would be able to distinguish a large copy (if your problem is small, it will be difficult to see minor performance increases).
Hope this helps!
vector<int> colVec = _sudoku[col]; does copy/transfer all the elements, while const vector<int>& colVec = _sudoku[col]; would not (it only creates an alias for the right hand side).
Same with bool SudokuSolver::isUnique(vector<int> v, int n) { versus bool SudokuSolver::isUnique(const vector<int>& v, int n) {
Edited after Jesper Juhl's suggestion: The const addition makes sure that you don't change the reference contents by mistake.
Edit 2: Another thing to notice is that vector<int> rowVec(_dim); these vectors are continuously allocated and unallocated at each iteration, which might get costly. You could try something like
int SudokuSolver::rowUniques() {
int count = 0;
vector<int> rowVec(_maximumDim); // Specify maximum dimension
for (int row = 0; row < _dim; row++) {
for (int i = 0; i < _dim; i++) {
rowVec[i] = _sudoku[i][row];
}
for (int i = 1; i <= _dim; i++) {
if (isUnique(rowVec, i)) {
count++;
}
}
}
return count;
}
if that doesn't mess up with your implementation.

How can I parallel this loop with open mp?

I don't know how I can parallel this loops because I have a lot of dependent variables and I am very confused
can you help and guide me?
the number one is :
for (int a = 0; a < sigmaLen; ++a) {
int f = freq[a];
if (f >= sumFreqLB)
if (updateRemainingDistances(s, a, pos))
if (prunePassed(pos + 1)) {
lmer[pos] = a;
enumerateStrings(pos + 1, sumFreqLB - f);
}
}
The second one is :
void preprocessLowerBounds() {
int i = stackSz - 1;
int pairOffset = (i * (i - 1)) >> 1;
for (int k = L; k; --k) {
int *dsn = dist[k] + pairOffset;
int *ds = dist[k - 1] + pairOffset;
int *s = colS[k - 1];
char ci = s[i];
for (int j = 0; j < i; ++j) {
char cj = s[j];
*ds++ = (*dsn++) + (ci != cj);
}
}
Really another one is :
void enumerateSubStrings(int rowNumber, int remainQTolerance) {
int nItems = rowSize[rowNumber][stackSz];
if (shouldGenerateNeighborhood(rowNumber, nItems)) {
bruteForceIt(rowNumber, nItems);
} else {
indexType *row = rowItem[rowNumber];
for (int j = 0; j < nItems; ++j) {
indexType ind = row[j];
addString(lmers + ind);
preprocessLowerBounds();
uint threshold = maxLB[stackSz] - addMaxFreq();
if (hasSolution(0, threshold)) {
if (getValid<hasPreprocessedPairs, useQ>(rowNumber + 1,
(stackSz <= 2 ? n : smallN), threshold + LminusD,
ind, remainQTolerance)) {
enumerateSubStrings<hasPreprocessedPairs, useQ>(
rowNumber + 1, remainQTolerance);
}
}
removeLastString();
}
}
void addString(const char *t) {
int *mf = colMf[stackSz + 1];
for (int j = 0; j < L; ++j) {
int c = t[j];
colS[j][stackSz] = c;
mf[j] = colMaxFreq[j] + (colMaxFreq[j] == colFreq[j][c]++);
}
colMaxFreq = mf;
++stackSz;
}
void preprocessLowerBounds() {
int i = stackSz - 1;
int pairOffset = (i * (i - 1)) >> 1;
for (int k = L; k; --k) {
int *dsn = dist[k] + pairOffset;
int *ds = dist[k - 1] + pairOffset;
int *s = colS[k - 1];
char ci = s[i];
for (int j = 0; j < i; ++j) {
char cj = s[j];
*ds++ = (*dsn++) + (ci != cj);
}
}
}
void removeLastString() {
--stackSz;
for (int j = 0; j < L; ++j)
--colFreq[j][colS[j][stackSz]];
colMaxFreq = colMf[stackSz];
}
Ok, For OpenMP to parallelize a loop in your basically follow these two rules, the first never write in the same memory location from different threads and second rule never depend on the reading of a memory area that may modified another thread, Now in the first loop you just change the lmer variable and other operations are read-only variables that I assume are not changing at the same time from another part of your code, so the first loop would be as follows:
#pragma omp for private(s,a,pos) //According to my intuition these variables are global or belong to a class, so you must convert private to each thread, on the other hand sumFreqLB and freq not included because only these reading
for (int a = 0; a < sigmaLen; ++a) {
int f = freq[a];
if (f >= sumFreqLB)
if (updateRemainingDistances(s, a, pos))
if (prunePassed(pos + 1)) {
#pragma omp critical //Only one thread at a time can enter otherwise you will fail at runtime
{
lmer[pos] = a;
}
enumerateStrings(pos + 1, sumFreqLB - f);
}
}
In the second loop i could not understand how you're using the for, but you have no problems because you use only reads and only modified the thread local variables.
You must make sure that the functions updateRemainingDistances, prunePassed and enumerateStrings do not use static or global variables within.
In the following function you use most only read operations which can be done from multiple threads (if any thread modifying these variables) and write in local memory positions so just change the shape of the FOR for OpenMP can recognize that FOR.
void preprocessLowerBounds() {
int i = stackSz - 1;
int pairOffset = (i * (i - 1)) >> 1;
#pragma omp for
for (int var=0; var<=k-L; var++){
int newK=k-var;//This will cover the initial range and in the same order
int *dsn = dist[newK] + pairOffset;
int *ds = dist[newK - 1] + pairOffset;
int *s = colS[newK - 1];
char ci = s[i];
for (int j = 0; j < i; ++j) {
char cj = s[j];
*ds++ = (*dsn++) + (ci != cj);
}
}
In the last function you use many functions for which I do not know the source code and thus can not know if they are looking for parallelizable example below the following examples are wrong:
std::vector myVector;
void notParalelizable_1(int i){
miVector.push_back(i);
}
void notParalelizable_2(int i){
static int A=0;
A=A+i;
}
int varGlobal=0;
void notParalelizable_3(int i){
varGlobal=varGlobal+i;
}
void oneFunctionParalelizable(int i)
{
int B=i;
}
int main()
{
#pragma omp for
for(int i=0;i<10;i++)
{
notParalelizable_1(i);//Error because myVector is modified simultaneously from multiple threads, The error here is that myVector not store the values in ascending order as this necessarily being accesing by multiple threads, this more complex functions can generate erroneous results or even errors in run time.
}
#pragma omp for
for(int i=0;i<10;i++)
{
notParalelizable_2(i);//Error because A is modified simultaneously from multiple threads
}
#pragma omp for
for(int i=0;i<10;i++)
{
notParalelizable_3(i);//Error because varGlobal is modified simultaneously from multiple threads
}
#pragma omp for
for(int i=0;i<10;i++)
{
oneFunctionParalelizable(i);//no problem
}
//The following code is correct
int *vector=new int[10];
#pragma omp for
for(int i=0;i<10;i++)
{
vector[i]=i;//No problem because each thread writes to a different memory pocicion
}
//The following code is wrong
int k=2;
#pragma omp for
for(int i=0;i<10;i++)
{
k=k+i; //The result of the k variable at the end will be wrong as it is modified from different threads
}
return 0;
}

Why using inline function in c++ doesn't grow binary size?

I have written this code:
inline int a_plus_b_power2(int a, int b) {
return (a + b) * (a + b);
}
int main() {
for(int a = 0; a < 9999999999999; ++a)
for(int b = 0; b < 999999999999; ++b)
a_plus_b_power2(a, b);
return 0;
}
but why the binary of this program doesn't differ with this program:
inline int a_plus_b_power2(int a, int b) {
return (a + b) * (a + b);
}
int main() {
for(int a = 0; a < 9; ++a)
for(int b = 0; b < 9; ++b)
a_plus_b_power2(a, b);
return 0;
}
You are confusing function inlining with loop unrolling:
Loop unrolling means transforming
for (int i = 0; i < 4; i++)
a(i);
into
a(0); a(1); a(2); a(3);
while function inlining means transforming
void a(int i) { cout << i; }
for (int i = 0; i < 4; i++)
a(i);
into
for (int i = 0; i < 4; i++)
cout << i;
Compilers do have options to enable loop unrolling (look at -funroll-loops and related options for gcc), but unless you poke them really hard, most of them will be very reluctant to unroll 999999999999 iterations... (the resulting binary would be multiple terabytes).
Inlined functions are only "pasted" once per invocation.
In both your examples, the inlined function is only invoked once, although it is called many times.
I believe you want something like this:
for (unsigned int a = 0; a < 9; ++a)
{
for (unsigned int b = 0; b < 9; b+= 3) // Incremented by 3 because of 3 calls in loop.
{
a_plus_b_power_2(a, b + 0);
a_plus_b_power_2(a, b + 1);
a_plus_b_power_2(a, b + 2);
}
}
The above example may cause the compiler to paste the code inside your inline function 3 times within the loop and increase the size of the binary.
Note: turn off optimizations because optimizations may cause the compiler to convert the inline function into a standalone function inside the loop.