Why using inline function in c++ doesn't grow binary size?

Why using inline function in c++ doesn't grow binary size? - c++

I have written this code:
inline int a_plus_b_power2(int a, int b) {
return (a + b) * (a + b);
}
int main() {
for(int a = 0; a < 9999999999999; ++a)
for(int b = 0; b < 999999999999; ++b)
a_plus_b_power2(a, b);
return 0;
}
but why the binary of this program doesn't differ with this program:
inline int a_plus_b_power2(int a, int b) {
return (a + b) * (a + b);
}
int main() {
for(int a = 0; a < 9; ++a)
for(int b = 0; b < 9; ++b)
a_plus_b_power2(a, b);
return 0;
}

You are confusing function inlining with loop unrolling:
Loop unrolling means transforming
for (int i = 0; i < 4; i++)
a(i);
into
a(0); a(1); a(2); a(3);
while function inlining means transforming
void a(int i) { cout << i; }
for (int i = 0; i < 4; i++)
a(i);
into
for (int i = 0; i < 4; i++)
cout << i;
Compilers do have options to enable loop unrolling (look at -funroll-loops and related options for gcc), but unless you poke them really hard, most of them will be very reluctant to unroll 999999999999 iterations... (the resulting binary would be multiple terabytes).

Inlined functions are only "pasted" once per invocation.
In both your examples, the inlined function is only invoked once, although it is called many times.
I believe you want something like this:
for (unsigned int a = 0; a < 9; ++a)
{
for (unsigned int b = 0; b < 9; b+= 3) // Incremented by 3 because of 3 calls in loop.
{
a_plus_b_power_2(a, b + 0);
a_plus_b_power_2(a, b + 1);
a_plus_b_power_2(a, b + 2);
}
}
The above example may cause the compiler to paste the code inside your inline function 3 times within the loop and increase the size of the binary.
Note: turn off optimizations because optimizations may cause the compiler to convert the inline function into a standalone function inside the loop.

Related

Coufused about using cpp to achieve selection sort

I tried to implement selection sorting in C++，when i encapsulate the swap function, the output shows a lot of zeros.But at beginning of array codes still work.When I replace swap function with the code in the comment, the output is correct.
I am so confused by this result, who can help me to solve it.
#include <iostream>
#include <string>
using namespace std;
template<class T>
int length(T& arr)
{
return sizeof(arr) / sizeof(arr[0]);
}
void swap(int& a, int& b)
{
a += b;
b = a - b;
a = a - b;
}
int main()
{
int array[] = { 2,2,2,2,6,56,9,4,6,7,3,2,1,55,1 };
int N = length(array);
for (int i = 0; i < N; i++)
{
int min = i; // index of min
for (int j = i + 1;j < N; j++)
{
if (array[j] < array[min]) min = j;
}
swap(array[i],array[min]);
// int temp = array[i];
// array[i] = array[min];
// array[min] = temp;
}
for (int i = 0; i < N; i++)
{
int showNum = array[i];
cout << showNum << " ";
}
return 0;
}

Problem is that your swap function do not work if a and b refer to same variable. When for example swap(array[i], array[i]) is called.
Note in such case, this lines: b = a - b; will set b to zero since a and b are same variable.
This happens when by a chance i array element is already in place.
offtopic:
Learn to split code into functions. Avoid putting lots of code in single function especially main. See example. This is more important the you think.

Your swap function is not doing what it is supposed to do. Just use this instead or fix your current swap.
void swap(int& a, int& b){
int temp = a;
a = b;
b = temp;
}

How to nest for loops in CUDA?

I would like to ask for a complete example of CUDA code, one that includes everything someone may want to include so that it may be referenced by people trying to write such code such as myself.
My main concerns are whether or not it is possible to process multiple for loops at the same time on different threads in the same block. This is the difference between running (for a clear example) a total of 2016 threads divided into blocks of 32 on case 3 in the example code and running 1024 threads on each for loop theoretically with the code we have we could run even fewer taking of another 2 blocks by running the for loops of other cases under the same block. Otherwise separate cases would primarily be used for processing separate tasks such as a for loop. Currently it appears that the CUDA code simply knows when to run in parallel.
// note: rarely referenced, you can process if statements in parallel seemingly by block, I'd say that is the primary purpose of using more blocks instead of increasing thread count per block during call, other than the need of multiple SMs (Streaming Multiprocessors), capped at 2048 threads (also the cap for a block)//
If we have the following code including for loops and if statements then what would the code that optimizes parallelization be?
public void main(string[] args) {
doMath(3); // we want to process each statement in parallel. For this we use different blocks.
}
void doMath(int question) {
int[] x = new int{0,1,2,3,4,5,6,7,8,9};
int[] y = new int{0,1,2,3,4,5,6,7,8,10};
int[] z = new int{0,1,2,3,4,5,6,7,8,11};
int[] w = new int{0,1,2,3,4,5,6,7,8,12};
int[] q = new int[1000];
int[] r = new int[1000];
int[] v = new int[1000];
int[] t = new int[1000];
switch(question) {
case 1:
for (int a = 0; a < x.length; a++) {
for (int b = 0; b < y.length; b++) {
for (int c = 0; c < z.length; c++) {
q[(a*100)+(b*10)+(c)] = x[a] + y[b] + z[c];
}
}
}
break;
case 2:
for (int a = 0; a < x.length; a++) {
for (int b = 0; b < y.length; b++) {
for (int c = 0; c < w.length; c++) {
r[(a*100)+(b*10)+(c)] = x[a] + y[b] + w[c];
}
}
}
break;
case 3:
for (int a = 0; a < x.length; a++) {
for (int b = 0; b < z.length; b++) {
for (int c = 0; c < w.length; c++) {
v[(a*100)+(b*10)+(c)] = x[a] + z[b] + w[c];
}
}
}
for (int a = 0; a < x.length; a++) {
for (int b = 0; b < y.length; b++) {
for (int c = 0; c < w.length; c++) {
t[(a*100)+(b*10)+(c)] = x[a] + y[b] + w[c];
}
}
}
break;
}
}
From the samples I have seen the CUDA code would be as follows:
// 3 blocks for 3 switch cases the third case requires 2000 threads to be done in perfect parallel while the first two only require 1000. blocks operate by multiples of 32 (threads). the trick is to take the greatest common denominator of all cases, or if/else statements as the... case... may be, and appropriate the number of blocks required to each case. (in this example we would need 127 blocks of 32 threads (1024 * 2 + 2048 - 32)//
//side note: each Streaming Multiprocessor or SM can only support 2048 threads and 2048 / (# of blocks * # of threads/block)//
public void main(string[] args) {
int *x, *y *z, *w, *q, *r, *t;
int[] x = new int{0,1,2,3,4,5,6,7,8,9};
int[] y = new int{0,1,2,3,4,5,6,7,8,10};
int[] z = new int{0,1,2,3,4,5,6,7,8,11};
int[] w = new int{0,1,2,3,4,5,6,7,8,12};
int[] q = new int[1000];
int[] r = new int[1000];
int[] t = new int[1000];
cudaMallocManaged(&x, x.length*sizeof(int));
cudaMallocManaged(&y, y.length*sizeof(int));
cudaMallocManaged(&z, z.length*sizeof(int));
cudaMallocManaged(&w, w.length*sizeof(int));
cudaMallocManaged(&q, q.length*sizeof(int));
cudaMallocManaged(&r, r.length*sizeof(int));
cudaMallocManaged(&t, t.length*sizeof(int));
doMath<<<127,32>>>(x, y, z, w, q, r, t);
cudaDeviceSynchronize();
cudaFree(x);
cudaFree(y);
cudaFree(z);
cudaFree(w);
cudaFree(q);
cudaFree(r);
cudaFree(t);
}
__global__
void doMath(int *x, int *y, int *z, int *w, int *q, int *r, int *t) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
switch(question) {
case 1:
for (int a = index; a < x.length; a+=stride ) {
for (int b = index; b < y.length; b+=stride) {
for (int c = index; c < z.length; c+=stride) {
q[(a*100)+(b*10)+(c)] = x[a] + y[b] + w[c];
}
}
}
break;
case 2:
for (int a = index; a < x.length; a+=stride) {
for (int b = index; b < y.length; b+=stride) {
for (int c = index; c < w.length; c+=stride) {
r[(a*100)+(b*10)+(c)] = x[a] + y[b] + w[c];
}
}
}
break;
case 3:
for (int a = index; a < x.length; a+=stride) {
for (int b = index; b < y.length; b+=stride) {
for (int c = index; c < z.length; c+=stride) {
q[(a*100)+(b*10)+(c)] = x[a] + y[b] + w[c];
}
}
}
for (int a = index; a < x.length; a+=stride) {
for (int b = index; b < y.length; b+=stride) {
for (int c = index; c < w.length; c+=stride) {
t[(a*100)+(b*10)+(c)] = x[a] + y[b] + w[c];
}
}
}
break;
}
}

In cuda every thread runs your kernel. If you want the threads to do different things you have to branch dependent (in someway) on threadIdx and/or blockIdx.
You did this by calculating index. Every thread in you kernel has a different index. Now you have to map your indices to the work the kernel should do. So you have to map every index to one or multiple triplets of (a,b,c).
Your current mapping is something like:
index -> (index+i*stride,index+j*stride,index+k*stride)
I do not believe this was your intent.

Garbage in array after calling function

The problem occurs in foo() (in the commented lines), and is that foo2() should return the result of a matrix multiplication repeated process in it's first parameter. It is working in the first case and failing right after.
B and B_tmp arrays should have the same values at the end of foo() and that's not happening
T is 1x6 matrix, A is 6x3 matrix, B is 200x3 matrix
foo3() multiplies TxA and store the result (1x3 matrix) at the end of B
What foo2() does at the beginning with B_t1_t2 is not relevant, it just prepares the 1x6 matrix, changing the order in some way
I must try to solve this without changing any function declaration
I'm new to c++ and have been searching for too long now, I'm desperated
#include <stdio.h>
#include <iostream>
#include <random>
#include <thread>
using namespace std;
double fRand(const double & min, const double & max) {
thread_local std::mt19937 generator(std::random_device{}());
std::uniform_real_distribution<double> distribution(min, max);
return distribution(generator);
}
int iRand(const int & min, const int & max) {
thread_local std::mt19937 generator(std::random_device{}());
std::uniform_int_distribution<int> distribution(min, max);
return distribution(generator);
}
void foo3(double T[6], double A[18], double *B)
{
for(int i = 0; i < 3; i++) {
double r = 0;
for(int j = 0; j < 6; j++) {
r += T[j] * A[i*3+j];
}
*B = r; B++;
}
}
void foo2(double *B, double *A, int from, int to)
{
for (int i=from; i < to; i++) { //This is not relevant but I leave it just in case
double B_t1_t2[6];
for (int x = 0; x < 3; x++)
B_t1_t2[x] = B[(i-1)*3 + x];
for (int x = 0; x < 3; x++)
B_t1_t2[x+3] = B[(i-2)*3 + x];
foo3(B_t1_t2, A, &B[i*3]);
}
}
void foo(double *A, double *B)
{
for (int i = 0; i < 18; i++)
A[i] = fRand(1, 2);
foo2(B, A, 2, 200);
cout << "\nB" << endl;
for (int i = 0; i < 600; i++)
cout << B[i] << " "; // HERE IS WORKING, B DOES NOT CONTAIN GARBAGE
cout << endl;
double B_tmp[600];
foo2(B_tmp, A, 2, 200);
cout << "\nB_tmp" << endl;
for (int i = 0; i < 600; i++)
cout << B_tmp[i] << " "; // WHY NOT WORKING HERE?
cout << endl;
}
int main()
{
double A[18], B[600];
for(int i = 0; i<6; i++)
B[i] = 1;
foo(A, B);
}
Why the second cout in foo() is showing garbage?
Also, if declarations must change, what would be the best way?
Im trying to use stack memory as much as I can.

Before calling foo(A, B); first 6 elements of B array were filled (all are set to 1). In foo function you call foo2 function twice. In first call you pass B array into foo2 function, and it works because B is filled. In second call of foo2 in foo you pass B_tmp array but all items of this array have garbage value, you didn't initialize them. So do
double B_tmp[600];
for (int i = 0; i < 6; ++i)
B_tmp[i] = 1;
foo2(B_tmp, A, 2, 200);

How can I parallel this loop with open mp?

I don't know how I can parallel this loops because I have a lot of dependent variables and I am very confused
can you help and guide me?
the number one is :
for (int a = 0; a < sigmaLen; ++a) {
int f = freq[a];
if (f >= sumFreqLB)
if (updateRemainingDistances(s, a, pos))
if (prunePassed(pos + 1)) {
lmer[pos] = a;
enumerateStrings(pos + 1, sumFreqLB - f);
}
}
The second one is :
void preprocessLowerBounds() {
int i = stackSz - 1;
int pairOffset = (i * (i - 1)) >> 1;
for (int k = L; k; --k) {
int *dsn = dist[k] + pairOffset;
int *ds = dist[k - 1] + pairOffset;
int *s = colS[k - 1];
char ci = s[i];
for (int j = 0; j < i; ++j) {
char cj = s[j];
*ds++ = (*dsn++) + (ci != cj);
}
}
Really another one is :
void enumerateSubStrings(int rowNumber, int remainQTolerance) {
int nItems = rowSize[rowNumber][stackSz];
if (shouldGenerateNeighborhood(rowNumber, nItems)) {
bruteForceIt(rowNumber, nItems);
} else {
indexType *row = rowItem[rowNumber];
for (int j = 0; j < nItems; ++j) {
indexType ind = row[j];
addString(lmers + ind);
preprocessLowerBounds();
uint threshold = maxLB[stackSz] - addMaxFreq();
if (hasSolution(0, threshold)) {
if (getValid<hasPreprocessedPairs, useQ>(rowNumber + 1,
(stackSz <= 2 ? n : smallN), threshold + LminusD,
ind, remainQTolerance)) {
enumerateSubStrings<hasPreprocessedPairs, useQ>(
rowNumber + 1, remainQTolerance);
}
}
removeLastString();
}
}
void addString(const char *t) {
int *mf = colMf[stackSz + 1];
for (int j = 0; j < L; ++j) {
int c = t[j];
colS[j][stackSz] = c;
mf[j] = colMaxFreq[j] + (colMaxFreq[j] == colFreq[j][c]++);
}
colMaxFreq = mf;
++stackSz;
}
void preprocessLowerBounds() {
int i = stackSz - 1;
int pairOffset = (i * (i - 1)) >> 1;
for (int k = L; k; --k) {
int *dsn = dist[k] + pairOffset;
int *ds = dist[k - 1] + pairOffset;
int *s = colS[k - 1];
char ci = s[i];
for (int j = 0; j < i; ++j) {
char cj = s[j];
*ds++ = (*dsn++) + (ci != cj);
}
}
}
void removeLastString() {
--stackSz;
for (int j = 0; j < L; ++j)
--colFreq[j][colS[j][stackSz]];
colMaxFreq = colMf[stackSz];
}

Ok, For OpenMP to parallelize a loop in your basically follow these two rules, the first never write in the same memory location from different threads and second rule never depend on the reading of a memory area that may modified another thread, Now in the first loop you just change the lmer variable and other operations are read-only variables that I assume are not changing at the same time from another part of your code, so the first loop would be as follows:
#pragma omp for private(s,a,pos) //According to my intuition these variables are global or belong to a class, so you must convert private to each thread, on the other hand sumFreqLB and freq not included because only these reading
for (int a = 0; a < sigmaLen; ++a) {
int f = freq[a];
if (f >= sumFreqLB)
if (updateRemainingDistances(s, a, pos))
if (prunePassed(pos + 1)) {
#pragma omp critical //Only one thread at a time can enter otherwise you will fail at runtime
{
lmer[pos] = a;
}
enumerateStrings(pos + 1, sumFreqLB - f);
}
}
In the second loop i could not understand how you're using the for, but you have no problems because you use only reads and only modified the thread local variables.
You must make sure that the functions updateRemainingDistances, prunePassed and enumerateStrings do not use static or global variables within.
In the following function you use most only read operations which can be done from multiple threads (if any thread modifying these variables) and write in local memory positions so just change the shape of the FOR for OpenMP can recognize that FOR.
void preprocessLowerBounds() {
int i = stackSz - 1;
int pairOffset = (i * (i - 1)) >> 1;
#pragma omp for
for (int var=0; var<=k-L; var++){
int newK=k-var;//This will cover the initial range and in the same order
int *dsn = dist[newK] + pairOffset;
int *ds = dist[newK - 1] + pairOffset;
int *s = colS[newK - 1];
char ci = s[i];
for (int j = 0; j < i; ++j) {
char cj = s[j];
*ds++ = (*dsn++) + (ci != cj);
}
}
In the last function you use many functions for which I do not know the source code and thus can not know if they are looking for parallelizable example below the following examples are wrong:
std::vector myVector;
void notParalelizable_1(int i){
miVector.push_back(i);
}
void notParalelizable_2(int i){
static int A=0;
A=A+i;
}
int varGlobal=0;
void notParalelizable_3(int i){
varGlobal=varGlobal+i;
}
void oneFunctionParalelizable(int i)
{
int B=i;
}
int main()
{
#pragma omp for
for(int i=0;i<10;i++)
{
notParalelizable_1(i);//Error because myVector is modified simultaneously from multiple threads, The error here is that myVector not store the values in ascending order as this necessarily being accesing by multiple threads, this more complex functions can generate erroneous results or even errors in run time.
}
#pragma omp for
for(int i=0;i<10;i++)
{
notParalelizable_2(i);//Error because A is modified simultaneously from multiple threads
}
#pragma omp for
for(int i=0;i<10;i++)
{
notParalelizable_3(i);//Error because varGlobal is modified simultaneously from multiple threads
}
#pragma omp for
for(int i=0;i<10;i++)
{
oneFunctionParalelizable(i);//no problem
}
//The following code is correct
int *vector=new int[10];
#pragma omp for
for(int i=0;i<10;i++)
{
vector[i]=i;//No problem because each thread writes to a different memory pocicion
}
//The following code is wrong
int k=2;
#pragma omp for
for(int i=0;i<10;i++)
{
k=k+i; //The result of the k variable at the end will be wrong as it is modified from different threads
}
return 0;
}

c++ define variable inside if block for function scope

Background: I am working on developing several different controllers (over 10 or so) for a hardware which involves running the code in hard real-time under RTAI linux. I have implemented a class for the hardware with each controller as a separate member function of the class. I'm looking to pass the desired trajectory for the respective control variable to each of these control functions based on which controller is chosen. In addition, since there are several parameters for each controller and I am looking to quickly switch controllers without having to navigate through the entire code and changing parameters, I am looking to define all the control variables at one place and define them based on which controller I choose to run. Here is a minimum working example of what I am looking for.
I am looking to define variables based on if a condition is true or not as follows in C++:
int foo()
{
int i=0;
if(i==0)
{
int a=0;
float b=1;
double c=10;
}
elseif(i==1)
{
int e=0;
float f=1;
double g=10;
}
// Memory locked for hard real-time execution
// execute in hard real-time from here
while(some condition)
{
// 100's of lines of code
if(i==0)
{
a=a+1;
b=b*2;
c=c*4;
// 100's of lines of code
}
elseif(i==1)
{
e=e*e*e;
f=f*3;
g=g*10;
// 100's of lines of code
}
// 100's of lines of code
}
// stop execution in hard real-time
}
The above code gives error on execution as the scope of the variables defined in the if blocks is limited to the respective if block. Could anyone suggest a better way of handling this issue? What is the best practice in this context in C++?

In your case, you may simply use:
int foo()
{
int i = 0;
if (i == 0) {
int a = 0;
float b = 1;
double c = 10;
for(int j = 1; j < 10; j++) {
a = a + 1;
b = b * 2;
c = c * 4;
}
} else if (i == 1) {
int e = 0;
float f = 1;
double g = 10;
for(int j = 1; j < 10; j++) {
e = e * e * e;
f = f * 3;
g = g * 10;
}
}
}
or even better, create sub-functions
void foo0()
{
int a = 0;
float b = 1;
double c = 10;
for(int j = 1; j < 10; j++) {
a = a + 1;
b = b * 2;
c = c * 4;
}
}
void foo1()
{
//.. stuff with e, f, g
}
int foo()
{
int i = 0;
if (i == 0) {
foo0();
} else if (i == 1) {
foo1();
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Why using inline function in c++ doesn't grow binary size? - c++

Related

Coufused about using cpp to achieve selection sort

How to nest for loops in CUDA?

Garbage in array after calling function

How can I parallel this loop with open mp?

c++ define variable inside if block for function scope

Categories

Resources