C++ calculate Row major and Column major access time - c++

I want to calculate access time for these two ways : Row major and Column major
as we know C/C++ is Row major , so when we process in first way (Row major) we should be faster.
but look at this code in C++ language
#include <iostream>
#include <time.h>
#include <cstdio>
clock_t RowMajor()
{
char* buf =new char [20000,20000];
clock_t start = clock();
for (int i = 0; i < 20000; i++)
for (int j = 0; j <20000; j++)
{
++buf[i,j];
}
clock_t elapsed = clock() - start;
delete [] buf;
return elapsed ;
}
clock_t ColumnMajor()
{
char* buf =new char[20000,20000];
clock_t start = clock();
for (int i = 0; i < 20000; i++)
for (int j = 0; j < 20000; j++)
{
++buf[j,i];
}
clock_t elapsed = clock() - start;
delete [] buf;
return elapsed ;
}
int main()
{
std::cout << "Process Started." << std::endl;
printf( "ColumnMajor took %lu microSeconds. \n", ColumnMajor()*1000000/ (CLOCKS_PER_SEC) );
printf( "RowMajor took %lu microSeconds. \n", RowMajor() *1000000/ (CLOCKS_PER_SEC) );
std::cout << "done" << std::endl; return 0;
}
but whenever i run this code i get diffrent answers , sometimes Rowmajor time is grater than column major time and sometimes is opposite,
any help is apriciated.

in c++ the coma operator can't be used create/access matrix thing. To make a matrix you need to keep track of with and height and allocate all the memory as an array. Basically you need to create a vector with the number or elements equivalent to number of elements in the matrix and you get each element by taking the x + y * width.
clock_t RowMajor()
{
int width = 20000;
int height = 20000;
char* buf = new char[width * height];
clock_t start = clock();
for (int j = 0; j < height; j++)
for (int i = 0; i < width; i++)
{
++buf[i + width * j];
}
clock_t elapsed = clock() - start;
delete[] buf;
return elapsed;
}
for ColumnMajor the buf needs to be accessed with buf[j * width + i];
An alternative way to create a matrix (from comments, thanks to James Kanze) is to create the buffer like so: char (*buf)[20000] = new char[20000][200000]. In this case, accessing the buffer is like: buf[i][j]
The safest way to do this is to use std::vector or array, and avoid using new/delete. Use std::vector to prevent buffer write overflows:
clock_t RowMajor()
{
int width = 20000;
int height = 20000;
std::vector<char> buf;
buf.resize(width * height);
clock_t start = clock();
for (int j = 0; j <height; j++)
for (int i = 0; i <width; i++)
{
++buf[i + j * width];
}
clock_t elapsed = clock() - start;
return elapsed;
}

Thanks to Raxvan ,this is the final code works fine so far
#include <iostream>
#include <time.h>
#include <cstdio>
#include <windows.h>
int calc = 0;
clock_t RowMajor()
{
int width = 20000;
int height = 20000;
char* buf = new char[width * height];
clock_t start = clock();
for (int j = 0; j < height; j++)
for (int i = 0; i < width; i++)
{
++buf[i + width * j];
}
clock_t elapsed = clock() - start;
delete[] buf;
return elapsed;
}
clock_t ColumnMajor()
{
int width = 20000;
int height = 20000;
char* buf = new char[width * height];
clock_t start = clock();
for (int j = 0; j < height; j++)
for (int i = 0; i < width; i++)
{
++buf[j + width * i];
}
clock_t elapsed = clock() - start;
delete[] buf;
return elapsed;
}
int main()
{
std::cout << "Process Started." << std::endl;
calc= ColumnMajor() /CLOCKS_PER_SEC ;
printf( "ColumnMajor took %lu . \n", calc );
calc=RowMajor()/CLOCKS_PER_SEC ;
printf( "RowMajor took %lu . \n", calc );
std::cout << "done" << std::endl; return 0;
}

Related

How can this for loop be optimized to run faster without parallelizing or SSE?

I am trying to optimize a piece of code without resorting to parallelizing / SSE.
Current critical code runs in about 20ms on my PC with O2. That seems quite a bit even for ~17mil iterations.
The particular piece that is too slow is as follows:
for (int d = 0; d < numDims; d++)
{
for (int i = 0; i < numNodes; i++)
{
bins[d][(int) (floodVals[d][i] * binSteps)]++;
}
}
Update: Changing to iterators reduced the run-time to 17ms.
for (int d = 0; d < numDims; d++)
{
std::vector<float>::iterator floodIt;
for (floodIt = floodVals[d].begin(); floodIt < floodVals[d].end(); floodIt++)
{
bins[d][(int) (*floodIt * binSteps)]++;
}
}
The full dummy code is here:
#include <vector>
#include <random>
#include <iostream>
#include <chrono>
int main()
{
// Initialize random normalized input [0, 1)
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_real_distribution<float> dist(0, 0.99999);
// Initialize dimensions
const int numDims = 130;
const int numNodes = 130000;
const int binSteps = 30;
// Make dummy data
std::vector<std::vector<float>> floodVals(numDims, std::vector<float>(numNodes));
for (int d = 0; d < numDims; d++)
{
for (int i = 0; i < numNodes; i++)
{
floodVals[d][i] = dist(gen);
}
}
// Initialize binning
std::vector<std::vector<int>> bins(numDims, std::vector<int>(binSteps, 0));
// Time critical section of code
auto start = std::chrono::high_resolution_clock::now();
for (int d = 0; d < numDims; d++)
{
for (int i = 0; i < numNodes; i++)
{
bins[d][(int) (floodVals[d][i] * binSteps)]++;
}
}
auto finish = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = finish - start;
std::cout << "Elapsed: " << elapsed.count() * 1000 << " ms" << std::endl;
return 0;
}
Try eliminating indexing on d in the inner loop, since it is constant in the inner loop anyway. This was roughly 2x faster for me.
for (int d = 0; d < numDims; d++)
{
int* const bins_d = &bins[d][0];
float* const floodVals_d = &floodVals[d][0];
for (int i = 0; i < numNodes; i++)
{
bins_d[(int) (floodVals_d[i] * binSteps)]++;
}
}

Speedup when avoiding false sharing problem?

I am trying to add cache-line padding to avoid false sharing problem but I cant see a big difference in speedup. With padding its only 1.2 x faster. I am running the code without padding and the one with padding n = 700 milion times for testing. Should I get more speedup than 1.2 times? Maybe I have missed something with my padding implementation? I am adding 15 ints padding because I am assuming that counters doesnt have to be allocated at the start of a cache-line. Any tips appreciated.
Here is my code:
template <const int k> void par_countingsort2(int *out, int const *in, const int n) {
const int paddingAmount = cachelinesize / sizeof(int);
const int kPadded = k + (paddingAmount - 1);
printf("/n%d", kPadded);
int counters[nproc][kPadded] = {}; // all zeros
#pragma omp parallel
{
int *thcounters = counters[omp_get_thread_num()];
#pragma omp for
for (int i = 0; i < n; ++i)
++thcounters[in[i]];
#pragma omp single
{
int tmp, sum = 0;
for (int j = 0; j < k; ++j)
for (int i = 0; i < nproc; ++i) {
tmp = counters[i][j];
counters[i][j] = sum;
sum += tmp;
}
}
#pragma omp for
for (int i = 0; i < n; ++i)
out[thcounters[in[i]]++] = in[i];
}
}
#define k 1000
int main(int argc, char *argv[]) {
//init input
int n = argc>1 && atoi(argv[1])>0 ? atoi(argv[1]) : 0;
int* in = (int*)malloc(sizeof(int)*n);
int* out = (int*)malloc(sizeof(int)*n);;
for (int i = 0; i < n; ++i)
in[i] = rand()%k;
printf("n = %d\n", n);
//print some parameters
printf("nproc = %d\n", nproc);
printf("cachelinesize = %d byte\n", cachelinesize);
printf("k = %d\n", k);
double tp2 = omp_get_wtime();
par_countingsort2<k>(out, in, n);
tp2 = omp_get_wtime() - tp2;
printf("par2, elapsed time = %.3f seconds (%.1fx speedup from par1), check passed = %c\n", tp2, tp/tp2, checkreset(out,in,n)?'y':'n');
//free mem
free(in);
free(out);
return EXIT_SUCCESS;
}

Measuring time with chrono changes after printing

I want to measure the execution time of a program in ns in C++. For that purpose I am using the chrono library.
int main() {
const int ROWS = 200;
const int COLS = 200;
double input[ROWS][COLS];
int i,j;
auto start = std::chrono::steady_clock::now();
for (i = 0; i < ROWS; i++) {
for (j = 0; j < COLS; j++)
input[i][j] = i + j;
}
auto end = std::chrono::steady_clock::now();
auto res=std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
std::cout << "Elapsed time in nanoseconds : "
<< res
<< " ns" << std::endl;
return 0;
}
I measured the time and it executed in 90 ns . However when I add a printing afterwards the time changes.
int main() {
const int ROWS = 200;
const int COLS = 200;
double input[ROWS][COLS];
int i,j;
auto start = std::chrono::steady_clock::now();
for (i = 0; i < ROWS; i++) {
for (j = 0; j < COLS; j++)
input[i][j] = i + j;
}
auto end = std::chrono::steady_clock::now();
auto res=std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
std::cout << "Elapsed time in nanoseconds : "
<< res
<< " ns" << std::endl;
for (i = 0; i < ROWS; i++) {
for (j = 0; j < COLS; j++)
std::cout<<input[i][j];
}
return 0;
}
The time changes to 89700 ns. What could be the problem. I only want to measure the execution time of the for.

Multithread calculate mean and std does not improve efficiency

I am a novice in the field of C++ multithread programming and I try to use multithread to compute the mean and standard deviation of my data in parallel to reduce the cost of time. My function of calculation of mean and standard deviation is as the following.
void cal_mean_std(float* data, float* mean, float* sd, int N, int start_index, int span_cols)
{
int value;
for(int j = start_index; j < start_index + span_cols; j++){
mean[j] = 0;
sd[j] = 0;
for (int i = 0; i < N; i++) {
value = data[j * N + i];
mean[j] += value;
sd[j] += value * value;
}
mean[j] = mean[j] / N;
sd[j] = sqrt(sd[j] / N - mean[j] * mean[j]);
}
}
I specify the start index and calculation spans of each thread and I activate my thread_pool as the following.
x.mean = new float[x.M];
x.sd = new float[x.M];
std::vector<std::thread> thread_pool;
int h = 4;
thread_pool.reserve(h);
int SNIPs = static_cast<int>(x.M / h + 1);
int SNIPs_final = x.M - (h - 1) * SNIPs;
for (int i = 0; i < h - 1; i++)
{
thread_pool.push_back(std::thread(std::bind(cal_mean_std, x.data, x.mean, x.sd,
x.N, i*SNIPs, SNIPs)));
}
thread_pool.push_back(std::thread(std::bind(cal_mean_std, x.data, x.mean, x.sd,
x.N, (h-1)*SNIPs, SNIPs_final)));
for (int i = 0; i < h; i++)
thread_pool.at(i).join();
where the x.M is the total number of cols of my data. However, I found that implement in this way did not improve the program efficiency. I am not sure what the problem is.
Actually, we can simulate data to do the computation. My data size is 5k x 300k. The sequential calculation by using for loop all over the data one thread takes 15 seconds. My multithreading version sometimes takes 16 seconds.
The simulation code is as the following and I find that when I use h = 1, the program takes 6s to finish. However, when I use h = 4, the program takes 14s to finish.
#include <thread>
#include <vector>
#include <stdlib.h>
#include <vector>
#include <stdio.h>
#include <iostream>
#include <math.h>
void gen_matrix(int N, int P, float* data){
for (int i = 0; i < N * P; i++)
{
data[i] = rand() % 10;
}
}
void cal_mean_std(float* data, float* mean, float* sd, int N, int start_index, int span_cols)
{
int value;
for(int j = start_index; j < start_index + span_cols; j++){
mean[j] = 0;
sd[j] = 0;
for (int i = 0; i < N; i++) {
value = data[j * N + i];
mean[j] += value;
sd[j] += value * value;
}
mean[j] = mean[j] / N;
sd[j] = sqrt(sd[j] / N - mean[j] * mean[j]);
}
}
int main()
{
int N = 5000;
int P = 300000;
float* data = new float[N*P];
gen_matrix(N, P, data);
float* mean = new float[P];
float* std = new float[P];
std::vector<std::thread> thread_pool;
clock_t t1;
t1 = clock();
int h = 1;
thread_pool.reserve(h);
int SNIPs = static_cast<int>(P / h + 1);
int SNIPs_final = P - (h - 1) * SNIPs;
for (int i = 0; i < h - 1; i++)
{
thread_pool.push_back(std::thread(std::bind(cal_mean_std, data, mean, std,
N, i*SNIPs, SNIPs)));
}
thread_pool.push_back(std::thread(std::bind(cal_mean_std, data, mean, std,
N, (h-1)*SNIPs, SNIPs_final)));
for (int i = 0; i < h; i++)
thread_pool.at(i).join();
std::cout <<"Time for the cal mean and std is " << (clock() - t1) * 1.0/CLOCKS_PER_SEC << std::endl;
return 0;
}
Thank you, everyone. Finally, I found what the problem is with my code. The timer clock_t computes the CPU consumption time instead of wall time.

Dynamic 2D array C++98 vs C++11

Following this question "What is “cache-friendly” code?" I've created dynamic 2d array to check how much time would it take to access elements column-wise and row-wise.
When I create an array in the following way:
const int len = 10000;
int **mass = new int*[len];
for (int i = 0; i < len; ++i)
{
mass[i] = new int[len];
}
it takes 0.239 sec to traverse this array row-wise and 1.851 sec column-wise (in Release).
But when I create an array in this way:
auto mass = new int[len][len];
I get an opposite result: 0.204 sec to traverse this array row-wise and 0.088 sec column-wise.
My code:
const int len = 10000;
int **mass = new int*[len];
for (int i = 0; i < len; ++i)
{
mass[i] = new int[len];
}
// auto mass = new int[len][len]; // C++11 style
begin = std::clock();
for (int i = 0; i < len; ++i)
{
for (int j = 0; j < len; ++j)
{
mass[i][j] = i + j;
}
}
end = std::clock();
std::cout << "[i][j] " << static_cast<float>(end - begin) / 1000 << std::endl;
begin = std::clock();
for (int i = 0; i < len; ++i)
{
for (int j = 0; j < len; ++j)
{
mass[j][i] = i + j;
}
}
end = std::clock();
std::cout << "[j][i] " << static_cast<float>(end - begin) / 1000 << std::endl;
Please, can you explain what is the difference between these ways to allocate memory for two-dimentional dynamic array? Why does it faster to traverse array row-wise in first way and column-wise in second way?