Different float values in array impact performance by 10x - why? - c++

please check out my code and the quesion below - thanks
Code:
#include <iostream>
#include <chrono>
using namespace std;
int bufferWriteIndex = 0;
float curSample = 0;
float damping[5] = { 1, 1, 1, 1, 1 };
float modeDampingTermsExp[5] = { 0.447604, 0.0497871, 0.00247875, 0.00012341, 1.37263e-05 };
float modeDampingTermsExp2[5] = { -0.803847, -3, -6, -9, -11.1962 };
int main(int argc, char** argv) {
float subt = 0;
int subWriteIndex = 0;
auto now = std::chrono::high_resolution_clock::now();
while (true) {
curSample = 0;
for (int i = 0; i < 5; i++) {
//Slow version
damping[i] = damping[i] * modeDampingTermsExp2[i];
//Fast version
//damping[i] = damping[i] * modeDampingTermsExp[i];
float cosT = 2 * damping[i];
for (int m = 0; m < 5; m++) {
curSample += cosT;
}
}
//t += tIncr;
bufferWriteIndex++;
//measure calculations per second
auto elapsed = std::chrono::high_resolution_clock::now() - now;
if ((elapsed / std::chrono::milliseconds(1)) > 1000) {
now = std::chrono::high_resolution_clock::now();
int idx = bufferWriteIndex;
cout << idx - subWriteIndex << endl;
subWriteIndex = idx;
}
}
}
As you can see im measuring the number of calculations or increments of bufferWriteIndex per second.
Question:
Why is performance faster when using modeDampingTermsExp -
Program output:
12625671
12285846
12819392
11179072
12272587
11722863
12648955
vs using modeDampingTermsExp2 ?
1593620
1668170
1614495
1785965
1814576
1851797
1808568
1801945
It's about 10x faster. It seems like the numbers in those 2 arrays have an impact on calculation time. Why?
I am using Visual Studio 2019 with the following flags: /O2 /Oi /Ot /fp:fast

This is because you are hitting denormal numbers (also see this question).
You can get rid of denormals like so:
#include <cmath>
// [...]
for (int i = 0; i < 5; i++) {
damping[i] = damping[i] * modeDampingTermsExp2[i];
if (std::fpclassify(damping[i]) == FP_SUBNORMAL) {
damping[i] = 0; // Treat denormals as 0.
}
float cosT = 2 * damping[i];
for (int m = 0; m < 5; m++) {
curSample += cosT;
}
}

Related

Size of numbers in float array have significant impact on performance - Why?

UPDATE 2:
I've switched to initializing all arrays with fixed numbers. Why is performance faster when using modeDampingTermsExp:
12625671
12285846
12819392
11179072
12272587
11722863
12648955
vs using modeDampingTermsExp2 ?
1593620
1668170
1614495
1785965
1814576
1851797
1808568
1801945
It's about 10x faster.
Full code
#include <iostream>
#include <chrono>
using namespace std;
int bufferWriteIndex = 0;
float curSample = 0;
float tIncr = 0.1f;
float modeGainsTimesModeShapes[25] = { -0.144338, -1.49012e-08, -4.3016e-09, 7.45058e-09, -0, -0.25,
-1.49012e-08, 4.77374e-16, -7.45058e-09, 0, -0.288675, 0, 4.3016e-09, 3.55271e-15, -0, -0.25,
1.49012e-08, -1.4512e-15, 7.45058e-09, 0, -0.144338, 1.49012e-08, -4.30159e-09, -7.45058e-09, -0 };
float modeDampingTermsString[5] = { -8.03847, -30, -60, -90, -111.962 };
float damping[5] = { 1, 1, 1, 1, 1 };
float modeFrequenciesArr[5] = { 71419.1, 266564, 533137, 799710, 994855 };
float modeDampingTermsExp[5] = { 0.447604, 0.0497871, 0.00247875, 0.00012341, 1.37263e-05 };
float modeDampingTermsExp2[5] = { -0.803847, -3, -6, -9, -11.1962 };
int main(int argc, char** argv) {
float subt = 0;
int subWriteIndex = 0;
auto now = std::chrono::high_resolution_clock::now();
while (true) {
curSample = 0;
for (int i = 0; i < 5; i++) {
//Slow version
//damping[i] = damping[i] * modeDampingTermsExp2[i];
//Fast version
damping[i] = damping[i] * modeDampingTermsExp2[i];
float cosT = 2 * damping[i];
for (int m = 0; m < 5; m++) {
curSample += modeGainsTimesModeShapes[i * 5 + m] * cosT;
}
}
//t += tIncr;
bufferWriteIndex++;
//measure calculations per second
auto elapsed = std::chrono::high_resolution_clock::now() - now;
if ((elapsed / std::chrono::milliseconds(1)) > 1000) {
now = std::chrono::high_resolution_clock::now();
int idx = bufferWriteIndex;
cout << idx - subWriteIndex << endl;
subWriteIndex = idx;
}
}
}
UPDATE 1:
I changed tIncr to 0.1f to avoid a possible subnormal number as mentioned by #JaMit. I've also removed t += tIncr; from the calculation.
Full code:
#include <iostream>
#include <chrono>
using namespace std;
int bufferWriteIndex = 0;
float curSample = 0;
float t = 0;
float tIncr = 0.1f;
float modeGainsTimesModeShapes[25] = { -0.144338, -1.49012e-08, -4.3016e-09, 7.45058e-09, -0, -0.25,
-1.49012e-08, 4.77374e-16, -7.45058e-09, 0, -0.288675, 0, 4.3016e-09, 3.55271e-15, -0, -0.25,
1.49012e-08, -1.4512e-15, 7.45058e-09, 0, -0.144338, 1.49012e-08, -4.30159e-09, -7.45058e-09, -0 };
float modeDampingTermsString[5] = { -8.03847, -30, -60, -90, -111.962 };
float damping[5] = { 1, 1, 1, 1, 1 };
float modeFrequenciesArr[5] = { 71419.1, 266564, 533137, 799710, 994855 };
float modeDampingTermsExp[5];
int main(int argc, char** argv) {
/*
for (int m = 0; m < 5; m++) {
modeDampingTermsExp[m] = exp(modeDampingTermsString[m] * tIncr);
}*/
for (int m = 0; m < 5; m++) {
modeDampingTermsExp[m] = modeDampingTermsString[m] * tIncr;
}
//std::thread t1(audioStringSimCos);
//t1.detach();
float subt = 0;
int subWriteIndex = 0;
auto now = std::chrono::high_resolution_clock::now();
while (true) {
curSample = 0;
for (int i = 0; i < 5; i++) {
damping[i] = damping[i] * modeDampingTermsExp[i];
float cosT = 2 * damping[i] * cos(t * modeFrequenciesArr[i]);
for (int m = 0; m < 5; m++) {
curSample += modeGainsTimesModeShapes[i * 5 + m] * cosT;
}
}
//t += tIncr;
bufferWriteIndex++;
//measure calculations per second
auto elapsed = std::chrono::high_resolution_clock::now() - now;
if ((elapsed / std::chrono::milliseconds(1)) > 1000) {
now = std::chrono::high_resolution_clock::now();
int idx = bufferWriteIndex;
cout << idx - subWriteIndex << endl;
subWriteIndex = idx;
}
}
}
Now it runs faster WITH the exp in the intialization?
Output with exp:
7632423
7516857
6855266
6251330
7040232
6784555
7169865
7638150
7403717
7626824
7408493
7722998
around 7 million/s, and without exp:
1229743
1193849
1069924
1426083
1472080
1484318
1503082
1462985
1433372
1357586
1483370
1491731
1526445
1516673
1517916
1522941
1523948
1506818
around 1.5 million. This is confusing to me.
ORIGINAL POST:
It seems like the way I initialize modeDampingTermsExp in my small example has a huge impact on the performance of my calculations, where I access it. Here is my minimum, reproducible example:
#include <iostream>
#include <chrono>
using namespace std;
int bufferWriteIndex = 0;
float curSample = 0;
float t = 0;
float tIncr = 1.0f / 48000;
float modeGainsTimesModeShapes[25] = { -0.144338, -1.49012e-08, -4.3016e-09, 7.45058e-09, -0, -0.25,
-1.49012e-08, 4.77374e-16, -7.45058e-09, 0, -0.288675, 0, 4.3016e-09, 3.55271e-15, -0, -0.25,
1.49012e-08, -1.4512e-15, 7.45058e-09, 0, -0.144338, 1.49012e-08, -4.30159e-09, -7.45058e-09, -0 };
float modeDampingTermsString[5] = { -8.03847, -30, -60, -90, -111.962 };
float damping[5] = { 1, 1, 1, 1, 1 };
float modeFrequenciesArr[5] = { 71419.1, 266564, 533137, 799710, 994855 };
float modeDampingTermsExp[5];
int main(int argc, char** argv) {
/*
for (int m = 0; m < 5; m++) {
modeDampingTermsExp[m] = exp(modeDampingTermsString[m] * tIncr);
}*/
for (int m = 0; m < 5; m++) {
modeDampingTermsExp[m] = modeDampingTermsString[m] * tIncr;
}
//std::thread t1(audioStringSimCos);
//t1.detach();
float subt = 0;
int subWriteIndex = 0;
auto now = std::chrono::high_resolution_clock::now();
while (true) {
curSample = 0;
for (int i = 0; i < 5; i++) {
damping[i] = damping[i] * modeDampingTermsExp[i];
float cosT = 2 * damping[i] * cos(t * modeFrequenciesArr[i]);
for (int m = 0; m < 5; m++) {
curSample += modeGainsTimesModeShapes[i * 5 + m] * cosT;
}
}
t += tIncr;
bufferWriteIndex++;
//measure calculations per second
auto elapsed = std::chrono::high_resolution_clock::now() - now;
if ((elapsed / std::chrono::milliseconds(1)) > 1000) {
now = std::chrono::high_resolution_clock::now();
int idx = bufferWriteIndex;
cout << idx - subWriteIndex << endl;
subWriteIndex = idx;
}
}
}
When I initialize it like this
for (int m = 0; m < 5; m++) {
modeDampingTermsExp[m] = exp(modeDampingTermsString[m] * tIncr);
}
using the exp function, performance is about 10 times slower than like this:
for (int m = 0; m < 5; m++) {
modeDampingTermsExp[m] = modeDampingTermsString[m] * tIncr;
}
I measure the calculations per second unsing chrono right below the 2 nested for loops in the endless while(true) loop (snippet of the fulle example above):
//measure calculations per second
auto elapsed = std::chrono::high_resolution_clock::now() - now;
if ((elapsed / std::chrono::milliseconds(1)) > 1000) {
now = std::chrono::high_resolution_clock::now();
int idx = bufferWriteIndex;
cout << idx - subWriteIndex << endl;
subWriteIndex = idx;
}
Using the exp function, my program gives the following output for example:
538108
356659
356227
383885
389902
390405
391748
391375
388910
383791
391691
it stays at around 390k.
Using the other initialization without it, I get the following output:
3145299
3049618
2519474
2755627
2846730
2824666
2893591
3119401
3492762
3366317
3470675
3505168
3492805
3523005
3432182
3561458
3580840
3576725
around 3 - 3.5 million "samples" per second.
Why does the way I initialize the modeDampingTermsExp array impact performance later in the code where I access it? What am I missing here?
I am using Visual Studio 2019 with the following flags:
/O2 /Oi /Ot /fp:fast
Thank you very much!

what are some optimization tricks to make my code run faster

i'm moving outside my confront zone and trying to make a random number distribution program while also making sure it is still somewhat uniform.
here is my code
this is the RandomDistribution.h file
#pragma once
#include <vector>
#include <random>
#include <iostream>
static float randy(float low, float high) {
static std::random_device rd;
static std::mt19937 random(rd());
std::uniform_real_distribution<float> ran(low, high);
return ran(random);
}
typedef std::vector<float> Vfloat;
class RandomDistribution
{
public:
RandomDistribution();
RandomDistribution(float percent, float contents, int container);
~RandomDistribution();
void setvariables(float percent, float contents, int container);
Vfloat RunDistribution();
private:
float divider;
float _percent;
int jar_limit;
float _contents;
float _maxdistribution;
Vfloat Jar;
bool is0;
};
this is my RandomDistribution.cpp
#include "RandomDistribution.h"
RandomDistribution::RandomDistribution() {
}
RandomDistribution::RandomDistribution(float percent, float contents, int containers):_contents(contents),jar_limit(containers)
{
Jar.resize(containers);
if (percent < 0)
_percent = 0;
else {
_percent = percent;
}
divider = jar_limit * percent;
is0 = false;
}
RandomDistribution::~RandomDistribution()
{
}
void RandomDistribution::setvariables(float percent, float contents, int container) {
if (jar_limit != container)
Jar.resize(container);
_contents = contents;
jar_limit = container;
is0 = false;
if (percent < 0)
_percent = 0;
else {
_percent = percent;
}
divider = jar_limit * percent;
}
Vfloat RandomDistribution::RunDistribution() {
for (int i = 0; i < jar_limit; i++) {
if (!is0) {
if (i + 1 >= jar_limit || _contents < 2) {
Jar[i] = _contents;
_contents -= Jar[i];
is0 = true;
}
if (!_percent <= 0) {//making sure it does not get the hole container at once
_maxdistribution = (_contents / (divider)) * (i + 1);
}
else {
_maxdistribution = _contents;
}
Jar[i] = randy(0, _maxdistribution);
if (Jar[i] < 1) {
Jar[i] = 0;
continue;
}
_contents -= Jar[i];
}
else {
Jar[0];
}
//mixing Jar so it is randomly spaced out instead all at the top
int swapper = randy(0, i);
float hold = Jar[i];
Jar[i] = Jar[swapper];
Jar[swapper] = hold;
}
return Jar;
}
source code
int main(){
RandomDistribution distribution[100];
for (int i = 0; i < 100; i++) {
distribution[i] = {RandomDistribution(1.0f, 5000.0f, 2000) };
}
Vfloat k;
k.resize(200);
for (int i = 0; i < 10; i++) {
auto t3 = chrono::steady_clock::now();
for (int b = 0; b < 100; b++) {
k = distribution[b].RunDistribution();
distribution[b].setvariables(1.0f, 5000.0f, 2000);
}
auto t4 = chrono::steady_clock::now();
auto time_span = chrono::duration_cast<chrono::duration<double>>(t4 - t3);
cout << time_span.count() << " seconds\n";
}
}
what prints out is usually between 1 to 2 seconds for each cycle. i want to bring it down to a tenth of a second if possible cause this is gonna be only one step of the process to completion and i want to run it alot more then 100 times. what can i do to speed this up, any trick or something i'm just missing here.
here is a sample of the time stamps
4.71113 seconds
1.35444 seconds
1.45008 seconds
1.74961 seconds
2.59192 seconds
2.76171 seconds
1.90149 seconds
2.2822 seconds
2.36768 seconds
2.61969 seconds
Cheinan Marks has some benchmarks and performance tips related to random generators & friends in his cppcon 2016 talk I Just Wanted a Random Integer! He mentions some fast generators as well IIRC. I'd start there.

Eigen: simplifying expression with Eigen intrinsics

I'm trying to scale all the columns in a matrix with a corresponding value from a vector. Where this value is 0, I want to replace that column with a column from an other matrix scaled by a constant. Sounds complicated, but in Matlab it's pretty simple (but probably not fully optimized):
a(:,b ~= 0) = a(:,b ~= 0)./b(b ~= 0);
a(:,b == 0) = c(:,b == 0)*x;
doing it with a for loop in C++ would also be pretty simple:
RowVectorXf b;
Matrix3Xf a, c;
float x;
for (int i = 0; i < b.size(); i++) {
if (b(i) != 0) {
a.col(i) = a.col(i) / b(i);
} else {
a.col(i) = c.col(i) * x;
}
}
Is there a possibility to do this operation (faster) with Eigen intrinsics such as colwise and select?
p.s. I tried to shorten the if condition to the form
a.col(i) = (b(i) != 0) ? (a.col(i) / b(i)) : (c.col(i) * x);
But this does not compile with the error error: operands to ?: have different types ...(long listing of the types)
Edit:
I added the code for testing the answers, here it is:
#include <Eigen/Dense>
#include <stdlib.h>
#include <chrono>
#include <iostream>
using namespace std;
using namespace Eigen;
void flushCache()
{
const int size = 20 * 1024 * 1024; // Allocate 20M. Set much larger than L2
volatile char *c = (char *) malloc(size);
volatile int i = 8;
for (volatile int j = 0; j < size; j++)
c[j] = i * j;
free((void*) c);
}
int main()
{
Matrix3Xf a(3, 1000000);
RowVectorXf b(1000000);
Matrix3Xf c(3, 1000000);
float x = 0.4;
a.setRandom();
b.setRandom();
c.setRandom();
for (int testNumber = 0; testNumber < 4; testNumber++) {
flushCache();
chrono::high_resolution_clock::time_point t1 = chrono::high_resolution_clock::now();
for (int repetition = 0; repetition < 1000; repetition++) {
switch (testNumber) {
case 0:
for (int i = 0; i < b.size(); i++) {
if (b(i) != 0) {
a.col(i) = a.col(i) / b(i);
} else {
a.col(i) = c.col(i) * x;
}
}
break;
case 1:
for (int i = 0; i < b.size(); i++) {
a.col(i) = (b(i) != 0) ? (a.col(i) / b(i)).eval() : (c.col(i) * x).eval();
}
break;
case 2:
for (int i = 0; i < b.size(); i++) {
a.col(i) = (b(i) != 0) ? (a.col(i) * (1.0f / b(i))) : (c.col(i) * x);
}
break;
case 3:
a = b.cwiseEqual(0.0f).replicate< 3, 1 >().select(c * x, a.cwiseQuotient(b.replicate< 3, 1 >()));
break;
default:
break;
}
}
chrono::high_resolution_clock::time_point t2 = chrono::high_resolution_clock::now();
auto duration = chrono::duration_cast< chrono::milliseconds >(t2 - t1).count();
cout << "duration: " << duration << "ms" << endl;
}
return 0;
}
Sample output is:
duration: 14391ms
duration: 15219ms
duration: 9148ms
duration: 13513ms
By the way, not using setRandom to init the variables, the output is totally different:
duration: 10255ms
duration: 11076ms
duration: 8250ms
duration: 5198ms
#chtz suggests it's because of denormalized values, but I think it's because of branch prediction. An evidance that it's because of branch prediction is, that initializing b.setZero(); leads to the same timings as not initializing.
a.col(i) = (b(i) != 0) ? (a.col(i) * (1.0f/b(i))) : (c.col(i) * x);
would work but only because the expressions would be of the same type, and it will likely not safe any time (a ? : expression is essentially translated to the same as an if-else branch.)
If you prefer writing it into one line, the following expression should work:
a = b.cwiseEqual(0.0f).replicate<3,1>().select(c*x, a.cwiseQuotient(b.replicate<3,1>()));
Again, I doubt it will make any significant performance difference.

compute pi value using monte carlo method multithreading

I am trying to find value of PI using montecarlo method, and using parallel C code. i have write serail code and works fine. But the parallel code gives me wrong values of pi some times 0 or minus values
my code
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#define NUM_THREADS 4 //number of threads
#define TOT_COUNT 10000055 //total number of iterations
void *doCalcs(void *threadid)
{
long longTid;
longTid = (long)threadid;
int tid = (int)longTid; //obtain the integer value of thread id
//using malloc for the return variable in order make
//sure that it is not destroyed once the thread call is finished
float *in_count = (float *)malloc(sizeof(float));
*in_count=0;
unsigned int rand_state = rand();
//get the total number of iterations for a thread
float tot_iterations= TOT_COUNT/NUM_THREADS;
int counter=0;
//calculation
for(counter=0;counter<tot_iterations;counter++){
//float x = (double)random()/RAND_MAX;
//float y = (double)random()/RAND_MAX;
//float result = sqrt((x*x) + (y*y));
double x = rand_r(&rand_state) / ((double)RAND_MAX + 1) * 2.0 - 1.0;
double y = rand_r(&rand_state) / ((double)RAND_MAX + 1) * 2.0 - 1.0;
float result = sqrt((x*x) + (y*y));
if(result<1){
*in_count+=1; //check if the generated value is inside a unit circle
}
}
//get the remaining iterations calculated by thread 0
if(tid==0){
float remainder = TOT_COUNT%NUM_THREADS;
for(counter=0;counter<remainder;counter++){
float x = (double)random()/RAND_MAX;
float y = (double)random()/RAND_MAX;
float result = sqrt((x*x) + (y*y));
if(result<1){
*in_count+=1; //check if the generated value is inside a unit circle
}
}
}
}
int main(int argc, char *argv[])
{
pthread_t threads[NUM_THREADS];
int rc;
long t;
void *status;
float tot_in=0;
for(t=0;t<NUM_THREADS;t++){
rc = pthread_create(&threads[t], NULL, doCalcs, (void *)t);
if (rc){
printf("ERROR; return code from pthread_create() is %d\n", rc);
exit(-1);
}
}
//join the threads
for(t=0;t<NUM_THREADS;t++){
pthread_join(threads[t], &status);
//printf("Return from thread %ld is : %f\n",t, *(float*)status);
tot_in+=*(float*)status; //keep track of the total in count
}
printf("Value for PI is %f \n",1, 4*(tot_in/TOT_COUNT));
/* Last thing that main() should do */
pthread_exit(NULL);
}
This is a solution using async and future as suggested by #vladon.
#include <iostream>
#include <vector>
#include <random>
#include <future>
using namespace std;
long random_circle_sampling(long n_samples){
std::random_device rd; //Will be used to obtain a seed for the random number engine
std::mt19937 gen(rd()); //Standard mersenne_twister_engine seeded with rd()
std::uniform_real_distribution<> dis(0.0, 1.0);
long points_inside = 0;
for(long i = 0; i < n_samples; ++i){
double x = dis(gen);
double y = dis(gen);
if(x*x + y*y <= 1.0){
++points_inside;
}
}
return points_inside;
}
double approximate_pi(long tot_samples, int n_threads){
long samples_per_thread = tot_samples / n_threads;
// Used to store the future results
vector<future<long>> futures;
for(int t = 0; t < n_threads; ++t){
// Start a new asynchronous task
futures.emplace_back(async(launch::async, random_circle_sampling, samples_per_thread));
}
long tot_points_inside = 0;
for(future<long>& f : futures){
// Wait for the result to be ready
tot_points_inside += f.get();
}
double pi = 4.0 * (double) tot_points_inside / (double) tot_samples;
return pi;
}
int main() {
cout.precision(32);
long tot_samples = 1e6;
int n_threads = 8;
double pi = 3.14159265358979323846;
double approx_pi = approximate_pi(tot_samples, n_threads);
double abs_diff = abs(pi - approx_pi);
cout << "pi\t\t" <<pi << endl;
cout << "approx_pi\t" <<approx_pi << endl;
cout << "abs_diff\t" <<abs_diff << endl;
return 0;
}
You can simply run it with:
$ g++ -std=c++11 -O3 pi.cpp -o pi && time ./pi
pi 3.1415926535897931159979634685442
approx_pi 3.1427999999999998159694314381341
abs_diff 0.0012073464102066999714679695898667
./pi 0.04s user 0.00s system 27% cpu 0.163 total
Your code is not C++, it's bad, very bad plain old C.
That is C++:
#include <cmath>
#include <iostream>
#include <numeric>
#include <random>
#include <thread>
#include <vector>
constexpr auto num_threads = 4; //number of threads
constexpr auto total_count = 10000055; //total number of iterations
void doCalcs(int total_iterations, int & in_count_result)
{
auto seed = std::random_device{}();
auto gen = std::mt19937{ seed };
auto dist = std::uniform_real_distribution<>{0, 1};
auto in_count{ 0 };
//calculation
for (auto counter = 0; counter < total_iterations; ++counter) {
auto x = dist(gen);
auto y = dist(gen);
auto result = std::sqrt(std::pow(x, 2) + std::pow(y, 2));
if (result < 1) {
++in_count; //check if the generated value is inside a unit circle
}
}
in_count_result = in_count;
}
void main()
{
std::vector<std::thread> threads(num_threads);
std::vector<int> in_count(num_threads);
in_count.resize(num_threads);
for (size_t i = 0; i < num_threads; ++i) {
int total_iterations = total_count / num_threads;
if (i == 0) {
total_iterations += total_count % num_threads; // get the remaining iterations calculated by thread 0
}
threads.emplace_back(doCalcs, total_iterations, std::ref(in_count[i]));
}
for (auto & thread : threads) {
if (thread.joinable()) {
thread.join();
}
}
double pi_value = 4.0 * static_cast<double>(std::accumulate(in_count.begin(), in_count.end(), 0)) / static_cast<double>(total_count);
std::cout << "Value of PI is: " << pi_value << std::endl;
}
P.S. And it is also not that good, read about futures, promises and std::async.

Simple test to measure cache lines size

Starting from this article - Gallery of Processor Cache Effects by Igor Ostrovsky - I wanted to play with his examples on my own machine.
This is my code for the first example, that looks at how touching different cache lines affect running time:
#include <iostream>
#include <time.h>
using namespace std;
int main(int argc, char* argv[])
{
int step = 1;
const int length = 64 * 1024 * 1024;
int* arr = new int[length];
timespec t0, t1;
clock_gettime(CLOCK_REALTIME, &t0);
for (int i = 0; i < length; i += step)
arr[i] *= 3;
clock_gettime(CLOCK_REALTIME, &t1);
long int duration = (t1.tv_nsec - t0.tv_nsec);
if (duration < 0)
duration = 1000000000 + duration;
cout<< step << ", " << duration / 1000 << endl;
return 0;
}
Using various values for step, I don't see the jump in the running time:
step, microseconds
1, 451725
2, 334981
3, 287679
4, 261813
5, 254265
6, 246077
16, 215035
32, 207410
64, 202526
128, 197089
256, 195154
I would expect to see something similar with:
But from 16 onwards, the running time is halved each time we double the step.
I test it on an Ubuntu13, Xeon X5450 and compiling it with: g++ -O0.
Is something flawed with my code, or the results are actually ok?
Any insight on what I'm missing would be highly appreciated.
As i see you want to observe effect of cache line sizes, i recommend tool cachegrind, part of valgrind tool set. Your approach is right but not close to results.
#include <iostream>
#include <time.h>
#include <stdlib.h>
using namespace std;
int main(int argc, char* argv[])
{
int step = atoi(argv[1]);
const int length = 64 * 1024 * 1024;
int* arr = new int[length];
for (int i = 0; i < length; i += step)
arr[i] *= 3;
return 0;
}
Run tool valgrind --tool=cachegrind ./a.out $cacheline-size and you should see results. After plotting this you will get desired results with accuracy. Happy Experimenting!!
public class CacheLine {
public static void main(String[] args) {
CacheLine cacheLine = new CacheLine();
cacheLine.startTesting();
}
private void startTesting() {
byte[] array = new byte[128 * 1024];
for (int testIndex = 0; testIndex < 10; testIndex++) {
testMethod(array);
System.out.println("--------- // ---------");
}
}
private void testMethod(byte[] array) {
for (int len = 8192; len <= array.length; len += 8192) {
long t0 = System.nanoTime();
for (int i = 0; i < 10000; i++) {
for (int k = 0; k < len; k += 64) {
array[k] = 1;
}
}
long dT = System.nanoTime() - t0;
System.out.println("len: " + len / 1024 + " dT: " + dT + " dT/stepCount: " + (dT) / len);
}
}
}
This code helps you with determining L1 data cache size. You can read about it more in detail here. https://medium.com/#behzodbekqodirov/threading-in-java-194b7db6c1de#.kzt4w8eul