Program works slower when running second time after recompilation - c++

Performance of the simple program (generate 1 200 000 unique random shuffled integers then sort them) is slower, when I run it from Qt Creator second time after recompilation (and all the next till the next recompilation).
#include <iostream>
#include <random>
#include <algorithm>
#include <chrono>
#include <iterator>
#include <cstdint>
using size_type = std::uint32_t;
alignas(64) size_type v[1200000];
// behaviour really not depends on CPU affinity
#ifdef __linux__
#include <sched.h>
#endif
int main()
{
#ifdef __linux__
{
cpu_set_t m;
int status;
CPU_ZERO(&m);
CPU_SET(0, &m);
status = sched_setaffinity(0, sizeof(m), &m);
if (status != 0) {
perror("sched_setaffinity");
}
}
#endif
std::mt19937 g(0);
for (size_type i = 1; i < std::size(v); ++i) {
v[i] = std::exchange(v[g() % i], i);
}
for (size_type i = 0; i < 10; ++i) { // first output not depends on number of iterations
auto start = std::chrono::high_resolution_clock::now();
std::sort(std::begin(v), std::end(v));
std::cout << std::chrono::duration_cast< std::chrono::microseconds >(std::chrono::high_resolution_clock::now() - start).count() << std::endl;
}
}
Say, first time it prints;
97896
26069
25628
25771
25863
25722
25976
25855
25687
25735
and then:
137238
35056
34880
34468
34746
27309
25781
25932
25502
25383
yet another (and all further like the second and third):
137648
35086
34966
26005
26305
26435
25683
25440
25981
25632
If I recompile program, then all repeating again.
If I recompile program and run it from the console, then all outputs starting from value near the 137000, even the first one, and looks like the next:
137207
35059
35035
34844
34563
34586
34466
34132
34327
34487
If mutters much, I build and run above program on Ubuntu Desktop 16.04.3 64 bit on AMD A10-7800 Radeon R7, 12 Compute Cores 4C+8G w/ 8GB RAM, SSD. Without root previlegies and without debugger on. I use g++-7 -m32 -march=native -mtune=native -O3, gold and ccache.
I expected inverse results, because of (maybe) branch prediction caching or some other caching (if possible at all between consecutive runs of the same code), but the results are discouraging.

Related

C++ with OpenMP try to avoid the false sharing for tight looped array

I try to introduce OpenMP to my c++ code to improve the performance using a simple case as shown:
#include <omp.h>
#include <chrono>
#include <iostream>
#include <cmath>
using std::cout;
using std::endl;
#define NUM 100000
int main()
{
double data[NUM] __attribute__ ((aligned (128)));;
#ifdef _OPENMP
auto t1 = omp_get_wtime();
#else
auto t1 = std::chrono::steady_clock::now();
#endif
for(long int k=0; k<100000; ++k)
{
#pragma omp parallel for schedule(static, 16) num_threads(4)
for(long int i=0; i<NUM; ++i)
{
data[i] = cos(sin(i*i+ k*k));
}
}
#ifdef _OPENMP
auto t2 = omp_get_wtime();
auto duration = t2 - t1;
cout<<"OpenMP Elapsed time (second): "<<duration<<endl;
#else
auto t2 = std::chrono::steady_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();
cout<<"No OpenMP Elapsed time (second): "<<duration/1e6<<endl;
#endif
double tempsum = 0.;
for(long int i=0; i<NUM; ++i)
{
int nextind = (i == 0 ? 0 : i-1);
tempsum += i + sin(data[i]) + cos(data[nextind]);
}
cout<<"Raw data sum: "<<tempsum<<endl;
return 0;
}
Access to a tightly looped int array (size = 10000) and change its elements in either parallel or non-parallel way.
Build as
g++ -o test test.cpp
or
g++ -o test test.cpp -fopenmp
The program reported results as:
No OpenMP Elapsed time (second): 427.44
Raw data sum: 5.00009e+09
OpenMP Elapsed time (second): 113.017
Raw data sum: 5.00009e+09
Intel 10th CPU, Ubuntu 18.04, GCC 7.5, OpenMP 4.5.
I suspect that the false sharing in the cache line leads to the bad performance of the OpenMP version code.
I update the new test results after increasing the loop size, the OpenMP runs faster as expected.
Thank you!
Since you're writing C++, use the C++ random number generator, which is threadsafe, unlike the C legacy one you're using.
Also, you're not using your data array, so the compiler is actually at liberty to remove your loop completely.
You should touch all your data once before you do the timed loop. That way you ensure that pages are instantiated and data is in or out of cache depending.
Your loop is pretty short.
rand() is not thread-safe (see here). Use an array of C++ random-number generators instead, one for each thread. See std::uniform_int_distribution for details.
You can drop #ifdef _OPENMP variations in your code. In a Bash terminal, you can call your application as OMP_NUM_THREADS=1 test. See here for details.
So you can remove num_threads(4) as well because you can explicitly specify the amount of parallelism.
Use Google Benchmark or command-line parameters so you can parameterize the number of threads and array size.
From here, I expect you will see:
The performance when you call OMP_NUM_THREADS=1 test is close to your non-OpenMP version.
The array of C++ RNG generators is faster than calling rand() from multiple threads.
The multi-threaded version is still slower than the single-threaded version when using a 10,000 element array.

How to improve the speed of merkle root calculation in C++?

I am trying to optimise the merkle root calculation as much as possible. So far, I implemented it in Python which resulted in this question and the suggestion to rewrite it in C++.
#include <iostream>
#include <vector>
#include <string>
#include <fstream>
#include <streambuf>
#include <sstream>
#include <openssl/evp.h>
#include <openssl/sha.h>
#include <openssl/crypto.h>
std::vector<unsigned char> double_sha256(std::vector<unsigned char> a, std::vector<unsigned char> b)
{
unsigned char inp[64];
int j=0;
for (int i=0; i<32; i++)
{
inp[j] = a[i];
j++;
}
for (int i=0; i<32; i++)
{
inp[j] = b[i];
j++;
}
const EVP_MD *md_algo = EVP_sha256();
unsigned int md_len = EVP_MD_size(md_algo);
std::vector<unsigned char> out( md_len );
EVP_Digest(inp, 64, out.data(), &md_len, md_algo, nullptr);
EVP_Digest(out.data(), md_len, out.data(), &md_len, md_algo, nullptr);
return out;
}
std::vector<std::vector<unsigned char> > calculate_merkle_root(std::vector<std::vector<unsigned char> > inp_list)
{
std::vector<std::vector<unsigned char> > out;
int len = inp_list.size();
if (len == 1)
{
out.push_back(inp_list[0]);
return out;
}
for (int i=0; i<len-1; i+=2)
{
out.push_back(
double_sha256(inp_list[i], inp_list[i+1])
);
}
if (len % 2 == 1)
{
out.push_back(
double_sha256(inp_list[len-1], inp_list[len-1])
);
}
return calculate_merkle_root(out);
}
int main()
{
std::ifstream infile("txids.txt");
std::vector<std::vector<unsigned char> > txids;
std::string line;
int count = 0;
while (std::getline(infile, line))
{
unsigned char* buf = OPENSSL_hexstr2buf(line.c_str(), nullptr);
std::vector<unsigned char> buf2;
for (int i=31; i>=0; i--)
{
buf2.push_back(
buf[i]
);
}
txids.push_back(
buf2
);
count++;
}
infile.close();
std::cout << count << std::endl;
std::vector<std::vector<unsigned char> > merkle_root_hash;
for (int k=0; k<1000; k++)
{
merkle_root_hash = calculate_merkle_root(txids);
}
std::vector<unsigned char> out0 = merkle_root_hash[0];
std::vector<unsigned char> out;
for (int i=31; i>=0; i--)
{
out.push_back(
out0[i]
);
}
static const char alpha[] = "0123456789abcdef";
for (int i=0; i<32; i++)
{
unsigned char c = out[i];
std::cout << alpha[ (c >> 4) & 0xF];
std::cout << alpha[ c & 0xF];
}
std::cout.put('\n');
return 0;
}
However, the performance is worse compared to the Python implementation (~4s):
$ g++ test.cpp -L/usr/local/opt/openssl/lib -I/usr/local/opt/openssl/include -lcrypto
$ time ./a.out
1452
289792577c66cd75f5b1f961e50bd8ce6f36adfc4c087dc1584f573df49bd32e
real 0m9.245s
user 0m9.235s
sys 0m0.008s
The complete implementation and the input file are available here: test.cpp and txids.txt.
How can I improve the performance? Are the compiler optimizations enabled by default? Are there faster sha256 libraries than openssl available?
There are plenty of things you can do to optimize the code.
Here is the list of the important points:
compiler optimizations need to be enabled (using -O3 in GCC);
std::array can be used instead of the slower dynamically-sized std::vector (since the size of a hash is 32), one can even define a new Hash type for clarity;
parameters should be passed by reference (C++ pass parameter by copy by default)
the C++ vectors can be reserved to pre-allocate the memory space and avoid unneeded copies;
OPENSSL_free must be called to release the allocated memory of OPENSSL_hexstr2buf;
push_back should be avoided when the size is a constant known at compile time;
using std::copy is often faster (and cleaner) than a manual copy;
std::reverse is often faster (and cleaner) than a manual loop;
the size of a hash is supposed to be 32, but one can check that using assertions to be sure it is fine;
count is not needed as it is the size of the txids vector;
Here is the resulting code:
#include <iostream>
#include <vector>
#include <string>
#include <fstream>
#include <streambuf>
#include <sstream>
#include <cstring>
#include <array>
#include <algorithm>
#include <cassert>
#include <openssl/evp.h>
#include <openssl/sha.h>
#include <openssl/crypto.h>
using Hash = std::array<unsigned char, 32>;
Hash double_sha256(const Hash& a, const Hash& b)
{
assert(a.size() == 32 && b.size() == 32);
unsigned char inp[64];
std::copy(a.begin(), a.end(), inp);
std::copy(b.begin(), b.end(), inp+32);
const EVP_MD *md_algo = EVP_sha256();
assert(EVP_MD_size(md_algo) == 32);
unsigned int md_len = 32;
Hash out;
EVP_Digest(inp, 64, out.data(), &md_len, md_algo, nullptr);
EVP_Digest(out.data(), md_len, out.data(), &md_len, md_algo, nullptr);
return out;
}
std::vector<Hash> calculate_merkle_root(const std::vector<Hash>& inp_list)
{
std::vector<Hash> out;
int len = inp_list.size();
out.reserve(len/2+2);
if (len == 1)
{
out.push_back(inp_list[0]);
return out;
}
for (int i=0; i<len-1; i+=2)
{
out.push_back(double_sha256(inp_list[i], inp_list[i+1]));
}
if (len % 2 == 1)
{
out.push_back(double_sha256(inp_list[len-1], inp_list[len-1]));
}
return calculate_merkle_root(out);
}
int main()
{
std::ifstream infile("txids.txt");
std::vector<Hash> txids;
std::string line;
while (std::getline(infile, line))
{
unsigned char* buf = OPENSSL_hexstr2buf(line.c_str(), nullptr);
Hash buf2;
std::copy(buf, buf+32, buf2.begin());
std::reverse(buf2.begin(), buf2.end());
txids.push_back(buf2);
OPENSSL_free(buf);
}
infile.close();
std::cout << txids.size() << std::endl;
std::vector<Hash> merkle_root_hash;
for (int k=0; k<1000; k++)
{
merkle_root_hash = calculate_merkle_root(txids);
}
Hash out0 = merkle_root_hash[0];
Hash out = out0;
std::reverse(out.begin(), out.end());
static const char alpha[] = "0123456789abcdef";
for (int i=0; i<32; i++)
{
unsigned char c = out[i];
std::cout << alpha[ (c >> 4) & 0xF];
std::cout << alpha[ c & 0xF];
}
std::cout.put('\n');
return 0;
}
On my machine, this code is 3 times faster than the initial version and 2 times faster than the Python implementation.
This implementation spends >98% of its time in EVP_Digest. As a result, if you want a faster code, you could try to find a faster hashing library although OpenSSL should be already pretty fast. The current code already succeed to compute 1.7 millions hashes per second in sequential on a mainstream CPU. This is quite good. Alternatively, you can also parallelize the program using OpenMP (this is roughly 5 times faster on my 6 core machine).
I decided to implement Merkle Root and SHA-256 computation from scratch, full SHA-256 implemented, using SIMD (Single Instruction Multiple Data) approach, known for SSE2, AVX2, AVX512.
My code below for AVX2 case has speed 3.5x times faster than OpenSSL version, and 7.3x times faster than Python's hashlib implementation.
Here I provide C++ implementation, also I made Python implementation with same speed (because it uses C++ code in the core), for Python implementation see related post. Python implementation is definitely easier to use than C++.
My code is quite complex, both because it has full SHA-256 implementation and also because it has a class for abstracting any SIMD operations, also many tests.
First I provide timings, made on Google Colab because they have quite advanced AVX2 processor there:
MerkleRoot-Ossl 1274 ms
MerkleRoot-Simd-GEN-1 1613 ms
MerkleRoot-Simd-GEN-2 1795 ms
MerkleRoot-Simd-GEN-4 788 ms
MerkleRoot-Simd-GEN-8 423 ms
MerkleRoot-Simd-SSE2-1 647 ms
MerkleRoot-Simd-SSE2-2 626 ms
MerkleRoot-Simd-SSE2-4 690 ms
MerkleRoot-Simd-AVX2-1 407 ms
MerkleRoot-Simd-AVX2-2 403 ms
MerkleRoot-Simd-AVX2-4 489 ms
Ossl is for testing OpenSSL implementation, the rest is mine implementation. AVX512 has even more improvement in speed, here it is not tested because Colab has no AVX512 support. Actual improvement in speed depends on processor capabilities.
Compilation is tested both in Windows (MSVC) and Linux (CLang), using following commands:
Windows with OpenSSL support cl.exe /O2 /GL /Z7 /EHs /std:c++latest sha256_simd.cpp -DSHS_HAS_AVX2=1 -DSHS_HAS_OPENSSL=1 /MD -Id:/bin/OpenSSL/include/ /link /LIBPATH:d:/bin/OpenSSL/lib/ libcrypto_static.lib libssl_static.lib Advapi32.lib User32.lib Ws2_32.lib, provide your directory with installed OpenSSL. If OpenSSL support is not needed use cl.exe /O2 /GL /Z7 /EHs /std:c++latest sha256_simd.cpp -DSHS_HAS_AVX2=1. Here also instead of of AVX2 you may use SSE2 or AVX512. Windows openssl can be downloaded from here.
Linux CLang compilation is done through clang++-12 -march=native -g -m64 -O3 -std=c++20 sha256_simd.cpp -o sha256_simd.exe -DSHS_HAS_OPENSSL=1 -lssl -lcrypto if OpenSSL is needed and if not needed then clang++-12 -march=native -g -m64 -O3 -std=c++20 sha256_simd.cpp -o sha256_simd.exe. As you can see most recent clang-12 is used, to install it do bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)" (this command is described here). Linux version automatically detects current CPU architecture and uses best SIMD set of instructions.
My code needs C++20 standard support, as it uses some advanced features for easier implementing all things.
I implemented OpenSSL support in my library only to compare timings to show that my AVX2 version is 3-3.5x times faster.
Also providing timings done on GodBolt, but those are only for example of AVX-512 usage, as GodBolt CPUs have advanced AVX-512. Don't use GodBolt to actually measure timings, because all the timings there jump up to 5x times up and down, seems because of active processes eviction by operating system. Also providing GodBolt link for playground (this link may have a bit outdated code, use newest link to code at the bottom of my post):
MerkleRoot-Ossl 2305 ms
MerkleRoot-Simd-GEN-1 2982 ms
MerkleRoot-Simd-GEN-2 3078 ms
MerkleRoot-Simd-GEN-4 1157 ms
MerkleRoot-Simd-GEN-8 781 ms
MerkleRoot-Simd-GEN-16 349 ms
MerkleRoot-Simd-SSE2-1 387 ms
MerkleRoot-Simd-SSE2-2 769 ms
MerkleRoot-Simd-SSE2-4 940 ms
MerkleRoot-Simd-AVX2-1 251 ms
MerkleRoot-Simd-AVX2-2 253 ms
MerkleRoot-Simd-AVX2-4 777 ms
MerkleRoot-Simd-AVX512-1 257 ms
MerkleRoot-Simd-AVX512-2 741 ms
MerkleRoot-Simd-AVX512-4 961 ms
Examples of my code usage can be seen inside Test() function that tests all functionality of my library. My code is a bit dirty because I didn't want to spend very much time creating beautiful library, rather just to make a Proof of Concept that SIMD-based implementation can be considerably faster than OpenSSL version.
If you really want to use my boosted SIMD-based version instead of OpenSSL and if you care for speed very much, and you have questions about how to use it, please ask me in comments or chat.
Also I didn't bother about implementing multi-core/multi-threaded version, I think it is obvious how to do that and you can and should implement it without difficulties.
Providing external link to the code below, because my code is around 51 KB in size which exceeds allowed 30 KB of text for StackOverflow post.
sha256_simd.cpp

OpenMP 4.5 won't offload to GPU with target directive

I am trying to make a simple GPU offloading program using openMP. However, when I try to offload it still runs on the default device, i.e. my CPU.
I have installed a compiler, g++ 7.2.0 that has CUDA support (is in on a cluster that I use). When I run the below code it shows me that it can see the 8 GPUs but when I try to offload it says that it is still on the CPU.
#include <omp.h>
#include <iostream>
#include <stdio.h>
#include <math.h>
#include <algorithm>
#define n 10000
#define m 10000
using namespace std;
int main()
{
double tol = 1E-10;
double err = 1;
size_t iter_max = 10;
size_t iter = 0;
bool notGPU[1] = {true};
double Anew[n][m];
double A[n][m];
int target[1];
target[0] = omp_get_initial_device();
cout << "Total Devices: " << omp_get_num_devices() << endl;
cout << "Target: " << target[0] << endl;
for (int iter = 0; iter < iter_max; iter++){
#pragma omp target
{
err = 0.0;
#pragma omp parallel for reduction(max:err)
for (int j = 1; j < n-1; ++j){
target[0] = omp_is_initial_device();
for (int i = 1; i < m-1; i++){
Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]);
err = fmax(err, fabs(Anew[j][i] - A[j][i]));
}
}
}
}
if (target[0]){
cout << "not on GPU" << endl;
} else{
cout << "On GPU" << endl;}
return 0;
}
When I run this I always get that it is not on the GPU, but that there are 8 devices available.
This is not a well documented process!
You have to install some packages which look a little like:
sudo apt install gcc-offload-nvptx
You also need to add additional flags to your compilation string. I've globbed together a number of them below. Mix and match until something works, or use them as the basis for further Googling.
gcc -fopenmp -foffload=x86_64-intelmicemul-linux-gnu="-mavx2" -foffload=nvptx-none -foffload="-O3" -O2 test.c -fopenmp-targets=nvptx64-nvidia-cuda
When I last tried this with GCC in 2018 it just didn't work. At that time target offloading for OpenMP only worked with the IBM XL compiler and OpenACC (a similar set of directives to OpenMP) only worked on the Nvidia's PGI compiler. I find PGI to do a worse job of compiling C/C++ than the others (seems inefficient, non-standard flags), but a Community Edition is available for free and a little translating will get you running in OpenACC quickly.
IBM XL seems to do a fine job compiling, but I don't know if it's available for free.
The situation may have changed with GCC. If you find a way to get it working, I'd appreciate you leaving a comment here. My strong recommendation is that you stop trying with GCC7 and get ahold of GCC8 or GCC9. GPU offloading is a fast-moving area and you'll want the latest compilers to take best advantage of it.
Looks like you're missing a device(id) in your #pragma omp target line:
#pragma omp target device(/*your device id here*/)
Without that, you haven't explicitly asked OpenMP to run anywhere but your CPU.

How do you find out what parts of code are creating the most virtual memory?

I have a program that starts up and within about 5 minutes the virtual size of process is about 13 gigs. It runs on Linux, uses boost, gnu c++ library and various other 3rd party libraries.
After 5 minutes size stays at 13 gigs and rss size steady at around 5 gigs.
I can't just run it in a debugger because at startup about 30 threads are started, each of which starts running its own code, that does various allocations. So stepping through and checking virtual memory at different parts of code at each breakpoint is not feasible.
I thought of changing program to start each thread one at a time to make it easier to track allocation of memory, but before doing this are there any good tools?
Valgrind is fairly slow, maybe tcmalloc could provide the info?
I would use valgrind (perhaps run it an entire night) or else use Boehm GC.
Alternatively, use the proc(5) filesystem to understand (e.g. thru /proc/$pid/statm & /proc/$pid/maps) when a lot of memory gets allocated.
The most important is to find memory leaks. If the memory don't grow after startup it is less an issue.
Perhaps adding instance counters to each class might help (use atomic integers or mutexes to serialize them).
If the program's source code is big (e.g. a million of source lines) so that spending several days/weeks is worth the effort, perhaps customizing the GCC compiler (e.g. with MELT) might be relevant.
a std::set minibenchmark
You mentioned big std::set based upon million rows.
#include <set>
#include <string>
#include <string.h>
#include <cstdio>
#include <cstdlib>
#include <unistd.h>
#include <time.h>
class MyElem
{
int _n;
char _s[16-sizeof(_n)];
public:
MyElem(int k) : _n(k)
{
snprintf (_s, sizeof(_s), "%d", k);
};
~MyElem()
{
_n=0;
memset(_s, 0, sizeof(_s));
};
int n() const
{
return _n;
};
std::string str() const
{
return std::string(_s);
};
bool less(const MyElem&x) const
{
return _n < x._n;
};
};
bool operator < (const MyElem& l, const MyElem& r)
{
return l.less(r);
}
typedef std::set<MyElem> MySet;
void bench (int cnt, MySet& set)
{
for (long i=0; i<(long)cnt*1024; i++)
set.insert(MyElem(i));
time_t now = 0;
time (&now);
set.insert (((now) & 0xfffffff) * 100);
}
int main (int argc, char** argv)
{
MySet s;
clock_t cstart, cend;
int c = argc>1?atoi(argv[1]):256;
if (c<16) c=16;
printf ("c=%d Kiter\n", c);
cstart = clock();
bench (c, s);
cend = clock();
int x = getpid();
char cmdbuf[64];
snprintf(cmdbuf, sizeof(cmdbuf), "pmap %d", x);
printf ("running %s\n", cmdbuf);
fflush (NULL);
system(cmdbuf);
putchar('\n');
printf ("at end c=%d Kiter clockdiff=%.2f millisec = %.f µs/Kiter\n",
c, (cend-cstart)*1.0e-3, (double)(cend-cstart)/c);
if (s.find(x) != s.end())
printf("set has %d\n", x);
else
printf("set don't contain %d\n", x);
return 0;
}
Notice the 16 bytes sizeof(MyElem). On Debian/Sid/AMD64 with GCC 4.8.1 (intel i3770K processor, 16Gbytes RAM) and compiling that bench with g++ -Wall -O1 tset.cc -o ./tset-01
With 32768 thousands of iterations, so 32M elements:
total 2109592K
(last line above given by pmap)
at end c=32768 Kiter clockdiff=16470.00 millisec = 503 µs/Kiter
Then the implicit time from my zsh
./tset-01 32768 16.77s user 0.54s system 99% cpu 17.343 total
This is about 2.1Gbytes. so perhaps 64.3 bytes per element & set member overhead (since sizeof(MyElem)==16 the set seems to have a non-negligible cost of perhaps 6 words per element)

LRU Caching & Multithreading

I have already made a post some time ago to ask about a good design for LRU caching (in C++). You can find the question, the answer and some code there:
Better understanding the LRU algorithm
I have now tried to multi-thread this code (using pthread) and came with some really unexpected results. Before even attempting to use locking, I have created a system in which each thread accesses its own cache (see code). I run this code on a 4 cores processor. I tried to run it with 1 thread and 4 thread. When it runs on 1 thread I do 1 million lookups in the cache, on 4 threads, each threads does 250K lookups. I was expecting to get a time reduction with 4 threads but get the opposite. 1 threads runs in 2.2 seconds, 4 threads runs in more than 6 seconds?? I just can't make sense of this result.
Is something wrong with my code? Can this be explained somehow (thread management takes time). It would be great to have the feedback from experts. Thanks a lot -
I compile this code with: c++ -o cache cache.cpp -std=c++0x -O3 -lpthread
#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <errno.h>
#include <sys/time.h>
#include <list>
#include <cstdlib>
#include <cstdio>
#include <memory>
#include <list>
#include <unordered_map>
#include <stdint.h>
#include <iostream>
typedef uint32_t data_key_t;
using namespace std;
//using namespace std::tr1;
class TileData
{
public:
data_key_t theKey;
float *data;
static const uint32_t tileSize = 32;
static const uint32_t tileDataBlockSize;
TileData(const data_key_t &key) : theKey(key), data(NULL)
{
float *data = new float [tileSize * tileSize * tileSize];
}
~TileData()
{
/* std::cerr << "delete " << theKey << std::endl; */
if (data) delete [] data;
}
};
typedef shared_ptr<TileData> TileDataPtr; // automatic memory management!
TileDataPtr loadDataFromDisk(const data_key_t &theKey)
{
return shared_ptr<TileData>(new TileData(theKey));
}
class CacheLRU
{
public:
list<TileDataPtr> linkedList;
unordered_map<data_key_t, TileDataPtr> hashMap;
CacheLRU() : cacheHit(0), cacheMiss(0) {}
TileDataPtr getData(data_key_t theKey)
{
unordered_map<data_key_t, TileDataPtr>::const_iterator iter = hashMap.find(theKey);
if (iter != hashMap.end()) {
TileDataPtr ret = iter->second;
linkedList.remove(ret);
linkedList.push_front(ret);
++cacheHit;
return ret;
}
else {
++cacheMiss;
TileDataPtr ret = loadDataFromDisk(theKey);
linkedList.push_front(ret);
hashMap.insert(make_pair<data_key_t, TileDataPtr>(theKey, ret));
if (linkedList.size() > MAX_LRU_CACHE_SIZE) {
const TileDataPtr dropMe = linkedList.back();
hashMap.erase(dropMe->theKey);
linkedList.remove(dropMe);
}
return ret;
}
}
static const uint32_t MAX_LRU_CACHE_SIZE = 100;
uint32_t cacheMiss, cacheHit;
};
int numThreads = 1;
void *testCache(void *data)
{
struct timeval tv1, tv2;
// Measuring time before starting the threads...
double t = clock();
printf("Starting thread, lookups %d\n", (int)(1000000.f / numThreads));
CacheLRU *cache = new CacheLRU;
for (uint32_t i = 0; i < (int)(1000000.f / numThreads); ++i) {
int key = random() % 300;
TileDataPtr tileDataPtr = cache->getData(key);
}
std::cerr << "Time (sec): " << (clock() - t) / CLOCKS_PER_SEC << std::endl;
delete cache;
}
int main()
{
int i;
pthread_t thr[numThreads];
struct timeval tv1, tv2;
// Measuring time before starting the threads...
gettimeofday(&tv1, NULL);
#if 0
CacheLRU *c1 = new CacheLRU;
(*testCache)(c1);
#else
for (int i = 0; i < numThreads; ++i) {
pthread_create(&thr[i], NULL, testCache, (void*)NULL);
//pthread_detach(thr[i]);
}
for (int i = 0; i < numThreads; ++i) {
pthread_join(thr[i], NULL);
//pthread_detach(thr[i]);
}
#endif
// Measuring time after threads finished...
gettimeofday(&tv2, NULL);
if (tv1.tv_usec > tv2.tv_usec)
{
tv2.tv_sec--;
tv2.tv_usec += 1000000;
}
printf("Result - %ld.%ld\n", tv2.tv_sec - tv1.tv_sec,
tv2.tv_usec - tv1.tv_usec);
return 0;
}
A thousand apologies, by keeping debugging the code I realised I made a really bad beginner's mistake, if you look at that code:
TileData(const data_key_t &key) : theKey(key), data(NULL)
{
float *data = new float [tileSize * tileSize * tileSize];
}
from the TikeData class where data is supposed to actually be a member variable of the class... So the right code should be:
class TileData
{
public:
float *data;
TileData(const data_key_t &key) : theKey(key), data(NULL)
{
data = new float [tileSize * tileSize * tileSize];
numAlloc++;
}
};
I am so sorry about that! It's a mistake I have done in the past, and I guess prototyping is great, but it sometimes lead to do such stupid mistakes.
I ran the code with 1 and 4 threads and do now see the speedup. 1 thread takes about 2.3 seconds, 4 threads takes 0.92 seconds.
Thanks all for your help, and sorry if I made you lose your time ;-)
I don't have a concrete answer yet. I can think of several possibilities. One is that testCache() is using random(), which is almost certainly implemented with a single global mutex. (Thus all of your threads are competing for the mutex, which is now ping-ponging between the caches.) ((That's assuming that random() is actually thread-safe on your system.))
Next, testCach() is accessing a CacheLRU which is implemented with unordered_maps and shared_ptrs. The unordered_maps, in particular might be implemented with some kind of global mutex underneath that is causing all of your threads to compete for access.
To really diagnose what is going on here you should do something much simpler inside of testCache(). (First try just taking the sqrt() of an input variable 250K times (vs. 1M times). Then try linearly accessing a C array of size 250K (or 1M). Slowly build up to the complex thing you are currently doing.)
Another possibility has to do with the pthread_join. pthread_join doesn't return until all the threads are done. So if one is taking longer than the others, you are measuring the slowest one. Your computation here seems balanced, but perhaps your OS is doing something unexpected? (Like mapping several threads to one core (perhaps because you have a hyper-threaded processor?, or one thread is moving from one core to another in the middle of the run (perhaps because the OS thinks it is smart when it is not.)
This will be a bit of a "build it up" answer. I'm running your code on a Fedora 16 Linux system with a 4-core AMD cpu and 16GB of RAM.
I can confirm that I'm seeing similar "slower with more threads" behaviour. I removed the random function, which doesn't improve things at all.
I'm going to make some other minor changes.