Checking device_vector inside CUDA kernel doesn't work

Checking device_vector inside CUDA kernel doesn't work - c++

I'm running CUDA 4.2 on Windows 7 64 bits in the Visual Studio 2010 Professional environment
First, I have the following code running:
// include the header files
#include <iostream>
#include <stdio.h>
#include <time.h>
#include "cuda.h"
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
using namespace std;
//kernel function
__global__
void dosomething(int *d_bPtr, int count, int* d_bStopPtr)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid==0)
d_bStopPtr[tid]=0;
else if(tid<count)
{
d_bPtr[tid]=tid;
// only if the arrary cell before it is 0, then change it to 0 too
if (d_bStopPtr[tid-1]==0 )
d_bStopPtr[tid]=0;
}
}
int main()
{
int count=100000;
// define the vectors
thrust::host_vector <int> h_a(count);
thrust::device_vector <int> d_b(count,0);
int* d_bPtr=thrust::raw_pointer_cast(&d_b[0]);
thrust::device_vector <int> d_bStop(count,1);
int* d_bStopPtr=thrust::raw_pointer_cast(&d_bStop[0]);
// get the device property
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
int threadsPerBlock = prop.maxThreadsDim[0];
int blocksPerGrid = min(prop.maxGridSize[0], (count + threadsPerBlock - 1) / threadsPerBlock);
//copy device to host
thrust::copy(d_b.begin(),d_b.end(),h_a.begin());
cout<<h_a[100]<<"\t"<<h_a[200]<<"\t"<<h_a[300]<<"\t"<<endl;
//run the kernel
while(d_bStop[count-1])
{
dosomething<<<blocksPerGrid, threadsPerBlock>>>(d_bPtr,count,d_bStopPtr);
}
//copy device back to host again
thrust::copy(d_b.begin(),d_b.end(),h_a.begin());
cout<<h_a[100]<<"\t"<<h_a[200]<<"\t"<<h_a[300]<<"\t"<<endl;
//wait to see the console output
int x;
cin>>x;
return 0;
}
However, each time I need to check the while condition, but it is slow. So I'm thinking of checking the condition of this device vector inside the kernel, and change the code like this:
// include the header files
#include <iostream>
#include <stdio.h>
#include <time.h>
#include "cuda.h"
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
using namespace std;
//kernel function
__global__
void dosomething(int *d_bPtr, int count, int* d_bStopPtr)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid==0)
d_bStopPtr[tid]=0;
else if(tid<count)
{
// if the last cell of the arrary is still not 0 yet, repeat
while(d_bStopPtr[count-1])
{
d_bPtr[tid]=tid;
// only if the arrary cell before it is 0, then change it to 0 too
if (d_bStopPtr[tid-1]==0 )
d_bStopPtr[tid]=0;
}
}
}
int main()
{
int count=100000;
// define the vectors
thrust::host_vector <int> h_a(count);
thrust::device_vector <int> d_b(count,0);
int* d_bPtr=thrust::raw_pointer_cast(&d_b[0]);
thrust::device_vector <int> d_bStop(count,1);
int* d_bStopPtr=thrust::raw_pointer_cast(&d_bStop[0]);
// get the device property
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
int threadsPerBlock = prop.maxThreadsDim[0];
int blocksPerGrid = min(prop.maxGridSize[0], (count + threadsPerBlock - 1) / threadsPerBlock);
//copy device to host
thrust::copy(d_b.begin(),d_b.end(),h_a.begin());
cout<<h_a[100]<<"\t"<<h_a[200]<<"\t"<<h_a[300]<<"\t"<<endl;
//run the kernel
dosomething<<<blocksPerGrid, threadsPerBlock>>>(d_bPtr,count,d_bStopPtr);
//copy device back to host again
thrust::copy(d_b.begin(),d_b.end(),h_a.begin());
cout<<h_a[100]<<"\t"<<h_a[200]<<"\t"<<h_a[300]<<"\t"<<endl;
//wait to see the console output
int x;
cin>>x;
return 0;
}
However, the second version always causes the graphic card and the computer to hang. Can you please help me with speeding up the first version? How to check the condition inside the kernel and then jump out and stop the kernel?

You are basically looking for global thread synchronous behavior. This is a no-no in GPU programming. Ideally each threadblock is independent, and can complete the work based on it's own data and processing. Creating threadblocks that depend on the results of other threadblocks to complete their work is creating the possibility of a deadlock condition. Suppose I have a GPU with 14 SMs (threadblock execution units), and suppose I create 100 threadblocks. Now suppose threadblocks 0-13 are waiting for threadblock 99 to release a lock (e.g. write a zero value to a particular location). Now suppose those first 14 threadblocks begin executing on the 14 SMs, perhaps looping, spinning on the lock value. There is no mechanism in the GPU to guarantee that threadblock 99 will execute first or even execute at all, if threadblocks 0-13 have the SMs tied up.
Let's not get into questions about "what about GMEM stalls that force eviction of threadblocks 0-13" because none of that guarantees that threadblock 99 will get priority to execute at any point. The only thing that guarantees that threadblock 99 will execute is the draining (i.e. completion) of other threadblocks. But if the other threadblocks are spinning, waiting for results from threadblock 99, that may never happen.
Good forward-compatible, scalable GPU code depends on independent parallel work. So you're advised to re-craft your algorithm to make the work you are trying to accomplish independent, at least at the inter-threadblock level.
If you must do global thread syncing, the kernel launch is the only truly guaranteed point for this, and thus your first approach is the working approach.
To help with this, it may be useful to study how reduction algorithms get implemented on a GPU. Various types of reductions have dependencies across all threads, but by creating intermediate results, we can break the work into independent pieces. The independent pieces can then be aggregated using a multi-kernel approach (or some other more advanced approaches) to speed up what amounts to a serial algorithm.
Your kernel doesn't actually do much. It sets one array equal to it's index, i.e. a[i] = i; and it sets the other array to all zeroes (although sequentially) b[i]=0;. To show an example of your first code "speeded up", you could do something like this:
// include the header files
#include <iostream>
#include <stdio.h>
#include <time.h>
#include "cuda.h"
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
using namespace std;
//kernel function
__global__
void dosomething(int *d_bPtr, int count, int* d_bStopPtr)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
while(tid<count)
{
d_bPtr[tid]=tid;
while(d_bStopPtr[tid]!=0)
// only if the arrary cell before it is 0, then change it to 0 too
if (tid==0) d_bStopPtr[tid] =0;
else if (d_bStopPtr[tid-1]==0 )
d_bStopPtr[tid]=0;
tid += blockDim.x;
}
}
int main()
{
int count=100000;
// define the vectors
thrust::host_vector <int> h_a(count);
thrust::device_vector <int> d_b(count,0);
int* d_bPtr=thrust::raw_pointer_cast(&d_b[0]);
thrust::device_vector <int> d_bStop(count,1);
int* d_bStopPtr=thrust::raw_pointer_cast(&d_bStop[0]);
// get the device property
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
// int threadsPerBlock = prop.maxThreadsDim[0];
int threadsPerBlock = 32;
// int blocksPerGrid = min(prop.maxGridSize[0], (count + threadsPerBlock - 1) / threadsPerBlock);
int blocksPerGrid = 1;
//copy device to host
thrust::copy(d_b.begin(),d_b.end(),h_a.begin());
cout<<h_a[100]<<"\t"<<h_a[200]<<"\t"<<h_a[300]<<"\t"<<endl;
//run the kernel
// while(d_bStop[count-1])
// {
dosomething<<<blocksPerGrid, threadsPerBlock>>>(d_bPtr,count,d_bStopPtr);
// }
//copy device back to host again
cudaDeviceSynchronize();
thrust::copy(d_b.begin(),d_b.end(),h_a.begin());
cout<<h_a[100]<<"\t"<<h_a[200]<<"\t"<<h_a[300]<<"\t"<<endl;
//wait to see the console output
int x;
cin>>x;
return 0;
}
On my machine this speeds the execution time up from 10 secs to almost instantaneous (much less than 1 second). Note that this is not a great example of CUDA programming, because I am only launching one block of 32 threads. That's not enough to effectively utilize the machine. But the work done by your kernel is so trivial that I'm not sure what a good example would be. I could just create a kernel that sets one array to it's index a[i]=i; and the other array to zero b[i]=0; all in parallel. That would be even faster, and we could use the whole machine that way.

Related

cuda random number not always return 0 and 1

I am trying to generate a set of random number there are only 1 and zero. The code below almost works. When I do the print for loop I notice that some times I have a number that generates that is not 1 or 0. I know I am missing something just not sure what. I think its a memory misplacement.
#include <stdio.h>
#include <curand.h>
#include <curand_kernel.h>
#include <math.h>
#include <assert.h>
#define MIN 1
#define MAX (2048*20)
#define MOD 2 // only need one and zero for each random value.
#define THREADS_PER_BLOCK 256
__global__ void setup_kernel(curandState *state, unsigned long seed)
{
int idx = threadIdx.x+blockDim.x*blockIdx.x;
curand_init(seed, idx, 0, state+idx);
}
__global__ void generate_kernel(curandState *state, unsigned int *result){
int idx = threadIdx.x + blockDim.x*blockIdx.x;
result[idx] = curand(state+idx) % MOD;
}
int main(){
curandState *d_state;
cudaMalloc(&d_state, sizeof(curandState));
unsigned *d_result, *h_result;
cudaMalloc(&d_result, (MAX-MIN+1) * sizeof(unsigned));
h_result = (unsigned *)malloc((MAX-MIN+1)*sizeof(unsigned));
cudaMemset(d_result, 0, (MAX-MIN+1)*sizeof(unsigned));
setup_kernel<<<MAX/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>(d_state,time(NULL));
generate_kernel<<<MAX/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>(d_state, d_result);
cudaMemcpy(h_result, d_result, (MAX-MIN+1) * sizeof(unsigned), cudaMemcpyDeviceToHost);
printf("Bin: Count: \n");
for (int i = MIN; i <= MAX; i++)
printf("%d %d\n", i, h_result[i-MIN]);
free(h_result);
cudaFree(d_result);
system("pause");
return 0;
}
What I am attempting to do is transform a genetic algorithm from this site.
http://www.ai-junkie.com/ga/intro/gat3.html
I thought it would be a good problem to learn CUDA and have some fun at the same time.
The first part is to generate my random array.

The problem here is that both your setup_kernel and generate_kernel are not running to completion because of out of bounds memory access. Both kernels are expecting that there will be a generator state for each thread, but you are only allocating a single state on the device. This results in out of bounds memory reads and writes on both kernels. Change this:
curandState *d_state;
cudaMalloc(&d_state, sizeof(curandState));
to something like
curandState *d_state;
cudaMalloc(&d_state, sizeof(curandState) * (MAX-MIN+1));
so that you have one generator state per thread you are running, and things should start working. If you had made any attempt at checking errors from either the runtime API return statuses or using cuda-memcheck, the source of the error would have been immediately apparent.

Cuda kernel to compute squares of integers in an array

I am learning some basic CUDA programming. I am trying to initialize an array on the Host with host_a[i] = i. This array consists of N = 128 integers. I am launching a kernel with 1 block and 128 threads per block, in which I want to square the integer at index i.
My questions are:
How do I come to know whether the kernel gets launched or not? Can I use printf within the kernel?
The expected output for my program is a space-separated list of squares of integers -
1 4 9 16 ... .
What's wrong with my code, since it outputs 1 2 3 4 5 ...
Code:
#include <iostream>
#include <numeric>
#include <stdlib.h>
#include <cuda.h>
const int N = 128;
__global__ void f(int *dev_a) {
unsigned int tid = threadIdx.x;
if(tid < N) {
dev_a[tid] = tid * tid;
}
}
int main(void) {
int host_a[N];
int *dev_a;
cudaMalloc((void**)&dev_a, N * sizeof(int));
for(int i = 0 ; i < N ; i++) {
host_a[i] = i;
}
cudaMemcpy(dev_a, host_a, N * sizeof(int), cudaMemcpyHostToDevice);
f<<<1, N>>>(dev_a);
cudaMemcpy(host_a, dev_a, N * sizeof(int), cudaMemcpyDeviceToHost);
for(int i = 0 ; i < N ; i++) {
printf("%d ", host_a[i]);
}
}

How do I come to know whether the kernel gets launched or not? Can I use printf within the kernel?
You can use printf in device code (as long as you #include <stdio.h>) on any compute capability 2.0 or higher GPU. Since CUDA 7 and CUDA 7.5 only support those types of GPUs, if you are using CUDA 7 or CUDA 7.5 (successfully) then you can use printf in device code.
What's wrong with my code?
As identified in the comments, there is nothing "wrong" with your code, if run on a properly set up machine. To address your previous question "How do I come to know whether the kernel gets launched or not?", the best approach in my opinion is to use proper cuda error checking, which has numerous benefits besides just telling you whether your kernel launched or not. In this case it would also give a clue as to the failure being an improper CUDA setup on your machine. You can also run CUDA codes with cuda-memcheck as a quick test as to whether any runtime errors are occurring.

Thrust execution policy issues kernel to default stream

I am currently designing a short tutorial exhibiting various aspects and capabilities of Thrust template library.
Unfortunately, it seems that there is a problem in a code that I have written in order to show how to use copy/compute concurrency using cuda streams.
My code could be found here, in the asynchronousLaunch directory:
https://github.com/gnthibault/Cuda_Thrust_Introduction/tree/master/AsynchronousLaunch
Here is an abstract of the code that generates the problem:
//STL
#include <cstdlib>
#include <algorithm>
#include <iostream>
#include <vector>
#include <functional>
//Thrust
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/execution_policy.h>
#include <thrust/scan.h>
//Cuda
#include <cuda_runtime.h>
//Local
#include "AsynchronousLaunch.cu.h"
int main( int argc, char* argv[] )
{
const size_t fullSize = 1024*1024*64;
const size_t halfSize = fullSize/2;
//Declare one host std::vector and initialize it with random values
std::vector<float> hostVector( fullSize );
std::generate(hostVector.begin(), hostVector.end(), normalRandomFunctor<float>(0.f,1.f) );
//And two device vector of Half size
thrust::device_vector<float> deviceVector0( halfSize );
thrust::device_vector<float> deviceVector1( halfSize );
//Declare and initialize also two cuda stream
cudaStream_t stream0, stream1;
cudaStreamCreate( &stream0 );
cudaStreamCreate( &stream1 );
//Now, we would like to perform an alternate scheme copy/compute
for( int i = 0; i < 10; i++ )
{
//Wait for the end of the copy to host before starting to copy back to device
cudaStreamSynchronize(stream0);
//Warning: thrust::copy does not handle asynchronous behaviour for host/device copy, you must use cudaMemcpyAsync to do so
cudaMemcpyAsync(thrust::raw_pointer_cast(deviceVector0.data()), thrust::raw_pointer_cast(hostVector.data()), halfSize*sizeof(float), cudaMemcpyHostToDevice, stream0);
cudaStreamSynchronize(stream1);
//second copy is most likely to occur sequentially after the first one
cudaMemcpyAsync(thrust::raw_pointer_cast(deviceVector1.data()), thrust::raw_pointer_cast(hostVector.data())+halfSize, halfSize*sizeof(float), cudaMemcpyHostToDevice, stream1);
//Compute on device, here inclusive scan, for histogram equalization for instance
thrust::transform( thrust::cuda::par.on(stream0), deviceVector0.begin(), deviceVector0.end(), deviceVector0.begin(), computeFunctor<float>() );
thrust::transform( thrust::cuda::par.on(stream1), deviceVector1.begin(), deviceVector1.end(), deviceVector1.begin(), computeFunctor<float>() );
//Copy back to host
cudaMemcpyAsync(thrust::raw_pointer_cast(hostVector.data()), thrust::raw_pointer_cast(deviceVector0.data()), halfSize*sizeof(float), cudaMemcpyDeviceToHost, stream0);
cudaMemcpyAsync(thrust::raw_pointer_cast(hostVector.data())+halfSize, thrust::raw_pointer_cast(deviceVector1.data()), halfSize*sizeof(float), cudaMemcpyDeviceToHost, stream1);
}
//Full Synchronize before exit
cudaDeviceSynchronize();
cudaStreamDestroy( stream0 );
cudaStreamDestroy( stream1 );
return EXIT_SUCCESS;
}
Here are the results of one instance of the program, observed through nvidia visual profile:
As yo can see, cudamemcopy (in brown) are both issued to stream 13 and 14, but kernels generated by Thrust from thrust::transform are issued to default stream (in blue in the capture)
By the way, I am using cuda toolkit version 7.0.28, with a GTX680 and gcc 4.8.2.
I would be grateful if someone could tell me what is wrong with my code.
Thank you in advance
Edit: here is the code that I consider as a solution:
//STL
#include <cstdlib>
#include <algorithm>
#include <iostream>
#include <functional>
#include <vector>
//Thrust
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/execution_policy.h>
//Cuda
#include <cuda_runtime.h>
//Local definitions
template<typename T>
struct computeFunctor
{
__host__ __device__
computeFunctor() {}
__host__ __device__
T operator()( T in )
{
//Naive functor that generates expensive but useless instructions
T a = cos(in);
for(int i = 0; i < 350; i++ )
{
a+=cos(in);
}
return a;
}
};
int main( int argc, char* argv[] )
{
const size_t fullSize = 1024*1024*2;
const size_t nbOfStrip = 4;
const size_t stripSize = fullSize/nbOfStrip;
//Allocate host pinned memory in order to use asynchronous api and initialize it with random values
float* hostVector;
cudaMallocHost(&hostVector,fullSize*sizeof(float));
std::fill(hostVector, hostVector+fullSize, 1.0f );
//And one device vector of the same size
thrust::device_vector<float> deviceVector( fullSize );
//Declare and initialize also two cuda stream
std::vector<cudaStream_t> vStream(nbOfStrip);
for( auto it = vStream.begin(); it != vStream.end(); it++ )
{
cudaStreamCreate( &(*it) );
}
//Now, we would like to perform an alternate scheme copy/compute in a loop using the copyToDevice/Compute/CopyToHost for each stream scheme:
for( int i = 0; i < 5; i++ )
{
for( int j=0; j!=nbOfStrip; j++)
{
size_t offset = stripSize*j;
size_t nextOffset = stripSize*(j+1);
cudaStreamSynchronize(vStream.at(j));
cudaMemcpyAsync(thrust::raw_pointer_cast(deviceVector.data())+offset, hostVector+offset, stripSize*sizeof(float), cudaMemcpyHostToDevice, vStream.at(j));
thrust::transform( thrust::cuda::par.on(vStream.at(j)), deviceVector.begin()+offset, deviceVector.begin()+nextOffset, deviceVector.begin()+offset, computeFunctor<float>() );
cudaMemcpyAsync(hostVector+offset, thrust::raw_pointer_cast(deviceVector.data())+offset, stripSize*sizeof(float), cudaMemcpyDeviceToHost, vStream.at(j));
}
}
//On devices that do not possess multiple queues copy engine capability, this solution serializes all command even if they have been issued to different streams
//Why ? Because in the point of view of the copy engine, which is a single ressource in this case, there is a time dependency between HtoD(n) and DtoH(n) which is ok, but there is also
// a false dependency between DtoH(n) and HtoD(n+1), that preclude any copy/compute overlap
//Full Synchronize before testing second solution
cudaDeviceSynchronize();
//Now, we would like to perform an alternate scheme copy/compute in a loop using the copyToDevice for each stream /Compute for each stream /CopyToHost for each stream scheme:
for( int i = 0; i < 5; i++ )
{
for( int j=0; j!=nbOfStrip; j++)
{
cudaStreamSynchronize(vStream.at(j));
}
for( int j=0; j!=nbOfStrip; j++)
{
size_t offset = stripSize*j;
cudaMemcpyAsync(thrust::raw_pointer_cast(deviceVector.data())+offset, hostVector+offset, stripSize*sizeof(float), cudaMemcpyHostToDevice, vStream.at(j));
}
for( int j=0; j!=nbOfStrip; j++)
{
size_t offset = stripSize*j;
size_t nextOffset = stripSize*(j+1);
thrust::transform( thrust::cuda::par.on(vStream.at(j)), deviceVector.begin()+offset, deviceVector.begin()+nextOffset, deviceVector.begin()+offset, computeFunctor<float>() );
}
for( int j=0; j!=nbOfStrip; j++)
{
size_t offset = stripSize*j;
cudaMemcpyAsync(hostVector+offset, thrust::raw_pointer_cast(deviceVector.data())+offset, stripSize*sizeof(float), cudaMemcpyDeviceToHost, vStream.at(j));
}
}
//On device that do not possess multiple queues in the copy engine, this solution yield better results, on other, it should show nearly identic results
//Full Synchronize before exit
cudaDeviceSynchronize();
for( auto it = vStream.begin(); it != vStream.end(); it++ )
{
cudaStreamDestroy( *it );
}
cudaFreeHost( hostVector );
return EXIT_SUCCESS;
}
Compiled using nvcc ./test.cu -o ./test.exe -std=c++11

There are 2 things I would point out. Both of these are (now) referenced in this related question/answer which you may wish to refer to.
The failure of thrust to issue the underlying kernels to non-default streams in this case seems to be related to this issue. It can be rectified (as covered in the comments to the question) by updating to the latest thrust version. Future CUDA versions (beyond 7) will probably include a fixed thrust as well. This is probably the central issue being discussed in this question.
The question seems to also suggest that one of the goals is overlap of copy and compute:
in order to show how to use copy/compute concurrency using cuda streams
but this won't be achievable, I don't think, with the code as currently crafted, even if item 1 above is fixed. Overlap of copy with compute operations requires the proper use of cuda streams on the copy operation (cudaMemcpyAsync) as well as a pinned host allocation. The code proposed in the question is lacking any use of a pinned host allocation (std::vector does not use a pinned allocator by default, AFAIK), and so I would not expect the cudaMemcpyAsync operation to overlap with any kernel activity, even if it should be otherwise possible. To rectify this, a pinned allocator should be used, and one such example is given here.
For completeness, the question is otherwise lacking an MCVE, which is expected for questions of this type. This makes it more difficult for others to attempt to test your issue, and is explicitly a close reason on SO. Yes, you provided a link to an external github repo, but this behavior is frowned on. The MCVE requirement explicitly states that the necessary pieces should be included in the question itself (not an external reference.) Since the only lacking piece, AFAICT, is "AsynchronousLaunch.cu.h", it seems like it would have been relatively straightforward to include this one additional piece in your question. The problem with external links is that when they break in the future, the question becomes less useful for future readers. (And, forcing others to navigate an external github repo looking for specific files is not conducive to getting help, in my opinion.)

How do you find out what parts of code are creating the most virtual memory?

I have a program that starts up and within about 5 minutes the virtual size of process is about 13 gigs. It runs on Linux, uses boost, gnu c++ library and various other 3rd party libraries.
After 5 minutes size stays at 13 gigs and rss size steady at around 5 gigs.
I can't just run it in a debugger because at startup about 30 threads are started, each of which starts running its own code, that does various allocations. So stepping through and checking virtual memory at different parts of code at each breakpoint is not feasible.
I thought of changing program to start each thread one at a time to make it easier to track allocation of memory, but before doing this are there any good tools?
Valgrind is fairly slow, maybe tcmalloc could provide the info?

I would use valgrind (perhaps run it an entire night) or else use Boehm GC.
Alternatively, use the proc(5) filesystem to understand (e.g. thru /proc/$pid/statm & /proc/$pid/maps) when a lot of memory gets allocated.
The most important is to find memory leaks. If the memory don't grow after startup it is less an issue.
Perhaps adding instance counters to each class might help (use atomic integers or mutexes to serialize them).
If the program's source code is big (e.g. a million of source lines) so that spending several days/weeks is worth the effort, perhaps customizing the GCC compiler (e.g. with MELT) might be relevant.
a std::set minibenchmark
You mentioned big std::set based upon million rows.
#include <set>
#include <string>
#include <string.h>
#include <cstdio>
#include <cstdlib>
#include <unistd.h>
#include <time.h>
class MyElem
{
int _n;
char _s[16-sizeof(_n)];
public:
MyElem(int k) : _n(k)
{
snprintf (_s, sizeof(_s), "%d", k);
};
~MyElem()
{
_n=0;
memset(_s, 0, sizeof(_s));
};
int n() const
{
return _n;
};
std::string str() const
{
return std::string(_s);
};
bool less(const MyElem&x) const
{
return _n < x._n;
};
};
bool operator < (const MyElem& l, const MyElem& r)
{
return l.less(r);
}
typedef std::set<MyElem> MySet;
void bench (int cnt, MySet& set)
{
for (long i=0; i<(long)cnt*1024; i++)
set.insert(MyElem(i));
time_t now = 0;
time (&now);
set.insert (((now) & 0xfffffff) * 100);
}
int main (int argc, char** argv)
{
MySet s;
clock_t cstart, cend;
int c = argc>1?atoi(argv[1]):256;
if (c<16) c=16;
printf ("c=%d Kiter\n", c);
cstart = clock();
bench (c, s);
cend = clock();
int x = getpid();
char cmdbuf[64];
snprintf(cmdbuf, sizeof(cmdbuf), "pmap %d", x);
printf ("running %s\n", cmdbuf);
fflush (NULL);
system(cmdbuf);
putchar('\n');
printf ("at end c=%d Kiter clockdiff=%.2f millisec = %.f µs/Kiter\n",
c, (cend-cstart)*1.0e-3, (double)(cend-cstart)/c);
if (s.find(x) != s.end())
printf("set has %d\n", x);
else
printf("set don't contain %d\n", x);
return 0;
}
Notice the 16 bytes sizeof(MyElem). On Debian/Sid/AMD64 with GCC 4.8.1 (intel i3770K processor, 16Gbytes RAM) and compiling that bench with g++ -Wall -O1 tset.cc -o ./tset-01
With 32768 thousands of iterations, so 32M elements:
total 2109592K
(last line above given by pmap)
at end c=32768 Kiter clockdiff=16470.00 millisec = 503 µs/Kiter
Then the implicit time from my zsh
./tset-01 32768 16.77s user 0.54s system 99% cpu 17.343 total
This is about 2.1Gbytes. so perhaps 64.3 bytes per element & set member overhead (since sizeof(MyElem)==16 the set seems to have a non-negligible cost of perhaps 6 words per element)

LRU Caching & Multithreading

I have already made a post some time ago to ask about a good design for LRU caching (in C++). You can find the question, the answer and some code there:
Better understanding the LRU algorithm
I have now tried to multi-thread this code (using pthread) and came with some really unexpected results. Before even attempting to use locking, I have created a system in which each thread accesses its own cache (see code). I run this code on a 4 cores processor. I tried to run it with 1 thread and 4 thread. When it runs on 1 thread I do 1 million lookups in the cache, on 4 threads, each threads does 250K lookups. I was expecting to get a time reduction with 4 threads but get the opposite. 1 threads runs in 2.2 seconds, 4 threads runs in more than 6 seconds?? I just can't make sense of this result.
Is something wrong with my code? Can this be explained somehow (thread management takes time). It would be great to have the feedback from experts. Thanks a lot -
I compile this code with: c++ -o cache cache.cpp -std=c++0x -O3 -lpthread
#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <errno.h>
#include <sys/time.h>
#include <list>
#include <cstdlib>
#include <cstdio>
#include <memory>
#include <list>
#include <unordered_map>
#include <stdint.h>
#include <iostream>
typedef uint32_t data_key_t;
using namespace std;
//using namespace std::tr1;
class TileData
{
public:
data_key_t theKey;
float *data;
static const uint32_t tileSize = 32;
static const uint32_t tileDataBlockSize;
TileData(const data_key_t &key) : theKey(key), data(NULL)
{
float *data = new float [tileSize * tileSize * tileSize];
}
~TileData()
{
/* std::cerr << "delete " << theKey << std::endl; */
if (data) delete [] data;
}
};
typedef shared_ptr<TileData> TileDataPtr; // automatic memory management!
TileDataPtr loadDataFromDisk(const data_key_t &theKey)
{
return shared_ptr<TileData>(new TileData(theKey));
}
class CacheLRU
{
public:
list<TileDataPtr> linkedList;
unordered_map<data_key_t, TileDataPtr> hashMap;
CacheLRU() : cacheHit(0), cacheMiss(0) {}
TileDataPtr getData(data_key_t theKey)
{
unordered_map<data_key_t, TileDataPtr>::const_iterator iter = hashMap.find(theKey);
if (iter != hashMap.end()) {
TileDataPtr ret = iter->second;
linkedList.remove(ret);
linkedList.push_front(ret);
++cacheHit;
return ret;
}
else {
++cacheMiss;
TileDataPtr ret = loadDataFromDisk(theKey);
linkedList.push_front(ret);
hashMap.insert(make_pair<data_key_t, TileDataPtr>(theKey, ret));
if (linkedList.size() > MAX_LRU_CACHE_SIZE) {
const TileDataPtr dropMe = linkedList.back();
hashMap.erase(dropMe->theKey);
linkedList.remove(dropMe);
}
return ret;
}
}
static const uint32_t MAX_LRU_CACHE_SIZE = 100;
uint32_t cacheMiss, cacheHit;
};
int numThreads = 1;
void *testCache(void *data)
{
struct timeval tv1, tv2;
// Measuring time before starting the threads...
double t = clock();
printf("Starting thread, lookups %d\n", (int)(1000000.f / numThreads));
CacheLRU *cache = new CacheLRU;
for (uint32_t i = 0; i < (int)(1000000.f / numThreads); ++i) {
int key = random() % 300;
TileDataPtr tileDataPtr = cache->getData(key);
}
std::cerr << "Time (sec): " << (clock() - t) / CLOCKS_PER_SEC << std::endl;
delete cache;
}
int main()
{
int i;
pthread_t thr[numThreads];
struct timeval tv1, tv2;
// Measuring time before starting the threads...
gettimeofday(&tv1, NULL);
#if 0
CacheLRU *c1 = new CacheLRU;
(*testCache)(c1);
#else
for (int i = 0; i < numThreads; ++i) {
pthread_create(&thr[i], NULL, testCache, (void*)NULL);
//pthread_detach(thr[i]);
}
for (int i = 0; i < numThreads; ++i) {
pthread_join(thr[i], NULL);
//pthread_detach(thr[i]);
}
#endif
// Measuring time after threads finished...
gettimeofday(&tv2, NULL);
if (tv1.tv_usec > tv2.tv_usec)
{
tv2.tv_sec--;
tv2.tv_usec += 1000000;
}
printf("Result - %ld.%ld\n", tv2.tv_sec - tv1.tv_sec,
tv2.tv_usec - tv1.tv_usec);
return 0;
}

A thousand apologies, by keeping debugging the code I realised I made a really bad beginner's mistake, if you look at that code:
TileData(const data_key_t &key) : theKey(key), data(NULL)
{
float *data = new float [tileSize * tileSize * tileSize];
}
from the TikeData class where data is supposed to actually be a member variable of the class... So the right code should be:
class TileData
{
public:
float *data;
TileData(const data_key_t &key) : theKey(key), data(NULL)
{
data = new float [tileSize * tileSize * tileSize];
numAlloc++;
}
};
I am so sorry about that! It's a mistake I have done in the past, and I guess prototyping is great, but it sometimes lead to do such stupid mistakes.
I ran the code with 1 and 4 threads and do now see the speedup. 1 thread takes about 2.3 seconds, 4 threads takes 0.92 seconds.
Thanks all for your help, and sorry if I made you lose your time ;-)

I don't have a concrete answer yet. I can think of several possibilities. One is that testCache() is using random(), which is almost certainly implemented with a single global mutex. (Thus all of your threads are competing for the mutex, which is now ping-ponging between the caches.) ((That's assuming that random() is actually thread-safe on your system.))
Next, testCach() is accessing a CacheLRU which is implemented with unordered_maps and shared_ptrs. The unordered_maps, in particular might be implemented with some kind of global mutex underneath that is causing all of your threads to compete for access.
To really diagnose what is going on here you should do something much simpler inside of testCache(). (First try just taking the sqrt() of an input variable 250K times (vs. 1M times). Then try linearly accessing a C array of size 250K (or 1M). Slowly build up to the complex thing you are currently doing.)
Another possibility has to do with the pthread_join. pthread_join doesn't return until all the threads are done. So if one is taking longer than the others, you are measuring the slowest one. Your computation here seems balanced, but perhaps your OS is doing something unexpected? (Like mapping several threads to one core (perhaps because you have a hyper-threaded processor?, or one thread is moving from one core to another in the middle of the run (perhaps because the OS thinks it is smart when it is not.)

This will be a bit of a "build it up" answer. I'm running your code on a Fedora 16 Linux system with a 4-core AMD cpu and 16GB of RAM.
I can confirm that I'm seeing similar "slower with more threads" behaviour. I removed the random function, which doesn't improve things at all.
I'm going to make some other minor changes.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js