Im writting a console application that uses multithreading. Each thread process a set of images using opencv functions.
If the function that uses opencv functions is executed in a single thread I get a reference computational time. If I execute this function from multiple threads the function (individually in each thread) is much slower (nearly double), when it should be nearly the same.
¿Does opencv parallelizes, serializes or blocks itself the execution?.
I have test the aplication using opencv libraries compiled WITH_TBB and without TBB and the result is almost the same. I don't know if it may have any inffluence, but I have seen also that some functions like cv::threshold or cv::findcontours create 12 additional subprocesses when beein executed. If The open cv calls are commented the time is the same for all threads and is the same to the obtained in a single thread execution, so in this case the multithreading is working well. The question is if there is maybe an opencv compilation option or a function call that allows to obtain the same time in multithreading and in single threading execution??.
EDIT
This is the result of increasing the number of threads (cores) in a 4 cores CPU, executing with 1, 2, 3 and 4 cores the same function. Each core process 768 images with 1600x1200 resolution in a for loop. Inside the loop the function causing the increasing delay is called. I shoud expect that, independently of the number of cores the time is approx the same obtained for a single thread (35000ms) or 10% more, but, as can be seen the time raises up when the number of threads is increased, I can not find why...
TIMES: (Sorry, the system not allow me to upload images to the posts)
time in File No. 3 --> 35463
Mean time using 1 cores is: 47ms
time in File No. 3 --> 42747
time in File No. 3 --> 42709
Mean time using 2 cores is: 28ms
time in File No. 3 --> 54587
time in File No. 3 --> 54595
time in File No. 3 --> 54437
Mean time using 3 cores is: 24ms
time in File No. 3 --> 68751
time in File No. 3 --> 68865
time in File No. 3 --> 68878
time in File No. 3 --> 68622
Mean time using 4 cores is: 22ms
If no opencv code is used insithe the function, the time, as expected, is similar for all the cases 1, 2 3 or 4 threads but when an open cv function is used, for example only with a a simple call to:
img.convertTo(img,CV_32F);
beeing img a cv::Mat, the time increases when the number of threads is increased. I have made test also disabling the hiper-threading option in the CPU Bios. In that case all the times decrease, been the time with 1 thread 25.000ms, but the problem of time increase is still present (33sec with 2 threads, 43 with 3, 57 with 4)... I dont know if this tells you something
Edit 2
A mcve:
#include "stdafx.h"
#include <future>
#include <chrono>
#include "Filter.h"
#include <iostream>
#include <future>
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/imgproc/imgproc.hpp>
long long Ticks();
int WithOpencv(cv::Mat img);
int With_OUT_Opencv(cv::Mat img);
int TestThreads (char *buffer,std::string file);
#define Blur3x3(matrix,f,c) ((matrix[(f-1)*1600+(c-1)] + matrix[(f-1)*1600+c] + matrix[(f-1)*1600+(c+1)] + matrix[f*1600+(c-1)] + matrix[f*1600+c] + matrix[f*1600+(c+1)] + matrix[(f+1)*1600+(c-1)] + matrix[(f+1)*1600+c] + matrix[(f+1)*1600+(c+1)])/9)
int _tmain(int argc, _TCHAR* argv[])
{
std::string file="Test.bmp";
auto function = [&](char *buffer){return TestThreads(buffer,file);};
char *buffers[12];
std::future<int> frames[12];
DWORD tid;
int i,j;
int nframes = 0;
int ncores;
cv::setNumThreads(8);
for (i=0;i<8;i++) buffers[i] = new char[1000*1024*1024];
for (j=1;j<9;j++)
{
ncores = j;
long long t = Ticks();
for (i=0;i<ncores;i++) frames[i] = std::async(std::launch::async,function,buffers[i]);
for (i=0;i<ncores;i++) nframes += frames[i].get();
t = Ticks() - t;
std::cout << "Mean time using " << ncores << " cores is: " << t/nframes << "ms" << std::endl << std::endl;
nframes = 0;
Sleep(2000);
}
for (int i=0;i<8;i++) delete buffers[i];
return NULL;
return 0;
}
int TestThreads (char *buffer,std::string file)
{
long long ta;
int res;
char *ruta=new char[file.length() + 1];
strcpy(ruta,file.c_str());
cv::Mat img (1200, 1600, CV_8UC1);
img=cv::imread(file);
ta = Ticks();
for (int i=0;i<15;i++) {
//Uncomment this and comment next line to test without opencv calls. With_OUT_Opencv implements simple filters with direct operations over mat data
//res = With_OUT_Opencv(img);
res = WithOpencv(img);
}
ta = Ticks() - ta;
std::cout << "Time in file No. 3--> " << ta << std::endl;
return 15;
}
int WithOpencv(cv::Mat img){
cv::Mat img_bin;
cv::Mat img_filtered;
cv::Mat img_filtered2;
cv::Mat img_res;
int Crad_morf=2;
double Tthreshold=20;
cv::Mat element = cv::getStructuringElement(cv::MORPH_ELLIPSE, cv::Size(2*Crad_morf + 1, 2*Crad_morf+1));
img.convertTo(img,CV_32F);
cv::blur(img, img_filtered, cv::Size(3, 3));
cv::blur(img.mul(img), img_filtered2, cv::Size(3, 3));
cv::sqrt(img_filtered2 - img_filtered.mul(img_filtered), img_res);
cv::normalize(img_res, img_res, 0.0, 1.0, cv::NORM_MINMAX);
img_res.convertTo(img_res,CV_8UC1,255.0);
cv::threshold(img_res, img_bin, Tthreshold, 255, cv::THRESH_BINARY);
if (Crad_morf!=0){
cv::dilate(img_bin, img_bin, element);
}
return 0;
}
int With_OUT_Opencv(cv::Mat img){
unsigned char *baux1 = new unsigned char[1600*1200];
unsigned short *baux2 = new unsigned short[1600*1200];
unsigned char max=0;
int f,c,i;
unsigned char threshold = 177;
for (f=1;f<1199;f++) // Bad Blur filters
{
for (c=1; c<1599; c++)
{
baux1[f*1600+c] = Blur3x3(img.data,f,c);
baux1[f*1600+c] = baux1[f*1600+c] * baux1[f*1600+c];
baux2[f*1600+c] = img.data[f*1600+c] * img.data[f*1600+c];
}
}
for (f=1;f<1199;f++)
{
for (c=1; c<1599; c++)
{
baux1[f*1600+c] = sqrt(Blur3x3(baux2,f,c) - baux1[f*1600+c]);
if (baux1[f*1600+c] > max) max = baux1[f*1600+c];
}
}
threshold = threshold * ((float)max/255.0); // Bad Norm/Bin
for (i=0;i<1600*1200;i++)
{
if (baux1[i]>threshold) baux1[i] = 1;
else baux1[i] = 0;
}
delete []baux1;
delete []baux2;
return 0;
}
long long Ticks()
{
static long long last = 0;
static unsigned ticksPerMS = 0;
LARGE_INTEGER largo;
if (last==0)
{
QueryPerformanceFrequency(&largo);
ticksPerMS = (unsigned)(largo.QuadPart/1000);
QueryPerformanceCounter(&largo);
last = largo.QuadPart;
return 0;
}
QueryPerformanceCounter(&largo);
return (largo.QuadPart-last)/ticksPerMS;
}
I'm confused as to what your question is.
Your initial question suggested that running x number of iterations in serial is considerably faster than running them in parallel. Note: when the same target function is used. And you're wondering why running the same target function is considerably slower in a multithreaded scenario.
However, I now see that your example is comparing the performance of OpenCV with some other custom code. Is that what your question is about?
Related to the question as I initially thought the question was, the answer is: no, running the target function in serial is not considerably faster than running it in parallel. See results and code below.
Results
eight threads took 4104.38 ms
single thread took 7272.68 ms
four threads took 3687 ms
two threads took 4500.15 ms
(on a Apple MBA 2012 i5 & opencv3)
Test code
#include <iostream>
#include <vector>
#include <chrono>
#include <thread>
#include <opencv2/opencv.hpp>
using namespace std;
using namespace std::chrono;
using namespace cv;
class benchmark {
time_point<steady_clock> start = steady_clock::now();
string title;
public:
benchmark(const string& title) : title(title) {}
~benchmark() {
auto diff = steady_clock::now() - start;
cout << title << " took " << duration <double, milli> (diff).count() << " ms" << endl;
}
};
template <typename F>
void repeat(unsigned n, F f) {
while (n--) f();
};
int targetFunction(Mat img){
cv::Mat img_bin;
cv::Mat img_filtered;
cv::Mat img_filtered2;
cv::Mat img_res;
int Crad_morf=2;
double Tthreshold=20;
cv::Mat element = cv::getStructuringElement(cv::MORPH_ELLIPSE, cv::Size(2*Crad_morf + 1, 2*Crad_morf+1));
img.convertTo(img,CV_32F);
cv::blur(img, img_filtered, cv::Size(3, 3));
cv::blur(img.mul(img), img_filtered2, cv::Size(3, 3));
cv::sqrt(img_filtered2 - img_filtered.mul(img_filtered), img_res);
cv::normalize(img_res, img_res, 0.0, 1.0, cv::NORM_MINMAX);
img_res.convertTo(img_res,CV_8UC1,255.0);
cv::threshold(img_res, img_bin, Tthreshold, 255, cv::THRESH_BINARY);
if (Crad_morf!=0){
cv::dilate(img_bin, img_bin, element);
}
//imshow("WithOpencv", img_bin);
return 0;
}
void runTargetFunction(int nIterations, int nThreads, const Mat& img) {
int nIterationsPerThread = nIterations / nThreads;
vector<thread> threads;
auto targetFunctionFn = [&img]() {
targetFunction(img);
};
setNumThreads(nThreads);
repeat(nThreads, [&] {
threads.push_back(thread([=]() {
repeat(nIterationsPerThread, targetFunctionFn);
}));
});
for(auto& thread : threads)
thread.join();
}
int main(int argc, const char * argv[]) {
string file = "../../opencv-test/Test.bmp";
auto img = imread(file);
const int nIterations = 64;
// let's run using eight threads
{
benchmark b("eight threads");
runTargetFunction(nIterations, 8, img);
}
// let's run using a single thread
{
benchmark b("single thread");
runTargetFunction(nIterations, 1, img);
}
// let's run using four threads
{
benchmark b("four threads");
runTargetFunction(nIterations, 4, img);
}
// let's run using a two threads
{
benchmark b("two threads");
runTargetFunction(nIterations, 2, img);
}
return 0;
}
You are measuring three things:
The time that all threads need to complete the whole task divided by the size of the whole task.
The time required by each individual thread to complete its part of the task.
The time required to complete the whole task.
You are observing that the first time is going down from 47ms to 22ms when increasing the number of threads. That is good! At the same time you are realizing that the time requried by an individual thread increases from 35463 to about 68751 (whatever units). Finally, you are realizing that the overall executing time goes up.
Regarding the second measurement: When increasing the number of threads, the individual threads need longer to perform there respective operations. Two possible explanations:
Your threads are competing for memory bus bandwidth.
Your threads are triggering computations that are multi-threaded by themselves, so effectively they are competing with each other for CPU time.
Now for the question why the overall working time increases. The reason is simple: You are not only increasing the number of threads, but you are increasing the work load at the same rate. If your threads were not competing with each other at all and there would be no overhead involved, N threads would require the same time to do N times the work. It does not, so you are noticing a slow down.
Related
We are profiling a complex C++ program that performs multiple iterations of an algorithm. Timing is critical and we want to minimise the execution time of each iteration. The algorithm is such that the execution time should be very similar for each iteration, but we find that the execution time decreases with successive iterations. We suspect that the cache is responsible for this but can't fully explain what we see based on our understanding of caches.
We are running the code on an Intel Xeon processor with Centos 7.6, compiled by g++ 7.3.1.
We managed to demonstrate the behaviour using the simple program shown below:
#include <vector>
#include <fstream>
#include <array>
#include <chrono>
#include <iostream>
int main()
{
std::chrono::high_resolution_clock::time_point t1;
std::chrono::high_resolution_clock::time_point t2;
const unsigned NUM_BUFFERS = 200;
const unsigned BUFFER_SIZE_BYTES = 1024 * 1024;
const unsigned NUM_TRIALS = 50;
std::vector<uint8_t*> buffers;
for (int buff=0; buff<NUM_BUFFERS; ++buff)
buffers.push_back(new uint8_t[BUFFER_SIZE_BYTES]);
std::vector<double> tAll; // Records execution time for each buffer write
tAll.resize(NUM_TRIALS*NUM_BUFFERS);
unsigned indt = 0;
// For each trial
for ( unsigned indTrial=0; indTrial<NUM_TRIALS; indTrial++ )
{
// For all buffers
for ( unsigned indBuffer=0; indBuffer<NUM_BUFFERS; indBuffer++ )
{
t1 = std::chrono::high_resolution_clock::now();
// Increment contents of entire buffer
uint8_t* p_buff = buffers.at(indBuffer);
for ( unsigned ind=0; ind<BUFFER_SIZE_BYTES; ind++ )
{
p_buff[ind]++;
}
t2 = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> duration = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);
tAll.at(indt++) = duration.count();
}
}
// Write execution times to a file
std::ofstream fp;
fp.open("TEST_ARTEMIS.TXT");
double max=0;
for ( unsigned ind=0; ind<tAll.size(); ind++ )
{
fp << tAll[ind] << std::endl;
}
}
This program increments every byte of a series of 200 off 1MB buffers.The process is repeated 50 times.
The time for each complete write of each buffer is written to a file. If we plot those times, and zoom in to the first 250 buffer writes, we see:
The first buffer write takes ~10ms, the next few take ~3ms, the next 200 take ~2.5ms, and the time then drops to 2ms.
We don't think this behaviour can be explained just by simple cache behaviour as the L2/L3 caches are not large enough to contain all the buffers, so cache writes should be happening throughout the experiment. It's as though the memory gets 'warmed up' and gets faster with time.
Can anyone suggest an explanation for what we are seeing please?
Maybe I’m confusing myself with threads, but my understanding of threading conflicts with each other.
I’ve created a program which uses POSIX pthreads. Without using these threads the program takes 0.061723 seconds to run, and with threads takes 0.081061 seconds to run.
At first I thought this is what should happen, as threads allow something to happen while other things should be able to happen. i.e. processing a lot of data on one thread while still having responsive UI on another, this would mean the processing of the data would take longer as the CPU divides its time between processing UI and processing the data.
However, surely the point of multithreading is to make the program take advantage of multiple CPUs/cores?
As you can tell I’m something of an intermediate so excuse me if it’s a simple question.
But what should I expect the program to do?
I’m running this on a mid-2012 Macbook Pro 13” base model. CPU is 22 nm "Ivy Bridge" 2.5 GHz Intel "Core i5" processor (3210M), with two independent processor "cores" on a single silicon chip
UPDATED WITH CODE
This is in main function. I didn’t add variable declaration for convenience but I’m sure you can work out what each does by its name:
// Loop through all items we need to process
//
while (totalNumberOfItemsToProcess > 0 && numberOfItemsToProcessOnEachIteration > 0 && startingIndex <= totalNumberOfItemsToProcess)
{
// As long as we have items to process...
//
// Align the index with number of items to process per iteration
//
const uint endIndex = startingIndex + (numberOfItemsToProcessOnEachIteration - 1);
// Create range
//
Range range = RangeMake(startingIndex,
endIndex);
rangesProcessed[i] = range;
// Create thread
//
// Create a thread identifier, 'newThread'
//
pthread_t newThread;
// Create thread with range
//
int threadStatus = pthread_create(&newThread, NULL, processCoordinatesInRangePointer, &rangesProcessed[i]);
if (threadStatus != 0)
{
std::cout << "Failed to create thread" << std::endl;
exit(1);
}
// Add thread to threads
//
threadIDs.push_back(newThread);
// Setup next iteration
//
// Starting index
//
// Realign the index with number of items to process per iteration
//
startingIndex = (endIndex + 1);
// Number of items to process on each iteration
//
if (startingIndex > (totalNumberOfItemsToProcess - numberOfItemsToProcessOnEachIteration))
{
// If the total number of items to process is less than the number of items to process on each iteration
//
numberOfItemsToProcessOnEachIteration = totalNumberOfItemsToProcess - startingIndex;
}
// Increment index
//
i++;
}
std::cout << "Number of threads: " << threadIDs.size() << std::endl;
// Loop through all threads, rejoining them back up
//
for ( size_t i = 0;
i < threadIDs.size();
i++ )
{
// Wait for each thread to finish before returning
//
pthread_t currentThreadID = threadIDs[i];
int joinStatus = pthread_join(currentThreadID, NULL);
if (joinStatus != 0)
{
std::cout << "Thread join failed" << std::endl;
exit(1);
}
}
The processing functions:
void processCoordinatesAtIndex(uint index)
{
const int previousIndex = (index - 1);
// Get coordinates from terrain
//
Coordinate3D previousCoordinate = terrain[previousIndex];
Coordinate3D currentCoordinate = terrain[index];
// Calculate...
//
// Euclidean distance
//
double euclideanDistance = Coordinate3DEuclideanDistanceBetweenPoints(previousCoordinate, currentCoordinate);
euclideanDistances[index] = euclideanDistance;
// Angle of slope
//
double slopeAngle = Coordinate3DAngleOfSlopeBetweenPoints(previousCoordinate, currentCoordinate, false);
slopeAngles[index] = slopeAngle;
}
void processCoordinatesInRange(Range range)
{
for ( uint i = range.min;
i <= range.max;
i++ )
{
processCoordinatesAtIndex(i);
}
}
void *processCoordinatesInRangePointer(void *threadID)
{
// Cast the pointer to the right type
//
struct Range *range = (struct Range *)threadID;
processCoordinatesInRange(*range);
return NULL;
}
UPDATE:
Here are my global variables, which, are only global for simplicity - don’t have a go!
std::vector<Coordinate3D> terrain;
std::vector<double> euclideanDistances;
std::vector<double> slopeAngles;
std::vector<Range> rangesProcessed;
std::vector<pthread_t> threadIDs;
Correct me if I’m wrong, but, I think the issue was with how the time elapsed was measured. Instead of using clock_t I’ve moved to gettimeofday() and that reports a shorter time, from non threaded time of 22.629000 ms to a threaded time of 8.599000 ms.
Does this seem right to people?
Of course, my original question was based on whether or not a multithreaded program SHOULD be faster or not, so I won’t mark this answer as the correct one for that reason.
Original Problem:
So I have written some code to experiment with threads and do some testing.
The code should create some numbers and then find the mean of those numbers.
I think it is just easier to show you what I have so far. I was expecting with two threads that the code would run about 2 times as fast. Measuring it with a stopwatch I think it runs about 6 times slower! EDIT: Now using the computer and clock() function to tell the time.
void findmean(std::vector<double>*, std::size_t, std::size_t, double*);
int main(int argn, char** argv)
{
// Program entry point
std::cout << "Generating data..." << std::endl;
// Create a vector containing many variables
std::vector<double> data;
for(uint32_t i = 1; i <= 1024 * 1024 * 128; i ++) data.push_back(i);
// Calculate mean using 1 core
double mean = 0;
std::cout << "Calculating mean, 1 Thread..." << std::endl;
findmean(&data, 0, data.size(), &mean);
mean /= (double)data.size();
// Print result
std::cout << " Mean=" << mean << std::endl;
// Repeat, using two threads
std::vector<std::thread> thread;
std::vector<double> result;
result.push_back(0.0);
result.push_back(0.0);
std::cout << "Calculating mean, 2 Threads..." << std::endl;
// Run threads
uint32_t halfsize = data.size() / 2;
uint32_t A = 0;
uint32_t B, C, D;
// Split the data into two blocks
if(data.size() % 2 == 0)
{
B = C = D = halfsize;
}
else if(data.size() % 2 == 1)
{
B = C = halfsize;
D = hsz + 1;
}
// Run with two threads
thread.push_back(std::thread(findmean, &data, A, B, &(result[0])));
thread.push_back(std::thread(findmean, &data, C, D , &(result[1])));
// Join threads
thread[0].join();
thread[1].join();
// Calculate result
mean = result[0] + result[1];
mean /= (double)data.size();
// Print result
std::cout << " Mean=" << mean << std::endl;
// Return
return EXIT_SUCCESS;
}
void findmean(std::vector<double>* datavec, std::size_t start, std::size_t length, double* result)
{
for(uint32_t i = 0; i < length; i ++) {
*result += (*datavec).at(start + i);
}
}
I don't think this code is exactly wonderful, if you could suggest ways of improving it then I would be grateful for that also.
Register Variable:
Several people have suggested making a local variable for the function 'findmean'. This is what I have done:
void findmean(std::vector<double>* datavec, std::size_t start, std::size_t length, double* result)
{
register double holding = *result;
for(uint32_t i = 0; i < length; i ++) {
holding += (*datavec).at(start + i);
}
*result = holding;
}
I can now report: The code runs with almost the same execution time as with a single thread. That is a big improvement of 6x, but surely there must be a way to make it nearly twice as fast?
Register Variable and O2 Optimization:
I have set the optimization to 'O2' - I will create a table with the results.
Results so far:
Original Code with no optimization or register variable:
1 thread: 4.98 seconds, 2 threads: 29.59 seconds
Code with added register variable:
1 Thread: 4.76 seconds, 2 Threads: 4.76 seconds
With reg variable and -O2 optimization:
1 Thread: 0.43 seconds, 2 Threads: 0.6 seconds 2 Threads is now slower?
With Dameon's suggestion, which was to put a large block of memory in between the two result variables:
1 Thread: 0.42 seconds, 2 Threads: 0.64 seconds
With TAS 's suggestion of using iterators to access contents of the vector:
1 Thread: 0.38 seconds, 2 Threads: 0.56 seconds
Same as above on Core i7 920 (single channel memory 4GB):
1 Thread: 0.31 seconds, 2 Threads: 0.56 seconds
Same as above on Core i7 920 (dual channel memory 2x2GB):
1 Thread: 0.31 seconds, 2 Threads: 0.35 seconds
Why are 2 threads 6x slower than 1 thread?
You are getting hit by a bad case of false sharing.
After getting rid of the false-sharing, why is 2 threads not faster than 1 thread?
You are bottlenecked by your memory bandwidth.
False Sharing:
The problem here is that each thread is accessing the result variable at adjacent memory locations. It's likely that they fall on the same cacheline so each time a thread accesses it, it will bounce the cacheline between the cores.
Each thread is running this loop:
for(uint32_t i = 0; i < length; i ++) {
*result += (*datavec).at(start + i);
}
And you can see that the result variable is being accessed very often (each iteration). So each iteration, the threads are fighting for the same cacheline that's holding both values of result.
Normally, the compiler should put *result into a register thereby removing the constant access to that memory location. But since you never turned on optimizations, it's very likely the compiler is indeed still accessing the memory location and thus incurring false-sharing penalties at every iteration of the loop.
Memory Bandwidth:
Once you have eliminated the false sharing and got rid of the 6x slowdown, the reason why you're not getting improvement is because you've maxed out your memory bandwidth.
Sure your processor may be 4 cores, but they all share the same memory bandwidth. Your particular task of summing up an array does very little (computational) work for each memory access. A single thread is already enough to max out your memory bandwidth. Therefore going to more threads is not likely to get you much improvement.
In short, no you won't be able to make summing an array significantly faster by throwing more threads at it.
As stated in other answers, you are seeing false sharing on the result variable, but there is also one other location where this is happening. The std::vector<T>::at() function (as well as the std::vector<T>::operator[]()) access the length of the vector on each element access. To avoid this you should switch to using iterators. Also, using std::accumulate() will allow you to take advantage of optimizations in the standard library implementation you are using.
Here are the relevant parts of the code:
thread.push_back(std::thread(findmean, std::begin(data)+A, std::begin(data)+B, &(result[0])));
thread.push_back(std::thread(findmean, std::begin(data)+B, std::end(data), &(result[1])));
and
void findmean(std::vector<double>::const_iterator start, std::vector<double>::const_iterator end, double* result)
{
*result = std::accumulate(start, end, 0.0);
}
This consistently gives me better performance for two threads on my 32-bit netbook.
More threads doesn't mean faster! There is an overhead in creating and context-switching threads, even the hardware in which this code run is influencing the results. For such a trivial work like this it's better probably a single thread.
This is probably because the cost of launching and waiting for two threads is a lot more than computing the result in a single loop. Your data size is 128MB, which is not alot for modern processors to process in a single loop.
I have already made a post some time ago to ask about a good design for LRU caching (in C++). You can find the question, the answer and some code there:
Better understanding the LRU algorithm
I have now tried to multi-thread this code (using pthread) and came with some really unexpected results. Before even attempting to use locking, I have created a system in which each thread accesses its own cache (see code). I run this code on a 4 cores processor. I tried to run it with 1 thread and 4 thread. When it runs on 1 thread I do 1 million lookups in the cache, on 4 threads, each threads does 250K lookups. I was expecting to get a time reduction with 4 threads but get the opposite. 1 threads runs in 2.2 seconds, 4 threads runs in more than 6 seconds?? I just can't make sense of this result.
Is something wrong with my code? Can this be explained somehow (thread management takes time). It would be great to have the feedback from experts. Thanks a lot -
I compile this code with: c++ -o cache cache.cpp -std=c++0x -O3 -lpthread
#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <errno.h>
#include <sys/time.h>
#include <list>
#include <cstdlib>
#include <cstdio>
#include <memory>
#include <list>
#include <unordered_map>
#include <stdint.h>
#include <iostream>
typedef uint32_t data_key_t;
using namespace std;
//using namespace std::tr1;
class TileData
{
public:
data_key_t theKey;
float *data;
static const uint32_t tileSize = 32;
static const uint32_t tileDataBlockSize;
TileData(const data_key_t &key) : theKey(key), data(NULL)
{
float *data = new float [tileSize * tileSize * tileSize];
}
~TileData()
{
/* std::cerr << "delete " << theKey << std::endl; */
if (data) delete [] data;
}
};
typedef shared_ptr<TileData> TileDataPtr; // automatic memory management!
TileDataPtr loadDataFromDisk(const data_key_t &theKey)
{
return shared_ptr<TileData>(new TileData(theKey));
}
class CacheLRU
{
public:
list<TileDataPtr> linkedList;
unordered_map<data_key_t, TileDataPtr> hashMap;
CacheLRU() : cacheHit(0), cacheMiss(0) {}
TileDataPtr getData(data_key_t theKey)
{
unordered_map<data_key_t, TileDataPtr>::const_iterator iter = hashMap.find(theKey);
if (iter != hashMap.end()) {
TileDataPtr ret = iter->second;
linkedList.remove(ret);
linkedList.push_front(ret);
++cacheHit;
return ret;
}
else {
++cacheMiss;
TileDataPtr ret = loadDataFromDisk(theKey);
linkedList.push_front(ret);
hashMap.insert(make_pair<data_key_t, TileDataPtr>(theKey, ret));
if (linkedList.size() > MAX_LRU_CACHE_SIZE) {
const TileDataPtr dropMe = linkedList.back();
hashMap.erase(dropMe->theKey);
linkedList.remove(dropMe);
}
return ret;
}
}
static const uint32_t MAX_LRU_CACHE_SIZE = 100;
uint32_t cacheMiss, cacheHit;
};
int numThreads = 1;
void *testCache(void *data)
{
struct timeval tv1, tv2;
// Measuring time before starting the threads...
double t = clock();
printf("Starting thread, lookups %d\n", (int)(1000000.f / numThreads));
CacheLRU *cache = new CacheLRU;
for (uint32_t i = 0; i < (int)(1000000.f / numThreads); ++i) {
int key = random() % 300;
TileDataPtr tileDataPtr = cache->getData(key);
}
std::cerr << "Time (sec): " << (clock() - t) / CLOCKS_PER_SEC << std::endl;
delete cache;
}
int main()
{
int i;
pthread_t thr[numThreads];
struct timeval tv1, tv2;
// Measuring time before starting the threads...
gettimeofday(&tv1, NULL);
#if 0
CacheLRU *c1 = new CacheLRU;
(*testCache)(c1);
#else
for (int i = 0; i < numThreads; ++i) {
pthread_create(&thr[i], NULL, testCache, (void*)NULL);
//pthread_detach(thr[i]);
}
for (int i = 0; i < numThreads; ++i) {
pthread_join(thr[i], NULL);
//pthread_detach(thr[i]);
}
#endif
// Measuring time after threads finished...
gettimeofday(&tv2, NULL);
if (tv1.tv_usec > tv2.tv_usec)
{
tv2.tv_sec--;
tv2.tv_usec += 1000000;
}
printf("Result - %ld.%ld\n", tv2.tv_sec - tv1.tv_sec,
tv2.tv_usec - tv1.tv_usec);
return 0;
}
A thousand apologies, by keeping debugging the code I realised I made a really bad beginner's mistake, if you look at that code:
TileData(const data_key_t &key) : theKey(key), data(NULL)
{
float *data = new float [tileSize * tileSize * tileSize];
}
from the TikeData class where data is supposed to actually be a member variable of the class... So the right code should be:
class TileData
{
public:
float *data;
TileData(const data_key_t &key) : theKey(key), data(NULL)
{
data = new float [tileSize * tileSize * tileSize];
numAlloc++;
}
};
I am so sorry about that! It's a mistake I have done in the past, and I guess prototyping is great, but it sometimes lead to do such stupid mistakes.
I ran the code with 1 and 4 threads and do now see the speedup. 1 thread takes about 2.3 seconds, 4 threads takes 0.92 seconds.
Thanks all for your help, and sorry if I made you lose your time ;-)
I don't have a concrete answer yet. I can think of several possibilities. One is that testCache() is using random(), which is almost certainly implemented with a single global mutex. (Thus all of your threads are competing for the mutex, which is now ping-ponging between the caches.) ((That's assuming that random() is actually thread-safe on your system.))
Next, testCach() is accessing a CacheLRU which is implemented with unordered_maps and shared_ptrs. The unordered_maps, in particular might be implemented with some kind of global mutex underneath that is causing all of your threads to compete for access.
To really diagnose what is going on here you should do something much simpler inside of testCache(). (First try just taking the sqrt() of an input variable 250K times (vs. 1M times). Then try linearly accessing a C array of size 250K (or 1M). Slowly build up to the complex thing you are currently doing.)
Another possibility has to do with the pthread_join. pthread_join doesn't return until all the threads are done. So if one is taking longer than the others, you are measuring the slowest one. Your computation here seems balanced, but perhaps your OS is doing something unexpected? (Like mapping several threads to one core (perhaps because you have a hyper-threaded processor?, or one thread is moving from one core to another in the middle of the run (perhaps because the OS thinks it is smart when it is not.)
This will be a bit of a "build it up" answer. I'm running your code on a Fedora 16 Linux system with a 4-core AMD cpu and 16GB of RAM.
I can confirm that I'm seeing similar "slower with more threads" behaviour. I removed the random function, which doesn't improve things at all.
I'm going to make some other minor changes.
I’m having a problem writing some C++ AMP code. I have included a sample.
It runs fine on emulated accelerators but crashes the display driver on my hardware (windows 7, NVIDIA GeForce GTX 660, latest drivers) but I can see nothing on wrong with my code.
Is there a problem with my code or is this a hardware/driver/complier issue?
#include "stdafx.h"
#include <vector>
#include <iostream>
#include <amp.h>
int _tmain(int argc, _TCHAR* argv[])
{
// Prints "NVIDIA GeForce GTX 660"
concurrency::accelerator_view target_view = concurrency::accelerator().create_view();
std::wcout << target_view.accelerator.description << std::endl;
// lower numbers do not cause the issue
const int x = 2000;
const int y = 30000;
// 1d array for storing result
std::vector<unsigned int> resultVector(y);
Concurrency::array_view<unsigned int, 1> resultsArrayView(resultVector.size(), resultVector);
// 2d array for data for processing
std::vector<unsigned int> dataVector(x * y);
concurrency::array_view<unsigned int, 2> dataArrayView(y, x, dataVector);
parallel_for_each(
// Define the compute domain, which is the set of threads that are created.
resultsArrayView.extent,
// Define the code to run on each thread on the accelerator.
[=](concurrency::index<1> idx) restrict(amp)
{
concurrency::array_view<unsigned int, 1> buffer = dataArrayView[idx[0]];
unsigned int bufferSize = buffer.get_extent().size();
// needs both loops to cause crash
for (unsigned int outer = 0; outer < bufferSize; outer++)
{
for (unsigned int i = 0; i < bufferSize; i++)
{
// works without this line, also if I change to buffer[0] it works?
dataArrayView[idx[0]][0] = 0;
}
}
// works without this line
resultsArrayView[0] = 0;
});
std::cout << "chash on next line" << std::endl;
resultsArrayView.synchronize();
std::cout << "will never reach me" << std::endl;
system("PAUSE");
return 0;
}
It is very likely that your computation exceeds permitted quantum time (default 2 seconds). After that time the operating systems comes in and restarts the GPU forcefully, this is called Timeout Detection and Recovery (TDR). The software adapter (reference device) does not have the TDR enabled, that is why the computation can exceed permitted quantum time.
Does your computation really require 3000 threads (variable x), each performing 2000 * 3000 (x * y) loop iterations? You can chunk your computation, such that each chunks takes less than 2 seconds to compute. You can also consider disabling TDR or exceeding the permitted quantum time to fit your need.
I highly recommend reading a blog post on how to handle TDRs in C++ AMP, which explains TDR in details: http://blogs.msdn.com/b/nativeconcurrency/archive/2012/03/07/handling-tdrs-in-c-amp.aspx
Additionally, here is the separate blog post on how to disable the TDR on Windows 8:
http://blogs.msdn.com/b/nativeconcurrency/archive/2012/03/06/disabling-tdr-on-windows-8-for-your-c-amp-algorithms.aspx