c++ thread creation big overhead - c++

I have the following code, which confuses me a lot:
float OverlapRate(cv::Mat& model, cv::Mat& img) {
if ((model.rows!=img.rows)||(model.cols!=img.cols)) {
return 0;
}
cv::Mat bgr[3];
cv::split(img, bgr);
int counter = 0;
float b_average = 0, g_average = 0, r_average = 0;
for (int i = 0; i < model.rows; i++) {
for (int j = 0; j < model.cols; j++) {
if((model.at<uchar>(i,j)==255)){
counter++;
b_average += bgr[0].at<uchar>(i, j);
g_average += bgr[1].at<uchar>(i, j);
r_average += bgr[2].at<uchar>(i, j);
}
}
}
b_average = b_average / counter;
g_average = g_average / counter;
r_average = r_average / counter;
counter = 0;
float b_stde = 0, g_stde = 0, r_stde = 0;
for (int i = 0; i < model.rows; i++) {
for (int j = 0; j < model.cols; j++) {
if((model.at<uchar>(i,j)==255)){
counter++;
b_stde += std::pow((bgr[0].at<uchar>(i, j) - b_average), 2);
g_stde += std::pow((bgr[1].at<uchar>(i, j) - g_average), 2);
r_stde += std::pow((bgr[2].at<uchar>(i, j) - r_average), 2);
}
}
}
b_stde = std::sqrt(b_stde / counter);
g_stde = std::sqrt(g_stde / counter);
r_stde = std::sqrt(r_stde / counter);
return (b_stde + g_stde + r_stde) / 3;
}
void work(cv::Mat& model, cv::Mat& img, int index, std::map<int, float>& results){
results[index] = OverlapRate(model, img);
}
int OCR(cv::Mat& a, std::map<int,cv::Mat>& b, const std::vector<int>& possible_values)
{
int recog_value = -1;
clock_t start = clock();
std::thread threads[10];
std::map<int, float> results;
for(int i=0; i<10; i++)
{
threads[i] = std::thread(work, std::ref(b[i]), std::ref(a), i, std::ref(results));
}
for(int i=0; i<10; i++)
threads[i].join();
float min_score = 1000;
int min_index = -1;
for(auto& it:results)
{
if (it.second < min_score) {
min_score = it.second;
min_index = it.first;
}
}
clock_t end = clock();
clock_t t = end - start;
printf ("It took me %d clicks (%f seconds) .\n",t,((float)t)/CLOCKS_PER_SEC);
recog_value = min_index;
}
What the above code does is just simple optical character recognition. I have one optical character as an input and compare it with 0 - 9 ten standard character models to get the most similar one, and then output the recognized value.
When I execute the above code without using ten threads running at the same time, the time is 7ms. BUT, when I use ten threads, it drops down to 1 or 2 seconds for a single optical character recognition.
What is the reason?? The debug information tells that thread creation consumes a lot of time, which is this code:
threads[i] = std::thread(work, std::ref(b[i]), std::ref(a), i, std::ref(results));
Why? Thanks.

Running multiple threads is useful in only 2 contexts: you have multiple hardware cores (so the threads can run simultaneously) OR each thread is waiting for IO (so one thread can run while another thread is waiting for IO, like a disk load or network transfer).
Your code is not IO bound, so I hope you have 10 cores to run your code. If you don't have 10 cores, then each thread will be competing for scarce resources, and the scarcest resource of all is L1 cache space. If all 10 threads are fighting for 1 or 2 cores and their cache space, then the caches will be "thrashing" and give you 10-100x slower performance.
Try running benchmarking your code 10 different times, with N=1 to 10 threads and see how it performs.
(There is one more reason the have multiple threads, which is when the cores support hyper threading. The OS will"pretend" that 1 core has 2 virtual processors, but with this you don't get 2x performance. You get something between 1x and 2x. But in order to get this partial boost, you have to run 2 threads per core)

Not always is efficient to use threads. If you use threads on small problem, then managing threads cost more time and resources then solving the problem. You must have enough work for threads and good managing work over threads.
If you want to know how many threads you can use on problem or how big must be problem, find Isoeffective functions (psi1, psi2, psi3) from theory of parallel computers.

Related

How to make a thread wait for a condition to perform an operation without it using too much CPU time?

I've been trying to use threads in a matrix operation to make it faster for large matrixes (1000x1000). I've had some sucess so far with the below code. With significant speed improvements in comparision to using a single thread.
void updateG(Matrix &u, Matrix &g, int n, int bgx, int tamx, int tamy)
{
int i, j;
for (i = bgx; i < tamx; i += n)
{
for (j = 0; j < tamy; j++)
{
g(i,j, g(i,j)+ dt * 0.5 * (u(i,j) - (g(i,j) * y)));
}
}
}
void updateGt(Matrix &u, Matrix &g, int tam)strong text
{
int i;
const int n = NT;
std::thread array[n];
for (int i = 0; i < n; i++)
{
array[i] = std::thread(updateG, std::ref(u), std::ref(g), n, i, tam, tam);
}
joinAll(array, n);
}
However, I need to call this operation several times in the main code, and every time this happens, I must initialize the thread array again, creating new threads and wasting a lot of time (according to what I've read online those are expensive).
So, I've developed an alternative solution to create and initialize the thread array only once, and use the same threads to perform the matrix operations every time the function is called. Using some flag variables so the thread only performs the operation when it has to. Like in the following code:
void updateG(int bgx,int tam)
{
while (!flaguGkill[bgx]) {
if (flaguG[bgx]) {
int i, j;
for (i = bgx; i < tam; i += NT)
{
for (j = 0; j < tam; j++)
{
g->operator()(i, j, g->operator()(i, j) + dt * 0.5 * (u->operator()(i, j) - (g->operator()(i, j) * y)));
}
}
flaguG[bgx] = false;
}
}
}
void updateGt()
{
for (int k = 0; k < NT; k++)
{
flaguG[k] = true;
}
for (int i = 0; i < NT; i++)
{
while(flaguG[i]);
}
}
My problem is. This solution, that's supposed to be faster, is much slower than the first one, by a large margin. In my complete code, I have 2 functions like this updateGt and updateXt, and I'm using 4 threads for each, I believe the problem is that while the function is supposed to be idle waiting, it is instead using a lot of CPU time only to keep checking on the codition. Anyone knows if that is really the case, and if so, how could I fix it?
The problem here is called busy waiting. As mentioned in comments you'll want to use std::condition_variable, like this:
std::mutex mutex;
std::condition_variable cv;
while (!flaguGkill[bgx]) {
{
unique_lock<mutex> lock(mutex); // aquire mutex lock as required by condition variable
cv.wait(lock, [this]{return flaguG[bgx];}); // thread will suspend here and release the lock if the expression does not return true
}
int i, j;
for (i = bgx; i < tam; i += NT)
{
for (j = 0; j < tam; j++)
{
g->operator()(i, j, g->operator()(i, j) + dt * 0.5 * (u->operator()(i, j) - (g->operator()(i, j) * y)));
}
}
flaguG[bgx] = false;
}
Note: the section [this] { return flaguG[bgx];}, you may need to alter the capture paramaters (the bit in the []) depending on the scope of those variables
Where you set this to be true, you then need to call
for (int k = 0; k < NT; k++)
{
flaguG[k] = true;
cv.notify_one();
}
// you can then use another condition variable here

C++ Multi-Threaded Operation Slower than Single Thread

I am carrying out a 3D matrix by 1D vector multiplication within a class in C++. All variables are contained within the class. When I create one instance of the class on a single thread and carry out the multiplication 100 times, the multiplication operation takes ~0.8ms each time.
When I create 4 instances of the class, each on a separate thread, and run the multiplication operation 25 times on each, the operation takes ~1.7ms each time. The operations on each thread are being carried out on separate data, and are running on separate cores.
As expected, however, the overall time to complete the 100 matrix multiplications is reduced with 4 threads over a single thread.
My questions are:
1) What is the cause of the slowdown in the multiplication operation when multiple threads are used?
2) Is there any way in which the operation can be sped up?
EDIT:
To clarify the problem:
The overall time to carry out 100 matrix products does decrease when I split them over 4 threads - threading does make the overall program faster.
The timing in question is the actual matrix multiplication within the already created threads (see code). This time excludes thread creation, and memory allocation & deletion. This is the time that doubles when I use 4 threads rather than 1. The overall time to carry out all multiplications halves when I use 4 threads. My question is why are the individual matrix products slower when running on 4 threads rather than 1.
Below is a code sample. It is not my actual code, but a simplified example I have written to demonstrate the problem.
Multiply.h
class Multiply
{
public:
Multiply ();
~Multiply ();
void
DoProduct ();
private:
double *a;
};
Multiply.cpp
Multiply::Multiply ()
{
a = new double[100 * 100 * 100];
std::memset(a,1,100*100*100*sizeof(double));
}
void
Multiply::DoProduct ()
{
double *result = new double[100 * 100];
double *b = new double[100];
std::memset(result,0,100*100*sizeof(double));
std::memset(b,1,100*sizeof(double));
//Timer starts here, i.e. excluding memory allocation and thread creation and the rest
auto start_time = std::chrono::high_resolution_clock::now ();
//matrix product
for (int i = 0; i < 100; ++i)
for (int j = 0; j < 100; ++j)
{
double t = 0;
for (int k = 0; k < 100; ++k)
t = t + a[k + j * 100 + i * 100 * 100] * b[k];
result[j + 100 * i] = result[j + 100 * i] + t;
}
//Timer stops here, i.e. before memory deletion
int time = std::chrono::duration_cast < std::chrono::microseconds > (std::chrono::high_resolution_clock::now () - start_time).count ();
std::cout << "Time: " << time << std::endl;
delete []result;
delete []b;
}
Multiply::~Multiply ()
{
delete[] a;
}
Main.cpp
void
threadWork (int iters)
{
Multiply *m = new Multiply ();
for (int i = 0; i < iters; i++)
{
m->DoProduct ();
}
}
void
main ()
{
int numProducts = 100;
int numThreads = 1; //4;
std::thread t[numThreads];
auto start_time = std::chrono::high_resolution_clock::now ();
for (int i = 0; i < numThreads; i++)
t[i] = std::thread (threadWork, numProducts / numThreads);
for (int i = 0; i < n; i++)
t[i].join ();
int time = std::chrono::duration_cast < std::chrono::microseconds > (std::chrono::high_resolution_clock::now () - start_time).count ();
std::cout << "Time total: " << time << std::endl;
}
Async and thread calls are quite time expensive compare to ordinary function calls. So pre-launch threads and create a thread pool. You push your functions as tasks and request the thread pool to tether these tasks from the prority-queue.
The tasks could be set with priorities to execute in proper order to avoid use and hence delays arising due to use of mutexes and locks
You are launching too many threads , keep it below the maximum allowed by your system to avoid bottlenecks.

Same program execution time on different thread numbers openMP

I have a c++ program that multiplies 2 matrixes. I have to use openMP. This is what I have so far. https://pastebin.com/wn0AXFBG
#include <stdlib.h>
#include <time.h>
#include <omp.h>
#include <iostream>
#include <fstream>
using namespace std;
int main()
{
int n = 1;
int Matrix1[1000][100];
int Matrix2[100][2];
int Matrix3[1000][2];
int sum = 0;
ofstream fr("rez.txt");
double t1 = omp_get_wtime();
omp_set_num_threads(n);
#pragma omp parallel for collapse(2) num_threads(n)
for ( int i = 0; i < 10; i++) {
for ( int j = 0; j < 10; j++) {
Matrix1[i][j] = i * j;
}
}
#pragma omp simd
for (int i = 0; i < 100; i++) {
for (int j = 0; j < 2; j++) {
int t = rand() % 100;
if (t < 50) Matrix2[i][j] = -1;
if (t >= 50) Matrix2[i][j] = 1;
}
}
#pragma omp parallel for collapse(3) num_threads(n)
for (int ci = 0; ci < 1000; ci++) {
for (int cj = 0; cj < 2; cj++) {
for (int i = 0; i < 100; i++) {
if(i==0) Matrix3[ci][cj] = 0;
Matrix3[ci][cj] += Matrix1[ci][i] * Matrix2[i][cj];
}
}
}
double t2 = omp_get_wtime();
double time = t2 - t1;
fr << time;
return 0;
}
The problem is that I get the same execution times whether I use 1 thread or 8. Pictures of timing added.
I have to show that the time is reduced near to 8 times. I am using the Intel C++ compiler with openMP turned on. Please advise.
First of all, I think, there is a small bug in your program, when you are initializing entries in matrix 1 as Matrix1[i][j] = i * j. The i and j are not going upto 1000 and 100 respectively.
Also, I am not sure if your computer actually supports 8 logical cores or not,
If there are no 8 logical cores then your computer will create 8 threads and one logical core will context switch more than one threads and thus will bring the performance down and thus, high execution time. So be sure about how many actual logical cores are available and specify less than or equal to that amount of cores to num_threads()
Now coming to the question, collapse clause fuses all the loops into one and tries to dynamically schedule that fused loop among p processors. I am not sure about how it deals with the race condition handling, but if you try to parallelize innermost loop without fusing all 3 loops, there is race condition as each thread will try to concurrently update Matrix3[ci][cj] and some kind of synchronization mechanism maybe atomic or reduction clause are needed to ensure correctness.
I am pretty sure that you can parallelize outer loop without any kind of race condition and also get a speedup near the number of processors you have employed (Again, as far as number of processors are less than or equal to number of logical cores) and I would suggest changing segment of your code as below.
// You can also use this function to set number of threads:
// omp_set_num_threads(n);
#pragma omp parallel for num_threads(n)
for (int ci = 0; ci < 1000; ci++) {
for (int cj = 0; cj < 2; cj++) {
for (int i = 0; i < 100; i++) {
if(i==0) Matrix3[ci][cj] = 0;
Matrix3[ci][cj] += Matrix1[ci][i] * Matrix2[i][cj];
}
}
}

threading program in C++ not faster

I have a program which reads the file line by line and then stores each possible substring of length 50 in a hash table along with its frequency. I tried to use threads in my program so that it will read 5 lines and then use five different threads to do the processing. The processing involves reading each substring of that line and putting them into hash map with frequency. But it seems there is something wrong which I could not figure out for which the program is not faster then the serial approach. Also, for large input file it is aborted. Here is the piece of code I am using
unordered_map<string, int> m;
mutex mtx;
void parseLine(char *line, int subLen){
int no_substr = strlen(line) - subLen;
for(int i = 0; i <= no_substr; i++) {
char *subStr = (char*) malloc(sizeof(char)* subLen + 1);
strncpy(subStr, line+i, subLen);
subStr[subLen]='\0';
mtx.lock();
string s(subStr);
if(m.find(s) != m.end()) m[s]++;
else {
pair<string, int> ret(s, 1);
m.insert(ret);
}
mtx.unlock();
}
}
int main(){
char **Array = (char **) malloc(sizeof(char *) * num_thread +1);
int num = 0;
while (NOT END OF FILE) {
if(num < num_th) {
if(num == 0)
for(int x = 0; x < num_th; x++)
Array[x] = (char*) malloc(sizeof(char)*strlen(line)+1);
strcpy(Array[num], line);
num++;
}
else {
vector<thread> threads;
for(int i = 0; i < num_th; i++) {
threads.push_back(thread(parseLine, Array[i]);
}
for(int i = 0; i < num_th; i++){
if(threads[i].joinable()) {
threads[i].join();
}
}
for(int x = 0; x < num_th; x++) free(seqArray[x]);
num = 0;
}
}
}
It's a myth that just by the virtue of using threads, the end result must be faster. In general, in order to take advantage of multithreading, two conditions must be met(*):
1) You actually have to have sufficient physical CPU cores, that can run the threads at the same time.
2) The threads have independent tasks to do, that they can do on their own.
From a cursory examination of the shown code, it seems to fail on the second part. It seems to me that, most of the time all of these threads will be fighting each other in order to acquire the same mutex. There's little to be gained from multithreading, in this situation.
(*) Of course, you don't always use threads for purely performance reasons. Multithreading also comes in useful in many other situations too, for example, in a program with a GUI, having a separate thread updating the GUI helps the UI working even while the main execution thread is chewing on something, for a while...

Parallel program no speed increase vs linear program

I have created a model program of a more complex program that will utilize multithreading and multiple harddrives to increase performance. The data size is so large that reading all data into memory will not be feasible so the data will be read, processed, and written back out in chunks. This test program uses pipeline design to be able to read, process and write out at the same time on 3 different threads. Because read and write are to different harddrive, there is no problem with read and write at the same time. However, the program utilizing multithread seems to run 2x slower than its linear version(also in the code). I have tried to have the read and write thread not be destoryed after running a chunk but the synchronization seem to have slowed it down even more than the current version. I was wondering if I am doing something wrong or how I can improve this. Thank You.
Tested using i3-2100 # 3.1ghz and 16GB ram.
#include <iostream>
#include <fstream>
#include <ctime>
#include <thread>
#define CHUNKSIZE 8192 //size of each chunk to process
#define DATASIZE 2097152 //total size of data
using namespace std;
int data[3][CHUNKSIZE];
int run = 0;
int totalRun = DATASIZE/CHUNKSIZE;
bool finishRead = false, finishWrite = false;
ifstream infile;
ofstream outfile;
clock_t starttime, endtime;
/*
Process a chunk of data(simulate only, does not require to sort all data)
*/
void quickSort(int arr[], int left, int right) {
int i = left, j = right;
int tmp;
int pivot = arr[(left + right) / 2];
while (i <= j) {
while (arr[i] < pivot) i++;
while (arr[j] > pivot) j--;
if (i <= j) {
tmp = arr[i];
arr[i] = arr[j];
arr[j] = tmp;
i++;
j--;
}
};
if (left < j) quickSort(arr, left, j);
if (i < right) quickSort(arr, i, right);
}
/*
Find runtime
*/
void diffclock(){
double diff = (endtime - starttime)/(CLOCKS_PER_SEC/1000);
cout<<"Total run time: "<<diff<<"ms"<<endl;
}
/*
Read a chunk of data
*/
void readData(){
for(int i = 0; i < CHUNKSIZE; i++){
infile>>data[run%3][i];
}
finishRead = true;
}
/*
Write a chunk of data
*/
void writeData(){
for(int i = 0; i < CHUNKSIZE; i++){
outfile<<data[(run-2)%3][i]<<endl;
}
finishWrite = true;
}
/*
Pipelines Read, Process, Write using multithread
*/
void threadtransfer(){
starttime = clock();
infile.open("/home/pcg/test/iothread/source.txt");
outfile.open("/media/pcg/Data/test/iothread/ThreadDuplicate.txt");
thread read, write;
run = 0;
readData();
run = 1;
readData();
quickSort(data[(run-1)%3], 0, CHUNKSIZE - 1);
run = 2;
while(run < totalRun){
//cout<<run<<endl;
finishRead = finishWrite = false;
read = thread(readData);
write = thread(writeData);
read.detach();
write.detach();
quickSort(data[(run-1)%3], 0, CHUNKSIZE - 1);
while(!finishRead||!finishWrite){} //check if next cycle is ready.
run++;
}
quickSort(data[(run-1)%3], 0, CHUNKSIZE - 1);
writeData();
run++;
writeData();
infile.close();
outfile.close();
endtime = clock();
diffclock();
}
/*
Linearly read, sort, and write a chunk and repeat.
*/
void lineartransfer(){
int totalRun = DATASIZE/CHUNKSIZE;
int holder[CHUNKSIZE];
starttime = clock();
infile.open("/home/pcg/test/iothread/source.txt");
outfile.open("/media/pcg/Data/test/iothread/Linearduplicate.txt");
run = 0;
while(run < totalRun){
for(int i = 0; i < CHUNKSIZE; i++) infile>>holder[i];
quickSort(holder, 0, CHUNKSIZE - 1);
for(int i = 0; i < CHUNKSIZE; i++) outfile<<holder[i]<<endl;
run++;
}
endtime = clock();
diffclock();
}
/*
Create large amount of data for testing
*/
void createData(){
outfile.open("/home/pcg/test/iothread/source.txt");
for(int i = 0; i < DATASIZE; i++){
outfile<<rand()<<endl;
}
outfile.close();
}
int main(){
int mode=0;
cout<<"Number of threads: "<<thread::hardware_concurrency()<<endl;
cout<<"Enter mode\n1.Create Data\n2.thread copy\n3.linear copy\ninput mode:";
cin>>mode;
if(mode == 1) createData();
else if(mode == 2) threadtransfer();
else if(mode == 3) lineartransfer();
return 0;
}
Don't busy-wait. This wastes precious CPU time and may well slow down the rest (not to mention the compiler can optimize it into an infinite loop because it can't guess whether those flags will change or not, so it's not even correct in the first place). And don't detach() either. Replace both detach() and busy-waiting with join():
while (run < totalRun) {
read = thread(readData);
write = thread(writeData);
quickSort(data[(run-1)%3], 0, CHUNKSIZE - 1);
read.join();
write.join();
run++;
}
As to the global design, well, ignoring the global variables I guess it's otherwise acceptable if you don't expect the processing (quickSort) part to ever exceed the read/write time. I for one would use message queues to pass the buffers between the various threads (which allows to add more processing threads if you need it, either doing the same tasks in parallel or different tasks in sequence) but maybe that's because I'm used to do it that way.
Since you are measuing time using clock on a Linux machine, I expect that the total CPU time is (roughly) the same whether you run one thread or multiple threads.
Maybe you want to use time myprog instead? Or use gettimeofday to fetch the time (which will give you a time in seconds + nanoseconds [although the nanoseconds may not be "accurate" down to the last digit].
Edit:
Next, don't use endl when writing to a file. It slows things down a lot, because the C++ runtime goes and flushes to the file, which is an operating system call. It is almost certainly somehow protected against multiple threads, so you have three threads doing write-data, a single line, synchronously, at a time. Most likely going to take nearly 3x as long as running a single thread. Also, don't write to the same file from three different threads - that's going to be bad in one way or another.
Please correct me if I am wrong, but it seems your threaded function is basically a linear function doing 3 times the work of your linear function?
In a threaded program you would create three threads and run the readData/quicksort functions once on each thread (distributing thee workload), but in your program it seems like the thread simulation is actually just reading three times, quicksorting three times, and writing three times, and totalling the time it takes to do all three of each.