I'm working on linux but multithreading and single threading both are taking 340ms. Can someone tell me what's wrong with what I'm doing?
Here is my code
#include<time.h>
#include<fstream>
#define SIZE_OF_ARRAY 1000000
using namespace std;
struct parameter
{
int *data;
int left;
int right;
};
void readData(int *data)
{
fstream iFile("Data.txt");
for(int i = 0; i < SIZE_OF_ARRAY; i++)
iFile>>data[i];
}
int threadCount = 4;
int Partition(int *data, int left, int right)
{
int i = left, j = right, temp;
int pivot = data[(left + right) / 2];
while(i <= j)
{
while(data[i] < pivot)
i++;
while(data[j] > pivot)
j--;
if(i <= j)
{
temp = data[i];
data[i] = data[j];
data[j] = temp;
i++;
j--;
}
}
return i;
}
void QuickSort(int *data, int left, int right)
{
int index = Partition(data, left, right);
if(left < index - 1)
QuickSort(data, left, index - 1);
if(index < right)
QuickSort(data, index + 1, right);
}
//Multi threading code starts from here
void *Sort(void *param)
{
parameter *param1 = (parameter *)param;
QuickSort(param1->data, param1->left, param1->right);
pthread_exit(NULL);
}
int main(int argc, char *argv[])
{
clock_t start, diff;
int *data = new int[SIZE_OF_ARRAY];
pthread_t threadID, threadID1;
pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
pthread_attr_setscope(&attr, PTHREAD_SCOPE_SYSTEM);
parameter param, param1;
readData(data);
start = clock();
int index = Partition(data, 0, SIZE_OF_ARRAY - 1);
if(0 < index - 1)
{
param.data = data;
param.left = 0;
param.right = index - 1;
pthread_create(&threadID, NULL, Sort, (void *)¶m);
}
if(index < SIZE_OF_ARRAY - 1)
{
param1.data = data;
param1.left = index + 1;
param1.right = SIZE_OF_ARRAY;
pthread_create(&threadID1, NULL, Sort, (void *)¶m1);
}
pthread_attr_destroy(&attr);
pthread_join(threadID, NULL);
pthread_join(threadID1, NULL);
diff = clock() - start;
cout<<"Sorting Time = "<<diff * 1000 / CLOCKS_PER_SEC<<"\n";
delete []data;
return 0;
}
//Multithreading Ends here
Single thread main function
int main(int argc, char *argv[])
{
clock_t start, diff;
int *data = new int[SIZE_OF_ARRAY];
readData(data);
start = clock();
QuickSort(data, 0, SIZE_OF_ARRAY - 1);
diff = clock() - start;
cout<<"Sorting Time = "<<diff * 1000 / CLOCKS_PER_SEC<<"\n";
delete []data;
return 0;
}
//Single thread code ends here
some of functions single thread and multi thread use same
clock returns total CPU time, not wall time.
If you have 2 CPUs and 2 threads, then after a second of running both thread simultaneously clock will return CPU time of 2 seconds (the sum of CPU times of each thread).
So the result is totally expected. It does not matter how many CPUs you have, the total running time summed over all CPUs will be the same.
Note that you call Partition once from the main thread...
The code works on the same memory block which prevents a CPU from working when the other accesses that same memory block. Unless your data is really large you're likely to have many such hits.
Finally, if your algorithm works at memory speed when you run it with one thread, adding more threads doesn't help. I did such tests a while back with image data, and having multiple thread decreased the total speed because the process was so memory intensive that both threads were fighting to access memory... and the result was worse than not having threads at all.
What makes really fast computers today go really is fast is running one very intensive process per computer, not a large number of threads (or processes) on a single computer.
Build a thread pool with a producer-consumer queue with 24 threads hanging off it. Partition your data into two and issue a mergesort task object to the pool, the mergesort object should issue further pairs of mergesorts to the queue and wait on a signal for them to finish and so on until a mergersort object finds that it has [L1 cache-size data]. The object then qicksorts its data and signals completion to its parent task.
If that doesn't turn out to be blindingly quick on 24 cores, I'll stop posting about threads..
..and it will handle multiple sorts in parallel.
..and the pool can be used for other tasks.
.. and there is no No performance-destroying, deadlock-generating join(), synchronize(), (if you except the P-C queue, which only locks for long enough to push an object ref on), no thread-creation overhead and no dodgy thread-stopping/terminating/destroying code. Like the barbers, there is no waiting - as soon as a thread is finished with a task it can get another.
No thread micro-management, no tuning, (you could create 64 threads now, ready for the next generation of boxes). You could make the thread count tuneable - just add more threads at runtime, or delete some by queueing up poison-pills.
You don't need a reference to the threads at all - just set 'em off, (pass queue as parameter).
Related
I have the following code, which confuses me a lot:
float OverlapRate(cv::Mat& model, cv::Mat& img) {
if ((model.rows!=img.rows)||(model.cols!=img.cols)) {
return 0;
}
cv::Mat bgr[3];
cv::split(img, bgr);
int counter = 0;
float b_average = 0, g_average = 0, r_average = 0;
for (int i = 0; i < model.rows; i++) {
for (int j = 0; j < model.cols; j++) {
if((model.at<uchar>(i,j)==255)){
counter++;
b_average += bgr[0].at<uchar>(i, j);
g_average += bgr[1].at<uchar>(i, j);
r_average += bgr[2].at<uchar>(i, j);
}
}
}
b_average = b_average / counter;
g_average = g_average / counter;
r_average = r_average / counter;
counter = 0;
float b_stde = 0, g_stde = 0, r_stde = 0;
for (int i = 0; i < model.rows; i++) {
for (int j = 0; j < model.cols; j++) {
if((model.at<uchar>(i,j)==255)){
counter++;
b_stde += std::pow((bgr[0].at<uchar>(i, j) - b_average), 2);
g_stde += std::pow((bgr[1].at<uchar>(i, j) - g_average), 2);
r_stde += std::pow((bgr[2].at<uchar>(i, j) - r_average), 2);
}
}
}
b_stde = std::sqrt(b_stde / counter);
g_stde = std::sqrt(g_stde / counter);
r_stde = std::sqrt(r_stde / counter);
return (b_stde + g_stde + r_stde) / 3;
}
void work(cv::Mat& model, cv::Mat& img, int index, std::map<int, float>& results){
results[index] = OverlapRate(model, img);
}
int OCR(cv::Mat& a, std::map<int,cv::Mat>& b, const std::vector<int>& possible_values)
{
int recog_value = -1;
clock_t start = clock();
std::thread threads[10];
std::map<int, float> results;
for(int i=0; i<10; i++)
{
threads[i] = std::thread(work, std::ref(b[i]), std::ref(a), i, std::ref(results));
}
for(int i=0; i<10; i++)
threads[i].join();
float min_score = 1000;
int min_index = -1;
for(auto& it:results)
{
if (it.second < min_score) {
min_score = it.second;
min_index = it.first;
}
}
clock_t end = clock();
clock_t t = end - start;
printf ("It took me %d clicks (%f seconds) .\n",t,((float)t)/CLOCKS_PER_SEC);
recog_value = min_index;
}
What the above code does is just simple optical character recognition. I have one optical character as an input and compare it with 0 - 9 ten standard character models to get the most similar one, and then output the recognized value.
When I execute the above code without using ten threads running at the same time, the time is 7ms. BUT, when I use ten threads, it drops down to 1 or 2 seconds for a single optical character recognition.
What is the reason?? The debug information tells that thread creation consumes a lot of time, which is this code:
threads[i] = std::thread(work, std::ref(b[i]), std::ref(a), i, std::ref(results));
Why? Thanks.
Running multiple threads is useful in only 2 contexts: you have multiple hardware cores (so the threads can run simultaneously) OR each thread is waiting for IO (so one thread can run while another thread is waiting for IO, like a disk load or network transfer).
Your code is not IO bound, so I hope you have 10 cores to run your code. If you don't have 10 cores, then each thread will be competing for scarce resources, and the scarcest resource of all is L1 cache space. If all 10 threads are fighting for 1 or 2 cores and their cache space, then the caches will be "thrashing" and give you 10-100x slower performance.
Try running benchmarking your code 10 different times, with N=1 to 10 threads and see how it performs.
(There is one more reason the have multiple threads, which is when the cores support hyper threading. The OS will"pretend" that 1 core has 2 virtual processors, but with this you don't get 2x performance. You get something between 1x and 2x. But in order to get this partial boost, you have to run 2 threads per core)
Not always is efficient to use threads. If you use threads on small problem, then managing threads cost more time and resources then solving the problem. You must have enough work for threads and good managing work over threads.
If you want to know how many threads you can use on problem or how big must be problem, find Isoeffective functions (psi1, psi2, psi3) from theory of parallel computers.
I have a program which reads the file line by line and then stores each possible substring of length 50 in a hash table along with its frequency. I tried to use threads in my program so that it will read 5 lines and then use five different threads to do the processing. The processing involves reading each substring of that line and putting them into hash map with frequency. But it seems there is something wrong which I could not figure out for which the program is not faster then the serial approach. Also, for large input file it is aborted. Here is the piece of code I am using
unordered_map<string, int> m;
mutex mtx;
void parseLine(char *line, int subLen){
int no_substr = strlen(line) - subLen;
for(int i = 0; i <= no_substr; i++) {
char *subStr = (char*) malloc(sizeof(char)* subLen + 1);
strncpy(subStr, line+i, subLen);
subStr[subLen]='\0';
mtx.lock();
string s(subStr);
if(m.find(s) != m.end()) m[s]++;
else {
pair<string, int> ret(s, 1);
m.insert(ret);
}
mtx.unlock();
}
}
int main(){
char **Array = (char **) malloc(sizeof(char *) * num_thread +1);
int num = 0;
while (NOT END OF FILE) {
if(num < num_th) {
if(num == 0)
for(int x = 0; x < num_th; x++)
Array[x] = (char*) malloc(sizeof(char)*strlen(line)+1);
strcpy(Array[num], line);
num++;
}
else {
vector<thread> threads;
for(int i = 0; i < num_th; i++) {
threads.push_back(thread(parseLine, Array[i]);
}
for(int i = 0; i < num_th; i++){
if(threads[i].joinable()) {
threads[i].join();
}
}
for(int x = 0; x < num_th; x++) free(seqArray[x]);
num = 0;
}
}
}
It's a myth that just by the virtue of using threads, the end result must be faster. In general, in order to take advantage of multithreading, two conditions must be met(*):
1) You actually have to have sufficient physical CPU cores, that can run the threads at the same time.
2) The threads have independent tasks to do, that they can do on their own.
From a cursory examination of the shown code, it seems to fail on the second part. It seems to me that, most of the time all of these threads will be fighting each other in order to acquire the same mutex. There's little to be gained from multithreading, in this situation.
(*) Of course, you don't always use threads for purely performance reasons. Multithreading also comes in useful in many other situations too, for example, in a program with a GUI, having a separate thread updating the GUI helps the UI working even while the main execution thread is chewing on something, for a while...
I have a global function that get an array and index to array.
the function need to find a word in some dictionary and where it start in a given sequence.
but I see that the threads are overwrite the results. so I guess its because a memory race.
what can I do?
__global__ void find_words(int* dictionary, int dictionary_size, int* indeces,
int indeces_size, int *sequence, int sequence_size,
int longest_word, int* devWords, int *counter)
{
int id = blockIdx.x * blockDim.x + threadIdx.x;
int start = id * (CHUNK_SIZE - longest_word);
int finish = start + CHUNK_SIZE;
int word_index = -1;
if (finish > sequence_size)
{
finish = sequence_size;
}
// search in a closed area
while(start < finish)
{
find_word_in_phoneme_dictionary_kernel(dictionary, dictionary_size,
indeces, indeces_size, sequence, &word_index, start, finish);
if(word_index >= 0 && word_index <= indeces[indeces_size-1])
{
devWords[*counter] = word_index;
devWords[*counter+1] = start; // index in sequence
*counter+=2;
start += dictionary[word_index];
}
else
{
start++;
}
}
__syncthreads();
}
I also tried to create for each thread his own array and counter to store there his results
and then to collect all the threads results.. but i don't understand how to implement the gather in CUDA. any help?
I guess the problem is that your counter is read and incremented by multiple threads. As a result, multiple threads will use the same counter value as index in the array. You should instead use int atomicAdd(int* address, int val); to increment the counter. The code would look like this:
int oldCounter = atomicAdd(counter, 2);
devWords[oldCounter] = word_index;
devWords[oldCounter+1] = start;
Note that I incremented counter before accessing the array. atomicAdd(...) returns the old value of the counter, which I then used to access the array.
The Atomic operations however are serialized, which means that incrementing the counter can not run in parallel. The rest oft the code is still running in parallel though.
I have created a model program of a more complex program that will utilize multithreading and multiple harddrives to increase performance. The data size is so large that reading all data into memory will not be feasible so the data will be read, processed, and written back out in chunks. This test program uses pipeline design to be able to read, process and write out at the same time on 3 different threads. Because read and write are to different harddrive, there is no problem with read and write at the same time. However, the program utilizing multithread seems to run 2x slower than its linear version(also in the code). I have tried to have the read and write thread not be destoryed after running a chunk but the synchronization seem to have slowed it down even more than the current version. I was wondering if I am doing something wrong or how I can improve this. Thank You.
Tested using i3-2100 # 3.1ghz and 16GB ram.
#include <iostream>
#include <fstream>
#include <ctime>
#include <thread>
#define CHUNKSIZE 8192 //size of each chunk to process
#define DATASIZE 2097152 //total size of data
using namespace std;
int data[3][CHUNKSIZE];
int run = 0;
int totalRun = DATASIZE/CHUNKSIZE;
bool finishRead = false, finishWrite = false;
ifstream infile;
ofstream outfile;
clock_t starttime, endtime;
/*
Process a chunk of data(simulate only, does not require to sort all data)
*/
void quickSort(int arr[], int left, int right) {
int i = left, j = right;
int tmp;
int pivot = arr[(left + right) / 2];
while (i <= j) {
while (arr[i] < pivot) i++;
while (arr[j] > pivot) j--;
if (i <= j) {
tmp = arr[i];
arr[i] = arr[j];
arr[j] = tmp;
i++;
j--;
}
};
if (left < j) quickSort(arr, left, j);
if (i < right) quickSort(arr, i, right);
}
/*
Find runtime
*/
void diffclock(){
double diff = (endtime - starttime)/(CLOCKS_PER_SEC/1000);
cout<<"Total run time: "<<diff<<"ms"<<endl;
}
/*
Read a chunk of data
*/
void readData(){
for(int i = 0; i < CHUNKSIZE; i++){
infile>>data[run%3][i];
}
finishRead = true;
}
/*
Write a chunk of data
*/
void writeData(){
for(int i = 0; i < CHUNKSIZE; i++){
outfile<<data[(run-2)%3][i]<<endl;
}
finishWrite = true;
}
/*
Pipelines Read, Process, Write using multithread
*/
void threadtransfer(){
starttime = clock();
infile.open("/home/pcg/test/iothread/source.txt");
outfile.open("/media/pcg/Data/test/iothread/ThreadDuplicate.txt");
thread read, write;
run = 0;
readData();
run = 1;
readData();
quickSort(data[(run-1)%3], 0, CHUNKSIZE - 1);
run = 2;
while(run < totalRun){
//cout<<run<<endl;
finishRead = finishWrite = false;
read = thread(readData);
write = thread(writeData);
read.detach();
write.detach();
quickSort(data[(run-1)%3], 0, CHUNKSIZE - 1);
while(!finishRead||!finishWrite){} //check if next cycle is ready.
run++;
}
quickSort(data[(run-1)%3], 0, CHUNKSIZE - 1);
writeData();
run++;
writeData();
infile.close();
outfile.close();
endtime = clock();
diffclock();
}
/*
Linearly read, sort, and write a chunk and repeat.
*/
void lineartransfer(){
int totalRun = DATASIZE/CHUNKSIZE;
int holder[CHUNKSIZE];
starttime = clock();
infile.open("/home/pcg/test/iothread/source.txt");
outfile.open("/media/pcg/Data/test/iothread/Linearduplicate.txt");
run = 0;
while(run < totalRun){
for(int i = 0; i < CHUNKSIZE; i++) infile>>holder[i];
quickSort(holder, 0, CHUNKSIZE - 1);
for(int i = 0; i < CHUNKSIZE; i++) outfile<<holder[i]<<endl;
run++;
}
endtime = clock();
diffclock();
}
/*
Create large amount of data for testing
*/
void createData(){
outfile.open("/home/pcg/test/iothread/source.txt");
for(int i = 0; i < DATASIZE; i++){
outfile<<rand()<<endl;
}
outfile.close();
}
int main(){
int mode=0;
cout<<"Number of threads: "<<thread::hardware_concurrency()<<endl;
cout<<"Enter mode\n1.Create Data\n2.thread copy\n3.linear copy\ninput mode:";
cin>>mode;
if(mode == 1) createData();
else if(mode == 2) threadtransfer();
else if(mode == 3) lineartransfer();
return 0;
}
Don't busy-wait. This wastes precious CPU time and may well slow down the rest (not to mention the compiler can optimize it into an infinite loop because it can't guess whether those flags will change or not, so it's not even correct in the first place). And don't detach() either. Replace both detach() and busy-waiting with join():
while (run < totalRun) {
read = thread(readData);
write = thread(writeData);
quickSort(data[(run-1)%3], 0, CHUNKSIZE - 1);
read.join();
write.join();
run++;
}
As to the global design, well, ignoring the global variables I guess it's otherwise acceptable if you don't expect the processing (quickSort) part to ever exceed the read/write time. I for one would use message queues to pass the buffers between the various threads (which allows to add more processing threads if you need it, either doing the same tasks in parallel or different tasks in sequence) but maybe that's because I'm used to do it that way.
Since you are measuing time using clock on a Linux machine, I expect that the total CPU time is (roughly) the same whether you run one thread or multiple threads.
Maybe you want to use time myprog instead? Or use gettimeofday to fetch the time (which will give you a time in seconds + nanoseconds [although the nanoseconds may not be "accurate" down to the last digit].
Edit:
Next, don't use endl when writing to a file. It slows things down a lot, because the C++ runtime goes and flushes to the file, which is an operating system call. It is almost certainly somehow protected against multiple threads, so you have three threads doing write-data, a single line, synchronously, at a time. Most likely going to take nearly 3x as long as running a single thread. Also, don't write to the same file from three different threads - that's going to be bad in one way or another.
Please correct me if I am wrong, but it seems your threaded function is basically a linear function doing 3 times the work of your linear function?
In a threaded program you would create three threads and run the readData/quicksort functions once on each thread (distributing thee workload), but in your program it seems like the thread simulation is actually just reading three times, quicksorting three times, and writing three times, and totalling the time it takes to do all three of each.
I have a very simple function in C++:
double testSpeed()
{
using namespace boost;
int temp = 0;
timer aTimer;
//1 billion iterations.
for(int i = 0; i < 1000000000; i++) {
temp = temp + i;
}
double elapsedSec = aTimer.elapsed();
double speed = 1.0/elapsedSec;
return speed;
}
I want to run this function with multiple threads. I saw examples online that I can
do it as follows:
// start two new threads that calls the "hello_world" function
boost::thread my_thread1(&testSpeed);
boost::thread my_thread2(&testSpeed);
// wait for both threads to finish
my_thread1.join();
my_thread2.join();
However, this will run two threads that each will iterate billion times, right? I want the
two threads to do the job concurrently so the entire thing will run faster. I don't care
about sync, it's just a speed test.
Thank you!
There may be a nicer way, but this should work, it passes the range of variable to iterate over into the thread, it also starts a single timer before the threads are started, and ends after the timer after they're both done. It should be pretty obvious how to scale this up to more threads.
void testSpeed(int start, int end)
{
int temp = 0;
for(int i = start; i < end; i++)
{
temp = temp + i;
}
}
using namespace boost;
timer aTimer;
// start two new threads that calls the "hello_world" function
boost::thread my_thread1(&testSpeed, 0, 500000000);
boost::thread my_thread2(&testSpeed, 500000000, 1000000000);
// wait for both threads to finish
my_thread1.join();
my_thread2.join();
double elapsedSec = aTimer.elapsed();
double speed = 1.0/elapsedSec;