Basically I havn't done multi-threaded programming earlier. Conceptually I am aware of it.
So started with some what coding with random number generation. Code is working but it produce slower result than single thread program. So wanted to know for loopholes in my code and how to improve performance.
so if I tr to generate 1-1500 numbers randomly, using single thread and 10 threads (or 5 threads). single thread execute faster. thread switching or locking seems to be taking time. so how to handle it?
pthread_mutex_t map_lock;
std::set<int> numSet;
int randcount=0;
static void *thread_fun (void *arg)
{
int randNum= *(int *)arg;
int result;
std::set<int> findItr;
while (randcount != randNum -1 ) {
result = rand ()%randNum;
if (result == 0) continue;
pthread_mutex_lock (&map_lock);
const bool is_in = (numSet.find (result) != numSet.end ());
if (!is_in)
{
numSet.insert (result);
printf (" %d\t", result);
randcount++;
}
pthread_mutex_unlock (&map_lock);
}
}
Since the majority of your code blocks all parallel threads (because is between a pthread_mutex_lock (&map_lock); and a pthread_mutex_unlock (&map_lock); block), your code works like it was running sequentially only with the overhead of parallelisation.
Tip: try to only collect the results in your thread then pass them back to the main thread which will display them. Also if you don't access your set parallely but pass back partial lists from each thread you don't have to deal with concurrency slowing down your code.
Related
So I built a program that should be released to production soon, but I'm worried if I run into a situation where all threads lock/wait, that the pipeline will be compromised. I am pretty sure I designed it so this won't happen, but if it were to, I'd like to kill all the threads and produce a boilerplate output. My first assumption was to simply code a thread to monitor the iterations of all the other threads, killing them if no iteration occurs for 5 seconds, but this doesn't seem to work, and also there's the problem that all the threads are in some random state of execution:
void deadlock_monitor() {
while(true) {
std::this_thread::sleep_for(std::chrono::milliseconds(1000));
int64_t time_diff = gnut_GetMicroTime() - last_thread_iter;
if(((time_diff/1000) > 5000) && !processing_completed) {
exit(1);
}
if(processing_completed) {
return;
}
}
return;
}
Is there a best practice to deal with this, or is ensuring there are no race conditions all I can do?
I have some code that needs to benchmark multiple algorithms. But before they can be benchmarked they need to get prepared. I would like to do this preparation multi-threaded, but the benchmarking needs to be sequential. Normally I would make threads for the preparation, wait for them to finish with join and let the benchmarking be done in the main thread. However the preparation and benchmarking are done in a seperate process after a fork because sometimes the preparation and the benchmarking may take too long. (So there is also a timer process made by a fork which kills the other process after x seconds.) And the preparation and benchmarking have to be done in the same process otherwise the benchmarking does not work. So I was wondering if I make a thread for every algorithm if there is a way to let them run concurrently until a certain point, then let them all wait untill the others reach that point and then let them do the rest of the work sequentially.
Here is the code that would be executed in a thread:
void prepareAndBenchmark(algorithm) {
//The timer thread that stops the worker after x seconds
pid_t timeout_pid = fork();
if (timeout_pid == 0) {
sleep(x);
_exit(0);
}
//The actual work
pid_t worker_pid = fork();
if (worker_pid == 0) {
//Concurrently:
prepare(algorithm)
//Concurrently up until this point
//At this point all the threads should run sequentially one after the other:
double result = benchmark(algorithm)
exit(0);
}
int status;
pid_t exited_pid = wait(&status);
if (exited_pid == worker_pid) {
kill(timeout_pid, SIGKILL);
if(status == 0) {
//I would use pipes to get the result of the benchmark.
} else {
//Something went wrong
}
} else {
//It took too long.
kill(worker_pid, SIGKILL);
}
wait(NULL);
}
I have also read that forking in threads migth give problems, would it be a problem in this code?
I think I could use mutex to have only one thread benchmarking at a time, but I don't want to have a thread benchmarking while others are still preparing.
I have a big file and i want to read and also [process] all lines (even lines) of the file with multi threads.
One suggests to read the whole file and break it to multiple files (same count as threads), then let every thread process a specific file. as this idea will read the whole file, write it again and read multiple files it seems to be slow (3x I/O) and i think there must be better scenarios,
I myself though this could be a better scenario:
One thread will read the file and put the data on a global variable and other threads will read the data from that variable and process. more detailed:
One thread will read the main file with running func1 function and put each even line on a Buffer: line1Buffer of a max size MAX_BUFFER_SIZE and other threads will pop their data from the Buffer and process it with running func2 function. in code:
Global variables:
#define MAX_BUFFER_SIZE 100
vector<string> line1Buffer;
bool continue = true;// to end thread 2 to last thread by setting to false
string file = "reads.fq";
Function func1 : (thread 1)
void func1(){
ifstream ifstr(file.c_str());
for (long long i = 0; i < numberOfReads; i++) { // 2 lines per read
getline(ifstr,ReadSeq);
getline(ifstr,ReadSeq);// reading even lines
while( line1Buffer.size() == MAX_BUFFER_SIZE )
; // to delay when the buffer is full
line1Buffer.push_back(ReadSeq);
}
continue = false;
return;
}
And function func2 : (other threads)
void func2(){
string ReadSeq;
while(continue){
if(line2Buffer.size() > 0 ){
ReadSeq = line1Buffer.pop_back();
// do the proccessing....
}
}
}
About the speed:
If the reading part is slower so the total time will be equal to reading the file for just one time(and the buffer may just contain 1 file at each time and hence just 1 another thread will be able to work with thread 1). and if the processing part is slower then the total time will be equal to the time for the whole processing with numberOfThreads - 1 threads. both cases is faster than reading the file and writing in multiple files with 1 thread and then read the files with multi threads and process...
and so there is 2 question:
1- how to call the functions by threads the way thread 1 runs func1 and others run func2 ?
2- is there any faster scenario?
3-[Deleted] anyone can extend this idea to M threads for reading and N threads for processing? obviously we know :M+N==umberOfThreads is true
Edit: the 3rd question is not right as multiple threads can't help in reading a single file
Thanks All
An other approach could be interleaved thread.
Reading is done by every thread, but only 1 at once.
Because of the waiting in the very first iteration, the
threads will be interleaved.
But this is only an scaleable option, if work() is the bottleneck
(then every non-parallel execution would be better)
Thread:
while (!end) {
// should be fair!
lock();
read();
unlock();
work();
}
basic example: (you should probably add some error-handling)
void thread_exec(ifstream* file,std::mutex* mutex,int* global_line_counter) {
std::string line;
std::vector<std::string> data;
int i;
do {
i = 0;
// only 1 concurrent reader
mutex->lock();
// try to read the maximum number of lines
while(i < MAX_NUMBER_OF_LINES_PER_ITERATION && getline(*file,line)) {
// only the even lines we want to process
if (*global_line_counter % 2 == 0) {
data.push_back(line);
i++;
}
(*global_line_counter)++;
}
mutex->unlock();
// execute work for every line
for (int j=0; j < data.size(); j++) {
work(data[j]);
}
// free old data
data.clear();
//until EOF was not reached
} while(i == MAX_NUMBER_OF_LINES_PER_ITERATION);
}
void process_data(std::string file) {
// counter for checking if line is even
int global_line_counter = 0;
// open file
ifstream ifstr(file.c_str());
// mutex for synchronization
// maybe a fair-lock would be a better solution
std::mutex mutex;
// create threads and start them with thread_exec(&ifstr, &mutex, &global_line_counter);
std::vector<std::thread> threads(NUM_THREADS);
for (int i=0; i < NUM_THREADS; i++) {
threads[i] = std::thread(thread_exec, &ifstr, &mutex, &global_line_counter);
}
// wait until all threads have finished
for (int i=0; i < NUM_THREADS; i++) {
threads[i].join();
}
}
What is your bottleneck? Hard disk or processing time?
If it's the hard disk, then you're probably not going to get any more performance out as you've hit the limits of the hardware. Concurrent reads are by far faster than trying to jump around the file. Having multiple threads trying to read your file will almost certainly reduce the overall speed as it will increase disk thrashing.
A single thread reading the file and a thread pool (or just 1 other thread) to deal with the contents is probably as good as you can get.
Global variables:
This is a bad habit to get into.
Assume having #p treads, two scenarios mentioned in the post and answers:
1) Reading with 'a' thread and processing with other threads, in this case #p-1 thread will process in comparison with only one thread reading. assume the time for full operation is jobTime and time for processing with n threads is pTime(n) so:
worst case occurs when reading time is very slower than processing and jobTime = pTime(1)+readTime and the best case is when the processing is slower than reading in which jobTime is equal to pTime(#p-1)+readTime
2) read and process with all #p threads. in this scenario every thread needs to do two steps. first step is to read a part of the file with size MAX_BUFFER_SIZE which is sequential; means no two threads can read at one time. but the second part is processing the read data which can be parallel. this way in the worst case jobTime is pTime(1)+readTime as before (but*), but the best optimized case is pTime(#p)+readTime which is better than previous.
*: in 2nd approach's worst case, however reading is slower but you can find a optimized MAX_BUFFER_SIZE in which (in the worst case) some reading with one thread will overlaps with some processing with another thread. with this optimized MAX_BUFFER_SIZE the jobTime will be less than pTime(1)+readTime and could diverge to readTime
First off, reading a file is a slow operation so unless you are doing some superheavy processing, the file reading will be limiting.
If you do decide to go the multithreaded route a queue is the right approach. Just make sure you push in front an pop out back. An stl::deque should work well. Also you will need to lock the queue with a mutex and sychronize it with a conditional variable.
One last thing is you will need to limit the size if the queue for the scenario where we are pushing faster than we are popping.
Recently I started to develop on CUDA and faced with the problem with atomicCAS().
To do some manipulations with memory in device code I have to create a mutex, so that only one thread could work with memory in critical section of code.
The device code below runs on 1 block and several threads.
__global__ void cudaKernelGenerateRandomGraph(..., int* mutex)
{
int i = threadIdx.x;
...
do
{
atomicCAS(mutex, 0, 1 + i);
}
while (*mutex != i + 1);
//critical section
//do some manipulations with objects in device memory
*mutex = 0;
...
}
When first thread executes
atomicCAS(mutex, 0, 1 + i);
mutex is 1. After that first thread changes its status from Active to Inactive, and line
*mutex = 0;
is not executed. Other threads stays forever in loop. I have tried many variants of this cycle like while(){};, do{}while();, with temp variable = *mutex inside loop, even variant with if(){} and goto. But result is the same.
The host part of code:
...
int verticlesCount = 5;
int *mutex;
cudaMalloc((void **)&mutex, sizeof(int));
cudaMemset(mutex, 0, sizeof(int));
cudaKernelGenerateRandomGraph<<<1, verticlesCount>>>(..., mutex);
I use Visual Studio 2012 with CUDA 5.5.
The device is NVidia GeForce GT 240 with compute capability 1.2.
Thanks in advance.
UPD:
After some time working on my diploma project this spring, I found a solution for critical section on cuda.
This is a combination of lock-free and mutex mechanisms.
Here is working code. Used it to impelment atomic dynamic-resizable array.
// *mutex should be 0 before calling this function
__global__ void kernelFunction(..., unsigned long long* mutex)
{
bool isSet = false;
do
{
if (isSet = atomicCAS(mutex, 0, 1) == 0)
{
// critical section goes here
}
if (isSet)
{
mutex = 0;
}
}
while (!isSet);
}
The loop in question
do
{
atomicCAS(mutex, 0, 1 + i);
}
while (*mutex != i + 1);
would work fine if it were running on the host (CPU) side; once thread 0 sets *mutex to 1, the other threads would wait exactly until thread 0 sets *mutex back to 0.
However, GPU threads are not as independent as their CPU counterparts. GPU threads are grouped into groups of 32, commonly referred to as warps. Threads in the same warp will execute instructions in complete lock-step. If a control statement such as if or while causes some of the 32 threads to diverge from the rest, the remaining threads will wait (i.e. sleeps) for the divergent threads to finish. [1]
Going back to the loop in question, thread 0 becomes inactive because threads 1, 2, ..., 31 are still stuck in the while loop. So thread 0 never reaches the line *mutex = 0, and the other 31 threads loops forever.
A potential solution is to make a local copy of the shared resource in question, let 32 threads modify the copy, and then pick one thread to 'push' the change back to the shared resource. A __shared__ variable is ideal in this situation: it will be shared by the threads belonging to the same block but not other blocks. We can use __syncthreads() to fine-control the access of this variable by the member threads.
[1] CUDA Best Practices Guide - Branching and Divergence
Avoid different execution paths within the same warp.
Any flow control instruction (if, switch, do, for, while) can significantly affect the instruction throughput by causing threads of the same warp to diverge; that is, to follow different execution paths. If this happens, the different execution paths must be serialized, since all of the threads of a warp share a program counter; this increases the total number of instructions executed for this warp. When all the different execution paths have completed, the threads converge back to the same execution path.
I wrote a program that employs multithreading for parallel computing. I have verified that on my system (OS X) it maxes out both cores simultaneously. I just ported it to Ubuntu with no modifications needed, because I coded it with that platform in mind. In particular, I am running the Canonical HVM Oneiric image on an an Amazon EC2, cluster compute 4x large instance. Those machines feature 2 Intel Xeon X5570, quad-core CPUs.
Unfortunately, my program does not accomplish multithreading on the EC2 machine. Running more than 1 thread actually slows the computing marginally for each additional thread. Running top while my program is running shows that when more than 1 thread is initialized, the system% of CPU consumption is roughly proportional to the number of threads. With only 1 thread, %sy is ~0.1. In either case user% never goes above ~9%.
The following are the threading-relevant sections of my code
const int NUM_THREADS = N; //where changing N is how I set the # of threads
void Threading::Setup_Threading()
{
sem_unlink("producer_gate");
sem_unlink("consumer_gate");
producer_gate = sem_open("producer_gate", O_CREAT, 0700, 0);
consumer_gate = sem_open("consumer_gate", O_CREAT, 0700, 0);
completed = 0;
queued = 0;
pthread_attr_init (&attr);
pthread_attr_setdetachstate (&attr, PTHREAD_CREATE_DETACHED);
}
void Threading::Init_Threads(vector <NetClass> * p_Pop)
{
thread_list.assign(NUM_THREADS, pthread_t());
for(int q=0; q<NUM_THREADS; q++)
pthread_create(&thread_list[q], &attr, Consumer, (void*) p_Pop );
}
void* Consumer(void* argument)
{
std::vector <NetClass>* p_v_Pop = (std::vector <NetClass>*) argument ;
while(1)
{
sem_wait(consumer_gate);
pthread_mutex_lock (&access_queued);
int index = queued;
queued--;
pthread_mutex_unlock (&access_queued);
Run_Gen( (*p_v_Pop)[index-1] );
completed--;
if(!completed)
sem_post(producer_gate);
}
}
main()
{
...
t1 = time(NULL);
threads.Init_Threads(p_Pop_m);
for(int w = 0; w < MONTC_NUM_TRIALS ; w++)
{
queued = MONTC_POP;
completed = MONTC_POP;
for(int q = MONTC_POP-1 ; q > -1; q--)
sem_post(consumer_gate);
sem_wait(producer_gate);
}
threads.Close_Threads();
t2 = time(NULL);
cout << difftime(t2, t1);
...
}
Ok, just guess. There is simple way to transform your parallel code to consecutive. For example:
thread_func:
while (1) {
pthread_mutex_lock(m1);
//do something
pthread_mutex_unlock(m1);
...
pthread_mutex_lock(mN);
pthread_mutex_unlock(mN);
If you run such code in several thread, you will not see any speedup, because of mutex usage. Code will work as consecutive, not as parallel. Only one thread will work at any moment.
The bad thing, that you can not used any mutex in your program explicity, but still have such situation. For example, call of "malloc" may cause usage of mutex some where in "C" runtime, call of "write" may cause usage of mutex somewhere in Linux kernel. Even call of gettimeofday may cause mutex lock/unlock (and they cause, if tell about Linux/glibc).
You may have only one mutex, but spend under it a lot of time, and this may cause such behaviour.
And because of mutex may be used somewhere in kernel and C/C++ runtime, you can see different behaviour with different OSes.