I have started working on multithreading and point cloud processing. Problem is i have to implement multithreading onto an existing implementation and there are so many read and write operation so using mutex does not give me enough speed up in terms of performance due to too many read operations from the grid.
At the end i modified the code in a way that i can have one vtkSmartPointer<vtkUnstructuredGrid>which holds my point cloud. The only operation the threads have to do is accessing points using GetPoint method. However, it is not thread safe even when you have read-only operation due to smart pointers.
Because of that i had to copy my main Point cloud for each thread which at the end causes memory issues if i have too many threads and big clouds.
I tried to cut point clouds into chunks but then it gets too complicated again when i have too many threads. I can not guarantee optimized amount of points to process for each thread. Also i do neighbour search for each point so cutting point cloud into chunks gets even more complicated because i need to have overlaps for each chunk in order to get proper neighbourhood search.
Since vtkUnstructuredGridis memory optimized i could not replace it with some STL containers. I would be happy if you can recommend me data structures i can use for point cloud processing that are thread-safe to read. Or if there is any other solution i could use.
Thanks in advance
I am not familiar with VTK or how it works.
In general, there are various techniques and methods to improve performance in multi-threading environment. The question is vague, so I can only provide a general vague answer.
Easy: In case there are many reads and few writes, use std::shared_mutex as it allows multiple reads simultaneously.
Moderate: If the threads work with distinct data most of the time: they access the same data array but at distinct locations - then you can implement a handler that ensures that the threads concurrently work over distinct pieces of data without intersections and if a thread ask to work over a piece of data that is currently being processed, then tell it to work over something else or wait.
Hard: There are methods that allow efficient concurrency via std::atomic by utilizing various memory instructions. I am not too familiar with it and it is definitely not simple but you can seek tutorials on it in the internet. As far as I know, certain parts of such methods are still in research-and-development and best practices aren't yet developed.
P.S. If there are many reads/writes over the same data... is the implementation even aware of the fact that the data is shared over several threads? Does it even perform correctly? You might end up needing to rewrite the whole implementation.
I just thought i post the solution because it was actually my stupitidy. I realized that at one part of my code i was using double* vtkDataSet::GetPoint(vtkIdType ptId) version of GetPoint() which is not thread safe.
For multithreaded code void vtkDataSet::GetPoint(vtkIdType id,double x[3]) should be used.
Related
I have written a small program that generates images of the Mandelbrot set, and I have been using it as an opportunity to teach myself multithreading.
I currently have four threads that each handle calculating a quarter of the data. When they finish, the data is aggregated to then be drawn to a bitmap.
I'm currently pre-calculating all the complex numbers for each pixel in the main thread and putting them into an vector. Then, I split the vector into four smaller vectors to pass into each thread to modify.
Is there a best practice here? Should I be splitting up my data set so that the threads can work without interfering with eachother, or should I just use one data set and use mutexs/locking? I suppose benchmarking would probably be my best bet.
Thanks, let me know if you'd want to see my code.
The best practice is make threads as independent of each other as possible. I'm not familiar with the particular problem you're trying to solve, but if it allows equally dividing work among threads, splitting up the data set will be the most efficient way. When splitting data, have false sharing in mind, and minimize cross-thread data movements.
Choosing other parallelisation strategies makes sense on cases where, e.g.,:
Eliminating cross-thread dependencies requires a change to the algorithm that will cause too much extra work.
The amount of work per thread isn't balanced, and you need some dynamic work assignment to ensure all threads are busy until work is completed.
The algorithm is composed of different stages such that task parallelism is more efficient than data parallelism (namely, each stage is handled by a different thread, and data is pipelined between threads. This makes sense if there are too many dependencies within each stage).
Bear in mind that a mutex/lock means wasted time waiting, as well as possibly non-trivial synchronisation overhead if the mutex is a kernel object. However, correctness comes first: if other options are too difficult to get right, you'll lose more than you'll gain. Finally, always compare your parallel implementation to a sequential one. Due to data movements and dependencies, the sequential implementation often runs faster than the parallel one.
From my studies I know the concepts of starvation, deadlock, fairness and other concurrency issues. However, theory differs from practice, to an extent, and real engineering tasks often involve greater detail than academic blah blah...
As a C++ developer I've been concerned about threading issues for a while...
Suppose you have a shared variable x which refers to some larger portion of the program's memory. The variable is shared between two threads A and B.
Now, if we consider read/write operations on x from both A and B threads, possibly at the same time, there is a need to synchronize those operations, right? So the access to x needs some form of synchronization which can be achieved for example by using mutexes.
Now lets consider another scenario where x is initially written by thread A, then passed to thread B (somehow) and that thread only reads x. The thread B then produces a response to x called y and passes it back to the thread A (again, somehow). My question is: what synchronization primitives should I use to make this scenario thread-safe. I've read about atomics and, more importantly, memory fences - are these the tools I should rely on?
This is not a typical scenario in which there is a "critical section". Instead some data is passed between threads with no possibility of concurrent writes in the same memory location. So, after being written, the data should first be "flushed" somehow, so that the other threads could see it in a valid and consistent state before reading. How is it called in the literature, is it "visibility"?
What about pthread_once and its Boost/std counterpart i.e. call_once. Does it help if both x and y are passed between threads through a sort of "message queue" which is accessed by means of "once" functionality. AFAIK it serves as a sort of memory fence but I couldn't find any confirmation for this.
What about CPU caches and their coherency? What should I know about that from the engineering point of view? Does such knowledge help in the scenario mentioned above, or any other scenario commonly encountered in C++ development?
I know I might be mixing a lot of topics but I'd like to better understand what is the common engineering practice so that I could reuse the already known patterns.
This question is primarily related to the situation in C++03 as this is my daily environment at work. Since my project mainly involves Linux then I may only use pthreads and Boost, including Boost.Atomic. But I'm also interested if anything concerning such matters has changed with the advent of C++11.
I know the question is abstract and not that precise but any input could be useful.
you have a shared variable x
That's where you've gone wrong. Threading is MUCH easier if you hand off ownership of work items using some sort of threadsafe consumer-producer queue, and from the perspective of the rest of the program, including all the business logic, nothing is shared.
Message passing also helps prevent cache collisions (because there is no true sharing -- except of the producer-consumer queue itself, and that has trivial effect on performance if the unit of work is large -- and organizing the data into messages help reduce false sharing).
Parallelism scales best when you separate the problem into subproblems. Small subproblems are also much easier to reason about.
You seem to already be thinking along these lines, but no, threading primitives like atomics, mutexes, and fences are not very good for applications using message passing. Find a real queue implementation (queue, circular ring, Disruptor, they go under different names but all meet the same need). The primitives will be used inside the queue implementation, but never by application code.
Considering my lack of c++ knowledge, please try to read my intent and not my poor technical question.
This is the backbone of my program https://github.com/zaphoyd/websocketpp/blob/experimental/examples/broadcast_server/broadcast_server.cpp
I'm building a websocket server with websocket++ (and oh is websocket++ sweet. I highly recommend), and I can easily manipulate per user data thread-safely because it really doesn't need to be manipulated by different threads; however, I do want to be able to write to an array (I'm going to use the catch-all term "array" from weaker languages like vb, php, js) in one function thread (with multiple iterations that could be running simultanously) and also read in 1 or more threads.
Take stack as an example: if I wanted to have all of the ids (PRIMARY column of all articles) sorted in a particular way, in this case by net votes, and held in memory, I'm thinking I would have a function that's called in its' own boost::thread, fired whenever a vote on the site comes in to reorder the array.
How can I do this without locking & blocking? I'm 100% fine with users reading from an old array while another is being built, but I absolutely do not want their reads or the thread writes to ever fail/be blocked.
Does a lock-free array exist? If not, is there some way to build the new array in a temporary array and then write it to the actual array when the building is finished without locking & blocking?
Have you looked at Boost.Lockfree?
Uh, uh, uh. Complicated.
Look here (for an example): RCU -- and this is only about multiple reads along with ONE write.
My guess is that multiple writers at once are not going to work. You should rather look for a more efficient representation than an array, one that allows for faster updates. How about a balanced tree? log(n) should never block anything in a noticeable fashion.
Regarding boost -- I'm happy that it finally has proper support for thread synchronization.
Of course, you could also keep a copy and batch the updates. Then a background process merges the updates and copies the result for the readers.
I am writing mex code in MATLAB to do and operation (because the operation uses a library in c++). The mex code has a section where there is a function that is repeatedly called in a loop with a different argument value, and each function call is independent (i.e., computation of 1 call does not depend on previous calls). So, to speed this up I wrote multithreaded code that creates multiple threads - the exact number of threads is equal to the number of loop iterations, in my example this value is 10. Each thread computes the function in the loop for a separate value of the argument, the threads return and join, some more computation is done and a result is returned.
All this in theory should give me good speedup, but I see that the multithreaded code is a lot slower than the normal single threaded one!! I have access to very powerful 24 core machines, so this is totally baffling, because I'd expected each thread to be scheduled on a separate core.
Any ideas to what is leading to this? Any common problems/errors in code that lead to this?
Any help will be greatly appreciated.
EDIT:
To answer many doubts raised in solutions proposed by people here, I want to share some information about my code:
1. Each function call takes a few minutes, so synchronization and spawning of threads should not be an overhead here (though if there are any mitigating circumstances in this case, any info about that would be really helpful!)
Each thread does access common data structures, arrays, matrices but the values in these are not overwritten at all. All writes to variables are done to variables, pointers, arrays, etc that are local to the thread. So, I am guessing there shouldn't be many cache misses here?
Also there are no mutex sections in my code, since no thread write to any common memory location. All writes are to memory locations local to the thread.
I'm still trying to figure out the reason why my multithreaded implementation is not working :( So, any pointers/info will be really helpful!
Thanks!!
Given how general your question is, the general answer is that there are probably two effects in play:
There is large overhead involved starting and stopping threads (and synchronizing them), and the computation scaling is not enough to overcome the overhead. The total times per function call will shed some light on this issue.
Threads can compete with each other and slow down the aggregate performance. A common mechanism is "cache thrashing". Since multiple cores share the same memory controller and parts of the cache hiearchy, one thread can fill the cache with the information it needs, only to have some of that data evicted by the needs of a different thread, causing more trips to main memory. Since main memory access is so expensive, the end result is a slowdown.
I would test the job with varying numbers of threads. It may turn out, for instance, that using two threads is advantageous, but four or more is not. For more detailed answers, add more details to the question, such as type of computation, size of dataset, etc.
You didn't describe what your code does, so this is just guesswork.
Multithreading is not a miracle cure. There are a lot of ways that multithreading what was a single threaded chunk of code can be slower than the original. There's a good deal of overhead involved in spawning, synchronizing, joining, and destroying threads.
Suppose the task at hand was to add ten pairs of numbers. If you make this multithreaded by spawning a thread for each addition and then joining and destroying when the calculation is finished, your multithreaded version will be much, much slower than the original. Threading is not intended for very short duration calculations. The costs of spawning, joining, and destroying are going to overwhelm any speedup you gain by performing those simple tasks in parallel.
Another way to make things slower is to establish barriers the prevent parallel operations. A mutex, for example, to protect against multiple writers simultaneously accessing the same object. That protected code needs to be small. Make the entire bodies of your thread operate under the guise of a mutex and you have the equivalent of a single threaded application that has a whole bunch of threading overhead added in.
Those barriers that preclude parallel execution might be present even if you didn't put them in place. Some of those barriers are in the C standard library. POSIX mandates that most library functions be thread safe. The standard only lists the functions that don't have to be thread safe. If you use library functions in those computations, you might be better of staying single threaded because your code essentially is single threaded.
I do not think your problems are mex specific at all - this sounds like usual performance problems while programing multi-threaded code for SMPs.
To add a little to the already mentioned potential problems:
False cache line sharing: you might think that your threads work independently, while in fact they access different data within the same cache line. Trivial example:
/* global variable accessible by all threads */
int thread_data[nthreads];
/* inside thread function */
thread_data[thrid] = some_value;
inefficient memory bandwidth utilization. On NUMA systems you want the CPUs to access their own data banks. If you do not correctly distribute the data, the CPUs ask for memory from other CPUs. That implies communication, which you do not suspect is there.
thread affinity. Somewhat connected to the point above. You want your threads to be bound to their own CPUs for the entire duration of the computations. Otherwise they might be migrated by the OS, which causes overhead, and they might be moved further away from the memory bank they will access.
I'm currently in the process of developing my own little threading library, mainly for learning purposes, and am at the part of the message queue which will involve a lot of synchronisation in various places. Previously I've mainly used locks, mutexes and condition variables a bit which all are variations of the same theme, a lock for a section that should only be used by one thread at a time.
Are there any different solutions to synchronisation than using locks? I've read lock-free synchronization at places, but some consider hiding the locks in containers to be lock-free, which I disagree with. you just don't explicitly use the locks yourself.
Lock-free algorithms typically involve using compare-and-swap (CAS) or similar CPU instructions that update some value in memory not only atomically, but also conditionally and with an indicator of success. That way you can code something like this:
1 do
2 {
3 current_value = the_varibale
4 new_value = ...some expression using current_value...
5 } while(!compare_and_swap(the_variable, current_value, new_value));
compare_and_swap() atomically checks whether the_variable's value is still current_value, and only if that's so will it update the_variable's value to new_value and return true
exact calling syntax will vary with the CPU, and may involve assembly language or system/compiler-provided wrapper functions (use the latter if available - there may be other compiler optimisations or issues that their usage restricts to safe behaviours); generally, check your docs
The significance is that when another thread updates the variable after the read on line 3 but before the CAS on line 5 attempts the update, the compare and swap instruction will fail because the state from which you're updating is not the one you used to calculate the desired target state. Such do/while loops can be said to "spin" rather than lock, as they go round and round the loop until CAS succeeds.
Crucially, your existing threading library can be expected to have a two-stage locking approach for mutex, read-write locks etc. involving:
First stage: spinning using CAS or similar (i.e. spin on { read the current value, if it's not set then cas(current = not set, new = set) }) - which means other threads doing a quick update often won't result in your thread swapping out to wait, and all the relatively time-consuming overheads associated with that.
The second stage is only used if some limit of loop iterations or elapsed time is exceeded: it asks the operating system to queue the thread until it knows (or at least suspects) the lock is free to acquire.
The implication of this is that if you're using a mutex to protect access to a variable, then you are unlikely to do any better by implementing your own CAS-based "mutex" to protect the same variable.
Lock free algorithms come into their own when you are working directly on a variable that's small enough to update directly with the CAS instruction itself. Instead of being...
get a mutex (by spinning on CAS, falling back on slower OS queue)
update variable
release mutex
...they're simplified (and made faster) by simply having the spin on CAS do the variable update directly. Of course, you may find the work to calculate new value from old painful to repeat speculatively, but unless there's a LOT of contention you're not wasting that effort often.
This ability to update only a single location in memory has far-reaching implications, and work-arounds can require some creativity. For example, if you had a container using lock-free algorithms, you may decide to calculate a potential change to an element in the container, but can't sync that with updating a size variable elsewhere in memory. You may need to live without size, or be able to use an approximate size where you do a CAS-spin to increment or decrement the size later, but any given read of size may be slightly wrong. You may need to merge two logically-related data structures - such as a free list and the element-container - to share an index, then bit-pack the core fields for each into the same atomically-sized word at the start of each record. These kinds of data optimisations can be very invasive, and sometimes won't get you the behavioural characteristics you'd like. Mutexes et al are much easier in this regard, and at least you know you won't need a rewrite to mutexes if requirements evolve just that step too far. That said, clever use of a lock-free approach really can be adequate for a lot of needs, and yield a very gratifying performance and scalability improvement.
A core (good) consequence of lock-free algorithms is that one thread can't be holding the mutex then happen to get swapped out by the scheduler, such that other threads can't work until it resumes; rather - with CAS - they can spin safely and efficiently without an OS fallback option.
Things that lock free algorithms can be good for include updating usage/reference counters, modifying pointers to cleanly switch the pointed-to data, free lists, linked lists, marking hash-table buckets used/unused, and load-balancing. Many others of course.
As you say, simply hiding use of mutexes behind some API is not lock free.
There are a lot of different approaches to synchronization. There are various variants of message-passing (for example, CSP) or transactional memory.
Both of these may be implemented using locks, but that's an implementation detail.
And then of course, for some purposes, there are lock-free algorithms or data-structures, which make do with just a few atomic instructions (such as compare-and-swap), but this isn't really a general-purpose replacement for locks.
There are several implementations of some data structures, which can be implemented in a lock free configuration. For example, the producer/consumer pattern can often be implemented using lock-free linked list structures.
However, most lock-free solutions require significant thought on the part of the person designing the specific program/specific problem domain. They aren't generally applicable for all problems. For examples of such implementations, take a look at Intel's Threading Building Blocks library.
Most important to note is that no lock-free solution is free. You're going to give something up to make that work, at the bare minimum in implementation complexity, and probably performance in scenarios where you're running on a single core (for example, a linked list is MUCH slower than a vector). Make sure you benchmark before using lock free on the base assumption that it would be faster.
Side note: I really hope you're not using condition variables, because there's no way to ensure that their access operates as you wish in C and C++.
Yet another library to add to your reading list: Fast Flow
What's interesting in your case is that they are based on lock-free queues. They have implemented a simple lock-free queue and then have built more complex queues out of it.
And since the code is free, you can peruse it and get the code for the lock-free queue, which is far from trivial to get right.