I am using two Boost threads, each of which uses different FFTW plan (example: thread 1 uses 'plan_fft' and thread 2 uses 'plan_ifft'). When I run only one thread (thread 2), it works perfectly, but when I run both threads, then I am getting a segmentation fault. I think it may be because of creation plan is not thread safe. It would be great help for me if someone provides solution to "how to use two different fftw_plans (each in one thread) in two threads in a parallel manner".
I forgot to mention one thing as solutions provided by FFTW multithreading developers:
using semaphore locks
creating all the plans in the one thread
I implemented the 2nd one (i.e created all the plans in main program and then called two threads from the main program). When I do so, there are no errors and segmentation fault, but I am not getting the result.
Please note: these two threads are independent and not sharing any common data, so I think a semaphore lock won't work for my case.
My doubt: can we create (and destroy) plans in the main program and execute these two different plans in two different threads?
The FFTW folks provide a nice summary to the thread safety topic here. Wrapup: nothing is thread safe except for fftw_execute, so you have to take care that e.g. only a single thread creates plans. However, it should be no problem to execute them in parallel.
Related
I am wondering whether there is any difference in performance in running a single program (exe) with 10 different threads or running the program with a single thread 10 times in parallel (starting it from a .bat file) assuming the work done is the same and only the number of threads spawned by the program change?
I am developing a client/server communication program and want to test it for throughput. I'm currently learning about parallel programming and threading as wasn't sure how Windows would handle the above scenario. Will the scheduler schedule work the same way for both scenarios? Will there be a performance difference?
The machine the program is running on has 4 threads.
Threads are slightly lighter weight than processes as there are many things a process gets it's own copy of. Especially when you compare the time it takes to start a new thread, vs starting a new process (from scratch, fork where available also avoids a lot of costs). Although in either case you can generally get even better performance using a worker pool where possible rather than starting and stopping fresh processes/threads.
The other major difference is that threads by default all share the same memory while processes get their own and need to communicate through more explicit means (which may include blocks of shared memory). This might make it easier for a threaded solution to avoid copying data, but this is also one of the dangers of multithreaded programming when care is not taken in how they use the shared memory/objects.
Also there may be API's that are more orientated around a single process. For example on Windows there is IO Completion Ports which basically works on the idea of having many in-progress IO operations for different files, sockets, etc. with multiple threads (but generally far less than the number of files/sockets) handling the results as they become available through a GetQueuedCompletionStatus loop.
In TBB, the task_scheduler_init () method, which is often (and should be?) invoked internally, is a deliberate design decision.
However, if we mix TBB and MPI, is it guaranteed to be thread-safe without controlling the number of threads of each MPI process?
For example, say we have 7 cores (with no hyper-threading) and 2 MPI processes. If each process spawns an individual TBB task using 4 threads simultaneously, then there is a conflict which might cause the program to crash at runtime.
I'm a newbie of TBB.
Looking forward to your comments and suggestions!
From Intel TBB runtime perspective it does not matter if it is a MPI process or not. So if you have two processes you will have two independent instances of Intel TBB runtime. I am not sure that I understand the question related to thread-safeness but it should work correctly. However, you may have oversubscription that can lead to performance issues.
In addition, you should check the MPI implementation documentation if you use MPI routines concurrently (from multiple threads) because it can cause some issues.
Generally speaking, this is a two steps tango
MPI binds each task to some resources
the thread runtime (TBB, same things would apply to OpenMP) is generally smart enough to bind threads within the previously provided resources.
bottom line, if MPI binds its tasks to non overlapping resources, then there should be no conflict caused by the TBB runtime.
a typical scenario is to run 2 MPI tasks with 8 OpenMP threads per task on a dual socket octo core box. as long as MPI binds a task to a socket, and the OpenMP runtime is told to bind the threads to cores, performance will be optimal.
I have a main thread which do some not-so-heavy-heavy work and also I'm creating worker threads which do very-heavy work. All documentation and examples shows how to create a number of hardware threads equal to std::thread::hardware_concurrency(). But since main thread already existed the number of threads becomes std::thread::hardware_concurrency() + 1. For example:
my machine supports 2 hardware threads.
in main thread I'm creating this 2 threads and the total number of threads becomes 3.
a core with the main thread do it's job plus (probably) the worker job.
Of course I don't want this because UI (which is done in main thread) becomes not responsive due to latency. What will happen if I create std::thread::hardware_concurrency() - 1 thread? Will it guarantee that the main thread and only main thread is running on single core? How can I check it?
P.S.: I'm using some sort of pool - I start threads on the program start and stop on exit. During the execution all worker threads run infinite while loop.
As others have written in the comments, you should carefully consider whether you can do a better job than the OS.
That being said, it is technically possible:
Use the native_handle method to get the OS's handle to your thread.
Consult your OS's documentation for setting the thread affinity. E.g., using pthreads, you'd want pthread_set_affinity.
This gives you full control over where each thread runs. In particular, you can give one of the threads a core of its own.
Note that this isn't part of the standard, as it is a level that is not portable. This might serve as another hint that it's possibly not what you're looking for.
No - std::thread::hardware_concurrency() only gives you a hint about the potential numbers of cores in use for multithreading. You might be interested in CPU Affinity Masks (Putting Threads on different CPUs). This works on the pthread level which you can reached via std::thread::native_handle (http://en.cppreference.com/w/cpp/thread/thread/native_handle)
Depending on your OS, you can get the thread's native handle, and control their priority levels using pthread_setschedparam(), for example giving the worker threads a lower priority than the main thread. This can be one solution to the UI problem. In general, number of threads need not match number of available HW cores.
There are definitely cases where you want to be able to gain full control, and reliably analyze what is going on. You are using Windows, but as an example, it is possible on a multicore machine to exclude e.g. one core from the normal Linux OS scheduler, and use that core for time-critical hard real-time tasks. In essence, you will own that core and handle interrupts for it, thereby enabling something close to hard real-time response times and predictability. Requires careful programming and analysis, and takes a significant effort. But very attractive if done right.
I was wondering if in an MPI program where you specify that there is thread support, all the threads make an MPI::Bcast call (making shure that in the call, the sender process only possesses one thread), is this received by all the other threads or just for one thread from each process (the fastest)?
Common MPI implementations deal with communication among processes. Implementations supporting threads simply allow multiple threads to make some or all MPI calls, rather than just one. Every one of T threads in a process calling MPI_Bcast means that the process has called MPI_Bcast T times, and expects that all of the other ranks on the communicator will do the same.
Depending on the level of thread support in your implementation of MPI, (please check, threading support in MPI is very sketchy), MPI call is made only once per process.
To add to the answer given by Novelocrat:
The basic unit of computation in MPI is the "rank." In most (all?) interesting implementations of MPI, a rank IS a process. All of the threads within a process share the same Rank ID.
The MPI Standard supports multiple levels of thread parallelism: MPI_THREAD_SINGLE, MPI_THREAD_FUNNELED, MPI_THREAD_SERIALIZED, and MPI_THREAD_MULTIPLE.
Of these, only MPI_THREAD_MULTILE actually has multiple threads making overlapping calls into the MPI library. The other three cases are an assertion from the application that the Rank can be treated as if it were "single threaded." For more, see the MPI Standard entry on MPI_INIT_THREAD.
I've begun using Boost.ASIO for some simple network programming, my understanding of the library is not a great deal, so please bear with me and my newbie question.
At the moment in my project I only have 1 io_service object. Which use for all the async I/O operations etc.
My understanding is that one can create multiple threads and pass the run method of an io_service instance to the thread to provide more threads to the io_service.
My question: Is it good design to have multiple io_service objects? say for example have 2 distinct io_service instances, each with 2 threads associated, do they somehow know about each other (and hence cooperate with each), or if not would they negatively affect each other?
My intention is to have 1 io_service for socket based I/O and another for serial based (tty) I/O.
We use multiple io_service's because some of the components in our application need to run all their worker threads at certain fixed priorities, different for each component. Thus each component is given its own io_service, and each component has its own pool of threads executing run().
Other designs I could think of would be if a different number of threads in the pool is required for each IO, or, more relevant to your case, is if the pool cannot be shared because, for example, if your network IO can take out every thread and leave your serial IO waiting.
IIRC, during Michael Caisse's Boostcon ASIO talk (which is worth watching anyway), I believe this question is explicitly asked by an audience member and ok'd as a potential solution. I take from that that it's not wrong per se, and can be used that way according to your design.
This discussion may be enlightening:
http://thread.gmane.org/gmane.comp.lib.boost.asio.user/1300
I don't have the code right here, but why would you use multiple io_services?
I thought it used one io_service and multiple threads executing run on
that one io_service.
IIUC, each io_service owns a select/epoll/whatever queue, so having multiple
io_services is akin to having multiple independent select/epoll loops. In some
situations, eg. large numbers of sockets and multiple CPUs, this might help.
Something I'm less sure about is with multiple threads all running
io_service::run (with the same io_service). I think this just means the
handlers are run concurrently, while the select/epoll/etc. loop is 'shared'.
I think this is best for when your handlers are relatively long-running
operations.