In my c++ program, I use a proprietarry library (.dll) which is not thread-safe. :-(
In this library there is a specific scientific computing.
Is there a way to safety start several computing of this library in parallel with threads ?
(1 process, many threads)
My program "is like" a "for" loop on which call each time the computing of my not thread-safe library
Sounds like you want to load the DLL multiple times. Take a look at Load the same dll multiple times.
A very simple approach would be forking multiple slave processes in your for loop. The slave process loads the non-thread-safe module and does computations, and finally returns the result to the parent process via either a simple return code (if the result fits into 4 bytes), IPC, or file.
Of course, this approach assumes that the parallel computations do not need any interaction with the others.
Related
I am wondering whether there is any difference in performance in running a single program (exe) with 10 different threads or running the program with a single thread 10 times in parallel (starting it from a .bat file) assuming the work done is the same and only the number of threads spawned by the program change?
I am developing a client/server communication program and want to test it for throughput. I'm currently learning about parallel programming and threading as wasn't sure how Windows would handle the above scenario. Will the scheduler schedule work the same way for both scenarios? Will there be a performance difference?
The machine the program is running on has 4 threads.
Threads are slightly lighter weight than processes as there are many things a process gets it's own copy of. Especially when you compare the time it takes to start a new thread, vs starting a new process (from scratch, fork where available also avoids a lot of costs). Although in either case you can generally get even better performance using a worker pool where possible rather than starting and stopping fresh processes/threads.
The other major difference is that threads by default all share the same memory while processes get their own and need to communicate through more explicit means (which may include blocks of shared memory). This might make it easier for a threaded solution to avoid copying data, but this is also one of the dangers of multithreaded programming when care is not taken in how they use the shared memory/objects.
Also there may be API's that are more orientated around a single process. For example on Windows there is IO Completion Ports which basically works on the idea of having many in-progress IO operations for different files, sockets, etc. with multiple threads (but generally far less than the number of files/sockets) handling the results as they become available through a GetQueuedCompletionStatus loop.
As Far as I know,
In computer science, a thread of execution is the smallest sequence of
programmed instructions that can be managed independently by a
scheduler, which is typically a part of the operating system. The
implementation of threads and processes differs between operating
systems, but in most cases a thread is a component of a process.
Multiple threads can exist within one process, executing concurrently
and sharing resources such as memory, while different processes do not
share these resources. In particular, the threads of a process share
its executable code and the values of its variables at any given time.[1]
When I decided to write a multi thread program in c++, i faced with many choices like boost thread, posix thread and std thread.
A simple search on internet shows a performance measurement taken by boost.org website here.
My question is a bit more basic and performance related as well.
Basically, Why do they differ in performance ? Why, for example thread type A, is faster than the others? The are written by most professional programmers, are ran by powerful OSs ,yet they offer different performance.
What makes them faster or slower?
The Boost documentation refers to the Fiber library, which are not actually threads. Creating what the library calls a fiber (essentially a user-space thread or coroutine, sometimes also referred to as green threads) does not create a separate schedulable entity on the kernel side, so it can be much more efficient at creation time. Other things could be less efficient because I/O operations necessarily become much more involved under this model (because a fiber doing I/O should not block the operating system thread it runs on if other fibers could do work there).
Note that some of the coroutine implementations out there are well out of the conceptual limits of the de-facto GNU/Linux ABI and other POSIX-like operating systems, so they should be considered ugly hacks at best.
In TBB, the task_scheduler_init () method, which is often (and should be?) invoked internally, is a deliberate design decision.
However, if we mix TBB and MPI, is it guaranteed to be thread-safe without controlling the number of threads of each MPI process?
For example, say we have 7 cores (with no hyper-threading) and 2 MPI processes. If each process spawns an individual TBB task using 4 threads simultaneously, then there is a conflict which might cause the program to crash at runtime.
I'm a newbie of TBB.
Looking forward to your comments and suggestions!
From Intel TBB runtime perspective it does not matter if it is a MPI process or not. So if you have two processes you will have two independent instances of Intel TBB runtime. I am not sure that I understand the question related to thread-safeness but it should work correctly. However, you may have oversubscription that can lead to performance issues.
In addition, you should check the MPI implementation documentation if you use MPI routines concurrently (from multiple threads) because it can cause some issues.
Generally speaking, this is a two steps tango
MPI binds each task to some resources
the thread runtime (TBB, same things would apply to OpenMP) is generally smart enough to bind threads within the previously provided resources.
bottom line, if MPI binds its tasks to non overlapping resources, then there should be no conflict caused by the TBB runtime.
a typical scenario is to run 2 MPI tasks with 8 OpenMP threads per task on a dual socket octo core box. as long as MPI binds a task to a socket, and the OpenMP runtime is told to bind the threads to cores, performance will be optimal.
I have a C++ program that performs some lengthy computation in parallel using OpenMP. Now that program also has to respond to user input and update some graphics. So far I've been starting my computations from the main / GUI thread, carefully balancing the workload so that it is neither to short to mask the OpenMP threading overhead nor to long so the GUI becomes unresponsive.
Clearly I'd like to fix that by running everything concurrently. As far as I can tell, OpenMP 2.5 doesn't provide a good mechanism for doing this. I assume it wasn't intended for this type of problem. I also wouldn't want to dedicate an entire core to the GUI thread, it just needs <10% of one for its work.
I thought maybe separating the computation into a separate pthread which launches the parallel constructs would be a good way of solving this. I coded this up but had OpenMP crash when invoked from the pthread, similar to this bug: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36242 . Note that I was not trying to launch parallel constructs from more than one thread at a time, OpenMP was only used in one pthread throughout the program.
It seems I can neither use OpenMP to schedule my GUI work concurrently nor use pthreads to have the parallel constructs run concurrently. I was thinking of just handling my GUI work in a separate thread, but that happens to be rather ugly in my case and might actually not work due to various libraries I use.
What's the textbook solution here? I'm sure others have used OpenMP in a program that needs to concurrently deal with a GUI / networking etc., but I haven't been able to find any information using Google or the OpenMP forum.
Thanks!
There is no textbook solution. The textbook application for OpenMP is non-interactive programs that read input files, do heavy computation, and write output files, all using the same thread pool of size ~ #CPUs in your supercomputer. It was not designed for concurrent execution of interactive and computation code and I don't think interop with any threads library is guaranteed by the spec.
Leaving theory aside, you seem to have encountered a bug in the GCC implementation of OpenMP. Please file a bug report with the GCC maintainers and for the time being, either look for a different compiler or run your GUI code in a separate process, communicating with the OpenMP program over some IPC mechanism. (E.g., async I/O over sockets.)
I am currently designing an application that has one module which will load large amounts of data from a database and reduce it to a much smaller set by various calculations depending on the circumstances.
Many of the more intensive operations behave deterministically and would lend themselves to parallel processing.
Provided I have a loop that iterates over a large number of data chunks arriving from the db and for each one call a deterministic function without side effects, how would I make it so that the program does not wait for the function to return but rather sets the next calls going, so they could be processed in parallel? A naive approach to demonstrate the principle would do me for now.
I have read Google's MapReduce paper and while I could use the overall principle in a number of places, I won't, for now, target large clusters, rather it's going to be a single multi-core or multi-CPU machine for version 1.0. So currently, I'm not sure if I can actually use the library or would have to roll a dumbed-down basic version myself.
I am at an early stage of the design process and so far I am targeting C-something (for the speed critical bits) and Python (for the productivity critical bits) as my languages. If there are compelling reasons, I might switch, but so far I am contented with my choice.
Please note that I'm aware of the fact that it might take longer to retrieve the next chunk from the database than to process the current one and the whole process would then be I/O-bound. I would, however, assume for now that it isn't and in practice use a db cluster or memory caching or something else to be not I/O-bound at this point.
Well, if .net is an option, they have put a lot of effort into Parallel Computing.
If you still plan on using Python, you might want to have a look at Processing. It uses processes rather than threads for parallel computing (due to the Python GIL) and provides classes for distributing "work items" onto several processes. Using the pool class, you can write code like the following:
import processing
def worker(i):
return i*i
num_workers = 2
pool = processing.Pool(num_workers)
result = pool.imap(worker, range(100000))
This is a parallel version of itertools.imap, which distributes calls over to processes. You can also use the apply_async methods of the pool and store lazy result objects in a list:
results = []
for i in range(10000):
results.append(pool.apply_async(worker, i))
For further reference, see the documentation of the Pool class.
Gotchas:
processing uses fork(), so you have to be careful on Win32
objects transferred between processes need to be pickleable
if the workers are relatively fast, you can tweak chunksize, i.e.
the number of work items send to a worker process in one batch
processing.Pool uses a background thread
You can implement the algorithm from Google's MapReduce without having physically separate machines. Just consider each of those "machines" to be "threads." Threads are automatically distributed on multi-core machines.
I might be missing something here, but this this seems fairly straight forward using pthreads.
Set up a small threadpool with N threads in it and have one thread to control them all.
The master thread simply sits in a loop doing something like:
Get data chunk from DB
Find next free thread If no thread is free then wait
Hand over chunk to worker thread
Go back and get next chunk from DB
In the meantime the worker threads they sit and do:
Mark myself as free
Wait for the mast thread to give me a chunk of data
Process the chunk of data
Mark myself as free again
The method by which you implement this can be as simple as two mutex controlled arrays. One has the worked threads in it (the threadpool) and the other indicated if each corresponding thread is free or busy.
Tweak N to your liking ...
If you're working with a compiler that will support it, I would suggest taking a look at http://www.openmp.org for a way of annotating your code in such a way that
certain loops will be parallelized.
It does a lot more as well, and you might find it very helpful.
Their web page reports that gcc4.2 will support openmp, for example.
The same thread pool is used in java. But the threads in threadpools are serialisable and sent to other computers and deserialised to run.
I have developed a MapReduce library for multi-threaded/multi-core use on a single server. Everything is taken care of by the library, and the user just has to implement Map and Reduce. It is positioned as a Boost library, but not yet accepted as a formal lib. Check out http://www.craighenderson.co.uk/mapreduce
You may be interested in examining the code of libdispatch, which is the open source implementation of Apple's Grand Central Dispatch.
Intel's TBB or boost::mpi might be of interest to you also.