Boost Threadpool with network or GPU

Boost Threadpool with network or GPU - c++

I'm using a thread group like shown here Boost group_threads Maximal number of parallel thread
My program does depth graph search which takes really long. Because that i want to speed up. I thought about connection other Computers over network or using my GPU.
So it is possible to start threads on other computers over network(of course they have to run client) or to use the own GPU
Does boost thread support something like this?

boost thread does not support this exactly. Boost threads are specific to a given process.
For the multi-machine case, you would need to communicate between the machines over the network. You could use boost asio sockets or boost MPI for this.
For the GPU case, you would have to write code specifically to execute on the GPU which is an in-depth subject.
You could also take a look at OpenCL which may be a better fit for your purpose.

Related

How to harness power of multiple processors with c++ and boost

I have a task that can be easily broken down to parallel tasks. And I have a PC with multiple processors which will run the task. I'm planning to use c++ and boost library.
I'm familliar with multithreading using multiple cores, but it's my first time with multiprocessor system. I'm not sure if boost::threads will be sufficient for efficient usage of all processors.
Should I use boost::threads or build a solution upon multiple processes? Also, I'm not familiar with MPI, but I feel it may be useful for my task.

Using multi core support in C / C++

I have seen in some posts it has been said that to use multiple cores of processor use Boost thread (use multi-threading) library. Usually threads are not visible to operating system. So how can we sure that multi-threading will support usage of multi-cores. Is there a difference between Java threads and Boost threads?

The operating system is also called a "supervisor" because it has access to everything. Since it is responsible for managing preemptive threads, it knows exactly how many a process has, and can inspect what they are doing at any time.
Java may add a layer of indirection (green threads) to make many threads look like one, depending on JVM and configuration. Boost does not do this, but instead only wraps the POSIX interface which usually communicates directly with the OS kernel.
Massively multithreaded applications may benefit from coalescing threads, so that the number of ready-to-run threads matches the number of logical CPU cores. Reducing everything to one thread may be going too far, though :v) and #Voo says that green threads are only a legacy technology. A good JVM should support true multithreading; check your configuration options. On the C++ side, there are libraries like Intel TBB and Apple GCD to help manage parallelism.

C++ Server - To Thread or not to Thread?

I'm working on a game server, written in C++, and I'm trying to decide how many threads to use and what tasks to thread. The basic server skeleton consists of keyboard I/O and output to a console, accepting incoming connects, sending outgoing connects, and doing the game "stuff".
What I'd like to know is which things should be given a separate thread. Should each connect have its own thread? I know this is variable, it depends on the project or so, but I would like it to support a pretty decent number of players (somewhere in the hundreds if possible).

The standard answer should always be: Try it the simplest way first, and only look for ways to improve performance if the simple way isn't good enough. However, re-architecting a large C++ program can be a painful experience, so some guesses about performance in advance may be appropriate.
Theoretically, hundreds of threads are probably OK on modern machines. The NPTL implementation for Linux was tested with tens of thousands of threads, as I recall. If that's the easiest way for you to implement, it may be the right answer.
However, high-performance web servers and similar typically use event-driven models instead. Consider a library like libevent. I'm sure there are C++ libraries for the same purpose.
I personally believe that languages without first-class continuations, or at least coroutines, are poor choices for this kind of work, but the C language family is how we get work done today, so off we go. :-)

A good solution could be to use a Thread pool.
Idea is to let the main thread dispatch equitably all connexions in a fixed number of threads.
With a good design, you can easily set the number of thread on runtime.
You can find more informations here.
Create more threads than you have CPU cores is not productive, and adding too threads decrease performances due to time taken for switching between threads.
By example, for compiling a large project (it's not exactly the same thing, but it's valid for both case), it's often recommended to use no more thread than number of CPU cores + 1.

A very common technique is to have the game server run on one thread to monitor several connections (i.e. sockets) by using a select on each socket. When data is available, grab the data and enqueue it in a producer/consumer type model for the game engine to pick up.
This is by no means the be-all-end-all implementation, but it should be enough to get you started. Sounds like a cool project. Good luck!

If you setup the connections and utilize them in a manner that cause the thread to block waiting on IO then you should be able to service all of the connections and the keyboard on one thread. You may not want to put the console output on that same thread, as I've seen cases (on windows at least), where the speed of writing to the console is actually a bottleneck (i.e. if the console window is minimized the process runs considerably faster).
If the work of your game engine parallelizes well then you probably want to set use as many threads as there are CPUs less one (for the OS and the other two threads). If you expect the client to run on the same machine the server will want to detect that and scale back the number of threads it uses.

Calls to GPU kernel from a multithreaded C++ application?

I'm re-implementing some sections of an image processing library that's multithreaded C++ using pthreads. I'd like to be able to invoke a CUDA kernel in every thread and trust the device itself to handle kernel scheduling, but I know better than to count on that behavior. Does anyone have any experience with this type of issue?

CUDA 4.0 made it much simpler to drive a single CUDA context from multiple threads - just call cudaSetDevice() to specify which CUDA device you want the thread to submit commands.
Note that this is likely to be less efficient than driving the CUDA context from a single thread - unless the CPU threads have other work to keep them occupied between kernel launches, they are likely to get serialized by the mutexes that CUDA uses internally to keep its data structures consistent.

Perhaps Cuda streams are the solution to your problem. Try to invoke kernels from a different stream in each thread. However, I don't see how this will help, as I think that your kernel executions will be serialized, even though they are invoked in parallel. In fact, Cuda kernel invocations even on the same stream are asynchronous by nature, so you can make any number of invocations from the same thread. I really don't understand what you are trying to achieve.

How do I tell a multi-core / multi-CPU machine to process function calls in a loop in parallel?

I am currently designing an application that has one module which will load large amounts of data from a database and reduce it to a much smaller set by various calculations depending on the circumstances.
Many of the more intensive operations behave deterministically and would lend themselves to parallel processing.
Provided I have a loop that iterates over a large number of data chunks arriving from the db and for each one call a deterministic function without side effects, how would I make it so that the program does not wait for the function to return but rather sets the next calls going, so they could be processed in parallel? A naive approach to demonstrate the principle would do me for now.
I have read Google's MapReduce paper and while I could use the overall principle in a number of places, I won't, for now, target large clusters, rather it's going to be a single multi-core or multi-CPU machine for version 1.0. So currently, I'm not sure if I can actually use the library or would have to roll a dumbed-down basic version myself.
I am at an early stage of the design process and so far I am targeting C-something (for the speed critical bits) and Python (for the productivity critical bits) as my languages. If there are compelling reasons, I might switch, but so far I am contented with my choice.
Please note that I'm aware of the fact that it might take longer to retrieve the next chunk from the database than to process the current one and the whole process would then be I/O-bound. I would, however, assume for now that it isn't and in practice use a db cluster or memory caching or something else to be not I/O-bound at this point.

Well, if .net is an option, they have put a lot of effort into Parallel Computing.

If you still plan on using Python, you might want to have a look at Processing. It uses processes rather than threads for parallel computing (due to the Python GIL) and provides classes for distributing "work items" onto several processes. Using the pool class, you can write code like the following:
import processing
def worker(i):
return i*i
num_workers = 2
pool = processing.Pool(num_workers)
result = pool.imap(worker, range(100000))
This is a parallel version of itertools.imap, which distributes calls over to processes. You can also use the apply_async methods of the pool and store lazy result objects in a list:
results = []
for i in range(10000):
results.append(pool.apply_async(worker, i))
For further reference, see the documentation of the Pool class.
Gotchas:
processing uses fork(), so you have to be careful on Win32
objects transferred between processes need to be pickleable
if the workers are relatively fast, you can tweak chunksize, i.e.
the number of work items send to a worker process in one batch
processing.Pool uses a background thread

You can implement the algorithm from Google's MapReduce without having physically separate machines. Just consider each of those "machines" to be "threads." Threads are automatically distributed on multi-core machines.

I might be missing something here, but this this seems fairly straight forward using pthreads.
Set up a small threadpool with N threads in it and have one thread to control them all.
The master thread simply sits in a loop doing something like:
Get data chunk from DB
Find next free thread If no thread is free then wait
Hand over chunk to worker thread
Go back and get next chunk from DB
In the meantime the worker threads they sit and do:
Mark myself as free
Wait for the mast thread to give me a chunk of data
Process the chunk of data
Mark myself as free again
The method by which you implement this can be as simple as two mutex controlled arrays. One has the worked threads in it (the threadpool) and the other indicated if each corresponding thread is free or busy.
Tweak N to your liking ...

If you're working with a compiler that will support it, I would suggest taking a look at http://www.openmp.org for a way of annotating your code in such a way that
certain loops will be parallelized.
It does a lot more as well, and you might find it very helpful.
Their web page reports that gcc4.2 will support openmp, for example.

The same thread pool is used in java. But the threads in threadpools are serialisable and sent to other computers and deserialised to run.

I have developed a MapReduce library for multi-threaded/multi-core use on a single server. Everything is taken care of by the library, and the user just has to implement Map and Reduce. It is positioned as a Boost library, but not yet accepted as a formal lib. Check out http://www.craighenderson.co.uk/mapreduce

You may be interested in examining the code of libdispatch, which is the open source implementation of Apple's Grand Central Dispatch.

Intel's TBB or boost::mpi might be of interest to you also.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Boost Threadpool with network or GPU - c++

Related

How to harness power of multiple processors with c++ and boost

Using multi core support in C / C++

C++ Server - To Thread or not to Thread?

Calls to GPU kernel from a multithreaded C++ application?

How do I tell a multi-core / multi-CPU machine to process function calls in a loop in parallel?

Categories

Resources