Threading opencl compiling

Threading opencl compiling - c++

[Update:] I'm spawning multiple processes now and it works fairly well, though the basic threading problem still exists. [/]
I'm trying to thread a c++ (g++ 4.6.1) program that compiles a bunch of opencl kernels. Most of the time taken is spent inside clBuildProgram. (It's genetic programming and actually running the code and evaluating fitness is much much faster.) I'm trying to thread the compilation of these kernels and not having any luck so far. At this point, there's no shared data between threads (aside from having the same platform and device reference), but it will only run one thread at a time. I can run this code as several processes (just launching them in different terminal windows in linux) and it will then use up multiple cores but not within one process. I can use multiple cores with the same basic threading code (std::thread) with just basic math, so I think it's something to do with either the opencl compile or some static data I forgot about. :) Any ideas? I've done my best to make this thread-safe, so I'm stumped.
I'm using AMD's SDK (opencl 1.1, circa 6/13/2010) and a 5830 or 5850 to run it. The SDK and g++ are not as up to date as they could be. The last time I installed a newer linux distro in order to get the newer g++, my code was running at half speed (at least the opencl compiles were), so I went back. (Just checked the code on that install and it runs at half speed still with no threading differences.) Also, when I said it only runs one thread at a time, it will launch all of them and then alternate between two until they finish, then do the next two, etc. And it does look like all of the threads are running until the code gets to building the program. I'm not using a callback function in clBuildProgram. I realize there's a lot that could be going wrong here and it's hard to say without the code. :)
I am pretty sure this problem occurs inside of or in the call of clBuildProgram. I'm printing the time taken inside of here and the threads that get postponed will come back with a long compile time for their first compile. The only shared data between these clBuildProgram calls is the device id, in that each thread's cl_device_id has the same value.
This is how I'm launching the threads:
for (a = 0; a < num_threads; a++) {
threads[a] = std::thread(std::ref(programs[a]));
threads[a].detach();
sleep(1); // giving the opencl init f()s time to complete
}
This is where it's bogging down (and these are all local variables being passed, though the device id will be the same):
clBuildProgram(program, 1, & device, options, NULL, NULL);
It doesn't seem to make a difference whether each thread has a unique context or command_queue. I really suspected this was the problem which is why I mention it. :)
Update: Spawning child processes with fork() will work for this.

You might want to post something on AMD's support forum about that. Considering the many failed OpenGL implementations about thread consistency that the spec requires, it would not surprise me that OpenCL drivers are still suboptimal on that sense. They could use process ID internally to separate data instead, who knows.
If you have a working multi processed generation, then I suggest you keep that, and communicate results using IPC. Either you can use boost::ipc which has interesting ways of using serialization (e.g with boost::spirit to reflect the data structures). Or you could use posix pipes, or shared memory, or just dump compilation results to files and poll the directory from your parent process, using boost::filesystem and directory iterators...
Forked processes may inherit some handles; so there are ways to use unnamed pipes as well I believe, that could help you into avoiding the need to create a pipe server that would instantiate client pipes, which can lead to extensive protocol coding.

Related

Game Engine Multithreading with Lua

I'm designing the threading architecture for my game engine, and I have reached a point where I am stumped.
The engine is partially inspired by Grimrock's engine, where they put as much as they could into LuaJIT, with some things, including low level systems, written in C++.
This seemed like a good plan, given that LuaJIT is easy to use, and I can continue to add API functions in C++ and expand it further. Faster iteration is nice, the ability to have a custom IDE attached to the game and edit the code while it runs is an interesting option, and serializing from Lua is also easy.
But I am stumped on how to go about adding threading. I know Lua has coroutines, but that is not true threading; it's basically to keep Lua from stalling as it waits for code that takes too long.
I originally had in mind to have the main thread running in Lua and calling C++ functions which are dispatched to the scheduler, but I can't find enough information on how Lua functions. I do know that when Lua calls a C++ function it runs outside of the state, so theoretically it may be possible.
I also don't know whether, if Lua makes such a call that is not supposed to return anything, it will hang on the function until it's done.
And I'm not sure whether the task scheduler runs in the main thread, or if it is simply all worker threads pulling data from a queue.
Basically meaning that, instead of everything running at once, it waits for the game state update before doing anything.
Does anyone have any ideas, or suggestions for threading?

In general, a single lua_State * is not thread safe. It's written in pure C and meant to go very fast. It's not safe to allow exceptions go through it either. There's no locks in there and no way for it to protect itself.
If you want to run multiple lua scripts simultaneously in separate threads, the most straightforward way is to use luaL_newstate() separately in each thread, initialize each of them, and load and run scripts in each of them. They can talk to the C++ safely as long as your callbacks use locks when necessary. At least, that's how I would try to do it.
There are various things you could do to speed it up, for instance, if you are loading copies of a single script in each of the threads, you could compile it to lua bytecode before you launch any of the threads, then put the buffer into shared memory, and have the scripts load the shared byte code without changing. That's most likely an unnecessary optimization though, depending on your application.

How to take advantage of multi-cpu in c++?

I work in lab and wrote multithreaded computational program, on C++11 using std::thread. Now I have an opportunity to run my program on multi-cpu server.
Server:
Runs Ubuntu server
Has 40 Intel CPU's
I know nothing about multi-cpu programming. First idea, that comes into my mind to run 40 applications and then glue their results together. It is possible, but I want to know more about my opportunities.
If I compile my code on server by it's gcc compiler, does resulting application take advantage of multi-cpu?
If #1 answer depends, how can I check it?
Thank you!

If your program runs multithreaded your OS should take care automatically that it uses the CPUs available.
Make sure to distribute the work you have to do to about the same number of threads there are CPUs you can use. Make sure it is not just one thread that does the work and the other threads are just waiting for the termination of this thread.

You question is not only about multi-thread, but about multi-cpu.
Basically the operating system will automatically spread out the threads over the cores. You don't need to do anything.
Once you are using C++11, you have std::thread::get_id() that you can call and identify the different thread, but you CAN NOT identify the core you are using. Use pthreads directly + "cpu affinity" for this.
You can google for "CPU affinity" for more details on how to get control over it. If you want this kind of precision. You can identify the core as well as choose the core... You can start with this: http://man7.org/linux/man-pages/man3/pthread_setaffinity_np.3.html

How do I correctly handle a permanently hung third-party library call in a thread in C++?

I have a device which has an library. Some of its functions are most awesomely ill-behaved, in the "occasionally hang forever" sense.
I have a program which uses this device. If/when it hangs, I need to be able to recover gracefully and reset it. The offending calls should return within milliseconds and are being called in a loop many many times per second.
My first question is: when a thread running the recalcitrant function hangs, what do I do? Even if I litter the thread with interruption points, this happens:
boost::this_thread::interruption_point(); // irrelevant, in the past
deviceLibrary.thatFunction(); // <-- hangs here forever
boost::this_thread::interruption_point(); // never gets here!
The only word I've read on what to do there is to modify the function itself, but that's out of the question for a variety of reasons -- not least of which is "this is already miles outside of my skill set".
I have tried asynchronous launching with C++11 futures:
// this was in a looping thread -- it does not work: wait_for sometimes never returns
std::future<void> future = std::async(std::launch::async,
[this] () { deviceLibrary.thatFunction(*data_ptr); });
if (future.wait_for(std::chrono::seconds(timeout)) == std::future_status::timeout) {
printf("no one will ever read this\n");
deviceLibrary.reset(); // this would work if it ever got here
}
No dice, in that or a number of variations.
I am now trying boost::asio with a thread_group of a number of worker threads running io_service::run(). It works magnificently until the second time it times out. Then I've run out of threads, because each hanging thread eats up one of my thread_group and it never comes back ever.
My latest idea is to call work_threads.create_thread to make a new thread to replace the now-hanging one. So my second question is: if this is a viable way of dealing with this, how should I cope with the slowly amassing group of hung threads? How do I remove them? Is it fine to leave them there?
Incidentally, I should mention that there is in fact a version of deviceLibrary.thatFunction() that has a timeout. It doesn't.
I found this answer but it's C# and Windows specific, and this one which seems relevant. But I'm not so sure about spawning hundreds of extra processes a second (edit: oh right; I could banish all the calls to one or two separate processes. If they communicate well enough and I can share the device between them. Hm...)
Pertinent background information: I'm using MSVC 2013 on Windows 7, but the code has to cross-compile for ARM on Debian with GCC 4.6 also. My level of C++ knowledge is... well... if it seems like I'm missing something obvious, I probably am.
Thanks!

If you want to reliably kill something that's out of your control and may hang, use a separate process.
While process isolation was once considered to be very 'heavy-handed', browsers like Chrome today will implement it on a per-tab basis. Each tab gets a process, the GUI has a process, and if the tab rendering dies it doesn't take down the whole browser.
How can Google Chrome isolate tabs into separate processes while looking like a single application?
Threads are simply not designed for letting a codebase defend itself from ill-behaved libraries. Processes are.
So define the services you need, put that all in one program using your flaky libraries, and use interprocess communication from your main app to speak with the bridge. If the bridge times out or has a problem due to the flakiness, kill it and restart it.

I am only going to answer this part of your text:
when a thread running the recalcitrant function hangs, what do I do?
A thread could invoke inline machine instructions.
These instructions might clear the interrupt flag.
This may cause the code to be non interruptible.
As long as it does not decide to return, you cannot force it to return.
You might be able to force it to die (eg kill the process containing the thread), but you cannot force the code to return.
I hope my answer convinces you that the answer recommending to use a bridge process is in fact what you should do.

The first thing you do is make sure that it's the library that's buggy. Then you create a minimal example that demonstrates the problem (if possible), and send a bug report and the example to the library's developer. Lastly, you cross your fingers and wait.
What you don't do is put your fingers in your ears and say "LALALALALA" while you hide the problem behind layers of crud in an attempt to pretend the problem is gone.

C/C++ framework for distributed computing (MPI?)

I'm investigating as to whether there is a framework/library that will help me implement a distributed computing system.
I have a master that has a large amount of data split up into files of a few hundred megabytes. The files would be chunked up into ~1MB pieces and distributed to workers for processing. Once initialized, the processing on each worker is dependent on state information obtained from the previous chunk, so workers must stay alive throughout the entire process, and the master needs to be able to send the right chunks to the right workers. One other thing to note is that this system is only a piece of a larger processing chain.
I did a little bit of looking into MPI (specifically Open MPI), but I'm not sure if it is the right fit. It seems to be geared to sending small messages (a few bytes), though I did find some charts that show it's throughput increases with larger files (up to 1/5 MB).
I'm concerned that there might not be a way to maintain the state unless it was constantly sent back and forth in messages. Looking at the structure of some MPI examples, it looked like master (rank 0) and workers (ranks 1-n) were a part of the same piece of and their actions were determined by conditionals. Can I have the workers stay alive (maintaining state) and wait for more messages to arrive?
Now that I'm writing this I'm thinking it would work. The rank 1...n section would just be a loop with a blocking receive followed by the processing code. The state would be maintained in that loop until a "no more data" message was received at which point it would send back the results. I might be beginning to grasp the MPI structure here...
My other question about MPI is how to actually run the code. Remember that this system is part of a larger system, so it needs to be called from some other code. The examples I've seen make use of mpirun, with which you can specify how the number of processors, or a hosts file. Can I get the same behavior by calling my MPI function from other code?
So my question is is MPI the right framework here? Is there something better suited to this task, or am I going to be doing this from scratch?

MPI seems reasonable option for your task. It uses the SPMD architecture, meaning you have the same program executing simultaneously on possibly distributed or even heterogeneous system. So the choice of process with rank 0 being the master and others being the workers is not mandatory, you can choose other patterns.
If you want to provide state for your application, you can use a constantly living MPI application and master process sending commands to worker ones over time. You probably should also consider saving that state to disk in order to provide more robustness to failures.
Running of an MPI process is done initially by mpirun. For example, you create some program program.c, then compile it using mpicc -o program program.c. Then you have to run mpirun -np 20 ./program <params> to run 20 processes. You will have 20 independent processes each having its own rank, so further progress is upon your application. The way these 20 processes will be distributed among nodes/processors is controlled by things like hostfile etc, should look at the documentation more closely.
If you want your code to be reusable, i.e. runnable from another MPI program, you generally should at least learn what MPI Communicator is and how to create/use one. There're articles on the net, keywords being "Creating MPI library".
If the code using your library is not to be in MPI itself, it's no huge problem, your program in MPI is not limited to MPI in communication. It just should communicate inside it's logic through MPI. You can call any program using mpirun, unless it tries calls to MPI library, it won't notice that it's being run under MPI.

If you are getting up and running with a cluster and mpi, then I recommend having a look at boost mpi. Its a c++ wrapper over an underlying mpi library (such as openmpi or mpich2). I found it very useful.
Your idea of sending messages back and forward, with each node requesting a new message when it is finished until a handshake saying "no more messages" is provided sounds a good one. I had a similar idea, and got a simple version up and running. I just put it on github for you in case you want to have a look. https://github.com/thshorrock/mpi_manager. Most of the code is in the header file:
https://github.com/thshorrock/mpi_manager/blob/master/include/mpi_manager/mpi_manager.hpp
Note, this was just a bit of code that was used to get me up and running, its not fully documented, and not a final version but its fairly short, works fine for my purposes and should provide a starting point for you.

Have a look at FastFlow. They use a data flow model to process data. It is extremely efficient if this model is suitable for you.

RayPlatform is a MPI framework for C++. You need to define plugins for your application (like modules in Linux).
RayPlatform is licensed under the LGPLv3.
Link: https://github.com/sebhtml/RayPlatform
It is well documented also.
An example application using RayPlatform: https://github.com/sebhtml/RayPlatform-example
edit: added link

Handling GUI thread in a program using OpenMP

I have a C++ program that performs some lengthy computation in parallel using OpenMP. Now that program also has to respond to user input and update some graphics. So far I've been starting my computations from the main / GUI thread, carefully balancing the workload so that it is neither to short to mask the OpenMP threading overhead nor to long so the GUI becomes unresponsive.
Clearly I'd like to fix that by running everything concurrently. As far as I can tell, OpenMP 2.5 doesn't provide a good mechanism for doing this. I assume it wasn't intended for this type of problem. I also wouldn't want to dedicate an entire core to the GUI thread, it just needs <10% of one for its work.
I thought maybe separating the computation into a separate pthread which launches the parallel constructs would be a good way of solving this. I coded this up but had OpenMP crash when invoked from the pthread, similar to this bug: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36242 . Note that I was not trying to launch parallel constructs from more than one thread at a time, OpenMP was only used in one pthread throughout the program.
It seems I can neither use OpenMP to schedule my GUI work concurrently nor use pthreads to have the parallel constructs run concurrently. I was thinking of just handling my GUI work in a separate thread, but that happens to be rather ugly in my case and might actually not work due to various libraries I use.
What's the textbook solution here? I'm sure others have used OpenMP in a program that needs to concurrently deal with a GUI / networking etc., but I haven't been able to find any information using Google or the OpenMP forum.
Thanks!

There is no textbook solution. The textbook application for OpenMP is non-interactive programs that read input files, do heavy computation, and write output files, all using the same thread pool of size ~ #CPUs in your supercomputer. It was not designed for concurrent execution of interactive and computation code and I don't think interop with any threads library is guaranteed by the spec.
Leaving theory aside, you seem to have encountered a bug in the GCC implementation of OpenMP. Please file a bug report with the GCC maintainers and for the time being, either look for a different compiler or run your GUI code in a separate process, communicating with the OpenMP program over some IPC mechanism. (E.g., async I/O over sockets.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js