Separate physics thread without locks - c++

I have a classic physics-thread vs. graphics-thread problem:
Say I'm running one thread for the physics update and one thread for rendering.
In the physics thread (pseudo-code):
while(true)
{
foreach object in simulation
SomeComplicatedPhysicsIntegration( &object->modelviewmatrix);
//modelviewmatrix is a vector of 16 floats (ie. a 4x4 matrix)
}
and in the graphics thread:
while(true)
{
foreach object in simulation
RenderObject(object->modelviewmatrix);
}
Now in theory this would not require locks, as one thread is only writing to the matrices and another is only reading, and I don't care about stale data so much.
The problem that updating the matrix is not an atomic operation and sometimes the graphics thread will read only partially updated matrices (ie. not all 16 floats have been copied, only part of them) which means part of the matrix is from one physics frame and part is from the previous frame, which in turn means the matrix is nolonger affine (ie. it's basically corrupted).
Is there any good method of preventing this without using locks? I read about a possible implementation using double buffering, but I cannot imagine a way that would work without syncing the threads.
Edit: I guess what I'd really like to use is some sort of triple buffering like they use on graphic displays.. anyone know of a good presentation of the triple buffering algorithm?
Edit 2: Indeed using non-synced triple buffering is not a good ideea (as suggested in the answers below). The physics thread can run mutiple cycles eating a lot of CPU and stalling the graphics thread, computing frames that never even get rendered in the end.
I have opted for a simple double-buffered algorithm with a single lock, where the physics thread only computes as much as 1 frame in advance of the graphics thread before swapping buffers. Something like this:
Physics:
while(true)
{
foreach physicstimestep
foreach object in simulation
SomeComplicatedPhysicsIntegration( &object->modelviewmatrix.WriteBuffer);
LockSemaphore()
SwapBuffers()
UnlockSemaphore()
}
Graphics:
while(true)
{
LockSemaphore()
foreach object in simulation
RenderObject(object->modelviewmatrix.ReadBuffer);
UnlockSemaphore()
}
How does that sound?

You could maintain a shared queue between the two threads, and implement the physics thread such that it only adds a matrix to the queue after it has fully populated all of the values in that matrix. This assumes that the physics thread allocates a new matrix on each iteration (or more specifically that the matrices are treated as read-only once they are placed in the queue).
So any time your graphics thread pulls a matrix out of the queue, it is guaranteed to be fully populated and a valid representation of the simulation state at the time at which the matrix was generated.
Note that the graphics thread will need to be able to handle cases in which the queue is empty for one or more iterations, and that it would probably be a good idea to world-timestamp each queue entry so that you have a mechanism of keeping the two threads reasonably in sync without using any formal synchronization techniques (for instance, by not allowing the graphics thread to consume any matrices that have a timestamp that is in the future, and by allowing it to skip ahead in the queue if the next matrix is from too far in the past). Also note that whatever queue you use must be implemented such that it will not explode if the physics thread tries to add something at the same time that the graphics thread is removing something.

but I cannot imagine a way that would work without syncing the threads.
No matter what kind of scheme you are using, synchronizing the threads is an absolute essential here. Without synchronization you run the risk that your physics thread will race far ahead of the graphics thread, or vice versa. Your program, typically a master thread that advances time, needs to be in control of thread operations, not the threading mechanism.
Double buffering is one scheme that lets your physics and graphics threads run in parallel (for example, you have a multi-CPU or multi-core machine). The physics thread operates on one buffer while the graphics thread operates on the other. Note that this induces a lag in the graphics, which may or may not be an issue.

The basic gist behind double buffering is that you duplicate your data to be rendered on screen.
If you run with some sort of locking, then your simulation thread will always be rendering exactly one frame ahead of the display thread. Every piece of data that gets simulated gets rendered. (The synchronization doesn't have to be very heavy: a simple condition variable can frequently be updated and wake the rendering thread pretty cheaply.)
If you run without synchronization, your simulation thread might simulate events that never get rendered, if the rendering thread cannot keep up. If you include a monotonically increasing generation number in your data (update it after each complete simulation cycle), then your rendering thread can simply busy-wait on the two generation numbers (one for each buffer of data).
Once one (or both) of the generation numbers is greater than the most-recently-rendered generation, copy the newest buffer into the rendering thread, update the most-recently-rendered counter, and start rendering. When it's done, return to busy waiting.
If your rendering thread is too fast, you may chew through a lot of processor in that busy wait. So this only makes sense if you expect to periodically skip rendering some data and almost never need to wait for more simulation.

Don't update the matrix in the physics thread?
Take a chunk, (perhaps a row you have just rendered), and queue its position/size/whatever to the physics thread. Invert/transpose/whateverCleverMatrixStuff the row of modelviewmatrix's into another, new row. Post it back to the render thread. Copy the new row in at some suitable time in your rendering. Perhaps you do not need to copy it in - maybe you can just swap out an 'old' vector for the new one and free the old one?
Is this possible, or is the structure/manipulation of your matrices/whatever too complex for this?
All kinda depends on the structure of your data, so this solution may well be inappropriate/impossible.
Rgds,
Martin

Now in theory this would not require locks, as one thread is only writing to the matrices and another is only reading, and I don't care about stale data so much.
Beware: without proper synchronization, there is no guarantee that the reading thread will ever observe any changes by the writing thread. This aspect is known as visibility, and sadly, it is often overlooked.

Assuming lockless or near-lockless updates is actually what would solve your problem best, it sounds like you want the physics thread to calculate a new matrix, and then instantaneously update all those values at once, so it doesn't matter what version of the matrices the graphics thread gets, as long as (a) it gets them eventually and (b) it never gets half of one and half of the old one.
In which case, it sounds like you want a physics thread something like:
/* pseudocode */
while (true) foreach (object in simulation) {
auto new_object = object;
SomeComplicatedPhysicsIntegrationInPlace(new_object)
atomic_swap(object, new_object); // In pseudocode, ignore return value since nowhere
// else changes value of object. In code, use assert, etc
}
Alternativelty, you can calculate a new state of the whole simulation, and then swap the the values. An easy way of implementing this would be:
/* Psudocode */
while (true) {
simulation[ 1-global_active_idx ] = simulation[ global_active_idx ];
foreach (object in simulation[global_inactive_idx]) {
SomeComplicatedPhysicsIntegrationInPlace(object);
}
global_active_idx = 1-global_active_idx; // implicitly assume this is atomic
}
The graphics thread should constantly render simulation[global_active_idx].
In fact, this doesn't work. It will work in many situations because typically, writing a 1 to memory location containing 0 is in fact atomic on most processors, but it's not guaraneed to work. Specifically, the other thread may never reread that value. Many people bodge this by declaring the variable volatile, while works on many compilers, but is not guaranteed to work.
However, to make either example work, all you need is an atomic write instruction, while C++ doesn't provide until C++0x, but is fairly easy for compilers to implement, since most "write an int" instructions ARE atomic, the compiler just has to ensure that.
So, you can write your code using an atomic_swap function at the end of the physics loop, and implement that in terms of (a) a lock, write, unlock sequence -- which shouldn't block the graphics thread significantly, beause it's only blocking for the length of time of one memory write, and possibly only doing so once per whole frame or (b) a compiler's built in atomic support, eg. http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html
There are similar solutions, eg. the physics thread updating a semaphore, which the graphics thread treats simply as a variable with value 0 or 1; eg. the physics thread posts finished calculations to a queue (which is implemented internally in a similar way to the above), and the graphics thread constantly renders the top of the queue, repeating the last one if the queue underflows.
However, I'm not sure I'm understanding your problem. Why is there any point updating the graphics if the physics hasn't changed? Why is there any point updating the physics faster than the physics, can't it extrapolate further in each chunk? Does locking the update actually make any difference?

You can make two matrices fit in one cache line size (128 bytes) and aligned it to 128 bytes. This way the matrix is loaded in a cache line and therefore write to and from memory in one block. This is not for the feint of heart though and will require much more work. (This only fixes the problem of reading a matrix while it is being updated and getting a non-orthogonal result.)

Related

Thread Safe Integer Array?

I have a situation where I have a legacy multi-threaded application I'm trying to move to a linux platform and convert into C++.
I have a fixed size array of integers:
int R[5000];
And I perform a lot of operations like:
R[5] = (R[10] + R[20]) / 50;
R[5]++;
I have one Foreground task that mostly reads the values....but on occasion can update one. And then I have a background worker that is updating the values constantly.
I need to make this structure thread safe.
I would rather only update the value if the value has actually changed. The worker is constantly collecting data and doing calculation and storing the data whether it changes or not.
So should I create a custom class MyInt which has the structure and then include an array of mutexes to lock for updating/reading each value and then overload the [], =, ++, +=, -=, etc? Or should I try to implement anatomic integer array?
Any suggestions as to what that would look like? I'd like to try and keep the above notation for doing the updates...but I get that it might not be possible.
Thanks,
WB
The first thing to do is make the program work reliably, and the easiest way to do that is to have a Mutex that is used to control access to the entire array. That is, whenever either thread needs to read or write to anything in the array, it should do:
the_mutex.lock();
// do all the array-reads, calculations, and array-writes it needs to do
the_mutex.unlock();
... then test your program and see if it still runs fast enough for your needs. If so, you're done; that's all you need to do.
If you find that the program isn't fast enough due to contention on the mutex, you can start trying optimizations to make things faster. For example, if you know that your threads' operations will only need to work on local segments of the array at one time, you could create multiple mutexes, and assign different subsets of the array to each mutex (e.g. mutex #1 is used to serialize access to the first 100 array items, mutex #2 for the second 100 array items, etc). That will greatly decrease the chances of one thread having to wait for the other thread to release a mutex before it can continue.
If things still aren't fast enough for you, you could then look in to having two different arrays, one for each thread, and occasionally copying from one array to the other. That way each thread could safely access its own private array without any serialization needed. The copying operation would need to be handled carefully, probably using some sort of inter-thread message-passing protocol.

thread building block combined with pthreads

I have a queue with elements which needs to be processed. I want to process these elements in parallel. The will be some sections on each element which need to be synchronized. At any point in time there can be max num_threads running threads.
I'll provide a template to give you an idea of what I want to achieve.
queue q
process_element(e)
{
lock()
some synchronized area
// a matrix access performed here so a spin lock would do
unlock()
...
unsynchronized area
...
if( condition )
{
new_element = generate_new_element()
q.push(new_element) // synchonized access to queue
}
}
process_queue()
{
while( elements in q ) // algorithm is finished condition
{
e = get_elem_from_queue(q) // synchronized access to queue
process_element(e)
}
}
I can use
pthreads
openmp
intel thread building blocks
Top problems I have
Make sure that at any point in time I have max num_threads running threads
Lightweight synchronization methods to use on queue
My plan is to the intel tbb concurrent_queue for the queue container. But then, will I be able to use pthreads functions ( mutexes, conditions )? Let's assume this works ( it should ). Then, how can I use pthreads to have max num_threads at one point in time? I was thinking to create the threads once, and then, after one element is processes, to access the queue and get the next element. However it if more complicated because I have no guarantee that if there is not element in queue the algorithm is finished.
My question
Before I start implementing I'd like to know if there is an easy way to use intel tbb or pthreads to obtain the behaviour I want? More precisely processing elements from a queue in parallel
Note: I have tried to use tasks but with no success.
First off, pthreads gives you portability which is hard to walk away from. The following appear to be true from your question - let us know if these aren't true because the answer will then change:
1) You have a multi-core processor(s) on which you're running the code
2) You want to have no more than num_threads threads because of (1)
Assuming the above to be true, the following approach might work well for you:
Create num_threads pthreads using pthread_create
Optionally, bind each thread to a different core
q.push(new_element) atomically adds new_element to a queue. pthreads_mutex_lock and pthreads_mutex_unlock can help you here. Examples here: http://pages.cs.wisc.edu/~travitch/pthreads_primer.html
Use pthreads_mutexes for dequeueing elements
Termination is tricky - one way to do this is to add a TERMINATE element to the queue, which upon dequeueing, causes the dequeuer to queue up another TERMINATE element (for the next dequeuer) and then terminate. You will end up with one extra TERMINATE element in the queue, which you can remove by having a named thread dequeue it after all the threads are done.
Depending on how often you add/remove elements from the queue, you may want to use something lighter weight than pthread_mutex_... to enqueue/dequeue elements. This is where you might want to use a more machine-specific construct.
TBB is compatible with other threading packages.
TBB also emphasizes scalability. So when you port over your program to from a dual core to a quad core you do not have to adjust your program. With data parallel programming, program performance increases (scales) as you add processors.
Cilk Plus is also another runtime that provides good results.
www.cilkplus.org
Since pThreads is a low level theading library you have to decide how much control you need in your application because it does offer flexibility, but at a high cost in terms of programmer effort, debugging time, and maintenance costs.
My recommendation is to look at tbb::parallel_do. It was designed to process elements from a container in parallel, even if the container itself is not concurrent; i.e. parallel_do works with an std::queue correctly without any user synchronization (of course you would still need to protect your matrix access inside process_element(). Moreover, with parallel_do you can add more work on the fly, which looks like what you need, as process_element() creates and adds new elements to the work queue (the only caution is that the newly added work will be processed immediately, unlike putting in a queue which would postpone processing till after all "older" items). Also, you don't have to worry about termination: parallel_do will complete automatically as soon as all initial queue items and new items created on the fly are processed.
However, if, besides the computation itself, the work queue can be concurrently fed from another source (e.g. from an I/O processing thread), then parallel_do is not suitable. In this case, it might make sense to look at parallel_pipeline or, better, the TBB flow graph.
Lastly, an application can control the number of active threads with TBB, though it's not a recommended approach.

Threading access to various buffers

I'm trying to figure out the best way to do this, but I'm getting a bit stuck in figuring out exactly what it is that I'm trying to do, so I'm going to explain what it is, what I'm thinking I want to do, and where I'm getting stuck.
I am working on a program that has a single array (Image really), which per frame can have a large number of objects placed on an image array. Each object is completely independent of all other objects. The only dependency is the output, in theory possible to have 2 of these objects placed on the same location on the array. I'm trying to increase the efficiency of placing the objects on the image, so that I can place more objects. In order to do that, I'm wanting to thread the problem.
The first step that I have taken towards threading it involves simply mutex protecting the array. All operations which place an object on the array will call the same function, so I only have to put the mutex lock in one place. So far, it is working, but it is not seeing the improvements that I would hope to have. I am hypothesizing that this is because most of the time, the limiting factor is the image write statement.
What I'm thinking I need to do next is to have multiple image buffers that I'm writing to, and to combine them when all of the operations are done. I should say that obscuration is not a problem, all that needs to be done is to simply add the pixel counts together. However, I'm struggling to figure out what mechanism I need to use in order to do this. I have looked at semaphores, but while I can see that they would limit a number of buffers, I can envision a situation in which two or more programs would be trying to write to the same buffer at the same time, potentially leading to inaccuracies.
I need a solution that does not involve any new non-standard libraries. I am more than willing to build the solution, but I would very much appreciate a few pointers in the right direction, as I'm currently just wandering around in the dark...
To help visualize this, imagine that I am told to place, say, balls at various locations on the image array. I am told to place the balls each frame, with a given brightness, location, and size. The exact location of the balls is dependent on the physics from the previous frame. All of the balls must be placed on a final image array, as quickly as they possibly can be. For the purpose of this example, if two balls are on top of each other, the brightness can simply be added together, thus there is no need to figure out if one is blocking the other. Also, no using GPU cards;-)
Psuedo-code would look like this: (Assuming that some logical object is given for location, brightness, and size). Also, assume, that isValidPoint simply finds if the point should be on the circle, given the location and radius of said circle.
global output_array[x_arrLimit*y_arrLimit)
void update_ball(int ball_num)
{
calc_ball_location(ball_num, *location, *brightness, *size); // location, brightness, size all set inside function
place_ball(location,brightness,size)
}
void place_ball(location,brighness,size)
{
get_bounds(location,size,*xlims,*ylims)
for (int x=xlims.min;x<xlims.max;y++)
{
for (int y=ylims.min;y<ylims.max;y++)
{
if (isValidPoint(location,size,x,y))
{
output_array(x,y)+=brightness;
}
}
}
}
The reason you're not seeing any speed up with the current design is that, with a single mutex for the entire buffer, you might as well not bother with threading, as all the objects have to be added serially anyway (unless there's significant processing being done to determine what to add, but it doesn't sound like that's the case). Depending on what it takes to "add an object to the buffer" (do you use scan-line algorithms, flood fill, or something else), you might consider having one mutex per row or a range of rows, or divide the image into rectangular tiles with one mutex per region or something. That would allow multiple threads to add to the image at the same time as long as they're not trying to update the same regions.
OK, you have an image member in some object. Add the, no doubt complex, code to add other image/objects to it. maipulate it, whatever. Aggregate in all the other objects that may be involved, add some command enun to tell threads what op to do and an 'OnCompletion' event to call when done.
Queue it to a pool of threads hanging on the end of a producer-consumer queue. Some thread will get the *object, perform the operation on the image/set and then call the event, (pass the completed *object as a parameter). In the event, you can do what you like, according to the needs of your app. Maybe you will add the processed images into a (thread-safe!!), vector or other container or queue them off to some other thread - whatever.
If the order of processing the images must be preserved, (eg. video stream), you could add an incrementing sequence-number to each object that is submitted to the pool, so enabling your 'OnComplete' handler to queue up 'later' images until all earlier ones have come in.
Since no two threads ever work on the same image, you need no locking while processing. The only locks you should, (may), need are those internal the queues, and they only lock for the time taken to push/pop object pointers to/from the queue - contention will be very rare.

Splitting up a program into 4 threads is slower than a single thread

I've been writing a raytracer the past week, and have come to a point where it's doing enough that multi-threading would make sense. I have tried using OpenMP to parallelize it, but running it with more threads is actually slower than running it with one.
Reading over other similar questions, especially about OpenMP, one suggestion was that gcc optimizes serial code better. However running the compiled code below with export OMP_NUM_THREADS=1 is twice as fast as with export OMP_NUM_THREADS=4. I.e. It's the same compiled code on both runs.
Running the program with time:
> export OMP_NUM_THREADS=1; time ./raytracer
real 0m34.344s
user 0m34.310s
sys 0m0.008s
> export OMP_NUM_THREADS=4; time ./raytracer
real 0m53.189s
user 0m20.677s
sys 0m0.096s
User time is a lot smaller than real, which is unusual when using multiple cores- user should be larger than real as several cores are running at the same time.
Code that I have parallelized using OpenMP
void Raytracer::render( Camera& cam ) {
// let the camera know to use this raytracer for probing the scene
cam.setSamplingFunc(getSamplingFunction());
int i, j;
#pragma omp parallel private(i, j)
{
// Construct a ray for each pixel.
#pragma omp for schedule(dynamic, 4)
for (i = 0; i < cam.height(); ++i) {
for (j = 0; j < cam.width(); ++j) {
cam.computePixel(i, j);
}
}
}
}
When reading this question I thought I had found my answer. It talks about the implementation of gclib rand() synchronizing calls to itself to preserve state for random number generation between threads. I am using rand() quite a lot for monte carlo sampling, so i thought that was the problem. I got rid of calls to rand, replacing them with a single value, but using multiple threads is still slower. EDIT: oops turns out I didn't test this correctly, it was the random values!
Now that those are out of the way, I will discuss an overview of what's being done on each call to computePixel, so hopefully a solution can be found.
In my raytracer I essentially have a scene tree, with all objects in it. This tree is traversed a lot during computePixel when objects are tested for intersection, however, no writes are done to this tree or any objects. computePixel essentially reads the scene a bunch of times, calling methods on the objects (all of which are const methods), and at the very end writes a single value to its own pixel array. This is the only part that I am aware of where more than one thread will try to write to to the same member variable. There is no synchronization anywhere since no two threads can write to the same cell in the pixel array.
Can anyone suggest places where there could be some kind of contention? Things to try?
Thank you in advance.
EDIT:
Sorry, was stupid not to provide more info on my system.
Compiler gcc 4.6 (with -O2 optimization)
Ubuntu Linux 11.10
OpenMP 3
Intel i3-2310M Quad core 2.1Ghz (on my laptop at the moment)
Code for compute pixel:
class Camera {
// constructors destructors
private:
// this is the array that is being written to, but not read from.
Colour* _sensor; // allocated using new at construction.
}
void Camera::computePixel(int i, int j) const {
Colour col;
// simple code to construct appropriate ray for the pixel
Ray3D ray(/* params */);
col += _sceneSamplingFunc(ray); // calls a const method that traverses scene.
_sensor[i*_scrWidth+j] += col;
}
From the suggestions, it might be the tree traversal that causes the slow-down. Some other aspects: there is quite a lot of recursion involved once the sampling function is called (recursive bouncing of rays)- could this cause these problems?
Thanks everyone for the suggestions, but after further profiling, and getting rid of other contributing factors, random-number generation did turn out to be the culprit.
As outlined in the question above, rand() needs to keep track of its state from one call to the next. If several threads are trying to modify this state, it would cause a race condition, so the default implementation in glibc is to lock on every call, to make the function thread-safe. This is terrible for performance.
Unfortunately the solutions to this problem that I've seen on stackoverflow are all local, i.e. deal with the problem in the scope where rand() is called. Instead I propose a "quick and dirty" solution that anyone can use in their program to implement independent random number generation for each thread, requiring no synchronization.
I have tested the code, and it works- there is no locking, and no noticeable slowdown as a result of calls to threadrand. Feel free to point out any blatant mistakes.
threadrand.h
#ifndef _THREAD_RAND_H_
#define _THREAD_RAND_H_
// max number of thread states to store
const int maxThreadNum = 100;
void init_threadrand();
// requires openmp, for thread number
int threadrand();
#endif // _THREAD_RAND_H_
threadrand.cpp
#include "threadrand.h"
#include <cstdlib>
#include <boost/scoped_ptr.hpp>
#include <omp.h>
// can be replaced with array of ordinary pointers, but need to
// explicitly delete previous pointer allocations, and do null checks.
//
// Importantly, the double indirection tries to avoid putting all the
// thread states on the same cache line, which would cause cache invalidations
// to occur on other cores every time rand_r would modify the state.
// (i.e. false sharing)
// A better implementation would be to store each state in a structure
// that is the size of a cache line
static boost::scoped_ptr<unsigned int> randThreadStates[maxThreadNum];
// reinitialize the array of thread state pointers, with random
// seed values.
void init_threadrand() {
for (int i = 0; i < maxThreadNum; ++i) {
randThreadStates[i].reset(new unsigned int(std::rand()));
}
}
// requires openmp, for thread number, to index into array of states.
int threadrand() {
int i = omp_get_thread_num();
return rand_r(randThreadStates[i].get());
}
Now you can initialize the random states for threads from main using init_threadrand(), and subsequently get a random number using threadrand() when using several threads in OpenMP.
The answer is, without knowing what machine you're running this on, and without really seeing the code of your computePixel function, that it depends.
There is quite a few factors that could affect the performance of your code, one thing that comes to mind is the cache alignment. Perhaps your data structures, and you did mention a tree, are not really ideal for caching, and the CPU ends up waiting for the data come from the RAM, since it cannot fit things into the cache. Wrong cache-line alignments could cause something like that. If the CPU has to wait for things to come from RAM, it is likely, that the thread will be context-switched out, and another will be run.
Your OS thread scheduler is non-deterministic, therefore, when a thread will run is not a predictable thing, so if it so happens that your threads are not running a lot, or are contending for CPU cores, this could also slow things down.
Thread affinity, also plays a role. A thread will be scheduled on a particular core, and normally it will be attempted to keep this thread on the same core. If more then one of your threads are running on a single core, they will have to share the same core. Another reason things could slow down. For performance reasons, once a particular thread has run on a core, it is normally kept there, unless there's a good reason to swap it to another core.
There's some other factors, which I don't remember off the top of my head, however, I suggest doing some reading on threading. It's a complicated and extensive subject. There's lots of material out there.
Is the data being written at the end, data that other threads need to be able to do computePixel ?
One strong possibility is false sharing. It looks like you are computing the pixels in sequence, thus each thread may be working on interleaved pixels. This is usually a very bad thing to do.
What could be happening is that each thread is trying to write the value of a pixel beside one written in another thread (they all write to the sensor array). If these two output values share the same CPU cache-line this forces the CPU to flush the cache between the processors. This results in an excessive amount of flushing between CPUs, which is a relatively slow operation.
To fix this you need to ensure that each thread truly works on an independent region. Right now it appears you divide on rows (I'm not positive since I don't know OMP). Whether this works depends on how big your rows are -- but still the end of each row will overlap with the beginning of the next (in terms of cache lines). You might want to try breaking the image into four blocks and have each thread work on a series of sequential rows (for like 1..10 11..20 21..30 31..40). This would greatly reduce the sharing.
Don't worry about reading constant data. So long as the data block is not being modified each thread can read this information efficiently. However, be leery of any mutable data you have in your constant data.
I just looked and the Intel i3-2310M doesn't actually have 4 cores, it has 2 cores and hyper-threading. Try running your code with just 2 threads and see it that helps. I find in general hyper-threading is totally useless when you have a lot of calculations, and on my laptop I turned it off and got much better compilation times of my projects.
In fact, just go into your BIOS and turn off HT -- it's not useful for development/computation machines.

How to store and push simulation state while minimally affecting updates per second?

My app is comprised of two threads:
GUI Thread (using Qt)
Simulation Thread
My reason for using two threads is to keep the GUI responsive, while letting the Sim thread spin as fast as possible.
In my GUI thread I'm rendering the entities in the sim at an FPS of 30-60; however, I want my sim to "crunch ahead" - so to speak - and queue up game state to be drawn eventually (think streaming video, you've got a buffer).
Now for each frame of the sim I render I need the corresponding simulation "State". So my sim thread looks something like:
while(1) {
simulation.update();
SimState* s = new SimState;
simulation.getAgents( s->agents ); // store agents
// store other things to SimState here..
stateStore.enqueue(s); // stateStore is a QQueue<SimState*>
if( /* some threshold reached */ )
// push stateStore
}
SimState looks like:
struct SimState {
std::vector<Agent> agents;
//other stuff here
};
And Simulation::getAgents looks like:
void Simulation::getAgents(std::vector<Agent> &a) const
{
// mAgents is a std::vector<Agent>
std::vector<Agent> a_tmp(mAgents);
a.swap(a_tmp);
}
The Agents themselves are somewhat complex classes. The members are a bunch of ints and floats and two std::vector<float>s.
With this current setup the sim can't crunch must faster than the GUI thread is drawing. I've verified that the current bottleneck is simulation.getAgents( s->agents ), because even if I leave out the push the updates-per-second are slow. If I comment out that line I see several orders of magnitude improvement in updates/second.
So, what sorts of containers should I be using to store the simulation's state? I know there is a bunch of copying going on atm, but some of it is unavoidable. Should I store Agent* in the vector instead of Agent ?
Note: In reality the simulation isn't in a loop, but uses Qt's QMetaObject::invokeMethod(this, "doSimUpdate", Qt::QueuedConnection); so I can use signals/slots to communicate between the threads; however, I've verified a simpler version using while(1){} and the issue persists.
Try re-using your SimState objects (using some kind of pool mechanism) instead of allocating them every time. After a few simulation loops, the re-used SimState objects will have vectors that have grown to the needed size, thus avoiding reallocation and saving time.
An easy way to implement a pool is to initially push a bunch of pre-allocated SimState objects onto a std::stack<SimState*>. Note that a stack is preferable to a queue, because you want to take the SimState object that is more likely to be "hot" in the cache memory (the most recently used SimState object will be at the top of the stack). Your simulation queue pops SimState objects off the stack and populates them with the computed SimState. These computed SimState objects are then pushed into a producer/consumer queue to feed the GUI thread. After being rendered by the GUI thread, they are pushed back onto the SimState stack (i.e. the "pool"). Try to avoid needless copying of SimState objects while doing all this. Work directly with the SimState object in each stage of your "pipeline".
Of course, you'll have to use the proper synchronization mechanisms in your SimState stack and queue to avoid race conditions. Qt might already have thread-safe stacks/queues. A lock-free stack/queue might speed things up if there is a lot of contention (Intel Thread Building Blocks provides such lock-free queues). Seeing that it takes on the order of 1/50 seconds to compute a SimState, I doubt that contention will be a problem.
If your SimState pool becomes depleted, then it means that your simulation thread is too "far ahead" and can afford to wait for some SimState objects to be returned to the pool. The simulation thread should block (using a condition variable) until a SimState object becomes available again in the pool. The size of your SimState pool corresponds to how much SimState can be buffered (e.g. a pool of ~50 objects gives you a crunch-ahead time of up to ~1 seconds).
You can also try running parallel simulation threads to take advantage of multi-core processors. The Thread Pool pattern can be useful here. However, care must be taken that the computed SimStates are enqueued in the proper order. A thread-safe priority queue ordered by time-stamp might work here.
Here's a simple diagram of the pipeline architecture I'm suggesting:
(Right-click and select view image for a clearer view.)
(NOTE: The pool and queue hold SimState by pointer, not by value!)
Hope this helps.
If you plan to re-use your SimState objects, then your Simulation::getAgents method will be inefficient. This is because the vector<Agent>& a parameter is likely to already have enough capacity to hold the agent list.
The way you're doing it now would throw away this already allocated vector and create a new one from scratch.
IMO, your getAgents should be:
void Simulation::getAgents(std::vector<Agent> &a) const
{
a = mAgents;
}
Yes, you lose exception safety, but you might gain performance (especially with the reusable SimState approach).
Another idea: You could try making your Agent objects fixed-size, by using a c-style array (or boost::array) and "count" variable instead std::vector for Agent's float list members. Simply make the fixed-size array big enough for any situation in your simulation. Yes, you'll waste space, but you might gain a lot of speed.
You can then pool your Agents using a fixed-size object allocator (such as boost::pool) and pass them around by pointer (or shared_ptr). That'll eliminate a lot of heap allocation and copying.
You can use this idea alone or in combination with the above ideas. This idea seems easier to implement than the pipeline thing above, so you might want to try it first.
Yet another idea: Instead of a thread pool for running simulation loops, you can break down the simulation into several stages and execute each stage in it's own thread. Producer/consumer queues are used to exchange SimState objects between stages. For this to be effective, the different stages need to have roughly similar CPU workloads (otherwise, one stage will become the bottleneck). This is a different way to exploit parallelism.