Threading access to various buffers - c++

I'm trying to figure out the best way to do this, but I'm getting a bit stuck in figuring out exactly what it is that I'm trying to do, so I'm going to explain what it is, what I'm thinking I want to do, and where I'm getting stuck.
I am working on a program that has a single array (Image really), which per frame can have a large number of objects placed on an image array. Each object is completely independent of all other objects. The only dependency is the output, in theory possible to have 2 of these objects placed on the same location on the array. I'm trying to increase the efficiency of placing the objects on the image, so that I can place more objects. In order to do that, I'm wanting to thread the problem.
The first step that I have taken towards threading it involves simply mutex protecting the array. All operations which place an object on the array will call the same function, so I only have to put the mutex lock in one place. So far, it is working, but it is not seeing the improvements that I would hope to have. I am hypothesizing that this is because most of the time, the limiting factor is the image write statement.
What I'm thinking I need to do next is to have multiple image buffers that I'm writing to, and to combine them when all of the operations are done. I should say that obscuration is not a problem, all that needs to be done is to simply add the pixel counts together. However, I'm struggling to figure out what mechanism I need to use in order to do this. I have looked at semaphores, but while I can see that they would limit a number of buffers, I can envision a situation in which two or more programs would be trying to write to the same buffer at the same time, potentially leading to inaccuracies.
I need a solution that does not involve any new non-standard libraries. I am more than willing to build the solution, but I would very much appreciate a few pointers in the right direction, as I'm currently just wandering around in the dark...
To help visualize this, imagine that I am told to place, say, balls at various locations on the image array. I am told to place the balls each frame, with a given brightness, location, and size. The exact location of the balls is dependent on the physics from the previous frame. All of the balls must be placed on a final image array, as quickly as they possibly can be. For the purpose of this example, if two balls are on top of each other, the brightness can simply be added together, thus there is no need to figure out if one is blocking the other. Also, no using GPU cards;-)
Psuedo-code would look like this: (Assuming that some logical object is given for location, brightness, and size). Also, assume, that isValidPoint simply finds if the point should be on the circle, given the location and radius of said circle.
global output_array[x_arrLimit*y_arrLimit)
void update_ball(int ball_num)
{
calc_ball_location(ball_num, *location, *brightness, *size); // location, brightness, size all set inside function
place_ball(location,brightness,size)
}
void place_ball(location,brighness,size)
{
get_bounds(location,size,*xlims,*ylims)
for (int x=xlims.min;x<xlims.max;y++)
{
for (int y=ylims.min;y<ylims.max;y++)
{
if (isValidPoint(location,size,x,y))
{
output_array(x,y)+=brightness;
}
}
}
}

The reason you're not seeing any speed up with the current design is that, with a single mutex for the entire buffer, you might as well not bother with threading, as all the objects have to be added serially anyway (unless there's significant processing being done to determine what to add, but it doesn't sound like that's the case). Depending on what it takes to "add an object to the buffer" (do you use scan-line algorithms, flood fill, or something else), you might consider having one mutex per row or a range of rows, or divide the image into rectangular tiles with one mutex per region or something. That would allow multiple threads to add to the image at the same time as long as they're not trying to update the same regions.

OK, you have an image member in some object. Add the, no doubt complex, code to add other image/objects to it. maipulate it, whatever. Aggregate in all the other objects that may be involved, add some command enun to tell threads what op to do and an 'OnCompletion' event to call when done.
Queue it to a pool of threads hanging on the end of a producer-consumer queue. Some thread will get the *object, perform the operation on the image/set and then call the event, (pass the completed *object as a parameter). In the event, you can do what you like, according to the needs of your app. Maybe you will add the processed images into a (thread-safe!!), vector or other container or queue them off to some other thread - whatever.
If the order of processing the images must be preserved, (eg. video stream), you could add an incrementing sequence-number to each object that is submitted to the pool, so enabling your 'OnComplete' handler to queue up 'later' images until all earlier ones have come in.
Since no two threads ever work on the same image, you need no locking while processing. The only locks you should, (may), need are those internal the queues, and they only lock for the time taken to push/pop object pointers to/from the queue - contention will be very rare.

Related

Update index variables in threads

I have objects like balls. These objects are dynamically created and stacked into a vector. For each of these balls, a separate stream is created that updates its coordinates. Each of these streams has a reference to a vector with balls and knows the sequence number of its ball. Then, let's say I need to delete several balls and streams associated with them.
I did it like this:
the sword has a bool killMe variable that becomes true when the ball needs to be removed. The thread that updates the coordinates notices that the ball needs to be removed, removes the ball, and terminates on its own. But when the ball is removed from the vector, the sequence numbers of the subsequent balls change and their streams, trying to refer to them the next time, cause the program to crash.
How to organize a timely update of the ball index in their streams?
Rather than each thread having an index into the vector, why not pass a reference to the object being worked on?
Note that this may still be problematic if your vector is vector<Ball>, as I'm not sure what happens to references to objects that are moved. That sounds like a problem.
But you could store vector<std::shared_ptr<Ball>> and then you're golden.
Another choice if you really want to use indexes is still to use a vector of shared pointers but then you can nullify the pointers you need to delete -- leaving holes in your vector, but at least you aren't moving things around.
The other choice involves mutexes, and you'll be mutex-locked A LOT. This seems less useful.

Thread Safe Integer Array?

I have a situation where I have a legacy multi-threaded application I'm trying to move to a linux platform and convert into C++.
I have a fixed size array of integers:
int R[5000];
And I perform a lot of operations like:
R[5] = (R[10] + R[20]) / 50;
R[5]++;
I have one Foreground task that mostly reads the values....but on occasion can update one. And then I have a background worker that is updating the values constantly.
I need to make this structure thread safe.
I would rather only update the value if the value has actually changed. The worker is constantly collecting data and doing calculation and storing the data whether it changes or not.
So should I create a custom class MyInt which has the structure and then include an array of mutexes to lock for updating/reading each value and then overload the [], =, ++, +=, -=, etc? Or should I try to implement anatomic integer array?
Any suggestions as to what that would look like? I'd like to try and keep the above notation for doing the updates...but I get that it might not be possible.
Thanks,
WB
The first thing to do is make the program work reliably, and the easiest way to do that is to have a Mutex that is used to control access to the entire array. That is, whenever either thread needs to read or write to anything in the array, it should do:
the_mutex.lock();
// do all the array-reads, calculations, and array-writes it needs to do
the_mutex.unlock();
... then test your program and see if it still runs fast enough for your needs. If so, you're done; that's all you need to do.
If you find that the program isn't fast enough due to contention on the mutex, you can start trying optimizations to make things faster. For example, if you know that your threads' operations will only need to work on local segments of the array at one time, you could create multiple mutexes, and assign different subsets of the array to each mutex (e.g. mutex #1 is used to serialize access to the first 100 array items, mutex #2 for the second 100 array items, etc). That will greatly decrease the chances of one thread having to wait for the other thread to release a mutex before it can continue.
If things still aren't fast enough for you, you could then look in to having two different arrays, one for each thread, and occasionally copying from one array to the other. That way each thread could safely access its own private array without any serialization needed. The copying operation would need to be handled carefully, probably using some sort of inter-thread message-passing protocol.

MPI synchronize matrix of vectors

Excuse me if this question is common or trivial, I am not very familiar with MPI so bear with me.
I have a matrix of vectors. Each vector is empty or has a few items in it.
std::vector<someStruct*> partitions[matrix_size][matrix_size];
When I start the program each process will have the same data in this matrix, but as the code progresses each process might remove several items from some vectors and put them in other vectors.
So when I reach a barrier I somehow have to make sure each process has the latest version of this matrix. The big problem is that each process might manipulate any or all vectors.
How would I go about to make sure that every process has the correct updated matrix after the barrier?
EDIT:
I am sorry I was not clear. Each process may move one or more objects to another vector but only one process may move each object. In other words each process has a list of objects it may move, but the matrix may be altered by everyone. And two processes can't move the same object ever.
In that case you'll need to send messages using MPI_Bcast that inform the other processors about this and instruct them to do the same. Alternatively, if the ordering doesn't matter until you hit the barrier, you can only send the messages to the root process which performs the permutations and then after the barrier sends it to all the others using MPI_Bcast.
One more thing: vectors of pointers are usually quite a bad idea, as you'll need to manage the memory manually in there. If you can use C++11, use std::unique_ptr or std::shared_ptr instead (depending on what your semantics are), or use Boost which provides very similar facilities.
And lastly, representing a matrix as a fixed-size array of fixed-size arrays is readlly bad. First: the matrix size is fixed. Second: adjacent rows are not necessarily stored in contiguous memory, slowing your program down like crazy (it literally can be orders of magnitudes). Instead represent the matrix as a linear array of size Nrows*Ncols, and then index the elements as Nrows*i + j where Nrows is the number of rows and i and j are the row and column indices, respectively. If you don't want column-major storage instead, address the elements by i + Ncols*j. You can wrap this index-juggling in inline functions that have virtually zero overhead.
I would suggest to lay out the data differently:
Each process has a map of his objects and their position in the matrix. How that is implemented depends on how you identify objects. If all local objects are numbered, you could just use a vector<pair<int,int>>.
Treat that as the primary structure you manipulate and communicate that structure with MPI_Allgather (each process sends it data to all other processes, at the end everyone has all data). If you need fast lookup by coordinates, then you can build up a cache.
That may or may not be performing well. Other optimizations (like sharing 'transactions') totally depend on your objects and the operations you perform on them.

Storing collections of items that an algorithm might need

I have a class MyClass that stores a collection of PixelDescriptor* objects. MyClass uses a function specified by a Strategy pattern style object (call it DescriptorFunction) to do something for each descriptor:
void MyFunction()
{
descriptorA = DescriptorCollection[0];
for_each descriptor in DescriptorCollection
{
DescriptorFunction->DoSomething(descriptor)
}
}
However, this only makes sense if the descriptors are of a type that the DescriptorFunction knows how to handle. That is, not all DescriptorFunction's know how to handle all descriptor types, but as long as the descriptors that are stored are of the type that the visitor that is specified knows about, all is well.
How would you ensure the right type of descriptors are computed? Even worse, what if the strategy object needs more than one type of descriptor?
I was thinking about making a composite descriptor type, something like:
class CompositeDescriptor
{
std::vector<PixelDescriptor*> Descriptors;
}
Then a CompositeDescriptor could be passed to the DescriptorFunction. But again, how would I ensure that the correct descriptors are present in the CompositeDescriptor?
As a more concrete example, say one descriptor is Color and another is Intensity. One Strategy may be to average Colors. Another strategy may be to average Intensities. A third strategy may be to pick the larger of the average color or the average intensity.
I've thought about having another Strategy style class called DescriptorCreator that the client would be responsible for setting up. If a ColorDescriptorCreator was provided, then the ColorDescriptorFunction would have everything it needs. But making the client responsible for getting this pairing correct seems like a bad idea.
Any thoughts/suggestions?
EDIT: In response to Tom's comments, a bit more information:
Essentially DescriptorFunction is comparing two pixels in an image. These comparisons can be done in many ways (besides just finding the absolute difference between the pixels themseles). For example 1) Compute the average of corresponding pixels in regions around the pixels (centered at the pixels). 2) Compute a fancier "descriptor" which typically produces a vector at each pixel and average the difference of the two vectors element-wise. 3) compare 3D points corresponding to the pixel locations in external data, etc etc.
I've run into two problems.
1) I don't want to compute everything inside the strategy (if the strategy just took the 2 pixels to compare as arguments) because then the strategy has to store lots of other data (the image, there is a mask involved describing some invalid regions, etc etc) and I don't think it should be responsible for that.
2) Some of these things are expensive to compute. I have to do this millions of times (the pixels being compared are always difference, but the features at each pixel do not change), so I don't want to compute any feature more than once. For example, consider the strategy function compares the fancy descriptors. In each iteration, one pixels is compared to all other pixels. This means in the second iteration, all of the features would have to be computed again, which is extremely redundant. This data needs to be stored somewhere that all of the strategies can access it - this is why I was trying to keep a vector in the main client.
Does this clarify the problem? Thanks for the help so far!
The first part sounds like a visitor pattern could be appropriate. A visitor can ignore any types it doesn't want to handle.
If they require more than one descriptor type, then it is a different abstraction entirely. Your description is somewhat vague, so it's hard to say exactly what to do. I suspect that you are over thinking it. I say that because generally choosing arguments for an algorithm is a high level concern.
I think the best advice is to try it out.
I would write each algorithm with the concrete arguments (or stubs if its well understood). Write code to translate the collection into the concrete arguments. If there is an abstraction to be made, it should be apparent while writing your translations. Writing a few choice helper functions for these translations is usually the bulk of the work.
Giving a concrete example of the descriptors and a few example declarations might give enough information to offer more guidance.

Separate physics thread without locks

I have a classic physics-thread vs. graphics-thread problem:
Say I'm running one thread for the physics update and one thread for rendering.
In the physics thread (pseudo-code):
while(true)
{
foreach object in simulation
SomeComplicatedPhysicsIntegration( &object->modelviewmatrix);
//modelviewmatrix is a vector of 16 floats (ie. a 4x4 matrix)
}
and in the graphics thread:
while(true)
{
foreach object in simulation
RenderObject(object->modelviewmatrix);
}
Now in theory this would not require locks, as one thread is only writing to the matrices and another is only reading, and I don't care about stale data so much.
The problem that updating the matrix is not an atomic operation and sometimes the graphics thread will read only partially updated matrices (ie. not all 16 floats have been copied, only part of them) which means part of the matrix is from one physics frame and part is from the previous frame, which in turn means the matrix is nolonger affine (ie. it's basically corrupted).
Is there any good method of preventing this without using locks? I read about a possible implementation using double buffering, but I cannot imagine a way that would work without syncing the threads.
Edit: I guess what I'd really like to use is some sort of triple buffering like they use on graphic displays.. anyone know of a good presentation of the triple buffering algorithm?
Edit 2: Indeed using non-synced triple buffering is not a good ideea (as suggested in the answers below). The physics thread can run mutiple cycles eating a lot of CPU and stalling the graphics thread, computing frames that never even get rendered in the end.
I have opted for a simple double-buffered algorithm with a single lock, where the physics thread only computes as much as 1 frame in advance of the graphics thread before swapping buffers. Something like this:
Physics:
while(true)
{
foreach physicstimestep
foreach object in simulation
SomeComplicatedPhysicsIntegration( &object->modelviewmatrix.WriteBuffer);
LockSemaphore()
SwapBuffers()
UnlockSemaphore()
}
Graphics:
while(true)
{
LockSemaphore()
foreach object in simulation
RenderObject(object->modelviewmatrix.ReadBuffer);
UnlockSemaphore()
}
How does that sound?
You could maintain a shared queue between the two threads, and implement the physics thread such that it only adds a matrix to the queue after it has fully populated all of the values in that matrix. This assumes that the physics thread allocates a new matrix on each iteration (or more specifically that the matrices are treated as read-only once they are placed in the queue).
So any time your graphics thread pulls a matrix out of the queue, it is guaranteed to be fully populated and a valid representation of the simulation state at the time at which the matrix was generated.
Note that the graphics thread will need to be able to handle cases in which the queue is empty for one or more iterations, and that it would probably be a good idea to world-timestamp each queue entry so that you have a mechanism of keeping the two threads reasonably in sync without using any formal synchronization techniques (for instance, by not allowing the graphics thread to consume any matrices that have a timestamp that is in the future, and by allowing it to skip ahead in the queue if the next matrix is from too far in the past). Also note that whatever queue you use must be implemented such that it will not explode if the physics thread tries to add something at the same time that the graphics thread is removing something.
but I cannot imagine a way that would work without syncing the threads.
No matter what kind of scheme you are using, synchronizing the threads is an absolute essential here. Without synchronization you run the risk that your physics thread will race far ahead of the graphics thread, or vice versa. Your program, typically a master thread that advances time, needs to be in control of thread operations, not the threading mechanism.
Double buffering is one scheme that lets your physics and graphics threads run in parallel (for example, you have a multi-CPU or multi-core machine). The physics thread operates on one buffer while the graphics thread operates on the other. Note that this induces a lag in the graphics, which may or may not be an issue.
The basic gist behind double buffering is that you duplicate your data to be rendered on screen.
If you run with some sort of locking, then your simulation thread will always be rendering exactly one frame ahead of the display thread. Every piece of data that gets simulated gets rendered. (The synchronization doesn't have to be very heavy: a simple condition variable can frequently be updated and wake the rendering thread pretty cheaply.)
If you run without synchronization, your simulation thread might simulate events that never get rendered, if the rendering thread cannot keep up. If you include a monotonically increasing generation number in your data (update it after each complete simulation cycle), then your rendering thread can simply busy-wait on the two generation numbers (one for each buffer of data).
Once one (or both) of the generation numbers is greater than the most-recently-rendered generation, copy the newest buffer into the rendering thread, update the most-recently-rendered counter, and start rendering. When it's done, return to busy waiting.
If your rendering thread is too fast, you may chew through a lot of processor in that busy wait. So this only makes sense if you expect to periodically skip rendering some data and almost never need to wait for more simulation.
Don't update the matrix in the physics thread?
Take a chunk, (perhaps a row you have just rendered), and queue its position/size/whatever to the physics thread. Invert/transpose/whateverCleverMatrixStuff the row of modelviewmatrix's into another, new row. Post it back to the render thread. Copy the new row in at some suitable time in your rendering. Perhaps you do not need to copy it in - maybe you can just swap out an 'old' vector for the new one and free the old one?
Is this possible, or is the structure/manipulation of your matrices/whatever too complex for this?
All kinda depends on the structure of your data, so this solution may well be inappropriate/impossible.
Rgds,
Martin
Now in theory this would not require locks, as one thread is only writing to the matrices and another is only reading, and I don't care about stale data so much.
Beware: without proper synchronization, there is no guarantee that the reading thread will ever observe any changes by the writing thread. This aspect is known as visibility, and sadly, it is often overlooked.
Assuming lockless or near-lockless updates is actually what would solve your problem best, it sounds like you want the physics thread to calculate a new matrix, and then instantaneously update all those values at once, so it doesn't matter what version of the matrices the graphics thread gets, as long as (a) it gets them eventually and (b) it never gets half of one and half of the old one.
In which case, it sounds like you want a physics thread something like:
/* pseudocode */
while (true) foreach (object in simulation) {
auto new_object = object;
SomeComplicatedPhysicsIntegrationInPlace(new_object)
atomic_swap(object, new_object); // In pseudocode, ignore return value since nowhere
// else changes value of object. In code, use assert, etc
}
Alternativelty, you can calculate a new state of the whole simulation, and then swap the the values. An easy way of implementing this would be:
/* Psudocode */
while (true) {
simulation[ 1-global_active_idx ] = simulation[ global_active_idx ];
foreach (object in simulation[global_inactive_idx]) {
SomeComplicatedPhysicsIntegrationInPlace(object);
}
global_active_idx = 1-global_active_idx; // implicitly assume this is atomic
}
The graphics thread should constantly render simulation[global_active_idx].
In fact, this doesn't work. It will work in many situations because typically, writing a 1 to memory location containing 0 is in fact atomic on most processors, but it's not guaraneed to work. Specifically, the other thread may never reread that value. Many people bodge this by declaring the variable volatile, while works on many compilers, but is not guaranteed to work.
However, to make either example work, all you need is an atomic write instruction, while C++ doesn't provide until C++0x, but is fairly easy for compilers to implement, since most "write an int" instructions ARE atomic, the compiler just has to ensure that.
So, you can write your code using an atomic_swap function at the end of the physics loop, and implement that in terms of (a) a lock, write, unlock sequence -- which shouldn't block the graphics thread significantly, beause it's only blocking for the length of time of one memory write, and possibly only doing so once per whole frame or (b) a compiler's built in atomic support, eg. http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html
There are similar solutions, eg. the physics thread updating a semaphore, which the graphics thread treats simply as a variable with value 0 or 1; eg. the physics thread posts finished calculations to a queue (which is implemented internally in a similar way to the above), and the graphics thread constantly renders the top of the queue, repeating the last one if the queue underflows.
However, I'm not sure I'm understanding your problem. Why is there any point updating the graphics if the physics hasn't changed? Why is there any point updating the physics faster than the physics, can't it extrapolate further in each chunk? Does locking the update actually make any difference?
You can make two matrices fit in one cache line size (128 bytes) and aligned it to 128 bytes. This way the matrix is loaded in a cache line and therefore write to and from memory in one block. This is not for the feint of heart though and will require much more work. (This only fixes the problem of reading a matrix while it is being updated and getting a non-orthogonal result.)