Most efficient way to roll back data. To turn back time - c++

So, I have a 3d platformer. And I want to have a button that if you hold it it makes you "go back in time". Thankfully the game is rather simple and only has one entity so the only thing that would have to be saved for each frame is.
struct Coord {
float x;
float y;
float z;
}
structure Bool6 {
bool front;
bool back;
bool left;
bool right;
bool top;
bool bottom;
}
struct Player {
Coord Pos;
Coord Vel;
Bool6 Col;
}
But I fear that is a lot of data especially since my game theoretically runs somewhere around 60fps and it would be good to have 5 seconds or so (300 frames) of data saved that can be accessed when roll-backed. I have considered each frame doing something like this
Player Data[300];
for (int i = 299; i > 0; i--)
{
Data[i] = Data[(i-1)];
}
Data[0] = "THIS FRAMES DATA";
However that sounds like it means an outrageous amount of processing power is going just in storing each frame.
Is their a more efficient way to store this data keeping all of the data in order?
Also is their a way I can tell an array slot that it has nothing? So that their arent problems if the player tries to rollback before all of the array slots are filled or after rolling back? I believe in C# i would have set it equal to NULL... but that doesn't work in c++ probably because im using structures.
Thanks much!

However that sounds like it means an outrageous amount of processing power
Before making such a statement it can be useful to do the math. It seems the data you are concerned with is about 40 bytes, so 40*300 = 12 kB. This can easily fit in memory and is far from an "outrageous amount of processing power" on modern computers.
Is their a more efficient way to store this data keeping all of the data in order?
Yes. If your game is deterministic, all you have to store is the player's input and one game state 5 seconds ago. When rolling back, reset the game state and replay user inputs to recompute each frame data.
See this question for interesing discussion on how to design your own replay system on gamedev stackexchange.

I don't think an array of 300 relatively small elements will slow you down at all, have you tried profiling it yet?
That said you could store it in a vector and keep an iterator to the "current" and update that.

If you think storing 300 is a lot, store less, for example you can store 1/5 frames :
....|....|....|....|..*
* is your position, `/` the frames you will store and `....` other frames
and you don't have to copy all the saved data each time ... just delete the first and add one at the end, you can maybe use a std::list, you won't have to copy any data
in Every 5 frames you will call myList.pop_front(); the oldest frame and myList.push_back(); the newest

I don't think the storage requirement is THAT harsh - even with 300 frames this small object will take up much less memory than your average texture.
I suggest you avoid using a raw array and look at using a std::vector which is almost as efficient and will automatically resize as you need more buffer space (that way if you suddenly need 8 second or the fps goes up to 100 you aren't going to suddenly run out of buffer space). This also resolves your difficulty with unfilled 'slots' as the vector has a known size that you can efficiently access.
I also might suggest that you don't need to store every frame - games like Prince of Persia that do this trick if you watch them carefully are much less smooth when time runs backwards suggesting that they perhaps only store every few frames or something like twice a second as opposed to every frame.

Unless you're running on an MCU, data sizes (given the structures you've provided) will be negligible compared to the rest of your game (~10K, if I calculated it correctly, is nothing for modern PCs).
CPU-wise, moving data in a manner you've specified and on every frame, MIGHT be sub-optimal (it will move around 10K 60 times per second, or 600K per second, which MIGHT - though probably won't - be noticeable). IF it becomes a concern, I'd go for a circular buffer (for example, as in boost::circular_buffer), or for a std::deque or std::list (with deque probably being my first choice); all of them have O(1) insertion/deletion time, and insertion/deletion is what you need most of the time. Theoretically, there is also an option to use memmove() to speed things up without changing things much, but it is quite error-prone, and still has O(N) complexity, so I'd rather not do it.

Indeed, you could do less operations per frame to store the state.
You could use a std::vector. Then, at each frame, push_back() the new state and if(vector.size() > 300), then do a pop_front().
If you think it's not enough, just save less frequently (each half a second). When you roll back your info, you could do an interpolation between values.
Edit:
you're damn right Othman, vector doesn't have pop_front, so you can use vector.erase(vector.begin()) instead :) So you don't have to use linked lists.

Related

(c++) Is there any techniques using memory rearrange to increase cache hit rate?

I want to have better performance by increasing cache hit rate.
I know that loop from 0 to n in something like vector is the best way of reducing cache miss, but in my application, the accessing order of the element is changing everytime, I can't use a fixed order. However, it is not totally random, I can assume that for any two continuous round of access (each round will check every element eventually), only a small part is different. I think I can use that features to reduce cache miss but I don't know if there are similar techniques or it is just worthless.
Situation Example:
struct Data{
...
}
std::vector<Data> dataList;
void Assignment(std::vector<int> dispatch_order)
{
for(int idx:dispatch_order){ do something with dataList[idx]...}
}
Call of Assignment will be like: (executing hillclimb or simulated annealingļ¼‰
Assignment({0,1,2,3,4});//move 4 to [1]
Assignment({0,4,1,2,3});//move {2,3} to [0]
Assignment({2,3,0,4,1});//move 0 to last
Assignment({2,3,4,1,0});
(so I can say that most of parts are similar for any two continuous Assignment).
No modification on dataList during entire computation.
Elements of dataList is no more than 10000 (2000 in average?), but repetation of Assignment is over 1,000,000.
If I can rearrange the data location in dataList s.t. for(int idx:dispatch_order) will access the element Data in almost increasing order in memory space, that would be a better performance. But 'rearrange' costs, is it really a good idea?
Or I can make duplication of dataList in different order, but it needs extra memory which might drop down the hit rate and I still need to change the order.

c++ Algorithm: searching for a subset of a 'list', non-ordered

UPDATE: MY BAD. this was not the cause of the double slowdown. I had other bugs.
C++ MFC. Visual Studio 12.
I'm trying to optimize performance within a draw loop. I have a list of all my objects(ListAll), lets say it has 300 objects, all with unique ID's. I have a second list(ListNow) of the ID's which need to be rendered, size of 100. all the values in ListNow have associated objects stored in ListAll.
currently, ListAll is a CMap < UINT, UINT, Object*, Object*>, and ListNow is a CArray< UINT,UINT>.
// this is the slower, current method
for (int i = 0; i < ListNow.GetSize(); i++)
{
UINT id = ListNow.GetAt(i);
if (ListAll->Lookup(id, object))
{
object->draw();
}
}
in the past I only had ListAll(CMap), and I called draw() on every object in it. It only had the 100 I wanted to draw, and I 'rebuilt' it every time i switched what was being drawn.
// this is the faster, old method
POSITION pos = ListAll->GetStartPosition();
while (pos)
{
ListAll->GetNextAssoc(pos, id, object);
object->Draw();
}
Technically both algorithms perform at O(n) speed...but simply adding the CMap::Lookup function to the loop has doubled the time it takes. I have properly set my CMap size to a prime number larger than the number of objects in the CMap. This slowdown is blatant with lists of size 300,000 and above.
I switched to this system so that I could store all the objects in the draw lists, and could quickly swap between what is being drawn between different windows using the same object lists. This speeds up time when switching drastically but has slowed down each individual draw call. Switching back now is not an option, we knew it would slow down each draw call a bit, but not this much. The slowdown is definitely in the code I show you, because when i switch back to drawing everything(remove the lookup), it cuts time in half.
My only idea to increase performance is to record the LastDrawn object pointers in a list, and inform the function if it needs to change(call lookup()) or if it can simply re-use the last drawn(GetNext()). since 90% of the time, nothing has changed between calls.
Does anyone have a faster solution than this? I'm dreaming of a tricky bit masking solution that somehow produces the object pointers i want, I don't know. Anything would help at this point.
It appears that you problem will be solved if you store your Object's pointers instead of their IDs into your ListNow.

What is the most efficient (yet sufficiently flexible) way to store multi-dimensional variable-length data?

I would like to know what the best practice for efficiently storing (and subsequently accessing) sets of multi-dimensional data arrays with variable length. The focus is on performance, but I also need to be able to handle changing the length of an individual data set during runtime without too much overhead.
Note: I know this is a somewhat lengthy question, but I have looked around quite a lot and could not find a solution or example which describes the problem at hand with sufficient accuracy.
Background
The context is a computational fluid dynamics (CFD) code that is based on the discontinuous Galerkin spectral element method (DGSEM) (cf. Kopriva (2009), Implementing Spectral Methods for Partial Differential Equations). For the sake of simplicity, let us assume a 2D data layout (it is in fact in three dimensions, but the extension from 2D to 3D should be straightforward).
I have a grid that consists of K square elements k (k = 0,...,K-1) that can be of different (physical) sizes. Within each grid element (or "cell") k, I have N_k^2 data points. N_k is the number of data points in each dimension, and can vary between different grid cells.
At each data point n_k,i (where i = 0,...,N_k^2-1) I have to store an array of solution values, which has the same length nVars in the whole domain (i.e. everywhere), and which does not change during runtime.
Dimensions and changes
The number of grid cells K is of O(10^5) to O(10^6) and can change during runtime.
The number of data points N_k in each grid cell is between 2 and 8 and can change during runtime (and may be different for different cells).
The number of variables nVars stored at each grid point is around 5 to 10 and cannot change during runtime (it is also the same for every grid cell).
Requirements
Performance is the key issue here. I need to be able to regularly iterate in an ordered fashion over all grid points of all cells in an efficient manner (i.e. without too many cache misses). Generally, K and N_k do not change very often during the simulation, so for example a large contiguous block of memory for all cells and data points could be an option.
However, I do need to be able to refine or coarsen the grid (i.e. delete cells and create new ones, the new ones may be appended to the end) during runtime. I also need to be able to change the approximation order N_k, so the number of data points I store for each cell can change during runtime as well.
Conclusion
Any input is appreciated. If you have experience yourself, or just know a few good resources that I could look at, please let me know. However, while the solution will be crucial to the performance of the final program, it is just one of many problems, so the solution needs to be of an applied nature and not purely academic.
Should this be the wrong venue to ask this question, please let me know what a more suitable place would be.
Often, these sorts of dynamic mesh structures can be very tricky to deal with efficiently, but in block-structured adaptive mesh refinement codes (common in astrophysics, where complex geometries aren't important) or your spectral element code where you have large block sizes, it is often much less of an issue. You have so much work to do per block/element (with at least 10^5 cells x 2 points/cell in your case) that the cost of switching between blocks is comparitively minor.
Keep in mind, too, that you can't generally do too much of the hard work on each element or block until a substantial amount of that block's data is already in cache. You're already going to have to had flushed most of block N's data out of cache before getting much work done on block N+1's anyway. (There might be some operations in your code which are exceptions to this, but those are probably not the ones where you're spending much time anyway, cache or no cache, because there's not a lot of data reuse - eg, elementwise operations on cell values). So keeping each the blocks/elements beside each other isn't necessarily a huge deal; on the other hand, you definitely want the blocks/elements to be themselves contiguous.
Also notice that you can move blocks around to keep them contiguous as things get resized, but not only are all those memory copies also going to wipe your cache, but the memory copies themselves get very expensive. If your problem is filling a significant fraction of memory (and aren't we always?), say 1GB, and you have to move 20% of that around after a refinement to make things contiguous again, that's .2 GB (read + write) / ~20 GB/s ~ 20 ms compared to reloading (say) 16k cache lines at ~100ns each ~ 1.5 ms. And your cache is trashed after the shuffle anyway. This might still be worth doing if you knew that you were going to do the refinement/derefinement very seldom.
But as a practical matter, most adaptive mesh codes in astrophysical fluid dynamics (where I know the codes well enough to say) simply maintain a list of blocks and their metadata and don't worry about their contiguity. YMMV of course. My suggestion would be - before spending too much time crafting the perfect data structure - to first just test the operation on two elements, twice; the first, with the elements in order and computing on them 1-2, and the second, doing the operation in the "wrong" order, 2-1, and timing the two computations several times.
For each cell, store the offset in which to find the cell data in a contiguous array. This offset mapping is very efficient and widely used. You can reorder the cells for cache reuse in traversals. When the order or number of cells changes, create a new array and interpolate, then throw away the old arrays. This storage is much better for external analysis because operations like inner products in Krylov methods and stages in Runge-Kutta methods can be managed without reference to the mesh. It also requires minimal memory per vector (e.g. in Krylov bases and with time integration).

Threading access to various buffers

I'm trying to figure out the best way to do this, but I'm getting a bit stuck in figuring out exactly what it is that I'm trying to do, so I'm going to explain what it is, what I'm thinking I want to do, and where I'm getting stuck.
I am working on a program that has a single array (Image really), which per frame can have a large number of objects placed on an image array. Each object is completely independent of all other objects. The only dependency is the output, in theory possible to have 2 of these objects placed on the same location on the array. I'm trying to increase the efficiency of placing the objects on the image, so that I can place more objects. In order to do that, I'm wanting to thread the problem.
The first step that I have taken towards threading it involves simply mutex protecting the array. All operations which place an object on the array will call the same function, so I only have to put the mutex lock in one place. So far, it is working, but it is not seeing the improvements that I would hope to have. I am hypothesizing that this is because most of the time, the limiting factor is the image write statement.
What I'm thinking I need to do next is to have multiple image buffers that I'm writing to, and to combine them when all of the operations are done. I should say that obscuration is not a problem, all that needs to be done is to simply add the pixel counts together. However, I'm struggling to figure out what mechanism I need to use in order to do this. I have looked at semaphores, but while I can see that they would limit a number of buffers, I can envision a situation in which two or more programs would be trying to write to the same buffer at the same time, potentially leading to inaccuracies.
I need a solution that does not involve any new non-standard libraries. I am more than willing to build the solution, but I would very much appreciate a few pointers in the right direction, as I'm currently just wandering around in the dark...
To help visualize this, imagine that I am told to place, say, balls at various locations on the image array. I am told to place the balls each frame, with a given brightness, location, and size. The exact location of the balls is dependent on the physics from the previous frame. All of the balls must be placed on a final image array, as quickly as they possibly can be. For the purpose of this example, if two balls are on top of each other, the brightness can simply be added together, thus there is no need to figure out if one is blocking the other. Also, no using GPU cards;-)
Psuedo-code would look like this: (Assuming that some logical object is given for location, brightness, and size). Also, assume, that isValidPoint simply finds if the point should be on the circle, given the location and radius of said circle.
global output_array[x_arrLimit*y_arrLimit)
void update_ball(int ball_num)
{
calc_ball_location(ball_num, *location, *brightness, *size); // location, brightness, size all set inside function
place_ball(location,brightness,size)
}
void place_ball(location,brighness,size)
{
get_bounds(location,size,*xlims,*ylims)
for (int x=xlims.min;x<xlims.max;y++)
{
for (int y=ylims.min;y<ylims.max;y++)
{
if (isValidPoint(location,size,x,y))
{
output_array(x,y)+=brightness;
}
}
}
}
The reason you're not seeing any speed up with the current design is that, with a single mutex for the entire buffer, you might as well not bother with threading, as all the objects have to be added serially anyway (unless there's significant processing being done to determine what to add, but it doesn't sound like that's the case). Depending on what it takes to "add an object to the buffer" (do you use scan-line algorithms, flood fill, or something else), you might consider having one mutex per row or a range of rows, or divide the image into rectangular tiles with one mutex per region or something. That would allow multiple threads to add to the image at the same time as long as they're not trying to update the same regions.
OK, you have an image member in some object. Add the, no doubt complex, code to add other image/objects to it. maipulate it, whatever. Aggregate in all the other objects that may be involved, add some command enun to tell threads what op to do and an 'OnCompletion' event to call when done.
Queue it to a pool of threads hanging on the end of a producer-consumer queue. Some thread will get the *object, perform the operation on the image/set and then call the event, (pass the completed *object as a parameter). In the event, you can do what you like, according to the needs of your app. Maybe you will add the processed images into a (thread-safe!!), vector or other container or queue them off to some other thread - whatever.
If the order of processing the images must be preserved, (eg. video stream), you could add an incrementing sequence-number to each object that is submitted to the pool, so enabling your 'OnComplete' handler to queue up 'later' images until all earlier ones have come in.
Since no two threads ever work on the same image, you need no locking while processing. The only locks you should, (may), need are those internal the queues, and they only lock for the time taken to push/pop object pointers to/from the queue - contention will be very rare.

Better way to copy several std::vectors into 1? (multithreading)

Here is what I'm doing:
I'm taking in bezier points and running bezier interpolation then storing the result in a std::vector<std::vector<POINT>.
The bezier calculation was slowing me down so this is what I did.
I start with a std::vector<USERPOINT> which is a struct with a point and 2 other points for bezier handles.
I divide these up into ~4 groups and assign each thread to do 1/4 of the work. To do this I created 4 std::vector<std::vector<POINT> > to store the results from each thread.In the end all the points have to be in 1 continuous vector, before I used multithreading I accessed this directly but now I reserve the size of the 4 vectors produced by the threads and insert them into the original vector, in the correct order. This works, but unfortunatly the copy part is very slow and makes it slower than without multithreading. So now my new bottleneck is copying the results to the vector. How could I do this way more efficiently?
Thanks
Have all the threads put their results into a single contiguous vector just like before. You have to ensure each thread only accesses parts of the vector that are separate from the others. As long as that's the case (which it should be regardless -- you don't want to generate the same output twice) each is still working with memory that's separate from the others, and you don't need any locking (etc.) for things to work. You do, however, need/want to ensure that the vector for the result has the correct size for all the results first -- multiple threads trying (for example) to call resize() or push_back() on the vector will wreak havoc in a hurry (not to mention causing copying, which you clearly want to avoid here).
Edit: As Billy O'Neal pointed out, the usual way to do this would be to pass a pointer to each part of the vector where each thread will deposit its output. For the sake of argument, let's assume we're using the std::vector<std::vector<POINT> > mentioned as the original version of things. For the moment, I'm going to skip over the details of creating the threads (especially since it varies across systems). For simplicity, I'm also assuming that the number of curves to be generated is an exact multiple of the number of threads -- in reality, the curves won't divide up exactly evenly, so you'll have to "fudge" the count for one thread, but that's really unrelated to the question at hand.
std::vector<USERPOINT> inputs; // input data
std::vector<std::vector<POINT> > outputs; // space for output data
const int thread_count = 4;
struct work_packet { // describe the work for one thread
USERPOINT *inputs; // where to get its input
std::vector<POINT> *outputs; // where to put its output
int num_points; // how many points to process
HANDLE finished; // signal when it's done.
};
std::vector<work_packet> packets(thread_count); // storage for the packets.
std::vector<HANDLE> events(thread_count); // storage for parent's handle to events
outputs.resize(inputs.size); // can't resize output after processing starts.
for (int i=0; i<thread_count; i++) {
int offset = i * inputs.size() / thread_count;
packets[i].inputs = &inputs[0]+offset;
packets[i].outputs = &outputs[0]+offset;
packets[i].count = inputs.size()/thread_count;
events[i] = packets[i].done = CreateEvent();
threads[i].process(&packets[i]);
}
// wait for curves to be generated (Win32 style, for the moment).
WaitForMultipleObjects(&events[0], thread_count, WAIT_ALL, INFINITE);
Note that although we have to be sure that the outputs vector doesn't get resized while be operated on by multiple threads, the individual vectors of points in outputs can be, because each will only ever be touched by one thread at a time.
If the simple copy in between things is slower than before you started using Mulitthreading, it's entirely likely that what you're doing simple isn't going to scale to multiple cores. If it's something simple like bezier stuff I suspect that's going to be the case.
Remember that the overhead of making the threads and such has an impact on total run time.
Finally.. for the copy, what are you using? Is it std::copy?
Multithreading is not going to speed up your process. Processing the data in different cores, could.