A clean way to render things - c++

I don't really have any problems with the way I'm rendering now, but I don't feel like it's a very good way of handling rendering. I'm using SDL.
It boils down to this I have some abstract class
class Renderable
With two functions.
virtual void update() = 0;
virtual void doRender(SDL_Surface* surface) = 0;
I have another class
class RenderManager
With 1 std::vector
std::vector<Renderable*> _world;
and 2 std::queue
std::queue<Renderable*> _addQueue;
std::queue<Renderable*> _delQueue;
The two queues hold the renderables that need to be added in the next tick and the ones that need to be removed. Doing everything in one shot gave me problems and now that I think about it, it makes sense (at least the way I did it).
Renderables can add and remove themselves from the RenderManager statically.
Here's more or less the function handling everything.
void renderAll() {
std::vector<Renderable*>::iterator begin, end;
begin = _world.begin();
end = _world.end();
for (;begin != end; ++begin) {
(*begin)->update();
(*begin)->doRender(_mainWindow); // _mainWindow is the screen of course
}
begin = world.begin();
if (_delQueue.size() > 0) {
for (unsigned int i = 0; i < _delQueue.size(); i++) {
std::vector<Renderable*>::iterator del;
del = std::find(begin, end, _delQueue.front());
if (del != end) {
delete *del;
_world.erase(del);
}
_delQueue.pop();
}
}
if (_addQueue.size() > 0) {
for (unsigned int i = 0; i < _addQueue.size(); i++) {
Renderable* front = _addQueue.front();
// _placement is a property of Renderable calculated by RenderManager
// where they can choose the level they want to be rendered on.
_world.insert(begin + front->_placement, front);
_addQueue.pop();
}
}
}
I'm kinda sorta newish to C++, but I think I know my way around it on an average scale at least. I'm even newer to SDL, but it seems pretty simple and easy to learn. I'm concerned because I have 3 big loops together. I tried one shotting it but I was having problems with _world resizing during the loop causing massive amounts of destruction. But I'm not claiming I did it right! :)
I was thinking maybe something involving threads?
EDIT:
Ahh, sorry for the ambiguity. By "cleaner" I mean more efficient. Also there is no "problem" with my approach, I just feel there's a more efficient way.

Firstly, I'd say don't fix something which isn't broken. Are you experiencing performance issues? Unless you're adding and removing 'renderables' in huge quantities every frame, I can't see a huge problem with what you have. Of course, in terms of an overall application it could be a clumsy design, but you haven't stated what sort of application this is for, so it's hard, if not impossible to judge.
However, I can guess and say that because you're using SDL, there's a chance you're developing a game. Personally I've always rendered game objects by having a render method for each active object and use an object manager to cycle through pointers to each object every tick and call this render method. Because constantly removing an item from the middle of a vector can cause slowdowns due to internal copying of memory (vectors guarantee contiguous memory), you could have a flag in every object which is set when it is meant to be removed, and periodically the object manager performs 'garbage collection', removing all objects with this flag set at the same time, hence reducing the amount of copying that needs to be done. In the mean time before garbage collection occurs, the manager simply ignores the flagged object, not calling its render method each tick - it's as if it has gone. It's actually not too dissimilar to what you have here with your queue system, in fact if game objects are derived from your 'renderable' class it could be deemed the same.
By the way is there any reason you're querying the queue sizes before accessing their elements? If size() is 0, the for loops won't operate anyway.

Related

Am i fragmenting the heap? And how not to or otherwise be better

I'm working in C++ for Arduino running on an ESP8266.
I have a long lived object which maintains two collections of sub-objects (two different types).
These two collections will have the order of ~5-20 elements each by the end of the long-lived object's life, although the upper bound is not fixed. Realistically it wont be much more than 20 if it ever is, it wont be 100 for example.
My code for maintaining each collection looks like this (only showing one here for brevity):
void MyLongLivedObject::addSubObject(SubObject* obj)
{
SubObject** newColl = new SubObject*[this->collectionCount + 1];
for (uint8_t i = 0; i < this->collectionCount ; i++)
newColl[i] = this->collection[i];
delete this->collection;
this->collection = newColl;
this->collection[this->collectionCount++] = obj;
}
Is my understanding right that as each of the two collections are expanded by adding objects, they are moved to the end of the heap memory, leaving a hole? So if i were to alternate adding objects to each collection they would basically walk their way down the heap?
And then maybe at some point the hole at the start of the heap would be large enough to accommodate one of the collections and it would get moved back there.
I know the question of "is this a good idea?" is subjective and will depend on many things. To avoid that, can I instead ask for input on how to answer this question myself? What things should I be considering, how do I determine if this is acceptible or not? What questions do I need to answer to be comfortable with this implementation (i.e. to not horribly fragment the heap and cause related woe)?
Alternatively - if this whole approach is rubbish, what should I be doing instead?

Is this request frequency limiter thread safe?

In order to prevent excessive server pressure, I implemented a request frequency limiter using a sliding window algorithm, which can determine whether the current request is allowed to pass according to the parameters. In order to achieve the thread safety of the algorithm, I used the atomic type to control the number of sliding steps of the window, and used unique_lock to achieve the correct sum of the total number of requests in the current window.
But I'm not sure whether my implementation is thread-safe, and if it is safe, whether it will affect service performance. Is there a better way to achieve it?
class SlideWindowLimiter
{
public:
bool TryAcquire();
void SlideWindow(int64_t window_number);
private:
int32_t limit_; // maximum number of window requests
int32_t split_num_; // subwindow number
int32_t window_size_; // the big window
int32_t sub_window_size_; // size of subwindow = window_size_ / split_number
int16_t index_{0}; //the index of window vector
std::mutex mtx_;
std::vector<int32_t> sub_windows_; // window vector
std::atomic<int64_t> start_time_{0}; //start time of limiter
}
bool SlideWindowLimiter::TryAcquire() {
int64_t cur_time = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::system_clock::now().time_since_epoch()).count();
auto time_stamp = start_time_.load();
int64_t window_num = std::max(cur_time - window_size_ - start_time_, int64_t(0)) / sub_window_size_;
std::unique_lock<std::mutex> guard(mtx_, std::defer_lock);
if (window_num > 0 && start_time_.compare_exchange_strong(time_stamp, start_time_.load() + window_num*sub_window_size_)) {
guard.lock();
SlideWindow(window_num);
guard.unlock();
}
monitor_->TotalRequestQps();
{
guard.lock();
int32_t total_req = 0;
std::cout<<" "<<std::endl;
for(auto &p : sub_windows_) {
std::cout<<p<<" "<<std::this_thread::get_id()<<std::endl;
total_req += p;
}
if(total_req >= limit_) {
monitor_->RejectedRequestQps();
return false;
} else {
monitor_->PassedRequestQps();
sub_windows_[index_] += 1;
return true;
}
guard.unlock();
}
}
void SlideWindowLimiter::SlideWindow(int64_t window_num) {
int64_t slide_num = std::min(window_num, int64_t(split_num_));
for(int i = 0; i < slide_num; i++){
index_ += 1;
index_ = index_ % split_num_;
sub_windows_[index_] = 0;
}
}
First of all, thread-safe is a relative property. Two sequences of operations are thread-safe relative to each other. A single bit of code cannot be thread-safe by itself.
I'll instead answer "am I handling threading in such a way that reasonable thread-safety guarantees could be made with other reasonable code".
The answer is "No".
I found one concrete problem; your use of atomic and compare_exchange_strong isn't in a loop, and you access start_time_ atomically at multiple spots without the proper care. If start_time_ changes in the period with the 3 spots you read and write from it, you return false, and fail to call SlideWindow, then... proceed as if you had.
I can't think of why that would be a reasonable response to contention, so that is a "No, this code isn't written to behave reasonably under multiple threads using it".
There is a lot of bad smell in your code. You are mixing concurrency code with a whole pile of state, which means it isn't clear what mutexes are guarding what data.
You have a pointer in your code that is never defined. Maybe it is supposed to be a global variable?
You are writing to cout using multiple << on one line. That is a bad plan in a multithreaded environment; even if your cout is concurrency-hardened, you get scrambled output. Build a buffer string and do one <<.
You are passing data between functions via the back door. index_ for example. One function sets a member variable, another reads it. Is there any possibility it gets edited by another thread? Hard to audit, but seems reasonably likely; you set it on one .lock(), then .unlock(), then read it as if it was in a sensible state in a later lock(). What more, you use it to access a vector; if the vector or index changed in unplanned ways, that could crash or lead to memory corruption.
...
I would be shocked if this code didn't have a pile of race conditions, crashes and the like in production. I see no sign of any attempt to prove that this code is concurrency safe, or simplify it to the point where it is easy to sketch such a proof.
In actual real practice, any code that you haven't proven is concurrency safe is going to be unsafe to use concurrently. So complex concurrency code is almost guaranteed to be unsafe to use concurrently.
...
Start with a really, really simple model. If you have a mutex and some data, make that mutex and the data into a single struct, so you know exactly what that mutex is guarding.
If you are messing with an atomic, don't use it in the middle of other code mixed up with other variables. Put it in its own class. Give that class a name, representing some concrete semantics, ideally ones that you have found elsewhere. Describe what it is supposed to do, and what the methods guarantee before and after. Then use that.
Elsewhere, avoid any kind of global state. This includes class member variables used to pass state around. Pass your data explicitly from one function to another. Avoid pointers to anything mutable.
If your data is all value types in automatic storage and pointers to immutable (never changing in the lifetime of your threads) data, that data can't be directly involved in a race condition.
The remaining data is bundled up and firewalled in a small a spot as possible, and you can look at how you interact with it and determine if you are messing up.
...
Multithreaded programming is hard, especially in an environment with mutable data. If you aren't working to make it possible to prove your code is correct, you are going to produce code that isn't correct, and you won't know it.
Well, based off my experience, I know it; all code that isn't obviously trying to act in such a way that it is easy to show it is correct is simply incorrect. If the code is old and has piles of patches over a decade+, the incorrectness is probably unlikely and harder to find, but it is probably still incorrect. If it is new code, it is probably easier to find the incorrectness.

C++ speed up method call

I am working on a very time consuming application and I want to speed it up a little. I analyzed the runtime of single parts using the clock() function of the ctime library and found something, which is not totally clear to me.
I have time prints outside and inside of a method, lets call it Method1. The print inside Method1 includes the whole body of it, only the return of a float is exluded of course. Well, the thing is, that the print outside states twice to three times the time of the print inside Method1. It's obvious, that the print outside should state more time, but the difference seems quite big to me.
My method looks as follows, I am using references and pointers as parameters to prevent copying of data. Note, that the data vector includes 330.000 pointers to instances.
float ClassA::Method1(vector<DataClass*>& data, TreeClass* node)
{
//start time measurement
vector<Mat> offset_vec_1 = vector<Mat>();
vector<Mat> offset_vec_2 = vector<Mat>();
for (int i = 0; i < data.size(); i++)
{
DataClass* cur_data = data.at(i);
Mat offset1 = Mat();
Mat offset2 = Mat();
getChildParentOffsets(cur_data, node, offset1, offset2);
offset_vec_1.push_back(offset1);
offset_vec_2.push_back(offset2);
}
float ret = CalculateCovarReturnTrace(offset_vec_1) + CalculateCovarReturnTrace(offset_vec_2);
//end time measurement
return ret;
}
Is there any "obvious" way to increase the call speed? I would prefer to keep the method for readability reasons, thus, can I change anything to gain a speed up?
I am appreciating any suggestions!
Based on your updated code, the only code between the end time measurement and the measurement after the function call is the destructors for constructed objects in the function. That being the two vectors of 330,000 Mats each. Which will likely take some time depending on the resources used by each of those Mats.
Without trying to lay claim to any of the comments made by others to the OP ...
(1) The short-answer might well be, "no." This function appears to be quite clear, and it's doing a lot of work 30,000 times. Then, it's doing a calculation over "all that data."
(2) Consider re-using the "offset1" and "offset2" matrices, instead of creating entirely new ones for each iteration. It remains to be seen, of course, whether this would actually be faster. (And in any case, see below, it amounts to "diddling the code.")
(3) Therefore, borrowing from The Elements of Programming Style: "Don't 'diddle' code to make it faster: find a better algorithm." And in this case, there just might not be one. You might need to address the runtime issue by "throwing silicon at it," and I'd suggest that the first thing to do would be to add as much RAM as possible to this computer. A process that "deals with a lot of data" is very-exposed to virtual memory page-faults, each of which requires on the order of *milli-*seconds to resolve. (Those hundredths of a second add-up real fast.)
I personally do not see anything categorically wrong with this code, nor anything that would categorically cause it to run faster. Nor would I advocate re-writing ("diddling") the code from the very-clear expression of it that you have right now.

Function pointers and their called parameters

This may be the wrong approach, but as I dig deeper into developing my game engine I have ran into a timing issue.
So lets say I have a set of statements like:
for(int i = 0; i < 400; i++)
{
engine->get2DObject("bullet")->Move(1, 0);
}
The bullet will move, however, on the screen there won't be any animation. It will basically "warp" from one part of the screen to the next.
So, I am thinking...create a vector of function pointers for each base object (which the user inherits from) and when they call "Move" I actually don't move the object until the next iteration of the game loop.
So something like:
while(game.running())
{
game.go();
}
go()
{
for(...)
2dobjs.at(i)->action.pop_back();
}
Or something like that, and this way it runs only a single action of each object during each iteration (of course I'll add in a check to see if there is actually "actions" to be ran).
And if that is a good idea, my question is how do I store the parameters. Since each object can do more than a single type of "action" rather than move (rotateby is one example), I think it would be nonsensical to create a struct similar in fashion to:
struct MoveAction {
MovePTR...;
int para1;
int para2;
};
Thoughts? Or is this the completely wrong direction?
Thoughts? Or is this the completely wrong direction?
It's the wrong direction.
Actually, the idea of being able to queue actions or orders has some merit. But usually this applies to things like AI and to more general tasks (like "proceed to location X", "attack anything you see", etc). So it usually doesn't cover the frame-by-frame movement issue that you're dealing with here.
For very simply entities like bullets, queuing tasks is rather overkill. After all, what else is a bullet going to do except move forward each time? It's also an awkward way to implement such simple motion. It brings up issues like, why only 400 steps forward? What if the area in front is longer? What if its shorter? What if some other object gets in the way after only say 50 steps? (Then 350 queued move actions were a waste.)
Another problem with this idea (at least in the simplistic way you've presented it) is that your bullets are going to move a fixed amount per each iteration of your game loop. But not all iterations are going to take the same amount of time. Some people's computers will obviously be able to run the code faster than others. And even disregarding that, sometimes the game loop may be doing considerably more work than other times (such as when many more entities are active in the game). So you would really want a way to factor in the time that each iteration is taking to adjust the movement and such so that the everything appears to move at a consistent speed to the user, regardless of how fast the game loop is running.
So instead of each movement action being "move X amount" they would be "move at X speed" and would then be multiplied by something like the time elapsed since the last iteration. And then you would really never know how many move actions to queue, because it would depend on how fast your game loop is running in that scenario. But again, this is another indication that queuing movement actions is an awkward solution to frame-by-frame movement.
The way most game engines do it is to simply call some sort of function on the each object each frame which evaluates its movement. Maybe a function like void ApplyMovement(float timeElapsedSinceLastFrame);. In the case of a bullet, it would multiply the speed at which a bullet should move per second by the passed in time since the last frame to determine the amount that the bullet should move this frame. For more complex objects you might want to do math with rotations, accelerations, decelerations, target-seeking, etc. But the general idea is the same: call a function each iteration of the game loop to evaluate what should happen for that iteration.
I don't think this is the right direction. After you store 400 moves in a vector of function pointers, you'll still need to pop and perform a move, redraw the screen, and repeat. Isn't it easier just to move(), redraw, and repeat?
I'd say your bullet is warping because it's moving 400 pixels/frame, not because you need to delay the move calculations.
But if this is the correct solution for your architecture, in C++ you should use classes rather than function pointers. For example:
class Action // an abstract interface for Actions
{
public:
virtual void execute() = 0; // pure virtual function
}
class MoveAction: public Action
{
public:
MoveAction(Point vel) : velocity(vel) {}
virtual void execute();
Point velocity;
...
}
std::vector<Action*> todo;
gameloop
{
...
todo.push_back(new MoveAction(Point(1,0))
...
}
So where does 400 come from? Why not just do it this way:
go()
{
engine->get2DObject("bullet")->Move(1, 0);
}
I think the general approach could be different to get what you want. I personally would prefer a setup where each go loop called functions to set the position for all objects including the bullet, then draw the scene. So each loop it would put the bullet where it should be right then.
If you do decide to go your route, you will likely have to do your struct idea. It's not pleasant, but until we get closures, you'll have to do.

Lockless Deque in Win32 C++

I'm pretty new to lockless data structures, so for an exercise I wrote (What I hope functions as) a bounded lockless deque (No resizing yet, just want to get the base cases working). I'd just like to have some confirmation from people who know what they're doing as to whether I've got the right idea and/or how I might improve this.
class LocklessDeque
{
public:
LocklessDeque() : m_empty(false),
m_bottom(0),
m_top(0) {}
~LocklessDeque()
{
// Delete remaining tasks
for( unsigned i = m_top; i < m_bottom; ++i )
delete m_tasks[i];
}
void PushBottom(ITask* task)
{
m_tasks[m_bottom] = task;
InterlockedIncrement(&m_bottom);
}
ITask* PopBottom()
{
if( m_bottom - m_top > 0 )
{
m_empty = false;
InterlockedDecrement(&m_bottom);
return m_tasks[m_bottom];
}
m_empty = true;
return NULL;
}
ITask* PopTop()
{
if( m_bottom - m_top > 0 )
{
m_empty = false;
InterlockedIncrement(&m_top);
return m_tasks[m_top];
}
m_empty = true;
return NULL;
}
bool IsEmpty() const
{
return m_empty;
}
private:
ITask* m_tasks[16];
bool m_empty;
volatile unsigned m_bottom;
volatile unsigned m_top;
};
Looking at this I would think this would be a problem:
void PushBottom(ITask* task)
{
m_tasks[m_bottom] = task;
InterlockedIncrement(&m_bottom);
}
If this is used in an actual multithreaded environment I would think you'd collide when setting m_tasks[m_bottom]. Think of what would happen if you have two threads trying to do this at the same time - you couldn't be sure of which one actually set m_tasks[m_bottom].
Check out this article which is a reasonable discussion of a lock-free queue.
Your use of the m_bottom and m_top members to index the array is not okay. You can use the return value of InterlockedXxxx() to get a safe index. You'll need to lose IsEmpty(), it can never be accurate in a multi-threading scenario. Same problem with the empty check in PopXxx. I don't see how you could make that work without a mutex.
The key to doing almost impossible stuff like this is to use InterlockedCompareExchange. (This is the name Win32 uses but any multithreaded-capable platform will have an InterlockedCompareExchange equivalent).
The idea is, you make a copy of the structure (which must be small enough to perform an atomic read (64 or if you can handle some unportability, 128 bits on x86).
You make another copy with your proposed update, do your logic and update the copy, then you update the "real" structure using InterlockedCompareExchange. What InterlockedCompareExchange does is, atomically make sure the value is still the value you started with before your state update, and if it is still that value, atomically updates the value with the new state. Generally this is wrapped in an infinite loop that keeps trying until someone else hasn't changed the value in the middle. Here is roughly the pattern:
union State
{
struct
{
short a;
short b;
};
uint32_t volatile rawState;
} state;
void DoSomething()
{
// Keep looping until nobody else changed it behind our back
for (;;)
{
state origState;
state newState;
// It's important that you only read the state once per try
origState.rawState = state.rawState;
// This must copy origState, NOT read the state again
newState.rawState = origState.rawState;
// Now you can do something impossible to do atomically...
// This example takes a lot of cycles, there is huge
// opportunity for another thread to come in and change
// it during this update
if (newState.b == 3 || newState.b % 6 != 0)
{
newState.a++;
}
// Now we atomically update the state,
// this ONLY changes state.rawState if it's still == origState.rawState
// In either case, InterlockedCompareExchange returns what value it has now
if (InterlockedCompareExchange(&state.rawState, newState.rawState,
origState.rawState) == origState.rawState)
return;
}
}
(Please forgive if the above code doesn't actually compile - I wrote it off the top of my head)
Great. Now you can make lockless algorithms easy. WRONG! The trouble is that you are severely limited on the amount of data that you can update atomically.
Some lockless algorithms use a technique where they "help" concurrent threads. For example, say you have a linked list that you want to be able to update from multiple threads, other threads can "help" by performing updates to the "first" and "last" pointers if they are racing through and see that they are at the node pointed to by "last" but the "next" pointer in the node is not null. In this example, upon noticing that the "last" pointer is wrong, they update the last pointer, only if it still points to the current node, using an interlocked compare exchange.
Don't fall into a trap where you "spin" or loop (like a spinlock). While there is value in spinning briefly because you expect the "other" thread to finish something - they may not. The "other" thread may have been context switched and may not be running anymore. You are just eating CPU time, burning electricity (killing a laptop battery perhaps) by spinning until a condition is true. The moment you begin to spin, you might as well chuck your lockless code and write it with locks. Locks are better than unbounded spinning.
Just to go from hard to ridiculous, consider the mess you can get yourself into with other architectures - things are generally pretty forgiving on x86/x64, but when you get into other "weakly ordered" architectures, you get into territory where things happen that make no sense - memory updates won't happen in program order, so all your mental reasoning about what the other thread is doing goes out the window. (Even x86/x64 have a memory type called "write combining" which is often used when updating video memory but can be used for any memory buffer hardware, where you need fences) Those architectures require you to use "memory fence" operations to guarantee that all reads/writes/both before the fence will be globally visible (by other cores). A write fence guarantees that any writes before the fence will be globally visible before any writes after the fence. A read fence will guarantee that no reads after the fence will be speculatively executed before the fence. A read/write fence (aka full fence or memory fence) will make both guarantees. Fences are very expensive. (Some use the term "barrier" instead of "fence")
My suggestion is to implement it first with locks/condition variables. If you have any trouble with getting that working perfectly, it's hopeless to attempt doing a lockless implementation. And always measure, measure, measure. You'll probably find the performance of the implementation using locks is perfectly fine - without the incertainty of some flaky lockless implementation with a natsy hang bug that will only show up when you're doing a demo to an important customer. Perhaps you can fix the problem by redefining the original problem into something more easily solved, perhaps by restructuring the work so bigger items (or batches of items) are going into the collection, which reduces the pressure on the whole thing.
Writing lockless concurrent algorithms is very difficult (as you've seen written 1000x elsewhere I'm sure). It is often not worth the effort either.
Addressing the problem pointed out by Aaron, I'd do something like:
void PushBottom(ITask *task) {
int pos = InterlockedIncrement(&m_bottom);
m_tasks[pos] = task;
}
Likewise, to pop:
ITask* PopTop() {
int pos = InterlockedIncrement(&m_top);
if (pos == m_bottom) // This is still subject to a race condition.
return NULL;
return m_tasks[pos];
}
I'd eliminate both m_empty and IsEmpty() from the design completely. The result returned by IsEmpty is subject to a race condition, meaning by the time you look at that result, it may well be stale (i.e. what it tells you about the queue may be wrong by the time you look at what it returned). Likewise, m_empty provides nothing but a record of information that's already available without it, a recipe for producing stale data. Using m_empty doesn't guarantee it can't work right, but it does increase the chances of bugs considerably, IMO.
I'm guessing it's due to the interim nature of the code, but right now you also have some serious problems with the array bounds. You aren't doing anything to force your array indexes to wrap around, so as soon as you try to push the 17th task onto the queue, you're going to have a major problem.
Edit: I should point out that the race condition noted in the comment is quite serious -- and it's not the only one either. Although somewhat better than the original, this should not be mistaken for code that will work correctly.
I'd say that writing correct lock-free code is considerably more difficult than writing correct code that uses locks. I don't know of anybody who has done so without a solid understanding of code that does use locking. Based on the original code, I'd say it would be much better to start by writing and understanding code for a queue that does use locks, and only when you've used that to gain a much better understanding of the issues involved really make an attempt at lock-free code.