I have a class that stores the latest value of some incoming realtime data (around 150 million events/second).
Suppose it looks like this:
class DataState
{
Event latest_event;
public:
//pushes event atomically
void push_event(const Event __restrict__* e);
//pulls event atomically
Event pull_event();
};
I need to be able to push events atomically and pull them with strict ordering guarantees. Now, I know I can use a spinlock, but given the massive event rate (over 100 million/second) and high degree of concurrency I'd prefer to use lockfree operations.
The problem is that Event is 64 bytes in size. There is no CMPXCHG64B instruction on any current X86 CPU (as of August '16). So if I use std::atomic<Event> I'd have to link to libatomic which uses mutexes under the hood (too slow).
So my solution was to instead atomically swap pointers to the value. Problem is dynamic memory allocation becomes a bottleneck with these event rates. So... I define something I call a "ring allocator":
/// #brief Lockfree Static short-lived allocator used for a ringbuffer
/// Elements are guaranteed to persist only for "size" calls to get_next()
template<typename T> class RingAllocator {
T *arena;
std::atomic_size_t arena_idx;
const std::size_t arena_size;
public:
/// #brief Creates a new RingAllocator
/// #param size The number of elements in the underlying arena. Make this large enough to avoid overwriting fresh data
RingAllocator<T>(std::size_t size) : arena_size(size)
{
//allocate pool
arena = new T[size];
//zero out pool
std::memset(arena, 0, sizeof(T) * size);
arena_idx = 0;
}
~RingAllocator()
{
delete[] arena;
}
/// #brief Return next element's pointer. Thread-safe
/// #return pointer to next available element
T *get_next()
{
return &arena[arena_idx.exchange(arena_idx++ % arena_size)];
}
};
Then I could have my DataState class look like this:
class DataState
{
std::atomic<Event*> latest_event;
RingAllocator<Event> event_allocator;
public:
//pushes event atomically
void push_event(const Event __restrict__* e)
{
//store event
Event *new_ptr = event_allocator.get_next()
*new_ptr = *e;
//swap event pointers
latest_event.store(new_ptr, std::memory_order_release);
}
//pulls event atomically
Event pull_event()
{
return *(latest_event.load(std::memory_order_acquire));
}
};
As long as I size my ring allocator to the max # of threads that may concurrently call the functions, there's no risk of overwriting data that pull_event could return. Plus everything's super localized so indirection won't cause bad cache performance. Any possible pitfalls with this approach?
The DataState class:
I thought it was going to be a stack or queue, but it isn't, so push / pull don't seem like good names for methods. (Or else the implementation is totally bogus).
It's just a latch that lets you read the last event that any thread stored.
There's nothing to stop two writes in a row from overwriting an element that's never been read. There's also nothing to stop you reading the same element twice.
If you just need somewhere to copy small blocks of data, a ring buffer does seem like a decent approach. But if you don't want to lose events, I don't think you can use it this way. Instead, just get a ring buffer entry, then copy to it and use it there. So the only atomic operation should be incrementing the ring buffer position index.
The ring buffer
You can make get_next() much more efficient. This line does an atomic post-increment (fetch_add) and an atomic exchange:
return &arena[arena_idx.exchange(arena_idx++ % arena_size)];
I'm not even sure it's safe, because the xchg can maybe step on the fetch_add from another thread. Anyway, even if it's safe, it's not ideal.
You don't need that. Make sure the arena_size is always a power of 2, then you don't need to modulo the shared counter. You can just let it go, and have every thread modulo it for their own use. It will eventually wrap, but it's a binary integer so it will wrap at a power of 2, which is a multiple of your arena size.
I'd suggest storing an AND-mask instead of a size, so there's no risk of the % compiling to anything other than an and instruction, even if it's not a compile-time constant. This makes sure we avoid a 64-bit integer div instruction.
template<typename T> class RingAllocator {
T *arena;
std::atomic_size_t arena_idx;
const std::size_t size_mask; // maybe even make this a template parameter?
public:
RingAllocator<T>(std::size_t size)
: arena_idx(0), size_mask(size-1)
{
// verify that size is actually a power of two, so the mask is all-ones in the low bits, and all-zeros in the high bits.
// so that i % size == i & size_mask for all i
...
}
...
T *get_next() {
size_t idx = arena_idx.fetch_add(1, std::memory_order_relaxed); // still atomic, but we don't care which order different threads take blocks in
idx &= size_mask; // modulo our local copy of the idx
return &arena[idx];
}
};
Allocating the arena would be more efficient if you used calloc instead of new + memset. The OS already zeros pages before giving them to user-space processes (to prevent information leakage), so writing them all is just wasted work.
arena = new T[size];
std::memset(arena, 0, sizeof(T) * size);
// vs.
arena = (T*)calloc(size, sizeof(T));
Writing the pages yourself does fault them in, so they're all wired to real physical pages, instead of just copy-on-write mappings for a system-wide shared physical zero page (like they are after new/malloc/calloc). On a NUMA system, the physical page chosen might depend on which thread actually touched the page, rather than which thread did the allocation. But since you're reusing the pool, the first core to write a page might not be the one that ends up using it most.
Maybe something to look for in microbenchmarks / perf counters.
As long as I size my ring allocator to the max # of threads that may concurrently call the functions, there's no risk of overwriting data that pull_event could return. .... Any possible pitfalls with this approach?
The pitfall is, IIUC, that your statement is wrong.
If I have just 2 threads, and 10 elements in the ring buffer, the first thread could call pull_event once, and be "mid-pulling", and then the second thread could call push 10 times, overwriting what thread 1 is pulling.
Again, assuming I understand your code correctly.
Also, as mentioned above,
return &arena[arena_idx.exchange(arena_idx++ % arena_size)];
that arena_idx++ inside the exchange on the same variable, just looks wrong. And in fact is wrong. Two threads could increment it - ThreadA increments to 8 and threadB increments to 9, and then threadB exchanges it to 9, then threadA exchanges it to 8. whoops.
atomic(op1) # atomic(op2) != atomic(op1 # op2)
I worry about what else is wrong in code not shown. I don't mean that as an insult - lock-free is just not easy.
Have you looked at any of the C++ Disruptor (Java) ports that are available?
disruptor--
disruptor
Although they are not complete ports they may offer all that you need. I am currently working on a more fully featured port however it's not quite ready.
Related
Currently, we have a system where metadata will be stored in a kv storage cluster. We store it simply by serializing the application metadata with protobuf and then send to kv cluster. As the system gets bigger and bigger, fetching metadata itself becomes expensive. Therefore, we develop an in-memory meta-cache component, simply an LRU cache, and the cached items are the protobuf objects. Recently, we have faced some challenges:
Concurrently read-write seem a big issue: when we add new data to our system, we need to update the cache as well, so we will need to partially lock part of the cache to ensure repeatable read, this brings very high lock contention on the cache.
The cache gets bigger and bigger over time.
I'm thinking that maybe our cache design is not good enough and considering using 3rd party lib like Facebook Cachelib (our system is written in C++). Can anyone who has experience with the matter give me some advice? Should we use 3rd party lib, or should we improve our own? If we improve our own, what can we do?
Thanks so much :).
If you are in-memory caching on only a single server node and if you can hash all of keys (for metadata indexing) into unique positive integers and
if ~75M lookups per second on an old CPU is ok for you, there is a multi-threaded multi-level read-write cache implementation:
https://github.com/tugrul512bit/LruClockCache/blob/main/MultiLevelCache.h
It works like this:
int main()
{
std::vector<int> data(1024*1024); // simulating a backing-store
MultiLevelCache<int,int> cache(
64*1024 /* direct-mapped L1 cache elements */,
256,1024 /* n-way set-associative (LRU approximation) L2 cache elements */,
[&](int key){ return data[key]; } /* cache-miss function to get data from backingstore into cache */,
[&](int key, int value){ data[key]=value; } /* cache-miss function to set data on backging-store during eviction */
);
cache.set(5,10); // this is single-thread example, sets value 10 at key position of 5
cache.flush(); // writes all latest bits of data to backing-store
std::cout<<data[5]<<std::endl;
auto val = cache.getThreadSafe(5); // this is thread-safe from any number of threads
cache.setThreadSafe(10,val); // thread-safe, any number of threads
return 0;
}
It consists of two caches, L1 is fast&sharded but weak on cache-hits, L2 is slower&sharded but is better on cache-hits although not as good as perfect-LRU.
If there is a read-only part (like just distributing values to threads from a read-only backing-store) in the application, there is another way to improve performance up to 2.5 billion lookups per second:
// shared last level cache (LRU approximation)
auto LLC = std::make_shared<LruClockCache<int,int>>(LLCsize,
[ & ](int key) {
return backingStore[key];
},
[ & ](int key, int value) {
backingStore[key] = value;
});
#pragma omp parallel num_threads(8)
{
// each thread builds its own lockless L1&L2 caches, binds to LLC (LLC has thread-safe bindings to L2)
// building overhead is linearly dependent on cache sizes
CacheThreader<LruClockCache,int,int> cache(LLC,L1size,L2size);
for(int i=0;i<many_times;i++)
cache.get(i); // 300M-400M per second per thread
}
Performance depends on accessing pattern prediction. For user-input dependent access order, it has lower performance due to inefficient pipelining. For compile-time-known indexing, it has better performance. For true random access, it has lower performance due to cache-misses & lack of vectorization. For sequential access, it has better performance.
If you need to use your main thread asynchronously during get/set calls and still require a (weak) cache coherence between gets and sets, there is another implementation which has a performance that relies on number of keys requested at a time from a thread:
// backing-store
std::vector<int> data(1000000);
// L1 direct mapped 128 tags
// L2 n-way set-associative 128 sets + 1024 tags per set
AsyncCache<int,int> cache(128,128,1024,[&](int key){ return data[key]; },[&](int key, int value){ data[key]=value; });
// a variable to use as output for the get() call
int val;
// thread-safe, returns immediately
// each set/get call should be given a slot id per thread
// or it returns a slot id instead
int slot = cache.setAsync(5,100);
// returns immediately
cache.getAsync(5,&val,slot);
// garbage garbage
std::cout<<data[5]<<" "<<val<<std::endl;
// waits for operations made in a slot
cache.barrier(slot);
// garbage 100
std::cout<<data[5]<<" "<<val<<std::endl;
// writes latest bits to backing-store
cache.flush();
// 100 100
std::cout<<data[5]<<" "<<val<<std::endl;
this is not fully-coherent and threads are responsible to do right ordering of gets/sets between each other.
If your metadata indexing is two/three-dimensional, there are 2D/3D direct-mapped multi-thread caches:
https://github.com/tugrul512bit/LruClockCache/blob/main/integer_key_specialization/DirectMapped2DMultiThreadCache.h
https://github.com/tugrul512bit/LruClockCache/blob/main/integer_key_specialization/DirectMapped3DMultiThreadCache.h
int backingStore[10][10];
DirectMapped2DMultiThreadCache<int,int> cache(4,4,
[&](int x, int y){ return backingStore[x][y]; },
[&](int x, int y, int value){ backingStore[x][y]=value; });
for(int i=0;i<10;i++)
for(int j=0;j<10;j++)
cache.set(i,j,0); // row-major
cache.flush();
for(int i=0;i<10;i++)for(int j=0;j<10;j++)
std::cout<<backingStore[i][j]<<std::endl;
it has better cache-hit ratio than normal direct-mapped cache if you do tiled-processing.
The goal is to implement a sequence number generator in modern C++. The context is in a concurrent environment.
Requirement #1 The class must be singleton (common for all threads)
Requirement #2 The type used for the numbers is 64-bit integer.
Requirement #3 The caller can request more than one numbers
Requirement #4 This class will cache a sequence of numbers before being able serve the calls. Because it caches a sequence, it must also store the upper bound -> the maximum number to be able to return.
Requirement #5 Last but not least, at startup (constructor) and when there are no available numbers to give ( n_requested > n_avalaible ), the singleton class must query the database to get a new sequence. This load from DB, updates both seq_n_ and max_seq_n_.
A brief draft for its interface is the following:
class singleton_sequence_manager {
public:
static singleton_sequence_manager& instance() {
static singleton_sequence_manager s;
return s;
}
std::vector<int64_t> get_sequence(int64_t n_requested);
private:
singleton_sequence_manager(); //Constructor
void get_new_db_sequence(); //Gets a new sequence from DB
int64_t seq_n_;
int64_t max_seq_n_;
}
Example just to clarify the use case.
Suppose that at startup, DB sets seq_n_ to 1000 and max_seq_n_ to 1050:
get_sequence.(20); //Gets [1000, 1019]
get_sequence.(20); //Gets [1020, 1039]
get_sequence.(5); //Gets [1040, 1044]
get_sequence.(10); //In order to serve this call, a new sequence must be load from DB
Obviously, an implementation using locks and std::mutex is quite simple.
What I am interested into is implementing a lock-free version using std::atomic and atomic operations.
My first attempt is the following one:
int64_t seq_n_;
int64_t max_seq_n_;
were changed to:
std::atomic<int64_t> seq_n_;
std::atomic<int64_t> max_seq_n_;
Get a new sequence from DB just sets the new values in the atomic variables:
void singleton_sequence_manager::get_new_db_sequence() {
//Sync call is made to DB
//Let's just ignore unhappy paths for simplicity
seq_n_.store( start_of_seq_got_from_db );
max_seq_n_.store( end_of_seq_got_from_db );
//At this point, the class can start returning numbers in [seq_n_ : max_seq_n_]
}
And now the get_sequence function using atomic compare and swap technique:
std::vector<int64_t> singleton_sequence_manager::get_sequence(int64_t n_requested) {
bool succeeded{false};
int64_t current_seq{};
int64_t next_seq{};
do {
current_seq = seq_n_.load();
do {
next_seq = current_seq + n_requested + 1;
}
while( !seq_n_.compare_exchange_weak( current_seq, next_seq ) );
//After the CAS, the caller gets the sequence [current_seq:next_seq-1]
//Check if sequence is in the cached bound.
if( max_seq_n_.load() > next_seq - 1 )
succeeded = true;
else //Needs to load new sequence from DB, and re-calculate again
get_new_db_sequence();
}
while( !succeeded );
//Building the response
std::vector<int64_t> res{};
res.resize(n_requested);
for(int64_t n = current_seq ; n < next_seq ; n++)
res.push_back(n);
return res;
}
Thoughts:
I am really concerned for the lock-free version. Is the implementation safe ? If we ignore the DB load part, obviously yes. The problem arises (in my head at least) when the class has to load a new sequence from the DB. Is the update from DB safe ? Two atomic stores ?
My second attempt was to combine both seq_n_ and max_seq_n_ into a struct called sequence and use a single atomic variable std::atomic but the compiler failed. Because the size of the struct sequence is greater than 64-bit.
Is it possible to somehow protect the DB part by using an atomic flag for marking if the sequence is ready yet: flag set to false while waiting the db load to finish and to update both atomic variables. Therefore, get_sequence must be updated in order to wait for flag to bet set to true. (Use of spin lock ?)
Your lock-free version has a fundamental flaw, because it treats two independent atomic variables as one entity. Since writes to seq_n_ and max_seq_n_ are separate statements, they can be separated during execution resulting in the use of one of them with a value that is incorrect when paired with the other.
For example, one thread can get past the CAS inner while loop (with an n_requested that is too large for the current cached sequence), then be suspended before checking if it is cached. A second thread can come thru and update the max_seq_n value to a larger value. The first thread then resumes, and passes the max_seq_n check because the value was updated by the second thread. It is now using an invalid sequence.
A similar thing can happen in get_new_db_sequence between the two store calls.
Since you're writing to two distinct locations (even if adjacent in memory), and they cannot be updated atomically (due to the combined size of 128 bits not being a supported atomic size with your compiler), the writes must be protected by a mutex.
A spin lock should only be used for very short waits, since it does consume CPU cycles. A typical usage would be to use a short spin lock, and if the resource is still unavailable use something more expensive (like a mutex) to wait using CPU time.
I'm using C++11 and I'm aware that concurrent writes to std::vector<bool> someArray are not thread-safe due to the specialization of std::vector for bools.
I'm trying to find out if writes to bool someArray[2048] have the same problem:
Suppose all entries in someArray are initially set to false.
Suppose I have a bunch of threads that write at different indices in someArray. In fact, these threads only set different array entries from false to true.
Suppose I have a reader thread that at some point acquires a lock, triggering a memory fence operation.
Q: Will the reader see all the writes to someArray that occurred before the lock was acquired?
Thanks!
You should use std::array<bool, 2048> someArray, not bool someArray[2048];. If you're in C++11-land, you'll want to modernize your code as much as you are able.
std::array<bool, N> is not specialized in the same way that std::vector<bool> is, so there's no concerns there in terms of raw safety.
As for your actual question:
Will the reader see all the writes to someArray that occurred before the lock was acquired?
Only if the writers to the array also interact with the lock, either by releasing it at the time that they finish writing, or else by updating a value associated with the lock that the reader then synchronizes with. If the writers never interact with the lock, then the data that will be retrieved by the reader is undefined.
One thing you'll also want to bear in mind: while it's not unsafe to have multiple threads write to the same array, provided that they are all writing to unique memory addresses, writing could be slowed pretty dramatically by interactions with the cache. For example:
void func_a() {
std::array<bool, 2048> someArray{};
for(int i = 0; i < 8; i++) {
std::thread writer([i, &someArray]{
for(size_t index = i * 256; index < (i+1) * 256; index++)
someArray[index] = true;
//Some kind of synchronization mechanism you need to work out yourself
});
writer.detach();
}
}
void func_b() {
std::array<bool, 2048> someArray{};
for(int i = 0; i < 8; i++) {
std::thread writer([i, &someArray]{
for(size_t index = i; index < 2048; index += 8)
someArray[index] = true;
//Some kind of synchronization mechanism you need to work out yourself
});
writer.detach();
}
}
The details are going to vary depending on the underlying hardware, but in nearly all situations, func_a is going to be orders of magnitude faster than func_b, at least for a sufficiently large array size (2048 was chosen as an example, but it may not be representative of the actual underlying performance differences). Both functions should have the same result, but one will be considerably faster than the other.
First of all, the general std::vector is not thread-safe as you might think. The guarantees are already stated here.
Addressing your question: The reader may not see all writes after acquiring the lock. This is due to the fact that the writers may never have performed a release operation which is required to establish a happens-before relationship between the writes and the subsequent reads. In (very) simple terms: every acquire operation (such as a mutex lock) needs a release operation to synchronize with. Every memory operation done before a release onto a certain veriable will be visible to any thread that acquired the same variable. See also Release-Acquire ordering.
One important thing to note is that all operations (fetch and store) on an int32 sized variable (such as a bool) is atomic (holds true for x86 or x64 architectures). So if you declare your array as volatile (necessary as each thread may have a cached value of the array), you should not have any issues in modifying the array (via multiple threads).
I have a vector that is modified in one thread, and I need to use its contents in another. Locking between these threads is unacceptable due to performance requirements. Since iterating over the vector while it is changing will cause a crash, I thought to copy the vector and then iterate over the copy. My question is, can this way also crash?
struct Data
{
int A;
double B;
bool C;
};
std::vector<Data> DataVec;
void ModifyThreadFunc()
{
// Here the vector is changed, which includes adding and erasing elements
...
}
void ReadThreadFunc()
{
auto temp = DataVec; // Will this crash?
for (auto& data : temp)
{
// Do stuff with the data
...
}
// This definitely can crash
/*for (auto& data : DataVec)
{
// Do stuff with the data
...
}*/
}
The basic thread safety guarantee for vector::operator= is:
"if an exception is thrown, the container is in a valid state."
What types of exceptions are possible here?
EDIT:
I solved this using double buffering, and posted my answer below.
As has been pointed out by the other answers, what you ask for is not doable. If you have concurrent access, you need synchronization, end of story.
That being said, it is not unusual to have requirements like yours where synchronization is not an option. In that case, what you can still do is get rid of the concurrent access. For example, you mentioned that the data is accessed once per frame in a game-loop like execution. Is it strictly required that you get the data from the current frame or could it also be the data from the last frame?
In that case, you could work with two vectors, one that is being written to by the producer thread and one that is being read by all the consumer threads. At the end of the frame, you simply swap the two vectors. Now you no longer need *(1) fine-grained synchronization for the data access, since there is no concurrent data access any more.
This is just one example how to do this. If you need to get rid of locking, start thinking about how to organize data access so that you avoid getting into the situation where you need synchronization in the first place.
*(1): Strictly speaking, you still need a synchronization point that ensures that when you perform the swapping, all the writer and reader threads have finished working. But this is far easier to do (usually you have such a synchronization point at the end of each frame anyway) and has a far lesser impact on performance than synchronizing on every access to the vector.
My question is, can this way also crash?
Yes, you still have a data race. If thread A modifies the vector while thread B is creating a copy, all iterators to the vector are invalidated.
What types of exceptions are possible here?
std::vector::operator=(const vector&) will throw on memory allocation failure, or if the contained elements throw on copy. The same thing applies to copy construction, which is what the line in your code marked "Will this crash?" is actually doing.
The fundamental problem here is that std::vector is not thread-safe. You have to either protect it with a lock/mutex, or replace it with a thread-safe container (such as the lock-free containers in Boost.Lockfree or libcds).
I have a vector that is modified in one thread, and I need to use its contents in another. Locking between these threads is unacceptable due to performance requirements.
this is an impossible to meet requirement.
Anyway, any sharing of data between 2 threads will require a kind of locking, be it explicit or implementation (eventually hardware) provided. You must examine again your actual requirements: it can be inacceptable to suspend one thread until the other one ends, but you could lock short sequences of instructions. And/or possibly use a diffent architecture. For example erasing an item in a vector is a costly operation (linear time because you have to move all the data above the removed item) while marking it as invalid is much quicker (constant time because it is one single write). If you really have to erase in the middle of a vector, maybe a list would be more appropriate.
But if you can put a locking exclusion around the copy of the vector in ReadThreadFunc and around any vector modification in ModifyThreadFunc, it could be enough. To give a priority to the modifying thread, you could just try to lock in the other thread and immediately give up if you cannot.
Maybe you should rethink your design!
Each thread should have his own vector (list, queue whatever fit your needs) to work on. So thread A can do some work and pass the result to thrad B. You simply have to lock when writing the data from thread A int thread B's queue.
Without some kind of locking it's not possible.
So I solved this using double buffering, which guarantees no crashing, and the reading thread will always have usable data, even if it might not be correct:
struct Data
{
int A;
double B;
bool C;
};
const int MAXSIZE = 100;
Data Buffer[MAXSIZE];
std::vector<Data> DataVec;
void ModifyThreadFunc()
{
// Here the vector is changed, which includes adding and erasing elements
...
// Copy from the vector to the buffer
size_t numElements = DataVec.size();
memcpy(Buffer, DataVec.data(), sizeof(Data) * numElements);
memset(&Buffer[numElements], 0, sizeof(Data) * (MAXSIZE - numElements));
}
void ReadThreadFunc()
{
Data* p = Buffer;
for (int i = 0; i < MAXSIZE; ++i)
{
// Use the data
...
++p;
}
}
I'm writing a small ray tracer using bounding volume hierarchies to accelerate ray tracing.
Long story short, I have a binary tree and I might need to visit multiple leafs.
Current I have a node with two children left and right, then during travel() if some condition, in this example intersect(), the children are visited:
class BoundingBoxNode{
BoundingBoxNode* left, *right;
void travel(param &p);
inline bool intersect(param &p){...};
};
void BoundingBoxNode::travel(param &p){
if(this->intersect(p)){
if(left)
left->travel(p);
if(right)
right->travel(p);
}
}
This approach uses recursive methods calls, however, I need to optimized this code as much as possible... And according to Optimization Reference Manual for IA-32, function calls deeper than 16 can be very expensive, so I would like to do this using a while loop instead of recursive calls.
But I do NOT wish to do dynamic heap allocations as these are expensive. So I was thinking that maybe I could use that fact that every time a while loop starts over the stack will be in the same position.
In the following very ugly hack I rely on alloca() to always allocate the same address:
class BoundingBoxNode{
BoundingBoxNode* left, right;
inline void travel(param &p){
int stack_size = 0;
BoundingBoxNode* current = this;
while(stack_size >= 0){
BoundingBoxNode* stack = alloca(stack_size * 4 + 2*4);
if(current->intersect(p)){
if(current->left){
stack[stack_size] = current->left;
stack_size++;
}
if(current->right){
stack[stack_size] = current->right;
stack_size++;
}
}
stack_size--;
current = stack[stack_size];
}
};
inline bool intersect(param &p){...};
};
However surprising it may seem this approach does fail :)
But it does work as long as the stack is smaller than 4 or 5... I'm also quite confident that this approach is possible, I just really think I need some help implementing it correctly.
So how can I manipulate the stack manually from C++, is it possible that I can use some compiler extension... Or must I do this is assembler, and if so, how do I write assembler than can be compiled with both GCC and ICC.
I hope somebody can help me... I don't need a perfect solution, just a hack, if it works it's good enough for this purpose :)
Regards Jonas Finnemann Jensen
So, you've got a recursive function that you want to convert to a loop. You correctly work out that your function is not tail call so you have to implement it with a stack.
Now, why are you worried about the number of times that you allocate your "scratch space" stack? Is this not done once per traversal? -- if not then pass the scratch area in to the traverse function itself so it can be allocated once and then re-used for each traversal.
If the stack is small enough to fit in the cache it will stay hot and the fact that it isn't on the real C++ stack won't matter.
Once you've done all of that profile it both ways and see if it made any difference -- keep the faster version.
Stack allocations cannot be resized.
In your example, it isn't immediately obvious which data you need to allocate - besides the call stack itself. You could basically hold the current path in a vector preallocated to the maximum depth. The loop gets ugly, but that's life...
If you need many small allocations that can be released as a whole (after the algorithm completes), use a continuous pool for your allocations.
If you know an upper boundary for the required memory, the allocation is just a pointer increment:
class CPool
{
std::vector<char> m_data;
size_t m_head;
public:
CPool(size_t size) : m_data(size()), m_head(0) {}
void * Alloc(size_t size)
{
if (m_data.size() - head < size)
throw bad_alloc();
void * result = &(m_data[m_head]);
m_head += size;
return result;
}
void Free(void * p) {} // free is free ;)
};
If you don't have an upper boundary for the total size, use "pool on a rope" - i.e. when the big chunk of memory runs out, get a new one, and put these chunks in a list.
You don't need the stack, you just need a stack. You can probably use a std::stack<BoundingBoxNode* >, if I look at your code.
The C++ Standard provides no means of manipulating the stack - it doesn't even require that there be a stack. Have you actually measured the performance of your code using dynamic allocation?
The fact that it works with small stack sizes is probably a coincidence. You'd have to maintain multiple stacks and copy between them. You're never guaranteed that successive calls to alloca will return the same address.
Best approach is probably a fixed size for the stack, with an assert to catch overflows. Or you could determine the max stack size from the tree depth on construction and dynamically allocate a stack that will be used for every traversal... assuming you're not traversing on multiple threads, at least.
Since alloca allocations are cummulative, I suggest you do a first alloca to store the "this" pointer, thus becoming the "base" of the stack, keep track of how many elements your stack can hold and allocate only the size needed:
inline void travel(param &p){
BoundingBoxNode* stack = alloca(sizeof(BoundingBoxNode*)*3);
int stack_size = 3, stack_idx = 0;
stack[stk_idx] = this;
do {
BoundingBoxNode* current = stack[stk_idx];
if( current->intersect(p)){
int need = current->left ? ( current->right ? 2 : 1 ) : 0;
if ( stack-size - stk_idx < need ) {
// Let us simplify and always allocate enough for two
alloca(sizeof(BoundingBoxNode*)*2);
stack_size += 2;
}
if(current->left){
stack[stack_idx++] = current->left;
}
if(current->right){
stack[stack_idx++] = current->right;
}
}
stack_idx--;
} while(stack_idx > 0)
};
From your question, it appears there is a lot that still needs to be learned.
The most important thing to learn is: don't assume anything about performance without first measuring your runtime execution and analysing the results to determine exactly where the bottlenecks to performance are.
The function 'alloca' allocates a chunk of memory from the stack, the stack size is increased (by moving the stack pointer). Each call to 'alloca' creates a new chunk of memory until you run out of stack space, it does not re-use previously allocated memory, the data that was pointed to by 'stack' is lost when you allocate another chunk of memory and assign it to 'stack'. This is a memory leak. In this case, the memory is automatically freed when the function exits so it's not a serious leak, but, you've lost the data.
I would leave the "Optimization Reference Manual for IA-32" well alone. It assumes you know exactly how the CPU works. Let the compiler worry about optimisations it will do a good enough job for what you're doing - the compiler writers hopefully know that reference inside out. With modern PC's, the common bottleneck to performance is usually memory bandwidth.
I believe the '16 deep' function calls being expensive is to do with how the CPU is managing its stack and is a guideline only. The CPU keeps the top of the stack in onboard cache, when the cache is full, the bottom of the stack is paged to RAM which is where the performance starts to decrease. Functions with lots of arguments won't nest as deeply as functions with no arguments. And it's not just function calls, it's also local variables and memory allocated using alloca. In fact, using alloca is probably a performance hit since the CPU will be designed to optimise its stack for common use cases - a few parameters and a few local variables. Stray from the common case and performance drops off.
Try using std::stack as MSalters has suggested above. Get that working.
Use a C++ data structure. You are using C++ after all. A std::vector<> can be pre-allocated in chunks for an amortized cost of pretty much nil. And it's safe too (as you have noticed that using the normal stack is not. Especially when you're using threads)
And no, it's not expensive. It's as fast as a stack allocation.
std::list<> yes, that will be expensive. But that's because you can't pre-allocate that. std::vector<> is chunk-allocated by default.