Currently, we have a system where metadata will be stored in a kv storage cluster. We store it simply by serializing the application metadata with protobuf and then send to kv cluster. As the system gets bigger and bigger, fetching metadata itself becomes expensive. Therefore, we develop an in-memory meta-cache component, simply an LRU cache, and the cached items are the protobuf objects. Recently, we have faced some challenges:
Concurrently read-write seem a big issue: when we add new data to our system, we need to update the cache as well, so we will need to partially lock part of the cache to ensure repeatable read, this brings very high lock contention on the cache.
The cache gets bigger and bigger over time.
I'm thinking that maybe our cache design is not good enough and considering using 3rd party lib like Facebook Cachelib (our system is written in C++). Can anyone who has experience with the matter give me some advice? Should we use 3rd party lib, or should we improve our own? If we improve our own, what can we do?
Thanks so much :).
If you are in-memory caching on only a single server node and if you can hash all of keys (for metadata indexing) into unique positive integers and
if ~75M lookups per second on an old CPU is ok for you, there is a multi-threaded multi-level read-write cache implementation:
https://github.com/tugrul512bit/LruClockCache/blob/main/MultiLevelCache.h
It works like this:
int main()
{
std::vector<int> data(1024*1024); // simulating a backing-store
MultiLevelCache<int,int> cache(
64*1024 /* direct-mapped L1 cache elements */,
256,1024 /* n-way set-associative (LRU approximation) L2 cache elements */,
[&](int key){ return data[key]; } /* cache-miss function to get data from backingstore into cache */,
[&](int key, int value){ data[key]=value; } /* cache-miss function to set data on backging-store during eviction */
);
cache.set(5,10); // this is single-thread example, sets value 10 at key position of 5
cache.flush(); // writes all latest bits of data to backing-store
std::cout<<data[5]<<std::endl;
auto val = cache.getThreadSafe(5); // this is thread-safe from any number of threads
cache.setThreadSafe(10,val); // thread-safe, any number of threads
return 0;
}
It consists of two caches, L1 is fast&sharded but weak on cache-hits, L2 is slower&sharded but is better on cache-hits although not as good as perfect-LRU.
If there is a read-only part (like just distributing values to threads from a read-only backing-store) in the application, there is another way to improve performance up to 2.5 billion lookups per second:
// shared last level cache (LRU approximation)
auto LLC = std::make_shared<LruClockCache<int,int>>(LLCsize,
[ & ](int key) {
return backingStore[key];
},
[ & ](int key, int value) {
backingStore[key] = value;
});
#pragma omp parallel num_threads(8)
{
// each thread builds its own lockless L1&L2 caches, binds to LLC (LLC has thread-safe bindings to L2)
// building overhead is linearly dependent on cache sizes
CacheThreader<LruClockCache,int,int> cache(LLC,L1size,L2size);
for(int i=0;i<many_times;i++)
cache.get(i); // 300M-400M per second per thread
}
Performance depends on accessing pattern prediction. For user-input dependent access order, it has lower performance due to inefficient pipelining. For compile-time-known indexing, it has better performance. For true random access, it has lower performance due to cache-misses & lack of vectorization. For sequential access, it has better performance.
If you need to use your main thread asynchronously during get/set calls and still require a (weak) cache coherence between gets and sets, there is another implementation which has a performance that relies on number of keys requested at a time from a thread:
// backing-store
std::vector<int> data(1000000);
// L1 direct mapped 128 tags
// L2 n-way set-associative 128 sets + 1024 tags per set
AsyncCache<int,int> cache(128,128,1024,[&](int key){ return data[key]; },[&](int key, int value){ data[key]=value; });
// a variable to use as output for the get() call
int val;
// thread-safe, returns immediately
// each set/get call should be given a slot id per thread
// or it returns a slot id instead
int slot = cache.setAsync(5,100);
// returns immediately
cache.getAsync(5,&val,slot);
// garbage garbage
std::cout<<data[5]<<" "<<val<<std::endl;
// waits for operations made in a slot
cache.barrier(slot);
// garbage 100
std::cout<<data[5]<<" "<<val<<std::endl;
// writes latest bits to backing-store
cache.flush();
// 100 100
std::cout<<data[5]<<" "<<val<<std::endl;
this is not fully-coherent and threads are responsible to do right ordering of gets/sets between each other.
If your metadata indexing is two/three-dimensional, there are 2D/3D direct-mapped multi-thread caches:
https://github.com/tugrul512bit/LruClockCache/blob/main/integer_key_specialization/DirectMapped2DMultiThreadCache.h
https://github.com/tugrul512bit/LruClockCache/blob/main/integer_key_specialization/DirectMapped3DMultiThreadCache.h
int backingStore[10][10];
DirectMapped2DMultiThreadCache<int,int> cache(4,4,
[&](int x, int y){ return backingStore[x][y]; },
[&](int x, int y, int value){ backingStore[x][y]=value; });
for(int i=0;i<10;i++)
for(int j=0;j<10;j++)
cache.set(i,j,0); // row-major
cache.flush();
for(int i=0;i<10;i++)for(int j=0;j<10;j++)
std::cout<<backingStore[i][j]<<std::endl;
it has better cache-hit ratio than normal direct-mapped cache if you do tiled-processing.
Related
The goal is to implement a sequence number generator in modern C++. The context is in a concurrent environment.
Requirement #1 The class must be singleton (common for all threads)
Requirement #2 The type used for the numbers is 64-bit integer.
Requirement #3 The caller can request more than one numbers
Requirement #4 This class will cache a sequence of numbers before being able serve the calls. Because it caches a sequence, it must also store the upper bound -> the maximum number to be able to return.
Requirement #5 Last but not least, at startup (constructor) and when there are no available numbers to give ( n_requested > n_avalaible ), the singleton class must query the database to get a new sequence. This load from DB, updates both seq_n_ and max_seq_n_.
A brief draft for its interface is the following:
class singleton_sequence_manager {
public:
static singleton_sequence_manager& instance() {
static singleton_sequence_manager s;
return s;
}
std::vector<int64_t> get_sequence(int64_t n_requested);
private:
singleton_sequence_manager(); //Constructor
void get_new_db_sequence(); //Gets a new sequence from DB
int64_t seq_n_;
int64_t max_seq_n_;
}
Example just to clarify the use case.
Suppose that at startup, DB sets seq_n_ to 1000 and max_seq_n_ to 1050:
get_sequence.(20); //Gets [1000, 1019]
get_sequence.(20); //Gets [1020, 1039]
get_sequence.(5); //Gets [1040, 1044]
get_sequence.(10); //In order to serve this call, a new sequence must be load from DB
Obviously, an implementation using locks and std::mutex is quite simple.
What I am interested into is implementing a lock-free version using std::atomic and atomic operations.
My first attempt is the following one:
int64_t seq_n_;
int64_t max_seq_n_;
were changed to:
std::atomic<int64_t> seq_n_;
std::atomic<int64_t> max_seq_n_;
Get a new sequence from DB just sets the new values in the atomic variables:
void singleton_sequence_manager::get_new_db_sequence() {
//Sync call is made to DB
//Let's just ignore unhappy paths for simplicity
seq_n_.store( start_of_seq_got_from_db );
max_seq_n_.store( end_of_seq_got_from_db );
//At this point, the class can start returning numbers in [seq_n_ : max_seq_n_]
}
And now the get_sequence function using atomic compare and swap technique:
std::vector<int64_t> singleton_sequence_manager::get_sequence(int64_t n_requested) {
bool succeeded{false};
int64_t current_seq{};
int64_t next_seq{};
do {
current_seq = seq_n_.load();
do {
next_seq = current_seq + n_requested + 1;
}
while( !seq_n_.compare_exchange_weak( current_seq, next_seq ) );
//After the CAS, the caller gets the sequence [current_seq:next_seq-1]
//Check if sequence is in the cached bound.
if( max_seq_n_.load() > next_seq - 1 )
succeeded = true;
else //Needs to load new sequence from DB, and re-calculate again
get_new_db_sequence();
}
while( !succeeded );
//Building the response
std::vector<int64_t> res{};
res.resize(n_requested);
for(int64_t n = current_seq ; n < next_seq ; n++)
res.push_back(n);
return res;
}
Thoughts:
I am really concerned for the lock-free version. Is the implementation safe ? If we ignore the DB load part, obviously yes. The problem arises (in my head at least) when the class has to load a new sequence from the DB. Is the update from DB safe ? Two atomic stores ?
My second attempt was to combine both seq_n_ and max_seq_n_ into a struct called sequence and use a single atomic variable std::atomic but the compiler failed. Because the size of the struct sequence is greater than 64-bit.
Is it possible to somehow protect the DB part by using an atomic flag for marking if the sequence is ready yet: flag set to false while waiting the db load to finish and to update both atomic variables. Therefore, get_sequence must be updated in order to wait for flag to bet set to true. (Use of spin lock ?)
Your lock-free version has a fundamental flaw, because it treats two independent atomic variables as one entity. Since writes to seq_n_ and max_seq_n_ are separate statements, they can be separated during execution resulting in the use of one of them with a value that is incorrect when paired with the other.
For example, one thread can get past the CAS inner while loop (with an n_requested that is too large for the current cached sequence), then be suspended before checking if it is cached. A second thread can come thru and update the max_seq_n value to a larger value. The first thread then resumes, and passes the max_seq_n check because the value was updated by the second thread. It is now using an invalid sequence.
A similar thing can happen in get_new_db_sequence between the two store calls.
Since you're writing to two distinct locations (even if adjacent in memory), and they cannot be updated atomically (due to the combined size of 128 bits not being a supported atomic size with your compiler), the writes must be protected by a mutex.
A spin lock should only be used for very short waits, since it does consume CPU cycles. A typical usage would be to use a short spin lock, and if the resource is still unavailable use something more expensive (like a mutex) to wait using CPU time.
I have a class that stores the latest value of some incoming realtime data (around 150 million events/second).
Suppose it looks like this:
class DataState
{
Event latest_event;
public:
//pushes event atomically
void push_event(const Event __restrict__* e);
//pulls event atomically
Event pull_event();
};
I need to be able to push events atomically and pull them with strict ordering guarantees. Now, I know I can use a spinlock, but given the massive event rate (over 100 million/second) and high degree of concurrency I'd prefer to use lockfree operations.
The problem is that Event is 64 bytes in size. There is no CMPXCHG64B instruction on any current X86 CPU (as of August '16). So if I use std::atomic<Event> I'd have to link to libatomic which uses mutexes under the hood (too slow).
So my solution was to instead atomically swap pointers to the value. Problem is dynamic memory allocation becomes a bottleneck with these event rates. So... I define something I call a "ring allocator":
/// #brief Lockfree Static short-lived allocator used for a ringbuffer
/// Elements are guaranteed to persist only for "size" calls to get_next()
template<typename T> class RingAllocator {
T *arena;
std::atomic_size_t arena_idx;
const std::size_t arena_size;
public:
/// #brief Creates a new RingAllocator
/// #param size The number of elements in the underlying arena. Make this large enough to avoid overwriting fresh data
RingAllocator<T>(std::size_t size) : arena_size(size)
{
//allocate pool
arena = new T[size];
//zero out pool
std::memset(arena, 0, sizeof(T) * size);
arena_idx = 0;
}
~RingAllocator()
{
delete[] arena;
}
/// #brief Return next element's pointer. Thread-safe
/// #return pointer to next available element
T *get_next()
{
return &arena[arena_idx.exchange(arena_idx++ % arena_size)];
}
};
Then I could have my DataState class look like this:
class DataState
{
std::atomic<Event*> latest_event;
RingAllocator<Event> event_allocator;
public:
//pushes event atomically
void push_event(const Event __restrict__* e)
{
//store event
Event *new_ptr = event_allocator.get_next()
*new_ptr = *e;
//swap event pointers
latest_event.store(new_ptr, std::memory_order_release);
}
//pulls event atomically
Event pull_event()
{
return *(latest_event.load(std::memory_order_acquire));
}
};
As long as I size my ring allocator to the max # of threads that may concurrently call the functions, there's no risk of overwriting data that pull_event could return. Plus everything's super localized so indirection won't cause bad cache performance. Any possible pitfalls with this approach?
The DataState class:
I thought it was going to be a stack or queue, but it isn't, so push / pull don't seem like good names for methods. (Or else the implementation is totally bogus).
It's just a latch that lets you read the last event that any thread stored.
There's nothing to stop two writes in a row from overwriting an element that's never been read. There's also nothing to stop you reading the same element twice.
If you just need somewhere to copy small blocks of data, a ring buffer does seem like a decent approach. But if you don't want to lose events, I don't think you can use it this way. Instead, just get a ring buffer entry, then copy to it and use it there. So the only atomic operation should be incrementing the ring buffer position index.
The ring buffer
You can make get_next() much more efficient. This line does an atomic post-increment (fetch_add) and an atomic exchange:
return &arena[arena_idx.exchange(arena_idx++ % arena_size)];
I'm not even sure it's safe, because the xchg can maybe step on the fetch_add from another thread. Anyway, even if it's safe, it's not ideal.
You don't need that. Make sure the arena_size is always a power of 2, then you don't need to modulo the shared counter. You can just let it go, and have every thread modulo it for their own use. It will eventually wrap, but it's a binary integer so it will wrap at a power of 2, which is a multiple of your arena size.
I'd suggest storing an AND-mask instead of a size, so there's no risk of the % compiling to anything other than an and instruction, even if it's not a compile-time constant. This makes sure we avoid a 64-bit integer div instruction.
template<typename T> class RingAllocator {
T *arena;
std::atomic_size_t arena_idx;
const std::size_t size_mask; // maybe even make this a template parameter?
public:
RingAllocator<T>(std::size_t size)
: arena_idx(0), size_mask(size-1)
{
// verify that size is actually a power of two, so the mask is all-ones in the low bits, and all-zeros in the high bits.
// so that i % size == i & size_mask for all i
...
}
...
T *get_next() {
size_t idx = arena_idx.fetch_add(1, std::memory_order_relaxed); // still atomic, but we don't care which order different threads take blocks in
idx &= size_mask; // modulo our local copy of the idx
return &arena[idx];
}
};
Allocating the arena would be more efficient if you used calloc instead of new + memset. The OS already zeros pages before giving them to user-space processes (to prevent information leakage), so writing them all is just wasted work.
arena = new T[size];
std::memset(arena, 0, sizeof(T) * size);
// vs.
arena = (T*)calloc(size, sizeof(T));
Writing the pages yourself does fault them in, so they're all wired to real physical pages, instead of just copy-on-write mappings for a system-wide shared physical zero page (like they are after new/malloc/calloc). On a NUMA system, the physical page chosen might depend on which thread actually touched the page, rather than which thread did the allocation. But since you're reusing the pool, the first core to write a page might not be the one that ends up using it most.
Maybe something to look for in microbenchmarks / perf counters.
As long as I size my ring allocator to the max # of threads that may concurrently call the functions, there's no risk of overwriting data that pull_event could return. .... Any possible pitfalls with this approach?
The pitfall is, IIUC, that your statement is wrong.
If I have just 2 threads, and 10 elements in the ring buffer, the first thread could call pull_event once, and be "mid-pulling", and then the second thread could call push 10 times, overwriting what thread 1 is pulling.
Again, assuming I understand your code correctly.
Also, as mentioned above,
return &arena[arena_idx.exchange(arena_idx++ % arena_size)];
that arena_idx++ inside the exchange on the same variable, just looks wrong. And in fact is wrong. Two threads could increment it - ThreadA increments to 8 and threadB increments to 9, and then threadB exchanges it to 9, then threadA exchanges it to 8. whoops.
atomic(op1) # atomic(op2) != atomic(op1 # op2)
I worry about what else is wrong in code not shown. I don't mean that as an insult - lock-free is just not easy.
Have you looked at any of the C++ Disruptor (Java) ports that are available?
disruptor--
disruptor
Although they are not complete ports they may offer all that you need. I am currently working on a more fully featured port however it's not quite ready.
As sort of a side project, I'm working on a multithreaded sum algorihm, which would outperform std::accumulate when working on a large enough array. First I'm going to describe my thought process leading up to this, but if you want to skip straight to the problem, feel free to scroll down to that part.
I found many parallel sum algorihms online, most of which take the following approach:
template <typename T, typename IT>
T parallel_sum(IT _begin, IT _end, T _init) {
const auto size = distance(_begin, _end);
static const auto n = thread::hardware_concurrency();
if (size < 10000 || n == 1) return accumulate(_begin, _end, _init);
vector<future<T>> partials;
partials.reserve(n);
auto chunkSize = size / n;
for (unsigned i{ 0 }; i < n; i++) {
partials.push_back(async(launch::async, [](IT _b, IT _e){
return accumulate(_b, _e, T{0});
}, next(_begin, i*chunkSize), (i==n-1)?_end:next(_begin, (i+1)*chunkSize)));
}
for (auto& f : partials) _init += f.get();
return _init;
}
Assuming there are 2 threads available (as reported by thread::hardware_concurrency()), this function would access the elements in memory the following way:
As a simple example, we are looking at 8 elements here. The two threads are indicated by red and blue. The arrows show the location from with the threads wish to load data. Once the cells turn either red or blue, they have been loaded by the corresponding thread.
This approach (at least in my opinion) is not the best, since the threads load data from different parts of memory simultaneously. If you have many processing threads, say 16 on an 8-core hyper-threaded CPU, or even more than that, the CPU's prefetcher would have a very hard time keeping up with all these reads from completely different parts of memory (assuming the array is far too big to fit in cache). This is why I think the second example should be faster:
template <typename T, typename IT>
T parallel_sum2(IT _begin, IT _end, T _init) {
const auto size = distance(_begin, _end);
static const auto n = thread::hardware_concurrency();
if (size < 10000 || n == 1) return accumulate(_begin, _end, _init);
vector<future<T>> partials;
partials.reserve(n);
for (unsigned i{ 0 }; i < n; i++) {
partials.push_back(async(launch::async, [](IT _b, IT _e, unsigned _s){
T _ret{ 0 };
for (; _b < _e; advance(_b, _s)) _ret += *_b;
return _ret;
}, next(_begin, i), _end, n));
}
for (auto& f : partials) _init += f.get();
return _init;
}
This function accesses memory in a sort-of-sequential way, like so:
This way the prefetcher is always able to stay ahead, since all the threads access the same-ish part of memory, so there should be less cache misses, and faster load times over all, at least I think so.
The problem is that while this is all fine and dandy in theory, actual compiled versions of these show a different result. The second one is way slower. I dug a little deeper into the problem, and found out that the assembly code that is produced for the actual addition is very different. These are the "hot loops" in each one that perform the addition (remember that the first one uses std::accumulate internally, so you're basically looking at that):
Please ignore the percentages and the colors, my profiler sometimes gets things wrong.
I noticed that std::accumulate when compiled, uses an AVX2 vector instruction, vpaddq. This can add four 64-bit integers at once. I think the reason why the second version cannot be vectorized, is that each thread only accesses one element at a time, then skips over some. The vector addition would load several contiguous elements then add them together. Clearly this cannot be done, since the threads don't load elements contiguously. I tried manually unrolling the for loop in the second version, and that vector instruction did appear in the assembly, but the whole thing became painfully slow for some reason.
The above results and assembly code comes from a gcc-compiled version, but the same kind of behavior can be observed with Visual Studio 2015 as well, although I haven't looked at the assembly it produces.
So is there a way to take advante of vector instructions while retaining this sequential memory access model? Or is this memory access method something that would help at all when compared to the first version of the function?
I wrote a little benchmark program, which is ready to compile and run, just in case you want to see the performance yourself.
PS.: My primary target hardware is modern x86_64 (like haswell and such).
Each core has its own cache and prefetching.
You should look at each thread as independently executing program. In this case shortcomings of second approach will be clear: you do not access sequental data in single thread. There are holes which should not be processed, so thread cannot use vector instructions.
Another problem: CPU prefetches data in chunks. Due to how different cache levels work, changing some data within chunk marks that cache stale, and if other core tries to do some operation on same chunk of data it will be required to wait until first core will write changes and retrieve that chunk again. Basicly in your second example cache is always stale and you see raw memory access perfomance.
The best way to handle concurrent processing is to process data in large sequental chunks.
Premise
I want to do some kind of computation involving k long vectors of data (each of length n), which I receive in main memory, and write some kind of result back to main memory. For simplicity suppose also that the computation is merely
for(i = 0; i < n; i++)
v_out[i] = foo(v_1[i],v_2[i], ... ,v_k[i])
or perhaps
for(i = 0; i < n; i++)
v_out[i] = bar_k(...bar_2(bar_1(v_1[i]),v_2[i]), ... ),v_k[i])
(this is not code, it's pseudocode.) foo() and bar_i() functions have no side-effects. k is constant (known at compile-time), n is known only right before this computation occurs (and it is relatively big - at least a few times larger than the entire L2 cache size and maybe more).
Suppose I am on an single thread on a single core of an x86_64 processor (Intel, or AMD, or what-have-you; the choice does matter, probably). Finally, suppose foo() (resp. bar_i()) is not an intensive computation, i.e. the time to read data from memory and write it back is significant or even dominant relative to the n (resp. kxn) invocations of foo() (resp. bar_i()).
Question
How do I arrange this computation, so as to avoid:
The data from one input vector clearing out cached data for another vector.
Input vector data clearing out cached output vector data.
Intermediary results of bar_j(...bar_1(v_1[i])...) remaining in registers or L1 cache if there's enough capacity to hold them there until the data for v_{j+1}[i] ... v_k[i] arrives and allows us to complete the computation. Same for L2.
The L1 cache lines of the output vectors being cleared while we intend to continue working on elements in that cache line. Same for L2.
Under-utilization of memory bandwidth.
core idle time, to the extent possible.
Write-read-write-read-write sequences on v_out (which might be very expensive if those writes need to be updated into main memory; the motivation here is that it might be tempting to read just one vector, update the output and repeat).
Notes:
Any rearrangement of the input data counts towards the total computation time. The vectors will not be re-used in the alternate arrangement so it's basically a waste of time.
If it makes things easier for you to assume alignment, or lack of alignment, that's fine, just say so.
Computing with the bar_i functions allows more flexibility with the access patters, but creates additional challenges w.r.t. the caching of v_out values.
The data from one input vector clearing out cached data for another
vector.
If you are using v_1[i], v_2[i], ..., v_k[i] in the same function call, the input vectors will not clear out data cached for other vectors. For each element you read in a vector, the CPU will fetch just a cache line, not the entire vector. So if you read k elements, you will bring k cache lines from each vector.
Input vector data clearing out cached output vector data.
You have the same case as the above. That won't be the case.
Output vector data remaining in registers or L1 cache for as long as
we need it (in the bar case).
You could try to use _mm_prefetch intrinsics to fetch the data before writing it.
Under-utilization of memory bandwidth.
To do that you need to maximize the number of full width transactions. Basically you need that when the CPU fetches a cache line, all the elements are used right away. To do this you must rearrange your data. I would consider all the k vectors as a matrix of k x n elements, stored in a column major format.
type* pMat = (type*)aligned_alloc(CACHE_LINE_SIZE, n * k * sizeof(type));
v_0[i] = pMat[i * k + 0];
v_1[i] = pMat[i * k + 1];
// ...
v_k-1[i] = pMat[i * k + k-1];
That will put the v_0, ... v_k elements in SIMD registers and you might have the chance of better vectorization.
core idle time, to the extent possible.
Less cache misses, less transcedental instructions will lead to less idle time.
Write-read-write-read-write sequences on v_out (which might be very
expensive if those writes need to be updated into main memory; the
motivation here is that it might be tempting to read just one vector,
update the output and repeat).
You could decrease the price of the sequences using prefetching (_mm_prefetch).
To reduce the clearing out of cached data, you could rearrange your k vectors to 1 vector holding a structure with k members. That way, the loop will access those elments in sequence and not jump around in memory.
struct VectorData
{
Type1 Var1;
Type2 Var2;
// ...
TypeK VarK;
};
std::vector<VectorData> v_in;
for (i = 0; i < n; i++){
v_out[i] = foo(v_in[i].Var1, v_in[i].Var2, ... , v_in[i].VarK);
// Or just pass the whole element:
v_out[i] = foo(v_in[i]);
}
I need to implement a LRU algorithm in a 3D renderer for texture caching. I write the code in C++ on Linux.
In my case I will use texture caching to store "tiles" of image data (16x16 pixels block). Now imagine that I do a lookup in the cache, get a hit (tile is in the cache). How do I return the content of the "cache" for that entry to the function caller? I explain. I imagine that when I load a tile in the cache memory, I allocate the memory to store 16x16 pixels for example, then load the image data for that tile. Now there's two solutions to pass the content of the cache entry to the function caller:
1) either as pointer to the tile data (fast, memory efficient),
TileData *tileData = cache->lookup(tileId); // not safe?
2) or I need to recopy the tile data from the cache within a memory space allocated by the function caller (copy can be slow).
void Cache::lookup(int tileId, float *&tileData)
{
// find tile in cache, if not in cache load from disk add to cache, ...
...
// now copy tile data, safe but ins't that slow?
memcpy((char*)tileData, tileDataFromCache, sizeof(float) * 3 * 16 * 16);
}
float *tileData = new float[3 * 16 * 16]; // need to allocate the memory for that tile
// get tile data from cache, requires a copy
cache->lookup(tileId, tileData);
I would go with 1) but the problem is, what happens if the tile gets deleted from the cache just after the lookup, and that the function tries to access the data using the return pointer? The only solution I see to this, is to use a form of referencing counting (auto_ptr) where the data is actually only deleted when it's not used anymore?
the application might access more than 1 texture. I can't seem to find of a way of creating a key which is unique to each texture and each tile of a texture. For example I may have tile 1 from file1 and tile1 from file2 in the cache, so making the search on tildId=1 is not enough... but I can't seem to find a way of creating the key that accounts for the file name and the tileID. I can build a string that would contain the file name and the tileID (FILENAME_TILEID) but wouldn't a string used as a key be much slower than an integer?
Finally I have a question regarding time stamp. Many papers suggest to use a time stamp for ordering the entry in the cache. What is a good function to use a time stamp? the time() function, clock()? Is there a better way than using time stamps?
Sorry I realise it's a very long message, but LRU doesn't seem as simple to implement than it sounds.
Answers to your questions:
1) Return a shared_ptr (or something logically equivalent to it). Then all of the "when-is-it-safe-to-delete-this-object" issues pretty much go away.
2) I'd start by using a string as a key, and see if it actually is too slow or not. If the strings aren't too long (e.g. your filenames aren't too long) then you may find it's faster than you expect. If you do find out that string-keys aren't efficient enough, you could try something like computing a hashcode for the string and adding the tile ID to it... that would probably work in practice although there would always be the possibility of a hash-collision. But you could have a collision-check routine run at startup that would generate all of the possible filename+tileID combinations and alert you if map to the same key value, so that at least you'd know immediately during your testing when there is a problem and could do something about it (e.g. by adjusting your filenames and/or your hashcode algorithm). This assumes that what all the filenames and tile IDs are going to be known in advance, of course.
3) I wouldn't recommend using a timestamp, it's unnecessary and fragile. Instead, try something like this (pseudocode):
typedef shared_ptr<TileData *> TileDataPtr; // automatic memory management!
linked_list<TileDataPtr> linkedList;
hash_map<data_key_t, TileDataPtr> hashMap;
// This is the method the calling code would call to get its tile data for a given key
TileDataPtr GetData(data_key_t theKey)
{
if (hashMap.contains_key(theKey))
{
// The desired data is already in the cache, great! Just move it to the head
// of the LRU list (to reflect its popularity) and then return it.
TileDataPtr ret = hashMap.get(theKey);
linkedList.remove(ret); // move this item to the head
linkedList.push_front(ret); // of the linked list -- this is O(1)/fast
return ret;
}
else
{
// Oops, the requested object was not in our cache, load it from disk or whatever
TileDataPtr ret = LoadDataFromDisk(theKey);
linkedList.push_front(ret);
hashMap.put(theKey, ret);
// Don't let our cache get too large -- delete
// the least-recently-used item if necessary
if (linkedList.size() > MAX_LRU_CACHE_SIZE)
{
TileDataPtr dropMe = linkedList.tail();
hashMap.remove(dropMe->GetKey());
linkedList.remove(dropMe);
}
return ret;
}
}
In the same order as your questions:
Copying over the texture date does not seem reasonable from a performance standpoint. Reference counting sound far better, as long as you can actually code it safely. The data memory would be freed as soon as it is not used by the renderer or have a reference stored in the cache.
I assume that you are going to use some sort of hash table for the look-up part of what you are describing. The common solution to your problem has two parts:
Using a suitable hashing function that combines multiple values e.g. the texture file name and the tile ID. Essentially you create a composite key that is treated as one entity. The hashing function could be a XOR operation of the hashes of all elementary components, or something more complex.
Selecting a suitable hash function is critical for performance reasons - if the said function is not random enough, you will have a lot of hash collisions.
Using a suitable composite equality check to handle the case of hash collisions.
This way you can look-up the combination of all attributes of interest in a single hash table look-up.
Using timestamps for this is not going to work - period. Most sources regarding caching usually describe the algorithms in question with network resource caching in mind (e.g. HTTP caches). That is not going to work here for three reasons:
Using natural time only makes sense of you intend to implement caching policies that take it into account, e.g. dropping a cache entry after 10 minutes. Unless you are doing something very weird something like this makes no sense within a 3D renderer.
Timestamps have a relatively low actual resolution, even if you use high precision timers. Most timer sources have a precision of about 1ms, which is a very long time for a processor - in that time your renderer would have worked through several texture entries.
Do you have any idea how expensive timer calls are? Abusing them like this could even make your system perform worse than not having any cache at all...
The usual solution to this problem is to not use a timer at all. The LRU algorithm only needs to know two things:
The maximum number of entries allowed.
The order of the existing entries w.r.t. their last access.
Item (1) comes from the configuration of the system and typically depends on the available storage space. Item (2) generally implies the use of a combined linked list/hash table data structure, where the hash table part provides fast access and the linked list retains the access order. Each time an entry is accessed, it is placed at the end of the list, while old entries are removed from its start.
Using a combined data structure, rather than two separate ones allows entries to be removed from the hash table without having to go through a look-up operation. This improves the overall performance, but it is not absolutely necessary.
As promised I am posting my code. Please let me know if I have made mistakes or if I could improve it further. I am now going to look into making it work in a multi-threaded environment. Again thanks to Jeremy and Thkala for their help (sorry the code doesn't fit the comment block).
#include <cstdlib>
#include <cstdio>
#include <memory>
#include <list>
#include <unordered_map>
#include <cstdint>
#include <iostream>
typedef uint32_t data_key_t;
class TileData
{
public:
TileData(const data_key_t &key) : theKey(key) {}
data_key_t theKey;
~TileData() { std::cerr << "delete " << theKey << std::endl; }
};
typedef std::shared_ptr<TileData> TileDataPtr; // automatic memory management!
TileDataPtr loadDataFromDisk(const data_key_t &theKey)
{
return std::shared_ptr<TileData>(new TileData(theKey));
}
class CacheLRU
{
public:
// the linked list keeps track of the order in which the data was accessed
std::list<TileDataPtr> linkedList;
// the hash map (unordered_map is part of c++0x while hash_map isn't?) gives quick access to the data
std::unordered_map<data_key_t, TileDataPtr> hashMap;
CacheLRU() : cacheHit(0), cacheMiss(0) {}
TileDataPtr getData(data_key_t theKey)
{
std::unordered_map<data_key_t, TileDataPtr>::const_iterator iter = hashMap.find(theKey);
if (iter != hashMap.end()) {
TileDataPtr ret = iter->second;
linkedList.remove(ret);
linkedList.push_front(ret);
++cacheHit;
return ret;
}
else {
++cacheMiss;
TileDataPtr ret = loadDataFromDisk(theKey);
linkedList.push_front(ret);
hashMap.insert(std::make_pair<data_key_t, TileDataPtr>(theKey, ret));
if (linkedList.size() > MAX_LRU_CACHE_SIZE) {
const TileDataPtr dropMe = linkedList.back();
hashMap.erase(dropMe->theKey);
linkedList.remove(dropMe);
}
return ret;
}
}
static const uint32_t MAX_LRU_CACHE_SIZE = 8;
uint32_t cacheMiss, cacheHit;
};
int main(int argc, char **argv)
{
CacheLRU cache;
for (uint32_t i = 0; i < 238; ++i) {
int key = random() % 32;
TileDataPtr tileDataPtr = cache.getData(key);
}
std::cerr << "Cache hit: " << cache.cacheHit << ", cache miss: " << cache.cacheMiss << std::endl;
return 0;
}