Imagine you have a pretty big array of double and a simple function avg(double*,size_t) that computes the average value (just a simple example: both the array and the function could be whatever data structure and algorithm). I would like that if the function is called a second time and the array is not changed in the meanwhile, the return value comes directly from the previous one, without going through the unchanged data.
To hold the previous value looks simple, I just need a static variable inside the function, right? But what about detecting the changes in the array? Do I need to write an interface to access the array which sets a flag to be read by the function? Can something smarter and more portable be done?
As Kerrek SB so astutely put it, this is known as "memoization." I'll cover my personal favorite method at the end (both with double* array and the much easier DoubleArray), so you can skip to there if you just want to see code. However, there are many ways to solve this problem, and I wanted to cover them all, including those suggested by others. Skip to the horizontal rule if you just want to see code.
The first part is some theory and alternate approaches. There are fundamentally four parts to the problem:
Prove the function is idempotent (calling a function once is the same as calling it any number of times)
Cache results keyed to the inputs
Search cached results given a new set of inputs
Invalidating cached results which are no longer accurate/current
The first step is easy for you: average is idempotent. It has no side effects.
Caching the results is a fun step. You obviously are going to create some "key" for the inputs that you can compare against the cached "keys." In Kerrek SB's memoization example, the key is a tuple of all of the arguments, compared against other keys with ==. In your system, the equivalent solution would be to have the key be the contents of the entire array. This means each key comparison is O(n), which is expensive. If the function was more expensive to calculate than the average function is, this price may be acceptable. However in the case of averaging, this key is terribly expensive.
This leads one on the open-ended search for good keys. Dieter Lücking's answer was to key the array pointer. This is O(1), and wicked fast to boot. However, it also makes the assumption that once you've calculated the average for an array, that array's values never change, and that memory address is never re-used for another array. Solutions for this come later, in the invalidation portion of the task.
Another popular key is HotLick's (1) in the comments. You use a unique identifier for the array (pointer or, better yet, a unique integer idx that will never be used again) as your key. Each array then has a "dirty bit for avg" that they are expected to set to true whenever a value is changed. Caches first look for the dirty bit. If it is true, they ignore the cached value, calculate the new value, cache the new value, then clear the dirty bit indicating that the cached value is now valid. (this is really invalidation, but it fit well in this part of the answer)
This technique assumes that there are more calls to avg than updates to the data. If the array is constantly dirty, then avg still has to keep recalculating, but we still pay the price of setting the dirty bit on every write (slowing it down).
This technique also assumes that there is only one function, avg, which needs cached results. If you have many functions, it starts to get expensive to keep all of the dirty bits up to date. The solution is an "epoch" counter. Instead of a dirty bit, you have an integer, which starts at 0. Every write increments it. When you cache a result, you cache not only the identity of the array, but its epoch as well. When you check to see if you have a cached value, you also check to see if the epoch changed. If it did change, you can't prove your old results are current, and have to throw them out.
Storing the results is an interesting task. It is very easy to write a storing algorithm which uses up gobs of memory by remembering hundreds of thousands of old results to avg. Generally speaking, there needs to be a way to let the caching code know that an array has been destroyed, or a way to slowly remove old unused cache results. In the former case, the deallocator of the double arrays needs to let the cache code know that that array is being deallocated. In the latter case, it is common to limit a cache to 10 or 100 entries, and have evict old cache results.
The last piece is invalidation of caches. I spoke earlier regarding the dirty bit. The general pattern for this is that a value inside a cache must be marked invalid if the key it was stored in didn't change, but the values in the array did change. This can obviously never happen if the key is a copy of the array, but it can occur when the key is an identifing integer or a pointer.
Generally speaking, invalidation is a way to add a requirement to your caller: if you want to use avg with caching, here's the extra work you are required to do to help the caching code.
Recently I implemented a system with such caching invalidation scheme. It was very simple, and stemmed from one philosophy: the code which is calling avg is in a better position to determine if the array has changed than avg is itself.
There were two versions of the equvalent of avg: double avg(double* array, int n) and double avg(double* array, int n, CacheValidityObject& validity).
Calling the 2 argument version of avg never cached, because it had no guarantees that array had not changed.
Calling the 3 argument version of avg activated caching. The caller guarentees that, if it passes the same CacheValidityObject to avg without marking it dirty, then the arrays must be the same.
Putting the onus on the caller makes average trivial. CacheValidityObject is a very simple class to hold on to the results
class CacheValidityObject
{
public:
CacheValidityObject(); // creates a new dirty CacheValidityObject
void invalidate(); // marks this object as dirty
// this function is used only by the `avg` algorithm. "friend" may
// be used here, but this example makes it public
boost::shared_ptr<void>& getData();
private:
boost::shared_ptr<void> mData;
};
inline void CacheValidityObject::invalidate()
{
mData.reset(); // blow away any cached data
}
double avg(double* array, int n); // defined as usual
double avg(double* array, int n, CacheValidityObject& validity)
{
// this function assumes validity.mData is null or a shared_ptr to a double
boost::shared_ptr<void>& data = validity.getData();
if (data) {
// The cached result, stored on the validity object, is still valid
return *static_pointer_cast<double>(data);
} else {
// There was no cached result, or it was invalidated
double result = avg(array, n);
data = make_shared<double>(result); // cache the result
return result;
}
}
// usage
{
double data[100];
fillWithRandom(data, 100);
CacheValidityObject dataCacheValidity;
double a = avg(data, 100, dataCacheValidity); // caches the aveerage
double b = avg(data, 100, dataCacheValidity); // cache hit... uses cached result
data[0] = 0;
dataCacheValidity.invalidate();
double c = avg(data, 100, dataCacheValidity); // dirty.. caches new result
double d = avg(data, 100, dataCacheValidity); // cache hit.. uses cached result
// CacheValidityObject::~CacheValidityObject() will destroy the shared_ptr,
// freeing the memory used to cache the result
}
Advantages
Nearly the fastest caching possible (within a few opcodes)
Trivial to implement
Doesn't leak memory, saving cached values only when the caller thinks it may want to use them again
Disadvantages
Requires the caller to handle caching, instead of doing it implicitly for them.
If you wrap the double* array in a class, you can minimize the disadvantage. Assign each algorithm an index (can be done at run time) Have the DoubleArray class maintain a map of cached values. Each modification to DoubleArray invalidates the cached results. This is the most easy to use version, but doesn't work with a naked array... you need a class to help you out
class DoubleArray
{
public:
// all of the getters and setters and constructors.
// Special note: all setters MUST call invalidate()
CacheValidityObject getCache(int inIdx)
{
return mCaches[inIdx];
}
void setCache(int inIdx, const CacheValidityObject& inObj)
{
mCaches[inIdx] = inObj;
}
private:
void invalidate()
{
mCaches.clear();
}
std::map<int, CacheValidityObject> mCaches;
double* mArray;
int mSize;
};
inline int getNextAlgorithmIdx()
{
static int nextIdx = 1;
return nextIdx++;
}
static const int avgAlgorithmIdx = getNextAlgorithmIdx();
double avg(DoubleArray& inArray)
{
CacheValidityObject valid = inArray.getCache(avgAlgorithmIdx);
// use the 3 argument avg in the previous example
double result = avg(inArray.getArray(), inArray.getSize(), valid);
inArray.setCache(avgAlgorithmIdx, valid);
return result;
}
// usage
DoubleArray array(100);
fillRandom(array);
double a = avg(array); // calculates, and caches
double b = avg(array); // cache hit
array.set(0, 5); // invalidates caches
double c = avg(array); // calculates, and caches
double d = avg(array); // cache hit
#include <limits>
#include <map>
// Note: You have to manage cached results - release it with avg(p, 0)!
double avg(double* p, std::size_t n) {
typedef std::map<double*, double> map;
static map results;
map::iterator pos = results.find(p);
if(n) {
// Calculate or get a cached value
if(pos == results.end()) {
pos = results.insert(map::value_type(p, 0.5)).first; // calculate it
}
return pos->second;
}
// Erase a cached value
results.erase(pos);
return std::numeric_limits<double>::quiet_NaN();
}
Related
I want to improve the performance of the following code. What aspect might affect the performance of the code when it's executed?
Also, considering that there is no limit to how many objects you can add to the container, what improvements could be made to “Object” or “addToContainer” to improve the performance of the program?
I was wondering if std::push_back in C++ affects performance of the code in any way? Especially if there is no limit to adding to list.
struct Object{
string name;
string description;
};
vector<Object> container;
void addToContainer(Object object) {
container.push_back(object);
}
int main() {
addToContainer({ "Fira", "+5 ATTACK" });
addToContainer({ "Potion", "+10 HP" });
}
Before you do ANYTHING profile the code and get a benchmark. After you make a change profile the code and get a benchmark. Compare the benchmarks. If you do not do this, you're rolling dice. Is it faster? Who knows.
Profile profile profile.
With push_back you have two main concerns:
Resizing the vector when it fills up, and
Copying the object into the vector.
There are a number of improvements you can make to the resizing cost cost of push_back depending on how items are being added.
Strategic use of reserve to minimize the amount of resizing, for example. If you know how many items are about to be added, you can check the capacity and size to see if it's worth your time to reserve to avoid multiple resizes. Note this requires knowledge of vector's expansion strategy and that is implementation-specific. An optimization for one vector implementation could be a terribly bad mistake on another.
You can use insert to add multiple items at a time. Of course this is close to useless if you need to add another container into the code in order to bulk-insert.
If you have no idea how many items are incoming, you might as well let vector do its job and optimize HOW the items are added.
For example
void addToContainer(Object object) // pass by value. Possible copy
{
container.push_back(object); // copy
}
Those copies can be expensive. Get rid of them.
void addToContainer(Object && object) //no copy and can still handle temporaries
{
container.push_back(std::move(object)); // moves rather than copies
}
std::string is often very cheap to move.
This variant of addToContainer can be used with
addToContainer({ "Fira", "+5 ATTACK" });
addToContainer({ "Potion", "+10 HP" });
and might just migrate a pointer and as few book-keeping variables per string. They are temporaries, so no one cares if it will rips their guts out and throws away the corpses.
As for existing Objects
Object o{"Pizza pop", "+5 food"};
addToContainer(std::move(o));
If they are expendable, they get moved as well. If they aren't expendable...
void addToContainer(const Object & object) // no copy
{
container.push_back(object); // copy
}
You have an overload that does it the hard way.
Tossing this one out there
If you already have a number of items you know are going to be in the list, rather than appending them all one at a time, use an initialization list:
vector<Object> container{
{"Vorpal Cheese Grater", "Many little pieces"},
{"Holy Hand Grenade", "OMG Damage"}
};
push_back can be extremely expensive, but as with everything, it depends on the context. Take for example this terrible code:
std::vector<float> slow_func(const float* ptr)
{
std::vector<float> v;
for(size_t i = 0; i < 256; ++i)
v.push_back(ptr[i]);
return v;
}
each call to push_back has to do the following:
Check to see if there is enough space in the vector
If not, allocate new memory, and copy the old values into the new vector
copy the new item to the end of the vector
increment end
Now there are two big problems here wrt performance. Firstly each push_back operation depends upon the previous operation (since the previous operation modified end, and possibly the entire contents of the array if it had to be resized). This pretty much destroys any vectorisation possibilities in the code. Take a look here:
https://godbolt.org/z/RU2tM0
The func that uses push_back does not make for very pretty asm. It's effectively hamstrung into being forced to copy a single float at a time. Now if you compare that to an alternative approach where you resize first, and then assign; the compiler just replaces the whole lot with a call to new, and a call to memcpy. This will be a few orders of magnitude faster than the previous method.
std::vector<float> fast_func(const float* ptr)
{
std::vector<float> v(256);
for(size_t i = 0; i < 256; ++i)
v[i] = ptr[i];
return v;
}
BUT, and it's a big but, the relative performance of push_back very much depends on whether the items in the array can be trivially copied (or moved). If you example you do something silly like:
struct Vec3 {
float x = 0;
float y = 0;
float z = 0;
};
Well now when we did this:
std::vector<Vec3> v(256);
The compiler will allocate memory, but also be forced to set all the values to zero (which is pointless if you are about to overwrite them again!). The obvious way around this is to use a different constructor:
std::vector<Vec3> v(ptr, ptr + 256);
So really, only use push_back (well, really you should prefer emplace_back in most cases) when either:
additional elements are added to your vector occasionally
or, The objects you are adding are complex to construct (in which case, use emplace_back!)
without any other requirements, unfortunately this is the most efficient:
void addToContainer(Object) { }
to answer the rest of your question. In general push_back will just add to the end of the allocated vector O(1), but will need to grow the vector on occasion, which can be amortized out but is O(N)
also, it would likely be more efficient not to use string, but to keep char * although memory management might be tricky unless it is always a literal being added
I have a large array (> millions) of Items, where each Item has the form:
struct Item { void *a; size_t b; };
There are a handful of distinct a fields—meaning there are many items with the same a field.
I would like to "factor" this information out to save about 50% memory usage.
However, the trouble is that these Items have a significant ordering, and that may change over time. Therefore, I can't just go ahead make a separate Item[] for each distinct a, because that will lose the relative ordering of the items with respect to each other.
On the other hand, if I store the orderings of all the items in a size_t index; field, then I lose any memory savings from the removal of the void *a; field.
So is there a way for me to actually save memory here, or no?
(Note: I can already think of e.g. using an unsigned char for a to index into a small array, but I'm wondering if there's a better way. That one will require me to either use unaligned memory or to split every Item[] into two, which isn't great for memory locality, so I'd prefer something else.)
(Note: I can already think of e.g. using an unsigned char for a to index into a small array, but I'm wondering if there's a better way.)
This thinking is on the right track, but it's not that simple, since you will run into some nasty alignment/padding issues that will negate your memory gains.
At that point, when you start trying to scratch the last few bytes of a structure like this, you will probably want to use bit fields.
#define A_INDEX_BITS 3
struct Item {
size_t a_index : A_INDEX_BITS;
size_t b : (sizeof(size_t) * CHAR_BIT) - A_INDEX_BITS;
};
Note that this will limit how many bits are available for b, but on modern platforms, where sizeof(size_t) is 8, stripping 3-4 bits from it is rarely an issue.
Use a combination of lightweight compression schemes (see this for examples and some references) to represent the a* values. #Frank's answer employes DICT followed by NS, for example. If you have long runs of the same pointer, you could consider RLE (Run-Length Encoding) on top of that.
This is a bit of a hack, but I've used it in the past with some success. The extra overhead for object access was compensated for by the significant memory reduction.
A typical use case is an environment where (a) values are actually discriminated unions (that is, they include a type indicator) with a limited number of different types and (b) values are mostly kept in large contiguous vectors.
With that environment, it is quite likely that the payload part of (some kinds of) values uses up all the bits allocated for it. It is also possible that the datatype requires (or benefits from) being stored in aligned memory.
In practice, now that aligned access is not required by most mainstream CPUs, I would just used a packed struct instead of the following hack. If you don't pay for unaligned access, then storing a { one-byte type + eight-byte value } as nine contiguous bytes is probably optimal; the only cost is that you need to multiply by 9 instead of 8 for indexed access, and that is trivial since the 9 is a compile-time constant.
If you do have to pay for unaligned access, then the following is possible. Vectors of "augmented" values have the type:
// Assume that Payload has already been typedef'd. In my application,
// it would be a union of, eg., uint64_t, int64_t, double, pointer, etc.
// In your application, it would be b.
// Eight-byte payload version:
typedef struct Chunk8 { uint8_t kind[8]; Payload value[8]; }
// Four-byte payload version:
typedef struct Chunk4 { uint8_t kind[4]; Payload value[4]; }
Vectors are then vectors of Chunks. For the hack to work, they must be allocated on 8- (or 4-)byte aligned memory addresses, but we've already assumed that alignment is required for the Payload types.
The key to the hack is how we represent a pointer to an individual value, because the value is not contiguous in memory. We use a pointer to it's kind member as a proxy:
typedef uint8_t ValuePointer;
And then use the following low-but-not-zero-overhead functions:
#define P_SIZE 8U
#define P_MASK P_SIZE - 1U
// Internal function used to get the low-order bits of a ValuePointer.
static inline size_t vpMask(ValuePointer vp) {
return (uintptr_t)vp & P_MASK;
}
// Getters / setters. This version returns the address so it can be
// used both as a getter and a setter
static inline uint8_t* kindOf(ValuePointer vp) { return vp; }
static inline Payload* valueOf(ValuePointer vp) {
return (Payload*)(vp + 1 + (vpMask(vp) + 1) * (P_SIZE - 1));
}
// Increment / Decrement
static inline ValuePointer inc(ValuePointer vp) {
return vpMask(++vp) ? vp : vp + P_SIZE * P_SIZE;
}
static inline ValuePointer dec(ValuePointer vp) {
return vpMask(vp--) ? vp - P_SIZE * P_SIZE : vp;
}
// Simple indexed access from a Chunk pointer
static inline ValuePointer eltk(Chunk* ch, size_t k) {
return &ch[k / P_SIZE].kind[k % P_SIZE];
}
// Increment a value pointer by an arbitrary (non-negative) amount
static inline ValuePointer inck(ValuePointer vp, size_t k) {
size_t off = vpMask(vp);
return eltk((Chunk*)(vp - off), k + off);
}
I left out a bunch of the other hacks but I'm sure you can figure them out.
One cool thing about interleaving the pieces of the value is that it has moderately good locality of reference. For the 8-byte version, almost half of the time a random access to a kind and a value will only hit one 64-byte cacheline; the rest of the time two consecutive cachelines are hit, with the result that walking forwards (or backwards) through a vector is just as cache-friendly as walking through an ordinary vector, except that it uses fewer cachelines because the objects are half the size. The four byte version is even cache-friendlier.
I think I figured out the information-theoretically-optimal way to do this myself... it's not quite worth the gains in my case, but I'll explain it here in case it helps someone else.
However, it requires unaligned memory (in some sense).
And perhaps more importantly, you lose the ability easily add new values of a dynamically.
What really matters here is the number of distinct Items, i.e. the number of distinct (a,b) pairs. After all, it could be that for one a there are a billion different bs, but for the other ones there are only a handful, so you want to take advantage of that.
If we assume that there are N distinct items to choose from, then we need n = ceil(log2(N)) bits to represent each Item. So what we really want is an array of n-bit integers, with n computed at run time. Then, once you get the n-bit integer, you can do a binary search in log(n) time to figure out which a it corresponds to, based on your knowledge of the count of bs for each a. (This may be a bit of a performance hit, but it depends on the number of distinct as.)
You can't do this in a nice memory-aligned fashion, but that isn't too bad. What you would do is make a uint_vector data structure with the number of bits per element being a dynamically-specifiable quantity. Then, to randomly access into it, you'd do a few divisions or mod operations along with bit-shifts to extract the required integer.
The caveat here is that the dividing by a variable will probably severely damage your random-access performance (although it'll still be O(1)). The way to mitigate that would probably be to write a few different procedures for common values of n (C++ templates help here!) and then branch into them with various if (n == 33) { handle_case<33>(i); } or switch (n) { case 33: handle_case<33>(i); }, etc. so that the compiler sees the divisor as a constant and generates shifts/adds/multiplies as needed, rather than division.
This is information-theoretically optimal as long as you require a constant number of bits per element, which is what you would want for random-accessing. However, you could do better if you relax that constraint: you could pack multiple integers into k * n bits, then extract them with more math. This will probably kill performance too.
(Or, long story short: C and C++ really need a high-performance uint_vector data structure...)
An Array-of-Structures approach may be helpful. That is, have three vectors...
vector<A> vec_a;
vector<B> vec_b;
SomeType b_to_a_map;
You access your data as...
Item Get(int index)
{
Item retval;
retval.a = vec_a[b_to_a_map[index]];
retval.b = vec_b[index];
return retval;
}
Now all you need to do is choose something sensible for SomeType. For example, if vec_a.size() were 2, you could use vector<bool> or boost::dynamic_bitset. For more complex cases you could try bit-packing, for example to support 4-values of A, we simple change our function with...
int a_index = b_to_a_map[index*2]*2 + b_to_a_map[index*2+1];
retval.a = vec_a[a_index];
You can always beat bit-packing by using range-packing, using div/mod to store a fractional bit length per item, but the complexity grows quickly.
A good guide can be found here http://number-none.com/product/Packing%20Integers/index.html
I need to implement a LRU algorithm in a 3D renderer for texture caching. I write the code in C++ on Linux.
In my case I will use texture caching to store "tiles" of image data (16x16 pixels block). Now imagine that I do a lookup in the cache, get a hit (tile is in the cache). How do I return the content of the "cache" for that entry to the function caller? I explain. I imagine that when I load a tile in the cache memory, I allocate the memory to store 16x16 pixels for example, then load the image data for that tile. Now there's two solutions to pass the content of the cache entry to the function caller:
1) either as pointer to the tile data (fast, memory efficient),
TileData *tileData = cache->lookup(tileId); // not safe?
2) or I need to recopy the tile data from the cache within a memory space allocated by the function caller (copy can be slow).
void Cache::lookup(int tileId, float *&tileData)
{
// find tile in cache, if not in cache load from disk add to cache, ...
...
// now copy tile data, safe but ins't that slow?
memcpy((char*)tileData, tileDataFromCache, sizeof(float) * 3 * 16 * 16);
}
float *tileData = new float[3 * 16 * 16]; // need to allocate the memory for that tile
// get tile data from cache, requires a copy
cache->lookup(tileId, tileData);
I would go with 1) but the problem is, what happens if the tile gets deleted from the cache just after the lookup, and that the function tries to access the data using the return pointer? The only solution I see to this, is to use a form of referencing counting (auto_ptr) where the data is actually only deleted when it's not used anymore?
the application might access more than 1 texture. I can't seem to find of a way of creating a key which is unique to each texture and each tile of a texture. For example I may have tile 1 from file1 and tile1 from file2 in the cache, so making the search on tildId=1 is not enough... but I can't seem to find a way of creating the key that accounts for the file name and the tileID. I can build a string that would contain the file name and the tileID (FILENAME_TILEID) but wouldn't a string used as a key be much slower than an integer?
Finally I have a question regarding time stamp. Many papers suggest to use a time stamp for ordering the entry in the cache. What is a good function to use a time stamp? the time() function, clock()? Is there a better way than using time stamps?
Sorry I realise it's a very long message, but LRU doesn't seem as simple to implement than it sounds.
Answers to your questions:
1) Return a shared_ptr (or something logically equivalent to it). Then all of the "when-is-it-safe-to-delete-this-object" issues pretty much go away.
2) I'd start by using a string as a key, and see if it actually is too slow or not. If the strings aren't too long (e.g. your filenames aren't too long) then you may find it's faster than you expect. If you do find out that string-keys aren't efficient enough, you could try something like computing a hashcode for the string and adding the tile ID to it... that would probably work in practice although there would always be the possibility of a hash-collision. But you could have a collision-check routine run at startup that would generate all of the possible filename+tileID combinations and alert you if map to the same key value, so that at least you'd know immediately during your testing when there is a problem and could do something about it (e.g. by adjusting your filenames and/or your hashcode algorithm). This assumes that what all the filenames and tile IDs are going to be known in advance, of course.
3) I wouldn't recommend using a timestamp, it's unnecessary and fragile. Instead, try something like this (pseudocode):
typedef shared_ptr<TileData *> TileDataPtr; // automatic memory management!
linked_list<TileDataPtr> linkedList;
hash_map<data_key_t, TileDataPtr> hashMap;
// This is the method the calling code would call to get its tile data for a given key
TileDataPtr GetData(data_key_t theKey)
{
if (hashMap.contains_key(theKey))
{
// The desired data is already in the cache, great! Just move it to the head
// of the LRU list (to reflect its popularity) and then return it.
TileDataPtr ret = hashMap.get(theKey);
linkedList.remove(ret); // move this item to the head
linkedList.push_front(ret); // of the linked list -- this is O(1)/fast
return ret;
}
else
{
// Oops, the requested object was not in our cache, load it from disk or whatever
TileDataPtr ret = LoadDataFromDisk(theKey);
linkedList.push_front(ret);
hashMap.put(theKey, ret);
// Don't let our cache get too large -- delete
// the least-recently-used item if necessary
if (linkedList.size() > MAX_LRU_CACHE_SIZE)
{
TileDataPtr dropMe = linkedList.tail();
hashMap.remove(dropMe->GetKey());
linkedList.remove(dropMe);
}
return ret;
}
}
In the same order as your questions:
Copying over the texture date does not seem reasonable from a performance standpoint. Reference counting sound far better, as long as you can actually code it safely. The data memory would be freed as soon as it is not used by the renderer or have a reference stored in the cache.
I assume that you are going to use some sort of hash table for the look-up part of what you are describing. The common solution to your problem has two parts:
Using a suitable hashing function that combines multiple values e.g. the texture file name and the tile ID. Essentially you create a composite key that is treated as one entity. The hashing function could be a XOR operation of the hashes of all elementary components, or something more complex.
Selecting a suitable hash function is critical for performance reasons - if the said function is not random enough, you will have a lot of hash collisions.
Using a suitable composite equality check to handle the case of hash collisions.
This way you can look-up the combination of all attributes of interest in a single hash table look-up.
Using timestamps for this is not going to work - period. Most sources regarding caching usually describe the algorithms in question with network resource caching in mind (e.g. HTTP caches). That is not going to work here for three reasons:
Using natural time only makes sense of you intend to implement caching policies that take it into account, e.g. dropping a cache entry after 10 minutes. Unless you are doing something very weird something like this makes no sense within a 3D renderer.
Timestamps have a relatively low actual resolution, even if you use high precision timers. Most timer sources have a precision of about 1ms, which is a very long time for a processor - in that time your renderer would have worked through several texture entries.
Do you have any idea how expensive timer calls are? Abusing them like this could even make your system perform worse than not having any cache at all...
The usual solution to this problem is to not use a timer at all. The LRU algorithm only needs to know two things:
The maximum number of entries allowed.
The order of the existing entries w.r.t. their last access.
Item (1) comes from the configuration of the system and typically depends on the available storage space. Item (2) generally implies the use of a combined linked list/hash table data structure, where the hash table part provides fast access and the linked list retains the access order. Each time an entry is accessed, it is placed at the end of the list, while old entries are removed from its start.
Using a combined data structure, rather than two separate ones allows entries to be removed from the hash table without having to go through a look-up operation. This improves the overall performance, but it is not absolutely necessary.
As promised I am posting my code. Please let me know if I have made mistakes or if I could improve it further. I am now going to look into making it work in a multi-threaded environment. Again thanks to Jeremy and Thkala for their help (sorry the code doesn't fit the comment block).
#include <cstdlib>
#include <cstdio>
#include <memory>
#include <list>
#include <unordered_map>
#include <cstdint>
#include <iostream>
typedef uint32_t data_key_t;
class TileData
{
public:
TileData(const data_key_t &key) : theKey(key) {}
data_key_t theKey;
~TileData() { std::cerr << "delete " << theKey << std::endl; }
};
typedef std::shared_ptr<TileData> TileDataPtr; // automatic memory management!
TileDataPtr loadDataFromDisk(const data_key_t &theKey)
{
return std::shared_ptr<TileData>(new TileData(theKey));
}
class CacheLRU
{
public:
// the linked list keeps track of the order in which the data was accessed
std::list<TileDataPtr> linkedList;
// the hash map (unordered_map is part of c++0x while hash_map isn't?) gives quick access to the data
std::unordered_map<data_key_t, TileDataPtr> hashMap;
CacheLRU() : cacheHit(0), cacheMiss(0) {}
TileDataPtr getData(data_key_t theKey)
{
std::unordered_map<data_key_t, TileDataPtr>::const_iterator iter = hashMap.find(theKey);
if (iter != hashMap.end()) {
TileDataPtr ret = iter->second;
linkedList.remove(ret);
linkedList.push_front(ret);
++cacheHit;
return ret;
}
else {
++cacheMiss;
TileDataPtr ret = loadDataFromDisk(theKey);
linkedList.push_front(ret);
hashMap.insert(std::make_pair<data_key_t, TileDataPtr>(theKey, ret));
if (linkedList.size() > MAX_LRU_CACHE_SIZE) {
const TileDataPtr dropMe = linkedList.back();
hashMap.erase(dropMe->theKey);
linkedList.remove(dropMe);
}
return ret;
}
}
static const uint32_t MAX_LRU_CACHE_SIZE = 8;
uint32_t cacheMiss, cacheHit;
};
int main(int argc, char **argv)
{
CacheLRU cache;
for (uint32_t i = 0; i < 238; ++i) {
int key = random() % 32;
TileDataPtr tileDataPtr = cache.getData(key);
}
std::cerr << "Cache hit: " << cache.cacheHit << ", cache miss: " << cache.cacheMiss << std::endl;
return 0;
}
With my current project, I did my best to adhere to the principle that premature optimization is the root of all evil. However, now the code is tested, and it is time for optimization. I did some profiling, and it turns out my code spends almost 20% of its time in a function where it finds all possible children, puts them in a vector, and returns them. As a note, I am optimizing for speed, memory limitations are not a factor.
Right now the function looks like this:
void Board::GetBoardChildren(std::vector<Board> &children)
{
children.reserve(open_columns_.size()); // only reserve max number of children
UpdateOpenColumns();
for (auto i : open_columns_)
{
short position_adding_to = ColumnToPosition(i);
MakeMove(position_adding_to); // make the possible move
children.push_back(*this); // add to vector of children
ReverseMove(); // undo move
}
}
According to the profiling, my code spends about 40% of the time just on the line children.push_back(*this); I am calling the function like this:
std::vector<Board> current_children;
current_state.GetBoardChildren(current_children);
I was thinking since the maximum number of possible children is small (7), would it be better to just use an array? Or is there not a ton I can do to optimize this function?
From your responses to my comments, it seems very likely that most of the time is spent copying the board in
children.push_back(*this);
You need to find a way to avoid making all those copies, or a way to make them cheaper.
Simply changing the vector into an array or a list will likely not make any difference to performance.
The most important question is: Do you really need all States at once inside current_state?
If you just iterate over them once or twice in the default order, then there is no need for a vector, just generate them on demand.
If you really need it, here is the next step. Since Board is expensive for copying, a DifferenceBoard that keeps only track of the difference may be better. Pseudocode:
struct DifferenceBoard { // or maybe inherit from Board that a DifferenceBoard
// can be built from another DifferenceBoard
Board *original;
int fromposition, toposition;
State state_at_position;
State get(int y, int x) const {
if ((x,y) == fromposition) return Empty;
if ((x,y) == toposition ) return state_at_position;
return original->get();
}
};
Deal all, I have implemented some functions and like to ask some basic thing as I do not have a sound fundamental knowledge on C++. I hope, you all would be kind enough to tell me what should be the good way as I can learn from you. (Please, this is not a homework and i donot have any experts arround me to ask this)
What I did is; I read the input x,y,z, point data (around 3GB data set) from a file and then compute one single value for each point and store inside a vector (result). Then, it will be used in next loop. And then, that vector will not be used anymore and I need to get that memory as it contains huge data set. I think I can do this in two ways.
(1) By just initializing a vector and later by erasing it (see code-1). (2) By allocating a dynamic memory and then later de-allocating it (see code-2). I heard this de-allocation is inefficient as de-allocation again cost memory or maybe I misunderstood.
Q1)
I would like to know what would be the optimized way in terms of memory and efficiency.
Q2)
Also, I would like to know whether function return by reference is a good way of giving output. (Please look at code-3)
code-1
int main(){
//read input data (my_data)
vector<double) result;
for (vector<Position3D>::iterator it=my_data.begin(); it!=my_data.end(); it++){
// do some stuff and calculate a "double" value (say value)
//using each point coordinate
result.push_back(value);
// do some other stuff
//loop over result and use each value for some other stuff
for (int i=0; i<result.size(); i++){
//do some stuff
}
//result will not be used anymore and thus erase data
result.clear()
code-2
int main(){
//read input data
vector<double) *result = new vector<double>;
for (vector<Position3D>::iterator it=my_data.begin(); it!=my_data.end(); it++){
// do some stuff and calculate a "double" value (say value)
//using each point coordinate
result->push_back(value);
// do some other stuff
//loop over result and use each value for some other stuff
for (int i=0; i<result->size(); i++){
//do some stuff
}
//de-allocate memory
delete result;
result = 0;
}
code03
vector<Position3D>& vector<Position3D>::ReturnLabel(VoxelGrid grid, int segment) const
{
vector<Position3D> *points_at_grid_cutting = new vector<Position3D>;
vector<Position3D>::iterator point;
for (point=begin(); point!=end(); point++) {
//do some stuff
}
return (*points_at_grid_cutting);
}
For such huge data sets I would avoid using std containers at all and make use of memory mapped files.
If you prefer to go on with std::vector, use vector::clear() or vector::swap(std::vector()) to free memory allocated.
erase will not free the memory used for the vector. It reduces the size but not the capacity, so the vector still holds enough memory for all those doubles.
The best way to make the memory available again is like your code-1, but let the vector go out of scope:
int main() {
{
vector<double> result;
// populate result
// use results for something
}
// do something else - the memory for the vector has been freed
}
Failing that, the idiomatic way to clear a vector and free the memory is:
vector<double>().swap(result);
This creates an empty temporary vector, then it exchanges the contents of that with result (so result is empty and has a small capacity, while the temporary has all the data and the large capacity). Finally, it destroys the temporary, taking the large buffer with it.
Regarding code03: it's not good style to return a dynamically-allocated object by reference, since it doesn't provide the caller with much of a reminder that they are responsible for freeing it. Often the best thing to do is return a local variable by value:
vector<Position3D> ReturnLabel(VoxelGrid grid, int segment) const
{
vector<Position3D> points_at_grid_cutting;
// do whatever to populate the vector
return points_at_grid_cutting;
}
The reason is that provided the caller uses a call to this function as the initialization for their own vector, then something called "named return value optimization" kicks in, and ensures that although you're returning by value, no copy of the value is made.
A compiler that doesn't implement NRVO is a bad compiler, and will probably have all sorts of other surprising performance failures, but there are some cases where NRVO doesn't apply - most importantly when the value is assigned to a variable by the caller instead of used in initialization. There are three fixes for this:
1) C++11 introduces move semantics, which basically sort it out by ensuring that assignment from a temporary is cheap.
2) In C++03, the caller can play a trick called "swaptimization". Instead of:
vector<Position3D> foo;
// some other use of foo
foo = ReturnLabel();
write:
vector<Position3D> foo;
// some other use of foo
ReturnLabel().swap(foo);
3) You write a function with a more complicated signature, such as taking a vector by non-const reference and filling the values into that, or taking an OutputIterator as a template parameter. The latter also provides the caller with more flexibility, since they need not use a vector to store the results, they could use some other container, or even process them one at a time without storing the whole lot at once.
Your code seems like the computed value from the first loop is only used context-insensitively in the second loop. In other words, once you have computed the double value in the first loop, you could act immediately on it, without any need to store all values at once.
If that's the case, you should implement it that way. No worries about large allocations, storage or anything. Better cache performance. Happiness.
vector<double) result;
for (vector<Position3D>::iterator it=my_data.begin(); it!=my_data.end(); it++){
// do some stuff and calculate a "double" value (say value)
//using each point coordinate
result.push_back(value);
If the "result" vector will end up having thousands of values, this will result in many reallocations. It would be best if you initialize it with a large enough capacity to store, or use the reserve function :
vector<double) result (someSuitableNumber,0.0);
This will reduce the number of reallocation, and possible optimize your code further.
Also I would write : vector<Position3D>& vector<Position3D>::ReturnLabel(VoxelGrid grid, int segment) const
Like this :
void vector<Position3D>::ReturnLabel(VoxelGrid grid, int segment, vector<Position3D> & myVec_out) const //myVec_out is populated inside func
Your idea of returning a reference is correct, since you want to avoid copying.
`Destructors in C++ must not fail, therefore deallocation does not allocate memory, because memory can't be allocated with the no-throw guarantee.
Apart: Instead of looping multiple times, it is probably better if you do the operations in an integrated manner, i.e. instead of loading the whole dataset, then reducing the whole dataset, just read in the points one by one, and apply the reduction directly, i.e. instead of
load_my_data()
for_each (p : my_data)
result.push_back(p)
for_each (p : result)
reduction.push_back (reduce (p))
Just do
file f ("file")
while (f)
Point p = read_point (f)
reduction.push_back (reduce (p))
If you don't need to store those reductions, simply output them sequentially
file f ("file")
while (f)
Point p = read_point (f)
cout << reduce (p)
code-1 will work fine and is almost the same as code-2, with no major advantages or disadvantages.
code03 Somebody else should answer that but i believe the difference between a pointer and a reference in this case would be marginal, I do prefer pointers though.
That being said, I think you might be approaching the optimization from the wrong angle. Do you really need all points to compute the output of a point in your first loop? Or can you rewrite your algorithm to read only one point, compute the value as you would in your first loop and then use it immediately the way you want to? Maybe not with single Points, but with batches of points. That could potentially cut back on your memory require quite a bit with only a small increase in processing time.