Boost.Flyweight memory consumption - c++

I'm just reading an article about Boost.Flyweight Performance
As you can see in the link the overhead of the factory is
- for hashed_factory: ~2.5 * sizeof(word)
- for set_factory: 4 * sizeof(word)
The basic question is .... why 4 words for set and not zero ?
As far as I know, using a hash implies computing and storing a hash key, while using a set not: it's implemented as a red-black-tree, inserting and look-up takes log(n), so no values is stored and memory overhead should be zero (with the drawback that instead of one comparison in the case of hash you will have log(n) comparisons). Where is the mistake ?

Each node of the RB tree contains a pointer to the left child, pointer to the right child, the color and one piece of data. The first three count as overhead, which means it isn't 0. I'm not quite sure why they say it's 4 when the 3 elements fit easily in 3 words, but maybe they count in something else (like the parent node pointer, which isn't strictly necessary, or memory allocation overhead, although that's unlikely).

Related

Swapping suffix and prefix of collection

I have a list L (in the general sense, not std::list) of numbers and I also have i which is the index of the smallest element in L. I want to swap the two partitions separated by index i. What standard data structure and what operations should I perform such that I can do this as efficiently as possible (preferably in constant time)?
An example: let L be 9 6 -4 6 12. The smallest value is L[2] = -4, so i = 2. After swapping the two partitions, I want L to be -4 6 12 9 6.
The list will be pretty large (up to 103 elements) and I will also have to traverse it multiple times (up to 103 traversions in the worst case), so using std::list is not a good idea due to caching issues. On the other hand, std::vector will make it difficult to swap the two partitions. Is std::deque a good choice for this?
There are two aspects to your problem:
1- Constant time swap: Conceptually speaking, the best approach will be a doubly linked list (std::list) in terms of swapping.
Since your data is big, nodes will always remain at their initial places in the memory, and you will only alter some constant number of pointers to do the type of swap your are mentioning.
2- Locality: We all know that a contiguously allocated space in memory is better for cache performance. This leans towards std::vector.
What is in the middle?
Resizable contiguous chunks of memory that can be allocated through a custom allocator. There are numerous ways to design these. An example.

Set all values of a row and/or column in c++ to 1 or 0

I have a problem which requires resetting all values in a column to 0 or 1. The code which i am using is normal naive approach to set values by iterating each time. Is there any faster implementation.
//Size of board n*n
i=0;
cin>>x>>y;x--;
if(query=="SetRow")
{
while(i!=N){ board[i][x]=y;i++;}
}
else
{
while(i!=N){ board[i][x]=y;i++;}
}
y can be 0 or 1
Well, other then iterating the columns there are few optimizations you might want to make:
Iterating columns is less efficient then iterating rows (about *4 slower) due to cache performance. In columns iteration, you have a cache miss for each element - while in rows iteration you have cache miss for 1 out of 4 elements (usually, it depends on architecture and size of data, but usually a cache line fits 4 integers).
Thus - if you iterate columns more often then rows- redesign it, in order to iterate rows more often. This thread discusses a similar issue.
Also, after you do it - you can use memset() which I believe is better optimized for this task.
(Note: Compilers might do that for you automatically in some cases)
You can use lazy initialization, there is actually O(1) algorithm to initialize an array with constant value, it is described with more details here: initalize an array in constant time. This comes at the cost of ~triple the amount of space, and more expansive seek later on.
The idea behind it (2) is to maintain additional stack (logically, implemented as array+ pointer to top) and array, the additional array will indicate when it was first initialized (a number from 0 to n) and the stack will indicate which elements were already modified.
When you access array[i], if stack[additionalArray[i]] == i && additionalArray[i] < top the value of the array is array[i]. Otherwise - it is the "initialized" value.
When doing array[i] = x, if it was not initialized yet (as seen before), you should set additionalArray[i] = stack[top] and increase top.
This results in O(1) initialization, but as said it requires additional memory and each access is more expansive.
The same principles described by the article regarding initializing an array in O(1) can also be applied here.
The problem is taken from running codechef long contest.... hail cheaters .. close this thread

Efficiently insert integers into a growing array if no duplicates

There is a data structure which acts like a growing array. Unknown amount of integers will be inserted into it one by one, if and only if these integers has no dup in this data structure.
Initially I thought a std::set suffices, it will automatically grow as new integers come in and make sure no dups.
But, as the set grows large, the insertion speed goes down. So any other idea to do this job besides hash?
Ps
I wonder any tricks such as xor all the elements or build a Sparse Table (just like for rmq) would apply?
If you're willing to spend memory on the problem, 2^32 bits is 512MB, at which point you can just use a bit field, one bit per possible integer. Setting aside CPU cache effects, this gives O(1) insertion and lookup times.
Without knowing more about your use case, it's difficult to say whether this is a worthwhile use of memory or a vast memory expense for almost no gain.
This site includes all the possible containers and layout their running time for each action ,
so maybe this will be useful :
http://en.cppreference.com/w/cpp/container
Seems like unordered_set as suggested is your best way.
You could try a std::unordered_set, which should be implemented as a hash table (well, I do not understand why you write "besides hash"; std::set normally is implemented as a balanced tree, which should be the reason for insufficient insertion performance).
If there is some range the numbers fall in, then you can create several std::set as buckets.
EDIT- According to the range that you have specified, std::set, should be fast enough. O(log n) is fast enough for most purposes, unless you have done some measurements and found it slow for your case.
Also you can use Pigeonhole Principle along with sets to reject any possible duplicate, (applicable when set grows large).
A bit vector can be useful to detect duplicates
Even more requirements would be necessary for an optimal decision. This suggestion is based on the following constraints:
Alcott 32 bit integers, with about 10.000.000 elements (ie any 10m out of 2^32)
It is a BST (binary search tree) where every node stores two values, the beginning and the end of a continuous region. The first element stores the number where a region starts, the second the last. This arrangement allows big regions in the hope that you reach you 10M limit with a very small tree height, so cheap search. The data structure with 10m elements would take up 8 bytes per node, plus the links (2x4bytes) maximum two children per node. So that make 80M for all the 10M elements. And of course, if there are usually more elements inserted you can keep track of the once which are not.
Now if you need to be very careful with space and after running simulations and/or statistical checks you find that there are lots of small regions (less than 32 bit in length), you may want to change your node type to one number which starts the region, plus a bitmap.
If you don't have to align access to the bitmap and, say, you only have continuous chunks with only 8 elements, then your memo requirement becuse 4+1 for the node and 4+4 bytes for the children. Hope this helps.

How to map 13 bit value to 4 bit code?

I have a std::map for some packet processing program.
I didn't noticed before profiling but unfortunately this map lookup alone consume about 10% CPU time (called too many time).
Usually there only exist at most 10 keys in the input data. So I'm trying to implement a kind of key cache in front of the map.
Key value is 13 bit integer. I know there are only 8192 possible keys and array of 8192 items can give constant time lookup but I feel already ashamed and don't want use such a naive approach :(
Now, I'm just guessing some method of hashing that yield 4 bit code value for 13 bit integer very fast.
Any cool idea?
Thanks in advance.
UPDATE
Beside my shame, I don't have total control over source code and it's almost prohibited to make new array for this purpose.
Project manager said (who ran the profiler) linked list show small performance gain and recommended using std::list instead of std::map.
UPDATE
Value of keys are random (no relationship) and doesn't have good distribution.
Sample:
1) 0x100, 0x101, 0x10, 0x0, 0xffe
2) 0x400, 0x401, 0x402, 0x403, 0x404, 0x405, 0xff
Assuming your hash table either contains some basic type -- it's almost no memory at all. Even on 64-bit systems it's only 64kb of memory. There is no shame in using a lookup table like that, it has some of the best performance you can get.
You may want to go with middle solution and open addressing technique: one array of size 256. Index to an array is some simple hash function like XOR of two bytes. Element of the array is struct {key, value}. Collisions are handled by storing collided element at the next available index. If you need to delete element from array, and if deletion is rare then just recreate array (create a temporary list from remaining elements, and then create array from this list).
If you pick your hash function smartly there would not be any collisions almost ever. For instance, from your two examples one such hash would be to XOR low nibble of high byte with high nibble of low byte (and do what you like with remaining 13-th bit).
Unless you're writing for some sort of embedded system where 8K is really significant, just use the array and move on. If you really insist on doing something else, you might consider a perfect hash generator (e.g., gperf).
If there are really only going to be something like 10 active entries in your table, you might seriously consider using an unsorted vector to hold this mapping. Something like this:
typedef int key_type;
typedef int value_type;
std::vector<std::pair<key_type, value_type> > mapping;
inline void put(key_type key, value_type value) {
for (size_t i=0; i<mapping.size(); ++i) {
if (mapping[i].first==key) {
mapping[i].second=value;
return;
}
}
mapping.push_back(std::make_pair(key, value));
}
inline value_type get(key_type key) {
for (size_t i=0; i<mapping.size(); ++i) {
if (mapping[i].first==key) {
return mapping[i].second;
}
}
// do something reasonable if not found?
return value_type();
}
Now, the asymptotic speed of these algorithms (each O(n)) is much worse than you'd have with either a red-black tree (like std::map at O(log n)) or hash table (O(1)). But you're not talking about dealing with a large number of objects, so asymptotic estimates don't really buy you much.
Additionally, std::vector buys you both low overhead and locality of reference, which neither std::map nor std::list can offer. So it's more likely that a small std::vector will stay entirely within the L1 cache. As it's almost certainly the memory bottleneck that's causing your performance issues, using a std::vector with even my poor choice of algorithm will likely be faster than either a tree or linked list. Of course, only a few solid profiles will tell you for sure.
There are certainly algorithms that might be better choices: a sorted vector could potentially give even better performance; a well tuned small hash table might work as well. I suspect that you'll run into Amdahl's law pretty quickly trying to improve on a simple unsorted vector, however. Pretty soon you might find yourself running into function call overhead, or some other such concern, as a large contributor to your profile.
I agree with GWW, you don't use so much memory in the end...
But if you want, you could use an array of 11 or 13 linkedlists, and hash the keys with the % function. If the key number is less than the array size, complexity tents still to be O(1).
When you always just have about ten keys, use a list (or array). Do some benchmarking to find out whether or not using a sorted list (or array) and binary search will improve performance.
You might first want to see if there are any unnecessary calls to the key lookup. You only want to do this once per packet ideally -- each time you call a function there is going to be some overhead, so getting rid of extra calls is good.
Map is generally pretty fast, but if there is any exploitable pattern in the way keys are mapped to items you could use that and potentially do better. Could you provide a bit more information about the keys and the associated 4-bit values? E.g. are they sequential, is there some sort of pattern?
Finally, as others have mentioned, a lookup table is very fast, 8192 values * 4 bits is only 4kb, a tiny amount of memory indeed.
I would use a lookup table. It's tiny unless you are using a micrcontroller or something.
Otherwise I would do this -
Generate a table of say 30 elements.
For each lookup calculate a hash value of (key % 30) and compare it with the stored key in that location in the table. If the key is there then you found your value. if the slot is empty, then add it. If the key is wrong then skip to the next free cell and repeat.
With 30 cells and 10 keys collisions should be rare but if you get one it's fast to skip to the next cell, and normal lookups are simply a modulus and a compare operation so fairly fast

Indexing: Implementing Tree data structures with Arrays/Vectors

I have been implementing a heap in C++ using a vector. Since I have to access the children of a node (2n, 2n+1) easily, I had to start at index 1. Is it the right way? As per my implementation, there is always a dummy element at zeroth location.
Your way works. Alternatively you can have root at index 0 and have children at 2n+1 and 2n+2
While this works well for heaps, you end up using a huge amount of redundant memory for other tree data structures that do not necessarily have a full and complete Binary tree. For example, this means that if you have a Binary search tree of 20 nodes with a depth of 5, you end up having to use an array of 2^5=32 instead of 20. Now imagine if you need a tree of 25 nodes with a depth of 22. You end up using a huge array of 4194304, whereas you could have used a linked representation to store just the 25 nodes.
You can still use an array and not incur such a memory hit. Just allocate a large block of memory as an array and use array indices as pointers to the children.
Thus, where you had
node.left = (node.index*2)
node.right = (node.index*2+1)
You simply use
node.left = <index of left child>
node.right = <index of right child>
Or you can just use pointers/references instead of integer indices to an array if your language supports it.
Edit:
It might not be obvious to everyone that a complete binary search tree takes up O(2^d) memory. There are d levels and every level has twice as many nodes as the level its parent is in (because every node except those at the bottom has exactly two children - never one). A binary heap is a binary tree (but not a Binary Search Tree) that is always complete by definition, so an array based implementation outlined by the OP does not incur any real memory overhead. For a heap, that is the best way to implement it in code. OTOH, most other binary trees (esp. Binary Search Trees) are not guaranteed to be complete. So trying to use this approach on would need O(2^depth) memory where depth can be as large as n, where we only need O(n) memory in a linked implementation.
So my answer is: yes, this is the best way for a heap. Just don't try it for other binary trees (unless you're sure they will always be complete).