Data structure for sparse insertion

Data structure for sparse insertion - c++

I am asking this question mostly for confirmation, because I am not an expert in data structures, but I think the structure that suits my need is a hashmap.
Here is my problem (which I guess is typical?):
We are looking at pairwise interactions between a large number of objects (say N=90k), so think about the storage as a sparse matrix;
There is a process, say (P), which randomly starts from one object, and computes a model which may lead to another object: I cannot predict the pairs in advance, so I need to be able to "create" entries dynamically (arguably the performance is not very critical here);
The process (P) may "revisit" existing pairs and update the corresponding element in the matrix: this happens a lot, and therefore I need to be able to find and update as fast as possible.
Finally, the process (P) is repeated millions of times, but only requires write access to the data structure, it does not need to know about the latest "state" of that storage. This feels intuitively like a detail that might be exploited to improve performance, but I don't think hashmaps do.
This last point is actually the main reason for my question here: is there a data-structure which satisfies the first three points (I'm thinking hash-map, correct?), and which would also exploit the last feature for improved performance (I'm thinking something like buffering operations and execute them in bulk asynchronously)?
EDIT: I am working with C++, and would prefer it if there was an existing library implementing that data structure. In addition, I am limited by the system requirements; I cannot use C++11 features.

I would use something like:
#include <boost/unordered_map.hpp>
class Data
{
boost::unordered_map<std::pair<int,int>,double> map;
public:
void update(int i, int j, double v)
{
map[std::pair<int,int>(i,j)] += v;
}
void output(); // Prints data somewhere.
};
That will get you going (you may need to declare a suitable hash function). You might be able to speed things up by making the key type be a 64-bit integer, and using ((int64_t)i << 32) | j to make the index.
If nearly all the updates go to a small fraction of the pairs, you could have two maps (small and large), and directly update the small map. Every time the size of small passed a threshold, you could update large and clear small. You would need to do some carefully testing to see if this helped or not. The only reason I think it might help, is by improving cache locality.
Even if you end up using a different data structure, you can keep this class interface, and the rest of the code will be undisturbed. In particular, dropping sparsehash into the same structure will be very easy.

Related

Which data structure is sorted by insertion and has fast "contains" check?

I am looking for a data structure that preserves the order in which the elements were inserted and offers a fast "contains" predicate. I also need iterator and random access. Performance during insertion or deletion is not relevant. I am also willing to accept overhead in terms of memory consumption.
Background: I need to store a list of objects. The objects are instances of a class called Neuron and stored in a Layer. The Layer object has the following public interface:
class Layer {
public:
Neuron *neuronAt(const size_t &index) const;
NeuronIterator begin();
NeuronIterator end();
bool contains(const Neuron *const &neuron) const;
void addNeuron(Neuron *const &neuron);
};
The contains() method is called quite often when the software runs, I've asserted that using callgrind. I tried to circumvent some of the calls to contains(), but is still a hot spot. Now I hope to optimize exactly this method.
I thought of using std::set, using the template argument to provide my own comparator struct. But the Neuron class itself does not give its position in the Layer away. Additionally, I'd like to have *someNeuronIterator = anotherNeuron to work without screwing up the order.
Another idea was to use a plain old C array. Since I do not care about the performance of adding a new Neuron object, I thought I could make sure that the Neuron objects are always stored linear in memory. But that would invalidate the pointer I pass to addNeuron(); at least I'd have to change it to point to the new copy I created to keep things linear aligned. Right?
Another idea was to use two data structures in the Layer object: A vector/list for the order, and a map/hash for lookup. But that would contradict my wish for an iterator that allowed operator* without a const reference, wouldn't it?
I hope somebody can hint an idea for a data structure or a concept that would satisfy my needs, or at least give me an idea for an alternative. Thanks!

If this contains check is really where you need the fastest execution, and assuming you can be a little intrusive with the source code, the fastest way to check if a Neuron belongs in a layer is to simply flag it when you insert it into a layer (ex: bit flag).
You have guaranteed O(1) checks at that point to see if a Neuron belongs in a layer and it's also fast at the micro-level.
If there can be numerous layer objects, this can get a little trickier, as you'll need a separate bit for each potential layer a neuron can belong to unless a Neuron can only belong in a single layer at once. This is reasonably manageable, however, if the number of layers are relatively fixed in size.
If the latter case and a Neuron can only belong to one layer at once, then all you need is a backpointer to Layer*. To see if a Neuron belongs in a layer, simply see if that backpointer points to the layer object.
If a Neuron can belong to multiple layers at once, but not too many at one time, then you could store like a little array of backpointers like so:
struct Neuron
{
...
Layer* layers[4]; // use whatever small size that usually fits the common case
Layer* ptr;
int num_layers;
};
Initialize ptr to point to layers if there are 4 or fewer layers to which the Neuron belongs. If there are more, allocate it on the free store. In the destructor, free the memory if ptr != layers. You can also optimize away num_layers if the common case is like 1 layer, in which case a null-terminated solution might work better. To see if a Neuron belongs to a layer, simply do a linear search through ptr. That's practically constant-time complexity with respect to the number of Neurons provided that they don't belong in a mass number of layers at once.
You can also use a vector here but you might reduce cache hits on those common case scenarios since it'll always put its contents in a separate block, even if the Neuron only belongs to like 1 or 2 layers.
This might be a bit different from what you were looking for with a general-purpose, non-intrusive data structure, but if your performance needs are really skewed towards these kinds of set operations, an intrusive solution is going to be the fastest in general. It's not quite as pretty and couples your element to the container, but hey, if you need max performance...
Another idea was to use a plain old C array. Since I do not care about the performance of adding a new Neuron object, I thought I could make sure that the Neuron objects are always stored linear in memory. But that would invalidate the pointer I pass to addNeuron(); [...]
Yes, but it won't invalidate indices. While not as convenient to use as pointers, if you're working with mass data like vertices of a mesh or particles of an emitter, it's common to use indices here to avoid the invalidation and possibly to save an extra 32-bits per entry on 64-bit systems.
Update
Given that Neurons only exist in one Layer at a time, I'd go with the back pointer approach. Seeing if a neuron belongs to a layer becomes a simple matter of checking if the back pointer points to the same layer.
Since there's an API involved, I'd suggest, just because it sounds like you're pushing around a lot of data and have already profiled it, that you focus on an interface which revolves around aggregates (layers, e.g.) rather than individual elements (neurons). It'll just leave you a lot of room to swap out underlying representations when your clients aren't performing operations at the individual scalar element-type interface.
With the O(1) contains implementation and the unordered requirement, I'd go with a simple contiguous structure like std::vector. However, you do expose yourself to potential invalidation on insertion.
Because of that, if you can, I'd suggest working with indices here. However, that become a little unwieldy since it requires your clients to store both a pointer to the layer in which a neuron belongs in addition to its index (though if you do this, the backpointer becomes unnecessary as the client is tracking where things belong).
One way to mitigate this is to simply use something like std::vector<Neuron*> or ptr_vector if available. However, that can expose you to cache misses and heap overhead, and if you want to optimize that, this is where the fixed allocator comes in handy. However, that's a bit of a pain with alignment issues and a bit of a research topic, and so far it seems like your main goal is not to optimize insertion or sequential access quite as much as this contains check, so I'd start with the std::vector<Neuron*>.

You can get O(1) contains-check, O(1) insert and preserve insertion order. If you are using Java, looked at LinkedHashMap. If you are not using Java, look at LinkedHashMap and figure out a parallel data structure that does it or implement it yourself.
It's just a hashmap with a doubly linked list. The link list is to preserve order and the hashmap is to allow O(1) access. So when you insert an element, it makes an entry with the key, and the map will point to a node in the linked list where your data will reside. To look up, you go to the hash table to find the pointer directly to your linked list node (not the head), and get the value in O(1). To access them sequentially, you just traverse the linked list.

A heap sounds like it could be useful to you. It's like a tree, but the newest element is always inserted at the top, and then works its way down based on its value, so there is a method to quickly check if it's there.
Otherwise, you could store a hash table (quick method to check if the neuron is contained in the table) with the key for the neuron, and values of: the neuron itself, and the timestep upon which the neuron was inserted (to check its chronological insertion time).

3D-Grid of bins: nested std::vector vs std::unordered_map

pros, I need some performance-opinions with the following:
1st Question:
I want to store objects in a 3D-Grid-Structure, overall it will be ~33% filled, i.e. 2 out of 3 gridpoints will be empty.
Short image to illustrate:
Maybe Option A)
vector<vector<vector<deque<Obj>> grid;// (SizeX, SizeY, SizeZ);
grid[x][y][z].push_back(someObj);
This way I'd have a lot of empty deques, but accessing one of them would be fast, wouldn't it?
The Other Option B) would be
std::unordered_map<Pos3D, deque<Obj>, Pos3DHash, Pos3DEqual> Pos3DMap;
where I add&delete deques when data is added/deleted. Probably less memory used, but maybe less fast? What do you think?
2nd Question (follow up)
What if I had multiple containers at each position? Say 3 buckets for 3 different entities, say object types ObjA, ObjB, ObjC per grid point, then my data essentially becomes 4D?
Another illustration:
Using Option 1B I could just extend Pos3D to include the bucket number to account for even more sparse data.
Possible queries I want to optimize for:
Give me all Objects out of ObjA-buckets from the entire structure
Give me all Objects out of ObjB-buckets for a set of
grid-positions
Which is the nearest non-empty ObjC-bucket to
position x,y,z?
PS:
I had also thought about a tree based data-structure before, reading about nearest neighbour approaches. Since my data is so regular I had thought I'd save all the tree-building dividing of the cells into smaller pieces and just make a static 3D-grid of the final leafs. Thats how I came to ask about the best way to store this grid here.
Question associated with this, if I have a map<int, Obj> is there a fast way to ask for "all objects with keys between 780 and 790"? Or is the fastest way the building of the above mentioned tree?
EDIT
I ended up going with a 3D boost::multi_array that has fortran-ordering. It's a little bit like the chunks games like minecraft use. Which is a little like using a kd-tree with fixed leaf-size and fixed amount of leaves? Works pretty fast now so I'm happy with this approach.

Answer to 1st question
As #Joachim pointed out, this depends on whether you prefer fast access or small data. Roughly, this corresponds to your options A and B.
A) If you want fast access, go with a multidimensional std::vector or an array if you will. std::vector brings easier maintenance at a minimal overhead, so I'd prefer that. In terms of space it consumes O(N^3) space, where N is the number of grid points along one dimension. In order to get the best performance when iterating over the data, remember to resolve the indices in the reverse order as you defined it: innermost first, outermost last.
B) If you instead wish to keep things as small as possible, use a hash map, and use one which is optimized for space. That would result in space O(N), with N being the number of elements. Here is a benchmark comparing several hash maps. I made good experiences with google::sparse_hash_map, which has the smallest constant overhead I have seen so far. Plus, it is easy to add it to your build system.
If you need a mixture of speed and small data or don't know the size of each dimension in advance, use a hash map as well.
Answer to 2nd question
I'd say you data is 4D if you have a variable number of elements a long the 4th dimension, or a fixed large number of elements. With option 1B) you'd indeed add the bucket index, for 1A) you'd add another vector.
Which is the nearest non-empty ObjC-bucket to position x,y,z?
This operation is commonly called nearest neighbor search. You want a KDTree for that. There is libkdtree++, if you prefer small libraries. Otherwise, FLANN might be an option. It is a part of the Point Cloud Library which accomplishes a lot of tasks on multidimensional data and could be worth a look as well.

An integer hashing problem

I have a (C++) std::map<int, MyObject*> that contains a couple of millions of objects of type MyObject*. The maximum number of objects that I can have, is around 100 millions. The key is the object's id. During a certain process, these objects must be somehow marked( with a 0 or 1) as fast as possible. The marking cannot happen on the objects themselves (so I cannot introduce a member variable and use that for the marking process). Since I know the minimum and maximum id (1 to 100_000_000), the first thought that occured to me, was to use a std::bit_set<100000000> and perform my marking there. This solves my problem and also makes it easier when marking processes run in parallel, since these use their own bit_set to mark things, but I was wondering what the solution could be, if I had to use something else instead of a 0-1 marking, e.g what could I use if I had to mark all objects with an integer number ?
Is there some form of a data structure that can deal with this kind of problem in a compact (memory-wise) manner, and also be fast ? The main queries of interest are whether an object is marked, and with what was marked with.
Thank you.
Note: std::map<int, MyObject*> cannot be changed. Whatever data structure I use, must not deal with the map itself.

How about making the value_type of your map a std::pair<bool, MyObject*> instead of MyObject*?

If you're not concerned with memory, then a std::vector<int> (or whatever suits your need in place of an int) should work.
If you don't like that, and you can't modify your map, then why not create a parallel map for the markers?
std::map<id,T> my_object_map;
std::map<id,int> my_marker_map;
If you cannot modify the objects directly, have you considered wrapping the objects before you place them in the map? e.g.:
struct
{
int marker;
T *p_x;
} T_wrapper;
std::map<int,T_wrapper> my_map;
If you're going to need to do lookups anyway, then this will be no slower.
EDIT: As #tenfour suggests in his/her answer, a std::pair may be a cleaner solution here, as it saves the struct definition. Personally, I'm not a big fan of std::pairs, because you have to refer to everything as first and second, rather than by meaningful names. But that's just me...

The most important question to ask yourself is "How many of these 100,000,000 objects might be marked (or remain unmarked)?" If the answer is smaller than roughly 100,000,000/(2*sizeof(int)), then just use another std::set or std::tr1::unordered_set (hash_set previous to tr1) to track which ones are so marked (or remained unmarked).
Where does 2*sizeof(int) come from? It's an estimate of the amount of memory overhead to maintain a heap structure in a deque of the list of items that will be marked.
If it is larger, then use std::bitset as you were about to use. It's overhead is effectively 0% for the scale of quantity you need. You'll need about 13 megabytes of contiguous ram to hold the bitset.
If you need to store a marking as well as presence, then use std::tr1::unordered_map using the key of Object* and value of marker_type. And again, if the percentage of marked nodes is higher than the aforementioned comparison, then you'll want to use some sort of bitset to hold the number of bits needed, with suitable adjustments in size, at 12.5 megabytes per bit.
A purpose-built object holding the bitset might be your best choice, given the clarification of the requirements.
Edit: this assumes that you've done proper time-complexity computations for what are acceptable solutions to you, since changing the base std::map structure is no longer permitted.

If you don't mind using hacks, take a look at the memory optimization used in Boost.MultiIndex. It can store one bit in the LSB of a stored pointer.

How fast is the code

I'm developing game. I store my game-objects in this map:
std::map<std::string, Object*> mObjects;
std::string is a key/name of object to find further in code. It's very easy to point some objects, like: mObjects["Player"] = .... But I'm afraid it's to slow due to allocation of std::string in each searching in that map. So I decided to use int as key in that map.
The first question: is that really would be faster?
And the second, I don't want to remove my current type of objects accesing, so I found the way: store crc string calculating as key. For example:
Object *getObject(std::string &key)
{
int checksum = GetCrc32(key);
return mObjects[checksum];
}
Object *temp = getOject("Player");
Or this is bad idea? For calculating crc I would use boost::crc. Or this is bad idea and calculating of checksum is much slower than searching in map with key type std::string?

Calculating a CRC is sure to be slower than any single comparison of strings, but you can expect to do about log2N comparisons before finding the key (e.g. 10 comparisons for 1000 keys), so it depends on the size of your map. CRC can also result in collisions, so it's error prone (you could detect collisions relatively easily detect, and possibly even handle them to get correct results anyway, but you'd have to be very careful to get it right).
You could try an unordered_map<> (possibly called hash_map) if your C++ environment provides one - it may or may not be faster but won't be sorted if you iterate. Hash maps are yet another compromise:
the time to hash is probably similar to the time for your CRC, but
afterwards they can often seek directly to the value instead of having to do the binary-tree walk in a normal map
they prebundle a bit of logic to handle collisions.
(Silly point, but if you can continue to use ints and they can be contiguous, then do remember that you can replace the lookup with an array which is much faster. If the integers aren't actually contiguous, but aren't particularly sparse, you could use a sparse index e.g. array of 10000 short ints that are indices into 1000 packed records).
Bottom line is if you care enough to ask, you should implement these alternatives and benchmark them to see which really works best with your particular application, and if they really make any tangible difference. Any of them can be best in certain circumstances, and if you don't care enough to seriously compare them then it clearly means any of them will do.

For the actual performance you need to profile the code and see it. But I would be tempted to use hash_map. Although its not part of the C++ standard library most of the popular implentations provide it. It provides very fast lookup.

The first question: is that really would be faster?
yes - you're comparing an int several times, vs comparing a potentially large map of strings of arbitrary length several times.
checksum: Or this is bad idea?
it's definitely not guaranteed to be unique. it's a bug waiting to bite.
what i'd do:
use multiple collections and embrace type safety:
// perhaps this simplifies things enough that t_player_id can be an int?
std::map<t_player_id, t_player> d_players;
std::map<t_ghoul_id, t_ghoul> d_ghouls;
std::map<t_carrot_id, t_carrot> d_carrots;
faster searches, more type safety. smaller collections. smaller allocations/resizes.... and on and on... if your app is very trivial, then this won't matter. use this approach going forward, and adjust after profiling/as needed for existing programs.
good luck

If you really want to know you have to profile your code and see how long does the function getObject take. Personally I use valgrind and KCachegrind to profile and render data on UNIX system.
I think using id would be faster. It's faster to compare int than string so...

Need to store string as id for objects in some fast data structure

I'm implementing a session store for a web-server. Keys are string
and stored objects are pointers. I tried using map but need something
faster. I will look up an object 5-20 times
as frequent than insert.
I tried using hash-map but failed. I felt like I got more constraints than more free time.
I'm coding c/c++ under Linux.
I don't want to commit to boost, since my web server is going to outlive boost. :)
This is a highly relevant question since the hardware (ssd disk) is
changing rapidly. What was the right solution will not be in 2 years.

I was going to suggest a map, but I see you have already ruled this out.
I tried using map but need something
faster.
These are the std::map performance bounds courtesy of the Wikipedia page:
Searching for an element takes O(log n) time
Inserting a new element takes O(log n) time
Incrementing/decrementing an iterator takes O(log n) time
Iterating through every element of a map takes O(n) time
Removing a single map element takes O(log n) time
Copying an entire map takes O(n log n) time.
How have you measured and determined that a map is not optimised sufficiently for you? It's quite possible that any bottlenecks you are seeing are in other parts of the code, and a map is perfectly adequate.
The above bounds seem like they would fit within all but the most stringent scalability requirements.

The type of data structure that will be used will be determined by the data you want to access. Some questions you should ask:
How many items will be in the session store? 50? 100000? 10000000000?
How large is each item in the store (byte size)?
What kind of string input is used for the key? ASCII-7? UTF-8? UCS2?
...
Hash tables generally perform very well for look ups. You can optimize them heavily for speed by writing them yourself (and yes, you can resize the table). Suggestions to improve performance with hash tables:
Choose a good hash function! this will have preferably even distribution among your hash table and will not be time intensive to compute (this will depend on the format of the key input).
Make sure that if you are using buckets to not exceed a length of 6. If you do exceed 6 buckets then your hash function probably isn't distributing evenly enough. A bucket length of < 3 is preferable.
Watch out for how you allocate your objects. If at all possible, try to allocate them near each other in memory to take advantage of locality of reference. If you need to, write your own sub-allocator/heap manager. Also keep to aligned boundaries for better access speeds (aligned is processor/bus dependent so you'll have to determine if you want to target a particular processor type).
BTrees are also very good and in general perform well. (Someone can insert info about btrees here).
I'd recommend looking at the data you are storing and making sure that the data is as small as possible. use shorts, unsigned char, bit fields as necessary. There are other additional ways to squeeze out improved performance as well such as allocating your string data at the end of your struct at the same time that you allocate the struct. i.e.
struct foo {
int a;
char my_string[0]; // allocate an instance of foo to be
// sizeof(int) + sizeof(your string data) etc
}
You may also find that implementing your own string compare routine can actually boost performance dramatically, however this will depend upon your input data.

It is possible to make your own. But you shouldn't have any problems with boost or std::tr1::unordered_map.
A ternary trie may be faster than a hash map for a smaller number of elements.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js