Looking for a data structure that's fast to initialize and fast to lookup (O(1)) - c++

I need a data structure in which I want to store information about which instances I already processed during an action. Because of limitations I can't store it in the instance itself (e.g. because I can execute the action in parallel.
What's specific is that the instances for which I want to store information have a unique number, so instead of a pointer to the instance, I could use that unique number to store information.
My first solution was to use an std::set<Instance *>. Every time i process an instance, I add it to the set so that I know that I already processed that instance.
Advantage: this is very fast to initialize
Disadvantage: lookups are not O(1), but O(logN)
My second soluction was to use an std::vector<bool> (actually std::vector<byte> because bool vectors have a specific specialization which makes it slower than a non-bool vector). The unique number of the instance can be used as index into the vector, and in the vector simply contains true or false to indicate if we already processed the instance or not (luckily my unique numbers start to count from 1).
Advantage: lookups are O(1)
Disadvantage: initialization if relatively slow, since std::vector needs to initialize every element explicitly (and probably also independently)
I could also use a C-style array (on which I can use memset), but since the number of instances (or the number of unique numbers) is not known beforehand, I need to write my own code to extend the array, memset the rest of the array, ... (which is not very hard, but which is something I want to avoid).
Is there any other kind of data structure that is very fast to initialize, and has O(1) lookup time?

You may try boost::unordered_set or the new C++11 std::unordered_set. They are hashed based containers rather than trees like std::set.

Well, with such a simple identification method... I would use a hash table.
Can you not use boost::unordered_map or std::unordered_map ?
Of course, you might prefer more sophisticated implementations if you want guaranteed O(1) insertion instead of amortized O(1) insertion, but it should get you started.

Related

Build large(ish) unordered set with all data available at the beginning

I have a situation where I need to optimize the creation of an unordered set. Expected number of elements is around 5-25M. My first thought is that I should prepare all data beforehand and do something like
unordered_set s(data);
instead of
for (auto& elem : data)
s.insert(elem);
Can the STL unordered set use bulk loading methods and speed up its creation? How can I tweak the parameters of the hash table (bucket size etc) if I know prior to the table construction the expected number of elements?
This question is quite broad and interesting.
First of all, there is a special method called reserve - it allows you to pre-allocate the storage for a number of elements before actually inserting them. Pre-allocating sufficient memory (and avoiding re-locations during the instertion) is a very powerful approach, which is commonly used for large data sets. Note, that it is also available for various standard containers, including vector, unordered_map etc.
Secondly, if you're using C++11, you might benefit from using move-semantics while inserting the elements into your container (of course, given you don't need them in your feed once they are placed in the set, which should be true for 5 to 25 millions of objects).
These two techniques are a good start. You may need to tune it further by setting different hashing function, or even choosing different implementation of an unordered_set. But at this point, you should provide more information: what are your value objects and what is their life-cycle; what insertion time do you find acceptable in your application.
EDIT: of course it's all about C++11, as unordered_set was not available prior to it. Shame on me :)
My focus now is on whether I can use functions like rehash to notify the table for the upcoming size
Suppose you call
unordered_set s(begin(data), end(data));
while the standard doesn't dictate an implementation, a good implementation will be able to discern the number of elements, and preallocate the size accordingly. If you look at the source code used by gcc (by me /usr/include/c++/5/tr1/hashtable.h), for example, it uses
_M_bucket_count = std::max(_M_rehash_policy._M_next_bkt(__bucket_hint),
_M_rehash_policy.
_M_bkt_for_elements(__detail::
__distance_fw(__f,
__l)));
_M_buckets = _M_allocate_buckets(_M_bucket_count);
so it already preallocates size based on the number of elements.
The problem might be different, though. If you look at the documentation, it states:
constructs the container with the contents of the range [first, last). Sets max_load_factor() to 1.0.
This saves space, but might cause collisions. To reduce the collisions, you could use
unordered_set s(begin(data), end(data), k * data.size());
where k > 1 is some constant. This corresponds to a load factor that is 1 / k. YMMV.

Would a unordered_map be a good choice?

I'm wondering if an unordered_map would be a good choice as container for my specific problem. What I've read about maps does not really cover my are, which is:
The container will store between 100 and 500 objects (not
int/double...)
The size will never change.
The order is not important as the objects themselves contain some kind of "index".
Very often (!) I need to filter all elements in the container that have some
property (e.g. have color==blue)
Currently I use vectors, which works. However if e.g. an unordered_map would improve performance (in regard to "filtering") I could image to change that.
std::unordered_map wouldn't really help you if you have multiple search criteria (sometimes color == blue, sometimes flavour == up), because maps only offer fast query on a single, pre-determined key.
I'd say std::vector is just fine for you, ideally wrapped in your own structure which will provide the lookup interface. If profiling later tells you this is not fast enough, you could build your own indexes above such data. You wouldn't even have to do that manually, boost::multi_index is a generic container designed for multiple-criterion lookup.
I would use vector or simply array for storing actual data. And have a few maps that maps key with pointer to actual data.
This would give higher memory usage, but in case searching by different indexes is often needed you may sacrifice a bit of memory.
A hash table (which std::unordered_map is) provides constant-time lookup for one key (key-value pair). However, its constant factors are always higher (i. e. the lookup is slower) than a simple array (which provides constant-time lookup for integer indices).
If you need to filter a collection of elements based on some criteria, then you need to inspect each individual element. In this case, a hash table would be strictly worse than an array/vector performance-wise, since its computational complexity is the same as that of array indexing, but with worse constant factors.
So no, there's no reason why you would want to use an unordered_map in this case.

What is an efficient way to keep track of a small number of struct types by int id in c++?

I have about 70-150 different structs X with an unsigned integral ID field. They are read in and initialized on initialization of program and never modified thereafter. What would be the fastest way to access them (which happens a lot) among the following (or some other method?):
Use a std::vector v; where v[X.id] = X; to access by doing X& x = v[id]; (this should do a copy at the beginning but later on merely do a lookup by id on essentially a flat array.
Same as above but std::vector v; with X* x = v[id]; I am wary about this one because it has one extra level of indirection.
a std::map - feels like overkill compared to above?
same as above but unordered_map - again given 70-150 occurrences might not even beat suggestion 3.
Anything more clever? One problem I see with 1 is it might be a bit sparse in access patterns but not sure how to address that if that's the fastest way.
Using vector will be definitely the fastest approach:
Vector has complexity O(1). No need to search or hash, you will immediately find your instance.
Map has complexity O(log N). The map needs to compare your index with logN other entries to find your instance.
Unordered_map has complexity O(1), but with quite some overhead of calculating the hash value (although for simple numbers it will be not that much). However, the std::unordered_map still puts multiple entries behind one hash-index, so instead of comparing one index, it has to compare several ones (I think by default it's 4).
I think that for a small number of items there is no faster way to access than using array or vector data types because it provides constant time access. In case of many objects this approach is also fastest but also the most memory expensive.
If you don't mind pre-allocating the space for all the structures in advance, and the IDs can be ordered sequentially so they are also indexes, just use approach 1.
If the structures are expensive to construct, delay construction either by not actually placing the construction code in the default constructor (and doing it explicitly via a method call later) or by using a raw array and placement new on memory "slots" within that array.
If IDs are not sequential, use std::unordered_map to prevent waste of space on "holes".

Which is the fastest STL container for find?

Alright as a preface I have a need to cache a relatively small subset of rarely modified data to avoid querying the database as frequently for performance reasons. This data is heavily used in a read-only sense as it is referenced often by a much larger set of data in other tables.
I've written a class which will have the ability to store basically the entirety of the two tables in question in memory while listening for commit changes in conjunction with a thread safe callback mechanism for updating the cached objects.
My current implementation has two std::vectors one for the elements of each table. The class provides both access to the entirety of each vector as well as convenience methods for searching for a specific element of table data via std::find, std::find_if, etc.
Does anyone know if using std::list, std::set, or std::map over std::vector for searching would be preferable? Most of the time that is what will be requested of these containers after populating once from the database when a new connection is made.
I'm also open to using C++0x features supported by VS2010 or Boost.
For searching a particular value, with std::set and std::map it takes O(log N) time, while with the other two it takes O(N) time; So, std::set or std::map are probably better. Since you have access to C++0x, you could also use std::unordered_set or std::unordered_map which take constant time on average.
For find_if, there's little difference between them, because it takes an arbitrary predicate and containers cannot optimize arbitrarily, of course.
However if you will be calling find_if frequently with a certain predicate, you can optimize yourself: use a std::map or std::set with a custom comparator or special keys and use find instead.
A sorted vector using std::lower_bound can be just as fast as std::set if you're not updating very often; they're both O(log n). It's worth trying both to see which is better for your own situation.
Since from your (extended) requirements you need to search on multiple fields, I would point you to Boost.MultiIndex.
This Boost library lets you build one container (with only one exemplary of each element it contains) and index it over an arbitrary number of indices. It also lets you precise which indices to use.
To determine the kind of index to use, you'll need extensive benchmarks. 500 is a relatively low number of entries, so constant factors won't play nicely. Furthermore, there can be a noticeable difference between single-thread and multi-thread usage (most hash-table implementations can collapse on MT usage because they do not use linear-rehashing, and thus a single thread ends up rehashing the table, blocking all others).
I would recommend a sorted index (skip-list like, if possible) to accomodate range requests (all names beginning by Abc ?) if the performance difference is either unnoticeable or simply does not matter.
If you only want to search for distinct values, one specific column in the table, then std::hash is fastest.
If you want to be able to search using several different predicates, you will need some kind of index structure. It can be implemented by extending your current vector based approach with several hash tables or maps, one for each field to search for, where the value is either an index into the vector, or a direct pointer to the element in the vector.
Going further, if you want to be able to search for ranges, such as all occasions having a date in July you need an ordered data structure, where you can extract a range.
Not an answer per se, but be sure to use a typedef to refer to the container type you do use, something like typedef std::vector< itemtype > data_table_cache; Then use your typedef type everywhere.

Dynamic array width id?

I need some sort of dynamic array in C++ where each element have their own id represented by an int.
The datatype needs these functions:
int Insert() - return ID
Delete(int ID)
Get(ID) - return Element
What datatype should I use? I'we looked at Vector and List, but can't seem to find any sort of ID. Also I'we looked at map and hastable, these may be usefull. I'm however not sure what to chose.
I would probably use a vector and free id list to handle deletions, then the index is the id. This is really fast to insert and get and fairly easy to manage (the only trick is the free list for deleted items).
Otherwise you probably want to use a map and just keep track of the lowest unused id and assign it upon insertion.
A std::map could work for you, which allows to associate a key to a value. The key would be your ID, but you should provide it yourself when adding an element to the map.
An hash table is a sort of basic mechanism that can be used to implement an unordered map. It corresponds to std::unordered_map.
It seems that the best container to use is unordered_map.
It is based on hash. You can insert, delete or searche for elements in O(n).
Currently unordered_map is not in STL. If you want to use STL container use std::map.
It is based on tree. Inserts, deletes and searches for elements in O(n*log(n)).
Still the container choice depends much on the usage intensity. For example, if you will find for elements rare, vector and list could be ok. These containers do not have find method, but <algorithm> library include it.
A vector gives constant-time random access, the "id" can simply be the offset (index) into the vector. A deque is similar, but doesn't store all items contiguously.
Either of these would be appropriate, if the ID values can start at 0 (or a known offset from 0 and increment monotonically). Over time if there are a large amount of removals, either vector or deque can become sparsely populated, which may be detrimental.
std::map doesn't have the problem of becoming sparsely populated, but look ups move from constant time to logarithmic time, which could impact performance.
boost::unordered_map may be the best yet, as the best case scenario as a hash table will likely have the best overall performance characteristics given the question. However, usage of the boost library may be necessary -- but there are also unordered container types in std::tr1 if available in your STL implementation.