Fast C++ simplified map\dictionary like collection? - c++

I need a simple API
int id = Put(value);
void Remove(id);
// and some kind of for_each that could iterate over all data ( there will be around ~1000 objects or less), and this is operation that will be calld super often compared to Put and Remove
There is extreamly lot of critisizing talk in many C++ confrences on how std::map is slow. And actually I do not need all of its members any way. So under that impression I look for an alternative that would support my minified needs.

It mostly comes down to what characteristics you need/care about. A map provides:
unique keys
fast lookup by key
in-order iteration by key
insertion/deletion of arbitrary items
order is always maintained
If you don't need all those, you might be better off with some other container. If you don't care about order or uniqueness, then you could just create a vector of int/pointer (or guid/pointer) pairs, and add new pairs as needed. This supports fast iteration, but makes finding an individual item relatively slow and does nothing to maintain uniqueness. If you want ordering at specific times, you can use sort to get that. Likewise, if you want uniqueness at specific times, you can use sort followed by unique to get that. When it's sorted, finding items by key is generally quite fast (usually faster than with map).
If you care about uniqueness but don't need the items in order when you iterate them, you might want to consider an unordered_map instead. Like a map, this still lets you find an individual item quickly, but iteration doesn't happen in a meaningful order. Like with a sorted vector, finding an individual item by key is usually faster than with map.

Related

Unordered_map produce secondary key

I'm using strings as a type of key for my unordered_map but is it possible that I could associate a secondary unique key, independent from the primary, so I could perform a find operation with the second key?
I was thinking that the key could be a hash number the internal hash algorithm came up with.
I thought of including an id (increasing by 1 each time) to the structure I'm saving but then again, I would have to look up for the key that is a string first.
Reason behind this: I want to make lists that enlist some of the elements in the unordered_map but saving strings in the list is very inefficient instead of saving int or long long. (I would prefer not to use pointers but rather a bookkeeping style of procedure).
You couldn't use a hash number the internal hash algorithm came up with because it could change the numbers due to the table growth in size. This is called rehashing. Also hashes are not guaranteed to be unique (they certainly won't be).
Keeping pointers to elements in your list will work just fine, since unordered_map doesn't invalidate pointers. But the deletion of the elements will be hard.
Boost has multi_index_container, which provides many useful database-like functions. It will be perfect for your task.
If you don't want to use Boost, you could use unordered_map with unique integer indices, and another unordered_map which keeps string->index pairs for searching by string keys. The deletion will also be hard, because either you will check all your lists each time you delete a record, or you will check if the record still exists each time you are traversing the list.

Is there a data structure that works like a map but also allows the sequence of values to be used independently of the keys?

Often I have a map for which the keys are only used for importing, exporting and setup. During the performance critical stage, the values are of importance, not the keys.
I would like to use a data structure that acts like a map but gives me the option to just use a simple std::vector of mapped values when the keys are irrelevant.
The implementation seems reasonably simple. Just use a pair of equally sized vectors in which to store the key and value respectively, sorted with respect to the key vector. Insertions and deletions will be less efficient than in boost::flat_map, but that is a compromise I'm willing to make in order to get instant access to the vector of key unencumbered values.
Is this a terrible idea? If not, is there an existing implementation I can use?

C++ Data Structure that would be best to hold a large list of names

Can you share your thoughts on what the best STL data structure would be for storing a large list of names and perform searches on these names?
Edit:
The names are not unique and the list can grow as new names can continuously added to it. And by large I am talking of from 1 million to 10 million names.
Since you want to search names, you want a structure that support fast random access. That means vector, deque and list are all out of the question. Also, vector/array are slow on random adds/inserts for sorted sets because they have to shift items to make room for each inserted item. Adding to end is very fast, though.
Consider std::map, std::unordered_map or std::unordered_multimap (or their siblings std::set, std::unordered_set and std::unordered_multiset if you are only storing keys).
If you are purely going to do unique, random access, I'd start with one of the unordered_* containers.
If you need to store an ordered list of names, and need to do range searches/iteration and sorted operations, a tree based container like std::map or std::set should do better with the iteration operation than a hash based container because the former will store items adjacent to their logical predecessors and successors. For random access, it is O(log N) which is still decent.
Prior to std::unordered_*, I used std::map to hold large numbers of objects for an object cache and though there are faster random access containers, it scaled well enough for our uses. The newer unordered_map has O(1) access time so it is a hashed structure and should give you the near best access times.
You can consider the possibility of using concatenation of those names using a delimiter but the searching might take a hit. You would need to come up with a adjusted binary searching.
But you should try the obvious solution first which is a hashmap which is called unordered_map in stl. See if that meets your needs. Searching should be plenty fast there but at a cost of memory.

When to choose std::vector over std::map for key-value data?

Considering the positive effect of caching and data locality when searching in primary memory, I tend to use std::vector<> with std::pair<>-like key-value items and perform linear searches for both, if I know that the total amount of key-value items will never be "too large" to severely impact performance.
Lately I've been in lots of situations where I know beforehand that I will have huge amounts of key-value items and have therefore opted for std::map<> from the beginning.
I'd like to know how you make your decisions for the proper container in situations like the ones described above.
Do you
always use std::vector<> (or similar)?
always use std::map<> (or similar)?
have a gut feeling for where in the item-count range one is preferable over the other?
something entirely different?
Thanks!
I only rarely use std::vector with a linear search (except in conjunction with binary searching as described below). I suppose for a small enough amount of data it would be better, but with that little data it's unlikely that anything is going to provide a huge advantage.
Depending on usage pattern, a binary search on an std::vector can make sense though. A std::map works well when you need to update the data regularly during use. In quite a few cases, however, you load up some data and then you use the data -- but after you've loaded the data, it mostly remains static (i.e., it changes very little, if at all).
In this case, it can make a lot of sense to load the data into a vector, sort it if necessary, and then do binary searches on the data (e.g. std::lower_bound, std::equal_range). This gives pretty much the best of both worlds -- low-complexity binary searches and good cache usage from high locality of reference (i.e., the vector is contiguous, as opposed to the linked structure of a std::map). The shortcoming, of course, is that insertions and deletions are slow -- but this is one time I have used your original idea -- store newly inserted data separately until it reaches some limit, and only then sort it in with the rest of the data, so a single search consists of a binary search of the main body of the data, followed by a linear search of the (small amount) of newly inserted data.
I would never make the choice solely on (possibly bogus) "efficiency" grounds, but always on what I am actually going to do with the container. Do I want to store duplicates? Is insertion order important? Will I sometimes want to search for the value not the key? Those kind of things.
Have you considered using sorted data structures? They tend to offer logarithmic searches and inserts - a reasonable trade-off. Personally I don't have any hard and fast rules other than liking maps for the ability to key on a human-readable/understandable value.
Of course there's plenty of discussion as well on the efficiency of maps vs. lists/vectors (sorted and unsorted) - if your key is a string that's 10,000 characters, it can take longer to do a string compare than to search through a list of just a few items, so you want to make sure that you can efficiently compare keys as well.
I almost always prefer to use map (or unordered_map, when a hash container makes more sense) vs. a vector.
That being said, I think your reasoning is backwards. I would tend to use a vector only when there are huge amounts of data, since a vector will be a smaller memory footprint.
With the right kinds of datasets, you can load a vector and then sort it and binary_search it with a smaller footprint and similar performance characteristics to a map, especially if the dataset is stable after load.
Why are you not taking unordered_map into account?

Is there a data structure that doesn't allow duplicates and also maintains order of entry?

Duplicate: Choosing a STL container with uniqueness and which keeps insertion ordering
I'm looking for a data structure that acts like a set in that it doesn't allow duplicates to be inserted, but also knows the order in which the items were inserted. It would basically be a combination of a set and list/vector.
I would just use a list/vector and check for duplicates myself, but we need that duplicate verification to be fast as the size of the structure can get quite large.
Take a look at Boost.MultiIndex. You may have to write a wrapper over this.
A Boost.Bimap with the insertion order as an index should work (e.g. boost::bimap < size_t, Foo > ). If you are removing objects from the data structure, you will need to track the next insertion order value separately.
Writing your own class that wraps a vector and a set would seem the obvious solution - there is no C++ standard library container that does what you want.
Java has this in the form of an ordered set. I don't thing C++ has this, but it is not that difficult to implement yourself. What the Sun guys did with the Java class was to extend the hash table such that each item was simultaneously inserted into a hash table and kept in a double linked list. There is very little overhead in this, especially if you preallocate the items that are used to construct the linked list from.
If I where you, I would write a class that either used a private vector to store the items in or implement a hashtable in the class yourself. When any item is to be inserted into the set, check to see if it is in the hash table and optionally replace the item in there if such an item is in it. Then find the old item in the hash table, update the list to point to the new element and you are done.
To insert a new element you do the same, except you have to use a new element in the list - you can't reuse the old ones.
To delete an item, you reorder the list to point around it, and free the list element.
Note that it should be possible for you to get the part of the linked list where the element you are interested in is directly from the element so that you don't have to walk the chain each time you have to move or change an element.
If you anticipate having a lot of these items changed during the program run, you might want to keep a list of the list items, such that you can merely take the head of this list, rather than allocating memory each time you have to add a new element.
You might want to look at the dancing links algorithm.
I'd just use two data structures, one for order and one for identity. (One could point into the other if you store values, depending on which operation you want the fastest)
Sounds like a job for an OrderedDictionary.
Duplicate verification that's fast seems to be the critical part here. I'd use some type of a map/dictionary maybe, and keep track of the insertion order yourself as the actual data. So the key is the "data" you're shoving in (which is then hashed, and you don't allow duplicate keys), and put in the current size of the map as the "data". Of course this only works if you don't have any deletions. If you need that, just have an external variable you increment on every insertion, and the relative order will tell you when things were inserted.
Not necessarily pretty, but not that hard to implement either.
Assuming that you're talking ANSI C++ here, I'd either write my own or use composition and delegation to wrap a map for data storage and a vector of the keys for order of insertion. Depending on the characteristics of the data, you might be able to use the insertion index as your map key and avoid using the vector.