C++ Data Structure that would be best to hold a large list of names - c++

Can you share your thoughts on what the best STL data structure would be for storing a large list of names and perform searches on these names?
Edit:
The names are not unique and the list can grow as new names can continuously added to it. And by large I am talking of from 1 million to 10 million names.

Since you want to search names, you want a structure that support fast random access. That means vector, deque and list are all out of the question. Also, vector/array are slow on random adds/inserts for sorted sets because they have to shift items to make room for each inserted item. Adding to end is very fast, though.
Consider std::map, std::unordered_map or std::unordered_multimap (or their siblings std::set, std::unordered_set and std::unordered_multiset if you are only storing keys).
If you are purely going to do unique, random access, I'd start with one of the unordered_* containers.
If you need to store an ordered list of names, and need to do range searches/iteration and sorted operations, a tree based container like std::map or std::set should do better with the iteration operation than a hash based container because the former will store items adjacent to their logical predecessors and successors. For random access, it is O(log N) which is still decent.
Prior to std::unordered_*, I used std::map to hold large numbers of objects for an object cache and though there are faster random access containers, it scaled well enough for our uses. The newer unordered_map has O(1) access time so it is a hashed structure and should give you the near best access times.

You can consider the possibility of using concatenation of those names using a delimiter but the searching might take a hit. You would need to come up with a adjusted binary searching.
But you should try the obvious solution first which is a hashmap which is called unordered_map in stl. See if that meets your needs. Searching should be plenty fast there but at a cost of memory.

Related

Implementation of an Unordered Map

I'm trying to understand Unordered Maps and Hashing. As I understand it, Unordered Maps have a Hash function inside of it that takes an object of type T, and returns an int, which then uses the int as an index to an internal array. It uses a List of the object of type T in the array position, so that if there's something already in the spot, additions are inserted into the List.
Conceptually, would using a Set instead of a List improve efficiency?
(maybe somehow binary search and a Set being ordered helps over having a List)
Or maybe a Vector instead of the List?
(maybe random access helps over the List.)
The datatype should not matter much, because in most cases, the container at the hashed index only contains zero or one element. If you regularly have many elements there, the hash map degrades in performance anyway. The remedy for that is to resize the initial array, which std::unordered_map<> does itself. However, if you have a bad hash function which causes many hash collisions, switching the hash function is necessary for proper operation.
If there're often a lot of collisions at the same bucket, then using a set is more efficient than using a list, and indeed some Java hash table implementations have adopted sets for this reason. vectors can't be used for std::unordered_map or std::unordered_set implementations, as they need to reallocate to a different memory area when grown past their capacity, whilst the Standard requires that the elements in an unordered container are never moved by other operations on the container.
That said, the nature of hash tables is that - with a high quality hash function - the statistical distribution of number-of-elements colliding in particular buckets relates only to the load factor. If you can't trust the collisions not to get out of control, perhaps you shouldn't be using that hash function.
Some details: Standard-library unordered containers have a default max_load_factor() (load_factor() is the ratio of size() to bucket_count()) of 1.0, and with a strong pseudo-randomizing hash function they'll have 1/e ~= 36.8% of buckets empty, as many with one element, half that with 2 elements (~18.4%), a third of that with 3 elements (~6.13%), a quarter of that with 4 elements (~1.53%), a fifth of that with 5 elements (~0.3%), a sixth of that with 6 elements (~0.05%). As you can hopefully see, it's incredibly rare to have to search through many elements (even in the worst case scenario where the hash table is at its max load factor), so a list approach is usually adequate.

Unordered_map produce secondary key

I'm using strings as a type of key for my unordered_map but is it possible that I could associate a secondary unique key, independent from the primary, so I could perform a find operation with the second key?
I was thinking that the key could be a hash number the internal hash algorithm came up with.
I thought of including an id (increasing by 1 each time) to the structure I'm saving but then again, I would have to look up for the key that is a string first.
Reason behind this: I want to make lists that enlist some of the elements in the unordered_map but saving strings in the list is very inefficient instead of saving int or long long. (I would prefer not to use pointers but rather a bookkeeping style of procedure).
You couldn't use a hash number the internal hash algorithm came up with because it could change the numbers due to the table growth in size. This is called rehashing. Also hashes are not guaranteed to be unique (they certainly won't be).
Keeping pointers to elements in your list will work just fine, since unordered_map doesn't invalidate pointers. But the deletion of the elements will be hard.
Boost has multi_index_container, which provides many useful database-like functions. It will be perfect for your task.
If you don't want to use Boost, you could use unordered_map with unique integer indices, and another unordered_map which keeps string->index pairs for searching by string keys. The deletion will also be hard, because either you will check all your lists each time you delete a record, or you will check if the record still exists each time you are traversing the list.

How efficient is iterating through an unordered_set?

Does iterating through an unordered_set require looking through each bucket of the hash table? If so, wouldn't that be very inefficient? If I want to frequently iterate over a set but still need remove in O(1) time is unordered_set still the best data structure to use?
As it happens, common implementations of std::unordered:set link all elements together much as a std::forward_list does, so traversing the container is basically equivalent to traversing a list (details here). In any case, when in doubt profile your program and see if results meet your needs.
Will iterating through a hash table be slower than iterating through a vector? Yes. A vector will store its elements contiguously; hash tables need some way to identify if a bucket contains data or not. Some hash tables give each bucket a linked list of values that map to the same bucket; others use other methods. Either way, an unordered_set iterator needs to look at each bucket and determine if its empty. That's not as fast as pointer arithmetic.
However, I would not classify the extra time spent looking at empty buckets as "very inefficient". Just because it's not as fast as a sorted vector doesn't mean it's inefficient. You still have cache coherency on your side, since buckets probably don't take up that much memory, and testing for an empty one is just a single cached memory fetch.
In the end, every data structure has tradeoffs. If you want O(1) lookup and removal, a hash table is the only way to get that. That means iteration is going to take longer than it would for a vector. But not as long as it would for a set.
Hash tables store data in a vector and everything is indexed by converting the key into a hash number (typically a long) which becomes the index in the vector of the desired element, there are also hash tables using vectors inside vectors to do this.
If you iterate through an std::unordered_set it has only cost O(n) because it's like iterating through a std::vector

Fast C++ simplified map\dictionary like collection?

I need a simple API
int id = Put(value);
void Remove(id);
// and some kind of for_each that could iterate over all data ( there will be around ~1000 objects or less), and this is operation that will be calld super often compared to Put and Remove
There is extreamly lot of critisizing talk in many C++ confrences on how std::map is slow. And actually I do not need all of its members any way. So under that impression I look for an alternative that would support my minified needs.
It mostly comes down to what characteristics you need/care about. A map provides:
unique keys
fast lookup by key
in-order iteration by key
insertion/deletion of arbitrary items
order is always maintained
If you don't need all those, you might be better off with some other container. If you don't care about order or uniqueness, then you could just create a vector of int/pointer (or guid/pointer) pairs, and add new pairs as needed. This supports fast iteration, but makes finding an individual item relatively slow and does nothing to maintain uniqueness. If you want ordering at specific times, you can use sort to get that. Likewise, if you want uniqueness at specific times, you can use sort followed by unique to get that. When it's sorted, finding items by key is generally quite fast (usually faster than with map).
If you care about uniqueness but don't need the items in order when you iterate them, you might want to consider an unordered_map instead. Like a map, this still lets you find an individual item quickly, but iteration doesn't happen in a meaningful order. Like with a sorted vector, finding an individual item by key is usually faster than with map.

Maintaining an Ordered collection of objects

I have the following requirements for a collection of objects:
Dynamic size (in theory unlimited, but in practice a couple of thousand should be more than enough)
Ordered, but allowing reorder and insertion at arbitrary locations.
Allows for deletion
Indexed Access - Random Access
Count
The objects I am storing are not large, a couple of properties and a small array or two (256 booleans)
Is there any built in classes I should know about before I go writing a linked list?
Original answer: That sounds like std::list (a doubly linked list) from the Standard Library.
New answer:
After your change to the specs, a std::vector might work as long as there aren't more than a few thousand elements and not a huge number of insertions and deletions in the middle of the vector. The linear complexity of insertion and deletion in the middle may be outweighed by the low constants on the vector operations. If you are doing a lot of insertions and deletions just at the beginning and end, std::deque might work as well.
-Insertion and Deletion: This is possible for any STL container, but the question is how long it takes to do it. Any linked-list container (list, map, set) will do this in constant time, while array-like containers (vector) will do it in linear time (with constant-amortized allocation).
-Sorting: Considering that you can keep a sorted collection at all times, this isn't much of an issue, any STL container will allow that. For map and set, you don't have to do anything, they already take care of keeping the collection sorted at all times. For vector or list, you have to do that work, i.e. you have to do binary search for the place where the new elements go and insert them there (but STL Algorithms has all the pieces you need for that).
-Resorting: If you need to take the current collection you have sorted with respect to rule A, and resort the collection with respect to rule B, this might be a problem. Containers like map and set are parametrized (as a type) by the sorting rule, this means that to resort it, you would have to extract every element from the original collection and insert them in a new collection with a new sorting rule. However, if you use a vector container, you can just use the STL sort function anytime to resort with whatever rule you like.
-Random Access: You said you needed random access. Personally, my definition of random access means that any element in the collection can be accessed (by index) in constant time. With that definition (which I think is quite standard), any linked-list implementation does not qualify and it leaves you with the only option of using an array-like container (e.g. std::vector).
Conclusion, to have all those properties, it would probably be best to use a std::vector and implement your own sorted insertion and sorted deletion (performing binary search into the vector to find the element to delete or the place to insert the new element). If your objects that you need to store are of significant size, and the data according to which they are sorted (name, ID, etc.) is small, you might consider splitting the problem by holding a unsorted linked-list of objects (with full information) and keeping a sorted vector of keys along with a pointer to the corresponding node in the linked-list (in that case, of course, use std::list for the former, and std::vector for the latter).
BTW, I'm no grand expert with STL containers, so I might have been wrong in the above, just think for yourself. Explore the STL for yourself, I'm sure you will find what you need, or at least all the pieces that you need. Maybe, look at Boost libraries too.
You haven't specified enough of your requirements to select the best container.
Dynamic size (in theory unlimited, but in practice a couple of thousand should be more than enough)
STL containers are designed to grow as needed.
Ordered, but allowing reorder and insertion at arbitrary locations.
Allowing reorder? A std::map can't be reordered: you can delete from one std::map and insert into another using a different ordering, but as different template instantiations the two variables will have different types. std::list has a sort() member function [thanks Blastfurnace for pointing this out], particularly efficient for large objects. A std::vector can be resorted easily using the non-member std::sort() function, particularly efficient for tiny objects.
Efficient insertion at arbitrary locations can be done in a map or list, but how will you find those locations? In a list, searching is linear (you must start from somewhere you already know about and scan forwards or backwards element by element). std::map provides efficient searching, as does an already-sorted vector, but inserting into a vector involves shifting (copying) all the subsequent elements to make space: that's a costly operation in the scheme of things.
Allows for deletion
All containers allow for deletion, but you have the exact-same efficiency issues as you do for insertion (i.e. fast for list if you already know the location, fast for map, deletion in vectors is slow, though you can "mark" elements deleted without removing them, e.g. making a string empty, having a boolean flag in a struct).
Indexed Access - Random Access
vector is indexed numerically (but can be binary searched), map by an arbitrary key (but no numerical index). list is not indexed and must be searched linearly from a known element.
Count
std::list provides an O(n) size() function (so that it can provide O(1) splice), but you can easily track the size yourself (assuming you won't splice). Other STL containers already have O(1) time for size().
Conclusions
Consider whether using a std::list will result in lots of inefficient linear searches for the element you need. If not, then a list does give you efficient insertion and deletion. Resorting is good.
A map or hash map will allow quick lookup and easy insertion/deletion, can't be resorted, but you can easily move the data out to another map with another sort criteria (with moderate efficiency.
A vector allows fast searching and in-place resorting, but the worst insert/deletion. It's the fastest for random-access lookup using the element index.