Best STL data structure to find unordered elements - c++

I'm currently trying to implement a hash table in C++ for a homework...
I've chosen to use internal linking as a solution for collisions in the table...
and I'm looking for a good STL container that will find a specific entry in an unordered set of data.
I can't use an stl container that is based on trees (set, map, trees, etc...)
Right now I'm using a vector, is it a good choice? The search time will be linear, right? Can it be better?

As you're saying I assume the buckets can get big..., it's better to use std::list. Searching is linear in both cases, but adding elements is constant in std::list.
I guess they're all the same, since data isn't ordered - No, they are not. If they were, there would be just one container. Each container has it's own advantages and disadvantages, different containers are used for different situations.
A little information about vector:
std::vector has capacity, that's why it has capacity() and size() methods. They're both different. So, suppose the capacity is 4 and you have 2 elements, then size will be 2. So, adding another element will increment the size (will be 3) and it's all very fast.
But what happens when you have to add 5+ elements and the capacity is 4? Completely new memory is allocated, all old elements are copied in the new memory, all old elements are destroyed (their destructors are called, if user-defined types). Then the old memory has to be freed. These are expensive operations if you think that adding/removing elements will be more often.
You can avoid this, using std::vector::reserve method to reserve some memory in advance and not reallocate new memory all the time and copy everything over and over again. But this is useful when you know the approximate size of these vectors. I suppose you don't in your situation( reserving much memory is't a good solution, too - you should not waste memory just like that ) So, again, I'd prefer std::list.
Or double hash.
Anyway, this allocating of new memory and copying of objects will not happen that often, as std::vector is "clever" and when allocate new space, it doesn't increase the capacity with only 1 element or something. I think it doubles it, but I'm not that sure about that. Argh, I don't know how exactly this is called in English.. Probably something like "accumulative time/memory" or "accumulative complexity" :? Don't know :/
NOTE: Whatever you choose, I'd suggest you to pay your attention at the hash-function. It's the most important here. A hash container should NOT have too many elements with the same hash. So, my advice is to search for a good hash-function and then this will not matter that much.
Hope that helped (:
EDIT: I'd recommend you this article - comparing std::vector and std::deque - it's perfect - compares memory usage (allocating, deallocating, growing), CPU usage, etc. I'd recommend the whole site for such articles - there aren't many, but are really well written.

std::tr1::unordered_set could be what you need.

Related

Memory efficient std::map alternative

I'm using a std::map to store about 20 million entries. If they were stored without any container overhead, it would take approximately 650MB of memory. However, since they are stored using std::map, it uses up about 15GB of memory (i.e. too much).
The reason I am using an std::map is because I need to find keys that are equal to/larger/smaller than x. This is why something like sparsehash wouldn't work (since, using that, I cannot find keys by comparison).
Is there an alternative to using std::map (or ordered maps in general) that would result in less memory usage?
EDIT: Writing performance is much more important than reading performance. It will probably only read ~10 entries, but I don't know which entries it will read.
One alternative would be to use flat_map from Boost.Containers: that supports the same interface as std::map, but is backed by a sorted contiguous array (think std::vector) instead of a tree. Or hand-roll your own solution based on the same idea.
Its performance characteristic is of course different, due to the different back-end. It's up to you to evaluate whether it's usable in your case.
Are you writing on-the-fly or one time before the lookup is done? If the later is the case, you shouldn't need a map, you could use std::vector and one-time sort.
You could just insert everything unsorted to the vector, sort one-time after everything is there (O(N * log N) as well as std::map, but much better performance characteristics) and then lookup in the sorted array (O(logN) as the std::map).
And especially if you know the number of elements before reading and could reserve the vector size upfront, that could work pretty well. Or at least if you know some "upper bound" to reserve perhaps slightly more than actually needed but avoid the reallocations.
Given your requirements:
Insertion needs to be quick
There are many elements to read
Read-back can be slow
You only read back data once
I'd consider typedef std::pair<uint64, thirty_six_byte_struct> element; and populate a std::list<element>. That will be hard to beat in terms of performance.
For reading back, I'd simply traverse the linked list, checking at every point if you need one of those elements. That's a O(N) traversal but as you say, you'll only do that once.
Turns out the issue wasn't std::map.
I realized was using 3 separate maps to represent various parts of the same data, and after slimming it down to 1, the difference in memory was entirely negligible.
Looking at the code a little more, I realized code I had written to free a really expensive struct (per element of the map) didn't actually work.
Fixing that part, it now uses <1GB of memory, as it should! :)
TL;DR: std::map's overhead is entirely negligible for this. The issue was my own.

Fast data structure that supports finding the minimum element and accessing, inserting, removing and updating data at any index

I'm looking for ideas to implement a templatized sequence container data structure which can beat the performance of std::vector in as many features as possible and potentially perform much faster. It should support the following:
Finding the minimum element (and returning it's index)
Insertion at any index
Removal at any index
Accessing and updating any element by index (via operator[])
What would be some good ways to implement such a structure in C++?
You generally be pretty sure that the STL implementations of all containers tend to be very good at the range of tasks they were designed for. That is to say, you're unlikely to be able to build a container that is as robust as std::vector and quicker for all applications. However, generally speaking, it is almost always possible to beat a generic tool when optimizing for a specific application.
First, let's think about what a vector actually is. You can think of it as a pointer to a c-style array, except that its elements are stored on the heap. Unlike a c array, it also provides a bunch of methods that make it a little bit more convenient to manipulate. But like a c-array, all of it's data is stored contiguously in memory, so lookups are extremely cheap, but changing its size may require the entire array to be shifted elsewhere in memory to make room for the new elements.
Here are some ideas for how you could do each of the things you're asking for better than a vanilla std::vector:
Finding the minimum element: Search is typically O(N) for many containers, and certainly for a vector (because you need to iterate through all elements to find the lowest). You can make it O(1), or very close to free, by simply keeping the smallest element at all times, and only updating it when the container is changed.
Insertion at any index: If your elements are small and there are not many, I wouldn't bother tinkering here, just do what the vector does and keep elements contiguously next to each other to keep lookups quick. If you have large elements, store pointers to the elements instead of the elements themselves (boost's stable vector will do this for you). Keep in mind that this make lookup more expensive, because you now need to dereference the pointer, so whether you want to do this will depend on your application. If you know the number of elements you are going to insert, std::vector provides the reserve method which preallocates some memory for you, but what it doesn't do is allow you to decide how the size of the allocated memory grows. So if your application warrants lots of push_back operations without enough information to intelligently call reserve, you might be able to beat the standard std::vector implementation by tailoring the growth function of your container to your particular needs. Another option is using a linked list (e.g. std::list), which will beat an std::vector in insertions for larger containers. However, the cost here is that lookup (see 4.) will now become vastly slower (O(N) instead of O(1) for vectors), so you're unlikely to want to go down this path unless you plan to do more insertions/erasures than lookups.
Removal at any index: Similar considerations as for 2.
Accessing and updating any element by index (via operator[]): The only way you can beat std::vector in this regard is by making sure your data is in the cache when you try to access it. This is because lookup for a vector is essentially an array lookup, which is really just some pointer arithmetic and a pointer dereference. If you don't access your vector often you might be able to squeeze out a few clock cycles by using a custom allocator (see boost pools) and placing your pool close to the stack pointer.
I stopped writing mainly because there are dozens of ways in which you could approach this problem.
At the end of the day, this is probably more of an exercise in teaching you that the implementation of std::vector is likely to be extremely efficient for most compilers. All of these suggestions are essentially micro-optimizations (which are the root of all evil), so please don't blindly apply these in important code, as they're highly likely to end up costing you a lot of time and headache.
However, that's not to say you shouldn't tinker and learn for yourself, so by all means go ahead and try to beat it for your application and let us know how you go! Good luck :)

Initializing a std::map when the size is known in advance

I would like to initialize a std::map. For now I am using ::insert but I feel I am wasting some computational time since I already know the size I want to allocate. Is there a way to allocate a fixed size map and then fill the map ?
No, the members of the map are internally stored in a tree structure. There is no way to build the tree until you know the keys and values that are to be stored.
The short answer is: yes, this is possible, but it's not trivial. You need to define a custom allocator for your map. The basic idea is that your custom allocator will set aside a single block of memory for the map. As the map requires new nodes, the allocator will simply assign them addresses within the pre-allocated block. Something like this:
std::map<KeyType, ValueType, std::less<KeyType>, MyAllocator> myMap;
myMap.get_allocator().reserve( nodeSize * numberOfNodes );
There are a number of issues you'll have to deal with, however.
First, you don't really know the size of each map node or how many allocations the map will perform. These are internal implementation details. You can experiment to find out, but you can't assume that the results will hold across different compilers (or even future versions of the same compiler). Therefore, you shouldn't worry about allocating a "fixed" size map. Rather, your goal should be to reduce the number of allocations required to a handful.
Second, this strategy becomes quite a bit more complex if you want to support deletion.
Third, don't forget memory alignment issues. The pointers your allocator returns must be properly aligned for the various types of objects the memory will store.
All that being said, before you try this, make sure it's necessary. Memory allocation can be very expensive, but you still shouldn't assume that it's a problem for your program. Measure to find out. You should also consider alternative strategies that more naturally allow pre-allocation. For example, a sorted list or a std::unordered_map.
Not sure if this answers your question, but Boost.Container has a flat_map in which you can reserve space. Basically you can see this as a sorted vector of (key, value) pairs. Tip: if you also know that your input is sorted, you can use insert with hint for maximal performance.
There are several good answers to this question already, but they miss some primary points.
Initialize the map directly
The map knows the size up front if initialized directly with iterators:
auto mymap = std::map(it_begin, it_end);
This is the best way to dodge the issue. If you are agnostic about the implementation, the map can then know the size up front from the iterators and you moved the issue to the std:: implementation to worry about.
Alternatively use insert with iterators instead, that is:
mymap.insert(it_begin, it_end);
See: https://en.cppreference.com/w/cpp/container/map/insert
Beware of Premature optimization
but I feel I am wasting some computational time.
This sounds a lot like you are optimization prematurely (meaning you do not know where the bottleneck is - you are guessing or seeing an issue that isn't really one). Instead, measure first and then do optimization - repeat if necessary.
Memory allocation could already be optimized, to a large degree
Rolling your own block allocator for the map could be close to fruitless. On modern system(here I include OS/hardware and the C++ language level) memory allocation is already very well optimized for the general case and you could be looking at little or no improvement if rolling your own block allocator. Even if you take a lot of care and get the map into one contiguous array - while an improvement in itself - you could still be facing the problem that in the end, the elements could be placed randomly in the array (eg. insertion order) and be less cache friendly anyway (this very much depending on your actual use case though - I'm assuming a super large data-set).
Use another container or third party map
If you are still facing this issue - the best approach is probably to use another container (eg. a sorted std::vector - use std::lower_bound for lookups) or use a third party map optimized for how you are using the map. A good example is flat_map from boost - see this answer.
Conclusion
Let the std::map worry about the issue.
When performance is the main issue: use a data structure (perhaps 3rd party) that best suits how your data is being used (random inserts or bulk inserts / mostly iteration or mostly lookups / etc.). You then need to profile and gather performance metrics to compare.
You are talking about block allocators. But it is hard to implement. Measure before think about such hard things. Anyway Boost has some articles about implementing block allocator. Or use already implemented preallocated map Stree

Something like a deque on large numbers of items, but small memory usage on small numbers?

I have a whole bunch of objects of a certain type, each of which may allocate a deque to hold other objects of that same type. I am using a deque because I need fast access at both ends, and because any particular object could possibly refer to many other objects.
However, it's likely the case that many or even most of the objects refer to very few other objects. In this case, the memory usage of deque is pretty big. The implementation I'm using is allocating 4096 bytes at a shot, as soon as I do the very first push_back(). Each element in the deque is only 8 bytes. That's a whole lot of wasted space, especially because I'm making many of these objects, and hence many of these deques.
At the same time, I pretty much need a deque (or something like it), because like I said, any particular object can actually refer to many other objects, despite the fact that most objects refer to very few other objects.
My first thought was using capacity() and reserve() to grow the deque myself, but my compiler informed me that there are no such functions on deque.
So, I was thinking perhaps to write a class with a deque-like interface, underlying which is a vector and a deque, with the vector used until (say) sixteen elements exist, after which the vector is thrown away and the deque is used from there on out.
Since the vector is only used when there are only a small number of elements, it shouldn't really matter too much that push_front() and pop_front() are going to be inefficient in terms of speed, and since I can control the vector with capacity() and reserve(), it shouldn't really matter too much that deque uses a lot of memory when more elements exist.
But, before rolling my own class like this, I wanted to check to see if something like this already exists. Also, if anybody knows of any reason I haven't thought of why something like this is a bad idea, or if anybody has any related suggestions, I'd love to hear it.
Thanks in advance.
You don't mention if you need other capabilities of vector or deque like random access iterators. If you don't this actually sounds like a good candidate to use list. It has good performance inserting and removing from both ends.
You could use an (intrusive) list if you don't need random access by index. Lists allow quick O(1) push_front/push_back() and pop_front/pop_back().
If objects are not shared, that is, an object is only ever owned by at most one other object, than an intrusive list would be the best. And since your objects are of the same type, they can be allocated from one memory pool (big array) to avoid any memory overhead.

selection of data structure

I use C++, say i want to store 40 usernames, I will simply use an array. However, if I want to store 40000 usernames is this still a good idea in terms of search speed? Which data structure should I use to improve this speed?
You need to specify what the insertion and removal requirements are. Do things need to be removed and inserted at random points in the sequence?
Also, why the requirement to search sequentially? Are you doing searches that aren't suitable for a hash table lookup?
At the moment I'd suggest a deque or a list. Often it's best to choose a container with the interface that makes for the simplest implementation for your algorithm and then only change the choice if the performance is inadequate and an alternative provides the necessary speedup.
A vector has two principle advantages, there is no per-object memory overhead, although vectors will over-allocate to prevent frequent copying and objects are stored contiguously so sequential access tends to be fast. These are also its disadvantages. Growing vectors require reallocation and copying, and insertion and removal from anywhere other than the end of the vector also require copying. Contiguous storage can produce problems for vectors with large numbers of objects or large objects as the contiguous storage requirements can be hard to satisfy even with only mild memory fragmentation.
A list doesn't require contigous storage but list nodes usually have a per-object overhead of two pointers (in most implementation). This can be significant in list of very small objects (e.g. in a list of pointers, each node is 3x the size of the data item). Insertion and removal from the middle of a list is very cheap though and list nodes never need to me moved in memory once created.
A deque uses chunked storage, so it has a low per-object overhead similar to a vector, but doesn't require contiguous storage over the whole container so doesn't have the same problem with fragmented memory spaces. It is often a very good choice for collections and is often overlooked.
As a rule of thumb, prefer vector to list or, diety forbid, C-style array.
After the vector is filled, make sure it is properly ordered using the sort algorithm. You can then search for a particular record using either find, binary_search or lower_bound. (You don't need to sort to use find.)
Seriously unless you are in a resource constrained environment (embedded platform, phone, or other). Use a std::map, save the effort of doing sorting or searching and let the container take care of everything. This will possibly be a sorted tree structure, probably balance (e.g. Red-Black), which means you will get good searching performance. Unless the size of you data is close to the size of one or two pointers, the memory overhead of whatever data structure you pick is negligable. You Graphics Card probably has more memory that you are going to use up for the data you are think about.
As others said there is very little good reason to use vanilla array, if you don't want to use a map use std::vector or std::list depending on whether you need insert/delete data (=>list) or not (=>vector)
Also consider if you really need all that data in memory, how about putting it on disk via sqlite. Or even use sqlite for in memory access. It all depends on what you need to do with your data.
std::vector and std::list seem good for this task. You can use an array if you know the maximum number of records beforehands.
If you need only sequentially search and storage, then list is the proper container.
Also, vector wouldn't be a bad choice.