Correct data structure for fast insert and fast search? - c++

I have an array and I need to insert items there as fast as possible. Before adding an item I need to see if it exists, so I do a full array scan. I can't use binary search since I can't sort the array after every insert.
Is there a more efficient data structure for this job?
Edit: On that array I store strings. Next to each string I store a 4 byte hash. I first compare the hashes and if they are the same then the string.

std::unordered_map usually implemented as (hashtable) will give you best insert/search time (O(1)) but does not preserve nor provide any order.
std::map gives you O(log(n)) for search and insert as it requires particular ordering (not the one you got to insert items so) and usually implemented with balanced tree.
Custom balanced search trees are another option if you need sorted order and fast (O(log n)) insert/search.
Sorted std::vector (to support ability to add items) is another option if O(n) is acceptable insert time but you need smallest memory footprint and O(log n) search time. You'd need to insert items in sorted order which is O(n) due to need to copy the rest of the array.
If you need to preserve original order you stuck with O(n) for both insert/search if you are using just an array ('std::vector').
You can use separate std::unordered_map/std::unordered_set in addition to 'std::vector' to add "is already present" check to gain speed at price of essentially 2-3x memory space and need to update 2 structures when adding items. This array+hashtable combination will give you O(n) insert and O(1) search.

Related

Does there exist a data structure with constant access and insertion/deletion times? [duplicate]

By vector vs. list in STL:
std::vector: Insertions at the end are constant, amortized time, but insertions elsewhere are a costly O(n).
std::list: You cannot randomly access elements, so getting at a particular element in the list can be expensive.
I need a container such that you can both access the element at any index in O(1) time, but also insert/remove an element at any index in O(1) time. It must also be able to manage thousands of entries. Is there such a container?
Edit: If not O(1), some X << O(n)?
There's a theoretical result that says that any data structure representing an ordered list cannot have all of insert, lookup by index, remove, and update take time better than O(log n / log log n), so no such data structure exists.
There are data structures that get pretty close to this, though. For example, an order statistics tree lets you do insertions, deletions, lookups, and updates anywhere in the list in time O(log n) apiece. These are reasonably good in practice, and you may be able to find an implementation online.
Depending on your specific application, there may be alternative data structures that are more tailored toward your needs. For example, if you only care about finding the smallest/biggest element at each point in time, then a data structure like a Fibonacci heap might fit the bill. (Fibonacci heaps are usually slower in practice than a regular binary heap, but the related pairing heap tends to run extremely quickly.) If you're frequently updating ranges of elements by adding or subtracting from them, then a Fenwick tree might be a better call.
Hope this helps!
Look at a couple of data structures.
The Rope
Tree of arrays. The tree is sorted by array index for fast index search.
B+Tree
Sorted tree of sorted arrays. This thing is used by almost every database ever.
Neither one is O(1) because that's impossible. But they are pretty good.

slow std::map for large entries

We have 48,16,703 entries in this format.
1 abc
2 def
...
...
4816702 blah
4816703 blah_blah
Since the number of entries are quite big, I am worried that std::map would take much time during insertion since it need to do the balancing as well for each insertion.
Only inserting these entries into the map takes a lot of time. I am doing
map[first] = second;
Two questions:
1. Am I correct in using std::map for these kind of cases?
2. Am I correct in inserting like the above way. OR Should I use map.insert()
I am sorry for not doing the experiments and writing the absolute numbers but we want an general consensus if we are doing the right thing or not.
Also, they keys are not consecutive always..
P.S. Of-course, later we will need to access that map as well to get the values corresponding to the keys.
If you don’t need to insert into the map afterwards, you can construct an unsorted vector of your data, sort it according to the key, and then search using functions like std::equal_range.
It’s the same complexity as std::map, but far less allocations.
Use an std::unordered_map, which has much better insertion time complexity than std::map, as the reference mentions:
Complexity
Single element insertions:
Average case: constant.
Worst case: linear in container size.
Multiple elements insertion:
Average case: linear in the number of elements inserted.
Worst case: N*(size+1): number of elements inserted times the container size plus one.
May trigger a rehash (not included in the complexity above).
That's better than the logarithmic time complexity of std::map's insertion.
Note: std::map's insertion can enjoy "amortized constant if a hint is given and the position given is the optimal.". If that's the case for you, then use a map (if a vector is not applicable).
#n.m. provides a representative Live demo

Is find() function efficient for sets?

As far as I am concerned, binary search stands for the most efficient way to determine whethere there exists a certain element x in a sorted array. Thus, I was wondering if it is a good idea to make use of the find() or count() functions in order to perform this process of seeking for an element or it is more reasonable to use a sorted array rather than a set and apply the binary search method.
Yes it is efficient.
A set contains unique and sorted elements. Therefore find() uses binary search and has a O(logN) complexity in a set of N elements. Insertion is logarithmic too, in order to keep it sorted and unique.
set::find() is fairly efficient, O(log n).
If you don't need to access the elements in order, you should consider using an unordered_set. unordered_set::find() is O(1) on average.

Storing in std::map/std::set vs sorting a vector after storing all data

Language: C++
One thing I can do is allocate a vector of size n and store all data
and then sort it using sort(begin(),end()). Else, I can keep putting
the data in a map or set which are ordered itself so I don't have to
sort afterwards. But in this case inserting an element may be more
costly due to rearrangements(I guess).
So which is the optimal choice for minimal time for a wide range of n(no. of objects)
It depends on the situation.
map and set are usually red-black trees, they should do a lot of work to be balanced, or the operation on it will be very slow. And it doesn't support random access. so if you only want to sort one time, you shouldn't use them.
However, if you want to continue insert elements into the container and keep order, map and set will take O(logN) time, while the sorted vector is O(N). The latter is much slower, so if you want frequently insert and delete, you should use map or set.
The difference between the 2 is noticable!
Using a set, you get O(log(N)) complexity for each element you insert. So by result you get O(N log(N)), which is the complexity of an insertion sort.
Adding everything in a vector is of complexity O(1), and sorting it will be O(N log(N)) since C++11 (before it, std::sort have O(N log(N)) on average.).
Once sorted, you could use binary_search to have the same complexity as in a set.
The API of using a vector as set ain't the friendly, although it does give nice performance benefits. This off course is only useful when you can do a bulk insert of data or when the amount of lookups is much larger than the manipulations of the content. Algorithmsable to sort on partially sorted vector, when you have to extend later on.
Finally, one has to remark that you don't have the same guarantees of iterator invalidation.
So, why are vectors better? Cache locality!
A vector has all data in a single memory block, hence the processor can do prefetching while for a set, the memory is scattered around the place requireing the data to find the next address. This makes vector a better set implementation than std::set for large data when you can live with the limitations.
To give you an idea, on the codebase I'm working on, we have several set and map implementations based on vectors which have their own narratives to function in. (For example: no erase or no operator[])

sorting an STL List after all pushbacks or just use Multimap?

We were using a multimap<int,string> to store several hundred thousand items (>300K), when we realized we needed to add more data for analysis. So we created a class that held a few items and the necessary overridden operators for stl and used a multimap<ourStruct,String>. This worked fine and didn't take much longer than before (with some test data), when we then realized an stl <list> would do just fine, as long as we sorted it after we finished adding all the items. To our surprise, we found that adding all items to multimap still easily beats the total time to add all items to list, and then sort.
This doesn't make sense to us EE types, since by our thinking every insert to multimap would have to traverse the list then tack it on to the end, where as with list we would just add on to the end (via push back), then hopefully the sort wouldn't take as long.
One more factoid: we intially did the comparison test with out sorting the list and were thrilled to see significant speed ups in speed using list. Then we added the sort, and were a bit stunned...
Any of the CS gurus out there care to weigh in?
std::multimap uses a balanced tree1, so it does not traverse the entire list when you insert an item. The number of items traversed for an insert is approximately the base 2 logarithm of the number of items in the collection.
Based on what you've said, your best bet would probably be to put your data in a vector, and then sort.
1 Technically, the standard doesn't directly require a balanced tree, but it requires ability to traverse in sorted order, and logarithmic complexity for insertions and deletions in the worst case, and I'm not aware of many other data structures that can meet that requirement.
Removed ref to hash .. Balanced tree is why only an n2 traverse is required.