unordered set, it it worth calling find before insert? - c++

When inserting elements into an std::unorder_set is it worth calling std::unordered_set::find prior to std::unordered_set::insert? From my understanding, I should always just call insert as it returns an std::pair which contains a bool that tells whether the insertion succeeded.

Calling find before insert is essentially an anti-pattern, which is typically observed in poorly designed custom set implementations. Namely, it might be necessary in implementations that do not tell the caller whether the insertion actually occurred. std::set does provide you with this information, meaning that it is normally not necessary to perform this find-before-insert dance.
A typical implementation of insert will typically contain the full implementation of find, meaning that the find-before-insert approach performs the search twice for no meaningful reason.
However, some other shortcomings of std::set design do sometimes call for a find-before-insert sequence. For example, if your set elements contain some fields that need to be modified if (an only if) the actual insertion occurred. For example, you might have to allocate "permanent" memory for some pointer fields instead of "temporary" (local) memory these fields were pointing to before the insertion. Unfortunately, this is impossible to do after the insertion, since std::set only provides you with non-modifying access to its elements. One workaround is to do a find first, thus "predicting" whether an actual insertion will occur, and then setting up the new element accordingly (like allocating "permanent" memory for all fields) before doing the insert. This is ugly from the performance point of view, but it is acceptable in non-performance-critical code. That's just how things are with standard containers.

It's best to just attempt the insert, otherwise the effort of hashing and iterating over any elements that have collided in the hash bucket is unnecessarily repeated.

If your set it threadsafe and accessed concurrently then calling find first does very little, as insert would be atomic but a check-then-act would be susceptible to race condition.
So in general and especially in a multithreaded context, just insert.

Related

std::map::insert/operator[] hybrid for inserting mostly-ordered values to STL maps?

I need to merge and edit multiple Motorolla S-Record (MOT/SREC) files that each consist of memory addresses and the associated contents, and are therefore generally ordered by the memory address (unless someone hand-edits it or something), and then write the result back out (in order) to a single S-Rec file.
To allow easy ordering and processing of the addresses, I decided to use a map with the memory address as an integer key and the full SREC line (including the address) as a string value, e.g. std::map<int,std::string> mymap;
Since the files may have overlapping regions, the 'merge' will have to allow subsequent values to overwrite any existing ones. Since std::insert will not replace an existing value, my thought was to use operator[], but then two things piqued my interest:
std::insert also has a form that takes an iterator position 'hint' and returns an iterator, which seems especially suited for the type of mostly-ordered insert I'll be doing - the data following an insert is very likely to follow it in order, so why not give the last position as a hint (i.e. just pass the iterator the last insert call returned to the next insert call)?
From the SGI STL reference, "m[k] is equivalent to (*((m.insert(value_type(k, data_type()))).first)).second", which made me wonder if I could combine the hint form and the operator (w/ replacement) form... After all, complexity for both is logarithmic unless a good hint is provided for the insert, in which case it becomes an 'amortized constant'...
Which all comes down to: Is there an issue with combining insert-with-hint and operator[] by inserting a 'hint' into the operator[] syntax? e.g. (*(m.insert(hint, value_type(k, data_type())))).second = value; (This does still create a value using the default constructor before assigning the new value, which may be avoided somehow, but I'm not too worried about it right now for string values...)
Bonus: is there a reason why this isn't written as (m.insert(hint, value_type(k, data_type())))->second, or in the original example (*((m.insert(value_type(k, data_type()))).first)).second? Is there something special about the iterator or something?
(For what it's worth, "In STL maps, is it better to use map::insert than []?" deals with the choice between insert and operator[], and answers there tend to argue for readability in general, which is fine and is good practice in general, but this question deals especially with ordered (or mostly-ordered) data that may make some level of optimization worth it.)

Is the order of two same unordered_maps the same?

In other words, if I fill two unordered_map, or unordered_set, objects with exactly the same content and the same hashing function, will iterating over them give the same sequence of key/value pairs?
If so, then what are the conditions for this to hold (e.g. same hashing function, same keys, not necessarily same values).
No. There is no requirement, for example, that objects that have the same hash be placed in any particular order. In fact, in general it's impossible for an unordered map to do this because the only information it has access to is the hash value.
The behaviour in this case is undefined. So, in some situations the sequence will be the same, in others - different. You can't be sure in anything. The types you mentioned are named unordered not by accident. Using them as ordered ones is a very very bad and extremely dangerous style.
You can find that your compiler behaves in some special way you would like to use. But you can't be sure. You mustn't be sure! You do not know, what conditions are causing such behavior of the compiler. You can never be sure that any change of the compiler version will not change the behavior you need.
What is simply forbidden in other languages, in C/C++ is not specified. But you should take it as forbidden, too.
Look c-faq about the problem of undefined behavior This concept is common for all C/C++
Well first I will quote MSDN:
The actual order of elements in the controlled sequence depends on the hash function, the comparison function, the order of insertion, the maximum load factor, and the current number of buckets. You cannot in general predict the order of elements in the controlled sequence. You can always be assured, however, that any subset of elements that have equivalent ordering are adjacent in the controlled sequence.
The above quote is the same for unordered_map and unordered_set and as the name implies the sequence is unspecified, however, they do mention that equivalent keys are adjacent.
If you need guaranteed ordering, this is not the datastructure for you. In fact, if you're mainly going to be doing iteration, unordered_map/set is not the data structre for you, either.
For iteration, a std::map will prove to be the better data structure as gonig from one node to the next is less algorithmically complex. And the order of iteration for the objects in std::map is guaranteed by the spec (and is actually a defining property of the structure itself). (This is all assuming you're using the same comparison operator, obviously). No hashing is involved in std::map.
Suffice to say, it sounds like you're barking up the wrong tree here. unordered_map should generally be using for the benefits such as O(1) lookup and not for storing a list of objects then iterating over them. It most definitely should not be used if you're trying to get a deterministic order of iteration.

Why is inserting multiple elements into a std::set simultaneously faster?

I'm reading through:
"The C++ Standard Library: A Tutorial and Reference by Nicolai M.
Josuttis"
and I'm in the section about Sets & Multisets. I came across a line regarding Inserting and Removing elements:
"Inserting and removing happens faster if, when working with multiple
elements, you use a single call for all elements rather than multiple
calls."
I'm far from a data structures master, but I'm aware that they're implemented with red-black trees. What I don't understand from that is how would the STL implementers write an algorithm to insert multiple elements at once in a faster manner?
Can anyone shed some light on why this quote is true for me?
My first thought was that it might rebalance the tree only after inserting/erasing the whole range. Since the whole operation is inlined in practice, that seems more likely than the number of function calls.
Checking the GCC headers on my local machine, this doesn't seem to be the case - and anyway, I don't know how the tradeoff between reduced rebalancing activity, and potentially increased search times for intermediate inserts to an unbalanced tree, would work out.
Maybe it's considered a QoI issue, but in any case, using the most expressive method is probably best, not just because it saves you writing a for loop and shows your intention most clearly, but because it leaves library writers latitude to optimise more aggressively in the future, without you needing to know and change your code.
There are two reasons:
1) Making a single call for multiple elements instead of N times more calls.
2) The insertion operation checks for each element inserted whether another element exists already in the container with the same value. This can be optimized when insterting multiple elements together.
What you read as you quoted is wrong. Inserting to a std::set is O(log n), unless you use the insert() overload with the position iterator, in which case it is amortized O(n) when the position is valid. But, if you use the range overload with sorted elements then you get O(n) insertion.
Memory management could be a good reason. In this case it could allocate the memory just once. If all elements are called separatelly, all calls try allocate memory separatelly. As I know, most of the set and map implementations try to keep the memory in the same page, or pages near together to minimalize page faults.
I'm not sure about this, but I think that if the number of elements inserted is smaller than the number of elements in the set, then it can be more efficient to sort the inserted range before performing the insertions. This way, all values can be inserted in a single pass over the tree, and duplicates in the inserted range can be easily eliminated (or inserted very fast in the case of a multiset).
Of course, this optimization is only possible if the input iterators allows sorting the input range (i.e. if they are random iterators).

Efficiently erasing elements in tr1::unordered_map

I am experimenting with tr1::unordered_map and stumbled upon the problem how to
efficiently delete elements. The 'erase' method offers to delete either by key or
by iterator. I would assume the latter to be more efficient, since the former
presumably involves an implicit find operation. On the other hand my investigations
on the internet have revealed that iterators may become invalid after calling
the insert() method.
I am interested in a typical real-world situation, where objects put into a hash table
have a life span which is long enough such that calls to insert() happen during that
life span. Thus may I conclude that in such a situation deletion by key is the only
option left? Are there any alternatives how to delete objects more efficiently? I am
fully aware that the question only matters in applications, where deletions happen
often. Whether this will be the case for my current project, remains to be seen, but
I would rather learn about these issues while designing my project rather than when
there is already a lot of code present.
The whole point of the unordered containers is to have the fastest possible lookup time. Worrying about the time it takes to erase an element by key sounds like the classic example of premature optimization.
If it matters a great deal to you, because you're keeping the iterator for some other reason, then C++0x says of std::unordered_map (quoting from the FDIS), in 23.2.5/11:
The insert and emplace members shall not affect the validity of
iterators if (N+n) < z * B, where N is the number of elements in the
container prior to the insert operation, n is the number of elements
inserted, B is the container’s bucket count, and z is the container’s
maximum load factor.
I haven't checked whether the tr1 spec has the same guarantee, but it's fairly logical based on the expected implementation.
If you can use this guarantee, then you can protect your iterators up to a point. As Mark says, though, lookup in unordered_map is supposed to be fast. Keeping a key rather than an iterator is worse than keeping an index rather than an iterator in a vector, but better than the equivalent for map.
Yes, insert() can invalidate all iterators. Therefore, I don't think there's a way to avoid the (implicit) lookup. The good news is that the latter is likely to be cheap.

Least Recently Used cache using C++

I am trying to implement LRU Cache using C++ . I would like to know what is the best design for implementing them. I know LRU should provide find(), add an element and remove an element. The remove should remove the LRU element. what is the best ADTs to implement this
For ex: If I use a map with element as value and time counter as key I can search in O(logn) time, Inserting is O(n), deleting is O(logn).
One major issue with LRU caches is that there is little "const" operations, most will change the underlying representation (if only because they bump the element accessed).
This is of course very inconvenient, because it means it's not a traditional STL container, and therefore any idea of exhibiting iterators is quite complicated: when the iterator is dereferenced this is an access, which should modify the list we are iterating on... oh my.
And there are the performances consideration, both in term of speed and memory consumption.
It is unfortunate, but you'll need some way to organize your data in a queue (LRU) (with the possibility to remove elements from the middle) and this means your elements will have to be independant from one another. A std::list fits, of course, but it's more than you need. A singly-linked list is sufficient here, since you don't need to iterate the list backward (you just want a queue, after all).
However one major drawback of those is their poor locality of reference, if you need more speed you'll need to provide your own custom (pool ?) allocator for the nodes, so that they are kept as close together as possible. This will also alleviate heap fragmentation somewhat.
Next, you obviously need an index structure (for the cache bit). The most natural is to turn toward a hash map. std::tr1::unordered_map, std::unordered_map or boost::unordered_map are normally good quality implementation, some should be available to you. They also allocate extra nodes for hash collision handling, you might prefer other kinds of hash maps, check out Wikipedia's article on the subject and read about the characteristics of the various implementation technics.
Continuing, there is the (obvious) threading support. If you don't need thread support, then it's fine, if you do however, it's a bit more complicated:
As I said, there is little const operation on such a structure, thus you don't really need to differentiate Read/Write accesses
Internal locking is fine, but you might find that it doesn't play nice with your uses. The issue with internal locking is that it doesn't support the concept of "transaction" since it relinquish the lock between each call. If this is your case, transform your object into a mutex and provide a std::unique_ptr<Lock> lock() method (in debug, you can assert than the lock is taken at the entry point of each method)
There is (in locking strategies) the issue of reentrance, ie the ability to "relock" the mutex from within the same thread, check Boost.Thread for more information about the various locks and mutexes available
Finally, there is the issue of error reporting. Since it is expected that a cache may not be able to retrieve the data you put in, I would consider using an exception "poor taste". Consider either pointers (Value*) or Boost.Optional (boost::optional<Value&>). I would prefer Boost.Optional because its semantic is clear.
The best way to implement an LRU is to use the combination of a std::list and stdext::hash_map (want to use only std then std::map).
Store the data in the list so that
the least recently used in at the
last and use the map to point to the
list items.
For "get" use the map to get the
list addr and retrieve the data
and move the current node to the
first(since this was used now) and update the map.
For "insert" remove the last element
from the list and add the new data
to the front and update the map.
This is the fastest you can get, If you are using a hash_map you should almost have all the operations done in O(1). If using std::map it should take O(logn) in all cases.
A very good implementation is available here
This article describes a couple of C++ LRU cache implementations (one using STL, one using boost::bimap).
When you say priority, I think "heap" which naturally leads to increase-key and delete-min.
I would not make the cache visible to the outside world at all if I could avoid it. I'd just have a collection (of whatever) and handle the caching invisibly, adding and removing items as needed, but the external interface would be exactly that of the underlying collection.
As far as the implementation goes, a heap is probably the most obvious. It has complexities roughly similar to a map, but instead of building a tree from linked nodes, it arranges items in an array and the "links" are implicit based on array indices. This increases the storage density of your cache and improves locality in the "real" (physical) processor cache.
I suggest a heap and maybe a Fibonacci Heap
I'd go with a normal heap in C++.
With the std::make_heap (guaranteed by the standard to be O(n)), std::pop_heap, and std::push_heap in #include, implementing it would be absolutely cake. You only have to worry about increase-key.