Does binary search have logarithmic performance of deque C++ data structure? - c++

The standard says that std::binary_search(...) and the two related functions std::lower_bound(...) and std::upper_bound(...) are O(log n) if the data structure has random access. So, given that, I presume that these algorithms have O(log n) performance on std::deque (assuming its contents are kept sorted by the user).
However, it seems that the internal representation of std::deque is tricky (it's broken into chunks), so I was wondering: does the requirement of O(log n) search hold for std::deque.

Yes it still holds for deque because the container is required to provide access to any element in constant time (just a slightly higher constant than vector).
That doesn't relieve you of the obligation for the deque to be sorted though.

Yes. deque has constant time access for an index if it is there. It is organized in pages of an equal size. What you have is smth. like a vector of pointers to pages. If you have let's say 2 pages with 100 elements and you would like to access 103rd element than determine the page by 103/100 = 1 and determine the index in the page by 103 %100 => 3. Now use a constant time access in two vectors to get the element.
Here the statement from http://www.cplusplus.com/reference/stl/deque/:
Deques may be implemented by specific libraries in different ways, but in all cases they allow for the individual elements to be accessed through random access iterators, with storage always handled automatically (expanding and contracting as needed).

Just write a program to test that, I think deque's implementation of binary_search will slower than vector's, but it's complexity is still O(logn)

Related

What container to choose for fast search/insert with huge amounts of data?

So it's a thought experiment. I want to have a huge collection of structures such as:
struct
{
KeyType key;
ValueType value;
}
And I need fast access by a key and fast insertion of new values.
I would not use std::map cuz it has too big memory overhead for one structure and for huge amounts of data it might be drastical. Right?
So next I would consider using sorted std::vector and binary_search. It's fine for searching, but adding new values to the vector would be too slow. Imagine you need to add a new value to the beginning of the sorted array, you'd have to move data right aaaaaAAAALOT!
What if I use deque? As I know it has O(1) for push_back/push_front, but still O(n) for inserting (as it would have to move data anyway, less data though).
The questions are:
1) Is O(n) of inserting data in deque much faster in real situation than O(n) in vector?
2) What happens when you insert a value to Deque and the bucket it should go into is full?
3) Is there another preferable type of container in case you need to store lots of data and need two fast operations: search and insertion?
Thanks!
I would not use std::map cuz it has too big memory overhead for one structure and for huge amounts of data it might be drastical. Right?
That depends on the size of your structs... the bigger they are the less the overheads are as a proportion of the overall memory use. For example, a std::map implementation might average say 20 bytes of housekeeping data per element (I just made that up - measure on your own system), so if your struct size is in the hundreds of bytes - who cares...? But, if the struct holds 2 ints, it's a big proportion....
So next I would consider using sorted std::vector and binary_search. It's fine for searching, but adding new values to the vector would be too slow. Imagine you need to add a new value to the beginning of the sorted array, you'd have to move data right aaaaaAAAALOT!
Totally unsuitable....
1) Is O(n) of inserting data in deque much faster in real situation than O(n) in vector?
As deque is likely implemented as a vector of fixed-sized arrays, insertion implies a shuffling of all elements towards the nearest end of the container. The shuffling's probably a tiny bit less cache efficient, but if inserting nearer the front of the container it would likely still end up faster.
2) What happens when you insert a value to Deque and the bucket it should go into is full?
As above, it'll need to shuffle, overflowing either:
the last element to become the first element of the next "bucket", moving all those elements along and overflowing into the next bucket, etc.
the first element to become the last element of the previous bucket, moving all those elements along and overflowing into the next bucket, etc.
3) Is there another preferable type of container in case you need to store lots of data and need two fast operations: search and insertion?
unordered_map, which is implemented as a hash map. If you have small objects (e.g. less than 20 or 30 bytes) or a firm cap on the number of elements, you can normally easily outperform unordered_map with custom code, but it's rarely worth the effort unless the table access dominates you application's performance, and that performance is critical.
3) Is there another preferable type of container in case you need to store lots of data and need two fast operations: search and insertion?
Consider using std::unordered_map, which is an implementation of a hash map. Insertion, lookup, and removal are all O(1) in the average case. This assumes that you will only ever look for an item based on its exact key; if your searches can have different constraints then you either need a different structure, or you need multiple maps to map the various keys you will search for to the corresponding object.
This requires that there is an available hash function for KeyType, either as part of the standard library or provided by you.
There's no container which would provide the best of all the worlds to you. Like you are saying you want best lookup/insertion with minimum amount of space needed for storing elements.
Below if the list of containers which you could consider for your implementation:-
VECTOR :-
Strengths:-
1) Space is allocated only for holding data.
2) Good for random access.
3) Container of choice if insertions/deletions are not in the middle of the container.
Weakness:-
1) poor performance if insertions/deletions are at the middle.
2) rellocations happen if reserve is not used properly.
DEQUE:-
Choose deque over vector in case insertions/deletions are at the beginning as well as end of the container.
MAP:-
Disadvantage over vector:-
1) more space is allocated for holding pointers.
Advantages over vector:-
1) better insertions/deletions/lookup as compared to vector.
If std::unordered_map is used then these dictionary operations would be amortized O(1).
Firstly, in order to directly answer your questions:
1) Is O(n) of inserting data in deque much faster in real situation
than O(n) in vector?
The number of elements that have to be moved is (on average) only half compared to vector. However, it can actually perform worse as the data is stored in non-contiguous memory, so copying/moving the same number of elements is much less efficient (it cannot e.g. be implemented in terms of a single memcopy operation).
2) What happens when you insert a value to Deque and the bucket it
should go into is full?
At least for the gnu gcc Libstdc++ implementation, every bucket except the first and last one is always full. I believe, that inserting in the middle means that all elements are moved/copied one slot to the closer end (front or back) and the effect ripples through all buckets until the first or last one is reached.
In summary, the only scenario, where std::deque is consistently better than vector is if you use it as (suprise) a queue (only inserting and removing elements from the front or end) and that's what the implementation is optimized for. It is not optimized for insertions in the middle.
3) Is there another preferable type of container in case you need to
store lots of data and need two fast operations: search and insertion?
As already stated by others: A hash table like std::unordered_map is the data structure you are looking for.
From what I've heard however, std::unordered_map is a slightly suboptimal implementation if it, as it uses buckets in order to resolve hash collisions and those buckets are implemented as linked lists (here is a very interesting talk from Chandler Carruth on the general topic of the performance of different data structures). For random access on big data structures, cache locality should matter a lot less, so this is probably not such a big issue in your case.
Finally I'd like to mention that if your value and key types are small PODs and depending on how big your huge collection is (are we talking about some million or rather billions of elements) and how often you actually have to insert/remove elements, there might still be cases, where a simple std::vector outperforms any other STL container. So as always: if your thought experiment ever becomes reality try out and measure.

STL What is Random access and Sequential access?

So I am curious to know, what is random access?
I searched a little bit, and couldn't find much. The understanding I have now is that the "blocks" in the container are placed randomly (as seen here). Random access then means I can access every block of the container no matter what position (so I can read what it says on position 5 without going through all blocks before that), while with sequential access, I have to go through 1st , 2nd, 3rd and 4th to get to the 5th block.
Am I right? Or if not, then can someone explain to me what random access is and sequential access is?
Sequential access means the cost of accessing the 5th element is 5 times the cost of accessing the first element, or at least that there is an increasing cost associated with an elements position in the set. This is because to access the 5th element of the set, you must first perform an operation to find the 1st, 2nd, 3rd, and 4th elements, so accessing the 5th element requires 5 operations.
Random access means that accessing any element in the set has the same cost as any other element in the set. Finding the 5th element of a set is still only a single operation.
So accessing a random element in a random access data-structure is going to have O(1) cost whereas accessing a random element in a sequential data-structure is going to have a O(n/2) -> O(n) cost. The n/2 comes from that if want to access a random element in a set 100 times, the average position of that element is going to be about halfway through the set. So for a set of n elements, that comes out to n/2 (Which in big O notation can just be approximated to n).
Something you might find cool:
Hashmaps are an example of a data structure which implements random access. A cool thing to note is that on hash collisions in a hash map, the two collided elements are stored in a sequential linked list in that bucket on the hash map. So that means that if you have 100% collisions for a hash map you actually end up with sequential storage.
Here's an image of a hashmap illustrating what I'm describing:
This means the worst case scenario for a hash map is actually O(n) for accessing an element, the same as average case for sequential storage, or put more correctly, finding an element in a hashmap is Ω(n), O(1), and Θ(1). Where Ω is worst case, Θ is best case, and O is average case.
So:
Sequential access: Finding a random element in a set of n elements is Ω(n), O(n/2), and Θ(1) which for very large numbers becomes Ω(n), O(n), and Θ(1).
Random access: Finding a random element in a set of n elements is Ω(n/2), O(1), and Θ(1) which for very large numbers becomes Ω(n), O(1), and Θ(1)
So random access has the benefit of giving better performance for accessing elements, however sequential storage data structures provide benefits in other areas.
Second Edit For #sumsar1812:
I want to preface with this is how I understand the advantages/use cases of sequential storage, but I am not as certain about my understanding of benefits of sequential containers as I am about my answer above. So please correct me anywhere I am mistaken.
Sequential storage is useful because the data will actually be stored sequentially in memory.
You can actually access the next member of a sequentially stored data set by offsetting a pointer to the previous element of that set by the amount of bytes it takes to store a single element of that type.
So since a signed int requires 8 bytes to store, if you have a fixed array of integers with a pointer pointing to the first integer:
int someInts[5];
someInts[1] = 5;
someInts is a pointer pointing to the first element of that array. Adding 1 to that pointer just offsets where it points to in memory by 8 bytes.
(someInts+1)* //returns 5
This means if you need to access every element in your data structure in that specific order its going to be much faster since each lookup for sequential storage is just adding a constant value to the pointer.
For random access storage, there is no guarantee that each element is stored even near the other elements. This means each lookup will be more expensive that just adding a constant amount.
Random access storage containers can still simulate what appears to be an ordered list of elements by using an iterator. However, as long as you allow random access lookups for elements there will be no guarantee that those elements are stored sequentially in memory. This means that even though a container can exhibit behavior of both a random access container and a sequential container, it will not exhibit the benefits of a sequential container.
So if the order of the elements in your container is supposed to be meaningful, or you plan on iterating and operating on every element in a data set then you might benefit from a sequential container.
In truth it still gets a little more complicated because a linked list, which is a sequential container doesn't actually store sequentially in memory whereas a vector, another sequential container, does. Here's a good article that explains use cases for each specific container better than I can.
There are two main aspects to this, and it's unclear which of the two is more relevant to your question. One of those aspects is accessing the content of an STL container via iterators, where those iterators allow either random access or forward (sequential) access. The other aspect is that of accessing a container or even just memory itself in random or sequential orders.
Iterators - Random Access vs. Sequential Access
To start with iterators, take two examples: std::vector<T> and std::list<T>. A vector stores an array of values, whereas a list stores a linked list of values. The former is stored sequentially in memory, and this allows arbitrary random access: calculating the location of any element is just as fast as calculating the location of the next element. Thus the sequential storage gives you efficient random access, and the iterator is a random access iterator.
By contrast, a list performs a separate allocation for each node, and each node only knows where its neighbors are. Thus calculating the location of a random non-neighbor node cannot be done directly. Any attempt to do so must traverse all the intermediate nodes, and thus algorithms that attempt to skip nodes may perform badly. The non-sequential storage yields randomized locations and thus only efficient sequential access. Thus the iterator that list provides is a bidirectional iterator, one of a few different sequential iterators.
Memory - Random Access vs. Sequential Access
However there's another wrinkle in your question. The iterator parts only address the traversal of the container. Underneath that, however, the CPU will be accessing memory itself in a particular pattern. While at a high level the CPU is capable of addressing any random address with no overhead of calculating where it is (it's like a big vector), in practice reading memory involves caching and lots of subtleties that make accessing different parts of memory take different amounts of time.
For example, once you start working with a rather large data set, even if you're working with a vector, it's more efficient to access all elements in sequential order than to access all elements in some random order. By contrast a list doesn't make this possible. Since the nodes of a list aren't even necessarily located sequential memory locations, even a sequential access of the list's items may not read memory sequentially, and can be more expensive because of this.
The terms themselves don't imply any performance characteristics as #echochamber says. The terms only refer to a method of access.
"Random Access" refers to accessing elements in a container in an arbitrary order. std::vector is an example of a C++ container that performs great for random access. std::stack is an example of a C++ container that doesn't even allow random access.
"Sequential Access" refers to accessing elements in order. This is only relevant for ordered containers. Some containers are optimized better for sequential access than random access, for example std::list.
Here's some code to show the difference:
// random access. picking elements out, regardless of ordering or sequencing.
// The important factor is that we are selecting items by some kind of
// index.
auto a = some_container[25];
auto b = some_container[1];
auto c = some_container["hi"];
// sequential access. Here, there is no need for an index.
// This implies that the container has a concept of ordering, where an
// element has neighbors
for(container_type::iterator it = some_container.begin();
it != some_container.end();
++ it)
{
auto d = *it;
}

C++: container replacement for vector/deque for huge sizes

so my applications has containers with 100 million and more elements.
I'm on the hunt for a container which behaves - time-wise - better than std::deque (let alone std::vector) with respect to frequent insertions and deletions all over the container ... including near the middle. Access time to the n-th element does not need to be as fast as vector, but should definetely be better than full traversal like in std::list (which has a huge memory overhead per element anyway).
Elements should be treated ordered by index (like vector, deque, list), so std::set or std::unordered_set also do not work well.
Before I sit down and code such a container myself: has anyone seen such a beast already? I'm pretty sure the STL hasn't anything like this, looking to BOOST I did not find something I could use but I may be wrong.
Any hints?
There's a whole STL replacement for big data, in case your app is centric to such data:
STXXL - http://stxxl.sourceforge.net/
edit: I was actually a bit fast to answer. 100 million is not really a large number. E.g., if each element is one byte, you could save it in a 96MiB array. So whether STXXL is any useful, the size of an element should be significantly bigger.
I think you can get the performance characteristics that you want with a skip list:
https://en.wikipedia.org/wiki/Skip_list#Indexable_skiplist
It's the "indexable" part that you're interested in, of course -- you don't actually want the items to be sorted. So some modification is needed that I leave as an exercise.
You might find that 100 million list nodes begins to strain a 32 bit address space, but probably not an issue in 64 bits.
1) If the data is highly sparse, i.e. has lots of zeroes or can be expressed as such, I would highly recommend a data structure that takes advantage of that:
sparselib++ for matrices
sparsehash for hash maps
2) Hash maps should do O(1) for all the operations you describe and the sparsehash implementation I mentioned earlier is particularly space-efficient; it also includes a sparsetable type which is a bit more low-level and can be used in place of an array.
3) If the strict ordering is not that important (it probably is, because you mentioned elements should be treated ordered by index), you can swap the elements you want to erase to the end of the vector and then resize to do removal in O(1). Insertion would just be push_back.
Try a hash map. The STL has several, all with the unordered naming prefix , such as unorderd_map, etc. It has constant time insertion and look up given a good hashing algorithm. With your 'huge' data set the hash map would most likely cover your needs. Making a slight change to the application to cover the differences in the interfaces is trivial.

Maintaining an Ordered collection of objects

I have the following requirements for a collection of objects:
Dynamic size (in theory unlimited, but in practice a couple of thousand should be more than enough)
Ordered, but allowing reorder and insertion at arbitrary locations.
Allows for deletion
Indexed Access - Random Access
Count
The objects I am storing are not large, a couple of properties and a small array or two (256 booleans)
Is there any built in classes I should know about before I go writing a linked list?
Original answer: That sounds like std::list (a doubly linked list) from the Standard Library.
New answer:
After your change to the specs, a std::vector might work as long as there aren't more than a few thousand elements and not a huge number of insertions and deletions in the middle of the vector. The linear complexity of insertion and deletion in the middle may be outweighed by the low constants on the vector operations. If you are doing a lot of insertions and deletions just at the beginning and end, std::deque might work as well.
-Insertion and Deletion: This is possible for any STL container, but the question is how long it takes to do it. Any linked-list container (list, map, set) will do this in constant time, while array-like containers (vector) will do it in linear time (with constant-amortized allocation).
-Sorting: Considering that you can keep a sorted collection at all times, this isn't much of an issue, any STL container will allow that. For map and set, you don't have to do anything, they already take care of keeping the collection sorted at all times. For vector or list, you have to do that work, i.e. you have to do binary search for the place where the new elements go and insert them there (but STL Algorithms has all the pieces you need for that).
-Resorting: If you need to take the current collection you have sorted with respect to rule A, and resort the collection with respect to rule B, this might be a problem. Containers like map and set are parametrized (as a type) by the sorting rule, this means that to resort it, you would have to extract every element from the original collection and insert them in a new collection with a new sorting rule. However, if you use a vector container, you can just use the STL sort function anytime to resort with whatever rule you like.
-Random Access: You said you needed random access. Personally, my definition of random access means that any element in the collection can be accessed (by index) in constant time. With that definition (which I think is quite standard), any linked-list implementation does not qualify and it leaves you with the only option of using an array-like container (e.g. std::vector).
Conclusion, to have all those properties, it would probably be best to use a std::vector and implement your own sorted insertion and sorted deletion (performing binary search into the vector to find the element to delete or the place to insert the new element). If your objects that you need to store are of significant size, and the data according to which they are sorted (name, ID, etc.) is small, you might consider splitting the problem by holding a unsorted linked-list of objects (with full information) and keeping a sorted vector of keys along with a pointer to the corresponding node in the linked-list (in that case, of course, use std::list for the former, and std::vector for the latter).
BTW, I'm no grand expert with STL containers, so I might have been wrong in the above, just think for yourself. Explore the STL for yourself, I'm sure you will find what you need, or at least all the pieces that you need. Maybe, look at Boost libraries too.
You haven't specified enough of your requirements to select the best container.
Dynamic size (in theory unlimited, but in practice a couple of thousand should be more than enough)
STL containers are designed to grow as needed.
Ordered, but allowing reorder and insertion at arbitrary locations.
Allowing reorder? A std::map can't be reordered: you can delete from one std::map and insert into another using a different ordering, but as different template instantiations the two variables will have different types. std::list has a sort() member function [thanks Blastfurnace for pointing this out], particularly efficient for large objects. A std::vector can be resorted easily using the non-member std::sort() function, particularly efficient for tiny objects.
Efficient insertion at arbitrary locations can be done in a map or list, but how will you find those locations? In a list, searching is linear (you must start from somewhere you already know about and scan forwards or backwards element by element). std::map provides efficient searching, as does an already-sorted vector, but inserting into a vector involves shifting (copying) all the subsequent elements to make space: that's a costly operation in the scheme of things.
Allows for deletion
All containers allow for deletion, but you have the exact-same efficiency issues as you do for insertion (i.e. fast for list if you already know the location, fast for map, deletion in vectors is slow, though you can "mark" elements deleted without removing them, e.g. making a string empty, having a boolean flag in a struct).
Indexed Access - Random Access
vector is indexed numerically (but can be binary searched), map by an arbitrary key (but no numerical index). list is not indexed and must be searched linearly from a known element.
Count
std::list provides an O(n) size() function (so that it can provide O(1) splice), but you can easily track the size yourself (assuming you won't splice). Other STL containers already have O(1) time for size().
Conclusions
Consider whether using a std::list will result in lots of inefficient linear searches for the element you need. If not, then a list does give you efficient insertion and deletion. Resorting is good.
A map or hash map will allow quick lookup and easy insertion/deletion, can't be resorted, but you can easily move the data out to another map with another sort criteria (with moderate efficiency.
A vector allows fast searching and in-place resorting, but the worst insert/deletion. It's the fastest for random-access lookup using the element index.

c++ container for checking whether ordered data is in a collection

I have data that is a set of ordered ints
[0] = 12345
[1] = 12346
[2] = 12454
etc.
I need to check whether a value is in the collection in C++, what container will have the lowest complexity upon retrieval? In this case, the data does not grow after initiailization. In C# I would use a dictionary, in c++, I could either use a hash_map or set. If the data were unordered, I would use boost's unordered collections. However, do I have better options since the data is ordered? Thanks
EDIT: The size of the collection is a couple of hundred items
Just to detail a bit over what have already been said.
Sorted Containers
The immutability is extremely important here: std::map and std::set are usually implemented in terms of binary trees (red-black trees for my few versions of the STL) because of the requirements on insertion, retrieval and deletion operation (and notably because of the invalidation of iterators requirements).
However, because of immutability, as you suspected there are other candidates, not the least of them being array-like containers. They have here a few advantages:
minimal overhead (in term of memory)
contiguity of memory, and thus cache locality
Several "Random Access Containers" are available here:
Boost.Array
std::vector
std::deque
So the only thing you actually need to do can be broken done in 2 steps:
push all your values in the container of your choice, then (after all have been inserted) use std::sort on it.
search for the value using std::binary_search, which has O(log(n)) complexity
Because of cache locality, the search will in fact be faster even though the asymptotic behavior is similar.
If you don't want to reinvent the wheel, you can also check Alexandrescu's [AssocVector][1]. Alexandrescu basically ported the std::set and std::map interfaces over a std::vector:
because it's faster for small datasets
because it can be faster for frozen datasets
Unsorted Containers
Actually, if you really don't care about order and your collection is kind of big, then a unordered_set will be faster, especially because integers are so trivial to hash size_t hash_method(int i) { return i; }.
This could work very well... unless you're faced with a collection that somehow causes a lot of collisions, because then unsorted containers will search over the "collisions" list of a given hash in linear time.
Conclusion
Just try the sorted std::vector approach and the boost::unordered_set approach with a "real" dataset (and all optimizations on) and pick whichever gives you the best result.
Unfortunately we can't really help more there, because it heavily depends on the size of the dataset and the repartition of its elements
If the data is in an ordered random-access container (e.g. std::vector, std::deque, or a plain array), then std::binary_search will find whether a value exists in logarithmic time. If you need to find where it is, use std::lower_bound (also logarithmic).
Use a sorted std::vector, and use a std::binary_search to search it.
Your other options would be a hash_map (not in the C++ standard yet but there are other options, e.g. SGI's hash_map and boost::unordered_map), or an std::map.
If you're never adding to your collection, a sorted vector with binary_search will most likely have better performance than a map.
I'd suggest using a std::vector<int> to store them and a std::binary_search or std::lower_bound to retrieve them.
Both std::unordered_set and std::set add significant memory overhead - and even though the unordered_set provides O(1) lookup, the O(logn) binary search will probably outperform it given that the data is stored contiguously (no pointer following, less chance of a page fault etc.) and you don't need to calculate a hash function.
If you already have an ordered array or std::vector<int> or similar container of the data, you can just use std::binary_search to probe each value. No setup time, but each probe will take O(log n) time, where n is the number of ordered ints you've got.
Alternately, you can use some sort of hash, such as boost::unordered_set<int>. This will require some time to set up, and probably more space, but each probe will take O(1) time on the average. (For small n, this O(1) could be more than the previous O(log n). Of course, for small n, the time is negligible anyway.)
There is no point in looking at anything like std::set or std::map, since those offer no advantage over binary search, given that the list of numbers to match will not change after being initialized.
So, the questions are the approximate value of n, and how many times you intend to probe the table. If you aren't going to check many values to see if they're in the ints provided, then setup time is very important, and std::binary_search on the sorted container is the way to go. If you're going to check a lot of values, it may be worth setting up a hash table. If n is large, the hash table will be faster for probing than binary search, and if there's a lot of probes this is the main cost.
So, if the number of ints to compare is reasonably small, or the number of probe values is small, go with the binary search. If the number of ints is large, and the number of probes is large, use the hash table.