Looking for clarification on Hashing and BST functions and Big O notation

Looking for clarification on Hashing and BST functions and Big O notation - c++

So I am trying to understand the data types and Big O notation of some functions for a BST and Hashing.
So first off, how are BSTs and Hashing stored? Are BSTs usually arrays, or are they linked lists because they have to point to their left and right leaves?
What about Hashing? I've had the most trouble finding clear information regarding Hashing in terms of computation-based searching. I understand that Hashing is best implemented with an array of chains. Is this for faster searching or to decrease overhead on creating the allocated data type?
This following question might be just bad interpretation on my part, but what makes a traversal function different from a search function in BSTs, Hashing, and STL containers?
Is traversal Big O(N) for BSTS because you're actually visiting each node/data member, whereas search() can reduce its time by eliminating half the searching field?
And somewhat related, why is it that in the STL, list.insert() and list.erase() have a Big O(1) whereas the vector and deque counterparts are O(N)?
Lastly, why would a vector.push_back() be O(N)? I thought the function could be done something along the lines of this like O(1), but I've come across text saying it is O(N):
vector<int> vic(2,3);
vector<int>::const iterator IT = vic.end();
//wanna insert 4 to the end using push_back
IT++;
(*IT) = 4;
hopefully this works. I'm a bit tired but I would love any explanations why something similar to that wouldn't be efficient or plausible. Thanks

BST's (Ordered Binary Trees) are a series of nodes where a parent node points to its two children, which in turn point to their max-two children, etc. They're traversed in O(n) time because traversal visits every node. Lookups take O(log n) time. Inserts take O(1) time because internally they don't need to a bunch of existing nodes; just allocate some memory and re-aim the pointers. :)
Hashes (unordered_map) use a hashing algorithm to assign elements to buckets. Usually buckets contain a linked list so that hash collisions just result in several elements in the same bucket. Traversal will again be O(n), as expected. Lookups and inserts will be amortized O(1). Amortized means that on average, O(1), though an individual insert might result in a rehashing (redistribution of buckets to minimize collisions). But over time the average complexity is O(1). Note, however, that big-O notation doesn't really deal with the "constant" aspect; only order of growth. The constant overhead in the hashing algorithms can be high enough that for some data-sets the O(log n) binary trees outperform the hashes. Nevertheless, the hash's advantage is that its operations are constant time-complexity.
Search functions take advantage (in the case of binary trees) of the notion of "order"; a search through a BST has the same characteristics as a basic binary search over an ordered array. O(log n) growth. Hashes don't really "search". They compute the bucket, and then quickly run through the collisions to find the target. That's why lookups are constant time.
As for insert and erase; in array-based sequence containers, all elements that come after the target have to be bumped over to the right. Move semantics in C++11 can improve upon the performance, but the operation is still O(n). For linked sequence containers (list, forward_list, trees), insertion and erasing just means fiddling with some pointers internally. It's a constant-time process.
push_back() will be O(1) until you exceed the existing allocated capacity of the vector. Once the capacity is exceeded, a new allocation takes place to produce a container that is large enough to accept more elements. All the elements need to then be moved into the larger memory region, which is an O(n) process. I believe Move Semantics can help here as well, but it's still going to be O(n). Vectors and strings are implemented such that as they allocate space for a growing data set, they allocate more than they need, in anticipation of additional growth. This is an efficiency safeguard; it means that the typical push_back() won't trigger a new allocation and move of the entire data set into a larger container. But eventually after enough push_backs, the limit will be reached, and the vector's elements will be copied into a larger container, which again has some extra headroom left over for more efficient push_backs.

Traversal refers to visiting every node, whereas search is only to find a particular node, so your intuition is spot on there. O(N) complexity because you need to visit N nodes.
std::vector::insert is for insert in the middle, and it involves copying all subsequent elements over by one slot, inorder to make room for the element being inserted, hence O(N). Linked list doesnt have this issue, hence O(1). Similar logic for erase. deque properties are similar to vector
std::vector::push_back is a O(1) operation, for the most part, only deviates if capacity is exceeded and reallocations + copy are needed.

Related

Does there exist a data structure with constant access and insertion/deletion times? [duplicate]

By vector vs. list in STL:
std::vector: Insertions at the end are constant, amortized time, but insertions elsewhere are a costly O(n).
std::list: You cannot randomly access elements, so getting at a particular element in the list can be expensive.
I need a container such that you can both access the element at any index in O(1) time, but also insert/remove an element at any index in O(1) time. It must also be able to manage thousands of entries. Is there such a container?
Edit: If not O(1), some X << O(n)?

There's a theoretical result that says that any data structure representing an ordered list cannot have all of insert, lookup by index, remove, and update take time better than O(log n / log log n), so no such data structure exists.
There are data structures that get pretty close to this, though. For example, an order statistics tree lets you do insertions, deletions, lookups, and updates anywhere in the list in time O(log n) apiece. These are reasonably good in practice, and you may be able to find an implementation online.
Depending on your specific application, there may be alternative data structures that are more tailored toward your needs. For example, if you only care about finding the smallest/biggest element at each point in time, then a data structure like a Fibonacci heap might fit the bill. (Fibonacci heaps are usually slower in practice than a regular binary heap, but the related pairing heap tends to run extremely quickly.) If you're frequently updating ranges of elements by adding or subtracting from them, then a Fenwick tree might be a better call.
Hope this helps!

Look at a couple of data structures.
The Rope
Tree of arrays. The tree is sorted by array index for fast index search.
B+Tree
Sorted tree of sorted arrays. This thing is used by almost every database ever.
Neither one is O(1) because that's impossible. But they are pretty good.

std::list and std::vector - Best of both worlds?

By vector vs. list in STL:
std::vector: Insertions at the end are constant, amortized time, but insertions elsewhere are a costly O(n).
std::list: You cannot randomly access elements, so getting at a particular element in the list can be expensive.
I need a container such that you can both access the element at any index in O(1) time, but also insert/remove an element at any index in O(1) time. It must also be able to manage thousands of entries. Is there such a container?
Edit: If not O(1), some X << O(n)?

There's a theoretical result that says that any data structure representing an ordered list cannot have all of insert, lookup by index, remove, and update take time better than O(log n / log log n), so no such data structure exists.
There are data structures that get pretty close to this, though. For example, an order statistics tree lets you do insertions, deletions, lookups, and updates anywhere in the list in time O(log n) apiece. These are reasonably good in practice, and you may be able to find an implementation online.
Depending on your specific application, there may be alternative data structures that are more tailored toward your needs. For example, if you only care about finding the smallest/biggest element at each point in time, then a data structure like a Fibonacci heap might fit the bill. (Fibonacci heaps are usually slower in practice than a regular binary heap, but the related pairing heap tends to run extremely quickly.) If you're frequently updating ranges of elements by adding or subtracting from them, then a Fenwick tree might be a better call.
Hope this helps!

Look at a couple of data structures.
The Rope
Tree of arrays. The tree is sorted by array index for fast index search.
B+Tree
Sorted tree of sorted arrays. This thing is used by almost every database ever.
Neither one is O(1) because that's impossible. But they are pretty good.

Storing in std::map/std::set vs sorting a vector after storing all data

Language: C++
One thing I can do is allocate a vector of size n and store all data
and then sort it using sort(begin(),end()). Else, I can keep putting
the data in a map or set which are ordered itself so I don't have to
sort afterwards. But in this case inserting an element may be more
costly due to rearrangements(I guess).
So which is the optimal choice for minimal time for a wide range of n(no. of objects)

It depends on the situation.
map and set are usually red-black trees, they should do a lot of work to be balanced, or the operation on it will be very slow. And it doesn't support random access. so if you only want to sort one time, you shouldn't use them.
However, if you want to continue insert elements into the container and keep order, map and set will take O(logN) time, while the sorted vector is O(N). The latter is much slower, so if you want frequently insert and delete, you should use map or set.

The difference between the 2 is noticable!
Using a set, you get O(log(N)) complexity for each element you insert. So by result you get O(N log(N)), which is the complexity of an insertion sort.
Adding everything in a vector is of complexity O(1), and sorting it will be O(N log(N)) since C++11 (before it, std::sort have O(N log(N)) on average.).
Once sorted, you could use binary_search to have the same complexity as in a set.
The API of using a vector as set ain't the friendly, although it does give nice performance benefits. This off course is only useful when you can do a bulk insert of data or when the amount of lookups is much larger than the manipulations of the content. Algorithmsable to sort on partially sorted vector, when you have to extend later on.
Finally, one has to remark that you don't have the same guarantees of iterator invalidation.
So, why are vectors better? Cache locality!
A vector has all data in a single memory block, hence the processor can do prefetching while for a set, the memory is scattered around the place requireing the data to find the next address. This makes vector a better set implementation than std::set for large data when you can live with the limitations.
To give you an idea, on the codebase I'm working on, we have several set and map implementations based on vectors which have their own narratives to function in. (For example: no erase or no operator[])

Advantage of Binary Search Tree over vector in C++

What is the use of data structure Binary Search Tree, if vector (in sorted order) can support insert,delete and search in log(n) time (using binary search)??

The basic advantage of a tree is that insert and delete in a vector are not O(log(n)) - they are O(n). (They take log(n) comparisons, but n moves.)
The advantage of a vector is that the constant factor can be hugely in their favour (because they tend to be much more cache friendly, and cache misses can cost you a factor of 100 in performance).
Sorted vectors win when
Mostly searching.
Frequent updates but only a few elements in the container.
Objects have efficient move semantics
Trees win when
Lots of updates with many elements in the container.
Object move is expensive.
... and don't forget hashed containers which are O(1) search, and unordered vectors+linear search (which are O(n) for everything, but if small enough are actually fastest).

There won't be much difference in performance between a sorted vector and BST if there are only search operations after some initial insertions/deletions. As
binary search over vector will cost you same as searching a key in BST. In fact I would go for sorted vector in this case as it's more cache friendly.
However, if there are frequent insertions/deletions involved along with searching, then a sorted vector won't be good option as elements need to move back and forth after every insertion and deletion to keep vector sorted.

Theoretically there's impossible to do insert or delete in a sorted vector in O(log(n)). But if you really want the advantage of searching in BST vs vector, here's somethings I can think about:
BST and other tree structures take bulk of small memory allocations of "node", and each node is a fixed small memory chunk. While vector uses a big continuous memory block to hold all the items, and it double (or even triple) the memory usage while re-sizing. So in the system with very limited memory, or in the system where fragmentation happens frequently, it's possible that BST will successfully allocate enough memory chunks for all the nodes, while vector failed to allocate the memory.

STL priority_queue<pair> vs. map

I need a priority queue that will store a value for every key, not just the key. I think the viable options are std::multi_map<K,V> since it iterates in key order, or std::priority_queue<std::pair<K,V>> since it sorts on K before V. Is there any reason I should prefer one over the other, other than personal preference? Are they really the same, or did I miss something?

A priority queue is sorted initially, in O(N) time, and then iterating all the elements in decreasing order takes O(N log N) time. It is stored in a std::vector behind the scenes, so there's only a small coefficient after the big-O behavior. Part of that, though, is moving the elements around inside the vector. If sizeof (K) or sizeof (V) is large, it will be a bit slower.
std::map is a red-black tree (in universal practice), so it takes O(N log N) time to insert the elements, keeping them sorted after each insertion. They are stored as linked nodes, so each item incurs malloc and free overhead. Then it takes O(N) time to iterate over them and destroy the structure.
The priority queue overall should usually have better performance, but it's more constraining on your usage: the data items will move around during iteration, and you can only iterate once.
If you don't need to insert new items while iterating, you can use std::sort with a std::vector, of course. This should outperform the priority_queue by some constant factor.
As with most things in performance, the only way to judge for sure is to try it both ways (with real-world testcases) and measure.
By the way, to maximize performance, you can define a custom comparison function to ignore the V and compare only the K within the pair<K,V>.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Looking for clarification on Hashing and BST functions and Big O notation - c++

Related

Does there exist a data structure with constant access and insertion/deletion times? [duplicate]

std::list and std::vector - Best of both worlds?

Storing in std::map/std::set vs sorting a vector after storing all data

Advantage of Binary Search Tree over vector in C++

STL priority_queue<pair> vs. map

Categories

Resources