Implementing a Set Using a Sorted Array

Implementing a Set Using a Sorted Array - c++

This is a data structures question, but also regarding implementation. A set is typically implemented using a BST, but my professor wants us to know how to implement some data structures when only given limited options. So he wants us to be able to understand how to create a set using only an array.
Using a standard (unsorted) array I understand the implementation/complexity...
void add(Student[] arr, Student findstu)
{
Student stu = new Student();
int i=0;
boolean found = false;
while(stu!=NULL)
{
stu = arr[i++];
if (stu==findstu)
{
found = true;
}
}
if (found==false)
{
arr[i+1] = findstu;
}
}
The add/remove/contains are ally pretty much the same code, all will have the first while loop, which will make them O(n).
But if we used a sorted array, why would contains be O(lgn) and add/remove O(n)?

Searching would be of O(logN) because due to the fact that the array is sorted you could apply binary search which is of O(logN) complexity.
Insertion and erasure would be O(N) complexity (i.e., linear time) because every time you would attempt to insert or erase an element in the sorted array you would have to shift the elements of your array one position which is O(N) linear time complexity.

Related

Confusion about time complexity with hash maps

On leetcode I find it is common to "ignore" the worst-case time complexity involving hash maps. I thought in software interviews that it was standard to assume "worst case" as they often do. Below is my solution to a simple problem. The problem is to find the first non repeating char in a string. I understand that hash maps are on average O(1) lookup.. but when iterating over the string, and looking up the hash map, why is the time complexity not O(N^2) and instead is O(N)?
#include <unordered_map>
class Solution {
public:
unordered_map<char, int> m;
int firstUniqChar(string s) {
for(char c : s) {
m[c]++;
}
for(int i =0; i < s.length(); i++) {
if(m[s[i]] == 1) {
return i;
}
}
return -1;
}
};

It is on average O(N) because hash map is on average O(1) per lookup and you do O(N) of them.
On average means by averaging over all possible inputs. That means there might exists an input array that breaks a particular hash and achieves O(N) or much worse on every lookup.
Worst-case is heavily implementation specific - e.g. hashing into buckets depends on how are elements stored in each bucket. If they are in a simple list, then lookup is O(<duplicates>), binary tree will bring that down to O(log<duplicates>). There might also be a difference between searching for keys present and missing.
Also there is a big assumption that all hashed containers can grow with the number of elements stored. I.e. keeping the occupancy of buckets low.
It does not hurt to mention their worst-cases in interviews, it demonstrates you know they can have limits.

The time-complexity of the given problem is O(N). You may provide a perfect hash function for it, that is no collision ever happens. This perfect hash function here is static_cast<size_t>(256+c). Well, if you look at the fastest solutions to this problem on leetcode you see that guys use plain arrays.

Which container is most efficient for multiple insertions / deletions in C++?

I was set a homework challenge as part of an application process (I was rejected, by the way; I wouldn't be writing this otherwise) in which I was to implement the following functions:
// Store a collection of integers
class IntegerCollection {
public:
// Insert one entry with value x
void Insert(int x);
// Erase one entry with value x, if one exists
void Erase(int x);
// Erase all entries, x, from <= x < to
void Erase(int from, int to);
// Return the count of all entries, x, from <= x < to
size_t Count(int from, int to) const;
The functions were then put through a bunch of tests, most of which were trivial. The final test was the real challenge as it performed 500,000 single insertions, 500,000 calls to count and 500,000 single deletions.
The member variables of IntegerCollection were not specified and so I had to choose how to store the integers. Naturally, an STL container seemed like a good idea and keeping it sorted seemed an easy way to keep things efficient.
Here is my code for the four functions using a vector:
// Previous bit of code shown goes here
private:
std::vector<int> integerCollection;
};
void IntegerCollection::Insert(int x) {
/* using lower_bound to find the right place for x to be inserted
keeps the vector sorted and makes life much easier */
auto it = std::lower_bound(integerCollection.begin(), integerCollection.end(), x);
integerCollection.insert(it, x);
}
void IntegerCollection::Erase(int x) {
// find the location of the first element containing x and delete if it exists
auto it = std::find(integerCollection.begin(), integerCollection.end(), x);
if (it != integerCollection.end()) {
integerCollection.erase(it);
}
}
void IntegerCollection::Erase(int from, int to) {
if (integerCollection.empty()) return;
// lower_bound points to the first element of integerCollection >= from/to
auto fromBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), from);
auto toBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), to);
/* std::vector::erase deletes entries between the two pointers
fromBound (included) and toBound (not indcluded) */
integerCollection.erase(fromBound, toBound);
}
size_t IntegerCollection::Count(int from, int to) const {
if (integerCollection.empty()) return 0;
int count = 0;
// lower_bound points to the first element of integerCollection >= from/to
auto fromBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), from);
auto toBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), to);
// increment pointer until fromBound == toBound (we don't count elements of value = to)
while (fromBound != toBound) {
++count; ++fromBound;
}
return count;
}
The company got back to me saying that they wouldn't be moving forward because my choice of container meant the runtime complexity was too high. I also tried using list and deque and compared the runtime. As I expected, I found that list was dreadful and that vector took the edge over deque. So as far as I was concerned I had made the best of a bad situation, but apparently not!
I would like to know what the correct container to use in this situation is? deque only makes sense if I can guarantee insertion or deletion to the ends of the container and list hogs memory. Is there something else that I'm completely overlooking?

We cannot know what would make the company happy. If they reject std::vector without concise reasoning I wouldn't want to work for them anyway. Moreover, we dont really know the precise requirements. Were you asked to provide one reasonably well performing implementation? Did they expect you to squeeze out the last percent of the provided benchmark by profiling a bunch of different implementations?
The latter is probably too much for a homework challenge as part of an application process. If it is the first you can either
roll your own. It is unlikely that the interface you were given can be implemented more efficiently than one of the std containers does... unless your requirements are so specific that you can write something that performs well under that specific benchmark.
std::vector for data locality. See eg here for Bjarne himself advocating std::vector rather than linked lists.
std::set for ease of implementation. It seems like you want the container sorted and the interface you have to implement fits that of std::set quite well.
Let's compare only isertion and erasure assuming the container needs to stay sorted:
operation std::set std::vector
insert log(N) N
erase log(N) N
Note that the log(N) for the binary_search to find the position to insert/erase in the vector can be neglected compared to the N.
Now you have to consider that the asymptotic complexity listed above completely neglects the non-linearity of memory access. In reality data can be far away in memory (std::set) leading to many cache misses or it can be local as with std::vector. The log(N) only wins for huge N. To get an idea of the difference 500000/log(500000) is roughly 26410 while 1000/log(1000) is only ~100.
I would expect std::vector to outperform std::set for considerably small container sizes, but at some point the log(N) wins over cache. The exact location of this turning point depends on many factors and can only reliably determined by profiling and measuring.

Nobody knows which container is MOST efficient for multiple insertions / deletions. That is like asking what is the most fuel-efficient design for a car engine possible. People are always innovating on the car engines. They make more efficient ones all the time. However, I would recommend a splay tree. The time required for a insertion or deletion is a splay tree is not constant. Some insertions take a long time and some take only a very a short time. However, the average time per insertion/deletion is always guaranteed to be be O(log n), where n is the number of items being stored in the splay tree. logarithmic time is extremely efficient. It should be good enough for your purposes.

The first thing that comes to mind is to hash the integer value so single look ups can be done in constant time.
The integer value can be hashed to compute an index in to an array of bools or bits, used to tell if the integer value is in the container or not.
Counting and and deleting large ranges could be sped up from there, by using multiple hash tables for specific integer ranges.
If you had 0x10000 hash tables, that each stored ints from 0 to 0xFFFF and were using 32 bit integers you could then mask and shift the upper half of the int value and use that as an index to find the correct hash table to insert / delete values from.
IntHashTable containers[0x10000];
u_int32 hashIndex = (u_int32)value / 0x10000;
u_int32int valueInTable = (u_int32)value - (hashIndex * 0x10000);
containers[hashIndex].insert(valueInTable);
Count for example could be implemented as so, if each hash table kept count of the number of elements it contained:
indexStart = startRange / 0x10000;
indexEnd = endRange / 0x10000;
int countTotal = 0;
for (int i = indexStart; i<=indexEnd; ++i) {
countTotal += containers[i].count();
}

Not sure if using sorting really is a requirement for removing the range. It might be based on position. Anyway, here is a link with some hints which STL container to use.
In which scenario do I use a particular STL container?
Just FYI.
Vector maybe a good choice, but it does a lot of re allocation, as you know. I prefer deque instead, as it doesn't require big chunk of memory to allocate all items. For such requirement as you had, list probably fit better.

Basic solution for this problem might be std::map<int, int>
where key is the integer you are storing and value is the number of occurences.
Problem with this is that you can not quickly remove/count ranges. In other words complexity is linear.
For quick count you would need to implement your own complete binary tree where you can know the number of nodes between 2 nodes(upper and lower bound node) because you know the size of tree, and you know how many left and right turns you took to upper and lower bound nodes. Note that we are talking about complete binary tree, in general binary tree you can not make this calculation fast.
For quick range remove I do not know how to make it faster than linear.

Binary search code fails efficiency check

I have been tasked to complete a technical assessment for a position involving a simple C++ coding exercise. The problem was to check if a number exists in a sorted array, where:
ints[] is the array to be sorted
size is the size of the array
k is the number to be checked
The requirement was to implement a solution that uses as few CPU cycles as possible. My solution was as follows:
static bool exists(int ints[], int size, int k)
{
std::vector<int> v(ints,ints+size);
if (std::binary_search (v.begin(), v.end(), k))
return true;
return false;
}
This failed the performance test with a million items in the array. I am a bit confused as to why. Is it the fact that I am creating a new structure from the vector? Does it involve copying all of the items in a new location in memory?

std::vector<int> v(ints,ints+size); is going to make a copy of your array. You really don't want to do this in a binary search function since it is an O(N) operation. That totally dominates the O(logN) of a binary search and makes you algorithm equivalent to a linear search (only worse since you are also consuming O(N) space). You should be using the array directly in your call to binary_search like you do to create the vector with:
static bool exists(int ints[], int size, int k)
{
return std::binary_search(ints, ints+size, k);
}

Z-sorted list of 3d objects

In a C++/OpenGL app, I have a bunch of translucent objects arranged in 3d space. Because of the translucency, the objects must be drawn in order from furthest to nearest. (For the reasons described in "Transparency Sorting.")
Luckily, the camera is fixed. So I plan to maintain a collection of pointers to the 3d objects, sorted by camera Z. Each frame, I'll iterate over the collection, drawing each object.
Fast insertion and deletion are important, because the objects in existence change frequently.
I'm considering using a std::list as the container. To insert, I'll use std::lower_bound to determine where the new object goes. Then I'll insert at the iterator returned by lower_bound.
Does this sound like a sane approach? Given the details I've provided, do you foresee any major performance issues I've overlooked?

I don't think a std::list would ever be a good choice for this use case. While insertion is very inefficient, you need to iterate through the list to find the right place for the insertion, which makes it O(n) complexity.
If you want to keep it simple, a std::set would already be much better, and even simpler to apply than std::list. It's implemented as a balanced tree, so insertion is O(log n) complexity, and done by simply calling the insert() method on the container. The iterator gives you the elements in sorted order. It does have the downside of non-local memory access patterns during iteration, which makes it not cache friendly.
Another approach comes to mind that intuitively should be very efficient. Its basic idea is similar to what #ratchet_freak already proposed, but it does not copy the entire vector on each iteration:
The container that contains the main part of the data is a std::vector, which is always kept sorted.
New elements are added to an "overflow" container, which could be a std::set, or another std::vector that is kept sorted. This is only allowed to reach a certain size.
While iterating, traverse the main and overflow containers simultaneously, using similar logic to a merge sort.
When the overflow container reaches the size limit, merge it with the main container, resulting in a new main container.
A rough sketch of the code for this:
const size_t OVERFLOW_SIZE = 32;
// Ping pong between two vectors when merging.
std::vector<Entry> mainVecs[2];
unsigned activeIdx = 0;
std::vector<Entry> overflowVec;
overflowVec.reserve(OVERFLOW_SIZE);
void insert(const Entry& entry) {
std::vector<Entry>::iterator pos =
std::upper_bound(overflowVec.begin(), overflowVec.end(), entry);
overflowVec.insert(pos, 1, entry);
if (overflowVec.size() == OVERFLOW_SIZE) {
std::merge(mainVecs[activeIdx].begin(), mainVecs[activeIdx].end(),
overflowVec.begin(), overflowVec.end(),
mainVecs[1 - activeIdx].begin());
mainVecs[activeIdx].clear();
overflowVec.clear();
activeIdx = 1 - activeIdx;
}
}
void draw() {
std::vector<Entry>::const_iterator mainIt = mainVecs[activeIdx].begin();
std::vector<Entry>::const_iterator mainEndIt = mainVecs[activeIdx].begin();
std::vector<Entry>::const_iterator overflowIt = overflowVec.begin();
std::vector<Entry>::const_iterator overflowEndIt = overflowVec.end();
for (;;) {
if (overflowIt == overflowEndIt) {
if (mainIt == mainEndIt) {
break;
}
draw(*mainIt);
++mainIt;
} else if (mainIt == mainEndIt) {
if (overflowIt == overflowEndIt) {
break;
}
draw(*overflowIt);
++overflowIt;
} else if (*mainIt < *overflowIt) {
draw(*mainIt);
++mainIt;
} else {
draw(*overflowIt);
++overflowIt;
}
}
}

std::list is a non-random-access container,
Complexity of lower_bound.
On average, logarithmic in the distance between first and last: Performs approximately log2(N)+1 element comparisons (where N is this distance).
On non-random-access iterators, the iterator advances produce themselves an additional linear complexity in N on average
So it seems not a good idea.
Using std::vector, you will have correct complexity for lower_bound.
And you may have better performance too for inserting/removing element(but lower complexity).

Depending on how big the list is you can keep a smaller "mutation set" for the objects that got added/changed the last frame and a big existing sorted set.
Then each frame you do a merge while drawing:
vector<GameObject*> newList;
newList.reserve(mutationSet.size()+ExistingSet.size();
sort(mutationSet.begin(), mutationSet.end(), byZCoord);//small list -> faster sort
auto mutationIt = mutationSet.begin();
for(auto it = ExistingSet.begin(); it != ExistingSet.end(); ++it){
if(*it->isRemoved()){
//release to pool and
continue;
}
while(mutationIt != mutationSet.end() && *mutationIt->getZ() < *it->getZ()){
*mutationIt->render();
newList.pushBack(*mutationIt);
}
*it->render();
newList.pushBack(*iIt);
}
while(mutationIt != mutationSet.end()){
*mutationIt->render();
newList.pushBack(*mutationIt);
}
mutationSet.clear();
ExistingSet.clear();
swap(ExistingSet, newList);
You will be doing the iteration anyway and sorting a small list is faster than appending the new list and sorting everything O(n + k + k log k) vs. O( (n+k)log(n+k))

C++ Search Performance

What I have is two text files. One contains a list of roughly 70,000 names (~1.5MB). The other contains text which will be obtained from miscellaneous sources. That is, this file's contents will change each time the program is executed (~0.5MB). Essentially, I want to be able to paste some text into a text file and see which names from my list are found. Kind of like the find function (CTR + F) but with 70,000 keywords.
In any case, what I have thus far is:
int main()
{
ifstream namesfile("names.txt"); //names list
ifstream miscfile("misc.txt"); //misc text
vector<string> vecnames; //vector to hold names
vector<string> vecmisc; //vector to hold misc text
size_t found;
string s;
string t;
while (getline(namesfile,s))
veccomp.push_back(s);
while (getline(miscfile,t))
vectenk.push_back(t);
//outer loop iterates through names list
for (vector<string>::size_type i = 0; i != vecnames.size(); ++i) {
//inner loop iterates through the lines of the mist text file
for (vector<string>::size_type j = 0;j != vecmisc.size(); ++j) {
found=vecmisc[j].find(vecnames[i]);
if (found!=string::npos) {
cout << vecnames[i] << endl;
break;
}
}
}
cout << "SEARCH COMPLETE";
//to keep console application from exiting
getchar();
return 0;
}
Now this works great as far as extracting the data I need, however, it is terribly slow and obviously inefficient since each name requires that I potentially search the entire file again which gives (75000 x # of lines in misc text file) iterations. If anyone could help, I would certainly appreciate it. Some sample code is most welcomed. Additionally, I'm using Dev C++ if that makes any difference. Thanks.

Use a std::hash_set. Insert all your keywords into the set, then traverse the large document and each time you come to a word, test whether the set includes that word.

Using a vector, the best-case search time you're going to get is O(log N) complexity using a binary search algorithm, and that's only going to work for a sorted list. If you include the time it will take to make sorted insertions into a list, the final amortized complexity for a sorted linear container (arrays, lists), as well as non-linear containers such as binary-search trees, O(N log N). So that basically means that if you add more elements to the list, the time it will take to both add those elements to the list, as well as find them later on, will increase at a rate a little faster than the linear growth rate of the list (i.e., if you double the size of the list, it will take a little over double the time to sort the list, and then any searches on the list will be pretty quick ... in order to double the search time, the list would have to grow by the square of the existing amount of elements).
A good hash-table implementation on the other-hand (such as std::unordered_map) along with a good hash-algorithm that avoids too many collisions, has an amortized complexity of O(1) ... that means overall there's a constant look-up time for any given element, no matter how many elements there are, making searches very fast. The main penalty over a linear list or binary-search tree for the hash-table is the actual memory footprint of the hash table. A good hash-table, in order to avoid too many collisions, will want to have a size equal to some large prime number that is at least greater than 2*N, where N is the total number of elements you plan on storing in the array. But the "wasted space" is the trade-off for efficient and extremely fast look-ups.

While a map of any kind is the simplest solution, Scott Myers makes a good case for sorted vector and binary_search from algorithm (in Effective STL).
Using a sorted vector, your code would look something like
#include <algorithm>
...
int vecsize = vecnames.size();
sort(vecnames.begin(), vecnames.begin() + vecsize );
for (vector<string>::size_type j = 0;j != vecmisc.size(); ++j)
{
bool found= binary_search(vecnames.begin(), vecnames.begin()+vecsize,vecmisc[j]);
if (found) std::cout << vecmisc[j] << std::endl;
}
The advantages of using a sorted vector and binary_search are
1) There is no tree to traverse, the binary_search begins at (end-start)/2, and keeps dividing by 2. It will take at most log(n) to search the range.
2) There is no key,value pair. You get the simplicity of a vector without the overhead of a map.
3) The vector's elements are in a contiguous range (which is why you should use reserve before populating the vector, inserts are faster), and so searching through the vector's elements rarely crosses page boundaries (slightly faster).
4) It's cool.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js