I am trying to verify the correctness of a sorting algorithm S that is sorting a large array A of at least 4 GB. Assuming S is sorting in non-decreasing order, checking only A[i - 1] <= A[i] for 1 <= i < n is not sufficient. This is because the keys produced by S, even though sorted, may contain one or more keys that do not belong to the original A.
I can think of at least two trivial ways to test the correctness:
Make a copy of A to A_copy before A is sorted, use std::sort on A_copy, and check A[i] == A_copy[i] for 0 <= i < n after the sort.
Maintain a std::unordered_map to store the frequency of the keys in A before sort, and verify with the frequency after the sort in addition to the non-decreasing order check.
There are obvious issues with the above approaches. std::sort is extremely slow for large data and requires O(n) additional memory. Using a map should be faster but also requires extra O(n) memory if the keys are unique.
My question: is there any better way to perform this sort correctness check that is both fast and uses O(1) extra memory?
Thanks.
You can treat your algorithm as a message being transferred over an unreliable channel, and utilize error detection/correction methods. Main different is your data is getting out of the original order while most error correction are sensitive to the position, though not all of them.
One simple solution is store the XOR value of hash(a) for all a in A, though it can only reliably detect if one element is added (if for example an element was added twice, it will fail to identify it).
int verification = 0;
for (const auto& a : A) {
verification ^= hash(a)
}
mySort(A);
for (const auto& a : A) {
verification ^= hash(a)
}
if (verification != 0) {
// invalid
} else {
// valid
}
The literature contains much more options for identifying or even correcting errors on wires which you can utilize. These will allow you a nice trade-off between the amount of extra memory you use and the number of mistakes you are able to find.
Related
I am trying to solve the programming problem firstDuplicate on codesignal. The problem is "Given an array a that contains only numbers in the range 1 to a.length, find the first duplicate number for which the second occurrence has minimal index".
Example: For a = [2, 1, 3, 5, 3, 2] the output should be firstDuplicate(a) = 3
There are 2 duplicates: numbers 2 and 3. The second occurrence of 3 has a smaller index than the second occurrence of 2 does, so the answer is 3.
With this code I pass 21/23 tests, but then it tells me that the program exceeded the execution time limit on test 22. How would I go about making it faster so that it passes the remaining two tests?
#include <algorithm>
int firstDuplicate(vector<int> a) {
vector<int> seen;
for (size_t i = 0; i < a.size(); ++i){
if (std::find(seen.begin(), seen.end(), a[i]) != seen.end()){
return a[i];
}else{
seen.push_back(a[i]);
}
}
if (seen == a){
return -1;
}
}
Anytime you get asked a question about "find the duplicate", "find the missing element", or "find the thing that should be there", your first instinct should be use a hash table. In C++, there are the unordered_map and unordered_set classes that are for such types of coding exercises. The unordered_set is effectively a map of keys to bools.
Also, pass you vector by reference, not value. Passing by value incurs the overhead of copying the entire vector.
Also, that comparison seems costly and unnecessary at the end.
This is probably closer to what you want:
#include <unordered_set>
int firstDuplicate(const vector<int>& a) {
std::unordered_set<int> seen;
for (int i : a) {
auto result_pair = seen.insert(i);
bool duplicate = (result_pair.second == false);
if (duplicate) {
return (i);
}
}
return -1;
}
std::find is linear time complexity in terms of distance between first and last element (or until the number is found) in the container, thus having a worst-case complexity of O(N), so your algorithm would be O(N^2).
Instead of storing your numbers in a vector and searching for it every time, Yyu should do something like hashing with std::map to store the numbers encountered and return a number if while iterating, it is already present in the map.
std::map<int, int> hash;
for(const auto &i: a) {
if(hash[i])
return i;
else
hash[i] = 1;
}
Edit: std::unordered_map is even more efficient if the order of keys doesn't matter, since insertion time complexity is constant in average case as compared to logarithmic insertion complexity for std::map.
It's probably an unnecessary optimization, but I think I'd try to take slightly better advantage of the specification. A hash table is intended primarily for cases where you have a fairly sparse conversion from possible keys to actual keys--that is, only a small percentage of possible keys are ever used. For example, if your keys are strings of length up to 20 characters, the theoretical maximum number of keys is 25620. With that many possible keys, it's clear no practical program is going to store any more than a minuscule percentage, so a hash table makes sense.
In this case, however, we're told that the input is: "an array a that contains only numbers in the range 1 to a.length". So, even if half the numbers are duplicates, we're using 50% of the possible keys.
Under the circumstances, instead of a hash table, even though it's often maligned, I'd use an std::vector<bool>, and expect to get considerably better performance in the vast majority of cases.
int firstDuplicate(std::vector<int> const &input) {
std::vector<bool> seen(input.size()+1);
for (auto i : input) {
if (seen[i])
return i;
seen[i] = true;
}
return -1;
}
The advantage here is fairly simple: at least in a typical case, std::vector<bool> uses a specialization to store bools in only one bit apiece. This way we're storing only one bit for each number of input, which increases storage density, so we can expect excellent use of the cache. In particular, as long as the number of bytes in the cache is at least a little more than 1/8th the number of elements in the input array, we can expect all of seen to be in the cache most of the time.
Now make no mistake: if you look around, you'll find quite a few articles pointing out that vector<bool> has problems--and for some cases, that's entirely true. There are places and times that vector<bool> should be avoided. But none of its limitations applies to the way we're using it here--and it really does give an advantage in storage density that can be quite useful, especially for cases like this one.
We could also write some custom code to implement a bitmap that would give still faster code than vector<bool>. But using vector<bool> is easy, and writing our own replacement that's more efficient is quite a bit of extra work...
I was set a homework challenge as part of an application process (I was rejected, by the way; I wouldn't be writing this otherwise) in which I was to implement the following functions:
// Store a collection of integers
class IntegerCollection {
public:
// Insert one entry with value x
void Insert(int x);
// Erase one entry with value x, if one exists
void Erase(int x);
// Erase all entries, x, from <= x < to
void Erase(int from, int to);
// Return the count of all entries, x, from <= x < to
size_t Count(int from, int to) const;
The functions were then put through a bunch of tests, most of which were trivial. The final test was the real challenge as it performed 500,000 single insertions, 500,000 calls to count and 500,000 single deletions.
The member variables of IntegerCollection were not specified and so I had to choose how to store the integers. Naturally, an STL container seemed like a good idea and keeping it sorted seemed an easy way to keep things efficient.
Here is my code for the four functions using a vector:
// Previous bit of code shown goes here
private:
std::vector<int> integerCollection;
};
void IntegerCollection::Insert(int x) {
/* using lower_bound to find the right place for x to be inserted
keeps the vector sorted and makes life much easier */
auto it = std::lower_bound(integerCollection.begin(), integerCollection.end(), x);
integerCollection.insert(it, x);
}
void IntegerCollection::Erase(int x) {
// find the location of the first element containing x and delete if it exists
auto it = std::find(integerCollection.begin(), integerCollection.end(), x);
if (it != integerCollection.end()) {
integerCollection.erase(it);
}
}
void IntegerCollection::Erase(int from, int to) {
if (integerCollection.empty()) return;
// lower_bound points to the first element of integerCollection >= from/to
auto fromBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), from);
auto toBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), to);
/* std::vector::erase deletes entries between the two pointers
fromBound (included) and toBound (not indcluded) */
integerCollection.erase(fromBound, toBound);
}
size_t IntegerCollection::Count(int from, int to) const {
if (integerCollection.empty()) return 0;
int count = 0;
// lower_bound points to the first element of integerCollection >= from/to
auto fromBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), from);
auto toBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), to);
// increment pointer until fromBound == toBound (we don't count elements of value = to)
while (fromBound != toBound) {
++count; ++fromBound;
}
return count;
}
The company got back to me saying that they wouldn't be moving forward because my choice of container meant the runtime complexity was too high. I also tried using list and deque and compared the runtime. As I expected, I found that list was dreadful and that vector took the edge over deque. So as far as I was concerned I had made the best of a bad situation, but apparently not!
I would like to know what the correct container to use in this situation is? deque only makes sense if I can guarantee insertion or deletion to the ends of the container and list hogs memory. Is there something else that I'm completely overlooking?
We cannot know what would make the company happy. If they reject std::vector without concise reasoning I wouldn't want to work for them anyway. Moreover, we dont really know the precise requirements. Were you asked to provide one reasonably well performing implementation? Did they expect you to squeeze out the last percent of the provided benchmark by profiling a bunch of different implementations?
The latter is probably too much for a homework challenge as part of an application process. If it is the first you can either
roll your own. It is unlikely that the interface you were given can be implemented more efficiently than one of the std containers does... unless your requirements are so specific that you can write something that performs well under that specific benchmark.
std::vector for data locality. See eg here for Bjarne himself advocating std::vector rather than linked lists.
std::set for ease of implementation. It seems like you want the container sorted and the interface you have to implement fits that of std::set quite well.
Let's compare only isertion and erasure assuming the container needs to stay sorted:
operation std::set std::vector
insert log(N) N
erase log(N) N
Note that the log(N) for the binary_search to find the position to insert/erase in the vector can be neglected compared to the N.
Now you have to consider that the asymptotic complexity listed above completely neglects the non-linearity of memory access. In reality data can be far away in memory (std::set) leading to many cache misses or it can be local as with std::vector. The log(N) only wins for huge N. To get an idea of the difference 500000/log(500000) is roughly 26410 while 1000/log(1000) is only ~100.
I would expect std::vector to outperform std::set for considerably small container sizes, but at some point the log(N) wins over cache. The exact location of this turning point depends on many factors and can only reliably determined by profiling and measuring.
Nobody knows which container is MOST efficient for multiple insertions / deletions. That is like asking what is the most fuel-efficient design for a car engine possible. People are always innovating on the car engines. They make more efficient ones all the time. However, I would recommend a splay tree. The time required for a insertion or deletion is a splay tree is not constant. Some insertions take a long time and some take only a very a short time. However, the average time per insertion/deletion is always guaranteed to be be O(log n), where n is the number of items being stored in the splay tree. logarithmic time is extremely efficient. It should be good enough for your purposes.
The first thing that comes to mind is to hash the integer value so single look ups can be done in constant time.
The integer value can be hashed to compute an index in to an array of bools or bits, used to tell if the integer value is in the container or not.
Counting and and deleting large ranges could be sped up from there, by using multiple hash tables for specific integer ranges.
If you had 0x10000 hash tables, that each stored ints from 0 to 0xFFFF and were using 32 bit integers you could then mask and shift the upper half of the int value and use that as an index to find the correct hash table to insert / delete values from.
IntHashTable containers[0x10000];
u_int32 hashIndex = (u_int32)value / 0x10000;
u_int32int valueInTable = (u_int32)value - (hashIndex * 0x10000);
containers[hashIndex].insert(valueInTable);
Count for example could be implemented as so, if each hash table kept count of the number of elements it contained:
indexStart = startRange / 0x10000;
indexEnd = endRange / 0x10000;
int countTotal = 0;
for (int i = indexStart; i<=indexEnd; ++i) {
countTotal += containers[i].count();
}
Not sure if using sorting really is a requirement for removing the range. It might be based on position. Anyway, here is a link with some hints which STL container to use.
In which scenario do I use a particular STL container?
Just FYI.
Vector maybe a good choice, but it does a lot of re allocation, as you know. I prefer deque instead, as it doesn't require big chunk of memory to allocate all items. For such requirement as you had, list probably fit better.
Basic solution for this problem might be std::map<int, int>
where key is the integer you are storing and value is the number of occurences.
Problem with this is that you can not quickly remove/count ranges. In other words complexity is linear.
For quick count you would need to implement your own complete binary tree where you can know the number of nodes between 2 nodes(upper and lower bound node) because you know the size of tree, and you know how many left and right turns you took to upper and lower bound nodes. Note that we are talking about complete binary tree, in general binary tree you can not make this calculation fast.
For quick range remove I do not know how to make it faster than linear.
I am solving a problem on LeetCode, but nobody has yet been able to explain my issue.
The problem is as such:
Given an arbitrary ransom note string and another string containing letters from all the magazines, write a function that will return true if the ransom note can be constructed from the magazines ; otherwise, it will return false.
Each letter in the magazine string can only be used once in your ransom note.
Note:
You may assume that both strings contain only lowercase letters.
canConstruct("a", "b") -> false
canConstruct("aa", "ab") -> false
canConstruct("aa", "aab") -> true
My code (which takes 32ms):
class Solution {
public:
bool canConstruct(string ransomNote, string magazine) {
if(ransomNote.size() > magazine.size()) return false;
unordered_map<char, int> m;
for(int i = 0; i < magazine.size(); i++)
m[magazine[i]]++;
for(int i = 0; i < ransomNote.size(); i++)
{
if(m[ransomNote[i]] <= 0) return false;
m[ransomNote[i]]--;
}
return true;
}
};
The code (which I dont know why is faster - takes 19ms):
bool canConstruct(string ransomNote, string magazine) {
int lettersLeft = ransomNote.size(); // Remaining # of letters to be found in magazine
int arr[26] = {0};
for (int j = 0; j < ransomNote.size(); j++) {
arr[ransomNote[j] - 'a']++; // letter - 'a' gives a value of 0 - 25 for each lower case letter a-z
}
int i = 0;
while (i < magazine.size() && lettersLeft > 0) {
if (arr[magazine[i] - 'a'] > 0) {
arr[magazine[i] - 'a']--;
lettersLeft--;
}
i++;
}
if (lettersLeft == 0) {
return true;
} else {
return false;
}
}
Both of these have the same complexity and use the same structure to solve the problem, but I don't understand why one takes almost twice as much time than the other. The time to query a vector is O(1), but its the same for an unordered_map. Same story with adding an entry/key to either of them.
Please, could someone explain why the run time varies so much?
First thing to note is, although the average time to query an unordered_map is constant, the worst case is not O(1). As you can see here it actually rises to the order of O(N), N denoting the size of the container.
Secondly, as vector allocates sequential portions of memory, accessing to that memory is highly efficient and actually is constant, even in the worst-case. (i.e. simple pointer arithmetic, as opposed to computing the result of a more complex hash function) There is also the possibility of various levels of caching of sequential memory that may be involved (i.e. depending on the platform your code is running on) which may make the execution of a code using vector even faster, compared to one that is using unordered_map.
In essence, in terms of complexity, the worst-case performance of a vector is more efficient than that of unordered_map. On top of that, most hardware systems offer features such as caching which give usage of vector an even bigger edge. (i.e. lesser constant factors in O(1) operations)
Your second approach uses plain C array where accessing an element is a simple pointer dereference. But that is not the case with unordered_map. There are two points to note:
First, accessing an element is not a simple pointer dereference. It has to do other works to maintain it's internal structure. An unordered_map is actually a hash table under the hood and C++ standard indirectly mandates it to be implemented using open addressing which is a far more complex algorithm than simple array access.
Second, O(1) access is on average but not on worst case.
For these reasons no wonder that array version will work better than unordered_map even though they have same run time complexity. This is another example where two codes with same run time complexity performs differently.
You will see the benefit of unordered_map only when you have a large number of keys (oppose to fixed 26 here).
"O(1)" means "constant time" -- that is, an algorithm that is (truly) O(1) will not get slower when there is more data (in this case, when there are more items in the map or array). It does not indicate how fast the algorithm runs -- it only indicates that it won't slow down if there is more data. Seeing different times for one O(1) algorithm vs. another does not mean that they are not O(1). You should not expect that one O(1) algorithm will run exactly as fast as another. But, if there is a difference, you should see the same difference if the maps/arrays have more data in them.
I was looking to get only the unique elements from a container. Let's say srcContainer is the container from which I want unique elements. I looked at three options:
Using std::unique
std::sort(srcContainer.begin(), srcContainer.end());
srcContainer.erase(std::unique(srcContainer.begin(), srcContainer.end()), srcContainer.end());
Using BOOST::unique
boost::erase(srcContainer, boost::unique<boost::return_found_end>(boost::sort(srcContainer)));
My own method
std::set<T> uniqueElems(srcContainer.begin(), srcContainer.end());
srcContainer.clear();
srcContainer.insert(srcContainer.end(), uniqueElems.begin(), uniqueElems.end());
The issue with 1. and 2. are that they change the order in which members occurred in the original srcContainer. With 3. there is no change in order, and in addition it gives a much better performance compared to 1. and 2 (Is it because there is no explicit sorting in 3. ??) above. The elapsed wall clock time for 3 methods above and the number of elements in srcContainer are given below:
size of srcContainer (contains integers) = 1e+6
- std::unique = 1.04779 secs
- BOOST::unique = 1.04774 secs
- Own method = 0.481638 secs
size of srcContainer (contains integers) = 1e+8
- std::unique = 151.554 secs
- BOOST::unique = 151.474 secs
- Own method = 57.5693 secs
My question is:
Is there better way to find unique using std::unique or BOOST::unique or any other code and maintaining the original order in the container?
Any issue with using method 3. above.
For performance profiling srcContainer was created as follows:
std::vector<int> srcContainer;
int halfWay = numElems/2;
for (size_t k=0; k<numElems; ++k) {
if (k < halfWay)
srcContainer.push_back(k);
else
srcContainer.push_back(k - halfWay);
}
Edits:
Agreed with comments that method 3. also changes the order of elements. Is there a better way to get unique elements without changing order?
Thanks
EDIT based on info about source data:
The reason you're seeing the set insertion complete faster than sorting the vector is that your input data is two already sorted ranges. For quicksort (typically used by std::sort) this is a degenerate case and one of the worst possible inputs you could give it. For an input size of 1e8 changing the sort from std::sort to std::stable_sort cut the runtime from ~25s to <9s.
If you want to keep the original item order, you could try something like the following which keeps a hash of all the items. I have no idea what the performance of this would be, but for example you could utilize an approach with hashing and remove_if as sketched below:
struct Remover
{
explicit Remover(hash& found_items) : found_items_(found_items) { }
bool operator()(const Iter& item) { retval = <does exist in hash>; add to hash; return retval; }
hash& found_items_;
};
hash dup_finder;
Remover remover(dup_finder);
std::erase(std::remove_if(src.begin(), src.end(), remover), src.end());
Original components of my answer:
If the elements in the source container are already mostly sorted, you may see better performance with stable_sort rather than sort prior to calling unique. I'm unable to guess without more information about yoru data set what might cause option 3 to perform better than 1&2.
Option 3 should remove uniques but keep in mind that in spite of what you assert, it will still reorder the items in exactly the same way that the first two options do.
Does anyone know if it's possible to turn this from O(m * n) to O(m + n)?
vector<int> theFirst;
vector<int> theSecond;
vector<int> theMatch;
theFirst.push_back( -2147483648 );
theFirst.push_back(2);
theFirst.push_back(44);
theFirst.push_back(1);
theFirst.push_back(22);
theFirst.push_back(1);
theSecond.push_back(1);
theSecond.push_back( -2147483648 );
theSecond.push_back(3);
theSecond.push_back(44);
theSecond.push_back(32);
theSecond.push_back(1);
for( int i = 0; i < theFirst.size(); i++ )
{
for( int x = 0; x < theSecond.size(); x++ )
{
if( theFirst[i] == theSecond[x] )
{
theMatch.push_back( theFirst[i] );
}
}
}
Put the contents of the first vector into a hash set, such as std::unordered_set. That is O(m). Scan the second vector, checking if the values are in the unordered_set and keeping a tally of those that are. That is n lookups of a hash structure, so O(n). So, O(m+n). If you have l elements in the overlap, you may count O(l) for adding them to the third vector. std::unordered_set is in the C++0x draft and available in the latest gcc versions, and there is also an implementation in boost.
Edited to use unordered_set
Using C++2011 syntax:
unordered_set<int> firstMap(theFirst.begin(), theFirst.end());
for (const int& i : theSecond) {
if (firstMap.find(i)!=firstMap.end()) {
cout << "Duplicate: " << i << endl;
theMatch.push_back(i);
}
}
Now, the question still remains, what do you want to do with duplicates in the originals? Explicitly, how many times should 1 be in theMatch, 1, 2 or 4 times?
This outputs:
Duplicate: 1
Duplicate: -2147483648
Duplicate: 44
Duplicate: 1
Using this: http://www.cplusplus.com/reference/algorithm/set_intersection/
You should be able to achieve O(mlogm + nlogn) I believe. (set_intersection requires that the input ranges be already sorted).
This might perform a bit differently than your solution for duplicate elements, however.
Please correct me if I am wrong,
you are suggesting following solution for the intersection problem:
sort two vectors, and keep iteration in both sorted vector in such a way that we reach to a common element,
so overall complexity will be
(n*log(n) + m*log(m)) + (n + m)
Assuming k*log(k) as complexity of sorting
Am I right?
Ofcourse the complexity will depend on the complexity of sorting.
I would sort the longer array O(n*log (n)), search for elements from the shorter array O(m*log (n)). Total is then O(n*log(n) + m*log (n) )
Assuming you want to produce theMatch from two data sets, and you don't care about the data sets themselves, put one in an unordered_map (available currently from Boost and listed in the final committee draft for C++11), mapping the key to an integer that increases whenever added to, and therefore keeps track of the number of times the key occurs. Then, when you get a hit on the other data set, you push_back the hit the number of times it occurred in the first time.
You can get to O(n log n + m log m) by sorting the vectors first, or O(n log n + m) by creating a std::map of one of them.
Caveat: these are not order-preserving operations, and theMatch will come out in different orders with different techniques. It looks to me like the order is likely considered arbitrary. If the order given in the code above is necessary, I don't think there's a better algorithm.
Edit:
Take data set A and data set B, of type Type. Create an unordered_map<Type, int>.
Go through data set A, and check each member to see if it's in the map. If not, add the element with the int 1 to the map. If it is, increment the int. Each of these operations is O(1) on the average, so this step is O(len A).
Go through data set B, and check each member to see if it's in the map. If not, go on to the next. If so, push_back the member onto the destination queue. The int is the number of times that value is in data set A, so do the push_back the number of times the member's in A to duplicate the behavior given. Each of these operations is on the average O(1), so this step is O(len B).
This is average behavior. If you always hit the worst case, you're back with O(m*n). I don't think there's a way to guarantee O(m + n).
If the order of the elements in the resulting array/set doesn't matter then the answer is yes.
For the arbitrary types of elements with some order defined the best algorithm is O( max(m,n)*log(min(m,n)) ). For the numbers of limited size the best algorithm is O(m+n).
Construct the set of elements of smaller array - for arbitrary elements just sorting is OK and for the numbers of limited size it must be something similar to intermediate table in numeric sort.
Iterate through larger array and check if the element is within a set constructed earlier - for the arbitrary element binary search is OK (which is O(log(min(n,m))) and for numbers the single check is O(1).