Is it possible to generically sort in linear time? - c++

I'm trying to solve a problem in O(n) time where, given two forward iterators to the front of a container and the back of a container, I want to remove all elements in the container that don't appear at least < this number > of times. For example, given a vector of strings such as ("john", "hello", "one", "yes", "hello", "one") and I wanted to remove all elements that appear less than 2 times, my final vector would then contain just ("hello", "one").
I was thinking that if I could generically sort in O(n) time I can accomplish this result (in O(n) time), but I'm having a hard time doing that with strings, ints, chars, or whatever else may be used (generically). Am I thinking about this correctly, or is there a simpler way to solve the problem?

Yes, you are not actually sorting but removing elements.
1). Store each word into a hashset.
2). Lookup and only add if not in hashset.

Short answer: no. Comparison based sorting takes O(n log n) time. (This can be formally proved.) If you know something about your input (e.g. the input is distributed uniformly at random within a known range) then you can use well known algorithms such as bucket sort or radix sort in O(n) time. Contrary to #Mooing Duck, there is no such thing as sorting in O(1) time (this should be obvious -- you must visit each element at least once for any sorting algorithm).
However, as several other posters have noted, your problem does not require a sorting algorithm ...

There is no need to sort
1) Populate std::unordered_map<string,vector<int>> indexOfStrings; - O(N)
2) For each string whose vector size() < 2, delete element - O(number of deletions) <= O(N)
indexOfStrings - stores the index of each occurance of the string. This allows for quick deletion from vector without the need for a search.

You don't need a sort, you just need an unordered_map:
unordered_map<string, int> counter;
vector<string> newvec;
for(string &s : v) {
if((++counter[s]) == 2) {
newvec.push_back(s);
}
}
Note that this is C++11 code. (Thanks #jogojapan for the code improvement suggestion).

Related

Optimal data structure (in C++) for random access and looping through elements

I have the following problem: I have a set of N elements (N being somewhere between several hundred and several thousand elements, let's say between 500 and 3000 elements). Out of these elements, small percentage will have some property "X", but the elements "gain" and "lose" this property in a semi-random fashion; so if I store them all in an array, and assign 1 to elements with property X, and zero otherwise, this array of N elements will have n 1's and the N-n zeros (n being small in the 20-50 range).
The problem is the following: these elements change very frequently in a semi-random way (meaning that any element can flip from 0 to 1 and vice versa, but the process that controls that is somewhat stable, so the total number "n" fluctuates a bit, but is reasonably stable in the 20-50 range); and I frequently need all the "X" elements of the set (in other words, indices of the array where value of the array is 1), to perform some task on them.
One simple and slow way to achieve this is to simply loop through the array and if index k has value 1, perform the task, but this is kinda slow because well over 95% of all the elements have value 1. The solution would be to put all the 1s into a different structure (with n elements) and then loop through that structure, instead of looping through all N elements. The question is what's the best structure to use?
Elements will flip from 0 to 1 and vice versa randomly (from several different threads), so there's no order there of any sort (time when element flipped from 0 to 1 is has nothing to do with time it will flip back), and when I loop through them (from another thread), I do not need to loop in any particular order (in other words, I just need to get them all, but it's nor relevant in which order).
Any suggestions what would be the optimal structure for this? "std::map" comes to mind, but since the keys of std::map are sorted (and I don't need that feature), the questions is if there is anything faster?
EDIT: To clarify, the array example is just one (slow) way to solve the problem. The essence of the problem is that out of one big set "S" with "N" elements, there is a continuously changing subset "s" of "n" elements (with n much smaller then N), and I need to loop though that set "s". Speed is of essence, both for adding/removing elements to "s", and for looping through them. So while suggestions like having 2 arrays and moving elements between them would be fast from iteration perspective, adding and removing elements to an array would be prohibitively slow. It sounds like some hash-based approach like std::set would work reasonably fast on both iteration and addition/removal fronts, the question is is there something better than that? Reading the documentation on "unordered_map" and "unordered_set" doesn't really clarify how much faster addition/removal of elements is relative to std::map and std::set, nor how much slower the iteration through them would be. Another thing to keep in mind is that I don't need a generic solution that works best in all cases, I need one that works best when N is in the 500-3000 range, and n is in the 20-50 range. Finally, the speed is really of essence; there are plenty slow ways of doing it, so I'm looking for the fastest way.
Since order doesn't appear to be important, you can use a single array and keep the elements with property X at the front. You will also need an index or iterator to the point in the array that is the transition from X set to unset.
To set X, increment the index/iterator and swap that element with the one you want to change.
To unset X, do the opposite: decrement the index/iterator and swap that element with the one you want to change.
Naturally with multiple threads you will need some sort of mutex to protect the array and index.
Edit: to keep a half-open range as iterators are normally used, you should reverse the order of the operations above: swap, then increment/decrement. If you keep an index instead of an iterator then the index does double duty as the count of the number of X.
N=3000 isn't really much. If you use a single bit for each of them, you have a structure smaller than 400 bytes. You can use std::bitset for that. If you use an unordered_set or a set however be mindful that you'll spend many more bytes for each of the n elements in your list: if you just allocate a pointer for each element in a 64bit architecture you'll use at least 8*50 = 400 bytes, much more than the bitset
#geza : perhaps I misunderstood what you meant by two arrays; I assume you meant something like have one std::vector (or something similar) in which I store all elements with property X, and another where I store the rest? In reality, I don't care about others, so I really need one array. Adding an element is obviously simple if I can just add it to the end of the array; now, correct me if I'm wrong here, but finding an element in that array is O(n) operation (since the array is unsorted), and then removing it from the array again requires shifting all the elements by one place, so this in average requires n/2 operations. If I use linked list instead of vector, then deleting an element is faster, but finding it still takes O(n). That's what I meant when I said it would be prohibitively slow; if I misunderstood you, please do clarify.
It sounds like std::unordered_set or std::unordered_map would be fastest in adding/deleting elements, since it's O(1) to find an element, but it's unclear to me how fast can one loop through all the keys; the documentation clearly states that iteration through keys of std::unordered_map is slower then iteration through keys of std::map, but it's not quantified in any way just how slow is "slower", and how fast is "faster".
And finally, to repeat one more time, I'm not interested in general solution, I'm interested in one for small "n". So if for example I have two solutions, one that's k_1*log(n), and second that's k_2*n^2, first one might be faster in principle (and for large n), but if k_1 >> k_2 (let's say for example k_1 = 1000 and k_2=2 and n=20), second one can still be faster for relatively small "n" (1000*log(20) is still larger than 2*20^2). So even if addition/deletion in std::unordered_map might be done in constant time O(1), for small "n" it still matters if that constant time is 1 nanosecond or 1 microsecond or 1 millisecond. So I'm really looking for suggestions that work best for small "n", not for in the asymptotic limit of large "n".
An alternative approach (in my opinion worth only if the number of element is increased at least tenfold) might be keeping a double index:
#include<algorithm>
#include<vector>
class didx {
// v == indexes[i] && v > 0 <==> flagged[v-1] == i
std::vector<ptrdiff_t> indexes;
std::vector<ptrdiff_t> flagged;
public:
didx(size_t size) : indexes(size) {}
// loop through flagged items using iterators
auto begin() { return flagged.begin(); }
auto end() { return flagged.end(); }
void flag(ptrdiff_t index) {
if(!isflagged(index)) {
flagged.push_back(index);
indexes[index] = flagged.size();
}
}
void unflag(ptrdiff_t index) {
if(isflagged(index)) {
// swap last item with item to be removed in "flagged", update indexes accordingly
// in "flagged" we swap last element with element at index to be removed
auto idx = indexes[index]-1;
auto last_element = flagged.back();
std::swap(flagged.back(),flagged[idx]);
std::swap(indexes[index],indexes[last_element]);
// remove the element, which is now last in "flagged"
flagged.pop_back();
indexes[index] = 0;
}
}
bool isflagged(ptrdiff_t index) {
return indexes[index] > 0;
}
};

Time complexity difference between two containsDuplicates algorithms

I completed two version of a leetcode algorithm and am wondering if my complexity analysis is correct, even though the online submission time in ms does not show it accurately. The goal is to take a vector of numbers as a reference and return true if it contains duplicate values and false if it does not.
The two most intuitive approaches are:
1.) Sort the vector and do one sweep to the second to last, and see if any neighboring elements are identical and return true if so.
2.) Use a hashtable and insert the values and if a key already exists in the table, return true.
I completed the first version first, and it was quick, but seeing as how the sort routine would take O(nlog(n)) and the hash table inserts & map.count()s would make the second version O(log(n) + N) = O(N) I would think the hashing version would be faster with very large data sets.
In the online judging I was proven wrong, however I assumed they weren't using large enough data sets to offset the std::map overhead. So I ran a lot of tests repeatedly filling vectors up to a size between 0 and 10000 incrementing by 2, adding random values in between 0 and 20000. I piped the output to a csv file and plotted it on linux and here's the image I got.
Is the provided image truly showing me the difference here, between an O(N) and an O(nlog(n)) algorithm? I just want to make sure my complexity analysis is correct on these?
Here are the algorithms run:
bool containsDuplicate(vector<int>& nums) {
if(nums.size() < 2) return false;
sort(nums.begin(), nums.end());
for(int i = 0; i < nums.size()-1; ++i) {
if(nums[i] == nums[i+1]) return true;
}
return false;
}
// Slightly slower in small cases because of data structure overhead I presume
bool containsDuplicateWithHashing(vector<int>& nums) {
map<int, int> map;
for (int i = 0; i < nums.size(); ++i) {
if(map.count(nums[i])) return true;
map.insert({nums[i], i});
}
return false;
}
std::map is sorted, and involves O(log n) cost for each insertion and lookup, so the total cost in the "no duplicates" case (or in the "first duplicate near the end of the vector" case) would have similar big-O to sorting and scanning: O(n log n); it's typically fragmented in memory, so overhead could easily be higher than that of an optimized std::sort.
It would appear much faster if duplicates were common though; if you usually find a duplicate in the first 10 elements, it doesn't matter if the input has 10,000 elements, because the map doesn't have time to grow before you hit a duplicate and duck out. It's just that a test that only works well when it succeeds is not a very good test for general usage (if duplicates are that common, the test seems a bit silly); you want good performance in both the contains duplicate and doesn't contain duplicate cases.
If you're looking to compare approaches with meaningfully different algorithmic complexity, try using std::unordered_set to replace your map-based solution (insert returns whether the key already existed as well, so you reduce work from one lookup followed by one insert to just one combined insert and lookup on each loop), which has average case O(1) insertion and lookup, for O(n) duplicate checking complexity.
FYI, another approach that would be O(n log n) but use a sort-like strategy that shortcuts when a duplicate is found early, would be to make a heap with std::make_heap (O(n) work), then repeatedly pop_heap (O(log n) per pop) from the heap and compare to the heap's .front(); if the value you just popped and the front are the same, you've got a duplicate and can exit immediately. You could also use the priority_queue adapter to simplify this into a single container, instead of manually using the utility functions on a std::vector or the like.

Efficient removal of a set of integers from another set

I have a (large) set of integers S, and I want to run the following pseudocode:
set result = {};
while(S isn't empty)
{
int i = S.getArbitraryElement();
result.insert(i);
set T = elementsToDelete(i);
S = S \ T; // set difference
}
The function elementsToDelete is efficient (sublinear in the initial size of S) and the size of T is small (assume it's constant). T may contain integers no longer in S.
Is there a way of implementing the above that is faster than O(|S|^2)? I suspect I should be able to get O(|S| k), where k is the time complexity of elementsToDelete. I can of course implement the above in a straightforward way using std::set_difference but my understanding is that set_difference is O(|S|).
Using std::set S;, you can do:
for (auto k : elementsToDelete(i)) {
S.erase(k);
}
Of course the lookup for erase is O(log(S.size())), not the O(1) you're asking for. That can be achieved with std::unordered_set, assuming not too many collisions (which is a big assumption in general but very often true in particular).
Despite the name, the std::set_difference algorithm doesn't have much to do with std::set. It works on anything you can iterate in order. Anyway it's not for in-place modification of a container. Since T.size() is small in this case, you really don't want to create a new container each time you remove a batch of elements. In another example where the result set is small enough, it would be more efficient than repeated erase.
The set_difference in C++ library has time complexity of O(|S|) hence it is not good for your purposes so i advice you to use S.erase() to delete set element in the S in O(logN) implemented as BST . Hence your time complexity reduces to O(NlogN)

Remove elements from first set element which second set contains without iteration

I have two sets of pairs ( I cannot use c++11)
std::set<std::pair<int,int> > first;
std::set<std::pair<int,int> > second;
and I need to remove from first set all elements which are in second set(if first contain element from second to remove). I can do this by iterating through second set and if first contains same element erase from first element, but I wonder is there way to do this without iteration ?
If I understand correctly, basically you want to calculate the difference of first and second. There is an <algorithm> function for that.
std::set<std::pair<int, int>> result;
std::set_difference(first.begin(), first.end(), second.begin(), second.end(), inserter(result, result.end()));
Yes, you can.
If you want to remove, not just to detect, that is here another <algorithm> function: remove_copy_if():
http://www.cplusplus.com/reference/algorithm/remove_copy_if/
imho. It's not so difficult to understand how it works.
I wonder is there way to do this without iteration.
No. Internally, sets are balanced binary trees - there's no way to operate on them without iterating over the structure. (I assume you're interested in the efficiency of implementation, not the convenience in code, so I've deliberately ignored library routines that must iterates internally).
Sets are sorted though, so you could do an iterations over each, removing as you went (so # operations is the sum of set sizes) instead of an iteration and a lookup for each element (where number of operations is the number of elements you're iterating over times log base 2 of the number of elements in the other set). Only if one of your sets is much smaller than the other will the iterate/find approach will win out. If you look at the implementation of your library's set_difference function )mentioned in Amen's answer) - it should show you how to do the two iterations nicely.
If you want something more efficient, you need to think about how to achieve that earlier: for example, storing your pairs as flags in identically sized two-dimension matrix such that you can AND with the negation of the second set. Whether that's practical depends on the range of int values you're storing, whether the amount of memory needed is ok for your purposes....

How to efficiently *nearly* sort a list?

I have a list of items; I want to sort them, but I want a small element of randomness so they are not strictly in order, only on average ordered.
How can I do this most efficiently?
I don't mind if the quality of the random is not especially good, e.g. it simply based on the chance ordering of the input, e.g. an early-terminated incomplete sort.
The context is implementing a nearly-greedy search by introducing a very slight element of inexactness; this is in a tight loop and so the speed of sorting and calling random() are to be considered
My current code is to do a std::sort (this being C++) and then do a very short shuffle just in the early part of the array:
for(int i=0; i<3; i++) // I know I have more than 6 elements
std::swap(order[i],order[i+rand()%3]);
Use first two passes of JSort. Build heap twice, but do not perform insertion sort. If element of randomness is not small enough, repeat.
There is an approach that (unlike incomplete JSort) allows finer control over the resulting randomness and has time complexity dependent on randomness (the more random result is needed, the less time complexity). Use heapsort with Soft heap. For detailed description of the soft heap, see pdf 1 or pdf 2.
You could use a standard sort algorithm (is a standard library available?) and pass a predicate that "knows", given two elements, which is less than the other, or if they are equal (returning -1, 0 or 1). In the predicate then introduce a rare (configurable) case where the answer is random, by using a random number:
pseudocode:
if random(1000) == 0 then
return = random(2)-1 <-- -1,0,-1 randomly choosen
Here we have 1/1000 chances to "scamble" two elements, but that number strictly depends on the size of your container to sort.
Another thing to add in the 1000 case, could be to remove the "right" answer because that would not scramble the result!
Edit:
if random(100 * container_size) == 0 then <-- here I consider the container size
{
if element_1 < element_2
return random(1); <-- do not return the "correct" value of -1
else if element_1 > element_2
return random(1)-1; <-- do not return the "correct" value of 1
else
return random(1)==0 ? -1 : 1; <-- do not return 0
}
in my pseudocode:
random(x) = y where 0 <= y <=x
One possibility that requires a bit more space but would guarantee that existing sort algorithms could be used without modification would be to create a copy of the sort value(s) and then modify those in some fashion prior to sorting (and then use the modified value(s) for the sort).
For example, if the data to be sorted is a simple character field Name[N] then add a field (assuming data is in a structure or class) called NameMod[N]. Fill in the NameMod with a copy of Name but add some randomization. Then 3% of the time (or some appropriate amount) change the first character of the name (e.g., change it by +/- one or two characters). And then 10% of the time change the second character +/- a few characters.
Then run it through whatever sort algorithm you prefer. The benefit is that you could easily change those percentages and randomness. And the sort algorithm will still work (e.g., it would not have problems with the compare function returning inconsistent results).
If you are sure that element is at most k far away from where they should be, you can reduce quicksort N log(N) sorting time complexity down to N log(k)....
edit
More specifically, you would create k buckets, each containing N/k elements.
You can do quick sort for each bucket, which takes k * log(k) times, and then sort N/k buckets, which takes N/k log(N/k) time. Multiplying these two, you can do sorting in N log(max(N/k,k))
This can be useful because you can run sorting for each bucket in parallel, reducing total running time.
This works if you are sure that any element in the list is at most k indices away from their correct position after the sorting.
but I do not think you meant any restriction.
Split the list into two equally-sized parts. Sort each part separately, using any usual algorithm. Then merge these parts. Perform some merge iterations as usual, comparing merged elements. For other merge iterations, do not compare the elements, but instead select element from the same part, as in the previous step. It is not necessary to use RNG to decide, how to treat each element. Just ignore sorting order for every N-th element.
Other variant of this approach nearly sorts an array nearly in-place. Split the array into two parts with odd/even indexes. Sort them. (It is even possible to use standard C++ algorithm with appropriately modified iterator, like boost::permutation_iterator). Reserve some limited space at the end of the array. Merge parts, starting from the end. If merged part is going to overwrite one of the non-merged elements, just select this element. Otherwise select element in sorted order. Level of randomness is determined by the amount of reserved space.
Assuming you want the array sorted in ascending order, I would do the following:
for M iterations
pick a random index i
pick a random index k
if (i<k)!=(array[i]<array[k]) then swap(array[i],array[k])
M controls the "sortedness" of the array - as M increases the array becomes more and more sorted. I would say a reasonable value for M is n^2 where n is the length of the array. If it is too slow to pick random elements then you can precompute their indices beforehand. If the method is still too slow then you can always decrease M at the cost of getting a poorer sort.
Take a small random subset of the data and sort it. You can use this as a map to provide an estimate of where every element should appear in the final nearly-sorted list. You can scan through the full list now and move/swap elements that are not in a good position.
This is basically O(n), assuming the small initial sorting of the subset doesn't take a long time. Hopefully you can build the map such that the estimate can be extracted quickly.
Bubblesort to the rescue!
For a unsorted array, you could pick a few random elements and bubble them up or down. (maybe by rotation, which is a bit more efficient) It will be hard to control the amount of (dis)order, even if you pick all N elements, you are not sure that the whole array will be sorted, because elements are moved and you cannot ensure that you touched every element only once.
BTW: this kind of problem tends to occur in game playing engines, where the list with candidate moves is kept more-or-less sorted (because of weighted sampling), and sorting after each iteration is too expensive, and only one or a few elements are expected to move.