I have a problem, where i have big list of number pairs. something like that:
(0, 1)
(10, 5)
(5, 6)
(8, 6)
(7, 5)
.....
I need to make that i can make very fast lookups if the pair exist in list.
My first idea was make map< std::pair<int,int> > container. and do searches using container.find().
Second idea was to make vector<vector<int> container where i can search is the pair exist by using std::find(container[id1].begin(),container[id1].end(),id2);
The second way is a bit faster than first, but i need more effective way if that possible.
So question is there more effective way to find is a number pair exist in list?
The number of pairs i know when starting program, so i dont care a lot about pair insertion/deletion, i just need very fast searches.
If you do not care about insertion you could use a sorted std::vector and std::binary_search, or std::lower_bound.
int main()
{
using namespace std;
vector<pair<int, int>> pairs;
pairs.push_back(make_pair(1, 1));
pairs.push_back(make_pair(3, 1));
pairs.push_back(make_pair(3, 2));
pairs.push_back(make_pair(4, 1));
auto compare = [](const pair<int, int>& lh, const pair<int, int>& rh)
{
return lh.first != rh.first ?
lh.first < rh.first : lh.second < rh.second;
};
sort(begin(pairs), end(pairs), compare);
auto lookup = make_pair(3, 1);
bool has31 = binary_search(begin(pairs), end(pairs), lookup, compare);
auto iter31 = lower_bound(begin(pairs), end(pairs), lookup, compare);
if (iter31 != end(pairs) && *iter31 == lookup)
cout << iter31->first << "; " << iter31->second << "at position "
<< distance(begin(pairs), iter31);
}
If you want faster-than-set lookups (faster than O(lg n)) and don't care about items being in a random order, then a hashtable is the way to go.
This is not part of the standard, but a hash_set is available in most compilers. The reference for it is here.
If you want to have really fast searches, you can try a Bloom filter. However, they sometimes result in false positives (i.e. detecting that there is an item pair when there is none), and require lots of memory. A suitable Bloom filter implementation would be:
const int MAX_HASH = 23879519; // note it's prime; must be 2-5 times larger than number of your pairs
vector<bool> Bloom(MAX_HASH); // vector<bool> compresses bools into bits
// multiply one by a large-ish prime, add the second, return modulo another prime
// then use it as the key for the filter
int hash(long long a, long long b) {
return (a*15485863LL + b) % MAX_HASH;
}
// constant-time addition
void add_item(pair<int,int> p) {
Bloom[hash(p.first, p.second)] = true;
}
// constant-time check
bool is_in_set(pair<int,int> p) {
return Bloom[hash(p.first, p.second)];
}
std::set is probably the way to go, and it should perform reasonably well even if the number of elements increase (whereas the performance of std::vector will slow down quite quickly unless you sort it beforehand and do some sort of binary or tree search). Keep in mind you'll have to define a < operator to use std::set.
If you can use c++0x, std::unordered_set might be worth a try also, particularly if you don't care about order. You'll find unordered_set in Boost. This doesn't require a < operator to be defined. If you make your unordered_set an appropriate size and define your own simple hash function that does not produce many collisions it might be faster than even a binary search on a sorted vector.
You could use some implementation of hash_set to get it faster
for instance boost::unordered_set where the key is the std::pair.
This is the fastest from the easiest approaches.
Here's another solution iff your individual numbers are ints.
Construct a long long with the two ints (the first int could be the high 32 bits and the second int the lower 32 bits)
Insert this into an unorderd_set (or set, or sorted vector - profile to find your match)
find.
Should be some percentage faster than working with pairs/tuples etc. esp.
Why not sorting the tuples according to 1st element, then 2nd, then a binary search should be O(log(n)).
Related
I am trying to solve the programming problem firstDuplicate on codesignal. The problem is "Given an array a that contains only numbers in the range 1 to a.length, find the first duplicate number for which the second occurrence has minimal index".
Example: For a = [2, 1, 3, 5, 3, 2] the output should be firstDuplicate(a) = 3
There are 2 duplicates: numbers 2 and 3. The second occurrence of 3 has a smaller index than the second occurrence of 2 does, so the answer is 3.
With this code I pass 21/23 tests, but then it tells me that the program exceeded the execution time limit on test 22. How would I go about making it faster so that it passes the remaining two tests?
#include <algorithm>
int firstDuplicate(vector<int> a) {
vector<int> seen;
for (size_t i = 0; i < a.size(); ++i){
if (std::find(seen.begin(), seen.end(), a[i]) != seen.end()){
return a[i];
}else{
seen.push_back(a[i]);
}
}
if (seen == a){
return -1;
}
}
Anytime you get asked a question about "find the duplicate", "find the missing element", or "find the thing that should be there", your first instinct should be use a hash table. In C++, there are the unordered_map and unordered_set classes that are for such types of coding exercises. The unordered_set is effectively a map of keys to bools.
Also, pass you vector by reference, not value. Passing by value incurs the overhead of copying the entire vector.
Also, that comparison seems costly and unnecessary at the end.
This is probably closer to what you want:
#include <unordered_set>
int firstDuplicate(const vector<int>& a) {
std::unordered_set<int> seen;
for (int i : a) {
auto result_pair = seen.insert(i);
bool duplicate = (result_pair.second == false);
if (duplicate) {
return (i);
}
}
return -1;
}
std::find is linear time complexity in terms of distance between first and last element (or until the number is found) in the container, thus having a worst-case complexity of O(N), so your algorithm would be O(N^2).
Instead of storing your numbers in a vector and searching for it every time, Yyu should do something like hashing with std::map to store the numbers encountered and return a number if while iterating, it is already present in the map.
std::map<int, int> hash;
for(const auto &i: a) {
if(hash[i])
return i;
else
hash[i] = 1;
}
Edit: std::unordered_map is even more efficient if the order of keys doesn't matter, since insertion time complexity is constant in average case as compared to logarithmic insertion complexity for std::map.
It's probably an unnecessary optimization, but I think I'd try to take slightly better advantage of the specification. A hash table is intended primarily for cases where you have a fairly sparse conversion from possible keys to actual keys--that is, only a small percentage of possible keys are ever used. For example, if your keys are strings of length up to 20 characters, the theoretical maximum number of keys is 25620. With that many possible keys, it's clear no practical program is going to store any more than a minuscule percentage, so a hash table makes sense.
In this case, however, we're told that the input is: "an array a that contains only numbers in the range 1 to a.length". So, even if half the numbers are duplicates, we're using 50% of the possible keys.
Under the circumstances, instead of a hash table, even though it's often maligned, I'd use an std::vector<bool>, and expect to get considerably better performance in the vast majority of cases.
int firstDuplicate(std::vector<int> const &input) {
std::vector<bool> seen(input.size()+1);
for (auto i : input) {
if (seen[i])
return i;
seen[i] = true;
}
return -1;
}
The advantage here is fairly simple: at least in a typical case, std::vector<bool> uses a specialization to store bools in only one bit apiece. This way we're storing only one bit for each number of input, which increases storage density, so we can expect excellent use of the cache. In particular, as long as the number of bytes in the cache is at least a little more than 1/8th the number of elements in the input array, we can expect all of seen to be in the cache most of the time.
Now make no mistake: if you look around, you'll find quite a few articles pointing out that vector<bool> has problems--and for some cases, that's entirely true. There are places and times that vector<bool> should be avoided. But none of its limitations applies to the way we're using it here--and it really does give an advantage in storage density that can be quite useful, especially for cases like this one.
We could also write some custom code to implement a bitmap that would give still faster code than vector<bool>. But using vector<bool> is easy, and writing our own replacement that's more efficient is quite a bit of extra work...
In a "self-avoiding random walk" situation, I have a 2-dimensional vector with a configuration of step-coordinates. I want to be able to check if a certain site has been occupied, but the problem is that the axis can be zero, so checking if the fabs() of the coordinate is true (or that it has a value), won't work. Therefore, I've considered looping through the steps and checking if my coordinate equals another coordinate on all axis, and if it does, stepping back and trying again (a so-called depth-first approach).
Is there a more efficient way to do this? I've seen someone use a boolean array with all possible coordinates, like so:
bool occupied[nMax][nMax]; // true if lattice site is occupied
for (int y = -rMax; y <= rMax; y++)
for (int x = -rMax; x <= rMax; x++)
occupied[index(y)][index(x)] = false;
But, in my program the number of dimensions is unknown, so would an approach such as:
typedef std::vector<std::vector<long int>> WalkVec;
WalkVec walk(1, std::vector<long int>(dof,0));
siteVisited = false; counter = 0;
while (counter < (walkVec.back().size()-1))
{
tdof = 1;
while (tdof <= dimensions)
{
if (walkHist.back().at(tdof-1) == walkHist.at(counter).at(tdof-1) || walkHist.back().at(tdof-1) == 0)
{
siteVisited = true;
}
else
{
siteVisited = false;
break;
}
tdof++;
}
work where dof if the number of dimensions. (the check for zero checks if the position is the origin. Three zero coordinates, or three visited coordinates on the same step is the only way to make it true)
Is there a more efficient way of doing it?
You can do this check in O(log n) or O(1) time using STL's set or unordered_set respectively. The unordered_set container requires you to write a custom hash function for your coordinates, while the set container only needs you to provide a comparison function. The set implementation is particularly easy, and logarithmic time should be fast enough:
#include <iostream>
#include <set>
#include <vector>
#include <cassert>
class Position {
public:
Position(const std::vector<long int> &c)
: m_coords(c) { }
size_t dim() const { return m_coords.size(); }
bool operator <(const Position &b) const {
assert(b.dim() == dim());
for (size_t i = 0; i < dim(); ++i) {
if (m_coords[i] < b.m_coords[i])
return true;
if (m_coords[i] > b.m_coords[i])
return false;
}
return false;
}
private:
std::vector<long int> m_coords;
};
int main(int argc, const char *argv[])
{
std::set<Position> visited;
std::vector<long int> coords(3, 0);
visited.insert(Position(coords));
while (true) {
std::cout << "x, y, z: ";
std::cin >> coords[0] >> coords[1] >> coords[2];
Position candidate(coords);
if (visited.find(candidate) != visited.end())
std::cout << "Aready visited!" << std::endl;
else
visited.insert(candidate);
}
return 0;
}
Of course, as iavr mentions, any of these approaches will require O(n) storage.
Edit: The basic idea here is very simple. The goal is to store all the visited locations in a way that allows you to quickly check if a particular location has been visited. Your solution had to scan through all the visited locations to do this check, which makes it O(n), where n is the number of visited locations. To do this faster, you need a way to rule out most of the visited locations so you don't have to compare against them at all.
You can understand my set-based solution by thinking of a binary search on a sorted array. First you come up with a way to compare (sort) the D-dimensional locations. That's what the Position class' < operator is doing. As iavr pointed out in the comments, this is basically just a lexicographic comparison. Then, when all the visited locations are sorted in this order, you can run a binary search to check if the candidate point has been visited: you recursively check if the candidate would be found in the upper or lower half of the list, eliminating half of the remaining list from comparison at each step. This halving of the search domain at each step gives you logarithmic complexity, O(log n).
The STL set container is just a nice data structure that keeps your elements in sorted order as you insert and remove them, ensuring insertion, removal, and queries are all fast. In case you're curious, the STL implementation I use uses a red-black tree to implement this data structure, but from your perspective this is irrelevant; all that matters is that, once you give it a way to compare elements (the < operator), inserting elements into the collection (set::insert) and asking if an element is in the collection (set::find) are O(log n). I check against the origin by just adding it to the visited set--no reason to treat it specially.
The unordered_set is a hash table, an asymptotically more efficient data structure (O(1)), but a harder one to use because you must write a good hash function. Also, for your application, going from O(n) to O(log n) should be plenty good enough.
Your question concerns the algorithm rather the use of the (C++) language, so here is a generic answer.
What you need is a data structure to store a set (of point coordinates) with an efficient operation to query whether a new point is in the set or not.
Explicitly storing the set as a boolean array provides constant-time query (fastest), but at space that is exponential in the number of dimensions.
An exhaustive search (your second option) provides queries that are linear in the set size (walk length), at a space that is also linear in the set size and independent of dimensionality.
The other two common options are tree structures and hash tables, e.g. available as std::set (typically using a red-black tree) and std::unordered_set (the latter only in C++11). A tree structure typically has logarithmic-time query, while a hash table query can be constant-time in practice, almost bringing you back to the complexity of a boolean array. But in both cases the space needed is again linear in the set size and independent of dimensionality.
I am trying to solve the following problem: Numbers are being inserted into a container. Each time a number is inserted I need to know how many elements are in the container that are greater than or equal to the current number being inserted. I believe both operations can be done in logarithmic complexity.
My question:
Are there standard containers in a C++ library that can solve the problem?
I know that std::multiset can insert elements in logarithmic time, but how can you query it? Or should I implement a data structure (e.x. a binary search tree) to solve it?
Great question. I do not think there is anything in STL which would suit your needs (provided you MUST have logarithmic times). I think the best solution then, as aschepler says in comments, is to implement a RB tree. You may have a look at STL source code, particularly on stl_tree.h to see whether you could use bits of it.
Better still, look at : (Rank Tree in C++)
Which contains link to implementation:
(http://code.google.com/p/options/downloads/list)
You should use a multiset for logarithmic complexity, yes. But computing the distance is the problem, as set/map iterators are Bidirectional, not RandomAccess, std::distance has an O(n) complexity on them:
multiset<int> my_set;
...
auto it = my_map.lower_bound(3);
size_t count_inserted = distance(it, my_set.end()) // this is definitely O(n)
my_map.insert(make_pair(3);
Your complexity-issue is complicated. Here is a full analysis:
If you want a O(log(n)) complexity for each insertion, you need a sorted structure as a set. If you want the structure to not reallocate or move items when adding a new item, the insertion point distance computation will be O(n). If know the insertion size in advance, you do not need logarithmic insertion time in a sorted container. You can insert all the items then sort, it is as much O(n.log(n)) as n * O(log(n)) insertions in a set.
The only alternative is to use a dedicated container like a weighted RB-tree. Depending on your problem this may be the solution, or something really overkill.
Use multiset and distance, you are O(n.log(n)) on insertion (yes, n insertions * log(n) insertion time for each one of them), O(n.n) on distance computation, but computing distances is very fast.
If you know the inserted data size (n) in advance : Use a vector, fill it, sort it, return your distances, you are O(n.log(n)), and it is easy to code.
If you do not know n in advance, your n is likely huge, each item is memory-heavy so you can not have O(n.log(n)) reallocation : then you have time to re-encode or re-use some non-standard code, you really have to meet these complexity expectations, use a dedicated container. Also consider using a database, you will probably have issues maintaining this in memory.
Here's a quick way using Policy-Based Data Structures in C++:
There exists something called as an Ordered Set, which lets you insert/remove elements in O(logN) time (and pretty much all other functions that std::set has to offer). It also gives 2 more features: Find the Kth element and **find the rank of the Xth element. The problem is that this doesn't allow duplicates :(
No Worries though! We will map duplicates with a separate index/priority, and define a new structure (call it Ordered Multiset)! I've attached my implementation below for reference.
Finally, every time you want to find the no of elements greater than say x, call the function upper_bound (No of elements less than or equal to x) and subtract this number from the size of your Ordered Multiset!
Note: PBDS use a lot of memory, so that is a constraint, I'd suggest using a Binary Search Tree or a Fenwick Tree.
#include <bits/stdc++.h>
#include <ext/pb_ds/assoc_container.hpp>
#include <ext/pb_ds/tree_policy.hpp>
using namespace std;
using namespace __gnu_pbds;
struct ordered_multiset { // multiset supporting duplicating values in set
int len = 0;
const int ADD = 1000010;
const int MAXVAL = 1000000010;
unordered_map<int, int> mp; // hash = 96814
tree<int, null_type, less<int>, rb_tree_tag, tree_order_statistics_node_update> T;
ordered_multiset() { len = 0; T.clear(), mp.clear(); }
inline void insert(int x){
len++, x += MAXVAL;
int c = mp[x]++;
T.insert((x * ADD) + c); }
inline void erase(int x){
x += MAXVAL;
int c = mp[x];
if(c) {
c--, mp[x]--, len--;
T.erase((x*ADD) + c); } }
inline int kth(int k){ // 1-based index, returns the
if(k<1 || k>len) return -1; // K'th element in the treap,
auto it = T.find_by_order(--k); // -1 if none exists
return ((*it)/ADD) - MAXVAL; }
inline int lower_bound(int x){ // Count of value <x in treap
x += MAXVAL;
int c = mp[x];
return (T.order_of_key((x*ADD)+c)); }
inline int upper_bound(int x){ // Count of value <=x in treap
x += MAXVAL;
int c = mp[x];
return (T.order_of_key((x*ADD)+c)); }
inline int size() { return len; } // Number of elements in treap
};
Usage:
ordered_multiset s;
for(int i=0; i<n; i++) {
int x; cin>>x;
s.insert(x);
int ctr = s.size() - s.upper_bound(x);
cout<<ctr<<" ";
}
Input (n = 6) : 10 1 3 3 2
Output : 0 1 1 1 3
Time Complexity : O(log n) per query/insert
References : mochow13's GitHub
Sounds like a case for count_if - although I admit this doesn't solve it at logarithmic complexity, that would require a sorted type.
vector<int> v = { 1, 2, 3, 4, 5 };
int some_value = 3;
int count = count_if(v.begin(), v.end(), [some_value](int n) { return n > some_value; } );
Edit done to fix syntactic problems with lambda function
If the whole range of numbers is sufficiently small (on the order of a few million), this problem can be solved relatively easily using a Fenwick tree.
Although Fenwick trees are not part of the STL, they are both very easy to implement and time efficient. The time complexity is O(log N) for both updates and queries and the constant factors are low.
You mention in a comment on another question, that you needed this for a contest. Fenwick trees are very popular tools in competitive programming and are often useful.
I'm looking for a data structure (array-like) that allows fast (faster than O(N)) arbitrary insertion of values into the structure. The data structure must be able to print out its elements in the way they were inserted. This is similar to something like List.Insert() (which is too slow as it has to shift every element over), except I don't need random access or deletion. Insertion will always be within the size of the 'array'. All values are unique. No other operations are needed.
For example, if Insert(x, i) inserts value x at index i (0-indexing). Then:
Insert(1, 0) gives {1}
Insert(3, 1) gives {1,3}
Insert(2, 1) gives {1,2,3}
Insert(5, 0) gives {5,1,2,3}
And it'll need to be able to print out {5,1,2,3} at the end.
I am using C++.
Use skip list. Another option should be tiered vector. The skip list performs inserts at const O(log(n)) and keeps the numbers in order. The tiered vector supports insert in O(sqrt(n)) and again can print the elements in order.
EDIT: per the comment of amit I will explain how do you find the k-th element in a skip list:
For each element you have a tower on links to next elements and for each link you know how many elements does it jump over. So looking for the k-th element you start with the head of the list and go down the tower until you find a link that jumps over no more then k elements. You go to the node pointed to by this node and decrease k with the number of elements you have jumped over. Continue doing that until you have k = 0.
Did you consider using std::map or std::vector ?
You could use a std::map with the rank of insertion as key. And vector has a reserve member function.
You can use an std::map mapping (index, insertion-time) pairs to values, where insertion-time is an "autoincrement" integer (in SQL terms). The ordering on the pairs should be
(i, t) < (i*, t*)
iff
i < i* or t > t*
In code:
struct lt {
bool operator()(std::pair<size_t, size_t> const &x,
std::pair<size_t, size_t> const &y)
{
return x.first < y.first || x.second > y.second;
}
};
typedef std::map<std::pair<size_t, size_t>, int, lt> array_like;
void insert(array_like &a, int value, size_t i)
{
a[std::make_pair(i, a.size())] = value;
}
Regarding your comment:
List.Insert() (which is too slow as it has to shift every element over),
Lists don't shift their values, they iterate over them to find the location you want to insert, be careful what you say. This can be confusing to newbies like me.
A solution that's included with GCC by default is the rope data structure. Here is the documentation. Typically, ropes come to mind when working with long strings of characters. Here we have ints instead of characters, but it works the same. Just use int as the template parameter. (Could also be pairs, etc.)
Here's the description of rope on Wikipedia.
Basically, it's a binary tree that maintains how many elements are in the left and right subtrees (or equivalent information, which is what's referred to as order statistics), and these counts are updated appropriately as subtrees are rotated when elements are inserted and removed. This allows O(lg n) operations.
There's this data structure which pushes insertion time down from O(N) to O(sqrt(N)) but I'm not that impressed. I feel one should be able to do better but I'll have to work at it a bit.
In c++ you can just use a map of vectors, like so:
int main() {
map<int, vector<int> > data;
data[0].push_back(1);
data[1].push_back(3);
data[1].push_back(2);
data[0].push_back(5);
map<int, vector<int> >::iterator it;
for (it = data.begin(); it != data.end(); it++) {
vector<int> v = it->second;
for (int i = v.size() - 1; i >= 0; i--) {
cout << v[i] << ' ';
}
}
cout << '\n';
}
This prints:
5 1 2 3
Just like you want, and inserts are O(log n).
I was asked an interview question to find the number of distinct absolute values among the elements of the array. I came up with the following solution (in C++) but the interviewer was not happy with the code's run time efficiency.
I will appreciate pointers as to how I can improve the run time efficiency of this code?
Also how do I calculate the efficiency of the code below? The for loop executes A.size() times. However I am not sure about the efficiency of STL std::find (In the worse case it could be O(n) so that makes this code O(n²) ?
Code is:
int countAbsoluteDistinct ( const std::vector<int> &A ) {
using namespace std;
list<int> x;
vector<int>::const_iterator it;
for(it = A.begin();it < A.end();it++)
if(find(x.begin(),x.end(),abs(*it)) == x.end())
x.push_back(abs(*it));
return x.size();
}
To propose alternative code to the set code.
Note that we don't want to alter the caller's vector, we take by value. It's better to let the compiler copy for us than make our own. If it's ok to destroy their value we can take by non-const reference.
#include <vector>
#include <algorithm>
#include <iterator>
#include <cstdlib>
using namespace std;
int count_distinct_abs(vector<int> v)
{
transform(v.begin(), v.end(), v.begin(), abs); // O(n) where n = distance(v.end(), v.begin())
sort(v.begin(), v.end()); // Average case O(n log n), worst case O(n^2) (usually implemented as quicksort.
// To guarantee worst case O(n log n) replace with make_heap, then sort_heap.
// Unique will take a sorted range, and move things around to get duplicated
// items to the back and returns an iterator to the end of the unique section of the range
auto unique_end = unique(v.begin(), v.end()); // Again n comparisons
return distance(v.begin(), unique_end); // Constant time for random access iterators (like vector's)
}
The advantage here is that we only allocate/copy once if we decide to take by value, and the rest is all done in-place while still giving you an average complexity of O(n log n) on the size of v.
std::find() is linear (O(n)). I'd use a sorted associative container to handle this, specifically std::set.
#include <vector>
#include <set>
using namespace std;
int distict_abs(const vector<int>& v)
{
std::set<int> distinct_container;
for(auto curr_int = v.begin(), end = v.end(); // no need to call v.end() multiple times
curr_int != end;
++curr_int)
{
// std::set only allows single entries
// since that is what we want, we don't care that this fails
// if the second (or more) of the same value is attempted to
// be inserted.
distinct_container.insert(abs(*curr_int));
}
return distinct_container.size();
}
There is still some runtime penalty with this approach. Using a separate container incurs the cost of dynamic allocations as the container size increases. You could do this in place and not occur this penalty, however with code at this level its sometimes better to be clear and explicit and let the optimizer (in the compiler) do its work.
Yes, this will be O(N2) -- you'll end up with a linear search for each element.
A couple of reasonably obvious alternatives would be to use an std::set or std::unordered_set. If you don't have C++0x, you can replace std::unordered_set with tr1::unordered_set or boost::unordered_set.
Each insertion in an std::set is O(log N), so your overall complexity is O(N log N).
With unordered_set, each insertion has constant (expected) complexity, giving linear complexity overall.
Basically, replace your std::list with a std::set. This gives you O(log(set.size())) searches + O(1) insertions, if you do things properly. Also, for efficiency, it makes sense to cache the result of abs(*it), although this will have only a minimal (negligible) effect. The efficiency of this method is about as good as you can get it, without using a really nice hash (std::set uses bin-trees) or more information about the values in the vector.
Since I was not happy with the previous answer here is mine today. Your intial question does not mention how big your vector is. Suppose your std::vector<> is extremely large and have very few duplicates (why not?). This means that using another container (eg. std::set<>) will basically duplicate your memory consumption. Why would you do that since your goal is simply to count non duplicate.
I like #Flame answer, but I was not really happy with the call to std::unique. You've spent lots of time carefully sorting your vector and then simply discard the sorted array while you could be re-using it afterward.
I could not find anything really elegant in the STD library, so here is my proposal (a mixture of std::transform + std::abs + std::sort, but without touching the sorted array afterward).
// count the number of distinct absolute values among the elements of the sorted container
template<class ForwardIt>
typename std::iterator_traits<ForwardIt>::difference_type
count_unique(ForwardIt first, ForwardIt last)
{
if (first == last)
return 0;
typename std::iterator_traits<ForwardIt>::difference_type
count = 1;
ForwardIt previous = first;
while (++first != last) {
if (!(*previous == *first) ) ++count;
++previous;
}
return count;
}
Bonus point is works with forward iterator:
#include <iostream>
#include <list>
int main()
{
std::list<int> nums {1, 3, 3, 3, 5, 5, 7,8};
std::cout << count_unique( std::begin(nums), std::end(nums) ) << std::endl;
const int array[] = { 0,0,0,1,2,3,3,3,4,4,4,4};
const int n = sizeof array / sizeof * array;
std::cout << count_unique( array, array + n ) << std::endl;
return 0;
}
Two points.
std::list is very bad for search. Each search is O(n).
Use std::set. Insert is logarithmic, it removes duplicate and is sorted. Insert every value O(n log n) then use set::size to find how many values.
EDIT:
To answer part 2 of your question, the C++ standard mandates the worst case for operations on containers and algorithms.
Find: Since you are using the free function version of find which takes iterators, it cannot assume anything about the passed in sequence, it cannot assume that the range is sorted, so it must traverse every item until it finds a match, which is O(n).
If you are using set::find on the other hand, this member find can utilize the structure of the set, and it's performance is required to be O(log N) where N is the size of the set.
To answer your second question first, yes the code is O(n^2) because the complexity of find is O(n).
You have options to improve it. If the range of numbers is low you can just set up a large enough array and increment counts while iterating over the source data. If the range is larger but sparse, you can use a hash table of some sort to do the counting. Both of these options are linear complexity.
Otherwise, I would do one iteration to take the abs value of each item, then sort them, and then you can do the aggregation in a single additional pass. The complexity here is n log(n) for the sort. The other passes don't matter for complexity.
I think a std::map could also be interesting:
int absoluteDistinct(const vector<int> &A)
{
map<int, char> my_map;
for (vector<int>::const_iterator it = A.begin(); it != A.end(); it++)
{
my_map[abs(*it)] = 0;
}
return my_map.size();
}
As #Jerry said, to improve a little on the theme of most of the other answers, instead of using a std::map or std::set you could use a std::unordered_map or std::unordered_set (or the boost equivalent).
This would reduce the runtimes down from O(n lg n) or O(n).
Another possibility, depending on the range of the data given, you might be able to do a variant of a radix sort, though there's nothing in the question that immediately suggests this.
Sort the list with a Radix style sort for O(n)ish efficiency. Compare adjacent values.
The best way is to customize the quicksort algorithm such that when we are partitioning whenever we get two equal element then overwrite the second duplicate with last element in the range and then reduce the range. This will ensure you will not process duplicate elements twice. Also after quick sort is done the range of the element is answer
Complexity is still O(n*Lg-n) BUT this should save atleast two passes over the array.
Also savings are proportional to % of duplicates. Imagine if they twist original questoin with, 'say 90% of the elements are duplicate' ...
One more approach :
Space efficient : Use hash map .
O(logN)*O(n) for insert and just keep the count of number of elements successfully inserted.
Time efficient : Use hash table O(n) for insert and just keep the count of number of elements successfully inserted.
You have nested loops in your code. If you will scan each element over the whole array it will give you O(n^2) time complexity which is not acceptable in most of the scenarios. That was the reason the Merge Sort and Quick sort algorithms came up to save processing cycles and machine efforts. I will suggest you to go through the suggested links and redesign your program.