Efficient Data Structure for Insertion - c++

I'm looking for a data structure (array-like) that allows fast (faster than O(N)) arbitrary insertion of values into the structure. The data structure must be able to print out its elements in the way they were inserted. This is similar to something like List.Insert() (which is too slow as it has to shift every element over), except I don't need random access or deletion. Insertion will always be within the size of the 'array'. All values are unique. No other operations are needed.
For example, if Insert(x, i) inserts value x at index i (0-indexing). Then:
Insert(1, 0) gives {1}
Insert(3, 1) gives {1,3}
Insert(2, 1) gives {1,2,3}
Insert(5, 0) gives {5,1,2,3}
And it'll need to be able to print out {5,1,2,3} at the end.
I am using C++.

Use skip list. Another option should be tiered vector. The skip list performs inserts at const O(log(n)) and keeps the numbers in order. The tiered vector supports insert in O(sqrt(n)) and again can print the elements in order.
EDIT: per the comment of amit I will explain how do you find the k-th element in a skip list:
For each element you have a tower on links to next elements and for each link you know how many elements does it jump over. So looking for the k-th element you start with the head of the list and go down the tower until you find a link that jumps over no more then k elements. You go to the node pointed to by this node and decrease k with the number of elements you have jumped over. Continue doing that until you have k = 0.

Did you consider using std::map or std::vector ?
You could use a std::map with the rank of insertion as key. And vector has a reserve member function.

You can use an std::map mapping (index, insertion-time) pairs to values, where insertion-time is an "autoincrement" integer (in SQL terms). The ordering on the pairs should be
(i, t) < (i*, t*)
iff
i < i* or t > t*
In code:
struct lt {
bool operator()(std::pair<size_t, size_t> const &x,
std::pair<size_t, size_t> const &y)
{
return x.first < y.first || x.second > y.second;
}
};
typedef std::map<std::pair<size_t, size_t>, int, lt> array_like;
void insert(array_like &a, int value, size_t i)
{
a[std::make_pair(i, a.size())] = value;
}

Regarding your comment:
List.Insert() (which is too slow as it has to shift every element over),
Lists don't shift their values, they iterate over them to find the location you want to insert, be careful what you say. This can be confusing to newbies like me.

A solution that's included with GCC by default is the rope data structure. Here is the documentation. Typically, ropes come to mind when working with long strings of characters. Here we have ints instead of characters, but it works the same. Just use int as the template parameter. (Could also be pairs, etc.)
Here's the description of rope on Wikipedia.
Basically, it's a binary tree that maintains how many elements are in the left and right subtrees (or equivalent information, which is what's referred to as order statistics), and these counts are updated appropriately as subtrees are rotated when elements are inserted and removed. This allows O(lg n) operations.

There's this data structure which pushes insertion time down from O(N) to O(sqrt(N)) but I'm not that impressed. I feel one should be able to do better but I'll have to work at it a bit.

In c++ you can just use a map of vectors, like so:
int main() {
map<int, vector<int> > data;
data[0].push_back(1);
data[1].push_back(3);
data[1].push_back(2);
data[0].push_back(5);
map<int, vector<int> >::iterator it;
for (it = data.begin(); it != data.end(); it++) {
vector<int> v = it->second;
for (int i = v.size() - 1; i >= 0; i--) {
cout << v[i] << ' ';
}
}
cout << '\n';
}
This prints:
5 1 2 3
Just like you want, and inserts are O(log n).

Related

Fast search and delete in a std::list of objects

I have a very large list of objects (nodes), and I want to be able to remove/delete elements of the list based on a set of values inside of them.
Preferably in constant time...
The objects (among other things) has values like:
long long int nodeID;
int depth;
int numberOfClusters;
double [] points;
double [][] clusters;
What I need to do is to look through the list, and check if there are any elements that has the same values in all fields except for nodeID.
Right now I'm doing something like this:
for(i = nodes.begin(); i != nodes.end(); i++)
{
for(j = nodes.begin(); j != nodes.end(); j++)
{
if(i != j)
{
if(compareNodes((*i), (*j)))
{
j = nodes.erase (j);
}
}
}
}
Where compareNodes() compares the values inside the two nodes. But this is wildly inefficient.
I'm using erasebecause that seems to be the only way to delete an element in the middle of a std::list.
Optimally, I would like to be able to find an element based on these values, and remove it from the list if it exists.
I am thinking some sort of hash map to find the element (a pointer to the element) in constant time, but even if I can do that, I can't find a way to remove the element without iterating through the list.
It seemes that I have to use erase , but that requires iterating through the list, which means linear complexity in the list size.
There is also remove_if but again, same problem linear complexity in list size.
Is there no way to get remove an element from a std::list without iterating through the whole list?
First off, you can speed up your existing solution by starting j at std::next(i) instead of nodes.begin() (assuming your compareNodes function is commutative).
Second, the hashmap approach sounds viable. But why keep a pointer to the element as a value in the map, when you can keep an iterator? They're both "a thing which references the element," but you can use the iterator to erase the element. And std::list iterators don't invalidate when the list is modified (they're most probably just pointers under the hood).
Thirdly, if you want to encapsulate/automate the lookup & sequential access, you can look into Boost.Multi-index to build a container with both sequential and hashed access.

How to insert to a vector to ensure it remains sorted? [duplicate]

ALL,
This question is a continuation of this one.
I think that STL misses this functionality, but it just my IMHO.
Now, to the question.
Consider following code:
class Foo
{
public:
Foo();
int paramA, paramB;
std::string name;
};
struct Sorter
{
bool operator()(const Foo &foo1, const Foo &foo2) const
{
switch( paramSorter )
{
case 1:
return foo1.paramA < foo2.paramA;
case 2:
return foo1.paramB < foo2.paramB;
default:
return foo1.name < foo2.name;
}
}
int paramSorter;
};
int main()
{
std::vector<Foo> foo;
Sorter sorter;
sorter.paramSorter = 0;
// fill the vector
std::sort( foo.begin(), foo.end(), sorter );
}
At any given moment of time the vector can be re-sorted.
The class also have the getter methods which are used in the sorter structure.
What would be the most efficient way to insert a new element in the vector?
Situation I have is:
I have a grid (spreadsheet), that uses the sorted vector of a class. At any given time the vector can be re-sorted and the grid will display the sorted data accordingly.
Now I will need to insert a new element in the vector/grid.
I can insert, then re-sort and then re-display the whole grid, but this is very inefficient especially for the big grid.
Any help would be appreciated.
The simple answer to the question:
template< typename T >
typename std::vector<T>::iterator
insert_sorted( std::vector<T> & vec, T const& item )
{
return vec.insert
(
std::upper_bound( vec.begin(), vec.end(), item ),
item
);
}
Version with a predicate.
template< typename T, typename Pred >
typename std::vector<T>::iterator
insert_sorted( std::vector<T> & vec, T const& item, Pred pred )
{
return vec.insert
(
std::upper_bound( vec.begin(), vec.end(), item, pred ),
item
);
}
Where Pred is a strictly-ordered predicate on type T.
For this to work the input vector must already be sorted on this predicate.
The complexity of doing this is O(log N) for the upper_bound search (finding where to insert) but up to O(N) for the insert itself.
For a better complexity you could use std::set<T> if there are not going to be any duplicates or std::multiset<T> if there may be duplicates. These will retain a sorted order for you automatically and you can specify your own predicate on these too.
There are various other things you could do which are more complex, e.g. manage a vector and a set / multiset / sorted vector of newly added items then merge these in when there are enough of them. Any kind of iterating through your collection will need to run through both collections.
Using a second vector has the advantage of keeping your data compact. Here your "newly added" items vector will be relatively small so the insertion time will be O(M) where M is the size of this vector and might be more feasible than the O(N) of inserting in the big vector every time. The merge would be O(N+M) which is better than O(NM) it would be inserting one at a time, so in total it would be O(N+M) + O(M²) to insert M elements then merge.
You would probably keep the insertion vector at its capacity too, so as you grow that you will not be doing any reallocations, just moving of elements.
If you need to keep the vector sorted all the time, first you might consider whether using std::set or std::multiset won't simplify your code.
If you really need a sorted vector and want to quickly insert an element into it, but do not want to enforce a sorting criterion to be satisfied all the time, then you can first use std::lower_bound() to find the position in a sorted range where the element should be inserted in logarithmic time, then use the insert() member function of vector to insert the element at that position.
If performance is an issue, consider benchmarking std::list vs std::vector. For small items, std::vector is known to be faster because of a higher cache hit rate, but the insert() operation itself is computationally faster on lists (no need to move elements around).
Just a note, you can use upper_bound as well depending on your needs. upper_bound will assure new entries that are equivalent to others will appear at the end of their sequence, lower_bound will assure new entries equivalent to others will appear at the beginning of their sequence. Can be useful for certain implementations (maybe classes that can share a "position" but not all of their details!)
Both will assure you that the vector remains sorted according to < result of elements, although inserting into lower_bound will mean moving more elements.
Example:
insert 7 # lower_bound of { 5, 7, 7, 9 } => { 5, *7*, 7, 7, 9 }
insert 7 # upper_bound of { 5, 7, 7, 9 } => { 5, 7, 7, *7*, 9 }
Instead of inserting and sorting. You should do a find and then insert
Keep the vector sorted. (sort once). When you have to insert
find the first element that compares as greater to the one you are going to insert.
Do an insert just before that position.
This way the vector stays sorted.
Here is an example of how it goes.
start {} empty vector
insert 1 -> find first greater returns end() = 1 -> insert at 1 -> {1}
insert 5 -> find first greater returns end() = 2 -> insert at 2 -> {1,5}
insert 3 -> find first greater returns 2 -> insert at 2 -> {1,3,5}
insert 4 -> find first greater returns 3 -> insert at 3 -> {1,3,4,5}
When you want to switch between sort orders, you can use multiple index datastructures, each of which you keep in sorted order (probably some kind of balanced tree, like std::map, which maps sort-keys to vector-indices, or std::set to store pointers to youre obects - but with different comparison functions).
Here's a library which does this: http://www.boost.org/doc/libs/1_53_0/libs/multi_index/doc/index.html
For every change (insert of new elements or update of keys) you must update all index datastructure, or flag them as invalid.
This works if there are not "too many" sort orders and not "too many" updates of your datastructure. Otherwise - bad luck, you have to re-sort everytime you want to change the order.
In other words: The more indices you need (to speed up lookup operations), the more time you need for update operations. And every index needs memory, of course.
To keep the count of indices small, you could use some query engine which combines the indices of several fields to support more complex sort orders over several fields. Like an SQL query optimizer. But that may be overkill...
Example: If you have two fields, a and b, you can support 4 sort orders:
a
b
first a then b
first b then a
with 2 indices (3. and 4.).
With more fields, the possible combinations of sort orders gets big, fast. But you can still use an index which sorts "almost as you want it" and, during the query, sort the remaining fields you couldn't catch with that index, as needed. For sorted output of the whole data, this doesn't help much. But if you only want to lookup some elements, the first "narrowing down" can help much.
Here is one I wrote for simplicity. Horribly slow for large sets but fine for small sets. It sorts as values are added:
void InsertionSortByValue(vector<int> &vec, int val)
{
vector<int>::iterator it;
for (it = vec.begin(); it < vec.end(); it++)
{
if (val < *it)
{
vec.insert(it, val);
return;
}
}
vec.push_back(val);
}
int main()
{
vector<int> vec;
for (int i = 0; i < 20; i++)
InsertionSortByValue(vec, rand()%20);
}
Here is another I found on some website. It sorts by array:
void InsertionSortFromArray(vector<int> &vec)
{
int elem;
unsigned int i, j, k, index;
for (i = 1; i < vec.size(); i++)
{
elem = vec[i];
if (elem < vec[i-1])
{
for (j = 0; j <= i; j++)
{
if (elem < vec[j])
{
index = j;
for (k = i; k > j; k--)
vec[k] = vec[k-1];
break;
}
}
}
else
continue;
vec[index] = elem;
}
}
int main()
{
vector<int> vec;
for (int i = 0; i < 20; i++)
vec.push_back(rand()%20);
InsertionSortFromArray(vec);
}
Assuming you really want to use a vector, and the sort criterium or keys don't change (so the order of already inserted elements always stays the same):
Insert the element at the end, then move it to the front one step at a time, until the preceeding element isn't bigger.
It can't be done faster (regarding asymptotic complexity, or "big O notation"), because you must move all bigger elements. And that's the reason why STL doesn't provide this - because it's inefficient on vectors, and you shouldn't use them if you need it.
Edit: Another assumption: Comparing the elements is not much more expensive than moving them. See comments.
Edit 2: As my first assumption doesn't hold (you want to change the sort criterium), scrap this answer and see my other one: https://stackoverflow.com/a/15843955/1413374

How to verify if a vector has a value at a certain index

In a "self-avoiding random walk" situation, I have a 2-dimensional vector with a configuration of step-coordinates. I want to be able to check if a certain site has been occupied, but the problem is that the axis can be zero, so checking if the fabs() of the coordinate is true (or that it has a value), won't work. Therefore, I've considered looping through the steps and checking if my coordinate equals another coordinate on all axis, and if it does, stepping back and trying again (a so-called depth-first approach).
Is there a more efficient way to do this? I've seen someone use a boolean array with all possible coordinates, like so:
bool occupied[nMax][nMax]; // true if lattice site is occupied
for (int y = -rMax; y <= rMax; y++)
for (int x = -rMax; x <= rMax; x++)
occupied[index(y)][index(x)] = false;
But, in my program the number of dimensions is unknown, so would an approach such as:
typedef std::vector<std::vector<long int>> WalkVec;
WalkVec walk(1, std::vector<long int>(dof,0));
siteVisited = false; counter = 0;
while (counter < (walkVec.back().size()-1))
{
tdof = 1;
while (tdof <= dimensions)
{
if (walkHist.back().at(tdof-1) == walkHist.at(counter).at(tdof-1) || walkHist.back().at(tdof-1) == 0)
{
siteVisited = true;
}
else
{
siteVisited = false;
break;
}
tdof++;
}
work where dof if the number of dimensions. (the check for zero checks if the position is the origin. Three zero coordinates, or three visited coordinates on the same step is the only way to make it true)
Is there a more efficient way of doing it?
You can do this check in O(log n) or O(1) time using STL's set or unordered_set respectively. The unordered_set container requires you to write a custom hash function for your coordinates, while the set container only needs you to provide a comparison function. The set implementation is particularly easy, and logarithmic time should be fast enough:
#include <iostream>
#include <set>
#include <vector>
#include <cassert>
class Position {
public:
Position(const std::vector<long int> &c)
: m_coords(c) { }
size_t dim() const { return m_coords.size(); }
bool operator <(const Position &b) const {
assert(b.dim() == dim());
for (size_t i = 0; i < dim(); ++i) {
if (m_coords[i] < b.m_coords[i])
return true;
if (m_coords[i] > b.m_coords[i])
return false;
}
return false;
}
private:
std::vector<long int> m_coords;
};
int main(int argc, const char *argv[])
{
std::set<Position> visited;
std::vector<long int> coords(3, 0);
visited.insert(Position(coords));
while (true) {
std::cout << "x, y, z: ";
std::cin >> coords[0] >> coords[1] >> coords[2];
Position candidate(coords);
if (visited.find(candidate) != visited.end())
std::cout << "Aready visited!" << std::endl;
else
visited.insert(candidate);
}
return 0;
}
Of course, as iavr mentions, any of these approaches will require O(n) storage.
Edit: The basic idea here is very simple. The goal is to store all the visited locations in a way that allows you to quickly check if a particular location has been visited. Your solution had to scan through all the visited locations to do this check, which makes it O(n), where n is the number of visited locations. To do this faster, you need a way to rule out most of the visited locations so you don't have to compare against them at all.
You can understand my set-based solution by thinking of a binary search on a sorted array. First you come up with a way to compare (sort) the D-dimensional locations. That's what the Position class' < operator is doing. As iavr pointed out in the comments, this is basically just a lexicographic comparison. Then, when all the visited locations are sorted in this order, you can run a binary search to check if the candidate point has been visited: you recursively check if the candidate would be found in the upper or lower half of the list, eliminating half of the remaining list from comparison at each step. This halving of the search domain at each step gives you logarithmic complexity, O(log n).
The STL set container is just a nice data structure that keeps your elements in sorted order as you insert and remove them, ensuring insertion, removal, and queries are all fast. In case you're curious, the STL implementation I use uses a red-black tree to implement this data structure, but from your perspective this is irrelevant; all that matters is that, once you give it a way to compare elements (the < operator), inserting elements into the collection (set::insert) and asking if an element is in the collection (set::find) are O(log n). I check against the origin by just adding it to the visited set--no reason to treat it specially.
The unordered_set is a hash table, an asymptotically more efficient data structure (O(1)), but a harder one to use because you must write a good hash function. Also, for your application, going from O(n) to O(log n) should be plenty good enough.
Your question concerns the algorithm rather the use of the (C++) language, so here is a generic answer.
What you need is a data structure to store a set (of point coordinates) with an efficient operation to query whether a new point is in the set or not.
Explicitly storing the set as a boolean array provides constant-time query (fastest), but at space that is exponential in the number of dimensions.
An exhaustive search (your second option) provides queries that are linear in the set size (walk length), at a space that is also linear in the set size and independent of dimensionality.
The other two common options are tree structures and hash tables, e.g. available as std::set (typically using a red-black tree) and std::unordered_set (the latter only in C++11). A tree structure typically has logarithmic-time query, while a hash table query can be constant-time in practice, almost bringing you back to the complexity of a boolean array. But in both cases the space needed is again linear in the set size and independent of dimensionality.

how do you insert the value in a sorted vector?

ALL,
This question is a continuation of this one.
I think that STL misses this functionality, but it just my IMHO.
Now, to the question.
Consider following code:
class Foo
{
public:
Foo();
int paramA, paramB;
std::string name;
};
struct Sorter
{
bool operator()(const Foo &foo1, const Foo &foo2) const
{
switch( paramSorter )
{
case 1:
return foo1.paramA < foo2.paramA;
case 2:
return foo1.paramB < foo2.paramB;
default:
return foo1.name < foo2.name;
}
}
int paramSorter;
};
int main()
{
std::vector<Foo> foo;
Sorter sorter;
sorter.paramSorter = 0;
// fill the vector
std::sort( foo.begin(), foo.end(), sorter );
}
At any given moment of time the vector can be re-sorted.
The class also have the getter methods which are used in the sorter structure.
What would be the most efficient way to insert a new element in the vector?
Situation I have is:
I have a grid (spreadsheet), that uses the sorted vector of a class. At any given time the vector can be re-sorted and the grid will display the sorted data accordingly.
Now I will need to insert a new element in the vector/grid.
I can insert, then re-sort and then re-display the whole grid, but this is very inefficient especially for the big grid.
Any help would be appreciated.
The simple answer to the question:
template< typename T >
typename std::vector<T>::iterator
insert_sorted( std::vector<T> & vec, T const& item )
{
return vec.insert
(
std::upper_bound( vec.begin(), vec.end(), item ),
item
);
}
Version with a predicate.
template< typename T, typename Pred >
typename std::vector<T>::iterator
insert_sorted( std::vector<T> & vec, T const& item, Pred pred )
{
return vec.insert
(
std::upper_bound( vec.begin(), vec.end(), item, pred ),
item
);
}
Where Pred is a strictly-ordered predicate on type T.
For this to work the input vector must already be sorted on this predicate.
The complexity of doing this is O(log N) for the upper_bound search (finding where to insert) but up to O(N) for the insert itself.
For a better complexity you could use std::set<T> if there are not going to be any duplicates or std::multiset<T> if there may be duplicates. These will retain a sorted order for you automatically and you can specify your own predicate on these too.
There are various other things you could do which are more complex, e.g. manage a vector and a set / multiset / sorted vector of newly added items then merge these in when there are enough of them. Any kind of iterating through your collection will need to run through both collections.
Using a second vector has the advantage of keeping your data compact. Here your "newly added" items vector will be relatively small so the insertion time will be O(M) where M is the size of this vector and might be more feasible than the O(N) of inserting in the big vector every time. The merge would be O(N+M) which is better than O(NM) it would be inserting one at a time, so in total it would be O(N+M) + O(M²) to insert M elements then merge.
You would probably keep the insertion vector at its capacity too, so as you grow that you will not be doing any reallocations, just moving of elements.
If you need to keep the vector sorted all the time, first you might consider whether using std::set or std::multiset won't simplify your code.
If you really need a sorted vector and want to quickly insert an element into it, but do not want to enforce a sorting criterion to be satisfied all the time, then you can first use std::lower_bound() to find the position in a sorted range where the element should be inserted in logarithmic time, then use the insert() member function of vector to insert the element at that position.
If performance is an issue, consider benchmarking std::list vs std::vector. For small items, std::vector is known to be faster because of a higher cache hit rate, but the insert() operation itself is computationally faster on lists (no need to move elements around).
Just a note, you can use upper_bound as well depending on your needs. upper_bound will assure new entries that are equivalent to others will appear at the end of their sequence, lower_bound will assure new entries equivalent to others will appear at the beginning of their sequence. Can be useful for certain implementations (maybe classes that can share a "position" but not all of their details!)
Both will assure you that the vector remains sorted according to < result of elements, although inserting into lower_bound will mean moving more elements.
Example:
insert 7 # lower_bound of { 5, 7, 7, 9 } => { 5, *7*, 7, 7, 9 }
insert 7 # upper_bound of { 5, 7, 7, 9 } => { 5, 7, 7, *7*, 9 }
Instead of inserting and sorting. You should do a find and then insert
Keep the vector sorted. (sort once). When you have to insert
find the first element that compares as greater to the one you are going to insert.
Do an insert just before that position.
This way the vector stays sorted.
Here is an example of how it goes.
start {} empty vector
insert 1 -> find first greater returns end() = 1 -> insert at 1 -> {1}
insert 5 -> find first greater returns end() = 2 -> insert at 2 -> {1,5}
insert 3 -> find first greater returns 2 -> insert at 2 -> {1,3,5}
insert 4 -> find first greater returns 3 -> insert at 3 -> {1,3,4,5}
When you want to switch between sort orders, you can use multiple index datastructures, each of which you keep in sorted order (probably some kind of balanced tree, like std::map, which maps sort-keys to vector-indices, or std::set to store pointers to youre obects - but with different comparison functions).
Here's a library which does this: http://www.boost.org/doc/libs/1_53_0/libs/multi_index/doc/index.html
For every change (insert of new elements or update of keys) you must update all index datastructure, or flag them as invalid.
This works if there are not "too many" sort orders and not "too many" updates of your datastructure. Otherwise - bad luck, you have to re-sort everytime you want to change the order.
In other words: The more indices you need (to speed up lookup operations), the more time you need for update operations. And every index needs memory, of course.
To keep the count of indices small, you could use some query engine which combines the indices of several fields to support more complex sort orders over several fields. Like an SQL query optimizer. But that may be overkill...
Example: If you have two fields, a and b, you can support 4 sort orders:
a
b
first a then b
first b then a
with 2 indices (3. and 4.).
With more fields, the possible combinations of sort orders gets big, fast. But you can still use an index which sorts "almost as you want it" and, during the query, sort the remaining fields you couldn't catch with that index, as needed. For sorted output of the whole data, this doesn't help much. But if you only want to lookup some elements, the first "narrowing down" can help much.
Here is one I wrote for simplicity. Horribly slow for large sets but fine for small sets. It sorts as values are added:
void InsertionSortByValue(vector<int> &vec, int val)
{
vector<int>::iterator it;
for (it = vec.begin(); it < vec.end(); it++)
{
if (val < *it)
{
vec.insert(it, val);
return;
}
}
vec.push_back(val);
}
int main()
{
vector<int> vec;
for (int i = 0; i < 20; i++)
InsertionSortByValue(vec, rand()%20);
}
Here is another I found on some website. It sorts by array:
void InsertionSortFromArray(vector<int> &vec)
{
int elem;
unsigned int i, j, k, index;
for (i = 1; i < vec.size(); i++)
{
elem = vec[i];
if (elem < vec[i-1])
{
for (j = 0; j <= i; j++)
{
if (elem < vec[j])
{
index = j;
for (k = i; k > j; k--)
vec[k] = vec[k-1];
break;
}
}
}
else
continue;
vec[index] = elem;
}
}
int main()
{
vector<int> vec;
for (int i = 0; i < 20; i++)
vec.push_back(rand()%20);
InsertionSortFromArray(vec);
}
Assuming you really want to use a vector, and the sort criterium or keys don't change (so the order of already inserted elements always stays the same):
Insert the element at the end, then move it to the front one step at a time, until the preceeding element isn't bigger.
It can't be done faster (regarding asymptotic complexity, or "big O notation"), because you must move all bigger elements. And that's the reason why STL doesn't provide this - because it's inefficient on vectors, and you shouldn't use them if you need it.
Edit: Another assumption: Comparing the elements is not much more expensive than moving them. See comments.
Edit 2: As my first assumption doesn't hold (you want to change the sort criterium), scrap this answer and see my other one: https://stackoverflow.com/a/15843955/1413374

Effective search of number pairs

I have a problem, where i have big list of number pairs. something like that:
(0, 1)
(10, 5)
(5, 6)
(8, 6)
(7, 5)
.....
I need to make that i can make very fast lookups if the pair exist in list.
My first idea was make map< std::pair<int,int> > container. and do searches using container.find().
Second idea was to make vector<vector<int> container where i can search is the pair exist by using std::find(container[id1].begin(),container[id1].end(),id2);
The second way is a bit faster than first, but i need more effective way if that possible.
So question is there more effective way to find is a number pair exist in list?
The number of pairs i know when starting program, so i dont care a lot about pair insertion/deletion, i just need very fast searches.
If you do not care about insertion you could use a sorted std::vector and std::binary_search, or std::lower_bound.
int main()
{
using namespace std;
vector<pair<int, int>> pairs;
pairs.push_back(make_pair(1, 1));
pairs.push_back(make_pair(3, 1));
pairs.push_back(make_pair(3, 2));
pairs.push_back(make_pair(4, 1));
auto compare = [](const pair<int, int>& lh, const pair<int, int>& rh)
{
return lh.first != rh.first ?
lh.first < rh.first : lh.second < rh.second;
};
sort(begin(pairs), end(pairs), compare);
auto lookup = make_pair(3, 1);
bool has31 = binary_search(begin(pairs), end(pairs), lookup, compare);
auto iter31 = lower_bound(begin(pairs), end(pairs), lookup, compare);
if (iter31 != end(pairs) && *iter31 == lookup)
cout << iter31->first << "; " << iter31->second << "at position "
<< distance(begin(pairs), iter31);
}
If you want faster-than-set lookups (faster than O(lg n)) and don't care about items being in a random order, then a hashtable is the way to go.
This is not part of the standard, but a hash_set is available in most compilers. The reference for it is here.
If you want to have really fast searches, you can try a Bloom filter. However, they sometimes result in false positives (i.e. detecting that there is an item pair when there is none), and require lots of memory. A suitable Bloom filter implementation would be:
const int MAX_HASH = 23879519; // note it's prime; must be 2-5 times larger than number of your pairs
vector<bool> Bloom(MAX_HASH); // vector<bool> compresses bools into bits
// multiply one by a large-ish prime, add the second, return modulo another prime
// then use it as the key for the filter
int hash(long long a, long long b) {
return (a*15485863LL + b) % MAX_HASH;
}
// constant-time addition
void add_item(pair<int,int> p) {
Bloom[hash(p.first, p.second)] = true;
}
// constant-time check
bool is_in_set(pair<int,int> p) {
return Bloom[hash(p.first, p.second)];
}
std::set is probably the way to go, and it should perform reasonably well even if the number of elements increase (whereas the performance of std::vector will slow down quite quickly unless you sort it beforehand and do some sort of binary or tree search). Keep in mind you'll have to define a < operator to use std::set.
If you can use c++0x, std::unordered_set might be worth a try also, particularly if you don't care about order. You'll find unordered_set in Boost. This doesn't require a < operator to be defined. If you make your unordered_set an appropriate size and define your own simple hash function that does not produce many collisions it might be faster than even a binary search on a sorted vector.
You could use some implementation of hash_set to get it faster
for instance boost::unordered_set where the key is the std::pair.
This is the fastest from the easiest approaches.
Here's another solution iff your individual numbers are ints.
Construct a long long with the two ints (the first int could be the high 32 bits and the second int the lower 32 bits)
Insert this into an unorderd_set (or set, or sorted vector - profile to find your match)
find.
Should be some percentage faster than working with pairs/tuples etc. esp.
Why not sorting the tuples according to 1st element, then 2nd, then a binary search should be O(log(n)).