Fast search and delete in a std::list of objects

Fast search and delete in a std::list of objects - c++

I have a very large list of objects (nodes), and I want to be able to remove/delete elements of the list based on a set of values inside of them.
Preferably in constant time...
The objects (among other things) has values like:
long long int nodeID;
int depth;
int numberOfClusters;
double [] points;
double [][] clusters;
What I need to do is to look through the list, and check if there are any elements that has the same values in all fields except for nodeID.
Right now I'm doing something like this:
for(i = nodes.begin(); i != nodes.end(); i++)
{
for(j = nodes.begin(); j != nodes.end(); j++)
{
if(i != j)
{
if(compareNodes((*i), (*j)))
{
j = nodes.erase (j);
}
}
}
}
Where compareNodes() compares the values inside the two nodes. But this is wildly inefficient.
I'm using erasebecause that seems to be the only way to delete an element in the middle of a std::list.
Optimally, I would like to be able to find an element based on these values, and remove it from the list if it exists.
I am thinking some sort of hash map to find the element (a pointer to the element) in constant time, but even if I can do that, I can't find a way to remove the element without iterating through the list.
It seemes that I have to use erase , but that requires iterating through the list, which means linear complexity in the list size.
There is also remove_if but again, same problem linear complexity in list size.
Is there no way to get remove an element from a std::list without iterating through the whole list?

First off, you can speed up your existing solution by starting j at std::next(i) instead of nodes.begin() (assuming your compareNodes function is commutative).
Second, the hashmap approach sounds viable. But why keep a pointer to the element as a value in the map, when you can keep an iterator? They're both "a thing which references the element," but you can use the iterator to erase the element. And std::list iterators don't invalidate when the list is modified (they're most probably just pointers under the hood).
Thirdly, if you want to encapsulate/automate the lookup & sequential access, you can look into Boost.Multi-index to build a container with both sequential and hashed access.

Related

How to avoid out of range exception when erasing vector in a loop?

My apologies for the lengthy explanation.
I am working on a C++ application that loads two files into two 2D string vectors, rearranges those vectors, builds another 2D string vector, and outputs it all in a report. The first element of the two vectors is a code that identifies the owner of the item and the item in the vector. I pass the owner's identification to the program on start and loop through the two vectors in a nested while loop to find those that have matching first elements. When I do, I build a third vector with components of the first two, and I then need to capture any that don't match.
I was using the syntax "vector.erase(vector.begin() + i)" to remove elements from the two original arrays when they matched. When the loop completed, I had my new third vector, and I was left with two vectors that only had elements, which didn't match and that is what I needed. This was working fine as I tried the various owners in the files (the program accepts one owner at a time). Then I tried one that generated an out of range error.
I could not figure out how to do the erase inside of the loop without throwing the error (it didn't seem that swap and pop or erase-remove were feasible solutions). I solved my problem for the program with two extra nested while loops after building my third vector in this one.
I'd like to know how to make the erase method work here (as it seems a simpler solution) or at least how to check for my out of range error (and avoid it). There were a lot of "rows" for this particular owner; so debugging was tedious. Before giving up and going on to the nested while solution, I determined that the second erase was throwing the error. How can I make this work, or are my nested whiles after the fact, the best I can do? Here is the code:
i = 0;
while (i < AIvector.size())
{
CHECK:
j = 0;
while (j < TRvector.size())
{
if (AIvector[i][0] == TRvector[j][0])
{
linevector.clear();
// Add the necessary data from both vectors to Combo_outputvector
for (x = 0; x < AIvector[i].size(); x++)
{
linevector.push_back(AIvector[i][x]); // add AI info
}
for (x = 3; x < TRvector[j].size(); x++) // Don't need the the first three elements; so start with x=3.
{
linevector.push_back(TRvector[j][x]); // add TR info
}
Combo_outputvector.push_back(linevector); // build the combo vector
// then erase these two current rows/elements from their respective vectors, this revises the AI and TR vectors
AIvector.erase(AIvector.begin() + i);
TRvector.erase(TRvector.begin() + j);
goto CHECK; // jump from here because the erase will have changed the two increments
}
j++;
}
i++;
}

As already discussed, your goto jumps to the wrong position. Simply moving it out of the first while loop should solve your problems. But can we do better?
Erasing from a vector can be done cleanly with std::remove and std::erase for cheap-to-move objects, which vector and string both are. After some thought, however, I believe this isn't the best solution for you because you need a function that does more than just check if a certain row exists in both containers and that is not easily expressed with the erase-remove idiom.
Retaining the current structure, then, we can use iterators for the loop condition. We have a lot to gain from this, because std::vector::erase returns an iterator to the next valid element after the erased one. Not to mention that it takes an iterator anyway. Conditionally erasing elements in a vector becomes as simple as
auto it = vec.begin()
while (it != vec.end()) {
if (...)
it = vec.erase(it);
else
++it;
}
Because we assign erase's return value to it we don't have to worry about iterator invalidation. If we erase the last element, it returns vec.end() so that doesn't need special handling.
Your second loop can be removed altogether. The C++ standard defines functions for searching inside STL containers. std::find_if searches for a value in a container that satisfies a condition and returns an iterator to it, or end() if it doesn't exist. You haven't declared your types anywhere so I'm just going to assume the rows are std::vector<std::string>>.
using row_t = std::vector<std::string>;
auto AI_it = AIVector.begin();
while (AI_it != AIVector.end()) {
// Find a row in TRVector with the same first element as *AI_it
auto TR_it = std::find_if (TRVector.begin(), TRVector.end(), [&AI_it](const row_t& row) {
return row[0] == (*AI_it)[0];
});
// If a matching row was found
if (TR_it != TRVector.end()) {
// Copy the line from AIVector
auto linevector = *AI_it;
// Do NOT do this if you don't guarantee size > 3
assert(TR_it->size() >= 3);
std::copy(TR_it->begin() + 3, TR_it->end(),
std::back_inserter(linevector));
Combo_outputvector.emplace_back(std::move(linevector));
AI_it = AIVector.erase(AI_it);
TRVector.erase(TR_it);
}
else
++AI_it;
}
As you can see, switching to iterators completely sidesteps your initial problem of figuring out how not to access invalid indices. If you don't understand the syntax of the arguments for find_if search for the term lambda. It is beyond the scope if this answer to explain what they are.
A few notable changes:
linevector is now encapsulated properly. There is no reason for it to be declared outside this scope and reused.
linevector simply copies the desired row from AIVector rather than push_back every element in it, as long as Combo_outputvector (and therefore linevector) contains the same type than AIVector and TRVector.
std::copy is used instead of a for loop. Apart from being slightly shorter, it is also more generic, meaning you could change your container type to anything that supports random access iterators and inserting at the back, and the copy would still work.
linevector is moved into Combo_outputvector. This can be a huge performance optimization if your vectors are large!
It is possible that you used an non-encapsulated linevector because you wanted to keep a copy of the last inserted row outside of the loop. That would prohibit moving it, however. For this reason it is faster and more descriptive to do it as I showed above and then simply do the following after the loop.
auto linevector = Combo_outputvector.back();

How to insert to a vector to ensure it remains sorted? [duplicate]

ALL,
This question is a continuation of this one.
I think that STL misses this functionality, but it just my IMHO.
Now, to the question.
Consider following code:
class Foo
{
public:
Foo();
int paramA, paramB;
std::string name;
};
struct Sorter
{
bool operator()(const Foo &foo1, const Foo &foo2) const
{
switch( paramSorter )
{
case 1:
return foo1.paramA < foo2.paramA;
case 2:
return foo1.paramB < foo2.paramB;
default:
return foo1.name < foo2.name;
}
}
int paramSorter;
};
int main()
{
std::vector<Foo> foo;
Sorter sorter;
sorter.paramSorter = 0;
// fill the vector
std::sort( foo.begin(), foo.end(), sorter );
}
At any given moment of time the vector can be re-sorted.
The class also have the getter methods which are used in the sorter structure.
What would be the most efficient way to insert a new element in the vector?
Situation I have is:
I have a grid (spreadsheet), that uses the sorted vector of a class. At any given time the vector can be re-sorted and the grid will display the sorted data accordingly.
Now I will need to insert a new element in the vector/grid.
I can insert, then re-sort and then re-display the whole grid, but this is very inefficient especially for the big grid.
Any help would be appreciated.

The simple answer to the question:
template< typename T >
typename std::vector<T>::iterator
insert_sorted( std::vector<T> & vec, T const& item )
{
return vec.insert
(
std::upper_bound( vec.begin(), vec.end(), item ),
item
);
}
Version with a predicate.
template< typename T, typename Pred >
typename std::vector<T>::iterator
insert_sorted( std::vector<T> & vec, T const& item, Pred pred )
{
return vec.insert
(
std::upper_bound( vec.begin(), vec.end(), item, pred ),
item
);
}
Where Pred is a strictly-ordered predicate on type T.
For this to work the input vector must already be sorted on this predicate.
The complexity of doing this is O(log N) for the upper_bound search (finding where to insert) but up to O(N) for the insert itself.
For a better complexity you could use std::set<T> if there are not going to be any duplicates or std::multiset<T> if there may be duplicates. These will retain a sorted order for you automatically and you can specify your own predicate on these too.
There are various other things you could do which are more complex, e.g. manage a vector and a set / multiset / sorted vector of newly added items then merge these in when there are enough of them. Any kind of iterating through your collection will need to run through both collections.
Using a second vector has the advantage of keeping your data compact. Here your "newly added" items vector will be relatively small so the insertion time will be O(M) where M is the size of this vector and might be more feasible than the O(N) of inserting in the big vector every time. The merge would be O(N+M) which is better than O(NM) it would be inserting one at a time, so in total it would be O(N+M) + O(M²) to insert M elements then merge.
You would probably keep the insertion vector at its capacity too, so as you grow that you will not be doing any reallocations, just moving of elements.

If you need to keep the vector sorted all the time, first you might consider whether using std::set or std::multiset won't simplify your code.
If you really need a sorted vector and want to quickly insert an element into it, but do not want to enforce a sorting criterion to be satisfied all the time, then you can first use std::lower_bound() to find the position in a sorted range where the element should be inserted in logarithmic time, then use the insert() member function of vector to insert the element at that position.
If performance is an issue, consider benchmarking std::list vs std::vector. For small items, std::vector is known to be faster because of a higher cache hit rate, but the insert() operation itself is computationally faster on lists (no need to move elements around).

Just a note, you can use upper_bound as well depending on your needs. upper_bound will assure new entries that are equivalent to others will appear at the end of their sequence, lower_bound will assure new entries equivalent to others will appear at the beginning of their sequence. Can be useful for certain implementations (maybe classes that can share a "position" but not all of their details!)
Both will assure you that the vector remains sorted according to < result of elements, although inserting into lower_bound will mean moving more elements.
Example:
insert 7 # lower_bound of { 5, 7, 7, 9 } => { 5, *7*, 7, 7, 9 }
insert 7 # upper_bound of { 5, 7, 7, 9 } => { 5, 7, 7, *7*, 9 }

Instead of inserting and sorting. You should do a find and then insert
Keep the vector sorted. (sort once). When you have to insert
find the first element that compares as greater to the one you are going to insert.
Do an insert just before that position.
This way the vector stays sorted.
Here is an example of how it goes.
start {} empty vector
insert 1 -> find first greater returns end() = 1 -> insert at 1 -> {1}
insert 5 -> find first greater returns end() = 2 -> insert at 2 -> {1,5}
insert 3 -> find first greater returns 2 -> insert at 2 -> {1,3,5}
insert 4 -> find first greater returns 3 -> insert at 3 -> {1,3,4,5}

When you want to switch between sort orders, you can use multiple index datastructures, each of which you keep in sorted order (probably some kind of balanced tree, like std::map, which maps sort-keys to vector-indices, or std::set to store pointers to youre obects - but with different comparison functions).
Here's a library which does this: http://www.boost.org/doc/libs/1_53_0/libs/multi_index/doc/index.html
For every change (insert of new elements or update of keys) you must update all index datastructure, or flag them as invalid.
This works if there are not "too many" sort orders and not "too many" updates of your datastructure. Otherwise - bad luck, you have to re-sort everytime you want to change the order.
In other words: The more indices you need (to speed up lookup operations), the more time you need for update operations. And every index needs memory, of course.
To keep the count of indices small, you could use some query engine which combines the indices of several fields to support more complex sort orders over several fields. Like an SQL query optimizer. But that may be overkill...
Example: If you have two fields, a and b, you can support 4 sort orders:
a
b
first a then b
first b then a
with 2 indices (3. and 4.).
With more fields, the possible combinations of sort orders gets big, fast. But you can still use an index which sorts "almost as you want it" and, during the query, sort the remaining fields you couldn't catch with that index, as needed. For sorted output of the whole data, this doesn't help much. But if you only want to lookup some elements, the first "narrowing down" can help much.

Here is one I wrote for simplicity. Horribly slow for large sets but fine for small sets. It sorts as values are added:
void InsertionSortByValue(vector<int> &vec, int val)
{
vector<int>::iterator it;
for (it = vec.begin(); it < vec.end(); it++)
{
if (val < *it)
{
vec.insert(it, val);
return;
}
}
vec.push_back(val);
}
int main()
{
vector<int> vec;
for (int i = 0; i < 20; i++)
InsertionSortByValue(vec, rand()%20);
}
Here is another I found on some website. It sorts by array:
void InsertionSortFromArray(vector<int> &vec)
{
int elem;
unsigned int i, j, k, index;
for (i = 1; i < vec.size(); i++)
{
elem = vec[i];
if (elem < vec[i-1])
{
for (j = 0; j <= i; j++)
{
if (elem < vec[j])
{
index = j;
for (k = i; k > j; k--)
vec[k] = vec[k-1];
break;
}
}
}
else
continue;
vec[index] = elem;
}
}
int main()
{
vector<int> vec;
for (int i = 0; i < 20; i++)
vec.push_back(rand()%20);
InsertionSortFromArray(vec);
}

Assuming you really want to use a vector, and the sort criterium or keys don't change (so the order of already inserted elements always stays the same):
Insert the element at the end, then move it to the front one step at a time, until the preceeding element isn't bigger.
It can't be done faster (regarding asymptotic complexity, or "big O notation"), because you must move all bigger elements. And that's the reason why STL doesn't provide this - because it's inefficient on vectors, and you shouldn't use them if you need it.
Edit: Another assumption: Comparing the elements is not much more expensive than moving them. See comments.
Edit 2: As my first assumption doesn't hold (you want to change the sort criterium), scrap this answer and see my other one: https://stackoverflow.com/a/15843955/1413374

How to do fast sorting in sorted list when only one element is changed

I need a list of elements that are always sorted. the operation involved is quite simple, for example, if the list is sorted from high to low, i only need three operations in some loop task:
while true do {
list.sort() //sort the list that has hundreds of elements
val = list[0] //get the first/maximum value in the list
list.pop_front() //remove the first/maximum element
...//do some work here
list.push_back(new_elem)//insert a new element
list.sort()
}
however, since I only add one elem at a time, and I have speed concern, I don't want the sorting go through all the elements, e.g., using bubble sorting. So I just wonder if there is a function to insert the element in order? or whether the list::sort() function is smarter enough to use some kind of quick sort when only one element is added/modified?
Or maybe should I use deque for better speed performance if above are all the operations needed?
thanks alot!

As mentioned in the comments, if you aren't locked into std::list then you should try std::set or std::multiset.
The std::list::insert method takes an iterator which specifies where to add the new item. You can use std::lower_bound to find the correct insertion point; it's not optimal without random access iterators but it still only does O(log n) comparisons.
P.S. don't use variable names that collide with built-in classes like list.
lst.sort(std::greater<T>()); //sort the list that has hundreds of elements
while true do {
val = lst.front(); //get the first/maximum value in the list
lst.pop_front(); //remove the first/maximum element
...//do some work here
std::list<T>::iterator it = std::lower_bound(lst.begin(), lst.end(), std::greater<T>());
lst.insert(it, new_elem); //insert a new element
// lst is already sorted
}

How to get index of bidirectional iterator with std::map?

What is the most effective way to get the index of an iterator of an std::vector? explains how to do it for std::vector or std::list but what about std::map?

The cleanest way to do this would be to use the std::distance function:
auto index = std::distance(myMap.begin(), myMapItr);
However, this runs in O(n) time, which is inefficient for large maps.
If you need to determine the index of an iterator into a map or other ordered collection, you may want to search for a library containing an order statistic tree, which is a modified binary search tree that supports efficient (O(1) or O(log n)) time lookup of the index of a particular value in the tree.
Alternatively, if you are manually iterating over the tree, you can just keep a counter lying around alongside the iterator that you increment every time you traverse from one element to the next. This gives O(1)-time lookup of the index of the iterator, but is not fully general.
Hope this helps!

Try this:
int IndexOf(Type *t)
{
Type** data = vector.data();
int index = 0;
while(*data++ != t)
{
index ++;
}
return index ;
}

Efficient Data Structure for Insertion

I'm looking for a data structure (array-like) that allows fast (faster than O(N)) arbitrary insertion of values into the structure. The data structure must be able to print out its elements in the way they were inserted. This is similar to something like List.Insert() (which is too slow as it has to shift every element over), except I don't need random access or deletion. Insertion will always be within the size of the 'array'. All values are unique. No other operations are needed.
For example, if Insert(x, i) inserts value x at index i (0-indexing). Then:
Insert(1, 0) gives {1}
Insert(3, 1) gives {1,3}
Insert(2, 1) gives {1,2,3}
Insert(5, 0) gives {5,1,2,3}
And it'll need to be able to print out {5,1,2,3} at the end.
I am using C++.

Use skip list. Another option should be tiered vector. The skip list performs inserts at const O(log(n)) and keeps the numbers in order. The tiered vector supports insert in O(sqrt(n)) and again can print the elements in order.
EDIT: per the comment of amit I will explain how do you find the k-th element in a skip list:
For each element you have a tower on links to next elements and for each link you know how many elements does it jump over. So looking for the k-th element you start with the head of the list and go down the tower until you find a link that jumps over no more then k elements. You go to the node pointed to by this node and decrease k with the number of elements you have jumped over. Continue doing that until you have k = 0.

Did you consider using std::map or std::vector ?
You could use a std::map with the rank of insertion as key. And vector has a reserve member function.

You can use an std::map mapping (index, insertion-time) pairs to values, where insertion-time is an "autoincrement" integer (in SQL terms). The ordering on the pairs should be
(i, t) < (i*, t*)
iff
i < i* or t > t*
In code:
struct lt {
bool operator()(std::pair<size_t, size_t> const &x,
std::pair<size_t, size_t> const &y)
{
return x.first < y.first || x.second > y.second;
}
};
typedef std::map<std::pair<size_t, size_t>, int, lt> array_like;
void insert(array_like &a, int value, size_t i)
{
a[std::make_pair(i, a.size())] = value;
}

Regarding your comment:
List.Insert() (which is too slow as it has to shift every element over),
Lists don't shift their values, they iterate over them to find the location you want to insert, be careful what you say. This can be confusing to newbies like me.

A solution that's included with GCC by default is the rope data structure. Here is the documentation. Typically, ropes come to mind when working with long strings of characters. Here we have ints instead of characters, but it works the same. Just use int as the template parameter. (Could also be pairs, etc.)
Here's the description of rope on Wikipedia.
Basically, it's a binary tree that maintains how many elements are in the left and right subtrees (or equivalent information, which is what's referred to as order statistics), and these counts are updated appropriately as subtrees are rotated when elements are inserted and removed. This allows O(lg n) operations.

There's this data structure which pushes insertion time down from O(N) to O(sqrt(N)) but I'm not that impressed. I feel one should be able to do better but I'll have to work at it a bit.

In c++ you can just use a map of vectors, like so:
int main() {
map<int, vector<int> > data;
data[0].push_back(1);
data[1].push_back(3);
data[1].push_back(2);
data[0].push_back(5);
map<int, vector<int> >::iterator it;
for (it = data.begin(); it != data.end(); it++) {
vector<int> v = it->second;
for (int i = v.size() - 1; i >= 0; i--) {
cout << v[i] << ' ';
}
}
cout << '\n';
}
This prints:
5 1 2 3
Just like you want, and inserts are O(log n).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js