How to achieve better efficiency re-inserting into sets in C++

How to achieve better efficiency re-inserting into sets in C++ - c++

I need to modify an object that has already been inserted into a set. This isn't trivial because the iterator in the pair returned from an insertion of a single object is a const iterator and does not allow modifications. So, my plan was that if an insert failed I could copy that object into a temporary variable, erase it from the set, modify it locally and then insert my modified version.
insertResult = mySet.insert(newPep);
if( insertResult.second == false )
modifySet(insertResult.first, newPep);
void modifySet(set<Peptide>::iterator someIter, Peptide::Peptide newPep) {
Peptide tempPep = (*someIter);
someSet.erase(someIter);
// Modify tempPep - this does not modify the key
someSet.insert(tempPep);
}
This works, but I want to make my insert more efficient. I tried making another iterator and setting it equal to someIter in modifySet. Then after deleting someIter I would still have an iterator to that location in the set and I could use that as the insertion location.
void modifySet(set<Peptide>::iterator someIter, Peptide::Peptide newPep) {
Peptide tempPep = (*someIter);
anotherIter = someIter;
someSet.erase(someIter);
// Modify tempPep - this does not modify the key
someSet.insert(anotherIter, tempPep);
}
However, this results in a seg fault. I am hoping that someone can tell me why this insertion fails or suggest another way to modify an object that has already been inserted into a set.
The full source code can be viewed at github.

I agree with Peter that a map is probably a better model of what you are doing, specifically something like map<pep_key, Peptide::Peptide>, would let you do something like:
insertResult = myMap.insert(std::make_pair(newPep.keyField(), newPep));
if( insertResult.second == false )
insertResult.first->second = newPep;
To answer your question, the insert segfaults because erase invalidates an iterator, so inserting with it (or a copy of it) is analogous to dereferencing an invalid pointer. The only way I see to do what you want is with a const_cast
insertResult = mySet.insert(newPep);
if( insertResult.second == false )
const_cast<Peptide::Peptide&>(*(insertResult.first)) = newPep;
the const_cast approach looks like it will work for what you are doing, but is generally a bad idea.

I hope it isn't bad form to answer my own question, but I would like it to be here in case someone else ever has this problem. The answer of why my attempt seg faulted was given my academicRobot, but here is the solution to make this work with a set. While I do appreciate the other answers and plan to learn about maps, this question was about efficiently re-inserting into a set.
void modifySet(set<Peptide>::iterator someIter, Peptide::Peptide newPep) {
if( someIter == someSet.begin() ) {
Peptide tempPep = (*someIter);
someSet.erase(someIter);
// Modify tempPep - this does not modify the key
someSet.insert(tempPep);
}
else {
Peptide tempPep = (*someIter);
anotherIter = someIter;
--anotherIter;
someSet.erase(someIter);
// Modify tempPep - this does not modify the key
someSet.insert(anotherIter, tempPep);
}
}
In my program this change dropped my run time by about 15%, from 32 seconds down to 27 seconds. My larger data set is currently running and I have my fingers crossed that the 15% improvement scales.

std::set::insert returns a pair<iterator, bool> as far as I know. In any case, directly modifying an element in any sort of set is risky. What if your modification causes the item to compare equal to another existing item? What if it changes the item's position in the total order of items in the set? Depending on the implementation, this will cause undefined behaviour.
If the item's key remains the same and only its properties change, then I think what you really want is a map or an unordered_map instead of a set.

As you realized set are a bit messy to deal with because you have no way to indicate which part of the object should be considered for the key and which part you can modify safely.
The usual answer is to use a map or an unordered_map (if you have access to C++0x) and cut your object in two halves: the key and the satellite data.
Beware of the typical answer: std::map<key_type, Peptide>, while it seems easy it means you need to guarantee that the key part of the Peptide object always match the key it's associated with, the compiler won't help.
So you have 2 alternatives:
Cut Peptide in two: Peptide::Key and Peptide::Data, then you can use the map safely.
Don't provide any method to alter the part of Peptide which defines the key, then you can use the typical answer.
Finally, note that there are two ways to insert in a map-like object.
insert: insert but fails if the value already exists
operator[]: insert or update (which requires creating an empty object)
So, a solution would be:
class Peptide
{
public:
Peptide(int const id): mId(id) {}
int GetId() const;
void setWeight(float w);
void setLength(float l);
private:
int const mId;
float mWeight;
float mLength;
};
typedef std::unordered_map<int, Peptide> peptide_map;
Note that in case of update, it means creating a new object (default constructor) and then assigning to it. This is not possible here, because assignment means potentially changing the key part of the object.

std::map will make your life a lot easier and I wouldn't be surprised if it outperforms std::set for this particular case. The storage of the key might seem redundant but can be trivially cheap (ex: pointer to immutable data in Peptide with your own comparison predicate to compare the pointee correctly). With that you don't have to fuss about with the constness of the value associated with a key.
If you can change Peptide's implementation, you can avoid redundancy completely by making Peptide into two separate classes: one for the key part and one for the value associated with the key.

Related

std::unordered_map::insert vs std::unordered_map::operator[]

I have a container of type unordered_map and I was wanting confirmation of which version I should use if I want to add an element to the map. I want it to overwrite the old value with the new presented if it exists and just add it if it does not.
I see that insert adds the element if it exits and also returns a pair of iterator and bool where the bool indicates if the insert is successful. I also see that operator[] adds the element if it does not exist and overwrites it if it does.
My question is basically if I should I be using operator[] for this purpose or are there any gotchas that I haven't considered. Also if my perception of these methods is wrong, please correct me.
here is what I was going to do. Data is a scoped enum of storage type int
void insertData(const Data _Data, const int _value)
{
int SC_val = static_cast<int>(_Data);
//sc val is now the integer value of the Data being added
//returns a pair of iterator and bool indicating whether the insert was successful
auto ret = baseData.insert(std::pair<int,int>(SC_val,_value));
if (ret.second == false)
{//if the insert was not successful(key already exists)
baseData[ret.first->first] = _value;
}
}
or should I just do
int index = static_cast<int>(_Data);
baseData[index] = _value;
I am leaning towards the operator[] version as I see no real difference and it is much less code. Please advise and thank you all in advance.

insert and operator[] are both very useful methods. They appear similar, however, the details make them very different.
operator[]
Returns a reference to the element you are searching for. When no element exists, it creates a new default element. (So requires default constructor)
When used to insert an element: myMap[key] = value;, the value will override the old value for the key.
insert
Returns an iterator and a bool. The iterator is to the element. The bool indicates if a new element was inserted (true), or it already contained an element for the key (false).
Using insert doesn't require a default constructor.
When used to insert a new element: myMap.insert({key, value});, the old value does not get updated if key already exists in the map.
insert_or_assign
Tnx to Marc Glisse who mentioned it in the comments.
This method is similar to insert. The difference is in the behavior when the element already exists, in which case it will override the existing element.
Returns an iterator and a bool. The iterator is to the element. The bool indicates if a new element was inserted (true), or it already contained an element for the key (false).
Using insert_or_assign doesn't require a default constructor.
When used to insert a new element: myMap.insert({key, value});, the old value gets updated if key already exists in the map.
Building your map
Your use-case inserts data into the map and assumes that the key doesn't exist.
Writing baseData[index] = _value; will exactly do what you want.
However, if I would have to write it, I would go with the insert variant:
auto successfulInsert = baseData.emplace(SC_val, _value).second;
assert(successfulInsert && "Value has been inserted several times.");

Just using operator [] perfectly fits for your case.
FYI: Quote from cppreference.com std::unordered_map:
std::unordered_map::operator[]
Returns a reference to the value that is mapped to a key equivalent to key, performing an insertion if such key does not already exist.
I see no real difference and it is much less code.
You're right!

It seems that you want to insert data only when it is not exist in the baseData.
You can use count() to check if the data is in the map like this:
int index = static_cast<int>(_Data);
if(!baseData.count(index))
{
baseData[index] = _value
}

Alternative to nested maps in standard namespace

I have nested map of type:
std::map<int,std::map<pointer,pointer>>
I am iterating over the map each time/per frame and doing updates on it.So basically I have 2 nested if loops.
i have an array and i need to sort the data with 2 attributes. First attribute is integer which is the first key, then second attribute is a pointer which is a key of nested map inside the main map. so my code is something like:
iterator = outermap.find();
if(iterator!=outermap.end()){
value = iterator->second;
it1 = value.find();
if(it1!=value.end(){
value1 = it1->second;
// do something
}
else{
// do something and add new value
}
}
else {
// do something and add the values
}
This is really slow and causing my application to drop frame rate. Is there any alternative to this? Can we use hash codes and linked list to achieve the same?

You can use std::unordered_map, it will hash the keys so finds complete faster. Using value = iterator->second is copying your entire map to the 'value' variable. Using a reference avoids unnecessary copying and is better for performance, eg: auto & value = iterator->second.
Also std::map is guaranteed to be ordered. This can be used to your advantage since your keys are integers for the outermost map.

Firstly, your question is a bit vague, so this may or may not fit your problem.
Now, you have a map<int, map<pointer, pointer>>, but you never operate on the inner map itself. All you do is look up a value by an int and a pointer. This is also exactly what you should do instead, use an aggregate of those two as key in a map. The type for that is pair<int, pointer>, the map then becomes a map<pair<int, pointer>, pointer>.
One more note: You seem to know the keys to search in the map in advance. If the check whether the element exists is not just for safety, you could also use the overloaded operator[] of the map. The lookup then becomes outermap[ikey][pkey] and returns a default-initialized pointer (so probably a null pointer, it pointer really is a pointer). For the suggested combined map, the lookup would be outermap[make_pair(ikey, pkey)].

Memory Allocation in C++, Using a Map of Linked Lists

The underlying data structure I am using is:
map<int, Cell> struct Cell{ char c; Cell*next; };
In effect the data structure maps an int to a linked list. The map(in this case implemented as a hashmap) ensures that finding a value in the list runs in constant time. The Linked List ensures that insertion and deletion also run in constant time. At each processing iteration I am doing something like:
Cell *cellPointer1 = new Cell;
//Process cells, build linked list
Once the list is built I put the elements Cell in map. The structure was working just fine and after my program I deallocate memory. For each Cell in the list.
delete cellPointer1
But at the end of my program I have a memory leak!!
To test memory leak I use:
#include <stdlib.h>
#include <crtdbg.h>
#define _CRTDBG_MAP_ALLOC
_CrtDumpMemoryLeaks();
I'm thinking that somewhere along the way the fact that I am putting the Cells in the map does not allow me to deallocate the memory correctly. Does anyone have any ideas on how to solve this problem?

We'll need to see your code for insertion and deletion to be sure about it.
What I'd see as a memleak-free insert / remove code would be:
( NOTE: I'm assuming you don't store the Cells that you allocate in the map )
//
// insert
//
std::map<int, Cell> _map;
Cell a; // no new here!
Cell *iter = &a;
while( condition )
{
Cell *b = new Cell();
iter->next = b;
iter = b;
}
_map[id] = a; // will 'copy' a into the container slot of the map
//
// cleanup:
//
std::map<int,Cell>::iterator i = _map.begin();
while( i != _map.end() )
{
Cell &a = i->second;
Cell *iter = a.next; // list of cells associated to 'a'.
while( iter != NULL )
{
Cell *to_delete = iter;
iter = iter->next;
delete to_delete;
}
_map.erase(i); // will remove the Cell from the map. No need to 'delete'
i++;
}
Edit: there was a comment indicating that I might not have understood the problem completely. If you insert ALL the cells you allocate in the map, then the faulty thing is that your map contains Cell, not Cell*.
If you define your map as: std::map<int, Cell *>, your problem would be solved at 2 conditions:
you insert all the Cells that you allocate in the map
the integer (the key) associated to each cell is unique (important!!)
Now the deletion is simply a matter of:
std::map<int, Cell*>::iterator i = _map.begin();
while( i != _map.end() )
{
Cell *c = i->second;
if ( c != NULL ) delete c;
}
_map.clear();

I've built almost the exact same hybrid data structure you are after (list/map with the same algorithmic complexity if I were to use unordered_map instead) and have been using it from time to time for almost a decade though it's a kind of bulky structure (something I'd use with convenience in mind more than efficiency).
It's worth noting that this is quite different from just using std::unordered_map directly. For a start, it preserves the original order in which one inserts elements. Insertion, removal, and searches are guaranteed to happen in logarithmic time (or constant time depending on whether key searching is involved and whether you use a hash table or BST), iterators do not get invalidated on insertion/removal (the main requirement I needed which made me favor std::map over std::unordered_map), etc.
The way I did it was like this:
// I use this as the iterator for my container with
// the list being the main 'focal point' while I
// treat the map as a secondary structure to accelerate
// key searches.
typedef typename std::list<Value>::iterator iterator;
// Values are stored in the list.
std::list<Value> data;
// Keys and iterators into the list are stored in a map.
std::map<Key, iterator> accelerator;
If you do it like this, it becomes quite easy. push_back is a matter of pushing back to the list and adding the last iterator to the map, iterator removal is a matter of removing the key pointed to by the iterator from the map before removing the element from the list as the list iterator, finding a key is a matter of searching the map and returning the associated value in the map which happens to be the list iterator, key removal is just finding a key and then doing iterator removal, etc.
If you want to improve all methods to constant time, then you can use std::unordered_map instead of std::map as I did here (though that comes with some caveats).
Taking an approach like this should simplify things considerably over an intrusive list-based solution where you're manually having to free memory.

Is there a reason why you are not using built-in containers like, say, STL?
Anyhow, you don't show the code where the allocation takes place, nor the map definition (is this coming from a library?).
Are you sure you deallocate all of the previously allocated Cells, starting from the last one and going backwards up to the first?

You could do this using the STL (remove next from Cell):
std::unordered_map<int,std::list<Cell>>
Or if cell only contains a char
std::unordered_map<int,std::string>
If your compiler doesn't support std::unordered_map then try boost::unordered_map.
If you really want to use intrusive data structures, have a look at Boost Intrusive.

As others have pointed out, it may be hard to see what you're doing wrong without seeing your code.
Someone should mention, however, that you're not helping yourself by overlaying two container types here.
If you're using a hash_map, you already have constant insertion and deletion time, see the related Hash : How does it work internally? post. The only exception to the O(c) lookup time is if your implementation decides to resize the container, in which case you have added overhead regardless of your linked list addition. Having two addressing schemes is only going to make things slower (not to mention buggier).
Sorry if this doesn't point you to the memory leak, but I'm sure a lot of memory leaks / bugs come from not using stl / boost containers to their full potential. Look into that first.

You need to be very careful with what you are doing, because values in a C++ map need to be copyable and with your structure that has raw pointers, you must handle your copy semantics properly.
You would be far better off using std::list where you won't need to worry about your copy semantics.
If you can't change that then at least std::map<int, Cell*> will be a bit more manageable, although you would have to manage the pointers in your map because std::map will not manage them for you.
You could of course use std::map<int, shared_ptr<Cell> >, probably easiest for you for now.
If you also use shared_ptr within your Cell object itself, you will need to beware of circular references, and as Cell will know it's being shared_ptr'd you could derive it from enable_shared_from_this
My final point will be that list is very rarely the correct collection type to use. It is the correct one to use sometimes, especially when you have an LRU cache situation and you want to move accessed elements to the end of the list fast. However that is the minority case and it probably doesn't apply here. Think of an alternative collection you really want. map< int, set<char> > perhaps? or map< int, vector< char > > ?
Your list has a lot of overheads to store a few chars

STL map insertion efficiency: [] vs. insert

There are two ways of map insertion:
m[key] = val;
Or
m.insert(make_pair(key, val));
My question is, which operation is faster?
People usually say the first one is slower, because the STL Standard at first 'insert' a default element if 'key' is not existing in map and then assign 'val' to the default element.
But I don't see the second way is better because of 'make_pair'. make_pair actually is a convenient way to make 'pair' compared to pair<T1, T2>(key, val). Anyway, both of them do two assignments, one is assigning 'key' to 'pair.first' and two is assigning 'val' to 'pair.second'. After pair is made, map inserts the element initialized by 'pair.second'.
So the first way is 1. 'default construct of typeof(val)' 2. assignment
the second way is 1. assignment 2. 'copy construct of typeof(val)'

Both accomplish different things.
m[key] = val;
Will insert a new key-value pair if the key doesn't exist already, or it will overwrite the old value mapped to the key if it already exists.
m.insert(make_pair(key, val));
Will only insert the pair if key doesn't exist yet, it will never overwrite the old value. So, choose accordingly to what you want to accomplish.
For the question what is more efficient: profile. :P Probably the first way I'd say though. The assignment (aka copy) is the case for both ways, so the only difference lies in construction. As we all know and should implement, a default construction should basically be a no-op, and thus be very efficient. A copy is exactly that - a copy. So in way one we get a "no-op" and a copy, and in way two we get two copies.
Edit: In the end, trust what your profiling tells you. My analysis was off like #Matthieu mentions in his comment, but that was my guessing. :)
Then, we have C++0x coming, and the double-copy on the second way will be naught, as the pair can simply be moved now. So in the end, I think it falls back on my first point: Use the right way to accomplish the thing you want to do.

On a lightly loaded system with plenty of memory, this code:
#include <map>
#include <iostream>
#include <ctime>
#include <string>
using namespace std;
typedef map <unsigned int,string> MapType;
const unsigned int NINSERTS = 1000000;
int main() {
MapType m1;
string s = "foobar";
clock_t t = clock();
for ( unsigned int i = 0; i < NINSERTS; i++ ) {
m1[i] = s;
}
cout << clock() - t << endl;
MapType m2;
t = clock();
for ( unsigned int i = 0; i < NINSERTS; i++ ) {
m2.insert( make_pair( i, s ) );
}
cout << clock() - t << endl;
}
produces:
1547
1453
or similar values on repeated runs. So insert is (in this case) marginally faster.

Performance wise I think they are mostly the same in general. There may be some exceptions for a map with large objects, in which case you should use [] or perhaps emplace which creates fewer temporary objects than 'insert'. See the discussion here for details.
You can, however, get a performance bump in special cases if you use the 'hint' function on the insert operator. So, looking at option 2 from here:
iterator insert (const_iterator position, const value_type& val);
the 'insert' operation can be reduced to constant time (from log(n)) if you give a good hint (often the case if you know you are adding things at the back of your map).

We have to refine the analysis by mentioning that the relative performance depends on the type(size) of the objects being copied as well.
I did a similar experiment (to nbt) with a map of (int -> set). I know it is a terrible thing to do, but, illustrative for this scenario. The "value", in this case a set of ints, has 20 elements in it.
I execute a million iterations of the []= Vs. insert operations and do RDTSC/iter-count.
[] = set | 10731 cycles
insert(make_pair<>) | 26100 cycles
It shows the magnitude of penalty added due to the copying. Of course, CPP11(move ctor's)
will change the picture.

My take on it:
Worth reminding that maps is a balanced binary tree, most of the modifications and checks take O(logN).
Depends really on the problem you are trying to solve.
1) if you just want to insert the value knowing that it is not there yet,
then [] would do two things:
a) check if the item is there or not
b) if it is not there will create pair and do what insert does (
double work of O( logN ) ), so I would use insert.
2) if you are not sure if it is there or not, then a) if you did check if the item is there by doing something like if( map.find( item ) == mp.end() ) couple of lines above somewhere, then use insert, because of double work [] would perform b) if you didn't check, then it depends, cause insert won't modify the value if it is there, [] will, otherwise they are equal

My answer is not on efficiency but on safety, which is relevant to choosing an insertion algorithm:
The [] and insert() calls would trigger destructors of the elements. This may have dangerous side effects if, say, your destructors have critical behaviors inside.
After such a hazard, I stopped relying on STL's implicit lazy insertion features and always use explicit checks if my objects have behaviors in their ctors/dtors.
See this question:
Destructor called on object when adding it to std::list

STL map - insert or update

I have a map of objects and I want to update the object mapped to a key, or create a new object and insert into the map. The update is done by a different function that takes a pointer to the object (void update(MyClass *obj))
What is the best way to "insert or update" an element in a map?

The operator[]

With something like the following snippet:
std::map<Key, Value>::iterator i = amap.find(key);
if (i == amap.end())
amap.insert(std::make_pair(key, CreateFunction()));
else
UpdateFunction(&(i->second));
If you want to measure something that might improve performance you might want to use .lower_bound() to find where an entry and use that as a hint to insert in the case where you need to insert a new object.
std::map<Key, Value>::iterator i = amap.lower_bound(key);
if (i == amap.end() || i->first != key)
amap.insert(i, std::make_pair(key, CreateFunction()));
// Might need to check and decrement i.
// Only guaranteed to be amortized constant
// time if insertion is immediately after
// the hint position.
else
UpdateFunction(&(i->second));

something like:
map<int,MyClass*> mymap;
map<int,MyClass*>::iterator it;
MyClass* dummy = new MyClass();
mymap.insert(pair<int,MyClass*>(2,dummy));
it = mymap.find(2);
update(it.second);
here a nice reference link

The operator[] already does, what you want. See the reference for details.

The return value of insert is "a pair consisting of an iterator to the inserted element (or to the element that prevented the insertion) and a bool denoting whether the insertion took place."
Therefore you can simply do
auto result = values.insert({ key, CreateFunction()});
if (!result.second)
UpdateFunction(&(result.first->second));
NOTE:
Since your question involved raw pointers, and you said you wanted your Update function to take a pointer, I have made that assumption in my snippet. Assume that CreateFunction() returns a pointer and UpdateFunction() expects a pointer.
I'd strongly advise against using raw pointers though.

In C++17, function insert_or_assign insert if not existing and update if there.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js