Is it possible to process equality in a std::set comparator? - c++

I am sorry if the title isn't very descriptive, I was having a hard time figuring out how to name this question. This is pretty much the first time I need to use a set, though I've been using maps forever.
I don't think it is possible, but I need to ask. I would like to perform a specific action on a struct when I add it to my std::set, but only if equality is true.
For example, I can use a list and then sort() and unique() the list. In my predicate, I can do as I wish, since I will get the result if 2 values are equal.
Here is a quick example of what my list predicate looks like:
bool markovWeightOrdering (unique_ptr<Word>& w1, unique_ptr<Word>& w2) {
if (w1->word_ == w2->word_) {
w1->weight_++;
return true;
}
return false;
}
Does anyone have an idea how to achieve a similar result, while using a std::set for the obvious gain in performance (and simplicity), since my container needs to be unique anyways? Thank you for any help or guidance, it is much appreciated.

element in set are immutable, so you cannot modify them.
if you use set with pointer (or similar), the pointed object may be modified (but care to not modify the order). std::set::insert returns a pair with iterator and a boolean to tell if element has been inserted, so you may do something like:
auto p = s.insert(make_unique<Word>("test"));
if (p.second == false) {
(*p.first)->weight += 1;
}
Live example

Manipulating a compare operator is likely a bad idea.
You might use a std::set with a predicate, instead:
struct LessWord
{
bool operator () (const std::unique_ptr<Word>& w1, const std::unique_ptr<Word>& w2) {
return w1->key < w2->key;
}
};
typedef std::set<std::unique_ptr<Word>, LessWord> word_set;
Than you test at insert if the word is existing and increment the weight:
word_set words;
std::unique_ptr<Word> word_ptr;
auto insert = words.insert(word_ptr);
if( ! insert.second)
++(insert.first->get()->weight_);
Note: Doing this is breaking const correctness, logically. A set element is immutable, but the unique_ptr enables modifications (even a fatal modification of key values).

Related

Constraining remove_if on only part of a C++ list

I have a C++11 list of complex elements that are defined by a structure node_info. A node_info element, in particular, contains a field time and is inserted into the list in an ordered fashion according to its time field value. That is, the list contains various node_info elements that are time ordered. I want to remove from this list all the nodes that verify some specific condition specified by coincidence_detect, which I am currently implementing as a predicate for a remove_if operation.
Since my list can be very large (order of 100k -- 10M elements), and for the way I am building my list this coincidence_detect condition is only verified by few (thousands) elements closer to the "lower" end of the list -- that is the one that contains elements whose time value is less than some t_xv, I thought that to improve speed of my code I don't need to run remove_if through the whole list, but just restrict it to all those elements in the list whose time < t_xv.
remove_if() though does not seem however to allow the user to control up to which point I can iterate through the list.
My current code.
The list elements:
struct node_info {
char *type = "x";
int ID = -1;
double time = 0.0;
bool spk = true;
};
The predicate/condition for remove_if:
// Remove all events occurring at t_event
class coincident_events {
double t_event; // Event time
bool spk; // Spike condition
public:
coincident_events(double time,bool spk_) : t_event(time), spk(spk_){}
bool operator()(node_info node_event){
return ((node_event.time==t_event)&&(node_event.spk==spk)&&(strcmp(node_event.type,"x")!=0));
}
};
The actual removing from the list:
void remove_from_list(double t_event, bool spk_){
// Remove all events occurring at t_event
coincident_events coincidence(t_event,spk_);
event_heap.remove_if(coincidence);
}
Pseudo main:
int main(){
// My list
std::list<node_info> event_heap;
...
// Populate list with elements with random time values, yet ordered in ascending order
...
remove_from_list(0.5, true);
return 1;
}
It seems that remove_if may not be ideal in this context. Should I consider instead instantiating an iterator and run an explicit for cycle as suggested for example in this post?
It seems that remove_if may not be ideal in this context. Should I consider instead instantiating an iterator and run an explicit for loop?
Yes and yes. Don't fight to use code that is preventing you from reaching your goals. Keep it simple. Loops are nothing to be ashamed of in C++.
First thing, comparing double exactly is not a good idea as you are subject to floating point errors.
You could always search the point up to where you want to do a search using lower_bound (I assume you list is properly sorted).
The you could use free function algorithm std::remove_if followed by std::erase to remove items between the iterator returned by remove_if and the one returned by lower_bound.
However, doing that you would do multiple passes in the data and you would move nodes so it would affect performance.
See also: https://en.cppreference.com/w/cpp/algorithm/remove
So in the end, it is probably preferable to do you own loop on the whole container and for each each check if it need to be removed. If not, then check if you should break out of the loop.
for (auto it = event_heap.begin(); it != event_heap.end(); )
{
if (coincidence(*it))
{
auto itErase = it;
++it;
event_heap.erase(itErase)
}
else if (it->time < t_xv)
{
++it;
}
else
{
break;
}
}
As you can see, code can easily become quite long for something that should be simple. Thus, if you need to do that kind of algorithm often, consider writing you own generic algorithm.
Also, in practice you might not need to do a complete search for the end using the first solution if you process you data in increasing time order.
Finally, you might consider using an std::set instead. It could lead to simpler and more optimized code.
Thanks. I used your comments and came up with this solution, which seemingly increases speed by a factor of 5-to-10.
void remove_from_list(double t_event,bool spk_){
coincident_events coincidence(t_event,spk_);
for(auto it=event_heap.begin();it!=event_heap.end();){
if(t_event>=it->time){
if(coincidence(*it)) {
it = event_heap.erase(it);
}
else
++it;
}
else
break;
}
}
The idea to make erase return it (as already ++it) was suggested by this other post. Note that in this implementation I am actually erasing all list elements up to t_event value (meaning, I pass whatever I want for t_xv).

Erase an element from a map then put it back

Suppose I have a backtracking algorithm where I need to remove an element from a map, do something, then put it back. I am not sure if there is a good way to do it:
func(std::<K, V> map &dict) {
for (auto i : dict) {
remove i from dict;
func(dict);
put i back to dict;
}
}
I know there are ways to delete an element from map in here but I am not sure if there are ways to achieve what I described above. I am aware that I can create a new dict in for loop and pass it in func but I am wondering if it can be done using the same dict.
What you ask is definitely possible. Here is one way to do it while trying to keep things simple and efficient:
void func(std::map<K, V> &dict) {
for (auto i = dict.cbegin(); i != dict.cend(); ) {
auto old = *i;
i = dict.erase(i); // i now points to the next element (or dict.end())
some_other_func(dict);
dict.insert(i, old); // i is used a hint where to re-insert the old value
}
}
By calling std::map::erase with an iterator argument and std::map::insert with a hint, the complexity of each iteration through this loop is amortized constant. Note that I assumed your calling of func in line 5 was actually supposed to be some_other_func, because calling func recursively would invalidate the iterator you carefully set in line 4.
However, this is not the most efficient way to do this sort of processing (#rakurai has a suggestion that should be considered).
This question has an answer here.
However, your problem may be that you would like the function to ignore the key K while processing dict. How about a second parameter "K ignore" and make the function skip that pair? You could give it a default value in the declaration if that breaks other code, or overload the function.

Iterating over Unorderd_map using indexed for loop

I am trying to access values stored in an unorderd_map using a for loop, but I am stuck trying to access values using the current index of my loop. Any suggestion, or link to look-on? thanks. [Hint: I don't want to use an iterator].
my sample code:
#include <iostream>
#include <string>
#include <unordered_map>
#include <vector>
using namespace std;
int main()
{
unordered_map<int,string>hash_table;
//filling my hash table
hash_table.insert(make_pair(1,"one"));
hash_table.insert(make_pair(2,"two"));
hash_table.insert(make_pair(3,"three"));
hash_table.insert(make_pair(4,"four"));
//now, i want to access values of my hash_table with for loop, `i` as index.
//
for (int i=0;i<hash_table.size();i++ )
{
cout<<"Value at index "<<i<<" is "<<hash_table[i].second;//I want to do something like this. I don't want to use iterator!
}
return 0;
}
There are two ways to access an element from an std::unordered_map.
An iterator.
Subscript operator, using the key.
I am stuck trying to access values using the current index of my loop
As you can see, accessing an element using the index is not listed in the possible ways to access an element.
I'm sure you realize that since the map is unordered the phrase element at index i is quite meaningless in terms of ordering. It is possible to access the ith element using the begin iterator and std::advance but...
Hint: I don't want to use an iterator].
Hint: You just ran out of options. What you want to do is not possible. Solution: Start wanting to use tools that are appropriate to achieving your objective.
If you want to iterate a std::unordered_map, then you use iterators because that's what they're for. If you don't want to use iterators, then you cannot iterate an std::unordered_map. You can hide the use of iterators with a range based for loop, but they're still used behind the scenes.
If you want to iterate something using a position - index, then what you need is an array such as a std::vector.
First, why would you want to use an index versus an iterator?
Suppose you have a list of widgets you want your UI to draw. Each widget can have its own list of child widgets, stored in a map. Your options are:
Make each widget draw itself. Not ideal since widgets are now coupled to the UI kit you are using.
Return the map and use an iterator in the drawing code. Not ideal because now the drawing code knows your storage mechanism.
An API that can avoid both of these might look like this.
const Widget* Widget::EnumerateChildren(size_t* io_index) const;
You can make this work with maps but it isn't efficient. You also can't guarantee the stability of the map between calls. So this isn't recommended but it is possible.
const Widget* Widget::EnumerateChildren(size_t* io_index) const
{
auto& it = m_children.begin();
std::advance(it, *io_index);
*io_index += 1;
return it->second;
}
You don't have to use std::advance and could use a for loop to advance the iterator yourself. Not efficient or very safe.
A better solution to the scenario I described would be to copy out the values into a vector.
void Widget::GetChildren(std::vector<Widget*>* o_children) const;
You can't do it without an iterator. An unordered map could store the contents in any order and move them around as it likes. The concept of "3rd element" for example means nothing.
If you had a list of the keys from the map then you could index into that list of keys and get what you want. However unless you already have it you would need to iterate over the map to generate the list of keys so you still need an iterator.
An old question.
OK, I'm taking the risk: here may be a workaround (not perfect though: it is just a workaround).
This post is a bit long because I explain why this may be needed. In fact one might want to use the very same code to load and save data from and to a file. This is very useful to synchronize the loading and saving of data, to keep the same data, the same order, and the same types.
For example, if op is load_op or save_op:
load_save_data( var1, op );
load_save_data( var2, op );
load_save_data( var3, op );
...
load_save_data hides the things performed inside. Maintenance is thus much more easy.
The problem is when it comes to containers. For example (back to the question) it may do this for sets (source A) to save data:
int thesize = theset.size();
load_save(thesize, load); // template (member) function with 1st arg being a typename
for( elem: theset) {
load_save_data( thesize, save_op );
}
However, to read (source B):
int thesize;
load_save_data( thesize, save);
for( int i=0; i<thesize, i++) {
Elem elem;
load_save_data( elem, load_op);
theset.insert(elem);
}
So, the whole source code would be something like this, with too loops:
if(op == load_op) { A } else { B }
The problem is there are two different kinds of loop, and it would be nice to merge them as one only. Ideally, it would be nice to be able to do:
int thesize;
load_save_data( thesize, save);
for( int i=0; i<thesize, i++) {
Elem elem;
if( op == save_op ) {
elem=theset[i]; // not possible
}
load_save_data( elem, op);
if( op == load_op ) {
theset.insert(elem);
}
}
(as this code is used in different contexts, care may be taken to provide enough information to the compiler to allow it the strip the unnecessary code (the right "if"), not obvious but possible)
This way, each call to load_save_data is in the same order, the same type. You forget a field for both or none, but everything is kept synchronized between save and load. You may add a variable, change a type, change the order etc in one place only. The code maintenance is thus easier.
A solution to the impossible "theset[i]" is indeed to use a vector or a map instead of a set but you're losing the properties of a set (avoid two identical items).
So a workaround (but it has a heavy price: efficiency and simplicity) is something like:
void ...::load_save( op )
{
...
int thesize;
set<...> tmp;
load_save_data( thesize, save);
for( int i=0; i<thesize, i++) {
Elem elem;
if( op == save_op ) {
elem=*(theset.begin()); \
theset.erase(elem); > <-----
tmp.insert(elem); /
}
load_save_data( elem, op);
if( op == load_op ) {
theset.insert(elem);
}
}
if(op == save_op) {
theset.insert(tmp.begin(), tmp.end()); <-----
}
...
}
Not very beautiful but it does the trick, and (IMHO) itis the closest answer to the question.

STL map insertion efficiency: [] vs. insert

There are two ways of map insertion:
m[key] = val;
Or
m.insert(make_pair(key, val));
My question is, which operation is faster?
People usually say the first one is slower, because the STL Standard at first 'insert' a default element if 'key' is not existing in map and then assign 'val' to the default element.
But I don't see the second way is better because of 'make_pair'. make_pair actually is a convenient way to make 'pair' compared to pair<T1, T2>(key, val). Anyway, both of them do two assignments, one is assigning 'key' to 'pair.first' and two is assigning 'val' to 'pair.second'. After pair is made, map inserts the element initialized by 'pair.second'.
So the first way is 1. 'default construct of typeof(val)' 2. assignment
the second way is 1. assignment 2. 'copy construct of typeof(val)'
Both accomplish different things.
m[key] = val;
Will insert a new key-value pair if the key doesn't exist already, or it will overwrite the old value mapped to the key if it already exists.
m.insert(make_pair(key, val));
Will only insert the pair if key doesn't exist yet, it will never overwrite the old value. So, choose accordingly to what you want to accomplish.
For the question what is more efficient: profile. :P Probably the first way I'd say though. The assignment (aka copy) is the case for both ways, so the only difference lies in construction. As we all know and should implement, a default construction should basically be a no-op, and thus be very efficient. A copy is exactly that - a copy. So in way one we get a "no-op" and a copy, and in way two we get two copies.
Edit: In the end, trust what your profiling tells you. My analysis was off like #Matthieu mentions in his comment, but that was my guessing. :)
Then, we have C++0x coming, and the double-copy on the second way will be naught, as the pair can simply be moved now. So in the end, I think it falls back on my first point: Use the right way to accomplish the thing you want to do.
On a lightly loaded system with plenty of memory, this code:
#include <map>
#include <iostream>
#include <ctime>
#include <string>
using namespace std;
typedef map <unsigned int,string> MapType;
const unsigned int NINSERTS = 1000000;
int main() {
MapType m1;
string s = "foobar";
clock_t t = clock();
for ( unsigned int i = 0; i < NINSERTS; i++ ) {
m1[i] = s;
}
cout << clock() - t << endl;
MapType m2;
t = clock();
for ( unsigned int i = 0; i < NINSERTS; i++ ) {
m2.insert( make_pair( i, s ) );
}
cout << clock() - t << endl;
}
produces:
1547
1453
or similar values on repeated runs. So insert is (in this case) marginally faster.
Performance wise I think they are mostly the same in general. There may be some exceptions for a map with large objects, in which case you should use [] or perhaps emplace which creates fewer temporary objects than 'insert'. See the discussion here for details.
You can, however, get a performance bump in special cases if you use the 'hint' function on the insert operator. So, looking at option 2 from here:
iterator insert (const_iterator position, const value_type& val);
the 'insert' operation can be reduced to constant time (from log(n)) if you give a good hint (often the case if you know you are adding things at the back of your map).
We have to refine the analysis by mentioning that the relative performance depends on the type(size) of the objects being copied as well.
I did a similar experiment (to nbt) with a map of (int -> set). I know it is a terrible thing to do, but, illustrative for this scenario. The "value", in this case a set of ints, has 20 elements in it.
I execute a million iterations of the []= Vs. insert operations and do RDTSC/iter-count.
[] = set | 10731 cycles
insert(make_pair<>) | 26100 cycles
It shows the magnitude of penalty added due to the copying. Of course, CPP11(move ctor's)
will change the picture.
My take on it:
Worth reminding that maps is a balanced binary tree, most of the modifications and checks take O(logN).
Depends really on the problem you are trying to solve.
1) if you just want to insert the value knowing that it is not there yet,
then [] would do two things:
a) check if the item is there or not
b) if it is not there will create pair and do what insert does (
double work of O( logN ) ), so I would use insert.
2) if you are not sure if it is there or not, then a) if you did check if the item is there by doing something like if( map.find( item ) == mp.end() ) couple of lines above somewhere, then use insert, because of double work [] would perform b) if you didn't check, then it depends, cause insert won't modify the value if it is there, [] will, otherwise they are equal
My answer is not on efficiency but on safety, which is relevant to choosing an insertion algorithm:
The [] and insert() calls would trigger destructors of the elements. This may have dangerous side effects if, say, your destructors have critical behaviors inside.
After such a hazard, I stopped relying on STL's implicit lazy insertion features and always use explicit checks if my objects have behaviors in their ctors/dtors.
See this question:
Destructor called on object when adding it to std::list

std::map insert or std::map find?

Assuming a map where you want to preserve existing entries. 20% of the time, the entry you are inserting is new data. Is there an advantage to doing std::map::find then std::map::insert using that returned iterator? Or is it quicker to attempt the insert and then act based on whether or not the iterator indicates the record was or was not inserted?
The answer is you do neither. Instead you want to do something suggested by Item 24 of Effective STL by Scott Meyers:
typedef map<int, int> MapType; // Your map type may vary, just change the typedef
MapType mymap;
// Add elements to map here
int k = 4; // assume we're searching for keys equal to 4
int v = 0; // assume we want the value 0 associated with the key of 4
MapType::iterator lb = mymap.lower_bound(k);
if(lb != mymap.end() && !(mymap.key_comp()(k, lb->first)))
{
// key already exists
// update lb->second if you care to
}
else
{
// the key does not exist in the map
// add it to the map
mymap.insert(lb, MapType::value_type(k, v)); // Use lb as a hint to insert,
// so it can avoid another lookup
}
The answer to this question also depends on how expensive it is to create the value type you're storing in the map:
typedef std::map <int, int> MapOfInts;
typedef std::pair <MapOfInts::iterator, bool> IResult;
void foo (MapOfInts & m, int k, int v) {
IResult ir = m.insert (std::make_pair (k, v));
if (ir.second) {
// insertion took place (ie. new entry)
}
else if ( replaceEntry ( ir.first->first ) ) {
ir.first->second = v;
}
}
For a value type such as an int, the above will more efficient than a find followed by an insert (in the absence of compiler optimizations). As stated above, this is because the search through the map only takes place once.
However, the call to insert requires that you already have the new "value" constructed:
class LargeDataType { /* ... */ };
typedef std::map <int, LargeDataType> MapOfLargeDataType;
typedef std::pair <MapOfLargeDataType::iterator, bool> IResult;
void foo (MapOfLargeDataType & m, int k) {
// This call is more expensive than a find through the map:
LargeDataType const & v = VeryExpensiveCall ( /* ... */ );
IResult ir = m.insert (std::make_pair (k, v));
if (ir.second) {
// insertion took place (ie. new entry)
}
else if ( replaceEntry ( ir.first->first ) ) {
ir.first->second = v;
}
}
In order to call 'insert' we are paying for the expensive call to construct our value type - and from what you said in the question you won't use this new value 20% of the time. In the above case, if changing the map value type is not an option then it is more efficient to first perform the 'find' to check if we need to construct the element.
Alternatively, the value type of the map can be changed to store handles to the data using your favourite smart pointer type. The call to insert uses a null pointer (very cheap to construct) and only if necessary is the new data type constructed.
There will be barely any difference in speed between the 2, find will return an iterator, insert does the same and will search the map anyway to determine if the entry already exists.
So.. its down to personal preference. I always try insert and then update if necessary, but some people don't like handling the pair that is returned.
I would think if you do a find then insert, the extra cost would be when you don't find the key and performing the insert after. It's sort of like looking through books in alphabetical order and not finding the book, then looking through the books again to see where to insert it. It boils down to how you will be handling the keys and if they are constantly changing. Now there is some flexibility in that if you don't find it, you can log, exception, do whatever you want...
If you are concerned about efficiency, you may want to check out hash_map<>.
Typically map<> is implemented as a binary tree. Depending on your needs, a hash_map may be more efficient.
I don't seem to have enough points to leave a comment, but the ticked answer seems to be long winded to me - when you consider that insert returns the iterator anyway, why go searching lower_bound, when you can just use the iterator returned. Strange.
Any answers about efficiency will depend on the exact implementation of your STL. The only way to know for sure is to benchmark it both ways. I'd guess that the difference is unlikely to be significant, so decide based on the style you prefer.
map[ key ] - let stl sort it out. That's communicating your intention most effectively.
Yeah, fair enough.
If you do a find and then an insert you're performing 2 x O(log N) when you get a miss as the find only lets you know if you need to insert not where the insert should go (lower_bound might help you there). Just a straight insert and then examining the result is the way that I'd go.