Efficient way to get subset of std::set - c++

I call a library function that accepts pointer to std::set and processes elements of it.
However, it processes only certain number of elements (let's say 100) and if the set has more elements it simply throws an exception. However, I receive set of much larger size. So I need efficient way to get subset of std::set.
Currently, I am copying 100 elements to temporary set and pass it to the function.
struct MyClass
{
// Class having considerably large size instance
};
// Library function that processes set having only 100 elements at a time
void ProcessSet (std::set<MyClass>* ptrMyClassObjectsSet);
void FunctionToProcessLargeSet (std::set<MyClass>& MyClassObjSet)
{
std::set<MyClass> MyClass100ObjSet;
// Cannot pass MyClassObject as it is to ProcessSet as it might have large number of elements
// So create set of 100 elements and pass it to the function
std::set<MyClass>::iterator it;
for (it = MyClassObjSet.begin(); it != MyClassObjSet.end(); ++it)
{
MyClass100ObjSet.insert (*it);
if (MyClass100ObjSet.size() == 100)
{
ProcessSet (&MyClass100ObjSet);
MyClass100ObjSet.clear();
}
}
// Prrocess remaining elments
ProcessSet (&MyClass100ObjSet);
MyClass100ObjSet.clear();
}
But it's impacting performance. Can anyone please suggest better ways to do this?

Since it looks like you are locked into having to use a subset. I have tweaked your code a little and I think it might be faster for you. It is still an O(n) operation but there is no branching in the for loop which should increase performance.
void FunctionToProcessLargeSet(std::set<MyClass>& MyClassObjSet)
{
int iteration = MyClassOgjSet.size() / 100; // get number of times we have collection of 100
auto it = MyClassObjSet.begin();
auto end = MyClassObjSet.begin();
for (; iteration == 0; --iteration)
{
std::advance(end, 100); // move end 100 away
std::set<MyClass> MyClass100ObjSet(it, std::advance(it, end)); // construct with iterator range
std::advance(it, 100); // advace it to end pos
ProcessSet(&MyClass100ObjSet); // process subset
}
if (MyClassOgjSet.size() % 100 != 0) // get last subset
{
std::set<MyClass> MyClass100ObjSet(it, MyClassObjSet.end());
// Prrocess remaining elments
ProcessSet(&MyClass100ObjSet);
}
}
Let me know if this runs faster for you.

Well, that sounds like a bad library design, but if you have to work with what you have then:
If library can accept a pair of iterators - that's the easy way to go using std::advance
If it's templated and can accept std::set<T>, then copying a part of your set to std::set<std::reference_wrapper<T>> might perform better if copying T is slow (see here to see that no copies are created)
If it only acceptsstd::set<ParticularObjectType>, I don't see a way around copying the data.
Hope this helps,
Rostislav.

Related

Alternatives to vector.size() after running remove_if()

I'm writing a piece of C++ that checks to see whether particular elements of a vector return true, and uses remove_if() to remove them if not. After this, I use vector.size() to check to see if there are any elements remaining in the vector, and then return the function if not.
At the moment, I do vector.erase() after remove_if(), as it doesn't actually reduce the size of the vector. However, this code needs to run fast, and recursively changing the size of a vector in memory is probably not ideal. However, returning if the vector is zero (instead of running the rest of the function) probably also saves time.
Is there a nice way to check how many elements remain in the vector without erasing?
here's the code:
auto remove = remove_if(sight.begin(), sight.end(), [](const Glance *a) {
return a->occupied;
});
sight.erase(remove, sight.end());
if (sight.size() == 0) {
// There's nowhere to move
return;
}
EDIT:
thanks for the help + the guidance. From the answers it's clear that the wording of the question isn't quite correct: erase() doesn't change the size of the vector in memory, but the capacity. I had mis-remembered the explanation from this post, which nicely articulates why erase() is slower than remove() for multiple removals (as you have to copy the location of the elements in the vector multiple times).
I used Instruments to benchmark the code I had originally against Johannes' suggestion, and the difference was marginal, though Johannes' was consistently slightly faster (~9.8% weight vs ~8.3% weight for the same code otherwise). The linked article should explain why. ✨
You can use std::distance(sight.begin(), remove); to get the number of remaining elements:
auto remove = remove_if(sight.begin(), sight.end(), [](const Glance *a) {
return a->occupied;
});
size_t remaining = std::distance(sight.begin(), remove);
if (remaining == 0) {
// There's nowhere to move
return;
}
But if you are only interested in 0 you can do:
auto remove = remove_if(sight.begin(), sight.end(), [](const Glance *a) {
return a->occupied;
});
if (remove == sight.begin()) {
// There's nowhere to move
return;
}
Instead of erasing elements that satisfy a criteria and then checking the number of remaining elements just to find out how many elements did not satisfy the criteria, you could simply iterate the container and count those elements. The standard library has an algorithm for that: std::count_if.

Constraining remove_if on only part of a C++ list

I have a C++11 list of complex elements that are defined by a structure node_info. A node_info element, in particular, contains a field time and is inserted into the list in an ordered fashion according to its time field value. That is, the list contains various node_info elements that are time ordered. I want to remove from this list all the nodes that verify some specific condition specified by coincidence_detect, which I am currently implementing as a predicate for a remove_if operation.
Since my list can be very large (order of 100k -- 10M elements), and for the way I am building my list this coincidence_detect condition is only verified by few (thousands) elements closer to the "lower" end of the list -- that is the one that contains elements whose time value is less than some t_xv, I thought that to improve speed of my code I don't need to run remove_if through the whole list, but just restrict it to all those elements in the list whose time < t_xv.
remove_if() though does not seem however to allow the user to control up to which point I can iterate through the list.
My current code.
The list elements:
struct node_info {
char *type = "x";
int ID = -1;
double time = 0.0;
bool spk = true;
};
The predicate/condition for remove_if:
// Remove all events occurring at t_event
class coincident_events {
double t_event; // Event time
bool spk; // Spike condition
public:
coincident_events(double time,bool spk_) : t_event(time), spk(spk_){}
bool operator()(node_info node_event){
return ((node_event.time==t_event)&&(node_event.spk==spk)&&(strcmp(node_event.type,"x")!=0));
}
};
The actual removing from the list:
void remove_from_list(double t_event, bool spk_){
// Remove all events occurring at t_event
coincident_events coincidence(t_event,spk_);
event_heap.remove_if(coincidence);
}
Pseudo main:
int main(){
// My list
std::list<node_info> event_heap;
...
// Populate list with elements with random time values, yet ordered in ascending order
...
remove_from_list(0.5, true);
return 1;
}
It seems that remove_if may not be ideal in this context. Should I consider instead instantiating an iterator and run an explicit for cycle as suggested for example in this post?
It seems that remove_if may not be ideal in this context. Should I consider instead instantiating an iterator and run an explicit for loop?
Yes and yes. Don't fight to use code that is preventing you from reaching your goals. Keep it simple. Loops are nothing to be ashamed of in C++.
First thing, comparing double exactly is not a good idea as you are subject to floating point errors.
You could always search the point up to where you want to do a search using lower_bound (I assume you list is properly sorted).
The you could use free function algorithm std::remove_if followed by std::erase to remove items between the iterator returned by remove_if and the one returned by lower_bound.
However, doing that you would do multiple passes in the data and you would move nodes so it would affect performance.
See also: https://en.cppreference.com/w/cpp/algorithm/remove
So in the end, it is probably preferable to do you own loop on the whole container and for each each check if it need to be removed. If not, then check if you should break out of the loop.
for (auto it = event_heap.begin(); it != event_heap.end(); )
{
if (coincidence(*it))
{
auto itErase = it;
++it;
event_heap.erase(itErase)
}
else if (it->time < t_xv)
{
++it;
}
else
{
break;
}
}
As you can see, code can easily become quite long for something that should be simple. Thus, if you need to do that kind of algorithm often, consider writing you own generic algorithm.
Also, in practice you might not need to do a complete search for the end using the first solution if you process you data in increasing time order.
Finally, you might consider using an std::set instead. It could lead to simpler and more optimized code.
Thanks. I used your comments and came up with this solution, which seemingly increases speed by a factor of 5-to-10.
void remove_from_list(double t_event,bool spk_){
coincident_events coincidence(t_event,spk_);
for(auto it=event_heap.begin();it!=event_heap.end();){
if(t_event>=it->time){
if(coincidence(*it)) {
it = event_heap.erase(it);
}
else
++it;
}
else
break;
}
}
The idea to make erase return it (as already ++it) was suggested by this other post. Note that in this implementation I am actually erasing all list elements up to t_event value (meaning, I pass whatever I want for t_xv).

Couple performance questions (one bigger vector vs smaller chunks vectors) and Is it worth to store iteration index for jump access of vector?

I am a bit curiuous about vector optimization and have couple questions about it. (I am still a beginner in programing)
example:
struct GameInfo{
EnumType InfoType;
// Other info...
};
int _lastPosition;
// _gameInfoV is sorted beforehand
std::vector<GameInfo> _gameInfoV;
// The tick function is called every game frame (in "perfect" condition it's every 1.0/60 second)
void BaseClass::tick()
{
for (unsigned int i = _lastPosition; i < _gameInfoV.size(); i++{
auto & info = _gameInfoV[i];
if( !info.bhasbeenAdded ){
if( DoWeNeedNow() ){
_lastPosition++;
info.bhasbeenAdded = true;
_otherPointer->DoSomething(info.InfoType);
// Do something more with "info"....
}
else return; //Break the cycle since we don't need now other "info"
}
}
}
The _gameInfoV vector size can be between 2000 and 5000.
My main 2 questions are:
Is it better to leave the way how it is or it's better to make smaller chunks of it, which is checked for every different GameInfo.InfoType
Is it worth the hassle of storing the last start position index of the vector instead of iterating from the beginning.
Note that if using smaller vectors there will be like 3 to 6 of them
The third thing is probably that I am not using vector iterators, but is it safe to use then like this?
std::vector<GameInfo>::iterator it = _gameInfoV.begin() + _lastPosition;
for (it = _gameInfoV.begin(); it != _gameInfoV.end(); ++it){
//Do something
}
Note: It will be used in smartphones, so every optimization will be appreciated, when targeting weaker phones.
-Thank you
Don't; except if you frequently move memory around
It is no hassle if you do it correctly:
std::vector<GameInfo>::const_iterator _lastPosition(gameInfoV.begin());
// ...
for (std::vector<GameInfo>::iterator info=_lastPosition; it!=_gameInfoV.end(); ++info)
{
if (!info->bhasbeenAdded)
{
if (DoWeNeedNow())
{
++_lastPosition;
_otherPointer->DoSomething(info->InfoType);
// Do something more with "info"....
}
else return; //Break the cycle since we don't need now other "i
}
}
Breaking one vector up into several smaller vectors in general doesn't improve performance. It could even slightly degrade performance because the compiler has to manage more variables, which take up more CPU registers etc.
I don't know about gaming so I don't understand the implication of GameInfo.InfoType. Your processing time and CPU resource requirements are going to increase if you do more total iterations through loops (where each loop iteration performs the same type of operation). So if separating the vectors causes you to avoid some loop iterations because you can skip entire vectors, that's going to increase performance of your app.
iterators are the most secure way to iterate through containers. But for a vector I often just use the index operator [] and my own indexer (a plain old unsigned integer).

a pushBack() function, as opposite to popFront()

Can I use popFront() and then eventually push back what was poped? The number of calls to popFront() might be more than one (but not much greater than it, say < 10, if does matter). This is also the number of calls which the imaginary pushBack() function will be called too.
for example:
string s = "Hello, World!";
int n = 5;
foreach(i; 0 .. n) {
// do something with s.front
s.popFront();
}
if(some_condition) {
foreach(i; 0 .. n) {
s.pushBack();
}
}
writeln(s); // should output "Hello, World!" since number of poped is same as pushed back.
I think popFront() does use .ptr but I'm not sure if it in D does makes any difference and can help anyway to reach my goal easily (i.e, in D's way and not write my own with a Circular buffer or so).
A completely different approach to reach it is very welcome too.
A range is either generative (e.g. if it's a list of random numbers), or it's a view into a container. In neither case does it make sense to push anything onto it. As you call popFront, you're iterating through the list and shrinking your view of the container. If you think of a range being like two C++ iterators for a moment, and you have something like
struct IterRange(T)
{
#property bool empty() { return iter == end; }
#property T front() { return *iter; }
void popFront() { ++iter; }
private Iterator iter;
private Iterator end;
}
then it will be easier to understand. If you called popFront, it would move the iterator forward by one, thereby changing which element you're looking at, but you can't add elements in front of it. That would require doing something like an insertion on the container itself, and maybe the iterator or range could be used to tell the container where you want an alement inserted, but the iterator or range can't do that itself. The same goes if you have a generative range like
struct IncRange(T)
{
#property bool empty() { value == T.max; }
#property T front() { return value; }
void popFront() { ++value; }
private T value;
}
It keeps incrementing the value, and there is no container backing it. So, it doesn't even have anywhere that you could push a value onto.
Arrays are a little bit funny because they're ranges but they're also containers (sort of). They have range semantics when popping elements off of them or slicing them, but they don't own their own memory, and once you append to them, you can get a completely different chunk of memory with the same values. So, it is sort of a range that you can add and remove elements from - but you can't do it using the range API. So, you could do something like
str = newChar ~ str;
but that's not terribly efficient. You could make it more efficient by creating a new array at the target size and then filling in its elements rather than concatenating repeatedly, but regardless, pushing something on the the front of an array is not a particularly idiomatic or efficient thing to be doing.
Now, if what you're looking to do is just reset the range so that it once again refers to the elements that were popped off rather than really push elements onto it - that is, open up the window again so that it shows what it showed before - that's a bit different. It's still not supported by the range API at all (you can never unpop anything that was popped off). However, if the range that you're dealing with is a forward range (and arrays are), then you can save the range before you pop off the elements and then use that to restore the previous state. e.g.
string s = "Hello, World!";
int n = 5;
auto saved = s.save;
foreach(i; 0 .. n)
s.popFront();
if(some_condition)
s = saved;
So, you have to explicitly store the previous state yourself in order to restore it instead of having something like unpopFront, but having the range store that itself (as would be required for unpopFront) would be very inefficient in most cases (much is it might work in the iterator case if the range kept track of where the beginning of the container was).
No, there is no standard way to "unpop" a range or a string.
If you were to pass a slice of a string to a function:
fun(s[5..10]);
You'd expect that that function would only be able to see those 5 characters. If there was a way to "unpop" the slice, the function would be able to see the entire string.
Now, D is a system programming language, so expanding a slice is possible using pointer arithmetic and GC queries. But there is nothing in the standard library to do this for you.

Deleting elements from a vector

The following C++ code fills a vector with a number of objects and then removes some of these objects, but it looks like it deletes the wrong ones:
vector<Photon> photons;
photons = source->emitPhotons(); // fills vector with 300 Photon objects
for (int i=0; i<photons.size(); i++) {
bool useless = false;
// process photon, set useless to true for some
// remove useless photons
if (useless) {
photons.erase(photons.begin()+i);
}
}
Am I doing this correctly? I'm thinking the line photons.erase(photons.begin()+i); might be the problem?
Definietly the wrong way of doing it, you never adjust i down as you delete..
Work with iterators, and this problem goes away!
e.g.
for(auto it = photons.begin(); it != photons.end();)
{
if (useless)
it = photons.erase(it);
else
++it;
}
There are other ways using algorithms (such as remove_if and erase etc.), but above is clearest...
the elegant way would be:
std::vector<Photon> photons = source->emitPhotons();
photons.erase(
std::remove_if(photons.begin(), photons.end(), isUseless),
photons.end());
and:
bool isUseless(const Photon& photon) { /* whatever */ }
The proper version will look like:
for (vector<Photon>::iterator i=photons.begin(); i!=photons.end(); /*note, how the advance of i is made below*/) {
bool useless = false;
// process photon, set useless to true for some
// remove useless photons
if (useless) {
i = photons.erase(i);
} else {
++i;
}
}
You should work with stl::list in this case. Quoting the STL docs:
Lists have the important property that insertion and splicing do not invalidate iterators to list elements, and that even removal invalidates only the iterators that point to the elements that are removed.
So this would go along the lines of:
std::list<Photon> photons;
photons = source->emitPhotons();
std::list<Photon>::iterator i;
for(i=photons.begin();i!=photons.end();++i)
{
bool useless=false;
if(useless)
photons.erase(i);
}
Erasing elements in the middle of a vector is very inefficient ... the rest of the elements need to be "shifted" back one slot in order to fill-in the "empty" slot in the vector created by the call to erase. If you need to erase elements in the middle of a list-type data-structure without incurring such a penalty, and you don't need O(1) random access time (i.e., you're just trying to store your elements in a list that you'll copy or use somewhere else later, and you always iterate through the list rather than randomly accessing it), you should look into std::list which uses an underlying linked-list for its implementation, giving it O(1) complexity for modifications to the list like insert/delete.