Remove elements from first set element which second set contains without iteration - c++

I have two sets of pairs ( I cannot use c++11)
std::set<std::pair<int,int> > first;
std::set<std::pair<int,int> > second;
and I need to remove from first set all elements which are in second set(if first contain element from second to remove). I can do this by iterating through second set and if first contains same element erase from first element, but I wonder is there way to do this without iteration ?

If I understand correctly, basically you want to calculate the difference of first and second. There is an <algorithm> function for that.
std::set<std::pair<int, int>> result;
std::set_difference(first.begin(), first.end(), second.begin(), second.end(), inserter(result, result.end()));

Yes, you can.
If you want to remove, not just to detect, that is here another <algorithm> function: remove_copy_if():
http://www.cplusplus.com/reference/algorithm/remove_copy_if/
imho. It's not so difficult to understand how it works.

I wonder is there way to do this without iteration.
No. Internally, sets are balanced binary trees - there's no way to operate on them without iterating over the structure. (I assume you're interested in the efficiency of implementation, not the convenience in code, so I've deliberately ignored library routines that must iterates internally).
Sets are sorted though, so you could do an iterations over each, removing as you went (so # operations is the sum of set sizes) instead of an iteration and a lookup for each element (where number of operations is the number of elements you're iterating over times log base 2 of the number of elements in the other set). Only if one of your sets is much smaller than the other will the iterate/find approach will win out. If you look at the implementation of your library's set_difference function )mentioned in Amen's answer) - it should show you how to do the two iterations nicely.
If you want something more efficient, you need to think about how to achieve that earlier: for example, storing your pairs as flags in identically sized two-dimension matrix such that you can AND with the negation of the second set. Whether that's practical depends on the range of int values you're storing, whether the amount of memory needed is ok for your purposes....

Related

C++ efficient way to store and update sorted items

I have a operation that continuously generates random solutions (std::vector<float>). I evaluate the solutions against a mathematical function to see their usefulness (float). I would like to store the top 10 solutions all the time. What would be the most efficient way to do this in C++?
I need to store both the solutions(std::vector) and their usefulness (float). I am performing several hundred thousands of evaluations and hence I am in need of an efficient solution.
Edit:
I am aware of sorting methods. I am looking for methods other than sorting and storing the values. Looking for better data structures if any.
You evaluate the float score() function for current std::vector<T> solution, store them in a std::pair<vector<T>, float>.
You use a std::priority_queue< pair<vector<T>, float> > to store the 10 best solutions based on their score, and the score itself. std::priority_queue is a heap, so it allows you to extract its max value according to a compare function that you can set up to compare score_a < score_b.
Store the first 10 pairs, then for each new one compare it with the top of the heap, if score(new) > score(10th) then insert(new) into the priority_queue p, and p.pop_back() to get rid of the old 10th element.
You keep doing this inside a loop until you run out of vector<T> solutions.
Have a vector of pair, where pair has 1 element as solution and other element as usefulness. Then write custom comparator to compare elements in the vector.
Add element at last, then sort this vector and remove last element.
As #user4581301 mentioned in comments, for 10 elements, you dont need to sort. Just traverse vector everytime, or you can also perform ordered insert in vector.
Here are some links to help you:
https://www.geeksforgeeks.org/sorting-vector-of-pairs-in-c-set-1-sort-by-first-and-second/
Comparator for vector<pair<int,int>>

Optimal data structure (in C++) for random access and looping through elements

I have the following problem: I have a set of N elements (N being somewhere between several hundred and several thousand elements, let's say between 500 and 3000 elements). Out of these elements, small percentage will have some property "X", but the elements "gain" and "lose" this property in a semi-random fashion; so if I store them all in an array, and assign 1 to elements with property X, and zero otherwise, this array of N elements will have n 1's and the N-n zeros (n being small in the 20-50 range).
The problem is the following: these elements change very frequently in a semi-random way (meaning that any element can flip from 0 to 1 and vice versa, but the process that controls that is somewhat stable, so the total number "n" fluctuates a bit, but is reasonably stable in the 20-50 range); and I frequently need all the "X" elements of the set (in other words, indices of the array where value of the array is 1), to perform some task on them.
One simple and slow way to achieve this is to simply loop through the array and if index k has value 1, perform the task, but this is kinda slow because well over 95% of all the elements have value 1. The solution would be to put all the 1s into a different structure (with n elements) and then loop through that structure, instead of looping through all N elements. The question is what's the best structure to use?
Elements will flip from 0 to 1 and vice versa randomly (from several different threads), so there's no order there of any sort (time when element flipped from 0 to 1 is has nothing to do with time it will flip back), and when I loop through them (from another thread), I do not need to loop in any particular order (in other words, I just need to get them all, but it's nor relevant in which order).
Any suggestions what would be the optimal structure for this? "std::map" comes to mind, but since the keys of std::map are sorted (and I don't need that feature), the questions is if there is anything faster?
EDIT: To clarify, the array example is just one (slow) way to solve the problem. The essence of the problem is that out of one big set "S" with "N" elements, there is a continuously changing subset "s" of "n" elements (with n much smaller then N), and I need to loop though that set "s". Speed is of essence, both for adding/removing elements to "s", and for looping through them. So while suggestions like having 2 arrays and moving elements between them would be fast from iteration perspective, adding and removing elements to an array would be prohibitively slow. It sounds like some hash-based approach like std::set would work reasonably fast on both iteration and addition/removal fronts, the question is is there something better than that? Reading the documentation on "unordered_map" and "unordered_set" doesn't really clarify how much faster addition/removal of elements is relative to std::map and std::set, nor how much slower the iteration through them would be. Another thing to keep in mind is that I don't need a generic solution that works best in all cases, I need one that works best when N is in the 500-3000 range, and n is in the 20-50 range. Finally, the speed is really of essence; there are plenty slow ways of doing it, so I'm looking for the fastest way.
Since order doesn't appear to be important, you can use a single array and keep the elements with property X at the front. You will also need an index or iterator to the point in the array that is the transition from X set to unset.
To set X, increment the index/iterator and swap that element with the one you want to change.
To unset X, do the opposite: decrement the index/iterator and swap that element with the one you want to change.
Naturally with multiple threads you will need some sort of mutex to protect the array and index.
Edit: to keep a half-open range as iterators are normally used, you should reverse the order of the operations above: swap, then increment/decrement. If you keep an index instead of an iterator then the index does double duty as the count of the number of X.
N=3000 isn't really much. If you use a single bit for each of them, you have a structure smaller than 400 bytes. You can use std::bitset for that. If you use an unordered_set or a set however be mindful that you'll spend many more bytes for each of the n elements in your list: if you just allocate a pointer for each element in a 64bit architecture you'll use at least 8*50 = 400 bytes, much more than the bitset
#geza : perhaps I misunderstood what you meant by two arrays; I assume you meant something like have one std::vector (or something similar) in which I store all elements with property X, and another where I store the rest? In reality, I don't care about others, so I really need one array. Adding an element is obviously simple if I can just add it to the end of the array; now, correct me if I'm wrong here, but finding an element in that array is O(n) operation (since the array is unsorted), and then removing it from the array again requires shifting all the elements by one place, so this in average requires n/2 operations. If I use linked list instead of vector, then deleting an element is faster, but finding it still takes O(n). That's what I meant when I said it would be prohibitively slow; if I misunderstood you, please do clarify.
It sounds like std::unordered_set or std::unordered_map would be fastest in adding/deleting elements, since it's O(1) to find an element, but it's unclear to me how fast can one loop through all the keys; the documentation clearly states that iteration through keys of std::unordered_map is slower then iteration through keys of std::map, but it's not quantified in any way just how slow is "slower", and how fast is "faster".
And finally, to repeat one more time, I'm not interested in general solution, I'm interested in one for small "n". So if for example I have two solutions, one that's k_1*log(n), and second that's k_2*n^2, first one might be faster in principle (and for large n), but if k_1 >> k_2 (let's say for example k_1 = 1000 and k_2=2 and n=20), second one can still be faster for relatively small "n" (1000*log(20) is still larger than 2*20^2). So even if addition/deletion in std::unordered_map might be done in constant time O(1), for small "n" it still matters if that constant time is 1 nanosecond or 1 microsecond or 1 millisecond. So I'm really looking for suggestions that work best for small "n", not for in the asymptotic limit of large "n".
An alternative approach (in my opinion worth only if the number of element is increased at least tenfold) might be keeping a double index:
#include<algorithm>
#include<vector>
class didx {
// v == indexes[i] && v > 0 <==> flagged[v-1] == i
std::vector<ptrdiff_t> indexes;
std::vector<ptrdiff_t> flagged;
public:
didx(size_t size) : indexes(size) {}
// loop through flagged items using iterators
auto begin() { return flagged.begin(); }
auto end() { return flagged.end(); }
void flag(ptrdiff_t index) {
if(!isflagged(index)) {
flagged.push_back(index);
indexes[index] = flagged.size();
}
}
void unflag(ptrdiff_t index) {
if(isflagged(index)) {
// swap last item with item to be removed in "flagged", update indexes accordingly
// in "flagged" we swap last element with element at index to be removed
auto idx = indexes[index]-1;
auto last_element = flagged.back();
std::swap(flagged.back(),flagged[idx]);
std::swap(indexes[index],indexes[last_element]);
// remove the element, which is now last in "flagged"
flagged.pop_back();
indexes[index] = 0;
}
}
bool isflagged(ptrdiff_t index) {
return indexes[index] > 0;
}
};

C++ Fixed Size Container to Store Most Recent Values

I would like to know what the most suitable data structure is for the following problem in C++
I am wanting to store 100 floats ordered by recency. So when I add (push) a new item the other elements are moved up one position. Every time an event is triggered I receive a value and then add it to my data structure.
When the number of elements reaches 100, I would like to remove (pop) the item at the end (the oldest).
I want to able to iterate over all the elements and perform some mathematical operations on them.
I have looked at all the standard C++ containers but none of them fulfill all my needs. What's the easiest way to achieve this with standard C++ code?
You want a circular buffer. You can use Boost's implementation or make your own by allocating an array, and keeping track of the beginning and end of the used range. This boils down to doing indexing modulo 100.
Without creating your own or using a library, std::vector is the most efficient standard data structure for this. Once it has reached its maximum size, there will be no more dynamic memory allocations. The cost of moving up 100 floats is trivial compared to the cost of dynamic memory allocations. (This is why std::list is a slow data structure for this). There is no push_front function for vector. Instead you have to use v.insert(v.begin(), f)
Of course this assumes what you are doing is performance-critical, which it probably isn't. In that case I would use std::deque for more convenient usage.
Just saw that you need to iterator over them. Use a list.
Your basic function would look something like this
void addToList(int value){
list100.push_back(value);
if(list100.size() > 100){
list100.pop_front();
}
}
Iterating over them is easy as well:
for(int val : list100){
sum += val;
}
// Average, or whatever you need to do
Obviously, if you're using something besides int, you'll need to change that. Although this adds a little bit more functionality than you need, it's very efficient since it's a doubly linked list.
http://www.cplusplus.com/reference/list/list/
You can use either std::array, std::dequeue, std::list or std::priority_queue
A MAP (std::map) should be able to solve your requirement. Use Key as the object and value as the current push number nPusheCount which gets incremented whenever you add an element to map.
when adding a new element to map, if you have less than 100 elements, just add the number to the MAP as key and nPushCount as the value.
If you have 100 elements already, check if the number exists in map already and do following:
If the number already exists in map, then add the number as key and nPushCount as value;
If doesnt, delete the number with lowest nPushCount as value and then add the desired number with updated nPushCount.

How to efficiently *nearly* sort a list?

I have a list of items; I want to sort them, but I want a small element of randomness so they are not strictly in order, only on average ordered.
How can I do this most efficiently?
I don't mind if the quality of the random is not especially good, e.g. it simply based on the chance ordering of the input, e.g. an early-terminated incomplete sort.
The context is implementing a nearly-greedy search by introducing a very slight element of inexactness; this is in a tight loop and so the speed of sorting and calling random() are to be considered
My current code is to do a std::sort (this being C++) and then do a very short shuffle just in the early part of the array:
for(int i=0; i<3; i++) // I know I have more than 6 elements
std::swap(order[i],order[i+rand()%3]);
Use first two passes of JSort. Build heap twice, but do not perform insertion sort. If element of randomness is not small enough, repeat.
There is an approach that (unlike incomplete JSort) allows finer control over the resulting randomness and has time complexity dependent on randomness (the more random result is needed, the less time complexity). Use heapsort with Soft heap. For detailed description of the soft heap, see pdf 1 or pdf 2.
You could use a standard sort algorithm (is a standard library available?) and pass a predicate that "knows", given two elements, which is less than the other, or if they are equal (returning -1, 0 or 1). In the predicate then introduce a rare (configurable) case where the answer is random, by using a random number:
pseudocode:
if random(1000) == 0 then
return = random(2)-1 <-- -1,0,-1 randomly choosen
Here we have 1/1000 chances to "scamble" two elements, but that number strictly depends on the size of your container to sort.
Another thing to add in the 1000 case, could be to remove the "right" answer because that would not scramble the result!
Edit:
if random(100 * container_size) == 0 then <-- here I consider the container size
{
if element_1 < element_2
return random(1); <-- do not return the "correct" value of -1
else if element_1 > element_2
return random(1)-1; <-- do not return the "correct" value of 1
else
return random(1)==0 ? -1 : 1; <-- do not return 0
}
in my pseudocode:
random(x) = y where 0 <= y <=x
One possibility that requires a bit more space but would guarantee that existing sort algorithms could be used without modification would be to create a copy of the sort value(s) and then modify those in some fashion prior to sorting (and then use the modified value(s) for the sort).
For example, if the data to be sorted is a simple character field Name[N] then add a field (assuming data is in a structure or class) called NameMod[N]. Fill in the NameMod with a copy of Name but add some randomization. Then 3% of the time (or some appropriate amount) change the first character of the name (e.g., change it by +/- one or two characters). And then 10% of the time change the second character +/- a few characters.
Then run it through whatever sort algorithm you prefer. The benefit is that you could easily change those percentages and randomness. And the sort algorithm will still work (e.g., it would not have problems with the compare function returning inconsistent results).
If you are sure that element is at most k far away from where they should be, you can reduce quicksort N log(N) sorting time complexity down to N log(k)....
edit
More specifically, you would create k buckets, each containing N/k elements.
You can do quick sort for each bucket, which takes k * log(k) times, and then sort N/k buckets, which takes N/k log(N/k) time. Multiplying these two, you can do sorting in N log(max(N/k,k))
This can be useful because you can run sorting for each bucket in parallel, reducing total running time.
This works if you are sure that any element in the list is at most k indices away from their correct position after the sorting.
but I do not think you meant any restriction.
Split the list into two equally-sized parts. Sort each part separately, using any usual algorithm. Then merge these parts. Perform some merge iterations as usual, comparing merged elements. For other merge iterations, do not compare the elements, but instead select element from the same part, as in the previous step. It is not necessary to use RNG to decide, how to treat each element. Just ignore sorting order for every N-th element.
Other variant of this approach nearly sorts an array nearly in-place. Split the array into two parts with odd/even indexes. Sort them. (It is even possible to use standard C++ algorithm with appropriately modified iterator, like boost::permutation_iterator). Reserve some limited space at the end of the array. Merge parts, starting from the end. If merged part is going to overwrite one of the non-merged elements, just select this element. Otherwise select element in sorted order. Level of randomness is determined by the amount of reserved space.
Assuming you want the array sorted in ascending order, I would do the following:
for M iterations
pick a random index i
pick a random index k
if (i<k)!=(array[i]<array[k]) then swap(array[i],array[k])
M controls the "sortedness" of the array - as M increases the array becomes more and more sorted. I would say a reasonable value for M is n^2 where n is the length of the array. If it is too slow to pick random elements then you can precompute their indices beforehand. If the method is still too slow then you can always decrease M at the cost of getting a poorer sort.
Take a small random subset of the data and sort it. You can use this as a map to provide an estimate of where every element should appear in the final nearly-sorted list. You can scan through the full list now and move/swap elements that are not in a good position.
This is basically O(n), assuming the small initial sorting of the subset doesn't take a long time. Hopefully you can build the map such that the estimate can be extracted quickly.
Bubblesort to the rescue!
For a unsorted array, you could pick a few random elements and bubble them up or down. (maybe by rotation, which is a bit more efficient) It will be hard to control the amount of (dis)order, even if you pick all N elements, you are not sure that the whole array will be sorted, because elements are moved and you cannot ensure that you touched every element only once.
BTW: this kind of problem tends to occur in game playing engines, where the list with candidate moves is kept more-or-less sorted (because of weighted sampling), and sorting after each iteration is too expensive, and only one or a few elements are expected to move.

Sort, pack and remap array of indexed values to minimize overlapping

Sitation:
overview:
I have something like this:
std::vector<SomeType> values;
std::vector<int> indexes;
struct Range{
int firstElement;//first element to be used in indexes array
int numElements;//number of element to be used from indexed array
int minIndex;/*minimum index encountered between firstElement
and firstElements+numElements*/
int maxIndex;/*maximum index encountered between firstElement
and firstElements+numElements*/
Range()
:firstElement(0), numElements(0), minIndex(0), maxIndex(0){
}
}
std::vector<Range> ranges;
I need to sort values, remap indexes, and recalculate ranges to minimize maxValueIndex-minValueIndex for each range.
details:
values is an array(okay, "vector") of some type (irrelevant which one). elements in values may be unique, but this is not guaranteed.
indexes is an vector of ints. each element in "indexes" is an indexes that correspond to some element in values. Elements in indexes are not unique, one value may repeat multiple types. And indexes.size() >= values.size().
Now, ranges correspond to a "chunk" of data from indexes. firstElement is an index of element to be used from indexes (i.e. used like this: indexes[range.firstElement]), numElements is (obviously) number of elements to be used, minIndex is mininum in (indexes[firstElement]...indexes[firstElement+numElements-1]) a,d maxIndex is maximum in (indexes[firstElement]...indexes[firstElement+numElements-1]). Ranges never overlap.
I.e. for every two ranges a, b
((a.firstElement >= b.firstElement) && (a.firstElement < (b.firstElement+b.numElements)) == false
Obviously, when I do any operation on values (swap to elements, etc), I need to update indexes (so they keep pointing on the same value), and recalculate corresponding range, so range's minIndex and maxIndex are correct.
Now, I need to rearrange values in the way that will minimize Range.maxIndex - Range.minIndex. I do not need the "best" result after packing, having "probably the best" or "good" packing will be enough.
problem:
Remapping indexes and recalculating ranges is easy. The problem is that I'm not sure how to sort elements in values, because same index may be encountered in multiple ranges.
Any ideas about how to proceed?
restrictions:
Changing container type is not allowed. Containers should be array-like. No maps, not lists.
But you're free to use whatever container you want during the sorting.
Also, no boost or external libraries - pure C++/STL, I really neeed only an algorithm.
additional info:
There is no greater/less comparison defined for SomeType - only equality/non-equality.
But there should be no need to ever compare two values, only indexes.
The goal of algorithm is to make sure that output of
for (int i = 0; i < indexes.size; i++){
print(values[indexes[i]]); //hypothetical print function
}
Will be identical before and after sorting, while also making sure that for each range
Range.maxIndex-Range.minIndex (after sorting) is as small as possible to achieve with reasonable effort.
I'm not looking for a "perfect" or "most optimal" solution, having a "probably perfect" or "probably most optimal" solution should be enough.
P.S. This is NOT a homework.
This is not an algorithm, just some thinking aloud. It will probably break if there are too many duplicates.
If there was no duplicates, you'd simply rearrange the values so the indexes are 0,1,2, and so on. So for the starting point, let's exclude the values that are double-referenced and arrange the rest
Since there are duplicates, you need to figure out where to stick them. Suppose the duplicate is referred to by ranges r1, r2, r3. Now, as long as you insert the duplicate between min([r1,r2,r3].minIndex)-1 and max([r1,r2,r3].maxIndex)+1, the sum of maxIndex-minIndex will be the same no matter where you insert it. Moving the insertion point to the left will reduce max-min for all ranges to the left, but increment it for all ranges to the right. So, I think the sensible thing to do is to insert the duplicate at the left edge (minindex) of the rightmost range (one with largest minIndex) of r1,r2,r3. Repeat with all duplicates.
Okay, it looks like there is only one way to reliably solve this problem:
Make sure that no index is ever used by two ranges at once by duplicating values.
I.e scan entire array of indexes, and when you find index (of value) that is being used in more than one range, you add copy of that value for each range - each with unique index. After that problem becomes trivial - you simply sort values in the way that will make sure that values array first contains values used only by first range, then values for 2nd range, and so on. I.e. this will get maximum packing.
Because in my app it is more important to minimize sum(ranges[i].maxIndex-ranges[i].minIndex) that to minimize number of values, this approach works for me.
I do not think that there is other reliable way to solve the problem - it is quite easy to get situation when there are indexes used by every range, and in this case it will not be possible to "pack" data no matter what you do. Even allowing index to be used by two ranges at once will lead to problems - you can get ranges a, b and c where a and b, b and c, a and c will have common indexes. In this case it also won't be possible to pack the data.