Boost multi_index: retrieve unique values of a non-unique key - c++

I have a boost::multi_index_container whose elements are structs like this:
struct Elem {
A a;
B b;
C c;
};
The main key (in a database sense) is a composite_key of a and b. Other
keys exist to perform various types of queries.
I now need to retrieve a set of all different values of c. These values are
by all means not unique, but iterating through all entries (albeit ordered),
or using std::unique seems quite a waste, considering that
the number of different values of c is expected to be << than the total
number of entries (say, 10 to 1000).
Am I missing a simple way to obtain this result more efficiently?

I scoured the Boost.MultiIndex documentation and can't seem to find a way to do what you want. I'm interested in knowing if it's doable.
Perhaps the best you can do is maintain a std::map<C, size_t> (or hash map) alongside your multi_index_container and keep them both "synchronized".
The map associates a C value with its occurrence count (frequency). It's essentially a histogram of C values. Each time you add an Elem to your multi_index_container, you increment the corresponding frequency in the histogram. When you remove an Elem from your multi_index_counter, you decrement the corresponding frequency in the histogram. When the frequency reaches zero, you delete that entry from the histogram.
To retrieve the set of distinct C values, you simply iterate through the <key,value> pairs in the histogram and look at the key part of each pair. If you used a std::map, then the distinct C values will come out sorted.
If you're going to examine the set of distinct C values only once (or rarely) then the approach I described above may be overkill. A simpler approach would be to insert all C values into a std::set<C> and then iterate through the set to retrieve the distinct C values.
You said that the set of distinct C's is much smaller then the total number of C's. The std::set<C> approach should therefore waste much less space than copying the C's to a std::vector, sorting the vector, then running std::unique.
Let's compare the time complexity of copying to a set versus copying to a vector, sorting, then running unique. Let N be the total number of C values, and let M be the number of distinct C values. The set approach, by my reckoning, should have a time complexity of O(N*log(M)). Since M is small and does not grow much with higher N's, the complexity effectively becomes O(N). The sorting+unique technique, on the other hand, should have a time complexity of O(N*log(N)).

The approach I took to solving this problem was to use boost range adaptors as follows
const auto& indexedContainer = container.get<IndexType>();
const auto uniqueIndexRange = indexedContainer
| boost::adaptors::transformed([&](auto&& v) {
return indexedContainer.key_extractor()(v); })
| boost::adaptors::uniqued;

Related

Given a set of positive integers <=k and its n subsets , find union of which pairs of subsets give the original set

I have a set A which consists of first p positive integers (1 to p), and I am given n subsets of this set. How can I find how many pair of subsets on union would give the original set A?
Of course this can be done naively by checking the size of the union of each pair and if it is equal to p , the union must make up the set A, but is there a more elegant way of doing this, which reduces the time complexity?
The set_union in c++ has a time complexity of 2*(size(set 1) + size(set 2)) - 1 which is not good for nC2 pairs.
If we need to cope with a worst-case scenario then some ideas about this problem:
I suppose that using of std::bitset without any optimizations would be sufficient for this task because of the much faster union operation. But if not, don't use variable size vectors, use simple p-length 0-1 arrays/vectors or unordered_sets. I don't think variable size vectors without O(1) find operation would be better in worst-case scenarios.
Use heuristics to minimize subsets unions. The simplest heuristic is checking the sizes of subsets. We need only those pairs (A, B) of subsets where size(A) + size(B) >= p.
In addition to heuristics, we can count (in O(n^2)) the frequencies of appearing of every number in subsets. After that, we can check the presence of the numbers in some subset(s) in frequence-increasing order. Also, we can exclude those numbers that appear in every subset.
If you'll fix some subset A (in the outer loop for example) and will find unions with other subsets, you can check only those numbers that do not appear in set A. If the subset A is large enough this can dramatically reduce the number of operations needed.
Just a possible improvement to your approach, instead of binary searching you can keep a boolean array to find out if some x appears in array i in O(1).
For example,
Let's say, when taking input you save all the appearances for an array i. Meaning, if x appears in array i, then isThere[i][x] should be true else false.
This can save some time.

How to efficiently *nearly* sort a list?

I have a list of items; I want to sort them, but I want a small element of randomness so they are not strictly in order, only on average ordered.
How can I do this most efficiently?
I don't mind if the quality of the random is not especially good, e.g. it simply based on the chance ordering of the input, e.g. an early-terminated incomplete sort.
The context is implementing a nearly-greedy search by introducing a very slight element of inexactness; this is in a tight loop and so the speed of sorting and calling random() are to be considered
My current code is to do a std::sort (this being C++) and then do a very short shuffle just in the early part of the array:
for(int i=0; i<3; i++) // I know I have more than 6 elements
std::swap(order[i],order[i+rand()%3]);
Use first two passes of JSort. Build heap twice, but do not perform insertion sort. If element of randomness is not small enough, repeat.
There is an approach that (unlike incomplete JSort) allows finer control over the resulting randomness and has time complexity dependent on randomness (the more random result is needed, the less time complexity). Use heapsort with Soft heap. For detailed description of the soft heap, see pdf 1 or pdf 2.
You could use a standard sort algorithm (is a standard library available?) and pass a predicate that "knows", given two elements, which is less than the other, or if they are equal (returning -1, 0 or 1). In the predicate then introduce a rare (configurable) case where the answer is random, by using a random number:
pseudocode:
if random(1000) == 0 then
return = random(2)-1 <-- -1,0,-1 randomly choosen
Here we have 1/1000 chances to "scamble" two elements, but that number strictly depends on the size of your container to sort.
Another thing to add in the 1000 case, could be to remove the "right" answer because that would not scramble the result!
Edit:
if random(100 * container_size) == 0 then <-- here I consider the container size
{
if element_1 < element_2
return random(1); <-- do not return the "correct" value of -1
else if element_1 > element_2
return random(1)-1; <-- do not return the "correct" value of 1
else
return random(1)==0 ? -1 : 1; <-- do not return 0
}
in my pseudocode:
random(x) = y where 0 <= y <=x
One possibility that requires a bit more space but would guarantee that existing sort algorithms could be used without modification would be to create a copy of the sort value(s) and then modify those in some fashion prior to sorting (and then use the modified value(s) for the sort).
For example, if the data to be sorted is a simple character field Name[N] then add a field (assuming data is in a structure or class) called NameMod[N]. Fill in the NameMod with a copy of Name but add some randomization. Then 3% of the time (or some appropriate amount) change the first character of the name (e.g., change it by +/- one or two characters). And then 10% of the time change the second character +/- a few characters.
Then run it through whatever sort algorithm you prefer. The benefit is that you could easily change those percentages and randomness. And the sort algorithm will still work (e.g., it would not have problems with the compare function returning inconsistent results).
If you are sure that element is at most k far away from where they should be, you can reduce quicksort N log(N) sorting time complexity down to N log(k)....
edit
More specifically, you would create k buckets, each containing N/k elements.
You can do quick sort for each bucket, which takes k * log(k) times, and then sort N/k buckets, which takes N/k log(N/k) time. Multiplying these two, you can do sorting in N log(max(N/k,k))
This can be useful because you can run sorting for each bucket in parallel, reducing total running time.
This works if you are sure that any element in the list is at most k indices away from their correct position after the sorting.
but I do not think you meant any restriction.
Split the list into two equally-sized parts. Sort each part separately, using any usual algorithm. Then merge these parts. Perform some merge iterations as usual, comparing merged elements. For other merge iterations, do not compare the elements, but instead select element from the same part, as in the previous step. It is not necessary to use RNG to decide, how to treat each element. Just ignore sorting order for every N-th element.
Other variant of this approach nearly sorts an array nearly in-place. Split the array into two parts with odd/even indexes. Sort them. (It is even possible to use standard C++ algorithm with appropriately modified iterator, like boost::permutation_iterator). Reserve some limited space at the end of the array. Merge parts, starting from the end. If merged part is going to overwrite one of the non-merged elements, just select this element. Otherwise select element in sorted order. Level of randomness is determined by the amount of reserved space.
Assuming you want the array sorted in ascending order, I would do the following:
for M iterations
pick a random index i
pick a random index k
if (i<k)!=(array[i]<array[k]) then swap(array[i],array[k])
M controls the "sortedness" of the array - as M increases the array becomes more and more sorted. I would say a reasonable value for M is n^2 where n is the length of the array. If it is too slow to pick random elements then you can precompute their indices beforehand. If the method is still too slow then you can always decrease M at the cost of getting a poorer sort.
Take a small random subset of the data and sort it. You can use this as a map to provide an estimate of where every element should appear in the final nearly-sorted list. You can scan through the full list now and move/swap elements that are not in a good position.
This is basically O(n), assuming the small initial sorting of the subset doesn't take a long time. Hopefully you can build the map such that the estimate can be extracted quickly.
Bubblesort to the rescue!
For a unsorted array, you could pick a few random elements and bubble them up or down. (maybe by rotation, which is a bit more efficient) It will be hard to control the amount of (dis)order, even if you pick all N elements, you are not sure that the whole array will be sorted, because elements are moved and you cannot ensure that you touched every element only once.
BTW: this kind of problem tends to occur in game playing engines, where the list with candidate moves is kept more-or-less sorted (because of weighted sampling), and sorting after each iteration is too expensive, and only one or a few elements are expected to move.

What is the difference between std::set and std::vector?

I am learning STL now. I read about set container. I have question when you want to use set? After reading description of set it looks like it is useless because we can substitute it by vector. Could you say pros and cos for vector vs set containers. Thanks
A set is ordered. It is guaranteed to remain in a specific ordering, according to a functor that you provide. No matter what elements you add or remove (unless you add a duplicate, which is not allowed in a set), it will always be ordered.
A vector has exactly and only the ordering you explicitly give it. Items in a vector are where you put them. If you put them in out of order, then they're out of order; you now need to sort the container to put them back in order.
Admittedly, set has relatively limited use. With proper discipline, one could insert items into a vector and keep it ordered. However, if you are constantly inserting and removing items from the container, vector will run into many issues. It will be doing a lot of copying/moving of elements and so forth, since it is effectively just an array.
The time it takes to insert an item into a vector is proportional to the number of items already in the vector. The time it takes to insert an item into a set is proportional to the log₂ of the number of items. If the number of items is large, that's a huge difference. log₂(100,000) is ~16; that's a major speed improvement. The same goes for removal.
However, if you do all of your insertions at once, at initialization time, then there's no problem. You can insert everything into the vector, sort it (paying that price once), and then use standard algorithms for sorted vectors to find elements and iterate over the sorted list. And while iteration over the elements of a set isn't exactly slow, iterating over a vector is faster.
So there are cases where a sorted vector beats a set. That being said, you really shouldn't bother with the expense of this kind of optimization unless you know that it is necessary. So use a set unless you have experience with the kind of system you're writing (and thus know that you need that performance) or have profiling data in hand that tells you that you need a vector and not a set.
They are different things: you decide how vectors are ordered, and you can also put as many equal things into a vector as you please. Sets are ordered in accordance to that set's internal rules (you may set the rules, but the set will deal with the ordering), and you cannot put multiple equal items into a set.
Of course you could maintain a vector of unique items, but your performance would suffer a lot when you do set-oriented operations. For example, assume that you have a set of 10000 items and a vector of 10000 distinct unordered items. Now suppose that you need to check if a value X is among the values in the set (or among the values in the vector). When X is not among the items, searching the vector would be some 100 times slower. You would see similar performance differences on calculating set unions and intersections.
To summarize, sets and vectors have different purposes. You can use a vector instead of a set, but it would require more work, and would likely hurt the performance rather severely.
form cpluplus.com
set:
Sets are containers that store unique elements following a specific
order.
so the set is ordered AND item are uniquely represented
while vect:
Vectors are sequence containers representing arrays that can change in
size.
so vector is in the order you fill it AND can hold multiple identical items
prefer set:
if you wish to filter multiple identical values
if you wish to parse items in a specified order (doing this in vector requires to specifically sort vector).
prefer vector:
if you want to keep identical values
if you wish to parse items in same order as you pushed them (assuming you don't process the vector order)
The simple difference is that set can contain only unique values, and it is sorted. So you can use it for the cases where you need to continuously sort the values after every insert / delete.
set<int> a;
vector<int> b;
for (int i = 0; i < 10; ++i)
{
int val = rand() % 10;
a.insert(val);
b.push_back(val);
}
cout << "--SET---\n"; for (auto i : a) cout << i << ","; cout << endl;
cout << "--VEC---\n"; for (auto j : b) cout << j << ","; cout << endl;
The output is
--SET---
0,1,2,4,7,8,9,
--VEC---
1,7,4,0,9,4,8,8,2,4,
it is faster to search an item against a set than a vector (O(log(n)) vs O(n)). To search an item against a vector, you need to iterate all items in the vector, but the set use red-black tree to optimize the search, only a few item will be looked to find a match.
The set is ordered, it means you can only iterate it from smallest one to biggest one by order, or the reversed order.
But the vector is unordered, you can travel it by the insert order.

What is the fastest way to return x,y coordinates that are present in both list A and list B?

I have two lists (list A and list B) of x,y coordinates where 0 < x < 4000, 0 < y < 4000, and they will always be integers. I need to know what coordinates are in both lists. What would be your suggestion for how to approach this?
I have been thinking about representing the lists as two grids of bits and doing bitwise & possibly?
List A has about 1000 entries and changes maybe once every 10,000 requests. List B will vary wildly in length and will be different on every run through.
EDIT: I should mention that no coordinate will be in lists twice; 1,1 cannot be in list A more than once for example.
Represent (x,y) as a single 24 bit number as described in the comments.
Maintain A in numerical order (you said it doesn't vary much, so this should be hardly any cost).
For each B do a binary search on the list. Since A is about 1000 items big, you'll need at most 10 integer comparisons (in the worst case) to check for membership.
If you have a bit more memory (about 2MB) to play with you could create a bit-vector to support all possible 24 bit numbers then then perform a single bit operation per item to test for membership. So A would be represented by a single 2^24 bit number with a bit-set if the value is there (otherwise 0). To test for membership you would just use an appropriate bit and operation.
Put the coordinates of list A into some kind of a set (probably a hash, bst, or heap), then you can quickly see if the coordinate from list B is present.
Depending on whether you're expecting the list to be present or not present in the list would determine what underlying data structure you use.
Hashes are good at telling you if something is in it, though depending on how it's implemented, could behave poorly when trying to find something that isn't in it.
bst and heaps are equally good at telling you if something is in it or not, but don't perform theoretically as well as hashes when something is in it.
Since A is rather static you may consider building a query structure and check of all elements in B whether they occur in A. One example would be an std::set > A and you can query like A.find(element_from_b) != A.end() ...
So the running time in total is worst case O(b log a) (where b is the number of elements in B, and a respectively). Note also that since a is always about 10000, log a basically is constant.
Define an ordering based on their lexicographic order (sort first on x then on y). Sort both lists based on that ordering in O(n log n) time where n is the larger of the number of elements of each list. Set a pointer to the first elment of each list and advance the one that points to the lesser element; when the pointers reference to elements with the same value, put them into a set (to avoid multiplicities within each list). This last part can be done in O(n) time (or O(m log m) where m is the number of elements common to both lists).
Update (based on comment below and edit above): Since no point appears more than once in each list, then you can use a list or vector or dequeue to hold the points common to both or some other (amortized) constant time insertion realizing the O(n) time performance regardless of the number of common elements.
This is easy if you implement an STL predicate which orders two pairs (i.e. return (R.x < L.x || (R.x==L.x && R.y < L.y). You can then call std::list::sort to order them, and std::set_intersection to find the common elements. No need to write the algoritms
This is the kind of problem that just screams "Bloom Filter" at me.
If I understand correctly, you want the common coordinates in X and Y -- the intersection of (sets) Listing A and B? If you are using STL:
#include <vector>
#include <std>
using namespace std;
// ...
set<int> a; // x coord (assumed populated elsewhere)
set<int> b; // y coord (assumed populated elsewhere)
set<int> in; // intersection
// ...
set_intersection(a.begin(), a.end(), b.begin(), b.end(), insert_iterator<set<int> >(in,in.begin()));
I think hashing is your best bet.
//Psuedocode:
INPUT: two lists, each with (x,y) coordinates
find the list that's longer, call it A
hash each element in A
go to the other list, call it B
hash each element in B and look it up in the table.
if there's a match, return/store (x,y) somewhere
repeat #4 till the end
Assuming length of A is m and B's length is n, run time is O(m + n) --> O(n)

Sort, pack and remap array of indexed values to minimize overlapping

Sitation:
overview:
I have something like this:
std::vector<SomeType> values;
std::vector<int> indexes;
struct Range{
int firstElement;//first element to be used in indexes array
int numElements;//number of element to be used from indexed array
int minIndex;/*minimum index encountered between firstElement
and firstElements+numElements*/
int maxIndex;/*maximum index encountered between firstElement
and firstElements+numElements*/
Range()
:firstElement(0), numElements(0), minIndex(0), maxIndex(0){
}
}
std::vector<Range> ranges;
I need to sort values, remap indexes, and recalculate ranges to minimize maxValueIndex-minValueIndex for each range.
details:
values is an array(okay, "vector") of some type (irrelevant which one). elements in values may be unique, but this is not guaranteed.
indexes is an vector of ints. each element in "indexes" is an indexes that correspond to some element in values. Elements in indexes are not unique, one value may repeat multiple types. And indexes.size() >= values.size().
Now, ranges correspond to a "chunk" of data from indexes. firstElement is an index of element to be used from indexes (i.e. used like this: indexes[range.firstElement]), numElements is (obviously) number of elements to be used, minIndex is mininum in (indexes[firstElement]...indexes[firstElement+numElements-1]) a,d maxIndex is maximum in (indexes[firstElement]...indexes[firstElement+numElements-1]). Ranges never overlap.
I.e. for every two ranges a, b
((a.firstElement >= b.firstElement) && (a.firstElement < (b.firstElement+b.numElements)) == false
Obviously, when I do any operation on values (swap to elements, etc), I need to update indexes (so they keep pointing on the same value), and recalculate corresponding range, so range's minIndex and maxIndex are correct.
Now, I need to rearrange values in the way that will minimize Range.maxIndex - Range.minIndex. I do not need the "best" result after packing, having "probably the best" or "good" packing will be enough.
problem:
Remapping indexes and recalculating ranges is easy. The problem is that I'm not sure how to sort elements in values, because same index may be encountered in multiple ranges.
Any ideas about how to proceed?
restrictions:
Changing container type is not allowed. Containers should be array-like. No maps, not lists.
But you're free to use whatever container you want during the sorting.
Also, no boost or external libraries - pure C++/STL, I really neeed only an algorithm.
additional info:
There is no greater/less comparison defined for SomeType - only equality/non-equality.
But there should be no need to ever compare two values, only indexes.
The goal of algorithm is to make sure that output of
for (int i = 0; i < indexes.size; i++){
print(values[indexes[i]]); //hypothetical print function
}
Will be identical before and after sorting, while also making sure that for each range
Range.maxIndex-Range.minIndex (after sorting) is as small as possible to achieve with reasonable effort.
I'm not looking for a "perfect" or "most optimal" solution, having a "probably perfect" or "probably most optimal" solution should be enough.
P.S. This is NOT a homework.
This is not an algorithm, just some thinking aloud. It will probably break if there are too many duplicates.
If there was no duplicates, you'd simply rearrange the values so the indexes are 0,1,2, and so on. So for the starting point, let's exclude the values that are double-referenced and arrange the rest
Since there are duplicates, you need to figure out where to stick them. Suppose the duplicate is referred to by ranges r1, r2, r3. Now, as long as you insert the duplicate between min([r1,r2,r3].minIndex)-1 and max([r1,r2,r3].maxIndex)+1, the sum of maxIndex-minIndex will be the same no matter where you insert it. Moving the insertion point to the left will reduce max-min for all ranges to the left, but increment it for all ranges to the right. So, I think the sensible thing to do is to insert the duplicate at the left edge (minindex) of the rightmost range (one with largest minIndex) of r1,r2,r3. Repeat with all duplicates.
Okay, it looks like there is only one way to reliably solve this problem:
Make sure that no index is ever used by two ranges at once by duplicating values.
I.e scan entire array of indexes, and when you find index (of value) that is being used in more than one range, you add copy of that value for each range - each with unique index. After that problem becomes trivial - you simply sort values in the way that will make sure that values array first contains values used only by first range, then values for 2nd range, and so on. I.e. this will get maximum packing.
Because in my app it is more important to minimize sum(ranges[i].maxIndex-ranges[i].minIndex) that to minimize number of values, this approach works for me.
I do not think that there is other reliable way to solve the problem - it is quite easy to get situation when there are indexes used by every range, and in this case it will not be possible to "pack" data no matter what you do. Even allowing index to be used by two ranges at once will lead to problems - you can get ranges a, b and c where a and b, b and c, a and c will have common indexes. In this case it also won't be possible to pack the data.