Data structure for matching sets - c++

I have an application where I have a number of sets. A set might be
{4, 7, 12, 18}
unique numbers and all less than 50.
I then have several data items:
1 {1, 2, 4, 7, 8, 12, 18, 23, 29}
2 {3, 4, 6, 7, 15, 23, 34, 38}
3 {4, 7, 12, 18}
4 {1, 4, 7, 12, 13, 14, 15, 16, 17, 18}
5 {2, 4, 6, 7, 13, 15}
Data items 1, 3 and 4 match the set because they contain all items in the set.
I need to design a data structure that is super fast at identifying whether a data item is a member of a set includes all the members that are part of the set (so the data item is a superset of the set). My best estimates at the moment suggest that there will be fewer than 50,000 sets.
My current implementation has my sets and data as unsigned 64 bit integers and the sets stored in a list. Then to check a data item I iterate through the list doing a ((set & data) == set) comparison. It works and it's space efficient but it's slow (O(n)) and I'd be happy to trade some memory for some performance. Does anyone have any better ideas about how to organize this?
Edit:
Thanks very much for all the answers. It looks like I need to provide some more information about the problem. I get the sets first and I then get the data items one by one. I need to check whether the data item is matches one of the sets.
The sets are very likely to be 'clumpy' for example for a given problem 1, 3 and 9 might be contained in 95% of sets; I can predict this to some degree in advance (but not well).
For those suggesting memoization: this is this the data structure for a memoized function. The sets represent general solutions that have already been computed and the data items are new inputs to the function. By matching a data item to a general solution I can avoid a whole lot of processing.

I see another solution which is dual to yours (i.e., testing a data item against every set) and that is using a binary tree where each node tests whether a specific item is included or not.
For instance if you had the sets A = { 2, 3 } and B = { 4 } and C = { 1, 3 } you'd have the following tree
_NOT_HAVE_[1]___HAVE____
| |
_____[2]_____ _____[2]_____
| | | |
__[3]__ __[3]__ __[3]__ __[3]__
| | | | | | | |
[4] [4] [4] [4] [4] [4] [4] [4]
/ \ / \ / \ / \ / \ / \ / \ / \
. B . B . B . B B C B A A A A
C B C B
C
After making the tree, you'd simply need to make 50 comparisons---or how ever many items you can have in a set.
For instance, for { 1, 4 }, you branch through the tree : right (the set has 1), left (doesn't have 2), left, right, and you get [ B ], meaning only set B is included in { 1, 4 }.
This is basically called a "Binary Decision Diagram". If you are offended by the redundancy in the nodes (as you should be, because 2^50 is a lot of nodes...) then you should consider the reduced form, which is called a "Reduced, Ordered Binary Decision Diagram" and is a commonly used data-structure. In this version, nodes are merged when they are redundant, and you no longer have a binary tree, but a directed acyclic graph.
The Wikipedia page on ROBBDs can provide you with more information, as well as links to libraries which implement this data-structure for various languages.

I can't prove it, but I'm fairly certain that there is no solution that can easily beat the O(n) bound. Your problem is "too general": every set has m = 50 properties (namely, property k is that it contains the number k) and the point is that all these properties are independent of each other. There aren't any clever combinations of properties that can predict the presence of other properties. Sorting doesn't work because the problem is very symmetric, any permutation of your 50 numbers will give the same problem but screw up any kind of ordering. Unless your input has a hidden structure, you're out of luck.
However, there is some room for speed / memory tradeoffs. Namely, you can precompute the answers for small queries. Let Q be a query set, and supersets(Q) be the collection of sets that contain Q, i.e. the solution to your problem. Then, your problem has the following key property
Q ⊆ P => supersets(Q) ⊇ supersets(P)
In other words, the results for P = {1,3,4} are a subcollection of the results for Q = {1,3}.
Now, precompute all answers for small queries. For demonstration, let's take all queries of size <= 3. You'll get a table
supersets({1})
supersets({2})
...
supersets({50})
supersets({1,2})
supersets({2,3})
...
supersets({1,2,3})
supersets({1,2,4})
...
supersets({48,49,50})
with O(m^3) entries. To compute, say, supersets({1,2,3,4}), you look up superset({1,2,3}) and run your linear algorithm on this collection. The point is that on average, superset({1,2,3}) will not contain the full n = 50,000 elements, but only a fraction n/2^3 = 6250 of those, giving an 8-fold increase in speed.
(This is a generalization of the "reverse index" method that other answers suggested.)
Depending on your data set, memory use will be rather terrible, though. But you might be able to omit some rows or speed up the algorithm by noting that a query like {1,2,3,4} can be calculated from several different precomputed answers, like supersets({1,2,3}) and supersets({1,2,4}), and you'll use the smallest of these.

If you're going to improve performance, you're going to have to do something fancy to reduce the number of set comparisons you make.
Maybe you can partition the data items so that you have all those where 1 is the smallest element in one group, and all those where 2 is the smallest item in another group, and so on.
When it comes to searching, you find the smallest value in the search set, and look at the group where that value is present.
Or, perhaps, group them into 50 groups by 'this data item contains N' for N = 1..50.
When it comes to searching, you find the size of each group that holds each element of the set, and then search just the smallest group.
The concern with this - especially the latter - is that the overhead of reducing the search time might outweigh the performance benefit from the reduced search space.

You could use inverted index of your data items. For your example
1 {1, 2, 4, 7, 8, 12, 18, 23, 29}
2 {3, 4, 6, 7, 15, 23, 34, 38}
3 {4, 7, 12, 18}
4 {1, 4, 7, 12, 13, 14, 15, 16, 17, 18}
5 {2, 4, 6, 7, 13, 15}
the inverted index will be
1: {1, 4}
2: {1, 5}
3: {2}
4: {1, 2, 3, 4, 5}
5: {}
6: {2, 5}
...
So, for any particular set {x_0, x_1, ..., x_i} you need to intersect sets for x_0, x_1 and others. For example, for the set {2,3,4} you need to intersect {1,5} with {2} and with {1,2,3,4,5}. Because you could have all your sets in inverted index sorted, you could intersect sets in min of lengths of sets that are to be intersected.
Here could be an issue, if you have very 'popular' items (as 4 in our example) with huge index set.
Some words about intersecting. You could use sorted lists in inverted index, and intersect sets in pairs (in increasing length order). Or as you have no more than 50K items, you could use compressed bit sets (about 6Kb for every number, fewer for sparse bit sets, about 50 numbers, not so greedily), and intersect bit sets bitwise. For sparse bit sets that will be efficiently, I think.

A possible way to divvy up the list of bitmaps, would be to create an array of (Compiled Nibble Indicators)
Let's say one of your 64 bit bitmaps has the bit 0 to bit 8 set.
In hex we can look at it as 0x000000000000001F
Now, let's transform that into a simpler and smaller representation.
Each 4 bit Nibble, either has at least one bit set, or not.
If it does, we represent it as a 1, if not we represent it as a 0.
So the hex value reduces to bit pattern 0000000000000011, as the right hand 2 nibbles have are the only ones that have bits in them. Create an array, that holds 65536 values, and use them as a head of linked lists, or set of large arrays....
Compile each of your bit maps, into it's compact CNI. Add it to the correct list, until all of the lists have been compiled.
Then take your needle. Compile it into its CNI form. Use that to value, to subscript to the head of the list. All bitmaps in that list have a possibility of being a match.
All bitmaps in the other lists can not match.
That is a way to divvy them up.
Now in practice, I doubt a linked list would meet your performance requirements.
If you write a function to compile a bit map to CNI, you could use it as a basis to sort your array by the CNI. Then have your array of 65536 heads, simply subscript into the original array as the start of a range.
Another technique would be to just compile a part of the 64 bit bitmap, so you have fewer heads. Analysis of your patterns should give you an idea of what nibbles are most effective in partitioning them up.
Good luck to you, and please let us know what you finally end up doing.
Evil.

The index of the sets that match the search criterion resemble the sets themselves. Instead of having the unique indexes less than 50, we have unique indexes less than 50000. Since you don't mind using a bit of memory, you can precompute matching sets in a 50 element array of 50000 bit integers. Then you index into the precomputed matches and basically just do your ((set & data) == set) but on the 50000 bit numbers which represent the matching sets. Here's what I mean.
#include <iostream>
enum
{
max_sets = 50000, // should be >= 64
num_boxes = max_sets / 64 + 1,
max_entry = 50
};
uint64_t sets_containing[max_entry][num_boxes];
#define _(x) (uint64_t(1) << x)
uint64_t sets[] =
{
_(1) | _(2) | _(4) | _(7) | _(8) | _(12) | _(18) | _(23) | _(29),
_(3) | _(4) | _(6) | _(7) | _(15) | _(23) | _(34) | _(38),
_(4) | _(7) | _(12) | _(18),
_(1) | _(4) | _(7) | _(12) | _(13) | _(14) | _(15) | _(16) | _(17) | _(18),
_(2) | _(4) | _(6) | _(7) | _(13) | _(15),
0,
};
void big_and_equals(uint64_t lhs[num_boxes], uint64_t rhs[num_boxes])
{
static int comparison_counter = 0;
for (int i = 0; i < num_boxes; ++i, ++comparison_counter)
{
lhs[i] &= rhs[i];
}
std::cout
<< "performed "
<< comparison_counter
<< " comparisons"
<< std::endl;
}
int main()
{
// Precompute matches
memset(sets_containing, 0, sizeof(uint64_t) * max_entry * num_boxes);
int set_number = 0;
for (uint64_t* p = &sets[0]; *p; ++p, ++set_number)
{
int entry = 0;
for (uint64_t set = *p; set; set >>= 1, ++entry)
{
if (set & 1)
{
std::cout
<< "sets_containing["
<< entry
<< "]["
<< (set_number / 64)
<< "] gets bit "
<< set_number % 64
<< std::endl;
uint64_t& flag_location =
sets_containing[entry][set_number / 64];
flag_location |= _(set_number % 64);
}
}
}
// Perform search for a key
int key[] = {4, 7, 12, 18};
uint64_t answer[num_boxes];
memset(answer, 0xff, sizeof(uint64_t) * num_boxes);
for (int i = 0; i < sizeof(key) / sizeof(key[0]); ++i)
{
big_and_equals(answer, sets_containing[key[i]]);
}
// Display the matches
for (int set_number = 0; set_number < max_sets; ++set_number)
{
if (answer[set_number / 64] & _(set_number % 64))
{
std::cout
<< "set "
<< set_number
<< " matches"
<< std::endl;
}
}
return 0;
}
Running this program yields:
sets_containing[1][0] gets bit 0
sets_containing[2][0] gets bit 0
sets_containing[4][0] gets bit 0
sets_containing[7][0] gets bit 0
sets_containing[8][0] gets bit 0
sets_containing[12][0] gets bit 0
sets_containing[18][0] gets bit 0
sets_containing[23][0] gets bit 0
sets_containing[29][0] gets bit 0
sets_containing[3][0] gets bit 1
sets_containing[4][0] gets bit 1
sets_containing[6][0] gets bit 1
sets_containing[7][0] gets bit 1
sets_containing[15][0] gets bit 1
sets_containing[23][0] gets bit 1
sets_containing[34][0] gets bit 1
sets_containing[38][0] gets bit 1
sets_containing[4][0] gets bit 2
sets_containing[7][0] gets bit 2
sets_containing[12][0] gets bit 2
sets_containing[18][0] gets bit 2
sets_containing[1][0] gets bit 3
sets_containing[4][0] gets bit 3
sets_containing[7][0] gets bit 3
sets_containing[12][0] gets bit 3
sets_containing[13][0] gets bit 3
sets_containing[14][0] gets bit 3
sets_containing[15][0] gets bit 3
sets_containing[16][0] gets bit 3
sets_containing[17][0] gets bit 3
sets_containing[18][0] gets bit 3
sets_containing[2][0] gets bit 4
sets_containing[4][0] gets bit 4
sets_containing[6][0] gets bit 4
sets_containing[7][0] gets bit 4
sets_containing[13][0] gets bit 4
sets_containing[15][0] gets bit 4
performed 782 comparisons
performed 1564 comparisons
performed 2346 comparisons
performed 3128 comparisons
set 0 matches
set 2 matches
set 3 matches
3128 uint64_t comparisons beats 50000 comparisons so you win. Even in the worst case, which would be a key which has all 50 items, you only have to do num_boxes * max_entry comparisons which in this case is 39100. Still better than 50000.

Since the numbers are less than 50, you could build a one-to-one hash using a 64-bit integer and then use bitwise operations to test the sets in O(1) time. The hash creation would also be O(1). I think either an XOR followed by a test for zero or an AND followed by a test for equality would work. (If I understood the problem correctly.)

Put your sets into an array (not a linked list) and SORT THEM. The sorting criteria can be either 1) the number of elements in the set (number of 1-bits in the set representation), or 2) the lowest element in the set. For example, let A={7, 10, 16} and B={11, 17}. Then B<A under criterion 1), and A<B under criterion 2). Sorting is O(n log n), but I assume that you can afford some preprocessing time, i.e., that the search structure is static.
When a new data item arrives, you can use binary search (logarithmic time) to find the starting candidate set in the array. Then you search linearly through the array and test the data item against the set in the array until the data item becomes "greater" than the set.
You should choose your sorting criterion based on the spread of your sets. If all sets have 0 as their lowest element, you shouldn't choose criterion 2). Vice-versa, if the distribution of set cardinalities is not uniform, you shouldn't choose criterion 1).
Yet another, more robust, sorting criterion would be to compute the span of elements in each set, and sort them according to that. For example, the lowest element in set A is 7, and highest is 16, so you would encode its span as 0x1007; similarly the B's span would be 0x110B. Sort the sets according to the "span code" and again use binary search to find all sets with the same "span code" as your data item.
Computing the "span code" is slow in ordinary C, but it can be done fast if you resort to assembly -- most CPUs have instructions that find the most/least significant set bit.

This is not a real answer more an observation: this problem looks like it could be efficiently parallellized or even distributed, which would at least reduce the running time to O(n / number of cores)

You can build a reverse index of "haystack" lists that contain each element:
std::set<int> needle; // {4, 7, 12, 18}
std::vector<std::set<int>> haystacks;
// A list of your each of your data sets:
// 1 {1, 2, 4, 7, 8, 12, 18, 23, 29}
// 2 {3, 4, 6, 7, 15, 23, 34, 38}
// 3 {4, 7, 12, 18}
// 4 {1, 4, 7, 12, 13, 14, 15, 16, 17, 18}
// 5 {2, 4, 6, 7, 13,
std::hash_map[int, set<int>> element_haystacks;
// element_haystacks maps each integer to the sets that contain it
// (the key is the integers from the haystacks sets, and
// the set values are the index into the 'haystacks' vector):
// 1 -> {1, 4} Element 1 is in sets 1 and 4.
// 2 -> {1, 5} Element 2 is in sets 2 and 4.
// 3 -> {2} Element 3 is in set 3.
// 4 -> {1, 2, 3, 4, 5} Element 4 is in sets 1 through 5.
std::set<int> answer_sets; // The list of haystack sets that contain your set.
for (set<int>::const_iterator it = needle.begin(); it != needle.end(); ++it) {
const std::set<int> &new_answer = element_haystacks[i];
std::set<int> existing_answer;
std::swap(existing_answer, answer_sets);
// Remove all answers that don't occur in the new element list.
std::set_intersection(existing_answer.begin(), existing_answer.end(),
new_answer.begin(), new_answer.end(),
inserter(answer_sets, answer_sets.begin()));
if (answer_sets.empty()) break; // No matches :(
}
// answer_sets now lists the haystack_ids that include all your needle elements.
for (int i = 0; i < answer_sets.size(); ++i) {
cout << "set: " << element_haystacks[answer_sets[i]];
}
If I'm not mistaken, this will have a max runtime of O(k*m), where is the avg number of sets that an integer belongs to and m is the avg size of the needle set (<50). Unfortunately, it'll have a significant memory overhead due to building the reverse mapping (element_haystacks).
I'm sure you could improve this a bit if you stored sorted vectors instead of sets and element_haystacks could be a 50 element vector instead of a hash_map.

I'm surprised no one has mentioned that the STL contains an algorithm to handle this sort of thing for you. Hence, you should use includes. As it describes it performs at most 2*(N+M)-1 comparisons for a worst case performance of O(M+N).
Hence:
bool isContained = includes( myVector.begin(), myVector.end(), another.begin(), another.end() );
if you're needing O( log N ) time, I'll have to yield to the other responders.

Another idea is to completely prehunt your elephants.
Setup
Create a 64 bit X 50,000 element bit array.
Analyze your search set, and set the corresponding bits in each row.
Save the bit map to disk, so it can be reloaded as needed.
Searching
Load the element bit array into memory.
Create a bit map array, 1 X 50000. Set all of the values to 1. This is the search bit array
Take your needle, and walk though each value. Use it as a subscript into the element bit array. Take the corresponding bit array, then AND it into the search array.
Do that for all values in your needle, and your search bit array, will hold a 1,
for every matching element.
Reconstruct
Walk through the search bit array, and for each 1, you can use the element bit array, to reconstruct the original values.

How many data items do you have? Are they really all unique? Could you cache popular data items, or use a bucket/radix sort before the run to group repeated items together?
Here is an indexing approach:
1) Divide the 50-bit field into e.g. 10 5-bit sub-fields. If you really have 50K sets then 3 17-bit chunks might be nearer the mark.
2) For each set, choose a single subfield. A good choice is the sub-field where that set has the most bits set, with ties broken almost arbitrarily - e.g. use the leftmost such sub-field.
3) For each possible bit-pattern in each sub-field note down the list of sets which are allocated to that sub-field and match that pattern, considering only the sub-field.
4) Given a new data item, divide it into its 5-bit chunks and look each up in its own lookup table to get a list of sets to test against. If your data is completely random you get a factor of two speedup or more, depending on how many bits are set in the densest sub-field of each set. If an adversary gets to make up random data for you, perhaps they find data items that almost but not quite match loads of sets and you don't do very well at all.
Possibly there is scope for taking advantage of any structure in your sets, by numbering bits so that sets tend to have two or more bits in their best sub-field - e.g. do cluster analysis on the bits, treating them as similar if they tend to appear together in sets. Or if you can predict patterns in the data items, alter the allocation of sets to sub-fields in step(2) to reduce the number of expected false matches.
Addition:
How many tables would need to have to guarantee that any 2 bits always fall into the same table? If you look at the combinatorial definition in http://en.wikipedia.org/wiki/Projective_plane, you can see that there is a way to extract collections of 7 bits from 57 (=1 + 7 + 49) bits in 57 different ways so that for any two bits at least one collection contains both of them. Probably not very useful, but it's still an answer.

Related

Is there a simd instruction/intrinsic/builtin for partial shift of elements?

A minimal example would be more beneficial:
Say I have a sorted 8 ints = {10, 20, 30, 40, 50, 60, 70, 80} (My use case is for sorted integers but i am not sure if that information is valuable considering vector instruction act on the entire dataset)
There are few operations required:
Insert and shift.
-> insert 25 at it's sorted location.
-> becomes insert 25 at index 2 and shift rest.
10, 20, 30, 40, 50, 60, 70, 80 becomes: 10, 20, 25, 30, 40, 50, 60, 70
Remove and shift and insert at back.
-> remove 20 from the array and insert 90 at back if 20 is found and removed.
10, 20, 30, 40, 50, 60, 70, 80 becomes 10, 30, 40, 50, 60, 70, 80, 90
Or would a set of instructions make it work?
I am trying the insert and shift part with multiple steps for a descending sorted array. https://godbolt.org/z/_WCxkW
One general approach to do what you want is (the general idea is the same for [u]int_{8,16,32,64} or even float/double):
Insert x into input:
// Shift your input array (e.g. "abcefghi") to the right:
out = ShiftRight(input); // out = 0abcefgh
// broadcast the to-be-inserted element (e.g., 'd')
insert = broadcast(x); // insert = dddddddd
// compute
out = min(max(out,insert),input)
// == min(max(0abcefgh,dddddddd),abcefghi)
// == min(ddddefgh,abcefghi) == abcdefgh
Remove first element not smaller than x from input:
// shift input (e.g., "abcdefgh") to the left (insert something at the end)
out = ShiftLeft(input); // out = bcdefghX
// determine elements smaller than `x` (e.g., "f") by broadcast and compare
mask = broadcast(x) < input; // mask = 11111000
// take masked elements from `input` and other values from `out` (using a blend instruction)
out = blend(mask, input, out); // == abcdeghX
If the number of elements to be removed is not guaranteed to be 1 (i.e., it may not exist or exist multiple times), this is more difficult, since every output value potentially depends on every input value. One idea might be to compare for equality and count the number of elements (using a maskmove and popcount).
For shifting you can use
SSE2 and only one 128bit register: pslldq, psrldq
SSSE3 and a sequence of 128bit registers: palignr
AVX2 and one 256bit register: vpermd with a pre-determined index vector (there is no AVX2 equivalent of the previous instructions which works over the entire 256bit register)
If your input is stored in memory, load it again with one element offset (this requires a "safe" element beyond each end of the array -- and it may introduce a significant write-read latency if you perform these operations multiple times)
For broadcasting, I suggest just using the _mm[256]_set1_epi32 intrinsic and let the compiler figure out what is most efficient (without AVX2, this will likely require a shuffle)
Min/max operators exist for various sizes/types (depending on the SSE/AVX version) -- just search for instructions starting with pmin/pmax.
As far as I know, there are no unsigned comparisons before AVX512, but of course you can use signed comparison, if no values are bigger than the biggest signed value. Or you can workaround by flipping the upper bit before comparing (I assume there is a related question on stackoverflow).
Finally, blending is done by pblendvb if you have SSE4.1. Otherwise you need to do some bitwise-and/andnot/or operations.

C++: Recursive function for variations with repetitions, ordered by amount of different letters

I have a function that generates variations like this: 111, 112, ..., 133, 211, 212, ..., 233, 311, ..., 333. Length of generated sequences always matches length of dictionary; with 4 symbols it'd be 1111 to 4444.
This is done in a brute force algorithm for graph coloring. We're trying to find the right sequence that has as less different colors as possible, i.e. if both 12343 and 12321 are solutions, we'd prefer the latter.
Right now I go and check each and every sequence if it’s right, and then store the best result in process. It’s not really a good code.
So professor asked me to write a function that generates variations in specific order. These sequences should come ordered by their amount of different numbers, like this: 111, 222, 333; 112, 113, 121, …, 323; 123, 213. In this case, if we found out that, say, 121 is right, we just stop, because we already know that it’s the best solution.
The idea is to skip as much sequence checks as possible so the code would run faster. Please help :)
Right now I use this code:
init function
std::vector<int> res; //contains the "alphabet"
res.reserve(V);
for (int i = V - 1; i >= 0; i--) {
res.push_back(i);
}
std::vector<int> index(res.size());
std::vector<int> bestresult(V); //here goes the best answer if it's found
for (int i = V - 1; i >= 0; i--) {
bestresult.push_back(i);
}
int bestcolors = V;
permutate(res, index, 0, bestresult, bestcolors);
result = bestresult;
permutate:
void Graph::permutate(const std::vector<int>& s, std::vector<int>& index, std::size_t depth, std::vector<int>& bestres, int &lowestAmountOfColors)
{
if (depth == s.size()) {
//doing all needed checks and saving bestresult here;
return;
}
for (std::size_t i = 0; i < s.size(); ++i) {
index[depth] = i;
permutate(s, index, depth + 1, bestres, lowestAmountOfColors);
}
}
How can I alter these functions?
The challenge is to find all permutations of colors so that you can test if they are a valid graph coloring. Unfortunately, it is exponential. So we need to search the permutations in a way that we check the smallest solutions first, and we need to prune the solution space dramatically.
To find the smallest solutions first, we must limit the number of colors available, and exhaust those permutations before we grow the number of colors. Pretty simple. We just need a function that considers n colors for N vertices. The number of vertices remains fixed, but we consider n=1, then n=2, etc.
Within the function, we know that we need various combinations of 1, 2, ... n with enough repetition to get a total of N different values. So I made a vector of counts. This vector has n entries, and the values sum up to N.
For example, if we are considering three color solutions for a graph with 7 vertices, one possible count array would be {4, 3, 1} would be used to generate the candidate {1, 1, 1, 1, 2, 2, 2, 3}. Color 1 appears 4 times. Color 2 appears 3 times. Color 3 appears 1 time.
The cool thing about this counts array is that as long as it is sorted greatest to least, then its combinations cannot duplicate any other combination we have considered, because colors are interchangable. (Okay, not entirely accurate, there are some duplications when colors have the same count, but we eliminated a lot of permutations from ever being looked at, which is the whole point).
Once you reduce the counts array to an actual candidate solution, you can find all ordering using combinations, not permutations. This will generate fewer candidates. Google next_combination to find some good code showing how to do this.
When we generate the counts array, I initialized all values to 1, then added all the remaining counts to the first color. I search ALL combinations which meet the counts array. Then I get the next candidate by shifting the counts to the right in such a way that it remains sorted.
So to sum up, find_minimum_graph_coloring has a for loop which calls solve_for_n. That function generates all the possible counts-arrays for that value of n, and calls another function. That function checks all combinations for that counts-array.
The first for loop checks smaller numbers of colors first, so we can return immediately upon finding a solution. The counts-array notation eliminates many equivalent colorations so if we consider {1, 1, 2} then we will never try {2, 2, 1}

Determine unique values across multiple sets

In this project, there are multiple sets in which they hold values from 1 - 9. Within this, I need to efficiently determine if there are values that is unique in one set but not others.
For Example:
std::set<int> s_1 = { 1, 2, 3, 4, 5 };
std::set<int> s_2 = { 2, 3, 4 };
std::set<int> s_3 = { 2, 3, 4, 6 };
Note: The number of sets is unknown until runtime.
As you can see, s_1 contains the unique value of 1 and 5 and s_3 contains the unique value of 6.
After determining the unique values, the aforementioned sets should then just contain the unique values like:
// s_1 { 1, 5 }
// s_2 { 2, 3, 4 }
// s_3 { 6 }
What I've tried so far is to loop through all the sets and record the count of the numbers that have appeared. However I wanted to know if there is a more efficient solution out there.
There are std algorithm in the std C++ library for intersection, difference and union operations on 2 sets.
If I understood well your problem you could do this :
do an intersection on all sets (in a loop) to determine a base, and then apply a difference between each set and the base ?
You could benchmark this against your current implementation. Should be faster.
Check out this answer.
Getting Union, Intersection, or Difference of Sets in C++
EDIT: cf Tony D. comment : You can basically do the same operation using a std::bitset<> and binary operators (& | etc..), which should be faster.
Depending on the actual size of your input, might be well worth a try.
I would suggest something in c# like this
Dictionary<int, int> result = new Dictionary<int, int>();
foreach(int i in sets){
if(!result.containskey(i))
result.add(i,1);
else
result[i].value = result[i].value+1;
}
now the Numbers with count value only 1 means its unique, then find the sets with these numbers...
I would suggest :
start inserting all the elements in all the sets into a multimap.
Here each element is a key and and the set name with be the value.
One your multimap is filled with all the elements in all the sets,
then loop throgth the multimap and take count of each element in the
multimap.
If the count is 1 for any key, this means its unique and value of
that will be the set name.

High Performance Computing: the use of shared_array vs atomics?

I'm curious if anyone here has knowledge on the efficiency of atomics, specifically std::atomic<int>. My problem goes as follows:
I have a data set, say data = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12} that is passed into an algorithm algo(begin(data), end(data)). algo partitions the data into chunks and executes each chunk asynchronously, so algo would perform it's operation on, say, 4 different chunks:
{1, 2, 3}
{4, 5, 6}
{7, 8, 9}
{10, 11, 12}
in each separate partition I need to return the count of elements that satisfy a predicate op at the end of each partition
//partition lambda function
{
//'it' corresponds to the position in it's respective partition
if( op(*it) )
count++;
//return the count at the end of this partition
return count;
}
the problem is that I'm going to run into a data race just incrementing 1 variable with 4 chunks executing asynchronously. I was thinking of two possible solutions:
use a std::atomic
the problem here is I know very little about C++'s atomics, and from what i've heard they can be inefficient. Is this true? what results should I expect to see with using atomics to keep track of a count?
use a shared array, where the size is the partition count
I know my shared arrays pretty well so this idea doesn't seem too bad, but I'm unsure how it would hold up when a very small chunk size is given, which would make the shared array keeping track of the count at the end of each partition quite large. It would be useful however as the algorithm doesn't have to wait for anything to finish to increment, it simply places it's respective count in the shared array.
so with both my ideas, I could implement it possible as:
//partition lambda function, count is now atomic
{
//'it' corresponds to the position in it's respective partition
if( op(*it) )
count++;
//return the count at the end of this partition
return count.load();
}
//partition lambda function, count is in shared array that will be be accessed later
//instead of returned
{
int count = 0;
//'it' corresponds to the position in it's respective partition
if( op(*it) )
count++;
//total count at end of each partition. ignore fact that partition_id = 0 wouldn't work
shared_arr[partition_id] = shared_arr[partition_id - 1] + count;
}
any ideas on atomic vs shared_array?

Subsequence search

I have a large number of lists (35 MB in total) which I would like to search for subsequences: each term must appear in order but not necessarily consecutively. So 1, 2, 3 matches each of
1, 2, 3, 4, 5, 6
1, 2, 2, 3, 3, 3
but not
6, 5, 4, 3, 2, 1
123, 4, 5, 6, 7
(, is a delimiter, not characters to match.)
Short of running a regex (/1, ([^,]+, )*2, ([^,]+, )*3/ for the example) on tens or hundreds of thousands of sequences, how can I determine which sequences are a match? I can preprocess the sequences, though memory usage needs to stay reasonable (within a constant factor of the existing sequence size, say). The longest sequence is short, less than a kilobyte, so you can assume queries are short as well.
This reminds me of sequence alignment from bioinformatics, where you try to match a small snippet of DNA against a large database. The differences are your presumably larger alphabet, and your increased tolerance for arbitrarily long gaps.
You may find some inspiration looking at the existing tools and algorithms, notably Smith-Waterman and BLAST.
If the individual numbers are spread out over the file and not occurring on the majority of lines then a simple indexing of the line number where they occur could give you a speed up. This will however be slower if your data are lines of the same numbers repeated in different orders.
To build the index would only require a single pass of the data along these lines:
Hash<int, List<int>> index
line_number = 1
foreach(line in filereader)
{
line_number += 1
foreach(parsed_number in line)
index[parsed_number].append(line)
}
That index could be stored and reused for the dataset. To search on it would only need something like this. Please excuse the mixed psuedocode, I've tried to make it as clear as possible. It "return"s when it's out of possible matches and "yield"s a line number when all of the elements of the substring occur on that line.
// prefilled hash linking number searched for to a list of line numbers
// the lines should be in ascending order
Hash<int, List<int>> index
// The subsequence we're looking for
List<int> subsequence = {1, 2, 3}
int len = subsequence.length()
// Take all the lists from the index that match the numbers we're looking for
List<List<int>> lines = index[number] for number in subsequence
// holder for our current search row
// has the current lowest line number each element occurs on
int[] search = new int[len]
for(i = 0; i < len; i++)
search[i] = lines[i].pop()
while(true)
{
// minimum line number, substring position and whether they're equal
min, pos, eq = search[0], 0, true
// find the lowest line number and whether they all match
for(i = 0; i < len; i++)
{
if(search[i] < min)
min, p, eq = search[i], i, false
else if (search[i] > min)
eq = false
}
// if they do all match every one of the numbers occurs on that row
if(eq)
{
yield min; // line has all the elements
foreach(list in lines)
if(list.empty()) // one of the numbers isn't in any more lines
return
// update the search to the next lowest line number for every substring element
for(i = 0; i < len; i++)
search[i] = lines[i].pop()
}
else
{
// the lowest line number for each element is not the same, so discard the lowest one
if(lines[position].empty()) // there are no more lines for the element we'd be updating
return
search[position] = lines[position].pop();
}
}
Notes:
This could trivially be extended to store the position in the line as well as the line number and then only a little extra logic at the "yield" point would be able to determine an actual match instead of just that all the items are present.
I've used "pop" to show how it's moving through the line numbers but you don't actually want to be destroying your index every search.
I've assumed the numbers all fit into ints here. Extend it to longs or even map the string representation of each number to an int if you have really huge numbers.
There are some speedups to be had, especially in skipping lines at the "pop" stages, but I went for the clearer explanation.
Whether using this or another method you could also chop down the computation depending on the data. A single pass to work out whether each line is ascending, descending, all odd, all even, or what the highest and lowest numbers are could be used to cut down the search space for each substring. Whether these would be useful depends entirely on your dataset.
maybe I misunderstood but, isn't this straightforward like this?
search = [1 2 3]
for sequence in sequences:
sidx = 0
for item in sequence:
if item==search[sidx]:
sidx++
if sidx>=len(search): break
if sidx>len(search):
print sequence + "matches"
it seems to be O(N) for N sequences
and O(M) for searching for subsequence length M
not sure if this would be that much faster than a regex though?