Pattern for random subsetting with domains in Chapel

Pattern for random subsetting with domains in Chapel - chapel

Here is my set up. In hockey, players form into "lines" which are on the ice at the same time. A "forward" line is a trio of Left Wing, Center and Right Wing. A "D" line is a pair of Left D and Right D. In beer leagues, you typically dress 13 skaters = 3 Forward lines, 2 D lines plus one Goalie.
Suppose I have 20 people who want to play. I want to construct the lines from 13 random skaters. I have to preserve their names and jersey numbers. I'm wondering if this is the job of Chapel Domains. For instance, something like
var player_ids: domain(1) = {1..20}
var jerseys [player_ids] = [71, 99, 97, ...]
var names [player_ids] = ['Alice', 'Bonobo', 'Changarakoo'...]
It's a simple idea, but now I want to
1. Pick three random players and assign them to Line 1 F
2. Pick three from the remainders and assign the to Line 2 F
...
n-1: Use the player ids to create an indicator matrix (details aren't important)
n: WIN!
The point of n-1 is that I have to be able to reference the player id and jersey number at the end.
What is the correct pattern for this in Chapel?

Here is my suggestion. To draw players without replacement, I conceptually think of shuffling a deck of cards - where each card has a player's name on it. So this code uses Random.shuffle.
use Random;
var player_ids = {1..20};
// jersey number, name are simply keyed off off player_id
// Generate an array of player IDs
var ids_array = [i in player_ids] i;
// Randomly shuffle player IDs
shuffle(ids_array);
// Now select from the shuffled IDs the players for the game.
var cur = 1;
getLine(cur, "Forward1", 3, ids_array);
getLine(cur, "Forward2", 3, ids_array);
getLine(cur, "Forward3", 3, ids_array);
getLine(cur, "D1", 2, ids_array);
getLine(cur, "D2", 2, ids_array);
getLine(cur, "Goalie", 1, ids_array);
proc getLine(ref curIndex, lineName, playersNeeded, ids_array) {
writeln("Line ", lineName, ":");
for i in 1..playersNeeded {
writeln(" player ", ids_array[curIndex]); // would use name & jersey..
curIndex += 1;
}
}

The Beer-Hockey Team Coach can use this conceptalso( live >>> online ) ( sure, numbers will vary, no fixed RNG-seed was set )
Let's go a bit deeper into the process. The randomised selection is the mathematically harder part of the story ( where compliance will complicate the issues more in domains outside the skating ring, than for the Coach himself (ref. below ) ).
So, let's accept that the Team setup is a static-map, where players' ordinals map onto the F_line{1..3,1..3}, D_line{1..2,1..2}, G, Rest{1..7}
use Random;
var aRandomTEAM = [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 ]; // a known, contiguous ENUM of <anonymised_HASH_ID#s>
permutation( aRandomTEAM ); // a known, static MAP of aRandomTEAM -> F_line{1..3}, D_line1{1..2}, G, Rest
for id in {1..13}{
writeln( " a Static MAP position of TEAM[",
id, "]: will be played by anonymised_HASH_ID#( ",
aRandomTEAM[id], " )"
);
}
For human-readable inspection, this produces:
a Static MAP position of TEAM[1]: will be played by anonymised_HASH_ID#( 20 )
a Static MAP position of TEAM[2]: will be played by anonymised_HASH_ID#( 5 )
a Static MAP position of TEAM[3]: will be played by anonymised_HASH_ID#( 11 )
a Static MAP position of TEAM[4]: will be played by anonymised_HASH_ID#( 4 )
a Static MAP position of TEAM[5]: will be played by anonymised_HASH_ID#( 15 )
a Static MAP position of TEAM[6]: will be played by anonymised_HASH_ID#( 7 )
a Static MAP position of TEAM[7]: will be played by anonymised_HASH_ID#( 16 )
a Static MAP position of TEAM[8]: will be played by anonymised_HASH_ID#( 12 )
a Static MAP position of TEAM[9]: will be played by anonymised_HASH_ID#( 8 )
a Static MAP position of TEAM[10]: will be played by anonymised_HASH_ID#( 18 )
a Static MAP position of TEAM[11]: will be played by anonymised_HASH_ID#( 19 )
a Static MAP position of TEAM[12]: will be played by anonymised_HASH_ID#( 17 )
a Static MAP position of TEAM[13]: will be played by anonymised_HASH_ID#( 3 )
hovever, the machine-readable post-processing may map these onto requested arrays, keep the sensitive personal details safe and separate, having the GUUID#-reference links into names and all other details safe. The referential integrity is both cheap and safe and an implementation of static unique associative mapping from ( intentionally ) contiguous ordinals onto a proxy-anonymising HashTable is trivial ( ref. Opaque Domains and Arrays for possible further inspirations ).
Legal Warning:
A due care ought be taken if using a randomisation in regulated domains, where a compliance has to be documented and positive proofs of methods' robustness performed and validated.
Documentation may bring more details on known risks for using the current randomisation implementations in some legally demanding domains:
Permuted Linear Congruential Random Number Generator
This module provides PCG random number generation routines. See http://www.pcg-random.org/ and the paper, PCG: A Family of Simple Fast Space-Efficient Statistically Good Algorithms for Random Number Generation by M.E. O'Neill.
Of a particular attention ought be some known potential restrictions, like:
Note
For integers, this class uses a strategy for generating a value in a particular range that has not been subject to rigorous study and may have statistical problems.
For real numbers, this class generates a random value in [max, min] by computing a random value in [0,1] and scaling and shifting that value. Note that not all possible floating point values in the interval [min, max] can be constructed in this way.
Remarks like this should always attract due attention of Compliance Officers, so as to carefully pre-validate a feasibility of use within their intended ( regulated ) problem-domain mandatory practices and controlled-environments' requirements.

Related

Chapel domains : differences between `low/high` and `first/last` methods

Chapel domains have two sets of methods
domain.low, domain.high
and
domain.first, domain.last
What are the various cases where these return different results (i.e when is domain.first != domain.low and domain.last != domain.high?

First, note that these queries are supported not just on domains, but also on ranges (a simpler type representing an integer sequence upon which many domains, and their domain queries, are based). For that reason, my answer will initially focus on ranges for simplicity, before returning to dense rectangular domains (which are defined using a range per dimension).
As background, first and last on a range are designed to specify the indices that you'll get when iterating over that range. In contrast, low and high specify the minimal and maximal indices that define the range.
For a simple range, like 1..10, first and low will be the same, evaluating to 1, while last and high will both evaluate to 10
The way you iterate through a range in reverse order in Chapel is by using a negative stride like 1..10 by -1. For this range, low and high will still be 1 and 10 respectively, but first will be 10 and last will be 1 since the range represents the integers 10, 9, 8, ..., 1.
Chapel also supports non-unit strides, and they can also result in differences. For example for the range 1..10 by 2, low and high will still be 1 and 10 respectively, and first will still be 1 but last will be 9 since this range only represents the odd values between 1 and 10.
The following program demonstrates these cases along with 1..10 by -2 which I'll leave as an exercise for the reader (you can also try it online (TIO)):
proc printBounds(r) {
writeln("For range ", r, ":");
writeln(" first = ", r.first);
writeln(" last = ", r.last);
writeln(" low = ", r.low);
writeln(" high = ", r.high);
writeln();
}
printBounds(1..10);
printBounds(1..10 by -1);
printBounds(1..10 by 2);
printBounds(1..10 by -2);
Dense rectangular domains are defined using a range per dimension. Queries like low, high, first, and last on such domains return a tuple of values, one per dimension, corresponding to the results of the queries on the respective ranges. As an example, here's a 4D domain defined in terms of the ranges above (TIO):
const D = {1..10, 1..10 by -1, 1..10 by 2, 1..10 by -2};
writeln("low = ", D.low);
writeln("high = ", D.high);
writeln("first = ", D.first);
writeln("last = ", D.last);

Return a random object from a list based on proprieties

This is quite a strange issue for me because I can't visualize my problem correctly. Just so that you know, I'm not really asking for code but just for an idea to write an approriate alogirthm that would generate some weather based on their probability of occuring.
Here's what I want to achieve :
Let's say I have a WeatherClass, with a parameter called "Probability". I want to have different weather instances with their own probability of "happening".
enum Probability {
Never = -1,
Low = 0,
Normal = 1,
Always = 2
};
std::vector<WeatherClass> WeatherContainer;
WeatherClass Sunny = WeatherClass();
Sunny.Probability = Probability.Normal;
WeatherClass Rainy = WeatherClass();
Rainy.Probability = Probability.Low;
WeatherClass Cloudy = WeatherClass();
Cloudy.Probability = Probability.Normal;
WeatherContainer.push_back(Sunny);
WeatherContainer.push_back(Rainy);
WeatherContainer.push_back(Cloudy);
Now, my question is : what is the most clever way to return some weather based on its own probability of happening?
I don't know why but I can't figure this out.. My first guess would be to have some kind of "luck" variable and compare it with the probability of each element or something similar.
Any hint or advice would be really helpful.
Greets,
required

Generally speaking, assuming you have an integer sequence of numbers representing a linear increase in probability (starting from 1, not 0!):
1,2,3,4,5,6...n
Marking pn for some specific integer (weather in your scheme, say "6"), and the sum of all the enum integers Sn, a linear probability could easily be defined as:
pn/Sn
This of course means the weather associated with "1" is least likely, and the one with "n" is most likely. Other schemes are possible, such as exponential - just need to normalize properly. Also, if you forgot your math:
Sn=(1+n)*n/2
Now you need to roll from this probability. One option, disregarding efficiency, to help you think about this:
Make a giant set, where each weather (or integer) appears as many times as the associated integer. 1 appears once, ..., n appears n times. This list is of size Sn by definition. Now use the random library:
int choice = rand() % Sn; #index between 0 and Sn-1 - chosen probability indicator.
You could of course randomize the list as well for extra randomness.
An example: in our array we have probmap={1,2,2,3,3,3}. If choice==4, then probmap[4]==3. Suppose 3 corresponds to Sunny, then we have our result!. There are of course ways to make this better, choose different probability functions etc. but I think this is a good start.

You can generate a random number between 0 and 3, subtract 1, cast it to Probability and search your vector for a matching entry.
auto result = rand();
result %= 4;
--result;
auto prob = (Probability)result;
auto index = -1;
for( auto I = 0 ; I < WeatherContainer.size() ; ++I )
if( WeatherContainer [ I ].Probability == prob )
{
index = I;
break;
}
if( index != -1 )
{
// Do your thing
}

Creating probability by value from a dict

I'm creating a type of survival game in python, and am creating the different animals you can encounter and "battle" with (users will only encounter one animal at a time). I want certain animals with a lower number associated with them to have a higher probability of showing up then the animals with a higher number associated to them:
POSSIBLE_ANIMALS_ENCOUNTERED = {
"deer": 50, "elk": 75, "mountain lion": 150,
"rat": 5, "raccoon": 15, "squirrel": 5,
"black bear": 120, "grizzly bear": 200, "snake": 5,
}
So, I would like a rat to appear more then a deer and just as much as a squirrel or snake. How can I go about creating probability for the animals inside of the dict based on their value? I'd like it so that the user will see a higher percentage of animals with lower value, and a lower percentage of animals with a higher value.
For example:
The user should see a animal with a value from 1 to 5, 50% of the time (0.5), an animal with a value from 6 to 50 %25 percent of the time (0.25), and any animal with a value higher then that 10 percent of the time (0.10).

You need to codify the percentage likelihood of an encounter based on "health points" (encounter_ranges tuples are from the data in your edit), then do a weighted random choice from your elements. I've put comments inline:
from random import random
from bisect import bisect
POSSIBLE_ANIMALS_ENCOUNTERED = {
"deer": 50, "elk": 75, "mountain lion": 150,
"rat": 5, "raccoon": 15, "squirrel": 5,
"black bear": 120, "grizzly bear": 200, "snake": 5,
}
# codify your ranges. (min, max, encounter %)
encounter_ranges = (
(50, float('inf'), .10),
(6, 50, .25),
(1, 5, .50)
)
# create a list of their probability to be encountered
# replace .items() with .iteritems() if python 2.x
ANIMAL_ENCOUNTER_PERCENTAGE_PER_HP_LIST = []
for k, v in POSSIBLE_ANIMALS_ENCOUNTERED.items():
for encounter_chance in encounter_ranges:
if (v >= encounter_chance[0]) and (v <= encounter_chance[1]):
# multiplied by 100 because we're going to use it in a weighted random below
ANIMAL_ENCOUNTER_PERCENTAGE_PER_HP_LIST.append([k, encounter_chance[2] * 100])
# With our percentages in hand, we need to do a weighted random distribution
# if you're only planning on encountering one animal at a time.
# stolen from http://stackoverflow.com/a/4322940/559633
def weighted_choice(choices):
values, weights = zip(*choices)
total = 0
cum_weights = []
for w in weights:
total += w
cum_weights.append(total)
x = random() * total
i = bisect(cum_weights, x)
return values[i]
# run this function to return the animal you will encounter:
weighted_choice(ANIMAL_ENCOUNTER_PERCENTAGE_PER_HP_LIST)
Note that this approach will always return something -- a 100% chance encounter of some animal. If you want to make the encounters more random, that's a much simpler problem, but I didn't include it as you'd need to specify how you expect this game mechanic to work (random 0-100% chance of any encounter? a percentage chance of the encounter based on the return (50% chance of encounter if rat is returned)? etc).
Note that I wrote this for Python 3, as if you're new to Python, you really should be using 3.x, but I left a comment in what you need to switch if you decide to stick with 2.x.

What is a possible way to bias a random number generator?

I built a word generator, it picks a length and then randomly picks letters of the alphabet to make up words.
The program works but 99% of the output is rubbish as it is not observing the constructs of the English language, I am getting as many words with x and z in as I do e.
What are my options for biasing the RNG so it will use common letters more often.
I am using rand() from the stl seeded with the time.

The output will still be rubbish because biasing the random number generator is not enough to construct proper English words. But one approach to biasing the rng is:
Make a histogram of the occurences of letters in a large English text (the corpus). You'll get something like 500 'e', 3 'x', 1 'q', 450 'a', 200 'b' and so on.
Divide an interval into ranges where each letter gets a slice, with the length of the slice being the number of occurences in the interval. a gets [0-450), b [450,650), ..., q [3500,3501).
Generate a random number between 0 and the total length of the interval and check where it lands. Any number within 450-650 gives you a b, but only 3500 gives you a 'q'.

One method would be to use the letter frequency. For each letter define a range: a = [0, 2] (if the letter 'a' has 2% chance of being used), b = [2, 5] (3% chance), and so forth.. then generate a random number between 0 and 100 and choose a letter.
An other method is to use a nondeterministic finite automata where you can define certain transitions (you could parse the bible and build your probability). So you have a lot of transitions like this: e.g. the transition from 'a' to 'b' is 5%. Then you walk through the automata and generate some words.
I just saw that the proper term is markov chain, which is probably better than a NFA.

You can do an n-gram analysis of some body of text and use that as a base for the bias. You can do this either by letters or by syllables. Doing the analysis by syllables is probably more complicated.
To do it by letters, it's easy. You iterate through each character in the source text and keep track of the last n-1 characters you came across. Then, for each next character, you add the last n-1 characters and this new one (a n-gram) to your table of frequencies.
What does this table of frequencies look like? You can use a map mapping the n-grams to their frequencies. But this approach is not very good for the algorithm I suggest below. For that it's better to map each (n-1)-grams to a map of the last letter of an n-gram to its frequency. Something like: std::map<std::string, std::map<char,int>>.
Having made the analysis and collected the statistics, the algorithm would go like this:
pick a random starting n-gram. Your previous analysis may contain weighted data for which letters usually start words;
from all the n-grams that start with previous n-1 letters, pick a random last letter (considering the weights from the analysis);
repeat until you reach the end of a word (either using a predefined length or from data about word ending frequencies);
To pick random values from a set of values with different weights, you can start by setting up a table of the cumulative frequencies. Then you pick a random number between less than the sum of the frequencies, and see in what interval it falls.
For example:
A happens 10 times;
B happens 7 times;
C happens 9 times;
You build the following table: { A: 10, B: 17, C: 26 }. You pick a number between 1 and 26. If it is less than 10, it's A; if it's greater or equal to 10, but less than 17, it's B; if it's greater than 17, it's C.

You may want to use the English language's letter frequency to have a more realistic output : http://en.wikipedia.org/wiki/Letter_frequency.
But if you want pronounceable words, you should probably generate them from syllabes. You can find more information online, e.g. here : http://spell.psychology.wustl.edu/SyllStructDistPhon/CVC.html

You could derive a Markov Model be reading a source text and then generate words which are "like" the source.
This also works for generating sentences from words. Well, sort of works.

If you want to change just the letter frequency in the words, without futher lexical analisys (like the qu pair), get a list of english language letter frequencies.
Then create a weighted random generator, that will have more chance to output an e (1 in 7 chance) that a x (around 1 in a 1000).
To generate a weighted random generator (rand generates integers, IIRC):
1. Normalize the letter frequencies, so that they are all integers (for the Wikipedia frequencies basically multiply by 100000)
2. Make some sort of lookup table, where to each letter you assign a certain range, like the table below
letter | weight | start | end
a | 8.17% | 0 | 8167
b | 1.49% | 8168 | 9659
c | 2.78% | 9660 | 12441
d | 4.25% | 12442 | 16694
e | 12.70% | 16695 | 29396
f | 2.23% | 29397 | 31624
g | 2.02% | 31625 | 33639
.....
z | 0.07% | 99926 | 99999
3. Generate a random number between 0 and 99999, and use that to find the corresponding letter. This way, you will have the correct letter frequencies.

First, you need a table with the letters and their weights, something
like:
struct WeightedLetter
{
char letter;
int weight;
};
static WeightedLetter const letters[] =
{
{ 'a', 82 },
{ 'b', 15 },
{ 'c', 28 },
// ...
};
char getLetter()
{
int totalWeight = 0;
for ( WeightedLetter const* iter = begin( letters );
iter != end( letters );
++ iter ) {
totalWeight += iter->weight;
}
int choice = rand() % totalWeight;
// but you probably want a better generator
WeightedLetter const* result = begin( letters );
while ( choice > result->weight ) {
choice -= result->weight;
++ result;
}
return result->letter;
}
This is just off the top of my head, so it's likely to contain errors;
at the very least, the second loop requires some verification. But it
should give you the basic idea.
Of course, this still isn't going to result in English-like words. The
sequence "uq" is just as likely as "qu", and there's nothing to prevent
a word without a vowel, or a ten letter word with just vowels. The Wikipedia page on English Phonology has some good information as to what combinations can occur where, but it doesn't have any statistics on them. On the other hand, if you're trying to make up possible words, like Jabberwocky, then that may not be a problem: choose a random number of syllables, from 1 to some maximum, then an onset, a nucleus and a coda. (Don't forget that the onset and the coda can be empty.)

If you want to create pronounceable words do not try and join letters together.
Join sounds. Make a list of sounds to select from:"abe", "ape", "gre" etc

Data structure for matching sets

I have an application where I have a number of sets. A set might be
{4, 7, 12, 18}
unique numbers and all less than 50.
I then have several data items:
1 {1, 2, 4, 7, 8, 12, 18, 23, 29}
2 {3, 4, 6, 7, 15, 23, 34, 38}
3 {4, 7, 12, 18}
4 {1, 4, 7, 12, 13, 14, 15, 16, 17, 18}
5 {2, 4, 6, 7, 13, 15}
Data items 1, 3 and 4 match the set because they contain all items in the set.
I need to design a data structure that is super fast at identifying whether a data item is a member of a set includes all the members that are part of the set (so the data item is a superset of the set). My best estimates at the moment suggest that there will be fewer than 50,000 sets.
My current implementation has my sets and data as unsigned 64 bit integers and the sets stored in a list. Then to check a data item I iterate through the list doing a ((set & data) == set) comparison. It works and it's space efficient but it's slow (O(n)) and I'd be happy to trade some memory for some performance. Does anyone have any better ideas about how to organize this?
Edit:
Thanks very much for all the answers. It looks like I need to provide some more information about the problem. I get the sets first and I then get the data items one by one. I need to check whether the data item is matches one of the sets.
The sets are very likely to be 'clumpy' for example for a given problem 1, 3 and 9 might be contained in 95% of sets; I can predict this to some degree in advance (but not well).
For those suggesting memoization: this is this the data structure for a memoized function. The sets represent general solutions that have already been computed and the data items are new inputs to the function. By matching a data item to a general solution I can avoid a whole lot of processing.

I see another solution which is dual to yours (i.e., testing a data item against every set) and that is using a binary tree where each node tests whether a specific item is included or not.
For instance if you had the sets A = { 2, 3 } and B = { 4 } and C = { 1, 3 } you'd have the following tree
_NOT_HAVE_[1]___HAVE____
| |
_____[2]_____ _____[2]_____
| | | |
__[3]__ __[3]__ __[3]__ __[3]__
| | | | | | | |
[4] [4] [4] [4] [4] [4] [4] [4]
/ \ / \ / \ / \ / \ / \ / \ / \
. B . B . B . B B C B A A A A
C B C B
C
After making the tree, you'd simply need to make 50 comparisons---or how ever many items you can have in a set.
For instance, for { 1, 4 }, you branch through the tree : right (the set has 1), left (doesn't have 2), left, right, and you get [ B ], meaning only set B is included in { 1, 4 }.
This is basically called a "Binary Decision Diagram". If you are offended by the redundancy in the nodes (as you should be, because 2^50 is a lot of nodes...) then you should consider the reduced form, which is called a "Reduced, Ordered Binary Decision Diagram" and is a commonly used data-structure. In this version, nodes are merged when they are redundant, and you no longer have a binary tree, but a directed acyclic graph.
The Wikipedia page on ROBBDs can provide you with more information, as well as links to libraries which implement this data-structure for various languages.

I can't prove it, but I'm fairly certain that there is no solution that can easily beat the O(n) bound. Your problem is "too general": every set has m = 50 properties (namely, property k is that it contains the number k) and the point is that all these properties are independent of each other. There aren't any clever combinations of properties that can predict the presence of other properties. Sorting doesn't work because the problem is very symmetric, any permutation of your 50 numbers will give the same problem but screw up any kind of ordering. Unless your input has a hidden structure, you're out of luck.
However, there is some room for speed / memory tradeoffs. Namely, you can precompute the answers for small queries. Let Q be a query set, and supersets(Q) be the collection of sets that contain Q, i.e. the solution to your problem. Then, your problem has the following key property
Q ⊆ P => supersets(Q) ⊇ supersets(P)
In other words, the results for P = {1,3,4} are a subcollection of the results for Q = {1,3}.
Now, precompute all answers for small queries. For demonstration, let's take all queries of size <= 3. You'll get a table
supersets({1})
supersets({2})
...
supersets({50})
supersets({1,2})
supersets({2,3})
...
supersets({1,2,3})
supersets({1,2,4})
...
supersets({48,49,50})
with O(m^3) entries. To compute, say, supersets({1,2,3,4}), you look up superset({1,2,3}) and run your linear algorithm on this collection. The point is that on average, superset({1,2,3}) will not contain the full n = 50,000 elements, but only a fraction n/2^3 = 6250 of those, giving an 8-fold increase in speed.
(This is a generalization of the "reverse index" method that other answers suggested.)
Depending on your data set, memory use will be rather terrible, though. But you might be able to omit some rows or speed up the algorithm by noting that a query like {1,2,3,4} can be calculated from several different precomputed answers, like supersets({1,2,3}) and supersets({1,2,4}), and you'll use the smallest of these.

If you're going to improve performance, you're going to have to do something fancy to reduce the number of set comparisons you make.
Maybe you can partition the data items so that you have all those where 1 is the smallest element in one group, and all those where 2 is the smallest item in another group, and so on.
When it comes to searching, you find the smallest value in the search set, and look at the group where that value is present.
Or, perhaps, group them into 50 groups by 'this data item contains N' for N = 1..50.
When it comes to searching, you find the size of each group that holds each element of the set, and then search just the smallest group.
The concern with this - especially the latter - is that the overhead of reducing the search time might outweigh the performance benefit from the reduced search space.

You could use inverted index of your data items. For your example
1 {1, 2, 4, 7, 8, 12, 18, 23, 29}
2 {3, 4, 6, 7, 15, 23, 34, 38}
3 {4, 7, 12, 18}
4 {1, 4, 7, 12, 13, 14, 15, 16, 17, 18}
5 {2, 4, 6, 7, 13, 15}
the inverted index will be
1: {1, 4}
2: {1, 5}
3: {2}
4: {1, 2, 3, 4, 5}
5: {}
6: {2, 5}
...
So, for any particular set {x_0, x_1, ..., x_i} you need to intersect sets for x_0, x_1 and others. For example, for the set {2,3,4} you need to intersect {1,5} with {2} and with {1,2,3,4,5}. Because you could have all your sets in inverted index sorted, you could intersect sets in min of lengths of sets that are to be intersected.
Here could be an issue, if you have very 'popular' items (as 4 in our example) with huge index set.
Some words about intersecting. You could use sorted lists in inverted index, and intersect sets in pairs (in increasing length order). Or as you have no more than 50K items, you could use compressed bit sets (about 6Kb for every number, fewer for sparse bit sets, about 50 numbers, not so greedily), and intersect bit sets bitwise. For sparse bit sets that will be efficiently, I think.

A possible way to divvy up the list of bitmaps, would be to create an array of (Compiled Nibble Indicators)
Let's say one of your 64 bit bitmaps has the bit 0 to bit 8 set.
In hex we can look at it as 0x000000000000001F
Now, let's transform that into a simpler and smaller representation.
Each 4 bit Nibble, either has at least one bit set, or not.
If it does, we represent it as a 1, if not we represent it as a 0.
So the hex value reduces to bit pattern 0000000000000011, as the right hand 2 nibbles have are the only ones that have bits in them. Create an array, that holds 65536 values, and use them as a head of linked lists, or set of large arrays....
Compile each of your bit maps, into it's compact CNI. Add it to the correct list, until all of the lists have been compiled.
Then take your needle. Compile it into its CNI form. Use that to value, to subscript to the head of the list. All bitmaps in that list have a possibility of being a match.
All bitmaps in the other lists can not match.
That is a way to divvy them up.
Now in practice, I doubt a linked list would meet your performance requirements.
If you write a function to compile a bit map to CNI, you could use it as a basis to sort your array by the CNI. Then have your array of 65536 heads, simply subscript into the original array as the start of a range.
Another technique would be to just compile a part of the 64 bit bitmap, so you have fewer heads. Analysis of your patterns should give you an idea of what nibbles are most effective in partitioning them up.
Good luck to you, and please let us know what you finally end up doing.
Evil.

The index of the sets that match the search criterion resemble the sets themselves. Instead of having the unique indexes less than 50, we have unique indexes less than 50000. Since you don't mind using a bit of memory, you can precompute matching sets in a 50 element array of 50000 bit integers. Then you index into the precomputed matches and basically just do your ((set & data) == set) but on the 50000 bit numbers which represent the matching sets. Here's what I mean.
#include <iostream>
enum
{
max_sets = 50000, // should be >= 64
num_boxes = max_sets / 64 + 1,
max_entry = 50
};
uint64_t sets_containing[max_entry][num_boxes];
#define _(x) (uint64_t(1) << x)
uint64_t sets[] =
{
_(1) | _(2) | _(4) | _(7) | _(8) | _(12) | _(18) | _(23) | _(29),
_(3) | _(4) | _(6) | _(7) | _(15) | _(23) | _(34) | _(38),
_(4) | _(7) | _(12) | _(18),
_(1) | _(4) | _(7) | _(12) | _(13) | _(14) | _(15) | _(16) | _(17) | _(18),
_(2) | _(4) | _(6) | _(7) | _(13) | _(15),
0,
};
void big_and_equals(uint64_t lhs[num_boxes], uint64_t rhs[num_boxes])
{
static int comparison_counter = 0;
for (int i = 0; i < num_boxes; ++i, ++comparison_counter)
{
lhs[i] &= rhs[i];
}
std::cout
<< "performed "
<< comparison_counter
<< " comparisons"
<< std::endl;
}
int main()
{
// Precompute matches
memset(sets_containing, 0, sizeof(uint64_t) * max_entry * num_boxes);
int set_number = 0;
for (uint64_t* p = &sets[0]; *p; ++p, ++set_number)
{
int entry = 0;
for (uint64_t set = *p; set; set >>= 1, ++entry)
{
if (set & 1)
{
std::cout
<< "sets_containing["
<< entry
<< "]["
<< (set_number / 64)
<< "] gets bit "
<< set_number % 64
<< std::endl;
uint64_t& flag_location =
sets_containing[entry][set_number / 64];
flag_location |= _(set_number % 64);
}
}
}
// Perform search for a key
int key[] = {4, 7, 12, 18};
uint64_t answer[num_boxes];
memset(answer, 0xff, sizeof(uint64_t) * num_boxes);
for (int i = 0; i < sizeof(key) / sizeof(key[0]); ++i)
{
big_and_equals(answer, sets_containing[key[i]]);
}
// Display the matches
for (int set_number = 0; set_number < max_sets; ++set_number)
{
if (answer[set_number / 64] & _(set_number % 64))
{
std::cout
<< "set "
<< set_number
<< " matches"
<< std::endl;
}
}
return 0;
}
Running this program yields:
sets_containing[1][0] gets bit 0
sets_containing[2][0] gets bit 0
sets_containing[4][0] gets bit 0
sets_containing[7][0] gets bit 0
sets_containing[8][0] gets bit 0
sets_containing[12][0] gets bit 0
sets_containing[18][0] gets bit 0
sets_containing[23][0] gets bit 0
sets_containing[29][0] gets bit 0
sets_containing[3][0] gets bit 1
sets_containing[4][0] gets bit 1
sets_containing[6][0] gets bit 1
sets_containing[7][0] gets bit 1
sets_containing[15][0] gets bit 1
sets_containing[23][0] gets bit 1
sets_containing[34][0] gets bit 1
sets_containing[38][0] gets bit 1
sets_containing[4][0] gets bit 2
sets_containing[7][0] gets bit 2
sets_containing[12][0] gets bit 2
sets_containing[18][0] gets bit 2
sets_containing[1][0] gets bit 3
sets_containing[4][0] gets bit 3
sets_containing[7][0] gets bit 3
sets_containing[12][0] gets bit 3
sets_containing[13][0] gets bit 3
sets_containing[14][0] gets bit 3
sets_containing[15][0] gets bit 3
sets_containing[16][0] gets bit 3
sets_containing[17][0] gets bit 3
sets_containing[18][0] gets bit 3
sets_containing[2][0] gets bit 4
sets_containing[4][0] gets bit 4
sets_containing[6][0] gets bit 4
sets_containing[7][0] gets bit 4
sets_containing[13][0] gets bit 4
sets_containing[15][0] gets bit 4
performed 782 comparisons
performed 1564 comparisons
performed 2346 comparisons
performed 3128 comparisons
set 0 matches
set 2 matches
set 3 matches
3128 uint64_t comparisons beats 50000 comparisons so you win. Even in the worst case, which would be a key which has all 50 items, you only have to do num_boxes * max_entry comparisons which in this case is 39100. Still better than 50000.

Since the numbers are less than 50, you could build a one-to-one hash using a 64-bit integer and then use bitwise operations to test the sets in O(1) time. The hash creation would also be O(1). I think either an XOR followed by a test for zero or an AND followed by a test for equality would work. (If I understood the problem correctly.)

Put your sets into an array (not a linked list) and SORT THEM. The sorting criteria can be either 1) the number of elements in the set (number of 1-bits in the set representation), or 2) the lowest element in the set. For example, let A={7, 10, 16} and B={11, 17}. Then B<A under criterion 1), and A<B under criterion 2). Sorting is O(n log n), but I assume that you can afford some preprocessing time, i.e., that the search structure is static.
When a new data item arrives, you can use binary search (logarithmic time) to find the starting candidate set in the array. Then you search linearly through the array and test the data item against the set in the array until the data item becomes "greater" than the set.
You should choose your sorting criterion based on the spread of your sets. If all sets have 0 as their lowest element, you shouldn't choose criterion 2). Vice-versa, if the distribution of set cardinalities is not uniform, you shouldn't choose criterion 1).
Yet another, more robust, sorting criterion would be to compute the span of elements in each set, and sort them according to that. For example, the lowest element in set A is 7, and highest is 16, so you would encode its span as 0x1007; similarly the B's span would be 0x110B. Sort the sets according to the "span code" and again use binary search to find all sets with the same "span code" as your data item.
Computing the "span code" is slow in ordinary C, but it can be done fast if you resort to assembly -- most CPUs have instructions that find the most/least significant set bit.

This is not a real answer more an observation: this problem looks like it could be efficiently parallellized or even distributed, which would at least reduce the running time to O(n / number of cores)

You can build a reverse index of "haystack" lists that contain each element:
std::set<int> needle; // {4, 7, 12, 18}
std::vector<std::set<int>> haystacks;
// A list of your each of your data sets:
// 1 {1, 2, 4, 7, 8, 12, 18, 23, 29}
// 2 {3, 4, 6, 7, 15, 23, 34, 38}
// 3 {4, 7, 12, 18}
// 4 {1, 4, 7, 12, 13, 14, 15, 16, 17, 18}
// 5 {2, 4, 6, 7, 13,
std::hash_map[int, set<int>> element_haystacks;
// element_haystacks maps each integer to the sets that contain it
// (the key is the integers from the haystacks sets, and
// the set values are the index into the 'haystacks' vector):
// 1 -> {1, 4} Element 1 is in sets 1 and 4.
// 2 -> {1, 5} Element 2 is in sets 2 and 4.
// 3 -> {2} Element 3 is in set 3.
// 4 -> {1, 2, 3, 4, 5} Element 4 is in sets 1 through 5.
std::set<int> answer_sets; // The list of haystack sets that contain your set.
for (set<int>::const_iterator it = needle.begin(); it != needle.end(); ++it) {
const std::set<int> &new_answer = element_haystacks[i];
std::set<int> existing_answer;
std::swap(existing_answer, answer_sets);
// Remove all answers that don't occur in the new element list.
std::set_intersection(existing_answer.begin(), existing_answer.end(),
new_answer.begin(), new_answer.end(),
inserter(answer_sets, answer_sets.begin()));
if (answer_sets.empty()) break; // No matches :(
}
// answer_sets now lists the haystack_ids that include all your needle elements.
for (int i = 0; i < answer_sets.size(); ++i) {
cout << "set: " << element_haystacks[answer_sets[i]];
}
If I'm not mistaken, this will have a max runtime of O(k*m), where is the avg number of sets that an integer belongs to and m is the avg size of the needle set (<50). Unfortunately, it'll have a significant memory overhead due to building the reverse mapping (element_haystacks).
I'm sure you could improve this a bit if you stored sorted vectors instead of sets and element_haystacks could be a 50 element vector instead of a hash_map.

I'm surprised no one has mentioned that the STL contains an algorithm to handle this sort of thing for you. Hence, you should use includes. As it describes it performs at most 2*(N+M)-1 comparisons for a worst case performance of O(M+N).
Hence:
bool isContained = includes( myVector.begin(), myVector.end(), another.begin(), another.end() );
if you're needing O( log N ) time, I'll have to yield to the other responders.

Another idea is to completely prehunt your elephants.
Setup
Create a 64 bit X 50,000 element bit array.
Analyze your search set, and set the corresponding bits in each row.
Save the bit map to disk, so it can be reloaded as needed.
Searching
Load the element bit array into memory.
Create a bit map array, 1 X 50000. Set all of the values to 1. This is the search bit array
Take your needle, and walk though each value. Use it as a subscript into the element bit array. Take the corresponding bit array, then AND it into the search array.
Do that for all values in your needle, and your search bit array, will hold a 1,
for every matching element.
Reconstruct
Walk through the search bit array, and for each 1, you can use the element bit array, to reconstruct the original values.

How many data items do you have? Are they really all unique? Could you cache popular data items, or use a bucket/radix sort before the run to group repeated items together?
Here is an indexing approach:
1) Divide the 50-bit field into e.g. 10 5-bit sub-fields. If you really have 50K sets then 3 17-bit chunks might be nearer the mark.
2) For each set, choose a single subfield. A good choice is the sub-field where that set has the most bits set, with ties broken almost arbitrarily - e.g. use the leftmost such sub-field.
3) For each possible bit-pattern in each sub-field note down the list of sets which are allocated to that sub-field and match that pattern, considering only the sub-field.
4) Given a new data item, divide it into its 5-bit chunks and look each up in its own lookup table to get a list of sets to test against. If your data is completely random you get a factor of two speedup or more, depending on how many bits are set in the densest sub-field of each set. If an adversary gets to make up random data for you, perhaps they find data items that almost but not quite match loads of sets and you don't do very well at all.
Possibly there is scope for taking advantage of any structure in your sets, by numbering bits so that sets tend to have two or more bits in their best sub-field - e.g. do cluster analysis on the bits, treating them as similar if they tend to appear together in sets. Or if you can predict patterns in the data items, alter the allocation of sets to sub-fields in step(2) to reduce the number of expected false matches.
Addition:
How many tables would need to have to guarantee that any 2 bits always fall into the same table? If you look at the combinatorial definition in http://en.wikipedia.org/wiki/Projective_plane, you can see that there is a way to extract collections of 7 bits from 57 (=1 + 7 + 49) bits in 57 different ways so that for any two bits at least one collection contains both of them. Probably not very useful, but it's still an answer.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js