Split up a collection, for each subset respecting probabilities for properties of its items

Split up a collection, for each subset respecting probabilities for properties of its items - c++

For a small game (for which I am a bit forced to use C++, so STL-based solutions can be interesting here), I encountered following neat problem. I was wondering if there is any literature on the subject that I could read, or clever implementations.
Collection S of unique items {E1, E2, E3}, each item E having a set of properties, {P1, P2, P3...}
This collection should be split up in S1, S2, S3, S4. It is defined how large S1..4 have to be exactly. We can assume the collection can be correctly split up in those sizes for the remainder of the problem.
Now, for S1, a number of constraints can appear, {C1, C2..}, which specify that for instance, no items with the property P1 may appear in it. Another constraint could be that it should favour the items with property P2 with a factor of 0.8 (we can assume these types of constraints are normalized for all of the subsets per property).
The "weighting" is not that hard to implement. I simply fill some array with candidate numbers, the ones with higher weight are represented more in this array. I then select a random element of the array. the size of the array determines accuracy/granularity (in my case, a small array suffices).
The problem is forbidding some items to appear. It can easily lead to a situation where one item in S needs to be placed in one of the subsets S1, S2, S3, or S4, but this can no longer happen because the subsets are either all full, or the ones that are not full have a specific constraint that this item cannot appear in the set. So you have to backtrack the placement. Doing so too often may violate the weighted probability too much.
How is this problem called, or does it easily map to another (probably NP) problem?
EDIT: Example:
S = {A, B, C, D, E, F, G, H, I, J, K, L, M }
S1 = [ 0.8 probability of having VOWEL, CANNOT HAVE I or K, SIZE = 6 ]
S2 = [ 0.2 probability of having VOWEL, CANNOT HAVE M, B, E, SIZE = 7 ]
Now, suppose we start filling by FOR(LETTER IN S):
LETTER A, create a fill array based on property constraints (0.8 vs 0.2):
[ 1, 1, 1, 1, 1, 1, 1, 2, 2].
Pick a random element from that array: 1.
Now, put A in S1.
For letter I, for instance, the only candidate would be 2, since S1 has a constraint that I cannot appear in it.
Keep doing this, eventually you might end up with:
C = { M } // one more letter to distribute
S1 = A, B, D, E, F, G
S2 = C, F, G, I, K, L
Now, where to place M? I tcannot be placed in S1, since that one is full, and it cannot be placed in S2 because it has a constraint that M cannot be placed in it.
The only way is to backtrack some placement, but then we might mess with the weighted distribution too much (f.i., giving S2 one vowel of S1, which flips around the natural distribution)
Note that this become slightly more complex (in the sense that more backtracks would be needed) when more subsets are in play, instead of just 2.

This has resemblance to a constraint satisfaction problem (CSP) with hard and soft constraints. There are a couple of standard algorithms for that, but you have to check, if they apply to your particular problem instance.
Check wikipedia for starters.

How about this heuristic:
1 Taking into consideration limitations due to constraints and full sets, locate any elements that only meet the criteria for a single set and place them there. If at any point, one of these insertion causes a set to become full, re-evaluate the elements for meeting the criteria for only a single set.
2 Now look only at elements that could fit in exactly two sets. For each element compute the differences in the required probabilities for each set if you added that element vs if you did not. Insert the element into the set where the insert results in the best short term result (first fit / greedy algorithm). If an insert fills up a set, re-evaluate the elements for meeting the criteria for only two sets
3 Continue for elements that fit in 3 sets, 4 sets ... n sets.
At this point all elements will be placed into sets meeting all the constraints, but the probabilities are probably not optimal. You could continue by swapping elements between the sets (only allowing swaps that don't violate constraints), by using a gradient descent or random-restart hill clibing algorithm on a function describing the how closely all the probability's are met. This will tend to converge towards the optimal solution but is not guaranteed to meet it. Continue until you meet your requirements to within an acceptable amount, or until a fixed time limit is met, or until the improvements possible is below a set threshold.

Related

WHATWG: Infra Standard - Clarification (Should be consecutive? Get the indices incorrect definition?)

I'm creating a List data structure in PHP based off of the WHATWG Infra Standard as a programming exercise and am having some issues trying to clarify a couple items.
I don't see anywhere that implies that indices in a list must be consecutive. However, it seems to be implied in the definition of getting the indices of a list:
To get the indices of a list, return the range from 0 to the list’s size, exclusive.
In the definition of getting the indices (listed above) it says to return the range from 0 to the list's size, but wouldn't this include one extra index than is actually present? I see the sentence ends with ", exclusive" but I don't know what that means in this context.
Any insight is much appreciated!

There is no operation that would allow for a "sparse" list if that's what you are asking for.
Removing items from a list will make the following items "move" to fill the now empty positions.
And regarding "return the range from 0 to the list’s size, exclusive." "range" is a link to https://infra.spec.whatwg.org/#the-exclusive-range, which states
The range n to m, exclusive, creates a new ordered set containing all of the integers from n up to and including m − 1 in consecutively increasing order, as long as m is greater than n. If m equals n, then it creates an empty ordered set.
Don't be afraid of following all the links in the specs if you want to implement it.

Simultaneously partitioning two ranges of elements the same way in C++

Is there a C++ standard method for partitioning two ranges of elements simultaneously and partitioning the second range of elements according to the partitioning of the first range of elements? Like the std::partition, the partitioning operates on one range of elements and the other range of elements is partitioned the same way. Or is there an efficient way of doing this without having to copy and adapt the std::partition method and without zipping the two ranges?
Example:
[a, b, c, d, e] -> [a, c, e, b, d]
[f, g, h, i, j] -> [f, h, j, g, i]

While I don't have the time to code this in detail right now, I have the following idea. From what I can see in the docs for std::{stable_}partition, it does in no way report the swaps it performs to the client. Thus, you can't extract that information and apply it to the second range. Anything relying on comparing and manually reordering the elements would probably be too slow.
Approach 1 (simpler)
A way simpler implementation occurred to me. Above I state that we can't get the swaps out of std::partition. Well, I think we can, at the cost of using some additional memory.
Turn the leading vector (the one partition is based one) into a std::vector<std::pair<T, int>>.
At the start, each tag equals the element position in the vector.
Call std::partition with a predicate that only takes into account std::pair::first() (i.e. T).
After std::partition exits, reorder the second vector as follows: the tag values in the leading vector are the element indexes in the second vector, while the element indexes of the leading vector are the new element positions. Thus, a new ordering is known, and we can reorder the second vector accordingly by performing swaps smartly.
This approach takes additional memory O(n) multiplied by sizeof an int or say a char for a short input vector. But, it's easy to implement: only one custom predicate and one loop of swaps.
Approach 2 (more complicated)
Based on the same assumption that swaps aren't visible from the outside, we can arrive to a solution as follows:
"glue" the two ranges together into a single linear data structure (possibly by defining a proxy adapter class)
launch std::partition on that, with a custom predicate, that would compare the first elements in the "zipped" DS (along the lines of pair::first()), but since the two arrays are now unified by the means of an adapter, the swaps would occur on the elements in both arrays
remove the temporary DS if required ("un-glue" the two arrays)
Edit: Given that std::partition uses std::swap, a key part of this approach will be to redefine std::swap for your pair element type (is this still called "overload" for templates?). As far as I can see, this function can be overloaded for custom types, which proves that it is possible to construct an algorithm like described in approach 2 using std::partition, but re-defining both the swap function and the comparison predicate.

Finding shortest path in a graph, with additional restrictions

I have a graph with 2n vertices where every edge has a defined length. It looks like **
**.
I'm trying to find the length of the shortest path from u to v (smallest sum of edge lengths), with 2 additional restrictions:
The number of blue edges that the path contains is the same as the number of red edges.
The number of black edges that the path contains is not greater than p.
I have come up with an exponential-time algorithm that I think would work. It iterates through all binary combinations of length n - 1 that represent the path starting from u in the following way:
0 is a blue edge
1 is a red edge
There's a black edge whenever
the combination starts with 1. The first edge (from u) is then the first black one on the left.
the combination ends with 0. Then last edge (to v) is then the last black one on the right.
adjacent digits are different. That means we went from a blue edge to a red edge (or vice versa), so there's a black one in between.
This algorithm would ignore the paths that don't meet the 2 requirements mentioned earlier and calculate the length for the ones that do, and then find the shortest one. However doing it this way would probably be awfully slow and I'm looking for some tips to come up with a faster algorithm. I suspect it's possible to achieve with dynamic programming, but I don't really know where to start. Any help would be very appreciated. Thanks.

Seems like Dynamic Programming problem to me.
In here, v,u are arbitrary nodes.
Source node: s
Target node: t
For a node v, such that its outgoing edges are (v,u1) [red/blue], (v,u2) [black].
D(v,i,k) = min { ((v,u1) is red ? D(u1,i+1,k) : D(u1,i-1,k)) + w(v,u1) ,
D(u2,i,k-1) + w(v,u2) }
D(t,0,k) = 0 k <= p
D(v,i,k) = infinity k > p //note, for any v
D(t,i,k) = infinity i != 0
Explanation:
v - the current node
i - #reds_traversed - #blues_traversed
k - #black_edges_left
The stop clauses are at the target node, you end when reaching it, and allow reaching it only with i=0, and with k<=p
The recursive call is checking at each point "what is better? going through black or going though red/blue", and choosing the best solution out of both options.
The idea is, D(v,i,k) is the optimal result to go from v to the target (t), #reds-#blues used is i, and you can use up to k black edges.
From this, we can conclude D(s,0,p) is the optimal result to reach the target from the source.
Since |i| <= n, k<=p<=n - the total run time of the algorithm is O(n^3), assuming implemented in Dynamic Programming.

Edit: Somehow I looked at the "Finding shortest path" phrase in the question and ignored the "length of" phrase where the original question later clarified intent. So both my answers below store lots of extra data in order to easily backtrack the correct path once you have computed its length. If you don't need to backtrack after computing the length, my crude version can change its first dimension from N to 2 and just store one odd J and one even J, overwriting anything older. My faster version can drop all the complexity of managing J,R interactions and also just store its outer level as [0..1][0..H] None of that changes the time much, but it changes the storage a lot.
To understand my answer, first understand a crude N^3 answer: (I can't figure out whether my actual answer has better worst case than crude N^3 but it has much better average case).
Note that N must be odd, represent that as N=2H+1. (P also must be odd. Just decrement P if given an even P. But reject the input if N is even.)
Store costs using 3 real coordinates and one implied coordinate:
J = column 0 to N
R = count of red edges 0 to H
B = count of black edges 0 to P
S = side odd or even (S is just B%1)
We will compute/store cost[J][R][B] as the lowest cost way to reach column J using exactly R red edges and exactly B black edges. (We also used J-R blue edges, but that fact is redundant).
For convenience write to cost directly but read it through an accessor c(j,r,b) that returns BIG when r<0 || b<0 and returns cost[j][r][b] otherwise.
Then the innermost step is just:
If (S)
cost[J+1][R][B] = red[J]+min( c(J,R-1,B), c(J,R-1,B-1)+black[J] );
else
cost[J+1][R][B] = blue[J]+min( c(J,R,B), c(J,R,B-1)+black[J] );
Initialize cost[0][0][0] to zero and for the super crude version initialize all other cost[0][R][B] to BIG.
You could super crudely just loop through in increasing J sequence and whatever R,B sequence you like computing all of those.
At the end, we can find the answer as:
min( min(cost[N][H][all odd]), black[N]+min(cost[N][H][all even]) )
But half the R values aren't really part of the problem. In the first half any R>J are impossible and in the second half any R<J+H-N are useless. You can easily avoid computing those. With a slightly smarter accessor function, you could avoid using the positions you never computed in the boundary cases of ones you do need to compute.
If any new cost[J][R][B] is not smaller than a cost of the same J, R, and S but lower B, that new cost is useless data. If the last dim of the structure were map instead of array, we could easily compute in a sequence that drops that useless data from both the storage space and the time. But that reduced time is then multiplied by log of the average size (up to P) of those maps. So probably a win on average case, but likely a loss on worst case.
Give a little thought to the data type needed for cost and the value needed for BIG. If some precise value in that data type is both as big as the longest path and as small as half the max value that can be stored in that data type, then that is a trivial choice for BIG. Otherwise you need a more careful choice to avoid any rounding or truncation.
If you followed all that, you probably will understand one of the better ways that I thought was too hard to explain: This will double the element size but cut the element count to less than half. It will get all the benefits of the std::map tweak to the basic design without the log(P) cost. It will cut the average time way down without hurting the time of pathological cases.
Define a struct CB that contains cost and black count. The main storage is a vector<vector<CB>>. The outer vector has one position for every valid J,R combination. Those are in a regular pattern so we could easily compute the position in the vector of a given J,R or the J,R of a given position. But it is faster to keep those incrementally so J and R are implied rather than directly used. The vector should be reserved to its final size, which is approx N^2/4. It may be best if you pre compute the index for H,0
Each inner vector has C,B pairs in strictly increasing B sequence and within each S, strictly decreasing C sequence . Inner vectors are generated one at a time (in a temp vector) then copied to their final location and only read (not modified) after that. Within generation of each inner vector, candidate C,B pairs will be generated in increasing B sequence. So keep the position of bestOdd and bestEven while building the temp vector. Then each candidate is pushed into the vector only if it has a lower C than best (or best doesn't exist yet). We can also treat all B<P+J-N as if B==S so lower C in that range replaces rather than pushing.
The implied (never stored) J,R pairs of the outer vector start with (0,0) (1,0) (1,1) (2,0) and end with (N-1,H-1) (N-1,H) (N,H). It is fastest to work with those indexes incrementally, so while we are computing the vector for implied position J,R, we would have V as the actual position of J,R and U as the actual position of J-1,R and minU as the first position of J-1,? and minV as the first position of J,? and minW as the first position of J+1,?
In the outer loop, we trivially copy minV to minU and minW to both minV and V, and pretty easily compute the new minW and decide whether U starts at minU or minU+1.
The loop inside that advances V up to (but not including) minW, while advancing U each time V is advanced, and in typical positions using the vector at position U-1 and the vector at position U together to compute the vector for position V. But you must cover the special case of U==minU in which you don't use the vector at U-1 and the special case of U==minV in which you use only the vector at U-1.
When combining two vectors, you walk through them in sync by B value, using one, or the other to generate a candidate (see above) based on which B values you encounter.
Concept: Assuming you understand how a value with implied J,R and explicit C,B is stored: Its meaning is that there exists a path to column J at cost C using exactly R red branches and exactly B black branches and there does not exist exists a path to column J using exactly R red branches and the same S in which one of C' or B' is better and the other not worse.

Your exponential algorithm is essentially a depth-first search tree, where you keep track of the cost as you descend.
You could make it branch-and-bound by keeping track of the best solution seen so far, and pruning any branches that would go beyond the best so far.
Or, you could make it a breadth-first search, ordered by cost, so as soon as you find any solution, it is among the best.
The way I've done this in the past is depth-first, but with a budget.
I prune any branches that would go beyond the budget.
Then I run if with budget 0.
If it doesn't find any solutions, I run it with budget 1.
I keep incrementing the budget until I get a solution.
This might seem like a lot of repetition, but since each run visits many more nodes than the previous one, the previous runs are not significant.
This is exponential in the cost of the solution, not in the size of the network.

Why Capacity-1 in integer knapsack?

The dynamic programming solution to the integer knapsack problem,
For a knapsack of capacity C, and for n items, where ith item has the size Si and value Vi, is:
M(C)=max(M(C-1), M(C-Si)+Vi), where i goes from 1 to n
Here M is an array. M(C) denotes the maximum value of a knapsack of capacity C.
What is the use of M(C-1) in this relation? I mean the solution should just be this:
M(C)=max(M(C-Si)+Vi), where i goes from 1 to n
I think all the cases that M(C-1) covers are covered in M(C).
If I'm wrong, please give me an example situation.

I think you have to setup of the formula a bit confused - specifically, you've mixed up the capacity of the bag with a sub problem of n-1 items. Let's redefine a bit.
Let P denote the problem, as represented by a list of n items.
Further, let Pk represent the subproblem consisting of items at indices 1...k from the original problem, where 1 <= k <= n. Thus Pn is equivalent to P.
For each item at index i, let Vi denote the value of that item and Si denote the size of that item.
Let C be the capacity of the bag, C >= 0
Let M(Pk, C) denote the optimal solution to the problem described by Pk with a bag of capacity C. M(Pk, C) returns the list of items included in the solution (and thus also returns the value of the optimal solution and the excess capacity in the bag).
For each item, we could either include it in the optimal solution, or not include it in the optimal solution. Clearly, the optimal solution is whichever of these two options is preferable. The only corner case to consider is if the item in question cannot fit in the bag. In this case we must exclude it.
We can rely on recursion to cover every item for us, thus have no need for iteration. Thus all together:
M(Pk,C) = if(Sk > C) M(P(k-1), C) else max(M(P(k-1),C), Vk + M(P(k-1),C-Sk))

What is the fastest way to return x,y coordinates that are present in both list A and list B?

I have two lists (list A and list B) of x,y coordinates where 0 < x < 4000, 0 < y < 4000, and they will always be integers. I need to know what coordinates are in both lists. What would be your suggestion for how to approach this?
I have been thinking about representing the lists as two grids of bits and doing bitwise & possibly?
List A has about 1000 entries and changes maybe once every 10,000 requests. List B will vary wildly in length and will be different on every run through.
EDIT: I should mention that no coordinate will be in lists twice; 1,1 cannot be in list A more than once for example.

Represent (x,y) as a single 24 bit number as described in the comments.
Maintain A in numerical order (you said it doesn't vary much, so this should be hardly any cost).
For each B do a binary search on the list. Since A is about 1000 items big, you'll need at most 10 integer comparisons (in the worst case) to check for membership.
If you have a bit more memory (about 2MB) to play with you could create a bit-vector to support all possible 24 bit numbers then then perform a single bit operation per item to test for membership. So A would be represented by a single 2^24 bit number with a bit-set if the value is there (otherwise 0). To test for membership you would just use an appropriate bit and operation.

Put the coordinates of list A into some kind of a set (probably a hash, bst, or heap), then you can quickly see if the coordinate from list B is present.
Depending on whether you're expecting the list to be present or not present in the list would determine what underlying data structure you use.
Hashes are good at telling you if something is in it, though depending on how it's implemented, could behave poorly when trying to find something that isn't in it.
bst and heaps are equally good at telling you if something is in it or not, but don't perform theoretically as well as hashes when something is in it.

Since A is rather static you may consider building a query structure and check of all elements in B whether they occur in A. One example would be an std::set > A and you can query like A.find(element_from_b) != A.end() ...
So the running time in total is worst case O(b log a) (where b is the number of elements in B, and a respectively). Note also that since a is always about 10000, log a basically is constant.

Define an ordering based on their lexicographic order (sort first on x then on y). Sort both lists based on that ordering in O(n log n) time where n is the larger of the number of elements of each list. Set a pointer to the first elment of each list and advance the one that points to the lesser element; when the pointers reference to elements with the same value, put them into a set (to avoid multiplicities within each list). This last part can be done in O(n) time (or O(m log m) where m is the number of elements common to both lists).
Update (based on comment below and edit above): Since no point appears more than once in each list, then you can use a list or vector or dequeue to hold the points common to both or some other (amortized) constant time insertion realizing the O(n) time performance regardless of the number of common elements.

This is easy if you implement an STL predicate which orders two pairs (i.e. return (R.x < L.x || (R.x==L.x && R.y < L.y). You can then call std::list::sort to order them, and std::set_intersection to find the common elements. No need to write the algoritms

This is the kind of problem that just screams "Bloom Filter" at me.

If I understand correctly, you want the common coordinates in X and Y -- the intersection of (sets) Listing A and B? If you are using STL:
#include <vector>
#include <std>
using namespace std;
// ...
set<int> a; // x coord (assumed populated elsewhere)
set<int> b; // y coord (assumed populated elsewhere)
set<int> in; // intersection
// ...
set_intersection(a.begin(), a.end(), b.begin(), b.end(), insert_iterator<set<int> >(in,in.begin()));

I think hashing is your best bet.
//Psuedocode:
INPUT: two lists, each with (x,y) coordinates
find the list that's longer, call it A
hash each element in A
go to the other list, call it B
hash each element in B and look it up in the table.
if there's a match, return/store (x,y) somewhere
repeat #4 till the end
Assuming length of A is m and B's length is n, run time is O(m + n) --> O(n)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js