Finding zero crossings from a list in python

Finding zero crossings from a list in python - python-2.7

I did write a code for doing this:
for i in xrange(len(Derivative)):
if ((Derivative[i-1] > Derivative[i]) and (Derivative[i+1] < Derivative[i]) and (Derivative[i-1] > 0.0) and (Derivative[i+1] < 0.0) and (Derivative[i] > 0.0)):
print str(i+1)
Here Derivative is list in which I have to detect zero crossing mainly the ones where the values just before zero crossing is positive and one after zero crossing to be negative.
I have attached the graph of Derivative to further elucidate the problem!
I wish to know if there is better way of doing this in Python, I mean shorter and precise code ?

You only need to compare the current derivative with the one following it. As a result, you can delete any references to Derivative[i-1]. If Derivative[i] is greater than zero, and Derivative[i+1] is less than zero, then necessarily Derivative[i+1] < Derivative[i]. So you can delete that condition as well. This leaves:
for i in xrange(len(Derivative)-1):
if Derivative[i] > 0 and Derivative[i+1] < 0:
print "crossing between indexes {} and {}".format(i, i+1)
Also, I shortened the argument to xrange by one. Otherwise, you'd get list index out of range.

Related

Random selection of a number from a list of numbers that is biased toward the lowest number

At every moment t, I have a different set of positive integers. I need to randomly select one of them, satisfying the criteria that the probability of a particular number to be selected from the set must be proportionally higher, the lower is the value of the number. At moment t+1, we have another set of positive integers, and again we need to select one of them satisfying the same criteria. So on, so forth. How to do this in c++?

One way would be to assign ranges to each number, randomize a value, and pick the number which has the range in which the randomized value is included.
Example
Input:
1 2 3
Ranges:
[0,300) [300, 450) [450, 550)
Probability for each number:
~55% ~27% ~18%
With random value between 0 and 550, ex. 135 the selected number would be 1 since 0 <= 135 < 300.
As to how you do this in C++, try it for yourself first.

What you're talking about is basically the Geometric distribution, or at least a chopped and normalized variant of it.
C++11 includes a geometric distribution generator, and you can use that directly if you like, but to keep things simple, you can also do something like this:
int genRandom(int count)
{
while(true)
{
int x = 0;
while(x < count-1)
{
if(rand() < RAND_MAX/2) // with probability 0.5...
{
return x;
}
x++;
}
}
}
Note that here I'm using rejection sampling to ensure that a value gets picked, rather than return count-1 if I run past it (which would give the last two items equal probability).

Given a string, find two identical subsequences with consecutive indexes C++

I need to construct an algorithm (not necessarily effective) that given a string finds and prints two identical subsequences (by print I mean color for example). What more, the union of the sets of indexes of these two subsequences has to be a set of consecutive natural numbers (a full segment of integers).
In mathematics, the thing what I am looking for is called "tight twins", if it helps anything. (E.g., see the paper (PDF) here.)
Let me give a few examples:
1) consider string 231213231
It has two subsequences I am looking for in the form of "123". To see it better look at this image:
The first subsequence is marked with underlines and the second with overlines. As you can see they have all the properties I need.
2) consider string 12341234
3) consider string 12132344.
Now it gets more complicated:
4) consider string: 13412342
It is also not that easy:
I think that these examples explain well enough what I meant.
I've been thinking a long time about an algorithm that could do that but without success.
For coloring, I wanted to use this piece of code:
using namespace std;
HANDLE hConsole;
hConsole = GetStdHandle(STD_OUTPUT_HANDLE);
SetConsoleTextAttribute(hConsole, k);
where k is color.
Any help, even hints, would be highly appreciated.

Here's a simple recursion that tests for tight twins. When there's a duplicate, it splits the decision tree in case the duplicate is still part of the first twin. You'd have to run it on each substring of even length. Other optimizations for longer substrings could include hashing tests for char counts, as well as matching the non-duplicate portions of the candidate twins (characters that only appear twice in the whole substring).
Explanation of the function:
First, a hash is created with each character as key and the indexes it appears in as values. Then we traverse the hash: if a character count is odd, the function returns false; and indexes of characters with a count greater than 2 are added to a list of duplicates - characters half of which belong in one twin but we don't know which.
The basic rule of the recursion is to only increase i when a match for it is found later in the string, while maintaining a record of chosen matches (js) that i must skip without looking for a match. It works because if we find n/2 matches, in order, by the time j reaches the end, that's basically just another way of saying the string is composed of tight twins.
JavaScript code:
function isTightTwins(s){
var n = s.length,
char_idxs = {};
for (var i=0; i<n; i++){
if (char_idxs[s[i]] == undefined){
char_idxs[s[i]] = [i];
} else {
char_idxs[s[i]].push(i);
}
}
var duplicates = new Set();
for (var i in char_idxs){
// character with odd count
if (char_idxs[i].length & 1){
return false;
}
if (char_idxs[i].length > 2){
for (let j of char_idxs[i]){
duplicates.add(j);
}
}
}
function f(i,j,js){
// base case positive
if (js.size == n/2 && j == n){
return true;
}
// base case negative
if (j > n || (n - j < n/2 - js.size)){
return false;
}
// i is not less than j
if (i >= j) {
return f(i,j + 1,js);
}
// this i is in the list of js
if (js.has(i)){
return f(i + 1,j,js);
// yet to find twin, no match
} else if (s[i] != s[j]){
return f(i,j + 1,js);
} else {
// maybe it's a twin and maybe it's a duplicate
if (duplicates.has(j)) {
var _js = new Set(js);
_js.add(j);
return f(i,j + 1,js) | f(i + 1,j + 1,_js);
// it's a twin
} else {
js.add(j);
return f(i + 1,j + 1,js);
}
}
}
return f(0,1,new Set());
}
console.log(isTightTwins("1213213515")); // true
console.log(isTightTwins("11222332")); // false

WARNING: Commenter גלעד ברקן points out that this algorithm gives the wrong answer of 6 (higher than should be possible!) for the string 1213213515. My implementation gets the same wrong answer, so there seems to be a serious problem with this algorithm. I'll try to figure out what the problem is, but in the meantime DO NOT TRUST THIS ALGORITHM!
I've thought of a solution that will take O(n^3) time and O(n^2) space, which should be usable on strings of up to length 1000 or so. It's based on a tweak to the usual notion of longest common subsequences (LCS). For simplicity I'll describe how to find a minimal-length substring with the "tight twin" property that starts at position 1 in the input string, which I assume has length 2n; just run this algorithm 2n times, each time starting at the next position in the input string.
"Self-avoiding" common subsequences
If the length-2n input string S has the "tight twin" (TT) property, then it has a common subsequence with itself (or equivalently, two copies of S have a common subsequence) that:
is of length n, and
obeys the additional constraint that no character position in the first copy of S is ever matched with the same character position in the second copy.
In fact we can safely tighten the latter constraint to no character position in the first copy of S is ever matched to an equal or lower character position in the second copy, due to the fact that we will be looking for TT substrings in increasing order of length, and (as the bottom section shows) in any minimal-length TT substring, it's always possible to assign characters to the two subsequences A and B so that for any matched pair (i, j) of positions in the substring with i < j, the character at position i is assigned to A. Let's call such a common subsequence a self-avoiding common subsequence (SACS).
The key thing that makes efficient computation possible is that no SACS of a length-2n string can have more than n characters (since clearly you can't cram more than 2 sets of n characters into a length-2n string), so if such a length-n SACS exists then it must be of maximum possible length. So to determine whether S is TT or not, it suffices to look for a maximum-length SACS between S and itself, and check whether this in fact has length n.
Computation by dynamic programming
Let's define f(i, j) to be the length of the longest self-avoiding common subsequence of the length-i prefix of S with the length-j prefix of S. To actually compute f(i, j), we can use a small modification of the usual LCS dynamic programming formula:
f(0, _) = 0
f(_, 0) = 0
f(i>0, j>0) = max(f(i-1, j), f(i, j-1), m(i, j))
m(i, j) = (if S[i] == S[j] && i < j then 1 else 0) + f(i-1, j-1)
As you can see, the only difference is the additional condition && i < j. As with the usual LCS DP, computing it takes O(n^2) time, since the 2 arguments each range between 0 and n, and the computation required outside of recursive steps is O(1). (Actually we need only compute the "upper triangle" of this DP matrix, since every cell (i, j) below the diagonal will be dominated by the corresponding cell (j, i) above it -- though that doesn't alter the asymptotic complexity.)
To determine whether the length-2j prefix of the string is TT, we need the maximum value of f(i, 2j) over all 0 <= i <= 2n -- that is, the largest value in column 2j of the DP matrix. This maximum can be computed in O(1) time per DP cell by recording the maximum value seen so far and updating as necessary as each DP cell in the column is calculated. Proceeding in increasing order of j from j=1 to j=2n lets us fill out the DP matrix one column at a time, always treating shorter prefixes of S before longer ones, so that when processing column 2j we can safely assume that no shorter prefix is TT (since if there had been, we would have found it earlier and already terminated).

Let the string length be N.
There are two approaches.
Approach 1. This approach is always exponential-time.
For each possible subsequence of length 1..N/2, list all occurences of this subsequence. For each occurence, list positions of all characters.
For example, for 123123 it should be:
(1, ((1), (4)))
(2, ((2), (5)))
(3, ((3), (6)))
(12, ((1,2), (4,5)))
(13, ((1,3), (4,6)))
(23, ((2,3), (5,6)))
(123, ((1,2,3),(4,5,6)))
(231, ((2,3,4)))
(312, ((3,4,5)))
The latter two are not necessary, as their appear only once.
One way to do it is to start with subsequences of length 1 (i.e. characters), then proceed to subsequences of length 2, etc. At each step, drop all subsequences which appear only once, as you don't need them.
Another way to do it is to check all 2**N binary strings of length N. Whenever a binary string has not more than N/2 "1" digits, add it to the table. At the end drop all subsequences which appear only once.
Now you have a list of subsequences which appear more than 1 time. For each subsequence, check all the pairs, and check whether such a pair forms a tight twin.
Approach 2. Seek for tight twins more directly. For each N*(N-1)/2 substrings, check whether the substring is even length, and each character appears in it even number of times, and then, being its length L, check whether it contains two tight twins of the length L/2. There are 2**L ways to divide it, the simplest you can do is to check all of them. There are more interesting ways to seek for t.t., though.

I would like to approach this as a dynamic programming/pattern matching problem. We deal with characters one at a time, left to right, and we maintain a herd of Non-Deterministic Finite Automata / NDFA, which correspond to partial matches. We start off with a single null match, and with each character we extend each NDFA in every possible way, with each NDFA possibly giving rise to many children, and then de-duplicate the result - so we need to minimise the state held in the NDFA to put a bound on the size of the herd.
I think a NDFA needs to remember the following:
1) That it skipped a stretch of k characters before the match region.
2) A suffix which is a p-character string, representing characters not yet matched which will need to be matched by overlines.
I think that you can always assume that the p-character string needs to be matched with overlines because you can always swap overlines and underlines in an answer if you swap throughout the answer.
When you see a new character you can extend NDFAs in the following ways:
a) An NDFA with nothing except skips can add a skip.
b) An NDFA can always add the new character to its suffix, which may be null
c) An NDFA with a p character string whose first character matches the new character can turn into an NDFA with a p-1 character string which consists of the last p-1 characters of the old suffix. If the string is now of zero length then you have found a match, and you can work out what it was if you keep links back from each NDFA to its parent.
I thought I could use a neater encoding which would guarantee only a polynomial herd size, but I couldn't make that work, and I can't prove polynomial behaviour here, but I notice that some cases of degenerate behaviour are handled reasonably, because they lead to multiple ways to get to the same suffix.

Calculating the mean of the data

/***************************************************************************
Description : Calculates the trimmed mean of the data.
Comments : trim defaults to 0. Trim = 0.5 now gives the median.
***************************************************************************/
Real StatData::mean(Real trim) const
{
check_trim(trim);
if (size() < 1)
err << "StatData::mean: no data" << fatal_error;
Real result = 0;
const_cast<StatData&>(*this).items.sort();
int low = (int)(size()*trim); // starting at 0
int high = size() - low;
if (low == high) {
low--; high++;
}
for(int k = low; k < high; k++)
result += items[k];
ASSERT(2*low < size()); // Make sure we're not dividing by zero.
return result / (size() - 2*low);
}
I have three questions to ask:
1) Is *this referring to StatData?
2) Why is ASSERT(2*low < size()) checking for not dividing by zero?
3) The mean value usually means the total sum divided by the total size. but why are we doing size()-2*low?

Before we start, let's take a little bit of time to explain what the parameter trim is.
trim denotes how much fraction of data you want to cut off from both ends of the data before you want to compute what you need, assuming this is in sorted order. By doing trim = 0.5, you are cutting everything off except for considering the middle, which is the median. By doing trim = 0.1 for example, the first 10% and the last 10% of the data are discarded, and you only compute the mean within the remaining 80% of the data. Note that this is a normalized fraction between [0,1]. This fraction is then multiplied by size() to determine which index in your data we need to start from when computing the mean - denoted by low, and also which index to stop at - denoted by high. high is simply computed by size() - low, as the amount of data to cut off on both sides needs to be symmetric. This is actually sometimes called the alpha trimmed mean, or more commonly known as the truncated mean. The reason why it is also called alpha trimmed mean is because alpha defines how much of a fraction you want to cut off from the beginning and end of your sorted data. Equivalently in our case, alpha = trim.
Now onto your questions.
Question #1
The *this is referring to an instance of the current class which is of type StatData, and is ultimately trying to access items, which seems to be a container that contains some numbers of type Real. However, as Neil Kirk explained in his comment, and with what Hi I'm Dan has said, this is a very unsafe way of using const_cast so that you're able to access items so that you can sort these items. This is very bad.
Question #2
This is basically to ensure that when you're calculating the mean, you aren't dividing by zero. This condition will never be > 2*low because the size of your data will never get higher than this point. They check to see if size() < 2*low to ensure that you are going to divide the summation of your data by a number > 0, which is what we expect from the arithmetic mean. Should this condition fail, this means that computing the mean is not possible, and it should output an error.
Question #3
You are dividing by size() - 2*low because you are using trim to discard the proportion of data from the beginning and from the end of your data you don't need. This exactly corresponds to low on the one side and low on the other side. Take note that high computes where we need to stop accumulating at the upper end, and the proportion of data that exists after this point is low. As such, the combination of these proportions that are eliminated is 2*low, which is why you need to subtract this away from size() as you aren't using that data anymore.

The function is marked const, so the developer used a rather ugly const_cast to cast the const away in order to call sort.
ASSERT appears to be a macro (due to it being in capital letters) that most likely calls assert, which terminates the program if the expression evaluates to zero.
For a summary of what trimmed mean means, refer to this page.
The 10% trimmed mean is the mean computed by excluding the 10% largest
and 10% smallest values from the sample and taking the arithmetic mean
of the remaining 80% of the sample ...

Generating random integers with a difference constraint

I have the following problem:
Generate M uniformly random integers from the range 0-N, where N >> M, and where no pair has a difference less than K. where M >> K.
At the moment the best method I can think of is to maintain a sorted list, then determine the lower bound of the current generated integer and test it with the lower and upper elements, if it's ok to then insert the element in between. This is of complexity O(nlogn).
Would there happen to be a more efficient algorithm?
An example of the problem:
Generate 1000 uniformly random integers between zero and 100million where the difference between any two integers is no less than 1000
A comprehensive way to solve this would be to:
Determine all the combinations of n-choose-m that satisfy the constraint, lets called it set X
Select a uniformly random integer i in the range [0,|X|).
Select the i'th combination from X as the result.
This solution is problematic when the n-choose-m is large, as enumerating and storing all possible combinations will be extremely costly. Hence an efficient online generating solution is sought.
Note: The following is a C++ implementation of the solution provided by pentadecagon
std::vector<int> generate_random(const int n, const int m, const int k)
{
if ((n < m) || (m < k))
return std::vector<int>();
std::random_device source;
std::mt19937 generator(source());
std::uniform_int_distribution<> distribution(0, n - (m - 1) * k);
std::vector<int> result_list;
result_list.reserve(m);
for (int i = 0; i < m; ++i)
{
result_list.push_back(distribution(generator));
}
std::sort(std::begin(result_list),std::end(result_list));
for (int i = 0; i < m; ++i)
{
result_list[i] += (i * k);
}
return result_list;
}
http://ideone.com/KOeR4R
.

EDIT: I adapted the text for the requirement to create ordered sequences, each with the same probability.
Create random numbers a_i for i=0..M-1 without duplicates. Sort them. Then create numbers
b_i=a_i + i*(K-1)
Given the construction, those numbers b_i have the required gaps, because the a_i already have gaps of at least 1. In order to make sure those b values cover exactly the required range [1..N], you must ensure a_i are picked from a range [1..N-(M-1)*(K-1)]. This way you get truly independent numbers. Well, as independent as possible given the required gap. Because of the sorting you get O(M log M) performance again, but this shouldn't be too bad. Sorting is typically very fast. In Python it looks like this:
import random
def random_list( N, M, K ):
s = set()
while len(s) < M:
s.add( random.randint( 1, N-(M-1)*(K-1) ) )
res = sorted( s )
for i in range(M):
res[i] += i * (K-1)
return res

First off: this will be an attempt to show that there's a bijection between the (M+1)- compositions (with the slight modification that we will allow addends to be 0) of the value N - (M-1)*K and the valid solutions to your problem. After that, we only have to pick one of those compositions uniformly at random and apply the bijection.
Bijection:
Let
Then the xi form an M+1-composition (with 0 addends allowed) of the value on the left (notice that the xi do not have to be monotonically increasing!).
From this we get a valid solution
by setting the values mi as follows:
We see that the distance between mi and mi + 1 is at least K, and mM is at most N (compare the choice of the composition we started out with). This means that every (M+1)-composition that fulfills the conditions above defines exactly one valid solution to your problem. (You'll notice that we only use the xM as a way to make the sum turn out right, we don't use it for the construction of the mi.)
To see that this gives a bijection, we need to see that the construction can be reversed; for this purpose, let
be a given solution fulfilling your conditions. To get the composition this is constructed from, define the xi as follows:
Now first, all xi are at least 0, so that's alright. To see that they form a valid composition (again, every xi is allowed to be 0) of the value given above, consider:
The third equality follows since we have this telescoping sum that cancels out almost all mi.
So we've seen that the described construction gives a bijection between the described compositions of N - (M-1)*K and the valid solutions to your problem. All we have to do now is pick one of those compositions uniformly at random and apply the construction to get a solution.
Picking a composition uniformly at random
Each of the described compositions can be uniquely identified in the following way (compare this for illustration): reserve N - (M-1)*K spaces for the unary notation of that value, and another M spaces for M commas. We get an (M+1)- composition of N - (M-1)*K by choosing M of the N - (M-1)*K + M spaces, putting commas there, and filling the rest with |. Then let x0 be the number of | before the first comma, xM+1 the number of | after the last comma, and all other xi the number of | between commas i and i+1. So all we have to do is pick an M-element subset of the integer interval[1; N - (M-1)*K + M] uniformly at random, which we can do for example with the Fisher-Yates shuffle in O(N + M log M) (we need to sort the M delimiters to build the composition) since M*K needs to be in O(N) for any solutions to exist. So if N is bigger than M by at least a logarithmic factor, then this is linear in N.
Note: #DavidEisenstat suggested that there are more space efficient ways of picking the M-element subset of that interval; I'm not aware of any, I'm afraid.
You can get an error-proof algorithm out of this by doing the simple input validation we get from the construction above that N ≥ (M-1) * K and that all three values are at least 1 (or 0, if you define the empty set as a valid solution for that case).

Why not do this:
for (int i = 0; i < M; ++i) {
pick a random number between K and N/M
add this number to (N/M)* i;
Now you have M random numbers, distributed evenly along N, all of which have a difference of at least K. It's in O(n) time. As an added bonus, it's already sorted. :-)
EDIT:
Actually, the "pick a random number" part shouldn't be between K and N/M, but between min(K, [K - (N/M * i - previous value)]). That would ensure that the differences are still at least K, and not exclude values that should not be missed.
Second EDIT:
Well, the first case shouldn't be between K and N/M - it should be between 0 and N/M. Just like you need special casing for when you get close to the N/M*i border, we need special initial casing.
Aside from that, the issue you brought up in your comments was fair representation, and you're right. As my pseudocode is presented, it currently completely misses the excess between N/M*M and N. It's another edge case; simply change the random values of your last range.
Now, in this case, your distribution will be different for the last range. Since you have more numbers, you have slightly less chance for each number than you do for all the other ranges. My understanding is that because you're using ">>", this shouldn't really impact the distribution, i.e. the difference in size in the sample set should be nominal. But if you want to make it more fair, you divide the excess equally among each range. This makes your initial range calculation more complex - you'll have to augment each range based on how much remainder there is divided by M.
There are lots of special cases to look out for, but they're all able to be handled. I kept the pseudocode very basic just to make sure that the general concept came through clearly. If nothing else, it should be a good starting point.
Third and Final EDIT:
For those worried that the distribution has a forced evenness, I still claim that there's nothing saying it can't. The selection is uniformly distributed in each segment. There is a linear way to keep it uneven, but that also has a trade-off: if one value is selected extremely high (which should be unlikely given a very large N), then all the other values are constrained:
int prevValue = 0;
int maxRange;
for (int i = 0; i < M; ++i) {
maxRange = N - (((M - 1) - i) * K) - prevValue;
int nextValue = random(0, maxRange);
prevValue += nextValue;
store previous value;
prevValue += K;
}
This is still linear and random and allows unevenness, but the bigger prevValue gets, the more constrained the other numbers become. Personally, I prefer my second edit answer, but this is an available option that given a large enough N is likely to satisfy all the posted requirements.
Come to think of it, here's one other idea. It requires a lot more data maintenance, but is still O(M) and is probably the most fair distribution:
What you need to do is maintain a vector of your valid data ranges and a vector of probability scales. A valid data range is just the list of high-low values where K is still valid. The idea is you first use the scaled probability to pick a random data range, then you randomly pick a value within that range. You remove the old valid data range and replace it with 0, 1 or 2 new data ranges in the same position, depending on how many are still valid. All of these actions are constant time other than handling the weighted probability, which is O(M), done in a loop M times, so the total should be O(M^2), which should be much better than O(NlogN) because N >> M.
Rather than pseudocode, let me work an example using OP's original example:
0th iteration: valid data ranges are from [0...100Mill], and the weight for this range is 1.0.
1st iteration: Randomly pick one element in the one element vector, then randomly pick one element in that range.
If the element is, e.g. 12345678, then we remove the [0...100Mill] and replace it with [0...12344678] and [12346678...100Mill]
If the element is, e.g. 500, then we remove the [0...100Mill] and replace it with just [1500...100Mill], since [0...500] is no longer a valid range. The only time we will replace it with 0 ranges is in the unlikely event that you have a range with only one number in it and it gets picked. (In that case, you'll have 3 numbers in a row that are exactly K apart from each other.)
The weight for the ranges are their length over the total length, e.g. 12344678/(12344678 + (100Mill - 12346678)) and (100Mill - 12346678)/(12344678 + (100Mill - 12346678))
In the next iterations, you do the same thing: randomly pick a number between 0 and 1 and determine which of the ranges that scale falls into. Then randomly pick a number in that range, and replace your ranges and scales.
By the time it's done, we're no longer acting in O(M), but we're still only dependent on the time of M instead of N. And this actually is both uniform and fair distribution.
Hope one of these ideas works for you!

Algorithm for finding the maximum number of non-overlapping lines on the x axis

I'm not exactly sure how to ask this, but I'll try to be as specific as possible.
Imagine a tetris screen with only rectangles, of different shapes, falling to the bottom.
I want to compute the maximum number of rectangles that I can fit one next to the other without any overlapping ones. I've named them lines in the title because I'm actually only interested in the length of the rectangle when computing, or the line parallel to the x axis that it's falling towards.
So basically I have a custom type with a start and end, both integers between 0 and 100. Say we have a list of these rectangles ranging from 1 to n. rectangle_n.start (unless it's the rectangle closest to the origin) has to be > rectangle_(n-1).end so that they will never overlap.
I'm reading the rectangle coordinates (both are x axis coordinates) from a file with random numbers.
As an example:
consider this list of rectangle type objects
rectangle_list {start, end} = {{1,2}, {3,5}, {4,7} {9,12}}
We can observe that the 3rd object has its start coordinate 4 < the previous rectangle's end coordinate which is 5. So in sorting this list, I would have to remove the 2nd or the 3rd object so that they don't overlap.
I'm not sure if there is a type for this kind of problem so I didn't know how else to name it. I'm interested in an algorithm that can be applied on a list of such objects and would sort them out accordingly.
I've tagged this with c++ because the code I'm writing is c++ but any language would do for the algorithm.

You are essentially solving the following problem. Suppose we have n intervals {[x_1,y_1),[x_2,y_2),...,[x_n,y_n)} with x_1<=x_2<=...<=x_n. We want to find a maximal subset of these intervals such that there are no overlaps between any intervals in the subset.
The naive solution is dynamic programming. It guarantees to find the best solution. Let f(i), 0<=i<=n, be the size of the maximal subset up to interval [x_i,y_i). We have equation (this is latex):
f(i)=\max_{0<=j<i}{f(j)+d(i,j)}
where d(i,j)=1 if and only if [x_i,y_i) and [x_j,y_j) have no overlaps; otherwise d(i,j) takes zero. You can iteratively compute f(i), starting from f(0)=0. f(n) gives the size of the maximal subset. To get the actual subset, you need to keep a separate array s(i)=\argmax_{0<=j<i}{f(j)+d(i,j)}. You then need to backtrack to get the 'path'.
This is an O(n^2) algorithm: you need to compute each f(i) and for each f(i) you need i number of tests. I think there should be a O(nlogn) algorithm, but I am not so sure.
EDIT: an implementation in Lua:
function find_max(list)
local ret, f, b = {}, {}, {}
f[0], b[0] = 0, 0
table.sort(list, function(a,b) return a[1]<b[1] end)
-- dynamic programming
for i, x in ipairs(list) do
local max, max_j = 0, -1
x = list[i]
for j = 0, i - 1 do
local e = j > 0 and list[j][2] or 0
local score = e <= x[1] and 1 or 0
if f[j] + score > max then
max, max_j = f[j] + score, j
end
end
f[i], b[i] = max, max_j
end
-- backtrack
local max, max_i = 0, -1
for i = 1, #list do
if f[i] > max then -- don't use >= here
max, max_i = f[i], i
end
end
local i, ret = max_i, {}
while true do
table.insert(ret, list[i])
i = b[i]
if i == 0 then break end
end
return ret
end
local l = find_max({{1,2}, {4,7}, {3,5}, {8,11}, {9,12}})
for _, x in ipairs(l) do
print(x[1], x[2])
end

The name of this problem is bin packing, it is usually considered as a hard problem but can be computed reasonably well for small number of bins.
Here is a video explaining common approaches to this problem
EDIT : By hard problem, I mean that some kind of brute force has to be employed. You will have to evaluate a lot of solutions and reject most of them, so usually you need some kind of evaluation mechanism. You need to be able to compare solution, such as "This solution packs 4 rectangles with area of 15" is better than "This solution packs 3 rectangles with area of 16".

I can't think of a shortcut, so you may have to enumerate the power set in descending order of size and stop on the first match.
The straightforward way to do this is to enumerate combinations of decreasing size. You could do something like this in C++11:
template <typename I>
std::set<Span> find_largest_non_overlapping_subset(I start, I finish) {
std::set<Span> result;
for (size_t n = std::distance(start, finish); n-- && result.empty();) {
enumerate_combinations(start, finish, n, [&](I begin, I end) {
if (!has_overlaps(begin, end)) {
result.insert(begin, end);
return false;
}
return true;
});
}
return result;
}
The implementation of enumerate_combination is left as an exercise. I assume you already have has_overlap.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Finding zero crossings from a list in python - python-2.7

Related

Random selection of a number from a list of numbers that is biased toward the lowest number

Given a string, find two identical subsequences with consecutive indexes C++

Calculating the mean of the data

Generating random integers with a difference constraint

Algorithm for finding the maximum number of non-overlapping lines on the x axis

Categories

Resources