Algorithm to produce a difference of two collections of intervals - c++

Problem
Suppose I have two collections of intervals, named A and B. How would I find a difference (a relative complement) in a most time- and memory-efficient way?
Picture for illustration:
Interval endpoints are integers ( ≤ 2128-1 ) and they are always both 2n long and aligned on the m×2n lattice (so you can make a binary tree out of them).
Intervals can overlap in the input but this does not affect the output (the result if flattened would be the same).
The problem is because there are MANY intervals in both collections (up to 100,000,000), so naïve implementations will probably be slow.
The input is read from two files and it is sorted in such a way that smaller sub-intervals (if overlapping) come immediately after their parents in order of size. For example:
[0,7]
[0,3]
[4,7]
[4,5]
[8,15]
...
What have I tried?
So far, I've been working on a implementation that generates a binary search tree while doing so aggregates neighbouring intervals ( [0,3],[4,7] => [0,7] ) from both collections, then traverses the second tree and "bumps out" the intervals that are present in both (subdividing the larger intervals in the first tree if necessary).
While this appears to be working for small collections, it requires more and more RAM to hold the tree itself, not to mention the time it needs to complete the insertion and removal from the tree.
I figured that since intervals come pre-sorted, I could use some dynamic algorithm and finish in one pass. I am not sure if this is possible, however.
So, how would I go about solving this problem in a efficient way?
Disclaimer: This is not a homework but a modification/generalization of an actual real-life problem I am facing. I am programming in C++ but I can accept an algorithm in any [imperative] language.

Recall one of the first programming exercises we all had back in school - writing a calculator program. Taking an arithmetic expression from the input line, parsing it and evaluating. Remember keeping track of the parentheses depth? So here we go.
Analogy: interval start points are opening parentheses, end points - closing parentheses. We keep track of the parentheses depth (nesting). The depth of two - intersection of intervals, the depth of one - difference of intervals
Algorithm:
No need to distinguish between A and B, just sort all start points and end points in the ascending order
Set the parentheses depth counter to zero
Iterate through the end points, starting from the smallest one. If it is a starting point increment the depth counter, if it is an ending point decrement the counter
Keep track of intervals where the depth is 1, those are intervals of A and B difference. The intervals where the depth is two are AB intersections

Your intervals are sorted which is great. You can do this in linear time with almost no memory.
Start by "flattening" your two sets. That is for set A, start from the lowest interval, and combine any overlapping intervals until you have an interval set that has no overlaps. Then do that for B.
Now take your two sets and start with the first two intervals. We'll call these the interval indices for A and B, Ai and Bi.
Ai indexes the first interval in A, Bi the first interval in B.
While there are intervals to process do the following:
Consider the start points of both intervals, are the start points the same? If so advance the start point of both intervals to the end point of the smaller interval, emit nothing to your output. Advance the index of the smaller interval to the next interval. (That is if Ai ends before Bi, then Ai advances to the next interval.) If both intervals end in the same place, advance both Ai and Bi and emit nothing.
Is the one start point earlier than the other start point? If so emit the interval from the earlier start point to either a) the start of the later endpoint, or b) the end of the earlier end point. If you chose option b, advance the index of the eariler interval.
So for example if the interval at Ai starts first, you emit the interval from start of Ai to start of Bi, or the end of Ai whichever is smaller. If Ai ended before the start of Bi, you advance Ai.
Repeat until all intervals are consumed.
Ps. I assume you don't have spare memory to flatten the two interval sets into separate buffers. Do this in two functions. A "get next interval" function that advances the interval indices, which does the flattening as necessary, and feed flattened data to the differencing function.

What you are looking for is a Sweep line algorithm.
A simple logic should tell you when the Sweep line is intersecting an interval in both A and B and where it intersects only one set.
This is very similar to this problem. Just consider that you have a set of vertical lines passing through the end points of the B's segments.
This algorithm complexity is O((m+n) log (m+n)) which is the cost of the initial sort. The sweep line algorithm on a sorted set takes O(m+n)

I think you should use boost.icl (Interval Container Library)
http://www.boost.org/doc/libs/1_50_0/libs/icl/doc/html/index.html
#include <iostream>
#include <boost/icl/interval_set.hpp>
using namespace boost::icl;
int main()
{
typedef interval_set<int> TIntervalSet;
TIntervalSet intSetA;
TIntervalSet intSetB;
intSetA += discrete_interval<int>::closed( 0, 2);
intSetA += discrete_interval<int>::closed( 9,15);
intSetA += discrete_interval<int>::closed(12,15);
intSetB += discrete_interval<int>::closed( 1, 2);
intSetB += discrete_interval<int>::closed( 4, 7);
intSetB += discrete_interval<int>::closed( 9,10);
intSetB += discrete_interval<int>::closed(12,13);
std::cout << intSetA << std::endl;
std::cout << intSetB << std::endl;
std::cout << intSetA - intSetB << std::endl;
return 0;
}
this prints
{[0,2][9,15]}
{[1,2][4,7][9,10][12,13]}
{[0,1)(10,12)(13,15]}

Related

Closest point from List for every point of other List

I have a population of so called "Dots" that search for food. Every Dot has a sight_ value, which indicates the range in which it can see food.
The position of each Dot is saved as a pair<uint16_t,uint16_t>. The positions of all foodsources are in a vector<pair<uint16_t,uint16_t>>.
Now I want to calculate the closest foodsource for every Dot, which this Dot can see. And I don't want to calculate the distance of every combination.
My idea was to create a copy of the food-vector, sort one copy by x and the other by y. Then find the interval [x-sight, x+sight] respectively [y-sight, y+sight] in the vectors and then create the intersection of both.
I've read over set_intersection, but it requires both ranges to be sorted with the same rule.
Any Ideas how I could do this? Could also be that my Idea is just the wrong approach.
Thanks
IceFreez3r
Edit:
I did some runtime approximations:
Sort Food: n log n
Find Interval for one Coordinate and one Dot: 2 log n (lower and upper bound)
If we assume equal distribution of food sources, we can calculate the bound that is estimated to be closer to the middle first and then calculate the second bound in the rest interval. This would reduce the runtime to: log n + log(n/2) (Just realized this s probably not *that* powerful:log(n/2) =~ log(n) - 1)
Build intersection: #x * #y =~ (n * sight/testgroundsize)^2
Compute exact Distance for every Food in Intersection: n * (sight/testgroundsize)^2
Sum: 2 n log n + 2 * #Dots * (log n + log(n/2) + (n * sight/testgroundsize)^2 + n * (sight/testgroundsize)^2)
Sum with just limiting one coordinate: n log n + #Dots * (log n + log(n/2) + n * sight/testgroundsize)
I did some tests and just calculated the above formulas on the run:
int dots = dots_.size();
int sum = 2 * n * log(n) + 2 * dots * (log(n) + log(n/2) + pow(n * (sum_sight / dots) / testground_size_,2) + n * pow((sum_sight / dots) / testground_size_, 2));
int sum2 = n * log(n) + dots * (log(n) + log(n/2) + n * (sum_sight / dots) / testground_size_);
cout << n*dots << endl << sum << endl << sum2 << endl;
It turned out the Intersection idea is just bad. While the idea of just limiting one coordinate is at least better than brute-force.
I didn't think about the grid-idea yet #Daniel Jour
You're stepping into a whole field of interesting approaches to this problem. Terms to Google are binary space partitioning, quadtrees, ... and of course nearest neighbour search.
A relatively simple but effective approach when the dots are far more spread than what their "visible range" is:
Select a value "grid size".
Create a map from grid coordinates to a list/set of entities
For each food source: put them in the map at their grid coordinates
For each dot: put them in the map at their grid coordinates and also in the neighbour grid "cells". The size of the neighbourhood depends on the grid size and the dot's sight value
For each entry in the map which contains at least one dot: Either do this algorithm recursively with a smaller grid size or use the brute force approach: check each dot in that grid cell against each food source in that grid cell.
This is a linear algorithm, compared with the quadratic brute force approach.
Calculation of grid coordinates: grid_x = int(x / grid_size) ... same for other coordinate.
Neighbourhood: steps = ceil(sight_value / grid_size) .. the neighbourhood is a square with side length 2×steps + 1 centred at the dot's grid coordinates
I believe your approach is incorrect. This can be mathematically verified. What you can do instead is calculate the magnitude of the vector joining the dot with the food source by means of Pythagoras theorem, and ensure that this magnitude is less than the observation limit. This deals exclusively with determining relative distance, as defined by the Cartesian co-ordinate system, and the standard unit of measurement. In relation to efficiency concerns, the first order of business is to determine if the approach to be taken is in computational terms in actuality less efficient, as measured by time, even though the logical component responsible for certain calculations are, in virtue of this alternative implementation, less time consuming. Of coarse, the ideal is one in which the time taken is decreased, and not merely numerically contained by means of refactoring.
Now, if it is the case that the position of a dot can be specified as any two numbers one may choose, this of course implies a frame of reference called the basis, and also one local to the dot in question. With respect to both, one can quantify position, and other such characteristics and properties. As a consequence of this observation, it would seem that you need n*2 data structures, where n is the amount of dots in the environment, that
contain the sorted values relative to each dot, and quite frankly it is unclear whether or not this approach would even work or is optimal. You state the design and programmatic constraint that the solution shall not compute the distances from each dot to each food source. But to achieve this, one must implement other such procedures, in order that we derive the correct results. These comments are made in relation to my discussion on efficiency. Therefore, you may be better of simply calculating the distance in each case. This is somewhat elegant.

Linear interpolation of two vector arrays with different lengths

I have two curves. One handdrawn and one is a smoothed version of the handdrawn.
The data of each curve is stored in 2 seperate vector arrays.
Time Delta is also stored in the handdrawn curve vector, so i can replay the drawing process and so that it looks natural.
Now i need to transfer the Time Delta from Curve 1 (Raw input) to Curve 2 (already smoothed curve).
Sometimes the size of the first vector is larger and sometimes smaller than the second vector.
(Depends on the input draw speed)
So my question is: How do i fill vector PenSmoot.time with the correct values?
Case 1: Input vector is larger
PenInput.time[0] = 0 PenSmoot.time[0] = 0
PenInput.time[1] = 5 PenSmoot.time[1] = ?
PenInput.time[2] = 12 PenSmoot.time[2] = ?
PenInput.time[3] = 2 PenSmoot.time[3] = ?
PenInput.time[4] = 50 PenSmoot.time[4] = ?
PenInput.time[5] = 100
PenInput.time[6] = 20
PenInput.time[7] = 3
PenInput.time[8] = 9
PenInput.time[9] = 33
Case 2: Input vector is smaller
PenInput.time[0] = 0 PenSmoot.time[0] = 0
PenInput.time[1] = 5 PenSmoot.time[1] = ?
PenInput.time[2] = 12 PenSmoot.time[2] = ?
PenInput.time[3] = 2 PenSmoot.time[3] = ?
PenInput.time[4] = 50 PenSmoot.time[4] = ?
PenSmoot.time[5] = ?
PenSmoot.time[6] = ?
PenSmoot.time[7] = ?
PenSmoot.time[8] = ?
PenSmoot.time[9] = ?
Simplyfied representation:
PenInput holds the whole data of a drawn curve (Raw Input)
PenInput.x // X coordinate)
PenInput.y // Y coordinate)
PenInput.pressure // The pressure of the pen)
PenInput.timetotl // Total elapsed time)
PenInput.timepart // Time fragments)
PenSmoot holds the data of the massaged (smoothed,evenly distributed) curve of PenInput
PenSmoot.x // X coordinate)
PenSmoot.y // Y coordinate)
PenSmoot.pressure // Unknown - The pressure of the pen)
PenSmoot.timetotl // Unknown - Total elapsed time)
PenSmoot.timepart // Unknown - Time fragments)
This is the struct that i have.
struct Pencil
{
sf::VertexArray vertices;
std::vector<int> pressure;
std::vector<sf::Int32> timetotl;
std::vector<sf::Int32> timepart;
};
[This answer has been extensively revised based on editing to the question.]
Okay, it seems to me that you just about need to interpolate the time stamps in parallel with the points.
I'm going to guess that the incoming data is something on the order of an array of points (e.g., X, Y coordinates) and an array of time deltas with the same number of each, so time-delta N tells you the time it took to get from point N-1 to point N.
When you interpolate the points, you're probably going to want to do it intelligently. For example, in the shape shown in the question, we have what look like two nearly straight lines, one with positive slope, and the other with negative slope. According to the picture, that's composed of 263 points. We could reduce that to three points and still have a fairly reasonable representation of the original shape by choosing the two end-points plus one point where the two lines meet.
We probably don't need to go quite that far though. Especially taking time into account, we'd probably want to use at least 7 points for the output--one for each end-point of each colored segment. That would give us 6 straight line segments. Let's say those are at points 0, 30, 140, 180, 200, 250, and 263.
We'd then use exactly the same segmentation on the time deltas. Add up the deltas from 0 to 30 to get an average speed for the first segment. Add up the deltas for 31 through 140 to get an average speed for the second segment (and so on to the end).
Increasing the number of points works out roughly the same way. We need to look at exactly which input points were used to create a pair of output points. For a simplistic example, let's assume we produced output that was precisely double the number of input points. We'd then interpolate time deltas exactly halfway between each pair of input points.
In the case shown in the question, we start with unevenly distributed inputs, but produce evenly distributed outputs. So the second output point might be an average of the first four input points. The next output point might be an average of three input points (and so on). In many cases, it's likely that neither end-point of a segment in the output corresponds precisely to any point in the input.
That's fine too. We interpolate between two points of the input to figure out the time hack for the starting point of the output segment. Likewise for the ending point. Then we can compute the total time it should have taken to travel between them based on the time delta between the points.
If you want to get fancy, you could use a higher order interpolation instead of linear. That does require more input points per interpolation, but it looks like you probably have plenty to do something like a quadratic or cubic interpolation (in most cases). This is likely to make the most differences at transitions--places the "pen" was accelerating or decelerating quickly. In such an place, linear interpolation can give somewhat misleading results (though, given the number of points you seem to be working with, it may not make enough difference to notice).
As an illustration, let's consider a straight line. We're going to start from 5 input points, and produce 7 output points.
So, the input points are [0, 2, 7, 10, 15], and the associated time deltas are [0, 1, 4, 8, 3].
So, out total distance traveled is 16, and we want our output points to be evenly distributed. So, the distance between output points will be 16/7 = (roughly) 2.29.
So, obviously the first output point and time are both 0. The second output point is 2.29. To compute the output time, we take the entirety of the time to the first input point (0->2), plus .29 / (7-2) * (4-1). That interpolated section gives 1.37, so our first output time delta is 2.37.
The next output point should be at a distance of 4.58. Since the second input segment goes from 2 to 7, our entire second output segment will lie within the second input segment. So, we take 2.29 / (7-2), telling use that this output segment occupies .458 of the input segment. We then multiply that by the time for the second input segment to get the time delta for the second output segment: .458 * (4-1) = 1.374.
[...and it continues on the same way until we reach the end.]

Snake game - random number generator for food tiles

I am trying to make a 16x16 LED Snake game using Arduino (C++).
I need to assign a random grid index for the next food tile.
What I have is a list of indices that are occupied by the snake (snakeSquares).
So, my thought is that I need to generate a list of potential foodSquares. Then I can pick a random index from that list, and use the value there for my next food square.
I have some ideas for this but they seem kind of clunky, so I was looking for some feedback. I am using the Arduino LinkedList.h library for my lists in lieu of stdio.h (and random() in place of rand()):
Generate a list (foodSquares) containing the integers [0, 255] so that the indices correspond to the values in the list (I don't know of a quick way to do this, will probably need to use a for loop).
When generating list of snakeSquares, set foodSquares[i] = -1. Afterwards, loop through foodSquares and remove all elements that equal -1.
Then, generate a random number randNum from [0, foodSquares.size()-1] and make the next food square index equal to foodSquares[randNum].
So I guess my question is, will this approach work, and is there a better way to do it?
Thanks.
Potential approach that won't require more lists:
Calculate random integer representing number of steps.
Take head or tail as a starting tile.
For each step move at random free adjacent tile.
I couldn't understand it completely your question as some of those points are quite waste of processor time (i.e. point 1 and 2). But, the first point could be solved quite easily in n proportional complexity as follows:
for (uint8_t i = 0; i < 256; i++) {
// assuming there is a list of food_squares
food_squares[i] = i;
}
Then to the second point you would have to set every food_square to -1, for what? Anyway. A way you could implement this would be as VTT has said and I will describe it further:
Take a random number between [0..255].
Does it is one the snake_squares? If so, back to one, else, go to three.
This is the same as your third point, use this random number to set the position of the food in food_square (food_square[random_number] = some_value).

fortran beginner - writing variable to output file

I am starting to work with a CFD fortran program, and want to update the variables that it writes to an output file.
I want to output several columns, I and J coordinates(IL and JL), Water Surface Elevation (SURFEL), Bottom Elevation of coordinate (BELV), Depth of Water (HP) and finally, and this is where I have the question, the Maximum Water Surface Elevation of the coordinate during the simulation (SURFELMAX). L refers to a specific I,J coordinate, LA is the last coordinate in the simulation
So far I have:
DO L=2,LA
SURFEL=BELV(L)+HP(L)
IF (SURFEL.GT.SURFELMAX)THEN
SURFELMAX=SURFEL
ELSE IF (SURFELMAX.GT.SURFEL) THEN
SURFELMAX=SURFELMAX
WRITE(10,200)IL(L),JL(L),SURFEL,SURFELMAX
ENDIF
ENDDO
Everything works ok other than the SURFELMAX, in which the highest recorded surface elevation that occurred in any coordinate in the whole domain is written for each coordinate, i.e. the column is filled with the same value, the highest experienced in the whole domain during the simulation.
Would I need to first allocate an array for SURFELMAX, and have SURFEL checked against it each time to see if it has increased? If so could somebody point me in the right direction for this?
If I understand the requirements correctly, then you want to calculate SURFELMAX before you start writing out. This could simply be:
SURFELMAX = MAXVAL(BELV(2:LA)+HP(2:LA))
WRITE(10,200) (IL(L), JL(L), BELV(L)+HP(L), SURFELMAX, L=2,LA)
(or even as a single line).
It appears I didn't understand correctly; I'll try again - keeping the above as a warning to others.
It seems that you do indeed want SURFELMAX(2:LA) where each element is the highest in a given cell to date.
do L=2, LA
SURFELMAX(L) = MAX(SURFELMAX(L), BELV(L)+HP(L)) ! Store the historical maximum
WRITE (10,200) IL(L), JL(L), BELV(L)+HP(L), SURFELMAX(L)
end do
where, initially, SURFELMAX has been set to a sufficiently small value. You could also explicitly calculate SURFEL if that is needed.
If this is time dependent, then you will have to define a 2-d array SURFELMAX of size (1:LA,1:T) (T = number of time steps, LA = number of active coordinates).
Then increment the time step (say, the iterator is called I_T) outside of the loop through the domain.
Finally assign the maximum value at each coordinate to the SURFELMAX(I_T,L)

Algorithm for sorting a two-dimensional array based on similarities of adjecent objects

I'm writing a program that is supposed
to sort a number of square tiles (of which
each side is colored in one of five colors—red, orange,
blue, green and yellow), that are laying next to each other
(eg 8 rows and 12 columns) in a way that as many sides with
the same color connect as possible.
So, for instance, a tile with right side colored
red should have a tile on the right that has a red left-side.)
The result is evaluated by counting how many non-matching pairs
of sides exist on the board. I'm pretty much done with the actual program;
I just have some trouble with my sorting algorithm. Right now I'm using
Bubble-sort based algorithm, that compares every piece on the board
with every other piece, and if switching those two reduces the amount of
non-matching pairs of sides on the board, it switches them. Here a
abstracted version of the sorting function, as it is now:
for(int i = 0 ; i < DimensionOfBoard.cx * DimensionOfBoard.cy ; i++)
for(int j = 0 ; j < DimensionOfBoard.cx * DiemsionOfBoard.cy ; j++)
{
// Comparing a piece with itself is useless
if(i == j)
continue;
// v1 is the amount of the nonmatching sides of both pieces
// (max is 8, since we have 2 pieces with 4 sides each (duh))
int v1 = Board[i].GetNonmatchingSides() + Board[j].GetNonmatchingSides();
// Switching the pieces, if this decreases the value of the board
// (increases the amount of nonmatching sides) we'll switch back)
SwitchPieces(Board[i], Board[j]);
// If switching worsened the situation ...
if(v1 < Board[i].GetNonmathcingSides() + Board[j].GetNonmatchingSides())
// ... we switch back to the initial state
SwitchPieces(Board[i], Board[j]);
}
As an explanation: Board is a pointer to an array of Piece Object. Each Piece has
four Piece-pointers that point to the four adjacent pieces (or NULL, if the Piece is
a side/corner piece.) And switching actually doesn't switch the pieces itself, but
rather switches the colors. (Instead of exchanging the pieces it scrapes off the color
of both and switches that.)
This algorithm doesn't work too bad, it significantly improves the value of the
board, but it doesn't optimize it as it should. I assume it's because side and corner
pieces can't have move than three/two wrong adjacent pieces, since one/two side(s)
are empty. I tried to compensate for that (by multiplying Board[i].GetMatchingPieces()
with Board[i].GetHowManyNonemptySides() before comparing), but that didn't help a bit.
And that's where I need help. I don't know very many sorting algorithms, let alone
those that work with two-dimensional arrays. So can anyone of you know about
an algorithmic concept that might help me to improve my work? Or can anyone see a
problem that I haven't found yet? Any help is appreciated. Thank you.
if there was a switch you have to re-evaluate a board, because there might be previous positions where now you could find an enhancement.
Note that you are going to find only a local minimum with those swappings. You might won't be able to find any enhancements but that doesn't mean that's the best board configuration.
One way to find a better configuration is to shuffle a board and search for a new local minumum, or use an algorithm-skeleten that allows you to make bigger jumps in the state, eg: Simulated annealing.