Have anyone used the Boolean function of Boost polygon library?
Boost polygon library
It says that the algorithm is O(nlogn) in time complexity, n = #points
I input 200000 random generated polygons (with 5~8 potins)
but the OR and XOR function cost about half hours (yes, just call its function)
though the result is correct, but the time consuming is horrible
have anyone met this problem?
Although it would always help to post the code that exhibits the described behavior, I assume that each of the i=1..n polygons has some (unique) crossings with each of the previous 1..(i-1) polygons, which implies that the number of points that result from XOR'ing the first n-1 polygons is quadratic in n, so you are requesting n times an operation of O(#Points * log(#Points)) where #Points is O(n^2), thus the total complexity would be O(n^2*log(n)).
Related
I got the task of showing the time taken by the merge sort algorithm theoretically ( n log(n) ) and practically (by program) on a graph by using different values of n and time taken.
In the program, I'm printing the time difference between before calling the function and after the end of the function in microseconds I want to know what dose n log(n) means.
I tryed with this values:
Number of values:
10000 20000 30000 40000 50000 60000 70000 80000 90000 100000
program time in micro second:
12964 24961 35905 47870 88764 67848 81782 97739 111702 119682
time using n log n formula:
132877 285754 446180 611508 780482 952360 1.12665e+006 1.30302e+006 1.48119e+006 1.66096e+006
code:
auto start = std::chrono::high_resolution_clock::now();
mergeSort(arr, 0, n - 1);
auto elapsed = std::chrono::high_resolution_clock::now() - start;
long long microseconds = std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count();
cout << microseconds << " ";
Graph i got:
What time complexity actually means?
I interpret your question in the following way:
Why is the actual time needed by the program not K*n*log(n) microseconds?
The answer is: Because on modern computers, the same step (such as comparing two numbers) does not need the same time if it is executed multiple times.
If you look at the time needed for 50.000 and 60.000 numbers, you can see, that the 50.000 numbers even needed more time than the 60.000 numbers.
The reason might be some interrupt that occurred while the 50.000 numbers were sorted; I assume that you'll get a time between the 40.000 numbers and the 60.000 numbers if you run your program a second time.
In other words: External influences (like interrupts) have more impact on the time needed by your program than the program itself.
I got the task of showing the time taken by the merge sort algorithm theoretically (n log(n)) and practically (by program) on a graph by using different values of n and time taken.
I'd take a number of elements to be sorted that takes about one second. Let's say sorting 3 Million numbers takes one second; then I would sort 3, 6, 9, 12 ... and 30 Million numbers and measure the time.
This reduces the influence of interrupts etc. on the measurement. However, you'll still have some effect of the memory cache in this case.
You can use your existing measurements (especially the 50.000 and the 60.000) to show that for a small number of elements to be sorted, there are other factors that influence the run time.
Note that a graph of y = x log(x) is surprisingly close to a straight line.
This is because the gradient at any point x is 1 + log(x), which is a slowly growing function of x.
In other words, it's difficult within the bounds of experimental error to distinguish between O(N) and O(N log N).
The fact that the blue line is pretty straight is a reasonable verification that the algorithm is not O(N * N), but really without better statistical analysis and program control set-up, one can't say much else.
The difference between the red and blue line is down to "big O" not concerning itself with proportionality constants and other coefficients.
The time complexity is the time a program takes to execute, as a function of the problem size.
The problem size is usually expressed as the number of input elements, but some other measures can sometimes be used (e.g. algorithms on matrices of size NxN can be rated in terms of N instead of N²).
The time can effectively be measured in units of time (seconds), but is often assessed by just counting the number of atomic operations of some kind performed (e.g. the number of comparisons, of array accesses...)
In fact, for theoretical studies, the exact time is not a relevant information because it is not "portable": it strongly depends on the performance of the computer used and also on implementation details.
This is why algorithmicians do not really care about exact figures, but rather on how the time varies with increasing problem sizes. This leads to the concept of asymptotic complexity, which measures the running time to an unknown factor, and for mathematical convenience, an approximation of the running time is often used, to make the computations tractable.
If you study the complexity by pure benchmarking (timing), you can obtain experimental points, which you could call empirical complexity. But some statistical rigor should be applied.
(Some of the other answers do merge the concepts of complexity and asymptotic complexity, but this is not correct.)
In this discussion of complexity, you can replace time by space and you study the memory footprint of the program.
Time complexity has nothing to do with actual time.
It's just a way that helps us to compare different algorithms - which algorithm will run faster.
For example -
In case of sorting: we have bubble sort having time-complexity as O(n^2) and merge sort having time-complexity as O(N log(N)). So, with the help of time-complexity we can say that merge-sort is much better than bubble sort for sorting things.
Big-O notations was created so that we can have generalized way of comparing speed of different algorithms, a way which is not machine dependent.
I am looking to generate derangements uniformly at random. In other words: shuffle a vector so that no element stays in its original place.
Requirements:
uniform sampling (each derangement is generated with equal probability)
a practical implementation is faster than the rejection method (i.e. keep generating random permutations until we find a derangement)
None of the answers I found so far are satisfactory in that they either don't sample uniformly (or fail to prove uniformity) or do not make a practical comparison with the rejection method. About 1/e = 37% of permutations are derangements, which gives a clue about what performance one might expect at best relative to the rejection method.
The only reference I found which makes a practical comparison is in this thesis which benchmarks 7.76 s for their proposed algorithm vs 8.25 s for the rejection method (see page 73). That's a speedup by a factor of only 1.06. I am wondering if something significantly better (> 1.5) is possible.
I could implement and verify various algorithms proposed in papers, and benchmark them. Doing this correctly would take quite a bit of time. I am hoping that someone has done it, and can give me a reference.
Here is an idea for an algorithm that may work for you. Generate the derangement in cycle notation. So (1 2) (3 4 5) represents the derangement 2 1 4 5 3. (That is (1 2) is a cycle and so is (3 4 5).)
Put the first element in the first place (in cycle notation you can always do this) and take a random permutation of the rest. Now we just need to find out where the parentheses go for the cycle lengths.
As https://mathoverflow.net/questions/130457/the-distribution-of-cycle-length-in-random-derangement notes, in a permutation, a random cycle is uniformly distributed in length. They are not randomly distributed in derangements. But the number of derangements of length m is m!/e rounded up for even m and down for odd m. So what we can do is pick a length uniformly distributed in the range 2..n and accept it with the probability that the remaining elements would, proceeding randomly, be a derangement. This cycle length will be correctly distributed. And then once we have the first cycle length, we repeat for the next until we are done.
The procedure done the way I described is simpler to implement but mathematically equivalent to taking a random derangement (by rejection), and writing down the first cycle only. Then repeating. It is therefore possible to prove that this produces all derangements with equal probability.
With this approach done naively, we will be taking an average of 3 rolls before accepting a length. However we then cut the problem in half on average. So the number of random numbers we need to generate for placing the parentheses is O(log(n)). Compared with the O(n) random numbers for constructing the permutation, this is a rounding error. However it can be optimized by noting that the highest probability for accepting is 0.5. So if we accept with twice the probability of randomly getting a derangement if we proceeded, our ratios will still be correct and we get rid of most of our rejections of cycle lengths.
If most of the time is spent in the random number generator, for large n this should run at approximately 3x the rate of the rejection method. In practice it won't be as good because switching from one representation to another is not actually free. But you should get speedups of the order of magnitude that you wanted.
this is just an idea but i think it can produce a uniformly distributed derangements.
but you need a helper buffer with max of around N/2 elements where N is the size of the items to be arranged.
first is to choose a random(1,N) position for value 1.
note: 1 to N instead of 0 to N-1 for simplicity.
then for value 2, position will be random(1,N-1) if 1 fall on position 2 and random(1,N-2) otherwise.
the algo will walk the list and count only the not-yet-used position until it reach the chosen random position for value 2, of course the position 2 will be skipped.
for value 3 the algo will check if position 3 is already used. if used, pos3 = random(1,N-2), if not, pos3 = random(1,N-3)
again, the algo will walk the list and count only the not-yet-used position until reach the count=pos3. and then position the value 3 there.
this will goes for the next values until totally placed all the values in positions.
and that will generate a uniform probability derangements.
the optimization will be focused on how the algo will reach pos# fast.
instead of walking the list to count the not-yet-used positions, the algo can used a somewhat heap like searching for the positions not yet used instead of counting and checking positions 1 by 1. or any other methods aside from heap-like searching. this is a separate problem to be solved: how to reached an unused item given it's position-count in a list of unused-items.
I'm curious ... and mathematically uninformed. So I ask innocently, why wouldn't a "simple shuffle" be sufficient?
for i from array_size downto 1: # assume zero-based arrays
j = random(0,i-1)
swap_elements(i,j)
Since the random function will never produce a value equal to i it will never leave an element where it started. Every element will be moved "somewhere else."
Let d(n) be the number of derangements of an array A of length n.
d(n) = (n-1) * (d(n-1) + d(n-2))
The d(n) arrangements are achieved by:
1. First, swapping A[0] with one of the remaining n-1 elements
2. Next, either deranging all n-1 remaning elements, or deranging
the n-2 remaining that excludes the index
that received A[0] from the initial matrix.
How can we generate a derangement uniformly at random?
1. Perform the swap of step 1 above.
2. Randomly decide which path we're taking in step 2,
with probability d(n-1)/(d(n-1)+d(n-2)) of deranging all remaining elements.
3. Recurse down to derangements of size 2-3 which are both precomputed.
Wikipedia has d(n) = floor(n!/e + 0.5) (exactly). You can use this to calculate the probability of step 2 exactly in constant time for small n. For larger n the factorial can be slow, but all you need is the ratio. It's approximately (n-1)/n. You can live with the approximation, or precompute and store the ratios up to the max n you're considering.
Note that (n-1)/n converges very quickly.
I have two set of points (cv::Point2f): setA and setB. For each point in setA, I want to find its nearest neighbor in setB. So, I have tried two methods:
Linear search: for each point in setA, just simply scan through all points in setB to find nearest one.
Using opencv kd-tree:
_ First I built a kd-tree for setB using opencv flann:
cv::flann::KDTreeIndexParams indexParams;
cv::flann::Index kdTree(cv::Mat(setB).reshape(1), indexParams);
_ Then, for each point in setA I do query to find nearest neighbor:
kdTree.knnSearch(point_in_setA, indices, dists, maxPoints);
Note: I set maxPoints to 1, cause I only need the nearest one.
I do a bit study, and come out with some time complexity for each case:
Linear search: O(M*N)
Kd-Tree: NlogN + MlogN => first term for building kd-tree, second term is for querying
Where M is number of points in setA, and N is for setB. And range of N: 100~1000, range of M: 10000~100000.
So, the kd-tree is supposed to run much faster than the linear search method. However, when I run the real test on my laptop, the result is kd-tree method is slower than linear search (0.02~0.03s vs 0.4~0.5s).
When I do profiling, I got hot spot at knnSearch() function, it takes 20.3% CPU time compares to 7.9% of linear search.
Uhm, I read some online articles, they said to query kd-tree it usually take logN. But I am not sure about how opencv implement it.
Anyone knows what's wrong here? Is there any parameter I should adjust in kd-tree or did I make mistake somewhere in the code or computation?
Taken from the Flann documentation. For low dimensional data, you should use KDTreeSingleIndexParams.
KDTreeSingleIndexParams
When passing an object of this type the index will contain a single kd-tree optimized for searching lower dimensionality data (for example 3D point clouds), in your case 2D points. You can play with the leaf_max_size parameters and profile your results.
struct KDTreeSingleIndexParams : public IndexParams
{
KDTreeSingleIndexParams( int leaf_max_size = 10 );
};
max leaf size: The maximum number of points to have in a leaf for not
branching the tree any more
O(log(N)) doesn't necessarily mean it is faster than O(N).
This is only true for sufficiently big N.
Your N is a rather small number. If your kd-tree contained millions of elements, you'd probably see the difference between a linear scan and a logarithmic search.
So my guess is that you spend a lot of time with overhead like building the tree, which is slower for small N than just scanning this rather small list without any overhead.
I have input array A
A[0], A[1], ... , A[N-1]
I want function Max(T,A) which return B represent max value on A over previous moving window of size T where
B[i+T] = Max(A[i], A[i+T])
By using max heap to keep track of max value on current moving windows A[i] to A[i+T], this algorithm yields O(N log(T)) worst case.
I would like to know is there any better algorithm? Maybe an O(N) algorithm
O(N) is possible using Deque data structure. It holds pairs (Value; Index).
at every step:
if (!Deque.Empty) and (Deque.Head.Index <= CurrentIndex - T) then
Deque.ExtractHead;
//Head is too old, it is leaving the window
while (!Deque.Empty) and (Deque.Tail.Value > CurrentValue) do
Deque.ExtractTail;
//remove elements that have no chance to become minimum in the window
Deque.AddTail(CurrentValue, CurrentIndex);
CurrentMin = Deque.Head.Value
//Head value is minimum in the current window
it's called RMQ(range minimum query). Actually i once wrote an article about that(with c++ code). See http://attiix.com/2011/08/22/4-ways-to-solve-%C2%B11-rmq/
or you may prefer the wikipedia, Range Minimum Query
after the preparation, you can get the max number of any given range in O(1)
There is a sub-field in image processing called Mathematical Morphology. The operation you are implementing is a core concept in this field, called dilation. Obviously, this operation has been studied extensively and we know how to implement it very efficiently.
The most efficient algorithm for this problem was proposed in 1992 and 1993, independently by van Herk, and Gil and Werman. This algorithm needs exactly 3 comparisons per sample, independently of the size of T.
Some years later, Gil and Kimmel further refined the algorithm to need only 2.5 comparisons per sample. Though the increased complexity of the method might offset the fewer comparisons (I find that more complex code runs more slowly). I have never implemented this variant.
The HGW algorithm, as it's called, needs two intermediate buffers of the same size as the input. For ridiculously large inputs (billions of samples), you could split up the data into chunks and process it chunk-wise.
In sort, you walk through the data forward, computing the cumulative max over chunks of size T. You do the same walking backward. Each of these require one comparison per sample. Finally, the result is the maximum over one value in each of these two temporary arrays. For data locality, you can do the two passes over the input at the same time.
I guess you could even do a running version, where the temporary arrays are of length 2*T, but that would be more complex to implement.
van Herk, "A fast algorithm for local minimum and maximum filters on rectangular and octagonal kernels", Pattern Recognition Letters 13(7):517-521, 1992 (doi)
Gil, Werman, "Computing 2-D min, median, and max filters", IEEE Transactions on Pattern Analysis and Machine Intelligence 15(5):504-507 , 1993 (doi)
Gil, Kimmel, "Efficient dilation, erosion, opening, and closing algorithms", IEEE Transactions on Pattern Analysis and Machine Intelligence 24(12):1606-1617, 2002 (doi)
(Note: cross-posted from this related question on Code Review.)
I wrote two Matrix Multiplications programs in C++: Regular MM (source), and Strassen's MM (source), both of which operate on square matrices of sizes 2^k x 2^k(in other words, square matrices of even size).
Results are just terrible. For 1024 x 1024 matrix, Regular MM takes 46.381 sec, while Strassen's MM takes 1484.303 sec (25 minutes !!!!).
I attempted to keep the code as simple as possible. Other Strassen's MM examples found on the web are not that much different from my code. One issue with Strassen's code is obvious - I don't have cutoff point, that switches to regular MM.
What other issues my Strassen's MM code has ???
Thanks !
Direct links to sources
http://pastebin.com/HqHtFpq9
http://pastebin.com/USRQ5tuy
Edit1.
Fist, a lot of great advices. Thank you for taking your time and sharing knowledge.
I implemented changes(kept all of my code), added cut-off point.
MM of 2048x2048 matrix, with cutoff 512 already gives good results.
Regular MM: 191.49s
Strassen's MM: 112.179s
Significant improvement.
Results were obtained on prehistoric Lenovo X61 TabletPC with Intel Centrino processor, using Visual Studio 2012.
I will do more checks(to make sure I got the correct results), and will publish the results.
One issue with Strassen's code is obvious - I don't have cutoff point,
that switches to regular MM.
It's fair to say that recursing down to 1 point is the bulk of (if not the entire) problem. Trying to guess at other performance bottlenecks without addressing this is almost moot due to the massive performance hit that it brings. (In other words, you're comparing Apples to Oranges.)
As discussed in the comments, cache alignment could have an effect, but not to this scale. Furthemore, cache alignment would likely hurt the regular algorithm more than the Strassen algorithm since the latter is cache-oblivious.
void strassen(int **a, int **b, int **c, int tam) {
// trivial case: when the matrix is 1 X 1:
if (tam == 1) {
c[0][0] = a[0][0] * b[0][0];
return;
}
That's far too small. While the Strassen algorithm has a smaller complexity, it has a much bigger Big-O constant. For one, you have function call overhead all the way down to 1 element.
This is analogous to using merge or quick sort and recursing all the way down to one element. To be efficient you need to stop the recursion when the size gets small and fall back to the classic algorithm.
In quick/merge sort, you'd fall back to a low-overhead O(n^2) insertion or selection sort. Here you would fall back to the normal O(n^3) matrix multiply.
The threshold which you fall back the classic algorithm should be a tunable threshold that will likely vary depending on the hardware and the ability of the compiler to optimize the code.
For something like Strassen multiplication where the advantage is only O(2.8074) over the classic O(n^3), don't be surprised if this threshold turns out to be very high. (thousands of elements?)
In some applications there can be many algorithms each with decreasing complexity but increasing Big-O. The result is that multiple algorithms become optimal at different sizes.
Large integer multiplication is a notorious example of this:
Grade-school Multiplication: O(N^2) optimal for < ~100 digits*
Karatsuba Multiplication: O(N^1.585) faster than above at ~100 digits*
Toom-Cook 3-way: O(N^1.465) faster than Karatsuba at ~3000 digits*
Floating-point FFT: O(> N log(N)) faster than Karatsuba/Toom-3 at ~700 digits*
Schönhage–Strassen algorithm (SSA): O(N log(n) loglog(n)) faster than FFT at ~ a billion digits*
Fixed-width Number-Theoretic Transform: O(N log(n) faster than SSA at ~ a few billion digits?*
*Note these example thresholds are approximate and can vary drastically - often by more than a factor of 10.
So, there may be more problems that this, but your first problem is that you're using arrays of pointers to arrays. And since you're using array sizes that are powers of 2, this is an especially big performance hit over allocating the elements contiguously and using integer division to fold the long array of numbers into rows.
Anyway, that's my first guess as to a problem. As I said, there may be more, and I'll add to this answer as I discover them.
Edit: This likely only contributes a small amount to the problem. The problem is likely the one Luchian Grigore refers to involving cache line contention issues with powers of two.
I verified that my concern is valid for the naive algorithm. The time for the naive algorithm goes down by almost 50% if the array is contiguous instead. Here is the code for this (using a SquareMatrix class that is C++11 dependent) on pastebin.