I have an implementation of Kruskal's algorithm in C++ (using disjoint data set structure). I'm trying to find possible methods of creating worse case scenario test cases for the total running time of the algorithm. I'm confused however on what can make the algorithm result in a worst case scenario when trying to create test cases and was wondering if anyone here might know of possible scenarios that would really make Kruskal's algorithm struggle.
As of now the main test I've considered that might theoretically test the bounds of Kruskal's algorithm would be test cases where all weights are the same. An example would be like the following:
4 4
(4, 4) 4 //(4,4) vertex and weight = 4
(4, 4) 4
(4, 4) 4
(4, 4) 4
What I end up running into is that regardless of what I do, if I try to slow down the algorithm I just end up with no minimum spanning tree and end up failing to actually test the bounds of the algorithm.
To stress Kruskal's algorithm, you need a graph with as many redundant edges as possible, and at least one necessary edge that will be considered last (since Kruskal's algorithm sorts the edges by weight). Here's an example.
The edges with weight 1 are necessary, and will be taken first. The edges with weight 2 are redundant and will cause Kruskal's algorithm to waste time before getting to the edge with weight 3.
Note that the running time of Kruskal's algorithm is determined primarily by the time to sort the edges by weight. Adding additional redundant edges of medium weight will increase the sort time as well as the search time.
Kruskal's algorithm consists of two phases - sorting the edges and than performing union find. If you implement the second phase using disjoint set forest and applying the path compression and union by rank heuristics, the sorting will be much slower than the second phase. Thus to create a worst case scenario for Kruskal you should simply generate a worst case scenario for the sorting algorithm you are using. If you use the built-in sorting, it has an optimization that will actually make it work way faster for already sorted array.
Related
I'm working on implementing a ModelClass for any 3D model in my DirectX 11/12 pipeline.
My specific problem lies within calculating the min and max for the BoundingBox structure I wish to use as a member of the ModelClass.
I have two approaches to calculating them.
Approach 1.
When each vertex is being read from file, store a current minx,y,z and maxx,y,z and check each vertex as it is loaded in against the current min/max x,y,z.
Approach 2.
After all the vertices have been loaded, sort them by x, then y, then z, finding the lowest and highest value at each point.
Which Approach would you recommend and why?
Approach 1
Time complexity is in O(n) and memory complexity is O(1).
It is simple to implement.
Approach 2
Time complexity is O(nLogn) memory complexity is potentially at least linear (if you make a copy of the arrays or if you use merge sort) or O(1) if you use an in place sorting algorithm like quicksort.
This has to be done 3 times one for each dimension.
All in all Approach 1 is best in all scenarios I can think of.
Sorting generally is not a cheap operation especially as your models are getting larger. Therefore it to me like Approach 1 is more efficient but if unsure I suggest measuring it see which one takes longer.
If you are using a library like Asspimp I believe the library takes care of bounding boxes but this might not be an option if you create the pipeline as a learning opportunity.
I created a C++ program that outputs the input size vs. execution time (microseconds) of algorithms and writes the results to a .csv file. Upon importing the .csv in LibreOffice Calc and plotting graphs,
I noticed that binary search for input sizes upto 10000 is running in constant time even though I search for an element not in the array. Similarly, for upto the same input size, merge sort seems to run in linear time instead of the linear-logarithmic time it runs in in all cases.
Insertion Sort and Bubble Sort run just fine and the output plot resembles their worst case quadratic complexity very closely.
I provide the input arrays from a file. For n = 5, the contents of the file are as follows. Each line represents an input array:
5 4 3 2 1
4 3 2 1
3 2 1
2 1
1
The results.csv file on running insertion sort is:
Input,Time(ms)
5,4
4,3
3,2
2,2
1,2
The graph of binary search for maximum input 100 is here.
Also the graph of merge sort for maximum input 1000 is here which looks a lot like it is linear (the values in the table also suggest so).
Any help as to why this is happening will be greatly appreciated.
Here is a link to the github repository for the source code: https://github.com/dhanraj-s/Time-Complexity
Complexity is about asymptotic worst case behaviour.
...worst case...
Even a quadratic algorithm may fall back to a linear variant if the input allows. It's complexity is still quadratic because for the worst case, the algorithm can only guarantee a quadratic runtime.
...asymptotic...
It might well be that the asymptotic behaviour for the algorithms starts to settle in only for input sizes much bigger than what you chose.
That being said, in practice complexity alone is not the most useful metric, but if you do care about performance, you need to measure.
The current algorithm used, is a genetic algorithm using, mutation and ordered crossover. We modified the original ordered crossover algorithm by removing the depot (end points) and then performing the crossover and adding them in after. The parent selection algorithm uses roulette selection with the
goodness = 1/time_to_travel_route.
Without the Crossover, the algorithm produces good results (using only mutations), but adding it in significantly worsens them. Here is a link to a post with a similar problem: Why does adding Crossover to my Genetic Algorithm gives me worse results?
Following the advice given in the post, the goodness was changed to
goodness = 1/(time_to_travel_route)^n with varying n
However, this still did not produce a favorable result.
Population Size: tried from 100 to 10,000
Stop Condition: tried from 10 generations to 1000
Fitness Algorithm: tried 1/(time_to_travel_route)^n with varying n from 1 to Big Numbers
Mutation Algorithm: The algorithm uses 2-opt. All offspring are mutated. The mutation algorithm tries different mutations until it finds a better solution. However, if it finds a worse solution, it might just return the worse population with probability p. This is done to add some randomness and escape local minimas. We varied p from 5 to 20 percent.
In the wikipedia article on sorting algorithms,
http://en.wikipedia.org/wiki/Sorting_algorithm#Summaries_of_popular_sorting_algorithms
under Bubble sort it says:Bubble sort can also be used efficiently on a list of any length that is nearly sorted (that is, the elements are not significantly out of place)
So my question is: Without sorting the list using a sorting algoithm first, how can one know if that is nearly sorted or not?
Are you familiar with the general sorting lower bound? You can prove that in a comparison-based sorting algorithm, any sorting algorithm must make Ω(n log n) comparisons in the average case. The way you prove this is through an information-theoretic argument. The basic idea is that there are n! possible permutations of the input array, and since the only way you can learn about which permutation you got is to make comparisons, you have to make at least lg n! comparisons in order to be certain that you know the structure of your input permutation.
I haven't worked out the math on this, but I suspect that you could make similar arguments to show that it's difficult to learn how sorted a particular array is. Essentially, if you don't do a large number of comparisons, then you wouldn't be able to tell apart an array that's mostly sorted from an array that is actually quite far from sorted. As a result, all the algorithms I'm aware of that measure "sortedness" take a decent amount of time to do so.
For example, one measure of the level of "sortedness" in an array is the number of inversions in that array. You can count the number of inversions in an array in time O(n log n) using a divide-and-conquer algorithm based on mergesort, but with that runtime you could just sort the array instead.
Typically, the way that you'd know that your array was mostly sorted was to know something a priori about how it was generated. For example, if you're looking at temperature data gathered from 8AM - 12PM, it's very likely that the data is already mostly sorted (modulo some variance in the quality of the sensor readings). If your data looks at a stock price over time, it's also likely to be mostly sorted unless the company has a really wonky trajectory. Some other algorithms also partially sort arrays; for example, it's not uncommon for quicksort implementations to stop sorting when the size of the array left to sort is small and to follow everything up with a final insertion sort pass, since every element won't be very far from its final position then.
I don't believe there exists any standardized measure of how sorted or random an array is.
You can come up with your own measure - like count the number of adjacent pairs which are out of order (suggested in comment), or count the number of larger numbers which occur before smaller numbers in the array (this is trickier than a simple single pass).
I did recently attach the 3rd version of Dijkstra algorithm for shortest path of single source into my project.
I realize that there are many different implementations which vary strongly in performance and also do vary in the quality of result in large graphs. With my data set (> 100.000 vertices) the runtime varies from 20 minutes to a few seconds. Th shortest paths also vary by 1-2%.
Which is the best implementation you know?
EDIT:
My Data is a hydraulic network, with 1 to 5 vertices per node. Its comparable to a street map. I made some modifications to a already accelerated algorithm (using a sorted list for all remaining nodes) and now find to the same results in a fraction of time. I have searched for such a thing quite a while. I wonder if such a implementation already exists.
I can not explain the slight differences in results. I know that Dijkstra is not heuristic, but all the implementations seem to be correct. The faster solutions have the results with shorter paths. I use double precision math exclusively.
EDIT 2:
I found out that the differences in the found path are indeed my fault. I had inserted special handling for some vertices (only valid in one direction) and forgot about that in the other implementation.
BUT im still more than surprised that Dijkstra can be accelerated dramatically by the following change:
In general a Dijkstra algorithm contains a loop like:
MyListType toDoList; // List sorted by smallest distance
InsertAllNodes(toDoList);
while(! toDoList.empty())
{
MyNodeType *node = *toDoList.first();
toDoList.erase(toDoList.first());
...
}
If you change this a little bit, it works the same, but performs better:
MyListType toDoList; // List sorted by smallest distance
toDoList.insert(startNode);
while(! toDoList.empty())
{
MyNodeType *node = *toDoList.first();
toDoList.erase(toDoList.first());
for(MyNeigborType *x = node.Neigbors; x != NULL; x++)
{
...
toDoList.insert(x->Node);
}
}
It seems, that this modification reduces the runtime by a order not of magnitude, but a order of exponent. It reduced my runtime form 30 Seconds to less than 2. I can not find this modification in any literature. It's also very clear that the reason lies in the sorted list. insert/erase performs much worse with 100.000 elements that with a hand full of.
ANSWER:
After a lot of googling i found it myself. The answer is clearly:
boost graph lib. Amazing - i had not found this for quite a while. If you think, that there is no performance variation between Dijkstra implementations, see wikipedia.
The best implementations known for road networks (>1 million nodes) have query times expressed in microseconds. See for more details the 9th DIMACS Implementation Challenge(2006). Note that these are not simply Dijkstra, of course, as the whole point was to get results faster.
May be I am not answering your question. My point is why to use Dijkstra when there are pretty much more efficient algorithms for your problem. If your graph fullfills the triangular property (it is an euclidian graph)
|ab| +|bc| > |ac|
(the distance from node a to node b plus distance from node b to node c is bigger than the distance from node a to node c) then you can apply the A* algorithm.
This algorithm is pretty efficient. Otherwise consider using heuristics.
The implementation is not the major issue. The algorithm to be used does matter.
Two points I'd like to make:
1) Dijkstra vs A*
Dijkstra's algorithm is a dynamic programming algorithm, not an heuristic. A* is an heuristic because it also uses an heuristic function (lets say h(x) ) to "estimate" how close a point x is getting to the end point. This information is exploited in subsequent decisions of which nodes to explore next.
For cases such as an Euclidean graph, then A* works well because the heuristic function is easy to define (one can simply use the Euclidean distance, for example). However, for non Euclidean graphs it may be harder to define the heuristic function, and a wrong definition can lead to a non-optimal path.
Therefore, dijkstra has the advantage over A* which is that it works for any general graph (with the exception of A* being faster in some cases). It could well be that certain implementations use these algorithms interchangeably, resulting in different results.
2) The dijkstra algorithm (and others such as A*) use a priority queue to obtain the next node to explore. A good implementation may use a heap instead of a queue, and an even better one may use a fibonacci heap. This could explain the different run times.
The last time I checked, Dijkstra's Algorithm returns an optimal solution.
All "true" implementations of Dijkstra's should return the same result each time.
Similarly, asymptotic analysis shows us that minor optimisations to particular implementations are not going to affect performance significantly as the input size increases.
It's going to depend on a lot of things. How much do you know about your input data? Is it dense, or sparse? That will change which versions of the algorithm are the fastest.
If it's dense, just use a matrix. If its sparse, you might want to look at more efficient data structures for finding the next closest vertex. If you have more information about your data set than just the graph connectivity, then see if a different algorithm would work better like A*.
Problem is, there isn't "one fastest" version of the algorithm.