I know how this algorithm works, but cant decide when to use which algorithm ?
Are there some guidelines, where one better perform than other or any considerations ?
Thanks very much.
If you want to find a solution with the shortest number of steps or if your tree has infinite height (or very large) you should use breadth first.
If you have a finite tree and want to traverse all possible solutions using the smallest amount of memory then you should use depth first.
If you are searching for the best chess move to play you could use iterative deepening which is a combination of both.
IDDFS combines depth-first search's space-efficiency and breadth-first search's completeness (when the branching factor is finite).
BFS is generally useful in cases where the graph has some meaningful "natural layering" (e.g., closer nodes represent "closer" results) and your goal result is likely to be located closer to the starting point or the starting points are "cheaper to search".
When you want to find the shortest path, BFS is a natural choice.
If your graph is infinite or pro grammatically generated, you would probably want to search closer layers before venturing afield, as the cost of exploring remote nodes before getting to the closer nodes is prohibitive.
If accessing more remote nodes would be more expensive due to memory/disk/locality issues, BFS may again be better.
Which method to use usually depends on application (ie. the reason why you have to search a graph) - for example topological sorting requires depth-first search whereas Ford-Fulkerson algorithm for finding maximum flow requires breadth-first search.
If you are traversing a tree, depth-first will use memory proportional to its depth. If the tree is reasonably balanced (or has some other limit on its depth), it may be convenient to use recursive depth-first traversal.
However, don't do this for traversing a general graph; it will likely cause a stack overflow. For unbounded trees or general graphs, you will need some kind of auxiliary storage that can expand to a size proportional to the number of input nodes. In this case, breadth-first traversal is simple and convenient.
If your problem provides a reason to choose one node over another, you might consider using a priority queue, instead of a stack (for depth-first) or a FIFO (for breadth-first). A priority queue will take O(log K) time (where K is the current number of different priorities) to find the best node at each step, but the optimization may be worth it.
Related
I'm doing a homework problem that simplified is grouping stars into constellations given their x,y coordinates and a min distance. Any star can be a constellation by itself. so e.g 5 stars cant connect to each other then it will return that there are 5 constellations.
I've initially made an algorithm that checks each point with a runtime of O(n^2). I want to make it faster and saw that a DBSCAN runs in O(nlogn) time.
My question is that if I were to use a DBSCAN the algorithm says it will run in O(nlogn) time but if my minPts is 1 (the size of my clusters) will that negate the efficiency of the DBSCAN and run in O(n^2)??
As far as I'm concerned running time of a DBSCAN in this case depends on a calculation of neighbours, which is performed on each point, within the given distance. As you mentioned if you perform a linear scan you'd get O(n^2) in total. Nevertheless, in order to speed up a search you could use an index-based structure which searches in a O(logn) time.
Please, checkout spatial databases.
The runtime depends on epsilon being small enough so that the result sizes are small, as well as an index being able to accelerate these queries. There is no requirement on minpts, so it will work for the "degenerate" case of Single-Link clustering.
But in your case, you may simply want to use the spatial index for neighbor search directly, not go via the proxy of DBSCAN?
I'm reading up on datastructures, especially immutable ones like the append-only B+ tree used in CouchDB and the Hash array mapped trie used in Clojure and some other functional programming languages.
The main reason datastructures that work well in memory might not work well on disk appears to be time spent on disk seeks due to fragmentation, as with a normal binary tree.
However, HAMT is also very shallow, so doesn't require any more seeks than a B tree.
Another suggested reason is that deletions from a array mapped trie are more expensive tha from a B tree. This is based on the assumption that we're talking about a dense vector, and doesn't apply when using either as a hash map.
What's more, it seems that a B tree does more rebalancing, so using it in an append-only manner produces more garbage.
So why do CouchDB and practically every other database and filesystem use B trees?
[edit] fractal trees? log-structured merge tree? mind = blown
[edit] Real-life B trees use a degree in the thousands, while a HAMT has a degree of 32. A HAMT of degree 1024 would be possible, but slower due to popcnt handling 32 or 64 bits at a time.
B-trees are used because they are a well-understood algorithm that achieves "ideal" sorted-order read-cost. Because keys are sorted, moving to the next or previous key is very cheap.
HAMTs or other hash storage, stores keys in random order. Keys are retrieved by their exact value, and there is no efficient way to find to the next or previous key.
Regarding degree, it is normally selected indirectly, by selecting page size. HAMTs are most often used in memory, with pages sized for cache lines, while B-trees are most often used with secondary storage, where page sizes are related to IO and VM parameters.
Log Structured Merge (LSM) is a different approach to sorted order storage which achieves more optimal write-efficiency, by trading off some read efficiency. That hit to read efficiency can be a problem for read-modify-write workloads, but the fewer uncached reads there are, the more LSM provides better overall throughput vs B-tree - at the cost of higher worst case read latency.
LSM also offers the promise of a wider-performance envelope. Putting new data into its proper place is "deferred", offering the possibility to tune read-to-write efficiency by controlling the proportion of deferred cleanup work to live work. In theory, an ideal-LSM with zero-deferral is a B-tree and with 100%-deferral is a log.
However, LSM is more of a "family" of algorithms than a specific algorithm like a B-tree. Their usage is growing in popularity, but it is hindered by the lack of a de-facto optimal LSM design. LevelDB/RocksDB is one of the more practical LSM implementations, but it is far from optimal.
Another approach to achieving write-throughput efficiency is to write-optimize B-trees through write-deferral, while attempting to maintain their optimal read-throughput.
Fractal-trees, shuttle-trees, stratified-trees are this type of design, and represent a hybrid gray area between B-tree and LSM. Rather than deferring writes to an offline process, they amortize write-deferral in a fixed way. For example, such a design might represent a fixed 60%-write-deferral fraction. This means they can't achieve the 100% write-deferral performance of an LSM, but they also have a more predictable read-performance, making them more practical drop-in replacements for B-trees. (As in the commercial Tokutek MySQL and MongoDB fractal-tree backends)
Btrees are ordered by their key while in a hash map similar keys have very different hash values so are stored far each other. Now think of a query that do a range scan "give me yesterday's sales": with a hash map you have to scan all the map to find them, with a btree on the sales_dtm columns you'll find them nicely clustered and you exactly know where to start and stop reading.
What is the most efficient method, in terms of execution time, to search a small array of about 4 to 16 elements, for an element that is equal to what you're searching for, in C++? The element being searched for is a pointer in this case, so it's relatively small.
(My purpose is to prevent points in a point cloud from creating edges with points that already share an edge with them. The edge array is small for each point, but there can be a massive number of points. Also, I'm just curious too!)
Your best bet is to profile your specific application with a variety of mechanisms and see which performs best.
I suspect that given it's unsorted a straight linear search will be best for you. If you're able to pre-sort the array once and it updates infrequently or never, you could pre-sort and then use a binary search.
Try a linear search; try starting with one or more binary chop stages. The former involves more comparisons on average; the latter has more scope for cache misses and branch mispredictions, and requires that the arrays are pre-sorted.
Only by measuring can you tell which is faster, and then only on the platform you measured on.
If you have to do this search more than once, and the array doesn't change often/at all, sort it and then use a binary search.
I have an implementation of Dijkstra's Algorithm, based on the code on this website. Basically, I have a number of nodes (say 10000), and each node can have 1 to 3 connections to other nodes.
The nodes are generated randomly within a 3d space. The connections are also randomly generated, however it always tries to find connections with it's closest neighbors first and slowly increases the search radius. Each connection is given a distance of one. (I doubt any of this matters but it's just background).
In this case then, the algorithm is just being used to find the shortest number of hops from the starting point to all the other nodes. And it works well for 10,000 nodes. The problem I have is that, as the number of nodes increases, say towards 2 million, I use up all of my computers memory when trying to build the graph.
Does anyone know of an alternative way of implementing the algorithm to reduce the memory footprint, or is there another algorithm out there that uses less memory?
According to your comment above, you are representing the edges of the graph with a distance matrix long dist[GRAPHSIZE][GRAPHSIZE]. This will take O(n^2) memory, which is too much for large values of n. It is also not a good representation in terms of execution time when you only have a small number of edges: it will cause Dijkstra's algorithm to take O(n^2) time (where n is the number of nodes) when it could potentially be faster, depending on the data structures used.
Since in your case you said each node is only connected to up to 3 other nodes, you shouldn't use this matrix: Instead, for each node you should store a list of the nodes it is connected to. Then when you want to go over the neighbors of a node, you just need to iterate over this list.
In some specific cases you don't even need to store this list because it can be calculated for each node when needed. For example, when the graph is a grid and each node is connected to the adjacent grid nodes, it's easy to find a node's neighbors on the fly.
If you really cannot afford memory, even with minimizations on your graph representation, you may develop a variation of the Dijkstra's algorithm, considering a divide and conquer method.
The idea is to split data into minor chunks, so you'll be able to perform Dijkstra's algorithm in each chunk, for each of the points within it.
For each solution generated in these minor chunks, consider the it as an unique node to another data chunk, from which you'll start another execution of Dijkstra.
For example, consider the points below:
.B .C
.E
.A .D
.F .G
You can select the closest points to a given node, say, within two hops, and then use the solution as part of the graph extended, considering the former points as only one set of points, with a distance equal to the resulting distance of the Dijkstra solution.
Say you start from D:
select the closest points to D within a given number of hops;
use Dijkstra's algorithm upon the selected entries, commencing from D;
use the solution as a graph with the central node D and the last nodes in the shortest paths as nodes directly linked to D;
extend the graph, repeating the algorithm until all the nodes have been considered.
Although there's a costly extra processing here, you'd be able to surpass memory limitation, and, if you have some other machines, you can even distribute the processes.
Please, note this is just the idea of the process, the process I've described is not necessarily the best way to do it. You may find something interesting looking for distributed Dijkstra's algorithm.
I like boost::graph a lot. It's memory consumption is very decent (I've used it on road networks with 10 million nodes and 2Gb ram).
It has a Dijkstra implementation, but if the goal is to implement and understand it by yourself, you can still use their graph representation (I suggest adjacency list) and compare your result with theirs to be sure your result is correct.
Some people mentioned other algorithms. I don't think this will play a big role on the memory usage, but more likely in the speed. 2M nodes, if the topology is close to a street-network, the running time will be less than a second from one node to all others.
http://www.boost.org/doc/libs/1_52_0/libs/graph/doc/index.html
I have been trying to understand the difference between the two as they apply to different algorithms used to choose tasks to run in some CPU schedulers.
What is the difference between an RB tree that places lowest time needed process on the left and chooses nodes from the left to run, and a queue that places them in a shortest job first order?
A single queue has time complexity[1] of O(1) on search because it can just pop the next process to execution. Insertion has also O(1) as it places the new item at the end of the queue. This kind of round-robin scheduler was used e.g. in early Linux kernel. The downside was that all tasks were executed every time in the same order.
To fix this, a simple improvement is to keep popping the head of the queue with O(1) and search a suitable slot in the queue on insert by priority and/or time requirements thus having O(n). Some schedulers keep multiple queues (or even a priority queue), that have varying operation times depending from the implementation and needs.
Red-black tree, on the other hand, has time complexity of O(log n) to get the next process and on insert. The principle idea of a red-black tree is that it keeps itself balanced with every operation thus remaining efficient without any further optimization operations. A priority queue can also be implemented using a red-black tree internally.
A good starting point on (Linux) schedulers is the CFS article on IBM's site, which has a nice set of references, as well.