Scaling a Dijkstra's Algorithm implementation - c++

I have a Graph with each Edge having some weight.
I have implemented dijkstra's Algorithm to find the shortest Path from Vertex A to B.
Weights for the Graph are read from a Key/Value DB. [redis.io].
Each Weights DB is around 2 GBs.
There are 50 DBs for weights. [Or 50 different files each 2 GB having weight values which I stored in the Redis.io].
To find the shortest Path, function FindPath(Start, End, DB_name) is used.
Dijkstras reads the weight values from memory[Redio.io is an in-memory key value store]. But my RAM is only 6GBs. It is not possible to store 2GBs * 50 DBs into the memory at the same time.
The request for the Path can be Random and Concurrent.
What is the best way to store the Weights DB?
Is increasing the RAM only option to increase the speed of the program execution?
EDIT
Number of Edges: 4,62,505

If speed is concerned the main option is to increase ram. You cannot achieve similar perfomance with a nosql DB (eg. mongodb). Another option would be to try to parallelize the algorithm on a multi core system. But this is very tough as the final solution is global.
[EDIT]
The fastest way to store the weights is a contiguous array of weights indexed by edge number. One array per DB. If all arrays cannot fit in your ram , you can design some basic caching mechanims , swaping DB from file to array (hoping not all db are accessed simultaneously).

Related

Comparison of HoG with CNN

I am working on the comparison of Histogram of oriented gradient (HoG) and Convolutional Neural Network (CNN) for the weed detection. I have two datasets of two different weeds.
CNN architecture is 3 layer network.
1) 1st dataset contains two classes and have 18 images.
The dataset is increased using data augmentation (rotation, adding noise, illumination changes)
Using the CNN I am getting a testing accuracy of 77% and for HoG with SVM 78%.
2) Second dataset contact leaves of two different plants.
each class contain 2500 images without data augmentation.
For this dataset, using CNN I am getting a test accuracy of 94% and for HoG with SVM 80%.
My question is Why I am getting higher accuracy for HoG using first dataset? CNN should be much better than HoG.
The only reason comes to my mind is the first data has only 18 images and less diverse as compare to the 2nd dataset. is it correct?
Yes, your intuition is right, having this small data set (just 18 images before data augmentation) can cause the worse performance. In general, for CNNs you usually need at least thousands of images. SVMs do not perform that bad because of the regularization (that you most probably use) and because of the probably much lower number of parameters the model has. There are ways how to regularize deep nets, e.g., with your first data set you might want to give dropout a try, but I would rather try to acquire a substantially larger data set.

Python Array is very large, and processing runs out of memory

I am trying to create a distance matrix to run the DBSCAN algorithm for clustering purposes. The final distance matrix has 174,000 X 174,000 entries that are all floating numbers between 0 and 1. I have the individual lists (all 174,000 of them) saved with numbers saved as int in them, but when trying to consolidate into an array, I keep running out of memory.
Is there a way to compress the data (I have tried hdf5, but that also seems to struggle) that can deal with such a large data set?

3D lookup table to discretize the volume

I have a depth camera that returns measured distance values of the volume in millimeters. It's needed to create a 3D lookup table to store all possible distance values for each pixel in the image. So I am getting an array of the size 640x480x2048. This approach is very memory consuming and if I use integers in C++ it takes about 2.5 GB of RAM. Additionally, I also have some parameters for each item in the volume, so all together it reaches maximum capacity of my 4GB memory.
My question is: Is there any good experience how I can optimally store and manage above described data set?
P.S Please don't consider the option of file storage. It doesn't fit me.
Thanks in advance

Efficient algorithm to deal with big-data network files for computing n nearest nodes

Problem:
I have two network files with me (say NET1 and NET2) - each has a set of nodes with unique ID for each node and geographic coordinates X and Y. Each node in NET2 is to have n connections to NET1 and the ID of n nodes will be determined by the minimum straight line distance. The output will have three fields IDs of node in NET1, NET2 and the distance between them. All the files are in tab delimited format.
One way forward..
One way to implement this is for each node in NET2, we loop through each node in NET1 and compute all NET1-NET2 distance combinations. Sort it by NET2 node id and by distance and write out the first four records for each node. But the problem is there are close to 2 million nodes on NET1, 2000 nodes in NET2 - that is 4 billion distances to be calculated and written in the first step of this algorithm... and the runtime is quite forbidding!
Request:
I was curious if any of you folks out there has faced similar issue. I would love to hear from y'all about any algorithms and data structures that can be used to speed the processing. I know that the scope of this question is very broad but I hope someone can point me the right way as I have very limited experience optimizing codes for data of this scale.
Languages:
I am trying in C++, Python and R.
Please pitch in with ideas! Help greatly appreciated!
kd-tree is one of the options. It allows you to find nearest neighbor (or a set of nearest neighbors) in reasonable time. Of course, you have to build the tree in the beginning and it takes some time. But generally, kd-tree is suitable, if you don't have to add/remove nodes in runtime, which seems to be your case. It also has better performance with lower dimension (in your case the dimension is 2).
Another possible data structure is octree (quadtree for 2D), it's simpler data structure (quite easy to implement), but kd-tree can be more efficient.

Select all points in a matrix within 30m of another point

So if you look at my other posts, it's no surprise I'm building a robot that can collect data in a forest, and stick it on a map. We have algorithms that can detect tree centers and trunk diameters and can stick them on a cartesian XY plane.
We're planning to use certain 'key' trees as natural landmarks for localizing the robot, using triangulation and trilateration among other methods, but programming this and keeping data straight and efficient is getting difficult using just Matlab.
Is there a technique for sub-setting an array or matrix of points? Say I have 1000 trees stored over 1km (1000m), is there a way to say, select only points within 30m radius of my current location and work only with those?
I would just use a GIS, but I'm doing this in Matlab and I'm unaware of any GIS plugins for Matlab.
I forgot to mention, this code is going online, meaning it's going on a robot for real-time execution. I don't know if, as the map grows to several miles, using a different data structure will help or if calculating every distance to a random point is what a spatial database is going to do anyway.
I'm thinking of mirroring the array of trees, into two arrays, one sorted by X and the other by Y. Then bubble sorting to determine the 30m range in that. I do the same for both arrays, X and Y, and then have a third cross link table that will select the individual values. But I don't know, what that's called, how to program that and I'm sure someone already has so I don't want to reinvent the wheel.
Cartesian Plane
GIS
You are looking for a spatial database like a quadtree or a kd-tree. I found two kd-tree implementations here and here, but didn't find any quadtree implementations for Matlab.
The simple solution of calculating all the distances and scanning through seems to run almost instantaneously:
lim = 1;
num_trees = 1000;
trees = randn(num_trees,2); %# list of trees as Nx2 matrix
cur = randn(1,2); %# current point as 1x2 vector
dists = hypot(trees(:,1) - cur(1), trees(:,2) - cur(2)); %# distance from all trees to current point
nearby = tree_ary((dists <= lim),:); %# find the nearby trees, pull them from the original matrix
On a 1.2 GHz machine, I can process 1 million trees (1 MTree?) in < 0.4 seconds.
Are you running the Matlab code directly on the robot? Are you using the Real-Time Workshop or something? If you need to translate this to C, you can replace hypot with sqr(trees[i].x - pos.x) + sqr(trees[i].y - pos.y), and replace the limit check with < lim^2. If you really only need to deal with 1 KTree, I don't know that it's worth your while to implement a more complicated data structure.
You can transform you cartesian coordinates into polar coordinates with CART2POL. Then selecting points inside certain radius will be strait-forward.
[THETA,RHO] = cart2pol(X-X0,Y-Y0);
selected = RHO < 30;
where X0, Y0 are coordinates of the current location.
My guess is that trees are distributed roughly evenly through the forest. If that is the case, simply use 30x30 (or 15x15) grid blocks as hash keys into an closed hash table. Look up the keys for all blocks intersecting the search circle, and check all hash entries starting at that key until one is flagged as the last in its "bucket."
0---------10---------20---------30--------40---------50----- address # line
(0,0) (0,30) (0,60) (30,0) (30,30) (30,60) hash key values
(1,3) (10,15) (3,46) (24,9.) (23,65.) (15,55.) tree coordinates + "." flag
For example, to get the trees in (0,0)…(30,30), map (0,0) to the address 0 and read entries (1,3), (10,15), reject (3,46) because it's out of bounds, read (24,9), and stop because it's flagged as the last tree in that sector.
To get trees in (0,60)…(30,90), map (0,60) to address 20. Skip (24, 9), read (23, 65), and stop as it's last.
This will be quite memory efficient as it avoids storing pointers, which would otherwise be of considerable size relative to the actual data. Nevertheless, closed hashing requires leaving some empty space.
The illustration isn't "to scale" as in reality there would be space for several entries between the hash key markers. So you shouldn't have to skip any entries unless there are more trees than average in a local preceding sector.
This does use hash collisions to your advantage, so it's not as random as a hash function typically is. (Not every entry corresponds to a distinct hash value.) However, as dense sections of forest will often be adjacent, you should randomize the mapping of sectors to "buckets," so a given dense sector will hopefully overflow into a less dense one, or the next, or the next.
Additionally, there is the issue of empty sectors and terminating iteration. You could insert a dummy tree into each sector to mark it as empty, or some other simple hack.
Sorry for the long explanation. This kind of thing is simpler to implement than to document. But the performance and the footprint can be excellent.
Use some sort of spatially partitioned data structure. A simple solution would be to simply create a 2d array of lists containing all objects within a 30m x 30m region. Worst case is then that you only need to compare against the objects in four of those lists.
Plenty of more complex (and potentially beneficial) solutions could also be used - something like bi-trees are a bit more complex to implement (not by much though), but could get more optimum performance (especially in cases where the density of objects varies considerably).
You could look at the voronoi diagram support in matlab:
http://www.mathworks.com/access/helpdesk/help/techdoc/ref/voronoi.html
If you base the voronoi polygons on your key trees, and cluster the neighbouring trees into those polygons, that would partition your search space by proximity (finding the enclosing polygon for a given non-key point is fast), but ultimately you're going to get down to computing key to non-key distances by pythagoras or trig and comparing them.
For a few thousand points (trees) brute force might be fast enough if you have a reasonable processor on board. Compute the distance of every other tree from tree n, then select those within 30'. This is the same as having all trees in the same voronoi polygon.
Its been a few years since I worked in GIS but I found the following useful: 'Computational Geometry In C' Joseph O Rourke, ISBN 0-521-44592-2 Paperback.