Python Array is very large, and processing runs out of memory

Python Array is very large, and processing runs out of memory - python-2.7

I am trying to create a distance matrix to run the DBSCAN algorithm for clustering purposes. The final distance matrix has 174,000 X 174,000 entries that are all floating numbers between 0 and 1. I have the individual lists (all 174,000 of them) saved with numbers saved as int in them, but when trying to consolidate into an array, I keep running out of memory.
Is there a way to compress the data (I have tried hdf5, but that also seems to struggle) that can deal with such a large data set?

Related

Memory saving system for molecule calculations

I am currently working on a MD simulation. It stores the molecule positions in a vector. For each time step, that vector is stored for display in a second vector, resulting in
std::vector<std::vector<molecule> > data;
The size of data is time steps*<number of molecules>*sizeof(molecule), where sizeof(molecule) is (already reduced) 3*sizeof(double), as the position vector. Still I get memory problems for larger amounts of time steps and number of molecules.
Thus, is there an additional possibility to decrease the amount of data? My current workflow is that I calculate all molecules first, store them, and then render them by using the data of each molecule for each step, the rendering is done with Irrlicht. (Maybe later with blender).

If the trajectories are smooth, you can consider to compress data by storing only for every Nth step and restoring the intermediate positions by interpolation.
If the time step is small, linear interpolation can do. Top quality can be provided by cubic splines. Anyway, the computation of the spline coefficients is a global operation that you can only perform in the end and that required extra storage (!), and you might prefer Cardinal splines, which can be built locally, from four consecutive positions.

You could gain a factor of 2 improvement by storing the positions in single precision rather than double - it will be sufficient for rendering, if not for the simulation.
But ultimately you will need to store the results in a file and render offline.

Visualization of large 2d-grid data with very large height value

I have a large scientific data in z = f(i,j), where i and j are integers and z can be usually between e.g. 0 to 1e20, and it cannot all fit in memory. I'd like to visualize this using a heat map (using log scale). The question is if there's a framework that can manage the data structure and also visualize, like OpenCV (but I don't think it can handle arbitrarily large z values; I know very little about OpenCV).
If I were to implement it, I'd represent the z height values into tiles, and then make smaller versions by averaging them to make even smaller tiles. This can then be used to pan and zoom interactively. And there's some compression that might be needed to reduce disk usage.
Any good libraries to do this? C++ preferred and GPL would not work for me. Thanks in advance. I just noticed STXXL and HDF5 for data structures. Would these help?

Scaling a Dijkstra's Algorithm implementation

I have a Graph with each Edge having some weight.
I have implemented dijkstra's Algorithm to find the shortest Path from Vertex A to B.
Weights for the Graph are read from a Key/Value DB. [redis.io].
Each Weights DB is around 2 GBs.
There are 50 DBs for weights. [Or 50 different files each 2 GB having weight values which I stored in the Redis.io].
To find the shortest Path, function FindPath(Start, End, DB_name) is used.
Dijkstras reads the weight values from memory[Redio.io is an in-memory key value store]. But my RAM is only 6GBs. It is not possible to store 2GBs * 50 DBs into the memory at the same time.
The request for the Path can be Random and Concurrent.
What is the best way to store the Weights DB?
Is increasing the RAM only option to increase the speed of the program execution?
EDIT
Number of Edges: 4,62,505

If speed is concerned the main option is to increase ram. You cannot achieve similar perfomance with a nosql DB (eg. mongodb). Another option would be to try to parallelize the algorithm on a multi core system. But this is very tough as the final solution is global.
[EDIT]
The fastest way to store the weights is a contiguous array of weights indexed by edge number. One array per DB. If all arrays cannot fit in your ram , you can design some basic caching mechanims , swaping DB from file to array (hoping not all db are accessed simultaneously).

Square/cubic root lookup table

I'm wondering which is the best way to create two lookups table for square root and cubic root of float values in range [0.0, 1.0).
I already profiled the code and saw that this is quite a strong bottleneck of performances (because I need to compute them for several tenths of thousands of values each). Then I remembered about lookup tables and thought they would help me increasing the performance.
Since my values are in a small range I was thinking about splitting the range with steps of, let's say, 0.0025 (hoping it's enough) but I'm unsure about which should be the most efficient way to retrieve them.
I can easily populate the lookup table but I need a way to efficiently get the correct value for a given float (which is not discretized on any step). Any suggestions or well known approaches to this problem?
I'm working with a mobile platform, just to specify.
Thanks in advance

You have (1.0-0.0)/0.0025 = 400 steps
Just create a 400x1 matrix and access it by multiplying the float you want the square/cube to by 400.
For instance if you want to look up the square of 0.0075. Multiply 0.0075 by 400 and get 3 which is your index in the matrix

double table_sqrt(double v)
{
return table[(unsigned int)(v / 0.0025)];
}

You could multiply the values by whatever precision that you want, and then use a hash-table since the results would be integral values.
For instance, rather than using a floating point key-value for something like 0.002, give yourself a precision of three or four decimal places, making your key value for 0.002 equal to 200 or 2000. Then you can quickly look-up the resulting floating point value for the square and cubic root stored in the hash-table key for the 2000 slot.
If you're wanting to also get values out of the non-discreet ranges in-between slots, you could use an array or tree rather than a hash-table so that you can generate "in-between" values by interpolating between the roots stored at two adjacent key-value slots.

If you only need to split into 10 different stripes, find the inputs which correspond to the thresholds between stripes, and use an unrolled binary search to test against those 9 values. Or is there additional computation required before the threshold test is done, so that the looked-up value isn't the final result.

Select all points in a matrix within 30m of another point

So if you look at my other posts, it's no surprise I'm building a robot that can collect data in a forest, and stick it on a map. We have algorithms that can detect tree centers and trunk diameters and can stick them on a cartesian XY plane.
We're planning to use certain 'key' trees as natural landmarks for localizing the robot, using triangulation and trilateration among other methods, but programming this and keeping data straight and efficient is getting difficult using just Matlab.
Is there a technique for sub-setting an array or matrix of points? Say I have 1000 trees stored over 1km (1000m), is there a way to say, select only points within 30m radius of my current location and work only with those?
I would just use a GIS, but I'm doing this in Matlab and I'm unaware of any GIS plugins for Matlab.
I forgot to mention, this code is going online, meaning it's going on a robot for real-time execution. I don't know if, as the map grows to several miles, using a different data structure will help or if calculating every distance to a random point is what a spatial database is going to do anyway.
I'm thinking of mirroring the array of trees, into two arrays, one sorted by X and the other by Y. Then bubble sorting to determine the 30m range in that. I do the same for both arrays, X and Y, and then have a third cross link table that will select the individual values. But I don't know, what that's called, how to program that and I'm sure someone already has so I don't want to reinvent the wheel.
Cartesian Plane
GIS

You are looking for a spatial database like a quadtree or a kd-tree. I found two kd-tree implementations here and here, but didn't find any quadtree implementations for Matlab.

The simple solution of calculating all the distances and scanning through seems to run almost instantaneously:
lim = 1;
num_trees = 1000;
trees = randn(num_trees,2); %# list of trees as Nx2 matrix
cur = randn(1,2); %# current point as 1x2 vector
dists = hypot(trees(:,1) - cur(1), trees(:,2) - cur(2)); %# distance from all trees to current point
nearby = tree_ary((dists <= lim),:); %# find the nearby trees, pull them from the original matrix
On a 1.2 GHz machine, I can process 1 million trees (1 MTree?) in < 0.4 seconds.
Are you running the Matlab code directly on the robot? Are you using the Real-Time Workshop or something? If you need to translate this to C, you can replace hypot with sqr(trees[i].x - pos.x) + sqr(trees[i].y - pos.y), and replace the limit check with < lim^2. If you really only need to deal with 1 KTree, I don't know that it's worth your while to implement a more complicated data structure.

You can transform you cartesian coordinates into polar coordinates with CART2POL. Then selecting points inside certain radius will be strait-forward.
[THETA,RHO] = cart2pol(X-X0,Y-Y0);
selected = RHO < 30;
where X0, Y0 are coordinates of the current location.

My guess is that trees are distributed roughly evenly through the forest. If that is the case, simply use 30x30 (or 15x15) grid blocks as hash keys into an closed hash table. Look up the keys for all blocks intersecting the search circle, and check all hash entries starting at that key until one is flagged as the last in its "bucket."
0---------10---------20---------30--------40---------50----- address # line
(0,0) (0,30) (0,60) (30,0) (30,30) (30,60) hash key values
(1,3) (10,15) (3,46) (24,9.) (23,65.) (15,55.) tree coordinates + "." flag
For example, to get the trees in (0,0)…(30,30), map (0,0) to the address 0 and read entries (1,3), (10,15), reject (3,46) because it's out of bounds, read (24,9), and stop because it's flagged as the last tree in that sector.
To get trees in (0,60)…(30,90), map (0,60) to address 20. Skip (24, 9), read (23, 65), and stop as it's last.
This will be quite memory efficient as it avoids storing pointers, which would otherwise be of considerable size relative to the actual data. Nevertheless, closed hashing requires leaving some empty space.
The illustration isn't "to scale" as in reality there would be space for several entries between the hash key markers. So you shouldn't have to skip any entries unless there are more trees than average in a local preceding sector.
This does use hash collisions to your advantage, so it's not as random as a hash function typically is. (Not every entry corresponds to a distinct hash value.) However, as dense sections of forest will often be adjacent, you should randomize the mapping of sectors to "buckets," so a given dense sector will hopefully overflow into a less dense one, or the next, or the next.
Additionally, there is the issue of empty sectors and terminating iteration. You could insert a dummy tree into each sector to mark it as empty, or some other simple hack.
Sorry for the long explanation. This kind of thing is simpler to implement than to document. But the performance and the footprint can be excellent.

Use some sort of spatially partitioned data structure. A simple solution would be to simply create a 2d array of lists containing all objects within a 30m x 30m region. Worst case is then that you only need to compare against the objects in four of those lists.
Plenty of more complex (and potentially beneficial) solutions could also be used - something like bi-trees are a bit more complex to implement (not by much though), but could get more optimum performance (especially in cases where the density of objects varies considerably).

You could look at the voronoi diagram support in matlab:
http://www.mathworks.com/access/helpdesk/help/techdoc/ref/voronoi.html
If you base the voronoi polygons on your key trees, and cluster the neighbouring trees into those polygons, that would partition your search space by proximity (finding the enclosing polygon for a given non-key point is fast), but ultimately you're going to get down to computing key to non-key distances by pythagoras or trig and comparing them.
For a few thousand points (trees) brute force might be fast enough if you have a reasonable processor on board. Compute the distance of every other tree from tree n, then select those within 30'. This is the same as having all trees in the same voronoi polygon.
Its been a few years since I worked in GIS but I found the following useful: 'Computational Geometry In C' Joseph O Rourke, ISBN 0-521-44592-2 Paperback.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js