PCA for data compression - compression

I am in a discussion on whether you can save disk space by doing PCA on your data. Suppose you have a covariance matrix and your data vectors are of length 1000. The compression method to cut space by 50% would be:
derive a matrix that rotates the covariance matrix into diagonal form such that the eigenvalues are along the diagonal.
drop the smallest 500 diagonal elements - replace by zero
rotate the result using the transpose of the original rotation.
Me: This doesnt save any space for the vectors because there will still be nonzero elements in all the 1000 components after rotation. There is no compression. The data are probably simplified but that is a different thing.
Him: just take the first 500 elements in the result - that is your "compression".
I know I am right but plenty of people say in the literature that they are doing compression with PCA - here is an example:
I think that this tutorial is mostly right and is a nice description but the conclusion on compression is wrong. But how could something so obvious be overlooked by people who clearly work with data. Makes me things that I am wrong.
Can anyone help me understand their viewpoint?

In my opinion:
1- Yes, you can compress data by PCA because the dimension of the vectors (each one) you have to store is less than the original. Of course, you have to store the matrix to decompress the data too, but if your original dataset is enough large, this is insignificant to the data itself.
2- Of course there is a drawback. The compression is not lossless. You lose the original data forever, and your new version after decompression won't be exactly the same as the original. It will be an approximation.
At this point here's my advice:
If you have a lot of data with the same form (vectors of the same dimension...), your interest in this data is qualitative (you don't care the exact number itself, only the approximate number) and some of the data shows collinearity (dependency between vectors), PCA is a way to save storage space.
It is imperative to check if you lose the variance of the original data or not, because this is the signal you are choosing too much compression.
Anyway, the main purpose of PCA is not saving storage space... it is to do heavy operations with the data quicker to obtain a very similar result.
I hope this is helpful for you.


Representation of a symmetric diagonal matrix

Lets assume we have a huge symmetric diagonal matrix. What is the efficient way to implement this?
The only way that i could think of is that by using the symmetric property where Xij = Xji, we can reduce the size of this matrix by half. But then representing this matrix using a 2D array would be inefficient, since we cant reduce the matrix size by using arrays.
Another thing representing this matrix using adjacency list also would be inefficient, because relating this matrix to a graph. It would be a density graph. And the operation of adj list takes lots of time such as removing, inserting and searching.
But what about using heaps?
There is no one answer until you decide what you are going to do with this matrix (or maybe matrices?).
If you are just going to store and remember it, then just store it sequentially, leaving out the redundant entries. (Your code knows how to access it, because that is all it does, right?)
More probably, you want to do normal matrix operations on it. In that case, are you trying to make the storage efficient, or the execution? In the later case, I don't see many opportunities based on it being symmetric--the multiplies are the expensive thing and you probably still need all of those. If it is the storage, then are you limiting yourself to operations that only take symmetric in and symmetric out? Sounds awfully specific. If so, then you only need to do the calculations for the part you are storing, because, by definition the other entries are symmetric, so just write your code to generate that part of the matrix and you are done.

Partition large amount of 3D point data

I need to partition a large set of 3D points (using C++). The points are stored on the HDD as binary float array, and the files are usually larger than 10GB.
I need to divide the set into smaller subsets that have a size less than 1GB.
The points in the subset should still have the same neighborhood because I need to perform certain algorithms on the data (e.g., object detection).
I thought I could use a KD-Tree. But how can I construct the KD-Tree efficiently if I can't load all the points into RAM? Maybe I could map the file as virtual memory. Then I could save a pointer to each 3D point that belongs to a segment and store it in a node of the KD-Tree. Would that work? Any other ideas?
Thank you for your help. I hope you can unterstand the problem :D
You basically need an out-of-core algorithm for computing (approximate) medians. Given a large file, find its median and then partition it into two smaller files. A k-d tree is the result of applying this process recursively along varying dimensions (and when the smaller files start to fit in memory, you don't have to bother with the out-of-core algorithm any more).
To approximate the median of a large file, use reservoir sampling to grab a large but in-memory sample, then run an in-core median finding algorithm. Alternatively, for an exact median, compute the (e.g.) approximate 45th and 55th percentiles, then make another pass to extract the data points between them and compute the median exactly (unless the sample was unusually non-random, in which case retry). Details are in the Motwani--Raghavan book on randomized algorithms.

Handle very large distance matrix in C (or C++ if it could help)

I am implementing this clustering algorithm http://www.sciencemag.org/content/344/6191/1492.full (free access version) in C in my software and I need to build a distance matrix, but in some cases, the size of the dataset (after redundancy removal) is huge (n > 1 500 000 and it is even larger, going up to 4 000 000 on more complex cases). My problem is, even allocating the upper triangular matrix would be ( (1500000*1500000) - 1500000) * 0.5 * sizeof(float) =~ 5.5e12 Bytes. So, memory allocation fails (even on our computing nodes with 256 GB of RAM) and writing to disk is not an option in this case.
Beside cutting down the size (which I will look) of the dataset to cluster, anybody has an idea of a technique I could use to approximate and store this amount of information ?
N.B. Like I said in the title, I am using C and I can also use C++. Also, if anybody has another clustering algorithm (where the number of clusters is determined with the algorithm itself) to use, please suggest it to me.
Thanks in advance for your time,
You probably have to step back and reconsider your algorithm.
First, perhaps you don't need to have distance matrix between all pairs of data points. Perhaps you could group together similar data points into data bins and then create a matrix of distances between bins.
That is, start by computing pairwise distances between points, but keep only relatively small distances and pointers to "the other" point. Kind of a very sparse matrix of shorter distances. This is straightforward to do in parallel.
Then create data bins that contain groups of points with mutually small distances between them. For example, if you threshold "short" distances in such manner that bins would hold on average, say, 50 data points you'd get 1500000/50=30000 bins.
Then go through your data again and compute distances between bins. That would produce 30000^2 distances, which is a matrix of about 4GB. In addition you still have 30000 with 50^2 distances within bins, which is another 300MB. This amount of data is quite manageable.
If replacing the distance between data points with a distance between the corresponding bins is sufficient precision for your application that would work. It all depends on the kind of data you are dealing with and the precision requirements of your application.

Is there a data structure with these characteristics?

I'm looking for a data structure that would allow me to store an M-by-N 2D matrix of values contiguously in memory, such that the distance in memory between any two points approximates the Euclidean distance between those points in the matrix. That is, in a typical row-major representation as a one-dimensional array of M * N elements, the memory distance differs between adjacent cells in the same row (1) and adjacent cells in neighbouring rows (N).
I'd like a data structure that reduces or removes this difference. Really, the name of such a structure is sufficient—I can implement it myself. If answers happen to refer to libraries for this sort of thing, that's also acceptable, but they should be usable with C++.
I have an application that needs to perform fast image convolutions without hardware acceleration, and though I'm aware of the usual optimisation techniques for this sort of thing, I feel a specialised data structure or data ordering could improve performance.
Given the requirement that you want to store the values contiguously in memory, I'd strongly suggest you research space-filling curves, especially Hilbert curves.
To give a bit of context, such curves are sometimes used in database indexes to improve the locality of multidimensional range queries (e.g., "find all items with x/y coordinates in this rectangle"), thereby aiming to reduce the number of distinct pages accessed. A bit similar to the R-trees that have been suggested here already.
Either way, it looks that you're bound to an M*N array of values in memory, so the whole question is about how to arrange the values in that array, I figure. (Unless I misunderstood the question.)
So in fact, such orderings would probably still only change the characteristics of distance distribution.. average distance for any two randomly chosen points from the matrix should not change, so I have to agree with Oli there. Potential benefit depends largely on your specific use case, I suppose.
I would guess "no"! And if the answer happens to be "yes", then it's almost certainly so irregular that it'll be way slower for a convolution-type operation.
To qualify my guess, take an example. Let's say we store a[0][0] first. We want a[k][0] and a[0][k] to be similar distances, and proportional to k, so we might choose to interleave the storage of first row and first column (i.e. a[0][0], a[1][0], a[0][1], a[2][0], a[0][2], etc.) But how do we now do the same for e.g. a[1][0]? All the locations near it in memory are now taken up by stuff that's near a[0][0].
Whilst there are other possibilities than my example, I'd wager that you always end up with this kind of problem.
If your data is sparse, then there may be scope to do something clever (re Cubbi's suggestion of R-trees). However, it'll still require irregular access and pointer chasing, so will be significantly slower than straightforward convolution for any given number of points.
You might look at space-filling curves, in particular the Z-order curve, which (mostly) preserves spatial locality. It might be computationally expensive to look up indices, however.
If you are using this to try and improve cache performance, you might try a technique called "bricking", which is a little bit like one or two levels of the space filling curve. Essentially, you subdivide your matrix into nxn tiles, (where nxn fits neatly in your L1 cache). You can also store another level of tiles to fit into a higher level cache. The advantage this has over a space-filling curve is that indices can be fairly quick to compute. One reference is included in the paper here: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=
This sounds like something that could be helped by an R-tree. or one of its variants. There is nothing like that in the C++ Standard Library, but looks like there is an R-tree in the boost candidate library Boost.Geometry (not a part of boost yet). I'd take a look at that before writing my own.
It is not possible to "linearize" a 2D structure into an 1D structure and keep the relation of proximity unchanged in both directions. This is one of the fundamental topological properties of the world.
Having that that, it is true that the standard row-wise or column-wise storage order normally used for 2D array representation is not the best one when you need to preserve the proximity (as much as possible). You can get better result by using various discrete approximations of fractal curves (space-filling curves).
Z-order curve is a popular one for this application: http://en.wikipedia.org/wiki/Z-order_(curve)
Keep in mind though that regardless of which approach you use, there will always be elements that violate your distance requirement.
You could think of your 2D matrix as a big spiral, starting at the center and progressing to the outside. Unwind the spiral, and store the data in that order, and distance between addresses at least vaguely approximates Euclidean distance between the points they represent. While it won't be very exact, I'm pretty sure you can't do a whole lot better either. At the same time, I think even at very best, it's going to be of minimal help to your convolution code.
The answer is no. Think about it - memory is 1D. Your matrix is 2D. You want to squash that extra dimension in - with no loss? It's not going to happen.
What's more important is that once you get a certain distance away, it takes the same time to load into cache. If you have a cache miss, it doesn't matter if it's 100 away or 100000. Fundamentally, you cannot get more contiguous/better performance than a simple array, unless you want to get an LRU for your array.
I think you're forgetting that distance in computer memory is not accessed by a computer cpu operating on foot :) so the distance is pretty much irrelevant.
It's random access memory, so really you have to figure out what operations you need to do, and optimize the accesses for that.
You need to reconvert the addresses from memory space to the original array space to accomplish this. Also, you've stressed distance only, which may still cause you some problems (no direction)
If I have an array of R x C, and two cells at locations [r,c] and [c,r], the distance from some arbitrary point, say [0,0] is identical. And there's no way you're going to make one memory address hold two things, unless you've got one of those fancy new qubit machines.
However, you can take into account that in a row major array of R x C that each row is C * sizeof(yourdata) bytes long. Conversely, you can say that the original coordinates of any memory address within the bounds of the array are
r = (address / C)
c = (address % C)
r1 = (address1 / C)
r2 = (address2 / C)
c1 = (address1 % C)
c2 = (address2 % C)
dx = r1 - r2
dy = c1 - c2
dist = sqrt(dx^2 + dy^2)
(this is assuming you're using zero based arrays)
(crush all this together to make it run more optimally)
For a lot more ideas here, go look for any 2D image manipulation code that uses a calculated value called 'stride', which is basically an indicator that they're jumping back and forth between memory addresses and array addresses
This is not exactly related to closeness but might help. It certainly helps for minimation of disk accesses.
one way to get better "closness" is to tile the image. If your convolution kernel is less than the size of a tile you typical touch at most 4 tiles at worst. You can recursively tile in bigger sections so that localization improves. A Stokes-like (At least I thinks its Stokes) argument (or some calculus of variations ) can show that for rectangles the best (meaning for examination of arbitrary sub rectangles) shape is a smaller rectangle of the same aspect ratio.
Quick intuition - think about a square - if you tile the larger square with smaller squares the fact that a square encloses maximal area for a given perimeter means that square tiles have minimal boarder length. when you transform the large square I think you can show you should the transform the tile the same way. (might also be able to do a simple multivariate differentiation)
The classic example is zooming in on spy satellite data images and convolving it for enhancement. The extra computation to tile is really worth it if you keep the data around and you go back to it.
Its also really worth it for the different compression schemes such as cosine transforms. (That's why when you download an image it frequently comes up as it does in smaller and smaller squares until the final resolution is reached.
There are a lot of books on this area and they are helpful.

std::vector<std::vector<type>> for sparse matrix structure or something else?

I am implementing a sparse matrix class in compressed row format. This means i have a fixed number of rows and each row consists of a number of elements (this number can be different for different rows but will not change after the matrix has been initialised.
Is it suitable to implement this via vectors of vectors or will this somehow fragment the memory?
How whould i implement the allocation of this so i will have one big chunk of memory?
Thanks for sharing your wisdom!
You can use existing libraries, like boost sparse matrix (http://www.boost.org/doc/libs/1_43_0/libs/numeric/ublas/doc/matrix_sparse.htm)
The general rule with sparse matrices is that you pick a structure that best fits the location of non-zeros in matrix; so maybe you could write a bit more about the matrix structure, and also what (sort of) algorithms will be invoked on it.
About the memory -- if this matrix is not too big, it is better to alloc it as a one big chunk and make some index magic; this may result in a better cache use.
If it is huge, then the chunked version should be used since it will better fit in memory.
You won't get 'one big chunk' with vector-of-vectors. Each row will be a contiguous chunk, but each row's data could be located at a different place on the heap.
Keep in mind that this might be a good thing if you're going to be dealing with huge matrices. The larger the matrix, the less likely you'll be able to acquire a contiguous chunk to fit the whole thing.
I've implemented the compressed column format with 3 std::vector<int> (2 for row & column indices and one for the value). Why do you need std::vector<std::vector<int>>?
Are you sure you're implementing the right format? A description of the compressed column format (and code for matrix-vector multiplication) can be found here.