How to create n-dimensional test data for cluster analysis? - c++

I'm working on a C++ implementation of k-means and therefore I need n-dimensional test data. For the beginning 2D points are sufficient, since they can be visualized easily in a 2D image, but I'd finally prefer a general approach that supports n dimensions.
There was an answer here on stackoverflow, which proposed concatenating sequential vectors of random numbers with different offsets and spreads, but I'm not sure how to create those, especially without including a 3rd party library.
Below is the method declaration I have so far, it contains the parameters which should vary. But the can be changed, if necessary - with the exception of data, it needs to be a pointer type since I'm using OpenCL.
auto populateTestData(float** data, uint8_t dimension, uint8_t clusters, uint32_t elements) -> void;
Another problem that came to my mind was the efficient detection/avoidance of collisions when generating random numbers. Couldn't that be a performance bottle neck, e.g. if one's generating 100k numbers in a domain of 1M values, i.e. if the relation between generated numbers and number space isn't small enough?
QUESTION
How can I efficiently create n-dimensional test data for cluster analysis? What are the concepts I need to follow?

It's possible to use c++11 (or boost) random stuff to create clusters, but it's a bit of work.
std::normal_distribution can generate univariate normal distributions with zero mean.
Using 1. you can sample from a normal vector (just create an n dimensional vector of such samples).
If you take a vector n from 2. and output A n + b, then you've transformed the center b away + modified by A. (In particular, for 2 and 3 dimensions it's easy to build A as a rotation matrix.) So, repeatedly sampling 2. and performing this transformation can give you a sample centered at b.
Choose k pairs of A, b, and generate your k clusters.
Notes
You can generate different clustering scenarios using different types of A matrices. E.g., if A is a non-length preserving matrix multiplied by a rotation matrix, then you can get "paraboloid" clusters (it's actually interesting to make them wider along the vectors connecting the centers).
You can either generate the "center" vectors b hardcoded, or using a distribution like used for the x vectors above (perhaps uniform, though, using this).

Related

Representation of a symmetric diagonal matrix

Lets assume we have a huge symmetric diagonal matrix. What is the efficient way to implement this?
The only way that i could think of is that by using the symmetric property where Xij = Xji, we can reduce the size of this matrix by half. But then representing this matrix using a 2D array would be inefficient, since we cant reduce the matrix size by using arrays.
Another thing representing this matrix using adjacency list also would be inefficient, because relating this matrix to a graph. It would be a density graph. And the operation of adj list takes lots of time such as removing, inserting and searching.
But what about using heaps?
There is no one answer until you decide what you are going to do with this matrix (or maybe matrices?).
If you are just going to store and remember it, then just store it sequentially, leaving out the redundant entries. (Your code knows how to access it, because that is all it does, right?)
More probably, you want to do normal matrix operations on it. In that case, are you trying to make the storage efficient, or the execution? In the later case, I don't see many opportunities based on it being symmetric--the multiplies are the expensive thing and you probably still need all of those. If it is the storage, then are you limiting yourself to operations that only take symmetric in and symmetric out? Sounds awfully specific. If so, then you only need to do the calculations for the part you are storing, because, by definition the other entries are symmetric, so just write your code to generate that part of the matrix and you are done.

The art of interpolation over a subset of multiple-dimensions (c++)

I have been looking for an answer to this for quite a while, but I am not able to find one.
The problem:
I have a n-dimensional (e.g. n = 9) function which is extremely computationally burdensome to evaluate, but for which I need a huge amount of evaluations. I want to use interpolation for this case.
However k < n dimensions (e.g. k = 7) are discrete (mostly binary) and therefore I need not to interpolate over these, which leaves me with m-dimensions (e.g. 2) over which I want to interpolate. I am mostly interested in basic linear interpolation, similar to http://rncarpio.github.io/linterp/.
The question:
(Option A) Should I invoke d1 x d2 x ... x dk interpolation functions (e.g. 2^7= 128) which then only interpolate over the two dimensions I need, but I need to look for the right interpolation function every time I need a value, ...
... (Option B) or should I invoke one interpolation function which could possible interpolate between all dimensions, but which I then will only use to interpolate across the two dimensions I need (for all others I fully provide the grid with function values)?
I think it is important to emphasize that I am really interested in linear interpolation and that the answer will most likely differ in other cases. Furthermore, in the application I want to use this, I need not 128 functions but rather over 10,000 functions.
Additionally, should option A be the answer, how should I store the interpolation functions in c++, i.e. should I use a map with a tuple as a key (drawing on the boost library), or a multidimensional array (again, drawing on the boost library) or is there an easier way?
I'd likely choose Option A, but not maps. If you have binary data, just use an array of size 2 (this is one of the rare cases when using an array is right); if you have a small domain consider having two vectors, one for keys, one for values. This is because vector search can be made extremely efficient on at least on x86 / x64 architectures. Be sure to hide this implementation detail by providing an accessor function (i.e., const value& T::lookup(const key&)). I'd vote against using a tuple as a map key as it makes it both slower and more complicated. If you need to extremely optimize and your domains are so small that the product of their cardinality fits within 64 bits, you might just manually create an index key (like: (1<<3) * key4bits + (1<<2) * keyBinary + key2bits), in this case you'll use a map (or two vectors).

Armadillo C++:- Efficient access of columns in a cube structure

Using Armadillo matrix library I am aware that the efficient way of accessing a column in a 2d matrix is via a simply call to .col(i).
I am wondering is there an efficient way of extracting a column stored in a "cube", without first having to call the slice command?
I need the most efficient possible way of accessing the data stored in for instance (using matlab notation) A(:,i,j) . I will be doing this millions of times on a very large dataset, so speed and efficiency is of a high priority.
I think you want
B = A.subcube( span:all, span(i), span(j) );
or equivalently
B = A.subcube( span(), span(i), span(j) );
where B will be a row or column vector of the same type as A (e.g. containing double by default, or a number of other available types).
.slice() should be pretty quick. It simply provides a reference to the underlying Mat class. You could try something along these lines:
cube C(4,3,2);
double* mem = C.slice(1).colptr(2);
Also, bear in mind that Armadillo has range checks enabled by default. If you want to avoid the range checks, use the .at() element accessors:
cube C(4,3,2);
C.at(3,2,1) = 456;
Alternatively, you can store your matrices in a field class:
field<mat> F(100);
F(0).ones(12,34);
Corresponding element access:
F(0)(1,2); // with range checks
F.at(0).at(1,2); // without range checks
You can also compile your code with ARMA_NO_DEBUG defined, which will remove all run-time debugging (such as range checks). This will give you a speedup, but it is only recommended once you have debugged all your code (ie. verified that your algorithm is working correctly). The debugging checks are very useful in picking up mistakes.

Efficient partial reductions given arrays of elements, offsets to and lengths of sublists

For my application I have to handle a bunch of objects (let's say ints) that gets subsequently divided and sorted into smaller buckets. To this end, I store the elements in a single continuous array
arr = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14...}
and the information about the buckets (sublists) is given by offsets to the first element in the respective bucket and the lengths of the sublist.
So, for instance, given
offsets = {0,3,8,..}
sublist_lengths = {3,5,2,...}
would result in the following splits:
0 1 2 || 3 4 5 6 7 || 8 9 || ...
What I am looking for is a somewhat general and efficient way to run algorithms, like reductions, on the buckets only using either custom kernels or the thrust library. Summing the buckets should give:
3 || 25 || 17 || ...
What I've come up with:
option 1: custom kernels require a quite a bit of tinkering, copies into shared memory, proper choice of block and grid sizes and an own implementation of the algorithms, like scan, reduce, etc. Also, every single operation would require an own custom kernel. In general it is clear to me how to do this, but after having used thrust for the last couple of days I have the impression that there might be a smarter way
option 2: generate an array of keys from the offsets ({0,0,0,1,1,1,1,1,2,2,3,...} in the above example) and use thrust::reduce_by_key. I don't like the extra list generation, though.
option 3: Use thrust::transform_iterator together with thrust::counting_iterator to generate the above given key list on the fly. Unfortunately, I can't come up with an implementation that doesn't require increments of indices to the offset list on the device and defeats parallelism.
What would be the most sane way to implement this?
Within Thrust, I can't think of a better solution than Option 2. The performance will not be terrible, but it's certainly not optimal.
Your data structure bears similarity to the Compressed Sparse Row (CSR) format for storing sparse matrices, so you could use techniques developed for computing sparse matrix-vector multiplies (SpMV) for such matrices if you want better performance. Note that the "offsets" array of the CSR format has length (N+1) for a matrix with N rows (i.e. buckets in your case) where the last offset value is the length of arr. The CSR SpMV code in Cusp is a bit convoluted, but it serves as a good starting point for your kernel. Simply remove any reference to Aj or x from the code and pass offsets and arr into the Ap and Av arguments respectively.
You didn't mention how big the buckets are. If the buckets are big enough, maybe you can get away with copying the offsets and sublist_lengths to the host, iterating over them and doing a separate Thrust call for each bucket. Fermi can have 16 kernels in flight at the same time, so on that architecture you might be able to handle smaller buckets and still get good utilization.

Is there a data structure with these characteristics?

I'm looking for a data structure that would allow me to store an M-by-N 2D matrix of values contiguously in memory, such that the distance in memory between any two points approximates the Euclidean distance between those points in the matrix. That is, in a typical row-major representation as a one-dimensional array of M * N elements, the memory distance differs between adjacent cells in the same row (1) and adjacent cells in neighbouring rows (N).
I'd like a data structure that reduces or removes this difference. Really, the name of such a structure is sufficient—I can implement it myself. If answers happen to refer to libraries for this sort of thing, that's also acceptable, but they should be usable with C++.
I have an application that needs to perform fast image convolutions without hardware acceleration, and though I'm aware of the usual optimisation techniques for this sort of thing, I feel a specialised data structure or data ordering could improve performance.
Given the requirement that you want to store the values contiguously in memory, I'd strongly suggest you research space-filling curves, especially Hilbert curves.
To give a bit of context, such curves are sometimes used in database indexes to improve the locality of multidimensional range queries (e.g., "find all items with x/y coordinates in this rectangle"), thereby aiming to reduce the number of distinct pages accessed. A bit similar to the R-trees that have been suggested here already.
Either way, it looks that you're bound to an M*N array of values in memory, so the whole question is about how to arrange the values in that array, I figure. (Unless I misunderstood the question.)
So in fact, such orderings would probably still only change the characteristics of distance distribution.. average distance for any two randomly chosen points from the matrix should not change, so I have to agree with Oli there. Potential benefit depends largely on your specific use case, I suppose.
I would guess "no"! And if the answer happens to be "yes", then it's almost certainly so irregular that it'll be way slower for a convolution-type operation.
EDIT
To qualify my guess, take an example. Let's say we store a[0][0] first. We want a[k][0] and a[0][k] to be similar distances, and proportional to k, so we might choose to interleave the storage of first row and first column (i.e. a[0][0], a[1][0], a[0][1], a[2][0], a[0][2], etc.) But how do we now do the same for e.g. a[1][0]? All the locations near it in memory are now taken up by stuff that's near a[0][0].
Whilst there are other possibilities than my example, I'd wager that you always end up with this kind of problem.
EDIT
If your data is sparse, then there may be scope to do something clever (re Cubbi's suggestion of R-trees). However, it'll still require irregular access and pointer chasing, so will be significantly slower than straightforward convolution for any given number of points.
You might look at space-filling curves, in particular the Z-order curve, which (mostly) preserves spatial locality. It might be computationally expensive to look up indices, however.
If you are using this to try and improve cache performance, you might try a technique called "bricking", which is a little bit like one or two levels of the space filling curve. Essentially, you subdivide your matrix into nxn tiles, (where nxn fits neatly in your L1 cache). You can also store another level of tiles to fit into a higher level cache. The advantage this has over a space-filling curve is that indices can be fairly quick to compute. One reference is included in the paper here: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.30.8959
This sounds like something that could be helped by an R-tree. or one of its variants. There is nothing like that in the C++ Standard Library, but looks like there is an R-tree in the boost candidate library Boost.Geometry (not a part of boost yet). I'd take a look at that before writing my own.
It is not possible to "linearize" a 2D structure into an 1D structure and keep the relation of proximity unchanged in both directions. This is one of the fundamental topological properties of the world.
Having that that, it is true that the standard row-wise or column-wise storage order normally used for 2D array representation is not the best one when you need to preserve the proximity (as much as possible). You can get better result by using various discrete approximations of fractal curves (space-filling curves).
Z-order curve is a popular one for this application: http://en.wikipedia.org/wiki/Z-order_(curve)
Keep in mind though that regardless of which approach you use, there will always be elements that violate your distance requirement.
You could think of your 2D matrix as a big spiral, starting at the center and progressing to the outside. Unwind the spiral, and store the data in that order, and distance between addresses at least vaguely approximates Euclidean distance between the points they represent. While it won't be very exact, I'm pretty sure you can't do a whole lot better either. At the same time, I think even at very best, it's going to be of minimal help to your convolution code.
The answer is no. Think about it - memory is 1D. Your matrix is 2D. You want to squash that extra dimension in - with no loss? It's not going to happen.
What's more important is that once you get a certain distance away, it takes the same time to load into cache. If you have a cache miss, it doesn't matter if it's 100 away or 100000. Fundamentally, you cannot get more contiguous/better performance than a simple array, unless you want to get an LRU for your array.
I think you're forgetting that distance in computer memory is not accessed by a computer cpu operating on foot :) so the distance is pretty much irrelevant.
It's random access memory, so really you have to figure out what operations you need to do, and optimize the accesses for that.
You need to reconvert the addresses from memory space to the original array space to accomplish this. Also, you've stressed distance only, which may still cause you some problems (no direction)
If I have an array of R x C, and two cells at locations [r,c] and [c,r], the distance from some arbitrary point, say [0,0] is identical. And there's no way you're going to make one memory address hold two things, unless you've got one of those fancy new qubit machines.
However, you can take into account that in a row major array of R x C that each row is C * sizeof(yourdata) bytes long. Conversely, you can say that the original coordinates of any memory address within the bounds of the array are
r = (address / C)
c = (address % C)
so
r1 = (address1 / C)
r2 = (address2 / C)
c1 = (address1 % C)
c2 = (address2 % C)
dx = r1 - r2
dy = c1 - c2
dist = sqrt(dx^2 + dy^2)
(this is assuming you're using zero based arrays)
(crush all this together to make it run more optimally)
For a lot more ideas here, go look for any 2D image manipulation code that uses a calculated value called 'stride', which is basically an indicator that they're jumping back and forth between memory addresses and array addresses
This is not exactly related to closeness but might help. It certainly helps for minimation of disk accesses.
one way to get better "closness" is to tile the image. If your convolution kernel is less than the size of a tile you typical touch at most 4 tiles at worst. You can recursively tile in bigger sections so that localization improves. A Stokes-like (At least I thinks its Stokes) argument (or some calculus of variations ) can show that for rectangles the best (meaning for examination of arbitrary sub rectangles) shape is a smaller rectangle of the same aspect ratio.
Quick intuition - think about a square - if you tile the larger square with smaller squares the fact that a square encloses maximal area for a given perimeter means that square tiles have minimal boarder length. when you transform the large square I think you can show you should the transform the tile the same way. (might also be able to do a simple multivariate differentiation)
The classic example is zooming in on spy satellite data images and convolving it for enhancement. The extra computation to tile is really worth it if you keep the data around and you go back to it.
Its also really worth it for the different compression schemes such as cosine transforms. (That's why when you download an image it frequently comes up as it does in smaller and smaller squares until the final resolution is reached.
There are a lot of books on this area and they are helpful.