Reduce the length of a feature vector for comparison - c++

I have a problem, where several different objects are each described by a vector of real numbers, between 0 and 100, and a length (dimension) of 1000 elements.
Then I want to compare a new vector of equal characteristics with the set of vectors above, to find the most similar, with the Mahalanobis distance.
My question is:
How can I reduce the length of the vectors to the N most relevant elements (say, 100 of the 1000), without affecting too much the quality of the answers found, ie, the distance does not vary too much?
Remember that each vector is a description of a different object, unrelated to others.
I thought about using PCA, but after studying it, I saw that I needed at least two samples per object, or so I understood.
Any idea? In case of coding, I´m using C++, OpenCV
Thanks in advance.

Related

3D-Grid of bins: nested std::vector vs std::unordered_map

pros, I need some performance-opinions with the following:
1st Question:
I want to store objects in a 3D-Grid-Structure, overall it will be ~33% filled, i.e. 2 out of 3 gridpoints will be empty.
Short image to illustrate:
Maybe Option A)
vector<vector<vector<deque<Obj>> grid;// (SizeX, SizeY, SizeZ);
grid[x][y][z].push_back(someObj);
This way I'd have a lot of empty deques, but accessing one of them would be fast, wouldn't it?
The Other Option B) would be
std::unordered_map<Pos3D, deque<Obj>, Pos3DHash, Pos3DEqual> Pos3DMap;
where I add&delete deques when data is added/deleted. Probably less memory used, but maybe less fast? What do you think?
2nd Question (follow up)
What if I had multiple containers at each position? Say 3 buckets for 3 different entities, say object types ObjA, ObjB, ObjC per grid point, then my data essentially becomes 4D?
Another illustration:
Using Option 1B I could just extend Pos3D to include the bucket number to account for even more sparse data.
Possible queries I want to optimize for:
Give me all Objects out of ObjA-buckets from the entire structure
Give me all Objects out of ObjB-buckets for a set of
grid-positions
Which is the nearest non-empty ObjC-bucket to
position x,y,z?
PS:
I had also thought about a tree based data-structure before, reading about nearest neighbour approaches. Since my data is so regular I had thought I'd save all the tree-building dividing of the cells into smaller pieces and just make a static 3D-grid of the final leafs. Thats how I came to ask about the best way to store this grid here.
Question associated with this, if I have a map<int, Obj> is there a fast way to ask for "all objects with keys between 780 and 790"? Or is the fastest way the building of the above mentioned tree?
EDIT
I ended up going with a 3D boost::multi_array that has fortran-ordering. It's a little bit like the chunks games like minecraft use. Which is a little like using a kd-tree with fixed leaf-size and fixed amount of leaves? Works pretty fast now so I'm happy with this approach.
Answer to 1st question
As #Joachim pointed out, this depends on whether you prefer fast access or small data. Roughly, this corresponds to your options A and B.
A) If you want fast access, go with a multidimensional std::vector or an array if you will. std::vector brings easier maintenance at a minimal overhead, so I'd prefer that. In terms of space it consumes O(N^3) space, where N is the number of grid points along one dimension. In order to get the best performance when iterating over the data, remember to resolve the indices in the reverse order as you defined it: innermost first, outermost last.
B) If you instead wish to keep things as small as possible, use a hash map, and use one which is optimized for space. That would result in space O(N), with N being the number of elements. Here is a benchmark comparing several hash maps. I made good experiences with google::sparse_hash_map, which has the smallest constant overhead I have seen so far. Plus, it is easy to add it to your build system.
If you need a mixture of speed and small data or don't know the size of each dimension in advance, use a hash map as well.
Answer to 2nd question
I'd say you data is 4D if you have a variable number of elements a long the 4th dimension, or a fixed large number of elements. With option 1B) you'd indeed add the bucket index, for 1A) you'd add another vector.
Which is the nearest non-empty ObjC-bucket to position x,y,z?
This operation is commonly called nearest neighbor search. You want a KDTree for that. There is libkdtree++, if you prefer small libraries. Otherwise, FLANN might be an option. It is a part of the Point Cloud Library which accomplishes a lot of tasks on multidimensional data and could be worth a look as well.

How to handle very large matrices (e.g. 1000000 by 1000000)

My question is very general..and its not duplicate too..
when we declare something like this int mat[1000000][1000000];
it is sure it will give an error saying matrix size too large.
i have seen many problems on many competitive programming websites where we need to declare a 2d matrix with 10^6 rows, columns ,I know there is always some trick associated with it to reduce the matrix size.
so i just want to ask what are the possible ways or tricks we can use in such cases to minimize the size ..i mean which types of algorithms are generally required to solve it like DP or anyone else??
In DP, if current row is dependent only on previous row, you can use
int mat[2][1000000];. After calculating current row, you can immediately discard previous row and switch current and previous.
Sometimes, it is possible to use std::map instead of 2D array.
I have encountered many question in programming contests and the
solutions defers from case to case basis, so if you mention a
specific case, I can possibly give you a better targeted solution.
That depends very much on the specific task. There is no universal "trick" that will always work. You'll have to look for something in the particular problem that allows you to solve it in a different way.
That said, if I could really see no other way, I'd start thinking about how many elements of that matrix will really be non-zero (perhaps I can use a sparse array or a map (dictionary) instead). Or maybe I don't need to store all the elements it memory, but can instead re-calculate them every time I need them.
At any rate, a matrix that large (or any kind of fake representation of it) will NOT be useful. Not just because you don't have enough memory, but also because filling up such an array with data will take anywhere from several hours to many months. That should be your first concern - figuring out how to solve the task with less data and computations. When you figure out that, you'll also see what data structure is appropriate.

Fast Median Filter in C / C++ for `UINT16` 2D Array

Does anyone know a fast median filter algorithm for 16-bit (unsigned short) arrays in c++?
http://nomis80.org/ctmf.html
This one seems quite promising, but it only seems to work with byte arrays. Does anyone know either how to modify it to work with shorts or an alternative algorithm?
The technique in the paper relies on creating a histogram with 256 bins for an 8 bit pixel channel. Converting to 16 bits per channel would require a histogram with 65536 bins, and a histogram is required for each column of the image. Inflating the memory requirements by 256 makes this a less efficient algorithm overall, but still probably doable with today's hardware.
Using their proposed optimization of breaking the histogram into coarse and fine sections should further reduce the runtime hit to only 16x.
For small radius values I think you'll find traditional methods of median filtering will be more performant.
Fast Median Search - An ANSI C implementation (PDF) is something for C, it's a paper with the title "Fast median search: an ANSI C implementation". The author claims it's O(log(n)), he also provides some code, maybe it'll help you. It's not better than your suggested code, but maybe a look worth.
std::vector<unsigned short> v{4, 2, 5, 1, 3};
std::vector<unsigned short> h(v.size()/2+1);
std::partial_sort_copy(v.begin(), v.end(), h.begin(), h.end());
int median = h.back();
Runs in O(N·log(N/2+1)) and does not modify your input.
This article describes a method for median filtering of images that runs in O(log r) time per pixel, where r is the filter radius, and works for any data type (be it 8 bit integers or doubles):
Fast Median and Bilateral Filtering
I know this question is somewhat old but I also got interested in median filtering. If one is working with signals or images, then there will be a large overlap of data for the processing window. This can be taken advantage of.
I've posted some benchmark code here: 1D moving median filtering in C++
It's template based so it should work with most POD data types.
According to my results std::nth_element has poor performance for a moving median, as it must sort the window of values each time.
However, using a pool of values that is kept sorted, one can perform the median with 3 operation.
Remove oldest value out of the pool (calls std::lower_bound)
Insert new value into pool (calls std::lower_bound)
Store new value in history buffer
The median is now the middle value in the pool.
I hope someone finds this interesting and contributes their ideas!
See equations 4 and 5 in the following paper. The complexity is O(N*W) where W is the width of the filter and N is the number of samples.
See Noise Reduction by Vector Median Filtering.

Is there a data structure with these characteristics?

I'm looking for a data structure that would allow me to store an M-by-N 2D matrix of values contiguously in memory, such that the distance in memory between any two points approximates the Euclidean distance between those points in the matrix. That is, in a typical row-major representation as a one-dimensional array of M * N elements, the memory distance differs between adjacent cells in the same row (1) and adjacent cells in neighbouring rows (N).
I'd like a data structure that reduces or removes this difference. Really, the name of such a structure is sufficient—I can implement it myself. If answers happen to refer to libraries for this sort of thing, that's also acceptable, but they should be usable with C++.
I have an application that needs to perform fast image convolutions without hardware acceleration, and though I'm aware of the usual optimisation techniques for this sort of thing, I feel a specialised data structure or data ordering could improve performance.
Given the requirement that you want to store the values contiguously in memory, I'd strongly suggest you research space-filling curves, especially Hilbert curves.
To give a bit of context, such curves are sometimes used in database indexes to improve the locality of multidimensional range queries (e.g., "find all items with x/y coordinates in this rectangle"), thereby aiming to reduce the number of distinct pages accessed. A bit similar to the R-trees that have been suggested here already.
Either way, it looks that you're bound to an M*N array of values in memory, so the whole question is about how to arrange the values in that array, I figure. (Unless I misunderstood the question.)
So in fact, such orderings would probably still only change the characteristics of distance distribution.. average distance for any two randomly chosen points from the matrix should not change, so I have to agree with Oli there. Potential benefit depends largely on your specific use case, I suppose.
I would guess "no"! And if the answer happens to be "yes", then it's almost certainly so irregular that it'll be way slower for a convolution-type operation.
EDIT
To qualify my guess, take an example. Let's say we store a[0][0] first. We want a[k][0] and a[0][k] to be similar distances, and proportional to k, so we might choose to interleave the storage of first row and first column (i.e. a[0][0], a[1][0], a[0][1], a[2][0], a[0][2], etc.) But how do we now do the same for e.g. a[1][0]? All the locations near it in memory are now taken up by stuff that's near a[0][0].
Whilst there are other possibilities than my example, I'd wager that you always end up with this kind of problem.
EDIT
If your data is sparse, then there may be scope to do something clever (re Cubbi's suggestion of R-trees). However, it'll still require irregular access and pointer chasing, so will be significantly slower than straightforward convolution for any given number of points.
You might look at space-filling curves, in particular the Z-order curve, which (mostly) preserves spatial locality. It might be computationally expensive to look up indices, however.
If you are using this to try and improve cache performance, you might try a technique called "bricking", which is a little bit like one or two levels of the space filling curve. Essentially, you subdivide your matrix into nxn tiles, (where nxn fits neatly in your L1 cache). You can also store another level of tiles to fit into a higher level cache. The advantage this has over a space-filling curve is that indices can be fairly quick to compute. One reference is included in the paper here: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.30.8959
This sounds like something that could be helped by an R-tree. or one of its variants. There is nothing like that in the C++ Standard Library, but looks like there is an R-tree in the boost candidate library Boost.Geometry (not a part of boost yet). I'd take a look at that before writing my own.
It is not possible to "linearize" a 2D structure into an 1D structure and keep the relation of proximity unchanged in both directions. This is one of the fundamental topological properties of the world.
Having that that, it is true that the standard row-wise or column-wise storage order normally used for 2D array representation is not the best one when you need to preserve the proximity (as much as possible). You can get better result by using various discrete approximations of fractal curves (space-filling curves).
Z-order curve is a popular one for this application: http://en.wikipedia.org/wiki/Z-order_(curve)
Keep in mind though that regardless of which approach you use, there will always be elements that violate your distance requirement.
You could think of your 2D matrix as a big spiral, starting at the center and progressing to the outside. Unwind the spiral, and store the data in that order, and distance between addresses at least vaguely approximates Euclidean distance between the points they represent. While it won't be very exact, I'm pretty sure you can't do a whole lot better either. At the same time, I think even at very best, it's going to be of minimal help to your convolution code.
The answer is no. Think about it - memory is 1D. Your matrix is 2D. You want to squash that extra dimension in - with no loss? It's not going to happen.
What's more important is that once you get a certain distance away, it takes the same time to load into cache. If you have a cache miss, it doesn't matter if it's 100 away or 100000. Fundamentally, you cannot get more contiguous/better performance than a simple array, unless you want to get an LRU for your array.
I think you're forgetting that distance in computer memory is not accessed by a computer cpu operating on foot :) so the distance is pretty much irrelevant.
It's random access memory, so really you have to figure out what operations you need to do, and optimize the accesses for that.
You need to reconvert the addresses from memory space to the original array space to accomplish this. Also, you've stressed distance only, which may still cause you some problems (no direction)
If I have an array of R x C, and two cells at locations [r,c] and [c,r], the distance from some arbitrary point, say [0,0] is identical. And there's no way you're going to make one memory address hold two things, unless you've got one of those fancy new qubit machines.
However, you can take into account that in a row major array of R x C that each row is C * sizeof(yourdata) bytes long. Conversely, you can say that the original coordinates of any memory address within the bounds of the array are
r = (address / C)
c = (address % C)
so
r1 = (address1 / C)
r2 = (address2 / C)
c1 = (address1 % C)
c2 = (address2 % C)
dx = r1 - r2
dy = c1 - c2
dist = sqrt(dx^2 + dy^2)
(this is assuming you're using zero based arrays)
(crush all this together to make it run more optimally)
For a lot more ideas here, go look for any 2D image manipulation code that uses a calculated value called 'stride', which is basically an indicator that they're jumping back and forth between memory addresses and array addresses
This is not exactly related to closeness but might help. It certainly helps for minimation of disk accesses.
one way to get better "closness" is to tile the image. If your convolution kernel is less than the size of a tile you typical touch at most 4 tiles at worst. You can recursively tile in bigger sections so that localization improves. A Stokes-like (At least I thinks its Stokes) argument (or some calculus of variations ) can show that for rectangles the best (meaning for examination of arbitrary sub rectangles) shape is a smaller rectangle of the same aspect ratio.
Quick intuition - think about a square - if you tile the larger square with smaller squares the fact that a square encloses maximal area for a given perimeter means that square tiles have minimal boarder length. when you transform the large square I think you can show you should the transform the tile the same way. (might also be able to do a simple multivariate differentiation)
The classic example is zooming in on spy satellite data images and convolving it for enhancement. The extra computation to tile is really worth it if you keep the data around and you go back to it.
Its also really worth it for the different compression schemes such as cosine transforms. (That's why when you download an image it frequently comes up as it does in smaller and smaller squares until the final resolution is reached.
There are a lot of books on this area and they are helpful.

What are the differences between the various boost ublas sparse vectors?

In boost::numeric::ublas, there are three sparse vector types.
I can see that the mapped_vector is essentially an stl::map from index to value, which considers all not-found values to be 0 (or whatever is the common value).
But the documentation is sparse (ha ha) on information about compressed_vector and coordinate_vector.
Is anyone able to clarify? I'm trying to figure out the algorithmic complexity of adding items to the various vectors, and also of dot products between two such vectors.
A very helpful answer offered that compressed_vector is very similar to compressed_matrix. But it seems that, for example, compressed row storage is only for storing matrices -- not just vectors.
I see that unbounded_array is the storage type, but I'm not quite sure what the specification is for that, either. If I create a compressed_vector with size 200,000,000, but with only 5 non-zero locations, is this less efficient in any way than creating a compressed_vector with size 10 and 5 non-zero locations?
Many thanks!
replace matrix with vector and you have the answers
http://www.guwi17.de/ublas/matrix_sparse_usage.html