I have 5+ million data to predict people's race. One textual feature gives rise to tens of thousands more. For example, name 'Smith' give rise to 'sm', 'mi', 'it'... etc. I then need to transform it into some sparse matrix
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
X2= vec.fit_transform(measurements)
Because of the tens of thousands of generated features, I can't use the following to give me an array, otherwise I am getting an out of memory error.
X = vec.fit_transform(measurements).toarray()
As far as I can tell, a lot of other functions/modules in scikilearn only allows the array format data to be fitted. For example: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA and http://scikit-learn.org/stable/modules/feature_selection.html for dimensionality reduction and feature selection.
pca = PCA(n_components=2)
pca.fit(X) # X works but not X2, though I can't get X with my big data set because of out-of-memory error
I am not certain that this will help, but you can try to slice your X2 into smaller parts (but still as big as possible), and use IncrementalPCA on them.
from sklearn.utils import gen_batches
from sklearn.decomposition import IncrementalPCA
pca = IncrementalPCA()
n_samples, n_features = X2.shape
batchsize = n_features*5
for slice in gen_batches(n_samples, batchsize):
pca.partial_fit(X2[slice].toarray())
You may change that 5 constant to some bigger number, if your RAM size allows to do that.
As you noticed you probably won't be able to convert your text features into a numpy array.
So you'll need to focus on techniques that can handle sparse data.
PCA is not one of them.
The reason is that PCA performs centering of the data, which makes the data dense (picture a sparse matrix, then substract 0.5 to every element).
This SO answer provides more explanation and an alternative:
To clarify: PCA is mathematically defined as centering the data (removing the mean value to each feature) and then applying truncated SVD on the centered data.
As centering the data would destroy the sparsity and force a dense representation that often does not fit in memory any more, it is common to directly do truncated SVD on sparse data (without centering). This resembles PCA but it's not exactly the same.
In the context of text data performing SVD after a TfidfVectorizer or a CountVectorizer is actually a famous technique called latent semantic analysis.
As for the feature selection part, you'll probably have to modify the source code of your scoring function (e.g. chi2) so that it handles sparse matrices without making them dense.
It is possible, this is mostly a trade-off between keeping the sparsity of matrices and using efficient array operations.
In your case though I'd try and throw this at a classifier first to see if the extra work is worth your time.
Related
I am working on a C++ project that needs to perform FFT on a large 2D raster data (10 to 100 GB). In particular, the performance is quite bad when applying FFT for each column, whose elements are not contiguous in memory (placed with a stride of the width of the data).
Currently, I'm doing this. Since the data does not fit in the memory, I read several columns, namely n columns, into the memory with its orientation transposed (so that a column in the file becomes a row in the memory) and apply FFT with an external library (MKL). I read (fread) n pixels, move on to the next row (fseek as much as width - n), read n pixels, jump to the next row, and so on. When the operation (FFT) is done with the column chunk, I write it back to the file in the same manner. I write n pixels, jump to the next row, and so on. This way of reading and writing file takes too much time, so I want to find some way of boosting it.
I have considered transposing the whole file beforehand, but the entire process includes both row-major and column-major FFT operations and transposing will not benefit.
I'd like to hear any experiences or idea about this kind of column-major operations on a large data. Any suggestions related particularly to FFT or MKL will help as well.
Why not to work with both transposed and non-transposed data at the same time? That will increase memory requirement x2, but that may worth it.
Consider switching to a Hadamard Transformation. As a complete IPS, the transform offers no multiplications, since all of the coefficients in the transform are plus or minus one. If you need the resultant transform in a fourier basis, a matrix multiplication will change bases.
I have a netcdf file which contains a float array (21600, 43200). I don't want to read in the entire array to RAM because it's too large, so I'm using the Dataset object from the netCDF4 library to read in the array.
I would like to calculate the mean of a subset this array using two 1D numpy arrays (x_coords, y_coords) of 300-400 coordinates.
I don't think I can use basic indexing, because the coordinates I have aren't continuous. What I'm currently doing is just feeding the arrays directly into the object, like so:
ncdf_data = Dataset(file, 'r')
mean = np.mean(ncdf_data.variables['q'][x_coords, y_coords])
The above code takes far too long for my liking (~3-4 seconds depending on the coordinates I'm using), and I'd like to speed this up somehow. Is there a pythonic way that I can use to directly work out the mean from such a subset without triggering fancy indexing?
I know h5py warns about the slow speed of fancy indexing,
docs.h5py.org/en/latest/high/dataset.html#fancy-indexing.
netcdf probably has the same problem.
Can you load contiguous slice that contains all values, and apply the faster numpy advanced indexing to that subset? Or you may have to work with chunks.
numpy advanced indexing is slower than it's basic slicing, but that is still quite a bit faster than the fancy indexing directly off the file.
However you do it, np.mean will be operating on data in memory, not directly on data in the file. The slowness of fancy indexing is because it has to access data scattered through out the file. Loading the data into an array in memory isn't the slow part. The slow part is seeking and reading from the file.
Putting the file on a faster drive (e.g. a solid state one) might help.
I'm writing up an implementation of backpropagation for a feedforward neural network in C++ and I'm using the Armadillo library. Right now, I'm loading training data with the method load for the class matrix in the Armadillo library. Two questions:
1) Is this a reasonable choice for storing pre-formatted (CSV), numeric data that fits into main memory (<2GB)? Certainly there are better ways to do this than others and it'd be nice to know if this is not a good practice. Part of me feels like this isn't a good choice for holding the data as there are likely more data-ish structures/frameworks (like I should be accessing some SQL database or something). Another part of me feels like numeric data is by definition just matrices so this should be wonderful.
2) I need to sample without replacement from a data set in my implementation and I see two routes: either I could shuffle the rows of the data set or shuffle an array that indexes the data set. There is a shuffle method for the matrix class in the Armadillo library and I'm suspicious that what is shuffled is addresses and not the rows themselves. Wouldn't that be just as efficient as shuffling an indexing array?
1) Yes, this is fine and it's how I would do it, but note that Armadillo matrices are column-major and thus you may need to transpose the CSV that you load. If your data is sufficiently large that it won't fit in main memory, you could consider writing a custom CSV parser that looks at the data in a streaming sense (i.e. one point at a time), thus reducing your RAM footprint, or you could even use mmap() to map a file full of packed doubles as your matrix and let the kernel work out what needs to be swapped in when.
2) Because all matrix data is stored contiguously (i.e. double* not double**), shuffle() will be moving the elements in the matrix. What I generally do in this type of situation is create a vector of indices and shuffle it:
uvec indices = linspace<uvec>(0, n, n);
shuffle(indices);
// Now loop over each shuffled point...
for (uword i = 0; i < n; ++i)
{
// access the point with data.col(indices[i]) and do whatever
}
(The above code isn't tested, but it should work or easily be adapted into something that works.)
For what it's worth, mlpack (http://www.mlpack.org/) does have a not-yet-stable neural network infrastructure that uses Armadillo, and it may be worth your time to check out; the link below is to the relevant source directly, but poking around on Github and the mlpack website should reveal better documentation.
https://github.com/mlpack/mlpack/tree/master/src/mlpack/methods/ann
I am using C++ function to find inverse Fourier transform.
int inYSize = 170; int inXSize = 2280;
float* outData = new float[inYSize*inXSize];
fftwf_plan mReverse = fftwf_plan_dft_c2r_2d(inYSize, inXSize,(fftwf_complex*)temp, outdata,
FFTW_ESTIMATE);
fftwf_execute(mReverse);
My input is 2D array temp with complex numbers. All the elements have real value 1 and imaginary 0.
So I am expecting InverseFFT of such an array should be 2D array with real values. Output array should have SPIKE at 0,0 and rest all values 0. But I am getting all different values in the output array even after normalizing with total size of an array. What could be the reason?
FFTW is not that trivial to deal with when it comes to multidimensional DFT and Complex to Real transform.
When doing a C2R transform of a MxN row-major array, the second dimension is cut in half because of the symmetry of the result : outData is twice bigger than needed, but it's not the reason of your problem (and not you're case as you are doing C2R and not R2C).
More info about this tortuous matter : http://www.fftw.org/doc/One_002dDimensional-DFTs-of-Real-Data.html
"Good Guy Advice" : Use only the C2C "easier" way of doing things, take the modulus of the output if you don't know how to process the results, but don't waste your time on n-D Complex to Real transforms.
Because of limited precision, because of the numerical implementation of the DFT, because of unsubordinated drunk bits, you can get values that are not 0 even if they are very small. This is the normal behavior of a FFT algorithm.
Besides reading carefully the user manual (http://www.fftw.org/doc/) even if it's a real pain (I lost few days around this library just to get a 3D transform working, just to understand how data was scaled)
You should try with a C2C 1D transform before going C2C 2D and C2R 2D, just to be sure you have somehow an idea of what you're doing.
What's the inverse FFT of a planar constant something where every bin of the "frequency-plane" is filled with a one ? Are you looking for a new way to define +inf or -inf ? In that case I would rather start with the easier division by 0 ^^. The direct FFT should be a as you described, with the SPIKE correctly scaled being 1, pretty sure the inverse is not.
Do not hesitate to add precision to your question, and good luck with FFTW
With this little information it is hard to tell. What i could imagine would be that you get spectral leakage due to the window selection (See This Wikipedia article for details about leakage).
What you could do is try using another windowing function to reduce leakage or redefine your windowing size.
I would like to know what the best practice for efficiently storing (and subsequently accessing) sets of multi-dimensional data arrays with variable length. The focus is on performance, but I also need to be able to handle changing the length of an individual data set during runtime without too much overhead.
Note: I know this is a somewhat lengthy question, but I have looked around quite a lot and could not find a solution or example which describes the problem at hand with sufficient accuracy.
Background
The context is a computational fluid dynamics (CFD) code that is based on the discontinuous Galerkin spectral element method (DGSEM) (cf. Kopriva (2009), Implementing Spectral Methods for Partial Differential Equations). For the sake of simplicity, let us assume a 2D data layout (it is in fact in three dimensions, but the extension from 2D to 3D should be straightforward).
I have a grid that consists of K square elements k (k = 0,...,K-1) that can be of different (physical) sizes. Within each grid element (or "cell") k, I have N_k^2 data points. N_k is the number of data points in each dimension, and can vary between different grid cells.
At each data point n_k,i (where i = 0,...,N_k^2-1) I have to store an array of solution values, which has the same length nVars in the whole domain (i.e. everywhere), and which does not change during runtime.
Dimensions and changes
The number of grid cells K is of O(10^5) to O(10^6) and can change during runtime.
The number of data points N_k in each grid cell is between 2 and 8 and can change during runtime (and may be different for different cells).
The number of variables nVars stored at each grid point is around 5 to 10 and cannot change during runtime (it is also the same for every grid cell).
Requirements
Performance is the key issue here. I need to be able to regularly iterate in an ordered fashion over all grid points of all cells in an efficient manner (i.e. without too many cache misses). Generally, K and N_k do not change very often during the simulation, so for example a large contiguous block of memory for all cells and data points could be an option.
However, I do need to be able to refine or coarsen the grid (i.e. delete cells and create new ones, the new ones may be appended to the end) during runtime. I also need to be able to change the approximation order N_k, so the number of data points I store for each cell can change during runtime as well.
Conclusion
Any input is appreciated. If you have experience yourself, or just know a few good resources that I could look at, please let me know. However, while the solution will be crucial to the performance of the final program, it is just one of many problems, so the solution needs to be of an applied nature and not purely academic.
Should this be the wrong venue to ask this question, please let me know what a more suitable place would be.
Often, these sorts of dynamic mesh structures can be very tricky to deal with efficiently, but in block-structured adaptive mesh refinement codes (common in astrophysics, where complex geometries aren't important) or your spectral element code where you have large block sizes, it is often much less of an issue. You have so much work to do per block/element (with at least 10^5 cells x 2 points/cell in your case) that the cost of switching between blocks is comparitively minor.
Keep in mind, too, that you can't generally do too much of the hard work on each element or block until a substantial amount of that block's data is already in cache. You're already going to have to had flushed most of block N's data out of cache before getting much work done on block N+1's anyway. (There might be some operations in your code which are exceptions to this, but those are probably not the ones where you're spending much time anyway, cache or no cache, because there's not a lot of data reuse - eg, elementwise operations on cell values). So keeping each the blocks/elements beside each other isn't necessarily a huge deal; on the other hand, you definitely want the blocks/elements to be themselves contiguous.
Also notice that you can move blocks around to keep them contiguous as things get resized, but not only are all those memory copies also going to wipe your cache, but the memory copies themselves get very expensive. If your problem is filling a significant fraction of memory (and aren't we always?), say 1GB, and you have to move 20% of that around after a refinement to make things contiguous again, that's .2 GB (read + write) / ~20 GB/s ~ 20 ms compared to reloading (say) 16k cache lines at ~100ns each ~ 1.5 ms. And your cache is trashed after the shuffle anyway. This might still be worth doing if you knew that you were going to do the refinement/derefinement very seldom.
But as a practical matter, most adaptive mesh codes in astrophysical fluid dynamics (where I know the codes well enough to say) simply maintain a list of blocks and their metadata and don't worry about their contiguity. YMMV of course. My suggestion would be - before spending too much time crafting the perfect data structure - to first just test the operation on two elements, twice; the first, with the elements in order and computing on them 1-2, and the second, doing the operation in the "wrong" order, 2-1, and timing the two computations several times.
For each cell, store the offset in which to find the cell data in a contiguous array. This offset mapping is very efficient and widely used. You can reorder the cells for cache reuse in traversals. When the order or number of cells changes, create a new array and interpolate, then throw away the old arrays. This storage is much better for external analysis because operations like inner products in Krylov methods and stages in Runge-Kutta methods can be managed without reference to the mesh. It also requires minimal memory per vector (e.g. in Krylov bases and with time integration).