I have a collection of 512D std::vector to store face embeddings. I create my index and perform training on a subset of the data.
int d = 512;
size_t nb = this->templates.size() // 95000
size_t nt = 50000; // training data size
std::vector<float> training_set(nt * d);
faiss::IndexFlatIP coarse_quantizer(d);
int ncentroids = int(4 * sqrt(nb)));
faiss::IndexIVFPQ index(&coarse_quantizer,d,ncentroids,4,8);
std::vector<float> training_set(nt*d);
The this->templates has an index value in [0] and the 512D vectors in [1]. My question is about the training and indexing. I have this currently:
int v=0;
for (auto const& element : this->templates)
{
std::vector<double> enrollment_template = element.second;
for (int i=0;i<d;i++){
training_set[(v*d)+i] = (float)enrollment_template.at(i);
v++;
}
index.train(nt,training_set.data());
FAISS Index.Train function
virtual void train(idx_t n, const float *x)
Perform training on a representative set of vectors
Parameters:
n – nb of training vectors
x – training vecors, size n * d
Is that the proper way of adding the 512D vector data into Faiss for training? It seems to me that if I have 2 face embeddings that are 512D in size, the training_set would be like this:
training_set[0-511] - Face #1's 512D vectors
training_set[512-1024] - Face #2's 512D vectors
and since Faiss knows we are working with 512D vectors, it will intelligently parse them out of the array.
Here's a more efficient way to write it:
int v = 0;
for (auto const& element : this->templates)
{
auto& enrollment_template = element.second; // not copy
if (v + d > training_set.size()) {
break; // prevent overflow, "nt" is smaller than templates.size()
}
for (int i = 0; i < d; i++) {
training_set[v] = enrollment_template[i]; // not at()
v++;
}
}
We avoid a copy with auto& enrollment_template, avoid extra branching with enrollment_template[i] (you know you won't be out of bounds), and simplify the address computation with training_set[v] by making v a count of elements rather than rows.
Further efficiency could be gained if templates can be changed to store floats rather than doubles--then you'd just be bitwise-copying 512 floats rather than converting doubles to floats.
Also, be sure to declare d as constexpr to give the compiler the best chance of optimizing the loop.
Related
I am currently trying to implement matrix multiplication methods using the Microsoft SEAL library. I have created a vector<vector<double>> as input matrix and encoded it with CKKSEncoder. However the encoder packs an entire vector into a single Plaintext so I just have a vector<Plaintext> which makes me lose the 2D structure (and then of course I'll have a vector<Ciphertext> after encryption). Having a 1D vector allows me to access only the rows entirely but not the columns.
I managed to transpose the matrices before encoding. This allowed me to multiply component-wise the rows of the first matrix and columns (rows in transposed form) of the second matrix but I am unable to sum the elements of the resulting vector together since it's packed into a single Ciphertext. I just need to figure out how to make the vector dot product work in SEAL to perform matrix multiplication. Am I missing something or is my method wrong?
It has been suggested by KyoohyungHan in the issue: https://github.com/microsoft/SEAL/issues/138 that it is possible to solve the problem with rotations by rotating the output vector and summing it up repeatedly.
For example:
// my_output_vector is the Ciphertext output
vector<Ciphertext> rotations_output(my_output_vector.size());
for(int steps = 0; steps < my_output_vector.size(); steps++)
{
evaluator.rotate_vector(my_output_vector, steps, galois_keys, rotations_output[steps]);
}
Ciphertext sum_output;
evaluator.add_many(rotations_output, sum_output);
vector of vectors is not the same as an array of arrays (2D, matrix).
While one-dimentional vector<double>.data() points to contiguous memory space (e.g., you can do memcpy on that), each of "subvectors" allocates own, separate memory buffer. Therefore vector<vector<double>>.data() makes no sense and cannot be used as a matrix.
In C++, two-dimensional array array2D[W][H] is stored in memory identically to array[W*H]. Therefore both can be processed by the same routines (when it makes sense). Consider the following example:
void fill_array(double *array, size_t size, double value) {
for (size_t i = 0; i < size; ++i) {
array[i] = value;
}
}
int main(int argc, char *argv[])
{
constexpr size_t W = 10;
constexpr size_t H = 5;
double matrix[W][H];
// using 2D array as 1D to fill all elements with 5.
fill_array(&matrix[0][0], W * H, 5);
for (const auto &row: matrix) {
for (const auto v : row) {
cout << v << '\t';
}
cout << '\n';
}
return 0;
}
In the above example, you can substitute double matrix[W][H]; with vector<double> matrix(W * H); and feed matrix.data() into fill_array(). However, you cannot declare vector(W) of vector(H).
P.S. There are plenty of C++ implementations of math vector and matrix. You can use one of those if you don't want to deal with C-style arrays.
I am trying to create an array of X pointers referencing matrices of dimensions Y by 16. Is there any way to accomplish this in C++ without the use of triple pointers?
Edit: Adding some context for the problem.
There are a number of geometries on the screen, each with a transform that has been flattened to a 1x16 array. Each snapshot represents the transforms for each of number of components. So the matrix dimensions are 16 by num_components by num_snapshots , where the latter two dimensions are known at run-time. In the end, we have many geometries with motion applied.
I'm creating a function that takes a triple pointer argument, though I cannot use triple pointers in my situation. What other ways can I pass this data (possibly via multiple arguments)? Worst case, I thought about flattening this entire 3D matrix to an array, though it seems like a sloppy thing to do. Any better suggestions?
What I have now:
function(..., double ***snapshot_transforms, ...)
What I want to accomplish:
function (..., <1+ non-triple pointer parameters>, ...)
Below isn't the function I'm creating that takes the triple pointer, but shows what the data is all about.
static double ***snapshot_transforms_function (int num_snapshots, int num_geometries)
{
double component_transform[16];
double ***snapshot_transforms = new double**[num_snapshots];
for (int i = 0; i < num_snapshots; i++)
{
snapshot_transforms[i] = new double*[num_geometries];
for (int j = 0; j < num_geometries; j++)
{
snapshot_transforms[i][j] = new double[16];
// 4x4 transform put into a 1x16 array with dummy values for each component for each snapshot
for (int k = 0; k < 16; k++)
snapshot_transforms[i][j][k] = k;
}
}
return snapshot_transforms;
}
Edit2: I cannot create new classes, nor use C++ features like std, as the exposed function prototype in the header file is getting put into a wrapper (that doesn't know how to interpret triple pointers) for translation to other languages.
Edit3: After everyone's input in the comments, I think going with a flattened array is probably the best solution. I was hoping there would be some way to split this triple pointer and organize this complex data across multiple data pieces neatly using simple data types including single pointers. Though I don't think there is a pretty way of doing this given my caveats here. I appreciate everyone's help =)
It is easier, better, and less error prone to use an std::vector. You are using C++ and not C after all. I replaced all of the C-style array pointers with vectors. The typedef doublecube makes it so that you don't have to type vector<vector<vector<double>>> over and over again. Other than that the code basically stays the same as what you had.
If you don't actually need dummy values I would remove that innermost k loop completely. reserve will reserve the memory space that you need for the real data.
#include <vector>
using std::vector; // so we can just call it "vector"
typedef vector<vector<vector<double>>> doublecube;
static doublecube snapshot_transforms_function (int num_snapshots, int num_geometries)
{
// I deleted component_transform. It was never used
doublecube snapshot_transforms;
snapshot_transforms.reserve(num_snapshots);
for (int i = 0; i < num_snapshots; i++)
{
snapshot_transforms.at(i).reserve(num_geometries);
for (int j = 0; j < num_geometries; j++)
{
snapshot_transforms.at(i).at(j).reserve(16);
// 4x4 transform put into a 1x16 array with dummy values for each component for each snapshot
for (int k = 0; k < 16; k++)
snapshot_transforms.at(i).at(j).at(k) = k;
}
}
return snapshot_transforms;
}
Adding a little bit of object-orientation usually makes the code easier to manage -- for example, here's some code that creates an array of 100 Matrix objects with varying numbers of rows per Matrix. (You could vary the number of columns in each Matrix too if you wanted to, but I left them at 16):
#include <vector>
#include <memory> // for shared_ptr (not strictly necessary, but used in main() to avoid unnecessarily copying of Matrix objects)
/** Represents a (numRows x numCols) 2D matrix of doubles */
class Matrix
{
public:
// constructor
Matrix(int numRows = 0, int numCols = 0)
: _numRows(numRows)
, _numCols(numCols)
{
_values.resize(_numRows*_numCols);
std::fill(_values.begin(), _values.end(), 0.0f);
}
// copy constructor
Matrix(const Matrix & rhs)
: _numRows(rhs._numRows)
, _numCols(rhs._numCols)
{
_values.resize(_numRows*_numCols);
std::fill(_values.begin(), _values.end(), 0.0f);
}
/** Returns the value at (row/col) */
double get(int row, int col) const {return _values[(row*_numCols)+col];}
/** Sets the value at (row/col) to the specified value */
double set(int row, int col, double val) {return _values[(row*_numCols)+col] = val;}
/** Assignment operator */
Matrix & operator = (const Matrix & rhs)
{
_numRows = rhs._numRows;
_numCols = rhs._numCols;
_values = rhs._values;
return *this;
}
private:
int _numRows;
int _numCols;
std::vector<double> _values;
};
int main(int, char **)
{
const int numCols = 16;
std::vector< std::shared_ptr<Matrix> > matrixList;
for (int i=0; i<100; i++) matrixList.push_back(std::make_shared<Matrix>(i, numCols));
return 0;
}
I am trying to utilize sparse matrices in Armadillo, and am noticing a significant difference in access times with SpMat<int> compared to equivalent code using Mat<int>.
Description:
Below are two methods, which are identical in every respect except that Method_One uses regular matrices and Method_Two uses sparse matrices.
Both methods take following arguments:
WS, DS: Pointers to a NN dimensional array
WW: 13 K [max(WS)]
DD: 1.7 K [max(DS)]
NN: 2.3 M
TT: 50
I am using Visual Studio 2017 for compiling the code into a .mexw64 executable which can be called from Matlab.
Code:
void Method_One(int WW, int DD, int TT, int NN, double* WS, double* DS)
{
Mat<int> WP(WW, TT, fill::zeros); // (13000 x 50) matrix
Mat<int> DP(DD, TT, fill::zeros); // (1700 x 50) matrix
Col<int> ZZ(NN, fill::zeros); // 2,300,000 column vector
for (int n = 0; n < NN; n++)
{
int w_n = (int) WS[n] - 1;
int d_n = (int) DS[n] - 1;
int t_n = rand() % TT;
WP(w_n, t_n)++;
DP(d_n, t_n)++;
ZZ(n) = t_n + 1;
}
return;
}
void Method_Two(int WW, int DD, int TT, int NN, double* WS, double* DS)
{
SpMat<int> WP(WW, TT); // (13000 x 50) matrix
SpMat<int> DP(DD, TT); // (1700 x 50) matrix
Col<int> ZZ(NN, fill::zeros); // 2,300,000 column vector
for (int n = 0; n < NN; n++)
{
int w_n = (int) WS[n] - 1;
int d_n = (int) DS[n] - 1;
int t_n = rand() % TT;
WP(w_n, t_n)++;
DP(d_n, t_n)++;
ZZ(n) = t_n + 1;
}
return;
}
Timing:
I am timing both methods using wall_clock timer object in Armadillo. For example,
wall_clock timer;
timer.tic();
Method_One(WW, DD, TT, NN, WS, DS);
double t = timer.toc();
Results:
Timing elapsed for Method_One using Mat<int>: 0.091 sec
Timing elapsed for Method_Two using SpMat<int>: 30.227 sec (almost 300 times slower)
Any insights into this are highly appreciated!
UPDATE:
This issue has been fixed with newer version (8.100.1) of Armadillo.
Here are the new results:
Timing elapsed for Method_One using Mat<int>: 0.141 sec
Timing elapsed for Method_Two using SpMat<int>: 2.127 sec (15 times slower, which is acceptable!)
Thanks to Conrad and Ryan.
As hbrerkere already mentioned, the problem stems from the fact that the values of the matrix are stored in a packed format (CSC) that makes it time-consuming to
Find the index of an already existing entry: Depending on whether the column entries are sorted by their row index you need either linear or binary search.
Insert a value that was previously zero: Here you need to find the insertion point for your new value and move all elements after that, leading to Ω(n) worst case time for a single insertion!
All these operations are constant-time operations for dense matrices, which mostly explains the runtime difference.
My usual solution was to use a separate sparse matrix type for assembly (where you usually access an element multiple times) based on the coordinate format (storing triples (i, j, value)) that uses a map like std::map or std::unordered_map to store the triple index corresponding to a position (i,j) in the matrix.
Some similar approaches are also discussed in this question about matrix assembly
Example from my most recent use:
class DynamicSparseMatrix {
using Number = double;
using Index = std::size_t;
using Entry = std::pair<Index, Index>;
std::vector<Number> values;
std::vector<Index> rows;
std::vector<Index> cols;
std::map<Entry, Index> map; // unordered_map might be faster,
// but you need a suitable hash function
// like boost::hash<Entry> for this.
Index num_rows;
Index num_cols;
...
Number& value(Index row, Index col) {
// just to prevent misuse
assert(row >= 0 && row < num_rows);
assert(col >= 0 && col < num_cols);
// Find the entry in the matrix
Entry e{row, col};
auto it = map.find(e);
// If the entry hasn't previously been stored
if (it == map.end()) {
// Add a new entry by adding its value and coordinates
// to the end of the storage vectors.
it = map.insert(make_pair(e, values.size())).first;
rows.push_back(row);
cols.push_back(col);
values.push_back(0);
}
// Return the value
return values[(*it).second];
}
...
};
After assembly you can store all the values from rows, cols, values (which actually represent the matrix in Coordinate format), possibly sort them and do a batch insertion into your Armadillo matrix.
Sparse matrices are stored in a compressed format (CSC). Every time a non-zero element inserted into a sparse matrix, the entire internal representation has to be updated. This is time consuming.
It's much faster to construct the sparse matrix using batch constructors.
I'm using a particle physics library written in c++ for a game.
In order to draw the particles I must get an array of all their positions like so..
b2Vec2* particlePositionBuffer = world->GetParticlePositionBuffer();
This returns an array of b2Vec2 objects (which represent 2 dimensional vectors in the physics engine).
Also I can get and set their colour using
b2ParticleColor* particleColourBuffer = world->GetParticleColorBuffer();
I would like to get the 10% of the particles with the highest Y values (and then change their colour)
My idea is..
1. Make an array of structs the same size as the particlePositionBuffer array, the struct just contains an int (the particles index in the particlePositionBuffer array) and a float (the particles y position)
2.Then I sort the array by the y position.
3.Then I use the int in the struct from the top 10% of structs in my struct array to do stuff to their colour in the particleColourBuffer array.
Could someone show me how to sort and array of structs like that in c++ ?
Also do you think this is a decent way of going about this? I only need to do it once (not every frame)
Following may help:
// Functor to compare indices according to Y value.
struct comp
{
explicit comp(b2Vec2* particlePositionBuffer) :
particlePositionBuffer(particlePositionBuffer)
{}
operator (int lhs, int rhs) const
{
// How do you get Y coord ?
// note that I do rhs < lhs to have higher value first.
return particlePositionBuffer[rhs].getY() < particlePositionBuffer[lhs].getY();
}
b2Vec2* particlePositionBuffer;
};
void foo()
{
const std::size_t size = world->GetParticleCount(); // How do you get Count ?
const std::size_t subsize = size / 10; // check for not zero ?
std::vector<std::size_t> indices(size);
for (std::size_t i = 0; i != size; ++i) {
indices[i] = i;
}
std::nth_element(indices.begin(), indices.begin() + subsize, indices.end(),
comp(world->GetParticlePositionBuffer()));
b2ParticleColor* particleColourBuffer = world->GetParticleColorBuffer();
for (std::size_t i = 0; i != subsize; ++i) {
changeColor(particleColourBuffer[i])
}
}
If your particle count is low, it won't matter much either way, and sorting them all first with a simple stl sort routine would be fine.
If the number were large though, I'd create a binary search tree whose maximum size was 10% of the number of your particles. Then I'd maintain the minY actually stored in the tree for quick rejection purposes. Then this algorithm should do it:
Walk through your original array and add items to the tree until it is full (10%)
Update your minY
For remaining items in original array
If item.y is less than minY, go to next item (quick rejection)
Otherwise
Remove the currently smallest Y value from the tree
Add the larger Y item to the tree
Update MinY
A binary search tree has a nice advantage of quick insert, quick search, and maintained ordering. If you want to be FAST, this is better than a complete sort on the entire array.
According to Visual Studio's performance analyzer, the following function is consuming what seems to me to be an abnormally large amount of processor power, seeing as all it does is add between 1 and 3 numbers from several vectors and store the result in one of those vectors.
//Relevant class members:
//vector<double> cache (~80,000);
//int inputSize;
//Notes:
//RealFFT::real is a typedef for POD double.
//RealFFT::RealSet is a wrapper class for a c-style array of RealFFT::real.
//This is because of the FFT library I'm using (FFTW).
//It's bracket operator is overloaded to return a const reference to the appropriate array element
vector<RealFFT::real> Convolver::store(vector<RealFFT::RealSet>& data)
{
int cr = inputSize; //'cache' read position
int cw = 0; //'cache' write position
int di = 0; //index within 'data' vector (ex. data[di])
int bi = 0; //index within 'data' element (ex. data[di][bi])
int blockSize = irBlockSize();
int dataSize = data.size();
int cacheSize = cache.size();
//Basically, this takes the existing values in 'cache', sums them with the
//values in 'data' at the appropriate positions, and stores them back in
//the cache at a new position.
while (cw < cacheSize)
{
int n = 0;
if (di < dataSize)
n = data[di][bi];
if (di > 0 && bi < inputSize)
n += data[di - 1][blockSize + bi];
if (++bi == blockSize)
{
di++;
bi = 0;
}
if (cr < cacheSize)
n += cache[cr++];
cache[cw++] = n;
}
//Take the first 'inputSize' number of values and return them to a new vector.
return Common::vecTake<RealFFT::real>(inputSize, cache, 0);
}
Granted, the vectors in question have sizes of around 80,000 items, but by comparison, a function which multiplies similar vectors of complex numbers (complex multiplication requires 4 real multiplications and 2 additions each) consumes about 1/3 the processor power.
Perhaps it has something to with the fact it has to jump around within the vectors rather then just accessing them linearly? I really have no idea though. Any thoughts on how this could be optimized?
Edit: I should mention I also tried writing the function to access each vector linearly, but this requires more total iterations and actually the performance was worse that way.
Turn on compiler optimization as appropriate. A guide for MSVC is here:
http://msdn.microsoft.com/en-us/library/k1ack8f1.aspx