Fast symmetric binary matrix multiplication using vector extensions - c++

I'm using a binary matrix representing an undirected graph and toying with gcc's vector extensions to see what can be done to produce a matrix product (replacing +/* operations with |/&) efficiently.
The following attempt assumes both input matrices are symmetric about the diagonal:
typedef unsigned char __attribute__((vector_size(8))) vec;
vec example_input = {
0b11001000
, 0b11001001
, 0b00100100
, 0b00010000
, 0b11001000
, 0b00100100
, 0b00000010
, 0b01000001
};
void symmetric_product( const vec& left, const vec& right, vec& result ) {
for( unsigned ii = 0; ii < 8; ++ii ) {
vec tmp{};
// broadcast row ii across all rows
tmp -= 1;
tmp &= left[ii];
// compute first half of dot product of
// all rows in 'right' with row 'ii'
tmp &= right;
// The rest does the 'tallying'; I believe
// the rest could be replaced with the
// pext intrinsic
result[ii] = 0;
for( unsigned jj = 0; jj < 8; ++jj ) {
result[ii] |= (0 != tmp[ii]) << jj;
}
}
}
I was researching something similar several months back and thought I'd seen a slick way to pull this off, but all I'm finding now is the pext* family of instructions.
If that's the only way then so be it; my hope is someone knows of another way that doesn't require hardware-specific intrinsics.

Related

FAISS with C++ indexing 512D vectors

I have a collection of 512D std::vector to store face embeddings. I create my index and perform training on a subset of the data.
int d = 512;
size_t nb = this->templates.size() // 95000
size_t nt = 50000; // training data size
std::vector<float> training_set(nt * d);
faiss::IndexFlatIP coarse_quantizer(d);
int ncentroids = int(4 * sqrt(nb)));
faiss::IndexIVFPQ index(&coarse_quantizer,d,ncentroids,4,8);
std::vector<float> training_set(nt*d);
The this->templates has an index value in [0] and the 512D vectors in [1]. My question is about the training and indexing. I have this currently:
int v=0;
for (auto const& element : this->templates)
{
std::vector<double> enrollment_template = element.second;
for (int i=0;i<d;i++){
training_set[(v*d)+i] = (float)enrollment_template.at(i);
v++;
}
index.train(nt,training_set.data());
FAISS Index.Train function
virtual void train(idx_t n, const float *x)
Perform training on a representative set of vectors
Parameters:
n – nb of training vectors
x – training vecors, size n * d
Is that the proper way of adding the 512D vector data into Faiss for training? It seems to me that if I have 2 face embeddings that are 512D in size, the training_set would be like this:
training_set[0-511] - Face #1's 512D vectors
training_set[512-1024] - Face #2's 512D vectors
and since Faiss knows we are working with 512D vectors, it will intelligently parse them out of the array.
Here's a more efficient way to write it:
int v = 0;
for (auto const& element : this->templates)
{
auto& enrollment_template = element.second; // not copy
if (v + d > training_set.size()) {
break; // prevent overflow, "nt" is smaller than templates.size()
}
for (int i = 0; i < d; i++) {
training_set[v] = enrollment_template[i]; // not at()
v++;
}
}
We avoid a copy with auto& enrollment_template, avoid extra branching with enrollment_template[i] (you know you won't be out of bounds), and simplify the address computation with training_set[v] by making v a count of elements rather than rows.
Further efficiency could be gained if templates can be changed to store floats rather than doubles--then you'd just be bitwise-copying 512 floats rather than converting doubles to floats.
Also, be sure to declare d as constexpr to give the compiler the best chance of optimizing the loop.

Accumulating Doubles Into Bins via intrinsics

I have a vector of observations and an equal length vector of offsets assigning observations to a set of bins. The value of each bin should be the sum of all observations assigned to that bin, and I'm wondering if there's a vectorized method to do the reduction.
A naive implementation is below:
const int N_OBS = 100`000`000;
const int N_BINS = 16;
double obs[N_OBS]; // Observations
int8_t offsets[N_OBS];
double acc[N_BINS] = {0};
for (int i = 0; i < N_OBS; ++i) {
acc[offsets[i]] += obs[i]; // accumulate obs value into its assigned bin
}
Is this possible using simd/avx intrinsics? Something similar to the above will be run millions of times. I've looked at scatter/gather approaches, but can't seem to figure out a good way to get it done.
Modern CPUs are surprisingly good running your naïve version. On AMD Zen3, I’m getting 48ms for 100M random numbers on input, that’s 18 GB/sec RAM read bandwidth. That’s like 35% of the hard bandwidth limit on my computer (dual-channel DDR4-3200).
No SIMD gonna help, I’m afraid. Still, the best version I got is the following. Compile with OpenMP support, the switch depends on your C++ compiler.
void computeHistogramScalarOmp( const double* rsi, const int8_t* indices, size_t length, double* rdi )
{
// Count of OpenMP threads = CPU cores to use
constexpr int ompThreadsCount = 4;
// Use independent set of accumulators per thread, otherwise concurrency gonna corrupt data.
// Aligning by 64 = cache line, we want to assign cache lines to CPU cores, sharing them is extremely expensive
alignas( 64 ) double accumulators[ 16 * ompThreadsCount ];
memset( &accumulators, 0, sizeof( accumulators ) );
// Minimize OMP overhead by dispatching very few large tasks
#pragma omp parallel for schedule(static, 1)
for( int i = 0; i < ompThreadsCount; i++ )
{
// Grab a slice of the output buffer
double* const acc = &accumulators[ i * 16 ];
// Compute a slice of the source data for this thread
const size_t first = i * length / ompThreadsCount;
const size_t last = ( i + 1 ) * length / ompThreadsCount;
// Accumulate into thread-local portion of the buffer
for( size_t i = first; i < last; i++ )
{
const int8_t idx = indices[ i ];
acc[ idx ] += rsi[ i ];
}
}
// Reduce 16*N scalars to 16 with a few AVX instructions
for( int i = 0; i < 16; i += 4 )
{
__m256d v = _mm256_load_pd( &accumulators[ i ] );
for( int j = 1; j < ompThreadsCount; j++ )
{
__m256d v2 = _mm256_load_pd( &accumulators[ i + j * 16 ] );
v = _mm256_add_pd( v, v2 );
}
_mm256_storeu_pd( rdi + i, v );
}
}
The above version results in 20.5ms time, translates to 88% of RAM bandwidth limit.
P.S. I have no idea why the optimal threads count is 4 here, I have 8 cores/16 threads in the CPU. Both lower and higher values decrease the bandwidth. The constant is probably CPU-specific.
If indeed the offsets do not change for thousands (probably even tens) of times, it is likely worthwile to "transpose" them, i.e., to store all indices which need to be added to acc[0], then all indices which need to be added to acc[1], etc.
Essentially, what you are doing originally is a sparse-matrix times dense-vector product with the matrix in compressed-column-storage format (without explicitly storing the 1-values).
As shown in this answer sparse GEMV products are usually faster if the matrix is stored in compressed-row-storage (even without AVX2's gather instruction, you don't need to load and store the accumulated value every time).
Untested example implementation:
using sparse_matrix = std::vector<std::vector<int> >;
// call this once:
sparse_matrix transpose(uint8_t const* offsets, int n_bins, int n_obs){
sparse_matrix res;
res.resize(n_bins);
// count entries for each bin:
for(int i=0; i<n_obs; ++i) {
// assert(offsets[i] < n_bins);
res[offsets[i]].push_back(i);
}
return res;
}
void accumulate(double acc[], sparse_matrix const& indexes, double const* obs){
for(std::size_t row=0; row<indexes.size(); ++row) {
double sum = 0;
for(int col : indexes[row]) {
// you can manually vectorize this using _mm256_i32gather_pd,
// but clang/gcc should autovectorize this with -ffast-math -O3 -march=native
sum += obs[col];
}
acc[row] = sum;
}
}

Having a hard time figuring out logic behind array manipulation

I am given a filled array of size WxH and need to create a new array by scaling both the width and the height by a power of 2. For example, 2x3 becomes 8x12 when scaled by 4, 2^2. My goal is to make sure all the old values in the array are placed in the new array such that 1 value in the old array fills up multiple new corresponding parts in the scaled array. For example:
old_array = [[1,2],
[3,4]]
becomes
new_array = [[1,1,2,2],
[1,1,2,2],
[3,3,4,4],
[3,3,4,4]]
when scaled by a factor of 2. Could someone explain to me the logic on how I would go about programming this?
It's actually very simple. I use a vector of vectors for simplicity noting that 2D matrixes are not efficient. However, any 2D matrix class using [] indexing syntax can, and should be for efficiency, substituted.
#include <vector>
using std::vector;
int main()
{
vector<vector<int>> vin{ {1,2},{3,4},{5,6} };
size_t scaleW = 2;
size_t scaleH = 3;
vector<vector<int>> vout(scaleH * vin.size(), vector<int>(scaleW * vin[0].size()));
for (size_t i = 0; i < vout.size(); i++)
for (size_t ii = 0; ii < vout[0].size(); ii++)
vout[i][ii] = vin[i / scaleH][ii / scaleW];
auto x = vout[8][3]; // last element s/b 6
}
Here is my take. It is very similar to #Tudor's but I figure between our two, you can pick what you like or understand best.
First, let's define a suitable 2D array type because C++'s standard library is very lacking in this regard. I've limited myself to a rather simple struct, in case you don't feel comfortable with object oriented programming.
#include <vector>
// using std::vector
struct Array2d
{
unsigned rows, cols;
std::vector<int> data;
};
This print function should give you an idea how the indexing works:
#include <cstdio>
// using std::putchar, std::printf, std::fputs
void print(const Array2d& arr)
{
std::putchar('[');
for(std::size_t row = 0; row < arr.rows; ++row) {
std::putchar('[');
for(std::size_t col = 0; col < arr.cols; ++col)
std::printf("%d, ", arr.data[row * arr.cols + col]);
std::fputs("]\n ", stdout);
}
std::fputs("]\n", stdout);
}
Now to the heart, the array scaling. The amount of nesting is … bothersome.
Array2d scale(const Array2d& in, unsigned rowfactor, unsigned colfactor)
{
Array2d out;
out.rows = in.rows * rowfactor;
out.cols = in.cols * colfactor;
out.data.resize(std::size_t(out.rows) * out.cols);
for(std::size_t inrow = 0; inrow < in.rows; ++inrow) {
for(unsigned rowoff = 0; rowoff < rowfactor; ++rowoff) {
std::size_t outrow = inrow * rowfactor + rowoff;
for(std::size_t incol = 0; incol < in.cols; ++incol) {
std::size_t in_idx = inrow * in.cols + incol;
int inval = in.data[in_idx];
for(unsigned coloff = 0; coloff < colfactor; ++coloff) {
std::size_t outcol = incol * colfactor + coloff;
std::size_t out_idx = outrow * out.cols + outcol;
out.data[out_idx] = inval;
}
}
}
}
return out;
}
Let's pull it all together for a little demonstration:
int main()
{
Array2d in;
in.rows = 2;
in.cols = 3;
in.data.resize(in.rows * in.cols);
for(std::size_t i = 0; i < in.rows * in.cols; ++i)
in.data[i] = static_cast<int>(i);
print(in);
print(scale(in, 3, 2));
}
This prints
[[0, 1, 2, ]
[3, 4, 5, ]
]
[[0, 0, 1, 1, 2, 2, ]
[0, 0, 1, 1, 2, 2, ]
[0, 0, 1, 1, 2, 2, ]
[3, 3, 4, 4, 5, 5, ]
[3, 3, 4, 4, 5, 5, ]
[3, 3, 4, 4, 5, 5, ]
]
To be honest, i'm incredibly bad at algorithms but i gave it a shot.
I am not sure if this can be done using only one matrix, or if it can be done in less time complexity.
Edit: You can estimate the number of operations this will make with W*H*S*S where Sis the scale factor, W is width and H is height of input matrix.
I used 2 matrixes m and r, where m is your input and r is your result/output. All that needs to be done is to copy each element from m at positions [i][j] and turn it into a square of elements with the same value of size scale_factor inside r.
Simply put:
int main()
{
Matrix<int> m(2, 2);
// initial values in your example
m[0][0] = 1;
m[0][1] = 2;
m[1][0] = 3;
m[1][1] = 4;
m.Print();
// pick some scale factor and create the new matrix
unsigned long scale = 2;
Matrix<int> r(m.rows*scale, m.columns*scale);
// i know this is bad but it is the most
// straightforward way of doing this
// it is also the only way i can think of :(
for(unsigned long i1 = 0; i1 < m.rows; i1++)
for(unsigned long j1 = 0; j1 < m.columns; j1++)
for(unsigned long i2 = i1*scale; i2 < (i1+1)*scale; i2++)
for(unsigned long j2 = j1*scale; j2 < (j1+1)*scale; j2++)
r[i2][j2] = m[i1][j1];
// the output in your example
std::cout << "\n\n";
r.Print();
return 0;
}
I do not think it is relevant for the question, but i used a class Matrix to store all the elements of the extended matrix. I know it is a distraction but this is still C++ and we have to manage memory. And what you are trying to achieve with this algorithm needs a lot of memory if the scale_factor is big so i wrapped it up using this:
template <typename type_t>
class Matrix
{
private:
type_t** Data;
public:
// should be private and have Getters but
// that would make the code larger...
unsigned long rows;
unsigned long columns;
// 2d Arrays get big pretty fast with what you are
// trying to do.
Matrix(unsigned long rows, unsigned long columns)
{
this->rows = rows;
this->columns = columns;
Data = new type_t*[rows];
for(unsigned long i = 0; i < rows; i++)
Data[i] = new type_t[columns];
}
// It is true, a copy constructor is needed
// as HolyBlackCat pointed out
Matrix(const Matrix& m)
{
rows = m.rows;
columns = m.columns;
Data = new type_t*[rows];
for(unsigned long i = 0; i < rows; i++)
{
Data[i] = new type_t[columns];
for(unsigned long j = 0; j < columns; j++)
Data[i][j] = m[i][j];
}
}
~Matrix()
{
for(unsigned long i = 0; i < rows; i++)
delete [] Data[i];
delete [] Data;
}
void Print()
{
for(unsigned long i = 0; i < rows; i++)
{
for(unsigned long j = 0; j < columns; j++)
std::cout << Data[i][j] << " ";
std::cout << "\n";
}
}
type_t* operator [] (unsigned long row)
{
return Data[row];
}
};
First of all, having a suitable 2D matrix class is presumed but not the question. But I don't know the API of yours, so I'll illustrate with something typical:
struct coord {
size_t x; // x position or column count
size_t y; // y position or row count
};
template <typename T>
class Matrix2D {
⋮ // implementation details
public:
⋮ // all needed special members (ctors dtor, assignment)
Matrix2D (coord dimensions);
coord dimensions() const; // return height and width
const T& cell (coord position) const; // read-only access
T& cell (coord position); // read-write access
// handy synonym:
const T& operator[](coord position) const { return cell(position); }
T& operator[](coord position) { return cell(position); }
};
I just showed the public members I need: create a matrix with a given size, query the size, and indexed access to the individual elements.
So, given that, your problem description is:
template<typename T>
Matrix2D<T> scale_pow2 (const Matrix2D& input, size_t pow)
{
const auto scale_factor= 1 << pow;
const auto size_in = input.dimensions();
Matrix2D<T> result ({size_in.x*scale_factor,size_in.y*scale_factor});
⋮
⋮ // fill up result
⋮
return result;
}
OK, so now the problem is precisely defined: what code goes in the big blank immediately above?
Each cell in the input gets put into a bunch of cells in the output. So you can either iterate over the input and write a clump of cells in the output all having the same value, or you can iterate over the output and each cell you need the value for is looked up in the input.
The latter is simpler since you don't need a nested loop (or pair of loops) to write a clump.
for (coord outpos : /* ?? every cell of the output ?? */) {
coord frompos {
outpos.x >> pow,
outpos.y >> pow };
result[outpos] = input[frompos];
}
Now that's simple!
Calculating the from position for a given output must match the way the scale was defined: you will have pow bits giving the position relative to this clump, and the higher bits will be the index of where that clump came from
Now, we want to set outpos to every legal position in the output matrix indexes. That's what I need. How to actually do that is another sub-problem and can be pushed off with top-down decomposition.
a bit more advanced
Maybe nested loops is the easiest way to get that done, but I won't put those directly into this code, pushing my nesting level even deeper. And looping 0..max is not the simplest thing to write in bare C++ without libraries, so that would just be distracting. And, if you're working with matrices, this is something you'll have a general need for, including (say) printing out the answer!
So here's the double-loop, put into its own code:
struct all_positions {
coord current {0,0};
coord end;
all_positions (coord end) : end{end} {}
bool next() {
if (++current.x < end.x) return true; // not reached the end yet
current.x = 0; // reset to the start of the row
if (++current.y < end.y) return true;
return false; // I don't have a valid position now.
}
};
This does not follow the iterator/collection API that you could use in a range-based for loop. For information on how to do that, see my article on Code Project or use the Ranges stuff in the C++20 standard library.
Given this "old fashioned" iteration helper, I can write the loop as:
all_positions scanner {output.dimensions}; // starts at {0,0}
const auto& outpos= scanner.current;
do {
⋮
} while (scanner.next());
Because of the simple implementation, it starts at {0,0} and advancing it also tests at the same time, and it returns false when it can't advance any more. Thus, you have to declare it (gives the first cell), use it, then advance&test. That is, a test-at-the-end loop. A for loop in C++ checks the condition before each use, and advances at the end, using different functions. So, making it compatible with the for loop is more work, and surprisingly making it work with the ranged-for is not much more work. Separating out the test and advance the right way is the real work; the rest is just naming conventions.
As long as this is "custom", you can further modify it for your needs. For example, add a flag inside to tell you when the row changed, or that it's the first or last of a row, to make it handy for pretty-printing.
summary
You need a bunch of things working in addition to the little piece of code you actually want to write. Here, it's a usable Matrix class. Very often, it's prompting for input, opening files, handling command-line options, and that kind of stuff. It distracts from the real problem, so get that out of the way first.
Write your code (the real code you came for) in its own function, separate from any other stuff you also need in order to house it. Get it elsewhere if you can; it's not part of the lesson and just serves as a distraction. Worse, it may be "hard" in ways you are not prepared for (or to do well) as it's unrelated to the actual lesson being worked on.
Figure out the algorithm (flowchart, pseudocode, whatever) in a general way before translating that to legal syntax and API on the objects you are using. If you're just learning C++, don't get bogged down in the formal syntax when you are trying to figure out the logic. Until you naturally start to think in C++ when doing that kind of planning, don't force it. Use whiteboard doodles, tinkertoys, whatever works for you.
Get feedback and review of the idea, the logic of how to make it happen, from your peers and mentors if available, before you spend time coding. Why write up an idea that doesn't work? Fix the logic, not the code.
Finally, sketch the needed control flow, functions and data structures you need. Use pseudocode and placeholder notes.
Then fill in the placeholders and replace the pseudo with the legal syntax. You already planned it out, so now you can concentrate on learning the syntax and library details of the programming language. You can concentrate on "how do I express (some tiny detail) in C++" rather than keeping the entire program in your head. More generally, isolate a part that you will be learning; be learning/practicing one thing without worrying about the entire edifice.
To a large extent, some of those ideas translate to the code as well. Top-Down Design means you state things at a high level and then implement that elsewhere, separately. It makes code readable and maintainable, as well as easier to write in the first place. Functions should be written this way: the function explains how to do (what it does) as a list of details that are just one level of detail further down. Each of those steps then becomes a new function. Functions should be short and expressed at one semantic level of abstraction. Don't dive down into the most primitive details inside the function that explains the task as a set of simpler steps.
Good luck, and keep it up!

unexpected results with word2vec algorithm

I implemented word2vec in c++.
I found the original syntax to be unclear, so I figured I'd re-implement it, using all the benefits of c++ (std::map, std::vector, etc)
This is the method that actually gets called every time a sample is trained (l1 denotes the index of the first word, l2 the index of the second word, label indicates whether it is a positive or negative sample, and neu1e acts as the accumulator for the gradient)
void train(int l1, int l2, double label, std::vector<double>& neu1e)
{
// Calculate the dot-product between the input words weights (in
// syn0) and the output word's weights (in syn1neg).
auto f = 0.0;
for (int c = 0; c < m__numberOfFeatures; c++)
f += syn0[l1][c] * syn1neg[l2][c];
// This block does two things:
// 1. Calculates the output of the network for this training
// pair, using the expTable to evaluate the output layer
// activation function.
// 2. Calculate the error at the output, stored in 'g', by
// subtracting the network output from the desired output,
// and finally multiply this by the learning rate.
auto z = 1.0 / (1.0 + exp(-f));
auto g = m_learningRate * (label - z);
// Multiply the error by the output layer weights.
// (I think this is the gradient calculation?)
// Accumulate these gradients over all of the negative samples.
for (int c = 0; c < m__numberOfFeatures; c++)
neu1e[c] += (g * syn1neg[l2][c]);
// Update the output layer weights by multiplying the output error
// by the hidden layer weights.
for (int c = 0; c < m__numberOfFeatures; c++)
syn1neg[l2][c] += g * syn0[l1][c];
}
This method gets called by
void train(const std::string& s0, const std::string& s1, bool isPositive, std::vector<double>& neu1e)
{
auto l1 = m_wordIDs.find(s0) != m_wordIDs.end() ? m_wordIDs[s0] : -1;
auto l2 = m_wordIDs.find(s1) != m_wordIDs.end() ? m_wordIDs[s1] : -1;
if(l1 == -1 || l2 == -1)
return;
train(l1, l2, isPositive ? 1 : 0, neu1e);
}
which in turn gets called by the main training method.
Full code can be found at
https://github.com/jorisschellekens/ml/tree/master/word2vec
With complete example at
https://github.com/jorisschellekens/ml/blob/master/main/example_8.hpp
When I run this algorithm, the top 10 words 'closest' to father are:
father
Khan
Shah
forgetful
Miami
rash
symptoms
Funeral
Indianapolis
impressed
This the method to calculate the nearest words:
std::vector<std::string> nearest(const std::string& s, int k) const
{
// calculate distance
std::vector<std::tuple<std::string, double>> tmp;
for(auto &t : m_unigramFrequency)
{
tmp.push_back(std::make_tuple(t.first, distance(t.first, s)));
}
// sort
std::sort(tmp.begin(), tmp.end(), [](const std::tuple<std::string, double>& t0, const std::tuple<std::string, double>& t1)
{
return std::get<1>(t0) < std::get<1>(t1);
});
// take top k
std::vector<std::string> out;
for(int i=0; i<k; i++)
{
out.push_back(std::get<0>(tmp[tmp.size() - 1 - i]));
}
// return
return out;
}
Which seems weird.
Is something wrong with my algorithm?
Are you sure, that you get "nearest" words (not farest)?
...
// take top k
std::vector<std::string> out;
for(int i=0; i<k; i++)
{
out.push_back(std::get<0>(tmp[tmp.size() - 1 - i]));
}
...

how to elminate the "doubled" elements of a vector in c++

I'm using the HoughLinesto detect line in a frame, the lines information are saved in a cv::vector<cv::Vec2f> which I handle as two dimensional array, I'm interested in the second one , it the angle of the line, I want to keep only the lines that have a angle difference greater than 1.5 rad for that here I what I did :
.............................
cv::vector<cv::Vec2f> lineQ;
..............................
// ordring the vector based on the angle value in rad
for ( int i = 0 ; i< lineQ.size()-1; i++){
for(int j= i+1;j<lineQ.size();j++){
if(lineQ[i][1] > lineQ[j][1]){
tmp = lineQ[i];
lineQ[i] = lineQ[j];
lineQ[j] = tmp;
}
}
}
now I want to compare the vector elements between each other based on the angle
cv::vector<cv::Vec2f> line;
for ( int i = 0 ; i< lineQ.size()-1; i++){
for ( int j= i+1; j<lineQ.size(); j++){
if(fabs(lineQ[i][1] - lineQ[j][1])>1.5){
line.push_back(lineQ[i]);
}
}
}
this works for 2 lines but when I got 3 whit let's say 1.3rad as an angle the size of line
is than 2. I though to use erase but this change the size of my vector !
One option is to supply a soft "equals" to std::unique_copy:
std::unique_copy(lineQ.begin(), lineQ.end(), std::back_inserter(line),
[](const cv::Vec2f & a, const cv::Vec2f & b) {
return b[1] - a[1] <= 1.5;
});
Sidenote: You can also avoid the effort of writing your own sort (Bubble sort is just about the worst choice.) and use the standard library. Something like this ought to work:
std::sort(lineQ.begin(), lineQ.end(),
[](const cv::Vec2f & a, const cv::Vec2f & b) {
return a[1] < b[1];
})).
(The above code assumes C++11, which most of us have by now. If you're stuck on an earlier version, you can write a couple of functor classes instead.)