Fast integer matrix multiplication with bit-twiddling hacks

Fast integer matrix multiplication with bit-twiddling hacks - c++

I am asking if it is possible to improve considerably integer matrix multiplication with bitwise operations. The matrices are small, and the elements are small nonnegative integers (small means at most 20).
To keep us focused, let's be extremely specific, and say that I have two 3x3 matrices, with integer entries 0<=x<15.
The following naive C++ implementation executed a million times performs around 1s, measured with linux time.
#include <random>
int main() {
//Random number generator
std::random_device rd;
std::mt19937 eng(rd());
std::uniform_int_distribution<> distr(0, 15);
int A[3][3];
int B[3][3];
int C[3][3];
for (int trials = 0; trials <= 1000000; trials++) {
//Set up A[] and B[]
for (int i = 0; i < 3; ++i) {
for (int j = 0; j < 3; ++j) {
A[i][j] = distr(eng);
B[i][j] = distr(eng);
C[i][j] = 0;
}
}
//Compute C[]=A[]*B[]
for (int i = 0; i < 3; ++i) {
for (int j = 0; j < 3; ++j) {
for (int k = 0; k < 3; ++k) {
C[i][j] = C[i][j] + A[i][k] * B[k][j];
}
}
}
}
return 0;
}
Notes:
The matrices are not necessarily sparse.
Strassen-like comments does not help here.
Let's try not to use the circumstantial observation, that in this specific problem the matrices A[] and B[] can be encoded as a single 64 bit integer. Think of what would happen for just a bit larger matrices.
Computation is single-threaded.
Related: Binary matrix multiplication bit twiddling hack and What is the optimal algorithm for the game 2048?

The question you linked is about a matrix where every element is a single bit. For one-bit values a and b, a * b is exactly equivalent to a & b.
For adding 2-bit elements, it might be plausible (and faster than unpacking) to add basically from scratch, with XOR (carryless-add), then generate the carry with AND, shift, and mask off carry across element boundaries.
A 3rd bit would require detecting when adding the carry produces yet another carry. I don't think it would be a win to emulating even a 3 bit adder or multiplier, compared to using SIMD. Without SIMD (i.e. in pure C with uint64_t) it might make sense. For add, you might try using a normal add and then try to undo the carry between element boundaries, instead of building an adder yourself out of XOR/AND/shift operations.
packed vs. unpacked-to-bytes storage formats
If you have very many of these tiny matrices, storing them in memory in compressed form (e.g. packed 4bit elements) can help with cache footprint / memory bandwidth. 4bit elements are fairly easy to unpack to having each element in a separate byte element of a vector.
Otherwise, store them with one matrix element per byte. From there, you can easily unpack them to 16bit or 32bit per element if needed, depending on what element sizes the target SIMD instruction set provides. You might keep some matrices in local variables in unpacked format to reuse across multiplies, but pack them back into 4bits per element for storage in an array.
Compilers suck at this with uint8_t in scalar C code for x86. See comments on #Richard's answer: gcc and clang both like to use mul r8 for uint8_t, which forces them to move data into eax (the implicit input/output for a one-operand multiply), rather than using imul r32, r32 and ignoring the garbage that leaves outside the low 8 bits of the destination register.
The uint8_t version actually runs slower than the uint16_t version, even though it has half the cache footprint.
You're probably going to get best results from some kind of SIMD.
Intel SSSE3 has a vector byte multiply, but only with adding of adjacent elements. Using it would require unpacking your matrix into a vector with some zeros between rows or something, so you don't get data from one row mixed with data from another row. Fortunately, pshufb can zero elements as well as copy them around.
More likely to be useful is SSE2 PMADDWD, if you unpack to each matrix element in a separate 16bit vector element. So given a row in one vector, and a transposed-column in another vector, pmaddwd (_mm_madd_epi16) is one horizontal add away from giving you the dot-product result you need for C[i][j].
Instead of doing each of those adds separately, you can probably pack multiple pmaddwd results into a single vector so you can store C[i][0..2] in one go.

You may find that reducing the data size gives you a considerable performance improvement if you are performing this calculation over a large number of matrices:
#include <cstdint>
#include <cstdlib>
using T = std::uint_fast8_t;
void mpy(T A[3][3], T B[3][3], T C[3][3])
{
for (int i = 0; i < 3; ++i) {
for (int j = 0; j < 3; ++j) {
for (int k = 0; k < 3; ++k) {
C[i][j] = C[i][j] + A[i][k] * B[k][j];
}
}
}
}
The pentium can move and sign-extend an 8-bit value in one instruction. This means you're getting 4 times as many matricies per cache line.
UPDATE: curiosity piqued, I wrote a test:
#include <random>
#include <utility>
#include <algorithm>
#include <chrono>
#include <iostream>
#include <typeinfo>
template<class T>
struct matrix
{
static constexpr std::size_t rows = 3;
static constexpr std::size_t cols = 3;
static constexpr std::size_t size() { return rows * cols; }
template<class Engine, class U>
matrix(Engine& engine, std::uniform_int_distribution<U>& dist)
: matrix(std::make_index_sequence<size()>(), engine, dist)
{}
template<class U>
matrix(std::initializer_list<U> li)
: matrix(std::make_index_sequence<size()>(), li)
{
}
matrix()
: _data { 0 }
{}
const T* operator[](std::size_t i) const {
return std::addressof(_data[i * cols]);
}
T* operator[](std::size_t i) {
return std::addressof(_data[i * cols]);
}
private:
template<std::size_t...Is, class U, class Engine>
matrix(std::index_sequence<Is...>, Engine& eng, std::uniform_int_distribution<U>& dist)
: _data { (void(Is), dist(eng))... }
{}
template<std::size_t...Is, class U>
matrix(std::index_sequence<Is...>, std::initializer_list<U> li)
: _data { ((Is < li.size()) ? *(li.begin() + Is) : 0)... }
{}
T _data[rows * cols];
};
template<class T>
matrix<T> operator*(const matrix<T>& A, const matrix<T>& B)
{
matrix<T> C;
for (int i = 0; i < 3; ++i) {
for (int j = 0; j < 3; ++j) {
for (int k = 0; k < 3; ++k) {
C[i][j] = C[i][j] + A[i][k] * B[k][j];
}
}
}
return C;
}
static constexpr std::size_t test_size = 1000000;
template<class T, class Engine>
void fill(std::vector<matrix<T>>& v, Engine& eng, std::uniform_int_distribution<T>& dist)
{
v.clear();
v.reserve(test_size);
generate_n(std::back_inserter(v), test_size,
[&] { return matrix<T>(eng, dist); });
}
template<class T>
void test(std::random_device& rd)
{
std::mt19937 eng(rd());
std::uniform_int_distribution<T> distr(0, 15);
std::vector<matrix<T>> As, Bs, Cs;
fill(As, eng, distr);
fill(Bs, eng, distr);
fill(Cs, eng, distr);
auto start = std::chrono::high_resolution_clock::now();
auto ia = As.cbegin();
auto ib = Bs.cbegin();
for (auto&m : Cs)
{
m = *ia++ * *ib++;
}
auto stop = std::chrono::high_resolution_clock::now();
auto diff = stop - start;
auto millis = std::chrono::duration_cast<std::chrono::microseconds>(diff).count();
std::cout << "for type " << typeid(T).name() << " time is " << millis << "us" << std::endl;
}
int main() {
//Random number generator
std::random_device rd;
test<std::uint64_t>(rd);
test<std::uint32_t>(rd);
test<std::uint16_t>(rd);
test<std::uint8_t>(rd);
}
example output (recent macbook pro, 64-bit, compiled with -O3)
for type y time is 32787us
for type j time is 15323us
for type t time is 14347us
for type h time is 31550us
summary:
on this platform, int32 and int16 proved to be as fast as each other. int64 and int8 were equally slow (the 8-bit result surprised me).
conclusion:
As ever, express intent to the compiler and let the optimiser do its thing. If the program is running too slowly in production, take measurements and optimise the worst-offenders.

Related

Hint the compiler that float-vector count is divisible by 8?

static inline void R1_sub_R0(float *vec, size_t cnt, float toSubtract){
for(size_t i=0; cnt; ++i){
vec[i] -= toSubtract;
}
}
I know that cnt will always be divisible by 8, therefore the code could be vectorized via SSE and AVX. In other words, we can iterate over *vec as a __m256 type.
But compiler will probably not know this. How to re-assure the compiler that this count is guaranteed to be divisible by 8?
Will something like this help it? (if we stick it at the start of the function)
assert(((cnt*sizeof(float)) % sizeof(__m256)) ==0 ); //checks that it's "multiple of __m256 type".
Of course, I could have simply written the whole thing as a vectorized code:
static inline void R1_sub_R0(float *vec, size_t cnt, float toSubtract){
assert(cnt*sizeof(float) % sizeof(__m256) == 0);//check that it's "multiple of __m256 type".
assert(((uintptr_t)(const void *)(POINTER)) % (16) == 0);//assert that 'vec' is 16-byte aligned
__m256 sToSubtract = _mm256_set1_ps(toSubtract);
__m256 *sPtr = (__m256*)vec;
const __m256 *sEnd = (const __m256*)(vec+cnt);
for(sPtr; sPtr != sEnd; ++sPtr){
*sPtr = _mm256_sub_ps(*sPtr, sToSubtract);
}
}
However, it runs 10% slower than the original version.
So I just want to give the compiler extra bit of information. That way it could vectorize the code even more efficiently.

Hint the compiler that float-vector count is divisible by 8?
You could semi-unroll the loop by nesting another:
for(size_t i=0; i < cnt; i += 8){
for(size_t j=0; j < 8; j++){
vec[i + j] -= toSubtract;
}
}
The compiler can easily see that the inner loop has constant iterations and can unroll it and potentially use SIMD if it so chooses.
Hint the compiler that float-vector count is [16-byte aligned]?
This is quite a bit more tricky.
You could use something like:
struct alignas(16) sse {
float arr[8];
};
// cnt is now number of structs which is 8th fraction of original cnt
R1_sub_R0(sse *vec, size_t cnt, float toSubtract) {
for(size_t i=0; i < cnt; i ++){
for(size_t j=0; j < 8; j++){
vec[i].arr[j] -= toSubtract;
}
}
Other than that, there are compiler extensions such as __builtin_assume_aligned that can be used with the plain float array.

I would not try to microoptimize here, the compiler will generate the fastest code if you just communicate to it what you want to be done, not how you want it to be done. Therefore i would use std::transform.
std::transform(vec, vec + cnt, vec,
[toSubstract](float f){return f - toSubstract;});

C++ - Efficiently computing a vector-matrix product

I need to compute a product vector-matrix as efficiently as possible. Specifically, given a vector s and a matrix A, I need to compute s * A. I have a class Vector which wraps a std::vector and a class Matrix which also wraps a std::vector (for efficiency).
The naive approach (the one that I am using at the moment) is to have something like
Vector<T> timesMatrix(Matrix<T>& matrix)
{
Vector<unsigned int> result(matrix.columns());
// constructor that does a resize on the underlying std::vector
for(unsigned int i = 0 ; i < vector.size() ; ++i)
{
for(unsigned int j = 0 ; j < matrix.columns() ; ++j)
{
result[j] += (vector[i] * matrix.getElementAt(i, j));
// getElementAt accesses the appropriate entry
// of the underlying std::vector
}
}
return result;
}
It works fine and takes nearly 12000 microseconds. Note that the vector s has 499 elements, while A is 499 x 15500.
The next step was trying to parallelize the computation: if I have N threads then I can give each thread a part of the vector s and the "corresponding" rows of the matrix A. Each thread will compute a 499-sized Vector and the final result will be their entry-wise sum.
First of all, in the class Matrix I added a method to extract some rows from a Matrix and build a smaller one:
Matrix<T> extractSomeRows(unsigned int start, unsigned int end)
{
unsigned int rowsToExtract = end - start + 1;
std::vector<T> tmp;
tmp.reserve(rowsToExtract * numColumns);
for(unsigned int i = start * numColumns ; i < (end+1) * numColumns ; ++i)
{
tmp.push_back(matrix[i]);
}
return Matrix<T>(rowsToExtract, numColumns, tmp);
}
Then I defined a thread routine
void timesMatrixThreadRoutine
(Matrix<T>& matrix, unsigned int start, unsigned int end, Vector<T>& newRow)
{
// newRow is supposed to contain the partial result
// computed by a thread
newRow.resize(matrix.columns());
for(unsigned int i = start ; i < end + 1 ; ++i)
{
for(unsigned int j = 0 ; j < matrix.columns() ; ++j)
{
newRow[j] += vector[i] * matrix.getElementAt(i - start, j);
}
}
}
And finally I modified the code of the timesMatrix method that I showed above:
Vector<T> timesMatrix(Matrix<T>& matrix)
{
static const unsigned int NUM_THREADS = 4;
unsigned int matRows = matrix.rows();
unsigned int matColumns = matrix.columns();
unsigned int rowsEachThread = vector.size()/NUM_THREADS;
std::thread threads[NUM_THREADS];
Vector<T> tmp[NUM_THREADS];
unsigned int start, end;
// all but the last thread
for(unsigned int i = 0 ; i < NUM_THREADS - 1 ; ++i)
{
start = i*rowsEachThread;
end = (i+1)*rowsEachThread - 1;
threads[i] = std::thread(&Vector<T>::timesMatrixThreadRoutine, this,
matrix.extractSomeRows(start, end), start, end, std::ref(tmp[i]));
}
// last thread
start = (NUM_THREADS-1)*rowsEachThread;
end = matRows - 1;
threads[NUM_THREADS - 1] = std::thread(&Vector<T>::timesMatrixThreadRoutine, this,
matrix.extractSomeRows(start, end), start, end, std::ref(tmp[NUM_THREADS-1]));
for(unsigned int i = 0 ; i < NUM_THREADS ; ++i)
{
threads[i].join();
}
Vector<unsigned int> result(matColumns);
for(unsigned int i = 0 ; i < NUM_THREADS ; ++i)
{
result = result + tmp[i]; // the operator+ is overloaded
}
return result;
}
It still works but now it takes nearly 30000 microseconds, which is almost three times as much as before.
Am I doing something wrong? Do you think there is a better approach?
EDIT - using a "lightweight" VirtualMatrix
Following Ilya Ovodov's suggestion, I defined a class VirtualMatrix that wraps a T* matrixData, which is initialized in the constructor as
VirtualMatrix(Matrix<T>& m)
{
numRows = m.rows();
numColumns = m.columns();
matrixData = m.pointerToData();
// pointerToData() returns underlyingVector.data();
}
Then there is a method to retrieve a specific entry of the matrix:
inline T getElementAt(unsigned int row, unsigned int column)
{
return *(matrixData + row*numColumns + column);
}
Now the execution time is better (approximately 8000 microseconds) but maybe there are some improvements to be made. In particular the thread routine is now
void timesMatrixThreadRoutine
(VirtualMatrix<T>& matrix, unsigned int startRow, unsigned int endRow, Vector<T>& newRow)
{
unsigned int matColumns = matrix.columns();
newRow.resize(matColumns);
for(unsigned int i = startRow ; i < endRow + 1 ; ++i)
{
for(unsigned int j = 0 ; j < matColumns ; ++j)
{
newRow[j] += (vector[i] * matrix.getElementAt(i, j));
}
}
}
and the really slow part is the one with the nested for loops. If I remove it, the result is obviously wrong but is "computed" in less than 500 microseconds. This to say that now passing the arguments takes almost no time and the heavy part is really the computation.
According to you, is there any way to make it even faster?

Actually you make a partial copy of matrix for each thread in extractSomeRows. It takes a lot of time.
Redesign it so that "some rows" become virtual matrix pointing at data located in original matrix.

Use vectorized assembly instructions for an architecture by making it more explicit that you want to multiply in 4's, i.e. for the x86-64 SSE2+ and possibly ARM'S NEON.
C++ compilers can often unroll the loop into vectorized code if you explicitly make an operation happen in contingent elements:
Simple and fast matrix-vector multiplication in C / C++
There is also the option of using libraries specifically made for matrix multipication. For larger matrices, it may be more efficient to use special implementations based on the Fast Fourier Transform, alternate algorithms like Strassen's Algorithm, etc. In fact, your best bet would be to use a C library like this, and then wrap it in an interface that looks similar to a C++ vector.

c++ speed comparison iterator vs index

I am currently writing a linalg library in c++, for educational purposes and personal use. As part of it I implemented a custom matrix class with custom row and column iterators. While providing very nice feature to work with std::algorithm and std::numeric functions, I performed a speed comparison for a matrix multiplication between an index and iterator/std::inner_product approach. The results differ significantly:
// used later on for the custom iterator
template<class U>
struct EveryNth {
bool operator()(const U& ) { return m_count++ % N == 0; }
EveryNth(std::size_t i) : m_count(0), N(i) {}
EveryNth(const EveryNth& element) : m_count(0), N(element.N) {}
private:
int m_count;
std::size_t N;
};
template<class T,
std::size_t rowsize,
std::size_t colsize>
class Matrix
{
private:
// Data is stored in a MVector, a modified std::vector
MVector<T> matrix;
std::size_t row_dim;
std::size_t column_dim;
public:
// other constructors, this one is for matrix in the computation
explicit Matrix(MVector<T>&& s): matrix(s),
row_dim(rowsize),
column_dim(colsize){
}
// other code...
typedef boost::filter_iterator<EveryNth<T>,
typename std::vector<T>::iterator> FilterIter;
// returns an iterator that skips elements in a range
// if "to" is to be specified, then from has to be set to a value
// # param "j" - j'th column to be requested
// # param "from" - starts at the from'th element
// # param "to" - goes from the from'th element to the "to'th" element
FilterIter begin_col( std::size_t j,
std::size_t from = 0,
std::size_t to = rowsize ){
return boost::make_filter_iterator<EveryNth<T> >(
EveryNth<T>( cols() ),
matrix.Begin() + index( from, j ),
matrix.Begin() + index( to, j )
);
}
// specifies then end of the iterator
// so that the iterator can not "jump" past the last element into undefines behaviour
FilterIter end_col( std::size_t j,
std::size_t to = rowsize ){
return boost::make_filter_iterator<EveryNth<T> >(
EveryNth<T>( cols() ),
matrix.Begin() + index( to, j ),
matrix.Begin() + index( to, j )
);
}
FilterIter begin_row( std::size_t i,
std::size_t from = 0,
std::size_t to = colsize ){
return boost::make_filter_iterator<EveryNth<T> >(
EveryNth<T>( 1 ),
matrix.Begin() + index( i, from ),
matrix.Begin() + index( i, to )
);
}
FilterIter end_row( std::size_t i,
std::size_t to = colsize ){
return boost::make_filter_iterator<EveryNth<T> >(
EveryNth<T>( 1 ),
matrix.Begin() + index( i, to ),
matrix.Begin() + index( i, to )
);
}
// other code...
// allows to access an element of the matrix by index expressed
// in terms of rows and columns
// # param "r" - r'th row of the matrix
// # param "c" - c'th column of the matrix
std::size_t index(std::size_t r, std::size_t c) const {
return r*cols()+c;
}
// brackets operator
// return an elements stored in the matrix
// # param "r" - r'th row in the matrix
// # param "c" - c'th column in the matrix
T& operator()(std::size_t r, std::size_t c) {
assert(r < rows() && c < matrix.size() / rows());
return matrix[index(r,c)];
}
const T& operator()(std::size_t r, std::size_t c) const {
assert(r < rows() && c < matrix.size() / rows());
return matrix[index(r,c)];
}
// other code...
// end of class
};
Now in the main function in run the following:
int main(int argc, char *argv[]){
Matrix<int, 100, 100> a = Matrix<int, 100, 100>(range<int>(10000));
std::clock_t begin = clock();
double b = 0;
for(std::size_t i = 0; i < a.rows(); i++){
for (std::size_t j = 0; j < a.cols(); j++) {
std::inner_product(a.begin_row(i), a.end_row(i),
a.begin_column(j),0);
}
}
// double b = 0;
// for(std::size_t i = 0; i < a.rows(); i++){
// for (std::size_t j = 0; j < a.cols(); j++) {
// for (std::size_t k = 0; k < a.rows(); k++) {
// b += a(i,k)*a(k,j);
// }
// }
// }
std::clock_t end = clock();
double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
std::cout << elapsed_secs << std::endl;
std::cout << "--- End of test ---" << std::endl;
std::cout << std::endl;
return 0;
}
For the std::inner_product/iterator approach it takes:
bash-3.2$ ./main
3.78358
--- End of test ---
and for the index (// out) approach:
bash-3.2$ ./main
0.106173
--- End of test ---
which is almost 40 times faster then the iterator approach. Do you see anything in the code that could slow down iterator computation that much? I should mention that I tried out both methods and they produce correct results.
Thank you for your ideas.

What you have to understand is that matrix operations are VERY well-understood, and compilers are VERY good at doing optimizations of the things that are involved in matrix operations.
Consider C = AB, where C is MxN, A is MxQ, B is QxN.
double a[M][Q], b[Q][N], c[M][N];
for(unsigned i = 0; i < M; i++){
for (unsigned j = 0; j < N; j++) {
double temp = 0.0;
for (unsigned k = 0; k < Q; k++) {
temp += a[i][k]*b[k][j];
}
c[i][j] = temp;
}
}
(You would not believe how tempted I was to write the above in FORTRAN IV.)
The compiler looks at this, and notices that what is really happening is that he is walking through a and c with a stride of 1 and b with a stride of Q. He eliminates the multiplications in the subscript calculations and does straight indexing.
At that point, the inner loop is of the form:
temp += a[r1] * b[r2];
r1 += 1;
r2 += Q;
And you have loops around that to (re)initialize r1 and r2 for each pass.
That is the absolute minimum computation you can do to do a straightforward matrix multiplication. You cannot do any less than that, because you have to do those multiplications and additions and index adjustments.
All you can do is add overhead.
That's what the iterator and std::inner_product() approach does: it adds metric tonnes of overhead.

This is just some additional information and general advice for low-level code optimization.
To conclusively find out where time is spent in low-level code (tight loops and hotspots),
You must be able to implement multiple versions of the code for computing the same result, using different implementation strategies.
You will need broad mathematical and computational knowledge to do this.
You must inspect the disassembly (machine code).
You must also run your code under an instruction-level sampling profiler to see which part of the machine code is executed the most heavily (i.e. the hotspots).
In order to collect a sufficient number of profiler samples, you will need to run the code in a tight loop, in millions or billions of times.
You must compare the disassembly of the hotspots between different versions of code (from different implementation strategies).
Based on the information above, you can arrive at the conclusion that some implementation strategies are less efficient (more wasteful or redundant) than others.
If you arrive at this step, you can now publish and share your findings with others.
Some possibilities:
Using boost::filter_iterator to implement an iterator that skips every N element is wasteful. The internal implementation must increment by one at a time. If N is large, visiting the next element via boost::filter_iterator becomes an O(N) operation, as opposed to a simple iterator arithmetic which would be an O(1) operation.
Your boost::filter_iterator implementation uses the modulo operator. Although integer division and modulo operations are fast on modern CPUs, it is still not as fast as simple integer arithmetic.
Simply speaking,
Increments, decrements, additions and subtractions are fastest, for both integers and floating point.
Multiplication and bit shifts are slightly slower.
Divisions and modulo operations will bet yet slower.
Finally, floating point trigonometric and transcendental functions, especially those that require calling out to standard mathematical library functions, will be the slowest.

Optimized way to find M largest elements in an NxN array using C++

I need a blazing fast way to find the 2D positions and values of the M largest elements in an NxN array.
right now I'm doing this:
struct SourcePoint {
Point point;
float value;
}
SourcePoint* maxValues = new SourcePoint[ M ];
maxCoefficients = new SourcePoint*[
for (int j = 0; j < rows; j++) {
for (int i = 0; i < cols; i++) {
float sample = arr[i][j];
if (sample > maxValues[0].value) {
int q = 1;
while ( sample > maxValues[q].value && q < M ) {
maxValues[q-1] = maxValues[q]; // shuffle the values back
q++;
}
maxValues[q-1].value = sample;
maxValues[q-1].point = Point(i,j);
}
}
}
A Point struct is just two ints - x and y.
This code basically does an insertion sort of the values coming in. maxValues[0] always contains the SourcePoint with the lowest value that still keeps it within the top M values encoutered so far. This gives us a quick and easy bailout if sample <= maxValues, we don't do anything. The issue I'm having is the shuffling every time a new better value is found. It works its way all the way down maxValues until it finds it's spot, shuffling all the elements in maxValues to make room for itself.
I'm getting to the point where I'm ready to look into SIMD solutions, or cache optimisations, since it looks like there's a fair bit of cache thrashing happening. Cutting the cost of this operation down will dramatically affect the performance of my overall algorithm since this is called many many times and accounts for 60-80% of my overall cost.
I've tried using a std::vector and make_heap, but I think the overhead for creating the heap outweighed the savings of the heap operations. This is likely because M and N generally aren't large. M is typically 10-20 and N 10-30 (NxN 100 - 900). The issue is this operation is called repeatedly, and it can't be precomputed.
I just had a thought to pre-load the first M elements of maxValues which may provide some small savings. In the current algorithm, the first M elements are guaranteed to shuffle themselves all the way down just to initially fill maxValues.
Any help from optimization gurus would be much appreciated :)

A few ideas you can try. In some quick tests with N=100 and M=15 I was able to get it around 25% faster in VC++ 2010 but test it yourself to see whether any of them help in your case. Some of these changes may have no or even a negative effect depending on the actual usage/data and compiler optimizations.
Don't allocate a new maxValues array each time unless you need to. Using a stack variable instead of dynamic allocation gets me +5%.
Changing g_Source[i][j] to g_Source[j][i] gains you a very little bit (not as much as I'd thought there would be).
Using the structure SourcePoint1 listed at the bottom gets me another few percent.
The biggest gain of around +15% was to replace the local variable sample with g_Source[j][i]. The compiler is likely smart enough to optimize out the multiple reads to the array which it can't do if you use a local variable.
Trying a simple binary search netted me a small loss of a few percent. For larger M/Ns you'd likely see a benefit.
If possible try to keep the source data in arr[][] sorted, even if only partially. Ideally you'd want to generate maxValues[] at the same time the source data is created.
Look at how the data is created/stored/organized may give you patterns or information to reduce the amount of time to generate your maxValues[] array. For example, in the best case you could come up with a formula that gives you the top M coordinates without needing to iterate and sort.
Code for above:
struct SourcePoint1 {
int x;
int y;
float value;
int test; //Play with manual/compiler padding if needed
};

If you want to go into micro-optimizations at this point, the a simple first step should be to get rid of the Points and just stuff both dimensions into a single int. That reduces the amount of data you need to shift around, and gets SourcePoint down to being a power of two long, which simplifies indexing into it.
Also, are you sure that keeping the list sorted is better than simply recomputing which element is the new lowest after each time you shift the old lowest out?

(Updated 22:37 UTC 2011-08-20)
I propose a binary min-heap of fixed size holding the M largest elements (but still in min-heap order!). It probably won't be faster in practice, as I think OPs insertion sort probably has decent real world performance (at least when the recommendations of the other posteres in this thread are taken into account).
Look-up in the case of failure should be constant time: If the current element is less than the minimum element of the heap (containing the max M elements) we can reject it outright.
If it turns out that we have an element bigger than the current minimum of the heap (the Mth biggest element) we extract (discard) the previous min and insert the new element.
If the elements are needed in sorted order the heap can be sorted afterwards.
First attempt at a minimal C++ implementation:
template<unsigned size, typename T>
class m_heap {
private:
T nodes[size];
static const unsigned last = size - 1;
static unsigned parent(unsigned i) { return (i - 1) / 2; }
static unsigned left(unsigned i) { return i * 2; }
static unsigned right(unsigned i) { return i * 2 + 1; }
void bubble_down(unsigned int i) {
for (;;) {
unsigned j = i;
if (left(i) < size && nodes[left(i)] < nodes[i])
j = left(i);
if (right(i) < size && nodes[right(i)] < nodes[j])
j = right(i);
if (i != j) {
swap(nodes[i], nodes[j]);
i = j;
} else {
break;
}
}
}
void bubble_up(unsigned i) {
while (i > 0 && nodes[i] < nodes[parent(i)]) {
swap(nodes[parent(i)], nodes[i]);
i = parent(i);
}
}
public:
m_heap() {
for (unsigned i = 0; i < size; i++) {
nodes[i] = numeric_limits<T>::min();
}
}
void add(const T& x) {
if (x < nodes[0]) {
// reject outright
return;
}
nodes[0] = x;
swap(nodes[0], nodes[last]);
bubble_down(0);
}
};
Small test/usage case:
#include <iostream>
#include <limits>
#include <algorithm>
#include <vector>
#include <stdlib.h>
#include <assert.h>
#include <math.h>
using namespace std;
// INCLUDE TEMPLATED CLASS FROM ABOVE
typedef vector<float> vf;
bool compare(float a, float b) { return a > b; }
int main()
{
int N = 2000;
vf v;
for (int i = 0; i < N; i++) v.push_back( rand()*1e6 / RAND_MAX);
static const int M = 50;
m_heap<M, float> h;
for (int i = 0; i < N; i++) h.add( v[i] );
sort(v.begin(), v.end(), compare);
vf heap(h.get(), h.get() + M); // assume public in m_heap: T* get() { return nodes; }
sort(heap.begin(), heap.end(), compare);
cout << "Real\tFake" << endl;
for (int i = 0; i < M; i++) {
cout << v[i] << "\t" << heap[i] << endl;
if (fabs(v[i] - heap[i]) > 1e-5) abort();
}
}

You're looking for a priority queue:
template < class T, class Container = vector<T>,
class Compare = less<typename Container::value_type> >
class priority_queue;
You'll need to figure out the best underlying container to use, and probably define a Compare function to deal with your Point type.
If you want to optimize it, you could run a queue on each row of your matrix in its own worker thread, then run an algorithm to pick the largest item of the queue fronts until you have your M elements.

A quick optimization would be to add a sentinel value to yourmaxValues array. If you have maxValues[M].value equal to std::numeric_limits<float>::max() then you can eliminate the q < M test in your while loop condition.

One idea would be to use the std::partial_sort algorithm on a plain one-dimensional sequence of references into your NxN array. You could probably also cache this sequence of references for subsequent calls. I don't know how well it performs, but it's worth a try - if it works good enough, you don't have as much "magic". In particular, you don't resort to micro optimizations.
Consider this showcase:
#include <algorithm>
#include <iostream>
#include <vector>
#include <stddef.h>
static const int M = 15;
static const int N = 20;
// Represents a reference to a sample of some two-dimensional array
class Sample
{
public:
Sample( float *arr, size_t row, size_t col )
: m_arr( arr ),
m_row( row ),
m_col( col )
{
}
inline operator float() const {
return m_arr[m_row * N + m_col];
}
bool operator<( const Sample &rhs ) const {
return (float)other < (float)*this;
}
int row() const {
return m_row;
}
int col() const {
return m_col;
}
private:
float *m_arr;
size_t m_row;
size_t m_col;
};
int main()
{
// Setup a demo array
float arr[N][N];
memset( arr, 0, sizeof( arr ) );
// Put in some sample values
arr[2][1] = 5.0;
arr[9][11] = 2.0;
arr[5][4] = 4.0;
arr[15][7] = 3.0;
arr[12][19] = 1.0;
// Setup the sequence of references into this array; you could keep
// a copy of this sequence around to reuse it later, I think.
std::vector<Sample> samples;
samples.reserve( N * N );
for ( size_t row = 0; row < N; ++row ) {
for ( size_t col = 0; col < N; ++col ) {
samples.push_back( Sample( (float *)arr, row, col ) );
}
}
// Let partial_sort find the M largest entry
std::partial_sort( samples.begin(), samples.begin() + M, samples.end() );
// Print out the row/column of the M largest entries.
for ( std::vector<Sample>::size_type i = 0; i < M; ++i ) {
std::cout << "#" << (i + 1) << " is " << (float)samples[i] << " at " << samples[i].row() << "/" << samples[i].col() << std::endl;
}
}

First of all, you are marching through the array in the wrong order!
You always, always, always want to scan through memory linearly. That means the last index of your array needs to be changing fastest. So instead of this:
for (int j = 0; j < rows; j++) {
for (int i = 0; i < cols; i++) {
float sample = arr[i][j];
Try this:
for (int i = 0; i < cols; i++) {
for (int j = 0; j < rows; j++) {
float sample = arr[i][j];
I predict this will make a bigger difference than any other single change.
Next, I would use a heap instead of a sorted array. The standard <algorithm> header already has push_heap and pop_heap functions to use a vector as a heap. (This will probably not help all that much, though, unless M is fairly large. For small M and a randomized array, you do not wind up doing all that many insertions on average... Something like O(log N) I believe.)
Next after that is to use SSE2. But that is peanuts compared to marching through memory in the right order.

You should be able to get nearly linear speedup with parallel processing.
With N CPUs, you can process a band of rows/N rows (and all columns) with each CPU, finding the top M entries in each band. And then do a selection sort to find the overall top M.
You could probably do that with SIMD as well (but here you'd divide up the task by interleaving columns instead of banding the rows). Don't try to make SIMD do your insertion sort faster, make it do more insertion sorts at once, which you combine at the end using a single very fast step.
Naturally you could do both multi-threading and SIMD, but on a problem which is only 30x30, that's not likely to be worthwhile.

I tried replacing float by double, and interestingly that gave me a speed improvement of about 20% (using VC++ 2008). That's a bit counterintuitive, but it seems modern processors or compilers are optimized for double value processing.

Use a linked list to store the best yet M values. You'll still have to iterate over it to find the right spot, but the insertion is O(1). It would probably even be better than binary search and insertion O(N)+O(1) vs O(lg(n))+O(N).
Interchange the fors, so you're not accessing every N element in memory and trashing the cache.
LE: Throwing another idea that might work for uniformly distributed values.
Find the min, max in 3/2*O(N^2) comparisons.
Create anywhere from N to N^2 uniformly distributed buckets, preferably closer to N^2 than N.
For every element in the NxN matrix place it in bucket[(int)(value-min)/range], range=max-min.
Finally create a set starting from the highest bucket to the lowest, add elements from other buckets to it while |current set| + |next bucket| <=M.
If you get M elements you're done.
You'll likely get less elements than M, let's say P.
Apply your algorithm for the remaining bucket and get biggest M-P elements out of it.
If elements are uniform and you use N^2 buckets it's complexity is about 3.5*(N^2) vs your current solution which is about O(N^2)*ln(M).

How to speed up matrix multiplication in C++?

I'm performing matrix multiplication with this simple algorithm. To be more flexible I used objects for the matricies which contain dynamicly created arrays.
Comparing this solution to my first one with static arrays it is 4 times slower. What can I do to speed up the data access? I don't want to change the algorithm.
matrix mult_std(matrix a, matrix b) {
matrix c(a.dim(), false, false);
for (int i = 0; i < a.dim(); i++)
for (int j = 0; j < a.dim(); j++) {
int sum = 0;
for (int k = 0; k < a.dim(); k++)
sum += a(i,k) * b(k,j);
c(i,j) = sum;
}
return c;
}
EDIT
I corrected my Question avove! I added the full source code below and tried some of your advices:
swapped k and j loop iterations -> performance improvement
declared dim() and operator()() as inline -> performance improvement
passing arguments by const reference -> performance loss! why? so I don't use it.
The performance is now nearly the same as it was in the old porgram. Maybe there should be a bit more improvement.
But I have another problem: I get a memory error in the function mult_strassen(...). Why?
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
OLD PROGRAM
main.c http://pastebin.com/qPgDWGpW
c99 main.c -o matrix -O3
NEW PROGRAM
matrix.h http://pastebin.com/TYFYCTY7
matrix.cpp http://pastebin.com/wYADLJ8Y
main.cpp http://pastebin.com/48BSqGJr
g++ main.cpp matrix.cpp -o matrix -O3.
EDIT
Here are some results. Comparison between standard algorithm (std), swapped order of j and k loop (swap) and blocked algortihm with block size 13 (block).

Speaking of speed-up, your function will be more cache-friendly if you swap the order of the k and j loop iterations:
matrix mult_std(matrix a, matrix b) {
matrix c(a.dim(), false, false);
for (int i = 0; i < a.dim(); i++)
for (int k = 0; k < a.dim(); k++)
for (int j = 0; j < a.dim(); j++) // swapped order
c(i,j) += a(i,k) * b(k,j);
return c;
}
That's because a k index on the inner-most loop will cause a cache miss in b on every iteration. With j as the inner-most index, both c and b are accessed contiguously, while a stays put.

Make sure that the members dim() and operator()() are declared inline, and that compiler optimization is turned on. Then play with options like -funroll-loops (on gcc).
How big is a.dim() anyway? If a row of the matrix doesn't fit in just a couple cache lines, you'd be better off with a block access pattern instead of a full row at-a-time.

You say you don't want to modify the algorithm, but what does that mean exactly?
Does unrolling the loop count as "modifying the algorithm"? What about using SSE/VMX whichever SIMD instructions are available on your CPU? What about employing some form of blocking to improve cache locality?
If you don't want to restructure your code at all, I doubt there's more you can do than the changes you've already made. Everything else becomes a trade-off of minor changes to the algorithm to achieve a performance boost.
Of course, you should still take a look at the asm generated by the compiler. That'll tell you much more about what can be done to speed up the code.

Use SIMD if you can. You absolutely have to use something like VMX registers if you do extensive vector math assuming you are using a platform that is capable of doing so, otherwise you will incur a huge performance hit.
Don't pass complex types like matrix by value - use a const reference.
Don't call a function in each iteration - cache dim() outside your loops.
Although compilers typically optimize this efficiently, it's often a good idea to have the caller provide a matrix reference for your function to fill out rather than returning a matrix by type. In some cases, this may result in an expensive copy operation.

Here is my implementation of the fast simple multiplication algorithm for square float matrices (2D arrays). It should be a little faster than chrisaycock code since it spares some increments.
static void fastMatrixMultiply(const int dim, float* dest, const float* srcA, const float* srcB)
{
memset( dest, 0x0, dim * dim * sizeof(float) );
for( int i = 0; i < dim; i++ ) {
for( int k = 0; k < dim; k++ )
{
const float* a = srcA + i * dim + k;
const float* b = srcB + k * dim;
float* c = dest + i * dim;
float* cMax = c + dim;
while( c < cMax )
{
*c++ += (*a) * (*b++);
}
}
}
}

Pass the parameters by const reference to start with:
matrix mult_std(matrix const& a, matrix const& b) {
To give you more details we need to know the details of the other methods used.
And to answer why the original method is 4 times faster we would need to see the original method.
The problem is undoubtedly yours as this problem has been solved a million times before.
Also when asking this type of question ALWAYS provide compilable source with appropriate inputs so we can actually build and run the code and see what is happening.
Without the code we are just guessing.
Edit
After fixing the main bug in the original C code (a buffer over-run)
I have update the code to run the test side by side in a fair comparison:
// INCLUDES -------------------------------------------------------------------
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
#include <time.h>
// DEFINES -------------------------------------------------------------------
// The original problem was here. The MAXDIM was 500. But we were using arrays
// that had a size of 512 in each dimension. This caused a buffer overrun that
// the dim variable and caused it to be reset to 0. The result of this was causing
// the multiplication loop to fall out before it had finished (as the loop was
// controlled by this global variable.
//
// Everything now uses the MAXDIM variable directly.
// This of course gives the C code an advantage as the compiler can optimize the
// loop explicitly for the fixed size arrays and thus unroll loops more efficiently.
#define MAXDIM 512
#define RUNS 10
// MATRIX FUNCTIONS ----------------------------------------------------------
class matrix
{
public:
matrix(int dim)
: dim_(dim)
{
data_ = new int[dim_ * dim_];
}
inline int dim() const {
return dim_;
}
inline int& operator()(unsigned row, unsigned col) {
return data_[dim_*row + col];
}
inline int operator()(unsigned row, unsigned col) const {
return data_[dim_*row + col];
}
private:
int dim_;
int* data_;
};
// ---------------------------------------------------
void random_matrix(int (&matrix)[MAXDIM][MAXDIM]) {
for (int r = 0; r < MAXDIM; r++)
for (int c = 0; c < MAXDIM; c++)
matrix[r][c] = rand() % 100;
}
void random_matrix_class(matrix& matrix) {
for (int r = 0; r < matrix.dim(); r++)
for (int c = 0; c < matrix.dim(); c++)
matrix(r, c) = rand() % 100;
}
template<typename T, typename M>
float run(T f, M const& a, M const& b, M& c)
{
float time = 0;
for (int i = 0; i < RUNS; i++) {
struct timeval start, end;
gettimeofday(&start, NULL);
f(a,b,c);
gettimeofday(&end, NULL);
long s = start.tv_sec * 1000 + start.tv_usec / 1000;
long e = end.tv_sec * 1000 + end.tv_usec / 1000;
time += e - s;
}
return time / RUNS;
}
// SEQ MULTIPLICATION ----------------------------------------------------------
int* mult_seq(int const(&a)[MAXDIM][MAXDIM], int const(&b)[MAXDIM][MAXDIM], int (&z)[MAXDIM][MAXDIM]) {
for (int r = 0; r < MAXDIM; r++) {
for (int c = 0; c < MAXDIM; c++) {
z[r][c] = 0;
for (int i = 0; i < MAXDIM; i++)
z[r][c] += a[r][i] * b[i][c];
}
}
}
void mult_std(matrix const& a, matrix const& b, matrix& z) {
for (int r = 0; r < a.dim(); r++) {
for (int c = 0; c < a.dim(); c++) {
z(r,c) = 0;
for (int i = 0; i < a.dim(); i++)
z(r,c) += a(r,i) * b(i,c);
}
}
}
// MAIN ------------------------------------------------------------------------
using namespace std;
int main(int argc, char* argv[]) {
srand(time(NULL));
int matrix_a[MAXDIM][MAXDIM];
int matrix_b[MAXDIM][MAXDIM];
int matrix_c[MAXDIM][MAXDIM];
random_matrix(matrix_a);
random_matrix(matrix_b);
printf("%d ", MAXDIM);
printf("%f \n", run(mult_seq, matrix_a, matrix_b, matrix_c));
matrix a(MAXDIM);
matrix b(MAXDIM);
matrix c(MAXDIM);
random_matrix_class(a);
random_matrix_class(b);
printf("%d ", MAXDIM);
printf("%f \n", run(mult_std, a, b, c));
return 0;
}
The results now:
$ g++ t1.cpp
$ ./a.exe
512 1270.900000
512 3308.800000
$ g++ -O3 t1.cpp
$ ./a.exe
512 284.900000
512 622.000000
From this we see the C code is about twice as fast as the C++ code when fully optimized. I can not see the reason in the code.

I'm taking a wild guess here, but if you dynamically allocating the matrices makes such a huge difference, maybe the problem is fragmentation. Again, I've no idea how the underlying matrix is implemented.
Why don't you allocate the memory for the matrices by hand, ensuring it's contiguous, and build the pointer structure yourself?
Also, does the dim() method have any extra complexity? I would declare it inline, too.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js