I'm trying to use GCC vector extension (https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html) to speed up matrix multiplication. The idea is to use SIMD instructions to multiply and add four float numbers at once. A minimal working example is listed below. The example works fine when multiplying a (M=10,K=12) matrix to a (K=12,N=12) matrix. When I change the parameters (say N=9), however, I get a segmentation fault.
I suspect this is due to memory alignment issues. In my understanding, when using a SIMD for a vector wich 16bytes (in this case float4), the target memory address should be a multiple of 16. There are already discussions on memory alignment issues with SIMD instructions. (e.g. Relationship between SSE vectorization and Memory alignment). In the example below, when &b(0,0) is 0x810e10, &b(1,0) is 0x810e34, which is not a multiple of 16.
My questions are,
Is it true that I'm getting the segfault for the memory alignment issues?
Can anyone tell me how to fix the problem easily? I've thought of using a two-dimensional array instead of one array, but I don't want to do this so as not to change the rest of the codes.
Minimal Working Example
#include <iostream>
#include <cstdlib>
#include <stdio.h>
#include <cstring>
#include <assert.h>
#include <algorithm>
using namespace std;
typedef float float4 __attribute__((vector_size (16)));
static inline void * alloc64(size_t sz) {
void * a = 0;
if (posix_memalign(&a, 64, sz) != 0) {
perror("posix_memalign");
exit(1);
}
return a;
}
struct Mat {
size_t m,n;
float * a;
Mat(size_t m_, size_t n_, float f) {
m = m_;
n = n_;
a = (float*) malloc(sizeof(float) * m * n);
fill(a,a + m * n,f);
}
/* a(i,j) */
float& operator()(long i, long j) {
return a[i * n + j];
}
};
Mat operator* (Mat a, Mat b) {
Mat c(a.m, b.n,0);
assert(a.n == b.m);
for (long i = 0; i < a.m; i++) {
for(long k = 0; k < a.n; k++){
float aa = a(i,k);
float4 a4 = {aa,aa,aa,aa};
long j;
for (j = 0; j <= b.n-4; j+=4) {
*((float4 *)&c(i,j)) = *((float4 *)&c(i,j)) + a4 * (*(float4 *)&b(k,j));
}
while(j < b.n){
c(i,j) += aa * b(k,j);
j++;
}
}
}
return c;
}
const int M = 10;
const int K = 12;
const int N = 12;
int main(){
Mat a(M,K,1);
Mat b(K,N,1);
Mat c = a * b;
for(int i = 0; i < M; i++){
for(int j = 0; j < N; j++)
cout << c(i,j) << " ";
cout << endl;
}
cout << endl;
}
In my understanding, when using a SIMD for a vector wich 16bytes (in
this case float4), the target memory address should be a multiple of
16.
That is incorrect on x64 processors. There are instructions that require alignment, but you can perfectly well write and read SIMD registers from unaligned memory locations without penalty and with absolute safety using the right instructions.
Is it true that I'm getting the segfault for the memory alignment
issues?
Yes.
But it is not related to SIMD instructions. In C/C++, it is undefined behavior to write *((float4 *)&c) = ... the way you do, and can certainly crash, but you can reproduce the problem without vectorization... Given the right circumstances, the following basic code will crash...
char * c = ...
*(int *) c = 1;
Can anyone tell me how to fix the problem easily? I've thought of
using a two-dimensional array instead of one array, but I don't want
to do this so as not to change the rest of the codes.
The typical workaround is to use memcpy. Let us look at a code example...
#include <string.h>
typedef float float4 __attribute__((vector_size (16)));
void writeover(float * x, float4 y) {
*(float4 * ) x = y;
}
void writeover2(float * x, float4 y) {
memcpy(x,&y,sizeof(y));
}
With, say, clang++, these two functions get compiled to vmovaps and vmovups. These are equivalent instructions, but the first one will crash if your pointer is not aligned on sizeof(float4). They are very fast functions on recent hardware.
The point is that you can often rely on memcpy to generate code that is nearly optimally fast. Of course, the amount of overhead you get (if any) will depend on the compiler you are using.
If you do get performance problems, then you can use Intel intrinsics or assembly instead... but chances are good that memcpy will serve you well.
A different fix is to only work in terms of float4 * pointers. This forces all your matrices to have dimensions that are divisible by four, but if you pad the leftover with zeroes you will probably get simple and really fast code.
Related
I am learning C++ at the moment and currently I am experimenting with pointers and structures. In the following code, I am copying vector A into a buffer of size 100 bytes. Afterwards I copy vector B into the same buffer with an offset, so that the vectors are right next to each other in the buffer. Afterward, I want to find the vectors in the buffer again and calculate the dot product between the vectors.
#include <iostream>
const short SIZE = 5;
typedef struct vector {
float vals[SIZE];
} vector;
void vector_copy (vector* v, vector* target) {
for (int i=0; i<SIZE; i++) {
target->vals[i] = v->vals[i];
}
}
float buffered_vector_product (char buffer[]) {
float scalar_product = 0;
int offset = SIZE * 4;
for (int i=0; i<SIZE; i=i+4) {
scalar_product += buffer[i] * buffer[i+offset];
}
return scalar_product;
}
int main() {
char buffer[100] = {};
vector A = {{1, 1.5, 2, 2.5, 3}};
vector B = {{0.5, -1, 1.5, -2, 2.5}};
vector_copy(&A, (vector*) buffer);
vector_copy(&B, (vector*) (buffer + sizeof(vector)));
float prod = buffered_vector_product(buffer);
std::cout << prod <<std::endl;
return 0;
}
Unfortunately this doesn't work yet. The problem lies within the function buffered_vector_product. I am unable to get the float values back from the buffer. Each float value should need 4 bytes. I don't know, how to access these 4 bytes and convert them into a float value. Can anyone help me out? Thanks a lot!
In the function buffered_vector_product, change the lines
int offset = SIZE * 4;
for (int i=0; i<SIZE; i=i+4) {
scalar_product += buffer[i] * buffer[i+offset];
}
to
for ( int i=0; i<SIZE; i++ ) {
scalar_product += ((float*)buffer)[i] * ((float*)buffer)[i+SIZE];
}
If you want to calculate the offsets manually, you can instead replace it with the following:
size_t offset = SIZE * sizeof(float);
for ( int i=0; i<SIZE; i++ ) {
scalar_product += *(float*)(buffer+i*sizeof(float)) * *(float*)(buffer+i*sizeof(float)+offset);
}
However, with both solutions, you should beware of both the alignment restrictions and the strict aliasing rule.
The problem with the alignment restrictions can be solved by changing the line
char buffer[100] = {};
to the following:
alignas(float) char buffer[100] = {};
The strict aliasing rule is a much more complex issue, because the exact rule has changed significantly between different C++ standards and is (or at least was) different from the strict aliasing rule in the C language. See the link in the comments section for further information on this issue.
I am attempting to load in a .mat file containing a tensor of known dimensions in C++; 144x192x256.
I have adjusted the linear index for the read operation to be column major as in MATLAB. However I am still getting memory access issues.
void FeatureLoader::readMat(const std::string &fname, Image< std::vector<float> > *out) {
//Read MAT file.
const char mode = 'r';
MATFile *matFile = matOpen(fname.c_str(), &mode);
if (matFile == NULL) {
throw std::runtime_error("Cannot read MAT file.");
}
//Copy the data from column major to row major storage.
float *newData = newImage->GetData();
const mxArray *arr = matGetVariable(matFile, "map");
if (arr == NULL) {
throw std::runtime_error("Cannot read variable.");
}
double *arrData = (double*)mxGetPr(arr);
#pragma omp parallel for
for (int i = 0; i < 144; i++) {
#pragma omp parallel for
for (int j = 0; j < 192; j++) {
for (int k = 0; k < 256; k++) {
int rowMajIdx = (i * 192 + j) * 256 + k;
int colMajIdx = (j * 144 + i) * 256 + k;
newData[rowMajIdx] = static_cast<float>(arrData[colMajIdx]);
}
}
}
}
In the above snippet, am I right to be accessing the data linearly as with a flattened 3D array in C++? For example:-
idx_row_major = (x*WIDTH + y)*DEPTH + z
idx_col_major = (y*HEIGHT + x)*DEPTH + z
Is this the underlying representation that MATLAB uses?
You have some errors in the indexing of the row mayor and column mayor Idx. Additionally, naively accessing the data can lead to very slow times due to random memory access (memory latency is key! Read more here).
The best way to pass from MATLAB to C++ types (From 3D to 1D) is following the example below.
In this example we illustrate how to take a double real-type 3D matrix from MATLAB, and pass it to a C double* array.
The main objectives of this example are showing how to obtain data from MATLAB MEX arrays and to highlight some small details in matrix storage and handling.
matrixIn.cpp
#include "mex.h"
void mexFunction(int nlhs , mxArray *plhs[],
int nrhs, mxArray const *prhs[]){
// check amount of inputs
if (nrhs!=1) {
mexErrMsgIdAndTxt("matrixIn:InvalidInput", "Invalid number of inputs to MEX file.");
}
// check type of input
if( !mxIsDouble(prhs[0]) || mxIsComplex(prhs[0])){
mexErrMsgIdAndTxt("matrixIn:InvalidType", "Input matrix must be a double, non-complex array.");
}
// extract the data
double const * const matrixAux= static_cast<double const *>(mxGetData(prhs[0]));
// Get matrix size
const mwSize *sizeInputMatrix= mxGetDimensions(prhs[0]);
// allocate array in C. Note: its 1D array, not 3D even if our input is 3D
double* matrixInC= (double*)malloc(sizeInputMatrix[0] *sizeInputMatrix[1] *sizeInputMatrix[2]* sizeof(double));
// MATLAB is column major, not row major (as C). We need to reorder the numbers
// Basically permutes dimensions
// NOTE: the ordering of the loops is optimized for fastest memory access!
// This improves the speed in about 300%
const int size0 = sizeInputMatrix[0]; // Const makes compiler optimization kick in
const int size1 = sizeInputMatrix[1];
const int size2 = sizeInputMatrix[2];
for (int j = 0; j < size2; j++)
{
int jOffset = j*size0*size1; // this saves re-computation time
for (int k = 0; k < size0; k++)
{
int kOffset = k*size1; // this saves re-computation time
for (int i = 0; i < size1; i++)
{
int iOffset = i*size0;
matrixInC[i + jOffset + kOffset] = matrixAux[iOffset + jOffset + k];
}
}
}
// we are done!
// Use your C matrix here
// free memory
free(matrixInC);
return;
}
The relevant concepts to be aware of:
MATLAB matrices are all 1D in memory, no matter how many dimensions they have when used in MATLAB. This is also true for most (if not all) main matrix representation in C/C++ libraries, as allows optimization and faster execution.
You need to explicitly copy matrices from MATLAB to C in a loop.
MATLAB matrices are stored in column major order, as in Fortran, but C/C++ and most modern languages are row major. It is important to permute the input matrix , or else the data will look completely different.
The relevant function in this example are:
mxIsDouble checks if input is double type.
mxIsComplex checks if input is real or imaginary.
mxGetData returns a pointer to the real data in the input array. NULL if there is no real data.
mxGetDimensions returns an pointer to a mwSize array, with the size of the dimension in each index.
I do a blas matrix/vector product with this simple code:
#include "mkl.h"
#include <stdio.h>
int main(){
const int M = 2;
const int N = 3;
double *x = new double[N];
double *A = new double[M*N];
double *b = new double[M];
for (int i = 0; i < M; i++){
b[i] = 0.0; //Not necessary but anyway...
for (int j = 0; j < N; j++){
A[j * M + i] = i + j * 2;
}
}
for (int j = 0; j < N; j++)
x[j] = j*j;
const int incr = 1;
const double alpha = 1.0;
const double beta = 0.0;
const char no = 'N';
dgemv(&no, &M, &N, &alpha, A, &M, x, &incr, &beta, b, &incr );
printf("b = [%e %e]'\n",b[0],b[1]);
delete[] x;
delete[] A;
delete[] b;
}
While the displayed result is as expected ([18, 23]), Intel Inspector finds one invalid memory access and 2 invalid partial memory access when calling dgemv. The invalid memory access and one invalid partial memory access are related to memory allocated corresponding to the vector b. The second invalid partial memory access is related with the memory allocated for A. I do not get any error if I use a static array.
It also happens with other MKL functions, such as dgesv or when I try to use cblas_dgemv. I use Intel Inspector XE 2016 and intel C++ Compiler 16.0 with MKL sequential.
Is my dgemv call wrong, or is that a false positive. Anyone experienced that?
Thanks
EDIT:
As suggested by Josh Milthorpe: the error appears only on small-size arrays, probably because MKL is trying to access memory in large chunks for efficiency.
I did several tests, and M needs to be at least 20 in order to not get an error. N can be any positive number. I suppose that this is not a bug, and MKL is just accessing memory outside of the allocated space for the matrix, but does not alter or really use it.
I have that error but I'm sure I have the same data type and I didn't do anything wrong I suppose. It's for calculating the determinant of a matrix. Someone help. I really can't think of why I have this error :(
#include <iostream>
#include <stdio.h>
#include <cmath>
using namespace std;
double determinant(double matrix[100][100], int order)
{
double det, temp[100][100]; int row, col;
if (order == 1)
return matrix[0][0];
else if (order == 2)
return ((matrix[0][0] * matrix[1][1]) - (matrix[0][1] * matrix[1][0]));
else
{
for (int r = 0; r < order; r++)
{
row = 0;
col = 0;
for (int i = 1; i < order; i++)
{
for (int j = 0; j < order; j++)
{
if (j == r)
continue;
temp[row][col] = matrix[i][j];
col++;
}
row++;
}
det += (matrix[0][r] * pow(-1, r) * determinant(temp, order - 1));
}
return det;
}
}
int main()
{
int n;
cout << "Enter the dimension: ";
cin >> n;
double elem[n][n];
for (int i = 0; i < n; i++)
{
cout << "Enter row " << i << ": ";
for (int j = 0; j < n; j++)
{
cin >> elem[i][j];
}
cout << endl;
}
cout << determinant(elem, n);
return 0;
}
your prototype is
double determinant(double matrix[100][100], int order)
and you call it with
determinant(elem, n);
when
double elem[n][n]; that is a "dynamic" array size so not 100x100
it seam compiler assumes n is 1 at compile time so
obviously double array [1][1] can't be converted to [100][100]
as you wrote it even if your input matrix data is 1x1 you have to store it in 100x100 array.
just declare double elem[100][100];
finally at run time ensure user input n < 100 to avoid a bug
You have three problems.
First, the size of elem is unknown at compile time. You should use elem[100][100] if you really want the variable on the stack and the size of the matrix really is 100x100.
Second, your determinant function creates a 10 thousand element matrix on the stack and it is recursive, which means you'll get a lot of them and likely run out stack space. You should consider using a single temp matrix and reusing this for each recursive step.
Third, since you need the matrix size it to be dynamic, declare it on the heap. Something like:
double* elem = new double[n * n];
Strictly speaking you do not need to do this, but it will not waste as much memory as a 100x100 matrix if you are calculating the determinant of small matrices.
If you use a one dimensional array, you can pass in an array of any size to determinant (the determinant function should also take a one-dimensional array or double* instead of double[100][100]). You will have to calculate the index yourself using matrix[order*j+i].
double elem[n][n]; is illegal in C++. Arrays must have dimensions known at compiletime.
Your bizarre error message is a result of a compiler attempting to support double elem[n][n] as an extension, but not doing a very good job of it.
One way to fix this would be to change your code to be double elem[100][100]; .
To fix it without wasting memory and sticking to Standard C++, you should use std::vector instead of a C-style array. It is simpler to code to use a vector of vectors, although for performance reasons you may want to use a 1-D vector.
Also, you would need to refactor determinant slightly as you don't really want to be allocating new memory each time you do another step of the recursion. The determinant function needs to know what dimension of memory is allocated, as well as what dimension you want to calculate the determinant on.
I am a Fortran user and do not know C++ well enough. I need to make some additions into an existing C++ code. I need to create a 2d matrix (say A) of type double whose size (say m x n) is known only during the run. With Fortran this can be done as follows
real*8, allocatable :: A(:,:)
integer :: m, n
read(*,*) m
read(*,*) n
allocate(a(m,n))
A(:,:) = 0.0d0
How do I create a matrix A(m,n), in C++, when m and n are not known at the time of compilation? I believe the operator new in C++ can be useful but not not sure how to implement it with doubles. Also, when I use following in C++
int * x;
x = new int [10];
and check the size of x using sizeof(x)/sizeof(x[0]), I do not have 10, any comments why?
To allocate dynamically a construction similar to 2D array use the following template.
#include <iostream>
int main()
{
int m, n;
std::cout << "Enter the number of rows: ";
std::cin >> m;
std::cout << "Enter the number of columns: ";
std::cin >> n;
double **a = new double * [m];
for ( int i = 0; i < m; i++ ) a[i] = new double[n]();
//...
for ( int i = 0; i < m; i++ ) delete []a[i];
delete []a;
}
Also you can use class std::vector instead of the manually allocated pointers.
#include <iostream>
#include <vector>
int main()
{
int m, n;
std::cout << "Enter the number of rows: ";
std::cin >> m;
std::cout << "Enter the number of columns: ";
std::cin >> n;
std::vector<std::vector<double>> v( m, std::vector<double>( n ) );
//...
}
As for this code snippet
int * x;
x = new int [10];
then x has type int * and x[0] has type int. So if the size of the pointer is equal to 4 and the size of an object of type int is equal also to 4 then sizeof( x ) / sizeof( x[0] ) will yields 1. Pointers do not keep the information whether they point to only a single object or the first object pf some sequence of objects.
I would recommend using std::vector and avoid all the headache of manually allocating and deallocating memory.
Here's an example program:
#include <iostream>
#include <vector>
typedef std::vector<double> Row;
typedef std::vector<Row> Matrix;
void testMatrix(int M, int N)
{
// Create a row with all elements set to 0.0
Row row(N, 0.0);
// Create a matrix with all elements set to 0.0
Matrix matrix(M, row);
// Test accessing the matrix.
for ( int i = 0; i < M; ++i )
{
for ( int j = 0; j < N; ++j )
{
matrix[i][j] = i+j;
std::cout << matrix[i][j] << " ";
}
std::cout << std::endl;
}
}
int main()
{
testMatrix(10, 20);
}
The formal C++ way of doing it would be this:
std::vector<std::vector<int>> a;
This creates container which contains a zero size set of sub-containers. C++11/C++13 provide std::array for fixed-sized containers, but you specified runtime sizing.
We now have to impart our dimensions on this and, unfortunately. Lets assign the top-level:
a.resize(10);
(you can also push or insert elements)
What we now have is a vector of 10 vectors. Unfortunately, they are all independent, so you would need to:
for (size_t i = 0; i < a.size(); ++i) {
a[i].resize(10);
}
We now have a 10x10. We can also use vectors constructor:
std::vector<std::vector<int>> a(xSize, std::vector<int>(ySize)); // assuming you want a[x][y]
Note that vectors are fully dynamic, so we can resize elements as we need:
a[1].push_back(10); // push value '10' onto a[1], creating an 11th element in a[1]
a[2].erase(2); // remove element 2 from a[2], reducing a[2]s size to 9
To get the size of a particular slot:
a.size(); // returns 10
a[1].size(); // returns 11 after the above
a[2].size(); // returns 9 after teh above.
Unfortunately C++ doesn't provide a strong, first-class way to allocate an array that retains size information. But you can always create a simple C-style array on the stack:
int a[10][10];
std::cout << "sizeof a is " << sizeof(a) <<'\n';
But using an allocator, that is placing the data onto the heap, requires /you/ to track size.
int* pointer = new int[10];
At this point, "pointer" is a numeric value, zero to indicate not enough memory was available or the location in memory where the first of your 10 consecutive integer storage spaces are located.
The use of the pointer decorator syntax tells the compiler that this integer value will be used as a pointer to store addresses and so allow pointer operations via the variable.
The important thing here is that all we have is an address, and the original C standard didn't specify how the memory allocator would track size information, and so there is no way to retrieve the size information. (OK, technically there is, but it requires using compiler/os/implementation specific information that is subject to frequent change)
These integers must be treated as a single object when interfacing with the memory allocation system -- you can't, for example:
delete pointer + 5;
to delete the 5th integer. They are a single allocation unit; this notion allows the system to track blocks rather than individual elements.
To delete an array, the C++ syntax is
delete[] pointer;
To allocate a 2-dimensional array, you will need to either:
Flatten the array and handle sizing/offsets yourself:
static const size_t x = 10, y = 10;
int* pointer = new int[x * y];
pointer[0] = 0; // position 0, the 1st element.
pointer[x * 1] = 0; // pointer[1][0]
or you could use
int access_2d_array_element(int* pointer, const size_t xSize, const size_t ySize, size_t x, size_t y)
{
assert(x < xSize && y < ySize);
return pointer[y * xSize + x];
}
That's kind of a pain, so you would probably be steered towards encapsulation:
class Array2D
{
int* m_pointer;
const size_t m_xSize, m_ySize;
public:
Array2D(size_t xSize, size_t ySize)
: m_pointer(new int[xSize * ySize])
, m_xSize(xSize)
, m_ySize(ySize)
{}
int& at(size_t x, size_t y)
{
assert(x < m_xSize && y < m_ySize);
return m_pointer[y * m_xSize + x];
}
// total number of elements.
size_t arrsizeof() const
{
return m_xSize * m_ySize;
}
// total size of all data elements.
size_t sizeof() const
{
// this sizeof syntax makes the code more generic.
return arrsizeof() * sizeof(*m_pointer);
}
~Array2D()
{
delete[] m_pointer;
}
};
Array2D a(10, 10);
a.at(1, 3) = 13;
int x = a.at(1, 3);
Or,
For each Nth dimension (N < dimensions) allocate an array of pointers-to-pointers, only allocating actual ints for the final dimension.
const size_t xSize = 10, ySize = 10;
int* pointer = new int*(x); // the first level of indirection.
for (size_t i = 0; i < x; ++i) {
pointer[i] = new int(y);
}
pointer[0][0] = 0;
for (size_t i = 0; i < x; ++i) {
delete[] pointer[i];
}
delete[] pointer;
This last is more-or-less doing the same work, it just creates more memory fragmentation than the former.
-----------EDIT-----------
To answer the question "why do I not have 10" you're probably compiling in 64-bit mode, which means that "x" is an array of 10 pointers-to-int, and because you're in 64-bit mode, pointers are 64-bits long, while ints are 32 bits.
The C++ equivalent of your Fortran code is:
int cols, rows;
if ( !(std::cin >> cols >> rows) )
// error handling...
std::vector<double> A(cols * rows);
To access an element of this array you would need to write A[r * rows + c] (or you could do it in a column-major fashion, that's up to you).
The element access is a bit clunky, so you could write a class that wraps up holding this vector and provides a 2-D accessor method.
In fact your best bet is to find a free library that already does this, instead of reinventing the wheel. There isn't a standard Matrix class in C++, because somebody would always want a different option (e.g. some would want row-major storage, some column-major, particular operations provided, etc. etc.)
Someone suggested boost::multi_array; that stores all its data contiguously in row-major order and is probably suitable. If you want standard matrix operations consider something like Eigen, again there are a lot of alternatives out there.
If you want to roll your own then it could look like:
struct FortranArray2D // actually easily extensible to any number of dimensions
{
FortranArray2D(size_t n_cols, size_t n_rows)
: n_cols(n_cols), n_rows(n_rows), content(n_cols * n_rows) { }
double &operator()(size_t col, size_t row)
{ return content.at(row * n_rows + col); }
void resize(size_t new_cols, size_t new_rows)
{
FortranArray2D temp(new_cols, new_rows);
// insert some logic to move values from old to new...
*this = std::move(temp);
}
private:
size_t n_rows, n_cols;
std::vector<double> content;
};
Note in particular that by avoiding new you avoid the thousand and one headaches that come with manual memory management. Your class is copyable and movable by default. You could add further methods to replicate any functionality that the Fortran array has which you need.
int ** x;
x = new int* [10];
for(int i = 0; i < 10; i++)
x[i] = new int[5];
Unfortunately you'll have to store the size of matrix somewhere else.
C/C++ won't do it for you. sizeof() works only when compiler knows the size, which is not true in dynamic arrays.
And if you wan to achieve it with something more safe than dynamic arrays:
#include <vector>
// ...
std::vector<std::vector<int>> vect(10, std::vector<int>(5));
vect[3][2] = 1;