Optimizing square matrix multiplication with std::thread

Optimizing square matrix multiplication with std::thread - c++

I'm trying to implement matrix multiplying with std::thread in C++.
Currently, my kernel code looks like
void multiply(const int* a, const int* b, int* c, int rowLength, int start) {
for (auto i = start; i < rowLength; i += threadCount) {
const auto rowI = i * rowLength;
for (auto j = 0; j < rowLength; j++) {
auto result = 0;
const auto rowJ = j * rowLength;
for (auto k = 0; k < rowLength; k++) {
result += a[rowI + k] * b[rowJ + k];
}
c[rowI + j] = result;
}
}
}
As you see, I'm multiplying matrix A with already transposed matrix B (it's done during input). Currently, I'm trying to use one-dimension approach. Is there any optimizations that can I make with my current code?

Related

Memory leak in the implementation of the matrix multiplication operation

Memory leak in the implementation of the matrix multiplication operation:
template <typename T>
class Matrix
{
private:
T *data = nullptr;
size_t rows;
size_t cols;
Here is the multiplication operation itself:
Matrix<T> operator*(const Matrix<T> &other)
{
Matrix<T> result(rows, other.cols);
if (cols == other.rows)
{
for (size_t i = 0; i < rows; i++)
{
for (size_t j = 0; j < other.cols; j++)
{
for (size_t k = 0; k < cols; k++)
{
result.data[i * other.cols + j] += data[i * cols + k] * other.data[k * other.cols + j];
}
}
}
}
else
{
throw std::logic_error("Matrix sizes do not match");
}
return result;
}
How can I change this method so that it works correctly (and does not fall on tests)?
Here is a link to the class https://godbolt.org/z/4PPYx4Y3j. For some reason, everything works well here, but when I start doing a test:
TEST(testMatrixCalculations, testMultiplication)
{
myMatrix::Matrix<int> mat1(3, 3);
myMatrix::Matrix<int> mat2(3, 3);
for (auto &it: mat1)
{
it = 3;
}
for (auto &it : mat2)
{
it = 3;
}
mat1.printMatrix();
mat2.printMatrix();
myMatrix::Matrix<int> mat3 = mat1 * mat2;
mat3.printMatrix();
for (auto it : mat3)
{
ASSERT_EQ(it, 27);
}
}
Outputs this:
3 3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 3 3
-1119477653 32718 -1119477653 32718 775685387 21966 775685387 21966 27
Failure
Expected equality of these values:
it
Which is: -1119477653
27

Your result.data is not initialized to 0 but you apply a += operation to it. You must either initialize your Matrix::data member to zero in the Matrix main constructor function, or initialize it preliminary in your multiplication loop.
for (size_t i = 0; i < rows; i++) {
for (size_t j = 0; j < other.cols; j++) {
result.data[i * other.cols + j] = 0;
for (size_t k = 0; k < cols; k++) {
result.data[i * other.cols + j] += data[i * cols + k] * other.data[k * other.cols + j];
}
}
}

Performance with matrix class in C++

I was performance profiling our library and noticed that most time is spent in matrix manipulations.
I wanted to see whether I could improve performance by changing the order of the matrix loops or by changing the matrix class definition from row major to column major.
Questions:
Below I test 2 cases. Test case 1 is always the fastest, no matter whether my matrix is row or columns major. Why is that?
Turning on vectorization improves Test case 1 with a factor 2, why is that?
Performance profiling is done with Very Sleepy.
I used Visual Studio 2019 – platformtoolset v142, and compiled in 32-bit.
Our library defines a matrix template where the underlying is a dynamic array where the ordering is column major (full code follows below):
Type& operator()(int row, int col)
{
return pArr[row + col * m_rows];
}
Type operator()(int row, int col) const
{
return pArr[row + col * m_rows];
}
We also have a matrix class specific for doubles:
class DMatrix : public TMatrix<double>
{
public:
// Constructors:
DMatrix() : TMatrix<double>() { }
DMatrix(int rows, int cols) : TMatrix<double>(rows, cols, true) {}
};
I ran 2 test cases that perform nested loop operations on randomly filled matrices. The difference between Test case 1 and 2 is the order of the inner loops.
int nrep = 10000; // Large number of calculations
int nstate = 400;
int nstep = 400;
int nsec = 3; // 100 times smaller than nstate and nstep
DMatrix value(nstate, nsec);
DMatrix Rc(nstate, 3 * nstep);
DMatrix rhs(nstate, nsec);
// Test case 1
for (int k = 0; k < nrep; k++) {
for (int n = 0; n < nstep; n++) {
int diag = 3 * n + 1;
for (int i = 1; i < nstate; i++) {
for (int j = 0; j < nsec; j++) {
value(i, j) = (rhs(i, j) - Rc(i, diag - 1) * value(i - 1, j)) / Rc(i, diag);
}
}
}
}
// Test case 2
for (int k = 0; k < nrep; k++) {
for (int n = 0; n < nstep; n++) {
int diag = 3 * n + 1;
for (int j = 0; j < nsec; j++) {
for (int i = 1; i < nstate; i++) {
value(i, j) = (rhs(i, j) - Rc(i, diag - 1) * value(i - 1, j)) / Rc(i, diag);
}
}
}
}
Since the matrix is column major, I expected that I would get the best performance when the inner loop follows a column, due to nearby elements being CPU cached, but instead it is doing the opposite. Note that nstep and nstate are typically 100 times larger than nsec.
When I turn on vectorization:
“Advanced Vector Extensions 2” in Code Generation/Enable Enhanced Instruction Set, the performance difference gets even larger:
When I turn off the vectorization and make the matrix row major:
Type& operator()(int row, int col)
{
return pArr[col + row*m_cols];
}
Type operator()(int row, int col) const
{
return pArr[col + row*m_cols];
}
I don’t get any difference in performance compared to when the matrix was column major:
With vector optimizations:
The full code. matrix.h:
#ifndef __MATRIX_H
#define __MATRIX_H
#include <assert.h>
#include <iostream>
template<class Type>
class TMatrix
{
public:
TMatrix(); // Default constructor
TMatrix(int rows, int cols, bool init = false); // Constructor with dimensions + flag to default initialize or not
TMatrix(const TMatrix& mat); // Copy constructor
TMatrix& operator=(const TMatrix& mat); // Assignment operator
~TMatrix(); // Destructor
// Move constructor/assignment
TMatrix(TMatrix&& mat) noexcept;
TMatrix& operator=(TMatrix&& mat) noexcept;
// Get matrix dimensions
int no_rows() const { return m_rows; }
int no_columns() const { return m_cols; }
Type& operator()(int row, int col)
{
assert(row >= 0 && row < m_rows&& col >= 0 && col < m_cols);
return pArr[row + col * m_rows]; // elements in a column lay next to each other
//return pArr[col + row*m_cols]; // elements in a row lay next to each other
}
Type operator()(int row, int col) const
{
assert(row >= 0 && row < m_rows&& col >= 0 && col < m_cols);
return pArr[row + col * m_rows];
// return pArr[col + row*m_cols];
}
protected:
void clear();
Type* pArr;
int m_rows, m_cols;
};
//**************************************************************
// Implementation of TMatrix
//**************************************************************
// Default constructor
template<class Type>
TMatrix<Type>::TMatrix()
{
m_rows = 0;
m_cols = 0;
pArr = 0;
}
// Constructor with matrix dimensions (rows, cols)
template<class Type>
TMatrix<Type>::TMatrix(int rows, int cols, bool init)
{
pArr = 0;
m_rows = rows;
m_cols = cols;
if (m_rows > 0 && m_cols > 0)
if (init)
pArr = new Type[m_rows * m_cols]();
else
pArr = new Type[m_rows * m_cols]; // TODO: check for p = NULL (memory allocation error, which will triger a GPF)
else
{
m_rows = 0;
m_cols = 0;
}
}
// Copy constructor
template<class Type>
TMatrix<Type>::TMatrix(const TMatrix& mat)
{
pArr = 0;
m_rows = mat.m_rows;
m_cols = mat.m_cols;
if (m_rows > 0 && m_cols > 0)
{
int dim = m_rows * m_cols;
pArr = new Type[dim];
for (int i = 0; i < dim; i++)
pArr[i] = mat.pArr[i];
}
else
{
m_rows = m_cols = 0;
}
}
// Move constructors
template<class Type>
TMatrix<Type>::TMatrix(TMatrix&& mat) noexcept
{
m_rows = mat.m_rows;
m_cols = mat.m_cols;
if (m_rows > 0 && m_cols > 0)
{
pArr = mat.pArr;
}
else
{
m_rows = m_cols = 0;
pArr = 0;
}
mat.pArr = 0;
}
// Clear the matrix
template<class Type>
void TMatrix<Type>::clear()
{
delete[] pArr;
pArr = 0;
m_rows = m_cols = 0;
}
// Destructor
template<class Type>
TMatrix<Type>::~TMatrix()
{
clear();
}
// Move assignment
template<class Type>
TMatrix<Type>& TMatrix<Type>::operator=(TMatrix&& mat) noexcept
{
if (this != &mat) // Check for self assignment
{
clear();
m_rows = mat.m_rows;
m_cols = mat.m_cols;
if (m_rows > 0 && m_cols > 0)
{
pArr = mat.pArr;
}
else
{
m_rows = m_cols = 0;
}
mat.pArr = nullptr;
}
return *this;
}
// Assignment operator with check for self-assignment
template<class Type>
TMatrix<Type>& TMatrix<Type>::operator=(const TMatrix& mat)
{
if (this != &mat) // Guard against self assignment
{
clear();
m_rows = mat.m_rows;
m_cols = mat.m_cols;
if (m_rows > 0 && m_cols > 0)
{
int dim = m_rows * m_cols;
pArr = new Type[dim];
for (int i = 0; i < dim; i++)
pArr[i] = mat.pArr[i];
}
else
{
m_rows = m_cols = 0;
}
}
return *this;
}
#endif
dmatrix.h:
#ifndef __DMATRIX_H
#define __DMATRIX_H
#include "matrix.h"
class DMatrix : public TMatrix<double>
{
public:
// Constructors:
DMatrix() : TMatrix<double>() { }
DMatrix(int rows, int cols) : TMatrix<double>(rows, cols, true) {}
};
#endif
Main:
#include <iostream>
#include "dmatrix.h"
int main()
{
int nrep = 10000; // Large number of calculations
int nstate = 400;
int nstep = 400;
int nsec = 3; // 100 times smaller than nstate and nstep
DMatrix value(nstate, nsec);
DMatrix Rc(nstate, 3 * nstep);
DMatrix rhs(nstate, nsec);
// Give some random input
for (int i = 0; i < Rc.no_rows(); i++) {
for (int j = 0; j < Rc.no_columns(); j++) {
Rc(i, j) = double(std::rand()) / RAND_MAX;
}
}
for (int i = 0; i < value.no_rows(); i++) {
for (int j = 0; j < value.no_columns(); j++) {
value(i, j) = 1 + double(std::rand()) / RAND_MAX;
}
}
for (int i = 0; i < rhs.no_rows(); i++) {
for (int j = 0; j < rhs.no_columns(); j++) {
rhs(i, j) = 1 + double(std::rand()) / RAND_MAX;
}
}
// Test case 1
for (int k = 0; k < nrep; k++) {
for (int n = 0; n < nstep; n++) {
int diag = 3 * n + 1;
for (int i = 1; i < nstate; i++) {
for (int j = 0; j < nsec; j++) { // Expectation: this is fast - inner loop follows row
value(i, j) = (rhs(i, j) - Rc(i, diag - 1) * value(i - 1, j)) / Rc(i, diag);
}
}
}
}
// Test case 2
for (int k = 0; k < nrep; k++) {
for (int n = 0; n < nstep; n++) {
int diag = 3 * n + 1;
for (int j = 0; j < nsec; j++) {
for (int i = 1; i < nstate; i++) { // Expectation: this is slow - inner loop walks down column
value(i, j) = (rhs(i, j) - Rc(i, diag - 1) * value(i - 1, j)) / Rc(i, diag);
}
}
}
}
return 0;
}
Thanks in advance for your help.
Best regards,
Nele

As I mentioned in a comment, after some testing:
Rc is the largest matrix here (by roughly a factor of 100), and it is reasonable to assume that most of the running time is spent on handling it. When the inner loop is on j, you get significant improvement because Rc(i, diag - 1) and Rc(i, diag) can be reused in all iterations of the inner loop.
To make sure that this is the case, I changed the loops to the following:
// Test case 1
for (int k = 0; k < nrep; k++) {
for (int i = 1; i < nstate; i++) {
for (int j = 0; j < nsec; j++) { // Expectation: this is fast - inner loop follows row
value(i, j) = (rhs(i, j) - value(i - 1, j));
}
}
}
// Test case 2
for (int k = 0; k < nrep; k++) {
for (int j = 0; j < nsec; j++) {
for (int i = 1; i < nstate; i++) { // Expectation: this is slow - inner loop walks down column
value(i, j) = (rhs(i, j) - value(i - 1, j)) ;
}
}
}
With this calculation (and different matrix sizes - 2000 by 2000, for 200 repetitions), one test case runs 10 times faster than the other (no fancy profiling, but linux's time gives 18s vs. ~2s).
When I change row-major and column-major the trend is reversed.
EDIT:
Conclusion - you need to select row-major/column-major based on what workes best for Rc, and always use Test case 1 (if this represents the problems you're actually trying to solve).
Regarding vectorization - I'm not sure how this works. Maybe someone else can offer an explanation.

Simulated annealing: too slow with poor results

I'm trying to solve, thanks to the simulated annealing method, the following problem :
Optimization problem
Where I already got the c_i,j,f values stored in a 1D array, so that
c_i,j,f <=> c[i + j * n + f * n * n]
My simulated annealing function looks like this :
int annealing(int n, int k_max, int c[]){
// Initial point (verifying the constraints )
int x[n * n * n];
for (int i = 0; i < n; i++){
for (int j = 0; j < n; j++){
for (int f = 0; f < n; f++){
if (i == j && j == f && f == i){
x[i + j * n + f * n * n] = 1;
}else{
x[i + j * n + f * n * n] = 0;
}
}
}
}
// Drawing y in the local neighbourhood of x : random permutation by keeping the constraints verified
int k = 0;
double T = 0.01; // initial temperature
double beta = 0.9999999999; // cooling factor
int y[n * n * n];
int permutation_i[n];
int permutation_j[n];
while (k <= k_max){ // k_max = maximum number of iterations allowed
Permutation(permutation_i, n);
Permutation(permutation_j, n);
for (int f = 0; f < n; f++){
for (int i = 0; i < n; i++){
for (int j = 0; j < n; j++){
y[i + j * n + f * n * n] = x[permutation_i[i] + permutation_j[j] * n + f * n * n];
}
}
}
if (f(y, c, n) < f(x, c, n) || rand()/(double)(RAND_MAX) <= pow(M_E, -(f(y, c, n)-f(x, c, n))/T)){
for (int i = 0; i < n; i++){
for (int j = 0; j < n; j++){
for (int f = 0; f < n; f++){
x[i + j * n + f * n * n] = y[i + j * n + f * n * n];
}
}
}
}
T *= beta;
++k;
}
return f(x, c, n);
}
The procedure Permutation(int permutation[], n) fills in the array permutation with a random permutation of [[0,n-1]] (for example, it would transform [0,1,2,3,4] into [3,0,4,2,1]).
The problem is, it takes too much time with 1000000 iterations, and the values of the objective function oscillate between 78 - 79 whilst I should get 0 as a solution.
I was also thinking I could do better when it comes to complexity...
Someone may help me please?
Thanks in advance!

I would use std::vector<int>, instead of arrays (and define a couple of constants):
#include <vector>
#include <algorithm>
#include <random>
int annealing(int n, int k_max, std::vector<int> c) {
const int N2 = n * n;
const int N3 = N2 * n;
std::vector<int> x(N3);
std::vector<int> y(N3);
std::vector<int> permutation_i(n);
std::vector<int> permutation_j(n);
// ...
The initial nested loops boil down to:
for (int i = 0; i < n; i++){
x[(i*N2) + (i + (i * n))] = 1;
}
This should be your Permutation function:
void Permutation(std::vector<int> & x)
{
std::random_device rd;
std::mt19937 g(rd());
std::shuffle(x.begin(), x.end(), g);
}
Initialize vectors before use (0 to n-1):
std::iota(permutation_i.begin(), permutation_i.end(), 0);
std::iota(permutation_j.begin(), permutation_j.end(), 0);
I have no idea what your f function is, but you should edit it to accept std::vector as its first two arguments.

Matrix scalar product

I have a task to calculate a scalar product
s=(B*(r+q+r), A*A*p)
As I understand, I need to calculate 2 vectors: first - B*(r+q+r), second - AAp, and then calculate a scalar product.
#include <iostream>
#include <vector>
using namespace std;
using matrix = vector<vector<double>>;
matrix add(matrix A, matrix B) {
matrix C;
C.resize(A.size());
for (int i = 0; i< A.size(); i++) {
C[i].resize(B.size());
for (int j = 0; j < B.size(); j++) {
C[i][j] = A[i][j] + B[i][j];
}
}
return C;
}
matrix multiple(matrix A, matrix B)
{
matrix C;
C.reserve(100);
C.resize(B.size());
for (int i = 0; i < A.size(); i++) {
C[i].resize(B.size());
for (int j = 0; j < B.size(); j++) {
for (int k = 0; k < B.size(); k++)
C[i][j] += A[i][k] * B[k][j];
}
}
return C;
}
void main() {
matrix A = { {1,2,3}, {1,2,1}, {3,2,0} };
matrix B = { {4,1,2},{0,4,3},{1,1,1} };
matrix r = { {-0.7f, 1.3, 0.2} };
matrix q = { { -1.6f, 0.8, 1.1} };
matrix p = { {0.1, 1.7, -1.5} };
matrix r_q = add(r, q);
for (int i = 0; i < r_q.size(); i++) {
for (int j = 0; j < r_q.size(); j++) {
cout << r_q[i][j] << "\t";
}
cout << "\n";
}
matrix a_a = multiple(A, A);
matrix a_a_p = multiple(a_a,p);
getchar();
}
Problems:
add method work not correct, it put in result only one number - sum of the first items.
Multipling matrix with the same dimensions (A*A) work correct. Multipling matrix with the different dimensions (a_a * p) - return error "vector subscript out of range".
Thanks for any advice.

The OP chose to implement both matrices and vectors using a std::vector<std::vector<double>>.
This may not be a good design choice in general, but also in particular, to be consistent to the mathematical meaning of all the involved operations, all the vectors should be considered (and declared as well) as "column" matrices (or Nx1 matrices):
matrix r = { {-0.7}, {1.3}, {0.2} };
matrix q = { {-1.6}, {0.8}, {1.1} };
matrix p = { {0.1}, {1.7}, {-1.5} };
Then, in the functions that perform the calculations, special attention should be paid to the correct sizes of rows and columns to avoid out of bounds accesses.
matrix add(matrix const &A, matrix const &B)
{
if (A.size() != B.size() || A.size() == 0)
throw std::runtime_error("number of rows mismatch");
size_t columns = A[0].size();
matrix C(A.size(), std::vector<double>(columns, 0.0));
for (size_t i = 0; i < A.size(); i++)
{
if ( A[i].size() != columns || B[i].size() != columns )
throw std::runtime_error("number of columns mismatch");
for (size_t j = 0; j < columns; j++)
{
C[i][j] = A[i][j] + B[i][j];
}
}
return C;
}
matrix multiple(matrix const &A, matrix const &B)
{
if ( A.size() == 0 || B.size() == 0 || B[0].size() == 0)
throw std::runtime_error("size mismatch");
size_t columns = B[0].size();
matrix C(A.size(), std::vector<double>(columns, 0.0));
for (size_t i = 0; i < A.size(); i++)
{
if ( A[i].size() != B.size() || B[i].size() != columns )
throw std::runtime_error("inner size mismatch");
for (size_t j = 0; j < columns; j++)
{
for (size_t k = 0; k < B.size(); k++)
C[i][j] += A[i][k] * B[k][j];
}
}
return C;
}
The compiler should have also warned the OP about the incorrect use of void main instead of int main and about the comparisons between signed and unsigned integer expressions (I used size_t instead of int).
From a mathematical point of view, it's worth noting that to solve OP problem, that is to calculate the scalar product s = (B(r+q+r), AAp), the operations really needed (to be implemented) are the sum of vectors, the product of a matrix by a vector (easier and more efficient then matrix multiplication) and the dot product of two vectors:
t = r + q + r
b = Bt
u = Ap
a = Au
s = (b, a)

Apply memmove function to a 3d array

I am trying to achieve the fftshift function (from MATLAB) in c++ with for loop and it's really time-consuming. here is my code:
const int a = 3;
const int b = 4;
const int c = 5;
int i, j, k;
int aa = a / 2;
int bb = b / 2;
int cc = c / 2;
double ***te, ***tempa;
te = new double **[a];
tempa = new double **[a];
for (i = 0; i < a; i++)
{
te[i] = new double *[b];
tempa[i] = new double *[b];
for (j = 0; j < b; j++)
{
te[i][j] = new double [c];
tempa[i][j] = new double [c];
for (k = 0; k < c; k++)
{
te[i][j][k] = i + j+k;
}
}
}
/*for the row*/
if (c % 2 == 1)
{
for (i = 0; i < a; i++)
{
for (j = 0; j < b; j++)
{
for (k = 0; k < cc; k++)
{
tempa[i][j][k] = te[i][j][k + cc + 1];
tempa[i][j][k + cc] = te[i][j][k];
tempa[i][j][c - 1] = te[i][j][cc];
}
}
}
}
else
{
for (i = 0; i < a; i++)
{
for (j = 0; j < b; j++)
{
for (k = 0; k < cc; k++)
{
tempa[i][j][k] = te[i][j][k + cc];
tempa[i][j][k + cc] = te[i][j][k];
}
}
}
}
for (i = 0; i < a; i++)
{
for (j = 0; j < b; j++)
{
for (k = 0; k < c; k++)
{
te[i][j][k] = tempa[i][j][k];
}
}
}
/*for the column*/
if (b % 2 == 1)
{
for (i = 0; i < a; i++)
{
for (j = 0; j < bb; j++)
{
for (k = 0; k < c; k++)
{
tempa[i][j][k] = te[i][j + bb + 1][k];
tempa[i][j + bb][k] = te[i][j][k];
tempa[i][b - 1][k] = te[i][bb][k];
}
}
}
}
else
{
for (i = 0; i < a; i++)
{
for (j = 0; j < bb; j++)
{
for (k = 0; k < c; k++)
{
tempa[i][j][k] = te[i][j + bb][k];
tempa[i][j + bb][k] = te[i][j][k];
}
}
}
}
for (i = 0; i < a; i++)
{
for (j = 0; j < b; j++)
{
for (k = 0; k < c; k++)
{
te[i][j][k] = tempa[i][j][k];
}
}
}
/*for the third dimension*/
if (a % 2 == 1)
{
for ( i = 0; i < aa; i++)
{
for (j = 0; j < b; j++)
{
for ( k = 0; k < c; k++)
{
tempa[i][j][k] = te[i + aa + 1][j][k];
tempa[i + aa][j][k] = te[i][j][k];
tempa[a - 1][j][k] = te[aa][j][k];
}
}
}
}
else
{
for (i = 0; i < aa; i++)
{
for ( j = 0; j < b; j++)
{
for ( k = 0; k < c; k++)
{
tempa[i][j][k] = te[i + aa][j][k];
tempa[i + aa][j][k] = te[i][j][k];
}
}
}
}
for (i = 0; i < a; i++)
{
for (j = 0; j < b; j++)
{
for (k = 0; k < c; k++)
{
cout << te[i][j][k] << ' ';
}
cout << endl;
}
cout << "\n";
}
cout << "and then" << endl;
for (i = 0; i < a; i++)
{
for (j = 0; j < b; j++)
{
for (k = 0; k < c; k++)
{
cout << tempa[i][j][k] << ' ';
}
cout << endl;
}
cout << "\n";
}
now I want to rewrite it with memmove to improve the running efficiency.
For the 3rd dimension, I use:
memmove(tempa, te + aa, sizeof(double)*(a - aa));
memmove(tempa + aa+1, te, sizeof(double)* aa);
this code can works well with 1d and 2d array, but doesn't work for the 3d array. Also, I do not know how to move the column and row elements with memmove. Anyone can help me with all of these? thanks so much!!
Now I have modified the code as below:
double ***te, ***tempa1,***tempa2, ***tempa3;
te = new double **[a];
tempa1 = new double **[a];
tempa2 = new double **[a];
tempa3 = new double **[a];
for (i = 0; i < a; i++)
{
te[i] = new double *[b];
tempa1[i] = new double *[b];
tempa2[i] = new double *[b];
tempa3[i] = new double *[b];
for (j = 0; j < b; j++)
{
te[i][j] = new double [c];
tempa1[i][j] = new double [c];
tempa2[i][j] = new double [c];
tempa3[i][j] = new double [c];
for (k = 0; k < c; k++)
{
te[i][j][k] = i + j+k;
}
}
}
/*for the third dimension*/
memmove(tempa1, te + (a-aa), sizeof(double**)*aa);
memmove(tempa1 + aa, te, sizeof(double**)* (a-aa));
//memmove(te, tempa, sizeof(double)*a);
/*for the row*/
for (i = 0; i < a; i++)
{
memmove(tempa2[i], tempa1[i] + (b - bb), sizeof(double*)*bb);
memmove(tempa2[i] + bb, tempa1[i], sizeof(double*)*(b - bb));
}
/*for the column*/
for (j = 0; i < a; i++)
{
for (k = 0; j < b; j++)
{
memmove(tempa3[i][j], tempa2[i][j] + (c - cc), sizeof(double)*cc);
memmove(tempa3[i][j] + cc, tempa2[i][j], sizeof(double)*(c-cc));
}
}
but the problem is that I define too much new dynamic arrays and also the results for tempa3 are incorrect. could anyone give some suggestions?

I believe you want something like that:
memmove(tempa, te + (a - aa), sizeof(double**) * aa);
memmove(tempa + aa, te, sizeof(double**) * (a - aa));
or
memmove(tempa, te + aa, sizeof(double**) * (a - aa));
memmove(tempa + (a - aa), te, sizeof(double**) * aa);
depending on whether you want to swap the first half "rounded up or down" (I assume you want it rounded up, it's the first version then).
I don't really like your code's design though:
First and foremost, avoid dynamic allocation and use std::vector or std::array when possible.
You could argue it would prevent you from safely using memmove instead of swap for the first dimensions (well, it should work, but I'm not 100% sure it isn't implementation defined) but I don't think that would improve that much the efficiency.
Besides, if you want to have a N-dimensional array, I usually prefer avoiding "chaining pointers" (although with your algorithm, you can actually use this structure, so it's not that bad).
For instance, if you're adamant about dynamically allocating your array with new, you might use something like that instead to reduce memory usage (the difference might be neglectible though; it's also probably slightly faster but again, probably neglectible):
#include <cstddef>
#include <iostream>
typedef std::size_t index_t;
constexpr index_t width = 3;
constexpr index_t height = 4;
constexpr index_t depth = 5;
// the cells (i, j, k) and (i, j, k+1) are adjacent in memory
// the rows (i, j, _) and (i, j+1, _) are adjacent in memory
// the "slices" (i, _, _) and (i+1, _, _) are adjacent in memory
constexpr index_t cell_index(index_t i, index_t j, index_t k) {
return (i * height + j) * depth + k;
}
int main() {
int* array = new int[width * height * depth]();
for( index_t i = 0 ; i < width ; ++i )
for( index_t j = 0 ; j < height ; ++j )
for( index_t k = 0 ; k < depth ; ++k ) {
// do something on the cell (i, j, k)
array[cell_index(i, j, k)] = i + j + k;
std::cout << array[cell_index(i, j, k)] << ' ';
}
std::cout << '\n';
// alternatively you can do this:
//*
for( index_t index = 0 ; index < width * height * depth ; ++index) {
index_t i = index / (height * depth);
index_t j = (index / depth) % height;
index_t k = index % depth;
array[index] = i + j + k;
std::cout << array[index] << ' ';
}
std::cout << '\n';
//*/
delete[] array;
}
The difference is the organization in memory. Here you have a big block of 60*sizeof(int) bytes (usually 240 or 480 bytes), whereas with your method you would have:
- 1 block of 3*sizeof(int**) bytes
- 3 blocks of 4*sizeof(int*) bytes
- 12 blocks of 5*sizeof(int) bytes
(120 more bytes on a 64 bit architecture, two additional indirections for each cell access, and more code for allocating/deallocating all that memory)
Granted, you can't do array[i][j][k] anymore, but still...
The same stands with vectors (you can either make an std::vector<std::vector<std::vector<int>>> or a std::vector<int>)
There is also a bit too much code repetition: your algorithm basically swaps the two halves of your table three times (once for each dimension), but you rewrote 3 times the same thing with a few differences.
There is also too much memory allocation/copy (your algorithm works and can exploit the structure of array of pointers by simply swapping pointers to swap whole rows/slices, in that specific case, you can exploit this data structure to avoid copies with your algorithm... but you don't)
You should choose more explicit variable names, that helps. For instance use width, height, depth instead of a, b, c.
For instance, here is an implementation with vectors (I didn't know matlab's fftshift function though, but according to your code and this page, I assume it's basically "swapping the corners"):
(also, compile with -std=c++11)
#include <cstddef>
#include <iostream>
#include <vector>
#include <algorithm>
typedef std::size_t index_t;
typedef double element_t;
typedef std::vector<element_t> row_t;
typedef std::vector<row_t> slice_t;
typedef std::vector<slice_t> array_3d_t;
// for one dimension
// you might overload this for a std::vector<double>& and use memmove
// as you originally wanted to do here
template<class T>
void fftshift_dimension(std::vector<T>& row)
{
using std::swap;
const index_t size = row.size();
if(size <= 1)
return;
const index_t halved_size = size / 2;
// swap the two halves
for(index_t i = 0, j = size - halved_size ; i < halved_size ; ++i, ++j)
swap(row[i], row[j]);
// if the size is odd, rotate the right part
if(size % 2)
{
swap(row[halved_size], row[size - 1]);
const index_t n = size - 2;
for(index_t i = halved_size ; i < n ; ++i)
swap(row[i], row[i + 1]);
}
}
// base case
template<class T>
void fftshift(std::vector<T>& array) {
fftshift_dimension(array);
}
// reduce the problem for a dimension N+1 to a dimension N
template<class T>
void fftshift(std::vector<std::vector<T>>& array) {
fftshift_dimension(array);
for(auto& slice : array)
fftshift(slice);
}
// overloads operator<< to print a 3-dimensional array
std::ostream& operator<<(std::ostream& output, const array_3d_t& input) {
const index_t width = input.size();
for(index_t i = 0; i < width ; i++)
{
const index_t height = input[i].size();
for(index_t j = 0; j < height ; j++)
{
const index_t depth = input[i][j].size();
for(index_t k = 0; k < depth; k++)
output << input[i][j][k] << ' ';
output << '\n';
}
output << '\n';
}
return output;
}
int main()
{
constexpr index_t width = 3;
constexpr index_t height = 4;
constexpr index_t depth = 5;
array_3d_t input(width, slice_t(height, row_t(depth)));
// initialization
for(index_t i = 0 ; i < width ; ++i)
for(index_t j = 0 ; j < height ; ++j)
for(index_t k = 0 ; k < depth ; ++k)
input[i][j][k] = i + j + k;
std::cout << input;
// in place fftshift
fftshift(input);
std::cout << "and then" << '\n' << input;
}
live example
You could probably make a slightly more efficient algorithm by avoiding to swap multiple times the same cell and/or using memmove, but I think it's already fast enough for many uses (on my machine fftshift takes roughly 130ms for a 1000x1000x100 table).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Optimizing square matrix multiplication with std::thread - c++

Related

Memory leak in the implementation of the matrix multiplication operation

Performance with matrix class in C++

Simulated annealing: too slow with poor results

Matrix scalar product

Apply memmove function to a 3d array

Categories

Resources