Matrix Multiplication using SIMD vectors in C++ - c++

I am currently reading an article on github about performance optimisation using Clang's extended vector syntax. The author gives the following code snippet:
The templated code below implements the innermost loops that calculate a patch of size regA x regB in matrix C. The code loads regA scalars from matrixA and regB SIMD-width vectors from matrix B. The program uses Clang's extended vector syntax.
/// Compute a RAxRB block of C using a vectorized dot product, where RA is the
/// number of registers to load from matrix A, and RB is the number of registers
/// to load from matrix B.
template <unsigned regsA, unsigned regsB>
void matmul_dot_inner(int k, const float *a, int lda, const float *b, int ldb,
float *c, int ldc) {
float8 csum[regsA][regsB] = {{0.0}};
for (int p = 0; p < k; p++) {
// Perform the DOT product.
for (int bi = 0; bi < regsB; bi++) {
float8 bb = LoadFloat8(&B(p, bi * 8));
for (int ai = 0; ai < regsA; ai++) {
float8 aa = BroadcastFloat8(A(ai, p));
csum[ai][bi] += aa * bb;
}
}
}
// Accumulate the results into C.
for (int ai = 0; ai < regsA; ai++) {
for (int bi = 0; bi < regsB; bi++) {
AdduFloat8(&C(ai, bi * 8), csum[ai][bi]);
}
}
}
The code, outlines below, confuses me the most. I read the full article and understood the logic behind using blocking and calculating a small patch, but I can't entirely understand what does this bit means:
// Perform the DOT product.
for (int bi = 0; bi < regsB; bi++) {
float8 bb = LoadFloat8(&B(p, bi * 8)); //the pointer to the range of values?
for (int ai = 0; ai < regsA; ai++) {
float8 aa = BroadcastFloat8(A(ai, p));
csum[ai][bi] += aa * bb;
}
}
}
Can anyone elaborate what's going on in here?
The article could be found here

The 2nd comment on the article links to https://github.com/pytorch/glow/blob/405e632ef138f1d49db9c3181182f7efd837bccc/lib/Backends/CPU/libjit/libjit_defs.h#L26 which defines the float8 type as
typedef float float8 __attribute__((ext_vector_type(8)));
(similar to how immintrin.h defines __m256). And defines the load / broadcast functions Similar to _mm256_load_ps and _mm256_set1_ps. With that header, you should be able to compile the code in the article.
See Clang's native vector documentation. GNU C native vector syntax is a nice way to get an overloaded * operator. I don't know what clang's ext_vector_type does that GCC/clang/ICC float __attribute__((vector_width(32))) (32 byte width) wouldn't.
The article could have added 1 small section to explain that, but it seems it was more focused on the performance details, and wasn't really interested in explaining how to use syntax.
Most of the discussion in the article is about how to manually vectorize matmul for cache efficiency with SIMD vectors. That part looks good from the quick skim I gave it.
You can do those things with any of multiple ways to manually vectors: GNU C native vectors or clang's very similar "extended" vectors, or portable Intel intrinsics.

Related

How to access matrix data in opencv by another mat with locations (indexing)

Suppose I have a Mat of indices (locations) called B, We can say that this Mat has dimensions of 1 x 100 and We suppose to have another Mat, called A, full of data of the same dimensions of B.
Now, I would access to the data of A with B. Usually I would create a for loop and I would take for each elements of B, the right elements of A. For the most fussy of the site, this is the code that I would write:
for(int i=0; i < B.cols; i++){
int index = B.at<int>(0, i);
std::cout<<A.at<int>(0, index)<<std:endl;
}
Ok, now that I showed you what I could do, I ask you if there is a way to access the matrix A, always using the B indices, in a more intelligent and fast way. As someone could do in python thanks to the numpy.take() function.
This operation is called remapping. In OpenCV, you can use function cv::remap for this purpose.
Below I present the very basic example of how remap algorithm works; please note that I don't handle border conditions in this example, but cv::remap does - it allows you to use mirroring, clamping, etc. to specify what happens if the indices exceed the dimensions of the image. I also don't show how interpolation is done; check the cv::remap documentation that I've linked to above.
If you are going to use remapping you will probably have to convert indices to floating point; you will also have to introduce another array of indices that should be trivial (all equal to 0) if your image is one-dimensional. If this starts to represent a problem because of performance, I'd suggest you implement the 1-D remap equivalent yourself. But benchmark first before optimizing, of course.
For all the details, check the documentation, which covers everything you need to know to use te algorithm.
cv::Mat<float> remap_example(cv::Mat<float> image,
cv::Mat<float> positions_x,
cv::Mat<float> positions_y)
{
// sizes of positions arrays must be the same
int size_x = positions_x.cols;
int size_y = positions_x.rows;
auto out = cv::Mat<float>(size_y, size_x);
for(int y = 0; y < size_y; ++y)
for(int x = 0; x < size_x; ++x)
{
float ps_x = positions_x(x, y);
float ps_y = positions_y(x, y);
// use interpolation to determine intensity at image(ps_x, ps_y),
// at this point also handle border conditions
// float interpolated = bilinear_interpolation(image, ps_x, ps_y);
out(x, y) = interpolated;
}
return out;
}
One fast way is to use pointer for both A (data) and B (indexes).
const int* pA = A.ptr<int>(0);
const int* pIndexB = B.ptr<int>(0);
int sum = 0;
for(int i = 0; i < Bi.cols; ++i)
{
sum += pA[*pIndexB++];
}
Note: Be carefull with pixel type, in this case (as you write in your code) is int!
Note2: Using cout for each point access put the optimization useless!
Note3: In this article Satya compare four methods for pixel access and fastest seems "foreach": https://www.learnopencv.com/parallel-pixel-access-in-opencv-using-foreach/

NEON increasing run time

I am currently trying to optimize some of my image processing code to use NEON instructions.
Let's say I have to very large float arrays and I want to multiply each value of the first one with three consecutive values of the second one. (The second one is three times as large.)
float* l_ptrGauss_pf32 = [...];
float* l_ptrLaplace_pf32 = [...]; // Three times as large
for (uint64_t k = 0; k < l_numPixels_ui64; ++k)
{
float l_weight_f32 = *l_ptrGauss_pf32;
*l_ptrLaplace_pf32 *= l_weight_f32;
++l_ptrLaplace_pf32;
*l_ptrLaplace_pf32 *= l_weight_f32;
++l_ptrLaplace_pf32;
*l_ptrLaplace_pf32 *= l_weight_f32;
++l_ptrLaplace_pf32;
++l_ptrGauss_pf32;
}
So when I replace the above code with NEON intrinsics, the run time is about 10% longer.
float32x4_t l_gaussElem_f32x4;
float32x4_t l_laplElem1_f32x4;
float32x4_t l_laplElem2_f32x4;
float32x4_t l_laplElem3_f32x4;
for( uint64_t k=0; k<(l_lastPixelInBlock_ui64/4); ++k)
{
l_gaussElem_f32x4 = vld1q_f32(l_ptrGauss_pf32);
l_laplElem1_f32x4 = vld1q_f32(l_ptrLaplace_pf32);
l_laplElem2_f32x4 = vld1q_f32(l_ptrLaplace_pf32+4);
l_laplElem3_f32x4 = vld1q_f32(l_ptrLaplace_pf32+8);
l_laplElem1_f32x4 = vmulq_f32(l_gaussElem_f32x4, l_laplElem1_f32x4);
l_laplElem2_f32x4 = vmulq_f32(l_gaussElem_f32x4, l_laplElem2_f32x4);
l_laplElem3_f32x4 = vmulq_f32(l_gaussElem_f32x4, l_laplElem3_f32x4);
vst1q_f32(l_ptrLaplace_pf32, l_laplElem1_f32x4);
vst1q_f32(l_ptrLaplace_pf32+4, l_laplElem2_f32x4);
vst1q_f32(l_ptrLaplace_pf32+8, l_laplElem3_f32x4);
l_ptrLaplace_pf32 += 12;
l_ptrGauss_pf32 += 4;
}
Both versions are compiled with -Ofast using Apple LLVM 8.0. Is the compiler really so good at optimizing this code even without NEON intrinsics?
You code contains relatively many operations of vector loading and a few operations of multiplication. So I would recommend to optimize loading of vectors. There are two steps:
Use aligned memory in your arrays.
Use prefetch.
In order to do this I would recommend to use next function:
inline float32x4_t Load(const float * p)
{
// use prefetch:
__builtin_prefetch(p + 256);
// tell compiler that address is aligned:
float * _p = (float *)__builtin_assume_aligned(p, 16);
return vld1q_f32(_p);
}

iOS - C/C++ - Speed up Integral Image calculation

I have a method which calculates an integral image (description here) commonly used in computer vision applications.
float *Integral(unsigned char *grayscaleSource, int height, int width, int widthStep)
{
// convert the image to single channel 32f
unsigned char *img = grayscaleSource;
// set up variables for data access
int step = widthStep/sizeof(float);
uint8_t *data = (uint8_t *)img;
float *i_data = (float *)malloc(height * width * sizeof(float));
// first row only
float rs = 0.0f;
for(int j=0; j<width; j++)
{
rs += (float)data[j];
i_data[j] = rs;
}
// remaining cells are sum above and to the left
for(int i=1; i<height; ++i)
{
rs = 0.0f;
for(int j=0; j<width; ++j)
{
rs += data[i*step+j];
i_data[i*step+j] = rs + i_data[(i-1)*step+j];
}
}
// return the integral image
return i_data;
}
I am trying to make it as fast as possible. It seems to me like this should be able to take advantage of Apple's Accelerate.framework, or perhaps ARMs neon intrinsics, but I can't see exactly how. It seems like that nested loop is potentially quite slow (for real time applications at least).
Does anyone think this is possible to speed up using any other techniques??
You can certainly vectorize the row by row summation. That is vDSP_vadd(). The horizontal direction is vDSP_vrsum().
If you want to write your own vector code, the horizontal sum might be sped up by something like psadbw, but that is Intel. Also, take a look at prefix sum algorithms, which are famously parallelizable.

Fastest way to calculate minimum euclidean distance between two matrices containing high dimensional vectors

I started a similar question on another thread, but then I was focusing on how to use OpenCV. Having failed to achieve what I originally wanted, I will ask here exactly what I want.
I have two matrices. Matrix a is 2782x128 and Matrix b is 4000x128, both unsigned char values. The values are stored in a single array. For each vector in a, I need the index of the vector in b with the closest euclidean distance.
Ok, now my code to achieve this:
#include <windows.h>
#include <stdlib.h>
#include <stdio.h>
#include <cstdio>
#include <math.h>
#include <time.h>
#include <sys/timeb.h>
#include <iostream>
#include <fstream>
#include "main.h"
using namespace std;
void main(int argc, char* argv[])
{
int a_size;
unsigned char* a = NULL;
read_matrix(&a, a_size,"matrixa");
int b_size;
unsigned char* b = NULL;
read_matrix(&b, b_size,"matrixb");
LARGE_INTEGER liStart;
LARGE_INTEGER liEnd;
LARGE_INTEGER liPerfFreq;
QueryPerformanceFrequency( &liPerfFreq );
QueryPerformanceCounter( &liStart );
int* indexes = NULL;
min_distance_loop(&indexes, b, b_size, a, a_size);
QueryPerformanceCounter( &liEnd );
cout << "loop time: " << (liEnd.QuadPart - liStart.QuadPart) / long double(liPerfFreq.QuadPart) << "s." << endl;
if (a)
delete[]a;
if (b)
delete[]b;
if (indexes)
delete[]indexes;
return;
}
void read_matrix(unsigned char** matrix, int& matrix_size, char* matrixPath)
{
ofstream myfile;
float f;
FILE * pFile;
pFile = fopen (matrixPath,"r");
fscanf (pFile, "%d", &matrix_size);
*matrix = new unsigned char[matrix_size*128];
for (int i=0; i<matrix_size*128; ++i)
{
unsigned int matPtr;
fscanf (pFile, "%u", &matPtr);
matrix[i]=(unsigned char)matPtr;
}
fclose (pFile);
}
void min_distance_loop(int** indexes, unsigned char* b, int b_size, unsigned char* a, int a_size)
{
const int descrSize = 128;
*indexes = (int*)malloc(a_size*sizeof(int));
int dataIndex=0;
int vocIndex=0;
int min_distance;
int distance;
int multiply;
unsigned char* dataPtr;
unsigned char* vocPtr;
for (int i=0; i<a_size; ++i)
{
min_distance = LONG_MAX;
for (int j=0; j<b_size; ++j)
{
distance=0;
dataPtr = &a[dataIndex];
vocPtr = &b[vocIndex];
for (int k=0; k<descrSize; ++k)
{
multiply = *dataPtr++-*vocPtr++;
distance += multiply*multiply;
// If the distance is greater than the previously calculated, exit
if (distance>min_distance)
break;
}
// if distance smaller
if (distance<min_distance)
{
min_distance = distance;
(*indexes)[i] = j;
}
vocIndex+=descrSize;
}
dataIndex+=descrSize;
vocIndex=0;
}
}
And attached are the files with sample matrices.
matrixa
matrixb
I am using windows.h just to calculate the consuming time, so if you want to test the code in another platform than windows, just change windows.h header and change the way of calculating the consuming time.
This code in my computer is about 0.5 seconds. The problem is that I have another code in Matlab that makes this same thing in 0.05 seconds. In my experiments, I am receiving several matrices like matrix a every second, so 0.5 seconds is too much.
Now the matlab code to calculate this:
aa=sum(a.*a,2); bb=sum(b.*b,2); ab=a*b';
d = sqrt(abs(repmat(aa,[1 size(bb,1)]) + repmat(bb',[size(aa,1) 1]) - 2*ab));
[minz index]=min(d,[],2);
Ok. Matlab code is using that (x-a)^2 = x^2 + a^2 - 2ab.
So my next attempt was to do the same thing. I deleted my own code to make the same calculations, but It was 1.2 seconds approx.
Then, I tried to use different external libraries. The first attempt was Eigen:
const int descrSize = 128;
MatrixXi a(a_size, descrSize);
MatrixXi b(b_size, descrSize);
MatrixXi ab(a_size, b_size);
unsigned char* dataPtr = matrixa;
for (int i=0; i<nframes; ++i)
{
for (int j=0; j<descrSize; ++j)
{
a(i,j)=(int)*dataPtr++;
}
}
unsigned char* vocPtr = matrixb;
for (int i=0; i<vocabulary_size; ++i)
{
for (int j=0; j<descrSize; ++j)
{
b(i,j)=(int)*vocPtr ++;
}
}
ab = a*b.transpose();
a.cwiseProduct(a);
b.cwiseProduct(b);
MatrixXi aa = a.rowwise().sum();
MatrixXi bb = b.rowwise().sum();
MatrixXi d = (aa.replicate(1,vocabulary_size) + bb.transpose().replicate(nframes,1) - 2*ab).cwiseAbs2();
int* index = NULL;
index = (int*)malloc(nframes*sizeof(int));
for (int i=0; i<nframes; ++i)
{
d.row(i).minCoeff(&index[i]);
}
This Eigen code costs 1.2 approx for just the line that says: ab = a*b.transpose();
A similar code using opencv was used also, and the cost of the ab = a*b.transpose(); was 0.65 seconds.
So, It is real annoying that matlab is able to do this same thing so quickly and I am not able in C++! Of course being able to run my experiment would be great, but I think the lack of knowledge is what really is annoying me. How can I achieve at least the same performance than in Matlab? Any kind of soluting is welcome. I mean, any external library (free if possible), loop unrolling things, template things, SSE intructions (I know they exist), cache things. As I said, my main purpose is increase my knowledge for being able to code thinks like this with a faster performance.
Thanks in advance
EDIT: more code suggested by David Hammen. I casted the arrays to int before making any calculations. Here is the code:
void min_distance_loop(int** indexes, unsigned char* b, int b_size, unsigned char* a, int a_size)
{
const int descrSize = 128;
int* a_int;
int* b_int;
LARGE_INTEGER liStart;
LARGE_INTEGER liEnd;
LARGE_INTEGER liPerfFreq;
QueryPerformanceFrequency( &liPerfFreq );
QueryPerformanceCounter( &liStart );
a_int = (int*)malloc(a_size*descrSize*sizeof(int));
b_int = (int*)malloc(b_size*descrSize*sizeof(int));
for(int i=0; i<descrSize*a_size; ++i)
a_int[i]=(int)a[i];
for(int i=0; i<descrSize*b_size; ++i)
b_int[i]=(int)b[i];
QueryPerformanceCounter( &liEnd );
cout << "Casting time: " << (liEnd.QuadPart - liStart.QuadPart) / long double(liPerfFreq.QuadPart) << "s." << endl;
*indexes = (int*)malloc(a_size*sizeof(int));
int dataIndex=0;
int vocIndex=0;
int min_distance;
int distance;
int multiply;
/*unsigned char* dataPtr;
unsigned char* vocPtr;*/
int* dataPtr;
int* vocPtr;
for (int i=0; i<a_size; ++i)
{
min_distance = LONG_MAX;
for (int j=0; j<b_size; ++j)
{
distance=0;
dataPtr = &a_int[dataIndex];
vocPtr = &b_int[vocIndex];
for (int k=0; k<descrSize; ++k)
{
multiply = *dataPtr++-*vocPtr++;
distance += multiply*multiply;
// If the distance is greater than the previously calculated, exit
if (distance>min_distance)
break;
}
// if distance smaller
if (distance<min_distance)
{
min_distance = distance;
(*indexes)[i] = j;
}
vocIndex+=descrSize;
}
dataIndex+=descrSize;
vocIndex=0;
}
}
The entire process is now 0.6, and the casting loops at the beginning are 0.001 seconds. Maybe I did something wrong?
EDIT2: Anything about Eigen? When I look for external libs they always talk about Eigen and their speed. I made something wrong? Here a simple code using Eigen that shows it is not so fast. Maybe I am missing some config or some flag, or ...
MatrixXd A = MatrixXd::Random(1000, 1000);
MatrixXd B = MatrixXd::Random(1000, 500);
MatrixXd X;
This code is about 0.9 seconds.
As you observed, your code is dominated by the matrix product that represents about 2.8e9 arithmetic operations. Yopu say that Matlab (or rather the highly optimized MKL) computes it in about 0.05s. This represents a rate of 57 GFLOPS showing that it is not only using vectorization but also multi-threading. With Eigen, you can enable multi-threading by compiling with OpenMP enabled (-fopenmp with gcc). On my 5 years old computer (2.66Ghz Core2), using floats and 4 threads, your product takes about 0.053s, and 0.16s without OpenMP, so there must be something wrong with your compilation flags. To summary, to get the best of Eigen:
compile in 64bits mode
use floats (doubles are twice as slow owing to vectorization)
enable OpenMP
if your CPU has hyper-threading, then either disable it or define the OMP_NUM_THREADS environment variable to the number of physical cores (this is very important, otherwise the performance will be very bad!)
if you have other task running, it might be a good idea to reduce OMP_NUM_THREADS to nb_cores-1
use the most recent compiler that you can, GCC, clang and ICC are best, MSVC is usually slower.
One thing that is definitely hurting you in your C++ code is that it has a boatload of char to int conversions. By boatload, I mean up to 2*2782*4000*128 char to int conversions. Those char to int conversions are slow, very slow.
You can reduce this to (2782+4000)*128 such conversions by allocating a pair of int arrays, one 2782*128 and the other 4000*128, to contain the cast-to-integer contents of your char* a and char* b arrays. Work with these int* arrays rather than your char* arrays.
Another problem might be your use of int versus long. I don't work on windows, so this might not be applicable. On the machines I work on, int is 32 bits and long is now 64 bits. 32 bits is more than enough because 255*255*128 < 256*256*128 = 223.
That obviously isn't the problem.
What's striking is that the code in question is not calculating that huge 2728 by 4000 array that the Matlab code is creating. What's even more striking is that Matlab is most likely doing this with doubles rather than ints -- and it's still beating the pants off the C/C++ code.
One big problem is cache. That 4000*128 array is far too big for level 1 cache, and you are iterating over that big array 2782 times. Your code is doing far too much waiting on memory. To overcome this problem, work with smaller chunks of the b array so that your code works with level 1 cache for as long as possible.
Another problem is the optimization if (distance>min_distance) break;. I suspect that this is actually a dis-optimization. Having if tests inside your innermost loop is oftentimes a bad idea. Blast through that inner product as fast as possible. Other than wasted computations, there is no harm in getting rid of this test. Sometimes it is better to make apparently unneeded computations if doing so can remove a branch in an innermost loop. This is one of those cases. You might be able to solve your problem just by eliminating this test. Try doing that.
Getting back to the cache problem, you need to get rid of this branch so that you can split the operations over the a and b matrix into smaller chunks, chunks of no more than 256 rows at a time. That's how many rows of 128 unsigned chars fit into one of the two modern Intel chip's L1 caches. Since 250 divides 4000, look into logically splitting that b matrix into 16 chunks. You may well want to form that big 2872 by 4000 array of inner products, but do so in small chunks. You can add that if (distance>min_distance) break; back in, but do so at a chunk level rather than at the byte by byte level.
You should be able to beat Matlab because it almost certainly is working with doubles, but you can work with unsigned chars and ints.
Matrix multiply generally uses the worst possible cache access pattern for one of the two matrices, and the solution is to transpose one of the matrices and use a specialized multiply algorithm that works on data stored that way.
Your matrix already IS stored transposed. By transposing it into the normal order and then using a normal matrix multiply, your are absolutely killing performance.
Write your own matrix multiply loop that inverts the order of indices to the second matrix (which has the effect of transposing it, without actually moving anything around and breaking cache behavior). And pass your compiler whatever options it has for enabling auto-vectorization.

Complex matrix exponential in C++

Is it actually possible to calculate the Matrix Exponential of a Complex Matrix in c / c++?
I've managed to take the product of two complex matrices using blas functions from the GNU Science Library. for matC = matA * matB:
gsl_blas_zgemm (CblasNoTrans, CblasNoTrans, GSL_COMPLEX_ONE, matA, matB, GSL_COMPLEX_ZERO, matC);
And I've managed to get the matrix exponential of a matrix by using the undocumented
gsl_linalg_exponential_ss(&m.matrix, &em.matrix, .01);
But this doesn't seems to accept complex arguments.
Is there anyway to do this? I used to think c++ was capable of anything. Now I think its outdated and cryptic...
Several options:
modify the gsl_linalg_exponential_ss code to accept complex matrices
write your complex NxN matrix as real 2N x 2N matrix
Diagonalize the matrix, take the exponential of the eigenvalues, and rotate the matrix back to the original basis
Using the complex matrix product that is available, implement the matrix exponential according to it's definition: exp(A) = sum_(n=0)^(n=infinity) A^n/(n!)
You have to check which methods are appropriate for your problem.
C++ is a general purpose language. As mentioned above, if you need specific functionality you have to find a library that can do it or implement it yourself. Alternatively you could use software like MatLab and Mathematica. If that's too expensive there are open source alternatives, e.g. Sage and Octave.
"I used to think c++ was capable of anything" - if a general-purpose language has built-in complex math in its core, then something is wrong with that language.
Fur such very specific tasks there is a well-accepted solution: libraries. Either write your own, or much better, use an already existing one.
I myself rarely need complex matrices in C++, I always used Matlab and similar tools for that. However, this http://www.mathtools.net/C_C__/Mathematics/index.html might be of interest to you if you know Matlab.
There are a couple other libraries which might be of help:
http://eigen.tuxfamily.org/index.php?title=Main_Page
http://math.nist.gov/lapack++/
I was also thinking to do the same, writing your complex NxN matrix as real 2N x 2N matrix is the best way to solve the problem, then use gsl_linalg_exponential_ss().
Suppose A=Ar+i*Ai, where A is the complex matrix and Ar and Ai are the real matrices. Then write the new matrix B=[Ar Ai ;-Ai Ar] (Here the matrix is written in matlab notation). Now calculate the exponential of B, that is eB=[eB1 eB2 ;eB3 eB4], then exponential of A is given by, eA=eB1+1i.*eB2
(summing the matrices eB1 and 1i.*eB2).
I have written a code to calculate the matrix exponential of the complex matrices with the gsl function, gsl_linalg_exponential_ss(&m.matrix, &em.matrix, .01);
Here you have the complete code, and the compilation results. I have checked the result with the Matlab and result agrees.
#include <stdio.h>
#include <gsl/gsl_matrix.h>
#include <gsl/gsl_linalg.h>
#include <gsl/gsl_complex.h>
#include <gsl/gsl_complex_math.h>
void my_gsl_complex_matrix_exponential(gsl_matrix_complex *eA, gsl_matrix_complex *A, int dimx)
{
int j,k=0;
gsl_complex temp;
gsl_matrix *matreal =gsl_matrix_alloc(2*dimx,2*dimx);
gsl_matrix *expmatreal =gsl_matrix_alloc(2*dimx,2*dimx);
//Converting the complex matrix into real one using A=[Areal, Aimag;-Aimag,Areal]
for (j = 0; j < dimx;j++)
for (k = 0; k < dimx;k++)
{
temp=gsl_matrix_complex_get(A,j,k);
gsl_matrix_set(matreal,j,k,GSL_REAL(temp));
gsl_matrix_set(matreal,dimx+j,dimx+k,GSL_REAL(temp));
gsl_matrix_set(matreal,j,dimx+k,GSL_IMAG(temp));
gsl_matrix_set(matreal,dimx+j,k,-GSL_IMAG(temp));
}
gsl_linalg_exponential_ss(matreal,expmatreal,.01);
double realp;
double imagp;
for (j = 0; j < dimx;j++)
for (k = 0; k < dimx;k++)
{
realp=gsl_matrix_get(expmatreal,j,k);
imagp=gsl_matrix_get(expmatreal,j,dimx+k);
gsl_matrix_complex_set(eA,j,k,gsl_complex_rect(realp,imagp));
}
gsl_matrix_free(matreal);
gsl_matrix_free(expmatreal);
}
int main()
{
int dimx=4;
int i, j ;
gsl_matrix_complex *A = gsl_matrix_complex_alloc (dimx, dimx);
gsl_matrix_complex *eA = gsl_matrix_complex_alloc (dimx, dimx);
for (i = 0; i < dimx;i++)
{
for (j = 0; j < dimx;j++)
{
gsl_matrix_complex_set(A,i,j,gsl_complex_rect(i+j,i-j));
if ((i-j)>=0)
printf("%d+%di ",i+j,i-j);
else
printf("%d%di ",i+j,i-j);
}
printf(";\n");
}
my_gsl_complex_matrix_exponential(eA,A,dimx);
printf("\n Printing the complex matrix exponential\n");
gsl_complex compnum;
for (i = 0; i < dimx;i++)
{
for (j = 0; j < dimx;j++)
{
compnum=gsl_matrix_complex_get(eA,i,j);
if (GSL_IMAG(compnum)>=0)
printf("%f+%fi\t ",GSL_REAL(compnum),GSL_IMAG(compnum));
else
printf("%f%fi\t ",GSL_REAL(compnum),GSL_IMAG(compnum));
}
printf("\n");
}
return(0);
}