Memory Allocation and Mir - d

I'm trying to work in an environment in which I need to minimize dynamic allocations and I was curious as to how the sliced function works in terms of memory allocation. I've looked around and haven't found much.
Is it possible to have slices allocated on the stack if they are constant size?

import std;
#nogc void main()
{
import mir.ndslice;
import mir.ndslice.topology;
import mir.blas : gemv, gemm;
scope double[9] aData = 0;
scope double[3] vData = [1, 2, 1];
scope double[3] cData;
auto A = aData.sliced.sliced(3, 3);
auto v = vData.sliced.sliced(3, 1);
auto c = cData.sliced.sliced(3, 1);
A.diagonal[0] = 2;
A.diagonal[1] = 1;
A.diagonal[2] = 0;
// c = 1 * A * v + 0 * c
gemm!double(1, A, v, 0, c);
}
Experimented around for quite a bit and got this as my zero GC (minus writeln of course) linear algebra. Looks pretty clunky and I was hoping for something nicer like in slice.
Again recently found out about D and was trying to see if it had any chance doing this stuff deterministically for some robotics projects when linear algebra algorithms are necessary.
Probably shouldn't be using gemm but didn't feel like digging up gemv. Not as elegant as something like slice!double(3, 3)
Adam said something about slices never being allocated but I'm still wary about letting the GC loose at all.
I work typically with a lot of DC motors doing kinematic simulations. Millisecond level timing is critical and D proposed it could be like that so I checked it out because C++ is painful to work with.
Though I have to ask why this doesn't work.
double[9] a_data;
scope A = a_data.sliced(3, 3);
when this does...
auto a_data = new double[9]; // Gah new?!
scope A = a_data.sliced(3, 3);
While the 2nd is far more elegant.

Related

How to efficiently store matmul results into another matrix in Eigen

I would like to store multiple matmul results as row vector into another matrix, but my current code seems to take a lot of memory space. Here is my pseudo code:
for (int i = 0; i < C_row; ++i) {
C.row(i) = (A.transpose() * B).reshaped(1, C_col);
}
In this case, C is actually a Map of pre-allocated array declared as Map<Matrix<float, -1, -1, RowMajor>> C(C_array, C_row, C_col);.
Therefore, I expect the calculated matmul results can directly go to memory space of C and do not create temporary copies. In other words, the total memory usage should be the same with or without the above code. But I found that with the above code, the memory usage is increased significantly.
I tried to use C.row(i).noalias() to directly assign results to each row of C, but there is no memory usage difference. How to make this code more efficiently by taking less memory space?
The reshaped is the culprit. It cannot be folded into the matrix multiplication so it results in a temporary allocation for the multiplication. Ideally you would need to put it onto the left of the assignment:
C.row(i).reshaped(A.cols(), B.cols()).noalias() = A.transpose() * B;
However, that does not compile. Reshaped doesn't seem to fulfil the required interface. It's a pretty new addition to Eigen, so I'm not overly surprised. You might want to open a feature request on their bug tracker.
Anyway, as a workaround, try this:
Eigen::Map<Eigen::MatrixXd> reshaped(C.row(i).data(), A.cols(), B.cols());
reshaped.noalias() = A.transpose() * B;

Eigen: Efficiently storing the output of a matrix evaluation in a raw pointer

I am using some legacy C code that passing around lots of raw pointers. To interface with the code, I have to pass a function of the form:
const int N = ...;
T * func(T * x) {
// TODO Put N elements in x
return x + N;
}
where this function should write the result into x, and then return x.
Internally, in this function, I am using Eigen extensively to perform some calculations. Then I write the result back to the raw pointer using the Map class. A simple example which mimics what I am doing is this:
const int N = 5;
T * func(T * x) {
// Do a lot of operations that result in some matrices like
Eigen::Matrix<T, N, 1 > A = ...
Eigen::Matrix<T, N, 1 > B = ...
Eigen::Map<Eigen::Matrix<T, N, 1 >> constraint(x);
constraint = A - B;
return x + N;
}
Obviously, there is much more complicated stuff going on internally, but that is the gist of it... Do some calculations with Eigen, then use the Map class to write the result back to the raw pointer.
Now the problem is that when I profile this code with Callgrind, and then view the results with KCachegrind, the lines
constraint = A - B;
are almost always the bottleneck. This is sort of understandable, because such lines could/are potentially doing three things:
Constructing the Map object
Performing the calculation
Writing the result to the pointer
So it is understandable that this line might have the longest runtime. But I am a little bit worried that perhaps I am somehow doing an extra copy in that line before the data gets written to the raw pointer.
So is there a better way of writing the result to the raw pointer? Or is that the idiom I should be using?
In the back of my mind, I am wondering if using the placement new syntax would buy me anything here.
Note: This code is mission critical and should run in realtime, so I really need to squeeze every ounce of speed out of it. For instance, getting this call from a runtime of 0.12 seconds to 0.1 seconds would be huge for us. But code legibility is also a huge concern since we are constantly tweaking the model used in the internal calculations.
These two lines of code:
Eigen::Map<Eigen::Matrix<T, N, 1 >> constraint(x);
constraint = A - B;
are essentially compiled by Eigen as:
for(int i=0; i<N; ++i)
x[i] = A[i] - B[i];
The reality is a bit more complicated because of explicit unrolling, and explicit vectorization (both depends on T), but that's essentially it. So the construction of the Map object is essentially a no-op (it is optimized away by any compiler) and no, there is no extra copy going on here.
Actually, if your profiler is able to tell you that the bottleneck lies on this simple expression, then that very likely means that this piece of code has not been inlined, meaning that you did not enabled compiler optimizations flags (like -O3 with gcc/clang).

Error Iterating Through Members of a Struct in Vector

I have a struct and two vectors in my .h file:
struct FTerm {
int m_delay;
double m_weight;
};
std::vector<FTerm> m_xterms;
std::vector<FTerm> m_yterms;
I've already read in a file to populate values to m_xterms and m_yterms and I'm trying to iterate through those values:
vector<FTerm>::iterator terms;
for (terms = m_xterms.begin(); terms < m_xterms.end(); terms++)
{
int delaylength = m_xterms->m_delay * 2; // Assume stereo
double weight = m_xterms->m_weight;
}
Although I'm pretty sure I have the logic wrong, I currently get the error Error expression must have a pointer type. Been stuck at this for a while, thanks.
Change
int delaylength = m_xterms->m_delay * 2;
double weight = m_xterms->m_weight;
to
int delaylength = terms->m_delay * 2;
// ^^^^^
double weight = terms->m_weight;
// ^^^^^
as you want to access values through
vector<FTerm>::iterator terms;
within the loop
for (terms = m_xterms.begin(); terms < m_xterms.end(); terms++)
// ^^^^^
"Although I'm pretty sure I have the logic wrong, ..."
That can't be answered, unless you give more context about the requirements for the logic.
Along with the problem πάντα ῥεῖ pointed out, your code currently has a problem that it simply doesn't accomplish anything except wasting some time.
Consider:
for (terms = m_xterms.begin(); terms < m_xterms.end(); terms++)
{
int delaylength = m_xterms->m_delay * 2; // Assume stereo
double weight = m_xterms->m_weight;
}
Both delaylength and weight are created upon entry to the block, and destroyed on exit--so we create a pair of values, then destroy them, and repeat for as many items as there are in the vector--but never do anything with the values we compute. They're just computed, then destroyed.
Assuming you fix that, I'd also write the code enough differently that this problem simply isn't likely to happen to start with. For example, let's assume you really wanted to modify each item in your array, instead of just computing something from it and throwing away the result. You could do that with code like this:
std::transform(m_xterms.begin(), m_xterms.end(), // Source
m_xterms.begin(), // destination
[](FTerm const &t) { return {t.m_delay * 2, t.m_weight}; });// computation
Now the code actually accomplishes something, and it seems a lot less likely that we'd end up accidentally writing it incorrectly.
Bottom line: standard algorithms are your friends. Unlike human friends, they love to be used.

Using arrays in C++ CT reconstruction algorithm

I'm developing a CT reconstruction algorithm using C++. I'm using C++ because I need to use a library written in C++ that will let me read/write a specific file format.
This reconstruction algorithm involves working with 3D and 2D images. I've written similar algorithms in C and MATLAB using arrays. However, I've read that, in C++, arrays are "evil" (see http://www.parashift.com/c++-faq-lite/containers.html). The way I use arrays to manipulate images (in C) is the following (this creates a 3D array that will be used as a 3D image):
int i,j;
int *** image; /* another way to make a 5x12x27 array */
image = (int ***) malloc(depth * sizeof(int **));
for (i = 0; i < depth; ++i) {
image[i] = (int **) malloc(height * sizeof(int *));
for (j = 0; j < height; ++j) {
image[i][j] = (int *) malloc(width * sizeof(int));
}
}
or I use 1-dimensional arrays and do index arithmetic to simulate 3D data. At the end, I free the necessary memory.
I have read that there are equivalent ways of doing this in C++. I've seen that I could create my own matrix class that uses vectors of vectors (from STL) or that I could use the boost-matrix library. The problem is that this makes my code look bloated.
My questions are:
1) Is there a reason to not use arrays for this purpose? Why should I use the more complicated data structures?
2) I don't think I'll use the advantages of containers (as seen in the C++ FAQ lite link I posted). Is there something I'm not seeing?
3) The C++ FAQ lite mentions that arrays will make me less productive. I don't really see how that applies to my case. What do you guys think?
Working at a higher level always saves you time given equal familiarity with both types of code. It's usually simpler and you might not need to bother with some tasks like deleting.
That said, if you already have the C code and are basically converting malloc to new (or leaving it as-is) then it makes perfect sense to leave it. No reason to duplicate work for no advantage. If you're going to be extending it and adding more features you might want to think about a rewrite. Image manipulation is often an intensive process and I see straight code like yours all the time for performance reasons.
Arrays have a purpose, vectors have a purpose, and so on. You seem to understand the tradeoffs so I won't go into that. Understanding the context of what you're doing is necessary; anyone who says that arrays are always bad or vectors are always too much overhead (etc.) probably doesn't know what they're talking about.
I know it looks difficult at first, and your code seems simple - but eventually yours is going to hurt.
Use a library like boost, or consider a custom 3D image toolkit like vtk
if the 3D canvas has a fixed size you won't win much by using containers. I would avoid allocating the matrix in small chunks as you do, though, and just instead do
#define DIM_X 5
#define DIM_Y 12
#define DIM_Z 27
#define SIZE (DIM_X * DIM_Y * DIM_Z)
#define OFFS(x, y, z) (x + y * DIM_X + z * (DIM_Y * DIM_X))
and then
class 3DImage {
private unsigned int pixel_data[SIZE];
int & operator()(int x, int y, int z) { return pixel_data[OFFS(x,y,z)]; }
}
after which you can do e.g.
3DImage img;
img(1,1,1) = 10;
img(2,2,2) = img(1,1,1) + 2;
without having any memory allocation or algorithm overhead. But as some others have noted, the choice of the data structure also depends on what kind of algorithms you are planning to run on the images. You can always however adapt a third-party algorithm e.g. for matrix inversion with a proper facade class if needed; and this flat representation is much faster than the nested arrays of pointers you wrote.
If the dimensions are not fixed compile time, you can obviously still use exactly the same approach, it's just that you need to allocate pixel_data dynamically and store the dimensions in the 3DImage object itself. Here's that version:
class 3DImage {
private unsigned int *pixel_data;
unsigned int dim_x, dim_y, dim_z;
3DImage(int xd, int yd, int zd) { dim_x = xd; dim_y = yd; dim_z = zd;
pixel_data = new int[dim_x * dim_y * dim_z];
}
virtual ~3DImage() { delete pixel_data; }
int & operator(int x, int y, int z) {
return pixel_data[x + y * dim_x + z * dim_y * dim_x];
}
}
My questions are:
1) Is there a reason to not use arrays for this purpose? Why should I use the more complicated data structures?
I personally prefer to use basic arrays. By basic I mean a 1D linear array. Say you have a 512 X 512 image, and you have 5 slices, then the image array looks like following:
int sizeX = 512;
int sizeY = 512;
int sizeZ = 5;
float* img = new float[sizeX * sizeY * sizeZ]
To access of a pixel/voxel at location (x,y,z), you would need to do:
float val = img[z*sizeX*sizeY + y*sizeX + sizeX];
2) I don't think I'll use the advantages of containers (as seen in the C++ FAQ lite link I posted). Is there something I'm not seeing?
To use containers is more like a programming thing (easier, safer, exception catching....). If you are an algorithm guy, then it might not be your concern at all. However, one example to use <vector> in C++, you can always do this:
int sizeX = 512;
int sizeY = 512;
int sizeZ = 5;
std::vector<float> img(sizeX * sizeY * sizeZ);
float* p = &img[0];
3) The C++ FAQ lite mentions that arrays will make me less productive. I don't really see how that applies to my case. What do you guys think?
I don't see why array makes you less productive. Of course, c++ guys would prefer to use vectors to raw arrays. But again, it is just a programming thing.
Hope this helps.
Supplement:
The easiest way to do a 2D/3D CT recon would be to use MATLAB/python + C/C++; But again, this would require you sufficient experience when to use which. MATLAB has built in FFT/IFFT, so you don't have to write a C/C++ code for that. I remember I used KissFFT before, and it was no problem.

Multiply Large Complex Number Vector by Scalar efficiently C++

I'm currently trying to most efficiently do an in-place multiplication of an array of complex numbers (memory aligned the same way the std::complex would be but currently using our own ADT) by an array of scalar values that is the same size as the complex number array.
The algorithm is already parallelized, i.e. the calling object splits the work up into threads. This calculation is done on arrays in the 100s of millions - so, it can take some time to complete. CUDA is not a solution for this product, although I wish it was. I do have access to boost and thus have some potential to use BLAS/uBLAS.
I'm thinking, however, that SIMD might yield much better results, but I'm not familiar enough with how to do this with complex numbers. The code I have now is as follows (remember this is chunked up into threads which correspond to the number of cores on the target machine). The target machine is also unknown. So, a generic approach is probably best.
void cmult_scalar_inplace(fcomplex *values, const int start, const int end, const float *scalar)
{
for (register int idx = start; idx < end; ++idx)
{
values[idx].real *= scalar[idx];
values[idx].imag *= scalar[idx];
}
}
fcomplex is defined as follows:
struct fcomplex
{
float real;
float imag;
};
I've tried manually unrolling the loop, as my finally loop count will always be a power of 2, but the compiler is already doing that for me (I've unrolled as far as 32). I've tried a const float reference to the scalar - in thinking I'd save one access - and that proved to be equal to the what the compiler was already doing. I've tried STL and transform, which game close results, but still worse. I've also tried casting to std::complex and allow it to use the overloaded operator for scalar * complex for the multiplication but this ultimately produced the same results.
So, anyone with any ideas? Much appreciation is given for your time in considering this! Target platform is Windows. I'm using Visual Studio 2008. Product cannot contain GPL code as well! Thanks so much.
You can do this fairly easily with SSE, e.g.
void cmult_scalar_inplace(fcomplex *values, const int start, const int end, const float *scalar)
{
for (int idx = start; idx < end; idx += 2)
{
__m128 vc = _mm_load_ps((float *)&values[idx]);
__m128 vk = _mm_set_ps(scalar[idx + 1], scalar[idx + 1], scalar[idx], scalar[idx]);
vc = _mm_mul_ps(vc, vk);
_mm_store_ps((float *)&values[idx], vc);
}
}
Note that values and scalar need to be 16 byte aligned.
Or you could just use the Intel ICC compiler and let it do the hard work for you.
UPDATE
Here is an improved version which unrolls the loop by a factor of 2 and uses a single load instruction to get 4 scalar values which are then unpacked into two vectors:
void cmult_scalar_inplace(fcomplex *values, const int start, const int end, const float *scalar)
{
for (int idx = start; idx < end; idx += 4)
{
__m128 vc0 = _mm_load_ps((float *)&values[idx]);
__m128 vc1 = _mm_load_ps((float *)&values[idx + 2]);
__m128 vk = _mm_load_ps(&scalar[idx]);
__m128 vk0 = _mm_shuffle_ps(vk, vk, 0x50);
__m128 vk1 = _mm_shuffle_ps(vk, vk, 0xfa);
vc0 = _mm_mul_ps(vc0, vk0);
vc1 = _mm_mul_ps(vc1, vk1);
_mm_store_ps((float *)&values[idx], vc0);
_mm_store_ps((float *)&values[idx + 2], vc1);
}
}
Your best bet will be to use an optimised BLAS which will take advantage of whatever is available on your target platform.
One problem I see is that in the function it's hard for the compiler to understand that the scalar pointer is not indeed pointing in the middle of the complex array (scalar could in theory be pointing to the complex or real part of a complex).
This actually forces the order of evaluation.
Another problem I see is that here the computation is so simple that other factors will influence the raw speed, therefore if you really care about performance the only solution is in my opinion to implement several variations and test them at runtime on the user machine to discover what is the fastest.
What I'd consider is using different unrolling sizes, and also playing with the alignment of scalar and values (the memory access pattern can have a big influence of caching effects).
For the problem of the unwanted serialization an option is to see what is the generated code for something like
float r0 = values[i].real, i0 = values[i].imag, s0 = scalar[i];
float r1 = values[i+1].real, i1 = values[i+1].imag, s1 = scalar[i+1];
float r2 = values[i+2].real, i2 = values[i+2].imag, s2 = scalar[i+2];
values[i].real = r0*s0; values[i].imag = i0*s0;
values[i+1].real = r1*s1; values[i+1].imag = i1*s1;
values[i+2].real = r2*s2; values[i+2].imag = i2*s2;
because here the optimizer has in theory a little bit more freedom.
Do you have access to Intel's Integrated Performance Primitives?
Integrated Performance Primitives They have a number of functions that handle cases like this with pretty decent performance. You might have some success with your particular problem, but I would not be surprised if your compiler already does a decent job of optimizing the code.