I've switched recently from matlab to c++ in order to run simulations faster, however it still runs slow. I'm pretty positive that there is much to improve in terms of memory usage.
Consider the following code, it shows an example of two array/vector declaration, that I use in a simulation.
One with known fixed length (array01) and another with unknown length (array02) that changes during the run.
The question here is what is the best/proper/efficient way of declaring variables ( for both array types) in terms of memory usage and performance.
# include <iostream>
# include <vector>
# include <ctime>
# include <algorithm>
using namespace std;
const int n = 1000;
const int m= 100000;
int main()
{
srand((unsigned)time(NULL));
vector <double> array02;
vector <vector<double>> Array01(n,m);
for (unsigned int i=0; i<n; i++)
{
for (unsigned int j=0; j<m;j++)
{
array02.clear();
rr = rand() % 10;
for (unsigned int l = 0 ; l<rr <l++)
{
array02.pushback(l);
}
// perform some calculation with array01 and array02
}
}
}
You should consider defining your own Matrix class with a void resize(unsigned width, unsigned height) member function, and a double get(unsigned i, unsigned j) inlined member function and/or a double& at(unsigned i, unsigned j) inlined member function (both giving Mi,j element). The matrix internal data could be a one-dimensional array or vector of doubles. Using a vector of vectors (all of the same size) is not the best (or fastest) way to represent a matrix.
class Matrix {
std::vector<double> data;
unsigned width, height;
public:
Matrix() : data(), width(0), height(0) {};
~Matrix() = default;
/// etc..., see rule of five
void resize(unsigned w, unsigned h) {
data.resize(w*h);
width = w; height = h;
}
double get(unsigned i, unsigned j) const {
assert(i<width && j<height);
return data[i*width+j];
}
double& at(unsigned i, unsigned j) {
assert(i<width && j<height);
return data[i*width+j];
}
}; // end class Matrix
Read also about the rule of five.
You could also try scilab (it is free software). It is similar to Matlab and might have different performances. Don't forget to use a recent version.
BTW, there are tons of existing C++ numerical libraries dealing with matrices. Consider using one of them. If performance is of paramount importance, don't forget to ask your compiler to optimize your code after you have debugged it.
Assuming you are on Linux (which I recommend for numerical computations; it is significant that most supercomputers run Linux), compile using g++ -std=c++11 -Wall -Wextra -g during the debugging phase, then use g++ -std=c++11 -Wall -Wextra -mtune=native -O3 during benchmarking. Don't forget to profile, and remember that premature optimization is evil (you first need to make your program correct).
You might even spend weeks, or months and perhaps many years, of work to use techniques like OpenMP, OpenCL, MPI, pthreads or std::thread for parallelization (which is a difficult subject you'll need years to master).
If your matrix is big, and/or have additional properties (is sparse, triangular, symmetric, etc...) there are many mathematical and computer science knowledge to master to improve the performance. You can make a PhD on that, and spend your entire life on the subject. So go to your University library to read some books on numerical analysis and linear algebra.
For random numbers C++11 gives you <random>; BTW use C++11 or C++14, not some earlier version of C++.
Read also http://floating-point-gui.de/ and a good book about C++ programming.
PS. I don't claim any particular expertise on numerical computation. I prefer much symbolic computation.
First of all don't try to reinvent the wheel :) Try to use some heavily optimized numerical library, for example
Intel MKL (Fastest and most used math library for Intel and compatible processors)
LAPACK++ (library for high performance linear algebra)
Boost (not only numerical, but solves almost any problem)
Second: If you need a matrix for a very simple program, use vector[i + width * j] notation. It's faster because you save an extra memory allocation.
Your example doesn't event compile. I tried to rewrite it a little:
#include <vector>
#include <ctime>
int main()
{
const int rowCount = 1000;
const int columnCount = 1000;
srand(time(nullptr));
// Declare matrix
std::vector<double> matrix;
// Preallocate elemts (faster insertion later)
matrix.reserve(rowCount * columnCount);
// Insert elements
for (size_t i = 0; i < rowCount * columnCount; ++i) {
matrix.push_back(rand() % 10);
}
// perform some calculation with matrix
// For example this is a matrix element at matrix[1, 3]:
double element_1_3 = matrix[3 + 1 * rowCount];
return EXIT_SUCCESS;
}
Now the speed depends on rand() (which is slow).
As people said:
Prefer a 1d array instead of 2d array for matrices.
Don't reinvent the wheel, use existing library: I think that Eigen library is the best suite for you, judging from your code. It also have very, very optimized code generated since it use C++ template static calculation when ever possible.
Related
I am trying to compute colsum(N * P), where N is a sparse, 1M by 2500 matrix, and P is a dense 2500 by 1.5M matrix. I am using the Eigen C++ library with Intel's MKL library. The issue is that the matrix N*P can't actually exist in memory, it's way too big (~10 TB). My question is whether Eigen will be able to handle this computation through some combination of lazy evaluation and parallelism? It says here that Eigen won't make temporary matrices unnecessarily: http://eigen.tuxfamily.org/dox-devel/TopicLazyEvaluation.html
But does Eigen know to compute N * P in piecewise chunks that will actually fit in memory? IE: it will have to do something like colsum(N * P_1) ++ colsum(N * P_2) ++ .. ++ colsum(N * P_n), where P is split into n different submatrices column-wise and "++" is concatenation.
I am working with 128 GB RAM.
I gave it a try but ended up with a bad malloc (I'm only running on 8GB on Win8). I set up my main() and used a not inline colsum function I wrote.
int main(int argc, char *argv[])
{
Eigen::MatrixXd dense = Eigen::MatrixXd::Random(1000, 100000);
Eigen::SparseMatrix<double> sparse(100000, 1000);
typedef Triplet<int> Trip;
std::vector<Trip> trps(dense.rows());
for(int i = 0; i < dense.rows(); i++)
{
trps[i] = Trip(20*i, i, 2);
}
sparse.setFromTriplets(trps.begin(), trps.end());
VectorXd res = colsum(sparse, dense);
std::cout << res;
std::cin >> argc;
return 0;
}
The attempt was simply:
__declspec(noinline) VectorXd
colsum(const Eigen::SparseMatrix<double> &sparse, const Eigen::MatrixXd &dense)
{
return (sparse * dense).colwise().sum();
}
That had a bad malloc. Sol it looks like you have to split it up manually on your own (unless someone else has a better solution).
EDIT
I improved the function a bit, but the get the same bad malloc:
__declspec(noinline) VectorXd
colsum(const Eigen::SparseMatrix<double> &sparse, const Eigen::MatrixXd &dense)
{
return (sparse * dense).topRows(4).colwise().sum();
}
EDIT 2
Another option would be to make the sparse matrix dense and force a lazy evaluation. I don't think that it would work with a sparse matrix (oh well).
__declspec(noinline) VectorXd
colsum(const Eigen::SparseMatrix<double> &sparse, const Eigen::MatrixXd &dense)
{
Eigen::MatrixXd denseSparse(sparse);
return denseSparse.lazyProduct(dense).colwise().sum();
}
This doesn't give me the bad malloc, but computes a lot of pointless 0*x_i expressions.
To answer your question: Especially, when products are involved, Eigen often evaluates parts of expressions into temporaries. In some situations this could be optimized but is not implemented yet, in some cases this is essentially the most efficient way to implement it.
However, in your case you could simply calculate the colsum of N (a 1 x 2500 vector) and multiply that by P.
Maybe future versions of Eigen will be able to make this kind of optimization themselves, but most of the time it is a good idea to make problem-specific optimizations oneself before letting the computer do the rest of the work.
Btw: I'm afraid sparse.colwise() is not implemented yet, so you must compute that manually. If you are lazy, you can instead compute Eigen::RowVectorXd Nsum = Eigen::RowVectorXd::Ones(N.rows())*P; (I have not checked it, but this might actually get optimized to near optimal code, with the most recent versions of Eigen).
I'm trying to learn about vectorisation, and rather than reinvet the wheel I'm using Agner Fog's vector library
Here's my original C++/STL code
#include <vector>
#include <vectorclass.h>
template<typename T>
double mean_v1(T begin,T end) {
float mean = 0;
std::for_each(begin,end,[&mean](const double& d) { mean+=d; });
return mean / std::distance(begin,end);
}
double mean_v2(T begin,T end) {
float mean = 0;
const int distance = std::distance(begin,end); // This is expensive
const int loop = ( distance >> 2)+1; // divide by 4
const int partial = distance & 2; // remainder 4
Vec4d vec;
for(int i = 0; i < loop;++i) {
if(i == (loop-1)) {
vec.load_partial(partial,&*begin);
mean = horizontal_add(vec);
}
else {
vec.load(&*begin);
mean = horizontal_add(vec);
begin+=4; // This is expensive
}
}
return mean / distance;
}
int main(int argc,char**argv) {
using namespace boost::assign;
std::vector<float> numbers;
// Note 13 numbers, which won't fit into a sse register perfectly
numbers+=39.57,39.57,39.604,39.58,39.61,31.669,31.669,31.669,31.65,32.09,33.54,32.46,33.45;
const float mean1 = mean_v1(numbers.begin(),numbers.end());
const float mean2 = mean_v2(numbers.begin(),numbers.end());
return 0;
}
Both v1 and v2 work correctly and they both take about the same time. However profiling it shows the the std::distance() and moving the iterator along takes almost 45% of the total time. The vector adds is just 0.8% which is significantly faster than v1.
Searching the web, all the examples seem to deal with perfect number of values that fit precisely into the SSE registers. How do people deal with odd numbers of values eg for this example where setting up the loop is taking a lot longer than the calculation.
I'm thinking there must be best practices or ideas on how to deal with this scenario.
Assume I can't change the interface of mean() to take float[], but must use iterators
You're mixing float & double unnecessarily, especially as you don't let your accumulator be double your precision is totally destroyed and won't be close to satisfactory for larger series.
As the arithmetic is super light weight what's destroying your performance here is most likely memory access, read up on memory cache lines and how they work. Basically what you need to do here is probe ahead, some processors have explicit instructions for pulling stuff into your cache, otherwise you can perform a load at a memory location ahead of time. Create another level of nesting in your loop and at regular intervals prime the cache with data you know you will get to in a few iterations.
What people do to maximize performance is that they spend a lot of time actually designing their data layout. You shouldn't need to do an intermediate transformation on your data. So what people do is they allocate aligned memory ( most SIMD instruction sets either requires or imposes grave penalties for reading / writing to unaligned memory ), and then they try to aggregate data in such a way that it fits the instruction set. In fact it's often a win to pad your data up to whatever register size the instruction set supports. So if lets say you're going to process 3 dimensional vectors, padding with an extra element which is unused will almost always be a big win.
I'm trying to sort an array using Thrust, but it doesn't work if the array is too big. (I have a GTX460 1GB memory)
I'm using cuda with c++ integration on VS2012, Here is my code :
my .cpp
extern "C" void thrust_sort(uint32_t *data, int n);
int main(int argc, char **argv){
int n = 2<<26;
uint32_t * v = new uint32_t[n];
srand(time(NULL));
for (int i = 0; i < n; ++i) {
v[i] = rand()%n;
}
thrust_sort(v, n);
delete [] v;
return 0;
}
my .cu
extern "C"
void thrust_sort(uint32_t *data, int n){
thrust::device_vector<uint32_t> d_data(data, data + n);
thrust::stable_sort(d_data.begin(), d_data.end());
thrust::copy(d_data.begin(), d_data.end(), data);
}
The program stop working at the start of stable_sort().
How much more memory does stable_sort() need ?
Is there a way to fix this ? (even if it makes it a bit slower or whatever)
Is there another sorting algorithm that doesn't require more memory than the original array ?
Thanks for your help :)
There are in the literature some techniques used to deal with the problem of sorting data that is too big to fit in RAM, such as saving partial values in files, and so on. An example: Sorting a million 32-bit integers in 2MB of RAM using Python
Your problem is less complicated since your input fits in RAM but is too much for your GPU. You can solve this problem by using the strategy parallel by Regular Sampling. You can see here an example of this technique applied to quicksort.
Long story short, you divide the array into smaller sub-arrays that fit on the memory of the GPU. Then you sort each of the sub-arrays, and in the end, you merge the results base on the premises of the Regular Sampling approach.
You can use a hybrid approach, sorting some of the sub-arrays in the CPU by assigning each one to a different core (using multi-threading), and at the same time, sending others sub-arrays to the GPU. You can even subdivide this work also to different processors using a message passing interface such as MPI. Or you can simply sort each sub-array one-by-one on the GPU and do the final merge step using the CPU, taking (or not) advantage of the multi-cores.
Inspired by Herb Sutter's compelling lecture Not your father's C++, I decided to take another look at the latest version of C++ using Microsoft's Visual Studio 2010. I was particularly interested by Herb's assertion that C++ is "safe and fast" because I write a lot of performance-critical code.
As a benchmark, I decided to try to write the same simple FFT algorithm in a variety of languages.
I came up with the following C++11 code that uses the built-in complex type and vector collection:
#include <complex>
#include <vector>
using namespace std;
// Must provide type or MSVC++ barfs with "ambiguous call to overloaded function"
double pi = 4 * atan(1.0);
void fft(int sign, vector<complex<double>> &zs) {
unsigned int j=0;
// Warning about signed vs unsigned comparison
for(unsigned int i=0; i<zs.size()-1; ++i) {
if (i < j) {
auto t = zs.at(i);
zs.at(i) = zs.at(j);
zs.at(j) = t;
}
int m=zs.size()/2;
j^=m;
while ((j & m) == 0) { m/=2; j^=m; }
}
for(unsigned int j=1; j<zs.size(); j*=2)
for(unsigned int m=0; m<j; ++m) {
auto t = pi * sign * m / j;
auto w = complex<double>(cos(t), sin(t));
for(unsigned int i = m; i<zs.size(); i+=2*j) {
complex<double> zi = zs.at(i), t = w * zs.at(i + j);
zs.at(i) = zi + t;
zs.at(i + j) = zi - t;
}
}
}
Note that this function only works for n-element vectors where n is an integral power of two. Anyone looking for fast FFT code that works for any n should look at FFTW.
As I understand it, the traditional xs[i] syntax from C for indexing a vector does not do bounds checking and, consequently, is not memory safe and can be a source of memory errors such as non-deterministic corruption and memory access violations. So I used xs.at(i) instead.
Now, I want this code to be "safe and fast" but I am not a C++11 expert so I'd like to ask for improvements to this code that would make it more idiomatic or efficient?
I think you are being overly "safe" in your use of at(). In most of your cases the index used is trivially verifable as being constrained by the size of the container in the for loop.
e.g.
for(unsigned int i=0; i<zs.size()-1; ++i) {
...
auto t = zs.at(i);
The only ones I'd leave as at()s are the (i + j)s. It's not immediately obvious whether they would always be constrained (although if I was really unsure I'd probably manually check - but I'm not familiar with FFTs enough to have an opinion in this case).
There are also some fixed computations being repeated for each loop iteration:
int m=zs.size()/2;
pi * sign
2*j
And the zs.at(i + j) is computed twice.
It's possible that the optimiser may catch these - but if you are treating this as performance critical, and you have your timers testing it, I'd hoist them out of the loops (or, in the case of zs.at(i + j), just take a reference on first use) and see if that impacts the timer.
Talking of second-guessing the optimiser: I'm sure that the calls to .size() will be inlined as, at least, a direct call to an internal member variable - but given how many times you call it I'd also experiment with introducing local variables for zs.size() and zs.size()-1 upfront. They're more likely to be put into registers that way too.
I don't know how much of a difference (if any) all of this will have on your total runtime - some of it may already be caught by the optimiser, and the differences may be small compared to the computations involved - but worth a shot.
As for being idiomatic my only comment, really, is that size() returns a std::size_t (which is usually a typedef for an unsigned int - but it's more idiomatic to use that type instead). If you did want to use auto but avoid the warning you could try adding the ul suffix to the 0 - not sure I'd say that is idiomatic, though. I suppose you're already less than idiomatic in not using iterators here, but I can see why you can't do that (easily).
Update
I gave all my suggestions a try and they all had a measurable performance improvement - except the i+j and 2*j precalcs - they actually caused a slight slowdown! I presume they either prevented a compiler optimisation or prevented it from using registers for some things.
Overall I got a >10% perf. improvement with those suggestions.
I suspect more could be had if the second block of loops was refactored a little to avoid the jumps - and having done so enabling SSE2 instruction set may give a significant boost (I did try it as is and saw a slight slowdown).
I think that refactoring, along with using something like MKL for the cos and sin calls should give greater, and less brittle, improvements. And neither of those things would be language dependent (I know this was originally being compared to an F# implementation).
Update 2
I forgot to mention that pre-calculating zs.size() did make a difference.
Update 3
Also forgot to say (until reminded by #xeo in comment to OP) that the block following the i < j check can be boiled down to a std::swap. This is more idiomatic and at least as performant - in the worst case should inline to the same code as written. Indeed when I did it I saw no change in the performance. In other cases it can lead to a performance gain if move constructors are available.
I'm developing a CT reconstruction algorithm using C++. I'm using C++ because I need to use a library written in C++ that will let me read/write a specific file format.
This reconstruction algorithm involves working with 3D and 2D images. I've written similar algorithms in C and MATLAB using arrays. However, I've read that, in C++, arrays are "evil" (see http://www.parashift.com/c++-faq-lite/containers.html). The way I use arrays to manipulate images (in C) is the following (this creates a 3D array that will be used as a 3D image):
int i,j;
int *** image; /* another way to make a 5x12x27 array */
image = (int ***) malloc(depth * sizeof(int **));
for (i = 0; i < depth; ++i) {
image[i] = (int **) malloc(height * sizeof(int *));
for (j = 0; j < height; ++j) {
image[i][j] = (int *) malloc(width * sizeof(int));
}
}
or I use 1-dimensional arrays and do index arithmetic to simulate 3D data. At the end, I free the necessary memory.
I have read that there are equivalent ways of doing this in C++. I've seen that I could create my own matrix class that uses vectors of vectors (from STL) or that I could use the boost-matrix library. The problem is that this makes my code look bloated.
My questions are:
1) Is there a reason to not use arrays for this purpose? Why should I use the more complicated data structures?
2) I don't think I'll use the advantages of containers (as seen in the C++ FAQ lite link I posted). Is there something I'm not seeing?
3) The C++ FAQ lite mentions that arrays will make me less productive. I don't really see how that applies to my case. What do you guys think?
Working at a higher level always saves you time given equal familiarity with both types of code. It's usually simpler and you might not need to bother with some tasks like deleting.
That said, if you already have the C code and are basically converting malloc to new (or leaving it as-is) then it makes perfect sense to leave it. No reason to duplicate work for no advantage. If you're going to be extending it and adding more features you might want to think about a rewrite. Image manipulation is often an intensive process and I see straight code like yours all the time for performance reasons.
Arrays have a purpose, vectors have a purpose, and so on. You seem to understand the tradeoffs so I won't go into that. Understanding the context of what you're doing is necessary; anyone who says that arrays are always bad or vectors are always too much overhead (etc.) probably doesn't know what they're talking about.
I know it looks difficult at first, and your code seems simple - but eventually yours is going to hurt.
Use a library like boost, or consider a custom 3D image toolkit like vtk
if the 3D canvas has a fixed size you won't win much by using containers. I would avoid allocating the matrix in small chunks as you do, though, and just instead do
#define DIM_X 5
#define DIM_Y 12
#define DIM_Z 27
#define SIZE (DIM_X * DIM_Y * DIM_Z)
#define OFFS(x, y, z) (x + y * DIM_X + z * (DIM_Y * DIM_X))
and then
class 3DImage {
private unsigned int pixel_data[SIZE];
int & operator()(int x, int y, int z) { return pixel_data[OFFS(x,y,z)]; }
}
after which you can do e.g.
3DImage img;
img(1,1,1) = 10;
img(2,2,2) = img(1,1,1) + 2;
without having any memory allocation or algorithm overhead. But as some others have noted, the choice of the data structure also depends on what kind of algorithms you are planning to run on the images. You can always however adapt a third-party algorithm e.g. for matrix inversion with a proper facade class if needed; and this flat representation is much faster than the nested arrays of pointers you wrote.
If the dimensions are not fixed compile time, you can obviously still use exactly the same approach, it's just that you need to allocate pixel_data dynamically and store the dimensions in the 3DImage object itself. Here's that version:
class 3DImage {
private unsigned int *pixel_data;
unsigned int dim_x, dim_y, dim_z;
3DImage(int xd, int yd, int zd) { dim_x = xd; dim_y = yd; dim_z = zd;
pixel_data = new int[dim_x * dim_y * dim_z];
}
virtual ~3DImage() { delete pixel_data; }
int & operator(int x, int y, int z) {
return pixel_data[x + y * dim_x + z * dim_y * dim_x];
}
}
My questions are:
1) Is there a reason to not use arrays for this purpose? Why should I use the more complicated data structures?
I personally prefer to use basic arrays. By basic I mean a 1D linear array. Say you have a 512 X 512 image, and you have 5 slices, then the image array looks like following:
int sizeX = 512;
int sizeY = 512;
int sizeZ = 5;
float* img = new float[sizeX * sizeY * sizeZ]
To access of a pixel/voxel at location (x,y,z), you would need to do:
float val = img[z*sizeX*sizeY + y*sizeX + sizeX];
2) I don't think I'll use the advantages of containers (as seen in the C++ FAQ lite link I posted). Is there something I'm not seeing?
To use containers is more like a programming thing (easier, safer, exception catching....). If you are an algorithm guy, then it might not be your concern at all. However, one example to use <vector> in C++, you can always do this:
int sizeX = 512;
int sizeY = 512;
int sizeZ = 5;
std::vector<float> img(sizeX * sizeY * sizeZ);
float* p = &img[0];
3) The C++ FAQ lite mentions that arrays will make me less productive. I don't really see how that applies to my case. What do you guys think?
I don't see why array makes you less productive. Of course, c++ guys would prefer to use vectors to raw arrays. But again, it is just a programming thing.
Hope this helps.
Supplement:
The easiest way to do a 2D/3D CT recon would be to use MATLAB/python + C/C++; But again, this would require you sufficient experience when to use which. MATLAB has built in FFT/IFFT, so you don't have to write a C/C++ code for that. I remember I used KissFFT before, and it was no problem.