I am writing a basic MD code in C++ using LJ potential for an NVE system. The starting configuration is FCC and the starting velocities are randomly generated.
I am facing a strange problem in that the evolution of the system seems to be independent of the time step I implement, it is my understanding that the energy losses are smaller for small time steps and larger for larger time steps. However I am getting the same result at the end of the simulation in terms of energy whether I run (0.0001step)*(10000steps) or 0.001*1000 and so on.
The entire code is to big for me to post here, so I am posting what I think is relevant and leaving out binning etc., kindly let me know if any additional information is required. I have been through countless codes available online and though they look similar to mine I just am not able to figure out what the difference is and where I am going wrong.
The main cpp contains the following loop
for (int i=0; i<t;i++)
{
potential_calc(neighlist,fromfile, run_parameters,i);//calculating the force fields
velverlet(neighlist,fromfile, run_parameters, bin, dt);//calculating the velocities
}
The declarations of the 2 cpp files for potential calculation & verlet integration are
void potential_calc(neighborlist_type *neighlist, config_type *fromfile, potential *run_parameters, int t)
void velverlet(neighborlist_type *neighlist, config_type *fromfile, potential *run_parameters, bin_type *bin, double dt)
The code for calculating the force - potential_calc.cpp is below
for (long i=0; i<fromfile->N; i++)
{
long atom_p=i;
for (long j=0; j<neighlist[i].countsn; j++)
{
long atom_s=neighlist[i].numb[j];
for (int k=0; k<Dim; k++)
{
dist[k]= fromfile->r[atom_p][k] - (fromfile->r[atom_s][k] + neighlist[atom_p].xyz[j][k]*fromfile->L[k]);
//the .xyz indicates the image being considered real or mirror(if mirror then in which direction)
}
disp2 = pow(dist[0],2)+pow(dist[1],2)+pow(dist[2],2);
if (disp2<rb2)
{
int c1=fromfile->c[atom_p];
int c2=fromfile->c[atom_s];
double long force_temp;
disp=pow(disp2,0.5);
sig_r6=pow(run_parameters->sigma[c1-1][c2-1]/disp,6);//(sigma/r)^6
sig_r8=pow(run_parameters->sigma[c1-1][c2-1]/disp,8);//(sigma/r)^8
run_parameters->pe[atom_p] += (4*run_parameters->epsilon[c1-1][c2-1]*((sig_r6*sig_r6)-sig_r6)) - potential_correction[c1-1][c2-1];
force_temp=(-1*((48*run_parameters->epsilon[c1-1][c2-1])/pow(run_parameters->sigma[c1-1][c2-1],2)*((sig_r6*sig_r8)-((sig_r8)*0.5))));
for (int k=0; k<Dim;k++)
{
run_parameters->force[atom_p][k]+=force_temp*(-1*dist[k]);
}
}
}
//calculating kinetic energy
run_parameters->ke[atom_p] = 0.5*(pow(fromfile->v[atom_p][0],2)+pow(fromfile->v[atom_p][1],2)+pow(fromfile->v[atom_p][2],2));
}
Once the force calculation is done it goes to the updation of velocity and position in the velverlet.cpp
for (long i=0; i<fromfile->N; i++)
{
for (int j=0; j<Dim; j++)
{
fromfile->v[i][j] += (dt*run_parameters->force[i][j]);
}
}
for (long i=0; i<fromfile->N; i++)
{
for (int j=0; j<Dim; j++)
{
fromfile->r[i][j] += dt*fromfile->v[i][j];
}
}
There may be slight differences in how velocity verlet is implemented by different people but I can't figure out how I am getting time step independent results.
Please help. Any input is appreciated
Sorry if any formatting/tagging is wrong, this is the first time I am posting here
Related
Searching around on how I can improve my waveform generation code, I've come across SIMD and the libsimdpp library, but I have no idea how to use it. If I got it right using raw SIMD will require me to write code for each architecture while libsimdpp will handle that for me.
What I need to do is, to calculate the squared and rms value of a chunk of samples, which I managed to boost the process using vectorization, which worked perfectly until I introduced the same calculation for both left and right channel of an audio file.
So, my question and what I need help with, is how can I use libsimdpp (or any library that will make simdp easier for me) to improve the bellow code?
// STRAT: vector containing all the audio samples
std::vector<double> samples;
nv_samples = samples.size();
// END
// START: Loop through the samples vector, incrementing ecah time the index with samples_per_pixel
for (int i = 0; i < nb_samples; i+= samples_per_pixel)
{
// START: Create chunk of samples with the size of samples_per_pixel
double* chunk = &samplesL[i];
// END
// START: Calculate rms and sqared sum
float sum = 0;
float squaredsum = 0;
/// there are multiple definitions of above for both channels but I won't include them
//// to make the code easier to be read
for (int j = 0; j < samples_per_pixel; j++)
{
if (chunk[j] < 0)
sum += -chunk[j]
else
sum += chunk[j]
squaredsum += chunk[j] * chunk[j]
}
/// average
float average_point = (sumL * 2) / samples_per_pixel;
// rms
float meanL = squaredsumL / samples_per_pixel;
rms_pointL = qSqrt(meanL);
/// Drawing of both avearge point and rms
//// [...]
// END
}
I created an image processing algorithm using OpenCV and currently I'm trying to improve the time efficiency of my own, simple function which is similar to LUT, but with interpolation between values (double calibRI::corr(double)).
I optimized the pixel loop according to the OpenCV docs.
Non parallel function (calib(cv::Mat) -an object of calibRI functor class) takes about 0.15s. I decided to use cv::parallel_for_ to make it shorter.
First I implemented it as image tiling -according to >> this document. The time was reduced to 0.12s (4 threads).
virtual void operator()(const cv::Range& range) const
{
for(int i = range.start; i < range.end; i++)
{
// divide image in 'thr' number of parts and process simultaneously
cv::Rect roi(0, (img.rows/thr)*i, img.cols, img.rows/thr);
cv::Mat in = img(roi);
cv::Mat out = retVal(roi);
out = calib(in); //loops over all pixels and does out[u,v]=calibRI::corr(in[u,v])
}
I though that running my function in parallel for subimages/tiles/ROIs is not yet optimal, so I implemented it as below:
template <typename T>
class ParallelPixelLoop : public cv::ParallelLoopBody
{
typedef boost::function<T(T)> pixelProcessingFuntionPtr;
private:
cv::Mat& image; //source and result image (to be overwritten)
bool cont; //if the image is continuous
size_t rows;
size_t cols;
size_t threads;
std::vector<cv::Range> ranges;
pixelProcessingFuntionPtr pixelProcessingFunction; //pixel modif. function
public:
ParallelPixelLoop(cv::Mat& img, pixelProcessingFuntionPtr fun, size_t thr = 4)
: image(img), cont(image.isContinuous()), rows(img.rows), cols(img.cols), pixelProcessingFunction(fun), threads(thr)
{
int groupSize = 1;
if (cont) {
cols *= rows;
rows = 1;
groupSize = ceil( cols / threads );
}
else {
groupSize = ceil( rows / threads );
}
int t = 0;
for(t=0; t<threads-1; ++t) {
ranges.push_back( cv::Range( t*groupSize, (t+1)*groupSize ) );
}
ranges.push_back( cv::Range( t*groupSize, rows<=1?cols:rows ) ); //last range must be to the end of image (ceil used before)
}
virtual void operator()(const cv::Range& range) const
{
for(int r = range.start; r < range.end; r++)
{
T* Ip = nullptr;
cv::Range ran = ranges.at(r);
if(cont) {
Ip = image.ptr<T>(0);
for (int j = ran.start; j < ran.end; ++j)
{
Ip[j] = pixelProcessingFunction(Ip[j]);
}
}
else {
for(int i = ran.start; i < ran.end; ++i)
{
Ip = image.ptr<T>(i);
for (int j = 0; j < cols; ++j)
{
Ip[j] = pixelProcessingFunction(Ip[j]);
}
}
}
}
}
};
Then I run it on 1280x1024 64FC1 image, on i5 processor, Win8, and get the time in range of 0.4s using the code below:
double t = cv::getTickCount();
ParallelPixelLoop<double> loop(V,boost::bind(&calibRI::corr,this,_1),4);
cv::parallel_for_(cv::Range(0,4),loop);
std::cout << "Exec time: " << (cv::getTickCount()-t)/cv::getTickFrequency() << "s\n";
I have no idea why is my implementation so much slower than iterating all the pixels in subimages... Is there a bug in my code or the OpenCV ROIs are optimized in some special way?
I do not think there is a time measurement error issue, as described here. I'm using OpenCV time functions.
Is there any other way to reduce the time of this function?
Thanks in advance!
Generally it's really hard to say why using cv::parallel_for failed to speed up whole process. One possibility is that the problem is not related to processing/multithreading, but to time measurement. About 2 months ago i tried to optimize this algorithm and i noticed strange thing - first time i use it, it takes x ms, but if use use it second, third, ... time (of course without restarting application) it takes about x/2 (or even x/3) ms. I'm not sure what causes this behaviour - most likely (in my opinion) it's causes by branch prediction - when code is executed first time branch predictor "learns" which paths are usually taken, so next time it can predict which branch to take(and usually the guess will be correct). You can read more about it here - it's really good question and it can open your eyes for some quite important thing.
So, in your situation i would try few things:
measure it many times - 100 or 1000 should be enough (if it takes 0.12-0.4s it won't take much time) and see whether the last version of you code still is the slowest one. So just replace your code with this:
double t = cv::getTickCount();
for (unsigned int i=0; i<1000; i++) {
ParallelPixelLoop loop(V,boost::bind(&calibRI::corr,this,_1),4);
cv::parallel_for_(cv::Range(0,4),loop);
}
std::cout << "Exec time: " << (cv::getTickCount()-t)/cv::getTickFrequency() << "s\n";
test it on bigger image. Maybe in your situation you just "don't need" 4 cores, but on bigger image 4 cores will make positive difference.
Use profiler (for example Very Sleepy) to see what part of your code is critical
So I have an assignment where I have to get bugs to draw to the screen (no problem) and then shoot off from the bugbag drawn at the bottom of the screen (also no problem). However, my issue is that when the code loops, the speed picks up for some reason and everything I've tried to move around to fix the issue has proved fruitless. I either get it to slow was day (but not loop through the amount taken in) or nothing changes. Here is a code snippet from where the loop resides, I can provide more if need be, but I'm positive the problem stems from this method.
Point2D creatureThrow(Creature& myCreature, BugBag& theBag)
{
//Added for creature
Point2D creatureLocation = Point2D(CREATURE_DRAW_LEFT, 0);
int startingBugCount = theBag.getBugCount();
//std::vector<Bug> deadBugs(startingBugCount);
//Create bug
Bug* bug = new Bug();
// Display all the bugs
for (int bugNumber = 1; bugNumber <= startingBugCount; bugNumber++)
{
bug->moveTo(0, -90);
for (int step = 0; step < BUG_STEP_SIZE; step++) // 20 is the number of steps
{
gdsWindow.clear();
for (int j=0; j < bugNumber; j++) //move bugs
{
//bug->draw();
bug->moveBy(0, 8);
}
myCreature.draw();
theBag.draw();
bug->draw();
Sleep(FRAME_SLEEP);
}
}
return creatureLocation;
}
I have a method which calculates an integral image (description here) commonly used in computer vision applications.
float *Integral(unsigned char *grayscaleSource, int height, int width, int widthStep)
{
// convert the image to single channel 32f
unsigned char *img = grayscaleSource;
// set up variables for data access
int step = widthStep/sizeof(float);
uint8_t *data = (uint8_t *)img;
float *i_data = (float *)malloc(height * width * sizeof(float));
// first row only
float rs = 0.0f;
for(int j=0; j<width; j++)
{
rs += (float)data[j];
i_data[j] = rs;
}
// remaining cells are sum above and to the left
for(int i=1; i<height; ++i)
{
rs = 0.0f;
for(int j=0; j<width; ++j)
{
rs += data[i*step+j];
i_data[i*step+j] = rs + i_data[(i-1)*step+j];
}
}
// return the integral image
return i_data;
}
I am trying to make it as fast as possible. It seems to me like this should be able to take advantage of Apple's Accelerate.framework, or perhaps ARMs neon intrinsics, but I can't see exactly how. It seems like that nested loop is potentially quite slow (for real time applications at least).
Does anyone think this is possible to speed up using any other techniques??
You can certainly vectorize the row by row summation. That is vDSP_vadd(). The horizontal direction is vDSP_vrsum().
If you want to write your own vector code, the horizontal sum might be sped up by something like psadbw, but that is Intel. Also, take a look at prefix sum algorithms, which are famously parallelizable.
I am working on an assignment where I transpose a matrix to reduce cache misses for a matrix multiplication operation. From what I understand from a few classmates, I should get 8x improvement. However, I am only getting 2x ... what might I be doing wrong?
Full Source on GitHub
void transpose(int size, matrix m) {
int i, j;
for (i = 0; i < size; i++)
for (j = 0; j < size; j++)
std::swap(m.element[i][j], m.element[j][i]);
}
void mm(matrix a, matrix b, matrix result) {
int i, j, k;
int size = a.size;
long long before, after;
before = wall_clock_time();
// Do the multiplication
transpose(size, b); // transpose the matrix to reduce cache miss
for (i = 0; i < size; i++)
for (j = 0; j < size; j++) {
int tmp = 0; // save memory writes
for(k = 0; k < size; k++)
tmp += a.element[i][k] * b.element[j][k];
result.element[i][j] = tmp;
}
after = wall_clock_time();
fprintf(stderr, "Matrix multiplication took %1.2f seconds\n", ((float)(after - before))/1000000000);
}
Am I doing things right so far?
FYI: The next optimization I need to do is use SIMD/Intel SSE3
Am I doing things right so far?
No. You have a problem with your transpose. You should have seen this problem before you started worrying about performance. When you are doing any kind of hacking around for optimizations it always a good idea to use the naive but suboptimal implementation as a test. An optimization that achieves a factor of 100 speedup is worthless if it doesn't yield the right answer.
Another optimization that will help is to pass by reference. You are passing copies. In fact, your matrix result may never get out because you are passing copies. Once again, you should have tested.
Yet another optimization that will help the speedup is to cache some pointers. This is still quite slow:
for(k = 0; k < size; k++)
tmp += a.element[i][k] * b.element[j][k];
result.element[i][j] = tmp;
An optimizer might see a way around the pointer problems, but probably not. At least not if you don't use the nonstandard __restrict__ keyword to tell the compiler that your matrices don't overlap. Cache pointers so you don't have to do a.element[i], b.element[j], and result.element[i]. And it still might help to tell the compiler that these arrays don't overlap with the __restrict__ keyword.
Addendum
After looking over the code, it needs help. A minor comment first. You aren't writing C++. Your code is C with a tiny hint of C++. You're using struct rather than class, malloc rather than new, typedef struct rather than just struct, C headers rather than C++ headers.
Because of your implementation of your struct matrix, my comment on slowness due to copy constructors was incorrect. That it was incorrect is even worse! Using the implicitly-defined copy constructor in conjunction with classes or structs that contain naked pointers is playing with fire. You will get burned very badly if someone calls m(a, a, a_squared) to get the square of matrix a. You will get burned even worse if some expects m(a, a, a) to do an in-place computation of a2.
Mathematically, your code only covers a tiny portion of the matrix multiplication problem. What if someone wants to multiply a 100x1000 matrix by a 1000x200 matrix? That's perfectly valid, but your code doesn't handle it because your code only works with square matrices. On the other hand, your code will let someone multiply a 100x100 matrix by a 200x200 matrix, which doesn't make a bit of sense.
Structurally, your code has close to a 100% guarantee that it will be slow because of your use of ragged arrays. malloc can spritz the rows of your matrices all across memory. You'll get much better performance if the matrix is internally represented as a contiguous array but is accessed as if it were a NxM matrix. C++ provides some nice mechanisms for doing just that.
If your assignment implies that you MUST transpose, then, of course, you should correct your transpose procedure. As it stands, it does the transpose TWO times, resulting in no transpose at all. The j=loop should not read
j=0; j<size; j++
but
j=0; j<i; j++
Transposing is not necessary to avoid processing the elements of one of the factor-matrices in the "wrong" order. Just interchange the j-loop and the k-loop. Leaving aside for the moment any (other) performance-tuning, the basic loop-structure should be:
for (int i=0; i<size; i++)
{
for (int k=0; k<size; k++)
{
double tmp = a[i][k];
for (int j=0; j<size; j++)
{
result[i][j] += tmp * b[k][j];
}
}
}