Operator= slowing down simulation - c++

I am running a Monte Carlo simulation of a polymer. The entire configuration of the current state of the system is given by the object called Grid. This is my definition of Grid:
class Grid{
public:
std::vector <Polymer> PolymersInGrid; // all the polymers in the grid
int x; // length of x-edge of grid
int y; // length of y-edge of grid
int z; // length of z-edge of grid
double kT; // energy factor
double Emm_n ; // monomer-solvent when Not aligned
double Emm_a ; // monomer-solvent when Aligned
double Ems; // monomer-solvent interaction
double Energy; // energy of grid
std::map <std::vector <int>, Particle> OccupancyMap; // a map that gives the particle given the location
Grid(int xlen, int ylen, int zlen, double kT_, double Emm_a_, double Emm_n_, double Ems_): x (xlen), y (ylen), z (zlen), kT (kT_), Emm_n(Emm_n_), Emm_a (Emm_a_), Ems (Ems_) { // Constructor of class
// this->instantiateOccupancyMap();
};
// Destructor of class
~Grid(){
};
// assignment operator that allows for a correct transfer of properties. Important to functioning of program.
Grid& operator=(Grid other){
std::swap(PolymersInGrid, other.PolymersInGrid);
std::swap(Energy, other.Energy);
std::swap(OccupancyMap, other.OccupancyMap);
return *this;
}
.
.
.
}
I can go into the details of the object Polymer and Particle, if required.
In my driver code, this is what I am going:
Define maximum number of iterations.
Defining a complete Grid G.
Creating a copy of G called G_.
I am perturbing the configuration of G_.
If the perturbance on G_ is accepted per the Metropolis criterion, I assign G_ to G (G=G_).
Repeat steps 1-4 until maximum number of iterations is achieved.
This is my driver code:
auto start = std::chrono::high_resolution_clock::now();
Grid G_ (G);
int acceptance_count = 0;
for (int i{1}; i< (Nmov+1); i++){
// choose a move
G_ = MoveChooser(G, v);
if ( MetropolisAcceptance (G.Energy, G_.Energy, G.kT) ) {
// accepted
// replace old config with new config
acceptance_count++;
std::cout << "Number of acceptances is " << acceptance_count << std::endl;
G = G_;
}
else {
// continue;
}
if (i % dfreq == 0){
G.dumpPositionsOfPolymers (i, dfile) ;
G.dumpEnergyOfGrid(i, efile, call) ;
}
// G.PolymersInGrid.at(0).printChainCoords();
}
auto stop = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds> (stop-start);
std::cout << "\n\nTime taken for simulation: " << duration.count() << " milliseconds" << std::endl;
This is the interesting part: if I run the simulation using condition that do not have lots of "acceptances" (low temperatures, bad solvent), the simulation runs pretty fast. However, if there are a large number of acceptances, the simulation gets incredibly slow. My hypothesis is that my assignment operator = is slowing down my simulation.
I ran some tests:
number of acceptances = 25365, wall-clock time = 717770 milliseconds (!)
number of acceptances = 2165, wall-clock time = 64412 milliseconds
number of acceptances = 3000, wall-clock time = 75550 milliseconds
And the trend continues.
Could anyone advise me on how to possibly make this more efficient? Is there a way to bypass the slowdown I am experiencing, I think, due to the = operator?
I would really appreciate any advice you have for me!

One thing that you can certainly do to improve performance is to force moving _G rather than coping it to G:
G = std::move(G_);
After all, at this stage you don't need G_ any more.
Side remark. The fact that you don't need to copy all member data in operator= indicates that your design of Grid is far from perfect, but, well, keep it if the program is small and you're sure you control everything. Anyway, rather than using operator=, you should define and use a member function with a meaningful name, like "fast_and_dirty_swap" etc :-) Then you can define operator= the way suggested by #Jarod42, that is, using = default.
An alternative approach that I used before C++11 is to operate on pointers. In this scenario one would have two Grids, one "real" and one treated as a buffer, or sandbox, and on acceptance on would simply swap the pointers, so that the "buffer" filled with MoveChooser would become the real, current Grid.
A pseudocode:
Create two buffers, previous and current, each capable of storing a simulation state
Initialize current
Create two pointers, p_prev = &previous, p_curr = &currenrt
For as many steps as you wish
compute the next state from *p_curr and store it in *p_prev (e.g. monte_carlo_step(p_curr, p_prev)
swap the pointers: now the current system state is at p_curr and the previous at p_prev.
analyze the results stored at *p_curr

Related

Hole Filling methods takes over 20 minutes

When trying to fill holes in a mesh with a border that is highly complex, the application takes 20 minutes in the hole filling call.
It can be any of the calls shown here.
The code I'm using is this:
int main()
{
std::ifstream input("V:/tobehealed2.off");
Triangle_mesh mesh;
input >> mesh;
//////////////
std::vector<std::vector<int>> indices(mesh.num_faces());
std::vector<Point_3> vertices(mesh.num_vertices());
int i = 0;
for (auto& p : mesh.points()) {
vertices[i++] = p;
}
i = 0;
for (auto& f : mesh.faces()) {
std::vector<int> triangle(3);
int j = 0;
for (auto v : mesh.vertices_around_face(mesh.halfedge(f))) {
triangle[j++] = v;
}
indices[i++] = triangle;
}
mesh.clear();
CGAL::Polygon_mesh_processing::repair_polygon_soup(vertices, indices);
CGAL::Polygon_mesh_processing::orient_polygon_soup(vertices, indices);
CGAL::Polygon_mesh_processing::polygon_soup_to_polygon_mesh(vertices, indices, mesh);
CGAL::Polygon_mesh_processing::keep_largest_connected_components(mesh, 1);
bool hasHoles = true;
std::vector<face_descriptor> face_out;
std::vector<vertex_descriptor> vertex_out;
while (hasHoles) {
hasHoles = false;
for (auto& hh : mesh.halfedges()) {
if (mesh.is_border(hh)) {
hasHoles = true;
CGAL::Polygon_mesh_processing::triangulate_and_refine_hole(mesh, hh, std::back_inserter(face_out), std::back_inserter(vertex_out));
break;
}
}
face_out.clear();
vertex_out.clear();
}
CGAL::Polygon_mesh_processing::keep_largest_connected_components(mesh, 1);
CGAL::Polygon_mesh_processing::remove_isolated_vertices(mesh);
CGAL::Polygon_mesh_processing::remove_self_intersections(mesh);
CGAL::Surface_mesh_simplification::Count_stop_predicate<Triangle_mesh> stop(60000);
int r = CGAL::Surface_mesh_simplification::edge_collapse(mesh, stop);
mesh.collect_garbage();
std::ofstream out2("V:/healed.off");
out2 << mesh;
}
Application takes over 20 minutes in the call to triangulate_and_refine_hole.
Tested model is available for download here:
https://drive.google.com/file/d/1t2dwJBs5vNpg2jLOVprHEY-8tCHvCErK/view?usp=sharing
My goal is just to be able to check beforehand if the model has a hole so complex the closing of it will take several minutes, so I can skip the hole filling attempt. Also, if there is a way to exit the function call after some threshold time it would be nice.
The size of the model doesn't matter so much. If I use a mesh 3 times larger, it can fill a not-so-complex-hole in just a few seconds.
Also, if there is a way to exit the function call after some threshold time it would be nice.
What always works is to start the task in another thread and monitor that thread, killing it if necessary after some time.
Really wouldn't be a pretty solution though. CGAL is still a maintained library, so fairly sure you just povide it some incorrect data. Or are you sure at all that it actually hangs up?
36Mb is quite a solid size for a model and hole-fixing is a task that grows with model complexity if I recall right.
EDIT:
Ah well, if you can afford to wait in general for 20 mintes before having a repaired model, threading is the way to go. It will just run in the background.
if you cannot afford that it takes so long, well, then it is not so easy. Either you find some significantly better implementation (not that likely), or you will have to do some trade offs. Either simplifying the model or living with less correct hole fixes (assuming there are some heuristic algorithms for this task).

c++ stack efficient for multicore application

I am trying to code a multicode Markov Chain in C++ and while I am trying to take advantage of the many CPUs (up to 24) to run a different chain in each one, I have a problem in picking a right container to gather the result the numerical evaluations on each CPU. What I am trying to measure is basically the average value of an array of boolean variables. I have tried coding a wrapper around a `std::vector`` object looking like that:
struct densityStack {
vector<int> density; //will store the sum of boolean varaibles
int card; //will store the amount of elements we summed over for normalizing at the end
densityStack(int size){ //constructor taking as only parameter the size of the array, usually size = 30
density = vector<int> (size, 0);
card = 0;
}
void push_back(vector<int> & toBeAdded){ //method summing a new array (of measurements) to our stack
for(auto valStack = density.begin(), newVal = toBeAdded.begin(); valStack != density.end(); ++valStack, ++ newVal)
*valStack += *newVal;
card++;
}
void savef(const char * fname){ //method outputting into a file
ofstream out(fname);
out.precision(10);
out << card << "\n"; //saving the cardinal in first line
for(auto val = density.begin(); val != density.end(); ++val)
out << << (double) *val/card << "\n";
out.close();
}
};
Then, in my code I use a single densityStack object and every time a CPU core has data (can be 100 times per second) it will call push_back to send the data back to densityStack.
My issue is that this seems to be slower that the first raw approach where each core stored each array of measurement in file and then I was using some Python script to average and clean (I was unhappy with it because storing too much information and inducing too much useless stress on the hard drives).
Do you see where I can be losing a lot of performance? I mean is there a source of obvious overheading? Because for me, copying back the vector even at frequencies of 1000Hz should not be too much.
How are you synchronizing your shared densityStack instance?
From the limited info here my guess is that the CPUs are blocked waiting to write data every time they have a tiny chunk of data. If that is the issue, a simple technique to improve performance would be to reduce the number of writes. Keep a buffer of data for each CPU and write to the densityStack less frequently.

slower loops in a function when using floor and std::set

I'm writing a class in windows using visual studio, one of it's public function has a big for loop looks like below,
void brain_network_opencl::block_filter_fcd_all(int m)
{
const int m_block_len = m * block_len;
time_t start, end;
for (int j = 0; j < shift_2d_gpu[1]; j++) // local work size/number of rows per block
{
for (int i = 0; i < masksize; i++) // number of extracted voxels
{
if (j + m_block_len != i)
{
//if (floor(dst_ptr_gpu[i + j * masksize] * power_up) > threadhold_fcd)
if ((int)(dst_ptr_gpu[i + j * masksize] * power_up) > threadhold_fcd)
{
org_row = mask_ind[j + m_block_len];
org_col = mask_ind[i];
nodes.insert(org_row);
conns.insert(make_pair(org_row, org_col));
}
}
}
}
end = clock();
cout << end - start << "ms" << " for block" << j << endl;
}
where nodes is std::set<set> ,conns is std::multimap<int, int> and mask_ind is std::vector<int>, they are declared as private variables as well as masksize and shift_2d_gpu;
Major time costs by floor and .insert;
The problem is, the same code (with all the variables) in a main function costs only 1/5~1 the time than it calls from here. And if I replace (int) by floor in both function and main(), it costs much more in this function;
What causes this problem and do I have to write it all inside a main()?
By the way does it has something to do with the overloads?
floor shows +3 overloads and .insert shows +5 overloads
updates
I copy the codes of this function to another new console project's main function.
It's still much slower than my first function (codes also in main)!!!
Now I'm confused...
It's there any settings that make floor and .insert faster?
updates 2014/03/31
It's because of the settings in Project Properties->Configuration Properties->C/C++->General->Debug Information Format, this value is set to P*rogram Database for Edit And Continue (/ZI)* as default and it is incompatible with a lot of optimizations according to msdn. If this value is set to Program Database (/Zi), the time cost of floor wouldn't be 10 times of (int).
(I looked into Disassembly and found out that the length of codes (call floor -> jmp floor ->different codes) are different when the setting is altered, that's the reason causes floor and .insert spent much more time than it should)
As Gassa has pointed out, to optimize the tight loop use a custom floor function.
set<int> isn't cache friendly, but to replace it with a cache-friendly structure you might need to alter the algorithm. Still, unordered_set<int>, with a decent space reserved to it, should be a bit better, having less cache misses per insert than a binary tree.
P.S. Non-virtual overloads in C++ are resolved at compile time and have no effect on performance

Calculating moving average in C++

I am trying to calculate the moving average of a signal. The signal value ( a double ) is updated at random times.
I am looking for an efficient way to calculate it's time weighted average over a time window, in real time. I could do it my self, but it is more challenging than I thought.
Most of the resources I've found over the internet are calculating moving average of periodical signal, but mine updates at random time.
Does anyone know good resources for that ?
Thanks
The trick is the following: You get updates at random times via void update(int time, float value). However you also need to also track when an update falls off the time window, so you set an "alarm" which called at time + N which removes the previous update from being ever considered again in the computation.
If this happens in real-time you can request the operating system to make a call to a method void drop_off_oldest_update(int time) to be called at time + N
If this is a simulation, you cannot get help from the operating system and you need to do it manually. In a simulation you would call methods with the time supplied as an argument (which does not correlate with real time). However, a reasonable assumption is that the calls are guaranteed to be such that the time arguments are increasing. In this case you need to maintain a sorted list of alarm time values, and for each update and read call you check if the time argument is greater than the head of the alarm list. While it is greater you do the alarm related processing (drop off the oldest update), remove the head and check again until all alarms prior to the given time are processed. Then do the update call.
I have so far assumed it is obvious what you would do for the actual computation, but I will elaborate just in case. I assume you have a method float read (int time) that you use to read the values. The goal is to make this call as efficient as possible. So you do not compute the moving average every time the read method is called. Instead you precompute the value as of the last update or the last alarm, and "tweak" this value by a couple of floating point operations to account for the passage of time since the last update. (i. e. a constant number of operations except for perhaps processing a list of piled up alarms).
Hopefully this is clear -- this should be a quite simple algorithm and quite efficient.
Further optimization: one of the remaining problems is if a large number of updates happen within the time window, then there is a long time for which there are neither reads nor updates, and then a read or update comes along. In this case, the above algorithm will be inefficient in incrementally updating the value for each of the updates that is falling off. This is not necessary because we only care about the last update beyond the time window so if there is a way to efficiently drop off all older updates, it would help.
To do this, we can modify the algorithm to do a binary search of updates to find the most recent update before the time window. If there are relatively few updates that needs to be "dropped" then one can incrementally update the value for each dropped update. But if there are many updates that need to be dropped then one can recompute the value from scratch after dropping off the old updates.
Appendix on Incremental Computation: I should clarify what I mean by incremental computation above in the sentence "tweak" this value by a couple of floating point operations to account for the passage of time since the last update. Initial non-incremental computation:
start with
sum = 0;
updates_in_window = /* set of all updates within window */;
prior_update' = /* most recent update prior to window with timestamp tweaked to window beginning */;
relevant_updates = /* union of prior_update' and updates_in_window */,
then iterate over relevant_updates in order of increasing time:
for each update EXCEPT last {
sum += update.value * time_to_next_update;
},
and finally
moving_average = (sum + last_update * time_since_last_update) / window_length;.
Now if exactly one update falls off the window but no new updates arrive, adjust sum as:
sum -= prior_update'.value * time_to_next_update + first_update_in_last_window.value * time_from_first_update_to_new_window_beginning;
(note it is prior_update' which has its timestamp modified to start of last window beginning). And if exactly one update enters the window but no new updates fall off, adjust sum as:
sum += previously_most_recent_update.value * corresponding_time_to_next_update.
As should be obvious, this is a rough sketch but hopefully it shows how you can maintain the average such that it is O(1) operations per update on an amortized basis. But note further optimization in previous paragraph. Also note stability issues alluded to in an older answer, which means that floating point errors may accumulate over a large number of such incremental operations such that there is a divergence from the result of the full computation that is significant to the application.
If an approximation is OK and there's a minimum time between samples, you could try super-sampling. Have an array that represents evenly spaced time intervals that are shorter than the minimum, and at each time period store the latest sample that was received. The shorter the interval, the closer the average will be to the true value. The period should be no greater than half the minimum or there is a chance of missing a sample.
#include <map>
#include <iostream>
// Sample - the type of a single sample
// Date - the type of a time notation
// DateDiff - the type of difference of two Dates
template <class Sample, class Date, class DateDiff = Date>
class TWMA {
private:
typedef std::map<Date, Sample> qType;
const DateDiff windowSize; // The time width of the sampling window
qType samples; // A set of sample/date pairs
Sample average; // The answer
public:
// windowSize - The time width of the sampling window
TWMA(const DateDiff& windowSize) : windowSize(windowSize), average(0) {}
// Call this each time you receive a sample
void
Update(const Sample& sample, const Date& now) {
// First throw away all old data
Date then(now - windowSize);
samples.erase(samples.begin(), samples.upper_bound(then));
// Next add new data
samples[now] = sample;
// Compute average: note: this could move to Average(), depending upon
// precise user requirements.
Sample sum = Sample();
for(typename qType::iterator it = samples.begin();
it != samples.end();
++it) {
DateDiff duration(it->first - then);
sum += duration * it->second;
then = it->first;
}
average = sum / windowSize;
}
// Call this when you need the answer.
const Sample& Average() { return average; }
};
int main () {
TWMA<double, int> samples(10);
samples.Update(1, 1);
std::cout << samples.Average() << "\n"; // 1
samples.Update(1, 2);
std::cout << samples.Average() << "\n"; // 1
samples.Update(1, 3);
std::cout << samples.Average() << "\n"; // 1
samples.Update(10, 20);
std::cout << samples.Average() << "\n"; // 10
samples.Update(0, 25);
std::cout << samples.Average() << "\n"; // 5
samples.Update(0, 30);
std::cout << samples.Average() << "\n"; // 0
}
Note: Apparently this is not the way to approach this. Leaving it here for reference on what is wrong with this approach. Check the comments.
UPDATED - based on Oli's comment... not sure about the instability that he is talking about though.
Use a sorted map of "arrival times" against values. Upon arrival of a value add the arrival time to the sorted map along with it's value and update the moving average.
warning this is pseudo-code:
SortedMapType< int, double > timeValueMap;
void onArrival(double value)
{
timeValueMap.insert( (int)time(NULL), value);
}
//for example this runs every 10 seconds and the moving window is 120 seconds long
void recalcRunningAverage()
{
// you know that the oldest thing in the list is
// going to be 129.9999 seconds old
int expireTime = (int)time(NULL) - 120;
int removeFromTotal = 0;
MapIterType i;
for( i = timeValueMap.begin();
(i->first < expireTime || i != end) ; ++i )
{
}
// NOW REMOVE PAIRS TO LEFT OF i
// Below needs to apply your time-weighting to the remaining values
runningTotal = calculateRunningTotal(timeValueMap);
average = runningTotal/timeValueMap.size();
}
There... Not fully fleshed out but you get the idea.
Things to note:
As I said the above is pseudo code. You'll need to choose an appropriate map.
Don't remove the pairs as you iterate through as you will invalidate the iterator and will have to start again.
See Oli's comment below also.

Counting down in for-loops

I believe (from some research reading) that counting down in for-loops is actually more efficient and faster in runtime. My full software code is C++
I currently have this:
for (i=0; i<domain; ++i) {
my 'i' is unsigned resgister int,
also 'domain' is unsigned int
in the for-loop i is used for going through an array, e.g.
array[i] = do stuff
converting this to count down messes up the expected/correct output of my routine.
I can imagine the answer being quite trivial, but I can't get my head round it.
UPDATE: 'do stuff' does not depend on previous or later iteration. The calculations within the for-loop are independant for that iteration of i. (I hope that makes sense).
UPDATE: To achieve a runtime speedup with my for-loop, do I count down and if so remove the unsigned part when delcaring my int, or what other method?
Please help.
There is only one correct method of looping backwards using an unsigned counter:
for( i = n; i-- > 0; )
{
// Use i as normal here
}
There's a trick here, for the last loop iteration you will have i = 1 at the top of the loop, i-- > 0 passes because 1 > 0, then i = 0 in the loop body. On the next iteration i-- > 0 fails because i == 0, so it doesn't matter that the postfix decrement rolled over the counter.
Very non obvious I know.
I'm guessing your backward for loop looks like this:
for (i = domain - 1; i >= 0; --i) {
In that case, because i is unsigned, it will always be greater than or equal to zero. When you decrement an unsigned variable that is equal to zero, it will wrap around to a very large number. The solution is either to make i signed, or change the condition in the for loop like this:
for (i = domain - 1; i >= 0 && i < domain; --i) {
Or count from domain to 1 rather than from domain - 1 to 0:
for (i = domain; i >= 1; --i) {
array[i - 1] = ...; // notice you have to subtract 1 from i inside the loop now
}
This is not an answer to your problem, because you don't seem to have a problem.
This kind of optimization is completely irrelevant and should be left to the compiler (if done at all).
Have you profiled your program to check that your for-loop is a bottleneck? If not, then you do not need to spend time worrying about this. Even more so, having "i" as a "register" int, as you write, makes no real sense from a performance standpoint.
Even without knowing your problem domain, I can guarantee you that both the reverse-looping technique and the "register" int counter will have negligible impact on your program's performance. Remember, "Premature optimization is the root of all evil".
That said, better spent optimization time would be on thinking about the overall program structure, data structures and algorithms used, resource utilization, etc.
Checking to see if a number is zero can be quicker or more efficient than a comparison. But this is the sort of micro-optimization you really shouldn't worry about - a few clock cycles will be greatly dwarfed by just about any other perf issue.
On x86:
dec eax
jnz Foo
Instead of:
inc eax
cmp eax, 15
jl Foo
It has nothing to do with counting up or down. What can be faster is counting toward zero. Michael's answer shows why — x86 gives you a comparison with zero as an implicit side effect of many instructions, so after you adjust your counter, you just branch based on the result instead of doing an explicit comparison. (Maybe other architectures do that, too; I don't know.)
Borland's Pascal compilers are notorious for performing that optimization. The compiler transforms this code:
for i := x to y do
foo(i);
into an internal representation more akin to this:
tmp := Succ(y - x);
i := x;
while tmp > 0 do begin
foo(i);
Inc(i);
Dec(tmp);
end;
(I say notorious not because the optimization affects the outcome of the loop, but because the debugger displays the counter variable incorrectly. When the programmer inspects i, the debugger may display the value of tmp instead, causing no end of confusion and panic for programmers who think their loops are running backward.)
The idea is that even with the extra Inc or Dec instruction, it's still a net win, in terms of running time, over doing an explicit comparison. Whether you can actually notice that difference is up for debate.
But note that the conversion is something the compiler would do automatically, based on whether it deemed the transformation worthwhile. The compiler is usually better at optimizing code than you are, so don't spend too much effort competing with it.
Anyway, you asked about C++, not Pascal. C++ "for" loops aren't quite as easy to apply that optimization to as Pascal "for" loops are because the bounds of Pascal's loops are always fully calculated before the loop runs, whereas C++ loops sometimes depend on the stopping condition and the loop contents. C++ compilers need to do some amount of static analysis to determine whether any given loop could fit the requirements for the kind of transformation Pascal loops qualify for unconditionally. If the C++ compiler does the analysis, then it could do a similar transformation.
There's nothing stopping you from writing your loops that way on your own:
for (unsigned i = 0, tmp = domain; tmp > 0; ++i, --tmp)
array[i] = do stuff
Doing that might make your code run faster. Like I said before, though, you probably won't notice. The bigger cost you pay by manually arranging your loops like that is that your code no longer follows established idioms. Your loop is a perfectly ordinary "for" loop, but it no longer looks like one — it has two variables, they're counting in opposite directions, and one of them isn't even used in the loop body — so anyone reading your code (including you, a week, a month, or a year from now when you've forgotten the "optimization" you were hoping to achieve) will need to spend extra effort proving to himself or herself that the loop is indeed an ordinary loop in disguise.
(Did you notice that my code above used unsigned variables with no danger of wrapping around at zero? Using two separate variables allows that.)
Three things to take away from all this:
Let the optimizer do its job; on the whole it's better at it than you are.
Make ordinary code look ordinary so that the special code doesn't have to compete to get attention from people reviewing, debugging, or maintaining it.
Don't do anything fancy in the name of performance until testing and profiling show it to be necessary.
If you have a decent compiler, it will optimize "counting up" just as effectively as "counting down". Just try a few benchmarks and you'll see.
So you "read" that couting down is more efficient? I find this very difficult to believe unless you show me some profiler results and the code. I can buy it under some circumstances, but in the general case, no. Seems to me like this is a classic case of premature optimization.
Your comment about "register int i" is also very telling. Nowadays, the compiler always knows better than you how to allocate registers. Don't bother using using the register keyword unless you have profiled your code.
When you're looping through data structures of any sort, cache misses have a far bigger impact than the direction you're going. Concern yourself with the bigger picture of memory layout and algorithm structure instead of trivial micro-optimisations.
You may try the following, which compiler will optimize very efficiently:
#define for_range(_type, _param, _A1, _B1) \
for (_type _param = _A1, _finish = _B1,\
_step = static_cast<_type>(2*(((int)_finish)>(int)_param)-1),\
_stop = static_cast<_type>(((int)_finish)+(int)_step); _param != _stop; \
_param = static_cast<_type>(((int)_param)+(int)_step))
Now you can use it:
for_range (unsigned, i, 10,0)
{
cout << "backwards i: " << i << endl;
}
for_range (char, c, 'z','a')
{
cout << c << endl;
}
enum Count { zero, one, two, three };
for_range (Count, c, three, zero)
{
cout << "backwards: " << c << endl;
}
You may iterate in any direction:
for_range (Count, c, zero, three)
{
cout << "forward: " << c << endl;
}
The loop
for_range (unsigned,i,b,a)
{
// body of the loop
}
will produce the following code:
mov esi,b
L1:
; body of the loop
dec esi
cmp esi,a-1
jne L1
Hard to say with information given but... reverse your array, and count down?
Jeremy Ruten rightly pointed out that using an unsigned loop counter is dangerous. It's also unnecessary, as far as I can tell.
Others have also pointed out the dangers of premature optimization. They're absolutely right.
With that said, here is a style I used when programming embedded systems many years ago, when every byte and every cycle did count for something. These forms were useful for me on the particular CPUs and compilers that I was using, but your mileage may vary.
// Start out pointing to the last elem in array
pointer_to_array_elem_type p = array + (domain - 1);
for (int i = domain - 1; --i >= 0 ; ) {
*p-- = (... whatever ...)
}
This form takes advantage of the condition flag that is set on some processors after arithmetical operations -- on some architectures, the decrement and testing for the branch condition can be combined into a single instruction. Note that using predecrement (--i) is the key here -- using postdecrement (i--) would not have worked as well.
Alternatively,
// Start out pointing *beyond* the last elem in array
pointer_to_array_elem_type p = array + domain;
for (pointer_to_array_type p = array + domain; p - domain > 0 ; ) {
*(--p) = (... whatever ...)
}
This second form takes advantage of pointer (address) arithmetic. I rarely see the form (pointer - int) these days (for good reason), but the language guarantees that when you subtract an int from a pointer, the pointer is decremented by (int * sizeof (*pointer)).
I'll emphasize again that whether these forms are a win for you depends on the CPU and compiler that you're using. They served me well on Motorola 6809 and 68000 architectures.
In some later arm cores, decrement and compare takes only a single instruction. This makes decrementing loops more efficient than incrementing ones.
I don't know why there isn't an increment-compare instruction also.
I'm surprised that this post was voted -1 when it's a true issue.
Everyone here is focusing on performance. There is actually a logical reason to iterate towards zero that can result in cleaner code.
Iterating over the last element first is convenient when you delete invalid elements by swapping with the end of the array. For bad elements not adjacent to the end we can swap into the end position, decrease the end bound of the array, and keep iterating. If you were to iterate toward the end then swapping with the end could result in swapping bad for bad. By iterating end to 0 we know that the element at the end of the array has already been proven valid for this iteration.
For further explanation...
If:
You delete bad elements by swapping with one end of the array and changing the array bounds to exclude the bad elements.
Then obviously:
You would swap with a good element i.e. one that has already been tested in this iteration.
So this implies:
If we iterate away from the variable bound then elements between the variable bound and the current iteration pointer have been proven good. Whether the iteration pointer gets ++ or -- doesn't matter. What matters is that we're iterating away from the variable bound so we know that the elements adjacent to it are good.
So finally:
Iterating towards 0 allows us to use only one variable to represent the array bounds. Whether this matters is a personal decision between you and your compiler.
What matters much more than whether you're increasing or decreasing your counter is whether or not you're going up memory or down memory. Most caches are optimized for going up memory, not down memory. Since memory access time is the bottleneck that most programs today face, this means that changing your program so that you go up memory can result in a performance boost even if this requires comparing your counter to a non-zero value. In some of my programs, I saw a significant improvement in performance by changing my code to go up memory instead of down it.
Skeptical? Here's the output that I got:
sum up = 705046256
sum down = 705046256
Ave. Up Memory = 4839 mus
Ave. Down Memory = 5552 mus
sum up = inf
sum down = inf
Ave. Up Memory = 18638 mus
Ave. Down Memory = 19053 mus
from running this program:
#include <chrono>
#include <iostream>
#include <random>
#include <vector>
template<class Iterator, typename T>
void FillWithRandomNumbers(Iterator start, Iterator one_past_end, T a, T b) {
std::random_device rnd_device;
std::mt19937 generator(rnd_device());
std::uniform_int_distribution<T> dist(a, b);
for (auto it = start; it != one_past_end; it++)
*it = dist(generator);
return ;
}
template<class Iterator>
void FillWithRandomNumbers(Iterator start, Iterator one_past_end, double a, double b) {
std::random_device rnd_device;
std::mt19937_64 generator(rnd_device());
std::uniform_real_distribution<double> dist(a, b);
for (auto it = start; it != one_past_end; it++)
*it = dist(generator);
return ;
}
template<class RAI, class T>
inline void sum_abs_up(RAI first, RAI one_past_last, T &total) {
T sum = 0;
auto it = first;
do {
sum += *it;
it++;
} while (it != one_past_last);
total += sum;
}
template<class RAI, class T>
inline void sum_abs_down(RAI first, RAI one_past_last, T &total) {
T sum = 0;
auto it = one_past_last;
do {
it--;
sum += *it;
} while (it != first);
total += sum;
}
template<class T> std::chrono::nanoseconds TimeDown(
std::vector<T> &vec, const std::vector<T> &vec_original,
std::size_t num_repititions, T &running_sum) {
std::chrono::nanoseconds total{0};
for (std::size_t i = 0; i < num_repititions; i++) {
auto start_time = std::chrono::high_resolution_clock::now();
sum_abs_down(vec.begin(), vec.end(), running_sum);
total += std::chrono::high_resolution_clock::now() - start_time;
vec = vec_original;
}
return total;
}
template<class T> std::chrono::nanoseconds TimeUp(
std::vector<T> &vec, const std::vector<T> &vec_original,
std::size_t num_repititions, T &running_sum) {
std::chrono::nanoseconds total{0};
for (std::size_t i = 0; i < num_repititions; i++) {
auto start_time = std::chrono::high_resolution_clock::now();
sum_abs_up(vec.begin(), vec.end(), running_sum);
total += std::chrono::high_resolution_clock::now() - start_time;
vec = vec_original;
}
return total;
}
int main() {
std::size_t num_repititions = 1 << 10;
{
typedef int ValueType;
auto lower = std::numeric_limits<ValueType>::min();
auto upper = std::numeric_limits<ValueType>::max();
std::vector<ValueType> vec(1 << 24);
FillWithRandomNumbers(vec.begin(), vec.end(), lower, upper);
const auto vec_original = vec;
ValueType sum_up = 0, sum_down = 0;
auto time_up = TimeUp(vec, vec_original, num_repititions, sum_up).count();
auto time_down = TimeDown(vec, vec_original, num_repititions, sum_down).count();
std::cout << "sum up = " << sum_up << '\n';
std::cout << "sum down = " << sum_down << '\n';
std::cout << "Ave. Up Memory = " << time_up/(num_repititions * 1000) << " mus\n";
std::cout << "Ave. Down Memory = "<< time_down/(num_repititions * 1000) << " mus"
<< std::endl;
}
{
typedef double ValueType;
auto lower = std::numeric_limits<ValueType>::min();
auto upper = std::numeric_limits<ValueType>::max();
std::vector<ValueType> vec(1 << 24);
FillWithRandomNumbers(vec.begin(), vec.end(), lower, upper);
const auto vec_original = vec;
ValueType sum_up = 0, sum_down = 0;
auto time_up = TimeUp(vec, vec_original, num_repititions, sum_up).count();
auto time_down = TimeDown(vec, vec_original, num_repititions, sum_down).count();
std::cout << "sum up = " << sum_up << '\n';
std::cout << "sum down = " << sum_down << '\n';
std::cout << "Ave. Up Memory = " << time_up/(num_repititions * 1000) << " mus\n";
std::cout << "Ave. Down Memory = "<< time_down/(num_repititions * 1000) << " mus"
<< std::endl;
}
return 0;
}
Both sum_abs_up and sum_abs_down do the same thing and are timed they same way with the only difference being that sum_abs_up goes up memory while sum_abs_down goes down memory. I even pass vec by reference so that both functions access the same memory locations. Nevertheless, sum_abs_up is consistently faster than sum_abs_down. Give it a run yourself (I compiled it with g++ -O3).
FYI vec_original is there for experimentation, to make it easy for me to change sum_abs_up and sum_abs_down in a way that makes them alter vec while not allowing these changes to affect future timings.
It's important to note how tight the loop that I'm timing is. If a loop's body is large then it likely won't matter whether its iterator goes up or down memory since the time it takes to execute the loop's body will likely completely dominate. Also, it's important to mention that with some rare loops, going down memory is sometimes faster than going up it. But even with such loops it's rarely ever the case that going up was always slower than going down (unlike loops that go up memory, which are very often always faster than the equivalent down-memory loops; a small handful of times they were even 40+% faster).
The point is, as a rule of thumb, if you have the option, if the loop's body is small, and if there's little difference between having your loop go up memory instead of down it, then you should go up memory.