Does divide-and-conquer really win against the increased memory allocation? - c++

I've just finished coding some classical divide-and-conquer algorithms, and I came up the following question:(more for curiosity)
Admittedly, in many cases, divide-and-conquer algorithm is faster than the traditional algorithm; for examples, in Fast Fourier Transform, it improves the complexity from N^2 to Nlog2N. However, through coding, I found out that, because of "dividing", we have more subproblems, which means we have to create more containers and allocate more memories on the subproblem additionally. Just think about this, in merge sort, we have to create left and right array in each recursion, and in Fast Fourier Transform, we have to create odd and even array in each recursion. This means, we have to allocate more memories during the algorithm.
So, my question is, in reality, such as in C++, does algorithms like divide-and-conquer really win, when we also have to increase the complexity in memory allocation? (Or memory allocation won't take run time at all, and it's cost is zero?)
Thanks for helping me out!

Almost everything when it comes to optimisation is a compromise between one resource vs. another - in traditional engineering it's typically "cost vs. material".
In computing, it's often "time vs. memory usage" that is the compromise.
I don't think there is one simple answer to your actual question - it really depends on the algorithm - and in real life, this may lead to compromise solutions where a problem is divided into smaller pieces, but not ALL the way down to the minimal size, only "until it's no longer efficient to divide it".
Memory allocation isn't a zero-cost operation, if we are talking about new and delete. Stack memory is near zero cost once the actual stack memory has been populated with physical memory by the OS - it's at most one extra instruction on most architectures to make some space on the stack, and sometimes one extra instruction at exit to give the memory back.
The real answer is, as nearly always when it comes to performance, to benchmark the different solutions.

It is useful to understand that getting "one level better" in big-O terms (like going from n^2 to n, or from n to log n) usually matters a lot. Consider your Fourier example.
At O(n^2), with a n=100 you're looking at 10000, and with n=1000 you get a whole million, 1000000. On the other hand, with O(n*log(n)) you get 664 for n=100 and 9965 at n=1000. The slower growth should be obvious.
Of course memory allocation costs resources, as does some other code necessary in divide-and-conquer, such as combining the parts. But the whole idea is that the overhead from extra allocations and such is far, far smaller than the extra time that would be needed for a small algorithm.
The time for extra allocations isn't usually a concern, but the memory use itself can be. That is one of the fundamental programming tradeoffs. You have to choose between speed and memory usage. Sometimes you can afford the extra memory to get faster results, sometimes you must save all the memory. This is one of the reasons why there's no 'ultimate algorithm' for many problems. Say, mergesort is great, running in O(n * log(n)) even in the worst-case scenario, but it needs extra memory. Unless you use the in-place version, which then runs slower. Or maybe you know your data is likely already near-sorted and then something like smoothsort suits you better.

Related

Most efficient way to grow array C++

Apologies if this has been asked before, I can't find a question that fully answers what I want to know. They mention ways to do this, but don't compare approaches.
I am writing a program in C++ to solve a PDE to steady state. I don't know how many time steps this will take. Therefore I don't know how long my time arrays will be. This will have a maximum time of 100,000s, but the time step could be as small as .001, so it could be as many as 1e8 doubles in length in the worst case (not necessarily a rare case either).
What is the most efficient way to implement this in terms of memory allocated and running time?
Options I've looked at:
Dynamically allocating an array with 1e8 elements, most of which won't ever be used.
Allocating a smaller array initially, creating a larger array when needed and copying elements over
Using std::vector and it's size increasing functionality
Are there any other options?
I'm primarily concerned with speed, but I want to know what memory considerations come into it as well
If you are concerned about speed just allocate 1e8 doubles and be done with it.
In most cases vector should work just fine. Remember that amortized it's O(1) for the append.
Unless you are running on something very weird the OS memory allocation should take care of most fragmentation issues and the fact that it's hard to find a 800MB free memory block.
As noted in the comments, if you are careful using vector, you can actually reserve the capacity to store the maximum input size in advance (1e8 doubles) without paging in any memory.
For this you want to avoid the fill constructor and methods like resize (which would end up accessing all the memory) and use reserve and push_back to fill it and only touch memory as needed. That will allow most operating systems to simply page in chunks of your accessed vector at a time instead of the entire contents all at once.
Yet I tend to avoid this solution for the most part at these kinds of input scales, but for simple reasons:
A possibly-paranoid portability fear that I may encounter an operating system which doesn't have this kind of page-on-demand behavior.
A possibly-paranoid fear that the allocation may fail to find a contiguous set of unused pages and face out of memory errors (this is a grey zone -- I tend to worry about this for arrays which span gigabytes, hundreds of megabytes is borderline).
Just a totally subjective and possibly dumb/old bias towards not leaning too heavily on the operating system's behavior for paging in allocated memory, and preferring to have a data structure which simply allocates on demand.
Debugging.
Among the four, the first two could simply be paranoia. The third might just be plain dumb. Yet at least on operating systems like Windows, when using a debug build, the memory is initialized in its entirety early, and we end up mapping the allocated pages to DRAM immediately on reserving capacity for such a vector. Then we might end up leading to a slight startup delay and a task manager showing 800 megabytes of memory usage for a debug build even before we've done anything.
While generally the efficiency of a debug build should be a minor concern, when the potential discrepancy between release and debug is enormous, it can start to render production code almost incapable of being effectively debugged. So when the differences are potentially vast like this, my preference is to "chunk it up".
The strategy I like to apply here is to allocate smaller chunks -- smaller arrays of N elements, where N might be, say, 512 doubles (just snug enough to fit a common denominator page size of 4 kilobytes -- possibly minus a couple of doubles for chunk metadata). We fill them up with elements, and when they get full, create another chunk.
With these chunks, we can aggregate them together by either linking them (forming an unrolled list) or storing a vector of pointers to them in a separate aggregate depending on whether random-access is needed or merely sequential access will suffice. For the random-access case, this incurs a slight overhead, yet one I've tended to find relatively small at these input scales which often have times dominated by the upper levels of the memory hierarchy rather than register and instruction level.
This might be overkill for your case and a careful use of vector may be the best bet. Yet if that doesn't suffice and you have similar concerns/needs as I do, this kind of chunky solution might help.
The only way to know which option is 'most efficient' on your machine is to try a few different options and profile. I'd probably start with the following:
std::vector constructed with the maximum possible size.
std::vector constructed with a conservative ballpark size and push_back.
std::deque and push_back.
The std::vector vs std::deque debate is ongoing. In my experience, when the number of elements is unknown and not too large, std::deque is almost never faster than std::vector (even if the std::vector needs multiple reallocations) but may end up using less memory. When the number of elements is unknown and very large, std::deque memory consumption seems to explode and std::vector is the clear winner.
If after profiling, none of these options offers satisfactory performance, then you may want to consider writing a custom allocator.

3D FFT with data larger than cache

I have searched for an answer to this question but have not found anything that can directly help me.
I am working on a 3D numerical integrator for a non-linear PDE using the parallel FFT library included in MKL.
My arrays consist of 2^30 data points which is much much larger than the cache. This results in ~50% of cache references being misses, which appears to add a massive amount of overhead accessing memory.
Is there a clever way I can deal with this? Is it expected to have 50% cache misses using an array this large?
Any help would be much appreciated.
Thanks,
Dylan
2^30 data points in a single FFT counts as being quite big!
The data plus the exponentials and the output array is several thousand times bigger than the L3 cache, and millions times bigger than L1.
Given that disparity one might argue that a 50% cache miss rate is actually quite good, especially for an algorithm like an FFT which accesses memory in non-sequential ways.
I don't think that there will be much you can do about it. The MKL is quite good, and I'm sure that they've taken advantage of whatever cache hinting instructions there are.
You might try contacting Mercury Systems Inc. (www.mrcy.com) and ask them about their Scientific Algorithms Library (SAL). They have a habit of writing their own math libraries, and in my experience they are pretty good at it. Their FFT on PowerPC was 30% quicker than the next best one; quite an achievement. You can try an un-optimised version of SAL for free (http://sourceforge.net/projects/opensal/). The real optimised for Intel SAL is definitely not free though.
Also bear in mind that no matter how clever the algorithm is, with a data set that size you're always going to be fundamentally stuck with main memory bandwidths, not cache bandwidths.
GPUs might be worth a look, but you'd need one with a lot of memory to hold 2^30 data points (32 bit complex values = 2gbytes, same again for the output array, plus exponentials, etc).
I think the problem of excessive misses is due to a failure of the cache prefetch mechanism, but not knowing the details of the memory accesses I can't tell you exactly why.
It does not matter that your arrays are very large, 50% misses are excessive. The processor should avoid misses by detecting you are iterating over an array and loading ahead of time the data elements you are likely to use.
Either the pattern of array accesses is not regular and thus the prefetcher in the processor does not figure out a pattern to prefetch, or you have a cache associativy problem, that is, elements in your iteration might be matched to the same cache slots.
For example, assume a cache size of 1Mb and a set associativy of 4. In this example, the cache will map memory using the lower 20 bits to an internal slot. If you stride by 1Mb, that is, your iterations are exactly 1Mb, then the lower 20 bits are always the same and go to the same cache slot, the new element shares the same cache slot as the old one. When you get to the fifth element, all four positions are used up and from then on it is only misses, in such case your cache size is effectively one single slot; if you stride by half the cache size, then the effective number of slots is 2, which might be enough to not have any misses at all or have 100% or anything in between depending on whether your access pattern requires both slots simultaneously or not.
To convince yourself of this, make a toy program with varying stride sizes and you'll see that those that divide or are multiples of the cache sizes increase misses, you can use valgrind --tool=cachegrind
You should first make sure you know what is causing the cache misses; they may be the fault of other code you've written rather than the FFT library. In fact, I expect that is very likely the case.
The rest of this post assumes that the FFT is really at fault and we need to optimize.
The standard trick to get data locality out of an FFT is to
Arrange the data in a two-dimensional array
Do an FFT along each row
Apply twiddle factors
Do a matrix transpose
Do an FFT along each row
This is the Cooley-Tukey algorithm, in the case where we factor 2^(m+n) = 2^m * 2^n.
The point of this is that the recursive calls to the FFT are much much smaller, and may very well fit in cache. And if not, you can apply this method recursively until things do fit in cache. And if you're ambitious, you do a lot of benchmarking to figure out the optimal way to do the splitting.
Thus, assuming you also use a good matrix transpose algorithm, the end result is a relatively cache-friendly FFT.
The library you're using really should be doing this already. If it's not, then some options are:
Maybe it exposes enough lower level functionality that you can tell it to use Cooley-Tukey in an efficient way even though the high level routines aren't
You could implement Cooley-Tukey yourself, using the given library to do the smaller FFTs.

How to evaluate the quality of custom memory allocator?

What characteristics should be checked when evaluating memory allocation?
Performance of allocation and de-allocation? Are simple stress-tests enough? How to check the quality of allocation?
For example, I found Oracle's test for malloc, but it's only Oracle's view of the problem. And this test is oriented only to multi-threaded performance.
How people usually checks their allocators?
Just to give more focus on the "how", rather than the "what", whic the other answers seem to deal with. Here's how I would do it.
First Step - Make it possible to compare approaches
Determine what qualities you value. Make a list, prioritize and finally, make a value function.
That is, figure out which measurements are the most useful indicators of quality, in your view/case. A few good measurements could be average time to allocate a memory block, total runtime of the application (if applicable), average frame rate, total or average memory consumption ... It all depends on what you wish to achieve.
Then, create a function which, given these measurements from a test run, gives you a value which can be used as quality measure. The simplest case would be to simply decide a weight factor for each of the measurements. These weight factors should embody both the importance of each measurement and, if they use different units (such as nanoseconds for average allocation time and bytes for average memory consumption), attempt to scale them to compare fairly.
Second Step - Device a test scenario
This should be as close to a realistic case as possible. The best would be simply the actual code that you want to use your memory allocator for, with added code for taking all the measurements needed to compute your value function.
Third Step - Test
Write a bunch of different allocators and test them all against each other, as well as the default or without any allocator (if applicable). Measure all results, compute the value function for each and rank them according to the results. Keep in mind all the different considerations that you always need to think of when performing performance measurements.
Fourth Step - Evaluate and re-iterate
Look at how the different solutions stack up against each other. Apply some critical thinking. Do these results actually correspond to how you experienced the quality each allocator during the tests? If the results do not match what you thought you saw.
For example, if the one which seemed blazing fast and gave a total runtime of half a minute less than the rest, gets a mediocre score.. Well, then it's time to scrutinize your approach. Perhaps there's a bug in your measuring? Or perhaps you need to re-evaluate your chosen value function... Re-iterate steps one through four until the results are clear and seem in accordance with your actual experience in testing them.
Usually, the performance of a memory allocator is about the speed of the finding and creating a memory chunk in the heap depending of the size of the manipulated memory blocks. And, also (but more recently), how does it behave in the case of multithreaded allocations. You can find interesting studies and benchmarks in the following list:
ptmalloc - a multi-thread malloc implementation
Benchmarks of the Lockless Memory Allocator
Dynamic memory allocator implementations in Linux system libraries
A Scalable Concurrent malloc(3) Implementation for FreeBSD
... probably many others ...
I guess my answer is not genius but - it depends.
If you are writing custom memory allocator you probably knows what should be it's characteristic. Eg. if you want to have allocator allowing you to quickly allocate a lot of small object and you dont really care about memory usage overhead you should probably have different tests then when you are creating allocator for big objects and you want to save as much of memory as it's possible even with cost of CPU time.
Stress tests are always good because it can help you to find some race conditions and check if your allocator is bugfree, but perfromace test depends on what you wanted to achive.
Here are the metrics that one should consider when optimizing/analyzing the dynamic memory allocation mechanism in the system.
Implementation overhead - how much memory does it cost to keep the allocation's internal data structures operational. Additionally, if these structures grow over time or pre-allocated once (both approaches have pros and cons and both are valid).
Operational efficiency - how long does it take to allocate/free a memory block. Here usually the allocation is a challenge, because it is almost never a constant time and depends on characteristics of previously allocated blocks of memory. Freeing the block looks straightforward but if combined with memory de-fragmentation, deserves further attention (will not be covered here).
Thread safety has less to do with allocation itself and more with the decision to use certain solution in the system. Basically here, if you don't have threads - there is nothing to worry about. If you do have threads - make sure that your allocation will not be interrupted while at work.
Memory fragmentation - actual layout of the allocated memory. Here two completely contradicting requirements come in play - you allocate as soon as you find the right spot in your buffer, or you make sure you cause as little fragmentation as possible. Former is fast, whilst the later is more resource-friendly (and also potentially slower).
Garbage collection - This is a separate topic and a field of study on its own, it is being mentioned only for the sake of completeness. Important is to understand, that even if your don't plan on releasing allocated memory too often, GC can still be of use to help with analysis of already allocated memory, preparing internal data structures for the next efficient memory allocation. IDLE CPU time is arguable the best moment to do this house keeping task. This topic, however is out of scope of this question.

Can function overhead slow down a program by a factor of 50x?

I have a code that I'm running for a project. It is O(N^2), where N is 200 for my case. There is an algorithm that turns this O(N^2) to O(N logN). This means that, with this new algorithm, it should be ~100 times faster. However, I'm only getting a factor of 2-fold increase (aka 2x faster).
I'm trying to narrow down things to see if I messed something up, or whether it's something inherent to the way I coded this program. For starters, I have a lot of function overhead within nested classes. For example, I have a lot of this (within many loops):
energy = globals->pair_style->LJ->energy();
Since I'm getting the right results when it comes to actual data, just wrong speed increase, I'm wondering if function overhead can actually cause that much speed decrease, by as much as 50-fold.
Thanks!
Firstly, your interpretation that O(N logN) is ~100 times faster than O(N^2) for N=200 is incorrect. The big-Oh notation deals with upper bounds and behaviour in the limit, and doesn't account for any multiplicative constants in the complexity.
Secondly, yes, on modern hardware function calls tend to be relatively expensive due to pipeline disruption. To find out how big a factor this is in your case, you'd have to come up with some microbenchmarks.
The absoloute biggest hit is cache misses. An L1 cache miss is relatively cheap but when you miss on L2 (or L3 if you have it) you may be losing hundreds or even thousands of cycles to the incoming stall.
Thing is though this may only be part of the problem. Do not optimise your code until you have profiled it. Identify the slow areas and then figure out WHY they are slow. Once you have an understanding of why its running slowly you have a good chance to optimise it.
As an aside, O notation is very handy but is not the be all and end all. I've seen O(n^2) algorithms work significantly faster than O(n log n) for small amounts fo data (and small may mean less than several thousand) due to the fact that they cache far more effectively.
The important thing about Big O notation is that it only specifies the limit of the execution time, as the data set size increases - any constants are thrown away. While O(N^2) is indeed slower than O(N log N), the actual run times might be N^2 vs. 1000N log N - that is, an O(N^2) can be faster than O(N log N) on some data sets.
Without more details, it's hard to say more - yes, function calls do indeed have a fair amount of overhead, and that might be why you're not seeing a bigger increase in performance - or it might just be the case that your O(N log N) doesn't perform quite as well on a data set of your size.
I've worked on image processing algorithms, and calling a function per pixel (ie: for 640x480 would be 307200) can significanly reduce performance. Try declaring your function inline, or making the function a macro. This can quickly show you if it is because of function calls. Try looking at some profiling tools. VS 2010 comes with some nice tools, or else there is also Intel VTune, glowcode. They can help show where you are spending time.
IMHO I don't think that 1600 function calls should reduce performance much at all (200 log 200)
I suggest profiling it using
The big FAQ topic on profiling is here: How can I profile C++ code running in Linux?
gprof (requires compiletime instrumentation)
valgrind --tool=callgrind and kcachegrind; excellent tool with excellent visualization - screenshots here:

Data type size impact on performance

I'm running intensive numerical simulations. I often use long integers, but I realized it would be safe to use integers instead. Would this improve the speed of my simulations significantly?
Depends. If you have lots and lots of numbers consecutively in memory, they're more likely to fit in the L2 cache, so you have fewer cache misses. L2 cache misses - depending on your platform - can be a significant performance impact, so it's definitely a good idea to make things fit in cache as much as possible (and prefetch if you can). But don't expect your code to fly all of a sudden because your types are smaller.
EDIT: One more thing - if you choose an awkward type (like a 16-bit integer on a 32-bit or 64-bit platform), you may end up with worse performance because the CPU will have to surgically extract the 16-bit value and turn it into something it can work with. But typically, ints are a good choice.
Depends on your data set sizes. Obviously, halving the size of your integers could double the amount of data that fits into CPU caches and thus access to data would be faster. For more details I suggest you read the famous Ulrich Drepper's paper What Every Programmer Should Know About Memory.
This is why typedef is your friend. :-)
If mathematically possible, try using floats instead of integers. I read somewhere that floating point arithmetic (esp. multiplication) can actually be faster on some processors.
The best thing is to experiment and benchmark. It's damn near impossible to figure out analytically which micro-optimizations work best.
EDIT: This post discusses the performance difference between integer and float.
All the answers have already treated the CPU cache issue: if your data is two times smaller, then in some cases it can fit into L2 cache completely, yielding performance boost.
However, there is another very important and more general thing: memory bandwidth. If you algorithm is linear (aka O(N) complexity) and accesses memory sequentally, then it may be memory-bound. It means that memory reads/writes are the bottleneck, and CPU is simply wasting a lot of cycles waiting for memory operations to complete. In such case reducing total memory size in two times would yield reliable 2x performance boost.
Moreover, in such cases switching to bytes may yield even more performance boost, despite the fact that CPU computations may be slower with bytes as one of the other answerers have already mentioned.
In general, the answer depends on several things like: total size of data your algorithm works with, memory access pattern (random/sequental), algorithm asymptotic complexity, computation per memory ratio (mostly for linear algorithms).