This reddit thread has drawn my attention on custom memory allocators. User Rohmboid says, for instance:
People wouldn't be writing their own pool allocators it if there
wasn't a clear benefit.
How do they know there is one?
I don’t want to waste my time/money/energy on writing a custom allocator if the time spent managing memory only accounts for less than 1% of the duration of my program. Neither do I want to switch to a custom allocator and be unable to tell the speedup. So I am wondering: how can I measure (or at least, estimate) the time spent allocating/freeing/fetching memory?
How do I know there is one?
Profile your code.
There's no point optimizing something that isn't a hot path in your code.
If the Allocator (A) takes 5% of your CPU time, and your app takes the other 95%, speeding the allocator up twice gives you (5/2)/100 = 2.5% boost. Now try to speed up B by even a fraction.
How?
The easiest way is to use the IDE built-in profiler; MSVS one is rather decent, although I am using Intel VTune; its ease of use is really great, and it just shows you - optimize here.
Using the profiling program has the additional benefit; you don't have to modify your code at all; you also don't have to recompile when you want to change profiling options and run again. That being said, timers in your application can also give nice results, although they rarely need to be placed directly inside allocator. It's better to continuously narrow the possible places where the program takes the most time.
Related
I have a native C++ application which performs heavy calculations and consumes a lot of memory. My goal is to optimize it, mainly reduce its run time.
After several cycles of profiling-optimizing, I tried the Profile Guided Optimization which I never tried before.
I followed the steps described on MSDN Profile-Guided Optimizations, changing the compilation (/GL) and linking (/LTCG) flags. After adding /GENPROFILE, I ran the application to create .pgc and .pdg files, then changed the linker options to /USEPROFILE and watched additional linker messages that reported that the profiling data was used:
3> 0 of 0 ( 0.0%) original invalid call sites were matched.
3> 0 new call sites were added.
3> 116 of 27096 ( 0.43%) profiled functions will be compiled for speed, and the rest of the functions will be compiled for size
3> 63583 of 345025 inline instances were from dead/cold paths
3> 27096 of 27096 functions (100.0%) were optimized using profile data
3> 608324578581 of 608324578581 instructions (100.0%) were optimized using profile data
3> Finished generating code
Everything looked promising, until I measured the program's performance.
The results were absolutely counterintuitive for me
Performance went down instead of up! 4% to 5% slower than without using Profile Guided Optimization (when comparing with/without the /USEPROFILE option).
Even when running the exact same scenario that was used with /GENPROFILE to create the Profile Guided Optimization data files, it ran 4% slower.
What is going on?
Looking at the sparse doc here the profiler doesn't seem to include any memory optimizations.
If your program takes 2GiB of memory, I'd speculate that the execution speed is limited by memory access and not by the CPU itself. (You also stated something about maps being used, these are also memory limited)
Memory access is difficult to optimize for a profiler, cause it can't change your malloc calls to (for example) put frequently used data into the same pages or make sure they are moved to the same cache line of the CPU.
In addition to that the profiler may introduce additional memory accesses when trying to optimize the bare CPU performance of your program.
The doc states "virtual call speculation", I would speculate that this (and maybe other features like inlining) could introduce additional memory traffic, thus degrading the overall performance cause memory bandwidth is already the limiting factor.
Don't look at it as a black box. If the program can be speeded up, it's because it is doing things it doesn't need to do.
Those things will hide from the profile-guided or any other optimizer, and they will certainly hide from your guesser.
They won't hide from this. Many people use it.
I'm trying to resist the temptation to guess, but I'm failing.
Here's what I see in big C++ apps, no matter how well-written they are.
When people could use a simple data structure like an array, instead they use an abstract container class, with iterators and whatnot. Where does the time go?
That's where it goes.
Another thing they do is write "powerful functions and methods". The writer of the function is so proud of it, that it does so much, that he/she expects it will be called reverently and sparingly.
The user of the function (which could be the same person) thinks "Look how useful this function is! See how much I can get done in a single line of code? The more I use it the more productive I will be."
See how this can easily do needless work?
There's another thing that happens in software - layers of abstraction.
If the pattern above is repeated over several layers of abstraction, the slowdown factors multiply.
The good news is, if you can find those, and if you can fix them, you can get enormous speedup. The bad news is you could suffer as "not a team player".
What are the basic tips and tricks that a C++ programmer should know when trying to optimize his code in the context of Caching?
Here's something to think about:
For instance, I know that reducing a function's footprint would make the code run a bit faster since you would have less overall instructions on the processor's instruction register I.
When trying to allocate an std::array<char, <size>>, what would be the ideal size that could make your read and writes faster to the array?
How big can an object be to decide to put it on the heap instead of the stack?
In most cases, knowing the correct answer to your question will gain you less than 1% overall performance.
Some (data-)cache optimizations that come to my mind are:
For arrays: use less RAM. Try shorter data types or a simple compression algorithm like RLE. This can also save CPU at the same time, or in the opposite waste CPU cycles with data type conversions. Especially floating point to integer conversions can be quite expensive.
Avoid access to the same cacheline (usually around 64 bytes) from different threads, unless all access is read-only.
Group members that are often used together next to each other. Prefer sequential access to random access.
If you really want to know all about caches, read What Every Programmer Should Know About Memory. While I disagree with the title, it's a great in-depth document.
Because your question suggests that you actually expect gains from just following the tips above (in which case you will be disappointed), here are some general optimization tips:
Tip #1: About 90% of your code you should be optimized for readability, not performance. If you decide to attempt an optimization for performance, make sure you actually measure the gain. When it is below 5% I usually go back to the more readable version.
Tip #2: If you have an existing codebase, profile it first. If you don't profile it, you will miss some very effective optimizations. Usually there are some calls to time-consuming functions that can be completely eliminated, or the result cached.
If you don't want to use a profiler, at least print the current time in a couple of places, or interrupt the program with a debugger a couple of times to check where it is most often spending its time.
It's been shown to me that it is possible for the ram to be read without the system crashing or to be even taken over by a ram bypass. http://www.google.com/patents/US6745308
However, the patent notes over and over that if a component isn't idle, it cannot be bypassed. This seems to have been confirmed: https://electronics.stackexchange.com/a/70881/17872
Is it possible for c++ to prevent the ram controller from becoming idle while allowing the program to operate otherwise normally? If so, how?
I understand that this could be a huge amount of code if possible, so please feel free to provide pseudocode (but actual code gets the long run check).
It really depends on what you mean by "kept from becoming idle" and probably a whole range of parameters system parameters (bus speed, memory controller speed, CPU/GPU speed, etc). A trivial attempt may simply be to allocate a large amount of memory an simply write to every cell in that memory as fast as the processor can achieve. Multiple threads doing this may be required to saturate the bus, as single core may not issue enough write operations.
Having said that, I'm not sure that's necessarily a critical factor. Yes, if someone writes pathologically bad code, the patented method doesn't provide any benefit. But it also doesn't make a huge amount of drawback, vs. not having it. Yes, a few more gates, but it doesn't look like an extremely complex set of logic (compared to all the other stuff that goes into a modern processor or GPU). The key point is that quite often, systems are not 100% saturated, and the bypassing will succeed, which provides benefits.
I may of course have misunderstood what your question is, and why you are asking it....
I want to dynamically allocate a memory block for an array in C/C++, and this array will be accessed at a high frequency. So I want this array to stay on chip, i.e., in the Cache. How can I do this explicitly with code in C/C++?
There is no standard C++ language feature that allows you to do this.
Depending on your compiler and CPU, you may be able to use an arch-specific CPU instruction in an asm block:
T* p = new T(...);
size_t n = sizeof(T);
asm {
"CACHE n bytes at address p"
}
...or some builtin compiler function ("intrinsic") that does this.
You will need to consult your CPU manual and/or your compiler manual.
As an example, x86 CPUs have a set of instructions starting with PREFETCH.
And another example, GCC has a function called __builtin_prefetch. See GCC Data Prefetch Support
I will try to answer this question from a bit different perspective. Do you really need to do this. And even if it would be a way to do so, will it worth it? Imagine there is a "magic" void * malloc_and_lock_in_cache( int cacheLevel ) function. What you going to do with this data. If it's an application limited to while (1) loop with random array access from single thread you will have such behaviour anyway due to optimisation and CPU architecture. If you think about more real world solutions you always have logic around access. For example locking for multithreading, certain conditions, etc. The the question - do the rest of your application algorithms are so perfect that only left to do is to allocate array in cache.
Do all other access/sorting/lookup functions are state-of-art logic which cannot be reviewed rather then gaining very limited performance kickback trying to overwrite CPU optimisation.
Also do you consider to run your application without ANY operation system on a raw hardware so you shouldn't care about how your allocation will affects OS behaviour, rest of application running around?
And what should happen if your application will run inside virtual machine or environments like XEN.?
I can remember one similar popular subject 15-18 years ago about physical memory usage and disk caching utilities. Indeed tools like MS-DOS smartdrive or similar utilities were REALLY useful and speed up things a lot. Usenet was full of 'tuning advices' and performance analyses for things like write-through/write-back settings.
Especially if your DOS application were processing large amounts of data and implemented some memory swapping logic (I am talking about times then 4MB RAM was luxury) that's became mostly a drama, that from one point of view you need as much memory you can, but from another point of view you need swapping, so you actually need to swap, but swapping goes through cache etc..
But what happened next. We've got VM386 mode, disk cache/memory swaps integrated into OS, and who was care anymore about things like tuning smartdrive/ramdisks. In general it was 'cheaper' to allocate as much as you need VM then implement own voodoo algorithms to swap physical memory blocks (although this functionality is still in WinAPI).
So I would really recommend to concentrate efforts on algorithms and application design rather then trying to use some very low level features with really unpredictable results until you dont develop some new microkernel OS.
I don't think you can. First, which cache? L3, L2, L1? You can prefetch, and align so it its access is more optimized, and then you can query it periodically maybe to make it stay and not go LRU'd, but you can't really make it stay in cache.
First you have to know what's the architecture of the machine you want to run the code on. Then you should check it there's an instruction doing that kind of stuff.
Actually using the memory heavily will force the cache controller to put this region in cache.
And there are three rules of optimizing, you may want to know them first :)
http://c2.com/cgi/wiki?RulesOfOptimization
I have a program written in C++ that runs a number of for loops per second without using anything that would make it wait for any reason. It consistently uses 2-10% of the CPU. Is there any way to force it to use more of the CPU and do a greater number of calculations without making the program more complex? Additionally, I compile with C::B on a Windows computer. Essentially, I'm asking whether there is a way to make my program faster by increasing usage of CPU, and if so, how.
That depends on why it's only using 10% of the CPU. If it's because you're using a multi-CPU machine and your program is using only one CPU, then no, you will have to introduce concurrency into your code to use that additional horsepower.
If it's being limited by something else (e.g. copying data to and from the disk), then you don't need to focus on CPU, you need to focus on whatever the bottleneck is. Most likely, the limiter will be reading from the disk, which you can improve by using better caching mechanisms.
Assuming your application has the power (PROCESS_SET_INFORMATION access right), you can use SetPriorityClass to bump up your priortiy (to the usual detriment of all other processes, of course).
You can go ABOVE_NORMAL_PRIORITY_CLASS (try this one first), HIGH_PRIORITY_CLASS (be very careful with this one) or REALTIME_PRIORITY_CLASS (I would strongly suggest that you probably shouldn't give this one a shot).
If you try the higher priorities and it's still clocking pretty low, then that's probably because you're not CPU-bound (such as if you're writing data to an output file). If that's the case, you'll probably have to find a way to make yourself CPU bound.
Just keep in mind that doing so may not be necessary (or even desirable). If you're running at a higher priority than other threads and you're still not sucking up a lot of CPU, it's probably because Windows has (most likely, rightfully) decided you don't need it.
It's really not the program's right or responsibility to demand additional resources from the system. That's the OS' job, as resource scheduler.
If it is necessary to use more CPU time than the OS sees fit, you should request that from the OS using the platform-dependent API. In this case, that seems to be something along the lines of SetPriorityClass or SetThreadPriority.
Creating a thread & giving higher priority to the thread might be one way.
If you use C++, consider using Intel Threading Building Block. You can find some examples here.
Some profilers give very nice indications of where bottlenecks in your code are. For example - the CodeAnalyst (for AMD chips only) has the instructions per cycle ratio. I'm sure intel profilers are similar.
As Billy O'Neal says though, if your runnning on an 8-core, being stuck on 10 percent of cpu is about right. If this is your problem then Windows msvc++ has a parallel mode (the parallel patterns library) for the standard algorithms. This can give parallelisation for free if have written your loops the c++ way (its still your responsibility to make sure your loops are thread safe). I've not used the msvc version but the gnu::__parallel_for_each etc work a treat.