how to limit cache memory ubuntu? - c++

I have written a C++ code which I have to run on many low configuration computers. Now, my PC is very high configuration. I am using Ubuntu 10.04 & I set hard limit on some resources i.e. on memory & virtual memory. Now my question is:
1) how to set limit on the cache size and cache line size ?
2) what other limits I should put to check my code is OK or not ?
I am using command:
ulimit -H -m 1000000
ulimit -H -v 500000

You can't limit cache size, that's a (mostly transparent) hardware feature.
The good news is this shouldn't matter, since you can't run out of cache - it just spills and your program runs more slowly.
If your concern is avoiding spills, you could investigate valgrind --tool=cachegrind - it may be possible to examine the likely behaviour on your target hardware cache.

To PROPERLY simulate running on low-end machines (although not with low cache-limits), you can run the code in a virtual machine rather than the real hardware of your machine. This will show you what happens on a machine with small memory much more than if you limit using ulimit, as ulimit simply limits what YOUR application will get. So it shows that your application doesn't run out of memory when running a particular set of tests. But it doesn't show how the application and system behaves together when there isn't a huge amount of memory in the first place.
A machine with low amount of physical memory will behave quite differently when it comes to for example swapping behaviour, and filesystem caching, just to mention a couple of things that change between a "large memory, but application is limited" vs "small memory in the first place".
I'm not sure if Ubuntu comes with any flavour of Virtual Machine setup, but for example VirtualBox is pretty easy to configure and set up on any Linux/Windows machine. As long as you have a modern enough processor to run hardware virtualization instructions.
As Useless not at all uselessly stated, cache-memory will not "run out" or in any other way cause a failure. It will run a little slower, but not massive amounts (about 10x for any given operation, but this will be averaged over a large number of other instructions in most cases, unless you are really working hard at proving how important cache is, such as very large matrix multiplications).
One tip might also be to look around for some old hardware. There are usually computers for sale that are several years old, for next to nothing at a "computer recycling shop" or similar. Set such a system up, install your choice of OS, and see what happens.

Related

does building from a RAM drive truly yield speed increase?

I'm working on a project that has thousands of .cpp files plus thousands more .h and .hpp and the build takes 28min running from an SSD.
We inherited this project from a different company just weeks ago but perusing the makefiles, they explicitly disabled parallel builds via the .NOPARALLEL phony target; we're trying to find out if they have a good reason.
Worst case, the only way to speed this up is to use a RAM drive.
So I followed the instructions from Tekrevue and installed Imdisk and then ran benchmarks using CrystalDiskMark:
SSD
RAM Drive
I also ran dd using Cygwin and there's a significant speedup (at least 3x) on the RAM drive compared to my SSD.
However, my build time changes not one minute!
So then I thought: maybe my proprietary compiler calls some Windows API and causes a huge slowdown so I built fftw from source on Cygwin.
What I expected is that my processor usage would increase to some max and stay there for the duration of the build. Instead, my usage was very spiky: one for each file compiled. I understand even Cygwin still has to interact with windows so the fact that I still got spiky proc usage makes me assume that it's not my compiler that's the issue.
Ok. New theory: invoking compiler for each source-file has some huge overhead in Windows so, I copy-pasted from my build-log and passed 45 files to my compiler and compared it to invoking the compiler 45 times separately. Invoking ONCE was faster but only by 4 secs total for the 45 files.
And I saw the same "spiky" processor usage as when invoking compiler once for each file.
Why can't I get the compiler to run faster even when running from RAM drive? What's the overhead?
UPDATE #1
Commenters have been saying, I think, that the RAM drive thing is kind of unnecessary bc windows will cache the input and output files in RAM anyway.
Plus, maybe the RAM drive implementation (ie drivers) is sub-optimal.
So, I'm not using the RAM drive anymore.
Also, people have said that I should run the 45-file build multiple times so as to remove the overhead for caching: I ran it 4 times and each time it was 52secs.
CPU usage (taken 5 secs before compilation ended)
Virtual memory usage
When the compiler spits out stuff to disk, it's actually cached in RAM first, right?
Well then this screenshot indicates that IO is not an issue or rather, it's as fast as my RAM.
Question:
So since everything is in RAM, why isn't the CPU % higher more of the time?
Is there anything I can do to make single- threaded/job build go faster?
(Remember this is single-threaded build for now)
UPDATE 2
It was suggested below that I should set the affinity, of my compile-45-files invocation, to 1 so that windows won't bounce around the invocation to multiple cores.
The result:
100% single-core usage! for the same 52secs
So it wasn't the hard drive, RAM or the cache but CPU that's the bottleneck.
**THANK YOU ALL! ** for your help
========================================================================
My machine: Intel i7-4710MQ # 2.5GHz, 16GB RAM
I don't see why you are blaming so much the operating system, besides sequential, dumb IO (to load sources/save intermediate output - which should be ruled out by seeing that an SSD and a ramdisk perform the same) and process starting (ruled out by compiling a single giant file) there's very little interaction between the compiler and the operating system.
Now, once you ruled out "disk" and processor, I expect the bottleneck to be the memory speed - not for the RAM-disk IO part (which probably was already mostly saturated by the SSD), but for the compilation process itself.
That's actually quite a common problem, at this moment of time processors are usually faster than memory, which is often the bottleneck (that's the reason why currently it's critical to write cache-friendly code). The processor is probably wasting some significant time waiting for out of cache data to be fetched from main memory.
Anyway, this is all speculation. If you want a reliable answer, as usual you have to profile. Grab some sampling profiler from a list like this and go see where the compiler is wasting time. Personally, I expect to see a healthy dose of cache misses (or even page faults if you burned too much RAM for the ramdisk), but anything can be.
Reading your source code from the drive is a very, very small part of the overhead of compiling software. Your CPU speed is far more relevant, as parsing and generating binaries are the slowest part of the process.
**Update
Your graphs show a very busy CPU, I am not sure what you expect to see. Unless the build is multithreaded AND your kernel stops scheduling other, less intensive threads, this is certainly the graph of a busy processor.
Your trace is showing 23% CPU usage. Your CPU has 4 actual cores (with hyperthreading to make it look like 8). So, you're using exactly one core to its absolute maximum (plus or minus 2%, which is probably better accuracy than you can really expect).
The obvious conclusion from this would be that your build process is CPU bound, so improving your disk speed is unlikely to make much difference.
If you want substantially faster builds, you need to either figure out what's wrong with your current makefiles or else write entirely new ones without the problems, so you can support both partial and parallel builds.
That can gain you a lot. Essentially anything else you do (speeding up disks, overclocking the CPU, etc.) is going to give minor gains at best (maybe 20% if you're really lucky, where a proper build environment will probably give at least a 20:1 improvement for most typical builds).

in a worst case how much QPI latency can slow-down arbitrary application?

I'm developing low-latency HFT trading application.
I'm using single-CPU machine. Because it's much easier to configure and maintain, (no need to tune NUMA). Also, obviously, assuming we have enough resources, it should be definitely not slower than dual-CPU setup, and likely it will be a little bit faster, cause no QPI/NUMA latency.
HFT requires a lot of resources and now I realize I want to have much more cores. Also colocating two 1U single CPU machines is much more expensive than colocating one 1U dual-cpu machine, so even assuming I can "split" my program to two it's still make sense to use 1U dual-CPU machine.
So how fear QPI/NUMA latency is? If I move my application from single-CPU machine to dual-CPU machine how much slower it can be? Maximum I can afford several-microseconds delay, but not more. Can QPI/Numa introduce significant delay if not tuned correctly and how significant this delay would be?
Is it possible to write such application which runs much slower (more than several microseconds slower) on dual-CPU setup than single-CPU setup? I.e runs much slower on a faster computer? (of course assuming we have the same processors, memory, network card and everything else)
This is not trivially answerable, since it depends on so many factors. Is the code written for NUMA?
Is the code doing mostly reads, mostly writes or about equal? How much data is shared between threads that run on separate CPUs? How often is such data written to, forcing cache-refresh?
How does tasks get scheduled, how and when does the OS decide to move threads from one CPU socket to the next?
Does the code and data fit in cache?
Those are just a few factors that will change the results dramatically between a "works really well" and "gives really poor performance".
As with EVERYTHING performance-related, details can make a huge difference, and reading answers like this one on the internet will not give you a reliable answer that applies to YOUR situati8on. Benchmark your application, check performance counters and tweak based on that. [Given the price for a machine of the specs you describe in comments above, I'd expect the supplier would allow some sort of test, demo, "try before you buy", etc].
Assuming you have a worst case scenario, a memory access will be straddling two cache-lines (unaligned access of a 8-byte value, for example), which is split between your worst placed CPUs, and the MMU needs reloading, each of those page-table entries also being in the worst possible CPUs, and since the memory for that pair of memory locations is in different locations, needing new TLB entries for each of the two 4-byte reads to load your 64-bit value. (Each TLB entry is a separate location).
This means 2 x 4 x n, where n is something like 50-100 ns. So one memory access could, at least in theory take 1600 ns. So 1.6 microseconds. It's unlikely that you will get MUCH worse than this for a single operation. The overhead is a lot less than for example swapping to disk, which can add milliseconds to your execution time.
It is not very hard to write code that updates the same cache-line on multiple CPUs and thus causing dramatic reduction in performance - I remember a long time back when I first had an Athlon SMP system running a simple benchmark, where the author did this for a Dhrystone benchmark
int numberOfRuns[MAX_CPUS];
Now, numberOfRuns is the outer loop-counter, and updating that for each loop, on either CPU, would cause "false sharing" (so each time the counter was updated, the other CPU had to flush that cache-line).
Running this on 2 core SMP system gave 30% of the single CPU performance. So 3 times SLOWER than the one CPU, rather than faster as you'd expect. (This was some 12 or so years ago, so memory may be a little "off" on the exact details, but the essense of this story is still true - a badly written application can run slower on multiple cores compared to single core).
I'd expect at least that bad performance on a modern system where you have false sharing of commonly used variables.
In comparison, well-written code should run near N times faster, if there is little or no sharing between CPU cores. I have a highly CPU-bound, multithreaded, calculator for weird numbers, which gives near n-times performance gain both on my single-socket system at home and my two-socket system at work.
$ time ./weird -t 1 -e 100000
real 0m22.641s
user 0m22.660s
sys 0m0.003s
$ time ./weird -t 6 -e 100000
real 0m5.096s
user 0m25.333s
sys 0m0.005s
So about 11% overhead. That is sharing one variable [current number] which is atomically updated between threads (using C++ standard atomics). Unfortunately, I don't have a good example of "badly written code" to contrast this against.

Make a program run slowly

Is there any way to run a C++ program slower by changing any OS parameters in Linux? In this way I would like to simulate what will happen if that particular program happens to run on a real slower machine.
In other words, a faster machine should behave as a slower machine to that particular program.
Lower the priority using nice (and/or renice). You can also do it programmatically using nice() system call. This will not slow down the execution speed per se, but will make Linux scheduler allocate less (and possibly shorter) execution time frames, preempt more often, etc. See Process Scheduling (Chapter 10) of Understanding the Linux Kernel for more details on scheduling.
You may want to increase the timer interrupt frequency to put more load on the kernel, which will in turn slow everything down. This requires a kernel rebuild.
You can use CPU Frequency Scaling mechanism (requires kernel module) and control (slow down, speed up) the CPU using the cpufreq-set command.
Another possibility is to call sched_yield(), which will yield quantum to other processes, in performance critical parts of your program (requires code change).
You can hook common functions like malloc(), free(), clock_gettime() etc. using LD_PRELOAD, and do some silly stuff like burn a few million CPU cycles with rep; hop;, insert memory barriers etc. This will slow down the program for sure. (See this answer for an example of how to do some of this stuff).
As #Bill mentioned, you can always run Linux in a virtualization software which allows you to limit the amount of allocated CPU resources, memory, etc.
If you really want your program to be slow, run it under Valgrind (may also help you find some problems in your application like memory leaks, bad memory references, etc).
Some slowness can be achieved by recompiling your binary with disabled optimizations (i.e. -O0 and enable assertions (i.e. -DDEBUG).
You can always buy an old PC or a cheap netbook (like One Laptop Per Child, and don't forget to donate it to a child once you are done testing) with a slow CPU and run your program.
Hope it helps.
QEMU is a CPU emulator for Linux. Debian has packages for it (I imagine most distros will). You can run a program in an emulator and most of them should support slowing things down. For instance, Miroslav Novak has patches to slow down QEMU.
Alternatively, you could cross compile to another CPU-linux (arm-none-gnueabi-linux, etc) and then have QEMU translate that code to run.
The nice suggestion is simple and may work if you combine it with another process which will consume cpu.
nice -19 test &
while [ 1 ] ; do sha1sum /boot/vmlinuz*; done;
You did not say if you need graphics, file and/or network I/O? Do you know something about the class of error you are looking for? Is it a race condition, or does the code just perform poorly at a customer site?
Edit: You can also use signals like STOP and CONT to start and stop your program. A debugger can also do this. The issue is that the code runs a full speed and then gets stopped. Most solutions with the Linux scheduler will have this issue. There was some sort of thread analyzer from Intel afair. I see Vtune Release Notes. This is Vtune, but I was pretty sure there is another tool to analyze thread races. See: Intel Thread Checker, which can check for some thread race conditions. But we don't know if the app is multi-threaded?
Use cpulimit:
Cpulimit is a tool which limits the CPU usage of a process (expressed in percentage, not in CPU time). It is useful to control batch jobs, when you don't want them to eat too many CPU cycles. The goal is prevent a process from running for more than a specified time ratio. It does not change the nice value or other scheduling priority settings, but the real CPU usage. Also, it is able to adapt itself to the overall system load, dynamically and quickly.
The control of the used cpu amount is done sending SIGSTOP and SIGCONT POSIX signals to processes.
All the children processes and threads of the specified process will share the same percent of CPU.
It's in the Ubuntu repos. Just
apt-get install cpulimit
Here're some examples on how to use it on an already-running program:
Limit the process 'bigloop' by executable name to 40% CPU:
cpulimit --exe bigloop --limit 40
cpulimit --exe /usr/local/bin/bigloop --limit 40
Limit a process by PID to 55% CPU:
cpulimit --pid 2960 --limit 55
Get an old computer
VPS hosting packages tend to run slowly, have lots of interruptions, and wildly varying latencies. The cheaper you go the worse the hardware will be. Unlike truly old hardware, there is a good chance they will contain instruction sets (SSE4) that are not usually found on old hardware. Neverthless, if you want a system that walks slowly and shutters often, a cheap VPS host will be the quickest start.
If you just want to simulate your program to analyze its behavior on really slow machine, you can try making your whole program to run as a thread of some other main program.
In this manner you can prioritize same code with different priorities in few threads at once and collect data of your analysis. I have used this in game development for frame-processing analysis.
Use sleep or wait inside of your code. Its not the brightest way to do but acceptable in all kind of computer with different speeds.
The simplest possible way to do it would be to wrap your main runable code in a while loop with a sleep at the end of it.
For example:
void main()
{
while 1
{
// Logic
// ...
usleep(microseconds_to_sleep)
}
}
As people will mention, this isn't the most accurate way, since your logic code will still run at normal speed but with delays between runs. Also, it assumes that your logic code is something that runs in a loop.
But it is both simple and configurable.
You can increase load on CPUs via launching tools like stress or stress-ng

How to prepare a constant benchmark environment

When I am doing graphics benchmark performance test (C++), I find the application is sometimes a little faster or slower. And this is related with current operating system status/caches/memory usage, and graphics hardware status.
I am using Win7. I am wondering if there is some guideline to tell me how to get a stable/constant environment for benchmark performance testing?
There are many ways to do that - what I tend to do for my testing, is using WAIK (Windows Automated Installation Kit, available free of charge from Microsoft), to deploy a minimal windows 7 system on a separate workstation.
Then, the following configuration items need to be considered/changed (try not to deviate too much from a typical user machine, thou, otherwise your benchmark would not be constructive):
Set Paging File to static 2x RAM
Disable Automatic Updates
Disable Drive Indexing
These represent a reasonably optimal environment for testing, that is still attainable by enthusiasts, and thus can be representative of a Power-User (even if I use Automatic Updates and Drive Indexing, I schedule them both for when I'm away/sleeping)
As for caches and memory usages - at least in Win7 Professional, you can script remote startup - so for instance, I would have a script run my benchmark overnight (for large regression tests), restarting the OS after each run. Or I would run the same benchmark 5-10 times without rebooting, to see if cache usage changes.
Finally, there are bootloader switches to control the number of processors and the amount of available RAM - my test machine is an AMD Phenom X6 with 16GB of RAM, but we need to test how performance changes with the number of cores (some users would have single-core systems, and some would have multi-core systems), and with the amount of RAM (from 1-16GB).
This is usually done prior to a checkpoint release, to see if recommended or minimal recommendation need to be adjusted due to both extra features and additional optimization that happened since.

How to avoid HDD thrashing

I am developing a large program which uses a lot of memory. The program is quite experimental and I add and remove big chunks of code all the time. Sometimes I will add a routine that is rather too memory hungry and the HDD drive will start thrashing and the program (and the whole system) will slow to a snails pace. It can easily take 5 mins to shut it down!
What I would like is a mechanism for avoiding this scenario. Either a run time procedure or even something to be done before running the program, which can say something like "If you run this program there is a risk of HDD thrashing - aborting now to avoid slowing to a snails pace".
Any ideas?
EDIT: Forgot to mention, my program uses multiple threads.
You could consider using SetProcessWorkingSetSize . This would be useful in debugging, because your app will crash with a fatal exception when it runs out of memory instead of dragging your machine into a thrashing situation.
http://msdn.microsoft.com/en-us/library/ms686234%28VS.85%29.aspx
Similar SO question
Set Windows process (or user) memory limit
Windows XP is terrible when there are multiple threads or processes accessing the disk at the same time. This is effectively what you experience when your application begins to swap, as the OS is writing out some pages while reading in others. Windows XP (and Server 2003 for that matter) is utterly trash for this. This is a real shame, as it means that swapping is almost synonymous with thrashing on these systems.
Your options:
Microsoft fixed this problem in Vista and Server 2008. So stop using a 9 year old OS. :)
Use unbuffered I/O to read/write data to a file, and implement your own paging inside your application. Implementing your own "swap" like this enables you to avoid thrashing.
See here many more details of this problem: How to obtain good concurrent read performance from disk
I'm not familiar with Windows programming, but under Unix you can limit the amount of memory that a program can use with setrlimit(). Maybe there is something similar. The goal is to get the program to abort once it uses to much memory, rather than thrashing. The limit would be a bit less than the total physical memory on the machine. I would guess somewhere between 75% and 90%, but some experimentation would be necessary to find the optimal setting.
Chances are your program could use some memory management. While there are a few programs that do need to hold everything in memory at once, odds are good that with a little bit of foresight you might be able to rework your program to reuse or discard a lot of the memory you need.
Your program will run much faster too. If you are using that much memory, then basically all of your built-in first and second level caches are likely overflowing, meaning the CPU is mostly waiting on memory loads instead of processing your code's instructions.
I'd rather determine reasonable minimum requirements for the computer your program is supposed to run on, and during installation either warn the user if there's not enough memory available, or refuse to install.
Telling him each time he's starting the program is nonsensical.