I haven't really found a solid answer to this question other than "get more ram". Is there a way to reduce the memory used by g++ during the compile process? I am (for reasons) trying to compile webkitgtk on a g4 mac mini with 1GB ram. It can't be upgraded. Current compilation options are
-Os -mabi=altivec -mcpu=native -mtune=native.
It has 1GB ram and 1GB swap but just runs out of memory. While I could theoretically just keep adding swap space, in practice this gets very slow, and I want to minimize that.
Webkitgtk is notoriously demanding of RAM (and time) during compilation. The Webgtk build instructions link to some suggestions, which might be useful. But the overall impression those pages give is that you need considerably more than 1GB of RAM, unless you are prepared to let the build run for some time, possibly days.
Perhaps you have access to one or more other computers. In that case, you could consider setting up cross-compilation environments and maybe even installing distcc in order to make use of these additional resources.
Setting up a cross-compilation environment for an OS X target is a bit of a project, but once you've got that set up distcc is pretty straightforward. And it won't take very many compiles to pay back your investment in time through significantly reduced compile times.
Related
I'm working on a project that has thousands of .cpp files plus thousands more .h and .hpp and the build takes 28min running from an SSD.
We inherited this project from a different company just weeks ago but perusing the makefiles, they explicitly disabled parallel builds via the .NOPARALLEL phony target; we're trying to find out if they have a good reason.
Worst case, the only way to speed this up is to use a RAM drive.
So I followed the instructions from Tekrevue and installed Imdisk and then ran benchmarks using CrystalDiskMark:
SSD
RAM Drive
I also ran dd using Cygwin and there's a significant speedup (at least 3x) on the RAM drive compared to my SSD.
However, my build time changes not one minute!
So then I thought: maybe my proprietary compiler calls some Windows API and causes a huge slowdown so I built fftw from source on Cygwin.
What I expected is that my processor usage would increase to some max and stay there for the duration of the build. Instead, my usage was very spiky: one for each file compiled. I understand even Cygwin still has to interact with windows so the fact that I still got spiky proc usage makes me assume that it's not my compiler that's the issue.
Ok. New theory: invoking compiler for each source-file has some huge overhead in Windows so, I copy-pasted from my build-log and passed 45 files to my compiler and compared it to invoking the compiler 45 times separately. Invoking ONCE was faster but only by 4 secs total for the 45 files.
And I saw the same "spiky" processor usage as when invoking compiler once for each file.
Why can't I get the compiler to run faster even when running from RAM drive? What's the overhead?
UPDATE #1
Commenters have been saying, I think, that the RAM drive thing is kind of unnecessary bc windows will cache the input and output files in RAM anyway.
Plus, maybe the RAM drive implementation (ie drivers) is sub-optimal.
So, I'm not using the RAM drive anymore.
Also, people have said that I should run the 45-file build multiple times so as to remove the overhead for caching: I ran it 4 times and each time it was 52secs.
CPU usage (taken 5 secs before compilation ended)
Virtual memory usage
When the compiler spits out stuff to disk, it's actually cached in RAM first, right?
Well then this screenshot indicates that IO is not an issue or rather, it's as fast as my RAM.
Question:
So since everything is in RAM, why isn't the CPU % higher more of the time?
Is there anything I can do to make single- threaded/job build go faster?
(Remember this is single-threaded build for now)
UPDATE 2
It was suggested below that I should set the affinity, of my compile-45-files invocation, to 1 so that windows won't bounce around the invocation to multiple cores.
The result:
100% single-core usage! for the same 52secs
So it wasn't the hard drive, RAM or the cache but CPU that's the bottleneck.
**THANK YOU ALL! ** for your help
========================================================================
My machine: Intel i7-4710MQ # 2.5GHz, 16GB RAM
I don't see why you are blaming so much the operating system, besides sequential, dumb IO (to load sources/save intermediate output - which should be ruled out by seeing that an SSD and a ramdisk perform the same) and process starting (ruled out by compiling a single giant file) there's very little interaction between the compiler and the operating system.
Now, once you ruled out "disk" and processor, I expect the bottleneck to be the memory speed - not for the RAM-disk IO part (which probably was already mostly saturated by the SSD), but for the compilation process itself.
That's actually quite a common problem, at this moment of time processors are usually faster than memory, which is often the bottleneck (that's the reason why currently it's critical to write cache-friendly code). The processor is probably wasting some significant time waiting for out of cache data to be fetched from main memory.
Anyway, this is all speculation. If you want a reliable answer, as usual you have to profile. Grab some sampling profiler from a list like this and go see where the compiler is wasting time. Personally, I expect to see a healthy dose of cache misses (or even page faults if you burned too much RAM for the ramdisk), but anything can be.
Reading your source code from the drive is a very, very small part of the overhead of compiling software. Your CPU speed is far more relevant, as parsing and generating binaries are the slowest part of the process.
**Update
Your graphs show a very busy CPU, I am not sure what you expect to see. Unless the build is multithreaded AND your kernel stops scheduling other, less intensive threads, this is certainly the graph of a busy processor.
Your trace is showing 23% CPU usage. Your CPU has 4 actual cores (with hyperthreading to make it look like 8). So, you're using exactly one core to its absolute maximum (plus or minus 2%, which is probably better accuracy than you can really expect).
The obvious conclusion from this would be that your build process is CPU bound, so improving your disk speed is unlikely to make much difference.
If you want substantially faster builds, you need to either figure out what's wrong with your current makefiles or else write entirely new ones without the problems, so you can support both partial and parallel builds.
That can gain you a lot. Essentially anything else you do (speeding up disks, overclocking the CPU, etc.) is going to give minor gains at best (maybe 20% if you're really lucky, where a proper build environment will probably give at least a 20:1 improvement for most typical builds).
My colleague likes to use gcc with '-g -O0' for building production binaries because of debugging is easy if core dump happens. He says there is no need to use compiler optimization or tweak the code because he finds the process in production does not have high CPU load, e.g. 30% around.
I asked him the reason behind that and he told me: If CPU load is not high, the bottleneck must not be our code performance, and should be some IO (disk/network). So by using gcc -O2 is of no use to improve the latency and throughput. Also that also indicates we don't have much to improve in the code because CPU is not a bottleneck. Is that correct?
About CPU usage ~ optimisation
I would expect most optimisation problems in a program to correlate to higher-than-usual CPU load, because we say that a sub-optimal program does more than it theoretically needs to. But "usual" here is a complicated word. I don't think you can pick a hard value of system-wide CPU load percentage at which optimisation becomes useful.
If my program reallocates a char buffer in a loop, when it doesn't need to, my program might run ten times slower than it needs to, and my CPU usage may be ten times higher than it needs to be, and optimising the function may yield ten-fold increases in application performance … but the CPU usage may still only be 0.5% of the whole system capacity.
Even if we were to choose a CPU load threshold at which to begin profiling and optimising, on a general-purpose server I'd say that 30% is far too high. But it depends on the system, because if you're programming for an embedded device that only runs your program, and has been chosen and purchased because it has just enough power to run your program, then 30% could be relatively low in the grand scheme of things.
Further still, not all optimisation problems will indeed have anything to do with higher-than-usual CPU load. Perhaps you're just waiting in a sleep longer than you actually need to, causing message latency to increase but substantially reducing CPU usage.
tl;dr: Your colleague's view is simplistic, and probably doesn't match reality in any useful way.
About build optimisation levels
Relating to the real crux of your question, though, it's fairly unusual to deploy a release build with all compiler optimisations turned off. Compilers are designed to emit pretty naive code at -O0, and to do the sort of optimisations that are pretty much "standard" in 2016 at -O1 and -O2. You're generally expected to turn these on for production use, otherwise you're wasting a huge portion of a modern compiler's capability.
Many folks also tend not to use -g in a release build, so that the deployed binary is smaller and easier for your customers to handle. You can drop a 45MB executable to 1MB by doing this, which is no pocket change.
Does this make debugging more difficult? Yes, it can. Generally, if a bug is located, you want to receive reproduction steps that you can then repeat in a debug-friendly version of your application and analyse the stack trace that comes out of that.
But if the bug cannot be reproduced on demand, or it can only be reproduced in a release build, then you may have a problem. It may therefore seem reasonable to keep basic optimisations on (-O1) but also keep debug symbols in (-g); the optimisations themselves shouldn't vastly hinder your ability to analyse the core dump provided by your customer, and the debug symbols will allow you to correlate the information to source code.
That being said, you can have your cake and eat it too:
Build your application with -O2 -g
Copy the resulting binary
Perform strip on one of those copies, to remove the debug symbols; the binaries will otherwise be identical
Store them both forever
Deploy the stripped version
When you have a core dump to analyse, debug it against your original, non-stripped version
You should also have sufficient logging in your application to be able to track down most bugs without needing any of this.
Under certain circumstances he could be correct, and mostly incorrect under other (while under some he's totally correct).
If you assume that you run for 1s the CPU would be busy for 0.3s and waiting for something else 0.7s. If you optimized the code and say got 100% improvement then the CPU would complete what took 0.3s in 0.15s and make the task complete in 0.85s instead of 1s (given that the wait for something else will take the same time).
However if you've got a multicore situation the CPU load is sometimes defined as the amount of processing power that's being used. So if one core runs at 100% and two are idling the CPU load would become 33% so in such a scenario 30% CPU load may be due to the program is only able to make use of one core. In that case it could improve performance drastically if the code were optimized.
Note that sometimes what is thought to be an optimization is actually an pessimization - that's why it's important to measure. I've seen a few "optimizations" that reduce performance. Also some times optimizations would alter the behavior (in particular when you "improve" the source code) so you should probably make sure it doesn't break anything by having proper tests. After doing performance measurement you should decide if it's worth trading debuggability for speed.
A possible improvement might be to compile with gcc -Og -g using a recent GCC. The -Og optimization is debugger-friendly.
Also, you can compile with gcc -O1 -g; you get many (simple) optimizations, so performance is usually 90% of -O2 (with of course some exceptions, where even -O3 matters). And the core dump is usually debuggable.
And it really depends upon the kind of software and the required reliability and ease of debugging. Numerical code (HPC) is quite different from small database post-processing.
At last, using -g3 instead of -g might help (e.g. gcc -Wall -O1 -g3)
BTW synchronization issues and deadlocks might be more likely to appear on optimized code than on non-optimized ones.
It's really simple: CPU time is not free. We like to think that it is, but it's patently false. There are all sorts of magnification effects that make every cycle count in some scenarios.
Suppose that you develop an app that runs on a million of mobile devices. Every second your code wastes is 1-2 years of continuous device use worth on a 4-core device. Even with 0% CPU utilization, wall time latency costs you backlight time, and that's not to be ignored with either: backlight uses about 30% of device's power.
Suppose that you develop an app that runs in a data center. Every 10% of the core that you're using is what someone else won't be using. At the end of the day, you've only got so many cores on a server, and that server has power, cooling, maintenance and amortization costs. Every 1% of CPU usage has costs that are simple to determine, and they aren't zero!
On the other hand: developer time isn't free, and every second of developer's attention requires commensurate energy and resource inputs just to keep her or him alive, fed, well and happy. Yet, in this case all the developer needs to do is flip a compiler switch. I personally don't buy the "easier debugging" myths. Modern debugging information is expressive enough to capture register use, value liveliness, code replication and such. Optimizations don't really get in the way as they did 15 years ago.
If your business has a single, underutilized server, then what the developer is doing might be OK, practically speaking. But all I see here really is an unwillingness to learn how to use the debugging tools or proper tools to begin with.
I have written a C++ code which I have to run on many low configuration computers. Now, my PC is very high configuration. I am using Ubuntu 10.04 & I set hard limit on some resources i.e. on memory & virtual memory. Now my question is:
1) how to set limit on the cache size and cache line size ?
2) what other limits I should put to check my code is OK or not ?
I am using command:
ulimit -H -m 1000000
ulimit -H -v 500000
You can't limit cache size, that's a (mostly transparent) hardware feature.
The good news is this shouldn't matter, since you can't run out of cache - it just spills and your program runs more slowly.
If your concern is avoiding spills, you could investigate valgrind --tool=cachegrind - it may be possible to examine the likely behaviour on your target hardware cache.
To PROPERLY simulate running on low-end machines (although not with low cache-limits), you can run the code in a virtual machine rather than the real hardware of your machine. This will show you what happens on a machine with small memory much more than if you limit using ulimit, as ulimit simply limits what YOUR application will get. So it shows that your application doesn't run out of memory when running a particular set of tests. But it doesn't show how the application and system behaves together when there isn't a huge amount of memory in the first place.
A machine with low amount of physical memory will behave quite differently when it comes to for example swapping behaviour, and filesystem caching, just to mention a couple of things that change between a "large memory, but application is limited" vs "small memory in the first place".
I'm not sure if Ubuntu comes with any flavour of Virtual Machine setup, but for example VirtualBox is pretty easy to configure and set up on any Linux/Windows machine. As long as you have a modern enough processor to run hardware virtualization instructions.
As Useless not at all uselessly stated, cache-memory will not "run out" or in any other way cause a failure. It will run a little slower, but not massive amounts (about 10x for any given operation, but this will be averaged over a large number of other instructions in most cases, unless you are really working hard at proving how important cache is, such as very large matrix multiplications).
One tip might also be to look around for some old hardware. There are usually computers for sale that are several years old, for next to nothing at a "computer recycling shop" or similar. Set such a system up, install your choice of OS, and see what happens.
I want to try to speed up my compile-time of our C++ projects. They have about 3M lines of code.
Of course, I don't need to always compile every project, but sometimes there are lot of source files modified by others, and I need to recompile all of them (for example, when someone updates an ASN.1 source file).
I've measured that compiling a mid-project (that does not involves all the source files) takes about three minutes. I know that's not too much, but sometimes it's really boring waiting for a compile..
I've tried to move the source code to an SSD (an old OCZ Vertex 3 60 GB) that, benchmarked, it's from 5 to 60 times faster than the HDD (especially in random reading/writing). Anyway, the compile-time is almost the same (maybe 2-3 seconds faster, but it should be a chance).
Maybe moving the Visual Studio bin to SSD would grant additional increment in performance?
Just to complete the question: I've a W3520 Xeon #2.67 GHz and 12 GB of DDR3 ECC.
This all greatly depends on your build environment and other setup. For example, on my main compile server, I have 96 GiB of RAM and 16 cores. The HDD is rather slow, but that doesn't really matter as about everything is cached in RAM.
On my desktop (where I also compile sometimes) I only have 8 Gib of RAM, and six cores. Doing the same parallel build there could be greatly sped up, because six compilers running in parallel eat up enough memory for the SSD speed difference being very noticeable.
There are many things that influence the build times, including the ratio of CPU to I/O "boundness". In my experience (GCC on Linux) they include:
Complexity of code. Lots of metatemplates make it use more CPU time, more C-like code might make the I/O of generated objects (more) dominant
Compiler settings for temporary files, like -pipe for GCC.
Optimization being used. Usually, the more optmization, the more the CPU work dominates.
Parallel builds. Compiling a single file at a time will likely never produce enough I/O to get today's slowest harddisk to any limit. Compiling with eight cores (or more) at once however might.
OS/filesystem being used. It seems that some filesystems in the past have choked on the access pattern for many files built in parallel, essentially putting the I/O bottleneck into the filesystem code, rather than the underlying hardware.
Available RAM for buffering. The more aggressively an OS can buffer your I/O, the less important the HDD speed gets. This is why sometimes a make -j6 can be a slower than a make -j4 despite having enough idle cores.
To make it short: It depends on enough things to make any "yes, it will help you" or "no, it will help you not" pure speculation, so if you have the possibility to try it out, do it. But don't spend too much time on it, for every hour you try to cut your compile times into half, try to estimate how often you (or your coworkers if you have any) could have rebuilt the project, and how that relates to the possible time saved.
C++ compilation/linking is limited by processing speed, not HDD I/O. That's why you're not seeing any increase in compilation speed.
(Moving the compiler/linker binaries to the SSD will do nothing. When you compile a big project, the compiler/linker and the necessary library are read into memory once and stay there.)
I have seen some minor speedups from moving the working directory to an SSD or ramdisk when compiling C projects (which is a lot less time consuming than C++ projects that make heavy use of templates etc), but not enough to make it worth it.
I found that compiling a project of around 1 million lines of C++ sped up by about a factor of two when the code was on an SSD (system with an eight-core Core i7, 12 GB RAM). Actually, the best possible performance we got was with one SSD for the system and a second one for the source -- it wasn't that the build was much faster, but the OS was much more responsive while a big build was underway.
The other thing that made a huge difference was enabling parallel building. Note that there are two separate options that both need to be enabled:
Menu Tools → Options → Projects and Solutions → maximum number of parallel project builds
Project properties → C++/General → Multi-processor compilation
The multiprocessor compilation is incompatible with a couple of other flags (including minimal rebuild, I think) so check the output window for warnings. I found that with the MP compilation flag set all cores were hitting close to 100% load, so you can at least see the CPU is being used aggressively.
One point not mentioned is that when using ccache and a highly parallel build, you'll see benefits to using an SSD.
I did replace my hard disk drive with an SSD hoping that it will reduce the compilation time of my C++ project. Simply replacing the hard disk drive with an SSD did not solve the problem and compilation time with both were almost the same.
However, after initial failures, I got success in speeding up the compilation by approximately six times.
The following steps were done to increase the compilation speed.
Turned off hibernation: "powercfg -h off" in command prompt
Turned off drive indexing on C drive
Shrunk page file to 800 min/1024 max (it was initially set to system managed size of 8092).
Apparently the speed of the C++ linker in Visual Studio 2010 hasn't improved that much (about 25% in our case). This means that we're still stuck with linking times between 30 seconds and two minutes. Surely there are linkers out there that perform better? Does anyone have experience with switching to another linker or even a complete tool set and seeing linking times go down drastically?
Cheers,
Sebastiaan
You may well find a faster linker but, unless it's ten times as fast and I'm linking thirty times an hour, I think I'd prefer to use the tools that Microsoft has tested with.
I would rather have relatively slow link times than potentially unstable software.
And you kids are spoilt nowadays. In my day, we had to submit our 80-column cards to the computer centre and, if we were lucky, the operator would get it typed in by next Thursday and we could start debugging from the hardcopy output :-)
When we checked the linker speed, we have identified the disk speed to be the most limiting factor. The amount of file traffic is huge, especially because of debugging info (just check the size of the pdb).
For us the solution was:
install insane amounts of RAM, so that a lot of file traffic can be cached (go for 4 GB, or even more, if you are on 64b OS). Note: you may need to change some system settings so that system is able to dedicate more memory for the cache
use very fast hard drive (connecting multiple of them as RAID may help even more)
We have also experimented with SSD, but the SSD we have tried had very slow write performance, therefore the net effect was negative. This might have changed meanwhile, especially with the best of SSDs.
As a first step I would suggest to launch Process Explorer (or even Task Manager will do) and check your CPU load and I/O traffic during the link phase, so that you can verify of you are CPU limited, or I/O limited.
There may be, but I imagine you'd be talking improvements in the range of a few percentage points. You're unlikely to find anything that's magnitudes faster (which is, I assume, what you'd like).
There are ways to improve your link times, though. What options do you have turned on? Things like "Enable Incremental Linking" and "Enable Function-level Linking" can have dramatic effects on linking performance (well, obviously the first time you link it'll be a "full" link, but subsequent links can be made much quicker with these settings).
Wow and i get nervous when my link time is above 10 sec.
Use modern style SSD Disks. I have 2x 60 GB OCZ Vertex2 E Disks as a RAID 0 and IO is not a problem anymore. SSD are now good enough for daily use even for heavy writes.
And get a few gigabyte of memory. Can't see any reason anymore to work with less then 8 GB RAM.
Turn on incremental linking, and linking should not take more than 1 sec.