I'm working on a project that has thousands of .cpp files plus thousands more .h and .hpp and the build takes 28min running from an SSD.
We inherited this project from a different company just weeks ago but perusing the makefiles, they explicitly disabled parallel builds via the .NOPARALLEL phony target; we're trying to find out if they have a good reason.
Worst case, the only way to speed this up is to use a RAM drive.
So I followed the instructions from Tekrevue and installed Imdisk and then ran benchmarks using CrystalDiskMark:
SSD
RAM Drive
I also ran dd using Cygwin and there's a significant speedup (at least 3x) on the RAM drive compared to my SSD.
However, my build time changes not one minute!
So then I thought: maybe my proprietary compiler calls some Windows API and causes a huge slowdown so I built fftw from source on Cygwin.
What I expected is that my processor usage would increase to some max and stay there for the duration of the build. Instead, my usage was very spiky: one for each file compiled. I understand even Cygwin still has to interact with windows so the fact that I still got spiky proc usage makes me assume that it's not my compiler that's the issue.
Ok. New theory: invoking compiler for each source-file has some huge overhead in Windows so, I copy-pasted from my build-log and passed 45 files to my compiler and compared it to invoking the compiler 45 times separately. Invoking ONCE was faster but only by 4 secs total for the 45 files.
And I saw the same "spiky" processor usage as when invoking compiler once for each file.
Why can't I get the compiler to run faster even when running from RAM drive? What's the overhead?
UPDATE #1
Commenters have been saying, I think, that the RAM drive thing is kind of unnecessary bc windows will cache the input and output files in RAM anyway.
Plus, maybe the RAM drive implementation (ie drivers) is sub-optimal.
So, I'm not using the RAM drive anymore.
Also, people have said that I should run the 45-file build multiple times so as to remove the overhead for caching: I ran it 4 times and each time it was 52secs.
CPU usage (taken 5 secs before compilation ended)
Virtual memory usage
When the compiler spits out stuff to disk, it's actually cached in RAM first, right?
Well then this screenshot indicates that IO is not an issue or rather, it's as fast as my RAM.
Question:
So since everything is in RAM, why isn't the CPU % higher more of the time?
Is there anything I can do to make single- threaded/job build go faster?
(Remember this is single-threaded build for now)
UPDATE 2
It was suggested below that I should set the affinity, of my compile-45-files invocation, to 1 so that windows won't bounce around the invocation to multiple cores.
The result:
100% single-core usage! for the same 52secs
So it wasn't the hard drive, RAM or the cache but CPU that's the bottleneck.
**THANK YOU ALL! ** for your help
========================================================================
My machine: Intel i7-4710MQ # 2.5GHz, 16GB RAM
I don't see why you are blaming so much the operating system, besides sequential, dumb IO (to load sources/save intermediate output - which should be ruled out by seeing that an SSD and a ramdisk perform the same) and process starting (ruled out by compiling a single giant file) there's very little interaction between the compiler and the operating system.
Now, once you ruled out "disk" and processor, I expect the bottleneck to be the memory speed - not for the RAM-disk IO part (which probably was already mostly saturated by the SSD), but for the compilation process itself.
That's actually quite a common problem, at this moment of time processors are usually faster than memory, which is often the bottleneck (that's the reason why currently it's critical to write cache-friendly code). The processor is probably wasting some significant time waiting for out of cache data to be fetched from main memory.
Anyway, this is all speculation. If you want a reliable answer, as usual you have to profile. Grab some sampling profiler from a list like this and go see where the compiler is wasting time. Personally, I expect to see a healthy dose of cache misses (or even page faults if you burned too much RAM for the ramdisk), but anything can be.
Reading your source code from the drive is a very, very small part of the overhead of compiling software. Your CPU speed is far more relevant, as parsing and generating binaries are the slowest part of the process.
**Update
Your graphs show a very busy CPU, I am not sure what you expect to see. Unless the build is multithreaded AND your kernel stops scheduling other, less intensive threads, this is certainly the graph of a busy processor.
Your trace is showing 23% CPU usage. Your CPU has 4 actual cores (with hyperthreading to make it look like 8). So, you're using exactly one core to its absolute maximum (plus or minus 2%, which is probably better accuracy than you can really expect).
The obvious conclusion from this would be that your build process is CPU bound, so improving your disk speed is unlikely to make much difference.
If you want substantially faster builds, you need to either figure out what's wrong with your current makefiles or else write entirely new ones without the problems, so you can support both partial and parallel builds.
That can gain you a lot. Essentially anything else you do (speeding up disks, overclocking the CPU, etc.) is going to give minor gains at best (maybe 20% if you're really lucky, where a proper build environment will probably give at least a 20:1 improvement for most typical builds).
Related
I noticed earlier that when my VS is building my big c++ solution, my CPU usage was less than 25%. Wondering if I can set VS to always use 100% CPU, I did some research:
Found two options that can be configured for this purpose:
Maximum Number Of Parallel Project Builds
Maximum Concurrent C++ Compilations
What is the difference?
And to achieve my goal, bonus question is how can I set VS to use more CPU when it builds?
It can often be insightful to see all the files considered in a build. It's not uncommon to access 10.000 files, especially when using bigger libraries.
The fastest way to access these files would be if they're in RAM already, i.e. the OS file cache. Otherwise, an SSD is a reasonable alternative. But if they have to come from a mechanical HDD, the CPU will spend a large amount of time just sleeping while it waits for files to be read.
Hence, the way to improve CPU utilization is to make sure it's not waiting for I/O. Hardware is much cheaper than C++ programmers; get a fast SSD and sufficient RAM.
I'm developing low-latency HFT trading application.
I'm using single-CPU machine. Because it's much easier to configure and maintain, (no need to tune NUMA). Also, obviously, assuming we have enough resources, it should be definitely not slower than dual-CPU setup, and likely it will be a little bit faster, cause no QPI/NUMA latency.
HFT requires a lot of resources and now I realize I want to have much more cores. Also colocating two 1U single CPU machines is much more expensive than colocating one 1U dual-cpu machine, so even assuming I can "split" my program to two it's still make sense to use 1U dual-CPU machine.
So how fear QPI/NUMA latency is? If I move my application from single-CPU machine to dual-CPU machine how much slower it can be? Maximum I can afford several-microseconds delay, but not more. Can QPI/Numa introduce significant delay if not tuned correctly and how significant this delay would be?
Is it possible to write such application which runs much slower (more than several microseconds slower) on dual-CPU setup than single-CPU setup? I.e runs much slower on a faster computer? (of course assuming we have the same processors, memory, network card and everything else)
This is not trivially answerable, since it depends on so many factors. Is the code written for NUMA?
Is the code doing mostly reads, mostly writes or about equal? How much data is shared between threads that run on separate CPUs? How often is such data written to, forcing cache-refresh?
How does tasks get scheduled, how and when does the OS decide to move threads from one CPU socket to the next?
Does the code and data fit in cache?
Those are just a few factors that will change the results dramatically between a "works really well" and "gives really poor performance".
As with EVERYTHING performance-related, details can make a huge difference, and reading answers like this one on the internet will not give you a reliable answer that applies to YOUR situati8on. Benchmark your application, check performance counters and tweak based on that. [Given the price for a machine of the specs you describe in comments above, I'd expect the supplier would allow some sort of test, demo, "try before you buy", etc].
Assuming you have a worst case scenario, a memory access will be straddling two cache-lines (unaligned access of a 8-byte value, for example), which is split between your worst placed CPUs, and the MMU needs reloading, each of those page-table entries also being in the worst possible CPUs, and since the memory for that pair of memory locations is in different locations, needing new TLB entries for each of the two 4-byte reads to load your 64-bit value. (Each TLB entry is a separate location).
This means 2 x 4 x n, where n is something like 50-100 ns. So one memory access could, at least in theory take 1600 ns. So 1.6 microseconds. It's unlikely that you will get MUCH worse than this for a single operation. The overhead is a lot less than for example swapping to disk, which can add milliseconds to your execution time.
It is not very hard to write code that updates the same cache-line on multiple CPUs and thus causing dramatic reduction in performance - I remember a long time back when I first had an Athlon SMP system running a simple benchmark, where the author did this for a Dhrystone benchmark
int numberOfRuns[MAX_CPUS];
Now, numberOfRuns is the outer loop-counter, and updating that for each loop, on either CPU, would cause "false sharing" (so each time the counter was updated, the other CPU had to flush that cache-line).
Running this on 2 core SMP system gave 30% of the single CPU performance. So 3 times SLOWER than the one CPU, rather than faster as you'd expect. (This was some 12 or so years ago, so memory may be a little "off" on the exact details, but the essense of this story is still true - a badly written application can run slower on multiple cores compared to single core).
I'd expect at least that bad performance on a modern system where you have false sharing of commonly used variables.
In comparison, well-written code should run near N times faster, if there is little or no sharing between CPU cores. I have a highly CPU-bound, multithreaded, calculator for weird numbers, which gives near n-times performance gain both on my single-socket system at home and my two-socket system at work.
$ time ./weird -t 1 -e 100000
real 0m22.641s
user 0m22.660s
sys 0m0.003s
$ time ./weird -t 6 -e 100000
real 0m5.096s
user 0m25.333s
sys 0m0.005s
So about 11% overhead. That is sharing one variable [current number] which is atomically updated between threads (using C++ standard atomics). Unfortunately, I don't have a good example of "badly written code" to contrast this against.
I'm trying to find out whether I could use an Intel Xeon Phi coprocessor to "parallelize" the following problem:
Say I have 2000 files that need to be processed by a single-threaded executable. For each file, the executable reads it, does its thing and outputs it to a correspoinding output file, then exits.
For instance:
FILES=/path/to/*
for f in $FILES
do
# take action on each file
./executable $f outFileCorrespondingTo_f
done
The tools are not coded for multi-threaded execution, or looping through the files, nor do we wish to change anything in their code for now. They're written in C with some external libraries.
My questions are:
Could this kind of "script-looping" be run on the Xeon Phi's native OS in such a way that it parallelizes the calls to the executable, so they run concurrently on all of its cores? Is it "general-purpose" enough for that?
The files themselves are rather small, so its 8GB memory would be more than enough for storing the data at runtime, but not for keeping all of the output on the device, so I would need to output on the host. So my second quetion is: is this kind of memory exchange possible "externally"?
i.e. not coded into the tool, but managed by the host OS and the device, for every execution of the executable.
If this is possible, could it provide a performance boost in any way, or would the memory and thread allocation bottlenecks be too intensive? Basically each execution takes a few seconds, depending on the length of the input file, but I'm pretty confident this is a few orders of magnitude longer than how much it would take to transfer the file.
Xeon phi co-processors run a very feature-full version of the Linux operating system, so most of what you are used to on a Linux box is likely to work on Xeon Phi as well.
Now, for your specific issue, I guess that GNU Parallel should just permit you to do what you want in a breath. Simply, you'll have to have your file system mounted on the card so that you can access the files directly, but this is just standard stuff for a Xeon Phi node. And be aware that this will generate some traffic on the PCIe link between the host and the co-processor for the file transfers.
Regarding performance, this is hard to tell: the lower single-threaded performance of Xeon Phi cores along with the transfer times are definitely suggesting a big hit in this domain, but the level of parallelism you can extract from the device might very well overcome this, depending on how compute intensive your workload is. Best answer is for you to give it a try...
This is an addition to the answer given by Gilles.
Yes, the Xeon Phi should be able to do what you want at a basic operational level.
Even so, I think it is the wrong platform for your purpose for a few reasons.
Each core on the Xeon Phi is a Pentium core. Though it is enhanced (4 threads/core, 512 bit vector engine, etc), it is still a Pentium. That means it runs scalar code as a Pentium. Your task sounds like a whole bunch of serial processes running in parallel. So each process will run as if it is running on a Pentium.
To achieve remarkable performance, you need code that parallelizes well (read that as OpenMP, light weight threads, and thread pooling) and also vectorizes (takes advantage of the 512-bit vector engine). Without both of those enhancements, you are running on a Pentium, abet a lot of Pentiums.
Moving data across the PCIe bus is slow. If you are transferring a lot of files, this can be even slower though you can reduce the contention a little by hiding latency (depending upon your application). If you are hitting the PCIe bus with 244 file read requests on start up, that's quite a lot of contention. Even in a steady state condition, it sounds like you'll be reading more than 20 files at any given time (and I suspect even more given that we are executing scalar code as a Pentium).
Now the KNL architecture might be more appropriate for your needs, but that isn't out yet.
If you still think the Xeon Phi might be appropriate for what you want to do, you can ask the Xeon Phi Intel forum experts. If your application is proprietary/sensitive, you can ask the Intel experts as a private message.
I want to try to speed up my compile-time of our C++ projects. They have about 3M lines of code.
Of course, I don't need to always compile every project, but sometimes there are lot of source files modified by others, and I need to recompile all of them (for example, when someone updates an ASN.1 source file).
I've measured that compiling a mid-project (that does not involves all the source files) takes about three minutes. I know that's not too much, but sometimes it's really boring waiting for a compile..
I've tried to move the source code to an SSD (an old OCZ Vertex 3 60 GB) that, benchmarked, it's from 5 to 60 times faster than the HDD (especially in random reading/writing). Anyway, the compile-time is almost the same (maybe 2-3 seconds faster, but it should be a chance).
Maybe moving the Visual Studio bin to SSD would grant additional increment in performance?
Just to complete the question: I've a W3520 Xeon #2.67 GHz and 12 GB of DDR3 ECC.
This all greatly depends on your build environment and other setup. For example, on my main compile server, I have 96 GiB of RAM and 16 cores. The HDD is rather slow, but that doesn't really matter as about everything is cached in RAM.
On my desktop (where I also compile sometimes) I only have 8 Gib of RAM, and six cores. Doing the same parallel build there could be greatly sped up, because six compilers running in parallel eat up enough memory for the SSD speed difference being very noticeable.
There are many things that influence the build times, including the ratio of CPU to I/O "boundness". In my experience (GCC on Linux) they include:
Complexity of code. Lots of metatemplates make it use more CPU time, more C-like code might make the I/O of generated objects (more) dominant
Compiler settings for temporary files, like -pipe for GCC.
Optimization being used. Usually, the more optmization, the more the CPU work dominates.
Parallel builds. Compiling a single file at a time will likely never produce enough I/O to get today's slowest harddisk to any limit. Compiling with eight cores (or more) at once however might.
OS/filesystem being used. It seems that some filesystems in the past have choked on the access pattern for many files built in parallel, essentially putting the I/O bottleneck into the filesystem code, rather than the underlying hardware.
Available RAM for buffering. The more aggressively an OS can buffer your I/O, the less important the HDD speed gets. This is why sometimes a make -j6 can be a slower than a make -j4 despite having enough idle cores.
To make it short: It depends on enough things to make any "yes, it will help you" or "no, it will help you not" pure speculation, so if you have the possibility to try it out, do it. But don't spend too much time on it, for every hour you try to cut your compile times into half, try to estimate how often you (or your coworkers if you have any) could have rebuilt the project, and how that relates to the possible time saved.
C++ compilation/linking is limited by processing speed, not HDD I/O. That's why you're not seeing any increase in compilation speed.
(Moving the compiler/linker binaries to the SSD will do nothing. When you compile a big project, the compiler/linker and the necessary library are read into memory once and stay there.)
I have seen some minor speedups from moving the working directory to an SSD or ramdisk when compiling C projects (which is a lot less time consuming than C++ projects that make heavy use of templates etc), but not enough to make it worth it.
I found that compiling a project of around 1 million lines of C++ sped up by about a factor of two when the code was on an SSD (system with an eight-core Core i7, 12 GB RAM). Actually, the best possible performance we got was with one SSD for the system and a second one for the source -- it wasn't that the build was much faster, but the OS was much more responsive while a big build was underway.
The other thing that made a huge difference was enabling parallel building. Note that there are two separate options that both need to be enabled:
Menu Tools → Options → Projects and Solutions → maximum number of parallel project builds
Project properties → C++/General → Multi-processor compilation
The multiprocessor compilation is incompatible with a couple of other flags (including minimal rebuild, I think) so check the output window for warnings. I found that with the MP compilation flag set all cores were hitting close to 100% load, so you can at least see the CPU is being used aggressively.
One point not mentioned is that when using ccache and a highly parallel build, you'll see benefits to using an SSD.
I did replace my hard disk drive with an SSD hoping that it will reduce the compilation time of my C++ project. Simply replacing the hard disk drive with an SSD did not solve the problem and compilation time with both were almost the same.
However, after initial failures, I got success in speeding up the compilation by approximately six times.
The following steps were done to increase the compilation speed.
Turned off hibernation: "powercfg -h off" in command prompt
Turned off drive indexing on C drive
Shrunk page file to 800 min/1024 max (it was initially set to system managed size of 8092).
I have a larger C++ program which starts out by reading thousands of small text files into memory and storing data in stl containers. This takes about a minute. Periodically, a compilation will exhibit behavior where that initial part of the program will run at about 22-23% CPU load. Once that step is over, it goes back to ~100% CPU. It is more likely to happen with O2 flag turned on but not consistently. It happens even less often with the -p flag which makes it almost impossible to profile. I did capture it once but the gprof output wasn't helpful - everything runs with the same relative speed just at low cpu usage.
I am quite certain that this has nothing to do with multiple cores. I do have a quad-core cpu, and most of the code is multi-threaded, but I tested this issue running a single thread. Also, when I run the problematic step in multiple threads, each thread only runs at ~20% CPU.
I apologize ahead of time for the vagueness of the question but I have run out of ideas as to how to troubleshoot it further, so any hints might be helpful.
UPDATE: Just to make sure it's clear, the problematic part of the code does sometimes (~30-40% of the compilations) run at 100% CPU, so it's hard to buy the (otherwise reasonable) argument that I/O is the bottleneck
It's the buffer cache
My guess is that you are seeing the results of the Linux buffer cache in operation.
Those thousands of files will take a long time to read in from the disk and the CPU will mostly be waiting on rotational and seek latencies. Reported CPU-time used will be low expressed as a percentage. (But probably greater overall.)
But once read, those small files are completely cached in memory and accessing each file (in subsequent runs) becomes a purely CPU-bound activity.
Whether the blocks remain in the cache depends on intervening activity, such as recompiles. When new programs are run and other files are read, the programs and the files will be cached and old blocks will be dropped, and obviously, memory-intensive workload will also clear out the buffer cache.
Since you're reading a ton of small files, your program is blocked waiting on disk I/O for the majority of the time. Since the CPU isn't busy while it's waiting for the disk to ship the data to it, you're seeing a load of significantly less than 100%. Once that's over, now you're CPU-bound, and your program will eat all available CPU time.
The fact that it works faster sometimes is because (as Jarryd & DigitalRoss mention) once you've read them into system memory, they're in the OS's cache, so subsequent loads will be an order of magnitude faster, unless they've been evicted by other disk I/O. So if you run the program back-to-back, the 2nd run will probably be much faster. If you wait a while (and do other stuff in the meantime), there may have been enough other disk I/O to evict those files from the cache, in which case it will take a long time to read them again.
In addition to other answers mentionning the buffer cache, if you want to understand what is going on during a compilation, you could pass some of the below flags to GCC (i.e. to g++, probably as a CXXFLAGS setting in your Makefile):
-v to ask g++ to show the involved subprocesses (e.g. cc1plus for the proper C++ compiler)
-time to ask g++ to report the time of each sub-process
-ftime-report to ask g++ (actually cc1plus) to report the time of internal phases or passes inside the compiler.