How to optimize large data manipulation in parallel - c++

I'm developing a C/C++ application to manipulate large quantities of data in a generic way (aggregation/selection/transformation).
I'm using a AMD Phenom II X4 965 Black Edition, so with decent amount of different caches.
I've developed both ST and MT version of the functions to perform all the single operations and, not surprisingly, in the best case the MT version are 2x faster than the ST, even when using 4 cores.
Given I'm a fan of using 100% of available resources, I was pissed about the fact just 2x, I'd want 4x.
For this reason I've spent already quite a considerable amount of time with -pg and valgrind, using the cache simulator and callgraph. The program is working as expected and cores are sharing the input process data (i.e. operations to apply on data) and the cache misses are reported (as expected sic.) when the different threads load the data to be processed (millions of entities or rows if now you have an idea what I'm trying to do :-) ).
Eventually I've used different compilers, g++ and clang++, with -O3 both, and performance is identical.
My conclusion is that due to the large amount of data (GB of data) to process, given the fact the data has got to be loaded eventually in the CPU, this is real wait time.
Can I further improve my software? Have I hit a limit?
I'm using C/C++ on Linux x86-64, Ubuntu 11.10.
I'm all ears! :-)

What kind of application is it? Could you show us some code?
As I commented, you might have reached some hardware limit like RAM bandwidth. If you did, no software trick could improve it.
You might investigate using MPI, OpenMP, or OpenCL (on GPUs) but without an idea of your application we cannot help.
If compiling with GCC and if you want to help the processor cache prefetching, consider using with care and parsimony __builtin_prefetch (but using it too much or badly would decrease performance).

Related

Can bad vectorized code impact on scalability?

I have parallelized an already existing code for computer vision applications using OpenMP. I think that I well designed it because:
The workload is well-balanced
There is no synchronization/locking mechanism
I parallelized the outer most loops
All the cores are used for most of the time (there are no idle cores)
There is enough work for each thread
Now, the application doesn't scale when using many cores, e.g. it doesn't scale well after 15 cores.
The code uses external libraries (i.e. OpenCV and IPP) where the code is already optimized and vectorized, while I manually vectorized some portions of the code as best as I could. However, according to Intel Advisor, the code isn't well vectorized, but there is no much left to do: I already vectorized the code where I could and I can't improve the external libraries.
So my question is: is it possible that vectorization is the reason why the code doesn't scale well at some point? If so, why?
In line with comments from Adam Nevraumont, VTune Amplifier can do a lot to pinpoint memory bandwidth issues: https://software.intel.com/en-us/vtune-amplifier-help-memory-access-analysis.
It may be useful to start at a higher level of analysis than that though, like looking at hot spots. If it turns out that most of your time is spent in OpenCV or similar like you're concerned about, finding that out early might save some time vs. digging into memory bottlenecks directly.

does building from a RAM drive truly yield speed increase?

I'm working on a project that has thousands of .cpp files plus thousands more .h and .hpp and the build takes 28min running from an SSD.
We inherited this project from a different company just weeks ago but perusing the makefiles, they explicitly disabled parallel builds via the .NOPARALLEL phony target; we're trying to find out if they have a good reason.
Worst case, the only way to speed this up is to use a RAM drive.
So I followed the instructions from Tekrevue and installed Imdisk and then ran benchmarks using CrystalDiskMark:
SSD
RAM Drive
I also ran dd using Cygwin and there's a significant speedup (at least 3x) on the RAM drive compared to my SSD.
However, my build time changes not one minute!
So then I thought: maybe my proprietary compiler calls some Windows API and causes a huge slowdown so I built fftw from source on Cygwin.
What I expected is that my processor usage would increase to some max and stay there for the duration of the build. Instead, my usage was very spiky: one for each file compiled. I understand even Cygwin still has to interact with windows so the fact that I still got spiky proc usage makes me assume that it's not my compiler that's the issue.
Ok. New theory: invoking compiler for each source-file has some huge overhead in Windows so, I copy-pasted from my build-log and passed 45 files to my compiler and compared it to invoking the compiler 45 times separately. Invoking ONCE was faster but only by 4 secs total for the 45 files.
And I saw the same "spiky" processor usage as when invoking compiler once for each file.
Why can't I get the compiler to run faster even when running from RAM drive? What's the overhead?
UPDATE #1
Commenters have been saying, I think, that the RAM drive thing is kind of unnecessary bc windows will cache the input and output files in RAM anyway.
Plus, maybe the RAM drive implementation (ie drivers) is sub-optimal.
So, I'm not using the RAM drive anymore.
Also, people have said that I should run the 45-file build multiple times so as to remove the overhead for caching: I ran it 4 times and each time it was 52secs.
CPU usage (taken 5 secs before compilation ended)
Virtual memory usage
When the compiler spits out stuff to disk, it's actually cached in RAM first, right?
Well then this screenshot indicates that IO is not an issue or rather, it's as fast as my RAM.
Question:
So since everything is in RAM, why isn't the CPU % higher more of the time?
Is there anything I can do to make single- threaded/job build go faster?
(Remember this is single-threaded build for now)
UPDATE 2
It was suggested below that I should set the affinity, of my compile-45-files invocation, to 1 so that windows won't bounce around the invocation to multiple cores.
The result:
100% single-core usage! for the same 52secs
So it wasn't the hard drive, RAM or the cache but CPU that's the bottleneck.
**THANK YOU ALL! ** for your help
========================================================================
My machine: Intel i7-4710MQ # 2.5GHz, 16GB RAM
I don't see why you are blaming so much the operating system, besides sequential, dumb IO (to load sources/save intermediate output - which should be ruled out by seeing that an SSD and a ramdisk perform the same) and process starting (ruled out by compiling a single giant file) there's very little interaction between the compiler and the operating system.
Now, once you ruled out "disk" and processor, I expect the bottleneck to be the memory speed - not for the RAM-disk IO part (which probably was already mostly saturated by the SSD), but for the compilation process itself.
That's actually quite a common problem, at this moment of time processors are usually faster than memory, which is often the bottleneck (that's the reason why currently it's critical to write cache-friendly code). The processor is probably wasting some significant time waiting for out of cache data to be fetched from main memory.
Anyway, this is all speculation. If you want a reliable answer, as usual you have to profile. Grab some sampling profiler from a list like this and go see where the compiler is wasting time. Personally, I expect to see a healthy dose of cache misses (or even page faults if you burned too much RAM for the ramdisk), but anything can be.
Reading your source code from the drive is a very, very small part of the overhead of compiling software. Your CPU speed is far more relevant, as parsing and generating binaries are the slowest part of the process.
**Update
Your graphs show a very busy CPU, I am not sure what you expect to see. Unless the build is multithreaded AND your kernel stops scheduling other, less intensive threads, this is certainly the graph of a busy processor.
Your trace is showing 23% CPU usage. Your CPU has 4 actual cores (with hyperthreading to make it look like 8). So, you're using exactly one core to its absolute maximum (plus or minus 2%, which is probably better accuracy than you can really expect).
The obvious conclusion from this would be that your build process is CPU bound, so improving your disk speed is unlikely to make much difference.
If you want substantially faster builds, you need to either figure out what's wrong with your current makefiles or else write entirely new ones without the problems, so you can support both partial and parallel builds.
That can gain you a lot. Essentially anything else you do (speeding up disks, overclocking the CPU, etc.) is going to give minor gains at best (maybe 20% if you're really lucky, where a proper build environment will probably give at least a 20:1 improvement for most typical builds).

Intel Xeon Phi - running multiple single-threaded executables

I'm trying to find out whether I could use an Intel Xeon Phi coprocessor to "parallelize" the following problem:
Say I have 2000 files that need to be processed by a single-threaded executable. For each file, the executable reads it, does its thing and outputs it to a correspoinding output file, then exits.
For instance:
FILES=/path/to/*
for f in $FILES
do
# take action on each file
./executable $f outFileCorrespondingTo_f
done
The tools are not coded for multi-threaded execution, or looping through the files, nor do we wish to change anything in their code for now. They're written in C with some external libraries.
My questions are:
Could this kind of "script-looping" be run on the Xeon Phi's native OS in such a way that it parallelizes the calls to the executable, so they run concurrently on all of its cores? Is it "general-purpose" enough for that?
The files themselves are rather small, so its 8GB memory would be more than enough for storing the data at runtime, but not for keeping all of the output on the device, so I would need to output on the host. So my second quetion is: is this kind of memory exchange possible "externally"?
i.e. not coded into the tool, but managed by the host OS and the device, for every execution of the executable.
If this is possible, could it provide a performance boost in any way, or would the memory and thread allocation bottlenecks be too intensive? Basically each execution takes a few seconds, depending on the length of the input file, but I'm pretty confident this is a few orders of magnitude longer than how much it would take to transfer the file.
Xeon phi co-processors run a very feature-full version of the Linux operating system, so most of what you are used to on a Linux box is likely to work on Xeon Phi as well.
Now, for your specific issue, I guess that GNU Parallel should just permit you to do what you want in a breath. Simply, you'll have to have your file system mounted on the card so that you can access the files directly, but this is just standard stuff for a Xeon Phi node. And be aware that this will generate some traffic on the PCIe link between the host and the co-processor for the file transfers.
Regarding performance, this is hard to tell: the lower single-threaded performance of Xeon Phi cores along with the transfer times are definitely suggesting a big hit in this domain, but the level of parallelism you can extract from the device might very well overcome this, depending on how compute intensive your workload is. Best answer is for you to give it a try...
This is an addition to the answer given by Gilles.
Yes, the Xeon Phi should be able to do what you want at a basic operational level.
Even so, I think it is the wrong platform for your purpose for a few reasons.
Each core on the Xeon Phi is a Pentium core. Though it is enhanced (4 threads/core, 512 bit vector engine, etc), it is still a Pentium. That means it runs scalar code as a Pentium. Your task sounds like a whole bunch of serial processes running in parallel. So each process will run as if it is running on a Pentium.
To achieve remarkable performance, you need code that parallelizes well (read that as OpenMP, light weight threads, and thread pooling) and also vectorizes (takes advantage of the 512-bit vector engine). Without both of those enhancements, you are running on a Pentium, abet a lot of Pentiums.
Moving data across the PCIe bus is slow. If you are transferring a lot of files, this can be even slower though you can reduce the contention a little by hiding latency (depending upon your application). If you are hitting the PCIe bus with 244 file read requests on start up, that's quite a lot of contention. Even in a steady state condition, it sounds like you'll be reading more than 20 files at any given time (and I suspect even more given that we are executing scalar code as a Pentium).
Now the KNL architecture might be more appropriate for your needs, but that isn't out yet.
If you still think the Xeon Phi might be appropriate for what you want to do, you can ask the Xeon Phi Intel forum experts. If your application is proprietary/sensitive, you can ask the Intel experts as a private message.

Finding performance issue that may be due to thread locking (possibly)

I've spent a little time running valgrind/callgrind to profile a server that does a lot of TCP/IP communications using many threads. After some time improving the performance, I realised that in this particular test scenario, the process is not CPU bound so the performance "improvements" I'd looked at were of no use.
In theory, the CPU should be very busy. I know the TCP/IP device it connects to isn't the limitation as the server runs on two machines. One is a PC the other is an embedded device with an Arm processor. Even the embedded device only gets to about 2% CPU usage but it does far fewer transactions - about a tenth. Both systems only get up to about 2% even though we're trying to get data as fast as possible.
My guess is that some mutex is locked and is holding up a thread. This is a pure guess! There are a few threads in the system with common data. Perhaps there are other possibilities but how do I tell?
Is there anyway to use a tool like valgrind/callgrind that might show the time spent in system calls? I can also run it on Windows with Visual Studio 2012 if that's better.
We might have to try walking through the code or something but not sure that we have time.
Any tips appreciated.
Thanks.
Callgrind is a great profiler but it does have some drawbacks. In particular, it assumes that the same instruction always executes in the same amount of time, and it assumes that instruction counts are the most important metric.
This is fine for getting (mostly) reproducible profiling results and for analyzing in detail what instructions are executed, but there are some types of performance problems which Callgrind doesn't detect:
time spent waiting for locks
time spent sleeping (eg. simple sleep()/usleep() calls will effectively slow down your application but won't show up in Callgrind)
time spent waiting for disk I/O or network I/O
time spent waiting for data that was swapped out
influences from CPU cache hits/misses (you can try to use Cachegrind for this particular topic)
influences from CPU pipeline stalls, branch prediction failures and all the other features of modern CPUs that can cause the same instruction to be executed faster or slower depending on the context
These problems can be detected quite well using a statistical (or sample-based) profiler. Examples would be Sysprof and OProfile, or any kind of "poor-man's sampling profiler" as described eg. at https://stackoverflow.com/a/378024. The VS2012 built-in profiler mentioned by WhozCraig appears to be a sampling profiler as well.
While statistical profilers are really useful because they provide "real-world" results instead of simple instructions counts, they have the possible drawback that you don't get reproducible results easily (the results will vary a little bit with every run), and that you need to gather sufficient number of samples to get detailed results.

Fastest way to run a program in a 64 bit environment?

It's been a couple of decades since I've done any programming. As a matter of fact the last time I programmed was in an MS-DOS environment before Windows came out. I've had this programming idea that I have wanted to try for a few years now and I thought I would give it a try. The amount of calculations are enormous. Consequently I want to run it in the fastest environment I can available to a general hobby programmer.
I'll be using a 64 bit machine. Currently it is running Windows 7. Years ago a program ran much slower in the windows environment then then in MS-DOS mode. My personal programming experience has been in Fortran, Pascal, Basic, and machine language for the 6800 Motorola series processors. I'm basically willing to try anything. I've fooled around with Ubuntu also. No objections to learning new. Just want to take advantage of speed. I'd prefer to spend no money on this project. So I'm looking for a free or very close to free compiler. I've downloaded Microsoft Visual Studio C++ Express. But I've got a feeling that the completed compiled code will have to be run in the Windows environment. Which I'm sure slows the processing speed considerably.
So I'm looking for ideas or pointers to what is available.
Thank you,
Have a Great Day!
Jim
Speed generally comes with the price of either portability or complexity.
If your programming idea involves lots of computation, then if you're using Intel CPU, you might want to use Intel's compiler, which might benefit from some hidden processor features that might make your program faster. Otherwise, if portability is your goal, then use GCC (GNU Compiler Collection), which can cross-compile well optimized executable to practically any platform available on earth. If your computation can be parallelizable, then you might want to look at SIMD (Single Input Multiple Data) and GPGPU/CUDA/OpenCL (using graphic card for computation) techniques.
However, I'd recommend you should just try your idea in the simpler languages first, e.g. Python, Java, C#, Basic; and see if the speed is good enough. Since you've never programmed for decades now, it's likely your perception of what was an enormous computation is currently miniscule due to the increased processor speed and RAM. Nowadays, there is not much noticeable difference in running in GUI environment and command line environment.
Tthere is no substantial performance penalty to operating under Windows and a large quantity of extremely high performance applications do so. With new compiler advances and new optimization techniques, Windows is no longer the up-and-coming, new, poorly optimized technology it was twenty years ago.
The simple fact is that if you haven't programmed for 20 years, then you won't have any realistic performance picture at all. You should make like most people- start with an easy to learn but not very fast programming language like C#, create the program, then prove that it runs too slowly, then make several optimization passes with tools such as profilers, then you may decide that the language is too slow. If you haven't written a line of code in two decades, the overwhelming probability is that any program that you write will be slow because you're a novice programmer from modern perspectives, not because of your choice of language or environment. Creating very high performance applications requires a detailed understanding of the target platform as well as the language of choice, AND the operations of the program.
I'd definitely recommend Visual C++. The Express Edition is free and Visual Studio 2010 can produce some unreasonably fast code. Windows is not a slow platform - even if you handwrote your own OS, it'd probably be slower, and even if you produced one that was faster, the performance gain would be negligible unless your program takes days or weeks to complete a single execution.
The OS does not make your program magically run slower. True, the OS does eat a few clock cycles here and there, but it's really not enough to be at all noticeable (and it does so in order to provide you with services you most likely need, and would need to re-implement yourself otherwise)
Windows doesn't, as some people seem to believe, eat 50% of your CPU. It might eat 0.5%, but so does Linux and OSX. And if you were to ditch all existing OS'es and instead write your own from scratch, you'd end up with a buggy, less capable OS which also eats a bit of CPU time.
So really, the environment doesn't matter.
What does matter is what hardware you run the program on (and here, running it on the GPU might be worth considering) and how well you utilize the hardware (concurrency is pretty much a must if you want to exploit modern hardware).
What code you write, and how you compile it does make a difference. The hardware you're running on makes a difference. The choice of OS does not.
A digression: that the OS doesn't matter for performance is, in general, obviously false. Citing CPU utilization when idle seems a quite "peculiar" idea to me: of course one hopes that when no jobs are running the OS is not wasting energy. Otherwise one measure the speed/throughput of an OS when it is providing a service (i.e. mediating the access to hardware/resources).
To avoid an annoying MS Windows vs Linux vs Mac OS X battle, I will refer to a research OS concept: exokernels. The point of exokernels is that a traditional OS is not just a mediator for resource access but it implements policies. Such policies does not always favor the performance of your application-specific access mode to a resource. With the exokernel concept, researchers proposed to "exterminate all operating system abstractions" (.pdf) retaining its multiplexer role. In this way:
… The results show that common unmodified UNIX applications can enjoy the benefits of exokernels: applications either perform comparably on Xok/ExOS and the BSD UNIXes, or perform significantly better. In addition, the results show that customized applications can benefit substantially from control over their resources (e.g., a factor of eight for a Web server). …
So bypassing the usual OS access policies they gained, for a customized web server, an increase of about 800% in performance.
Returning to the original question: it's generally true that an application is executed with no or negligible OS overhead when:
it has a compute-intensive kernel, where such kernel does not call the OS API;
memory is enough or data is accessed in a way that does not cause excessive paging;
all inessential services running on the same systems are switched off.
There are possibly other factors, depending by hardware/OS/application.
I assume that the OP is correct in its rough estimation of computing power required. The OP does not specify the nature of such intensive computation, so its difficult to give suggestions. But he wrote:
The amount of calculations are enormous
"Calculations" seems to allude to compute-intensive kernels, for which I think is required a compiled language or a fast interpreted language with native array operators, like APL, or modern variant such as J, A+ or K (potentially, at least: I do not know if they are taking advantage of modern hardware).
Anyway, the first advice is to spend some time in researching fast algorithms for your specific problem (but when comparing algorithms remember that asymptotic notation disregards constant factors that sometimes are not negligible).
For the sequential part of your program a good utilization of CPU caches is crucial for speed. Look into cache conscious algorithms and data structures.
For the parallel part, if such program is amenable to parallelization (remember both Amdahl's law and Gustafson's law), there are different kinds of parallelism to consider (they are not mutually exclusive):
Instruction-level parallelism: it is taken care by the hardware/compiler;
data parallelism:
bit-level: sometimes the acronym SWAR (SIMD Within A Register) is used for this kind of parallelism. For problems (or some parts of them) where it can be formulated a data representation that can be mapped to bit vectors (where a value is represented by 1 or more bits); so each instruction from the instruction set is potentially a parallel instruction which operates on multiple data items (SIMD). Especially interesting on a machine with 64 bits (or larger) registers. Possible on CPUs and some GPUs. No compiler support required;
fine-grain medium parallelism: ~10 operations in parallel on x86 CPUs with SIMD instruction set extensions like SSE, successors, predecessors and similar; compiler support required;
fine-grain massive parallelism: hundreds of operations in parallel on GPGPUs (using common graphic cards for general-purpose computations), programmed with OpenCL (open standard), CUDA (NVIDIA), DirectCompute (Microsoft), BrookGPU (Stanford University) and Intel Array Building Blocks. Compiler support or use of a dedicated API is required. Note that some of these have back-ends for SSE instructions also;
coarse-grain modest parallelism (at the level of threads, not single instructions): it's not unusual for CPUs on current desktops/laptops to have more then one core (2/4) sharing the same memory pool (shared-memory). The standard for shared-memory parallel programming is the OpenMP API, where, for example in C/C++, #pragma directives are used around loops. If I am not mistaken, this can be considered data parallelism emulated on top of task parallelism;
task parallelism: each core in one (or multiple) CPU(s) has its independent flow of execution and possibly operates on different data. Here one can use the concept of "thread" directly or a more high-level programming model which masks threads.
I will not go into details of these programming models here because apparently it is not what the OP needs.
I think this is enough for the OP to evaluate by himself how various languages and their compilers/run-times / interpreters / libraries support these forms of parallelism.
Just my two cents about DOS vs. Windows.
Years ago (something like 1998?), I had the same assumption.
I have some program written in QBasic (this was before I discovered C), which did intense calculations (neural network back-propagation). And it took time.
A friend offered to rewrite the thing in Visual Basic. I objected, because, you know, all those gizmos, widgets and fancy windows, you know, would slow down the execution of, you know, the important code.
The Visual Basic version so much outperformed the QBasic one that it became the default application (I won't mention the "hey, even in Excel's VBA, you are outperformed" because of my wounded pride, but...).
The point here, is the "you know" part.
You don't know.
The OS here is not important. As others explained in their answers, choose your hardware, and choose your language. And write your code in a clear way because now, compilers are better at optimizing code developers, unless you're John Carmack (premature optimization is the root of all evil).
Then, if you're not happy with the result, use a profiler to test your code. Consider multithreading (which will help you if you have multiple cores... TBB comes to mind).
What are you trying to do? I believe all the stuff should be compiled in 64bit mode by default. Computers have gotten a lot faster. Speed should not be a problem for the most part.
Side note: As for computation intense stuff you may want to look into OpenCL or CUDA. OpenCL and CUDA take advantage of the GPU which can transfer lots of information at a time compared to the CPU.
If your last points of reference are M68K and PCs running DOS then I'd suggest that you start with C/C++ on a modern processor and OS. If you run into performance problems and can prove that they are caused by running on Linux / Windows or that the compiler / optimizer generated code isn't sufficient, then you could look at other OSes and/or hand coded ASM. If you're looking for free, Linux / gcc is a good place to start.
I am the original poster of this thread.
I am once again reiterating the emphasis that this program will have enormous number of calculations.
Windows & Ubuntu are multi-tasking environments. There are processes running and many of them are using processor resources. True many of them are seen as inactive. But still the Windows environment by the nature of multi-tasking is constantly monitoring the need to start up each process. For example currently there are 62 processes showing in the Windows Task Manager. According the task manager three are consuming CPU resouces. So we have three ongoing processes that are consuming CPU processing. There are an addition 59 showing active but consuming no CPU processing. So that is 63 being monitored by Windows and then there is the Windows that also is checking on various things.
I was hoping to find some way to just be able to run a program on the bare machine level. Side stepping all the Windows (Ubuntu) involvement.
The idea is very calculation intensive.
Thank you all for taking the time to respond.
Have a Great Day,
Jim