Measure or profile use of AVX2 (and other advanced instruction sets) instructions used by programm

Measure or profile use of AVX2 (and other advanced instruction sets) instructions used by programm - profiling

We are chasing some weird hardware failures on AMD Threadrippers. I came across some evidence that AVX2/AVX-512 instructions can lead to weird behaviour (https://news.ycombinator.com/item?id=22382946).
Is there a generic way of measuring or profiling the use of AVX2/AVX-512 instructions of a running program or machine? For now it would be enough for me to get a ball-park of how many of these instructions are being used in a given time frame. I do not necessarily need to pin it down to the actual program using them. The more detailed the profiling / attribution of AVX2/AVX-512 instruction use by program or time is the better.
I would prefer tools that run in Linux.

Related

Performance counter for Halide?

Is there a performance counter available for code written in the Halide language? I would like to know how many loads, stores, and ALU operations are performed by my code.
The Halide tutorial for scheduling multi-stage pipelines compares different schedules by comparing the amount of allocated memory, loads, stores, and calls to halide Funcs, but I don't see how this information was collected. I suppose it might be possible to use trace_stores, trace_loads, and trace_realizations to print to the console every time one of these operations occurs. This isn't a great option though because it would greatly slow down the program's execution and would require some kind of counting script to compile the long list of console outputs into the desired counts for loads, stores, and ALU operations.

I'm pretty sure they just used the trace_xxx output and ran some scripts/programs on it.
If you're looking for real performance numbers on a X86 platform, I would go with Intel VTune Amplifier. It's pretty expensive, but may be free if you're in academia (student, teacher, researcher) or it's for an open source project.
Other than that, look at the lowered statement code by setting HL_DEBUG_CODEGEN=1 in the environment and you can get a better idea of the loop structure and data use. Note that this output goes to stderr, not stdout.
EDIT: For Linux, there's perf.

We do not have any perf counter based support at present. It is fairly difficult to make it portable. (And on mobile devices, often the OS simply doesn't allow access to the hardware.) The support in Profiling.cpp and src/profiling.cpp could likely be used to drive perf counter operation. The profiling lowering pass adds code to call routines in the runtime which update information about Func and Pipeline execution. This information is collected and aggregated by another thread.
If tracing is run to a file (e.g. using HL_TRACE_FILE) a binary format is used and it is a bit more efficient. See utils/HalideTraceViz for a tool to work with the binary format. This is generally how analyses are done within the team.
There was a small amount of investigation of OProfile, which looked promising but I don't think we ever got code working.

Finding performance issue that may be due to thread locking (possibly)

I've spent a little time running valgrind/callgrind to profile a server that does a lot of TCP/IP communications using many threads. After some time improving the performance, I realised that in this particular test scenario, the process is not CPU bound so the performance "improvements" I'd looked at were of no use.
In theory, the CPU should be very busy. I know the TCP/IP device it connects to isn't the limitation as the server runs on two machines. One is a PC the other is an embedded device with an Arm processor. Even the embedded device only gets to about 2% CPU usage but it does far fewer transactions - about a tenth. Both systems only get up to about 2% even though we're trying to get data as fast as possible.
My guess is that some mutex is locked and is holding up a thread. This is a pure guess! There are a few threads in the system with common data. Perhaps there are other possibilities but how do I tell?
Is there anyway to use a tool like valgrind/callgrind that might show the time spent in system calls? I can also run it on Windows with Visual Studio 2012 if that's better.
We might have to try walking through the code or something but not sure that we have time.
Any tips appreciated.
Thanks.

Callgrind is a great profiler but it does have some drawbacks. In particular, it assumes that the same instruction always executes in the same amount of time, and it assumes that instruction counts are the most important metric.
This is fine for getting (mostly) reproducible profiling results and for analyzing in detail what instructions are executed, but there are some types of performance problems which Callgrind doesn't detect:
time spent waiting for locks
time spent sleeping (eg. simple sleep()/usleep() calls will effectively slow down your application but won't show up in Callgrind)
time spent waiting for disk I/O or network I/O
time spent waiting for data that was swapped out
influences from CPU cache hits/misses (you can try to use Cachegrind for this particular topic)
influences from CPU pipeline stalls, branch prediction failures and all the other features of modern CPUs that can cause the same instruction to be executed faster or slower depending on the context
These problems can be detected quite well using a statistical (or sample-based) profiler. Examples would be Sysprof and OProfile, or any kind of "poor-man's sampling profiler" as described eg. at https://stackoverflow.com/a/378024. The VS2012 built-in profiler mentioned by WhozCraig appears to be a sampling profiler as well.
While statistical profilers are really useful because they provide "real-world" results instead of simple instructions counts, they have the possible drawback that you don't get reproducible results easily (the results will vary a little bit with every run), and that you need to gather sufficient number of samples to get detailed results.

How to profile a C++ function at assembly level?

I have a function that is the bottleneck of my program. It requires no access to memory and requires only calculation. It is the inner loop and called many times so any small gains to this function is big wins for my program.
I come from a background in optimizing SPU code on the PS3 where you take a SPU program and run it through a pipeline analyzer where you can put each assembly statement in its own column and you minimize the amount of cycles the function takes. Then you overlay loops so you can minimized pipeline dependencies even more. With that program and a list of all the cycles each assembly instruction takes I could optimize much better then the compiler ever could.
On a different platform it had events I could register (cache misses, cycles, etc.) and I could run the function and track CPU events. That was pretty nice as well.
Now I'm doing a hobby project on Windows using Visual Studio C++ 2010 w/ a Core i7 Intel processor. I don't have the money to justify paying the large cost of VTune.
My question:
How do I profile a function at the assembly level for an Intel processor on Windows?
I want to compile, view disassembly, get performance metrics, adjust my code and repeat.

There are some great free tools available, mainly AMD's CodeAnalyst (from my experiences on my i7 vs my phenom II, its a bit handicapped on the Intel processor cause it doesn't have access to the direct hardware specific counters, though that might have been bad config).
However, a lesser know tool is the Intel Architecture Code Analyser (which is free like CodeAnalyst), which is similar to the spu tool you described, as it details latency, throughput and port pressure (basically the request dispatches to the ALU's, MMU and the like) line by line for your programs assembly. Stan Melax gave a nice talk on it and x86 optimization at this years GDC, under the title "hotspots, flops and uops: to-the-metal cpu optimization".
Intel also has a few more tools in the same vein as IACA, avaibale under the performance tuning section of their experimental/what-if code site, such as PTU, which is (or was) an experimental evolution of VTune, from what I can see, its free.
Its also a good idea to have read the intel optimization manual before diving into this.
EDIT: as Ben pointed out, the timings might not be correct for older processors, but that can be easily made up for using Agner Fog's Optimization manuals, which also contain many other gems.

You might want to try some of the utilities included in valgrind like callgrind or cachegrind.
Callgrind can do profiling and dump assembly.
And kcachegrind is a nice GUI, and will show the dumps including assembly and number of hits per instruction etc.

From you description it sounds like you problem may be embarrassingly parallel, have you considered using ppl's parallel_for?

Fastest way to run a program in a 64 bit environment?

It's been a couple of decades since I've done any programming. As a matter of fact the last time I programmed was in an MS-DOS environment before Windows came out. I've had this programming idea that I have wanted to try for a few years now and I thought I would give it a try. The amount of calculations are enormous. Consequently I want to run it in the fastest environment I can available to a general hobby programmer.
I'll be using a 64 bit machine. Currently it is running Windows 7. Years ago a program ran much slower in the windows environment then then in MS-DOS mode. My personal programming experience has been in Fortran, Pascal, Basic, and machine language for the 6800 Motorola series processors. I'm basically willing to try anything. I've fooled around with Ubuntu also. No objections to learning new. Just want to take advantage of speed. I'd prefer to spend no money on this project. So I'm looking for a free or very close to free compiler. I've downloaded Microsoft Visual Studio C++ Express. But I've got a feeling that the completed compiled code will have to be run in the Windows environment. Which I'm sure slows the processing speed considerably.
So I'm looking for ideas or pointers to what is available.
Thank you,
Have a Great Day!
Jim

Speed generally comes with the price of either portability or complexity.
If your programming idea involves lots of computation, then if you're using Intel CPU, you might want to use Intel's compiler, which might benefit from some hidden processor features that might make your program faster. Otherwise, if portability is your goal, then use GCC (GNU Compiler Collection), which can cross-compile well optimized executable to practically any platform available on earth. If your computation can be parallelizable, then you might want to look at SIMD (Single Input Multiple Data) and GPGPU/CUDA/OpenCL (using graphic card for computation) techniques.
However, I'd recommend you should just try your idea in the simpler languages first, e.g. Python, Java, C#, Basic; and see if the speed is good enough. Since you've never programmed for decades now, it's likely your perception of what was an enormous computation is currently miniscule due to the increased processor speed and RAM. Nowadays, there is not much noticeable difference in running in GUI environment and command line environment.

Tthere is no substantial performance penalty to operating under Windows and a large quantity of extremely high performance applications do so. With new compiler advances and new optimization techniques, Windows is no longer the up-and-coming, new, poorly optimized technology it was twenty years ago.
The simple fact is that if you haven't programmed for 20 years, then you won't have any realistic performance picture at all. You should make like most people- start with an easy to learn but not very fast programming language like C#, create the program, then prove that it runs too slowly, then make several optimization passes with tools such as profilers, then you may decide that the language is too slow. If you haven't written a line of code in two decades, the overwhelming probability is that any program that you write will be slow because you're a novice programmer from modern perspectives, not because of your choice of language or environment. Creating very high performance applications requires a detailed understanding of the target platform as well as the language of choice, AND the operations of the program.
I'd definitely recommend Visual C++. The Express Edition is free and Visual Studio 2010 can produce some unreasonably fast code. Windows is not a slow platform - even if you handwrote your own OS, it'd probably be slower, and even if you produced one that was faster, the performance gain would be negligible unless your program takes days or weeks to complete a single execution.

The OS does not make your program magically run slower. True, the OS does eat a few clock cycles here and there, but it's really not enough to be at all noticeable (and it does so in order to provide you with services you most likely need, and would need to re-implement yourself otherwise)
Windows doesn't, as some people seem to believe, eat 50% of your CPU. It might eat 0.5%, but so does Linux and OSX. And if you were to ditch all existing OS'es and instead write your own from scratch, you'd end up with a buggy, less capable OS which also eats a bit of CPU time.
So really, the environment doesn't matter.
What does matter is what hardware you run the program on (and here, running it on the GPU might be worth considering) and how well you utilize the hardware (concurrency is pretty much a must if you want to exploit modern hardware).
What code you write, and how you compile it does make a difference. The hardware you're running on makes a difference. The choice of OS does not.

A digression: that the OS doesn't matter for performance is, in general, obviously false. Citing CPU utilization when idle seems a quite "peculiar" idea to me: of course one hopes that when no jobs are running the OS is not wasting energy. Otherwise one measure the speed/throughput of an OS when it is providing a service (i.e. mediating the access to hardware/resources).
To avoid an annoying MS Windows vs Linux vs Mac OS X battle, I will refer to a research OS concept: exokernels. The point of exokernels is that a traditional OS is not just a mediator for resource access but it implements policies. Such policies does not always favor the performance of your application-specific access mode to a resource. With the exokernel concept, researchers proposed to "exterminate all operating system abstractions" (.pdf) retaining its multiplexer role. In this way:
… The results show that common unmodified UNIX applications can enjoy the benefits of exokernels: applications either perform comparably on Xok/ExOS and the BSD UNIXes, or perform significantly better. In addition, the results show that customized applications can benefit substantially from control over their resources (e.g., a factor of eight for a Web server). …
So bypassing the usual OS access policies they gained, for a customized web server, an increase of about 800% in performance.
Returning to the original question: it's generally true that an application is executed with no or negligible OS overhead when:
it has a compute-intensive kernel, where such kernel does not call the OS API;
memory is enough or data is accessed in a way that does not cause excessive paging;
all inessential services running on the same systems are switched off.
There are possibly other factors, depending by hardware/OS/application.
I assume that the OP is correct in its rough estimation of computing power required. The OP does not specify the nature of such intensive computation, so its difficult to give suggestions. But he wrote:
The amount of calculations are enormous
"Calculations" seems to allude to compute-intensive kernels, for which I think is required a compiled language or a fast interpreted language with native array operators, like APL, or modern variant such as J, A+ or K (potentially, at least: I do not know if they are taking advantage of modern hardware).
Anyway, the first advice is to spend some time in researching fast algorithms for your specific problem (but when comparing algorithms remember that asymptotic notation disregards constant factors that sometimes are not negligible).
For the sequential part of your program a good utilization of CPU caches is crucial for speed. Look into cache conscious algorithms and data structures.
For the parallel part, if such program is amenable to parallelization (remember both Amdahl's law and Gustafson's law), there are different kinds of parallelism to consider (they are not mutually exclusive):
Instruction-level parallelism: it is taken care by the hardware/compiler;
data parallelism:
bit-level: sometimes the acronym SWAR (SIMD Within A Register) is used for this kind of parallelism. For problems (or some parts of them) where it can be formulated a data representation that can be mapped to bit vectors (where a value is represented by 1 or more bits); so each instruction from the instruction set is potentially a parallel instruction which operates on multiple data items (SIMD). Especially interesting on a machine with 64 bits (or larger) registers. Possible on CPUs and some GPUs. No compiler support required;
fine-grain medium parallelism: ~10 operations in parallel on x86 CPUs with SIMD instruction set extensions like SSE, successors, predecessors and similar; compiler support required;
fine-grain massive parallelism: hundreds of operations in parallel on GPGPUs (using common graphic cards for general-purpose computations), programmed with OpenCL (open standard), CUDA (NVIDIA), DirectCompute (Microsoft), BrookGPU (Stanford University) and Intel Array Building Blocks. Compiler support or use of a dedicated API is required. Note that some of these have back-ends for SSE instructions also;
coarse-grain modest parallelism (at the level of threads, not single instructions): it's not unusual for CPUs on current desktops/laptops to have more then one core (2/4) sharing the same memory pool (shared-memory). The standard for shared-memory parallel programming is the OpenMP API, where, for example in C/C++, #pragma directives are used around loops. If I am not mistaken, this can be considered data parallelism emulated on top of task parallelism;
task parallelism: each core in one (or multiple) CPU(s) has its independent flow of execution and possibly operates on different data. Here one can use the concept of "thread" directly or a more high-level programming model which masks threads.
I will not go into details of these programming models here because apparently it is not what the OP needs.
I think this is enough for the OP to evaluate by himself how various languages and their compilers/run-times / interpreters / libraries support these forms of parallelism.

Just my two cents about DOS vs. Windows.
Years ago (something like 1998?), I had the same assumption.
I have some program written in QBasic (this was before I discovered C), which did intense calculations (neural network back-propagation). And it took time.
A friend offered to rewrite the thing in Visual Basic. I objected, because, you know, all those gizmos, widgets and fancy windows, you know, would slow down the execution of, you know, the important code.
The Visual Basic version so much outperformed the QBasic one that it became the default application (I won't mention the "hey, even in Excel's VBA, you are outperformed" because of my wounded pride, but...).
The point here, is the "you know" part.
You don't know.
The OS here is not important. As others explained in their answers, choose your hardware, and choose your language. And write your code in a clear way because now, compilers are better at optimizing code developers, unless you're John Carmack (premature optimization is the root of all evil).
Then, if you're not happy with the result, use a profiler to test your code. Consider multithreading (which will help you if you have multiple cores... TBB comes to mind).

What are you trying to do? I believe all the stuff should be compiled in 64bit mode by default. Computers have gotten a lot faster. Speed should not be a problem for the most part.
Side note: As for computation intense stuff you may want to look into OpenCL or CUDA. OpenCL and CUDA take advantage of the GPU which can transfer lots of information at a time compared to the CPU.

If your last points of reference are M68K and PCs running DOS then I'd suggest that you start with C/C++ on a modern processor and OS. If you run into performance problems and can prove that they are caused by running on Linux / Windows or that the compiler / optimizer generated code isn't sufficient, then you could look at other OSes and/or hand coded ASM. If you're looking for free, Linux / gcc is a good place to start.

I am the original poster of this thread.
I am once again reiterating the emphasis that this program will have enormous number of calculations.
Windows & Ubuntu are multi-tasking environments. There are processes running and many of them are using processor resources. True many of them are seen as inactive. But still the Windows environment by the nature of multi-tasking is constantly monitoring the need to start up each process. For example currently there are 62 processes showing in the Windows Task Manager. According the task manager three are consuming CPU resouces. So we have three ongoing processes that are consuming CPU processing. There are an addition 59 showing active but consuming no CPU processing. So that is 63 being monitored by Windows and then there is the Windows that also is checking on various things.
I was hoping to find some way to just be able to run a program on the bare machine level. Side stepping all the Windows (Ubuntu) involvement.
The idea is very calculation intensive.
Thank you all for taking the time to respond.
Have a Great Day,
Jim

How to profile multi-threaded C++ application on Linux?

I used to do all my Linux profiling with gprof.
However, with my multi-threaded application, it's output appears to be inconsistent.
Now, I dug this up:
http://sam.zoy.org/writings/programming/gprof.html
However, it's from a long time ago and in my gprof output, it appears my gprof is listing functions used by non-main threads.
So, my questions are:
In 2010, can I easily use gprof to profile multi-threaded Linux C++ applications? (Ubuntu 9.10)
What other tools should I look into for profiling?

Edit: added another answer on poor man's profiler, which IMHO is better for multithreaded apps.
Have a look at oprofile. The profiling overhead of this tool is negligible and it supports multithreaded applications---as long as you don't want to profile mutex contention (which is a very important part of profiling multithreaded applications)

Have a look at poor man's profiler. Surprisingly there are few other tools that for multithreaded applications do both CPU profiling and mutex contention profiling, and PMP does both, while not even requiring to install anything (as long as you have gdb).

Try modern linux profiling tool, the perf (perf_events): https://perf.wiki.kernel.org/index.php/Tutorial and http://www.brendangregg.com/perf.html:
perf record ./application
# generates profile file perf.data
perf report

Have a look at Valgrind.

A Paul R said, have a look at Zoom. You can also use lsstack, which is a low-tech approach but surprisingly effective, compared to gprof.
Added: Since you clarified that you are running OpenGL at 33ms, my prior recommendation stands. In addition, what I personally have done in situations like that is both effective and non-intuitive. Just get it running with a typical or problematic workload, and just stop it, manually, in its tracks, and see what it's doing and why. Do this several times.
Now, if it only occasionally misbehaves, you would like to stop it only while it's misbehaving. That's not easy, but I've used an alarm-clock interrupt set for just the right delay. For example, if one frame out of 100 takes more than 33ms, at the start of a frame, set the timer for 35ms, and at the end of a frame, turn it off. That way, it will interrupt only when the code is taking too long, and it will show you why. Of course, one sample might miss the guilty code, but 20 samples won't miss it.

I tried valgrind and gprof. It is a crying shame that none of them work well with multi-threaded applications. Later, I found Intel VTune Amplifier. The good thing is, it handles multi-threading well, works with most of the major languages, works on Windows and Linux, and has many great profiling features. Moreover, the application itself is free. However, it only works with Intel processors.

You can randomly run pstack to find out the stack at a given point. E.g. 10 or 20 times.
The most typical stack is where the application spends most of the time (according to experience, we can assume a Pareto distribution).
You can combine that knowledge with strace or truss (Solaris) to trace system calls, and pmap for the memory print.
If the application runs on a dedicated system, you have also sar to measure cpu, memory, i/o, etc. to profile the overall system.

Since you didn't mention non-commercial, may I suggest Intel's VTune. It's not free but the level of detail is very impressive (and the overhead is negligible).

Putting a slightly different twist on matters, you can actually get a pretty good idea as to what's going on in a multithreaded application using ftrace and kernelshark. Collecting the right trace and pressing the right buttons and you can see the scheduling of individual threads.
Depending on your distro's kernel you may have to build a kernel with the right configuration (but I think that a lot of them have it built in these days).

Microprofile is another possible answer to this. It requires hand-instrumentation of the code, but it seems like it handles multi-threaded code pretty well. And it also has special hooks for profiling graphics pipelines, including what's going on inside the card itself.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js