Disable hardware prefetcher on Xeon Phi - xeon-phi

Is it possible to disable hardware prefetcher on Xeon Phi preferably via programming? I want to measure the percent performance improvement provided by the hardware prefetcher for the STREAM benchmark. I do not want to change the original STREAM benchmark, I want to add some lines of code or a compiler flag that disables the hardware prefetcher.
I am aware of the answer at https://software.intel.com/en-us/forums/intel-many-integrated-core/topic/520185. I wonder whether there is an update or a geneious workaround.

Related

How can I break down the memory-only time and computation-only time for a program on Xeon Phi?

Modern processors overlap memory accesses with computations. I want to study this overlap on Intel Xeon Phi. A conventional way to do so is to modify the code and make two versions: memory-only and computation-only, like the approach used in this slide for GPU: http://www.nvidia.com/content/GTC-2010/pdfs/2012_GTC2010.pdf.
However, my program has complex control flows and data dependencies. It's very hard for me to make such two versions.
Is there any convenient way to measure this overlap? I'm considering the Vtune profile, but I'm still not sure about what HW counters should I look at.

How to Adjust Processor Bus Multiplier

I am looking for a windows function, structure, API that control the multiplier of the bus speed of the processor. In other word, I am trying to adjust frequency of the CPU by varying the multiplier.Currently, I am adjusting the CPU speed by modifying the power scheme using the following function.
PowerWriteDCValueIndex(…,…,…,…)
And adjust the
THROTTLE_MAXIMUM; & THROTTLE_MINIMUM;
However, this allows me to vary the processor speed as % which is not accurate.
Hope my question is clear and you can help.
Thanks.
The handling of power states in the OS is handled by a kernel driver module, which will be specific at least to the particular CPU vendor, sometimes also to the CPU Model (e.g. the operations is done differently in an 64-bit AMD processor than it is in a 32-bit AMD processor. I once played with a Linux drivers to set the clock speeds of AMD processors).
This driver will be controlled by a "governor" process that takes as input the configuration settings (policy) you're already using, the current load on the CPU (and often some "load history" too, to reduce too many switches) and other sources such as CPU temperature, power left in battery (if applicable). [In mobile devices, the temperature of the CPU is definitely an input into the equation, since most modern CPUs and GPUs are able to draw much more power than the device can dissipate, thus overheating the chip if the power setting is left on a high setting for too long]
Unfortunately, you need to know a lot more details than "I want to run this fast" before you can do this. There are BIOS tables (ACPI and/or other vendor specific tables) that define what voltage to use at what frequency, and you will need to set the voltage first, then the clock-speed when going up in speed, and the clock speed then voltage when going down in speed. The tables will most often not contain ALL the speeds the CPU can go at, but a "full speed", "medium speed" and "slow speed" setting. [And there will be multiple tables for different types of CPU's, since the BIOS doesn't know if the person building the system will use a high power, high speed CPU or a low speed, low power CPU that].
There are also registers that need to be programmed to determine how long the CPU should "sleep" before it switches to the new speed, to allow the PLL's (that control the clock multipliers) to stabilize. This means that you don't want to switch too often.
The system also needs to know that the clock-frequency has changed so that any processing that depends on the CPU-speed can be adjusted (e.g. things that use the RDTSC instruciton on x86 to measure short times will need to adjust their timings based on a new setting).
If you don't get ALL of these things perfect, you will have an unstable system (and in a mobile device, you could even "fry" the chip - or the user!).
It is not clear what you intend to do, but in general, it's better to leave these things to the governor that already is in the system than to try to make a better system - almost all attempts to make this "better" will fail.

How to measure read/cycle or instructions/cycle?

I want to thoroughly measure and tune my C/C++ code to perform better with caches on a x86_64 system. I know how to measure time with a counter (QueryPerformanceCounter on my Windows machine) but I'm wondering how would one measure the instructions per cycle or reads/write per cycle with respect to the working set.
How should I proceed to measure these values?
Modern processors (i.e., those not very constrained that are less than some 20 years old) are superscalar, i.e., they execute more than one instruction at a time (given correct instruction ordering). Latest x86 processors translate the CISC instructions into internal RISC instructions, reorder them and execute the result, have even several regster banks so instructions using "the same registers" can be done in parallel. There isn't any reasonable way to define the "time the instruction execution takes" today.
The current CPUs are much faster than memory (a few hundred instructions is the typical cost of accessing memory), they are all heavily dependent on cache for performance. And then you have all kinds of funny effects of cores sharing (or not) parts of cache, ...
Tuning code for maximal performance starts with the software architecture, goes on to program organization, algorithm and data structure selection (here a modicum of cache/virtual memory awareness is useful too), careful programming and (as te most extreme measures to squeeze out the last 2% of performance) considerations like the ones you mention (and the other favorite, "rewrite in assembly"). And the ordering is that one because the first levels give more performance for the same cost. Measure before digging in, programmers are notoriously unreliable in finding bottlenecks. And consider the cost of reorganizing code for performance, both in the work itself, in convincing yourself this complex code is correct, and maintenance. Given the relative costs of computers and people, extreme performance tuning rarely makes any sense (perhaps for heavily travelled code paths in popular operating systems, in common code paths generated by a compiler, but almost nowhere else).
If you are really interested in where your code is hitting cache and where it is hitting memory, and the processor is less than about 10-15 years old in its design, then there are performance counters in the processor. You need driver level software to access these registers, so you probably don't want to write your own tools for this. Fortunately, you don't have to.
There is tools like VTune from Intel, CodeAnalyst from AMD and oprofile for Linux (works with both AMD and Intel processors).
There are a whole range of different registers that count the number of instructions actually completed, the number of cycles the processor is waiting for . You can also get a count of things like "number of memory reads", "number of cache misses", "number of TLB misses", "number of FPU instructions".
The next, more tricky part, is of course to try to fix any of these sort of issues, and as mentioned in another answer, programmers aren't always good at tweaking these sort of things - and it's certainly time consuming, not to mention that what works well on processor model X will not necessarily run fast on model Y (there were some tuning tricks for early Pentium 4 that works VERY badly on AMD processors - if on the other hand, you tune that code for AMD processors of that age, you get code that runs well on the same generation Intel processor too!)
You might be interested in the rdtsc x86 instruction, which reads a relative number of cycles.
See http://www.fftw.org/cycle.h for an implementation to read the counter in many compilers.
However, I'd suggest simply measuring using QueryPerformanceCounter. It is rare that the actual number of cycles is important, to tune code you typically only need to be able to compare relative time measurements, and rdtsc has many pitfalls (though probably not applicable to the situation you described):
On multiprocessor systems, there is not a single coherent cycle counter value.
Modern processors often adjust the frequency, changing the rate of change in time with respect to the rate of change in cycles.

Determine Values AND/OR Address of Values in CPU Cache

Is there a way to determine exactly what values, memory addresses, and/or other information currently resides in the CPU cache (L1, L2, etc.) - for current or all processes?
I've been doing quite a bit a reading which shows how to optimize programs to utilize the CPU cache more effectively. However, I'm looking for a way to truly determine if certain approaches are effective.
Bottom line: is it possible to be 100% certain what does and does not make it into the CPU cache.
Searching for this topic returns several results on how to determine the cache size, but not contents.
Edit: To clarify some of the comments below: Since software would undoubtedly alter the cache, do CPU manufactures have a tool / hardware diagnostic system (built-in) which provides this functionality?
Without using specialized hardware, you cannot directly inspect what is in the CPU cache. The act of running any software to inspect the CPU cache would alter the state of the cache.
The best approach I have found is simply to identify real hot spots in your application and benchmark alternative algorithms on hardware the code will run on in production (or on a range of likely hardware if you do not have control over the production environment).
In addition to Eric J.'s answer, I'll add that while I'm sure the big chip manufacturers do have such tools it's unlikely that such a "debug" facility would be made available to regular mortals like you and I, but even if it were, it wouldn't really be of much help.
Why? It's unlikely that you are having performance issues that you've traced to cache and which cannot be solved using the well-known and "common sense" techniques for maintaining high cache-hit ratios.
Have you really optimized all other hotspots in the code and poor cache behavior by the CPU is the problem? I very much doubt that.
Additionally, as food for thought: do you really want to optimize your program's behavior to only one or two particular CPUs? After all, caching algorithms change all the time, as do the parameters of the caches, sometimes dramatically.
If you have a relatively modern processor running Windows then take a look at
http://software.intel.com/en-us/articles/intel-performance-counter-monitor-a-better-way-to-measure-cpu-utilization
and see if that might provide some of what you are looking for.
To optimize for one specific CPU cache size is usually in vain since this optimization will break when your assumptions about the CPU cache sizes are wrong when you execute on a different CPU.
But there is a way out there. You should optimize for certain access patterns to allow the CPU to easily predict what memory locations should be read next (the most obvious one is a linear increasing read). To be able to fully utilize a CPU you should read about cache oblivious algorithms where most of them follow a divide and conquer strategy where a problem is divided into sub parts to a certain extent until all memory accesses fit completly into the CPU cache.
It is also noteworthy to mention that you have a code and data cache which are separate. Herb Sutter has a nice video online where he talks about the CPU internals in depth.
The Visual Studio Profiler can collect CPU counters dealing with memory and L2 counters. These options are available when you select instrumentation profiling.
Intel has also a paper online which talks in greater detail about these CPU counters and what the task manager of Windows and Linux do show you and how wrong it is for todays CPUs which do work internally asynchronous and parallel at many diffent levels. Unfortunatley there is no tool from intel to display this stuff directly. The only tool I do know is the VS profiler. Perhaps VTune has similar capabilities.
If you have gone this far to optimize your code you might look as well into GPU programming. You need at least a PHD to get your head around SIMD instructions, cache locality, ... to get perhaps a factor 5 over your original design. But by porting your algorithm to a GPU you get a factor 100 with much less effort ony a decent graphics card. NVidia GPUs which do support CUDA (all today sold cards do support it) can be very nicely programmed in a C dialect. There are even wrapper for managed code (.NET) to take advantage of the full power of GPUs.
You can stay platform agnostic by using OpenCL but NVidia OpenCL support is very bad. The OpenCL drivers are at least 8 times slower than its CUDA counterpart.
Almost everything you do will be in the cache at the moment when you use it, unless you are reading memory that has been configured as "uncacheable" - typically, that's frame buffer memory of your graphics card. The other way to "not hit the cache" is to use specific load and store instructions that are "non-temporal". Everything else is read into the L1 cache before it reaches the target registers inside the CPU itself.
For nearly all cases, CPU's do have a fairly good system of knowing what to keep and what to throw away in the cache, and the cache is nearly always "full" - not necessarily of useful stuff, if, for example you are working your way through an enormous array, it will just contain a lot of "old array" [this is where the "non-temporal" memory operations come in handy, as they allow you to read and/or write data that won't be stored in the cache, since next time you get back to the same point, it won't be in the cache ANYWAYS].
And yes, processors usually have special registers [that can be accessed in kernel drivers] that can inspect the contents of the cache. But they are quite tricky to use without at the same time losing the content of the cache(s). And they are definitely not useful as "how much of array A is in the cache" type checking. They are specifically for "Hmm, it looks like cache-line 1234 is broken, I'd better read the cached data to see if it's really the value it should be" when processors aren't working as they should.
As DanS says, there are performance counters that you can read from suitable software [need to be in the kernel to use those registers too, so you need some sort of "driver" software for that]. In Linux, there's "perf". And AMD has a similar set of performance counters that can be used to find out, for example "how many cache misses have we had over this period of time" or "how many cache hits in L" have we had, etc.

Measure how often a branch is mispredicted

Assuming I have a if-else branch in C++ how can I (in-code) measure how often the branch is mispredicted? I would like to add some calls or macros around the branch (similar to how you do bottom-up profiling) that would report branch mispredictions.
It would be nice to have a generic method, but lets do Intel i5 2500k for starters.
If you are using an AMD CPU, AMD's CodeAnalyst is just what you need (works on windows and Linux)*.
if your not, then you may need to fork out for a VTune licence or build something using the on CPU performance registers and counters details in the instruction manuals.
You can also check out gperf & OProfile (linux only), see how well they perform (I've never used these, but I see them referred to quite a bit).
*CodeAnalyst should work on an Intel CPU, you just don't get all then nice CPU level analysis.
I wonder if it would be possible to extract this information from g++ -fprofile-arcs? It has to measure exactly this in order to feed back into the optimizer in order to optimize branching.
OProfile
OProfile is pretty complex, but it can profile anything your CPU tracks.
Look through the Event Type Reference and look for your particular CPU.
For instance here is the core2 events. After a quick search I don't see any event counters for missed branch prediction on the core2 architecture.