In order to check whether AVX or SSE is available, the compiler usually sets SSE or AVX true. However is there an option, how to receive during compile time the size of the cache?
Edit:
I refrase the question a little bit, since it is not clear enough.
I want information about the cache during compile time. The more the better. I would like to have it for optimization purposes (i.e. cache-blocking,...). In my current task - spatial blocking - the size of the cache is mostly relevant. But the comments bellow rightly asked, which level of cache am i asking for. In addition caches can behave very differrently, if you consider their evicting strategy, size of cache line, amount of levels, how they are shared between cores... The list goes on.
So my general question: How do i recieve any information about the cache during compile time?
For my current task, it would be sufficient to readout /proc/cpuinfo and use the cachesize given there. However the general question is far more interesting.
How do i recieve information about the cpu (with focus on its cache) during compile time?
(I am not considering of crosscompiling at all. The code compiled will be run on the SAME machine.)
Apparently, no compiler is capable of having such detailed information about a cpu AND making it available to the compiling program. However i have found a library, which seems to do exactly that. Sadly again only at runtime. However it is at least a comfortable solution.
(I found it by accident. I was not looking for it :) If you ever look for something, 1) google 2) look here)
You can compile several versions of your library for various cache sizes and then pick them dynamically. That is the only possible approach for adding cache size info at compile-time.
Related
I am executing my C++ code on linux. In my code, there is a large 2D array of some structure. The array is accessed randomly. I have to find how many cache misses occur when that 2D array is accessed. Is there any other solution except valgrind (as it takes too much time to compute results ) that can help me to find cache misses and cache miss rate of this array.
Generally, your program does not just access the 2D array, so memory access includes other variables. It may have some effects, more or less.
If the array access is intensive(or encapsulated), maybe you can use a simple simulator to evaluate the miss rate.
Otherwise, you may need some tools. I suggest Pin, a dynamic binary instrumentation tool, of which the principle is similar to valgrind, while I think the overhead is acceptable, but you have to write the analysis code.
Or a better choice is Vtune, performance analysis tool from intel, try the cache analysis. Still, you need some time to learn it.
Simulating the cache risks not reproducing the exact same behavior due to minor differences (in replacement policy, protocols, etc..).
A better option is profiling the code - the simplest profiling tool for linux would probably be perf. There's a pretty decent tutorial here - https://perf.wiki.kernel.org/index.php/Tutorial
Also see - Are there any way to profile cache miss in linux kernel? , it includes a list of specific example statistics to check for cache performance
I want to find how much cache efficient my C++ code is. I am running it on UBUNTU. How to find the number of cache hits or cache miss ?
Another Question is: I found using time command: one part of my code is giving 2133 (minor) page faults and another one is giving 2361 (minor) page faults. Is (minor) page faults is related to cache miss? If so how it is related. I have to perform some I/O is that can cause (minor) page faults ?
The most comprehensive profiling tool for linux is oprofile, which can profile single applications or your entire system, and can give you detailed information about cache misses (and where they are occurring) on processors that support performance counters for events like cache misses (pretty much all x86 processors made for the past 20 years support such counters)
Page faults have nothing to do with cache misses, though they are also a potential source of performance problems.
Writing my own answer as it was getting longer than a single comment to Chris Dodd's answer...
You can use oprofile or perftool (which is basically using the same kernel functionality as oprofile) will give you cache hits vs. misses. Note that it's hard to say "how many cache-misses" or "cache-hits" an application "should have", and you can really only compare the number with another run of the same application (after tweaks, but also seeing how much it varies from one run to another can be quite useful).
I don't think the exact number of page-faults is that critical - it tends to vary a little bit depending on "luck".
Page faults are caused by all manner of things. For example allocating a large chunk of memory, that memory will be "not present" when the page gets used the first time, and at that point, it is then marked present (and probably also filled with zero at that point). You also get page-faults from loading functions from shared libraries, as they are "not present" when the shared library is initialized. In some cases, it may also be that a part of the application uses mmap to map in a file, instead of usign file-read/write operations (which is probably what you are thinking of with regards to File I/O).
Is there a way I could write a "tool" which could analyse the produced x86 assembly language from a C/C++ program and measure the performance in such a way, that it wouldnt matter if I ran it on a 1GHz or 3GHz processor?
I am thinking more along the lines of instruction throughput? How could I write such a tool? Would it be possible?
I'm pretty sure this has to be equivalent to the halting problem, in which case it can't be done. Things such as branch prediction, memory accesses, and memory caching will all change performance irrespective of the speed of the CPU upon which the program is run.
Well, you could, but it would have very limited relevance. You can't tell the running time by just looking at the instructions.
What about cache usage? A "longer" code can be more cache-friendly, and thus faster.
Certain CPU instructions can be executed in parallel and out-of-order, but the final behaviour depends a lot on the hardware.
If you really want to try it, I would recommend writing a tool for valgrind. You would essentially run the program under a simulated environment, making sure you can replicate the behaviour of real-world CPUs (that's the challenging part).
EDIT: just to be clear, I'm assuming you want dynamic analysis, extracted from real inputs. IF you want static analysis you'll be in "undecidable land" as the other answer pointed out (you can't even detect if a given code loops forever).
EDIT 2: forgot to include the out-of-order case in the second point.
It's possible, but only if the tool knows all the internals of the processor for which it is projecting performance. Since knowing 'all' the internals is tantamount to building your own processor, you would correctly guess that this is not an easy task. So instead, you'll need to make a lot of assumptions, and hope that they don't affect your answer too much. Unfortunately, for anything longer than a few hundred instructions, these assumptions (for example, all memory reads are found in L1 data cache and have 4 cycle latency; all instructions are in L1 instruction cache but in trace cache thereafter) affect your answer a lot. Clock speed is probably the easiest variable to handle, but the details for all the rest that differ greatly from processor to processor.
Current processors are "speculative", "superscalar", and "out-of-order". Speculative means that they choose their code path before the correct choice is computed, and then go back and start over from the branch if their guess is wrong. Superscalar means that multiple instructions that don't depend on each other can sometimes be executed simultaneously -- but only in certain combinations. Out-of-order means that there is a pool of instructions waiting to be executed, and the processor chooses when to execute them based on when their inputs are ready.
Making things even worse, instructions don't execute instantaneously, and the number of cycles they do take (and the resources they occupy during this time) vary also. Accuracy of branch prediction is hard to predict, and it takes different numbers of cycles for processors to recover. Caches are different sizes, take different times to access, and have different algorithms for decided what to cache. There simply is no meaningful concept of 'how fast assembly executes' without reference to the processor it is executing on.
This doesn't mean you can't reason about it, though. And the more you can narrow down the processor you are targetting, and the more you constrain the code you are evaluating, the better you can predict how code will execute. Agner Fog has a good mid-level introduction to the differences and similarities of the current generation of x86 processors:
http://www.agner.org/optimize/microarchitecture.pdf
Additionally, Intel offers for free a very useful (and surprisingly unknown) tool that answers a lot of these questions for recent generations of their processors. If you are trying to measure the performance and interaction of a few dozen instructions in a tight loop, IACA may already do what you want. There are all sorts of improvements that could be made to the interface and presentation of data, but it's definitely worth checking out before trying to write your own:
http://software.intel.com/en-us/articles/intel-architecture-code-analyzer
To my knowledge, there isn't an AMD equivalent, but if there is I'd love to hear about it.
I'm currently porting a VS2005 C++ application from CE5 to CE6 and I'm experiencing severe performance problems. This goes so far that a single HTTP request retrieving dynamic content takes 40ms on CE5 and 350ms on CE6. These values used to be worse due to a bunch of inefficiencies that I already cleaned up, improving performance on both systems, but at the moment I'm stuck at that latency. For the record, both tests are made on the same machine and the webserver is not the one supplied with CE but a custom one implemented in C++. Note also that the problem is not the network IO, CE6 even outperforms CE5 on the same machine when serving static files, but it's the dynamic content handling.
While trying to figure out why the program performs so badly, I stumbled across something that puzzled me: Under CE5, the Interlocked* API for x86 use neither the compiler intrinsics nor real function calls but inline assembly code. This code has a comment saying that the intrinsic includes lock prefixes that are only required for multi-processor systems and that slow down code running on just a single core like CE5. On CE6, these functions are implemented using the compiler intrinsics including the lock prefix. Since these functions are used by e.g. Boost and STLport, both of which are used inside the webserver, I was wondering if those could be the culprit.
Another thing I noticed was that some string parsing functions take extremely long. Worse, it seems that calling the same function a second time after the first time takes less time, so it seems as if some kind of caching was going on. Since this is a short (<1kB) string received via TCP that is parsed in memory, I can't imagine which cache could be responsible for that. The only cache could be the instruction cache, but the program is not larger than the CE5 version and if the code was running from uncached memory it would not show these caching effects.
TLDR - Questions:
Is CE6 capable of handling multiple processors at all?
Is there an easy way to tell the compiler that it should omit the lock prefix? My current approach to achieve that is to simply copy the inline assembly from the CE5 SDK, but that's beyond ugly.
I'd also appreciate any other suggestions what to look at or what to try. Many thanks in advance!
Summary There is no problem that depends on the executable, let alone on the Interlocked API. Running the same executable proved that. However, running on a different machine with a different platform setup made a difference. We're now back to Platform Builder, trying to figure out the differences between the two platforms.
No. WEC7 is required for SMP support. Most likely in CE6 the OEM has disabled the other cores.
None that I am aware of.
Either use the performance profiling tools or instrument your code with timing calls to narrow down where things are taking too long.
I have finally found the reason for the performance behaviour, it's simply paging. CE6 has a pool manager (see http://blogs.msdn.com/b/ce_base/archive/2008/01/19/paging-and-the-windows-ce-paging-pool.aspx) which handles paging out unused mapped DLLs and EXEs. When the amount of mapped binaries exceeds a certain size, it starts (with low priority) to page out memory. The limit when it starts paging out is just 3MiB by default, which is rather low for current applications. Also, the cache is not an LRU cache but simply discarding the pages in the order they were loaded.
It turns out that our system exceeded this limit, which causes the paging to begin. Due to the algorithm used, it will always throw out used ones that will then have to be paged in again. The code that serves static files is small, so this wasn't affected as much by this limit. The code that serves dynamic pages is much larger though, so it wreaks havoc on the overall system with IO. This also explains why the problem couldn't be attributed to a specific piece of code, it wasn't the code itself but loading it.
I have detected this via IOCTL_HAL_GET_POOL_PARAMETERS, which gave me the relevant configuration parameters, current state, how often the pageout-thread ran and for how long (although the latter is only the time it took to swap out pages). I should be able to find the resulting page faults in the kernel tracker, too, now that I know what I'm looking for. I could also observe that the activity LED on the CF card adapter now lights up when first loading a file, but not on subsequent requests, where it is taken from cache. This used to always cause the LED to flash on dynamic pages.
The simple solution is to increase the limit for the pool manager, so it doesn't start throwing out things. This can be done easily in config.bib by patching kernel.dll with the according values. Alternatively, reducing the executable size would help, but that's not so easy.
is there a way in C++ to determine the CPU's cache size? i have an algorithm that processes a lot of data and i'd like to break this data down into chunks such that they fit into the cache. Is this possible?
Can you give me any other hints on programming with cache-size in mind (especially in regard to multithreaded/multicore data processing)?
Thanks!
According to "What every programmer should know about memory", by Ulrich Drepper you can do the following on Linux:
Once we have a formula for the memory
requirement we can compare it with the
cache size. As mentioned before, the
cache might be shared with multiple
other cores. Currently {There
definitely will sometime soon be a
better way!} the only way to get
correct information without hardcoding
knowledge is through the /sys
filesystem. In Table 5.2 we have seen
the what the kernel publishes about
the hardware. A program has to find
the directory:
/sys/devices/system/cpu/cpu*/cache
This is listed in Section 6: What Programmers Can Do.
He also describes a short test right under Figure 6.5 which can be used to determine L1D cache size if you can't get it from the OS.
There is one more thing I ran across in his paper: sysconf(_SC_LEVEL2_CACHE_SIZE) is a system call on Linux which is supposed to return the L2 cache size although it doesn't seem to be well documented.
C++ itself doesn't "care" about CPU caches, so there's no support for querying cache-sizes built into the language. If you are developing for Windows, then there's the GetLogicalProcessorInformation()-function, which can be used to query information about the CPU caches.
Preallocate a large array. Then access each element sequentially and record the time for each access. Ideally there will be a jump in access time when cache miss occurs. Then you can calculate your L1 Cache. It might not work but worth trying.
read the cpuid of the cpu (x86) and then determine the cache-size by a look-up-table. The table has to be filled with the cache sizes the manufacturer of the cpu publishes in its programming manuals.
Depending on what you're trying to do, you might also leave it to some library. Since you mention multicore processing, you might want to have a look at Intel Threading Building Blocks.
TBB includes cache aware memory allocators. More specifically, check cache_aligned_allocator (in the reference documentation, I couldn't find any direct link).
Interestingly enough, I wrote a program to do this awhile ago (in C though, but I'm sure it will be easy to incorporate in C++ code).
http://github.com/wowus/CacheLineDetection/blob/master/Cache%20Line%20Detection/cache.c
The get_cache_line function is the interesting one, which returns the location of right before the biggest spike in timing data of array accesses. It correctly guessed on my machine! If anything else, it can help you make your own.
It's based off of this article, which originally piqued my interest: http://igoro.com/archive/gallery-of-processor-cache-effects/
You can see this thread: http://software.intel.com/en-us/forums/topic/296674
The short answer is in this other thread:
On modern IA-32 hardware, the cache line size is 64. The value 128 is
a legacy of the Intel Netburst Microarchitecture (e.g. Intel Pentium
D) where 64-byte lines are paired into 128-byte sectors. When a line
in a sector is fetched, the hardware automatically fetches the other
line in the sector too. So from a false sharing perspective, the
effective line size is 128 bytes on the Netburst processors. (http://software.intel.com/en-us/forums/topic/292721)
IIRC, GCC has a __builtin_prefetch hint.
http://gcc.gnu.org/onlinedocs/gcc-3.3.6/gcc/Other-Builtins.html
has an excellent section on this. Basically, it suggests:
__builtin_prefetch (&array[i + LookAhead], rw, locality);
where rw is a 0 (prepare for read) or 1 (prepare for a write) value, and locality uses the number 0-3, where zero is no locality, and 3 is very strong locality.
Both are optional. LookAhead would be the number of elements to look ahead to. If memory access were 100 cycles, and the unrolled loops are two cycles apart, LookAhead could be set to 50 or 51.
There are two cases that need to be distinguished. Do you need to know the cache sizes at compile time or at runtime?
Determining the cache-size at compile-time
For some applications, you know the exact architecture that your code will run on, for example, if you can compile the code directly on the host machine. In that case, simplify looking up the size and hard-coding it is an option (could be automated in the build system). On most machines today, the L1 cache line should be 64 bytes.
If you want to avoid that complexity or if you need to support compilation on unknown architectures, you can use the C++17 feature std::hardware_constructive_interference_size as a good fallback. It will provide a compile-time estimation for the cache line, but be aware of its limitations. Note that the compiler cannot guess perfectly when it creates the binary, as the size of the cache-line is, in general, architecture dependent.
Determining the cache-size at runtime
At runtime, you have the advantage that you know the exact machine, but you will need platform specific code to read the information from the OS. A good starting point is the code snippet from this answer, which supports the major platforms (Windows, Linux, MacOS). In a similar fashion, you can also read the L2 cache size at runtime.
I would advise against trying to guess the cache line by running benchmarks at startup and measuring which one performed best. It might well work, but it is also error-prone if the CPU is used by other processes.
Combining both approaches
If you have to ship one binary and the machines that it will later run on features a range of different architectures with varying cache sizes, you could create specialized code parts for each cache size, and then dynamically (at application startup) choose the best fitting one.
The cache will usually do the right thing. The only real worry for normal programmer is false sharing, and you can't take care of that at runtime because it requires compiler directives.