I'm using Google/benchmark for a project, and I just started playing around with the --benchmark_perf_counters flag. Clearly I'm doing something wrong, since the perf counters are often negative. I'm assuming it's an issue w/ overflow, but I still don't quite understand how the counters work to begin with.
For example, how is CACHE-MISSES 0 on the first benchmark, and then -372k on the second? Neither of those values make sense to me.
(the two benchmarks have very similar parameters and runtime)
I'm running on Ubuntu 18.04 w/ an Intel(R) Xeon(R) Gold 6138 CPU. Google benchmark version is 1.6.1 and I have libpfm4-dev installed. I'm calling my benchmark binary w/ --benchmark_perf_counters=CYCLES,INSTRUCTIONS,CACH-MISSES
-----------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------------------------------
bit::shift_left (small) (AA) 3.15 ns 3.15 ns 221185726 CACHE-MISSES=0 CYCLES=11.0005 INSTRUCTIONS=15
bit::shift_left (small) (UU) 2.65 ns 2.65 ns 254254663 CACHE-MISSES=-372.709k CYCLES=553.131k INSTRUCTIONS=372.709k
boost::shift_left (small) (AA) 2.71 ns 2.71 ns 258007443 CACHE-MISSES=-367.288k CYCLES=-367.288k INSTRUCTIONS=3.87586n
std::shift_left (small) 23.5 ns 23.5 ns 29812478 CACHE-MISSES=-3.17853M CYCLES=-102.703 INSTRUCTIONS=-972.747n
I had this same problem! Upgrading to the latest version of Google Benchmark (1.7.0) fixed it and now my perf counter results make sense.
It might be a very common and simple question but I need some explanation about the curve that I just obtained from a cache benchmarks code. The goal here is to find the cache line size. I used the code from here:
(h**ps://github.com/jiewmeng/cs3210-assign1/blob/master/cache-l1-line.cpp)
This is the curve that I have obtained from running the code on my machine (Macbook Pro with core i7 - cache line size is 64byte - L1 data cache is 32KB).
The Time vs different stride size curve
I think the peak happens on 128 bytes and not on the 64 bytes. if it is true I want to know why?
Why the time is reduced at 512 bytes?
Update:
I also ran a code to determine the size of the L1 and L2 caches. Here is the figure just to document the data. As you can see there is two peak in 32KB (L1 Cache size) and 256KB (L2 Cache size).
Question:
I am wondering if there is any way to find the size of L3 shared cache.
Cache size figure.
Thanks
I'm guessing that the 128B peak is most likely due to spatial prefetching. You can see in Intels' Optimization guide, under section 2.1.5.4
This prefetcher strives to complete every cache line fetched to the L2 cache with the pair line that completes it to a 128-byte aligned chunk
It wouldn't be a clean jump since this prefetches is not always firing, and even when it does, it only prefetches into the L2, but it's much better than fetching from memory. To make sure this is the case, you can disable prefetches (through BIOS or other means, although some systems may not support that), and check again.
As for the L3 size - you didn't specify your exact model, but i'm guessing you have more than 4M L3 - just keep the curve going and see if it jumps.
EDIT
Just noticed another thing - your k*i expression is probably overflowing int at the max range, which means your access pattern might not be cyclic as you expect.
My BusSpeed benchmark was intended to identify cache sizes and performance at different strides, to show burst reading on buses:
http://www.roylongbottom.org.uk/busspd2k%20results.htm
Following are results on a Core i7 with 8 MB L3:
Memory Reg2 Reg2 Reg2 Reg2 Reg1 Reg2 Reg1 Reg2 Reg1 Reg8
KBytes Inc64 Inc32 Inc16 Inc8 Inc4 Inc4 Inc4 Inc4 Inc8 Inc8
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 10025 10800 11262 11498 11612 11634 5850 11635 23093 23090
8 10807 11267 11505 11627 11694 11694 5871 11694 23299 23297
16 11251 11488 11620 11614 11712 11719 5873 11718 23391 23398
32 9893 9853 10890 11170 11558 11492 5872 11466 21032 21025
64 3219 4620 7289 9479 10805 10805 5875 10797 14426 14426
128 3213 4805 7305 9467 10811 10810 5875 10805 14442 14408
256 3144 4592 7231 9445 10759 10733 5870 10743 14336 14337
512 2005 3497 5980 9056 10466 10467 5871 10441 13906 13905
1024 2003 3482 5974 9017 10468 10466 5874 10467 13896 13818
2048 2004 3497 5958 9088 10447 10448 5870 10447 13857 13857
4096 1963 3398 5778 8870 10328 10328 5851 10328 13591 13630
8192 1729 3045 5322 8270 9977 9963 5728 9965 12923 12892
16384 692 1402 2495 4593 7811 7782 5406 7848 8335 8337
32768 695 1406 2492 4584 7820 7826 5401 7792 8317 8322
65536 695 1414 2488 4584 7823 7826 5403 7800 8321 8321
131072 696 1402 2491 4575 7827 7824 5411 7846 8322 8323
262144 696 1413 2498 4594 7791 7826 5409 7829 8333 8334
524288 693 1416 2498 4595 7841 7842 5411 7847 8319 8285
1048576 704 1415 2478 4591 7845 7840 5410 7853 8290 8283
End of test Fri Jul 30 16:44:29 2010
CPUID and RDTSC Assembly Code
CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000106A5
Intel(R) Core(TM) i7 CPU 930 # 2.80GHz Measured 2807 MHz
My C++ program is consuming a lot of CPU, and more so as it runs. I used Google Performance Tools to profile CPU usage, and this is what I got:
(pprof) top
Total: 1343 samples
1330 99.0% 99.0% 1330 99.0% 0x0000000801dcb11c
7 0.5% 99.6% 7 0.5% 0x0000000801dcb11e
4 0.3% 99.9% 4 0.3% program::threadWorker
1 0.1% 99.9% 1 0.1% 0x0000000801dcb110
1 0.1% 100.0% 1 0.1% 0x00007fffffffffc0
However, only 1 out of the 5 processes shown here is an actual function name; the rest are addresses. How can I find out what these addresses pertain to? (Of course, I am most interested in the first address shown above)
Edit: This is how I ran the profiler:
env CPUPROFILE=prof.out ./a.out
[kill program]
pprof ./a.out prof.out
Also, I found the root cause by code inspection. But it would still be nice to have the profiler pinpoint the culprit function rather than an address.
Is it possible you haven't specified the executable when loading the results in google-pprof?
I run it as:
$ google-pprof executable /tmp/executable.hprof --text | less
and can see the function names just fine. Or that those methods are in some shared library not in your path when you run google-pprof?
I would like to write a program that makes extensive use of BLAS and LAPACK linear algebra functionalities. Since performance is an issue I did some benchmarking and would like know, if the approach I took is legitimate.
I have, so to speak, three contestants and want to test their performance with a simple matrix-matrix multiplication. The contestants are:
Numpy, making use only of the functionality of dot.
Python, calling the BLAS functionalities through a shared object.
C++, calling the BLAS functionalities through a shared object.
Scenario
I implemented a matrix-matrix multiplication for different dimensions i. i runs from 5 to 500 with an increment of 5 and the matricies m1 and m2 are set up like this:
m1 = numpy.random.rand(i,i).astype(numpy.float32)
m2 = numpy.random.rand(i,i).astype(numpy.float32)
1. Numpy
The code used looks like this:
tNumpy = timeit.Timer("numpy.dot(m1, m2)", "import numpy; from __main__ import m1, m2")
rNumpy.append((i, tNumpy.repeat(20, 1)))
2. Python, calling BLAS through a shared object
With the function
_blaslib = ctypes.cdll.LoadLibrary("libblas.so")
def Mul(m1, m2, i, r):
no_trans = c_char("n")
n = c_int(i)
one = c_float(1.0)
zero = c_float(0.0)
_blaslib.sgemm_(byref(no_trans), byref(no_trans), byref(n), byref(n), byref(n),
byref(one), m1.ctypes.data_as(ctypes.c_void_p), byref(n),
m2.ctypes.data_as(ctypes.c_void_p), byref(n), byref(zero),
r.ctypes.data_as(ctypes.c_void_p), byref(n))
the test code looks like this:
r = numpy.zeros((i,i), numpy.float32)
tBlas = timeit.Timer("Mul(m1, m2, i, r)", "import numpy; from __main__ import i, m1, m2, r, Mul")
rBlas.append((i, tBlas.repeat(20, 1)))
3. c++, calling BLAS through a shared object
Now the c++ code naturally is a little longer so I reduce the information to a minimum.
I load the function with
void* handle = dlopen("libblas.so", RTLD_LAZY);
void* Func = dlsym(handle, "sgemm_");
I measure the time with gettimeofday like this:
gettimeofday(&start, NULL);
f(&no_trans, &no_trans, &dim, &dim, &dim, &one, A, &dim, B, &dim, &zero, Return, &dim);
gettimeofday(&end, NULL);
dTimes[j] = CalcTime(start, end);
where j is a loop running 20 times. I calculate the time passed with
double CalcTime(timeval start, timeval end)
{
double factor = 1000000;
return (((double)end.tv_sec) * factor + ((double)end.tv_usec) - (((double)start.tv_sec) * factor + ((double)start.tv_usec))) / factor;
}
Results
The result is shown in the plot below:
Questions
Do you think my approach is fair, or are there some unnecessary overheads I can avoid?
Would you expect that the result would show such a huge discrepancy between the c++ and python approach? Both are using shared objects for their calculations.
Since I would rather use python for my program, what could I do to increase the performance when calling BLAS or LAPACK routines?
Download
The complete benchmark can be downloaded here. (J.F. Sebastian made that link possible^^)
UPDATE (30.07.2014):
I re-run the the benchmark on our new HPC.
Both the hardware as well as the software stack changed from the setup in the original answer.
I put the results in a google spreadsheet (contains also the results from the original answer).
Hardware
Our HPC has two different nodes one with Intel Sandy Bridge CPUs and one with the newer Ivy Bridge CPUs:
Sandy (MKL, OpenBLAS, ATLAS):
CPU: 2 x 16 Intel(R) Xeon(R) E2560 Sandy Bridge # 2.00GHz (16 Cores)
RAM: 64 GB
Ivy (MKL, OpenBLAS, ATLAS):
CPU: 2 x 20 Intel(R) Xeon(R) E2680 V2 Ivy Bridge # 2.80GHz (20 Cores, with HT = 40 Cores)
RAM: 256 GB
Software
The software stack is for both nodes the sam. Instead of GotoBLAS2, OpenBLAS is used and there is also a multi-threaded ATLAS BLAS that is set to 8 threads (hardcoded).
OS: Suse
Intel Compiler: ictce-5.3.0
Numpy: 1.8.0
OpenBLAS: 0.2.6
ATLAS:: 3.8.4
Dot-Product Benchmark
Benchmark-code is the same as below. However for the new machines I also ran the benchmark for matrix sizes 5000 and 8000.
The table below includes the benchmark results from the original answer (renamed: MKL --> Nehalem MKL, Netlib Blas --> Nehalem Netlib BLAS, etc)
Single threaded performance:
Multi threaded performance (8 threads):
Threads vs Matrix size (Ivy Bridge MKL):
Benchmark Suite
Single threaded performance:
Multi threaded (8 threads) performance:
Conclusion
The new benchmark results are similar to the ones in the original answer. OpenBLAS and MKL perform on the same level, with the exception of Eigenvalue test.
The Eigenvalue test performs only reasonably well on OpenBLAS in single threaded mode.
In multi-threaded mode the performance is worse.
The "Matrix size vs threads chart" also show that although MKL as well as OpenBLAS generally scale well with number of cores/threads,it depends on the size of the matrix. For small matrices adding more cores won't improve performance very much.
There is also approximately 30% performance increase from Sandy Bridge to Ivy Bridge which might be either due to higher clock rate (+ 0.8 Ghz) and/or better architecture.
Original Answer (04.10.2011):
Some time ago I had to optimize some linear algebra calculations/algorithms which were written in python using numpy and BLAS so I benchmarked/tested different numpy/BLAS configurations.
Specifically I tested:
Numpy with ATLAS
Numpy with GotoBlas2 (1.13)
Numpy with MKL (11.1/073)
Numpy with Accelerate Framework (Mac OS X)
I did run two different benchmarks:
simple dot product of matrices with different sizes
Benchmark suite which can be found here.
Here are my results:
Machines
Linux (MKL, ATLAS, No-MKL, GotoBlas2):
OS: Ubuntu Lucid 10.4 64 Bit.
CPU: 2 x 4 Intel(R) Xeon(R) E5504 # 2.00GHz (8 Cores)
RAM: 24 GB
Intel Compiler: 11.1/073
Scipy: 0.8
Numpy: 1.5
Mac Book Pro (Accelerate Framework):
OS: Mac OS X Snow Leopard (10.6)
CPU: 1 Intel Core 2 Duo 2.93 Ghz (2 Cores)
RAM: 4 GB
Scipy: 0.7
Numpy: 1.3
Mac Server (Accelerate Framework):
OS: Mac OS X Snow Leopard Server (10.6)
CPU: 4 X Intel(R) Xeon(R) E5520 # 2.26 Ghz (8 Cores)
RAM: 4 GB
Scipy: 0.8
Numpy: 1.5.1
Dot product benchmark
Code:
import numpy as np
a = np.random.random_sample((size,size))
b = np.random.random_sample((size,size))
%timeit np.dot(a,b)
Results:
System | size = 1000 | size = 2000 | size = 3000 |
netlib BLAS | 1350 ms | 10900 ms | 39200 ms |
ATLAS (1 CPU) | 314 ms | 2560 ms | 8700 ms |
MKL (1 CPUs) | 268 ms | 2110 ms | 7120 ms |
MKL (2 CPUs) | - | - | 3660 ms |
MKL (8 CPUs) | 39 ms | 319 ms | 1000 ms |
GotoBlas2 (1 CPU) | 266 ms | 2100 ms | 7280 ms |
GotoBlas2 (2 CPUs)| 139 ms | 1009 ms | 3690 ms |
GotoBlas2 (8 CPUs)| 54 ms | 389 ms | 1250 ms |
Mac OS X (1 CPU) | 143 ms | 1060 ms | 3605 ms |
Mac Server (1 CPU)| 92 ms | 714 ms | 2130 ms |
Benchmark Suite
Code:
For additional information about the benchmark suite see here.
Results:
System | eigenvalues | svd | det | inv | dot |
netlib BLAS | 1688 ms | 13102 ms | 438 ms | 2155 ms | 3522 ms |
ATLAS (1 CPU) | 1210 ms | 5897 ms | 170 ms | 560 ms | 893 ms |
MKL (1 CPUs) | 691 ms | 4475 ms | 141 ms | 450 ms | 736 ms |
MKL (2 CPUs) | 552 ms | 2718 ms | 96 ms | 267 ms | 423 ms |
MKL (8 CPUs) | 525 ms | 1679 ms | 60 ms | 137 ms | 197 ms |
GotoBlas2 (1 CPU) | 2124 ms | 4636 ms | 147 ms | 456 ms | 743 ms |
GotoBlas2 (2 CPUs)| 1560 ms | 3278 ms | 116 ms | 295 ms | 460 ms |
GotoBlas2 (8 CPUs)| 741 ms | 2914 ms | 82 ms | 262 ms | 192 ms |
Mac OS X (1 CPU) | 948 ms | 4339 ms | 151 ms | 318 ms | 566 ms |
Mac Server (1 CPU)| 1033 ms | 3645 ms | 99 ms | 232 ms | 342 ms |
Installation
Installation of MKL included installing the complete Intel Compiler Suite which is pretty straight forward. However because of some bugs/issues configuring and compiling numpy with MKL support was a bit of a hassle.
GotoBlas2 is a small package which can be easily compiled as a shared library. However because of a bug you have to re-create the shared library after building it in order to use it with numpy.
In addition to this building it for multiple target plattform didn't work for some reason. So I had to create an .so file for each platform for which i want to have an optimized libgoto2.so file.
If you install numpy from Ubuntu's repository it will automatically install and configure numpy to use ATLAS. Installing ATLAS from source can take some time and requires some additional steps (fortran, etc).
If you install numpy on a Mac OS X machine with Fink or Mac Ports it will either configure numpy to use ATLAS or Apple's Accelerate Framework.
You can check by either running ldd on the numpy.core._dotblas file or calling numpy.show_config().
Conclusions
MKL performs best closely followed by GotoBlas2.
In the eigenvalue test GotoBlas2 performs surprisingly worse than expected. Not sure why this is the case.
Apple's Accelerate Framework performs really good especially in single threaded mode (compared to the other BLAS implementations).
Both GotoBlas2 and MKL scale very well with number of threads. So if you have to deal with big matrices running it on multiple threads will help a lot.
In any case don't use the default netlib blas implementation because it is way too slow for any serious computational work.
On our cluster I also installed AMD's ACML and performance was similar to MKL and GotoBlas2. I don't have any numbers tough.
I personally would recommend to use GotoBlas2 because it's easier to install and it's free.
If you want to code in C++/C also check out Eigen3 which is supposed to outperform MKL/GotoBlas2 in some cases and is also pretty easy to use.
I've run your benchmark. There is no difference between C++ and numpy on my machine:
Do you think my approach is fair, or are there some unnecessary overheads I can avoid?
It seems fair due to there is no difference in results.
Would you expect that the result would show such a huge discrepancy between the c++ and python approach? Both are using shared objects for their calculations.
No.
Since I would rather use python for my program, what could I do to increase the performance when calling BLAS or LAPACK routines?
Make sure that numpy uses optimized version of BLAS/LAPACK libraries on your system.
Here's another benchmark (on Linux, just type make): http://dl.dropbox.com/u/5453551/blas_call_benchmark.zip
http://dl.dropbox.com/u/5453551/blas_call_benchmark.png
I do not see essentially any difference between the different methods for large matrices, between Numpy, Ctypes and Fortran. (Fortran instead of C++ --- and if this matters, your benchmark is probably broken.)
Your CalcTime function in C++ seems to have a sign error. ... + ((double)start.tv_usec)) should be instead ... - ((double)start.tv_usec)). Perhaps your benchmark also has other bugs, e.g., comparing between different BLAS libraries, or different BLAS settings such as number of threads, or between real time and CPU time?
EDIT: failed to count the braces in the CalcTime function -- it's OK.
As a guideline: if you do a benchmark, please always post all the code somewhere. Commenting on benchmarks, especially when surprising, without having the full code is usually not productive.
To find out which BLAS Numpy is linked against, do:
$ python
Python 2.7.2+ (default, Aug 16 2011, 07:24:41)
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy.core._dotblas
>>> numpy.core._dotblas.__file__
'/usr/lib/pymodules/python2.7/numpy/core/_dotblas.so'
>>>
$ ldd /usr/lib/pymodules/python2.7/numpy/core/_dotblas.so
linux-vdso.so.1 => (0x00007fff5ebff000)
libblas.so.3gf => /usr/lib/libblas.so.3gf (0x00007fbe618b3000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fbe61514000)
UPDATE: If you can't import numpy.core._dotblas, your Numpy is using its internal fallback copy of BLAS, which is slower, and not meant to be used in performance computing!
The reply from #Woltan below indicates that this is the explanation for the difference he/she sees in Numpy vs. Ctypes+BLAS.
To fix the situation, you need either ATLAS or MKL --- check these instructions: http://scipy.org/Installing_SciPy/Linux Most Linux distributions ship with ATLAS, so the best option is to install their libatlas-dev package (name may vary).
Given the rigor you've shown with your analysis, I'm surprised by the results thus far. I put this as an 'answer' but only because it's too long for a comment and does provide a possibility (though I expect you've considered it).
I would've thought the numpy/python approach wouldn't add much overhead for a matrix of reasonable complexity, since as the complexity increases, the proportion that python participates in should be small. I'm more interested in the results on the right hand side of the graph, but orders of magnitude discrepancy shown there would be disturbing.
I wonder if you're using the best algorithms that numpy can leverage. From the compilation guide for linux:
"Build FFTW (3.1.2):
SciPy Versions >= 0.7 and Numpy >= 1.2:
Because of license, configuration, and maintenance issues support for FFTW was removed in versions of SciPy >= 0.7 and NumPy >= 1.2. Instead now uses a built-in version of fftpack.
There are a couple ways to take advantage of the speed of FFTW if necessary for your analysis.
Downgrade to a Numpy/Scipy version that includes support.
Install or create your own wrapper of FFTW. See http://developer.berlios.de/projects/pyfftw/ as an un-endorsed example."
Did you compile numpy with mkl? (http://software.intel.com/en-us/articles/intel-mkl/). If you're running on linux, the instructions for compiling numpy with mkl are here: http://www.scipy.org/Installing_SciPy/Linux#head-7ce43956a69ec51c6f2cedd894a4715d5bfff974 (in spite of url). The key part is:
[mkl]
library_dirs = /opt/intel/composer_xe_2011_sp1.6.233/mkl/lib/intel64
include_dirs = /opt/intel/composer_xe_2011_sp1.6.233/mkl/include
mkl_libs = mkl_intel_lp64,mkl_intel_thread,mkl_core
If you're on windows, you can obtain a compiled binary with mkl, (and also obtain pyfftw, and many other related algorithms) at: http://www.lfd.uci.edu/~gohlke/pythonlibs/, with a debt of gratitude to Christoph Gohlke at the Laboratory for Fluorescence Dynamics, UC Irvine.
Caveat, in either case, there are many licensing issues and so on to be aware of, but the intel page explains those. Again, I imagine you've considered this, but if you meet the licensing requirements (which on linux is very easy to do), this would speed up the numpy part a great deal relative to using a simple automatic build, without even FFTW. I'll be interested to follow this thread and see what others think. Regardless, excellent rigor and excellent question. Thanks for posting it.
Let me contribute to a somewhat weird finding
My numpy is linked to mkl, as given by numpy.show_config(). I have no idea what kind of libblas.so C++/BLAS was used. I hope someone can tell me a way to figure it out.
I think the results strongly depend on the library was used. I cannot isolate the efficiency of C++/BLAS.