Benchmarking (python vs. c++ using BLAS) and (numpy) - c++

I would like to write a program that makes extensive use of BLAS and LAPACK linear algebra functionalities. Since performance is an issue I did some benchmarking and would like know, if the approach I took is legitimate.
I have, so to speak, three contestants and want to test their performance with a simple matrix-matrix multiplication. The contestants are:
Numpy, making use only of the functionality of dot.
Python, calling the BLAS functionalities through a shared object.
C++, calling the BLAS functionalities through a shared object.
Scenario
I implemented a matrix-matrix multiplication for different dimensions i. i runs from 5 to 500 with an increment of 5 and the matricies m1 and m2 are set up like this:
m1 = numpy.random.rand(i,i).astype(numpy.float32)
m2 = numpy.random.rand(i,i).astype(numpy.float32)
1. Numpy
The code used looks like this:
tNumpy = timeit.Timer("numpy.dot(m1, m2)", "import numpy; from __main__ import m1, m2")
rNumpy.append((i, tNumpy.repeat(20, 1)))
2. Python, calling BLAS through a shared object
With the function
_blaslib = ctypes.cdll.LoadLibrary("libblas.so")
def Mul(m1, m2, i, r):
no_trans = c_char("n")
n = c_int(i)
one = c_float(1.0)
zero = c_float(0.0)
_blaslib.sgemm_(byref(no_trans), byref(no_trans), byref(n), byref(n), byref(n),
byref(one), m1.ctypes.data_as(ctypes.c_void_p), byref(n),
m2.ctypes.data_as(ctypes.c_void_p), byref(n), byref(zero),
r.ctypes.data_as(ctypes.c_void_p), byref(n))
the test code looks like this:
r = numpy.zeros((i,i), numpy.float32)
tBlas = timeit.Timer("Mul(m1, m2, i, r)", "import numpy; from __main__ import i, m1, m2, r, Mul")
rBlas.append((i, tBlas.repeat(20, 1)))
3. c++, calling BLAS through a shared object
Now the c++ code naturally is a little longer so I reduce the information to a minimum.
I load the function with
void* handle = dlopen("libblas.so", RTLD_LAZY);
void* Func = dlsym(handle, "sgemm_");
I measure the time with gettimeofday like this:
gettimeofday(&start, NULL);
f(&no_trans, &no_trans, &dim, &dim, &dim, &one, A, &dim, B, &dim, &zero, Return, &dim);
gettimeofday(&end, NULL);
dTimes[j] = CalcTime(start, end);
where j is a loop running 20 times. I calculate the time passed with
double CalcTime(timeval start, timeval end)
{
double factor = 1000000;
return (((double)end.tv_sec) * factor + ((double)end.tv_usec) - (((double)start.tv_sec) * factor + ((double)start.tv_usec))) / factor;
}
Results
The result is shown in the plot below:
Questions
Do you think my approach is fair, or are there some unnecessary overheads I can avoid?
Would you expect that the result would show such a huge discrepancy between the c++ and python approach? Both are using shared objects for their calculations.
Since I would rather use python for my program, what could I do to increase the performance when calling BLAS or LAPACK routines?
Download
The complete benchmark can be downloaded here. (J.F. Sebastian made that link possible^^)

UPDATE (30.07.2014):
I re-run the the benchmark on our new HPC.
Both the hardware as well as the software stack changed from the setup in the original answer.
I put the results in a google spreadsheet (contains also the results from the original answer).
Hardware
Our HPC has two different nodes one with Intel Sandy Bridge CPUs and one with the newer Ivy Bridge CPUs:
Sandy (MKL, OpenBLAS, ATLAS):
CPU: 2 x 16 Intel(R) Xeon(R) E2560 Sandy Bridge # 2.00GHz (16 Cores)
RAM: 64 GB
Ivy (MKL, OpenBLAS, ATLAS):
CPU: 2 x 20 Intel(R) Xeon(R) E2680 V2 Ivy Bridge # 2.80GHz (20 Cores, with HT = 40 Cores)
RAM: 256 GB
Software
The software stack is for both nodes the sam. Instead of GotoBLAS2, OpenBLAS is used and there is also a multi-threaded ATLAS BLAS that is set to 8 threads (hardcoded).
OS: Suse
Intel Compiler: ictce-5.3.0
Numpy: 1.8.0
OpenBLAS: 0.2.6
ATLAS:: 3.8.4
Dot-Product Benchmark
Benchmark-code is the same as below. However for the new machines I also ran the benchmark for matrix sizes 5000 and 8000.
The table below includes the benchmark results from the original answer (renamed: MKL --> Nehalem MKL, Netlib Blas --> Nehalem Netlib BLAS, etc)
Single threaded performance:
Multi threaded performance (8 threads):
Threads vs Matrix size (Ivy Bridge MKL):
Benchmark Suite
Single threaded performance:
Multi threaded (8 threads) performance:
Conclusion
The new benchmark results are similar to the ones in the original answer. OpenBLAS and MKL perform on the same level, with the exception of Eigenvalue test.
The Eigenvalue test performs only reasonably well on OpenBLAS in single threaded mode.
In multi-threaded mode the performance is worse.
The "Matrix size vs threads chart" also show that although MKL as well as OpenBLAS generally scale well with number of cores/threads,it depends on the size of the matrix. For small matrices adding more cores won't improve performance very much.
There is also approximately 30% performance increase from Sandy Bridge to Ivy Bridge which might be either due to higher clock rate (+ 0.8 Ghz) and/or better architecture.
Original Answer (04.10.2011):
Some time ago I had to optimize some linear algebra calculations/algorithms which were written in python using numpy and BLAS so I benchmarked/tested different numpy/BLAS configurations.
Specifically I tested:
Numpy with ATLAS
Numpy with GotoBlas2 (1.13)
Numpy with MKL (11.1/073)
Numpy with Accelerate Framework (Mac OS X)
I did run two different benchmarks:
simple dot product of matrices with different sizes
Benchmark suite which can be found here.
Here are my results:
Machines
Linux (MKL, ATLAS, No-MKL, GotoBlas2):
OS: Ubuntu Lucid 10.4 64 Bit.
CPU: 2 x 4 Intel(R) Xeon(R) E5504 # 2.00GHz (8 Cores)
RAM: 24 GB
Intel Compiler: 11.1/073
Scipy: 0.8
Numpy: 1.5
Mac Book Pro (Accelerate Framework):
OS: Mac OS X Snow Leopard (10.6)
CPU: 1 Intel Core 2 Duo 2.93 Ghz (2 Cores)
RAM: 4 GB
Scipy: 0.7
Numpy: 1.3
Mac Server (Accelerate Framework):
OS: Mac OS X Snow Leopard Server (10.6)
CPU: 4 X Intel(R) Xeon(R) E5520 # 2.26 Ghz (8 Cores)
RAM: 4 GB
Scipy: 0.8
Numpy: 1.5.1
Dot product benchmark
Code:
import numpy as np
a = np.random.random_sample((size,size))
b = np.random.random_sample((size,size))
%timeit np.dot(a,b)
Results:
System | size = 1000 | size = 2000 | size = 3000 |
netlib BLAS | 1350 ms | 10900 ms | 39200 ms |
ATLAS (1 CPU) | 314 ms | 2560 ms | 8700 ms |
MKL (1 CPUs) | 268 ms | 2110 ms | 7120 ms |
MKL (2 CPUs) | - | - | 3660 ms |
MKL (8 CPUs) | 39 ms | 319 ms | 1000 ms |
GotoBlas2 (1 CPU) | 266 ms | 2100 ms | 7280 ms |
GotoBlas2 (2 CPUs)| 139 ms | 1009 ms | 3690 ms |
GotoBlas2 (8 CPUs)| 54 ms | 389 ms | 1250 ms |
Mac OS X (1 CPU) | 143 ms | 1060 ms | 3605 ms |
Mac Server (1 CPU)| 92 ms | 714 ms | 2130 ms |
Benchmark Suite
Code:
For additional information about the benchmark suite see here.
Results:
System | eigenvalues | svd | det | inv | dot |
netlib BLAS | 1688 ms | 13102 ms | 438 ms | 2155 ms | 3522 ms |
ATLAS (1 CPU) | 1210 ms | 5897 ms | 170 ms | 560 ms | 893 ms |
MKL (1 CPUs) | 691 ms | 4475 ms | 141 ms | 450 ms | 736 ms |
MKL (2 CPUs) | 552 ms | 2718 ms | 96 ms | 267 ms | 423 ms |
MKL (8 CPUs) | 525 ms | 1679 ms | 60 ms | 137 ms | 197 ms |
GotoBlas2 (1 CPU) | 2124 ms | 4636 ms | 147 ms | 456 ms | 743 ms |
GotoBlas2 (2 CPUs)| 1560 ms | 3278 ms | 116 ms | 295 ms | 460 ms |
GotoBlas2 (8 CPUs)| 741 ms | 2914 ms | 82 ms | 262 ms | 192 ms |
Mac OS X (1 CPU) | 948 ms | 4339 ms | 151 ms | 318 ms | 566 ms |
Mac Server (1 CPU)| 1033 ms | 3645 ms | 99 ms | 232 ms | 342 ms |
Installation
Installation of MKL included installing the complete Intel Compiler Suite which is pretty straight forward. However because of some bugs/issues configuring and compiling numpy with MKL support was a bit of a hassle.
GotoBlas2 is a small package which can be easily compiled as a shared library. However because of a bug you have to re-create the shared library after building it in order to use it with numpy.
In addition to this building it for multiple target plattform didn't work for some reason. So I had to create an .so file for each platform for which i want to have an optimized libgoto2.so file.
If you install numpy from Ubuntu's repository it will automatically install and configure numpy to use ATLAS. Installing ATLAS from source can take some time and requires some additional steps (fortran, etc).
If you install numpy on a Mac OS X machine with Fink or Mac Ports it will either configure numpy to use ATLAS or Apple's Accelerate Framework.
You can check by either running ldd on the numpy.core._dotblas file or calling numpy.show_config().
Conclusions
MKL performs best closely followed by GotoBlas2.
In the eigenvalue test GotoBlas2 performs surprisingly worse than expected. Not sure why this is the case.
Apple's Accelerate Framework performs really good especially in single threaded mode (compared to the other BLAS implementations).
Both GotoBlas2 and MKL scale very well with number of threads. So if you have to deal with big matrices running it on multiple threads will help a lot.
In any case don't use the default netlib blas implementation because it is way too slow for any serious computational work.
On our cluster I also installed AMD's ACML and performance was similar to MKL and GotoBlas2. I don't have any numbers tough.
I personally would recommend to use GotoBlas2 because it's easier to install and it's free.
If you want to code in C++/C also check out Eigen3 which is supposed to outperform MKL/GotoBlas2 in some cases and is also pretty easy to use.

I've run your benchmark. There is no difference between C++ and numpy on my machine:
Do you think my approach is fair, or are there some unnecessary overheads I can avoid?
It seems fair due to there is no difference in results.
Would you expect that the result would show such a huge discrepancy between the c++ and python approach? Both are using shared objects for their calculations.
No.
Since I would rather use python for my program, what could I do to increase the performance when calling BLAS or LAPACK routines?
Make sure that numpy uses optimized version of BLAS/LAPACK libraries on your system.

Here's another benchmark (on Linux, just type make): http://dl.dropbox.com/u/5453551/blas_call_benchmark.zip
http://dl.dropbox.com/u/5453551/blas_call_benchmark.png
I do not see essentially any difference between the different methods for large matrices, between Numpy, Ctypes and Fortran. (Fortran instead of C++ --- and if this matters, your benchmark is probably broken.)
Your CalcTime function in C++ seems to have a sign error. ... + ((double)start.tv_usec)) should be instead ... - ((double)start.tv_usec)). Perhaps your benchmark also has other bugs, e.g., comparing between different BLAS libraries, or different BLAS settings such as number of threads, or between real time and CPU time?
EDIT: failed to count the braces in the CalcTime function -- it's OK.
As a guideline: if you do a benchmark, please always post all the code somewhere. Commenting on benchmarks, especially when surprising, without having the full code is usually not productive.
To find out which BLAS Numpy is linked against, do:
$ python
Python 2.7.2+ (default, Aug 16 2011, 07:24:41)
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy.core._dotblas
>>> numpy.core._dotblas.__file__
'/usr/lib/pymodules/python2.7/numpy/core/_dotblas.so'
>>>
$ ldd /usr/lib/pymodules/python2.7/numpy/core/_dotblas.so
linux-vdso.so.1 => (0x00007fff5ebff000)
libblas.so.3gf => /usr/lib/libblas.so.3gf (0x00007fbe618b3000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fbe61514000)
UPDATE: If you can't import numpy.core._dotblas, your Numpy is using its internal fallback copy of BLAS, which is slower, and not meant to be used in performance computing!
The reply from #Woltan below indicates that this is the explanation for the difference he/she sees in Numpy vs. Ctypes+BLAS.
To fix the situation, you need either ATLAS or MKL --- check these instructions: http://scipy.org/Installing_SciPy/Linux Most Linux distributions ship with ATLAS, so the best option is to install their libatlas-dev package (name may vary).

Given the rigor you've shown with your analysis, I'm surprised by the results thus far. I put this as an 'answer' but only because it's too long for a comment and does provide a possibility (though I expect you've considered it).
I would've thought the numpy/python approach wouldn't add much overhead for a matrix of reasonable complexity, since as the complexity increases, the proportion that python participates in should be small. I'm more interested in the results on the right hand side of the graph, but orders of magnitude discrepancy shown there would be disturbing.
I wonder if you're using the best algorithms that numpy can leverage. From the compilation guide for linux:
"Build FFTW (3.1.2):
SciPy Versions >= 0.7 and Numpy >= 1.2:
Because of license, configuration, and maintenance issues support for FFTW was removed in versions of SciPy >= 0.7 and NumPy >= 1.2. Instead now uses a built-in version of fftpack.
There are a couple ways to take advantage of the speed of FFTW if necessary for your analysis.
Downgrade to a Numpy/Scipy version that includes support.
Install or create your own wrapper of FFTW. See http://developer.berlios.de/projects/pyfftw/ as an un-endorsed example."
Did you compile numpy with mkl? (http://software.intel.com/en-us/articles/intel-mkl/). If you're running on linux, the instructions for compiling numpy with mkl are here: http://www.scipy.org/Installing_SciPy/Linux#head-7ce43956a69ec51c6f2cedd894a4715d5bfff974 (in spite of url). The key part is:
[mkl]
library_dirs = /opt/intel/composer_xe_2011_sp1.6.233/mkl/lib/intel64
include_dirs = /opt/intel/composer_xe_2011_sp1.6.233/mkl/include
mkl_libs = mkl_intel_lp64,mkl_intel_thread,mkl_core
If you're on windows, you can obtain a compiled binary with mkl, (and also obtain pyfftw, and many other related algorithms) at: http://www.lfd.uci.edu/~gohlke/pythonlibs/, with a debt of gratitude to Christoph Gohlke at the Laboratory for Fluorescence Dynamics, UC Irvine.
Caveat, in either case, there are many licensing issues and so on to be aware of, but the intel page explains those. Again, I imagine you've considered this, but if you meet the licensing requirements (which on linux is very easy to do), this would speed up the numpy part a great deal relative to using a simple automatic build, without even FFTW. I'll be interested to follow this thread and see what others think. Regardless, excellent rigor and excellent question. Thanks for posting it.

Let me contribute to a somewhat weird finding
My numpy is linked to mkl, as given by numpy.show_config(). I have no idea what kind of libblas.so C++/BLAS was used. I hope someone can tell me a way to figure it out.
I think the results strongly depend on the library was used. I cannot isolate the efficiency of C++/BLAS.

Related

Google benchmark negative perf counters

I'm using Google/benchmark for a project, and I just started playing around with the --benchmark_perf_counters flag. Clearly I'm doing something wrong, since the perf counters are often negative. I'm assuming it's an issue w/ overflow, but I still don't quite understand how the counters work to begin with.
For example, how is CACHE-MISSES 0 on the first benchmark, and then -372k on the second? Neither of those values make sense to me.
(the two benchmarks have very similar parameters and runtime)
I'm running on Ubuntu 18.04 w/ an Intel(R) Xeon(R) Gold 6138 CPU. Google benchmark version is 1.6.1 and I have libpfm4-dev installed. I'm calling my benchmark binary w/ --benchmark_perf_counters=CYCLES,INSTRUCTIONS,CACH-MISSES
-----------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------------------------------
bit::shift_left (small) (AA) 3.15 ns 3.15 ns 221185726 CACHE-MISSES=0 CYCLES=11.0005 INSTRUCTIONS=15
bit::shift_left (small) (UU) 2.65 ns 2.65 ns 254254663 CACHE-MISSES=-372.709k CYCLES=553.131k INSTRUCTIONS=372.709k
boost::shift_left (small) (AA) 2.71 ns 2.71 ns 258007443 CACHE-MISSES=-367.288k CYCLES=-367.288k INSTRUCTIONS=3.87586n
std::shift_left (small) 23.5 ns 23.5 ns 29812478 CACHE-MISSES=-3.17853M CYCLES=-102.703 INSTRUCTIONS=-972.747n
I had this same problem! Upgrading to the latest version of Google Benchmark (1.7.0) fixed it and now my perf counter results make sense.

Strange behaviour of Parallel Boost Graph Library example code

I have set up simple tests with Parallel Boost Graph Library (PBGL), which I have never used before, and observed entirely unexpected behaviour I would like to explain.
My steps were as follows:
Dump test data in METIS format (a kind of social graph with 50 mln vertices and 100 mln edges);
Build modified PBGL example from graph_parallel\example\dijkstra_shortest_paths.cpp
Example was slightly extended to proceed with Eager, Crauser and delta-stepping algorithms.
Note: building of the example required some obscure workaround about the MUTABLE_QUEUE define in crauser_et_al_shortest_paths.hpp (example code is in fact incompatible with the new mutable_queue)
int lookahead = 1;
delta_stepping_shortest_paths(g, start, dummy_property_map(), get(vertex_distance, g), get(edge_weight, g), lookahead);
dijkstra_shortest_paths(g, start, distance_map(get(vertex_distance, g)).lookahead(lookahead));
dijkstra_shortest_paths(g, start, distance_map(get(vertex_distance, g)));
Run
mpiexec -n 1 mytest.exe mydata.me
mpiexec -n 2 mytest.exe mydata.me
mpiexec -n 4 mytest.exe mydata.me
mpiexec -n 8 mytest.exe mydata.me
The observed behaviour:
-n 1:
mem usage: 35 GB in 1 running process, which utilizes exactly 1 device thread (processor load 12.5%)
delta stepping time: about 1 min 20 s
eager time: about 2 min
crauser time: about 3 min 20 s.
-n 2:
crash in the stage of data load.
-n 4:
mem usage: 40+ Gb in roughly equal parts in 4 running processes, each of which utilizes exactly 1 device thread
calculation times are unchanged in the margins of observation error.
-n 8:
mem usage: 44+ Gb in roughly equal parts in 8 running processes, each of which utilizes exactly 1 device thread
calculation times are unchanged in the margins of observation error.
So, except the unapropriate memory usage and very low total performance the only changes I observe when more MPI processes are running are slightly increased total memory consumption and linear rise of processor load.
The fact that initial graph is somehow partitioned between processes (probably by the vertices number ranges) is nevertheless evident.
What is wrong with this test (and, probably, my idea of MPI usage in whole)?
My enviromnent:
- one Win 10 PC with 64 Gb and 8 kernels;
- MS MPI 10.0.12498.5;
- MSVC 2017, toolset 141;
- boost 1.71
N.B. See original example code here.

How to troubleshoot cudaErrorUnknown from cudaDeviceSynchronize()?

I have a large code base that performs RGB to YUV color conversion with CUDA kernels. Since I am doing a lot of parallel conversions, I use streams (maybe that is relevant here). The code is running on Linux, It is working fine on a Quadro K4200 GPU but I recently got a new Quadro P4000 GPU on which I constantly get cudaErrorUnknown when calling cudaDeviceSynchronize(). Before this happens the only things I do are a call to cuMemcpy2DAsync to copy the pixel data and after that a call to my kernel. The code base is large and I can share some relevant parts, but can anyone give advice how could I troubleshoot this? Since I was working with the K4200 all the time, I haven't changed the CUDA compiler flags. Should I do that? I am currently compiling the same code for both cards with the following flags:
--compiler-bindir /usr/bin/gcc-4.9 -gencode=arch=compute_30,code=\"sm_30,compute_30\" -cudart static -maxrregcount=0 --machine 64 --compile -g -G -std=c++11 -D_MWAITXINTRIN_H_INCLUDED
But in that case is it even possible to make a single object that runs on different GPUs?
This is the output of nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90 Driver Version: 384.90 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro P4000 Off | 00000000:04:00.0 Off | N/A |
| 46% 39C P0 29W / 105W | 0MiB / 8112MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Quadro K4200 Off | 00000000:84:00.0 Off | N/A |
| 30% 40C P0 26W / 110W | 0MiB / 4036MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
Should I disable the old card, could the driver seeing both cards start to behave incorrectly? Are there any internal NVIDIA logs/tools that I can use to get a more detailed description of what is failing?
How to troubleshoot ... ?
By transforming your program into a
Minimal, Complete, Verifiable Example (MCVE)
of this issue manifesting.
This will focus your "list of suspects" to very few CUDA API calls, which should either be enough for you to figure out the problem by yourself or would make it possible for you to post the whole thing (in a different question) here and get proper help. Or you'll find out the problem goes away as you drop supposedly-irrelevant parts of the code, meaning that it lies with what you've just removed.
Recompiling the kernel with the correct architecture flags -gencode=arch=compute_61,code=sm_61 as suggested by #tera fixed it for Quadro P4000, however now the same code fails on Quadro K4200, but this time with a reasonable error cudaErrorNoKernelImageForDevice:
This indicates that there is no kernel image available that is suitable for the device. This can occur when a user specifies code generation options for a particular CUDA source file that do not include the corresponding device configuration.
So apparently my biggest problem was the lack of knowledge to understand what could be causing the cudaErrorUnknown.

pthread multithreading in Mac OS vs windows multithreaing

I've developed a multi platform program (using the FLTK toolkit) and implement multithreading to do background intensive tasks.
I have followed the FLTK tutorials/examples on multithreading which involve using pthread on Mac, ie the function pthread_create and windows threading on windows ie _beginthread
What I have noticed is that the performance is much higher on Windows ie 3 to 4 times faster in these background threads (in the time to execute them).
Why might this be? Is it the threading libraries I'm using? Surely there shouldn't be such a difference? Or could it be the runtime libraries underneath it all?
Here are my machine details
Mac:
Intel(R) Core(TM) i7-3820QM CPU # 2.70GHz
16 GB DDR3 1600 MHz
Model MacBookPro9,1
OS: Mac OSX 10.8.5
Windows:
Intel(R) Core(TM) i7-3520M CPU # 2.90GHz
16 GB DDR3 1600 MHz
Model: Dell Latitude E5530
OS: Windows 7 Service Pack 1
EDIT
To just do a basic speed comparison I compiled this on both machines running from the command line
int main(int agrc, char **argv)
{
time_t t = time(NULL);
tm* tt=localtime(&t);
std::stringstream s;
s<< std::setfill('0')<<std::setw(2)<<tt->tm_mday<<"/"<<std::setw(2)<<tt->tm_mon+1<<"/"<< std::setw(4)<<tt->tm_year+1900<<" "<< std::setw(2)<<tt->tm_hour<<":"<<std::setw(2)<<tt->tm_min<<":"<<std::setw(2)<<tt->tm_sec;
std::cout<<"1: "<<s.str()<<std::endl;
double sum=0;
for (int i=0;i<100000000;i++){
double ii=i*0.123456789;
sum=sum+sin(ii)*cos(ii);
}
t = time(NULL);
tt=localtime(&t);
s.str("");
s<< std::setfill('0')<<std::setw(2)<<tt->tm_mday<<"/"<<std::setw(2)<<tt->tm_mon+1<<"/"<< std::setw(4)<<tt->tm_year+1900<<" "<< std::setw(2)<<tt->tm_hour<<":"<<std::setw(2)<<tt->tm_min<<":"<<std::setw(2)<<tt->tm_sec;
std::cout<<"2: "<<s.str()<<std::endl;
}
Windows takes less than a second. Mac takes 4/5 seconds. Any ideas?
On Mac I'm compiling with g++, with visual studio 2013 on windows.
SECOND EDIT
if I change the line
std::cout<<"2: "<<s.str()<<std::endl;
to
std::cout<<"2: "<<s.str()<<" "<<sum<<std::endl;
Then all of a sudden Windows takes a little bit longer...
This makes me think that the whole thing might be compiler optimisation. So the question would be is g++ (4.2 is the version I have) worse at optimisation or do I need to provide additional flags?
THIRD(!) AND FINAL EDIT
I can report that I achieve comparable performance by ensuring g++ optimisation flags -O were provided at compile time. One of those annoying things that happens so often
A: Im tearing my hair out on problem x
B: Are you sure you're not doing y?
A: That works, why is this information not plastered all over the place and in every tutorial on problem x on the web?
B: Did you read the manual?
A: No, if I completely read the manual for every single bit of code/program I used I would never actually get round to doing anything...
Meh.

Hyper-threading Performance Comparison

I have written a project, which uses some basic functions in openssl such as RAND_bytes and des_ecb_encrypt.
My computer has i7-2600(4 cores and 8 logic CPU). When I run my project with 4 threads, it will costs 10 seconds. When I run it with 8 threads, it also costs 10 seconds.
What I mean is that hyper-threading doesn't give me any performance improvement. In Linux, the experiment result is same.
I found here tells me that hyper-threading doesn't give me some improvement in some situations. Also, I found here give me some intuitive results.
However, I have tried to write some simple tests and found some simple examples which will show hyper-threading won't give me apparent improvement. Sadly, I don't find it.
So, my questions is that whether there are some simple tests shows the hyper-threading won't give me any performance improvement.
You may find that hyperthreading helps more on code that is using large amounts of memory, so that the processor is regularly blocked on fetching from memory.
In my experience, it's quite hard to find "simple code" that shows benefits from hyperthreading. It tends to be more complex examples that show the benefit. Still, the benefit will most likely not be 2x that of "no hyperthreading". Count on getting perhaps 20-30% improvement.
Hyper threading takes advantage of the fact that the CPU has many components and when one is used, when there's no hyper threading, the others just sit there idle. You can try writing two types of threads, one doing integer calculations (that will hopefully use the ALU) and one doing floating point arithmetic (that will hopefully use the FPU).
I did not try this myself but it seems that in such a scenario hyper threading should improve the performance.
To show the opposite you can use only one type of the threads (either threads only doing integer operations or threads only doing floating point operations).
It may also be that your test is flawed, but in order to know if that is the case we'll need more information about that test.
I have written a project, which use some basic functions in openssl such as RAND_bytes and des_ecb_encrypt... My computer has i7-2600(4 cores and 8 logic CPU). When I run my project with 4 threads, it will costs 10 seconds. When I run it with 8 threads, it also costs 10 seconds.
When using RDRAND (which RAND_bytes will do in this case), the bus us the limiting factor. You should peak at around 800MB/sec. It does not matter how many threads you have - the bus cannot transfer data fast enough. See Intel rdrand instruction revisited.
If you used AES, then you might see a better speedup over the DES/3DES observations. Your Ivy Bridge has AES-NI and it can achieve almost 1.3 cycle/byte, and that should be about double or triple AES is software. To ensure you are using the AES-NI instructions, you have to use the EVP_* interfaces.
I found here tells me that hyper-threading doesn't give me some improvement in some situations. Also, I found here give me some intuitive results.
I think #selalerer and #Mats Petersson answered your question. The problem does not scale linearly and there's a maximum speedup you will encounter. Intel states its about 30%.
Intel's newest architecture favors of Out-Of-Order execution over Hyper-threading execution because its supposed to be more efficient. Read about the Silvermont processor cores.
But if you want a formal deep dive, then see a book on computer engineering. Here's the book we used when I studied it in college: Computer Organization and Design (its probably a bit dated now).
However, I have tried to write some simple tests and found some simple examples which will show hyper-threading won't give me apparent improvement.
OpenSSL also has a benchmarking app. See the source code in <openssl source>/apps/speed.c.
Also, benchmarking apps have their own personalities. An encryption stress test may not reveal the differences as predominantly as you hope to see them. See, for example, Benchmarking Tools.
Following are details and results of my MP benchmarks for Linux and Windows, that can behave differently. Not much HT but Linux tests include Atom (1 core 2 threads) and Windows has Core i7 results (4+4).
http://www.roylongbottom.org.uk/linux%20multithreading%20benchmarks.htm
http://www.roylongbottom.org.uk/quad%20core%208%20thread.htm
Take your pick, depending what you want to prove whether HT provides better or worse performance. Following are RandMem results on i7 (Linux seems better using this test). For such as i7, you also need to consider Turbo Boost that might be lower with multiple threads.
CPUs MBytes Per Second Using Threads Gain At Threads
/HTs 1 2 4 6 8 2 4 6 8
Serial RD
Core i7 4/8 L1 11458 22661 37039 43717 46374 2.0 3.2 3.8 4.0
930 L2 10380 20832 32853 41711 42839 2.0 3.2 4.0 4.1
#### MHz L3 8828 17743 29610 38414 40330 2.0 3.4 4.4 4.6
Win 764 RAM 4266 8712 17347 24946 25589 2.0 4.1 5.8 6.0
Serial RW
Core i7 4/8 L1 15282 13724 16240 16209 18379 0.9 1.1 1.1 1.2
930 L2 12223 18216 25326 28104 27047 1.5 2.1 2.3 2.2
#### MHz L3 10234 19266 21931 24450 26351 1.9 2.1 2.4 2.6
Win 764 RAM 4533 7656 13876 14543 13390 1.7 3.1 3.2 3.0
Random RD
Core i7 4/8 L1 11266 22548 38174 45592 47141 2.0 3.4 4.0 4.2
930 L2 6233 12463 20059 24986 25667 2.0 3.2 4.0 4.1
#### MHz L3 3499 6915 9211 10002 9531 2.0 2.6 2.9 2.7
Win 764 RAM 459 909 1241 1398 1364 2.0 2.7 3.0 3.0
Random RW
Core i7 4/8 L1 14375 3027 2780 2901 3297 0.2 0.2 0.2 0.2
930 L2 5887 4555 6117 6693 7281 0.8 1.0 1.1 1.2
#### MHz L3 3104 4604 4721 5047 4933 1.5 1.5 1.6 1.6
Win 764 RAM 428 860 899 948 1026 2.0 2.1 2.2 2.4
#### 2.8 GHz running at up to 3.06 GHz via Turbo Boost, dual channel 1066 MHz DDR3 RAM
Then the MP Whetstone benchmark that shows real gains
MWIPS MFLOP MFLOP MFLOP COS EXP FIXPT IF EQUAL
CPU MHz 1 2 3 MOPS MOPS MOPS MOPS MOPS
Core i7 1 Thrd #### 3115 1065 886 738 79.3 39.7 2447 2936 1154
Core i7 Win7 #### 21690 8676 7621 5844 531 291 16643 12027 5034
Quad Core Thread 1 1091 1027 728 66.4 36.5 2050 1501 629
Plus HT Thread 2 1089 1037 742 66.0 36.5 2090 1507 630
Thread 3 1090 946 742 66.8 36.5 2069 1534 631
Thread 4 1092 1037 727 66.6 36.6 2031 1501 630
Thread 5 1042 959 736 66.4 36.5 1912 1483 630
Thread 6 1091 874 723 66.6 36.1 2049 1507 629
Thread 7 1090 867 725 65.6 36.3 2094 1516 631
Thread 8 1091 874 722 66.3 36.3 2350 1476 624
Gain % 696 815 860 792 670 733 680 410 436