I have a function which is apparently the bottleneck of my whole program. I thought that parallelization with OpenMP could be helpful.
Here is an working example of my computation (sorry, the function is a little bit long). In my program, some of the work before the 5 nested loops is done somewhere else and is not a problem at all for the efficiency.
#include <vector>
#include <iostream>
#include <cmath>
#include <cstdio>
#include <chrono>
#include "boost/dynamic_bitset.hpp"
using namespace std::chrono;
void compute_mddr(unsigned Ns, unsigned int block_size, unsigned int sector)
{
std::vector<unsigned int> basis;
for (std::size_t s = 0; s != std::pow(2,Ns); s++) {
boost::dynamic_bitset<> s_bin(Ns,s);
if (s_bin.count() == Ns/2) {
basis.push_back(s);
}
}
std::vector<double> gs(basis.size());
for (unsigned int i = 0; i != gs.size(); i++)
gs[i] = double(std::rand())/RAND_MAX;
unsigned int ns_A = block_size;
unsigned int ns_B = Ns-ns_A;
boost::dynamic_bitset<> mask_A(Ns,(1<<ns_A)-(1<<0));
boost::dynamic_bitset<> mask_B(Ns,((1<<ns_B)-(1<<0))<<ns_A);
// Find the basis of the A block
unsigned int NAsec = sector;
std::vector<double> basis_NAsec;
for (unsigned int s = 0; s < std::pow(2,ns_A); s++) {
boost::dynamic_bitset<> s_bin(ns_A,s);
if (s_bin.count() == NAsec)
basis_NAsec.push_back(s);
}
unsigned int bs_A = basis_NAsec.size();
// Find the basis of the B block
unsigned int NBsec = (Ns/2)-sector;
std::vector<double> basis_NBsec;
for (unsigned int s = 0; s < std::pow(2,ns_B); s++) {
boost::dynamic_bitset<> s_bin(ns_B,s);
if (s_bin.count() == NBsec)
basis_NBsec.push_back(s);
}
unsigned int bs_B = basis_NBsec.size();
std::vector<std::vector<double> > mddr(bs_A);
for (unsigned int i = 0; i != mddr.size(); i++) {
mddr[i].resize(bs_A);
for (unsigned int j = 0; j != mddr[i].size(); j++) {
mddr[i][j] = 0.0;
}
}
// Main calculation part
for (unsigned int mu_A = 0; mu_A != bs_A; mu_A++) { // loop 1
boost::dynamic_bitset<> mu_A_bin(ns_A,basis_NAsec[mu_A]);
for (unsigned int nu_A = mu_A; nu_A != bs_A; nu_A++) { // loop 2
boost::dynamic_bitset<> nu_A_bin(ns_A,basis_NAsec[nu_A]);
double sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (unsigned int mu_B = 0; mu_B < bs_B; mu_B++) { // loop 3
boost::dynamic_bitset<> mu_B_bin(ns_B,basis_NBsec[mu_B]);
for (unsigned int si = 0; si != basis.size(); si++) { // loop 4
boost::dynamic_bitset<> si_bin(Ns,basis[si]);
boost::dynamic_bitset<> si_A_bin = si_bin & mask_A;
si_A_bin.resize(ns_A);
if (si_A_bin != mu_A_bin)
continue;
boost::dynamic_bitset<> si_B_bin = (si_bin & mask_B)>>ns_A;
si_B_bin.resize(ns_B);
if (si_B_bin != mu_B_bin)
continue;
for (unsigned int sj = 0; sj < basis.size(); sj++) { // loop 5
boost::dynamic_bitset<> sj_bin(Ns,basis[sj]);
boost::dynamic_bitset<> sj_A_bin = sj_bin & mask_A;
sj_A_bin.resize(ns_A);
if (sj_A_bin != nu_A_bin)
continue;
boost::dynamic_bitset<> sj_B_bin = (sj_bin & mask_B)>>ns_A;
sj_B_bin.resize(ns_B);
if (sj_B_bin != mu_B_bin)
continue;
sum += gs[si]*gs[sj];
}
}
}
mddr[nu_A][mu_A] = mddr[mu_A][nu_A] = sum;
}
}
}
int main()
{
unsigned int l = 8;
unsigned int Ns = 2*l;
unsigned block_size = 6; // must be between 1 and l
unsigned sector = (block_size%2 == 0) ? block_size/2 : (block_size+1)/2;
high_resolution_clock::time_point t1 = high_resolution_clock::now();
compute_mddr(Ns,block_size,sector);
high_resolution_clock::time_point t2 = high_resolution_clock::now();
duration<double> time_span = duration_cast<duration<double>>(t2 - t1);
std::cout << "Function took " << time_span.count() << " seconds.";
std::cout << std::endl;
}
The compute_mddr function is basically entirely filling up the matrix mddr, and this corresponds to the outermost loops 1 and 2.
I decided to parallelize the loop 3, since it's essentially computing a sum. To give order of magnitudes, the loop 3 is over ~50-100 elements in the basis_NBsec vector, while the two innermost loops si and sj run over the ~10000 elements for the vector basis.
However, when running the code (compiled with -O3 -fopenmp on gcc 5.4.0, ubuntu 16.0.4 and i5-4440 cpu) I see either no speed-up (2 threads) or a very limited gain (3 and 4 threads):
time OMP_NUM_THREADS=1 ./a.out
Function took 230.435 seconds.
real 3m50.439s
user 3m50.428s
sys 0m0.000s
time OMP_NUM_THREADS=2 ./a.out
Function took 227.754 seconds.
real 3m47.758s
user 7m2.140s
sys 0m0.048s
time OMP_NUM_THREADS=3 ./a.out
Function took 181.492 seconds.
real 3m1.495s
user 7m36.056s
sys 0m0.036s
time OMP_NUM_THREADS=4 ./a.out
Function took 150.564 seconds.
real 2m30.568s
user 7m56.156s
sys 0m0.096s
If I understand correctly the numbers from user, for 3 and 4 threads the cpu usage is not good (and indeed, when the code is running I get ~250% cpu usage for 3 threads and barely 300% for 4 threads).
It's my first use of OpenMP, I just played with it very quickly on simple examples. Here, as far as I can see, I'm not modifying any of the shared vectors basis_NAsec, basis_NBsec and basis in the parallel part, only reading (this was an aspect pointed out in several related questions I read).
So, what am I doing wrong ?
Taking a quick look at the performance of your program with perf record shows that, regaradless of the number of threads, most of the time is spent in malloc & free. That's generally a bad sign, and it also inhibits parallelization.
Samples: 1M of event 'cycles:pp', Event count (approx.): 743045339605
Children Self Command Shared Object Symbol
+ 17.14% 17.12% a.out a.out [.] _Z12compute_mddrjjj._omp_fn.0
+ 15.45% 15.43% a.out libc-2.23.so [.] __memcmp_sse4_1
+ 15.21% 15.19% a.out libc-2.23.so [.] __memset_avx2
+ 13.09% 13.07% a.out libc-2.23.so [.] _int_free
+ 11.66% 11.65% a.out libc-2.23.so [.] _int_malloc
+ 10.21% 10.20% a.out libc-2.23.so [.] malloc
The cause for malloc & free is the constant creation of boost::dynamic_bitset objects, which are basically std::vectors. Note: With perf, can be challenging to find the callers of a certain function. You can just run in gdb, interrupt during the execution phase, break balloc, continue to figure out the callers.
The direct approach to improving performance, is trying to keep alive those objects as long as possible to avoid reallocation over and over again. This goes against the usual good practice to declare variables as locally as possible. The transformation of reusing the dynamic_bitset objects could look like the following:
#pragma omp parallel for reduction(+:sum)
for (unsigned int mu_B = 0; mu_B < bs_B; mu_B++) { // loop 3
boost::dynamic_bitset<> mu_B_bin(ns_B,basis_NBsec[mu_B]);
boost::dynamic_bitset<> si_bin(Ns);
boost::dynamic_bitset<> si_A_bin(Ns);
boost::dynamic_bitset<> si_B_bin(Ns);
boost::dynamic_bitset<> sj_bin(Ns);
boost::dynamic_bitset<> sj_A_bin(Ns);
boost::dynamic_bitset<> sj_B_bin(Ns);
for (unsigned int si = 0; si != basis.size(); si++) { // loop 4
si_bin = basis[si];
si_A_bin = si_bin;
assert(si_bin.size() == Ns);
assert(si_A_bin.size() == Ns);
assert(mask_A.size() == Ns);
si_A_bin &= mask_A;
si_A_bin.resize(ns_A);
if (si_A_bin != mu_A_bin)
continue;
si_B_bin = si_bin;
assert(si_bin.size() == Ns);
assert(si_B_bin.size() == Ns);
assert(mask_B.size() == Ns);
// Optimization note: dynamic_bitset::operator&
// does create a new object, operator&= does not
// Same for >>
si_B_bin &= mask_B;
si_B_bin >>= ns_A;
si_B_bin.resize(ns_B);
if (si_B_bin != mu_B_bin)
continue;
for (unsigned int sj = 0; sj < basis.size(); sj++) { // loop 5
sj_bin = basis[sj];
sj_A_bin = sj_bin;
assert(sj_bin.size() == Ns);
assert(sj_A_bin.size() == Ns);
assert(mask_A.size() == Ns);
sj_A_bin &= mask_A;
sj_A_bin.resize(ns_A);
if (sj_A_bin != nu_A_bin)
continue;
sj_B_bin = sj_bin;
assert(sj_bin.size() == Ns);
assert(sj_B_bin.size() == Ns);
assert(mask_B.size() == Ns);
sj_B_bin &= mask_B;
sj_B_bin >>= ns_A;
sj_B_bin.resize(ns_B);
if (sj_B_bin != mu_B_bin)
continue;
sum += gs[si]*gs[sj];
}
}
}
This already reduces the single threaded runtime on my system from ~289 s to ~39 s. Also the program scales almost perfectly up to ~10 threads (4.1 s).
For more threads, there are load balance issues in the parallel loop. which can be mitigated a bit by adding schedule(dynamic), but I'm not sure how relevant that is for you.
More importantly, you should consider using std::bitset. Even without the extremely expensive boost::dynamic_bitset constructor, it is very expensive. Most of the time is pent in superflous dynamic_bitest/vector code and memmove/memcmp on a single word.
+ 32.18% 32.15% ope_gcc_dyn ope_gcc_dyn [.] _ZNSt6vectorImSaImEEaSERKS1_
+ 29.13% 29.10% ope_gcc_dyn ope_gcc_dyn [.] _Z12compute_mddrjjj._omp_fn.0
+ 21.65% 0.00% ope_gcc_dyn [unknown] [.] 0000000000000000
+ 16.24% 16.23% ope_gcc_dyn ope_gcc_dyn [.] _ZN5boost14dynamic_bitsetImSaImEE6resizeEmb.constprop.102
+ 10.25% 10.23% ope_gcc_dyn libc-2.23.so [.] __memcmp_sse4_1
+ 9.61% 0.00% ope_gcc_dyn libc-2.23.so [.] 0xffffd47cb9d83b78
+ 7.74% 7.73% ope_gcc_dyn libc-2.23.so [.] __memmove_avx_unaligned
That basically goes away if you just use very few words of a std::bitset. Maybe 64 bit will always be enough for you. If it is dynamic over a large range, you could make a template of the entire function and instantiate it for a number of different bitsiszes of which you dynamically select the appropriate one. I suspect you should gain another order of magnitude in performance. This may in turn reduce parallel efficiency, requiring another round of performance analysis.
It's very important to use tools to understand the performance of your codes. There are very simple and very good tools for all sorts of cases. In your case a simple one such as perf is sufficient.
Related
I was not satisfied with the performance of the below thrust::reduce_by_key, so I rewrote it in a variety of ways with little gained benefit (including removing the permutation iterator). However, it wasn't until after replacing it with a thrust::for_each() (see below) that capitalizes on atomicAdd(), that I gained almost a 75x speedup! The two versions produce the exact same results. What could be the biggest cause for the dramatic performance differences?
Complete code for comparison between the two approaches:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <ctime>
#include <iostream>
#include <thrust/copy.h>
#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>
#include <thrust/host_vector.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/sort.h>
constexpr int NumberOfOscillators = 100;
int SeedRange = 500;
struct GetProduct
{
template<typename Tuple>
__host__ __device__
int operator()(const Tuple & t)
{
return thrust::get<0>(t) * thrust::get<1>(t);
}
};
int main()
{
using namespace std;
using namespace thrust::placeholders;
/* BEGIN INITIALIZATION */
thrust::device_vector<int> dv_OscillatorsVelocity(NumberOfOscillators);
thrust::device_vector<int> dv_outputCompare(NumberOfOscillators);
thrust::device_vector<int> dv_Connections_Strength((NumberOfOscillators - 1) * NumberOfOscillators);
thrust::device_vector<int> dv_Connections_Active((NumberOfOscillators - 1) * NumberOfOscillators);
thrust::device_vector<int> dv_Connections_TerminalOscillatorID_Map(0);
thrust::device_vector<int> dv_Permutation_Connections_To_TerminalOscillators((NumberOfOscillators - 1) * NumberOfOscillators);
thrust::device_vector<int> dv_Connection_Keys((NumberOfOscillators - 1) * NumberOfOscillators);
srand((unsigned int)time(NULL));
thrust::fill(dv_OscillatorsVelocity.begin(), dv_OscillatorsVelocity.end(), 0);
for (int c = 0; c < NumberOfOscillators * (NumberOfOscillators - 1); c++)
{
dv_Connections_Strength[c] = (rand() % SeedRange) - (SeedRange / 2);
dv_Connections_Active[c] = 0;
}
int curOscillatorIndx = -1;
for (int c = 0; c < NumberOfOscillators * NumberOfOscillators; c++)
{
if (c % NumberOfOscillators == 0)
{
curOscillatorIndx++;
}
if (c % NumberOfOscillators != curOscillatorIndx)
{
dv_Connections_TerminalOscillatorID_Map.push_back(c % NumberOfOscillators);
}
}
for (int n = 0; n < NumberOfOscillators; n++)
{
for (int p = 0; p < NumberOfOscillators - 1; p++)
{
thrust::copy_if(
thrust::device,
thrust::make_counting_iterator<int>(0),
thrust::make_counting_iterator<int>(dv_Connections_TerminalOscillatorID_Map.size()), // indices from 0 to N
dv_Connections_TerminalOscillatorID_Map.begin(), // array data
dv_Permutation_Connections_To_TerminalOscillators.begin() + (n * (NumberOfOscillators - 1)), // result will be written here
_1 == n);
}
}
for (int c = 0; c < NumberOfOscillators * (NumberOfOscillators - 1); c++)
{
dv_Connection_Keys[c] = c / (NumberOfOscillators - 1);
}
/* END INITIALIZATION */
/* BEGIN COMPARISON */
auto t = clock();
for (int x = 0; x < 5000; ++x) //Set x maximum to a reasonable number while testing performance.
{
thrust::reduce_by_key(
thrust::device,
//dv_Connection_Keys = 0,0,0,...1,1,1,...2,2,2,...3,3,3...
dv_Connection_Keys.begin(), //keys_first The beginning of the input key range.
dv_Connection_Keys.end(), //keys_last The end of the input key range.
thrust::make_permutation_iterator(
thrust::make_transform_iterator(
thrust::make_zip_iterator(
thrust::make_tuple(
dv_Connections_Strength.begin(),
dv_Connections_Active.begin()
)
),
GetProduct()
),
dv_Permutation_Connections_To_TerminalOscillators.begin()
), //values_first The beginning of the input value range.
thrust::make_discard_iterator(), //keys_output The beginning of the output key range.
dv_OscillatorsVelocity.begin() //values_output The beginning of the output value range.
);
}
std::cout << "iterations time for original: " << (clock() - t) * (1000.0 / CLOCKS_PER_SEC) << "ms\n" << endl << endl;
thrust::copy(dv_OscillatorsVelocity.begin(), dv_OscillatorsVelocity.end(), dv_outputCompare.begin());
t = clock();
for (int x = 0; x < 5000; ++x) //Set x maximum to a reasonable number while testing performance.
{
thrust::for_each(
thrust::device,
thrust::make_counting_iterator(0),
thrust::make_counting_iterator(0) + dv_Connections_Active.size(),
[
s = dv_OscillatorsVelocity.size() - 1,
dv_b = thrust::raw_pointer_cast(dv_OscillatorsVelocity.data()),
dv_c = thrust::raw_pointer_cast(dv_Permutation_Connections_To_TerminalOscillators.data()), //3,6,9,0,7,10,1,4,11,2,5,8
dv_ppa = thrust::raw_pointer_cast(dv_Connections_Active.data()),
dv_pps = thrust::raw_pointer_cast(dv_Connections_Strength.data())
] __device__(int i) {
const int readIndex = i / s;
atomicAdd(
dv_b + readIndex,
(dv_ppa[dv_c[i]] * dv_pps[dv_c[i]])
);
}
);
}
std::cout << "iterations time for new: " << (clock() - t) * (1000.0 / CLOCKS_PER_SEC) << "ms\n" << endl << endl;
std::cout << "***" << (dv_OscillatorsVelocity == dv_outputCompare ? "success" : "fail") << "***\n";
/* END COMPARISON */
return 0;
}
Extra info.:
My results are using a single GTX 980 TI.
There are 100 * (100 - 1) = 9,900 elements in all of the "Connection" vectors.
Each of the 100 unique keys found in dv_Connection_Keys has 99 elements each.
Use this compiler option: --expt-extended-lambda
What could be the biggest cause for the dramatic performance differences?
You are evidently building a debug project, that is your compilation settings include the -G switch. Although you were asked for your compilation settings in the comments, you didn't mention this.
It's important.
CUDA device code can have dramatically different performance characteristics when compiled with -G.
Don't evaluate performance of a debug project, or code compiled with -G.
When I compile and run your code without -G, I get:
iterations time for original: 210ms
iterations time for new: 70ms
***success***
When I compile your code with the debug switch -G, and run, I get:
iterations time for original: 12330ms
iterations time for new: 320ms
***success***
returning to your question, that accounts for the biggest factor of the difference.
The following answer tries to explain or at least motivate the remaining difference in performance after going from a debug build to a release build as explained in Robert Crovella's answer.
Coalescing
As the accesses in both kernels are not coalesced due to the permutation_iterator/indirection through dv_c, going by the the plain number of accesses will overestimate the performance in this case. thrust::reduce_by_key (or pretty much all Thrust algorithms) is not and can not be optimized for general permutations of the input as the performance of these bandwidth-bound kernels depends strongly on coalesced memory access. Naturally the algorithms are written such that accesses are coalesced for normal continuous input. So if you need to access the permuted state order of the data more than once (which might happen in a single reduction algorithm), it could be faster to actually permute the data in memory using thrust::gather or thrust::scatter once so at least all following accesses are efficient. I would not expect the for_each solution to beat reduce_by_key without that permutation.
Atomics
Newer versions of nvcc will try to use automatically use warp-aggregated-atomics to reduce the number of actual atomic instructions on the same address. As neighboring threads (same warp) tend to atomically write to the same address, this optimization is crucial for the performance of your custom reduction. Another important detail is that s = NumberOfOscillators is relatively small (100) in your code compared to typical thread-block sizes (256, 512, 1024; locality of atomic writes) and the amount of parallelism in the for_each (~NumberOfOscillators^2). So for smaller NumberOfOscillators I expect your custom reduction to get worse than reduce_by_key due to the vanishing amount of parallelism, while for bigger NumberOfOscillators you get both much more parallelism and more thread blocks/warps writing to the same location, so it is not quite clear which one will win without benchmarking it for given hardware and compiler.
I have a threading issue under windows.
I am developing a program that runs complex physical simulations for different conditions. Say a condition per hour of the year, would be 8760 simulations. I am grouping those simulations per thread such that each thread runs a for loop of 273 simulations (on average)
I bought an AMD ryzen 9 5950x with 16 cores (32 threads) for this task. On Linux, all the threads seem to be between 98% to 100% usage, while under windows I get this:
(The first bar is the I/O thread reading data, the smaller bars are the process threads. Red: synchronization, green: process, purple: I/O)
This is from Visual Studio's concurrency visualizer, which tells me that 63% of the time was spent on thread synchronization. As far as I can tell, my code is the same for both the Linux and windows executions.
I made my best to make the objects immutable to avoid issues and that provided a big gain with my old 8-thread intel i7. However with many more threads, this issue arises.
For threading, I have tried a custom parallel for, and the taskflow library. Both perform identically for what I want to do.
Is there something fundamental about windows threads that produces this behaviour?
The custom parallel for code:
/**
* parallel for
* #tparam Index integer type
* #tparam Callable function type
* #param start start index of the loop
* #param end final +1 index of the loop
* #param func function to evaluate
* #param nb_threads number of threads, if zero, it is determined automatically
*/
template<typename Index, typename Callable>
static void ParallelFor(Index start, Index end, Callable func, unsigned nb_threads=0) {
// Estimate number of threads in the pool
if (nb_threads == 0) nb_threads = getThreadNumber();
// Size of a slice for the range functions
Index n = end - start + 1;
Index slice = (Index) std::round(n / static_cast<double> (nb_threads));
slice = std::max(slice, Index(1));
// [Helper] Inner loop
auto launchRange = [&func] (int k1, int k2) {
for (Index k = k1; k < k2; k++) {
func(k);
}
};
// Create pool and launch jobs
std::vector<std::thread> pool;
pool.reserve(nb_threads);
Index i1 = start;
Index i2 = std::min(start + slice, end);
for (unsigned i = 0; i + 1 < nb_threads && i1 < end; ++i) {
pool.emplace_back(launchRange, i1, i2);
i1 = i2;
i2 = std::min(i2 + slice, end);
}
if (i1 < end) {
pool.emplace_back(launchRange, i1, end);
}
// Wait for jobs to finish
for (std::thread &t : pool) {
if (t.joinable()) {
t.join();
}
}
}
A complete C++ project illustrating the issue is uploaded here
Main.cpp:
//
// Created by santi on 26/08/2022.
//
#include "input_data.h"
#include "output_data.h"
#include "random.h"
#include "par_for.h"
void fillA(Matrix& A){
Random rnd;
rnd.setTimeBasedSeed();
for(int i=0; i < A.getRows(); ++i)
for(int j=0; j < A.getRows(); ++j)
A(i, j) = (int) rnd.randInt(0, 1000);
}
void worker(const InputData& input_data,
OutputData& output_data,
const std::vector<int>& time_indices,
int thread_index){
std::cout << "Thread " << thread_index << " [" << time_indices[0]<< ", " << time_indices[time_indices.size() - 1] << "]\n";
for(const int& t: time_indices){
Matrix b = input_data.getAt(t);
Matrix A(input_data.getDim(), input_data.getDim());
fillA(A);
Matrix x = A * b;
output_data.setAt(t, x);
}
}
void process(int time_steps, int dim, int n_threads){
InputData input_data(time_steps, dim);
OutputData output_data(time_steps, dim);
// correct the number of threads
if ( n_threads < 1 ) { n_threads = ( int )getThreadNumber( ); }
// generate indices
std::vector<int> time_indices = arrange<int>(time_steps);
// compute the split of indices per core
std::vector<ParallelChunkData<int>> chunks = prepareParallelChunks(time_indices, n_threads );
// run in parallel
ParallelFor( 0, ( int )chunks.size( ), [ & ]( int k ) {
// run chunk
worker(input_data, output_data, chunks[k].indices, k );
} );
}
int main(){
process(8760, 5000, 0);
return 0;
}
The performance problem you see is definitely caused by the many memory allocations, as already suspected by Matt in his answer. To expand on this: Here is a screenshot from Intel VTune running on an AMD Ryzen Threadripper 3990X with 64 cores (128 threads):
As you can see, almost all of the time is spent in malloc or free, which get called from the various Matrix operations. The bottom part of the image shows the timeline of the activity of a small selection of the threads: Green means that the thread is inactive, i.e. waiting. Usually only one or two threads are actually active. Allocations and freeing memory accesses a shared resource, causing the threads to wait for each other.
I think you have only two real options:
Option 1: No dynamic allocations anymore
The most efficient thing to do would be to rewrite the code to preallocate everything and get rid of all the temporaries. To adapt it to your example code, you could replace the b = input_data.getAt(t); and x = A * b; like this:
void MatrixVectorProduct(Matrix const & A, Matrix const & b, Matrix & x)
{
for (int i = 0; i < x.getRows(); ++i) {
for (int j = 0; j < x.getCols(); ++j) {
x(i, j) = 0.0;
for (int k = 0; k < A.getCols(); ++k) {
x(i,j) += (A(i,k) * b(k,j));
}
}
}
}
void getAt(int t, Matrix const & input_data, Matrix & b) {
for (int i = 0; i < input_data.getRows(); ++i)
b(i, 0) = input_data(i, t);
}
void worker(const InputData& input_data,
OutputData& output_data,
const std::vector<int>& time_indices,
int thread_index){
std::cout << "Thread " << thread_index << " [" << time_indices[0]<< ", " << time_indices[time_indices.size() - 1] << "]\n";
Matrix A(input_data.getDim(), input_data.getDim());
Matrix b(input_data.getDim(), 1);
Matrix x(input_data.getDim(), 1);
for (const int & t: time_indices) {
getAt(t, input_data.getMat(), b);
fillA(A);
MatrixVectorProduct(A, b, x);
output_data.setAt(t, x);
}
std::cout << "Thread " << thread_index << ": Finished" << std::endl;
}
This fixes the performance problems.
Here is a screenshot from VTune, where you can see a much better utilization:
Option 2: Using a special allocator
The alternative is to use a different allocator that handles allocating and freeing memory more efficiently in multithreaded scenarios. One that I had very good experience with is mimalloc (there are others such as hoard or the one from TBB). You do not need to modify your source code, you just need to link with a specific library as described in the documentation.
I tried mimalloc with your source code, and it gave near 100% CPU utilization without any code changes.
I also found a post on the Intel forums with a similar problem, and the solution there was the same (using a special allocator).
Additional notes
Matrix::allocSpace() allocates the memory by using pointers to arrays. It is better to use one contiguous array for the whole matrix instead of multiple independent arrays. That way, all elements are located behind each other in memory, allowing more efficient access.
But in general I suggest to use a dedicated linear algebra library such as Eigen instead of the hand rolled matrix implementation to exploit vectorization (SSE2, AVX,...) and to get the benefits of a highly optimized library.
Ensure that you compile your code with optimizations enabled.
Disable various cross-checks if you do not need them: assert() (i.e. define NDEBUG in the preprocessor), and for MSVC possibly /GS-.
Ensure that you actually have enough memory installed.
You said that all your memory was pre-allocated, but in the worker function I see this...
Matrix b = input_data.getAt(t);
which allocates and fills a new matrix b, and this...
Matrix A(input_data.getDim(), input_data.getDim());
which allocates and fills a new matrix A, and this...
Matrix x = A * b;
which allocates and fills a new matrix x.
The heap is a global data structure, so the thread synchronization time you're seeing is probably contention in the memory allocate/free functions.
These are in a tight loop. You should fix this loop to access b by reference, and reuse the other 2 matrices for every iteration.
I am trapped in a wired situation; my c++ code keeps consuming more memory (reaching around 70G), until the whole process got killed.
I am invoking a C++ code from Python, which implements the Longest common subsequence length algorithm.
The C++ code is shown below:
#define MAX(a,b) (((a)>(b))?(a):(b))
#include <stdio.h>
int LCSLength(long unsigned X[], long unsigned Y[], int m, int n)
{
int** L = new int*[m+1];
for(int i = 0; i < m+1; ++i)
L[i] = new int[n+1];
printf("i am hre\n");
int i, j;
for(i=0; i<=m; i++)
{
printf("i am hre1\n");
for(j=0; j<=n; j++)
{
if(i==0 || j==0)
L[i][j] = 0;
else if(X[i-1]==Y[j-1])
L[i][j] = L[i-1][j-1]+1;
else
L[i][j] = MAX(L[i-1][j],L[i][j-1]);
}
}
int tt = L[m][n];
printf("i am hre2\n");
for (i = 0; i < m+1; i++)
delete [] L[i];
delete [] L;
return tt;
}
And my Python code is like this:
from ctypes import cdll
import ctypes
lib = cdll.LoadLibrary('./liblcs.so')
la = 36840
lb = 833841
a = (ctypes.c_ulong * la)()
b = (ctypes.c_ulong * lb)()
for i in range(la):
a[i] = 1
for i in range(lb):
b[i] = 1
print "test"
lib._Z9LCSLengthPmS_ii(a, b, la, lb)
IMHO, in the C++ code, after the new operation which could allocate a large amount of memory on the heap, there would be not more additional memory consumption inside the loop.
However, to my surprise, I observed that the used memory keeps increasing during the loop. (I am using top on Linux, and it keeps print i am her1 before the process got killed)
It is really confused me at this point, as I guess after the memory allocation, there are only some arithmetic operations inside the loop, why does the code take more memory?
Am I clear enough? Could anyone give me some help on this issue? Thank you!
Your consuming too much memory. The reason why the system does not die on allocation is because Linux allows you to allocate more memory than you can use
http://serverfault.com/questions/141988/avoid-linux-out-of-memory-application-teardown
I just did the same thing on a test machine. I was able to get past the uses of new and start the loop, only when the system decided that I was eating too much of the available RAM did it kill me.
This is what I got. A lovely OOM message in dmesg.
[287602.898843] Out of memory: Kill process 7476 (a.out) score 792 or sacrifice child
[287602.899900] Killed process 7476 (a.out) total-vm:2885212kB, anon-rss:907032kB, file-rss:0kB, shmem-rss:0kB
On Linux you would see something like this in your kernel logs or as the output from dmesg...
[287585.306678] Out of memory: Kill process 7469 (a.out) score 787 or sacrifice child
[287585.307759] Killed process 7469 (a.out) total-vm:2885208kB, anon-rss:906912kB, file-rss:4kB, shmem-rss:0kB
[287602.754624] a.out invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
[287602.755843] a.out cpuset=/ mems_allowed=0
[287602.756482] CPU: 0 PID: 7476 Comm: a.out Not tainted 4.5.0-x86_64-linode65 #2
[287602.757592] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
[287602.759461] 0000000000000000 ffff88003d845780 ffffffff815abd27 0000000000000000
[287602.760689] 0000000000000282 ffff88003a377c58 ffffffff811d0e82 ffff8800397f8270
[287602.761915] 0000000000f7d192 000105902804d798 ffffffff81046a71 ffff88003d845780
[287602.763192] Call Trace:
[287602.763532] [<ffffffff815abd27>] ? dump_stack+0x63/0x84
[287602.774614] [<ffffffff811d0e82>] ? dump_header+0x59/0x1ed
[287602.775454] [<ffffffff81046a71>] ? kvm_clock_read+0x1b/0x1d
[287602.776322] [<ffffffff8112b046>] ? ktime_get+0x49/0x91
[287602.777127] [<ffffffff81156c83>] ? delayacct_end+0x3b/0x60
[287602.777970] [<ffffffff81187c11>] ? oom_kill_process+0xc0/0x367
[287602.778866] [<ffffffff811882c5>] ? out_of_memory+0x3bf/0x406
[287602.779755] [<ffffffff8118c646>] ? __alloc_pages_nodemask+0x8fc/0xa6b
[287602.780756] [<ffffffff811c095d>] ? alloc_pages_current+0xbc/0xe0
[287602.781686] [<ffffffff81186c1d>] ? filemap_fault+0x2d3/0x48b
[287602.782561] [<ffffffff8128adea>] ? ext4_filemap_fault+0x37/0x51
[287602.783511] [<ffffffff811a9d56>] ? __do_fault+0x68/0xb1
[287602.784310] [<ffffffff811adcaa>] ? handle_mm_fault+0x6a4/0xd1b
[287602.785216] [<ffffffff810496cd>] ? __do_page_fault+0x33d/0x398
[287602.786124] [<ffffffff819c6ab8>] ? async_page_fault+0x28/0x30
Take a look at what you are doing:
#include <iostream>
int main(){
int m = 36840;
int n = 833841;
unsigned long total = 0;
total += (sizeof(int) * (m+1));
for(int i = 0; i < m+1; ++i){
total += (sizeof(int) * (n+1));
}
std::cout << total << '\n';
}
You're simply consuming too much memory.
If the size of your int is 4 bytes, you are allocating 122 GB.
I've got a function that is typically run 50 times (to run 50 simulations). Usually this is done sequentially single threaded but I'd like to speed things up using multiple threads. The threads don't need to access each others memory or data so I don't think racing is an issue. Essentially the thread should just complete its task, and return to main thats it's finished, also returning a double value.
First of all, looking through all the boost documentation and examples has really convoluted me and I'm not sure what I'm looking for anymore. boost::thread ? boost future? Could someone give an example of what is applicable in my case. Additionally, I don't understand how to specify how many threads to run, is it more like I would run 50 threads and the OS handles when to execute them?
If your code is completely CPU-bound (no network/disk IO), then you would benefit from starting as many background threads as you have CPUs. Use Boost's hardware_concurrency() function to determine that number and/or allow the user to set it. Just starting a bunch of threads is not helpful, as that will increase the overhead caused by creating, switching and terminating threads.
The code starting the threads is a simple loop, followed by another loop to wait for the thread's completion. You can also use the thread_group class for that. If the number of jobs is not known and can't be distributed on thread startup, consider using a thread pool where you just start a sensible number of threads and then give them jobs while they come up.
Read the Boost.Thread Futures docs for an idea of using futures and async to achieve this. It also shows how to do it manually (the hard way) using thread objects.
Given this serial code:
double run_sim(Data*);
int main()
{
const unsigned ntasks = 50;
double results[ntasks];
Data data[ntasks];
for (unsigned i=0; i<ntasks; ++i)
results[i] = run_sim(data[i]);
}
A naive parallel version would be:
#define BOOST_THREAD_PROVIDES_FUTURE
#include <boost/thread/future.hpp>
#include <boost/bind.hpp>
double run_task(Data*);
int main()
{
const unsigned nsim = 50;
Data data[nsim];
boost::future<int> futures[nsim];
for (unsigned i=0; i<nsim; ++i)
futures[i] = boost::async(boost::bind(&run_sim, &data[i]));
double results[nsim];
for (unsigned i=0; i<nsim; ++i)
results[i] = futures[i].get();
}
Because boost::async doesn't yet support deferred functions every async call will create a new thread, so this will spawn 50 thread at once. This might perform quite badly, so you could split it up into smaller blocks:
#define BOOST_THREAD_PROVIDES_FUTURE
#include <boost/thread/future.hpp>
#include <boost/thread/thread.hpp>
#include <boost/bind.hpp>
double run_sim(Data*);
int main()
{
const unsigned nsim = 50;
unsigned nprocs = boost::thread::hardware_concurrency();
if (nprocs == 0)
nprocs = 2; // cannot determine number of cores, let's say 2
Data data[nsim];
boost::future<int> futures[nsim];
double results[nsim];
for (unsigned i=0; i<nsim; ++i)
{
if ( ((i+1) % nprocs) != 0 )
futures[i] = boost::async(boost::bind(&run_sim, &data[i]));
else
results[i] = run_sim(&data[i]);
}
for (unsigned i=0; i<nsim; ++i)
if ( ((i+1) % nprocs) != 0 )
results[i] = futures[i].get();
}
If hardware_concurrency() returns 4, this will create three new threads then call run_sim synchronously in the main() thread, then create another three new threads then call run_sim synchronously. This will prevent 50 threads all being created at once, as the main thread stops to do some of the work, which will allow some of the other threads to complete.
The code above requires quite a recent version of Boost, it's slightly easier using Standard C++ if you can use C++11:
#include <future>
double run_sim(Data*);
int main()
{
const unsigned nsim = 50;
Data data[nsim];
std::future<int> futures[nsim];
double results[nsim];
unsigned nprocs = std::thread::hardware_concurrency();
if (nprocs == 0)
nprocs = 2;
for (unsigned i=0; i<nsim; ++i)
{
if ( ((i+1) % nprocs) != 0 )
futures[i] = std::async(boost::launch::async, &run_sim, &data[i]);
else
results[i] = run_sim(&data[i]);
}
for (unsigned i=0; i<nsim; ++i)
if ( ((i+1) % nprocs) != 0 )
results[i] = futures[i].get();
}
I like some features of D, but would be interested if they come with a
runtime penalty?
To compare, I implemented a simple program that computes scalar products of many short vectors both in C++ and in D. The result is surprising:
D: 18.9 s [see below for final runtime]
C++: 3.8 s
Is C++ really almost five times as fast or did I make a mistake in the D
program?
I compiled C++ with g++ -O3 (gcc-snapshot 2011-02-19) and D with dmd -O (dmd 2.052) on a moderate recent linux desktop. The results are reproducible over several runs and standard deviations negligible.
Here the C++ program:
#include <iostream>
#include <random>
#include <chrono>
#include <string>
#include <vector>
#include <array>
typedef std::chrono::duration<long, std::ratio<1, 1000>> millisecs;
template <typename _T>
long time_since(std::chrono::time_point<_T>& time) {
long tm = std::chrono::duration_cast<millisecs>( std::chrono::system_clock::now() - time).count();
time = std::chrono::system_clock::now();
return tm;
}
const long N = 20000;
const int size = 10;
typedef int value_type;
typedef long long result_type;
typedef std::vector<value_type> vector_t;
typedef typename vector_t::size_type size_type;
inline value_type scalar_product(const vector_t& x, const vector_t& y) {
value_type res = 0;
size_type siz = x.size();
for (size_type i = 0; i < siz; ++i)
res += x[i] * y[i];
return res;
}
int main() {
auto tm_before = std::chrono::system_clock::now();
// 1. allocate and fill randomly many short vectors
vector_t* xs = new vector_t [N];
for (int i = 0; i < N; ++i) {
xs[i] = vector_t(size);
}
std::cerr << "allocation: " << time_since(tm_before) << " ms" << std::endl;
std::mt19937 rnd_engine;
std::uniform_int_distribution<value_type> runif_gen(-1000, 1000);
for (int i = 0; i < N; ++i)
for (int j = 0; j < size; ++j)
xs[i][j] = runif_gen(rnd_engine);
std::cerr << "random generation: " << time_since(tm_before) << " ms" << std::endl;
// 2. compute all pairwise scalar products:
time_since(tm_before);
result_type avg = 0;
for (int i = 0; i < N; ++i)
for (int j = 0; j < N; ++j)
avg += scalar_product(xs[i], xs[j]);
avg = avg / N*N;
auto time = time_since(tm_before);
std::cout << "result: " << avg << std::endl;
std::cout << "time: " << time << " ms" << std::endl;
}
And here the D version:
import std.stdio;
import std.datetime;
import std.random;
const long N = 20000;
const int size = 10;
alias int value_type;
alias long result_type;
alias value_type[] vector_t;
alias uint size_type;
value_type scalar_product(const ref vector_t x, const ref vector_t y) {
value_type res = 0;
size_type siz = x.length;
for (size_type i = 0; i < siz; ++i)
res += x[i] * y[i];
return res;
}
int main() {
auto tm_before = Clock.currTime();
// 1. allocate and fill randomly many short vectors
vector_t[] xs;
xs.length = N;
for (int i = 0; i < N; ++i) {
xs[i].length = size;
}
writefln("allocation: %i ", (Clock.currTime() - tm_before));
tm_before = Clock.currTime();
for (int i = 0; i < N; ++i)
for (int j = 0; j < size; ++j)
xs[i][j] = uniform(-1000, 1000);
writefln("random: %i ", (Clock.currTime() - tm_before));
tm_before = Clock.currTime();
// 2. compute all pairwise scalar products:
result_type avg = cast(result_type) 0;
for (int i = 0; i < N; ++i)
for (int j = 0; j < N; ++j)
avg += scalar_product(xs[i], xs[j]);
avg = avg / N*N;
writefln("result: %d", avg);
auto time = Clock.currTime() - tm_before;
writefln("scalar products: %i ", time);
return 0;
}
To enable all optimizations and disable all safety checks, compile your D program with the following DMD flags:
-O -inline -release -noboundscheck
EDIT: I've tried your programs with g++, dmd and gdc. dmd does lag behind, but gdc achieves performance very close to g++. The commandline I used was gdmd -O -release -inline (gdmd is a wrapper around gdc which accepts dmd options).
Looking at the assembler listing, it looks like neither dmd nor gdc inlined scalar_product, but g++/gdc did emit MMX instructions, so they might be auto-vectorizing the loop.
One big thing that slows D down is a subpar garbage collection implementation. Benchmarks that don't heavily stress the GC will show very similar performance to C and C++ code compiled with the same compiler backend. Benchmarks that do heavily stress the GC will show that D performs abysmally. Rest assured, though, this is a single (albeit severe) quality-of-implementation issue, not a baked-in guarantee of slowness. Also, D gives you the ability to opt out of GC and tune memory management in performance-critical bits, while still using it in the less performance-critical 95% of your code.
I've put some effort into improving GC performance lately and the results have been rather dramatic, at least on synthetic benchmarks. Hopefully these changes will be integrated into one of the next few releases and will mitigate the issue.
This is a very instructive thread, thanks for all the work to the OP and helpers.
One note - this test is not assessing the general question of abstraction/feature penalty or even that of backend quality. It focuses on virtually one optimization (loop optimization). I think it's fair to say that gcc's backend is somewhat more refined than dmd's, but it would be a mistake to assume that the gap between them is as large for all tasks.
Definitely seems like a quality-of-implementation issue.
I ran some tests with the OP's code and made some changes. I actually got D going faster for LDC/clang++, operating on the assumption that arrays must be allocated dynamically (xs and associated scalars). See below for some numbers.
Questions for the OP
Is it intentional that the same seed be used for each iteration of C++, while not so for D?
Setup
I have tweaked the original D source (dubbed scalar.d) to make it portable between platforms. This only involved changing the type of the numbers used to access and modify the size of arrays.
After this, I made the following changes:
Used uninitializedArray to avoid default inits for scalars in xs (probably made the biggest difference). This is important because D normally default-inits everything silently, which C++ does not.
Factored out printing code and replaced writefln with writeln
Changed imports to be selective
Used pow operator (^^) instead of manual multiplication for final step of calculating average
Removed the size_type and replaced appropriately with the new index_type alias
...thus resulting in scalar2.cpp (pastebin):
import std.stdio : writeln;
import std.datetime : Clock, Duration;
import std.array : uninitializedArray;
import std.random : uniform;
alias result_type = long;
alias value_type = int;
alias vector_t = value_type[];
alias index_type = typeof(vector_t.init.length);// Make index integrals portable - Linux is ulong, Win8.1 is uint
immutable long N = 20000;
immutable int size = 10;
// Replaced for loops with appropriate foreach versions
value_type scalar_product(in ref vector_t x, in ref vector_t y) { // "in" is the same as "const" here
value_type res = 0;
for(index_type i = 0; i < size; ++i)
res += x[i] * y[i];
return res;
}
int main() {
auto tm_before = Clock.currTime;
auto countElapsed(in string taskName) { // Factor out printing code
writeln(taskName, ": ", Clock.currTime - tm_before);
tm_before = Clock.currTime;
}
// 1. allocate and fill randomly many short vectors
vector_t[] xs = uninitializedArray!(vector_t[])(N);// Avoid default inits of inner arrays
for(index_type i = 0; i < N; ++i)
xs[i] = uninitializedArray!(vector_t)(size);// Avoid more default inits of values
countElapsed("allocation");
for(index_type i = 0; i < N; ++i)
for(index_type j = 0; j < size; ++j)
xs[i][j] = uniform(-1000, 1000);
countElapsed("random");
// 2. compute all pairwise scalar products:
result_type avg = 0;
for(index_type i = 0; i < N; ++i)
for(index_type j = 0; j < N; ++j)
avg += scalar_product(xs[i], xs[j]);
avg /= N ^^ 2;// Replace manual multiplication with pow operator
writeln("result: ", avg);
countElapsed("scalar products");
return 0;
}
After testing scalar2.d (which prioritized optimization for speed), out of curiousity I replaced the loops in main with foreach equivalents, and called it scalar3.d (pastebin):
import std.stdio : writeln;
import std.datetime : Clock, Duration;
import std.array : uninitializedArray;
import std.random : uniform;
alias result_type = long;
alias value_type = int;
alias vector_t = value_type[];
alias index_type = typeof(vector_t.init.length);// Make index integrals portable - Linux is ulong, Win8.1 is uint
immutable long N = 20000;
immutable int size = 10;
// Replaced for loops with appropriate foreach versions
value_type scalar_product(in ref vector_t x, in ref vector_t y) { // "in" is the same as "const" here
value_type res = 0;
for(index_type i = 0; i < size; ++i)
res += x[i] * y[i];
return res;
}
int main() {
auto tm_before = Clock.currTime;
auto countElapsed(in string taskName) { // Factor out printing code
writeln(taskName, ": ", Clock.currTime - tm_before);
tm_before = Clock.currTime;
}
// 1. allocate and fill randomly many short vectors
vector_t[] xs = uninitializedArray!(vector_t[])(N);// Avoid default inits of inner arrays
foreach(ref x; xs)
x = uninitializedArray!(vector_t)(size);// Avoid more default inits of values
countElapsed("allocation");
foreach(ref x; xs)
foreach(ref val; x)
val = uniform(-1000, 1000);
countElapsed("random");
// 2. compute all pairwise scalar products:
result_type avg = 0;
foreach(const ref x; xs)
foreach(const ref y; xs)
avg += scalar_product(x, y);
avg /= N ^^ 2;// Replace manual multiplication with pow operator
writeln("result: ", avg);
countElapsed("scalar products");
return 0;
}
I compiled each of these tests using an LLVM-based compiler, since LDC seems to be the best option for D compilation in terms of performance. On my x86_64 Arch Linux installation I used the following packages:
clang 3.6.0-3
ldc 1:0.15.1-4
dtools 2.067.0-2
I used the following commands to compile each:
C++: clang++ scalar.cpp -o"scalar.cpp.exe" -std=c++11 -O3
D: rdmd --compiler=ldc2 -O3 -boundscheck=off <sourcefile>
Results
The results (screenshot of raw console output) of each version of the source as follows:
scalar.cpp (original C++):
allocation: 2 ms
random generation: 12 ms
result: 29248300000
time: 2582 ms
C++ sets the standard at 2582 ms.
scalar.d (modified OP source):
allocation: 5 ms, 293 μs, and 5 hnsecs
random: 10 ms, 866 μs, and 4 hnsecs
result: 53237080000
scalar products: 2 secs, 956 ms, 513 μs, and 7 hnsecs
This ran for ~2957 ms. Slower than the C++ implementation, but not too much.
scalar2.d (index/length type change and uninitializedArray optimization):
allocation: 2 ms, 464 μs, and 2 hnsecs
random: 5 ms, 792 μs, and 6 hnsecs
result: 59
scalar products: 1 sec, 859 ms, 942 μs, and 9 hnsecs
In other words, ~1860 ms. So far this is in the lead.
scalar3.d (foreaches):
allocation: 2 ms, 911 μs, and 3 hnsecs
random: 7 ms, 567 μs, and 8 hnsecs
result: 189
scalar products: 2 secs, 182 ms, and 366 μs
~2182 ms is slower than scalar2.d, but faster than the C++ version.
Conclusion
With the correct optimizations, the D implementation actually went faster than its equivalent C++ implementation using the LLVM-based compilers available. The current gap between D and C++ for most applications seems only to be based on limitations of current implementations.
dmd is the reference implementation of the language and thus most work is put into the frontend to fix bugs rather than optimizing the backend.
"in" is faster in your case cause you are using dynamic arrays which are reference types. With ref you introduce another level of indirection (which is normally used to alter the array itself and not only the contents).
Vectors are usually implemented with structs where const ref makes perfect sense. See smallptD vs. smallpt for a real-world example featuring loads of vector operations and randomness.
Note that 64-Bit can also make a difference. I once missed that on x64 gcc compiles 64-Bit code while dmd still defaults to 32 (will change when the 64-Bit codegen matures). There was a remarkable speedup with "dmd -m64 ...".
Whether C++ or D is faster is likely to be highly dependent on what you're doing. I would think that when comparing well-written C++ to well-written D code, they would generally either be of similar speed, or C++ would be faster, but what the particular compiler manages to optimize could have a big effect completely aside from the language itself.
However, there are a few cases where D stands a good chance of beating C++ for speed. The main one which comes to mind would be string processing. Thanks to D's array slicing capabalities, strings (and arrays in general) can be processed much faster than you can readily do in C++. For D1, Tango's XML processor is extremely fast, thanks primarily to D's array slicing capabilities (and hopefully D2 will have a similarly fast XML parser once the one that's currently being worked on for Phobos has been completed). So, ultimately whether D or C++ is going to be faster is going to be very dependent on what you're doing.
Now, I am suprised that you're seeing such a difference in speed in this particular case, but it is the sort of thing that I would expect to improve as dmd improves. Using gdc might yield better results and would likely be a closer comparison of the language itself (rather than the backend) given that it's gcc-based. But it wouldn't surprise me at all if there are a number of things which could be done to speed up the code that dmd generates. I don't think that there's much question that gcc is more mature than dmd at this point. And code optimizations are one of the prime fruits of code maturity.
Ultimately, what matters is how well dmd performs for your particular application, but I do agree that it would definitely be nice to know how well C++ and D compare in general. In theory, they should be pretty much the same, but it really depends on the implementation. I think that a comprehensive set of benchmarks would be required to really test how well the two presently compare however.
You can write C code is D so as far as which is faster, it will depend on a lot of things:
What compiler you use
What feature you use
how aggressively you optimize
Differences in the first aren't fair to drag in. The second might give C++ an advantage as it, if anything, has fewer heavy features. The third is the fun one: D code in some ways is easier to optimize because in general it is easier to understand. Also it has the ability to do a large degree of generative programing allowing things like verbose and repetitive but fast code to be written in a shorter forms.
Seems like a quality of implementation issue. For example, here's what I've been testing with:
import std.datetime, std.stdio, std.random;
version = ManualInline;
immutable N = 20000;
immutable Size = 10;
alias int value_type;
alias long result_type;
alias value_type[] vector_type;
result_type scalar_product(in vector_type x, in vector_type y)
in
{
assert(x.length == y.length);
}
body
{
result_type result = 0;
foreach(i; 0 .. x.length)
result += x[i] * y[i];
return result;
}
void main()
{
auto startTime = Clock.currTime();
// 1. allocate vectors
vector_type[] vectors = new vector_type[N];
foreach(ref vec; vectors)
vec = new value_type[Size];
auto time = Clock.currTime() - startTime;
writefln("allocation: %s ", time);
startTime = Clock.currTime();
// 2. randomize vectors
foreach(ref vec; vectors)
foreach(ref e; vec)
e = uniform(-1000, 1000);
time = Clock.currTime() - startTime;
writefln("random: %s ", time);
startTime = Clock.currTime();
// 3. compute all pairwise scalar products
result_type avg = 0;
foreach(vecA; vectors)
foreach(vecB; vectors)
{
version(ManualInline)
{
result_type result = 0;
foreach(i; 0 .. vecA.length)
result += vecA[i] * vecB[i];
avg += result;
}
else
{
avg += scalar_product(vecA, vecB);
}
}
avg = avg / (N * N);
time = Clock.currTime() - startTime;
writefln("scalar products: %s ", time);
writefln("result: %s", avg);
}
With ManualInline defined I get 28 seconds, but without I get 32. So the compiler isn't even inlining this simple function, which I think it's clear it should be.
(My command line is dmd -O -noboundscheck -inline -release ....)