I have n sets A0,A2,...An-1 holding items of a set E.
I define a configuration C as the integer made of n bits, so C has values between 0 and 2^n-1. Now, I define the following:
(C) an item e of E is in configuration C
<=> for each bit b of C, if b==1 then e is in Ab, else e is not in Ab
For instance for n=3, the configuration C=011 corresponds to items of E that are in A0 and A1 but NOT in A2 (the NOT is important)
C[bitmap] is the count of elements that have exactly that presence/absence pattern in the sets. C[001] is the number of elements in A0 that aren't also in any other sets.
Another possible definition is :
(V) an item e of E is in configuration V
<=> for each bit b of V, if b==1 then e is in Ab
For instance for n=3, the (V) configuration V=011 corresponds to items of E that are in A0 and A1
V[bitmap] is the count of the intersection of the selected sets. (i.e. the count of how many elements are in all of the sets where the bitmap is true.) V[001] is the number of elements in A0. V[011] is the number of elements in A0 and A1, regardless of whether or not they're also in A2.
In the following, the first picture shows items of sets A0, A1 and A2, the second picture shows size of (C) configurations and the third picture shows size of (V) configurations.
I can also represent the configurations by either of two vectors:
C[001]= 5 V[001]=14
C[010]=10 V[010]=22
C[100]=11 V[100]=24
C[011]= 2 V[011]= 6
C[101]= 3 V[101]= 7
C[110]= 6 V[110]=10
C[111]= 4 V[111]= 4
What I want is to write a C/C++ function that transforms C into V as efficiently as possible. A naive approach could be the following 'transfo' function that is obviously in O(4^n) :
#include <vector>
#include <cstdio>
using namespace std;
vector<size_t> transfo (const vector<size_t>& C)
vector<size_t> V (C.size());
for (size_t i=0; i<C.size(); i++)
V[i] = 0;
for (size_t j=0; j<C.size(); j++)
if ((j&i)==i) { V[i] += C[j]; }
return V;
int main()
vector<size_t> C = {
/* 000 */ 0,
/* 001 */ 5,
/* 010 */ 10,
/* 011 */ 2,
/* 100 */ 11,
/* 101 */ 3,
/* 110 */ 6,
/* 111 */ 4
vector<size_t> V = transfo (C);
for (size_t i=1; i<V.size(); i++) { printf ("[%2ld] C=%2ld V=%2ld\n", i, C[i], V[i]); }
My question is : is there a more efficient algorithm than the naive one for transforming a vector C into a vector V ? And what would be the complexity of such a "good" algorithm ?
Note that I could be interested by any SIMD solution.
Well, you are trying to compute 2n values, so you cannot do better than O(2n).
The naive approach starts from the observation that V[X] is obtained by fixing all the 1 bits in X and iterating over all the possible values where the 0 bits are. For example,
V[010] = C[010] + C[011] + C[110] + C[111]
But this approach performs O(2n) additions for every element of V, yielding a total complexity of O(4n).
Here is an O(n × 2n) algorithm. I too am curious if an O(2n) algorithm exists.
Let n = 4. Let us consider the full table of V versus C. Each line in the table below corresponds to one value of V and this value is calculated by summing up the columns marked with a *. The layout of * symbols can be easily deduced from the naive approach.
0000| * | * | * | * | * | * | * | * || * | * | * | * | * | * | * | *
0001| | * | | * | | * | | * || | * | | * | | * | | *
0010| | | * | * | | | * | * || | | * | * | | | * | *
0011| | | | * | | | | * || | | | * | | | | *
0100| | | | | * | * | * | * || | | | | * | * | * | *
0101| | | | | | * | | * || | | | | | * | | *
0110| | | | | | | * | * || | | | | | | * | *
0111| | | | | | | | * || | | | | | | | *
1000| | | | | | | | || * | * | * | * | * | * | * | *
1001| | | | | | | | || | * | | * | | * | | *
1010| | | | | | | | || | | * | * | | | * | *
1011| | | | | | | | || | | | * | | | | *
1100| | | | | | | | || | | | | * | * | * | *
1101| | | | | | | | || | | | | | * | | *
1110| | | | | | | | || | | | | | | * | *
1111| | | | | | | | || | | | | | | | *
Notice that the top-left, top-right and bottom-right corners contain identical layouts. Therefore, we can perform some calculations in bulk as follows:
Compute the bottom half of the table (the bottom-right corner).
Add the values to the top half.
Compute the top-left corner.
If we let q = 2n, Thus the recurrent complexity is
T(q) = 2T(q/2) + O(q)
which solves using the Master Theorem to
T(q) = O(q log q)
or, in terms of n,
T(n) = O(n × 2n)
According to the great observation of #CătălinFrâncu, I wrote two recursive implementations of the transformation (see code below) :
transfo_recursive: very straightforward recursive implementation
transfo_avx2 : still recursive but use AVX2 for last step of the recursion for n=3
I propose here that the sizes of the counters are coded on 32 bits and that the n value can grow up to 28.
I also wrote an iterative implementation (transfo_iterative) based on my own observation on the recursion behaviour. Actually, I guess it is close to the non recursive implementation proposed by #chtz.
Here is the benchmark code:
// compiled with: g++ -O3 intersect.cpp -march=native -mavx2 -lpthread -DNDEBUG
#include <vector>
#include <cstdio>
#include <cstdlib>
#include <ctime>
#include <cmath>
#include <thread>
#include <algorithm>
#include <sys/times.h>
#include <immintrin.h>
#include <boost/align/aligned_allocator.hpp>
using namespace std;
typedef u_int32_t Count;
// Note: alignment is important for AVX2
typedef std::vector<Count,boost::alignment::aligned_allocator<Count, 8*sizeof(Count)>> CountVector;
typedef void (*callback) (CountVector::pointer C, size_t q);
typedef vector<pair<const char*, callback>> FunctionsVector;
unsigned int randomSeed = 0;
double timestamp()
struct timespec timet;
clock_gettime(CLOCK_MONOTONIC, &timet);
return timet.tv_sec + (timet.tv_nsec/ 1000000000.0);
CountVector getRandomVector (size_t n)
// We use the same seed, so we'll get the same random values
srand (randomSeed);
// We fill a vector of size q=2^n with random values
CountVector C(1ULL<<n);
for (size_t i=0; i<C.size(); i++) { C[i] = rand() % (1ULL<<(8*sizeof(Count))); }
return C;
void copy_add_block (CountVector::pointer C, size_t q)
for (size_t i=0; i<q/2; i++) { C[i] += C[i+q/2]; }
void copy_add_block_avx2 (CountVector::pointer C, size_t q)
__m256i* target = (__m256i*) (C);
__m256i* source = (__m256i*) (C+q/2);
size_t imax = q/(2*8);
for (size_t i=0; i<imax; i++)
target[i] = _mm256_add_epi32 (source[i], target[i]);
// Naive approach : O(4^n)
CountVector transfo_naive (const CountVector& C)
CountVector V (C.size());
for (size_t i=0; i<C.size(); i++)
V[i] = 0;
for (size_t j=0; j<C.size(); j++)
if ((j&i)==i) { V[i] += C[j]; }
return V;
// Recursive approach : O(n.2^n)
void transfo_recursive (CountVector::pointer C, size_t q)
if (q>1)
transfo_recursive (C+q/2, q/2);
transfo_recursive (C, q/2);
copy_add_block (C, q);
// Iterative approach : O(n.2^n)
void transfo_iterative (CountVector::pointer C, size_t q)
size_t i = 0;
for (size_t n=q; n>1; n>>=1, i++)
size_t d = 1<<i;
for (ssize_t j=q-1-d; j>=0; j--)
if ( ((j>>i)&1)==0) { C[j] += C[j+d]; }
// Recursive AVX2 approach : O(n.2^n)
#define ROTATE1(s) _mm256_permutevar8x32_epi32 (s, _mm256_set_epi32(0,7,6,5,4,3,2,1))
#define ROTATE2(s) _mm256_permutevar8x32_epi32 (s, _mm256_set_epi32(0,0,7,6,5,4,3,2))
#define ROTATE4(s) _mm256_permutevar8x32_epi32 (s, _mm256_set_epi32(0,0,0,0,7,6,5,4))
void transfo_avx2 (CountVector::pointer V, size_t N)
__m256i k1 = _mm256_set_epi32 (0,0xFFFFFFFF,0,0xFFFFFFFF,0,0xFFFFFFFF,0,0xFFFFFFFF);
__m256i k2 = _mm256_set_epi32 (0,0,0xFFFFFFFF,0xFFFFFFFF,0,0,0xFFFFFFFF,0xFFFFFFFF);
__m256i k4 = _mm256_set_epi32 (0,0,0,0,0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF);
if (N==8)
__m256i* source = (__m256i*) (V);
*source = _mm256_add_epi32 (*source, _mm256_and_si256(ROTATE1(*source),k1));
*source = _mm256_add_epi32 (*source, _mm256_and_si256(ROTATE2(*source),k2));
*source = _mm256_add_epi32 (*source, _mm256_and_si256(ROTATE4(*source),k4));
else // if (N>8)
transfo_avx2 (V+N/2, N/2);
transfo_avx2 (V, N/2);
copy_add_block_avx2 (V, N);
#define ROTATE1_AND(s) _mm256_srli_epi64 ((s), 32) // odd 32bit elements
#define ROTATE2_AND(s) _mm256_bsrli_epi128 ((s), 8) // high 64bit halves
// gcc doesn't have _mm256_zextsi128_si256
// and _mm256_castsi128_si256 doesn't guarantee zero extension
// vperm2i118 can do the same job as vextracti128, but is slower on Ryzen
#ifdef __clang__ // high 128bit lane
#define ROTATE4_AND(s) _mm256_zextsi128_si256(_mm256_extracti128_si256((s),1))
//#define ROTATE4_AND(s) _mm256_castsi128_si256(_mm256_extracti128_si256((s),1))
#define ROTATE4_AND(s) _mm256_permute2x128_si256((s),(s),0x81) // high bit set = zero that lane
void transfo_avx2_pcordes (CountVector::pointer C, size_t q)
if (q==8)
__m256i* source = (__m256i*) (C);
__m256i tmp = *source;
tmp = _mm256_add_epi32 (tmp, ROTATE1_AND(tmp));
tmp = _mm256_add_epi32 (tmp, ROTATE2_AND(tmp));
tmp = _mm256_add_epi32 (tmp, ROTATE4_AND(tmp));
*source = tmp;
else //if (N>8)
transfo_avx2_pcordes (C+q/2, q/2);
transfo_avx2_pcordes (C, q/2);
copy_add_block_avx2 (C, q);
// Template specialization (same as transfo_avx2_pcordes)
template <int n>
void transfo_template (__m256i* C)
const size_t q = 1ULL << n;
transfo_template<n-1> (C);
transfo_template<n-1> (C + q/2);
__m256i* target = (__m256i*) (C);
__m256i* source = (__m256i*) (C+q/2);
for (size_t i=0; i<q/2; i++)
target[i] = _mm256_add_epi32 (source[i], target[i]);
template <>
void transfo_template<0> (__m256i* C)
__m256i* source = (__m256i*) (C);
__m256i tmp = *source;
tmp = _mm256_add_epi32 (tmp, ROTATE1_AND(tmp));
tmp = _mm256_add_epi32 (tmp, ROTATE2_AND(tmp));
tmp = _mm256_add_epi32 (tmp, ROTATE4_AND(tmp));
*source = tmp;
void transfo_recur_template (CountVector::pointer C, size_t q)
#define CASE(n) case 1ULL<<n: transfo_template<n> ((__m256i*)C); break;
q = q / 8; // 8 is the number of 32 bits items in the AVX2 registers
// We have to 'link' the dynamic value of q with a static template specialization
switch (q)
CASE( 1); CASE( 2); CASE( 3); CASE( 4); CASE( 5); CASE( 6); CASE( 7); CASE( 8); CASE( 9);
CASE(10); CASE(11); CASE(12); CASE(13); CASE(14); CASE(15); CASE(16); CASE(17); CASE(18); CASE(19);
CASE(20); CASE(21); CASE(22); CASE(23); CASE(24); CASE(25); CASE(26); CASE(27); CASE(28); CASE(29);
default: printf ("transfo_template undefined for q=%ld\n", q); break;
// Recursive approach multithread : O(n.2^n)
void transfo_recur_thread (CountVector::pointer C, size_t q)
std::thread t1 (transfo_recur_template, C+q/2, q/2);
std::thread t2 (transfo_recur_template, C, q/2);
copy_add_block_avx2 (C, q);
void header (const char* title, const FunctionsVector& functions)
printf ("\n");
for (size_t i=0; i<functions.size(); i++) { printf ("------------------"); } printf ("\n");
printf ("%s\n", title);
for (size_t i=0; i<functions.size(); i++) { printf ("------------------"); } printf ("\n");
printf ("%3s\t", "# n");
for (auto fct : functions) { printf ("%20s\t", fct.first); }
printf ("\n");
// Check that alternative implementations provide the same result as the naive one
void check (const FunctionsVector& functions, size_t nmin, size_t nmax)
header ("CHECK (0 values means similar to naive approach)", functions);
for (size_t n=nmin; n<=nmax; n++)
printf ("%3ld\t", n);
CountVector reference = transfo_naive (getRandomVector(n));
for (auto fct : functions)
// We call the (in place) transformation
CountVector C = getRandomVector(n);
(*fct.second) (C.data(), C.size());
int nbDiffs= 0;
for (size_t i=0; i<C.size(); i++)
if (reference[i]!=C[i]) { nbDiffs++; }
printf ("%20ld\t", nbDiffs);
printf ("\n");
// Performance test
void performance (const FunctionsVector& functions, size_t nmin, size_t nmax)
header ("PERFORMANCE", functions);
for (size_t n=nmin; n<=nmax; n++)
printf ("%3ld\t", n);
for (auto fct : functions)
// We compute the average time for several executions
// We use more executions for small n values in order
// to have more accurate results
size_t nbRuns = 1ULL<<(2+nmax-n);
vector<double> timeValues;
// We run the test several times
for (size_t r=0; r<nbRuns; r++)
// We don't want to measure time for vector fill
CountVector C = getRandomVector(n);
double t0 = timestamp();
(*fct.second) (C.data(), C.size());
double t1 = timestamp();
timeValues.push_back (t1-t0);
// We sort the vector of times in order to get the median value
std::sort (timeValues.begin(), timeValues.end());
double median = timeValues[timeValues.size()/2];
printf ("%20lf\t", log(1000.0*1000.0*median)/log(2));
printf ("\n");
int main (int argc, char* argv[])
size_t nmin = argc>=2 ? atoi(argv[1]) : 14;
size_t nmax = argc>=3 ? atoi(argv[2]) : 28;
// We get a common random seed
randomSeed = time(NULL);
FunctionsVector functions = {
make_pair ("transfo_recursive", transfo_recursive),
make_pair ("transfo_iterative", transfo_iterative),
make_pair ("transfo_avx2", transfo_avx2),
make_pair ("transfo_avx2_pcordes", transfo_avx2_pcordes),
make_pair ("transfo_recur_template", transfo_recur_template),
make_pair ("transfo_recur_thread", transfo_recur_thread)
// We check for some n that alternative implementations
// provide the same result as the naive approach
check (functions, 5, 15);
// We run the performance test
performance (functions, nmin, nmax);
And here is the performance graph:
One can observe that the simple recursive implementation is pretty good, even compared to the AVX2 version. The iterative implementation is a little bit disappointing but I made no big effort to optimize it.
Finally, for my own use case with 32 bits counters and for n values up to 28, these implementations are obviously ok for me compared to the initial "naive" approach in O(4^n).
Following some remarks from #PeterCordes and #chtz, I added the following implementations:
transfo-avx2-pcordes : the same as transfo-avx2 with some AVX2 optimizations
transfo-recur-template : the same as transfo-avx2-pcordes but using C++ template specialization for implementing recursion
transfo-recur-thread : usage of multithreading for the two initial recursive calls of transfo-recur-template
Here is the updated benchmark result:
A few remarks about this result:
the AVX2 implementations are logically the best options but maybe not with the maximum potential x8 speedup with counters of 32 bits
among the AVX2 implementations, the template specialization brings a little speedup but it almost fades for bigger values for n
the simple two-threads version has bad results for n<20; for n>=20, there is always a little speedup but far from a potential 2x.
I have a code to compute a Gaussian Mixture Model with Expectation Maximization in order to identify the clusters from a given input data sample.
A piece of the code is repeating the computation of such model for a number of trials Ntrials (one indepenendet of the other but using the same input data) in order to finally pick up the best solution (the one maximizing the total likelihood from the model). This concept can be generalized to many other clustering algorithms (e.g. k-means).
I want to parallelize the part of the code that has to be repeated Ntrials times through multi-threading with C++11 such that each thread will execute one trial.
A code example, assuming an input Eigen::ArrayXXd sample of (Ndimensions x Npoints) can be of the type:
double bestTotalModelProbability = 0;
Eigen::ArrayXd clusterIndicesFromSample(Npoints);
for (int i=0; i < Ntrials; i++)
totalModelProbability = computeGaussianMixtureModel(sample);
// Check if this trial is better than the previous one.
// If so, update the results (cluster index for each point
// in the sample) and keep them.
if totalModelProbability > bestTotalModelProbability
bestTotalModelProbability = totalModelProbability;
clusterIndicesFromSample = obtainClusterMembership(sample);
where I pass the reference value of sample (Eigen::Ref), and not sample itself to both the functions computeGaussianMixtureModel() and obtainClusterMembership().
My code is heavily based on Eigen array, and the N-dimensional problems that I take can account for order 10-100 dimensions and 500-1000 different sample points. I am looking for some examples to create a multi-threaded version of this code using Eigen arrays and std:thread of C++11, but could not find anything around and I am struggling with making even some simple examples for manipulation of Eigen arrays.
I am not even sure Eigen can be used within std::thread in C++11.
Can someone help me even with some simple example to understand the synthax?
I am using clang++ as compiler in Mac OSX on a CPU with 6 cores (12 threads).
OP's question attracted my attention because number-crunching with speed-up earned by multi-threading is one of the top todo's on my personal list.
I must admit that my experience with the Eigen library is very limited. (I once used the decompose of 3×3 rotation matrices to Euler angles which is very clever solved in a general way in the Eigen library.)
Hence, I defined another sample task consisting of a stupid counting of values in a sample data set.
This is done multiple times (using the same evaluation function):
single threaded (to get a value for comparison)
starting each sub-task in an extra thread (in an admittedly rather stupid approach)
starting threads with interleaved access to sample data
starting threads with partitioned access to sample data.
#include <cstdint>
#include <cstdlib>
#include <chrono>
#include <iomanip>
#include <iostream>
#include <limits>
#include <thread>
#include <vector>
// a sample function to process a certain amount of data
template <typename T>
size_t countFrequency(
size_t n, const T data[], const T &begin, const T &end)
size_t result = 0;
for (size_t i = 0; i < n; ++i) result += data[i] >= begin && data[i] < end;
return result;
typedef std::uint16_t Value;
typedef std::chrono::high_resolution_clock Clock;
typedef std::chrono::microseconds MuSecs;
typedef decltype(std::chrono::duration_cast<MuSecs>(Clock::now() - Clock::now())) Time;
Time duration(const Clock::time_point &t0)
return std::chrono::duration_cast<MuSecs>(Clock::now() - t0);
std::vector<Time> makeTest()
const Value SizeGroup = 4, NGroups = 10000, N = SizeGroup * NGroups;
const size_t NThreads = std::thread::hardware_concurrency();
// make a test sample
std::vector<Value> sample(N);
for (Value &value : sample) value = (Value)rand();
// prepare result vectors
std::vector<size_t> results4[4] = {
std::vector<size_t>(NGroups, 0),
std::vector<size_t>(NGroups, 0),
std::vector<size_t>(NGroups, 0),
std::vector<size_t>(NGroups, 0)
// make test
std::vector<Time> times{
[&]() { // single threading
// make a copy of test sample
std::vector<Value> data(sample);
std::vector<size_t> &results = results4[0];
// remember start time
const Clock::time_point t0 = Clock::now();
// do experiment single-threaded
for (size_t i = 0; i < NGroups; ++i) {
results[i] = countFrequency(data.size(), data.data(),
(Value)(i * SizeGroup), (Value)((i + 1) * SizeGroup));
// done
return duration(t0);
[&]() { // multi-threading - stupid aproach
// make a copy of test sample
std::vector<Value> data(sample);
std::vector<size_t> &results = results4[1];
// remember start time
const Clock::time_point t0 = Clock::now();
// do experiment multi-threaded
std::vector<std::thread> threads(NThreads);
for (Value i = 0; i < NGroups;) {
size_t nT = 0;
for (; nT < NThreads && i < NGroups; ++nT, ++i) {
threads[nT] = std::move(std::thread(
[i, &results, &data, SizeGroup]() {
size_t result = countFrequency(data.size(), data.data(),
(Value)(i * SizeGroup), (Value)((i + 1) * SizeGroup));
results[i] = result;
for (size_t iT = 0; iT < nT; ++iT) threads[iT].join();
// done
return duration(t0);
[&]() { // multi-threading - interleaved
// make a copy of test sample
std::vector<Value> data(sample);
std::vector<size_t> &results = results4[2];
// remember start time
const Clock::time_point t0 = Clock::now();
// do experiment multi-threaded
std::vector<std::thread> threads(NThreads);
for (Value iT = 0; iT < NThreads; ++iT) {
threads[iT] = std::move(std::thread(
[iT, &results, &data, NGroups, SizeGroup, NThreads]() {
for (Value i = iT; i < NGroups; i += NThreads) {
size_t result = countFrequency(data.size(), data.data(),
(Value)(i * SizeGroup), (Value)((i + 1) * SizeGroup));
results[i] = result;
for (std::thread &threadI : threads) threadI.join();
// done
return duration(t0);
[&]() { // multi-threading - grouped
std::vector<Value> data(sample);
std::vector<size_t> &results = results4[3];
// remember start time
const Clock::time_point t0 = Clock::now();
// do experiment multi-threaded
std::vector<std::thread> threads(NThreads);
for (Value iT = 0; iT < NThreads; ++iT) {
threads[iT] = std::move(std::thread(
[iT, &results, &data, NGroups, SizeGroup, NThreads]() {
for (Value i = iT * NGroups / NThreads,
iN = (iT + 1) * NGroups / NThreads; i < iN; ++i) {
size_t result = countFrequency(data.size(), data.data(),
(Value)(i * SizeGroup), (Value)((i + 1) * SizeGroup));
results[i] = result;
for (std::thread &threadI : threads) threadI.join();
// done
return duration(t0);
// check results (must be equal for any kind of computation)
const unsigned nResults = sizeof results4 / sizeof *results4;
for (unsigned i = 1; i < nResults; ++i) {
size_t nErrors = 0;
for (Value j = 0; j < NGroups; ++j) {
if (results4[0][j] != results4[i][j]) {
#ifdef _DEBUG
<< "results4[0][" << j << "]: " << results4[0][j]
<< " != results4[" << i << "][" << j << "]: " << results4[i][j]
<< "!\n";
#endif // _DEBUG
if (nErrors) std::cerr << nErrors << " errors in results4[" << i << "]!\n";
// done
return times;
int main()
std::cout << "std::thread::hardware_concurrency(): "
<< std::thread::hardware_concurrency() << '\n';
// heat up
std::cout << "Heat up...\n";
for (unsigned i = 0; i < 3; ++i) makeTest();
// repeat NTrials times
const unsigned NTrials = 10;
std::cout << "Measuring " << NTrials << " runs...\n"
<< " "
<< " | " << std::setw(10) << "Single"
<< " | " << std::setw(10) << "Multi 1"
<< " | " << std::setw(10) << "Multi 2"
<< " | " << std::setw(10) << "Multi 3"
<< '\n';
std::vector<double> sumTimes;
for (unsigned i = 0; i < NTrials; ++i) {
std::vector<Time> times = makeTest();
std::cout << std::setw(2) << (i + 1) << ".";
for (const Time &time : times) {
std::cout << " | " << std::setw(10) << time.count();
std::cout << '\n';
sumTimes.resize(times.size(), 0.0);
for (size_t j = 0; j < times.size(); ++j) sumTimes[j] += times[j].count();
std::cout << "Average Values:\n ";
for (const double &sumTime : sumTimes) {
std::cout << " | "
<< std::setw(10) << std::fixed << std::setprecision(1)
<< sumTime / NTrials;
std::cout << '\n';
std::cout << "Ratio:\n ";
for (const double &sumTime : sumTimes) {
std::cout << " | "
<< std::setw(10) << std::fixed << std::setprecision(3)
<< sumTime / sumTimes.front();
std::cout << '\n';
// done
return 0;
Compiled and tested on cygwin64 on Windows 10:
$ g++ --version
g++ (GCC) 7.3.0
$ g++ -std=c++11 -O2 -o test-multi-threading test-multi-threading.cc
$ ./test-multi-threading
std::thread::hardware_concurrency(): 8
Heat up...
Measuring 10 runs...
| Single | Multi 1 | Multi 2 | Multi 3
1. | 384008 | 1052937 | 130662 | 138411
2. | 386500 | 1103281 | 133030 | 132576
3. | 382968 | 1078988 | 137123 | 137780
4. | 395158 | 1120752 | 138731 | 138650
5. | 385870 | 1105885 | 144825 | 129405
6. | 366724 | 1071788 | 137684 | 130289
7. | 352204 | 1104191 | 133675 | 130505
8. | 331679 | 1072299 | 135476 | 138257
9. | 373416 | 1053881 | 138467 | 137613
10. | 370872 | 1096424 | 136810 | 147960
Average Values:
| 372939.9 | 1086042.6 | 136648.3 | 136144.6
| 1.000 | 2.912 | 0.366 | 0.365
I did the same on coliru.com. (I had to reduce the heat up cycles and the sample size as I exceeded the time limit with the original values.):
g++ (GCC) 8.1.0
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
std::thread::hardware_concurrency(): 4
Heat up...
Measuring 10 runs...
| Single | Multi 1 | Multi 2 | Multi 3
1. | 224684 | 297729 | 48334 | 39016
2. | 146232 | 337222 | 66308 | 59994
3. | 195750 | 344056 | 61383 | 63172
4. | 198629 | 317719 | 62695 | 50413
5. | 149125 | 356471 | 61447 | 57487
6. | 155355 | 322185 | 50254 | 35214
7. | 140269 | 316224 | 61482 | 53889
8. | 154454 | 334814 | 58382 | 53796
9. | 177426 | 340723 | 62195 | 54352
10. | 151951 | 331772 | 61802 | 46727
Average Values:
| 169387.5 | 329891.5 | 59428.2 | 51406.0
| 1.000 | 1.948 | 0.351 | 0.303
Live Demo on coliru
I wonder a little bit that the ratios on coliru (with only 4 threads) are even better than on my PC with (with 8 threads). Actually, I don't know how to explain this.
However, there are a lot of other differences in the two setups which may or may not be responsible. At least, both measurements show a rough speed-up of 3 for 3rd and 4th approach where the 2nd consumes uniquely every potential speed-up (probably by starting and joining all these threads).
Looking at the sample code, you will recognize that there is no mutex or any other explicit locking. This is intentionally. As I've learned (many, many years ago), every attempt for parallelization may cause a certain extra amount of communication overhead (for concurrent tasks which have to exchange data). If communication overhead becomes to big, it simply consumes the speed advantage of concurrency. So, best speed-up can be achieved by:
least communication overhead i.e. concurrent tasks operate on independent data
least effort for post-merging the concurrently computed results.
In my sample code, I
prepared every data and storage before starting the threads,
shared data which is read is never changed while threads are running,
data which is written as it were thread-local (no two threads write to the same address of data)
evaluate the computed results after all threads have been joined.
Concerning 3. I struggled a bit whether this is legal or not i.e. is it granted for data which is written in threads to appear correctly in main thread after joining. (The fact that something seems to work fine is illusive in general but especially illusive concerning multi-threading.)
cppreference.com provides the following explanations
for std::thread::thread()
The completion of the invocation of the constructor synchronizes-with (as defined in std::memory_order) the beginning of the invocation of the copy of f on the new thread of execution.
for std::thread::join()
The completion of the thread identified by *this synchronizes with the corresponding successful return from join().
In Stack Overflow, I found the following related Q/A's:
Does relaxed memory order effect can be extended to after performing-thread's life?
Are memory fences required here?
Is there an implicit memory barrier with synchronized-with relationship on thread::join?
which convinced me, it is OK.
However, the drawback is that
the creation and joining of threads causes additional effort (and it's not that cheap).
An alternative approach could be the usage of a thread pool to overcome this. I googled a bit and found e.g. Jakob Progsch's ThreadPool on github. However, I guess, with a thread pool the locking issue is back “in the game”.
Whether this will work for Eigen functions as well, depends on how the resp. Eigen functions are implemented. If there are accesses to global variables in them (which become shared when the same function is called concurrently), this will cause a data race.
Googling a bit, I found the following doc.
Eigen and multi-threading – Using Eigen in a multi-threaded application:
In the case your own application is multithreaded, and multiple threads make calls to Eigen, then you have to initialize Eigen by calling the following routine before creating the threads:
#include <Eigen/Core>
int main(int argc, char** argv)
With Eigen 3.3, and a fully C++11 compliant compiler (i.e., thread-safe static local variable initialization), then calling initParallel() is optional.
note that all functions generating random matrices are not re-entrant nor thread-safe. Those include DenseBase::Random(), and DenseBase::setRandom() despite a call to Eigen::initParallel(). This is because these functions are based on std::rand which is not re-entrant. For thread-safe random generator, we recommend the use of boost::random or c++11 random feature.
Suppose I have a class X, which functionality requires a lot of constant table values, say an array A[1024]. I have a recurrent function f that computes its values, smth like
A[x] = f(A[x - 1]);
Suppose that A[0] is a known constant, therefore the rest of the array is constant too. What is the best way to calculate these values beforehand, using features of modern C++, and without storaging file with hardcoded values of this array? My workaround was a const static dummy variable:
const bool X::dummy = X::SetupTables();
bool X::SetupTables() {
A[0] = 1;
for (size_t i = 1; i <= A.size(); ++i)
A[i] = f(A[i - 1]);
But I believe, it’s not the most beautiful way to go.
Note: I emphasize that array is rather big and I want to avoid monstrosity of the code.
Since C++14, for loops are allowed in constexpr functions. Moreover, since C++17, std::array::operator[] is constexpr too.
So you can write something like this:
template<class T, size_t N, class F>
constexpr auto make_table(F func, T first)
std::array<T, N> a {first};
for (size_t i = 1; i < N; ++i)
a[i] = func(a[i - 1]);
return a;
Example: https://godbolt.org/g/irrfr2
I think this way is more readable:
#include <array>
constexpr int f(int a) { return a + 1; }
constexpr void init(auto &A)
A[0] = 1;
for (int i = 1; i < A.size(); i++) {
A[i] = f(A[i - 1]);
int main() {
std::array<int, 1024> A;
A[0] = 1;
I need to make a disclaimer, that for big array sizes it is not guaranteed to generate array in constant time. And the accepted answer is more likely to generate the full array during template expansion.
But the way I propose has number of advantages:
It is quite safe that the compiler will not eat up all your memory and fails to expand the template.
The compilation speed is significantly faster
You use C++-ish interface when you use an array
The code is in general more readable
In a particular example when you need only one value, the variant with templates generated for me only a single number, while the variant with std::array generated a loop.
Thanks to Navin I found a way to force compile time evaluation of the array.
You can force it to run at compile time if you return by value: std::array A = init();
So with slight modification the code looks as follows:
#include <array>
constexpr int f(int a) { return a + 1; }
constexpr auto init()
// Need to initialize the array
std::array<int, SIZE> A = {0};
A[0] = 1;
for (unsigned i = 1; i < A.size(); i++) {
A[i] = f(A[i - 1]);
return A;
int main() {
auto A = init();
return A[SIZE - 1];
To have this compiled one needs C++17 support, otherwise operator [] from std::array is not constexpr. I also update the measurements.
On assembly output
As I mentioned earlier the template variant is more concise. Please look here for more detail.
In the template variant, when I just pick the last value of the array, the whole assembly looks as follows:
mov eax, 1024
While for std::array variant I have a loop:
subq $3984, %rsp
movl $1, %eax
leal 1(%rax), %edx
movl %edx, -120(%rsp,%rax,4)
addq $1, %rax
cmpq $1024, %rax
jne .L2
movl 3972(%rsp), %eax
addq $3984, %rsp
With std::array and return by value the assemble is identical to version with templates:
mov eax, 1024
On compilation speed
I compared these two variants:
#include <utility>
constexpr int f(int a) { return a + 1; }
template<int... Idxs>
constexpr void init(int* A, std::integer_sequence<int, Idxs...>) {
auto discard = {A[Idxs] = f(A[Idxs - 1])...};
int main() {
int A[SIZE];
A[0] = 1;
init(A + 1, std::make_integer_sequence<int, sizeof A / sizeof *A - 1>{});
#include <array>
constexpr int f(int a) { return a + 1; }
constexpr void init(auto &A)
A[0] = 1;
for (int i = 1; i < A.size(); i++) {
A[i] = f(A[i - 1]);
int main() {
std::array<int, SIZE> A;
A[0] = 1;
The results are:
| Size | Templates (s) | std::array (s) | by value |
| 1024 | 0.32 | 0.23 | 0.38s |
| 2048 | 0.52 | 0.23 | 0.37s |
| 4096 | 0.94 | 0.23 | 0.38s |
| 8192 | 1.87 | 0.22 | 0.46s |
| 16384 | 3.93 | 0.22 | 0.76s |
How I generated:
for SIZE in 1024 2048 4096 8192 16384
echo $SIZE
time g++ -DSIZE=$SIZE test2.cpp
time g++ -DSIZE=$SIZE test.cpp
time g++ -std=c++17 -DSIZE=$SIZE test3.cpp
And if you enable optimizations, the speed of code with template is even worse:
| Size | Templates (s) | std::array (s) | by value |
| 1024 | 0.92 | 0.26 | 0.29s |
| 2048 | 2.81 | 0.25 | 0.33s |
| 4096 | 10.94 | 0.23 | 0.36s |
| 8192 | 52.34 | 0.24 | 0.39s |
| 16384 | 211.29 | 0.24 | 0.56s |
How I generated:
for SIZE in 1024 2048 4096 8192 16384
echo $SIZE
time g++ -O3 -march=native -DSIZE=$SIZE test2.cpp
time g++ -O3 -march=native -DSIZE=$SIZE test.cpp
time g++ -O3 -std=c++17 -march=native -DSIZE=$SIZE test3.cpp
My gcc version:
$ g++ --version
g++ (Debian 7.2.0-1) 7.2.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
One example:
#include <utility>
constexpr int f(int a) { return a + 1; }
template<int... Idxs>
constexpr void init(int* A, std::integer_sequence<int, Idxs...>) {
auto discard = {A[Idxs] = f(A[Idxs - 1])...};
int main() {
int A[1024];
A[0] = 1;
init(A + 1, std::make_integer_sequence<int, sizeof A / sizeof *A - 1>{});
Requires -ftemplate-depth=1026 g++ command line switch.
Example how to make it a static member:
struct B
int A[1024];
B() {
A[0] = 1;
init(A + 1, std::make_integer_sequence<int, sizeof A / sizeof *A - 1>{});
struct C
static B const b;
B const C::b;
just for fun, a c++17 compact one-liner might be ( requires an std::array A, or some other memory-contiguous tuple-like ):
std::apply( [](auto, auto&... x){ ( ( x = f((&x)[-1]) ), ... ); }, A );
note that this can be used in a constexpr function too.
That said, from c++14 we can use loops in constexpr functions, so we can write a constexpr function returning an std::array directly, written (almost) the usual way.
I tried to enable vectorization of an often-used function to improve the performance.
The algorithm should do the following and is called ~4.000.000 times!
Input: double* cellvalue
Output: int8* Output (8 bit integer, c++ char)
if (cellvalue > upper_threshold )
*output = 1;
else if (cellvalue < lower_threshold)
*output = -1;
*output = 0;
My first vectorization approach to compute 2 doubles in parallel looks like:
__m128d lowerThresh = _mm_set1_pd(m_lowerThreshold);
__m128d upperThresh = _mm_set1_pd(m_upperThreshold);
__m128d vec = _mm_load_pd(cellvalue);
__m128d maskLower = _mm_cmplt_pd(vec, lowerThresh); // less than
__m128d maskUpper = _mm_cmpgt_pd(vec, upperThresh); // greater than
static const tInt8 negOne = -1;
static const tInt8 posOne = 1;
output[0] = (negOne & *((tInt8*)&maskLower.m128d_f64[0])) | (posOne & *((tInt8*)&maskUpper.m128d_f64[0]));
output[1] = (negOne & *((tInt8*)&maskLower.m128d_f64[1])) | (posOne & *((tInt8*)&maskUpper.m128d_f64[1]));
Does this make sense to you? It works, but I think the last part to create the output is very complicated. Is there any faster method to do this?
Also I tried to compute 8 values at once with nearly the same code. Will this perform better? Does the order of instructions make sense?
__m128d lowerThresh = _mm_set1_pd(m_lowerThreshold);
__m128d upperThresh = _mm_set1_pd(m_upperThreshold);
// load 4 times
__m128d vec0 = _mm_load_pd(cellValue);
__m128d vec1 = _mm_load_pd(cellValue + 2);
__m128d vec2 = _mm_load_pd(cellValue + 4);
__m128d vec3 = _mm_load_pd(cellValue + 6);
__m128d maskLower0 = _mm_cmplt_pd(vec0, lowerThresh); // less than
__m128d maskLower1 = _mm_cmplt_pd(vec1, lowerThresh); // less than
__m128d maskLower2 = _mm_cmplt_pd(vec2, lowerThresh); // less than
__m128d maskLower3 = _mm_cmplt_pd(vec3, lowerThresh); // less than
__m128d maskUpper0 = _mm_cmpgt_pd(vec0, upperThresh); // greater than
__m128d maskUpper1 = _mm_cmpgt_pd(vec1, upperThresh); // greater than
__m128d maskUpper2 = _mm_cmpgt_pd(vec2, upperThresh); // greater than
__m128d maskUpper3 = _mm_cmpgt_pd(vec3, upperThresh); // greater than
static const tInt8 negOne = -1;
static const tInt8 posOne = 1;
output[0] = (negOne & *((tInt8*)&maskLower0.m128d_f64[0])) | (posOne & *((tInt8*)&maskUpper0.m128d_f64[0]));
output[1] = (negOne & *((tInt8*)&maskLower0.m128d_f64[1])) | (posOne & *((tInt8*)&maskUpper0.m128d_f64[1]));
output[2] = (negOne & *((tInt8*)&maskLower1.m128d_f64[0])) | (posOne & *((tInt8*)&maskUpper1.m128d_f64[0]));
output[3] = (negOne & *((tInt8*)&maskLower1.m128d_f64[1])) | (posOne & *((tInt8*)&maskUpper1.m128d_f64[1]));
output[4] = (negOne & *((tInt8*)&maskLower2.m128d_f64[0])) | (posOne & *((tInt8*)&maskUpper2.m128d_f64[0]));
output[5] = (negOne & *((tInt8*)&maskLower2.m128d_f64[1])) | (posOne & *((tInt8*)&maskUpper2.m128d_f64[1]));
output[6] = (negOne & *((tInt8*)&maskLower3.m128d_f64[0])) | (posOne & *((tInt8*)&maskUpper3.m128d_f64[0]));
output[7] = (negOne & *((tInt8*)&maskLower3.m128d_f64[1])) | (posOne & *((tInt8*)&maskUpper3.m128d_f64[1]));
Hopefully you can help me to understand the vectorization thing a bit better ;)
_mm_cmplt_pd and _mm_cmpgt_pd produce a result that is already either 0 or -1; anding it with -1 does nothing, and anding it with 1 is equivalent to negating it. Thus, if upper_threshold > lower_threshold (so that both conditions are never true), you can just write*:
_mm_storeu_si128(output, _mm_sub_epi64(maskLower, maskUpper));
(*) it's unclear what an "int8" is in your code; that's not a standard type in C++. It could be an 8-byte int, which is the behavior I've used here. If it's an 8-bit int instead, you'll want to pack up a bunch of results to store together.
Questioner clarifies that they intend int8 to be an 8-bit integer. In that case, you can do the following for a quick implementation:
__m128i result = _mm_sub_epi64(maskLower, maskUpper)
output[0] = result.m128i_i64[0]; // .m128i_i64 is an oddball MSVC-ism, so
output[1] = result.m128i_i64[1]; // I'm not 100% sure about the syntax here.
but you may also want to try packing eight result vectors together and store them with a single store operation.
If you change the code not to branch, then a modern compiler will do the vectorization for you.
Here's the test I ran:
#include <stdint.h>
#include <iostream>
#include <random>
#include <vector>
#include <chrono>
using Clock = std::chrono::steady_clock;
using std::chrono::milliseconds;
typedef double Scalar;
typedef int8_t Integer;
const Scalar kUpperThreshold = .5;
const Scalar kLowerThreshold = .2;
void compute_comparisons1(int n, const Scalar* xs, Integer* ys) {
#pragma simd
for (int i=0; i<n; ++i) {
Scalar x = xs[i];
ys[i] = (x > kUpperThreshold) - (x < kLowerThreshold);
void compute_comparisons2(int n, const Scalar* xs, Integer* ys) {
for (int i=0; i<n; ++i) {
Scalar x = xs[i];
Integer& y = ys[i];
if (x > kUpperThreshold)
y = 1;
else if(x < kLowerThreshold)
y = -1;
y = 0;
const int N = 4000000;
auto random_generator = std::mt19937{0};
int main() {
std::vector<Scalar> xs(N);
std::vector<Integer> ys1(N);
std::vector<Integer> ys2(N);
std::uniform_real_distribution<Scalar> dist(0, 1);
for (int i=0; i<N; ++i)
xs[i] = dist(random_generator);
auto time0 = Clock::now();
compute_comparisons1(N, xs.data(), ys1.data());
auto time1 = Clock::now();
compute_comparisons2(N, xs.data(), ys2.data());
auto time2 = Clock::now();
std::cout << "v1: " << std::chrono::duration_cast<milliseconds>(time1 - time0).count() << "\n";
std::cout << "v2: " << std::chrono::duration_cast<milliseconds>(time2 - time1).count() << "\n";
for (int i=0; i<N; ++i) {
if (ys1[i] != ys2[i]) {
std::cout << "Error!\n";
return -1;
return 0;
If you compile with a recent version of gcc (I used 4.8.3) and use the flags "-O3 -std=c++11 -march=native -S", you can verify by looking at the assembly that it vectorizes the code. And it runs much faster (3 milliseconds vs 16 milliseconds on my machine.)
Also, I'm not sure what your requirements are; but if you can live with less precision, then using float instead of double will further improve the speed (double takes 1.8x as long on my machine)
I'm trying to implement a naive version of LU decomposition in OpenCL. To start, I have implemented a sequential version in C++ and constructed methods to verify my result (i.e., multiplication methods). Next I implemented my algorithm in a kernel and tested it with manually verified input (i.e., a 5x5 matrix). This works fine.
However, when I run my algorithm on a randomly generated matrix bigger than 5x5 I get strange results. I've cleaned my code, checked the calculations manually but I can't figure out where my kernel is going wrong. I'm starting to think that it might have something to do with the floats and the stability of the calculations. By this I mean that error margins get propagated and get bigger and bigger. I'm well-aware that I can swap rows to get the biggest pivot value and such, but the error margin is way off sometimes. And in any case I would have expected the result - albeit a wrong one - to be the same as the sequential algorithm. I would like some help identifying where I could be doing something wrong.
I'm using a single dimensional array so addressing a matrix with two dimensions happens like this:
A(row, col) = A[row * matrix_width + col].
About the results I might add that I decided to merge the L and U matrix into one. So Given L and U:
L: U:
1 0 0 A B C
X 1 0 0 D E
Y Z 1 0 0 F
I display them as:
The kernel is the following:
The parameter source is the original matrix I want to decompose.
The parameter destin is the destination. matrix_size is the total size of the matrix (so that would be 9 for a 3x3) and matrix_width is the width (3 for a 3x3 matrix).
__kernel void matrix(
__global float * source,
__global float * destin,
unsigned int matrix_size,
unsigned int matrix_width
unsigned int index = get_global_id(0);
int col_idx = index % matrix_width;
int row_idx = index / matrix_width;
if (index >= matrix_size)
// First of all, copy our value to the destination.
destin[index] = source[index];
// Iterate over all the pivots.
for(int piv_idx = 0; piv_idx < matrix_width; piv_idx++)
// We have to be the row below the pivot row
// And we have to be the column of the pivot
// or right of that column.
if(col_idx < piv_idx || row_idx <= piv_idx)
// Calculate the divisor.
float pivot_value = destin[(piv_idx * matrix_width) + piv_idx];
float below_pivot_value = destin[(row_idx * matrix_width) + piv_idx];
float divisor = below_pivot_value/ pivot_value;
// Get the value in the pivot row on this column.
float pivot_row_value = destin[(piv_idx * matrix_width) + col_idx];
float current_value = destin[index];
destin[index] = current_value - (pivot_row_value * divisor);
// Write the divisor to the memory (we won't use these values anymore!)
// if we are the value under the pivot.
if(col_idx == piv_idx)
int divisor_location = (row_idx * matrix_width) + piv_idx;
destin[divisor_location] = divisor;
This is the sequential version:
// Decomposes a matrix into L and U but in the same matrix.
float * decompose(float* A, int matrix_width)
int total_length = matrix_width*matrix_width;
float *U = new float[total_length];
for (int i = 0; i < total_length; i++)
U[i] = A[i];
for (int row = 0; row < matrix_width; row++)
int pivot_idx = row;
float pivot_val = U[pivot_idx * matrix_width + pivot_idx];
for (int r = row + 1; r < matrix_width; r++)
float below_pivot = U[r*matrix_width + pivot_idx];
float divisor = below_pivot / pivot_val;
for (int row_idx = pivot_idx; row_idx < matrix_width; row_idx++)
float value = U[row * matrix_width + row_idx];
U[r*matrix_width + row_idx] = U[r*matrix_width + row_idx] - (value * divisor);
U[r * matrix_width + pivot_idx] = divisor;
return U;
An example output I get is the following:
Workgroup size: 1
Array dimension: 6
Original unfactorized:
| 176.000000 | 133.000000 | 431.000000 | 839.000000 | 739.000000 | 450.000000 |
| 507.000000 | 718.000000 | 670.000000 | 753.000000 | 122.000000 | 941.000000 |
| 597.000000 | 449.000000 | 596.000000 | 742.000000 | 491.000000 | 212.000000 |
| 159.000000 | 944.000000 | 797.000000 | 717.000000 | 822.000000 | 219.000000 |
| 266.000000 | 755.000000 | 33.000000 | 231.000000 | 824.000000 | 785.000000 |
| 724.000000 | 408.000000 | 652.000000 | 863.000000 | 663.000000 | 113.000000 |
| 176.000000 | 133.000000 | 431.000000 | 839.000000 | 739.000000 | 450.000000 |
| 2.880682 | 334.869324 | -571.573853 | -1663.892090 | -2006.823730 | -355.306763 |
| 3.392045 | -0.006397 | -869.627747 | -2114.569580 | -2028.558716 | -1316.693359 |
| 0.903409 | 2.460203 | -2.085742 | -357.893066 | 860.526367 | -2059.689209 |
| 1.511364 | 1.654343 | -0.376231 | -2.570729 | 4476.049805 | -5097.599121 |
| 4.113636 | -0.415427 | 1.562076 | -0.065806 | 0.003290 | 52.263515 |
Sequential multiplied matching with original?:
| 176.000000 | 133.000000 | 431.000000 | 839.000000 | 739.000000 | 450.000000 |
| 2.880682 | 334.869293 | -571.573914 | -1663.892212 | -2006.823975 | -355.306885 |
| 3.392045 | -0.006397 | -869.627808 | -2114.569580 | -2028.558716 | -1316.693359 |
| 0.903409 | 2.460203 | -2.085742 | -357.892578 | 5091.575684 | -2059.688965 |
| 1.511364 | 1.654343 | -0.376232 | -2.570732 | 16116.155273 | -5097.604980 |
| 4.113636 | -0.415427 | -0.737347 | 2.005755 | -3.655331 | -237.480438 |
GPU multiplied matching with original?:
Values differ: 5053.05 -- 822
Values differ: 5091.58 -- 860.526
Correct solution? 0
Okay, I understand why it was not working before, I think. The reason is that I only synchronize on each workgroup. When I would call my kernel with a workgroup size equal to the number of items in my matrix it would always be correct, because then the barriers would work properly. However, I decided to go with the approach as mentioned in the comments. Enqueue multiple kernels and wait for each kernel to finish before starting the next one. This would then map onto an iteration over each row of the matrix and multiplying it with the pivot element. This makes sure that I do not modify or read elements that are being modified by the kernel at that point.
Again, this works but only for small matrices. So I think I was wrong in assuming that it was the synchronization only. As per the request of Baiz I am posting my entire main here that calls the kernel:
int main(int argc, char *argv[])
try {
if (argc != 5) {
std::ostringstream oss;
oss << "Usage: " << argv[0] << " <kernel_file> <kernel_name> <workgroup_size> <array width>";
throw std::runtime_error(oss.str());
// Read in arguments.
std::string kernel_file(argv[1]);
std::string kernel_name(argv[2]);
unsigned int workgroup_size = atoi(argv[3]);
unsigned int array_dimension = atoi(argv[4]);
int total_matrix_length = array_dimension * array_dimension;
// Print parameters
std::cout << "Workgroup size: " << workgroup_size << std::endl;
std::cout << "Array dimension: " << array_dimension << std::endl;
// Create matrix to work on.
// Create a random array.
int matrix_width = sqrt(total_matrix_length);
float* input_matrix = new float[total_matrix_length];
input_matrix = randomMatrix(total_matrix_length);
/// Debugging
//float* input_matrix = new float[9];
//int matrix_width = 3;
//total_matrix_length = matrix_width * matrix_width;
//input_matrix[0] = 10; input_matrix[1] = -7; input_matrix[2] = 0;
//input_matrix[3] = -3; input_matrix[4] = 2; input_matrix[5] = 6;
//input_matrix[6] = 5; input_matrix[7] = -1; input_matrix[8] = 5;
// Allocate memory on the host and populate source
float *gpu_result = new float[total_matrix_length];
// OpenCL initialization
std::vector<cl::Platform> platforms;
std::vector<cl::Device> devices;
platforms[0].getDevices(CL_DEVICE_TYPE_GPU, &devices);
cl::Context context(devices);
cl::CommandQueue queue(context, devices[0], CL_QUEUE_PROFILING_ENABLE);
// Load the kernel source.
std::string file_text;
std::ifstream file_stream(kernel_file.c_str());
if (!file_stream) {
std::ostringstream oss;
oss << "There is no file called " << kernel_file;
throw std::runtime_error(oss.str());
file_text.assign(std::istreambuf_iterator<char>(file_stream), std::istreambuf_iterator<char>());
// Compile the kernel source.
std::string source_code = file_text;
std::pair<const char *, size_t> source(source_code.c_str(), source_code.size());
cl::Program::Sources sources;
cl::Program program(context, sources);
try {
catch (cl::Error& e) {
std::string msg;
program.getBuildInfo<std::string>(devices[0], CL_PROGRAM_BUILD_LOG, &msg);
std::cerr << "Your kernel failed to compile" << std::endl;
std::cerr << "-----------------------------" << std::endl;
std::cerr << msg;
// Allocate memory on the device
cl::Buffer source_buf(context, CL_MEM_READ_ONLY, total_matrix_length*sizeof(float));
cl::Buffer dest_buf(context, CL_MEM_WRITE_ONLY, total_matrix_length*sizeof(float));
// Create the actual kernel.
cl::Kernel kernel(program, kernel_name.c_str());
// transfer source data from the host to the device
queue.enqueueWriteBuffer(source_buf, CL_TRUE, 0, total_matrix_length*sizeof(float), input_matrix);
for (int pivot_idx = 0; pivot_idx < matrix_width; pivot_idx++)
// set the kernel arguments
kernel.setArg<cl::Memory>(0, source_buf);
kernel.setArg<cl::Memory>(1, dest_buf);
kernel.setArg<cl_uint>(2, total_matrix_length);
kernel.setArg<cl_uint>(3, matrix_width);
kernel.setArg<cl_int>(4, pivot_idx);
// execute the code on the device
std::cout << "Enqueueing new kernel for " << pivot_idx << std::endl;
cl::Event evt;
queue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(total_matrix_length), cl::NDRange(workgroup_size), 0, &evt);
std::cout << "Iteration " << pivot_idx << " done" << std::endl;
// transfer destination data from the device to the host
queue.enqueueReadBuffer(dest_buf, CL_TRUE, 0, total_matrix_length*sizeof(float), gpu_result);
// Calculate sequentially.
float* sequential = decompose(input_matrix, matrix_width);
// Print out the results.
std::cout << "Sequential:\n";
printMatrix(total_matrix_length, sequential);
// Print out the results.
std::cout << "GPU:\n";
printMatrix(total_matrix_length, gpu_result);
std::cout << "Correct solution? " << equalMatrices(gpu_result, sequential, total_matrix_length);
// compute the data throughput in GB/s
//float throughput = (2.0*total_matrix_length*sizeof(float)) / t; // t is in nano seconds
//std::cout << "Achieved throughput: " << throughput << std::endl;
// Cleanup
// Deallocate memory
delete[] gpu_result;
delete[] input_matrix;
delete[] sequential;
return 0;
catch (cl::Error& e) {
std::cerr << e.what() << ": " << jc::readable_status(e.err());
return 3;
catch (std::exception& e) {
std::cerr << e.what() << std::endl;
return 2;
catch (...) {
std::cerr << "Unexpected error. Aborting!\n" << std::endl;
return 1;
As maZZZu already stated, due to the parallel execution of the work items you can not be sure if an element in the array has been read/written yet.
This can be ensured using
however these mechanisms only work on threads wihtin the same work group.
There is no possibility to synchronize work items from different work groups.
Your problem most likely is:
you use multiple work groups for an algorithm which is most likely only executable by a single work group
you do not use enough barriers
if you already use only a single work group, try adding a
to all parts where you read/write from/to destin.
You should restructure your algorithm:
have only one work group perform the algorithm on your matrix
use local memory for better performance(since you repeatedly access elements)
use barriers everywhere. If the algorithm works you can start removing them after working out, which ones you don't need.
Could you post your kernel call and the working sizes?
From your algorithm I came up with this code.
I haven't tested it and I doubt it'll work right away.
But it should help you in understanding how to parallelize a sequential algorithm.
It will decompose the matrix with only one kernel launch.
Some restrictions:
This code only works with a single work group.
It will only work for matrices whose size does not exceed your maximum local work-group size (probably between 256 and 1024).
If you want to change that, you should refactor the algorithm to use only as many work items as the width of the matrix.
Just adapt them to your kernel.setArg(...) code
int nbElements = width*height;
clSetKernelArg (kernel, 0, sizeof(A), &A);
clSetKernelArg (kernel, 1, sizeof(U), &U);
clSetKernelArg (kernel, 2, sizeof(float) * widthMat * heightMat, NULL); // Local memory
clSetKernelArg (kernel, 3, sizeof(int), &width);
clSetKernelArg (kernel, 4, sizeof(int), &height);
clSetKernelArg (kernel, 5, sizeof(int), &nbElements);
Kernel code:
inline int indexFrom2d(const int u, const int v, const int width)
return width*v + u;
kernel void decompose(global float* A,
global float* U,
local float* localBuffer,
const int widthMat,
const int heightMat,
const int nbElements)
int gidx = get_global_id(0);
int col = gidx%widthMat;
int row = gidx/widthMat;
if(gidx >= nbElements)
// Copy from global to local memory
localBuffer[gidx] = A[gidx];
// Sync copy process
for (int rowOuter = 0; rowOuter < widthMat; ++rowOuter)
int pivotIdx = rowOuter;
float pivotValue = localBuffer[indexFrom2d(pivotIdx, pivotIdx, widthMat)];
// Data for all work items in the row
float belowPrivot = localBuffer[indexFrom2d(pivotIdx, row, widthMat)];
float divisor = belowPrivot / pivotValue;
float value = localBuffer[indexFrom2d(col, rowOuter, widthMat)];
// Only work items below pivot and from pivot to the right
if( widthMat > col >= pivotIdx &&
heightMat > row >= pivotIdx + 1)
localBuffer[indexFrom2d(col, row, widthMat)] = localBuffer[indexFrom2d(col, row, widthMat)] - (value * divisor);
if(col == pivotIdx)
localBuffer[indexFrom2d(pivotIdx, row, widthMat)] = divisor;
// Write back to global memory
U[gidx] = localBuffer[gidx];
The errors are way too big to be caused by float arithmetics.
Without any deeper understanding of your algorithm, I would say that the problem is that you are using values from the destination buffer. With sequential code this is fine, because you know what values are there. But with OpenCL, kernels are executed in parallel. So you cannot tell if another kernel has already stored its value to destination buffer or not.
I'm trying to learn C++ and hence I'm trying to do a function to calculate the binomial coefficient. The code works up to a n of 12, for larger values the generated result is incorrect. I'm grateful for your input.
long double binomial(int n, int k) {
int d = n-k;
int i = 1, t = 1, n1 = 1, n2 = 1;
if (d == 0) {
return 1;
} else if (n==0) {
return 1;
} else {
while (i <=n) {
t *= i;
if (i == d) {
n1 = t;
cout << t;
if (i == k) {
n2 = t;
cout << t;
return t/n1/n2;
int main() {
int n, k;
cout << "Select an integer n: \n";
cin >> n;
cout << "Select an integer k: \n";
cin >> k;
long double v = binomial(n,k);
cout << "The binomial coefficient is: " << v << "\n";
return 0;
An int variable can only hold numbers up to a certain size. This varies from compiler to compiler and platform to platform but a typical limit would be around 2 billion. Your program is using numbers bigger than that so you get errors.
If you want to compute with big integers the answer is to get a big integer library. GMP is a popular one.
If int is 32 bits long on your system (very common nowadays), then the factorial of 13 doesn't fit into it (6227020800 > 2147483647).
Either transition to something bigger (unsigned long long, anyone?), or use a bigint library, or come up with a better/more clever algorithm that doesn't involve computing large factorials, at least not directly.
One of the suggests would be to use some other type.
Here is a list of integer types, sizes and limits.
|type |size (B)|Limits |
|long long |8 |–9,223,372,036,854,775,808 to 9,223,372,036,854,775,807|
|unsigned long long |8 |0 to 18,446,744,073,709,551,615 |
|int |4 |–2,147,483,648 to 2,147,483,647 |
|unsigned int |4 |0 to 4,294,967,295 |
|short |2 |–32,768 to 32,767 |
|unsigned short |2 |0 to 65,535 |
|char |1 |–128 to 127 |
|unsigned char |1 |0 to 255 |
Note long and int usually are the same size.
Note those limits aren't the same on all architectures standart guarantee only two things about variable sizes:
1 = sizeof(char) = sizeof(unsigned char)
2 = sizeof(shor) = sizeof(unsigned short) <= sizeof(int) = sizeof(unsigned int) <= sizeof(long) = sizeof(unsigned long) <= sizeof(long long) = sizeof(unsigned long long)
Another option is to use bigint library, however in this case calculations will take more time but will fit.