Improving speed of affine transformation of an array using intrinsics - c++

In a performance sensitive code, I have to perform am affine transformation of a vector:
Y=a*X+b
where Y and X are vectors and a and b are scalars.
As a quick-and-dirty way to improve the speed of the computation, I delegated parallelization to openMP
#pragma omp simd directive. Having some spare time, lately I tried to implement it directly using intrinsics, getting more or less the same performance as the omp solution.
Is there a way to beat the OMP vectorization? I can use up AVX2 instructions.
The code below is tested under windows 10, compiled with VS 2019.
#include <iostream>
#include <armadillo>
#include <chrono>
#include <immintrin.h>
///Computes y=alpha*x+beta
inline void SumAndSetOmp(
arma::Col<double>& y /**< Result*/,
const arma::Col<double>& x /**< Input*/,
const double& alpha /**< Coefficient*/,
const double& beta /**< Offset*/)
{
auto* __restrict lhs = y.memptr();
const auto* __restrict add_rhs = x.memptr();
const auto& n = x.n_elem;
#pragma omp simd
for (arma::uword i = 0; i < n; ++i)
{
lhs[i] = add_rhs[i] * alpha + beta;
}
}
inline void SumAndSetSerial(
arma::Col<double>& y /**< Result*/,
const arma::Col<double>& x /**< Input*/,
const double& alpha /**< Coefficient*/,
const double& beta /**< Offset*/)
{
auto* lhs = y.memptr();
const auto* add_rhs = x.memptr();
const auto& n = x.n_elem;
for (arma::uword i = 0; i < n; ++i)
{
lhs[i] = add_rhs[i] * alpha + beta;
}
}
inline void SumAndSetAVX(arma::Col<double>& y /**< Result*/,
const arma::Col<double>& x /**< Input*/,
const double& alpha /**< Coefficient*/,
const double& beta /**< Offset*/)
{
//Allocate coefficients
const auto alphas = _mm256_set1_pd(alpha);
const auto betas = _mm256_set1_pd(beta);
//Extracting memory addresses
auto* __restrict pos_lhs = y.memptr();
const auto* __restrict pos_rhs = x.memptr();
//Computing sizes
const unsigned int length_array = 4;
const unsigned long long n_aligned = x.n_elem / length_array;
const unsigned int remainder = x.n_elem % length_array;
//Performing AVX instruction
for (unsigned long long i = 0; i < n_aligned; i++) {
const __m256d x_avx = _mm256_loadu_pd(pos_rhs);
const __m256d y_avx = _mm256_fmadd_pd(x_avx, alphas, betas);
_mm256_storeu_pd(pos_lhs, y_avx);
pos_rhs += length_array;
pos_lhs += length_array;
}
//Process the rest serially
for (unsigned int i = 0; i < remainder; i++) {
pos_lhs[i] = alpha * pos_rhs[i] + beta;
}
}
enum method
{
serial,
omp,
avx
};
arma::vec perform_test(const arma::vec& x, const method mtd, int trials = 100, const double alpha = 3.0, const double beta = 5.0)
{
arma::Col<double> res(x.n_elem);
const auto beg = std::chrono::steady_clock::now();
switch (mtd) {
case serial:
for (int i = 0; i < trials; i++)
SumAndSetSerial(res, x, alpha, beta);
break;
case omp:
for (int i = 0; i < trials; i++)
SumAndSetOmp(res, x, alpha, beta);
break;
case avx:
for (int i = 0; i < trials; i++)
SumAndSetAVX(res, x, alpha, beta);
break;
}
std::cout << "time:" << std::chrono::duration<double>(std::chrono::steady_clock::now() - beg).count() << "s\n";
return res;
}
//Benchmarking
double test_fun(long long int n,int trials=100, const double alpha = 3.0, const double beta = 5.0)
{
const arma::Col<double> x(n, arma::fill::randn);
const arma::Col<double> reference = alpha*x + beta;
std::cout << "Serial: ";
const auto res_serial = perform_test(x, method::serial, trials, alpha, beta);
std::cout << "OMP: ";
const auto res_omp = perform_test(x, method::omp, trials, alpha, beta);
std::cout << "AVX: ";
const auto res_avx = perform_test(x, method::avx, trials, alpha, beta);
// errors wrt the reference
const double err_serial = arma::max(arma::abs(reference - res_serial));
const double err_avx = arma::max(arma::abs(reference - res_avx));
const double err_omp = arma::max(arma::abs(reference - res_omp));
//Largest error
const double error = std::max(std::max(err_serial, err_avx), err_omp);
if (error> 1e-6)
{
throw std::runtime_error("Something is wrong!");
}
return error;
}
int main()
{
test_fun(10000000);
}

I had a go at improving on your solutions, but I think they are optimal, so you're best sticking with omp.
The following is speculative and goes beyond your question:
If it's an option you could try omp's multithreading, I'm always surprised at how low the number of elements at which you get a reasonable boost is, though 1000 elements with a simple affine transform I expect would be too low. If other parts of your algorithm are parallelisable then this is more likely to be helpful.
If you can afford to change your problem and don't need double precision you could work with floats.

Related

Why is multi-threading of matrix calculation not faster than single-core?

this is my first time using multi-threading to speed up a heavy calculation.
Background: The idea is to calculate a Kernel Covariance matrix, by reading a list of 3D points x_test and calculating the corresponding matrix, which has dimensions x_test.size() x x_test.size().
I already sped up the calculations by only calculating the lower triangluar matrix. Since all the calculations are independent from each other I tried to speed up the process (x_test.size() = 27000 in my case) by splitting the calculations of the matrix entries row-wise, assigning a range of rows to each thread.
On a single core the calculations took about 280 seconds each time, on 4 cores it took 270-290 seconds.
main.cpp
int main(int argc, char *argv[]) {
double sigma0sq = 1;
double lengthScale [] = {0.7633, 0.6937, 3.3307e+07};
const std::vector<std::vector<double>> x_test = parse2DCsvFile(inputPath);
/* Finding data slices of similar size */
//This piece of code works, each thread is assigned roughly the same number of matrix entries
int numElements = x_test.size()*x_test.size()/2;
const int numThreads = 4;
int elemsPerThread = numElements / numThreads;
std::vector<int> indices;
int j = 0;
for(std::size_t i=1; i<x_test.size()+1; ++i){
int prod = i*(i+1)/2 - j*(j+1)/2;
if (prod > elemsPerThread) {
i--;
j = i;
indices.push_back(i);
if(indices.size() == numThreads-1)
break;
}
}
indices.insert(indices.begin(), 0);
indices.push_back(x_test.size());
/* Spreding calculations to multiple threads */
std::vector<std::thread> threads;
for(std::size_t i = 1; i < indices.size(); ++i){
threads.push_back(std::thread(calculateKMatrixCpp, x_test, lengthScale, sigma0sq, i, indices.at(i-1), indices.at(i)));
}
for(auto & th: threads){
th.join();
}
return 0;
}
As you can see, each thread performs the following calculations on the data assigned to it:
void calculateKMatrixCpp(const std::vector<std::vector<double>> xtest, double lengthScale[], double sigma0sq, int threadCounter, int start, int stop){
char buffer[8192];
std::ofstream out("lower_half_matrix_" + std::to_string(threadCounter) +".csv");
out.rdbuf()->pubsetbuf(buffer, 8196);
for(int i = start; i < stop; ++i){
for(int j = 0; j < i+1; ++j){
double kij = seKernel(xtest.at(i), xtest.at(j), lengthScale, sigma0sq);
if (j!=0)
out << ',';
out << kij;
}
if(i!=xtest.size()-1 )
out << '\n';
}
out.close();
}
and
double seKernel(const std::vector<double> x1,const std::vector<double> x2, double lengthScale[], double sigma0sq) {
double sum(0);
for(std::size_t i=0; i<x1.size();i++){
sum += pow((x1.at(i)-x2.at(i))/lengthScale[i],2);
}
return sigma0sq*exp(-0.5*sum);
}
Aspects I considered
locking by simultaneous access to data vector -> I don't pass a reference to the threads, but a copy of the data. I know this is not optimal in terms of RAM usage, but as far as I know this should prevent simultaneous data access since every thread has its own copy
Output -> every thread writes its part of the lower triangular matrix to its own file. My task manager doesn't indicate a full SSD utilization in the slightest
Compiler and machine
Windows 11
GNU GCC Compiler
Code::Blocks (although I don't think that should be of importance)
There are many details that can be improved in your code, but I think the two biggest issues are:
using vectors or vectors, which leads to fragmented data;
writing each piece of data to file as soon as its value is computed.
The first point is easy to fix: use something like std::vector<std::array<double, 3>>. In the code below I use an alias to make it more readable:
using Point3D = std::array<double, 3>;
std::vector<Point3D> x_test;
The second point is slightly harder to address. I assume you wanted to write to the disk inside each thread because you couldn't manage to write to a shared buffer that you could then write to a file.
Here is a way to do exactly that:
void calculateKMatrixCpp(
std::vector<Point3D> const& xtest, Point3D const& lengthScale, double sigma0sq,
int threadCounter, int start, int stop, std::vector<double>& kMatrix
) {
// ...
double& kij = kMatrix[i * xtest.size() + j];
kij = seKernel(xtest[i], xtest[j], lengthScale, sigma0sq);
// ...
}
// ...
threads.push_back(std::thread(
calculateKMatrixCpp, x_test, lengthScale, sigma0sq,
i, indices[i-1], indices[i], std::ref(kMatrix)
));
Here, kMatrix is the shared buffer and represents the whole matrix you are trying to compute. You need to pass it to the thread via std::ref. Each thread will write to a different location in that buffer, so there is no need for any mutex or other synchronization.
Once you make these changes and try to write kMatrix to the disk, you will realize that this is the part that takes the most time, by far.
Below is the full code I tried on my machine, and the computation time was about 2 seconds whereas the writing-to-file part took 300 seconds! No amount of multithreading can speed that up.
If you truly want to write all that data to the disk, you may have some luck with file mapping. Computing the exact size needed should be easy enough if all values have the same number of digits, and it looks like you could write the values with multithreading. I have never done anything like that, so I can't really say much more about it, but it looks to me like the fastest way to write multiple gigabytes of memory to the disk.
#include <vector>
#include <thread>
#include <iostream>
#include <string>
#include <cmath>
#include <array>
#include <random>
#include <fstream>
#include <chrono>
using Point3D = std::array<double, 3>;
auto generateSampleData() -> std::vector<Point3D> {
static std::minstd_rand g(std::random_device{}());
std::uniform_real_distribution<> d(-1.0, 1.0);
std::vector<Point3D> data;
data.reserve(27000);
for (auto i = 0; i < 27000; ++i) {
data.push_back({ d(g), d(g), d(g) });
}
return data;
}
double seKernel(Point3D const& x1, Point3D const& x2, Point3D const& lengthScale, double sigma0sq) {
double sum = 0.0;
for (auto i = 0u; i < 3u; ++i) {
double distance = (x1[i] - x2[i]) / lengthScale[i];
sum += distance*distance;
}
return sigma0sq * std::exp(-0.5*sum);
}
void calculateKMatrixCpp(std::vector<Point3D> const& xtest, Point3D const& lengthScale, double sigma0sq, int threadCounter, int start, int stop, std::vector<double>& kMatrix) {
std::cout << "start of thread " << threadCounter << "\n" << std::flush;
for(int i = start; i < stop; ++i) {
for(int j = 0; j < i+1; ++j) {
double& kij = kMatrix[i * xtest.size() + j];
kij = seKernel(xtest[i], xtest[j], lengthScale, sigma0sq);
}
}
std::cout << "end of thread " << threadCounter << "\n" << std::flush;
}
int main() {
double sigma0sq = 1;
Point3D lengthScale = {0.7633, 0.6937, 3.3307e+07};
const std::vector<Point3D> x_test = generateSampleData();
/* Finding data slices of similar size */
//This piece of code works, each thread is assigned roughly the same number of matrix entries
int numElements = x_test.size()*x_test.size()/2;
const int numThreads = 4;
int elemsPerThread = numElements / numThreads;
std::vector<int> indices;
int j = 0;
for(std::size_t i = 1; i < x_test.size()+1; ++i){
int prod = i*(i+1)/2 - j*(j+1)/2;
if (prod > elemsPerThread) {
i--;
j = i;
indices.push_back(i);
if(indices.size() == numThreads-1)
break;
}
}
indices.insert(indices.begin(), 0);
indices.push_back(x_test.size());
auto start = std::chrono::system_clock::now();
std::vector<double> kMatrix(x_test.size() * x_test.size(), 0.0);
std::vector<std::thread> threads;
for (std::size_t i = 1; i < indices.size(); ++i) {
threads.push_back(std::thread(calculateKMatrixCpp, x_test, lengthScale, sigma0sq, i, indices[i - 1], indices[i], std::ref(kMatrix)));
}
for (auto& t : threads) {
t.join();
}
auto end = std::chrono::system_clock::now();
auto elapsed_seconds = std::chrono::duration<double>(end - start).count();
std::cout << "computation time: " << elapsed_seconds << "s" << std::endl;
start = std::chrono::system_clock::now();
constexpr int buffer_size = 131072;
char buffer[buffer_size];
std::ofstream out("matrix.csv");
out.rdbuf()->pubsetbuf(buffer, buffer_size);
for (int i = 0; i < x_test.size(); ++i) {
for (int j = 0; j < i + 1; ++j) {
if (j != 0) {
out << ',';
}
out << kMatrix[i * x_test.size() + j];
}
if (i != x_test.size() - 1) {
out << '\n';
}
}
end = std::chrono::system_clock::now();
elapsed_seconds = std::chrono::duration<double>(end - start).count();
std::cout << "writing time: " << elapsed_seconds << "s" << std::endl;
}
Okey I've wrote implementation with optimized formatting.
By using #Nelfeal code it was taking on my system around 250 seconds for the run to complete with write time taking the most by far. Or rather std::ofstream formatting taking most of the time.
I've written a C++20 version via std::format_to/format. It is a multi-threaded version that takes around 25-40 seconds to complete all the computations, formatting, and writing. If run in a single thread, it takes on my system around 70 seconds. Same performance should be achievable via fmt library on C++11/14/17.
Here is the code:
import <vector>;
import <thread>;
import <iostream>;
import <string>;
import <cmath>;
import <array>;
import <random>;
import <fstream>;
import <chrono>;
import <format>;
import <filesystem>;
using Point3D = std::array<double, 3>;
auto generateSampleData(Point3D scale) -> std::vector<Point3D>
{
static std::minstd_rand g(std::random_device{}());
std::uniform_real_distribution<> d(-1.0, 1.0);
std::vector<Point3D> data;
data.reserve(27000);
for (auto i = 0; i < 27000; ++i)
{
data.push_back({ d(g)* scale[0], d(g)* scale[1], d(g)* scale[2] });
}
return data;
}
double seKernel(Point3D const& x1, Point3D const& x2, Point3D const& lengthScale, double sigma0sq) {
double sum = 0.0;
for (auto i = 0u; i < 3u; ++i) {
double distance = (x1[i] - x2[i]) / lengthScale[i];
sum += distance * distance;
}
return sigma0sq * std::exp(-0.5 * sum);
}
void calculateKMatrixCpp(std::vector<Point3D> const& xtest, Point3D lengthScale, double sigma0sq, int threadCounter, int start, int stop, std::filesystem::path localPath)
{
using namespace std::string_view_literals;
std::vector<char> buffer;
buffer.reserve(15'000);
std::ofstream out(localPath);
std::cout << std::format("starting thread {}: from {} to {}\n"sv, threadCounter, start, stop);
for (int i = start; i < stop; ++i)
{
for (int j = 0; j < i; ++j)
{
double kij = seKernel(xtest[i], xtest[j], lengthScale, sigma0sq);
std::format_to(std::back_inserter(buffer), "{:.6g}, "sv, kij);
}
double kii = seKernel(xtest[i], xtest[i], lengthScale, sigma0sq);
std::format_to(std::back_inserter(buffer), "{:.6g}\n"sv, kii);
out.write(buffer.data(), buffer.size());
buffer.clear();
}
}
int main() {
double sigma0sq = 1;
Point3D lengthScale = { 0.7633, 0.6937, 3.3307e+07 };
const std::vector<Point3D> x_test = generateSampleData(lengthScale);
/* Finding data slices of similar size */
//This piece of code works, each thread is assigned roughly the same number of matrix entries
int numElements = x_test.size() * (x_test.size()+1) / 2;
const int numThreads = 3;
int elemsPerThread = numElements / numThreads;
std::vector<int> indices;
int j = 0;
for (std::size_t i = 1; i < x_test.size() + 1; ++i) {
int prod = i * (i + 1) / 2 - j * (j + 1) / 2;
if (prod > elemsPerThread) {
i--;
j = i;
indices.push_back(i);
if (indices.size() == numThreads - 1)
break;
}
}
indices.insert(indices.begin(), 0);
indices.push_back(x_test.size());
auto start = std::chrono::system_clock::now();
std::vector<std::thread> threads;
using namespace std::string_view_literals;
for (std::size_t i = 1; i < indices.size(); ++i)
{
threads.push_back(std::thread(calculateKMatrixCpp, std::ref(x_test), lengthScale, sigma0sq, i, indices[i - 1], indices[i], std::format("./matrix_{}.csv"sv, i-1)));
}
for (auto& t : threads)
{
t.join();
}
auto end = std::chrono::system_clock::now();
auto elapsed_seconds = std::chrono::duration<double>(end - start);
std::cout << std::format("total elapsed time: {}"sv, elapsed_seconds);
return 0;
}
Note: I used 6 digits of precision here as it is the default for std::ofstream. More digits means more writing time to disk and lower performance.

How to optimize my C++ OpenMp Matrix Multiplication code

I have written a C++ OpenMp Matrix Multiplication code that multiplies two 1000x1000 matrices.
So far I have gotten a 0.700 sec execution time using OpenMp but I want to see if there is other ways I can make it faster using OpenMp?
I appreciate any advice or tips you can give me.
Here is my code:
#include <iostream>
#include <time.h>
#include <omp.h>
using namespace std;
void Multiply()
{
//initialize matrices with random numbers
int aMatrix[1000][1000], i, j;
for( i = 0; i < 1000; ++i)
{for( j = 0; j < 1000; ++j)
{aMatrix[i][j] = rand();}
}
int bMatrix[1000][1000], i1, j2;
for( i1 = 0; i1 < 1000; ++i1)
{for( j2 = 0; j2 < 1000; ++j2)
{bMatrix[i1][j2] = rand();}
}
//Result Matrix
int product[1000][1000] = {0};
for (int row = 0; row < 1000; row++) {
for (int col = 0; col < 1000; col++) {
// Multiply the row of A by the column of B to get the row, column of product.
for (int inner = 0; inner < 1000; inner++) {
product[row][col] += aMatrix[row][inner] * bMatrix[inner][col];
}
}
}
}
int main() {
time_t begin, end;
time(&begin);
Multiply();
time(&end);
time_t elapsed = end - begin;
cout << ("Time measured: %ld seconds.\n", elapsed);
return 0;
}
Following things can be done for speedup:
Using OpenMP for parallelizing external loop, like you did (and like I also did in my following code). Or alternatively using std::async for multi-threading like it was used in another answer.
Transpose B matrix, this will help to increase L1 cache hits, because you will read from sequential memory each B column (or row in transposed variant).
Use vectorized SIMD instructions, this will allow to do several multiplications (and additions) within one CPU cycle. Often compilers do auto-vectorization of your loops well enough through SIMD instructions without your help, but I did it explicitly in my code.
Run several independent SIMD instructions within loop. This will help to occupy whole CPU pipeline of SIMD. I did so in my code by using four SIMD registers r0, r1, r2, r3. In compilers this is usually called loop unrolling.
Align your matrix starting address on 64-bytes boundary. This will help SIMD instructions to do fast aligned read/write.
Align starting address of each matrix row on 64-bytes boundary. I did this in my code by padding each row with zeros till multiple of 64-bytes. This also helps SIMD instructions to do fast aligned read/write.
In my following code I did all 1. - 6. steps above. Memory 64-bytes alignment I did through implementing AlignmentAllocator that was used in std::vector. Also I did time measurements for float/double/int.
On my old 4-core laptop I got following time measurements for the case of 1000x1000 matrix multiplying by 1000x1000:
float: time 0.1569 sec
double: time 0.3168 sec
int: time 0.1565 sec
To compare my hardware capabilities I did measurements of another answer of #doug for the case of int:
Threads w transpose 0.2164 secs.
As one can see my solution is 1.4x times faster that the other answer, I guess due to memory 64-bytes alignment and maybe due to using explicit SIMD (instead of relying on compiler auto-vectorization of a loop).
To compile my program, don't forget to add -fopenmp -lgomp options (for OpenMP support) and -march=native -O3 -std=c++20 options (for SIMD support, optimizations and standard) if you're compiling under GCC/CLang, while MSVC I guess adds OpenMP automatically and doesn't need any special options (use /O2 /GL /std:c++latest for optimizations and standard in MSVC).
In my code I only implemented SSE2/SSE4/AVX/AVX2 instructions for SIMD, if you have more powerful machine you may tell me and I implement also FMA/AVX-512, they will give even twice more speed boost.
My Mul() function is quite generic, it is templated, and you just pass pointers to matrices and row/col count, so your matrices may be stored on calling side in any way (through std::vector or std::array or plain 2D array).
At start of Run() function you may change number of rows and columns if you need a bigger test. Notice that all my functions support any rows and columns, you may even multiply matrix of size 1234x2345 by 2345x3456.
Try it online!
#include <cstdint>
#include <cstring>
#include <stdexcept>
#include <iostream>
#include <iomanip>
#include <vector>
#include <memory>
#include <string>
#include <immintrin.h>
#define USE_OPENMP 1
#define ASSERT_MSG(cond, msg) { if (!(cond)) throw std::runtime_error("Assertion (" #cond ") failed at line " + std::to_string(__LINE__) + "! Msg '" + std::string(msg) + "'."); }
#define ASSERT(cond) ASSERT_MSG(cond, "")
#if defined(_MSC_VER)
#define IS_MSVC 1
#else
#define IS_MSVC 0
#endif
#if USE_OPENMP
#include <omp.h>
#endif
template <typename T, std::size_t N>
class AlignmentAllocator {
public:
typedef T value_type;
typedef std::size_t size_type;
typedef std::ptrdiff_t difference_type;
typedef T * pointer;
typedef const T * const_pointer;
typedef T & reference;
typedef const T & const_reference;
public:
inline AlignmentAllocator() throw() {}
template <typename T2> inline AlignmentAllocator(const AlignmentAllocator<T2, N> &) throw() {}
inline ~AlignmentAllocator() throw() {}
inline pointer adress(reference r) { return &r; }
inline const_pointer adress(const_reference r) const { return &r; }
inline pointer allocate(size_type n);
inline void deallocate(pointer p, size_type);
inline void construct(pointer p, const value_type & wert);
inline void destroy(pointer p) { p->~value_type(); }
inline size_type max_size() const throw() { return size_type(-1) / sizeof(value_type); }
template <typename T2> struct rebind { typedef AlignmentAllocator<T2, N> other; };
bool operator!=(const AlignmentAllocator<T, N> & other) const { return !(*this == other); }
bool operator==(const AlignmentAllocator<T, N> & other) const { return true; }
};
template <typename T, std::size_t N>
inline typename AlignmentAllocator<T, N>::pointer AlignmentAllocator<T, N>::allocate(size_type n) {
#if IS_MSVC
auto p = (pointer)_aligned_malloc(n * sizeof(value_type), N);
#else
auto p = (pointer)std::aligned_alloc(N, n * sizeof(value_type));
#endif
ASSERT(p);
return p;
}
template <typename T, std::size_t N>
inline void AlignmentAllocator<T, N>::deallocate(pointer p, size_type) {
#if IS_MSVC
_aligned_free(p);
#else
std::free(p);
#endif
}
template <typename T, std::size_t N>
inline void AlignmentAllocator<T, N>::construct(pointer p, const value_type & wert) {
new (p) value_type(wert);
}
template <typename T>
using AlignedVector = std::vector<T, AlignmentAllocator<T, 64>>;
template <typename T>
struct RegT;
#ifdef __AVX__
template <> struct RegT<float> { static size_t constexpr bisize = 256; using type = __m256; static type zero() { return _mm256_setzero_ps(); } };
template <> struct RegT<double> { static size_t constexpr bisize = 256; using type = __m256d; static type zero() { return _mm256_setzero_pd(); } };
inline void MulAddReg(float const * a, float const * b, __m256 & c) {
c = _mm256_add_ps(c, _mm256_mul_ps(_mm256_load_ps(a), _mm256_load_ps(b)));
}
inline void MulAddReg(double const * a, double const * b, __m256d & c) {
c = _mm256_add_pd(c, _mm256_mul_pd(_mm256_load_pd(a), _mm256_load_pd(b)));
}
inline void StoreReg(float * dst, __m256 const & src) { _mm256_store_ps(dst, src); }
inline void StoreReg(double * dst, __m256d const & src) { _mm256_store_pd(dst, src); }
#else // SSE2
template <> struct RegT<float> { static size_t constexpr bisize = 128; using type = __m128; static type zero() { return _mm_setzero_ps(); } };
template <> struct RegT<double> { static size_t constexpr bisize = 128; using type = __m128d; static type zero() { return _mm_setzero_pd(); } };
inline void MulAddReg(float const * a, float const * b, __m128 & c) {
c = _mm_add_ps(c, _mm_mul_ps(_mm_load_ps(a), _mm_load_ps(b)));
}
inline void MulAddReg(double const * a, double const * b, __m128d & c) {
c = _mm_add_pd(c, _mm_mul_pd(_mm_load_pd(a), _mm_load_pd(b)));
}
inline void StoreReg(float * dst, __m128 const & src) { _mm_store_ps(dst, src); }
inline void StoreReg(double * dst, __m128d const & src) { _mm_store_pd(dst, src); }
#endif
#ifdef __AVX2__
template <> struct RegT<int32_t> { static size_t constexpr bisize = 256; using type = __m256i; static type zero() { return _mm256_setzero_si256(); } };
//template <> struct RegT<int64_t> { static size_t constexpr bisize = 256; using type = __m256i; static type zero() { return _mm256_setzero_si256(); } };
inline void MulAddReg(int32_t const * a, int32_t const * b, __m256i & c) {
c = _mm256_add_epi32(c, _mm256_mullo_epi32(_mm256_load_si256((__m256i*)a), _mm256_load_si256((__m256i*)b)));
}
//inline void MulAddReg(int64_t const * a, int64_t const * b, __m256i & c) {
// c = _mm256_add_epi64(c, _mm256_mullo_epi64(_mm256_load_si256((__m256i*)a), _mm256_load_si256((__m256i*)b)));
//}
inline void StoreReg(int32_t * dst, __m256i const & src) { _mm256_store_si256((__m256i*)dst, src); }
//inline void StoreReg(int64_t * dst, __m256i const & src) { _mm256_store_si256((__m256i*)dst, src); }
#else // SSE2
template <> struct RegT<int32_t> { static size_t constexpr bisize = 128; using type = __m128i; static type zero() { return _mm_setzero_si128(); } };
//template <> struct RegT<int64_t> { static size_t constexpr bisize = 128; using type = __m128i; static type zero() { return _mm_setzero_si128(); } };
inline void MulAddReg(int32_t const * a, int32_t const * b, __m128i & c) {
c = _mm_add_epi32(c, _mm_mullo_epi32(_mm_load_si128((__m128i*)a), _mm_load_si128((__m128i*)b)));
}
//inline void MulAddReg(int64_t const * a, int64_t const * b, __m128i & c) {
// c = _mm_add_epi64(c, _mm_mullo_epi64(_mm_load_si128((__m128i*)a), _mm_load_si128((__m128i*)b)));
//}
inline void StoreReg(int32_t * dst, __m128i const & src) { _mm_store_si128((__m128i*)dst, src); }
//inline void StoreReg(int64_t * dst, __m128i const & src) { _mm_store_si128((__m128i*)dst, src); }
#endif
template <typename T>
void Mul(T const * A0, size_t A_rows, size_t A_cols, T const * B0, size_t B_rows, size_t B_cols, T * C) {
size_t constexpr reg_cnt = RegT<T>::bisize / 8 / sizeof(T), block = 4 * reg_cnt;
ASSERT(A_cols == B_rows);
size_t const A_cols_aligned = (A_cols + block - 1) / block * block, B_rows_aligned = (B_rows + block - 1) / block * block;
// Copy aligned A
AlignedVector<T> Av(A_rows * A_cols_aligned);
for (size_t i = 0; i < A_rows; ++i)
std::memcpy(&Av[i * A_cols_aligned], &A0[i * A_cols], sizeof(Av[0]) * A_cols);
T const * A = Av.data();
// Transpose B
AlignedVector<T> Bv(B_cols * B_rows_aligned);
for (size_t j = 0; j < B_cols; ++j)
for (size_t i = 0; i < B_rows; ++i)
Bv[j * B_rows_aligned + i] = B0[i * B_cols + j];
T const * Bt = Bv.data();
ASSERT(uintptr_t(A) % 64 == 0 && uintptr_t(Bt) % 64 == 0);
ASSERT(uintptr_t(&A[A_cols_aligned]) % 64 == 0 && uintptr_t(&Bt[B_rows_aligned]) % 64 == 0);
// Multiply
#pragma omp parallel for
for (size_t i = 0; i < A_rows; ++i) {
// Aligned Reg storage
AlignedVector<T> Regs(block);
for (size_t j = 0; j < B_cols; ++j) {
T const * Arow = &A[i * A_cols_aligned + 0], * Btrow = &Bt[j * B_rows_aligned + 0];
using Reg = typename RegT<T>::type;
Reg r0 = RegT<T>::zero(), r1 = RegT<T>::zero(), r2 = RegT<T>::zero(), r3 = RegT<T>::zero();
size_t const k_hi = A_cols - A_cols % block;
for (size_t k = 0; k < k_hi; k += block) {
MulAddReg(&Arow[k + reg_cnt * 0], &Btrow[k + reg_cnt * 0], r0);
MulAddReg(&Arow[k + reg_cnt * 1], &Btrow[k + reg_cnt * 1], r1);
MulAddReg(&Arow[k + reg_cnt * 2], &Btrow[k + reg_cnt * 2], r2);
MulAddReg(&Arow[k + reg_cnt * 3], &Btrow[k + reg_cnt * 3], r3);
}
StoreReg(&Regs[reg_cnt * 0], r0);
StoreReg(&Regs[reg_cnt * 1], r1);
StoreReg(&Regs[reg_cnt * 2], r2);
StoreReg(&Regs[reg_cnt * 3], r3);
T sum1 = 0, sum2 = 0, sum3 = 0;
for (size_t k = 0; k < Regs.size(); ++k)
sum1 += Regs[k];
//for (size_t k = 0; k < A_cols - A_cols % block; ++k) sum3 += Arow[k] * Btrow[k];
for (size_t k = k_hi; k < A_cols; ++k)
sum2 += Arow[k] * Btrow[k];
C[i * A_rows + j] = sum2 + sum1;
}
}
}
#include <random>
#include <thread>
#include <chrono>
#include <type_traits>
template <typename T>
void Test(T const * A, size_t A_rows, size_t A_cols, T const * B, size_t B_rows, size_t B_cols, T const * C, T eps) {
for (size_t i = 0; i < A_rows / 16; ++i)
for (size_t j = 0; j < B_cols / 16; ++j) {
T sum = 0;
for (size_t k = 0; k < A_cols; ++k)
sum += A[i * A_cols + k] * B[k * B_cols + j];
ASSERT_MSG(std::abs(C[i * A_rows + j] - sum) <= eps * A_cols, "i " + std::to_string(i) + " j " + std::to_string(j) +
" C " + std::to_string(C[i * A_rows + j]) + " ref " + std::to_string(sum));
}
}
double Time() {
static auto const gtb = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::duration<double>>(
std::chrono::high_resolution_clock::now() - gtb).count();
}
template <typename T>
void Run() {
size_t constexpr A_rows = 1000, A_cols = 1000, B_rows = 1000, B_cols = 1000;
std::string const tname = std::is_same_v<T, float> ? "float" : std::is_same_v<T, double> ?
"double" : std::is_same_v<T, int32_t> ? "int" : "<unknown>";
bool const is_int = tname == "int";
std::mt19937_64 rng{123};
std::vector<T> A(A_rows * A_cols), B(B_rows * B_cols), C(A_rows * B_cols);
for (size_t i = 0; i < A.size(); ++i)
A[i] = is_int ? (int64_t(rng() % (1 << 11)) - (1 << 10)) : (T(int64_t(rng() % (1 << 28)) - (1 << 27)) / T(1 << 27));
for (size_t i = 0; i < B.size(); ++i)
B[i] = is_int ? (int64_t(rng() % (1 << 11)) - (1 << 10)) : (T(int64_t(rng() % (1 << 28)) - (1 << 27)) / T(1 << 27));
auto tim = Time();
Mul(&A[0], A_rows, A_cols, &B[0], B_rows, B_cols, &C[0]);
tim = Time() - tim;
std::cout << std::setw(6) << tname << ": time " << std::fixed << std::setprecision(4) << tim << " sec" << std::endl;
Test(&A[0], A_rows, A_cols, &B[0], B_rows, B_cols, &C[0], tname == "float" ? T(1e-7) : tname == "double" ? T(1e-15) : T(0));
}
int main() {
try {
#if USE_OPENMP
omp_set_num_threads(std::thread::hardware_concurrency());
#endif
Run<float>();
Run<double>();
Run<int32_t>();
return 0;
} catch (std::exception const & ex) {
std::cout << "Exception: " << ex.what() << std::endl;
return -1;
}
}
Output:
float: time 0.1569 sec
double: time 0.3168 sec
int: time 0.1565 sec
Here's straight c++ code that runs in .08s with ints and .14s with floats or doubles. My system is 10 years old with relatively slow memory. Good at the time but now is now.
I agree with #VictorEijkhout that the best results would be with tuned code. There has been huge amounts of work optimizing those.
#include <vector>
#include <array>
#include <random>
#include <cassert>
#include <iostream>
#include <iomanip>
#include <thread>
#include <future>
#include <chrono>
struct Timer {
std::chrono::system_clock::time_point snapTime;
Timer() { snapTime = std::chrono::system_clock::now(); }
operator double() { return std::chrono::duration<double>(std::chrono::system_clock::now() - snapTime).count(); }
};
using DataType = int;
using std::array, std::vector;
constexpr int N = 1000, THREADS = 12;
static auto launchType = std::launch::async;
using Matrix = vector<array<DataType, N>>;
Matrix create_matrix() { return Matrix(N); };
Matrix product(Matrix const& v0, Matrix const& v1, double& time)
{
Matrix ret = create_matrix();
Matrix v2 = create_matrix();
Timer timer;
for (size_t r = 0; r < N; r++) // transpose first
for (size_t c = 0; c < N; c++)
v2[c][r] = v1[r][c];
// lambda to process sets of rows in separate threads
auto do_row_set = [&v0, &v2, &ret](size_t start, size_t last) {
for (size_t row = start; row < last; row++)
for (size_t col = 0; col < N; col++)
{
DataType tmp{}; // separate tmp variable significantly improves optimization
for (size_t col_t = 0; col_t < N; col_t++)
tmp += v0[row][col_t] * v2[col][col_t];
ret[row][col] = tmp;
}
};
vector<size_t> seq;
const size_t NN = N / THREADS;
// make a sequence of NN+1 rows from start to end
for (size_t thread_n = 0; thread_n < N; thread_n += NN)
seq.push_back(thread_n);
seq.push_back(N);
vector<std::future<void>> results; results.reserve(THREADS);
for (size_t i = 0; i < THREADS; i++)
results.emplace_back(std::async(launchType, do_row_set, seq[i], seq[i + 1]));
for (auto& x : results)
x.get();
time = timer;
return ret;
}
bool operator==(Matrix const& v0, Matrix const& v1)
{
for (size_t r = 0; r < N; r++)
for (size_t c = 0; c < N; c++)
if (v0[r][c] != v1[r][c])
return false;
return true;
}
int main()
{
auto fill = [](Matrix& v) {
std::mt19937_64 r(1);
std::normal_distribution dist(1.);
for (size_t row = 0; row < N; row++)
for (size_t col = 0; col < N; col++)
v[row][col] = DataType(dist(r));
};
Matrix m1 = create_matrix(), m2 = create_matrix(), m3 = create_matrix();
fill(m1); fill(m2);
auto process_test = [&m1, &m2](Matrix& out) {
const int rpt_count = 4;
double sum = 0;
for (int i = 0; i < rpt_count; i++)
{
double time;
out = product(m1, m2, time);
sum += time / rpt_count;
}
return sum;
};
std::cout << std::fixed << std::setprecision(4);
double time{};
time = process_test(m3);
std::cout << "Threads w transpose " << time << " secs.\n";
}

Memory leakage C++ threading

I have a problem, probably, with memory leaking in C++ threads. I receive a runtime error with code 11. I am writing an optimization algorithm, which aims to optimize parameters of 2D reactors. It generates instances of reforming function, which creates Reformer objects. The reformers have 2 different parameters, which can differ locally in a single reformer and are passed to the reforming function from the main function. To specify, each reformer is divided into a specified number of zones (same dimensions and locations in each reformer), and each zone can have different parameters. Therefore, size of each of 2 vectors is equal to [NUMBER OF REFORMERS] * [NUMBER OF ZONES]. Then, the reforming function creates Segment objects, which number is equal to the number of zones.
I assume that the issue here is that threads try to access the same vector simultaneously and I would really appreciate a solution for that matter.
Remarks:
If I change the main.cpp to substitute the threads with a usual loop, no error is returned.
If I comment out the setProp method in the set_segments functions, no error is returned (with threads).
Threads are highly recommended here, due to long computation time of a single Reformer, and I have an access to a multi-core computing units.
To clarify, I will explain everything with a minimal reproducible example:
input.h
#include <iostream>
#include <fstream>
#include <vector>
#include <thread>
int reactor_no = 2; // number of reformers
int zones_X = 5; // number of zones in a single reformer, X direction
int zones_Y = 2; // number of zones in a single reformer, Y direction
double dim_X = 0.5; // reactor's length
double dim_Y = 0.2; // reactor's height
double wall_t = 0.1; // thickness of the reactor wall
size_t zones = zones_X * zones_Y;
Reformer.h:
#include "input.h"
class Reformer {
public:
Reformer() {}
Reformer(const double& L, const double& Y, const double& wall_t,
const int& zones_X = 1, const int& zones_Y = 1) {
length_ = L;
height_ = Y;
zonesX_ = zones_X;
zonesY_ = zones_Y;
wall_thickness_ = wall_t;
dx_ = length_ / static_cast<double> (zonesX_);
dr_ = height_ / static_cast<double> (zonesY_);
}
private:
double wall_thickness_; // wall thickness (m)
double length_; // recactor length (m)
double height_; // reactor height (m) (excluding wall thickness)
int zonesX_; // number of segments in the X direction
int zonesY_; // number of segments in the Y direction
double dx_; // segment width (m)
double dr_; // segment height (m)
}
Segment.h:
#include "input.h"
class Segment{
public:
Segment() : Segment(0, 0) {}
Segment(int i, int j) {
i_ = i;
j_ = j;
}
void setXR(const double& dx, const double& dr, const int& SL, const int& SR) {
x0_ = i_ * dx;
x1_ = x0_ + dx;
r0_ = j_ * dr;
r1_ = r0_ + dr;
if (i_ == SL - 1) {
x1_ = length;
}
if (j_ == SR - 1) {
r1_ = radius;
}
}
void setWall() {
x0_ = 0;
x1_ = length;
r0_ = radius;
r1_ = radius + wall_t;
}
void setProp(const double& por, const double& por_s, const bool& cat) {
porosity_ = por;
catalyst_ = cat;
}
private:
size_t i_; //segment column no.
size_t j_; //segment row no.
double x0_; //beginning of segment - x coordinate (m)
double x1_; //ending of segment - x coordinate (m)
double r0_; //beginning of segment - r coordinate (m)
double r1_; //ending of segment - r coordinate (m)
int catalyst_; //1 - catalytic, 0 - non-catalytic
double porosity_; //porosity (-)
};
main.cpp:
#include "input.h"
int main() {
int zones = zones_X * zones_Y;
size_t pop_size = reactor_no * zones;
std::vector<int> cat;
cat.reserve(pop_size);
std::vector<double> porosity;
porosity.reserve(pop_size); // the values in the vectors are not important, therefore I will just fill them with 1s
for (int i = 0; i < pop_size; i++) {
cat[i] = 1;
porosity[i] = 1.0;
}
std::vector<std::thread> Ref;
Ref.reserve(reactor_no);
for (k = 0; k < reactor_no; k++) {
Ref.emplace_back(reforming, k, cat, porosity);
}
for (auto &X : Ref) { X.join(); }
}
reforming.cpp:
#include "input.h"
void reforming(const int m, const std::vector<int>& cat_check, const std::vector<double>& por) {
Reformer reactor(length, radius, wall_t, zonesX, zonesY);
std::vector<Segment> seg; // vector holding segment objects
seg.reserve(zones);
set_segments(seg, reactor, zones, m, por, por_s, cat_check);
}
set_segments function:
#include "input.h"
void set_segments(std::vector<Segment> &seg, Reformer &reac, const int m,
const std::vector<double> &por, const std::vector<int> &check) {
int i, j, k, n;
double dx = dim_X / static_cast<double> (zones_X);
double dy = dim_Y / static_cast<double> (zones_Y);
std::vector<Segment*> ptr_seg;
ptr_seg.reserve(zones);
k = 0;
for (i = 0; i < zones_X; i++) {
for (j = 0; j < zones_Y; j++) {
n = m * zones + (i * zones_Y + j);
seg.emplace_back(Segment(i, j));
seg[k].setProp(por[n], check[n]);
seg[k].setXR(dx, dy, zones_X, zones_Y);
k++;
}
}
}
Adding std::ref() to the reforming function call parameters solved the problem.
for (k = 0; k < spec_max; k++) {
Ref.emplace_back(reforming, k, std::ref(cat), std::ref(porosity));
}

Is it possible to use CUDA parallelizing this nested for loop?

I want to speed up this nested for loop, just start learn CUDA, how could I use CUDA to parallel this c++ code ?
#define PI 3.14159265
using namespace std;
int main()
{
int nbint = 2;
int hits = 20;
int nbinp = 2;
float _theta, _phi, _l, _m, _n, _k = 0, delta = 5;
float x[20],y[20],z[20],a[20],t[20];
for (int i = 0; i < hits; ++i)
{
x[i] = rand() / (float)(RAND_MAX / 100);
}
for (int i = 0; i < hits; ++i)
{
y[i] = rand() / (float)(RAND_MAX / 100);
}
for (int i = 0; i < hits; ++i)
{
z[i] = rand() / (float)(RAND_MAX / 100);
}
for (int i = 0; i < hits; ++i)
{
a[i] = rand() / (float)(RAND_MAX / 100);
}
float maxforall = 1e-6;
float theta0;
float phi0;
for (int i = 0; i < nbint; i++)
{
_theta = (0.5 + i)*delta;
for (int j = 0; j < nbinp; j++)
{
_phi = (0.5 + j)*delta / _theta;
_l = sin(_theta* PI / 180.0)*cos(_phi* PI / 180.0);
_m = sin(_theta* PI / 180.0)*sin(_phi* PI / 180.0);
_n = cos(_theta* PI / 180.0);
for (int k = 0; k < hits; k++)
{
_k = -(_l*x[k] + _m*y[k] + _n*z[k]);
t[k] = a[k] - _k;
}
qsort(t, 0, hits - 1);
float max = t[0];
for (int k = 0; k < hits; k++)
{
if (max < t[k])
max = t[k];
}
if (max > maxforall)
{
maxforall = max;
}
}
}
return 0;
}
I want to put innermost for loop and the sort part(maybe the whole nested loop) into parallel. After sort those array I found the maximum of all arrays. I use maximum to simplify the code. The reason I need sort is that maximum represent
here is a continuous time information(all arrays contain time information). The sort part make those time from lowest to highest. Then I compare the a specific time interval(not a single value). The compare process almost like I choose maximum but with a continuous interval not a single value.
Your 3 nested loops calculate nbint*nbinp*hits values. Since each of those values is independent from each other, all values can be calculated in parallel.
You stated in your comments that you have a commutative and associative "filter condition" which reduces the output to a single scalar value. This can be exploited to avoid sorting and storing the temporary values. Instead, we can calculate the values on-the-fly and then apply a parallel reduction to determine the end result.
This can be done in "raw" CUDA, below I implemented this idea using thrust. The main idea is to run grid_op nbint*nbinp*hits times in parallel. In order to find out the three original "loop indices" from the single scalar index which is passed to grid_op the algorithm from this SO question is used.
thrust::transform_reduce performs the on-the-fly transformation and the subsequent parallel reduction (here thrust::maximum is used as a substitute).
#include <cmath>
#include <thrust/device_vector.h>
#include <thrust/functional.h>
#include <thrust/transform_reduce.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/tuple.h>
// ### BEGIN utility for demo ####
#include <iostream>
#include <thrust/random.h>
thrust::host_vector<float> random_vector(const size_t N)
{
thrust::default_random_engine rng;
thrust::uniform_real_distribution<float> u01(0.0f, 1.0f);
thrust::host_vector<float> temp(N);
for(size_t i = 0; i < N; i++) {
temp[i] = u01(rng);
}
return temp;
}
// ### END utility for demo ####
template <typename... Iterators>
thrust::zip_iterator<thrust::tuple<Iterators...>> zip(Iterators... its)
{
return thrust::make_zip_iterator(thrust::make_tuple(its...));
}
template <typename ZipIterator>
class grid_op
{
public:
grid_op(ZipIterator zipIt, std::size_t dim1, std::size_t dim2) : zipIt(zipIt), dim1(dim1), dim2(dim2){}
__host__ __device__
float operator()(std::size_t index) const
{
const auto coords = unflatten_3d_index(index, dim1, dim2);
const auto values = zipIt[thrust::get<2>(coords)];
const float delta = 5;
const float _theta = (0.5f + thrust::get<0>(coords))*delta;
const float _phi = (0.5f + thrust::get<1>(coords))*delta / _theta;
const float _l = sin(_theta* M_PI / 180.0)*cos(_phi* M_PI / 180.0);
const float _m = sin(_theta* M_PI / 180.0)*sin(_phi* M_PI / 180.0);
const float _n = cos(_theta* M_PI / 180.0);
const float _k = -(_l*thrust::get<0>(values) + _m*thrust::get<1>(values) + _n*thrust::get<2>(values));
return (thrust::get<3>(values) - _k);
}
private:
__host__ __device__
thrust::tuple<std::size_t, std::size_t, std::size_t>
unflatten_3d_index(std::size_t index, std::size_t dim1, std::size_t dim2) const
{
// taken from https://stackoverflow.com/questions/29142417/4d-position-from-1d-index
std::size_t x = index % dim1;
std::size_t y = ( ( index - x ) / dim1 ) % dim2;
std::size_t z = ( ( index - y * dim1 - x ) / (dim1 * dim2) );
return thrust::make_tuple(x,y,z);
}
ZipIterator zipIt;
std::size_t dim1;
std::size_t dim2;
};
template <typename ZipIterator>
grid_op<ZipIterator> make_grid_op(ZipIterator zipIt, std::size_t dim1, std::size_t dim2)
{
return grid_op<ZipIterator>(zipIt, dim1, dim2);
}
int main()
{
const int nbint = 3;
const int nbinp = 4;
const int hits = 20;
const std::size_t N = nbint * nbinp * hits;
thrust::device_vector<float> d_x = random_vector(hits);
thrust::device_vector<float> d_y = random_vector(hits);
thrust::device_vector<float> d_z = random_vector(hits);
thrust::device_vector<float> d_a = random_vector(hits);
auto zipIt = zip(d_x.begin(), d_y.begin(), d_z.begin(), d_a.begin());
auto countingIt = thrust::counting_iterator<std::size_t>(0);
auto unary_op = make_grid_op(zipIt, nbint, nbinp);
auto binary_op = thrust::maximum<float>();
const float init = 0;
float max = thrust::transform_reduce(
countingIt, countingIt+N,
unary_op,
init,
binary_op
);
std::cout << "max = " << max << std::endl;
}

Multithreading computation of Mandelbrot set

I have created a program which creates a Mandelbrot set. Now I'm trying to make it multithreaded.
// mandelbrot.cpp
// compile with: g++ -std=c++11 mandelbrot.cpp -o mandelbrot
// view output with: eog mandelbrot.ppm
#include <fstream>
#include <complex> // if you make use of complex number facilities in C++
#include <iostream>
#include <cstdlib>
#include <thread>
#include <mutex>
#include <vector>
using namespace std;
template <class T> struct RGB { T r, g, b; };
template <class T>
class Matrix {
public:
Matrix(const size_t rows, const size_t cols) : _rows(rows), _cols(cols) {
_matrix = new T*[rows];
for (size_t i = 0; i < rows; ++i) {
_matrix[i] = new T[cols];
}
}
Matrix(const Matrix &m) : _rows(m._rows), _cols(m._cols) {
_matrix = new T*[m._rows];
for (size_t i = 0; i < m._rows; ++i) {
_matrix[i] = new T[m._cols];
for (size_t j = 0; j < m._cols; ++j) {
_matrix[i][j] = m._matrix[i][j];
}
}
}
~Matrix() {
for (size_t i = 0; i < _rows; ++i) {
delete [] _matrix[i];
}
delete [] _matrix;
}
T *operator[] (const size_t nIndex)
{
return _matrix[nIndex];
}
size_t width() const { return _cols; }
size_t height() const { return _rows; }
protected:
size_t _rows, _cols;
T **_matrix;
};
// Portable PixMap image
class PPMImage : public Matrix<RGB<unsigned char> >
{
public:
unsigned int size;
PPMImage(const size_t height, const size_t width) : Matrix(height, width) { }
void save(const std::string &filename)
{
std::ofstream out(filename, std::ios_base::binary);
out <<"P6" << std::endl << _cols << " " << _rows << std::endl << 255 << std::endl;
for (size_t y=0; y<_rows; y++)
for (size_t x=0; x<_cols; x++)
out << _matrix[y][x].r << _matrix[y][x].g << _matrix[y][x].b;
}
};
/*Draw mandelbrot according to the provided parameters*/
void draw_Mandelbrot(PPMImage & image, const unsigned width, const unsigned height, double cxmin, double cxmax, double cymin, double cymax,unsigned int max_iterations)
{
for (std::size_t ix = 0; ix < width; ++ix)
for (std::size_t iy = 0; iy < height; ++iy)
{
std::complex<double> c(cxmin + ix / (width - 1.0)*(cxmax - cxmin), cymin + iy / (height - 1.0)*(cymax - cymin));
std::complex<double> z = 0;
unsigned int iterations;
for (iterations = 0; iterations < max_iterations && std::abs(z) < 2.0; ++iterations)
z = z*z + c;
image[iy][ix].r = image[iy][ix].g = image[iy][ix].b = iterations;
}
}
int main()
{
const unsigned width = 1600;
const unsigned height = 1600;
PPMImage image(height, width);
int parts = 8;
std::vector<int>bnd (parts, image.size);
std::thread *tt = new std::thread[parts - 1];
time_t start, end;
time(&start);
//Lauch parts-1 threads
for (int i = 0; i < parts - 1; ++i) {
tt[i] = std::thread(draw_Mandelbrot,ref(image), width, height, -2.0, 0.5, -1.0, 1.0, 10);
}
//Use the main thread to do part of the work !!!
for (int i = parts - 1; i < parts; ++i) {
draw_Mandelbrot(ref(image), width, height, -2.0, 0.5, -1.0, 1.0, 10);
}
//Join parts-1 threads
for (int i = 0; i < parts - 1; ++i)
tt[i].join();
time(&end);
std::cout << difftime(end, start) << " seconds" << std::endl;
image.save("mandelbrot.ppm");
delete[] tt;
return 0;
}
Now every thread draws the complete fractal (look in main()). How can I let the threads draw different parts of the fractal?
You're making this (quite a lot) harder than it needs to be. This is the sort of task to which OpenMP is almost perfectly suited. For this task it gives almost perfect scaling with a bare minimum of effort.
I modified your draw_mandelbrot by inserting a pragma before the outer for loop:
#pragma omp parallel for
for (int ix = 0; ix < width; ++ix)
for (int iy = 0; iy < height; ++iy)
Then I simplified your main down to:
int main() {
const unsigned width = 1600;
const unsigned height = 1600;
PPMImage image(height, width);
clock_t start = clock();
draw_Mandelbrot(image, width, height, -2.0, 0.5, -1.0, 1.0, 10);
clock_t stop = clock();
std::cout << (double(stop - start) / CLOCKS_PER_SEC) << " seconds\n";
image.save("mandelbrot.ppm");
return 0;
}
On my (fairly slow) machine, your original code ran in 4.73 seconds. My modified code ran in 1.38 seconds. That's an improvement of 3.4x out of code that's nearly indistinguishable from a trivial single-threaded version.
Just for what it's worth, I did a bit more rewriting to get this:
// mandelbrot.cpp
// compile with: g++ -std=c++11 mandelbrot.cpp -o mandelbrot
// view output with: eog mandelbrot.ppm
#include <fstream>
#include <complex> // if you make use of complex number facilities in C++
#include <iostream>
#include <cstdlib>
#include <thread>
#include <mutex>
#include <vector>
using namespace std;
template <class T> struct RGB { T r, g, b; };
template <class T>
struct Matrix
{
std::vector<T> data;
size_t rows;
size_t cols;
class proxy {
Matrix &m;
size_t index_1;
public:
proxy(Matrix &m, size_t index_1) : m(m), index_1(index_1) { }
T &operator[](size_t index) { return m.data[index * m.rows + index_1]; }
};
class const_proxy {
Matrix const &m;
size_t index_1;
public:
const_proxy(Matrix const &m, size_t index_1) : m(m), index_1(index_1) { }
T const &operator[](size_t index) const { return m.data[index * m.rows + index_1]; }
};
public:
Matrix(size_t rows, size_t cols) : data(rows * cols), rows(rows), cols(cols) { }
proxy operator[](size_t index) { return proxy(*this, index); }
const_proxy operator[](size_t index) const { return const_proxy(*this, index); }
};
template <class T>
std::ostream &operator<<(std::ostream &out, Matrix<T> const &m) {
out << "P6" << std::endl << m.cols << " " << m.rows << std::endl << 255 << std::endl;
for (size_t y = 0; y < m.rows; y++)
for (size_t x = 0; x < m.cols; x++) {
T pixel = m[y][x];
out << pixel.r << pixel.g << pixel.b;
}
return out;
}
/*Draw Mandelbrot according to the provided parameters*/
template <class T>
void draw_Mandelbrot(T & image, const unsigned width, const unsigned height, double cxmin, double cxmax, double cymin, double cymax, unsigned int max_iterations) {
#pragma omp parallel for
for (int ix = 0; ix < width; ++ix)
for (int iy = 0; iy < height; ++iy)
{
std::complex<double> c(cxmin + ix / (width - 1.0)*(cxmax - cxmin), cymin + iy / (height - 1.0)*(cymax - cymin));
std::complex<double> z = 0;
unsigned int iterations;
for (iterations = 0; iterations < max_iterations && std::abs(z) < 2.0; ++iterations)
z = z*z + c;
image[iy][ix].r = image[iy][ix].g = image[iy][ix].b = iterations;
}
}
int main() {
const unsigned width = 1600;
const unsigned height = 1600;
Matrix<RGB<unsigned char>> image(height, width);
clock_t start = clock();
draw_Mandelbrot(image, width, height, -2.0, 0.5, -1.0, 1.0, 255);
clock_t stop = clock();
std::cout << (double(stop - start) / CLOCKS_PER_SEC) << " seconds\n";
std::ofstream out("mandelbrot.ppm", std::ios::binary);
out << image;
return 0;
}
On my machine, this code runs in about 0.5 to 0.6 seconds.
As to why I made these changes: mostly to make it faster, cleaner, and simpler. Your Matrix class allocated a separate block of memory for each row (or perhaps column--didn't pay very close of attention). This allocates one contiguous block of the entire matrix instead. This eliminates a level of indirection to get to the data, and increases locality of reference, thus improving cache usage. It also reduces the total amount of data used.
Changing from using time to using clock to do the timing was to measure CPU time instead of wall time (and typically improve precision substantially as well).
Getting rid of the PPMImage class was done simply because (IMO) having a PPImage class that derives from a Matrix class just doesn't make much (if any) sense. I suppose it works (for a sufficiently loose definition of "work") but it doesn't strike me as good design. If you insist on doing it at all, it should at least be private derivation, because you're just using the Matrix as a way of implementing your PPMImage class, not (at least I certainly hope not) trying to make assertions about properties of PPM images.
If, for whatever, reason, you decide to handle the threading manually, the obvious way of dividing the work up between threads would still be by looking at the loops inside of draw_mandelbrot. The obvious one would be to leave your outer loop alone, but send the computation for each iteration off to a thread pool:
for (int ix = 0; ix < width; ++ix)
compute_thread(ix);
where the body of compute_thread is basically this chunk of code:
for (int iy = 0; iy < height; ++iy)
{
std::complex<double> c(cxmin + ix / (width - 1.0)*(cxmax - cxmin), cymin + iy / (height - 1.0)*(cymax - cymin));
std::complex<double> z = 0;
unsigned int iterations;
for (iterations = 0; iterations < max_iterations && std::abs(z) < 2.0; ++iterations)
z = z*z + c;
image[iy][ix].r = image[iy][ix].g = image[iy][ix].b = iterations;
}
There would obviously be a little work involved in passing the correct data to the compute thread (each thread should be pass a reference to a slice of the resulting picture), but that would be an obvious and fairly clean place to divide things up. In particular it divides the job up into enough tasks that you semi-automatically get pretty good load balancing (i.e., you can keep all the cores busy) but large enough that you don't waste massive amounts of time on communication and synchronization between the threads.
As to the result, with the number of iterations set to 255, I get the following (scaled to 25%):
...which is pretty much as I'd expect.
One of the big issues with this approach is that different regions take different amounts of time to calculate.
A more general approach is.
Start 1 source thread.
Start N worker threads.
Start 1 sink thread.
Create 2 thread safe queues (call them the source queue and the sink queue).
Divide the image into M (many more than N) pieces.
The source thread pushes pieces into the source queue
The workers pull piecse from the source queue, convert the pieces into result fragments, and pushes those fragments into the sink queue.
The sink thread takes fragments from the sink queue and combines them into the final image.
By dividing up the work this way, all the worker threads will be busy all the time.
You can divide the fractal into pieces by divide the start and end of the fractal with the screen dimension:
$this->stepsRe = (double)((($this->startRe * -1) + ($this->endeRe)) / ($this->size_x-1));
$this->stepsIm = (double)((($this->startIm * -1) + ($this->endeIm)) / ($this->size_y-1));