I currently have a project going on, in which a large dataset has to be created using HDF5. Now, the naive implementation is all nice and dandy, but very slow. The slow part is in the calculation (10x slower than write) which I cannot speed up anymore, but maybe parallelization is possible.
I guess I could use a simple #pragma omp parallel for but the dataspace.write(..) method should be squential for speed reasons (maybe it doesnt matter). See this picture for example.
It should be noted that, because of the dimensionality, the write function uses a chunked layout of the same size as the buffer (in reality around 1Mb)
/*
------------NAIVE IMPLEMENTATION-----------------
|T:<calc0><W0><calc1><W1><calc2><W2>............|
|-----------------------------------------------|
|----------PARALLEL IMPLEMENTATION--------------|
|-----------------------------------------------|
|T0:<calc0----><W0><calc4>.....<W4>.............|
|T1:<calc1---->....<W1><calc5->....<W5>.........|
|T2:<calc2--->.........<W2>calc6-->....<W6>.....|
|T3:<calc3----->...........<W3><calc7-->...<W7>.|
------------DIFFERENT IMPLEMENTATION-------------
i.e.: Queuesize=4
T0:.......<W0><W1><W2><W3><W4><W5><W6>..........|
T1:<calc0><calc3>.....<calc6>...................|
T2:<calc1>....<calc4>.....<calc7>...............|
T3:<calc2>........<calc5>.....<calc8>...........|
T Thread
<calcn---> Calculation time
<Wn> Write data n. Order *important*
. Waiting
*/
Codeexample:
#include <chrono>
#include <cmath>
#include <iostream>
#include <memory>
double calculate(float *buf, const struct options *opts) {
// dummy function just to get a time reference
double res = 0;
for (size_t i = 0; i < 10000; i++)
res += std::sin(i);
return 1 / (1 + res);
}
struct options {
size_t idx[6];
};
class Dataspace {
public:
void selectHyperslab(){}; // selects region in disk space
void write(float *buf){}; // write buf to selected disk space
};
int main() {
size_t N = 6;
size_t dims[6] = {4 * N, 4 * N, 4 * N, 4 * N, 4 * N, 4 * N},
buf_offs[6] = {4, 4, 4, 4, 4, 4};
// dims: size of each dimension, multiple of 4
// buf_offs: size of buffer in each dimension
// Calcuate buffer size and allocate
// the size of the buffer is usually around 1Mb
// and not a float but a compund datatype
size_t buf_size = buf_offs[0];
for (auto off : buf_offs)
buf_size *= off;
std::unique_ptr<float[]> buf{new float[buf_size]};
struct options opts; // options parameters, passed to calculation fun
struct Dataspace dataspace; // dummy Dataspace. Supplied by HDF5
size_t i = 0;
size_t idx0, idx1, idx2, idx3, idx4, idx5;
auto t_start = std::chrono::high_resolution_clock::now();
std::cout << "[START]" << std::endl;
for (idx0 = 0; idx0 < dims[0]; idx0 += buf_offs[0])
for (idx1 = 0; idx1 < dims[1]; idx1 += buf_offs[1])
for (idx2 = 0; idx2 < dims[2]; idx2 += buf_offs[2])
for (idx3 = 0; idx3 < dims[3]; idx3 += buf_offs[3])
for (idx4 = 0; idx4 < dims[4]; idx4 += buf_offs[4])
for (idx5 = 0; idx5 < dims[5]; idx5 += buf_offs[5]) {
i++;
opts.idx[0] = idx0;
opts.idx[1] = idx1;
opts.idx[2] = idx2;
opts.idx[3] = idx3;
opts.idx[4] = idx4;
opts.idx[5] = idx5;
dataspace.selectHyperslab(/**/); // function from HDF5
calculate(buf.get(), &opts); // populate buf with data
dataspace.write(buf.get()); // has to be sequential
}
std::cout << "[DONE] " << i << " calls" << std::endl;
std::chrono::duration<double> diff =
std::chrono::high_resolution_clock::now() - t_start;
std::cout << "Time: " << diff.count() << std::endl;
return 0;
}
Code should work right out of the box.
I already took a quick look into OpenMP, but I can't wrap my head around yet. Can anyone give me a hint/working example? I am not good with parallelization, but wouldn't a writer-thread with a bufferqueue work? Or is using OpenMP overkill anyways and pthreads suffice?
Any help is kindly appreciated,
cheers
Your first parallel implementation idea is by far the simplest to implement. Making a queue and a dedicated I/O thread might perform better, but is significantly more difficult to implement using OpenMP.
Below is a simple example of how a parallel version could look like. The most important aspects are:
Shared data: Make sure that there is no race condition on any data that is shared among threads. For example each thread must have it's own buf and opts as they are clearly modified in parallel with no restriction. The simplest way is to define the variables locally within a parallel region. Also loop idxn, at least for the inner loops, and i must be defined locally. You cannot compute i like you did - this would create a dependency between each loop iteration and prevent parallelization.
Apply pragma omp for worksharing to the loop. Due to the small amount of iterations in each dimension, it is advisable to apply collapse. This will distribute the work of multiple nested loops. The optimal value for collapse will expose enough parallel work for your the number of threads available for your program, but not create too much overhead or hinder single-thread optimization of inner loops. You might want to try different values.
Protect writing the data with a critical section. Only one thread at a time will enter the section. This is most likely necessary for correctness (depending on how it is implemented in hdf5). Apparently selectHyperslab will control how write will operate, so it must be in the same critical section.
Put together, it could look like this:
#pragma omp parallel
{
// define EVERYTHING that is modified locally to each thread!
std::unique_ptr<float[]> buf{new float[buf_size]};
struct options opts;
// Try different values for collapse if performance is not satisfactory
#pragma omp for collapse(3)
for (size_t idx0 = 0; idx0 < dims[0]; idx0 += buf_offs[0])
for (size_t idx1 = 0; idx1 < dims[1]; idx1 += buf_offs[1])
for (size_t idx2 = 0; idx2 < dims[2]; idx2 += buf_offs[2])
for (size_t idx3 = 0; idx3 < dims[3]; idx3 += buf_offs[3])
for (size_t idx4 = 0; idx4 < dims[4]; idx4 += buf_offs[4])
for (size_t idx5 = 0; idx5 < dims[5]; idx5 += buf_offs[5]) {
size_t i = idx5 + idx4 * dims[5] + ...;
opts.idx[0] = idx0;
opts.idx[1] = idx1;
opts.idx[2] = idx2;
opts.idx[3] = idx3;
opts.idx[4] = idx4;
opts.idx[5] = idx5;
calculate(buf.get(), &opts); // populate buf with data
#pragma omp critical
{
// I do assume that this function selects where/how data
// will be written so you *must* protected it
// Only one thread can do this at a time.
dataspace.selectHyperslab(/**/); // function from HDF5
dataspace.write(buf.get()); // has to be sequential
}
}
}
Related
I have this self-contained example of a TBB application that I run on a 2-NUMA-node CPU that performs a simple vector addition repeatedly on dynamic arrays. It recreates an issue that I am having with a bit more complicated example. I am trying to divide the computations cleanly between the available NUMA nodes by initializing the data in parallel with 2 task_arenas that are linked to separate NUMA nodes through TBB's NUMA API. The subsequent parallel execution should then be conducted so that that memory accesses are performed on data that is local to the cpu that computes its task. A control example uses a simple parallel_for with a static_partitioner to perform the computation while my intended example invokes per task_arena a task which invokes a parallel_for to compute the vector addition of the designated region, i.e. the half of the dynamic arena that was initialized before in the corresponding NUMA node. This example always takes twice as much time to perform the vector addition compared to the control example. It cannot be the overhead of creating the tasks for the task_arenas that will invoke the parallel_for algorithms, because the performance degradation only occurs when the tbb::task_arena::constraints are applied. Could anyone explain to me what happens and why this performance penalty is so harsh. A direction to resources would also be helpful as I am doing this for a university project.
#include <iostream>
#include <iomanip>
#include <tbb/tbb.h>
#include <vector>
int main(){
std::vector<int> numa_indexes = tbb::info::numa_nodes();
std::vector<tbb::task_arena> arenas(numa_indexes.size());
std::size_t numa_nodes = numa_indexes.size();
for(unsigned j = 0; j < numa_indexes.size(); j++){
arenas[j].initialize( tbb::task_arena::constraints(numa_indexes[j]));
}
std::size_t size = 10000000;
std::size_t part_size = std::ceil((float)size/numa_nodes);
double * A = (double *) malloc(sizeof(double)*size);
double * B = (double *) malloc(sizeof(double)*size);
double * C = (double *) malloc(sizeof(double)*size);
double * D = (double *) malloc(sizeof(double)*size);
//DATA INITIALIZATION
for(unsigned k = 0; k < numa_indexes.size(); k++)
arenas[k].execute(
[&](){
std::size_t local_start = k*part_size;
std::size_t local_end = std::min(local_start + part_size, size);
tbb::parallel_for(static_cast<std::size_t>(local_start), local_end,
[&](std::size_t i)
{
C[i] = D[i] = 0;
A[i] = B[i] = 1;
}, tbb::static_partitioner());
});
//PARALLEL ALGORITHM
tbb::tick_count t0 = tbb::tick_count::now();
for(int i = 0; i<100; i++)
tbb::parallel_for(static_cast<std::size_t>(0), size,
[&](std::size_t i)
{
C[i] += A[i] + B[i];
}, tbb::static_partitioner());
tbb::tick_count t1 = tbb::tick_count::now();
std::cout << "Time 1: " << (t1-t0).seconds() << std::endl;
//TASK ARENA & PARALLEL ALGORITHM
t0 = tbb::tick_count::now();
for(int i = 0; i<100; i++){
for(unsigned k = 0; k < numa_indexes.size(); k++){
arenas[k].execute(
[&](){
for(unsigned i=0; i<numa_indexes.size(); i++)
task_groups[i].wait();
task_groups[k].run([&](){
std::size_t local_start = k*part_size;
std::size_t local_end = std::min(local_start + part_size, size);
tbb::parallel_for(static_cast<std::size_t>(local_start), local_end,
[&](std::size_t i)
{
D[i] += A[i] + B[i];
});
});
});
}
t1 = tbb::tick_count::now();
std::cout << "Time 2: " << (t1-t0).seconds() << std::endl;
double sum1 = 0;
double sum2 = 0;
for(int i = 0; i<size; i++){
sum1 += C[i];
sum2 += D[i];
}
std::cout << sum1 << std::endl;
std::cout << sum2 << std::endl;
return 0;
}
Performance with:
for(unsigned j = 0; j < numa_indexes.size(); j++){
arenas[j].initialize( tbb::task_arena::constraints(numa_indexes[j]));
}
$ taskset -c 0,1,8,9 ./RUNME
Time 1: 0.896496
Time 2: 1.60392
2e+07
2e+07
Performance without constraints:
$ taskset -c 0,1,8,9 ./RUNME
Time 1: 0.652501
Time 2: 0.638362
2e+07
2e+07
EDIT: I implemented the use of task_group as found in #AlekseiFedotov's suggested resources, but the issue still remains.
Part of the provided example where the work with arenas happens is not one-to-one match to the example from the docs, "Setting the preferred NUMA node" section.
Looking further into the specification of the task_arena::execute() method, we can find out that the task_arena::execute() is a blocking API, i.e. it does not return until the passed lambda completes.
On the other hand, the specification of the task_group::run() method reveals that its method is asynchronous, i.e. returns immediately, not waiting for the passed functor to complete.
That is where the problem lies, I guess. The code executes two parallel loops within arenas one by one, in a serial manner so to say. Consider following the example from the docs carefully.
BTW, the oneTBB project, which is the revamped version of the TBB, can be found here.
EDIT answer for the EDITED question:
See the comment to the question.
The waiting should happen after work is submitted, not before it. Also, no need to go to another arena's task group to do the wait within the loop, just submit the work in the NUMA loop via arena[i].execute( [i, &] { task_group[i].run( [i, &] { /*...*/ } ); } ), then, in another loop, wait for each task_group within corresponding task_arena.
Please note how I capture the NUMA loop iteration by copy. Otherwise, the code might be referring the wrong data inside the lambda body.
What is the fastest way access random (non-sequential) elements in an array if the access pattern is known beforehand? The access is random for different needs at every step so rearranging the elements is expensive option. The code below is represents important sample of the whole application.
#include <iostream>
#include "chrono"
#include <cstdlib>
#define NN 1000000
struct Astr{
double x[3], v[3];
int i, j, k;
long rank, p, q, r;
};
int main ()
{
struct Astr *key;
key = new Astr[NN];
int ii, *sequence;
sequence = new int[NN]; // access pattern is stored here
float frac ;
// create array of structs
// create array for random numbers between 0 to NN to access 'key'
for(int i=0; i < NN; i++){
key[i].x[1] = static_cast<double>(i);
key[i].p = static_cast<long>(i);
frac = static_cast<float>(rand()) / static_cast<float>(RAND_MAX);
sequence[i] = static_cast<int>(frac * static_cast<float>(NN));
}
// part to check and improve
// =========================================Random=======================================================
std::chrono::high_resolution_clock::time_point TstartMain = std::chrono::high_resolution_clock::now();
double tmp;
long rnk;
for(int j=0; j < 1000; j++)
for(int i=0; i < NN; i++){
ii = sequence[i];
tmp = key[ii].x[1];
rnk = key[ii].p;
key[ii].x[1] = tmp * 1.01;
key[ii].p = rnk * 1.01;
}
std::chrono::high_resolution_clock::time_point TendMain = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>( TendMain - TstartMain );
double time_uni = static_cast<double>(duration.count()) / 1000000;
std::cout << "\n Random array access " << time_uni << "s \n" ;
// ==========================================Sequential======================================================
TstartMain = std::chrono::high_resolution_clock::now();
for(int j=0; j < 1000; j++)
for(int i=0; i < NN; i++){
tmp = key[i].x[1];
rnk = key[i].p;
key[i].x[1] = tmp * 1.01;
key[i].p = rnk * 1.01;
}
TendMain = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::microseconds>( TendMain - TstartMain );
time_uni = static_cast<double>(duration.count()) / 1000000;
std::cout << " Sequential array access " << time_uni << "s \n" ;
// ================================================================================================
delete [] key;
delete [] sequence;
}
As expected, sequential access is faster; the answer is following on my machine-
Random array access 21.3763s
Sequential array access 8.7755s
The main question is whether random access could be made any faster.
The code improvement could be in terms of the container itself ( e.g. list/vector rather than array). Could software prefetching be implemented?
In theory it is possible to help guide the pre-fetcher to speed up random access (well, on those CPU's that support it - e.g. _mm_prefetch for Intel/AMD). In practice however this is often a complete waste of time, and will more often than not, slow down your code.
The general theory is that you pass a pointer to the _mm_prefetch intrinsic a loop iteration or two prior to using the value. There are however problems with this:
It is likely that you'll end up tuning the code for your CPU. When running that same code on other platforms, you'll probably find that different CPU cache layouts/sizes mean that your prefetch optimisations are now actually slowing the performance down.
The additional prefetch instructions will end up using up more of your instruction cache, and most likely your uop cache as well. You may find this alone slows the code down.
This assumes the CPU actually pays attention to the _mm_prefetch instruction. It is only a hint, so there are no guarentees it will be respected by the CPU.
If you want to speed up random memory access, there are better methods than prefetching imho.
Reduce the size of the data (i.e. use shorts/float16s inplace of int/float, eradicate any erronious padding in your structs, etc). By reducing the size of the structs, you have less memory to read, so it will go quicker! (Simple compression schemes aren't a bad idea either!)
Sort your data so that instead of doing random access, you are processing the data sequentially.
Other than those two options, the best bet is to leave prefetching well alone, and the compiler do it's thing with your random access (The only exception: you are optimising code for a ~2001 Pentium 4, where prefetching was basically required).
To give an example of what #robthebloke says, the following code makes ~15% improvment on my machine:
#include <immintrin.h>
void do_it(struct Astr *key, const int *sequence) {
for(int i = 0; i < NN-8; ++i) {
_mm_prefetch(key + sequence[i+8], _MM_HINT_NTA);
struct Astr *ki = key+sequence[i];
ki->x[1] *= 1.01;
ki->p *= 1.01;
}
for(int i = NN-8; i < NN; ++i) {
struct Astr *ki = key+sequence[i];
ki->x[1] *= 1.01;
ki->p *= 1.01;
}
}
So I want to optimize the sum of a really big array and in order to do that I have wrote a multi-threaded code. The problem is that with this code I'm getting better timing results using only one thread instead of 2 or 3 or 4 threads...
Can someone explain me why this happens?
(Also I've only started coding in C++ this semester, until then I only knew C, so I'm sorry for possible dumb mistakes)
This is the thread code
*localSum = 0.0;
for (size_t i = 0; i < stop; i++)
*localSum += v[i];
Main process code
int numThreads = atoi(argv[1]);
int N = 100000000;
// create the input vector v and put some values in v
vector<double> v(N);
for (int i = 0; i < N; i++)
v[i] = i;
// this vector will contain the partial sum for each thread
vector<double> localSum(numThreads, 0);
// create threads. Each thread will compute part of the sum and store
// its result in localSum[threadID] (threadID = 0, 1, ... numThread-1)
startChrono();
vector<thread> myThreads(numThreads);
for (int i = 0; i < numThreads; i++){
int start = i * v.size() / numThreads;
myThreads[i] = thread(threadsum, i, numThreads, &v[start], &localSum[i],v.size()/numThreads);
}
for_each(myThreads.begin(), myThreads.end(), mem_fn(&thread::join));
// calculate global sum
double globalSum = 0.0;
for (int i = 0; i < numThreads; i++)
globalSum += localSum[i];
cout.precision(12);
cout << "Sum = " << globalSum << endl;
cout << "Runtime: " << stopChrono() << endl;
exit(EXIT_SUCCESS);
}
There are a few things:
1- The array just isn't big enough. Vectorized streaming add will be really hard to beat. You need a more complex function than add to really see results. Or a very large array.
2- Related, the overhead of all the thread creation and joining is going to swamp any performance gains from the threading. Adding is really fast, and you can easily saturate the CPU's functional units. for the thread to help it can't even be a hyperthread on the same core, it would need to be on a different core entirely (as the hyperthreads would both compete for the floating point units).
To test this, you can try to create all the treads before you start the timer and stop them all after you stop the timer (have them set a done flag instead of waiting on the join).
3- All your localsum variables are sharing the same cache line. Better would be to make the localsum variable on the stack and put the result into the array instead of adding directly into the array: https://mechanical-sympathy.blogspot.com/2011/07/false-sharing.html
If for some reason, you need to keep the sum observable to others in that array, pad the localsum vector entries like this so they don't share the same cache line:
struct localsumentry {
double sum;
char pad[56];
};
I am carrying out a 3D matrix by 1D vector multiplication within a class in C++. All variables are contained within the class. When I create one instance of the class on a single thread and carry out the multiplication 100 times, the multiplication operation takes ~0.8ms each time.
When I create 4 instances of the class, each on a separate thread, and run the multiplication operation 25 times on each, the operation takes ~1.7ms each time. The operations on each thread are being carried out on separate data, and are running on separate cores.
As expected, however, the overall time to complete the 100 matrix multiplications is reduced with 4 threads over a single thread.
My questions are:
1) What is the cause of the slowdown in the multiplication operation when multiple threads are used?
2) Is there any way in which the operation can be sped up?
EDIT:
To clarify the problem:
The overall time to carry out 100 matrix products does decrease when I split them over 4 threads - threading does make the overall program faster.
The timing in question is the actual matrix multiplication within the already created threads (see code). This time excludes thread creation, and memory allocation & deletion. This is the time that doubles when I use 4 threads rather than 1. The overall time to carry out all multiplications halves when I use 4 threads. My question is why are the individual matrix products slower when running on 4 threads rather than 1.
Below is a code sample. It is not my actual code, but a simplified example I have written to demonstrate the problem.
Multiply.h
class Multiply
{
public:
Multiply ();
~Multiply ();
void
DoProduct ();
private:
double *a;
};
Multiply.cpp
Multiply::Multiply ()
{
a = new double[100 * 100 * 100];
std::memset(a,1,100*100*100*sizeof(double));
}
void
Multiply::DoProduct ()
{
double *result = new double[100 * 100];
double *b = new double[100];
std::memset(result,0,100*100*sizeof(double));
std::memset(b,1,100*sizeof(double));
//Timer starts here, i.e. excluding memory allocation and thread creation and the rest
auto start_time = std::chrono::high_resolution_clock::now ();
//matrix product
for (int i = 0; i < 100; ++i)
for (int j = 0; j < 100; ++j)
{
double t = 0;
for (int k = 0; k < 100; ++k)
t = t + a[k + j * 100 + i * 100 * 100] * b[k];
result[j + 100 * i] = result[j + 100 * i] + t;
}
//Timer stops here, i.e. before memory deletion
int time = std::chrono::duration_cast < std::chrono::microseconds > (std::chrono::high_resolution_clock::now () - start_time).count ();
std::cout << "Time: " << time << std::endl;
delete []result;
delete []b;
}
Multiply::~Multiply ()
{
delete[] a;
}
Main.cpp
void
threadWork (int iters)
{
Multiply *m = new Multiply ();
for (int i = 0; i < iters; i++)
{
m->DoProduct ();
}
}
void
main ()
{
int numProducts = 100;
int numThreads = 1; //4;
std::thread t[numThreads];
auto start_time = std::chrono::high_resolution_clock::now ();
for (int i = 0; i < numThreads; i++)
t[i] = std::thread (threadWork, numProducts / numThreads);
for (int i = 0; i < n; i++)
t[i].join ();
int time = std::chrono::duration_cast < std::chrono::microseconds > (std::chrono::high_resolution_clock::now () - start_time).count ();
std::cout << "Time total: " << time << std::endl;
}
Async and thread calls are quite time expensive compare to ordinary function calls. So pre-launch threads and create a thread pool. You push your functions as tasks and request the thread pool to tether these tasks from the prority-queue.
The tasks could be set with priorities to execute in proper order to avoid use and hence delays arising due to use of mutexes and locks
You are launching too many threads , keep it below the maximum allowed by your system to avoid bottlenecks.
I am trying to measure the speedup in parallel section using one or four threads. As my parallel section is relatively simple, I expect a near-to-fourfold speedup. ( This is following my question:
openMp: severe perfomance loss when calling shared references of dynamic arrays )
As my parallel sections runs twice as fast on four cores compared to only one, I believe I have still not found the reason for the performance loss.
I want to parallelise my function iter as well as possible. The function is using entries of dynamic arrays and private quantities to change the entries of other dynamic arrays. Because every iteration step only uses the array entries of the respective loop step, I don't have different threads accessing the same array entry. Furthermore, I put some thought on false sharing, due to accessing entries in the same cache line. My guess is, that this is a minor effect, as my double-arrays are 5*10^5 long and by choosing a reasonable chunk size for the schedule(dynamic,chunk) command, I don't expect the very few entires in a given cache line to be accessed at the same time by different threads. In my simulation, I have about 80 of such arrays, so that allocating them on the stack is not comfortable and making private copies for every thread is out of question too.
Does anybody have an idea, how to improve this? I want to fully understand why this is so slow, before starting with compiler optimisations.
What also surprised me was: calling iter(parallel), with parallel = false, is slower than calling it with parallel = true and omp_set_num_threads(1).
main.cpp:
int main(){
mathClass m;
m.fillArrays();
double timeCount = 0.0;
for(int j = 0; j<1000; j++){
timeCount += m.iter(true);
}
printf("meam time difference = %fms\n",timeCount);
return 0;
}
mathClass.h:
class mathClass{
private:
double* A;
double* B;
double* C;
int length;
public:
double* D;
mathClass();
double iter(bool parallel);
void fillArrays();
};
mathClass.cpp:
mathClass::mathClass(){
length = 5000000;
A = new double[length];
B = new double[length];
C = new double[length];
D = new double[length];
}
void mathClass::fillArrays(){
int temp;
for ( int i=0; i<length; i++){
temp = rand() % 100;
A[i] = double(temp);
temp = rand() % 100;
B[i] = double(temp);
temp = rand() % 100;
C[i] = double(temp);
}
}
double mathClass::iter(bool parallel){
double startTime;
double endTime;
omp_set_num_threads(4);
startTime = omp_get_wtime();
#pragma omp parallel if(parallel)
{
int alpha; // private in all threads
#pragma omp for schedule(static)
for (int i=0; i<length; i++){
alpha = 15*A[i];
D[i] = C[i]*alpha + B[i]*alpha*alpha;
}
}
endTime = omp_get_wtime();
return endTime - startTime;
}