For the following code, here is a bit of context.
Mat img0; // 1280x960 grayscale
--
timer.start();
for (int i = 0; i < img0.rows; i++)
{
vector<double> v;
uchar* p = img0.ptr<uchar>(i);
for (int j = 0; j < img0.cols; ++j)
{
v.push_back(p[j]);
}
}
cout << "Single thread " << timer.end() << endl;
and
timer.start();
concurrency::parallel_for(0, img0.rows, [&img0](int i) {
vector<double> v;
uchar* p = img0.ptr<uchar>(i);
for (int j = 0; j < img0.cols; ++j)
{
v.push_back(p[j]);
}
});
cout << "Multi thread " << timer.end() << endl;
The result:
Single thread 0.0458856
Multi thread 0.0329856
The speedup is hardly noticeable.
My processor is Intel i5 3.10 GHz
RAM 8 GB DDR3
EDIT
I tried also a slightly different approach.
vector<Mat> imgs = split(img0, 2,1); // `split` is my custom function that, in this case, splits `img0` into two images, its left and right half
--
timer.start();
concurrency::parallel_for(0, (int)imgs.size(), [imgs](int i) {
Mat img = imgs[i];
vector<double> v;
for (int row = 0; row < img.rows; row++)
{
uchar* p = img.ptr<uchar>(row);
for (int col = 0; col < img.cols; ++col)
{
v.push_back(p[col]);
}
}
});
cout << " Multi thread Sectored " << timer.end() << endl;
And I get much better result:
Multi thread Sectored 0.0232881
So, it looks like I was creating 960 threads or something when I ran
parallel_for(0, img0.rows, ...
And that didn't work well.
(I must add that Kenney's comment is correct. Do not put too much relevance to the specific numbers I stated here. When measuring small intervals such as these, there are high variations. But in general, what I wrote in the edit, about splitting the image in half, improved performance in comparison to old approach.)
I think your problem is that you are limited by memory bandwidth. Your second snippet is basically reading from the whole of the image, and that has got to come out of main memory into cache. (Or out of L2 cache into L1 cache).
You need to arrange your code so that all four cores are working on the same bit of memory at once (I presume you are not actually trying to optimize this code - it is just a simple example).
Edit: Insert crucial "not" in last parenthetical remark.
Related
Here is my Matrix Multiplication C++ OpenMP code that I have written. I am trying to use OpenMP to optimize the program. The sequential code speed was 7 seconds but when I added openMP statements but it only got faster by 3 seconds. I thought it was going to get much faster and don't understand if I'm doing it right.
The OpenMP statements are in the fill_random function and in the matrix multiplication triple for loop section in main.
I would appreciate any help or advice you can give to understand this!
#include <iostream>
#include <cassert>
#include <omp.h>
#include <chrono>
using namespace std::chrono;
double** fill_random(int rows, int cols )
{
double** mat = new double* [rows]; //Allocate rows.
#pragma omp parallell collapse(2)
for (int i = 0; i < rows; ++i)
{
mat[i] = new double[cols]; // added
for( int j = 0; j < cols; ++j)
{
mat[i][j] = rand() % 10;
}
}
return mat;
}
double** create_matrix(int rows, int cols)
{
double** mat = new double* [rows]; //Allocate rows.
for (int i = 0; i < rows; ++i)
{
mat[i] = new double[cols](); //Allocate each row and zero initialize..
}
return mat;
}
void destroy_matrix(double** &mat, int rows)
{
if (mat)
{
for (int i = 0; i < rows; ++i)
{
delete[] mat[i]; //delete each row..
}
delete[] mat; //delete the rows..
mat = nullptr;
}
}
int main()
{
int rowsA = 1000; // number of rows
int colsA= 1000; // number of columns
double** matA = fill_random(rowsA, colsA);
int rowsB = 1000; // number of rows
int colsB = 1000; // number of columns
double** matB = fill_random(rowsB, colsB);
//Checking matrix multiplication qualification
assert(colsA == rowsB);
double** matC = create_matrix(rowsA, colsB);
//measure the multiply only
const auto start = high_resolution_clock::now();
//Multiplication
#pragma omp parallel for
for(int i = 0; i < rowsA; ++i)
{
for(int j = 0; j < colsB; ++j)
{
for(int k = 0; k < colsA; ++k) //ColsA..
{
matC[i][j] += matA[i][k] * matB[k][j];
}
}
}
const auto stop = high_resolution_clock::now();
const auto duration = duration_cast<seconds>(stop - start);
std::cout << "Time taken by function: " << duration.count() << " seconds" << std::endl;
//Clean up..
destroy_matrix(matA, rowsA);
destroy_matrix(matB, rowsB);
destroy_matrix(matC, rowsA);
return 0;
}
Your problem is rather small.
The collapse in the matrix creation does nothing because the loops are not perfectly nested. On the other hand, in the multiplication routine you should add a collapse(2) directive.
Creating a matrix with an array of pointers means that the expression matB[k][j] dances all over memory. Allocate your matrices as a single array and then use i*N+j as an indexing expression. (Of course I would put that in a macro or so.)
Matrix size of 1000x1000 with double(64 bit) element type requires 8MB data. When you multiply two matrices, you read 16MB data. When you write to a third matrix, you also access 24MB data total.
If L3 cache is smaller than 24MB then RAM is bottleneck. Maybe single thread did not fully use its bandwidth but when OpenMP is used, RAM bandwidth is fully used. In your case it had only 50% headroom for bandwidth.
Naive version is not using cache well. You need to swap order of two loops to gain more caching:
loop
loop k
loop
C[..] += B[..] * A[..]
although incrementing C does not re-use a register in this optimized version, it re-uses cache that is more important in this case. If you do it, it should get ~100-200 milliseconds computation time even in single-thread.
Also if you need performance, don't do this:
//Allocate each row and zero initialize..
allocate whole matrix at once so that your matrix is not scattered in memory.
To add more threads efficiently, you can do sub-matrix multiplications to compute full matrix multiplication. Scan-line multiplication is not good for load-balancing between threads. When sub-matrices are multiplied, they give better load distribution due to caching and higher floating-point operations per element fetched from memory.
Edit:
Swapping order of loops also makes compiler able to vectorize the innermost loop because one of the input matrices becomes a constant during the innermost loop.
I have this self-contained example of a TBB application that I run on a 2-NUMA-node CPU that performs a simple vector addition repeatedly on dynamic arrays. It recreates an issue that I am having with a bit more complicated example. I am trying to divide the computations cleanly between the available NUMA nodes by initializing the data in parallel with 2 task_arenas that are linked to separate NUMA nodes through TBB's NUMA API. The subsequent parallel execution should then be conducted so that that memory accesses are performed on data that is local to the cpu that computes its task. A control example uses a simple parallel_for with a static_partitioner to perform the computation while my intended example invokes per task_arena a task which invokes a parallel_for to compute the vector addition of the designated region, i.e. the half of the dynamic arena that was initialized before in the corresponding NUMA node. This example always takes twice as much time to perform the vector addition compared to the control example. It cannot be the overhead of creating the tasks for the task_arenas that will invoke the parallel_for algorithms, because the performance degradation only occurs when the tbb::task_arena::constraints are applied. Could anyone explain to me what happens and why this performance penalty is so harsh. A direction to resources would also be helpful as I am doing this for a university project.
#include <iostream>
#include <iomanip>
#include <tbb/tbb.h>
#include <vector>
int main(){
std::vector<int> numa_indexes = tbb::info::numa_nodes();
std::vector<tbb::task_arena> arenas(numa_indexes.size());
std::size_t numa_nodes = numa_indexes.size();
for(unsigned j = 0; j < numa_indexes.size(); j++){
arenas[j].initialize( tbb::task_arena::constraints(numa_indexes[j]));
}
std::size_t size = 10000000;
std::size_t part_size = std::ceil((float)size/numa_nodes);
double * A = (double *) malloc(sizeof(double)*size);
double * B = (double *) malloc(sizeof(double)*size);
double * C = (double *) malloc(sizeof(double)*size);
double * D = (double *) malloc(sizeof(double)*size);
//DATA INITIALIZATION
for(unsigned k = 0; k < numa_indexes.size(); k++)
arenas[k].execute(
[&](){
std::size_t local_start = k*part_size;
std::size_t local_end = std::min(local_start + part_size, size);
tbb::parallel_for(static_cast<std::size_t>(local_start), local_end,
[&](std::size_t i)
{
C[i] = D[i] = 0;
A[i] = B[i] = 1;
}, tbb::static_partitioner());
});
//PARALLEL ALGORITHM
tbb::tick_count t0 = tbb::tick_count::now();
for(int i = 0; i<100; i++)
tbb::parallel_for(static_cast<std::size_t>(0), size,
[&](std::size_t i)
{
C[i] += A[i] + B[i];
}, tbb::static_partitioner());
tbb::tick_count t1 = tbb::tick_count::now();
std::cout << "Time 1: " << (t1-t0).seconds() << std::endl;
//TASK ARENA & PARALLEL ALGORITHM
t0 = tbb::tick_count::now();
for(int i = 0; i<100; i++){
for(unsigned k = 0; k < numa_indexes.size(); k++){
arenas[k].execute(
[&](){
for(unsigned i=0; i<numa_indexes.size(); i++)
task_groups[i].wait();
task_groups[k].run([&](){
std::size_t local_start = k*part_size;
std::size_t local_end = std::min(local_start + part_size, size);
tbb::parallel_for(static_cast<std::size_t>(local_start), local_end,
[&](std::size_t i)
{
D[i] += A[i] + B[i];
});
});
});
}
t1 = tbb::tick_count::now();
std::cout << "Time 2: " << (t1-t0).seconds() << std::endl;
double sum1 = 0;
double sum2 = 0;
for(int i = 0; i<size; i++){
sum1 += C[i];
sum2 += D[i];
}
std::cout << sum1 << std::endl;
std::cout << sum2 << std::endl;
return 0;
}
Performance with:
for(unsigned j = 0; j < numa_indexes.size(); j++){
arenas[j].initialize( tbb::task_arena::constraints(numa_indexes[j]));
}
$ taskset -c 0,1,8,9 ./RUNME
Time 1: 0.896496
Time 2: 1.60392
2e+07
2e+07
Performance without constraints:
$ taskset -c 0,1,8,9 ./RUNME
Time 1: 0.652501
Time 2: 0.638362
2e+07
2e+07
EDIT: I implemented the use of task_group as found in #AlekseiFedotov's suggested resources, but the issue still remains.
Part of the provided example where the work with arenas happens is not one-to-one match to the example from the docs, "Setting the preferred NUMA node" section.
Looking further into the specification of the task_arena::execute() method, we can find out that the task_arena::execute() is a blocking API, i.e. it does not return until the passed lambda completes.
On the other hand, the specification of the task_group::run() method reveals that its method is asynchronous, i.e. returns immediately, not waiting for the passed functor to complete.
That is where the problem lies, I guess. The code executes two parallel loops within arenas one by one, in a serial manner so to say. Consider following the example from the docs carefully.
BTW, the oneTBB project, which is the revamped version of the TBB, can be found here.
EDIT answer for the EDITED question:
See the comment to the question.
The waiting should happen after work is submitted, not before it. Also, no need to go to another arena's task group to do the wait within the loop, just submit the work in the NUMA loop via arena[i].execute( [i, &] { task_group[i].run( [i, &] { /*...*/ } ); } ), then, in another loop, wait for each task_group within corresponding task_arena.
Please note how I capture the NUMA loop iteration by copy. Otherwise, the code might be referring the wrong data inside the lambda body.
My program opens a file which contains 100,000 numbers and parses them out into a 10,000 x 10 array correlating to 10,000 sets of 10 physical parameters. The program then iterates through each row of the array, performing overlap calculations between that row and every other row in the array.
The process is quite simple, and being new to c++, I programmed it the most straightforward way that I could think of. However, I know that I'm not doing this in the most optimal way possible, which is something that I would love to do, as the program is going to face off against my cohort's identical program, coded in Fortran, in a "race".
I have a feeling that I am going to need to implement multithreading to accomplish my goal of speeding up the program, but not only am I new to c++, I am new to multithreading, so I'm not sure how I should go about creating new threads in a beneficial way, or if it is even something that would give me that much "gain on investment" so to speak.
The program has the potential to be run on a machine with over 50 cores, but because the program is so simple, I'm not convinced that more threads is necessarily better. I think that if I implement two threads to compute the complex parameters of the two gaussians, one thread to compute the overlap between the gaussians, and one thread that is dedicated to writing to the file, I could speed up the program significantly, but I could also be wrong.
CODE:
cout << "Working...\n";
double **gaussian_array;
gaussian_array = (double **)malloc(N*sizeof(double *));
for(int i = 0; i < N; i++){
gaussian_array[i] = (double *)malloc(10*sizeof(double));
}
fstream gaussians;
gaussians.open("GaussParams", ios::in);
if (!gaussians){
cout << "File not found.";
}
else {
//generate the array of gaussians -> [10000][10]
int i = 0;
while(i < N) {
char ch;
string strNums;
string Num;
string strtab[10];
int j = 0;
getline(gaussians, strNums);
stringstream gaussian(strNums);
while(gaussian >> ch) {
if(ch != ',') {
Num += ch;
strtab[j] = Num;
}
else {
Num = "";
j += 1;
}
}
for(int c = 0; c < 10; c++) {
stringstream dbl(strtab[c]);
dbl >> gaussian_array[i][c];
}
i += 1;
}
}
gaussians.close();
//Below is the process to generate the overlap file between all gaussians:
string buffer;
ofstream overlaps;
overlaps.open("OverlapMatrix", ios::trunc);
overlaps.precision(15);
for(int i = 0; i < N; i++) {
for(int j = 0 ; j < N; j++){
double r1[6][2];
double r2[6][2];
double ol[2];
//compute complex parameters from the two gaussians
compute_params(gaussian_array[i], r1);
compute_params(gaussian_array[j], r2);
//compute overlap between the gaussians using the complex parameters
compute_overlap(r1, r2, ol);
//write to file
overlaps << ol[0] << "," << ol[1];
if(j < N - 1)
overlaps << " ";
else
overlaps << "\n";
}
}
overlaps.close();
return 0;
Any suggestions are greatly appreciated. Thanks!
This loop:
long n = 0;
unsigned int i, j, innerLoopLength = 4;
for (i = 0; i < 10000000; i++) {
for (j = 0; j < innerLoopLength; j++) {
n += v[j];
}
}
finishes in 0 ms, while this one:
long n = 0;
unsigned int i, j, innerLoopLength = argc;
for (i = 0; i < 10000000; i++) {
for (j = 0; j < innerLoopLength; j++) {
n += v[j];
}
}
takes 35 ms.
No matter what the innerLoopLength is, the first method is always pretty fast while the second getting slower and slower.
Does anybody know why and is there a way to speed up the seconds version? I'm grateful for every ms.
Full code:
#include <iostream>
#include <chrono>
#include <vector>
using namespace std;
int main(int argc, char *argv[]) {
vector<long> v;
cout << "argc: " << argc << endl;
for (long l = 1; l <= argc; l++) {
v.push_back(l);
}
auto start = chrono::steady_clock::now();
long n = 0;
unsigned int i, j, innerLoopLength = 4;
for (i = 0; i < 10000000; i++) {
for (j = 0; j < innerLoopLength; j++) {
n += v[j];
}
}
auto end = chrono::steady_clock::now();
cout << "duration: " << chrono::duration_cast<chrono::microseconds>(end - start).count() / 1000.0 << " ms" << endl;
cout << "n: " << n << endl;
return 0;
}
Compiled with -std=c++1z and -O3.
The fixed-length loop was far quicker due to loop unrolling:
Loop unrolling, also known as loop unwinding, is a loop transformation
technique that attempts to optimize a program's execution speed at the
expense of its binary size, which is an approach known as space–time
tradeoff. The transformation can be undertaken manually by the
programmer or by an optimizing compiler.
The goal of loop unwinding is to increase a program's speed by
reducing or eliminating instructions that control the loop, such as
pointer arithmetic and "end of loop" tests on each iteration; reducing
branch penalties; as well as hiding latencies, including the delay in
reading data from memory. To eliminate this computational overhead,
loops can be re-written as a repeated sequence of similar independent
statements.
Essentially, the inner loop of your C(++) code is transformed to the following before compilation:
for (i = 0; i < 10000000; i++) {
n += v[0];
n += v[1];
n += v[2];
n += v[3];
}
As you can see, it is a little bit faster.
In your specific case, there is yet another source of the optimization: you sum 1000000 times the same values to n. gcc can detect it since around 3.*, and converts it to a multiplication. You can check that, doing the same loop 100000000000 times will be similarly ready in 0 ms. You can check on the ASM level (g++ -S -o bench.s bench.c -O3), you will see only a multiplication and not an addition in a loop. To avoid this, you should add something what can't be converted to a multiplication so easily.
None of them can be done in the second case. Thus, on the ASM level, you will have to deal with a lot of conditional expressions (conditional jumps). These are costly in a modern CPU, because their unexpected result causes the CPU pipeline to reset.
What can you help:
If you know something from innerLoopLength, for example if it is always divisable by 4, you can unroll the loop for yourself
Some gcc(g++) optimization flag, to help him to understand, here you need fast code. Compile with at least -O3 -funroll-loops.
So I want to optimize the sum of a really big array and in order to do that I have wrote a multi-threaded code. The problem is that with this code I'm getting better timing results using only one thread instead of 2 or 3 or 4 threads...
Can someone explain me why this happens?
(Also I've only started coding in C++ this semester, until then I only knew C, so I'm sorry for possible dumb mistakes)
This is the thread code
*localSum = 0.0;
for (size_t i = 0; i < stop; i++)
*localSum += v[i];
Main process code
int numThreads = atoi(argv[1]);
int N = 100000000;
// create the input vector v and put some values in v
vector<double> v(N);
for (int i = 0; i < N; i++)
v[i] = i;
// this vector will contain the partial sum for each thread
vector<double> localSum(numThreads, 0);
// create threads. Each thread will compute part of the sum and store
// its result in localSum[threadID] (threadID = 0, 1, ... numThread-1)
startChrono();
vector<thread> myThreads(numThreads);
for (int i = 0; i < numThreads; i++){
int start = i * v.size() / numThreads;
myThreads[i] = thread(threadsum, i, numThreads, &v[start], &localSum[i],v.size()/numThreads);
}
for_each(myThreads.begin(), myThreads.end(), mem_fn(&thread::join));
// calculate global sum
double globalSum = 0.0;
for (int i = 0; i < numThreads; i++)
globalSum += localSum[i];
cout.precision(12);
cout << "Sum = " << globalSum << endl;
cout << "Runtime: " << stopChrono() << endl;
exit(EXIT_SUCCESS);
}
There are a few things:
1- The array just isn't big enough. Vectorized streaming add will be really hard to beat. You need a more complex function than add to really see results. Or a very large array.
2- Related, the overhead of all the thread creation and joining is going to swamp any performance gains from the threading. Adding is really fast, and you can easily saturate the CPU's functional units. for the thread to help it can't even be a hyperthread on the same core, it would need to be on a different core entirely (as the hyperthreads would both compete for the floating point units).
To test this, you can try to create all the treads before you start the timer and stop them all after you stop the timer (have them set a done flag instead of waiting on the join).
3- All your localsum variables are sharing the same cache line. Better would be to make the localsum variable on the stack and put the result into the array instead of adding directly into the array: https://mechanical-sympathy.blogspot.com/2011/07/false-sharing.html
If for some reason, you need to keep the sum observable to others in that array, pad the localsum vector entries like this so they don't share the same cache line:
struct localsumentry {
double sum;
char pad[56];
};