parallel for is slower than sequential for

parallel for is slower than sequential for - c++

My program shall perform a parallel distinct rotation of words and texts.
If you do not know what this means: Rotations of "BANANA" are
BANANA
ANANAB
NANABA
ANABAN
NABANA
ABANAN
(simply put the first letter to the end.)
vector<string> rotate_sequentiell( string* word )
{
vector<string> all_rotations;
for ( unsigned int i = 0; i < word->size(); i++ )
{
string rotated = word->substr( i ) + word->substr( 0,i );
all_rotations.push_back( rotated );
}
if ( verbose ) { printVec(&all_rotations, "Rotations"); }
return all_rotations;
}
We should be able to make this parallel. Instead of moving just one letter to the end, I want to move two letters at once to the end, so for example, we take BANANA
Take te "BA" to the end and get NANA BA, which is the third entry in the list above.
I implemented it like this
vector<string> rotate_parallel( string* word )
{
vector<string> all_rotations( word->size() );
#pragma omp parallel for
for ( unsigned int i = 0; i < word->size(); i++ )
{
string rotated = word->substr( i ) + word->substr( 0,i );
all_rotations[i] = rotated;
}
if ( verbose ) { printVec(&all_rotations, "Rotations"); }
return all_rotations;
}
I pre-calculated the number of possible rotations and used the #pragma omp parallel for, so it should do what I think it does.
To test these functions, I have a 40KB large text-file which is meant to be "rotated". I wanna have all the distinct rotations of a giant text.
What happens now is, that the sequential procedure tooks like 4.3 seconds and the parallel tooks like 6.5 seconds.
Why is that so? What am I doing wrong?
This is how I measure time:
clock_t start, finish;
start = clock();
bwt_encode_parallel( &glob_word, &seperator );
finish = clock();
cout << "Time (seconds): "
<< ((double)(finish - start))/CLOCKS_PER_SEC;
I compile my code with
g++ -O3 -g -Wall -lboost_regex -fopenmp -fmessage-length=0

The parallel version has 2 sources of additional work compared to the sequential version:
(1) overhead of starting the threads, and
(2) coordination and locking between the threads.
Impact of (1) Should diminish when the data set grows larger, and probably can't be worth 2 seconds anyway, but this would set the limit of how small jobs it makes sense to parallelize.
(2) is in your case probably mostly caused by omp assigning tasks to the threads, and the different threads doing memory allocation for the 2 intermediate substrings and the final string "rotated" - the memory allocation routine probably has to get a global lock before it can reserve a piece of the heap for you.
Preallocating the final storage in a single thread and guiding OMP to run the parallel loop in large (2048) blocks of iterations per thread tilts the result to to favor the parallel execution. I get about 700ms for the single threaded and 330ms for the multithreaded version with the code below:
enum {SZ = 40960};
std::string word;
word.resize(SZ);
for (int i = 0; i < SZ; i++) {
word[i] = (i & 127) + 1; // put stuff into the word
}
std::vector<std::string> all_rotations(SZ);
clock_t start, finish;
start = clock();
for (int i = 0; i < (int)word.size(); i++) {
all_rotations[i].reserve(SZ);
}
#pragma omp parallel for schedule (static, 2048)
for (int i = 0; i < (int)word.size(); i++) {
std::string rotated = word.substr(i) + word.substr(0, i);
all_rotations[i] = rotated;
}
finish = clock();
printf("Time (seconds): %0.3lf\n", ((double)(finish - start))/CLOCKS_PER_SEC);
Last, when you need the results of the burrows wheeler transform, you don't necessarily want N copies of a string that contains N characters. It would save space and processing to treat the string as a ring buffer and read each rotation from a different offset in the buffer.

Related

How to optimize omp parallelization when batching

I am generating class Objects and putting them into std::vector. Before adding, I need to check if they intersect with the already generated objects. As I plan to have millions of them, I need to parallelize this function as it takes a lot of time (The function must check each new object against all previously generated).
Unfortunately, the speed increase is not significant. The profiler also shows very low efficiency (all overhead). Any advise would be appreciated.
bool
Generator::_check_cube (std::vector<Cube> &cubes, const cube &cube)
{
auto ptr_cube = &cube;
auto npol = cubes.size();
auto ptr_cubes = cubes.data();
const auto nthreads = omp_get_max_threads();
bool check = false;
#pragma omp parallel shared (ptr_cube, ptr_cubes, npol, check)
{
#pragma omp single nowait
{
const auto batch_size = npol / nthreads;
for (int32_t i = 0; i < nthreads; i++)
{
const auto bstart = batch_size * i;
const auto bend = ((bstart + batch_size) > npol) ? npol : bstart + batch_size;
#pragma omp task firstprivate(i, bstart, bend) shared (check)
{
struct bd bd1{}, bd2{};
bd1 = allocate_bd();
bd2 = allocate_bd();
for (auto j = bstart; j < bend; j++)
{
bool loc_check;
#pragma omp atomic read
loc_check = check;
if (loc_check) break;
if (ptr_cube->cube_intersecting(ptr_cubes[j], &bd1, &bd2))
{
#pragma omp atomic write
check = true;
break;
}
}
free_bd(&bd1);
free_bd(&bd2);
}
}
}
}
return check;
}
UPDATE: The Cube is actually made of smaller objects Cuboids, each of them have size (L, W, H), position coordinates and rotation. The intersect function:
bool
Cube::cube_intersecting(Cube &other, struct bd *bd1, struct bd *bd2) const
{
const auto nom = number_of_cuboids();
const auto onom = other.number_of_cuboids();
for (int32_t i = 0; i < nom; i++)
{
get_mcoord(i, bd1);
for (int32_t j = 0; j < onom; j++)
{
other.get_mcoord(j, bd2);
if (check_gjk_intersection(bd1, bd2))
{
return true;
}
}
}
return false;
}
//get_mcoord calculates vertices of the cuboids
void
Cube::get_mcoord(int32_t index, struct bd *bd) const
{
for (int32_t i = 0; i < 8; i++)
{
for (int32_t j = 0; j < 3; j++)
{
bd->coord[i][j] = _cuboids[index].get_coord(i)[j];
}
}
}
inline struct bd
allocate_bd()
{
struct bd bd{};
bd.numpoints = 8;
bd.coord = (double **) malloc(8 * sizeof(double *));
for (int32_t i = 0; i < 8; i++)
{
bd.coord[i] = (double *) malloc(3 * sizeof(double));
}
return bd;
}
Typical values: npol > 1 million, threads 32, and each npol Cube consists of 1 - 3 smaller cuboids which are directly checked against other if intersect.

The problem with your search is that OpenMP really likes static loops, where the number of iterations is predetermined. Thus, maybe one task will break early, but all the other will go through their full search.
With recent versions of OpenMP (5, I think) there is a solution for that.
(Not sure about this one: Make your tasks much more fine-grained, for instance one for each intersection test);
Spawn your tasks in a taskloop;
Once you find your intersection (or any condition that causes you to break), do cancel taskloop.
Small problem: cancelling is disabled by default. Set the environment variable OMP_CANCELLATION to true.

Do you have more intersections being true or more being false ? If most are true, you're flooding your hardware with requests to write to a shared resource, and what you are doing is essentially sequential. One way to address this is to avoid using a shared resource so there is no mutex and you let all threads run and at the end you take a decision given the results; this will likely run faster but the benefit depends also on arbitrary choices such as few metrics (eg., nthreads, ncuboids).
It is possible that on another architecture (eg., gpu), your algorithm works well as it is. I may be worth it to benchmark it on a gpu, and see if you will benefit from that migration, given the production sizes (millions of cuboids, 24 dimensions).
You also have a complexity problem, which is, for every new cuboid you compare up to the whole set of existing cuboids. One way to address this is to gather all the cuboids size (range) by dimension and order them, and add the new cuboids ranges ordered. If there is intersection in one dimension, you test the next one etc. You also can runs them in parallel. Before running through the ranges, you test if you are hitting inside the global range, if not it's useless to test locally the intersection.
Here and in general you want to parallelize with minimum of dependency (shared resources, mutex). So you want to try to find a point of view where this will happen. Parallelising over dimensions over ordered ranges (segments) might be better that parallelizing over cuboids.
Algorithms and benefits of parallelism also depend on the values of your objects. This does not mean that complexity predictions are not relevant, but that one may find a smarter approach given those values.

I think your code is memory bound, so its bottleneck is memory read/write not calculations. This can be the main reason of poor speed increase. As already mentioned by #Soleil a different hardware (GPU) can be beneficial here.
You mentioned in the comments that Generator::_check_cub called many times. To reduce OpenMP overheads my suggestion is moving the parallel region out of this function, you can even use it in your main function:
main(){
#pragma omp parallel
#pragma omp single nowait
{
//your code
}
}
In this case you have to use #pragma omp taskwait to wait for the tasks to complete.
for (int32_t i = 0; i < nthreads; i++)
{
#pragma omp task default(none) firstprivate(...) shared (..)
{
//your code comes here
}
}
#pragma omp taskwait
I also suggest using default(none) clause in #pragma omp task directive so you have to explicitly tell the sharing attribute of all your variables.
Do you really need function get_mcoord? It seems a redunant memory copy to me. I think it may be better to write a check_gjk_intersection function which takes _cuboids or its indices as parameters. In this case you get rid of many memory allocations/deallocations of bd1 and bd2, which also can be time consuming as #Victor pointed out.

Performance of C++ program is 2x slower with the use of TBB

I'm trying to optimize the performance of a C++ program by using the TBB library.
My program only contains a couple of small for loop, so I know it can be a challenge to optimze time complexity in this case, but I have to use TBB.
As such, I tried to use a partitionner which made the program 2 time faster with TBB than without the partitionner, but it's still slower than the original program without the use of parallelism.
In my code, I print when a loop start and end with the id to see if there is parallelism. The output show that the loop is in fact execute sequentially, for example : start 1 end 1, start 2 end 2 , etc(it's a list of size 200). The output of the ids isn't random like you would expect from a parallelized program.
Here is an example of how I used the library:
tbb::global_control c(tbb::global_control::max_allowed_parallelism, 1000);
size_t grainsize = 1000;
size_t changes = 0;
tbb::parallel_for(
tbb::blocked_range<std::size_t>(0, list.size(), grainsize),
[&](const tbb::blocked_range<std::size_t> r) {
for (size_t id = r.begin(); id < r.end(); ++id) {
std::cout << "start:" << point_id << std::endl;
double disto = std::numeric_limits<double>::max();
size_t cluster_id = 0;
const Point& point = points.at(id);
for (size_t i = 0; i < short_list.size(); i++) {
const Point& origin = originss[i];
double disto2 = point.dist(origin);
if (disto2 < min) {
min = disto2;
clus = i;
}
}
if (m[id] != m_id) {
m[id] = m_id;
modif++;
}
disto_list[id] = min;
std::cout << "end:" << point_id << std::endl;
}
}
);
Is there a way to improve the performance of a C++ program composed of multiple small for loops with the use of the TBB library? And why are the loop not parallized?

If you are using task_scheduler_init in your program, then TBB uses the same thread throughout the program until task_scheduler_init objects are destroyed.
As you are passing max_allowed_parallelism as a parameter for global_control, if it is set to 1 then it will make your application run in a sequential way.
You can refer to the below link:
https://spec.oneapi.io/versions/latest/elements/oneTBB/source/task_scheduler/scheduling_controls/global_control_cls.html
It will be helpful if you provide the complete reproducer to figure out where exactly the issue took place.

How to run all threads in sequence as static with out using opemMP for?

I'm new to openMP and multi-threading.
I have been given a task to run a method as static, dynamic, and guided without using OpenMPfor loop which means I cant use scheduled clauses.!
I could create parallel threads with parallel and could assign loop iterations to threads equally
but how to make it static and dynamic(1000 block) and guided?
void static_scheduling_function(const int start_count,
const int upper_bound,
int *results)
{
int i, tid, numt;
#pragma omp parallel private(i,tid)
{
int from, to;
tid = omp_get_thread_num();
numt = omp_get_num_threads();
from = (upper_bound / numt) * tid;
to = (upper_bound / numt) * (tid + 1) - 1;
if (tid == numt - 1)
to = upper_bound - 1;
for (i = from; i < to; i++)
{
//compute one iteration (i)
int start = i;
int end = i + 1;
compute_iterations(start, end, results);
}
}
}
======================================
For dynamic i have tried something like this
void chunk_scheduling_function(const int start_count, const int upper_bound, int* results) {
int numt, shared_lower_iteration_counter=start_count;
for (int shared_lower_iteration_counter=start_count; shared_lower_iteration_counter<upper_bound;){
#pragma omp parallel shared(shared_lower_iteration_counter)
{
int tid = omp_get_thread_num();
int from,to;
int chunk = 1000;
#pragma omp critical
{
from= shared_lower_iteration_counter; // 10, 1010
to = ( shared_lower_iteration_counter + chunk ); // 1010,
shared_lower_iteration_counter = shared_lower_iteration_counter + chunk; // 1100 // critical is important while incrementing shared variable which decides next iteration
}
for(int i = from ; (i < to && i < upper_bound ); i++) { // 10 to 1009 , i< upperbound prevents other threads from executing call
int start = i;
int end = i + 1;
compute_iterations(start, end, results);
}
}
}
}

This looks like a university assignment (and a very good one IMO), I will not provide the complete solution, instead I will provide what you should be looking for.
The static scheduler looks okey; Notwithstanding, it can be improved by taking into account the chunk size as well.
For the dynamic and guided schedulers, they can be implemented by using a variable (let us name it shared_iteration_counter) that will be marking the current loop iteration that should pick up next by the threads. Therefore, when a thread needs to request a new task to work with (i.e., a new loop iteration) it queries that variable for that. In pseudo code would look like the following:
int thread_current_iteration = shared_iteration_counter++;
while(thread_current_iteration < MAX_SIZE)
{
// do work
thread_current_iteration = shared_iteration_counter++;
}
The pseudo code is assuming chunk size of 1 (i.e., shared_iteration_counter++) you will have to adapt to your use-case. Now, because that variable will be shared among threads, and every thread will be updating it, you need to ensure mutual exclusion during the updates of that variable. Fortunately, OpenMP offers means to achieve that, for instance, using #pragma omp critical, explicitly locks, and atomic operations. The latter is the better option for your use-case:
#pragma omp atomic
shared_iteration_counter = shared_iteration_counter + 1;
For the guided scheduler:
Similar to dynamic scheduling, but the chunk size starts off large and
decreases to better handle load imbalance between iterations. The
optional chunk parameter specifies them minimum size chunk to use. By
default the chunk size is approximately loop_count/number_of_threads.
In this case, not only you have to guarantee mutual exclusion of the variable that will be used to count the current loop iteration to be pick up by threads, but also guarantee mutual exclusion of the chunk size variable, since it also changes.
Without given it way too much bear in mind that you may need to considered how to deal with edge-cases such as your current thread_current_iteration= 1000 and your chunks_size=1000 with a MAX_SIZE=1500. Hence, thread_current_iteration + chunks_size > MAX_SIZE, but there is still 500 iterations to be computed.

OpenMP - std::next_permutation

I am trying to parallelize my own C++ implementation of Travelling Salesman Problem using OpenMP.
I have a function to calculate cost of road cost() and vector [0,1,2,...,N], where N is a number of nodes of the road.
In main(), I am trying to find the best road:
do
{
cost();
} while (std::next_permutation(permutation_base, permutation_base + operations_number));
I was trying to use #pragma omp parallel to parallelize that code, but it only made it more time consuming.
Is there any way to parallelize that code?

#pragma omp parallel doesn't automatically divide the computation on separate threads. If you want to divide the computation you need do additionally use #pragma omp for, otherwise the hole computation is done multiple times, one time for each thread. For instance the following code prints "Hello World!" four times on my laptop, since it has 4 cores.
int main(int argc, char* argv[]){
#pragma omp parallel
cout << "Hello World!\n";
}
The same thing happens to your code, if you simple write #pragma omp parallel. Your code gets executed multiple times, once for each thread. And therefore your program won't be faster. If you want to divide the work onto the threads (each thread does different things), you have to use something like #pragma omp parallel for.
Now we can look at your code. It isn't suited for parallelization. Lets see why. You start with your array permutation_base and calculate the costs. Then you manipulate permutation_base with next_permutation. You actually have to wait for the finished cost computations, before you are allowed to manipulate the the array, because otherwise the cost computation would be wrong. So the whole thing wouldn't work on separate threads.
One possible solution would be, to keep multiple copies of your array permutation_base, and each possible permutation base only runs through a part of all permutations. For instance:
vector<int> permutation_base{1, 2, 3, 4};
int n = permutation_base.size();
#pragma omp parallel for
for (int i = 0; i < n; ++i) {
// Make a copy of permutation_base
auto perm = permutation_base;
// rotate the i'th element to the front
// keep the other elements sorted
std::rotate(perm.begin(), perm.begin() + i, perm.begin() + i + 1);
// Now go through all permutations of the last `n-1` elements.
// Keep the first element fixed.
do {
cost()
}
while (std::next_permutation(perm.begin() + 1, perm.end()));
}

Most definitely.
The big problem with parallelizing these permutation problems is that in order to parallelize well, you need to "index" into an arbitrary permutation. In short, you need to find the kth permutation. You can take advantage of some cool math properties and you'll find this:
std::vector<int> kth_perm(long long k, std::vector<int> V) {
long long int index;
long long int next;
std::vector<int> new_v;
while(V.size()) {
index = k / fact(V.size() - 1);
new_v.push_back(V.at(index));
next = k % fact(V.size() - 1);
V.erase(V.begin() + index);
k = next;
}
return new_v;
}
So then your logic might look something like this:
long long int start = (numperms*threadnum)/ numthreads;
long long int end = threadnum == numthreads-1 ? numperms : (numperms*(threadnum+1))/numthreads;
perm = kth_perm(start, perm); // perm is your list of permutations
for (int j = start; j < end; ++j){
if (is_valid_tour(adj_list, perm, startingVertex, endingVertex)) {
isValidTour=true;
return perm;
}
std::next_permutation(perm.begin(),perm.end());
}
isValidTour = false;
return perm;
Obviously there's a lot of code, but the idea of parallelizing it can be captured by the little code I've posted. You can visualize "indexing" like this:
|--------------------------------|
^ ^ ^
t1 t2 ... tn
Find the ith permutation and let a thread call std::next_permutation until it finds the starting point of the next thread.
Note that you'll want to wrap the function that contains the bottom code in #pragma omp parallel

OpenMP parallel for loop speedup issues

Recently I started using OpenMP. Doing a numerical calculation involving 3d matrices created in c++ as vectors and I used parallel for loops to speedup the code. But it runs slower than serial code. I compile the code using Codeblocks in Windows 7. The code is something like this.
int main(){
vector<vector<vector<float> > > Dx; //
/*create 3d array Dx[IE][JE][KE] as vectors*/
Dx.resize(IE);
for (int i = 0; i < IE; ++i) {
for (int j = 0; j < JE; ++j){
dx[i][j].resize(KE);
}
}
//declare and initialize more matrices like this
.
.
.
double wtime = omp_get_wtime(); // start time
//and matrix calculations using parallel for loop
#pragma omp parallel for
for (int i=1; i < IE; ++i ) {
for (int j=1; j < JE; ++j ) {
for (int k=1; k < KE; ++k ) {
curl_h = ( Hz[i][j][k] - Hz[i][j-1][k] - Hy[i][j][k] + Hy[i][j][k-1]);
idxl[i][j][k] = idxl[i][j][k] + curl_h;
Dx[i][j][k] = gj3[j]*gk3[k]*dx[i][j][k]
+ gj2[j]*gk2[k]*.5*(curl_h + gi1[i]*idxl[i][j][k]);
}
}
}
wtime = omp_get_wtime() - wtime;
}
But code with parallel loops run slower than the serial code. Any ideas ?
Thxs.

The loop uses the variable curl_h, which is not declared as thread private. This is both a bug, and also the reason for your perceived performance problem:
As there is only one place in memory where curl_h is stored, all threads constantly and concurrently try to read and write it. One CPU core will load the value into its cache, the next one will issue a write to it, invalidating the cache of the first CPU, which will again grab the cacheline when it itself tries to use curl_h (read or write, both will require the cacheline to be in the local cache).
The point is, that the fierce pretense put up by the hardware that there is only one memory location called curl_h demands its tribute. You get a huge amount of chatter in the cache coherency protocol, and keep your memory buses busy with constantly refetching the same cacheline from memory. All your threads are really doing is fighting over that one cacheline.
Of course, the constant races between the threads are a big bug, as no process can be certain that the value it's currently using is actually the one it calculated in the statement above.
So, just add the correct private() declarations to your omp parallel for statement, and you'll fix both the bug and the performance issue.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js