FFTW plan creation using OpenMP - c++

I am trying to perform several FFT's in parallel. I am using FFTW and OpenMP. Each FFT is different, so I'm not relying on FFTW's build-in multithreading (which I know uses OpenMP).
int m;
// assume:
// int numberOfColumns = 100;
// int numberOfRows = 100;
#pragma omp parallel for default(none) private(m) shared(numberOfColumns, numberOfRows)// num_threads(4)
for(m = 0; m < 36; m++){
// create pointers
double *inputTest;
fftw_complex *outputTest;
fftw_plan testPlan;
// preallocate vectors for FFTW
outputTest = (fftw_complex*)fftw_malloc(sizeof(fftw_complex)*numberOfRows*numberOfColumns);
inputTest = (double *)fftw_malloc(sizeof(double)*numberOfRows*numberOfColumns);
// confirm that preallocation worked
if (inputTest == NULL || outputTest == NULL){
logger_.log_error("\t\t FFTW memory not allocated on m = %i", m);
}
// EDIT: insert data into inputTest
inputTest = someDataSpecificToThisIteration(m); // same size for all m
// create FFTW plan
#pragma omp critical (make_plan)
{
testPlan = fftw_plan_dft_r2c_2d(numberOfRows, numberOfColumns, inputTest, outputTest, FFTW_ESTIMATE);
}
// confirm that plan was created correctly
if (testPlan == NULL){
logger_.log_error("\t\t failed to create plan on m = %i", m);
}
// execute plan
fftw_execute(testPlan);
// clean up
fftw_free(inputTest);
fftw_free(outputTest);
fftw_destroy_plan(testPlan);
}// end parallelized for loop
This all works fine. However, if I remove the critical construct from around the plan creation (fftw_plan_dft_r2c_2d) my code will fail. Can someone explain why? fftw_plan_dft_r2c_2d isn't really an "orphan", right? Is it because two threads might both try to hit the numberOfRows or numberOfColumns memory location at the same time?

It's pretty much all written in the FFTW documentation about thread safety:
... but some care must be taken because the planner routines share data (e.g. wisdom and trigonometric tables) between calls and plans.
The upshot is that the only thread-safe (re-entrant) routine in FFTW is fftw_execute (and the new-array variants thereof). All other routines (e.g. the planner) should only be called from one thread at a time. So, for example, you can wrap a semaphore lock around any calls to the planner; even more simply, you can just create all of your plans from one thread. We do not think this should be an important restriction (FFTW is designed for the situation where the only performance-sensitive code is the actual execution of the transform), and the benefits of shared data between plans are great.
In a typical application of FFT plans are constructed seldom, so it doesn't really matter if you have to synchronise their creation. In your case you don't need to create a new plan at each iteration, unless the dimension of the data changes. You would rather do the following:
#pragma omp parallel default(none) private(m) shared(numberOfColumns, numberOfRows)
{
// create pointers
double *inputTest;
fftw_complex *outputTest;
fftw_plan testPlan;
// preallocate vectors for FFTW
outputTest = (fftw_complex*)fftw_malloc(sizeof(fftw_complex)*numberOfRows*numberOfColumns);
inputTest = (double *)fftw_malloc(sizeof(double)*numberOfRows*numberOfColumns);
// confirm that preallocation worked
if (inputTest == NULL || outputTest == NULL){
logger_.log_error("\t\t FFTW memory not allocated on m = %i", m);
}
// create FFTW plan
#pragma omp critical (make_plan)
testPlan = fftw_plan_dft_r2c_2d(numberOfRows, numberOfColumns, inputTest, outputTest, FFTW_ESTIMATE);
#pragma omp for
for (m = 0; m < 36; m++) {
// execute plan
fftw_execute(testPlan);
}
// clean up
fftw_free(inputTest);
fftw_free(outputTest);
fftw_destroy_plan(testPlan);
}
Now the plans are created only once in each thread and the serialisation overhead would diminish with each execution of fftw_execute(). If running on a NUMA system (e.g. a multi-socket AMD64 or Intel (post-)Nehalem system), then you should enable thread binding in order to achieve maximum performance.

Related

Problem of sorting OpenMP threads into NUMA nodes by experiment

I'm attempting to create a std::vector<std::set<int>> with one set for each NUMA-node, containing the thread-ids obtained using omp_get_thread_num().
Topo:
Idea:
Create data which is larger than L3 cache,
set first touch using thread 0,
perform multiple experiments to determine the minimum access time of each thread,
extract the threads into nodes based on sorted access times and information about the topology.
Code: (Intel compiler, OpenMP)
// create data which will be shared by multiple threads
const auto part_size = std::size_t{50 * 1024 * 1024 / sizeof(double)}; // 50 MB
const auto size = 2 * part_size;
auto container = std::unique_ptr<double>(new double[size]);
// open a parallel section
auto thread_count = 0;
auto thread_id_min_duration = std::multimap<double, int>{};
#ifdef DECIDE_THREAD_COUNT
#pragma omp parallel num_threads(std::thread::hardware_concurrency())
#else
#pragma omp parallel
#endif
{
// perform first touch using thread 0
const auto thread_id = omp_get_thread_num();
if (thread_id == 0)
{
thread_count = omp_get_num_threads();
for (auto index = std::size_t{}; index < size; ++index)
{
container.get()[index] = static_cast<double>(std::rand() % 10 + 1);
}
}
#pragma omp barrier
// access the data using all threads individually
#pragma omp for schedule(static, 1)
for (auto thread_counter = std::size_t{}; thread_counter < thread_count; ++thread_counter)
{
// calculate the minimum access time of this thread
auto this_thread_min_duration = std::numeric_limits<double>::max();
for (auto experiment_counter = std::size_t{}; experiment_counter < 250; ++experiment_counter)
{
const auto* data = experiment_counter % 2 == 0 ? container.get() : container.get() + part_size;
const auto start_timestamp = omp_get_wtime();
for (auto index = std::size_t{}; index < part_size; ++index)
{
static volatile auto exceedingly_interesting_value_wink_wink = data[index];
}
const auto end_timestamp = omp_get_wtime();
const auto duration = end_timestamp - start_timestamp;
if (duration < this_thread_min_duration)
{
this_thread_min_duration = duration;
}
}
#pragma omp critical
{
thread_id_min_duration.insert(std::make_pair(this_thread_min_duration, thread_id));
}
}
} // #pragma omp parallel
Not shown here is code which outputs the minimum access times sorted into the multimap.
Env. and Output
How do OMP_PLACES and OMP_PROC_BIND work?
I am attempting to not use SMT by using export OMP_PLACES=cores OMP_PROC_BIND=spread OMP_NUM_THREADS=24. However, I'm getting this output:
What's puzzling me is that I'm having the same access times on all threads. Since I'm trying to spread them across the 2 NUMA nodes, I expect to neatly see 12 threads with access time, say, x and another 12 with access time ~2x.
Why is the above happening?
Additional Information
Even more puzzling are the following environments and their outputs:
export OMP_PLACES=cores OMP_PROC_BIND=spread OMP_NUM_THREADS=26
export OMP_PLACES=cores OMP_PROC_BIND=spread OMP_NUM_THREADS=48
Any help in understanding this phenomenon would be much appreciated.
Put it shortly, the benchmark is flawed.
perform multiple experiments to determine the minimum access time of each thread
The term "minimum access time" is unclear here. I assume you mean "latency". The thing is your benchmark does not measure the latency. volatile tell to the compiler to read store data from the memory hierarchy. The processor is free to store the value in its cache and x86-64 processors actually do that (like almost all modern processors).
How do OMP_PLACES and OMP_PROC_BIND work?
You can find the documentation of both here and there. Put it shortly, I strongly advise you to set OMP_PROC_BIND=TRUE and OMP_PLACES="{0},{1},{2},..." based on the values retrieved from hw-loc. More specifically, you can get this from hwloc-calc which is a really great tool (consider using --li --po, and PU, not CORE because this is what OpenMP runtimes expect). For example you can query the PU identifiers of a given NUMA node. Note that some machines have very weird non-linear OS PU numbering and OpenMP runtimes sometimes fail to map the threads correctly. IOMP (OpenMP runtime of ICC) should use hw-loc internally but I found some bugs in the past related to that. To check the mapping is correct, I advise you to use hwloc-ps. Note that OMP_PLACES=cores does not guarantee that threads are not migrating from one core to another (even one on a different NUMA node) except if OMP_PROC_BIND=TRUE is set (or a similar setting). Note that you can also use numactl so to control the NUMA policies of your process. For example, you can tell to the OS not to use a given NUMA node or to interleave the allocations. The first touch policy is not the only one and may not be the default one on all platforms (on some Linux platforms, the OS can move the pages between the NUMA nodes so to improve locality).
Why is the above happening?
The code takes 4.38 ms to read 50 MiB in memory in each threads. This means 1200 MiB read from the node 0 assuming the first touch policy is applied. Thus the throughout should be about 267 GiB/s. While this seems fine at first glance, this is a pretty big throughput for such a processor especially assuming only 1 NUMA node is used. This is certainly because part of the fetches are done from the L3 cache and not the RAM. Indeed, the cache can partially hold a part of the array and certainly does resulting in faster fetches thanks to the cache associativity and good cache policy. This is especially true as the cache lines are not invalidated since the array is only read. I advise you to use a significantly bigger array to prevent this complex effect happening.
You certainly expect one NUMA node to have a smaller throughput due to remote NUMA memory access. This is not always true in practice. In fact, this is often wrong on modern 2-socket systems since the socket interconnect is often not a limiting factor (this is the main source of throughput slowdown on NUMA systems).
NUMA effect arise on modern platform because of unbalanced NUMA memory node saturation and non-uniform latency. The former is not a problem in your application since all the PUs use the same NUMA memory node. The later is not a problem either because of the linear memory access pattern, CPU caches and hardware prefetchers : the latency should be completely hidden.
Even more puzzling are the following environments and their outputs
Using 26 threads on a 24 core machine means that 4 threads have to be executed on two cores. The thing is hyper-threading should not help much in such a case. As a result, multiple threads sharing the same core will be slowed down. Because IOMP certainly pin thread to cores and the unbalanced workload, 4 threads will be about twice slower.
Having 48 threads cause all the threads to be slower because of a twice bigger workload.
Let me address your first sentence. A C++ std::vector is different from a C malloc. Malloc'ed space is not "instantiated": only when you touch the memory does the physical-to-logical address mapping get established. This is known as "first touch". And that is why in C-OpenMP you initialize an array in parallel, so that the socket touching the part of the array gets the pages of that part. In C++, the "array" in a vector is created by a single thread, so the pages wind up on the socket of that thread.
Here's a solution:
template<typename T>
struct uninitialized {
uninitialized() {};
T val;
constexpr operator T() const {return val;};
double operator=( const T&& v ) { val = v; return val; };
};
Now you can create a vector<uninitialized<double>> and the array memory is not touched until you explicitly initialize it:
vector<uninitialized<double>> x(N),y(N);
#pragma omp parallel for
for (int i=0; i<N; i++)
y[i] = x[i] = 0.;
x[0] = 0; x[N-1] = 1.;
Now, I'm not sure how this goes if you have a vector of sets. Just thought I'd point out the issue.
After more investigation, I note the following:
work-load managers on clusters can and will disregard/reset OMP_PLACES/OMP_PROC_BIND,
memory page migration is a thing on modern NUMA systems.
Following this, I started using the work-load manager's own thread binding/pinning system, and adapted my benchmark to lock the memory page(s) on which my data lay. Furthermore, giving in to my programmer's paranoia, I ditched the std::unique_ptr for fear that it may lay its own first touch after allocating the memory.
// create data which will be shared by multiple threads
const auto size_per_thread = std::size_t{50 * 1024 * 1024 / sizeof(double)}; // 50 MB
const auto total_size = thread_count * size_per_thread;
double* data = nullptr;
posix_memalign(reinterpret_cast<void**>(&data), sysconf(_SC_PAGESIZE), total_size * sizeof(double));
if (data == nullptr)
{
throw std::runtime_error("could_not_allocate_memory_error");
}
// perform first touch using thread 0
#pragma omp parallel num_threads(thread_count)
{
if (omp_get_thread_num() == 0)
{
#pragma omp simd safelen(8)
for (auto d_index = std::size_t{}; d_index < total_size; ++d_index)
{
data[d_index] = -1.0;
}
}
} // #pragma omp parallel
mlock(data, total_size); // page migration is a real thing...
// open a parallel section
auto thread_id_avg_latency = std::multimap<double, int>{};
auto generator = std::mt19937(); // heavy object can be created outside parallel
#pragma omp parallel num_threads(thread_count) private(generator)
{
// access the data using all threads individually
#pragma omp for schedule(static, 1)
for (auto thread_counter = std::size_t{}; thread_counter < thread_count; ++thread_counter)
{
// seed each thread's generator
generator.seed(thread_counter + 1);
// calculate the minimum access latency of this thread
auto this_thread_avg_latency = 0.0;
const auto experiment_count = 250;
for (auto experiment_counter = std::size_t{}; experiment_counter < experiment_count; ++experiment_counter)
{
const auto start_timestamp = omp_get_wtime() * 1E+6;
for (auto counter = std::size_t{}; counter < size_per_thread / 100; ++counter)
{
const auto index = std::uniform_int_distribution<std::size_t>(0, size_per_thread-1)(generator);
auto& datapoint = data[thread_counter * size_per_thread + index];
datapoint += index;
}
const auto end_timestamp = omp_get_wtime() * 1E+6;
this_thread_avg_latency += end_timestamp - start_timestamp;
}
this_thread_avg_latency /= experiment_count;
#pragma omp critical
{
thread_id_avg_latency.insert(std::make_pair(this_thread_avg_latency, omp_get_thread_num()));
}
}
} // #pragma omp parallel
std::free(data);
With these changes, I am noticing the difference I expected.
Further notes:
this experiment shows that the latency of non-local access is 1.09 - 1.15 times that of local access on the cluster that I'm using,
there is no reliable cross-platform way of doing this (requires kernel-APIs),
OpenMP seems to number the threads exactly as hwloc/lstopo, numactl and lscpu seems to number them (logical ID?)
The most astonishing things are that the difference in latencies is very low, and that memory page migration may happen, which begs the question, why should we care about first-touch and all the rest of the NUMA concerns at all?

How to optimize omp parallelization when batching

I am generating class Objects and putting them into std::vector. Before adding, I need to check if they intersect with the already generated objects. As I plan to have millions of them, I need to parallelize this function as it takes a lot of time (The function must check each new object against all previously generated).
Unfortunately, the speed increase is not significant. The profiler also shows very low efficiency (all overhead). Any advise would be appreciated.
bool
Generator::_check_cube (std::vector<Cube> &cubes, const cube &cube)
{
auto ptr_cube = &cube;
auto npol = cubes.size();
auto ptr_cubes = cubes.data();
const auto nthreads = omp_get_max_threads();
bool check = false;
#pragma omp parallel shared (ptr_cube, ptr_cubes, npol, check)
{
#pragma omp single nowait
{
const auto batch_size = npol / nthreads;
for (int32_t i = 0; i < nthreads; i++)
{
const auto bstart = batch_size * i;
const auto bend = ((bstart + batch_size) > npol) ? npol : bstart + batch_size;
#pragma omp task firstprivate(i, bstart, bend) shared (check)
{
struct bd bd1{}, bd2{};
bd1 = allocate_bd();
bd2 = allocate_bd();
for (auto j = bstart; j < bend; j++)
{
bool loc_check;
#pragma omp atomic read
loc_check = check;
if (loc_check) break;
if (ptr_cube->cube_intersecting(ptr_cubes[j], &bd1, &bd2))
{
#pragma omp atomic write
check = true;
break;
}
}
free_bd(&bd1);
free_bd(&bd2);
}
}
}
}
return check;
}
UPDATE: The Cube is actually made of smaller objects Cuboids, each of them have size (L, W, H), position coordinates and rotation. The intersect function:
bool
Cube::cube_intersecting(Cube &other, struct bd *bd1, struct bd *bd2) const
{
const auto nom = number_of_cuboids();
const auto onom = other.number_of_cuboids();
for (int32_t i = 0; i < nom; i++)
{
get_mcoord(i, bd1);
for (int32_t j = 0; j < onom; j++)
{
other.get_mcoord(j, bd2);
if (check_gjk_intersection(bd1, bd2))
{
return true;
}
}
}
return false;
}
//get_mcoord calculates vertices of the cuboids
void
Cube::get_mcoord(int32_t index, struct bd *bd) const
{
for (int32_t i = 0; i < 8; i++)
{
for (int32_t j = 0; j < 3; j++)
{
bd->coord[i][j] = _cuboids[index].get_coord(i)[j];
}
}
}
inline struct bd
allocate_bd()
{
struct bd bd{};
bd.numpoints = 8;
bd.coord = (double **) malloc(8 * sizeof(double *));
for (int32_t i = 0; i < 8; i++)
{
bd.coord[i] = (double *) malloc(3 * sizeof(double));
}
return bd;
}
Typical values: npol > 1 million, threads 32, and each npol Cube consists of 1 - 3 smaller cuboids which are directly checked against other if intersect.
The problem with your search is that OpenMP really likes static loops, where the number of iterations is predetermined. Thus, maybe one task will break early, but all the other will go through their full search.
With recent versions of OpenMP (5, I think) there is a solution for that.
(Not sure about this one: Make your tasks much more fine-grained, for instance one for each intersection test);
Spawn your tasks in a taskloop;
Once you find your intersection (or any condition that causes you to break), do cancel taskloop.
Small problem: cancelling is disabled by default. Set the environment variable OMP_CANCELLATION to true.
Do you have more intersections being true or more being false ? If most are true, you're flooding your hardware with requests to write to a shared resource, and what you are doing is essentially sequential. One way to address this is to avoid using a shared resource so there is no mutex and you let all threads run and at the end you take a decision given the results; this will likely run faster but the benefit depends also on arbitrary choices such as few metrics (eg., nthreads, ncuboids).
It is possible that on another architecture (eg., gpu), your algorithm works well as it is. I may be worth it to benchmark it on a gpu, and see if you will benefit from that migration, given the production sizes (millions of cuboids, 24 dimensions).
You also have a complexity problem, which is, for every new cuboid you compare up to the whole set of existing cuboids. One way to address this is to gather all the cuboids size (range) by dimension and order them, and add the new cuboids ranges ordered. If there is intersection in one dimension, you test the next one etc. You also can runs them in parallel. Before running through the ranges, you test if you are hitting inside the global range, if not it's useless to test locally the intersection.
Here and in general you want to parallelize with minimum of dependency (shared resources, mutex). So you want to try to find a point of view where this will happen. Parallelising over dimensions over ordered ranges (segments) might be better that parallelizing over cuboids.
Algorithms and benefits of parallelism also depend on the values of your objects. This does not mean that complexity predictions are not relevant, but that one may find a smarter approach given those values.
I think your code is memory bound, so its bottleneck is memory read/write not calculations. This can be the main reason of poor speed increase. As already mentioned by #Soleil a different hardware (GPU) can be beneficial here.
You mentioned in the comments that Generator::_check_cub called many times. To reduce OpenMP overheads my suggestion is moving the parallel region out of this function, you can even use it in your main function:
main(){
#pragma omp parallel
#pragma omp single nowait
{
//your code
}
}
In this case you have to use #pragma omp taskwait to wait for the tasks to complete.
for (int32_t i = 0; i < nthreads; i++)
{
#pragma omp task default(none) firstprivate(...) shared (..)
{
//your code comes here
}
}
#pragma omp taskwait
I also suggest using default(none) clause in #pragma omp task directive so you have to explicitly tell the sharing attribute of all your variables.
Do you really need function get_mcoord? It seems a redunant memory copy to me. I think it may be better to write a check_gjk_intersection function which takes _cuboids or its indices as parameters. In this case you get rid of many memory allocations/deallocations of bd1 and bd2, which also can be time consuming as #Victor pointed out.

Multithreaded Program for Sparse Matrices

I am a newbie to multithreading. I am trying to design a program that solves a sparse matrix. In my code I call Vector Vector dot product and Matix vector product as subroutines many times to arrive at the final solution. I am trying to parallelise the code using open MP (Especially the above two sub routines.)
I also have sequential codes in between which i donot intend to parallelise.
My question is how do I handle the threads created when the sub routine is called. Should I put a barrier at the end of every sub routine call.
Also where should I set the number of threads?
Mat_Vec_Mult(MAT,x0,rm);
#pragma omp parallel for schedule(static)
for(int i=0;i<numcols;i++)
rm[i] = b[i] - rm[i];
#pragma omp barrier
#pragma omp parallel for schedule(static)
for(int i=0;i<numcols;i++)
xm[i] = x0[i];
#pragma omp barrier
double* pm = (double*) malloc(numcols*sizeof(double));
#pragma omp parallel for schedule(static)
for(int i=0;i<numcols;i++)
pm[i] = rm[i];
#pragma omp barrier
scalarProd(rm,rm,numcols);
Thanks
EDIT:
for the scalar dotproduct, I am using the following piece of code:
double scalarProd(double* vec1, double* vec2, int n){
double prod = 0.0;
int chunk = 10;
int i;
//double* c = (double*) malloc(n*sizeof(double));
omp_set_num_threads(4);
// #pragma omp parallel shared(vec1,vec2,c,prod) private(i)
#pragma omp parallel
{
double pprod = 0.0;
#pragma omp for
for(i=0;i<n;i++) {
pprod += vec1[i]*vec2[i];
}
//#pragma omp for reduction (+:prod)
#pragma omp critical
for(i=0;i<n;i++) {
prod += pprod;
}
}
return prod;
}
I have now added the time calculation code in my ConjugateGradient function as below:
start_dotprod = omp_get_wtime();
rm_rm_old = scalarProd(rm,rm,MAT->ncols);
run_dotprod = omp_get_wtime() - start_dotprod;
fprintf(timing,"Time taken by rm_rm dot product : %lf \n",run_dotprod);
Observed results : Time taken for the dot product Sequential Version : 0.000007s Parallel Version : 0.002110
I am doing a simple compile using gcc -fopenmp command on Linux OS on my Intel I7 laptop.
I am currently using a matrix of size n = 5000.
I am getting huge speed down overall since the same dot product gets called multiple times till convergence is achieved( around 80k times).
Please suggest some improvements. Any help is much appreciated!
Honestly, I would suggest parallelizing at a higher level. By this I mean trying to minimize the number of #pragma omp parallels you are using. Every time you try and split up the work among your threads, there is an OpenMP overhead. Try and avoid this whenever possible.
So in your case at the very least I would try:
Mat_Vec_Mult(MAT,x0,rm);
double* pm = (double*) malloc(numcols*sizeof(double)); // must be performed once outside of parallel region
// all threads forked and created once here
#pragma omp parallel for schedule(static)
for(int i = 0; i < numcols; i++) {
rm[i] = b[i] - rm[i]; // (1)
xm[i] = x0[i]; // (2) does not require (1)
pm[i] = rm[i]; // (3) requires (1) at this i, not (2)
}
// implicit barrier at the end of omp for
// implicit join of all threads at the end of omp parallel
scalarProd(rm,rm,numcols);
Notice how I show that no barriers are actually necessary between your loops anyway.
If the majority of your time had been spent in this computation stage, you will surely be seeing considerable improvement. However, I'm reasonably confident that the majority of your time is being spent in Mat_Vec_Mult() and maybe also scalarProd(), so the amount of time you'll be saving is probably minimal.
** EDIT **
And as per your edit, I am seeing a few problems. (1) Always compile with -O3 when you are testing performance of your algorithm. (2) You won't be able to improve the runtime of something that takes .000007 sec to complete; that's nearly instantaneous. This goes back to what I said previously: try and parallelize at a higher level. CG Method is inherently a sequential algorithm, but there are certainly research papers developed detailing parallel CG. (3) Your implementation of scalar product is not optimal. Indeed, I suspect your implementation of matrix-vector product is not either. I would personally do the following:
double scalarProd(double* vec1, double* vec2, int n) {
double prod = 0.0;
int i;
// omp_set_num_threads(4); this should be done once during initialization somewhere previously in your program
#pragma omp parallel for private(i) reduction(+:prod)
for (i = 0; i < n; ++i) {
prod += vec1[i]*vec2[i];
}
return prod;
}
(4) There are entire libraries (LAPACK, BLAS, etc) that have highly optimized matrix-vector, vector-vector, etc operations. Any Linear Algebra library must be built upon them. Therefore, I'd suggest looking at using one of those libraries to do your two operations before you start re-creating the wheel here and trying to implement your own.

OpenMP share file handler

I've got a loop, which I parallelize using OpenMP. In this loop, I read a triangle from a file, and perform some operations on this data. These operations are independent from each triangle to another, so I thought this would be easy to parallelize, as long as I kept the actual reading of files in a critical section.
Order in which triangles are read is not important
Some triangles are read and get discarded pretty quickly, some need some more algorithmic work (bbox construction, ...)
I'm doing binary I/O
Using C++ ifstream *tri_data*
I'm testing this on an SSD
ReadTriangle calls file.read() and reads 12 floats from an ifstream.
#pragma omp parallel for shared (tri_data)
for(int i = 0; i < ntriangles ; i++) {
vec3 v0,v1,v2,normal;
#pragma omp critical
{
readTriangle(tri_data,v0,v1,v2,normal);
}
(working with the triangle here)
}
Now, the behaviour I'm observing is that with OpenMP enabled, the whole process is slower.
I've added some timers to my code to track time spent in the I/O method, and time spent in the loop itself.
Without OpenMP:
Total IO IN time : 41.836 s.
Total algorithm time : 15.495 s.
With OpenMP:
Total IO IN time : 48.959 s.
Total algorithm time : 44.61 s.
My guess is, since the reading is in a critical section, the threads are just waiting for eachother to finish using the file handler, resulting in a longer waiting time.
Any pointers on how to resolve this? My program would really benefit from the possibility to process read triangles with multiple processes. I've tried toying with thread scheduling and related stuff, but that doesn't seem to help a lot in this instance.
Since I'm working on an out-of-core algorithm, introducing a buffer to hold a multitude of triangles is not really an option.
So, the solution I would propose is based on a master/slave strategy, where:
the master (thread 0) performs all the I/O
the slaves do some work on the retrieved data
The pseudo-code would read something like the following:
#include<omp.h>
vector<vec3> v0;
vector<vec3> v1;
vector<vec3> v2;
vector<vec3> normal;
vector<int> tdone;
int nthreads;
int triangles_read = 0;
/* ... */
#pragma omp parallel shared(tri_data)
{
int id = omp_get_thread_num();
/*
* Initialize all the buffers in the master thread.
* Notice that the size in memory is similar to your example.
*/
#pragma omp single
{
nthreads = omp_get_num_threads();
v0.resize(nthreads);
v1.resize(nthreads);
v2.resize(nthreads);
normal.resize(nthreads);
tdone.resize(nthreads,1);
}
if ( id == 0 ) { // Producer thread
int next = 1;
while( triangles_read != ntriangles ) {
if ( tdone[next] ) { // If the next thread is free
readTriangle(tri_data,v0[next],v1[next],v2[next],normal[next]); // Read data and fill the correct buffer
triangles_read++;
tdone[next] = 0; // Set a flag for thread next to start working
#pragma omp flush (tdone[next],triangles_read) // Flush it
}
next = next%(nthreads - 1) + 1; // Set next
} // while
} else { // Consumer threads
while( true ) { // Wait for work
if( tdone[id] == 0) {
/* ... do work here on v0[id], v1[id], v2[id], normal[id] ... */
tdone[id] == 1;
#pragma omp flush (tdone[id]) // Flush it
}
if( tdone[id] == 1 && triangles_read == ntriangles) break; // Work finished for all
}
}
#pragma omp barrier
}
I am not sure if this is still valuable to you but that was a nice teaser anyhow!

How can I parallelize a for using boost?

To optimize the execution of some libraries I am making, I have to parallelize some calculations.
Unfortunately, I can not use openmp for that, so I am trying to do some similar alternative using boost::thread.
Anyone knows of some implementation like this?
I have special problems with the sharing of variables between threads (to define variables as 'shared' and 'pribate' of openmp). Any sugestions?
As far as I know you'll have to do that explicitly with anything other than OpenMP.
As an example if we have a parallelized loop in OpenMP
int i;
size_t length = 10000;
int someArray[] = new int[length];
#pragma omp parallel private(i)
{
#pragma omp for schedule(dynamic, 8)
for (i = 0; i < length; ++i) {
someArray[i] = i*i;
}
}
You'll have to factor out the logic into a "generic" loop that can work on a sub-range of your problem, and then explicitly schedule the threads. Each thread will then work on a chunk of the whole problem. In that way you explicitly declare the "private" variables- the ones that go into the subProblem function.
void subProblem(int* someArray, size_t startIndex, size_t subLength) {
size_t end = startIndex+subLength;
for (size_t i = startIndex; i < end; ++i) {
someArray[i] = i*i;
}
}
void algorithm() {
size_t i;
size_t length = 10000;
int someArray[] = new int[length];
int numThreads = 4; // how to subdivide
int thread = 0;
// a vector of all threads working on the problem
std::vector<boost::thread> threadVector;
for(thread = 0; thread < numThreads; ++thread) {
// size of subproblem
size_t subLength = length / numThreads;
size_t startIndex = subLength*thread;
// use move semantics to create a thread in the vector
// requires c++11. If you can't use c++11,
// perhaps look at boost::move?
threadVector.emplace(boost::bind(subProblem, someArray, startIndex, subLength));
}
// threads are now working on subproblems
// now go through the thread vector and join with the threads.
// left as an exercise :P
}
The above is one of many scheduling algorithms- it just cuts the problem into as many chunks as you have threads.
The OpenMP way is more complicated- it cuts the problem into many small sized chunks (of 8 in my example), and then uses work-stealing scheduling to give these chunks to threads in a thread pool. The difficulty of implementing the OpenMP way, is that you need "persistent" threads that wait for work ( a thread pool ). Hope this makes sense.
An even simpler way would be to do async on every iteration (scheduling a piece of work for each iteration). This can work, if the each iteration is very expensive and takes a long time. However, if it's small pieces of work with MANY iterations, most of the overhead will go into the scheduling and thread creation, rendering the parallelization useless.
In conclusion, depending on your problem, there are be many ways to schedule the work, it's up to you to find out what works best for your problem.
TL;DR:
Try Intel Threading Building Blocks (or Microsoft PPL) which schedule for you, provided you give the "sub-range" function:
http://cache-www.intel.com/cd/00/00/30/11/301132_301132.pdf#page=14