I've been looking to write a big 4-dimensional HDF5 file where the values are computed in parallel and written in mutual exclusion.
This is a minimal working example of my problem.
#include <fstream>
#include <iostream>
#include <algorithm>
#include <vector>
#include <omp.h>
#include "H5Cpp.h"
using namespace H5;
int main(int argc, char *argv[])
{
H5File file("out.h5", H5F_ACC_TRUNC);
const uint F_RANK = 4;
const uint M_RANK = 3;
const uint D = 8;
const uint MD = 2;
std::vector<hsize_t> fdim = {D, D, 6, 1024};
std::vector<hsize_t> count = {1, 1, 6, 1024};
std::vector<hsize_t> mdim = {MD, 6, 1024};
float fillvalue = -1.0;
DSetCreatPropList plist;
plist.setFillValue(PredType::NATIVE_FLOAT, &fillvalue);
DataSpace fspace(F_RANK, fdim.data());
DataSet dataset = file.createDataSet("data", PredType::NATIVE_FLOAT, fspace, plist);
#pragma omp parallel
{
uint id = omp_get_thread_num();
std::vector<hsize_t> start(F_RANK, 0);
DataSpace private_fspace = dataset.getSpace();
DataSpace private_mspace(M_RANK, mdim.data());
private_mspace.selectAll();
std::cout << "THREAD " << id << " STARTS" << std::endl;
#pragma omp for ordered schedule(static)
for (uint b = 0; b < D*D / MD; b++)
{
// Assume this is an expensive computation
std::vector<float> data(MD*6*1024, b);
#pragma omp critical
{
private_fspace.selectNone();
std::cout << "THREAD " << id << " SELECTED NONE" << std::endl;
// Store batch
for (uint p = 0; p < MD; p++)
{
start[0] = b / (D*D) + p;
start[1] = p;
private_fspace.selectHyperslab(H5S_SELECT_OR, count.data(), start.data());
std::cout << "THREAD " << id << " SELECTED " << start[0] << ' ' << start[1] << std::endl;
}
std::cout << "ABOUT TO WRITE " << private_fspace.getSelectNpoints() << std::endl;
dataset.write(data.data(), PredType::NATIVE_FLOAT, private_mspace, private_fspace);
private_fspace.selectNone();
}
std::cout << "WRITTEN" << std::endl;
}
}
}
I'm compiling it using h5c++ example.cpp -fopenmp.
When run, there are many possible errors that can happen, some are Segmentation faults, others are aborts and even bus errors.
There's even some HDF5 library errors like these
HDF5: infinite loop closing library
L,T_top,P,P,Z,FD,E,SL,FL,FL,.....
Sometimes it will fail before writing anything, sometimes it will finish without any errors. Of course, it only fails when OMP_NUM_THREADS is set to more than 1, so the problem is in the parallelism. However, I've set all relevant variables I could as private for each process, and everything happens within a critical section, so I don't see how this is failing.
Debugging on GDB reveals HDF5 may fail on the DataSet::getSpace() with a segmentation fault, but there are so many ways it can fail I'd say the problem is more general than this.
Related
This question already has answers here:
C++: Timing in Linux (using clock()) is out of sync (due to OpenMP?)
(3 answers)
Closed 4 months ago.
I'm trying to test the speed up of OpenMP on an array sum program.
The elements are generated using random generator to avoid optimization.
The length of array is also set large enough to indicate the performance difference.
This program is built using g++ -fopenmp -g -O0 -o main main.cpp, -g -O0 are used to avoid optimization.
However OpenMP parallel for code is significant slower than sequential code.
Test result:
Your thread count is: 12
Filling arrays
filling time:66718888
Now running omp code
2thread omp time:11154095
result: 4294903886
Now running omp code
4thread omp time:10832414
result: 4294903886
Now running omp code
6thread omp time:11165054
result: 4294903886
Now running sequential code
sequential time: 3525371
result: 4294903886
#include <iostream>
#include <stdio.h>
#include <omp.h>
#include <ctime>
#include <random>
using namespace std;
long long llsum(char *vec, size_t size, int threadCount) {
long long result = 0;
size_t i;
#pragma omp parallel for num_threads(threadCount) reduction(+: result) schedule(guided)
for (i = 0; i < size; ++i) {
result += vec[i];
}
return result;
}
int main(int argc, char **argv) {
int threadCount = 12;
omp_set_num_threads(threadCount);
cout << "Your thread count is: " << threadCount << endl;
const size_t TEST_SIZE = 8000000000;
char *testArray = new char[TEST_SIZE];
std::mt19937 rng;
rng.seed(std::random_device()());
std::uniform_int_distribution<std::mt19937::result_type> dist6(0, 4);
cout << "Filling arrays\n";
auto fillingStartTime = clock();
for (int i = 0; i < TEST_SIZE; ++i) {
testArray[i] = dist6(rng);
}
auto fillingEndTime = clock();
auto fillingTime = fillingEndTime - fillingStartTime;
cout << "filling time:" << fillingTime << endl;
// test omp time
for (int i = 1; i <= 3; ++i) {
cout << "Now running omp code\n";
auto ompStartTime = clock();
auto ompResult = llsum(testArray, TEST_SIZE, i * 2);
auto ompEndTime = clock();
auto ompTime = ompEndTime - ompStartTime;
cout << i * 2 << "thread omp time:" << ompTime << endl << "result: " << ompResult << endl;
}
// test sequential addition time
cout << "Now running sequential code\n";
auto seqStartTime = clock();
long long expectedResult = 0;
for (int i = 0; i < TEST_SIZE; ++i) {
expectedResult += testArray[i];
}
auto seqEndTime = clock();
auto seqTime = seqEndTime - seqStartTime;
cout << "sequential time: " << seqTime << endl << "result: " << expectedResult << endl;
delete[]testArray;
return 0;
}
As pointed out by #High Performance Mark, I should use omp_get_wtime() instead of clock().
clock() is 'active processor time', not 'elapsed time.
See
OpenMP time and clock() give two different results
https://en.cppreference.com/w/c/chrono/clock
After using omp_get_wtime(), and fixing the int i to size_t i, the result is more meaningful:
Your thread count is: 12
Filling arrays
filling time:267.038
Now running omp code
2thread omp time:26.1421
result: 15999820788
Now running omp code
4thread omp time:7.16911
result: 15999820788
Now running omp code
6thread omp time:5.66505
result: 15999820788
Now running sequential code
sequential time: 30.4056
result: 15999820788
I am trying to compute 2D FFT on 100 million complex data (100000x1000) and it is taking 4.6 seconds approximately, but I want to reduce time. Then I tried to compute it using fftw_thread. But then the computation time has increased (in 2 threads time taken - 8.5 sec an in 4 threads time taken - 16.5 sec).
I am using FFTW3 library for C++ and OS - ubuntu 18.04
I am attaching the C++ code below :
#include <iostream>
#include <time.h>
#include <fftw3.h>
using namespace std;
#define ROW 100000
#define COL 1000
int main() {
fftwf_complex *in = (fftwf_complex *)calloc(ROW*COL,sizeof(fftwf_complex));
fftwf_complex *out = (fftwf_complex *)calloc(ROW*COL,sizeof(fftwf_complex));
// generating random data
for(uint32_t i = 0 ; i < ROW*COL ; i++) {
in[i][0] = i+1;
in[i][1] = i+2;
}
int thread_number = 2;
fftwf_plan_with_nthreads(thread_number);
int h = fftwf_init_threads();
fftwf_plan p = fftwf_plan_dft_2d(ROW,COL,in,out,FFTW_FORWARD,FFTW_ESTIMATE);
fftwf_execute(p);
fftwf_destroy_plan(p);
fftwf_cleanup_threads();
}
I am getting no error. I want to reduce the execution time. Can anyone please help me in this matter to reduce the time to compute the 2D FFT on 100 million data.
How did you measure execution time? Note that the actual FFT is done with fftwf_execute. The rest is initialization and cleaning up. See the code below (if you are not on Linux modify time_in_secs to fit to your system). On my computer the code below takes around 10 seconds with one thread,
6 seconds with two threads and around 3.6 seconds with four threads. That's for the FFT part (t3-t2).
#include <iostream>
#include <time.h>
#include <fftw3.h>
#define ROW 100000
#define COL 1000
double
time_in_secs()
{
struct timespec t;
clock_gettime( CLOCK_MONOTONIC /* CLOCK_PROCESS_CPUTIME_ID */, &t );
return (double)t.tv_sec + 1.0E-09 * (double)t.tv_nsec;
}
int main() {
fftwf_complex *in = (fftwf_complex *)calloc(ROW*COL,sizeof(fftwf_complex));
fftwf_complex *out = (fftwf_complex *)calloc(ROW*COL,sizeof(fftwf_complex));
// generating random data
for(uint32_t i = 0 ; i < ROW*COL ; i++) {
in[i][0] = i+1;
in[i][1] = i+2;
}
int thread_number = 6;
double t1 = time_in_secs();
fftwf_plan_with_nthreads(thread_number);
int h = fftwf_init_threads();
fftwf_plan p = fftwf_plan_dft_2d(ROW,COL,in,out,FFTW_FORWARD,FFTW_ESTIMATE);
double t2 = time_in_secs();
fftwf_execute(p);
double t3 = time_in_secs();
fftwf_destroy_plan(p);
fftwf_cleanup_threads();
std::cout << "Time for init: " << t2-t1 << " sec\n";
std::cout << "Time for FFT: " << t3-t2 << " sec\n";
std::cout << "Total time: " << t3-t1 << " sec\n";
std::cout << "# threads: " << thread_number << '\n';
}
Speeding up the initialization process can be done utilizing wisdom as shown below. In the first run of the program the wisdom file will not be found. Computation of the plan takes its time. In successive calls the wisdom will be used for accelerated computation of the plan. Notice that fftwf_init_threads must be called before the wisdom file gets read.
double t1 = time_in_secs();
fftwf_plan_with_nthreads(thread_number);
int h = fftwf_init_threads();
const char * wisdom_file = "fftw_wisdom.dat";
FILE *w_file= fopen( wisdom_file, "r" );
if( w_file )
{
int ec = fftwf_import_wisdom_from_file( w_file );
fclose( w_file );
std::cout << "Read wisdom file " << ec << '\n';
}
else
{
std::cout << "No wisdom file found\n";
}
fftwf_plan p = fftwf_plan_dft_2d(ROW,COL,in,out,FFTW_FORWARD,FFTW_MEASURE);
w_file= fopen( wisdom_file, "w" );
if( w_file )
{
fftwf_export_wisdom_to_file( w_file );
fclose( w_file );
std::cout << "Wrote wisdom file\n";
}
double t2 = time_in_secs();
Compared to the initial example we have set the planner flag to FFTW_MEASURE. This makes the effect of wisdom storage more pronounced.
I'm messing around with multithreading in c++ and here is my code:
#include <iostream>
#include <vector>
#include <string>
#include <thread>
void read(int i);
bool isThreadEnabled;
std::thread threads[100];
int main()
{
isThreadEnabled = true; // I change this to compare the threaded vs non threaded method
if (isThreadEnabled)
{
for (int i = 0;i < 100;i++) //this for loop is what I'm confused about
{
threads[i] = std::thread(read,i);
}
for (int i = 0; i < 100; i++)
{
threads[i].join();
}
}
else
{
for (int i = 0; i < 100; i++)
{
read(i);
}
}
}
void read(int i)
{
int w = 0;
while (true) // wasting cpu cycles to actually see the difference between the threaded and non threaded
{
++w;
if (w == 100000000) break;
}
std::cout << i << std::endl;
}
in the for loop that uses threads the console prints values in a random order ex(5,40,26...) which is expected and totally fine since threads don't run in the same order as they were initiated...
but what confuses me is that the values printed are sometimes more than the maximum value that int i can reach (which is 100), values like 8000,2032,274... are also printed to the console even though i will never reach that number, I don't understand why ?
This line:
std::cout << i << std::endl;
is actually equivalent to
std::cout << i;
std::cout << std::endl;
And thus while thread safe (meaning there's no undefined behaviour), the order of execution is undefined. Given two threads the following execution is possible:
T20: std::cout << 20
T32: std::cout << 32
T20: std::cout << std::endl
T32: std::cout << std::endl
which results in 2032 in console (glued numbers) and an empty line.
The simplest (not necessarily the best) fix for that is to wrap this line with a shared mutex:
{
std::lock_guard lg { mutex };
std::cout << i << std::endl;
}
(the brackets for a separate scope are not needed if the std::cout << i << std::endl; is the last line in the function)
I'm attempting to write a parallel vector fill, using the following code:
#include <iostream>
#include <thread>
#include <vector>
#include <chrono>
#include <algorithm>
using namespace std;
using namespace std::chrono;
void fill_part(vector<double> & v, int ii, int num_threads)
{
fill(v.begin() + ii*v.size()/num_threads, v.begin() + (ii+1)*v.size()/num_threads, 0);
}
int main()
{
vector<double> v(200*1000*1000);
high_resolution_clock::time_point t = high_resolution_clock::now();
fill(v.begin(), v.end(), 0);
duration<double> d = high_resolution_clock::now() - t;
cout << "Filling the vector took " << duration_cast<milliseconds>(d).count()
<< " ms in serial.\n";
unsigned num_threads = thread::hardware_concurrency() ? thread::hardware_concurrency() : 1;
cout << "Num threads: " << num_threads << '\n';
vector<thread> threads;
t = high_resolution_clock::now();
for(int ii = 0; ii< num_threads; ++ii)
{
threads.emplace_back(fill_part, std::ref(v), ii, num_threads);
}
for(auto & t : threads)
{
if(t.joinable()) t.join();
}
d = high_resolution_clock::now() - t;
cout << "Filling the vector took " << duration_cast<milliseconds>(d).count()
<< " ms in parallel.\n";
}
I tried this code on four different architectures (all Intel CPUs--but no matter).
The first I tried had 4 CPUs, and the parallelization gave no speedup. The second had 4, and was 4 times as fast, the third had 4, and was twice as fast, and the last had 2, and gave no speedup.
My hypothesis is that the differences arise because the RAM bus can either be saturated by a single CPU or not, but is this correct? How can I predict what architectures will benefit from this parallelization?
Bonus question: The void fill_part function is awkward, so I wanted to do it with a lambda:
threads.emplace_back([&]{fill(v.begin() + ii*v.size()/num_threads, v.begin() + (ii+1)*v.size()/num_threads, 0); });
This compiles but terminates with a bus error; what's wrong with the lambda syntax?
I have a program that currently generates large arrays and matrices that can be upwards of 10GB in size. The program uses MPI to parallelize workloads, but is limited by the fact that each process needs its own copy of the array or matrix in order to perform its portion of the computation. The memory requirements make this problem unfeasible with a large number of MPI processes and so I have been looking into Boost::Interprocess as a means of sharing data between MPI processes.
So far, I have come up with the following which creates a large vector and parallelizes the summation of its elements:
#include <cstdlib>
#include <ctime>
#include <functional>
#include <iostream>
#include <string>
#include <utility>
#include <boost/interprocess/managed_shared_memory.hpp>
#include <boost/interprocess/containers/vector.hpp>
#include <boost/interprocess/allocators/allocator.hpp>
#include <boost/tuple/tuple_comparison.hpp>
#include <mpi.h>
typedef boost::interprocess::allocator<double, boost::interprocess::managed_shared_memory::segment_manager> ShmemAllocator;
typedef boost::interprocess::vector<double, ShmemAllocator> MyVector;
const std::size_t vector_size = 1000000000;
const std::string shared_memory_name = "vector_shared_test.cpp";
int main(int argc, char **argv) {
int numprocs, rank;
MPI::Init();
numprocs = MPI::COMM_WORLD.Get_size();
rank = MPI::COMM_WORLD.Get_rank();
if(numprocs >= 2) {
if(rank == 0) {
std::cout << "On process rank " << rank << "." << std::endl;
std::time_t creation_start = std::time(NULL);
boost::interprocess::shared_memory_object::remove(shared_memory_name.c_str());
boost::interprocess::managed_shared_memory segment(boost::interprocess::create_only, shared_memory_name.c_str(), size_t(12000000000));
std::cout << "Size of double: " << sizeof(double) << std::endl;
std::cout << "Allocated shared memory: " << segment.get_size() << std::endl;
const ShmemAllocator alloc_inst(segment.get_segment_manager());
MyVector *myvector = segment.construct<MyVector>("MyVector")(alloc_inst);
std::cout << "myvector max size: " << myvector->max_size() << std::endl;
for(int i = 0; i < vector_size; i++) {
myvector->push_back(double(i));
}
std::cout << "Vector capacity: " << myvector->capacity() << " | Memory Free: " << segment.get_free_memory() << std::endl;
std::cout << "Vector creation successful and took " << std::difftime(std::time(NULL), creation_start) << " seconds." << std::endl;
}
std::flush(std::cout);
MPI::COMM_WORLD.Barrier();
std::time_t summing_start = std::time(NULL);
std::cout << "On process rank " << rank << "." << std::endl;
boost::interprocess::managed_shared_memory segment(boost::interprocess::open_only, shared_memory_name.c_str());
MyVector *myvector = segment.find<MyVector>("MyVector").first;
double result = 0;
for(int i = rank; i < myvector->size(); i = i + numprocs) {
result = result + (*myvector)[i];
}
double total = 0;
MPI::COMM_WORLD.Reduce(&result, &total, 1, MPI::DOUBLE, MPI::SUM, 0);
std::flush(std::cout);
MPI::COMM_WORLD.Barrier();
if(rank == 0) {
std::cout << "On process rank " << rank << "." << std::endl;
std::cout << "Vector summing successful and took " << std::difftime(std::time(NULL), summing_start) << " seconds." << std::endl;
std::cout << "The arithmetic sum of the elements in the vector is " << total << std::endl;
segment.destroy<MyVector>("MyVector");
}
std::flush(std::cout);
MPI::COMM_WORLD.Barrier();
boost::interprocess::shared_memory_object::remove(shared_memory_name.c_str());
}
sleep(300);
MPI::Finalize();
return 0;
}
I noticed that this causes the entire shared object to be mapped into each processes' virtual memory space - which is an issue with our computing cluster as it limits virtual memory to be the same as physical memory. Is there a way to share this data structure without having to map out the entire shared memory space - perhaps in the form of sharing a pointer of some kind? Would trying to access unmapped shared memory even be defined behavior? Unfortunately the operations we are performing on the array means that each process eventually needs to access every element in it (although not concurrently - I suppose its possible to break up the shared array into pieces and trade portions of the array for those you need, but this is not ideal).
Since the data you want to share is so large, it may be more practical to treat the data as a true file, and use file operations to read the data that you want. Then, you do not need to use shared memory to share the file, just let each process read directly from the file system.
ifstream file ("data.dat", ios::in | ios::binary);
file.seekg(someOffset, ios::beg);
file.read(array, sizeof(array));