I have a program that currently generates large arrays and matrices that can be upwards of 10GB in size. The program uses MPI to parallelize workloads, but is limited by the fact that each process needs its own copy of the array or matrix in order to perform its portion of the computation. The memory requirements make this problem unfeasible with a large number of MPI processes and so I have been looking into Boost::Interprocess as a means of sharing data between MPI processes.
So far, I have come up with the following which creates a large vector and parallelizes the summation of its elements:
#include <cstdlib>
#include <ctime>
#include <functional>
#include <iostream>
#include <string>
#include <utility>
#include <boost/interprocess/managed_shared_memory.hpp>
#include <boost/interprocess/containers/vector.hpp>
#include <boost/interprocess/allocators/allocator.hpp>
#include <boost/tuple/tuple_comparison.hpp>
#include <mpi.h>
typedef boost::interprocess::allocator<double, boost::interprocess::managed_shared_memory::segment_manager> ShmemAllocator;
typedef boost::interprocess::vector<double, ShmemAllocator> MyVector;
const std::size_t vector_size = 1000000000;
const std::string shared_memory_name = "vector_shared_test.cpp";
int main(int argc, char **argv) {
int numprocs, rank;
MPI::Init();
numprocs = MPI::COMM_WORLD.Get_size();
rank = MPI::COMM_WORLD.Get_rank();
if(numprocs >= 2) {
if(rank == 0) {
std::cout << "On process rank " << rank << "." << std::endl;
std::time_t creation_start = std::time(NULL);
boost::interprocess::shared_memory_object::remove(shared_memory_name.c_str());
boost::interprocess::managed_shared_memory segment(boost::interprocess::create_only, shared_memory_name.c_str(), size_t(12000000000));
std::cout << "Size of double: " << sizeof(double) << std::endl;
std::cout << "Allocated shared memory: " << segment.get_size() << std::endl;
const ShmemAllocator alloc_inst(segment.get_segment_manager());
MyVector *myvector = segment.construct<MyVector>("MyVector")(alloc_inst);
std::cout << "myvector max size: " << myvector->max_size() << std::endl;
for(int i = 0; i < vector_size; i++) {
myvector->push_back(double(i));
}
std::cout << "Vector capacity: " << myvector->capacity() << " | Memory Free: " << segment.get_free_memory() << std::endl;
std::cout << "Vector creation successful and took " << std::difftime(std::time(NULL), creation_start) << " seconds." << std::endl;
}
std::flush(std::cout);
MPI::COMM_WORLD.Barrier();
std::time_t summing_start = std::time(NULL);
std::cout << "On process rank " << rank << "." << std::endl;
boost::interprocess::managed_shared_memory segment(boost::interprocess::open_only, shared_memory_name.c_str());
MyVector *myvector = segment.find<MyVector>("MyVector").first;
double result = 0;
for(int i = rank; i < myvector->size(); i = i + numprocs) {
result = result + (*myvector)[i];
}
double total = 0;
MPI::COMM_WORLD.Reduce(&result, &total, 1, MPI::DOUBLE, MPI::SUM, 0);
std::flush(std::cout);
MPI::COMM_WORLD.Barrier();
if(rank == 0) {
std::cout << "On process rank " << rank << "." << std::endl;
std::cout << "Vector summing successful and took " << std::difftime(std::time(NULL), summing_start) << " seconds." << std::endl;
std::cout << "The arithmetic sum of the elements in the vector is " << total << std::endl;
segment.destroy<MyVector>("MyVector");
}
std::flush(std::cout);
MPI::COMM_WORLD.Barrier();
boost::interprocess::shared_memory_object::remove(shared_memory_name.c_str());
}
sleep(300);
MPI::Finalize();
return 0;
}
I noticed that this causes the entire shared object to be mapped into each processes' virtual memory space - which is an issue with our computing cluster as it limits virtual memory to be the same as physical memory. Is there a way to share this data structure without having to map out the entire shared memory space - perhaps in the form of sharing a pointer of some kind? Would trying to access unmapped shared memory even be defined behavior? Unfortunately the operations we are performing on the array means that each process eventually needs to access every element in it (although not concurrently - I suppose its possible to break up the shared array into pieces and trade portions of the array for those you need, but this is not ideal).
Since the data you want to share is so large, it may be more practical to treat the data as a true file, and use file operations to read the data that you want. Then, you do not need to use shared memory to share the file, just let each process read directly from the file system.
ifstream file ("data.dat", ios::in | ios::binary);
file.seekg(someOffset, ios::beg);
file.read(array, sizeof(array));
Related
I am new to OpenCL and I have a problem with displaying the <CL_DEVICE_MAX_WORK_ITEM_SIZES> as a whole number/value. Instead I get a memory address.
Initially, I tried to declare a string and integer output variable to display the maximum work item size. But eventually I found out that the work item size returns a size_t data type instead. So I created a vector to store the size_t variable but it outputs a memory address instead.
And also, my display shows the Device Number in the reverse order (shows Device #1 then Device #0 - this is used to select a device for the later part of my program)
Any help would be appreciated. Thank you.
int main()
{
std::vector<cl::Platform> platforms; // available platforms
std::vector<cl::Device> devices; // devices available to a platform
std::string outputString; // string for output
std::vector<::size_t> maxWorkItems[3];
unsigned int i, j; // counters
std::string choice; // user input choice
cl::Platform::get(&platforms);
std::cout << "Do you want to use a CPU or GPU device: ";
std::cin >> choice;
if (choice == "CPU" || choice == "cpu")
{
// for each platform
for (i = 0; i < platforms.size(); i++)
{
// get all CPU devices available to the platform
platforms[i].getDevices(CL_DEVICE_TYPE_ALL, &devices);
for (j = 0; j < devices.size(); j++)
{
cl_device_type type;
devices[j].getInfo(CL_DEVICE_TYPE, &type);
if (type == CL_DEVICE_TYPE_CPU) {
std::cout << "\tDevice #" << j << std::endl;
//outputs the device type
std::cout << "\tType: " << "CPU" << std::endl;
// get and output device name
outputString = devices[j].getInfo<CL_DEVICE_NAME>();
std::cout << "\tName: " << outputString << std::endl;
// get and output device vendor
outputString = devices[j].getInfo<CL_DEVICE_VENDOR>();
std::cout << "\tVendor: " << outputString << std::endl;
//get and output compute units
std::cout << "\tNumber of compute units: " << devices[j].getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>() << std::endl;
//get and output workgroup size
std::cout << "\tMaximum work group size: " << devices[j].getInfo<CL_DEVICE_MAX_WORK_GROUP_SIZE>() << std::endl;
//get and output workitem size
maxWorkItems[0] = devices[j].getInfo<CL_DEVICE_MAX_WORK_ITEM_SIZES>();
std::cout << "\tMaximum work item size: " << maxWorkItems << std::endl;
//get and output local memory size
std::cout << "\tLocal memory size: " << devices[j].getInfo<CL_DEVICE_LOCAL_MEM_SIZE>() << std::endl;
std::cout << std::endl;
}
}
}
}
Below is the undesired output of my code:
The max work item size is in hexadecimal format, and the device numbers are in reverse order.
The CL_DEVICE_MAX_WORK_ITEM_SIZE property is of array type, specifically, size_t[]. You shouldn't be expecting a scalar value, but an array of CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS elements. With the OpenCL C++ wrapper, you're on the right track with the vector, but you've now declared an array of 3 vectors:
std::vector<::size_t> maxWorkItems[3];
You in fact just want the one vector that will hold all the returned values:
std::vector<::size_t> maxWorkItems;
The property query becomes:
maxWorkItems = devices[j].getInfo<CL_DEVICE_MAX_WORK_ITEM_SIZES>();
Then you should be able to query the max work items in each dimension using maxWorkItems[0], maxWorkItems[1], etc.
I'm writing a code where I need to write a large string into memory.
I used a stringstream object to do so, but something odd to me happens: even if the size of the underlying string buffer has not exceeded the maximum size of a string object my program crashes with a BAD_ACCESS error.
I've created a test program like this:
#include <sstream> // std::stringstream
#include <iostream> // std::cout
int main(int argc, const char * argv[]) {
std::stringstream stream;
std::string string;
std::cout << "Max string size: " << string.max_size() << "\n";
for (int i = 0; true; i++) {
if (i >= 644245094) {
stream.seekg(0, std::ios::end);
std::stringstream::pos_type size = stream.tellg();
stream.seekg(0, std::ios::beg);
std::cout << "Size of stringstream: " << size << "\n";
}
stream << "hello";
}
return 0;
}
That if (i >= 644245094) inside the loop is only used to print the size of the stringstream buffer just before the program crashes. I've used my debugger to see which was the number of the last iteration and used it to print the size of the buffer just before the crash happens.
This is the output I get:
Max string size: 18446744073709551599
Size of stringstream: 3221225470
After this the program crashes.
I thought that the cause might be because the program fills up my computer's RAM but the memory used by this program is ~6.01GB so not enough to fill up my RAM. For the record I have a 16GB RAM Macbook Pro.
What could be the problem? Am I missing something about how << operator works?
Thank you in advance!
The behaviour of a std::stringstream when it gets full and fails may not be consistent across all platforms.
I modified your code and rain it on Yocto 3.19.0-32 64-bit, with gcc 5.4.1. I did not get an exception thrown, rather the stream set one of its failure mode bits.
The code I ran was:
#include <sstream> // std::stringstream
#include <iostream> // std::cout
std::stringstream::pos_type get_size(std::stringstream& stream)
{
stream.seekg(0, std::ios::end);
std::stringstream::pos_type size = stream.tellg();
stream.seekg(0, std::ios::beg);
return size;
}
int main(int argc, const char * argv[])
{
std::stringstream stream;
std::string string;
std::cout << "Max string size: " << string.max_size() << std::endl;
std::stringstream::pos_type size;
for (unsigned long i = 0; true; ++i)
{
size = get_size(stream);
stream.write("x", 1);
if (stream.fail())
{
std::cout << "Fail after " << i + 1 << " insertions" << std::endl;
std::cout << "Size of stringstream just before fail: " << size << std::endl;
break;
}
}
size = get_size(stream);
std::cout << "Size of stringstream just after fail: " << size << std::endl;
return 0;
}
And I got the following output, which shows that my stringstream filled and failed 56 bytes short of 8GB:
Max string size: 4611686018427387897
Fail after 8589934536 insertions
Size of stringstream just before fail: 8589934535
Size of stringstream just after fail: -1
Can you not use a different container and pre-allocate the memory, instead of using such a large stringstream?
I am learning how vectors work in c++, and wrote a sample program to try to learn how memory with vectors are handled.
#include <iostream>
#include <vector>
int main()
{
//Test 1:
double n = 3.5;
std::vector<double> test;
std::cout << sizeof(test) << std::endl;
test.push_back(n);
std::cout << sizeof(test) << std::endl;
std::cout << std::endl;
std::cout << std::endl;
std::cout << std::endl;
//Test 2
std::vector<int> test2;
std::cout << sizeof(test2) << std::endl;
for (int i = 0; i < 1000; i++) {
test2.push_back(i);
}
std::cout << sizeof(test2) << std::endl;
}
Interestingly, the program prints out 24 as the number of bytes stored each-time. Despite adding new elements to the vector. How is the amount of memory that the vector occupies when it is initially declared the same as after I have added elements to the vector?
Internally, the vector object has a pointer to dynamically-allocated memory that contains the elements. When you use sizeof(test) you're just getting the size of the structure that contains the pointer, the size of the memory that it points to is not included.
This memory has to be dynamically-allocated so that the vector can grow and shrink as needed. It's not possible for a class object to change its size.
To get the amount of memory being used by the data storage, use sizeof(double) * test.capacity().
I am trying to write a simple C++ program that takes an array, sends equal portions of it to different processors, and those processors do computations on the components and then send the portions of the array back to the master processor to be combined in the final array.
I have started with a simple case where I have an array of size 2, and the first component gets added by 1 by process 1. The second component gets added by 2 by process 2.
Here is what I have:
# include <cstdlib>
# include <iostream>
# include <iomanip>
# include <ctime>
#include <fstream>
# include "mpi.h"
using namespace std;
ofstream debug("DEBUG");
ofstream debug1("DEBUG1");
ofstream debug2("DEBUG2");
// Declare the array
double arr[2];
int main(int argc, char *argv[])
{
MPI::Init(argc, argv);
// Make the array
arr[0] = 1;
arr[1] = 2;
int rank = MPI::COMM_WORLD.Get_rank();
int npes = MPI::COMM_WORLD.Get_size();
if ( rank == 0 ) {
cout << "Running on "<< npes << " Processes "<< endl;
double arr1;
double arr2;
MPI::COMM_WORLD.Recv(&arr1, 1, MPI::DOUBLE, 0, 0);
debug << "arr1: " << arr1 << endl;
/*... Program freezes here. I'd like to combine arr1 and arr2 into
arr*/
}
if ( rank == 1){
debug1 << "This is process " << rank << endl;
double arr1 = arr[0];
debug1 << "arr1: " << arr1 << endl;
arr1 = arr1 + 1;
debug1 << "arr1+1: " << arr1 << endl;
MPI::COMM_WORLD.Send(&arr1, 1, MPI::DOUBLE, 0, 0);
}
if ( rank == 2){
debug2 << "This is process " << rank << endl;
double arr2 = arr[1];
debug2 << "arr2: " << arr2 << endl;
arr2 = arr2 + 2;
debug2 << "arr2+2: " << arr2 << endl;
}
cout << "Greetings from process " << rank << endl;
MPI::Finalize();
}
I am compiling with
mpiCC test.cpp -o test
and running with
mpirun -np 3 test
since I wish to use 2 processors to operate on arr and 1 processor (process 0) to gather the components.
My issue is that the program freezes when using
MPI::COMM_WORLD.Recv(&arr1, 1, MPI::DOUBLE, 0, 0);
on process 0.
Does anyone know why this would happen? I'd simply like to distribute computations on an array over processors and thought this would be a good example to start with.
When you use MPI, there are functions designed for that kind of task. MPI_Scatter and MPI_Reduce. It allows you to divide your array to n children do your computations and get the result back to the coordinator.
I wanted to read an array of double values from a file to an array. I have like 128^3 values. My program worked just fine as long as I stayed at 128^2 values, but now I get an "segmentation fault" error, even though 128^3 ≈ 2,100,000 is by far below the maximum of int. So how many values can you actually put into an array of doubles?
#include <iostream>
#include <fstream>
int LENGTH = 128;
int main(int argc, const char * argv[]) {
// insert code here...
const int arrLength = LENGTH*LENGTH*LENGTH;
std::string filename = "density.dat";
std::cout << "opening file" << std::endl;
std::ifstream infile(filename.c_str());
std::cout << "creating array with length " << arrLength << std::endl;
double* densdata[arrLength];
std::cout << "Array created"<< std::endl;
for(int i=0; i < arrLength; ++i){
double a;
infile >> a;
densdata[i] = &a;
std::cout << "read value: " << a << " at line " << (i+1) << std::endl;
}
return 0;
}
You are allocating the array on the stack, and stack size is limited (by default, stack limit tends to be in single-digit megabytes).
You have several options:
increase the size of the stack (ulimit -s on Unix);
allocate the array on the heap using new;
move to using std::vector.