Intel Xeon Phi offload code + STL vector - c++

i would like to copy data stored in STL vector to Intel Xeon Phi coprocessor. In my code, I created class which contains vector with data needed to computation. I want to create class object on host, initialize data on host too and then I want to send this object to coprocessor. This is simple code which illustrate what i want to do. After copy object to the coprocessor vector is empty. What can be problem? How do it correctly?
#pragma offload_attribute (push, target(mic))
#include <vector>
#include "offload.h"
#include <stdio.h>
#pragma offload_attribute (pop)
class A
{
public:
A() {}
std::vector<int> V;
};
int main()
{
A* wsk = new A();
wsk->V.push_back(1);
#pragma offload target(mic) in(wsk)
{
printf("%d", wsk->V.size());
printf("END OFFLOAD");
}
return 0;
}

When an object is copied to the coprocessor, only the memory of that element itself, which is of type A. std::vector allocates a separate block of memory to store its elements. Copying over the std::vector embedded within A does not copy over its elements. I would recommend against trying to use std::vector directly. You can copy its elements, but not the vector itself.
int main()
{
A* wsk = new A();
wsk->V.push_back(1);
int* data = &wsk->V[0];
int size = wsk->V.size();
#pragma offload target(mic) in(data : length(size))
{
printf("%d", size);
printf("END OFFLOAD");
}
return 0;
}

Related

Refactor to reduce "frequency" (amount of time) of tiny memory allocation "ClassC[massive]::ClassB[]::int[tiny]"

I want to store int[] inside B and store B[] inside C.
There are also other fields that act the same way.
In a certain run-time situation, I know that all vector (.bf1,.bf2,.cf1,.cf2 and cs) that created in a code scope must have certain size (e.g. num_bf1,num_bf2,num_cf1,num_cf2 and num_c).
#include <iostream>
#include <vector>
class B{public:
std::vector<int> bf1; //b's field 1 <---------
std::vector<float> bf2;//b's field 2
//... many of it
};
class C{public:
std::vector<B> cf1; // <---------
std::vector<int> cf2;
//... many of it
};
int main(){
//"num_xxx" are known only after some complex algo only at run time.
// Their value are also different for each game time-step.
int num_bf1=3; // <---------
int num_bf2=4;
int num_cf1=2; // <---------
int num_cf2=8;
int num_c=6;
std::vector<C> cs;
for(int m=0;m<num_c;m++){
C c;
for(int n=0;n<num_cf1;n++){
B b;
b.bf1.resize(num_bf1); // <---------
b.bf2.resize(num_bf2);
c.cf1.push_back(b); // <---------
}
c.cf2.resize(num_cf2);
cs.push_back(c); // "cs" is now finished as i wish
}
}
It works but this code has an ugly bottleneck.
In real case, the memory allocation alone use ~35% of CPU time.
My poor approach
I would like to avoid it by allocate a large memory and split it.
With known values of num_xxx, I can calculate amount of int that C::B::bf1 want.
Therefore, I can std::vector<int> bf1; bf1.resize(num_c*num_cf1*num_bf1); .
Here is the full code :-
#include <iostream>
#include <string>
#include <vector>
#include <cmath>
class B{public:
int* bf1; //b's field 1
float* bf2;//b's field 2
//... many of it
};
class C{public:
B* cf1;
int* cf2;
//... many of it
};
Here is the new main :-
int main(){
int num_bf1=3;
int num_bf2=4;
int num_cf1=2;
int num_cf2=8;
int num_c=6;
std::vector<C> cs;
std::vector<int> bf1; bf1.resize(num_c*num_cf1*num_bf1);
std::vector<float> bf2; bf2.resize(num_c*num_cf1*num_bf2);
std::vector<B> cf1; cf1.resize(num_c*num_cf1);
std::vector<int> cf2; bf2.resize(num_c*num_cf2);
for(int m=0;m<num_c;m++){
C c;
for(int n=0;n<num_cf1;n++){
B* b=&(cf1[m*num_cf1+n]);
b->bf1=&bf1[(m*num_cf1+n)*num_bf1];
// .... something about bf2 that is hard ...
}
c.cf1=&(cf1[m*num_cf1+0]);
// .... something about cf2 that is hard ...
cs.push_back(c);
}
}
This make my program much faster, but it is so hard to code, leads to a nightmare for maintainability and readability.
My poor approach V.2
I tried to use a custom stack allocator (and plan to use one-frame allocator).
The issue is that, in multithread, the mutex lock for allocator is called & blocked so often.
A solution is to create 1 stack allocator for each thread.
However, I believe this approach is too low level. It is a fixing a place to solve the problem in another distanced place. I think the correct solution is to fix tiny-array creation, not create a library to support such bad practice(?).
Question
How to solve this tiny-array problem elegantly?

Using placement new with an std::function doesn't work

This code crashes at (*function)(). I'm running Visual Studio 2019 and compiling C++17 for Windows 10 x86_64. I've tested it on Linux with GCC (-std=c++17) and it works fine. I'm wondering if this is a problem with Visual Studio's C++ compiler or something that I'm not seeing.
#include <vector>
#include <array>
#include <functional>
#include <iostream>
int main() {
const size_t blockSize = sizeof(std::function<void()>);
using block = std::array<char, blockSize>;
std::vector<block> blocks;
auto lambda = [](){
std::cout << "The lambda was successfully called.\n";
};
blocks.emplace_back();
new (&blocks[0]) std::function<void()>(lambda);
blocks.emplace_back();
new (&blocks[1]) std::function<void()>(lambda);
std::function<void()> *function = (std::function<void()> *)blocks[0].data();
(*function)();
return 0;
}
The error is a read access violation in the std::function internals.
When you append an element to a std::vector and that element would make the vector's size greater than its capacity the vector has to allocate new storage and copy/move all of its elements to the new space. While you've constructed complex std::function objects in the char arrays held in your vector, the vector doesn't know that. The underlying bytes that make up the std::function object will get copied to the vector's new storage, but the std::function's copy/move constructor won't get called.
The best solution would be to just use a std::vector<std::function<void()>> directly. If you have to use a vector of blocks of raw storage for some reason then you'll need to pre-allocate space before you start inserting elements. i.e.
int main() {
const size_t blockSize = sizeof(std::function<void()>);
using block = std::array<char, blockSize>;
std::vector<block> blocks;
auto lambda = [](){
std::cout << "The lambda was successfully called.\n";
};
blocks.reserve(2); // pre-allocate space for 2 blocks
blocks.emplace_back();
new (&blocks[0]) std::function<void()>(lambda);
blocks.emplace_back();
new (&blocks[1]) std::function<void()>(lambda);
std::function<void()> *function = (std::function<void()> *)blocks[0].data();
(*function)();
return 0;
}
Live Demo
Alternatively you could use a different data structure such as std::list that maintains stable addresses when elements are added or removed.

C++ program with derived data type (nested class object) container using MPI/OpenMP

I have developed a program in C++11 and I want to speed up the performance.
I will use a simple example to show the structure of the program (not complete).
//main.cpp
#include "a.h"
int main()
{
std::vector<a> a_container;
for (auto i=0; i< 10K; i++)
{
a a_obj;
a_container.push_back(a_obj);
}
for(time = 1; time< long_time; time++)
{
//i used openmp here already
for (auto i=0; i< 10K; i++)
{
a_container[i].dosomething();
}
for (auto i=0; i< 10K; i++)
{
a_container[i].update();
}
}
return 1;
}
//a.cpp
//a.h
#include "b.h"
class a
{
int d;
b b_obj;
int dosomething();
}
//b.cpp
//b.h
class b
{
int c;
double d;
int dosomething();
}
So in order to speed up the program, I want to use both MPI and OpenMP, mainly for the loop (could be up to 1 million~1 billion instances).
The class object a and b both contain complex member variables (standard and other containers, etc.) and functions.
By using OpenMP, I can take advantage of one HPC node with all cores/threads. But if I want to use MPI, I need to distribute all the instances to many nodes.
I haven't found a good solution to this yet, the closest thing I have right now is;
http://mpi-forum.org/docs/mpi-2.2/mpi22-report/node83.htm#Node83
and https://blogs.cisco.com/performance/how-to-send-cxx-stl-objects-in-mpi
Please provide some suggestion. Thanks.
Sending non-trivially-copyable objects over MPI is just the same as sending them over any other byte transport: you have to serialize. You can use stringstream to hold the buffer on either end, if it helps.
However, it’s very likely that you shouldn’t do this at all. The data needed to create your objects (e.g., loop bounds and initial values) is probably much smaller and simpler than the form used for ongoing computation. Send that instead, and you can create your complicated objects in parallel as well as reducing communication. (If the parameters are known statically, you don’t have to send anything: each process can just start working on the known initialization.)

Why do creating pointer to instances beyond certain number(30) of a 'Structure' with 'stxxl:Vector' as one of its DataType fails?

I am using STXXL Library's stxxl::vector in my code as :
struct B
{
typedef stxxl::VECTOR_GENERATOR<float>::result vector;
vector content;
};
And then creating many instances of the above declared Structure in a loop using the following code snippet :
for(i=0;i<50;i++)
{
B* newVect= new B();
// Passing the above '*newVect' to some other function
}
But this snippet will not create 'newVect' beyond certain number(30: in this case)
However, I tried the same thing by just replacing "stxxl:Vector" with some other In Memory datatypes as :
struct B
{
float a,b,c;
int f,g,h;
};
Above created Structure works fine even for "100000" new instances as:
for(i=0;i<100000;i++)
{
B* newVect= new B();
// Passing the above '*newVect' to some other function
}
with every system resource remaining same.
Please help me with this.
Can "stxxl:Iterators" help here or work as an alternative?
What kind of behavior do 'stxxl:vector' have in this case?
UPDATE
Tried removing the function call from each iteration and putting it altogether outside the loop but no help.
Example Code:
#include <stxxl/vector>
#include <iostream>
using namespace std;
struct buff
{
typedef stxxl::VECTOR_GENERATOR<float>::result vector;
vector content;
};
struct par
{
buff* b[35];
};
void f(par *p)
{
for(int h=0;h<35;h++)
{
std::cout<<endl<<"In func: "<<(*p).b[h];
}
}
int main()
{
par parent;
for(int h=0;h<35;h++)
{
buff* b=new buff();
parent.b[h]=b;
cout<<endl<<"IN main: "<<parent.b[h];
}
cout << endl << endl;
f(&parent);
return 0;
}
Each stxxl::vector costs a specific amount of internal memory, since it is basically a paging system for blocks in external memory.
With the default settings, this is 8 (CachePages) * 4 (PageSize) * 2 MiB (BlockSize) = 64 MiB of RAM per stxxl::vector.
So, you are basically running out of RAM.

Need help creating an array of objects

I am trying to create an array of class objects taking an integer argument. I cannot see what is wrong with this simple little code. Could someone help?
#include <fstream>
#include <iostream>
using namespace std;
typedef class Object
{
int var;
public:
Object(const int& varin) : var(varin) {}
} Object;
int main (int argc, char * const argv[])
{
for(int i = 0; i < 10; i++)
{
Object o(i)[100];
}
return 0;
}
In C++ you don't need typedefs for classes and structs. So:
class Object
{
int var;
public:
Object(const int& varin) : var(varin) {}
};
Also, descriptive names are always preferrable, Object is much abused.
int main (int argc, char * const argv[])
{
int var = 1;
Object obj_array[10]; // would work if Object has a trivial ctor
return 0;
}
Otherwise, in your case:
int main (int argc, char * const argv[])
{
int var = 1;
Object init(var);
Object obj_array[10] = { var, ..., var }; // initialize manually
return 0;
}
Though, really you should look for vector
#include <vector>
int main (int argc, char * const argv[])
{
int var = 1;
vector<Object> obj_vector(10, var); // initialize 10 objects with var value
return 0;
}
dirkgently's rundown is fairly accurate representation of arrays of items in C++, but where he is initializing all the items in the array with the same value it looks like you are trying to initialize each with a distinct value.
To answer your question, creating an array of objects that take an int constructor parameter. You can't, objects are created when the array is allocated and in the absence of a trivial constructor your compiler will complain. You can however initialize an array of pointers to your object but you really get a lot more flexibility with a vector so my following examples will use std::vector.
You will need to initialize each of the object separately if you want each Object to have a distinct value, you can do this one of two ways; on the stack, or on the heap. Lets look at on-the-stack first.
Any constructor that take a single argument and is not marked as explicit can be used as an implicit constructor. This means that any place where an object of that type is expected you can instead use an instance of the single parameter type. In this example we create a vector of your Object class and add 100 Objects to it (push_back adds items to a vector), we pass an integer into push_back which implicitly creates an Object passing in the integer.
#include <vector>
int main() {
std::vector<Object> v;
for(int i = 0; i < 100; i++) {
v.push_back(i);
}
}
Or to be explicit about it:
#include <vector>
int main() {
std::vector<Object> v;
for(int i = 0; i < 100; i++) {
v.push_back(Object(i));
}
}
In these examples, all of the Object objects are allocated on the stack in the scope of the for loop, so a copy happens when the object is pushed into the vector. Copying a large number of objects can cause some performance issues especially if your object is expensive to copy.
One way to get around this performance issue is to allocate the objects on the heap and store pointers to the objects in your vector:
#include <vector>
int main() {
std::vector<Object*> v;
for(int i = 0; i < 100; i++) {
v.push_back(new Object(i));
}
for(int i = 0; i < 100; i++) {
delete v[i];
}
}
Since our objects were created on the heap we need to make sure that we delete them to call their deconstructor and, free their memory, this code does that in the second loop.
Manually calling delete has it's own caveats, if you pass these pointers to other code you can quickly loose track of who owns the pointers and who should delete them. An easier way to solve this problem is to use a smart pointer to track the lifetime of the pointer, see either boost::shared_ptr or tr1::shared_ptr which are reference-counted pointers :
#include <vector>
int main() {
std::vector<shared_ptr<Object> > v;
for(int i = 0; i < 100; i++) {
Object* o = new Object(i);
v.push_back(shared_ptr<Object>(o));
}
}
You'll notice that the shared_ptr constructor is explicit, this is done intentionally to make sure that the developer is intentionally stuffing their pointer into the shared pointer. When all references to an object are released the object will automatically be deleted by the shared_ptr, freeing us of the need to worry about it's lifetime.
If you want to stick to arrays, then you must either initialize manually or use the default constructor. However, you can get some control by creating a constructor with a default argument. This will be treated as a default constructor by the compiler. For example, the following code prints out the numbers 0, ..., 9 in order. (However, I'm not sure that the standard dictates that the objects in the array must be constructed in order. It might be implementation dependent, in which case the numbers may appear in arbitrary order.)
#include <iostream>
using namespace std;
struct A {
int _val;
A(int val = initializer()) : _val(val) {}
static int initializer() { static int v = 0; return v++; }
};
int main()
{
A a[10];
for(int i = 0; i < 10; i++)
cout << a[i]._val << endl;
}