This is my first question on Stack Overflow, and it's quite a long question. The tl;dr version is: How do I work with a thrust::device_vector<BaseClass> if I want it to store objects of different types DerivedClass1, DerivedClass2, etc, simultaneously?
I want to take advantage of polymorphism with CUDA Thrust. I'm compiling for an -arch=sm_30 GPU (GeForce GTX 670).
Let us take a look at the following problem: Suppose there are 80 families in town. 60 of them are married couples, 20 of them are single-parent households. Each family has, therefore, a different number of members. It's census time and households have to state the parents' ages and the number of children they have. Therefore, an array of Family objects is constructed by the government, namely thrust::device_vector<Family> familiesInTown(80), such that information of families familiesInTown[0] to familiesInTown[59] corresponds to married couples, the rest (familiesInTown[60] to familiesInTown[79]) being single-parent households.
Family is the base class - the number of parents in the household (1 for single parents and 2 for couples) and the number of children they have are stored here as members.
SingleParent, derived from Family, includes a new member - the single parent's age, unsigned int ageOfParent.
MarriedCouple, also derived from Family, however, introduces two new members - both parents' ages, unsigned int ageOfParent1 and unsigned int ageOfParent2.
#include <iostream>
#include <stdio.h>
#include <thrust/device_vector.h>
class Family
{
protected:
unsigned int numParents;
unsigned int numChildren;
public:
__host__ __device__ Family() {};
__host__ __device__ Family(const unsigned int& nPars, const unsigned int& nChil) : numParents(nPars), numChildren(nChil) {};
__host__ __device__ virtual ~Family() {};
__host__ __device__ unsigned int showNumOfParents() {return numParents;}
__host__ __device__ unsigned int showNumOfChildren() {return numChildren;}
};
class SingleParent : public Family
{
protected:
unsigned int ageOfParent;
public:
__host__ __device__ SingleParent() {};
__host__ __device__ SingleParent(const unsigned int& nChil, const unsigned int& age) : Family(1, nChil), ageOfParent(age) {};
__host__ __device__ unsigned int showAgeOfParent() {return ageOfParent;}
};
class MarriedCouple : public Family
{
protected:
unsigned int ageOfParent1;
unsigned int ageOfParent2;
public:
__host__ __device__ MarriedCouple() {};
__host__ __device__ MarriedCouple(const unsigned int& nChil, const unsigned int& age1, const unsigned int& age2) : Family(2, nChil), ageOfParent1(age1), ageOfParent2(age2) {};
__host__ __device__ unsigned int showAgeOfParent1() {return ageOfParent1;}
__host__ __device__ unsigned int showAgeOfParent2() {return ageOfParent2;}
};
If I were to naïvely initiate the objects in my thrust::device_vector<Family> with the following functors:
struct initSlicedCouples : public thrust::unary_function<unsigned int, MarriedCouple>
{
__device__ MarriedCouple operator()(const unsigned int& idx) const
// I use a thrust::counting_iterator to get idx
{
return MarriedCouple(idx % 3, 20 + idx, 19 + idx);
// Couple 0: Ages 20 and 19, no children
// Couple 1: Ages 21 and 20, 1 child
// Couple 2: Ages 22 and 21, 2 children
// Couple 3: Ages 23 and 22, no children
// etc
}
};
struct initSlicedSingles : public thrust::unary_function<unsigned int, SingleParent>
{
__device__ SingleParent operator()(const unsigned int& idx) const
{
return SingleParent(idx % 3, 25 + idx);
}
};
int main()
{
unsigned int Num_couples = 60;
unsigned int Num_single_parents = 20;
thrust::device_vector<Family> familiesInTown(Num_couples + Num_single_parents);
// Families [0] to [59] are couples. Families [60] to [79] are single-parent households.
thrust::transform(thrust::counting_iterator<unsigned int>(0),
thrust::counting_iterator<unsigned int>(Num_couples),
familiesInTown.begin(),
initSlicedCouples());
thrust::transform(thrust::counting_iterator<unsigned int>(Num_couples),
thrust::counting_iterator<unsigned int>(Num_couples + Num_single_parents),
familiesInTown.begin() + Num_couples,
initSlicedSingles());
return 0;
}
I would definitely be guilty of some classic object slicing...
So, I asked myself, what about a vector of pointers that may give me some sweet polymorphism? Smart pointers in C++ are a thing, and thrust iterators can do some really impressive things, so let's give it a shot, I figured. The following code compiles.
struct initCouples : public thrust::unary_function<unsigned int, MarriedCouple*>
{
__device__ MarriedCouple* operator()(const unsigned int& idx) const
{
return new MarriedCouple(idx % 3, 20 + idx, 19 + idx); // Memory issues?
}
};
struct initSingles : public thrust::unary_function<unsigned int, SingleParent*>
{
__device__ SingleParent* operator()(const unsigned int& idx) const
{
return new SingleParent(idx % 3, 25 + idx);
}
};
int main()
{
unsigned int Num_couples = 60;
unsigned int Num_single_parents = 20;
thrust::device_vector<Family*> familiesInTown(Num_couples + Num_single_parents);
// Families [0] to [59] are couples. Families [60] to [79] are single-parent households.
thrust::transform(thrust::counting_iterator<unsigned int>(0),
thrust::counting_iterator<unsigned int>(Num_couples),
familiesInTown.begin(),
initCouples());
thrust::transform(thrust::counting_iterator<unsigned int>(Num_couples),
thrust::counting_iterator<unsigned int>(Num_couples + Num_single_parents),
familiesInTown.begin() + Num_couples,
initSingles());
Family A = *(familiesInTown[2]); // Compiles, but object slicing takes place (in theory)
std::cout << A.showNumOfParents() << "\n"; // Segmentation fault
return 0;
}
Seems like I've hit a wall here. Am I understanding memory management correctly? (VTables, etc). Are my objects being instantiated and populated on the device? Am I leaking memory like there is no tomorrow?
For what it's worth, in order to avoid object slicing, I tried with a dynamic_cast<DerivedPointer*>(basePointer). That's why I made my Family destructor virtual.
Family *pA = familiesInTown[2];
MarriedCouple *pB = dynamic_cast<MarriedCouple*>(pA);
The following lines compile, but, unfortunately, a segfault is thrown again. CUDA-Memcheck won't tell me why.
std::cout << "Ages " << (pB -> showAgeOfParent1()) << ", " << (pB -> showAgeOfParent2()) << "\n";
and
MarriedCouple B = *pB;
std::cout << "Ages " << B.showAgeOfParent1() << ", " << B.showAgeOfParent2() << "\n";
In short, what I need is a class interface for objects that will have different properties, with different numbers of members among each other, but that I can store in one common vector (that's why I want a base class) that I can manipulate on the GPU. My intention is to work with them both in thrust transformations and in CUDA kernels via thrust::raw_pointer_casting, which has worked flawlessly for me until I've needed to branch out my classes into a base one and several derived ones. What is the standard procedure for that?
Thanks in advance!
I am not going to attempt to answer everything in this question, it is just too large. Having said that here are some observations about the code you posted which might help:
The GPU side new operator allocates memory from a private runtime heap. As of CUDA 6, that memory cannot be accessed by the host side CUDA APIs. You can access the memory from within kernels and device functions, but that memory cannot be accessed by the host. So using new inside a thrust device functor is a broken design that can never work. That is why your "vector of pointers" model fails.
Thrust is fundamentally intended to allow data parallel versions of typical STL algorithms to be applied to POD types. Building a codebase using complex polymorphic objects and trying to cram those through Thrust containers and algorithms might be made to work, but it isn't what Thrust was designed for, and I wouldn't recommend it. Don't be surprised if you break thrust in unexpected ways if you do.
CUDA supports a lot of C++ features, but the compilation and object models are much simpler than even the C++98 standard upon which they are based. CUDA lacks several key features (RTTI for example) which make complex polymorphic object designs workable in C++. My suggestion is use C++ features sparingly. Just because you can do something in CUDA doesn't mean you should. The GPU is a simple architecture and simple data structures and code are almost always more performant than functionally similar complex objects.
Having skim read the code you posted, my overall recommendation is to go back to the drawing board. If you want to look at some very elegant CUDA/C++ designs, spend some time reading the code bases of CUB and CUSP. They are both very different, but there is a lot to learn from both (and CUSP is built on top of Thrust, which makes it even more relevant to your usage case, I suspect).
I completely agree with #talonmies answer. (e.g. I don't know that thrust has been extensively tested with polymorphism.) Furthermore, I have not fully parsed your code. I post this answer to add additional info, in particular that I believe some level of polymorphism can be made to work with thrust.
A key observation I would make is that it is not allowed to pass as an argument to a __global__ function an object of a class with virtual functions. This means that polymorphic objects created on the host cannot be passed to the device (via thrust, or in ordinary CUDA C++). (One basis for this limitation is the requirement for virtual function tables in the objects, which will necessarily be different between host and device, coupled with the fact that it is illegal to directly take the address of a device function in host code).
However, polymorphism can work in device code, including thrust device functions.
The following example demonstrates this idea, restricting ourselves to objects created on the device although we can certainly initialize them with host data. I have created two classes, Triangle and Rectangle, derived from a base class Polygon which includes a virtual function area. Triangle and Rectangle inherit the function set_values from the base class but replace the virtual area function.
We can then manipulate objects of those classes polymorphically as demonstrated here:
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/for_each.h>
#include <thrust/sequence.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/copy.h>
#define N 4
class Polygon {
protected:
int width, height;
public:
__host__ __device__ void set_values (int a, int b)
{ width=a; height=b; }
__host__ __device__ virtual int area ()
{ return 0; }
};
class Rectangle: public Polygon {
public:
__host__ __device__ int area ()
{ return width * height; }
};
class Triangle: public Polygon {
public:
__host__ __device__ int area ()
{ return (width * height / 2); }
};
struct init_f {
template <typename Tuple>
__host__ __device__ void operator()(const Tuple &arg) {
(thrust::get<0>(arg)).set_values(thrust::get<1>(arg), thrust::get<2>(arg));}
};
struct setup_f {
template <typename Tuple>
__host__ __device__ void operator()(const Tuple &arg) {
if (thrust::get<0>(arg) == 0)
thrust::get<1>(arg) = &(thrust::get<2>(arg));
else
thrust::get<1>(arg) = &(thrust::get<3>(arg));}
};
struct area_f {
template <typename Tuple>
__host__ __device__ void operator()(const Tuple &arg) {
thrust::get<1>(arg) = (thrust::get<0>(arg))->area();}
};
int main () {
thrust::device_vector<int> widths(N);
thrust::device_vector<int> heights(N);
thrust::sequence( widths.begin(), widths.end(), 2);
thrust::sequence(heights.begin(), heights.end(), 3);
thrust::device_vector<Rectangle> rects(N);
thrust::device_vector<Triangle> trgls(N);
thrust::for_each(thrust::make_zip_iterator(thrust::make_tuple(rects.begin(), widths.begin(), heights.begin())), thrust::make_zip_iterator(thrust::make_tuple(rects.end(), widths.end(), heights.end())), init_f());
thrust::for_each(thrust::make_zip_iterator(thrust::make_tuple(trgls.begin(), widths.begin(), heights.begin())), thrust::make_zip_iterator(thrust::make_tuple(trgls.end(), widths.end(), heights.end())), init_f());
thrust::device_vector<Polygon *> polys(N);
thrust::device_vector<int> selector(N);
for (int i = 0; i<N; i++) selector[i] = i%2;
thrust::for_each(thrust::make_zip_iterator(thrust::make_tuple(selector.begin(), polys.begin(), rects.begin(), trgls.begin())), thrust::make_zip_iterator(thrust::make_tuple(selector.end(), polys.end(), rects.end(), trgls.end())), setup_f());
thrust::device_vector<int> areas(N);
thrust::for_each(thrust::make_zip_iterator(thrust::make_tuple(polys.begin(), areas.begin())), thrust::make_zip_iterator(thrust::make_tuple(polys.end(), areas.end())), area_f());
thrust::copy(areas.begin(), areas.end(), std::ostream_iterator<int>(std::cout, "\n"));
return 0;
}
I suggest compiling the above code for a cc2.0 or newer architecture. I tested with CUDA 6 on RHEL 5.5.
(The polymorphic example idea, and some of the code, was taken from here.)
Related
I read about the builtin comparison operators. I wondered why there are no ordering operators(<, <=, >, >=) for member pointers. It is valid to compare the adresses of two members of an instantiation of a struct.
http://en.cppreference.com/w/cpp/language/operator_comparison:
3) If, within an object of non-union class type, two pointers point to different non-static data members with the same member access, or to subobjects or array elements of such members, recursively, the pointer to the later declared member compares greater. In other words, class members in each of the three member access modes are positioned in memory in order of declaration.
With the use of the adressof operator(&) and the member pointer dereference operator(.*) it is possible to compare the adresses, but an instance is needed.
My questions:
Why are there no builtin ordering operators for memberpointers?
How to compare two memberpointers without an instance?
My approach:
#include <iostream>
template<class S, class T>
int cmp_memberptr(T S::* a, T S::* b) {
//S s; // works, but needed instanciation
//S& s = std::declval<S>(); // error
S& s = *(S*)nullptr; // no instanciation, works (on my machine), but undefined behavior because of nullptr dereference (most compilers warn directly)!
// note: the precedence of .*:
return int(&(s.*a) < &(s.*b)) - int(&(s.*a) > &(s.*b));
};
struct Point { int x, y; };
int main(int argc, char const* const* argv) {
Point p;
#define tst(t) std::cout << #t " is " << ((t) ? "true" : "false") << '\n'
tst(&p.x < &p.y);
//tst(&Point::x < &Point::y); // the main problem!
tst(cmp_memberptr(&Point::x, &Point::y) < 0);
#undef tst
};
I considered the offsetof-macro, but it does not take memberpointers as parameters.
Member-pointers are more comlex beasts than you might think. They consist of an index into the potentially existing vtable and an offset (MSVC is broken in that regard without specifying extra options).
That is due to the existence of virtual inheritance, which means the exact offset of the virtual base sub-object depends on the most derived type, instead of the static type used for access.
Even the order of virtual bases depends on that.
So, you can create a total order for member-pointers pointing to elements of the same virtual base, or pointing to elements outside any virtual base. Any specific implementation might even mandate more (accepting the inefficiency that forces), but that's outside the purview of the standard.
In the end, you cannot rely on even having a total order without knowing implementation-details and having additional guarantees.
Example on coliru:
#include <iostream>
struct B {
int x;
};
struct M : virtual B {};
struct D : M {
int y;
};
static void print_offset(const M& m) {
std::cout << "offset of m.x: " << ((char*)&m.x - (char*)&m) << '\n';
}
int main() {
print_offset(M{});
print_offset(D{});
}
Example output:
offset of m.x: 8
offset of m.x: 12
I don't know how standards conformant this is, but according to Godbolt, the following code compiles cleanly in clang, gcc and MSVC and generates the right code (push 4, basically, for m2) in an efficient way:
#include "stdio.h"
template <typename T, typename M> int member_offset (M m)
{
const void *pv = nullptr;
const T *pT = static_cast <const T *> (pv);
return static_cast <int> (reinterpret_cast <const char *> (&(pT->*m)) - reinterpret_cast <const char *> (pT));
}
class x
{
public:
int m1;
int m2;
};
int main (void)
{
int m1_offset = member_offset <x> (&x::m1);
int m2_offset = member_offset <x> (&x::m2);
printf ("m1_offset=%d, m2_offset=%d\n", m1_offset, m2_offset);
}
Output from Wandbox:
Start
m1_offset=0, m2_offset=4
0
Finish
Now you can just use or compare member_offset's to do whatever you want.
EDIT
As pointed out by Caleth and Deduplicator above, this doesn't work with virtual inheritance. See my last comment to Deduplicator's answer for the reason why. As an aside, it's interesting to me that there is significant runtime overhead when accessing instance variables in the base class when using virtual inheritance. I hadn't realised that.
Also, the following simple macro is easier to use and works correctly with multiple inheritance with clang (so much for all those fancy templates):
#define member_offset(classname, member) \
((int) ((char *) &((classname*) nullptr)->member - (char *) nullptr))
You can test this with gcc and clang at Wandbox:
#include "stdio.h"
#define member_offset(classname, member) \
((int) ((char *) &((classname *) nullptr)->member - (char *) nullptr))
struct a { int m1; };
struct b { int m2; };
struct c : a, b { int m3; };
int main (void)
{
int m1_offset = member_offset (c, m1);
int m2 = member_offset (c, m2);
int m3 = member_offset (c, m3);
printf ("m1_offset=%d, m2=%d, m3=%d\n", m1_offset, m2, m3);
}
Output:
m1_offset=0, m2=4, m3=8
But if you use this macro with a class that uses virtual inheritance, you get a SEGFAULT (because the compiler needs to look inside the object to find the offset of that object's data members and there is no object there - just a nullptr).
So, the answer to the OP's question is that you do need an instance to do this in the general case. Maybe have a special constructor that does nothing to create one of these with minimal overhead.
SECOND EDIT
I thought about this some more, and it occurred to me that instead of saying:
int mo = member_offset (c, m);
You should instead say:
constexpr int mo = member_offset (c, m);
Then the compiler should alert you if class c is using virtual inheritance.
Unfortunately, neither clang nor gcc will compile this for any kind of class, virtual inheritance or not. MSVC, on the other hand, does, and only generates a compiler error if class c uses virtual inheritance.
I don't know which compiler is doing the right thing here, insofar as the standard goes, but MSVC's behaviour is obviously the most sensible.
I recently modified a program to use virtual functions (in place of sequence if if-else conditions with static calls.) The modified program runs 8% slower than the original. That seems like too high of a cost for using virtual functions, so I must be doing something inefficient in the way I set the class hierarchy and virtual functions; but, I'm at a loss for how to track down the problem. (I see similar performance degradation using both clang on my Mac and gcc on Linux.)
The program is used to study different community detection algorithms. The program uses a nested loop to apply a series of user-specified objective functions to a variety of (graph, partition) pairs.
Here is a rough outline of the original code
int main(int argc, char* argv[]) {
bool use_m1;
bool use_m2;
...
bool use_m10;
// set the various "use" flags based on argv
for (Graph& g : graphsToStudy()) {
for (Partition& p : allPartitions()) {
if (use_m1) {
M1::evaluate(g, p);
}
if (use_m2) {
M2::evaluate(g,p);
}
// and so on
}
}
To make the code easier to maintain, I created a class structure for the different objective functions, and iterated through an array of pointers:
class ObjectiveFunction {
public:
virtual double eval(Graph& g, Partition& p) = 0;
}
class ObjFn1 : public ObjectiveFunction {
public:
virtual double eval(Graph& g, Partition& p) {
return M1::evaluate(g,p);
}
}
class ObjFn2 : public ObjectiveFunction {
public:
virtual double eval(Graph& g, Partition& p) {
return M2::evaluate(g,p);
}
}
int main(int argc, char* argv[]) {
vector<ObjectiveFunction*> funcs;
fill_funcs_based_on_opts(funcs, argc, argv);
for (Graph& g : graphsToStudy()) {
for (Partition& p : allPartitions()) {
// funcs contains one object for each function selected by user.
for (ObjectiveFunction* fp : funcs) {
fp->evaluate(g, p);
}
}
}
Given that generating graphs and partitions, as well as the objective functions themselves are moderately computationally intensive, the addition of the virtual function call should be almost unnoticeable. Any ideas what I may have done wrong; or how to track it down? I tried using callgrind, but am not seeing any insights.
Perhaps I am just incorrectly interpreting the output of callgrind_annotate. In the example below, Neo::Context::evaluatePartition is analogous to ObjFn1::evaluate in the example above.
Why is this function listed four different times with different
source files? This method is only ever called from function main
in timeMetrics.cpp.
What does src/lib/PartitionIterator.h:main refer to? There is no
main function in PartitionIterator.h.
Why does 414,219,420 appear twice in the source code listing for
evaluatePartition? Isn't the first number supposed to represent
the overhead of the function call?
35,139,513,913 PROGRAM TOTALS
17,029,020,600 src/lib/metrics/Neo.h:gvcd::metrics::Neo::Context<unsigned int, unsigned char, unsigned int>::evaluatePartition(gvcd::Partition<unsigned int, unsigned int> const&, bool) [bin/timeMetrics_v]
7,168,741,865 /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/vector:gvcd::Partition<unsigned int, unsigned int>::buildMembersh ipList()
4,418,473,884 src/lib/Partition.h:gvcd::Partition<unsigned int, unsigned int>::buildMembershipList() [bin/timeMetrics_v]
1,459,239,657 src/lib/PartitionIterator.h:main
1,288,682,640 /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/vector:gvcd::metrics::Neo::Context<unsigned int, unsigned char, u nsigned int>::evaluatePartition(gvcd::Partition<unsigned int, unsigned int> const&, bool)
1,058,560,740 src/lib/Partition.h:gvcd::metrics::Neo::Context<unsigned int, unsigned char, unsigned int>::evaluatePartition(gvcd::Partition<unsigned int, unsigned int> const&, bool)
1,012,736,608 src/perfEval/timeMetrics.cpp:main [bin/timeMetrics_v] 443,847,782 /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/vector:main
368,372,912 /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/memory:gvcd::Partition<unsigned int, unsigned int>::buildMembersh ipList()
322,170,738 /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/ostream:main
92,048,760 src/lib/SmallGraph.h:gvcd::metrics::Neo::Context<unsigned int, unsigned char, unsigned int>::evaluatePartition(gvcd::Partition<unsigned int, unsigned int> const&, bool)
84,549,144 ???:szone_free_definite_size [/usr/lib/system/libsystem_malloc.dylib]
54,212,938 ???:tiny_free_list_add_ptr [/usr/lib/system/libsystem_malloc.dylib]
. virtual double
414,219,420 evaluatePartition(const Partition <VertexID, SetBitmap> &p, bool raw = false) {
414,219,420 uint_wn_t raw_answer = Neo::evaluatePartition(*(this->g), p);
. return (double) (raw ? raw_answer : max_neo - raw_answer);
. }
. }; // end Context
Lets fix the obvious first:
In both versions you do this:
foreach (Graph g : graphsToStudy()) {
foreach (Partition p : allPartitions()) {
Unless Graph/Partition are easy and small to copy then most of your work will be here.
foreach (Graph& g : graphsToStudy()) {
// ^
foreach (Partition& p : allPartitions()) {
// ^
The second question I have. This does not seem like the correct usage of virtual functions. Your original code looks totally fine in this use case where multiple version of evaluate() are being called on each (g, p) object pair.
Now if you only called every one of the evaluate() functions then it might be a better use case, but then you no longer need that inner loop:
foreach (ObjectiveFunction* fp : funcs) {
It's expensive because you're actually using polymorphism, which defeats the branch predictor.
It may help the branch predictor if you replace collection iteration with an intrinsic linked list:
class ObjectiveFunction
{
ObjectiveFunction* m_next;
virtual double evaluate(Graph& g, Partition& p) = 0;
protected:
ObjectiveFunction(ObjectiveFunction* next = nullptr) : m_next(next) {}
// for gcc use __attribute__((always_inline))
// for MSVC use __forceinline
void call_next(Graph& g, Partition& p)
{
if (m_next) m_next->eval(g, p);
}
public:
virtual void eval(Graph& g, Partition& p) = 0;
};
Now, instead of one line of code inside the loop reaching many different functions, the call_next() function (which should be the last step of each individual eval overload) should be inlined into each of those overloads, and at runtime, each inlined copy of that indirect-call instruction will repeatedly call just one function, resulting in 100% branch prediction.
Where I can, I prefer static over dynamic dispatch -- dynamic dispatch can cost you by preventing optimizations like function inlining, and with the double-dereference involved with vtables you can suffer from poor locality (instruction cache misses).
I suspect the lion's share of the difference in performance is from losing the benefits of optimizations performed on static dispatch. It might be interesting to try disabling inlining on the original code to see how much of an benefit you were enjoying.
In a function that takes several arguments of the same type, how can we guarantee that the caller doesn't mess up the ordering?
For example
void allocate_things(int num_buffers, int pages_per_buffer, int default_value ...
and later
// uhmm.. lets see which was which uhh..
allocate_things(40,22,80,...
A typical solution is to put the parameters in a structure, with named fields.
AllocateParams p;
p.num_buffers = 1;
p.pages_per_buffer = 10;
p.default_value = 93;
allocate_things(p);
You don't have to use fields, of course. You can use member functions or whatever you like.
If you have a C++11 compiler, you could use user-defined literals in combination with user-defined types. Here is a naive approach:
struct num_buffers_t {
constexpr num_buffers_t(int n) : n(n) {} // constexpr constructor requires C++14
int n;
};
struct pages_per_buffer_t {
constexpr pages_per_buffer_t(int n) : n(n) {}
int n;
};
constexpr num_buffers_t operator"" _buffers(unsigned long long int n) {
return num_buffers_t(n);
}
constexpr pages_per_buffer_t operator"" _pages_per_buffer(unsigned long long int n) {
return pages_per_buffer_t(n);
}
void allocate_things(num_buffers_t num_buffers, pages_per_buffer_t pages_per_buffer) {
// do stuff...
}
template <typename S, typename T>
void allocate_things(S, T) = delete; // forbid calling with other types, eg. integer literals
int main() {
// now we see which is which ...
allocate_things(40_buffers, 22_pages_per_buffer);
// the following does not compile (see the 'deleted' function):
// allocate_things(40, 22);
// allocate_things(40, 22_pages_per_buffer);
// allocate_things(22_pages_per_buffer, 40_buffers);
}
Two good answers so far, one more: another approach would be to try leverage the type system wherever possible, and to create strong typedefs. For instance, using boost strong typedef (http://www.boost.org/doc/libs/1_61_0/libs/serialization/doc/strong_typedef.html).
BOOST_STRONG_TYPEDEF(int , num_buffers);
BOOST_STRONG_TYPEDEF(int , num_pages);
void func(num_buffers b, num_pages p);
Calling func with arguments in the wrong order would now be a compile error.
A couple of notes on this. First, boost's strong typedef is rather dated in its approach; you can do much nicer things with variadic CRTP and avoid macros completely. Second, obviously this introduces some overhead as you often have to explicitly convert. So generally you don't want to overuse it. It's really nice for things that come up over and over again in your library. Not so good for things that come up as a one off. So for instance, if you are writing a GPS library, you should have a strong double typedef for distances in metres, a strong int64 typedef for time past epoch in nanoseconds, and so on.
(Note: post was originally tagged 'C`)
C99 onwards allows an extension to #Dietrich Epp idea: compound literal
struct things {
int num_buffers;
int pages_per_buffer;
int default_value
};
allocate_things(struct things);
// Use a compound literal
allocate_things((struct things){.default_value=80, .num_buffers=40, .pages_per_buffer=22});
Could even pass the address of the structure.
allocate_things(struct things *);
// Use a compound literal
allocate_things(&((struct things){.default_value=80,.num_buffers=40,.pages_per_buffer=22}));
You can't. That's why it is recommended to have as few function arguments as possible.
In your example you could have separate functions like set_num_buffers(int num_buffers), set_pages_per_buffer(int pages_per_buffer) etc.
You probably have noticed yourself that allocate_things is not a good name because it doesn't express what the function is actually doing. Especially I would not expect it to set a default value.
Just for completeness, you could use named arguments, when your call becomes.
void allocate_things(num_buffers=20, pages_per_buffer=40, default_value=20);
// or equivalently
void allocate_things(pages_per_buffer=40, default_value=20, num_buffers=20);
However, with the current C++ this requires quite a bit of code to be implemented (in the header file declaring allocate_things(), which must also declare appropriate external objects num_buffers etc providing operator= which return a unique suitable object).
---------- working example (for sergej)
#include <iostream>
struct a_t { int x=0; a_t(int i): x(i){} };
struct b_t { int x=0; b_t(int i): x(i){} };
struct c_t { int x=0; c_t(int i): x(i){} };
// implement using all possible permutations of the arguments.
// for many more argumentes better use a varidadic template.
void func(a_t a, b_t b, c_t c)
{ std::cout<<"a="<<a.x<<" b="<<b.x<<" c="<<c.x<<std::endl; }
inline void func(b_t b, c_t c, a_t a) { func(a,b,c); }
inline void func(c_t c, a_t a, b_t b) { func(a,b,c); }
inline void func(a_t a, c_t c, b_t b) { func(a,b,c); }
inline void func(c_t c, b_t b, a_t a) { func(a,b,c); }
inline void func(b_t b, a_t a, c_t c) { func(a,b,c); }
struct make_a { a_t operator=(int i) { return {i}; } } a;
struct make_b { b_t operator=(int i) { return {i}; } } b;
struct make_c { c_t operator=(int i) { return {i}; } } c;
int main()
{
func(b=2, c=10, a=42);
}
Are you really going to try to QA all the combinations of arbitrary integers? And throw in all the checks for negative/zero values etc?
Just create two enum types for minimum, medium and maximum number of buffers, and small medium and large buffer sizes. Then let the compiler do the work and let your QA folks take an afternoon off:
allocate_things(MINIMUM_BUFFER_CONFIGURATION, LARGE_BUFFER_SIZE, 42);
Then you only have to test a limited number of combinations and you'll have 100% coverage. The people working on your code 5 years from now will only need to know what they want to achieve and not have to guess the numbers they might need or which values have actually been tested in the field.
It does make the code slightly harder to extend, but it sounds like the parameters are for low-level performance tuning, so twiddling the values should not be perceived as cheap/trivial/not needing thorough testing. A code review of a change from
allocate_something(25, 25, 25);
...to
allocate_something(30, 80, 42);
...will likely get just a shrug/blown off, but a code review of a new enum value EXTRA_LARGE_BUFFERS will likely trigger all the right discussions about memory use, documentation, performance testing etc.
I have the following variable that I need to work with, and have to write my own wrapper around it for an assignment. I am going beyond the assignment (since I am going to have to use this wrapper I make) and wanting to overload the subscript operator in my wrapper in order to use it with the double pointer array. What I mean in code is this:
What I have:
From given header for library:
typedef struct { // A pixel stores 3 bytes of data:
byte red; // intensity of the red component
byte green; // intensity of the green component
byte blue; // intensity of the blue component
} pixel;
typedef struct {
int rows, cols; /* pic size */
pixel **pixels; /* image data */
} image;
My class (of course included in header):
pixels& MyWrapper::operator[] (const int nIndex) {
return Image.pixels[nIndex]; // where Image is of type image
}
Of course this won't work since the double pointer returns a pointer, which is not what I'm telling it to return, yet returning *pixels& doesn't return it either. Just to sate my curiosity and to help me understand why this isn't possible, could someone tell me how this would be implemented if it can be at all, and why it is that way? Keep in mind I don't understand pointers very well yet (I know of the basics of how they work, but thats it), and am hoping to use this to very much broaden my understanding.
this is more typical, for c++:
#include <vector>
namespace AA {
class t_point {
public:
t_point(const size_t& X, const size_t& Y) : d_x(X), d_y(Y) {
}
const size_t& x() const { return this->d_x; }
const size_t& y() const { return this->d_y; }
private:
size_t d_x;
size_t d_y;
};
class t_array {
public:
// abusive, but possible. prefer `at`
const int& operator[](const t_point& idx) const {
return this->at(idx.x(), idx.y());
}
const int& at(const t_point& idx) const {
return this->at(idx.x(), idx.y());
}
const int& at(const size_t& x, const size_t& y) const {
return this->d_objects[x][y];
}
private:
// or use your c image representation...
std::vector<std::vector<int> > d_objects;
private:
static const int& ctest(const t_array& arr) {
const t_point pt(1, 2);
return arr[pt];
return arr.at(pt);
return arr.at(pt.d_x, pt.d_y);
}
};
}
the big problem with using one index in this case it that it is not clear which index (pixel) you are attempting to access, while pushing all the coordinate calculations off to the client. if it's a single pointer, you'd still push the problem onto the client, but you'd be able access the index predictably.
with double... the layout in memory can vary, it is not necessarily contiguous. it's just a bad design to publish it as a single value (logically, as a 1D array), rather than a 2D array or a point (for example).
It's not clear why you are using double indirection in the first place.
If pixels is a double pointer to an array of pixels, you can do
pixels& MyWrapper::operator[] (const int nIndex) {
return (*Image.pixels)[nIndex]; // where Image is of type image
}
If pixels is a pointer to an array of pointers to arrays, then you need two indices:
pixels& MyWrapper::operator() ( int xIndex, int yIndex ) {
return Image.pixels[yIndex][xIndex]; // where Image is of type image
}
There are a few weird things going on here.
typedef class { } identifier is not good C++. Use class identifier { };, or else the class has no name, so you cannot define member functions outside the class { } scope. (Among other problems.)
There is no reason to make a parameter type const int. Plain int accomplishes the same thing.
There is no apparent reason for the double indirection. Typically in C++ we avoid the direct use of pointers. There is probably a prepackaged standard structure you can use instead.
Use boost::multi_array
Suppose you have a function, and you call it a lot of times, every time the function return a big object. I've optimized the problem using a functor that return void, and store the returning value in a public member:
#include <vector>
const int N = 100;
std::vector<double> fun(const std::vector<double> & v, const int n)
{
std::vector<double> output = v;
output[n] *= output[n];
return output;
}
class F
{
public:
F() : output(N) {};
std::vector<double> output;
void operator()(const std::vector<double> & v, const int n)
{
output = v;
output[n] *= n;
}
};
int main()
{
std::vector<double> start(N,10.);
std::vector<double> end(N);
double a;
// first solution
for (unsigned long int i = 0; i != 10000000; ++i)
a = fun(start, 2)[3];
// second solution
F f;
for (unsigned long int i = 0; i != 10000000; ++i)
{
f(start, 2);
a = f.output[3];
}
}
Yes, I can use inline or optimize in an other way this problem, but here I want to stress on this problem: with the functor I declare and construct the output variable output only one time, using the function I do that every time it is called. The second solution is two time faster than the first with g++ -O1 or g++ -O2. What do you think about it, is it an ugly optimization?
Edit:
to clarify my aim. I have to evaluate the function >10M times, but I need the output only few random times. It's important that the input is not changed, in fact I declared it as a const reference. In this example the input is always the same, but in real world the input change and it is function of the previous output of the function.
More common scenario is to create object with reserved large enough size outside the function and pass large object to the function by pointer or by reference. You could reuse this object on several calls to your function. Thus you could reduce continual memory allocation.
In both cases you are allocating new vector many many times.
What you should do is to pass both input and output objects to your class/function:
void fun(const std::vector<double> & in, const int n, std::vector<double> & out)
{
out[n] *= in[n];
}
this way you separate your logic from the algorithm. You'll have to create a new std::vector once and pass it to the function as many time as you want. Notice that there's unnecessary no copy/allocation made.
p.s. it's been awhile since I did c++. It may not compile right away.
It's not an ugly optimization. It's actually a fairly decent one.
I would, however, hide output and make an operator[] member to access its members. Why? Because you just might be able to perform a lazy evaluation optimization by moving all the math to that function, thus only doing that math when the client requests that value. Until the user asks for it, why do it if you don't need to?
Edit:
Just checked the standard. Behavior of the assignment operator is based on insert(). Notes for that function state that an allocation occurs if new size exceeds current capacity. Of course this does not seem to explicitly disallow an implementation from reallocating even if otherwise...I'm pretty sure you'll find none that do and I'm sure the standard says something about it somewhere else. Thus you've improved speed by removing allocation calls.
You should still hide the internal vector. You'll have more chance to change implementation if you use encapsulation. You could also return a reference (maybe const) to the vector from the function and retain the original syntax.
I played with this a bit, and came up with the code below. I keep thinking there's a better way to do this, but it's escaping me for now.
The key differences:
I'm allergic to public member variables, so I made output private, and put getters around it.
Having the operator return void isn't necessary for the optimization, so I have it return the value as a const reference so we can preserve return value semantics.
I took a stab at generalizing the approach into a templated base class, so you can then define derived classes for a particular return type, and not re-define the plumbing. This assumes the object you want to create takes a one-arg constructor, and the function you want to call takes in one additional argument. I think you'd have to define other templates if this varies.
Enjoy...
#include <vector>
template<typename T, typename ConstructArg, typename FuncArg>
class ReturnT
{
public:
ReturnT(ConstructArg arg): output(arg){}
virtual ~ReturnT() {}
const T& operator()(const T& in, FuncArg arg)
{
output = in;
this->doOp(arg);
return this->getOutput();
}
const T& getOutput() const {return output;}
protected:
T& getOutput() {return output;}
private:
virtual void doOp(FuncArg arg) = 0;
T output;
};
class F : public ReturnT<std::vector<double>, std::size_t, const int>
{
public:
F(std::size_t size) : ReturnT<std::vector<double>, std::size_t, const int>(size) {}
private:
virtual void doOp(const int n)
{
this->getOutput()[n] *= n;
}
};
int main()
{
const int N = 100;
std::vector<double> start(N,10.);
double a;
// second solution
F f(N);
for (unsigned long int i = 0; i != 10000000; ++i)
{
a = f(start, 2)[3];
}
}
It seems quite strange(I mean the need for optimization at all) - I think that a decent compiler should perform return value optimization in such cases. Maybe all you need is to enable it.