CUDA template kernel wrapper - c++

I started writing a simulation and decided to try use a more objected oriented approach. As such I also decided using a template parameter in the CUDA kernel, which indicates the spatial dimension of the simulation. The problem is, because of the restriction of implementing template functions in the header files, I had to use a complicated approach to keep make the kernel wrapper callable from .cpp source files.
My approach was to overload the wrapper function for 2 and 3 dimensions. I then have a class for wrapper class which deals with initialization and managing the kernel resources. Unfortunately, because of the restriction I mentioned, I have to keep two members for template classes, i.e.
struct kernelWrapper{
KernelWrapper(Simulation&lt2&gt *simulation):
d_(2),
simulation2d_(simulation)
{}
KernelWrapper(Simulation&lt3&gt *simulation):
d_(3),
simulation3d_(simulation)
{}
process(void){ //wrapper function for kernel launching
switch(d_){
case 2:
kernel&lt2&gt&lt&lt&lt..., ...&gt&gt&gt(...);
break;
}
case 3:
kernel&lt3&gt&lt&lt&lt..., ...&gt&gt&gt(...);
break;
}
default:
break;
}
int d_;
union{
Simulation&lt2&gt *simulation2d_;
Simulation&lt3&gt *simulation3d_;
};
union{
Lattice&lt2&gt *lattice2d_d;
Lattice&lt3&gt *lattice3d_d;
};
};
I was thus wondering if you know of a better way to achieve what I'm trying to do, that is, to make a wrapper for a template CUDA kernel.
UPDATE: I'd like to add one more solution I've found out after making the above post. As indicated by the C++ faq (points 13-15) one can put the template implementation in a source file and explicitly instantiate the templates that are needed, i.e. in my case for 2 and 3 dimensions. Using C++11, one can take this a step further and introduce the keyword extern in the template definition to save some compiling/linking time, also explained here.

The problem is, because of the restriction of implementing template
functions in the header files, I had to use a complicated approach to
keep make the kernel wrapper callable from .cpp source files.
It is legal to write a template declared code in .cpp
Whether kernelWrapper is in .hpp or .cpp you should have a code that looks like
template<int d_>
struct kernelWrapper
{
KernelWrapper(Simulation<d_> *simulation) : simulation_(simulation)
{}
process(void)
{
kernel<d_><<<..., ...>>>(...);
}
Simulation<d_>* simulation_;
Lattice<d_>* lattice2d_;
};
Also avoid using switch/case to select a kernel, use something like:
int const max_dimension = 4;
template<int static_dimension>
void select_kernel(int dynamic_dimension)
{
if(dynamic_dimension == static_dimension)
{
call_kernel<static_dimension>();
}
select_kernel<static_dimension+1>(dynamic_dimension);
}
template<>
void select_kernel<max_dimension>(int dynamic_dimension)
{
// error message
}
void select_kernel(int dynamic_dimension)
{
select_kernel<1>(dynamic_dimension);
}
If such selection is frequent, it makes sens to not using templates.

Related

c++ : How to make a specific binary (executable) for each trait?

I am working on a project that has a lot of trait types. If I compile every trait in the same code base, the released binary would be quite big.
I am thinking about to use macro to build a binary for each specific trait --- from a business logic perspective, this makes perfect sense.
However, I realized that, if I want to cut down the code base, I need to have this long if/elif pile at the end of each template cpp file. This sounds like a very tedious thing to do.
I am wondering if you have encountered this kind of problem before, what's the most neat solution here?
#include "MyTraits.hpp"
#include "Runner.hpp"
int main(){
#if defined USE_TRAIT_1
Runner<Trait1> a;
#elif defined USE_TRAIT_2
Runner<Trait2> a;
#elif defined USE_TRAIT_3
Runner<Trait3> a;
#endif
return 0;
}
If you want to explicity instanciate templates in specific compilation units you should use the extern template keyword.
// Runner.hpp
//define your template class
template <class runner_trait>
class Runner {
...
};
//This tells the compiler to not instanciate the template,
// if it is encounterd, but link to it from a compilation unit.
// If it is not found, you will get a linker errer.
extern template Runner<Trait1>;
extern template Runner<Trait2>;
extern template Runner<Trait3>;
Runner_trait1.cpp
// the template class keyword tell the compiler to instanciate the template in this compilation unit.
template class Runner<Trait1>;
// The files for Runner_trait2.cpp and Runner_trait3.cpp look identical,
// except for the trait after Runner
I'm not entirely sure that my answer will address the root cause of the problem. But the proposed solution may at least look a bit more "neat".
The basic idea behind the proposal by #JeffCharter makes sense, but I don't like the idea of embedding the code (in this case type names) into the makefile. So I elaborated on it a bit, with the following goals in mind:
Make the contents of main() short and easy to understand
Avoid polluting the makefile
Avoid the macro usage to the most possible extent
I ended up with the following solution which requires a single numeric macro which can be defined in the makefile. Be aware that it uses C++17's constexpr if, so in case you find it useful make sure your compiler supports it.
constexpr int traitID = TRAIT_ID; // TRAIT_ID is a macro defined somewhere else.
template <typename T>
struct Wrapped // helper struct
{
using Type = T;
};
auto trait()
{
// Although it may look not that different from macros, the main difference
// is that here all the code below gets compiled.
if constexpr (traitID == 1)
return Wrapped<Trait1>{};
else if constexpr (traitID == 2)
return Wrapped<Trait2>{};
else if constexpr (traitID == 3)
return Wrapped<Trait3>{};
// add more cases if necessary
}
int main() // the contents of 'main' seems to have become more readable
{
using Trait = decltype(trait())::Type;
Runner<Trait> a;
return 0;
}
Also, here's a live example at Coliru.

Efficient configuration of class hierarchy at compile-time

This question is specifically about C++ architecture on embedded, hard real-time systems. This implies that large parts of the data-structures as well as the exact program-flow are given at compile-time, performance is important and a lot of code can be inlined. Solutions preferably use C++03 only, but C++11 inputs are also welcome.
I am looking for established design-patterns and solutions to the architectural problem where the same code-base should be re-used for several, closely related products, while some parts (e.g. the hardware-abstraction) will necessarily be different.
I will likely end up with a hierarchical structure of modules encapsulated in classes that might then look somehow like this, assuming 4 layers:
Product A Product B
Toplevel_A Toplevel_B (different for A and B, but with common parts)
Middle_generic Middle_generic (same for A and B)
Sub_generic Sub_generic (same for A and B)
Hardware_A Hardware_B (different for A and B)
Here, some classes inherit from a common base class (e.g. Toplevel_A from Toplevel_base) while others do not need to be specialized at all (e.g. Middle_generic).
Currently I can think of the following approaches:
(A): If this was a regular desktop-application, I would use virtual inheritance and create the instances at run-time, using e.g. an Abstract Factory.
Drawback: However the *_B classes will never be used in product A and hence the dereferencing of all the virtual function calls and members not linked to an address at run-time will lead to quite some overhead.
(B) Using template specialization as inheritance mechanism (e.g. CRTP)
template<class Derived>
class Toplevel { /* generic stuff ... */ };
class Toplevel_A : public Toplevel<Toplevel_A> { /* specific stuff ... */ };
Drawback: Hard to understand.
(C): Use different sets of matching files and let the build-scripts include the right one
// common/toplevel_base.h
class Toplevel_base { /* ... */ };
// product_A/toplevel.h
class Toplevel : Toplevel_base { /* ... */ };
// product_B/toplevel.h
class Toplevel : Toplevel_base { /* ... */ };
// build_script.A
compiler -Icommon -Iproduct_A
Drawback: Confusing, tricky to maintain and test.
(D): One big typedef (or #define) file
//typedef_A.h
typedef Toplevel_A Toplevel_to_be_used;
typedef Hardware_A Hardware_to_be_used;
// etc.
// sub_generic.h
class sub_generic {
Hardware_to_be_used the_hardware;
// etc.
};
Drawback: One file to be included everywhere and still the need of another mechnism to actually switch between different configurations.
(E): A similar, "Policy based" configuration, e.g.
template <class Policy>
class Toplevel {
Middle_generic<Policy> the_middle;
// ...
};
// ...
template <class Policy>
class Sub_generic {
class Policy::Hardware_to_be_used the_hardware;
// ...
};
// used as
class Policy_A {
typedef Hardware_A Hardware_to_be_used;
};
Toplevel<Policy_A> the_toplevel;
Drawback: Everything is a template now; a lot of code needs to be re-compiled every time.
(F): Compiler switch and preprocessor
// sub_generic.h
class Sub_generic {
#if PRODUCT_IS_A
Hardware_A _hardware;
#endif
#if PRODUCT_IS_B
Hardware_B _hardware;
#endif
};
Drawback: Brrr..., only if all else fails.
Is there any (other) established design-pattern or a better solution to this problem, such that the compiler can statically allocate as many objects as possible and inline large parts of the code, knowing which product is being built and which classes are going to be used?
I'd go for A. Until it's PROVEN that this is not good enough, go for the same decisions as for desktop (well, of course, allocating several kilobytes on the stack, or using global variables that are many megabytes large may be "obvious" that it's not going to work). Yes, there is SOME overhead in calling virtual functions, but I would go for the most obvious and natural C++ solution FIRST, then redesign if it's not "good enough" (obviously, try to determine performance and such early on, and use tools like a sampling profiler to determine where you are spending time, rather than "guessing" - humans are proven pretty poor guessers).
I'd then move to option B if A is proven to not work. This is indeed not entirely obvious, but it is, roughly, how LLVM/Clang solves this problem for combinations of hardware and OS, see:
https://github.com/llvm-mirror/clang/blob/master/lib/Basic/Targets.cpp
First I would like to point out that you basically answered your own question in the question :-)
Next I would like to point out that in C++
the exact program-flow are given at compile-time, performance is
important and a lot of code can be inlined
is called templates. The other approaches that leverage language features as opposed to build system features will serve only as a logical way of structuring the code in your project to the benefit of developers.
Further, as noted in other answers C is more common for hard real-time systems than are C++, and in C it is customary to rely on MACROS to make this kind of optimization at compile time.
Finally, you have noted under your B solution above that template specialization is hard to understand. I would argue that this depends on how you do it and also on how much experience your team has on C++/templates. I find many "template ridden" projects to be extremely hard to read and the error messages they produce to be unholy at best, but I still manage to make effective use of templates in my own projects because I respect the KISS principle while doing it.
So my answer to you is, go with B or ditch C++ for C
I understand that you have two important requirements :
Data types are known at compile time
Program-flow is known at compile time
The CRTP wouldn't really address the problem you are trying to solve as it would allow the HardwareLayer to call methods on the Sub_generic, Middle_generic or TopLevel and I don't believe it is what you are looking for.
Both of your requirements can be met using the Trait pattern (another reference). Here is an example proving both requirements are met. First, we define empty shells representing two Hardwares you might want to support.
class Hardware_A {};
class Hardware_B {};
Then let's consider a class that describes a general case which corresponds to Hardware_A.
template <typename Hardware>
class HardwareLayer
{
public:
typedef long int64_t;
static int64_t getCPUSerialNumber() {return 0;}
};
Now let's see a specialization for Hardware_B :
template <>
class HardwareLayer<Hardware_B>
{
public:
typedef int int64_t;
static int64_t getCPUSerialNumber() {return 1;}
};
Now, here is a usage example within the Sub_generic layer :
template <typename Hardware>
class Sub_generic
{
public:
typedef HardwareLayer<Hardware> HwLayer;
typedef typename HwLayer::int64_t int64_t;
int64_t doSomething() {return HwLayer::getCPUSerialNumber();}
};
And finally, a short main that executes both code paths and use both data types :
int main(int argc, const char * argv[]) {
std::cout << "Hardware_A : " << Sub_generic<Hardware_A>().doSomething() << std::endl;
std::cout << "Hardware_B : " << Sub_generic<Hardware_B>().doSomething() << std::endl;
}
Now if your HardwareLayer needs to maintain state, here is another way to implement the HardLayer and Sub_generic layer classes.
template <typename Hardware>
class HardwareLayer
{
public:
typedef long hwint64_t;
hwint64_t getCPUSerialNumber() {return mySerial;}
private:
hwint64_t mySerial = 0;
};
template <>
class HardwareLayer<Hardware_B>
{
public:
typedef int hwint64_t;
hwint64_t getCPUSerialNumber() {return mySerial;}
private:
hwint64_t mySerial = 1;
};
template <typename Hardware>
class Sub_generic : public HardwareLayer<Hardware>
{
public:
typedef HardwareLayer<Hardware> HwLayer;
typedef typename HwLayer::hwint64_t hwint64_t;
hwint64_t doSomething() {return HwLayer::getCPUSerialNumber();}
};
And here is a last variant where only the Sub_generic implementation changes :
template <typename Hardware>
class Sub_generic
{
public:
typedef HardwareLayer<Hardware> HwLayer;
typedef typename HwLayer::hwint64_t hwint64_t;
hwint64_t doSomething() {return hw.getCPUSerialNumber();}
private:
HwLayer hw;
};
On a similar train of thought to F, you could just have a directory layout like this:
Hardware/
common/inc/hardware.h
hardware1/src/hardware.cpp
hardware2/src/hardware.cpp
Simplify the interface to only assume a single hardware exists:
// sub_generic.h
class Sub_generic {
Hardware _hardware;
};
And then only compile the folder that contains the .cpp files for the hardware for that platform.
The benefits to this approach are:
It's simple to understand whats happening and to add a hardware3
hardware.h still serves as your API
It takes away the abstraction from the compiler (for your speed concerns)
Compiler 1 doesn't need to compile hardware2.cpp or hardware3.cpp which may contain things Compiler 1 can't do (like inline assembly, or some other specific Compiler 2 thing)
hardware3 might be much more complicated for some reason you haven't considered yet.. so giving it a whole directory structure encapsulates it.
Since this is for a hard real time embedded system, usually you would go for a C type of solution not c++.
With modern compilers I'd say that the overhead of c++ is not that great, so it's not entirely a matter of performance, but embedded systems tend to prefer c instead of c++.
What you are trying to build would resemble a classic device drivers library (like the one for ftdi chips).
The approach there would be (since it's written in C) something similar to your F, but with no compile time options - you would specialize the code, at runtime, based on somethig like PID, VID, SN, etc...
Now if you what to use c++ for this, templates should probably be your last option (code readability usually ranks higher than any advantage templates bring to the table). So you would probably go for something similar to A: a basic class inheritance scheme, but no particularly fancy design pattern is required.
Hope this helps...
I am going to assume that these classes only need to be created a single time, and that their instances persist throughout the entire program run time.
In this case I would recommend using the Object Factory pattern since the factory will only get run one time to create the class. From that point on the specialized classes are all a known type.

C++ handling specific impl - #ifdef vs private inheritance vs tag dispatch

I have some classes implementing some computations which I have
to optimize for different SIMD implementations e.g. Altivec and
SSE. I don't want to polute the code with #ifdef ... #endif blocks
for each method I have to optimize so I tried a couple of other
approaches, but unfotunately I'm not very satisfied of how it turned
out for reasons I'll try to clarify. So I'm looking for some advice
on how I could improve what I have already done.
1.Different implementation files with crude includes
I have the same header file describing the class interface with different
"pseudo" implementation files for plain C++, Altivec and SSE only for the
relevant methods:
// Algo.h
#ifndef ALGO_H_INCLUDED_
#define ALGO_H_INCLUDED_
class Algo
{
public:
Algo();
~Algo();
void process();
protected:
void computeSome();
void computeMore();
};
#endif
// Algo.cpp
#include "Algo.h"
Algo::Algo() { }
Algo::~Algo() { }
void Algo::process()
{
computeSome();
computeMore();
}
#if defined(ALTIVEC)
#include "Algo_Altivec.cpp"
#elif defined(SSE)
#include "Algo_SSE.cpp"
#else
#include "Algo_Scalar.cpp"
#endif
// Algo_Altivec.cpp
void Algo::computeSome()
{
}
void Algo::computeMore()
{
}
... same for the other implementation files
Pros:
the split is quite straightforward and easy to do
there is no "overhead"(don't know how to say it better) to objects of my class
by which I mean no extra inheritance, no addition of member variables etc.
much cleaner than #ifdef-ing all over the place
Cons:
I have three additional files for maintenance; I could put the Scalar
implementation in the Algo.cpp file though and end up with just two but the
inclusion part will look and fell a bit dirtier
they are not compilable units per-se and have to be excluded from the
project structure
if I do not have the specific optimized implementation yet for let's say
SSE I would have to duplicate some code from the plain(Scalar) C++ implementation file
I cannot fallback to the plain C++ implementation if nedded; ? is it even possible
to do that in the described scenario ?
I do not feel any structural cohesion in the approach
2.Different implementation files with private inheritance
// Algo.h
class Algo : private AlgoImpl
{
... as before
}
// AlgoImpl.h
#ifndef ALGOIMPL_H_INCLUDED_
#define ALGOIMPL_H_INCLUDED_
class AlgoImpl
{
protected:
AlgoImpl();
~AlgoImpl();
void computeSomeImpl();
void computeMoreImpl();
};
#endif
// Algo.cpp
...
void Algo::computeSome()
{
computeSomeImpl();
}
void Algo::computeMore()
{
computeMoreImpl();
}
// Algo_SSE.cpp
AlgoImpl::AlgoImpl()
{
}
AlgoImpl::~AlgoImpl()
{
}
void AlgoImpl::computeSomeImpl()
{
}
void AlgoImpl::computeMoreImpl()
{
}
Pros:
the split is quite straightforward and easy to do
much cleaner than #ifdef-ing all over the place
still there is no "overhead" to my class - EBCO should kick in
the semantic of the class is much more cleaner at least comparing to the above
that is private inheritance == is implemented in terms of
the different files are compilable, can be included in the project
and selected via the build system
Cons:
I have three additional files for maintenance
if I do not have the specific optimized implementation yet for let's say
SSE I would have to duplicate some code from the plain(Scalar) C++ implementation file
I cannot fallback to the plain C++ implementation if nedded
3.Is basically method 2 but with virtual functions in the AlgoImpl class. That
would allow me to overcome the duplicate implementation of plain C++ code if needed
by providing an empty implementation in the base class and override in the derived
although I will have to disable that behavior when I actually implement the optimized
version. Also the virtual functions will bring some "overhead" to objects of my class.
4.A form of tag dispatching via enable_if<>
Pros:
the split is quite straightforward and easy to do
much cleaner than #ifdef ing all over the place
still there is no "overhead" to my class
will eliminate the need for different files for different implementations
Cons:
templates will be a bit more "cryptic" and seem to bring an unnecessary
overhead(at least for some people in some contexts)
if I do not have the specific optimized implementation yet for let's say
SSE I would have to duplicate some code from the plain(Scalar) C++ implementation
I cannot fallback to the plain C++ implementation if needed
What I couldn't figure out yet for any of the variants is how to properly and
cleanly fallback to the plain C++ implementation.
Also I don't want to over-engineer things and in that respect the first variant
seems the most "KISS" like even considering the disadvantages.
You could use a policy based approach with templates kind of like the way the standard library does for allocators, comparators and the like. Each implementation has a policy class which defines computeSome() and computeMore(). Your Algo class takes a policy as a parameter and defers to its implementation.
template <class policy_t>
class algo_with_policy_t {
policy_t policy_;
public:
algo_with_policy_t() { }
~algo_with_policy_t() { }
void process()
{
policy_.computeSome();
policy_.computeMore();
}
};
struct altivec_policy_t {
void computeSome();
void computeMore();
};
struct sse_policy_t {
void computeSome();
void computeMore();
};
struct scalar_policy_t {
void computeSome();
void computeMore();
};
// let user select exact implementation
typedef algo_with_policy_t<altivec_policy_t> algo_altivec_t;
typedef algo_with_policy_t<sse_policy_t> algo_sse_t;
typedef algo_with_policy_t<scalar_policy_t> algo_scalar_t;
// let user have default implementation
typedef
#if defined(ALTIVEC)
algo_altivec_t
#elif defined(SSE)
algo_sse_t
#else
algo_scalar_t
#endif
algo_default_t;
This lets you have all the different implementations defined within the same file (like solution 1) and compiled into the same program (unlike solution 1). It has no performance overheads (unlike virtual functions). You can either select the implementation at run time or get a default implementation chosen by the compile time configuration.
template <class algo_t>
void use_algo(algo_t algo)
{
algo.process();
}
void select_algo(bool use_scalar)
{
if (!use_scalar) {
use_algo(algo_default_t());
} else {
use_algo(algo_scalar_t());
}
}
As requested in the comments, here's a summary of what I did:
Set up policy_list helper template utility
This maintains a list of policies, and gives them a "runtime check" call before calling the first suitable implementaiton
#include <cassert>
template <typename P, typename N=void>
struct policy_list {
static void apply() {
if (P::runtime_check()) {
P::impl();
}
else {
N::apply();
}
}
};
template <typename P>
struct policy_list<P,void> {
static void apply() {
assert(P::runtime_check());
P::impl();
}
};
Set up specific policies
These policies implement a both a runtime test and an actual implementation of the algorithm in question. For my actual problem impl took another template parameter that specified what exactly it was they were implementing, here though the example assumes there is only one thing to be implemented. The runtime tests are cached in a static bool for some (e.g. the Altivec one I used) the test was really slow. For others (e.g. the OpenCL one) the test is actually "is this function pointer NULL?" after one attempt at setting it with dlsym().
#include <iostream>
// runtime SSE detection (That's another question!)
extern bool have_sse();
struct sse_policy {
static void impl() {
std::cout << "SSE" << std::endl;
}
static bool runtime_check() {
static bool result = have_sse();
// have_sse lives in another TU and does some cpuid asm stuff
return result;
}
};
// Runtime OpenCL detection
extern bool have_opencl();
struct opencl_policy {
static void impl() {
std::cout << "OpenCL" << std::endl;
}
static bool runtime_check() {
static bool result = have_opencl();
// have_opencl lives in another TU and does some LoadLibrary or dlopen()
return result;
}
};
struct basic_policy {
static void impl() {
std::cout << "Standard C++ policy" << std::endl;
}
static bool runtime_check() { return true; } // All implementations do this
};
Set per architecture policy_list
Trivial example sets one of two possible lists based on ARCH_HAS_SSE preprocessor macro. You might generate this from your build script, or use a series of typedefs, or hack support for "holes" in the policy_list that might be void on some architectures skipping straight to the next one, without trying to check for support. GCC sets some preprocessor macors for you that might help, e.g. __SSE2__.
#ifdef ARCH_HAS_SSE
typedef policy_list<opencl_policy,
policy_list<sse_policy,
policy_list<basic_policy
> > > active_policy;
#else
typedef policy_list<opencl_policy,
policy_list<basic_policy
> > active_policy;
#endif
You can use this to compile multiple variants on the same platform too, e.g. and SSE and no-SSE binary on x86.
Use the policy list
Fairly straightforward, call the apply() static method on the policy_list. Trust that it will call the impl() method on the first policy that passes the runtime test.
int main() {
active_policy::apply();
}
If you take the "per operation template" approach I mentioned earlier it might be something more like:
int main() {
Matrix m1, m2;
Vector v1;
active_policy::apply<matrix_mult_t>(m1, m2);
active_policy::apply<vector_mult_t>(m1, v1);
}
In that case you end up making your Matrix and Vector types aware of the policy_list in order that they can decide how/where to store the data. You can also use heuristics for this too, e.g. "small vector/matrix lives in main memory no matter what" and make the runtime_check() or another function test the appropriateness of a particular approach to a given implementation for a specific instance.
I also had a custom allocator for containers, which produced suitably aligned memory always on any SSE/Altivec enabled build, regardless of if the specific machine had support for Altivec. It was just easier that way, although it could be a typedef in a given policy and you always assume that the highest priority policy has the strictest allocator needs.
Example have_altivec():
I've included a sample have_altivec() implementation for completeness, simply because it's the shortest and therefore most appropriate for posting here. The x86/x86_64 CPUID one is messy because you have to support the compiler specific ways of writing inline ASM. The OpenCL one is messy because we check some of the implementation limits and extensions too.
#if HAVE_SETJMP && !(defined(__APPLE__) && defined(__MACH__))
jmp_buf jmpbuf;
void illegal_instruction(int sig) {
// Bad in general - https://www.securecoding.cert.org/confluence/display/seccode/SIG32-C.+Do+not+call+longjmp%28%29+from+inside+a+signal+handler
// But actually Ok on this platform in this scenario
longjmp(jmpbuf, 1);
}
#endif
bool have_altivec()
{
volatile sig_atomic_t altivec = 0;
#ifdef __APPLE__
int selectors[2] = { CTL_HW, HW_VECTORUNIT };
int hasVectorUnit = 0;
size_t length = sizeof(hasVectorUnit);
int error = sysctl(selectors, 2, &hasVectorUnit, &length, NULL, 0);
if (0 == error)
altivec = (hasVectorUnit != 0);
#elif HAVE_SETJMP_H
void (*handler) (int sig);
handler = signal(SIGILL, illegal_instruction);
if (setjmp(jmpbuf) == 0) {
asm volatile ("mtspr 256, %0\n\t" "vand %%v0, %%v0, %%v0"::"r" (-1));
altivec = 1;
}
signal(SIGILL, handler);
#endif
return altivec;
}
Conclusion
Basically you pay no penalty for platforms that can never support an implementation (the compiler generates no code for them) and only a small penalty (potentially just a very predictable by the CPU test/jmp pair if your compiler is half-decent at optimising) for platforms that could support something but don't. You pay no extra cost for platforms that the first choice implementation runs on. The details of the runtime tests vary between the technology in question.
If the virtual function overhead is acceptable, option 3 plus a few ifdefs seems a good compromise IMO. There are two variations that you could consider: one with abstract base class, and the other with the plain C implementation as the base class.
Having the C implementation as the base class lets you gradually add the vector optimized versions, falling back on the non-vectorized versions as you please, using an abstract interface would be a little cleaner to read.
Also, having separate C++ and vectorized versions of your class let you easily write unit tests that
Ensure that the vectorized code is giving the right result (easy to mess this up, and vector floating registers can have different precision than FPU, causing different results)
Compare the performance of the C++ vs the vectorized. It's often good to make sure the vectorized code is actually doing you any good. Compilers can generate very tight C++ code that sometimes does as well or better than vectorized code.
Here's one with the plain-c++ implementations as the base class. Adding an abstract interface would just add a common base class to all three of these:
// Algo.h:
class Algo_Impl // Default Plain C++ implementation
{
public:
virtual ComputeSome();
virtual ComputeSomeMore();
...
};
// Algo_SSE.h:
class Algo_Impl_SSE : public Algo_Impl // SSE
{
public:
virtual ComputeSome();
virtual ComputeSomeMore();
...
};
// Algo_Altivec.h:
class Algo_Impl_Altivec : public Algo_Impl // Altivec implementation
{
public:
virtual ComputeSome();
virtual ComputeSomeMore();
...
};
// Client.cpp:
Algo_Impl *myAlgo = 0;
#ifdef SSE
myAlgo = new Algo_Impl_SSE;
#elseif defined(ALTIVEC)
myAlgo = new Algo_Impl_Altivec;
#else
myAlgo = new Algo_Impl_Default;
#endif
...
You may consider to employ adapter patterns. There are a few types of adapters and it's quite an extensible concept. Here is an interesting article Structural Patterns: Adapter and Façade
that discusses very similar matter to the one in your question - the Accelerate framework as an example of the Adapter patter.
I think it is a good idea to discuss a solution on the level of design patterns without focusing on implementation detail like C++ language. Once you decide that the adapter states the right solutiojn for you, you can look for variants specific to your implemementation. For example, in C++ world there is known adapter variant called generic adapter pattern.
This isn't really a whole answer: just a variant on one of your existing options. In option 1 you've assumed that you include algo_altivec.cpp &c. into algo.cpp, but you don't have to do this. You could omit algo.cpp entirely, and have your build system decide which of algo_altivec.cpp, algo_sse.cpp, &c. to build. You'd have to do something like this anyway whichever option you use, since each platform can't compile every implementation; my suggestion is only that whichever option you choose, instead of having #if ALTIVEC_ENABLED everywhere in the source, where ALTIVEC_ENABLED is set from the build system, you just have the build system decide directly whether to compile algo_altivec.cpp .
This is a bit trickier to achieve in MSVC than make, scons, &c., but still possible. It's commonplace to switch in a whole directory rather than individual source files; that is, instead of algo_altivec.cpp and friends, you'd have platform/altivec/algo.cpp, platform/sse/algo.cpp, and so one. This way, when you have a second algorithm you need platform-specific implementations for, you can just add the extra source file to each directory.
Although my suggestion's mainly intended to be a variant of option 1, you can combine this with any of your options, to let you decide in the build system and at runtime which options to offer. In that case, though, you'll probably need implementation-specific header files too.
In order to hide the implementation details you may just use an abstract interface with static creator and provide three 3 implementation classes:
// --------------------- Algo.h ---------------------
#pragma once
typedef boost::shared_ptr<class Algo> AlgoPtr;
class Algo
{
public:
static AlgoPtr Create(std::string type);
~Algo();
void process();
protected:
virtual void computeSome() = 0;
virtual void computeMore() = 0;
};
// --------------------- Algo.cpp ---------------------
class PlainAlgo: public Algo { ... };
class AltivecAlgo: public Algo { ... };
class SSEAlgo: public Algo { ... };
static AlgoPtr Algo::Create(std::string type) { /* Factory implementation */ }
Please note, that since PlainAlgo, AlivecAlgo and SSEAlgo classes are defined in Algo.cpp, they are only seen from this compilation unit and therefore the implementation details hidden from the outside world.
Here is how one can use your class then:
AlgoPtr algo = Algo::Create("SSE");
algo->Process();
It seems to me that your first strategy, with separate C++ files and #including the specific implementation, is the simplest and cleanest. I would only add some comments to your Algo.cpp indicating which methods are in the #included files.
e.g.
// Algo.cpp
#include "Algo.h"
Algo::Algo() { }
Algo::~Algo() { }
void Algo::process()
{
computeSome();
computeMore();
}
// The following methods are implemented in separate,
// platform-specific files.
// void Algo::computeSome()
// void Algo::computeMore()
#if defined(ALTIVEC)
#include "Algo_Altivec.cpp"
#elif defined(SSE)
#include "Algo_SSE.cpp"
#else
#include "Algo_Scalar.cpp"
#endif
Policy-like templates (mixins) are fine until the requirement to fall back to default implementation. It's runtime opeation and should be handled by runtime polymorphism. Strategy pattern can handle this fine.
There's one drawback of this approach: Strategy-like algorithm implemented cannot be inlined. Such inlining can provide reasonable performance improvement in rare cases. If this is an issue you'll need to cover higher-level logic by Strategy.

Automatically Instantiating over a bunch of types in C++

In our library we have a number of "plugins", which are implemented in their own cpp files. Each plugin defines a template function, and should instantiate this function over a whole bunch of types. The number of types can be quite large, 30-100 of them, and can change depending on some compile time options. Each instance really have to be compiled and optimized individually, the performance improves by 10-100 times. The question is what is the best way to instantiate all of these functions.
Each plugin is written by a scientist who does not really know C++, so the code inside each plugin must be hidden inside macros or some simple construct. I have a half-baked solution based on a "database" of instances:
template<int plugin_id, class T>
struct S
{
typedef T (*ftype)(T);
ftype fp;
};
// By default we don't have any instances
template<int plugin_id, class T> S::ftype S::fp = 0;
Now a user that wants to use a plugin can check the value of
S<SOME_PLUGIN,double>::fp
to see if there is a version of this plugin for the double type. The template instantiation of fp will generate a weak reference, so the linker will use the "real" instance if we define it in a plugin implementation file. Inside the implementation of SOME_PLUGIN we will have an instantiation
template<> S<SOME_PLUGIN,double>::ftype S<SOME_PLUGIN,double>::fp =
some_plugin_implementation;
This seems to work. The question is if there is some way to automatically repeat this last statement for all types of interest. The types can be stored in a template class or generated by a template loop. I would prefer something that can be hidden by a macro. Of course this can be solved by an external code generator, but it's hard to do this portably and it interfers with the build systems of the people that use the library. Putting all the plugins in header files solves the problem, but makes the compiler explode (needing many gigabytes of memory and a very long compilation time).
I've used http://www.boost.org/doc/libs/1_44_0/libs/preprocessor/doc/index.html for such magic, in particular SEQ_FOR_EACH.
You could use a type list from Boost.MPL and then create a class template that recursively eats that list and instantiates every type. This would however make them all nested structs of that class template.
Hmm, I don't think I understand your problem correctly, so apologies if this answer is way off the mark, but could you not have a static member of S, which has a static instance of ftype, and return a reference to that, this way, you don't need to explicitly have an instance defined in your implementation files... i.e.
template<int plugin_id, class T>
struct S
{
typedef T (*ftype)(T);
static ftype& instance()
{
static ftype _fp = T::create();
return _fp;
}
};
and instead of accessing S<SOME_PLUGIN,double>::fp, you'd do S<SOME_PLUGIN,double>::instance(). To instantiate, at some point you have to call S<>::instance(). Do you need this to happen automagically as well?
EDIT: just noticed that you have a copy constructor, for ftype, changed the above code.. now you have to define a factory method in T called create() to really create the instance.
EDIT: Okay, I can't think of a clean way of doing this automatically, i.e. I don't believe there is a way to (at compile time) build a list of types, and then instantiate. However you could do it using a mix... Hopefully the example below will give you some ideas...
#include <iostream>
#include <typeinfo>
#include <boost/fusion/include/vector.hpp>
#include <boost/fusion/algorithm.hpp>
using namespace std;
// This simply calls the static instantiate function
struct instantiate
{
template <typename T>
void operator()(T const& x) const
{
T::instance();
}
};
// Shared header, presumably all plugin developers will use this header?
template<int plugin_id, class T>
struct S
{
typedef T (*ftype)(T);
static ftype& instance()
{
cout << "S: " << typeid(S<plugin_id, T>).name() << endl;
static ftype _fp; // = T::create();
return _fp;
}
};
// This is an additional struct, each plugin developer will have to implement
// one of these...
template <int plugin_id>
struct S_Types
{
// All they have to do is add the types that they will support to this vector
static void instance()
{
boost::fusion::vector<
S<plugin_id, double>,
S<plugin_id, int>,
S<plugin_id, char>
> supported_types;
boost::fusion::for_each(supported_types, instantiate());
}
};
// This is a global register, so once a plugin has been developed,
// add it to this list.
struct S_Register
{
S_Register()
{
// Add each plugin here, you'll only have to do this when a new plugin
// is created, unfortunately you have to do it manually, can't
// think of a way of adding a type at compile time...
boost::fusion::vector<
S_Types<0>,
S_Types<1>,
S_Types<2>
> plugins;
boost::fusion::for_each(plugins, instantiate());
}
};
int main(void)
{
// single instance of the register, defining this here, effectively
// triggers calls to instanc() of all the plugins and supported types...
S_Register reg;
return 0;
}
Basically uses a fusion vector to define all the possible instances that could exist. It will take a little bit of work from you and the developers, as I've outlined in the code... hopefully it'll give you an idea...

Where do you find templates useful?

At my workplace, we tend to use iostream, string, vector, map, and the odd algorithm or two. We haven't actually found many situations where template techniques were a best solution to a problem.
What I am looking for here are ideas, and optionally sample code that shows how you used a template technique to create a new solution to a problem that you encountered in real life.
As a bribe, expect an up vote for your answer.
General info on templates:
Templates are useful anytime you need to use the same code but operating on different data types, where the types are known at compile time. And also when you have any kind of container object.
A very common usage is for just about every type of data structure. For example: Singly linked lists, doubly linked lists, trees, tries, hashtables, ...
Another very common usage is for sorting algorithms.
One of the main advantages of using templates is that you can remove code duplication. Code duplication is one of the biggest things you should avoid when programming.
You could implement a function Max as both a macro or a template, but the template implementation would be type safe and therefore better.
And now onto the cool stuff:
Also see template metaprogramming, which is a way of pre-evaluating code at compile-time rather than at run-time. Template metaprogramming has only immutable variables, and therefore its variables cannot change. Because of this template metaprogramming can be seen as a type of functional programming.
Check out this example of template metaprogramming from Wikipedia. It shows how templates can be used to execute code at compile time. Therefore at runtime you have a pre-calculated constant.
template <int N>
struct Factorial
{
enum { value = N * Factorial<N - 1>::value };
};
template <>
struct Factorial<0>
{
enum { value = 1 };
};
// Factorial<4>::value == 24
// Factorial<0>::value == 1
void foo()
{
int x = Factorial<4>::value; // == 24
int y = Factorial<0>::value; // == 1
}
I've used a lot of template code, mostly in Boost and the STL, but I've seldom had a need to write any.
One of the exceptions, a few years ago, was in a program that manipulated Windows PE-format EXE files. The company wanted to add 64-bit support, but the ExeFile class that I'd written to handle the files only worked with 32-bit ones. The code required to manipulate the 64-bit version was essentially identical, but it needed to use a different address type (64-bit instead of 32-bit), which caused two other data structures to be different as well.
Based on the STL's use of a single template to support both std::string and std::wstring, I decided to try making ExeFile a template, with the differing data structures and the address type as parameters. There were two places where I still had to use #ifdef WIN64 lines (slightly different processing requirements), but it wasn't really difficult to do. We've got full 32- and 64-bit support in that program now, and using the template means that every modification we've done since automatically applies to both versions.
One place that I do use templates to create my own code is to implement policy classes as described by Andrei Alexandrescu in Modern C++ Design. At present I'm working on a project that includes a set of classes that interact with BEA\h\h\h Oracle's Tuxedo TP monitor.
One facility that Tuxedo provides is transactional persistant queues, so I have a class TpQueue that interacts with the queue:
class TpQueue {
public:
void enqueue(...)
void dequeue(...)
...
}
However as the queue is transactional I need to decide what transaction behaviour I want; this could be done seperately outside of the TpQueue class but I think it's more explicit and less error prone if each TpQueue instance has its own policy on transactions. So I have a set of TransactionPolicy classes such as:
class OwnTransaction {
public:
begin(...) // Suspend any open transaction and start a new one
commit(..) // Commit my transaction and resume any suspended one
abort(...)
}
class SharedTransaction {
public:
begin(...) // Join the currently active transaction or start a new one if there isn't one
...
}
And the TpQueue class gets re-written as
template <typename TXNPOLICY = SharedTransaction>
class TpQueue : public TXNPOLICY {
...
}
So inside TpQueue I can call begin(), abort(), commit() as needed but can change the behaviour based on the way I declare the instance:
TpQueue<SharedTransaction> queue1 ;
TpQueue<OwnTransaction> queue2 ;
I used templates (with the help of Boost.Fusion) to achieve type-safe integers for a hypergraph library that I was developing. I have a (hyper)edge ID and a vertex ID both of which are integers. With templates, vertex and hyperedge IDs became different types and using one when the other was expected generated a compile-time error. Saved me a lot of headache that I'd otherwise have with run-time debugging.
Here's one example from a real project. I have getter functions like this:
bool getValue(wxString key, wxString& value);
bool getValue(wxString key, int& value);
bool getValue(wxString key, double& value);
bool getValue(wxString key, bool& value);
bool getValue(wxString key, StorageGranularity& value);
bool getValue(wxString key, std::vector<wxString>& value);
And then a variant with the 'default' value. It returns the value for key if it exists, or default value if it doesn't. Template saved me from having to create 6 new functions myself.
template <typename T>
T get(wxString key, const T& defaultValue)
{
T temp;
if (getValue(key, temp))
return temp;
else
return defaultValue;
}
Templates I regulary consume are a multitude of container classes, boost smart pointers, scopeguards, a few STL algorithms.
Scenarios in which I have written templates:
custom containers
memory management, implementing type safety and CTor/DTor invocation on top of void * allocators
common implementation for overloads wiht different types, e.g.
bool ContainsNan(float * , int)
bool ContainsNan(double *, int)
which both just call a (local, hidden) helper function
template <typename T>
bool ContainsNanT<T>(T * values, int len) { ... actual code goes here } ;
Specific algorithms that are independent of the type, as long as the type has certain properties, e.g. binary serialization.
template <typename T>
void BinStream::Serialize(T & value) { ... }
// to make a type serializable, you need to implement
void SerializeElement(BinStream & strean, Foo & element);
void DeserializeElement(BinStream & stream, Foo & element)
Unlike virtual functions, templates allow more optimizations to take place.
Generally, templates allow to implement one concept or algorithm for a multitude of types, and have the differences resolved already at compile time.
We use COM and accept a pointer to an object that can either implement another interface directly or via [IServiceProvider](http://msdn.microsoft.com/en-us/library/cc678965(VS.85).aspx) this prompted me to create this helper cast-like function.
// Get interface either via QueryInterface of via QueryService
template <class IFace>
CComPtr<IFace> GetIFace(IUnknown* unk)
{
CComQIPtr<IFace> ret = unk; // Try QueryInterface
if (ret == NULL) { // Fallback to QueryService
if(CComQIPtr<IServiceProvider> ser = unk)
ser->QueryService(__uuidof(IFace), __uuidof(IFace), (void**)&ret);
}
return ret;
}
I use templates to specify function object types. I often write code that takes a function object as an argument -- a function to integrate, a function to optimize, etc. -- and I find templates more convenient than inheritance. So my code receiving a function object -- such as an integrator or optimizer -- has a template parameter to specify the kind of function object it operates on.
The obvious reasons (like preventing code-duplication by operating on different data types) aside, there is this really cool pattern that's called policy based design. I have asked a question about policies vs strategies.
Now, what's so nifty about this feature. Consider you are writing an interface for others to use. You know that your interface will be used, because it is a module in its own domain. But you don't know yet how people are going to use it. Policy-based design strengthens your code for future reuse; it makes you independent of data types a particular implementation relies on. The code is just "slurped in". :-)
Traits are per se a wonderful idea. They can attach particular behaviour, data and typedata to a model. Traits allow complete parameterization of all of these three fields. And the best of it, it's a very good way to make code reusable.
I once saw the following code:
void doSomethingGeneric1(SomeClass * c, SomeClass & d)
{
// three lines of code
callFunctionGeneric1(c) ;
// three lines of code
}
repeated ten times:
void doSomethingGeneric2(SomeClass * c, SomeClass & d)
void doSomethingGeneric3(SomeClass * c, SomeClass & d)
void doSomethingGeneric4(SomeClass * c, SomeClass & d)
// Etc
Each function having the same 6 lines of code copy/pasted, and each time calling another function callFunctionGenericX with the same number suffix.
There were no way to refactor the whole thing altogether. So I kept the refactoring local.
I changed the code this way (from memory):
template<typename T>
void doSomethingGenericAnything(SomeClass * c, SomeClass & d, T t)
{
// three lines of code
t(c) ;
// three lines of code
}
And modified the existing code with:
void doSomethingGeneric1(SomeClass * c, SomeClass & d)
{
doSomethingGenericAnything(c, d, callFunctionGeneric1) ;
}
void doSomethingGeneric2(SomeClass * c, SomeClass & d)
{
doSomethingGenericAnything(c, d, callFunctionGeneric2) ;
}
Etc.
This is somewhat highjacking the template thing, but in the end, I guess it's better than play with typedefed function pointers or using macros.
I personally have used the Curiously Recurring Template Pattern as a means of enforcing some form of top-down design and bottom-up implementation. An example would be a specification for a generic handler where certain requirements on both form and interface are enforced on derived types at compile time. It looks something like this:
template <class Derived>
struct handler_base : Derived {
void pre_call() {
// do any universal pre_call handling here
static_cast<Derived *>(this)->pre_call();
};
void post_call(typename Derived::result_type & result) {
static_cast<Derived *>(this)->post_call(result);
// do any universal post_call handling here
};
typename Derived::result_type
operator() (typename Derived::arg_pack const & args) {
pre_call();
typename Derived::result_type temp = static_cast<Derived *>(this)->eval(args);
post_call(temp);
return temp;
};
};
Something like this can be used then to make sure your handlers derive from this template and enforce top-down design and then allow for bottom-up customization:
struct my_handler : handler_base<my_handler> {
typedef int result_type; // required to compile
typedef tuple<int, int> arg_pack; // required to compile
void pre_call(); // required to compile
void post_call(int &); // required to compile
int eval(arg_pack const &); // required to compile
};
This then allows you to have generic polymorphic functions that deal with only handler_base<> derived types:
template <class T, class Arg0, class Arg1>
typename T::result_type
invoke(handler_base<T> & handler, Arg0 const & arg0, Arg1 const & arg1) {
return handler(make_tuple(arg0, arg1));
};
It's already been mentioned that you can use templates as policy classes to do something. I use this a lot.
I also use them, with the help of property maps (see boost site for more information on this), in order to access data in a generic way. This gives the opportunity to change the way you store data, without ever having to change the way you retrieve it.