CUDA: Wrapping device memory allocation in C++ - c++

I'm starting to use CUDA at the moment and have to admit that I'm a bit disappointed with the C API. I understand the reasons for choosing C but had the language been based on C++ instead, several aspects would have been a lot simpler, e.g. device memory allocation (via cudaMalloc).
My plan was to do this myself, using overloaded operator new with placement new and RAII (two alternatives). I'm wondering if there are any caveats that I haven't noticed so far. The code seems to work but I'm still wondering about potential memory leaks.
The usage of the RAII code would be as follows:
CudaArray<float> device_data(SIZE);
// Use `device_data` as if it were a raw pointer.
Perhaps a class is overkill in this context (especially since you'd still have to use cudaMemcpy, the class only encapsulating RAII) so the other approach would be placement new:
float* device_data = new (cudaDevice) float[SIZE];
// Use `device_data` …
operator delete [](device_data, cudaDevice);
Here, cudaDevice merely acts as a tag to trigger the overload. However, since in normal placement new this would indicate the placement, I find the syntax oddly consistent and perhaps even preferable to using a class.
I'd appreciate criticism of every kind. Does somebody perhaps know if something in this direction is planned for the next version of CUDA (which, as I've heard, will improve its C++ support, whatever they mean by that).
So, my question is actually threefold:
Is my placement new overload semantically correct? Does it leak memory?
Does anybody have information about future CUDA developments that go in this general direction (let's face it: C interfaces in C++ s*ck)?
How can I take this further in a consistent manner (there are other APIs to consider, e.g. there's not only device memory but also a constant memory store and texture memory)?
// Singleton tag for CUDA device memory placement.
struct CudaDevice {
static CudaDevice const& get() { return instance; }
private:
static CudaDevice const instance;
CudaDevice() { }
CudaDevice(CudaDevice const&);
CudaDevice& operator =(CudaDevice const&);
} const& cudaDevice = CudaDevice::get();
CudaDevice const CudaDevice::instance;
inline void* operator new [](std::size_t nbytes, CudaDevice const&) {
void* ret;
cudaMalloc(&ret, nbytes);
return ret;
}
inline void operator delete [](void* p, CudaDevice const&) throw() {
cudaFree(p);
}
template <typename T>
class CudaArray {
public:
explicit
CudaArray(std::size_t size) : size(size), data(new (cudaDevice) T[size]) { }
operator T* () { return data; }
~CudaArray() {
operator delete [](data, cudaDevice);
}
private:
std::size_t const size;
T* const data;
CudaArray(CudaArray const&);
CudaArray& operator =(CudaArray const&);
};
About the singleton employed here: Yes, I'm aware of its drawbacks. However, these aren't relevant in this context. All I needed here was a small type tag that wasn't copyable. Everything else (i.e. multithreading considerations, time of initialization) don't apply.

In the meantime there were some further developments (not so much in terms of the CUDA API, but at least in terms of projects attempting an STL-like approach to CUDA data management).
Most notably there is a project from NVIDIA research: thrust

I would go with the placement new approach. Then I would define a class that conforms to the std::allocator<> interface. In theory, you could pass this class as a template parameter into std::vector<> and std::map<> and so forth.
Beware, I have heard that doing such things is fraught with difficulty, but at least you will learn a lot more about the STL this way. And you do not need to re-invent your containers and algorithms.

Does anybody have information about future CUDA developments that go in this general direction (let's face it: C interfaces in C++ s*ck)?
Yes, I've done something like that:
https://github.com/eyalroz/cuda-api-wrappers/
nVIDIA's Runtime API for CUDA is intended for use both in C and C++ code. As such, it uses a C-style API, the lower common denominator (with a few notable exceptions of templated function overloads).
This library of wrappers around the Runtime API is intended to allow us to embrace many of the features of C++ (including some C++11) for using the runtime API - but without reducing expressivity or increasing the level of abstraction (as in, e.g., the Thrust library). Using cuda-api-wrappers, you still have your devices, streams, events and so on - but they will be more convenient to work with in more C++-idiomatic ways.

There are several projects that attempt something similar, for example CUDPP.
In the meantime, however, I've implemented my own allocator and it works well and was straightforward (> 95% boilerplate code).

Related

Idiomatic way to handle T[]-like objects in C++

I am using some C-library in my C++ code. The library wants me to allocate some amount of the memory and pass the pointer to the library. Unfortunately, the exact required memory size is not known in advance, so the library also requires me to provide C-callback with the following signature:
void* callback_realloc(void* ptr, size_t new_size);
where ptr is previously passed memory and new_size is required size, and the callback must return the pointer to newly allocated memory. There is no direct way to store the allocator state. Instead, I need to rely on pointer arithmetic somehow as the following:
template<class T>
struct o_s {
std::aligned_storage_t<sizeof(T), alignof(T)> data;
};
template<class Alloc>
struct o_i: private Alloc {
std::size_t allocated_size;
const Alloc& get_allocator() const { return *this; }
};
template<class T>
struct o: public o_i, public o_s<T> {
void* ptr() {
return &data;
}
// Additionally, override class operator new and operator delete...
};
void* my_callback(void* ptr, size_t new_size) {
auto meta = static_cast<o*>(reinterpret_cast<o_s<char>*>(ptr));
// access to the allocator state ...
}
Then sizeof(o_i) + initial_size memory is allocated and ptr() is passed to the C library.
At this point, I understand that I am not the first person in the world who needs this pattern. Unfortunately (and surprisingly) I have not found anything suitable for this in Boost or STL. I would like to use ready implementation to avoid possible underwater rocks.
The simplest solution is to allocate the memory using std::malloc, and use std::realloc as the callback. C API such as this are the case where using those makes sense in C++.
You don't necessarily have to use std::malloc however. You can implement a custom allocator if you want to. Using some of the allocated storage is one way of storing allocation metadata, and it's an efficient way. That's not necessary either though, since you can also store the metadata separately in a map-like structure.
I don't think you'd find an idiomatic way to do this in C++, as this is a C idiom.
Conceptually, you have two ways of going about this:
Store the metadata alongside the buffer you've allocated (as you seem to be doing). I feel using double inheritance etc. is a bit overkill here, when you could just allocate a char buffer of sizeof(size_t)+allocation_size and use the first part for your metadata.
Allocate and return to the API a raw buffer as needed, and use a separate static data structure to manage this with a map of ptr->allocation size. I suppose this is what malloc/realloc is doing behind the scenes anyway.
Both are valid, and the one you chose depends on the specific details of your application. When dealing with memory it's ok to actually deal with memory, be it pointer arithmetic or what not.
The important thing is probably to keep the ugly bit to a single location, and provide a C++ style API to this on the C++ side of things, so that the client code isn't exposed to implementation details.

Compiler or Standard C++ Library - new and delete

I am developing C++ coding for software (kernel) without any library. I am confused about the new operator and delete operator. I have implemented KMalloc() and KFree(). Now, I want to know if the following coding will work without any Standard C++ Library.
void *mem = KMalloc(sizeof(__type__));
Object *obj = new (mem) ();
If this will not work, then how will I setup the vtables or whatever object structure there is in a preallocated space without any Std Lib.
You first should define what C++ standard are you targeting. I guess it is at least C++11.
Then, if you code in C++ for some operating system kernel, beware and study carefully the relevant ABI specifications (the details depend even of the version of your C++ compiler, and gory details like even exception handling and stack unwinding matter a lot).
Notice that the Linux kernel ABI is not C++ friendly (it is not the same as Linux user-land ABI for x86-64). So coding in C++ for the Linux kernel is not reasonable.
You probably want
void *mem = KMalloc(sizeof(Object));
Object *obj = new (mem) Object();
The second statement uses the placement new feature of C++, which will run the constructor on the (more or less "unitialized") memory zone passed as placement.
(notice that bitwise copy of C++ objects -e.g. with memcpy- is undefined behavior in general (except for PODs); you need to use constructors and assignment operators)
There is no "placement delete", but you can explicitly run the destructor: obj->~Object() after which any use of the object pointed by the obj pointer is undefined behavior.
Now, I want to know if that code will work without any Standard C++ Library.
It might be much harder than what you believe. You need to understand all the details of the ABI targeted by your compiler, and that is hard.
Notice that running properly constructors (and destructors) -in a good enough order- is of paramount importance for C++; practically speaking, they are notably initializing the (implicit) vtable field[s], without which your object can crash (as soon as any virtual member function or destructor gets called).
Read also about the rule of five (for C++11).
Coding your own kernel in C++ practically requires understanding a lot of details about your C++ implementation (and ABI).
NB: practically speaking, bitwise copy with memcpy of smart pointers, of std::stream-s, of std::mutex-es, of std::thread-s - and perhaps even of standard containers and of std::string-s etc...- is very likely to make a disaster. If you dare doing such bad things, you really need to look into the details of your particular implementations...
In addition to what other answers have already said, you might want to overload operator new and operator delete in order for you not needing to do the KMalloc() plus placement new trick all the time.
// In the global namespace.
void* operator new
(
size_t size
)
{
/* You might also check for `KMalloc()`'s return value and throw
* an exception like the standard `operator new`. This, however,
* requires kernel-mode exception support, which is not that easy
* to get up and running.
*/
return KMalloc( size );
}
void* operator new[]
(
size_t size
)
{
return KMalloc( size );
}
void operator delete
(
void* what
)
{
KFree( what );
}
void operator delete[]
(
void* what
)
{
KFree( what );
}
Then, code like the following will work by calling your KMalloc() and KFree() routines when necessary, along with all necessary constructors like placement new would do.
template<typename Type>
class dumb_smart_pointer
{
public:
dumb_smart_pointer()
: pointer( nullptr )
{}
explicit dumb_smart_pointer
(
Type* pointer
)
: pointer( pointer )
{}
~dumb_smart_pointer()
{
if( this->pointer != nullptr )
{
delete this->pointer;
}
}
Type& operator*()
{
return *this->pointer;
}
Type* operator->()
{
return this->pointer;
}
private:
Type* pointer;
};
dumb_smart_pointer<int> my_pointer = new int( 123 );
*my_pointer += 42;
KConsoleOutput << *my_pointer << '\n';

How to write a C wrapper for delete that would be fast yet free any type given to it with out telling it what type

have a whole list of C wrappers for OpenCV C++ functions like the one below. And all of them return a "new". I can't change them because they are becoming part of OpenCV and it would make my library perfect to have a consistently updated skeleton to wrap around.
Mat* cv_create_Mat() {
return new Mat();
}
I can't rewrite the C wrapper for the C++ function so I wrote a delete wrapper like the one below,The memory I'm trying to free is a Mat*, Mat is an OpenCV c++ class...and the delete wrapper below works. There is absolutely no memory leakage at all.
I have a lot of other C wrappers for OpenCV C++ functions, though, that return a new pointer...there is at least 10 or 15 and my intention is to not have to write a separate delete wrapper for all of them. If you can show me how to write one delete wrapper that would free any pointer after having it not have to be told which type to free and fast too that would be awesome.
Those are my intentions and I know you great programmers can help me with that solution:)...in a nutshell...I have CvSVMParams*, Brisk*, RotatedRect*, CVANN_MLP* pointers there are a few others as well that all need to be free'd with one wrapper...one go to wrapper for C++'s delete that would free anything...Any help at this is greatly valued.
void delete_ptr(void* ptr) {
delete (Mat*)ptr;
}
Edit: I'd need one of the two of you who I sent the messages to, to tell me exactly how to run your posted code...The registry version doesn't work when I place in Emacs g++ above the main and run with Free(cv_create_Mat); a new Mat* creator and stub gets 5 error message running the same way. I need exact compile instructions. My intention is to be able to compile this to .so file You have really a lot of attention to this post though and I do appreciate it..Thank you
How about this, and then let the compiler deal with all the specializations:
template <typename T>
void delete_ptr(T *ptr) {
delete ptr;
}
The delete operator doesn't just free memory, it also calls destructors, and it has to be called on a typed pointer (not void*) so that it knows which class's destructor to call. You'll need a separate wrapper for each type.
For POD types that don't have destructors, you can allocate with malloc() instead of new, so that the caller can just use free().
I would advise against having a generic delete_ptr function.
Since creation and deletion come in pairs, I would create one for creation and for deletion of specific types.
Mat* cv_create_Mat();
void cv_delete_Mat(Mat*);
If you do this, there will be less ambiguity about the kind of object you are dealing with. Plus, the implementation of cv_delete_Mat(Mat*) will be less error prone and has to assume less.
void cv_delete_Mat(Mat* m)
{
delete m;
}
Generic operations like this can only be implemented in C by removing type information (void*) or by individually ensuring all of the wrapper functions exist.
C's ABI doesn't allow function overloading, and the C++ delete keyword is exactly the sort of generic wrapper you are asking for.
That said, there are things you can do, but none of them are any simpler than what you are already proposing. Any generic C++ code you write will be uncallable from C.
You could add members to your objects which know how to destroy themselves, e.g.
class CVersionOfFoo : public Foo {
...
static void deleter(CVersionOfFoo* p) { delete p; }
};
But that's not accessible from C.
Your last option is to set up some form of manual registry, where objects register their pointer along with a delete function. But that's going to be more work and harder to debug than just writing wrappers.
--- EDIT ---
Registry example; if you have C++11:
#include <functional>
#include <map>
/* not thread safe */
typedef std::map<void*, std::function<void(void*)>> ObjectMap;
static ObjectMap s_objectMap;
struct Foo {
int i;
};
struct Bar {
char x[30];
};
template<typename T>
T* AllocateAndRegister() {
T* t = new T();
s_objectMap[t] = [](void* ptr) { delete reinterpret_cast<T*>(ptr); };
return t;
}
Foo* AllocateFoo() {
return AllocateAndRegister<Foo>();
}
Bar* AllocateBar() {
return AllocateAndRegister<Bar>();
}
void Free(void* ptr) {
auto it = s_objectMap.find(ptr);
if (it != s_objectMap.end()) {
it->second(ptr); // call the lambda
s_objectMap.erase(it);
}
}
If you don't have C++11... You'll have to create a delete function.
As I said, it's more work than the wrappers you were creating.
It's not a case of C++ can't do this - C++ is perfectly capable of doing this, but you're trying to do this in C and C does not provide facilities for doing this automatically.
The core problem is that delete in C++ requires a type, and passing a pointer through a C interface loses that type. The question is how to recover that type in a generic way. Here are some choices.
Bear in mind that delete does two things: call the destructor and free the memory.
Separate functions for each type. Last resort, what you're trying to avoid.
For types that have a trivial destructor, you can cast your void pointer to anything you like because all it does it free the memory. That reduces the number of functions. [This is undefined behaviour, but it should work.]
Use run-time type information to recover the type_info of the pointer, and then dynamic cast it to the proper type to delete.
Modify your create functions to store the pointer in a dictionary with its type_info. On delete, retrieve the type and use it with dynamic cast to delete the pointer.
For all that I would probably use option 1 unless there were hundreds of the things. You could write a C++ template with explicit instantiation to reduce the amount of code needed, or a macro with token pasting to generate unique names. Here is an example (edited):
#define stub(T) T* cv_create_ ## T() { return new T(); } \
void cv_delete_ ## T(void *p) { delete (T*)p; }
stub(Mat);
stub(Brisk);
One nice thing about the dictionary approach is for debugging. You can track new and delete at run-time and make sure they match correctly. I would pick this option if the debugging was really important, but it takes more code to do.

Different behaviors for different size in C++ (Firebreath source code)

I encounter a confused question when I go through the source code of firebreath (src/ScriptingCore/Variant.h)
// function pointer table
struct fxn_ptr_table {
const std::type_info& (*get_type)();
void (*static_delete)(void**);
void (*clone)(void* const*, void**);
void (*move)(void* const*,void**);
bool (*less)(void* const*, void* const*);
};
// static functions for small value-types
template<bool is_small>
struct fxns
{
template<typename T>
struct type {
static const std::type_info& get_type() {
return typeid(T);
}
static void static_delete(void** x) {
reinterpret_cast<T*>(x)->~T();
}
static void clone(void* const* src, void** dest) {
new(dest) T(*reinterpret_cast<T const*>(src));
}
static void move(void* const* src, void** dest) {
reinterpret_cast<T*>(dest)->~T();
*reinterpret_cast<T*>(dest) = *reinterpret_cast<T const*>(src);
}
static bool lessthan(void* const* left, void* const* right) {
T l(*reinterpret_cast<T const*>(left));
T r(*reinterpret_cast<T const*>(right));
return l < r;
}
};
};
// static functions for big value-types (bigger than a void*)
template<>
struct fxns<false>
{
template<typename T>
struct type {
static const std::type_info& get_type() {
return typeid(T);
}
static void static_delete(void** x) {
delete(*reinterpret_cast<T**>(x));
}
static void clone(void* const* src, void** dest) {
*dest = new T(**reinterpret_cast<T* const*>(src));
}
static void move(void* const* src, void** dest) {
(*reinterpret_cast<T**>(dest))->~T();
**reinterpret_cast<T**>(dest) = **reinterpret_cast<T* const*>(src);
}
static bool lessthan(void* const* left, void* const* right) {
return **reinterpret_cast<T* const*>(left) < **reinterpret_cast<T* const*>(right);
}
};
};
template<typename T>
struct get_table
{
static const bool is_small = sizeof(T) <= sizeof(void*);
static fxn_ptr_table* get()
{
static fxn_ptr_table static_table = {
fxns<is_small>::template type<T>::get_type
, fxns<is_small>::template type<T>::static_delete
, fxns<is_small>::template type<T>::clone
, fxns<is_small>::template type<T>::move
, fxns<is_small>::template type<T>::lessthan
};
return &static_table;
}
};
The question is why the implementation of static functions for the big value-types (bigger than void*) is different from the small ones.
For example, static_delete for small value-type is just to invoke destructor on T instance, while that for big value-type is to use 'delete'.
Is there some trick? Thanks in advance.
It looks like Firebreath uses a dedicated memory pool for its small objects, while large objects are allocated normally in the heap. Hence the different behaviour. Notice the placement new in clone() for small objects, for instance: this creates the new object in a specified memory location without allocating it. When you create an object using placement new, you must explicitly call the destructor on it before deallocating memory, and this is what static_delete() does.
Memory is not actually deallocated because, as I say, it looks like a dedicated memory pool is in use. Memory management must be performed somewhere else. This kind of memory pool is a common optimisation for small objects.
What does the internal documentation say? If the author hasn't
documented it, he probably doesn't know himself.
Judging from the code, the interface to small objects is different than
that to large objects; the pointer you pass for a small object is a
pointer to the object itself, where as the one you pass for a large
object is a pointer to a pointer to the object.
The author, however, doesn't seem to know C++ very well (and I would
avoid using any code like this). For example, in move, he explicitly
destructs the object, then assigns to it: this is guaranteed undefined
behavior, and probably won't work reliably for anything but the simplest
built-in types. Also the distinction small vs. large objects is largely
irrelevant; some “small” objects can be quite expensive to
copy. And of course, given that everything here is a template anyway,
there's absolutely no reason to use void* for anything.
I have edited your question to include a link to the original source file, since obviously most of those answering here have not read it to see what is actually going on. I admit that this is probably one of the most confusing pieces of code in FireBreath; at the time, I was trying to avoid using boost and this has worked really well.
Since then I've considered switching to boost::any (for those itching to suggest it, no, boost::variant wouldn't work and I'm not going to explain why here; ask another question if you really care) but we have customized this class a fair amount to make it exactly what we need and boost::any would be difficult to customize in a similar manner. More than anything, we've been following the old axim: if it ain't broke, don't fix it!
First of all, you should know that several C++ experts have gone over this code; yes, it uses some practices that many consider dubious, but they are very carefully considered and they are consistent and reliable on the compilers supported by FireBreath. We have done extensive testing with valgrind, visual leak detector, LeakFinder, and Rational Purify and have never found any leaks in this code. It is more than a bit confusing; it's amazing to me that people who don't understand code assume the author doesn't know C++. In this case, Christopher Diggins (who wrote the code you quoted and the original cdiggins::any class that this is taken from) seems to know C++ extremely well as evidenced by the fact that he was able to write this code. The code is used internally and is highly optimized -- perhaps more than FireBreath needs, in fact. However, it has served us well.
I will try to explain the answer to your question as best I remember; keep in mind that I don't have a lot of time and it's been awhile since I really dug in deep with this. The main reason for "small" types using a different static class is that "small" types are pretty much built-in types; an int, a char, a long, etc. Anything bigger than void* is assumed to be an object of some sort. This is an optimization to allow it to reuse memory whenever possible rather than deleting and reallocating it.
If you look at the code side-by-side it's a lot clearer. If you look at delete and clone you'll see that on "large" objects it's dynamically allocating the memory; it calls "delete" in delete and in clone it uses a regular "new". In the "small" version it just stores the memory internally and reuses it; it never "delete"s the memory, it just calls the destructor or the constructor of the correct type on the memory that it has internally. Again, this is just done for the sake of efficiency. In move on both types it calls the destructor of the old object and then assigns the new object data.
The object itself is stored as a void* because we don't actually know what type the object will be; to get the object back out you have to specify the type, in fact. This is part of what allows the container to hold absolutely any type of data. That is the reason there are so many reinterpret_cast calls there -- many people see that and say "oh, no! The author must be clueless!" However, when you have a void* that you need to dereference, that's exactly the operator that you would use.
Anyway, all of that said, cdiggins has actually put out a new version of his any class this year; I'll need to take a look at it and probably will try to pull it in to replace the current one. The trick is that I have customized the current one (primarily to add a comparison operator so it can be put in a STL container and to add convert_cast) so I need to make sure I understand the new version well enough to do that safely.
Hope that helps; the article I got it from is here: http://www.codeproject.com/KB/cpp/dynamic_typing.aspx
Note that the article has been updated and it doesn't seem to be possible to get to the old one with the original anymore.
EDIT
Since I wrote this we have confirmed some issues with the old variant class and it has been updated and replaced with one that utilizes boost::any. Thanks to dougma for most of the work on this. FireBreath 1.7 (current master branch as of the time of this writing) contains that fix.

Alternative to boost::shared_ptr in an embedded environment

I'm using C++ in an embedded linux environment which has GCC version 2.95.
I just can't extract boost::shared_ptr files with bcp, it is just too heavy.
What I'd like would be a simple smart pointer implementation of boost::shared_ptr but without all boost overheads (if it is possible...).
I could come up with my own version reading boost source but I fear missing one or more points, it seems easy to make a faulty smart pointer and I can't afford to have a buggy implementation.
So, does a "simple" implementation or implementation example of boost::shared_ptr (or any reference counting equivalent smart pointer) exists that I could use or that I could take as an inspiration?
if you don't need mixing shared and weak ptr,and don't need coustom deletors, you can just use the quick and dirty my_shared_ptr:
template<class T>
class my_shared_ptr
{
template<class U>
friend class my_shared_ptr;
public:
my_shared_ptr() :p(), c() {}
explicit my_shared_ptr(T* s) :p(s), c(new unsigned(1)) {}
my_shared_ptr(const my_shared_ptr& s) :p(s.p), c(s.c) { if(c) ++*c; }
my_shared_ptr& operator=(const my_shared_ptr& s)
{ if(this!=&s) { clear(); p=s.p; c=s.c; if(c) ++*c; } return *this; }
template<class U>
my_shared_ptr(const my_shared_ptr<U>& s) :p(s.p), c(s.c) { if(c) ++*c; }
~my_shared_ptr() { clear(); }
void clear()
{
if(c)
{
if(*c==1) delete p;
if(!--*c) delete c;
}
c=0; p=0;
}
T* get() const { return (c)? p: 0; }
T* operator->() const { return get(); }
T& operator*() const { return *get(); }
private:
T* p;
unsigned* c;
}
For anyone interested in make_my_shared<X>, it can be trivially implemented as
template<class T, class... U>
auto make_my_shared(U&&... u)
{
return my_shared_ptr<T>(new T{std::forward<U>(u)...});
}
to be called as
auto pt = make_my_shared<T>( ... );
There is also std::tr1::shared_ptr which is just C++11 standard's lifting from boost. You could pick that if it is allowed, or write your own with reference counting.
I'd suggest you can use shared_ptr in isolation. However, if you are looking for simple implementation
with some unit tests
not thread safe
basic support for polymorphic assignment <-- let me know if you're intrested
custom deletors (i.e. array deletors, or custom functors for special resources)
Have a look here: Creating a non-thread safe shared_ptr
Which "boost overheads" are you concerned about, and in what way is shared_ptr "too heavy" for your application? There are no overheads simply from using Boost (at least if you only use header-only libraries, such as the smart pointer library); the only overheads I can think of concerning shared_ptr are:
Thread-safe reference counting; since you're using an ancient compiler, I assume you also have an ancient runtime library and kernel, so this may be very inefficient. If your application is single threaded, then you can disable this by #defining BOOST_DISABLE_THREADS.
Allocation of the reference-counting structure alongside the managed object. You can eliminate the extra allocation by creating the object using make_shared() or allocate_shared().
In many cases, you can eliminate the overhead from speed-critical code by not creating objects or copying shared pointers - pass pointers by reference and only copy them when you actually need to.
If you need thread safety for some pointers, but not others, and (after profiling, and removing all unnecessary allocations and pointer copying) you find that the use of shared pointers is still causing significant overhead, then you could consider using intrusive_ptr, and manage your own reference counting within the object.
There may also be benefit to updating to a modern GNU/Linux version, if that's feasible. Thread synchronisation is much more efficient since the introduction of futexes in Linux 2.6. You may find this helps in other ways too; there have been many improvements over the last decade. A more modern compiler would also provide the standard (TR1 or C++11) shared pointer, so you wouldn't need Boost for that.
You can use shared_ptr without all the Boost overhead: it's a header-only implementation. If you don't use anything other classes, only shared_ptr will be compiled in.
The implementation of shared_ptr is already pretty lean, but if you want to avoid the intermediate reference count block and a (potential) virtual call to deleter function, you can use boost::intrusive_ptr instead, which is more suited for embedded environments: it operates on a reference counter embedded in the object itself, you just have to provide a couple of functions to increment/decrement it. The downside is that you won't be able to use weak_ptr.
I can't comment how well gcc 2.95 would inline/fold template instantiations (it's a very old compiler), more recent versions of gcc handle it rather well, so you are on your own here.