Recently I've discovered Boos Pool library and started adapting it to my code. One thing that library mentions it was missing was a base class that would override new/delete operators for any class and use the pool for memory management. I wrote my own implementation and with some meta-template programming, it actually came out looking very decent (support any class with size between 1 and 1024 bytes by simply deriving from the base class)
I mentioned those things because so far this was really cool and exciting and then I found this post from Boost mailing list. It appears some people really hammer the Pool library and especially point out the inefficiency of free() method which they said runs in O(n) time. I stepped through the code and found this to be the implementation of that method:
void free(void * const chunk)
{
nextof(chunk) = first;
first = chunk;
}
To me this looks like O(1) and I really don't see the inefficiency they are talking about. One thing I did notice is that if you are using multiple instances of singleton_pool (i.e. different tags and/or allocation sizes), they all share the same mutex (critical section to be more precise) and this could be optimized a bit. But if you were using regular heap operations, they'd use the same form of synchronization.
So does anyone else consider Pool library to be inefficient and obsolete?
That free sure does look constant time to me. Perhaps the author of the post was referring to ordered_free, which has this implementation:
void ordered_free(void * const chunk)
{
// This (slower) implementation of 'free' places the memory
// back in the list in its proper order.
// Find where "chunk" goes in the free list
void * const loc = find_prev(chunk);
// Place either at beginning or in middle/end
if (loc == 0)
(free)(chunk);
else
{
nextof(chunk) = nextof(loc);
nextof(loc) = chunk;
}
}
Where find_prev is as follows
template <typename SizeType>
void * simple_segregated_storage<SizeType>::find_prev(void * const ptr)
{
// Handle border case
if (first == 0 || std::greater<void *>()(first, ptr))
return 0;
void * iter = first;
while (true)
{
// if we're about to hit the end or
// if we've found where "ptr" goes
if (nextof(iter) == 0 || std::greater<void *>()(nextof(iter), ptr))
return iter;
iter = nextof(iter);
}
}
Related
I am relatively new to modern c++ and working with a foreign code base. There is a function that takes a std::unordered_map and checks to see if a key is present in the map. The code is roughly as follows
uint32_t getId(std::unordered_map<uint32_t, uint32_t> &myMap, uint32_t id)
{
if(myMap.contains(id))
{
return myMap.at(id);
}
else
{
std::cerr << "\n\n\nOut of Range error for map: "<< id << "\t not found" << std::flush;
exit(74);
}
}
It seems like calling contains() followed by at() is inefficient since it requires a double lookup. So, my question is, what is the most efficient way to accomplish this? I also have a followup question: assuming the map is fairly large (~60k elements) and this method gets called frequently how problematic is the above approach?
After some searching, it seems like the following paradigms are more efficient than the above, but I am not sure which would be best.
Calling myMap.at() inside of a try-catch construct
Pros: at automatically throws an error if the key does not exist
Cons: try-catch is apparently fairly costly and also constrains what the optimizer can do with the code
Use find
Pros: One call, no try-catch overhead
Cons: Involves using an iterator; more overhead than just returning the value
auto findit = myMap.find(id);
if(findit == myMap.end())
{
//error message;
exit(74);
}
else
{
return findit->first;
}
You can do
// stuff before
{
auto findit = myMap.find(id);
if ( findit != myMap.end() ) {
return findit->first;
} else {
exit(74);
}
}
// stuff after
or with the new C++17 init statement syntax
// stuff before
if ( auto findit = myMap.find(id); findit != myMap.end() ) {
return findit->first;
} else {
exit(74);
}
// stuff after
Both define the iterator reference only in local scope. As the interator use is most definitively optimized away, I would go with it. Doing a second hash calculation will be slower almost for sure.
Also note that findit->first returns the key not the value. I was not sure what you expect the code to do, but one of the code snippets in the question returns the value, while the other one returns the key
In case you don't get enough speedup within only removing the extra lookup operation and if there are millions of calls to getId in a multi-threaded program, then you can use an N-way map to be able to parallelize the id-checks:
template<int N>
class NwayMap
{
public:
NwayMap(uint32_t hintMaxSize = 60000)
{
// hint about max size to optimize initial allocations
for(int i=0;i<N;i++)
shard[i].reserve(hintMaxSize/N);
}
void addIdValuePairThreadSafe(const uint32_t id, const uint32_t val)
{
// select shard
const uint32_t selected = id%N; // can do id&(N-1) for power-of-2 N value
std::lock_guard<std::mutex> lg(mut[selected]);
auto it = shard[selected].find(id);
if(it==shard[selected].end())
{
shard[selected].emplace(id,val);
}
else
{
// already added, update?
}
}
uint32_t getIdMultiThreadSafe(const uint32_t id)
{
// select shard
const uint32_t selected = id%N; // can do id&(N-1) for power-of-2 N value
// lock only the selected shard, others can work in parallel
std::lock_guard<std::mutex> lg(mut[selected]);
auto it = shard[selected].find(id);
// we expect it to be found, so get it quicker
// without going "else"
if(it!=shard[selected].end())
{
return it->second;
}
else
{
exit(74);
}
}
private:
std::unordered_map<uint32_t, uint32_t> shard[N];
std::mutex mut[N];
};
Pros:
if you serve each shard's getId from their own CPU threads, then you benefit from N*L1 cache size.
even within single thread use case, you can still interleave multiple id-check operations and benefit from instruction-level-parallelism because checking id 0 would have different independent code path than checking id 1 and CPU could do out-of-order execution on them (if pipeline is long enough)
Cons:
if a lot of checks from different threads collide, their operations are serialized and the locking mechanism causes extra latency
when id values are mostly strided, the parallelization is not efficient due to unbalanced emplacement
Calling myMap.at() inside of a try-catch construct
Pros: at automatically throws an error if the key does not exist
Cons: try-catch is apparently fairly costly and also constrains what the optimizer can do with the code
Your implementation of getId terminates application, so who cares about exception overheads?
Please note that most compilers (AFAIK all) implement C++ exceptions to have zero cost when exception is not thrown. Problem appears when stack is unwinded when exception is thrown and matching exception handler. I read somewhere that penalty when exception is thrown is x40 comparing to case when stack is unwinded by simple returns (with possible error codes).
Since you want to just terminate application then this overhead is negligible.
Technically, in C++ we have the possibility to use curly-braces to declare a new scope. For example in this function, which swaps two numbers
void swap_int(int& first, int& second)
{
int temp = first;
first = second;
second = temp;
}
we could also declare temp inside its own block:
void swap_int(int& first, int& second)
{
// Do stuf...
{
int temp = first;
first = second;
second = temp;
}
// Do other stuff...
}
This obviously has the advantage that temp is deleted directly when it is not needed anymore.
However, in the code I write I never use it. Also, in code from 3rd-party libraries I almost never see it at all.
Why is it not used publically? Does it bring any performance increase at all, or does it just mean additional typing-work?
I don't see anything wrong per-se with naked brackets. They're a part of the language, and they're well defined. Historically, one place I have found them useful is when working with code that uses status codes instead of exceptions, while keeping const goodness:
const StatusCode statusCode = DoThing();
if (statusCode == STATUS_SUCCESS)
Foo();
else
Bar();
const StatusCode statusCode2 = DoAnotherThing(); // Eww variable name.
...
The alternative would be:
{
const StatusCode statusCode = DoThing();
if (statusCode == STATUS_SUCCESS)
Foo();
else
Bar();
}
{
// Same variable name, used for same purpose, easy to
// find/replace, and has const guarantees. Great success.
const StatusCode statusCode = DoAnotherThing();
...
}
The same applies to objects such as thread lockers that use RAII as well (Mutex objects, semaphores, etc), or generally any kind of resource that you may want to have an extremely short lifetime (file handles, for example).
Personally, I think the reason it's rare is that it can be indicative of a code smell (although not always). Where there are naked brackets, there may be an opportunity to factor out a function.
To take your example, if there is more than one job for swap_int, then the function is doing more than one thing. By extracting the actual swap code into another function, you can encourage reuse! For example:
template <typename T>
void swap_anything(T &first, T& second)
{
T temp = first;
first = second;
second = temp;
}
// -------------------------------------------
void swap_int(int& first, int& second)
{
// Do stuff...
swap_anything(first, second);
// Do other stuff...
}
Sometimes it's good practice (although I dislike that term, as it's subjective and context-specific), but like many things in life, taking it to the extreme (on either end of the spectrum) is a bad idea.
You'll see new scopes introduced sometimes in C++ where RAII is important, like when dealing with thread locks. Sometimes the precise moment of when an object is created and destroyed is very important and needs to be controlled. In those situations, introducing a new scope is a very useful and often-used way of accomplishing this.
But that's not frequently the case. Most of the objects we (the broad programming community) use don't have strict lifetimes that need to be so carefully managed. As such, it's not worth arbitrarily introducing new scopes to manage lifetimes of objects whose lifetimes aren't worth micromanaging.
If you do, you'll decrease the signal-to-noise ratio, and people will have a hard time telling which scopes are introduced to carefully control important resources, and which scopes are not. This can make it easy to introduce bugs when refactoring or developing code across teams. At the very least, you'll make programming a whole lot more tedious, which sucks, and you should generally avoid that if you can or else a violent psychopath may take it out on you.
I am looking at the Roslyn September 2012 CTP with Reflector, and I noticed the following depth-first traversal of the syntax tree:
private IEnumerable<CommonSyntaxNode> DescendantNodesOnly(TextSpan span,
Func<CommonSyntaxNode, bool> descendIntoChildren, bool includeSelf)
{
if (includeSelf && IsInSpan(span, FullSpan))
{
yield return this;
}
if ((descendIntoChildren != null) && !descendIntoChildren(this))
{
yield break;
}
var queue = new Queue<StrongBox<IEnumerator<CommonSyntaxNode>>>();
var stack = new Stack<StrongBox<IEnumerator<CommonSyntaxNode>>>();
stack.Push(new StrongBox<IEnumerator<CommonSyntaxNode>>(ChildNodes().GetEnumerator()));
while (stack.Count > 0)
{
var enumerator = stack.Peek();
StrongBox<IEnumerator<CommonSyntaxNode>> childEnumerator;
if (enumerator.Value.MoveNext())
{
var current = enumerator.Value.Current;
if (IsInSpan(span, current.FullSpan))
{
yield return current;
if ((descendIntoChildren == null) || descendIntoChildren(current))
{
childEnumerator = queue.Count == 0
? new StrongBox<IEnumerator<CommonSyntaxNode>>()
: queue.Dequeue();
childEnumerator.Value = current.ChildNodes().GetEnumerator();
stack.Push(childEnumerator);
}
}
}
else
{
childEnumerator = stack.Pop();
childEnumerator.Value = null;
queue.Enqueue(childEnumerator);
}
}
}
I am guessing that the queue is to ease the runtime from allocating and deallocating so many instances of IEnumerator<CommonSyntaxNode>.
However, I am not sure why IEnumerator<CommonSyntaxNode> is wrapped in StrongBox<>. What sort of performance and safety trade-offs are involved in wrapping IEnumerator<CommonSyntaxNode>, which is usually a value type, inside the reference type StrongBox<>?
CommonSyntaxNode is an abstact class which contains alot of value types and can be inherited into a big object.
IEnumerator<CommonSyntaxNode> contains only a reference to a CommonSyntaxNode so it seems like the size of the CommonSyntaxNode won't affect the Enumerator size since it's only a reference, but:
since IEnumerator<T>'s method MoveNext() uses yield return; each iteration in the Enumerator will cause the method to save it's state untill the next iteration.
since the whole method state is heavy enough and it might contain CommonSyntaxNode's properties in order to do the MoveNext() logic, than the whole IEnumerator<CommonSyntaxNode> might be pretty heavy on memory.
using StrongBox<> causes the Queue or Stack hold only a small reference object (the StrongBox<> instead of the potentially heavy on memory IEnumerator<CommonSyntaxNode> - therefore - the GC is cleaning the Queue or Stack's contained IEnumerator<CommonSyntaxNode> from the memory potentially faster - reducing the application total memory consumption.
note that CommonSyntaxNode's enumerator is a struct which working with it directly means deeply copying the whole struct, it's a small struct so it's not really heavy but still...
The advantage of the StrongBox<T> is once an item is dequeued, the StrongBox clears out it's internal content - so the GC can then collect the instance of T being held by StrongBox and the Queue<T> ends up holding just an instance of StrongBox (instead of an instance of T).
The use of IEnumerator was a mistake. The code should have been using ChildSyntaxList.Enumerator, which is a struct. The use of StrongBox is for perf, to keep from needing to push & pop the enumerators from the end of the stack when they change.
I have a code that has a large number of mallocs and device-specific API mallocs (I'm programming on a GPU, so cudaMalloc).
Basically my end of my beginning of my code is a big smorgasbord of allocation calls, while my closing section is deallocation calls.
As I've encapsulated my global data in structures, the deallocations are quite long, but at least I can break them into a separate function. On the other hand, I would like a shorter solution. Additionally an automatic deallocator would reduce the risk of memory leaks created if I forget to explicitly write the deallocation in the global allocator function.
I was wondering whether it'd be possible to write some sort of templated class wrapper that can allow me to "register" variables during the malloc/cudaMalloc process, and then at the end of simulation do a mass loop-based deallocation (deregistration). To be clear I don't want to type out individual deallocations (free/cudaFrees), because again this is long and undesirable, and the assumption would be that anything I register won't be deallocated until the device simulation is complete and main is terminating.
A benefit here is that if I register a new simulation duration variable, it will automatically deallocate, so there's no danger of me forgetting do deallocate it and creating a memory leak.
Is such a wrapper possible?
Would you suggest doing it?
If so, how?
Thanks in advance!
An idea:
Create both functions, one that allocates memory and provides valid pointers after register them in a "list" of allocated pointers. In the second method, loop this list and deallocate all pointers:
// ask for new allocated pointer that will be registered automatically in list of pointers.
pointer1 = allocatePointer(size, listOfPointers);
pointer2 = allocatePointer(size, listOfPointers);
...
// deallocate all pointers
deallocatePointers(listOfPointers);
Even, you may use different listOfPointers depending of your simulation scope:
listOfPointer1 = getNewListOfPointers();
listOfPointer2 = getNewListOfPointers();
....
p1 = allocatePointer(size, listOfPointer1);
p2 = allocatePointer(size, listOfPointer2);
...
deallocatePointers(listOfPointers1);
...
deallocatePointers(listOfPointers2);
There are many ways to skin a cat, as they say.
I would recommend thrust's device_vector as a memory management tool. It abstracts allocation, deallocation, and memcpy in CUDA. It also gives you access to all the algorithms that Thrust provides.
I wouldn't recommend keeping random lists of unrelated pointers as Tio Pepe recommends. Instead you should encapsulate related data into a class. Even if you use thrust::device_vector you may want to encapsulate multiple related vectors and operations on them into a class.
The best choice is probably to use the smart pointers from C++ boost library, if that is an option.
If not, the best you can hope for in C is a program design that allows you to write allocation and deallocation in one place. Perhaps something like the following pseudo code:
while(!terminate_program)
{
switch(state_machine)
{
case STATE_PREOPERATIONAL:
myclass_init(); // only necessary for non-global/static objects
myclass_mem_manager();
state_machine = STATE_RUNNING;
break;
case STATE_RUNNING:
myclass_do_stuff();
...
break;
...
case STATE_EXIT:
myclass_mem_manager();
terminate_program = true;
break;
}
void myclass_init()
{
ptr_x = NULL;
ptr_y = NULL;
/* Where ptr_x, ptr_y are some of the many objects to allocate/deallocate.
If ptr is a global/static, (static storage duration) it is
already set to NULL automatically and this function isn't
necessary */
}
void myclass_mem_manager()
{
ptr_x = mem_manage (ptr_x, items_x*sizeof(Type_x));
ptr_y = mem_manage (ptr_y, items_y*sizeof(Type_y));
}
static void* mem_manage (const void* ptr, size_t bytes_n)
{
if(ptr == NULL)
{
ptr = malloc(bytes_n);
if (ptr == NULL)
{} // error handling
}
else
{
free(ptr);
ptr = NULL;
}
return ptr;
}
I've stumbled across this great post about validating parameters in C#, and now I wonder how to implement something similar in C++. The main thing I like about this stuff is that is does not cost anything until the first validation fails, as the Begin() function returns null, and the other functions check for this.
Obviously, I can achieve something similar in C++ using Validate* v = 0; IsNotNull(v, ...).IsInRange(v, ...) and have each of them pass on the v pointer, plus return a proxy object for which I duplicate all functions.
Now I wonder whether there is a similar way to achieve this without temporary objects, until the first validation fails. Though I'd guess that allocating something like a std::vector on the stack should be for free (is this actually true? I'd suspect an empty vector does no allocations on the heap, right?)
Other than the fact that C++ does not have extension methods (which prevents being able to add in new validations as easily) it should be too hard.
class Validation
{
vector<string> *errors;
void AddError(const string &error)
{
if (errors == NULL) errors = new vector<string>();
errors->push_back(error);
}
public:
Validation() : errors(NULL) {}
~Validation() { delete errors; }
const Validation &operator=(const Validation &rhs)
{
if (errors == NULL && rhs.errors == NULL) return *this;
if (rhs.errors == NULL)
{
delete errors;
errors = NULL;
return *this;
}
vector<string> *temp = new vector<string>(*rhs.errors);
std::swap(temp, errors);
}
void Check()
{
if (errors)
throw exception();
}
template <typename T>
Validation &IsNotNull(T *value)
{
if (value == NULL) AddError("Cannot be null!");
return *this;
}
template <typename T, typename S>
Validation &IsLessThan(T valueToCheck, S maxValue)
{
if (valueToCheck < maxValue) AddError("Value is too big!");
return *this;
}
// etc..
};
class Validate
{
public:
static Validation Begin() { return Validation(); }
};
Use..
Validate::Begin().IsNotNull(somePointer).IsLessThan(4, 30).Check();
Can't say much to the rest of the question, but I did want to point out this:
Though I'd guess that allocating
something like a std::vector on the
stack should be for free (is this
actually true? I'd suspect an empty
vector does no allocations on the
heap, right?)
No. You still have to allocate any other variables in the vector (such as storage for length) and I believe that it's up to the implementation if they pre-allocate any room for vector elements upon construction. Either way, you are allocating SOMETHING, and while it may not be much allocation is never "free", regardless of taking place on the stack or heap.
That being said, I would imagine that the time taken to do such things will be so minimal that it will only really matter if you are doing it many many times over in quick succession.
I recommend to get a look into Boost.Exception, which provides basically the same functionality (adding arbitrary detailed exception-information to a single exception-object).
Of course you'll need to write some utility methods so you can get the interface you want. But beware: Dereferencing a null-pointer in C++ results in undefined behavior, and null-references must not even exist. So you cannot return a null-pointer in a way as your linked example uses null-references in C# extension methods.
For the zero-cost thing: A simple stack-allocation is quite cheap, and a boost::exception object does not do any heap-allocation itself, but only if you attach any error_info<> objects to it. So it is not exactly zero cost, but nearly as cheap as it can get (one vtable-ptr for the exception-object, plus sizeof(intrusive_ptr<>)).
Therefore this should be the last part where one tries to optimize further...
Re the linked article: Apparently, the overhaead of creating objects in C# is so great that function calls are free in comparison.
I'd personally propose a syntax like
Validate().ISNOTNULL(src).ISNOTNULL(dst);
Validate() contructs a temporary object which is basically just a std::list of problems. Empty lists are quite cheap (no nodes, size=0). ~Validate will throw if the list is not empty. If profiling shows even this is too expensive, then you just change the std::list to a hand-rolled list. Remember, a pointer is an object too. You're not saving an object just by sticking to the unfortunate syntax of a raw pointer. Conversely, the overhead of wrapping a raw pointer with a nice syntax is purely a compile-time price.
PS. ISNOTNULL(x) would be a #define for IsNotNull(x,#x) - similar to how assert() prints out the failed condition, without having to repeat it.