A particular usage of StrongBox - roslyn

I am looking at the Roslyn September 2012 CTP with Reflector, and I noticed the following depth-first traversal of the syntax tree:
private IEnumerable<CommonSyntaxNode> DescendantNodesOnly(TextSpan span,
Func<CommonSyntaxNode, bool> descendIntoChildren, bool includeSelf)
{
if (includeSelf && IsInSpan(span, FullSpan))
{
yield return this;
}
if ((descendIntoChildren != null) && !descendIntoChildren(this))
{
yield break;
}
var queue = new Queue<StrongBox<IEnumerator<CommonSyntaxNode>>>();
var stack = new Stack<StrongBox<IEnumerator<CommonSyntaxNode>>>();
stack.Push(new StrongBox<IEnumerator<CommonSyntaxNode>>(ChildNodes().GetEnumerator()));
while (stack.Count > 0)
{
var enumerator = stack.Peek();
StrongBox<IEnumerator<CommonSyntaxNode>> childEnumerator;
if (enumerator.Value.MoveNext())
{
var current = enumerator.Value.Current;
if (IsInSpan(span, current.FullSpan))
{
yield return current;
if ((descendIntoChildren == null) || descendIntoChildren(current))
{
childEnumerator = queue.Count == 0
? new StrongBox<IEnumerator<CommonSyntaxNode>>()
: queue.Dequeue();
childEnumerator.Value = current.ChildNodes().GetEnumerator();
stack.Push(childEnumerator);
}
}
}
else
{
childEnumerator = stack.Pop();
childEnumerator.Value = null;
queue.Enqueue(childEnumerator);
}
}
}
I am guessing that the queue is to ease the runtime from allocating and deallocating so many instances of IEnumerator<CommonSyntaxNode>.
However, I am not sure why IEnumerator<CommonSyntaxNode> is wrapped in StrongBox<>. What sort of performance and safety trade-offs are involved in wrapping IEnumerator<CommonSyntaxNode>, which is usually a value type, inside the reference type StrongBox<>?

CommonSyntaxNode is an abstact class which contains alot of value types and can be inherited into a big object.
IEnumerator<CommonSyntaxNode> contains only a reference to a CommonSyntaxNode so it seems like the size of the CommonSyntaxNode won't affect the Enumerator size since it's only a reference, but:
since IEnumerator<T>'s method MoveNext() uses yield return; each iteration in the Enumerator will cause the method to save it's state untill the next iteration.
since the whole method state is heavy enough and it might contain CommonSyntaxNode's properties in order to do the MoveNext() logic, than the whole IEnumerator<CommonSyntaxNode> might be pretty heavy on memory.
using StrongBox<> causes the Queue or Stack hold only a small reference object (the StrongBox<> instead of the potentially heavy on memory IEnumerator<CommonSyntaxNode> - therefore - the GC is cleaning the Queue or Stack's contained IEnumerator<CommonSyntaxNode> from the memory potentially faster - reducing the application total memory consumption.
note that CommonSyntaxNode's enumerator is a struct which working with it directly means deeply copying the whole struct, it's a small struct so it's not really heavy but still...

The advantage of the StrongBox<T> is once an item is dequeued, the StrongBox clears out it's internal content - so the GC can then collect the instance of T being held by StrongBox and the Queue<T> ends up holding just an instance of StrongBox (instead of an instance of T).

The use of IEnumerator was a mistake. The code should have been using ChildSyntaxList.Enumerator, which is a struct. The use of StrongBox is for perf, to keep from needing to push & pop the enumerators from the end of the stack when they change.

Related

Most efficient paradigm for checking if a key exists in a c++ std::unordered_map?

I am relatively new to modern c++ and working with a foreign code base. There is a function that takes a std::unordered_map and checks to see if a key is present in the map. The code is roughly as follows
uint32_t getId(std::unordered_map<uint32_t, uint32_t> &myMap, uint32_t id)
{
if(myMap.contains(id))
{
return myMap.at(id);
}
else
{
std::cerr << "\n\n\nOut of Range error for map: "<< id << "\t not found" << std::flush;
exit(74);
}
}
It seems like calling contains() followed by at() is inefficient since it requires a double lookup. So, my question is, what is the most efficient way to accomplish this? I also have a followup question: assuming the map is fairly large (~60k elements) and this method gets called frequently how problematic is the above approach?
After some searching, it seems like the following paradigms are more efficient than the above, but I am not sure which would be best.
Calling myMap.at() inside of a try-catch construct
Pros: at automatically throws an error if the key does not exist
Cons: try-catch is apparently fairly costly and also constrains what the optimizer can do with the code
Use find
Pros: One call, no try-catch overhead
Cons: Involves using an iterator; more overhead than just returning the value
auto findit = myMap.find(id);
if(findit == myMap.end())
{
//error message;
exit(74);
}
else
{
return findit->first;
}
You can do
// stuff before
{
auto findit = myMap.find(id);
if ( findit != myMap.end() ) {
return findit->first;
} else {
exit(74);
}
}
// stuff after
or with the new C++17 init statement syntax
// stuff before
if ( auto findit = myMap.find(id); findit != myMap.end() ) {
return findit->first;
} else {
exit(74);
}
// stuff after
Both define the iterator reference only in local scope. As the interator use is most definitively optimized away, I would go with it. Doing a second hash calculation will be slower almost for sure.
Also note that findit->first returns the key not the value. I was not sure what you expect the code to do, but one of the code snippets in the question returns the value, while the other one returns the key
In case you don't get enough speedup within only removing the extra lookup operation and if there are millions of calls to getId in a multi-threaded program, then you can use an N-way map to be able to parallelize the id-checks:
template<int N>
class NwayMap
{
public:
NwayMap(uint32_t hintMaxSize = 60000)
{
// hint about max size to optimize initial allocations
for(int i=0;i<N;i++)
shard[i].reserve(hintMaxSize/N);
}
void addIdValuePairThreadSafe(const uint32_t id, const uint32_t val)
{
// select shard
const uint32_t selected = id%N; // can do id&(N-1) for power-of-2 N value
std::lock_guard<std::mutex> lg(mut[selected]);
auto it = shard[selected].find(id);
if(it==shard[selected].end())
{
shard[selected].emplace(id,val);
}
else
{
// already added, update?
}
}
uint32_t getIdMultiThreadSafe(const uint32_t id)
{
// select shard
const uint32_t selected = id%N; // can do id&(N-1) for power-of-2 N value
// lock only the selected shard, others can work in parallel
std::lock_guard<std::mutex> lg(mut[selected]);
auto it = shard[selected].find(id);
// we expect it to be found, so get it quicker
// without going "else"
if(it!=shard[selected].end())
{
return it->second;
}
else
{
exit(74);
}
}
private:
std::unordered_map<uint32_t, uint32_t> shard[N];
std::mutex mut[N];
};
Pros:
if you serve each shard's getId from their own CPU threads, then you benefit from N*L1 cache size.
even within single thread use case, you can still interleave multiple id-check operations and benefit from instruction-level-parallelism because checking id 0 would have different independent code path than checking id 1 and CPU could do out-of-order execution on them (if pipeline is long enough)
Cons:
if a lot of checks from different threads collide, their operations are serialized and the locking mechanism causes extra latency
when id values are mostly strided, the parallelization is not efficient due to unbalanced emplacement
Calling myMap.at() inside of a try-catch construct
Pros: at automatically throws an error if the key does not exist
Cons: try-catch is apparently fairly costly and also constrains what the optimizer can do with the code
Your implementation of getId terminates application, so who cares about exception overheads?
Please note that most compilers (AFAIK all) implement C++ exceptions to have zero cost when exception is not thrown. Problem appears when stack is unwinded when exception is thrown and matching exception handler. I read somewhere that penalty when exception is thrown is x40 comparing to case when stack is unwinded by simple returns (with possible error codes).
Since you want to just terminate application then this overhead is negligible.

How to access array of objects inside member function in C++?

I'm writing an Object Oriented version of FCFS scheduling algorithm, and I've hit a problem. I need to know if there's any way to access an array of objects inside the member function definition, without passing it as a parameter explicitly.
I've tried using "this-pointer", but since the calculation of finish time of current process requires the finish time of the previous, "this" won't work. Or at least I think it won't. I have no idea how to access "previous" object using "this"
void Process :: scheduleProcess(int pid) {
if(pid == 0) finishTime = burstTime;
else finishTime = burstTime +
this->[pid-1].finishTime;
turnAroundTime = finishTime - arrivalTime;
waitingTime = turnAroundTime - burstTime;
}
I can obviously send the array of objects as a parameter and use it directly. I just want to know if there's a better way to do this:
This is the part that's calling the aforementioned function:
for(int clockTime = 0; clockTime <= maxArrivalTime(process);
clockTime++) {
// If clockTime occurs in arrivalTime, return pid of that
process
int pid = arrivalTimeOf(clockTime, process);
if(pid >= 0) {
process[pid].scheduleProcess(pid);
} else continue;
}
Since I'm calling scheduleProcess() using process[pid], which is a vector of objects, I should be able to manipulate the variables pertaining to process[pid] object. How do I access process[pid-1] in the function itself? (Without passing process vector as an argument)
Since scheduleProcess is a member of Process, it only knows what the Process object knows. The previous process is unknown at this level. There are ways that use Undefined Behavior and make more assumptions about your code to get around this, but these should be avoided.
One portable solution to avoid all that is to simply pass in the previous process's finish time as a parameter, since you know this value at the point of the call to scheduleProcess. Where there is not a previous process (the first entry in the array), this finish time would be 0.

Serial allocators/deallocators

I have a code that has a large number of mallocs and device-specific API mallocs (I'm programming on a GPU, so cudaMalloc).
Basically my end of my beginning of my code is a big smorgasbord of allocation calls, while my closing section is deallocation calls.
As I've encapsulated my global data in structures, the deallocations are quite long, but at least I can break them into a separate function. On the other hand, I would like a shorter solution. Additionally an automatic deallocator would reduce the risk of memory leaks created if I forget to explicitly write the deallocation in the global allocator function.
I was wondering whether it'd be possible to write some sort of templated class wrapper that can allow me to "register" variables during the malloc/cudaMalloc process, and then at the end of simulation do a mass loop-based deallocation (deregistration). To be clear I don't want to type out individual deallocations (free/cudaFrees), because again this is long and undesirable, and the assumption would be that anything I register won't be deallocated until the device simulation is complete and main is terminating.
A benefit here is that if I register a new simulation duration variable, it will automatically deallocate, so there's no danger of me forgetting do deallocate it and creating a memory leak.
Is such a wrapper possible?
Would you suggest doing it?
If so, how?
Thanks in advance!
An idea:
Create both functions, one that allocates memory and provides valid pointers after register them in a "list" of allocated pointers. In the second method, loop this list and deallocate all pointers:
// ask for new allocated pointer that will be registered automatically in list of pointers.
pointer1 = allocatePointer(size, listOfPointers);
pointer2 = allocatePointer(size, listOfPointers);
...
// deallocate all pointers
deallocatePointers(listOfPointers);
Even, you may use different listOfPointers depending of your simulation scope:
listOfPointer1 = getNewListOfPointers();
listOfPointer2 = getNewListOfPointers();
....
p1 = allocatePointer(size, listOfPointer1);
p2 = allocatePointer(size, listOfPointer2);
...
deallocatePointers(listOfPointers1);
...
deallocatePointers(listOfPointers2);
There are many ways to skin a cat, as they say.
I would recommend thrust's device_vector as a memory management tool. It abstracts allocation, deallocation, and memcpy in CUDA. It also gives you access to all the algorithms that Thrust provides.
I wouldn't recommend keeping random lists of unrelated pointers as Tio Pepe recommends. Instead you should encapsulate related data into a class. Even if you use thrust::device_vector you may want to encapsulate multiple related vectors and operations on them into a class.
The best choice is probably to use the smart pointers from C++ boost library, if that is an option.
If not, the best you can hope for in C is a program design that allows you to write allocation and deallocation in one place. Perhaps something like the following pseudo code:
while(!terminate_program)
{
switch(state_machine)
{
case STATE_PREOPERATIONAL:
myclass_init(); // only necessary for non-global/static objects
myclass_mem_manager();
state_machine = STATE_RUNNING;
break;
case STATE_RUNNING:
myclass_do_stuff();
...
break;
...
case STATE_EXIT:
myclass_mem_manager();
terminate_program = true;
break;
}
void myclass_init()
{
ptr_x = NULL;
ptr_y = NULL;
/* Where ptr_x, ptr_y are some of the many objects to allocate/deallocate.
If ptr is a global/static, (static storage duration) it is
already set to NULL automatically and this function isn't
necessary */
}
void myclass_mem_manager()
{
ptr_x = mem_manage (ptr_x, items_x*sizeof(Type_x));
ptr_y = mem_manage (ptr_y, items_y*sizeof(Type_y));
}
static void* mem_manage (const void* ptr, size_t bytes_n)
{
if(ptr == NULL)
{
ptr = malloc(bytes_n);
if (ptr == NULL)
{} // error handling
}
else
{
free(ptr);
ptr = NULL;
}
return ptr;
}

Crash using concurrent_unordered_map

I've got a concurrent_unordered_map. I use the insert function (and no other) to try to insert into the map concurrently. However, many times, this crashes deep in the insert function internals. Here is some code:
class ModuleBase {
public:
virtual Wide::Parser::AST* GetAST() = 0;
virtual ~ModuleBase() {}
};
struct ModuleContents {
ModuleContents() {}
ModuleContents(ModuleContents&& other)
: access(other.access)
, base(std::move(other.base)) {}
Accessibility access;
std::unique_ptr<ModuleBase> base;
};
class Module : public ModuleBase {
public:
// Follows Single Static Assignment form. Once it's been written, do not write again.
Concurrency::samples::concurrent_unordered_map<Unicode::String, ModuleContents> contents;
Wide::Parser::AST* GetAST() { return AST; }
Wide::Parser::NamespaceAST* AST;
};
This is the function I use to actually insert into the map. There is more but it doesn't touch the map, only uses the return value of insert.
void CollateModule(Parser::NamespaceAST* module, Module& root, Accessibility access_level) {
// Build the new module, then try to insert it. If it comes back as existing, then we discard. Else, it was inserted and we can process.
Module* new_module = nullptr;
ModuleContents m;
{
if (module->dynamic) {
auto dyn_mod = MakeUnique<DynamicModule>();
dyn_mod->libname = module->libname->contents;
new_module = dyn_mod.get();
m.base = std::move(dyn_mod);
} else {
auto mod = MakeUnique<Module>();
new_module = mod.get();
m.base = std::move(mod);
}
new_module->AST = module;
m.access = access_level;
}
auto result = root.contents.insert(std::make_pair(module->name->name, std::move(m)));
This is the root function. It is called in parallel from many threads on different inputs, but with the same root.
void Collater::Context::operator()(Wide::Parser::NamespaceAST* input, Module& root) {
std::for_each(input->contents.begin(), input->contents.end(), [&](Wide::Parser::AST* ptr) {
if (auto mod_ptr = dynamic_cast<Wide::Parser::NamespaceAST*>(ptr)) {
CollateModule(mod_ptr, root, Accessibility::Public);
}
});
}
I'm not entirely sure wtf is going on. I've got one bit of shared state, and I only ever access it atomically, so why is my code dying?
Edit: This is actually completely my own fault. The crash was in the insert line, which I assumed to be the problem- but it wasn't. It wasn't related to the concurrency at all. I tested the return value of result the wrong way around- i.e., true for value existed, false for value did not exist, whereas the Standard defines true for insertion succeeded- i.e., value did not exist. This mucked up the memory management significantly, causing a crash- although exactly how it led to a crash in the unordered_map code, I don't know. Once I inserted the correct negation, it worked flawlessly. This was revealed because I didn't properly test the single-threaded version before jumping the concurrent fence.
One possibility is that you are crashing because of some problem with move semantics. Is the crash caused by a null pointer dereference? That would happen if you inadvertently accessed an object (e.g., ModuleContents) after it's been moved.
It's also possible that the crash is the result of a concurrency bug. The concurrent_unordered_map is thread safe in the sense that insertion and retrieval are atomic. However, whatever you are storing inside it is not automatically protected. So if multiple threads retrieve the same ModuleContents object, they will share the AST tree that's inside a Module. I'm not sure which references are modifiable, since I don't see any const pointers or references. Anything that is shared and modifiable must be protected by some synchronization mechanism (for instance, locks).

Lazy object creation in C++, or how to do zero-cost validation

I've stumbled across this great post about validating parameters in C#, and now I wonder how to implement something similar in C++. The main thing I like about this stuff is that is does not cost anything until the first validation fails, as the Begin() function returns null, and the other functions check for this.
Obviously, I can achieve something similar in C++ using Validate* v = 0; IsNotNull(v, ...).IsInRange(v, ...) and have each of them pass on the v pointer, plus return a proxy object for which I duplicate all functions.
Now I wonder whether there is a similar way to achieve this without temporary objects, until the first validation fails. Though I'd guess that allocating something like a std::vector on the stack should be for free (is this actually true? I'd suspect an empty vector does no allocations on the heap, right?)
Other than the fact that C++ does not have extension methods (which prevents being able to add in new validations as easily) it should be too hard.
class Validation
{
vector<string> *errors;
void AddError(const string &error)
{
if (errors == NULL) errors = new vector<string>();
errors->push_back(error);
}
public:
Validation() : errors(NULL) {}
~Validation() { delete errors; }
const Validation &operator=(const Validation &rhs)
{
if (errors == NULL && rhs.errors == NULL) return *this;
if (rhs.errors == NULL)
{
delete errors;
errors = NULL;
return *this;
}
vector<string> *temp = new vector<string>(*rhs.errors);
std::swap(temp, errors);
}
void Check()
{
if (errors)
throw exception();
}
template <typename T>
Validation &IsNotNull(T *value)
{
if (value == NULL) AddError("Cannot be null!");
return *this;
}
template <typename T, typename S>
Validation &IsLessThan(T valueToCheck, S maxValue)
{
if (valueToCheck < maxValue) AddError("Value is too big!");
return *this;
}
// etc..
};
class Validate
{
public:
static Validation Begin() { return Validation(); }
};
Use..
Validate::Begin().IsNotNull(somePointer).IsLessThan(4, 30).Check();
Can't say much to the rest of the question, but I did want to point out this:
Though I'd guess that allocating
something like a std::vector on the
stack should be for free (is this
actually true? I'd suspect an empty
vector does no allocations on the
heap, right?)
No. You still have to allocate any other variables in the vector (such as storage for length) and I believe that it's up to the implementation if they pre-allocate any room for vector elements upon construction. Either way, you are allocating SOMETHING, and while it may not be much allocation is never "free", regardless of taking place on the stack or heap.
That being said, I would imagine that the time taken to do such things will be so minimal that it will only really matter if you are doing it many many times over in quick succession.
I recommend to get a look into Boost.Exception, which provides basically the same functionality (adding arbitrary detailed exception-information to a single exception-object).
Of course you'll need to write some utility methods so you can get the interface you want. But beware: Dereferencing a null-pointer in C++ results in undefined behavior, and null-references must not even exist. So you cannot return a null-pointer in a way as your linked example uses null-references in C# extension methods.
For the zero-cost thing: A simple stack-allocation is quite cheap, and a boost::exception object does not do any heap-allocation itself, but only if you attach any error_info<> objects to it. So it is not exactly zero cost, but nearly as cheap as it can get (one vtable-ptr for the exception-object, plus sizeof(intrusive_ptr<>)).
Therefore this should be the last part where one tries to optimize further...
Re the linked article: Apparently, the overhaead of creating objects in C# is so great that function calls are free in comparison.
I'd personally propose a syntax like
Validate().ISNOTNULL(src).ISNOTNULL(dst);
Validate() contructs a temporary object which is basically just a std::list of problems. Empty lists are quite cheap (no nodes, size=0). ~Validate will throw if the list is not empty. If profiling shows even this is too expensive, then you just change the std::list to a hand-rolled list. Remember, a pointer is an object too. You're not saving an object just by sticking to the unfortunate syntax of a raw pointer. Conversely, the overhead of wrapping a raw pointer with a nice syntax is purely a compile-time price.
PS. ISNOTNULL(x) would be a #define for IsNotNull(x,#x) - similar to how assert() prints out the failed condition, without having to repeat it.