Event based state machine in c++ with coroutines - c++

Co-routines in c++ is a really powerful technique for implementing state machines however examples that I find on the internet are overly simplistic, e.g. they usually represent some kind of iterator which after calling to some "Next" routine moves along, dependent only on initial arguments of the coroutine. However in reasonably complicated event based state machines each next step depends, on the specific event received which caused to resume the running and also some default event handlers should be implemented for events that can occur at any time.
Suppose we have a simple phone state machine.
STATE:HOOK OFF-->[EVT:DIAL TONE]--> [STATE:DIALING] --> [EVT: NUMBER DIALED] --> STATE:TALKING.
Now I would like a coroutine that would see something like.
PhoneSM()
{
HookOf();
Yield_Till(DialTone_Event);
Dial();
Yield_Till(EndOfDial_Event);
Talk();
...
}
e.g. requirements
Yield_Till would only continue when specific event was receive (how???) when the couroutine run is resumed.If not then it should yield again.
Yield_Till must know how to run the events to default handlers like Hangup_Event because really it can happen any time and it will be cumbersome to add it yield call each time.
Any help with c++ (only!!!) implementation or ready made infrastructures for meeting the requirements will be highly appreciated.

It seems to me that you are trying to code an event-driven state machine as a sequential flow-chart. The difference between state-charts and flow-charts is quite fundamental, and is explained, for example in the article "A Crash Course in UML State Machines":
A state machine should be rather coded as a one-shot function that processes the current event and returns without yielding or blocking. If you wish to use co-routines, you can call this state-machine function from a co-routine, which then yields after each event.

This is an old question but the first I found when searching how to achieve this with C++20 coroutines. As I already implemented it a few times with different approaches I still try to answer it for future readers.
First some background why this is actually a state machine. You might skip this part if you are only interested in how to implement it. State machines were introduced as a standard way to do code that is called once in a while with new events and progresses some internal state. As in this case program counter and state-variables obviously can't live in registers and on stack there is some additional code required to continue where you left of. State machines are a standard way to achieve this without extreme overhead. However it is possible to write coroutines for the same task and every state-machine could be transferred in such a coroutine where each state is a label and the event handling code ends with a goto to the next state at which point it yields. As every developer knows goto-code is spaghetti code and there is a cleaner way to express the intent with flow-control structures. And in fact I have yet to see a state-machine which couldn't be written in a more compact and easier to understand way using coroutines and flow-control. That being said: How can this be implemented in C/C++?
There are a few approaches to do coroutines: it could be done with a switch statement inside a loop like in Duff's device, there were POSIX coroutines which are now obsolete and removed from the standard and C++20 brings modern C++ based coroutines. In order to have a full event-handling state-machine there are a few additional requirements. First of all the coroutine has to yield a set of events that will continue it. Then there needs to be a way to pass the actually occurred event together with its arguments back into the coroutine. And finally there has to be some driver code which manages the events and registers event-handlers, callbacks or signal-slot connections on the awaited events and calls the coroutine once such an event occurred.
In my latest implementations I used event objects that reside inside the coroutine and are yielded by reference/pointer. This way the coroutine is able to decide when such an event is of interest to it even when it might not be in a state where it is able to process it (e.g. a response to a previously send request got answered but the answer isn't to be processed yet). It also allows to use different event-types that might need different approaches to listen for events independent from the used driver code (which could be simplified this way).
Here is a small Duff's device coroutine for the state-machine in the question (with an extra occupied event for demonstration purpose):
class PhoneSM
{
enum State { Start, WaitForDialTone, WaitForEndOfDial, … };
State state = Start;
std::unique_ptr<DialTone_Event> dialToneEvent;
std::unique_ptr<EndOfDial_Event> endOfDialEvent;
std::unique_ptr<Occupied_Event> occupiedEvent;
public:
std::vector<Event*> operator()(Event *lastEvent = nullptr)
{
while (1) {
switch (state) {
case Start:
HookOf();
dialToneEvent = std::make_unique<DialTone_Event>();
state = WaitForDialTone;
// yield ( dialToneEvent )
return std::vector<Event*>{ dialToneEvent.get() };
case WaitForDialTone:
assert(lastEvent == dialToneEvent);
dialToneEvent.reset();
Dial();
endOfDialEvent = std::make_unique<EndOfDial_Event>();
occupiedEvent = std::make_unique<Occupied_Event>();
state = WaitForEndOfDial;
// yield ( endOfDialEvent, occupiedEvent )
return std::vector<Event*>{ endOfDialEvent.get(), occupiedEvent.get() };
case WaitForEndOfDial:
if (lastEvent == occupiedEvent) {
// Just return from the coroutine
return std::vector<Event*>();
}
assert(lastEvent == endOfDialEvent);
occupiedEvent.reset();
endOfDialEvent.reset();
Talk();
…
}
}
}
}
Of course implementing all the coroutine handling makes this overly complicated. A real coroutine would be much simpler. The following is pseudocode:
coroutine std::vector<Event*> PhoneSM() {
HookUp();
{
DialToneEvent dialTone;
yield { & dialTone };
}
Dial();
{
EndOfDialEvent endOfDial;
OccupiedEvent occupied;
Event *occurred = yield { & endOfDial, & occupied };
if (occurred == & occupied) {
return;
}
}
Talk();
…
}

Most co-routine libraries do not off sophisticated yield function. They simply yield and your co-routine will get control back at some arbitrary point. Hence, after a yield you will have to test the appropriate conditions in your code and yield again if they are not met. In this code you would also put tests for events like hangup, in which case you would terminate your co-routine.
There are a number of implementation in the public domain and some operating systems (e.g. Windows) offer co-routine services. Just google for co-routine or fiber.

Related

Scheduling a coroutine with a context

There are plenty of tutorials that explain how it's easy to use coroutines in C++, but I've spent a lot of time getting how to schedule "detached" coroutines.
Assume, I have the following definition of coroutine result type:
struct task {
struct promise_type {
auto initial_suspend() const noexcept { return std::suspend_never{}; }
auto final_suspend() const noexcept { return std::suspend_never{}; }
void return_void() const noexcept { }
void unhandled_exception() const { std::terminate(); }
task get_return_object() const noexcept { return {}; }
};
};
And there is also a method that runs "detached" coroutine, i.e. runs it asynchronously.
/// Handler should have overloaded operator() returning task.
template<class Handler>
void schedule_coroutine(Handler &&handler) {
std::thread([handler = std::forward<Handler>(handler)]() { handler(); }).detach();
}
Obviously, I can not pass lambda-functions or any other functional object that has a state into this method, because once the coroutine is suspended, the lambda passed into std::thread method will be destroyed with all the captured variables.
task coroutine_1() {
std::vector<object> objects;
// ...
schedule_coroutine([objects]() -> task {
// ...
co_await something;
// ...
co_return;
});
// ...
co_return;
}
int main() {
// ...
schedule_coroutine(coroutine_1);
// ...
}
I think there is should be a way to save the handler somehow (preferably near or within the coroutine promise) so that the next time coroutine is resumed it won't try to access to the destroyed object data. But unfortunately I have no idea how to do it.
I think your problem is a general (and common) misunderstanding of how co_await coroutines are intended to work.
When a function performs co_await <expr>, this (generally) means that the function suspends execution until expr resumes its execution. That is, your function is waiting until some process completes (and typically returns a value). That process, represented by expr, is the one who is supposed to resume the function (generally).
The whole point of this is to make code that executes asynchronously look like synchronous code as much as possible. In synchronous code, you would do something like <expr>.wait(), where wait is a function that waits for the task represented by expr to complete. Instead of "waiting" on it, you "a-wait" or "asynchronously wait" on it. The rest of your function executes asynchronously relative to your caller, based on when expr completes and how it decides to resume your function's execution. In this way, co_await <expr> looks and appears to act very much like <expr>.wait().
Compiler Magictm then goes in behind the scenes to make it asynchronous.
So the idea of launching a "detached coroutine" doesn't make sense within this framework. The caller of a coroutine function (usually) isn't the one who determines where the coroutine executes; it's the processes the coroutine invokes during its execution that decides that.
Your schedule_coroutine function really ought to just be a regular "execute a function asynchronously" operation. It shouldn't have any particular association with coroutines, nor any expectation that the given functor is or represents some asynchronous task or if it happens to invoke co_await. The function is just going to create a new thread and execute a function on it.
Just as you would have done pre-C++20.
If your task type represents an asynchronous task, then in proper RAII style, its destructor ought to wait until the task is completed before exiting (this includes any resumptions of coroutines scheduled by that task, throughout the entire execution of said task. The task isn't done until it is entirely done). Therefore, if handler() in your schedule_coroutine call returns a task, then that task will be initialized and immediately destroyed. Since the destructor waits for the asynchronous task to complete, the thread will not die until the task is done. And since the thread's functor is copied/moved from the function object given to the thread constructor, any captures will continue to exist until the thread itself exits.
I hope I got you right, but I think there might be a couple of misconceptions here. First off, you clearly cannot detach a coroutine, that would not make any sense at all. But you can execute asynchronous tasks inside a coroutine for sure, even though in my opinion this defeats its purpose entirely.
But let's take a look at the second block of code you posted. Here you invoke std::async and forward a handler to it. Now, in order to prevent any kind of early destruction you should use std::move instead and pass the handler to the lambda so it will be kept alive for as long as the scope of the lambda function is valid. This should probably already answer your final question as well, because the place where you want this handler to be stored would be the lambda capture itself.
Another thing that bothers me is the usage of std::async. The call will return a std::future-kind of type that will block until the lambda has been executed. But this will only happen if you set the launch type to std::launch::async, otherwise you will need to call .get() or .wait() on the future as the default launch type is std::launch::deferred and this will only lazy fire (meaning: when you actually request the result).
So, in your case and if you really wanted to use coroutines that way, I would suggest to use a std::thread instead and store it for a later join() somewhere pseudo-globally. But again, I don't think you would really want to use coroutines mechanics that way.
Your question makes perfect sense, the misunderstanding is C++20 coroutines are actually generators mistakenly occupying coroutine header name.
Let me explain how generators work and then answer how to schedule detached coroutine.
How generators work
Your question Scheduling a detached coroutine then looks How to schedule a detached generator and answer is: not possible because special convention transforms regular function into generator function.
What is not obvious right there is the yielding a value must take place inside generator function body. When you want to call a helper function that yields value for you - you can't. Instead you also make a helper function into generator and then await instead of just calling helper function. This effectively chains generators and might feel like writing synchronous code that executes async.
In Javascript special convention is async keyword. In Python special convention is yield instead of return keyword.
The C++20 coroutines are low level mechanism allowing to implement Javascipt like async/await.
Nothing wrong with including this low-level mechanism in C++ language except placing it in header named coroutine.
How to schedule detached coroutine
This question makes sense if you want to have green threads or fibers and you are writing scheduler logic that uses symmetric or asymmetric coroutines to accomplish this.
Now others might ask: why should anyone bother with fibers(not windows fibers;) when you have generators? The answer is because you can have encapsulated concurrency and parallelism logic, meaning rest of your team isn't required to learn and apply additional mental gymnastics while working on the project.
The result is true asynchronous programming where the rest of the team write linear code, without callbacks and such, with simple concept of concurrency for example single spawn() library function, avoiding any locks/mutexes and other multithreading complexity.
The beauty of encapsulation is seen when all details are hidden in low level i/o methods. All context switching, scheduling, etc. happens deep inside i/o classes like Channel, Queue or File.
Everyone involved in async programming should experience working like this. The feeling is intense.
To accomplish this instead of C++20 coroutines use Boost::fiber that includes scheduler or Boost::context that allows symmetric coroutines. Symmetric coroutines allow to suspend and switch to any other coroutine while asymmetric coroutines suspend and resume calling coroutine.

What is the best way to compose a state machine using Boost SML to manage ASIO objects and threads?

First of all, I'd like to clarify that by ASIO objects I mean a simple timer for the time being, but further down the line I want to create a state machine to deal with sockets and data transmission.
I have been exploring the Boost SML library for about a few weeks now and trying out different things. I quite like it, however the documentation doesn't help in my use case and its source is not exactly inviting for someone still fairly new to metaprogramming.
For the time being, I'd like to create a state machine that manages an ASIO timer (to wait asynchronously). The interface would provide a start call where you can tell it how long it should wait, a cancel call (to cancel an ongoing wait), and some callback to be invoked when the timer fires.
I have already achieved this in one way, following both sml examples in this repository and it works well - I have a class which manages the timer and contains a state machine. The public interface provides means to inject the appropriate events into the FSM and query its state. The private interface provides means to start and stop the timer. The FSM class is a friend to the controller so it has access to the private functions.
However I was wondering if there is a way to take some of the controller functionality and move it into the FSM - it would hold all ASIO objects and run the io_context/io_service in a thread it has spawned.
(1) First problem I'd encounter is the fact that the state machine is copied - ASIO objects don't allow this, but this can be worked around by wrapping them in shared pointers.
(2) Next is, sending events to the FSM from within. I figured out how to do it from actions by obtaining a callable boost::sml::back::process<> object and using this to "post" the event to the queue, but this would be useless from an ASIO handler as this by default would not be invoked from an action. I suppose a way around this is to capture the callable into the timer handler with a lambda, like this:
// Some pseudo code (this would all be done in one class):
// Transition table
"idle"_s + event<Wait> / start_waiting = "waiting"_s
// Events
struct Wait { std::chrono::seconds duration; };
struct Cancel {};
struct Expire {};
// Actions
std::function<void(Wait const&, boost::sml::back::process<Cancel, Expire>)> start_waiting =
[this] (Wait const& e, boost::sml::back::process<Cancel, Expire> p) {
run_timer(e, p);
};
// Private function
void run_timer(Wait const& e, boost::sml::back::process<Cancel, Expire>& p) {
m_timer->expires_after(e.duration);
auto timerHandler = [&p] (asio::error_code const& e) {
if (e == asio::error::operation_aborted)
p(Cancel{});
else
p(Expire{});
};
timer->async_wait(timerHandler);
}
But this feels like a bit of a botch.
(3) The last thing that worries me is how the state machine will handle threads. Obviously the timer handler will be executed in its own thread. If that posts an event to the queue of the FSM, will that event be processed by the same thread that posted it? I'm assuming yes, because I can't see any mentions of threads (other than thread safety) in the header. This will dictate how I go about managing the threads' lifetime.
Any tips on alternative ways to architect this, and their pros and cons, would be of great help.

What are coroutines in C++20?

What are coroutines in c++20?
In what ways it is different from "Parallelism2" or/and "Concurrency2" (look into below image)?
The below image is from ISOCPP.
https://isocpp.org/files/img/wg21-timeline-2017-03.png
At an abstract level, Coroutines split the idea of having an execution state off of the idea of having a thread of execution.
SIMD (single instruction multiple data) has multiple "threads of execution" but only one execution state (it just works on multiple data). Arguably parallel algorithms are a bit like this, in that you have one "program" run on different data.
Threading has multiple "threads of execution" and multiple execution states. You have more than one program, and more than one thread of execution.
Coroutines has multiple execution states, but does not own a thread of execution. You have a program, and the program has state, but it has no thread of execution.
The easiest example of coroutines are generators or enumerables from other languages.
In pseudo code:
function Generator() {
for (i = 0 to 100)
produce i
}
The Generator is called, and the first time it is called it returns 0. Its state is remembered (how much state varies with implementation of coroutines), and the next time you call it it continues where it left off. So it returns 1 the next time. Then 2.
Finally it reaches the end of the loop and falls off the end of the function; the coroutine is finished. (What happens here varies based on language we are talking about; in python, it throws an exception).
Coroutines bring this capability to C++.
There are two kinds of coroutines; stackful and stackless.
A stackless coroutine only stores local variables in its state and its location of execution.
A stackful coroutine stores an entire stack (like a thread).
Stackless coroutines can be extremely light weight. The last proposal I read involved basically rewriting your function into something a bit like a lambda; all local variables go into the state of an object, and labels are used to jump to/from the location where the coroutine "produces" intermediate results.
The process of producing a value is called "yield", as coroutines are bit like cooperative multithreading; you are yielding the point of execution back to the caller.
Boost has an implementation of stackful coroutines; it lets you call a function to yield for you. Stackful coroutines are more powerful, but also more expensive.
There is more to coroutines than a simple generator. You can await a coroutine in a coroutine, which lets you compose coroutines in a useful manner.
Coroutines, like if, loops and function calls, are another kind of "structured goto" that lets you express certain useful patterns (like state machines) in a more natural way.
The specific implementation of Coroutines in C++ is a bit interesting.
At its most basic level, it adds a few keywords to C++: co_return co_await co_yield, together with some library types that work with them.
A function becomes a coroutine by having one of those in its body. So from their declaration they are indistinguishable from functions.
When one of those three keywords are used in a function body, some standard mandated examining of the return type and arguments occurs and the function is transformed into a coroutine. This examining tells the compiler where to store the function state when the function is suspended.
The simplest coroutine is a generator:
generator<int> get_integers( int start=0, int step=1 ) {
for (int current=start; true; current+= step)
co_yield current;
}
co_yield suspends the functions execution, stores that state in the generator<int>, then returns the value of current through the generator<int>.
You can loop over the integers returned.
co_await meanwhile lets you splice one coroutine onto another. If you are in one coroutine and you need the results of an awaitable thing (often a coroutine) before progressing, you co_await on it. If they are ready, you proceed immediately; if not, you suspend until the awaitable you are waiting on is ready.
std::future<std::expected<std::string>> load_data( std::string resource )
{
auto handle = co_await open_resouce(resource);
while( auto line = co_await read_line(handle)) {
if (std::optional<std::string> r = parse_data_from_line( line ))
co_return *r;
}
co_return std::unexpected( resource_lacks_data(resource) );
}
load_data is a coroutine that generates a std::future when the named resource is opened and we manage to parse to the point where we found the data requested.
open_resource and read_lines are probably async coroutines that open a file and read lines from it. The co_await connects the suspending and ready state of load_data to their progress.
C++ coroutines are much more flexible than this, as they were implemented as a minimal set of language features on top of user-space types. The user-space types effectively define what co_return co_await and co_yield mean -- I've seen people use it to implement monadic optional expressions such that a co_await on an empty optional automatically propogates the empty state to the outer optional:
modified_optional<int> add( modified_optional<int> a, modified_optional<int> b ) {
co_return (co_await a) + (co_await b);
}
instead of
std::optional<int> add( std::optional<int> a, std::optional<int> b ) {
if (!a) return std::nullopt;
if (!b) return std::nullopt;
return *a + *b;
}
A coroutine is like a C function which has multiple return statements and when called a 2nd time does not start execution at the begin of the function but at the first instruction after the previous executed return. This execution location is saved together with all automatic variables that would live on the stack in non coroutine functions.
A previous experimental coroutine implementation from Microsoft did use copied stacks so you could even return from deep nested functions. But this version was rejected by the C++ committee. You can get this implementation for example with the Boosts fiber library.
coroutines are supposed to be (in C++) functions that are able to "wait" for some other routine to complete and to provide whatever is needed for the suspended, paused, waiting, routine to go on. the feature that is most interesting to C++ folks is that coroutines would ideally take no stack space...C# can already do something like this with await and yield but C++ might have to be rebuilt to get it in.
concurrency is heavily focused on the separation of concerns where a concern is a task that the program is supposed to complete. this separation of concerns may be accomplished by a number of means...usually be delegation of some sort. the idea of concurrency is that a number of processes could run independently (separation of concerns) and a 'listener' would direct whatever is produced by those separated concerns to wherever it is supposed to go. this is heavily dependent on some sort of asynchronous management. There are a number of approaches to concurrency including Aspect oriented programming and others. C# has the 'delegate' operator which works quite nicely.
parallelism sounds like concurrency and may be involved but is actually a physical construct involving many processors arranged in a more or less parallel fashion with software that is able to direct portions of code to different processors where it will be run and the results will be received back synchronously.

How to control thread lifetime using C++11 atomics

Following on from this question, I'd like to know what's the recommended approach we should take to replace the very common pattern we have in legacy code.
We have plenty of places where a primary thread is spawing one or more background worker threads and periodically pumping out some work for them to do, using a suitably synchronized queue. So the general pattern for a worker thread will look like this:
There will be an event HANDLE and a bool defined somewhere (usually as member variables) -
HANDLE hDoSomething = CreateEvent(NULL, FALSE, FALSE, NULL);
volatile bool bEndThread = false;
Then the worker thread function waits for the event to be signalled before doing work, but checks for a termination request inside the main loop -
unsigned int ThreadFunc(void *pParam)
{
// typical legacy implementation of a worker thread
while (true)
{
// wait for event
WaitForSingleObject(hDoSomething, INFINITE);
// check for termination request
if (bEndThread) break;
// ... do background work ...
}
// normal termination
return 0;
}
The primary thread can then give some work to the background thread like this -
// ... put some work on a synchronized queue ...
// pulse worker thread to do the work
SetEvent(hDoSomething);
And it can finally terminate the worker thread like so -
// to terminate the worker thread
bEndThread = true;
SetEvent(hDoSomething);
// wait for worker thread to die
WaitForSingleObject(hWorkerThreadHandle, dwSomeSuitableTimeOut);
In some cases, we've used two events (one for work, one for termination) and WaitForMultipleObjects instead, but the general pattern is the same.
So, looking at replacing the volatile bool with a C++11 standard equivalent, is it as simple as replacing this
volatile bool bEndThread = false;
with this?
std::atomic<bool> bEndThread = false;
I'm sure it will work, but it doesn't seem enough. Also, it doesn't affect the case where we use two events and no bool.
Note, I'm not intending to replace all this legacy stuff with the PPL and/or Concurrency Runtime equivalents because although we use these for new development, the legacy codebase is end-of-life and just needs to be compatible with the latest development tools (the original question I linked above shows where my concern arose).
Can someone give me a rough example of C++11 standard code we could use for this simple thread management pattern to rewrite our legacy code without too much refactoring?
If it ain't broken don't fix it (especially if this is a legacy code base)
VS style volatile will be around for a few more years. Given that
MFC isn't dead this won't be dead any time soon. A cursory Google
search says you can control it with /volatile:ms.
Atomics might do the job of volatile, especially if this is a counter
there might be little performance overhead.
Many Windows native functions have different performance characteristics when compared to their C++11 implementation. For example, Windows TimerQueues and Multimedia have precision that is not possible to achieve with C++11.
For example ::sleep_for(5)
will sleep for 15 (and not 5 or 6). This can be solved with a mysterious
call to timeSetPeriod. Another example is that unlocking on a condition variable can be slow to respond. Interfaces to fix these aren't exposed to C++11 on Windows.

Is a thread-safe queue a good approach?

I am looking for a way to optimize the following code, for an open source project that I develop, or make it more performant by moving the heavy work to another thread.
void ProfilerCommunication::AddVisitPoint(ULONG uniqueId)
{
CScopedLock<CMutex> lock(m_mutexResults);
m_pVisitPoints->points[m_pVisitPoints->count].UniqueId = uniqueId;
if (++m_pVisitPoints->count == VP_BUFFER_SIZE)
{
SendVisitPoints();
m_pVisitPoints->count=0;
}
}
The above code is used by the OpenCover profiler (an open source code coverage tool for .NET written in C++) when each visit point is called. The mutex is used to protect some shared memory (a 64K block shared between several processes 32/64 bit and C++/C#) when full it signals the host process. Obviously this is quite heavy for each instrumentation point and I'd like to make the impact lighter.
I am thinking of using a queue which is pushed to by the above method and a thread to pop the data and populate the shared memory.
Q. Is there a thread-safe queue in C++ (Windows STL) that I can use - or a lock-less queue as I wouldn't want to replace one issue with another? Do people consider my approach sensible?
EDIT 1: I have just found concurrent_queue.h in the include folder - could this be my answer...?
Okay I'll add my own answer - concurrent_queue works very well
using the details described in this MSDN article I implemented concurrent queue (and tasks and my first C++ lambda expression :) ) I didn't spend long thinking though as it is a spike.
inline void AddVisitPoint(ULONG uniqueId) { m_queue.push(uniqueId); }
...
// somewhere else in code
m_tasks.run([this]
{
ULONG id;
while(true)
{
while (!m_queue.try_pop(id))
Concurrency::Context::Yield();
if (id==0) break; // 0 is an unused number so is used to close the thread/task
CScopedLock<CMutex> lock(m_mutexResults);
m_pVisitPoints->points[m_pVisitPoints->count].UniqueId = id;
if (++m_pVisitPoints->count == VP_BUFFER_SIZE)
{
SendVisitPoints();
m_pVisitPoints->count=0;
}
}
});
Results:
Application without instrumentation = 9.3
Application with old instrumentation handler = 38.6
Application with new instrumentation handler = 16.2
Here it mentions not all container operations are thread safe on Windows. Only a limited number of methods. And I don't believe C++ standards mention about threadsafe containers. I maybe wrong, but checked the standards nothing came up
Would it be possible to offload the client's communication into a separate thread? Then the inspection points can use thread local storage to record their hits and only need to communicate with a local thread to pass off a reference when full. The communication thread can then take its time to pass on the data to the actual collector since it's not on the hot path anymore.
You could use a lock free queue. Herb Sutter has some articles here.