C++ How to Design Command Queue

C++ How to Design Command Queue - c++

I have 8 threads providing commands that need to be executed on the GPU, and I want the GPU execution to happen on the same thread via a buffer filled by the 8 other threads. I thought it would be fairly simple but am having issues designing it correctly.
struct CommandStruct {
CComPtr<IDirect3DPixelShader9> Shader;
const char* Param[9];
bool IsCompleted;
};
concurrent_queue<CommandStruct*> Shader::cmdBuffer;
std::mutex Shader::startLock;
std::thread* Shader::WorkerThread = NULL;
void Shader::AddCommandToQueue(CommandStruct* cmd) {
// Add command to queue.
cmdBuffer.push(cmd);
if (WorkerThread == NULL) {
startLock.lock();
if (WorkerThread == NULL) {
WorkerThread = new std::thread(StartWorkerThread, env);
}
startLock.unlock();
}
}
StartWorkerThread is a function that takes all elements within cmdBuffer and execute them one by one until the buffer is empty and stops after being idle for a few seconds.
The problem I'm having is how to make the caller wait until execution is completed? At first I tried SetEvent, but performance is VERY poor. Then I tried looping with Sleep(10), but then it wastes the whole CPU.
Ideally AddCommandToQueue would return once the command is completed, but then I have the same problem about how to wait for completion.
What's the right way of implementing such a queue? It needs to be fast because it contains many instructions for the GPU. I tried with an event, I tried with a IsCompleted flag on the structure, but the design isn't yet working.

I realize I'm a few years late, but I thought I'd provide some thoughts. Since you're looking at a command system, I would take the time to make sure your commands are type-erased or abstract. Here is an example system that may be useful to you. You might also take a look at his blogs on external and internal references:
https://blog.molecular-matters.com/2014/12/16/stateless-layered-multi-threaded-rendering-part-3-api-design-details/
The reason I'd keep your command system type-erased or abstract, is to maintain some level of a layered architecture. Here is a typical diagram of how a realtime game engine is architected: http://hightalestudios.com/2017/03/game-engine-architecture-2nd-edition-overview-ch-1-part-2/
One thing to note in the above diagram is that an entire layer is dedicated to platform independence. Above that level, I would avoid using constructs such as com pointers, since they are largely platform-dependent. Instead, think about creating a resource management system using a pool or arena allocator where you can store, cache, and reuse resources, vertex buffers, and such. You will set yourself up for success if you take this step.
This also sets yourself up for success when creating your command system. At a high level, think of your commands as small structs that hold handles to data held by the data manager. Again, by doing it this way, you won't need to load/reload VAOs, VBOs, IBOs, shaders, materials, etc. You'll get O(1) access to the corresponding data in the data manager if you setup your handle system correctly, and your commands will be small, which allows you to sort them. Sorting will become important when it comes time to render draw calls, since you'll want to draw items that are far away before drawing items that are closer.
If you follow the above advice, the content of your command should be nothing a small amount of data (handles/pointers only) and dispatch function. So here is a concrete example of what some of the commands in my homemade GUI/game engine look like:
struct make_viewport
{
int width = platform::default_viewport_width;
int height = platform::default_viewport_height;
static auto dispatch(const void* data) {
auto command = static_cast<const make_viewport*>(data);
auto handle = platform::make_viewport(command->width, command->height);
resource::manager::store(handle);
};
};
In the above example, the screen width and height are small enough that it would not make sense to use a handle and store the sizes in the resource manager. Also notice that the command knows nothing about the platform specifics at this level. There is a custom linker-based interface that is implemented for various operating systems, cpus, gpus, etc. and compiled on the host machine. There are other ways to do this, but this was the solution I thought of first. Here is a library I found for abstracting away several platform-specific things you may be interested in: https://github.com/ThePhD/infoware
So what exactly does this dispatch function do? Most command systems use some form of dispatch mechanism to execute commands lazily. A few common ways I've seen these implemented is as follows:
auto execute() noexcept -> void
{
// do stuff with command data
}
auto dispatch() noexcept -> void
{
// do stuff with command data
}
auto operator()() noexcept -> void
{
// do stuff with command data
}
The signature of your dispatch function is not important, but it is important to keep the signature consistent between each of your commands so that you can create polymorphic arrays of your commands. You will need to be able to sort and execute this polymorphic array, and its contents cannot be determined at compile time, so you won't be able to use something like std::tuple (you can use std::variant, but your data locality won't be as great as it could be).
If I had to make a suggestion, I would tend toward overloading operator(). Why? If you notice, this would make any command a valid functor. This allows for deep integration with modern C++ techniques such as lambdas and standard algorithms. In other words, rewriting the make_viewport command from above could be as simple as:
auto make_viewport = [width = platform::default_viewport_width, height = platform::default_viewport_height](const void* data)
{
auto command = static_cast<const make_viewport*>(data);
auto handle = platform::make_viewport(command->width, command->height);
resource::manager::store(handle);
};
Of course, you'd need to find a way to deal with polymorphism/sorting/executing, but if you find a way to do this, the lambda approach could be pretty interesting.
If you're interested in ways to handle type erasure, here's how I did it:
struct command_packet
{
void* command = nullptr;
dispatch_type dispatch = nullptr;
};
template<typename command_t> [[nodiscard]]
constexpr auto make_packet(command_t&& command) noexcept
{
return command_packet
{
.command = &command,
.dispatch = command.dispatch
};
}
You can also use a virtual base class to do this, but I wanted the ability to have POD commands so I could reason about the size of the commands. I then create an array of commands to execute every frame. This array should ideally have thread-local storage, which makes it difficult to fill from multiple threads. But the good news, is that you can create multiple command queues depending on your needs. For instance, draw calls will always need to be sorted by depth and therefore will need to be in the same queue. But if you're clever, you can choose to sort this queue using a radix sort. When implementing your radix sort, you can sort data into buckets from multiple threads since radix sort does not require any comparison operations. If your commands are small enough and numerous enough, you'll drastically out-perform std::sort in graphics applications.
And finally, if you need to perform culling, merging, and redundancy optimizations, you can add steps to the pipeline to do this. I have not yet implemented these optimizations on my system, but it should not be difficult to imagine that commands can be removed/added/modified/merged/chained with a little extra logic, making the aforementioned optimizations possible.
In summary, a command system that is likely to fit your needs would have the following attributes:
Commands are POD or their size can be reasoned about
Commands are polymorphic
Commands do not contain platform-specific code
Command system is coupled with a resource manager
Commands are designed to be aligned with cache lines (in other words, static_cast(sizeof(command)) / sizeof(void*) is a whole number, and your allocator provides the appropriate padding/alignment to avoid cache misses)
Your command pipeline would look something like this:
Load data (preferrably asynchronously, since handles can be returned immediately without waiting for resources to load) into the resource manager and obtain a handle to that data
create commands using the aforementioned handles using a task/job system
shove each command into your command queue (if you use a radix sort, you can directly stuff them into the first bucket for improved efficiency).
At the end of the frame, sort the data (preferrably using a radix sort).
Remove any redundant commands if so-desired
sequentially submit commands from your queue on a single thread. I'm sure you're already aware based on the question, but graphics APIs are typically stateful and require you to configure the state before each render call (which makes it impossible to perform in parallel... not to mention any kind of hardware-based limitations on parallelism that might exist on your platform). The thing to take note here is that if you setup your sort criteria correctly, you should only need to swap out shaders/textures/materials/vertex buffers/etc when they have changed. This further improves efficiency by getting rid of redundant api calls and copies.
Clear your queue. If using std::vector<T, Allocator>, for instance, a call to std::vector<T, Allocator>::clear() would be sufficient. The idea is to not deallocate memory, but to move the current index back to the beginning of the container. You do not even need to delete the commands if they are POD, you can simply overwrite it on the next frame if you're careful about how you set this up.
And finally, I just want to leave you with the note that what you're doing can be as difficult as you want it to be. Start small and simply get your system working. You can slowly add some of the features I've talked about as you get each piece working and have verified that it works. graphics commands are what I would call a critical bottleneck in most systems that use them, so do yourself a favor and set yourself up for success by creating an architecture that can be optimized if you find a reason to do so. Many command systems do not allow for sorting, for instance, which is a huge opportunity missed. Even if you don't implement sorting immediately, if you somewhat follow the cook book, you can implement sorting at a later time without too much pain. What you're doing almost requires a data-oriented approach to be efficient, so think about laying out your data in a manner that offers O(1) access, O(1) storage, doesn't result in redundancies, and is aligned to the cache lines on your platform. You will save yourself a world of hurt by spending some time on this. And perhaps you may find some additional useful tips in this book: https://www.amazon.com/Engine-Architecture-Third-Jason-Gregory/dp/1138035459/ref=pd_lpo_14_t_0/137-2195898-2699658?_encoding=UTF8&pd_rd_i=1138035459&pd_rd_r=100afaaf-5542-476a-89eb-87ae8bed3571&pd_rd_w=AaXTZ&pd_rd_wg=E6BRu&pf_rd_p=7b36d496-f366-4631-94d3-61b87b52511b&pf_rd_r=KQ0CFH5Y2JCA4817ATNM&psc=1&refRID=KQ0CFH5Y2JCA4817ATNM

Related

How to templatize code for a class member function based on constructor parameter

I have a highly performance-sensitive (read low latency requirement) C++ 17 class for logging that has member functions that can either log locally or can log remotely depending upon the flags with which the class is implemented. "Remote Logging" or "Local Logging" functionality is fully defined at the time when the object is constructed.
The code looks something like this
class Logger {
public:
Logger(bool aIsTx):isTx_(aIsTx) {init();}
~Logger() {}
uint16_t fbLog(const fileId_t aId, const void *aData, const uint16_t aSz){
if (isTx_)
// do remote logging
return remoteLog(aId, aData, aSz);
else
// do local logging
return fwrite(aData, aSz, 1,fd_[aId]);
}
protected:
bool isTx_
}
What I would like to do is
Some way of removing the if(isTx_) such that the code to be used gets defined at the time of instantiating.
Since the class objects are used by multiple other modules, I would not like to templatize the class because this will require me to wrap two templatized implementations of the class in an interface wrapper which will result in v-table call every time a member function is called.

You cannot "templetize" the behaviour, since you want the choice to be done at runtime.
In case you want to get rid of the if because of performance, rest assured that it will have negligible impact compared to disk access or network communication. Same goes for virtual function call.
If you need low latency, I recommend considering asynchronous logging: The main thread would simply copy the message into an internal buffer. Memory is way faster than disk or network, so there will be much less latency. You can then have a separate service thread that waits for the buffer to receive messages, and handles the slow communication.
As a bonus, you don't need branches or virtual functions in the main thread since it is the service thread that decides what to do with the messages.
Asynchronisity is not an easy approach however. There are many cases that must be taken into consideration:
How to synchronise the access to the buffer (I suggest trying out a lock free queue instead).
How much memory should the buffer be allowed to occupy? Without limit it can consume too much if the program logs faster than can be written.
If the buffer limit is reached, what should the main thread do? It either needs to fall back to synchronously waiting while the buffer is being processed or messages need to be discarded.
How to flush the buffer when the program crashes? If it is not possible, then the last messages may be lost - which probably are what you need to figure out why the program crashed in the first place.
Regardless of choice: If performance is critical, then try out multiple approaches and measure.

shared_ptr Real life use-cases

shared_ptr is to be used when we have a scenario where it is desirable to have multiple owners of a dynamically allocated item.
Problem is, I can't imagine any scenario where we require multiple owners. Every use-case I can image can be solved with a unique_ptr.
Could someone provide a real life use-case example with code where shared_ptr is required (and by required, I mean the optimal choice as a smart pointer)? And by "real life" I mean some practical and pragmatic use-case, not something overly abstract and fictitious.

In our simulator product, we use a framework to deliver messages between simulation components (called endpoints). These endpoints could reside on multiple threads within a process, or even on multiple machines in a simulation cluster with messages routed through a mesh of RDMA or TCP connections. The API looks roughly like:
class Endpoint {
public:
// Fill in sender address, etc., in msg, then send it to all
// subscribers on topic.
void send(std::unique_ptr<Message> msg, TopicId topic);
// Register this endpoint as a subscriber to topic, with handler
// called on receiving messages on that topic.
void subscribe(TopicId topic,
std::function<void(std::shared_ptr<const Message>)> handler);
};
In general, once the sender endpoint has executed send, it does not need to wait for any response from any receiver. So, if we were to try to keep a single owner throughout the message routing process, it would not make sense to keep ownership in the caller of send or otherwise, send would have to wait until all receivers are done processing the message, which would introduce an unnecessary round trip delay. On the other hand, if multiple receivers are subscribed to the topic, it also wouldn't make sense to assign unique ownership to any single one of them, as we don't know which one of them needs the message the longest. That would leave the routing infrastructure itself as the unique owner; but again, in that case, then the routing infrastructure would have to wait for all receivers to be done, while the infrastructure could have numerous messages to deliver to multiple threads, and it also wants to be able to pass off the message to receivers and be able to go to the next message to be delivered. Another alternative would be to keep a set of unique pointers to messages sent waiting for threads to process them, and have the receivers notify the message router when they're done; but that would also introduce unnecessary overhead.
On the other hand, by using shared_ptr here, once the routing infrastructure is done delivering messages to incoming queues of the endpoints, it can then release ownership to be shared between the various receivers. Then, the thread-safe reference counting ensures that the Message gets freed once all the receivers are done processing it. And in the case that there are subscribers on remote machines, the serialization and transmission component could be another shared owner of the message while it's doing its work; then, on the receiving machine(s), the receiving and deserialization component can pass off ownership of the Message copy it creates to shared ownership of the receivers on that machine.

In a CAD app, I use shared_ptr to save RAM and VRAM when multiple models happen to have a same mesh (e.g. after user copy-pasted these models). As a bonus, multiple threads can access meshes at the same time, because both shared_ptr and weak_ptr are thread safe when used correctly.
Below’s a trivial example. The real code is way more complex due to numerous reasons (GPU buffers, mouse picking, background processing triggered by some user input, and many others) but I hope that’s enough to give you an idea where shared_ptr is justified.
// Can be hundreds of megabytes in these vectors
class Mesh
{
std::string name;
std::vector<Vector3> vertices;
std::vector<std::array<uint32_t, 3>> indices;
BoundingBox bbox;
};
// Just 72 or 80 bytes, very cheap to copy.
// Can e.g. pass copies to another thread for background processing.
// A scene owns a collection of these things.
class Model
{
std::shared_ptr<Mesh> mesh;
Matrix transform;
};

In my program's user interface, I have the concept of "control point values" (a control point value represents the current state of a control on the hardware my program controls), and (of course) the concept of "widgets" (a widget is a GUI component that renders the current state of a control point to the monitor, for the user to see and/or manipulate).
Since it is a pretty elaborate system that it needs to control, we have
lots of different types of control point values (floats, ints, strings, booleans, binary blobs, etc)
lots of different types of widget (text displays, faders, meters, knobs, buttons, etc)
lots of different ways that a given widget could choose to render a particular control point value as text (upper case, lower case, more or fewer digits of precision, etc)
If we just did the obvious thing and wrote a new subclass every time we needed a new combination of the above, we'd end up with a geometric explosion of thousands of subclasses, and therefore a very large codebase that would be difficult to understand or maintain.
To avoid that, I separate out the knowledge of "how to translate a control point value into human-readable text in some particular way" into its own separate immutable object that can be used by anyone to do that translation, e.g.
// My abstract interface
class IControlPointToTextObject
{
public:
virtual std::string getTextForControlPoint(const ControlPoint & cp) const = 0;
};
// An example implementation
class RenderFloatingPointValueAsPercentage : public IControlPointToTextObject
{
public:
RenderFloatingPointValueAsPercentage(int precision) : m_precision(precision)
{
// empty
}
virtual std::string getTextForControlPoint(const ControlPoint & cp) const = 0
{
// code to create and return a percentage with (m_precision) digits after the decimal point goes here....
}
private:
const int m_precision;
};
... so far, so good; now e.g. when I want a text widget to display a control point value as a percentage with 3 digits of after the decimal point, I can do it like this:
TextWidget * myTextWidget = new TextWidget;
myTextWidget->setTextRenderer(std::unique_ptr<IControlPointToTextObject>(new RenderFloatingPointValueAsPercentage(3)));
... and I get what I want. But my GUIs can get rather elaborate, and they might have a large number (thousands) of widgets, and with the above approach I would have to create a separate RenderFloatingPointValueAsPercentage object for each widget, even though most of the RenderFloatingPointValueAsPercentage objects will end up being identical to each other. That's kind of wasteful, so I change my widget classes to accept a std::shared_ptr instead, and now I can do this:
std::shared_ptr<IControlPointToTextObject> threeDigitRenderer = std::make_shared<RenderFloatingPointValueAsPercentage>(3);
myWidget1->setTextRenderer(threeDigitRenderer);
myWidget2->setTextRenderer(threeDigitRenderer);
myWidget3->setTextRenderer(threeDigitRenderer);
[...]
No worries about object lifetimes, no dangling pointers, no memory leaks, no unnecessary creation of duplicate renderer objects. C'est bon :)

Take any lambda, called within a member function, f, of a class, C, where you want to deal with an object that you would pass into the lambda [&] as a reference. While you are waiting inside f for the lambda to finish, C goes out of scope. The function is gone and you have a dangling reference. Segmentation fault is as close as you get to defined behavior, when the lambda is next accessing the reference. You cannot pass the unique punter into the lambda. You couldn't access it from f once it's moved. The solution: shared pointer and [=]. I code the core of a database. We need shared pointers all the time in a multi-threaded infrastructure. Don't forget about the atomic reference counter. But your general scepticism is appreciated. Shared punters are used nearly always when one doesn't need them.

Suppose I want to implement a GLR parser for a language that is or contains a recursive "expression" definition. And the parsing must not just check whether the input conforms to the grammar, but also output something that can be used to do analysis, evaluations, compilations, etc. I'll need something to represent the result of each expression or subexpression grammar symbol. The actual semantic meaning of each grammar rule can be represented by polymorphism, so this will need to be some sort of pointer to a base class Expression.
The natural representation is then a std::shared_ptr<Expression>. An Expression object can be a subexpression of another compound Expression, in which case the compound Expression is the owner of the subexpression. Or an Expression object can be owned by the parse stack of the algorithm in progress, for a grammar production that has not yet been combined with other pieces. But not really both at the same time. If I were writing a LALR parser, I could probably do with std::unique_ptr<Expression>, transferring the subexpressions from the parse stack to the compound expression constructors as each grammar symbol is reduced.
The specific need for shared_ptr comes up with the GLR algorithm. At certain points, when there is more than one possible parse for the input scanned so far, the algorithm will duplicate the parse stack in order to try out tentative parses of each possibility. And as the tentative parsings proceed, each possiblity may need to use up some of those intermediate results from its own parse stack to form subexpressions of some compound expression, so now we might have the same Expression being used by both some number of parse stacks and some number of different compound Expression objects. Hopefully all but one tentative parsing will eventually fail, which means the failed parse stacks get discarded. The Expression objects directly and indirectly contained by discarded parse stacks should possibly be destroyed at that time, but some of them may be used directly or indirectly by other parse stacks.
It would be possible to do all this with just std::unique_ptr, but quite a bit more complicated. You could do a deep clone whenever parse stacks need to split, but that could be wasteful. You could have them owned by some other master container and have the parse stacks and/or compound expressions just use dumb pointers to them, but knowing when to clean them up would be difficult (and possibly end up essentially duplicating a simplified implementation of std::shared_ptr). I think std::shared_ptr is the clear winner here.

See this real life example. The current frame is shared across multiple consumers and with a smart pointer things get easy.
class frame { };
class consumer { public: virtual void draw(std::shared_ptr<frame>) = 0; };
class screen_consumer_t :public consumer { public: void draw(std::shared_ptr<frame>) override {} };
class matrox_consumer_t :public consumer { public: void draw(std::shared_ptr<frame>) override {} };
class decklink_consumer_t :public consumer { public: void draw(std::shared_ptr<frame>) override {} };
int main() {
std::shared_ptr<frame> current_frame = std::make_shared<frame>();
std::shared_ptr<consumer> screen_consumer = std::make_shared<screen_consumer_t>();
std::shared_ptr<consumer> matrox_consumer = std::make_shared<matrox_consumer_t>();
std::shared_ptr<consumer> decklink_consumer = std::make_shared<decklink_consumer_t>();
std::vector<consumer> consumers;
consumers.push_back(screen_consumer);
consumers.push_back(matrox_consumer);
consumers.push_back(decklink_consumer);
//screen_consumer->draw(current_frame);
//matrox_consumer->draw(current_frame);
//decklink_consumer->draw(current_frame);
for(auto c: consumers) c->draw(current_frame);
}
Edited:
Another example can be a Minimax tree, to avoid cyclic redundancy weak_ptr in conjunction with shared_ptr can be used:
struct node_t
{
std::unique_ptr<board_t> board_;
std::weak_ptr<node_t> parent_;
std::vector<std::shared_ptr<node_t>> children_;
};

Have you checked these articles about copy-on-write vector:
https://iheartcoding.net/blog/2016/07/11/copy-on-write-vector-in-c/
copy-on-write PIMPL:
https://crazycpp.wordpress.com/2014/09/13/pimplcow/
and generic copy-on-write pointer:
https://en.wikibooks.org/wiki/More_C%2B%2B_Idioms/Copy-on-write
All of them use shared_ptr internally.

std::shared_ptr is an implementation of reference counting technique in C++. For use-cases of reference counting see linked wikipedia article. One usage of reference counting is garbage collection in programming languages. So if you decide to write a new programming language with garbage collection in C++ you can implement it with std::shared_ptr, though you will also have to deal with cycles.

Simply put: there isn't really any.
For more detailed explanation, let's turn to formal reasoning. As we all know, C++ is a Turing-complete deterministic language. A popular simple example of equally computationally powerful tool is Brainfuck (often very convenient in establishing Turing-completeness of your favorite language of choice). If we look into Brainfuck's description (which is very small indeed, which makes it very handy for the purposes noted heretofore), we'll soon find out that there is not a single notion of anything resembling shared_ptr in there. So the answer is: no, there is no a real-life example where they would be absolutely required. Everything computable can be done without shared_ptrs.
If we continue the process thoroughly, we'll get rid equally easily of other unnecessary concepts, i.e. unique_ptr, std::unordered_map, exceptions, range-loops and so forth.

Efficiently handling options

I have a function, which is executed hundreds of millions of times in a typical program run. This function performs a certain main task, but, if the user so desires, it should perform some slight variations of that main task. The obvious way to implement this would be something like this:
void f(bool do_option)
{
// Do the first part
if (do_option)
{
// Optional extra code
}
// Continue normal execution
}
However, this is not very elegant, since the value of do_option does not change during a program run. The if statement is unnecessarily being performed very often.
I solved it by turning do_option into a template parameter. I recompile the program every time I want to change it. Right now, this workflow is acceptable: I don't change these options very often and I am the sole user/developer. In the future though, both these things will change, so I want a single binary with command-line switches.
Question is: what is the best or most elegant way to deal with this situation? I don't mind having a large binary with many copies of f. I could create a map from a set of command-line parameters to a function to execute, or perhaps use a switch. But I don't want to maintain that map by hand -- there will probably be more than five such parameters.
By the way, I know somebody is going to say that I'm prematurely optimizing. That is correct, I tested it. In my specific case, the performance of runtime ifs is not much worse than my template construction. That doesn't mean I'm not interested if nicer solutions are possible.

On a modern (non-embedded) CPU, the branch predictor will be smart enough to recognize that the same path is taken every time, so an if statement is a perfectly acceptable (and readable) way of handling your situation.
On an embedded processor, compiler optimizations should be smart enough to get rid of most of the overhead of the if statement.
If you're really picky, you can use the template method that you mentioned earlier, and have an if statement select which version of the function to execute.

Observer Design Pattern Issues

I am working on a large project in C++ that will have a graphical user interface.
The user interface will use some design pattern (MVVM/MVC) that will rely on the observer pattern.
My problem is that I currently have no way of predicting which parts of the Model should be observable. And there are many, many parts.
I find myself being pulled in several directions due to this issue:
If I develop back-end classes that do not support notification I will find myself violating the Open-Closed principle.
If I do provide support for notification to all Model classes and all of their data members it will have a massive performance cost that is unjustified since only a fraction of this support will actually be needed (even though this fraction is unknown).
The same is true if I only provide support for extension by making all non-const methods virtual and accessing these methods through base-pointers. This will also have a cost in readability.
I feel that out of these 3, (1.) is probably the lesser evil.
However, I feel like an ideal solution should actually exists in some language (definitely not C++), but I don't know if it's supported anywhere.
The unicorn solution I was thinking of is something like this:
Given a class Data, shouldn't it be possible for clients that seek to make Data observable do something like
#MakeObservable(Data)
as a compile time construct. This in turn would make it possible to call addObserver on Data objects and modify all assignments to data members with notifiers. it would also make you pay in performance only for what you get.
So my question is two-fold:
Am I right to assume that out of the 3 options I stated, (1.) is the lesser but necessary evil?
Does my unicorn solution exist anywhere? being worked on? or would be impossible to implement for some reason?

If I understand correctly, you're concerned with the cost of providing a signal/notification for potentially every observable property of every object.
Fortunately you're in luck, since storing a general thread-safe notifier with every single property of every object would generally be extremely expensive in any language or system.
Instead of getting all clever and trying to solve this problem at compile-time, which I recommend would shut out some very potentially useful options to a large-scale project (ex: plugins and scripting), I'd suggest thinking about how to make this cheaper at runtime. You want your signals to be stored at a coarser level than the individual properties of an object.
If you store just one with the object that passes along the appropriate data about which property was modified during a property change event to filter which clients to notify, then now we're getting a lot cheaper. We're exchanging some additional branching and larger aggregates for the connected slots, but you get a significantly smaller object in exchange with potentially faster read access, and I'd suggest this is a very valuable exchange in practice.
You can still design your public interface and even the event notification mechanism so that clients work with the system in a way that feels like they're connecting to properties rather than the whole object, perhaps even calling methods in a property (if it's an object/proxy) to connect slots if you need or can afford a back pointer to the object from a property.
If you're not sure, I would err on the side of attaching event slots to properties as well as modifying them as part of the object interface rather than property interface, as you'll have a lot more breathing room to optimize in exchange for a slightly different client aesthetic (one that I really don't think is less convenient so much as just 'different', or at least potentially worth the cost of eliminating a back pointer per property).
That's in the realm of convenience and wrapper-type things. But you don't need to violate the open-closed principle to achieve MVP designs in C++. Don't get crammed into a corner by data representation. You have a lot of flexibility at the public interface level.
Memory Compaction -- Paying for What We Use
On discovering that efficiency plays an important role here, I'd suggest some basic ways of thinking to help with that.
First, just because an object has some accessor like something() does not mean that the associated data has to be stored in that object. It doesn't even have to be stored anywhere until that method is called. If memory is your concern, it can be stored at some level outside.
Most software breaks down into hierarchies of aggregates owning resources. For example, in a 3D software, a vertex is owned by a mesh which is owned by the scene graph which is owned by the application root.
If you want designs where you pay almost no memory cost whatsoever for things that are not being used, then you want to associate the data to the object at a coarser level. If you store it directly in the object, then every object pays for what something() returns regardless of whether it is needed. If you store it indirectly in the object with a pointer, then you pay for the pointer to something() but not for the full cost of it unless it is used. If you associate it to the owner of the object, then retrieving it has a lookup cost, but one that is not as expensive as associating it to the owner of the owner of the object.
So there's always ways to get something very close to free for things you don't use if you associate at a coarse enough level. At granular levels you mitigate lookup and indirection overhead, at coarse levels you mitigate costs for things you don't use.
Massive Scale Events
Given massive scalability concerns with millions to billions of elements being processed, and still the desire for potentially some of them to generate events, if you can use an asynchronous design, I'd really recommend it here. You can have a lock-free per-thread event queue to which objects with a single bit flag set generate events. If the bit flag is not set, they don't.
This kind of deferred, async design is useful with such scale since it gives you periodic intervals (or possibly just other threads, though you'd need write locks -- as well as read locks, though writing is what needs to cheap -- in that case) in which to poll and devote full resources to bulk processing the queue while the more time-critical processing can continue without synchronizing with the event/notifier system.
Basic Example
// Interned strings are very useful here for fast lookups
// and reduced redundancy in memory.
// They're basically just indices or pointers to an
// associative string container (ex: hash or trie).
// Some contextual class for the thread storing things like a handle
// to its event queue, thread-local lock-free memory allocator,
// possible error codes triggered by functions called in the thread,
// etc. This is optional and can be replaced by thread-local storage
// or even just globals with an appropriate lock. However, while
// inconvenient, passing this down a thread's callstack is usually
// the most efficient and reliable, lock-free way.
// There may be times when passing around this contextual parameter
// is too impractical. There TLS helps in those exceptional cases.
class Context;
// Variant is some generic store/get/set anything type.
// A basic implementation is a void pointer combined with
// a type code to at least allow runtime checking prior to
// casting along with deep copying capabilities (functionality
// mapped to the type code). A more sophisticated one is
// abstract and overriden by subtypes like VariantInt
// or VariantT<int>
typedef void EventFunc(Context& ctx, int argc, Variant** argv);
// Your universal object interface. This is purely abstract:
// I recommend a two-tier design here:
// -- ObjectInterface->Object->YourSubType
// It'll give you room to use a different rep for
// certain subtypes without affecting ABI.
class ObjectInterface
{
public:
virtual ~Object() {}
// Leave it up to the subtype to choose the most
// efficient rep.
virtual bool has_events(Context& ctx) const = 0;
// Connect a slot to the object's signal (or its property
// if the event_id matches the property ID, e.g.).
// Returns a connection handle for th eslot. Note: RAII
// is useful here as failing to disconnect can have
// grave consequences if the slot is invalidated prior to
// the signal.
virtual int connect(Context& ctx, InternedString event_id, EventFunc func, const Variant& slot_data) = 0;
// Disconnect the slot from the signal.
virtual int disconnect(Context& ctx, int slot) = 0;
// Fetches a property with the specified ID O(n) integral cmps.
// Recommended: make properties stateless proxies pointing
// back to the object (more room for backend optimization).
// Properties can have set<T>/get<T> methods (can build this
// on top of your Variant if desired, but a bit more overhead
// if so).
// If even interned string compares are not fast enough for
// desired needs, then an alternative, less convenient interface
// to memoize property indices from an ID might be appropriate in
// addition to these.
virtual Property operator[](InternedString prop_id) = 0;
// Returns the nth property through an index.
virtual Property operator[](int n) = 0;
// Returns the number of properties for introspection/reflection.
virtual int num_properties() const = 0;
// Set the value of the specified property. This can generate
// an event with the matching property name to indicate that it
// changed.
virtual void set_value(Context& ctx, InternedString prop_id, const Variant& new_value) = 0;
// Returns the value of the specified property.
virtual const Variant& value(Context& ctx, InternedString prop_id) = 0;
// Poor man's RTTI. This can be ignored in favor of dynamic_cast
// for a COM-like design to retrieve additional interfaces the
// object supports if RTTI can be allowed for all builds/modes.
// I use this anyway for higher ABI compatibility with third
// parties.
virtual Interface* fetch_interface(Context& ctx, InternedString interface_id) = 0;
};
I'll avoid going into the nitty gritty details of the data representation -- the whole point is that it's flexible. What's important is to buy yourself room to change it as needed. Keeping the object abstract, keeping the property as a stateless proxy (with the exception of the backpointer to the object), etc. gives a lot of breathing room to profile and optimize away.
For async event handling, each thread should have a queue associated which can be passed down the call stack through this Context handle. When events occur, such as a property change, objects can push events to this queue through it if has_events() == true. Likewise, connect doesn't necessarily add any state to the object. It can create an associative structure, again through Context, which maps the object/event_id to the client. disconnect also removes it from that central thread source. Even the act of connecting/disconnecting a slot to/from a signal can be pushed to the event queue for a central, global place to process and make the appropriate associations (again preventing objects without observers from paying any memory cost).
When using this type of design, each thread should have at its entry point an exit handler for the thread which transfers the events pushed to the thread event queue from the thread-local queue to some global queue. This requires a lock but can be done not-too-frequently to avoid heavy contention and to allow each thread to not be slowed down by the event processing during performance-critical areas. Some kind of thread_yield kind of function should likewise be provided with such a design which also transfers from the thread-local queue to the global queue for long-lived threads/tasks.
The global queue is processed in a different thread, triggering the appropriate signals to the connected slots. There it can focus on bulk processing the queue when it isn't empty, sleeping/yielding when it is. The whole point of all of this is to aid performance: pushing to a queue is extremely cheap compared to the potential of sending synchronous events every time an object property is modified, and when working with massive scale inputs, this can be a very costly overhead. So simply pushing to a queue allows that thread to avoid spending its time on the event handling, deferring it to another thread.

Templatious library can help you with completely decoupling GUI and domain logic and making flexible/extendable/easy to maintain message systems and notifiers.
However, there's one downside - you need C++11 support for this.
Check out this article taming qt
And the example in github: DecoupledGuiExamples
So, you probably don't need notifiers on every class, you can just shoot messages from the inside functions and on specific classes where you can make any class send any message you want to GUI.

Shared, one-side-mutable state in multithreaded plugin architecture

I have an application with a very simple plugin system that builds around a core that takes care of the heavy lifting, but leaves any processing beyond the basics to the plugins. Now I'd like to make that system multi threaded, or at the very least allow individual plugins to run their own threads so that they can block individually without freezing the core.
Naturally, that means making the core thread safe, so that plugins can freely operate on thread safe member functions of said core. This is not that hard for many cases, but the problem comes in when the result of one of these member functions is a (const) reference to some inner environment maintained by the core. Plugins must not modify it, but the core running in another thread may update it at any point in time while the plugin is still holding on to the reference and possibly still mid-processing.
Now I could just expose a mutex in the core and have plugins lock it for as long as they need the data to remain unchanged, but since that mutex will have to block the core event processing, holding it for too long will cause all kinds of unpleasantries. I'd really like to avoid that.
Another solution might be having the function that would return a reference, return a copy instead, but the environment can grow pretty large with many containers that contain yet more things, and copying all that is expensive.
Apart from steering cleer of any direct sharing and instead talking over sockets or pipes, these seem to be my options, although I can not decide which one I'd choose.
Which would be my best bet? What options that I have not considered might help me here?

This question is quite opinion-based, but based on past experience, I would recommend using copy-on-write. You could do it at the top level, copying the entire object, or you could do it piece by piece, which might address the overhead of copying. For example:
struct BigData;
struct Shared
{
std::shared_ptr<const BigData> a;
std::shared_ptr<const BigData> b;
std::shared_ptr<const BigData> c;
}
You would then only copy the parts you modify and send over a copy of Shared to the plugin. So if you change something in a, b and c would still point to the same object but a would point to a new BigData which was copied from the original and then modified.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js