MPI distribution layer - c++

I used MPI to write a distribution layer. Let say we have n of data sources and k of data consumers. In my approach each of n MPI processes reads data, then distributes it to one (or many) of k data consumers (other MPI processes) in given manner (logic).
So it seems to be very generic and my question is there something like that already done?
It seems simple, but it might be very complicated. Let say that distribution checks which of data consumers is ready to work (dynamic work distribution). It may distribute data according to given algorithm based on data. There are plenty of possibilities and I as every of us do not want to reinvent the wheel.

As far as I know, there is no generic implementation for it, other than the MPI API itself. You should use the correct functions according to the problem's constraints.
If what you're trying to build a simple n-producers-and-k-consumers synchronized job/data queue, then of course there are already many implementations out there (just google it and you should get a few).
However, the way you present it seems very general - sometimes you want the data to only be sent to one consumer, sometimes to all of them, etc. In that case, you should figure out what you want and when, and use either point-to-point communication functions, or collective communication functions, accordingly (and of course everyone has to know what to expect - you can't have a consumer waiting for data from a single source, while the producer wishes to broadcast the data...).
All that aside, here is one implementation that comes to mind that seems to answer all of your requirements:
Make a synchronized queue, producers pushing data in one end, consumers taking it from the other (decide on all kinds of behaviors for the queue as you need - is the queue size limited, does adding an element to a full queue block or fail, does removing an element from an empty queue block or fail, etc.).
Assuming the data contains some flag that tells the consumers if this data is for everyone or just for one of them, the consumers peek and either remove the element, or leave it there and just note that they already did it (either by keeping its id locally, or by changing a flag in the data itself).
If you don't want a single piece of collective data to block until everyone dealt with it, you can use 2 queues, one for each type of data, and the consumers would take data from one of the queues at a time (either by choosing a different queue each time, randomly choosing a queue, prioritizing one of the queues, or by some accepted order that is deductible from the data (e.g. lowest id first)).
Sorry for the long answer, and I hope this helps :)

Related

What is the difference between queue and queue discipline in ns3?

I found that there are two classes about the queue in NS3
The first one is the https://github.com/nsnam/ns-3-dev-git/blob/master/src/network/utils/queue.h, which is named queue and a drop tail queue is implemented based on this.
The other is the https://github.com/nsnam/ns-3-dev-git/blob/master/src/traffic-control/model/queue-disc.h, which is named queue discipline and many queue are implemented
I now want to know what's the difference between these two notations?
First, I want to encourage you to read the ns-3 tutorial. You will find many of the answers to your questions there.
Queue and QueueDisc are not merely notations, they're distinct objects that serve distinct purposes. According to the ns-3 tutorial,
Architecturally, ns-3 separates the device layer from the IP layers or traffic control layers of an Internet host. Since recent releases of ns-3, outgoing packets traverse two queueing layers before reaching the channel object. The first queueing layer encountered is what is called the ‘traffic control layer’ in ns-3; here, active queue management (RFC7567) and prioritization due to quality-of-service (QoS) takes place in a device-independent manner through the use of queueing disciplines. The second queueing layer is typically found in the NetDevice objects. Different devices (e.g. LTE, Wi-Fi) have different implementations of these queues.
So, Queue's are the lowest level objects that actually store packets. A QueueDisc is an abstract class that provides a queue-like interface, but it actually implements Active Queue Management (AQM). More information about QueueDiscs can be found in the 'Detailed Description' section of the QueueDisc API documentation.
An implementation detail specific to ns-3: QueueDisc's actually encapsulate a Queue. This makes sense since a QueueDisc still needs to store the packets it gets.
You will find many of ns-3's interfaces attempt to mirror the network subsystem of Linux.

Event Sourcing/CQRS doubts about aggregates, atomicity, concurrency and eventual consistency

I'm studying event sourcing and command/query segregation and I have a few doubts that I hope someone with more experience will easily answer:
A) should a command handler work with more than one aggregate? (a.k.a. should they coordinate things between several aggregates?)
B) If my command handler generates more than one event to store, how do you guys push all those events atomically to the event store? (how can I garantee no other command handler will "interleave" events in between?)
C) In many articles I read people suggest using optimistic locking to write the new events generated, but in my use case I will have around 100 requests / second. This makes me think that a lot of requests will just fail at huge rates (a lot of ConcurrencyExceptions), how you guys deal with this?
D) How to deal with the fact that the command handler can crash after storing the events in the event store but before publishing them to the event bus? (how to eventually push those "confirmed" events back to the event bus?)
E) How you guys deal with the eventual consistency in the projections? you just live with it? or in some cases people lock things there too? (waiting for an update for example)
I made a sequence diagram to better ilustrate all those questions
(and sorry for the bad english)
If my command handler generates more than one event to store, how do you guys push all those events atomically to the event store?
Most reasonable event store implementations will allow you to batch multiple events into the same transaction.
In many articles I read people suggest using optimistic locking to write the new events generated, but in my use case I will have around 100 requests / second.
If you have lots of parallel threads trying to maintain a complex invariant, something has gone badly wrong.
For "events" that aren't expected to establish or maintain any invariant, then you are just writing things to the end of a stream. In other words, you are probably not trying to write an event into a specific position in the stream. So you can probably use batching to reduce the number of conflicting writes, and a simple retry mechanism. In effect, you are using the same sort of "fan-in" patterns that appear when you have concurrent writers inserting into a queue.
For the cases where you are establishing/maintaining an invariant, you don't normally have many concurrent writers. Instead, specific writers have authority to write events (think "sharding"); the concurrency controls there are primarily to avoid making a mess in abnormal conditions.
How to deal with the fact that the command handler can crash after storing the events in the event store but before publishing them to the event bus?
Use pull, rather than push, as the primary subscription mechanism. Make sure that subscribers can handle duplicate messages safely (aka "idempotent"). Don't use a message subscription that can re-order events when you need events strictly ordered.
How you guys deal with the eventual consistency in the projections? you just live with it?
Pretty much. Views and reports have metadata information in them to let you know at what fixed point in "time" the report was accurate.
Unless you lock out all writers while a report is being consumed, there's a potential for any data being out of date, regardless of whether you are using events vs some other data model, regardless of whether you are using a single data model or several.
It's all part of the tradeoff; we accept that there will be a larger window between report time and current time in exchange for lower response latency, an "immutable" event history, etc.
should a command handler work with more than one aggregate?
Probably not - which isn't the same thing as always never.
Usual framing goes something like this: aggregate isn't a domain modeling pattern, like entity. It's a lifecycle pattern, used to make sure that all of the changes we make at one time are consistent.
In the case where you find that you want a command handler to modify multiple domain entities at the same time, and those entities belong to different aggregates, then have you really chosen the correct aggregate boundaries?
What you can do sometimes is have a single command handler that manages multiple transactions, updating a different aggregate in each. But it might be easier, in the long run, to have two different command handlers that each receive a copy of the command and decide what to do, independently.

What's the most efficient way to async send data while async receiving with 0MQ?

I've got a ROUTER/DEALER setup where both ends need to be able to receive and send data asynchronously, as soon as it's available. The model is pretty much 0MQ's async C++ server: http://zguide.zeromq.org/cpp:asyncsrv
Both the client and the server workers poll, when there's data available they call a callback. While this happens, from another thread (!) I'm putting data in a std::deque. In each poll-forever thread, I check the deque (under lock), and if there are items there, I send them out to the specified DEALER id (the id is placed in the queue).
But I can't help thinking that this is not idiomatic 0MQ. The mutex is possibly a design problem. Plus, memory consumption can probably get quite high if enough time passes between polls (and data accumulates in the deque).
The only alternative I can think of is having another DEALER thread connect to an inproc each time I want to send out data, and just have it send it and exit. However, this implies a connect per item of data sent + construction and destruction of a socket, and it's probably not ideal.
Is there an idiomatic 0MQ way to do this, and if so, what is it?
I dont fully understand your design but I do understand your concern about using locks.
In most cases you can redesign your code to remove the use of locks using zeromq PAIR sockets and inproc.
Do you really need a std::deque? If not you could just use a zerom queue as its just a queue that you can read/write from from different threads using sockets.
If you really need the deque then encapsulate it into its own thread (a class would be nice) and make its API (push etc) accessible via inproc sockets.
So like I said before I may be on the wrong track but in 99% of cases I have come across you can always remove the locks completely with some ZMQ_PAIR/inproc if you need signalling.
0mq queue has limited buffer size and it can be controlled. So memory issue will get to some point and then dropping data will occur. For that reason you may consider using conflate option leaving only most recent data in queue.
In a case of single server and communication within single machine with many threads I suggest using publish/subscribe model where with conflate option you will receive new data as soon as you read buffer and won't have to worry about memory. And it removes blocking queue problem.
As for your implementation you are quite right, it is not best design but it is quite unavoidable. I suggest checking question Access std::deque from 3 threads while it answers your problem, it may not be the best approach.

Boost: is there an interprocess::message_queue-like mechanism for thread-only communication?

The boost::interprocess::message_queue mechanism seems primarily designed for just that: interprocess communication.
The problem is that it serializes the objects in the message:
"A message queue just copies raw bytes between processes and does not send objects."
This makes it completely unsuitable for fast and repeated interthread communication with large composite objects being passed.
I want to create a message with a ref/shared_ptr/pointer to a known and previously-created object and safely pass it from one thread to the next.
You CAN use asio::io_service and post with bind completions, but that's rather klunky AND requires that the thread in question be using asio, which seems a bit odd.
I've already written my own, sadly based on asio::io_service, but would prefer to switch over to a boost-supported general mechansim.
You need a mechanism, that designed for interprocess communication because separate processes has separate address space and you cannot simply pass pointers except very spacial cases. For thread communication you can use standard containers like std::stack, std::queue and std::priority_queue to communicate between threads, you just need to provide proper synchronization through mutexes. Or you can use lock-free containers, which also provided by boost. What else would you need for interthread communication?
Whilst I'm no expert in Boost per se, there is a fundamental difficulty in communicating between processes and threads via a pipe, message queue, etc, especially if it is assumed that a program's data is classes containing dynamically allocated memory (which is pretty much the case for things written with Boost; a string is not a simple object like it is in C...).
Copying of Data in Classes
Message queues and pipes are indeed just a way of passing a collection of bytes from one thread/process to another thread/process. Generally when you use them you're looking for the destination thread to end up with a copy of the original data, not just a copy of the references to the data (which would be pointing back at the original data).
With a simple C struct containing no pointers at all it's easy; a copy of the struct contains all the data, no problem. But a C++ class with complex data types like strings is now a structure containing references / pointers to allocated memory. Copy that structure and you haven't actually copied the data in the allocated memory.
That's where serialisation comes in. For interprocess communications where both processes can't ordinarily share the same memory serialisation serves as a way of parcelling up the structure to be sent plus all the data it refers to into a stream of bytes that can be unpacked at the other end. For threads it's no different if you don't want the two threads accessing the same memory at the same time. Serialisation is a convenient way of saving yourself having to navigating through a class to see exactly what needs to be copied.
Efficiency
I don't know what Boost uses for serialisation, but clearly serialising to XML would be painfully inefficient. A binary serialisation like ASN.1 BER would be much faster.
Also, copying data through pipes, message queues is no longer as inefficient as it used to be. Traditionally programmers don't do it because of the perceived waste of time spent copying the data repeatedly just to share it with another thread. With a single core machine that involves a lot of slow and wasteful memory accesses.
However, if one considers what "memory access" is in these days of QPI, Hypertransport, and so forth, it's not so very different to just copying the data in the first place. In both cases it involves data being sent over a serial bus from one core's memory controller to another core's cache.
Today's CPUs are really NUMA machines with memory access protocols layered on top of serial networks to fake an SMP environment. Programming in the style of copying messages through pipes, message queues, etc. is definitely edging towards saying that one is content with the idea of NUMA, and that really you don't need SMP at all.
Also, if you do all your inter-thread communications as message queues, they're not so very different to pipes, and pipes aren't so different to network sockets (at least that's the case on Not-Windows). So if you write your code carefully you can end up with a program that can be redeployed across a distributed network of computers or across a number of threads within a single process. That's a nice way of getting scalability because you're not changing the shape or feel of your program in any significant way when you scale up.
Fringe Benefits
Depending on the serialisation technology used there can be some fringe benefits. With ASN.1 you specify a message schema in which you set out the valid ranges of the message's contents. You can say, for example, that a message contains an integer, and it can have values between 0 and 10. The encoders and decoders generated by decent ASN.1 tools will automatically check that the data you're sending or receiving meets that constraint, and returns errors if not.
I would be surprised if other serialisers like Google Protocol Buffers didn't do a similar constraints check for you.
The benefit is that if you have a bug in your program and you try and send an out of spec message, the serialiser will automatically spot that for you. That can save a ton of time in debugging. Also it is something you definitely don't get if you share a memory buffer and protect it with a semaphore instead of using a message queue.
CSP
Communicating Sequential Processes and the Actor model are based on sending copies of data through message queues, pipes, etc. just like you're doing. CSP in particular is worth paying attention to because it's a good way of avoiding a lot of the pitfalls of multi-threaded software that can lurk undetected in source code.
There are some CSP implementations you can just use. There's JCSP, a class library for Java, and C++CSP, built on top of Boost to do CSP for C++. They're both from the University of Kent.
C++CSP looks quite interesting. It has a template class called csp::mobile, which is kind of like a Boost smart pointer. If you send one of these from one thread to another via a channel (CSP's word for a message queue) you're sending the reference, not the data. However, the template records which thread 'owns' the data. So a thread receiving a mobile now owns the data (which hasn't actually moved), and the thread that sent it can no longer access it. So you get the benefits of CSP without the overhead of copying the data.
It also looks like C++CSP is able to do channels over TCP; that's a very attractive feature, up scaling is a really simple possibility. JCSP works over network connections too.

C++ IOCP server container information

I have been passing a few ideas around in my head about how to actually contain large amounts of connections using an IO type of architecture while maintaining KISS. Through examples on the web, it seems like most use a double/single linked list with CONTAINING_RECORD. And, as a newbie in IO servers ( though, improving every day), I too use a linked-list container for an IO architecture.
My question is, instead of using a single/double linked list for my connections, why cant I just build a large array and use CONTAINING_RECORD? Can I used STL vector? Would that work? Also, what are other type of containers that work best with a massive IO server.
Im in the process of re-writing the server architecture for my game server (after many revisions), and would like to head into the right direction this time around because id rather not have to rewrite it again in the near future.
Thank you for your time, and replies.
Edit: Currently my server architecture is (in a nutshell):
Main thread listening and accepting -> Pass over the socket into a container.
Worker threads(2-3) grab IO events for the container of sockets.
Worker threads Read/Write Data on that container.
Main thread and worker threads all use a linked-list. I want to get away from this.
Your "connection list" will probably have removals from any position, not just the end. For std::vector, removing elements in the middle is an O(N) operation, but for linked lists it can be O(1). (For single-linked lists this isn't trivial and may require an inconvenient API).
std::map may be an interesting choice as it offers both O(log N) finding and removing of elements.
As with all data structures, it depends very much on what you want to do with it.
In a previous job I spent most of my time working on a hugely multithreaded C++ server which, in its Windows incarnation, used IO Completion Ports (the Solaris backend used /dev/poll, which is not that dissimilar in several essentials). That one stored connection-related data structures in a map-like structure dating from before the STL, using the file descriptors as the key values. Thus whenever we got an event on a connection we could look up its related data structures by the descriptor the IO layer handed us. New connections were easy to handle - just add an entry to the dictionary - and closed connections could also be cleaned up quite trivially.
Naturally one has to be careful about cross-thread access to these structures and about operation ordering - since IO is inherently effectful, the ordering of operations is crucial. Fortunately IOCP won't give you another event on another thread for the same socket until you put the socket back into the CP, but the Solaris implementation had to also keep a structure linking file descriptors to worker threads in order to ensure that we only processed one event per socket at a time, and in strict order, and we also tried to inject subsequent events for a socket into the same thread to avoid having to potentially switch the socket's structures onto another processor which is a disaster for cache hit rates.
The basic summary though is that we found an appropriately-designed dictionary class to be incredibly useful for this sort of thing.