C++ IOCP server container information - c++

I have been passing a few ideas around in my head about how to actually contain large amounts of connections using an IO type of architecture while maintaining KISS. Through examples on the web, it seems like most use a double/single linked list with CONTAINING_RECORD. And, as a newbie in IO servers ( though, improving every day), I too use a linked-list container for an IO architecture.
My question is, instead of using a single/double linked list for my connections, why cant I just build a large array and use CONTAINING_RECORD? Can I used STL vector? Would that work? Also, what are other type of containers that work best with a massive IO server.
Im in the process of re-writing the server architecture for my game server (after many revisions), and would like to head into the right direction this time around because id rather not have to rewrite it again in the near future.
Thank you for your time, and replies.
Edit: Currently my server architecture is (in a nutshell):
Main thread listening and accepting -> Pass over the socket into a container.
Worker threads(2-3) grab IO events for the container of sockets.
Worker threads Read/Write Data on that container.
Main thread and worker threads all use a linked-list. I want to get away from this.

Your "connection list" will probably have removals from any position, not just the end. For std::vector, removing elements in the middle is an O(N) operation, but for linked lists it can be O(1). (For single-linked lists this isn't trivial and may require an inconvenient API).
std::map may be an interesting choice as it offers both O(log N) finding and removing of elements.

As with all data structures, it depends very much on what you want to do with it.
In a previous job I spent most of my time working on a hugely multithreaded C++ server which, in its Windows incarnation, used IO Completion Ports (the Solaris backend used /dev/poll, which is not that dissimilar in several essentials). That one stored connection-related data structures in a map-like structure dating from before the STL, using the file descriptors as the key values. Thus whenever we got an event on a connection we could look up its related data structures by the descriptor the IO layer handed us. New connections were easy to handle - just add an entry to the dictionary - and closed connections could also be cleaned up quite trivially.
Naturally one has to be careful about cross-thread access to these structures and about operation ordering - since IO is inherently effectful, the ordering of operations is crucial. Fortunately IOCP won't give you another event on another thread for the same socket until you put the socket back into the CP, but the Solaris implementation had to also keep a structure linking file descriptors to worker threads in order to ensure that we only processed one event per socket at a time, and in strict order, and we also tried to inject subsequent events for a socket into the same thread to avoid having to potentially switch the socket's structures onto another processor which is a disaster for cache hit rates.
The basic summary though is that we found an appropriately-designed dictionary class to be incredibly useful for this sort of thing.

Related

What's the most efficient way to async send data while async receiving with 0MQ?

I've got a ROUTER/DEALER setup where both ends need to be able to receive and send data asynchronously, as soon as it's available. The model is pretty much 0MQ's async C++ server: http://zguide.zeromq.org/cpp:asyncsrv
Both the client and the server workers poll, when there's data available they call a callback. While this happens, from another thread (!) I'm putting data in a std::deque. In each poll-forever thread, I check the deque (under lock), and if there are items there, I send them out to the specified DEALER id (the id is placed in the queue).
But I can't help thinking that this is not idiomatic 0MQ. The mutex is possibly a design problem. Plus, memory consumption can probably get quite high if enough time passes between polls (and data accumulates in the deque).
The only alternative I can think of is having another DEALER thread connect to an inproc each time I want to send out data, and just have it send it and exit. However, this implies a connect per item of data sent + construction and destruction of a socket, and it's probably not ideal.
Is there an idiomatic 0MQ way to do this, and if so, what is it?
I dont fully understand your design but I do understand your concern about using locks.
In most cases you can redesign your code to remove the use of locks using zeromq PAIR sockets and inproc.
Do you really need a std::deque? If not you could just use a zerom queue as its just a queue that you can read/write from from different threads using sockets.
If you really need the deque then encapsulate it into its own thread (a class would be nice) and make its API (push etc) accessible via inproc sockets.
So like I said before I may be on the wrong track but in 99% of cases I have come across you can always remove the locks completely with some ZMQ_PAIR/inproc if you need signalling.
0mq queue has limited buffer size and it can be controlled. So memory issue will get to some point and then dropping data will occur. For that reason you may consider using conflate option leaving only most recent data in queue.
In a case of single server and communication within single machine with many threads I suggest using publish/subscribe model where with conflate option you will receive new data as soon as you read buffer and won't have to worry about memory. And it removes blocking queue problem.
As for your implementation you are quite right, it is not best design but it is quite unavoidable. I suggest checking question Access std::deque from 3 threads while it answers your problem, it may not be the best approach.

Boost: is there an interprocess::message_queue-like mechanism for thread-only communication?

The boost::interprocess::message_queue mechanism seems primarily designed for just that: interprocess communication.
The problem is that it serializes the objects in the message:
"A message queue just copies raw bytes between processes and does not send objects."
This makes it completely unsuitable for fast and repeated interthread communication with large composite objects being passed.
I want to create a message with a ref/shared_ptr/pointer to a known and previously-created object and safely pass it from one thread to the next.
You CAN use asio::io_service and post with bind completions, but that's rather klunky AND requires that the thread in question be using asio, which seems a bit odd.
I've already written my own, sadly based on asio::io_service, but would prefer to switch over to a boost-supported general mechansim.
You need a mechanism, that designed for interprocess communication because separate processes has separate address space and you cannot simply pass pointers except very spacial cases. For thread communication you can use standard containers like std::stack, std::queue and std::priority_queue to communicate between threads, you just need to provide proper synchronization through mutexes. Or you can use lock-free containers, which also provided by boost. What else would you need for interthread communication?
Whilst I'm no expert in Boost per se, there is a fundamental difficulty in communicating between processes and threads via a pipe, message queue, etc, especially if it is assumed that a program's data is classes containing dynamically allocated memory (which is pretty much the case for things written with Boost; a string is not a simple object like it is in C...).
Copying of Data in Classes
Message queues and pipes are indeed just a way of passing a collection of bytes from one thread/process to another thread/process. Generally when you use them you're looking for the destination thread to end up with a copy of the original data, not just a copy of the references to the data (which would be pointing back at the original data).
With a simple C struct containing no pointers at all it's easy; a copy of the struct contains all the data, no problem. But a C++ class with complex data types like strings is now a structure containing references / pointers to allocated memory. Copy that structure and you haven't actually copied the data in the allocated memory.
That's where serialisation comes in. For interprocess communications where both processes can't ordinarily share the same memory serialisation serves as a way of parcelling up the structure to be sent plus all the data it refers to into a stream of bytes that can be unpacked at the other end. For threads it's no different if you don't want the two threads accessing the same memory at the same time. Serialisation is a convenient way of saving yourself having to navigating through a class to see exactly what needs to be copied.
Efficiency
I don't know what Boost uses for serialisation, but clearly serialising to XML would be painfully inefficient. A binary serialisation like ASN.1 BER would be much faster.
Also, copying data through pipes, message queues is no longer as inefficient as it used to be. Traditionally programmers don't do it because of the perceived waste of time spent copying the data repeatedly just to share it with another thread. With a single core machine that involves a lot of slow and wasteful memory accesses.
However, if one considers what "memory access" is in these days of QPI, Hypertransport, and so forth, it's not so very different to just copying the data in the first place. In both cases it involves data being sent over a serial bus from one core's memory controller to another core's cache.
Today's CPUs are really NUMA machines with memory access protocols layered on top of serial networks to fake an SMP environment. Programming in the style of copying messages through pipes, message queues, etc. is definitely edging towards saying that one is content with the idea of NUMA, and that really you don't need SMP at all.
Also, if you do all your inter-thread communications as message queues, they're not so very different to pipes, and pipes aren't so different to network sockets (at least that's the case on Not-Windows). So if you write your code carefully you can end up with a program that can be redeployed across a distributed network of computers or across a number of threads within a single process. That's a nice way of getting scalability because you're not changing the shape or feel of your program in any significant way when you scale up.
Fringe Benefits
Depending on the serialisation technology used there can be some fringe benefits. With ASN.1 you specify a message schema in which you set out the valid ranges of the message's contents. You can say, for example, that a message contains an integer, and it can have values between 0 and 10. The encoders and decoders generated by decent ASN.1 tools will automatically check that the data you're sending or receiving meets that constraint, and returns errors if not.
I would be surprised if other serialisers like Google Protocol Buffers didn't do a similar constraints check for you.
The benefit is that if you have a bug in your program and you try and send an out of spec message, the serialiser will automatically spot that for you. That can save a ton of time in debugging. Also it is something you definitely don't get if you share a memory buffer and protect it with a semaphore instead of using a message queue.
CSP
Communicating Sequential Processes and the Actor model are based on sending copies of data through message queues, pipes, etc. just like you're doing. CSP in particular is worth paying attention to because it's a good way of avoiding a lot of the pitfalls of multi-threaded software that can lurk undetected in source code.
There are some CSP implementations you can just use. There's JCSP, a class library for Java, and C++CSP, built on top of Boost to do CSP for C++. They're both from the University of Kent.
C++CSP looks quite interesting. It has a template class called csp::mobile, which is kind of like a Boost smart pointer. If you send one of these from one thread to another via a channel (CSP's word for a message queue) you're sending the reference, not the data. However, the template records which thread 'owns' the data. So a thread receiving a mobile now owns the data (which hasn't actually moved), and the thread that sent it can no longer access it. So you get the benefits of CSP without the overhead of copying the data.
It also looks like C++CSP is able to do channels over TCP; that's a very attractive feature, up scaling is a really simple possibility. JCSP works over network connections too.

MPI distribution layer

I used MPI to write a distribution layer. Let say we have n of data sources and k of data consumers. In my approach each of n MPI processes reads data, then distributes it to one (or many) of k data consumers (other MPI processes) in given manner (logic).
So it seems to be very generic and my question is there something like that already done?
It seems simple, but it might be very complicated. Let say that distribution checks which of data consumers is ready to work (dynamic work distribution). It may distribute data according to given algorithm based on data. There are plenty of possibilities and I as every of us do not want to reinvent the wheel.
As far as I know, there is no generic implementation for it, other than the MPI API itself. You should use the correct functions according to the problem's constraints.
If what you're trying to build a simple n-producers-and-k-consumers synchronized job/data queue, then of course there are already many implementations out there (just google it and you should get a few).
However, the way you present it seems very general - sometimes you want the data to only be sent to one consumer, sometimes to all of them, etc. In that case, you should figure out what you want and when, and use either point-to-point communication functions, or collective communication functions, accordingly (and of course everyone has to know what to expect - you can't have a consumer waiting for data from a single source, while the producer wishes to broadcast the data...).
All that aside, here is one implementation that comes to mind that seems to answer all of your requirements:
Make a synchronized queue, producers pushing data in one end, consumers taking it from the other (decide on all kinds of behaviors for the queue as you need - is the queue size limited, does adding an element to a full queue block or fail, does removing an element from an empty queue block or fail, etc.).
Assuming the data contains some flag that tells the consumers if this data is for everyone or just for one of them, the consumers peek and either remove the element, or leave it there and just note that they already did it (either by keeping its id locally, or by changing a flag in the data itself).
If you don't want a single piece of collective data to block until everyone dealt with it, you can use 2 queues, one for each type of data, and the consumers would take data from one of the queues at a time (either by choosing a different queue each time, randomly choosing a queue, prioritizing one of the queues, or by some accepted order that is deductible from the data (e.g. lowest id first)).
Sorry for the long answer, and I hope this helps :)

Writing a server application that Pushes to clients (TCP)

I'm writing a client-server application and one of the requirements is the Server, upon receiving an update from one of the clients, be able to Push out new data to all the other clients. This is a C++ (Qt) application meant to run on Linux (both client and server), but I'm more looking for high-level conceptual ideas of how this should work (though low-level thoughts are good, too).
Server:
It needs to (among its other duties) keep a socket open listening for incoming packets from potentially n different clients, presumably on a background thread (I haven't written much in terms of socket code other than some rinky-dink examples in school). Upon getting this data from a client, it processes it and then spits it out to all its clients, right?
Of course, I'm not sure how it actually does this. I'm guessing this means it has to keep persistent connections with every single client (at least the active clients), but I don't understand even conceptually how to maintain this connection (or the list of these connections).
So, how should I approach this?
In general when you have multiple clients, there are a few ways to handle this.
First of all, in TCP, when a client connects to you they're placed into a queue until they can be serviced. This is a given, you don't need to do anything except call the accept system call to receive a new client. Once the client is recieved, you'll be given a socket which you use to read and write. Who reads / writes first is entirely dependent on your protocol, but both sides need to know the protocol (which is up to you to define).
Once you've got the socket, you can do a few things. In a simple case, you just read some data, process it, write back to the socket, close the socket, and serve the next client. Unfortunately this means you can only serve one client at a time, thus no "push" updates are possible. Another strategy is to keep a list of all the open sockets. Any "updates" simply iterate over the list and write to each socket. This may present a problem though because it only allows push updates (if a client sent a request, who would be watching for it?)
The more advanced approach is to assign one thread to each socket. In this scenario, each time a socket is created, you spin up a new thread whose whole purpose is to serve exactly one client. This cuts down on latency and utilizes multiple cores (if available), but is far more difficult to program. Also if you have 10,000 clients connecting, that's 10,000 threads which gets to be too much. Pushing an update to a single client (in this scenario) is very simple (a thread just writes to its respective socket). Pushing to all of them at once is a little more tricky (requires either a thread event or a producer / consumer queue, neither of which are very fun to implement)
There are, of course, a million other ways to handle this (one process per client, a thread pool, a load-balancing proxy, you name it). Suffice it to say there's no way to cover all of these in one answer. I hope this answers your basic questions, let me know if you need me to clarify anything. It's a very large subject. However if I might make a suggestion, handling multiple clients is a wheel that has been re-invented a million times. There are very good libraries out there that are far more efficient and programmer-friendly than raw socket IO. I suggest libevent, which turns network requests into an event-driven paradigm (much more like GUI programming, which might be nice for you), and is incredibly efficient.
From what I understand, I think you need to keep an infinite loop going, (at least until the program terminates) that answers a connection request from your clients. It would be best to add them to a array of some sort. Use an event to see when a new client is added to that array, and wait for one of them to give data. Then you do what you have to do with that data and spit it back.

Fast Cross Platform Inter Process Communication in C++

I'm looking for a way to get two programs to efficiently transmit a large amount of data to each other, which needs to work on Linux and Windows, in C++. The context here is a P2P network program that acts as a node on the network and runs continuously, and other applications (which could be games hence the need for a fast solution) will use this to communicate with other nodes in the network. If there's a better solution for this I would be interested.
boost::asio is a cross platform library handling asynchronous io over sockets. You can combine this with using for instance Google Protocol Buffers for your actual messages.
Boost also provides you with boost::interprocess for interprocess communication on the same machine, but asio lets you do your communication asynchronously and you can easily have the same handlers for both local and remote connections.
I have been using ICE by ZeroC (www.zeroc.com), and it has been fantastic. Super easy to use, and it's not only cross platform, but has support for many languages as well (python, java, etc) and even an embedded version of the library.
Well, if we can assume the two processes are running on the same machine, then the fastest way for them to transfer large quantities of data back and forth is by keeping the data inside a shared memory region; with that setup, the data is never copied at all, since both processes can access it directly. (If you wanted to go even further, you could combine the two programs into one program, with each former 'process' now running as a thread inside the same process space instead. In that case they would be automatically sharing 100% of their memory with each other)
Of course, just having a shared memory area isn't sufficient in most cases: you would also need some sort of synchronization mechanism so that the processes can read and update the shared data safely, without tripping over each other. The way I would do that would be to create two double-ended queues in the shared memory region (one for each process to send with). Either use a lockless FIFO-queue class, or give each double-ended queue a semaphore/mutex that you can use to serialize pushing data items into the queue and popping data items out of the queue. (Note that the data items you'd be putting into the queues would only be pointers to the actual data buffers, not the data itself... otherwise you'd be back to copying large amounts of data around, which you want to avoid. It's a good idea to use shared_ptrs instead of plain C pointers, so that "old" data will be automatically freed when the receiving process is done using it). Once you have that, the only other thing you'd need is a way for process A to notify process B when it has just put an item into the queue for B to receive (and vice versa)... I typically do that by writing a byte into a pipe that the other process is select()-ing on, to cause the other process to wake up and check its queue, but there are other ways to do it as well.
This is a hard problem.
The bottleneck is the internet, and that your clients might be on NAT.
If you are not talking internet, or if you explicitly don't have clients behind carrier grade evil NATs, you need to say.
Because it boils down to: use TCP. Suck it up.
I would strongly suggest Protocol Buffers on top of TCP or UDP sockets.
So, while the other answers cover part of the problem (socket libraries), they're not telling you about the NAT issue. Rather than have your users tinker with their routers, it's better to use some techniques that should get you through a vaguely sane router with no extra configuration. You need to use all of these to get the best compatibility.
First, ICE library here is a NAT traversal technique that works with STUN and/or TURN servers out in the network. You may have to provide some infrastructure for this to work, although there are some public STUN servers.
Second, use both UPnP and NAT-PMP. One library here, for example.
Third, use IPv6. Teredo, which is one way of running IPv6 over IPv4, often works when none of the above do, and who knows, your users may have working IPv6 by some other means. Very little code to implement this, and increasingly important. I find about half of Bittorrent data arrives over IPv6, for example.