Architectural Suggestions in a Linux App

Architectural Suggestions in a Linux App - c++

I've done quite a bit of programming on Windows but now I have to write my first Linux app.
I need to talk to a hardware device using UDP. I have to send 60 packets a second with a size of 40 bytes. If I send less than 60 packets within 1 second, bad things will happen.
The data for the packets may take a while to generate. But if the data isn't ready to send out on the wire, it's ok to send the same data that was sent out last time.
The computer is a command-line only setup and will only run this program.
I don't know much about Linux so I was hoping to get a general idea how you might set up an app to meet these requirements.
I was hoping for an answer like:
Make 2 threads, one for sending packets and the other for the calculations.
But I'm not sure it's that simple (maybe it is). Maybe it would be more reliable to make some sort of daemon that just sent out packets from shared memory or something and then have another app do the calculations? If it is some multiple process solution, what communication mechanism would you recommend?
Is there some way I can give my app more priority than normal or something similar?
PS: The more bulletproof the better!

I've done a similar project: a simple software on an embedded Linux computer, sending out CAN messages at a regular speed.
I would go for the two threads approach. Give the sending thread a slightly higher priority, and make it send out the same data block once again if the other thread is slow in computing those blocks.
60 UDP packets per second is pretty relaxed on most systems (including embedded ones), so I would not spend much sweat on optimizing the sharing of the data between the threads and the sending of the packets.
In fact, I would say: keep it simple! I you really are the only app in the system, and you have reasonable control over that system, you have nothing to gain from a complex IPC scheme and other tricks. Keeping it simple will help you produce better code with less defects and in less time, which actually means more time for testing.

Two threads as you've suggested would work. If you have a pipe() between them, then your calculating thread can provide packets as they are generated, while your comms thread uses select() to see if there is any new data. If not, then it just sends the last one from it's cache.
I may have over simplified the issue a little...

The suggestion to use a pair of threads sounds like it will do the trick, as long as the burden of performing the calculations is not too great.
Instead of using the pipe() as suggested by Cogsy, I would be inclined to use a mutex to lock a chunk of memory that you use to contain the output of your calculation thread - using it as a transfer area between the threads.
When your calculation thread is ready to output to the buffer it would grab the mutex, write to the transfer buffer and release the mutex.
When your transmit thread was ready to send a packet it would "try" to lock the mutex.
If it gets the lock, take a copy of the transfer buffer and send it.
If it doesn't get the lock, send the last copy.
You can control the priority of your process by using "nice" and specifying a negative adjustment figure to give it higher priority. Note that you will need to do this as superuser (either as root, or using 'sudo') to be able to specify negative values.
edit: Forgot to add - this is a good tutorial on pthreads on linux. Also describes the use of mutexes.

I didn't quite understand how hard is your 60 packets / sec requirement. Does a burst of 60 packets per second fill the requirement? Or is a sharp 1/60 second interval between each packet required?
This might go a bit out of topic, but another important issue is how you configure the Linux box. I would myself use a real-time Linux kernel and disable all unneeded services. Other wise there is a real risk that your application misses a packet at some time, regardless of what architecture you choose.
Any way, two threads should work well.

I posted this answer to illustrate a quite different approach to the "obvious" one, in the hope that someone discovers it to be exactly what they need. I didn't expect it to be selected as the best answer! Treat this solution with caution, because there are potential dangers and concurrency issues...
You can use the setitimer() system call to have a SIGALRM (alarm signal) sent to your program after a specified number of milliseconds. Signals are asynchronous events (a bit like messages) that interrupt the executing program to let a signal handler run.
A set of default signal handlers are installed by the OS when your program begins, but you can install a custom signal handler using sigaction().
So all you need is a single thread; use global variables so that the signal handler can access the necessary information and send off a new packet or repeat the last packet as appropriate.
Here's an example for your benefit:
#include <stdio.h>
#include <signal.h>
#include <sys/time.h>
int ticker = 0;
void timerTick(int dummy)
{
printf("The value of ticker is: %d\n", ticker);
}
int main()
{
int i;
struct sigaction action;
struct itimerval time;
//Here is where we specify the SIGALRM handler
action.sa_handler = &timerTick;
sigemptyset(&action.sa_mask);
action.sa_flags = 0;
//Register the handler for SIGALRM
sigaction(SIGALRM, &action, NULL);
time.it_interval.tv_sec = 1; //Timing interval in seconds
time.it_interval.tv_usec = 000000; //and microseconds
time.it_value.tv_sec = 0; //Initial timer value in seconds
time.it_value.tv_usec = 1; //and microseconds
//Set off the timer
setitimer(ITIMER_REAL, &time, NULL);
//Be busy
while(1)
for(ticker = 0; ticker < 1000; ticker++)
for(i = 0; i < 60000000; i++)
;
}

Two threads would work, you will need to make sure you lock your shared data structure through so the sending thread doesn't see it half way through an update.
60 per second doesn't sound too tricky.
If you are really concerned about scheduling, set the sending thread's scheduling policy to SCHED_FIFO and mlockall() its memory. That way, nothing will be able to stop it sending a packet (they could still go out late though if other things are being sent on the wire at the same time)
There has to be some tolerance of the device - 60 packets per second is fine, but what is the device's tolerance? 20 per second? If the device will fail if it doesn't receive one, I'd send them at three times the rate it requires.

I would stay away from threads and use processes and (maybe) signals and files. Since you say "bad things" may happen if you don't send, you need to avoid lock ups and race conditions. And that is easier to do with separate processes and data saved to files.
Something along the line of one process saving data to a file, then renaming it and starting anew. And the other process picking up the current file and sending its contents once per second.
Unlike Windows, you can copy (move) over the file while it's open.

Follow long-time Unix best practices: keep it simple and modular, decouple the actions, and let the OS do as much work for you as possible.
Many of the answers here are on the right track, but I think they can be even simpler:
Use two separate processes, one to create the data and write it to stdout, and one to read data from stdin and send it. Let the basic I/O libraries handle the data stream buffering between processes, and let the OS deal with the thread management.
Build the basic sender first using a timer loop and a buffer of bogus data and get it sending to the device at the right frequency.
Next make the sender read data from stdin - you can redirect data from a file, e.g. "sender < textdata"
Build the data producer next and pipe its output to the sender, e.g. "producer | sender".
Now you have the ability to create new producers as necessary without messing with the sender side. This answer assumes one-way communication.
Keeping the answer as simple as possible will get you more success, especially if you aren't very fluent in Linux/Unix based systems yet. This is a great opportunity to learn a new system, but don't over-do it. It is easy to jump to complex answers when the tools are available, but why use a bulldozer when a simple trowel is plenty. Mutex, semaphores, shared memory, etc, are all useful and available, but add complexity that you may not really need.

I agree with the the two thread approach. I would also have two static buffers and a shared enum. The sending thread should have this logic.
loop
wait for timer
grab mutex
check enum {0, 1}
send buffer 0 or 1 based on enum
release mutex
end loop
The other thread would have this logic:
loop
check enum
choose buffer 1 or 0 based on enum (opposite of other thread)
generate data
grab mutex
flip enum
release mutex
end loop
This way the sender always has a valid buffer for the entire time it is sending data. Only the generator thread can change the buffer pointer and it can only do that if a send is not in progress. Additionally, the enum flip should never take so many cycles as to delay the higher priority sender thread for very long.

Thanks everyone, I will be using everyones advice. I wish I could select more answers than 1!
For those that are curious. I dont have source for the device, its a propietary locked down system. I havent done enough testing to see how picky the 60 packets a second is yet. Thats all their limited docs say is "60 packets a second". Due to the nature of the device though, bursts of packets will be a bad thing. I think I will be able to get away with sending more than 60 a second to make up for the occasional missed packets..

Related

Using IOCP with UDP?

I'm pretty familiar with what Input/Output Completion Ports are for when it comes to TCP.
But what, if I am for example coding a FPS game, or anything where need for low latency can be a deal breaker - I want immediate response to the player to provide the best playing experience, even at cost of losing some spatial data on the go. It becomes obvious that I should use UDP and aside from sending coordinate updates frequently, I should also implement kind of semi-reliable protocol (afaik TCP induces packet loss in UDP so we should avoid mixing these two) to handle such events like chat messages, or gunshots where packet loss may be crucial.
Let's say I'm aiming at performance which would apply to MMOFPS game that allows to meet hundreds of players in one, persistent world, and aside from fighting with guns, it allows them to communicate through chat messages etc. - something like this actually exists and works well - check out PlanetSide 2.
Many articles there on the net (e.g. these from msdn) say overlapped sockets are the best and IOCP is god-tier concept, but they don't seem to distinguish the cases where we use other protocols than TCP.
So there is almost no reliable information about I/O techniques used when developing such a server, I've looked at this, but the topic seems to be highly controversial, and I've also seen this , but considering discussions in the first link, I don't know if I should follow assumptions of the second one, whether I should use IOCP with UDP at all, and if not, what is the most scalable and efficient I/O concept when it comes to UDP.
Or maybe am I just making another premature optimization and no thinking ahead is required for the moment ?
Thought about posting it on gamedev.stackexchange.com, but this question better applies to general-purpose networking I think.

I do not recommend using this, but technically the most efficient way to receive UDP datagrams would be to just block in recvfrom (or WSARecvFrom if you will). Of course, you'll need a dedicated thread for that, or not much will happen otherwise while you block.
Other than with TCP, you do not have a connection built into the protocol, and you do not have a stream without defined borders. That means you get the sender's address with every datagram that comes in, and you get a whole message or nothing. Always. No exceptions.
Now, blocking on recvfrom means one context switch to the kernel, and one context switch back when something was received. It won't go any faster by having several overlapped reads in flight either, because only one datagram can arrive on the wire at the same time, which is by far the most limiting factor (CPU time is not the bottleneck!). Using an IOCP means at least 4 context switches, two for the receive and two for the notification. Alternatively, an overlapped receive with completion callback is not much better either, because you must NtTestAlert or SleepEx to run the APC queue, so again you have at least 2 extra context switches (though, it's only +2 for all notifications together, and you might incidentially already sleep anyway).
However:
Using an IOCP and overlapped reads is nevertheless the best way to do it, even if it is not the most efficient one. Completion ports are irrespective from using TCP, they work just fine with UDP, too. As long as you use an overlapped read, it does not matter what protocol you use (or even whether it's network or disk, or some other waitable or alertable kernel object).
It also does not really matter for either latency or CPU load whether you burn a few hundred cycles extra for the completion port. We're talking about "nano" versus "milli" here, a factor of one to one million. On the other hand, completion ports are overall a very comfortable, sound, and efficient system.
You can for example trivially implement logic for resending when you did not receive an ACK in time (which you must do when a form of reliability is desired, UDP does not do it for you), as well as keepalive.
For keepalive, add a waitable timer (maybe firing after 15 or 20 seconds) that you reset every time you receive anything. If your completion port ever tells you that this timer went off, you know the connection is dead.
For resends, you could e.g. set a timeout on GetQueuedCompletionStatus, and every time you wake up find all packets that are more than so-and-so old and have not been ACKed yet.
The entire logic happens in one place, which is very nice. It's versatile, efficient, and hard to do wrong.
You can even have several threads (and, indeed, more threads than your CPU has cores) block on the completion port. Many threads sounds like an unwise design, but it is in fact the best thing to do.
A completion port wakes up to N threads in last-in-first-out order, N being the number of cores unless you tell it to do something different. If any of these threads block, another one is woken to handle outstanding events. This means that in the worst case, an extra thread may be running for a short time, but this is tolerable. In the average case, it keeps processor usage close to 100% as long as there is some work to do and zero otherwise, which is very nice. LIFO waking is favourable for processor caches and keeps switching thread contexts low.
This means you can block and wait for an incoming datagram and handle it (decrypt, decompress, perform logic, read someting from disk, whatever) and another thread will be immediately ready to handle the next datagram that might come in the next microsecond. You can use overlapped disk IO with the same completion port, too. If you have compute work (such as AI) to do that can be split into tasks, you can manually post (PostQueuedCompletionStatus) those on the completion port as well and you have a parallel task scheduler for free. All you have to do is wrap an OVERLAPPED into a structure that has some extra data after it, and use a key that you will recognize. No worrying about thread synchronization, it just magically works (you don't even strictly need to have an OVERLAPPED in your custom structure when posting your own notifications, it will work with any structure you pass, but I don't like lying to the operating system, you never know...).
It does not even matter much whether you block, for example when reading from disk. Sometimes this just happens and you can't help it. So what, one thread blocks, but your system still receives messages and reacts to it! The completion port automatically pulls another thread from its pool when it's necessary.
About TCP inducing packet loss on UDP, this is something that I am inclined to call an urban myth (although it is somewhat correct). The way this common mantra is worded is however misleading. It may have been true once upon a time (there exists research on that matter, which is, however, close to a decade old) that routers would drop UDP in favour of TCP, thereby inducing packet loss. That is, however, certainly not the case nowadays.
A more truthful point of view is that anything you send induces packet loss. TCP induces packet loss on TCP and UDP induces packet loss on TCP and vice versa, this is a normal condition (it's how TCP implements congestion control, by the way). A router will generally forward one incoming packet if the cable on the other plug is "silent", it will queue a few packets with a hard deadline (buffers are often deliberately small), optionally it may apply some form of QoS, and it will simply and silently drop everything else.
A lot of applications with rather harsh realtime requirements (VoIP, video streaming, you name it) nowadays use UDP, and while they cope well with a lost packet or two, they do not at all like significant, recurring packet loss. Still, they demonstrably work fine on networks that have a lot of TCP traffic. My phone (like the phones of millions of people) works exclusively over VoIP, data going over the same router as internet traffic. There is no way I can provoke a dropout with TCP, no matter how hard I try.
From that everyday observation, one can tell for certain that UDP is definitively not dropped in favour of TCP. If anything, QoS might favour UDP over TCP, but it most certainly doesn't penaltize it.
Otherwise, services like VoIP would stutter as soon as you open a website and be unavailable alltogether if you download something the size of a DVD ISO file.
EDIT:
To give somewhat of an idea of how simple life with IOCP can be (somewhat stripped down, utility functions missing):
for(;;)
{
if(GetQueuedCompletionStatus(iocp, &n, &k, (OVERLAPPED**)&o, 100) == 0)
{
if(o == 0) // ---> timeout, mark and sweep
{
CheckAndResendMarkedDgrams(); // resend those from last pass
MarkUnackedDgrams(); // mark new ones
}
else
{ // zero return value but lpOverlapped is not null:
// this means an error occurred
HandleError(k, o);
}
continue;
}
if(n == 0 && k == 0 && o == 0)
{
// zero size and zero handle is my termination message
// re-post, then break, so all threads on the IOCP will
// one by one wake up and exit in a controlled manner
PostQueuedCompletionStatus(iocp, 0, 0, 0);
break;
}
else if(n == -1) // my magic value for "execute user task"
{
TaskStruct *t = (TaskStruct*)o;
t->funcptr(t->arg);
}
else
{
/* received data or finished file I/O, do whatever you do */
}
}
Note how the entire logic for both handling completion messages, user tasks, and thread control happens in one simple loop, no obscure stuff, no complicated paths, every thread only executes this same, identical loop.
The same code works for 1 thread serving 1 socket, or for 16 threads out of a pool of 50 serving 5,000 sockets, 10 overlapped file transfers, and executing parallel computations.

I've seen the code to many FPS games that use UDP as the networking protocol.
The standard solution is to send all the data you need to update a single game frame in one large UDP packet. That packet should include a frame number, and a checksum. The packet should of course be compressed.
Generally the UDP packet contains the positions and velicities for every entity near the player, any chat messages that were sent, and all recent state changes. ( e.g. new entity created, entity destrouyed etc. )
Then the client listens for UDP packets. It will use only the packet with the highest frame number. So if out of order packets appear, the older packets are simply ignored.
Any packets with wrong checksums are also ignored.
Each packet should contain all the information to synchronize the client's game state with the server.
Chat messages get sent repeatedly over several packets, and each message has a unique message id For example, you retransmit the same chat message for say a full second worth of frames. If a client misses a chat message after getting it 60 times - then the quality of the network channel is just too low to play the game. Clients will display any messages they get in a UDP packet that have a message ID they have not yet displayed.
Similarly for objects being created or destroyed. All created or destroyed objects have a unique object Id set by the server. Objects get created or destroyed if the object id they correspond to has not been acted on before.
So the key here is to send data redundantly, and key all state transitions to unique id's set by the server.
#edit: Another poster mentioned that for chat messages you might want to use a different protocol on a different port. And they may be right about that probably being optimal. That is for message types where latency is not critical, but reliability is more important you might want to open up a different port and use TCP. But I'd leave that as a later excercise. It is certainly easier and cleaner at first for your game to use just one channel, and figure out the vagaries of multiple ports, multiple channels, with their various failure modes later. (e.g. what happens if the UDP channel is working, but the chat channel goes goes down? What if you succeed in opening one port and not the other? )

When I did this for a client we used ENet as the base reliable UDP protocol and re-implemented this from scratch to use IOCP for the server side whilst using the freely available ENet code for the client side.
IOCP works fine with UDP and integrates nicely with any TCP connections that you might also be handling (we have TCP, WebSocket or UDP client connections in and TCP connections between server nodes and being able to plug all of these into the same thread pool if we want is handy).
If absolute latency and UDP packet processing speed is most important (and it's unlikely it really is) then a using the new Server 2012 RIO API might be worth it, but I'm not convinced yet (see here for some preliminary performance tests and some example servers).
You probably want to look at using GetQueuedCompletionStatusEx() for dealing with your inbound data as it reduces the context switches per datagram as you can pull multiple datagrams back with a single call.

A couple things:
1) As a general rule if you need reliability, you are best off just using TCP. A competitive and perhaps even superior solution on top of UDP is possible, but it is extremely difficult to get right and have it perform properly. The main thing people implementing reliability on top of UDP don't bother with is proper flow control. You must have flow control if you intend to send large amounts of data and want it to gracefully take advantage of the bandwidth that is available at the moment (which changes continuously with route conditions). In practice, implementing anything other than essentially the same algorithm TCP uses is likely to be unfriendly to other protocols on the network as well. It's unlikely you will do a better job at implementing that algorithm than TCP does.
2) As for running TCP and UDP in parallel, it is not as huge of a concern these days as others have noted. At one time I heard that overloaded routers along the way were bias dropping UDP packets before TCP packets, which makes sense in some ways, since a dropped TCP packet will just be resent anyways, and a lost UDP packet often isn't. That said, I am skeptical that this actually happens. In particular, dropping a TCP packet will cause the sender to throttle back, so it may make more sense to drop the TCP packet.
The one case where TCP may interfere with UDP is that TCP by nature of it's algorithm is continuously trying to go faster and faster, unless it reaches a point where it loses packets, then it throttles back and repeats the process. As the TCP connection continuously bumps against that bandwidth ceiling, it is just as likely to cause UDP loss as TCP loss, which in theory would appear as if the TCP traffic was sporadically causing UDP loss.
However, this is a problem you will run into even if you put your own reliable mechanism on top of UDP (assuming you do flow control properly). If you wanted to avoid this condition, you could intentionally throttle the reliable data at the application layer. Typically in a game the reliable data rate is limited to the rate at which the client or server actually needs to send reliable data, which is often well below the bandwidth capabilities of the pipe, and thus the interference never occurs, regardless of whether it is TCP or UDP-reliable based.
Where things get a bit more difficult is if you are making a streaming asset game. For a game like FreeRealms which does this, the assets are downloaded from a CDN via HTTP/TCP and it will attempt to use all available bandwidth, which will increase packetloss on the main game channel (which is typically UDP). I have generally found the interference low enough that I don't think you should be worrying about it too much.
3) As for IOCP, my experience with them is very limited, but having done extensive game networking in the past, I am skeptical that they add value in the case of UDP. Typically the server will have a single UDP socket that is handling all incoming data. With hundreds of users connected, the rate at which the data is coming into the server is very high. Having a background thread doing a blocking call on the socket as others have suggested and then quickly moving the data into a queue for the main application thread to pick up is a reasonable solution, but somewhat unnecessary, since in practice the data is coming in so fast when under load that there is not much point in ever sleeping the thread when it blocks.
Let me put this another way, if the blocking socket call polled a single packet and then put the thread to sleep until the next packet came in, it would be context-switching to that thread thousands of times per second when the data rate got high. Either that, or by the time the unblocked thread executed and cleared the data, there would already be additional data ready to be processed as well. Instead, I prefer to put the socket in non-blocking mode and then have a background thread spin at around 100fps processing it (sleeping between polls as needed to achieve the frame rate). In this manner, the socket buffer will build up incoming packets for 10ms and then the background thread will wake up once and process all that data in bulk, then go back to sleep, thus preventing gratuitous context switches. I then have that same background thread do other send-related processing when it wakes up as well. Being entirely event-driven loses many of it's benefits when the data volume gets the least bit high.
In the case of TCP, the story is quite different, since you need an efficient mechanism to figure out which of hundreds of connects the incoming data is coming from and polling them all is very slow, even on a periodic basis.
So, in the case of UDP with a home-grown UDP-reliable mechanism on top of it, I typically have a background thread playing the same role that the OS plays... whereas the OS gets the data from the network card then distributes it to various logical TCP connections internally for processing, my background thread gets the data from the solitary UDP socket (via periodic polling) and distributes it to my own internal logical connection objects for processing. Those internal logical connections then put the application-level packet data into a thread-safe master-queue flagged with the logical connection they came from. The main application thread then processes that master-queue in, routing the packets directly to the game-level objects associated with that connection. From the main application threads point of view, it simply has an event driven queue it is processing.
The bottom line is that given that the poll call to the solitary UDP socket rarely comes up empty, it is difficult to imagine there is going to be a more efficient way to solve this problem. The only thing you lose with this method is you wait up to 10ms to wake up when in theory you could be waking up the instant the data first arrived, but that is only meaningful if you were under extremely light load anyways. Plus, the main application thread isn't going to be making use of the data until it's next frame cycle anyways, so the difference is moot, and I think the overall system performance is enhanced by this technique.

I wouldn't hold a game as old as PlanetSide up as a paragon of modern network implementation. Especially not having seen the insides of their networking library. :)
Different types of communication require different methodologies. One of the answers above talks around the differences between frame/position updates and chat messages, without recognizing that using the same transport for both is probably silly. You should most definitely use a connected TCP socket between your chat implementation and the chat server, for text-style chat. Don't argue, just do it.
So, for your game client doing updates via arriving UDP packets, the most efficient path from the network adapter through the kernel and into your application is (most likely) going to be a blocking recv. Create a thread that rips packets off the network, verifies their validity (chksum match, sequence number increasing, whatever other checks you have), de-serializes the data into an internal object, then queue the object on an internal queue to the application thread that handles those sorts of updates.
But don't take my word for it: test it! Write a small program that can receive and deserialize 3 or 4 kinds of packets, using a blocking thread and a queue to deliver the objects, then re-write it using a single thread and IOCPs, with the deserialization and queueing in the completion routine. Pound enough packets through it to get the run time up in the minute range, and test which one is fastest. Make sure something (i.e. some thread) in your test app is consuming the objects off the queue so you get a full picture of the relative performance.
Post back here when you have the two test programs done, and let us know which worked out best, mm'kay? Which was fastest, which would you rather maintain in the future, which took the longest to get it working, etc.

If you want to support many simultaneous connections, you need to use an event-driven networking approach. I know of two good libraries: libev (used by nodeJS) and libevent. They are very portable and easy to use. I have successfully used libevent in an application supporting hundreds of parallel TCP/UDP(DNS) connections.
I believe using event-driven network i/o is not premature optimization in a server - it should be the default design pattern. If you want to do a quick prototype implementation it may be better to start in a higher level language. For JavaScript there is nodeJS and for Python there is Twisted. Both I can personally recommend.

How about NodeJS
It supports UDP and it is highly scalable.

Is there a way to communicate data between computers without while loops? C++

I have been struggling to try and find my answer for this on google, as I dont know the exact terms I am looking to search for.
If someone were to build an msn messenger-like program, is it possible to have always-open connections and no while(true) loop? If so, could someone point me in the direction of how this is achieved?

Using boost::asio library for socket handling, i think it is possible to define callbacks upon data reception.

The one single magic word your looking for is asynchronous I/O. This can be achieved either through using asynchronous APIs (functions such as ReadThis() that return immediately and signal on success/failure -- like but not limited by boost::asio) or by deferring blocking calls to different threads. Picking either method requires careful weighing of both the underlying implementation and the scale of your operations.

You want to use ACE. It has a Reactor pattern which will notify you when data is available to be use.
Reactor Pattern

You could have:
while(1) {
sleep(100); // 100 ms
// check if there is a message
// process message
//...
}
This is ok, but there is an overhead on servers running 10000s of threads since threads come out of sleep and check for a message, causing context-switching. Instead, operating systems provide functions like select and epoll on Linux, which allow a thread to wait on an event.
while(1) {
// wait for message
// process message
//...
}
Using wait, the thread is not "woken up" unless a message is received.

You can only hide your while loop (or some kind of loop) somewhere buried in some library or restart the waiting for next IO in an event callback, but you aren't going to be able to completely avoid it.

That's a great question. Like nj said, you want to use asynchronous I/O. Too many programs use a polling strategy. It is not uncommon to have 1000 threads running on a system. If all of them were polling, you would have a slow system. Use asynchronous I/O whenever possible.

what about udp protocol communication ? you dont have to wait in while loop for every clients
just open one connection on specified port and call receive method

With a single file descriptor, Is there any performance difference between select, poll and epoll and ...?

The title really says it all.
The and ... means also include pselect and ppoll..
The server project I'm working on basically structured with multiple threads. Each
thread handles one or more sessions. All the threads are identical. The protocol
takes care of which thread will host the session.
I'm using an inhouse socket class that wraps things up. The point of interest is a checkread call which calls either poll (linux) or select (windows).
In summary each thread currently calls poll on a single socket. From what I can tell, using epoll would only be of benefit if this thread was looking at multiple sockets such as what you'd get in say an HTTP server. That's not what I'm doing in my case. And the class only handles a single socket at a time.
There is some brief discussion about edge and level triggering in the man pages for epoll. I'm not really sure what it means. In the socket class I see an optimization in the windows part of the code that shortcuts the select call with an ioctlsocket & FIONREAD to check if there is any data. Wondering if that would return > 0 even if a complete UDP packet hadn't arrived at the time of the call. Is this what edge triggering is in epoll?
In some rudimentary testing, I'm also seeing no noticeable difference between using select and poll.
I can see that using ppoll might be of benefit though due to greater precision in the timeout. Any thoughts?
And yes, I am trying to optimize throughput for a session that is receiving lots of data. The server is more Network & Disk bound than CPU.

The main difference between epoll vs select or poll is that epoll scales a lot better when run in a single thread. I don't know how this would compare to using a multithreaded server using select or poll.
Look at this http://monkey.org/~provos/libevent/libevent-benchmark2.jpg
The reason for this(as far as I can tell) is that when you are using select or poll you must loop through all the connected sockets to determine which ones have data to be read. When you are using epoll, it keeps a seperate array which contains references only to sockets which have data to be read. This saves you lots of loop cycles, and the difference becomes more and more noticeable the more sockets that are connected.
Another thing to look into if performance ever becomes a major issue is io completion ports(windows only) and kqueue(FreeBSD only). It's also important to remember that epoll is linux only. In most cases select or poll will work just fine.
In the case of a single file descriptor, select and poll are more efficient than epoll due to being much simpler. (epoll has some overhead which doesn't make itself useful with only a single socket)

According to the link: http://www.intelliproject.net/articles/showArticle/index/io_multiplexing.
If you use only one descriptor:
select: 201 micro seconds.
poll: 159 micro seconds.
epoll: 176 micro seconds.
Seems poll will be a better solution in such situation.

If you have only a single socket, what's the point of polling in the first place? Wouldn't the best performance then be by just using blocking read/write?
Wrt. the performance, with only a single file descriptor I don't think there is much, if any, difference between the various approaches. If you really care, I suppose you could measure, but I find it difficult that this would particularly matter for the overall performance of your program.
Level/edge triggering. Consider you're monitoring a signal, for simplicity say some voltage in a line. Edge triggering means that something triggers when the voltage goes over or under some specific limit. Level triggering means that something is considered to be in a triggered state as long as the voltage is over/under the limit. That is, edge triggering triggers when some event happens (crossing some threshold), level triggering reflects the state of some "thing" (in this case, voltage).
To get back to network programming, and edge triggered system might be one where you get some kind of signal when a packet is received. If you don't handle the event then the signal is lost. A level triggered system, OTOH, is something like asking "is there data waiting in the buffer for me?"; if you don't handle the event and ask again, the data will still be there waiting for you.

Multi-reader IPC solution?

I'm working on a framework in C++ (just for fun for now), that lets the user write plugins that use a standard API to stream data between each other. There's going to be three basic transport mechanisms for the data: files, sockets, and some kind of IPC piping system. The system is set up so that for the non-file transport, each stream can have multiple readers. IE once a server socket it setup, multiple computers can connect and stream the data. I'm a little stuck at the multi-reader IPC system though.
All my plugins run in threads (though I may want to go to a process-based system eventually) so they live in the same address space, so some kind of shared memory system would work fine, I was thinking I'd write my own circular buffer with a write pointer and read pointers chassing it around the buffer, but I have my doubts that I can achieve the same performance as something like linux pipes.
I'm curious what people would suggest for a multi-reader solution to something like this? Is the overhead for pipes or domain sockets low enough that I could just open a connection to each reader and issue separate writes to each reader? This is intended to be significant volumes of data (tens of mega-samples/sec), so performance is a must.

I develop a media server, and i usually use a single reader for a group of all active sockets of the same class. You can use a select() (in a blocking or non blocking mode) function for each group to read the sockets that became ready to be read. When a socket data is ready or a new connection occur i just call a notify callback function to manage it.
Each reader (that controls a group of sockets) could be managed by a separate thread, avoiding your main threads to block while waiting for new connections or socket data.

If I understand the description correctly, it seems to me that using a circular queue as you mention would be a good IPC solution. I think it could scale very well and would ultimately be better than individual pipes or individual shared memory for each client. One (of several) of the issues of using a single queue/buffer for multiple clients is to synchronize access to the buffers. A client needs to be able to successfully read an entry in the queue without the server changing it. Here is a possible mechanism for implementing that.
This requires that the server know how many active clients there are. That, I assume, would be possible as long as the clients are doing some kind of registration/login with the server (almost certainly true if they are in-process but not necessarily true for out-of-process clients).
Suppose there are N clients. For this example, assume 100 active clients.
Maintain two counting semaphores for each entry in the circular queue. If using out-of-process clients, these need to be shared between processes. Call the semaphores SemReady and SemDone.
Use SemReady to indicate that the buffer is ready for clients to read. The server writes to the buffer entry and then sets the value of the semaphore to the number of clients (100 in this case). More on this in a bit.
When a client wants to read an entry in the queue, it waits on the associated SemReady semaphore. If the initial value is at 100, then all 100 clients can successfully get the semaphore and “concurrently” read the data.
When a client is done reading/using the entry, it increments/releases the SemDone semaphore.
When a server wants to write to a buffer entry, it needs to make sure of two things: a) no clients are currently reading it, and b) no clients start to read it once the server is writing to it.
Therefore, first, block any further access to the buffer by waiting on the SemReady semaphore until the count is zero (obviously, use a zero timeout). When it hits zero, the server knows that no additional clients will start reading it.
To know that clients are done with the buffer, the server uses the SemDone semaphore. It checks the SemDone and waits until it is at value is at N minus the number of waits it did on SemReady. In other words, if SemReady was at zero, then it means all clients read the buffer entry, therefore, SemDone should be at N (100) when they are done. If, though, the server waited 10 times on SemReady, then SemDone should be at 90 (N-10) when all clients are done.
The above step needs some kind of timeout and status check on client “liveness” in case a client crashes/quits after getting SemReady and before releasing SemDone. Also, it would need to account for the possibility of new client registering during that step as well in order to keep the semaphore count values in sync.
Once the server has found no more clients are reading the buffer, it can reset SemDone to zero, write new data to the entry, and set SemReady to N (100).
Rinse and repeat.
Note 1 There are other synchronization issues to maintain the head/tail of the circular queue so that clients know where it is.
Note 2 SemDone could probably be an integer counter handled with atomic increments… I think it could anyway. Needs a bit of thought.
Note 3 It might make sense to have multiple threads in the server writing to the buffer entries. That way, if the server has to wait/timeout a bit on a crashed client that started reading but did not finish, it would not block subsequent queue entries that other clients might already be waiting for.

Network Multithreading

I'm programming an online game for two reasons, one to familiarize myself with server/client requests in a realtime environment (as opposed to something like a typical web browser, which is not realtime) and to actually get my hands wet in that area, so I can proceed to actually properly design one.
Anywho, I'm doing this in C++, and I've been using winsock to handle my basic, basic network tests. I obviously want to use a framelimiter and have 3D going and all of that at some point, and my main issue is that when I do a send() or receive(), the program kindly idles there and waits for a response. That would lead to maybe 8 fps on even the best internet connection.
So the obvious solution to me is to take the networking code out of the main process and start it up in its own thread. Ideally, I would call a "send" in my main process which would pass the networking thread a pointer to the message, and then periodically (every frame) check to see if the networking thread had received the reply, or timed out, or what have you. In a perfect world, I would actually have 2 or more networking threads running simultaneously, so that I could say run a chat window and do a background download of a piece of armor and still allow the player to run around all at once.
The bulk of my problem is that this is a new thing to me. I understand the concept of threading, but I can see some serious issues, like what happens if two threads try to read/write the same memory address at the same time, etc. I know that there are already methods in place to handle this sort of thing, so I'm looking for suggestions on the best way to implement something like this. Basically, I need thread A to be able to start a process in thread B by sending a chunk of data, poll thread B's status, and then receive the reply, also as a chunk of data., ideally without any major crashing going on. ^_^ I'll worry about what that data actually contains and how to handle dropped packets, etc later, I just need to get that happening first.
Thanks for any help/advice.
PS: Just thought about this, may make the question simpler. Is there a way to use the windows event handling system to my advantage? Like, would it be possible to have thread A initialize data somewhere, then trigger an event in thread B to have it pick up the data, and vice versa for thread B to tell thread A it was done? That would probably solve a lot of my problems, since I don't really need both threads to be able to work on the data at the same time, more of a baton pass really. I just don't know if this is possible between two different threads. (I know one thread can create its own messages for the event handler.)

The easiest thing
for you to do, would be to simply invoke the windows API QueueUserWorkItem. All you have to specify is the function that the thread will execute and the input passed to it. A thread pool will be automatically created for you and the jobs executed in it. New threads will be created as and when is required.
http://msdn.microsoft.com/en-us/library/ms684957(VS.85).aspx
More Control
You could have a more detailed control using another set of API's which can again manage the thread pool for you -
http://msdn.microsoft.com/en-us/library/ms686980(VS.85).aspx
Do it yourself
If you want to control all aspects of your thread creation and the pool management you would have to create the threads yourself, decide how they should end , how many to create etc (beginthreadex is the api you should be using to create threads. If you use MFC you should use AfxBeginThread function).
Send jobs to worker threads - Io completion Ports
In this case, you would also have to worry about how to communicate your jobs - i would recommend IoCOmpletionPorts to do that. It is the most scalable notification mechanism that i currently know of made for this purpose. It has the additional advantage that it is implemented in the kernel so you avoid all kinds of dead loack sitautions you would encounter if you decide to handroll something yourself.
This article will show you how with code samples -
http://blogs.msdn.com/larryosterman/archive/2004/03/29/101329.aspx
Communicate Back - Windows Messages
You could use windows messages to communicate the status back to your parent thread since it is doing the message wait anyway. use the PostMessage function to do this. (and check for errors)
ps : You could also allocate the data that needs to be sent out on a dedicated pointer and then the worker thread could take care of deleting it after sending it out. That way you avoid the return pointer traffic too.

BlodBath's suggestion of non-blocking sockets is potentially the right approach.
If you're trying to avoid using a multithreaded approach, then you could investigate the use of setting up overlapped I/O on your sockets. They will not block when you do a transmit or receive, but have the added bonus of giving you the option of waiting for multiple events within your single event loop. When your transmit has finished, you will receive an event. (see this for some details)
This is not incompatible with a multithreaded approach, so there's the option of changing your mind later. ;-)
On the design of your multithreaded app. the best thing to do is to work out all of the external activities that you want to be alerted to. For example, so far in your question you've listed network transmits, network receives, and user activity.
Depending on the number of concurrent connections you're going to be dealing with you'll probably find it conceptually simpler to have a thread per socket (assuming small numbers of sockets), where each thread is responsible for all of the processing for that socket.
Then you can implement some form of messaging system between your threads as RC suggested.
Arrange your system so that when a message is sent to a particular thread and event is also sent. Your threads can then be sent to sleep waiting for one of those events. (as well as any other stimulus - like socket events, user events etc.)
You're quite right that you need to be careful of situations where more than one thread is trying to access the same piece of memory. Mutexes and semaphores are the things to use there.
Also be aware of the limitations that your gui has when it comes to multithreading.
Some discussion on the subject can be found in this question.
But the abbreviated version is that most (and Windows is one of these) GUIs don't allow multiple threads to perform GUI operations simultaneously. To get around this problem you can make use of the message pump in your application, by sending custom messages to your gui thread to get it to perform gui operations.

I suggest looking into non-blocking sockets for the quick fix. Using non-blocking sockets send() and recv() do not block, and using the select() function you can get any waiting data every frame.

See it as a producer-consumer problem: when receiving, your network communication thread is the producer whereas the UI thread is the consumer. When sending, it's just the opposite. Implement a simple buffer class which gives you methods like push and pop (pop should be blocking for the network thread and non-blocking for the UI thread).
Rather than using the Windows event system, I would prefer something that is more portable, for example Boost condition variables.

I don't code games, but I've used a system similar to what pukku suggested. It lends nicely to doing things like having the buffer prioritize your messages to be processed if you have such a need.
I think of them as mailboxes per thread. You want to send a packet? Have the ProcessThread create a "thread message" with the payload to go on the wire and "send" it to the NetworkThread (i.e. push it on the NetworkThread's queue/mailbox and signal the condition variable of the NetworkThread so he'll wake up and pull it off). When the NetworkThread receives the response, package it up in a thread message and send it back to the ProcessThread in the same manner. Difference is the ProcessThread won't be blocked on a condition variable, just polling on mailbox.empty( ) when you want to check for the response.
You may want to push and pop directly, but a more convenient way for larger projects is to implement a toThreadName, fromThreadName scheme in a ThreadMsg base class, and a Post Office that threads register their Mailbox with. The PostOffice then has a send(ThreadMsg*); function that gets/pushes the messages to the appropriate Mailbox based on the to and from. Mailbox (the buffer/queue class) contains the ThreadMsg* = receiveMessage(), basically popping it off the underlying queue.
Depending on your needs, you could have ThreadMsg contain a virtual function process(..) that could be overridden accordingly in derived classes, or just have an ordinary ThreadMessage class with a to, from members and a getPayload( ) function to get back the raw data and deal with it directly in the ProcessThread.
Hope this helps.

Some topics you might be interested in:
mutex: A mutex allows you to lock access to specific resources for one thread only
semaphore: A way to determine how many users a certain resource still has (=how many threads are accessing it) and a way for threads to access a resource. A mutex is a special case of a semaphore.
critical section: a mutex-protected piece of code (street with only one lane) that can only be travelled by one thread at a time.
message queue: a way of distributing messages in a centralized queue
inter-process communication (IPC) - a way of threads and processes to communicate with each other through named pipes, shared memory and many other ways (it's more of a concept than a special technique)
All topics in bold print can be easily looked up on a search engine.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js