Minimizing dropped UDP packets at high packet rates (Windows 10)

Minimizing dropped UDP packets at high packet rates (Windows 10) - c++

IMPORTANT NOTE: I'm aware that UDP is an unreliable protocol. But, as I'm not the manufacturer of the device that delivers the data, I can only try to minimize the impact. Hence, please don't post any more statements about UDP being unreliable. I need suggestions to reduce the loss to a minimum instead.
I've implemented an application C++ which needs to receive a large amount of UDP packets in short time and needs to work under Windows (Winsock). The program works, but seems to drop packets, if the Datarate (or Packet Rate) per UDP stream reaches a certain level... Note, that I cannot change the camera interface to use TCP.
Details: It's a client for Gigabit-Ethernet cameras, which send their images to the computer using UDP packets. The data rate per camera is often close to the capacity of the network interface (~120 Megabytes per second), which means even with 8KB-Jumbo Frames the packet rate is at 10'000 to 15'000 per camera. Currently we have connected 4 cameras to one computer... and this means up to 60'000 packets per second.
The software handles all cameras at the same time and the stream receiver for each camera is implemented as a separate thread and has it's own receiving UDP socket.
At a certain frame rate the software seems miss a few UDP frames (even the network capacity is used only by ~60-70%) every few minutes.
Hardware Details
Cameras are from foreign manufacturers! They send UDP streams to a configurable UDP endpoint via ethernet. No TCP-support...
Cameras are connected via their own dedicated network interface (1GBit/s)
Direct connection, no switch used (!)
Cables are CAT6e or CAT7
Implementation Details
So far I set the SO_RCVBUF to a large value:
int32_t rbufsize = 4100 * 3100 * 2; // two 12 MP images
if (setsockopt(s, SOL_SOCKET, SO_RCVBUF, (char*)&rbufsize, sizeof(rbufsize)) == -1) {
perror("SO_RCVBUF");
throw runtime_error("Could not set socket option SO_RCVBUF.");
}
The error is not thrown. Hence, I assume the value was accepted.
I also set the priority of the main process to HIGH-PRIORITY_CLASS by using the following code:
SetPriorityClass(GetCurrentProcess(), HIGH_PRIORITY_CLASS);
However, I didn't find any possibility to change the thread priorities. The threads are created after the process priority is set...
The receiver threads use blocking IO to receive one packet at a time (with a 1000 ms timeout to allow the thread to react to a global shutdown signal). If a packet is received, it's stored in a buffer and the loop immediately continues to receive any further packets.
Questions
Is there any other way how I can reduce the probability of a packet loss? Any possibility to maybe receive all packets that are stored in the sockets buffer with one call? (I don't need any information about the sender side; just the contained payload)
Maybe, you can also suggest some registry/network card settings to check...

To increase the UDP Rx performance for GigE cameras on Widnows you may want to look into writing a custom filter driver (NDIS). This allows you to intercept the messages in the kernel, stop them from reaching userspace, pack them into some buffer and then send to userspace via a custom ioctl to your application. I have done this, it took about a week of work to get done. There is a sample available from Microsoft which I used as base for it.
It is also possible to use an existing generic driver, such as pcap, which I also tried and that took about half a week. This is not as good because pcap cannot determine when the frames end so packet grouping will be sub optimal.
I would suggest first digging deep in the network stack settings and making sure that the PC is not starved for resources. Look at guides for tuning e.g. Intel network cards for this type of load, that could potentially have a larger impact than a custom driver.
(I know this is an older thread and you have probably solved your problem. But things like this is good to document for future adventurers..)

IOCP and WSARecv in overlapped mode, you can setup around ~60k WSARecv
on the thread that handles the GetQueuedCompletionStatus process the data and also do a WSARecv in that thread to comnpensate for the one being used when receiving the data
please note that your udp packet size should stay below the MTU above it will cause drops depending on all the network hardware between the camera and the software
write some UDP testers that mimuc the camera to test the network just to be sure that the hardware will support the load.
https://www.winsocketdotnetworkprogramming.com/winsock2programming/winsock2advancediomethod5e.html

Related

How to calculate the network bandwidth of the device?

To achieve effective data transfer mechanism, I need to find out how many bits can fill up a network link.
Let me explain the situation,
Once I send a data(application protocol) it will reply a ACK after it process the data (in application layer) . If the RTT is high (Like 500 ms RTT) it takes too much time to send a ACK back. Until the ACK is received data will not being sent and it is in idle mode. To rectify the situation , I need to flight some data in-between intervals.
So I decide to transfer the data until the bandwidth delay product value(how many bits can fill up a network link) is exhaust by sent data size
BDP = Bandwidth(bits per sec) x RTT ( in secs).
How to find the network bandwidth of the device.
Is there any Windows API or other ways to finds the bandwidth of link ?
PS : I am newbie to network programming

You do not calculate bandwidth. The bandwidth is a property of the network interface. A 100 Mbps ethernet interface always has a 100 Mbps bandwidth. You are using the incorrect term.
If you are using TCP, the sender will constantly increase the send/congestion window until there is a problem, then it exponentially reduces the window size, and again starts increasing it until there is again a problem, repeating that over and over. Only a sender will know this window.
The receiver has a buffer that is the receive window, and it will communicate the current window size to the sender in every acknowledgement. The receive window will shrink as the buffer is filled, and grows as the buffer is emptied. The receive window determines how much data the sender is allowed to send before stopping to wait for an acknowledgement.
TCP handles all that automatically, calculating the SRTT and automatically adjusting to give you a good throughput for the conditions. You seem to want to control what TCP inherently does for you. You can tweak things like the receive buffer to increase the throughput, but you need to write your own transport protocol to do what you propose because you will overrun the receive buffer, losing data or crashing the receiving host.
Also, remember that TCP creates a connection between two equal TCP peers. Both are senders and both are receivers. Either side can send and receive, and either side can initiate closing the connection or kill it with a RST.

winapi GetIpNetworkConnectionBandwidthEstimates() gets "historical" BW "estimates" for a network connection (this is more relevant than the whole interface/link) on the spec'd intf.

Using ASIO to capture lots of UDP packets

I'm using the asio ( non boost version) library to capture incoming UDP packets via a 10GB Ethernet adapter.
150k packets a second is fine, but I start getting dropped packets when i got to higher rates like 300k packets/sec.
I'm pretty sure the bottleneck is in DMA'ing 300k seperate transfers from the network card to the host system. The transfers aren't big only 1400 bytes per transfer, so not a bandwidth issue.
Ideally i would like a mechanism to coalesce the data from multiple packets into a single DMA transfer to the host. Currently I am using asio::receive, to do synchronous transfers which gives better performance than async_receive.
I have tried using the receive command with a larger buffer, or using an array of multiple buffers, but i always seem to get a single read of 1400 bytes.
Is there any way around this?
Ideally i would like to read some multiple of the 1400 bytes at a time, so long as it didn't take too long for the total to be filled.
ie. wait up to 4ms and then return 4 x 1400 bytes, or simply return after 4ms with however many bytes are available...
I do not control the entire network so i cannot force jumbo frames :(
Cheers,

I would remove the asio layer and go direct to the metal.
If you're on Linux you should use recvmmsg(2) rather than recvmsg() or recvfrom(), as it at least allows for the possibility of transferring multiple messages at a time within the kernel, which the others don't.
If you can't do either of these things, you need to at least moderate your expectations. recvfrom() and recvmsg() and whatever lies over them in asio will never deliver more than one UDP datagram at a time. You need to:
speed up your receiving loop as much as possible, eliminating all possible overhead, especially dynamic memory allocation and I/O to other sockets or files.
ensure that the socket receiver buffer is as large as possible, at least a megabyte, via setsockopt()/SO_RCVBUFSIZ, and don't assume that what you set was what you got: get it back via getsockopt() to see if the platform has limited you in some way.

may be you can try a workarround with tcpdump using the libcap library http://www.tcpdump.org/ and filtering to recive UDP packets

Interminent Delays in C++ Tcp Communication in Linux

I have a device which sends data every 20 milliseconds over TCP. I have an application which connects to this device, starts the socket communication. My Application listens on a seperate thread and reads the data as fast as data is ready, puts data aside, and some other thread processes it. Device is directly connected to the computer via ethernet cable.
I see a strange problem and I am trying to understand the reason why, Almost once in every minute, it takes approximately 50 milliseconds to receive a packet from the device. I do a blocking read which will try reading for a second, and will finish as fast as data is ready, normally it takes approximately 20 ms as I would expect, but like I said before there are times it takes 50 ms even though it is very rare(1 in 3000). What I noticed is the packets after late packet arrives immediately, so it makes me think that there's some delay on the network layer. I also examined the timestamps of the packets(which is given by the device), they are consistenly increasing by 20 ms's.
Is it normal to see delays like that when the device is directly connected to the computer, Since it is TCP there might be lots of effort under the hood(CRC checks, out of order packages, retransmissions, etc). I still want to find an alternative way to prevent this delay than accepting the fact that it might happen.
Any insights will be greatly appreciated.

It's probably result of Nagle's algorithm which is turned on by default in TCP/IP socket.
Use setsockopt() to set the TCP_NODELAY flag on socket that sends data to turn it off.

Using IOCP with UDP?

I'm pretty familiar with what Input/Output Completion Ports are for when it comes to TCP.
But what, if I am for example coding a FPS game, or anything where need for low latency can be a deal breaker - I want immediate response to the player to provide the best playing experience, even at cost of losing some spatial data on the go. It becomes obvious that I should use UDP and aside from sending coordinate updates frequently, I should also implement kind of semi-reliable protocol (afaik TCP induces packet loss in UDP so we should avoid mixing these two) to handle such events like chat messages, or gunshots where packet loss may be crucial.
Let's say I'm aiming at performance which would apply to MMOFPS game that allows to meet hundreds of players in one, persistent world, and aside from fighting with guns, it allows them to communicate through chat messages etc. - something like this actually exists and works well - check out PlanetSide 2.
Many articles there on the net (e.g. these from msdn) say overlapped sockets are the best and IOCP is god-tier concept, but they don't seem to distinguish the cases where we use other protocols than TCP.
So there is almost no reliable information about I/O techniques used when developing such a server, I've looked at this, but the topic seems to be highly controversial, and I've also seen this , but considering discussions in the first link, I don't know if I should follow assumptions of the second one, whether I should use IOCP with UDP at all, and if not, what is the most scalable and efficient I/O concept when it comes to UDP.
Or maybe am I just making another premature optimization and no thinking ahead is required for the moment ?
Thought about posting it on gamedev.stackexchange.com, but this question better applies to general-purpose networking I think.

I do not recommend using this, but technically the most efficient way to receive UDP datagrams would be to just block in recvfrom (or WSARecvFrom if you will). Of course, you'll need a dedicated thread for that, or not much will happen otherwise while you block.
Other than with TCP, you do not have a connection built into the protocol, and you do not have a stream without defined borders. That means you get the sender's address with every datagram that comes in, and you get a whole message or nothing. Always. No exceptions.
Now, blocking on recvfrom means one context switch to the kernel, and one context switch back when something was received. It won't go any faster by having several overlapped reads in flight either, because only one datagram can arrive on the wire at the same time, which is by far the most limiting factor (CPU time is not the bottleneck!). Using an IOCP means at least 4 context switches, two for the receive and two for the notification. Alternatively, an overlapped receive with completion callback is not much better either, because you must NtTestAlert or SleepEx to run the APC queue, so again you have at least 2 extra context switches (though, it's only +2 for all notifications together, and you might incidentially already sleep anyway).
However:
Using an IOCP and overlapped reads is nevertheless the best way to do it, even if it is not the most efficient one. Completion ports are irrespective from using TCP, they work just fine with UDP, too. As long as you use an overlapped read, it does not matter what protocol you use (or even whether it's network or disk, or some other waitable or alertable kernel object).
It also does not really matter for either latency or CPU load whether you burn a few hundred cycles extra for the completion port. We're talking about "nano" versus "milli" here, a factor of one to one million. On the other hand, completion ports are overall a very comfortable, sound, and efficient system.
You can for example trivially implement logic for resending when you did not receive an ACK in time (which you must do when a form of reliability is desired, UDP does not do it for you), as well as keepalive.
For keepalive, add a waitable timer (maybe firing after 15 or 20 seconds) that you reset every time you receive anything. If your completion port ever tells you that this timer went off, you know the connection is dead.
For resends, you could e.g. set a timeout on GetQueuedCompletionStatus, and every time you wake up find all packets that are more than so-and-so old and have not been ACKed yet.
The entire logic happens in one place, which is very nice. It's versatile, efficient, and hard to do wrong.
You can even have several threads (and, indeed, more threads than your CPU has cores) block on the completion port. Many threads sounds like an unwise design, but it is in fact the best thing to do.
A completion port wakes up to N threads in last-in-first-out order, N being the number of cores unless you tell it to do something different. If any of these threads block, another one is woken to handle outstanding events. This means that in the worst case, an extra thread may be running for a short time, but this is tolerable. In the average case, it keeps processor usage close to 100% as long as there is some work to do and zero otherwise, which is very nice. LIFO waking is favourable for processor caches and keeps switching thread contexts low.
This means you can block and wait for an incoming datagram and handle it (decrypt, decompress, perform logic, read someting from disk, whatever) and another thread will be immediately ready to handle the next datagram that might come in the next microsecond. You can use overlapped disk IO with the same completion port, too. If you have compute work (such as AI) to do that can be split into tasks, you can manually post (PostQueuedCompletionStatus) those on the completion port as well and you have a parallel task scheduler for free. All you have to do is wrap an OVERLAPPED into a structure that has some extra data after it, and use a key that you will recognize. No worrying about thread synchronization, it just magically works (you don't even strictly need to have an OVERLAPPED in your custom structure when posting your own notifications, it will work with any structure you pass, but I don't like lying to the operating system, you never know...).
It does not even matter much whether you block, for example when reading from disk. Sometimes this just happens and you can't help it. So what, one thread blocks, but your system still receives messages and reacts to it! The completion port automatically pulls another thread from its pool when it's necessary.
About TCP inducing packet loss on UDP, this is something that I am inclined to call an urban myth (although it is somewhat correct). The way this common mantra is worded is however misleading. It may have been true once upon a time (there exists research on that matter, which is, however, close to a decade old) that routers would drop UDP in favour of TCP, thereby inducing packet loss. That is, however, certainly not the case nowadays.
A more truthful point of view is that anything you send induces packet loss. TCP induces packet loss on TCP and UDP induces packet loss on TCP and vice versa, this is a normal condition (it's how TCP implements congestion control, by the way). A router will generally forward one incoming packet if the cable on the other plug is "silent", it will queue a few packets with a hard deadline (buffers are often deliberately small), optionally it may apply some form of QoS, and it will simply and silently drop everything else.
A lot of applications with rather harsh realtime requirements (VoIP, video streaming, you name it) nowadays use UDP, and while they cope well with a lost packet or two, they do not at all like significant, recurring packet loss. Still, they demonstrably work fine on networks that have a lot of TCP traffic. My phone (like the phones of millions of people) works exclusively over VoIP, data going over the same router as internet traffic. There is no way I can provoke a dropout with TCP, no matter how hard I try.
From that everyday observation, one can tell for certain that UDP is definitively not dropped in favour of TCP. If anything, QoS might favour UDP over TCP, but it most certainly doesn't penaltize it.
Otherwise, services like VoIP would stutter as soon as you open a website and be unavailable alltogether if you download something the size of a DVD ISO file.
EDIT:
To give somewhat of an idea of how simple life with IOCP can be (somewhat stripped down, utility functions missing):
for(;;)
{
if(GetQueuedCompletionStatus(iocp, &n, &k, (OVERLAPPED**)&o, 100) == 0)
{
if(o == 0) // ---> timeout, mark and sweep
{
CheckAndResendMarkedDgrams(); // resend those from last pass
MarkUnackedDgrams(); // mark new ones
}
else
{ // zero return value but lpOverlapped is not null:
// this means an error occurred
HandleError(k, o);
}
continue;
}
if(n == 0 && k == 0 && o == 0)
{
// zero size and zero handle is my termination message
// re-post, then break, so all threads on the IOCP will
// one by one wake up and exit in a controlled manner
PostQueuedCompletionStatus(iocp, 0, 0, 0);
break;
}
else if(n == -1) // my magic value for "execute user task"
{
TaskStruct *t = (TaskStruct*)o;
t->funcptr(t->arg);
}
else
{
/* received data or finished file I/O, do whatever you do */
}
}
Note how the entire logic for both handling completion messages, user tasks, and thread control happens in one simple loop, no obscure stuff, no complicated paths, every thread only executes this same, identical loop.
The same code works for 1 thread serving 1 socket, or for 16 threads out of a pool of 50 serving 5,000 sockets, 10 overlapped file transfers, and executing parallel computations.

I've seen the code to many FPS games that use UDP as the networking protocol.
The standard solution is to send all the data you need to update a single game frame in one large UDP packet. That packet should include a frame number, and a checksum. The packet should of course be compressed.
Generally the UDP packet contains the positions and velicities for every entity near the player, any chat messages that were sent, and all recent state changes. ( e.g. new entity created, entity destrouyed etc. )
Then the client listens for UDP packets. It will use only the packet with the highest frame number. So if out of order packets appear, the older packets are simply ignored.
Any packets with wrong checksums are also ignored.
Each packet should contain all the information to synchronize the client's game state with the server.
Chat messages get sent repeatedly over several packets, and each message has a unique message id For example, you retransmit the same chat message for say a full second worth of frames. If a client misses a chat message after getting it 60 times - then the quality of the network channel is just too low to play the game. Clients will display any messages they get in a UDP packet that have a message ID they have not yet displayed.
Similarly for objects being created or destroyed. All created or destroyed objects have a unique object Id set by the server. Objects get created or destroyed if the object id they correspond to has not been acted on before.
So the key here is to send data redundantly, and key all state transitions to unique id's set by the server.
#edit: Another poster mentioned that for chat messages you might want to use a different protocol on a different port. And they may be right about that probably being optimal. That is for message types where latency is not critical, but reliability is more important you might want to open up a different port and use TCP. But I'd leave that as a later excercise. It is certainly easier and cleaner at first for your game to use just one channel, and figure out the vagaries of multiple ports, multiple channels, with their various failure modes later. (e.g. what happens if the UDP channel is working, but the chat channel goes goes down? What if you succeed in opening one port and not the other? )

When I did this for a client we used ENet as the base reliable UDP protocol and re-implemented this from scratch to use IOCP for the server side whilst using the freely available ENet code for the client side.
IOCP works fine with UDP and integrates nicely with any TCP connections that you might also be handling (we have TCP, WebSocket or UDP client connections in and TCP connections between server nodes and being able to plug all of these into the same thread pool if we want is handy).
If absolute latency and UDP packet processing speed is most important (and it's unlikely it really is) then a using the new Server 2012 RIO API might be worth it, but I'm not convinced yet (see here for some preliminary performance tests and some example servers).
You probably want to look at using GetQueuedCompletionStatusEx() for dealing with your inbound data as it reduces the context switches per datagram as you can pull multiple datagrams back with a single call.

A couple things:
1) As a general rule if you need reliability, you are best off just using TCP. A competitive and perhaps even superior solution on top of UDP is possible, but it is extremely difficult to get right and have it perform properly. The main thing people implementing reliability on top of UDP don't bother with is proper flow control. You must have flow control if you intend to send large amounts of data and want it to gracefully take advantage of the bandwidth that is available at the moment (which changes continuously with route conditions). In practice, implementing anything other than essentially the same algorithm TCP uses is likely to be unfriendly to other protocols on the network as well. It's unlikely you will do a better job at implementing that algorithm than TCP does.
2) As for running TCP and UDP in parallel, it is not as huge of a concern these days as others have noted. At one time I heard that overloaded routers along the way were bias dropping UDP packets before TCP packets, which makes sense in some ways, since a dropped TCP packet will just be resent anyways, and a lost UDP packet often isn't. That said, I am skeptical that this actually happens. In particular, dropping a TCP packet will cause the sender to throttle back, so it may make more sense to drop the TCP packet.
The one case where TCP may interfere with UDP is that TCP by nature of it's algorithm is continuously trying to go faster and faster, unless it reaches a point where it loses packets, then it throttles back and repeats the process. As the TCP connection continuously bumps against that bandwidth ceiling, it is just as likely to cause UDP loss as TCP loss, which in theory would appear as if the TCP traffic was sporadically causing UDP loss.
However, this is a problem you will run into even if you put your own reliable mechanism on top of UDP (assuming you do flow control properly). If you wanted to avoid this condition, you could intentionally throttle the reliable data at the application layer. Typically in a game the reliable data rate is limited to the rate at which the client or server actually needs to send reliable data, which is often well below the bandwidth capabilities of the pipe, and thus the interference never occurs, regardless of whether it is TCP or UDP-reliable based.
Where things get a bit more difficult is if you are making a streaming asset game. For a game like FreeRealms which does this, the assets are downloaded from a CDN via HTTP/TCP and it will attempt to use all available bandwidth, which will increase packetloss on the main game channel (which is typically UDP). I have generally found the interference low enough that I don't think you should be worrying about it too much.
3) As for IOCP, my experience with them is very limited, but having done extensive game networking in the past, I am skeptical that they add value in the case of UDP. Typically the server will have a single UDP socket that is handling all incoming data. With hundreds of users connected, the rate at which the data is coming into the server is very high. Having a background thread doing a blocking call on the socket as others have suggested and then quickly moving the data into a queue for the main application thread to pick up is a reasonable solution, but somewhat unnecessary, since in practice the data is coming in so fast when under load that there is not much point in ever sleeping the thread when it blocks.
Let me put this another way, if the blocking socket call polled a single packet and then put the thread to sleep until the next packet came in, it would be context-switching to that thread thousands of times per second when the data rate got high. Either that, or by the time the unblocked thread executed and cleared the data, there would already be additional data ready to be processed as well. Instead, I prefer to put the socket in non-blocking mode and then have a background thread spin at around 100fps processing it (sleeping between polls as needed to achieve the frame rate). In this manner, the socket buffer will build up incoming packets for 10ms and then the background thread will wake up once and process all that data in bulk, then go back to sleep, thus preventing gratuitous context switches. I then have that same background thread do other send-related processing when it wakes up as well. Being entirely event-driven loses many of it's benefits when the data volume gets the least bit high.
In the case of TCP, the story is quite different, since you need an efficient mechanism to figure out which of hundreds of connects the incoming data is coming from and polling them all is very slow, even on a periodic basis.
So, in the case of UDP with a home-grown UDP-reliable mechanism on top of it, I typically have a background thread playing the same role that the OS plays... whereas the OS gets the data from the network card then distributes it to various logical TCP connections internally for processing, my background thread gets the data from the solitary UDP socket (via periodic polling) and distributes it to my own internal logical connection objects for processing. Those internal logical connections then put the application-level packet data into a thread-safe master-queue flagged with the logical connection they came from. The main application thread then processes that master-queue in, routing the packets directly to the game-level objects associated with that connection. From the main application threads point of view, it simply has an event driven queue it is processing.
The bottom line is that given that the poll call to the solitary UDP socket rarely comes up empty, it is difficult to imagine there is going to be a more efficient way to solve this problem. The only thing you lose with this method is you wait up to 10ms to wake up when in theory you could be waking up the instant the data first arrived, but that is only meaningful if you were under extremely light load anyways. Plus, the main application thread isn't going to be making use of the data until it's next frame cycle anyways, so the difference is moot, and I think the overall system performance is enhanced by this technique.

I wouldn't hold a game as old as PlanetSide up as a paragon of modern network implementation. Especially not having seen the insides of their networking library. :)
Different types of communication require different methodologies. One of the answers above talks around the differences between frame/position updates and chat messages, without recognizing that using the same transport for both is probably silly. You should most definitely use a connected TCP socket between your chat implementation and the chat server, for text-style chat. Don't argue, just do it.
So, for your game client doing updates via arriving UDP packets, the most efficient path from the network adapter through the kernel and into your application is (most likely) going to be a blocking recv. Create a thread that rips packets off the network, verifies their validity (chksum match, sequence number increasing, whatever other checks you have), de-serializes the data into an internal object, then queue the object on an internal queue to the application thread that handles those sorts of updates.
But don't take my word for it: test it! Write a small program that can receive and deserialize 3 or 4 kinds of packets, using a blocking thread and a queue to deliver the objects, then re-write it using a single thread and IOCPs, with the deserialization and queueing in the completion routine. Pound enough packets through it to get the run time up in the minute range, and test which one is fastest. Make sure something (i.e. some thread) in your test app is consuming the objects off the queue so you get a full picture of the relative performance.
Post back here when you have the two test programs done, and let us know which worked out best, mm'kay? Which was fastest, which would you rather maintain in the future, which took the longest to get it working, etc.

If you want to support many simultaneous connections, you need to use an event-driven networking approach. I know of two good libraries: libev (used by nodeJS) and libevent. They are very portable and easy to use. I have successfully used libevent in an application supporting hundreds of parallel TCP/UDP(DNS) connections.
I believe using event-driven network i/o is not premature optimization in a server - it should be the default design pattern. If you want to do a quick prototype implementation it may be better to start in a higher level language. For JavaScript there is nodeJS and for Python there is Twisted. Both I can personally recommend.

How about NodeJS
It supports UDP and it is highly scalable.

Using Tcp, why do large blocks of data get transmitted with a lower bandwidth then small blocks of data?

Using 2 PC's with Windows XP, 64kB Tcp Window size, connected with a crossover cable
Using Qt 4.5.3, QTcpServer and QTcpSocket
Sending 2000 messages of 40kB takes 2 seconds (40MB/s)
Sending 1 message of 80MB takes 80 seconds (1MB/s)
Anyone has an explanation for this? I would expect the larger message to go faster, since the lower layers can then fill the Tcp packets more efficiently.

This is hard to comment on without seeing your code.
How are you timing this on the sending side? When do you know you're done?
How does the client read the data, does it read into fixed sized buffers and throw the data away or does it somehow know (from the framing) that the "message" is 80MB and try and build up the "message" into a single data buffer to pass up to the application layer?
It's unlikely to be the underlying Windows sockets code that's making this work poorly.

TCP, from the application side, is stream-based which means there are no packets, just a sequence of bytes. The kernel may collect multiple writes to the connection before sending it out and the receiving side may make any amount of the received data available to each "read" call.
TCP, on the IP side, is packets. Since standard Ethernet has an MTU (maximum transfer unit) of 1500 bytes and both TCP and IP have 20-byte headers, each packet transferred over Ethernet will pass 1460 bytes (or less) of the TCP stream to the other side. 40KB or 80MB writes from the application will make no difference here.
How long it appears to take data to transfer will depend on how and where you measure it. Writing 40KB will likely return immediately since that amount of data will simply get dropped in TCP's "send window" inside the kernel. An 80MB write will block waiting for it all to get transferred (well, all but the last 64KB which will fit, pending, in the window).
TCP transfer speed is also affected by the receiver. It has a "receive window" that contains everything received from the peer but not fetched by the application. The amount of space available in this window is passed to the sender with every return ACK so if it's not being emptied quickly enough by the receiving application, the sender will eventually pause. WireShark may provide some insight here.
In the end, both methods should transfer in the same amount of time since an application can easily fill the outgoing window faster than TCP can transfer it no matter how that data is chunked.
I can't speak for the operation of QT, however.

Bug in Qt 4.5.3
..................................

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js