Low Throughput on Windows Named Pipe Over WAN

Low Throughput on Windows Named Pipe Over WAN - c++

I'm having problems with low performance using a Windows named pipe. The throughput drops off rapidly as the network latency increases. There is a roughly linear relationship between messages sent per second and round trip time. It seems that the client must ack each message before the server will send the next one. This leads to very poor performance, I can only send 5 (~100 byte) messages per second over a link with an RTT of 200 ms.
The pipe is asynchronous, using multiple overlapped write operations (and multiple overlapped reads at the client end), but this is not improving throughput. Is it possible to send messages in parallel over a named pipe? The pipe is created using PIPE_TYPE_MESSAGE, would PIPE_READMODE_BYTE work better? Is there any other way I can improve performance?
This is a deployed solution, so I can't simply replace the pipe with a socket connection (I've read that Windows named pipe aren't recommended for use over a WAN, and I'm wondering if this is why). I'd be grateful for any help with this matter.

We found that Named Pipes had poor performance from Windows XP onwards.
I don't have a solution for you. But I am concurring with the notion of Named Pipes being useless from XP onwards. We changed our software (in terms of IPC) completely because of it.
Is your comms factored into a separate DLL? Perhaps you could replace the DLL with an interface that looks the same but behaves differently?

I've implemented a work around, introducing a small (~1ms) fixed delay to buffer up as much data as possible before writing to the pipe. Over a network link with a RTT of 200ms, I can send ten times as much data in about a third of the time.
I send a message down the pipe when it first connects, so the client can determine the comms mode supported by the server and send data accordingly.

I would imagine that some of the WAN optimisation gear out there would be able to boost performance, as one of the things they do is understand protocols and reduce their chattiness. Given the latency of many WAN links, this alone can boost throughput and reduce timeouts.

Related

How to send and receive data up to SO_SNDTIMEO and SO_RCVTIMEO without corrupting connection?

I am currently planning how to develop a man in the middle network application for TCP server that would transfer data between server and client. It would behave as regular client for server and server for remote client without modifying any data. It will be optionally used to detect and measure how long server or client is not able to receive data that is ready to be received in situation when connection is inactive.
I am planning to use blocking send and recv functions. Before any data transfer I would call a setsockopt function to set SO_SNDTIMEO and SO_RCVTIMEO to about 10 - 20 miliseconds assuming it will force blocking send and recv functions to return early in order to let another active connection data to be routed. Running thread per connection looks too expensive. I would not use async sockets here because I can not find guarantee that they will get complete in a parts of second especially when large data amount is being sent or received. High data delays does not look good. I would use very small buffers here but calling function for each received byte looks overkill.
My next assumption would be that is safe to call send or recv later if it has previously terminated by timeout and data was received less than requested.
But I am confused by contradicting information available at msdn.
send function
https://msdn.microsoft.com/en-us/library/windows/desktop/ms740149%28v=vs.85%29.aspx
If no error occurs, send returns the total number of bytes sent, which
can be less than the number requested to be sent in the len parameter.
SOL_SOCKET Socket Options
https://msdn.microsoft.com/en-us/library/windows/desktop/ms740532%28v=vs.85%29.aspx
SO_SNDTIMEO - The timeout, in milliseconds, for blocking send calls.
The default for this option is zero, which indicates that a send
operation will not time out. If a blocking send call times out, the
connection is in an indeterminate state and should be closed.
Are my assumptions correct that I can use these functions like this? Maybe there is more effective way to do this?
Thanks for answers

While you MIGHT implement something along the ideas you have given in your question, there are preferable alternatives on all major systems.
Namely:
kqueue on FreeBSD and family. And on MAC OSX.
epoll on linux and related types of operating systems.
IO completion ports on Windows.
Using those technologies allows you to process traffic on multiple sockets without timeout logics and polling in an efficient, reactive manner. They all can be considered successors of the ancient select() function in socket API.
As for the quoted documentation for send() in your question, it is not really confusing or contradicting. Useful network protocols implement a mechanism to create "backpressure" for situations where a sender tries to send more data than a receiver (and/or the transport channel) can accomodate for. So, an application can only provide more data to send() if the network stack has buffer space ready for it.
If, for example an application tries to send 3Kb worth of data and the tcp/ip stack has only room for 800 bytes, send() might succeed and return that it used 800 bytes of the 3k offered bytes.
The basic approach to forwarding the data on a connection is: Do not read from the incoming socket until you know you can send that data to the outgoing socket. If you read greedily (and buffer on application layer), you deprive the communication channel of its backpressure mechanism.
So basically, the "send capability" should drive the receive actions.
As for using timeouts for this "middle man", there are 2 major scenarios:
You know the sending behavior of the sender application. I.e. if it has some intent on sending any data within your chosen receive timeout at any time. Some applications only send sporadically and any chosen value for a receive timeout could be wrong. Even if it is supposed to send at a specific time interval, your timeouts will cause trouble once someone debugs the sending application.
You want the "middle man" to work for unknown applications (which must not use some encryption for middle man to have a chance, of course). There, you cannot pick any "adequate" timeout value because you know nothing about the sending behavior of the involved application(s).

As a previous poster has suggested, I strongly urge you to reconsider the design of your server so that it employs an asynchronous I/O strategy. This may very well require that you spend significant time learning about each operating systems' preferred approach. It will be time well-spent.
For anything other than a toy application, using blocking I/O in the manner that you suggest will not perform well. Even with short timeouts, it sounds to me as though you won't be able to service new connections until you have completed the work for the current connection. You may also find (with short timeouts) that you're burning more CPU time spinning waiting for work to do than actually doing work.
A previous poster wisely suggested taking a look at Windows I/O completion ports. Take a look at this article I wrote in 2007 for Dr. Dobbs. It's not perfect, but I try to do a decent job of explaining how you can design a simple server that uses a small thread pool to handle potentially large numbers of connections:
Windows I/O Completion Ports
http://www.drdobbs.com/cpp/multithreaded-asynchronous-io-io-comple/201202921
If you're on Linux/FreeBSD/MacOSX, take a look at libevent:
Libevent
http://libevent.org/
Finally, a good, practical book on writing TCP/IP servers and clients is "Practical TCP/IP Sockets in C" by Michael Donahoe and Kenneth Calvert. You could also check out the W. Richard Stevens texts (which cover the topic completely for UNIX.)
In summary, I think you should take some time to learn more about asynchronous socket I/O and the established, best-of-breed approaches for developing servers.
Feel free to private message me if you have questions down the road.

Looking for best approach to sending the same data to multiple destinations using sockets

Looking for the best approach to sending the same message to multiple destinations using TCP/IP sockets. I'm working with an existing VS 2010 C++ application on Windows. Hoping to use a standard library/design pattern approach that has many of the complexities already worked out if possible.
Here's one approach I'm thinking about.. One main thread retrieves messages from a database and adds them to some sort of thread safe queue. The application also has one thread for each client socket connection to some destination server. Each one of these threads would read from the thread safe queue, and send the message over a tcp/ip socket.
There may be better/simpler/more robust approaches than this one though..
The issues I have to be concerned about mostly are latency. The destinations could be anywhere, and there may be significant latency between one socket connection and another.
The messages must go in an exact FIFO order to all the destinations.
Also one destination will be considered the primary destination.. all messages must get to this destination, no exceptions. For the other destinations, i.e. non-primary, the messages are just copies and it's not absolutely critical if the non-primary destinations do not receive a few messages. At any point, one of the non-primary destinations could become the primary destination. If one of the destinations falls too far behind, then that thread would need to catch up to the primary destination, but skipping some messages.
Looking for any suggestions. Preliminary research so far, my situation appears to be something akin to a single producer and multiple consumers pattern, or possibly master-worker pattern in Java.
I need to implement this in C++ on Windows, and the application must use tcp/ip sockets using an existing defined protocol.
Any help at all would be greatly appreciated.

You need exactly two threads, one that saturates the IO channel to the database and another that saturates the IO channel to the network leading to the 12 servers. Unless you have multiple network interfaces (which you should think about!) you don't send things faster by using multiple threads. Also, since you don't have multiple threads taking care of the network, you don't have to sync them.
What you definitely need to know about is select(). In the case of WinSock, also take a look at WSAEventSelect/WaitForMultipleObjects. Basically, you take a message from the queue and then send it to all clients when they're ready. select() tells you when one of a set of sockets is ready to accept data, so you don't waste time waiting or block trying to send data. What you need to come up with is a schema to reconnect after broken connections, when to drop messages to lagging clients etc. Also, in case the throughput to the different targets varies a lot, you need to think about handling multiple messages in parallel. If they are small (less than a network packet's payload) it makes sense combining them anyway to avoid overhead.
I hope this short overview helps getting you started, otherwise I can elaborate on the details.

Using IOCP with UDP?

I'm pretty familiar with what Input/Output Completion Ports are for when it comes to TCP.
But what, if I am for example coding a FPS game, or anything where need for low latency can be a deal breaker - I want immediate response to the player to provide the best playing experience, even at cost of losing some spatial data on the go. It becomes obvious that I should use UDP and aside from sending coordinate updates frequently, I should also implement kind of semi-reliable protocol (afaik TCP induces packet loss in UDP so we should avoid mixing these two) to handle such events like chat messages, or gunshots where packet loss may be crucial.
Let's say I'm aiming at performance which would apply to MMOFPS game that allows to meet hundreds of players in one, persistent world, and aside from fighting with guns, it allows them to communicate through chat messages etc. - something like this actually exists and works well - check out PlanetSide 2.
Many articles there on the net (e.g. these from msdn) say overlapped sockets are the best and IOCP is god-tier concept, but they don't seem to distinguish the cases where we use other protocols than TCP.
So there is almost no reliable information about I/O techniques used when developing such a server, I've looked at this, but the topic seems to be highly controversial, and I've also seen this , but considering discussions in the first link, I don't know if I should follow assumptions of the second one, whether I should use IOCP with UDP at all, and if not, what is the most scalable and efficient I/O concept when it comes to UDP.
Or maybe am I just making another premature optimization and no thinking ahead is required for the moment ?
Thought about posting it on gamedev.stackexchange.com, but this question better applies to general-purpose networking I think.

I do not recommend using this, but technically the most efficient way to receive UDP datagrams would be to just block in recvfrom (or WSARecvFrom if you will). Of course, you'll need a dedicated thread for that, or not much will happen otherwise while you block.
Other than with TCP, you do not have a connection built into the protocol, and you do not have a stream without defined borders. That means you get the sender's address with every datagram that comes in, and you get a whole message or nothing. Always. No exceptions.
Now, blocking on recvfrom means one context switch to the kernel, and one context switch back when something was received. It won't go any faster by having several overlapped reads in flight either, because only one datagram can arrive on the wire at the same time, which is by far the most limiting factor (CPU time is not the bottleneck!). Using an IOCP means at least 4 context switches, two for the receive and two for the notification. Alternatively, an overlapped receive with completion callback is not much better either, because you must NtTestAlert or SleepEx to run the APC queue, so again you have at least 2 extra context switches (though, it's only +2 for all notifications together, and you might incidentially already sleep anyway).
However:
Using an IOCP and overlapped reads is nevertheless the best way to do it, even if it is not the most efficient one. Completion ports are irrespective from using TCP, they work just fine with UDP, too. As long as you use an overlapped read, it does not matter what protocol you use (or even whether it's network or disk, or some other waitable or alertable kernel object).
It also does not really matter for either latency or CPU load whether you burn a few hundred cycles extra for the completion port. We're talking about "nano" versus "milli" here, a factor of one to one million. On the other hand, completion ports are overall a very comfortable, sound, and efficient system.
You can for example trivially implement logic for resending when you did not receive an ACK in time (which you must do when a form of reliability is desired, UDP does not do it for you), as well as keepalive.
For keepalive, add a waitable timer (maybe firing after 15 or 20 seconds) that you reset every time you receive anything. If your completion port ever tells you that this timer went off, you know the connection is dead.
For resends, you could e.g. set a timeout on GetQueuedCompletionStatus, and every time you wake up find all packets that are more than so-and-so old and have not been ACKed yet.
The entire logic happens in one place, which is very nice. It's versatile, efficient, and hard to do wrong.
You can even have several threads (and, indeed, more threads than your CPU has cores) block on the completion port. Many threads sounds like an unwise design, but it is in fact the best thing to do.
A completion port wakes up to N threads in last-in-first-out order, N being the number of cores unless you tell it to do something different. If any of these threads block, another one is woken to handle outstanding events. This means that in the worst case, an extra thread may be running for a short time, but this is tolerable. In the average case, it keeps processor usage close to 100% as long as there is some work to do and zero otherwise, which is very nice. LIFO waking is favourable for processor caches and keeps switching thread contexts low.
This means you can block and wait for an incoming datagram and handle it (decrypt, decompress, perform logic, read someting from disk, whatever) and another thread will be immediately ready to handle the next datagram that might come in the next microsecond. You can use overlapped disk IO with the same completion port, too. If you have compute work (such as AI) to do that can be split into tasks, you can manually post (PostQueuedCompletionStatus) those on the completion port as well and you have a parallel task scheduler for free. All you have to do is wrap an OVERLAPPED into a structure that has some extra data after it, and use a key that you will recognize. No worrying about thread synchronization, it just magically works (you don't even strictly need to have an OVERLAPPED in your custom structure when posting your own notifications, it will work with any structure you pass, but I don't like lying to the operating system, you never know...).
It does not even matter much whether you block, for example when reading from disk. Sometimes this just happens and you can't help it. So what, one thread blocks, but your system still receives messages and reacts to it! The completion port automatically pulls another thread from its pool when it's necessary.
About TCP inducing packet loss on UDP, this is something that I am inclined to call an urban myth (although it is somewhat correct). The way this common mantra is worded is however misleading. It may have been true once upon a time (there exists research on that matter, which is, however, close to a decade old) that routers would drop UDP in favour of TCP, thereby inducing packet loss. That is, however, certainly not the case nowadays.
A more truthful point of view is that anything you send induces packet loss. TCP induces packet loss on TCP and UDP induces packet loss on TCP and vice versa, this is a normal condition (it's how TCP implements congestion control, by the way). A router will generally forward one incoming packet if the cable on the other plug is "silent", it will queue a few packets with a hard deadline (buffers are often deliberately small), optionally it may apply some form of QoS, and it will simply and silently drop everything else.
A lot of applications with rather harsh realtime requirements (VoIP, video streaming, you name it) nowadays use UDP, and while they cope well with a lost packet or two, they do not at all like significant, recurring packet loss. Still, they demonstrably work fine on networks that have a lot of TCP traffic. My phone (like the phones of millions of people) works exclusively over VoIP, data going over the same router as internet traffic. There is no way I can provoke a dropout with TCP, no matter how hard I try.
From that everyday observation, one can tell for certain that UDP is definitively not dropped in favour of TCP. If anything, QoS might favour UDP over TCP, but it most certainly doesn't penaltize it.
Otherwise, services like VoIP would stutter as soon as you open a website and be unavailable alltogether if you download something the size of a DVD ISO file.
EDIT:
To give somewhat of an idea of how simple life with IOCP can be (somewhat stripped down, utility functions missing):
for(;;)
{
if(GetQueuedCompletionStatus(iocp, &n, &k, (OVERLAPPED**)&o, 100) == 0)
{
if(o == 0) // ---> timeout, mark and sweep
{
CheckAndResendMarkedDgrams(); // resend those from last pass
MarkUnackedDgrams(); // mark new ones
}
else
{ // zero return value but lpOverlapped is not null:
// this means an error occurred
HandleError(k, o);
}
continue;
}
if(n == 0 && k == 0 && o == 0)
{
// zero size and zero handle is my termination message
// re-post, then break, so all threads on the IOCP will
// one by one wake up and exit in a controlled manner
PostQueuedCompletionStatus(iocp, 0, 0, 0);
break;
}
else if(n == -1) // my magic value for "execute user task"
{
TaskStruct *t = (TaskStruct*)o;
t->funcptr(t->arg);
}
else
{
/* received data or finished file I/O, do whatever you do */
}
}
Note how the entire logic for both handling completion messages, user tasks, and thread control happens in one simple loop, no obscure stuff, no complicated paths, every thread only executes this same, identical loop.
The same code works for 1 thread serving 1 socket, or for 16 threads out of a pool of 50 serving 5,000 sockets, 10 overlapped file transfers, and executing parallel computations.

I've seen the code to many FPS games that use UDP as the networking protocol.
The standard solution is to send all the data you need to update a single game frame in one large UDP packet. That packet should include a frame number, and a checksum. The packet should of course be compressed.
Generally the UDP packet contains the positions and velicities for every entity near the player, any chat messages that were sent, and all recent state changes. ( e.g. new entity created, entity destrouyed etc. )
Then the client listens for UDP packets. It will use only the packet with the highest frame number. So if out of order packets appear, the older packets are simply ignored.
Any packets with wrong checksums are also ignored.
Each packet should contain all the information to synchronize the client's game state with the server.
Chat messages get sent repeatedly over several packets, and each message has a unique message id For example, you retransmit the same chat message for say a full second worth of frames. If a client misses a chat message after getting it 60 times - then the quality of the network channel is just too low to play the game. Clients will display any messages they get in a UDP packet that have a message ID they have not yet displayed.
Similarly for objects being created or destroyed. All created or destroyed objects have a unique object Id set by the server. Objects get created or destroyed if the object id they correspond to has not been acted on before.
So the key here is to send data redundantly, and key all state transitions to unique id's set by the server.
#edit: Another poster mentioned that for chat messages you might want to use a different protocol on a different port. And they may be right about that probably being optimal. That is for message types where latency is not critical, but reliability is more important you might want to open up a different port and use TCP. But I'd leave that as a later excercise. It is certainly easier and cleaner at first for your game to use just one channel, and figure out the vagaries of multiple ports, multiple channels, with their various failure modes later. (e.g. what happens if the UDP channel is working, but the chat channel goes goes down? What if you succeed in opening one port and not the other? )

When I did this for a client we used ENet as the base reliable UDP protocol and re-implemented this from scratch to use IOCP for the server side whilst using the freely available ENet code for the client side.
IOCP works fine with UDP and integrates nicely with any TCP connections that you might also be handling (we have TCP, WebSocket or UDP client connections in and TCP connections between server nodes and being able to plug all of these into the same thread pool if we want is handy).
If absolute latency and UDP packet processing speed is most important (and it's unlikely it really is) then a using the new Server 2012 RIO API might be worth it, but I'm not convinced yet (see here for some preliminary performance tests and some example servers).
You probably want to look at using GetQueuedCompletionStatusEx() for dealing with your inbound data as it reduces the context switches per datagram as you can pull multiple datagrams back with a single call.

A couple things:
1) As a general rule if you need reliability, you are best off just using TCP. A competitive and perhaps even superior solution on top of UDP is possible, but it is extremely difficult to get right and have it perform properly. The main thing people implementing reliability on top of UDP don't bother with is proper flow control. You must have flow control if you intend to send large amounts of data and want it to gracefully take advantage of the bandwidth that is available at the moment (which changes continuously with route conditions). In practice, implementing anything other than essentially the same algorithm TCP uses is likely to be unfriendly to other protocols on the network as well. It's unlikely you will do a better job at implementing that algorithm than TCP does.
2) As for running TCP and UDP in parallel, it is not as huge of a concern these days as others have noted. At one time I heard that overloaded routers along the way were bias dropping UDP packets before TCP packets, which makes sense in some ways, since a dropped TCP packet will just be resent anyways, and a lost UDP packet often isn't. That said, I am skeptical that this actually happens. In particular, dropping a TCP packet will cause the sender to throttle back, so it may make more sense to drop the TCP packet.
The one case where TCP may interfere with UDP is that TCP by nature of it's algorithm is continuously trying to go faster and faster, unless it reaches a point where it loses packets, then it throttles back and repeats the process. As the TCP connection continuously bumps against that bandwidth ceiling, it is just as likely to cause UDP loss as TCP loss, which in theory would appear as if the TCP traffic was sporadically causing UDP loss.
However, this is a problem you will run into even if you put your own reliable mechanism on top of UDP (assuming you do flow control properly). If you wanted to avoid this condition, you could intentionally throttle the reliable data at the application layer. Typically in a game the reliable data rate is limited to the rate at which the client or server actually needs to send reliable data, which is often well below the bandwidth capabilities of the pipe, and thus the interference never occurs, regardless of whether it is TCP or UDP-reliable based.
Where things get a bit more difficult is if you are making a streaming asset game. For a game like FreeRealms which does this, the assets are downloaded from a CDN via HTTP/TCP and it will attempt to use all available bandwidth, which will increase packetloss on the main game channel (which is typically UDP). I have generally found the interference low enough that I don't think you should be worrying about it too much.
3) As for IOCP, my experience with them is very limited, but having done extensive game networking in the past, I am skeptical that they add value in the case of UDP. Typically the server will have a single UDP socket that is handling all incoming data. With hundreds of users connected, the rate at which the data is coming into the server is very high. Having a background thread doing a blocking call on the socket as others have suggested and then quickly moving the data into a queue for the main application thread to pick up is a reasonable solution, but somewhat unnecessary, since in practice the data is coming in so fast when under load that there is not much point in ever sleeping the thread when it blocks.
Let me put this another way, if the blocking socket call polled a single packet and then put the thread to sleep until the next packet came in, it would be context-switching to that thread thousands of times per second when the data rate got high. Either that, or by the time the unblocked thread executed and cleared the data, there would already be additional data ready to be processed as well. Instead, I prefer to put the socket in non-blocking mode and then have a background thread spin at around 100fps processing it (sleeping between polls as needed to achieve the frame rate). In this manner, the socket buffer will build up incoming packets for 10ms and then the background thread will wake up once and process all that data in bulk, then go back to sleep, thus preventing gratuitous context switches. I then have that same background thread do other send-related processing when it wakes up as well. Being entirely event-driven loses many of it's benefits when the data volume gets the least bit high.
In the case of TCP, the story is quite different, since you need an efficient mechanism to figure out which of hundreds of connects the incoming data is coming from and polling them all is very slow, even on a periodic basis.
So, in the case of UDP with a home-grown UDP-reliable mechanism on top of it, I typically have a background thread playing the same role that the OS plays... whereas the OS gets the data from the network card then distributes it to various logical TCP connections internally for processing, my background thread gets the data from the solitary UDP socket (via periodic polling) and distributes it to my own internal logical connection objects for processing. Those internal logical connections then put the application-level packet data into a thread-safe master-queue flagged with the logical connection they came from. The main application thread then processes that master-queue in, routing the packets directly to the game-level objects associated with that connection. From the main application threads point of view, it simply has an event driven queue it is processing.
The bottom line is that given that the poll call to the solitary UDP socket rarely comes up empty, it is difficult to imagine there is going to be a more efficient way to solve this problem. The only thing you lose with this method is you wait up to 10ms to wake up when in theory you could be waking up the instant the data first arrived, but that is only meaningful if you were under extremely light load anyways. Plus, the main application thread isn't going to be making use of the data until it's next frame cycle anyways, so the difference is moot, and I think the overall system performance is enhanced by this technique.

I wouldn't hold a game as old as PlanetSide up as a paragon of modern network implementation. Especially not having seen the insides of their networking library. :)
Different types of communication require different methodologies. One of the answers above talks around the differences between frame/position updates and chat messages, without recognizing that using the same transport for both is probably silly. You should most definitely use a connected TCP socket between your chat implementation and the chat server, for text-style chat. Don't argue, just do it.
So, for your game client doing updates via arriving UDP packets, the most efficient path from the network adapter through the kernel and into your application is (most likely) going to be a blocking recv. Create a thread that rips packets off the network, verifies their validity (chksum match, sequence number increasing, whatever other checks you have), de-serializes the data into an internal object, then queue the object on an internal queue to the application thread that handles those sorts of updates.
But don't take my word for it: test it! Write a small program that can receive and deserialize 3 or 4 kinds of packets, using a blocking thread and a queue to deliver the objects, then re-write it using a single thread and IOCPs, with the deserialization and queueing in the completion routine. Pound enough packets through it to get the run time up in the minute range, and test which one is fastest. Make sure something (i.e. some thread) in your test app is consuming the objects off the queue so you get a full picture of the relative performance.
Post back here when you have the two test programs done, and let us know which worked out best, mm'kay? Which was fastest, which would you rather maintain in the future, which took the longest to get it working, etc.

If you want to support many simultaneous connections, you need to use an event-driven networking approach. I know of two good libraries: libev (used by nodeJS) and libevent. They are very portable and easy to use. I have successfully used libevent in an application supporting hundreds of parallel TCP/UDP(DNS) connections.
I believe using event-driven network i/o is not premature optimization in a server - it should be the default design pattern. If you want to do a quick prototype implementation it may be better to start in a higher level language. For JavaScript there is nodeJS and for Python there is Twisted. Both I can personally recommend.

How about NodeJS
It supports UDP and it is highly scalable.

usb disk write latency (windows)

I am writing to USB disk from a lowest priority thread, using chunked buffer writing and still, from time to time the system in overall lags on this operation. If I disable writing to disk only, everything works fine. I can't use Windows file operations API calls, only C write. So I thought maybe there is a WinAPI function to turn on/off USB disk write caching which I could use in conjunction with FlushBuffers or similar alternatives? The number of drives for operations is undefined.
Ideally I would like to never be lagging using write call and the caching, if it will be performed transparently is ok too.
EDIT: would _O_SEQUENTIAL flag on write only operations be of any use here?

Try to reduce I/O priority for the thread.
See this article: http://msdn.microsoft.com/en-us/library/windows/desktop/ms686277(v=vs.85).aspx
In particular use THREAD_MODE_BACKGROUND_BEGIN for your IO thread.
Warning: this doesn't work in Windows XP

The thread priority won't affect the delay that happens in the process of writing the media, because it's done in the kernel mode by the file system/disk drivers that don't pay attention to the priority of the calling thread.
You might try to use "T" flag (_O_SHORTLIVED) and flush the buffers at the end of the operation, also try to decrease the buffer size.

There are different types of data transfer for USB, for data there are 3:
1.Bulk Transfer,
2.Isochronous Transfer, and
3.Interrupt Transfer.
Bulk Transfers Provides:
Used to transfer large bursty data.
Error detection via CRC, with guarantee of delivery.
No guarantee of bandwidth or minimum latency.
Stream Pipe - Unidirectional
Full & high speed modes only.
Bulk transfer is good for data that does not require delivery in a guaranteed amount of time The USB host controller gives a lower priority to bulk transfer than the other types of transfer.
Isochronous Transfers Provides:
Guaranteed access to USB bandwidth.
Bounded latency.
Stream Pipe - Unidirectional
Error detection via CRC, but no retry or guarantee of delivery.
Full & high speed modes only.
No data toggling.
Isochronous transfers occur continuously and periodically. They typically contain time sensitive information, such as an audio or video stream. If there were a delay or retry of data in an audio stream, then you would expect some erratic audio containing glitches. The beat may no longer be in sync. However if a packet or frame was dropped every now and again, it is less likely to be noticed by the listener.
Interrupt Transfers Provides:
Guaranteed Latency
Stream Pipe - Unidirectional
Error detection and next period retry.
Interrupt transfers are typically non-periodic, small device "initiated" communication requiring bounded latency. An Interrupt request is queued by the device until the host polls the USB device asking for data.
From the above, it seems that you want a Guaranteed Latency, so you should use Isochronous mode. There are some libraries that you can use like libusb, or you can read more in msdn

To find out what is letting your system hang you first need to drill down to the Windows hang. What was Windows doing while you did experience the hang?
To find this out you can take a kernel dump. How to get and analyze a Kernel Dump read here.
Depending on the findings you get there you then need to decide if there is anything under your control you can do about. Since you are using a third party library to to the writing there is little you can do except to set the IO priority, thread priority on thread or process level. If the library you were given links against a specific CRT you could try to build your own customized version of it to e.g. flush after every write to prevent write combining by the OS to write only data in big chunks back to disc.
Edit1
Your best bet would be to flush the device after every write. This could force the OS to flush any pending data and write the current pending writes to disc without caching the writes up to certain amount.
The second best thing would be to simply wait after each write to give the OS the chance to write pending changes though small back to disc after a certain time interval.
If you are deeper into performance you should try out XPerf which has a nice GUI and shows you even the call stack where your process did hang. The Windows Team and many other teams at MS use this tool to troubleshoot hang experiences. The latest edition with many more features comes with the Windows 8 SDK. But beware that Xperf only works on OS > Vista.

With a single file descriptor, Is there any performance difference between select, poll and epoll and ...?

The title really says it all.
The and ... means also include pselect and ppoll..
The server project I'm working on basically structured with multiple threads. Each
thread handles one or more sessions. All the threads are identical. The protocol
takes care of which thread will host the session.
I'm using an inhouse socket class that wraps things up. The point of interest is a checkread call which calls either poll (linux) or select (windows).
In summary each thread currently calls poll on a single socket. From what I can tell, using epoll would only be of benefit if this thread was looking at multiple sockets such as what you'd get in say an HTTP server. That's not what I'm doing in my case. And the class only handles a single socket at a time.
There is some brief discussion about edge and level triggering in the man pages for epoll. I'm not really sure what it means. In the socket class I see an optimization in the windows part of the code that shortcuts the select call with an ioctlsocket & FIONREAD to check if there is any data. Wondering if that would return > 0 even if a complete UDP packet hadn't arrived at the time of the call. Is this what edge triggering is in epoll?
In some rudimentary testing, I'm also seeing no noticeable difference between using select and poll.
I can see that using ppoll might be of benefit though due to greater precision in the timeout. Any thoughts?
And yes, I am trying to optimize throughput for a session that is receiving lots of data. The server is more Network & Disk bound than CPU.

The main difference between epoll vs select or poll is that epoll scales a lot better when run in a single thread. I don't know how this would compare to using a multithreaded server using select or poll.
Look at this http://monkey.org/~provos/libevent/libevent-benchmark2.jpg
The reason for this(as far as I can tell) is that when you are using select or poll you must loop through all the connected sockets to determine which ones have data to be read. When you are using epoll, it keeps a seperate array which contains references only to sockets which have data to be read. This saves you lots of loop cycles, and the difference becomes more and more noticeable the more sockets that are connected.
Another thing to look into if performance ever becomes a major issue is io completion ports(windows only) and kqueue(FreeBSD only). It's also important to remember that epoll is linux only. In most cases select or poll will work just fine.
In the case of a single file descriptor, select and poll are more efficient than epoll due to being much simpler. (epoll has some overhead which doesn't make itself useful with only a single socket)

According to the link: http://www.intelliproject.net/articles/showArticle/index/io_multiplexing.
If you use only one descriptor:
select: 201 micro seconds.
poll: 159 micro seconds.
epoll: 176 micro seconds.
Seems poll will be a better solution in such situation.

If you have only a single socket, what's the point of polling in the first place? Wouldn't the best performance then be by just using blocking read/write?
Wrt. the performance, with only a single file descriptor I don't think there is much, if any, difference between the various approaches. If you really care, I suppose you could measure, but I find it difficult that this would particularly matter for the overall performance of your program.
Level/edge triggering. Consider you're monitoring a signal, for simplicity say some voltage in a line. Edge triggering means that something triggers when the voltage goes over or under some specific limit. Level triggering means that something is considered to be in a triggered state as long as the voltage is over/under the limit. That is, edge triggering triggers when some event happens (crossing some threshold), level triggering reflects the state of some "thing" (in this case, voltage).
To get back to network programming, and edge triggered system might be one where you get some kind of signal when a packet is received. If you don't handle the event then the signal is lost. A level triggered system, OTOH, is something like asking "is there data waiting in the buffer for me?"; if you don't handle the event and ask again, the data will still be there waiting for you.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js