network delays and Application->ProcessMessages() - c++

I am writing a networking DLL that I use in my C++Builder project. This DLL works with remote FTP servers. I noticed a strange behavior when recv() is called. Sometimes it returns 0. But in another thread when recv() is called on the same socket, data is received as expected.
What does this mean? I also noticed that calling Application->ProcessMessage() inside the DLL thread speeds up data receiving.
But what is wrong? Doesn't ProcessMessages() just process window messages or am I missing something?
Thank you

If I understood you correctly and you are trying to recv on the same SOCKET in parallel threads then don't do that, there is nothing to gain from it. The data you are recv is already buffered by the underlying system and you are accessing that, the thing you could do is to make multiple buffers for the recv so that when it returns data you could pass one buffer to the "upper levels" for processing and use the other one for the new recv call. You can also use just one large buffer with notifications what is for processing and what part is being used for receiving. The system probably has locks that forbid multiple reading on the same socket and so the result in one recv is 0. If it didn't have that you would probably end up with some almost randomly split data.
EDIT: Full and long explanation
I think that using multiple threads to read from a single socket is not useful
Sockets are a software regulated thing. You network device doesn't create any "connections", it just processes the data received and wraps/unwrapps them into IP (or any other
supported Internet Layer) packets (the previous depending on the network device, some of them are almost entirely software emulated by the os and actually perform just the basic "write to tx-read rx" services but to us its the same deal) . The WinSock2 service recognizes packets with specific data ( as you have already noticed ) so that you may use one network device for simultaneously
communicating with multiple peers. WinSock2 activly monitors the traffic before handing it out to you. In other words: when you are about to get a successfull recv the data
was already there and the underlying system has checked the socket you used as a parameter in recv and only handed you over the data that has already been marked as the data
for that socket. Reading with multiple threads from one socket (without the almost useless MSG_PEEK) would make the system, if it didn't have locks, copy unknown number of bytes
to the location supplied in recv in the thread one and increment the internal pointer to data by number of copied bytes permanently, then, before whole data availible in the
recv is copied at the location1, the other thread would kick in and copy also unknown number of bytes thus also incrementing the internal pointer to data by that many bytes.
Result of this type of reading would ideally be half of the data stored from location supplied in thread 1, the other half starting from location supplied in thread 2. Since the ideal result is uncertain (time allocated by the system for theese two threads is not guarantied to be equal) you would end up with unsorted data without any means of sorting
it, since the info that the underlying system uses for knowing what data belongs to which socket will not be able to you.
Being that your system is most likely faster than your network device I stand by my two solutions, first one prefered as I have been using this method for both big and small chunks of data transfer:
Make one reading thread per connected socket and one circular buffer, size of the buffer depends on the size of chunks you expect to receive and the time you will need to process the stuff further, save current read position, save "to process count", when data is received notify the thread/threads that it is supposed to process the data in the buffer, save the position of the data being used for reading, continue recv if there is buffer space not being processed else wait until there is (must implement this in case your computer chokes somewhere, in normal situations it shouldn't). You must sync the receiving thread with the processing thread/threads when they are accesing the "to_process_count" and "current read pos" vars as those will tell
you which bytes you can reuse in your circular buffer.
Create and connect one socket per desired reading thread so that the system will know how to regulate the data on its own
The thing you are refering too as random threads reading from a single socket, is maybe acievable through the following scenarios:
1 Thread Enumerates socket to see if there is data availible
when data is availible it uses some mutex to wait if some thread is already in the reading state starts a new thread to read and process the existing data
or it can be achieved with something like this
Thread does its recv as soon as it has done a successful recv (yey, the data is in the buffer) it starts another thread from some thread pool to do recv and continues to process data and end itself
Theese are the only ways I can imagine that "reading with multiple threads on a single socket" is achievable. Yes, there won't be multiple threads calling recv at the same time
Sorry for the long post, the spelling and grammar errors and hope this helps you a bit

Ensure that socket is properly bound to the handle you are using in recv function.
You cannot speedup data reception, unless there is channel to receive the data.

Related

Separating messages in a simple TCP echo server using Winsock DLL

Please consider a simple echo server using TCP and the Winsock DLL. The client application sends messages from multiple threads. The recv call on the server sometimes returns with multiple messages stored in the passed buffer. At this point, there's no chance for the server to know, whether this is one huge message or multiple small messages.
I've read that one could use setsockopt in combination with the TCP_NODELAY option. Besides that MSDN states, that this option is implemented for backward compatibility only, it doesn't even change the behavior described above.
Of course, I could introduce some kind of delimiter at the end of each message and split the message on server-side. But I don't think that's way one should do it. So, what is the right way to do it?
Firstly, TCP_NODELAY was not the right way to do this... TCP is a byte stream protocol and any given connection only maintains the byte ordering - not necessarily the boundaries of any given send/write. It's inherently broken to rely on multiple threads that don't use any synchronisation being able to even keep the messages they want to send together on the stream. For example, say thread 1 wants to send the two-byte message "AB" and thread 2 wants to send "XY"... say thread 1 starts first and the output buffer only has room for one byte, send will enqueue "A" and let thread 1 know it's only sent one byte (so it should loop and retry - preferable after waiting for notification that the output queue has more space). Then, thread 2 might get some or all of "XY" into the queue before thread 1 can get "Y". These sorts of problems become more severe on slower connections, for slow and loaded machines (e.g. perhaps a low-powered phone that's playing video and multitasking while your app runs over 3G).
The ways to ensure the logical messages stay together over TCP include:
have a single sending thread that picks up messages sequentially from a shared queue (a mutex might be used to let the threads enqueue messages)
contest a lock (mutex) so the threads' sends have an uninterrupted ability to loop to send until a complete message is sent (this wouldn't suit some apps because any of the threads could be held up for quite a while doing comms work)
use a separate TCP connection per thread

C++: How to measure real upload rate on non blocking sockets

I'm writing a program on linux C++ using non-blocking sockets with epoll, waiting for EPOLLOUT in order to do send() for some data.
My question is: I've read that on non-blocking mode the data is copied to the kernel's buffer, thus a send() call may return immediately indicating that all the data has been sent, where in reality it was only copied to the kernel's buffer.
How do I know when the data was actually sent and received by the remote peer, for knowing the real transfer rate?
Whether in non-blocking mode or not, send will return as soon as the data is copied into the kernel buffer. The difference between blocking and non-blocking mode is when the buffer is full. In the full buffer case, blocking mode will suspend the current thread until the the write takes place while non-blocking mode will return immediately with EAGAIN or EWOULDBLOCK.
In a TCP connection, the kernel buffer normally is equal to the window size, so as soon as too much data remains unacknowledged, the connection blocks. This means that the sender is aware of how fast the remote end is receiving data.
With UDP it is a bit more complex because there is no acknowledgements. Here only the receiving end is capable of measuring the true speed since sent data may be lost en-route.
In both the TCP and UDP cases, the kernel will not attempt to send data that the link layer is unable to process. The link layer can also flow off the data if the network is congested.
Getting back to your case, when using non-blocking sockets, you can measure the network speed provided you handle the EAGAIN or EWOULDBLOCK errors correctly. This is certainly true for TCP where you send more data than the current window size (probably 64K or so) and you can get an idea of the link layer speed with UDP sockets as well.
You can get the current amount of data in the kernels socket buffers using an IOCTL. This would allow you to check what's actually been sent. I'm not sure it matters that much though, unless you have MASSIVE buffers and a tiny amount of data to send it's probably not of interest.
Investigate the TIOCOUTQ/TIOCINQ ioctl on your socket fd.
My question is: I've read that on non-blocking mode the data is copied to the kernel's buffer
That happens in all modes, not just non-blocking mode. I suggest you review your reading matter.
thus a send() call may return immediately indicating that all the data has been sent, where in reality it was only copied to the kernel's buffer.
Again that is true in all modes.
How do I know when the data was actually sent and received by the remote peer, for knowing the real transfer rate?
When you've sent all the data, shutdown the socket for output, then either set blocking mode and read, or keep selecting for 'readable'; and then in either case read the EOS that should result. That functions as a peer acknowledgement of the close. Then stop the timer.
send() merely puts data into the kernel's buffer and then exits, letting the kernel perform the actual transmission in the background, so all you can really do is measure the speed in which the kernel is accepting your outgoing data. You can't really measure the actual transmission speed unless the peer sends an acknowledgement for every buffer received (and there is no way to detect when TCP's own acks are received). But using the fact that send() can block when too much data is still in flight can help you figure out how fact your code is passing outgoing data to send().
send() tells you how many bytes were accepted. So it is very easy to calculate an approximate acceptance speed - divide the number of bytes accepted by the amount of time elapsed since the previous call to send(). So when you call send() to send X bytes and get Y bytes returned, record the time as time1, call send() again to send X bytes and get Y bytes returned, record the time as time2, you will see that your code is sending data at roughly Y / ((time2-time1) in ms) bytes per millisecond, which you can then use to calculate B/KB/MB/GB per ms/sec/min/hr as needed. Over the lifetime of the data transfer, that gives you fairly good idea of your app's general transmission speed.

Multi-reader IPC solution?

I'm working on a framework in C++ (just for fun for now), that lets the user write plugins that use a standard API to stream data between each other. There's going to be three basic transport mechanisms for the data: files, sockets, and some kind of IPC piping system. The system is set up so that for the non-file transport, each stream can have multiple readers. IE once a server socket it setup, multiple computers can connect and stream the data. I'm a little stuck at the multi-reader IPC system though.
All my plugins run in threads (though I may want to go to a process-based system eventually) so they live in the same address space, so some kind of shared memory system would work fine, I was thinking I'd write my own circular buffer with a write pointer and read pointers chassing it around the buffer, but I have my doubts that I can achieve the same performance as something like linux pipes.
I'm curious what people would suggest for a multi-reader solution to something like this? Is the overhead for pipes or domain sockets low enough that I could just open a connection to each reader and issue separate writes to each reader? This is intended to be significant volumes of data (tens of mega-samples/sec), so performance is a must.
I develop a media server, and i usually use a single reader for a group of all active sockets of the same class. You can use a select() (in a blocking or non blocking mode) function for each group to read the sockets that became ready to be read. When a socket data is ready or a new connection occur i just call a notify callback function to manage it.
Each reader (that controls a group of sockets) could be managed by a separate thread, avoiding your main threads to block while waiting for new connections or socket data.
If I understand the description correctly, it seems to me that using a circular queue as you mention would be a good IPC solution. I think it could scale very well and would ultimately be better than individual pipes or individual shared memory for each client. One (of several) of the issues of using a single queue/buffer for multiple clients is to synchronize access to the buffers. A client needs to be able to successfully read an entry in the queue without the server changing it. Here is a possible mechanism for implementing that.
This requires that the server know how many active clients there are. That, I assume, would be possible as long as the clients are doing some kind of registration/login with the server (almost certainly true if they are in-process but not necessarily true for out-of-process clients).
Suppose there are N clients. For this example, assume 100 active clients.
Maintain two counting semaphores for each entry in the circular queue. If using out-of-process clients, these need to be shared between processes. Call the semaphores SemReady and SemDone.
Use SemReady to indicate that the buffer is ready for clients to read. The server writes to the buffer entry and then sets the value of the semaphore to the number of clients (100 in this case). More on this in a bit.
When a client wants to read an entry in the queue, it waits on the associated SemReady semaphore. If the initial value is at 100, then all 100 clients can successfully get the semaphore and “concurrently” read the data.
When a client is done reading/using the entry, it increments/releases the SemDone semaphore.
When a server wants to write to a buffer entry, it needs to make sure of two things: a) no clients are currently reading it, and b) no clients start to read it once the server is writing to it.
Therefore, first, block any further access to the buffer by waiting on the SemReady semaphore until the count is zero (obviously, use a zero timeout). When it hits zero, the server knows that no additional clients will start reading it.
To know that clients are done with the buffer, the server uses the SemDone semaphore. It checks the SemDone and waits until it is at value is at N minus the number of waits it did on SemReady. In other words, if SemReady was at zero, then it means all clients read the buffer entry, therefore, SemDone should be at N (100) when they are done. If, though, the server waited 10 times on SemReady, then SemDone should be at 90 (N-10) when all clients are done.
The above step needs some kind of timeout and status check on client “liveness” in case a client crashes/quits after getting SemReady and before releasing SemDone. Also, it would need to account for the possibility of new client registering during that step as well in order to keep the semaphore count values in sync.
Once the server has found no more clients are reading the buffer, it can reset SemDone to zero, write new data to the entry, and set SemReady to N (100).
Rinse and repeat.
Note 1 There are other synchronization issues to maintain the head/tail of the circular queue so that clients know where it is.
Note 2 SemDone could probably be an integer counter handled with atomic increments… I think it could anyway. Needs a bit of thought.
Note 3 It might make sense to have multiple threads in the server writing to the buffer entries. That way, if the server has to wait/timeout a bit on a crashed client that started reading but did not finish, it would not block subsequent queue entries that other clients might already be waiting for.

How can I slow down a TCP connection on Windows?

I am developing a Windows proxy program where two TCP sockets, connected through different adapters are bridged by my program. That is, my program reads from one socket and writes to the other, and vice versa. Each socket is handled by its own thread. When one socket reads data it is queued for the other socket to write it. The problem I have is the case when one link runs at 100Mb and the other runs at 10Mb. I read data from the 100Mb link faster than I can write it to the 10Mb link. How can I "slow down" the faster connection so that it is essentially running at the slower link speed? Changing the faster link to a slower speed is not an option. --Thanks
Create a fixed length queue between reading and writing threads. Block on the enqueue when queue is full and on dequeue when it's empty. Regular semaphore or mutex/condition variable should work. Play with the queue size so the slower thread is always busy.
If this is a problem, then you're writing your program incorrectly.
You can't put more than 10mbps on a 10mbps link, so your thread that is writing on the slower link should start to block as you write. So as long as your thread uses the same size read buffer as write buffer, the thread should only consume data as quickly as it can throw it back out the 10mbps pipe. Any flow control needed to keep the remote sender from putting more than 10mbps into the 100mbps pipe to you will be taken care of automatically by the TCP protocol.
So it just shouldn't be an issue as long as your read and write buffers are the same size in that thread (or any thread).
Stop reading the data when you are not able to write it.
There is a queue of bytes coming into your program from the 100Mb/s link, and a queue out of your program to the 10Mb/s link. When the outgoing queue is full, stop reading from the incoming queue and TCP with throttle back the client on the 100Mb/s link.
You can use an internal queue between the reader and the writer to implement this cleanly.
A lot of complicated - and correct - solutions have been expounded. But really, to get to the crux of the matter - why do you have two threads? If you did the socket-100 read, socket-10 write in a single thread, it would naturally block on the write and you wouldn't have to design anything complicated.
If you are doing a non-blocking, select()-style event loop: only call FD_SET(readSocket, &readSet) if your outgoing-data queue is smaller than some hard-coded maximum size.
That way, when the outgoing socket falls behind, your proxy will stop reading data from the faster client until it catches back up. The TCP protocol will take care of the rest (in particular, it will tell your faster client to slow down for a while)

How to cope with high frequency data?

I have a C++ application which receives stock data and forward to another application via socket (acting as a server).
Actually the WSASend function returns with error code 10055 after small seconds and I found that is the error message
"No buffer space available. An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full".
The problem arises only when I run the application after the market hours as we are receiving the whole day data (approximately 130 MB) in few minutes (I assume this is relatively large)
I am doing so as a robustness test.
I tried to increase the sending buffer SO_SNDBUF using setsockopt function but the same problem still there.
How can I solve this problem? is this related to receiver buffer?
Sending details:
For each complete message I call the send method which uses overlapped sockets
EDIT:
Can someone give general guidelines to handle high frequency data in C++?
The flow-control of TCP will cause the internal send-buffer to fill up if the receiver is not processing their end of the socket fast enough. It would seem from the error message that you are sending data without regard for how quickly the Winsock stack can process it. It would be helpful if you could state exactly how you are sending the data? Are you waiting for all the data to arrive and then sending one big block, or sending piecemeal?
Are you sending via a non-blocking or overlapped socket? In either case, after each send you should probably wait for a notification that the socket is in a state where it can send more data, either because select()/WaitForMultipleObjects() indicates it can (for non-blocking sockets), or the overlapped I/O completes, signalling that the data has been successfully copied to the sockets internal send buffers.
You can overlap sends, i.e. queue up more than one buffer at a time - that's what overlapped I/O is for - but you need to pay careful regard to the memory implications of locking large numbers of pages and potentially exhausting the non-paged pool.
Nick's answer pretty much hits the nail on the head; you're most likely exhausting the 'locked pages limit' by starting too many overlapped sends at once. Ideally you need to buffer your data in your own memory buffers and only have a set number of overlapped sends pending at any one time. I talk about how my IOCP framework allows you to deal with this kind of situation here http://www.lenholgate.com/blog/2008/07/write-completion-flow-control.html and the related TCP receive window flow control issues here http://www.lenholgate.com/blog/2008/06/data-distribution-servers.html and here http://www.serverframework.com/asynchronousevents/2011/06/tcp-flow-control-and-asynchronous-writes.html.
My preferred solution is to allow a configurable number of pending overlapped sends at any one time and once this limit is exceeded to start buffering data and then using the completion of the pending overlapped sends to drive the sending of the buffered data. This allows you to strictly control the amount of non-paged pool and the amount of 'locked pages' used and makes it possible to have lots of connections sending as fast as possible yet still control the resources that they use.