libusb_interrupt_transfer LIBUSB_ERROR_TIMEOUT - c++

I have a general design question, the final software will eventually run on Linux and Windows...
I am trying to read 8 bytes on an endpoint with libusb_interrupt_transfer and LIBUSB_ERROR_TIMEOUT occurs in the middle of data being received... Will the be data be broken up? The docs warn about specifying anything other than the actual endpoint data size for the 'length' variable of the data to be received; that it can lead to buffer overrun. Also the docs say that if a timeout occurs to check the 'transferred' variable and that not all the data may have been received. Those two things being true, How am I supposed to deal with partially received data? If LIBUSB_ERROR_TIMEOUT occurs and my packet is only 8 bytes, will all 8 bytes always be received? And I supposed to always supply an 8 byte buffer, even if I am only requesting to receive the next 2 bytes to complete a previously timed out read request? And if I do supply that 8 byte buffer, and only request 2 bytes, it is possible that I may end up with 6 bytes of the next data packet that is incoming? Even though I only requested 2 bytes? Any info is greatly appreciated.
http://libusb.sourceforge.net/api-1.0/group__syncio.html#gac412bda21b7ecf57e4c76877d78e6486
libusb docs states "Also check transferred when dealing with a timeout error code. libusb may have to split your transfer into a number of chunks to satisfy underlying O/S requirements"
http://libusb.sourceforge.net/api-1.0/packetoverflow.html
docs states: "When requesting data on a bulk endpoint, libusb requires you to supply a buffer and the maximum number of bytes of data that libusb can put in that buffer. However, the size of the buffer is not communicated to the device - the device is just asked to send any amount of data."
Then it also states: "Overflows can only happen if the final packet in an incoming data transfer is smaller than the actual packet that the device wants to transfer. Therefore, you will never see an overflow if your transfer buffer size is a multiple of the endpoint's packet size: the final packet will either fill up completely or will be only partially filled."
unsigned char data[8];
int timeout = 250; //timeout in milliseconds
int xmtcnt = 0;
int rcvcnt = 0;
//EP OUT (Send data to USB Device)
//0x02 = Endpoint Type 0x00 + Endpoint Number 2
r = libusb_interrupt_transfer(devh,0x02, data, sizeof(data), &xmtcnt, timeout);
if(r != 0 || xmtcnt != 8){printf("XMT libusb_interrupt_transfer error %d\n",r); goto out_release;}
//EP IN (Recv data from USB device)
//0x81 = Endpoint Type 0x80 + Endpoint Number 1
//-----IS IT POSSIBLE TO RECEIVE LESS THAN 8 BYTES IF WE TIMEOUT?----
r = libusb_interrupt_transfer(devh,0x81, data, sizeof(data), &rcvcnt, timeout);
if(r != 0 || rcvcnt != 8){printf("RCV libusb_interrupt_transfer error %d\n",r); goto out_release;}
//show data received
CONSOLE("data: %d %d %d %d %d %d %d %d xmt:%d rcv:%d\n",data[0],data[1],data[2],data[3],data[4],data[5],data[6],data[7],xmtcnt,rcvcnt);

Related

rte_eth_tx_burst() descriptor/mbuf management guarantees vs. free thresholds

The rte_eth_tx_burst() function is documented as:
* It is the responsibility of the rte_eth_tx_burst() function to
* transparently free the memory buffers of packets previously sent.
* This feature is driven by the *tx_free_thresh* value supplied to the
* rte_eth_dev_configure() function at device configuration time.
* When the number of free TX descriptors drops below this threshold, the
* rte_eth_tx_burst() function must [attempt to] free the *rte_mbuf* buffers
* of those packets whose transmission was effectively completed.
I have a small test program where this doesn't seem to hold true (when using the ixgbe driver on a vfio X553 1GbE NIC).
So my program sets up one transmit queue like this:
uint16_t tx_ring_size = 1024-32;
rte_eth_dev_configure(port_id, 0, 1, &port_conf);
r = rte_eth_dev_adjust_nb_rx_tx_desc(port_id, &rx_ring_size, &tx_ring_size);
struct rte_eth_txconf txconf = dev_info.default_txconf;
r = rte_eth_tx_queue_setup(port_id, 0, tx_ring_size,
rte_eth_dev_socket_id(port_id), &txconf);
The transmit mbuf packet pool is created like this:
struct rte_mempool *pkt_pool = rte_pktmbuf_pool_create("pkt_pool", 1023, 341, 0,
RTE_MBUF_DEFAULT_BUF_SIZE, rte_socket_id());
In that way, when sending packets I rather run out of TX descriptors before I run out of packet buffers. (the program generates packets with just one segment)
My expectation is that when I call rte_eth_tx_burst() in a loop (to send one packet after another) that it never fails since it transparently frees mbufs of already sent packets.
However, this doesn't happen.
I basically have a transmit loop like this:
for (unsigned i = 0; i < 2048; ++i) {
struct rte_mbuf *pkt = rte_pktmbuf_alloc(args.pkt_pool);
// error check, prepare packet etc.
uint16_t l = rte_eth_tx_burst(args.port_id, 0, &pkt, 1);
// error check etc.
}
After 1086 transmitted packets (of ~ 300 bytes each), rte_eth_tx_burst() returns 0.
I use the default threshold values, i.e. the queried values are (from dev_info.default_txconf):
tx thresh : 32
tx rs thresh: 32
wthresh : 0
So the main question now is: How hard is rte_eth_tx_burst() supposed to try to free mbuf buffers (and thus descriptors)?
I mean, it could busy loop until the transmission of previously supplied mbufs is completed.
Or it could just quickly check if some descriptors are free again. But if not, just give up.
Related question: Are the default threshold values appropriate for this use case?
So I work around this like that:
for (;;) {
uint16_t l = rte_eth_tx_burst(args.port_id, 0, &pkt, 1);
if (l == 1) {
break;
} else {
RTE_LOG(ERR, USER1, "cannot send packet\n");
int r = rte_eth_tx_done_cleanup(args.port_id, 0, 256);
if (r < 0) {
rte_panic("%u. cannot cleanup tx descs: %s\n", i, rte_strerror(-r));
}
RTE_LOG(WARNING, USER1, "%u. cleaned up %d descriptors ...\n", i, r);
}
}
With that I get output like this:
USER1: cannot send packet
USER1: 1086. cleaned up 32 descriptors ...
USER1: cannot send packet
USER1: 1118. cleaned up 32 descriptors ...
USER1: cannot send packet
USER1: 1150. cleaned up 0 descriptors ...
USER1: cannot send packet
USER1: 1182. cleaned up 0 descriptors ...
[..]
USER1: cannot send packet
USER1: 1950. cleaned up 32 descriptors ...
USER1: cannot send packet
USER1: 1982. cleaned up 0 descriptors ...
USER1: cannot send packet
USER1: 2014. cleaned up 0 descriptors ...
USER1: cannot send packet
USER1: 2014. cleaned up 32 descriptors ...
USER1: cannot send packet
USER1: 2046. cleaned up 32 descriptors ...
Meaning that it frees at most 32 descriptors like this. And that it doesn't always succeed, but then the next rte_eth_tx_burst() succeeds freeing some.
Side question: Is there a better more dpdk-idiomatic way to handle the recycling of mbufs?
When I change the code such that I run out of mbufs before I run out of transmit descriptors (i.e. tx ring created with 1024 descriptors, mbuf pool still has 1023 elements), I have to change the alloc part like this:
struct rte_mbuf *pkt;
do {
pkt = rte_pktmbuf_alloc(args.pkt_pool);
if (!pkt) {
r = rte_eth_tx_done_cleanup(args.port_id, 0, 256);
if (r < 0) {
rte_panic("%u. cannot cleanup tx descs: %s\n", i, rte_strerror(-r));
}
RTE_LOG(WARNING, USER1, "%u. cleaned up %d descriptors ...\n", i, r);
}
} while (!pkt);
The output is similar, e.g.:
USER1: 1023. cleaned up 95 descriptors ...
USER1: 1118. cleaned up 32 descriptors ...
USER1: 1150. cleaned up 32 descriptors ...
USER1: 1182. cleaned up 32 descriptors ...
USER1: 1214. cleaned up 0 descriptors ...
USER1: 1214. cleaned up 0 descriptors ...
USER1: 1214. cleaned up 32 descriptors ...
[..]
That means the freeing of descriptors/mbufs is so 'slow' that it has to busy loop up to 3 times.
Again, is this a valid approach, or are there better dpdk ways to solve this?
Since rte_eth_tx_done_cleanup() might return -ENOTSUP, this may point to the direction that my usage of it might not be the best solution.
Incidentally, even with the ixgbe driver it fails for me when I disable checksum offloads!
Apparently, ixgbe_dev_tx_done_cleanup() then invokes ixgbe_tx_done_cleanup_vec() instead of ixgbe_tx_done_cleanup_full() which unconditionally returns -ENOTSUP:
static int
ixgbe_tx_done_cleanup_vec(struct ixgbe_tx_queue *txq __rte_unused,
uint32_t free_cnt __rte_unused)
{
return -ENOTSUP;
}
Does this make sense?
So then perhaps the better strategy is then to make sure that there are less descriptors than pool elements (e.g. 1024-32 < 1023) and just re-call rte_eth_tx_burst() until it returns one?
That means like this:
for (;;) {
uint16_t l = rte_eth_tx_burst(args.port_id, 0, &pkt, 1);
if (l == 1) {
break;
} else {
RTE_LOG(ERR, USER1, "%u. cannot send packet - retry\n", i);
}
}
This works, and the output shows again that the descriptors are freed 32 at a time, e.g.:
USER1: 1951. cannot send packet - retry
USER1: 1951. cannot send packet - retry
USER1: 1983. cannot send packet - retry
USER1: 1983. cannot send packet - retry
USER1: 2015. cannot send packet - retry
USER1: 2015. cannot send packet - retry
USER1: 2047. cannot send packet - retry
USER1: 2047. cannot send packet - retry
I know that I also can use rte_eth_tx_burst() to submit bigger bursts. But I want to get the simple/edge cases right and understand the dpdk semantics, first.
I'm on Fedora 33 and DPDK 20.11.2.
Recommendation/Solution: after analyzing the cause of the issue is indeed with TX descriptor with either rte_mempool_list_dump or dpdk-procinfo, please use rte_eth_tx_buffer_flush or change the settings for TX thresholds.
Explanation:
The behaviour mbuf_free is varied across PMD, and within the same NIC PF and VF also varies. Follow are some points to understand this propely
rte_mempool can be created with or without cache elements.
when created with cached elements, depending upon the available lcores (eal_options) and number of cache elements per core parameter, the configured mbufs are added per core cache.
When HW offload DEV_TX_OFFLOAD_MBUF_FAST_FREE is available and enabled, the agreement is the mbuf will have ref_cnt as 1.
So when ever tx_burst (success or failure is invoked) threshold levels are checked if free mbuf/mbuf-segments can be pushed back to pool.
With DEV_TX_OFFLOAD_MBUF_FAST_FREE enabled the driver blindly puts the elements into lcore cache.
while in case of no DEV_TX_OFFLOAD_MBUF_FAST_FREE, generic approach of validating the MBUF ensuring the nb_segments and ref_cnt are checked, then pushed to mempool.
But always the either fixed (32 I believe is the default set for all PMD) or available free mbuf is pushed to cache or pool always.
Facts:
In the case of the IXGBE VF driver the option DEV_TX_OFFLOAD_MBUF_FAST_FREE is not available. Which means each time whenever thresholds are met, each individual mbuf are checked and pushed to the mempool.
as per the code snippet rte_eth_dev_configure is configured only for TX, and rte_pktmbuf_pool_create is created to have 341 elements as cache.
Assumption has to be made, that there is only 1 Lcore based (which runs the loop of alloc and tx).
Code Snippet-1:
for (unsigned i = 0; i < 2048; ++i) {
struct rte_mbuf *pkt = rte_pktmbuf_alloc(args.pkt_pool);
// error check, prepare packet etc.
uint16_t l = rte_eth_tx_burst(args.port_id, 0, &pkt, 1);
// error check etc.
}
After 1086 transmitted packets (of ~ 300 bytes each), rte_eth_tx_burst() returns 0.
[Observation] If indeed the mbuf were running, the rte_pktmbuf_alloc should be failing before rte_eth_tx_burst. But failing at 1086, creates an interesting phenomenon because total mbuf created is 1023, and failure happens are 2 iteration of 32 mbuf_release to mempool. Analyzing the driver code for ixgbe, it can be found that (only place return as 0) in tx_xmit_pkts is
/* Only use descriptors that are available */
nb_pkts = (uint16_t)RTE_MIN(txq->nb_tx_free, nb_pkts);
if (unlikely(nb_pkts == 0))
return 0;
Even though in config tx_ring_size is set to 992, internally rte_eth_dev_adjust_nb_desc sets to max of *nb_desc, desc_lim->nb_min. Based on the code it is not because there are no free mbuf, but it due to TX descriptor is low or not availble.
while in all other cases, whenever rte_eth_tx_done_cleanup or rte_eth_tx_buffer_flush these actually pushes any pending descriptors to be DMA immediately out of SW PMD. This internally frees up more descriptors which makes the tx_burst much smoother.
To identify the root cause, whenever DPDK API tx_burst return either
invoke rte_mempool_list_dump or
make use of mempool dump via dpdk-procinfo
Note: most PMD operates on amortizing the cost of the descriptor (PCIe payload) write by batching and bunching for at least 4 (in case of SSE). Hence a single packet even if DPDK tx_burst returning 1 will not be pushing the packet out of NIC. Hence to ensure use rte_eth_tx_buffer_flush.
Say, you invoke rte_eth_tx_burst() to send one small packet (single mbuf, no offloads). Suppose, the driver indeed pushes the packet to the HW. Doing so eats up one descriptor in the ring: the driver "remembers" that this packet mbuf is associated with that descriptor. But the packet is not sent instantly. The HW typically has some means to notify the driver of completions. Just imagine: if the driver checked for completions on every rte_eth_tx_burst() invocation (thus ignoring any thresholds), then calling rte_eth_tx_burst() one more time in a tight loop manner for another packet would likely consume one more descriptor rather than recycle the first one. So, given this fact, I'd not use tight loop when investigating tx_free_thresh semantics. And it shouldn't matter whether you invoke rte_eth_tx_burst() once per a packet or once per a batch of them.
Now. Say, you have a Tx ring of size N. Suppose, tx_free_thresh is M. And you have a mempool of size Z. What you do is allocate a burst of N - M - 1 small packets and invoke rte_eth_tx_burst() to send this burst (no offloads; each packet is assumed to eat up one Tx descriptor). Then you wait for some wittingly sufficient (for completions) amount of time and check the number of free objects in the mempool. This figure should read Z - (N - M - 1). Then you allocate and send one extra packet. Then wait again. This time, the number of spare objects in the mempool should read Z - (N - M). Finally, you allocate and send one more packet (again!) thus crossing the threshold (the number of spare Tx descriptors becomes less than M). During this invocation of rte_eth_tx_burst(), the driver should detect crossing the threshold and start checking for completions. This should make the driver free (N - M) descriptors (consumed by two previous rte_eth_tx_burst() invocations) thus clearing up the whole ring. Then the driver proceeds to push the new packet in question to the HW thus spending one descriptor. You then check the mempool: this should report Z - 1 free objects.
So, the short of it: no loop, just three rte_eth_tx_burst() invocations with sufficient waiting time between them. And you check the spare object count in the mempool after each send operation. Theoretically, this way, you'll be able to understand the corner case semantics. That's the gist of it. However, please keep in mind that the actual behaviour may vary across different vendors / PMDs.
Relying on rte_eth_tx_done_cleanup() really isn't an option since many PMDs don't implement it. Mostly Intel PMD's provide it, but e.g. SFC, MLX* and af_packet ones don't.
However, it's still unclear why the ixgbe PMD doesn't support cleanup when no offloads are enabled.
The requirements on rte_eth_tx_burst() with respect to freeing are really light - from the API docs:
* It is the responsibility of the rte_eth_tx_burst() function to
* transparently free the memory buffers of packets previously sent.
* This feature is driven by the *tx_free_thresh* value supplied to the
* rte_eth_dev_configure() function at device configuration time.
* When the number of free TX descriptors drops below this threshold, the
* rte_eth_tx_burst() function must [attempt to] free the *rte_mbuf* buffers
* of those packets whose transmission was effectively completed.
[..]
* #return
* The number of output packets actually stored in transmit descriptors of
* the transmit ring. The return value can be less than the value of the
* *tx_pkts* parameter when the transmit ring is full or has been filled up.
So just attempting to free (but not waiting on the results of that attempt) and returning 0 (since 0 is less than tx_pkts) is covered by that 'contract'.
FWIW, no example distributed with dpdk loops around rte_eth_tx_burst() to re-submit not-yet-sent packages. There are some examples that use rte_eth_tx_burst() and discard unsent packages, though.
AFAICS, besides rte_eth_tx_done_cleanup() and rte_eth_tx_burst() there is no other function for requesting the release of mbufs previously submitted for transmission.
Thus, it's advisable to size the mbuf packet pool larger than the configured ring size in order to survive situations where all mbufs are inflight and can't be recovered because there is no mbuf left for calling rte_eth_tx_burst() again.

C++ nonblocking sockets - wait for all recv data

I wasn't running into this problem on my local system (of course), but now that I am setting up a virtual server, I am having some issues with a part of my code.
In order to receive all data from a nonblocking TCP recv(), I have this function
ssize_t Server::recvAll(int sockfd, const void *buf, size_t len, int flags) {
// just showing here that they are non-blocking sockets
u_long iMode=1;
ioctlsocket(sockfd,FIONBIO,&iMode);
ssize_t result;
char *pbuf = (char *)buf;
while ( len > 0 ) {
result = recv(sockfd,pbuf,len,flags);
printf("\tRES: %d", result);
if ( result <= 0 ) break;
pbuf += result;
len -= result;
}
return result;
}
I noticed that recvAll will usually print RES: 1024 (1024 being the amount of bytes I'm sending) and it works great. But less frequently, there is data loss and it prints only RES: 400 (where 400 is some number greater than 0 and less than 1024) and my code does not work, as it expects all 1024 bytes.
I tried also printing WSAGetLastError() and also running in debug, but it looks like the program runs slow enough due to the print/debug that I don't come across this issue.
I assume this function works great for blocking sockets, but not non-blocking sockets.
Any suggestions on measurements I can take to make sure that I do receive all 1024 bytes without data loss on non-blocking sockets?
If you use non-blocking mode then you read all data that has already arrived to the system. Once you read out all data recv returns error and reason is depending on system:
EWOULDBLOCK (in posix system)
WSAEWOULDBLOCK in windows sockets system
Once you receive this error you need to wait arrival of another data. You can do it in several ways:
Wait with special function like select/poll/epoll
Sleep some time and try to recv again (user-space polling)
If you need to reduce delay select/poll/epoll is preferable. Sleep is much more simple to implement.
Also you need consider that TCP is stream protocol and does NOT keep framing. This means that you can send, for example, 256 bytes then another 256 bytes but receive 512 bytes at once. This also true in opposite way: you may send 512 bytes at once and receive 256 bytes with first read and another 256 bytes in next read.

Poco Websocket cant'r read large data

I use Net::SocketReactor on proccess connection. When data input into socket called something like the following code:
int WebSocketWrapper::DoRecieve(void *buf) {
try{
int flags;
const auto size = m_sock.availabel();
const auto ret = m_sock.receiveFrame(buf, size, flags);
if (size != ret){
logger.warrning('Read less than available');
}
return ret;
}
catch (WebSocketException& exc){
logger.log(exc);
switch (exc.code()){
case pnet::WebSocket::WS_ERR_HANDSHAKE_UNSUPPORTED_VERSION:
logger.debug("unsuported version");
break;
// fallthrough
case pnet::WebSocket::WS_ERR_NO_HANDSHAKE:
case pnet::WebSocket::WS_ERR_HANDSHAKE_NO_VERSION:
case pnet::WebSocket::WS_ERR_HANDSHAKE_NO_KEY:
logger.debug("Bad request");
break;
}
}
return 0;
}
It's good working when data size is less than 1400 bytes. TCP packs not fragmented. But when I try send data over 1400 bytes I have WebSocketException: "Insufficient buffer for payload size". I'm explore source code Poco::Net::Websocket and he found conflict. When call Websocket::readFrame analyzes the size of the frame a header, but I have only part of the frame. I can request that return StreamSocket::availabel.
How read large data from websocket?
WebSockets operate in frames and you will always receive a frame or nothing. With that said, don't bother figuring out the amount of available data (you're probably hitting the ethernet 1500 byte MTU) but provide storage to accommodate the largest frame you expect to receive and call receiveFrame(). If the messages are fragmented between multiple frames, you'll have to deal with that at the application level. See documentation:
Receives a frame from the socket and stores it
in buffer. Up to length bytes are received. If
the frame's payload is larger, a WebSocketException
is thrown and the WebSocket connection must be
terminated.
The upcoming 1.7 release will have receiveFrame() that resizes the buffer automatically to accomodate the frame.
To understand fragmented messages, see Receiving Data in RFC 6455. While WebSockets are conceived as messaging protocol, some musings on whether they are really messaging or streaming can be found here.
Also, the code you posted does not compile and the idea of writing an unknown number of bytes in a buffer of unknown size seems hazardous, to put it mildly.

What is the size of a socket send buffer in Windows?

Based on my understanding, each socket is associated with two buffers, a send buffer and a receive buffer, so when I call the send() function, what happens is that the data to send will be placed into the send buffer, and it is the responsibility of Windows now to send the content of this send buffer to the other end.
In a blocking socket, the send() function does not return until the entire data supplied to it has been placed into the send buffer.
So what is the size of the send buffer?
I performed the following test (sending 1 GB worth of data):
#include <stdio.h>
#include <WinSock2.h>
#pragma comment(lib, "ws2_32.lib")
#include <Windows.h>
int main()
{
// Initialize Winsock
WSADATA wsa;
WSAStartup(MAKEWORD(2, 2), &wsa);
// Create socket
SOCKET s = socket(AF_INET, SOCK_STREAM, 0);
//----------------------
// Connect to 192.168.1.7:12345
sockaddr_in address;
address.sin_family = AF_INET;
address.sin_addr.s_addr = inet_addr("192.168.1.7");
address.sin_port = htons(12345);
connect(s, (sockaddr*)&address, sizeof(address));
//----------------------
// Create 1 GB buffer ("AAAAAA...A")
char *buffer = new char[1073741824];
memset(buffer, 0x41, 1073741824);
// Send buffer
int i = send(s, buffer, 1073741824, 0);
printf("send() has returned\nReturn value: %d\nWSAGetLastError(): %d\n", i, WSAGetLastError());
//----------------------
getchar();
return 0;
}
Output:
send() has returned
Return value: 1073741824
WSAGetLastError(): 0
send() has returned immediately, does this means that the send buffer has a size of at least 1 GB?
This is some information about the test:
I am using a TCP blocking socket.
I have connected to a LAN machine.
Client Windows version: Windows 7 Ultimate 64-bit.
Server Windows version: Windows XP SP2 32-bit (installed on Virtual Box).
Edit: I have also attempted to connect to Google (173.194.116.18:80) and I got the same results.
Edit 2: I have discovered something strange, setting the send buffer to a value between 64 KB and 130 KB will make send() work as expected!
int send_buffer = 64 * 1024; // 64 KB
int send_buffer_sizeof = sizeof(int);
setsockopt(s, SOL_SOCKET, SO_SNDBUF, (char*)send_buffer, send_buffer_sizeof);
Edit 3: It turned out (thanks to Harry Johnston) that I have used setsockopt() in an incorrect way, this is how it is used:
setsockopt(s, SOL_SOCKET, SO_SNDBUF, (char*)&send_buffer, send_buffer_sizeof);
Setting the send buffer to a value between 64 KB and 130 KB does not make send() work as expected, but rather setting the send buffer to 0 makes it block (this is what I noticed anyway, I don't have any documentation for this behavior).
So my question now is: where can I find a documentation on how send() (and maybe other socket operations) work under Windows?
After investigating on this subject. This is what I believe to be the correct answer:
When calling send(), there are two things that could happen:
If there are pending data which are below SO_SNDBUF, then send() would return immediately (and it does not matter whether you are sending 5 KB or you are sending 500 MB).
If there are pending data which are above or equal SO_SNDBUF, then send() would block until enough data has been sent to restore the pending data to below SO_SNDBUF.
Note that this behavior is only applicable to Windows sockets, and not to POSIX sockets. I think that POSIX sockets only use one fixed sized send buffer (correct me if I'm wrong).
Now back to your main question "What is the size of a socket send buffer in Windows?". I guess if you have enough memory it could grow beyond 1 GB if necessary (not sure what is the maximum limit though).
I can reproduce this behaviour, and using Resource Monitor it is easy to see that Windows does indeed allocate 1GB of buffer space when the send() occurs.
An interesting feature is that if you do a second send immediately after the first one, that call does not return until both sends have completed. The buffer space from the first send is released once that send has completed, but the second send() continues to block until all the data has been transferred.
I suspect the difference in behaviour is because the second call to send() was already blocking when the first send completed. The third call to send() returns immediately (and 1GB of buffer space is allocated) just as the first one did, and so on, alternating.
So I conclude that the answer to the question ("how large are the send buffers?") is "as large as Windows sees fit". The upshot is that, in order to avoid exhausting the system memory, you should probably restrict blocking sends to no more than a few hundred megabytes.
Your call to setsockopt() is incorrect; the fourth argument is supposed to be a pointer to an integer, not an integer converted to a pointer. Once this is corrected, it turns out that setting the buffer size to zero causes send() to always block.
To summarize, the observed behaviour is that send() will return immediately provided:
there is enough memory to buffer all the provided data
there is not a send already in progress
the buffer size is not set to zero
Otherwise, it will return once the data has been sent.
KB214397 describes some of this - thanks Hans! In particular it describes that setting the buffer size to zero disables Winsock buffering, and comments that "If necessary, Winsock can buffer significantly more than the SO_SNDBUF buffer size."
(The completion notification described does not quite match up to the observed behaviour, depending I guess on how you interpret "previously buffered send". But it's close.)
Note that apart from the risk of inadvertently exhausting the system memory, none of this should matter. If you really need to know whether the code at the other end has received all your data yet, the only reliable way to do that is to get it to tell you.
In a blocking socket, the send() function does not return until the entire data supplied to it has been placed into the send buffer.
That is not guaranteed. If there is available buffer space, but not enough space for the entire data, the socket can (and usually will) accept whatever data it can and ignore the rest. The return value of send() tells you how many bytes were actually accepted. You have to call send() again to send the remaining data.
So what is the size of the send buffer?
Use getsockopt() with the SO_SNDBUF option to find out.
Use setsockopt() with the SO_SNDBUF option to specify your own buffer size. However, the socket may impose a max cap on the value you specify. Use getsockopt() to find out what size was actually assigned.

Calculating socket upload speed

I'm wondering if anyone knows how to calculate the upload speed of a Berkeley socket in C++. My send call isn't blocking and takes 0.001 seconds to send 5 megabytes of data, but takes a while to recv the response (so I know it's uploading).
This is a TCP socket to a HTTP server and I need to asynchronously check how many bytes of data have been uploaded / are remaining. However, I can't find any API functions for this in Winsock, so I'm stumped.
Any help would be greatly appreciated.
EDIT: I've found the solution, and will be posting as an answer as soon as possible!
EDIT 2: Proper solution added as answer, will be added as solution in 4 hours.
I solved my issue thanks to bdolan suggesting to reduce SO_SNDBUF. However, to use this code you must note that your code uses Winsock 2 (for overlapped sockets and WSASend). In addition to this, your SOCKET handle must have been created similarily to:
SOCKET sock = WSASocket(AF_INET, SOCK_STREAM, IPPROTO_TCP, NULL, 0, WSA_FLAG_OVERLAPPED);
Note the WSA_FLAG_OVERLAPPED flag as the final parameter.
In this answer I will go through the stages of uploading data to a TCP server, and tracking each upload chunk and it's completion status. This concept requires splitting your upload buffer into chunks (minimal existing code modification required) and uploading it piece by piece, then tracking each chunk.
My code flow
Global variables
Your code document must have the following global variables:
#define UPLOAD_CHUNK_SIZE 4096
int g_nUploadChunks = 0;
int g_nChunksCompleted = 0;
WSAOVERLAPPED *g_pSendOverlapped = NULL;
int g_nBytesSent = 0;
float g_flLastUploadTimeReset = 0.0f;
Note: in my tests, decreasing UPLOAD_CHUNK_SIZE results in increased upload speed accuracy, but decreases overall upload speed. Increasing UPLOAD_CHUNK_SIZE results in decreased upload speed accuracy, but increases overall upload speed. 4 kilobytes (4096 bytes) was a good comprimise for a file ~500kB in size.
Callback function
This function increments the bytes sent and chunks completed variables (called after a chunk has been completely uploaded to the server)
void CALLBACK SendCompletionCallback(DWORD dwError, DWORD cbTransferred, LPWSAOVERLAPPED lpOverlapped, DWORD dwFlags)
{
g_nChunksCompleted++;
g_nBytesSent += cbTransferred;
}
Prepare socket
Initially, the socket must be prepared by reducing SO_SNDBUF to 0.
Note: In my tests, any value greater than 0 will result in undesirable behaviour.
int nSndBuf = 0;
setsockopt(sock, SOL_SOCKET, SO_SNDBUF, (char*)&nSndBuf, sizeof(nSndBuf));
Create WSAOVERLAPPED array
An array of WSAOVERLAPPED structures must be created to hold the overlapped status of all of our upload chunks. To do this I simply:
// Calculate the amount of upload chunks we will have to create.
// nDataBytes is the size of data you wish to upload
g_nUploadChunks = ceil(nDataBytes / float(UPLOAD_CHUNK_SIZE));
// Overlapped array, should be delete'd after all uploads have completed
g_pSendOverlapped = new WSAOVERLAPPED[g_nUploadChunks];
memset(g_pSendOverlapped, 0, sizeof(WSAOVERLAPPED) * g_nUploadChunks);
Upload data
All of the data that needs to be send, for example purposes, is held in a variable called pszData. Then, using WSASend, the data is sent in blocks defined by the constant, UPLOAD_CHUNK_SIZE.
WSABUF dataBuf;
DWORD dwBytesSent = 0;
int err;
int i, j;
for(i = 0, j = 0; i < nDataBytes; i += UPLOAD_CHUNK_SIZE, j++)
{
int nTransferBytes = min(nDataBytes - i, UPLOAD_CHUNK_SIZE);
dataBuf.buf = &pszData[i];
dataBuf.len = nTransferBytes;
// Now upload the data
int rc = WSASend(sock, &dataBuf, 1, &dwBytesSent, 0, &g_pSendOverlapped[j], SendCompletionCallback);
if ((rc == SOCKET_ERROR) && (WSA_IO_PENDING != (err = WSAGetLastError())))
{
fprintf(stderr, "WSASend failed: %d\n", err);
exit(EXIT_FAILURE);
}
}
The waiting game
Now we can do whatever we wish while all of the chunks upload.
Note: the thread which called WSASend must be regularily put into an alertable state, so that our 'transfer completed' callback (SendCompletionCallback) is dequeued out of the APC (Asynchronous Procedure Call) list.
In my code, I continuously looped until g_nUploadChunks == g_nChunksCompleted. This is to show the end-user upload progress and speed (can be modified to show estimated completion time, elapsed time, etc.)
Note 2: this code uses Plat_FloatTime as a second counter, replace this with whatever second timer your code uses (or adjust accordingly)
g_flLastUploadTimeReset = Plat_FloatTime();
// Clear the line on the screen with some default data
printf("(0 chunks of %d) Upload speed: ???? KiB/sec", g_nUploadChunks);
// Keep looping until ALL upload chunks have completed
while(g_nChunksCompleted < g_nUploadChunks)
{
// Wait for 10ms so then we aren't repeatedly updating the screen
SleepEx(10, TRUE);
// Updata chunk count
printf("\r(%d chunks of %d) ", g_nChunksCompleted, g_nUploadChunks);
// Not enough time passed?
if(g_flLastUploadTimeReset + 1 > Plat_FloatTime())
continue;
// Reset timer
g_flLastUploadTimeReset = Plat_FloatTime();
// Calculate how many kibibytes have been transmitted in the last second
float flByteRate = g_nBytesSent/1024.0f;
printf("Upload speed: %.2f KiB/sec", flByteRate);
// Reset byte count
g_nBytesSent = 0;
}
// Delete overlapped data (not used anymore)
delete [] g_pSendOverlapped;
// Note that the transfer has completed
Msg("\nTransfer completed successfully!\n");
Conclusion
I really hope this has helped somebody in the future who has wished to calculate upload speed on their TCP sockets without any server-side modifications. I have no idea how performance detrimental SO_SNDBUF = 0 is, although I'm sure a socket guru will point that out.
You can get a lower bound on the amount of data received and acknowledged by subtracting the value of the SO_SNDBUF socket option from the number of bytes you have written to the socket. This buffer may be adjusted using setsockopt, although in some cases the OS may choose a length smaller or larger than you specify, so you must re-check after setting it.
To get more precise than that, however, you must have the remote side inform you of progress, as winsock does not expose an API to retrieve the amount of data currently pending in the send buffer.
Alternately, you could implement your own transport protocol on UDP, but implementing rate control for such a protocol can be quite complex.
Since you don't have control over the remote side, and you want to do it in the code, I'd suggest doing very simple approximation. I assume a long living program/connection. One-shot uploads would be too skewed by ARP, DNS lookups, socket buffering, TCP slow start, etc. etc.
Have two counters - length of the outstanding queue in bytes (OB), and number of bytes sent (SB):
increment OB by number of bytes to be sent every time you enqueue a chunk for upload,
decrement OB and increment SB by the number returned from send(2) (modulo -1 cases),
on a timer sample both OB and SB - either store them, log them, or compute running average,
compute outstanding bytes a second/minute/whatever, same for sent bytes.
Network stack does buffering and TCP does retransmission and flow control, but that doesn't really matter. These two counters will tell you the rate your app produces data with, and the rate it is able to push it to the network. It's not the method to find out the real link speed, but a way to keep useful indicators about how good the app is doing.
If data production rate is bellow the network output rate - everything is fine. If it's the other way around and the network cannot keep up with the app - there's a problem - you need either faster network, slower app, or different design.
For one-time experiments just take periodic snapshots of netstat -sp tcp output (or whatever that is on Windows) and calculate the send-rate manually.
Hope this helps.
If your app uses packet headers like
0001234DT
where 000123 is the packet length for a single packet, you can consider using MSG_PEEK + recv() to get the length of the packet before you actually read it with recv().
The problem is send() is NOT doing what you think - it is buffered by the kernel.
getsockopt(sockfd, SOL_SOCKET, SO_SNDBUF, &flag, &sz));
fprintf(STDOUT, "%s: listener socket send buffer = %d\n", now(), flag);
sz=sizeof(int);
ERR_CHK(getsockopt(sockfd, SOL_SOCKET, SO_RCVBUF, &flag, &sz));
fprintf(STDOUT, "%s: listener socket recv buffer = %d\n", now(), flag);
See what these show for you.
When you recv on a NON-blocking socket that has data, it normally does not have MB of data parked in the buufer ready to recv. Most of what I have experienced is that the socket has ~1500 bytes of data per recv. Since you are probably reading on a blocking socket it takes a while for the recv() to complete.
Socket buffer size is the probably single best predictor of socket throughput. setsockopt() lets you alter socket buffer size, up to a point. Note: these buffers are shared among sockets in a lot of OSes like Solaris. You can kill performance by twiddling these settings too much.
Also, I don't think you are measuring what you think you are measuring. The real efficiency of send() is the measure of throughput on the recv() end. Not the send() end.
IMO.