rte_eth_tx_burst() descriptor/mbuf management guarantees vs. free thresholds - dpdk

The rte_eth_tx_burst() function is documented as:
* It is the responsibility of the rte_eth_tx_burst() function to
* transparently free the memory buffers of packets previously sent.
* This feature is driven by the *tx_free_thresh* value supplied to the
* rte_eth_dev_configure() function at device configuration time.
* When the number of free TX descriptors drops below this threshold, the
* rte_eth_tx_burst() function must [attempt to] free the *rte_mbuf* buffers
* of those packets whose transmission was effectively completed.
I have a small test program where this doesn't seem to hold true (when using the ixgbe driver on a vfio X553 1GbE NIC).
So my program sets up one transmit queue like this:
uint16_t tx_ring_size = 1024-32;
rte_eth_dev_configure(port_id, 0, 1, &port_conf);
r = rte_eth_dev_adjust_nb_rx_tx_desc(port_id, &rx_ring_size, &tx_ring_size);
struct rte_eth_txconf txconf = dev_info.default_txconf;
r = rte_eth_tx_queue_setup(port_id, 0, tx_ring_size,
rte_eth_dev_socket_id(port_id), &txconf);
The transmit mbuf packet pool is created like this:
struct rte_mempool *pkt_pool = rte_pktmbuf_pool_create("pkt_pool", 1023, 341, 0,
RTE_MBUF_DEFAULT_BUF_SIZE, rte_socket_id());
In that way, when sending packets I rather run out of TX descriptors before I run out of packet buffers. (the program generates packets with just one segment)
My expectation is that when I call rte_eth_tx_burst() in a loop (to send one packet after another) that it never fails since it transparently frees mbufs of already sent packets.
However, this doesn't happen.
I basically have a transmit loop like this:
for (unsigned i = 0; i < 2048; ++i) {
struct rte_mbuf *pkt = rte_pktmbuf_alloc(args.pkt_pool);
// error check, prepare packet etc.
uint16_t l = rte_eth_tx_burst(args.port_id, 0, &pkt, 1);
// error check etc.
}
After 1086 transmitted packets (of ~ 300 bytes each), rte_eth_tx_burst() returns 0.
I use the default threshold values, i.e. the queried values are (from dev_info.default_txconf):
tx thresh : 32
tx rs thresh: 32
wthresh : 0
So the main question now is: How hard is rte_eth_tx_burst() supposed to try to free mbuf buffers (and thus descriptors)?
I mean, it could busy loop until the transmission of previously supplied mbufs is completed.
Or it could just quickly check if some descriptors are free again. But if not, just give up.
Related question: Are the default threshold values appropriate for this use case?
So I work around this like that:
for (;;) {
uint16_t l = rte_eth_tx_burst(args.port_id, 0, &pkt, 1);
if (l == 1) {
break;
} else {
RTE_LOG(ERR, USER1, "cannot send packet\n");
int r = rte_eth_tx_done_cleanup(args.port_id, 0, 256);
if (r < 0) {
rte_panic("%u. cannot cleanup tx descs: %s\n", i, rte_strerror(-r));
}
RTE_LOG(WARNING, USER1, "%u. cleaned up %d descriptors ...\n", i, r);
}
}
With that I get output like this:
USER1: cannot send packet
USER1: 1086. cleaned up 32 descriptors ...
USER1: cannot send packet
USER1: 1118. cleaned up 32 descriptors ...
USER1: cannot send packet
USER1: 1150. cleaned up 0 descriptors ...
USER1: cannot send packet
USER1: 1182. cleaned up 0 descriptors ...
[..]
USER1: cannot send packet
USER1: 1950. cleaned up 32 descriptors ...
USER1: cannot send packet
USER1: 1982. cleaned up 0 descriptors ...
USER1: cannot send packet
USER1: 2014. cleaned up 0 descriptors ...
USER1: cannot send packet
USER1: 2014. cleaned up 32 descriptors ...
USER1: cannot send packet
USER1: 2046. cleaned up 32 descriptors ...
Meaning that it frees at most 32 descriptors like this. And that it doesn't always succeed, but then the next rte_eth_tx_burst() succeeds freeing some.
Side question: Is there a better more dpdk-idiomatic way to handle the recycling of mbufs?
When I change the code such that I run out of mbufs before I run out of transmit descriptors (i.e. tx ring created with 1024 descriptors, mbuf pool still has 1023 elements), I have to change the alloc part like this:
struct rte_mbuf *pkt;
do {
pkt = rte_pktmbuf_alloc(args.pkt_pool);
if (!pkt) {
r = rte_eth_tx_done_cleanup(args.port_id, 0, 256);
if (r < 0) {
rte_panic("%u. cannot cleanup tx descs: %s\n", i, rte_strerror(-r));
}
RTE_LOG(WARNING, USER1, "%u. cleaned up %d descriptors ...\n", i, r);
}
} while (!pkt);
The output is similar, e.g.:
USER1: 1023. cleaned up 95 descriptors ...
USER1: 1118. cleaned up 32 descriptors ...
USER1: 1150. cleaned up 32 descriptors ...
USER1: 1182. cleaned up 32 descriptors ...
USER1: 1214. cleaned up 0 descriptors ...
USER1: 1214. cleaned up 0 descriptors ...
USER1: 1214. cleaned up 32 descriptors ...
[..]
That means the freeing of descriptors/mbufs is so 'slow' that it has to busy loop up to 3 times.
Again, is this a valid approach, or are there better dpdk ways to solve this?
Since rte_eth_tx_done_cleanup() might return -ENOTSUP, this may point to the direction that my usage of it might not be the best solution.
Incidentally, even with the ixgbe driver it fails for me when I disable checksum offloads!
Apparently, ixgbe_dev_tx_done_cleanup() then invokes ixgbe_tx_done_cleanup_vec() instead of ixgbe_tx_done_cleanup_full() which unconditionally returns -ENOTSUP:
static int
ixgbe_tx_done_cleanup_vec(struct ixgbe_tx_queue *txq __rte_unused,
uint32_t free_cnt __rte_unused)
{
return -ENOTSUP;
}
Does this make sense?
So then perhaps the better strategy is then to make sure that there are less descriptors than pool elements (e.g. 1024-32 < 1023) and just re-call rte_eth_tx_burst() until it returns one?
That means like this:
for (;;) {
uint16_t l = rte_eth_tx_burst(args.port_id, 0, &pkt, 1);
if (l == 1) {
break;
} else {
RTE_LOG(ERR, USER1, "%u. cannot send packet - retry\n", i);
}
}
This works, and the output shows again that the descriptors are freed 32 at a time, e.g.:
USER1: 1951. cannot send packet - retry
USER1: 1951. cannot send packet - retry
USER1: 1983. cannot send packet - retry
USER1: 1983. cannot send packet - retry
USER1: 2015. cannot send packet - retry
USER1: 2015. cannot send packet - retry
USER1: 2047. cannot send packet - retry
USER1: 2047. cannot send packet - retry
I know that I also can use rte_eth_tx_burst() to submit bigger bursts. But I want to get the simple/edge cases right and understand the dpdk semantics, first.
I'm on Fedora 33 and DPDK 20.11.2.

Recommendation/Solution: after analyzing the cause of the issue is indeed with TX descriptor with either rte_mempool_list_dump or dpdk-procinfo, please use rte_eth_tx_buffer_flush or change the settings for TX thresholds.
Explanation:
The behaviour mbuf_free is varied across PMD, and within the same NIC PF and VF also varies. Follow are some points to understand this propely
rte_mempool can be created with or without cache elements.
when created with cached elements, depending upon the available lcores (eal_options) and number of cache elements per core parameter, the configured mbufs are added per core cache.
When HW offload DEV_TX_OFFLOAD_MBUF_FAST_FREE is available and enabled, the agreement is the mbuf will have ref_cnt as 1.
So when ever tx_burst (success or failure is invoked) threshold levels are checked if free mbuf/mbuf-segments can be pushed back to pool.
With DEV_TX_OFFLOAD_MBUF_FAST_FREE enabled the driver blindly puts the elements into lcore cache.
while in case of no DEV_TX_OFFLOAD_MBUF_FAST_FREE, generic approach of validating the MBUF ensuring the nb_segments and ref_cnt are checked, then pushed to mempool.
But always the either fixed (32 I believe is the default set for all PMD) or available free mbuf is pushed to cache or pool always.
Facts:
In the case of the IXGBE VF driver the option DEV_TX_OFFLOAD_MBUF_FAST_FREE is not available. Which means each time whenever thresholds are met, each individual mbuf are checked and pushed to the mempool.
as per the code snippet rte_eth_dev_configure is configured only for TX, and rte_pktmbuf_pool_create is created to have 341 elements as cache.
Assumption has to be made, that there is only 1 Lcore based (which runs the loop of alloc and tx).
Code Snippet-1:
for (unsigned i = 0; i < 2048; ++i) {
struct rte_mbuf *pkt = rte_pktmbuf_alloc(args.pkt_pool);
// error check, prepare packet etc.
uint16_t l = rte_eth_tx_burst(args.port_id, 0, &pkt, 1);
// error check etc.
}
After 1086 transmitted packets (of ~ 300 bytes each), rte_eth_tx_burst() returns 0.
[Observation] If indeed the mbuf were running, the rte_pktmbuf_alloc should be failing before rte_eth_tx_burst. But failing at 1086, creates an interesting phenomenon because total mbuf created is 1023, and failure happens are 2 iteration of 32 mbuf_release to mempool. Analyzing the driver code for ixgbe, it can be found that (only place return as 0) in tx_xmit_pkts is
/* Only use descriptors that are available */
nb_pkts = (uint16_t)RTE_MIN(txq->nb_tx_free, nb_pkts);
if (unlikely(nb_pkts == 0))
return 0;
Even though in config tx_ring_size is set to 992, internally rte_eth_dev_adjust_nb_desc sets to max of *nb_desc, desc_lim->nb_min. Based on the code it is not because there are no free mbuf, but it due to TX descriptor is low or not availble.
while in all other cases, whenever rte_eth_tx_done_cleanup or rte_eth_tx_buffer_flush these actually pushes any pending descriptors to be DMA immediately out of SW PMD. This internally frees up more descriptors which makes the tx_burst much smoother.
To identify the root cause, whenever DPDK API tx_burst return either
invoke rte_mempool_list_dump or
make use of mempool dump via dpdk-procinfo
Note: most PMD operates on amortizing the cost of the descriptor (PCIe payload) write by batching and bunching for at least 4 (in case of SSE). Hence a single packet even if DPDK tx_burst returning 1 will not be pushing the packet out of NIC. Hence to ensure use rte_eth_tx_buffer_flush.

Say, you invoke rte_eth_tx_burst() to send one small packet (single mbuf, no offloads). Suppose, the driver indeed pushes the packet to the HW. Doing so eats up one descriptor in the ring: the driver "remembers" that this packet mbuf is associated with that descriptor. But the packet is not sent instantly. The HW typically has some means to notify the driver of completions. Just imagine: if the driver checked for completions on every rte_eth_tx_burst() invocation (thus ignoring any thresholds), then calling rte_eth_tx_burst() one more time in a tight loop manner for another packet would likely consume one more descriptor rather than recycle the first one. So, given this fact, I'd not use tight loop when investigating tx_free_thresh semantics. And it shouldn't matter whether you invoke rte_eth_tx_burst() once per a packet or once per a batch of them.
Now. Say, you have a Tx ring of size N. Suppose, tx_free_thresh is M. And you have a mempool of size Z. What you do is allocate a burst of N - M - 1 small packets and invoke rte_eth_tx_burst() to send this burst (no offloads; each packet is assumed to eat up one Tx descriptor). Then you wait for some wittingly sufficient (for completions) amount of time and check the number of free objects in the mempool. This figure should read Z - (N - M - 1). Then you allocate and send one extra packet. Then wait again. This time, the number of spare objects in the mempool should read Z - (N - M). Finally, you allocate and send one more packet (again!) thus crossing the threshold (the number of spare Tx descriptors becomes less than M). During this invocation of rte_eth_tx_burst(), the driver should detect crossing the threshold and start checking for completions. This should make the driver free (N - M) descriptors (consumed by two previous rte_eth_tx_burst() invocations) thus clearing up the whole ring. Then the driver proceeds to push the new packet in question to the HW thus spending one descriptor. You then check the mempool: this should report Z - 1 free objects.
So, the short of it: no loop, just three rte_eth_tx_burst() invocations with sufficient waiting time between them. And you check the spare object count in the mempool after each send operation. Theoretically, this way, you'll be able to understand the corner case semantics. That's the gist of it. However, please keep in mind that the actual behaviour may vary across different vendors / PMDs.

Relying on rte_eth_tx_done_cleanup() really isn't an option since many PMDs don't implement it. Mostly Intel PMD's provide it, but e.g. SFC, MLX* and af_packet ones don't.
However, it's still unclear why the ixgbe PMD doesn't support cleanup when no offloads are enabled.
The requirements on rte_eth_tx_burst() with respect to freeing are really light - from the API docs:
* It is the responsibility of the rte_eth_tx_burst() function to
* transparently free the memory buffers of packets previously sent.
* This feature is driven by the *tx_free_thresh* value supplied to the
* rte_eth_dev_configure() function at device configuration time.
* When the number of free TX descriptors drops below this threshold, the
* rte_eth_tx_burst() function must [attempt to] free the *rte_mbuf* buffers
* of those packets whose transmission was effectively completed.
[..]
* #return
* The number of output packets actually stored in transmit descriptors of
* the transmit ring. The return value can be less than the value of the
* *tx_pkts* parameter when the transmit ring is full or has been filled up.
So just attempting to free (but not waiting on the results of that attempt) and returning 0 (since 0 is less than tx_pkts) is covered by that 'contract'.
FWIW, no example distributed with dpdk loops around rte_eth_tx_burst() to re-submit not-yet-sent packages. There are some examples that use rte_eth_tx_burst() and discard unsent packages, though.
AFAICS, besides rte_eth_tx_done_cleanup() and rte_eth_tx_burst() there is no other function for requesting the release of mbufs previously submitted for transmission.
Thus, it's advisable to size the mbuf packet pool larger than the configured ring size in order to survive situations where all mbufs are inflight and can't be recovered because there is no mbuf left for calling rte_eth_tx_burst() again.

Related

Qt QTcpSocket Reading Data Overlap Causes Invalid TCP Behavior During High Bandwidth Reading and Writing

Summary: Some of the memory within the TCP socket to be overwritten by other incoming data.
Application:
A client/server system that utilizes TCP within Qt (QTcpSocket and QTcpServer). The client request a frame from the server(just a simple string message), and the response (Server -> Client) which consists of that frame (614400 bytes for testing purposes). Frame sizes are established in advance and are fixed.
Implementation Details:
From the guarantees of the TCP protocol (Server -> Client), I know that I should be able to read the 614400 bytes from the socket and that they are in order. If any either of these two things fails, the connection must have failed.
Important Code:
Assuming the socket is connected.
This code requests a frame from the server. Known as the GetFrame() function.
// Prompt the server to send a frame over
if(socket->isWritable() && !is_receiving) { // Validate that socket is ready
is_receiving = true; // Forces only one request to go out at a time
qDebug() << "Getting frame from socket..." << image_no;
int written = SafeWrite((char*)"ReadyFrame"); // Writes then flushes the write buffer
if (written == -1) {
qDebug() << "Failed to write...";
return temp_frame.data();
}
this->SocketRead();
is_receiving = false;
}
qDebug() << image_no << "- Image Received";
image_no ++;
return temp_frame.data();
This code waits for the frame just requested to be read. This is the SocketRead() function
size_t byte_pos = 0;
qint64 bytes_read = 0;
do {
if (!socket->waitForReadyRead(500)) { // If it timed out return existing frame
if (!(socket->bytesAvailable() > 0)) {
qDebug() << "Timed Out" << byte_pos;
break;
}
}
bytes_read = socket->read((char*)temp_frame.data() + byte_pos, frame_byte_size - byte_pos);
if (bytes_read < 0) {
qDebug() << "Reading Failed" << bytes_read << errno;
break;
}
byte_pos += bytes_read;
} while (byte_pos < frame_byte_size && is_connected); // While we still have more pixels
qDebug() << "Finished Receiving Frame: " << byte_pos;
As shown in the code above, I read until the frame is fully received (where the number of bytes read is equal to the number of bytes in the frame).
The issue that I'm having is that the QTcpSocket read operation is skipping bytes in ways that are not in line with the guarantees of the TCP protocol. Since I skip bytes I end up not reaching the end of the while loop and just "Time Out". Why is this happening?
What I have done so far:
The data that the server sends is directly converted into uint16_t (short) integers which are used in other parts of the client. I have changed the server to simply output data that just counts up adding one for each number sent. Since the data type is uint16_t and the number of bytes exceeds that maximum number for that integer type, the int-16's will loop every 65535.
This is a data visualization software so this debugging configuration (on the client side) leads to something like this:
I have determined (and as you can see a little at the bottom of the graphic) that some bytes are being skipped. In the memory of temp_frame it is possible to see the exact point at which the memory skipped:
Under correct circumstances, this should count up sequentially.
From Wireshark and following this specific TCP connection I have determined that all of the bytes are in fact arriving (all 6114400), and that all the numbers are in order (I used a python script to ensure counting was sequential).
This is work on an open source project so this is the whole code base for the client.
Overall, I don't see how I could be doing something wrong in this solution, all I am doing is reading from the socket in the standard way.
Caveat: This isn't a definitive answer to your problem, but some things to try (it's too large for a comment).
With (e.g.) GigE, your data rate is ~100MB/s. With a [total] amount of kernel buffer space of 614400, this will be refilled ~175 times per second. IMO, this is still too small. When I've used SO_RCVBUF [for a commercial product], I've used a minimum of 8MB. This allows a wide(er) margin for task switch delays.
Try setting something huge like 100MB to eliminate this as a factor [during testing/bringup].
First, it's important to verify that the kernel and NIC driver can handle the throughput/latency.
You may be getting too many interrupts/second and the ISR prolog/epilog overhead may be too high. The NIC card driver can implement polled vs interrupt driver with NAPI for ethernet cards.
See: https://serverfault.com/questions/241421/napi-vs-adaptive-interrupts
See: https://01.org/linux-interrupt-moderation
You process/thread may not have high enough priority to be scheduled quickly.
You can use the R/T scheduler with sched_setscheduler, SCHED_RR, and a priority of (e.g.) 8. Note: going higher than 11 kills the system because at 12 and above you're at a higher priority than most internal kernel threads--not a good thing.
You may need to disable IRQ balancing and set the IRQ affinity to a single CPU core.
You can then set your input process/thread locked to that core [with sched_setaffinity and/or pthread_setaffinity].
You might need some sort of "zero copy" to bypass the kernel copying from its buffers into your userspace buffers.
You can mmap the kernel socket buffers with PACKET_MMAP. See: https://sites.google.com/site/packetmmap/
I'd be careful about the overhead of your qDebug output. It looks like an iostream type implementation. The overhead may be significant. It could be slowing things down significantly.
That is, you're not measuring the performance of your system. You're measuring the performance of your system plus the debugging code.
When I've had to debug/trace such things, I've used a [custom] "event" log implemented with an in-memory ring queue with a fixed number of elements.
Debug calls such as:
eventadd(EVENT_TYPE_RECEIVE_START,some_event_specific_data);
Here eventadd populates a fixed size "event" struct with the event type, event data, and a hires timestamp (e.g. struct timespec from clock_gettime(CLOCK_MONOTONIC,...).
The overhead of each such call is quite low. The events are just stored in the event ring. Only the last N are remembered.
At some point, your program triggers a dump of this queue to a file and terminates.
This mechanism is similar to [and modeled on] a H/W logic analyzer. It is also similar to dtrace
Here's a sample event element:
struct event {
long long evt_tstamp; // timestamp
int evt_type; // event type
int evt_data; // type specific data
};

Boost.Asio - how to check that all intermediate handlers performed in a strand?

I have client-server app that uses Boost.Asio and SSL. Server has many threads. Server uses async_read and sometimes I get invoked async handler with bytes_received != bytes_expected and ec == 0. Most of the time app misses 1-500 bytes out of 2048-16384 bytes (usual size of logical packets in the app).
My issue similar to this one - Why boost::asio::async_read completion sometimes executed with bytes_transferred=0 and ec=0?.
To fix it I made dedicated io_service::strand for each ssl::stream
and one io_service::strand for ssl::context (which is shared among threads). I wrapped all async invocations with io_service::strand::wrap() (async_read, async_write, async_wait). Routine launched via io_service::strand::post(). I made debug checks before each invocation of socket method (io_service::strand::running_in_this_thread()). But I still have broken packets...
Due to this answer https://stackoverflow.com/a/12801042/1802974 during the processing of async_read/async_write Boost.Asio can make intermediate handlers which must be inside the strand. This is the last chance to fix this weird issue. How can I check that all intermediate handlers actually performed inside dedicated socket strand?
UPDATE
After a lot of debugging I found my issue - it is not connected to SSL at all. That was because of wrong manipulation of buffers...
In my app I use std::vector<char> as a buffer (two vectors for each socket - one for read operations and one for write operations). Vectors created right after socket opened with default (zero) length.
I have a binary protocol. Each packet consists of a header and payload. Header contains length of payload. After server received header of a packet it checks that read buffer is sufficient to receive payload. If buffer is smaller than packet payload - server increases the buffer.
Code that was used to manage buffer size (do not do this ever):
if (buf.capacity() >= newSize)
{
return;
}
buf.resize(newSize);
After that I constructed asio buffer like this:
boost::asio::buffer(myVector, expectedPacketLength)
I.e. this overload was used: boost.org
Now we will see the error:
- Client sends packet of the length 400
- Server increases size of the buffer to 400 (size == 400, capacity == 400)
- Server successfully receives packet
- Client sends packet of the length 415
- Server increases size of the buffer to 415.
But `std::vector` has its own logic to increase capacity and reallocate data.
After 'resizing' actual parameters - capacity == 800, size == 415.
- Server successfully receives packet
- Client sends packet of the length 430
- Server try to increase size of a buffer, but capacity is already bigger than 430.
Nothing is done.
- Server creates buffer.
But mentioned overload of `boost::asio::buffer` has its own logic - size of created buffer 415.
- voila, we miss 15 last bytes of the packet
Why I post it here?
Be carefull with the buffers )) Just use std::vector::reserve to increase size and use most basic overload of boost::asio::buffer boost.org

What is the size of a socket send buffer in Windows?

Based on my understanding, each socket is associated with two buffers, a send buffer and a receive buffer, so when I call the send() function, what happens is that the data to send will be placed into the send buffer, and it is the responsibility of Windows now to send the content of this send buffer to the other end.
In a blocking socket, the send() function does not return until the entire data supplied to it has been placed into the send buffer.
So what is the size of the send buffer?
I performed the following test (sending 1 GB worth of data):
#include <stdio.h>
#include <WinSock2.h>
#pragma comment(lib, "ws2_32.lib")
#include <Windows.h>
int main()
{
// Initialize Winsock
WSADATA wsa;
WSAStartup(MAKEWORD(2, 2), &wsa);
// Create socket
SOCKET s = socket(AF_INET, SOCK_STREAM, 0);
//----------------------
// Connect to 192.168.1.7:12345
sockaddr_in address;
address.sin_family = AF_INET;
address.sin_addr.s_addr = inet_addr("192.168.1.7");
address.sin_port = htons(12345);
connect(s, (sockaddr*)&address, sizeof(address));
//----------------------
// Create 1 GB buffer ("AAAAAA...A")
char *buffer = new char[1073741824];
memset(buffer, 0x41, 1073741824);
// Send buffer
int i = send(s, buffer, 1073741824, 0);
printf("send() has returned\nReturn value: %d\nWSAGetLastError(): %d\n", i, WSAGetLastError());
//----------------------
getchar();
return 0;
}
Output:
send() has returned
Return value: 1073741824
WSAGetLastError(): 0
send() has returned immediately, does this means that the send buffer has a size of at least 1 GB?
This is some information about the test:
I am using a TCP blocking socket.
I have connected to a LAN machine.
Client Windows version: Windows 7 Ultimate 64-bit.
Server Windows version: Windows XP SP2 32-bit (installed on Virtual Box).
Edit: I have also attempted to connect to Google (173.194.116.18:80) and I got the same results.
Edit 2: I have discovered something strange, setting the send buffer to a value between 64 KB and 130 KB will make send() work as expected!
int send_buffer = 64 * 1024; // 64 KB
int send_buffer_sizeof = sizeof(int);
setsockopt(s, SOL_SOCKET, SO_SNDBUF, (char*)send_buffer, send_buffer_sizeof);
Edit 3: It turned out (thanks to Harry Johnston) that I have used setsockopt() in an incorrect way, this is how it is used:
setsockopt(s, SOL_SOCKET, SO_SNDBUF, (char*)&send_buffer, send_buffer_sizeof);
Setting the send buffer to a value between 64 KB and 130 KB does not make send() work as expected, but rather setting the send buffer to 0 makes it block (this is what I noticed anyway, I don't have any documentation for this behavior).
So my question now is: where can I find a documentation on how send() (and maybe other socket operations) work under Windows?
After investigating on this subject. This is what I believe to be the correct answer:
When calling send(), there are two things that could happen:
If there are pending data which are below SO_SNDBUF, then send() would return immediately (and it does not matter whether you are sending 5 KB or you are sending 500 MB).
If there are pending data which are above or equal SO_SNDBUF, then send() would block until enough data has been sent to restore the pending data to below SO_SNDBUF.
Note that this behavior is only applicable to Windows sockets, and not to POSIX sockets. I think that POSIX sockets only use one fixed sized send buffer (correct me if I'm wrong).
Now back to your main question "What is the size of a socket send buffer in Windows?". I guess if you have enough memory it could grow beyond 1 GB if necessary (not sure what is the maximum limit though).
I can reproduce this behaviour, and using Resource Monitor it is easy to see that Windows does indeed allocate 1GB of buffer space when the send() occurs.
An interesting feature is that if you do a second send immediately after the first one, that call does not return until both sends have completed. The buffer space from the first send is released once that send has completed, but the second send() continues to block until all the data has been transferred.
I suspect the difference in behaviour is because the second call to send() was already blocking when the first send completed. The third call to send() returns immediately (and 1GB of buffer space is allocated) just as the first one did, and so on, alternating.
So I conclude that the answer to the question ("how large are the send buffers?") is "as large as Windows sees fit". The upshot is that, in order to avoid exhausting the system memory, you should probably restrict blocking sends to no more than a few hundred megabytes.
Your call to setsockopt() is incorrect; the fourth argument is supposed to be a pointer to an integer, not an integer converted to a pointer. Once this is corrected, it turns out that setting the buffer size to zero causes send() to always block.
To summarize, the observed behaviour is that send() will return immediately provided:
there is enough memory to buffer all the provided data
there is not a send already in progress
the buffer size is not set to zero
Otherwise, it will return once the data has been sent.
KB214397 describes some of this - thanks Hans! In particular it describes that setting the buffer size to zero disables Winsock buffering, and comments that "If necessary, Winsock can buffer significantly more than the SO_SNDBUF buffer size."
(The completion notification described does not quite match up to the observed behaviour, depending I guess on how you interpret "previously buffered send". But it's close.)
Note that apart from the risk of inadvertently exhausting the system memory, none of this should matter. If you really need to know whether the code at the other end has received all your data yet, the only reliable way to do that is to get it to tell you.
In a blocking socket, the send() function does not return until the entire data supplied to it has been placed into the send buffer.
That is not guaranteed. If there is available buffer space, but not enough space for the entire data, the socket can (and usually will) accept whatever data it can and ignore the rest. The return value of send() tells you how many bytes were actually accepted. You have to call send() again to send the remaining data.
So what is the size of the send buffer?
Use getsockopt() with the SO_SNDBUF option to find out.
Use setsockopt() with the SO_SNDBUF option to specify your own buffer size. However, the socket may impose a max cap on the value you specify. Use getsockopt() to find out what size was actually assigned.

Winsock2 tcp/ip - some data packets are ignored probably due to null terminator from the previous packet

I wrote a simple client-server program. Network.h is a header file which uses Winsock2.h (TCP/IP mode) to create socket, accept/connect in blocking mode, send/recv in non-blocking mode. I made it so that the function string TNetwork::Recv(int size) will return the string "Nothing" if it gets WSAWOULDBLOCK error (no data is received yet)
Here is my main function:
int main(){
string Ans;
TNetwork::StartUp(); //WSA start up, etc
cin >> Ans;
if (Ans == "0"){ // 0 --> server
TNetwork::SetupAsServer(); //accept connection (in blocking mode!)
while (true){
TNetwork::Send("\nAss" + '\0'); //without null terminator, the client may read extra bytes, causing undefined behavior (?)
TNetwork::Send("embly" + '\0');
cin >> Ans;
}
}
else{ // others --> regard Ans as IP address. e.g. I can type "127.0.0.1"
TNetwork::SetupAsClient(Ans);
string Rec;
while (true){
Rec = TNetwork::Recv(1000);
if (Rec != "Nothing"){
cout << Rec;
}
}
}
system("PAUSE");
}
Supposedly, the client would print "Assembly" when connected, and when the server enters anything to its console window. Sometimes, though, the client would only print out "\nAss" in the console without the "embly.
To my understanding, TCP/IP ensures all data to be sent and in the correct order, so I guess what happens is that both packets arrive at the same time, which happen quite often over the unstable internet. And due to this null terminator, the client would ignore the "embly", since the Recv() function stopped reading when it hits a null terminator.
So, how can I ensure that the client will always read all data packets correctly?
Yes, the network stack will send the data in the correct order and doesn't care what termination type you use. This has to do with how you're receiving and processing the data stream (note: not packets, stream). If you receive all 11 bytes and print it to the screen, the print function will stop when it reaches the zero, but the rest of the data is still there.
Note: since it's a stream, what happens if you received only 10 bytes of data from the stream? You need to scan what you receive for the zero to know if you've received a full "zero-terminated string" if that's how you want to communicate your data.
EDIT: Also, I don't think "\nAss" + '\0' is doing what you think it is. Instead of adding a 0 character to the end of the string (which already has one, by the way), it's adding 0 to your string pointer.
As #mark points out, TCP is all about streams, not packets. TCP takes care of ensuring that data is reliably transmitted from A to B and that the data is delivered to the consumer in the order in which it was transmitted. Yes, the data is packetized on the wire, but the TCP stack on the system takes those packets and builds the stream which it makes available to you through the recv() function. The TCP stack handles out-of-order data, missing data, and duplicated data such that by the time your application sees it, the stream is a mirror-copy of when the sender sent.
To properly receive TCP data, you will typically need some kind of loop that reads data from the socket when it becomes available. The way I normally do this is to have a thread that is dedicated to servicing the socket. In the thread function is a loop that reads data from the socket when it becomes available and is idle otherwise. This loop reads data into a buffer of, say, 1 KB. Once the data is received from the socket into this buffer, the buffer is copied to another thread for processing. In the thread function for the processing thread is a loop that receives the 1 KB buffers from the socket thread and adds them to the back end of a master buffer of, say, 1 MB. The processing thread then processes the messages out of this master buffer and makes them available to the application.
For a simple demo application, two threads may be overkill. The two threads I've described could be certainly be combined into one, but for my application, it is more efficient to have two threads and take advantage of the multiple cores on my system. The point is, if you're going to have a front-end UI, there's not going to be a way around using at least one thread and still have the UI be responsive.
One other thing. There are two commonly-used mechanisms for protocol design. You're using one, namely, a marker (e.g., a null terminator, etc.) to signal the begin/end of a message. I don't prefer this mechanism mainly because the marker may actually need to be part of the message at some point. The other mechanism is to have a header on each message that tells, at a minimum, how long the message is. I prefer this mechanism and include in my headers a sync word and the message type as well. For example,
struct Header
{
__int16 _sync; // a hex pattern, e.g., 0xABCD
__int16 _type;
__int32 _length;
}
That's a total of 8 bytes. So when processing from the master buffer, I read the first 8 bytes, verify the sync word, and get the length. I determine if there are 'length' bytes available in the master buffer. If not, I have to wait until the socket thread provides me more data before checking again. If so, I extract 'length' bytes from the master buffer and pass that to an object created according to the specified type, which knows how to interpret that particular message. Then repeat.
As I mentioned, I use a master buffer of 1 MB or so. As messages are processed, it is important to remove them from the master buffer so there is additional space available for new data on the back end. This involves simply copying the unprocessed data, if any, to the beginning of the buffer. In cases where data comes in faster than you can process it, the master buffer may need the ability to resize itself to accommodate the additional data.
I hope that's not overwhelming. Start simple and add as you go.

Calculating socket upload speed

I'm wondering if anyone knows how to calculate the upload speed of a Berkeley socket in C++. My send call isn't blocking and takes 0.001 seconds to send 5 megabytes of data, but takes a while to recv the response (so I know it's uploading).
This is a TCP socket to a HTTP server and I need to asynchronously check how many bytes of data have been uploaded / are remaining. However, I can't find any API functions for this in Winsock, so I'm stumped.
Any help would be greatly appreciated.
EDIT: I've found the solution, and will be posting as an answer as soon as possible!
EDIT 2: Proper solution added as answer, will be added as solution in 4 hours.
I solved my issue thanks to bdolan suggesting to reduce SO_SNDBUF. However, to use this code you must note that your code uses Winsock 2 (for overlapped sockets and WSASend). In addition to this, your SOCKET handle must have been created similarily to:
SOCKET sock = WSASocket(AF_INET, SOCK_STREAM, IPPROTO_TCP, NULL, 0, WSA_FLAG_OVERLAPPED);
Note the WSA_FLAG_OVERLAPPED flag as the final parameter.
In this answer I will go through the stages of uploading data to a TCP server, and tracking each upload chunk and it's completion status. This concept requires splitting your upload buffer into chunks (minimal existing code modification required) and uploading it piece by piece, then tracking each chunk.
My code flow
Global variables
Your code document must have the following global variables:
#define UPLOAD_CHUNK_SIZE 4096
int g_nUploadChunks = 0;
int g_nChunksCompleted = 0;
WSAOVERLAPPED *g_pSendOverlapped = NULL;
int g_nBytesSent = 0;
float g_flLastUploadTimeReset = 0.0f;
Note: in my tests, decreasing UPLOAD_CHUNK_SIZE results in increased upload speed accuracy, but decreases overall upload speed. Increasing UPLOAD_CHUNK_SIZE results in decreased upload speed accuracy, but increases overall upload speed. 4 kilobytes (4096 bytes) was a good comprimise for a file ~500kB in size.
Callback function
This function increments the bytes sent and chunks completed variables (called after a chunk has been completely uploaded to the server)
void CALLBACK SendCompletionCallback(DWORD dwError, DWORD cbTransferred, LPWSAOVERLAPPED lpOverlapped, DWORD dwFlags)
{
g_nChunksCompleted++;
g_nBytesSent += cbTransferred;
}
Prepare socket
Initially, the socket must be prepared by reducing SO_SNDBUF to 0.
Note: In my tests, any value greater than 0 will result in undesirable behaviour.
int nSndBuf = 0;
setsockopt(sock, SOL_SOCKET, SO_SNDBUF, (char*)&nSndBuf, sizeof(nSndBuf));
Create WSAOVERLAPPED array
An array of WSAOVERLAPPED structures must be created to hold the overlapped status of all of our upload chunks. To do this I simply:
// Calculate the amount of upload chunks we will have to create.
// nDataBytes is the size of data you wish to upload
g_nUploadChunks = ceil(nDataBytes / float(UPLOAD_CHUNK_SIZE));
// Overlapped array, should be delete'd after all uploads have completed
g_pSendOverlapped = new WSAOVERLAPPED[g_nUploadChunks];
memset(g_pSendOverlapped, 0, sizeof(WSAOVERLAPPED) * g_nUploadChunks);
Upload data
All of the data that needs to be send, for example purposes, is held in a variable called pszData. Then, using WSASend, the data is sent in blocks defined by the constant, UPLOAD_CHUNK_SIZE.
WSABUF dataBuf;
DWORD dwBytesSent = 0;
int err;
int i, j;
for(i = 0, j = 0; i < nDataBytes; i += UPLOAD_CHUNK_SIZE, j++)
{
int nTransferBytes = min(nDataBytes - i, UPLOAD_CHUNK_SIZE);
dataBuf.buf = &pszData[i];
dataBuf.len = nTransferBytes;
// Now upload the data
int rc = WSASend(sock, &dataBuf, 1, &dwBytesSent, 0, &g_pSendOverlapped[j], SendCompletionCallback);
if ((rc == SOCKET_ERROR) && (WSA_IO_PENDING != (err = WSAGetLastError())))
{
fprintf(stderr, "WSASend failed: %d\n", err);
exit(EXIT_FAILURE);
}
}
The waiting game
Now we can do whatever we wish while all of the chunks upload.
Note: the thread which called WSASend must be regularily put into an alertable state, so that our 'transfer completed' callback (SendCompletionCallback) is dequeued out of the APC (Asynchronous Procedure Call) list.
In my code, I continuously looped until g_nUploadChunks == g_nChunksCompleted. This is to show the end-user upload progress and speed (can be modified to show estimated completion time, elapsed time, etc.)
Note 2: this code uses Plat_FloatTime as a second counter, replace this with whatever second timer your code uses (or adjust accordingly)
g_flLastUploadTimeReset = Plat_FloatTime();
// Clear the line on the screen with some default data
printf("(0 chunks of %d) Upload speed: ???? KiB/sec", g_nUploadChunks);
// Keep looping until ALL upload chunks have completed
while(g_nChunksCompleted < g_nUploadChunks)
{
// Wait for 10ms so then we aren't repeatedly updating the screen
SleepEx(10, TRUE);
// Updata chunk count
printf("\r(%d chunks of %d) ", g_nChunksCompleted, g_nUploadChunks);
// Not enough time passed?
if(g_flLastUploadTimeReset + 1 > Plat_FloatTime())
continue;
// Reset timer
g_flLastUploadTimeReset = Plat_FloatTime();
// Calculate how many kibibytes have been transmitted in the last second
float flByteRate = g_nBytesSent/1024.0f;
printf("Upload speed: %.2f KiB/sec", flByteRate);
// Reset byte count
g_nBytesSent = 0;
}
// Delete overlapped data (not used anymore)
delete [] g_pSendOverlapped;
// Note that the transfer has completed
Msg("\nTransfer completed successfully!\n");
Conclusion
I really hope this has helped somebody in the future who has wished to calculate upload speed on their TCP sockets without any server-side modifications. I have no idea how performance detrimental SO_SNDBUF = 0 is, although I'm sure a socket guru will point that out.
You can get a lower bound on the amount of data received and acknowledged by subtracting the value of the SO_SNDBUF socket option from the number of bytes you have written to the socket. This buffer may be adjusted using setsockopt, although in some cases the OS may choose a length smaller or larger than you specify, so you must re-check after setting it.
To get more precise than that, however, you must have the remote side inform you of progress, as winsock does not expose an API to retrieve the amount of data currently pending in the send buffer.
Alternately, you could implement your own transport protocol on UDP, but implementing rate control for such a protocol can be quite complex.
Since you don't have control over the remote side, and you want to do it in the code, I'd suggest doing very simple approximation. I assume a long living program/connection. One-shot uploads would be too skewed by ARP, DNS lookups, socket buffering, TCP slow start, etc. etc.
Have two counters - length of the outstanding queue in bytes (OB), and number of bytes sent (SB):
increment OB by number of bytes to be sent every time you enqueue a chunk for upload,
decrement OB and increment SB by the number returned from send(2) (modulo -1 cases),
on a timer sample both OB and SB - either store them, log them, or compute running average,
compute outstanding bytes a second/minute/whatever, same for sent bytes.
Network stack does buffering and TCP does retransmission and flow control, but that doesn't really matter. These two counters will tell you the rate your app produces data with, and the rate it is able to push it to the network. It's not the method to find out the real link speed, but a way to keep useful indicators about how good the app is doing.
If data production rate is bellow the network output rate - everything is fine. If it's the other way around and the network cannot keep up with the app - there's a problem - you need either faster network, slower app, or different design.
For one-time experiments just take periodic snapshots of netstat -sp tcp output (or whatever that is on Windows) and calculate the send-rate manually.
Hope this helps.
If your app uses packet headers like
0001234DT
where 000123 is the packet length for a single packet, you can consider using MSG_PEEK + recv() to get the length of the packet before you actually read it with recv().
The problem is send() is NOT doing what you think - it is buffered by the kernel.
getsockopt(sockfd, SOL_SOCKET, SO_SNDBUF, &flag, &sz));
fprintf(STDOUT, "%s: listener socket send buffer = %d\n", now(), flag);
sz=sizeof(int);
ERR_CHK(getsockopt(sockfd, SOL_SOCKET, SO_RCVBUF, &flag, &sz));
fprintf(STDOUT, "%s: listener socket recv buffer = %d\n", now(), flag);
See what these show for you.
When you recv on a NON-blocking socket that has data, it normally does not have MB of data parked in the buufer ready to recv. Most of what I have experienced is that the socket has ~1500 bytes of data per recv. Since you are probably reading on a blocking socket it takes a while for the recv() to complete.
Socket buffer size is the probably single best predictor of socket throughput. setsockopt() lets you alter socket buffer size, up to a point. Note: these buffers are shared among sockets in a lot of OSes like Solaris. You can kill performance by twiddling these settings too much.
Also, I don't think you are measuring what you think you are measuring. The real efficiency of send() is the measure of throughput on the recv() end. Not the send() end.
IMO.