UDP recvfrom thread use too much CPU resources

UDP recvfrom thread use too much CPU resources - c++

I am writing a Windows 7 visual c++ server application, which should receive UDP datagrams with 3.6 MB/s.
I have a main thread where the recvfrom() receives the data. The socket is a non-blocking socket and has 64kB receive buffer. If no data has been received on the socket the thread executes a sleep(1).
My problem is that the thread uses up almost 50% of my dual-core processor and I have no idea how could I decrease it. Wireshark use only 20% of it, so my main goal is to achieve a similar percentage.
Do you have any ideas?

Rather than polling you could use a select-like approach to wait for either data to arrive at your socket or the client to decide to shutdown:
First make your socket non-blocking:
u_long nonBlocking = 0;
WSAEventSelect(sock, NULL, 0);
ioctlsocket(sock, FIONBIO, &nonBlocking);
then use WSAWaitForMultipleEvents to wait until either data arrives or you want to cancel the recv:
int32_t MyRecv(THandle sock, WSAEVENT* recvCancelEvt,
uint8_t* buffer, uint32_t bufferBytes)
{
int32_t bytesReceived;
WSAEVENT evt;
DWORD ret;
HANDLE handles[2];
event = WSACreateEvent();
if (NULL == evt) {
return -1;
}
if (0 != WSAEventSelect(handle->iSocket, evt, FD_READ|FD_CLOSE)) {
WSACloseEvent(evt);
return -1;
}
bytesReceived = recv(sock, (char*)buffer, bufferBytes, 0);
if (SOCKET_ERROR==received && WSAEWOULDBLOCK==WSAGetLastError()) {
handles[0] = evt;
handles[1] = *recvCancelEvt;
ret = WSAWaitForMultipleEvents(2, handles, FALSE, INFINITE, FALSE);
if (WAIT_OBJECT_0 == ret) {
bytesReceived = recv(handle->iSocket, (char*)buffer, bufferBytes, 0);
}
}
WSACloseEvent(evt);
return bytesReceived;
}
Client code would call WSASetEvent on recvCancelEvt if it wanted to cancel a recv.

While solutions based on Select or blocking sockets are the correct approach, the reason you're running one core at 100% is due to the behaviour of Sleep:-
Look at the docs for WinAPI sleep():
This function causes a thread to relinquish the remainder of its time
slice and become unrunnable for an interval based on the value of
dwMilliseconds. The system clock "ticks" at a constant rate. If
dwMilliseconds is less than the resolution of the system clock, the
thread may sleep for less than the specified length of time.
So, if you're polling, you either need to use a much larger sleep time (maybe 20Ms, which is typically a bit greater than the Windows tick rate), or use a more accurate multimedia timer.

I would recommend using a boost::asio::io_service. We receive about 200MB/s of UDP multicast traffic while maxing out a modern CPU. This includes a full reliability protocol and data dispatch to the application. The bottleneck in profiling is the processing, not the boost::asio receive. Code here

It looks like most of the times your call to recvfrom does not return data. Sleeping 1 ms is not much. You should think about increasing the sleep time (cheap, but not best solution) or, the better solution, think about using an event driven approach. Use select() or Windows API to block until the socket is signalled or some other event you are interested in occurs and then call recvfrom. You might need to redesign your program's main loop for that.

Related

UDP real time sending and receiving on Linux on command from control computer

I am currently working on a project written in C++ involving UDP real time connection. I receive UDP packets from a control computer containing commands to start/stop an infinite while loop that reads data from an IMU and sends that data to the control computer.
My problem is the following: First I implemented an exit condition from the loop using recvfrom() and read(), but the control computer sends a UDP packet every second, which was delaying the whole loop and made sending the data in the desired time interval of 5ms impossible.
I tried to fix this problem by usingfcntl(fd, F_SETFL, O_NONBLOCK);and using only read(), which actually works fine, but I am unsure whether this is a wise idea or not, since I am not checking for errors anymore. Is there any elegant way how to solve this problem? I thought about using Pthreads or something like that, however I have never worked with threads or parallel programming so I would have to spend some time learning that.
I appreciate any advice on that problem you could give me.
Here is a code example:
//include
...
int main() {
RNet cmd; //RNet: struct that contains all the information of the UDP header and the command
RNet* pCmd = &cmd;
ssize_t b;
int fd2;
struct sockaddr_in snd; // sender is control computer
socklen_t length;
// further declaration of variables, connecting to socket, etc...
...
fcntl(fd2, F_SETFL, O_NONBLOCK);
while (1)
{
// read messages from control computer
if ((b = read(fd2, pCmd, 19)) > 0) {
memcpy(&cmd, pCmd, b);
}
// transmission
while (cmd.CLout.MotionCommand == 1) // MotionCommand: 1 - send messages; 0 - do nothing
{
if(time_elapsed >= 5) // elapsed time in ms
{
// update sensor values
...
//sendto ()
...
// update control time, timestamp, etc.
...
}
if (recvfrom(fd2, pCmd, (int)sizeof(pCmd), 0, (struct sockaddr*) &snd, &length) < 0) {
perror("error receiving data");
return 0;
}
// checking Control Model Command
if ((b = read(fd2, pCmd, 19)) > 0) {
memcpy(&cmd, pCmd, b);
}
}
}
}

I really like the "blocking calls on multiple threads" design. It enables you to have distinct independent tasks, and you don't have to worry about how each task can disturb another. It can have some drawbacks but it is usually a good fit for many needs.
To do that, just use pthread_create to create a new thread for each task (you may keep the main thread for one task). In your case, you should have a thread to receive commands, and another one to send your data. You also need for the receiving thread to notify the sending thread of the commands. To do that, you can use some synchronization tool, like a mutex.
Overall, you should have your receiving thread blocking on recvfrom, and the sending thread waiting for a signal from the mutex (wait for the mutex to be freed, technically). When the receiving thread receive a start command, it signals the mutex and go back to recvfrom (optionally you can set a variable to provide more information to the other thread).
As a comment, remember that UDP are 1-to-many, thus your code here will react to any packet sent to you (even from some random or malicious host). You may want to filter with the remote sockaddr after recvfrom, or use connect + recv. It depends on what you want.

How do I make libpcap/pcap_loop non-blocking?

I'm currently using libpcap to sniff traffic in promiscuous mode
int main()
{
// some stuff
printf("Opening device: %s\n", devname.c_str());
handle = pcap_open_live(devname.c_str(), 65536 , 1 , 0 , errbuf);
if (handle == NULL)
{
fprintf(stderr, "Couldn't open device %s : %s..." , devname.c_str(), errbuf);
return 1;
}
printf(" Done\n");
pcap_loop(handle , -1 , process_packet , NULL);
// here run a thread to do some stuff. however, pcap_loop is blocking
return 0;
}
I'd like to add an external thread to do some other stuff. How do I change the code above to make it non-blocking?

When you use non-blocking mode on libpcap you have to use pcap_dispatch, but note, pcap_dispatch can work in blocking or in non-blocking mode, it depends how you set libpcap, to set libpcap to work in non-blocking you have use the function pcap_setnonblock:
int pcap_setnonblock(pcap_t *p, int nonblock, char *errbuf);
The difference between blocking and non-blocking is not a loop that runs forever, but in blocking the function pcap_dispatch waits for a packet and only returns when this packet is received, however, in the non-blocking mode the function returns immediately and the callback must process the packet.
In "non-blocking" mode, an attempt to read from the capture
descriptor with pcap_dispatch() will, if no packets are currently
available to be read, return 0 immediately rather than blocking
waiting for packets to arrive. pcap_loop() and pcap_next() will not
work in "non-blocking" mode.
http://www.tcpdump.org/manpages/pcap_setnonblock.3pcap.html

pcap_loop is meant to go on until all input ends. If you don't want that behavior, call pcap_dispatch in a loop instead. By definition pcap_loop will never return, its meant to always searching for more data.

I use pcap_next_ex It returns a result indicating if a packet was read. This way I manage the acquisition my own thread. See an example here. The read_timeout in pcap_open also affects this function.

Multi-threaded Server handling multiple clients in one thread

I wanted to create a multi-threaded socket server using C++11 and standard linux C-Librarys.
The easiest way doing this would be opening a new thread for each incoming connection, but there must be an other way, because Apache isn't doing this. As far as I know Apache handles more than one connection in a Thread. How to realise such a system?
I thought of creating one thread always listening for new clients and assigning this new client to a thread. But if all threads are excecuting an "select()" currently, having an infinite timeout and none of the already assigned client is doing anything, this could take a while for the client to be useable.
So the "select()" needs a timeout. Setting the timeout to 0.5ms would be nice, but I guess the workload could rise too much, couldn't it?
Can someone of you tell me how you would realise such a system, handling more than one client for each thread?
PS: Hope my English is well enough for you to understand what I mean ;)

The standard method to multiplex multiple requests onto a single thread is to use the Reactor pattern. A central object (typically called a SelectServer, SocketServer, or IOService), monitors all the sockets from running requests and issues callbacks when the sockets are ready to continue reading or writing.
As others have stated, rolling your own is probably a bad idea. Handling timeouts, errors, and cross platform compatibility (e.g. epoll for linux, kqueue for bsd, iocp for windows) is tricky. Use boost::asio or libevent for production systems.
Here is a skeleton SelectServer (compiles but not tested) to give you an idea:
#include <sys/select.h>
#include <functional>
#include <map>
class SelectServer {
public:
enum ReadyType {
READABLE = 0,
WRITABLE = 1
};
void CallWhenReady(ReadyType type, int fd, std::function<void()> closure) {
SocketHolder holder;
holder.fd = fd;
holder.type = type;
holder.closure = closure;
socket_map_[fd] = holder;
}
void Run() {
fd_set read_fds;
fd_set write_fds;
while (1) {
if (socket_map_.empty()) break;
int max_fd = -1;
FD_ZERO(&read_fds);
FD_ZERO(&write_fds);
for (const auto& pr : socket_map_) {
if (pr.second.type == READABLE) {
FD_SET(pr.second.fd, &read_fds);
} else {
FD_SET(pr.second.fd, &write_fds);
}
if (pr.second.fd > max_fd) max_fd = pr.second.fd;
}
int ret_val = select(max_fd + 1, &read_fds, &write_fds, 0, 0);
if (ret_val <= 0) {
// TODO: Handle error.
break;
} else {
for (auto it = socket_map_.begin(); it != socket_map_.end(); ) {
if (FD_ISSET(it->first, &read_fds) ||
FD_ISSET(it->first, &write_fds)) {
it->second.closure();
socket_map_.erase(it++);
} else {
++it;
}
}
}
}
}
private:
struct SocketHolder {
int fd;
ReadyType type;
std::function<void()> closure;
};
std::map<int, SocketHolder> socket_map_;
};

First off, have a look at using poll() instead of select(): it works better when you have large number of file descriptors used from different threads.
To get threads currently waiting in I/O out of waiting I'm aware of two methods:
You can send a suitable signal to the thread using pthread_kill(). The call to poll() fails and errno is set to EINTR.
Some systems allow a file descriptor to be obtained from a thread control device. poll()ing the corresponding file descriptor for input succeeds when the thread control device is signalled. See, e.g., Can we obtain a file descriptor for a semaphore or condition variable?.

This is not a trivial task.
In order to achieve that, you need to maintain a list of all opened sockets (the server socket and the sockets to current clients). You then use the select() function to which you can give a list of sockets (file descriptors). With correct parameters, select() will wait until any event happen on one of the sockets.
You then must find the socket(s) which caused select() to exit and process the event(s). For the server socket, it can be a new client. For client sockets, it can be requests, termination notification, etc.
Regarding what you say in your question, I think you are not understanding the select() API very well. It is OK to have concurrent select() calls in different threads, as long as they are not waiting on the same sockets. Then if the clients are not doing anything, it doesn't prevent the server select() from working and accepting new clients.
You only need to give select() a timeout if you want to be able to do things even if clients are not doing anything. For example, you may have a timer to send periodic infos to the clients. You then give select a timeout corresponding to you first timer to expire, and process the expired timer when select() returns (along with any other concurrent events).
I suggest you have a long read of the select manpage.

C++ non blocking socket select send too slow?

I have a program that maintains a list of "streaming" sockets. These sockets are configured to be non-blocking sockets.
Currently, I have used a list to store these streaming sockets. I have some data that I need to send to all these streaming sockets hence I used the iterator to loop through this list of streaming sockets and calling the send_TCP_NB function below:
The issue is that my own program buffer that stores the data before sending to this send_TCP_NB function slowly decreases in free size indicating that the send is slower than the rate at which data is put into the program buffer. The rate at which the program buffer is about 1000 data per second. Each data is quite small, about 100 bytes.
Hence, i am not sure if my send_TCP_NB function is working efficiently or correct?
int send_TCP_NB(int cs, char data[], int data_length) {
bool sent = false;
FD_ZERO(&write_flags); // initialize the writer socket set
FD_SET(cs, &write_flags); // set the write notification for the socket based on the current state of the buffer
int status;
int err;
struct timeval waitd; // set the time limit for waiting
waitd.tv_sec = 0;
waitd.tv_usec = 1000;
err = select(cs+1, NULL, &write_flags, NULL, &waitd);
if(err==0)
{
// time limit expired
printf("Time limit expired!\n");
return 0; // send failed
}
else
{
while(!sent)
{
if(FD_ISSET(cs, &write_flags))
{
FD_CLR(cs, &write_flags);
status = send(cs, data, data_length, 0);
sent = true;
}
}
int nError = WSAGetLastError();
if(nError != WSAEWOULDBLOCK && nError != 0)
{
printf("Error sending non blocking data\n");
return 0;
}
else
{
if(nError == WSAEWOULDBLOCK)
{
printf("%d\n", nError);
}
return 1;
}
}
}

One thing that would help is if you thought out exactly what this function is supposed to do. What it actually does is probably not what you wanted, and has some bad features.
The major features of what it does that I've noticed are:
Modify some global state
Wait (up to 1 millisecond) for the write buffer to have some empty space
Abort if the buffer is still full
Send 1 or more bytes on the socket (ignoring how much was sent)
If there was an error (including the send decided it would have blocked despite the earlier check), obtain its value. Otherwise, obtain a random error value
Possibly print something to screen, depending on the value obtained
Return 0 or 1, depending on the error value.
Comments on these points:
Why is write_flags global?
Did you really intend to block in this function?
This is probably fine
Surely you care how much of the data was sent?
I do not see anything in the documentation that suggests that this will be zero if send succeeds
If you cleared up what the actual intent of this function was, it would probably be much easier to ensure that this function actually fulfills that intent.
That said
I have some data that I need to send to all these streaming sockets
What precisely is your need?
If your need is that the data must be sent before proceeding, then using a non-blocking write is inappropriate*, since you're going to have to wait until you can write the data anyways.
If your need is that the data must be sent sometime in the future, then your solution is missing a very critical piece: you need to create a buffer for each socket which holds the data that needs to be sent, and then you periodically need to invoke a function that checks the sockets to try writing whatever it can. If you spawn a new thread for this latter purpose, this is the sort of thing select is very useful for, since you can make that new thread block until it is able to write something. However, if you don't spawn a new thread and just periodically invoke a function from the main thread to check, then you don't need to bother. (just write what you can to everything, even if it's zero bytes)
*: At least, it is a very premature optimization. There are some edge cases where you could get slightly more performance by using the non-blocking writes intelligently, but if you don't understand what those edge cases are and how the non-blocking writes would help, then guessing at it is unlikely to get good results.
EDIT: as another answer implied, this is something the operating system is good at anyways. Rather than try to write your own code to manage this, if you find your socket buffers filling up, then make the system buffers larger. And if they're still filling up, you should really give serious thought to the idea that your program needs to block anyways, so that it stops sending data faster than the other end can handle it. i.e. just use ordinary blocking sends for all of your data.

Some general advice:
Keep in mind you are multiplying data. So if you get 1 MB/s in, you output N MB/s with N clients. Are you sure your network card can take it ? It gets worse with smaller packets, you get more general overhead. You may want to consider broadcasting.
You are using non blocking sockets, but you block while they are not free. If you want to be non blocking, better discard the packet immediately if the socket is not ready.
What would be better is to "select" more than one socket at once. Do everything that you are doing but for all the sockets that are available. You'll write to each "ready" socket, then repeat again while there are sockets that are not ready. This way, you'll proceed with the sockets that are available first, and then with some chance, the busy sockets will become themselves available.
the while (!sent) loop is useless and probably buggy. Since you are checking only one socket FD_ISSET will always be true. It is wrong to check again FD_ISSET after a FD_CLR
Keep in mind that your OS has some internal buffers for the sockets and that there are way to extend them (not easy on Linux, though, to get large values you need to do some config as root).
There are some socket libraries that will probably work better than what you can implement in a reasonable time (boost::asio and zmq for the ones I know).
If you need to implement it yourself, (i.e. because for instance zmq has its own packet format), consider using a threadpool library.
EDIT:
Sleeping 1 millisecond is probably a bad idea. Your thread will probably get descheduled and it will take much more than that before you get some CPU time again.

This is just a horrible way to do things. The select serves no purpose but to waste time. If the send is non-blocking, it can mangle data on a partial send. If it's blocking, you still waste arbitrarily much time waiting for one receiver.
You need to pick a sensible I/O strategy. Here is one: Set all sockets non-blocking. When you need to send data to a socket, just call write. If all the data writes, lovely. If not, save the portion of data that wasn't sent for later and add the socket to your write set. When you have nothing else to do, call select. If you get a hit on any socket in your write set, write as many bytes as you can from what you saved. If you write all of them, remove that socket from the write set.
(If you need to write to a data that's already in your write set, just add the data to the saved data to be sent. You may need to close the connection if too much data gets buffered.)
A better idea might be to use a library that already does all these things. Boost::asio is a good one.

You are calling select() before calling send(). Do it the other way around. Call select() only if send() reports WSAEWOULDBLOCK, eg:
int send_TCP_NB(int cs, char data[], int data_length)
{
int status;
int err;
struct timeval waitd;
char *data_ptr = data;
while (data_length > 0)
{
status = send(cs, data_ptr, data_length, 0);
if (status > 0)
{
data_ptr += status;
data_length -= status;
continue;
}
err = WSAGetLastError();
if (err != WSAEWOULDBLOCK)
{
printf("Error sending non blocking data\n");
return 0; // send failed
}
FD_ZERO(&write_flags);
FD_SET(cs, &write_flags); // set the write notification for the socket based on the current state of the buffer
waitd.tv_sec = 0;
waitd.tv_usec = 1000;
status = select(cs+1, NULL, &write_flags, NULL, &waitd);
if (status > 0)
continue;
if (status == 0)
printf("Time limit expired!\n");
else
printf("Error waiting for time limit!\n");
return 0; // send failed
}
return 1;
}

Calculating socket upload speed

I'm wondering if anyone knows how to calculate the upload speed of a Berkeley socket in C++. My send call isn't blocking and takes 0.001 seconds to send 5 megabytes of data, but takes a while to recv the response (so I know it's uploading).
This is a TCP socket to a HTTP server and I need to asynchronously check how many bytes of data have been uploaded / are remaining. However, I can't find any API functions for this in Winsock, so I'm stumped.
Any help would be greatly appreciated.
EDIT: I've found the solution, and will be posting as an answer as soon as possible!
EDIT 2: Proper solution added as answer, will be added as solution in 4 hours.

I solved my issue thanks to bdolan suggesting to reduce SO_SNDBUF. However, to use this code you must note that your code uses Winsock 2 (for overlapped sockets and WSASend). In addition to this, your SOCKET handle must have been created similarily to:
SOCKET sock = WSASocket(AF_INET, SOCK_STREAM, IPPROTO_TCP, NULL, 0, WSA_FLAG_OVERLAPPED);
Note the WSA_FLAG_OVERLAPPED flag as the final parameter.
In this answer I will go through the stages of uploading data to a TCP server, and tracking each upload chunk and it's completion status. This concept requires splitting your upload buffer into chunks (minimal existing code modification required) and uploading it piece by piece, then tracking each chunk.
My code flow
Global variables
Your code document must have the following global variables:
#define UPLOAD_CHUNK_SIZE 4096
int g_nUploadChunks = 0;
int g_nChunksCompleted = 0;
WSAOVERLAPPED *g_pSendOverlapped = NULL;
int g_nBytesSent = 0;
float g_flLastUploadTimeReset = 0.0f;
Note: in my tests, decreasing UPLOAD_CHUNK_SIZE results in increased upload speed accuracy, but decreases overall upload speed. Increasing UPLOAD_CHUNK_SIZE results in decreased upload speed accuracy, but increases overall upload speed. 4 kilobytes (4096 bytes) was a good comprimise for a file ~500kB in size.
Callback function
This function increments the bytes sent and chunks completed variables (called after a chunk has been completely uploaded to the server)
void CALLBACK SendCompletionCallback(DWORD dwError, DWORD cbTransferred, LPWSAOVERLAPPED lpOverlapped, DWORD dwFlags)
{
g_nChunksCompleted++;
g_nBytesSent += cbTransferred;
}
Prepare socket
Initially, the socket must be prepared by reducing SO_SNDBUF to 0.
Note: In my tests, any value greater than 0 will result in undesirable behaviour.
int nSndBuf = 0;
setsockopt(sock, SOL_SOCKET, SO_SNDBUF, (char*)&nSndBuf, sizeof(nSndBuf));
Create WSAOVERLAPPED array
An array of WSAOVERLAPPED structures must be created to hold the overlapped status of all of our upload chunks. To do this I simply:
// Calculate the amount of upload chunks we will have to create.
// nDataBytes is the size of data you wish to upload
g_nUploadChunks = ceil(nDataBytes / float(UPLOAD_CHUNK_SIZE));
// Overlapped array, should be delete'd after all uploads have completed
g_pSendOverlapped = new WSAOVERLAPPED[g_nUploadChunks];
memset(g_pSendOverlapped, 0, sizeof(WSAOVERLAPPED) * g_nUploadChunks);
Upload data
All of the data that needs to be send, for example purposes, is held in a variable called pszData. Then, using WSASend, the data is sent in blocks defined by the constant, UPLOAD_CHUNK_SIZE.
WSABUF dataBuf;
DWORD dwBytesSent = 0;
int err;
int i, j;
for(i = 0, j = 0; i < nDataBytes; i += UPLOAD_CHUNK_SIZE, j++)
{
int nTransferBytes = min(nDataBytes - i, UPLOAD_CHUNK_SIZE);
dataBuf.buf = &pszData[i];
dataBuf.len = nTransferBytes;
// Now upload the data
int rc = WSASend(sock, &dataBuf, 1, &dwBytesSent, 0, &g_pSendOverlapped[j], SendCompletionCallback);
if ((rc == SOCKET_ERROR) && (WSA_IO_PENDING != (err = WSAGetLastError())))
{
fprintf(stderr, "WSASend failed: %d\n", err);
exit(EXIT_FAILURE);
}
}
The waiting game
Now we can do whatever we wish while all of the chunks upload.
Note: the thread which called WSASend must be regularily put into an alertable state, so that our 'transfer completed' callback (SendCompletionCallback) is dequeued out of the APC (Asynchronous Procedure Call) list.
In my code, I continuously looped until g_nUploadChunks == g_nChunksCompleted. This is to show the end-user upload progress and speed (can be modified to show estimated completion time, elapsed time, etc.)
Note 2: this code uses Plat_FloatTime as a second counter, replace this with whatever second timer your code uses (or adjust accordingly)
g_flLastUploadTimeReset = Plat_FloatTime();
// Clear the line on the screen with some default data
printf("(0 chunks of %d) Upload speed: ???? KiB/sec", g_nUploadChunks);
// Keep looping until ALL upload chunks have completed
while(g_nChunksCompleted < g_nUploadChunks)
{
// Wait for 10ms so then we aren't repeatedly updating the screen
SleepEx(10, TRUE);
// Updata chunk count
printf("\r(%d chunks of %d) ", g_nChunksCompleted, g_nUploadChunks);
// Not enough time passed?
if(g_flLastUploadTimeReset + 1 > Plat_FloatTime())
continue;
// Reset timer
g_flLastUploadTimeReset = Plat_FloatTime();
// Calculate how many kibibytes have been transmitted in the last second
float flByteRate = g_nBytesSent/1024.0f;
printf("Upload speed: %.2f KiB/sec", flByteRate);
// Reset byte count
g_nBytesSent = 0;
}
// Delete overlapped data (not used anymore)
delete [] g_pSendOverlapped;
// Note that the transfer has completed
Msg("\nTransfer completed successfully!\n");
Conclusion
I really hope this has helped somebody in the future who has wished to calculate upload speed on their TCP sockets without any server-side modifications. I have no idea how performance detrimental SO_SNDBUF = 0 is, although I'm sure a socket guru will point that out.

You can get a lower bound on the amount of data received and acknowledged by subtracting the value of the SO_SNDBUF socket option from the number of bytes you have written to the socket. This buffer may be adjusted using setsockopt, although in some cases the OS may choose a length smaller or larger than you specify, so you must re-check after setting it.
To get more precise than that, however, you must have the remote side inform you of progress, as winsock does not expose an API to retrieve the amount of data currently pending in the send buffer.
Alternately, you could implement your own transport protocol on UDP, but implementing rate control for such a protocol can be quite complex.

Since you don't have control over the remote side, and you want to do it in the code, I'd suggest doing very simple approximation. I assume a long living program/connection. One-shot uploads would be too skewed by ARP, DNS lookups, socket buffering, TCP slow start, etc. etc.
Have two counters - length of the outstanding queue in bytes (OB), and number of bytes sent (SB):
increment OB by number of bytes to be sent every time you enqueue a chunk for upload,
decrement OB and increment SB by the number returned from send(2) (modulo -1 cases),
on a timer sample both OB and SB - either store them, log them, or compute running average,
compute outstanding bytes a second/minute/whatever, same for sent bytes.
Network stack does buffering and TCP does retransmission and flow control, but that doesn't really matter. These two counters will tell you the rate your app produces data with, and the rate it is able to push it to the network. It's not the method to find out the real link speed, but a way to keep useful indicators about how good the app is doing.
If data production rate is bellow the network output rate - everything is fine. If it's the other way around and the network cannot keep up with the app - there's a problem - you need either faster network, slower app, or different design.
For one-time experiments just take periodic snapshots of netstat -sp tcp output (or whatever that is on Windows) and calculate the send-rate manually.
Hope this helps.

If your app uses packet headers like
0001234DT
where 000123 is the packet length for a single packet, you can consider using MSG_PEEK + recv() to get the length of the packet before you actually read it with recv().
The problem is send() is NOT doing what you think - it is buffered by the kernel.
getsockopt(sockfd, SOL_SOCKET, SO_SNDBUF, &flag, &sz));
fprintf(STDOUT, "%s: listener socket send buffer = %d\n", now(), flag);
sz=sizeof(int);
ERR_CHK(getsockopt(sockfd, SOL_SOCKET, SO_RCVBUF, &flag, &sz));
fprintf(STDOUT, "%s: listener socket recv buffer = %d\n", now(), flag);
See what these show for you.
When you recv on a NON-blocking socket that has data, it normally does not have MB of data parked in the buufer ready to recv. Most of what I have experienced is that the socket has ~1500 bytes of data per recv. Since you are probably reading on a blocking socket it takes a while for the recv() to complete.
Socket buffer size is the probably single best predictor of socket throughput. setsockopt() lets you alter socket buffer size, up to a point. Note: these buffers are shared among sockets in a lot of OSes like Solaris. You can kill performance by twiddling these settings too much.
Also, I don't think you are measuring what you think you are measuring. The real efficiency of send() is the measure of throughput on the recv() end. Not the send() end.
IMO.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js