Calculating socket upload speed

Calculating socket upload speed - c++

I'm wondering if anyone knows how to calculate the upload speed of a Berkeley socket in C++. My send call isn't blocking and takes 0.001 seconds to send 5 megabytes of data, but takes a while to recv the response (so I know it's uploading).
This is a TCP socket to a HTTP server and I need to asynchronously check how many bytes of data have been uploaded / are remaining. However, I can't find any API functions for this in Winsock, so I'm stumped.
Any help would be greatly appreciated.
EDIT: I've found the solution, and will be posting as an answer as soon as possible!
EDIT 2: Proper solution added as answer, will be added as solution in 4 hours.

I solved my issue thanks to bdolan suggesting to reduce SO_SNDBUF. However, to use this code you must note that your code uses Winsock 2 (for overlapped sockets and WSASend). In addition to this, your SOCKET handle must have been created similarily to:
SOCKET sock = WSASocket(AF_INET, SOCK_STREAM, IPPROTO_TCP, NULL, 0, WSA_FLAG_OVERLAPPED);
Note the WSA_FLAG_OVERLAPPED flag as the final parameter.
In this answer I will go through the stages of uploading data to a TCP server, and tracking each upload chunk and it's completion status. This concept requires splitting your upload buffer into chunks (minimal existing code modification required) and uploading it piece by piece, then tracking each chunk.
My code flow
Global variables
Your code document must have the following global variables:
#define UPLOAD_CHUNK_SIZE 4096
int g_nUploadChunks = 0;
int g_nChunksCompleted = 0;
WSAOVERLAPPED *g_pSendOverlapped = NULL;
int g_nBytesSent = 0;
float g_flLastUploadTimeReset = 0.0f;
Note: in my tests, decreasing UPLOAD_CHUNK_SIZE results in increased upload speed accuracy, but decreases overall upload speed. Increasing UPLOAD_CHUNK_SIZE results in decreased upload speed accuracy, but increases overall upload speed. 4 kilobytes (4096 bytes) was a good comprimise for a file ~500kB in size.
Callback function
This function increments the bytes sent and chunks completed variables (called after a chunk has been completely uploaded to the server)
void CALLBACK SendCompletionCallback(DWORD dwError, DWORD cbTransferred, LPWSAOVERLAPPED lpOverlapped, DWORD dwFlags)
{
g_nChunksCompleted++;
g_nBytesSent += cbTransferred;
}
Prepare socket
Initially, the socket must be prepared by reducing SO_SNDBUF to 0.
Note: In my tests, any value greater than 0 will result in undesirable behaviour.
int nSndBuf = 0;
setsockopt(sock, SOL_SOCKET, SO_SNDBUF, (char*)&nSndBuf, sizeof(nSndBuf));
Create WSAOVERLAPPED array
An array of WSAOVERLAPPED structures must be created to hold the overlapped status of all of our upload chunks. To do this I simply:
// Calculate the amount of upload chunks we will have to create.
// nDataBytes is the size of data you wish to upload
g_nUploadChunks = ceil(nDataBytes / float(UPLOAD_CHUNK_SIZE));
// Overlapped array, should be delete'd after all uploads have completed
g_pSendOverlapped = new WSAOVERLAPPED[g_nUploadChunks];
memset(g_pSendOverlapped, 0, sizeof(WSAOVERLAPPED) * g_nUploadChunks);
Upload data
All of the data that needs to be send, for example purposes, is held in a variable called pszData. Then, using WSASend, the data is sent in blocks defined by the constant, UPLOAD_CHUNK_SIZE.
WSABUF dataBuf;
DWORD dwBytesSent = 0;
int err;
int i, j;
for(i = 0, j = 0; i < nDataBytes; i += UPLOAD_CHUNK_SIZE, j++)
{
int nTransferBytes = min(nDataBytes - i, UPLOAD_CHUNK_SIZE);
dataBuf.buf = &pszData[i];
dataBuf.len = nTransferBytes;
// Now upload the data
int rc = WSASend(sock, &dataBuf, 1, &dwBytesSent, 0, &g_pSendOverlapped[j], SendCompletionCallback);
if ((rc == SOCKET_ERROR) && (WSA_IO_PENDING != (err = WSAGetLastError())))
{
fprintf(stderr, "WSASend failed: %d\n", err);
exit(EXIT_FAILURE);
}
}
The waiting game
Now we can do whatever we wish while all of the chunks upload.
Note: the thread which called WSASend must be regularily put into an alertable state, so that our 'transfer completed' callback (SendCompletionCallback) is dequeued out of the APC (Asynchronous Procedure Call) list.
In my code, I continuously looped until g_nUploadChunks == g_nChunksCompleted. This is to show the end-user upload progress and speed (can be modified to show estimated completion time, elapsed time, etc.)
Note 2: this code uses Plat_FloatTime as a second counter, replace this with whatever second timer your code uses (or adjust accordingly)
g_flLastUploadTimeReset = Plat_FloatTime();
// Clear the line on the screen with some default data
printf("(0 chunks of %d) Upload speed: ???? KiB/sec", g_nUploadChunks);
// Keep looping until ALL upload chunks have completed
while(g_nChunksCompleted < g_nUploadChunks)
{
// Wait for 10ms so then we aren't repeatedly updating the screen
SleepEx(10, TRUE);
// Updata chunk count
printf("\r(%d chunks of %d) ", g_nChunksCompleted, g_nUploadChunks);
// Not enough time passed?
if(g_flLastUploadTimeReset + 1 > Plat_FloatTime())
continue;
// Reset timer
g_flLastUploadTimeReset = Plat_FloatTime();
// Calculate how many kibibytes have been transmitted in the last second
float flByteRate = g_nBytesSent/1024.0f;
printf("Upload speed: %.2f KiB/sec", flByteRate);
// Reset byte count
g_nBytesSent = 0;
}
// Delete overlapped data (not used anymore)
delete [] g_pSendOverlapped;
// Note that the transfer has completed
Msg("\nTransfer completed successfully!\n");
Conclusion
I really hope this has helped somebody in the future who has wished to calculate upload speed on their TCP sockets without any server-side modifications. I have no idea how performance detrimental SO_SNDBUF = 0 is, although I'm sure a socket guru will point that out.

You can get a lower bound on the amount of data received and acknowledged by subtracting the value of the SO_SNDBUF socket option from the number of bytes you have written to the socket. This buffer may be adjusted using setsockopt, although in some cases the OS may choose a length smaller or larger than you specify, so you must re-check after setting it.
To get more precise than that, however, you must have the remote side inform you of progress, as winsock does not expose an API to retrieve the amount of data currently pending in the send buffer.
Alternately, you could implement your own transport protocol on UDP, but implementing rate control for such a protocol can be quite complex.

Since you don't have control over the remote side, and you want to do it in the code, I'd suggest doing very simple approximation. I assume a long living program/connection. One-shot uploads would be too skewed by ARP, DNS lookups, socket buffering, TCP slow start, etc. etc.
Have two counters - length of the outstanding queue in bytes (OB), and number of bytes sent (SB):
increment OB by number of bytes to be sent every time you enqueue a chunk for upload,
decrement OB and increment SB by the number returned from send(2) (modulo -1 cases),
on a timer sample both OB and SB - either store them, log them, or compute running average,
compute outstanding bytes a second/minute/whatever, same for sent bytes.
Network stack does buffering and TCP does retransmission and flow control, but that doesn't really matter. These two counters will tell you the rate your app produces data with, and the rate it is able to push it to the network. It's not the method to find out the real link speed, but a way to keep useful indicators about how good the app is doing.
If data production rate is bellow the network output rate - everything is fine. If it's the other way around and the network cannot keep up with the app - there's a problem - you need either faster network, slower app, or different design.
For one-time experiments just take periodic snapshots of netstat -sp tcp output (or whatever that is on Windows) and calculate the send-rate manually.
Hope this helps.

If your app uses packet headers like
0001234DT
where 000123 is the packet length for a single packet, you can consider using MSG_PEEK + recv() to get the length of the packet before you actually read it with recv().
The problem is send() is NOT doing what you think - it is buffered by the kernel.
getsockopt(sockfd, SOL_SOCKET, SO_SNDBUF, &flag, &sz));
fprintf(STDOUT, "%s: listener socket send buffer = %d\n", now(), flag);
sz=sizeof(int);
ERR_CHK(getsockopt(sockfd, SOL_SOCKET, SO_RCVBUF, &flag, &sz));
fprintf(STDOUT, "%s: listener socket recv buffer = %d\n", now(), flag);
See what these show for you.
When you recv on a NON-blocking socket that has data, it normally does not have MB of data parked in the buufer ready to recv. Most of what I have experienced is that the socket has ~1500 bytes of data per recv. Since you are probably reading on a blocking socket it takes a while for the recv() to complete.
Socket buffer size is the probably single best predictor of socket throughput. setsockopt() lets you alter socket buffer size, up to a point. Note: these buffers are shared among sockets in a lot of OSes like Solaris. You can kill performance by twiddling these settings too much.
Also, I don't think you are measuring what you think you are measuring. The real efficiency of send() is the measure of throughput on the recv() end. Not the send() end.
IMO.

Related

Qt QTcpSocket Reading Data Overlap Causes Invalid TCP Behavior During High Bandwidth Reading and Writing

Summary: Some of the memory within the TCP socket to be overwritten by other incoming data.
Application:
A client/server system that utilizes TCP within Qt (QTcpSocket and QTcpServer). The client request a frame from the server(just a simple string message), and the response (Server -> Client) which consists of that frame (614400 bytes for testing purposes). Frame sizes are established in advance and are fixed.
Implementation Details:
From the guarantees of the TCP protocol (Server -> Client), I know that I should be able to read the 614400 bytes from the socket and that they are in order. If any either of these two things fails, the connection must have failed.
Important Code:
Assuming the socket is connected.
This code requests a frame from the server. Known as the GetFrame() function.
// Prompt the server to send a frame over
if(socket->isWritable() && !is_receiving) { // Validate that socket is ready
is_receiving = true; // Forces only one request to go out at a time
qDebug() << "Getting frame from socket..." << image_no;
int written = SafeWrite((char*)"ReadyFrame"); // Writes then flushes the write buffer
if (written == -1) {
qDebug() << "Failed to write...";
return temp_frame.data();
}
this->SocketRead();
is_receiving = false;
}
qDebug() << image_no << "- Image Received";
image_no ++;
return temp_frame.data();
This code waits for the frame just requested to be read. This is the SocketRead() function
size_t byte_pos = 0;
qint64 bytes_read = 0;
do {
if (!socket->waitForReadyRead(500)) { // If it timed out return existing frame
if (!(socket->bytesAvailable() > 0)) {
qDebug() << "Timed Out" << byte_pos;
break;
}
}
bytes_read = socket->read((char*)temp_frame.data() + byte_pos, frame_byte_size - byte_pos);
if (bytes_read < 0) {
qDebug() << "Reading Failed" << bytes_read << errno;
break;
}
byte_pos += bytes_read;
} while (byte_pos < frame_byte_size && is_connected); // While we still have more pixels
qDebug() << "Finished Receiving Frame: " << byte_pos;
As shown in the code above, I read until the frame is fully received (where the number of bytes read is equal to the number of bytes in the frame).
The issue that I'm having is that the QTcpSocket read operation is skipping bytes in ways that are not in line with the guarantees of the TCP protocol. Since I skip bytes I end up not reaching the end of the while loop and just "Time Out". Why is this happening?
What I have done so far:
The data that the server sends is directly converted into uint16_t (short) integers which are used in other parts of the client. I have changed the server to simply output data that just counts up adding one for each number sent. Since the data type is uint16_t and the number of bytes exceeds that maximum number for that integer type, the int-16's will loop every 65535.
This is a data visualization software so this debugging configuration (on the client side) leads to something like this:
I have determined (and as you can see a little at the bottom of the graphic) that some bytes are being skipped. In the memory of temp_frame it is possible to see the exact point at which the memory skipped:
Under correct circumstances, this should count up sequentially.
From Wireshark and following this specific TCP connection I have determined that all of the bytes are in fact arriving (all 6114400), and that all the numbers are in order (I used a python script to ensure counting was sequential).
This is work on an open source project so this is the whole code base for the client.
Overall, I don't see how I could be doing something wrong in this solution, all I am doing is reading from the socket in the standard way.

Caveat: This isn't a definitive answer to your problem, but some things to try (it's too large for a comment).
With (e.g.) GigE, your data rate is ~100MB/s. With a [total] amount of kernel buffer space of 614400, this will be refilled ~175 times per second. IMO, this is still too small. When I've used SO_RCVBUF [for a commercial product], I've used a minimum of 8MB. This allows a wide(er) margin for task switch delays.
Try setting something huge like 100MB to eliminate this as a factor [during testing/bringup].
First, it's important to verify that the kernel and NIC driver can handle the throughput/latency.
You may be getting too many interrupts/second and the ISR prolog/epilog overhead may be too high. The NIC card driver can implement polled vs interrupt driver with NAPI for ethernet cards.
See: https://serverfault.com/questions/241421/napi-vs-adaptive-interrupts
See: https://01.org/linux-interrupt-moderation
You process/thread may not have high enough priority to be scheduled quickly.
You can use the R/T scheduler with sched_setscheduler, SCHED_RR, and a priority of (e.g.) 8. Note: going higher than 11 kills the system because at 12 and above you're at a higher priority than most internal kernel threads--not a good thing.
You may need to disable IRQ balancing and set the IRQ affinity to a single CPU core.
You can then set your input process/thread locked to that core [with sched_setaffinity and/or pthread_setaffinity].
You might need some sort of "zero copy" to bypass the kernel copying from its buffers into your userspace buffers.
You can mmap the kernel socket buffers with PACKET_MMAP. See: https://sites.google.com/site/packetmmap/
I'd be careful about the overhead of your qDebug output. It looks like an iostream type implementation. The overhead may be significant. It could be slowing things down significantly.
That is, you're not measuring the performance of your system. You're measuring the performance of your system plus the debugging code.
When I've had to debug/trace such things, I've used a [custom] "event" log implemented with an in-memory ring queue with a fixed number of elements.
Debug calls such as:
eventadd(EVENT_TYPE_RECEIVE_START,some_event_specific_data);
Here eventadd populates a fixed size "event" struct with the event type, event data, and a hires timestamp (e.g. struct timespec from clock_gettime(CLOCK_MONOTONIC,...).
The overhead of each such call is quite low. The events are just stored in the event ring. Only the last N are remembered.
At some point, your program triggers a dump of this queue to a file and terminates.
This mechanism is similar to [and modeled on] a H/W logic analyzer. It is also similar to dtrace
Here's a sample event element:
struct event {
long long evt_tstamp; // timestamp
int evt_type; // event type
int evt_data; // type specific data
};

What means blocking for boost::asio::write?

I'm using boost::asio::write() to write data from a buffer to a com-Port. It's a serial port with a baud rate 115200 which means (as far as my understanding goes) that I can write effectively 11520 byte/s or 11,52KB/s data to the socket.
Now I'm having a quite big chunk of data (10015 bytes) which i want to write. I think that this should take little less than a second to really write on the port. But boost::asio::write() returns already 300 microseconds after the call with the transferred bytes 10015. I think this is impossible with that baud rate?
So my question is what is it actually doing? Really writing it to the port, or just some other kind of buffer maybe, which later writes it to the port.
I'd like the write() to only return after all the bytes have really been written to the port.
EDIT with code example:
The problem is that i always run into the timeout for the future/promise because it takes alone more than 100ms to send the message, but I think the timer should only start after the last byte is sent. Because write() is supposed to block?
void serial::write(std::vector<uint8_t> message) {
//create new promise for the request
promise = new boost::promise<deque<uint8_t>>;
boost::unique_future<deque<uint8_t>> future = promise->get_future();
// --- Write message to serial port --- //
boost::asio::write(serial_,boost::asio::buffer(message));
//wait for data or timeout
if (future.wait_for(boost::chrono::milliseconds(100))==boost::future_status::timeout) {
cout << "ACK timeout!" << endl;
//delete pointer and set it to 0
delete promise;
promise=nullptr;
}
//delete pointer and set it to 0 after getting a message
delete promise;
promise=nullptr;
}
How can I achieve this?
Thanks!

In short, boost::asio::write() blocks until all data has been written to the stream; it does not block until all data has been transmitted. To wait until data has been transmitted, consider using tcdrain().
Each serial port has both a receive and transmit buffer within kernel space. This allows the kernel to buffer received data if a process cannot immediately read it from the serial port, and allows data written to a serial port to be buffered if the device cannot immediately transmit it. To block until the data has been transmitted, one could use tcdrain(serial_.native_handle()).
These kernel buffers allow for the write and read rates to exceed that of the transmit and receive rates. However, while the application may write data at a faster rate than the serial port can transmit, the kernel will transmit at the appropriate rates.

What is the size of a socket send buffer in Windows?

Based on my understanding, each socket is associated with two buffers, a send buffer and a receive buffer, so when I call the send() function, what happens is that the data to send will be placed into the send buffer, and it is the responsibility of Windows now to send the content of this send buffer to the other end.
In a blocking socket, the send() function does not return until the entire data supplied to it has been placed into the send buffer.
So what is the size of the send buffer?
I performed the following test (sending 1 GB worth of data):
#include <stdio.h>
#include <WinSock2.h>
#pragma comment(lib, "ws2_32.lib")
#include <Windows.h>
int main()
{
// Initialize Winsock
WSADATA wsa;
WSAStartup(MAKEWORD(2, 2), &wsa);
// Create socket
SOCKET s = socket(AF_INET, SOCK_STREAM, 0);
//----------------------
// Connect to 192.168.1.7:12345
sockaddr_in address;
address.sin_family = AF_INET;
address.sin_addr.s_addr = inet_addr("192.168.1.7");
address.sin_port = htons(12345);
connect(s, (sockaddr*)&address, sizeof(address));
//----------------------
// Create 1 GB buffer ("AAAAAA...A")
char *buffer = new char[1073741824];
memset(buffer, 0x41, 1073741824);
// Send buffer
int i = send(s, buffer, 1073741824, 0);
printf("send() has returned\nReturn value: %d\nWSAGetLastError(): %d\n", i, WSAGetLastError());
//----------------------
getchar();
return 0;
}
Output:
send() has returned
Return value: 1073741824
WSAGetLastError(): 0
send() has returned immediately, does this means that the send buffer has a size of at least 1 GB?
This is some information about the test:
I am using a TCP blocking socket.
I have connected to a LAN machine.
Client Windows version: Windows 7 Ultimate 64-bit.
Server Windows version: Windows XP SP2 32-bit (installed on Virtual Box).
Edit: I have also attempted to connect to Google (173.194.116.18:80) and I got the same results.
Edit 2: I have discovered something strange, setting the send buffer to a value between 64 KB and 130 KB will make send() work as expected!
int send_buffer = 64 * 1024; // 64 KB
int send_buffer_sizeof = sizeof(int);
setsockopt(s, SOL_SOCKET, SO_SNDBUF, (char*)send_buffer, send_buffer_sizeof);
Edit 3: It turned out (thanks to Harry Johnston) that I have used setsockopt() in an incorrect way, this is how it is used:
setsockopt(s, SOL_SOCKET, SO_SNDBUF, (char*)&send_buffer, send_buffer_sizeof);
Setting the send buffer to a value between 64 KB and 130 KB does not make send() work as expected, but rather setting the send buffer to 0 makes it block (this is what I noticed anyway, I don't have any documentation for this behavior).
So my question now is: where can I find a documentation on how send() (and maybe other socket operations) work under Windows?

After investigating on this subject. This is what I believe to be the correct answer:
When calling send(), there are two things that could happen:
If there are pending data which are below SO_SNDBUF, then send() would return immediately (and it does not matter whether you are sending 5 KB or you are sending 500 MB).
If there are pending data which are above or equal SO_SNDBUF, then send() would block until enough data has been sent to restore the pending data to below SO_SNDBUF.
Note that this behavior is only applicable to Windows sockets, and not to POSIX sockets. I think that POSIX sockets only use one fixed sized send buffer (correct me if I'm wrong).
Now back to your main question "What is the size of a socket send buffer in Windows?". I guess if you have enough memory it could grow beyond 1 GB if necessary (not sure what is the maximum limit though).

I can reproduce this behaviour, and using Resource Monitor it is easy to see that Windows does indeed allocate 1GB of buffer space when the send() occurs.
An interesting feature is that if you do a second send immediately after the first one, that call does not return until both sends have completed. The buffer space from the first send is released once that send has completed, but the second send() continues to block until all the data has been transferred.
I suspect the difference in behaviour is because the second call to send() was already blocking when the first send completed. The third call to send() returns immediately (and 1GB of buffer space is allocated) just as the first one did, and so on, alternating.
So I conclude that the answer to the question ("how large are the send buffers?") is "as large as Windows sees fit". The upshot is that, in order to avoid exhausting the system memory, you should probably restrict blocking sends to no more than a few hundred megabytes.
Your call to setsockopt() is incorrect; the fourth argument is supposed to be a pointer to an integer, not an integer converted to a pointer. Once this is corrected, it turns out that setting the buffer size to zero causes send() to always block.
To summarize, the observed behaviour is that send() will return immediately provided:
there is enough memory to buffer all the provided data
there is not a send already in progress
the buffer size is not set to zero
Otherwise, it will return once the data has been sent.
KB214397 describes some of this - thanks Hans! In particular it describes that setting the buffer size to zero disables Winsock buffering, and comments that "If necessary, Winsock can buffer significantly more than the SO_SNDBUF buffer size."
(The completion notification described does not quite match up to the observed behaviour, depending I guess on how you interpret "previously buffered send". But it's close.)
Note that apart from the risk of inadvertently exhausting the system memory, none of this should matter. If you really need to know whether the code at the other end has received all your data yet, the only reliable way to do that is to get it to tell you.

In a blocking socket, the send() function does not return until the entire data supplied to it has been placed into the send buffer.
That is not guaranteed. If there is available buffer space, but not enough space for the entire data, the socket can (and usually will) accept whatever data it can and ignore the rest. The return value of send() tells you how many bytes were actually accepted. You have to call send() again to send the remaining data.
So what is the size of the send buffer?
Use getsockopt() with the SO_SNDBUF option to find out.
Use setsockopt() with the SO_SNDBUF option to specify your own buffer size. However, the socket may impose a max cap on the value you specify. Use getsockopt() to find out what size was actually assigned.

How to receive more than 40Kb in C++ socket using read()

I am developing a client-server application (TCP) in Linux using C++. This application is in charge of testing the network performance.
The connection between client and server is established only once, and then data are transmitted/received using write()/read() with an own-defined protocol.
When data exceeds 40Kb I receive just a part of the data only once. (i.e. I receive about 48KB)
Please find down the relevant part of the code:
while (1) {
servMtx.lock();
...
serv_bytes = (byte *) malloc(size_bytes);
n = read(newsockfd, serv_bytes,size_bytes);
if (n != (int)size_bytes ) {
std::cerr << "No enough data available for msg. Received just: " << n << std::endl;
continue;
}
receivedBytes += n + size_header_bytes + sizeof(ssize_t);
....
}
I increased the kernel buffer size to become 1MB using:
int buffsize = 1024*1024;
setsockopt(newsockfd, SOL_SOCKET, SO_RCVBUF, &buffsize, sizeof(buffsize));
and modified sysctl variables too:
sysctl -w net.core.rmem_max=8388608;
sysctl -w net.core.wmem_max=8388608;
as mentioned on this How to recive more than 65000 bytes in C++ socket using recv() but nothing was changed. Also, I tried to change the package size to no avail.

You should read or recv in several chunks (in general; if you are unlucky, the "several" becomes "one"). So you need to manage your buffering and keep (and use) the count of received bytes.
So at some point, you'll code
int nbrecv = recv(s, buffer + off, bufsize, 0);
if (nbrec>0) { off += nbrecv; bufsize -= nbrecv; }
and you probably should do that in your event loop (often around poll(2)...). And it does happen that nbrec is a lot less than bufsize and you should be handling that common case.
TCP does not guarantee that you'll get all the bytes in the same recv! It could depend on external factors (routing, network hardware, ...); it is a stream-oriented protocol, not a message-packet one. If your application wants messages it should buffer the input and chunk that input into messages according to the content. Look at HTTP or SMTP: their message have a well defined boundary given by header information (Content-Length: in HTTP) or by ending convention (line with a single . in SMTP).
Please read carefully read(2), recv(2), socket(7), tcp(7), some sockets tutorial, Advanced Linux Programming.

C++ non blocking socket select send too slow?

I have a program that maintains a list of "streaming" sockets. These sockets are configured to be non-blocking sockets.
Currently, I have used a list to store these streaming sockets. I have some data that I need to send to all these streaming sockets hence I used the iterator to loop through this list of streaming sockets and calling the send_TCP_NB function below:
The issue is that my own program buffer that stores the data before sending to this send_TCP_NB function slowly decreases in free size indicating that the send is slower than the rate at which data is put into the program buffer. The rate at which the program buffer is about 1000 data per second. Each data is quite small, about 100 bytes.
Hence, i am not sure if my send_TCP_NB function is working efficiently or correct?
int send_TCP_NB(int cs, char data[], int data_length) {
bool sent = false;
FD_ZERO(&write_flags); // initialize the writer socket set
FD_SET(cs, &write_flags); // set the write notification for the socket based on the current state of the buffer
int status;
int err;
struct timeval waitd; // set the time limit for waiting
waitd.tv_sec = 0;
waitd.tv_usec = 1000;
err = select(cs+1, NULL, &write_flags, NULL, &waitd);
if(err==0)
{
// time limit expired
printf("Time limit expired!\n");
return 0; // send failed
}
else
{
while(!sent)
{
if(FD_ISSET(cs, &write_flags))
{
FD_CLR(cs, &write_flags);
status = send(cs, data, data_length, 0);
sent = true;
}
}
int nError = WSAGetLastError();
if(nError != WSAEWOULDBLOCK && nError != 0)
{
printf("Error sending non blocking data\n");
return 0;
}
else
{
if(nError == WSAEWOULDBLOCK)
{
printf("%d\n", nError);
}
return 1;
}
}
}

One thing that would help is if you thought out exactly what this function is supposed to do. What it actually does is probably not what you wanted, and has some bad features.
The major features of what it does that I've noticed are:
Modify some global state
Wait (up to 1 millisecond) for the write buffer to have some empty space
Abort if the buffer is still full
Send 1 or more bytes on the socket (ignoring how much was sent)
If there was an error (including the send decided it would have blocked despite the earlier check), obtain its value. Otherwise, obtain a random error value
Possibly print something to screen, depending on the value obtained
Return 0 or 1, depending on the error value.
Comments on these points:
Why is write_flags global?
Did you really intend to block in this function?
This is probably fine
Surely you care how much of the data was sent?
I do not see anything in the documentation that suggests that this will be zero if send succeeds
If you cleared up what the actual intent of this function was, it would probably be much easier to ensure that this function actually fulfills that intent.
That said
I have some data that I need to send to all these streaming sockets
What precisely is your need?
If your need is that the data must be sent before proceeding, then using a non-blocking write is inappropriate*, since you're going to have to wait until you can write the data anyways.
If your need is that the data must be sent sometime in the future, then your solution is missing a very critical piece: you need to create a buffer for each socket which holds the data that needs to be sent, and then you periodically need to invoke a function that checks the sockets to try writing whatever it can. If you spawn a new thread for this latter purpose, this is the sort of thing select is very useful for, since you can make that new thread block until it is able to write something. However, if you don't spawn a new thread and just periodically invoke a function from the main thread to check, then you don't need to bother. (just write what you can to everything, even if it's zero bytes)
*: At least, it is a very premature optimization. There are some edge cases where you could get slightly more performance by using the non-blocking writes intelligently, but if you don't understand what those edge cases are and how the non-blocking writes would help, then guessing at it is unlikely to get good results.
EDIT: as another answer implied, this is something the operating system is good at anyways. Rather than try to write your own code to manage this, if you find your socket buffers filling up, then make the system buffers larger. And if they're still filling up, you should really give serious thought to the idea that your program needs to block anyways, so that it stops sending data faster than the other end can handle it. i.e. just use ordinary blocking sends for all of your data.

Some general advice:
Keep in mind you are multiplying data. So if you get 1 MB/s in, you output N MB/s with N clients. Are you sure your network card can take it ? It gets worse with smaller packets, you get more general overhead. You may want to consider broadcasting.
You are using non blocking sockets, but you block while they are not free. If you want to be non blocking, better discard the packet immediately if the socket is not ready.
What would be better is to "select" more than one socket at once. Do everything that you are doing but for all the sockets that are available. You'll write to each "ready" socket, then repeat again while there are sockets that are not ready. This way, you'll proceed with the sockets that are available first, and then with some chance, the busy sockets will become themselves available.
the while (!sent) loop is useless and probably buggy. Since you are checking only one socket FD_ISSET will always be true. It is wrong to check again FD_ISSET after a FD_CLR
Keep in mind that your OS has some internal buffers for the sockets and that there are way to extend them (not easy on Linux, though, to get large values you need to do some config as root).
There are some socket libraries that will probably work better than what you can implement in a reasonable time (boost::asio and zmq for the ones I know).
If you need to implement it yourself, (i.e. because for instance zmq has its own packet format), consider using a threadpool library.
EDIT:
Sleeping 1 millisecond is probably a bad idea. Your thread will probably get descheduled and it will take much more than that before you get some CPU time again.

This is just a horrible way to do things. The select serves no purpose but to waste time. If the send is non-blocking, it can mangle data on a partial send. If it's blocking, you still waste arbitrarily much time waiting for one receiver.
You need to pick a sensible I/O strategy. Here is one: Set all sockets non-blocking. When you need to send data to a socket, just call write. If all the data writes, lovely. If not, save the portion of data that wasn't sent for later and add the socket to your write set. When you have nothing else to do, call select. If you get a hit on any socket in your write set, write as many bytes as you can from what you saved. If you write all of them, remove that socket from the write set.
(If you need to write to a data that's already in your write set, just add the data to the saved data to be sent. You may need to close the connection if too much data gets buffered.)
A better idea might be to use a library that already does all these things. Boost::asio is a good one.

You are calling select() before calling send(). Do it the other way around. Call select() only if send() reports WSAEWOULDBLOCK, eg:
int send_TCP_NB(int cs, char data[], int data_length)
{
int status;
int err;
struct timeval waitd;
char *data_ptr = data;
while (data_length > 0)
{
status = send(cs, data_ptr, data_length, 0);
if (status > 0)
{
data_ptr += status;
data_length -= status;
continue;
}
err = WSAGetLastError();
if (err != WSAEWOULDBLOCK)
{
printf("Error sending non blocking data\n");
return 0; // send failed
}
FD_ZERO(&write_flags);
FD_SET(cs, &write_flags); // set the write notification for the socket based on the current state of the buffer
waitd.tv_sec = 0;
waitd.tv_usec = 1000;
status = select(cs+1, NULL, &write_flags, NULL, &waitd);
if (status > 0)
continue;
if (status == 0)
printf("Time limit expired!\n");
else
printf("Error waiting for time limit!\n");
return 0; // send failed
}
return 1;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js