How to receive more than 40Kb in C++ socket using read() - c++

I am developing a client-server application (TCP) in Linux using C++. This application is in charge of testing the network performance.
The connection between client and server is established only once, and then data are transmitted/received using write()/read() with an own-defined protocol.
When data exceeds 40Kb I receive just a part of the data only once. (i.e. I receive about 48KB)
Please find down the relevant part of the code:
while (1) {
servMtx.lock();
...
serv_bytes = (byte *) malloc(size_bytes);
n = read(newsockfd, serv_bytes,size_bytes);
if (n != (int)size_bytes ) {
std::cerr << "No enough data available for msg. Received just: " << n << std::endl;
continue;
}
receivedBytes += n + size_header_bytes + sizeof(ssize_t);
....
}
I increased the kernel buffer size to become 1MB using:
int buffsize = 1024*1024;
setsockopt(newsockfd, SOL_SOCKET, SO_RCVBUF, &buffsize, sizeof(buffsize));
and modified sysctl variables too:
sysctl -w net.core.rmem_max=8388608;
sysctl -w net.core.wmem_max=8388608;
as mentioned on this How to recive more than 65000 bytes in C++ socket using recv() but nothing was changed. Also, I tried to change the package size to no avail.

You should read or recv in several chunks (in general; if you are unlucky, the "several" becomes "one"). So you need to manage your buffering and keep (and use) the count of received bytes.
So at some point, you'll code
int nbrecv = recv(s, buffer + off, bufsize, 0);
if (nbrec>0) { off += nbrecv; bufsize -= nbrecv; }
and you probably should do that in your event loop (often around poll(2)...). And it does happen that nbrec is a lot less than bufsize and you should be handling that common case.
TCP does not guarantee that you'll get all the bytes in the same recv! It could depend on external factors (routing, network hardware, ...); it is a stream-oriented protocol, not a message-packet one. If your application wants messages it should buffer the input and chunk that input into messages according to the content. Look at HTTP or SMTP: their message have a well defined boundary given by header information (Content-Length: in HTTP) or by ending convention (line with a single . in SMTP).
Please read carefully read(2), recv(2), socket(7), tcp(7), some sockets tutorial, Advanced Linux Programming.

Related

Reading and storing a stream of UDP datagrams in c++ 98 and berkeley sockets

Setup: This is a dynamic library made only for openSUSE linux, using C++ 98 due to strict requirements, and Berkeley Sockets. I have a stream of UDP datagrams sent at 90hz from another pc.
I'm following the great Beej's Guide to Network Programming for this.
Intented behaviour: I'm providing a dinamic library as the interface with the network and it should store the latest datagram received so that when you ask for it, it will be returned.
Options
Option A) PASSIVE LIBRARY: Make them call an Update() method at those 90Hz (more or less) to make it read the datagrams from the socket and get the latest one. If not called frequently enough it will not work well.
Option B) ACTIVE LIBRARY: make the library perform the check itself at 90Hz (more or less). I guess I would need to use a thread for this, am I right?. I have no idea how to make it sleep so that it doesn't waste CPU and resources. No idea how to do this.
Problem: I created two little apps, one sends the datagrams at 90hz, another reads them. Active wait works, but as soon as I introduce a while loop reading the socket all the time, it returns no datagrams. The socket is always empty. I've been told to use "Select()" with a timeout. But I guess both do something similar. Why isn't there anything in the socket when I do this?
for(;;)
{
numbytes = recvfrom(sockfd, buf, MAXBUFLEN-1 , 0, p->ai_addr, &p->ai_addrlen);
if (numbytes == 0)
{
printf("Error. Sender closed connection?\n");
break;
}
else if (numbytes > 0)
{
printf("listener: got packet from %s\n",inet_ntop(p->ai_family, get_in_addr(p->ai_addr), s, sizeof s));
printf("listener: packet is %d bytes long\n", numbytes);
buf[numbytes] = '\0';
printf("listener: packet contains \"%s\"\n", buf);
}
}
Questions
Are the datagrams lost if I don't wait actively for them?
In case they are stored, why aren't they available when I fetch them with the while loop?
In case they are stored, how many are stored?
Is there any way to make the socket store the latest package by itself and letting me fetch the last one?

Qt QTcpSocket Reading Data Overlap Causes Invalid TCP Behavior During High Bandwidth Reading and Writing

Summary: Some of the memory within the TCP socket to be overwritten by other incoming data.
Application:
A client/server system that utilizes TCP within Qt (QTcpSocket and QTcpServer). The client request a frame from the server(just a simple string message), and the response (Server -> Client) which consists of that frame (614400 bytes for testing purposes). Frame sizes are established in advance and are fixed.
Implementation Details:
From the guarantees of the TCP protocol (Server -> Client), I know that I should be able to read the 614400 bytes from the socket and that they are in order. If any either of these two things fails, the connection must have failed.
Important Code:
Assuming the socket is connected.
This code requests a frame from the server. Known as the GetFrame() function.
// Prompt the server to send a frame over
if(socket->isWritable() && !is_receiving) { // Validate that socket is ready
is_receiving = true; // Forces only one request to go out at a time
qDebug() << "Getting frame from socket..." << image_no;
int written = SafeWrite((char*)"ReadyFrame"); // Writes then flushes the write buffer
if (written == -1) {
qDebug() << "Failed to write...";
return temp_frame.data();
}
this->SocketRead();
is_receiving = false;
}
qDebug() << image_no << "- Image Received";
image_no ++;
return temp_frame.data();
This code waits for the frame just requested to be read. This is the SocketRead() function
size_t byte_pos = 0;
qint64 bytes_read = 0;
do {
if (!socket->waitForReadyRead(500)) { // If it timed out return existing frame
if (!(socket->bytesAvailable() > 0)) {
qDebug() << "Timed Out" << byte_pos;
break;
}
}
bytes_read = socket->read((char*)temp_frame.data() + byte_pos, frame_byte_size - byte_pos);
if (bytes_read < 0) {
qDebug() << "Reading Failed" << bytes_read << errno;
break;
}
byte_pos += bytes_read;
} while (byte_pos < frame_byte_size && is_connected); // While we still have more pixels
qDebug() << "Finished Receiving Frame: " << byte_pos;
As shown in the code above, I read until the frame is fully received (where the number of bytes read is equal to the number of bytes in the frame).
The issue that I'm having is that the QTcpSocket read operation is skipping bytes in ways that are not in line with the guarantees of the TCP protocol. Since I skip bytes I end up not reaching the end of the while loop and just "Time Out". Why is this happening?
What I have done so far:
The data that the server sends is directly converted into uint16_t (short) integers which are used in other parts of the client. I have changed the server to simply output data that just counts up adding one for each number sent. Since the data type is uint16_t and the number of bytes exceeds that maximum number for that integer type, the int-16's will loop every 65535.
This is a data visualization software so this debugging configuration (on the client side) leads to something like this:
I have determined (and as you can see a little at the bottom of the graphic) that some bytes are being skipped. In the memory of temp_frame it is possible to see the exact point at which the memory skipped:
Under correct circumstances, this should count up sequentially.
From Wireshark and following this specific TCP connection I have determined that all of the bytes are in fact arriving (all 6114400), and that all the numbers are in order (I used a python script to ensure counting was sequential).
This is work on an open source project so this is the whole code base for the client.
Overall, I don't see how I could be doing something wrong in this solution, all I am doing is reading from the socket in the standard way.
Caveat: This isn't a definitive answer to your problem, but some things to try (it's too large for a comment).
With (e.g.) GigE, your data rate is ~100MB/s. With a [total] amount of kernel buffer space of 614400, this will be refilled ~175 times per second. IMO, this is still too small. When I've used SO_RCVBUF [for a commercial product], I've used a minimum of 8MB. This allows a wide(er) margin for task switch delays.
Try setting something huge like 100MB to eliminate this as a factor [during testing/bringup].
First, it's important to verify that the kernel and NIC driver can handle the throughput/latency.
You may be getting too many interrupts/second and the ISR prolog/epilog overhead may be too high. The NIC card driver can implement polled vs interrupt driver with NAPI for ethernet cards.
See: https://serverfault.com/questions/241421/napi-vs-adaptive-interrupts
See: https://01.org/linux-interrupt-moderation
You process/thread may not have high enough priority to be scheduled quickly.
You can use the R/T scheduler with sched_setscheduler, SCHED_RR, and a priority of (e.g.) 8. Note: going higher than 11 kills the system because at 12 and above you're at a higher priority than most internal kernel threads--not a good thing.
You may need to disable IRQ balancing and set the IRQ affinity to a single CPU core.
You can then set your input process/thread locked to that core [with sched_setaffinity and/or pthread_setaffinity].
You might need some sort of "zero copy" to bypass the kernel copying from its buffers into your userspace buffers.
You can mmap the kernel socket buffers with PACKET_MMAP. See: https://sites.google.com/site/packetmmap/
I'd be careful about the overhead of your qDebug output. It looks like an iostream type implementation. The overhead may be significant. It could be slowing things down significantly.
That is, you're not measuring the performance of your system. You're measuring the performance of your system plus the debugging code.
When I've had to debug/trace such things, I've used a [custom] "event" log implemented with an in-memory ring queue with a fixed number of elements.
Debug calls such as:
eventadd(EVENT_TYPE_RECEIVE_START,some_event_specific_data);
Here eventadd populates a fixed size "event" struct with the event type, event data, and a hires timestamp (e.g. struct timespec from clock_gettime(CLOCK_MONOTONIC,...).
The overhead of each such call is quite low. The events are just stored in the event ring. Only the last N are remembered.
At some point, your program triggers a dump of this queue to a file and terminates.
This mechanism is similar to [and modeled on] a H/W logic analyzer. It is also similar to dtrace
Here's a sample event element:
struct event {
long long evt_tstamp; // timestamp
int evt_type; // event type
int evt_data; // type specific data
};

C++ nonblocking sockets - wait for all recv data

I wasn't running into this problem on my local system (of course), but now that I am setting up a virtual server, I am having some issues with a part of my code.
In order to receive all data from a nonblocking TCP recv(), I have this function
ssize_t Server::recvAll(int sockfd, const void *buf, size_t len, int flags) {
// just showing here that they are non-blocking sockets
u_long iMode=1;
ioctlsocket(sockfd,FIONBIO,&iMode);
ssize_t result;
char *pbuf = (char *)buf;
while ( len > 0 ) {
result = recv(sockfd,pbuf,len,flags);
printf("\tRES: %d", result);
if ( result <= 0 ) break;
pbuf += result;
len -= result;
}
return result;
}
I noticed that recvAll will usually print RES: 1024 (1024 being the amount of bytes I'm sending) and it works great. But less frequently, there is data loss and it prints only RES: 400 (where 400 is some number greater than 0 and less than 1024) and my code does not work, as it expects all 1024 bytes.
I tried also printing WSAGetLastError() and also running in debug, but it looks like the program runs slow enough due to the print/debug that I don't come across this issue.
I assume this function works great for blocking sockets, but not non-blocking sockets.
Any suggestions on measurements I can take to make sure that I do receive all 1024 bytes without data loss on non-blocking sockets?
If you use non-blocking mode then you read all data that has already arrived to the system. Once you read out all data recv returns error and reason is depending on system:
EWOULDBLOCK (in posix system)
WSAEWOULDBLOCK in windows sockets system
Once you receive this error you need to wait arrival of another data. You can do it in several ways:
Wait with special function like select/poll/epoll
Sleep some time and try to recv again (user-space polling)
If you need to reduce delay select/poll/epoll is preferable. Sleep is much more simple to implement.
Also you need consider that TCP is stream protocol and does NOT keep framing. This means that you can send, for example, 256 bytes then another 256 bytes but receive 512 bytes at once. This also true in opposite way: you may send 512 bytes at once and receive 256 bytes with first read and another 256 bytes in next read.

Calculating socket upload speed

I'm wondering if anyone knows how to calculate the upload speed of a Berkeley socket in C++. My send call isn't blocking and takes 0.001 seconds to send 5 megabytes of data, but takes a while to recv the response (so I know it's uploading).
This is a TCP socket to a HTTP server and I need to asynchronously check how many bytes of data have been uploaded / are remaining. However, I can't find any API functions for this in Winsock, so I'm stumped.
Any help would be greatly appreciated.
EDIT: I've found the solution, and will be posting as an answer as soon as possible!
EDIT 2: Proper solution added as answer, will be added as solution in 4 hours.
I solved my issue thanks to bdolan suggesting to reduce SO_SNDBUF. However, to use this code you must note that your code uses Winsock 2 (for overlapped sockets and WSASend). In addition to this, your SOCKET handle must have been created similarily to:
SOCKET sock = WSASocket(AF_INET, SOCK_STREAM, IPPROTO_TCP, NULL, 0, WSA_FLAG_OVERLAPPED);
Note the WSA_FLAG_OVERLAPPED flag as the final parameter.
In this answer I will go through the stages of uploading data to a TCP server, and tracking each upload chunk and it's completion status. This concept requires splitting your upload buffer into chunks (minimal existing code modification required) and uploading it piece by piece, then tracking each chunk.
My code flow
Global variables
Your code document must have the following global variables:
#define UPLOAD_CHUNK_SIZE 4096
int g_nUploadChunks = 0;
int g_nChunksCompleted = 0;
WSAOVERLAPPED *g_pSendOverlapped = NULL;
int g_nBytesSent = 0;
float g_flLastUploadTimeReset = 0.0f;
Note: in my tests, decreasing UPLOAD_CHUNK_SIZE results in increased upload speed accuracy, but decreases overall upload speed. Increasing UPLOAD_CHUNK_SIZE results in decreased upload speed accuracy, but increases overall upload speed. 4 kilobytes (4096 bytes) was a good comprimise for a file ~500kB in size.
Callback function
This function increments the bytes sent and chunks completed variables (called after a chunk has been completely uploaded to the server)
void CALLBACK SendCompletionCallback(DWORD dwError, DWORD cbTransferred, LPWSAOVERLAPPED lpOverlapped, DWORD dwFlags)
{
g_nChunksCompleted++;
g_nBytesSent += cbTransferred;
}
Prepare socket
Initially, the socket must be prepared by reducing SO_SNDBUF to 0.
Note: In my tests, any value greater than 0 will result in undesirable behaviour.
int nSndBuf = 0;
setsockopt(sock, SOL_SOCKET, SO_SNDBUF, (char*)&nSndBuf, sizeof(nSndBuf));
Create WSAOVERLAPPED array
An array of WSAOVERLAPPED structures must be created to hold the overlapped status of all of our upload chunks. To do this I simply:
// Calculate the amount of upload chunks we will have to create.
// nDataBytes is the size of data you wish to upload
g_nUploadChunks = ceil(nDataBytes / float(UPLOAD_CHUNK_SIZE));
// Overlapped array, should be delete'd after all uploads have completed
g_pSendOverlapped = new WSAOVERLAPPED[g_nUploadChunks];
memset(g_pSendOverlapped, 0, sizeof(WSAOVERLAPPED) * g_nUploadChunks);
Upload data
All of the data that needs to be send, for example purposes, is held in a variable called pszData. Then, using WSASend, the data is sent in blocks defined by the constant, UPLOAD_CHUNK_SIZE.
WSABUF dataBuf;
DWORD dwBytesSent = 0;
int err;
int i, j;
for(i = 0, j = 0; i < nDataBytes; i += UPLOAD_CHUNK_SIZE, j++)
{
int nTransferBytes = min(nDataBytes - i, UPLOAD_CHUNK_SIZE);
dataBuf.buf = &pszData[i];
dataBuf.len = nTransferBytes;
// Now upload the data
int rc = WSASend(sock, &dataBuf, 1, &dwBytesSent, 0, &g_pSendOverlapped[j], SendCompletionCallback);
if ((rc == SOCKET_ERROR) && (WSA_IO_PENDING != (err = WSAGetLastError())))
{
fprintf(stderr, "WSASend failed: %d\n", err);
exit(EXIT_FAILURE);
}
}
The waiting game
Now we can do whatever we wish while all of the chunks upload.
Note: the thread which called WSASend must be regularily put into an alertable state, so that our 'transfer completed' callback (SendCompletionCallback) is dequeued out of the APC (Asynchronous Procedure Call) list.
In my code, I continuously looped until g_nUploadChunks == g_nChunksCompleted. This is to show the end-user upload progress and speed (can be modified to show estimated completion time, elapsed time, etc.)
Note 2: this code uses Plat_FloatTime as a second counter, replace this with whatever second timer your code uses (or adjust accordingly)
g_flLastUploadTimeReset = Plat_FloatTime();
// Clear the line on the screen with some default data
printf("(0 chunks of %d) Upload speed: ???? KiB/sec", g_nUploadChunks);
// Keep looping until ALL upload chunks have completed
while(g_nChunksCompleted < g_nUploadChunks)
{
// Wait for 10ms so then we aren't repeatedly updating the screen
SleepEx(10, TRUE);
// Updata chunk count
printf("\r(%d chunks of %d) ", g_nChunksCompleted, g_nUploadChunks);
// Not enough time passed?
if(g_flLastUploadTimeReset + 1 > Plat_FloatTime())
continue;
// Reset timer
g_flLastUploadTimeReset = Plat_FloatTime();
// Calculate how many kibibytes have been transmitted in the last second
float flByteRate = g_nBytesSent/1024.0f;
printf("Upload speed: %.2f KiB/sec", flByteRate);
// Reset byte count
g_nBytesSent = 0;
}
// Delete overlapped data (not used anymore)
delete [] g_pSendOverlapped;
// Note that the transfer has completed
Msg("\nTransfer completed successfully!\n");
Conclusion
I really hope this has helped somebody in the future who has wished to calculate upload speed on their TCP sockets without any server-side modifications. I have no idea how performance detrimental SO_SNDBUF = 0 is, although I'm sure a socket guru will point that out.
You can get a lower bound on the amount of data received and acknowledged by subtracting the value of the SO_SNDBUF socket option from the number of bytes you have written to the socket. This buffer may be adjusted using setsockopt, although in some cases the OS may choose a length smaller or larger than you specify, so you must re-check after setting it.
To get more precise than that, however, you must have the remote side inform you of progress, as winsock does not expose an API to retrieve the amount of data currently pending in the send buffer.
Alternately, you could implement your own transport protocol on UDP, but implementing rate control for such a protocol can be quite complex.
Since you don't have control over the remote side, and you want to do it in the code, I'd suggest doing very simple approximation. I assume a long living program/connection. One-shot uploads would be too skewed by ARP, DNS lookups, socket buffering, TCP slow start, etc. etc.
Have two counters - length of the outstanding queue in bytes (OB), and number of bytes sent (SB):
increment OB by number of bytes to be sent every time you enqueue a chunk for upload,
decrement OB and increment SB by the number returned from send(2) (modulo -1 cases),
on a timer sample both OB and SB - either store them, log them, or compute running average,
compute outstanding bytes a second/minute/whatever, same for sent bytes.
Network stack does buffering and TCP does retransmission and flow control, but that doesn't really matter. These two counters will tell you the rate your app produces data with, and the rate it is able to push it to the network. It's not the method to find out the real link speed, but a way to keep useful indicators about how good the app is doing.
If data production rate is bellow the network output rate - everything is fine. If it's the other way around and the network cannot keep up with the app - there's a problem - you need either faster network, slower app, or different design.
For one-time experiments just take periodic snapshots of netstat -sp tcp output (or whatever that is on Windows) and calculate the send-rate manually.
Hope this helps.
If your app uses packet headers like
0001234DT
where 000123 is the packet length for a single packet, you can consider using MSG_PEEK + recv() to get the length of the packet before you actually read it with recv().
The problem is send() is NOT doing what you think - it is buffered by the kernel.
getsockopt(sockfd, SOL_SOCKET, SO_SNDBUF, &flag, &sz));
fprintf(STDOUT, "%s: listener socket send buffer = %d\n", now(), flag);
sz=sizeof(int);
ERR_CHK(getsockopt(sockfd, SOL_SOCKET, SO_RCVBUF, &flag, &sz));
fprintf(STDOUT, "%s: listener socket recv buffer = %d\n", now(), flag);
See what these show for you.
When you recv on a NON-blocking socket that has data, it normally does not have MB of data parked in the buufer ready to recv. Most of what I have experienced is that the socket has ~1500 bytes of data per recv. Since you are probably reading on a blocking socket it takes a while for the recv() to complete.
Socket buffer size is the probably single best predictor of socket throughput. setsockopt() lets you alter socket buffer size, up to a point. Note: these buffers are shared among sockets in a lot of OSes like Solaris. You can kill performance by twiddling these settings too much.
Also, I don't think you are measuring what you think you are measuring. The real efficiency of send() is the measure of throughput on the recv() end. Not the send() end.
IMO.

Using Boost.Asio to get "the whole packet"

I have a TCP client connecting to my server which is sending raw data packets. How, using Boost.Asio, can I get the "whole" packet every time (asynchronously, of course)? Assume these packets can be any size up to the full size of my memory.
Basically, I want to avoid creating a statically sized buffer.
Typically when you build a custom protocol on the top of TCP/IP you use a simple message format where first 4 bytes is an unsigned integer containing the message length and the rest is the message data. If you have such a protocol then the reception loop is as simple as below (not sure what is ASIO notation, so it's just an idea)
for(;;) {
uint_32_t len = 0u;
read(socket, &len, 4); // may need multiple reads in non-blocking mode
len = ntohl(len);
assert (len < my_max_len);
char* buf = new char[len];
read(socket, buf, len); // may need multiple reads in non-blocking mode
...
}
typically, when you do async IO, your protocol should support it.
one easy way is to prefix a byte array with it's length at the logical level, and have the reading code buffer up until it has a full buffer ready for parsing.
if you don't do it, you will end up with this logic scattered all over the place (think about reading a null terminated string, and what it means if you just get a part of it every time select/poll returns).
TCP doesn't operate with packets. It provides you one contiguous stream. You can ask for the next N bytes, or for all the data received so far, but there is no "packet" boundary, no way to distinguish what is or is not a packet.