c++ tcp send() slow - c++

I have the code like this:
std::string msg = "blablabla" // a large string, even like 100-200KB
int total = 0;
int left = msg.size();
int resp_len;
while (total < resp.size())
{
gettimeofday(&start, NULL);
resp_len = send(*p_fd, msg.substr(total).c_str(), left, 0);
gettimeofday(&end, NULL);
cout << end.tv_usec - start.tv_usec << endl; // each send takes like 2-4 ms
if (resp_len == -1)
break;
total += resp_len;
left -= resp_len;
}
As you can see in that comment, each send() call takes like 2-4 milliseconds, which is a bit too much for my needs and I'm pretty sure it could be faster, but I don't really know how... The socket is in non-blocking mode, but other than that I haven't set anything. On the client side I get the message just fine, but when I'm measuring how long it takes to get the whole message it gets to like 8-18 ms. Usually the message comes with 3 send() calls, so it takes 6-12 ms already, the connecting time and the processing is like 0.
Any idea how could I make that send() call faster? I've seen other programs sending 100-200KB much faster...

Related

Sending a compressed string through winsock

Hie, everyone! I have a simple TCP server and client on winsock2 lib c++. The server simply send string messages. The client simply receives them. Everything is fine here. But when I use the zlib library to compress the string, the data is corrupting and I can't properly receive them on the client to unzip. Can someone help me?
Server:
{
std::lock_guard<std::mutex> lock(mtx);
std::cout << "Client connected\n";
int k = rand() % strings.size();
msg = strings[k];
msg_size = msg.size();
msgl_size = msg_size + msg_size*0.1 + 12;
msgl = new unsigned char[msgl_size + 1]{0};
if (Z_OK != compress((Bytef*)msgl,
&msgl_size,
reinterpret_cast<const unsigned char*>(msg.c_str()),
msg.size()))
{
std::cout << "Compression error! " << std::endl;
exit(2);
}
}
std::thread * thread = new std::thread([&newConnection, msgl, msgl_size, msg_size, msg]() {
std::lock_guard<std::mutex> lock(mtx);
send(newConnection, (char*)&msgl_size, sizeof(unsigned long), NULL);
send(newConnection, (char*)&msg_size, sizeof(unsigned long), NULL);
int res;
do {
res = send(newConnection, (char*)(msgl), sizeof(msgl_size), NULL);
}
while (msgl_size != res);
});
Client:
std::lock_guard<std::mutex> lock(mtxx);
unsigned long msgl_size, msg_size;
recv(Connection, (char*)&msg_size, sizeof(unsigned long), NULL);
recv(Connection, (char*)&msgl_size, sizeof(unsigned long), NULL);
unsigned char * msgl = new unsigned char[msgl_size + 1]{0};
int res;
do {
res = recv(Connection, reinterpret_cast<char*>(msgl), msgl_size, NULL);
}
while (msgl_size != res);
char * msg = new char[msg_size + 1];
if (Z_OK == uncompress(reinterpret_cast<unsigned char*>(msg),
&msg_size,
reinterpret_cast<unsigned char*>(msgl),
msgl_size))
{
msg[msg_size] = '\0';
std::cout << msg << std::endl;
std::cout << "Compress ratio: " << msgl_size / (float)msg_size << std::endl;
}
delete[] msgl;
Client side:
recv only returns whatever data is immediately available or blocks until data becomes available, this is unlikely to happen with a large file or a slow network. Quite likely recv will block until the first network packet arrives and depending on the underlying network that could be anywhere from a few hundred bytes to tens of thousands. Maybe the message fits in that and maybe not.
Setting recv's flags parameter to MSG_WAITALL is useful for shorter messages because you will either get exactly the number of bytes what you asked for or an error. Because of the possibility for error you always have to test the return value.
To repeat: Always check the return value.
recv's return value is either negative for socket failure, 0 for socket shutdown, or the number of bytes read. For more, consult the winsock documentation for recv.
So...
recv(Connection, (char*)&msg_size, sizeof(unsigned long), NULL);
and
recv(Connection, (char*)&msgl_size, sizeof(unsigned long), NULL);
do not check the return value. The socket could have failed or the call to recv could have returned less than what was requested and the remainder of the program will be operating on garbage.
These are a decent place to use MSG_WAITALL, but it's possible that the socket is fine and you were interrupted by a signal. Not sure if this can happen on Windows, but it can on Linux. Beware.
if (recv(Connection, (char*)&msg_size, sizeof(unsigned long), MSG_WAITALL) != sizeof(unsigned long) &&
recv(Connection, (char*)&msgl_size, sizeof(unsigned long), NULL) != sizeof(unsigned long)(
{
// log error
// exit function, loop, or whatever.
}
Next,
do {
res = recv(Connection, reinterpret_cast<char*>(msgl), msgl_size, NULL);
} while (msgl_size != res);
will loop until one recv returns exactly the right amount in a single call. Unlikely, but if it does, it must happen on the first read because the code writes over the previous read every time.
Say only 1/2 of the message is read from the socket on the first try. Since this isn't the full message, the loop enters and tries to read again, overwriting the first half of the message with the second half and perhaps enough bytes from the subsequent message to satisfy the requested number of bytes. This amalgam of two messages will not decrypt.
For a payload of potentially great size, loop until the program has it all.
char * bufp = reinterpret_cast<char*>(msgl);
int msg_remaining = msgl_size;
while (msg_remaining )
{
res = recv(Connection, bufp, msg_remaining, NULL);
if (res <= 0)
{
// log error
// exit function, loop, or whatever.
}
msg_remaining -= res; // reduce message remaining
bufp += res; // move next insert point in msgl
}
There may be problems with the decompression. I don't know enough about that to be able to answer. I suggest removing it and sending easily-debuggable plaintext until you have all of the network issues worked out.
Server side:
Like recv, send sends what it can. You may have to loop sending to make sure you didn't overfill the socket with a message too large for the socket to eat in one shot. And again like recv, ssend can fail. Always check the return value to see what really happened. Check the documentation for send for more information.
It looks to me like you have the right basic idea: send the size of data to expect, followed by the data itself. On the receiving side, read the size first, then read the specified amount of data.
Unfortunately, you've made a mistake or two when it came to the details of implementing that intent. The first big one is when you send the data:
do {
res = send(newConnection, (char*)(msgl), sizeof(msgl_size), NULL);
}
while (msgl_size != res);
This has a couple of problems. First of all, it uses sizeof(msg1_size), so it's only trying to send the size of an unsigned long (at least I'm guessing that msg1_size is an unsigned long).
What I'm pretty sure you intended here was to send the entire buffer instead:
unsigned long sent = 0;
unsigned long remaining = msg1_size;
do {
res = send(newConnection, (char*)(msgl + sent), remaining, NULL);
sent += res;
remaining -= res;
} while (msgl_size != sent);
With this, we start sending from the beginning of the buffer. If send returns after sending only part of that (as it's allowed to), we record how much was sent. Then on the next iteration, we re-start sending from the point where it left off. Meanwhile, we keep track of how much remains to be sent, and only attempt to send that much on each subsequent iteration.
At least at first glance, it looks like your receive loop probably needs roughly the same kind of repair, keeping track of the total received rather than trying to wait for a single transfer of the entire amount.
Oh, and of course for real code you also want to check for res being 0 or negative. As it stands right now, this doesn't even attempt to detect or properly react to most network errors.

Winsock using send() and FD_WRITE

I'm new to Winsock. I am writing an HTTP server and I am aiming to send a large file by reading and sending in chunks. To optimize this process, I attempt to use a nonblocking socket where I read the next chunk from the disk while I send the current one.
My problem now is that I get an FD_WRITE message even when it seems the buffer should be full and my function deletes the associated data from memory prematurely. I believe this causes my responses contain less data than they should send() stops sending prematurely and the client (which is a well-known one) receives about 70% of the data. When I use blocking sockets it works fine, it's just longer.
I tried using Wget, a simple HTTP client, to get a better idea of what's going on. From what I can see, the datastream thins out when I detect WSAEWOULDBLOCK errors when checking for errors after using send(). It looks like during those sends, not all the data gets sent.
When I set the sleep time to over 2000ms after checking for the FD_WRITE message, everything works as it basically comes down to using a blocking socket. I also tried setting times around 100-200ms, but those fail as well. As it is, the condition checking for FD_WRITE always returns valid before entering the loop.
WSAEVENT event = WSACreateEvent();
const int sendBufferSize = 1024 * 64;
int connectionSpeed = 5; //estimated, in MBytes/s
const int sleepTime = sendBufferSize / (connectionSpeed * 1024 * 1024);
size = 0;
const int bufSize = 1024 * 1024 * 35;
int lowerByteIndex = 0;
int upperByteIndex = bufSize;
size = bufSize;
int totalSIZE = 0;
unsigned char* p;
unsigned char* pt;
clock_t t = std::clock();
p = getFileBytesC(resolveLocalPath(path), size, lowerByteIndex, upperByteIndex);
lowerByteIndex += bufSize;
upperByteIndex += bufSize;
totalSIZE += size;
while (upperByteIndex <= fileSize + bufSize)
{
int ret = send(socket, (char*)p, size, 0);
pt = getFileBytesC(resolveLocalPath(path), size, lowerByteIndex, upperByteIndex);
totalSIZE += size;
lowerByteIndex += bufSize;
upperByteIndex += bufSize;
if (ret == SOCKET_ERROR && WSAGetLastError() == WSAEWOULDBLOCK)
{
while (SOCKET_ERROR == WSAEventSelect(socket, event, FD_WRITE))
{
Sleep(50);
}
}
Sleep(sleepTime); //should be around 30-50ms. Wait for the buffer to be empty
delete[] p;
p = pt;
std::cout << std::endl << (std::clock() - t) / (double)CLOCKS_PER_SEC;
}
send(socket, (char*)p, size, 0);
delete[] p;
std::cout << std::endl << (std::clock() - t) / (double)CLOCKS_PER_SEC ;
if (totalSIZE == fileSize) std::cout << std::endl << "finished transfer. UBI=" << upperByteIndex;
else
{
std::cout << std::endl << "ERROR: finished transfer\r\nBytes read=" << totalSIZE;
}
Sleep(2000);
closeSocket(socket);
You can't write correct non-blocking send() code without storing the value returned in a variable. It is the number of bytes actually sent. You can't assume the entire buffer was sent in non-blocking mode.
If send() returns -1 with WSAGetLastError() == WSAEWOULDBLOCK or whatever it is, then is the time to call WSASelect(), or WSAEVENTSelect() if you must, but with a timeout. Otherwise, i.e. if it returns a positive count, you should just advance your offset and decrement your length by the amount sent and repeat until there is nothing left to send.
Your sleeps are just literally a waste of time.
But I would question the whole approach. Sending on a blocking-mode socket is asynchronous anyway. There is little to be gained by your present approach. Just read the file in chunks and send the chunks in blocking mode.
The TransmitFile function exists to solve exactly this problem for you. It does the whole thing entirely in kernel mode so it's always going to beat a hand-crafted version.
https://msdn.microsoft.com/en-us/library/windows/desktop/ms740565(v=vs.85).aspx
Edit: On Linux there's the somewhat similar sendfile call.
http://linux.die.net/man/2/sendfile
(The ubiquity of web servers helped to motivate OS designers to solve this problem.)

Odd results when adding artificial delays to C++ code. Embedded Linux

I have been looking at the performance of our C++ server application running on embedded Linux (ARM). The pseudo code for the main processing loop of the server is this -
for i = 1 to 1000
Process item i
Sleep for 20 ms
The processing for one item takes about 2ms. The "Sleep" here is really a call to the Poco library to do a "tryWait" on an event. If the event is fired (which it never is in my tests) or the time expires, it comes returns. I don't know what system call this equates to. Although we ask for a 2ms block, it turns out to be roughly 20ms. I can live with that - that's not the problem. The sleep is just an artificial delay so that other threads in the process are not starved.
The loop takes about 24 seconds to go through 1000 items.
The problem is, we changed the way the sleep is used so that we had a bit more control. I mean - 20ms delay for 2ms processing doesn't allow us to do much processing. With this new parameter set to a certain value it does something like this -
For i = 1 to 1000
Process item i
if i % 50 == 0 then sleep for 1000ms
That's the rough code, in reality the number of sleeps is slightly different and it happens to work out at a 24s cycle to get through all the items - just as before.
So we are doing exactly the same amount of processing in the same amount of time.
Problem 1 - the CPU usage for the original code is reported at around 1% (it varies a little but that's about average) and the CPU usage reported for the new code is about 5%. I think they should be the same.
Well perhaps this CPU reporting isn't accurate so I thought I'd sort a large text file at the same time and see how much it's slowed up by our server. This is a CPU bound process (98% CPU usage according to top). The results are very odd. With the old code, the time taken to sort the file goes up by 21% when our server is running.
Problem 2 - If the server is only using 1% of the CPU then wouldn't the time taken to do the sort be pretty much the same?
Also, the time taken to go through all the items doesn't change - it's still 24 seconds with or without the sort running.
Then I tried the new code, it only slows the sort down by about 12% but it now takes about 40% longer to get through all the items it has to process.
Problem 3 - Why do the two ways of introducing an artificial delay cause such different results. It seems that the server which sleeps more frequently but for a minimum time is getting more priority.
I have a half baked theory on the last one - whatever the system call that is used to do the "sleep" is switching back to the server process when the time is elapsed. This gives the process another bite at the time slice on a regular basis.
Any help appreciated. I suspect I'm just not understanding it correctly and that things are more complicated than I thought. I can provide more details if required.
Thanks.
Update: replaced tryWait(2) with usleep(2000) - no change. In fact, sched_yield() does the same.
Well I can at least answer problem 1 and problem 2 (as they are the same issue).
After trying out various options in the actual server code, we came to the conclusion that the CPU reporting from the OS is incorrect. It's quite result so to make sure, I wrote a stand alone program that doesn't use Poco or any of our code. Just plain Linux system calls and standard C++ features. It implements the pseudo code above. The processing is replaced with a tight loop just checking the elapsed time to see if 2ms is up. The sleeps are proper sleeps.
The small test program shows exactly the same problem. i.e. doing the same amount of processing but splitting up the way the sleep function is called, produces very different results for CPU usage. In the case of the test program, the reported CPU usage was 0.0078 seconds using 1000 20ms sleeps but 1.96875 when a less frequent 1000ms sleep was used. The amount of processing done is the same.
Running the test on a Linux PC did not show the problem. Both ways of sleeping produced exactly the same CPU usage.
So clearly a problem with our embedded system and the way it measures CPU time when a process is yielding so often (you get the same problem with sched_yeild instead of a sleep).
Update: Here's the code. RunLoop is where the main bit is done -
int sleepCount;
double getCPUTime( )
{
clockid_t id = CLOCK_PROCESS_CPUTIME_ID;
struct timespec ts;
if ( id != (clockid_t)-1 && clock_gettime( id, &ts ) != -1 )
return (double)ts.tv_sec +
(double)ts.tv_nsec / 1000000000.0;
return -1;
}
double GetElapsedMilliseconds(const timeval& startTime)
{
timeval endTime;
gettimeofday(&endTime, NULL);
double elapsedTime = (endTime.tv_sec - startTime.tv_sec) * 1000.0; // sec to ms
elapsedTime += (endTime.tv_usec - startTime.tv_usec) / 1000.0; // us to ms
return elapsedTime;
}
void SleepMilliseconds(int milliseconds)
{
timeval startTime;
gettimeofday(&startTime, NULL);
usleep(milliseconds * 1000);
double elapsedMilliseconds = GetElapsedMilliseconds(startTime);
if (elapsedMilliseconds > milliseconds + 0.3)
std::cout << "Sleep took longer than it should " << elapsedMilliseconds;
sleepCount++;
}
void DoSomeProcessingForAnItem()
{
timeval startTime;
gettimeofday(&startTime, NULL);
double processingTimeMilliseconds = 2.0;
double elapsedMilliseconds;
do
{
elapsedMilliseconds = GetElapsedMilliseconds(startTime);
} while (elapsedMilliseconds <= processingTimeMilliseconds);
if (elapsedMilliseconds > processingTimeMilliseconds + 0.1)
std::cout << "Processing took longer than it should " << elapsedMilliseconds;
}
void RunLoop(bool longSleep)
{
int numberOfItems = 1000;
timeval startTime;
gettimeofday(&startTime, NULL);
timeval startMainLoopTime;
gettimeofday(&startMainLoopTime, NULL);
for (int i = 0; i < numberOfItems; i++)
{
DoSomeProcessingForAnItem();
double elapsedMilliseconds = GetElapsedMilliseconds(startTime);
if (elapsedMilliseconds > 100)
{
std::cout << "Item count = " << i << "\n";
if (longSleep)
{
SleepMilliseconds(1000);
}
gettimeofday(&startTime, NULL);
}
if (longSleep == false)
{
// Does 1000 * 20 ms sleeps.
SleepMilliseconds(20);
}
}
double elapsedMilliseconds = GetElapsedMilliseconds(startMainLoopTime);
std::cout << "Main loop took " << elapsedMilliseconds / 1000 <<" seconds\n";
}
void DoTest(bool longSleep)
{
timeval startTime;
gettimeofday(&startTime, NULL);
double startCPUtime = getCPUTime();
sleepCount = 0;
int runLoopCount = 1;
for (int i = 0; i < runLoopCount; i++)
{
RunLoop(longSleep);
std::cout << "**** Done one loop of processing ****\n";
}
double endCPUtime = getCPUTime();
std::cout << "Elapsed time is " <<GetElapsedMilliseconds(startTime) / 1000 << " seconds\n";
std::cout << "CPU time used is " << endCPUtime - startCPUtime << " seconds\n";
std::cout << "Sleep count " << sleepCount << "\n";
}
void testLong()
{
std::cout << "Running testLong\n";
DoTest(true);
}
void testShort()
{
std::cout << "Running testShort\n";
DoTest(false);
}

Select() sometimes does not wait

I'm writing a chat program, and my receive function sometimes does not wait at all.. Here is the receiving code: The important parts are basically the first half, but i've added the whole function just in case. (Edit: the commenting is for myself, not notes to you guys reading! sorry!)
ReceiveStatus Server::Receive(PacketInternal*& packetInternalOut)
{
fd_set fds ;
int n ;
struct timeval tv ;
// Set up the file descriptor set.
FD_ZERO(&fds) ;
FD_SET(*p_socket, &fds) ;
// Set up the struct timeval for the timeout.
tv.tv_sec = NETWORKTIMEOUTSEC ;
tv.tv_usec = NETWORKTIMEOUTUSEC ;
// Wait until timeout or data received.
n = select ( *p_socket, &fds, NULL, NULL, &tv ) ;
if ( n == 0)
{
return ReceiveStatus::ReceiveTimeout;
}
else if( n == -1 )
{
return ReceiveStatus::ReceiveSocketError;
}
//need to make this more flexible so it can support others
sockaddr_in fromAddr;
int flags = 0;
int fromLength = sizeof(fromAddr);
char dataIn[TOTALPACKETSIZE];
int bytesIn = recvfrom(*p_socket, dataIn, TOTALPACKETSIZE, flags, (SOCKADDR*)&fromAddr, &fromLength);
// Convert fromAddr into ip, port
if(bytesIn == SOCKET_ERROR)
{
return ReceiveStatus::ReceiveSocketError;
}
if(bytesIn > 0)
{
memcpy(packetInternalOut,dataIn,bytesIn);
return ReceiveStatus::ReceiveSuccessful;
}
else
{
return ReceiveStatus::ReceiveEmpty;
}
}
Is there anything that could effect whether or not this works or doesn't work? my chat program can either be a server or a client. they both use this same code. The server, when waiting for a connection, sits on Select() for 100 seconds, as NETWORKTIMEOUTSEC = 100. But in the char program, whenever I want to send a message, i first send a transfer request, and then I wait for an acknowledgement (For the acknowledgement packet, i need to call receive again). Now this is the step that does not wait. my ReceiveAck function calls Receive(), and receive just runs straight over the entire code. I can test this by creating a client and no server. If i send a message where there is no server, it should wait 100 seconds for an acknowledgement, and then time out. But instead, as soon as i hit enter, it says it timed out.
i cant work out what would be making it skip this step. I have debugged my chat program in both its server and client states. The values of tv and fds are the same in both, yet the server will wait and the client wont...
The first parameter to select() is one greater than the last socket. So you need:
n = select ( *p_socket + 1, &fds, NULL, NULL, &tv ) ;
Select also returns early (i.e. without any of the sockets having data present) when your application is hit by a signal. So if your app uses a lot of usleep() and friends in a different thread, you might be in for a surprise.
select() should always be used in a loop. You must check its return for three conditions:
-1 (an error), which you must evaluate to determine if it is fatal. EINTR is an example of a non-fatal error.
a zero, in which case some indeterminate amount of time has passed and, if you care about how long its been, you need to check the time separately.
A positive value, in which case you should check all of the flagged descriptors and act on them.
In all cases, you should check whether any other conditions exist which might make you want to exit the loop, such as how much time as actually passed.
Note that the first parameter to select() should generally be the constant FD_SETSIZE. There is little to be gained in setting it to anything else.
Also note that just because you received a datagram doesn't mean you received the datagram you wanted. You need a way to check that you did not get some random datagram that happened to be floating around on the network (it happens). Along those lines, make sure TOTALPACKETSIZE is 65536, because that's theoretically (approximately) how big a random packet could be.

After sending a lot, my send() call causes my program to stall completely. How is this possible?

So basically I'm making an MMO server in C++ that runs on linux. It works fine at first, but after maybe 40 seconds with 50 clients it will completely pause. When I debug it I find that basically the last frame its on before it stops responding is syscall() at which point it disappears into the kernel. Once it disappears into the kernel it never even returns a value... it's completely baffling.
The 50 clients are each sending 23 bytes every 250 milliseconds. These 23 bytes are then broadcasted to all the other 49 clients. This process begins to slow down and then eventually comes to a complete halt where the kernel never returns from a syscall for the send() command. What are some possible reasons here? This is truly driving me nuts!
One option I found is Nagles algorithm which forces delays. I've tried toggling it but it still happens however.
Edit: The program is stuck here. Specifically, in the send, which in turn calls syscall()
bool EpollManager::s_send(int curFD, unsigned char buf[], int bufLen, int flag)
// Meant to counteract partial sends
{
int sendRetVal = 0;
int bytesSent = 0;
while(bytesSent != bufLen)
{
print_buffer(buf, bufLen);
sendRetVal = send(curFD, buf + bytesSent, bufLen - bytesSent, flag);
cout << sendRetVal << " ";
if(sendRetVal == -1)
{
perror("Sending failed");
return false;
}
else
bytesSent += sendRetVal;
}
return true;
}
Also this is the method which calls the s_send.
void EpollManager::broadcast(unsigned char msg[], int bytesRead, int sender)
{
for(iMap = connections.begin(); iMap != connections.end(); iMap++)
{
if(sender != iMap->first)
{
if(s_send(iMap->first, msg, bytesRead, 0)) // MSG_NOSIGNAL
{
if(debug)
{
print_buffer(msg, bytesRead);
cout << "sent on file descriptor " << iMap->first << '\n';
}
}
}
}
if(connections.find(sender) != connections.end())
connections[sender]->reset_batch();
}
And to clarify connections is an instance of boost's unordered_map. The data that the program chokes on is not unique in any way either. It has been broadcast successfully to other file descriptors, but chokes on a, at least seemingly, random one.
TCP congestion control, i.e. Nagle's algorithm, along side a full buffer (SO_SNDBUF socket option) will cause the send() and similar operations to block.
The lazy way around this is to implement separate threads for each socket but this does not scale too far. On Linux you should use non-blocking sockets with poll() or similar, with Windows you would investigate IO completion ports. Look at middleware libraries to simplify this, libevent is a popular cross platform example with recent inclusion of Windows IOCP support, alternatively Boost:ASIO for C++.
A useful article to read on IO scalability would be The C10K problem.
Note you really do not want to disable Nagle's on Internet traffic, even on a LAN you might see major problems without some form of congestion feedback.
The kernel keeps a finite buffer for sending data. If the receiver isn't receiving, that buffer will fill up and the sender will block. Could that be the problem?