C++ network sockets, SCTP and packet size

C++ network sockets, SCTP and packet size - c++

I'm currently developing a server using connection-oriented SCTP to serve a small number of clients. After finishing the first prototype with a naive implementation, I'm now profiling the application to optimize. As it turns out, one of the two main consumers of CPU time is the networking part.
There are two questions about the efficiency of the application-level protocol I have implemented:
1) Packet size
Currently, I use a maximum packet size of 64 bytes. You can find many posts discussing packet sizes that are too big, but can they be too small? As SCTP allows me to read one packet at a time - similarly to UPD - while guaranteeing in-order delivery - similarly to TCP - this simplified implementation significantly. However, if I understand correctly, this will cost one syscall for each and every time that I send a packet. Does the amount of syscalls have a significant impact on performance? Would I be able to shave off a lot of CPU cycles by sending the messages in bunches in bigger packets, i.e. 1024 - 8192 bytes?
2) Reading and writing the buffers
I'm currently using memcpy to move data into and out of the application-level network buffers. I found many conflicting posts about what is more efficient, memcpy or normal assignment. I'm wondering if one approach will be significantly faster than the other in this scenario:
Option 1
void Network::ReceivePacket(char* packet)
{
uint8_t param1;
uint16_t param2
uint32_t param3;
memcpy(&param1, packet, 1);
memcpy(&param2, packet+1, 2);
memcpy(&param3, packet+3, 4);
// Handle the packet here
}
void Network::SendPacket(uint8_t param1, uint16_t param2, uint32_t param3)
{
char packet[7]
memcpy(&packet, &param1, 1);
memcpy(&packet+1, &param2, 2);
memcpy(&packet+3, &param3, 4);
// Send the packet here
}
Option 2
void Network::ReceivePacket(char* packet)
{
uint8_t param1;
uint16_t param2
uint32_t param3;
param1 = *((uint8_t*)packet);
param2 = *((uint16_t*)packet+1);
param3 = *((uint32_t*)packet+3);
// Handle the packet here
}
void Network::SendPacket(uint8_t param1, uint16_t param2, uint32_t param3)
{
char packet[7]
*((uint8_t*)packet) = param1;
*((uint16_t*)packet+1) = param2;
*((uint32_t*)packet+3) = param3;
// Send the packet here
}
The first one seems a lot cleaner to me, but I've found many posts indicating that maybe the second one is quite a bit faster.
Any kind of feedback is of course welcome.

As far as I know compilers optimize memcpy calls in particular so you should probably use it.
About your first question:
Summary: Make your packet size as big as you can and avoid the possibility of having an enlowered CPU performance.
A syscall, a system call, is your OS replying or processing your request and every time your request is being executed in kernel, which is a moderate amount of work.
To be honest I am not familiar with the SCTP concept, as a matter of fact I haven't dealt with socket programming since the last time I worked on some stuff and created a server via TCP. I remember the MTU for the relevant physical layer element was 1500, I also recall implementing my packet size as 1450-1460, as I was trying to get the maximum packet size underneath the 1500 cap.
So what Im saying is If I were you I would want my OS to be less active as it could so I wont get any trouble with CPU performance.

If you're looking to minimize the number of system calls, and you do find yourself sending and receiving multiple messages at a time, you might want to look at using the (Linux only) sendmmsg() and recvmmsg().
To use these, you would likely need to enqueue messages internally, which might add latency that would not exist otherwise.

I wouldn't go over 1024 for a buffer size personally. I've experienced some runtime problems when using packets over 1500, but 1024 is of course 4 to the 5th power making it wonderful to use.
It is possible, but I wouldn't advise it. I would make a separate thread for receiving packets, using recvmsg() so you can use multiple streams. I've found this to work wonderfully.
The main point of SCTP is multiple streams, so I would take full advantage of that. You just have to ensure that the data is put back in the correct order once everything is received (which takes some work to do).

Related

What will this C++ function do when run?

Can anybody tell me what the following function send_signal will do in real terms? I'm assuming it will send 1 million bytes of data to COM2, but that doesn't appear to match up with its real world implementation.
Update. The hardware on COM2 is a robotic arm and the function in question triggers the arm. However, the arm action only lasts for circa 10 seconds - but I'm guessing that 1 million bytes of data will take longer than 10 seconds to send and thus would trigger multiple actions (but it doesn't).
#define TXDATA 0x2F8 // COM2
#define SIGNAL 0x00 // This value gives +12V for 1 milli second
void send_signal(void)
{
long count;
for (count = 0; count < 1000000; count++)
_outp(TXDATA, (char) (SIGNAL + SIGNAL));
}

On an appropriate embedded system, it will indeed place one million bytes in the COM2 transmitter holding register. Since it does so faster than the port can actually transmit them, most of them will be lost. To actually send all one million bytes, it is necessary to read the status register and check for bit 5 (THR is empty) before copying each byte.
On a computer running a modern OS, it will fault because userspace is not permitted direct access to I/O ports, it has to go through a driver.

It calls the _outp function with 0x2F8 as first argument and 0x00 as second argument one milion times.
Without context one can only speculate. My guess would be for it to set a high state on COM2 for a certain amount of time.

this code send 1000000 of 0x00 in the 0x2F8 port
attention this _outp functions is obsolete

Best practice to implement fixed byte serial protocol in C++?

I have a device connected via serial interface to a BeagleBone computer. I communicates in a simple binary format like
|MessagID (1 Byte) | Data (n Bytes) | checksum (2 bytes) |
The message length is fixed for each command, meaning that it is known how many bytes to read after the First byte of a command was received. After some initial setup communication it sends packets of data every 20 ms.
My approach would be to use either termios or something like serial lib and then start a loop doing like that (a:
while(keepRunning)
{
char* buffer[256];
serial.read(buffer, 1)
switch(buffer[0])
{
case COMMAND1:
serial.read(&buffer[1], sizeof(MessageHello)+2); //Read data + checksum
if (calculateChecksum(buffer, sizeof(MessageHello)+3) )
{
extractDatafromCommand(buffer);
}
else
{
doSomeErrorHandling(buffer[0]);
}
break;
case COMMAND2:
serial.read(&buffer[1], sizeof(MessageFoo)+2);
[...]
}
}
extractDatafromCommand would then create some structs like:
struct MessageHello
{
char name[20];
int version;
}
Put everything in an own read thread and signal the availability of a new packet to other parts of the program using a semaphore (or a simple flag).
Is this a viable solution or are there better improvements to do (I assume so)?
Maybe make a abstract class Message and derive the other messages?

It really depends. The two major ways would be threaded (like you mentioned) and evented.
Threaded code is tricky because you can easily introduce race conditions. Code that you tested a million times could occasionally stumble and do the wrong thing after working for days or weeks or years. It's hard to 'prove' that things will always behave correctly. Seemingly trivial things like "i++" suddenly become leaky abstractions. (See why is i++ not thread safe on a single core machine? )
The other alternative is evented programming. Basically, you have a main loop that does a select() on all your file handles. Anything that is ready gets looked at, and you try to read/write as many bytes as you can without blocking. (pass O_NONBLOCK). There are two tricky parts: 1) you must never do long calculations without having a way to yield back to the main loop, and 2) you must never do a blocking operation (where the kernel stops your process waiting for a read or write).
In practice, most programs don't have long computations and it's easier to audit a small amount of your code for blocking calls than for races. (Although doing DNS without blocking is trickier than it should be.)
The upside of evented code is that there's no need for locking (no other threads to worry about) and it wastes less memory (in the general case where you're creating lots of threads.)
Most likely, you want to use a serial lib. termios processing is just overhead and a chance for stray bytes to do bad things.

fread speeds managed unmanaged

Ok, so I'm reading a binary file into a char array I've allocated with malloc.
(btw the code here isn't the actual code, I just wrote it on the spot to demonstrate, so any mistakes here are probably not mistakes in the actual program.) This method reads at about 50million bytes per second.
main
char *buffer = (char*)malloc(file_length_in_bytes*sizeof(char));
memset(buffer,0,file_length_in_bytes*sizeof(char));
//start time here
read_whole_file(buffer);
//end time here
free(buffer);
read_whole_buffer
void read_whole_buffer(char* buffer)
{
//file already opened
fseek(_file_pointer, 0, SEEK_SET);
int a = sizeof(buffer[0]);
fread(buffer, a, file_length_in_bytes*a, _file_pointer);
}
I've written something similar with managed c++ that uses filestream I believe and the function ReadByte() to read the entire file, byte by byte, and it reads at around 50million bytes per second.
Also, I have a sata and an IDE drive in my computer, and I've loading the file off of both, doesn't make any difference at all(Which is weird because I was under the assumption that SATA read much faster than IDE.)
Question
Maybe you can all understand why this doesn't make any sense to me. As far as I knew, it should be much faster to fread a whole file into an array, as opposed to reading it byte by byte. On top of that, through testing I've discovered that managed c++ is slower (only noticeable though if you are benchmarking your code and you require speed.)
SO
Why in the world am I reading at the same speed with both applications. Also is 50 million bytes from a file, into an array quick?
Maybe I my motherboard is bottle necking me? That just doesn't seem to make much sense eather.
Is there maybe a faster way to read a file into an array?
thanks.
My 'script timer'
Records start and end time with millisecond resolution...Most importantly it's not a timer
#pragma once
#ifndef __Script_Timer__
#define __Script_Timer__
#include <sys/timeb.h>
extern "C"
{
struct Script_Timer
{
unsigned long milliseconds;
unsigned long seconds;
struct timeb start_t;
struct timeb end_t;
};
void End_ST(Script_Timer *This)
{
ftime(&This->end_t);
This->seconds = This->end_t.time - This->start_t.time;
This->milliseconds = (This->seconds * 1000) + (This->end_t.millitm - This->start_t.millitm);
}
void Start_ST(Script_Timer *This)
{
ftime(&This->start_t);
}
}
#endif
Read buffer thing
char face = 0;
char comp = 0;
char nutz = 0;
for(int i=0;i<(_length*sizeof(char));++i)
{
face = buffer[i];
if(face == comp)
nutz = (face + comp)/i;
comp++;
}

Transfers from or to main memory run at speeds of gigabytes per second. Inside the CPU data flows even faster. It is not surprising that, whatever you do at the software side, the hard drive itself remains the bottleneck.
Here are some numbers from my system, using PerformanceTest 7.0:
hard disk: Samsung HD103SI 5400 rpm: sequential read/write at 80 MB/s
memory: 3 * 2 GB at 400 MHz DDR3: read/write around 2.2 GB/s
So if your system is a bit older than mine, a hard drive speed of 50 MB/s is not surprising. The connection to the drive (IDE/SATA) is not all that relevant; it's mainly about the number of bits passing the drive heads per second, purely a hardware thing.
Another thing to keep in mind is your OS's filesystem cache. It could be that the second time round, the hard drive isn't accessed at all.
The 180 MB/s memory read speed that you mention in your comment does seem a bit on the low side, but that may well depend on the exact code. Your CPU's caches come into play here. Maybe you could post the code you used to measure this?

The FILE* API uses buffered streams, so even if you read byte by byte, the API internally reads buffer by buffer. So your comparison will not make a big difference.
The low level IO API (open, read, write, close) is unbuffered, so using this one will make a difference.
It may also be faster for you, if you do not need the automatic buffering of the FILE* API!

I've done some tests on this, and after a certain point, the effect of increased buffer size goes down the bigger the buffer. There is usually an optimum buffer size you can find with a bit of trial and error.
Note also that fread() (or more specifically the C or C++ I/O library) will probably be doing its own buffering. If your system suports it a plain read() may (or may not) be a bit faster.

How to use protocol buffers?

Could someone please help and tell me how to use protocol buffers. Actually I want to exchange data through sockets between a program running on unix and anoother running on windows in order to run simulation studies.
The programs that use sockets to exchange data, are written in C/C++ and I would be glad if somneone could help me to use protocol buffers in order to exchange data in the form of :
struct snd_data{
char *var="temp";
int var1=1;
float var2;
double var2;
}
I tried several ways, but still data are not exchanged correctly. Any help would be very appreciated
Thanks for your help,

You start by defining your message in a .proto file:
package foo;
message snd_data {
required string var= 1;
required int32 var1 = 2;
optional float var2 = 3;
optional double var3 = 4;
}
(I guess the float and double actually are different variables...)
Then you compile it using protoc and then you have code implementing your buffer.
For further information see: http://code.google.com/apis/protocolbuffers/docs/cpptutorial.html

How are you writing your messages to the socket? Protobufs is not endian-sensitive itself, but neither does protobufs define a transport mechanism -- protobuf defines a mapping between a message and its serialized form (which is a sequence of (8-bit) bytes) and it is your responsibility to transfer this sequence of bytes to the remote host.
In our case, we define a very simple transport protocol; first we write the message size as an 32-bit integer (big endian), then comes the message itself. (Also remember that protobuf messages are not self-identifying, which means that you need to know which message you are sending. This is typically managed by having a wrapper message containing optional fields for all messages you want to send. See the protobuf website and mailing list archives for more info about this technique.)

Endianess is handled within protobuf.
See:
https://groups.google.com/forum/?fromgroups#!topic/protobuf/XbzBwCj4yL8
How cross-platform is Google's Protocol Buffer's handling of floating-point types in practice?

Are both machines x86? Otherwise you need to watch for big endian and little endian differences. Its also worth paying attention to struct packing. Passing pointer can be problematic too due to the fact pointer are different sizes on different platforms. All in there is far too little information in your post to say, for certain, what is going wrong ...

The answer lies in the endianess of the data being transmitted, this is something you need to consider very carefully and check. Look here to show what endianness can do and cause data to get messed up on both the receiver and sender. There is no such perfect measure of transferring data smoothly, just because data sent from a unix box guarantees the data on the windows box will be in the same order in terms of memory structure for the data. Also the padding of the structure on the unix box will be different to the padding on the windows box, it boils down to how the command line switches are used, think structure alignment.

Should network packet payload data be aligned on proper boundries?

If you have the following class as a network packet payload:
class Payload
{
char field0;
int field1;
char field2;
int field3;
};
Does using a class like Payload leave the recipient of the data susceptible to alignment issues when receiving the data over a socket? I would think that the class would either need to be reordered or add padding to ensure alignment.
Either reorder:
class Payload
{
int field1;
int field3;
char field0;
char field2;
};
or add padding:
class Payload
{
char field0;
char pad[3];
int field1;
char field2;
char pad[3];
int field3;
};
If reordering doesn't make sense for some reason, I would think adding the padding would be preferred since it would avoid alignment issues even though it would increase the size of the class.
What is your experience with such alignment issues in network data?

Correct, blindly ignoring alignment can cause problems. Even on the same operating system if 2 components were compiled with different compilers or different compiler versions.
It is better to...
1) Pass your data through some sort of serialization process.
2) Or pass each of your primitives individually, while still paying attention to byte ordering == Endianness
A good place to start would be Boost Serialization.

You should look into Google protocol buffers, or Boost::serialize like another poster said.
If you want to roll your own, please do it right.
If you use types from stdint.h (ie: uint32_t, int8_t, etc.), and make sure every variable has "native alignment" (meaning its address is divisible evenly by its size (int8_ts are anywhere, uint16_ts are on even addresses, uint32_ts are on addresses divisble by 4), you won't have to worry about alignment or packing.
At a previous job we had all structures sent over our databus (ethernet or CANbus or byteflight or serial ports) defined in XML. There was a parser that would validate alignment on the variables within the structures (alerting you if someone wrote bad XML), and then generate header files for various platforms and languages to send and receive the structures. This worked really well for us, we never had to worry about hand-writing code to do message parsing or packing, and it was guaranteed that all platforms wouldn't have stupid little coding errors. Some of our datalink layers were pretty bandwidth constrained, so we implemented things like bitfields, with the parser generating the proper code for each platform. We also had enumerations, which was very nice (you'd be surprised how easy it is for a human to screw up coding bitfields on enumerations by hand).
Unless you need to worry about it running on 8051s and HC11s with C, or over data link layers that are very bandwidth constrained, you are not going to come up with something better than protocol buffers, you'll just spend a lot of time trying to be on par with them.

We use packed structures that are overlaid directly over the binary packet in memory today and I am rueing the day that I decided to do that. The only way that we have gotten this to work is by:
carefully defining bit-width specific types based on the compilation environment (typedef unsigned int uint32_t)
inserting the appropriate compiler-specific pragmas in to specify tight packing of structure members
requiring that everything is in one byte order (use network or big-endian ordering)
carefully writing both the server and client code
If you are just starting out, I would advise you to skip the whole mess of trying to represent what's on the wire with structures. Just serialize each primitive element separately. If you choose not to use an existing library like Boost Serialize or a middleware like TibCo, then save yourself a lot of headache by writing an abstraction around a binary buffer that hides the details of your serialization method. Aim for an interface like:
class ByteBuffer {
public:
ByteBuffer(uint8_t *bytes, size_t numBytes) {
buffer_.assign(&bytes[0], &bytes[numBytes]);
}
void encode8Bits(uint8_t n);
void encode16Bits(uint16_t n);
//...
void overwrite8BitsAt(unsigned offset, uint8_t n);
void overwrite16BitsAt(unsigned offset, uint16_t n);
//...
void encodeString(std::string const& s);
void encodeString(std::wstring const& s);
uint8_t decode8BitsFrom(unsigned offset) const;
uint16_t decode16BitsFrom(unsigned offset) const;
//...
private:
std::vector<uint8_t> buffer_;
};
The each of your packet classes would have a method to serialize to a ByteBuffer or be deserialized from a ByteBuffer and offset. This is one of those things that I absolutely wish that I could go back in time and correct. I cannot count the number of times that I have spent time debugging an issue that was caused by forgetting to swap bytes or not packing a struct.
The other trap to avoid is using a union to represent bytes or memcpying to an unsigned char buffer to extract bytes. If you always use Big-Endian on the wire, then you can use simple code to write the bytes to the buffer and not worry about the htonl stuff:
void ByteBuffer::encode8Bits(uint8_t n) {
buffer_.push_back(n);
}
void ByteBuffer::encode16Bits(uint16_t n) {
encode8Bits(uint8_t((n & 0xff00) >> 8));
encode8Bits(uint8_t((n & 0x00ff) ));
}
void ByteBuffer::encode32Bits(uint32_t n) {
encode16Bits(uint16_t((n & 0xffff0000) >> 16));
encode16Bits(uint16_t((n & 0x0000ffff) ));
}
void ByteBuffer::encode64Bits(uint64_t n) {
encode32Bits(uint32_t((n & 0xffffffff00000000) >> 32));
encode32Bits(uint32_t((n & 0x00000000ffffffff) ));
}
This remains nicely platform agnostic since the numerical representation is always logically Big-Endian. This code also lends itself very nicely to using templates based on the size of the primitive type (think encode<sizeof(val)>((unsigned char const*)&val))... not so pretty, but very, very easy to write and maintain.

My experience is that the following approaches are to be preferred (in order of preference):
Use a high level framework like Tibco, CORBA, DCOM or whatever that will manage all these issues for you.
Write your own libraries on both sides of the connection that are are aware of packing, byte order and other issues.
Communicate only using string data.
Trying to send raw binary data without any mediation will almost certainly cause lots of problems.

You practically can't use a class or structure for this if you want any sort of portability. In your example, the ints may be 32-bit or 64-bit depending on your system. You're most likely using a little endian machine, but the older Apple macs are big endian. The compiler is free to pad as it likes too.
In general you'll need a method that writes each field to the buffer a byte at a time, after ensuring you get the byte order right with n2hll, n2hl or n2hs.

If you don't have natural alignment in the structures, compilers will usually insert padding so that alignment is proper. If, however, you use pragmas to "pack" the structures (remove the padding), there can be very harmful side affects. On PowerPCs, non-aligned floats generate an exception. If you're working on an embedded system that doesn't handle that exception, you'll get a reset. If there is a routine to handle that interrupt, it can DRASTICALLY slow down your code, because it'll use a software routine to work around the misalignment, which will silently cripple your performance.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js