sendmsg + raw ethernet + several frames

sendmsg + raw ethernet + several frames - c++

I use linux 3.x and modern glibc(2.19).
I would like send several Ethernet frames without switch from kernel/user space forth and back.
I have MTU = 1500, and I want to send 800 KB.
I init receiver address like this:
struct sockaddr_ll socket_address;
socket_address.sll_ifindex = if_idx.ifr_ifindex;
socket_address.sll_halen = ETH_ALEN;
socket_address.sll_addr[0] = MY_DEST_MAC0;
//...
After that I can call sendto/sendmsg 800KB / 1500 ~= 500 times and all works fine, but this require user space <-> kernel negotiation ~ 500 * 25 times per second. I want avoid it.
I try to init struct msghdr::msg_iov with appropriate info,
but get error "message too long", looks like msghdr::msg_iov can not describe something with size > MTU.
So question is it possible to send many raw Ethernet frame on Linux from userspace at once?
PS
The data (800KB) I get from file, and read it to memory. So struct iovec good for me, I can create suitable amount of Ethernet header and have to iovec per 1500 packet, one point to data, one point to Ethernet header.

Whoa.
My last company made realtime hidef video encoding hardware. In the lab, we had to blast 200MB / second across a bonded link, so I have some experience with this. What follows is based upon that.
Before you can tune, you must measure. You don't want to do multiple syscalls, but can you prove with timing measurement that the overhead is significant?
I use a wrapper routine around clock_gettime that gives back time of day with nanosecond precision (e.g. (tv_sec * 100000000) + tv_nsec). Call this [herein] "nanotime".
So, for any given syscall, you need a measurement:
tstart = nanotime();
syscall();
tdif = nanotime() - tstart;
For send/sendto/sendmsg/write, do this will small data so you're sure you're not blocking [or use O_NONBLOCK, if applicable]. This gives you the syscall overhead
Why are you going directly to ethernet frames? TCP [or UDP] is usually fast enough and modern NIC cards can do the envelope wrap/strip in hardware. I'd like to know if there is a specific situation that requires ethernet frames, or was it that you weren't getting the performance you wanted and came up with this as a solution. Remember, you're doing 800KB/s (~1MB/s) and my project was doing 100x-200x more than that over TCP.
What about using two plain write calls to the socket? One for header, one for data [all 800KB]. write can be used on a socket and doesn't have the EMSGSIZE error or restriction.
Further, why do you need your header to be in a separate buffer? When you allocate your buffer, just do:
datamax = 800 * 1024; // or whatever
buflen = sizeof(struct header) + datamax;
buf = malloc(buflen);
while (1) {
datalen = read(fdfile,&buf[sizeof(struct header)],datamax);
// fill in header ...
write(fdsock,buf,sizeof(struct header) + datalen);
}
This works even for the ethernet frame case.
One of the things can also do is use a setsockopt to increase the size of the kernel buffer for your socket. Otherwise, you can send data, but it will be dropped in the kernel before the receiver can drain it. More on this below.
To measure the performance of the wire, add some fields to your header:
u64 send_departure_time; // set by sender from nanotime
u64 recv_arrival_time; // set by receiver when the packet arrives
So, sender sets departure time and does write [just do the header for this test]. Call this packet Xs. receiver stamps this when it arrives. receiver immediately sends back a message to sender [call it Xr] with a departure stamp that and the contents of Xs. When sender gets this, it stamps it with an arrival time.
With the above we now have:
T1 -- time packet Xs departed sender
T2 -- time packet Xs arrived at receiver
T3 -- time packet Xr departed receiver
T4 -- time packet Xr arrived at sender
Assuming you do this on a relatively quiet connection with little to no other traffic and you know the link speed (e.g. 1 Gb/s), with T1/T2/T3/T4 you can calculate the overhead.
You can repeat the measurement for TCP/UDP vs ETH. You may find that it doesn't buy you as much as you think. Once again, can you prove it with precise measurement?
I "invented" this algorithm while working at the aforementioned company, only to find out that it was already part of a video standard for sending raw video across a 100Gb Ethernet NIC card and the NIC does the timestamping in hardware.
One of the other things you may have to do is add some throttle control. This is similar to what bittorrent does or what the PCIe bus does.
When PCIe bus nodes first start up, they communicate how much free buffer space they have available for "blind write". That is, the sender is free to blast up to this much, without any ACK message. As the receiver drains its input buffer, it sends periodic ACK messages to the sender with the number of bytes it was able to drain. sender can add this value back to the blind write limit and keep going.
For your purposes, the blind write limit is the size of the receiver's kernel socket buffer.
UPDATE
Based upon some of the additional information from your comments [the actual system configuration should go, in a more complete form, as an edit to your question at the bottom].
You do have a need for a raw socket and sending an ethernet frame. You can reduce the overhead by setting a larger MTU via ifconfig or an ioctl call with SIOCSIFMTU. I recommend the ioctl. You may not need to set MTU to 800KB. Your CPU's NIC card has a practical limit. You can probably increase MTU from 1500 to 15000 easily enough. This would reduce syscall overhead by 10x and that may be "good enough".
You probably will have to use sendto/sendmsg. The two write calls were based on conversion to TCP/UDP. But, I suspect sendmsg with msg_iov will have more overhead than sendto. If you search, you'll find that most example code for what you want uses sendto. sendmsg seems like less overhead for you, but it may cause more overhead for the kernel. Here's an example that uses sendto: http://hacked10bits.blogspot.com/2011/12/sending-raw-ethernet-frames-in-6-easy.html
In addition to improving syscall overhead, larger MTU might improve the efficiency of the "wire", even though this doesn't seem like a problem in your use case. I have experience with CPU + FPGA systems and communicating between them, but I am still puzzled by one of your comments about "not using a wire". FPGA connected to ethernet pins of CPU I get--sort of. More precisely, do you mean FPGA pins connected to ethernet pins of NIC card/chip of CPU"?
Are the CPU/NIC on the same PC board and the FPGA pins are connected via PC board traces? Otherwise, I don't understand "not using a wire".
However, once again, I must say that you must be able to measure your performance before you blindly try to improve it.
Have you run the test case I suggested for determining the syscall overhead? If it is small enough, trying to optimize for it may not be worth it and doing so may actually hurt performance more severely in other areas that you didn't realize when you started.
As an example, I once worked on a system that had a severe performance problem, such that, the system didn't work. I suspected the serial port driver was slow, so I recoded from a high level language (e.g. like C) into assembler.
I increased the driver performance by 2x, but it contributed less than a 5% performance improvement to the system. It turned out the real problem was that other code was accessing non-existent memory which just caused a bus timeout, slowing the system down measurably [it did not generate an interrupt that would have made it easy to find as on modern systems].
That's when I learned the importance of measurement. I had done my optimization based on an educated guess, rather than hard data. After that: lesson learned!
Nowadays, I never try large optimization until I can measure first. In some cases, I add an optimization that I'm "sure" will make things better (e.g. inlining a function). When I measure it [and because I can measure it], I find out that the new code is actually slower and I have to revert the change. But, that's the point: I can prove/disprove this with hard performance data.
What CPU are you using: x86, arm, mips, etc. At what clock frequency? How much DRAM? How many cores?
What FPGA are you using (e.g. Xilinx, Altera)? What specific type/part number? What is the maximum clock rate? Is the FPGA devoted entirely to logic or do you also have a CPU inside it such as microblaze, nios, arm? Does the FPGA have access to DRAM of it's own [and how much DRAM]?
If you increase the MTU, can the FPGA handle it, from either a buffer/space standpoint or a clock speed standpoint??? If you increase MTU, you may need to add an ack/sync protocol as I suggested in the original post.
Currently, the CPU is doing a blind write of the data, hoping the FPGA can handle it. This means you have an open race condition between CPU and FPGA.
This may be mitigated, purely as a side effect of sending small packets. If you increase MTU too much, you might overwhelm the FPGA. In other words, it is the very overhead you're trying to optimize away, that allows the FPGA to keep up with the data rate.
This is what I meant by unintended consequences of blind optimization. It can have unintended and worse side effects.
What is the nature of the data being sent to the FPGA? You're sending 800KB, but how often?
I am assuming that it is not the FPGA firmware itself for a few reasons. You said the firmware was already almost full [and it is receiving the ethernet data]. Also, firmware is usually loaded via the I2C bus, a ROM, or an FPGA programmer. So, am I correct?
You're sending the data to the FPGA from a file. This implies that it is only being sent once, at the startup of your CPU's application. Is that correct? If so, optimization is not needed because it's an init/startup cost that has little impact on the running system.
So, I have to assume that the file gets loaded many times, possibly a different file each time. Is that correct? If so, you may need to consider the impact of the read syscall. Not just from syscall overhead, but optimal read length. For example, IIRC, the optimal transfer size for a disk-to-disk or file-to-file copy/transfer is 64KB, depending upon the filesystem or underlying disk characteristics.
So, if you're looking to reduce overhead, reading data from a file may have considerably more than having the application generate the data [if that's possible].
The kernel syscall interface is designed to be very low overhead. Kernel programmers [I happen to be one] spend a great deal of time ensuring the overhead is low.
You say your system is utilizing the a lot of CPU time for other things. Can you measure the other things? How is your application structured? How many processes? How many threads? How do they communicate? What is the latency/througput? You may be able to find [can quite probably find] the larger bottlenecks and recode those and you'll get an overall reduction in CPU usage that far exceeds the maximum benefit you'll get from the MTU tweak.
Trying to optimize the syscall overhead may be like my serial port optimization. A lot of effort, and yet the overall results are/were disappointing.
When considering performance, it is important to consider it from an overall system standpoint. In your case, this means CPU, FPGA, and anything else in it.
You say that the CPU is doing a lot of things. Could/should some of those algorithms go into the FPGA? Is the reason they're not because the FPGA is almost out of space, otherwise you would? Is the FPGA firmware 100% done? Or, is there more RTL to be written? If you're at 90% space utilization in the FPGA, and you'll need more RTL, you may wish to consider going to an FPGA part that has more space for logic, possibly with a higher clock rate.
In my video company, we used FPGAs. We used the largest/fastest state-of-the-art part the FPGA vendor had. We also used virtually 100% of the space for logic and required the part's maximum clock rate. We were told by the vendor that we were the largest consumer of FPGA resources of any of their client companies worldwide. Because of this, we were straining the vendors development tools. Place-and-route would frequently fail and have to be rerun to get correct placement and meet timing.
So, when an FPGA is almost full with logic, the place-and-route can be difficult to achieve. It might be a reason to consider a larger part [if possible]

Related

Hard disk contention using multiple threads

I have not performed any profile testing of this yet, but what would the general consensus be on the advantages/disadvantages of resource loading from the hard disk using multiple threads vs one thread? Note. I am not talking about the main thread.
I would have thought that using more than one "other" thread to do the loading to be pointless because the HD cannot do 2 things at once, and therefore would surely only cause disk contention.
Not sure which way to go architecturally, appreciate any advice.
EDIT: Apologies, I meant to mean an SSD drive not a magnetic drive. Both are HD's to me, but I am more interested in the case of a system with a single SSD drive.
As pointed out in the comments one advantage of using multiple threads is that a large file load will not delay the presentation of a smaller for to the receiver of the thread loader. In my case, this is a big advantage, and so even if it costs a little perf to do it, having multiple threads is desirable.
I know there are no simple answers, but the real question I am asking is, what kind of performance % penalty would there be for making the parallel disk writes sequential (in the OS layer) as opposed to allowing only 1 resource loader thread? And what are the factors that drive this? I don't mean like platform, manufacturer etc. I mean technically, what aspects of the OS/HD interaction influence this penalty? (in theory).
FURTHER EDIT:
My exact use case are texture loading threads which only exist to load from HD and then "pass" them on to opengl, so there is minimal "computation in the threads (maybe some type conversion etc). In this case, the thread would spend most of its time waiting for the HD (I would of thought), and therefore how the OS-HD interaction is managed is important to understand. My OS is Windows 10.

Note. I am not talking about the main thread.
Main vs non-main thread makes zero difference to the speed of reading a disk.
I would have thought that using more than one "other" thread to do the loading to be pointless because the HD cannot do 2 things at once, and therefore would surely only cause disk contention.
Indeed. Not only are the attempted parallel reads forced to wait for each other (and thus not actually be parallel), but they will also make access pattern of the disk random as opposed to sequential, which is much much slower due to disk head seek time.
Of course, if you were to deal with multiple hard disks, then one thread dedicated for each drive would probably be optimal.
Now, if you were using a solid state drive instead of a hard drive, the situation isn't quite so clear cut. Multiple threads may be faster, slower, or comparable. There are probably many factors involved such as firmware, file system, operating system, speed of the drive relative to some other bottle neck, etc.
In either case, RAID might invalidate assumptions made here.

It depends on how much processing of the data you're going to do. This will determine whether the application is I/O you bound or compute bound.
For example, if all you are going to do to the data is some simple arithmetic, e.g. add 1, then you will end up being I/O bound. The CPU can add 1 to data far quicker than any I/O system can deliver flows of data.
However, if you're going to do a large amount of work on each batch of data, e.g. a FFT, then a filter, then a convolution (I'm picking random DSP routine names here), then it's likely that you will end up being compute bound; the CPU cannot keep up with the data being delivered by the I/O subsystem which owns your SSD.
It is quite an art to judge just how an algorithm should be structured to match the underlying capabilities of the underlying machine, and vice versa. There's profiling tools like FTRACE/Kernelshark, Intel's VTune, which are both useful in analysing exactly what is going on. Google does a lot to measure how many searches-per-Watt their hardware accomplishes, power being their biggest cost.
In general I/O of any sort, even a big array of SSDs, is painfully slow. Even the main memory in a PC (DDR4) is painfully slow in comparison to what the CPU can consume. Even the L3 and L2 caches are sluggards in comparison to the CPU cores. It's hard to design and multi-threadify an algorithm just right so that the right amount of work is done on each data item whilst it is in L1 cache so that the L2, L3 caches, DDR4 and I/O subsystems can deliver the next data item to the L1 caches just in time to keep the CPU cores busy. And the ideal software design for one machine is likely hopeless on another with a different CPU, or SSD, or memory SIMMs. Intel design for good general purpose computer performance, and actually extracting peak performance from a single program is a real challenge. Libraries like Intel's MKL and IPP are very big helps in doing this.
General Guidance
In general one should look at it in terms of data bandwidth required by any particular arrangement of threads and work those threads are doing.
This means benchmarking your program's inner processing loop and measuring how much data it processed and how quickly it managed to do it in, choosing an number of data items that makes sense but much more than the size of L3 cache. A single 'data item' is an amount of input data, the amount of corresponding output data, and any variables used processing the input to the output, the total size of which fits in L1 cache (with some room to spare). And no cheating - use the CPUs SSE/AVX instructions where appropriate, don't forego them by writing plain C or not using something like Intel's IPP/MKL. [Though if one is using IPP/MKL, it kinda does all this for you to the best of its ability.]
These days DDR4 memory is going to be good for anything between 20 to 100GByte/second (depending on what CPU, number of SIMMs, etc), so long as your not making random, scattered accesses to the data. By saturating the L3 your are forcing yourself into being bound by the DDR4 speed. Then you can start changing your code, increasing the work done by each thread on a single data item. Keep increasing the work per item and the speed will eventually start increasing; you've reached the point where you are no longer limited by the speed of DDR4, then L3, then L2.
If after this you can still see ways of increasing the work per data item, then keep going. You eventually get to a data bandwidth somewhere near that of the IO subsystems, and only then will you be getting the absolute most out of the machine.
It's an iterative process, and experience allows one to short cut it.
Of course, if one runs out of ideas for things to increase the work done per data item then that's the end of the design process. More performance can be achieved only by improving the bandwidth of whatever has ended up being the bottleneck (almost certainly the SSD).
For those of us who like doing this software of thing, the PS3's Cell processor was a dream. No need to second guess the cache, there was none. One had complete control over what data and code was where and when it was there.

A lot people will tell you that an HD can't do more than one thing at once. This isn't quite true because modern IO systems have a lot of indirection. Saturating them is difficult to do with one thread.
Here are three scenarios that I have experienced where multi-threading the IO helps.
Sometimes the IO reading library has a non-trivial amount of computation, think about reading compressed videos, or parity checking after the transfer has happened. One example is using robocopy with multiple threads. Its not unusual to launch robocopy with 128 threads!
Many operating systems are designed so that a single process can't saturate the IO, because this would lead to system unresponsiveness. In one case I got a 3% percent read speed improvement because I came closer to saturating the IO. This is doubly true if some system policy exists to stripe the data to different drives, as might be set on a Lustre drive in a HPC cluster. For my application, the optimal number of threads was two.
More complicated IO, like a RAID card, contains a substantial cache that keep the HD head constantly reading and writing. To get optimal throughput you need to be sure that whenever the head is spinning its constantly reading/writing and not just moving. The only way to do this is, in practice, is to saturate the card's on-board RAM.
So, many times you can overlap some minor amount of computation by using multiple threads, and stuff starts getting tricky with larger disk arrays.
Not sure which way to go architecturally, appreciate any advice.
Determining the amount of work per thread is the most common architectural optimization. Write code so that its easy to increase the IO worker count. You're going to need to benchmark.

Detecting maximum throughput of my network using C++

My application aims to detect the network throughput. Despite the C++ code, I am seeking for a reliable theory of throttling my network which returns exact value of maximum accepted baudrate. Indeed, I could write the code later.
After checking several ideas from internet I didn't find which suits my case.
I tried to send data as much as possible using TCP/IP and then check the baudrate on each sending of 10MB. Please find here the pseudo code for my algorithm:
while (send(...)){
if (tempSendBytes > 10Mb)
if (baudrate > predefinedThreshold)
usleep(calculateNeededTime());
tempSendBytes += sentBytes;
}
But, when the predefinedThreshold is acheived, buffers full up and my program get stuck without returning any error. As a matter of fact, checking the baudrate on each sent message will decrease my bandwidth to its minimum. So, I preferred to check each 10MB.
PS: There are no other technical problems in my code neither a memory leak. In addition, my program runs normally (sending and receiving data 100%) if I decrease predefinedThreshold.
My question:
Is there a way to detect the maximum bandwidth (on both loopback and real network) without buffers overflow neither getting stucked?

Yes, you can detect maximum throughput on both loopback and real interface. The real interface may require a TCP server running remotely with sufficient bandwidth to provide an accurate estimate of max throughput.
If you are strictly looking for theoretical, you may be able to run the test on the real interface the same way you run it on localhost with the server bound to the real interface's IP and the client running on the same computer. I'm not sure though what your OS's networking stack will do to this traffic but it should treat it like it was coming off box.
A lot of factors contribute to max theoretical throughput for TCP. TCP MTU, OS send buffer, OS receive buffer, etc. Wikipedia has a good high level overview, but it sounds like you may have already read it. http://en.wikipedia.org/wiki/Measuring_network_throughput You might also find this TCP tuning overview helpful, http://en.wikipedia.org/wiki/TCP_tuning
IPerf is commonly used to accurately measure bandwidth and the techniques it uses are rather comprehensive. It is written in C++ and its code base may be a good starting point for you. https://github.com/esnet/iperf
I know none of this provides an exact discussion of the theory, but hopefully helps clarify some things.

libpcap to capture 10 Gbps NIC

I want to capture packets from 10Gbps network card with 0 packet loss.
I am using lipcap for 100Mbps NIC and it is working fine.
Will libpcap be able to handle 10Gbps NIC traffic?
If not what are the other alternative ways to achive this?

Whether or not libpcap will handle 10Gbps with 0 packet loss is a matter of the machine that you are using and libpcap version. If the machine, CPU and HDD I/O are fast enough, you may get 0 packet loss. Otherwise you may need to perform the following actions:
Update your libpcap to the most recent version. Libpcap 1.0.0 or later, supposts zero-copy (memory-mapped) mechanism. It means that there is a buffer that's in both the kernel's address space and the application's address space, so that data doesn't need to be copied
from a kernel-mode buffer to a user-mode buffer. Packets are still copied from the skbuff (Linux) into the shared buffer, so it's really more like "one-copy", but that's still one fewer copy, so that could reduce the CPU time required to receive captured packets. Moreover more packets can be fetched from the buffer per application wake up call.
If you observe a high CPU utilization, it is probably your CPU that cannot handle the packet arrival rate. You can use xosview (a system load visualization tool) to check your system resources during the capture.
If the CPU drops packets, you can use PF_RING. PF_RING is an extension of libpcap with a circular buffer: http://www.ntop.org/products/pf_ring/. It is way faster and can capture with 10Gbps with commodity NICs http://www.ntop.org/products/pf_ring/hardware-packet-filtering/.
Another approach is to get a NIC that has an on-board memory and a specific HW design for packet capturing, see http://en.wikipedia.org/wiki/DAG_Technology.
If the CPU is not any more your problem, you need to test disk data transfer speed. hdparm is the simplest tool on Linux. Some distros come with a GUI, otherwise:
$ sudo hdparm -tT /dev/hda
If you are developing your own application based on libpcap:
Use pcap_stats to identify (a) the number of packets dropped because there was no room in the operating system's buffer when they arrived, because packets weren't being read fast enough; (b) number of packets dropped by the network interface or its driver.
Libpcap 1.0.0 has an API that lets an application set the buffer size, on platforms where the buffer size can be set.
b) If you find it hard to set the buffer, you can use Libpcap 1.1.0 or later in which the default capture buffer size has been increased from 32K to 512K.
c) If you are just using tcpdump, use 4.0.0 or later and use the -B flag for the size of the buffer

You don't say which Operating System or CPU. It doesn't matter whether you choose libpcap or not, the underlying network performance is still burdened by the Operating System Memory Management and its network driver. libpcap has kept up with the paces and can handle 10Gbps, but there's more.
If you want the best CPU so that you can do number-crunching, running virtual machines and while capturing packets, go with AMD Opteron CPU which still outperforms Intel Xeon Quadcore 5540 2.53GHz (despite Intel's XIO/DDIO introduction and mostly because of Intel dual-core sharing of same L2 cache). For best ready-made OS, go with latest FreeBSD as-is (which still outperforms Linux 3.10 networking using basic hardware.) Otherwise, Intel and Linux will works just fine for basic drop-free 10Gbps capture, provided you are eager to roll up your sleeves.
If you're pushing for breakneck speed all the time while doing financial-like or stochastic or large matrix predictive computational crunching (or something), then read-on...
As RedHat have discovered, 67.2 nanosecond is what it takes to process one minimal-sized packet at 10Gbps rate. I assert it's closer to 81.6 nanosecond for 64 byte Ethernet payload but they are talking 46-byte minimal as a theoretical.
To cut it short, you WON'T be able to DO or USE any of the following if you want 0% packet drop at full-rate by staying under 81.6 ns for each packet:
Make an SKB call for each packet (to minimize that overhead, amortized this over several 100s of packets)
TLB (Translation lookaside buffer, to avoid that, use HUGE page allocations)
Short latency (you did say 'capture', so latency is irrelevant here). It's called Interrupt Coalesce
(ethtool -C rx-frames 1024+).
Float processes across multi-CPU (must lock them down, one per network interface interrupt)
libc malloc() (must replace it with a faster one, preferably HUGE-based one)
So, Linux has an edge over FreeBSD to capture the 10Gbps rate in 0% drop rate AND run several virtual machines (and other overheads). Just that it requires a new memory management (MM) of some sort for a specific network device and not necessarily the whole operating system. Most new super-high-performance network driver are now making devices use HUGE memory that were allocated at userland then using driver calls to pass a bundle of packets at a time.
Many new network-driver having repurposed MM are out (in no particular order):
netmap
PF-RING
PF-RING+netmap
OpenOnload
DPDK
PacketShader
The maturity level of each code is highly dependent on which Linux (or distro) version you choose. I've tried a few of them and once I understood the basic design, it became apparent what I needed. YMMV.
Updated: White paper on high speed packet architecture: https://arxiv.org/pdf/1901.10664.pdf
Good luck.

PF_RING is a good solution an alternative can be netsniff-ng ( http://netsniff-ng.org/ ).
For both projects the gain of performance is reached by zero-copy mechanisms. Obviously, the bottleneck can be the HD, its data transfer rate.

If you have the time then move to Intel DPDK. It allows for Zero Copy access to the NIC's hardware register. I was able to achieve 0% drops at 10Gbps, 1.5Mpps on a single core.
You'll be better off in the long run

Unexpected Socket CPU Utilization

I'm having a performance issue that I don't understand. The system I'm working on has two threads that look something like this:
Version A:
Thread 1: Data Processing -> Data Selection -> Data Formatting -> FIFO
Thread 2: FIFO -> Socket
Where 'Selection' thins down the data and the FIFO at the end of thread 1 is the FIFO at the beginning of thread 2 (the FIFOs are actually TBB Concurrent Queues). For performance reasons, I've altered the threads to look like this:
Version B:
Thread 1: Data Processing -> Data Selection -> FIFO
Thread 2: FIFO -> Data Formatting -> Socket
Initially, this optimization proved to be successful. Thread 1 is capable of much higher throughput. I didn't look too hard at Thread 2's performance because I expected the CPU usage would be higher and (due to data thinning) it wasn't a major concern. However, one of my colleagues asked for a performance comparison of version A and version B. To test the setup I had thread 2's socket (a boost asio tcp socket) write to an instance of iperf on the same box (127.0.0.1) with the goal of showing the maximum throughput.
To compare the two set ups I first tried forcing the system to write data out of the socket at 500 Mbps. As part of the performance testing I monitored top. What I saw surprised me. Version A did not show up on 'top -H' nor did iperf (this was actually as suspected). However, version B (my 'enhanced version') was showing up on 'top -H' with ~10% cpu utilization and (oddly) iperf was showing up with 8%.
Obviously, that implied to me that I was doing something wrong. I can't seem to prove that I am though! Things I've confirmed:
Both versions are giving the socket 32k chunks of data
Both versions are using the same boost library (1.45)
Both have the same optimization setting (-O3)
Both receive the exact same data, write out the same data, and write it at the same rate.
Both use the same blocking write call.
I'm testing from the same box with the exact same setup (Red Hat)
The 'formatting' part of thread 2 is not the issue (I removed it and reproduced the problem)
Small packets across the network is not the issue (I'm using TCP_CORK and I've confirmed via wireshark that the TCP Segments are all ~16k).
Putting a 1 ms sleep right after the socket write makes the CPU usage on both the socket thread and iperf(?!) go back to 0%.
Poor man's profiler reveals very little (the socket thread is almost always sleeping).
Callgrind reveals very little (the socket write barely even registers)
Switching iperf for netcat (writing to /dev/null) doesn't change anything (actually netcat's cpu usage was ~20%).
The only thing I can think of is that I've introduced a tighter loop around the socket write. However, at 500 Mbps I wouldn't expect that the cpu usage on both my process and iperf would be increased?
I'm at a loss to why this is happening. My coworkers and I are basically out of ideas. Any thoughts or suggestions? I'll happily try anything at this point.

This is going to be very hard to analyze without code snippets or actual data quantities.
One thing that comes to mind: if the pre-formatted data stream is significantly larger than post-format, you may be expending more bandwidth/cycles copying a bunch more data through the FIFO (socket) boundary.
Try estimating or measuring the data rate at each stage. If the data rate is higher at the output of 'selection', consider the effects of moving formatting to the other side of the boundary. Is it possible that no copy is required for the select->format transition in configuration A, and configuration B imposes lots of copies?
... just a guesses without more insight into the system.

What if the FIFO was the bottleneck in version A. Then both threads would sit and wait for the FIFO most of the time. And in version B, you'd be handing the data off to iperf faster.

What exactly do you store in the FIFO queues? Do you store packets of data i.e buffers?
In version A, you were writing formatted data (probably bytes) to the queue. So, sending it on the socket involved just writing out a fixed size buffer.
However in version B, you are storing high level data in the queues. Formatting it is now creating bigger buffer sizes that are being written directly to the socket. This causes the TCp/ip stack to spend CPU cycles in fragmenting & overhead...
THis is my theory based on what you have said so far.

What is the optimal size of a UDP packet for maximum throughput?

I need to send packets from one host to another over a potentially lossy network. In order to minimize packet latency, I'm not considering TCP/IP. But, I wish to maximize the throughput uisng UDP. What should be the optimal size of UDP packet to use?
Here are some of my considerations:
The MTU size of the switches in the network is 1500. If I use a large packet, for example 8192, this will cause fragmentation. Loss of one fragment will result in the loss of the entire packet, right?
If I use smaller packets, I'll incur the overhead of the UDP and IP header
If I use a really large packet, what is the largest that I can use? I read that the largest datagram size is 65507. What is the buffer size I should use to allow me to send such sizes? Would that help to bump up my throughput?
What are the typical maximum datagram size supported by the common OSes (eg. Windows, Linux, etc.)?
Updated:
Some of the receivers of the data are embedded systems for which TCP/IP stack is not implemented.
I know that this place is filled with people who are very adament about using what's available. But I hope to have better answers than just focusing on MTU alone.

The best way to find the ideal datagram size is to do exactly what TCP itself does to find the ideal packet size: Path MTU discovery.
TCP also has a widely used option where both sides tell the other what their MSS (basically, MTU minus headers) is.

Alternative answer: be careful to not reinvent the wheel.
TCP is the product of decades of networking experience. There is a reson for every or almost every thing it does. It has several algorithms most people do not think about often (congestion control, retransmission, buffer management, dealing with reordered packets, and so on).
If you start reimplementing all the TCP algorithms, you risk ending up with an (paraphasing Greenspun's Tenth Rule) "ad hoc, informally-specified, bug-ridden, slow implementation of TCP".
If you have not done so yet, it could be a good idea to look at some recent alternatives to TCP/UDP, like SCTP or DCCP. They were designed for niches where neither TCP nor UDP was a good match, precisely to allow people to use an already "debugged" protocol instead of reinventing the wheel for every new application.

The easiest workaround to find mtu in c# is to send udp packets with dontfragment flag set to true. if it throws an exception, try reduce the packet size. do this until there is no exception thrown. you can start with 1500 packet size.

Another thing to consider is that some network devices don't handle fragmentation very well. We've seen many routers that drop fragmented UDP packets or packets that are too big. The suggestion by CesarB to use Path MTU is a good one.
Maximum throughput is not driven only by the packet size (though this contributes of course). Minimizing latency and maximizing throughput are often at odds with one other. In TCP you have the Nagle algorithm which is designed (in part) to increase overall throughput. However, some protocols (e.g., telnet) often disable Nagle (i.e., set the No Delay bit) in order to improve latency.
Do you have some real time constraints for the data? Streaming audio is different than pushing non-realtime data (e.g., logging information) as the former benefits more from low latency while the latter benefits from increased throughput and perhaps reliability. Are there reliability requirements? If you can't miss packets and have to have a protocol to request retransmission, this will reduce overall throughput.
There are a myriad of other factors that go into this and (as was suggested in another response) at some point you get a bad implementation of TCP. That being said, if you want to achieve low latency and can tolerate loss using UDP with an overall packet size set to the PATH MTU (be sure to set the payload size to account for headers) is likely the optimal solution (esp. if you can ensure that UDP can get from one end to the other.

Well, I've got a non-MTU answer for you. Using a connected UDP socket should speed things up for you. There are two reasons to call connect on your UDP socket. The first is efficiency. When you call sendto on an unconnected UDP socket what happens is that the kernel temporarily connects the socket, sends the data and then disconnects it. I read about a study indicating that this takes up nearly 30% of processing time when sending. The other reason to call connect is so that you can get ICMP error messages. On an unconnected UDP socket the kernel doesn't know what application to deliver ICMP errors to and so they just get discarded.

IP header is >= 20 bytes but mostly 20 and UDP header is 8 bytes. This leaves you 1500 - 28 = 1472 bytes for you data. PATH MTU discovery finds the smallest possible MTU on the way to destination. But this does not necessarily mean that, when you use the smallest MTU, you will get the best possible performance. I think the best way is to do a benchmark. Or maybe you should not care about the smallest MTU on the way at all. A network device may very well use a small MTU and also transfer packets very fast. And its value may very well change in the future. So you can not discover this and save it somewhere to use later on, you have to do it periodically. If I were you, I would set the MTU to something like 1440 and benchmark the application...

Even though the MTU at the switch is 1500, you can have situations (like tunneling through a VPN) that wrap a few extra headers around the packet- you may do better to reduce them slightly, and go at 1450 or so.
Can you simulate the network and test performance with different packet sizes?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js