I'm having a performance issue that I don't understand. The system I'm working on has two threads that look something like this:
Version A:
Thread 1: Data Processing -> Data Selection -> Data Formatting -> FIFO
Thread 2: FIFO -> Socket
Where 'Selection' thins down the data and the FIFO at the end of thread 1 is the FIFO at the beginning of thread 2 (the FIFOs are actually TBB Concurrent Queues). For performance reasons, I've altered the threads to look like this:
Version B:
Thread 1: Data Processing -> Data Selection -> FIFO
Thread 2: FIFO -> Data Formatting -> Socket
Initially, this optimization proved to be successful. Thread 1 is capable of much higher throughput. I didn't look too hard at Thread 2's performance because I expected the CPU usage would be higher and (due to data thinning) it wasn't a major concern. However, one of my colleagues asked for a performance comparison of version A and version B. To test the setup I had thread 2's socket (a boost asio tcp socket) write to an instance of iperf on the same box (127.0.0.1) with the goal of showing the maximum throughput.
To compare the two set ups I first tried forcing the system to write data out of the socket at 500 Mbps. As part of the performance testing I monitored top. What I saw surprised me. Version A did not show up on 'top -H' nor did iperf (this was actually as suspected). However, version B (my 'enhanced version') was showing up on 'top -H' with ~10% cpu utilization and (oddly) iperf was showing up with 8%.
Obviously, that implied to me that I was doing something wrong. I can't seem to prove that I am though! Things I've confirmed:
Both versions are giving the socket 32k chunks of data
Both versions are using the same boost library (1.45)
Both have the same optimization setting (-O3)
Both receive the exact same data, write out the same data, and write it at the same rate.
Both use the same blocking write call.
I'm testing from the same box with the exact same setup (Red Hat)
The 'formatting' part of thread 2 is not the issue (I removed it and reproduced the problem)
Small packets across the network is not the issue (I'm using TCP_CORK and I've confirmed via wireshark that the TCP Segments are all ~16k).
Putting a 1 ms sleep right after the socket write makes the CPU usage on both the socket thread and iperf(?!) go back to 0%.
Poor man's profiler reveals very little (the socket thread is almost always sleeping).
Callgrind reveals very little (the socket write barely even registers)
Switching iperf for netcat (writing to /dev/null) doesn't change anything (actually netcat's cpu usage was ~20%).
The only thing I can think of is that I've introduced a tighter loop around the socket write. However, at 500 Mbps I wouldn't expect that the cpu usage on both my process and iperf would be increased?
I'm at a loss to why this is happening. My coworkers and I are basically out of ideas. Any thoughts or suggestions? I'll happily try anything at this point.
This is going to be very hard to analyze without code snippets or actual data quantities.
One thing that comes to mind: if the pre-formatted data stream is significantly larger than post-format, you may be expending more bandwidth/cycles copying a bunch more data through the FIFO (socket) boundary.
Try estimating or measuring the data rate at each stage. If the data rate is higher at the output of 'selection', consider the effects of moving formatting to the other side of the boundary. Is it possible that no copy is required for the select->format transition in configuration A, and configuration B imposes lots of copies?
... just a guesses without more insight into the system.
What if the FIFO was the bottleneck in version A. Then both threads would sit and wait for the FIFO most of the time. And in version B, you'd be handing the data off to iperf faster.
What exactly do you store in the FIFO queues? Do you store packets of data i.e buffers?
In version A, you were writing formatted data (probably bytes) to the queue. So, sending it on the socket involved just writing out a fixed size buffer.
However in version B, you are storing high level data in the queues. Formatting it is now creating bigger buffer sizes that are being written directly to the socket. This causes the TCp/ip stack to spend CPU cycles in fragmenting & overhead...
THis is my theory based on what you have said so far.
Related
I use linux 3.x and modern glibc(2.19).
I would like send several Ethernet frames without switch from kernel/user space forth and back.
I have MTU = 1500, and I want to send 800 KB.
I init receiver address like this:
struct sockaddr_ll socket_address;
socket_address.sll_ifindex = if_idx.ifr_ifindex;
socket_address.sll_halen = ETH_ALEN;
socket_address.sll_addr[0] = MY_DEST_MAC0;
//...
After that I can call sendto/sendmsg 800KB / 1500 ~= 500 times and all works fine, but this require user space <-> kernel negotiation ~ 500 * 25 times per second. I want avoid it.
I try to init struct msghdr::msg_iov with appropriate info,
but get error "message too long", looks like msghdr::msg_iov can not describe something with size > MTU.
So question is it possible to send many raw Ethernet frame on Linux from userspace at once?
PS
The data (800KB) I get from file, and read it to memory. So struct iovec good for me, I can create suitable amount of Ethernet header and have to iovec per 1500 packet, one point to data, one point to Ethernet header.
Whoa.
My last company made realtime hidef video encoding hardware. In the lab, we had to blast 200MB / second across a bonded link, so I have some experience with this. What follows is based upon that.
Before you can tune, you must measure. You don't want to do multiple syscalls, but can you prove with timing measurement that the overhead is significant?
I use a wrapper routine around clock_gettime that gives back time of day with nanosecond precision (e.g. (tv_sec * 100000000) + tv_nsec). Call this [herein] "nanotime".
So, for any given syscall, you need a measurement:
tstart = nanotime();
syscall();
tdif = nanotime() - tstart;
For send/sendto/sendmsg/write, do this will small data so you're sure you're not blocking [or use O_NONBLOCK, if applicable]. This gives you the syscall overhead
Why are you going directly to ethernet frames? TCP [or UDP] is usually fast enough and modern NIC cards can do the envelope wrap/strip in hardware. I'd like to know if there is a specific situation that requires ethernet frames, or was it that you weren't getting the performance you wanted and came up with this as a solution. Remember, you're doing 800KB/s (~1MB/s) and my project was doing 100x-200x more than that over TCP.
What about using two plain write calls to the socket? One for header, one for data [all 800KB]. write can be used on a socket and doesn't have the EMSGSIZE error or restriction.
Further, why do you need your header to be in a separate buffer? When you allocate your buffer, just do:
datamax = 800 * 1024; // or whatever
buflen = sizeof(struct header) + datamax;
buf = malloc(buflen);
while (1) {
datalen = read(fdfile,&buf[sizeof(struct header)],datamax);
// fill in header ...
write(fdsock,buf,sizeof(struct header) + datalen);
}
This works even for the ethernet frame case.
One of the things can also do is use a setsockopt to increase the size of the kernel buffer for your socket. Otherwise, you can send data, but it will be dropped in the kernel before the receiver can drain it. More on this below.
To measure the performance of the wire, add some fields to your header:
u64 send_departure_time; // set by sender from nanotime
u64 recv_arrival_time; // set by receiver when the packet arrives
So, sender sets departure time and does write [just do the header for this test]. Call this packet Xs. receiver stamps this when it arrives. receiver immediately sends back a message to sender [call it Xr] with a departure stamp that and the contents of Xs. When sender gets this, it stamps it with an arrival time.
With the above we now have:
T1 -- time packet Xs departed sender
T2 -- time packet Xs arrived at receiver
T3 -- time packet Xr departed receiver
T4 -- time packet Xr arrived at sender
Assuming you do this on a relatively quiet connection with little to no other traffic and you know the link speed (e.g. 1 Gb/s), with T1/T2/T3/T4 you can calculate the overhead.
You can repeat the measurement for TCP/UDP vs ETH. You may find that it doesn't buy you as much as you think. Once again, can you prove it with precise measurement?
I "invented" this algorithm while working at the aforementioned company, only to find out that it was already part of a video standard for sending raw video across a 100Gb Ethernet NIC card and the NIC does the timestamping in hardware.
One of the other things you may have to do is add some throttle control. This is similar to what bittorrent does or what the PCIe bus does.
When PCIe bus nodes first start up, they communicate how much free buffer space they have available for "blind write". That is, the sender is free to blast up to this much, without any ACK message. As the receiver drains its input buffer, it sends periodic ACK messages to the sender with the number of bytes it was able to drain. sender can add this value back to the blind write limit and keep going.
For your purposes, the blind write limit is the size of the receiver's kernel socket buffer.
UPDATE
Based upon some of the additional information from your comments [the actual system configuration should go, in a more complete form, as an edit to your question at the bottom].
You do have a need for a raw socket and sending an ethernet frame. You can reduce the overhead by setting a larger MTU via ifconfig or an ioctl call with SIOCSIFMTU. I recommend the ioctl. You may not need to set MTU to 800KB. Your CPU's NIC card has a practical limit. You can probably increase MTU from 1500 to 15000 easily enough. This would reduce syscall overhead by 10x and that may be "good enough".
You probably will have to use sendto/sendmsg. The two write calls were based on conversion to TCP/UDP. But, I suspect sendmsg with msg_iov will have more overhead than sendto. If you search, you'll find that most example code for what you want uses sendto. sendmsg seems like less overhead for you, but it may cause more overhead for the kernel. Here's an example that uses sendto: http://hacked10bits.blogspot.com/2011/12/sending-raw-ethernet-frames-in-6-easy.html
In addition to improving syscall overhead, larger MTU might improve the efficiency of the "wire", even though this doesn't seem like a problem in your use case. I have experience with CPU + FPGA systems and communicating between them, but I am still puzzled by one of your comments about "not using a wire". FPGA connected to ethernet pins of CPU I get--sort of. More precisely, do you mean FPGA pins connected to ethernet pins of NIC card/chip of CPU"?
Are the CPU/NIC on the same PC board and the FPGA pins are connected via PC board traces? Otherwise, I don't understand "not using a wire".
However, once again, I must say that you must be able to measure your performance before you blindly try to improve it.
Have you run the test case I suggested for determining the syscall overhead? If it is small enough, trying to optimize for it may not be worth it and doing so may actually hurt performance more severely in other areas that you didn't realize when you started.
As an example, I once worked on a system that had a severe performance problem, such that, the system didn't work. I suspected the serial port driver was slow, so I recoded from a high level language (e.g. like C) into assembler.
I increased the driver performance by 2x, but it contributed less than a 5% performance improvement to the system. It turned out the real problem was that other code was accessing non-existent memory which just caused a bus timeout, slowing the system down measurably [it did not generate an interrupt that would have made it easy to find as on modern systems].
That's when I learned the importance of measurement. I had done my optimization based on an educated guess, rather than hard data. After that: lesson learned!
Nowadays, I never try large optimization until I can measure first. In some cases, I add an optimization that I'm "sure" will make things better (e.g. inlining a function). When I measure it [and because I can measure it], I find out that the new code is actually slower and I have to revert the change. But, that's the point: I can prove/disprove this with hard performance data.
What CPU are you using: x86, arm, mips, etc. At what clock frequency? How much DRAM? How many cores?
What FPGA are you using (e.g. Xilinx, Altera)? What specific type/part number? What is the maximum clock rate? Is the FPGA devoted entirely to logic or do you also have a CPU inside it such as microblaze, nios, arm? Does the FPGA have access to DRAM of it's own [and how much DRAM]?
If you increase the MTU, can the FPGA handle it, from either a buffer/space standpoint or a clock speed standpoint??? If you increase MTU, you may need to add an ack/sync protocol as I suggested in the original post.
Currently, the CPU is doing a blind write of the data, hoping the FPGA can handle it. This means you have an open race condition between CPU and FPGA.
This may be mitigated, purely as a side effect of sending small packets. If you increase MTU too much, you might overwhelm the FPGA. In other words, it is the very overhead you're trying to optimize away, that allows the FPGA to keep up with the data rate.
This is what I meant by unintended consequences of blind optimization. It can have unintended and worse side effects.
What is the nature of the data being sent to the FPGA? You're sending 800KB, but how often?
I am assuming that it is not the FPGA firmware itself for a few reasons. You said the firmware was already almost full [and it is receiving the ethernet data]. Also, firmware is usually loaded via the I2C bus, a ROM, or an FPGA programmer. So, am I correct?
You're sending the data to the FPGA from a file. This implies that it is only being sent once, at the startup of your CPU's application. Is that correct? If so, optimization is not needed because it's an init/startup cost that has little impact on the running system.
So, I have to assume that the file gets loaded many times, possibly a different file each time. Is that correct? If so, you may need to consider the impact of the read syscall. Not just from syscall overhead, but optimal read length. For example, IIRC, the optimal transfer size for a disk-to-disk or file-to-file copy/transfer is 64KB, depending upon the filesystem or underlying disk characteristics.
So, if you're looking to reduce overhead, reading data from a file may have considerably more than having the application generate the data [if that's possible].
The kernel syscall interface is designed to be very low overhead. Kernel programmers [I happen to be one] spend a great deal of time ensuring the overhead is low.
You say your system is utilizing the a lot of CPU time for other things. Can you measure the other things? How is your application structured? How many processes? How many threads? How do they communicate? What is the latency/througput? You may be able to find [can quite probably find] the larger bottlenecks and recode those and you'll get an overall reduction in CPU usage that far exceeds the maximum benefit you'll get from the MTU tweak.
Trying to optimize the syscall overhead may be like my serial port optimization. A lot of effort, and yet the overall results are/were disappointing.
When considering performance, it is important to consider it from an overall system standpoint. In your case, this means CPU, FPGA, and anything else in it.
You say that the CPU is doing a lot of things. Could/should some of those algorithms go into the FPGA? Is the reason they're not because the FPGA is almost out of space, otherwise you would? Is the FPGA firmware 100% done? Or, is there more RTL to be written? If you're at 90% space utilization in the FPGA, and you'll need more RTL, you may wish to consider going to an FPGA part that has more space for logic, possibly with a higher clock rate.
In my video company, we used FPGAs. We used the largest/fastest state-of-the-art part the FPGA vendor had. We also used virtually 100% of the space for logic and required the part's maximum clock rate. We were told by the vendor that we were the largest consumer of FPGA resources of any of their client companies worldwide. Because of this, we were straining the vendors development tools. Place-and-route would frequently fail and have to be rerun to get correct placement and meet timing.
So, when an FPGA is almost full with logic, the place-and-route can be difficult to achieve. It might be a reason to consider a larger part [if possible]
I am transferring data from FPGA PCIe thru DMA, which is very fast. I have 500 data with each data comprise of 80000 BYTES. Hence the time for all 500 data receiving and saving in .bin file is 0.5 seconds. If I do the same in .txt file (which is my final goal) it takes 15 seconds.
Hence Now what I want is to use threads in c++, where 1 thread (I call it as master thread) take DMA data(single data at a time) and simultaneously open the 500 other threads (one for each file) each file saving thread wait for some trigger event etc. (not much idea, since CPU inherently runs in sequential manner, causing problem for an FPGA designer who deals in parallel domain)
Please see the case I have explained could be the solution, but I need to know how to implement it if it is correct in++ ????
case
1st data(thru DMA) comes in master thread (where global memory is assigned using malloc() ) -> thread for file 1 is waiting for any TRIGGER etc. and as soon as it gets this trigger, copy the memory contents to its own allocated memory and then starts saving in the file, meanwhile it also triggers the 'master thread' to increment its counter and receive the next data and the process continues for the whole 500 data.
I am mostly and FPGA guy and c++ at this high level is first time, I am determined but am stuck. really messed up for two days reading loads of material over threads (in c++) mainly starting from createthreads() and going on and on, I thought the WaitForSingleObject might be solution but I cannot understand how to implement this...
any idea would be appreciable. I do not seek any code, I just seek the way to implement. For example those familiar with VHDL, they might know in VHDL we can use
Code: wait until abc'event and abc = '1';
but what to do here?
Thanks
sraza
The performance measurement you give show that the problem has nothing to do with DMA or threads. What's slow is converting from binary to string data.
Not surprising, since C++ iostreams are miserably slow and even the C stdio functions are significantly suboptimal
Use an optimized function for number->string conversion, and your 15 second time for writing a text file will get a lot closer to that 0.5 second time you have for binary. I'd expect 1.0 seconds or less, from this single change.
I am writing c++ socket code and i need some help !
in my program i don't know what the size of the message would be , it may send a part of file or the file it self ,the file may be a huge file , so shall i specify a maximum size for the packet , shall i divide it to more than one if exceeded the maximum ?
It's never constructive to think about "packets" and "messages" when using TCP because:
The network engine has its own ways of deciding the best segment size
Segment size makes no difference to application code: the receiving TPC is free to coalesce segments before passing data to the receiving process
You should view TCP the way it was designed: reliable in order bytes-stream service. So just write large-enough blocks and the engine and its myriad of rules will take care of it.
The problem is a little vague, but the approach seems universal. The transmitter should send an indication of how many bytes the receiver should expect. The receiver should expect to see this indication, and then prepare to receive that many bytes.
As far as packet size, generally an application does not worry about how the bytes are delivered on the network per se, but the application may care about not calling send and recv system calls too many times. This is particularly important on a concurrent server, when efficiency is key to scalability. So, you want a buffer that is big enough to avoid making too many system calls, but not so big as to cause you to block for a long time waiting for the data to drain into the kernel buffer. Matching the send/recv socket buffer size is usually sufficient for that, but it depends on other factors, like the bandwidth and latency of the network, and how quickly the receiver is draining the data, and the timeslice you want to allow per connection being handled during concurrency.
I have been using select to handle connections, recently there was a change an our socket library and select was replaced by epoll for linux platform.
my application architecture is such that I make only one or at max 2 socket connections and epoll/select on them in a single thread.
now with recent switch to epoll i noticed that performance of application has diminshed, I was actually surprised and was expecting performance go up or reamin same. I tried looking at various other parts and this is the only peice of code that has changed.
does epoll have performance penalty in terms of speed if used for very small number of sockets (like 1 or 2).
also anoher thing to note that I run around 125 such processes on same box (8 cpu cores).
could this be case that too many processes doing epoll_wait on same machine, this setup was similar when i was using select.
i noticed on box that load average is much higher but cpu usage was quite the same which makes me think that more time is spend in I/O and probaly coming from epoll related changes.
any ideas/pointers on what should i look more to identify the problem.
although absolute latency increased is quite small like average 1 millisec but this is a realtime system and this kind of latencies are generally unaccpetable.
Thanks
Hi,
Updating this question on latest findinds, apart from switching from select to epoll I found another relate change, earlier timeout with select was 10 millis but with epoll the way timeout is way smaller than before (like 1 micro..), can setting too low timeout in select or epoll result on decreased performance in anyway?
thanks
From the sounds of it, throughput may be unaffected with epoll() vs select(), but you're finding extra latency in individual requests that seems to be related to the use of epoll().
I think that in the case of watching only one or two sockets, epoll() should perform much like select(). epoll() is supposed to scale linearly as you watch more descriptors, whereas select() scales badly (& may even have a hard limit on #/descriptors). So it's not that epoll() has a penalty for a small # of descriptors, but it loses its performance advantage over select() in this case.
Can you change the code so you can easily go back & forth between the two event notification mechanisms? Get more data about the performance difference. If you conclusively find that select() has less latency & same throughput in your situation, then I'd just switch back to the "old & deprecated" API without hesitation :) To me it's fairly conclusive if you measure a performance difference from this specific code change. Perhaps previous testing of epoll() versus select() has focused on throughput versus latency of individual requests?
I've developed a mini HTTP server in C++, using boost::asio, and now I'm load testing it with multiple clients and I've been unable to get close to saturating the CPU. I'm testing on a Amazon EC2 instance, and getting about 50% usage of one cpu, 20% of another, and the remaining two are idle (according to htop).
Details:
The server fires up one thread per core
Requests are received, parsed, processed, and responses are written out
The requests are for data, which is read out of memory (read-only for this test)
I'm 'loading' the server using two machines, each running a java application, running 25 threads, sending requests
I'm seeing about 230 requests/sec throughput (this is application requests, which are composed of many HTTP requests)
So, what should I look at to improve this result? Given the CPU is mostly idle, I'd like to leverage that additional capacity to get a higher throughput, say 800 requests/sec or whatever.
Ideas I've had:
The requests are very small, and often fulfilled in a few ms, I could modify the client to send/compose bigger requests (perhaps using batching)
I could modify the HTTP server to use the Select design pattern, is this appropriate here?
I could do some profiling to try to understand what the bottleneck's are/is
boost::asio is not as thread-friendly as you would hope - there is a big lock around the epoll code in boost/asio/detail/epoll_reactor.hpp which means that only one thread can call into the kernel's epoll syscall at a time. And for very small requests this makes all the difference (meaning you will only see roughly single-threaded performance).
Note that this is a limitation of how boost::asio uses the Linux kernel facilities, not necessarily the Linux kernel itself. The epoll syscall does support multiple threads when using edge-triggered events, but getting it right (without excessive locking) can be quite tricky.
BTW, I have been doing some work in this area (combining a fully-multithreaded edge-triggered epoll event loop with user-scheduled threads/fibers) and made some code available under the nginetd project.
As you are using EC2, all bets are off.
Try it using real hardware, and then you might be able to see what's happening. Trying to do performance testing in VMs is basically impossible.
I have not yet worked out what EC2 is useful for, if someone find out, please let me know.
From your comments on network utilization,
You do not seem to have much network movement.
3 + 2.5 MiB/sec is around the 50Mbps ball-park (compared to your 1Gbps port).
I'd say you are having one of the following two problems,
Insufficient work-load (low request-rate from your clients)
Blocking in the server (interfered response generation)
Looking at cmeerw's notes and your CPU utilization figures
(idling at 50% + 20% + 0% + 0%)
it seems most likely a limitation in your server implementation.
I second cmeerw's answer (+1).
230 requests/sec seems very low for such simple async requests. As such, using multiple threads is probably premature optimisation - get it working properly and tuned in a single thread, and see if you still need them. Just getting rid of un-needed locking may get things up to speed.
This article has some detail and discussion on I/O strategies for web server-style performance circa 2003. Anyone got anything more recent?
ASIO is fine for small to medium tasks but it isn't very good at leveraging the power of the underlying system. Neither are raw socket calls, or even IOCP on Windows but if you are experienced you will always be better than ASIO. Either way there is a lot of overhead with all of those methods, just more with ASIO.
For what it is worth. using raw socket calls on my custom HTTP can serve 800K dynamic requests per second with a 4 core I7. It is serving from RAM, which is where you need to be for that level of performance. At this level of performance the network driver and OS are consuming about 40% of the CPU. Using ASIO I can get around 50 to 100K requests per second, its performance is quite variable and mostly bound in my app. The post by #cmeerw mostly explains why.
One way to improve performance is by implementing a UDP proxy. Intercepting HTTP requests and then routing them over UDP to your backend UDP-HTTP server you can bypass a lot of TCP overhead in the operating system stacks. You can also have front ends which pipe through on UDP themselves, which shouldn't be too hard to do yourself. An advantage of a HTTP-UDP proxy is that it allows you to use any good frontend without modification, and you can swap them out at will without any impact. You just need a couple more servers to implement it. This modification on my example lowered the OS CPU usage to 10%, which increased my requests per second to just over a million on that single backend. And FWIW You should always have a frontend-backend setup for any performant site because the frontends can cache data without slowing down the more important dynamic requests backend.
The future seems to be writing your own driver that implements its own network stack so you can get as close to the requests as possible and implement your own protocol there. Which probably isn't what most programmers want to hear as it is more complicated. In my case I would be able to use 40% more CPU and move to over 1 million dynamic requests per second. The UDP proxy method can get you close to optimal performance without needing to do this, however you will need more servers - though if you are doing this many requests per second you will usually need multiple network cards and multiple frontends to handle the bandwidth so having a couple lightweight UDP proxies in there isn't that big a deal.
Hope some of this can be useful to you.
How many instances of io_service do you have? Boost asio has an example that creates an io_service per CPU and use them in the manner of RoundRobin.
You can still create four threads and assign one per CPU, but each thread can poll on its own io_service.