I'm implementing a protocol over serial ports on Linux. The protocol is based on a request answer scheme so the throughput is limited by the time it takes to send a packet to a device and get an answer. The devices are mostly arm based and run Linux >= 3.0. I'm having troubles reducing the round trip time below 10ms (115200 baud, 8 data bit, no parity, 7 byte per message).
What IO interfaces will give me the lowest latency: select, poll, epoll or polling by hand with ioctl? Does blocking or non blocking IO impact latency?
I tried setting the low_latency flag with setserial. But it seemed like it had no effect.
Are there any other things I can try to reduce latency? Since I control all devices it would even be possible to patch the kernel, but its preferred not to.
---- Edit ----
The serial controller uses is an 16550A.
Request / answer schemes tends to be inefficient, and it shows up quickly on serial port. If you are interested in throughtput, look at windowed protocol, like kermit file sending protocol.
Now if you want to stick with your protocol and reduce latency, select, poll, read will all give you roughly the same latency, because as Andy Ross indicated, the real latency is in the hardware FIFO handling.
If you are lucky, you can tweak the driver behaviour without patching, but you still need to look at the driver code. However, having the ARM handle a 10 kHz interrupt rate will certainly not be good for the overall system performance...
Another options is to pad your packet so that you hit the FIFO threshold every time. It will also confirm that if it is or not a FIFO threshold problem.
10 msec # 115200 is enough to transmit 100 bytes (assuming 8N1), so what you are seeing is probably because the low_latency flag is not set. Try
setserial /dev/<tty_name> low_latency
It will set the low_latency flag, which is used by the kernel when moving data up in the tty layer:
void tty_flip_buffer_push(struct tty_struct *tty)
{
unsigned long flags;
spin_lock_irqsave(&tty->buf.lock, flags);
if (tty->buf.tail != NULL)
tty->buf.tail->commit = tty->buf.tail->used;
spin_unlock_irqrestore(&tty->buf.lock, flags);
if (tty->low_latency)
flush_to_ldisc(&tty->buf.work);
else
schedule_work(&tty->buf.work);
}
The schedule_work call might be responsible for the 10 msec latency you observe.
Having talked to to some more engineers about the topic I came to the conclusion that this problem is not solvable in user space. Since we need to cross the bridge into kernel land, we plan to implement an kernel module which talks our protocol and gives us latencies < 1ms.
--- edit ---
Turns out I was completely wrong. All that was necessary was to increase the kernel tick rate. The default 100 ticks added the 10ms delay. 1000Hz and a negative nice value for the serial process gives me the time behavior I wanted to reach.
Serial ports on linux are "wrapped" into unix-style terminal constructs, which hits you with 1 tick lag, i.e. 10ms. Try if stty -F /dev/ttySx raw low_latency helps, no guarantees though.
On a PC, you can go hardcore and talk to standard serial ports directly, issue setserial /dev/ttySx uart none to unbind linux driver from serial port hw and control the port via inb/outb to port registers. I've tried that, it works great.
The downside is you don't get interrupts when data arrives and you have to poll the register. often.
You should be able to do same on the arm device side, may be much harder on exotic serial port hw.
Here's what setserial does to set low latency on a file descriptor of a port:
ioctl(fd, TIOCGSERIAL, &serial);
serial.flags |= ASYNC_LOW_LATENCY;
ioctl(fd, TIOCSSERIAL, &serial);
In short: Use a USB adapter and ASYNC_LOW_LATENCY.
I've used a FT232RL based USB adapter on Modbus at 115.2 kbs.
I get about 5 transactions (to 4 devices) in about 20 mS total with ASYNC_LOW_LATENCY. This includes two transactions to a slow-poke device (4 mS response time).
Without ASYNC_LOW_LATENCY the total time is about 60 mS.
With FTDI USB adapters ASYNC_LOW_LATENCY sets the inter-character timer on the chip itself to 1 mS (instead of the default 16 mS).
I'm currently using a home-brewed USB adapter and I can set the latency for the adapter itself to whatever value I want. Setting it at 200 µS shaves another mS off that 20 mS.
None of those system calls have an effect on latency. If you want to read and write one byte as fast as possible from userspace, you really aren't going to do better than a simple read()/write() pair. Try replacing the serial stream with a socket from another userspace process and see if the latencies improve. If they don't, then your problems are CPU speed and hardware limitations.
Are you sure your hardware can do this at all? It's not uncommon to find UARTs with a buffer design that introduces many bytes worth of latency.
At those line speeds you should not be seeing latencies that large, regardless of how you check for readiness.
You need to make sure the serial port is in raw mode (so you do "noncanonical reads") and that VMIN and VTIME are set correctly. You want to make sure that VTIME is zero so that an inter-character timer never kicks in. I would probably start with setting VMIN to 1 and tune from there.
The syscall overhead is nothing compared to the time on the wire, so select() vs. poll(), etc. is unlikely to make a difference.
Related
I have an Multi threaded C++ network application which listens for UDP packets as input - this data then hops through the application/processed via various queues and threads and finally pushed out on a TCP socket.
What I am seeing is that if inputs come in slow lets say 5/sec, the total response time (in-to-out) is slow (lets say 100ms) and if the inputs come in fast e.g. 20/sec, the response time is also fast (~50ms). This observation is really weird. And also messes up the response time in the fast case because the 1st response is always slow. Just to make sure - the application is doing exactly the same amount of work in both slow and fast cases.
Things that have been tried to investigate this -
Its a dual Xeon box running Linux 2.6 kernel - disabled Turbo boost, made sure processors are in C0 states.
eliminated network causes. the root cause is within the box. ( in software or hardware )
I have a fake input going through the system from input to output on a timer to keep the application "warm" - no effect. (the application
's worker threads are busy waiting and pinned to cores).
perf points indicate that EVERY thing gets slower - which basically mean that the processors are slowing down when not under continuous load - but nothing else suggests that ( 17z/turbostat) or I am reading them incorrectly.
Does someone has color on what might be happening?
In my experience, this would be the nasty doings of Nagle's devilish device. Sounds extremeliy plausible to me that more data in the buffer fills up TCP packet which than get's sent. Without much data, tcp packet is waiting for the ack from the other side, which is, as we all know, delayed.
Solution - learn to make sure Nagle's program killer is disabled the first thing after you create any send-capable TCP socket.
I'm using Interix on Windows XP to port my C++ Linux application more readily to port to Windows XP. My application sends and receives packets over a socket to and from a nearby machine running Linux. When sending, I'm only getting throughput of around 180 KB/sec and when receiving I'm getting around 525 KB/sec. The same code running on Linux gets closer to 2,500 KB/sec.
When I attempt to send at a higher rate than 180 KB/sec, packets get dropped to bring the rate back down to about that level.
I feel like I should be able to get better throughput on sending than 180 KB/sec but am not sure how to go about determining what is the cause of the dropped packets.
How might I go about investigating this slowness in the hopes of improving throughput?
--Some More History--
To reach the above numbers, I have already improved the throughput a bit by doing the following (that made no difference on Linux, but help throughput on Interix):
I changed SO_RCVBUF and SO_SNDBUF from 256KB to 25MB, this improved throughput about 20%
I ran optimized instead of debug, this improved throughput about 15%
I turned off all logging messages going to stdout and a log file, this doubled throughput.
So it would seem that CPU is a limiting factor on Interix, but not on Linux. Further, I am running on a Virtual Machine hosted in a hypervisor. The Windows XP is given 2 cores and 2 GB of memory.
I notice that the profiler shows the cpu on the two cores never exceeding 50% utilization on average. This even occurs when I have two instances of my application running, still it hovers around 50% on both cores. Perhaps my application, which is multi-threaded, with a dedicated thread to read from UDP socket and a dedicated thread to write to UDP socket (only one is active at any given time) is not being scheduled well on Interix and thus my packets are dropping?
In answering your question, I am making the following assumptions based on your description of the problem:
(1) You are using the exact same program in Linux when achieving the throughput of 2,500 KB/sec, other than the socket library, which is of course, going to be different between Windows and Linux. If this assumption is correct, we probably shouldn't have to worry about other pieces of your code affecting the throughput.
(2) When using Linux to achieve 2,500 KB/sec throughput, the node is in the exact same location in the network. If this assumption is correct, we don't have to worry about network issues affecting your throughput.
Given these two assumptions, I would say that you likely have a problem in your socket settings on the Windows side. I would suggest checking the size of the send-buffer first. The size of the send-buffer is 8192 bytes by default. If you increase this, you should see an increase in throughput. Use setsockopt() to change this. Here is the usage manual: http://msdn.microsoft.com/en-us/library/windows/desktop/ms740476(v=vs.85).aspx
EDIT: It looks like I misread your post going through it too quickly the first time. I just noticed you're using Interix, which means you're probably not using a different socket library. Nevertheless, I suggest checking the send buffer size first.
I am working on a C++ application that can be qualified as a router. This application receives UDP packets on a given port (nearly 37 bytes each second) and must multicast them to another destinations within a 10 ms period. However, sometimes after packet reception, the retransmission exceeds the 10 ms limit and can reach the 100 ms. these off-limits delays are random.
The application receives on the same Ethernet interface but on a different port other kind of packets (up to 200 packets of nearly 100 bytes each second). I am not sure that this later flow is disrupting the other one because these delay peaks are too scarce (2 packets among 10000 packets)
What can be the causes of these sporadic delays? And how to solve them?
P.S. My application is running on a Linux 2.6.18-238.el5PAE. Delays are measured between the reception of the packet and after the success of the transmission!
An image to be more clear :
10ms is a tough deadline for a non-realtime OS.
Assign your process to one of the realtime scheduling policies, e.g. SCHED_RR or SCHED_FIFO (some reading). It can be done in the code via sched_setscheduler() or from command line via chrt. Adjust the priority as well, while you're at it.
Make sure your code doesn't consume CPU more than it has to, or it will affect entire system performance.
You may also need RT_PREEMPT patch.
Overall, the task of generating Ethernet traffic to schedule on Linux is not an easy one. E.g. see BRUTE, a high-performance traffic generator; maybe you'll find something useful in its code or in the research paper.
I have to write a C++ application that reads from the serial port byte by byte. This is an important need as it is receiving messages over radio transmission using modbus and the end of transmission is defined by 3.5 character length duration so I MUST be able to get the message byte by byte. The current system utilises DOS to do this which uses hardware interrupts. We wish to transfer to use Linux as the OS for this software, but we lack expertise in this area. I have tried a number of things to do this - firstly using polling with non-blocking read, using select with very short timeout values, setting the size of the read buffer of the serial port to one byte, and even using a signal handler on SIGIO, but none of these things provide quite what I require. My boss informs me that the DOS application we currently run uses hardware interrupts to get notification when there is something available to read from the serial port and that the hardware is accessible directly. Is there any way that I can get this functionality from a user space Linux application? Could I do this if I wrote a custom driver (despite never having done this before and having close to zero knowledge of how the kernel works) ??. I have heard that Linux is a very popular OS for hardware control and embedded devices so I am guessing that this kind of thing must be possible to do somehow, but I have spent literally weeks on this so far and still have no concrete idea of how best to proceed.
I'm not quite sure how reading byte-by-byte helps you with fractional-character reception, unless it's that there is information encoded in the duration of intervals between characters, so you need to know the timing of when they are received.
At any rate, I do suspect you are going to need to make custom modifications to the serial port kernel driver; that's really not all that bad as a project goes, and you will learn a lot. You will probably also need to change the configuration of the UART "chip" (really just a tiny corner of some larger device) to make it interrupt after only a single byte (ie emulate a 16450) instead of when it's typically 16-byte (emulating at 16550) buffer is partway full. The code of the dos program might actually be a help there. An alternative if the baud rate is not too fast would be to poll the hardware in the kernel or a realtime extension (or if it is really really slow as it might be on an HF radio link, maybe even in userspace)
If I'm right about needing to know the timing of the character reception, another option would be offload the reception to a micro-controller with dual UARTS (or even better, one UART and one USB interface). You could then have the micro watch the serial stream, and output to the PC (either on the other serial port at a much faster baud rate, or on the USB) little packages of data that include one received character and a timestamp - or even have it decode the protocol for you. The nice thing about this is that it would get you operating system independence, and would work on legacy free machines (byte-by-byte access is probably going to fail with an off-the-shelf USB-serial dongle). You can probably even make it out of some cheap eval board, rather than having to manufacture any custom hardware.
I'm writing a program that implements the RFC 2544 network test. As the part of the test, I must send UDP packets at a specified rate.
For example, I should send 64 byte packets at 1Gb/s. That means that I should send UDP packet every 0.5 microseconds. Pseudocode can look like "Sending UDP packets at a specified rate":
while (true) {
some_sleep (0.5);
Send_UDP();
}
But I'm afraid there is no some_sleep() function in Windows, and Linux too, that can give me 0.5 microseconds resolution.
Is it possible to do this task in C++, and if yes, what is the right way to do it?
Two approaches:
Implement your own sleep by busy-looping using a high-resolution timer such as windows QueryPerformanceCounter
Allow slight variations in rate, insert Sleep(1) when you're enough ahead of the calculated rate. Use timeBeginPeriod to get 1ms resolution.
For both approaches, you can't rely on the sleeps being exact. You will need to keep totals counters and adjust the sleep period as you get ahead/behind.
This might be helpful, but I doubt it's directly portable to anything but Windows. Implement a Continuously Updating, High-Resolution Time Provider for Windows by Johan Nilsson.
However, do keep in mind that for packets that small, the IP and UDP overhead is going to account for a large fraction of the actual on-the-wire data. This may be what you intended, or not. A very quick scan of RFC 2544 suggests that much larger packets are allowed; you may be better off going that route instead. Consistently delaying for as little as 0.5 microseconds between each Send_UDP() call is going to be difficult at best.
To transmit 64-byte Ethernet frames at line rate, you actually want to send every 672 ns. I think the only way to do that is to get really friendly with the hardware. You'll be running up against bandwidth limitations with the PCI bus, etc. The system calls to send one packet will take significantly longer than 672 ns. A sleep function is the least of your worries.
You guess you should be able to do it with Boost Asios timer function. I haven't tried it myself, but I guess that deadline_timer would take a boost::posix_time::nanosec as well as the boost::posix_time::second
Check out an example here
Here's a native Windows implementation of nanosleep. If GPL is acceptable you can reuse the code, else you'll have to reimplement.