usb disk write latency (windows)

usb disk write latency (windows) - c++

I am writing to USB disk from a lowest priority thread, using chunked buffer writing and still, from time to time the system in overall lags on this operation. If I disable writing to disk only, everything works fine. I can't use Windows file operations API calls, only C write. So I thought maybe there is a WinAPI function to turn on/off USB disk write caching which I could use in conjunction with FlushBuffers or similar alternatives? The number of drives for operations is undefined.
Ideally I would like to never be lagging using write call and the caching, if it will be performed transparently is ok too.
EDIT: would _O_SEQUENTIAL flag on write only operations be of any use here?

Try to reduce I/O priority for the thread.
See this article: http://msdn.microsoft.com/en-us/library/windows/desktop/ms686277(v=vs.85).aspx
In particular use THREAD_MODE_BACKGROUND_BEGIN for your IO thread.
Warning: this doesn't work in Windows XP

The thread priority won't affect the delay that happens in the process of writing the media, because it's done in the kernel mode by the file system/disk drivers that don't pay attention to the priority of the calling thread.
You might try to use "T" flag (_O_SHORTLIVED) and flush the buffers at the end of the operation, also try to decrease the buffer size.

There are different types of data transfer for USB, for data there are 3:
1.Bulk Transfer,
2.Isochronous Transfer, and
3.Interrupt Transfer.
Bulk Transfers Provides:
Used to transfer large bursty data.
Error detection via CRC, with guarantee of delivery.
No guarantee of bandwidth or minimum latency.
Stream Pipe - Unidirectional
Full & high speed modes only.
Bulk transfer is good for data that does not require delivery in a guaranteed amount of time The USB host controller gives a lower priority to bulk transfer than the other types of transfer.
Isochronous Transfers Provides:
Guaranteed access to USB bandwidth.
Bounded latency.
Stream Pipe - Unidirectional
Error detection via CRC, but no retry or guarantee of delivery.
Full & high speed modes only.
No data toggling.
Isochronous transfers occur continuously and periodically. They typically contain time sensitive information, such as an audio or video stream. If there were a delay or retry of data in an audio stream, then you would expect some erratic audio containing glitches. The beat may no longer be in sync. However if a packet or frame was dropped every now and again, it is less likely to be noticed by the listener.
Interrupt Transfers Provides:
Guaranteed Latency
Stream Pipe - Unidirectional
Error detection and next period retry.
Interrupt transfers are typically non-periodic, small device "initiated" communication requiring bounded latency. An Interrupt request is queued by the device until the host polls the USB device asking for data.
From the above, it seems that you want a Guaranteed Latency, so you should use Isochronous mode. There are some libraries that you can use like libusb, or you can read more in msdn

To find out what is letting your system hang you first need to drill down to the Windows hang. What was Windows doing while you did experience the hang?
To find this out you can take a kernel dump. How to get and analyze a Kernel Dump read here.
Depending on the findings you get there you then need to decide if there is anything under your control you can do about. Since you are using a third party library to to the writing there is little you can do except to set the IO priority, thread priority on thread or process level. If the library you were given links against a specific CRT you could try to build your own customized version of it to e.g. flush after every write to prevent write combining by the OS to write only data in big chunks back to disc.
Edit1
Your best bet would be to flush the device after every write. This could force the OS to flush any pending data and write the current pending writes to disc without caching the writes up to certain amount.
The second best thing would be to simply wait after each write to give the OS the chance to write pending changes though small back to disc after a certain time interval.
If you are deeper into performance you should try out XPerf which has a nice GUI and shows you even the call stack where your process did hang. The Windows Team and many other teams at MS use this tool to troubleshoot hang experiences. The latest edition with many more features comes with the Windows 8 SDK. But beware that Xperf only works on OS > Vista.

Related

Recording both rendering and recording device

I'm writing a program in C++, on Windows. I need to support Windows Vista+.
I want to record both the microphone and speaker simultaneously.
I'm using the WASAPI and can record the microphone and speaker separately, but I would like to have just one stream supplying me the input from both streams (for example, for recording a client play the guitar along with the music he hears on his headphones), instead of merging the two buffers together somehow (which I guess will lead me to timing issues).
Is there a way to do this?

I'm actually working on a library which can do exactly that, merge streams from multiple devices. You might want to give it a try: see xt-audio.com. If you're implementing this yourself, here's some things to consider:
If you're capturing the speakers through a WASAPI loopback interface you're operating in shared mode, in this case latency might be unacceptable for live performance. If possible stick to exclusive mode and use a loopback cable or hardware loopback device if you have one (e.g. the old fashioned "stereo mix" devices etc).
If you're merging buffers then yes, you're going to have timing issues. This is generally unavoidable when syncing independent devices. Pops/clicks can largely be avoided using a secondary intermediate buffer which introduces additional latency, but eventually you're going to have to pad/drop some samples to keep streams in sync.
Do NOT use separate threads for each independent stream. This will increase context switches and thereby increase the minimum achievable latency. Instead, designate one device as the master device, wait for that device's event to be raised, then read input from all devices whether they are "ready" or not (this is were dropping/padding comes into play).
In general you can get really decent performance from WASAPI exclusive mode, even running multiple streams together. But for something as critical as live performance you might want to consider a pro audio interface with ASIO drivers where everything just ticks off the same clock, or synchronization is at least handled at the driver level.

Idendify the reason for a 200 ms freezing in a time critical loop

New description of the problem:
I currently run our new data acquisition software in a test environment. The software has two main threads. One contains a fast loop which communicates with the hardware and pushes the data into a dual buffer. Every few seconds, this loop freezes for 200 ms. I did several tests but none of them let me figure out what the software is waiting for. Since the software is rather complex and the test environment could interfere too with the software, I need a tool/technique to test what the recorder thread is waiting for while it is blocked for 200 ms. What tool would be useful to achieve this?
Original question:
In our data acquisition software, we have two threads that provide the main functionality. One thread is responsible for collecting the data from the different sensors and a second thread saves the data to disc in big blocks. The data is collected in a double buffer. It typically contains 100000 bytes per item and collects up to 300 items per second. One buffer is used to write to in the data collection thread and one buffer is used to read the data and save it to disc in the second thread. If all the data has been read, the buffers are switched. The switch of the buffers seems to be a major performance problem. Each time the buffer switches, the data collection thread blocks for about 200 ms, which is far too long. However, it happens once in a while, that the switching is much faster, taking nearly no time at all. (Test PC: Windows 7 64 bit, i5-4570 CPU #3.2 GHz (4 cores), 16 GB DDR3 (800 MHz)).
My guess is, that the performance problem is linked to the data being exchanged between cores. Only if the threads run on the same core by chance, the exchange would be much faster. I thought about setting the thread affinity mask in a way to force both threads to run on the same core, but this also means, that I lose real parallelism. Another idea was to let the buffers collect more data before switching, but this dramatically reduces the update frequency of the data display, since it has to wait for the buffer to switch before it can access the new data.
My question is: Is there a technique to move data from one thread to another which does not disturb the collection thread?
Edit: The double buffer is implemented as two std::vectors which are used as ring buffers. A bool (int) variable is used to tell which buffer is the active write buffer. Each time the double buffer is accessed, the bool value is checked to know which vector should be used. Switching the buffers in the double buffer just means toggling this bool value. Of course during the toggling all reading and writing is blocked by a mutex. I don't think that this mutex could possibly be blocking for 200 ms. By the way, the 200 ms are very reproducible for each switch event.

Locking and releasing a mutex just to switch one bool variable will not take 200ms.
Main problem is probably that two threads are blocking each other in some way.
This kind of blocking is called lock contention. Basically this occurs whenever one process or thread attempts to acquire a lock held by another process or thread. Instead parallelism you have two thread waiting for each other to finish their part of work, having similar effect as in single threaded approach.
For further reading I recommend this article for a read, which describes lock contention with more detailed level.

Since you are running on windows maybe you use visual studio? if yes I would resort to VS profiler which is quite good (IMHO) in such cases, once you don't need to check data/instruction caches (then the Intel's vTune is a natural choice). From my experience VS is good enough to catch contention problems as well as CPU bottlenecks. you can run it directly from VS or as standalone tool. you don't need the VS installed on your test machine you can just copy the tool and run it locally.
VSPerfCmd.exe /start:SAMPLE /attach:12345 /output:samples - attach to process 12345 and gather CPU sampling info
VSPerfCmd.exe /detach:12345 - detach from process
VSPerfCmd.exe /shutdown - shutdown the profiler, the samples.vsp is written (see first line)
then you can open the file and inspect it in visual studio. if you don't see anything making your CPU busy switch to contention profiling - just change the "start" argument from "SAMPLE" to "CONCURRENCY"
The tool is located under %YourVSInstallDir%\Team Tools\Performance Tools\, AFAIR it is available from VS2010
Good luck

After discussing the problem in the chat, it turned out that the Windows Performance Analyser is a suitable tool to use. The software is part of the Windows SDK and can be opened using the command wprui in a command window. (Alois Kraus posted this useful link: http://geekswithblogs.net/akraus1/archive/2014/04/30/156156.aspx in the chat). The following steps revealed what the software had been waiting on:
Record information with the WPR using the default settings and load the saved file in the WPA.
Identify the relevant thread. In this case, the recording thread and the saving thread obviously had the highest CPU load. The saving thread could be easily identified. Since it saves data to disc, it is the one that with file access. (Look at Memory->Hard Faults)
Check out Computation->CPU usage (Precise) and select Utilization by Process, Thread. Select the process you are analysing. Best display the columns in the order: NewProcess, ReadyingProcess, ReadyingThreadId, NewThreadID, [yellow bar], Ready (µs) sum, Wait(µs) sum, Count...
Under ReadyingProcess, I looked for the process with the largest Wait (µs) since I expected this one to be responsible for the delays.
Under ReadyingThreadID I checked each line referring to the thread with the delays in the NewThreadId column. After a short search, I found a thread that showed frequent Waits of about 100 ms, which always showed up as a pair. In the column ReadyingThreadID, I was able to read the id of the thread the recording loop was waiting for.
According to its CPU usage, this thread did basically nothing. In our special case, this led me to the assumption that the serial port io command could cause this wait. After deactivating them, the delay was gone. The important discovery was that the 200 ms delay was in fact composed of two 100 ms delays.
Further analysis showed that the fetch data command via the virtual serial port pair gets sometimes lost. This might be linked to very high CPU load in the data saving and compression loop. If the fetch command gets lost, no data is received and the first as well as the second attempt to receive the data timed out with their 100 ms timeout time.

Getting notification that the serial port is ready to be read from

I have to write a C++ application that reads from the serial port byte by byte. This is an important need as it is receiving messages over radio transmission using modbus and the end of transmission is defined by 3.5 character length duration so I MUST be able to get the message byte by byte. The current system utilises DOS to do this which uses hardware interrupts. We wish to transfer to use Linux as the OS for this software, but we lack expertise in this area. I have tried a number of things to do this - firstly using polling with non-blocking read, using select with very short timeout values, setting the size of the read buffer of the serial port to one byte, and even using a signal handler on SIGIO, but none of these things provide quite what I require. My boss informs me that the DOS application we currently run uses hardware interrupts to get notification when there is something available to read from the serial port and that the hardware is accessible directly. Is there any way that I can get this functionality from a user space Linux application? Could I do this if I wrote a custom driver (despite never having done this before and having close to zero knowledge of how the kernel works) ??. I have heard that Linux is a very popular OS for hardware control and embedded devices so I am guessing that this kind of thing must be possible to do somehow, but I have spent literally weeks on this so far and still have no concrete idea of how best to proceed.

I'm not quite sure how reading byte-by-byte helps you with fractional-character reception, unless it's that there is information encoded in the duration of intervals between characters, so you need to know the timing of when they are received.
At any rate, I do suspect you are going to need to make custom modifications to the serial port kernel driver; that's really not all that bad as a project goes, and you will learn a lot. You will probably also need to change the configuration of the UART "chip" (really just a tiny corner of some larger device) to make it interrupt after only a single byte (ie emulate a 16450) instead of when it's typically 16-byte (emulating at 16550) buffer is partway full. The code of the dos program might actually be a help there. An alternative if the baud rate is not too fast would be to poll the hardware in the kernel or a realtime extension (or if it is really really slow as it might be on an HF radio link, maybe even in userspace)
If I'm right about needing to know the timing of the character reception, another option would be offload the reception to a micro-controller with dual UARTS (or even better, one UART and one USB interface). You could then have the micro watch the serial stream, and output to the PC (either on the other serial port at a much faster baud rate, or on the USB) little packages of data that include one received character and a timestamp - or even have it decode the protocol for you. The nice thing about this is that it would get you operating system independence, and would work on legacy free machines (byte-by-byte access is probably going to fail with an off-the-shelf USB-serial dongle). You can probably even make it out of some cheap eval board, rather than having to manufacture any custom hardware.

What could delay pre-emption of a VxWorks task?

In my current project, I have two levels of tasking, in a VxWorks system, a higher priority (100) task for number crunching and other work and then a lower priority (200) task for background data logging to on-board flash memory. Logging is done using the fwrite() call, to a file stored on a TFFS file system. The high priority task runs at a periodic rate and then sleeps to allow background logging to be done.
My expectation was that the background logging task would run when the high priority task sleeps and be preempted as soon as the high priority task wakes.
What appears to be happening is a significant delay in suspending the background logging task once the high priority task is ready to run again, when there is sufficient data to keep the logging task continuously occupied.
What could delay the pre-emption of a lower priority task under VxWorks 6.8 on a Power PC architecture?

You didn't quantify significant, so the following is just speculation...
You mention writing to flash. One of the issue is that writing to flash typically requires the driver to poll the status of the hardware to make sure the operation completes successfully.
It is possible that during certain operations, the file system temporarily disables preemption to insure that no corruption occurs - coupled with having to wait for hardware to complete, this might account for the delay.
If you have access to the System Viewer tool, that would go a long way towards identifying the cause of the delay.

I second the suggestion of using the System Viewer, It'll show all the tasks involved in TFFS stack and you may be surprised how many layers there are. If you're making an fwrite with a large block of data, the flash access may be large (and slow as Benoit said). You may try a bunch of smaller fwrites. I suggest doing a test to see how long fwrites() take for various sizes, and you may see differences from test to test with the same sizea as you cross flash block boundaries.

Low Throughput on Windows Named Pipe Over WAN

I'm having problems with low performance using a Windows named pipe. The throughput drops off rapidly as the network latency increases. There is a roughly linear relationship between messages sent per second and round trip time. It seems that the client must ack each message before the server will send the next one. This leads to very poor performance, I can only send 5 (~100 byte) messages per second over a link with an RTT of 200 ms.
The pipe is asynchronous, using multiple overlapped write operations (and multiple overlapped reads at the client end), but this is not improving throughput. Is it possible to send messages in parallel over a named pipe? The pipe is created using PIPE_TYPE_MESSAGE, would PIPE_READMODE_BYTE work better? Is there any other way I can improve performance?
This is a deployed solution, so I can't simply replace the pipe with a socket connection (I've read that Windows named pipe aren't recommended for use over a WAN, and I'm wondering if this is why). I'd be grateful for any help with this matter.

We found that Named Pipes had poor performance from Windows XP onwards.
I don't have a solution for you. But I am concurring with the notion of Named Pipes being useless from XP onwards. We changed our software (in terms of IPC) completely because of it.
Is your comms factored into a separate DLL? Perhaps you could replace the DLL with an interface that looks the same but behaves differently?

I've implemented a work around, introducing a small (~1ms) fixed delay to buffer up as much data as possible before writing to the pipe. Over a network link with a RTT of 200ms, I can send ten times as much data in about a third of the time.
I send a message down the pipe when it first connects, so the client can determine the comms mode supported by the server and send data accordingly.

I would imagine that some of the WAN optimisation gear out there would be able to boost performance, as one of the things they do is understand protocols and reduce their chattiness. Given the latency of many WAN links, this alone can boost throughput and reduce timeouts.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js