Buffering to the hard disk

Buffering to the hard disk - c++

I am receiving a large quantity of data at a fixed rate. I need to do some processing on this data on a different thread, but this may run slower than the data is coming in, so I need to buffer the data. Due to the quantity of data coming in the available RAM would be quickly exhausted, so it needs to overflow onto the hard disk. What I could do with is something like a filesystem-backed pipe, so the writer could be blocked by the filesystem, but not by the reader running too slowly.
Here's a rough set of requirements:
Writing should not be blocked by the reader running too slowly.
If data is read slow enough that the available RAM is exhausted it should overflow to the filesystem. It's ok for writes to the disk to block.
Reading should block if no data is available unless the stream has been closed by the writer.
If the reader is able to keep up with the data then it should never hit the hard disk as the RAM buffer would be sufficient (nice but not essential).
Disk space should be recovered as the data is consumed (or soon after).
Does such a mechanism exist in Windows?

This looks like a classic message queue. Did you consider MSMQ or similar? MSMQ has all the properties you are asking for. You may want to use direct addressing to avoid Active Directory http://msdn.microsoft.com/en-us/library/ms700996(v=vs.85).aspx and use local or TCP/IP queue address.

Use an actual file. Write to the file as the data is received, and in another process read the data from the file and process it.
You even get the added benefits of no multithreading.

Related

Is there a way prevent libcurl from buffering?

I am using libcurl with CURLOPT_WRITEFUNCTION to download a certain file.
I ask for a certain buffer size using CURLOPT_BUFFERSIZE.
When my callback function is called the first time and I get about that many bytes, there are much more data actually downloaded.
For example, if I ask for 1024 bytes of data, when I first get that, the process has already consumed 100K of data (based on process explorer and similar tools. I can see the continuous stream of data and ACKs in wireshark), so I assume it is downloading in advance and buffering the data.
The thing I am trying to achieve here is to be able to cancel the retrieval based on first few chunks of data without downloading anything that is unnecessary.
Is there a way to prevent that sort of buffering and only download the next chunk of data once I have finished processing the current one (or at least not to buffer tens and hundreds of kilobytes)?
I would prefer the solution to be server agnostic, so CURLOPT_RANGE won't actually work here.

How to determine whether data has been retrieved from disk or from caches?

I have written a program in C/C++ which needs to fetch data from the disk. After some time it so happens that the operating system stores some of the data in its caches. Is there some way by which I may figure out in a C/c++ programs whether the data has been retrieved from the caches or the data has been retrieved from the disk?

A simple solution would be to time the read operation. Disk reads are significantly slower. you can read a a group of file blocks (4K) twice to get an estimate.
The problem is that if you run the program again or copy the file in a shell, the OS will cache it.

Is FindFirstChangeNotification API doing any disk access? [duplicate]

I've used FileSystemWatcher in the past. However, I am hoping someone can explain how it actually is working behind the scenes.
I plan to utilize it in an application I am making and it would monitor about 5 drives and maybe 300,000 files.
Does the FileSystemWatcher actually do "Checking" on the drive - as in, will it be causing wear/tear on the drive? Also does it impact hard drive ability to "sleep"
This is where I do not understand how it works - if it is like scanning the drives on a timer etc... or if its waiting for some type of notification from the OS before it does anything.
I just do not want to implement something that is going to cause extra reads on a drive and keep the drive from sleeping.

Nothing like that. The file system driver simply monitors the normal file operations requested by other programs that run on the machine against the filters you've selected. If there's a match then it adds an entry to an internal buffer that records the operation and the filename. Which completes the driver request and gets an event to run in your program. You'll get the details of the operation passed to you from that buffer.
So nothing actually happens the operations themselves, there is no extra disk activity at all. It is all just software that runs. The overhead is minimal, nothing slows down noticeably.

The short answer is no. The FileSystemWatcher calls the ReadDirectoryChangesW API passing it an asynchronous flag. Basically, Windows will store data in an allocated buffer when changes to a directory occur. This function returns the data in that buffer and the FileSystemWatcher converts it into nice notifications for you.

usb disk write latency (windows)

I am writing to USB disk from a lowest priority thread, using chunked buffer writing and still, from time to time the system in overall lags on this operation. If I disable writing to disk only, everything works fine. I can't use Windows file operations API calls, only C write. So I thought maybe there is a WinAPI function to turn on/off USB disk write caching which I could use in conjunction with FlushBuffers or similar alternatives? The number of drives for operations is undefined.
Ideally I would like to never be lagging using write call and the caching, if it will be performed transparently is ok too.
EDIT: would _O_SEQUENTIAL flag on write only operations be of any use here?

Try to reduce I/O priority for the thread.
See this article: http://msdn.microsoft.com/en-us/library/windows/desktop/ms686277(v=vs.85).aspx
In particular use THREAD_MODE_BACKGROUND_BEGIN for your IO thread.
Warning: this doesn't work in Windows XP

The thread priority won't affect the delay that happens in the process of writing the media, because it's done in the kernel mode by the file system/disk drivers that don't pay attention to the priority of the calling thread.
You might try to use "T" flag (_O_SHORTLIVED) and flush the buffers at the end of the operation, also try to decrease the buffer size.

There are different types of data transfer for USB, for data there are 3:
1.Bulk Transfer,
2.Isochronous Transfer, and
3.Interrupt Transfer.
Bulk Transfers Provides:
Used to transfer large bursty data.
Error detection via CRC, with guarantee of delivery.
No guarantee of bandwidth or minimum latency.
Stream Pipe - Unidirectional
Full & high speed modes only.
Bulk transfer is good for data that does not require delivery in a guaranteed amount of time The USB host controller gives a lower priority to bulk transfer than the other types of transfer.
Isochronous Transfers Provides:
Guaranteed access to USB bandwidth.
Bounded latency.
Stream Pipe - Unidirectional
Error detection via CRC, but no retry or guarantee of delivery.
Full & high speed modes only.
No data toggling.
Isochronous transfers occur continuously and periodically. They typically contain time sensitive information, such as an audio or video stream. If there were a delay or retry of data in an audio stream, then you would expect some erratic audio containing glitches. The beat may no longer be in sync. However if a packet or frame was dropped every now and again, it is less likely to be noticed by the listener.
Interrupt Transfers Provides:
Guaranteed Latency
Stream Pipe - Unidirectional
Error detection and next period retry.
Interrupt transfers are typically non-periodic, small device "initiated" communication requiring bounded latency. An Interrupt request is queued by the device until the host polls the USB device asking for data.
From the above, it seems that you want a Guaranteed Latency, so you should use Isochronous mode. There are some libraries that you can use like libusb, or you can read more in msdn

To find out what is letting your system hang you first need to drill down to the Windows hang. What was Windows doing while you did experience the hang?
To find this out you can take a kernel dump. How to get and analyze a Kernel Dump read here.
Depending on the findings you get there you then need to decide if there is anything under your control you can do about. Since you are using a third party library to to the writing there is little you can do except to set the IO priority, thread priority on thread or process level. If the library you were given links against a specific CRT you could try to build your own customized version of it to e.g. flush after every write to prevent write combining by the OS to write only data in big chunks back to disc.
Edit1
Your best bet would be to flush the device after every write. This could force the OS to flush any pending data and write the current pending writes to disc without caching the writes up to certain amount.
The second best thing would be to simply wait after each write to give the OS the chance to write pending changes though small back to disc after a certain time interval.
If you are deeper into performance you should try out XPerf which has a nice GUI and shows you even the call stack where your process did hang. The Windows Team and many other teams at MS use this tool to troubleshoot hang experiences. The latest edition with many more features comes with the Windows 8 SDK. But beware that Xperf only works on OS > Vista.

Loading large multi-sample audio files into memory for playback - how to avoid temporary freezing

I am writing an application needs to use large audio multi-samples, usually around 50 mb in size. One file contains approximately 80 individual short sound recordings, which can get played back by my application at any time. For this reason all the audio data gets loaded into memory for quick access.
However, when loading one of these files, it can take many seconds to put into memory, meaning my program if temporarily frozen. What is a good way to avoid this happening? It must be compatible with Windows and OS X. It freezes at this : myMultiSampleClass->open(); which has to do a lot of dynamic memory allocation and reading from the file using ifstream.
I have thought of two possible options:
Open the file and load it into memory in another thread so my application process does not freeze. I have looked into the Boost library to do this but need to do quite a lot of reading before I am ready to implement. All I would need to do is call the open() function in the thread then destroy the thread afterwards.
Come up with a scheme to make sure I don't load the entire file into memory at any one time, I just load on the fly so to speak. The problem is any sample could be triggered at any time. I know some other software has this kind of system in place but I'm not sure how it works. It depends a lot on individual computer specifications, it could work great on my computer but someone with a slow HDD/Memory could get very bad results. One idea I had was to load x samples of each audio recording into memory, then if I need to play, begin playback of the samples that already exist whilst loading the rest of the audio into memory.
Any ideas or criticisms? Thanks in advance :-)

Use a memory mapped file. Loading time is initially "instant", and the overhead of I/O will be spread over time.

I like solution 1 as a first attempt -- simple & to the point.
If you are under Windows, you can do asynchronous file operations -- what they call OVERLAPPED -- to tell the OS to load a file & let you know when it's ready.

i think the best solution is to load a small chunk or single sample of wave data at a time during playback using asynchronous I/O (as John Dibling mentioned) to a fixed size of playback buffer.
the strategy will be fill the playback buffer first then play (this will add small amount of delay but guarantees continuous playback), while playing the buffer, you can re-fill another playback buffer on different thread (overlapped), at least you need to have two playback buffer, one for playing and one for refill in the background, then switch it in real-time
later you can set how large the playback buffer size based on client PC performance (it will be trade off between memory size and processing power, fastest CPU will require smaller buffer thus lower delay).

You might want to consider a producer-consumer approach. This basically involved reading the sound data into a buffer using one thread, and streaming the data from the buffer to your sound card using another thread.
The data reader is the producer, and streaming the data to the sound card is the consumer. You need high-water and low-water marks so that, if the buffer gets full, the producer stops reading, and if the buffer gets low, the producer starts reading again.
A C++ Producer-Consumer Concurrency Template Library
http://www.bayimage.com/code/pcpaper.html
EDIT: I should add that this sort of thing is tricky. If you are building a sample player, the load on the system varies continuously as a function of which keys are being played, how many sounds are playing at once, how long the duration of each sound is, whether the sustain pedal is being pressed, and other factors such as hard disk speed and buffering, and amount of processor horsepower available. Some programming optimizations that you eventually employ will not be obvious at first glance.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js