Writing many large files quickly in C++ - c++

I have a program which gets a stream of raw data from different cameras and writes it to disk. The program runs these sorts of recordings for ~2 minutes and then another program is used to process the frames.
Each raw frame is 2MB and the frame rate is 30fps (ie. data rate is around 60MB/s) and I'm writing to an SSD which can easily handle a sustained > 150MB/s (tested by copying 4000 2MB files from another disk which took 38 seconds and Process Explorer shows constant IO activity).
My issue is that occasionally calls to fopen(), fwrite() and fclose() stall for up to 5 seconds which means that 300MB of frames build up in memory as a back log, and after a few of these delays I hit the 4GB limit of a 32 bit process. (When the delay happens, Process Explorer shows a gap in IO activity)
There is a thread which runs a loop calling this function for every new frame which gets added to a queue:
writeFrame(char* data, size_t dataSize, char* filepath)
{
// Time block 2
FILE* pFile = NULL;
fopen(&pFile, filepath, "wb");
// End Time block 2
// Time block 3
fwrite(data,1,dataSize,pFile);
// End Time block 3
// Time block 4
fclose(pFile);
// End Time block 4
}
(There's error checking too in the actual code but it makes no difference to this issue)
I'm logging the time it takes for each of the blocks and the total time it takes to run the function and I get results which most of the time look like this: (times in ms)
TotalT,5, FOpenT,1, FWriteT,2, FCloseT,2
TotalT,4, FOpenT,1, FWriteT,1, FCloseT,2
TotalT,5, FOpenT,1, FWriteT,2, FCloseT,2
ie. ~5ms to run the whole functions, ~1ms to open the file, ~2ms to call write and ~2ms to close the file.
Occasionally however (on average about 1 in every 50 frames, but sometimes it can be thousands of frames between this problem occurring), I get frames which take over 4000ms:
TotalT,4032, FOpenT,4023, FWriteT,6, FCloseT,3
and
TotalT,1533, FOpenT,1, FWriteT,2, FCloseT,1530
All the frames are the same size and its never fwrite that takes the extra time, always fopen or fclose
No other process is reading/writing to/from this SSD (confirmed with Process Monitor).
Does anyone know what could be causing this issue and/or any way of avoiding/mitigating this problem?

I'm going to side with X.J., you're probably writing too many files to a single directory.
A solution could be to create a new directory for each batch of frames. Also consider calling SetEndOfFile directly after creating the file, as that will help Windows allocate sufficient space in a single operation.
FAT isn't a real solution as it's doing even worse on large directories.

Prepare empty files (2 MB files filled with zeros) So that space is already "ready", then just overwrite these files. Or create a file that is a batch of several frames, so you can reduce number of files.
there are libraries for doing compression and decompression and playback of videos:
libTheora may be usefull because already compress frames (well you will need to output the video in a single file) and do that pretty fast (lossy compression by the way).

Related

C++ Video Capturing using Sink Writer - Memory consumption

I am writing a C++ program (Win64) using C++ Builder 11.1.5 that captures video from a web cam and stores the captured frames in a WMV file using the sink writer interface as described in the following tutorial:
https://learn.microsoft.com/en-gb/windows/win32/medfound/tutorial--using-the-sink-writer-to-encode-video?redirectedfrom=MSDN
The video doesn't need to be real time using 30 Frames per second as the process being recorded is a slow one so I have set the FPS to be 5 (which is fine.)
The recording needs to run for about 8-12 hours at a time and using the algorithms in the sink writer tutorial, I have seen the memory consumption of the program go up dramatically after 10 minutes of recording (in excess of 10 Gb of memory). I also have seen that the final WMV file only becomes populated when the Finalize routine is called. Because of the memory consumption, the program starts to slow down after a while.
First Question: Is it possible to flush the sink writer to free up ram while it is recording?
Second Question: Maybe it would be more efficient to save the video in pieces and finalize the recording every 10 minutes or so then start another recording using a different file name such that when 8 hours is done the program could combine all the saved WMV files? How would one go about combining numerous WMV files into one large file?

Reading small separated chunks of a large file (C++)

I am reading a proprietary binary data file format. The format is basically header, data, size_of_previous_data, header, data, size_of_previous_data, header, data, size_of_previous_data, ...
Part of the header includes the number of bytes of the next chunk of data as well as its size being listed immediately after the data. The header is 256 bytes, the data is typically ~ 2MB and the size_of_previous_data is a 32 bit int.
The files are generally large ~GB, and I often have to search through tens of them for the data I want. In order to do this, the first thing I do in my code is idex each of the files, i.e. read in just the headers and record the location of the associated data (file and byte number). My code basically ready the header using fstream::read(), checks the data size, skips the data using fstream::seekg(), then reads in the size_of_previous_data, then repeats until I reach the end of the file.
My problem is that this indexing is painfully slow. The data is on an internal 7200 rpm hard drive on my Windows 10 laptop and Task manager shows that my hard drive usage is maxed out, but I am only getting read speeds of about 1.5 MB/s with response times typically >70 ms. I am reading the file using a std::fstream using fstream::get() to read the headers and fstream::seekg() to move to the next header.
I have profiled my code and almost the entire time is spent in the fstream::read() code to read the size_of_previous_data value. I presume that when I do this the data immediately after this is buffered so my fstream::read() to get the next header takes practically no time.
So I am wondering if there is a way to optimise this? Almost my entire buffer in any buffered read is likely to be wasted (97% of it, if it is an 8kB buffer). Is there a way to shrink this and is it likely to be worth it (perhaps underlying OS buffers too in a way I cannot change)?
Assuming that a disk seek takes about 10 ms (from Latency Numbers Every Programmer Should Know), your file is 11 GB consisting of 2 MB chunks, the theoretical minimum running time is 5500 * 10 ms = 55 seconds.
If you're already in that order of magnitude, the most effective way of speeding this up might be to buy an SSD.

Multithreading a File Map into an Array of Buffers

I'm trying to work with nasty large xml and text documents: ~40GBs.
I'm using Visual Studio 2012 on Windows 7.
I'm going to use 'Xerces' to snag the header/'footer tag' from the xmls.
I want to map an area of the file, say.. 60-120MBs.
Split the Map into (3 * processors/cores) equal parts. Setting each part as a buffer and loading the buffers into an array.
Then using (#processors/cores) while statments in new threads, I will synchronously count characters/lines/xml cycles while chewing through the the buffer array. When one buffer is completed the the process will jump to the next 'available' buffer and the completed buffer will be dropped out of memory. At the end I will add the total results into a project log.
Afterwards, I will reference the log, Split the files by character count/size(Or other option) to the nearest line or cycle and drop in the header and 'footer tag' to all the splits.
I'm doing this so I can import massive data to a MySQL server over a network with multiple computers.
My Question is, how do I create the buffer array and the file map with new threads?
Can I use :
win CreateFile
win CreateFileMapping
win MapViewOfFile
with standard ifstream operations and char buffers or should I opt something else?
Futher clarification:
My thinking is that if I can have the hard drive streaming the file into memory from one place and in one direction that I can use the full processing power of the machine to chew through seperate but equal buffers.
~Flavor: It's kind of like being a Shepard trying to scoop food out from one huge bin with 3-6 Large buckets with only two arms for X sheep that need to stay inside the fenced area. But they all move at the speed of light.
A few ideas or pointers might help me along here.
Any thoughts are Most Welcome. Thanks.
while(getline(my_file, myStr))
{
characterCount += myStr.length();
lineCount++;
if(my_file.eof()){
break;
}
}
This was the only code at run time for the test. 2hours, 30+min. 45-50% total processor for the program running it on a dual core 1.6Mhz laptop with 2GB RAM. Most of the RAM loaded right now is 600+MB from ~50 tabs open in firefox, Visual Studio at 60MB, then etcs.
IMPORTANT: During the test, the program running the code, which is only a window, and a dialog box, seemed to dump it's own working and private set of ram, down to like 300K ish, and didn't respond for the length of the test. I need to make another thread for the while statement I'm sure. But this means that NONE of the file was read into a buffer. The CPU was struggling for the entire run to keep up with the tinyest effort from the hard drive.
P.S. Further proof of CPU bottlenecking. It might take me 20min to transfer than entire file to another computer over my wireless network. Which includes the read process and a socket catch to write process on the other computer.
UPDATE
I used this adorable little thing to go from the previous test time to about 15-20min which is in line with what Mats Petersson was saying.
while (my_file.read( &bufferOne[0], bufferOne.size() ))
{
int cc = my_file.gcount();
for (int i = 0; i < cc; i++)
{
if (bufferOne[i] == '\n')
lineCount++;
characterCount++;
}
currentPercent = characterCount/onePercent;
SendMessage(GetDlgItem(hDlg, IDC_GENPROGRESS), PBM_SETPOS, currentPercent, 0);
}
Granted this is a single loop and it actually behaved much more appropriately than the previous test. This test was ~800% faster than the tight loop shown above this one with Getline. I set the buffer for this loop at 20MB. I jacked this code from: SOF - Fastest Example
BUT...
I would like to point out that while polling the process in resource mon and task manager, it clearly showed the first core at 75-90% usage, the second fluxuately 25-50% (Pretty standard for some minor background stuff that I have open), and the hard drive at.. wait for it... 50%. Some 100% disk time spikes but also some lows at 25%. All of which basically means that Splitting the buffer processing between two different threads could very well be a benefit. It will use all the system resources but.. that's what I want. I'll update later today when I have the working prototype.
MAJOR UPDATE:
Finally finished my project after a bunch of learning. No File Map needed. Only a bunch of vector char's. I have successfully built a dynamically executing file stream line and character counter.
The good news, went from the previous 10-15min marker to ~3-4min on a 5.8GB file, BOOYA!~
Very short answer: Yes, you can use those functions.
For reading data, it's likely the most efficient method to map the file content into memory, since it saves having to copy the memory into a buffer in the application, just read it straight into the place it's supposed to go. So, no problem as long as you have enough address space available - 64-bit machines should certainly have plenty, in a 32-bit system it may be more of a scarce resource - but for sections of a few hundred MB, it shouldn't be a huge issue.
However, using multiple threads, I'm not at all convinced. I have a fair idea that reading more than one part of a very large file will be counter productive. This will increase the amount of head movement on the disk, which is a large portion of transfer rate. You can count on some 50-100MB/s transfer rates for "ordinary" systems. If the system has some sort of raid controller or some such, maybe around double that - very exotic raid controllers may achieve three times.
So reading 40GB will take somewhere in the order of 3-15 minutes.
The CPU is probably not going to be very busy, and running multiple threads is quite likely to worsen the overall performance of the system.
You may want to keep a thread for reading and one for writing, and only actually write out the data once you have a sufficient amount of it, again, to avoid unnecessary moves of the read/write head on the disk(s).

Replaying stored data at a fixed rate

I am working on a problem where I want to replayed data stored in a file at a specified rate.
For Eg: 25,000 records/second.
The file is in ascii format. Currently, I read each line of the file and apply a regex to
extract the data. 2- 4 lines make up a record. I timed this operation and it takes close to
15 microseconds for generating each record.
The time taken to publish each record is 6 microseconds.
If I perform the reading and writing sequentially, then I would end up with 21 microseconds to publish each record. So effectively, this means my upper bound is ~47K records per second.
If I decide to multi thread the reading and writing then I will be able to send out a packet every 9 microsecond ( neglecting the locking penalty since reader and writer share the same Q ) which gives a throughput of 110K ticks per second.
Is my previous design correct ?
What kind of Queue and locking construct has minimum penalty when a single producer and consumer share a queue ?
If I would like to scale beyond this what's the best approach ?
My application is in C++
If it takes 15uS to read/prepare a record then your maximum throughput will be about 1sec/15uSec = 67k/sec. You can ignore the 6uSec part as the single thread reading the file cannot generate more records than that. (try it, change the program to only read/process and discard the output) not sure how you got 9uS.
To make this fly beyond 67k/sec ...
A) estimate the maximum records per second you can read from the disk to be formatted. While this depends on hardware a lot, a figure of 20Mb/sec is typical for an average laptop. This number will give you the upper bound to aim for, and as you get close you can ease off trying.
B) create a single thread just to read the file and incur the IO delay. This thread should write to large preallocated buffers, say 4Mb each. See http://en.wikipedia.org/wiki/Circular_buffer for a way of managing these. You are looking to hold maybe 1000 records per buffer (guess, but not just 8 ish records!) pseudo code:
while not EOF
Allocate big buffer
While not EOF and not buffer full
Read file using fgets() or whatever
Apply only very small preprocessing, ideally none
Save into buffer
Release buffer for other threads
C) create another thread ( or several if the order of records is not important) to process a ring buffer when it is full, your regex step. This thread in turn writes to another set of output ring buffers (tip, keep the ring buffer control structures apart in memory)
While run-program
Wait/get an input buffer to process, semaphores/mutex/whatever you prefer
Allocate output buffer
Process records from input buffer,
Place result in output buffer
Release output buffer for next thread
Release input buffer for reading thread
D) create you final thread to consume the data. It isn't clear if this output is being written to disk or network, so this might affect the disk reading thread.
Wait/get input buffer from processed records pool
Output records to wherever
Return buffer to processed records pool
Notes.
Preallocate all buffers and pass them back to where they came from. Eg you might have 4 buffers between file reading thread and processing threads, when all 4 are infuse, the file reader waits for one to be free, it doesn't just allocate new buffers.
Try not to memset() buffers if you can avoid it, waste of memory bandwidth.
You won't need many buffers, 6? Per ring buffer?
The system will auto tune to slowest thread ( http://en.wikipedia.org/wiki/Theory_of_constraints ) so if you can read and prepare data faster than you want to output it, all the buffers will fill up and everything will pause except the output.
As the threads are passing reasonable amounts of data each sync point, the overhead of this will not matter too much.
The above design is how some of my code reads CSV files as quick as possible, basically it all comes to to input IO bandwidth as limiting factor.

Loading large multi-sample audio files into memory for playback - how to avoid temporary freezing

I am writing an application needs to use large audio multi-samples, usually around 50 mb in size. One file contains approximately 80 individual short sound recordings, which can get played back by my application at any time. For this reason all the audio data gets loaded into memory for quick access.
However, when loading one of these files, it can take many seconds to put into memory, meaning my program if temporarily frozen. What is a good way to avoid this happening? It must be compatible with Windows and OS X. It freezes at this : myMultiSampleClass->open(); which has to do a lot of dynamic memory allocation and reading from the file using ifstream.
I have thought of two possible options:
Open the file and load it into memory in another thread so my application process does not freeze. I have looked into the Boost library to do this but need to do quite a lot of reading before I am ready to implement. All I would need to do is call the open() function in the thread then destroy the thread afterwards.
Come up with a scheme to make sure I don't load the entire file into memory at any one time, I just load on the fly so to speak. The problem is any sample could be triggered at any time. I know some other software has this kind of system in place but I'm not sure how it works. It depends a lot on individual computer specifications, it could work great on my computer but someone with a slow HDD/Memory could get very bad results. One idea I had was to load x samples of each audio recording into memory, then if I need to play, begin playback of the samples that already exist whilst loading the rest of the audio into memory.
Any ideas or criticisms? Thanks in advance :-)
Use a memory mapped file. Loading time is initially "instant", and the overhead of I/O will be spread over time.
I like solution 1 as a first attempt -- simple & to the point.
If you are under Windows, you can do asynchronous file operations -- what they call OVERLAPPED -- to tell the OS to load a file & let you know when it's ready.
i think the best solution is to load a small chunk or single sample of wave data at a time during playback using asynchronous I/O (as John Dibling mentioned) to a fixed size of playback buffer.
the strategy will be fill the playback buffer first then play (this will add small amount of delay but guarantees continuous playback), while playing the buffer, you can re-fill another playback buffer on different thread (overlapped), at least you need to have two playback buffer, one for playing and one for refill in the background, then switch it in real-time
later you can set how large the playback buffer size based on client PC performance (it will be trade off between memory size and processing power, fastest CPU will require smaller buffer thus lower delay).
You might want to consider a producer-consumer approach. This basically involved reading the sound data into a buffer using one thread, and streaming the data from the buffer to your sound card using another thread.
The data reader is the producer, and streaming the data to the sound card is the consumer. You need high-water and low-water marks so that, if the buffer gets full, the producer stops reading, and if the buffer gets low, the producer starts reading again.
A C++ Producer-Consumer Concurrency Template Library
http://www.bayimage.com/code/pcpaper.html
EDIT: I should add that this sort of thing is tricky. If you are building a sample player, the load on the system varies continuously as a function of which keys are being played, how many sounds are playing at once, how long the duration of each sound is, whether the sustain pedal is being pressed, and other factors such as hard disk speed and buffering, and amount of processor horsepower available. Some programming optimizations that you eventually employ will not be obvious at first glance.