Reading small separated chunks of a large file (C++)

Reading small separated chunks of a large file (C++) - c++

I am reading a proprietary binary data file format. The format is basically header, data, size_of_previous_data, header, data, size_of_previous_data, header, data, size_of_previous_data, ...
Part of the header includes the number of bytes of the next chunk of data as well as its size being listed immediately after the data. The header is 256 bytes, the data is typically ~ 2MB and the size_of_previous_data is a 32 bit int.
The files are generally large ~GB, and I often have to search through tens of them for the data I want. In order to do this, the first thing I do in my code is idex each of the files, i.e. read in just the headers and record the location of the associated data (file and byte number). My code basically ready the header using fstream::read(), checks the data size, skips the data using fstream::seekg(), then reads in the size_of_previous_data, then repeats until I reach the end of the file.
My problem is that this indexing is painfully slow. The data is on an internal 7200 rpm hard drive on my Windows 10 laptop and Task manager shows that my hard drive usage is maxed out, but I am only getting read speeds of about 1.5 MB/s with response times typically >70 ms. I am reading the file using a std::fstream using fstream::get() to read the headers and fstream::seekg() to move to the next header.
I have profiled my code and almost the entire time is spent in the fstream::read() code to read the size_of_previous_data value. I presume that when I do this the data immediately after this is buffered so my fstream::read() to get the next header takes practically no time.
So I am wondering if there is a way to optimise this? Almost my entire buffer in any buffered read is likely to be wasted (97% of it, if it is an 8kB buffer). Is there a way to shrink this and is it likely to be worth it (perhaps underlying OS buffers too in a way I cannot change)?

Assuming that a disk seek takes about 10 ms (from Latency Numbers Every Programmer Should Know), your file is 11 GB consisting of 2 MB chunks, the theoretical minimum running time is 5500 * 10 ms = 55 seconds.
If you're already in that order of magnitude, the most effective way of speeding this up might be to buy an SSD.

Related

Writing many large files quickly in C++

I have a program which gets a stream of raw data from different cameras and writes it to disk. The program runs these sorts of recordings for ~2 minutes and then another program is used to process the frames.
Each raw frame is 2MB and the frame rate is 30fps (ie. data rate is around 60MB/s) and I'm writing to an SSD which can easily handle a sustained > 150MB/s (tested by copying 4000 2MB files from another disk which took 38 seconds and Process Explorer shows constant IO activity).
My issue is that occasionally calls to fopen(), fwrite() and fclose() stall for up to 5 seconds which means that 300MB of frames build up in memory as a back log, and after a few of these delays I hit the 4GB limit of a 32 bit process. (When the delay happens, Process Explorer shows a gap in IO activity)
There is a thread which runs a loop calling this function for every new frame which gets added to a queue:
writeFrame(char* data, size_t dataSize, char* filepath)
{
// Time block 2
FILE* pFile = NULL;
fopen(&pFile, filepath, "wb");
// End Time block 2
// Time block 3
fwrite(data,1,dataSize,pFile);
// End Time block 3
// Time block 4
fclose(pFile);
// End Time block 4
}
(There's error checking too in the actual code but it makes no difference to this issue)
I'm logging the time it takes for each of the blocks and the total time it takes to run the function and I get results which most of the time look like this: (times in ms)
TotalT,5, FOpenT,1, FWriteT,2, FCloseT,2
TotalT,4, FOpenT,1, FWriteT,1, FCloseT,2
TotalT,5, FOpenT,1, FWriteT,2, FCloseT,2
ie. ~5ms to run the whole functions, ~1ms to open the file, ~2ms to call write and ~2ms to close the file.
Occasionally however (on average about 1 in every 50 frames, but sometimes it can be thousands of frames between this problem occurring), I get frames which take over 4000ms:
TotalT,4032, FOpenT,4023, FWriteT,6, FCloseT,3
and
TotalT,1533, FOpenT,1, FWriteT,2, FCloseT,1530
All the frames are the same size and its never fwrite that takes the extra time, always fopen or fclose
No other process is reading/writing to/from this SSD (confirmed with Process Monitor).
Does anyone know what could be causing this issue and/or any way of avoiding/mitigating this problem?

I'm going to side with X.J., you're probably writing too many files to a single directory.
A solution could be to create a new directory for each batch of frames. Also consider calling SetEndOfFile directly after creating the file, as that will help Windows allocate sufficient space in a single operation.
FAT isn't a real solution as it's doing even worse on large directories.

Prepare empty files (2 MB files filled with zeros) So that space is already "ready", then just overwrite these files. Or create a file that is a batch of several frames, so you can reduce number of files.
there are libraries for doing compression and decompression and playback of videos:
libTheora may be usefull because already compress frames (well you will need to output the video in a single file) and do that pretty fast (lossy compression by the way).

How to handle 100 Mbps input stream when my program can process data only at 1 Mbps rate

I am working on a project where we can have input data stream with 100 Mbps.
My program can be used overnight for capturing these data and thus will generate huge data file. My program logic which interpret these data is complex and can process only 1 Mb data per second.
We also dump the bytes to some log file after getting processed. We do not want to loose any incoming data and at the same time want my program to work in real time.So; we are maintaining a circular buffer which acts like a cache.
Right now only way to save incoming data from getting lost is to increase size of this buffer.
Please suggest better way to do this and also what are the alternate way of caching I can try?

Stream the input to a file. Really, there is no other choice. It comes in faster than you can process it.
You could create one file per second of input data. That way you can directly start processing old files while new files are being streamed on the disk.

Using libpcap, is there a way to determine the file offset of a captured packet from an offline pcap file?

I'm writing a program to reconstruct TCP streams captured by Snort. Most of the examples I've read regarding session reconstruction either:
load the entire pcap file in to memory to start with (not a solution because of hardware constraints and the fact that some of the capture files are 10 GB in size), or
cache each packet in memory as it reads through the capture and discards the irrelevant ones as it goes; this presents basically the same problems as reading the entire file in to memory
My current solution was to write my own pcap file parser since the format is simple. I save the offsets of each packet in a vector and can reload each one after I've passed it. This, like libpcap, only streams one packet in to memory at a time; I am only using sequence numbers and flags for ordering, NOT the packet data. Unlike libpcap, it is noticeably slower. processing a 570 MB capture with libpcap takes roughly 0.9 seconds whereas my code takes 3.2 seconds. However, I have the advantage of being able to seek backwards without reloading the entire capture.
If I were to stick with libpcap for speed issues, I was thinking I could just make a currentOffset variable with an initial value of 24 (the size of the pcap file global header), push it to a vector every time I load a new packet, and increment it every time I call pcap_next_ex by the size of the packet + 16 (for the size of the pcap record header). Then, whenever I wanted to read an individual packet, I could load it using conventional means and seek to packetOffsets[packetNumber].
Is there a better way to do this using libpcap?

Solved the problem myself.
Before I call pcap_next_ex, I push ftell(pcap_file(myPcap)) in to a vector<unsigned long>. I manually parse the packets after that as needed.
EZPZ. It just took 24+ hours of brain wrack...

Writing large files in c++

I am writing large files, in range from 70 - 700gb. Does anyone have experience if Memory mapped files would be more efficient than regular writing in chunks?
The code will be in c++ and run on linux 2.6

If you are writing the file from the beginning and onwards, there is nothing to be gained from memory mapping the file.
If you are writing the file in any other pattern, please update the question :)

Typical sustained hard drive transfer speeds for consumer grade drives are around 60 megabytes per second, with the sun shining, a stiff breeze in the back and the file system not too fragmented so the disk drive head doesn't have to seek too often.
So a hard lower limit on the amount of time it takes to write 700 gigabytes is 700 * 1024 / 60 = 11947 seconds or 3 hours and 20 minutes. No amount of buffering is going to fix that, it will quickly be overwhelmed by the drastic mismatch between the disk write speed and the ability of the processor to fill the fire hose. Start looking for a problem in your code or the disk drive state only when it takes a couple of times longer than that.

File reads slow on first read, but fast on consecutive reads

(This isn't my program, but I'll try to provide all the relevant information to the best of my knowledge.)
There is a program which reads binary files that are roughly 300MB in size, processes them and outputs some information. The program uses ifstream for file input and streams are correctly initialized and closed for each read.
The program has to read each file multiple times. Reading a file for the first time takes about 3 seconds, and each consecutive read takes about 0.1 seconds. If several files are processed, going back to the first file will still yield fast read speeds, but after some time re-reading a file becomes slow.
Additionally, if a file is copied to another location, the speed of the first read of the new file is roughly 0.1 seconds.
If you do the math, the speed of consecutive reads is roughly the advertised read speed of the hard drive.
All this looks like file locations are cached by either the OS or the hard drive, so that on consecutive reads you don't have to seek out file locations.
Does anyone know what exactly is causing the slowdown on the initial read, and if it can be prevented? Three seconds may not seem like a lot, but they add about 5 hours to the total time needed to correctly process every file.
Also, the program runs on Fedora 14 and Scientific Linux, with both OS's having their default file systems.
Any ideas would be appreciated.

Linux will try and copy the file into RAM to make the next read faster - I am guessing this is what is happening. The initial read is actual off disk - subsequent reads are out of the file cache because the entire file has been copied to RAM

The OS (Linux) has a disk cache. After you read the file once, it's in the cache.

My guess would be that maybe the first time it reads the file it takes longer because it loads some information into the cache?
After the first time, it just uses some of the information in the cache.

Yes, the data becomes cached. You might force that caching with the readahead syscall (or simply by having another process read it). If using mmap you could also use madvise

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js