Writing a single large data file, or multiple smaller files: Which is faster? - c++

I am developing a c++ program that writes a large amount of data to disk. The following function gzips the data and writes it out to a file. The compressed data is on the order of 100GB. The function to compress and write out the data is as follows:
void constructSNVFastqData(string const& fname) {
ofstream fastq_gz(fname.c_str());
stringstream ss;
for (int64_t i = 0; i < snvId->size(); i++) {
consensus_pair &cns_pair = snvId->getPair(i);
string qual(cns_pair.non_mutated.size(), '!');
ss << "#" + cns_pair.mutated + "[" + to_string(cns_pair.left_ohang) +
";" + to_string(cns_pair.right_ohang) + "]\n"
+ cns_pair.non_mutated + "\n+\n" + qual + "\n";
}
boost::iostreams::filtering_streambuf<boost::iostreams::input> out;
out.push(boost::iostreams::gzip_compressor());
out.push(ss);
boost::iostreams::copy(out,fastq_gz);
fastq_gz.close();
}
The function writes data to a string stream, which I then
write out to a file (fastq_gz) using boost's filtering_streambuf.
The file is not a log file. After the file has been written
it will be read by a child process. The file does not need to be viewed
by humans.
Currently, I am writing the data out to a single, large file (fastq_gz). This is taking a while, and the file system - according to our system manager - is very busy. I wonder if, instead of writing out a single large file, I should instead write out a number of smaller files? Would this approach be faster, or reduce the load on the file system?
Please note that it is not the compression that is slow - I have benchmarked.
I am running on a linux system and do not need to consider generalising the implementation to a windows filesystem.

So what your code is probably doing is (a) generating your file into memory swap space, (b) loading from swap space and compressing on the fly, (c) writing compressed data as you get it to the outfile.
(b) and (c) are great; (a) is going to kill you. It is two roundmtrips of thr uncompressed data, one of which while competing with your output file generation.
I cannot find one in boost iostreams, but you need an istream (source) or a device that gets data from you on demand. Someone must have written it (it seems so useful), but I don't see it in 5 minutes of looking at boost iostreams docs.

0.) Devise an algorithm to divide the data into multiple files so that it could be recombined later.
1.) Write data to multiple files on separate threads in the background. Maybe shared threads. (maybe start n = 10 threads at a time or so)
2.) Query through the future attribute of the shared objects to check if writing is done. (size > 1 GB)
3.) Once above is the case; then recombine data when it is queried by the child process
4.) I would recommend writing a new file after every 1 GB

Related

C++ / Fast random access skipping in a large file

I have large files, containing a small number of large datasets. Each dataset contains a name and the dataset size in bytes, allowing to skip it and go to the next dataset.
I want to build an index of dataset names very quickly. An example of file is about 21MB large, and contains 88 datasets. Reading the 88 names quickly by using a std::ifstream and seekg() to skip between datasets takes about 1300ms, which I would like to reduce.
So in fact, I'm reading 88 chunks of about 30 bytes, at given positions in a 21MB file, and it takes 1300ms.
Is there a way to improve this, or is it an OS and filesystem limitation? I'm running the test under Windows 7 64bit.
I know that having a complete index at the beginning of the file would be better, but the file format does not have this, and we can't change it.
You could use a memory mapped file interface (I recommend boost's implementation.)
This will open the file into the virtual page for your application for quicker lookup time, without going back to the disk.
You could scan the file and make your own header with the key and the index in a seperate file. Depending on your use case you can do it once at program start and everytime the file changes.
Before accessing the big data, a lookup in the smaller file gives you the needed index.
You may be able to do a buffer queuing process with multithreading. You could create a custom struct that would store various amounts of data.
You said:
Each dataset contains a name and the dataset size in bytes, allowing to skip it and go to the next dataset.
So as opening and closing the files over and over again is slow you could read the file all in one go and store it into a full buffer object and then parse it or store it into batches. This would also depend on if you are reading in text or binary mode on how easy it is to parse the file. I'll demonstrate the later with populating multiple batches while reading in a buffered size amount of data from file.
Pseudo Code
struct Batch {
std::string name; // Name of Dataset
unsigned size; // Size of Dataset
unsigned indexOffset; // Index to next read location
bool empty = true; // Flag to tell if this batch is full or empty
std::vector<DataType> dataset; // Container of Data
};
std::vector<Batch> finishedBatches;
// This doesn't matter on the size of the data set; this is just a buffer size on how much memory to digest in reading the file
const unsigned bufferSize = "Set to Your Preference" 1MB - 4MB etc.
void loadDataFromFile( const std::string& filename, unsigned bufferSize, std::vector<Batch>& batches ) {
// Set ifstream's buffer size
// OpenFile For Reading and read in and upto buffer size
// Spawn different thread to populate the Batches and while that batch is loading
// in data read in that much buffer data again. You will need to have a couple local
// stack batches to work with. So if a batch is complete and you reached the next index
// location from the file you can fill another batch.
// When a single batch is complete push it into the vector to store that batch.
// Change its flag and clear its vector and you can then use that empty batch again.
// Continue this until you reached end of file.
}
This would be a 2 threaded system here. Main thread for opening and reading from file and seeking from file with a worker thread filling the batches and pushing batches into container and swapping to use next batch.

Fast and efficient way of writing an array of structs to a text file

I have a binary file. I am reading a block of data from that file into an array of structs using fread method. My struct looks like below.
struct Num {
uint64_t key;
uint64_t val
};
My main goal is to write the array into a different text file with space separated key and value pairs in each line as shown below.
Key1 Val1
Key2 Val2
Key3 Val3
I have written a simple function to do this.
Num *buffer = new Num[buffer_size];
// Read a block of data from the binary file into the buffer array.
ofstream out_file(OUT_FILE, ios::out);
for(size_t i=0; i<buffer_size; i++)
out_file << buffer[i].key << ' ' << buffer[i].val << '\n';
The code works. But it's slow. One more approach would be to create the entire string first and write to file only once at the end.
But I want to know if there are any best ways to do this. I found some info about ostream_iterator. But I am not sure how it works.
The most efficient method to write structures to a file is to write as many as you can in the fewest transactions.
Usually this means using an array and writing entire array with one transaction.
The file is a stream device and is most efficient when data is continuously flowing in the stream. This can be as simple as writing the array in one call to more complicated using threads. You will save more time by performing block or burst I/O than worrying about which function call to use.
Also, in my own programs, I have observed that placing formatted text into a buffer (array) and then block writing the buffer is faster than using a function to write the formatted text to the file. There is a chance that the data stream may pause during the formatting. With writing formatted data from a buffer, the flow of data through the stream is continuous.
There are other factors involved in writing to a file, such as allocating space on the media, other tasks running on your system and any sharing of the file media.
By using the above techniques, I was able to write GBs of data in minutes instead of the previous duration of hours.

Fortran unformatted output with each MPI process writing part of an array

In my parallel program, there was a big matrix. Each process computed and stored a part of it. Then the program wrote the matrix to a file by letting each process wrote its own part of the matrix in the correct order. The output file is in "unformatted" form. But when I tried to read the file in a serial code (I have the correct size of the big matrix allocated), I got an error which I don't understand.
My question is: in an MPI program, how do you get a binary file as the serial version output for a big matrix which is stored by different processes?
Here is my attempt:
if(ThisProcs == RootProcs) then
open(unit = file_restart%unit, file = file_restart%file, form = 'unformatted')
write(file_restart%unit)psi
close(file_restart%unit)
endif
#ifdef USEMPI
call mpi_barrier(mpi_comm_world,MPIerr)
#endif
do i = 1, NProcs - 1
if(ThisProcs == i) then
open(unit = file_restart%unit, file = file_restart%file, form = 'unformatted', status = 'old', position = 'append')
write(file_restart%unit)psi
close(file_restart%unit)
endif
#ifdef USEMPI
call mpi_barrier(mpi_comm_world,MPIerr)
#endif
enddo
Psi is the big matrix, it is allocated as:
Psi(N_lattice_points, NPsiStart:NPsiEnd)
But when I tried to load the file in a serial code:
open(2,file=File1,form="unformatted")
read(2)psi
forrtl: severe (67): input statement requires too much data, unit 2 (I am using MSVS 2012+intel fortran 2013)
How can I fix the parallel part to make the binary file readable for the serial code? Of course one can combine them into one big matrix in the MPI program, but is there an easier way?
Edit 1
The two answers are really nice. I'll use access = "stream" to solve my problem. And I just figured I can use inquire to check whether the file is "sequential" or "stream".
This isn't a problem specific to MPI, but would also happen in a serial program which took the same approach of writing out chunks piecemeal.
Ignore the opening and closing for each process and look at the overall connection and transfer statements. Your connection is an unformatted file using sequential access. It's unformatted because you explicitly asked for that, and sequential because you didn't ask for anything else.
Sequential file access is based on records. Each of your write statements transfers out a record consisting of a chunk of the matrix. Conversely, your input statement attempts to read from a single record.
Your problem is that while you try to read the entire matrix from the first record of the file that record doesn't contain the whole matrix. It doesn't contain anything like the correct amount of data. End result: "input statement requires too much data".
So, you need to either read in the data based on the same record structure, or move away from record files.
The latter is simple, use stream access
open(unit = file_restart%unit, file = file_restart%file, &
form = 'unformatted', access='stream')
Alternatively, read with a similar loop structure:
do i=1, NPROCS
! read statement with a slice
end do
This of course requires understanding the correct slicing.
Alternatively, one can consider using MPI-IO for output, which is very similar to using stream output. Read this back in with stream access. You can find about this concept elsewhere on SO.
Fortran unformatted sequential writes in record files are not quite completely raw data. Each write will have data before and after the record in a processor dependent form. The size of your reads cannot exceed the record size of your writes. This means if psi is written in two writes, you will need to read it back in two reads, you cannot read it in at once.
Perhaps the most straightforward option is to instead use stream access instead of sequential. A stream file is indexed by bytes (generally) and does not contain record start and end information. Using this access method you can split the write but read all at once. Stream access is a feature of Fortran 2003.
If you stick with sequential access, you'll need to know how many MPI ranks wrote the file and loop over properly sized records to read the data as it was written. You could make the user specify the number of ranks or store that as the first record in the file and read that first to determine how to read the rest of the data.
If you are writing MPI, why not MPI-IO? Each process will call MPI_File_set_view to set a subarray view of the file, then each process can collectively write the data with MPI_FILE_WRITE_ALL . This approach is likely to scale really well on big machines (though your approach will be fine up to oh, maybe 100 processors.)

Fastest and efficient way of parsing raw data from file

I'm working on some project and I'm wondering which way is the most efficient to read a huge amount of data off a file(I'm speaking of file of 100 lines up to 3 billions lines approx., can be more thought). Once read, data will be stored in a structured data set (vector<entry> where "entry" defines a structured line).
A structured line of this file may look like :
string int int int string string
which also ends with the appropriate platform EOL and is TAB delimited
What I wish to accomplish is :
Read file into memory (string) or vector<char>
Read raw data from my buffer and format it into my data set.
I need to consider memory footprint and have a fast parsing rate.
I'm already avoiding usage of stringstream as they seems too slow.
I'm also avoiding multiple I/O call to my file by using :
// open the stream
std::ifstream is(filename);
// determine the file length
is.seekg(0, ios_base::end);
std::size_t size = is.tellg();
is.seekg(0, std::ios_base::beg);
// "out" can be a std::string or vector<char>
out.reserve(size / sizeof (char));
out.resize(size / sizeof (char), 0);
// load the data
is.read((char *) &out[0], size);
// close the file
is.close();
I've thought of taking this huge std::string and then looping line by line, I would extract line information (string and integer parts) into my data set row. Is there a better way of doing this?
EDIT : This application may run on a 32bit, 64bit computer, or on a super computer for bigger files.
Any suggestions are very welcome.
Thank you
Some random thoughts:
Use vector::resize() at the beginning (you did that)
Read large blocks of file data at a time, at least 4k, better still 256k. Read them into a memory buffer, parse that buffer into your vector.
Don't read the whole file at once, this might needlessly lead to swapping.
sizeof(char) is always 1 :)
while i cannot speak for supercomputers with 3 gig lines you will go nowhere in memory on a desktop machine.
i think you should first try to figure out all operations on that data. you should try to design all algorithms to operate sequentially. if you need random access you will do swapping all the time. this algorithm design will have a big impact on your data model.
so do not start with reading all data, just because that is an easy part, but design the whole system with a clear view an what data is in memory during the whole processing.
update
when you do all processing in a single run on the stream and separate the data processing in stages (read - preprocess - ... - write) you can utilize multithreading effectivly.
finally
whatever you want to do in a loop over the data, try to keep the number of loops a minimum. averaging for sure you can do in the read loop.
immediately make up a test file the size you expect to be the worst case in size and time two different approaches
.
time
loop
read line from disk
time
loop
process line (counting words per line)
time
loop
write data (word count) from line to disk
time
versus.
time
loop
read line from disk
process line (counting words per line)
write data (word count) from line to disk
time
if you have the algorithms already use yours. otherwise make up one (like counting words per line). if the write stage does not apply to your problem skip it. this test does take you less than an hour to write but can save you a lot.

Delaying file data for several minutes

On my machine, I have a file which is regenerated by an application every second - it contains different data each time, as it's based on some realtime data.
I would like to have a copy of this file, which would contain what the original file contained 5 minutes ago. Is this somehow easily achievable? I would be happy to do this using some BASH scripting magic, but adding some wise, memory efficient code to that original application (written in c++) would also satisfy me :)
You tagged your question with both linux and unix. This answer only applies to Linux.
You may be able to use inotify-tools (inotifywait man page) or incron (incrontab(5) man page) to watch the directory and make copies of the files as they are closed.
If disk space isn't an issue, you could make the program create a new file every second instead of writing to the same file. You would need a total of 300 files (5 min * 60 sec/min). The file name to write to would be $somename + timestamp() % 300. That way, to get the file 5 minutes ago, you would just access the file $somename + (timestamp()+1) % 300.
In order to achieve that, you need the space to hold each of the 300 (5*60) files. Since you indicate that the files are only about 50K in size, this is doable in 15MB memory (if you don't want to clutter your filesystem)
It should be as simple as: (something like)
struct {char* buf; size_t size} hist[300]; //initalize to all nulls.
int n = 0;
struct stat st;
for(;;sleep(1)){
int ifd = open("file", O_READ);
int ofd = open("file-lag", O_WRITE);
stat(ifd, &st);
hist[n].size = st.st_size;
if(hist[n].buf)
free(hist[n].buf);
buffer[n] = malloc(hist[n].size);
read(ifd, hist.buf[n], hist[n].size);
n = (n+1)%300;
if(hist[n].buf)
write(ofd, hist.buf[n], hist[n].size)
close(ofd);
close(ifd);
}