c++ read small portions of big number of files

c++ read small portions of big number of files - c++

I have a relatively simple question to ask, there has been an ongoing discussion regarding many programming languages about which method provides the fastest file read. Mostly debated on read() or mmap(). As a person who also participated in these debates, I failed to find an answer to my current problem, because most answers help in the situation where the file to read is huge (example, how to read a 10 TB text file...).
But my problem is a bit different, I have lots of files, lets say a 100 million. I want to read the first 1-2 lines from these files. Whether the file is 10 kb or 100 TB is irrelevant. I just want the first one or two lines from every file. So I want to avoid reading or buffering the unnecessary parts of the files. My knowledge was not enough to thoroughly test which method is faster, or to discover what are all my options in the first place.
What I am doing right know: (I am doing this multithreaded for the moment)
for(const auto& p: std::filesystem::recursive_directory_iterator(path)) {
if (!std::filesystem::is_directory(p)) {
std::ifstream read_file(p.path().string());
if (read_file.is_open()) {
while (getline(read_file, line)) {
// Get two lines here.
}
}
}
}
What does C++, or the linux environment provide me in this situation ? Is there a faster or more efficient way to read small portions of millions of files ?
Thank you for your time.
Info: I have access to C++20 and Ubuntu 18.04

You can save one underlying call to fstat by not testing if the path is a directory, and then rely on is_open test
#include <iostream>
#include <fstream>
#include <filesystem>
#include <string>
int main()
{
std::string line,path=".";
for(const auto& p: std::filesystem::recursive_directory_iterator(path)) {
{
std::ifstream read_file(p.path().string());
if (read_file.is_open()) {
std::cout << "opened: " << p.path().string() << '\n';
while (getline(read_file, line)) {
// Get two lines here.
}
}
}
}
}
At least on Windows this code skips the directories. And as suggested in comments is_open test can even be skipped since getline doesn't read anything from a directory either.
Not the cleanest, but if it can save time it's worth it.

Any function in a program accessing a file under Linux will result in calling some "system calls" (for example read()).
All other available functions in some programming language (like fread(), fgets(), std::filesystem ...) call functions or methods which in turn call some system calls.
For this reason you can't be faster than calling the system calls directly.
I'm not 100% sure, but I think in most cases, the combination open(), read(), close() will be the fastest method for reading data from the start of a file.
(If the data is not located at the start of the file, pread() might be faster than read(); I'm not sure.)
Note that read() does not read a certain number of lines but a certain number of bytes (e.g. into an array of char), so you have to find the end(s) of the line(s) "manually" by searching the '\n' character(s) and/or the end of the file in the array of char.
Unfortunately, a line may be much longer than you expect, so reading the first N bytes from the file does not contain the first M lines and you have to call read() again.
In this case it depends on your system (e.g. file system or even hard disks) how many bytes you should read in each call to read() to get the maximum performance.
Example: Let's say in 75% of all files, the first N lines are found in the first 512 bytes of the file; in the other 25% of all files, first N lines are longer than 512 bytes in sum.
On some computers, reading 1024 bytes at once might require nearly the same time as reading 512 bytes, but reading 512 bytes twice will be much slower than reading 1024 bytes at once; on such computers it makes sense to read() 1024 bytes at once: You save a lot of time for 25% of the files and you lose only very little time for the other 75%.
On other computers, reading 512 bytes is significantly faster than reading 1024 bytes; on such computers it would be better to read() 512 bytes: Reading 1024 bytes would save you only little time when processing the 25% of files but cost you very much time when processing the other 75%.
I think in the most cases this "optimal value" will be a multiple of 512 bytes because most modern file systems organize files in units that have a multiple of 512 bytes.

I was just typing something similar to Martin Rosenau answer (when his popped up): unstructured read of the max length of two lines. But I would go further: queue that text buffer with corresponding file name and let another thread parse / analyze that. If parsing takes about the same time as reading, you can save half of the time. If it takes longer (unlikely) - you can use multiple threads and save even more.
Side note - you should not parallelize reading (tried that).
It may be worth experimenting: can you open one file, read it asynchronously while proceeding to open the next one? I don't know if any OS can overlap those things.

Related

Writing arrays to text files in c++

I have to write some 5 to 6 large arrays of the same size to text files in C++. I am using only one "for loop" for writing all the arrays to text files. The following is the code i'm using.
ofstream kx_ecl("COARSE_PERMX.GRDECL");
ofstream ky_ecl("COARSE_PERMY.GRDECL");
ofstream kz_ecl("COARSE_PERMZ.GRDECL");
ofstream ntg_c("COARSE_NTG.GRDECL");
ofstream phi_c("COARSE_PORO.GRDECL");
ofstream multbv("MULTBRV.GRDECL");
kx_ecl<<"PERMX"<<endl;
ky_ecl<<"PERMY"<<endl;
kz_ecl<<"PERMZ"<<endl;
ntg_c<<"NTG"<<endl;
phi_c<<"PORO"<<endl;
multbv<<"MULTBRV"<<endl;
for (i=0;i<N;i++) {
kx_ecl<<new_KX[i]<<endl;
ky_ecl<<new_KY[i]<<endl;
kz_ecl<<new_KZ[i]<<endl;
ntg_c<<new_NTG[i]<<endl;
phi_c<<new_por[i]<<endl;
multbv<<MULTBRV[i]<<endl;
}
kx_ecl<<"/";
ky_ecl<<"/";
kz_ecl<<"/";
ntg_c<<"/";
phi_c<<"/";
multbv<<"/";
kx_ecl.close();
ky_ecl.close();
kz_ecl.close();
ntg_c.close();
phi_c.close();
multbv.close();
My question here is that is it faster if I output each of them individually one by one in which case it would have many for loops or leave it as it is. Also, is there any way to write the arrays to a file without using a for loop? I am asking because the output to files is taking a lot of time compared to the actual calculation in my code.

Are you aware that std::endl is flushing your ofstreams constantly? You could push out '\n' characters instead and just flush once at the end for each stream.
I/O might still end up being the biggest piece of your time spent, but you are working against the buffering and do not seem in this example at least to be gaining anything from it. Going one array/file at a time might give you infinitesimally faster performance due to better memory locality (unless you're using so much memory you're swapping pages to disk), but the difference is probably getting completely lost in the huge cost of file writing.

Proper, efficient file reading

I'd like to read and process (e.g. print) entries from the first row of a CSV file one at a time. I assume Unix-style \n newlines, that no entry is longer than 255 chars and (for now) that there's a newline before EOF. This is meant to be a more efficient alternative to fgets() followed by strtok().
#include <stdio.h>
#include <string.h>
int main() {
int i;
char ch, buf[256];
FILE *fp = fopen("test.csv", "r");
for (;;) {
for (i = 0; ; i++) {
ch = fgetc(fp);
if (ch == ',') {
buf[i] = '\0';
puts(buf);
break;
} else if (ch == '\n') {
buf[i] = '\0';
puts(buf);
fclose(fp);
return 0;
} else buf[i] = ch;
}
}
}
Is this method as efficient and correct as possible?
What is the best way to test for EOF and file reading errors using this method? (Possibilities: testing against the character macro EOF, feof(), ferror(), etc.).
Can I perform the same task using C++ file I/O with no loss of efficiency?

What is most efficient is going to depend a lot on the operating system, standard libraries (e.g. libc), and even the hardware you are running on. This makes it nearly impossible to tell you what's "most efficient".
That having been said, there are a few things you could try:
Use mmap() or a local operating system equivalent (Windows has CreateFileMapping / OpenFileMapping / MapViewOfFile, and probably others). Then you don't do explicit file reads: you simply access the file as if it were already in memory, and anything that isn't there will be faulted in by the page fault mechanism.
Read the entire file into a buffer manually, then work on that buffer. The fewer times you call into file read functions, the fewer function-call overheads you take, and likely also fewer application/OS domain switches. Obviously this uses more memory, but may very well be worth it.
Use a more optimal string scanner for your problem and platform. Going character-by-character yourself is almost never as fast as relying on something existing that's close to your problem domain. For example, you can bet that strchr and memchr are probably better-optimized than most code you can roll yourself, doing things like reading entire cachelines or words at once, scanning using better algorithms for this kind of search, etc. For more complicated cases, you might consider a full regular expression engine that could compile your regex to something fast for your complicated case.
Avoid copying your string around. It may be helpful to think in terms of "find delimiters" and then "output between delimiters". You could for example use strchr to find the next character of interest, and then fwrite or something to write to stdout directly from your input buffer. Then you're keeping most of your work in a few local registers, rather than using a stack or heap buf.
When in doubt, though, try a few possibilities and profile, profile, profile.
Also for this kind of problem, be very aware of differences between runs that are caused by OS and hardware caches: profile a bunch of runs rather than just one after each change -- and if possible, use tests that will either likely always hit caches (if you're trying to measure best-case performance) or tests that will likely miss (if you're trying to measure worst-case performance).
Regarding C++ file IO (fstream and such), just be aware that they're larger, more complicated beasts. They tend to include things such as locale management, automatic buffering, and the like -- as well as being less prone to particular types of coding mistakes.
If you're doing something pretty simple (like what you describe here), I tend to find C++ library stuff gets in the way. (Use a debugger and "step instruction" through a stringstream method versus some C string functions some time, you'll get a good feel for this quickly.)
It all depends on whether you're going to want or need that additional functionality or safety in the future.
Finally, the obligatory "don't sweat the small stuff". Only spend time on optimizing here if it's really important. Otherwise trust the libraries and OS to do the right thing for you most of the time -- if you get too far into micro-optimizations you'll find you're shooting yourself in the foot later. This is not to discourage you from thinking in terms of "should I read the whole file in ahead of time, will that break future use cases" -- because that's macro, rather than micro.
But generally speaking if you're not doing this kind of "make it faster" investigation for a good reason -- i.e. "need this app to perform better now that I've written it, and this code shows up as slow in profiler", or "doing this for fun so I can better understand the system" -- well, spend your time elsewhere first. =)

One method, provided you are going to scan through the file serially, is to use 2 buffers of a decent enough size (16K is the optimal size for SSDs and 4K for HDDs IIRC. But 16K should suffice for both). You start off by performing an asynchronous load (In windows look up Overlapped I/O and on Unix/OSX use O_NONBLOCK) of the first 16K into buffer 0 and then start another load into buffer 1 of bytes 16K to 32K. When your read position hits 16K, swap the buffers (so you are now reading from buffer 1 instead) wait for any further loads to complete into buffer 1 and perform an asynchronous load of bytes 32K to 48K into buffer 0 and so on. This way, you have far less chance of ever having to wait for a load to complete as it should be happening while you are processing the previous 16K.
I moved over to a scheme like this in my XML parser having been using fopen and fgetc previously and the speedup was huge. Loading in a 15 meg XML file and processing it reduced from minutes to seconds. Of course, Your milage may vary.

use fgets to read one line at a time. C++ file I/O are basically wrapper code with some compiler optimization tucked inside ( and many unwanted functionality ). Unless you are reading millions of lines of code and measuring time, it does not matter.

How can we split one 100 GB file into hundred 1 GB file?

This question came to mind when I was trying to solve this problem.
I have harddrive with capacity 120 GB, of which 100 GB is occupied by a single huge file. So 20 GB is still free.
My question is, how can we split this huge file into smaller ones, say 1 GB each? I see that if I had ~100 GB free space, probably it was possible with simple algorithm. But given only 20 GB free space, we can write upto 20 1GB files. I've no idea how to delete contents from the bigger file while reading from it.
Any solution?
It seems I've to truncate the file by 1 GB, once I finish writing one file, but that boils down to this queston:
Is it possible to truncate a part of a file? How exactly?
I would like to see an algorithm (or an outline of an algorithm) that works in C or C++ (preferably Standard C and C++), so I may know the lower level details. I'm not looking for a magic function, script or command that can do this job.

According to this question (Partially truncating a stream) you should be able to use, on a system that is POSIX compliant, a call to int ftruncate(int fildes, off_t length) to resize an existing file.
Modern implementations will probably resize the file "in place" (though this is unspecified in the documentation). The only gotcha is that you may have to do some extra work to ensure that off_t is a 64 bit type (provisions exist within the POSIX standard for 32 bit off_t types).
You should take steps to handle error conditions, just in case it fails for some reason, since obviously, any serious failure could result in the loss of your 100GB file.
Pseudocode (assume, and take steps to ensure, all data types are large enough to avoid overflows):
open (string filename) // opens a file, returns a file descriptor
file_size (descriptor file) // returns the absolute size of the specified file
seek (descriptor file, position p) // moves the caret to specified absolute point
copy_to_new_file (descriptor file, string newname)
// creates file specified by newname, copies data from specified file descriptor
// into newfile until EOF is reached
set descriptor = open ("MyHugeFile")
set gigabyte = 2^30 // 1024 * 1024 * 1024 bytes
set filesize = file_size(descriptor)
set blocks = (filesize + gigabyte - 1) / gigabyte
loop (i = blocks; i > 0; --i)
set truncpos = gigabyte * (i - 1)
seek (descriptor, truncpos)
copy_to_new_file (descriptor, "MyHugeFile" + i))
ftruncate (descriptor, truncpos)
Obviously some of this pseudocode is analogous to functions found in the standard library. In other cases, you will have to write your own.

There is no standard function for this job.
For Linux you can use the ftruncate method, while for Windows you can use _chsize or SetEndOfFile. A simple #ifdef will make it cross-platform.
Also read this Q&A.

Read large amount of ASCII numbers and write in binary form

I have data files with about 1.5 Gb worth of floating-point numbers stored as ASCII text separated by whitespace, e.g., 1.2334 2.3456 3.4567 and so on.
Before processing such numbers I first translate the original file to binary format. This is helpful because I can choose whether to use float or double, reduce file size (to about 800 MB for double and 400 MB for float), and read in chunks of the appropriate size once I am processing the data.
I wrote the following function to make the ASCII-to-binary translation:
template<typename RealType=float>
void ascii_to_binary(const std::string& fsrc, const std::string& fdst){
RealType value;
std::fstream src(fsrc.c_str(), std::fstream::in | std::fstream::binary);
std::fstream dst(fdst.c_str(), std::fstream::out | std::fstream::binary);
while(src >> value){
dst.write((char*)&value, sizeof(RealType));
}
// RAII closes both files
}
I would like to speed-up acii_to_binary, and I seem unable to come up with anything. I tried reading the file in chunks of 8192 bytes, and then try to process the buffer in another subroutine. This seems very complicated because the last few characters in the buffer may be whitespace (in which case all is good), or a truncated number (which is very bad) - the logic to handle the possible truncation seems hardly worth it.
What would you do to speed up this function? I would rather rely on standard C++ (C++11 is OK) with no additional dependencies, like boost.
Thank you.
Edit:
#DavidSchwarts:
I tried to implement your suggestion as follows:
template<typename RealType=float>
void ascii_to_binary(const std::string& fsrc, const std::string& fdst{
std::vector<RealType> buffer;
typedef typename std::vector<RealType>::iterator VectorIterator;
buffer.reserve(65536);
std::fstream src(fsrc, std::fstream::in | std::fstream::binary);
std::fstream dst(fdst, std::fstream::out | std::fstream::binary);
while(true){
size_t k = 0;
while(k<65536 && src >> buffer[k]) k++;
dst.write((char*)&buffer[0], buffer.size());
if(k<65536){
break;
}
}
}
But it does not seem to be writing the data! I'm working on it...

I did exactly the same thing, except that my fields were separated by tab '\t' and I had to also handle non-numeric comments on the end of each line and header rows interspersed with the data.
Here is the documentation for my utility.
And I also had a speed problem. Here are the things I did to improve performance by around 20x:
Replace explicit file reads with memory-mapped files. Map two blocks at once. When you are in the second block after processing a line, remap with the second and third blocks. This way a line that straddles a block boundary is still contiguous in memory. (Assumes that no line is larger than a block, you can probably increase blocksize to guarantee this.)
Use SIMD instructions such as _mm_cmpeq_epi8 to search for line endings or other separator characters. In my case, any line containing an '=' character was a metadata row that needed different processing.
Use a barebones number parsing function (I used a custom one for parsing times in HH:MM:SS format, strtod and strtol are perfect for grabbing ordinary numbers). These are much faster than istream formatted extraction functions.
Use the OS file write API instead of the standard C++ API.
If you dream of throughput in the 300,000 lines/second range, then you should consider a similar approach.
Your executable also shrinks when you don't use C++ standard streams. I've got 205KB, including a graphical interface, and only dependent on DLLs that ship with Windows (no MSVCRTxx.dll needed). And looking again, I still am using C++ streams for status reporting.

Aggregate the writes into a fixed buffer, using a std::vector of RealType. Your logic should work like this:
Allocate a std::vector<RealType> with 65,536 default-constructed entries.
Read up to 65,536 entries into the vector, replacing the existing entries.
Write out as many entries as you were able to read in.
If you read in exactly 65,536 entries, go to step 2.
Stop, you are done.
This will prevent you from alternating reads and writes to two different files, minimizing the seek activity significantly. It will also allow you make far fewer write calls, reducing copying and buffering logic.

How to optimize input/output in C++

I'm solving a problem which requires very fast input/output. More precisely, the input data file will be up to 15MB. Is there a fast, way to read/print integer values.
Note: I don't know if it helps, but the input file has the following form:
line 1: a number n
line 2..n+1: three numbers a,b,c;
line n+2: a number r
line n+3..n+4+r: four numbers a,b,c,d
Note 2: The input file will be stdin.
Edit: Something like the following isn't fast enough:
void fast_scan(int &n) {
char buffer[10];
gets(buffer);
n=atoi(buffer);
}
void fast_scan_three(int &a,int &b,int &c) {
char buffval[3][20],buffer[60];
gets(buffer);
int n=strlen(buffer);
int buffindex=0, curindex=0;
for(int i=0; i<n; ++i) {
if(!isdigit(buffer[i]) && !isspace(buffer[i]))break;
if(isspace(buffer[i])) {
buffindex++;
curindex=0;
} else {
buffval[buffindex][curindex++]=buffer[i];
}
}
a=atoi(buffval[0]);
b=atoi(buffval[1]);
c=atoi(buffval[2]);
}

General input/output optimization principle is to perform as less I/O operations as possible reading/writing as much data as possible.
So performance-aware solution typically looks like this:
Read all data from device into some buffer (using the principle mentioned above)
Process the data generating resulting data to some buffer (on place or another one)
Output results from buffer to device (using the principle mentioned above)
E.g. you could use std::basic_istream::read to input data by big chunks instead of doing it line by line. The similar idea with output - generate single string as result adding line feed symbols manually and output it at once.

If you want to minimize the physical I/O operation overhead, load the whole file into memory by a technique called memory mapped files. I doubt you'll get a noticable performance gain though. Parsing will most likely be a lot costlier.

Consider using threads. Threading is useful for lots of things, but this is exactly the kind of problem that motivated the invention of threads.
The underlying idea is to separate the input, processing, and output, so these different operations can run in parallel. Do it right and you will see a significant speedup.
Have one thread doing close to pure input. It reads lines into a buffer of lines. Have a second thread do a quick pre-parse and organize the raw input into blocks. You have two things that need to be parsed, the line that contain the number of lines that contain triples and the line that contains the number of lines that contain quads. This thread forms the raw input into blocks that are still mostly text. A third thread parses the triples and quads, re-forming the the input into fully parsed structures. Since the data are now organized into independent blocks, you can have multiple instances of this third operation so as to better take advantage of the multiple processors on your computer. Finally, other threads will operate on these fully-parsed structures. Note: It might be better to combine some of these operations, for example combining the input and pre-parsing operations into one thread.

Put several input lines in a buffer, split them, and then parse them simultaneously in different threads.

It's only 15MB. I would just slurp the whole thing into a memory buffer, then parse it.
The parsing looks something like this, approximately:
#define DIGIT(c)((c) >= '0' && (c) <= '9')
while(*p == ' ') p++;
if (DIGIT(*p)){
a = 0;
while(DIGIT(*p){
a *= 10; a += (*p++ - '0');
}
}
// and so on...
You should be able to write this kind of code in your sleep.
I don't know if that's any faster than atoi, but it's not fussing around figuring out where numbers begin and end. I would stay away from scanf because it goes through nine yards of figuring out its format string.
If you run this whole thing in a loop 1000 times and grab some stack samples, you should see that it is spending nearly 100% of its time reading in the file and generating output (which you didn't mention).
You just can't beat that.
If you do see that noticeable time is spent in the actual parsing, it might be possible to do overlapped I/O, but the machine would have to be really slow, or the I/O really fast (like from a solid state drive) before that would make sense.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js