Does my C++ code handle 100GB+ file copying? [closed] - c++

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I need a cross-platform portable function that is able to copy a 100GB+ binary file to a new destination. My first solution was this:
void copy(const string &src, const string &dst)
{
FILE *f;
char *buf;
long len;
f = fopen(src.c_str(), "rb");
fseek(f, 0, SEEK_END);
len = ftell(f);
rewind(f);
buf = (char *) malloc((len+1) * sizeof(char));
fread(buf, len, 1, f);
fclose(f);
f = fopen(dst.c_str(), "a");
fwrite(buf, len, 1, f);
fclose(f);
}
Unfortunately, the program was very slow. I suspect the buffer had to keep 100GB+ in the memory. I'm tempted to try the new code (taken from Copy a file in a sane, safe and efficient way):
std::ifstream src_(src, std::ios::binary);
std::ofstream dst_ = std::ofstream(dst, std::ios::binary);
dst_ << src_.rdbuf();
src_.close();
dst_.close();
My question is about this line:
dst_ << src_.rdbuf();
What does the C++ standard say about it? Does the code compiled to byte-by-byte transfer or just whole-buffer transfer (like my first example)?
I'm curious does the << compiled to something useful for me? Maybe I don't have to invest my time on something else, and just let the compiler do the job inside the operator? If the operator translates to looping for me, why should I do it myself?
PS: std::filesystem::copy is impossible as the code has to work for C++11.

The crux of your question is what happens when you do this:
dst_ << src_.rdbuf();
Clearly this is two function calls: one to istream::rdbuf(), which simply returns a pointer to a streambuf, followed by one to ostream::operator<<(streambuf*), which is documented as follows:
After constructing and checking the sentry object, checks if sb is a null pointer. If it is, executes setstate(badbit) and exits. Otherwise, extracts characters from the input sequence controlled by sb and inserts them into *this until one of the following conditions are met: [...]
Reading this, the answer to your question is that copying a file in this way will not require buffering the entire file contents in memory--rather it will read a character at a time (perhaps with some chunked buffering, but that's an optimization that shouldn't change our analysis).
Here is one implementation: https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-api-4.6/a01075_source.html (__copy_streambufs). Essentially it a loop calling sgetc() and sputc() repeatedly until EOF is reached. The memory required is small and constant.

The C++ standard (I checked C++98, so this should be extremely compatible) says in [lib.ostream.inserters]:
basic_ostream<charT,traits>& operator<<
(basic_streambuf<charT,traits> *sb);
Effects: If sb is null calls setstate(badbit) (which may throw ios_base::failure).
Gets characters from sb and inserts them in *this. Characters are read from sb and inserted until any of the following occurs:
end-of-file occurs on the input sequence;
inserting in the output sequence fails (in which case the character to be inserted is not extracted);
an exception occurs while getting a character from sb.
If the function inserts no characters, it calls setstate(failbit) (which may throw ios_base::failure (27.4.4.3)). If an exception was thrown while extracting a character, the function set failbit in error state, and if failbit is on in exceptions() the caught exception is rethrown.
Returns: *this.
This description says << on rdbuf works on a character-by-character basis. In particular, if inserting of a character fails, that exact character remains unread in the input sequence. This implies that an implementation cannot just extract the whole contents into a single huge buffer upfront.
So yes, there's a loop somewhere in the internals of the standard library that does a byte-by-byte (well, charT really) transfer.
However, this does not mean that the whole thing is completely unbuffered. This is simply about what operator<< does internally. Your ostream object will still accumulate data internally until its buffer is full, then call write (or whatever low-level function your OS uses).

Unfortunately, the program was very slow.
Your first solution is wrong for a very simple reason: it reads the entire source file in memory, then write it entirely.
Files have been invented (perhaps in the 1960s) to handle data that don't fit in memory (and has to be in some "slower" storage, at that time hard disks or drums, or perhaps even tapes). And they have always been copied by "chunks".
The current (Unix-like) definition of file (as a sequence of bytes than is open-ed, read, write-n, close-d) is more recent than 1960s. Probably the late 1970s or early 1980s. And it comes with the notion of streams (which has been standardized in C with <stdio.h> and in C++ with std::fstream).
So your program has to work (like every file copying program today) for files much bigger than the available memory.You need some loop to read some buffer, write it, and repeat.
The size of the buffer is very important. If it is too small, you'll make too many IO operations (e.g. system calls). If it is too big, IO might be inefficient or even not work.
In practice, the buffer should today be much less than your RAM, typically several megabytes.
Your code is more C like than C++ like because it uses fopen. Here is a possible solution in C with <stdio.h>. If you code in genuine C++, adapt it to <fstream>:
void copyfile(const char*destpath, const char*srcpath) {
// experiment with various buffer size
#define MYBUFFERSIZE (4*1024*1024) /* four megabytes */
char* buf = malloc(MYBUFFERSIZE);
if (!buf) { perror("malloc buf"); exit(EXIT_FAILURE); };
FILE* filsrc = fopen(srcpath, "r");
if (!filsrc) { perror(srcpath); exit(EXIT_FAILURE); };
FILE* fildest = fopen(destpath, "w");
if (!fildest) { perror(destpath); exit(EXIT_FAILURE); };
for (;;) {
size_t rdsiz = fread(buf, 1, MYBUFFERSIZE, filsrc);
if (rdsiz==0) // end of file
break;
else if (rdsiz<0) // input error
{ perror("fread"); exit(EXIT_FAILURE); };
size_t wrsiz = fwrite(buf, rdsiz, 1, fildest);
if (wrsiz != 1) { perror("fwrite"); exit(EXIT_FAILURE); };
}
if (fclose(filsrc)) { perror("fclose source"); exit(EXIT_FAILURE); };
if (fclose(fildest)) { perror("fclose dest"); exit(EXIT_FAILURE); };
}
For simplicity, I am reading the buffer in byte components and writing it as a whole. A better solution is to handle partial writes.
Apparently dst_ << src_.rdbuf(); might do some loop internally (I have to admit I never used it and did not understand that at first; thanks to Melpopene for correcting me). But the actual buffer size matters a big lot. The two other answers (by John Swinck and by melpomene) focus on that rdbuf() thing. My answer focus on explaining why copying can be slow when you do it like in your first solution, and why you need to loop and why the buffer size matters a big lot.
If you really care about performance, you need to understand implementation details and operating system specific things. So read Operating systems: three easy pieces. Then understand how, on your particular operating system, the various buffering is done (there are several layers of buffers involved: your program buffers, the standard stream buffers, the kernel buffers, the page cache). Don't expect your C++ standard library to buffer in an optimal fashion.
Don't even dream of coding in standard C++ (without operating system specific stuff) an optimal or very fast copying function. If performance matters, you need to dive in OS specific details.
On Linux, you might use time(1), oprofile(1), perf(1) to measure your program's performance. You could use strace(1) to understand the various system calls involved (see syscalls(2) for a list). You might even code (in a Linux specific way) using directly the open(2), read(2), write(2), close(2) and perhaps readahead(2), mmap(2), posix_fadvise(2), madvise(2), sendfile(2) system calls.
At last, large file copying are limited by disk IO (which is the bottleneck). So even by spending days in optimizing OS specific code, you won't win much. The hardware is the limitation. You probably should code what is the most readable code for you (it might be that dst_ << src_.rdbuf(); thing which is looping) or use some library providing file copy. You might win a tiny amount of performance by tuning the various buffer sizes.
If the operator translates to looping for me, why should I do it myself?
Because you have no explicit guarantee on the actual buffering done (at various levels). As I explained, buffering matters for performance. Perhaps the actual performance is not that critical for you, and the ordinary settings of your system and standard library (and their default buffers sizes) might be enough.
PS. Your question contains at least 3 different questions (but related ones). I don't find it clear (so downvoted it), because I did not understand what is the most relevant one. Is it : performance? robustness? meaning of dst_ << src_.rdbuf();? Why is the first solution slow? How to copy large files quickly?

Related

WriteFile overlapped and fwrite equivalent

On Windows, the WriteFile() function has a parameter called lpOverlapped which lets you specify an offset at which to write to the file.
I was wondering, is there is an fwrite() cross-platform equivalent of that?
I see that if the file is opened with the rb+ flag, I might be able to use fseek() to write to a particular offset. My question is - will this approach be equivalent to the overlapped WriteFile(), and will it produce the same behaviour on all platforms?
Background
The reason I need this is because I am writing blocked compressed data streams to a file, and I want to be able to load a specific block from the file and be able to decompress it. So, basically if I keep track of where the block begins in a file, I can load the block and decompress it in a more efficient manner. I know that there are probably better ways to do this, but I need this solution for some backwards compatibility.
Assuming you are okay with using POSIX functions and not just things from the C or C++ standard libraries, the solution is pwrite (aka: positioned write).
ssize_t rc = pwrite(file_handle, data_ptr, data_size, destination_offset);
I think you are confusing "overlapped" and "overwrite"/"offset." I didn't study up on the specifics of why Microsoft explicitly says overlapped writes include a parameter for offset (I think it makes sense as I describe below). In general, when Microsoft talks about "overlapped" IO, they are talking about how to synchronize events like starting to write the file, receiving notification that the write completed, and starting another write to the file which might or might not overlap with a previous write. In this last case, by overlap I mean what you would think that overlap means, ie overlaps within the contents of the file. Whereas Microsoft means that writing the file overlaps in time with your thread running, or not. Note that this gets very complicated if more than one thread can write the same file.
If possible, and surely if you want portable code, you want to avoid all this nonsense and just do the simplest write possible in each context, which means avoid Microsoft optimizations like "overlapped IO" unless you really need performance. (And if you need absolutely optimal performance, you might want to cache the file yourself and manage the overlaps, then write it once from start to finish.)
While pwrite is probably the best solution, there is an alternative that sticks with stdio functions. Unfortunately, to make it thread-safe, you're using non-standard "stdio" to take direct control of the FILE*'s internal lock, and the names aren't portable. Specifically, POSIX defines one set of "take/release file lock" names and Windows defines another set (_lock_file/_unlock_file).
That said, you could use these semi-portable constructs to use stdio functions to ensure no buffering conflicts (pwrite to fileno(some_FILE_star) could cause problems if the FILE* buffer overlaps the pwrite location, since pwrite won't fix up the buffer):
// Error checking omitted; you should actually check returns in real code
size_t pfwrite(const void *ptr, size_t size, size_t n,
size_t offset, FILE *stream) {
// Take FILE*'s lock and hold it for entire transaction
flockfile(stream); // _lock_file on Windows
// Record position
long origpos = ftell(stream);
// Seek to desired offset and write
fseek(stream, offset, SEEK_SET); // Possibly offset * size, not just offset?
size_t written = fwrite(ptr, size, n, stream);
// Seek back to original position
fseek(stream, origpos, SEEK_SET);
// Release FILE*'s lock now that transaction complete
funlockfile(stream); // _unlock_file on Windows
return written;
}

How to copy a file from one location to another in a fast way with C++ program? [duplicate]

This question already has answers here:
Copy a file in a sane, safe and efficient way
(9 answers)
Closed 7 years ago.
I am trying to understand the code behind the copy command which copies a file from one place to other.I studied c++ file system basics and have written the following code for my task.
#include<iostream>
#include<fstream>
using namespace std;
main()
{
cout<<"Copy file\n";
string from,to;
cout<<"Enter file address: ";
cin>>from;
ifstream in(from,ios::in | ios::binary);
if(!in)
{
cout<<"could not find file "<<from<<endl;
return 1;
}
cout<<"Enter file destination: ";
cin>>to;
ofstream out(to,ios::out | ios::binary);
char ch;
while(in.get(ch))
{
out.put(ch);
}
cout<<"file has been copied\n";
in.close();
out.close();
}
Though this code works but is much slower than the copy command of my OS which is windows.I want to know how I can make my program faster to reduce the difference between my program's time and the my OS's copy command time.
Reading one byte at time is going to waste a lot of time in function calls... use a bigger buffer:
char ch[4096];
while(in) {
in.read(ch, sizeof(ch));
out.write(ch, in.gcount());
}
(you may want to add some more error handling, e.g. out may go in a bad state and the like)
(the most C++-ish way is reported here, but takes advantage of streambuf functionalities that typically a beginner rarely has reason to know, and to me is also way less instructive)
You have correctly opened the file for binary read and binary write. However instead of reading characters(which is not meaningful in binary format), use istream::read and ostream::write.
Like other answers say, use bigger buffers. I'd go for 1MB.
But there's a lot more to it.
Also, avoid stream lib and FILE stuff. They buffer the data so you get 2 memcpy calls instead of 1.
Disabling buffering on the streams can achieve a similar result, but I think you're better of using the system calls directly.
And one last thing, on the "do it yourself" front. You must check the return values from read and write calls. They may read/write less bytes than you ask them to.
If you can manage a circular buffer, you should switch read/wrote whenever the function returns short... disk may be more ready for reading or for writing so no point in wasting time waiting instead of switching to the other thing you have to do.
And now the very last thing you might want to explore- look into the sendfile system call. It was built to speed up web servers by doing all the copy in the kernel and avoiding context switches and memcpys, but may serve here if it works with two disk file descriptors.

Efficient way to read file multiple line one time?

I am now trying to handle a large file (several GB), so I am thinking to use multi-thread. The file is multiple lines of data like:
data1 attr1.1 attr1.2 attr1.3
data2 attr2.1 attr2.2 attr2.3
data3 attr3.1 attr3.2 attr3.3
I am thinking to use one thread read multiple lines first to a buffer1, and then one other thread to handle the data in buffer1 line by line, while the reading thread start to read file to buffer2. Then the handling thread continues when buffer2 is ready, and the reading thread read to buffer1 again.
Now I finished the handler part by using freads for small file (several KB), but I am not sure how to make the buffer contains the complete line instead of splitting part of line at end of the buffer, which is like this:
data1 attr1.1 attr1.2 attr1.3
data2 attr2.1 att
Also, I find that the fgets or ifstream getline can read file line by line, but would it be very costly since it has many IOs?
Now I am struggling to figure out what it the best way to do that? Is there any efficient way to read multiple lines at one time? Any advice is appreciated.
C stdio and C++ iostream functions use buffered I/O. Small reads only have function-call and locking overhead, not read(2) system call overhead.
Without knowing the line length ahead of time, fgets has to either use a buffer or read one byte at a time. Luckily, the C/C++ I/O semantics allow it to use buffering, so every mainstream implementation does. (According to the docs, mixing stdio and I/O on the underlying file descriptors gives undefined results. This is what allows buffering.)
You're right that it would be a problem if every fgets required a system call.
You might find it useful for one thread to read lines and put the lines into some kind of data structure that's useful for the processing thread.
If you don't have to do much processing on each line, doing the I/O in the same thread as the processing will keep everything in the L1 cache of that CPU, though. Otherwise data will end up in L1 of the I/O thread, and then have to make it to L1 of the core running the processing thread.
Depending on what you want to do with your data, you can minimize copying by memory-mapping the file in-place. Or read with fread, or skip the stdio layer entirely and just use POSIX open / read, if you don't need your code to be as portable. Scanning a buffer for newlines migh have less overhead than what the stdio functions do.
You can handle the leftover line at the end of the buffer by copying it to the front of the buffer, and calling the next fread with a reduced buffer size. (Or, make your buffer ~1k bigger than the size of your fread calls, so you can always read multiples of the memory and filesystem page size (typically 4kiB), unless the trailing part of the line is > 1k.)
Or use a circular buffer, but reading from a circular buffer means checking for wraparound every time you touch it.
It all depends what you want to do as processing afterwards : do you need to keep a copy of the lines ? Do you intend to process input as std::strings ? etc...
Here some general remarks that could help you further:
istream::getline() and fgets() are buffered operations. So I/O is already reduced and you could assume the performance is already correct.
std::getline() is also buffered. Nevertheless, if you don't need to process std::strings the function would cost you a considerable number of memory allocation/deallocation, which might impact performance
Bloc operations like read() or fread() can achieve economies of scale if you can afford large buffers. This can be especially efficient, if you use the data in a throw-away fashion (because you can avoid copying the data and work directly in the buffer), but at the cost of an extra complexity.
But all these considerations shall not forget that the erformance is very much affected by the library implementation that you use.
I've done a little informal benchmark reading a milion of lines in the format you've shown:
* With MSVC2015 on my PC the read() is twice as fast as fgets(), and almost 4 times faster than std::string.
* With GCC on CodingGround, compiling with O3, fgets(), and both getline() are approximately the same, and the read() is slower.
Here the full code if you want to play around.
Here the the code that show you how to move the buffer arround.
int nr=0; // number of bytes read
bool last=false; // last (incomplete) read
while (!last)
{
// here nr conains the number of bytes kept from incomplete line
last = !ifs.read(buffer+nr, szb-nr);
nr = nr+ifs.gcount();
char *s, *p = buffer, *pe = p + nr;
do { // process complete lines in buffer
for (s = p; p != pe && *p != '\n'; p++)
;
if (p != pe || (p == pe && last)) {
if (p != pe)
*p++ = '\0';
lines++; // TO DO: here s is a null terminated line to process
sln += strlen(s); // (dummy operatio for the example)
}
} while (p != pe); // until eand of buffer is reached
std::copy(s, pe, buffer); // copy last (incoplete) line to begin of buffer
nr = pe - s; // and prepare the info for the next iteration
}

What's the difference between read() and getc()

I have two code segments:
while((n=read(0,buf,BUFFSIZE))>0)
if(write(1,buf,n)!=n)
err_sys("write error");
while((c=getc(stdin))!=EOF)
if(putc(c,stdout)==EOF)
err_sys("write error");
Some sayings on internet make me confused. I know that standard I/O does buffering automatically, but I have passed a buf to read(), so read() is also doing buffering, right? And it seems that getc() read data char by char, how much data will the buffer have before sending all the data out?
Thanks
While both functions can be used to read from a file, they are very different. First of all on many systems read is a lower-level function, and may even be a system call directly into the OS. The read function also isn't standard C or C++, it's part of e.g. POSIX. It also can read arbitrarily sized blocks, not only one byte at a time. There's no buffering (except maybe at the OS/kernel level), and it doesn't differ between "binary" and "text" data. And on POSIX systems, where read is a system call, it can be used to read from all kind of devices and not only files.
The getc function is a higher level function. It usually uses buffered input (so input is read in blocks into a buffer, sometimes by using read, and the getc function gets its characters from that buffer). It also only returns a single characters at a time. It's also part of the C and C++ specifications as part of the standard library. Also, there may be conversions of the data read and the data returned by the function, depending on if the file was opened in text or binary mode.
Another difference is that read is also always a function, while getc might be a preprocessor macro.
Comparing read and getc doesn't really make much sense, more sense would be comparing read with fread.

Proper, efficient file reading

I'd like to read and process (e.g. print) entries from the first row of a CSV file one at a time. I assume Unix-style \n newlines, that no entry is longer than 255 chars and (for now) that there's a newline before EOF. This is meant to be a more efficient alternative to fgets() followed by strtok().
#include <stdio.h>
#include <string.h>
int main() {
int i;
char ch, buf[256];
FILE *fp = fopen("test.csv", "r");
for (;;) {
for (i = 0; ; i++) {
ch = fgetc(fp);
if (ch == ',') {
buf[i] = '\0';
puts(buf);
break;
} else if (ch == '\n') {
buf[i] = '\0';
puts(buf);
fclose(fp);
return 0;
} else buf[i] = ch;
}
}
}
Is this method as efficient and correct as possible?
What is the best way to test for EOF and file reading errors using this method? (Possibilities: testing against the character macro EOF, feof(), ferror(), etc.).
Can I perform the same task using C++ file I/O with no loss of efficiency?
What is most efficient is going to depend a lot on the operating system, standard libraries (e.g. libc), and even the hardware you are running on. This makes it nearly impossible to tell you what's "most efficient".
That having been said, there are a few things you could try:
Use mmap() or a local operating system equivalent (Windows has CreateFileMapping / OpenFileMapping / MapViewOfFile, and probably others). Then you don't do explicit file reads: you simply access the file as if it were already in memory, and anything that isn't there will be faulted in by the page fault mechanism.
Read the entire file into a buffer manually, then work on that buffer. The fewer times you call into file read functions, the fewer function-call overheads you take, and likely also fewer application/OS domain switches. Obviously this uses more memory, but may very well be worth it.
Use a more optimal string scanner for your problem and platform. Going character-by-character yourself is almost never as fast as relying on something existing that's close to your problem domain. For example, you can bet that strchr and memchr are probably better-optimized than most code you can roll yourself, doing things like reading entire cachelines or words at once, scanning using better algorithms for this kind of search, etc. For more complicated cases, you might consider a full regular expression engine that could compile your regex to something fast for your complicated case.
Avoid copying your string around. It may be helpful to think in terms of "find delimiters" and then "output between delimiters". You could for example use strchr to find the next character of interest, and then fwrite or something to write to stdout directly from your input buffer. Then you're keeping most of your work in a few local registers, rather than using a stack or heap buf.
When in doubt, though, try a few possibilities and profile, profile, profile.
Also for this kind of problem, be very aware of differences between runs that are caused by OS and hardware caches: profile a bunch of runs rather than just one after each change -- and if possible, use tests that will either likely always hit caches (if you're trying to measure best-case performance) or tests that will likely miss (if you're trying to measure worst-case performance).
Regarding C++ file IO (fstream and such), just be aware that they're larger, more complicated beasts. They tend to include things such as locale management, automatic buffering, and the like -- as well as being less prone to particular types of coding mistakes.
If you're doing something pretty simple (like what you describe here), I tend to find C++ library stuff gets in the way. (Use a debugger and "step instruction" through a stringstream method versus some C string functions some time, you'll get a good feel for this quickly.)
It all depends on whether you're going to want or need that additional functionality or safety in the future.
Finally, the obligatory "don't sweat the small stuff". Only spend time on optimizing here if it's really important. Otherwise trust the libraries and OS to do the right thing for you most of the time -- if you get too far into micro-optimizations you'll find you're shooting yourself in the foot later. This is not to discourage you from thinking in terms of "should I read the whole file in ahead of time, will that break future use cases" -- because that's macro, rather than micro.
But generally speaking if you're not doing this kind of "make it faster" investigation for a good reason -- i.e. "need this app to perform better now that I've written it, and this code shows up as slow in profiler", or "doing this for fun so I can better understand the system" -- well, spend your time elsewhere first. =)
One method, provided you are going to scan through the file serially, is to use 2 buffers of a decent enough size (16K is the optimal size for SSDs and 4K for HDDs IIRC. But 16K should suffice for both). You start off by performing an asynchronous load (In windows look up Overlapped I/O and on Unix/OSX use O_NONBLOCK) of the first 16K into buffer 0 and then start another load into buffer 1 of bytes 16K to 32K. When your read position hits 16K, swap the buffers (so you are now reading from buffer 1 instead) wait for any further loads to complete into buffer 1 and perform an asynchronous load of bytes 32K to 48K into buffer 0 and so on. This way, you have far less chance of ever having to wait for a load to complete as it should be happening while you are processing the previous 16K.
I moved over to a scheme like this in my XML parser having been using fopen and fgetc previously and the speedup was huge. Loading in a 15 meg XML file and processing it reduced from minutes to seconds. Of course, Your milage may vary.
use fgets to read one line at a time. C++ file I/O are basically wrapper code with some compiler optimization tucked inside ( and many unwanted functionality ). Unless you are reading millions of lines of code and measuring time, it does not matter.