C++ reading buffer size - c++

Suppose that this file is 2 and 1/2 blocks long, with block size of 1024.
aBlock = 1024;
char* buffer = new char[aBlock];
while (!myFile.eof()) {
myFile.read(buffer,aBlock);
//do more stuff
}
The third time it reads, it is going to write half of the buffer, leaving the other half with invalid data. Is there a way to know how many bytes did it actually write to the buffer?

istream::gcount returns the number of bytes read by the previous read.

Your code is both overly complicated and error-prone.
Reading in a loop and checking only for eof is a logic error since this will result in an infinite loop if there is an error while reading (for whatever reason).
Instead, you need to check all fail states of the stream, which can be done by simply checking for the istream object itself.
Since this is already returned by the read function, you can (and, indeed, should) structure any reader loop like this:
while (myFile.read(buffer, aBlock))
process(buffer, aBlock);
process(buffer, myFile.gcount());
This is at the same time shorter, doesn’t hide bugs and is more readable since the check-stream-state-in-loop is an established C++ idiom.

You could also look at istream::readsome, which actually returns the amount of bytes read.

Related

Does my C++ code handle 100GB+ file copying? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I need a cross-platform portable function that is able to copy a 100GB+ binary file to a new destination. My first solution was this:
void copy(const string &src, const string &dst)
{
FILE *f;
char *buf;
long len;
f = fopen(src.c_str(), "rb");
fseek(f, 0, SEEK_END);
len = ftell(f);
rewind(f);
buf = (char *) malloc((len+1) * sizeof(char));
fread(buf, len, 1, f);
fclose(f);
f = fopen(dst.c_str(), "a");
fwrite(buf, len, 1, f);
fclose(f);
}
Unfortunately, the program was very slow. I suspect the buffer had to keep 100GB+ in the memory. I'm tempted to try the new code (taken from Copy a file in a sane, safe and efficient way):
std::ifstream src_(src, std::ios::binary);
std::ofstream dst_ = std::ofstream(dst, std::ios::binary);
dst_ << src_.rdbuf();
src_.close();
dst_.close();
My question is about this line:
dst_ << src_.rdbuf();
What does the C++ standard say about it? Does the code compiled to byte-by-byte transfer or just whole-buffer transfer (like my first example)?
I'm curious does the << compiled to something useful for me? Maybe I don't have to invest my time on something else, and just let the compiler do the job inside the operator? If the operator translates to looping for me, why should I do it myself?
PS: std::filesystem::copy is impossible as the code has to work for C++11.
The crux of your question is what happens when you do this:
dst_ << src_.rdbuf();
Clearly this is two function calls: one to istream::rdbuf(), which simply returns a pointer to a streambuf, followed by one to ostream::operator<<(streambuf*), which is documented as follows:
After constructing and checking the sentry object, checks if sb is a null pointer. If it is, executes setstate(badbit) and exits. Otherwise, extracts characters from the input sequence controlled by sb and inserts them into *this until one of the following conditions are met: [...]
Reading this, the answer to your question is that copying a file in this way will not require buffering the entire file contents in memory--rather it will read a character at a time (perhaps with some chunked buffering, but that's an optimization that shouldn't change our analysis).
Here is one implementation: https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-api-4.6/a01075_source.html (__copy_streambufs). Essentially it a loop calling sgetc() and sputc() repeatedly until EOF is reached. The memory required is small and constant.
The C++ standard (I checked C++98, so this should be extremely compatible) says in [lib.ostream.inserters]:
basic_ostream<charT,traits>& operator<<
(basic_streambuf<charT,traits> *sb);
Effects: If sb is null calls setstate(badbit) (which may throw ios_base::failure).
Gets characters from sb and inserts them in *this. Characters are read from sb and inserted until any of the following occurs:
end-of-file occurs on the input sequence;
inserting in the output sequence fails (in which case the character to be inserted is not extracted);
an exception occurs while getting a character from sb.
If the function inserts no characters, it calls setstate(failbit) (which may throw ios_base::failure (27.4.4.3)). If an exception was thrown while extracting a character, the function set failbit in error state, and if failbit is on in exceptions() the caught exception is rethrown.
Returns: *this.
This description says << on rdbuf works on a character-by-character basis. In particular, if inserting of a character fails, that exact character remains unread in the input sequence. This implies that an implementation cannot just extract the whole contents into a single huge buffer upfront.
So yes, there's a loop somewhere in the internals of the standard library that does a byte-by-byte (well, charT really) transfer.
However, this does not mean that the whole thing is completely unbuffered. This is simply about what operator<< does internally. Your ostream object will still accumulate data internally until its buffer is full, then call write (or whatever low-level function your OS uses).
Unfortunately, the program was very slow.
Your first solution is wrong for a very simple reason: it reads the entire source file in memory, then write it entirely.
Files have been invented (perhaps in the 1960s) to handle data that don't fit in memory (and has to be in some "slower" storage, at that time hard disks or drums, or perhaps even tapes). And they have always been copied by "chunks".
The current (Unix-like) definition of file (as a sequence of bytes than is open-ed, read, write-n, close-d) is more recent than 1960s. Probably the late 1970s or early 1980s. And it comes with the notion of streams (which has been standardized in C with <stdio.h> and in C++ with std::fstream).
So your program has to work (like every file copying program today) for files much bigger than the available memory.You need some loop to read some buffer, write it, and repeat.
The size of the buffer is very important. If it is too small, you'll make too many IO operations (e.g. system calls). If it is too big, IO might be inefficient or even not work.
In practice, the buffer should today be much less than your RAM, typically several megabytes.
Your code is more C like than C++ like because it uses fopen. Here is a possible solution in C with <stdio.h>. If you code in genuine C++, adapt it to <fstream>:
void copyfile(const char*destpath, const char*srcpath) {
// experiment with various buffer size
#define MYBUFFERSIZE (4*1024*1024) /* four megabytes */
char* buf = malloc(MYBUFFERSIZE);
if (!buf) { perror("malloc buf"); exit(EXIT_FAILURE); };
FILE* filsrc = fopen(srcpath, "r");
if (!filsrc) { perror(srcpath); exit(EXIT_FAILURE); };
FILE* fildest = fopen(destpath, "w");
if (!fildest) { perror(destpath); exit(EXIT_FAILURE); };
for (;;) {
size_t rdsiz = fread(buf, 1, MYBUFFERSIZE, filsrc);
if (rdsiz==0) // end of file
break;
else if (rdsiz<0) // input error
{ perror("fread"); exit(EXIT_FAILURE); };
size_t wrsiz = fwrite(buf, rdsiz, 1, fildest);
if (wrsiz != 1) { perror("fwrite"); exit(EXIT_FAILURE); };
}
if (fclose(filsrc)) { perror("fclose source"); exit(EXIT_FAILURE); };
if (fclose(fildest)) { perror("fclose dest"); exit(EXIT_FAILURE); };
}
For simplicity, I am reading the buffer in byte components and writing it as a whole. A better solution is to handle partial writes.
Apparently dst_ << src_.rdbuf(); might do some loop internally (I have to admit I never used it and did not understand that at first; thanks to Melpopene for correcting me). But the actual buffer size matters a big lot. The two other answers (by John Swinck and by melpomene) focus on that rdbuf() thing. My answer focus on explaining why copying can be slow when you do it like in your first solution, and why you need to loop and why the buffer size matters a big lot.
If you really care about performance, you need to understand implementation details and operating system specific things. So read Operating systems: three easy pieces. Then understand how, on your particular operating system, the various buffering is done (there are several layers of buffers involved: your program buffers, the standard stream buffers, the kernel buffers, the page cache). Don't expect your C++ standard library to buffer in an optimal fashion.
Don't even dream of coding in standard C++ (without operating system specific stuff) an optimal or very fast copying function. If performance matters, you need to dive in OS specific details.
On Linux, you might use time(1), oprofile(1), perf(1) to measure your program's performance. You could use strace(1) to understand the various system calls involved (see syscalls(2) for a list). You might even code (in a Linux specific way) using directly the open(2), read(2), write(2), close(2) and perhaps readahead(2), mmap(2), posix_fadvise(2), madvise(2), sendfile(2) system calls.
At last, large file copying are limited by disk IO (which is the bottleneck). So even by spending days in optimizing OS specific code, you won't win much. The hardware is the limitation. You probably should code what is the most readable code for you (it might be that dst_ << src_.rdbuf(); thing which is looping) or use some library providing file copy. You might win a tiny amount of performance by tuning the various buffer sizes.
If the operator translates to looping for me, why should I do it myself?
Because you have no explicit guarantee on the actual buffering done (at various levels). As I explained, buffering matters for performance. Perhaps the actual performance is not that critical for you, and the ordinary settings of your system and standard library (and their default buffers sizes) might be enough.
PS. Your question contains at least 3 different questions (but related ones). I don't find it clear (so downvoted it), because I did not understand what is the most relevant one. Is it : performance? robustness? meaning of dst_ << src_.rdbuf();? Why is the first solution slow? How to copy large files quickly?

How to decompress less than original size with Lz4 library?

I'm using LZ4 library and when decompressing data with:
int LZ4_decompress_fast_continue (void* LZ4_streamDecode, const char* source, char* dest, int originalSize);
I need only first n bytes of the originally encoded N bytes, where n < N. So in order to improve the performance, it makes sense to decompress only a part of the original buffer.
I wonder if I can pass n instead of N to the originalSize argument of the function?
My initial test showed, that it's not possible (I got incorrectly decompressed data). Though maybe there is a way, for example if n is a multiple of some CHUNK_SIZE? All original N bytes were compressed with 1 call of a compress function.
LZ4_decompress_safe_continue() and LZ4_decompress_fast_continue() can only decode full blocks. They consider a partial block as an error, and report it as such. They also consider that if there is not enough room to decompress a full block, it's also an error.
The functionality you are looking for doesn't exist yet. But there is a close cousin that might help.
LZ4_decompress_safe_partial() can decode a part of a block.
Note that, in contrast with _continue() variants, it only works on independent blocks.
Note also that the compressed block must nonetheless be complete, and the output buffer must nonetheless have enough space to decode the entire block. So the only advantage provided by this function is speed : if you want only the first 10 bytes, it will stop as soon as it has generated enough bytes.
"as soon as" doesn't mean "exactly at 10". It could be much later, and in the worst case, it could be after decoding the entire block. That's because the internal decoding engine is still the same : it decodes entire sequences, and doesn't "break them" in the middle, for speed considerations.
If you need to extract less bytes than a full block in order to save some memory, I'm afraid there is no solution yet. Report it as a feature request to upstream.
This seems to have been implemented in lz4 1.8.3.

Efficient way to read file multiple line one time?

I am now trying to handle a large file (several GB), so I am thinking to use multi-thread. The file is multiple lines of data like:
data1 attr1.1 attr1.2 attr1.3
data2 attr2.1 attr2.2 attr2.3
data3 attr3.1 attr3.2 attr3.3
I am thinking to use one thread read multiple lines first to a buffer1, and then one other thread to handle the data in buffer1 line by line, while the reading thread start to read file to buffer2. Then the handling thread continues when buffer2 is ready, and the reading thread read to buffer1 again.
Now I finished the handler part by using freads for small file (several KB), but I am not sure how to make the buffer contains the complete line instead of splitting part of line at end of the buffer, which is like this:
data1 attr1.1 attr1.2 attr1.3
data2 attr2.1 att
Also, I find that the fgets or ifstream getline can read file line by line, but would it be very costly since it has many IOs?
Now I am struggling to figure out what it the best way to do that? Is there any efficient way to read multiple lines at one time? Any advice is appreciated.
C stdio and C++ iostream functions use buffered I/O. Small reads only have function-call and locking overhead, not read(2) system call overhead.
Without knowing the line length ahead of time, fgets has to either use a buffer or read one byte at a time. Luckily, the C/C++ I/O semantics allow it to use buffering, so every mainstream implementation does. (According to the docs, mixing stdio and I/O on the underlying file descriptors gives undefined results. This is what allows buffering.)
You're right that it would be a problem if every fgets required a system call.
You might find it useful for one thread to read lines and put the lines into some kind of data structure that's useful for the processing thread.
If you don't have to do much processing on each line, doing the I/O in the same thread as the processing will keep everything in the L1 cache of that CPU, though. Otherwise data will end up in L1 of the I/O thread, and then have to make it to L1 of the core running the processing thread.
Depending on what you want to do with your data, you can minimize copying by memory-mapping the file in-place. Or read with fread, or skip the stdio layer entirely and just use POSIX open / read, if you don't need your code to be as portable. Scanning a buffer for newlines migh have less overhead than what the stdio functions do.
You can handle the leftover line at the end of the buffer by copying it to the front of the buffer, and calling the next fread with a reduced buffer size. (Or, make your buffer ~1k bigger than the size of your fread calls, so you can always read multiples of the memory and filesystem page size (typically 4kiB), unless the trailing part of the line is > 1k.)
Or use a circular buffer, but reading from a circular buffer means checking for wraparound every time you touch it.
It all depends what you want to do as processing afterwards : do you need to keep a copy of the lines ? Do you intend to process input as std::strings ? etc...
Here some general remarks that could help you further:
istream::getline() and fgets() are buffered operations. So I/O is already reduced and you could assume the performance is already correct.
std::getline() is also buffered. Nevertheless, if you don't need to process std::strings the function would cost you a considerable number of memory allocation/deallocation, which might impact performance
Bloc operations like read() or fread() can achieve economies of scale if you can afford large buffers. This can be especially efficient, if you use the data in a throw-away fashion (because you can avoid copying the data and work directly in the buffer), but at the cost of an extra complexity.
But all these considerations shall not forget that the erformance is very much affected by the library implementation that you use.
I've done a little informal benchmark reading a milion of lines in the format you've shown:
* With MSVC2015 on my PC the read() is twice as fast as fgets(), and almost 4 times faster than std::string.
* With GCC on CodingGround, compiling with O3, fgets(), and both getline() are approximately the same, and the read() is slower.
Here the full code if you want to play around.
Here the the code that show you how to move the buffer arround.
int nr=0; // number of bytes read
bool last=false; // last (incomplete) read
while (!last)
{
// here nr conains the number of bytes kept from incomplete line
last = !ifs.read(buffer+nr, szb-nr);
nr = nr+ifs.gcount();
char *s, *p = buffer, *pe = p + nr;
do { // process complete lines in buffer
for (s = p; p != pe && *p != '\n'; p++)
;
if (p != pe || (p == pe && last)) {
if (p != pe)
*p++ = '\0';
lines++; // TO DO: here s is a null terminated line to process
sln += strlen(s); // (dummy operatio for the example)
}
} while (p != pe); // until eand of buffer is reached
std::copy(s, pe, buffer); // copy last (incoplete) line to begin of buffer
nr = pe - s; // and prepare the info for the next iteration
}

String error-checking

I am using a lot of string functions like strncpy, strncat, sprintf etc. in my code. I know there are better alternatives to these, but I was handed over an old project, where these functions were used, so I have to stick with them for compatibility and consistency. My supervisor is very fussy about error checking and robustness, and insists that I check for buffer-overflow violations everytime I use these functions. This has created a lot of if-else statements in my code, which do not look pretty. My question is, is it really necessary to check for overflow everytime I call one of these functions? Even if I know that a buffer overflow can't possibly occur e.g. when storing an integer in a string using the sprintf function
sprintf(buf,"%d",someInteger);
I know that the maximum length of an unsigned integer on a 64-bit system can be 20 digits. buf on the other hand is well over 20 characters long. Should I still check for buffer overflow in this case?
I think the way to go is using exceptions. Exceptions are very useful when you must decouple the normal control-flow of a program and error-checking.
What you can do is create a wrapper for every string function in which you perform the error checking and throw an exception if a buffer overflow would occur.
Then, in your client code, you can simply call your wrappers inside a try block, and then check for exceptions and return error codes inside the catch block.
Sample code (not tested):
int sprintf_wrapper( char *buffer, int buffer_size, const char *format, ... )
{
if( /* check for buffer overflow */ )
throw my_buffer_exception;
va_list arg_ptr;
va_start( arg_ptr, format );
int ret = sprintf( buffer, , format, arg_ptr );
va_end(arg_ptr);
return ret;
}
Error foo()
{
//...
try{
sprintf_wrapper(buf1, 100, "%d", i1);
sprintf_wrapper(buf2, 100, "%d", i2);
sprintf_wrapper(buf3, 100, "%d", i3);
}
catch( my_buffer_exception& )
{
return err_code;
}
}
Maybe write a test case that you can invoke to simply test the buffer to reduce code duplication and ugliness.
You could abstract the if/else statements into a method of another class, and then pass in the buffer and length expected.
By nature, these buffers are VERY susceptible to overwrites, so be careful ANYTIME you take input in from a user/outside source. You could also try getting a string length (using strlen), or checking for the /0 end string character yourself, and comparing that to the buffer size. If you loop for the /0 character,and it's not there, you will get into an infinite loop if you don't constrain the max size of your loop by the expected buffer size, so check for this too.
Another option, is to refactor code, such that every time those methods are used, you replace them with a length safe version you write, where it calls a method with those checks already in place (but have to pass the buffer size to it). This may not be possible for some projects, as the complexity may be very hard to unit test.
Let me address your last paragraph first: You write code once, in contrast to how long it will be maintained and used. Guess how long you think your code will be in use, and then multiply that by 10-20 to figure out how long it will actually be in use. At the end of that window it's completely likely that an integer could be much bigger and overflow you buffer, so yes you must do buffer checking.
Given that you have a few options:
Use the "n" series of functions like snprintf to prevent buffer overflows and tell your users that it's undefined what will happen if the buffers overflow.
Consider it fatal and either abort() or throw an uncaught exception when a length violation occurs.
Try to notify the user there's a problem and either abort the operation or attempt to let the user modify input and retry.
The first two approaches are definitely going to be easier to implement and maintain because you don't have to worry about getting the right information back to the user in a reasonable way. In any of the cases you could most likely factored into a function as suggested in other answers.
Finally let me say since you've tagged this question C++ and not C, think long and hard about slowly migrating your code base to C++ (because your code base is C right now) and utilize the C++ facilities which then totally remove the need for these buffer checks, as it will happen automatically for you.
You can use gcc "-D_FORTIFY_SOURCE=1 D_FORTIFY_SOURCE=2" for buffer overflow detection.
https://securityblog.redhat.com/2014/03/26/fortify-and-you/

ifstream operator >> and error handling

I want to use ifstream to read data from a named piped. I would like to use its operator>> to read formatted data (typically, an int).
However, I am a bit confused in the way error handling works.
Imagine I want to read an int but only 3 bytes are available. Errors bits would be set, but what will happen to theses 3 bytes ? Will they "disappear", will they be put back into the stream for later extraction ?
Thanks,
As has been pointed out, you can't read binary data over an istream.
But concerning the number of available bytes issue (since you'll
probably want to use basic_ios<char> and streambuf for your binary
streams): istream and ostream use a streambuf for the actual
sourcing and sinking of the bytes. And streambuf normally buffer: the
procedure is: if a byte is in the buffer, return it, otherwise, try to
reload the buffer, waiting until the reloading has finished, or
definitively failed. In case of definitive failure, the streambuf
returns end of file, and that terminates the input; istream will
memorize the end of file, and not attempt any more input. So if the
type you are reading needs four bytes, it will request four bytes from
the streambuf, and will normally wait until those four bytes are
there. No error will be set (because there isn't an error); you will
simply not return from the operator>> until those four bytes arrive.
If you implement your own binary streams, I would strongly recommend
using the same pattern; it will allow direct use of already existing
standard components like std::ios_base and (perhaps) std::filebuf,
and will provide other programmers with an idiom they are familiar with.
If the blocking is a problem, the simplest solution is just to run the
input in a separate thread, communicating via a message queue or
something similar. (Boost has support for asynchronous IO. This avoids
threads, but is globally much more complicated, and doesn't work well
with the classical stream idiom.)