What's the difference between read() and getc() - c++

I have two code segments:
while((n=read(0,buf,BUFFSIZE))>0)
if(write(1,buf,n)!=n)
err_sys("write error");
while((c=getc(stdin))!=EOF)
if(putc(c,stdout)==EOF)
err_sys("write error");
Some sayings on internet make me confused. I know that standard I/O does buffering automatically, but I have passed a buf to read(), so read() is also doing buffering, right? And it seems that getc() read data char by char, how much data will the buffer have before sending all the data out?
Thanks

While both functions can be used to read from a file, they are very different. First of all on many systems read is a lower-level function, and may even be a system call directly into the OS. The read function also isn't standard C or C++, it's part of e.g. POSIX. It also can read arbitrarily sized blocks, not only one byte at a time. There's no buffering (except maybe at the OS/kernel level), and it doesn't differ between "binary" and "text" data. And on POSIX systems, where read is a system call, it can be used to read from all kind of devices and not only files.
The getc function is a higher level function. It usually uses buffered input (so input is read in blocks into a buffer, sometimes by using read, and the getc function gets its characters from that buffer). It also only returns a single characters at a time. It's also part of the C and C++ specifications as part of the standard library. Also, there may be conversions of the data read and the data returned by the function, depending on if the file was opened in text or binary mode.
Another difference is that read is also always a function, while getc might be a preprocessor macro.
Comparing read and getc doesn't really make much sense, more sense would be comparing read with fread.

Related

Does my C++ code handle 100GB+ file copying? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I need a cross-platform portable function that is able to copy a 100GB+ binary file to a new destination. My first solution was this:
void copy(const string &src, const string &dst)
{
FILE *f;
char *buf;
long len;
f = fopen(src.c_str(), "rb");
fseek(f, 0, SEEK_END);
len = ftell(f);
rewind(f);
buf = (char *) malloc((len+1) * sizeof(char));
fread(buf, len, 1, f);
fclose(f);
f = fopen(dst.c_str(), "a");
fwrite(buf, len, 1, f);
fclose(f);
}
Unfortunately, the program was very slow. I suspect the buffer had to keep 100GB+ in the memory. I'm tempted to try the new code (taken from Copy a file in a sane, safe and efficient way):
std::ifstream src_(src, std::ios::binary);
std::ofstream dst_ = std::ofstream(dst, std::ios::binary);
dst_ << src_.rdbuf();
src_.close();
dst_.close();
My question is about this line:
dst_ << src_.rdbuf();
What does the C++ standard say about it? Does the code compiled to byte-by-byte transfer or just whole-buffer transfer (like my first example)?
I'm curious does the << compiled to something useful for me? Maybe I don't have to invest my time on something else, and just let the compiler do the job inside the operator? If the operator translates to looping for me, why should I do it myself?
PS: std::filesystem::copy is impossible as the code has to work for C++11.
The crux of your question is what happens when you do this:
dst_ << src_.rdbuf();
Clearly this is two function calls: one to istream::rdbuf(), which simply returns a pointer to a streambuf, followed by one to ostream::operator<<(streambuf*), which is documented as follows:
After constructing and checking the sentry object, checks if sb is a null pointer. If it is, executes setstate(badbit) and exits. Otherwise, extracts characters from the input sequence controlled by sb and inserts them into *this until one of the following conditions are met: [...]
Reading this, the answer to your question is that copying a file in this way will not require buffering the entire file contents in memory--rather it will read a character at a time (perhaps with some chunked buffering, but that's an optimization that shouldn't change our analysis).
Here is one implementation: https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-api-4.6/a01075_source.html (__copy_streambufs). Essentially it a loop calling sgetc() and sputc() repeatedly until EOF is reached. The memory required is small and constant.
The C++ standard (I checked C++98, so this should be extremely compatible) says in [lib.ostream.inserters]:
basic_ostream<charT,traits>& operator<<
(basic_streambuf<charT,traits> *sb);
Effects: If sb is null calls setstate(badbit) (which may throw ios_base::failure).
Gets characters from sb and inserts them in *this. Characters are read from sb and inserted until any of the following occurs:
end-of-file occurs on the input sequence;
inserting in the output sequence fails (in which case the character to be inserted is not extracted);
an exception occurs while getting a character from sb.
If the function inserts no characters, it calls setstate(failbit) (which may throw ios_base::failure (27.4.4.3)). If an exception was thrown while extracting a character, the function set failbit in error state, and if failbit is on in exceptions() the caught exception is rethrown.
Returns: *this.
This description says << on rdbuf works on a character-by-character basis. In particular, if inserting of a character fails, that exact character remains unread in the input sequence. This implies that an implementation cannot just extract the whole contents into a single huge buffer upfront.
So yes, there's a loop somewhere in the internals of the standard library that does a byte-by-byte (well, charT really) transfer.
However, this does not mean that the whole thing is completely unbuffered. This is simply about what operator<< does internally. Your ostream object will still accumulate data internally until its buffer is full, then call write (or whatever low-level function your OS uses).
Unfortunately, the program was very slow.
Your first solution is wrong for a very simple reason: it reads the entire source file in memory, then write it entirely.
Files have been invented (perhaps in the 1960s) to handle data that don't fit in memory (and has to be in some "slower" storage, at that time hard disks or drums, or perhaps even tapes). And they have always been copied by "chunks".
The current (Unix-like) definition of file (as a sequence of bytes than is open-ed, read, write-n, close-d) is more recent than 1960s. Probably the late 1970s or early 1980s. And it comes with the notion of streams (which has been standardized in C with <stdio.h> and in C++ with std::fstream).
So your program has to work (like every file copying program today) for files much bigger than the available memory.You need some loop to read some buffer, write it, and repeat.
The size of the buffer is very important. If it is too small, you'll make too many IO operations (e.g. system calls). If it is too big, IO might be inefficient or even not work.
In practice, the buffer should today be much less than your RAM, typically several megabytes.
Your code is more C like than C++ like because it uses fopen. Here is a possible solution in C with <stdio.h>. If you code in genuine C++, adapt it to <fstream>:
void copyfile(const char*destpath, const char*srcpath) {
// experiment with various buffer size
#define MYBUFFERSIZE (4*1024*1024) /* four megabytes */
char* buf = malloc(MYBUFFERSIZE);
if (!buf) { perror("malloc buf"); exit(EXIT_FAILURE); };
FILE* filsrc = fopen(srcpath, "r");
if (!filsrc) { perror(srcpath); exit(EXIT_FAILURE); };
FILE* fildest = fopen(destpath, "w");
if (!fildest) { perror(destpath); exit(EXIT_FAILURE); };
for (;;) {
size_t rdsiz = fread(buf, 1, MYBUFFERSIZE, filsrc);
if (rdsiz==0) // end of file
break;
else if (rdsiz<0) // input error
{ perror("fread"); exit(EXIT_FAILURE); };
size_t wrsiz = fwrite(buf, rdsiz, 1, fildest);
if (wrsiz != 1) { perror("fwrite"); exit(EXIT_FAILURE); };
}
if (fclose(filsrc)) { perror("fclose source"); exit(EXIT_FAILURE); };
if (fclose(fildest)) { perror("fclose dest"); exit(EXIT_FAILURE); };
}
For simplicity, I am reading the buffer in byte components and writing it as a whole. A better solution is to handle partial writes.
Apparently dst_ << src_.rdbuf(); might do some loop internally (I have to admit I never used it and did not understand that at first; thanks to Melpopene for correcting me). But the actual buffer size matters a big lot. The two other answers (by John Swinck and by melpomene) focus on that rdbuf() thing. My answer focus on explaining why copying can be slow when you do it like in your first solution, and why you need to loop and why the buffer size matters a big lot.
If you really care about performance, you need to understand implementation details and operating system specific things. So read Operating systems: three easy pieces. Then understand how, on your particular operating system, the various buffering is done (there are several layers of buffers involved: your program buffers, the standard stream buffers, the kernel buffers, the page cache). Don't expect your C++ standard library to buffer in an optimal fashion.
Don't even dream of coding in standard C++ (without operating system specific stuff) an optimal or very fast copying function. If performance matters, you need to dive in OS specific details.
On Linux, you might use time(1), oprofile(1), perf(1) to measure your program's performance. You could use strace(1) to understand the various system calls involved (see syscalls(2) for a list). You might even code (in a Linux specific way) using directly the open(2), read(2), write(2), close(2) and perhaps readahead(2), mmap(2), posix_fadvise(2), madvise(2), sendfile(2) system calls.
At last, large file copying are limited by disk IO (which is the bottleneck). So even by spending days in optimizing OS specific code, you won't win much. The hardware is the limitation. You probably should code what is the most readable code for you (it might be that dst_ << src_.rdbuf(); thing which is looping) or use some library providing file copy. You might win a tiny amount of performance by tuning the various buffer sizes.
If the operator translates to looping for me, why should I do it myself?
Because you have no explicit guarantee on the actual buffering done (at various levels). As I explained, buffering matters for performance. Perhaps the actual performance is not that critical for you, and the ordinary settings of your system and standard library (and their default buffers sizes) might be enough.
PS. Your question contains at least 3 different questions (but related ones). I don't find it clear (so downvoted it), because I did not understand what is the most relevant one. Is it : performance? robustness? meaning of dst_ << src_.rdbuf();? Why is the first solution slow? How to copy large files quickly?

WriteFile overlapped and fwrite equivalent

On Windows, the WriteFile() function has a parameter called lpOverlapped which lets you specify an offset at which to write to the file.
I was wondering, is there is an fwrite() cross-platform equivalent of that?
I see that if the file is opened with the rb+ flag, I might be able to use fseek() to write to a particular offset. My question is - will this approach be equivalent to the overlapped WriteFile(), and will it produce the same behaviour on all platforms?
Background
The reason I need this is because I am writing blocked compressed data streams to a file, and I want to be able to load a specific block from the file and be able to decompress it. So, basically if I keep track of where the block begins in a file, I can load the block and decompress it in a more efficient manner. I know that there are probably better ways to do this, but I need this solution for some backwards compatibility.
Assuming you are okay with using POSIX functions and not just things from the C or C++ standard libraries, the solution is pwrite (aka: positioned write).
ssize_t rc = pwrite(file_handle, data_ptr, data_size, destination_offset);
I think you are confusing "overlapped" and "overwrite"/"offset." I didn't study up on the specifics of why Microsoft explicitly says overlapped writes include a parameter for offset (I think it makes sense as I describe below). In general, when Microsoft talks about "overlapped" IO, they are talking about how to synchronize events like starting to write the file, receiving notification that the write completed, and starting another write to the file which might or might not overlap with a previous write. In this last case, by overlap I mean what you would think that overlap means, ie overlaps within the contents of the file. Whereas Microsoft means that writing the file overlaps in time with your thread running, or not. Note that this gets very complicated if more than one thread can write the same file.
If possible, and surely if you want portable code, you want to avoid all this nonsense and just do the simplest write possible in each context, which means avoid Microsoft optimizations like "overlapped IO" unless you really need performance. (And if you need absolutely optimal performance, you might want to cache the file yourself and manage the overlaps, then write it once from start to finish.)
While pwrite is probably the best solution, there is an alternative that sticks with stdio functions. Unfortunately, to make it thread-safe, you're using non-standard "stdio" to take direct control of the FILE*'s internal lock, and the names aren't portable. Specifically, POSIX defines one set of "take/release file lock" names and Windows defines another set (_lock_file/_unlock_file).
That said, you could use these semi-portable constructs to use stdio functions to ensure no buffering conflicts (pwrite to fileno(some_FILE_star) could cause problems if the FILE* buffer overlaps the pwrite location, since pwrite won't fix up the buffer):
// Error checking omitted; you should actually check returns in real code
size_t pfwrite(const void *ptr, size_t size, size_t n,
size_t offset, FILE *stream) {
// Take FILE*'s lock and hold it for entire transaction
flockfile(stream); // _lock_file on Windows
// Record position
long origpos = ftell(stream);
// Seek to desired offset and write
fseek(stream, offset, SEEK_SET); // Possibly offset * size, not just offset?
size_t written = fwrite(ptr, size, n, stream);
// Seek back to original position
fseek(stream, origpos, SEEK_SET);
// Release FILE*'s lock now that transaction complete
funlockfile(stream); // _unlock_file on Windows
return written;
}

ifstream operator >> and error handling

I want to use ifstream to read data from a named piped. I would like to use its operator>> to read formatted data (typically, an int).
However, I am a bit confused in the way error handling works.
Imagine I want to read an int but only 3 bytes are available. Errors bits would be set, but what will happen to theses 3 bytes ? Will they "disappear", will they be put back into the stream for later extraction ?
Thanks,
As has been pointed out, you can't read binary data over an istream.
But concerning the number of available bytes issue (since you'll
probably want to use basic_ios<char> and streambuf for your binary
streams): istream and ostream use a streambuf for the actual
sourcing and sinking of the bytes. And streambuf normally buffer: the
procedure is: if a byte is in the buffer, return it, otherwise, try to
reload the buffer, waiting until the reloading has finished, or
definitively failed. In case of definitive failure, the streambuf
returns end of file, and that terminates the input; istream will
memorize the end of file, and not attempt any more input. So if the
type you are reading needs four bytes, it will request four bytes from
the streambuf, and will normally wait until those four bytes are
there. No error will be set (because there isn't an error); you will
simply not return from the operator>> until those four bytes arrive.
If you implement your own binary streams, I would strongly recommend
using the same pattern; it will allow direct use of already existing
standard components like std::ios_base and (perhaps) std::filebuf,
and will provide other programmers with an idiom they are familiar with.
If the blocking is a problem, the simplest solution is just to run the
input in a separate thread, communicating via a message queue or
something similar. (Boost has support for asynchronous IO. This avoids
threads, but is globally much more complicated, and doesn't work well
with the classical stream idiom.)

Unexpected "padding" in a Fortran unformatted file

I don't understand the format of unformatted files in Fortran.
For example:
open (3,file=filename,form="unformatted",access="sequential")
write(3) matrix(i,:)
outputs a column of a matrix into a file. I've discovered that it pads the file with 4 bytes on either end, however I don't really understand why, or how to control this behavior. Is there a way to remove the padding?
For unformated IO, Fortran compilers typically write the length of the record at the beginning and end of the record. Most but not all compilers use four bytes. This aids in reading records, e.g., length at the end assists with a backspace operation. You can suppress this with the new Stream IO mode of Fortran 2003, which was added for compatibility with other languages. Use access='stream' in your open statement.
I never used sequential access with unformatted output for this exact reason. However it depends on the application and sometimes it is convenient to have a record length indicator (especially for unstructured data). As suggested by steabert in Looking at binary output from fortran on gnuplot, you can avoid this by using keyword argument ACCESS = 'DIRECT', in which case you need to specify record length. This method is convenient for efficient storage of large multi-dimensional structured data (constant record length). Following example writes an unformatted file whose size equals the size of the array:
REAL(KIND=4),DIMENSION(10) :: a = 3.141
INTEGER :: reclen
INQUIRE(iolength=reclen)a
OPEN(UNIT=10,FILE='direct.out',FORM='UNFORMATTED',&
ACCESS='DIRECT',RECL=reclen)
WRITE(UNIT=10,REC=1)a
CLOSE(UNIT=10)
END
Note that this is not the ideal aproach in sense of portability. In an unformatted file written with direct access, there is no information about the size of each element. A readme text file that describes the data size does the job fine for me, and I prefer this method instead of padding in sequential mode.
Fortran IO is record based, not stream based. Every time you write something through write() you are not only writing the data, but also beginning and end markers for that record. Both record markers are the size of that record. This is the reason why writing a bunch of reals in a single write (one record: one begin marker, the bunch of reals, one end marker) has a different size with respect to writing each real in a separate write (multiple records, each of one begin marker, one real, and one end marker). This is extremely important if you are writing down large matrices, as you could balloon the occupation if improperly written.
Fortran Unformatted IO I am quite familiar with differing outputs using the Intel and Gnu compilers. Fortunately my vast experience dating back to 1970's IBM's allowed me to decode things. Gnu pads records with 4 byte integer counters giving the record length. Intel uses a 1 byte counter and a number of embedded coding values to signify a continuation record or the end of a count. One can still have very long record lengths even though only 1 byte is used.
I have software compiled by the Gnu compiler that I had to modify so it could read an unformatted file generated by either compiler, so it has to detect which format it finds. Reading an unformatted file generated by the Intel compiler (which follows the "old' IBM days) takes "forever" using Gnu's fgetc or opening the file in stream mode. Converting the file to what Gnu expects results in a factor of up to 100 times faster. It depends on your file size if you want to bother with detection and conversion or not. I reduced my program startup time (which opens a large unformatted file) from 5 minutes down to 10 seconds. I had to add in options to reconvert back again if the user wants to take the file back to an Intel compiled program. It's all a pain, but there you go.

Will fseek function flush data in the buffer in C++?

We know that call to functions like fprintf or fwrite will not write data to the disk immediately, instead, the data will be buffered until a threshold is reached. My question is, if I call the fseek function, will these buffered data writen to disk before seeking to the new position? Or the data is still in the buffer, and is writen to the new position?
cheng
I'm not aware if the buffer is guaranteed to be flushed, it may not if you seek to a position close enough. However there is no way that the buffered data will be written to the new position. The buffering is just an optimization, and as such it has to be transparent.
Yes; fseek() ensures that the file will look like it should according to the fwrite() operations you've performed.
The C standard, ISO/IEC 9899:1999 ยง7.19.9.2 fseek(), says:
The fseek function sets the file position indicator for the stream pointed to by stream.
If a read or write error occurs, the error indicator for the stream is set and fseek fails.
I don't believe that it's specified that the data must be flushed on a fseek but when the data is actually written to disk it must be written at that position that the stream was at when the write function was called. Even if the data is still buffered, that buffer can't be written to a different part of the file when it is flushed even if there has been a subsequent seek.
It seems that your real concern is whether previously-written (but not yet flushed) data would end up in the wrong place in the file if you do an fseek.
No, that won't happen. It'll behave as you'd expect.
I have vague memories of a requirement that you call fflush before
fseek, but I don't have my copy of the C standard available to verify.
(If you don't it would be undefined behavior or implementation defined,
or something like that.) The common Unix standard specifies that:
If the most recent operation, other than ftell(), on a given stream is
fflush(), the file offset in the underlying open file description
shall be adjusted to reflect the location specified by fseek().
[...]
If the stream is writable and buffered data had not been written to
the underlying file, fseek() shall cause the unwritten data to be
written to the file and shall mark the st_ctime and st_mtime fields of
the file for update.
This is marked as an extention to the ISO C standard, however, so you can't count on it except on Unix platforms (or other platforms which make similar guarantees).