strange file buffer size in chromium - c++

Reading chromium sources, found this interesting code for comparing content of two files. The interesting part is stack allocated buffer size:
const int BUFFER_SIZE = 2056;
char buffer1[BUFFER_SIZE], buffer2[BUFFER_SIZE];
do {
file1.read(buffer1, BUFFER_SIZE);
file2.read(buffer2, BUFFER_SIZE);
if ((file1.eof() != file2.eof()) ||
(file1.gcount() != file2.gcount()) ||
(memcmp(buffer1, buffer2, static_cast<size_t>(file1.gcount())))) {
file1.close();
file2.close();
return false;
}
} while (!file1.eof() || !file2.eof());
The first question is why so interesting buffer size is chosen? git blame shows nothing interesting regarding this. The only guess I have is that this particular buffer size 2056 = 2048 + 8 is supposed to induce read ahead behavior from a such high abstraction point. In other words, the logic is something like this: on first part read we will get one full buffer of 2048 plus 8. And if internal system IO buffer size is 2048, then extra 8 bytes will induce reading of next block. And when we will call next part read, next buffer will be already fetched by implementation and so on by induction.
The second question is why exactly 2048 is chosen as ubiquitous buffer size? Why not something like PAGE_SIZE or BUFSIZE?

I have asked chromium devel mail list and here are some answers:
Scott Hess shess#chromium.org
I doubt there was any reason other than that the buffer needs to have
some size. I'd have chosen 4096, myself, since most filesystem block
sizes are that these days. But iostream already has internal
buffering so it's not super important.
So it seems no particular reason for exactly this buffer size.

Related

Does my C++ code handle 100GB+ file copying? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I need a cross-platform portable function that is able to copy a 100GB+ binary file to a new destination. My first solution was this:
void copy(const string &src, const string &dst)
{
FILE *f;
char *buf;
long len;
f = fopen(src.c_str(), "rb");
fseek(f, 0, SEEK_END);
len = ftell(f);
rewind(f);
buf = (char *) malloc((len+1) * sizeof(char));
fread(buf, len, 1, f);
fclose(f);
f = fopen(dst.c_str(), "a");
fwrite(buf, len, 1, f);
fclose(f);
}
Unfortunately, the program was very slow. I suspect the buffer had to keep 100GB+ in the memory. I'm tempted to try the new code (taken from Copy a file in a sane, safe and efficient way):
std::ifstream src_(src, std::ios::binary);
std::ofstream dst_ = std::ofstream(dst, std::ios::binary);
dst_ << src_.rdbuf();
src_.close();
dst_.close();
My question is about this line:
dst_ << src_.rdbuf();
What does the C++ standard say about it? Does the code compiled to byte-by-byte transfer or just whole-buffer transfer (like my first example)?
I'm curious does the << compiled to something useful for me? Maybe I don't have to invest my time on something else, and just let the compiler do the job inside the operator? If the operator translates to looping for me, why should I do it myself?
PS: std::filesystem::copy is impossible as the code has to work for C++11.
The crux of your question is what happens when you do this:
dst_ << src_.rdbuf();
Clearly this is two function calls: one to istream::rdbuf(), which simply returns a pointer to a streambuf, followed by one to ostream::operator<<(streambuf*), which is documented as follows:
After constructing and checking the sentry object, checks if sb is a null pointer. If it is, executes setstate(badbit) and exits. Otherwise, extracts characters from the input sequence controlled by sb and inserts them into *this until one of the following conditions are met: [...]
Reading this, the answer to your question is that copying a file in this way will not require buffering the entire file contents in memory--rather it will read a character at a time (perhaps with some chunked buffering, but that's an optimization that shouldn't change our analysis).
Here is one implementation: https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-api-4.6/a01075_source.html (__copy_streambufs). Essentially it a loop calling sgetc() and sputc() repeatedly until EOF is reached. The memory required is small and constant.
The C++ standard (I checked C++98, so this should be extremely compatible) says in [lib.ostream.inserters]:
basic_ostream<charT,traits>& operator<<
(basic_streambuf<charT,traits> *sb);
Effects: If sb is null calls setstate(badbit) (which may throw ios_base::failure).
Gets characters from sb and inserts them in *this. Characters are read from sb and inserted until any of the following occurs:
end-of-file occurs on the input sequence;
inserting in the output sequence fails (in which case the character to be inserted is not extracted);
an exception occurs while getting a character from sb.
If the function inserts no characters, it calls setstate(failbit) (which may throw ios_base::failure (27.4.4.3)). If an exception was thrown while extracting a character, the function set failbit in error state, and if failbit is on in exceptions() the caught exception is rethrown.
Returns: *this.
This description says << on rdbuf works on a character-by-character basis. In particular, if inserting of a character fails, that exact character remains unread in the input sequence. This implies that an implementation cannot just extract the whole contents into a single huge buffer upfront.
So yes, there's a loop somewhere in the internals of the standard library that does a byte-by-byte (well, charT really) transfer.
However, this does not mean that the whole thing is completely unbuffered. This is simply about what operator<< does internally. Your ostream object will still accumulate data internally until its buffer is full, then call write (or whatever low-level function your OS uses).
Unfortunately, the program was very slow.
Your first solution is wrong for a very simple reason: it reads the entire source file in memory, then write it entirely.
Files have been invented (perhaps in the 1960s) to handle data that don't fit in memory (and has to be in some "slower" storage, at that time hard disks or drums, or perhaps even tapes). And they have always been copied by "chunks".
The current (Unix-like) definition of file (as a sequence of bytes than is open-ed, read, write-n, close-d) is more recent than 1960s. Probably the late 1970s or early 1980s. And it comes with the notion of streams (which has been standardized in C with <stdio.h> and in C++ with std::fstream).
So your program has to work (like every file copying program today) for files much bigger than the available memory.You need some loop to read some buffer, write it, and repeat.
The size of the buffer is very important. If it is too small, you'll make too many IO operations (e.g. system calls). If it is too big, IO might be inefficient or even not work.
In practice, the buffer should today be much less than your RAM, typically several megabytes.
Your code is more C like than C++ like because it uses fopen. Here is a possible solution in C with <stdio.h>. If you code in genuine C++, adapt it to <fstream>:
void copyfile(const char*destpath, const char*srcpath) {
// experiment with various buffer size
#define MYBUFFERSIZE (4*1024*1024) /* four megabytes */
char* buf = malloc(MYBUFFERSIZE);
if (!buf) { perror("malloc buf"); exit(EXIT_FAILURE); };
FILE* filsrc = fopen(srcpath, "r");
if (!filsrc) { perror(srcpath); exit(EXIT_FAILURE); };
FILE* fildest = fopen(destpath, "w");
if (!fildest) { perror(destpath); exit(EXIT_FAILURE); };
for (;;) {
size_t rdsiz = fread(buf, 1, MYBUFFERSIZE, filsrc);
if (rdsiz==0) // end of file
break;
else if (rdsiz<0) // input error
{ perror("fread"); exit(EXIT_FAILURE); };
size_t wrsiz = fwrite(buf, rdsiz, 1, fildest);
if (wrsiz != 1) { perror("fwrite"); exit(EXIT_FAILURE); };
}
if (fclose(filsrc)) { perror("fclose source"); exit(EXIT_FAILURE); };
if (fclose(fildest)) { perror("fclose dest"); exit(EXIT_FAILURE); };
}
For simplicity, I am reading the buffer in byte components and writing it as a whole. A better solution is to handle partial writes.
Apparently dst_ << src_.rdbuf(); might do some loop internally (I have to admit I never used it and did not understand that at first; thanks to Melpopene for correcting me). But the actual buffer size matters a big lot. The two other answers (by John Swinck and by melpomene) focus on that rdbuf() thing. My answer focus on explaining why copying can be slow when you do it like in your first solution, and why you need to loop and why the buffer size matters a big lot.
If you really care about performance, you need to understand implementation details and operating system specific things. So read Operating systems: three easy pieces. Then understand how, on your particular operating system, the various buffering is done (there are several layers of buffers involved: your program buffers, the standard stream buffers, the kernel buffers, the page cache). Don't expect your C++ standard library to buffer in an optimal fashion.
Don't even dream of coding in standard C++ (without operating system specific stuff) an optimal or very fast copying function. If performance matters, you need to dive in OS specific details.
On Linux, you might use time(1), oprofile(1), perf(1) to measure your program's performance. You could use strace(1) to understand the various system calls involved (see syscalls(2) for a list). You might even code (in a Linux specific way) using directly the open(2), read(2), write(2), close(2) and perhaps readahead(2), mmap(2), posix_fadvise(2), madvise(2), sendfile(2) system calls.
At last, large file copying are limited by disk IO (which is the bottleneck). So even by spending days in optimizing OS specific code, you won't win much. The hardware is the limitation. You probably should code what is the most readable code for you (it might be that dst_ << src_.rdbuf(); thing which is looping) or use some library providing file copy. You might win a tiny amount of performance by tuning the various buffer sizes.
If the operator translates to looping for me, why should I do it myself?
Because you have no explicit guarantee on the actual buffering done (at various levels). As I explained, buffering matters for performance. Perhaps the actual performance is not that critical for you, and the ordinary settings of your system and standard library (and their default buffers sizes) might be enough.
PS. Your question contains at least 3 different questions (but related ones). I don't find it clear (so downvoted it), because I did not understand what is the most relevant one. Is it : performance? robustness? meaning of dst_ << src_.rdbuf();? Why is the first solution slow? How to copy large files quickly?

How to read 4GB file on 32bit system

In my case I have different files lets assume that I have >4GB file with data. I want to read that file line by line and process each line. One of my restrictions is that soft has to be run on 32bit MS Windows or on 64bit with small amount of RAM (min 4GB). You can also assume that processing of these lines isn't bottleneck.
In current solution I read that file by ifstream and copy to some string. Here is snippet how it looks like.
std::ifstream file(filename_xml.c_str());
uintmax_t m_numLines = 0;
std::string str;
while (std::getline(file, str))
{
m_numLines++;
}
And ok, that's working but to slowly here is a time for my 3.6 GB of data:
real 1m4.155s
user 0m0.000s
sys 0m0.030s
I'm looking for a method that will be much faster than that for example I found that How to parse space-separated floats in C++ quickly? and I loved presented solution with boost::mapped_file but I faced to another problem what if my file is to big and in my case file 1GB large was enough to drop entire process. I have to care about current data in memory probably people who will be using that tool doesn't have more than 4 GB installed RAM.
So I found that mapped_file from boost but how to use it in my case? Is it possible to read partially that file and receive these lines?
Maybe you have another much better solution. I have to just process each line.
Thanks,
Bart
Nice to see you found my benchmark at How to parse space-separated floats in C++ quickly?
It seems you're really looking for the fastest way to count lines (or any linear single pass analysis), I've done a similar analysis and benchmark of exactly that here
Fast textfile reading in c++
Interestingly, you'll see that the most performant code does not need to rely on memory mapping at all there.
static uintmax_t wc(char const *fname)
{
static const auto BUFFER_SIZE = 16*1024;
int fd = open(fname, O_RDONLY);
if(fd == -1)
handle_error("open");
/* Advise the kernel of our access pattern. */
posix_fadvise(fd, 0, 0, 1); // FDADVICE_SEQUENTIAL
char buf[BUFFER_SIZE + 1];
uintmax_t lines = 0;
while(size_t bytes_read = read(fd, buf, BUFFER_SIZE))
{
if(bytes_read == (size_t)-1)
handle_error("read failed");
if (!bytes_read)
break;
for(char *p = buf; (p = (char*) memchr(p, '\n', (buf + bytes_read) - p)); ++p)
++lines;
}
return lines;
}
The case of a 64-bit system with small memory should be fine to load a large file into - it's all about address space - although it may well be slower than the "fastest" option in that case, it really depends on what else is in memory and how much of the memory is available for mapping the file into. In a 32-bit system, it won't work, since the pointers into the filemapping won't go beyond about 3.5GB at the very most - and typically around 2GB is the maximum - again, depending on what memory addresses are available to the OS to map the file into.
However, the benefit of memory mapping a file is pretty small - the huge majority of the time spent is from actually reading the data. The saving from using memory mapping comes from not having to copy the data once it's loaded into RAM. (When using other file-reading mechanisms, the read function will copy the data into the buffer supplied, where memory mapping a file will stuff it straight into the correct location directly).
You might want to look at increasing the buffer for the ifstream - the default buffer is often rather small, this leads to lots of expensive reads.
You should be able to do this using something like:
std::ifstream file(filename_xml.c_str());
char buffer[1024*1024];
file.rdbuf()->pubsetbuf(buffer, 1024*1024);
uintmax_t m_numLines = 0;
std::string str;
while (std::getline(file, str))
{
m_numLines++;
}
See this question for more info:
How to get IOStream to perform better?
Since this is windows, you can use the native windows file functions with the "ex" suffix:
windows file management functions
specifically the functions like GetFileSizeEx(), SetFilePointerEx(), ... . Read and write functions are limited to 32 bit byte counts, and the read and write "ex" functions are for asynchronous I/O as opposed to handling large files.

Reading in a file in 1kB blocks and writing to another file with cstdio

I'm trying to write a simple program that reads data from a file (input) in blocks of 1024 bytes, then writes the data to another file (output). The program works so far, but the problem I've run into is that if it reaches the end of the file and the last read isn't a neat block of 1024 bytes then it outputs garbage data for the rest of the array. I've got this working fine using fstream functions, but when I use cstdio functions (the assignment is to use fread and fwrite) is when I get the problem. Here's my code thus far:
#include <cstdio>
using namespace std;
int main()
{
FILE* fin;
FILE* fout;
char block[1024];
fin = fopen("input", "r");
fout = fopen("output", "w+");
while (!feof(fin))
{
fread(block,1024,1,fin);
fwrite(block,1,1024,fout);
}
fclose(fin);
fclose(fout);
return 0;
}
I'm sure it's a simple fix, but I can't seem to find any information about it on cplusplus.com and I can't figure out how to word the question on google. I appreciate your thoughts on this.
You have your size and count arguments around the wrong way in the fread. If you attempt to fread one item of size 1K and there's only fifteen bytes left in the file, you'll get nothing and the file will forever remain unread. That is, until your fwrite calls fill up the disk, then you'll know about it.
In other words, you'll never see that last fifteen bytes. That's because, while fread will happily give you less elements than you ask for, it will only give you whole elements, not partial ones.
What you need to do is to try and read 1024 items of size one byte each (rather than one item of 1024 bytes).
fread also returns the actual number of items read (which, as noted above, may be less than what you asked for) and that's what you should pass to fwrite (a):
size_t bytCount;
while (! feof (fin)) {
bytCount = fread (block, 1, sizeof(block), fin);
fwrite (block, 1, bytCount, fout);
}
You'll see I've also changed the magic number 1024 to sizeof(block) - this will minimise the source code changes needed if you ever up the buffer size.
(a) If you wanted to be truly robust, fwrite also returns the number of items written, which may be less than what you asked for. Perfect code would check this condition as well.
1024 is your buffer size. fread returns the actual number of bytes read from the file (put in the buffer). Make use of it.

Adding control symbols to a byte stream

Problem: sometimes we have to interleave multiple streams into one,
in which case its necessary to provide a way to identify block
boundaries within a stream. What kind of format would be good
for such a task?
(All processing has to be purely sequential and i/o operations are
blockwise and aligned.)
From decoding side, the best way is to have length prefixes
for blocks. But at encoding side, it requires either random access
to output file (seek to stream start and write the header), or
being able to cache the whole streams, which is, in general, impossible.
Alternatively, we can add length headers (+ some flags) to
blocks of cacheable size. Its surely a solution, but handling is much more
complex than [1], especially at encoding (presuming that i/o operations
are done with aligned fixed-size blocks).
Well, one possible implementation is to write a 0 byte into the buffer,
then stream data until its filled. So prefix byte = 0 would mean that
there're bufsize-1 bytes of stream data next, and !=0 would mean that
there's less... in which case we would be able to insert another prefix
byte if end-of-stream is reached. This would only work with bufsize=32k
or so, because otherwise the block length would require 3+ bytes to store,
and there would be a problem with handling of the case with end-of-stream
when there's only one byte of free space in the buffer.
(One solution to that would be storing 2-byte prefixes to each buffer
and adding 3rd byte when necessary; another is to provide a 2-byte encoding
for some special block lengths like bufsize-2).
Either way its no so good, because even 1 extra byte per 64k would accumulate
to a noticeable number with large files (1526 bytes per 100M). Also hardcoding
of the block size into format is bad too.
Escape prefix. Eg. EC 4B A7 00 = EC 4B A7, EC 4B A7 01 = end-of-stream.
Now this is really easy to encode, but decoding is pretty painful - requires
a messy state machine even to extract single bytes.
But overall it adds least overhead, so it seem that we still need to find
a good implementation for buffered decoding.
Escape prefix with all same bytes (Eg. FF FF FF). Much easier to check,
but runs of the same byte in the stream would produce a huge overhead (like 25%),
and its not unlikely with any byte value chosen for escape code.
Escape postfix. Store the payload byte before the marker - then decoder
just has to skip 1 byte before masked marker, and 4 bytes for control code.
So this basically introduces a fixed 4-byte delay for decoder, while [3]
has a complex path where marker bytes have to be returned one by one.
Still, with [3] encoder is much simpler (it just has to write an extra 0
when marker matches), and this doesn't really simplify the buffer processing.
Update: Actually I'm pretty sure that [3] or [5] would be the option I'd use,
I only listed other options in hope to get more alternatives (for example, it
would be ok if redundancy is 1 bit per block on average). So the main question
atm is how to parse the stream for [3]... current state machine looks like this:
int saved_c;
int o_last, c_last;
int GetByte( FILE* f ) {
int c;
Start:
if( o_last>=10 ) {
if( c_last>=(o_last-10) ) { c=saved_c; o_last=0; }
else c=byte("\xEC\x4B\xA7"[c_last++]);
} else {
c = getc(f);
if( o_last<3 ) {
if( char(c)==("\xEC\x4B\xA7"[o_last]) ) { o_last++; goto Start; }
else if( o_last>0 ) { saved_c=c; c_last=0; o_last+=10; goto Start; } // 11,12
// else just return c
} else {
if( c>0 ) { c=-1-c, o_last=0; printf( "c=%i\n", c ); }
else { saved_c=0xA7; c_last=0; o_last+=10-1; goto Start; } // 12
}
}
return c;
}
and its certainly ugly (and slow)
How about using blocks of fixed size, e.g. 1KB?
Each block would contain a byte (or 4 bytes) indicating which stream it is, and then it follows with just data.
Benefits:
You don't have to pay attention regarding the data itself. Data cannot accidently trigger any behaviour from your system (e.g. accidently terminate the stream)
I does not require random file access when encoding. In particular, you don't store the length of the block as it is fixed.
If data goes corrupt, the system can recover in the next block.
Drawbacks:
If you have to switch from stream to stream a lot, with each having only few bytes of data, the block may be underutilised. Lots of bytes will be empty.
If the block size is too small (e.g. if you want to solve the above problem), you can get huge overhead from the header.
From a standpoint of simplicity for sequential read and write, I would take solution 1, and just use a short buffer, limited to 256 bytes. Then you have a one-byte length followed by data. If a single stream has more than 256 consecutive bytes, you simply write out another length header and the data.
If you have any further requirements though you might have to do something more elaborate. For instance random-access reads probably require a magic number that can't be duplicated in the real data (or that's always escaped when it appears in the real data).
So I ended up using the ugly parser from OP in my actual project
( http://encode.ru/threads/1231-lzma-recompressor )
But now it seems that the actual answer is to let data type handlers
to terminate their streams however they want, which means that escape coding
in case of uncompressed data, but when some kind of entropy coding is
used in the stream, its usually possible to incorporate a more efficient EOF code.

Writing binary files C++, way to force something to be at byte 18?

I'm currently trying to write a .bmp file in C++ and for the most part it works, there is however, just one issue. When I start trying to save images with different widths and heights everything goes askew and I'm struggling to solve it, so is there any way to force something to write to a specific byte (adding padding in between it and the last thing written)?
There are several sort of obvious answers, such as keeping your data in memory in a buffer, then putting the desired value in as bufr[offset]=mydata;. I presume you want something a little fancier than that, because you are, for example, doing this in a streaming sort of application where you can't have the whole object in memory at the same time.
In that case, what you're looking for is the magic offered by fseek(3) and ftell(3) (see man pages). Seek positions the file as a specific offset; tell gets the file's current offset. If it's a constant offset of 18, the you simply finish up with the file, and do
fseek(fp, 18L, SEEK_CUR)
where fp is the file pointer, SEEK_CUR is a constant declared in stdio.h, and 18 is the number 18.
Update
By the way, this is based on the system call lseek(2). Something that confuses people (read "me", I never remember this until I have been searching) is there is no matching "ltell(2)" system call. Instead, to get the current file offset, you use
off_t offset;
offset = lseek(fp, 0L, SEEK_CUR);
because lseek returns the offset after its operation. The example code above gives us the offset after moving 0 bytes from the current offset, which is of course the current offset.
UPdate
aha, C++. You said C. For C++, there are member functions for seek and tell. See the fstream man page.
Count how many bytes have been written. Write zeroes until the count hits 18. Then resume writing your real data.
If you are on Windows, everything comes to writing predefined structures: "Bitmap storage".
Also there is an example that shows how they should be filled: "Storing an Image".
If you are writing not-just-for-windows code then you can mimic these structs and fallow the guide.