Analyzing read operation using Procmon - c++

I am trying to analyze a basic read operation using ifstream with Procmon.
Part of the code used for read operation where i was trying to read data of 16kb size from a file:
char * buffer = new char[128000];
ifstream fileHandle("file.txt");
fileHandle.read(buffer, 16000);
cout << buffer << endl;
fileHandle.close();
In Procmon there were 4 ReadFile operation with the following details:
Offset: 0, Length: 4,096, Priority: Normal
Offset: 4,096, Length: 4,096
Offset: 8,192, Length: 4,096
Offset: 12,288, Length: 4,096
So does it mean that there were 4 operations of each 4kb size ? and if so why did that happen instead of just having a single ReadFile operation of 16 kb size.

So does it mean that there were 4 operations of each 4kb size ?
Yes.
and if so why did that happen instead of just having a single ReadFile operation of 16 kb size.
Probably because the standard library shipped with your compiler sets the default size of the buffer of file streams to 4 KB; since the read operation has to go through the buffer, it has to be filled (through OS calls) and emptied 4 times before satisfying your request. Notice that you can change the internal buffer of an fstream using fileHandle.rdbuf->pubsetbuf.

So does it mean that there were 4 operations of each 4kb size ?
That is exactly what it is saying.
and if so why did that happen instead of just having a single ReadFile operation of 16 kb size.
Just because you asked for 16000 bytes does not mean ifstream can actually read 16000 bytes in a single operation. File systems do not usually allow for such large reads, there is usually a cap. Even if you increase the size of the internal buffer that ifstream uses internaly, that is still no guarantee that the file system will honor a larger read size.
The contract of read() is that it returns the requested number of bytes unless an EOF/error is encountered. HOW it accomplishes that reading internally is an implementation detail. In this case, ifstream had to read four 4KB blocks in order to return 16000 bytes.

Related

strange file buffer size in chromium

Reading chromium sources, found this interesting code for comparing content of two files. The interesting part is stack allocated buffer size:
const int BUFFER_SIZE = 2056;
char buffer1[BUFFER_SIZE], buffer2[BUFFER_SIZE];
do {
file1.read(buffer1, BUFFER_SIZE);
file2.read(buffer2, BUFFER_SIZE);
if ((file1.eof() != file2.eof()) ||
(file1.gcount() != file2.gcount()) ||
(memcmp(buffer1, buffer2, static_cast<size_t>(file1.gcount())))) {
file1.close();
file2.close();
return false;
}
} while (!file1.eof() || !file2.eof());
The first question is why so interesting buffer size is chosen? git blame shows nothing interesting regarding this. The only guess I have is that this particular buffer size 2056 = 2048 + 8 is supposed to induce read ahead behavior from a such high abstraction point. In other words, the logic is something like this: on first part read we will get one full buffer of 2048 plus 8. And if internal system IO buffer size is 2048, then extra 8 bytes will induce reading of next block. And when we will call next part read, next buffer will be already fetched by implementation and so on by induction.
The second question is why exactly 2048 is chosen as ubiquitous buffer size? Why not something like PAGE_SIZE or BUFSIZE?
I have asked chromium devel mail list and here are some answers:
Scott Hess shess#chromium.org
I doubt there was any reason other than that the buffer needs to have
some size. I'd have chosen 4096, myself, since most filesystem block
sizes are that these days. But iostream already has internal
buffering so it's not super important.
So it seems no particular reason for exactly this buffer size.

Is reading from an anonymous pipe atomic, in the sense of atomic content?

I am writing a process on Linux with two threads. They communicate using an anonymous pipe, created with the pipe() call.
One end is copying a C structure into the pipe:
struct EventStruct e;
[...]
ssize_t n = write(pipefd[1], &e, sizeof(e));
The other end reads it from the pipe:
struct EventStruct e;
ssize_t n = read(pipefd[0], &e, sizeof(e));
if(n != -1 && n != 0 && n < sizeof(e))
{
// Is a partial read possible here??
}
Can partial reads occur with the anonymous pipe?
The man page (man 7 pipe) stipulates that any write under PIPE_BUF size is atomic. But what they mean is atomic regarding other writers threads... I am not concerned with multiple writers issues. I have only one writer thread, and only one reader thread.
As a side note, my structure is 56 bytes long. Well below the PIPE_BUF size, which is at least 4096 bytes on Linux. It looks like it's even higher on most recent kernel.
Told otherwise: on the reading end, do I have to deal with partial read and store them meanwhile I receive a full structure instance?
As long as you are dealing with fixed size units, there isn't a problem. If you write a unit of N bytes on the pipe and the reader requests a unit of N bytes from the pipe, then there will be no issue. If you can't read all the data in one fell swoop (you don't know the size until after you've read its length, for example), then life gets trickier. However, as shown, you should be fine.
That said, you should still detect short reads. There's a catastrophe pending if you get a short read but assume it is full length. However, you should not expect to detect short reads — code coverage will be a problem. I'd simply test n < (ssize_t)sizeof(e) and anything detected is an error or EOF. Note the cast; otherwise, the signed value will be converted to unsigned and -1 won't be spotted properly.
For specification, you'll need to read the POSIX specifications for:
read()
write()
pipe()
and possibly trace links from those pages. For example, for write(), the specification says:
Write requests to a pipe or FIFO shall be handled in the same way as a regular file with the following exceptions:
There is no file offset associated with a pipe, hence each write request shall append to the end of the pipe.
Write requests of {PIPE_BUF} bytes or less shall not be interleaved with data from other processes doing writes on the same pipe. Writes of greater than {PIPE_BUF} bytes may have data interleaved, on arbitrary boundaries, with writes by other processes, whether or not the O_NONBLOCK flag of the file status flags is set.
Or from the specification of read():
Upon successful completion, where nbyte is greater than 0, read() shall mark for update the last data access timestamp of the file, and shall return the number of bytes read. This number shall never be greater than nbyte. The value returned may be less than nbyte if the number of bytes left in the file is less than nbyte, if the read() request was interrupted by a signal, or if the file is a pipe or FIFO or special file and has fewer than nbyte bytes immediately available for reading. For example, a read() from a file associated with a terminal may return one typed line of data.
So, the write() will write atomic units; the read() will only read atomic units because that's what was written. There won't be a problem, which is what I said at the start.

Buffer Size in C++

I am observing the following behavior with the C++ Std library method std::ostream::write().
For buffering the data I am making use of the following C++ API
std::ofstream::rdbuf()->pubsetbuf(char* s, streamsize n)
This works fine ( verified using the strace utility ) as long as the size of data (datasize) we are writing on the file stream using
std::ofstream::write (const char* s, datasize n)
Is less than 1023 bytes ( below this value the writes are accumulated till the buffer is not full), but when the size of data to write exceeds 1023, the buffer is not taken into account and the data is flushed to the file.
e.g. If I set the buffer size to 10KB and write around 512bytes a time, strace will show that multiple writes have been combined into a single write
writev(3, [{"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 9728}, {"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 512}], 2) = 10240 ( 10 KB )
writev(3, [{"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 9728}, {"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 512}], 2) = 10240
...
but when I write 1024 bytes a time ( keeping the buffer fixed to 10 KB), now strace shows me that it is not using the buffer and each ofstream::write call is being translated to write system call.
writev(3, [{NULL, 0}, {"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 1024}], 2) = 1024 ( 1KB )
writev(3, [{NULL, 0}, {"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 1024}], 2) = 1024
...
Is there any C++ API Call or Linux Tuning Parameter which I am missing?
This is an implementation detail of libstdc++, implemented around line 650 of bits/fstream.tcc. Basically, if the write is larger than 2^10, it will skip the buffer.
If you want the rationale behind this decision, I suggest you send a mail to the libstdc++ development list.
http://gcc.gnu.org/ml/libstdc++/
Looks like someone writing the stdlib implementation made an "optimization" without giving enough thought to it. So, the only workaround for you would be to avoid the C++ API and use the standard C library.
This is not the only suboptimality in the GNU/Linux implementation of the standard C++ library: on my machine, malloc() is 100 cycles faster than the standard void* operator new (size_t size)...

How can we split one 100 GB file into hundred 1 GB file?

This question came to mind when I was trying to solve this problem.
I have harddrive with capacity 120 GB, of which 100 GB is occupied by a single huge file. So 20 GB is still free.
My question is, how can we split this huge file into smaller ones, say 1 GB each? I see that if I had ~100 GB free space, probably it was possible with simple algorithm. But given only 20 GB free space, we can write upto 20 1GB files. I've no idea how to delete contents from the bigger file while reading from it.
Any solution?
It seems I've to truncate the file by 1 GB, once I finish writing one file, but that boils down to this queston:
Is it possible to truncate a part of a file? How exactly?
I would like to see an algorithm (or an outline of an algorithm) that works in C or C++ (preferably Standard C and C++), so I may know the lower level details. I'm not looking for a magic function, script or command that can do this job.
According to this question (Partially truncating a stream) you should be able to use, on a system that is POSIX compliant, a call to int ftruncate(int fildes, off_t length) to resize an existing file.
Modern implementations will probably resize the file "in place" (though this is unspecified in the documentation). The only gotcha is that you may have to do some extra work to ensure that off_t is a 64 bit type (provisions exist within the POSIX standard for 32 bit off_t types).
You should take steps to handle error conditions, just in case it fails for some reason, since obviously, any serious failure could result in the loss of your 100GB file.
Pseudocode (assume, and take steps to ensure, all data types are large enough to avoid overflows):
open (string filename) // opens a file, returns a file descriptor
file_size (descriptor file) // returns the absolute size of the specified file
seek (descriptor file, position p) // moves the caret to specified absolute point
copy_to_new_file (descriptor file, string newname)
// creates file specified by newname, copies data from specified file descriptor
// into newfile until EOF is reached
set descriptor = open ("MyHugeFile")
set gigabyte = 2^30 // 1024 * 1024 * 1024 bytes
set filesize = file_size(descriptor)
set blocks = (filesize + gigabyte - 1) / gigabyte
loop (i = blocks; i > 0; --i)
set truncpos = gigabyte * (i - 1)
seek (descriptor, truncpos)
copy_to_new_file (descriptor, "MyHugeFile" + i))
ftruncate (descriptor, truncpos)
Obviously some of this pseudocode is analogous to functions found in the standard library. In other cases, you will have to write your own.
There is no standard function for this job.
For Linux you can use the ftruncate method, while for Windows you can use _chsize or SetEndOfFile. A simple #ifdef will make it cross-platform.
Also read this Q&A.

Most efficient to read file into separate variables using fstream

I have tons of files which look a little like:
12-3-125-BINARYDATA
What would be the most efficient way to save the 12, 3 and 125 as separate integer variables, and the BINARYDATA as a char-vector?
I'd really like to use fstream, but I don't exactly know how to (got it working with std::strings, but the BINARYDATA part was all messed up).
The most efficient method for reading data is to read many "chunks", or records into memory using the fewest I/O function calls, then parsing the data in memory.
For example, reading 5 records with one fread call is more efficient than 5 calls to fread to read in a record. Accessing memory is always faster than accessing external data such as files.
Some platforms have the ability to memory-map a file. This may be more efficient than reading the using I/O functions. Profiling will determine the most efficient.
Fixed length records are always more efficient than variable length records. Variable length records involve either reading until a fixed size is read or reading until a terminal (sentinel) value is found. For example, a text line is a variable record and must be read one byte at a time until the terminating End-Of-Line marker is found. Buffering may help in this case.
What do you mean by Binary Data? Is it a 010101000 char by char or "real" binary data? If they are real "binary data", just read the file as binary file. First read 2 bytes for the first int, next 1 bytes for -, 2 bytes for 3,and so on, until you read the first pos of binary data, just get file length and read all of it.