Directly use c++ stream buffer to e.g. decompress - c++

My use case is reading a compressed file from disk, de-compressing it incrementally and using the resulting data.
Currently I am reading file contents into a temporary buffer allocated by me, point the decompression API at this buffer, and so on. The question is about understanding whether the temporary buffer really is necessary or helpful in this case.
In an experiment, I opened a stream to a text file and called get() once. In the debugger I can see that the filebuffer in the stream already contains the following characters in the text stream, as expected. (On msvc I found it under std::ifstream::_Filebuffer::_IGfirst)
I am looking for a portable way to access this buffer I see in the debugger, feed the decompression API with it, and then continue reading the file.
I don't understand why I should copy the filebuffer to my buffer (with e.g. read()) in this particular case where I will promptly consume the buffer contents and move on. I'm not questioning the merits of buffer I/O in general.
EDIT:
I did further experiments, and it seems that the internal stream buffer doesn't get used if the target of read() is itself sufficiently large. Apparently the case I was worried about, doesn't really come up.

Related

Is fflush() as good as fclose() at saving my data to disk if the machine crashes immediately afterward?

We have a logger in our library that writes lines of text to disk for later diagnosis. We use it for many purposes including analyzing hard machine hangs caused by the application or our library.
We would write a line to the log file then call fclose(), then reopen it for append when we need to write another line. This is 35X slower than calling fflush() after writing a line.
So my question is, am I more likely to have the last line successfully stored in the file with the fclose() approach than the fflush() approach? If so, why? If not, what is that 35X larger amount of time busy doing if not writing the data more safely to disk?
We care the most about Windows, by the way.
From http://en.cppreference.com/w/c/io/fflush:
For output streams (and for update streams on which the last operation was output), writes any unwritten data from the stream's buffer to the associated output device.
I would say fflush should work for your needs.
Opening and closing a file for each write operation is not a optimal solution. File opens are costly operation.
Also, fclose() would not guarantee that file data is flushed to the disk immediately on closing the file. The OS/filesystem will flush the data on their own logic.
fflush() will flush any data written to a file and is not yet synced to the disk. However, note that is will try to flush all blocks of file, not just latest blocks written. So if its big file and multiple applications writing to it, all of their data is flushed. However, doing flush for every write can be inefficient.
So if you really want your data to be written immediately use direct IOs, which will write data directly to the disk. And if you want to make this efficient, make them asynchronous.

Reducing number of refills in fgets()

My C++ program reads in a textual file stream that is delimited by newline characters. For performance reasons, I am using C I/O functions to process these data. I am using fgets() to read a line of this textual file stream into a char * buffer; the buffer gets processed with other functions not relevant to this question. Lines are read in until EOF.
Behind the scenes in fgets() — looking at a source code implementation for OpenBSD, for instance — it looks like this function will refill the FILE pointer's internal buffer once it runs out of characters to parse for newlines (assuming there are more characters to look through and ignoring other termination conditions, for a moment).
Problem: From profiling with gprof, it looks like a lot of time is spent on reading in and processing input, not so much elsewhere in the program, which is generally efficient. To improve performance, I'd like to explore reducing the total I/O overhead of this program, where I am working with very large (multi-GB) inputs.
Question: Perhaps minimizing refills is one way to keep file I/O to a minimum. Is there a (platform-independent) way to adjust the size of the internal buffer that the FILE pointer uses, or would I need to write a custom fgets()-like function with my own buffer? Are there other strategies for reducing overall I/O overhead (seeks, reads, etc.) when parsing text files?
Note: I apologize but I failed to indicate what kind of streams I am working with — I should state more clearly that my application reads from stdin (standard input) as well as regular files.
The setbuf(3) family of functions allow you to specify the buffering for a FILE*.
Specifically, setbuffer and setvbuf allow you to assign a buffer you allocate to be associated with the file. Or, you can simply specify the size to be malloced.
See also the GNU libc documentation on Controlling Buffering.

Read from a file or read the file into a buffer and then use the buffer(in C++)?

I am writing a parser wherein, I need to read characters from a file. But I will be reading the file character by character, and may even stop reading in the middle if come conditions do not satisy.
So is it advisable to create an ifstream of the file, and seek to the position everytime and start reading from there, Or should I read the entire file into a stream or buffer, and then use that instead??
If you can, use a memory-mapped file.
Boost offers a cross-platform one: http://www.boost.org/doc/libs/1_35_0/libs/iostreams/doc/classes/mapped_file.html
How big is the file? Do you make more than one pass? Whether you read it into an in-memory buffer or not, reading the file will consume (file size/BUFSIZ) reads to go through the whole thing. Reading character by character doesn't matter, because the underlying read still consumes BUFSIZ bytes at a time (unless you take steps to change that behavior) -- it just hands them out character-by-character.
If you're reading it and processing it in one pass anyway, then reading it into memory will mean you always need (file size/BUFSIZ) reads, where -- assuming the reason for stopping is distributed equiprobably -- reading it and processing in line will take on average (file size/BUFSIZ) * 0.5 reads, which on a big file could be a substantial gain.
An even more important question might be "what are you doing looking for this complicated a solution?" The amount of time it takes to figure out the cute solution probably dominates any gains you'll make from looking for something fancier than the standard "while not end of file, get character and process" solution.
Seeking the position every time and reading wouldn't be a better option for this as it degrades the performance,
Try creating a Buffer and read from that that would be a better idea and more efficient
Try to read all the file contents at a stretch to the buffer and then process the subsequent input needs with the buffer and without reading from the file everytime,,
On a full service OS (i.e. Windows, Mac OS, Linux, BSD...) the operating system will have a caching mechanism that handles this for you to some extent (and assuming your usage patterns meet some definition of "usual").
Unless you are experiencing unacceptable performance you might want to merrily ignore the whole issue (i.e. just use the naive file access primitives).

convert image buffer to filestream

Something similar to this may have been asked earlier, I could not find an exact answer to my problem to decided to ask here.
I am working with a 3rd party framework that has it's own classes defined to handle image files. It only accepts the file name and the whole implementation is around being able to open these filestreams and perform reads/writes.
I'd like to input an image buffer (that I obtain through some pre-processing on an image open earlier) and feed it to this framework. The problem being I cannot feed a buffer to it, only a filename string.
I am looking at the best way to convert my buffer to a filestream so it can be seekable and be ingested by the framework. Please help me figure out what I should be looking at.
I tried reading about streambuf (filebuf and stringbuf) and tried assigning the buffer to these types, but no success so far.
If the framework only takes a file name, then you have to pass it a file name. Which means the data must reside in the file system.
The portable answer is "write your data to a temporary file and pass the name of that".
On Unix, you might be able to use a named pipe and fork another thread to feed the data through the pipe...
But honestly, you are probably better off just using a temporary file. If you manage to open, read, and delete the file quickly enough, it most likely will never make it out to disk anyway, since the kernel will cache the data.
And if you are able to use a ramdisk (tmpfs), you can guarantee that everything happens in memory.
[edit]
One more thought. If you can modify your code base to operate on std::iostream instead of std::fstream, you can pass it a std::stringstream. They support all of the usual iostream operations on a memory buffer, including things like seeking.

Reading from a socket 1 byte a time vs reading in large chunk

What's the difference - performance-wise - between reading from a socket 1 byte a time vs reading in large chunk?
I have a C++ application that needs to pull pages from a web server and parse the received page line by line. Currently, I'm reading 1 byte at a time until I encounter a CRLF or the max of 1024 bytes is reached.
If reading in large chunk(e.g. 1024 bytes at a time) is a lot better performance-wise, any idea on how to achieve the same behavior I currently have (i.e. being able to store and process 1 html line at a time - until the CRLF without consuming the succeeding bytes yet)?
EDIT:
I can't afford too big buffers. I'm in a very tight code budget as the application is used in an embedded device. I prefer keeping only one fixed-size buffer, preferrably to hold one html line at a time. This makes my parsing and other processing easy as I am by anytime I try to access the buffer for parsing, I can assume that I'm processing one complete html line.
Thanks.
I can't comment on C++, but from other platforms - yes, this can make a big difference; particularly in the amount of switches the code needs to do, and the number of times it needs to worry about the async nature of streams etc.
But the real test is, of course, to profile it. Why not write a basic app that churns through an arbitrary file using both approaches, and test it for some typical files... the effect is usually startling, if the code is IO bound. If the files are small and most of your app runtime is spent processing the data once it is in memory, you aren't likely to notice any difference.
If you are reading directly from the socket, and not from an intermediate higher-level representation that can be buffered, then without any possible doubt, it is just better to read completely the 1024 bytes, put them in RAM in a buffer, and then parse the data from the RAM.
Why? Reading on a socket is a system call, and it causes a context switch on each read, which is expensive. Read more about it: IBM Tech Lib: Boost socket performances
First and simplest:
cin.getline(buffer,1024);
Second, usually all IO is buffered so you don't need to worry too much
Third, CGI process start usually costs much more then input processing (unless it is huge
file)... So you may just not think about it.
G'day,
One of the big performance hits by doing it one byte at a time is that your context is going from user time into system time over and over. And over. Not efficient at all.
Grabbing one big chunk, typically up to an MTU size, is measurably more efficient.
Why not scan the content into a vector and iterate over that looking out for \n's to separate your input into lines of web input?
HTH
cheers,
You are not reading one byte at a time from a socket, you are reading one byte at a atime from the C/C++ I/O system, which if you are using CGI will have alreadety buffered up all the input from the socket. The whole point of buffered I/O is to make the data available to the programmer in a way that is convenient for them to process, so if you want to process one byte at a time, go ahead.
Edit: On reflection, it is not clear from your question if you are implementing CGI or just using it. You could clarify this by posting a code snippet which indicates how you currently read read that single byte.
If you are reading the socket directly, then you should simply read the entire response to the GET into a buffer and then process it. This has numerous advantages, including performance and ease of coding.
If you are linitted to a small buffer, then use classic buffering algorithms like:
getbyte:
if buffer is empty
fill buffer
set buffer pointer to start of buffer
end
get byte at buffer pointer
increment pointer
You can open the socket file descritpor with the fdopen() function. Then you have buffered IO so you can call fgets() or similar on that descriptor.
There is no difference at the operating system level, data are buffered anyway. Your application, however, must execute more code to "read" bytes one at a time.