Reducing number of refills in fgets() - c++

My C++ program reads in a textual file stream that is delimited by newline characters. For performance reasons, I am using C I/O functions to process these data. I am using fgets() to read a line of this textual file stream into a char * buffer; the buffer gets processed with other functions not relevant to this question. Lines are read in until EOF.
Behind the scenes in fgets() — looking at a source code implementation for OpenBSD, for instance — it looks like this function will refill the FILE pointer's internal buffer once it runs out of characters to parse for newlines (assuming there are more characters to look through and ignoring other termination conditions, for a moment).
Problem: From profiling with gprof, it looks like a lot of time is spent on reading in and processing input, not so much elsewhere in the program, which is generally efficient. To improve performance, I'd like to explore reducing the total I/O overhead of this program, where I am working with very large (multi-GB) inputs.
Question: Perhaps minimizing refills is one way to keep file I/O to a minimum. Is there a (platform-independent) way to adjust the size of the internal buffer that the FILE pointer uses, or would I need to write a custom fgets()-like function with my own buffer? Are there other strategies for reducing overall I/O overhead (seeks, reads, etc.) when parsing text files?
Note: I apologize but I failed to indicate what kind of streams I am working with — I should state more clearly that my application reads from stdin (standard input) as well as regular files.

The setbuf(3) family of functions allow you to specify the buffering for a FILE*.
Specifically, setbuffer and setvbuf allow you to assign a buffer you allocate to be associated with the file. Or, you can simply specify the size to be malloced.
See also the GNU libc documentation on Controlling Buffering.

Related

How do I do stdin.byLine, but with a buffer?

I'm reading multi-gigabyte files and processing them from stdin. I'm reading from stdin like this.
string line;
foreach(line1; stdin.byLine){
line = to!string(line1);
...
}
Is there a faster way to do this? I tried a threading approach with
auto childTid = spawn(&fn, thisTid);
string line;
foreach(line1; stdin.byLine){
line = to!string(line1);
receiveOnly!(int);
send(childTid, line);
}
int x= 0;
send(childTid, x);
That allows it to load at least one more line from disk while my process is running at the cost of a copy operation, but this is still silly, what I need is fgets, or a way to combine stdio.byChunk(4096) with readline. I tried fgets.
char[] buf = new char[4096];
fgets(buf.ptr, 4096, stdio)
but it always fails with stdio is a file and not a stream. Not sure how to make it a stream. Any help would be appreciated with the approach you think best. I'm not very good at D, apologies for any noob mistakes.
There are actually already two layers of buffering under the hood (excluding the hardware itself): the C runtime library and the kernel both do a layer of buffering to minimize I/O costs.
First, the kernel keeps data from disk in its own buffer and will look ahead, loading beyond what you request in a single call if you are following a predictable pattern. This is to mitigate the low-level costs associated with seeking the device and will cache across processes - if you read a file with one program then again with a second, the second will probably get it from the kernel memory cache instead of the physical disk and may be noticeably much faster.
Second, the C library, on which D's std.stdio is built, also keeps a buffer. readln ultimately calls C file I/O functions which read a chunk from the kernel at a time. (Fun fact, writes are also buffered by the C library, default by line if user interactive and by chunk otherwise. Writing is quite slow and doing it by chunk makes a big difference, but sometimes the C lib thinks a pipe isn't interactive when it is and leads to a FAQ: Simple D program Output order is wrong )
These C lib buffers also mitigate the costs of many small reads and writes by batching them up before even sending to the kernel. In the case of readln, it will likely read several kilobytes at once, even if you ask for just one line or one byte, and the rest stays in the buffer for next time.
So your readln loop is already going to be automatically buffered and should get decent I/O performance.
You might be able to do it better yourself with a few techniques though. In that case, you may try using std.mmfile for a memory-mapped file and reading it as if i was an array, but your files are too big to fit in that on 32 bit. Might work on 64 bit though. (Note that a memory mapped file is NOT loaded all at once, it is just mapped to a memory address. When you actually touch part of it, the operating system will load/save on demand.)
Or, of course, you can use the lower level operating system functions like write from import core.sys.posix.unistd or WriteFile from import core.sys.windows.windows, which will bypass the C lib's layers (but, of course, keep the kernel layers, which you want, don't try to bypass them.)
You can look for any win32 or posix system call C tutorials if you want to know more about using those functions. It is the same in D as in C, with minor caveats like the import instead of #include.
Once you load the chunk, you will want to scan it for the newline and slice it in all probability to form the range to pass to the loop or other algorithms. The std.range and std.algorithm modules also have searching, splitting, and chunking functions that might help, but you need to be careful with lines that span the edges of your buffers to keep correctness and efficiency.
But if your performance is good enough as it is, I'd say just leave it - the C lib+kernel's buffering do a pretty good job in most cases.

C++: will this disk seek take a very large performance hit?

I am using the STL fstream utilities to read from a file. However, what I would like to do is is read a specified number of bytes and then seek back some bytes and read again from that position. So, it is sort of an overlapped read. In code, this would look as follows:
ifstream fileStream;
fileStream.open("file.txt", ios::in);
size_t read_num = 0;
size_t windows_size = 200;
while (read_num < total_num)
{
char buffer[1024];
size_t num_bytes_read = fileStream.read(buffer, sizeof(buffer));
read_num += num_bytes_read - 200;
filestream.seekg(read_num);
}
This is the not the only way to solve my problem but will make multi-tasking a breeze (I have been looking at other data structures like circular buffers but that will make multitasking difficult). I was wondering if I can have your input on how much of a performance hit these seek operations might take when processing very large files. I will only ever use one thread to read the data from file.
The files contain large sequence of texts only characters from the set {A,D,C,G,F,T}. Would it also be advisable to open it as a binary file rather than in text mode as I am doing?
Because the file is large, I am also opening it in chucks with the chuck being set to a 32 MB block. Would this be too large to take advantage of any caching mechanism?
On POSIX systems (notably Linux, and probably MacOSX), the C++ streams are based on lower primitives (often, system calls) such as read(2) and write(2) and the implementation will buffer the data (in the standard C++ library, which would call read(2) on buffers of several kilobytes) and the kernel generally keeps recently accessed pages in its page cache. Hence, practically speaking, most not too big files (e.g. files of few hundred megabytes on a laptop with several gigabytes of RAM) are staying in RAM (once they have been read or written) for a while. See also sync(2).
As commented by Hans Passant, reading in the middle a textual file could be errorprone (in particular, because an UTF8 character may span on several bytes) if not done very carefully.
Notice that for a C (fopen) or C++ point of view, textual files and binary files differ notably on how they handle end of lines.
If performance matters a lot for you, you could use directly low level systems calls like read(2) and write(2) and lseek(2) but then be careful to use wide enough buffers (typically of several kilobytes, e.g. 4Kbytes to 512Kbytes, or even several megabytes). Don't forget to use the returned read or written byte count (some IO operations can be partial, or fail, etc...). Avoid if possible (for performance reasons) to repeatedly read(2) only a dozen of bytes. You could instead memory-map the file (or a segment of it) using mmap(2) (before mmap-ing, use stat(2) to get metadata information, notably file size). And you could give advices to the kernel using posix_fadvise(2) or (for file mapped into virtual memory) madvise(2). Performance details are heavily system dependent (file system, hardware -SSD and hard-disks are different!, system load).
At last, you should consider using some higher-level library on binary files such as indexed files à la GDBM or the sqlite library, or consider using real databases such as PostGreSQL, MonogDB etc.
Apparently, your files contain genomics information. Probably you don't care about end-of-line processing and could open them as binary streams (or directly as low-level Unix file descriptors). Perhaps there already exist free software libraries to parse them. Otherwise, you might consider a two-pass approach: a first pass is reading sequentially the entire file and remembering (in C++ containers like std::map) the interesting parts and their offsets. A second pass would use direct access. You might even have some preprocessor converting your genomics file into SQLITE or GDBM files, and have your application work on these. You probably should avoid opening these files as text (but just as binary file) because end-of-line processing is useless to you.
On a 64 bits system, if you handle only a few files (not thousands of them at once) of several dozens of gigabytes, memory mapping (with mmap) them should make sense, then use madvise (but on a 32 bits system, you won't be able to mmap the entire file).
Plasibly, yes. Whenever you seek, the cached file data for that file is (likely to be) discarded, causing extra overhead of, at least, a system call to fetch the data again.
Assuming the file isn't enormous, it MAY be a better choice to read the entire file into memory (or, if you don't need portability, use a memory mapped file, at which point caching of the file content is trivial - again assuming the entire file fits in (virtual) memory).
However, all this is implementation dependent, so measuring performance of each method would be essential - it's only possible to KNOW these things for a given system by measuring, it's not something you can read about and get precise information on the internet (not even here on SO), because there are a whole bunch of factors that affect the behaviour.

method to read multiple lines from a file at once without partial lines

I'm reading in from a CSV file, parsing it, and storing the data, pretty simple.
Right now were using the standard readLine() method to do that, and I'm trying to squeeze some extra efficency out of this processing loop. I don't know how much they hide behind the scenes, but I assume each call to getLine is a new OS call with all the pain that entails? I don't want to pay for OS calls on each line of input. I would provide a huge buffer and have it fill the buffer with many lines at once.
However, I only care about full lines. I don't want to have to handle maintaining partial lines from one buffer read to append to the second buffer read to make a full line, that's just ugly and annoying.
So, is there a method out there that does this for me? It seems like there almost has to be. Any method which I can instruct to read in x number of lines, or x bytes but don't output the last partial line, or even an easy way for me to manage the memory buffer so I minimize the amount of code for handling partial strings would be appreciated. I can use Boost, though if there is a method in standard C++ I would prefer that.
Thanks.
It's very unlikely that you'll be able to do better than the built-in C++ streams. They're quite fast. In general, the fastest way to completely read a file is to use a single thread to read the entire file from start to end, especially if the file is contiguous on disk. Furthermore, it's likely that the disk is much more of a bottleneck during reading than the OS. If you need to improve the performance of your app, I have a few recommendations.
Use a profiler. If your app is reading a line then parsing it or processing it in some way, it's possible that the parsing or processing is something that can be optimized. This can be determined in profiling. If parsing or processing takes up substantial CPU resources, then optimization may be worth the effort.
If you determine that parsing or processing is responsible for a slow application, and that it can't be easily optimized, consider multiprogramming. If the processing of individual lines does not depend on the results of previous lines being processed, then use multiple threads or CPUs to do the processing.
Use pipelining if you have to process multiple files. For example, suppose you have four stages in your app: reading, parsing, processing, saving. It may be more efficient to read one file at a time rather then all of them all at once. However, while reading the second file, you can still parse the first one. While reading the third file, you can parse the second file and process the first one, etc. One way to implement this is a staged mult-threaded application design.
Use RAID to improve disk reads. Certain raid modes can create faster reads and writes.
i am java programmer, but still i have a hint... read the data in a stream. that means for example 4 or 5 times 2048bytes (or much more)... you can iterate over the stream (and convert it) and search for your line-ends(or some other char)... but i think "readLine" is doing the same anyway...

Read from a file or read the file into a buffer and then use the buffer(in C++)?

I am writing a parser wherein, I need to read characters from a file. But I will be reading the file character by character, and may even stop reading in the middle if come conditions do not satisy.
So is it advisable to create an ifstream of the file, and seek to the position everytime and start reading from there, Or should I read the entire file into a stream or buffer, and then use that instead??
If you can, use a memory-mapped file.
Boost offers a cross-platform one: http://www.boost.org/doc/libs/1_35_0/libs/iostreams/doc/classes/mapped_file.html
How big is the file? Do you make more than one pass? Whether you read it into an in-memory buffer or not, reading the file will consume (file size/BUFSIZ) reads to go through the whole thing. Reading character by character doesn't matter, because the underlying read still consumes BUFSIZ bytes at a time (unless you take steps to change that behavior) -- it just hands them out character-by-character.
If you're reading it and processing it in one pass anyway, then reading it into memory will mean you always need (file size/BUFSIZ) reads, where -- assuming the reason for stopping is distributed equiprobably -- reading it and processing in line will take on average (file size/BUFSIZ) * 0.5 reads, which on a big file could be a substantial gain.
An even more important question might be "what are you doing looking for this complicated a solution?" The amount of time it takes to figure out the cute solution probably dominates any gains you'll make from looking for something fancier than the standard "while not end of file, get character and process" solution.
Seeking the position every time and reading wouldn't be a better option for this as it degrades the performance,
Try creating a Buffer and read from that that would be a better idea and more efficient
Try to read all the file contents at a stretch to the buffer and then process the subsequent input needs with the buffer and without reading from the file everytime,,
On a full service OS (i.e. Windows, Mac OS, Linux, BSD...) the operating system will have a caching mechanism that handles this for you to some extent (and assuming your usage patterns meet some definition of "usual").
Unless you are experiencing unacceptable performance you might want to merrily ignore the whole issue (i.e. just use the naive file access primitives).

Reading from a socket 1 byte a time vs reading in large chunk

What's the difference - performance-wise - between reading from a socket 1 byte a time vs reading in large chunk?
I have a C++ application that needs to pull pages from a web server and parse the received page line by line. Currently, I'm reading 1 byte at a time until I encounter a CRLF or the max of 1024 bytes is reached.
If reading in large chunk(e.g. 1024 bytes at a time) is a lot better performance-wise, any idea on how to achieve the same behavior I currently have (i.e. being able to store and process 1 html line at a time - until the CRLF without consuming the succeeding bytes yet)?
EDIT:
I can't afford too big buffers. I'm in a very tight code budget as the application is used in an embedded device. I prefer keeping only one fixed-size buffer, preferrably to hold one html line at a time. This makes my parsing and other processing easy as I am by anytime I try to access the buffer for parsing, I can assume that I'm processing one complete html line.
Thanks.
I can't comment on C++, but from other platforms - yes, this can make a big difference; particularly in the amount of switches the code needs to do, and the number of times it needs to worry about the async nature of streams etc.
But the real test is, of course, to profile it. Why not write a basic app that churns through an arbitrary file using both approaches, and test it for some typical files... the effect is usually startling, if the code is IO bound. If the files are small and most of your app runtime is spent processing the data once it is in memory, you aren't likely to notice any difference.
If you are reading directly from the socket, and not from an intermediate higher-level representation that can be buffered, then without any possible doubt, it is just better to read completely the 1024 bytes, put them in RAM in a buffer, and then parse the data from the RAM.
Why? Reading on a socket is a system call, and it causes a context switch on each read, which is expensive. Read more about it: IBM Tech Lib: Boost socket performances
First and simplest:
cin.getline(buffer,1024);
Second, usually all IO is buffered so you don't need to worry too much
Third, CGI process start usually costs much more then input processing (unless it is huge
file)... So you may just not think about it.
G'day,
One of the big performance hits by doing it one byte at a time is that your context is going from user time into system time over and over. And over. Not efficient at all.
Grabbing one big chunk, typically up to an MTU size, is measurably more efficient.
Why not scan the content into a vector and iterate over that looking out for \n's to separate your input into lines of web input?
HTH
cheers,
You are not reading one byte at a time from a socket, you are reading one byte at a atime from the C/C++ I/O system, which if you are using CGI will have alreadety buffered up all the input from the socket. The whole point of buffered I/O is to make the data available to the programmer in a way that is convenient for them to process, so if you want to process one byte at a time, go ahead.
Edit: On reflection, it is not clear from your question if you are implementing CGI or just using it. You could clarify this by posting a code snippet which indicates how you currently read read that single byte.
If you are reading the socket directly, then you should simply read the entire response to the GET into a buffer and then process it. This has numerous advantages, including performance and ease of coding.
If you are linitted to a small buffer, then use classic buffering algorithms like:
getbyte:
if buffer is empty
fill buffer
set buffer pointer to start of buffer
end
get byte at buffer pointer
increment pointer
You can open the socket file descritpor with the fdopen() function. Then you have buffered IO so you can call fgets() or similar on that descriptor.
There is no difference at the operating system level, data are buffered anyway. Your application, however, must execute more code to "read" bytes one at a time.