Absolute fastest way to store a 32-bit integer to disk? - c++

I have a very latency sensitive routine that generates integers sequentially, but needs to store the last generated one to disk in case of a crash or re-start.
Currently I'm doing a seek to beginning of file then writing out the integer then flush each time a new int is generated. The flush is required so the write at least hits the battery-backed controller cache.
The seek is quite costly so I was thinking about just appending 4 bytes and if recovery is needed then to seek to the end and read the last 4 bytes. This previous statement obviously assumes that there isn't too much other disk activity happening, so the write head should ideally stay at end of the file.
The number won't typically go higher than 10,000,000 so 40MB isn't so bad.
Any advice as to how to achieve minimum latency without sacrificing integrity?
C or C++ on Linux 2.6+

I would think the fastest/easiest way to do this would be with mmap/msync -- mmap 1 page of the file into memory and store the value on that page. Any time the value changes, call msync(2) to force the page back to disk. This way you need only one system call per store

If I read correctly, how about using a memory mapped file? Just write your number to the assigned address and it appears in the file. This makes assumptions that the OS writing the cache to disk robustly when needed, but you might find it worth a try.
int len = sizeof(unsigned);
int fildes = open(...)
void* address = mmap(0, len, PROT_READ, MAP_PRIVATE, fildes, 0)
unsigned* mappedNumber = (unsigned*)(address);
*mappedNumber can now contain your integer.

Measure.
How much control do you have over the hardware? If anything less than full, you'll get no guarantees.
On Linux I'd probably try making a kernel driver that would do its writes with the highest priority, possibly even without using a file system.
But, theoretically... If it is enough for you to hit the controller cache, data will hit it every time you flush anything to disk. This means regardless of whether there will be physical seek inside the drive or not, the data will already be there. And because you'll never know what will other applications do, or how fast does the disk rotate, your seeks will be random even if you keep the logical file handle at the beginning or end of file.
And you can always ask your user to use a flash drive.

The fastest way to write a file is to map that file into memory and treat it as a char array.
You don't need to sync the file if you don't care about OS crashes (Linux never crashed on me in production). All your writes go to that file mapping bypassing the kernel, in other words, real zero-copy (you can't do that with sockets on the standard hardware yet). You may need to keep a header in that file that contains a number of records written in case your application crash during writing a record into the memory. I.e. write a record and only after that increment the record counter.
Resizing this file requires ftruncate()/remap() sequence which may take a bit too long, so you may want to minimize resizing by growing the file by a factor, like std::vector<> grows by 1.5 its size on push_back() when it overflows. Depending on your throughput and latency requirements certain optimization can be applied.
The kernel is going to write the file mapping to disk asynchronously (as if there were another thread in your application dedicated to writing to disk). There is a way to force the writes to disk if necessary by using msync(). This is only necessary, however, if you'd like to survive an OS crash. But surviving an OS crash requires sophisticated application design anyway, so in practice surviving the application crash is good enough.

Why does your application have to wait for the write complete at all?
Write your data asynchronously, or perhaps from another thread.
You don't really have much low-level control over the harddrive. As long as you write so little data at a time, you're going to incur a lot of expensive seeks. But since you're only using it as "checkpoints" to recover from in case of a crash, there seems to be no reason why the write couldn't occur asynchronously.

Storing an int only takes one block on disc, regardless of block size. So you have to sync one block to disc, and it takes as long as it takes, and there is nothing you can do to make it faster.
Whatever else you do, fdatasync() will be the killer, time-wise. It will sync one block into your (battery-backed RAID) controller.
Unless you have some kind of non-volatile ram, all (sensible) methods are going to be exactly equivalent because they all require one block to be sync'd.
Doing a seek system call is not going to make any difference, as that has no effect on hardware. In any case, you can avoid it by using pwrite().

Consider what "appending 4 bytes" means. Disks don't store files, or even bytes. They store clusters, and a fixed number of them. The notion of a file is created by the OS. It allocates some clusters to file system tables, to keep track of where a file is precisely located. Now, appending 4 bytes means at least writing the 4 bytes to a cluster. But that also means determining which cluster. What's the existing file size? Do we need a new cluster? If not, we need to read the last cluster, patch the 4 bytes in the correct position, and write back the cluster, then update the file size in the file system. If we do append a new cluster, we can write the 4 bytes followed by zeroes (don't need old value) but we need to do a whole lot of bookkeeping to add a cluster to a file.
So, the absolute fastest way cannot ever be to append 4 bytes. You must overwrite 4 existing bytes. Preferably in a sector that you already have in memory. Others have already pointed out that you can achieve this with mmap/msync.
Obviously, given current SSD and developer prices, and your 40 MB limit, you'll be using an SSD. It pays for itself if you save an hour. Therefore seek times are irrelevant; SSDs don't have physical heads.

There are a lot of people here talking about mmap() as if that will fix something, but your syscall overhead is basically zero compared to the disk write overhead. Remember that appending or writing to a file requires you to update the inode (mtime, filesize) anyway, so that means a disk seek.
I suggest you consider storing the integer somewhere other than a disk. For example:
write it to some nvram that you control (eg. on an embedded system). (If your RAID controller has nvram for writing, it might do this for you. But if you're asking this question, it probably doesn't.)
write it to free bytes in the system CMOS memory (eg. on PC hardware).
write it to another machine on the network (if it's a fast network) and get them to acknowledge.
redesign your application so you can get away with syncing after every n transactions, instead of after every transaction. That will be about n times faster than doing it every time.
redesign your application so that if the integer is lost, the changes from your most recent transaction are also lost. Then the fact that you've technically lost an integer update doesn't matter; when you reboot, it'll be as if you never incremented it, so you can just resume from there.
You didn't explain why you need this behaviour; to be honest, if your app needs this, it sounds like your application is probably not designed very well. For example, some people suggested using a database because they do this sort of thing all the time; true, but databases do it by being slow (ie. syncing the disk every time), unless you create a transaction first, in which case the disk only needs to get synced when you do 'commit transaction'. But if you absolutely must have a sync after every integer, you'd be constantly committing transactions, and a database couldn't save you from that; there's no magical way a database could guarantee not to lose data unless it does at least fdatasync().

Related

Optimal way of writing to append-only files on an SSD

I want to know what's the optimal way to log to an SSD. Think of something like a database log, where you're writing append-only, but you also have to fsync() every transaction or few transactions to ensure application level data durability.
I'm going to give some background on how SSDs work, so if you already know all this, please skim it anyway in case I am wrong about something. Some good stuff for further reading is Emmanuel Goossaert 6-part guide to coding for SSDs and the paper Don't Stack your Log on my Log [pdf].
SSDs write and read in whole pages only. Where the page size differs from SSD to SSD but is typically a multiple of 4kb. My Samsung EVO 840 uses an 8kb page size (which incidentally, Linus calls "unusable shit" in his usual colorful manner.) SSDs cannot modify data in-place, they can only write to free pages. So combining those two restrictions, updating a single byte on my EVO requires reading the 8kb page, changing the byte, and writing it to a new 8kb page and updating the FTL page mapping (a ssd data structure) so the logical address of that page as understood by the OS now points to the new physical page. Because the file data is also no longer contiguous in the same erase block (the smallest group of pages that can be erased) we are also building up a form of fragmentation debt that will cost us in future garbage collection in the SSD. Horribly inefficient.
As an asside, looking at my PC filesystem: C:\WINDOWS\system32>fsutil
fsinfo ntfsinfo c: It has a 512 byte sector size and a 4kb allocation
(cluster) size. Neither of which map to the SSD page size - probably
not very efficient.
There's some issues with just writing with e.g. pwrite() to the kernel page cache and letting the OS handle writing things out. First off, you'll need to issue an additional sync_file_range() call after calling pwrite() to actually kick off the IO, otherwise it will all wait until you call fsync() and unleash an IO storm. Secondly fsync() seems to block future calls to write() on the same file. Lastly you have no control over how the kernel writes things to the SSD, which it may do well, or it may do poorly causing a lot of write amplification.
Because of the above reasons, and because I need AIO for reads of the log anyway, I'm opting for writing to the log with O_DIRECT and O_DSYNC and having full control.
As I understand it, O_DIRECT requires all writes to be aligned to sector size and in whole numbers of sectors. So every time I decide to issue an append to the log, I need to add some padding to the end to bring it up to a whole number of sectors (if all writes are always a whole number of sectors, they will also be correctly aligned, at least in my code.) Ok, that's not so bad. But my question is, wouldn't it be better to round up to a whole number of SSD pages instead of sectors? Presumably that would eliminate write amplification?
That could burn a huge amount of space, especially if writing small amounts of data to the log at a time (e.g a couple hundred bytes.) It also may be unnecessary. SSDs like the Samsung EVO have a write cache, and they don't flush it on fsync(). Instead they rely on capacitors to write the cache out to the SSD in the event of a power loss. In that case, maybe the SSD does the right thing with an append only log being written sectors at a time - it may not write out the final partial page until the next append(s) arrives and completes it (or unless it is forced out of the cache due to large amounts of unrelated IOs.) Since the answer to that likely varies by device and maybe filesystem, is there a way I can code up the two possibilities and test my theory? Some way to measure write amplification or the number of updated/RMW pages on Linux?
I will try to answer your question, as I had the same task but in SD cards, which is still a flash memory.
Short Answer
You can only write a full page of 512 bytes in flash memory. Given the flash memory has a poor write count, the driver chip is buffering/randomizing to improve your drive lifetime.
To write a bit in flash memory, you must erase the entire page (512 bytes) where it sits first. So if you want to append or modify 1 byte somewhere, first it has to erase the entire page where it resides.
The process can be summarized as:
Read the whole page to a buffer
Modify the buffer with your added content
Erase the whole page
Rewrite the whole page with the modified buffer
Long Answer
The Sector (pages) is basically down to the very hardware of the flash implementation and flash physical driver, in which you have no control. That page has to be cleared and rewritten each time you change something.
As you probably already know, you cannot rewrite a single bit in a page without clearing and rewriting the entire 512 bytes. Now, Flash drives have a write cycle life of about 100'000 before a sector can be damaged. To improve lifetime, usually the physical driver, and sometimes the system will have a writing randomization algorithm to avoid always writing the same sector. (By the way, never do defragmentation on an SSD; it's useless and at best reduces the lifetime).
Concerning the cluster, this is handled at a higher level which is related to the file system and this you have control. Usually, when you format a new hard drive, you can select the cluster size, which on windows refers to the Allocation Unit Size of the format window.
Most file systems as I know work with an index which is located at the beginning of the disk. This index will keep track of each cluster and what is assigned to it. This means a file will occupy at least 1 sector, even if it's much smaller.
Now the trade-off is smaller is your sector size, bigger will be your index table and will occupy a lot of space. But if you have a lot of small files, then you will have a better occupation space.
On the other hand, if you only store big files and you want to select the biggest sector size, just slightly higher than your file size.
Since your task is to perform logging, I would recommend to log in single, huge file with big sector size. Having experimented with this type of log, having large amount of file within a single folder can cause issue, especially if you are in embedded devices.
Implementation
Now, if you have raw access to the drive and want to really optimize, you can directly write to the disk without using the file system.
On the upside
* Will save you quite some disk space
* Will render the disk tolerant in case of failure if your design is smart enough
* will require much fewer resources if you are on a limited system
On the downside
* Much more work and debug
* The drive won't be natively recognized by the system.
If you only log, you don't need to have a file system, you just need an entry point to a page where to write your data, which will continuously increase.
The implementation I've done on an SD card was to save 100 pages at the begging of the flash to store information about write and read location. This was held in a single page, but to avoid memory cycle issue, I would sequentially write in a circular method over the 100 pages and then have an algorithm to check which was the last to contain most recent information.
The position storage was written was done every 5 minutes or so which means in case of the power outage I would lose only 5 minutes of the log. It is also possible from the last write location to check further sector if they contain valid data before writing further.
This provided a very robust solution as they are very less likely to have table corruption.
I would also suggest to buffer 512 bytes and write page by page.
Others
You may also want to check some log specific file system, they might simply do the job for you: Log-structured file system

Writer/Reader buffer mechanism for large size - high freq data c++

I need a single writer and multiple reader (up to 5) mechanism that the writer pushes the data of size almost 1 MB each and 15 packages per second continuously which will be writtern in c++. What I’m trying to do is one thread keeps writing the data while 5 readers are going to make some search operations according to the timestamp of the data simultaneously. I have to keep each data package 60 min, and then they can be removed from the container.
Since the data can grow like 15 MB * 60 sec * 60 min = 54000MB/h I need almost 50 GB space to keep the data and make the operations fast enough for both the writer and the readers. Bu the thing is we cannot keep that size data on cache or RAM so it must be in a Hard drive like SSD (HDD would be too slow for that kind of operation)
Up to now what I’ve been thinking is, to make a circular buffer (since I can calculate the max size) directly implemented to an SSD, which I couldn’t find a suitable example up to now and I don’t know if it is possible or not either, or to implement some kind of mapping mechanism that one circular array will be available in the RAM that just keeps the timestamps of the data and physical address of the memory for searching the data which is available on the hard drive. So at least the search operations would be faster I guess.
Since any kind of lock, mutex or semaphore will slow down the operations (especially write is critical we cannot loose data because of any read operation) I don’t want to use them. I know there are some shared locks available but I think again they have some drawbacks. Is there any way/idea to implement such kind of system with lock free, wait free and thread safe as well? Any Data structure (container), pattern, example code/project or other kind of suggestions will be highly appreciated, thank you…
EDIT: Is there any other idea rather than bigger amount of RAM?
This can be done on a commodity PC (and can scale to a server without code changes).
Locks are not a problem. With a single writer and few consumers that do time-consuming tasks on big data, you will have rare locking and practically zero lock contention, so it's a non-issue.
Anything from a simple spinlock (if you're really desperate for low latency) or preferrably a pthread_mutex (which falls back to being a spinlock most of the time, anyway) will do fine. Nothing fancy.
Note that you do not acquire a lock, receive a megabyte of data from a socket, write it to disk, and then release the lock. That's not how it works.
You receive a megabyte of data and write it to a region that you own exclusively, then acquire a lock, change a pointer (and thus transfer ownership), and release the lock. The lock protects the metadata, not every single byte in a gigabyte-sized buffer. Long running tasks, short lock times, contention = zero.
As for the actual data, writing out 15MiB/s is absolutely no challenge, a normal harddisk will do 5-6 times as much, and a SSD will easily do 10 to 20 times that. It also isn't something you even need to do yourself. It's something you can leave to the operating system to manage.
I would create a 54.1GB1 file on disk and memory map that (assuming it's a 64bit system, a reasonable assumption when talking of multi-gigabyte-ram-servers, this is no problem). The operating system takes care of the rest. You just write your data to the mapped region which you use as circular buffer2.
What was most recently written will be more or less guaranteed3 to be resident in RAM, so the consumers can access it without faulting. Older data may or may not be in RAM, depending on whether your server has enough physical RAM available.
Data that is older can still be accessed, but likely at slightly slower speed (if there is not enough physical RAM to keep the whole set resident). It will however not affect the producer or the consumers reading the recently written data (unless the machine is so awfully low-spec that it can't even hold 2-3 of your 1MiB blocks in RAM, but then you have a different problem!).
You are not very concrete on how you intend to process data, other than there will be 5 consumers, so I will not go too deep into this part. You may have to implement a job scheduling system, or you can just divide each incoming block in 5 smaller chunks, or whatever -- depending on what exactly you want to do.
What you need to account for in any case is the region (either as pointer, or better as offset into the mapping) of data in your mapped ringbuffer that is "valid" and the region that is "unused".
The producer is the owner of the mapping, and it "allows" the consumers to access the data within the bounds given in the metadata (a begin/end pair of offsets). Only the producer may change this metadata.
Anyone (including the producer) accessing this metadata needs to acquire a lock.
It is probably even possible to do this with atomic operations, but seeing how you only lock rarely, I wouldn't even bother. It's a no-brainer using a lock, and there are no subtle mistakes that you can make.
Since the producer knows that the consumers will only look at data within well-defined bounds, it can write to areas outside the bounds (the area known being "emtpy") without locking. It only needs to lock to change the bounds afterwards.
As 54.1Gib > 54Gib, you have a hundred spare 1MiB blocks in the mapping that you can write to. That's probably much more than needed (2 or 3 should do), but it doesn't hurt to have a few extra. As you write to a new block (and increase the valid range by 1), also adjust the other end of the "valid range". That way, threads will no longer be allowed to access an old block, but a thread still working in that block can finish its work (the data still exists).
If one is strict about correctness, this may create a race condition if processing a block takes extremely long (over 1 1/2 minutes in this case). If you want to be absolutely sure, you'll need another lock which may in the worst case block the producer. That's something you absolutely didn't want, but blocking the producer in the worst case is the only thing that is 100% correct in every contrieved case unless a hypothetical computer has unlimited memory.
Given the situation, I think this theoretical race is an "allowable" thing. If processing a single block really takes that long with so much data steadily coming in, you have a much more serious problem at hand, so practically, it's a non-issue.
If your boss decides, at some point in the future, that you should keep more than 1 hour of backlog, you can enlarge the file and remap, and when the "empty" region is next at the end of the old buffer's size, simply extend the "known" file size, and adjust your max_size value in the producer. The consumer threads don't even need to know. You could of course create another file, copy the data, swap, and keep the consumers blocked in the mean time, but I deem that an inferior solution. It is probably not necessary for a size increase to be immediately visible, but on the other hand it is highly desirable that it is an "invisible" process.
If you put more RAM into the computer, your program will "magically" use it, without you needing to change anything. The operating system will simply keep more pages in RAM. If you add another few consumers, it will still work the same.
1 Intentionally bigger than what you need, let there be a few "extra" 1MiB blocks.
2 Preferrably, you can madvise the operating system (if you use a system that has a destructive DONT_NEED hint, such as Linux) that you are no longer interested in the contents before overwriting a region. But if you don't do that, it will work either way, only slightly less efficient because the OS will possibly do a read-modify-write operation where a write operation would have been enough.
3 There is of course never really a guarantee, but it's what will be the case anyway.
54GB/hour = 15MB/s. Good SSD these days can write 300+ MB/s. If you keep 1 hour in RAM and then occasionally flush older data to disk, you should be able to handle 10x more than 15MB/s (provided your search algorithm is fast enough to keep up).
Regarding fast locking mechanism between your threads, I would suggest looking into RCU - Read-Copy Update. Linux kernel is currently using it to achieve very efficient locking.
Do you have some minimum hardware requirements? 54GB in memory is perfectly possible these days (many motherboards can take 4x16GB these days, and that's not even server hardware). So if you want to require an SSD, you could maybe just as well require a lot of RAM and have an in-memory circular buffer as you suggest.
Also, if there's sufficient redudancy in the data, it may be viable to use some cheap compression algorithms (those which are easy on the CPU, i.e. some sort of 'level 0' compression). I.e. you don't store the raw data, but some compressed format (and possibly some index) which is decompressed by the readers.
Many good recommendations around. I'd like just to add that for circular buffer implementation you can have a look at Boost Circular Buffer

C++ Reading from several sections of a file is too slow

I need to read byte arrays from several locations of a big file.
I have already optimized the file so that as few sections as possible have to be read, and the sections are as closely together as possible.
I have 20 calls like this one:
m_content.resize(iByteCount);
fseek(iReadFile,iStartPos ,SEEK_SET);
size_t readElements = fread(&m_content[0], sizeof(unsigned char), iByteCount, iReadFile);
iByteCount is around 5000 on average.
Before using fread, I used a memory-mapped file, but the results were approximately the same.
My calls are still too slow (around 200 ms) when called for the first time. When I repeat the same call with the same sections of bytes to read, it is very fast (around 1 ms), but that does not really help me.
The file is big (around 200 mb).
After this call, I have to read double values from a different section of the file, but I can not avoid this.
I don't want to split it up in 2 files. I have seen the "huge file approach" used by other people, too, and they overcame this problem somehow.
If I use memory-mapping, the first call of reading is always slow. If I then repeat reading from this section, it is lightening fast. When I then read from a different section, it is slow for the first time, but then lightening fast the second time.
I have no idea why this is so.
Does anybody have any more ideas for me?
Thank you.
Disk drives have two (actually three) factors that limit their speed: access time, sequential bandwidth, and bus latency/bandwidth.
What you feel most is access time. Access time is typically in the millisecond ballpark. Having to do a seek takes upwards of 5 (often more than 10) milliseconds on a typical harddisk. Note that the number printed on a disk drive is the "average" time, not the worst time (and, in some cases it seems that it's much closer to "best" than "average").
Sequential read bandwidth is typically upwards of 60-80 MiB/s even for a slow disk, and 120-150 MiB/s for a faster disk (or >400MiB on solid state). Bus bandwidth and latency are something you usually don't care about as bus speed usually exceeds the drive speed (except if you use a modern solid state disk on SATA-2, or a 15k harddisk on SATA-1, or any disk over USB).
Also note that you cannot change the drive's bandwidth, nor the bus bandwidth. Nor can you change the seek time. However, you can change the number of seeks.
In practice, this means you must avoid seeks as much as you can. If that means reading in data that you do not need, do not be afraid of doing so. It is much faster to read 100 kiB than to read 5 kiB, seek ahead 90 kilobytes, and read another 5 kiB.
If you can, read the whole file in one go, and only use the parts you are interested in. 200 MiB should not be a big hindrance on a modern computer. Reading in 200 MiB with fread into an allocated buffer might however be forbidding (that depends on your target architecture, and what else your program is doing). But don't worry, you have already had the best solution to the problem: memory mapping.
While memory mapping is not a "magic accelerator", it is nevertheless as close to "magic" as you can get.
The big advantage of memory mapping is that you can directly read from the buffer cache. Which means that the OS will prefetch pages, and you can even ask it to more aggressively prefetch, so effectively all your reads will be "instantaneous". Also, what is stored in the buffer cache is in some sense "free".
Unluckily, memory mapping is not always easy to get right (especially since the documentation and the hint flags typically supplied by operating systems are deceptive or counter-productive).
While you have no guarantee that what has been read once stays in the buffers, in practice this is the case for anyting of "reasonable" size. Of course the operating system cannot and will not keep a terabyte of data in RAM, but something around 200 MiB will quite reliably stay in the buffers on a "normal" modern computer. Reading from buffers works more or less in zero time.
So, your goal is to get the operating system to read the file into its buffers, as sequentially as possible. Unless the machine runs out of physical memory so it is forced to discard buffer pages, this will be lightning fast (and if that happens, every other solution will be equally slow).
Linux has the readahead syscall which lets you prefetch data. Unluckily, it blocks until data has been fetched, which is not what you probably want (you would thus have to use an extra thread for this). madvise(MADV_WILLNEED) is a less reliable, but probably better alternative. posix_fadvise may work too, but note that Linux limits the readahead to twice the default readahead size (i.e. 256kiB).
Do not have yourself being fooled by the docs, as the docs are deceptive. It may seem that MADV_RANDOM is a better choice, as your access is "random". It makes sense to be honest to the OS about what you're doing, doesn't it? Usually yes, but not here. This, simply turns off prefetching, which is the exact opposite of what you really want. I don't know the rationale behind this, maybe some ill-advised attempt to converve memory -- in any case it is detrimental to your performance.
Windows (since Windows 8, for desktop only) has PrefetchVirtualMemory which does exactly what one would want here, but unluckily it's only available on the newest version. On older versions, there is just... nothing.
A very easy, efficient, and portable way of populating the pages in your mapping is to launch a worker thread that faults every page. This sounds horrendous, but it works very nicely, and is operating-system agnostic.
Something like volatile int x = 0; for(int i = 0; i < len; i += 4096) x += map[i]; is entirely sufficient. I am using such code to pre-fault pages prior to accessing them, it works at speeds unrivalled to any other method of populating buffers and uses very little CPU.
(moved to an answer as requested by the OP)
You cannot read from a file any quicker (there is no magic flag to say "read faster"). There is either an issue with your hardware or 200mS is how long it is supposed to take
1) The difference in access speed between your first read and subsequent ones is perfectly understandable : your first call actually read the file from the disk, and this takes time. However your kernel (not mentioning the disk controller) keep the accessed data buffered so when you access it a second time it is a pure memory access (1ms).
Even if you only need to access really tiny portions of the file, libc/kernel/controller optimizations access the disk in quite large chunk. You can read the libc/OS/controller doc to try and align your reads on these chunks.
2) You're using stream input, try using direct open/read/close functions : low-level I/O have less overhead (obviously). Nothing gets faster than this, so if you still find this too slow, you have an OS or hardware issue.
as it look you have a good benchmark, try to switch the size and the count in your fread call. reading 1 times 1000 bytes will be faster than 1000 x 1 byte.
Disk is slow, and as you pointed out, the delay comes from the first access - that's the disk spinning up and accessing the sectors necessary. You're always going to pay that cost one time.
You could improve your performance a little by using memory mapped IO. See either mmap (Linux) or CreateFileMapping+MapViewOfFile (Windows).
I have already optimized the file so that as few sections as possible have to be read
Correct me if I'm wrong, but in reference to the file being optimised, I'm assuming you mean you've ordered the sections to minimize the number of reads that take place and not what I'm going to suggest.
Being bound by IO here is likely due to the seek times, so other than getting a faster storage medium, your options are limited.
Two possible ideas I had are: -
1) Compress the data that is stored, which may give you slightly faster read times, but will still not help with seek time. You'd have to test if this benefits at all.
2) If relevant, as soon as you've retrieved one block of data, move it to a thread and start processing it while another read takes place. You may be doing this already, but if not, I thought it worth mentioning.

performance of fread/fwrite while doing operation with binary file

I am writing some binary data into a binary file through fwrite and once i am through with writing i am reading back the same data thorugh fread.While doing this i found that fwrite is taking less time to write whole data where as fread is taking more time to read all data.
So, i just want to know is it fwrite always takes less time than fread or there is some issue with my reading portion.
Although, as others have said, there are no guarantees, you'll typically find that a single write will be faster than a single read. The write will be likely to copy the data into a buffer and return straight away, while the read will be likely to wait for the data to be fetched from the storage device. Sometimes the write will be slow if the buffers fill up; sometimes the read will be fast if the data has already been fetched. And sometimes one of the many layers of abstraction between fread/fwrite and the storage hardware will decide to go off into its own little world for no apparent reason.
The C++ language makes no guarantees on the comparative performance of these (or any other) functions. It is all down to the combination of hardware and operating system, the load on the machine and the phase of the moon.
These functions interact with the operating system's file system cache. In many cases it is a simple memory-to-memory copy. Write could indeed be marginally faster if you run your program repeatedly. It just needs to find a hole in the cache to dump its data. Flushing that data to the disk happens at a time you can't see or measure.
More work is usually needed to read. At a minimum it needs to traverse the cache structure to discover if the disk data is already cached. If not, it is going to have to block on a disk driver request to retrieve the data from the disk, that takes many milliseconds.
The standard trap with profiling this behavior is taking measurements from repeated runs of your program. They are not at all representative for the way your program is going to behave in the wild. The odds that the disk data is already cached are very good on the second run of your program. They are very poor in real life, reads are likely to be very slow, especially the first one. An extra special trap exists for a write, at some point (depending on the behavior of other programs too), the cache is not going to be able to buffer the write request. Write performance is then going to fall of a cliff as your program gets blocked until enough data is flushed to the disk.
Long story short: don't ever assume disk read/write performance measurements are representative for how your program will behave in production. And perhaps more to the point: there isn't anything you can do to solve disk I/O perf problems in your code.
You are seeing some effect of the buffer/cache systems as other have said, however, if you use async API (as you said your suing fread/write you should look at aio_read/aio_write) you can experiment with some other methods for I/O which are likely more well optimized for what your doing.
One suggestion is that if you are read/update/write/reading a file a lot, you should, by way of an ioctl or DeviceIOControl, request to the OS to provide you the geometry of the disk your code is running on, then determine the size of a disk cylander so you may be able to determine if you can do your read/write operations buffered inside of a single cylinder. This way, the drive head will not move for your read/write and save you a fair amount of run time.

Are memory mapped files bad for constantly changing data?

I have a service that is responsible for collecting a constantly updating stream of data off the network. The intent is that the entire data set must be available for use (read only) at any time. This means that the newest data message that arrives to the oldest should be accessible to client code.
The current plan is to use a memory mapped file on Windows. Primarily because the data set is enormous, spanning tens of GiB. There is no way to know which part of the data will be needed, but when its needed, the client might need to jump around at will.
Memory mapped files fit the bill. However I have seen it said (written) that they are best for data sets that are already defined, and not constantly changing. Is this true? Can the scenario that I described above work reasonably well with memory mapped files?
Or am I better off keeping a memory mapped file for all the data up to some number of MB of recent data, so that the memory mapped file holds almost 99% of the history of the incoming data, but I store the most recent, say 100MB in a separate memory buffer. Every time this buffer becomes full, I move it to the memory mapped file and then clear it.
Any data set that is defined and doesn't change is best!
Memory mapped files generally win over anthing else - most OSs will cache the accesses in RAM anyway.
And the performance will be predictable, you don't fall off a cliff when you start to swap.
Sounds like a database fits your description. Paging is something most commercial ones do well out of the box.
From your problem statement, I see following requirements:
data must be always available
data is written once, I assume it is append only, never overwritten.
data read access pattern is random, i.e jumping around
there also appears to have an implicit latency requirement
Seems to me, memory mapped file is chosen to address 3) + 4). If your data size can be fit into memory, this may well be a reasonable solution. However, if your data size is too large to fit in memory, memory mapped file may result in performance issue due to frequent page fault.
You did not describe how "jumping around" is done. If it is possible to build an index, you may be able to save data into multiple files, keep index in memory, use index to load data and serve, and also cache most frequent used data. The basic idea is similar to disk based hash. This is probably a more scalable solution.
Since you tagged this Win32 I'm assuming you're working on a 32 bit machine, in which case you simply don't have enough address space to memory map all of your data set. This means you will have to create and destroy mappings into the file as you "jump around", which is going to make this less efficient than you might expect.
In practice, you typically have a bit more than 1 GB of contiguous address space to memory map the file into on a 32 bit windows box, and you can end up with less if you fragment your address space.
That being said, doing this with memory maps does have a benefit if you are memory (not address space) constrained, since when you memory map a file as read only (as opposed to explicitly reading it into memory) the OS will not have a second copy in the file system cache.
The file can be mapped as readonly in one thread that presents the data and have a background worker thread which has the file mapped as readwrite to do the appending.