I am reading block of data from volume snapshot using CreateFile/ReadFile and buffersize of 4096 bytes.
The problem I am facing is ReadFile is too slow, I am able to read 68439 blocks i.e. 267 Mb in 45 seconds, How can I increase the speed? Below is a part of my code that I am using,
block_handle = CreateFile(block_file,GENERIC_READ,FILE_SHARE_READ,0,OPEN_EXISTING,FILE_FLAG_SEQUENTIAL_SCAN,NULL);
if(block_handle != INVALID_HANDLE_VALUE)
{
DWORD pos = -1;
for(ULONG i = 0; i < 68439; i++)
{
sectorno = (i*8);
distance = sectorno * sectorsize;
phyoff.QuadPart = distance;
if(pos != phyoff.u.LowPart)
{
pos=SetFilePointer(block_handle, phyoff.u.LowPart,&phyoff.u.HighPart,FILE_BEGIN);
if (phyoff.u.LowPart == INVALID_SET_FILE_POINTER && GetLastError() != NO_ERROR)
{
printf("SetFilePointer Error: %d\n", GetLastError());
phyoff.QuadPart = -1;
return;
}
}
ret = ReadFile(block_handle, data, 4096, &dwRead, 0);
if(ret == FALSE)
{
printf("Error Read");
return;
}
pos += 4096;
}
}
Should I use OVERLAPPED structure? or what can be the possible solution?
Note: The code is not threaded.
Awaiting a positive response.
I'm not quite sure why you're using these extremely low level system functions for this.
Personally I have used C-style file operations (using fopen and fread) as well as C++-style operations (using fstream and read, see this link), to read raw binary files. From a local disk the read speed is on the order of 100MB/second.
In your case, if you don't want to use the standard C or C++ file operations, my guess is that the reason your code is slower is due to the fact that you're performing a seek after each block. Do you really need to call SetFilePointer for every block? If the blocks are sequential you shouldn't need to do this.
Also, experiment with different block sizes, don't be afraid to go up and beyond 1MB.
Your problem is the fragmented data reads. You cannot solve this by fiddling with ReadFile parameters. You need to defragment your reads. here are three approaches:
Defragment the data on the disk
Defragment the reads. That is, collect all the reads you need, but do not read anything yet. Sort the reads into order. Read everything in order, skipping the SetFilePointer wherever possible ( i.e. sequential blocks ). This will speed the total read greatly, but introduce a lag before the first read starts.
Memory map the data. Copy ALL the data into memory and do random access reads from memory. Whether or not this is possible depends on how much data there is in total.
Also, you might want to get fancy, and experiment with caching. When you read a block of data, it might be that although the next read is not sequential, it may well have a high probability of being close by. So when you read a block, sequentially read an enormous block of nearby data into memory. Before the next read, check if the new read is already in memory - thus saving a seek and a disk access. Testing, debugging and tuning this is a lot of work, so I do not really recommend it unless this is a mission critical optimization. Also note that your OS and/or your disk hardware may already be doing something along these lines, so be prepared to see no improvement whatsoever.
If possible, read sequentially (and tell CreateFile you intend to read sequentially with FILE_FLAG_SEQUENTIAL_SCAN).
Avoid unnecessary seeks. If you're reading sequentially, you shouldn't need any seeks.
Read larger chunks (like an integer multiple of the typical cluster size). I believe Windows's own file copy uses reads on the order of 8 MB rather than 4 KB. Consider using an integer multiple of the system's allocation granularity (available from GetSystemInfo).
Read from aligned offsets (you seem to be doing this).
Read to a page-aligned buffer. Consider using VirtualAlloc to allocate the buffer.
Be aware that fragmentation of the file can cause expensive seeking. There's not much you can do about this.
Be aware that volume compression can make seeks especially expensive because it may have to decompress the file from the beginning to find the starting point in the middle of the file.
Be aware that volume encryption might slow things down. Not much you can do but be aware.
Be aware that other software, like anti-malware, may be scanning the entire file every time you touch it. Fewer operations will minimize this hit.
Related
I am trying to make a simple file-based hash table. Here is my insert member function:
private: std::fstream f; // std::ios::in | std::ios::out | std::ios::binary
public: void insert(const char* this_key, long this_value) {
char* that_key;
long that_value;
long this_hash = std::hash<std::string>{}(this_key) % M;
long that_hash; // also block status
long block = this_hash;
long offset = block * BLOCK_SIZE;
while (true) {
this->f.seekg(offset);
this->f.read((char*) &that_hash, sizeof(long));
if (that_hash > -1) { // -1 (by default) indicates a never allocated block
this->f.read(that_key, BLOCK_SIZE);
if (strcmp(this_key, that_key) == 0) {
this->f.seekp(this->f.tellg());
this->f.write((char*) &this_value, sizeof(long));
break;
} else {
block = (block + 1) % M; // linear probing
offset = block * BLOCK_SIZE;
continue;
}
} else {
this->f.seekp(offset);
this->f.write((char*) &this_hash, sizeof(long)); // as block status
this->f.write(this_key, KEY_SIZE);
this->f.write((char*) &this_value, sizeof(long));
break;
}
}
}
Tests up to 10M key, value pairs with 50,000,017 blocks were fairly done. (Binary file size was about 3.8GB).
However, a test with 50M keys and 250,000,013 blocks extremely slows down... (Binary file size was more than 19GB in this case). 1,000 inserts usually takes 4~5ms but exceptionally take more than 2,000ms. It gets slower and slower then takes 40~150ms... (x10 ~ x30 slower...) I definitely have no idea...
What causes this exceptional binary file I/O slowing down?
Do seekg&seekp and other I/O operations are affected by file size? (I could not find any references about this question though...)
How key, value stores and databases avoid this I/O slow down?
How can I solve this problem?
Data size
Usually disk drive block size are a power of 2 so if your data block size is also a power of 2, then you can essentially eliminate the case where a data block cross a disk block boundary.
In your case, a value of 64 bytes (or 32 bytes if you don't really need to store the hash) might somewhat perform a bit better.
Insertion order
The other thing you could do to improve performance is to do your insertion is increasing hash order to reduce the number of a time data must be loaded from the disk.
Generally when data is read or written to the disk, the OS will read/write a large chuck at a time (maybe 4k) so if your algorithm is written is a way to write data locally in time, you will reduce the number of time data must actually be read or written to the disk.
Given that you make a lot of insertion, one possibility would be to process insertion in batch of say 1000 or even 10000 key/value pair at a time. Essentially, you would accumulate data in memory and sort it and once you have enough item (or when you are done inserting), you will then write the data in order.
That way, you should be able to reduce disk access which is very slow. This is probably even more important if you are using traditional hard drive as moving the head is slow (in which case, it might be useful to defragment it). Also, be sure that your hard drive have more than enough free space.
In some case, local caching (in your application) might also be helpful particularly if you are aware how your data is used.
File size VS collisions
When you use an hash, you want to find the sweet spot between file size and collisions. If you have too much collisions, then you will waste lot of time and at some point it might degenerate when it become hard to find a free place for almost every insertion.
On the other hand, if your file is really very large, you might end up in a case where you might fill your RAM with data that is mainly empty and still need to replace data with data from the disk on almost all insertion.
For example, if your data is 20GB and you are able to load say 2 GB in memory, then if insert are really random, 90% of the time you might need real access to hard drive.
Configuration
Well options will depends on the OS and it is beyond the scope of a programming forum. If you want to optimize your computer, then you should look elsewhere.
Reading
It might be helpful to read about operating systems (file system, cache layer…) and algorithms (external sorting algorithms, B-Tree and other structures) to get a better understanding.
Alternatives
Extra RAM
Fast SSD
Multithreading (for ex. input and output threads)
Rewriting of the algorithm (for ex. to read/write a whole disk page at once)
Faster CPU / 64 bit computer
Using algorithms designed for such scenarios.
Using a database.
Profiling code
Tuning parameters
For my neural-network-training project, I've got a very large file of input data. The file format is binary, and it consists of a very large number of fixed-size records. The file is currently ~13GB, but in the future it could become larger; for the purposes of this question let's assume it will be too large to just hold all of it in my computer's RAM at once.
Today's problem involves a little utility program I wrote (in C++, although I think choice of language doesn't matter too much here as one would likely encounter the same problem in any language) that is intended to read the big file and output a similar big file -- the output file is to contain the same data as the input file, except with the records shuffled into a random ordering.
To do this, I mmap() the input file into memory, then generate a list of integers from 1 to N (where N is the number of records in the input file), randomly shuffle the ordering of that list, then iterate over the list, writing out to the output file the n'th record from the mmap'd memory area.
This all works correctly, as far as it goes; the problem is that it's not scaling very well; that is, as the input file's size gets bigger, the time it takes to do this conversion is increasing faster than O(N). It's getting to the point where it's become a bottleneck for my workflow. I suspect the problem is that the I/O system (for MacOS/X 10.13.4, using the internal SSD of my Mac Pro trashcan, in case that's important) is optimized for sequential reads, and jumping around to completely random locations in the input file is pretty much the worst-case scenario as far as caching/read-ahead/other I/O optimizations are concerned. (I imagine that on a spinning disk it would perform even worse due to head-seek delays, but fortunately I'm at least using SSD here)
So my question is, is there any clever alternative strategy or optimization I could use to make this file-randomization-process more efficient -- one that would scale better as the size of my input files increases?
If problem is related to swapping and random disk access while reading random file locations, can you at least read input file sequentially?
When you're accessing some chunk in mmap-ed file, prefetcher will think that you'll need adjacent pages soon, so it will also load them. But you won't, so these pages will be discarded and loading time will be wasted.
create array of N toPositons, so toPosition[i]=i;
randomize destinations (are you using knuth's shuffle?);
then toPosition[i] = destination of input[i]. So, read input data sequentially from start and place them into corresponding place of destination file.
Perhaps, this will be more prefetcher-friendly. Of course, writing data randomly is slow too, but at least, you won't waste prefetched pages from input file.
Additional benefit is that when you've processed few millions of input data pages, these GBs will be unloaded from RAM and you'll never need them again, so you won't pollute actual disk cache. Remember that actual memory page size is at least 4K, so even when you're randomly accessing 1 byte of mmap-ed file, at least 4K of data should be read from disk into cache.
I'd recommend not using mmap() - there's no way all that memory pressure is any help at all, and unless you're re-reading the same data multiple times, mmap() is often the worst-performing way to read data.
First, generate your N random offsets, then, given those offsets, use pread() to read the data - and use low-level C-style IO.
This uses the fcntl() function to disable the page cache for your file. Since you're not re-reading the same data, the page cache likely does you little good, but it does use up RAM, slowing other things down. Try it both with and without the page cache disabled and see which is faster. Note also that I've left out all error checking:
(I'm also assuming C-style IO functions are in namespace std on a MAC, and I've used C-style strings and arrays to match the C-style IO functions while keeping the code simpler.)
#include <sys/types.h>
#include <sys/uio.h>
#include <unistd.h>
#include <fcntl.h>
void sendRecords( const char *dataFile, off_t offsets, size_t numOffsets )
{
int fd = std::open( dataFile, O_RDONLY );
// try with and without this
std::fcntl( fd, F_NOCACHE, 1 );
// can also try using page-aligned memory here
char data[ RECORD_LENGTH ];
for ( size_t ii = 0; ii < numOffsets; ii++ )
{
ssize_t bytesRead = std::pread( fd, data, sizeof( data ), offsets[ ii ] );
// process this record
processRecord( data );
}
close( datafd );
}
Assuming you have a file containing precalculated random offsets:
#include <sys/types.h>
#include <sys/uio.h>
#include <unistd.h>
#include <fcntl.h>
void sendRecords( const char *dataFile, const char *offsetFile )
{
int datafd = std::open( dataFile, O_RDONLY );
// try with and without this
std::fcntl( fd, F_NOCACHE, 1 );
int offsetfd = std::open( offsetFile, O_RDONLY );
// can also try using page-aligned memory here
char data[ RECORD_LENGTH ];
for ( ;; )
{
off_t offset;
ssize_t bytesRead = std::read( offsetfd, &offset, sizeof( offset ) );
if ( bytesRead != sizeof( offset ) )
{
break;
}
bytesRead = std::pread( fd, data, sizeof( data ), offset );
// process this record
processRecord( data );
}
std::close( datafd );
std::close( offsetfd );
}
You can go faster, too, since that code alternates reading and processing, and it'd probably be faster to use multiple threads to read and process simultaneously. It's not that hard to use one or more threads to read data into preallocated buffers that you then queue up and send to your processing thread.
Thanks to advice of various people in this thread (in particular Marc Glisse and Andrew Henle) I was able to reduce the execution time of my program on a 13GB input file, from ~16 minutes to ~2 minutes. I'll document how I did it in this answer, since the solution wasn't very much like either of the answers above (it was more based on Marc's comment, so I'll give Marc the checkbox if/when he restates his comment as an answer).
I tried replacing the mmap() strategy with pread(), but that didn't seem to make much difference; and I tried passing F_NOCACHE and various other flags to fcntl(), but they seemed to either have no effect or make things slower, so I decided to try a different approach.
The new approach is to do things in a 2-layer fashion: rather than reading in single records at a time, my program now loads in "blocks" of sequential records from the input file (each block containing around 4MB of data).
The blocks are loaded in random order, and I load in blocks until I have a certain amount of block-data held in RAM (currently ~4GB, as that is what my Mac's RAM can comfortably hold). Then I start grabbing random records out of random in-RAM blocks, and writing them to the output file. When a given block no longer has any records left in it to grab, I free that block and load in another block from the input file. I repeat this until all blocks from the input file have been loaded and all their records distributed to the output file.
This is faster because all of my output is strictly sequential, and my input is mostly sequential (i.e. 4MB of data is read after each seek rather than only ~2kB). The ordering of the output is slightly less random than it was, but I don't think that will be a problem for me.
I've been running into some issues with writing to a file - namely, not being able to write fast enough.
To explain, my goal is to capture a stream of data coming in over gigabit Ethernet and simply save it to a file.
The raw data is coming in at a rate of 10MS/s, and it's then saved to a buffer and subsequently written to a file.
Below is the relevant section of code:
std::string path = "Stream/raw.dat";
ofstream outFile(path, ios::out | ios::app| ios::binary);
if(outFile.is_open())
cout << "Yes" << endl;
while(1)
{
rxSamples = rxStream->recv(&rxBuffer[0], rxBuffer.size(), metaData);
switch(metaData.error_code)
{
//Irrelevant error checking...
//Write data to a file
std::copy(begin(rxBuffer), end(rxBuffer), std::ostream_iterator<complex<float>>(outFile));
}
}
The issue I'm encountering is that it's taking too long to write the samples to a file. After a second or so, the device sending the samples reports its buffer has overflowed. After some quick profiling of the code, nearly all of the execution time is spent on std::copy(...) (99.96% of the time to be exact). If I remove this line, I can run the program for hours without encountering any overflow.
That said, I'm rather stumped as to how I can improve the write speed. I've looked through several posts on this site, and it seems like the most common suggestion (in regard to speed) is to implement file writes as I've already done - through the use of std::copy.
If it's helpful, I'm running this program on Ubuntu x86_64. Any suggestions would be appreciated.
So the main problem here is that you try to write in the same thread as you receive, which means that your recv() can only be called again after copy is complete. A few observations:
Move the writing to a different thread. This is about a USRP, so GNU Radio might really be the tool of your choice -- it's inherently multithreaded.
Your output iterator is probably not the most performant solution. Simply "write()" to a file descriptor might be better, but that's performance measurements that are up to you
If your hard drive/file system/OS/CPU aren't up to the rates coming in from the USRP, even if decoupling receiving from writing thread-wise, then there's nothing you can do -- get a faster system.
Try writing to a RAM disk instead
In fact, I don't know how you came up with the std::copy approach. The rx_samples_to_file example that comes with UHD does this with a simple write, and you should definitely favor that over copying; file I/O can, on good OSes, often be done with one copy less, and iterating over all elements is probably very slow.
Let's do a bit of math.
Your samples are (apparently) of type std::complex<std::float>. Given a (typical) 32-bit float, that means each sample is 64 bits. At 10 MS/s, that means the raw data is around 80 megabytes per second--that's within what you can expect to write to a desktop (7200 RPM) hard drive, but getting fairly close to the limit (which is typically around 100-100 megabytes per second or so).
Unfortunately, despite the std::ios::binary, you're actually writing the data in text format (because std::ostream_iterator basically does stream << data;).
This not only loses some precision, but increases the size of the data, at least as a rule. The exact amount of increase depends on the data--a small integer value can actually decrease the quantity of data, but for arbitrary input, a size increase close to 2:1 is fairly common. With a 2:1 increase, your outgoing data is now around 160 megabytes/second--which is faster than most hard drives can handle.
The obvious starting point for an improvement would be to write the data in binary format instead:
uint32_t nItems = std::end(rxBuffer)-std::begin(rxBuffer);
outFile.write((char *)&nItems, sizeof(nItems));
outFile.write((char *)&rxBuffer[0], sizeof(rxBuffer));
For the moment I've used sizeof(rxBuffer) on the assumption that it's a real array. If it's actually a pointer or vector, you'll have to compute the correct size (what you want is the total number of bytes to be written).
I'd also note that as it stands right now, your code has an even more serious problem: since it hasn't specified a separator between elements when it writes the data, the data will be written without anything to separate one item from the next. That means if you wrote two values of (for example) 1 and 0.2, what you'd read back in would not be 1 and 0.2, but a single value of 10.2. Adding separators to your text output will add yet more overhead (figure around 15% more data) to a process that's already failing because it generates too much data.
Writing in binary format means each float will consume precisely 4 bytes, so delimiters are not necessary to read the data back in correctly.
The next step after that would be to descend to a lower-level file I/O routine. Depending on the situation, this might or might not make much difference. On Windows, you can specify FILE_FLAG_NO_BUFFERING when you open a file with CreateFile. This means that reads and writes to that file will basically bypass the cache and go directly to the disk.
In your case, that's probably a win--at 10 MS/s, you're probably going to use up the cache space quite a while before you reread the same data. In such a case, letting the data go into the cache gains you virtually nothing, but costs you some data to copy data to the cache, then somewhat later copy it out to the disk. Worse, it's likely to pollute the cache with all this data, so it's no longer storing other data that's a lot more likely to benefit from caching.
I have noticed that reading a file byte-by-bye takes more time to read whole file than reading file using fread .
According to cplusplus :
size_t fread ( void * ptr, size_t size, size_t count, FILE * stream );
Reads an array of count elements, each one with a size of size bytes, from the stream and stores them in the block of memory specified by ptr.
Q1 ) So , again fread reads the file by 1 bytes , so isn't it the same way as to read by 1-byte method ?
Q2 ) Results have proved that still fread takes lesser time .
From here:
I ran this with a file of approximately 44 megabytes as input. When compiled with VC++2012, I got the following results:
using getc Count: 400000 Time: 2.034
using fread Count: 400000 Time: 0.257
Also few posts on SO talks about it that it depends on OS .
Q3) What is the role of OS ?
Why is it so and what exactly goes behind the scene ?
fread does not read a file one byte at a time. The interface, which lets you specify size and count separately, is purely for your convenience. Behind the scenes, fread will simply read size * count bytes.
The amount of bytes that fread will try to read at once is highly dependent on your C implementation and the underlying filesystem. Unless you're intimately familiar with both, it's often safe to assume that fread will be closer to optimal than anything you invent yourself.
EDIT: physical disks tend to have a relatively high seek time compared to their throughput. In other words, they take relatively long to start reading. But once started, they can read consecutive bytes relatively fast. So without any OS/filesystem support, any call to fread would result in a severe overhead to start each read. So to utilize your disk efficiently, you'll want to read as many bytes at once as possible. But disks are slow compared to CPU, RAM and physical caches. Reading too much at once means your program spends a lot of time waiting for the disk to finish reading, when it could have been doing something useful (like processing already read bytes).
This is where the OS/filesystem comes in. The smart people who work on those have spent a lot of time figuring out the right amount of bytes to request from a disk. So when you call fread and request X bytes, the OS/filesystem will translate that to N requests for Y bytes each. Where Y is some generally optimal value that depends on more variables than can be mentioned here.
Another role of the OS/filesystem is what's called 'readahead'. The basic idea is that most IO occurs inside loops. So if a program requests some bytes from disk, there's a very good chance it'll request the next bytes shortly afterwards. Because of this, the OS/filesystem will typically read slightly more than you actually requested at first. Again, the exact amount depends on too many variables to mention. But basically, this is the reason that reading a single byte at a time is still somewhat efficient (it would be another ~10x slower without readahead).
In the end, it's best to think of fread as giving some hints to the OS/filesystem about how many bytes you'll want to read. The more accurate those hints are (closer to the total amount of bytes you'll want to read), the better the OS/filesystem will optimize the disk IO.
Protip: Use your profiler to identify the most significant bottlenecks in an actual, real-life problem...
Q1 ) So , again fread reads the file by 1 bytes , so isn't it the same way as to read by 1-byte method ?
Is there anything from the manual to suggest that bytes can only be read one at a time? Flash memory, which is becoming more and more common, typically requires that your OS read chunks as large as 512KB at a time. Perhaps your OS performs buffering for your benefit, so you don't have to inspect the entire amount...
Q2 ) Results have proved that still fread takes lesser time .
Logically speaking, that's a fallacy. There is no requirement that fgetc be any slower at retrieving a block of bytes than fread. In fact, an optimal compiler may very well produce the same machine code following optimisation parses.
In reality, it also turns out to be invalid. Most proofs (for example, the ones you're citing) neglect to consider the influence that setvbuf (or stream.rdbuf()->pubsetbuf, in C++) has.
The empirical evidence below, however, integrates setvbuf and, at least on every implementation I've tested it on, has shown fgetc to be roughly as fast as fread at reading a large block of data, within some meaningless margin of error that swings either way... Please, run these tests multiple times and let me know if you find a system where one of these is significantly faster than the other. I suspect you won't. There are two programs to build from this code:
gcc -o fread_version -std=c99 file.c
gcc -o fgetc_version -std=c99 -DUSE_FGETC file.c
Once both programs are compiled, generate a test_file containing a large number of bytes and you can test like so:
time cat test_file | fread_version
time cat test_file | fgetc_version
Without further adieu, here's the code:
#include <assert.h>
#include <stdio.h>
int main(void) {
unsigned int criteria[2] = { 0 };
# ifdef USE_FGETC
int n = setvbuf(stdin, NULL, _IOFBF, 65536);
assert(n == 0);
for (;;) {
int c = fgetc(stdin);
if (c < 0) {
break;
}
criteria[c == 'a']++;
}
# else
char buffer[65536];
for (;;) {
size_t size = fread(buffer, 1, sizeof buffer, stdin);
if (size == 0) {
break;
}
for (size_t x = 0; x < size; x++) {
criteria[buffer[x] == 'a']++;
}
}
# endif
printf("%u %u\n", criteria[0], criteria[1]);
return 0;
}
P.S. You might have even noticed the fgetc version is simpler than the fread version; it doesn't require a nested loop to traverse the characters. That should be the lesson to take away, here: Write code with maintenance in mind, rather than performance. If necessary, you can usually provide hints (such as setvbuf) to optimise bottlenecks that you've used your profiler to identify.
P.P.S. You did use your profiler to identify this as a bottleneck in an actual, real-life problem, right?
It depends how you are reading byte-by-byte. But there is a significant overhead to each call to fread (it probably needs to make an OS/kernel call).
If you call fread 1000 times to read 1000 bytes one by one then you pay that cost 1000 times; if you call fread once to read 1000 bytes then you only pay that cost once.
Consider what's physically happening with the disk. Every time you ask it to perform a read, its head must seek to the correct position and then wait for the right part of the platter to spin under it. If you do 100 separate 1-byte reads, you have to do that 100 times (as a first approximation; in reality the OS probably has a caching policy that's smart enough to figure out what you're trying to do and read ahead). But if you read 100 bytes one operation, and those bytes are roughly contiguous on the disk, you only have to do all this once.
Hans Passant's comment about caching is right on the money too, but even in the absence of that effect, I'd expect 1 bulk read operation to be faster than many small ones.
Other contributors to the speed reduction are instruction pipeline reloads and databus contentions. Data cache misses are similar to the instruction pipeline reloads, so I am not presenting them here.
Function calls and Instruction Pipeline
Internally, the processor has an instruction pipeline in cache (fast memory physically near the processor). The processor will fill up the pipeline with instructions, then execute the instructions and fill up the pipeline again. (Note, some processors may fetch instructions as slots open up in the pipeline).
When a function call is executed, the processor encounters a branch statement. The processor can't fetch any new instructions into the pipeline until the branch is resolved. If the branch is executed, the pipeline may be reloading, wasting time. (Note: some processors can read in enough instructions into the cache so that no reading of instructions is necessary. An example is a small loop.)
Worst case, when you call the read function 1000 times, you are cause 1000 reloads of the instruction pipeline. If you call the read function once, the pipeline is only reloaded once.
Databus Collisions
Data flows through a databus from the hard drive to the processor, then from the processor to the memory. Some platforms allow for Direct Memory Access (DMA) from the hard drive to the memory. In either case, there is contention of multiple users with the data bus.
The most efficient use of the databus is send large blocks of data. When the user (component, such as the processor or DMA) wants to use the databus, the user must wait for it to become available. Worst case, another user is sending large blocks so there is a long delay. When sending 1000 bytes, one at a time, the User has to wait 1000 times for other Users to give up time with the databus.
Picture waiting in a queue (line) at a market or restaurant. You need to purchase many items, but you purchase one, then have to go back and wait in line again. Or you could be like other shoppers and purchase many items. Which consumes more time?
Summary
There are many reasons to use large blocks for I/O transfers. Some of the reasons are with the physical drive, others involve instruction pipelines, data caches, and databus contention. By reducing the quantity of data requests and increasing the data size, the accumulative time is also reduced. One request has a lot less overhead than 1000 requests. If the overhead is 1 millisecond, one request takes 1 millisecond, while 1000 requests take 1 second.
I need to write data into drive. I have two options:
write raw sectors.(_write(handle, pBuffer, size);)
write into a file (fwrite(pBuffer, size, count, pFile);)
Which way is faster?
I expected the raw sector writing function, _write, to be more efficient. However, my test result failed! fwrite is faster. _write costs longer time.
I've pasted my snippet; maybe my code is wrong. Can you help me out? Either way is okay by me, but I think raw write is better, because it seems the data in the drive is encrypted at least....
#define SSD_SECTOR_SIZE 512
int g_pSddDevHandle = _open("\\\\.\\G:",_O_RDWR | _O_BINARY, _S_IREAD | _S_IWRITE);
TIMER_START();
while (ulMovePointer < 1024 * 1024 * 1024)
{
_write(g_pSddDevHandle,szMemZero,SSD_SECTOR_SIZE);
ulMovePointer += SSD_SECTOR_SIZE;
}
TIMER_END();
TIMER_PRINT();
FILE * file = fopen("f:\\test.tmp","a+");
TIMER_START();
while (ulMovePointer < 1024 * 1024 * 1024)
{
fwrite(szMemZero,SSD_SECTOR_SIZE,1,file);
ulMovePointer += SSD_SECTOR_SIZE;
}
TIMER_END();
TIMER_PRINT();
Probably because a direct write isn't buffered. When you call fwrite, you are doing buffered writes which tend to be faster in most situations. Essentially, each FILE* handler has an internal buffer which is flushed to disk periodically when it becomes full, which means you end up making less system calls, as you only write to disk in larger chunks.
To put it another way, in your first loop, you are actually writing SSD_SECTOR_SIZE bytes to disk during each iteration. In your second loop you are not. You are only writing SSD_SECTOR_SIZE bytes to a memory buffer, which, depending on the size of the buffer, will only be flushed every Nth iteration.
In the _write() case, the value of SSD_SECTOR_SIZE matters. In the fwrite case, the size of each write will actually be BUFSIZ. To get a better comparison, make sure the underlying buffer sizes are the same.
However, this is probably only part of the difference.
In the fwrite case, you are measuring how fast you can get data into memory. You haven't flushed the stdio buffer to the operating system, and you haven't asked the operating system to flush its buffers to physical storage. To compare more accurately, you should call fflush() before stopping the timers.
If you actually care about getting data onto the disk rather than just getting the data into the operating systems buffers, you should ensure that you call fsync()/FlushFileBuffers() before stopping the timer.
Other obvious differences:
The drives are different. I don't know how different.
The semantics of a write to a device are different to the semantics of writes to a filesystem; the file system is allowed to delay writes to improve performance until explicitly told not to (eg. with a standard handle, a call to FlushFileBuffers()); writes directly to a device aren't necessarily optimised in that way. On the other hand, the file system must do extra I/O to manage metadata (block allocation, directory entries, etc.)
I suspect that you're seeing a different in policy about how fast things actually get on to the disk. Raw disk performance can be very fast, but you need big writes and preferably multiple concurrent outstanding operations. You can also avoid buffer copying by using the right options when you open the handle.