See the below code for example. size is 1MB, and it certainly runs faster than when it is 1. I think it is due to that the number of IO system calls is reduced. Does this mean I will always benefit from a larger buffer size? I hoped so and ran some tests, but it seems that there is some limit. size being 2 will run much faster than when it is 1, but it doesn't go further that way.
Could someone explain this better? What is the optimal buffer size likely to be? And why don't I benefit much from expanding its size infinitely.
By the way, in this example I wrote to stdout for simplicity, but I'm also thinking about when writing to files in the disk.
enum
{
size = 1 << 20
};
void fill_buffer(char (*)[size]);
int main(void)
{
long n = 100000000;
for (;;)
{
char buf[size];
fill_buffer(&buf);
if (n <= size)
{
if (fwrite(buf, 1, n, stdout) != n)
{
goto error;
}
break;
}
if (fwrite(buf, 1, size, stdout) != size)
{
goto error;
}
n -= size;
}
return EXIT_SUCCESS;
error:
fprintf(stderr, "fwrite failed\n");
return EXIT_FAILURE;
}
You usually don't need the best buffer size, which may requires querying the OS for system parameters and do complex estimations or even benchmarking on the target environment, and it's dynamic. Lucky you just need a value that is good enough.
I would say a 4K~16K buffer suit most normal usages. Where 4K is the magic number for page size supported by normal machine (x86, arm) and also multiple of usual physical disk sector size(512B or 4K).
If you are dealing with huge amount of data (giga-bytes) you may realise simple fwrite-model is inadequate for its blocking nature.
On a large partition, cluster size is often 32 KB. On a large read / write request, if the system sees that there are a series of contiguous clusters, it will combine them into a single I/O. Otherwise, it breaks up the request into multiple I/O's. I don't know what the maximum I/O size is. On some old SCSI controllers, it was 64 KB or 1 MB - 8 KB (17 or 255 descriptors, in controller). For IDE / Sata, I've been able to do IOCTL's for 2 MB, confirming it was a single I/O with an external bus monitor, but I never tested to determine the limit.
For external sorting with k way bottom up merge sort with k > 2, read / write size of 10 MB to 100 MB is used to reduce random access overhead. The request will be broken up into multiple I/O's but the read or write will be sequential (under ideal circumstances).
Related
I need to calculate the length of the longest sequence of zero bytes in a binary file as fast as possible. I have a basic implementation in C++ below:
#include <iostream>
#include <fstream>
#include <algorithm>
#include <string>
int get_max_zero_streak(std::string fname)
{
std::ifstream myfile(fname, std::ios_base::binary);
int length = 0;
int streak = 0;
while(myfile)
{
unsigned char x = myfile.get(); // unsigned 8 bit integer
if(x == 0)
{
streak += 1;
}
else
{
length = std::max(length, streak);
streak = 0;
}
}
return length;
}
int main()
{
std::cout << get_max_zero_streak("000_c.aep") << std::endl;
std::cout << get_max_zero_streak("000_g1.aep") << std::endl;
std::cout << get_max_zero_streak("000_g2.aep") << std::endl;
std::cout << get_max_zero_streak("001_c.aep") << std::endl;
std::cout << get_max_zero_streak("001_g1.aep") << std::endl;
std::cout << get_max_zero_streak("001_g2.aep") << std::endl;
std::cout << get_max_zero_streak("002_c.aep") << std::endl;
std::cout << get_max_zero_streak("002_g1.aep") << std::endl;
std::cout << get_max_zero_streak("002_g2.aep") << std::endl;
return 0;
}
This is OK on smaller files, but is extraordinarily slow on larger files (such as 50 GB+). Is there a more efficient way to write this, or is parallelizing it my only hope? I'm reading the files from an NVMe SSD, so I don't think the drive read speed is the limitation.
The main bottleneck of this code is the byte-per-byte IO reads that prevent the loop to saturate most SSDs (by a large margin). Indeed, calling myfile.get() for each iteration is slow and compiler do not optimize it in practice. A solution to fix that is to read data chunk-by-chunk and use an inner loop operating on the current chunk. Chunks must be sufficiently large to amortize the cost of IO calls and sufficiently small so data is read from caches (something like a buffer of 32KiB~256KiB should be good enough).
Another big bottleneck is the computing loop itself: conditions are slow on modern processor unless they can be predicted (eg. if most values are 0 or 1 and there is not much unpredictable switches between both). Using a branchless code should help significantly to make the loop faster but the loop-carried dependency is then the main issue. Clang is able to generate a branchless assembly loop but GCC do not succeed to make this optimization. Still, the loop stay serial and scalar so the processor cannot compute bytes faster than its own frequency. In fact, it is much less than that due to the instruction latency and the dependencies. An efficient solution to fix the problem is to operate on multiple bytes at the same time. One solution is to use x86 SIMD instructions (x86-specific and fast). Another solution is to load 8 bytes and operate on large integer using bit-tweaks but it is a bit tricky to do efficiently and I doubt it will be much faster in the end.
Regarding the SIMD solution, the idea is to load data by block of 16 bytes, compare the 16 bytes to 0 simultaneously and set a bit of an integer to 1 or 0 regarding the result of the comparison. You can then check the number of trailing 0 bits very quickly (with a generic code or with compiler built-ins like __builtin_ctz generating a fast x86 instruction). You finally need to shift the result regarding the number of trailing 0 found. This solution is not very fast if there are many 0-1 switches but very fast when there are many large blocks of 0 (or 1). Note that with AVX-2, it is even possible to compute 32 bytes in a row. The SIMD instructions used to do the operation are _mm_load_si128, _mm_cmpeq_epi8 and _mm_movemask_epi8.
Note that you can use a lookup table (LUT) to speed things up when there is a lot of switches. This is not so simple though because the number of 0 of two consecutive masks needs to be added together. The LUT can include the number of previous pending 0 found so to be even faster. The idea is to find the maximum number of consecutive 0 bits in a 8-bit mask (representing 8 bytes) based on the previous number of 0 bytes found.
Another possible optimization is to compute the chunks in parallel. This idea is to reduce a chunk to 3 values: the number of pending 0 bytes, the number of leading 0 bytes and the maximum number of consecutive bytes found between the two. Note that the special case where all bytes are 0 should be considered. These values must be temporary stored and you can perform a final pass at the end of the file that consist in merging the result of all chunks. This operation can be implemented quite simply if you use OpenMP with tasks (one per chunk).
As a starting point, I'd read a large chunk of data at once (or if you're on Linux memory map the file1).
Then note that we don't care about the length of every run of zero bytes. We only care about finding a run longer than the longest we've found yet. We can use that to improve speed considerably.
For example, let's assume that starting from byte 0, we find a run of 26 zero bytes (and the 27th is non-zero).
So, the next possible starting point is the 28th byte. But we don't start counting the length of a run from there. At this point, we care about whether we have a run of at least 27 zeros, starting from byte 28. For that to be the case, all the bytes from 28 to 55 have to be zeros. So we can start from byte 55, and see if it's a zero. Then if (and only if) it's zero, we step backward through the preceding bytes to find the beginning of this string of zeros, compute the location of the earliest end that will give a new longest run starting from that point, and repeat.
But in the (highly likely--255/256 chance, if the data is random) chance that byte 55 is non-zero, we can skip ahead another 27 bytes (to byte 82) and see if it's zero or not. Again, most likely it's not, in which case we skip ahead another 27 bytes.
As the current maximum length grows, the efficiency of this algorithm grows with it. Most of the time we're looking at only one out of every length bytes of the data. If length gets very large, there may easily be entire sectors we never need to read in from secondary storage at all.
In theory you can memory map the file on Windows as well, but at least in my experience, memory mapping tends to be a big win on Linux, but rarely gives much improvement on Windows. If you're using Windows, you're generally better off calling CreateFile directly, and passing FILE_FLAG_NO_BUFFERING. See Optimizing an I/O bound Win32 application for more details.
I want to read the first 16 bytes out of every X*16 bytes of a file. The code I wrote works, but is quite slow, because of many function calls.
std::vector<Vertex> readFile(int maxVertexCount) {
std::ifstream in = std::ifstream(fileName, std::ios::binary);
in.seekg(0, in.end);
int fileLength = in.tellg();
int vertexCount = fileLength / 16;
int stepSize = std::max(1, vertexCount / maxVertexCount);
std::vector<Vertex> vertices;
vertices.reserve(vertexCount / stepSize);
for (int i = 0; i < vertexCount; i += stepSize) {
in.seekg(i * 16, in.beg);
char bytes[16];
in.read(bytes, 16);
vertices.push_back(Vertex(bytes));
}
in.close();
}
Could someone give me some suggestions to increase the performance of this code?
Don't use seek, I would mmap this whole file and then simply read off the bytes at the desired locations.
I'm not going to write the code for you, but it should be along the lines of:
Open the file, calculate the file size
mmap the whole file
Iterate through in your step sizes calculating the address of each block
Construct each Vertex based on the block and push into the vector
return the vector.
It's likely not the function calls itself, but the non-sequential access pattern, picking small segments from a large file. Even though you are reading only 16 bytes, the storage subsystem likely read (and caches) larger blocks. Your access pattern is deadly for typical I/O.
(Profiling should show whether disk access is the bottle neck. If it was "many function calls", CPU would be.)
So, first and foremost, can you change this requirement?
This is in all scenarios the easiest way out.
Could you scatter less? E.g. instead of reading vertices 0, 20, 40, ..., 1000 , read vertices 0,1,2,3,4, 100, 101, 102, 103, 104, 200, 201, 202, 203, 204, ... - same number of vertices, from "all parts" of the file.
Second, OS specific optimizations.
There's no portable way to control OS level caching.
One solution is memory mapping the file (CreaterFileMapping on Windows, mmap on Linuxy systems), as suggested by #Nim. This can omit one memory copy, but still the entire file will be read.
Can't help much with Linux, but on Windows you have as parameters to CreateFile:
FILE_FLAG_NO_BUFFERING which basically means you do the buffering, giving you finer control over the caching that happens, but you can't seek + read willy-nilly.
FILE_FLAG_SEQUENTIAL_SCAN which just telsl the cache not to store old data
Neither of these will solve the problem with your access pattern, but the first may mediate it somewhat - especially if your steps are larger than disk sectors, and the second can take pressure from the subsystem.
Third, Snapshot.
The best option may be to store an interleaved snapshot either in an associated file.
The snapshot could simply be the result of your operation, for a particular maxVertexCount. Or multiple snapshots, like mipmapping. The idea is to replace the scattered read by a sequential read.
Alternatively, the snapshot can store the data in interleaved order. For 128 vertices, you could store vertices in that order (roughly, beware of off-by<-one, zero-vs-one-based and aliasing effects, and my mistakes):
64, 32, 96, 16, 48, 80, 112 8, 24, 40, 56, 72, 88, 104, 120 ...
Whether you read the first 3 or first 7 or first 15 or first 31... values, the samples are equally spread out across the file, like in your original code. Rearranging them in memory will be much faster - especially if it's just a small subset.
Note: you need a robust algorithm to detect that your snapshot is out of date, independent of the many funny things that happen with "last write date" on different file systems. A "change counter" in the main file would be the safest one (though it would icnrease the cost of changes).
Fourth, Change the file format
(In case you can control that)
The interleaved storage suggested above could be used for the entire file. However, this has big implications for processing - especially if you need to restore the "original" order at some point.
An elegant option would be having such an interleaved subset as part of the file, and the full list of vertices in original order. There is a cutoff stepSize where this doesn't help much anymore, probably around 2*sector/block size of the disk. So file size would increase only by a few percent. However, writes would get a little more costly, changes to the amount of vertices significantly worse.
Aliasing warning
If this is intended to get a "statistical" or "visually sufficient" sampling, a fixed stepSize might be problematic, since it can create aliasing effects (think Moire patterns) with any patterns present in the data.
In this case, a random sampling would be preferrable. That may sound scary and makes some of the solutions above a bit harder, but is often the easiest way to avoid many suboptimal cases.
... and if for some reason you can't use map, read the file into a buffer in "great big gulps" ... a buffer-size that is some multiple of X bytes. Continue reading into that buffer (taking care to notice how many bytes were read). until you reach the end of the file.
What you're specifically trying to avoid is a whole bunch of physical I/O operations: movement of the disk's read/write mechanism. The operating system likes to buffer things for this reason but it can only guess at what your application is trying to do and it might guess wrong. Once the disk has positioned the read/write head to the proper track ("seek time"), it can retrieve a whole track's worth of data in one rotation. But "seek time" is comparatively slow.
Mapping the file, and then reading the data in the mapped file non-randomly, is clearly the most favorable strategy because now the operating system knows exactly what's going on.
First, I assme you're returning the vector by value from the definition even though your posted code lacks a return statement, so the vector has to be copied. Pass it by reference into your method so no copying is needed.
And you can use low-level pread() to read without needing to seek:
void readFile( size_t maxVertexCount, std::vector<Vertex> &vertices )
{
struct stat sb;
int fd = std::open( fileName, O_RDONLY );
std::fstat( fd, &sb );
size_t vertexCount = sb.st_size / 16;
size_t stepSize = std::max( 1, vertexCount / maxVertexCount );
vertices.reserve( vertexCount / stepSize );
for ( off_t i = 0; i < vertexCount; i += stepSize)
{
char bytes[ 16 ];
std::pread( fd, bytes, 16, 16 * i );
vertices.push_back( Vertex( bytes ) );
}
std::close( fd );
}
You should be able to figure out the required error handling and header files.
This takes advantage of the kernel's page cache and likely read-ahead. Depending on your OS and configuration, other methods such as mmap() or reading the entire file may or may not be faster.
I'm developing a program with several threads that manages the streaming from several cameras. I have to write every raw images on SSD disk. I'm using fwrite to put the image in a binary file. Something like:
FILE* output;
output = fopen(fileName, "wb");
fwrite(imageData, imageSize, 1, output);
fclose(output);
The procedure seems to runs fast enough to save all images with the given cameras throughput. The problem is that the save procedure is CPU consuming, and I start to have sync issues when the save process is enabled, due to the CPU usage of the save threads.
Is there any way to reduce the CPU load of fwrite operations? Like playing with buffering, better DMA settings, ...?
Thanks!
MIX
-- UPDATE 1
Forgetting the multithreading software, here is a simple file writer software:
#include <stdio.h>
#include <stdlib.h>
const unsigned int TOT_DATA = 1280*2*960;
int main(int argc, char* argv[])
{
if(argc != 2)
{
printf("Usage:\n");
printf(" %s totWrite\n\n", argv[0]);
return -1;
}
char* imageData;
FILE* output;
char fileName[256];
unsigned int totWrite;
totWrite = atoi(argv[1]);
imageData = new char[TOT_DATA];
printf("Write imageData[%u] on file %u times.\n", TOT_DATA, totWrite);
for(unsigned int i = 0; i < totWrite; i++)
{
sprintf(fileName, "image_%06u.raw", i);
output = fopen(fileName, "wb");
fwrite(imageData, TOT_DATA, 1, output);
fclose(output);
}
printf("DONE!\n");
delete [] imageData;
return 0;
}
A char buffer will be created, and it will be written on file totWrite times. No overwrites, since each cycle writes on a new file. (of course, one have to remove files written by previous run...)
Running top (I'm on Linux) while the program is running I see that ~50% of the CPU (that means 50% of one of the 4 cores) is used. I suppose fwrite is the bottleneck regarding the CPU usage since it is the "slower" operation in the cycle, so the one "more probably" running when top update its stats. Even "more probable" if TOT_DATA will be increased, say, 100 times.
Any further consideration about what can reduce CPU usage in such program?
If you consider playing with DMA settings you're way out of the scope of the standard C library. It will be nowhere near portable - and then you don't have any benefits of using portable functions.
The first step you probably should use (after you've confirmed that it's CPU that is the bottleneck) is to use lower level functions like for example open/write (or whatever your OS calls them).
Basically what can happen with fwrite is that the program first copies the data to another place in memory (the FILE* buffer) before actually writing the data to disc. This operation certainly is CPU bound and if data transfer by the CPU is slower than the data transfer to the SSD it could be a case where CPU power is consumed for no good reason.
Also one should note that using multiple threads have it's drawbacks. First if it were not an SSD multiple threads writing to disk could result in redundant head movements which is not bad, but even SSD may suffer somewhat as you might fragment the layout of the data.
There's also a problem in loading the entire file as you seem to do in the example, especially if you do it in multiple threads. It will simply consume a lot of memory (which could result in that swapping is required). If possible you should write the data to the file as the data arrives.
In my case I have different files lets assume that I have >4GB file with data. I want to read that file line by line and process each line. One of my restrictions is that soft has to be run on 32bit MS Windows or on 64bit with small amount of RAM (min 4GB). You can also assume that processing of these lines isn't bottleneck.
In current solution I read that file by ifstream and copy to some string. Here is snippet how it looks like.
std::ifstream file(filename_xml.c_str());
uintmax_t m_numLines = 0;
std::string str;
while (std::getline(file, str))
{
m_numLines++;
}
And ok, that's working but to slowly here is a time for my 3.6 GB of data:
real 1m4.155s
user 0m0.000s
sys 0m0.030s
I'm looking for a method that will be much faster than that for example I found that How to parse space-separated floats in C++ quickly? and I loved presented solution with boost::mapped_file but I faced to another problem what if my file is to big and in my case file 1GB large was enough to drop entire process. I have to care about current data in memory probably people who will be using that tool doesn't have more than 4 GB installed RAM.
So I found that mapped_file from boost but how to use it in my case? Is it possible to read partially that file and receive these lines?
Maybe you have another much better solution. I have to just process each line.
Thanks,
Bart
Nice to see you found my benchmark at How to parse space-separated floats in C++ quickly?
It seems you're really looking for the fastest way to count lines (or any linear single pass analysis), I've done a similar analysis and benchmark of exactly that here
Fast textfile reading in c++
Interestingly, you'll see that the most performant code does not need to rely on memory mapping at all there.
static uintmax_t wc(char const *fname)
{
static const auto BUFFER_SIZE = 16*1024;
int fd = open(fname, O_RDONLY);
if(fd == -1)
handle_error("open");
/* Advise the kernel of our access pattern. */
posix_fadvise(fd, 0, 0, 1); // FDADVICE_SEQUENTIAL
char buf[BUFFER_SIZE + 1];
uintmax_t lines = 0;
while(size_t bytes_read = read(fd, buf, BUFFER_SIZE))
{
if(bytes_read == (size_t)-1)
handle_error("read failed");
if (!bytes_read)
break;
for(char *p = buf; (p = (char*) memchr(p, '\n', (buf + bytes_read) - p)); ++p)
++lines;
}
return lines;
}
The case of a 64-bit system with small memory should be fine to load a large file into - it's all about address space - although it may well be slower than the "fastest" option in that case, it really depends on what else is in memory and how much of the memory is available for mapping the file into. In a 32-bit system, it won't work, since the pointers into the filemapping won't go beyond about 3.5GB at the very most - and typically around 2GB is the maximum - again, depending on what memory addresses are available to the OS to map the file into.
However, the benefit of memory mapping a file is pretty small - the huge majority of the time spent is from actually reading the data. The saving from using memory mapping comes from not having to copy the data once it's loaded into RAM. (When using other file-reading mechanisms, the read function will copy the data into the buffer supplied, where memory mapping a file will stuff it straight into the correct location directly).
You might want to look at increasing the buffer for the ifstream - the default buffer is often rather small, this leads to lots of expensive reads.
You should be able to do this using something like:
std::ifstream file(filename_xml.c_str());
char buffer[1024*1024];
file.rdbuf()->pubsetbuf(buffer, 1024*1024);
uintmax_t m_numLines = 0;
std::string str;
while (std::getline(file, str))
{
m_numLines++;
}
See this question for more info:
How to get IOStream to perform better?
Since this is windows, you can use the native windows file functions with the "ex" suffix:
windows file management functions
specifically the functions like GetFileSizeEx(), SetFilePointerEx(), ... . Read and write functions are limited to 32 bit byte counts, and the read and write "ex" functions are for asynchronous I/O as opposed to handling large files.
Ok, so I'm reading a binary file into a char array I've allocated with malloc.
(btw the code here isn't the actual code, I just wrote it on the spot to demonstrate, so any mistakes here are probably not mistakes in the actual program.) This method reads at about 50million bytes per second.
main
char *buffer = (char*)malloc(file_length_in_bytes*sizeof(char));
memset(buffer,0,file_length_in_bytes*sizeof(char));
//start time here
read_whole_file(buffer);
//end time here
free(buffer);
read_whole_buffer
void read_whole_buffer(char* buffer)
{
//file already opened
fseek(_file_pointer, 0, SEEK_SET);
int a = sizeof(buffer[0]);
fread(buffer, a, file_length_in_bytes*a, _file_pointer);
}
I've written something similar with managed c++ that uses filestream I believe and the function ReadByte() to read the entire file, byte by byte, and it reads at around 50million bytes per second.
Also, I have a sata and an IDE drive in my computer, and I've loading the file off of both, doesn't make any difference at all(Which is weird because I was under the assumption that SATA read much faster than IDE.)
Question
Maybe you can all understand why this doesn't make any sense to me. As far as I knew, it should be much faster to fread a whole file into an array, as opposed to reading it byte by byte. On top of that, through testing I've discovered that managed c++ is slower (only noticeable though if you are benchmarking your code and you require speed.)
SO
Why in the world am I reading at the same speed with both applications. Also is 50 million bytes from a file, into an array quick?
Maybe I my motherboard is bottle necking me? That just doesn't seem to make much sense eather.
Is there maybe a faster way to read a file into an array?
thanks.
My 'script timer'
Records start and end time with millisecond resolution...Most importantly it's not a timer
#pragma once
#ifndef __Script_Timer__
#define __Script_Timer__
#include <sys/timeb.h>
extern "C"
{
struct Script_Timer
{
unsigned long milliseconds;
unsigned long seconds;
struct timeb start_t;
struct timeb end_t;
};
void End_ST(Script_Timer *This)
{
ftime(&This->end_t);
This->seconds = This->end_t.time - This->start_t.time;
This->milliseconds = (This->seconds * 1000) + (This->end_t.millitm - This->start_t.millitm);
}
void Start_ST(Script_Timer *This)
{
ftime(&This->start_t);
}
}
#endif
Read buffer thing
char face = 0;
char comp = 0;
char nutz = 0;
for(int i=0;i<(_length*sizeof(char));++i)
{
face = buffer[i];
if(face == comp)
nutz = (face + comp)/i;
comp++;
}
Transfers from or to main memory run at speeds of gigabytes per second. Inside the CPU data flows even faster. It is not surprising that, whatever you do at the software side, the hard drive itself remains the bottleneck.
Here are some numbers from my system, using PerformanceTest 7.0:
hard disk: Samsung HD103SI 5400 rpm: sequential read/write at 80 MB/s
memory: 3 * 2 GB at 400 MHz DDR3: read/write around 2.2 GB/s
So if your system is a bit older than mine, a hard drive speed of 50 MB/s is not surprising. The connection to the drive (IDE/SATA) is not all that relevant; it's mainly about the number of bits passing the drive heads per second, purely a hardware thing.
Another thing to keep in mind is your OS's filesystem cache. It could be that the second time round, the hard drive isn't accessed at all.
The 180 MB/s memory read speed that you mention in your comment does seem a bit on the low side, but that may well depend on the exact code. Your CPU's caches come into play here. Maybe you could post the code you used to measure this?
The FILE* API uses buffered streams, so even if you read byte by byte, the API internally reads buffer by buffer. So your comparison will not make a big difference.
The low level IO API (open, read, write, close) is unbuffered, so using this one will make a difference.
It may also be faster for you, if you do not need the automatic buffering of the FILE* API!
I've done some tests on this, and after a certain point, the effect of increased buffer size goes down the bigger the buffer. There is usually an optimum buffer size you can find with a bit of trial and error.
Note also that fread() (or more specifically the C or C++ I/O library) will probably be doing its own buffering. If your system suports it a plain read() may (or may not) be a bit faster.