I am observing the following behavior with the C++ Std library method std::ostream::write().
For buffering the data I am making use of the following C++ API
std::ofstream::rdbuf()->pubsetbuf(char* s, streamsize n)
This works fine ( verified using the strace utility ) as long as the size of data (datasize) we are writing on the file stream using
std::ofstream::write (const char* s, datasize n)
Is less than 1023 bytes ( below this value the writes are accumulated till the buffer is not full), but when the size of data to write exceeds 1023, the buffer is not taken into account and the data is flushed to the file.
e.g. If I set the buffer size to 10KB and write around 512bytes a time, strace will show that multiple writes have been combined into a single write
writev(3, [{"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 9728}, {"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 512}], 2) = 10240 ( 10 KB )
writev(3, [{"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 9728}, {"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 512}], 2) = 10240
...
but when I write 1024 bytes a time ( keeping the buffer fixed to 10 KB), now strace shows me that it is not using the buffer and each ofstream::write call is being translated to write system call.
writev(3, [{NULL, 0}, {"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 1024}], 2) = 1024 ( 1KB )
writev(3, [{NULL, 0}, {"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 1024}], 2) = 1024
...
Is there any C++ API Call or Linux Tuning Parameter which I am missing?
This is an implementation detail of libstdc++, implemented around line 650 of bits/fstream.tcc. Basically, if the write is larger than 2^10, it will skip the buffer.
If you want the rationale behind this decision, I suggest you send a mail to the libstdc++ development list.
http://gcc.gnu.org/ml/libstdc++/
Looks like someone writing the stdlib implementation made an "optimization" without giving enough thought to it. So, the only workaround for you would be to avoid the C++ API and use the standard C library.
This is not the only suboptimality in the GNU/Linux implementation of the standard C++ library: on my machine, malloc() is 100 cycles faster than the standard void* operator new (size_t size)...
Related
See the below code for example. size is 1MB, and it certainly runs faster than when it is 1. I think it is due to that the number of IO system calls is reduced. Does this mean I will always benefit from a larger buffer size? I hoped so and ran some tests, but it seems that there is some limit. size being 2 will run much faster than when it is 1, but it doesn't go further that way.
Could someone explain this better? What is the optimal buffer size likely to be? And why don't I benefit much from expanding its size infinitely.
By the way, in this example I wrote to stdout for simplicity, but I'm also thinking about when writing to files in the disk.
enum
{
size = 1 << 20
};
void fill_buffer(char (*)[size]);
int main(void)
{
long n = 100000000;
for (;;)
{
char buf[size];
fill_buffer(&buf);
if (n <= size)
{
if (fwrite(buf, 1, n, stdout) != n)
{
goto error;
}
break;
}
if (fwrite(buf, 1, size, stdout) != size)
{
goto error;
}
n -= size;
}
return EXIT_SUCCESS;
error:
fprintf(stderr, "fwrite failed\n");
return EXIT_FAILURE;
}
You usually don't need the best buffer size, which may requires querying the OS for system parameters and do complex estimations or even benchmarking on the target environment, and it's dynamic. Lucky you just need a value that is good enough.
I would say a 4K~16K buffer suit most normal usages. Where 4K is the magic number for page size supported by normal machine (x86, arm) and also multiple of usual physical disk sector size(512B or 4K).
If you are dealing with huge amount of data (giga-bytes) you may realise simple fwrite-model is inadequate for its blocking nature.
On a large partition, cluster size is often 32 KB. On a large read / write request, if the system sees that there are a series of contiguous clusters, it will combine them into a single I/O. Otherwise, it breaks up the request into multiple I/O's. I don't know what the maximum I/O size is. On some old SCSI controllers, it was 64 KB or 1 MB - 8 KB (17 or 255 descriptors, in controller). For IDE / Sata, I've been able to do IOCTL's for 2 MB, confirming it was a single I/O with an external bus monitor, but I never tested to determine the limit.
For external sorting with k way bottom up merge sort with k > 2, read / write size of 10 MB to 100 MB is used to reduce random access overhead. The request will be broken up into multiple I/O's but the read or write will be sequential (under ideal circumstances).
In my case I have different files lets assume that I have >4GB file with data. I want to read that file line by line and process each line. One of my restrictions is that soft has to be run on 32bit MS Windows or on 64bit with small amount of RAM (min 4GB). You can also assume that processing of these lines isn't bottleneck.
In current solution I read that file by ifstream and copy to some string. Here is snippet how it looks like.
std::ifstream file(filename_xml.c_str());
uintmax_t m_numLines = 0;
std::string str;
while (std::getline(file, str))
{
m_numLines++;
}
And ok, that's working but to slowly here is a time for my 3.6 GB of data:
real 1m4.155s
user 0m0.000s
sys 0m0.030s
I'm looking for a method that will be much faster than that for example I found that How to parse space-separated floats in C++ quickly? and I loved presented solution with boost::mapped_file but I faced to another problem what if my file is to big and in my case file 1GB large was enough to drop entire process. I have to care about current data in memory probably people who will be using that tool doesn't have more than 4 GB installed RAM.
So I found that mapped_file from boost but how to use it in my case? Is it possible to read partially that file and receive these lines?
Maybe you have another much better solution. I have to just process each line.
Thanks,
Bart
Nice to see you found my benchmark at How to parse space-separated floats in C++ quickly?
It seems you're really looking for the fastest way to count lines (or any linear single pass analysis), I've done a similar analysis and benchmark of exactly that here
Fast textfile reading in c++
Interestingly, you'll see that the most performant code does not need to rely on memory mapping at all there.
static uintmax_t wc(char const *fname)
{
static const auto BUFFER_SIZE = 16*1024;
int fd = open(fname, O_RDONLY);
if(fd == -1)
handle_error("open");
/* Advise the kernel of our access pattern. */
posix_fadvise(fd, 0, 0, 1); // FDADVICE_SEQUENTIAL
char buf[BUFFER_SIZE + 1];
uintmax_t lines = 0;
while(size_t bytes_read = read(fd, buf, BUFFER_SIZE))
{
if(bytes_read == (size_t)-1)
handle_error("read failed");
if (!bytes_read)
break;
for(char *p = buf; (p = (char*) memchr(p, '\n', (buf + bytes_read) - p)); ++p)
++lines;
}
return lines;
}
The case of a 64-bit system with small memory should be fine to load a large file into - it's all about address space - although it may well be slower than the "fastest" option in that case, it really depends on what else is in memory and how much of the memory is available for mapping the file into. In a 32-bit system, it won't work, since the pointers into the filemapping won't go beyond about 3.5GB at the very most - and typically around 2GB is the maximum - again, depending on what memory addresses are available to the OS to map the file into.
However, the benefit of memory mapping a file is pretty small - the huge majority of the time spent is from actually reading the data. The saving from using memory mapping comes from not having to copy the data once it's loaded into RAM. (When using other file-reading mechanisms, the read function will copy the data into the buffer supplied, where memory mapping a file will stuff it straight into the correct location directly).
You might want to look at increasing the buffer for the ifstream - the default buffer is often rather small, this leads to lots of expensive reads.
You should be able to do this using something like:
std::ifstream file(filename_xml.c_str());
char buffer[1024*1024];
file.rdbuf()->pubsetbuf(buffer, 1024*1024);
uintmax_t m_numLines = 0;
std::string str;
while (std::getline(file, str))
{
m_numLines++;
}
See this question for more info:
How to get IOStream to perform better?
Since this is windows, you can use the native windows file functions with the "ex" suffix:
windows file management functions
specifically the functions like GetFileSizeEx(), SetFilePointerEx(), ... . Read and write functions are limited to 32 bit byte counts, and the read and write "ex" functions are for asynchronous I/O as opposed to handling large files.
This question came to mind when I was trying to solve this problem.
I have harddrive with capacity 120 GB, of which 100 GB is occupied by a single huge file. So 20 GB is still free.
My question is, how can we split this huge file into smaller ones, say 1 GB each? I see that if I had ~100 GB free space, probably it was possible with simple algorithm. But given only 20 GB free space, we can write upto 20 1GB files. I've no idea how to delete contents from the bigger file while reading from it.
Any solution?
It seems I've to truncate the file by 1 GB, once I finish writing one file, but that boils down to this queston:
Is it possible to truncate a part of a file? How exactly?
I would like to see an algorithm (or an outline of an algorithm) that works in C or C++ (preferably Standard C and C++), so I may know the lower level details. I'm not looking for a magic function, script or command that can do this job.
According to this question (Partially truncating a stream) you should be able to use, on a system that is POSIX compliant, a call to int ftruncate(int fildes, off_t length) to resize an existing file.
Modern implementations will probably resize the file "in place" (though this is unspecified in the documentation). The only gotcha is that you may have to do some extra work to ensure that off_t is a 64 bit type (provisions exist within the POSIX standard for 32 bit off_t types).
You should take steps to handle error conditions, just in case it fails for some reason, since obviously, any serious failure could result in the loss of your 100GB file.
Pseudocode (assume, and take steps to ensure, all data types are large enough to avoid overflows):
open (string filename) // opens a file, returns a file descriptor
file_size (descriptor file) // returns the absolute size of the specified file
seek (descriptor file, position p) // moves the caret to specified absolute point
copy_to_new_file (descriptor file, string newname)
// creates file specified by newname, copies data from specified file descriptor
// into newfile until EOF is reached
set descriptor = open ("MyHugeFile")
set gigabyte = 2^30 // 1024 * 1024 * 1024 bytes
set filesize = file_size(descriptor)
set blocks = (filesize + gigabyte - 1) / gigabyte
loop (i = blocks; i > 0; --i)
set truncpos = gigabyte * (i - 1)
seek (descriptor, truncpos)
copy_to_new_file (descriptor, "MyHugeFile" + i))
ftruncate (descriptor, truncpos)
Obviously some of this pseudocode is analogous to functions found in the standard library. In other cases, you will have to write your own.
There is no standard function for this job.
For Linux you can use the ftruncate method, while for Windows you can use _chsize or SetEndOfFile. A simple #ifdef will make it cross-platform.
Also read this Q&A.
I have written a C++ class for Windows and Linux that creates a memory-mapped view for an file of arbitrary size n. The code for the class constructor can be seen here. I am currently testing the code on Windows 32 bit XP.
I have found for file sizes 0 < n <= 1.7GB , the constructor returns a valid pointer to a memory-mapped view. However, for a file size >= 2 GB, MapViewOfFile returns a NULL value and an error code of 8, "Not enough storage is available to process this command". Evidently, Windows cannot find an available address space of size 2 GB in the process.
Therefore, I may need to modify the class constructor to create a set of smaller memory-mapped views totaling >= 2GB bytes && < 2 ^ 32 - 1 bytes. The other requirement is to create a mapping between each of the smaller memory-mapped views and a randomly accessed address in the process' address space.
Previously, I used the following code for random access:
char* KeyArray;
try {
mmapFile = new cMemoryMappedFile(n);
}
catch (cException e)
{
throw;
}
KeyArray = (char *)(mmapFile->GetPointer());
KeyArray[i] = ...
How should I modify the class to handle these requirements?
What you want can be achieved using repaging, see how it is done in boost.iostreams here.
You only have 2GB (or 3GB, with some tweaks) of user process space on a 32-bit OS. Period. That is a hard limitation, and no amount of creating many smaller mappings can get around that. You will need to shift your mapping around in order to access the different parts of the file. But it will still be faster than seeking, reading, and writing.
I can't see your pastebin link, but I can suggest a simple solution with a c++ class declaration. I think the implementation should be obvious from the comments:
class ShiftingMemMap
{
public:
// constructs a dynamically shifting memory map of a file...
ShiftingMemMap ( const char* fileName, size_t view_size = 4096 );
// retrieve/set a byte at the given file offset. If the offset is not currently in-view,
// shift the view to encompass the offset. The reference should not be stored for later
// access because the view may need to shift again...
byte& operator [] ( unsigned int_64_t offset );
private:
int_64_t current_offset;
size_t current_size;
};
All that being said, you could write a class that returns multiple views of a file to allow saving a reference for later and also editing different parts of the file simultaneously without having to shift the view back and forth repeatedly.
class MemMap
{
public:
MemMap ( const char* filename );
shared_ptr<MemMapView> View ( unsigned int_64_t offset, size_t size = 4096 );
};
class MemMapView
{
public:
char& operator[] ( size_t offset );
};
This is not going to work. You simply cannot use all of the 4 GiB of address space on 32 bit Windows. Redesign your access to the array to map just small views of the large file.
Yes, it is possible to write a C++ class for Windows,Solaris Unix, Linux that creates a memory-mapped view for an file of arbitrary size n. We just finished building two version of class, one using STL and the other one which my boss wrote which does not use STL. Both versions are better than using the heap when the size of the memory allocation is 1 gigabyte or more on 32 bit Windows, Linux, Solaris Unix.
Both of these version are also compatible with 64 bit Windows, Linux, Solaris Unix.
If you want to find out more details about this, please email me at frankchang91#gmail.com. Massachusetts, USA. these 2 versions were independently designed and may become part of a data fusion US patent filing.
Ok, so I'm reading a binary file into a char array I've allocated with malloc.
(btw the code here isn't the actual code, I just wrote it on the spot to demonstrate, so any mistakes here are probably not mistakes in the actual program.) This method reads at about 50million bytes per second.
main
char *buffer = (char*)malloc(file_length_in_bytes*sizeof(char));
memset(buffer,0,file_length_in_bytes*sizeof(char));
//start time here
read_whole_file(buffer);
//end time here
free(buffer);
read_whole_buffer
void read_whole_buffer(char* buffer)
{
//file already opened
fseek(_file_pointer, 0, SEEK_SET);
int a = sizeof(buffer[0]);
fread(buffer, a, file_length_in_bytes*a, _file_pointer);
}
I've written something similar with managed c++ that uses filestream I believe and the function ReadByte() to read the entire file, byte by byte, and it reads at around 50million bytes per second.
Also, I have a sata and an IDE drive in my computer, and I've loading the file off of both, doesn't make any difference at all(Which is weird because I was under the assumption that SATA read much faster than IDE.)
Question
Maybe you can all understand why this doesn't make any sense to me. As far as I knew, it should be much faster to fread a whole file into an array, as opposed to reading it byte by byte. On top of that, through testing I've discovered that managed c++ is slower (only noticeable though if you are benchmarking your code and you require speed.)
SO
Why in the world am I reading at the same speed with both applications. Also is 50 million bytes from a file, into an array quick?
Maybe I my motherboard is bottle necking me? That just doesn't seem to make much sense eather.
Is there maybe a faster way to read a file into an array?
thanks.
My 'script timer'
Records start and end time with millisecond resolution...Most importantly it's not a timer
#pragma once
#ifndef __Script_Timer__
#define __Script_Timer__
#include <sys/timeb.h>
extern "C"
{
struct Script_Timer
{
unsigned long milliseconds;
unsigned long seconds;
struct timeb start_t;
struct timeb end_t;
};
void End_ST(Script_Timer *This)
{
ftime(&This->end_t);
This->seconds = This->end_t.time - This->start_t.time;
This->milliseconds = (This->seconds * 1000) + (This->end_t.millitm - This->start_t.millitm);
}
void Start_ST(Script_Timer *This)
{
ftime(&This->start_t);
}
}
#endif
Read buffer thing
char face = 0;
char comp = 0;
char nutz = 0;
for(int i=0;i<(_length*sizeof(char));++i)
{
face = buffer[i];
if(face == comp)
nutz = (face + comp)/i;
comp++;
}
Transfers from or to main memory run at speeds of gigabytes per second. Inside the CPU data flows even faster. It is not surprising that, whatever you do at the software side, the hard drive itself remains the bottleneck.
Here are some numbers from my system, using PerformanceTest 7.0:
hard disk: Samsung HD103SI 5400 rpm: sequential read/write at 80 MB/s
memory: 3 * 2 GB at 400 MHz DDR3: read/write around 2.2 GB/s
So if your system is a bit older than mine, a hard drive speed of 50 MB/s is not surprising. The connection to the drive (IDE/SATA) is not all that relevant; it's mainly about the number of bits passing the drive heads per second, purely a hardware thing.
Another thing to keep in mind is your OS's filesystem cache. It could be that the second time round, the hard drive isn't accessed at all.
The 180 MB/s memory read speed that you mention in your comment does seem a bit on the low side, but that may well depend on the exact code. Your CPU's caches come into play here. Maybe you could post the code you used to measure this?
The FILE* API uses buffered streams, so even if you read byte by byte, the API internally reads buffer by buffer. So your comparison will not make a big difference.
The low level IO API (open, read, write, close) is unbuffered, so using this one will make a difference.
It may also be faster for you, if you do not need the automatic buffering of the FILE* API!
I've done some tests on this, and after a certain point, the effect of increased buffer size goes down the bigger the buffer. There is usually an optimum buffer size you can find with a bit of trial and error.
Note also that fread() (or more specifically the C or C++ I/O library) will probably be doing its own buffering. If your system suports it a plain read() may (or may not) be a bit faster.