Fastest way to create large file in c++?

Fastest way to create large file in c++? - c++

Create a flat text file in c++ around 50 - 100 MB
with the content 'Added first line' should be inserted in to the file for 4 million times

using old style file io
fopen the file for write.
fseek to the desired file size - 1.
fwrite a single byte
fclose the file

The fastest way to create a file of a certain size is to simply create a zero-length file using creat() or open() and then change the size using chsize(). This will simply allocate blocks on the disk for the file, the contents will be whatever happened to be in those blocks. It's very fast since no buffer writing needs to take place.

Not sure I understand the question. Do you want to ensure that every character in the file is a printable ASCII character? If so, what about this? Fills the file with "abcdefghabc...."
#include <stdio.h>
int main ()
{
const int FILE_SiZE = 50000; //size in KB
const int BUFFER_SIZE = 1024;
char buffer [BUFFER_SIZE + 1];
int i;
for(i = 0; i < BUFFER_SIZE; i++)
buffer[i] = (char)(i%8 + 'a');
buffer[BUFFER_SIZE] = '\0';
FILE *pFile = fopen ("somefile.txt", "w");
for (i = 0; i < FILE_SIZE; i++)
fprintf(pFile, buffer);
fclose(pFile);
return 0;
}

You haven't mentioned the OS but I'll assume creat/open/close/write are available.
For truly efficient writing and assuming, say, a 4k page and disk block size and a repeated string:
open the file.
allocate 4k * number of chars in your repeated string, ideally aligned to a page boundary.
print repeated string into the memory 4k times, filling the blocks precisely.
Use write() to write out the blocks to disk as many times as necessary. You may wish to write a partial piece for the last block to get the size to come out right.
close the file.
This bypasses the buffering of fopen() and friends, which is good and bad: their buffering means that they're nice and fast, but they are still not going to be as efficient as this, which has no overhead of working with the buffer.
This can easily be written in C++ or C, but does assume that you're going to use POSIX calls rather than iostream or stdio for efficiency's sake, so it's outside the core library specification.

I faced the same problem, creating a ~500MB file on Windows very fast.
The larger buffer you pass to fwrite() the fastest you'll be.
int i;
FILE *fp;
fp = fopen(fname,"wb");
if (fp != NULL) {
// create big block's data
uint8_t b[278528]; // some big chunk size
for( i = 0; i < sizeof(b); i++ ) // custom initialization if != 0x00
{
b[i] = 0xFF;
}
// write all blocks to file
for( i = 0; i < TOT_BLOCKS; i++ )
fwrite(&b, sizeof(b), 1, fp);
fclose (fp);
}
Now at least on my Win7, MinGW, creates file almost instantly.
Compared to fwrite() 1 byte at time, that will complete in 10 Secs.
Passing 4k buffer will complete in 2 Secs.

Fastest way to create large file in c++?
Ok. I assume fastest way means the one that takes the smallest run time.
Create a flat text file in c++ around 50 - 100 MB with the content 'Added first line' should be inserted in to the file for 4 million times.
preallocate the file using old style file io
fopen the file for write.
fseek to the desired file size - 1.
fwrite a single byte
fclose the file
create a string containing the "Added first line\n" a thousand times.
find it's length.
preallocate the file using old style file io
fopen the file for write.
fseek to the the string length * 4000
fwrite a single byte
fclose the file
open the file for read/write
loop 4000 times,
writing the string to the file.
close the file.
That's my best guess.
I'm sure there are a lot of ways to do it.

Related

How to optimize c++ binary file reading?

I have a complex interpreter reading in commands from (sometimes) multiples files (the exact details are out of scope) but it requires iterating over these multiple files (some could be GB is size, preventing nice buffering) multiple times.
I am looking to increase the speed of reading in each command from a file.
I have used the RDTSC (program counter) register to micro benchmark the code enough to know about >80% of the time is spent reading in from the files.
Here is the thing: the program that generates the input file is literally faster than to read in the file in my small interpreter. i.e. instead of outputting the file i could (in theory) just link the generator of the data to the interpreter and skip the file but that shouldn't be faster, right?
What am I doing wrong? Or is writing suppose to be 2x to 3x (at least) faster than reading from a file?
I have considered mmap but some of the results on http://lemire.me/blog/archives/2012/06/26/which-is-fastest-read-fread-ifstream-or-mmap/ appear to indicate it is no faster than ifstream. or would mmap help in this case?
details:
I have (so far) tried adding a buffer, tweaking parameters, removing the ifstream buffer (that slowed it down by 6x in my test case), i am currently at a loss for ideas after searching around.
The important section of the code is below. It does the following:
if data is left in buffer, copy form buffer to memblock (where it is then used)
if data is not left in the buffer, check to see how much data is left in the file, if more than the size of the buffer, copy a buffer sized chunk
if less than the file
//if data in buffer
if(leftInBuffer[activefile] > 0)
{
//cout <<bufferloc[activefile] <<"\n";
memcpy(memblock,(buffer[activefile])+bufferloc[activefile],16);
bufferloc[activefile]+=16;
leftInBuffer[activefile]-=16;
}
else //buffers blank
{
//read in block
long blockleft = (cfilemax -cfileplace) / 16 ;
int read=0;
/* slow block starts here */
if(blockleft >= MAXBUFELEMENTS)
{
currentFile->read((char *)(&(buffer[activefile][0])),16*MAXBUFELEMENTS);
leftInBuffer[activefile] = 16*MAXBUFELEMENTS;
bufferloc[activefile]=0;
read =16*MAXBUFELEMENTS;
}
else //read in part of the block
{
currentFile->read((char *)(&(buffer[activefile][0])),16*(blockleft));
leftInBuffer[activefile] = 16*blockleft;
bufferloc[activefile]=0;
read =16*blockleft;
}
/* slow block ends here */
memcpy(memblock,(buffer[activefile])+bufferloc[activefile],16);
bufferloc[activefile]+=16;
leftInBuffer[activefile]-=16;
}
edit: this is on a mac, osx 10.9.5, with an i7 with a SSD
Solution:
as was suggested below, mmap was able to increase the speed by about 10x.
(for anyone else who searches for this)
specifically open with:
uint8_t * openMMap(string name, long & size)
{
int m_fd;
struct stat statbuf;
uint8_t * m_ptr_begin;
if ((m_fd = open(name.c_str(), O_RDONLY)) < 0)
{
perror("can't open file for reading");
}
if (fstat(m_fd, &statbuf) < 0)
{
perror("fstat in openMMap failed");
}
if ((m_ptr_begin = (uint8_t *)mmap(0, statbuf.st_size, PROT_READ, MAP_SHARED, m_fd, 0)) == MAP_FAILED)
{
perror("mmap in openMMap failed");
}
uint8_t * m_ptr = m_ptr_begin;
size = statbuf.st_size;
return m_ptr;
}
read by:
uint8_t * mmfile = openMMap("my_file", length);
uint32_t * memblockmm;
memblockmm = (uint32_t *)mmfile; //cast file to uint32 array
uint32_t data = memblockmm[0]; //take int
mmfile +=4; //increment by 4 as I read a 32 bit entry and each entry in mmfile is 8 bits.

This should be a comment, but I don't have 50 reputation to make a comment.
What is the value of MAXBUFELEMENTS? From my experience, many smaller reads is far slower than one read of larger size. I suggest to read the entire file in if possible, some files could be GBs, but even reading in 100MB at once would perform better than reading 1 MB 100 times.
If that's still not good enough, next thing you can try is to compress(zlib) input files(may have to break them into chunks due to size), and decompress them in memory. This method is usually faster than reading in uncompressed files.

As #Tony Jiang said, try experimenting with the buffer size to see if that helps.
Try mmap to see if that helps.
I assume that currentFile is a std::ifstream? There's going to be some overhead for using iostreams (for example, an istream will do its own buffering, adding an extra layer to what you're doing); although I wouldn't expect the overhead to be huge, you can test by using open(2) and read(2) directly.
You should be able to run your code through dtruss -e to verify how long the read system calls take. If those take the bulk of your time, then you're hitting OS and hardware limits, so you can address that by piping, mmap'ing, or adjusting your buffer size. If those take less time than you expect, then look for problems in your application logic (unnecessary work on each iteration, etc.).

C++ determining size of data before writing it to a file

I am serializing some data to a file like this:
vector<ByteFeature>::iterator it = nByteFeatures.Content().begin();
for (;it != nByteFeatures.Content().end(); ++it)
{
for ( int i = 0; i < 52; i++)
{
fwrite( &it->Features[i], sizeof(unsigned char), 1, outfile);
}
}
But I would like to know in advance how much bytes that will be in the file.
I would like to write this number in front of the actual data.
Because in some situations I will have to skip loading this data, and I need to know how many bytes I have to skip.
There is more data written to the disk, and it would be crucial to me that I can write the number of bytes directly before the actual data. I do not want to store this number in a separate file or so.
.Content.size() would only tell me how many "items" are in there, but not the actual size of the data.
Thank you.

I've had to do this before myself.
The approach I took was to write a 4-byte placeholder, then the data, then fseek() back to the placeholder to write the length.
Write a 4-byte placeholder to the file.
ftell() to get the current file position.
Write the data to the file.
ftell() to get the new position.
Compute the length: the difference between the two ftell() values.
Then fseek() back to the placeholder and write the length.

You are writing 52 unsigned chars to a file for every ByteFeature. So the total number of bytes you are writing is 52 * nByteFeatures.Contents().size(), assuming that one char equals one byte.

very fast text file processing (C++)

i wrote an application which processes data on the GPU. Code works well, but i have the problem that the reading part of the input file (~3GB, text) is the bottleneck of my application. (The read from the HDD is fast, but the processing line by line is slow).
I read a line with getline() and copy line 1 to a vector, line2 to a vector and skip lines 3 and 4. And so on for the rest of the 11 mio lines.
I tried several approaches to get the file at the best time possible:
Fastest method I found is using boost::iostreams::stream
Others were:
Read the file as gzip, to minimize IO, but is slower than directly
reading it.
copy file to ram by read(filepointer, chararray, length)
and process it with a loop to distinguish the lines (also slower than boost)
Any suggestions how to make it run faster?
void readfastq(char *filename, int SRlength, uint32_t blocksize){
_filelength = 0; //total datasets (each 4 lines)
_SRlength = SRlength; //length of the 2. line
_blocksize = blocksize;
boost::iostreams::stream<boost::iostreams::file_source>ins(filename);
in = ins;
readNextBlock();
}
void readNextBlock() {
timeval start, end;
gettimeofday(&start, 0);
string name;
string seqtemp;
string garbage;
string phredtemp;
_seqs.empty();
_phred.empty();
_names.empty();
_filelength = 0;
//read only a part of the file i.e the first 4mio lines
while (std::getline(in, name) && _filelength<_blocksize) {
std::getline(in, seqtemp);
std::getline(in, garbage);
std::getline(in, phredtemp);
if (seqtemp.size() != _SRlength) {
if (seqtemp.size() != 0)
printf("Error on read in fastq: size is invalid\n");
} else {
_names.push_back(name);
for (int k = 0; k < _SRlength; k++) {
//handle special letters
if(seqtemp[k]== 'A') ...
else{
_seqs.push_back(5);
}
}
_filelength++;
}
}
EDIT:
The source-file is downloadable under https://docs.google.com/open?id=0B5bvyb427McSMjM2YWQwM2YtZGU2Mi00OGVmLThkODAtYzJhODIzYjNhYTY2
I changed the function readfastq to read the file, because of some pointer problems. So if you call readfastq the blocksize (in lines) must be bigger than the number of lines to read.
SOLUTION:
I found a solution, which get the time for read in the file from 60sec to 16sec. I removed the inner-loop which handeles the special characters and do this in GPU. This decreases the read-in time and only minimal increases the GPU running time.
Thanks for your suggestions.
void readfastq(char *filename, int SRlength) {
_filelength = 0;
_SRlength = SRlength;
size_t bytes_read, bytes_expected;
FILE *fp;
fp = fopen(filename, "r");
fseek(fp, 0L, SEEK_END); //go to the end of file
bytes_expected = ftell(fp); //get filesize
fseek(fp, 0L, SEEK_SET); //go to the begining of the file
fclose(fp);
if ((_seqarray = (char *) malloc(bytes_expected/2)) == NULL) //allocate space for file
err(EX_OSERR, "data malloc");
string name;
string seqtemp;
string garbage;
string phredtemp;
boost::iostreams::stream<boost::iostreams::file_source>file(filename);
while (std::getline(file, name)) {
std::getline(file, seqtemp);
std::getline(file, garbage);
std::getline(file, phredtemp);
if (seqtemp.size() != SRlength) {
if (seqtemp.size() != 0)
printf("Error on read in fastq: size is invalid\n");
} else {
_names.push_back(name);
strncpy( &(_seqarray[SRlength*_filelength]), seqtemp.c_str(), seqtemp.length()); //do not handle special letters here, do on GPU
_filelength++;
}
}
}

First instead of reading the file into memory you may work with file mappings. You just have to build your program as 64-bit to fit 3GB of virtual address space (for 32-bit application only 2GB is accessible in the user mode). Or alternatively you may map & process your file by parts.
Next, it sounds to me that your bottleneck is "copying a line to a vector". Dealing with vectors involves dynamic memory allocation (heap operations), which in a critical loop hits the performance very seriously). If this is the case - either avoid using vectors, or make sure they're declared outside the loop. The latter helps because when you reallocate/clear vectors they do not free memory.
Post your code (or a part of it) for more suggestions.
EDIT:
It seems that all your bottlenecks are related to string management.
std::getline(in, seqtemp); reading into an std::string deals with the dynamic memory allocation.
_names.push_back(name); This is even worse. First the std::string is placed into the vector by value. Means - the string is copied, hence another dynamic allocation/freeing happens. Moreover, when eventually the vector is internally reallocated - all the contained strings are copied again, with all the consequences.
I recommend using neither standard formatted file I/O functions (Stdio/STL) nor std::string. To achieve better performance you should work with pointers to strings (rather than copied strings), which is possible if you map the entire file. Plus you'll have to implement the file parsing (division into lines).
Like in this code:
class MemoryMappedFileParser
{
const char* m_sz;
size_t m_Len;
public:
struct String {
const char* m_sz;
size_t m_Len;
};
bool getline(String& out)
{
out.m_sz = m_sz;
const char* sz = (char*) memchr(m_sz, '\n', m_Len);
if (sz)
{
size_t len = sz - m_sz;
m_sz = sz + 1;
m_Len -= (len + 1);
out.m_Len = len;
// for Windows-format text files remove the '\r' as well
if (len && '\r' == out.m_sz[len-1])
out.m_Len--;
} else
{
out.m_Len = m_Len;
if (!m_Len)
return false;
m_Len = 0;
}
return true;
}
};

if _seqs and _names are std::vectors and you can guess the final size of them before processing the whole 3GB of data, you can use reserve to avoid most of the memory re-allocation during pushing back the new elements in the loop.
You should be aware of the fact that the vectors effectively produce another copy of parts of the file in main memory. So unless you have a main memory sufficiently large to store the text file plus the vector and its contents, you will probably end up with a number of page faults that also have a significant influence on the speed of your program.

You are apparently using <stdio.h> since using getline.
Perhaps fopen-ing the file with fopen(path, "rm"); might help, because the m tells (it is a GNU extension) to use mmap for reading.
Perhaps setting a big buffer (i.e. half a megabyte) with setbuffer could also help.
Probably, using the readahead system call (in a separate thread perhaps) could help.
But all this are guesses. You should really measure things.

General suggestions:
Code the simplest, most straight-forward, clean approach,
Measure,
Measure,
Measure,
Then if all else fails:
Read raw bytes (read(2)) in page-aligned chunks. Do so sequentially, so kernel's read-ahead plays to your advantage.
Re-use the same buffer to minimize cache flushing.
Avoid copying data, parse in place, pass around pointers (and sizes).
mmap(2)-ing [parts of the] file is another approach. This also avoids kernel-userland copy.

Depending on your disk speed, using a very fast de compression algorithm might help, like fastlz (there are at least two other that might be more efficient, but under GPL, so licence can be a problem).
Also, using C++ data structures and functions car increase the speed as you can maybe achieve a better compiler-time optimization. Going the C way isn't always the fastes! In some bad conditions, using char* you need to parse the whole string to reach the \0 yielding desastrous performances.
For parsing your data, using boost::spirit::qi is also probably the most optimized approach http://alexott.blogspot.com/2010/01/boostspirit2-vs-atoi.html

Storing an image file into a buffer (gif,jpeg etc).

I'm trying to load an image file into a buffer in order to send it through a scket. The problem that I'm having is that the program creates a buffer with a valid size but it does not copy the whole file into the buffer. My code is as follow
//imgload.cpp
#include <iostream>
#include <stdlib.h>
#include <stdio.h>
using namespace std;
int main(int argc,char *argv){
FILE *f = NULL;
char filename[80];
char *buffer = NULL;
long file_bytes = 0;
char c = '\0';
int i = 0;
printf("-Enter a file to open:");
gets(filename);
f = fopen(filename,"rb");
if (f == NULL){
printf("\nError opening file.\n");
}else{
fseek(f,0,SEEK_END);
file_bytes = ftell(f);
fseek(f,0,SEEK_SET);
buffer = new char[file_bytes+10];
}
if (buffer != NULL){
printf("-%d + 10 bytes allocated\n",file_bytes);
}else{
printf("-Could not allocate memory\n");
// Call exit?.
}
while (c != EOF){
c = fgetc(f);
buffer[i] = c;
i++;
}
c = '\0';
buffer[i-1] = '\0'; // helps remove randome characters in buffer when copying is finished..
i = 0;
printf("buffer size is now: %d\n",strlen(buffer));
//release buffer to os and cleanup....
return 0;
}
> output
c:\Users\Desktop>imgload
-Enter a file to open:img.gif
-3491 + 10 bytes allocated
buffer size is now: 9
c:\Users\Desktop>imgload
-Enter a file to open:img2.gif
-1261 + 10 bytes allocated
buffer size is now: 7
From the output I can see that it's allocating the correct size for each image 3491 and 1261 bytes (i doubled checked the file sizes through windows and the sizes being allocated are correct) but the buffer sizes after supposedly copying is 9 and 7 bytes long. Why is it not copying the entire data?.

You are wrong. Image is binary data, nor string data. So there are two errors:
1) You can't check end of file with EOF constant. Because EOF is often defined as 0xFF and it is valid byte in binary file. So use feof() function to check for end of file. Or also you may check current position in file with maximal possible (you got it before with ftell()).
2) As file is binary it may contain \0 in middle. So you can't use string function to work with such data.
Also I see that you use C++ language. Tell me please why you use classical C syntax for file working? I think that using C++ features such as file streams, containers and iterators will simplify your program.
P.S. And I want to say that you program will have problems with really big files. Who knows maybe you will try to work with them. If 'yes', rewrite ftell/fseek functions to their int64 (long long int) equivalents. Also you'll need to fix array counter. Another good idea is to read file by blocks. Reading byte by byte is dramatically slower.

All this is unneeded and actually makes no sense:
c = '\0';
buffer[i-1] = '\0';
i = 0;
printf("buffer size is now: %d\n",strlen(buffer));
Don't use strlen for binary data. strlen stops at the first NUL (\0) byte. A binary file may contain many such bytes, so NUL can't be used.
-3491 + 10 bytes allocated /* There are 3491 bytes in the file. */
buffer size is now: 9 /* The first byte with the value 0. */
In conclusion, drop that part. You already have the size of the file.

You are reading a binary file like a text file. You can't check for EOF as this could be anywhere in the binary file.

Using C++ to find out how many lines are in a text file

My C++ program needs to know how many lines are in a certain text file. I could do it with getline() and a while-loop, but is there a better way?

No.
Not unless your operating system's filesystem keeps track of the number of lines, which your system almost certainly doesn't as it's been a looong time since I've seen that.

By "another way", do you mean a faster way? No matter what, you'll need to read in the entire contents of the file. Reading in different-sized chunks shouldn't matter much since the OS or the underlying file libraries (or both) are buffering the file contents.
getline could be problematic if there are only a few lines in a very large file (high transient memory usage), so you might want to read in fixed-size 4KB chunks and process them one-by-one.

Iterate the file char-by-char with get(), and for each newline (\n) increment line number by one.

The fastest, but OS-dependent way would be to map the whole file to memory (if not possible to map the whole file at once - map it in chunks sequentially) and call std::count(mem_map_begin,mem_map_end,'\n')

Don't know if getline() is the best - buffer size is variable at the worst case (sequence of \n) it could read byte after byte in each iteration.
For me It would be better to read a file in a chunks of predetermined size. And than scan for number of new line encodings ( inside.
Although there's some risk I cannot / don't know how to resolve: other file encodings than ASCII. If getline() will handle than it's easiest but I don't think it's true.
Some url's:
Why does wide file-stream in C++ narrow written data by default?
http://en.wikipedia.org/wiki/Newline

possibly fastest way is to use low level read() and scan buffer for '\n':
int clines(const char* fname)
{
int nfd, nLen;
int count = 0;
char buf[BUFSIZ+1];
if((nfd = open(fname, O_RDONLY)) < 0) {
return -1;
}
while( (nLen = read(nfd, buf, BUFSIZ)) > 0 )
{
char *p = buf;
int n = nLen;
while( n && (p = memchr(p,'\n', n)) ) {
p++;
n = nLen - (p - buf);
count++;
}
}
close(nfd);
return count;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Fastest way to create large file in c++? - c++

Create a flat text file in c++ around 50 - 100 MB with the content 'Added first line' should be inserted in to the file for 4 million times

using old style file io fopen the file for write. fseek to the desired file size - 1. fwrite a single byte fclose the file

Related

How to optimize c++ binary file reading?

C++ determining size of data before writing it to a file

very fast text file processing (C++)

Storing an image file into a buffer (gif,jpeg etc).

Using C++ to find out how many lines are in a text file

Categories

Resources