Using buffers to read from unknown size file - c++

I'm trying to read blocks from a file and I have a problem.
char* inputBuffer = new char[blockSize]
while (inputFile.read(inputBuffer, blockSize)) {
int i = inputFile.gcount();
//Do stuff
}
Suppose our block size is 1024 bytes, and the file is 24,3 KiB. After reading the 23rd block, there will be 0,3 KiB left to read. I also want to read that 0,3 KiB, in fact I use gcount() later so I can know how much of the buffer did read(...) modify (in case if it is less).
But when it accesses the 24th block, read(...) returns a value such that the program does not enter the loop, obviously because the size of the remaining unread bytes in the file is less than the buffer size. What should I do?

I think that Konrad Rudolf who you talk about in the comment to another answer makes a good point about the problem with reading until eof. If you never reach eof because of some other error you are in an infinite loop. So take his advice, but modify it to address the problem you have identified. One way of doing it is as follows;
bool okay=true;
while ( okay ) {
okay = inputFile.read(inputBuffer, blockSize);
int i = inputFile.gcount();
if( i ) {
//Do stuff
}
}
Edit: Since my answer has been accepted, I am editing it to be as useful as possible. It turns out my bool okay is quite unnecessary (see ferosekhanj's answer). It is better to test the value of inputFile directly, that also has the advantage that you can elegantly avoid entering the loop if the file did not open okay. So I think this is the canonical solution to this problem;
inputFile.open( "test.txt", ios::binary );
while ( inputFile ) {
inputFile.read( inputBuffer, blockSize );
int i = inputFile.gcount();
if( i ) {
//Do stuff
}
}
Now the last time you //Do stuff, i will be less than blockSize, except in the case that the file happens to be a multiple of blockSize bytes long.
Konrad Rudolf's answer here is also good, it has the advantage that .gcount() is only called once, outside the loop, but the disadvantage that it really needs data processing to be put in a separate function, to avoid duplication.

The solution #Konrad Rudolph mentioned is to check for the stream object itself since that includes checking for eof and error condition. The inputFile.read() returns the stream that is inputFile itself so you can write like
while(inputFile.read())
But this will not work always. The case where it fails is your case. A proper solution would be to write like below
char* inputBuffer = new char[blockSize]
while (inputFile)
{
inputFile.read(inputBuffer, blockSize);
int count = inputFile.gcount();
//Access the buffer until count bytes
//Do stuff
}
I think this was the solution what #Konrad Rudolph meant in his post. From my old CPP experience I also would do something like above.

But when it accesses the 24th block, read(...) returns a value such that the program does not enter the loop, obviously because the size of the remaining unread bytes in the file is less than the buffer size.
That's because your loop is wrong. You should be doing:
while(!inputFile) {
std::streamsize numBytes = inputFile.readsome(inputBuffer, blockSize);
//Do stuff
}
Notice the use of readsome instead of read.

Related

Fastest way to read millions of integers from stdin C++?

I am working on a sorting project and I've come to the point where a main bottleneck is reading in the data. It takes my program about 20 seconds to sort 100,000,000 integers read in from stdin using cin and std::ios::sync_with_stdio(false); but it turns out that 10 of those seconds is reading in the data to sort. We do know how many integers we will be reading in (the count is at the top of the file we need to sort).
How can I make this faster? I know it's possible because a student in a previous semester was able to do counting sort in a little over 3 seconds (and that's basically purely read time).
The program is just fed the contents of a file with integers separated by newlines like $ ./program < numstosort.txt
Thanks
Here is the relevant code:
std::ios::sync_with_stdio(false);
int max;
cin >> max;
short num;
short* a = new short[max];
int n = 0;
while(cin >> num) {
a[n] = num;
n++;
}
This will get your data into memory about as fast as possible, assuming Linux/POSIX running on commodity hardware. Note that since you apparently aren't allowed to use compiler optimizations, C++ IO is not going to be the fastest way to read data. As others have noted, without optimizations the C++ code will not run anywhere near as fast as it can.
Given that the redirected file is already open as stdin/STDIN_FILENO, use low-level system call/C-style IO. That won't need to be optimized, as it will run just about as fast as possible:
struct stat sb;
int rc = ::fstat( STDIN_FILENO, &sb );
// use C-style calloc() to get memory that's been
// set to zero as calloc() is often optimized to be
// faster than a new followed by a memset().
char *data = (char *)::calloc( 1, sb.st_size + 1 );
size_t totalRead = 0UL;
while ( totalRead < sb.st_size )
{
ssize_t bytesRead = ::read( STDIN_FILENO,
data + totalRead, sb.st_size - totalRead );
if ( bytesRead <= 0 )
{
break;
}
totalRead += bytesRead;
}
// data is now in memory - start processing it
That code will read your data into memory as one long C-style string. And the lack of compiler optimizations won't matter one bit as it's all almost bare-metal system calls.
Using fstat() to get the file size allows allocating all the needed memory at once - no realloc() or copying data around is necessary.
You'll need to add some error checking, and a more robust version of the code would check to be sure the data returned from fstat() actually is a regular file with an actual size, and not a "useless use of cat" such as cat filename | YourProgram, because in that case the fstat() call won't return a useful file size. You'll need to examine the sb.st_mode field of the struct stat after the call to see what the stdin stream really is:
::fstat( STDIN_FILENO, &sb );
...
if ( S_ISREG( sb.st_mode ) )
{
// regular file...
}
(And for really high-performance systems, it can be important to ensure that the memory pages you're reading data into are actually mapped in your process address space. Performance can really stall if data arrives faster than the kernel's memory management system can create virtual-to-physical mappings for the pages data is getting dumped into.)
To handle a large file as fast as possible, you'd want to go multithreaded, with one thread reading data and feeding one or more data processing threads so you can start processing data before you're done reading it.
Edit: parsing the data.
Again, preventing compiler optimizations probably makes the overhead of C++ operations slower than C-style processing. Based on that assumption, something simple will probably run faster.
This would probably work a lot faster in a non-optimized binary, assuming the data is in a C-style string read in as above:
char *next;
long count = ::strtol( data, &next, 0 );
long *values = new long[ count ];
for ( long ii = 0; ii < count; ii++ )
{
values[ ii ] = ::strtol( next, &next, 0 );
}
That is also very fragile. It relies on strtol() skipping over leading whitespace, meaning if there's anything other than whitespace between the numeric values it will fail. It also relies on the initial count of values being correct. Again - that code will fail if that's not true. And because it can replace the value of next before checking for errors, if it ever goes off the rails because of bad data it'll be hopelessly lost.
But it should be about as fast as possible without allowing compiler optimizations.
That's what crazy about not allowing compiler optimizations. You can write simple, robust C++ code to do all your processing, make use of a good optimizing compiler, and probably run almost as fast as the code I posted - which has no error checking and will fail spectacularly in unexpected and undefined ways if fed unexpected data.
You can make it faster if you use a SolidState hard drive. If you want to ask something about code performance, you need to post how are you doing things in the first place.
You may be able to speed up your program by reading the data into a buffer, then converting the text in the buffer to internal representation.
The thought behind this is that all stream devices like to keep streaming. Starting and stopping the stream wastes time. A block read transfers a lot of data with one transaction.
Although cin is buffered, by using cin.read and a buffer, you can make the buffer a lot bigger than cin uses.
If the data has fixed width fields, there are opportunities to speed up the input and conversion processes.
Edit 1: Example
const unsigned int BUFFER_SIZE = 65536;
char text_buffer[BUFFER_SIZE];
//...
cin.read(text_buffer, BUFFER_SIZE);
//...
int value1;
int arguments_scanned = snscanf(&text_buffer, REMAINING_BUFFER_SIZE,
"%d", &value1);
The tricky part is handling the cases where the text of a number is cut off at the end of the buffer.
Can you ran this little test in compare to your test with and without commented line?
#include <iostream>
#include <cstdlib>
int main()
{
std::ios::sync_with_stdio(false);
char buffer[20] {0,};
int t = 0;
while( std::cin.get(buffer, 20) )
{
// t = std::atoi(buffer);
std::cin.ignore(1);
}
return 0;
}
Pure read test:
#include <iostream>
#include <cstdlib>
int main()
{
std::ios::sync_with_stdio(false);
char buffer[1024*1024];
while( std::cin.read(buffer, 1024*1024) )
{
}
return 0;
}

Parse sequences of protobuf messages from continguous chunks of fixed sized byte buffer

I've been struggling with this for two days straight with my poor knowledge with C++. What I need to do is parsing sequences of messages using protobuf C++ API from a big file, a file that may contain millions of such messages. Reading straight from the file is easy, as I can always just do "ReadVarInt32" to get the size and then do ParseFromCodedStream with the limit pushed on CodedInputStream, as described in this post. However, the I/O level API I am working with (libuv actually) requires a fixed sized of buffer being allocated for every read callback action. Apparently that block size has nothing to do with the message size I am reading out.
This makes my life hard. Basically everytime I read from the file and fill in the fixed-sized buffer (say 16K), that buffer would probably contain hundreds of complete protobuf messages, but the last chunk of that buffer would likely be an incomplete message. So I thought, okay what I should do is attempt reading as many messages as I can, and at the end, extract the last chunk and attach it to the beginning of the next 16K buffer I read out, keep going until I reach EOF of the file. I use ReadVarInt32() to get the size, and then compare that number with the rest of the buffer size, if the message size is smaller, keeps reading.
There is this API called GetDirectBufferPointer, so that I attempt to use this to record the pointer position before I even read out the next message's size. However I suspect due to endianness weirdness, if I just extract the rest of the byte array from where pointer starts and attaches to the next chunk, Parse won't succeed and in fact the first several bytes (8 I think) just completely messed up.
Alternatively, if I do codedStream.ReadRaw() and writes the residual stream into the buffer and then attaches to the head of the new chunk, the data won't get corrupted. But the problem is this time I will lose the "size" byte information as it has already been "read" in "ReadVarInt32"! And even if I just go ahead and remember the size information I read last time and directly call in next iteration message.ParseFromCodedStream(), it ended up reading one less byte, and some part even got corrupted and cannot restore the object successfully.
std::vector<char> mCheckBuffer;
std::vector<char> mResidueBuffer;
char bResidueBuffer[READ_BUFFER_SIZE];
char temp[READ_BUFFER_SIZE];
google::protobuf::uint32 size;
//"in" is the file input stream
while (in.good()) {
in.read(mReadBuffer.data(), READ_BUFFER_SIZE);
mCheckBuffer.clear();
//merge the last remaining chunk that contains incomplete message with
//the new data chunk I got out from buffer. Excuse my terrible C++ foo
std::merge(mResidueBuffer.begin(), mResidueBuffer.end(),
mReadBuffer.begin(), mReadBuffer.end(), std::back_inserter(mCheckBuffer));
//Treat the new merged buffer array as the new CIS
google::protobuf::io::ArrayInputStream ais(&mCheckBuffer[0],
mCheckBuffer.size());
google::protobuf::io::CodedInputStream cis(&ais);
//Record the pointer location on CIS in bResidueBuffer
cis.GetDirectBufferPointer((const void**)&bResidueBuffer,
&bResidueBufSize);
//No size information, probably first time or last iteration
//coincidentally read a complete message out. Otherwise I simply
//skip reading size again as I've already populated that from last
//iteration when I got an incomplete message
if(size == 0) {
cis.ReadVarint32(&size);
}
//Have to read this again to get remaining buffer size
cis.GetDirectBufferPointer((const void**)&temp, &mResidueBufSize);
//Compare the next message size with how much left in the buffer, if
//message size is smaller, I know I can read at least one more message
//out, keep reading until I run out of buffer, or, it's the end of message
//and my buffer just allocated larger so size should be 0
while (size <= mResidueBufSize && size != 0) {
//If this cis I constructed didn't have the size info at the beginning,
//and I just read straight from it hoping to get the message out from
//the "size" I got from last iteration, it simply doesn't work
//(read one less byte in fact, and some part of the message corrupted)
//push the size constraint to the input stream;
int limit = cis.PushLimit(size);
//parse message from the input stream
message.ParseFromCodedStream(&cis);
cis.PopLimit(limit);
google::protobuf::TextFormat::PrintToString(message, &str);
printf("%s", str.c_str());
//do something with the parsed object
//Now I have to record the new pointer location again
cis.GetDirectBufferPointer((const void**)&bResidueBuffer,
&bResidueBufSize);
//Read another time the next message's size and go back to while loop check
cis.ReadVarint32(&size);
}
//If I do the next line, bResidueBuffer will have the correct CIS information
//copied over, but not having the "already read" size info
cis.ReadRaw(bResidueBuffer, bResidueBufSize);
mResidueBuffer.clear();
//I am constructing a new vector that receives the residual chunk of the
//current buffer that isn't enough to restore a message
//If I don't do ReadRaw, this copy completely messes up at least the first 8
//bytes of the copied buffer's value, due to I suspect endianness
mResidueBuffer.insert(mResidueBuffer.end(), &bResidueBuffer[0],
&bResidueBuffer[bResidueBufSize]);
}
I'm really out of idea now. Is it even possible to gracefully use protobuf with APIs that requires fixed-sized intermediate buffer at all? Any inputs very much appreciated, thanks!
I see two major problems with your code:
std::merge(mResidueBuffer.begin(), mResidueBuffer.end(),
mReadBuffer.begin(), mReadBuffer.end(), std::back_inserter(mCheckBuffer));
It looks like you are expecting std::merge to concatenate your buffers, but in fact this function performs a merge of two sorted arrays into a single sorted array in the sense of MergeSort. This doesn't make any sense in this context; mCheckBuffer will end up containing nonsense.
cis.GetDirectBufferPointer((const void**)&bResidueBuffer,
&bResidueBufSize);
Here you are casting &bResidueBuffer to an incompatible pointer type. bResidueBuffer is a char array, so &bResidueBuffer is a pointer to a char array, which is not a pointer to a pointer. This is admittedly confusing because arrays can be implicitly converted to pointers (where the pointer points to the first element of the array), but this is actually a conversion -- bResidueBuffer is itself not a pointer, it can just be converted to one.
I think you're also misunderstanding what GetDirectBufferPointer() does. It looks like you want it to copy the rest of the buffer into bResidueBuffer, but the method never copies any data. The method gives you back a pointer that points into the original buffer.
The correct way to call it is something like:
const void* ptr;
int size;
cis.GetDirectBufferPointer(&ptr, &size);
Now ptr will point into the original buffer. You could now compare this against a pointer to the beginning of the buffer to find out where you are in the stream, like:
size_t pos = (const char*)ptr - &mCheckBuffer[0];
But, you shouldn't do that, because CodedInputStream already has the method CurrentPosition() for exactly this purpose. That will return the current byte offset in the buffer. So, use that instead.
Okay thanks to Kenton's help in pointing the major issues in my question, I have now revised the code piece and tested it working. I will post my solution down here. With that said, however, I am not feeling happy about all the complexity and edge case checkings I needed to do here. I think it's error prone. Even with this, what I will probably do for real is writing my direct "read from stream" blocking call in another thread outside of my libuv main thread so I don't get the requirement of having to use libuv API. But for the sake of completeness, here's my code:
std::vector<char> mCheckBuffer;
std::vector<char> mResidueBuffer;
std::vector<char> mReadBuffer(READ_BUFFER_SIZE);
google::protobuf::uint32 size;
//"in" is the file input stream
while (in.good()) {
//This part is tricky as you're not guaranteed that what end up in
//mReadBuffer is everything you read out from the file. The same
//happens with libuv's assigned buffer, after EOF, what's rest in
//the buffer could be anything
in.read(mReadBuffer.data(), READ_BUFFER_SIZE);
//merge the last remaining chunk that contains incomplete message with
//the new data chunk I got out from buffer. I couldn't find a more
//efficient way doing that
mCheckBuffer.clear();
mCheckBuffer.reserve(mResidueBuffer.size() + mReadBuffer.size());
mCheckBuffer.insert(mCheckBuffer.end(), mResidueBuffer.begin(),
mResidueBuffer.end());
mCheckBuffer.insert(mCheckBuffer.end(), mReadBuffer.begin(),
mReadBuffer.end());
//Treat the new merged buffer array as the new CIS
google::protobuf::io::ArrayInputStream ais(&mCheckBuffer[0],
mCheckBuffer.size());
google::protobuf::io::CodedInputStream cis(&ais);
//No size information, probably first time or last iteration
//coincidentally read a complete message out. Otherwise I simply
//skip reading size again as I've already populated that from last
//iteration when I got an incomplete message
if(size == 0) {
cis.ReadVarint32(&size);
}
bResidueBufSize = mCheckBuffer.size() - cis.CurrentPosition();
//Compare the next message size with how much left in the buffer, if
//message size is smaller, I know I can read at least one more message
//out, keep reading until I run out of buffer. If, it's the end of message
//and size (next byte I read from stream) happens to be 0, that
//will trip me up, cos when I push size 0 into PushLimit and then try
//parsing, it will actually return true even if it reads nothing.
//So I can get into an infinite loop, if I don't do the check here
while (size <= bResidueBufSize && size != 0) {
//If this cis I constructed didn't have the size info at the
//beginning, and I just read straight from it hoping to get the
//message out from the "size" I got from last iteration
//push the size constraint to the input stream
int limit = cis.PushLimit(size);
//parse the message from the input stream
bool result = message.ParseFromCodedStream(&cis);
//Parse fail, it could be because last iteration already took care
//of the last message and that size I read last time is just junk
//I choose to only check EOF here when result is not true, (which
//leads me to having to check for size=0 case above), cos it will
//be too many checks if I check it everytime I finish reading a
//message out
if(!result) {
if(in.eof()) {
log.info("Reached EOF, stop processing!");
break;
}
else {
log.error("Read error or input mal-formatted! Log error!");
exit;
}
}
cis.PopLimit(limit);
google::protobuf::TextFormat::PrintToString(message, &str);
//Do something with the message
//This is when the last message read out exactly reach the end of
//the buffer and there is no size information available on the
//stream any more, in which case size will need to be reset to zero
//so that the beginning of next iteration will read size info first
if(!cis.ReadVarint32(&size)) {
size = 0;
}
bResidueBufSize = mCheckBuffer.size() - cis.CurrentPosition();
}
if(in.eof()) {
break;
}
//Now I am copying the residual buffer into the intermediate
//mResidueBuffer, which will be merged with newly read data in next iteration
mResidueBuffer.clear();
mResidueBuffer.reserve(bResidueBufSize);
mResidueBuffer.insert(mResidueBuffer.end(),
&mCheckBuffer[cis.CurrentPosition()],&mCheckBuffer[mCheckBuffer.size()]);
}
if(!in.eof()) {
log.error("Something else other than EOF happened to the file, log error!");
exit;
}

how many bytes actually written by ostream::write?

suppose I send a big buffer to ostream::write, but only the beginning part of it is actually successfully written, and the rest is not written
int main()
{
std::vector<char> buf(64 * 1000 * 1000, 'a'); // 64 mbytes of data
std::ofstream file("out.txt");
file.write(&buf[0], buf.size()); // try to write 64 mbytes
if(file.bad()) {
// but suppose only 10 megabyte were available on disk
// how many were actually written to file???
}
return 0;
}
what ostream function can tell me how many bytes were actually written?
You can use .tellp() to know the output position in the stream to compute the number of bytes written as:
size_t before = file.tellp(); //current pos
if(file.write(&buf[0], buf.size())) //enter the if-block if write fails.
{
//compute the difference
size_t numberOfBytesWritten = file.tellp() - before;
}
Note that there is no guarantee that numberOfBytesWritten is really the number of bytes written to the file, but it should work for most cases, since we don't have any reliable way to get the actual number of bytes written to the file.
I don't see any equivalent to gcount(). Writing directly to the streambuf (with sputn()) would give you an indication, but there is a fundamental problem in your request: write are buffered and failure detection can be delayed to the effective writing (flush or close) and there is no way to get access to what the OS really wrote.

C++ copying files. Short on data

I'm trying to copy a file, but whatever I try, the copy seems to be a few bytes short.
_file is an ifstream set to binary mode.
void FileProcessor::send()
{
//If no file is opened return
if(!_file.is_open()) return;
//Reset position to beginning
_file.seekg(0, ios::beg);
//Result buffer
char * buffer;
char * partBytes = new char[_bufferSize];
//Packet *p;
//Read the file and send it over the network
while(_file.read(partBytes,_bufferSize))
{
//buffer = Packet::create(Packet::FILE,std::string(partBytes));
//p = Packet::create(buffer);
//cout<< p->getLength() << "\n";
//writeToFile(p->getData().c_str(),p->getLength());
writeToFile(partBytes,_bufferSize);
//delete[] buffer;
}
//cout<< *p << "\n";
delete [] partBytes;
}
_writeFile is the file to be written to.
void FileProcessor::writeToFile(const char *buffer,unsigned int size)
{
if(_writeFile.is_open())
{
_writeFile.write(buffer,size);
_writeFile.flush();
}
}
In this case I'm trying to copy a zip file.
But opening both the original and copy in notepad I noticed that while they look identical , It's different at the end where the copy is missing a few bytes.
Any suggestions?
You are assuming that the file's size is a multiple of _bufferSize. You have to check what's left on the buffer after the while:
while(_file.read(partBytes,_bufferSize)) {
writeToFile(partBytes,_bufferSize);
}
if(_file.gcount())
writeToFile(partBytes, _file.gcount());
Your while loop will terminate when it fails to read _bufferSize bytes because it hits an EOF.
The final call to read() might have read some data (just not a full buffer) but your code ignores it.
After your loop you need to check _file.gcount() and if it is not zero, write those remaining bytes out.
Are you copying from one type of media to another? Perhaps different sector sizes are causing the apparent weirdness.
What if _bufferSize doesn't divide evenly into the size of the file...that might cause extra bytes to be written.
You don't want to always do writeToFile(partBytes,_bufferSize); since it's possible (at the end) that less than _bufferSize bytes were read. Also, as pointed out in the comments on this answer, the ifstream is no longer "true" once the EOF is reached, so the last chunk isn't copied (this is your posted problem). Instead, use gcount() to get the number of bytes read:
do
{
_file.read(partBytes, _bufferSize);
writeToFile(partBytes, (unsigned int)_file.gcount());
} while (_file);
For comparisons of zip files, you might want to consider using a non-text editor to do the comparison; HxD is a great (free) hex editor with a file compare option.

very fast text file processing (C++)

i wrote an application which processes data on the GPU. Code works well, but i have the problem that the reading part of the input file (~3GB, text) is the bottleneck of my application. (The read from the HDD is fast, but the processing line by line is slow).
I read a line with getline() and copy line 1 to a vector, line2 to a vector and skip lines 3 and 4. And so on for the rest of the 11 mio lines.
I tried several approaches to get the file at the best time possible:
Fastest method I found is using boost::iostreams::stream
Others were:
Read the file as gzip, to minimize IO, but is slower than directly
reading it.
copy file to ram by read(filepointer, chararray, length)
and process it with a loop to distinguish the lines (also slower than boost)
Any suggestions how to make it run faster?
void readfastq(char *filename, int SRlength, uint32_t blocksize){
_filelength = 0; //total datasets (each 4 lines)
_SRlength = SRlength; //length of the 2. line
_blocksize = blocksize;
boost::iostreams::stream<boost::iostreams::file_source>ins(filename);
in = ins;
readNextBlock();
}
void readNextBlock() {
timeval start, end;
gettimeofday(&start, 0);
string name;
string seqtemp;
string garbage;
string phredtemp;
_seqs.empty();
_phred.empty();
_names.empty();
_filelength = 0;
//read only a part of the file i.e the first 4mio lines
while (std::getline(in, name) && _filelength<_blocksize) {
std::getline(in, seqtemp);
std::getline(in, garbage);
std::getline(in, phredtemp);
if (seqtemp.size() != _SRlength) {
if (seqtemp.size() != 0)
printf("Error on read in fastq: size is invalid\n");
} else {
_names.push_back(name);
for (int k = 0; k < _SRlength; k++) {
//handle special letters
if(seqtemp[k]== 'A') ...
else{
_seqs.push_back(5);
}
}
_filelength++;
}
}
EDIT:
The source-file is downloadable under https://docs.google.com/open?id=0B5bvyb427McSMjM2YWQwM2YtZGU2Mi00OGVmLThkODAtYzJhODIzYjNhYTY2
I changed the function readfastq to read the file, because of some pointer problems. So if you call readfastq the blocksize (in lines) must be bigger than the number of lines to read.
SOLUTION:
I found a solution, which get the time for read in the file from 60sec to 16sec. I removed the inner-loop which handeles the special characters and do this in GPU. This decreases the read-in time and only minimal increases the GPU running time.
Thanks for your suggestions.
void readfastq(char *filename, int SRlength) {
_filelength = 0;
_SRlength = SRlength;
size_t bytes_read, bytes_expected;
FILE *fp;
fp = fopen(filename, "r");
fseek(fp, 0L, SEEK_END); //go to the end of file
bytes_expected = ftell(fp); //get filesize
fseek(fp, 0L, SEEK_SET); //go to the begining of the file
fclose(fp);
if ((_seqarray = (char *) malloc(bytes_expected/2)) == NULL) //allocate space for file
err(EX_OSERR, "data malloc");
string name;
string seqtemp;
string garbage;
string phredtemp;
boost::iostreams::stream<boost::iostreams::file_source>file(filename);
while (std::getline(file, name)) {
std::getline(file, seqtemp);
std::getline(file, garbage);
std::getline(file, phredtemp);
if (seqtemp.size() != SRlength) {
if (seqtemp.size() != 0)
printf("Error on read in fastq: size is invalid\n");
} else {
_names.push_back(name);
strncpy( &(_seqarray[SRlength*_filelength]), seqtemp.c_str(), seqtemp.length()); //do not handle special letters here, do on GPU
_filelength++;
}
}
}
First instead of reading the file into memory you may work with file mappings. You just have to build your program as 64-bit to fit 3GB of virtual address space (for 32-bit application only 2GB is accessible in the user mode). Or alternatively you may map & process your file by parts.
Next, it sounds to me that your bottleneck is "copying a line to a vector". Dealing with vectors involves dynamic memory allocation (heap operations), which in a critical loop hits the performance very seriously). If this is the case - either avoid using vectors, or make sure they're declared outside the loop. The latter helps because when you reallocate/clear vectors they do not free memory.
Post your code (or a part of it) for more suggestions.
EDIT:
It seems that all your bottlenecks are related to string management.
std::getline(in, seqtemp); reading into an std::string deals with the dynamic memory allocation.
_names.push_back(name); This is even worse. First the std::string is placed into the vector by value. Means - the string is copied, hence another dynamic allocation/freeing happens. Moreover, when eventually the vector is internally reallocated - all the contained strings are copied again, with all the consequences.
I recommend using neither standard formatted file I/O functions (Stdio/STL) nor std::string. To achieve better performance you should work with pointers to strings (rather than copied strings), which is possible if you map the entire file. Plus you'll have to implement the file parsing (division into lines).
Like in this code:
class MemoryMappedFileParser
{
const char* m_sz;
size_t m_Len;
public:
struct String {
const char* m_sz;
size_t m_Len;
};
bool getline(String& out)
{
out.m_sz = m_sz;
const char* sz = (char*) memchr(m_sz, '\n', m_Len);
if (sz)
{
size_t len = sz - m_sz;
m_sz = sz + 1;
m_Len -= (len + 1);
out.m_Len = len;
// for Windows-format text files remove the '\r' as well
if (len && '\r' == out.m_sz[len-1])
out.m_Len--;
} else
{
out.m_Len = m_Len;
if (!m_Len)
return false;
m_Len = 0;
}
return true;
}
};
if _seqs and _names are std::vectors and you can guess the final size of them before processing the whole 3GB of data, you can use reserve to avoid most of the memory re-allocation during pushing back the new elements in the loop.
You should be aware of the fact that the vectors effectively produce another copy of parts of the file in main memory. So unless you have a main memory sufficiently large to store the text file plus the vector and its contents, you will probably end up with a number of page faults that also have a significant influence on the speed of your program.
You are apparently using <stdio.h> since using getline.
Perhaps fopen-ing the file with fopen(path, "rm"); might help, because the m tells (it is a GNU extension) to use mmap for reading.
Perhaps setting a big buffer (i.e. half a megabyte) with setbuffer could also help.
Probably, using the readahead system call (in a separate thread perhaps) could help.
But all this are guesses. You should really measure things.
General suggestions:
Code the simplest, most straight-forward, clean approach,
Measure,
Measure,
Measure,
Then if all else fails:
Read raw bytes (read(2)) in page-aligned chunks. Do so sequentially, so kernel's read-ahead plays to your advantage.
Re-use the same buffer to minimize cache flushing.
Avoid copying data, parse in place, pass around pointers (and sizes).
mmap(2)-ing [parts of the] file is another approach. This also avoids kernel-userland copy.
Depending on your disk speed, using a very fast de compression algorithm might help, like fastlz (there are at least two other that might be more efficient, but under GPL, so licence can be a problem).
Also, using C++ data structures and functions car increase the speed as you can maybe achieve a better compiler-time optimization. Going the C way isn't always the fastes! In some bad conditions, using char* you need to parse the whole string to reach the \0 yielding desastrous performances.
For parsing your data, using boost::spirit::qi is also probably the most optimized approach http://alexott.blogspot.com/2010/01/boostspirit2-vs-atoi.html