Problems with pwrite() to a file in C/C++ - c++

I've a bad problem. I'm trying to write to a file via filedescriptor and memalign. I can write to it but only something like an wrong encoded char is written to a file.
Here's my code:
fdOutputFile = open(outputFile, O_CREAT | O_WRONLY | O_APPEND | O_DIRECT, 0644)
void writeThis(char* text) {
while (*text != '\0') {
// if my internal buffer is full -> write to disk
if (buffPositionOutput == outputbuf.st_blksize) {
posix_memalign((void **)&bufferO, outputbuf.st_blksize, outputbuf.st_blksize);
cout << "wrote " << pwrite(fdOutputFile, bufferO, outputbuf.st_blksize, outputOffset*outputbuf.st_blksize) << " Bytes to disk." << endl;
buffPositionOutput = 0;
++outputOffset;
}
// buffer the incoming text...
bufferO[buffPositionOutput] = *text;
++text;
++buffPositionOutput;
}
}
I think it's the alignment - can someone help me?
It writes to the file but not the correct text, just a bunch of '[]'-chars.
Thanks in advance for your help!

Looking at your program, here is what happens:
You fill the memory initially pointed to by buffer0+buffPositionOutput (Which is where, precisely? I don't know based on the code you give.) up to buffer0+outputbuf.st_blksize with data.
You pass the address of the buffer0 pointer to posix_memalign, which ignores its current value and overwrites it with a pointer to outputbuf.st_blksize bytes of newly-allocated memory.
You write data from the newly-allocated block to disk; this might be anything, since you just allocated memory and haven't written anything there yet.
This won't work, obviously. You probably want to initialize your buffer via posix_memalign at the top of your function, and then just overwrite the block's worth of data in it as you use your aligned buffer to repeatedly write data into the file. (Reset buffpositionoutput to zero after each time you write data, but don't re-allocate.) Make sure you free your buffer when you are done.
Also, why are you using pwrite instead of write?
Here's how I would implement writeThis (keeping your variable names so you can match it up with your version):
void writeThis(char *text) {
char *buffer0;
size_t buffPositionOutput = 0;
posix_memalign(&buffer0, outputbuf.st_blksize, outputbuf.st_blksize);
while (*text != 0) {
++text; ++buffPositionOutput;
if (buffPositionOutput == outputbuf.st_blksize) {
write(fdOutputFile, buffer0, outputbuf.st_blksize);
buffPositionOuput = 0;
}
}
if (buffPositionOutput != 0) {
// what do you want to do with a partial block of data? Not sure.
}
}
(For speed, you might consider using memcpy calls instead of a loop. You would need to know the length of the data to write ahead of time though. Worry about that after you have a working solution that does not leak memory.)

You're re-allocating buffer0 every time you try to output it, and not freeing it. That's really not efficient (and leaks memory). I'd suggest you refactor your code a bit, because it's quite hard to follow whether your bounds checking on that buffer is correct or not.
Allocate buffer0 only once somewhere (form that snippet, storing it in outputbuf sounds like a good idea). Also store buffPositionOutput in that struct (or in another struct, but close to that buffer).
// in setup code
int rc = posix_memalign(&(outputbuf.data), outputbuf.st_blksize,
outputbuf.st_blksize);
// check rc!
outputbuf.writePosition = 0;
// in cleanup code
free(outputbuf.data);
Then you can rewrite your function like this:
void writeThis(char *text) {
while (*text != 0) {
outputbuf.data[outputbuf.writePosition] = *text;
outputbuf.writePosition++;
text++;
if (outputbuf.writePosition == outputbuf.block_size) {
int rc = pwrite(...);
// check rc!
std::cout << ...;
outputbuf.writePosition = 0;
}
}

I don't think C/C++ has encodings. ASCII only.
Unless you use wchar http://en.wikipedia.org/wiki/Wide_character

Related

Copying QFile contents to another QFile, what's the optimal way?

I need to copy a QFile to another QFile in chunks, so I can't use QFile::copy. Here's the most primitive implementation:
bool CFile::copyChunk(int64_t chunkSize, const QString &destFolder)
{
if (!_thisFile.isOpen())
{
// Initializing - opening files
_thisFile.setFileName(_absoluteFilePath);
if (!_thisFile.open(QFile::ReadOnly))
return false;
_destFile.setFileName(destFolder + _thisFileName);
if (!_destFile.open(QFile::WriteOnly))
return false;
}
if (chunkSize < (_thisFile.size() - _thisFile.pos()))
{
QByteArray data (chunkSize, 0);
_thisFile.read(data.data(), chunkSize);
return _destFile.write(data) == chunkSize;
}
}
It's not clear from this fragment, but I only intend to copy a binary file as a whole into another location, just in chunks so I can provide progress callbacks and cancellation facility for large files.
Another idea is to use memory mapping. Should I? If so, then should I only map source file and still use _destFile.write, or should I map both and use memcpy?
I guess this question isn't really tied to Qt, I think the answer should be general to any file I/O API that supports memory mapping.
Ok, ok, if it must be a memory mapping solution. Here is one:
QFile source("/tmp/bla1.bin");
source.open(QIODevice::ReadOnly);
QFile destination("/tmp/bla2.bin");
destination.open(QIODevice::ReadWrite);
destination.resize(source.size());
uchar *data = destination.map(0,destination.size());
if(!data){
qDebug() << "Cannot map";
exit(-1);
}
QByteArray buffer;
int chunksize = 200;
int var = 0;
do{
var = source.read((char *)(data), chunksize);
data += var;
}while(var > 0);
destination.unmap(data);
destination.close();
This maps only the destination file into memory. I doubt it will make much of a difference to map the source file also. But this is something for concrete measurements, not assumptions.
Another questions is whether you can map your whole file into memory at once. Constantly unmapping and remapping will certainly cost performance. And even if you use Qt. Functions like memory mapping have the tendency to act disturbingly different on different platforms, e.g. the maximum file size you map in to memory might be different.
What the optimal method is, lies always a bit in the eye of the beholder. Here is at least one working shorter method:
QFile source("/tmp/bla1.bin");
source.open(QIODevice::ReadOnly);
QFile destination("/tmp/bla2.bin");
destination.open(QIODevice::WriteOnly);
QByteArray buffer;
int chunksize = 200; // Whatever chunk size you like
while(!(buffer = source.read(chunksize)).isEmpty()){
destination.write(buffer);
}
destination.close();
source.close();
And memory mapping... I try to stay away from things like that. I am never too sure how platform independent they are.
Use this QFile::map() method:
QFile fs("Sourcefile.bin");
fs.open(QFile::ReadOnly);
QFile fd("Destinationfile.bin");
fd.open(QFile::WriteOnly);
fd.write((char*) fs.map(0, fs.size()), fs.size()); //Copies all data
fd.close();
fs.close();

C++ copying files. Short on data

I'm trying to copy a file, but whatever I try, the copy seems to be a few bytes short.
_file is an ifstream set to binary mode.
void FileProcessor::send()
{
//If no file is opened return
if(!_file.is_open()) return;
//Reset position to beginning
_file.seekg(0, ios::beg);
//Result buffer
char * buffer;
char * partBytes = new char[_bufferSize];
//Packet *p;
//Read the file and send it over the network
while(_file.read(partBytes,_bufferSize))
{
//buffer = Packet::create(Packet::FILE,std::string(partBytes));
//p = Packet::create(buffer);
//cout<< p->getLength() << "\n";
//writeToFile(p->getData().c_str(),p->getLength());
writeToFile(partBytes,_bufferSize);
//delete[] buffer;
}
//cout<< *p << "\n";
delete [] partBytes;
}
_writeFile is the file to be written to.
void FileProcessor::writeToFile(const char *buffer,unsigned int size)
{
if(_writeFile.is_open())
{
_writeFile.write(buffer,size);
_writeFile.flush();
}
}
In this case I'm trying to copy a zip file.
But opening both the original and copy in notepad I noticed that while they look identical , It's different at the end where the copy is missing a few bytes.
Any suggestions?
You are assuming that the file's size is a multiple of _bufferSize. You have to check what's left on the buffer after the while:
while(_file.read(partBytes,_bufferSize)) {
writeToFile(partBytes,_bufferSize);
}
if(_file.gcount())
writeToFile(partBytes, _file.gcount());
Your while loop will terminate when it fails to read _bufferSize bytes because it hits an EOF.
The final call to read() might have read some data (just not a full buffer) but your code ignores it.
After your loop you need to check _file.gcount() and if it is not zero, write those remaining bytes out.
Are you copying from one type of media to another? Perhaps different sector sizes are causing the apparent weirdness.
What if _bufferSize doesn't divide evenly into the size of the file...that might cause extra bytes to be written.
You don't want to always do writeToFile(partBytes,_bufferSize); since it's possible (at the end) that less than _bufferSize bytes were read. Also, as pointed out in the comments on this answer, the ifstream is no longer "true" once the EOF is reached, so the last chunk isn't copied (this is your posted problem). Instead, use gcount() to get the number of bytes read:
do
{
_file.read(partBytes, _bufferSize);
writeToFile(partBytes, (unsigned int)_file.gcount());
} while (_file);
For comparisons of zip files, you might want to consider using a non-text editor to do the comparison; HxD is a great (free) hex editor with a file compare option.

Use stringstream to read from TCP socket

I am using a socket library (I'd rather not not use it) whose recv operations works with std::string, but is just a wrapper for one call of the recv socket function, so it is probably that I only got some part of the message I wanted. My first instinct was to go in a loop and append the received string to another string until I get everything, but this seems inefficient. Another possibility was to do the same with a char array, but this seems messy. (I'd have to check the strings size before adding into the array and if it overflowed I need to store the string somewhere until the array is empty again.. )
So I was thinking about using a stringstream. I use a TLV protocol, so I need to first extract two bytes into an unsigned short, then get a certain amount of bytes from the stringstream and then loop again until I reach a delimiter field.
Is there any better way to do this? Am I completely on the wrong track? Are there any best practices? So far I've always only seen direct use of the socket library with char arrays so I'm curious why using `std::string`` with stringstreams could be a bad idea..
Edit: Replying to the comment below: The library is one we use internally, its not public (its nothing special though, mostly just a wrapper around the socket library to add exceptions, etc.).
I should mention that I have a working prototype using the socket library directly.
This works something like:
int lengthFieldSize = sizeof(unsigned short);
int endOfBuffer= 0;//Pointer to last valid position in buffer.
while(true) {
char buffer[RCVBUFSIZE];
while(true) {
int offset= endOfBuffer;
int rs= 0;
rs= recv(sock, buffer+offset, sizeof(buffer)-offset, 0);
endOfBuffer+= rs;
if(rs < 1) {
// Received nothing or error.
break;
} else if(endOfBuffer == RCVBUFSIZE) {
// Buffer full.
break;
} else if(rs > 0 && endOfBuffer > 1) {
unsigned short msglength= 0;
memcpy((char *) &msglength, buffer+endOfBuffer-lengthFieldSize, lengthFieldSize);
if(msglength == 0) {
break; // Received a full transmission.
}
}
}
unsigned int startOfData = 0;
unsigned short protosize= 0;
while(true) {
// Copy first two bytes into protosize (length field)
memcpy((char *) &protosize, buffer+startOfData, lengthFieldSize);
// Is the last length field the delimiter?
// Then reply and return. (We're done.)
// Otherwise: Is the next message not completely in the buffer?
// Then break. (Outer while will take us back to receiving)
if(protosize == 0) {
// Done receiving. Now send:
SendReplyMsg(sock, lengthFieldSize);
// Clean up.
close(sock);
return;
} else if((endOfBuffer-lengthFieldSize-startOfData) < protosize) {
memmove(buffer, buffer+startOfData, RCVBUFSIZE-startOfData);
//Adjust endOfBuffer:
endOfBuffer-=startOfData;
break;
}
startOfData+= lengthFieldSize;
gtControl::gtMsg gtMessage;
if(!gtMessage.ParseFromArray(buffer+startOfData, protosize)) {
cerr << "Failed to parse gtMessage." << endl;
close(sock);
return;
}
// Move position pointer forward by one message (length+pbuf)
startOfData+= protosize;
PrintGtMessage(&gtMessage);
}
}
So basically I have a big loop which contains a receiving loop and a parsing loop. There's a character array being passed back and forth as I can't be sure to have received everything until I actually parse it. I'm trying to replicate this behaviour using "proper" C++ (i.e. std::string)
My first instinct was to go in a loop and append the received string to another string until I get everything, but this seems inefficient.
String concatenation is technically platform dependent, but probably str1 + str2 will require one dynamic allocation and two copies (from str1 and str2). That's sorta slow, but it's far faster than network access! So my first piece of advice would be to go with your first instinct, to find out whether it's both correct and fast enough.
If it's not fast enough, and your profiler shows that the redundant string copies are to blame, consider maintaining a list of strings (std::vector<string*>, perhaps) and joining all the strings together once at the end. This requires some care, but should avoid a bunch of redundant string copying.
But definitely profile first!

very fast text file processing (C++)

i wrote an application which processes data on the GPU. Code works well, but i have the problem that the reading part of the input file (~3GB, text) is the bottleneck of my application. (The read from the HDD is fast, but the processing line by line is slow).
I read a line with getline() and copy line 1 to a vector, line2 to a vector and skip lines 3 and 4. And so on for the rest of the 11 mio lines.
I tried several approaches to get the file at the best time possible:
Fastest method I found is using boost::iostreams::stream
Others were:
Read the file as gzip, to minimize IO, but is slower than directly
reading it.
copy file to ram by read(filepointer, chararray, length)
and process it with a loop to distinguish the lines (also slower than boost)
Any suggestions how to make it run faster?
void readfastq(char *filename, int SRlength, uint32_t blocksize){
_filelength = 0; //total datasets (each 4 lines)
_SRlength = SRlength; //length of the 2. line
_blocksize = blocksize;
boost::iostreams::stream<boost::iostreams::file_source>ins(filename);
in = ins;
readNextBlock();
}
void readNextBlock() {
timeval start, end;
gettimeofday(&start, 0);
string name;
string seqtemp;
string garbage;
string phredtemp;
_seqs.empty();
_phred.empty();
_names.empty();
_filelength = 0;
//read only a part of the file i.e the first 4mio lines
while (std::getline(in, name) && _filelength<_blocksize) {
std::getline(in, seqtemp);
std::getline(in, garbage);
std::getline(in, phredtemp);
if (seqtemp.size() != _SRlength) {
if (seqtemp.size() != 0)
printf("Error on read in fastq: size is invalid\n");
} else {
_names.push_back(name);
for (int k = 0; k < _SRlength; k++) {
//handle special letters
if(seqtemp[k]== 'A') ...
else{
_seqs.push_back(5);
}
}
_filelength++;
}
}
EDIT:
The source-file is downloadable under https://docs.google.com/open?id=0B5bvyb427McSMjM2YWQwM2YtZGU2Mi00OGVmLThkODAtYzJhODIzYjNhYTY2
I changed the function readfastq to read the file, because of some pointer problems. So if you call readfastq the blocksize (in lines) must be bigger than the number of lines to read.
SOLUTION:
I found a solution, which get the time for read in the file from 60sec to 16sec. I removed the inner-loop which handeles the special characters and do this in GPU. This decreases the read-in time and only minimal increases the GPU running time.
Thanks for your suggestions.
void readfastq(char *filename, int SRlength) {
_filelength = 0;
_SRlength = SRlength;
size_t bytes_read, bytes_expected;
FILE *fp;
fp = fopen(filename, "r");
fseek(fp, 0L, SEEK_END); //go to the end of file
bytes_expected = ftell(fp); //get filesize
fseek(fp, 0L, SEEK_SET); //go to the begining of the file
fclose(fp);
if ((_seqarray = (char *) malloc(bytes_expected/2)) == NULL) //allocate space for file
err(EX_OSERR, "data malloc");
string name;
string seqtemp;
string garbage;
string phredtemp;
boost::iostreams::stream<boost::iostreams::file_source>file(filename);
while (std::getline(file, name)) {
std::getline(file, seqtemp);
std::getline(file, garbage);
std::getline(file, phredtemp);
if (seqtemp.size() != SRlength) {
if (seqtemp.size() != 0)
printf("Error on read in fastq: size is invalid\n");
} else {
_names.push_back(name);
strncpy( &(_seqarray[SRlength*_filelength]), seqtemp.c_str(), seqtemp.length()); //do not handle special letters here, do on GPU
_filelength++;
}
}
}
First instead of reading the file into memory you may work with file mappings. You just have to build your program as 64-bit to fit 3GB of virtual address space (for 32-bit application only 2GB is accessible in the user mode). Or alternatively you may map & process your file by parts.
Next, it sounds to me that your bottleneck is "copying a line to a vector". Dealing with vectors involves dynamic memory allocation (heap operations), which in a critical loop hits the performance very seriously). If this is the case - either avoid using vectors, or make sure they're declared outside the loop. The latter helps because when you reallocate/clear vectors they do not free memory.
Post your code (or a part of it) for more suggestions.
EDIT:
It seems that all your bottlenecks are related to string management.
std::getline(in, seqtemp); reading into an std::string deals with the dynamic memory allocation.
_names.push_back(name); This is even worse. First the std::string is placed into the vector by value. Means - the string is copied, hence another dynamic allocation/freeing happens. Moreover, when eventually the vector is internally reallocated - all the contained strings are copied again, with all the consequences.
I recommend using neither standard formatted file I/O functions (Stdio/STL) nor std::string. To achieve better performance you should work with pointers to strings (rather than copied strings), which is possible if you map the entire file. Plus you'll have to implement the file parsing (division into lines).
Like in this code:
class MemoryMappedFileParser
{
const char* m_sz;
size_t m_Len;
public:
struct String {
const char* m_sz;
size_t m_Len;
};
bool getline(String& out)
{
out.m_sz = m_sz;
const char* sz = (char*) memchr(m_sz, '\n', m_Len);
if (sz)
{
size_t len = sz - m_sz;
m_sz = sz + 1;
m_Len -= (len + 1);
out.m_Len = len;
// for Windows-format text files remove the '\r' as well
if (len && '\r' == out.m_sz[len-1])
out.m_Len--;
} else
{
out.m_Len = m_Len;
if (!m_Len)
return false;
m_Len = 0;
}
return true;
}
};
if _seqs and _names are std::vectors and you can guess the final size of them before processing the whole 3GB of data, you can use reserve to avoid most of the memory re-allocation during pushing back the new elements in the loop.
You should be aware of the fact that the vectors effectively produce another copy of parts of the file in main memory. So unless you have a main memory sufficiently large to store the text file plus the vector and its contents, you will probably end up with a number of page faults that also have a significant influence on the speed of your program.
You are apparently using <stdio.h> since using getline.
Perhaps fopen-ing the file with fopen(path, "rm"); might help, because the m tells (it is a GNU extension) to use mmap for reading.
Perhaps setting a big buffer (i.e. half a megabyte) with setbuffer could also help.
Probably, using the readahead system call (in a separate thread perhaps) could help.
But all this are guesses. You should really measure things.
General suggestions:
Code the simplest, most straight-forward, clean approach,
Measure,
Measure,
Measure,
Then if all else fails:
Read raw bytes (read(2)) in page-aligned chunks. Do so sequentially, so kernel's read-ahead plays to your advantage.
Re-use the same buffer to minimize cache flushing.
Avoid copying data, parse in place, pass around pointers (and sizes).
mmap(2)-ing [parts of the] file is another approach. This also avoids kernel-userland copy.
Depending on your disk speed, using a very fast de compression algorithm might help, like fastlz (there are at least two other that might be more efficient, but under GPL, so licence can be a problem).
Also, using C++ data structures and functions car increase the speed as you can maybe achieve a better compiler-time optimization. Going the C way isn't always the fastes! In some bad conditions, using char* you need to parse the whole string to reach the \0 yielding desastrous performances.
For parsing your data, using boost::spirit::qi is also probably the most optimized approach http://alexott.blogspot.com/2010/01/boostspirit2-vs-atoi.html

Concatenating strings into own protocol

I'm writing networking programming using socket.h to my studies. I have written server and client simple programs that can transfer files between them using buffer size given by user.
Server
void transfer(string name)
{
char *data_to_send;
ifstream myFile;
myFile.open(name.c_str(),ios::binary);
if(myFile.is_open))
{
while(myFile.eof))
{
data_to_send = new char [buffer_size];
myFile.read(data_to_send, buffer_size);
send(data_to_send,buffer_size);
delete [] data_to_send;
}
myFile.close();
send("03endtransmission",buffer_size);
}
else
{
send("03error",buffer_size);
}
}
Client
void download(string name)
{
char *received_data;
fstream myFile;
myFile.open(name.c_str(),ios::out|ios::binary);
if(myFile.is_open())
{
while(1)
{
received_data = new char[rozmiar_bufora];
if((receivedB = recv(sockfd, received_data, buffer_size,0)) == -1) {
perror("recv");
close(sockfd);
exit(1);
}
if(strcmp(received_data,"03endoftransmission") == 0)
{
cout<<"End of transmission"<<endl;
break;
}
else if (strcmp(received_data,"03error") == 0)
{
cout<<"Error"<<endl;
break;
}
myFile.write(received_data,buffer_size);
}
myFile.close();
}
The problem occurs, when I want to implement my own protocol- two chars (control), 32 chars hash, and the rest of package is data. I tried few times to split it and I end up with this code:
Server
#define PAYLOAD 34
void transfer(string name)
{
char hash[] = "12345678901234567890123456789012"; //32 chars
char *data_to_send;
ifstream myFile;
myFile.open(name.c_str(),ios::binary);
if(myFile.is_open))
{
while(myFile.eof))
{
data_to_send = new char [buffer_size-PAYLOAD];
myFile.read(data_to_send, buffer_size-PAYLOAD);
concatenation = new char[buffer_size];
strcpy(concatenation,"02");
strcat(concatenation,hash);
strcat(concatenation,data_to_send);
send(concatenation,buffer_size);
delete [] data_to_send;
delete [] concatenation;
}
myFile.close();
send("03endtransmission",buffer_size);
}
else
{
send("03error",buffer_size);
}
}
Client
void download(string name)
{
char *received_data;
fstream myFile;
myFile.open(name.c_str(),ios::out|ios::binary);
if(myFile.is_open())
{
while(1)
{
received_data = new char[buffer_size];
if((receivedB = recv(sockfd, received_data, buffer_size,0)) == -1) {
perror("recv");
close(sockfd);
exit(1);
}
if(strcmp(received_data,"03endoftransmission") == 0)
{
cout<<"End of transmission"<<endl;
break;
}
else if (strcmp(received_data,"03error") == 0)
{
cout<<"Error"<<endl;
break;
}
control = new char[3];
strcpy(control,"");
strncpy(control, received_data,2);
control[2]='\0';
hash = new char[33];
strcpy(hash,"");
strncpy(hash,received_data+2,32);
hash[32]='\0';
data = new char[buffer_size-PAYLOAD+1];
strcpy(data,"");
strncpy(data,received_data+34,buffer_size-PAYLOAD);
myFile.write(data,buffer_size-PAYLOAD);
}
myFile.close();
}
But this one inputs to file some ^# instead of real data. Displaying "data" to console looks the same on server and client. If you know how I can split it up, I would be very grateful.
You have some issues which may or may not be your problem.
(1) send/recv can return less than you requested. You may ask to receive 30 bytes but only get 10 on the recv call so all of these have to be coded in loops and buffered somewhere until you actually get the number you wanted. Your first set of programs was lucky to work in this regard and probably only because you tested on a limited amount of data. Once you start to push through more data your assumptions on what you are reading (and comparing) will fail.
(2) There is no need to keep allocating char buffers in the loops; allocate them before the loop or just use a local buffer rather than the heap. What you are doing is inefficient and in the second program you have memory leaks because you don't delete them.
(3) You can get rid of the strcpy/strncpy statements and just use memmove()
Your specific problem is not jumping out at me but maybe this will push in the right direction. More information what is being transmitted properly and exactly where in the data you are seeing problems would be helpful.
But this one inputs to file some ^# instead of real data. Displaying
"data" to console looks the same on server and client. If you know how
I can split it up, I would be very grateful.
You say that the data (I presume the complete file rather than the '^#') is the same on both client and server? If this is the case, then your issue is likely writing the data to file, rather than the actual transmission of the data itself.
If this is the case, you'll probably want to check assumptions about how the program writes to file - for example, are you passing in text data to be written to file, or binary data? If you're writing binary data, but it uses the NULL-terminated string, chances are it will quit early treating valid binary information as a NULL.
If it's text mode, you might want to consider initialising all strings with memset to a default character (other than NULL) to see if it's garbage data being out put.
If both server and client display the '^#' (or whatever data), binary based char data would be incompatible with the strcpy/strcat functions as this rely on NULL termination (where-as binary uses size termination instead).
I can't track down the specific problem, but maybe this might offer an insight or two that helps.