C++ read() problems - c++

I'm having trouble reading a large file into my own buffer in C++ in Visual Studio 2010. Below is a snippet of my code where length is the size of the file I'm reading in, bytesRead is set to 0 before this is run, and file is a std::ifstream.
buffer = new char[length];
while( bytesRead < length ){
file.read( buffer + bytesRead, length - bytesRead );
bytesRead += file.gcount();
}
file.close();
I noticed that gcount() returns 0 at the second read and onwards, meaning that read() did not give me any new characters so this is an infinite loop. I would like to continue to read the rest of the file. I know that the eofbit is set after the first read even though there is more data in the file.
I do not know what I can do to read more. Please help.

Make sure you open your file in binary mode (std::ios::binary) to avoid any newline conversions. Using text mode could invalidate your assumption that the length of the file is the number of bytes that you can read from the file.
In any case, it is good practice to examine the state of the stream after the read and stop if there has been an error (rather can continue indefinitely).

It sounds like your stream is in a failed state, at which point most operations will just immediately fail. You'll need to clear the stream to continue reading.

Related

How to correctly buffer lines using fread

I want to use fread() for IO for some reason (speed and ...). i have a file with different sized lines. by using a code like:
while( !EOF ){
fread(buffer,500MB,1,fileName);
// process buffer
}
the last line may read incompletely and we have to read the last line again in next trial, so how to force fread() to continue from the beginning of the last line?
or if possible how to force fread() to read more than 500MB untill reaching a \n or another specific character?
Thanks All
Ameer.
Assuming a bufferof bytes that you have reverse found a \n character in at position pos, then you want to roll back to the length of the buffer minus this pos. Call this step.
You can use fseek to move the file pointer back by this much:
int fseek( FILE *stream, long offset, int origin );
In your case
int ret = fseek(stream, -step, SEEK_END);
This will involve re-reading part of the file, and a fair bit of jumping around - the comments have suggested alternative ways that may be quicker.

Reading a Potentially incomplete File C++

I am writing a program to reformat a DNS log file for insertion to a database. There is a possibility that the line currently being written to in the log file is incomplete. If it is, I would like to discard it.
I started off believing that the eof function might be a good fit for my application, however I noticed a lot of programmers dissuading the use of the eof function. I have also noticed that the feof function seems to be quite similar.
Any suggestions/explanations that you guys could provide about the side effects of these functions would be most appreciated, as would any suggestions for more appropriate methods!
Edit: I currently am using the istream::peek function in order to skip over the last line, regardless of whether it is complete or not. While acceptable, a solution that determines whether the last line is complete would be preferred.
The specific comparison I'm using is: logFile.peek() != EOF
I would consider using
int fseek ( FILE * stream, long int offset, int origin );
with SEEK_END
and then
long int ftell ( FILE * stream );
to determine the number of bytes in the file, and therefore - where it ends. I have found this to be more reliable in detecting the end of the file (in bytes).
Could you detect an (End of Record/Line) EOR marker (CRLF perhaps) in the last two or three bytes of the file? (3 bytes might be used for CRLF^Z...depends on the file type). This would verify if you have a complete last row
fseek (stream, -2,SEEK_END);
fread (2 bytes... etc
If you try to open the file with exclusive locks, you can detect (by the failure to open) that the file is in use, and try again in a second...(or whenever)
If you need to capture the file contents as the file is being written, it's much easier if you eliminate as many layers of indirection and buffering between your logic and the actual bytes of data in the file.
Do not use C++ IO streams of any type - you have no real control over them. Don't use FILE *-based functions such as fopen() and fread() - those are buffered, and even if you disable buffering there are layers of code between your code and the data that once again you can't control and don't know what's happening.
In a POSIX environment, you can use low-level C-style open() and read()/pread() calls. And use fstat() to know when the file contents have changed - you'll see the st_size member of the struct stat argument change after a call to fstat().
You'd open the file like this:
int logFileFD = open( "/some/file/name.log", O_RDONLY );
Inside a loop, you could do something like this (error checking and actual data processing omitted):
size_t lastSize = 0;
while ( !done )
{
struct stat statBuf;
fstat( logFileFD, &statBuf );
if ( statBuf.st_size == lastSize )
{
sleep( 1 ); // or however long you want
continue; // go to next loop iteration
}
// process new data - might need to keep some of the old data
// around to handle lines that cross boundaries
processNewContents( logFileFD, lastSize, statBuf.st_size );
}
processNewContents() could look something like this:
void processNewContents( int fd, size_t start, size_t end )
{
static char oldData[ BUFSIZE ];
static char newData[ BUFSIZE ];
// assumes amount of data will fit in newData...
ssize_t bytesRead = pread( fd, newData, start, end - start );
// process the data that was read read here
return;
}
You may also find that you need to add some code to close() then re-open() the file in case your application doesn't seem to be "seeing" data written to the file. I've seen that happen on some systems - the application somehow sees a cached copy of the file size somewhere while an ls run in another context gets the more accurate, updated size. If, for example, you know your log file is written to every 10-15 seconds, if you go 30 seconds without seeing any change to the file you know to try reopening the file.
You can also track the inode number in the struct stat results to catch log file rotation.
In a non-POSIX environment, you can replace open(), fstat() and pread() calls with the low-level OS equivalent, although Windows provides most of what you'd need. On Windows, lseek() followed by read() would replace pread().

C++ continuous read file

I've a producer/consumer set-up: Our client is giving us data that our server processes, and our client is giving data to our server by constantly writing to a file. Our server uses inotify to look for any file modifications, and processes the new data.
Problem: The file reader in the server has a buffer of size 4096. I've a unit test that simulates the above situation. The test constantly writes to an open file, which the file reader constantly tries to read an process. But, I noticed that after the first record is read, which is much smaller than 4096, an error flag is set in the ifstream object. This means that any new data arriving is not being processed. A simple workaround seems to be to call ifstream::clear after every read, and this does solve the issue. But, what is going on? Is this the right solution?
First off, depending on your system it may or may not be possible to read a file another process writes to: On Windows the normal settings when opening a file make the access exclusive. I don't know enough about Window to tell whether there are other settings. On POSIX system a file with suitable permissions can be opened for reading and writing by different processes. From the sounds of it you are using Linux, i.e., something following the POSIX specification.
The approach to polling a file upon change isn't entirely ideal, though: As you noticed, you get an "error" every time you reach the end of the current file. Actually, reaching the end of a file isn't really an error but trying to decode something beyond end of file is an error. Also, reading beyond the end of file will still set std::ios_base::eofbit and, thus, the stream won't be good(). If you insist on using this approach there isn't much choice than reading up to the end of the file and dealing with the incomplete read somehow.
If you have control over creating the file, however, you can do a simple trick: Instead of having the file be a normal file, you can create it is mkfifo to create a named pipe using the file name the writing program will write to: When opening a file on a POSIX system it doesn't create a new file if there is already one but uses the existing file. Well, file or whatever else is addressed by the file name (in addition to files and named pipe you may see directories, character or block special devices, and possibly others).
Named pipes are curious beasts intended to have two processes communicate with each other: What is written to one end by one process is readable at the other end by another process! The named pipe itself doesn't have any content, i.e., if you need both the content of the file and the communication with another process you might need to replicate the content somewhere. Opening a named pipe for reading which will block whenever it has reached the current end of the file, i.e., initially the read would block until there is a writer. Similarly writes to the named pipe will block until there is a reader. Once there two processes communicating the respective other end will receive an error when reading or writing the named pipe after the other process has exited.
If you are good with opening and closing file again and again,
The right solution to this problem would be to store the last read pos and start from there once file is updated:
Exact algo will be :
set start_pos = 0 , end pos =0
update end_pos = infile.tellg(),
move get pointer to start_pos (use seekg()) and read the block (end_pos - start_pos).
update start_pos = end_pos and then close the file.
sleep for some time and open file again.
if file stream is still not good , close the file and jump to step 5.
if file stream is good, Jump to step 1.
All c++ reference is present at http://www.cplusplus.com/reference/istream/istream/seekg/
you can literally utilize the sample code given here.
Exact code will be:
`
#include <iostream>
#include <fstream>
int main(int argc, char *argv[]) {
if (argc != 2)
{
std::cout << "Please pass filename with full path \n";
return -1;
}
int end_pos = 0, start_pos = 0;
long length;
char* buffer;
char *filePath = argv[1];
std::ifstream is(filePath, std::ifstream::binary);
while (1)
{
if (is) {
is.seekg(0, is.end);
end_pos = is.tellg(); //always update end pointer to end of the file
is.seekg(start_pos, is.beg); // move read pointer to the new start position
// allocate memory:
length = end_pos - start_pos;
buffer = new char[length];
// read data as a block: (end_pos - start_pos) blocks form read pointer
is.read(buffer, length);
is.close();
// print content:
std::cout.write(buffer, length);
delete[] buffer;
start_pos = end_pos; // update start pointer
}
//wait and restart with new data
sleep(1);
is.open(filePath, std::ifstream::binary);
}
return 0;
}
`

EOF before EOF in Visual Studio

I had this snippet in a program (in Visual Studio 2005):
if(_eof(fp->_file))
{
break;
}
It broke the enclosing loop when eof was reached. But the program was not able to parse the last few thousand chars in file. So, in order to find out what was happening, I did this:
if(_eof(fp->_file))
{
cout<<ftell(fp)<<endl;
break;
}
Now the answer that I got from ftell was different (and smaller) than the actual file-size (which isn't expected). I thought that Windows might have some problem with the file, then I did this:
if(_eof(fp->_file))
{
cout<<ftell(fp)<<endl;
fseek(fp, 0 , SEEK_END);
cout<<ftell(fp)<<endl;
break;
}
Well, the fseek() gave the right answer (equal to the file-size) and the initial ftell() failed (as previously told).
Any idea about what could be wrong here?
EDIT: The file is open in "rb" mode.
You can't reliably use _eof() on a file descriptor obtained from a FILE*, because FILE* streams are buffered. It means that fp has sucked fp->_file dry and stores the remaining byte in its internal buffer. Eventually fp->_file is at eof position, while fp still has bytes for you to read. Use feof() after a read operation to determine if you are at the end of a file and be careful if you mix functions which operate on FILE* with those operating on integer file descriptors.
You should not be using _eof() directly on the descriptor if your file I/O operations are on the FILE stream that wraps it. There is buffering that takes place and the underlying descriptor will hit end-of-file before your application has read all the data from the FILE stream.
In this case, ftell(fp) is reporting the state of the stream and you should be using feof(fp) to keep them in the same I/O domain.

feof() returning true when EOF is not reached

I'm trying to read from a file at a specific offset (simplified version):
typedef unsigned char u8;
FILE *data_fp = fopen("C:\\some_file.dat", "r");
fseek(data_fp, 0x004d0a68, SEEK_SET); // move filepointer to offset
u8 *data = new u8[0x3F0];
fread(data, 0x3F0, 1, data_fp);
delete[] data;
fclose(data_fp);
The problem becomes, that data will not contain 1008 bytes, but 529 (seems random). When it reaches 529 bytes, calls to feof(data_fp) will start returning true.
I've also tried to read in smaller chunks (8 bytes at a time) but it just looks like it's hitting EOF when it's not there yet.
A simple look in a hex editor shows there are plenty of bytes left.
Opening a file in text mode, like you're doing, makes the library translate some of the file contents to other stuff, potentially triggering a unwarranted EOF or bad offset calculations.
Open the file in binary mode by passing the "b" option to the fopen call
fopen(filename, "rb");
Is the file being written to in parallel by some other application? Perhaps there's a race condition, so that the file ends at wherever the read stops, when the read is running, but later when you inspect it the rest has been written. That would explain the randomness, too.
Maybe it's a difference between textual and binary file. If you're on Windows, newlines are CRLF, which is two characters in file, but converted to only one when read. Try using fopen(..., "rb")
I can't see your link from work, but if your computer claims no more bytes exist, I'd tend to believe it. Why don't you print the size of the file rather than doing things by hand in a hex editor?
Also, you'd be better off using level 2 I/O the f-calls are ancient C ugliness, and you're using C++ since you have new.
int fh =open(filename, O_RDONLY);
struct stat s;
fstat(fh, s);
cout << "size=" << hex << s.st_size << "\n";
Now do your seeking and reading using level 2 I/O calls, which are faster anyway, and let's see what the size of the file really is.