C++ continuous read file - c++

I've a producer/consumer set-up: Our client is giving us data that our server processes, and our client is giving data to our server by constantly writing to a file. Our server uses inotify to look for any file modifications, and processes the new data.
Problem: The file reader in the server has a buffer of size 4096. I've a unit test that simulates the above situation. The test constantly writes to an open file, which the file reader constantly tries to read an process. But, I noticed that after the first record is read, which is much smaller than 4096, an error flag is set in the ifstream object. This means that any new data arriving is not being processed. A simple workaround seems to be to call ifstream::clear after every read, and this does solve the issue. But, what is going on? Is this the right solution?

First off, depending on your system it may or may not be possible to read a file another process writes to: On Windows the normal settings when opening a file make the access exclusive. I don't know enough about Window to tell whether there are other settings. On POSIX system a file with suitable permissions can be opened for reading and writing by different processes. From the sounds of it you are using Linux, i.e., something following the POSIX specification.
The approach to polling a file upon change isn't entirely ideal, though: As you noticed, you get an "error" every time you reach the end of the current file. Actually, reaching the end of a file isn't really an error but trying to decode something beyond end of file is an error. Also, reading beyond the end of file will still set std::ios_base::eofbit and, thus, the stream won't be good(). If you insist on using this approach there isn't much choice than reading up to the end of the file and dealing with the incomplete read somehow.
If you have control over creating the file, however, you can do a simple trick: Instead of having the file be a normal file, you can create it is mkfifo to create a named pipe using the file name the writing program will write to: When opening a file on a POSIX system it doesn't create a new file if there is already one but uses the existing file. Well, file or whatever else is addressed by the file name (in addition to files and named pipe you may see directories, character or block special devices, and possibly others).
Named pipes are curious beasts intended to have two processes communicate with each other: What is written to one end by one process is readable at the other end by another process! The named pipe itself doesn't have any content, i.e., if you need both the content of the file and the communication with another process you might need to replicate the content somewhere. Opening a named pipe for reading which will block whenever it has reached the current end of the file, i.e., initially the read would block until there is a writer. Similarly writes to the named pipe will block until there is a reader. Once there two processes communicating the respective other end will receive an error when reading or writing the named pipe after the other process has exited.

If you are good with opening and closing file again and again,
The right solution to this problem would be to store the last read pos and start from there once file is updated:
Exact algo will be :
set start_pos = 0 , end pos =0
update end_pos = infile.tellg(),
move get pointer to start_pos (use seekg()) and read the block (end_pos - start_pos).
update start_pos = end_pos and then close the file.
sleep for some time and open file again.
if file stream is still not good , close the file and jump to step 5.
if file stream is good, Jump to step 1.
All c++ reference is present at http://www.cplusplus.com/reference/istream/istream/seekg/
you can literally utilize the sample code given here.
Exact code will be:
`
#include <iostream>
#include <fstream>
int main(int argc, char *argv[]) {
if (argc != 2)
{
std::cout << "Please pass filename with full path \n";
return -1;
}
int end_pos = 0, start_pos = 0;
long length;
char* buffer;
char *filePath = argv[1];
std::ifstream is(filePath, std::ifstream::binary);
while (1)
{
if (is) {
is.seekg(0, is.end);
end_pos = is.tellg(); //always update end pointer to end of the file
is.seekg(start_pos, is.beg); // move read pointer to the new start position
// allocate memory:
length = end_pos - start_pos;
buffer = new char[length];
// read data as a block: (end_pos - start_pos) blocks form read pointer
is.read(buffer, length);
is.close();
// print content:
std::cout.write(buffer, length);
delete[] buffer;
start_pos = end_pos; // update start pointer
}
//wait and restart with new data
sleep(1);
is.open(filePath, std::ifstream::binary);
}
return 0;
}
`

Related

Reopening a closed file stream

Consider the following code,
auto fin = ifstream("address", ios::binary);
if(fin.is_open())
fin.close()
for(auto i = 0; i < N; ++i){
fin.open()
// ....
// read (next) b bytes...
// ....
fin.close()
// Some delay
}
The code above can't be implemented in the C++ I know, but I'd like to know if it is possible to do so?
Here are my requirements:
When reopening the file, there would be no need to pass the parameters (path and mode) again.
When reopening the stream, it continues from the point in the stream that it was when got closed.
Clarification
The files I work with are big in size and in a point of time other threads from third party libraries may decide to (re)move them. An open stream will prevent such actions.
Continuously reading a big file will slow down the system.
The need
Indeed, a file can't be deleted by another process as long as a stream keeps it open.
I suppose you have already asked yourself these questions, but fo the recors I have to suggest you to think about it:
Can't the file be read into (virtual) memory and discarded when no longer needed ?
Can't the file processing be pipelined asynchronously, to read it at once and process it without unnecessary delays ?
What to do if the file can no longer be opened because it was deleted by the other process ? What to do if the location can't be found, because the file was modified (e.g. shortened) ?
If you would have the perfect solution to your issue, what would be the effect if the other process would try to delete the file when it is open (only for a short time, but nevertheless open and blocking the deletion) ?
The solution
Unfortunately, you can't achieve the desired behavior with standard streams. You could emulate it by keeping track of the filename and of the position (and more generally of the state):
auto mypos = ifs.tellg(); // saves position.
// Should flag be saved as well ? and what about gcount ?
ifs.close();
...
if (! ifs.is_open()) {
ifs.open(myfilename, myflags); // open again !
if (! ifs) {
// ouch ! file disapeared ==> process error
}
ifs.seekg(mypos); // restore position
if (! ifs) {
// ouch ! position no longer reachable ==> process error
}
}
Of course, you wouldn't like to repeat this code ever and ever. And it would not be so nice having all the sudden a lot of global variables to keep track of the stream's state. But you could very easily encapsulate it in a wrapper class that would take care of saving and restoring the stream's state using existing standard operations.

Reading a Potentially incomplete File C++

I am writing a program to reformat a DNS log file for insertion to a database. There is a possibility that the line currently being written to in the log file is incomplete. If it is, I would like to discard it.
I started off believing that the eof function might be a good fit for my application, however I noticed a lot of programmers dissuading the use of the eof function. I have also noticed that the feof function seems to be quite similar.
Any suggestions/explanations that you guys could provide about the side effects of these functions would be most appreciated, as would any suggestions for more appropriate methods!
Edit: I currently am using the istream::peek function in order to skip over the last line, regardless of whether it is complete or not. While acceptable, a solution that determines whether the last line is complete would be preferred.
The specific comparison I'm using is: logFile.peek() != EOF
I would consider using
int fseek ( FILE * stream, long int offset, int origin );
with SEEK_END
and then
long int ftell ( FILE * stream );
to determine the number of bytes in the file, and therefore - where it ends. I have found this to be more reliable in detecting the end of the file (in bytes).
Could you detect an (End of Record/Line) EOR marker (CRLF perhaps) in the last two or three bytes of the file? (3 bytes might be used for CRLF^Z...depends on the file type). This would verify if you have a complete last row
fseek (stream, -2,SEEK_END);
fread (2 bytes... etc
If you try to open the file with exclusive locks, you can detect (by the failure to open) that the file is in use, and try again in a second...(or whenever)
If you need to capture the file contents as the file is being written, it's much easier if you eliminate as many layers of indirection and buffering between your logic and the actual bytes of data in the file.
Do not use C++ IO streams of any type - you have no real control over them. Don't use FILE *-based functions such as fopen() and fread() - those are buffered, and even if you disable buffering there are layers of code between your code and the data that once again you can't control and don't know what's happening.
In a POSIX environment, you can use low-level C-style open() and read()/pread() calls. And use fstat() to know when the file contents have changed - you'll see the st_size member of the struct stat argument change after a call to fstat().
You'd open the file like this:
int logFileFD = open( "/some/file/name.log", O_RDONLY );
Inside a loop, you could do something like this (error checking and actual data processing omitted):
size_t lastSize = 0;
while ( !done )
{
struct stat statBuf;
fstat( logFileFD, &statBuf );
if ( statBuf.st_size == lastSize )
{
sleep( 1 ); // or however long you want
continue; // go to next loop iteration
}
// process new data - might need to keep some of the old data
// around to handle lines that cross boundaries
processNewContents( logFileFD, lastSize, statBuf.st_size );
}
processNewContents() could look something like this:
void processNewContents( int fd, size_t start, size_t end )
{
static char oldData[ BUFSIZE ];
static char newData[ BUFSIZE ];
// assumes amount of data will fit in newData...
ssize_t bytesRead = pread( fd, newData, start, end - start );
// process the data that was read read here
return;
}
You may also find that you need to add some code to close() then re-open() the file in case your application doesn't seem to be "seeing" data written to the file. I've seen that happen on some systems - the application somehow sees a cached copy of the file size somewhere while an ls run in another context gets the more accurate, updated size. If, for example, you know your log file is written to every 10-15 seconds, if you go 30 seconds without seeing any change to the file you know to try reopening the file.
You can also track the inode number in the struct stat results to catch log file rotation.
In a non-POSIX environment, you can replace open(), fstat() and pread() calls with the low-level OS equivalent, although Windows provides most of what you'd need. On Windows, lseek() followed by read() would replace pread().

Protocol Buffers; saving data to disk & loading back issue

I have an issue with storing Protobuf data to disk.
The application i have uses Protocol Buffer to transfer data over a socket (which works fine), but when i try to store the data to disk it fails.
Actually, saving data reports no issues, but i cannot seem to load them again properly.
Any tips would be gladly appreciated.
void writeToDisk(DataList & dList)
{
// open streams
int fd = open("serializedMessage.pb", O_WRONLY | O_CREAT);
google::protobuf::io::ZeroCopyOutputStream* fileOutput = new google::protobuf::io::FileOutputStream(fd);
google::protobuf::io::CodedOutputStream* codedOutput = new google::protobuf::io::CodedOutputStream(fileOutput);
// save data
codedOutput->WriteLittleEndian32(PROTOBUF_MESSAGE_ID_NUMBER); // store with message id
codedOutput->WriteLittleEndian32(dList.ByteSize()); // the size of the data i will serialize
dList.SerializeToCodedStream(codedOutput); // serialize the data
// close streams
delete codedOutput;
delete fileOutput;
close(fd);
}
I've verified the data inside this function, the dList contains the data i expect. The streams report that no errors occur, and that a reasonable amount of bytes were written to disk. (also the file is of reasonable size)
But when i try to read back the data, it does not work. Moreover, what is really strange, is that if i append more data to this file, i can read the first messages (but not the one at the end).
void readDataFromFile()
{
// open streams
int fd = open("serializedMessage.pb", O_RDONLY);
google::protobuf::io::ZeroCopyInputStream* fileinput = new google::protobuf::io::FileInputStream(fd);
google::protobuf::io::CodedInputStream* codedinput = new google::protobuf::io::CodedInputStream(fileinput);
// read back
uint32_t sizeToRead = 0, magicNumber = 0;
string parsedStr = "";
codedinput->ReadLittleEndian32(&magicNumber); // the message id-number i expect
codedinput->ReadLittleEndian32(&sizeToRead); // the reported data size, also what i expect
codedinput->ReadString(&parsedstr, sizeToRead)) // the size() of 'parsedstr' is much less than it should (sizeToRead)
DataList dl = DataList();
if (dl.ParseFromString(parsedstr)) // fails
{
// work with data if all okay
}
// close streams
delete codedinput;
delete fileinput;
close(fd);
}
Obviously i have omitted some of the code here to simplify everything.
As a side note i have also also tried to serialize the message to a string & save that string via CodedOutputStream. This does not work either. I have verified the contents of that string though, so i guess culprit must be the stream functions.
This is a windows environment, c++ with protocol buffers and Qt.
Thank you for your time!
I solved this issue by switching from file descriptors to fstream, and FileCopyStream to OstreamOutputStream.
Although i've seen examples using the former, it didn't work for me.
I found a nice code example in hidden in the google coded_stream header. link #1
Also, since i needed to serialize multiple messages to the same file using protocol buffers, this link was enlightening. link #2
For some reason, the output file is not 'complete' until i actually desctruct the stream objects.
The read failure was because the file was not opened for reading with O_BINARY - change file opening to this and it works:
int fd = open("serializedMessage.pb", O_RDONLY | O_BINARY);
The root cause is the same as here: "read() only reads a few bytes from file". You were very likely following an example in the protobuf documentation which opens the file in the same way, but it stops parsing on Windows when it hits a special character in the file.
Also, in more recent versions of the library, you can use protobuf::util::ParseDelimitedFromCodedStream to simplify reading size+payload pairs.
... the question may be ancient, but the issue still exists and this answer is almost certainly the fix to the original problem.
try to use
codedinput->readRawBytes insead of ReadString
and
dl.ParseFromArray instead of ParseFromString
Not very familiar with protocol buffers but ReadString might only read a field of type strine.

Does constructing an iostream (c++) read data from the hard drive into memory?

When I construct an iostream when say opening a file will this always read the entire file from the hard disk and then put it into memory, or is it streamed in and buffered by the OS on demand?
I ask because one way to check if a file exists is to see if opening it fails, but I fear if the files I am opening are very large then this take a long time if iostream must read the entire file in on open.
To check whether a file exists can be done like this if you want to use boost.
#include <boost/filesystem.hpp>
bool fileExists = boost::filesystem::exists("foo.txt");
No, it will not read the entire file into memory when you open it. It will read your file in chunks though, but I believe this process will not start until you read the first byte. Also these chunks are relatively small (on the order of 4-128 kibibytes in size), and the fact it does this will speed things up greatly if you are reading the file sequentially.
In a test on my Linux box (well, Linux VM) simply opening the file only results in the OS open system call, but no read system call. It doesn't start reading anything from the file until the first attempt to read from the stream. And then it reads 8191 (why 8191? that seems a very strange number) byte chunks as I read the file in.
Opening a file is a bad way of testing if the file exists - all it does is tell you if you can open it. Opening might fail for a number of reasons, typically because you don't have read permission, but the file will still exist. It is usually better to use an operating system specific function to test for existence. And no, opening an fstream will not cause the contents to be read.
What I think is, when you open a file, the corresponding data structures for the process opening the file are populated which include file pointer, file descriptor, v node etc.
Now one can read and write to a file using buffered streams (fwrite , fread) or using system calls (read and write).
When we use buffered streams, we buffer the data and then write or read it[This is done for efficiency puposes]. This statement itself means that the whole file is not read into memory but certain bytes are read into buffer and then made available.
In case of sys calls such as read and write , kernel level buffering is done (using fsync one can flush out kernel buffer too), but data is actually read and written to the device .file
checking existance of file
#include &lt sys/stat.h &gt
int main(){
struct stat file_i;
std::string f("myfile.txt");
if (stat(f.c_str(),&file_i) != 0){
cout &lt&lt "File not found" &lt&lt endl;
}
return 0;
}
Hope this clarifies a bit.

how to create files named with current time?

I want to create a series of files under "log" directory which every file named based on execution time. And in each of these files, I want to store some log info for my program like the function prototype that acts,etc.
Usually I use the hard way of fopen("log/***","a") which is not for this purpose.And I just write a timestamp function:
char* timeStamp(char* txt){
char* rc;
char timestamp[16];
time_t rawtime = time(0);
tm *now = localtime(&rawtime);
if(rawtime != -1) {
strftime(timestamp,16,"%y%m%d_%H%M%S",now);
rc = strcat(txt,timestamp);
}
return(rc);
}
But I don't know what to do next. Please help me with this!
Declare a char array big enough to hold 16 + "log/" (so 20 characters total) and initialize it to "log/", then use strcat() or something related to add the time string returned by your function to the end of your array. And there you go!
Note how the string addition works: Your char array is 16 characters, which means you can put in 15 characters plus a nul byte. It's important not to forget that. If you need a 16 character string, you need to declare it as char timestamp[17] instead. Note that "log/" is a 4 character string, so it takes up 5 characters (one for the nul byte at the end), but strcat() will overwrite starting at the nul byte at the end, so you'll end up with the right number. Don't count the nul terminator twice, but more importantly, don't forget about it. Debugging that is a much bigger problem.
EDIT: While we're at it, I misread your code. I thought it just returned a string with the time, but it appears that it adds the time to a string passed in. This is probably better than what I thought you were doing. However, if you wanted, you could just make the function do all the work - it puts "log/" in the string before it puts the timestamp. It's not that hard.
What about this:
#include <stdio.h>
#include <time.h>
#define LOGNAME_FORMAT "log/%Y%m%d_%H%M%S"
#define LOGNAME_SIZE 20
FILE *logfile(void)
{
static char name[LOGNAME_SIZE];
time_t now = time(0);
strftime(name, sizeof(name), LOGNAME_FORMAT, localtime(&now));
return fopen(name, "ab");
}
You'd use it like this:
FILE *file = logfile();
// do logging
fclose(file);
Keep in mind that localtime() is not thread-safe!
Steps to create (or write to) a sequential access file in C++:
1.Declare a stream variable name:
ofstream fout; //each file has its own stream buffer
ofstream is short for output file stream
fout is the stream variable name
(and may be any legal C++ variable name.)
Naming the stream variable "fout" is helpful in remembering
that the information is going "out" to the file.
2.Open the file:
fout.open(filename, ios::out);
fout is the stream variable name previously declared
"scores.dat" is the name of the file
ios::out is the steam operation mode
(your compiler may not require that you specify
the stream operation mode.)
3.Write data to the file:
fout<<grade<<endl;
fout<<"Mr";
The data must be separated with space characters or end-of-line characters (carriage return), or the data will run together in the file and be unreadable. Try to save the data to the file in the same manner that you would display it on the screen.
If the iomanip.h header file is used, you will be able to use familiar formatting commands with file output.
fout<<setprecision(2);
fout<<setw(10)<<3.14159;
4.Close the file:
fout.close( );
Closing the file writes any data remaining in the buffer to the file, releases the file from the program, and updates the file directory to reflect the file's new size. As soon as your program is finished accessing the file, the file should be closed. Most systems close any data files when a program terminates. Should data remain in the buffer when the program terminates, you may loose that data. Don't take the chance --- close the file!
Sounds like you have mostly solved it already - to create a file like you describe:
char filename[256] = "log/";
timeStamp( filename );
f = fopen( filename, "a" );
Or do you wish do do something more?