I am trying to understand how basic I/O with files is handled in c++ or c. My aim is to read file line by line and send the lines across to a remote server. If the line is sent, I want to delete it from the file.
One way I tried was that I kept a count of the lines read and called an system() system call to delete the 'count' number of lines. I used the bash command: sed -i -e 1,'count'd filename.
After that I continued reading the file and surprisingly it worked as planned.
I have two questions:
Is this way reliable?
And why does this work at all, when while
reading the file I deleted a part of it and yet it works? What if I
did a seek to a previous position, what then?
Best,Digvijay
PS:
I would be glad if somebody could suggest a better way.
Also here is the code for the program I wrote:
#include<iostream>
#include<fstream>
#include<string>
#include<sstream>
#include<cstdlib>
int main(){
std::ifstream f;
std::string line;
std::stringstream ss;
int i=0;
f.open("in.txt");
if(f.is_open()){
while(getline(f,line)){
std::cout<<line<<std::endl;
i++;
if(i==2)break;
}
ss<<"sed -i -e 1,"<<i<<"d in.txt";
system(ss.str().c_str());
while(getline(f,line)){
std::cout<<line<<std::endl;
}
}
return 0;
}
Edit:
Firstly thanks for taking the time to write answers. But here is some extra information which I missed out on earlier. The files I am dealing with are log files. So they are constantly being appended with information from devices. The reason why I want to avoid creating a copy is, because the log file themselves are very big(at times) and plus this would help to keep the log file short. Since they would be divided into parts and archived on the server.
Solution
I have found the way to deal with the problem. Apparently Thomas is right, that sed does create a new file. So the old file remains as is. Using this, I can read n lines, call the system function, close the file pointer and open it again. I do this on small chunks of the log, repeatedly until it becomes small and hence efficient to deal with. The server while archives the logs in 1gb files.
However I have a new question, due to memory constraint, I need to know if it is possible to split a log file into two efficiently. (Which possibly would be another question on SO)
Most modern file systems don't support deleting lines at the beginning of the file, so doing so would be very inefficient.
The normal solution to your actual problem is to stop writing to your log file when it reaches some size, then start writing to a new file. The code that copies the files can delete a whole file once it has been written (this is an efficient operation).
sed writes a new version of the file, while the program keeps reading the same version that it opened. This is the usual behavior of Unix and Linux when a program writes a file that another program has open.
You can see this for yourself with this small C program:
#include <stdlib.h>
#include <stdio.h>
int main(void) {
FILE *f = fopen("in.txt", "r");
while (1) {
rewind(f);
int lines = 0;
int c;
while ((c = getc(f)) != EOF)
if (c == '\n')
++lines;
printf("Number of lines in file: %d\n", lines);
}
return 0;
}
Run that program in one window, and then use sed in another window to edit the file. The number of lines printed by the program will stay the same, even if the file on disk has been edited, and this is because Unix keeps the old, open version, even if it is no longer accessible to other programs.
As to your first question, how reliable your solution is, as far as I can see it should be reliable, except with the usual caveats about the system crashing or running out of memory in the middle of an update, someone else accessing the file, and of course all the problems with the system call. It is not very efficient, though, and for large data sets you might want to do it differently.
sujin's comment about using a temporary file for the lines you want to keep seems reasonable. It would be both faster and safer. Keep the original file, so if the system crashes you'll still have your data, and wait until you have finished to rename the old file to "in.txt.bak", and then rename your temporary file to "in.txt".
First off, avoid use of system calls as much as you can (if possible, don't use it at all) as they create race conditions and other problems which drastically (and often) detrimentally affect the outcome of your program. This especially true if access to files are involved.
Given your problem, there are a number of ways to do this, each with its own caveats.
I'll cover three possible solutions:
1) If the file is small enough:
you can read in the entire thing in a data structure (vector, list, deque, etc.)
delete the original file
determine how many lines to read (and send off via server protocol)
then write the remaining lines as the name of the original file.
If you intend to parallelize your program later on, this may be a better solution, provided that the file is small. Note: small is a relative term, but is generally limited by how much memory you have available.
2) If the file is quite large or you're limited by memory constraints, you will have to get creative by using buffers. Once you've read a line and successfully sent it off via your program, you determine where the file pointer is and copy the remaining information until the end of the current file as a new file. Once done, close and delete the old file, then close and rename the new file the same name as the old file.
3) If your solution doesn't have to be in C++, you can use shell-scripting or (controversially) another language to get the job done.
1) No, it's not reliable.
2) The C++ runtime library reads your file in blocks (internally) which are then parceled out to your (higher level) input requests until the block(s) is(are) exhausted, forcing it to (internally) read more blocks from disk. Since one or more physical blocks are read in before you make any call to sed, it/they cannot be altered if sed happens to change that first part of the file.
To see your code fail, you would need to make the input file big enough that there are remaining blocks of the file that have not been read in (internally by the runtime library) before you call sed. By "fail" I mean your program would not see all the characters that were originally in the file before sed clobbered some lines.
As the other guys said, you have to make another file with the records you need after read the original file and then delete it. But in this application perhaps you will see more useful a fifo than a file. If you are on a *NIX platform check up about the makefifo statement from the console.
It is like a file with the singularity that after read a line it gets deleted.
Related
What is the best way to cut the end off of a fstream file in C++ 11
I am writing a data persistence class to store audio for my audio editor. I have chosen to use fstream (possibly a bad idea) to create a random access binary read write file.
Each time I record a little sound into my file I simply tack it onto the end of this file. Another internal data structure / file, contains pointers into the audio file and keeps track of edits.
When I undo a recording action and then do something else the last bit of the audio file becomes irrelevant. It is not referenced in the current state of the document and you cannot redo yourself back to a state where you can ever see it again. So I want to chop this part of the file off and start recording at the new end. I don’t need to cut out bitts in the middle, just off the end.
When the user quits this file will remain and be reloaded when they open the project up again.
In my application I expect the user to do this all the time and being able to do this might save me as much as 30% of the file size. This file will be long, potentially very, very long, so rewriting it to another file every time this happens is not a viable option.
Rewriting it when the user saves could be an option but it is still not that attractive.
I could stick a value at the start that says how long the file is supposed to be and then overwrite the end to recycle the space but in the mean time. If I wanted to continually update the data store file in case of crash this would mean I would be rewriting the start over and over again. I worry that this might be bad for flash drives. I could also recomputed the end of the useful part of the file on load, by analyzing the pointer file but in the mean time I would be wasting all that space potentially, and that is complicated.
Is there a simple call for this in the fstream API?
Am I using the wrong library? Note I want to stick to something generic STL I preferred, so I can keep the code as cross platform as possible.
I can’t seem to find it in the documentation and have looked for many hours. It is not the end of the earth but would make this a little simpler and potentially more efficient. Maybe I am just missing it somehow.
Thanks for your help
Andre’
Is there a simple call for this in the fstream API?
If you have C++17 compiler then use std::filesystem::resize_file. In previous standards there was no such thing in standard library.
With older compilers ... on Windows you can use SetFilePointer or SetFilePointerEx to set the current position to the size you want, then call SetEndOfFile. On Unixes you can use truncate or ftruncate. If you want portable code then you can use Boost.Filesystem. From it is simplest to migrate to std::filesystem in the future because the std::filesystem was mostly specified based on it.
If you have variable, that contains your current position in the file, you could seek back for the length of your "unnedeed chunk", and just continue to write from there.
// Somewhere in the begining of your code:
std::ofstream *file = new std::ofstream();
file->open("/home/user/my-audio/my-file.dat");
// ...... long story of writing data .......
// Lets say, we are on a one millin byte now (in the file)
int current_file_pos = 1000000;
// Your last chunk size:
int last_chunk_size = 12345;
// Your chunk, that you are saving
char *last_chunk = get_audio_chunk_to_save();
// Writing chunk
file->write(last_chunk, last_chunk_size);
// Moving pointer:
current_file_pos += last_chunk_size;
// Lets undo it now!
current_file_pos -= last_chunk_size;
file->seekp(current_file_pos);
// Now you can write new chunks from the place, where you were before writing and unding the last one!
// .....
// When you want to finally write file to disk, you just close it
file->close();
// And when, truncate it to the size of current_file_pos
truncate("/home/user/my-audio/my-file.dat", current_file_pos);
Unfortunatelly, you'll have to write a crossplatform function truncate, that would call SetEndOfFile in windows, and truncate in linux. It's easy enough with using preprocessor macros.
I have a C++ program that creates an output file "A" with ofstream. This file is then read by some legacy C code that opens the file with _iobuf. The legacy code then creates its own output file "B" using _iobuf, and this file is then read by the C++ program using ifstream. This sequence is iterated many times, with the same file names for A and B in each iteration.
Occasionally, the C++ program cannot open the output file A for writing, and I must try several times before it succeeds. This occurs nondeterministically, and maybe once in a thousand iterations. Note that the C program never has to wait to open its input or output file, nor does the C++ program ever have to wait to open its input file. This informal observation is based on hundreds of thousands of iterations.
I'm wondering if this has something to do with mixing ofstream and _iobuf in the same program? Both the C++ code and the C code are linked into the same program. And the legacy C code is technically C++ code, but was written in a very C-like style. Is there anything I can do to eliminate this waiting to open the ofstream file? I do not want to change the legacy code if I can possibly avoid it.
Pseudo code (not compiled):
void someObject::someMethod()
{
for (int count = 0; count < someLimit; ++count)
{
newerObject::firstMethod();
olderObject::secondMethod();
newerObject::thirdMethod();
}
}
void newerObject::firstMethod()
{
// do some processing first
// then write the results of the processing to a file
ofstream A;
A.open("A", ofstream::out); // this sometimes must be tried multiple times
// write data to file A
A.close();
}
void olderObject::secondMethod()
{
FILE* f;
f = fopen("A", "rt"); // this always works the first time
// read data from file A
fclose(f);
// do some processing
f = fopen("B", "w");
// write data to file B
fclose(f);
}
void newerObject::thirdMethod()
{
ifstream B;
B.open("B"); // this always works the first time
// read data from file B
B.close();
// do some processing
}
Currently, as a work around, I put the ofstream::open in a do-while loop. I would love to get rid of this awkwardness. Thanks in advance for any advice you can give.
First off, the problem is almost certainly not the use of different methods to access the files: under the hood, the C and C++ I/O functions use the same system I/O facilities. You seem to be using Windows (on other systems files typically can be open multiple times simultaneously) and I don't know much about the system but I would suspect that the file system hasn't been updated to reflect that the file is closed when you try to open it. This may have to do with the "t" open flag: I don't know what this is about.
On UNIXes you can force the I/O operations to wait until the actual change on disk completed. Something like this could help avoiding the problem but has the significant cost that operations become hideously slow. On UNIXes one approach would be to blow away the file system entry the moment the file was opened successfully (after all, at this point its name isn't used anymore):
if (FILE* fp = fopen("file", "r")) {
remove("file");
// do processing
}
However, if I recall correctly on Windows you can neither remove the file nor rename it. Personally, in solving the problem I would proceed as follows:
Determine under which situations the file can't be opened, e.g. by keeping the file open and trying to open it. This is mainly intended to create a setup where the problem is reproducible so you can verify later that you indeed found a solution.
Once I found a way to reproduce the problem I would probably a better idea of the actual root cause and possibly googling would help. In any case this is the point where researching the root cause comes in.
Once the cause is understood it is hopefully easy to devise a solution. If not, opening the file multiple times under it is successful may very well be the right solution.
When I construct an iostream when say opening a file will this always read the entire file from the hard disk and then put it into memory, or is it streamed in and buffered by the OS on demand?
I ask because one way to check if a file exists is to see if opening it fails, but I fear if the files I am opening are very large then this take a long time if iostream must read the entire file in on open.
To check whether a file exists can be done like this if you want to use boost.
#include <boost/filesystem.hpp>
bool fileExists = boost::filesystem::exists("foo.txt");
No, it will not read the entire file into memory when you open it. It will read your file in chunks though, but I believe this process will not start until you read the first byte. Also these chunks are relatively small (on the order of 4-128 kibibytes in size), and the fact it does this will speed things up greatly if you are reading the file sequentially.
In a test on my Linux box (well, Linux VM) simply opening the file only results in the OS open system call, but no read system call. It doesn't start reading anything from the file until the first attempt to read from the stream. And then it reads 8191 (why 8191? that seems a very strange number) byte chunks as I read the file in.
Opening a file is a bad way of testing if the file exists - all it does is tell you if you can open it. Opening might fail for a number of reasons, typically because you don't have read permission, but the file will still exist. It is usually better to use an operating system specific function to test for existence. And no, opening an fstream will not cause the contents to be read.
What I think is, when you open a file, the corresponding data structures for the process opening the file are populated which include file pointer, file descriptor, v node etc.
Now one can read and write to a file using buffered streams (fwrite , fread) or using system calls (read and write).
When we use buffered streams, we buffer the data and then write or read it[This is done for efficiency puposes]. This statement itself means that the whole file is not read into memory but certain bytes are read into buffer and then made available.
In case of sys calls such as read and write , kernel level buffering is done (using fsync one can flush out kernel buffer too), but data is actually read and written to the device .file
checking existance of file
#include < sys/stat.h >
int main(){
struct stat file_i;
std::string f("myfile.txt");
if (stat(f.c_str(),&file_i) != 0){
cout << "File not found" << endl;
}
return 0;
}
Hope this clarifies a bit.
I wish for a file to be deleted from disk only when it is closed. Up until that point, other processes should be able to see the file on disk and read its contents, but eventually after the close of the file, it should be deleted from disk and no longer visible on disk to other processes.
Open the file, then delete it while it's open. Other processes will be able to use the file, but as soon as all handles to file are closed, it will be deleted.
Edit: based on the comments WilliamKF added later, this won't accomplish what he wants -- it'll keep the file itself around until all handles to it are closed, but the directory entry for the file name will disappear as soon as you call unlink/remove.
Open files in Unix are reference-counted. Every open(2) increments the counter, every close(2) decrements it. The counter is shared by all processes on the system.
Then there's a link count for a disk file. Brand-new file gets a count of one. The count is incremented by the link(2) system call. The unlink(2) decrements it. File is removed from the file system when this count drops to zero.
The only way to accomplish what you ask is to open the file in one process, then unlink(2) it. Other processes will be able to open(2) or stat(2) it between open(2) and unlink(2). Assuming the file had only one link, it'll be removed when all processes that have it open close it.
Use unlink
#include <unistd.h>
int unlink(const char *pathname);
unlink() deletes a name from the
filesystem. If that name was the last
link to a file and no processes have
the file open the file is deleted and
the space it was using is made
available for reuse.
If the name was the last link to a
file but any processes still have the
file open the file will remain in
existence until the last file
descriptor referring to it is closed.
If the name referred to a symbolic
link the link is removed.
If the name referred to a socket, fifo
or device the name for it is removed
but processes which have the object
open may continue to use it.
Not sure, but you could try remove, but it looks more like c-style.
Maybe boost::filesystem::remove?
bool remove( const path & ph );
Precondition: !ph.empty()
Returns: The value of exists( ph )
prior to the establishment of the
postcondition.
Postcondition: !exists( ph )
Throws: if ph.empty() || (exists(ph)
&& is_directory(ph) && !is_empty(ph)).
See empty path rationale.
Note: Symbolic links are themselves
deleted, rather than what they point
to being deleted.
Rationale: Does not throw when
!exists( ph ) because not throwing:
Works correctly if ph is a dangling
symbolic link. Is slightly
easier-to-use for many common use
cases. Is slightly higher-level
because it implies use of
postcondition semantics rather than
effects semantics, which would be
specified in the somewhat lower-level
terms of interactions with the
operating system. There is, however, a
slight decrease in safety because some
errors will slip by which otherwise
would have been detected. For example,
a misspelled path name could go
undetected for a long time.
The initial version of the library
threw an exception when the path did
not exist; it was changed to reflect
user complaints.
You could create a wrapper class that counts references, using one of the above methods to delete de file .
class MyFileClass{
static unsigned _count;
public:
MyFileClass(std::string& path){
//open file with path
_count++;
}
//other methods
~MyFileClass(){
if (! (--_count)){
//delete file
}
}
};
unsigned MyFileClass::_count = 0; //elsewhere
I think you need to extend your notion of “closing the file” beyond fclose or std::fstream::close to whatever you intend to do. That might be as simple as
class MyFile : public std::fstream {
std::string filename;
public:
MyFile(const std::string &fname) : std::fstream(fname), filename(fname) {}
~MyFile() { unlink(filename); }
}
or it may be something much more elaborate. For all I know, it may even be much simpler – if you close files only at one or two places in your code, the best thing to do may be to simply unlink the file there (or use boost::filesystem::remove, as Tom suggests).
OTOH, if all you want to achieve is that processes started from your process can use the file, you may not need to keep it lying around on disk at all. forked processes inherit open files. Don't forget to dup them, lest seeking in the child influences the position in the parent or vice versa.
I have some .gz compressed files which is around 5-7gig uncompressed.
These are flatfiles.
I've written a program that takes a uncompressed file, and reads it line per line, which works perfectly.
Now I want to be able to open the compressed files inmemory and run my little program.
I've looked into zlib but I can't find a good solution.
Loading the entire file is impossible using gzread(gzFile,void *,unsigned), because of the 32bit unsigned int limitation.
I've tried gzgets, but this almost doubles the execution time, vs reading in using gzread.(I tested on a 2gig sample.)
I've also looked into "buffering", such as splitting the gzread process into multiple 2gig chunks, find the last newline using strcchr, and then setting the gzseek.
But gzseek will emulate a total file uncompression. which is very slow.
I fail to see any sane solution to this problem.
I could always do some checking, whether or not a current line actually has a newline (should only occure in the last partially read line), and then read more data from the point in the program where this occurs.
But this could get very ugly.
Does anyhow have any suggestions?
thanks
edit:
I dont need to have the entire file at once,just need one line a time, but I got a fairly huge machine, so if that was the easiest I would have no problems.
For all those that suggest piping the stdin, I've experienced extreme slowdowns compared to opening the file. Here is a small code snippet I made some months ago, that illustrates it.
time ./a.out 59846/59846.txt
# 59846/59846.txt
18255221
real 0m4.321s
user 0m2.884s
sys 0m1.424s
time ./a.out <59846/59846.txt
18255221
real 1m56.544s
user 1m55.043s
sys 0m1.512s
And the source code
#include <iostream>
#include <fstream>
#define LENS 10000
int main(int argc, char **argv){
std::istream *pFile;
if(argc==2)//ifargument supplied
pFile = new std::ifstream(argv[1],std::ios::in);
else //if we want to use stdin
pFile = &std::cin;
char line[LENS];
if(argc==2) //if we are using a filename, print it.
printf("#\t%s\n",argv[1]);
if(!pFile){
printf("Do you have permission to open file?\n");
return 0;
}
int numRow=0;
while(!pFile->eof()) {
numRow++;
pFile->getline(line,LENS);
}
if(argc==2)
delete pFile;
printf("%d\n",numRow);
return 0;
}
thanks for your replies, I'm still waiting the golden apple
edit2:
using the cstyle FILE pointers instead of c++ streams is much much faster. So I think this is the way to go.
Thank for all your input
gzip -cd compressed.gz | yourprogram
just go ahead and read it line by line from stdin as it is uncompressed.
EDIT: Response to your remarks about performance. You're saying reading STDIN line by line is slow compared to reading an uncompressed file directly. The difference lies within terms of buffering. Normally pipe will yield to STDIN as soon as the output becomes available (no, or very small buffering there). You can do "buffered block reads" from STDIN and parse the read blocks yourself to gain performance.
You can achieve the same result with possibly better performance by using gzread() as well. (Read a big chunk, parse the chunk, read the next chunk, repeat)
gzread only reads chunks of the file, you loop on it as you would using a normal read() call.
Do you need to read the entire file into memory ?
If what you need is to read lines, you'd gzread() a sizable chunk(say 8192 bytes) into a buffer, loop through that buffer and find all '\n' characters and process those as individual lines. You'd have to save the last piece incase there is just part of a line, and prepend that to the data you read next time.
You could also read from stdin and invoke your app like
zcat bigfile.gz | ./yourprogram
in which case you can use fgets and similar on stdin. This is also beneficial in that you'd run decompression on one processor and processing the data on another processor :-)
I don't know if this will be an answer to your question, but I believe it's more than a comment:
Some months ago I discovered that the contents of Wikipedia can be downloaded in much the same way as the StackOverflow data dump. Both decompress to XML.
I came across a description of how the multi-gigabyte compressed dump file could be parsed. It was done by Perl scripts, actually, but the relevant part for you was that Bzip2 compression was used.
Bzip2 is a block compression scheme, and the compressed file could be split into manageable pieces, and each part uncompressed individually.
Unfortunately, I don't have a link to share with you, and I can't suggest how you would search for it, except to say that it was described on a Wikipedia 'data dump' or 'blog' page.
EDIT: Actually, I do have a link