atomic append on a file descriptor, but at what offset? - c++

in unistd.h
using open() with the O_APPEND flag gives atomic writes always to the end of the file...
this is great and all, but what if i need to know the offset at which it atomically appended to the file...?
i realize O_APPEND is often used for log files, but I'd actually like to know at what offset in the file it atomically appended.
I don't see any obvious way to do this..? Does anyone know?
Thanks

To get the current position in a file descriptor, use lseek() with offset 0 and whence SEEK_CUR.
int fd = open(...);
if (fd) {
off_t positionWhereAppendingBegins = lseek(fd, 0, SEEK_CUR);
write(...);
close(fd);
}
Note that this will not give you reliable results if the descriptor was opened some other way, i.e. via socket().

The file is written to at the file offset as obtained by the process when the file was opened. If another process writes to the file between the open and the write, then contents of the file are indeterminate.
The correct method of handling multiple process writing to a single file is for all processes to open the file with the O_APPEND flag, obtain an exclusive lock and once the lock is obtained, seek to the end of the file before writing to the file, and finally close the file to release the lock.
If you want to keep the file open between writes, initiate the process by opening the file with the O_APPEND flag. The writing loop in this case is obtain the exclusive lock, seek to the end of the file, write to the file and release the lock.
If you really need the file position, lseek will return the file offset of the callers file descriptor at the time of the call.

Related

Knowing current compressed file size using gzwrite (zlib)

I'm using zlib for c++.
Quote from
http://refspecs.linuxbase.org/LSB_3.0.0/LSB-PDA/LSB-PDA/zlib-gzwrite-1.html regarding gzwrite function:
The gzwrite() function shall write data to the compressed file referenced by file, which shall have been opened in a write mode (see gzopen() and gzdopen()). On entry, buf shall point to a buffer containing len bytes of uncompressed data. The gzwrite() function shall compress this data and write it to file. The gzwrite() function shall return the number of uncompressed bytes actually written.
I interpret this as the return value will NOT tell me how much larger the file became when writing. Only how much data was compressed into the file.
The only way to know how large the file is would then be to close it, and read the size from the file system. I have a requirement to only continue to write to the file until it reaches a certain size. Can this be achieved without closing the file?
A workaround would be to write until the uncompressed size reaches my limit and then close the file, read the size from file system and update my best guess of file size based on that, and then re-open the file and continue writing. This would make me close and open the file a few times towards the end (as I'm approaching the size limit).
Another workaround, which would give more of an estimate (which is not what I want really) would be to write until uncompressed size reaches the limit, close the file, read the file size from the file system and calculate the compression ratio so far. The I can use this compression ratio to calculate a new limit for uncompressed file size where the compression should get me down to the limit for the compressed file size. If I repeat this the estimate would improve, but again, not what I'm looking for.
Are there better options?
Preferred option would be if zlib could tell me the compressed file size while the file is still open. I don't see why this information would not be available inside zlib at this point, since compression happens when I call gzwrite and not when i close the file.
zlib provides the function gzoffset(), which does exactly what you're asking.
If for some reason you are stuck with a version of zlib that is more than about eight years old, when gzoffset() was added, then this is easy to do with gzdopen(). You open the output file with fopen() or open(), and provide the file descriptor (using fileno() and dup() if you used fopen()), and then provide that descriptor to gzdopen(). Then you can use ftell() or lseek() at any time to see how much as been written. Be careful to not try to double-close the descriptor. See the comments for gzdopen().
You can work around this issue by using a pipe. The idea is to write the compressed data into a pipe. After that, you read the data from the other end of the pipe, count it and write it to the actual file.
To set this up you need to first open the file to write to via a simple open. Then create a pipe via pipe2 and initialize zlib by passing one of the pipe descriptors to gzdopen:
int out = open("/path/to/file", O_WRONLY | O_CREAT | O_TRUNC);
int p[2];
pipe2(p, O_NONBLOCK);
gzFile zFile = gzdopen(p[0], "w");
You can now write the data first to the pipe and then splice it from the pipe to the out file:
gzwrite(zFile, buf, 1024); //or any other length
size_t bytesWritten = 0;
do {
bytesWritten = splice(p[1], NULL, out, NULL, 1024, SPLICE_F_NONBLOCK | SPLICE_F_MORE);
} while(bytesWritten == 1024);
As you can see, you now have the bytesWritten to tell you how much data was actually written. Simply sum it up in another variable and stop splicing as soon as you have written as much data as you need to (or just splice it in one go by writing everything to the zFile and the splice once with the amount of data you are allowed to store as the fifth parameter. If you want to not compress uneccessary data, simply do it in chunks as shown above).
A note on splice: Splice is linux specific, and is basically just a very efficient copy. You can always replace it with a simple "read and write" combo, i.e. read data from fd[1] into a buffer and then write the data from that buffer into out - splice is just faster and less code.

std::ofstream open file and replace in specific offset

I want to open a file (without re-creating it) and write to a specific offset.
This is the current code :
std::ofstream file(conf_file_path, std::ios::app);
file.seekp(offset, std::ios::beg);
const auto& output = file.write(conf_str, conf_str_len);
But it always writes to the end of file (probably due to the app flag)
If I don't use the app flag, it re-creates the file as I open it.
How can I open it without re-create it and be able to write to specific offset ?
it always writes to the end of file (probably due to the app flag)
Yes, this is due to the app flag. Here's what the documentation says:
app - seek to the end of stream before each write
If I don't use the app flag, it re-creates the file as I open it.
If you have out or trunc flags sets in the mode, then it destroys the contents of the file, if it already exists.
How can I open it without re-create it and be able to write to specific offset ?
You may use in|out. This will error out if the file doesn't exist; if it exists, the file will be opened and read from the beginning. If you want the stream to be read from the end, you may set the ate flag additionally.
All of this is clearly documented here; reading the manual really helps.

Is ftruncate() asynchronous?

I am attempting to write a class in C++ that provides a means of atomically appending to a file, even for the case of power failure mid write.
First, I write my current file position (a 64 offset from the beginning of the file, in bytes) to a separate journal file. Then, I write the requested data to the end of the date file. Finally, I call ftruncate() (setting the truncated size to 0) on the journal file.
The main idea is that if this class is ever asked to open a file that has a non empty journal file, then you know a write was interrupted and you can read the position of the last write from the journal file and fseek to that spot. You lose the last partial write, but the file should not be corrupted.
Unfortunately, it seems like ftruncate() is asynchronous. In practice, even if I call fflush() and fsync() after ftruncate I see the journal grow to up to hundreds of bytes while doing lots of writes. It always ultimately ends up at 0, but I expected to see it at either size 0 or size 8 at all times.
Is it possible to make ftruncate completely synchronous? Or is there a better way to use the journal?
ftruncate() does not change your file descriptor's write offset in the file. If you are leaving the file open and writing the next length after calling ftruncate(), then what's happening is the file's offset is still increasing. When you write, it resets the length of the file to be at the offset and then writes your bytes there.
Probably what you want to do is call lseek(fd, 0, SEEK_SET) after you call ftruncate() so that the next write to the file will take place at the beginning of the file.

EOF before EOF in Visual Studio

I had this snippet in a program (in Visual Studio 2005):
if(_eof(fp->_file))
{
break;
}
It broke the enclosing loop when eof was reached. But the program was not able to parse the last few thousand chars in file. So, in order to find out what was happening, I did this:
if(_eof(fp->_file))
{
cout<<ftell(fp)<<endl;
break;
}
Now the answer that I got from ftell was different (and smaller) than the actual file-size (which isn't expected). I thought that Windows might have some problem with the file, then I did this:
if(_eof(fp->_file))
{
cout<<ftell(fp)<<endl;
fseek(fp, 0 , SEEK_END);
cout<<ftell(fp)<<endl;
break;
}
Well, the fseek() gave the right answer (equal to the file-size) and the initial ftell() failed (as previously told).
Any idea about what could be wrong here?
EDIT: The file is open in "rb" mode.
You can't reliably use _eof() on a file descriptor obtained from a FILE*, because FILE* streams are buffered. It means that fp has sucked fp->_file dry and stores the remaining byte in its internal buffer. Eventually fp->_file is at eof position, while fp still has bytes for you to read. Use feof() after a read operation to determine if you are at the end of a file and be careful if you mix functions which operate on FILE* with those operating on integer file descriptors.
You should not be using _eof() directly on the descriptor if your file I/O operations are on the FILE stream that wraps it. There is buffering that takes place and the underlying descriptor will hit end-of-file before your application has read all the data from the FILE stream.
In this case, ftell(fp) is reporting the state of the stream and you should be using feof(fp) to keep them in the same I/O domain.

Does constructing an iostream (c++) read data from the hard drive into memory?

When I construct an iostream when say opening a file will this always read the entire file from the hard disk and then put it into memory, or is it streamed in and buffered by the OS on demand?
I ask because one way to check if a file exists is to see if opening it fails, but I fear if the files I am opening are very large then this take a long time if iostream must read the entire file in on open.
To check whether a file exists can be done like this if you want to use boost.
#include <boost/filesystem.hpp>
bool fileExists = boost::filesystem::exists("foo.txt");
No, it will not read the entire file into memory when you open it. It will read your file in chunks though, but I believe this process will not start until you read the first byte. Also these chunks are relatively small (on the order of 4-128 kibibytes in size), and the fact it does this will speed things up greatly if you are reading the file sequentially.
In a test on my Linux box (well, Linux VM) simply opening the file only results in the OS open system call, but no read system call. It doesn't start reading anything from the file until the first attempt to read from the stream. And then it reads 8191 (why 8191? that seems a very strange number) byte chunks as I read the file in.
Opening a file is a bad way of testing if the file exists - all it does is tell you if you can open it. Opening might fail for a number of reasons, typically because you don't have read permission, but the file will still exist. It is usually better to use an operating system specific function to test for existence. And no, opening an fstream will not cause the contents to be read.
What I think is, when you open a file, the corresponding data structures for the process opening the file are populated which include file pointer, file descriptor, v node etc.
Now one can read and write to a file using buffered streams (fwrite , fread) or using system calls (read and write).
When we use buffered streams, we buffer the data and then write or read it[This is done for efficiency puposes]. This statement itself means that the whole file is not read into memory but certain bytes are read into buffer and then made available.
In case of sys calls such as read and write , kernel level buffering is done (using fsync one can flush out kernel buffer too), but data is actually read and written to the device .file
checking existance of file
#include &lt sys/stat.h &gt
int main(){
struct stat file_i;
std::string f("myfile.txt");
if (stat(f.c_str(),&file_i) != 0){
cout &lt&lt "File not found" &lt&lt endl;
}
return 0;
}
Hope this clarifies a bit.