appending to a memory-mapped file - c++

I'm constantly appending to a file of stock quotes (ints, longs, doubles, etc.). I have this file mapped into memory with mmap.
What's the most efficient way to make newly appended data available as part of the memory mapping?
I understand that I can open the file again (new file descriptor) and then mmap it to get the new data but that seems to be inefficient. Another approach that has been suggested to me is to pre-allocate the file in 1mb chunks, write to a specific position until reaching the end then ftruncate the file to +1mb.
Are there other approaches?
Doest Boost help with this?

Boost.IOStreams has fixed-size only memory mapped files, so it won't help with your specific problem. Linux has an interface mremap which works as follows:
void *new_mapping = mremap(mapping, size, size + GROWTH, MREMAP_MAYMOVE);
if (new_mapping == MAP_FAILED)
// handle error
mapping = new_mapping;
This is non-portable, however (and poorly documented). Mac OS X seems not to have mremap.
In any case, you don't need to reopen the file, just munmap it and mmap it again:
void *append(int fd, char const *data, size_t nbytes, void *map, size_t &len)
{
// TODO: check for errors here!
ssize_t written = write(fd, data, nbytes);
munmap(map, len);
len += written;
return mmap(NULL, len, PROT_READ, 0, fd, 0);
}
A pre-allocation scheme may be very useful here. Be sure to keep track of the file's actual length and truncate it once more before closing.

I know the answer has already been accepted but maybe it will help someone else if I provide my answer. Allocate a large file ahead of time, say 10 GiB in size. Create three of these files ahead of time, I call them volumes. Keep track of your last known location somewhere like in the header, another file, etc. and then keep appending from that point. If you reach the maximum size of the file and run out of room switch to the next volume. If there are no more volumes, create another volume. Note that you would probably do this a few volumes ahead to make sure not to block your appends waiting for a new volume to be created. That's how we implement it where I work for storing continuous incoming video/audio in a DVR system for surveillance. We don't waste space to store file names for video clips which is why we don't use a real file system and instead we go flat file and we just track offsets, frame information (fps, frame type, width/height, etc), time recorded and camera channel. For you storage space is cheap for the kind of work you are doing, whereas your time is invaluable. So, grab as much as you want to ahead of time. You're basically implementing your own file system optimized for your needs. The needs that general-use file systems supply aren't the same needs that we need in other fields.

Looking at man page for mremap it should be possible.

My 5cents, but they are more C specific.
Make normal file, but mmap huge size - e.g file is say 100K, but mmap 1GB or more. Then you can safely access everything up to file size. Access over file size will result in error.
If you are on 32bit OS, just dont make mmap too big, because it will eat your address space.

If you're using boost/iostreams/device/mapped_file.hpp on windows:
boost::filesystem::resize_file throws an exception if a reading mapping object is open, due to lack of sharing privileges.
Instead, use windows-api to resize the file on the disc, and the reading mapped_files can still be open.
bool resize_file_wapi(string path, __int64 new_file_size) //boost::uintmax_t size
{
HANDLE handle = CreateFile(path.c_str(), GENERIC_WRITE, FILE_SHARE_READ | FILE_SHARE_WRITE, 0, OPEN_EXISTING,
FILE_ATTRIBUTE_NORMAL, 0);
LARGE_INTEGER sz;
sz.QuadPart = new_file_size;
return handle != INVALID_HANDLE_VALUE
&& ::SetFilePointerEx(handle, sz, 0, FILE_BEGIN)
&& ::SetEndOfFile(handle)
&& ::CloseHandle(handle);
}

Related

Knowing current compressed file size using gzwrite (zlib)

I'm using zlib for c++.
Quote from
http://refspecs.linuxbase.org/LSB_3.0.0/LSB-PDA/LSB-PDA/zlib-gzwrite-1.html regarding gzwrite function:
The gzwrite() function shall write data to the compressed file referenced by file, which shall have been opened in a write mode (see gzopen() and gzdopen()). On entry, buf shall point to a buffer containing len bytes of uncompressed data. The gzwrite() function shall compress this data and write it to file. The gzwrite() function shall return the number of uncompressed bytes actually written.
I interpret this as the return value will NOT tell me how much larger the file became when writing. Only how much data was compressed into the file.
The only way to know how large the file is would then be to close it, and read the size from the file system. I have a requirement to only continue to write to the file until it reaches a certain size. Can this be achieved without closing the file?
A workaround would be to write until the uncompressed size reaches my limit and then close the file, read the size from file system and update my best guess of file size based on that, and then re-open the file and continue writing. This would make me close and open the file a few times towards the end (as I'm approaching the size limit).
Another workaround, which would give more of an estimate (which is not what I want really) would be to write until uncompressed size reaches the limit, close the file, read the file size from the file system and calculate the compression ratio so far. The I can use this compression ratio to calculate a new limit for uncompressed file size where the compression should get me down to the limit for the compressed file size. If I repeat this the estimate would improve, but again, not what I'm looking for.
Are there better options?
Preferred option would be if zlib could tell me the compressed file size while the file is still open. I don't see why this information would not be available inside zlib at this point, since compression happens when I call gzwrite and not when i close the file.
zlib provides the function gzoffset(), which does exactly what you're asking.
If for some reason you are stuck with a version of zlib that is more than about eight years old, when gzoffset() was added, then this is easy to do with gzdopen(). You open the output file with fopen() or open(), and provide the file descriptor (using fileno() and dup() if you used fopen()), and then provide that descriptor to gzdopen(). Then you can use ftell() or lseek() at any time to see how much as been written. Be careful to not try to double-close the descriptor. See the comments for gzdopen().
You can work around this issue by using a pipe. The idea is to write the compressed data into a pipe. After that, you read the data from the other end of the pipe, count it and write it to the actual file.
To set this up you need to first open the file to write to via a simple open. Then create a pipe via pipe2 and initialize zlib by passing one of the pipe descriptors to gzdopen:
int out = open("/path/to/file", O_WRONLY | O_CREAT | O_TRUNC);
int p[2];
pipe2(p, O_NONBLOCK);
gzFile zFile = gzdopen(p[0], "w");
You can now write the data first to the pipe and then splice it from the pipe to the out file:
gzwrite(zFile, buf, 1024); //or any other length
size_t bytesWritten = 0;
do {
bytesWritten = splice(p[1], NULL, out, NULL, 1024, SPLICE_F_NONBLOCK | SPLICE_F_MORE);
} while(bytesWritten == 1024);
As you can see, you now have the bytesWritten to tell you how much data was actually written. Simply sum it up in another variable and stop splicing as soon as you have written as much data as you need to (or just splice it in one go by writing everything to the zFile and the splice once with the amount of data you are allowed to store as the fifth parameter. If you want to not compress uneccessary data, simply do it in chunks as shown above).
A note on splice: Splice is linux specific, and is basically just a very efficient copy. You can always replace it with a simple "read and write" combo, i.e. read data from fd[1] into a buffer and then write the data from that buffer into out - splice is just faster and less code.

Reading a Potentially incomplete File C++

I am writing a program to reformat a DNS log file for insertion to a database. There is a possibility that the line currently being written to in the log file is incomplete. If it is, I would like to discard it.
I started off believing that the eof function might be a good fit for my application, however I noticed a lot of programmers dissuading the use of the eof function. I have also noticed that the feof function seems to be quite similar.
Any suggestions/explanations that you guys could provide about the side effects of these functions would be most appreciated, as would any suggestions for more appropriate methods!
Edit: I currently am using the istream::peek function in order to skip over the last line, regardless of whether it is complete or not. While acceptable, a solution that determines whether the last line is complete would be preferred.
The specific comparison I'm using is: logFile.peek() != EOF
I would consider using
int fseek ( FILE * stream, long int offset, int origin );
with SEEK_END
and then
long int ftell ( FILE * stream );
to determine the number of bytes in the file, and therefore - where it ends. I have found this to be more reliable in detecting the end of the file (in bytes).
Could you detect an (End of Record/Line) EOR marker (CRLF perhaps) in the last two or three bytes of the file? (3 bytes might be used for CRLF^Z...depends on the file type). This would verify if you have a complete last row
fseek (stream, -2,SEEK_END);
fread (2 bytes... etc
If you try to open the file with exclusive locks, you can detect (by the failure to open) that the file is in use, and try again in a second...(or whenever)
If you need to capture the file contents as the file is being written, it's much easier if you eliminate as many layers of indirection and buffering between your logic and the actual bytes of data in the file.
Do not use C++ IO streams of any type - you have no real control over them. Don't use FILE *-based functions such as fopen() and fread() - those are buffered, and even if you disable buffering there are layers of code between your code and the data that once again you can't control and don't know what's happening.
In a POSIX environment, you can use low-level C-style open() and read()/pread() calls. And use fstat() to know when the file contents have changed - you'll see the st_size member of the struct stat argument change after a call to fstat().
You'd open the file like this:
int logFileFD = open( "/some/file/name.log", O_RDONLY );
Inside a loop, you could do something like this (error checking and actual data processing omitted):
size_t lastSize = 0;
while ( !done )
{
struct stat statBuf;
fstat( logFileFD, &statBuf );
if ( statBuf.st_size == lastSize )
{
sleep( 1 ); // or however long you want
continue; // go to next loop iteration
}
// process new data - might need to keep some of the old data
// around to handle lines that cross boundaries
processNewContents( logFileFD, lastSize, statBuf.st_size );
}
processNewContents() could look something like this:
void processNewContents( int fd, size_t start, size_t end )
{
static char oldData[ BUFSIZE ];
static char newData[ BUFSIZE ];
// assumes amount of data will fit in newData...
ssize_t bytesRead = pread( fd, newData, start, end - start );
// process the data that was read read here
return;
}
You may also find that you need to add some code to close() then re-open() the file in case your application doesn't seem to be "seeing" data written to the file. I've seen that happen on some systems - the application somehow sees a cached copy of the file size somewhere while an ls run in another context gets the more accurate, updated size. If, for example, you know your log file is written to every 10-15 seconds, if you go 30 seconds without seeing any change to the file you know to try reopening the file.
You can also track the inode number in the struct stat results to catch log file rotation.
In a non-POSIX environment, you can replace open(), fstat() and pread() calls with the low-level OS equivalent, although Windows provides most of what you'd need. On Windows, lseek() followed by read() would replace pread().

How to cut a file without using another file?

Is it possible to delete part of a file (let's say from the beginning to its half), without having to use another file?
Thank's!
Yes, it is possible, but still you'll have to rewrite most of the file.
The rough idea is as follows:
open the file
beg = find the start of the fragment to be removed
len = length of the fragment to be removed
blocksize = 4096 -- example block size, may be any
datamoved = 0
do {
fseek(pos +len +datamoved);
if( endoffile ) return; -- finished!
actualread = fread(buffer, blocksize)
fseek(pos + datamoved)
fwrite(buffer, actualread)
datamoved += actualread
}
and the last step after the loop is to 'truncate' the file to the pos+datamoved size. if the underlying filesystem does not handle 'truncatefile' operation, then you have to rewrite.. but most of filesystems and libraries do support that.
The short answer is that no, most file systems don't attempt to support operations like that.
That leaves you with two choices. The obvious one is to create a copy of the data, leaving out the parts you don't want. You can do this either in-place (i.e., moving the data around in the same file) or by using an auxiliary file, typically copying the data to the new file, then doing something like renaming the new file to the old name.
The other major choice is to simply re-structure your file and data so you don't have to get rid of the old data at all. For example, if you want to keep the most recent N amount of data from a process, you might structure (most of) the file as a circular buffer, with a couple of "pointers" at the beginning tell you the head and tail points, so you know where to read data from/write data to. With a structure like this, you don't erase or remove the old data, you just overwrite it as needed.
If you have enough memory, read its contents fully to the memory, copy it back to the front of the file, and truncate the file.
If you do not have enough memory, copy in blocks, and only when you are done truncate the file.

Unzip by using the "CompressedFolder" COM object

I unpack a zip archive using Win API. This API is based on COM interfaces; the COM model is accessible through the CompressFolder COM object.
I encountered the following problem. When I unpack a small file (3.5 MB) it takes a long time. I figured out that IStream::Read() causes this problem. It works slowly. I use a small buffer (1KB) to read this file in many iterations; if I use a buffer that nearly equals the file size, then it works much faster.
How can I get it to unpack fast even if the buffer size is much smaller than file size? Is it possible? I think it is important because the files may be big, say 1 GB.
Here is a fragment of the code that reads a file:
...
CComPtr<IEnumSTATSTG> pEnum = NULL;
pStorage->EnumElements(0, NULL, 0, &pEnum);
STATSTG stasStg;
while (S_OK == pFolderEnum->Next(1, &stasStg, NULL)) {
if (stasStg.type == STGTY_STREAM) {
CComPtr<IStream> pStream = NULL;
pStorage->OpenStream(stasStg.pwcsName, NULL, STGM_READ, NULL, &pStream);
...
while (hr == S_OK) {
// reading
pStream->Read(btBuffer, 1024, &ulByresRead); // it works slowly
}
}
}
A side question I have:
Is there method to detect a packed file size through IStream without reading the file?
It is not possible to achieve fast read with small buffers. Indeed, the more I/O operation you do, the more time it takes.
Try to limit the number of I/O operation by taking a relatively big buffer size. Then of course you must limit it in accordance with the memory you want to allocate to your program.
Aside, you may get delay because program loads libraries. This doesn't happen for Winzip if associated dll already is loaded.

GetDiskFreeSpaceEx with compressed disk

I want to get the free space on a compressed disk to show it to a end user. I'm using C++, MFC on Windows 2000 and later. The Windows API offers the GetDiskFreeSpaceEx() function.
However, this function seems to return the "uncompressed" sized of the data. This cause me some problem.
For example :
- Disk size is 100 GB
- Data size is 90 GB
- Compressed data size is 80 GB
The user will see that the disk is 90% full, but in reality, it is only 80% full.
EDIT
As Gleb pointed out, the function is returning the good information.
So here is the new question : is there a way to get both the compressed size and the uncompressed one?
I think you would have to map over all files, query with GetFileSize() and GetCompressedFileSize() and sum them up. Use GetFileAttributes() to know if a file is compressed or not, in case only parts of the whole volume is compressed, which might certainly be the case.
Hum, so that's not a trivial
operation. I suppose I must implement
some mechanism to avoid querying all
files size all the time. I mean ... if
I have a 800GB hard drive, it could
take some very long time to get all
file size.
True.
Perhaps start off by a full scan (application startup) and populate your custom data structure, e.g. a hash/map from file name to file data struct/class, then poll the drive with FindFirstChangeNotification() and update your internal structure accordingly.
You might also want to read about "Change Journals". I have never used them myself so don't know how they work, but might be worth checking out.
The function returns the amount of free space correctly. It can be demonstrated by using this simple program.
#include <stdio.h>
#include <windows.h>
void main() {
ULARGE_INTEGER p1, p2, p3;
GetDiskFreeSpaceEx(".", &p1, &p2, &p3);
printf("%llu %llu %llu\n", p1, p2, p3);
}
After compressing a previously uncompressed directory the free space grows.
So what are you talking about?