Truncating the file in c++

Truncating the file in c++ - c++

I was writing a program in C++ and wonder if anyone can help me with the situation explained here.
Suppose, I have a log file of about size 30MB, I have copied last 2MB of file to a buffer within the program.
I delete the file (or clear the contents) and then write back my 2MB to the file.
Everything works fine till here. But, the concern is I read the file (the last 2MB) and clear the file (the 30MB file) and then write back the last 2MB.
To much of time will be needed if in a scenario where I am copying last 300MB of file from a 1GB file.
Does anyone have an idea of making this process simpler?
When having a large log file the following reasons should and will be considered.
Disk Space: Log files are uncompressed plain text and consume large amounts of space.
Typical compression reduce the file size by 10:1. However a file cannot be compressed
when it is in use (locked). So a log file must be rotated out of use.
System resources: Opening and closing a file regularly will consume lots of system
resources and it would reduce the performance of the server.
File size: Small files are easier to backup and restore in case of a failure.
I just do not want to copy, clear and re-write the last specific lines to a file. Just a simpler process.... :-)
EDIT: Not making any inhouse process to support log rotation.
logrotate is the tool.

I would suggest an slightly different approach.
Create a new temporary file
Copy the required data from the original file to the temporary file
Close both files
Delete the original file
Rename the temp file to the same name as the original file
To improve the performance of the copy, you can copy the data in chunks, you can play around with the chunk size to find the optimal value.

If this is your file before:
-----------------++++
Where - is what you don't want and + is what you do want, the most portable way of getting:
++++
...is just as you said. Read in the section you want (+), delete/clear the file (as with fopen(... 'wb') or something similar and write out the bit you want (+).
Anything more complicated requires OS-specific help, and isn't portable. Unfortunately, I don't believe any major OS out there has support for what you want. There might be support for "truncate after position X" (a sort of head), but not the tail like operation you're requesting.
Such an operation would be difficult to implement, as varying blocksizes on filesystems (if the filesystem has a block size) would cause trouble. At best, you'd be limited to cutting on blocksize boundaries, but this would be harry. This is such a rare case, that this is probably why such a procudure is not directly supported.

A better approach might be not to let the file grow that big but rather use rotating log files with a set maximum size per log file and a maximum number of old files being kept.

If you can control the writing process, what you probably want to do here is to write to the file like a circular buffer. That way you can keep the last X bytes of data without having to do what you're suggesting at all.
Even if you can't control the writing process, if you can at least control what file it writes to, then maybe you could get it to write to a named pipe. You could attach your own program at the end of this named pipe that writes to a circular buffer as discussed.

Related

Why isn't lossless compression automatic on computers?

I was just wondering what could be the impact if, say, Microsoft decided to automaticly "lossless" compress every single file saved in a computer.
What are the pros? The cons? Is it feasible?

Speed.
When compressing a file of any kind you're encoding its contents in a more compact form, often using dictionaries and/or prefix codes (An example: huffman coding). To access the data you have to uncompress it, and this translates to time and used memory, as to access a specific piece of the file you have to decompress it as a whole. While decompressing you ave to save the results somewhere and the most appropriate place is RAM.
Of course this wouldn't be a great problem (decompressing the whole file) if all of it needed to be read, and not even in the case of a stream reading it, but if a program wanted to write to the compressed file all the data would have to be compressed again, or at least a part of it.
As you can see, compressing files in the filesystem would reduce a lot the bandwidth available to applications - to read a single byte you have to read a chunk of the file and decompress it - and also require more RAM.

Parsing log file with multi-line entries

I'm working on parsing a reasonable sized log file (up to 50Mb, at which point it wraps) from a third-party application in order to detect KEY_STRINGs which happened within a specified time frame. A typical entry in this log file may look like this
DEBUG 2013-10-11#14:23:49 [PID] - Product.Version.Module
(Param 1=blahblah Param2=blahblah Param3 =blahblah
Method=funtionname)
String that we usually don't care about but may be KEY_STRING
Entries are separated by a blank line (\r\n at the end of the entry then \r\n before the next entry starts)
This is for a Windows specific implementation so doesn't need to be portable, and can be C/C++/Win32
Reading this line by line would be time consuming but has the benefit of being able to parse the timestamp and check if the entry is within the given timeframe before checking if the any of the KEY_STRINGs are present in the entry. If I read the file by chunks I may find a KEY_STRING but the chunk doesn't have the earlier timestamp, or the chunk border may even be in the middle of the KEY_STRING. Reading the whole file into memory and parsing it isn't an option as the application this is to be a part of currently has a relatively small footprint, so can't justify increasing this by ~10x just for parsing a file (even temporarily). Is there a way I can read the file by delimited chunks (specifically "\r\n\r\n")? Or is there another/better method I've not thought of?
Any help on this will be greatly appreciated!

One possible solution is to use memory-mapped files. I've personally never used them for anything but toy applications, but know some of the theory behind it.
Essentially they provide a way of accessing the contents of files as if they're memory, I believe acting in a similar way to virtual memory, so required parts will be paged-in as required, and paged-out at some point (you should read the documentation to work out the rules behind this).
In pseudocode (because we all like pseudocode), you would do something along these lines:
HANDLE file = CreateFile(...);
HANDLE file_map = CreateFileMapping(file, 0, PAGE_READONLY, 0, 0, ...);
LPVOID mem = MapViewOfFile(file_map, FILE_MAP_READ, 0, 0, 0);
// at this point you can use mem to access data in the mapped part of the file...
// for your code, you would perform parsing as if you'd read the file into RAM.
// when you're done, unmap and close the file:
UnmapViewOfFile(mem);
CloseHandle(file_map);
CloseHandle(file);
I apologise now for not giving advice most excellent, but instead encourage further reading - Windows provides a lot of functionality to handling your memory, and it's mostly worth a read.

Make sure you can't use the memory, perhaps you're being a little bit too "paranoid"? Premature optimization, and all that.
Read it line by line (since that makes it easier to separate entries) but wrap the line-reading with a buffered read, reading as much at a time as you're comfortable with, perhaps 1 MB. That minimizes disk I/O, which is often good for performance.

Assuming that (as would normally be the case) all the entries in the file are in order by time, you should be able to use a variant of a binary search to find the correct start and end points, then parse the data in between.
The basic idea would be to seek to the middle of the file, then read a few lines until you get to one starting with "DEBUG", then read the time-stamp. If it's earlier than the time you care about, seek forward to the 3/4ths mark. If later than the time you care about, seek back to the 1/4th. mark. Repeat the basic idea until you've found the beginning. Then do the same thing for the end time.
Once the amount by which you're seeking drops below a certain threshold (e.g., 64K) it's probably faster to seek to the beginning of the 64K-aligned block, and just keep reading forward from there than to do any more seeking.
Another possibility to consider would be whether you can do some work in the background to build an index of the file as its being modified, then use the index when you actually need a result. The index would (for example) read the time-stamp of each entry right after its written (e.g., using ReadDirectoryChangesW to be told when the log file is modified). It would translate the textual time stamp into, for example, a time_t, then store an entry in the index giving the time_t and the file offset for that entry. This should be small enough (probably under a megabyte for a 50-megabyte log file) that it would be easy to work with it entirely in memory.

What factors can lead to Win32 error 665 (file system limitation)?

I maintain an application that collects data from a datalogger and appends that data to the end of a binary file. The nature of this system is that the file can grow large (> 4 gigabytes) small steps at a time. On of the users of my application has seen cases on his NTFS partition where the attempts to append data fail. The error is being reported as a result of a call to fflush(). When this happens, the return value for GetLastError() is 665 (ERROR_FILE_SYSTEM_LIMITATION). MSDN gives the following description for this error
The requested operation could not be completed due to a file system limitation
A search for this error code on google gives results related to SQL server with VERY large files (tens of gigabytes) but, at present, our file is much smaller. This user has not been able to get the file to grow beyond 10 gigabytes. We can temporarily correct the situation when we do some operation (like copying the file) that forces some sort of rewrite in the file system. Unfortunately, I am not sure what is going on to put us in this condition in the first place. What specific conditions in an NTFS file system can lead to this particular error being reported on a call to fflush()?

This sounds like you've reached a limit in the fragmentation of the file. In other words, each flush is creating a new extent (fragment) of the file and the filesystem is having a hard time finding a place to keep track of the list of fragments. That would explain why copying the file helps -- it creates a new file with fewer fragments.
Another thing that would probably work is defragmenting the file (using Sysinternals's contig utility you may be able to do so while it's in use). You can also use contig to tell you how many fragments the file has. I'm guessing it's on the order of one million.
If you have to flush the file frequently and can't defrag it, something you can do is simply create the file fairly large in the first place (to allocate the space all at once) and then write to successive bytes of the file rather than append.
If you're brave (and your process has admin access), you can defragment the file yourself with a few API calls: http://msdn.microsoft.com/en-us/library/aa363911(v=VS.85).aspx

Fastest way to erase part of file in C++

I wonder which is the fastest way to erase part of a file in c++.
I know the way of write a second file and skip the part you want. But i think is slow when you work with big files.
What about database system, how they remove records so fast?

A database keeps an index, with metadata listing which parts of the file are valid and which aren't. To delete data, just the index is updated to mark that section invalid, and the main file content doesn't have to be changed at all.

Database systems typically just mark deleted records as deleted, without physically recovering the unused space. They may later reuse the space occupied by deleted records. That's why they can delete parts of a database quickly.
The ability to quickly delete a portion of a file depends on the portion of the file you wish to delete. If the portion of the file that you are deleting is at the end of the file, you can simply truncate the file, using OS calls.
Deleting a portion of a file from the middle is potentially time consuming. Your choice is to either move the remainder of the file forward, or to copy the entire file to a new location, skipping the deleted portion. Either way could be time consuming for a large file.

The fastest way I know is to open data file as a Persisted memory-mapped file and simple move over the part you don't need. Would be faster than moving to second file but still not too fast with big files.

Writing to the middle of the file (without overwriting data)

In windows is it possible through an API to write to the middle of a file without overwriting any data and without having to rewrite everything after that?
If it's possible then I believe it will obviously fragment the file; how many times can I do it before it becomes a serious problem?
If it's not possible what approach/workaround is usually taken? Re-writing everything after the insertion point becomes prohibitive really quickly with big (ie, gigabytes) files.
Note: I can't avoid having to write to the middle. Think of the application as a text editor for huge files where the user types stuff and then saves. I also can't split the files in several smaller ones.

I'm unaware of any way to do this if the interim result you need is a flat file that can be used by other applications other than the editor. If you want a flat file to be produced, you will have to update it from the change point to the end of file, since it's really just a sequential file.
But the italics are there for good reason. If you can control the file format, you have some options. Some versions of MS Word had a quick-save feature where they didn't rewrite the entire document, rather they appended a delta record to the end of the file. Then, when re-reading the file, it applied all the deltas in order so that what you ended up with was the right file. This obviously won't work if the saved file has to be usable immediately to another application that doesn't understand the file format.
What I'm proposing there is to not store the file as text. Use an intermediate form that you can efficiently edit and save, then have a step which converts that to a usable text file infrequently (e.g., on editor exit). That way, the user can save as much as they want but the time-expensive operation won't have as much of an impact.
Beyond that, there are some other possibilities.
Memory-mapping (rather than loading) the file may provide efficiences which would speed things up. You'd probably still have to rewrite to the end of the file but it would be happening at a lower level in the OS.
If the primary reason you want fast save is to start letting the user keep working (rather than having the file available to another application), you could farm the save operation out to a separate thread and return control to the user immediately. Then you would need synchronisation between the two threads to prevent the user modifying data yet to be saved to disk.

The realistic answer is no. Your only real choices are to rewrite from the point of the modification, or build a more complex format that uses something like an index to tell how to arrange records into their intended order.
From a purely theoretical viewpoint, you could sort of do it under just the right circumstances. Using FAT (for example, but most other file systems have at least some degree of similarity) you could go in and directly manipulate the FAT. The FAT is basically a linked list of clusters that make up a file. You could modify that linked list to add a new cluster in the middle of a file, and then write your new data to that cluster you added.
Please note that I said purely theoretical. Doing this kind of manipulation under a complete unprotected system like MS-DOS would have been difficult but bordering on reasonable. With most newer systems, doing the modification at all would generally be pretty difficult. Most modern file systems are also (considerably) more complex than FAT, which would add further difficulty to the implementation. In theory it's still possible -- in fact, it's now thoroughly insane to even contemplate, where it was once almost reasonable.

I'm not sure about the format of your file but you could make it 'record' based.
Write your data in chunks and give each chunk an id.
Id could be data offset in file.
At the start of the file you could
have a header with a list of ids so
that you can read records in
order.
At the end of 'list of ids' you could point to another location in the file (and id/offset) that stores another list of ids
Something similar to filesystem.
To add new data you append them at the end and update index (add id to the list).
You have to figure out how to handle delete record and update.
If records are of the same size then to delete you can just mark it empty and next time reuse it with appropriate updates to index table.

Probably the most efficient way to do this (if you really want to do it) is to call ReadFileScatter() to read the chunks before and after the insertion point, insert the new data in the middle of the FILE_SEGMENT_ELEMENT[3] list, and call WriteFileGather(). Yes, this involves moving bytes on disk. But you leave the hard parts to the OS.

If using .NET 4 try a memory-mapped file if you have an editor-like application - might jsut be the ticket. Something like this (I didn't type it into VS so not sure if I got the syntax right):
MemoryMappedFile bigFile = MemoryMappedFile.CreateFromFile(
new FileStream(#"C:\bigfile.dat", FileMode.Create),
"BigFileMemMapped",
1024 * 1024,
MemoryMappedFileAccess.ReadWrite);
MemoryMappedViewAccessor view = MemoryMapped.CreateViewAccessor();
int offset = 1000000000;
view.Write<ObjectType>(offset, ref MyObject);

I noted both paxdiablo's answer on dealing with other applications, and Matteo Italia's comment on Installable File Systems. That made me realize there's another non-trivial solution.
Using reparse points, you can create a "virtual" file from a base file plus deltas. Any application unaware of this method will see a continuous range of bytes, as the deltas are applied on the fly by a file system filter. For small deltas (total <16 KB), the delta information can be stored in the reparse point itself; larger deltas can be placed in an alternative data stream. Non-trivial of course.

I know that this question is marked "Windows", but I'll still add my $0.05 and say that on Linux it is possible to both insert or remove a lump of data to/from the middle of a file without either leaving a hole or copying the second half forward/backward:
fallocate(fd, FALLOC_FL_COLLAPSE_RANGE, offset, len)
fallocate(fd, FALLOC_FL_INSERT_RANGE, offset, len)
Again, I know that this probably won't help the OP but I personally landed here searching for a Linix-specific answer. (There is no "Windows" word in the question, so web search engine saw no problem with sending me here.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Truncating the file in c++ - c++

A better approach might be not to let the file grow that big but rather use rotating log files with a set maximum size per log file and a maximum number of old files being kept.

Related

Why isn't lossless compression automatic on computers?

Parsing log file with multi-line entries

What factors can lead to Win32 error 665 (file system limitation)?

Fastest way to erase part of file in C++

Writing to the middle of the file (without overwriting data)

Categories

Resources