Remove Header From File - c++

I would like to be able to edit the contents of a binary file in C++, and remove all the contents up to a certain character position, which I already know, a bit like removing a header from a file.
One example file I have contains 1.3 million bytes, and I would like to remove the first 38,400, then save the file under the original name.
Currently, I've been doing a buffered read to find the position for each file (the rules for where to cut the file are complex and it is not a simple search), and of course, I could do another buffered read from pos, outputting into a new file, then do a rename, or something along those lines.
But it feels quite heavy handed to have to copy the entire file. Is there any way I can just get the OS (Windows Vista & upwards only - cross-platform not required) to relocate the start of the file, and recycle those 38,400 bytes? Alas, I can find no way, hence why I would ask of you your assistance :)
Thank you so much for any help you can offer.

No, there is no support for removing the head of the file in any OS I'm familiar with, including any version of Windows. You have to physically move the bytes you want to keep, so they end up at the start. You could do it in place with some effort, but the easiest way is as you suggest - write a new file then do rename-and-delete dance.

What you're looking for is what I call a "lop" operation, which is kind of like a truncate, but at the front of the file. I posted a question about it some time back. Unfortunately, there is no such thing available.

Simply overwrite file (by fixed memory blocks) from neccessary pos to the begin of file.
const size_t cBufSize = 10000000; // Adjust as you wish
const UINT_PTR ofs = 38400;
UINT_PTR readPos = ofs;
UINT_PTR writePos = 0;
UINT_PTR readedBytes = 0;
1. BYTE *buf = // Allocate cBufSize memory buffer
2. Seek to readPos
3. Read cBufSize bytes to buf, remember actual readed number of bytes to readedBytes
4. Seek to writePos
5. Write readedBytes from buf
6. Increase readPos and writePos on readedBytes value
7. Repeat from step 2, until you get end of file.
8. Reduce file size on ofs bytes.

Related

How to reduce the size of a fstream file in C++

What is the best way to cut the end off of a fstream file in C++ 11
I am writing a data persistence class to store audio for my audio editor. I have chosen to use fstream (possibly a bad idea) to create a random access binary read write file.
Each time I record a little sound into my file I simply tack it onto the end of this file. Another internal data structure / file, contains pointers into the audio file and keeps track of edits.
When I undo a recording action and then do something else the last bit of the audio file becomes irrelevant. It is not referenced in the current state of the document and you cannot redo yourself back to a state where you can ever see it again. So I want to chop this part of the file off and start recording at the new end. I don’t need to cut out bitts in the middle, just off the end.
When the user quits this file will remain and be reloaded when they open the project up again.
In my application I expect the user to do this all the time and being able to do this might save me as much as 30% of the file size. This file will be long, potentially very, very long, so rewriting it to another file every time this happens is not a viable option.
Rewriting it when the user saves could be an option but it is still not that attractive.
I could stick a value at the start that says how long the file is supposed to be and then overwrite the end to recycle the space but in the mean time. If I wanted to continually update the data store file in case of crash this would mean I would be rewriting the start over and over again. I worry that this might be bad for flash drives. I could also recomputed the end of the useful part of the file on load, by analyzing the pointer file but in the mean time I would be wasting all that space potentially, and that is complicated.
Is there a simple call for this in the fstream API?
Am I using the wrong library? Note I want to stick to something generic STL I preferred, so I can keep the code as cross platform as possible.
I can’t seem to find it in the documentation and have looked for many hours. It is not the end of the earth but would make this a little simpler and potentially more efficient. Maybe I am just missing it somehow.
Thanks for your help
Andre’
Is there a simple call for this in the fstream API?
If you have C++17 compiler then use std::filesystem::resize_file. In previous standards there was no such thing in standard library.
With older compilers ... on Windows you can use SetFilePointer or SetFilePointerEx to set the current position to the size you want, then call SetEndOfFile. On Unixes you can use truncate or ftruncate. If you want portable code then you can use Boost.Filesystem. From it is simplest to migrate to std::filesystem in the future because the std::filesystem was mostly specified based on it.
If you have variable, that contains your current position in the file, you could seek back for the length of your "unnedeed chunk", and just continue to write from there.
// Somewhere in the begining of your code:
std::ofstream *file = new std::ofstream();
file->open("/home/user/my-audio/my-file.dat");
// ...... long story of writing data .......
// Lets say, we are on a one millin byte now (in the file)
int current_file_pos = 1000000;
// Your last chunk size:
int last_chunk_size = 12345;
// Your chunk, that you are saving
char *last_chunk = get_audio_chunk_to_save();
// Writing chunk
file->write(last_chunk, last_chunk_size);
// Moving pointer:
current_file_pos += last_chunk_size;
// Lets undo it now!
current_file_pos -= last_chunk_size;
file->seekp(current_file_pos);
// Now you can write new chunks from the place, where you were before writing and unding the last one!
// .....
// When you want to finally write file to disk, you just close it
file->close();
// And when, truncate it to the size of current_file_pos
truncate("/home/user/my-audio/my-file.dat", current_file_pos);
Unfortunatelly, you'll have to write a crossplatform function truncate, that would call SetEndOfFile in windows, and truncate in linux. It's easy enough with using preprocessor macros.

Write at specific position at a file with open()

Hello I am trying to simulate two programs that send and receive files in C++ from the network, something like client and server. To begin with I have to split a file to pages of 4096 bytes and send it to the other program in order to create the file. The way I send and receive files through the network is by write and read. So in the client programm I must create a function tha receives the packages and puts them into a file. I cannot figure a way to put the packages in to the file. For example I a file has 2 pages I must create another file using these 2 pages. Also i cannot know if they come in order so I must create the file and put them in the right position.
/*consider the connections are ok and the file's name is at char* name*/
int file=open(name,"O_CREAT | O_WRONLY,0666);
char buffer[4096];
int pagenumber;
for(int i=0;i<page_number;i++){
read(socket,&pagenumber,sizeof(int));
read(socket,buffer,sizeof(int));
write(file(pagenumber*4096),buffer,4096);
}
This code works for pagenumber=0 but for pagenumber=1 nothing happens! Can you help me? Thanks in advance!
To write at a certain position in the file you must use lseek
off_t lseek(int fd, off_t offset, int whence);
It takes the descriptor, the offset and the final parameter is a constant in these:
SEEK_SET The offset is set to offset bytes.
SEEK_CUR The offset is set to its current location plus offset bytes.
SEEK_END The offset is set to the size of the file plus offset bytes.
If you know how big is the file going to be, you can use ftruncate for it.
int ftruncate(int fd, off_t length);
Anyway even if you create a file that is huge, since most filesystems on Linux support sparse files, the actual file on disk will be the sum of the blocks that have been written.
The first argument to write() is a filedescriptor, which you optained with open(). So it should be
int file = open(...);
...
write(file,buffer,4096);
not
write(file(pagenumber*4096),buffer,4096);
Regarding the question as to how to write at a specific position. You can prepare the file beforehand with write, and then use seek() to position the file where you want to write at. For a description of seek you can look here.
Mario, first of all, lets no rely on garbage in 'pagenumber' to continue the loop (which is happening when loop boundary condition is checked here for the first time). Now, if you are writing page number '0' and then page following it, pagenumber will be initialized to 0 and your loop will come out. Also, please check bytes written and read in write and read system calls respectively.
try pwrite
int file=open(name,"O_CREAT | O_WRONLY,0666);
char buffer[4096];
int pagenumber;
for(int i=0;i<page_number;i++){
read(socket,&pagenumber,sizeof(int));
read(socket,buffer,sizeof(int));
pwrite(file,buffer,4096,4096*i);
}

How to cut a file without using another file?

Is it possible to delete part of a file (let's say from the beginning to its half), without having to use another file?
Thank's!
Yes, it is possible, but still you'll have to rewrite most of the file.
The rough idea is as follows:
open the file
beg = find the start of the fragment to be removed
len = length of the fragment to be removed
blocksize = 4096 -- example block size, may be any
datamoved = 0
do {
fseek(pos +len +datamoved);
if( endoffile ) return; -- finished!
actualread = fread(buffer, blocksize)
fseek(pos + datamoved)
fwrite(buffer, actualread)
datamoved += actualread
}
and the last step after the loop is to 'truncate' the file to the pos+datamoved size. if the underlying filesystem does not handle 'truncatefile' operation, then you have to rewrite.. but most of filesystems and libraries do support that.
The short answer is that no, most file systems don't attempt to support operations like that.
That leaves you with two choices. The obvious one is to create a copy of the data, leaving out the parts you don't want. You can do this either in-place (i.e., moving the data around in the same file) or by using an auxiliary file, typically copying the data to the new file, then doing something like renaming the new file to the old name.
The other major choice is to simply re-structure your file and data so you don't have to get rid of the old data at all. For example, if you want to keep the most recent N amount of data from a process, you might structure (most of) the file as a circular buffer, with a couple of "pointers" at the beginning tell you the head and tail points, so you know where to read data from/write data to. With a structure like this, you don't erase or remove the old data, you just overwrite it as needed.
If you have enough memory, read its contents fully to the memory, copy it back to the front of the file, and truncate the file.
If you do not have enough memory, copy in blocks, and only when you are done truncate the file.

Using STL to search for raw bytes and replace it in a file, what is the best / correct way

I'd like to use the STL (C++0x in vs2010) to open a file and binary search through it and then replace when a match is found.
I can use fstream to open a file but i'm a bit confused to how to search the file.
Reading one byte at a time isn't good so it should read a block into a buffer and we then search in the buffer but is this functionality already in the STL ?
I want the buffering part to be automatic, when the search reach the end of buffer it should automatically read in the next block from the file as if it was reading it directly (using the same file offsets, etc..)
The other problem is when it finds a match exactly how should it update the file.
I know that this file functionality exists in windows using CreateFileMapping and MapViewOfFile but i want to use the STL to make my code portable and also by using the STL also more flexible. Using these windows function you can read the file without worrying about buffering, etc, they will even update the file when you change data. This is the functionality i'm after using the STL.
Update: I've changed 'binary search' to 'raw byte search' to clear up the confusion, sorry for that.
An example of the raw byte search function ( if you have a better please do tell )
// NOTE: This function doesn't support big files > 4 gb
DWORD_PTR RawByteSearch( PBYTE pBaseAddress, DWORD dwLowSize, PBYTE pSignature, DWORD sigSize )
{
PBYTE pBasePtr = pBaseAddress;
PBYTE pEndPtr = pBaseAddress + dwLowSize;
DWORD_PTR i;
while ( pBasePtr < pEndPtr )
{
for (i = 0; i < sigSize; ++i)
{
if ( pSignature[i] != pBasePtr[i] )
break;
}
if ( i == sigSize ) // Found a match
{
return pBasePtr - pBaseAddress;
}
++pBasePtr; // Next byte
}
return 0;
}
std::search already does what you need.
std::search(pBasePtr, bEndPtr, pSignature, pSignature+sigSize)
Out of completeness, you don't need to use memory mapping (although that will likely be the most efficient).
A 100% portable solution is to instead use the standard library stream iterators:
std::ifstream file;
std::search(std::istreambuf_iterator<char>(file.rdbuf()) // search from here
, std::istreambuf_iterator<char>() // ... to here
, pSignature // looking for a sequence starting with this
, pSignature+sigSize); // and ending with this
You should avoid the term "binary search". This is a term used for searching in sorted data in logarithmic time, what you are doing is a linear search (in binary data). The C++ standard library has a function for that, it's std::search. It would work something like:
PWORD pos = std::search(pBaseAddress, pBaseAddress+dwLowSize, pSignature, pSignature+pSigSize);
P.S. Hungarian Notation sucks ;)
std::fstream has a function called read that can read a block into a buffer. Am I misunderstanding something?
Edit: I just did a quick search on memory-mapped file implementations. Maybe this can be helpful for you:
http://www.boost.org/doc/libs/1_40_0/libs/iostreams/doc/classes/mapped_file.html
BinarySearch actually means starting at the midway point and comparing to slice your search space in two, then going to the midway point of the new search space to divide it further. As your search space halves with every iteration, the operation is O(log N) which is therefore in general a lot faster.
You may wish to do that here by building up an index.
A lot of this depends how your file is formatted. Does it consist of fixed-length "records" with a key at the start of the record and with that key sorted?
If so, you can actually scan every Nth record to create an index in memory, then use binary search to locate where approximately in the file the record would be, then load just that section of the file into memory.
For example, let's say each record is 128 bytes, the key is 4 bytes and there are 32K records, so the file is about 4MB big. If you load every 64th record, you are reading 512 of them and storing 2K of data in memory. When you search your local index you then load an 8K portion into memory and search in that for the actual record you wish to change, and modify it appropriately.
If your records are not all the same length, you will have a problem if your modification tramples over the next record.

Fastest way to find the number of lines in a text (C++)

I need to read the number of lines in a file before doing some operations on that file. When I try to read the file and increment the line_count variable at each iteration until I reach EOF. It was not that fast in my case. I used both ifstream and fgets. They were both slow. Is there a hacky way to do this, which is also used by, for instance BSD, Linux kernel or berkeley db (may be by using bitwise operations).
The number of lines is in the millions in that file and it keeps getting larger, each line is about 40 or 50 characters. I'm using Linux.
Note:
I'm sure there will be people who might say use a DB idiot. But briefly in my case I can't use a db.
The only way to find the line count is to read the whole file and count the number of line-end characters. The fastest way to do this is probably to read the whole file into a large buffer with one read operation and then go through the buffer counting the '\n' characters.
As your current file size appears to be about 60Mb, this is not an attractive option. You can get some of the speed by not reading the whole file, but reading it in chunks, say of size 1Mb. You also say that a database is out of the question, but it really does look to be the best long-term solution.
Edit: I just ran a small benchmark on this and using the buffered approach (buffer size 1024K) seems to be a bit more than twice as fast as reading a line at a time with getline(). Here's the code - my tests were done with g++ using -O2 optimisation level:
#include <iostream>
#include <fstream>
#include <vector>
#include <ctime>
using namespace std;
unsigned int FileRead( istream & is, vector <char> & buff ) {
is.read( &buff[0], buff.size() );
return is.gcount();
}
unsigned int CountLines( const vector <char> & buff, int sz ) {
int newlines = 0;
const char * p = &buff[0];
for ( int i = 0; i < sz; i++ ) {
if ( p[i] == '\n' ) {
newlines++;
}
}
return newlines;
}
int main( int argc, char * argv[] ) {
time_t now = time(0);
if ( argc == 1 ) {
cout << "lines\n";
ifstream ifs( "lines.dat" );
int n = 0;
string s;
while( getline( ifs, s ) ) {
n++;
}
cout << n << endl;
}
else {
cout << "buffer\n";
const int SZ = 1024 * 1024;
std::vector <char> buff( SZ );
ifstream ifs( "lines.dat" );
int n = 0;
while( int cc = FileRead( ifs, buff ) ) {
n += CountLines( buff, cc );
}
cout << n << endl;
}
cout << time(0) - now << endl;
}
Don't use C++ stl strings and getline ( or C's fgets), just C style raw pointers and either block read in page-size chunks or mmap the file.
Then scan the block at the native word size of your system ( ie either uint32_t or uint64_t) using one of the magic algorithms 'SIMD Within A Register (SWAR) Operations' for testing the bytes within the word. An example is here; the loop with the 0x0a0a0a0a0a0a0a0aLL in it scans for line breaks. ( that code gets to around 5 cycles per input byte matching a regex on each line of a file )
If the file is only a few tens or a hundred or so megabytes, and it keeps growing (ie something keeps writing to it), then there's a good likelihood that linux has it cached in memory, so it won't be disk IO limited, but memory bandwidth limited.
If the file is only ever being appended to, you could also remember the number of lines
and previous length, and start from there.
It has been pointed out that you could use mmap with C++ stl algorithms, and create a functor to pass to std::foreach. I suggested that you shouldn't do it not because you can't do it that way, but there is no gain in writing the extra code to do so. Or you can use boost's mmapped iterator, which handles it all for you; but for the problem the code I linked to was written for this was much, much slower, and the question was about speed not style.
You wrote that it keeps getting larger.
This sounds like it is a log file or something similar where new lines are appended but existing lines are not changed. If this is the case you could try an incremental approach:
Parse to the end of file.
Remember the line count and the offset of EOF.
When the file grows fseek to the offset, parse to EOF and update the line count and the offset.
There's a difference between counting lines and counting line separators. Some common gotchas to watch out for if getting an exact line count is important:
What's the file encoding? The byte-by-byte solutions will work for ASCII and UTF-8, but watch out if you have UTF-16 or some multibyte encoding that doesn't guarantee that a byte with the value of a line feed necessarily encodes a line feed.
Many text files don't have a line separator at the end of the last line. So if your file says "Hello, World!", you could end up with a count of 0 instead of 1. Rather than just counting the line separators, you'll need a simple state machine to keep track.
Some very obscure files use Unicode U+2028 LINE SEPARATOR (or even U+2029 PARAGRAPH SEPARATOR) as line separators instead of the more common carriage return and/or line feed. You might also want to watch out for U+0085 NEXT LINE (NEL).
You'll have to consider whether you want to count some other control characters as line breakers. For example, should a U+000C FORM FEED or U+000B LINE TABULATION (a.k.a. vertical tab) be considered going to a new line?
Text files from older versions of Mac OS (before OS X) use carriage returns (U+000D) rather than line feeds (U+000A) to separate lines. If you're reading the raw bytes into a buffer (e.g., with your stream in binary mode) and scanning them, you'll come up with a count of 0 on these files. You can't count both carriage returns and line feeds, because PC files generally end a line with both. Again, you'll need a simple state machine. (Alternatively, you can read the file in text mode rather than binary mode. The text interfaces will normalize line separators to '\n' for files that conform to the convention used on your platform. If you're reading files from other platforms, you'll be back to binary mode with a state machine.)
If you ever have a super long line in the file, the getline() approach can throw an exception causing your simple line counter to fail on a small number of files. (This is particularly true if you're reading an old Mac file on a non-Mac platform, causing getline() to see the entire file as one gigantic line.) By reading chunks into a fixed-size buffer and using a state machine, you can make it bullet proof.
The code in the accepted answer suffers from most of these traps. Make it right before you make it fast.
Remember that all fstreams are buffered. So they in-effect do actually reads in chunks so you do not have to recreate this functionality. So all you need to do is scan the buffer. Don't use getline() though as this will force you to size a string. So I would just use the STL std::count and stream iterators.
#include <iostream>
#include <fstream>
#include <iterator>
#include <algorithm>
struct TestEOL
{
bool operator()(char c)
{
last = c;
return last == '\n';
}
char last;
};
int main()
{
std::fstream file("Plop.txt");
TestEOL test;
std::size_t count = std::count_if(std::istreambuf_iterator<char>(file),
std::istreambuf_iterator<char>(),
test);
if (test.last != '\n') // If the last character checked is not '\n'
{ // then the last line in the file has not been
++count; // counted. So increement the count so we count
} // the last line even if it is not '\n' terminated.
}
It isn't slow because of your algorithm , It is slow because IO operations are slow. I suppose you are using a simple O(n) algorithm that is simply going over the file sequentially. In that case , there is no faster algorithm that can optimize your program.
However , I said there is no faster algorithm , but there is a faster mechanism which called "Memory Mapped file " , There are some drawback for mapped files and it might not be appropiate for you case , So you'll have to read about it and figure out by yourself.
Memory mapped files won't let you implement an algorithm better then O(n) but it may will reduce IO access time.
You can only get a definitive answer by scanning the entire file looking for newline characters. There's no way around that.
However, there are a couple of possibilities which you may want to consider.
1/ If you're using a simplistic loop, reading one character at a time checking for newlines, don't. Even though the I/O may be buffered, function calls themselves are expensive, time-wise.
A better option is to read large chunks of the file (say 5M) into memory with a single I/O operation, then process that. You probably don't need to worry too much about special assembly instruction since the C runtime library will be optimized anyway - a simple strchr() should do it.
2/ If you're saying that the general line length is about 40-50 characters and you don't need an exact line count, just grab the file size and divide by 45 (or whatever average you deem to use).
3/ If this is something like a log file and you don't have to keep it in one file (may require rework on other parts of the system), consider splitting the file periodically.
For example, when it gets to 5M, move it (e.g., x.log) to a dated file name (e.g., x_20090101_1022.log) and work out how many lines there are at that point (storing it in x_20090101_1022.count, then start a new x.log log file. Characteristics of log files mean that this dated section that was created will never change so you will never have to recalculate the number of lines.
To process the log "file", you'd just cat x_*.log through some process pipe rather than cat x.log. To get the line count of the "file", do a wc -l on the current x.log (relatively fast) and add it to the sum of all the values in the x_*.count files.
The thing that takes time is loading 40+ MB into memory. The fastest way to do that is to either memorymap it, or load it in one go into a big buffer. Once you have it in memory, one way or another, a loop traversing the data looking for \n characters is almost instantaneous, no matter how it is implemented.
So really, the most important trick is to load the file into memory as fast as possible. And the fastest way to do that is to do it as a single operation.
Otherwise, plenty of tricks may exist to speed up the algorithm. If lines are only added, never modified or removed, and if you're reading the file repeatedly, you can cache the lines read previously, and the next time you have to read the file, only read the newly added lines.
Or perhaps you can maintain a separate index file showing the location of known '\n' characters, so those parts of the file can be skipped over.
Reading large amounts of data from the harddrive is slow. There's no way around that.
If your file only grows, then Ludwig Weinzierl is the best solution if you do not have control of the writers. Otherwise, you can make it even faster: increment the counter by one each time a line is written to the file. If multiple writers may try to write to the file simultaneously, then make sure to use a lock. Locking your existing file is enough. The counter can be 4 or 8 bytes written in binary in a file written under /run/<your-prog-name>/counter (which is RAM so dead fast).
Ludwig Algorithm
Initialize offset to 0
Read file from offset to EOF counting '\n' (as mentioned by others, make sure to use buffered I/O and count the '\n' inside that buffer)
Update offset with position at EOF
Save counter & offset to a file or in a variable if you only need it in your software
Repeat from "Read file ..." on a change
This is actually how various software processing log files function (i.e. fail2ban comes to mind).
The first time, it has to process a huge file. Afterward, it is very small and thus goes very fast.
Proactive Algorithm
When creating the files, reset counter to 0.
Then each time you receive a new line to add to the file:
Lock file
Write one line
Load counter
Add one to counter
Save counter
Unlock file
This is very close to what database systems do so a SELECT COUNT(*) FROM table on a table with millions of rows return instantly. Databases also do that per index. So if you add a WHERE clause which matches a specific index, you also get the total instantly. Same principle as above.
Personal note: I see a huge number of Internet software which are backward. A watchdog makes sense for various things in a software environment. However, in most cases, when something of importance happens, you should send a message at the time it happens. Not use a backward concept of checking logs to detect that something bad just happened.
For example, you detect that a user tried to access a website and entered the wrong password 5 times in a row. You want to send a instant message to the admin to make sure there wasn't a 6th time which was successful and the hacker can now see all your user's data... If you use logs, the "instant message" is going to be late by seconds if not minutes.
Don't do processing backward.