Appending Binary files - c++

I have to write numerical data to binary files. Since some of the data vectors I deal with can be several gigs in size, I have learned not to use C++ iostreams. Instead I want to use C File*. I'm running into a problem right off the bat where I need to write some meta data to the front of the binary file. Since some of the meta data is not known at first I need to append the meta data as I get it to the appropriate offsets in the file.
for example lets say I have to enter a uint16_t representation for the year, month , and day, but first I need to skip the first entry(a uint32_t value for precision);
I don't know what i'm doing wrong but I can't seem to append the file with "ab".
Here's an example of what I wrote:
#include<cstdio>
uint16_t year = 2001;
uint16_t month = 8;
uint16_t day = 23;
uint16_t dateArray[]={year , month, day};
File * fileStream;
fileStream = fopen("/Users/mmmmmm/Desktop/test.bin" , "wb");
if(fileStream){
// skip the first 4 bytes
fseek ( fileStream , 4 , SEEK_SET );
fwrite(dateArray, sizeof(dateArray[0]) ,( sizeof(dateArray) / sizeof(dateArray[0]) ), filestream);
fclose(filestream);
}
// loops and other code to prepare and gather other parameters
// now append the front of the file with the precision.
uint32_t precision = 32;
File *fileStream2;
fileStream2 = fopen("/Users/mmmmmm/Desktop/test.bin" , "ab");
if(fileStream2){
// Get to the top of the file
rewind(fileStream2);
fwrite(&precision, sizeof(precision) , 1 , fileStream2);
fclose(fileStream2);
}
The appended data does not write. If I change it to "wb", then the file overwrites. I was able to get it to work with "r+b", but I don't understand why. I thought "ab" would be proper. Also , should I be using buffers or is this a sufficient approach?
Thanks for the advise
BTW this is on MacOSX

Due to the way that hard drives and filesystems work, inserting bytes in to the middle of a file is really slow and should be avoided, especially when dealing with multi-gigabyte files. If your metadata is stored in to a fixed-size header, just make sure that there's enough space for it before you start with the other data. If the header is variably sized, then chunk up the header. Put 1k of header space at the beginning, and have 8 bytes reserved to contain the value of the offset to the next header chunk, or 0 for EOF. Then when that chunk gets filled up, just add another chunk to the end of the file and write its offset to the previous header.
As for the technical IO, use the fopen() modes of r+b, w+b, or a+b, depending on your need. They all act the same with minor differences. r+b opens the file for reading and writing, starting at the first byte. It will error if the file doesn't exist. w+b will do the same, but create the file if it doesn't exist. a+b is the same as r+b, but it starts with the file pointer at the end of the file.
You can navigate the file with fseek() and rewind(). rewind() moves the file pointer back to the beginning of the file. fseek() moves the file pointer to a specified location. You can read more about it here.

"r+b" means you can read and write to any position in the file. In your second code block the rewind() call set the current position to byte 0 and the write is done at this position.
If you use "a+b", this also means read and write access, but the writes are all at the end of the file so you cannot position at byte 0, unless a new empty file is created.
To re-access the file at a specific byte, just use fseek().
fseek ( fileStream , 0 , SEEK_SET ); - positions to the precision value
fseek ( fileStream , 4 , SEEK_SET ); - positions to the year value
fseek ( fileStream , 8 , SEEK_SET ); - positions to the month value
fseek ( fileStream , 12 , SEEK_SET ); - positions to the day value

With such large files, it would be highly inefficient to rewrite gigs just to prepend a few bytes.
It would be much better to create a small companion file of the metadata you need for each file, and only add the metadata fields to the beginning of the files if they are to be rewritten anyway as part of an edit.
This is because prepending to a file is so expensive on most file systems.
NTFS has a second data channel for most files that goes unseen by almost all programs except for MS internals such as file managers and security scanning programs. You could easily cook up a program to add your metadata to that channel without needing to overwrite gigs on disk every single time.

Related

Knowing current compressed file size using gzwrite (zlib)

I'm using zlib for c++.
Quote from
http://refspecs.linuxbase.org/LSB_3.0.0/LSB-PDA/LSB-PDA/zlib-gzwrite-1.html regarding gzwrite function:
The gzwrite() function shall write data to the compressed file referenced by file, which shall have been opened in a write mode (see gzopen() and gzdopen()). On entry, buf shall point to a buffer containing len bytes of uncompressed data. The gzwrite() function shall compress this data and write it to file. The gzwrite() function shall return the number of uncompressed bytes actually written.
I interpret this as the return value will NOT tell me how much larger the file became when writing. Only how much data was compressed into the file.
The only way to know how large the file is would then be to close it, and read the size from the file system. I have a requirement to only continue to write to the file until it reaches a certain size. Can this be achieved without closing the file?
A workaround would be to write until the uncompressed size reaches my limit and then close the file, read the size from file system and update my best guess of file size based on that, and then re-open the file and continue writing. This would make me close and open the file a few times towards the end (as I'm approaching the size limit).
Another workaround, which would give more of an estimate (which is not what I want really) would be to write until uncompressed size reaches the limit, close the file, read the file size from the file system and calculate the compression ratio so far. The I can use this compression ratio to calculate a new limit for uncompressed file size where the compression should get me down to the limit for the compressed file size. If I repeat this the estimate would improve, but again, not what I'm looking for.
Are there better options?
Preferred option would be if zlib could tell me the compressed file size while the file is still open. I don't see why this information would not be available inside zlib at this point, since compression happens when I call gzwrite and not when i close the file.
zlib provides the function gzoffset(), which does exactly what you're asking.
If for some reason you are stuck with a version of zlib that is more than about eight years old, when gzoffset() was added, then this is easy to do with gzdopen(). You open the output file with fopen() or open(), and provide the file descriptor (using fileno() and dup() if you used fopen()), and then provide that descriptor to gzdopen(). Then you can use ftell() or lseek() at any time to see how much as been written. Be careful to not try to double-close the descriptor. See the comments for gzdopen().
You can work around this issue by using a pipe. The idea is to write the compressed data into a pipe. After that, you read the data from the other end of the pipe, count it and write it to the actual file.
To set this up you need to first open the file to write to via a simple open. Then create a pipe via pipe2 and initialize zlib by passing one of the pipe descriptors to gzdopen:
int out = open("/path/to/file", O_WRONLY | O_CREAT | O_TRUNC);
int p[2];
pipe2(p, O_NONBLOCK);
gzFile zFile = gzdopen(p[0], "w");
You can now write the data first to the pipe and then splice it from the pipe to the out file:
gzwrite(zFile, buf, 1024); //or any other length
size_t bytesWritten = 0;
do {
bytesWritten = splice(p[1], NULL, out, NULL, 1024, SPLICE_F_NONBLOCK | SPLICE_F_MORE);
} while(bytesWritten == 1024);
As you can see, you now have the bytesWritten to tell you how much data was actually written. Simply sum it up in another variable and stop splicing as soon as you have written as much data as you need to (or just splice it in one go by writing everything to the zFile and the splice once with the amount of data you are allowed to store as the fifth parameter. If you want to not compress uneccessary data, simply do it in chunks as shown above).
A note on splice: Splice is linux specific, and is basically just a very efficient copy. You can always replace it with a simple "read and write" combo, i.e. read data from fd[1] into a buffer and then write the data from that buffer into out - splice is just faster and less code.

Remove Header From File

I would like to be able to edit the contents of a binary file in C++, and remove all the contents up to a certain character position, which I already know, a bit like removing a header from a file.
One example file I have contains 1.3 million bytes, and I would like to remove the first 38,400, then save the file under the original name.
Currently, I've been doing a buffered read to find the position for each file (the rules for where to cut the file are complex and it is not a simple search), and of course, I could do another buffered read from pos, outputting into a new file, then do a rename, or something along those lines.
But it feels quite heavy handed to have to copy the entire file. Is there any way I can just get the OS (Windows Vista & upwards only - cross-platform not required) to relocate the start of the file, and recycle those 38,400 bytes? Alas, I can find no way, hence why I would ask of you your assistance :)
Thank you so much for any help you can offer.
No, there is no support for removing the head of the file in any OS I'm familiar with, including any version of Windows. You have to physically move the bytes you want to keep, so they end up at the start. You could do it in place with some effort, but the easiest way is as you suggest - write a new file then do rename-and-delete dance.
What you're looking for is what I call a "lop" operation, which is kind of like a truncate, but at the front of the file. I posted a question about it some time back. Unfortunately, there is no such thing available.
Simply overwrite file (by fixed memory blocks) from neccessary pos to the begin of file.
const size_t cBufSize = 10000000; // Adjust as you wish
const UINT_PTR ofs = 38400;
UINT_PTR readPos = ofs;
UINT_PTR writePos = 0;
UINT_PTR readedBytes = 0;
1. BYTE *buf = // Allocate cBufSize memory buffer
2. Seek to readPos
3. Read cBufSize bytes to buf, remember actual readed number of bytes to readedBytes
4. Seek to writePos
5. Write readedBytes from buf
6. Increase readPos and writePos on readedBytes value
7. Repeat from step 2, until you get end of file.
8. Reduce file size on ofs bytes.

Write at specific position at a file with open()

Hello I am trying to simulate two programs that send and receive files in C++ from the network, something like client and server. To begin with I have to split a file to pages of 4096 bytes and send it to the other program in order to create the file. The way I send and receive files through the network is by write and read. So in the client programm I must create a function tha receives the packages and puts them into a file. I cannot figure a way to put the packages in to the file. For example I a file has 2 pages I must create another file using these 2 pages. Also i cannot know if they come in order so I must create the file and put them in the right position.
/*consider the connections are ok and the file's name is at char* name*/
int file=open(name,"O_CREAT | O_WRONLY,0666);
char buffer[4096];
int pagenumber;
for(int i=0;i<page_number;i++){
read(socket,&pagenumber,sizeof(int));
read(socket,buffer,sizeof(int));
write(file(pagenumber*4096),buffer,4096);
}
This code works for pagenumber=0 but for pagenumber=1 nothing happens! Can you help me? Thanks in advance!
To write at a certain position in the file you must use lseek
off_t lseek(int fd, off_t offset, int whence);
It takes the descriptor, the offset and the final parameter is a constant in these:
SEEK_SET The offset is set to offset bytes.
SEEK_CUR The offset is set to its current location plus offset bytes.
SEEK_END The offset is set to the size of the file plus offset bytes.
If you know how big is the file going to be, you can use ftruncate for it.
int ftruncate(int fd, off_t length);
Anyway even if you create a file that is huge, since most filesystems on Linux support sparse files, the actual file on disk will be the sum of the blocks that have been written.
The first argument to write() is a filedescriptor, which you optained with open(). So it should be
int file = open(...);
...
write(file,buffer,4096);
not
write(file(pagenumber*4096),buffer,4096);
Regarding the question as to how to write at a specific position. You can prepare the file beforehand with write, and then use seek() to position the file where you want to write at. For a description of seek you can look here.
Mario, first of all, lets no rely on garbage in 'pagenumber' to continue the loop (which is happening when loop boundary condition is checked here for the first time). Now, if you are writing page number '0' and then page following it, pagenumber will be initialized to 0 and your loop will come out. Also, please check bytes written and read in write and read system calls respectively.
try pwrite
int file=open(name,"O_CREAT | O_WRONLY,0666);
char buffer[4096];
int pagenumber;
for(int i=0;i<page_number;i++){
read(socket,&pagenumber,sizeof(int));
read(socket,buffer,sizeof(int));
pwrite(file,buffer,4096,4096*i);
}

ERROR_NOT_ENOUGH_MEMORY Error when writing INI using WritePrivateProfileString, after 200k calls

I'm making simple dll packet sniffer using C++, that will hook to the apps, and write the received packet into INI file. Unfortunately after 20-30 minutes it crashed the main apps.
When the packet is received, receivedPacket() will be called. After 20+ minutes, WriteCount value is around 150,000-200,000.. and starting to get C++ runtime error/crash, GetLastError() code that I get is 0x8, which is ERROR_NOT_ENOUGH_MEMORY, and the WritePrivateProfileStringA() returns 0
void writeToINI(LPCSTR iSec,LPCSTR iKey,int iVal){
sprintf(inival, _T("%d"), iVal);
WritePrivateProfileStringA(iSec,iKey,inival,iniloc);
//sprintf(strc, _T("%d \n"), WriteCount);
//WriteConsole(GetStdHandle(STD_OUTPUT_HANDLE), strc, strlen(strc), 0, 0);
WriteCount++;
}
void receivedPacket(char *packet,WORD size){
switch ( packet[2] )
{
case 0x30:
// Size : 0x5F
ID = *(signed char*)&packet[0x10];
X = *(signed short*)&packet[0x20];
Y = *(signed short*)&packet[0x22];
Z = *(signed short*)&packet[0x24];
sprintf(inisec, _T("PACKET_%d"), (ID+1));
writeToINI(inisec,"id",ID);
writeToINI(inisec,"x",X);
writeToINI(inisec,"y",Y);
writeToINI(inisec,"z",Z);
}
[.....OTHER CASES.....]
}
Thanks :)
WritePrivateProfileString() and GetPrivateProfileString() are very slow (due to parsing INI file each call), instead you can:
use one of existing parsing libraries, but i am not sure about memory efficiency nor supporting sequential write.
write your own sequential INI writter:
read file (or only part, by part, if it is too big)
find section and key (if not found, create new section at end of file, or find insertion position, if you want sorted sections), save file position of key and next key
change value
save (beginning of original file to position of key + actual changed key + position of next key in original file to end of file) (if new section is created at end, you can simply append new section to original file) (if packets rewrite same ID often, you can add padding whitespace after each key, large to hold any value of desired type (example: change X=1---\n to X=100-\n (change - to whitespace), so you have constant size of key, you can update only part of file) )
database, for example MySQL
write binary file (fastest solution) and make program to read values, or to convert to text
Little note: I use GetPrivateProfileString() few years ago to read settings file (about 1KB of size), reading form HDD: 50ms, reading from USB flash disk: 1000ms!, after changing (1. read file to memory 2. run my own parser) it run in 1ms both on HDD and USB.
Thanks for the reply guys, but looks like the problem wasn't come from WritePrivateProfileStringA().
I just need to add extra size in malloc() for the Hook.
:)

is there a way to fopen a file that allows me to edit just a few bytes?

I am writing a class that compresses binary data using a zlib stream. I have a buffer that I fill with the output stream and once it becomes full I dump the buffer out to a file using fopen(filename, 'ab');... What this means is that my program only opens up the file to write to it whenever it has a buffer full of data to dump, it goes and does it and immediately closes it.
The issue is in my format I use an 8 byte header at the beginning of each file which contains the original length and compressed length but I do not know these values until the end of the whole compression process.
What I wanted to do was write 8 bytes of zeros, then append with all my compressed data, then come back at the end during cleanup to fill in those 8 bytes with the size data, but I can't seem to find a way to open the file without bringing it all back into memory. I just want to edit the first 8 bytes of the file. Do I need to use mmap?
Since you're using the file in append mode, you do need to close and re-open it:
open with fopen(filename, "r+b");
write the 8 bytes;
close the file using fclose().
The r+ means
Open for reading and writing. The stream is positioned at the
beginning of the file.
and the b is needed to open in binary mode.
You can use this method to change the data at any position in the file, not just at the beginning: simply use fseek() to seek to the required position before writing.
Use rewind() to take the file pointer back to the start of the file after you write out the last few bytes of data. You can then output your 8 bytes of length info.
If you have flexibility in changing your format, I might suggest this. Define your compressed stream such that it is a sequence of an unknown number of blocks, and each block is preceded by a fixed length integer specifying the number of bytes in the block. The stream is finished when the next block has a size of zero.
The drawback to this format is that there no way for the reader of the stream to know how much data is coming until it's all been read. But the advantage is that it avoids this problem you are trying to solve.
More importantly, it allows you to send a compressed stream of data somewhere as you read the input and you don't have to save it all before sending it. For example, you could write a compression Unix filter that you could put in a pipe stream:
prog1 | yourprog -compress | rsh host yourprog -expand | prog2
Good luck.