I've written a program, generating a tarball, which gets compressed by zlib.
At regular intervals, the same program is supposed to add a new file to the tarball.
Per definition, the tarball needs empty records (512 Byte blocks) to work properly at it's end, which already shows my problem.
According to documentation gzopen is unable to open the file in r+ mode, meaning I can't simply jump to the beginning of the empty records, append my file information and seal it again with empty records.
Right now, I'm at my wits end. Appending works fine with zlib, as long as the empty records are not involved, yet I need them to 'finalize' my compressed tarball.
Any ideas?
Ah yes, it would be nice if I could avoid decompressing the whole thing and/or parsing the entire tarball.
I'm also open for other (preferably simple) file formats I could implement instead of tar.
This is two separate problems, both of which are solvable.
The first is how to append to a tar file. All you need to do there is overwrite the final two zeroed 512-byte blocks with your file. You would write the 512-byte tar header, your file rounded up to an integer number of 512-byte blocks, and then two 512-byte blocks filled with zeros to mark the new end of the tar file.
The second is how to frequently append to a gzip file. The simplest approach is to write separate gzip streams and concatenate them. Write the last two 512-byte zeroed blocks in a separate gzip stream, and remember where that starts. Then overwrite that with a new gzip stream with the new tar entry, and then another gzip stream with the two end blocks. This can be done by seeking back in the file with lseek() and then using gzdopen() to start writing from there.
That will work well, with good compression, for added files that are large (at a minimum several 10's of K). If however you are adding very small files, simply concatenating small gzip streams will result in lousy compression, or worse, expansion. You can do something more complicated to actually add small amounts of data to a single gzip stream so that the compression algorithm can make use of the preceding data for correlation and string matching. For that, take a look at the approach in gzlog.h and gzlog.c in examples/ in the zlib distribution.
Here is an example of how to do the simple approach:
/* tapp.c -- Example of how to append to a tar.gz file with concatenated gzip
streams. Placed in the public domain by Mark Adler, 16 Jan 2013. */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
#include <unistd.h>
#include <fcntl.h>
#include "zlib.h"
#define local static
/* Build an allocated string with the prefix string and the NULL-terminated
sequence of words strings separated by spaces. The caller should free the
returned string when done with it. */
local char *build_cmd(char *prefix, char **words)
{
size_t len;
char **scan;
char *str, *next;
len = strlen(prefix) + 1;
for (scan = words; *scan != NULL; scan++)
len += strlen(*scan) + 1;
str = malloc(len); assert(str != NULL);
next = stpcpy(str, prefix);
for (scan = words; *scan != NULL; scan++) {
*next++ = ' ';
next = stpcpy(next, *scan);
}
return str;
}
/* Usage:
tapp archive.tar.gz addthis.file andthisfile.too
tapp will create a new archive.tar.gz file if it doesn't exist, or it will
append the files to the existing archive.tar.gz. tapp must have been used
to create the archive in the first place. If it did not, then tapp will
exit with an error and leave the file unchanged. Each use of tapp appends a
new gzip stream whose compression cannot benefit from the files already in
the archive. As a result, tapp should not be used to append a small amount
of data at a time, else the compression will be particularly poor. Since
this is just an instructive example, the error checking is done mostly with
asserts.
*/
int main(int argc, char **argv)
{
int tgz;
off_t offset;
char *cmd;
FILE *pipe;
gzFile gz;
int page;
size_t got;
int ret;
ssize_t raw;
unsigned char buf[3][512];
const unsigned char z1k[] = /* gzip stream of 1024 zeros */
{0x1f, 0x8b, 8, 0, 0, 0, 0, 0, 2, 3, 0x63, 0x60, 0x18, 5, 0xa3, 0x60,
0x14, 0x8c, 0x54, 0, 0, 0x2e, 0xaf, 0xb5, 0xef, 0, 4, 0, 0};
if (argc < 2)
return 0;
tgz = open(argv[1], O_RDWR | O_CREAT, 0644); assert(tgz != -1);
offset = lseek(tgz, 0, SEEK_END); assert(offset == 0 || offset >= (off_t)sizeof(z1k));
if (offset) {
if (argc == 2) {
close(tgz);
return 0;
}
offset = lseek(tgz, -sizeof(z1k), SEEK_END); assert(offset != -1);
raw = read(tgz, buf, sizeof(z1k)); assert(raw == sizeof(z1k));
if (memcmp(buf, z1k, sizeof(z1k)) != 0) {
close(tgz);
fprintf(stderr, "tapp abort: %s was not created by tapp\n", argv[1]);
return 1;
}
offset = lseek(tgz, -sizeof(z1k), SEEK_END); assert(offset != -1);
}
if (argc > 2) {
gz = gzdopen(tgz, "wb"); assert(gz != NULL);
cmd = build_cmd("tar cf - -b 1", argv + 2);
pipe = popen(cmd, "r"); assert(pipe != NULL);
free(cmd);
got = fread(buf, 1, 1024, pipe); assert(got == 1024);
page = 2;
while ((got = fread(buf[page], 1, 512, pipe)) == 512) {
if (++page == 3)
page = 0;
ret = gzwrite(gz, buf[page], 512); assert(ret == 512);
} assert(got == 0);
ret = pclose(pipe); assert(ret != -1);
ret = gzclose(gz); assert(ret == Z_OK);
tgz = open(argv[1], O_WRONLY | O_APPEND); assert(tgz != -1);
}
raw = write(tgz, z1k, sizeof(z1k)); assert(raw == sizeof(z1k));
close(tgz);
return 0;
}
In my opinion this is not possible with TAR conforming to standard strictly. I have read through zlib[1] manual and GNU tar[2] file specification. I did not find any information how appending to TAR can be implemented. So I am assuming it has to be done by over-writing the empty blocks.
So I assume, again, you can do it by using gzseek(). However, you would need to know how large is the uncompressed archive (size) and set offset to size-2*512.
Note, that this might be cumbersome since "The whence parameter is defined as in lseek(2); the value SEEK_END is not supported."1 and you can't open file for reading and writing at the same time, i.e. for introspect where the end blocks are.
However, it should be possible abusing TAR specs slightly. The GNU tar[2] docs mention something funny:
"
Each file archived is represented by a header block which describes the file, followed by zero or more blocks which give the contents of the file. At the end of the archive file there are two 512-byte blocks filled with binary zeros as an end-of-file marker. A reasonable system should write such end-of-file marker at the end of an archive, but must not assume that such a block exists when reading an archive. In particular GNU tar always issues a warning if it does not encounter it.
"
This means, you can deliberately not write those blocks. This is easy if you wrote the tarball compressor. Then you can use zlib in the normal append mode, remembering that the TAR decompressor must be aware of the "broken" TAR file.
[1]http://www.zlib.net/manual.html#Gzip
[2]http://www.gnu.org/software/tar/manual/html_node/Standard.html#SEC182
Related
I am writing a C++ library that also decompresses zlib files. For all of the files, the last call to gzread() (or at least one of the last calls) gives error -3 (Z_DATA_ERROR) with message "incorrect data check". As I have not created the files myself I am not entirely sure what is wrong.
I found this answer and if I do
gzip -dc < myfile.gz > myfile.decomp
gzip: invalid compressed data--crc error
on the command line the contents of myfile.decomp seems to be correct. There is still the crc error printed in this case, however, which may or may not be the same problem. My code, pasted below, should be straightforward, but I am not sure how to get the same behavior in code as on the command line above.
How can I achieve the same behavior in code as on the command line?
std::vector<char> decompress(const std::string &path)
{
gzFile inFileZ = gzopen(path.c_str(), "rb");
if (inFileZ == NULL)
{
printf("Error: gzopen() failed for file %s.\n", path.c_str());
return {};
}
constexpr size_t bufSize = 8192;
char unzipBuffer[bufSize];
int unzippedBytes = bufSize;
std::vector<char> unzippedData;
unzippedData.reserve(1048576); // 1 MiB is enough in most cases.
while (unzippedBytes == bufSize)
{
unzippedBytes = gzread(inFileZ, unzipBuffer, bufSize);
if (unzippedBytes == -1)
{
// Here the error is -3 / "incorrect data check" for (one of) the last block(s)
// in the file. The bytes can be correctly decompressed, as demonstrated on the
// command line, but how can this be achieved in code?
int errnum;
const char *err = gzerror(inFileZ, &errnum);
printf(err, "%s\n");
break;
}
if (unzippedBytes > 0)
{
unzippedData.insert(unzippedData.end(), unzipBuffer, unzipBuffer + unzippedBytes);
}
}
gzclose(inFileZ);
return unzippedData;
}
First off, the whole point of the CRC is to detect corrupted data. If the CRC is bad, then you should be going back to where this file came from and getting the data not corrupted. If the CRC is bad, discard the input and report an error.
You are not clear on the "behavior" you are trying to reproduce, but if you're trying to recover as much data as possible from a corrupted gzip file, then you will need to use zlib's inflate functions to decompress the file. int ret = inflateInit2(&strm, 31); will initialize the zlib stream to process a gzip file.
I'm trying to use ZLIB to inflate (decompress) .FLA files, thus extracting all its contents. Since FLA files use a ZIP format, I am able to read the local file headers(https://en.wikipedia.org/wiki/Zip_(file_format)) from it, and use the info inside to decompress the files.
It seems to work fine for regular text-based files, but when it comes to binary (I've only tried PNG and DAT files), it fails to decompress them, returning "Z_DATA_ERROR".
I'm unable to use the minilib library inside ZLIB, since the Central directory file header inside FLA files differs slightly from normal zip files (which is why im reading the local files header manually).
Here's the code I use to decompress a chunk of data:
void DecompressBuffer(char* compressedBuffer, unsigned int compressedSize, std::string& out_decompressedBuffer)
{
// init the decompression stream
z_stream stream;
stream.zalloc = Z_NULL;
stream.zfree = Z_NULL;
stream.opaque = Z_NULL;
stream.avail_in = 0;
stream.next_in = Z_NULL;
if (int err = inflateInit2(&stream, -MAX_WBITS) != Z_OK)
{
printf("Error: inflateInit %d\n", err);
return;
}
// Set the starting point and total data size to be read
stream.avail_in = compressedSize;
stream.next_in = (Bytef*)&compressedBuffer[0];
std::stringstream strStream;
// Start decompressing
while (stream.avail_in != 0)
{
unsigned char* readBuffer = (unsigned char*)malloc(MAX_READ_BUFFER_SIZE + 1);
readBuffer[MAX_READ_BUFFER_SIZE] = '\0';
stream.next_out = readBuffer;
stream.avail_out = MAX_READ_BUFFER_SIZE;
int ret = inflate(&stream, Z_NO_FLUSH);
if (ret == Z_STREAM_END)
{
// only store the data we have left in the stream
size_t length = MAX_READ_BUFFER_SIZE - stream.avail_out;
std::string str((char*)readBuffer);
str = str.substr(0, length);
strStream << str;
break;
}
else
{
if (ret != Z_OK)
{
printf("Error: inflate %d\n", ret); // This is what it reaches when trying to inflate a PNG or DAT file
break;
}
// store the readbuffer in the stream
strStream << readBuffer;
}
free(readBuffer);
}
out_decompressedBuffer = strStream.str();
inflateEnd(&stream);
}
I have tried zipping a single PNG file and extracing that. This doesn't return any errors from Inflate(), but doesn't correctly inflate the PNG either, and the only corresponding values seem to be the first few.
The original file (left) and the uncompressed via code file (right):
Hex editor versions of both PNGs
You do things that rely on the data being text and strings, not binary data.
For example
std::string str((char*)readBuffer);
If the contents of readBuffer is raw binary data then it might contain one or more zero bytes in the middle of it. When you use it as a C-style string then the first zero will act as the string terminator character.
I suggest you try to generalize it, and remove the dependency of strings. Instead I suggest you use e.g. std::vector<int8_t>.
Meanwhile, during your transition to a more generalized way, you can do e.g.
std::string str(readBuffer, length);
This will create a string of the specified length, and the contents will not be checked for terminators.
I wanna read and remove the first line from a txt file (without copying, it's a huge file).
I've read the net but everybody just copies the desired content to a new file. I can't do that.
Below a first attempt. This code will be stucked in a loop as no lines are removed. If the code would remove the first line of file at each opening, the code would reach the end.
#include <iostream>
#include <string>
#include <fstream>
#include <boost/interprocess/sync/file_lock.hpp>
int main() {
std::string line;
std::fstream file;
boost::interprocess::file_lock lock("test.lock");
while (true) {
std::cout << "locking\n";
lock.lock();
file.open("test.txt", std::fstream::in|std::fstream::out);
if (!file.is_open()) {
std::cout << "can't open file\n";
file.close();
lock.unlock();
break;
}
else if (!std::getline(file,line)) {
std::cout << "empty file\n"; //
file.close(); // never
lock.unlock(); // reached
break; //
}
else {
// remove first line
file.close();
lock.unlock();
// do something with line
}
}
}
Here's a solution written in C for Windows.
It will execute and finish on a 700,000 line, 245MB file in no time. (0.14 seconds)
Basically, I memory map the file, so that I can access the contents using the functions used for raw memory access. Once the file has been mapped, I just use the strchr function to find the location of one of the pair of symbols used to denote an EOL in windows (\n and \r) - this tells us how long in bytes the first line is.
From here, I just memcpy from the first byte f the second line back to the start of the memory mapped area (basically, the first byte in the file).
Once this is done, the file is unmapped, the handle to the mem-mapped file is closed and we then use the SetEndOfFile function to reduce the length of the file by the length of the first line. When we close the file, it has shrunk by this length and the first line is gone.
Having the file already in memory since I've just created and written it is obviously altering the execution time somewhat, but the windows caching mechanism is the 'culprit' here - the very same mechanism we're leveraging to make the operation complete very quickly.
The test data is the source of the program duplicated 100,000 times and saved as testInput2.txt (paste it 10 times, select all, copy, paste 10 times - replacing the original 10, for a total of 100 times - repeat until output big enough. I stopped here because more seemed to make Notepad++ a 'bit' unhappy)
Error-checking in this program is virtually non-existent and the input is expected not to be UNICODE, i.e - the input is 1 byte per character.
The EOL sequence is 0x0D, 0x0A (\r, \n)
Code:
#include <stdio.h>
#include <windows.h>
void testFunc(const char inputFilename[] )
{
int lineLength;
HANDLE fileHandle = CreateFile(
inputFilename,
GENERIC_READ | GENERIC_WRITE,
0,
NULL,
OPEN_EXISTING,
FILE_ATTRIBUTE_NORMAL | FILE_FLAG_WRITE_THROUGH,
NULL
);
if (fileHandle != INVALID_HANDLE_VALUE)
{
printf("File opened okay\n");
DWORD fileSizeHi, fileSizeLo = GetFileSize(fileHandle, &fileSizeHi);
HANDLE memMappedHandle = CreateFileMapping(
fileHandle,
NULL,
PAGE_READWRITE | SEC_COMMIT,
0,
0,
NULL
);
if (memMappedHandle)
{
printf("File mapping success\n");
LPVOID memPtr = MapViewOfFile(
memMappedHandle,
FILE_MAP_ALL_ACCESS,
0,
0,
0
);
if (memPtr != NULL)
{
printf("view of file successfully created");
printf("File size is: 0x%04X%04X\n", fileSizeHi, fileSizeLo);
LPVOID eolPos = strchr((char*)memPtr, '\r'); // windows EOL sequence is \r\n
lineLength = (char*)eolPos-(char*)memPtr;
printf("Length of first line is: %ld\n", lineLength);
memcpy(memPtr, eolPos+2, fileSizeLo-lineLength);
UnmapViewOfFile(memPtr);
}
CloseHandle(memMappedHandle);
}
SetFilePointer(fileHandle, -(lineLength+2), 0, FILE_END);
SetEndOfFile(fileHandle);
CloseHandle(fileHandle);
}
}
int main()
{
const char inputFilename[] = "testInput2.txt";
testFunc(inputFilename);
return 0;
}
What you want to do, indeed, is not easy.
If you open the same file for reading and writing in it without being careful, you will end up reading what you just wrote and the result will not be what you want.
Modifying the file in place is doable: just open it, seek in it, modify and close. However, you want to copy all the content of the file except K bytes at the beginning of the file. It means you will have to iteratively read and write the whole file by chunks of N bytes.
Now once done, K bytes will remain at the end that would need to be removed. I don't think there's a way to do it with streams. You can use ftruncate or truncate functions from unistd.h or use Boost.Interprocess truncate for this.
Here is an example (without any error checking, I let you add it):
#include <iostream>
#include <fstream>
#include <unistd.h>
int main()
{
std::fstream file;
file.open("test.txt", std::fstream::in | std::fstream::out);
// First retrieve size of the file
file.seekg(0, file.end);
std::streampos endPos = file.tellg();
file.seekg(0, file.beg);
// Then retrieve size of the first line (a.k.a bufferSize)
std::string firstLine;
std::getline(file, firstLine);
// We need two streampos: the read one and the write one
std::streampos readPos = firstLine.size() + 1;
std::streampos writePos = 0;
// Read the whole file starting at readPos by chunks of size bufferSize
std::size_t bufferSize = 256;
char buffer[bufferSize];
bool finished = false;
while(!finished)
{
file.seekg(readPos);
if(readPos + static_cast<std::streampos>(bufferSize) >= endPos)
{
bufferSize = endPos - readPos;
finished = true;
}
file.read(buffer, bufferSize);
file.seekg(writePos);
file.write(buffer, bufferSize);
readPos += bufferSize;
writePos += bufferSize;
}
file.close();
// No clean way to truncate streams, use function from unistd.h
truncate("test.txt", writePos);
return 0;
}
I'd really like to be able to provide a cleaner solution for in-place modification of the file, but I'm not sure there's one.
I have a gzip file "dat.gz", the origin file contains only ascii text line by line. The .gz file is generated by 'pigz -i'
I want to load "dat.gz" into several process to do parallel data processing. The program language must be C or C++. Under Linux
For example, the origin file contains "1\n2\n3", and I load the .gz file into 3 process(p0, p1, p2), so that p0 gets "1", p1 gets "2" and p3 gets"3".
I read the file format of gz here: http://tools.ietf.org/pdf/rfc1952.pdf , and I found that each block of one .gz file starts with "\x1f\x8b". So I cut the .gz file by "\x1f\x8b" into blocks. But when I use the decompress lib of boost to process the block, something goes wrong.
Maybe my method was wrong at root.
My test .gz file can be downloaded here: https://drive.google.com/file/d/0B9DaAjBTb3bbcEM1N1c4OEg0SWc/view?usp=sharing
My C++ test code is following. Running by "g++ -std=c++11 test.cpp -lboost_iostreams && ./a.out". It throws out an exception.
terminate called after throwing an instance of
boost::exception_detail::clone_impl >'
what(): gzip error
Aborted
#include <stdio.h>
#include <stdlib.h>
#include <string>
#include <fstream>
#include <iostream>
#include <boost/iostreams/filtering_stream.hpp>
#include <boost/iostreams/filter/gzip.hpp>
#include <boost/iostreams/copy.hpp>
#include <sstream>
//define buffer size of fread: 128KB
#define BUFSIZE 128*1024
void get_first_block(char *fn) {
FILE* fin = fopen(fn, "rb");
char buf[BUFSIZE] = {0};
int pos = 0;
//skip first 2 byte
fread(buf, sizeof(char), 2, fin);
int i;
while (1) {
int sz = fread(buf, sizeof(char), BUFSIZE, fin);
if (sz <= 1) {
break;
}
for (i=0; i<sz-1; ++i) {
if (buf[i] == (char)0x1f && buf[i+1] == (char)0x8b) {
break;
}
}
pos += sz;
}
//first block start: 0
//first block end: pos + i -1
int len = pos+i;
fseek(fin, 0, SEEK_SET);
char *blk = (char*)malloc(len);
fread(blk, 1, len, fin);
using namespace boost::iostreams;
filtering_streambuf<input> in;
in.push( gzip_decompressor() );
in.push( boost::iostreams::array_source(blk , len) );
std::stringstream _sstream;
boost::iostreams::copy(in, _sstream);
std::cout << _sstream.rdbuf() ;
}
int main() {
get_first_block("0000.gz");
return 0;
}
It's unlikely that there is more than one of those blocks in a .gz file, also see the Wikipedia article about gzip:
Although its file format also allows for multiple such streams to be
concatenated (zipped files are simply decompressed concatenated as if
they were originally one file), gzip is normally used to compress
just single files.
This is especially true for your test file, because if you additionally look at the "compression method" flag, you can expand the search string to 0x1F, 0x8B, 0x08 which only appears once at the very beginning of your test file.
When trying to split a .gz file into blocks, you've got to do some more parsing instead of just looking for 0x1F, 0x8B, because this can also appear inside compressed data blocks or other parts of the member.
You have to parse the members and the compressed data. Unfortunately, the header only contains the uncompressed length of the data, not the compressed length, so you can't just skip the compressed data without parsing it.
The compressed data will be deflate data (there are other, but unused compression types), see RFC 1951. For non-compressed deflate blocks (chapter 3.2.4), there's a LEN field in the header so you can skip those easily. But unfortunately, there's no length field in the header of compressed blocks, so you'll have to completely parse those.
pigz -i compresses each block independently, which permits random access at each block boundary. Between each block is an empty stored block, which ends with the sequence of bytes 00 00 ff ff. You can search for that sequence, and attempt to decompress after that. There are 39 such markers in your example file.
There is nothing that prevents 00 00 ff ff from appearing in the middle of a compressed block, not marking a block boundary. So you should expect that occasionally you will get a false indication of such a boundary, indicated by a failure to decompress. In that case, simply move on to the next such marker.
I'm trying to load an image file into a buffer in order to send it through a scket. The problem that I'm having is that the program creates a buffer with a valid size but it does not copy the whole file into the buffer. My code is as follow
//imgload.cpp
#include <iostream>
#include <stdlib.h>
#include <stdio.h>
using namespace std;
int main(int argc,char *argv){
FILE *f = NULL;
char filename[80];
char *buffer = NULL;
long file_bytes = 0;
char c = '\0';
int i = 0;
printf("-Enter a file to open:");
gets(filename);
f = fopen(filename,"rb");
if (f == NULL){
printf("\nError opening file.\n");
}else{
fseek(f,0,SEEK_END);
file_bytes = ftell(f);
fseek(f,0,SEEK_SET);
buffer = new char[file_bytes+10];
}
if (buffer != NULL){
printf("-%d + 10 bytes allocated\n",file_bytes);
}else{
printf("-Could not allocate memory\n");
// Call exit?.
}
while (c != EOF){
c = fgetc(f);
buffer[i] = c;
i++;
}
c = '\0';
buffer[i-1] = '\0'; // helps remove randome characters in buffer when copying is finished..
i = 0;
printf("buffer size is now: %d\n",strlen(buffer));
//release buffer to os and cleanup....
return 0;
}
> output
c:\Users\Desktop>imgload
-Enter a file to open:img.gif
-3491 + 10 bytes allocated
buffer size is now: 9
c:\Users\Desktop>imgload
-Enter a file to open:img2.gif
-1261 + 10 bytes allocated
buffer size is now: 7
From the output I can see that it's allocating the correct size for each image 3491 and 1261 bytes (i doubled checked the file sizes through windows and the sizes being allocated are correct) but the buffer sizes after supposedly copying is 9 and 7 bytes long. Why is it not copying the entire data?.
You are wrong. Image is binary data, nor string data. So there are two errors:
1) You can't check end of file with EOF constant. Because EOF is often defined as 0xFF and it is valid byte in binary file. So use feof() function to check for end of file. Or also you may check current position in file with maximal possible (you got it before with ftell()).
2) As file is binary it may contain \0 in middle. So you can't use string function to work with such data.
Also I see that you use C++ language. Tell me please why you use classical C syntax for file working? I think that using C++ features such as file streams, containers and iterators will simplify your program.
P.S. And I want to say that you program will have problems with really big files. Who knows maybe you will try to work with them. If 'yes', rewrite ftell/fseek functions to their int64 (long long int) equivalents. Also you'll need to fix array counter. Another good idea is to read file by blocks. Reading byte by byte is dramatically slower.
All this is unneeded and actually makes no sense:
c = '\0';
buffer[i-1] = '\0';
i = 0;
printf("buffer size is now: %d\n",strlen(buffer));
Don't use strlen for binary data. strlen stops at the first NUL (\0) byte. A binary file may contain many such bytes, so NUL can't be used.
-3491 + 10 bytes allocated /* There are 3491 bytes in the file. */
buffer size is now: 9 /* The first byte with the value 0. */
In conclusion, drop that part. You already have the size of the file.
You are reading a binary file like a text file. You can't check for EOF as this could be anywhere in the binary file.