How to read a .gz file line-by-line in C++?

How to read a .gz file line-by-line in C++? - c++

I have 3 terabyte .gz file and want to read its uncompressed content line-by-line in a C++ program. As the file is quite huge, I want to avoid loading it completely in memory.
Can anyone post a simple example of doing it?

You most probably will have to use ZLib's deflate, example is available from their site
Alternatively you may have a look at BOOST C++ wrapper
The example from BOOST page (decompresses data from a file and writes it to standard output)
#include <fstream>
#include <iostream>
#include <boost/iostreams/filtering_streambuf.hpp>
#include <boost/iostreams/copy.hpp>
#include <boost/iostreams/filter/zlib.hpp>
int main()
{
using namespace std;
ifstream file("hello.z", ios_base::in | ios_base::binary);
filtering_streambuf<input> in;
in.push(zlib_decompressor());
in.push(file);
boost::iostreams::copy(in, cout);
}

For something that is going to be used regularly, you probably want to use one of the previous suggestions. Alternatively, you can do
gzcat file.gz | yourprogram
and have yourprogram read from cin. This will decompress parts of the file in memory as it is needed, and send the uncompressed output to yourprogram.

Using zlib, I'm doing something along these lines:
// return a line in a std::vector< char >
std::vector< char > readline( gzFile f ) {
std::vector< char > v( 256 );
unsigned pos = 0;
for ( ;; ) {
if ( gzgets( f, &v[ pos ], v.size() - pos ) == 0 ) {
// end-of-file or error
int err;
const char *msg = gzerror( f, &err );
if ( err != Z_OK ) {
// handle error
}
break;
}
unsigned read = strlen( &v[ pos ] );
if ( v[ pos + read - 1 ] == '\n' ) {
if ( pos + read >= 2 && v[ pos + read - 2 ] == '\r' ) {
pos = pos + read - 2;
} else {
pos = pos + read - 1;
}
break;
}
if ( read == 0 || pos + read < v.size() - 1 ) {
pos = read + pos;
break;
}
pos = v.size() - 1;
v.resize( v.size() * 2 );
}
v.resize( pos );
return v;
}
EDIT: Removed two mis-copied * in the example above.
EDIT: Corrected out of bounds read on v[pos + read - 2]

The zlib library supports decompressing files in memory in blocks, so you don't have to decompress the entire file in order to process it.

Here is some code with which you can read normal and zipped files line by line:
char line[0x10000];
FILE *infile=open_file(file);
bool gzipped=endsWith(file, ".gz");
if(gzipped)
init_gzip_stream(infile,&line[0]);
while (readLine(infile,line,gzipped)) {
if(line[0]==0)continue;// skip gzip new_block
printf(line);
}
#include <zlib.h>
#define CHUNK 0x100
#define OUT_CHUNK CHUNK*100
unsigned char gzip_in[CHUNK];
unsigned char gzip_out[OUT_CHUNK];
///* These are parameters to inflateInit2. See http://zlib.net/manual.html for the exact meanings. */
#define windowBits 15
#define ENABLE_ZLIB_GZIP 32
z_stream strm = {0};
z_stream init_gzip_stream(FILE* file,char* out){// unsigned
strm.zalloc = Z_NULL;
strm.zfree = Z_NULL;
strm.opaque = Z_NULL;
strm.next_in = gzip_in;
strm.avail_in = 0;
strm.next_out = gzip_out;
inflateInit2 (& strm, windowBits | ENABLE_ZLIB_GZIP);
return strm;
}
bool inflate_gzip(FILE* file, z_stream strm,size_t bytes_read){
strm.avail_in = (int)bytes_read;
do {
strm.avail_out = OUT_CHUNK;
inflate (& strm, Z_NO_FLUSH);
// printf ("%s",gzip_out);
}while (strm.avail_out == 0);
if (feof (file)) {
inflateEnd (& strm);
return false;
}
return true;// all OK
}
char* first_line=(char*)&gzip_out[0];
char* current_line=first_line;
char* next_line=first_line;
char hangover[1000];
bool readLine(FILE* infile,char* line,bool gzipped){
if(!gzipped)
return fgets(line, sizeof(line), infile) != NULL;
else{
bool ok=true;
current_line=next_line;
if(!current_line || strlen(current_line)==0 || next_line-current_line>OUT_CHUNK){
current_line=first_line;
size_t bytes_read = fread (gzip_in, sizeof (char), CHUNK, infile);
ok=inflate_gzip(infile,strm,bytes_read);
strcpy(line,hangover);
}
if(ok){
next_line=strstr(current_line,"\n");
if(next_line){
next_line[0]=0;
next_line++;
strcpy(line+strlen(hangover),current_line);
hangover[0]=0;
}else{
strcpy(hangover,current_line);
line[0]=0;// skip that one!!
}
}
return ok;
}
}

You can't do that, because *.gz doesn't have "lines".
If compressed data has newlines, you'll have to decompress it. You don't have to decompress all data at once, you know, you can do it in chunks, and send strings back to main program when you encounter newline characters. *.gz can be decompressed using zlib.

Chilkat (http://www.chilkatsoft.com/) has libraries to read compressed files from a C++, .Net, VB, ... application.

Related

Size discrepancy in file deflated using zlib and Crypto++

I'm learning how to raw deflate (no header or trailer information) & inflate data in C++, so I decided to try the zlib and Crypto++ libraries.
I've found that, when deflating the same file, Crypto++ sometimes adds 4 extra bytes (depending on the method used).
For example, for a file containing the following sequence, whitespaces included: 1 2 3 4 5 6, deflating with zlib produces a file of size 14 bytes.
This holds true for Crypto++ deflate_method1, but for Crypto++ deflate_method2, the file size is 18 bytes.
Also, when trying to inflate a file that was deflated using Crypto++ deflate_method2 with Crypto++ inflate_method1, an exception is raised:
terminate called after throwing an instance of 'CryptoPP::Inflator::UnexpectedEndErr'
what(): Inflator: unexpected end of compressed block
Aborted (core dumped)
To compare, I did another test deflating/inflating with Python:
Deflating also yields a file of size 14 bytes.
I'm able to inflate all the deflated files correctly, regardless of the method used to deflate them.
At this point, I would like to understand two things:
Why is there a discrepancy in the size of the deflated files?
Why Python is able to inflate any of the files but Crypto++ is being picky?
Info & code:
OS: Ubuntu 16.04 Xenial
Zlib version: 1.0.1 from Ubuntu repos.
Crypto++ version: 8.0.2 from GitHub release.
Python version: 3.5.2
zlib version: 1.2.8 / runtime version: 1.2.8
Input & output files as base64:
Input: MSAyIDMgNCA1IDYK
Deflated:
Python: M1QwUjBWMFEwVTDjAgA=
Zlib: M1QwUjBWMFEwVTDjAgA=
Crypto++ method1: M1QwUjBWMFEwVTDjAgA=
Crypto++ method2: MlQwUjBWMFEwVTDjAgAAAP//
Zlib:
#include <algorithm>
#include <cstdlib>
#include <cstring>
#include <iterator>
#include <fstream>
#include <iostream>
#include <sstream>
#include <vector>
#include "zlib.h"
constexpr uint32_t BUFFER_READ_SIZE = 128;
constexpr uint32_t BUFFER_WRITE_SIZE = 128;
bool mydeflate(std::vector<unsigned char> & input)
{
const std::string inputStream{ input.begin(), input.end() };
uint64_t inputSize = input.size();
// Create a string stream where output will be created.
std::stringstream outputStringStream(std::ios::in | std::ios::out | std::ios::binary);
// Initialize zlib structures.
std::vector<char *> readBuffer(BUFFER_READ_SIZE);
std::vector<char *> writeBuffer(BUFFER_WRITE_SIZE);
z_stream zipStream;
zipStream.avail_in = 0;
zipStream.avail_out = BUFFER_WRITE_SIZE;
zipStream.next_out = reinterpret_cast<Bytef *>(writeBuffer.data());
zipStream.total_in = 0;
zipStream.total_out = 0;
zipStream.data_type = Z_BINARY;
zipStream.zalloc = nullptr;
zipStream.zfree = nullptr;
zipStream.opaque = nullptr;
// Window bits is passed < 0 to tell that there is no zlib header.
if (deflateInit2_(&zipStream, Z_DEFAULT_COMPRESSION, Z_DEFLATED, -MAX_WBITS, 8, Z_DEFAULT_STRATEGY, ZLIB_VERSION, sizeof(zipStream)) != Z_OK)
{
return false;
}
// Deflate the input stream
uint32_t readSize = 0;
uint64_t dataPendingToCompress = inputSize;
uint64_t dataPendingToWrite = 0;
bool isEndOfInput = false;
while (dataPendingToCompress > 0)
{
if (dataPendingToCompress > BUFFER_READ_SIZE)
{
readSize = BUFFER_READ_SIZE;
}
else
{
readSize = dataPendingToCompress;
isEndOfInput = true;
}
// Copy the piece of input stream to the read buffer.
std::memcpy(readBuffer.data(), &inputStream[inputSize - dataPendingToCompress], readSize);
dataPendingToCompress -= readSize;
zipStream.next_in = reinterpret_cast<Bytef *>(readBuffer.data());
zipStream.avail_in = readSize;
// While there is input data to compress.
while (zipStream.avail_in > 0)
{
// Output buffer is full.
if (zipStream.avail_out == 0)
{
outputStringStream.write(reinterpret_cast<const char*>(writeBuffer.data()), dataPendingToWrite);
zipStream.total_in = 0;
zipStream.avail_out = BUFFER_WRITE_SIZE;
zipStream.next_out = reinterpret_cast<Bytef *>(writeBuffer.data());
dataPendingToWrite = 0;
}
uint64_t totalOutBefore = zipStream.total_out;
int zlibError = deflate(&zipStream, isEndOfInput ? Z_FINISH : Z_NO_FLUSH);
if ((zlibError != Z_OK) && (zlibError != Z_STREAM_END))
{
deflateEnd(&zipStream);
return false;
}
dataPendingToWrite += static_cast<uint64_t>(zipStream.total_out - totalOutBefore);
}
}
// Flush last compressed data.
while (dataPendingToWrite > 0)
{
if (dataPendingToWrite > BUFFER_WRITE_SIZE)
{
outputStringStream.write(reinterpret_cast<const char*>(writeBuffer.data()), BUFFER_WRITE_SIZE);
}
else
{
outputStringStream.write(reinterpret_cast<const char*>(writeBuffer.data()), dataPendingToWrite);
}
zipStream.total_in = 0;
zipStream.avail_out = BUFFER_WRITE_SIZE;
zipStream.next_out = reinterpret_cast<Bytef *>(writeBuffer.data());
uint64_t totalOutBefore = zipStream.total_out;
int zlibError = deflate(&zipStream, Z_FINISH);
if ((zlibError != Z_OK) && (zlibError != Z_STREAM_END))
{
deflateEnd(&zipStream);
return false;
}
dataPendingToWrite = static_cast<uint64_t>(zipStream.total_out - totalOutBefore);
}
deflateEnd(&zipStream);
const std::string & outputString = outputStringStream.str();
std::vector<unsigned char> deflated{outputString.begin(), outputString.end()};
std::cout << "Output String size: " << outputString.size() << std::endl;
input.swap(deflated);
return true;
}
int main(int argc, char * argv[])
{
std::ifstream input_file{"/tmp/test.txt"};
std::vector<unsigned char> data((std::istreambuf_iterator<char>(input_file)), std::istreambuf_iterator<char>());
std::cout << "Deflated: " << mydeflate(data) << '\n';
std::ofstream output_file{"/tmp/deflated.txt"};
output_file.write(reinterpret_cast<char *>(data.data()), data.size());
return 0;
}
Crypto++:
#include "cryptopp/files.h"
#include "cryptopp/zdeflate.h"
#include "cryptopp/zinflate.h"
void deflate_method1(const std::string & input_file_path, const std::string & output_file_path)
{
CryptoPP::Deflator deflator(new CryptoPP::FileSink(output_file_path.c_str(), true), CryptoPP::Deflator::DEFAULT_DEFLATE_LEVEL, CryptoPP::Deflator::MAX_LOG2_WINDOW_SIZE);
CryptoPP::FileSource fs(input_file_path.c_str(), true);
fs.TransferAllTo(deflator);
}
void inflate_method1(const std::string & input_file_path, const std::string & output_file_path)
{
CryptoPP::FileSource fs(input_file_path.c_str(), true);
CryptoPP::Inflator inflator(new CryptoPP::FileSink(output_file_path.c_str(), true));
fs.TransferAllTo(inflator);
}
void deflate_method2(const std::string& input_file_path, const std::string& output_file_path)
{
CryptoPP::Deflator deflator(new CryptoPP::FileSink(output_file_path.c_str(), true), CryptoPP::Deflator::DEFAULT_DEFLATE_LEVEL, 15);
std::ifstream file_in;
file_in.open(input_file_path, std::ios::binary);
std::string buffer;
size_t num_read = 0;
const size_t buffer_size(1024 * 1024);
buffer.resize(buffer_size);
file_in.read(const_cast<char*>(buffer.data()), buffer_size);
num_read = file_in.gcount();
while (num_read) {
deflator.ChannelPut(CryptoPP::DEFAULT_CHANNEL, reinterpret_cast<unsigned char*>(const_cast<char *>(buffer.data())), num_read);
file_in.read(const_cast<char*>(buffer.data()), buffer_size);
num_read = file_in.gcount();
}
file_in.close();
deflator.Flush(true);
}
void inflate_method2(const std::string& input_file_path, const std::string& output_file_path)
{
CryptoPP::Inflator inflator(new CryptoPP::FileSink(output_file_path.c_str(), true));
std::ifstream file_in;
file_in.open(input_file_path, std::ios::binary);
std::string buffer;
size_t num_read = 0;
const size_t buffer_size(1024 * 1024);
buffer.resize(buffer_size);
file_in.read(const_cast<char*>(buffer.data()), buffer_size);
num_read = file_in.gcount();
while (num_read) {
inflator.ChannelPut(CryptoPP::DEFAULT_CHANNEL, reinterpret_cast<unsigned char*>(const_cast<char *>(buffer.data())), num_read);
file_in.read(const_cast<char*>(buffer.data()), buffer_size);
num_read = file_in.gcount();
}
file_in.close();
inflator.Flush(true);
}
int main(int argc, char * argv[])
{
deflate_method1("/tmp/test.txt", "/tmp/deflated_method1.bin");
inflate_method1("/tmp/deflated_method1.bin", "/tmp/inflated_method1.txt");
deflate_method2("/tmp/test.txt", "/tmp/deflated_method2.bin");
inflate_method2("/tmp/deflated_method2.bin", "/tmp/inflated_method2.txt");
// This throws: Inflator: unexpected end of compressed block
inflate_method1("/tmp/deflated_method2.bin", "/tmp/inflated_with_method1_file_deflated_with_method2.txt");
return 0;
}
Python:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import zlib
def CHUNKSIZE():
return 128
def deflate(file_path, compression_level, method, wbits):
plain_data = None
deflated_data = bytearray()
deflator = zlib.compressobj(compression_level, method, wbits)
with open(file_path, 'rb') as input_file:
while True:
plain_data = input_file.read(CHUNKSIZE())
if not plain_data:
break
deflated_data += deflator.compress(plain_data)
deflated_data += deflator.flush()
return deflated_data
def inflate(file_path, wbits):
inflated_data = bytearray()
inflator = zlib.decompressobj(wbits)
with open(file_path, 'rb') as deflated_file:
buffer = deflated_file.read(CHUNKSIZE())
while buffer:
inflated_data += inflator.decompress(buffer)
buffer = deflated_file.read(CHUNKSIZE())
inflated_data += inflator.flush()
return inflated_data
def write_file(file_path, data):
with open(file_path, 'wb') as output_file:
output_file.write(data)
if __name__ == "__main__":
deflated_data = deflate("/tmp/test.txt", zlib.Z_DEFAULT_COMPRESSION, zlib.DEFLATED, -zlib.MAX_WBITS)
write_file("/tmp/deflated_python.bin", deflated_data)

The first three are working correctly, generating a valid deflate compressed stream with a single, last deflate block.
You "Crypto++ method2" is generating two deflate blocks, where the second one is an empty stored block that is not marked as the last block. This is not a valid deflate stream since it does not terminate. You are not correctly finishing the compression.
Your deflator.Flush(true) is flushing the first block and emitting that empty stored block, without ending the deflate stream.
I'm not seeing much in the way of documentation, or really any at all, but looking at the source code, I would try deflator.EndBlock(true) instead.
Update:
Per the comment below, EndBlock is not public. Instead MessageEnd is what is needed to terminate the deflate stream.

c++ Zlib fwrite to file not working

I have this procedure I'm working on, to simply do the following:
Open a gzipped file
Uncompress it to a std::string
Compress it from a std::string
Save as gzipped file
However, I cannot seem to write the gzipped data at the end. The error I'm getting is :No Matching Function Call to 'fwrite'.
Am I feeding it the wrong parameters? Or an incorrect type?
Here's the full procedure:
#include <iostream>
#include "zlib-1.2.8/zlib.h"
#include <stdexcept>
#include <iomanip>
#include <sstream>
#include <stdio.h>
#include <assert.h>
#include <cstdio>
#include <string>
#include <cstring>
#include <cstdlib>
#include "zlib.h"
#include "zconf.h"
#if defined(MSDOS) || defined(OS2) || defined(WIN32) || defined(__CYGWIN__)
# include <fcntl.h>
# include <io.h>
# define SET_BINARY_MODE(file) setmode(fileno(file), O_BINARY)
#else
# define SET_BINARY_MODE(file)
#endif
#define CHUNK 16384
using namespace std;
bool gzipInflate( const std::string& compressedBytes, std::string& uncompressedBytes ) {
if ( compressedBytes.size() == 0 ) {
uncompressedBytes = compressedBytes ;
return true ;
}
uncompressedBytes.clear() ;
unsigned full_length = compressedBytes.size() ;
unsigned half_length = compressedBytes.size() / 2;
unsigned uncompLength = full_length ;
char* uncomp = (char*) calloc( sizeof(char), uncompLength );
z_stream strm;
strm.next_in = (Bytef *) compressedBytes.c_str();
strm.avail_in = compressedBytes.size() ;
strm.total_out = 0;
strm.zalloc = Z_NULL;
strm.zfree = Z_NULL;
bool done = false ;
if (inflateInit2(&strm, (16+MAX_WBITS)) != Z_OK) {
free( uncomp );
return false;
}
while (!done) {
// If our output buffer is too small
if (strm.total_out >= uncompLength ) {
// Increase size of output buffer
char* uncomp2 = (char*) calloc( sizeof(char), uncompLength + half_length );
memcpy( uncomp2, uncomp, uncompLength );
uncompLength += half_length ;
free( uncomp );
uncomp = uncomp2 ;
}
strm.next_out = (Bytef *) (uncomp + strm.total_out);
strm.avail_out = uncompLength - strm.total_out;
// Inflate another chunk.
int err = inflate (&strm, Z_SYNC_FLUSH);
if (err == Z_STREAM_END) done = true;
else if (err != Z_OK) {
break;
}
}
if (inflateEnd (&strm) != Z_OK) {
free( uncomp );
return false;
}
for ( size_t i=0; i<strm.total_out; ++i ) {
uncompressedBytes += uncomp[ i ];
}
free( uncomp );
return true ;
}
/** Compress a STL string using zlib with given compression level and return
* the binary data. */
std::string compress_string(const std::string& str, int compressionlevel = Z_BEST_COMPRESSION)
{
z_stream zs; // z_stream is zlib's control structure
memset(&zs, 0, sizeof(zs));
if (deflateInit(&zs, compressionlevel) != Z_OK)
throw(std::runtime_error("deflateInit failed while compressing."));
zs.next_in = (Bytef*)str.data();
zs.avail_in = str.size(); // set the z_stream's input
int ret;
char outbuffer[32768];
std::string outstring;
// retrieve the compressed bytes blockwise
do {
zs.next_out = reinterpret_cast<Bytef*>(outbuffer);
zs.avail_out = sizeof(outbuffer);
ret = deflate(&zs, Z_FINISH);
if (outstring.size() < zs.total_out) {
// append the block to the output string
outstring.append(outbuffer,zs.total_out - outstring.size());
}
}
while (ret == Z_OK);
deflateEnd(&zs);
if (ret != Z_STREAM_END) { // an error occurred that was not EOF
std::ostringstream oss;
oss << "Exception during zlib compression: (" << ret << ") " << zs.msg;
throw(std::runtime_error(oss.str()));
}
return outstring;
}
/* Reads a file into memory. */
bool loadBinaryFile( const std::string& filename, std::string& contents ) {
// Open the gzip file in binary mode
FILE* f = fopen( filename.c_str(), "rb" );
if ( f == NULL )
return false ;
// Clear existing bytes in output vector
contents.clear();
// Read all the bytes in the file
int c = fgetc( f );
while ( c != EOF ) {
contents += (char) c ;
c = fgetc( f );
}
fclose (f);
return true ;
}
int main(int argc, char *argv[]) {
// Read the gzip file data into memory
std::string fileData ;
if ( !loadBinaryFile( "myfilein.gz", fileData ) ) {
printf( "Error loading input file." );
return 0 ;
}
// Uncompress the file data
std::string data ;
if ( !gzipInflate( fileData, data ) ) {
printf( "Error decompressing file." );
return 0 ;
}
// Print the data
printf( "Data: \"" );
for ( size_t i=0; i<data.size(); ++i ) {
printf( "%c", data[i] );
}
printf ( "\"\n" );
std::string outy;
// Compress the file data
outy = compress_string(data, 0);
//Write the gzipped data to a file.
FILE *handleWrite=fopen("myfileout.gz","wb");
fwrite(outy,sizeof(outy),1,handleWrite);
fclose(handleWrite);
return 0;
}

std::string cannot be cast to void* (what fwrite expects).
Try:
fwrite(outy.data(),outy.size(),1,handleWrite);
Although I'm not sure if storing binary data inside std::string is a good idea.

Tellg and seekg and file read not working

I am Trying to read 64000 bytes from file in binary mode in buffer at one time till end of the file. My problem is tellg() returns position in hexadecimal value, How do I make it return decimal value?
because my if conditions are not working, it is reading more than 64000 and when I am relocating my pos and size_stream(size_stream = size_stream - 63999;
pos = pos + 63999;), it is pointing to wrong positions each time.
How do I read 64000 bytes from file into buffer in binary mode at once till the end of file?
Any help would be appreciated
std::fstream fin(file, std::ios::in | std::ios::binary | std::ios::ate);
if (fin.good())
{
fin.seekg(0, fin.end);
int size_stream = (unsigned int)fin.tellg(); fin.seekg(0, fin.beg);
int pos = (unsigned int)fin.tellg();
//........................<sending the file in blocks
while (true)
{
if (size_stream > 64000)
{
fin.read(buf, 63999);
buf[64000] = '\0';
CString strText(buf);
SendFileContent(userKey,
(LPCTSTR)strText);
size_stream = size_stream - 63999;
pos = pos + 63999;
fin.seekg(pos, std::ios::beg);
}
else
{
fin.read(buf, size_stream);
buf[size_stream] = '\0';
CString strText(buf);
SendFileContent(userKey,
(LPCTSTR)strText); break;
}
}

My problem is tellg() returns position in hexadecimal value
No, it doesn't. It returns an integer value. You can display the value in hex, but it is not returned in hex.
when I am relocating my pos and size_stream(size_stream = size_stream - 63999; pos = pos + 63999;), it is pointing to wrong positions each time.
You shouldn't be seeking in the first place. After performing a read, leave the file position where it is. The next read will pick up where the previous read left off.
How do I read 64000 bytes from file into buffer in binary mode at once till the end of file?
Do something more like this instead:
std::ifstream fin(file, std::ios::binary);
if (fin)
{
unsigned char buf[64000];
std::streamsize numRead;
do
{
numRead = fin.readsome(buf, 64000);
if ((!fin) || (numRead < 1)) break;
// DO NOT send binary data using `LPTSTR` string conversions.
// Binary data needs to be sent *as-is* instead.
//
SendFileContent(userKey, buf, numRead);
}
while (true);
}
Or this:
std::ifstream fin(file, std::ios::binary);
if (fin)
{
unsigned char buf[64000];
std::streamsize numRead;
do
{
if (!fin.read(buf, 64000))
{
if (!fin.eof()) break;
}
numRead = fin.gcount();
if (numRead < 1) break;
// DO NOT send binary data using `LPTSTR` string conversions.
// Binary data needs to be sent *as-is* instead.
//
SendFileContent(userKey, buf, numRead);
}
while (true);
}

Is there a better way to search a file for a string?

I need to search a (non-text) file for the byte sequence "9µ}Æ" (or "\x39\xb5\x7d\xc6").
After 5 hours of searching online this is the best I could do. It works but I wanted to know if there is a better way:
char buffer;
int pos=in.tellg();
// search file for string
while(!in.eof()){
in.read(&buffer, 1);
pos=in.tellg();
if(buffer=='9'){
in.read(&buffer, 1);
pos=in.tellg();
if(buffer=='µ'){
in.read(&buffer, 1);
pos=in.tellg();
if(buffer=='}'){
in.read(&buffer, 1);
pos=in.tellg();
if(buffer=='Æ'){
cout << "found";
}
}
}
}
in.seekg((streampos) pos);
Note:
I can't use getline(). It's not a text file so there are probably not many line breaks.
Before I tried using a multi-character buffer and then copying the buffer to a C++ string, and then using string::find(). This didn't work because there are many '\0' characters throughout the file, so the sequence in the buffer would be cut very short when it was copied to the string.

Similar to what bames53 posted; I used a vector as a buffer:
std::ifstream ifs("file.bin");
ifs.seekg(0, std::ios::end);
std::streamsize f_size = ifs.tellg();
ifs.seekg(0, std::ios::beg);
std::vector<unsigned char> buffer(f_size);
ifs.read(buffer.data(), f_size);
std::vector<unsigned char> seq = {0x39, 0xb5, 0x7d, 0xc6};
bool found = std::search(buffer.begin(), buffer.end(), seq.begin(), seq.end()) != buffer.end();

If you don't mind loading the entire file into an in-memory array (or using mmap() to make it look like the file is in memory), you could then search for your character sequence in-memory, which is a bit easier to do:
// Works much like strstr(), except it looks for a binary sub-sequence rather than a string sub-sequence
const char * MemMem(const char * lookIn, int numLookInBytes, const char * lookFor, int numLookForBytes)
{
if (numLookForBytes == 0) return lookIn; // hmm, existential questions here
else if (numLookForBytes == numLookInBytes) return (memcmp(lookIn, lookFor, numLookInBytes) == 0) ? lookIn : NULL;
else if (numLookForBytes < numLookInBytes)
{
const char * startedAt = lookIn;
int matchCount = 0;
for (int i=0; i<numLookInBytes; i++)
{
if (lookIn[i] == lookFor[matchCount])
{
if (matchCount == 0) startedAt = &lookIn[i];
if (++matchCount == numLookForBytes) return startedAt;
}
else matchCount = 0;
}
}
return NULL;
}
.... then you can just call the above function on the in-memory data array:
char * ret = MemMem(theInMemoryArrayContainingFilesBytes, numBytesInFile, myShortSequence, 4);
if (ret != NULL) printf("Found it at offset %i\n", ret-theInMemoryArrayContainingFilesBytes);
else printf("It's not there.\n");

This program loads the entire file into memory and then uses std::search on it.
int main() {
std::string filedata;
{
std::ifstream fin("file.dat");
std::stringstream ss;
ss << fin.rdbuf();
filedata = ss.str();
}
std::string key = "\x39\xb5\x7d\xc6";
auto result = std::search(std::begin(filedata), std::end(filedata),
std::begin(key), std::end(key));
if (std::end(filedata) != result) {
std::cout << "found\n";
// result is an iterator pointing at '\x39'
}
}

const char delims[] = { 0x39, 0xb5, 0x7d, 0xc6 };
char buffer[4];
const size_t delim_size = 4;
const size_t last_index = delim_size - 1;
for ( size_t i = 0; i < last_index; ++i )
{
if ( ! ( is.get( buffer[i] ) ) )
return false; // stream to short
}
while ( is.get(buffer[last_index]) )
{
if ( memcmp( buffer, delims, delim_size ) == 0 )
break; // you are arrived
memmove( buffer, buffer + 1, last_index );
}
You are looking for 4 bytes:
unsigned int delim = 0xc67db539;
unsigned int uibuffer;
char * buffer = reinterpret_cast<char *>(&uibuffer);
for ( size_t i = 0; i < 3; ++i )
{
if ( ! ( is.get( buffer[i] ) ) )
return false; // stream to short
}
while ( is.get(buffer[3]) )
{
if ( uibuffer == delim )
break; // you are arrived
uibuffer >>= 8;
}

Because you said you cannot search the entire file because of null terminator characters in the string, here's an alternative for you, which reads the entire file in and uses recursion to find the first occurrence of a string inside of the whole file.
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
string readFile (char *fileName) {
ifstream fi (fileName);
if (!fi)
cerr << "ERROR: Cannot open file" << endl;
else {
string str ((istreambuf_iterator<char>(fi)), istreambuf_iterator<char>());
return str;
}
return NULL;
}
bool findFirstOccurrenceOf_r (string haystack, char *needle, int haystack_pos, int needle_pos, int needle_len) {
if (needle_pos == needle_len)
return true;
if (haystack[haystack_pos] == needle[needle_pos])
return findFirstOccurrenceOf_r (haystack, needle, haystack_pos+1, needle_pos+1, needle_len);
return false;
}
int findFirstOccurrenceOf (string haystack, char *needle, int length) {
int pos = -1;
for (int i = 0; i < haystack.length() - length; i++) {
if (findFirstOccurrenceOf_r (haystack, needle, i, 0, length))
return i;
}
return pos;
}
int main () {
char str_to_find[4] = {0x39, 0xB5, 0x7D, 0xC6};
string contents = readFile ("input");
int pos = findFirstOccurrenceOf (contents, str_to_find, 4);
cout << pos << endl;
}
If the file is not too large, your best solution would be to load the whole file into memory, so you don't need to keep reading from the drive. If the file is too large to load in at once, you would want to load in chunks of the file at a time. But if you do load in chucks, make sure you check to edges of the chunks. It's possible that your chunk happens to split right in the middle of the string you're searching for.

How to compress a buffer with zlib?

There is a usage example at the zlib website: http://www.zlib.net/zlib_how.html
However in the example they are compressing a file. I would like to compress a binary data stored in a buffer in memory. I don't want to save the compressed buffer to disk either.
Basically here is my buffer:
fIplImageHeader->imageData = (char*)imageIn->getFrame();
How can I compress it with zlib?
I would appreciate some code example of how to do that.

zlib.h has all the functions you need: compress (or compress2) and uncompress. See the source code of zlib for an answer.
ZEXTERN int ZEXPORT compress OF((Bytef *dest, uLongf *destLen, const Bytef *source, uLong sourceLen));
/*
Compresses the source buffer into the destination buffer. sourceLen is
the byte length of the source buffer. Upon entry, destLen is the total size
of the destination buffer, which must be at least the value returned by
compressBound(sourceLen). Upon exit, destLen is the actual size of the
compressed buffer.
compress returns Z_OK if success, Z_MEM_ERROR if there was not
enough memory, Z_BUF_ERROR if there was not enough room in the output
buffer.
*/
ZEXTERN int ZEXPORT uncompress OF((Bytef *dest, uLongf *destLen, const Bytef *source, uLong sourceLen));
/*
Decompresses the source buffer into the destination buffer. sourceLen is
the byte length of the source buffer. Upon entry, destLen is the total size
of the destination buffer, which must be large enough to hold the entire
uncompressed data. (The size of the uncompressed data must have been saved
previously by the compressor and transmitted to the decompressor by some
mechanism outside the scope of this compression library.) Upon exit, destLen
is the actual size of the uncompressed buffer.
uncompress returns Z_OK if success, Z_MEM_ERROR if there was not
enough memory, Z_BUF_ERROR if there was not enough room in the output
buffer, or Z_DATA_ERROR if the input data was corrupted or incomplete. In
the case where there is not enough room, uncompress() will fill the output
buffer with the uncompressed data up to that point.
*/

This is an example to pack a buffer with zlib and save the compressed contents in a vector.
void compress_memory(void *in_data, size_t in_data_size, std::vector<uint8_t> &out_data)
{
std::vector<uint8_t> buffer;
const size_t BUFSIZE = 128 * 1024;
uint8_t temp_buffer[BUFSIZE];
z_stream strm;
strm.zalloc = 0;
strm.zfree = 0;
strm.next_in = reinterpret_cast<uint8_t *>(in_data);
strm.avail_in = in_data_size;
strm.next_out = temp_buffer;
strm.avail_out = BUFSIZE;
deflateInit(&strm, Z_BEST_COMPRESSION);
while (strm.avail_in != 0)
{
int res = deflate(&strm, Z_NO_FLUSH);
assert(res == Z_OK);
if (strm.avail_out == 0)
{
buffer.insert(buffer.end(), temp_buffer, temp_buffer + BUFSIZE);
strm.next_out = temp_buffer;
strm.avail_out = BUFSIZE;
}
}
int deflate_res = Z_OK;
while (deflate_res == Z_OK)
{
if (strm.avail_out == 0)
{
buffer.insert(buffer.end(), temp_buffer, temp_buffer + BUFSIZE);
strm.next_out = temp_buffer;
strm.avail_out = BUFSIZE;
}
deflate_res = deflate(&strm, Z_FINISH);
}
assert(deflate_res == Z_STREAM_END);
buffer.insert(buffer.end(), temp_buffer, temp_buffer + BUFSIZE - strm.avail_out);
deflateEnd(&strm);
out_data.swap(buffer);
}

You can easily adapt the example by replacing fread() and fwrite() calls with direct pointers to your data. For zlib compression (referred to as deflate as you "take out all the air of your data") you allocate z_stream structure, call deflateInit() and then:
fill next_in with the next chunk of data you want to compress
set avail_in to the number of bytes available in next_in
set next_out to where the compressed data should be written which should usually be a pointer inside your buffer that advances as you go along
set avail_out to the number of bytes available in next_out
call deflate
repeat steps 3-5 until avail_out is non-zero (i.e. there's more room in the output buffer than zlib needs - no more data to write)
repeat steps 1-6 while you have data to compress
Eventually you call deflateEnd() and you're done.
You're basically feeding it chunks of input and output until you're out of input and it is out of output.

The classic way more convenient with C++ features
Here's a full example which demonstrates compression and decompression using C++ std::vector objects:
#include <cstdio>
#include <iosfwd>
#include <iostream>
#include <vector>
#include <zconf.h>
#include <zlib.h>
#include <iomanip>
#include <cassert>
void add_buffer_to_vector(std::vector<char> &vector, const char *buffer, uLongf length) {
for (int character_index = 0; character_index < length; character_index++) {
char current_character = buffer[character_index];
vector.push_back(current_character);
}
}
int compress_vector(std::vector<char> source, std::vector<char> &destination) {
unsigned long source_length = source.size();
uLongf destination_length = compressBound(source_length);
char *destination_data = (char *) malloc(destination_length);
if (destination_data == nullptr) {
return Z_MEM_ERROR;
}
Bytef *source_data = (Bytef *) source.data();
int return_value = compress2((Bytef *) destination_data, &destination_length, source_data, source_length,
Z_BEST_COMPRESSION);
add_buffer_to_vector(destination, destination_data, destination_length);
free(destination_data);
return return_value;
}
int decompress_vector(std::vector<char> source, std::vector<char> &destination) {
unsigned long source_length = source.size();
uLongf destination_length = compressBound(source_length);
char *destination_data = (char *) malloc(destination_length);
if (destination_data == nullptr) {
return Z_MEM_ERROR;
}
Bytef *source_data = (Bytef *) source.data();
int return_value = uncompress((Bytef *) destination_data, &destination_length, source_data, source.size());
add_buffer_to_vector(destination, destination_data, destination_length);
free(destination_data);
return return_value;
}
void add_string_to_vector(std::vector<char> &uncompressed_data,
const char *my_string) {
int character_index = 0;
while (true) {
char current_character = my_string[character_index];
uncompressed_data.push_back(current_character);
if (current_character == '\00') {
break;
}
character_index++;
}
}
// https://stackoverflow.com/a/27173017/3764804
void print_bytes(std::ostream &stream, const unsigned char *data, size_t data_length, bool format = true) {
stream << std::setfill('0');
for (size_t data_index = 0; data_index < data_length; ++data_index) {
stream << std::hex << std::setw(2) << (int) data[data_index];
if (format) {
stream << (((data_index + 1) % 16 == 0) ? "\n" : " ");
}
}
stream << std::endl;
}
void test_compression() {
std::vector<char> uncompressed(0);
auto *my_string = (char *) "Hello, world!";
add_string_to_vector(uncompressed, my_string);
std::vector<char> compressed(0);
int compression_result = compress_vector(uncompressed, compressed);
assert(compression_result == F_OK);
std::vector<char> decompressed(0);
int decompression_result = decompress_vector(compressed, decompressed);
assert(decompression_result == F_OK);
printf("Uncompressed: %s\n", uncompressed.data());
printf("Compressed: ");
std::ostream &standard_output = std::cout;
print_bytes(standard_output, (const unsigned char *) compressed.data(), compressed.size(), false);
printf("Decompressed: %s\n", decompressed.data());
}
In your main.cpp simply call:
int main(int argc, char *argv[]) {
test_compression();
return EXIT_SUCCESS;
}
The output produced:
Uncompressed: Hello, world!
Compressed: 78daf348cdc9c9d75128cf2fca495164000024e8048a
Decompressed: Hello, world!
The Boost way
#include <iostream>
#include <boost/iostreams/filtering_streambuf.hpp>
#include <boost/iostreams/copy.hpp>
#include <boost/iostreams/filter/zlib.hpp>
std::string compress(const std::string &data) {
boost::iostreams::filtering_streambuf<boost::iostreams::output> output_stream;
output_stream.push(boost::iostreams::zlib_compressor());
std::stringstream string_stream;
output_stream.push(string_stream);
boost::iostreams::copy(boost::iostreams::basic_array_source<char>(data.c_str(),
data.size()), output_stream);
return string_stream.str();
}
std::string decompress(const std::string &cipher_text) {
std::stringstream string_stream;
string_stream << cipher_text;
boost::iostreams::filtering_streambuf<boost::iostreams::input> input_stream;
input_stream.push(boost::iostreams::zlib_decompressor());
input_stream.push(string_stream);
std::stringstream unpacked_text;
boost::iostreams::copy(input_stream, unpacked_text);
return unpacked_text.str();
}
TEST_CASE("zlib") {
std::string plain_text = "Hello, world!";
const auto cipher_text = compress(plain_text);
const auto decompressed_plain_text = decompress(cipher_text);
REQUIRE(plain_text == decompressed_plain_text);
}

This is not a direct answer on your question about the zlib API, but you may be interested in boost::iostreams library paired with zlib.
This allows to use zlib-driven packing algorithms using the basic "stream" operations notation and then your data could be easily compressed by opening some memory stream and doing the << data operation on it.
In case of boost::iostreams this would automatically invoke the corresponding packing filter for every data that passes through the stream.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to read a .gz file line-by-line in C++? - c++

I have 3 terabyte .gz file and want to read its uncompressed content line-by-line in a C++ program. As the file is quite huge, I want to avoid loading it completely in memory. Can anyone post a simple example of doing it?

The zlib library supports decompressing files in memory in blocks, so you don't have to decompress the entire file in order to process it.

Chilkat (http://www.chilkatsoft.com/) has libraries to read compressed files from a C++, .Net, VB, ... application.

Related

Size discrepancy in file deflated using zlib and Crypto++

c++ Zlib fwrite to file not working

Tellg and seekg and file read not working

Is there a better way to search a file for a string?

How to compress a buffer with zlib?

Categories

Resources