Boost 1.59 not decompressing all bzip2 streams - c++

I've been trying to decompress some .bz2 files on the fly and line-by-line so to speak as the files I'm dealing with are massive uncompressed (region of 100 GB uncompressed) so I wanted to add a solution that saves disk space.
I have no problems decompressing using files compressed with vanilla bzip2 but files compressed with pbzip2 only decompress the first bz2 stream it finds. This bugtracker relates to the problem: https://svn.boost.org/trac/boost/ticket/3853 but I was lead to believe it was fixed past version 1.41. I've checked the bzip2.hpp file and it contains the 'fixed' version and I've also checked that the version of Boost used in the program is 1.59.
The code is here:
cout<<"Warning bzip2 support is a little buggy!"<<endl;
//Open the file here
trans_file.open(files[i].c_str(), std::ios_base::in | std::ios_base::binary);
//Set up boost bzip2 compression
boost::iostreams::filtering_istream in;
in.push(boost::iostreams::bzip2_decompressor());
in.push(trans_file);
std::string str;
//Begin reading
while(std::getline(in, str))
{
std::stringstream stream(str);
stream>>id_f>>id_i>>aif;
/* Do stuff with values here*/
}
Any suggestions would be great. Thanks!

You are right.
It seems that changeset #63057 only fixes part of the issue.
The corresponding unit-test does work, though. But it uses the copy algorithm (also on a composite<> instead of a filtering_istream, if that is relevant).
I'd open this as a defect or a regression. Include a file that exhibits the problem, of course. For me it's reproduced using just /etc/dictionaries-common/words compressed with pbzip2 (default options).
I have the test.bz2 here: http://7f0d2fd2-af79-415c-ab60-033d3b494dc9.s3.amazonaws.com/test.bz2
Here's my test program:
#include <boost/iostreams/filtering_stream.hpp>
#include <boost/iostreams/filter/bzip2.hpp>
#include <boost/iostreams/stream.hpp>
#include <fstream>
#include <iostream>
namespace io = boost::iostreams;
void multiple_member_test(); // from the unit tests in changeset #63057
int main() {
//multiple_member_test();
//return 0;
std::ifstream trans_file("test.bz2", std::ios::binary);
//Set up boost bzip2 compression
io::filtering_istream in;
in.push(io::bzip2_decompressor());
in.push(trans_file);
//Begin reading
std::string str;
while(std::getline(in, str))
{
std::cout << str << "\n";
}
}
#include <boost/iostreams/compose.hpp>
#include <boost/iostreams/copy.hpp>
#include <boost/iostreams/device/array.hpp>
#include <boost/iostreams/device/back_inserter.hpp>
#include <cassert>
#include <sstream>
void multiple_member_test() // from the unit tests in changeset #63057
{
std::string data(20ul << 20, '*');
std::vector<char> temp, dest;
// Write compressed data to temp, twice in succession
io::filtering_ostream out;
out.push(io::bzip2_compressor());
out.push(io::back_inserter(temp));
io::copy(boost::make_iterator_range(data), out);
out.push(io::back_inserter(temp));
io::copy(boost::make_iterator_range(data), out);
// Read compressed data from temp into dest
io::filtering_istream in;
in.push(io::bzip2_decompressor());
in.push(io::array_source(&temp[0], temp.size()));
io::copy(in, io::back_inserter(dest));
// Check that dest consists of two copies of data
assert(data.size() * 2 == dest.size());
assert(std::equal(data.begin(), data.end(), dest.begin()));
assert(std::equal(data.begin(), data.end(), dest.begin() + dest.size() / 2));
dest.clear();
io::copy(
io::array_source(&temp[0], temp.size()),
io::compose(io::bzip2_decompressor(), io::back_inserter(dest)));
// Check that dest consists of two copies of data
assert(data.size() * 2 == dest.size());
assert(std::equal(data.begin(), data.end(), dest.begin()));
assert(std::equal(data.begin(), data.end(), dest.begin() + dest.size() / 2));
}

Related

fwrite() writes garbage at the end

I'm trying to write a function that execute a sql file with postgres. Postgres rise me an exception but without specificate the error. So I tryed to rewrite what it read, and I discovery that the file has some garbage at end
stat("treebase.sql",&buf);
dbschema= new (std::nothrow) char[buf.st_size+1];
if(!dbschema)
{
wxMessageBox(_("Not Enough memory"));
return;
}
if( !(fl=fopen("treebase.sql","r")))
{
wxMessageBox(_("Can not open treebase.sql"));
delete []dbschema;
return;
};
fo=fopen("tbout.sql","w");
fread(dbschema,sizeof(char),buf.st_size,fl);
fclose(fl);
dbschema[buf.st_size]='\0';
fwrite(dbschema,sizeof(char),buf.st_size+1,fo);
fflush(fo);
fclose(fo);
and the result is
![screen shot][1]
The input file 150473 length, the output is 156010. I really can not undersand where the 5000 bytes come from.
where is the bug?
[1]: https://i.stack.imgur.com/IXesz.png
You probably can't read buf.st_size much of data, because of the mode of fopen is "r" which defaults to text modes. In text mode fread and fwrite may do some conversions on what you read or write to match the environment special rules about text files such as end of lines. Use "rb" and "wb" modes for fopen for reading and writing binary files as is respectively.
Also, I would rather use fseek and ftell to get the size of file instead of stat.
Here's an example of how you could read the content of the file into memory and then write down an exact copy to another file. I added error checking too to make it clear if anything goes wrong. There's also no need to use stat etc. Plain standard C++ will do.
#include <cerrno>
#include <cstring>
#include <fstream>
#include <iostream>
#include <iterator>
#include <stdexcept>
#include <string>
std::string get_file_as_string(const std::string& filename) {
std::ifstream fl(filename, std::ios::binary); // binary mode
if(!fl) throw std::runtime_error(std::strerror(errno));
// return the content of the whole file as a std::string
return {std::istreambuf_iterator<char>(fl),
std::istreambuf_iterator<char>{}};
}
bool write_string_to_file(const std::string& str, const std::string& filename) {
std::ofstream fo(filename, std::ios::binary);
if(!fo) return false;
// return true or false depending on if it succeeded writing the file:
return static_cast<bool>(fo << str);
}
int main() {
auto dbschema = get_file_as_string("treebase.sql");
// use dbschema.c_str() if you need a `const char*`:
postgres_c_function(dbschema.c_str());
// use dbschema.data() if you need a `char*`:
postgres_c_function(dbschema.data());
if(write_string_to_file(dbschema, "tbout.sql")) {
std::cout << "success\n";
} else {
std::cout << "failure: " << std::strerror(errno) << '\n';
}
}

std::fstream - slow second read

I'am having troubles with fstream while reading large binary files(2 GigaBytes). Using this test code I create fstream and then read until EOF. Then clear flags and reset position to the beginning of the file. Then I read it again. Problem is, that the second read(second while cycle) is always significantly slower.
I need to get this solved for Embarcadero RAD Studio XE7. Same beheaviour of slower second read can be replicated in MS Visual 2010. First read is done in HDD maximum speed (about 140 MB/s), second is always done at quarter of it (35 MB/s).
Oddly, when I use g++ 4.9.2 on my Debian, both first and second readings are done with the same time as I would expected.
#include <stdio.h>
#include <cstdlib>
#include <fstream>
#include <string>
char buffer[400000];
int main()
{
std::fstream stream;
std::string filename = "largeFile.bin";
stream.open(filename, std::ios_base::in | std::ios_base::binary);
while(stream.good()){
stream.read(buffer, 400000);
}
printf("first read complete");
stream.clear();
stream.seekg(std::ios::beg);
while(stream.good()){
stream.read(buffer, 400000);
}
printf("second read complete");
stream.close();
return 0;
}
I omitted time measure of both while cycles, since it is not significant for this problem.
Am I doing something wrong when reading multiple times from start to EOF, on once opened file?

read and write a binary file in c++ with fstream

I'm trying to write simple c++ code to read and write a file.
The problem is my output file is smaller than the original file, and I'm stuck finding the cause.
I have a image with 6.6 kb and my output image is about 6.4 kb
#include <iostream>
#include <fstream>
using namespace std;
ofstream myOutpue;
ifstream mySource;
int main()
{
mySource.open("im1.jpg", ios_base::binary);
myOutpue.open("im2.jpg", ios_base::out);
char buffer;
if (mySource.is_open())
{
while (!mySource.eof())
{
mySource >> buffer;
myOutpue << buffer;
}
}
mySource.close();
myOutpue.close();
return 1;
}
Why not just:
#include <fstream>
int main()
{
std::ifstream mySource("im1.jpg", std::ios::binary);
std::ofstream myOutpue("im2.jpg", std::ios::binary);
myOutpue << mySource.rdbuf();
}
Or, less chattily:
int main()
{
std::ofstream("im2.jpg", std::ios::binary)
<< std::ifstream("im1.jpg", std::ios::binary).rdbuf();
}
Two things: You forget to open the output in binary mode, and you can't use the input/output operator >> and << for binary data, except if you use the output operator to write the input-streams basic_streambuf (which you can get using rdbuf).
For input use read and for output use write.
There are 3 problems in your code:
1- You have not opened your output file in Binary.
2- Your code return "1", normally you should return "0", if something went wrong then return an error code.
3- You should use "manipulators" and make c++ not to avoid whitespaces, so in order to read from file instead of:
mySource >> buffer;
you should use:
mySource >> std:noskipws >> buffer;
Well, its just because of padding at the end of the image. eof of any file do not include the padded bytes added at the end of file.
Try this
take img1.jpg contains 20 space charecter at the end not visible here (uegfuyregwfyugwrerycgerfcg6ygerbucykgeugcrgfrgeyf ) and run your program (do not include parenthesis in the file, these are used to show the data content)
you will see img2.jpg contains (uegfuyregwfyugwrerycgerfcg6ygerbucykgeugcrgfrgeyf)
So, its better option to read the file byte by byte using the filesize which you can get using stat, and run for loop till filesize. Hope this should resolve your problem you mentioned above

Read a JPEG image from memory with boost::gil

I am trying to read an image from memory by using boost::gil present in boost 1.53. I have taken the following lines from an example taken from internet:
#include <boost/gil/gil_all.hpp>
boost::gil::rgb8_image_t img;
boost::gil::image_read_settings<jpeg_tag> readSettings;
boost::gil::read_image(mystream, img, readSettings);
Except the first line, the type and function in the remaining lines cannot be found in the boost::gil namespace, so I cannot test if the above lines do what I want. Do you have any idea where to get the required types and functions?
See the new version of gil here: gil stable version
It works well and it is stable.
using namespace boost::gil;
image_read_settings<jpeg_tag> readSettings;
rgb8_image_t newImage;
read_image(stream, newImage, readSettings);
You code seems correct.
Boost 1.68, which is planned for release on 8th of August, 2018, will finally deliver the new Boost.GIL IO (aka IOv2) reviewed and accepted long time ago.
It is already available from the current master branch of the Boost super-project (check Boost.GIL CONTRIBUTING.md for guidelines how to work with the super-project).
Now, you can use GIL from Boost 1.68 or later, here is example that shows how to read image from input stream. It does not have to be file-based stream, but any std::istream-compatible stream should work.
#include <boost/gil.hpp>
#include <boost/gil/io/io.hpp>
#include <boost/gil/extension/io/jpeg.hpp>
#include <fstream>
#include <iostream>
int main(int argc, char* argv[])
{
if (argc != 2)
{
std::cerr << "input jpeg file missing\n";
return EXIT_FAILURE;
}
try
{
std::ifstream stream(argv[1], std::ios::binary);
namespace bg = boost::gil;
bg::image_read_settings<bg::jpeg_tag> read_settings;
bg::rgb8_image_t image;
bg::read_image(stream, image, read_settings);
return EXIT_SUCCESS;
}
catch (std::exception const& e)
{
std::cerr << e.what() << std::endl;
}
return EXIT_FAILURE;
}

file compression to handle intermediary output in c++

I want to compress a intermediate output of my program ( in C++) and then decompress it.
You can use Boost IOStreams to compress your data, for example something along these lines to compress/decompresses into/from a file (example adapted from Boost docs):
#include <fstream>
#include <iostream>
#include <boost/iostreams/filtering_stream.hpp>
#include <boost/iostreams/filtering_streambuf.hpp>
#include <boost/iostreams/copy.hpp>
#include <boost/iostreams/filter/gzip.hpp>
namespace bo = boost::iostreams;
int main()
{
{
std::ofstream ofile("hello.gz", std::ios_base::out | std::ios_base::binary);
bo::filtering_ostream out;
out.push(bo::gzip_compressor());
out.push(ofile);
out << "This is a gz file\n";
}
{
std::ifstream ifile("hello.gz", std::ios_base::in | std::ios_base::binary);
bo::filtering_streambuf<bo::input> in;
in.push(bo::gzip_decompressor());
in.push(ifile);
boost::iostreams::copy(in, std::cout);
}
}
You can also have a look at Boost Serialization - which can make saving your data much easier. It is possible to combine the two approaches (example). IOStreams support bzip compression as well.
EDIT: To address your last comment - you can compress an existing file... but it would be better to write it as compressed to begin with. If you really want, you could tweak the following code:
std::ifstream ifile("file", std::ios_base::in | std::ios_base::binary);
std::ofstream ofile("file.gz", std::ios_base::out | std::ios_base::binary);
bo::filtering_streambuf<bo::output> out;
out.push(bo::gzip_compressor());
out.push(ofile);
bo::copy(ifile, out);