Streaming decompression of data in a vector - c++

I need to do the following procedure.
Compress an input text into an array.
Split the compressed output into multiple pieces with approx. the same length and store them in a vector.
Decompress using streaming decompression.
Is it possible to do that?
Consider that in my case, the size of each block is fixed and independent of the compression scheme.
In this example here, the decompression function returns the size of the next block, I wonder if that is somewhat related to the compression scheme, i.e. you cannot randomly take a sub-array contained in the full compressed array and decompress it.
I need to use zstd, no other compression algorithms.
Here is what I tried so far.
//std::vector<std::string_view> _content_compressed passed as parameter
ZSTD_DStream* const dstream = ZSTD_createDStream();
ZSTD_initDStream(dstream);
std::vector<char*> vec;
for (auto el : _content_compressed)
{
auto ee = el.data();
char* decompressed = new char[1000];
ZSTD_inBuffer input = { el.data(), el.size(), 0 };
ZSTD_outBuffer output = { decompressed, _decompressed_size, 0 };
std::size_t toRead = ZSTD_decompressStream(dstream, &output, &input);
vec.push_back(decompressed);
}
The problem is that decompressed doesn't contain the decompressed value at the end.

Related

Why does my use of zlib decompress incorrectly?

Please explain if this is a Zlib bug or I misunderstand the use of Zlib.
I am trying to do the following:
-I have two strings - data from which I need to compress: string_data_1 and string_data_2 and which I compress with Zlib as raw data.
-Next, I create a third string and copy the already compressed data into this single row.
-Now I'm decompressing this combined compressed data and there is a problem.
Zlib decompressed only the "first" part of the compressed data, did not decompress the second part. Is that how it should be?
For an example in the facebook/zstd:Zstandard library - exactly the same action - leads to unpacking - all compressed data and the first and second parts.
Here is a simple code:
#include <iostream>
#include <string>
#include <zlib.h>
int my_Zlib__compress__RAW(std::string& string_data_to_be_compressed, std::string& string_compressed_result, int level_compressed)
{
//-------------------------------------------------------------------------
uLong zlib_uLong = compressBound(string_data_to_be_compressed.size());
string_compressed_result.resize(zlib_uLong);
//-------------------------------------------------------------------------
//this is the standard Zlib compress2 function - with one exception: the deflateInit2 function is used instead of the deflateInit function and the windowBits parameter is set to "-15" so that Zlib compresses the data as raw data:
int status = my_compress2((Bytef*)&string_compressed_result[0], &zlib_uLong, (const Bytef*)&string_data_to_be_compressed[0], string_data_to_be_compressed.size(), level_compressed);
if (status == Z_OK)
{
string_compressed_result.resize(zlib_uLong);
return 0;
}
else
{
return 1;
}
}
int my_Zlib__uncompress__RAW(std::string& string_data_to_be_uncompressed, std::string& string_compressed_data, size_t size_uncompressed_data)
{
//-------------------------------------------------------------------------
string_data_to_be_uncompressed.resize(size_uncompressed_data);
//-------------------------------------------------------------------------
//this is the standard Zlib uncompress function - with one exception: the inflateInit2 function is used instead of the inflateInit function and the windowBits parameter is set to "-15" so that Zlib uncompresses the data as raw data:
int status = my_uncompress((Bytef*)&string_data_to_be_uncompressed[0], (uLongf*)&size_uncompressed_data, (const Bytef*)&string_compressed_data[0], string_compressed_data.size());
if (status == Z_OK)
{
return 0;
}
}
int main()
{
int level_compressed = 9;
//------------------------------------------Compress_1-------------------------------------------
std::string string_data_1 = "Hello12_Hello12_Hello125"; //The data to be compressed.
std::string string_compressed_result_RAW_1; //Compressed data will be written here
int status = my_Zlib__compress__RAW(string_data_1 , string_compressed_result_RAW_1, level_compressed);
//----------------------------------------------------------------------------------------------
//--------------------------------------Compress_2----------------------------------------------
std::string string_data_2= "BUY22_BUY22_BUY223"; //The data to be compressed.
std::string string_compressed_result_RAW_2; //Compressed data will be written here
status = my_Zlib__compress__RAW(string_data_2 , string_compressed_result_RAW_2, level_compressed);
//----------------------------------------------------------------------------------------------
std::string Total_compressed_data = string_compressed_result_RAW_1 + string_compressed_result_RAW_2; //Combine two compressed data into one string
//Now I want to uncompress the data in a string - "Total_compressed_data"
//--------------------------------------Uncompress--------------------------------
std::string string_uncompressed_result_RAW; //Uncompressed data will be written here
int size_that_should_be_when_unpacking = string_data_1.size() + string_data_2.size();
status = my_Zlib__uncompress__RAW(string_uncompressed_result_RAW, Total_compressed_data, size_that_should_be_when_unpacking , level_compressed);
//--------------------------------------------------------------------------------
std::cout<<string_uncompressed_result_RAW<<std::endl; //Hello12_Hello12_Hello125
}
Zlib decompressed only the "first" part of the compressed data, did not decompress the "second" part.
Is that how it should be?
As noted in the comments, a concatenation of zlib streams is not a zlib stream. You need to uncompress again for the second zlib stream. Or compress the whole thing as one zlib stream in the first place.
You would need to use a variant of uncompress2(), not uncompress(), since the former will return the size of the first decompressed zlib stream in the last parameter, so that you know where to start decompressing the second one.
Better yet, you should use the inflate() functions instead for your application. The retention of the uncompressed size for use in decompression means that you'd need that on the other end. How do you get that? Are you transmitting it separately? You do not need that. You should use inflate() to decompress a chunk at a time, and then you don't need to know the uncompressed size ahead of time.
You should also use the deflate() functions for compression. Then you can keep the stream open, and keep compressing until you're done. Then you will have a single zlib stream.

What is the best solution for writing numbers into file and than read them?

I have 640*480 numbers. I need to write them into a file. I will need to read them later. What is the best solution? Numbers are between 0 - 255.
For me the best solution is to write them binary(8 bits). I wrote the numbers into txt file and now it looks like 1011111010111110 ..... So there are no questions where the number starts and ends.
How am I supposed to read them from the file?
Using c++
It's not good idea to write bit values like 1 and 0 to text file. The file size will bigger in 8 times. 1 byte = 8 bits. You have to store bytes, 0-255 - is byte. So your file will have size 640*480 bytes instead of 640*480*8. Every symbol in text file has size of 1 byte minimum. If you want to get bits, use binary operators of programming language that you use. To read bytes much easier. Use binary file for saving your data.
Presumably you have some sort of data structure representing your image, which somewhere inside holds the actual data:
class pixmap
{
public:
// stuff...
private:
std::unique_ptr<std::uint8_t[]> data;
};
So you can add a new constructor which takes a filename and reads bytes from that file:
pixmap(const std::string& filename)
{
constexpr int SIZE = 640 * 480;
// Open an input file stream and set it to throw exceptions:
std::ifstream file;
file.exceptions(std::ios_base::badbit | std::ios_base::failbit);
file.open(filename.c_str());
// Create a unique ptr to hold the data: this will be cleaned up
// automatically if file reading throws
std::unique_ptr<std::uint8_t[]> temp(new std::uint8_t[SIZE]);
// Read SIZE bytes from the file
file.read(reinterpret_cast<char*>(temp.get()), SIZE);
// If we get to here, the read worked, so we move the temp data we've just read
// into where we'd like it
data = std::move(temp); // or std::swap(data, temp) if you prefer
}
I realise I've assumed some implementation details here (you might not be using a std::unique_ptr to store the underlying image data, though you probably should be) but hopefully this is enough to get you started.
You can print the number between 0-255 as the char value in the file.
See the below code. in this example I am printing integer 70 as char.
So this result in print as 'F' on the console.
Similarly you can read it as char and then convert this char to integer.
#include <stdio.h>
int main()
{
int i = 70;
char dig = (char)i;
printf("%c", dig);
return 0;
}
This way you can restrict the file size.

Possible to convert std::vector of std::pairs into a byte array?

I am wondering if it is possible to convert vector of pairs into a byte array.
Here's a small example of creating the vector of pairs:
int main(int argc, char *argv[])
{
PBYTE FileData, FileData2, FileData3;
DWORD FileSize, FileSize2, FileSize3;
/* Here I read 3 files + their sizes and fill the above variables. */
//Here I create the vector of std::pairs.
std::vector<std::pair<PBYTE, DWORD>> DataVector
{
{ FileData, FileSize }, //Pair contains always file data + file size.
{ FileData2, FileSize2 },
{ FileData3, FileSize3 }
};
std::cin.ignore(2);
return 0;
}
Is it possible to convert this vector into a byte array (for compressing, and writing to disk, etc)?
Here is what I tried, but I didn't get even the size correctly:
PVOID DataVectorArr = NULL;
DWORD DataVectorArrSize = DataVector.size() * sizeof DataVector[0];
if ((DataVectorArr = malloc(DataVectorArrSize)) != NULL)
{
memcpy(DataVectorArr, &DataVector[0], DataVectorArrSize);
}
std::cout << DataVectorArrSize;
//... Here I tried to write the DataVectorArr to disk, which obviously fails because the size isn't correct. I am not also sure if the DataVectorArr contains the DataVector now.
if (DataVectorArr != NULL) delete DataVectorArr;
Enough code. Is is it even possible, or am I doing it wrong? If I am doing it wrong, what would be the solution?
Regards, Okkaaj
Edit: If it's unclear what I am trying to do, read the following (which I commented earlier):
Yes, I am trying to cast the vector of pairs to a PCHAR or PBYTE - so I can store it to disk using WriteFile. After it is stored, I can read it from disk as byte array, and parse back to vector of pairs. Is this possible? I got the idea from converting / casting struct to a byte array and back(read more from here: Converting struct to byte and back to struct) but I am not sure if this is possible with std::vector instead of structures.
Get rid of the malloc and make use of RAII for this:
std::vector<BYTE> bytes;
for (auto const& x : DataVector)
bytes.insert(bytes.end(), x.first, x.first+x.second);
// bytes now contains all images buttressed end-to-end.
std::cout << bytes.size() << '\n';
To avoid potential resize slow-lanes, you can enumerate the size calculation first, then .reserve() the space ahead of time:
std::size_t total_len = 0;
for (auto const& x : DataVector)
total_len += x.second;
std::vector<BYTE> bytes;
bytes.reserve(total_len);
for (auto const& x : DataVector)
bytes.insert(bytes.end(), x.first, x.first+x.second);
// bytes now contains all images buttressed end-to-end.
std::cout << bytes.size() << '\n';
But if all you want to do is dump these contiguously to disk, then why not simply:
std::ofstream outp("outfile.bin", std::ios::out|std::ios::binary);
for (auto const& x : DataVector)
outp.write(static_cast<const char*>(x.first), x.second);
outp.close();
skipping the middle man entirely.
And honestly, unless there is a good reason to do otherwise, it is highly likely your DataVector would be better off as simply a std::vector< std::vector<BYTE> > in the first place.
Update
If recovery is needed, you can't just do this as above. The minimal artifact that is missing is the description of the data itself. In this case the description is the actual length of each pair segment. To accomplish that the length must be stored along with the data. Doing that is trivial unless you also need it portable to platform-independence.
If that last sentence made you raise your brow, consider the problems with doing something as simple as this:
std::ofstream outp("outfile.bin", std::ios::out|std::ios::binary);
for (auto const& x : DataVector)
{
uint64_t len = static_cast<uint64_t>(x.first);
outp.write(reinterpret_cast<const char *>(&len), sizeof(len));
outp.write(static_cast<const char*>(x.first), x.second);
}
outp.close();
Well, now you can read each file by doing this:
Read a uint64_t to obtain the byte length of the data to follow
Read the data of that length
But this has inherent problems. It isn't portable at all. The endian-representation of the reader's platform had better match that of the writer, or this is utter fail. To accommodate this limitation the length preamble must be written in a platform-independent manner, which is tedious and a foundational reason why serialization libraries and their protocols exit in the first place.
If you haven't second-guessed what you're doing and how you're doing it by this point, you may want to read this again.

Reading in raw encoded nrrd data file into double

Does anyone know how to read in a file with raw encoding? So stumped.... I am trying to read in floats or doubles (I think). I have been stuck on this for a few weeks. Thank you!
File that I am trying to read from:
http://www.sci.utah.edu/~gk/DTI-data/gk2/gk2-rcc-mask.raw
Description of raw encoding:
hello://teem.sourceforge.net/nrrd/format.html#encoding (change hello to http to go to page)
- "raw" - The data appears on disk exactly the same as in memory, in terms of byte values and byte ordering. Produced by write() and fwrite(), suitable for read() or fread().
Info of file:
http://www.sci.utah.edu/~gk/DTI-data/gk2/gk2-rcc-mask.nhdr - I think the only things that matter here are the big endian (still trying to understand what that means from google) and raw encoding.
My current approach, uncertain if it's correct:
//Function ripped off from example of c++ ifstream::read reference page
void scantensor(string filename){
ifstream tdata(filename, ifstream::binary); // not sure if I should put ifstream::binary here
// other things I tried
// ifstream tdata(filename) ifstream tdata(filename, ios::in)
if(tdata){
tdata.seekg(0, tdata.end);
int length = tdata.tellg();
tdata.seekg(0, tdata.beg);
char* buffer = new char[length];
tdata.read(buffer, length);
tdata.close();
double* d;
d = (double*) buffer;
} else cerr << "failed" << endl;
}
/* P.S. I attempted to print the first 100 elements of the array.
Then I print 100 other elements at some arbitrary array indices (i.e. 9,900 - 10,000). I actually kept increasing the number of 0's until I ran out of bound at 100,000,000 (I don't think that's how it works lol but I was just playing around to see what happens)
Here's the part that makes me suspicious: so the ifstream different has different constructors like the ones I tried above.
the first 100 values are always the same.
if I use ifstream::binary, then I get some values for the 100 arbitrary printing
if I use the other two options, then I get -6.27744e+066 for all 100 of them
So for now I am going to assume that ifstream::binary is the correct one. The thing is, I am not sure if the file I provided is how binary files actually look like. I am also unsure if these are the actual numbers that I am supposed to read in or just casting gone wrong. I do realize that my casting from char* to double* can be unsafe, and I got that from one of the threads.
*/
I really appreciate it!
Edit 1: Right now the data being read in using the above method is apparently "incorrect" since in paraview the values are:
Dxx,Dxy,Dxz,Dyy,Dyz,Dzz
[0, 1], [-15.4006, 13.2248], [-5.32436, 5.39517], [-5.32915, 5.96026], [-17.87, 19.0954], [-6.02961, 5.24771], [-13.9861, 14.0524]
It's a 3 x 3 symmetric matrix, so 7 distinct values, 7 ranges of values.
The floats that I am currently parsing from the file right now are very large (i.e. -4.68855e-229, -1.32351e+120).
Perhaps somebody knows how to extract the floats from Paraview?
Since you want to work with doubles, I recommend to read the data from file as buffer of doubles:
const long machineMemory = 0x40000000; // 1 GB
FILE* file = fopen("c:\\data.bin", "rb");
if (file)
{
int size = machineMemory / sizeof(double);
if (size > 0)
{
double* data = new double[size];
int read(0);
while (read = fread(data, sizeof(double), size, file))
{
// Process data here (read = number of doubles)
}
delete [] data;
}
fclose(file);
}

Best way to store string of known maximum length in file for fast load into vector<string> in C++

I've got big amount of text data which I need to save to file for next reprocessing. These data are stored in table like vector< vector< string > > - every record (vector) has same number of attributes(vector). So, going through the vector I can find the max length of every attribute in table and count of records. Now I have to write these data to file (can be binary) in that way that I will be able to load them back into vector< vector< string > > very fast. It doesn't matter how much time will writing take but I need reading to vector implement in the fastest way.
Due to fact that data will be processed "record by record" the whole file may not will be load to memory. But for fast reading I want to use buffer 256 MB or 512 MB.
So for now I implemented this in this way:
Data are stored in two files - description file and data file. Description file contains the count of records, count of attributes and maximum length of every attribute. Data file is binary file of chars. There are no values or records separators, just values. Every value in concrete attribute has same length so if some value has smaller length than maximum length, the remaining chars are null characters '\0'.
Then I read chunk of file to char array buffer (256 MB or 512 MB) with std::fread. When application calls function vector getNext(), I read the chunk of chars from buffer (because I know length of every attribute) and append every char to concrete string to create vector.
But, this way seems not so fast for my purpose when I need parse big count of records in loop from buffer to vector. Is another better way to do whole this problem?
This part of code is parsing chars from buffer to values:
string value;
vector<string> record;
int pos = bfrIndex(); // returns current position in buffer. position of values of next record
for(unsigned int i = 0; i < d.colSize.size(); i++) { // d.colSize is vector of every attribute
value.clear();
value.reserve(d.colSize[i]);
for(unsigned int j = pos; j < pos + d.colSize[i]; j++) {
if (buffer[j] == '\0') break;
value += buffer[j];
}
record.push_back(value);
pos += d.colSize[i]; // set position in buffer to next value
}
return record;
I'd consider a binary approach that used the method employed in Doom's .wad files. I.e a directory with length & file offsets of each resource, followed by the resources themselves. With a small amount of overhead for the directory, you get instant knowledge of both where to find each string and how long they each are.
vector<vector<string> > is a 3d character "cube" where every dimension vary in size along the others. Unless you are able to predict each "size", you risk to read one-by one and reallocate every time.
Fast reading happens when you can "load up" the data all in once, and than define how to split. The data structure will probably be a single string, and a vector<vector<range> > where range is a std::pair<std::string::const_iterator>.
The problem -here- is that you cannot manipulate the strings being them tightened together.
A second chance is maintain the dynamic nature of vector<vector<string> >,but store the dataso that each "size" can be read before the data tehnselves, so that you can resize the vectos and then read the content into its componets.
In pseudocode:
template<class Stream, class Container>
void save(const Container& c, const stream& s)
{ s.write(c.size()); for(auto& e: c) save(e,s) }
template<class Stream, class Container>
void load(Container& c, const stream& s)
{
int sz=0; s.read(c.size()); c.resize(sz);
for(auto& i:c) load(i,s);
}
Of course, specialized for string-s so that saving/loading a string actually writes/reads its own chars.