LZ77 - algorithm - resolution - c++

I was reading about this algorithm... And I coded a class to compress, I have not coded the decompressing class yet...
What do you think about the code?
I think i've got a problema... My codification is : "position | length", but I believe this method will keep me on problems when ill decompressing, because I wont know if the numbers of positions and length are 2, 3, 4 digits... :S
some advice will be accepted... :D
Any suggestions will be accepted.
Main file:
#include <iostream>
#include "Compressor.h"
int main() {
Compressor c( "/home/facu/text.txt", 3);
std::cout << c.get_TEXT_FILE() << std::endl;
std::cout << c.get_TEXT_ENCONDED() << std::endl;
c.save_file_encoded();
return 0;
}
header file :
#ifndef _Compressor_H_
#define _Compressor_H_
#include <utility>
#include <string>
typedef unsigned int T_UI;
class Compressor
{
public:
//Constructor
Compressor( const std::string &PATH, const T_UI minbytes = 3 );
/** GET BUFFERS **/
std::string get_TEXT_FILE() const;
std::string get_TEXT_ENCONDED() const;
/** END GET BUFFERS **/
void save_file_encoded();
private:
/** BUFFERS **/
std::string TEXT_FILE; // contains the text from an archive
std::string TEXT_ENCODED; // contains the text encoded
std::string W_buffer; // contains the string to analyze
std::string W_inspection; // contains the string where will search matches
/** END BUFFERS **/
T_UI size_of_minbytes;
T_UI size_w_insp; // The size of window inspection
T_UI actual_byte;
std::pair< T_UI, T_UI> v_codes; // Values to code text
// Utilitaries functions
void change_size_insp(){ size_w_insp = TEXT_FILE.length() ; }
bool inspection_empty() const;
std::string convert_pair() const;
// Encode algorythm
void lz77_encode();
};
#endif
implementation file :
#include <iostream>
#include <fstream>
using std::ifstream;
using std::ofstream;
#include <string>
#include <cstdlib>
#include <sstream>
#include "Compressor.h"
Compressor::Compressor(const std::string& PATH, const T_UI minbytes)
{
std::string buffer = "";
TEXT_FILE = "";
ifstream input_text( PATH.c_str(), std::ios::in );
if( !input_text )
{
std::cerr << "Can't open the text file";
std::exit( 1 );
}
while( !input_text.eof() )
{
std::getline( input_text, buffer );
TEXT_FILE += buffer;
TEXT_FILE += "\n";
buffer.clear();
}
input_text.close();
change_size_insp();
size_of_minbytes = minbytes;
TEXT_ENCODED = "";
W_buffer = "";
W_inspection = "";
v_codes.first = 0;
v_codes.second = 0;
actual_byte = 0;
lz77_encode();
}
std::string Compressor::get_TEXT_FILE() const
{
return TEXT_FILE;
}
std::string Compressor::get_TEXT_ENCONDED() const
{
return TEXT_ENCODED;
}
bool Compressor::inspection_empty() const
{
return ( size_w_insp != 0 );
}
std::string Compressor::convert_pair() const
{
std::stringstream out;
out << v_codes.first;
out << "|";
out << v_codes.second;
return out.str();
}
void Compressor::save_file_encoded()
{
std::string path("/home/facu/encoded.txt");
ofstream out_txt( path.c_str(),std::ios::out );
out_txt << TEXT_ENCODED << "\n";
out_txt.close();
}
void Compressor::lz77_encode()
{
while( inspection_empty() )
{
W_buffer = TEXT_FILE.substr( actual_byte, 1);
if( W_inspection.find( W_buffer ) == W_inspection.npos )
{
// Cant find any byte from buffer
TEXT_ENCODED += W_buffer;
W_inspection += W_buffer;
W_buffer.clear();
++actual_byte;
--size_w_insp;
}
else
{
// We founded any byte from buffer in inspection
v_codes.first = W_inspection.find( W_buffer );
v_codes.second = 1;
while( W_inspection.find( W_buffer ) != W_inspection.npos )
{
++actual_byte;
--size_w_insp;
v_codes.second++;
W_inspection += TEXT_FILE[actual_byte - 1];
W_buffer += TEXT_FILE[actual_byte];
}
++actual_byte;
--size_w_insp;
if( v_codes.second > size_of_minbytes )
TEXT_ENCODED += convert_pair();
else
TEXT_ENCODED += W_buffer;
W_buffer.clear();
}
}
}
Thank you!
Im coding the descompressing class :)

I generally recommend writing the decompressor first, and then writing the compressor to match it.
I recommend getting a compressor and corresponding decompressor working with a fixed-size copy items first, and only afterwards -- if necessary -- tweak them to produce/consume variable-size copy items.
Many LZ77-like algorithms use a fixed size in the compressed file to represent both the position and length;
often one hex digit for length and 3 hex digits for position, a total of 2 bytes.
The "|" between the position and the copy-length is unnecessary.
If you are really trying to implement the original LZ77 algorithm,
your compression algorithm needs to always emit the fixed-length copy-length (even when it is zero), the fixed-length position (when the length is zero, you might as well stick zero here also), and a fixed-length literal value.
Some LZ77-like file formats are divided into "items" that are either a fixed-length copy-length,position pair, or else one or more literal values.
If you go that route, the compressor must somehow first tell the decompressor whether the upcoming item represents literal value(s) or a copy-length, position pair.
One of many ways to do this is to reserving a special "0" position value that, rather than indicating some position in the output decompressed stream like all other position values, instead indicates the next few literal values in the input compressed file.
Nearly all LZ77-like algorithms store an "offset" backwards from the current location in the plaintext, rather than a "position" forwards from the beginning of the plaintext.
For example, "1" represents the most recently-decoded plaintext byte, not the first-decoded plaintext byte.
How is it possible to for the decoder to tell where one integer ends, and the next one begins, when the compressed file contains a series of integers?
There are 3 popular answers:
Use a fixed-length code, where you've set in stone at compile-time how long each integer will be. (simplest)
Use a variable-length code, and reserve a special symbol like "|" to indicate end-of-code.
Use a variable-length prefix code.
other approaches, such as range coding. (most complicated)
https://en.wikibooks.org/wiki/Data_Compression
Jacob Ziv and Abraham Lempel; A Universal Algorithm for Sequential Data Compression, IEEE Transactions on Information Theory, 23(3), pp.337-343, May 1977.

The simple lz77 code
"Header-only c++ library of compression algorithms that does not have any external dependencies"
https://github.com/wkoroy/compressholib

Related

base64 encoding removing carriage return from dos header

I have been trying to encode the binary data of an application as base64 (specifically boosts base64), but I have run into an issue where the carriage return after the dos header is not being encoded correctly.
it should look like this:
This program cannot be run in DOS mode.[CR]
[CR][LF]
but instead its outputting like this:
This program cannot be run in DOS mode.[CR][LF]
it seems this first carriage return is being skipped, which then causes the DOS header to be invalid when attempting to run the program.
the code for the base64 algorithm I am using can be found at: https://www.boost.org/doc/libs/1_66_0/boost/beast/core/detail/base64.hpp
Thanks so much!
void load_file(const char* filename, char** file_out, size_t& size_out)
{
FILE* file;
fopen_s(&file, filename, "r");
if (!file)
return false;
fseek(file, 0, SEEK_END);
size = ftell(file);
rewind(file);
*out = new char[size];
fread(*out, size, 1, file);
fclose(file);
}
void some_func()
{
char* file_in;
size_t file_in_size;
load_file("filename.bin", &file_in, file_in_size);
auto encoded_size = base64::encoded_size(file_in_size);
auto file_encoded = new char[encoded_size];
memset(0, file_encoded, encoded_size);
base64::encode(file_encoded, file_in, file_in_size);
std::ofstream orig("orig.bin", std::ios_base::binary);
for (int i = 0; i < file_in_size; i++)
{
auto c = file_in[i];
orig << c; // DOS header contains a NULL as the 3rd char, don't allow it to be null terminated early, may cause ending nulls but does not affect binary files.
}
orig.close();
std::ofstream encoded("encoded.txt"); //pass this output through a base64 to file website.
encoded << file_encoded; // for loop not required, does not contain nulls (besides ending null) will contain trailing encoded nulls.
encoded.close();
auto decoded_size = base64::decoded_size(encoded_size);
auto file_decoded = new char[decoded_size];
memset(0, file_decoded, decoded_size); // again trailing nulls but it doesn't matter for binary file operation. just wasted disk space.
base64::decode(file_decoded, file_encoded, encoded_size);
std::ofstream decoded("decoded.bin", std::ios_base::binary);
for (int i = 0; i < decoded_size; i++)
{
auto c = file_decoded[i];
decoded << c;
}
decoded.close();
free(file_in);
free(file_encoded);
free(file_decoded);
}
The above code will show that the file reading does not remove the carriage return, while the encoding of the file into base64 does.
Okay thanks for adding the code!
I tried it, and indeed there was "strangeness", even after I simplified the code (mostly to make it C++, instead of C).
So what do you do? You look at the documentation for the functions. That seems complicated since, after all, detail::base64 is, by definition, not part of public API, and "undocumented".
However, you can still read the comments at the functions involved, and they are pretty clear:
/** Encode a series of octets as a padded, base64 string.
The resulting string will not be null terminated.
#par Requires
The memory pointed to by `out` points to valid memory
of at least `encoded_size(len)` bytes.
#return The number of characters written to `out`. This
will exclude any null termination.
*/
std::size_t
encode(void* dest, void const* src, std::size_t len)
And
/** Decode a padded base64 string into a series of octets.
#par Requires
The memory pointed to by `out` points to valid memory
of at least `decoded_size(len)` bytes.
#return The number of octets written to `out`, and
the number of characters read from the input string,
expressed as a pair.
*/
std::pair<std::size_t, std::size_t>
decode(void* dest, char const* src, std::size_t len)
Conclusion: What Is Wrong?
Nothing about "dos headers" or "carriage returns". Perhaps maybe something about "rb" in fopen (what's the differences between r and rb in fopen), but why even use that:
template <typename Out> Out load_file(std::string const& filename, Out out) {
std::ifstream ifs(filename, std::ios::binary); // or "rb" on your fopen
ifs.exceptions(std::ios::failbit |
std::ios::badbit); // we prefer exceptions
return std::copy(std::istreambuf_iterator<char>(ifs), {}, out);
}
The real issue is: your code ignored all return values from encode/decode.
The encoded_size and decoded_size values are estimations that will give you enough space to store the result, but you have to correct it to the actual size after performing the encoding/decoding.
Here's my fixed and simplified example. Notice how the md5sums checkout:
Live On Coliru
#include <boost/beast/core/detail/base64.hpp>
#include <fstream>
#include <iostream>
#include <vector>
namespace base64 = boost::beast::detail::base64;
template <typename Out> Out load_file(std::string const& filename, Out out) {
std::ifstream ifs(filename, std::ios::binary); // or "rb" on your fopen
ifs.exceptions(std::ios::failbit |
std::ios::badbit); // we prefer exceptions
return std::copy(std::istreambuf_iterator<char>(ifs), {}, out);
}
int main() {
std::vector<char> input;
load_file("filename.bin", back_inserter(input));
// allocate "enough" space, using an upperbound prediction:
std::string encoded(base64::encoded_size(input.size()), '\0');
// encode returns the **actual** encoded_size:
auto encoded_size = base64::encode(encoded.data(), input.data(), input.size());
encoded.resize(encoded_size); // so adjust the size
std::ofstream("orig.bin", std::ios::binary)
.write(input.data(), input.size());
std::ofstream("encoded.txt") << encoded;
// allocate "enough" space, using an upperbound prediction:
std::vector<char> decoded(base64::decoded_size(encoded_size), 0);
auto [decoded_size, // decode returns the **actual** decoded_size
processed] // (as well as number of encoded bytes processed)
= base64::decode(decoded.data(), encoded.data(), encoded.size());
decoded.resize(decoded_size); // so adjust the size
std::ofstream("decoded.bin", std::ios::binary)
.write(decoded.data(), decoded.size());
}
Prints. When run on "itself" using
g++ -std=c++20 -O2 -Wall -pedantic -pthread main.cpp -o filename.bin && ./filename.bin
md5sum filename.bin orig.bin decoded.bin
base64 -d < encoded.txt | md5sum
It prints
d4c96726eb621374fa1b7f0fa92025bf filename.bin
d4c96726eb621374fa1b7f0fa92025bf orig.bin
d4c96726eb621374fa1b7f0fa92025bf decoded.bin
d4c96726eb621374fa1b7f0fa92025bf -

C++ storing 0 and 1 more efficiently, like in a binary file?

I want to store multiple arrays which all entries consist of either 0 or 1.
This file would be quite large if i do it the way i do it.
I made a minimalist version of what i currently do.
#include <iostream>
#include <fstream>
using namespace std;
int main(){
ofstream File;
File.open("test.csv");
int array[4]={1,0,0,1};
for(int i = 0; i < 4; ++i){
File << array[i] << endl;
}
File.close();
return 0;
}
So basically is there a way of storing this in a binary file or something, since my data is 0 or 1 in the first place anyways?
If yes, how to do this? Can i also still have line-breaks and maybe even commas in that file? If either of the latter does not work, that's also fine. Just more importantly, how to store this as a binary file which has only 0 and 1 so my file is smaller.
Thank you very much!
So basically is there a way of storing this in a binary file or something, since my data is 0 or 1 in the first place anyways? If yes, how to do this? Can i also still have line-breaks and maybe even commas in that file? If either of the latter does not work, that's also fine. Just more importantly, how to store this as a binary file which has only 0 and 1 so my file is smaller.
The obvious solution is to take 64 characters, say A-Z, a-z, 0-9, and + and /, and have each character code for six entries in your table. There is, in fact, a standard for this called Base64. In Base64, A encodes 0,0,0,0,0,0 while / encodes 1,1,1,1,1,1. Each combination of six zeroes or ones has a corresponding character.
This still leaves commas, spaces, and newlines free for your use as separators.
If you want to store the data as compactly as possible, I'd recommend storing it as binary data, where each bit in the binary file represents one boolean value. This will allow you to store 8 boolean values for each byte of disk space you use up.
If you want to store arrays whose lengths are not multiples of 8, it gets a little bit more complicated since you can't store a partial byte, but you can solve that problem by storing an extra byte of meta-data at the end of the file that specifies how many bits of the final data-byte are valid and how many are just padding.
Something like this:
#include <iostream>
#include <fstream>
#include <cstdint>
#include <vector>
using namespace std;
// Given an array of ints that are either 1 or 0, returns a packed-array
// of uint8_t's containing those bits as compactly as possible.
vector<uint8_t> packBits(const int * array, size_t arraySize)
{
const size_t vectorSize = ((arraySize+7)/8)+1; // round up, then +1 for the metadata byte
vector<uint8_t> packedBits;
packedBits.resize(vectorSize, 0);
// Store 8 boolean-bits into each byte of (packedBits)
for (size_t i=0; i<arraySize; i++)
{
if (array[i] != 0) packedBits[i/8] |= (1<<(i%8));
}
// The last byte in the array is special; it holds the number of
// valid bits that we stored to the byte just before it.
// That way if the number of bits we saved isn't an even multiple of 8,
// we can use this value later on to calculate exactly how many bits we should restore
packedBits[vectorSize-1] = arraySize%8;
return packedBits;
}
// Given a packed-bits vector (i.e. as previously returned by packBits()),
// returns the vector-of-integers that was passed to the packBits() call.
vector<int> unpackBits(const vector<uint8_t> & packedBits)
{
vector<int> ret;
if (packedBits.size() < 2) return ret;
const size_t validBitsInLastByte = packedBits[packedBits.size()-1]%8;
const size_t numValidBits = 8*(packedBits.size()-((validBitsInLastByte>0)?2:1)) + validBitsInLastByte;
ret.resize(numValidBits);
for (size_t i=0; i<numValidBits; i++)
{
ret[i] = (packedBits[i/8] & (1<<(i%8))) ? 1 : 0;
}
return ret;
}
// Returns the size of the specified file in bytes, or -1 on failure
static ssize_t getFileSize(ifstream & inFile)
{
if (inFile.is_open() == false) return -1;
const streampos origPos = inFile.tellg(); // record current seek-position
inFile.seekg(0, ios::end); // seek to the end of the file
const ssize_t fileSize = inFile.tellg(); // record current seek-position
inFile.seekg(origPos); // so we won't change the file's read-position as a side effect
return fileSize;
}
int main(){
// Example of packing an array-of-ints into packed-bits form and saving it
// to a binary file
{
const int array[]={0,0,1,1,1,1,1,0,1,0};
// Pack the int-array into packed-bits format
const vector<uint8_t> packedBits = packBits(array, sizeof(array)/sizeof(array[0]));
// Write the packed-bits to a binary file
ofstream outFile;
outFile.open("test.bin", ios::binary);
outFile.write(reinterpret_cast<const char *>(&packedBits[0]), packedBits.size());
outFile.close();
}
// Now we'll read the binary file back in, unpack the bits to a vector<int>,
// and print out the contents of the vector.
{
// open the file for reading
ifstream inFile;
inFile.open("test.bin", ios::binary);
const ssize_t fileSizeBytes = getFileSize(inFile);
if (fileSizeBytes < 0)
{
cerr << "Couldn't read test.bin, aborting" << endl;
return 10;
}
// Read in the packed-binary data
vector<uint8_t> packedBits;
packedBits.resize(fileSizeBytes);
inFile.read(reinterpret_cast<char *>(&packedBits[0]), fileSizeBytes);
// Expand the packed-binary data back out to one-int-per-boolean
vector<int> unpackedInts = unpackBits(packedBits);
// Print out the int-array's contents
cout << "Loaded-from-disk unpackedInts vector is " << unpackedInts.size() << " items long:" << endl;
for (size_t i=0; i<unpackedInts.size(); i++) cout << unpackedInts[i] << " ";
cout << endl;
}
return 0;
}
(You could probably make the file even more compact than that by running zip or gzip on the file after you write it out :) )
You can indeed write and read binary data. However having line breaks and commas would be difficult. Imagine you save your data as boolean data, so only ones and zeros. Then having a comma would mean you need an special character, but you have only ones and zeros!. The next best thing would be to make an object of two booleans, one meaning the usual data you need (c++ would then read the data in pairs of bits), and the other meaning whether you have a comma or not, but I doubt this is what you need. If you want to do something like a csv, then it would be easy to just fix the size of each column (int would be 4 bytes, a string of no more than 32 char for example), and then just read and write accordingly. Suppose you have your binary
To initially save your array of the an object say pets, then you would use
FILE *apFile;
apFile = fopen(FILENAME,"w+");
fwrite(ARRAY_OF_PETS, sizeof(Pet),SIZE_OF_ARRAY, apFile);
fclose(apFile);
To access your idx pet, you would use
Pet m;
ifstream input_file (FILENAME, ios::in|ios::binary|ios::ate);
input_file.seekg (sizeof(Pet) * idx, ios::beg);
input_file.read((char*) &m,sizeof(Pet));
input_file.close();
You can also add data add the end, change data in the middle and so on.

C/C++ HDF5 Read string attribute

A colleague of mine used labview to write an ASCII string as an attribute in an HDF5 file. I can see that the attribute exist, and read it, but I can't print it.
The attribute is, as shown in HDF Viewer:
Date = 2015\07\09
So "Date" is its name.
I'm trying to read the attribute with this code
hsize_t sz = H5Aget_storage_size(dateAttribHandler);
std::cout<<sz<<std::endl; //prints 16
hid_t atype = H5Aget_type(dateAttribHandler);
std::cout<<atype<<std::endl; //prints 50331867
std::cout<<H5Aread(dateAttribHandler,atype,(void*)date)<<std::endl; //prints 0
std::cout<<date<<std::endl; //prints messy characters!
//even with an std::string
std::string s(date);
std::cout<<s<<std::endl; //also prints a mess
Why is this happening? How can I get this string as a const char* or std::string?
I tried also using the type atype = H5Tcopy (H5T_C_S1);, and that didn't work too...
EDIT:
Here I provide a full, self-contained program as it was requested:
#include <string>
#include <iostream>
#include <fstream>
#include <hdf5/serial/hdf5.h>
#include <hdf5/serial/hdf5_hl.h>
std::size_t GetFileSize(const std::string &filename)
{
std::ifstream file(filename.c_str(), std::ios::binary | std::ios::ate);
return file.tellg();
}
int ReadBinFileToString(const std::string &filename, std::string &data)
{
std::fstream fileObject(filename.c_str(),std::ios::in | std::ios::binary);
if(!fileObject.good())
{
return 1;
}
size_t filesize = GetFileSize(filename);
data.resize(filesize);
fileObject.read(&data.front(),filesize);
fileObject.close();
return 0;
}
int main(int argc, char *argv[])
{
std::string filename("../Example.hdf5");
std::string fileData;
std::cout<<"Success read file into memory: "<<
ReadBinFileToString(filename.c_str(),fileData)<<std::endl;
hid_t handle;
hid_t magFieldsDSHandle;
hid_t dateAttribHandler;
htri_t dateAtribExists;
handle = H5LTopen_file_image((void*)fileData.c_str(),fileData.size(),H5LT_FILE_IMAGE_DONT_COPY | H5LT_FILE_IMAGE_DONT_RELEASE);
magFieldsDSHandle = H5Dopen(handle,"MagneticFields",H5P_DEFAULT);
dateAtribExists = H5Aexists(magFieldsDSHandle,"Date");
if(dateAtribExists)
{
dateAttribHandler = H5Aopen(magFieldsDSHandle,"Date",H5P_DEFAULT);
}
std::cout<<"Reading file done."<<std::endl;
std::cout<<"Open handler: "<<handle<<std::endl;
std::cout<<"DS handler: "<<magFieldsDSHandle<<std::endl;
std::cout<<"Attributes exists: "<<dateAtribExists<<std::endl;
hsize_t sz = H5Aget_storage_size(dateAttribHandler);
std::cout<<sz<<std::endl;
char* date = new char[sz+1];
std::cout<<"mem bef: "<<date<<std::endl;
hid_t atype = H5Aget_type(dateAttribHandler);
std::cout<<atype<<std::endl;
std::cout<<H5Aread(dateAttribHandler,atype,(void*)date)<<std::endl;
fprintf(stderr, "Attribute string read was '%s'\n", date);
date[sz] = '\0';
std::string s(date);
std::cout<<"mem aft: "<<date<<std::endl;
std::cout<<s<<std::endl;
H5Dclose(magFieldsDSHandle);
H5Fclose(handle);
return 0;
}
Printed output of this program:
Success read file into memory: 0
Reading file done.
Open handler: 16777216
DS handler: 83886080
Attributes exists: 1
16
mem bef:
50331867
0
Attribute string read was '�P7'
mem aft: �P7
�P7
Press <RETURN> to close this window...
Thanks.
It turned out that H5Aread has to be called with a reference of the char pointer... so pointer of a pointer:
H5Aread(dateAttribHandler,atype,&date);
Keep in mind that one doesn't have to reserve memory for that. The library will reserve memory, and then you can free it with H5free_memory(date).
This worked fine.
EDIT:
I learned that this is the case only when the string to be read has variable length. If the string has a fixed length, then one has to manually reserve memory with size length+1 and even manually set the last char to null (to get a null-terminated string. There is a function in the hdf5 library that checks whether a string is fixed in length.
I discovered that if you do not allocate date and pass the &date to H5Aread, then it works. (I use the C++ and python APIs, so I do not know the C api very well.) Specifically change:
char* date = 0;
// std::cout<<"mem bef: "<<date<<std::endl;
std::cout << H5Aread(dateAttribHandler, atype, &date) << std::endl;
And you should see 2015\07\09 printed.
You may want to consider using the C++ API. Using the C++ API, your example becomes:
std::string filename("c:/temp/Example.hdf5");
H5::H5File file(filename, H5F_ACC_RDONLY);
H5::DataSet ds_mag = file.openDataSet("MagneticFields");
if (ds_mag.attrExists("Date"))
{
H5::Attribute attr_date = ds_mag.openAttribute("Date");
H5::StrType stype = attr_date.getStrType();
std::string date_str;
attr_date.read(stype, date_str);
std::cout << "date_str= <" << date_str << ">" << std::endl;
}
As a simpler alternative to existing APIs, your use-case could be solved as follows in C using HDFql:
// declare variable 'value'
char *value;
// register variable 'value' for subsequent use (by HDFql)
hdfql_variable_register(&value);
// read 'Date' (from 'MagneticFields') and populate variable 'value' with it
hdfql_execute("SELECT FROM Example.hdf5 MagneticFields/Date INTO MEMORY 0");
// display value stored in variable 'value'
printf("Date=%s\n", value);
FYI, besides C, the code above can be used in C++, Python, Java, C#, Fortran or R with minimal changes.

C++ reading large files part by part

I've been having a problem that I not been able to solve as of yet. This problem is related to reading files, I've looked at threads even on this website and they do not seem to solve the problem. That problem is reading files that are larger than a computers system memory. Simply when I asked this question a while ago I was referred too using the following code.
string data("");
getline(cin,data);
std::ifstream is (data);//, std::ifstream::binary);
if (is)
{
// get length of file:
is.seekg (0, is.end);
int length = is.tellg();
is.seekg (0, is.beg);
// allocate memory:
char * buffer = new char [length];
// read data as a block:
is.read (buffer,length);
is.close();
// print content:
std::cout.write (buffer,length);
delete[] buffer;
}
system("pause");
This code works well apart from the fact that it eats memory like fat kid in a candy store.
So after a lot of ghetto and unrefined programing, I was able to figure out a way to sort of fix the problem. However I more or less traded one problem for another in my quest.
#include <iostream>
#include <vector>
#include <string>
#include <fstream>
#include <stdio.h>
#include <stdlib.h>
#include <iomanip>
#include <windows.h>
#include <cstdlib>
#include <thread>
using namespace std;
/*======================================================*/
string *fileName = new string("tldr");
char data[36];
int filePos(0); // The pos of the file
int tmSize(0); // The total size of the file
int split(32);
char buff;
int DNum(0);
/*======================================================*/
int getFileSize(std::string filename) // path to file
{
FILE *p_file = NULL;
p_file = fopen(filename.c_str(),"rb");
fseek(p_file,0,SEEK_END);
int size = ftell(p_file);
fclose(p_file);
return size;
}
void fs()
{
tmSize = getFileSize(*fileName);
int AX(0);
ifstream fileIn;
fileIn.open(*fileName, ios::in | ios::binary);
int n1,n2,n3;
n1 = tmSize / 32;
// Does the processing
while(filePos != tmSize)
{
fileIn.seekg(filePos,ios_base::beg);
buff = fileIn.get();
// To take into account small files
if(tmSize < 32)
{
int Count(0);
char MT[40];
if(Count != tmSize)
{
MT[Count] = buff;
cout << MT[Count];// << endl;
Count++;
}
}
// Anything larger than 32
else
{
if(AX != split)
{
data[AX] = buff;
AX++;
if(AX == split)
{
AX = 0;
}
}
}
filePos++;
}
int tz(0);
filePos = filePos - 12;
while(tz != 2)
{
fileIn.seekg(filePos,ios_base::beg);
buff = fileIn.get();
data[tz] = buff;
tz++;
filePos++;
}
fileIn.close();
}
void main ()
{
fs();
cout << tmSize << endl;
system("pause");
}
What I tried to do with this code is too work around the memory issue. Rather than allocating memory for a large file that simply does not exist on a my system, I tried to use the memory I had instead which is about 8gb, but I only wanted to use maybe a few Kilobytes of it if at all possible.
To give you a layout of what I am talking about I am going to write a line of text.
"Hello my name is cake please give me cake"
Basically what I did was read said piece of text letter by letter. Then I put those letters into a box that could store 32 of them, from there I could use something like xor and then write them onto another file.
The idea in a way works but it is horribly slow and leaves off parts of files.
So basically how can I make something like this work without going slow or cutting off files. I would love to see how xor works with very large files.
So if anyone has a better idea than what I have, then I would be very grateful for the help.
To read and process the file piece-by-piece, you can use the following snippet:
// Buffer size 1 Megabyte (or any number you like)
size_t buffer_size = 1<<20;
char *buffer = new char[buffer_size];
std::ifstream fin("input.dat");
while (fin)
{
// Try to read next chunk of data
fin.read(buffer, buffer_size);
// Get the number of bytes actually read
size_t count = fin.gcount();
// If nothing has been read, break
if (!count)
break;
// Do whatever you need with first count bytes in the buffer
// ...
}
delete[] buffer;
The buffer size of 32 bytes, as you are using, is definitely too small. You make too many calls to library functions (and the library, in turn, makes calls (although probably not every time) to OS, which are typically slow, since they cause context-switching). There is also no need of tell/seek.
If you don't need all the file content simultaneously, reduce the working set first - like a set of about 32 words, but since XOR can be applied sequentially, you may further simplify the working set with constant size, like 4 kilo-bytes.
Now, you have the option to use file reader is.read() in a loop and process a small set of data each iteration, or use memmap() to map the file content as memory pointer which you can perform both read and write operations.

how to read a particular string from a buffer

i have a buffer
char buffer[size];
which i am using to store the file contents of a stream(suppose pStream here)
HRESULT hr = pStream->Read(buffer, size, &cbRead );
now i have all the contents of this stream in buffer which is of size(suppose size here). now i know that i have two strings
"<!doctortype html" and ".html>"
which are present somewhere (we don't their loctions) inside the stored contents of this buffer and i want to store just the contents of the buffer from the location
"<!doctortype html" to another string ".html>"
in to another buffer2[SizeWeDontKnow] yet.
How to do that ??? (actually contents from these two location are the contents of a html file and i want to store the contents of only html file present in this buffer). any ideas how to do that ??
You can use strnstr function to find the right position in your buffer. After you've found the starting and ending tag, you can extract the text inbetween using strncpy, or use it in place if the performance is an issue.
You can calculate needed size from the positions of the tags and the length of the first tag nLength = nPosEnd - nPosStart - nStartTagLength
Look for HTML parsers for C/C++.
Another way is to have a char pointer from the start of the buffer and then check each char there after. See if it follows your requirement.
If that's the only operation which operates on HTML code in your app, then you could use the solution I provided below (you can also test it online - here). However, if you are going to do some more complicated parsing, then I suggest using some external library.
#include <iostream>
#include <cstdio>
#include <cstring>
using namespace std;
int main()
{
const char* beforePrefix = "asdfasdfasdfasdf";
const char* prefix = "<!doctortype html";
const char* suffix = ".html>";
const char* postSuffix = "asdasdasd";
unsigned size = 1024;
char buf[size];
sprintf(buf, "%s%sTHE STRING YOU WANT TO GET%s%s", beforePrefix, prefix, suffix, postSuffix);
cout << "Before: " << buf << endl;
const char* firstOccurenceOfPrefixPtr = strstr(buf, prefix);
const char* firstOccurenceOfSuffixPtr = strstr(buf, suffix);
if (firstOccurenceOfPrefixPtr && firstOccurenceOfSuffixPtr)
{
unsigned textLen = (unsigned)(firstOccurenceOfSuffixPtr - firstOccurenceOfPrefixPtr - strlen(prefix));
char newBuf[size];
strncpy(newBuf, firstOccurenceOfPrefixPtr + strlen(prefix), textLen);
newBuf[textLen] = 0;
cout << "After: " << newBuf << endl;
}
return 0;
}
EDIT
I get it now :). You should use strstr to find the first occurence of the prefix then. I edited the code above, and updated the link.
Are you limited to C, or can you use C++?
In the C library reference there are plenty of useful ways of tokenising strings and comparing for matches (string.h):
http://www.cplusplus.com/reference/cstring/
Using C++ I would do the following (using buffer and size variables from your code):
// copy char array to std::string
std::string text(buffer, buffer + size);
// define what we're looking for
std::string begin_text("<!doctortype html");
std::string end_text(".html>");
// find the start and end of the text we need to extract
size_t begin_pos = text.find(begin_text) + begin_text.length();
size_t end_pos = text.find(end_text);
// create a substring from the positions
std::string extract = text.substr(begin_pos,end_pos);
// test that we got the extract
std::cout << extract << std::endl;
If you need C string compatibility you can use:
char* tmp = extract.c_str();