Writing the huffman tree to file after compression

Writing the huffman tree to file after compression - c++

I'm trying to write a Huffman tree to the compressed file after all the actual compressed file data has been inserted. But , i just realized a bit of a problem , suppose I decide that once all my actual data has been written to file , I will put in 2 linefeed characters and then write the tree.
That means , when I read stuff back, those two linefeeds (or any character really) are my delimiters. The problem is , that its entirely possible that the actual data also has 2 linefeeds one after the other, in such a scenario, my delimiter check would fail.
I've taken the example of two linefeeds here , but the same is true for any character string, I could subvert the problem by maybe taking a longer string as the delimiter , but that would have two undersirable effects:
1. There is still a remote chance that the long string is by some coincidence present in the compressed data.
2. Un-necessarily bloating a file which needs to be compressed.
Does anyone have any suggestions on how to separate the compressed data from the tree data ?

First, write the size of the tree in bytes. Then, write the tree itself, and then the contents itself.
When reading, first read the size, then the tree (now you know how many characters to read), and then the contents.
The size can be written as a string, ending with a line feed - this way, you know that the first number and line feeds belong to the size of the tree.

Why not write the size and len on the first 8 bytes (4 each) and then the data?
Then something like:
uint32_t compressed_size;
uint32_t data_len;
char * data;
file.read((char*)compressed_size, 4);
file.read((char*)data_len, 4);
data = new char[data_len];
zip.read(data, data_len);
Should work.
You could deflate the data for better compression.

Related

How do i write to a specific line of a text file?

myfile<<hashdugumu[key].numara;
I have this piece of code.For example,i would like to write to eighth line.How do i do that in c++ ?Thanks in advance.

If the line you want to write is exactly the same length (in bytes, not in characters, remember some encodings (like e.g. UTF-8) is variable length) then it's very easy: Just skip over the first seven lines and then write the line.
There is a caveat with this though: input streams and output streams have different stream positions. So if you read from a combined input/output file stream then only the read position will change, so if you just try to write directly then you will not write at the same position. To solve this you need to get the read position, and set the write position to the same value.
As an alternative, or if the data you want to write is not the same size as the existing data, then you have to use a temporary "buffer", be it another file or an actual in-memory buffer.
If the file is not big you can use an in-memory buffer, for example using a std::vector for the lines. Read each line into the vector, and then modify the lines (elements in the vector) that you want to modify. Finally reopen the file for writing, truncating it, and then just write each "line" to the file.
There is a slight problem with the above though when it comes to the rewriting of the data, and that is if the file is truncated and then there's an error when you write to the file, you can lose data. This can be dsolved by using a temporary file.
Using a temporary file it's easier to not bother with the in-memory buffer, and instead read from the original file and write directly to the temporary file. Knowing when you should write something else is done by keeping track of the current line numbers, which is easy if you read one line at a time. In your example you read the first seven lines from the original file and write them to the temporary file, after the seventh line you write your special eight line while skipping the original eight line from the original file, and then just continue reading/writing the remaining lines. When done close the files and then rename the temporary file as the original file.

How to read large files in C++ with mixed text and binary

I need to read a large file of either text, binary, or combination, such as a JPEG file, encrypt it, and write it to a file. At some later time I will need to read the encrypted data, and decrypt it.
The end goal is to verify that the decrypted data matches the original data.
My problem is that with large files greater than 1Meg, I don't want to read and write character by character. I am targeting this code for a phone and I/O will cause too long a delay for the user.
With a pure text file, using fread() and fwrite() convert the data to binary, and the result is different than the original. With a jpeg image, it appears that there is some textual content mixed in with the binary data.
Is there a way to efficiently read in an arbitrary type of file and write it back in the original format?
Or is character by character the only option?
Or am I still out of luck?

After debugging it turned out that the decrypt function had the plain text and cipher text buffers assigned backwards. After swapping the buffer assignments, the decrypted results matched the original data. I originally thought that maybe reading the text as binary and then rewriting as binary would not appear as text, but I was wrong.
Reading the entire file as binary works just fine.

c++ overwriting file data?

I am trying to run a program to replace certain data within a file. The relevant parts of the file attempting to be replaced look like the following:
1 Information 15e+10
2 Information 2e+16
3 Information 6e+2
And so on.
The files in question can be very large in the multiple gigabyte range and to my understanding because of this using a buffer of the whole file and rewriting the whole file is impossible/unreasonable. Well that is all fine I just want to replace the values (ex. the 15e+10).
This all works fine with simple ios::in|ios::out and tellp() if I am replacing the value with a similar sized value (15e+10->12e+12) or even if its a smaller size as I can simply add an extra space which can be ignored down the line (ex. 15e+10->4e+10 ). But I am running into the problem if I need to replace the value with a value whose length is longer than already in the file (ex. 6e+2->16e+10) it will write over the new line character or start writing over the information in the next line.
I have searched on the forums and everyone says you can either overwrite in the file, you can append to the end of the file, or you can buffer and recreate the whole file. Is there anyway I can achieve my goal of overwriting the value correctly without having to recreate the file?
If not then how can I have 2 files open (1 input 1 output) to do this if multiple files in question are too large for the memory?
Note: I would also like to avoid using boost:: as I need to be able to run this on a system without the boost library.

Open a stream to read from the input (IN) file and a second stream (OUT) to write to a new output (tmp) file.
Read from IN and write to OUT. When you get a value from IN that you want to replace write the replacement to OUT instead of the value you got from IN.
When parsing is complete replace the first file with the second (tmp) file.
Would this work for you?

Use lseek()/fseek() for "jump" to a given position in a file.

You can use seekp to go to the location and rewrite it with <<
Example:
example.txt ( |?| = 1 byte of data )
|A|B|C|\n|1|2|3|D|E|F|\n|4|5|6|
//Somewhere in the code
fstream file;
open("example.txt");
//Somehow find the character distance and store it into "distance"
seekp(distance);//If distance = 0, it will go to "A" like rewind() but easier for me
If the distance is 4, the next character will be overwritten is 1
file << "987";
And the file will be
|A|B|C|\n|9|8|7|D|E|F|\n|4|5|6|
BUT the only problem here is when you need to increase/decrease the size:
Increase:
You will overwrite the other character so you need to create a temp string to store it the rest of data or separate it into smaller chunk if the data is too large like
|A|B|C|\n|9|8|7|D|E|F|\n|4|5|6|
string tempstring;
seekp(distance);
file >> tempstring;
seekp(distance);
file << content << tempstring; //content is the data
Decrease:
The easiest solution is to write NULL character \0 to the excess space like
|A|B|C|\n|1|\0|\0|D|E|F|\n|4|5|6|
The only side-effect is the file size is the same as before

Splitting an ifstream in C++

I'm new to C++ and probably have a silly question. I have an ifstream which I'd like to split approximately in half.
The file in question is a sorted csv and I wish to search on the first value of each line of the file.
Eventually the file will be very large so I am trying to avoid having to read every line of the file.
e.g.
If the file contains 7 lines I'd like to split the ifstream to give 1 stream containing the first 3 lines and 1 stream containing the last 4 lines.

First, use the answer to this question to determine the size of your file. Then divide that number by two. Read the input line by line, and write it to the first output stream; check file.tellg() after each call. Once you're past the half-way point, switch the output to the second file.
This wouldn't split the strings evenly between the files, but the total number of characters in these strings should be close enough, and it wouldn't split your file in the middle of a string.

Think of it as a relational database with one huge table. In order to find a certain piece of data, you can either do a sequential scan over the entire table, or use an index (which must be usable for the type of query you want to perform).
A typical index for a text file would be a list of offsets inside the file, sorted by the index expression. If the csv file is sorted by a specific column already, then the offsets in the index would be ascending, which is useful to know when building the index.
So basically you have to read the file once anyway, to find out where lines end; this is the index for the sort column. To find a particular element, use a binary search, using the index to find individual elements in the data set.
Depending on the data type, you can extend your index to allow for quick comparison without reading the actual data table. For example, in a word list you could keep the first four letters of the word next to the offset, which allows you to get into the right area quickly and only requires data reads for the last accesses (which you can then optimize to a sequential scan, as filesystems handle that a lot better).
The same technique can be applied to the other columns as well; the offsets stored in the index would no longer be ascending in file order, of course.
Since it is CSV data, a special case also applies: If the only index you have is in the same order as the file data itself and the end of record can be determined easily (that is, either you have a fixed record length, or there is a clear record separator, such as an EOL character), then building the actual index can be omitted and the values guessed (for fixed length records, offset is always equal to record length times offset in the index; for separated records you can just jump into the middle of a record and seek for the next terminator; be aware that there are nasty corner cases with binary search here). This does however mean that you will always be reading data pages here, which is less efficient than just reading the index.

How to read partial data from large text file in C++

I have a big text file with more then 200.000 lines, and I need to read just a few lines. For instance: line 10.000 to 20.000.
Important: I don´t want to open and search the full file to extract theses lines because of performance issues.
Is this possible?

If the lines are fixed length, then it would be possible to seek to a specific byte position and load just the lines you want. If lines are variable length, the only way to find the lines you're looking for is to parse the file and count the number of end-of-line markers. If the file changes infrequently, you might be able to get sufficient performance by performing this parsing once and then keeping an index of the byte positions of each line to speed future accesses (perhaps writing that index to disk so it doesn't need to be done every time your program is run).

You will have to search through the file to count the newlines, unless you know that all lines are the same length (in which case you could seek to the offset = line_number * line_size_in_bytes, where line_number counts from zero and line_size_in_bytes includes all characters in the line).
If the lines are variable / unknown length then while reading through it once you could index the beginning offset of each line so that subsequent reads could seek to the start of a given line.

If these lines are all the same length you could compute an offset for a given line and read just those bytes.
If the lines are varying length then you really have to read the entire file to count how many lines there are. Line terminating characters are just arbitrary bytes in the file.

If the line are fixed length then you just compute the offset, no problem.
If they're not (i.e. a regular CSV file) then you'll need to go through the file, either to build an index or to just read the lines you need. To make the file reading a little faster a good idea would be to use memory mapped files (see the implementation that's part of the Boost iostreams: http://www.boost.org/doc/libs/1_39_0/libs/iostreams/doc/classes/mapped_file.html).

As others noted, if you do not have the lines of fixed width, it is impossible to do without building the index. However, if you are in control of the format of the file, you can get a ~O(log(size)) instead of O(size) performance in finding the start line, if you manage to store number of the line itself on each line, i.e. to have the file contents look something like this:
1: val1, val2, val3
2: val4
3: val5, val6
4: val7, val8, val9, val10
With this format of the file, you can quickly find the needed line by binary search: start with seeking into the middle of the file. Read till the next newline. Then read the line, and parse the number. If the number is bigger than the target, then you need to repeat the algorithm on the first half of the file, if it is smaller than the target line number, then you need to repeat it on the second half of the file.
You'd need to be careful about the corner cases (e.g.: your "beginning" of the range and "end" of the range are on the same line, etc.), but for me this approach worked excellently in the past for parsing the logfiles which had the date in it (and I needed to find the lines that are between the certain timestamps).
Of course, this still does not beat the performance of the explicitly built index or the fixed-size records.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js