Differentiating Binary Header and Encoded Binary In Huffman - c++

I'm creating a basic Huffman encoding/decoding tool. I've found this question which helped me implement a header that stores my generated huffman tree in binary form. I can also use the tree to encode/decode a text into a binary file as well. So the program actually works, but I still have a problem.
Currently the header and the encoded binary are in separate files because I cannot figure out a way to put them into the same file in a way that makes it easy for me to read the header at the start of the decoding procedure. Hard coding in some "end of header" character seems like a rather hacky way to do this, not to mention that there is the possibility that the some initial bits of the terminating character might be read in as part of the encoded tree in the header, causing the entire tree to get corrupted.
Although my program works with separate header and body files, I'd like to merge them. Any ideas on how I can do this?

You don't need to do anything special to merge your header (the tree) and the content (the Huffman-encoded text).
If you look in the answer in the question you posted, here, and examine the algorithm for decoding (the ReadNode(BitReader reader) pseudo-code function there) you can see that the algorithm stops reading the tree just because it reads it all - not because it reaches an EOF character or something of the like.
It doesn't need to search for an EOF because it recursively calls itself only for nodes that have children (0-bits). Once the algorithm has reaches all leaves there is no more recursive calling, so the reader will be positioned exactly in the right place for you to start reading the content (just after reading the whole header, with no additional "end-of-header" indications.

Related

What's the best way to store binary

Ive recently implemented Hoffman compression in c++, if I were to store the results as binary it would take up a lot more space as each 1 and 0 is a character. Alternatively I was thinking maybe I could break the binary into sections of 8 and put characters in the text file, but that would kinda be annoying (so hopefully that can be avoided). My question here is what is the best way to store binary in a text file in terms of character efficietcy?
[To recap the comments...]
My question here is what is the best way to store binary in a text file in terms of character efficiently?
If you can store the data as-is, then do so (in other words, do not use any encoding; simply save the raw bytes).
If you need to store the data within a text file (for instance as a paragraph or as a quoted string), then you have many ways of doing so. For instance, base64 is a very common one, but there are many others.

write and load decision tree to file C++ [duplicate]

This question already has answers here:
How to Serialize Binary Tree
(11 answers)
Closed 6 years ago.
I have a decision tree node class defined in following way:
class dt_node
{
public:
dt_node* child[2]; //two child nodes
int feature;
double value; // feature and value this node splits on
bool leaf;
double pred; // what this node predicts if leaf node
}
Is there a way I can write this to a file and reconstruct the tree from the file if needed?
You can do it anyhow you want...
And the real answer: it really is up to you only. If I were you, and had to save this kind of object in a .txt file, I would just make up some way to save this structure, for example as 0*0*0.0*0*0.0. With the first 0 representing the number of child nodes, second 0 representing the feature value and so on, while * character being a separator between values. Spaces could work better, but I just don't like them as separators in my files... Text file would then have some other character (for example, an |) between each separated object. Example would look like 3*22*31.11*1*1.0|2*2*1.0*0*33.3.
Obviously I could've misinterpreted your qestion. If you ask is there a way of saving this particular code and execute it via opening the file in a program without the dt_node class, I, unfortunately, feel like my knowledge is not sufficent enough to answer.
Hope it helps anyhow.
If you would like to write the format yourself, I'll just write every other node's parameters in the file (two doubles, bool and one int) along with it's level starting from the root node and then recurrently proceeding through the tree. As I can see, the bool you have in it controls whether the node have or have not any children, this will help in the reading file process.
File reading will be a bit mode complex than file writing. For each node you read, recurrently, again, read next nodes until any node's level will be equal or lesser than the current node's. It sounds complex, but it really isn't.
Of course you shouldn't write the note* pointers to the file, as they contain useless information, as upon reading the file you will have to recreate the full tree again.
Adding boost to your project can be a little bit of a pain, but there's quite a few libraries there including maths and graphics, so it may well be worth the effort.
The Boost serialisation docs are here with a tutorial here
The serialisation library allows you to add even just 1 function to your class which then defines how to save and load the state of that class. How that data is actually saved is then done by the boost library, for example you can have it save with binary, xml & text.
The only thing that you need to watch out for is that the binary serialisation is not machine transferable.

DEFLATE method reasoning

Why does LZ77 DEFLATE use Huffman encoding for it's second pass instead of LZW? Is there something about their combination that is optimal? If so, what is the nature of the output of LZ77 that makes it more suitable for Huffman compression than LZW or some other method entirely?
LZW tries to take advantage of repeated strings, just like the first "stage" as you call it of LZ77. It then does a poor job of entropy coding that information. LZW has been completely supplanted by more modern approaches. (Except for its legacy use in the GIF format.) Once LZ77 generates a list of literals and matches, there is nothing left for LZW to take advantage of, and it would then make an almost completely ineffective entropy coder for that information.
Mark Adler could best answer this question.
The details of how the LZ77 and Huffman work together need some closer examination. Once the raw data has been turned into a string of characters and special length, distance pairs, these elements must be represented with Huffman codes.
Though this is NOT, repeat, NOT standard terminology, call the point where we start reading in bits a "dial tone." After all, in our analogy, the dial tone is where you can start specifying a series of numbers that will end up mapping to a specific phone. So call the very beginning a "dial tone." At that dial tone, one of three things could follow: a character, a length-distance pair, or the end of the block. Since we must be able to tell which it is, all the possible characters ("literals"), elements that indicate ranges of possible lengths ("lengths"), and a special end-of-block indicator are all merged into a single alphabet. That alphabet then becomes the basis of a Huffman tree. Distances don't need to be included in this alphabet, since they can only appear directly after lengths. Once the literal has been decoded, or the length-distance pair decoded, we are at another "dial-tone" point and we start reading again. If we got the end-of-block symbol, of course, we're either at the beginning of another block or at the end of the compressed data.
Length codes or distance codes may actually be a code that represents a base value, followed by extra bits that form an integer to be added to the base value.
...
Read the whole deal here.
Long story short. LZ77 provides duplicate elimination. Huffman coding provides bit reduction. It's also on the wiki.

Using Getline on a Binary File

I have read that getline behaves as an unformatted input function. Which I believe should allow it to be used on a binary file. Let's say for example that I've done this:
ofstream ouput("foo.txt", ios_base::binary);
const auto foo = "lorem ipsum";
output.write(foo, strlen(foo) + 1);
output.close();
ifstream input("foo.txt", ios_base::binary);
string bar;
getline(input, bar, '\0');
Is that breaking any rules? It seems to work fine, I think I've just traditionally seen arrays handled by writing the size and then writing the array.
No, it's not breaking any rules that I can see.
Yes, it's more common to write an array with a prefixed size, but using a delimiter to mark the end can work perfectly well also. The big difference is that (like with a text file) you have to read through data to find the next item. With a prefixed size, you can look at the size, and skip directly to the next item if you don't need the current one. Of course, you also need to ensure that if you're using something to mark the end of a field, that it can never occur inside the field (or come up with some way of detecting when it's inside a field, so you can read the rest of the field when it does).
Depending on the situation, that can mean (for example) using Unicode text. This gives you a lot of options for values that can't occur inside the text (because they aren't legal Unicode). That, on the other hand, would also mean that your "binary" file is really a text file, and has to follow some basic text-file rules to make sense.
Which is preferable depends on how likely it is that you'll want to read random pieces of the file rather than reading through it from beginning to end, as well as the difficulty (if any) of finding a unique delimiter and if you don't have one, the complexity of making the delimiter recognizable from data inside a field. If the data is only meaningful if written in order, then having to read it in order doesn't really pose a problem. If you can read individual pieces meaningfully, then being able to do so much more likely to be useful.
In the end, it comes down to a question of what you want out of your file being "binary'. In the typical case, all 'binary" really means is that what end of line markers that might be translated from a new-line character to (for example) a carriage-return/line-feed pair, won't be. Depending on the OS you're using, it might not even mean that much though--for example, on Linux, there's normally no difference between binary and text mode at all.
Well, there are no rules broken and you'll get away with that just fine, except that may miss the precision of reading binary from a stream object.
With binary input, you usually want to know how many characters were read successfully, which you can obtain afterwards with gcount()... Using std::getline will not reflect the bytes read in gcount().
Of cause, you can simply get such info from the size of the string you passed into std::getline. But the stream will no longer encapsulate the number of bytes you consumed in the last Unformatted Operation

Parse an XML in standard C/C++ without additional libraries

I have an XML (assuming it is valid) and I must parse it and store it in a tree.
What is the best approach to parse it, without using other libraries, just basic manipulation of strings?
Keep in mind that I don't have to validate it, just parse and memorize it into a tree.
The basic structure of XML is quite simple:
<tagname [attribute[="value"] ...]>content</tagname>
where the content may contain both normal text and more XML structures, or the special form
<tagname [attribute[="value"] ...]/>
which is equivalent to
<tagname [attribute[="value"] ...]></tagname>
that is,. empty content.
So if you don't need to interpret a DTD or do other fancy things, you can do the following:
Check that the first non-whitespace character is <. If not, you don't have XML and can just give an error and exit.
Now follows the tag name, until the first whitespace, or the / or the > character. Store that.
If the next non-whitespace character is /, check that it is followed by >. If so, you've finished parsing and can return your result. Otherwise, you've got malformed XML, and can exit with an error.
If the character is >, then you've found the end of the begin tag. Now follows the content. Continue at step 6.
Otherwise what follows is an argument. Parse that, store the result, and continue at step 3.
Read the content until you find a < character.
If that character is followed by /, it's the end tag. Check that it is followed by the tag name and >, and if yes, return the result. Otherwise, throw an error.
If you get here, you've found the beginning of a nested XML. Parse that with this algorithm, and then continue at 6.
Reading XML looks simple but doing it correctly involves a few complexities you don't really want to deal with. Indeed, writing a simple XML parser effectively amounts to creating yet another XML library. I have done it and an incomplete version of this is sitting somewhere on my disk. Even if you don't need to validate your XML structure:
whether you validate or not, you need to deal with entity references like < and the variety of character entity references like A and
the plain body of an XML document is relatively simple but the header a major pain to deal with in particular the DTD: there are two versions thereof which are slightly different and you probably need to process the inline DTD
even the body isn't entirely trivial because of these annoying character data segments
even without validation you may need to support external entity references
the characters to be accepted and/or rejected for various parts of XML are also somewhat interesting
note that XML is defined in terms of Unicode and proper handling of this isn't entirely trivial either: just using char or wchar_t just doesn't cut it.
The first version I implemented was a nice little iterator intended to pop out all the elements encountered. This allowed for the nice feature of easily stopping and continuing the parsing at the choice of the iterator user. Unfortunately, I didn't get it to fly when trying to copy with the various entity references. It would parse simple XML files nice and fast but some quirks in the specification I just didn't get right.
What worked best for me was creating a simple recursive decent parser combined with a suitable stack of buffers to somewhat transparently deal with entity references. However, to finish this completely I still need to deal with some encoding issues and in the end I just had higher priority projects to work on (in my spare time, that is).
In summary: it can be done, obviously, as others did. It is probably a somewhat pointless exercise unless you have a really bright idea which makes your implementation uniquely better suited than the alternatives.
The best and only approach is to re-implement such a library from scratch without using any other libraries...
You're welcome to use existing libraries like pugixml, for example. It's installation is as simple as adding the files to your project and start using it. It's lightweight compared to other validating parsers, such as Xerces.