Trying to generate a uniquely decodable code and decode it - compression

I'm trying to encode arbitrary symbols into a bit string, and I don't really understand how I can either generate them or even decode a bit string containing those.
I want to work on arbitrary symbol to compress, I don't really know if an uniqulely decodable code is what I am looking for, maybe an arithmetic code or canonical huffman code ?
I just need a list of bit strings, describing the most frequent to least frequent, for any size of symbol table.

Encoding:
Generate Frequency table per symbol
Generate Probability Tree based on this frequency table
Generate Code Table - such that most frequent symbol gets the smallest bit string
Output: [Frequency Table + Sequence of Bit Strings (put together end to end)]
Important thing to note is that the sequence of these bit strings can be later segregated directly all by them selves. i.e. say [10010001] => {100, 1000, 1} (only an example)
Decoding:
Obtain Frequency table & Bit String sequence.
Generate Probability Tree (Same as 1.2)
Generate Code Table (Same as 1.3)
Recreate Data:
Parse Bit-String
Match with the Code Table
Output matched Symbol

Related

How to save a Huffman table in a file In a way that use the least storage?

It's my first question in stack overflow. it's long but I have explained it in detail and I think it's understandable.
I'm writing huffman code by c++ and saved characters and codes in a table like this:
Text: AAAAAAAAAAAAAAAAAAAABBBBBBBCCCCDDDDEEEE
Table: (Made by huffman tree)
Table
Now, I want to save this table to a file in the best way.
I can't save like this: A1B001C010D001E000
When it change to bits: 01000001101000010001010000110100100010000101000101000
Because I can't decode this.
If I save table in normal way, every character use 8 bit for saving it's code.
While my characters have 1bit or 3bit code. (In this case.)
this way use much storage.
My idea is add a separator character and set a code for it.
If we add a separator character and make huffman tree and write codes, have a table like this.
table2
Now, we can write codes in this way.
A0SepB110SepC100SepD1111sepE1110sep.
binary= 0100000101010100001011010101000011100101010001001111101010001011110101
I decode it in this way:
sep = 101.
Read 8 bit : 01000001 -> it's A.
rest = 01010100001011010101000011100101010001001111101010001011110101.
Read 1 bit : 0 (unlike sep1)
Read 1 bit : 1 (like sep1), Read 1 bit : 0 (like sep2), Read 1 bit : 1 (like sep3(end))
Sep was found so A = everything was befor sep = 0;
rest = 0100001011010101000011100101010001001111101010001011110101.
Read 8 bit : 01000010 -> it's B.
rest = 11010101000011100101010001001111101010001011110101.
Read 1 bit : 1 (like sep1)- Read 1 bit : 1 (unlike sep2)
Read 1 bit : 0 (unlike sep1)
Read 1 bit : 1 (like sep1) - Read 1 bit : 0 (like sep2) - Read 1 bit :1 (like sep3(end))
Sep was found so B = everything was befor sep = 110;
And so on ...
This way still use a little storage for separator ( number of characters * separator size )
My question: Is there a way to save first table in a file and use less storage?
For example like this: A1B001C010D001E000.
Don't save the table with the codes. Just save the lengths. See Canonical Huffman Code.
You can store the lengths of the codes (as Mark said) as a 256 byte header at the start of your compressed data. Each byte stores the length of the code, and because you're working with bytes with 256 possible values, and the huffman tree can only be of a certain depth (number of possible values - 1) you only need 8 bits to store the codes.
The first byte would store the code length of the value 0x00, the second byte stores the code length of 0x01, and so on and so forth.
However, if compressing English text, there is a better way to store the table.
Store the shape of the tree, 0s for nodes and 1s for leaves. Then, after you store the nodes and the leaves, you store the values of the leaves.
The tree for AAAAAAAAAAAAAAAAAAAABBBBBBBCCCCDDDDEEEE looks like this:
*
/ \
* A
/ \
* *
/ \ / \
E D C B
So you would store the shape of the tree as such:
000110111EDCBA
The reason why storing the huffman codes in this way is better for when you are compressing English text is that storing the shape of the tree costs 10n - 1 bits (where n is the number of unique characters in the data you are trying to compress) while storing the code lengths costs a flat 2048 bits. Therefore, for numbers of unique characters less than 205, storing the shape of the tree is more efficient, and because the average English string of text isn't going to have all that many of the possible 256 possible ASCII characters, you're usually better off storing the tree shape.
If you aren't just compressing text, and you're compressing more general data where there is a high likelihood that the number of unique characters could be greater than or equal to 205, you should probably use the code length storing format, or include 1 bit at the start of your header that says whether there's going to be a tree or a bunch of code lengths, and then write your decoder to decode either one depending on what that bit is set to.

Store 32 bit value as C string in most efficient form

I am trying to find the most efficient way to encode 32 bit hashed string values into text strings for transmission/logging in low bandwidth environments. Complex compression can't be used because the hash values need to be contained in human readable text strings when logged and sent between client and host.
Consider the following contrived examples:
given the key/value map
table[0xFE12ABCD] = "models/texture/red.bmp";
table[0x3EF088AD] = "textures/diagnostics/pink.jpg";
and the string formats:
"Loaded asset (0x%08x)"
"Replaced (0x%08x) with (0x%08x)"
they could be printed as:
"Loaded asset models/texture/red.bmp"
"Replaced models/texture/red.bmp with textures/diagnostics/pink.jpg"
Or if the key/value map is known by the client and server:
"Loaded asset (0xFE12ABCD)"
"Replaced (0xFE12ABCD) with (0x3EF088AD)"
The receiver can then scan for the (0xNNNNNNNN) pattern and expand it locally.
This is what I am doing right now but I would like to find a way to represent the 32 bit value more efficiently. A simple step would be to use a better identifying token:
"Loaded asset $FE12ABCD"
"Replaced $1000DEEE with $3EF088AD"
Which already reduces the length of each token - $ is not used anywhere else so it is reasonable.
However, what other options are there to make that 32 bit value even smaller? I can't use an index - it has to be a full 32 bit value because in some cases the generator of the string has the hash and sometimes it has a string it will hash immediately.
A common solution is to use Base-85 coding. You can code four bytes into five Base-85 digits, since 855 > 232. Pick 85 printable characters and assign them to the digit values 0..84. Then do base conversion to go either way. Since there are 94 printable characters in ASCII, it is usually easy to find 85 that are "safe" in whatever constrains your strings to be "readable".

Clojure dictionary of words

I want a dictionary of English words available, to pick random english words. I have a dictionary text file that I downloaded form the internet which has almost 1 million words, what's the best way to go about using this list in Clojure, given that most of the time I'll only need 1 randomly selected word?
Edit:
To answer the comments, this is for some tests which I may turn into load tests which is why I want a decent number of random words and I guess access speed is the most important thing. I do not want to use a database for this. I originally thought of a dictionary just because that's the first thing that popped into my mind but I think a random sequence of letters and numbers would be good enough, perhaps I will just use a UUID as a string.
Read all the words into a Vector and then call rand-nth , e.g.
(rand-nth all-words)
rand-nth uses the nth function on the underlying data structure and Clojure Vectors have log32N performance for index based retrieval.
Edit: This is assuming that it is for a test environment as you described in your question. A more memory efficient method would be to use RandomAccessFile and seek to a random location in the file of words, read until you find the first word delimiter (e.g. comma, EOL) and then read the following bytes until the next delimiter which will give you a random word.

Where can I get sample Reed-Solomon encoded data?

I want to write a Reed-Solomon decoder and experiment with performance improvements. Where could I find sample data with appended Reed-Solomon parity bytes?
I am aware that Reed-Solomon is used in all kinds of 1D and 2D bar codes, but I would like to have the raw data (an array of bytes) with clear separation of payload and parity bytes.
Any help is appreciated.
Basically, a Reed-Solomon code will be composed of characters with a value between 0 and (m-1), where m is the exposant of the Galois Field used to generate the RS code. For example, in GF(2^8) (2^8 = 256), you will get a RS code composed of characters between 0 and 255 (compatible with ASCII, UTF-8 and usual binary encoding). In GF(2^16), you will get characters encoded between 0 and 65535 (compatible with UTF-16 or with UTF-8 if you encode 2 characters as one).
Other than the range of values of each character of a RS code, all the rest can be basically considered random from an external POV (it's not if you have the generator polynomial and Galois field, but for your purpose of getting a sample, you can assume a random distribution of values in the valid range).
If you want to generate real samples of a RS code with the corresponding data block, you can use the Python library pyFileFixity (disclaimer, I am the author). By default, each ecc block is separated by a md5 digest, so that you can clearly separate them. The original data is not stored, but you can easily do that by modifying the script structural_adaptive_ecc.py or header_ecc.py (the latter will be easier to modify) to also store the original data (it's just a file.write() to edit). If Python is not your thing, you can probably find a Reed-Solomon library for your language of choice, and just do a slight modification to print or save into a file the original data along the ecc blocks.

How to identify compressed/uncompressed bit groups?

I'm using a static dictionary file with some words and values for this words. This values are not fixed sized, for example the is 1, love is 01, kill is 101 etc. When I try to compress a group of words, I traverse every word and look up to dictionary if a value exists for that word. If one exists I change the word with the value, if it doesn't exist I encode the word as bytes. After compression I got a chunk of bits, and because these dictionary values and uncompressed words are not fixed sized I can not group the bits and decode them.
I have thought about using 1 bit flag for every group of bits to determine it is compressed or uncompressed, but I can't detect the flag bit because of this unknown length of a codeword or regular word.
If I use a 1 byte delimiter, it still has problems. Let's say my delimiter is 00000000, and before the delimiter I have 100 and after delimiter I have 001, so we have 10000000000001, how am I supposed to know that which group of these bits are my delimiter?
Can I use some other method to group these compressed/uncompressed bits to decode them? Thank you.
First off,what language and system are you intending to deploy this? Many languages provide their own libraries and tools for compression and may suite your needs without major low-level design effors.
The answer here is to establish some more rigorous bookkeeping and file formatting to be able to undo the compression. Most compression systems have some amount of overhead in their file format which is why when you compress something twice you don't necessarily save anything and can actually increase the size of the file.
Often files take advantage of header at the start of a file to provide key information. which would be a good place to define any rules that are specific to the compressed file.
create fixed size delimiter to use between code words only. This can be determined after analyzing the file but before actually writing out the compressed data.
If you generate your delimiter rather than a fixed known value, include this as one of your header items.
keep your header a simple ascii format so that you can easily extract it with standard tools like sscanf and fscanf.
if you want to have a header that can contain extra information you may need a consistent way to tell where the header ends and the data begins. Including something to the effect of "ENDHEADER" should be enough and still easily identifiable.