LZ78 compression for data with both integers and words - compression

hi i'm researching for lz78 compression and i had a question. what if the data consists of both numbers and word. the decompression would be very confusing.
example: a compressed string like this: "11334a"
would it be (113,3) and (4,a) or (1,1) (3,3) (4,a)
i'm wonder if using an "escape" like RLE would work.
if not, is there anyway to compress mixed string like this or is lz78 for text only
thank you for your time

Related

String compression and decompression

I am working in a project with an industrial system that implements its own data storage. Some fields only allow 31 characters.
Is there any compression / decompression algorithm to convert a large string into a shorter string and still be able to undo the process?
I'd like the compressed result to be a string.
i.e.
original string: "this is a large string containing a lot of
characters"
compressed: "thssalrgestrgcntniglotfchrctrs"

By using modified RLE, there must at least one "compressed" file larger than original one?

I have read this.
The mathmatical proof shown that for all lossless algorithms, there will be at least one "compressed" file getting larger than before.
For RLE, the compressed text file will be larger if all characters are different.
e.g. ABC -> 1A1B1C
But if i modified the rule of RLE:
(1) For 1,2 length characters, no number will be added
e.g. ABCCDDDEFFFF -> ABCC3DE4F
It seems okay and no compressed file will be getting larger.
However, it contradicts the mathematical proof.
You fail to support decompression, because your compression is not unique. The problem is that your input may contain numbers as well. Now the input "1A" in RLE transforms to "11A" and back to 1A. In your scheme, "1A" and "A" both compress to "1A".

Matlab - how to extract specific data from a vector

I have some data from a GPS receiver, however, some of the data are corrupted by extra characters. I want to extract the timestamp (the first field) and the data for the $GPGGA and $GPVTG.
To be more clear, here is a sample of the data I have in a cell array:
'1458937887.70818 $GPGGA,200228.90,3555.3269,N,15552.9641,A*25'
'1458937887.709668 $GPVTG,56.740,T,56.740,M,0.069,N,0.127,K,D*2D'
'1458937887.712022 ªDe¾,…´apö$™°%=HfSrîU¾Õ½ôAqö‚>1ÀàHqgu$GPGGA,200229.00,3555.3269,N,15552.9641,C*2B'
'1458937887.714071 $GPVTG,286.847,T,286.847,M,0.028,N,0.051,K,D*28'
As you can see, the problem here is in the third line where some strange characters appear between the timestamp and the data.
Another problem is that sometimes this third line is split into two lines, something like this:
'1458937887.712022 ªDe¾,…´apö$™°'
'%=HfSrîU¾Õ½ôAqö‚>1ÀàHqgu$GPGGA,200229.00,3555.3269,N,15552.9641,D*24'
which is making using regexp very hard.
In summary, I want to format the third line (in both cases) as:
'1458937887.712022 $GPGGA,200229.00,3555.3269,N,15552.9641,D*2R'
Update:
Thanks to #excaza, this solves the first issue (removing the garbage):
regexprep(str, '(?<=\d\s)(.*)(?=\$GPGGA)', '')
As for the second issue, #Suever's question gave me an idea by looking at the format of the data. Is it possible to solve it while reading the data from a .txt file? Something like defining the delimiter to be * followed by two characters and a \n since all packets end with this pattern?

How to read output of hexdump of a file?

I wrote a program in C++ that compresses a file.
Now I want to see the contents of the compressed file.
I used hexdump but I dont know what the hex numbers mean.
For example I have:
0000000 00f8
0000001
How can I convert that back to something that I can compare with the original file contents?
If you implemented a well-known compression algorithm you should be able to find a tool that performs the same kind of compression and compare its results with yours. Otherwise you need to implement an uncompressor for your format and check that the result of compressing and then uncompressing is identical to your original data.
That looks like a file containing the single byte 0xf8. I say that since it appears to have the same behaviour as od under UNIX-like operating systems, with the last line containing the length and the contents padded to a word boundary (you can use od -t x1 to get rid of the padding, assuming your od is advanced enough).
As to how to recreate it, you need to run it through a decryption process that matches the encryption used.
Given that the encrypted file is that short, you either started with a very small file, your encryption process is broken, or it's incredibly efficient.

How to identify compressed/uncompressed bit groups?

I'm using a static dictionary file with some words and values for this words. This values are not fixed sized, for example the is 1, love is 01, kill is 101 etc. When I try to compress a group of words, I traverse every word and look up to dictionary if a value exists for that word. If one exists I change the word with the value, if it doesn't exist I encode the word as bytes. After compression I got a chunk of bits, and because these dictionary values and uncompressed words are not fixed sized I can not group the bits and decode them.
I have thought about using 1 bit flag for every group of bits to determine it is compressed or uncompressed, but I can't detect the flag bit because of this unknown length of a codeword or regular word.
If I use a 1 byte delimiter, it still has problems. Let's say my delimiter is 00000000, and before the delimiter I have 100 and after delimiter I have 001, so we have 10000000000001, how am I supposed to know that which group of these bits are my delimiter?
Can I use some other method to group these compressed/uncompressed bits to decode them? Thank you.
First off,what language and system are you intending to deploy this? Many languages provide their own libraries and tools for compression and may suite your needs without major low-level design effors.
The answer here is to establish some more rigorous bookkeeping and file formatting to be able to undo the compression. Most compression systems have some amount of overhead in their file format which is why when you compress something twice you don't necessarily save anything and can actually increase the size of the file.
Often files take advantage of header at the start of a file to provide key information. which would be a good place to define any rules that are specific to the compressed file.
create fixed size delimiter to use between code words only. This can be determined after analyzing the file but before actually writing out the compressed data.
If you generate your delimiter rather than a fixed known value, include this as one of your header items.
keep your header a simple ascii format so that you can easily extract it with standard tools like sscanf and fscanf.
if you want to have a header that can contain extra information you may need a consistent way to tell where the header ends and the data begins. Including something to the effect of "ENDHEADER" should be enough and still easily identifiable.