String compression and decompression - compression

I am working in a project with an industrial system that implements its own data storage. Some fields only allow 31 characters.
Is there any compression / decompression algorithm to convert a large string into a shorter string and still be able to undo the process?
I'd like the compressed result to be a string.
i.e.
original string: "this is a large string containing a lot of
characters"
compressed: "thssalrgestrgcntniglotfchrctrs"

Related

LZ78 compression for data with both integers and words

hi i'm researching for lz78 compression and i had a question. what if the data consists of both numbers and word. the decompression would be very confusing.
example: a compressed string like this: "11334a"
would it be (113,3) and (4,a) or (1,1) (3,3) (4,a)
i'm wonder if using an "escape" like RLE would work.
if not, is there anyway to compress mixed string like this or is lz78 for text only
thank you for your time

By using modified RLE, there must at least one "compressed" file larger than original one?

I have read this.
The mathmatical proof shown that for all lossless algorithms, there will be at least one "compressed" file getting larger than before.
For RLE, the compressed text file will be larger if all characters are different.
e.g. ABC -> 1A1B1C
But if i modified the rule of RLE:
(1) For 1,2 length characters, no number will be added
e.g. ABCCDDDEFFFF -> ABCC3DE4F
It seems okay and no compressed file will be getting larger.
However, it contradicts the mathematical proof.
You fail to support decompression, because your compression is not unique. The problem is that your input may contain numbers as well. Now the input "1A" in RLE transforms to "11A" and back to 1A. In your scheme, "1A" and "A" both compress to "1A".

Store 32 bit value as C string in most efficient form

I am trying to find the most efficient way to encode 32 bit hashed string values into text strings for transmission/logging in low bandwidth environments. Complex compression can't be used because the hash values need to be contained in human readable text strings when logged and sent between client and host.
Consider the following contrived examples:
given the key/value map
table[0xFE12ABCD] = "models/texture/red.bmp";
table[0x3EF088AD] = "textures/diagnostics/pink.jpg";
and the string formats:
"Loaded asset (0x%08x)"
"Replaced (0x%08x) with (0x%08x)"
they could be printed as:
"Loaded asset models/texture/red.bmp"
"Replaced models/texture/red.bmp with textures/diagnostics/pink.jpg"
Or if the key/value map is known by the client and server:
"Loaded asset (0xFE12ABCD)"
"Replaced (0xFE12ABCD) with (0x3EF088AD)"
The receiver can then scan for the (0xNNNNNNNN) pattern and expand it locally.
This is what I am doing right now but I would like to find a way to represent the 32 bit value more efficiently. A simple step would be to use a better identifying token:
"Loaded asset $FE12ABCD"
"Replaced $1000DEEE with $3EF088AD"
Which already reduces the length of each token - $ is not used anywhere else so it is reasonable.
However, what other options are there to make that 32 bit value even smaller? I can't use an index - it has to be a full 32 bit value because in some cases the generator of the string has the hash and sometimes it has a string it will hash immediately.
A common solution is to use Base-85 coding. You can code four bytes into five Base-85 digits, since 855 > 232. Pick 85 printable characters and assign them to the digit values 0..84. Then do base conversion to go either way. Since there are 94 printable characters in ASCII, it is usually easy to find 85 that are "safe" in whatever constrains your strings to be "readable".

How to read large files in C++ with mixed text and binary

I need to read a large file of either text, binary, or combination, such as a JPEG file, encrypt it, and write it to a file. At some later time I will need to read the encrypted data, and decrypt it.
The end goal is to verify that the decrypted data matches the original data.
My problem is that with large files greater than 1Meg, I don't want to read and write character by character. I am targeting this code for a phone and I/O will cause too long a delay for the user.
With a pure text file, using fread() and fwrite() convert the data to binary, and the result is different than the original. With a jpeg image, it appears that there is some textual content mixed in with the binary data.
Is there a way to efficiently read in an arbitrary type of file and write it back in the original format?
Or is character by character the only option?
Or am I still out of luck?
After debugging it turned out that the decrypt function had the plain text and cipher text buffers assigned backwards. After swapping the buffer assignments, the decrypted results matched the original data. I originally thought that maybe reading the text as binary and then rewriting as binary would not appear as text, but I was wrong.
Reading the entire file as binary works just fine.

How to identify compressed/uncompressed bit groups?

I'm using a static dictionary file with some words and values for this words. This values are not fixed sized, for example the is 1, love is 01, kill is 101 etc. When I try to compress a group of words, I traverse every word and look up to dictionary if a value exists for that word. If one exists I change the word with the value, if it doesn't exist I encode the word as bytes. After compression I got a chunk of bits, and because these dictionary values and uncompressed words are not fixed sized I can not group the bits and decode them.
I have thought about using 1 bit flag for every group of bits to determine it is compressed or uncompressed, but I can't detect the flag bit because of this unknown length of a codeword or regular word.
If I use a 1 byte delimiter, it still has problems. Let's say my delimiter is 00000000, and before the delimiter I have 100 and after delimiter I have 001, so we have 10000000000001, how am I supposed to know that which group of these bits are my delimiter?
Can I use some other method to group these compressed/uncompressed bits to decode them? Thank you.
First off,what language and system are you intending to deploy this? Many languages provide their own libraries and tools for compression and may suite your needs without major low-level design effors.
The answer here is to establish some more rigorous bookkeeping and file formatting to be able to undo the compression. Most compression systems have some amount of overhead in their file format which is why when you compress something twice you don't necessarily save anything and can actually increase the size of the file.
Often files take advantage of header at the start of a file to provide key information. which would be a good place to define any rules that are specific to the compressed file.
create fixed size delimiter to use between code words only. This can be determined after analyzing the file but before actually writing out the compressed data.
If you generate your delimiter rather than a fixed known value, include this as one of your header items.
keep your header a simple ascii format so that you can easily extract it with standard tools like sscanf and fscanf.
if you want to have a header that can contain extra information you may need a consistent way to tell where the header ends and the data begins. Including something to the effect of "ENDHEADER" should be enough and still easily identifiable.