Best compression algo for random string - compression

I have some string like below
ff8870fd30db56efd72e8b499a454c4e27be6ab70e23dd59a864563628e998a
which is around 2Kbytes I tried to compress but I am not getting good compression ratio
with gz I got only 400 bytes reduction and with defalte I got 450 reduction ..
Is there any better alogoritham to have more compression atleast more than 50% reduction .

By definition, you cannot compress random data because it will not contain any structure you can represent / describe in a more efficient way, using less bits.
If this is possible, the data contains a structure and is no longer random.
A common counter-argument is that, given enough odds, even an all 0 string can be generated by a RNG, but devil is in details: it is all about odds!
Even in a tiny 2KB space you have 2^(2048*8) possible strings if the data is generated by a true RNG or a robust PRNG algorithm seeded with reasonable amount of noise, and the vast majority of those stings will not contain any reasonable amout of "order" you can compress.
The fact you are obtaining a 400 B / 450 B compression on 2 KB is a strong hint the string you are looking at is not really random, just non-human readable or "random-looking".
GZ format is based on Deflate compression algorithm, so it is not clear why the two figures are presented separately - Deflate accepts various parameters for fine tuning compression at the expense of speed, so different settings can justify the different results.
To get better compression on random-looking (but not really random!) data can try with LZMA2 (7-Zip) or even better ZPAQ (http://mattmahoney.net/dc/zpaq.html).

I do know that this is much later than the OP.... however if you look at HOW the data is being represented then yes it is going to be hard to find repetition as a string... however...
with the example
"ff8870fd30db56efd72e8b499a454c4e27be6ab70e23dd59a864563628e998a" as given...
How else COULD this information be represented? these look to all be HEX couplets.. .for example
"0xff 0x88 0x70" etc... so.....if this was stored in bytes.... you automatically get 100% compression since each character is a single byte in itself...
if We wanted to get very clever we could look at some math where say we could map this data to more easily compressible data.. of course this would only be beneficial for very large data, as the encoding of small amounts of data would likely make it larger...

Related

Parallel bzip2 decompression scan may fail?

Bzip2 byte stream compression in parallel can be easily done with a FIFO queue where every chunk is processed as parallel task and streamed into a file.
The other way round parallel decompression is not so easy, because everything is bit-aligned and the exact bit-length of a block is known after it's decompressed.
As far as I can see, parallel decompress implementations use magic numbers for block start and stream end and perform a bit-scan. Isn't there a small chance that one of the streams contain such a magic value by coincidence?
Possible block validations:
4 Bytes CRC
6 Bytes "compressed magic"
6 Bytes "end of stream magic"
some bit combinations for the huffman trees are not allowed
max. x Bytes of huffman stream (range to search for next magic)
Per file:
4 Bytes File CRC
padding at the end
I could implement such a scan by just bit-shifting from the stream until I have a magic. But then when I read block N and it fails, I should (maybe also not) take into account, that it was a false positive. For a parallel implementation I can then stop all tasks for blocks N, N+1, N+2, .. , then try to find the next signature and go one. That makes everything very complicated and I don't know if it's worth the effort? I guess maybe not, but is there a chance that a parallel bzip2 implementation fails?
I'm wondering why a file format uses magic numbers as markers, but doesn't include jump hints. I guess the magic numbers are important for filesystem recovery, but anyway, why can't a block contain e.g 16bits for telling how far to jump to the next block.
Yes, the source code you linked notes that the magic 48-bit value can show up in compressed data by chance. It also notes the probability, around 10-14 (actually 2-48, closer to 3.55x10-15). That probability is at every sample, so on average one will occur in every 32 terabytes of compressed data. That's about one month of run time on one core on my machine. Not all that long. In a production environment, you should assume that it will happen. Because it will.
Also as noted in the source you linked, due to the possibility of a false positive, you need to then validate the remainder of the block. You would not stop the subsequent possible block processing, since it is extremely likely that they are all valid blocks. Just validate all and keep the validated ones. Verify when combining that the valid blocks exactly covered the input, with no overlaps. A properly implemented parallel bzip2 decompressor will always work on valid bzip2 streams.
It would need to be more than 16 bits, but yes, in principle a block could have contained the offset to the next block, since it already contains a CRC at the start of the block. Julian did consider that in the revision of bzip2, but decided against it:
bzip2-1.0.X, 0.9.5 and 0.9.0 use exactly the same file format as the
original version, bzip2-0.1. This decision was made in the interests
of stability. Creating yet another incompatible compressed file format
would create further confusion and disruption for users.
...
The compressed file format was never designed to be handled by a
library, and I have had to jump though some hoops to produce an
efficient implementation of decompression. It's a bit hairy. Try
passing decompress.c through the C preprocessor and you'll see what I
mean. Much of this complexity could have been avoided if the
compressed size of each block of data was recorded in the data stream.

Why is bzip2's maximum blocksize 900k?

bzip2 (i.e. this program by Julian Seward)'s lists available block-sizes between 100k and 900k:
$ bzip2 --help
bzip2, a block-sorting file compressor. Version 1.0.6, 6-Sept-2010.
usage: bzip2 [flags and input files in any order]
-1 .. -9 set block size to 100k .. 900k
This number corresponds to the hundred_k_blocksize value written into the header of a compressed file.
From the documentation, memory requirements are as follows:
Compression: 400k + ( 8 x block size )
Decompression: 100k + ( 4 x block size ), or
100k + ( 2.5 x block size )
At the time the original program was written (1996), I imagine 7.6M (400k + 8 * 900k) might have been a hefty amount of memory on a computer, but for today's machines it's nothing.
My question is two part:
1) Would better compression be achieved with larger block sizes? (Naively I'd assume yes). Is there any reason not to use larger blocks? How does the cpu time for compression scale with the size of block?
2) Practically, are there any forks of the bzip2 code (or alternate implementations) that allow for larger block sizes? Would this require significant revision to the source code?
The file format seems flexible enough to handle this. For example ... since hundred_k_blocksize holds an 8-bit character that indicates the block-size, one could extend down the ASCII table to indicate larger block-sizes (e.g. ':' = x3A => 1000k, ';' = x3B => 1100k, '<' = x3C => 1200k, ...).
Your intuition that a larger block size should lead to a higher compression ratio is supported by Matt Mahoney's compilation of programs from his large text compression benchmark. For example, the open-source BWT program, BBB, (http://mattmahoney.net/dc/text.html#1640) has a ~40% compression ratio improvement going from a blocksize of 10^6 to 10^9. Between these two values, the compression time doubles. Now that the "xz" program, which uses is an LZ variant (called LZMA2) originally described by 7zip author, Igor Pavlov, is beginning to overtake bzip2 as the default strategy for compressing source code, it is worth studying the possibility of upping bzip2's block size to see if it might be a viable alternative. Also, bzip2 avoided arithmetic coding due to patent restrictions, which have since expired. Combined with the possibility of using the fast asymmetric numeral systems for entropy coding developed by Jarek Duda, a modernized bzip2 could very well be competitive in both compression ratio and speed to xz.

Looking for hash algorithm where small change in input will result in small change on hash

Current hash functions are designed to have big changes on hash even if only a very small portion of input is changed. What I need, is a hash algorithm which output mutation will be directly proportional to input mutation. For example, I need something similar to this:
Hash("STR1") => 1000
Hash("STR2") => 1001
Hash("STR3") => 1002
etc.
I'm not good at algorithms, but never heared of such implementation, although I'm almost sure someone should already come up with this algorithm.
My current requirement is to have large bitrate (512 bits maybe?) to avoid collisions.
Thanks
UPDATE
I think I should clarify my goal, I see that I did a very poor job explaining what I need. Sorry, I'm not a native English speaker and great communicator.
So basically I need this hash algorithm for searching similar binary files. You can think of it as Antivirus hashing algorithm. It calculates file checksum, but unlike traditional hashing functions, even after some small modification in malware binary, it still is able to detect it. This is pretty much what I'm looking for.
Another aspect is to avoid collision. Let me explain what I mean by that. It's not a conflicting goal. I want Hash("STR1") to produce 1000 and Hash("STR2") to produce 1001 or 1010 maybe, doesn't matter as long as the value is close relative to previous hash. But Hash("This is a very large string or maybe even binary data" + 100 random chars) should not produce a value close to 1000. I understand it will not work always and there would be some hash/hash-range collisions, but I think I can introduce another hashing algorithm and verify both to minimize collisions.
So what do you think? Maybe there is a better way to achieve my goal, maybe I'm asking too much, I don't know. I'm not well versed in Cryptograpy, math or algorithms.
Thank you again for your time and effort
How about a simple summation? Your hash can then wrap at the desired size, and if you take this into account during hash comparisons, a small difference in inputs should yield a small difference in hashes.
However, I think "minimal collisions" and "proportional change in output" are conflicting goals.
This is called, in other domains, perceptual hashing.
One approach to this is as follows:
Get a training multiset of n-grams. (E.g. if n=2 and your training data was "This is a test" your training set would be "Th", "hi", "is", "s ", etc)
Sort and calculate the frequencies of said n-grams, decending.
Then the hash of a word is the first bits of "for each n-gram in the database, is this word's frequency said n-gram higher than the average frequency?"
Note that this can and will result in many collisions with similar words, unfortunately, unless the hash length is absurdly long.
MD5 or SHA-x is not what you want.
According to wikipedia, for example the substitution cipher has no avalanche effect (this is the word you mean).
In terms of hashing you could use some kind of a figure total.
For example:
char* hashme = "hallo123";
int result=0;
for(int i = 0; i<8; ++i) {
result += hashme[i];
}
It may be geared towards kids, but the old NSA Kid's section has some really good ideas.
Of course, these algorithms are really insecure, so you cannot use this in place of REAL encryption. (But you can't use a real encryption algorithm when you just want to have fun, either.)
The number grid involves setting up a grid, then using the coordinates of each letter:
Further ideas:
Mix up the letter arangement
Convert numbers to binary to obfuscate
A winding way also uses a grid. Essentially, the letters are packed in the grid left to right, in rows downwards. The output is produced by slicing vertically through the grid:
Typically hash and encryption algorithms oriented towards cryptography will behave in the exact opposite way of what you're looking for (i.e. small changes in the input will cause large changes in the output and vice versa), so this algorithm class is a dead end.
As a quick digression on why these algorithms behave like this: of necessity, they're designed to obscure statistical relationships between the input and output to make them more difficult to crack. For example, in the English language the letter "e" is by far the most commonly-used letter; in some very weak classical ciphers you could simply find the most common letter and figure that that corresponds to "e" (e.g. - if n is the most common letter, then odds are n = e). Actually, a statistical pattern like you describe would likely make the algorithm significantly more vulnerable to chosen-plaintext, known-plaintext, man in the middle, and replay attacks.
The man in the middle and replay attacks would be made significantly easier by the fact that it would be much easier to edit the ciphertext to achieve the desired plaintext without knowing the key (especially if you have access to a couple of chosen plaintexts).
If you know that
7/19/2016 1:35 transfer $10 from account x to account y
(where the datestamp is used to defend against a replay attack) encodes to
12345678910
whereas
7/19/2016 1:40 transfer $10 from account x to account y
encodes to
12445678910
it's a pretty safe guess that
12545678910
will mean something like
7/19/2016 1:45 transfer $10 from account x to account y
Without having access to the original key, you could replay this packet on a regular basis to continue to steal money from someone's account simply by making a trivial edit. Granted, this is a fairly contrived example, but it still illustrates the basic problem.
My understanding of what you're looking for is statistical similarity between files. This might help some: https://en.wikipedia.org/wiki/Semantic_similarity
This does indeed exist. The term is Locality-sensitive hashing. A concrete implementation can be found here.
Depending on the source document you might want to look at digital forensics or VisualRank (from google) for finding similar images and video. For textual data this is commonly used in anti-spam (read more here). For binary files you might want to first run disassembler and then run the algorithm on the text version - but this is just my feeling, I don't have a research to back this statement but it would be an interesting hypothesis to test.

Write a program that takes text as input and produces a program that reproduces that text

Recently I came across one nice problem, which turned up as simple to understand as hard to find any way to solve. The problem is:
Write a program, that reads a text from input and prints some other
program on output. If we compile and run the printed program, it must
output the original text.
The input text is supposed to be rather large (more than 10000 characters).
The only (and very strong) requirement is that the size of the archive (i.e. the program printed) must be strictly less than the size of the original text. This makes impossible obvious solutions like
std::string s;
/* read the text into s */
std::cout << "#include<iostream> int main () { std::cout<<\"" << s << "\"; }";
I believe some archiving techniques are to be used here.
Unfortunately, such a program does not exist.
To see why this is so, we need to do a bit of math. First, let's count up how many binary strings there are of length n. Each of the bits can be either a 0 or 1, which gives us one of two choices for each of those bits. Since there are two choices per bit and n bits, there are thus a total of 2n binary strings of length n.
Now, let's suppose that we want to build a compression algorithm that always compresses a bitstring of length n into a bitstring of length less than n. In order for this to work, we need to count up how many different strings of length less than n there are. Well, this is given by the number of bitstrings of length 0, plus the number of bitstrings of length 1, plus the number of bitstrings of length 2, etc., all the way up to n - 1. This total is
20 + 21 + 22 + ... + 2n - 1
Using a bit of math, we can get that this number is equal to 2n - 1. In other words, the total number of bitstrings of length less than n is one smaller than the number of bitstrings of length n.
But this is a problem. In order for us to have a lossless compression algorithm that always maps a string of length n to a string of length at most n - 1, we would have to have some way of associating every bitstring of length n with some shorter bitstring such that no two bitstrings of length n are associated with the same shorter bitstream. This way, we can compress the string by just mapping it to the associated shorter string, and we can decompress it by reversing the mapping. The restriction that no two bitstrings of length n map to the same shorter string is what makes this lossless - if two length-n bitstrings were to map to the same shorter bitstring, then when it came time to decompress the string, there wouldn't be a way to know which of the two original bitstrings we had compressed.
This is where we reach a problem. Since there are 2n different bitstrings of length n and only 2n-1 shorter bitstrings, there is no possible way we can pair up each bitstring of length n with some shorter bitstring without assigning at least two length-n bitstrings to the same shorter string. This means that no matter how hard we try, no matter how clever we are, and no matter how creative we get with our compression algorithm, there is a hard mathematical limit that says that we can't always make the text shorter.
So how does this map to your original problem? Well, if we get a string of text of length at least 10000 and need to output a shorter program that prints it, then we would have to have some way of mapping each of the 210000 strings of length 10000 onto the 210000 - 1 strings of length less than 10000. That mapping has some other properties, namely that we always have to produce a valid program, but that's irrelevant here - there simply aren't enough shorter strings to go around. As a result, the problem you want to solve is impossible.
That said, we might be able to get a program that can compress all but one of the strings of length 10000 to a shorter string. In fact, we might find a compression algorithm that does this, meaning that with probability 1 - 210000 any string of length 10000 could be compressed. This is such a high probability that if we kept picking strings for the lifetime of the universe, we'd almost certainly never guess the One Bad String.
For further reading, there is a concept from information theory called Kolmogorov complexity, which is the length of the smallest program necessary to produce a given string. Some strings are easily compressed (for example, abababababababab), while others are not (for example, sdkjhdbvljkhwqe0235089). There exist strings that are called incompressible strings, for which the string cannot possibly be compressed into any smaller space. This means that any program that would print that string would have to be at least as long as the given string. For a good introduction to Kolmogorov Complexity, you may want to look at Chapter 6 of "Introduction to the Theory of Computation, Second Edition" by Michael Sipser, which has an excellent overview of some of the cooler results. For a more rigorous and in-depth look, consider reading "Elements of Information Theory," chapter 14.
Hope this helps!
If we are talking about ASCII text...
I think this actually could be done, and I think the restriction that the text will be large than 10000 chars is there for a reason (to give you coding room).
People here are saying that the string cannot be compressed, yet it can.
Why?
Requirement: OUTPUT THE ORIGINAL TEXT
Text is not data. When you read input text you read ASCII chars (bytes). Which have both printable and non printable values inside.
Take this for example:
ASCII values characters
0x00 .. 0x08 NUL, (other control codes)
0x09 .. 0x0D (white-space control codes: '\t','\f','\v','\n','\r')
0x0E .. 0x1F (other control codes)
... rest of printable characters
Since you have to print text as output, you are not interested in the range (0x00-0x08,0x0E-0x1F).
You can compress the input bytes by using a different storing and retrieving mechanism (binary patterns), since you don't have to give back the original data but the original text. You can recalculate what the stored values mean and readjust them to bytes to print. You would effectively loose only data that was not text data anyway, and is therefore not printable or inputtable. If WinZip would do that it would be a big fail, but for your stated requirements it simply does not matter.
Since the requirement states that the text is 10000 chars and you can save 26 of 255, if your packing did not have any loss you are effectively saving around 10% space, which means if you can code the 'decompression' in 1000 (10% of 10000) characters you can achieve that. You would have to treat groups of 10 bytes as 11 chars, and from there extrapolate te 11th, by some extrapolation method for your range of 229. If that can be done then the problem is solvable.
Nevertheless it requires clever thinking, and coding skills that can actually do that in 1 kilobyte.
Of course this is just a conceptual answer, not a functional one.
I don't know if I could ever achieve this.
But I had the urge to give my 2 cents on this, since everybody felt it cannot be done, by being so sure about it.
The real problem in your problem is understanding the problem and the requirements.
What you are describing is essentially a program for creating self-extracting zip archives, with the small difference that a regular self-extracting zip archive would write the original data to a file rather than to stdout. If you want to make such a program yourself, there are plenty of implementations of compression algorithms, or you could implement e.g. DEFLATE (the algorithm used by gzip) yourself. The "outer" program must compress the input data and output the code for the decompression, and embed the compressed data into that code.
Pseudocode:
string originalData;
cin >> originalData;
char * compressedData = compress(originalData);
cout << "#include<...> string decompress(char * compressedData) { ... }" << endl;
cout << "int main() { char compressedData[] = {";
(output the int values of the elements of the compressedData array)
cout << "}; cout << decompress(compressedData) << endl; return 0; }" << endl;
Assuming "character" means "byte" and assuming the input text may contains at least as many valid characters as the programming language, its impossible to do this for all inputs, since as templatetypedef explained, for any given length of input text all "strictly smaller" programs are themselves possible inputs with smaller length, which means there are more possible inputs than there can ever be outputs. (It's possible to arrange for the output to be at most one bit longer than the input by using an encoding scheme that starts with a "if this is 1, the following is just the unencoded input because it couldn't be compressed further" bit)
Assuming its sufficient to have this work for most inputs (eg. inputs that consist mainly of ASCII characters and not the full range of possible byte values), then the answer readily exists: use gzip. That's what its good at. Nothing is going to be much better. You can either create self-extracting archives, or treat the gzip format as the "language" output. In some circumstances you may be more efficient by having a complete programming language or executable as your output, but often, reducing the overhead by having a format designed for this problem, ie. gzip, will be more efficient.
It's called a file archiver producing self-extracting archives.

What's the best way to hash a string vector not very long (urls)?

I am now dealing with url classification. I partition url with "/?", etc, generating a bunch of parts. In the process, I need to hash the first part to the kth part, say, k=2, then for "http://stackoverflow.com/questions/ask", the key is a string vector "stackoverflow.com questions". Currently, the hash is like Hash. But it consumes a lot of memory. I wonder whether MD5 can help or are there other alternatives. In effect, I do not need to recover the key exactly, as long as differentiating different keys.
Thanks!
It consumes a lot of memory
If your code already works, you may want to consider leaving it as-is. If you don't have a target, you won't know when you're done. Are you sure "a lot" is synonymous with "too much" in your case?
If you decide you really need to change your working code, you should consider the large variety of the options you have available, rather than taking someone's word for a specific algorithm:
http://en.wikipedia.org/wiki/List_of_hash_functions
http://en.wikipedia.org/wiki/Comparison_of_cryptographic_hash_functions
http://www.strchr.com/hash_functions
etc
Not sure about memory implications, and it certainly would change your perf profile, but you could also look into using Tries:
http://en.wikipedia.org/wiki/Trie
MD5 is a nice hash code for stuff where security is not an issue. It's fast and reasonably long (128 bits is enough for most applications). Also the distribution is very good.
Adler32 would be a possible alternative. It's very easy to implement, just a few lines of code. It's even faster then MD5. And it's long enough/good enough for many applications (though for many it is not).
(I know Adler32 is strictly not a hash-code, but it will still do fine for many applications)
However, if storing the hash-code is consuming a lot of memory, you can always truncate the hash-code, or use XOR to "shrink" it. E.g.
uint8_t md5[16];
GetMD5(md5, ...);
// use XOR to shrink the MD5 to 32 bits
for (size_t i = 4; i < 16; i++)
md5[i % 4] ^= md5[i];
// assemble the parts into one uint32_t
uint32_t const hash = md5[0] + (md5[1] << 8) + (md5[2] << 16) + (md5[3] << 24);
Personally I think MD5 would be overkill though. Have a look at Adler32, I think it will do.
EDIT
I have to correct myself: Adler23 is a rather poor choice for short strings (less then a few thousand bytes). I had completely forgotten about that. But there is always the obvious: CRC32. Not as fast as Adler23 (about the same speed as MD5), but still acceptably easy to implement, and there are also a ton of existing implementations with all kinds of licenses out there.
If you're only trying to find out if two URL's are the same have you considered storing a binary version of the IP address of the server? If two server names resolve to the same address is that incorrect or an advantage for your application?