Tests of hash collision on ASCII characters - c++

I'm currently in the process of building a caching system for some of our back end systems, which means that I'll need a hash table of some sort, to represent cached entities. In this context, I was wondering if anyone knows about any tests showing different algorithms and the minimum ASCII string length necessary to provoke a collision? Ie. what's a safe length (ASCII characters) to hash with a range of functions?
The reason is of course that I want the best trade off between size (the cache is going to be representing several million entities on relatively small servers), performance and collision safety.
Thanks in advance,
Nick

If you want a strong hash, I'd suggest something like the Jenkins Hash. This should be less likely to generate clashes. In terms of algorithms, what you're looking for is an avalanche test
Bob Jenkins' Site has a whole lot of handy information on this sort of thing.
As for the size of the hash table, I believe Knuth recommends having it large enough so that with a perfect hash, 2/3 of the table would be full, while Jenkins recommends the nearest greater power of two
Hope this helps!

Related

Creating a GUID using My Own Algorithm

I need to create a document GUID for an app that will insert Xmp data into a file. Written in C++, the app will be cross platform so I want to avoid using libraries that might create compatibility problems.
What I'm thinking of doing is to create a string by concatenating the name of my app plus the number of seconds since epoch plus the appropriate number of characters from a SHA256 hash calculated for the full file (I will be doing that anyway for another reason).
Would the result produced be enough to guarantee that collision would be "sufficiently improbable in practice"?
Unless you are expecting to generate insanely high numbers of documents, using SHA256 all by itself is overwhelmingly likely to avoid any collisions. If your app generates fewer than 10^29 documents over its lifetime then the chance of any collisions is less than 1 in 10^18 (assuming SHA256 is well-designed).
This means that roughly everyone in the world (7e9) could use your app to generate 1e9 documents per second for 1,000 years before there was a 0.0000000000000001% chance of a collision.
You can add more bits (like name and time) if you like but that will almost certainly be overkill for the purpose of avoiding collisions.

How to compute good preset dictionary for deflate compression

I have an opportunity to preset dictionary for deflate compression. It makes sense in my case, because data to be compressed is relatively small 1kb-3kb and I have a large sample of representative examples. Data to be compressed consists of arbitrary sequence of bytes, so tokenization etc. is not a good way to go. Also, data shows a lot of repetition (between data examples), so good dictionary could potentially give very good results.
The question is how calculate good dictionary? Is there an algorithm which calculates optimal dictionary (given sample data)?
I started looking at prefix trees, but it is not clear how to use them in this context.
Best regards,
Jarek
I am not aware of an algorithm to generate an optimal or even a good dictionary. This is generally done by hand. I think that a suffix tree would be a good approach to finding common strings for a dictionary, but I have never tried it.
The first thing to try is to simply concatenate 32K worth of your 1-3K examples and see how much gain that provides over no dictionary. Then you mess with it from there, changing the ordering of examples or pulling out repeated pieces in the examples to the end of the dictionary.
Note that the most common strings should be put at the end, since shorter distances take fewer bits.
I don't know how good this is, but it's a dictionary creator: https://github.com/vkrasnov/dictator

Why don't we use word ranks for string compression?

I have 3 main questions:
Let's say I have a large text file. (1)Is replacing the words with their rank an effective way to compress the file? (Got answer to this question. This is a bad idea.)
Also, I have come up with a new compression algorithm. I read some existing compression models that are used widely and I found out they use some pretty advanced concepts like statistical redundancy and probabilistic prediction. My algorithm does not use all these concepts and is a rather simple set of rules that need to be followed while compressing and decompressing. (2)My question is am I wasting my time trying to come up with a new compression algorithm without having enough knowledge about existing compression schemes?
(3)Furthermore, if I manage to successfully compress a string can I extend my algorithm to other content like videos, images etc.?
(I understand that the third question is difficult to answer without knowledge about the compression algorithm. But I am afraid the algorithm is so rudimentary and nascent I feel ashamed about sharing it. Please feel free to ignore the third question if you have to)
Your question doesn't make sense as it stands (see answer #2), but I'll try to rephrase and you can let me know if I capture your question. Would modeling text using the probability of individual words make for a good text compression algorithm? Answer: No. That would be a zeroth order model, and would not be able to take advantage of higher order correlations, such as the conditional probability of a given word following the previous word. Simple existing text compressors that look for matching strings and varied character probabilities would perform better.
Yes, you are wasting your time trying to come up with a new compression algorithm without having enough knowledge about existing compression schemes. You should first learn about the techniques that have been applied over time to model data, textual and others, and the approaches to use the modeled information to compress the data. You need to study what has already been researched for decades before developing a new approach.
The compression part may extend, but the modeling part won't.
Do you mean like having a ranking table of words sorted by frequency and assign smaller "symbols" to those words that are repeated the most, therefore reducing the amount of information that needs to be transmitted?
That's basically how Huffman Coding works, the problem with compression is that you always hit a limit somewhere along the road, of course, if the set of things that you try to compress follows a particular pattern/distribution then it's possible to be really efficient about it, but for general purposes (audio/video/text/encrypted data that appears to be random) there is no (and I believe that there can't be) "best" compression technique.
Huffman Coding uses frequency on letters. You can do the same with words or with letter frequency in more dimensions, i.e. combinations of letters and their frequency.

boost::uuid / unique across different databases

I want to generate a uuid which should be used as unique identifier across different systems/databases. I read the examples but i don't understand how i can be sure that the generated id's are unique over different systems and databases.
I hope you can help me out here.
Best regards
The idea behind a UUID is -- depending on how they are generated -- that there are so many values representable with 122-bits* that the chance of accidental collisions -- again, depending on how they are generated -- is very, very, very, very, very, very, very, very, small.
An excerpt from Wikipedia for the UUID version 4 (Leach-Salz Random):
In other words, only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%. The probability of one duplicate would be about 50% if every person on earth owns 600 million UUIDs.
... however, these probabilities only hold when the UUIDs are generated using sufficient entropy.
Of course, there are other UUID generation schemes and "well-known GUIDs", not all of which may be suitable for "globally-unique" usage. (Additionally, non-specialized UUIDs tend not to work well for primary/clustered keys due to fragmentation on insert: SQL Server has NEWSEQUENTIALID to help with that issue.)
Happy coding.
*There is a maximum of 128-bits in a UUID, however some UUID versions use some of the bits internally. I do not know what boost uses but I suspect it is also UUIDv4.

What is the best autocomplete/suggest algorithm,datastructure [C++/C]

We see Google, Firefox some AJAX pages show up a list of probable items while user types characters.
Can someone give a good algorithm, data structure for implementing autocomplete?
A trie is a data structure that can be used to quickly find words that match a prefix.
Edit: Here's an example showing how to use one to implement autocomplete http://rmandvikar.blogspot.com/2008/10/trie-examples.html
Here's a comparison of 3 different auto-complete implementations (though it's in Java not C++).
* In-Memory Trie
* In-Memory Relational Database
* Java Set
When looking up keys, the trie is marginally faster than the Set implementation. Both the trie and the set are a good bit faster than the relational database solution.
The setup cost of the Set is lower than the Trie or DB solution. You'd have to decide whether you'd be constructing new "wordsets" frequently or whether lookup speed is the higher priority.
These results are in Java, your mileage may vary with a C++ solution.
For large datasets, a good candidate for the backend would be Ternary search trees. They combine the best of two worlds: the low space overhead of binary search trees and the character-based time efficiency of digital search tries.
See in Dr. Dobbs Journal: http://www.ddj.com/windows/184410528
The goal is the fast retrieval of a finite resultset as the user types in. Lets first consider that to search "computer science" you can start typing from "computer" or "science" but not "omputer". So, given a phrase, generate the sub-phrases starting with a word. Now for each of the phrases, feed them into the TST (ternary search tree). Each node in the TST will represent a prefix of a phrase that has been typed so far. We will store the best 10 (say) results for that prefix in that node. If there are many more candidates than the finite amount of results (10 here) for a node, there should be a ranking function to resolve competition between two results.
The tree can be built once every few hours, depending on the dynamism of the data. If the data is in real time, then I guess some other algorithm will give a better balance. In this case, the absolute requirement is the lightning-fast retrieval of results for every keystroke typed which it does very well.
More complications will arise if the suggestion of spelling corrections is involved. In that case, the edit distance algorithms will have to be considered as well.
For small datasets like a list of countries, a simple implementation of Trie will do. If you are going to implement such an autocomplete drop-down in a web application, the autocomplete widget of YUI3 will do everything for you after you provide the data in a list. If you use YUI3 as just the frontend for an autocomplete backed by large data, make the TST based web services in C++, and then use script node data source of the autocomplete widget to fetch data from the web service instead of a simple list.
Segment trees can be used for efficiently implementing auto complete
If you want to suggest the most popular completions, a "Suggest Tree" may be a good choice:
Suggest Tree
For a simple solution : you generate a 'candidate' with a minimum edit (Levenshtein) distance (1 or 2) then you test the existence of the candidate with a hash container (set will suffice for a simple soltion, then use unordered_set from the tr1 or boost).
Example:
You wrote carr and you want car.
arr is generated by 1 deletion. Is arr in your unordered_set ? No. crr is generated by 1 deletion. Is crr in your unordered_set ? No. car is generated by 1 deletion. Is car in your unordered_set ? Yes, you win.
Of course there's insertion, deletion, transposition etc...
You see that your algorithm for generating candidates is really where you’re wasting time, especially if you have a very little unordered_set.