boost::uuid / unique across different databases - c++

I want to generate a uuid which should be used as unique identifier across different systems/databases. I read the examples but i don't understand how i can be sure that the generated id's are unique over different systems and databases.
I hope you can help me out here.
Best regards

The idea behind a UUID is -- depending on how they are generated -- that there are so many values representable with 122-bits* that the chance of accidental collisions -- again, depending on how they are generated -- is very, very, very, very, very, very, very, very, small.
An excerpt from Wikipedia for the UUID version 4 (Leach-Salz Random):
In other words, only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%. The probability of one duplicate would be about 50% if every person on earth owns 600 million UUIDs.
... however, these probabilities only hold when the UUIDs are generated using sufficient entropy.
Of course, there are other UUID generation schemes and "well-known GUIDs", not all of which may be suitable for "globally-unique" usage. (Additionally, non-specialized UUIDs tend not to work well for primary/clustered keys due to fragmentation on insert: SQL Server has NEWSEQUENTIALID to help with that issue.)
Happy coding.
*There is a maximum of 128-bits in a UUID, however some UUID versions use some of the bits internally. I do not know what boost uses but I suspect it is also UUIDv4.

Related

Train doc2vec for company name similarity

I am trying to deduplicate a huge list of companies (40M+) using the name similarities. I have a 500K of company name pairs labelled same/not-same (like I.B.M.=International Business Machines). Model built by logistic regression on vector difference of name pairs has a great f-score (0.98) but the inference (finding the most similar names) is too slow (almost 2 secs per name).
Is it possible to train doc2vec model using name similarity pairs (positive and negative), resulting in similar names has similar vectors so that I can use fast vector similarities algorithms like Annoy?
Searching for the top-N nearest-neighbors in high-dimensional spaces is hard. To get a perfectly accurate top-N typically requires an exhaustive search, which is probably the reason for your disappointing performance.
When some indexing can be applied, as with the ANNOY library, some extra indexing time and index-storage is required, and accuracy is sacrificed because some of the true top-N neighbors can be missed.
You haven't mentioned how your existing vectors are created. You don't need to adopt a new vector-creation method (like doc2vec) to use indexing; you can apply indexing libraries to your existing vectors.
If your existing vectors are sparse (as for example if they are big bag-of-character-n-grams representations, with many dimensions but most 0.0), you might want to look into Facebook's PySparNN library.
If they're dense, in addition to the ANNOY you mentioned, Facebook FAISS can be considered.
But also, even the exhaustive search-for-neighbors is highly parallelizable: split the data into M shards on M different systems, and finding the top-N on each is often close to 1/Nth the time of the same operation on the full index, then merging the M top-N lists relatively quick. So if finding the most-similar is your key bottleneck, and you need the top-N most-similar in say 100ms, throw 20 machines at 20 shards of the problem.
(Similarly, the top-N results for all may be worth batch-calculating. If you're using cloud resources, rent 500 machines to do 40 million 2-second operations, and you'll be done in under two days.)

Creating a GUID using My Own Algorithm

I need to create a document GUID for an app that will insert Xmp data into a file. Written in C++, the app will be cross platform so I want to avoid using libraries that might create compatibility problems.
What I'm thinking of doing is to create a string by concatenating the name of my app plus the number of seconds since epoch plus the appropriate number of characters from a SHA256 hash calculated for the full file (I will be doing that anyway for another reason).
Would the result produced be enough to guarantee that collision would be "sufficiently improbable in practice"?
Unless you are expecting to generate insanely high numbers of documents, using SHA256 all by itself is overwhelmingly likely to avoid any collisions. If your app generates fewer than 10^29 documents over its lifetime then the chance of any collisions is less than 1 in 10^18 (assuming SHA256 is well-designed).
This means that roughly everyone in the world (7e9) could use your app to generate 1e9 documents per second for 1,000 years before there was a 0.0000000000000001% chance of a collision.
You can add more bits (like name and time) if you like but that will almost certainly be overkill for the purpose of avoiding collisions.

Which error correction could should I use for GF(32)

I searched for comparisons between Reed-Solomon, Turbo and LDPC codes but they all seem to focus on efficiency. I'm more interested in commercial license of available libs, easiness and GF(32), i.e. a code with 32 symbols only (available Reed-Solomon implementations work for GF(256) and above).
Efficiency (speed) is not relevant. The messages are comprised of 24 symbols.
Can you provide a quick comparison on the most well-known Reed-Solomon, Turbo and LDPC codes for this case in which speed is not relevant?
Thanks.
Basically, Reed-Solomon is optimal, thus it means that you can exactly correct up to (n-k)/2 errors (k=length of your message, n=length of message + EC symbols), while TurboCodes and LDPC are near-optimal, meaning that you can correct up to (n-k-e)/2 where e is a small constant, so in ideal cases you are very close to (n-k)/2 (that's why it's called near-optimal, it's close to the Shannon limit). TurboCodes and LDPC have similar error correction power, and there are lots of variants depending on your needs (you can find lots of literature reviews or presentations).
What the different variants of LDPC or Turbocodes do is to optimize the algorithm to fit certain characteristics of the erasure channel (ie, the data) so as to reduce the constant e (and thus approaching the Shannon limit). So the best variant in your case depends on the details of your erasure channel. Also, to my knowledge, they are all in public domain now (maybe not yet for Turbocodes patents, but if not yet then they will soon).

Clarification needed about min/sim hashing + LSH

I have a reasonable understanding of a technique to detect similar documents
consisting in first computing their minhash signatures (from their shingles, or
n-grams), and then use an LSH-based algorithm to cluster them efficiently
(i.e. avoid the quadratic complexity which would entail a naive pairwise
exhaustive search).
What I'm trying to do is to bridge three different algorithms, which I think are
all related to this minhash + LSH framework, but in non-obvious ways:
(1) The algorithm sketched in Section 3.4.3 of Chapter 3 of the book Mining of Massive Datasets (Rajaraman and Ullman), which seems to be the canonical description of minhashing
(2) Ryan Moulton's Simple Simhashing article
(3) Charikar's so-called SimHash algorihm, described in this article
I find this confusing because what I assume is that although (2) uses the term
"simhashing", it's actually doing minhashing in a way similar to (1), but with
the crucial difference that a cluster can only be represented by a single
signature (even tough multiple hash functions might be involved), while two
documents have more chances of being similar with (1), because the signatures
can collide in multiple "bands". (3) seems like a different beast altogether, in
that the signatures are compared in terms of their Hamming distance, and the LSH
technique implies multiple sorting of the signatures, instead of banding them. I
also saw (somewhere else) that this last technique can incorporate a notion of
weighting, which can be used to put more emphasis on certain document parts, and
which seems to lack in (1) and (2).
So at last, my two questions:
(a) Is there a (satisfying) way in which to bridge those three algorithms?
(b) Is there a way to import this notion of weighting from (3) into (1)?
Article 2 is actually discussing minhash, but has erroneously called it simhash. That's probably why it is now deleted (it's archived here). Also, confusingly, it talks about "concatenating" multiple minhashes, which as you rightly observe reduces the chance of finding two similar documents. It seems to prescribe an approach that yields only a single "band", which will give very poor results.
Can the algorithms be bridged/combined?
Probably not. To see why, you should understand what the properties of the different hashes are, and how they are used to avoid n2 comparisons between documents.
Properties of minhash and simhash:
Essentially, minhash generates multiple hashes per document, and when there are two similar documents it is likely that a subset of these hashes will be identical. Simhash generates a single hash per document, and where there are two similar documents it is likely that their simhashes will be similar (having a small hamming distance).
How they solve the n2 problem:
With minhash you index all hashes to the documents that contain them; so if you are storing 100 hashes per document, then for each new incoming document you need to look up each of its 100 hashes in your index and find all documents that share at least (e.g.) 50% of them. This could mean building large temporary tallies, as hundreds of thousands of documents could share at least one hash.
With simhash there is a clever technique of storing each document's hash in multiple permutations in multiple sorted tables, such that any similar hash up to a certain hamming distance (3, 4, 5, possibly as high as 6 or 7 depending on hash size and table structure) is guaranteed to be found nearby in at least one of these tables, differing only in the low order bits. This makes searching for similar documents efficient, but restricts you to only finding very similar documents. See this simhash tutorial.
Because the use of minhash and simhash are so different, I don't see a way to bridge/combine them. You could theoretically have a minhash that generates 1-bit hashes and concatenate them into something like a simhash, but I don't see any benefit in this.
Can weighting be used in minhash?
Yes. Think of the minhashes as slots: if you store 100 minhashes per document, that's 100 slots you can fill. These slots don't all have to be filled from the same selection of document features. You could devote a certain number of slots to hashes calculated just from specific features (care should be taken, though, to always select features in such a way that the same features will be selected in similar documents).
So for example if you were dealing with journal articles, you could assign some slots for minhashes of document title, some for minhashes of article abstract, and the remainder of the slots for minhashes of the document body. You can even set aside some individual slots for direct hashes of things such as author's surname, without bothering about the minhash algorithm.
It's not quite as elegant as how simhash does it, where in effect you're just creating a bitwise weighted average of all the individual feature hashes, then rounding each bit up to 1 or down to 0. But it's quite workable.

Tests of hash collision on ASCII characters

I'm currently in the process of building a caching system for some of our back end systems, which means that I'll need a hash table of some sort, to represent cached entities. In this context, I was wondering if anyone knows about any tests showing different algorithms and the minimum ASCII string length necessary to provoke a collision? Ie. what's a safe length (ASCII characters) to hash with a range of functions?
The reason is of course that I want the best trade off between size (the cache is going to be representing several million entities on relatively small servers), performance and collision safety.
Thanks in advance,
Nick
If you want a strong hash, I'd suggest something like the Jenkins Hash. This should be less likely to generate clashes. In terms of algorithms, what you're looking for is an avalanche test
Bob Jenkins' Site has a whole lot of handy information on this sort of thing.
As for the size of the hash table, I believe Knuth recommends having it large enough so that with a perfect hash, 2/3 of the table would be full, while Jenkins recommends the nearest greater power of two
Hope this helps!