Creating a GUID using My Own Algorithm - c++

I need to create a document GUID for an app that will insert Xmp data into a file. Written in C++, the app will be cross platform so I want to avoid using libraries that might create compatibility problems.
What I'm thinking of doing is to create a string by concatenating the name of my app plus the number of seconds since epoch plus the appropriate number of characters from a SHA256 hash calculated for the full file (I will be doing that anyway for another reason).
Would the result produced be enough to guarantee that collision would be "sufficiently improbable in practice"?

Unless you are expecting to generate insanely high numbers of documents, using SHA256 all by itself is overwhelmingly likely to avoid any collisions. If your app generates fewer than 10^29 documents over its lifetime then the chance of any collisions is less than 1 in 10^18 (assuming SHA256 is well-designed).
This means that roughly everyone in the world (7e9) could use your app to generate 1e9 documents per second for 1,000 years before there was a 0.0000000000000001% chance of a collision.
You can add more bits (like name and time) if you like but that will almost certainly be overkill for the purpose of avoiding collisions.

Related

Use Cityhash for file integrity

The problem is that I want to use hash functions to check file integrity and encryption here is unnecessary, so I think the non-cryptographic hash cityhash may be a good choice, since what I want is just the speed and less collisions.
While the source has just provided the cityhash function with fixed length string as input and hash code as output. Then how can I use the function to hash a file?
Can I divide the file into several chunks, calculate every chunk's hash code and XOR every hash code? Will it affect the collision efficiency or speed? Do you have any other good ideas?
This is not an appropriate application for CityHash, and it will exhibit poor collision resistance when used this way.
If you want a quick file integrity checksum, use a CRC family function, like CRC16. If you want something more extensive, the speed of cryptographic hashes such as SHA1 should be more than sufficient. (Almost any modern CPU can hash data basically as fast as it can read it from memory.)

Fastest and LightWeight Hashing Algorithm for Large Files & 512 KB Chunks [C,Linux,MAC,Windows]

I'm working on a Project which involves computation of Hashes for Files. The Project is like a File Backup Service, So when a file gets uploaded from Client to Server, i need to check if that file is already available in the server. I generate a CRC-32 Hash for the file and then send the hash to server to check if it's already available.
If the file is not in server, i used to send the file as 512 KB Chunks[for Dedupe] and i have to calculate hash for this each 512 KB Chunk. The file sizes may be of few GB's sometimes and multiple clients will connect to the server. So i really need a Fast and LightWeight Hashing algorithm for files. Any ideas ..?
P.S : I have already noticed some Hashing Algorithm questions in StackOverflow, but the answer's not quite comparison of the Hashing Algorithms required exactly for this kind of Task. I bet this will be really useful for a bunch of People.
Actually, CRC32 does not have neither the best speed, neither the best distribution.
This is to be expected : CRC32 is pretty old by today's standard, and created in an era when CPU were not 32/64 bits wide nor OoO-Ex, also distribution properties were less important than error detection. All these requirements have changed since.
To evaluate the speed and distribution properties of hash algorithms, Austin Appleby created the excellent SMHasher package.
A short summary of results is presented here.
I would advise to select an algorithm with a Q.Score of 10 (perfect distribution).
You say you are using CRC-32 but want a faster hash.
CRC-32 is very basic and pretty fast.
I would think the I/O time would be much longer than the hash time.
You also want a hash that will not have collisions.
That is two different files or 512 KB chunks gets the same hash value.
You could look at any of the cryptographic hashs like MD5 (do not use for secure applications)
or SHA1.
If you are only using CRC-32 to check if a file is a duplicate, you are going to get false duplicates because different files can have the same crc-32. You had better use sha-1, crc-32 and md5 are both too weak.

boost::uuid / unique across different databases

I want to generate a uuid which should be used as unique identifier across different systems/databases. I read the examples but i don't understand how i can be sure that the generated id's are unique over different systems and databases.
I hope you can help me out here.
Best regards
The idea behind a UUID is -- depending on how they are generated -- that there are so many values representable with 122-bits* that the chance of accidental collisions -- again, depending on how they are generated -- is very, very, very, very, very, very, very, very, small.
An excerpt from Wikipedia for the UUID version 4 (Leach-Salz Random):
In other words, only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%. The probability of one duplicate would be about 50% if every person on earth owns 600 million UUIDs.
... however, these probabilities only hold when the UUIDs are generated using sufficient entropy.
Of course, there are other UUID generation schemes and "well-known GUIDs", not all of which may be suitable for "globally-unique" usage. (Additionally, non-specialized UUIDs tend not to work well for primary/clustered keys due to fragmentation on insert: SQL Server has NEWSEQUENTIALID to help with that issue.)
Happy coding.
*There is a maximum of 128-bits in a UUID, however some UUID versions use some of the bits internally. I do not know what boost uses but I suspect it is also UUIDv4.

Tests of hash collision on ASCII characters

I'm currently in the process of building a caching system for some of our back end systems, which means that I'll need a hash table of some sort, to represent cached entities. In this context, I was wondering if anyone knows about any tests showing different algorithms and the minimum ASCII string length necessary to provoke a collision? Ie. what's a safe length (ASCII characters) to hash with a range of functions?
The reason is of course that I want the best trade off between size (the cache is going to be representing several million entities on relatively small servers), performance and collision safety.
Thanks in advance,
Nick
If you want a strong hash, I'd suggest something like the Jenkins Hash. This should be less likely to generate clashes. In terms of algorithms, what you're looking for is an avalanche test
Bob Jenkins' Site has a whole lot of handy information on this sort of thing.
As for the size of the hash table, I believe Knuth recommends having it large enough so that with a perfect hash, 2/3 of the table would be full, while Jenkins recommends the nearest greater power of two
Hope this helps!

Opinions on my data storage problem (database/homebrew solution)

I have very simply structured data which is currently stored in a home-brew file format, but I am wondering whether we should migrate to something more modern. The data is simply a table of doubles, indexed by a double column. The things I need to perform are:
Iterating through the table.
Insertion and deletion of arbitrary records.
Selecting a given number of rows before and after a given key value (where the key might not be in the database).
The requirements are:
The storage must be file-based without a server.
It should not be necessary to read the whole file into memory.
The resulting file should be portable between different architectures (wrt endian-ness...)
Must be a very stable project (the data is highly critical).
Must run on Solaris/SPARC and preferably also on Linux/x64.
Access times should be as fast as possible.
Must be available as a C++ library. Bonus points for Fortran and Python bindings :)
Optional higher precision number representation than double precision would be a bonus.
Relatively compact storage size would also be a bonus.
From my limited experience, sqlite would be an interesting choice, or perhaps mysql in a non-server mode if sqlite is not fast enough. But perhaps a full-fledged SQL database is overkill?
What do you suggest?
SQLite meets nearly all of your requirements, and it's not that hard to use. Give it a try!
It's file-based, and the entire database is a single file.
It does not need to read the entire file into memory. Database size might be limited; you should check here if the limits will be a problem in your situation.
The format is cross-platform:
SQLite databases are portable across 32-bit and 64-bit machines and between big-endian and little-endian architectures.
It's been around for a long time and is used in many places, and is generally considered mature and stable.
It's very portable and runs on Solaris/SPARC and Linux/x64.
It's faster than MySQL (grains of salt present behind that link, though) or other such database servers, because only one client needs to be taken into account.
There is a C++ API and a Python binding and a Fortran wrapper.
There is no arbitrary-precision column type, but NUMERIC will be silently converted to text if it cannot be exactly represented:
For conversions between TEXT and REAL storage classes, SQLite considers the conversion to be lossless and reversible if the first 15 significant decimal digits of the number are preserved. If the lossless conversion of TEXT to INTEGER or REAL is not possible then the value is stored using the TEXT storage class.
Compact storage of the database, I'm not sure of. But I've never heard any claims that SQLite would be particularly wasteful.