The problem is that I want to use hash functions to check file integrity and encryption here is unnecessary, so I think the non-cryptographic hash cityhash may be a good choice, since what I want is just the speed and less collisions.
While the source has just provided the cityhash function with fixed length string as input and hash code as output. Then how can I use the function to hash a file?
Can I divide the file into several chunks, calculate every chunk's hash code and XOR every hash code? Will it affect the collision efficiency or speed? Do you have any other good ideas?
This is not an appropriate application for CityHash, and it will exhibit poor collision resistance when used this way.
If you want a quick file integrity checksum, use a CRC family function, like CRC16. If you want something more extensive, the speed of cryptographic hashes such as SHA1 should be more than sufficient. (Almost any modern CPU can hash data basically as fast as it can read it from memory.)
Related
Default function is from std::hash. I wonder if there are better hash functions for saving computational time? for integer keys as well as string keys.
I tried City Hash from Google for both integer and string keys, but its performance is a little worse than std::hash.
std::hash functions are already good in performance. I think you should try open source hash functions.
Check this out https://github.com/Cyan4973/xxHash. I quote from its description: "xxHash is an Extremely fast Hash algorithm, running at RAM speed limits. It successfully completes the SMHasher test suite which evaluates collision, dispersion and randomness qualities of hash functions. Code is highly portable, and hashes are identical on all platforms (little / big endian)."
Also this thread from another question on this site: Fast Cross-Platform C/C++ Hashing Library. FNV, Jenkins and MurmurHash are known to be fast.
You need to explain 'better' in what sense? The fastest hash function would be simply use the value, but that is useless. A more specific answer would depend on your memory constraints and what probabilities of collision are you willing to accept.
Also note that the inbuilt hash functions are built differently for different types, and as a result, I expect the hash functions for int and string to already by optimised in the general sense for time complexity and collision probability.
I need to create a document GUID for an app that will insert Xmp data into a file. Written in C++, the app will be cross platform so I want to avoid using libraries that might create compatibility problems.
What I'm thinking of doing is to create a string by concatenating the name of my app plus the number of seconds since epoch plus the appropriate number of characters from a SHA256 hash calculated for the full file (I will be doing that anyway for another reason).
Would the result produced be enough to guarantee that collision would be "sufficiently improbable in practice"?
Unless you are expecting to generate insanely high numbers of documents, using SHA256 all by itself is overwhelmingly likely to avoid any collisions. If your app generates fewer than 10^29 documents over its lifetime then the chance of any collisions is less than 1 in 10^18 (assuming SHA256 is well-designed).
This means that roughly everyone in the world (7e9) could use your app to generate 1e9 documents per second for 1,000 years before there was a 0.0000000000000001% chance of a collision.
You can add more bits (like name and time) if you like but that will almost certainly be overkill for the purpose of avoiding collisions.
I'm working on a Project which involves computation of Hashes for Files. The Project is like a File Backup Service, So when a file gets uploaded from Client to Server, i need to check if that file is already available in the server. I generate a CRC-32 Hash for the file and then send the hash to server to check if it's already available.
If the file is not in server, i used to send the file as 512 KB Chunks[for Dedupe] and i have to calculate hash for this each 512 KB Chunk. The file sizes may be of few GB's sometimes and multiple clients will connect to the server. So i really need a Fast and LightWeight Hashing algorithm for files. Any ideas ..?
P.S : I have already noticed some Hashing Algorithm questions in StackOverflow, but the answer's not quite comparison of the Hashing Algorithms required exactly for this kind of Task. I bet this will be really useful for a bunch of People.
Actually, CRC32 does not have neither the best speed, neither the best distribution.
This is to be expected : CRC32 is pretty old by today's standard, and created in an era when CPU were not 32/64 bits wide nor OoO-Ex, also distribution properties were less important than error detection. All these requirements have changed since.
To evaluate the speed and distribution properties of hash algorithms, Austin Appleby created the excellent SMHasher package.
A short summary of results is presented here.
I would advise to select an algorithm with a Q.Score of 10 (perfect distribution).
You say you are using CRC-32 but want a faster hash.
CRC-32 is very basic and pretty fast.
I would think the I/O time would be much longer than the hash time.
You also want a hash that will not have collisions.
That is two different files or 512 KB chunks gets the same hash value.
You could look at any of the cryptographic hashs like MD5 (do not use for secure applications)
or SHA1.
If you are only using CRC-32 to check if a file is a duplicate, you are going to get false duplicates because different files can have the same crc-32. You had better use sha-1, crc-32 and md5 are both too weak.
I have a large collection of unique strings (about 500k). Each string is associated with a vector of strings. I'm currently storing this data in a
map<string, vector<string> >
and it's working fine. However I'd like the look-up into the map to be faster than log(n). Under these constrained circumstances how can I create a hashtable that supports O(1) look-up? Seems like this should be possible since I know all the keys ahead of time... and all the keys are unique (so I don't have to account for collisions).
Cheers!
You can create a hashtable with boost::unordered_map, std::tr1::unordered_map or (on C++0x compilers) std::unordered_map. That takes almost zero effort. Google sparsehash may be faster still and tends to take less memory. (Deletion can be a pain, but it seems you won't need that.)
If the code is still not fast enough, you can exploit prior knowledge of the keys with a minimal perfect hash, as suggested by others, to obtain guaranteed O(1) performance. Whether the code generating effort that takes is worth it depends on you; putting 500k keys into a tool like gperf may take a code generator generator.
You may also want to look at CMPH, which generates a perfect hash function at run-time, though through a C API.
I would look into creating a Perfect Hash Function for your table. This will guarantee no collisions which are an expensive operation to resolve. Perfect Hash Function Generators are also available.
What you're looking for is a Perfect Hash. gperf is often used to generate these, but I don't know how well it works with such a large collection of strings.
If you want no collisions for a known collection of keys you're looking for a perfect hash. The CMPH library (my apologies as it is for C rather than C++) is mature and can generate minimal perfect hashes for rather large data sets.
Currently I'm checking against an XOR Checksum of the modified file time (st_mtime from fstat) for every file in the tree. I'm coupling this with the number of files found and a file size checksum (allowing overflows) to be safe but I'm quite paranoid that this can and will lead to false positives in the most extreme pathological cases.
One alternative (safe) option I am considering is keeping a manifest of every file by name and a CRC32 of the file contents. This option however is pretty slow, or slower than I would like at least for many files (lets say thousands).
So the question is, what are some tips or tricks you may have for determining whether any file has changed within a directory tree? I'd like to avoid a byte-by-byte comparison without trading away too much reliability.
Thanks very much for your suggestions.
You could use the "last modified on" property that files have (regardless of platform).
Simply store historical values and check historical values against current values, every so often.
boost::filesystem has a great cross platform API for reading this value.
EDIT: Specifically look at:
http://www.pdc.kth.se/training/Talks/C++/boost/libs/filesystem/doc/operations.htm#last_write_time