Two-way "Hashing" of string - c++

I want to generate int from a string and be able to generate it back.
Something like hash function but two-way function.
I want to use ints as ID in my application, but want to be able to convert it back in case of logging or debugging.
Like:
int id = IDProvider::getHash("NameOfMyObject");
object * a = createObject(id);
...
if(error)
{
LOG(IDProvider::getOriginalString(a->getId()), "some message");
}
I have heard of slightly modified CRC32 to be fast and 100% reversible, but I can not find it and I am not able to write it by myself.
Any hints what should I use?
Thank you!
edit
I have just founded the source I have the whole CRC32 thing from:
Jason Gregory : Game Engine Architecture
quotation:
"As with any hashing system, collisions are a possibility (i.e., two different strings might end up with the same hash code). However, with a suitable hash function, we can all but guarantee that collisions will not occur for all reasonable input strings we might use in our game. After all, a 32-bit hash chode represents more than four billion possible values. So if our hash function does a good job of distributing strings evently throughout this very large range, we are unlikely to collide. At Naughty Dog, we used a variant of the CRC-32 algorithm to hash our strings, and we didn't encounter a single collision in over two years of development on Uncharted: Drake's Fortune."

Reducing an arbitrary length string to a fixed size int is mathematically impossible to reverse. See Pidgeonhole principle. There is a near infinite amount of strings, but only 2^32 32 bit integers.
32 bit hashes(assuming your int is 32 bit) can have collisions very easily. So it's not a good unique ID either.
There are hashfunctions which allow you to create a message with a predefined hash, but it most likely won't be the original message. This is called a pre-image.
For your problem it looks like the best idea is creating a dictionary that maps integer-ids to strings and back.
To get the likelyhood of a collision when you hash n strings check out the birthday paradox. The most important property in that context is that collisions become likely once the number of hashed messages approaches the squareroot of the number of available hash values. So with a 32 bit integer collisions become likely if you hash around 65000 strings. But if you're unlucky it can happen much earlier.

I have exactly what you need. It is called a "pointer". In this system, the "pointer" is always unique, and can always be used to recover the string. It can "point" to any string of any length. As a bonus, it also has the same size as your int. You can obtain a "pointer" to a string by using the & operand, as shown in my example code:
#include <string>
int main() {
std::string s = "Hai!";
std::string* ptr = &s; // this is a pointer
std::string copy = *ptr; // this retrieves the original string
std::cout << copy; // prints "Hai!"
}

What you need is encryption. Hashing is by definition one way. You might try simple XOR Encryption with some addition/subtraction of values.
Reversible hash function?
How come MD5 hash values are not reversible?
checksum/hash function with reversible property
http://groups.google.com/group/sci.crypt.research/browse_thread/thread/ffca2f5ac3093255
... and many more via google search...

You could look at perfect hashing
http://en.wikipedia.org/wiki/Perfect_hash_function
It only works when all the potential strings are known up front. In practice what you enable by this, is to create a limited-range 'hash' mapping that you can reverse-lookup.
In general, the [hash code + hash algorithm] are never enough to get the original value back. However, with a perfect hash, collisions are by definition ruled out, so if the source domain (list of values) is known, you can get the source value back.
gperf is a well-known, age old program to generate perfect hashes in c/c++ code. Many more do exist (see the Wikipedia page)

Is it not possible. Hashing is not-returnable function - by definition.

As everyone mentioned, it is not possible to have a "reversible hash". However, there are alternatives (like encryption).
Another one is to zip/unzip your string using any lossless algorithm.
That's a simple, fully reversible method, with no possible collision.

Related

Looking for hash algorithm where small change in input will result in small change on hash

Current hash functions are designed to have big changes on hash even if only a very small portion of input is changed. What I need, is a hash algorithm which output mutation will be directly proportional to input mutation. For example, I need something similar to this:
Hash("STR1") => 1000
Hash("STR2") => 1001
Hash("STR3") => 1002
etc.
I'm not good at algorithms, but never heared of such implementation, although I'm almost sure someone should already come up with this algorithm.
My current requirement is to have large bitrate (512 bits maybe?) to avoid collisions.
Thanks
UPDATE
I think I should clarify my goal, I see that I did a very poor job explaining what I need. Sorry, I'm not a native English speaker and great communicator.
So basically I need this hash algorithm for searching similar binary files. You can think of it as Antivirus hashing algorithm. It calculates file checksum, but unlike traditional hashing functions, even after some small modification in malware binary, it still is able to detect it. This is pretty much what I'm looking for.
Another aspect is to avoid collision. Let me explain what I mean by that. It's not a conflicting goal. I want Hash("STR1") to produce 1000 and Hash("STR2") to produce 1001 or 1010 maybe, doesn't matter as long as the value is close relative to previous hash. But Hash("This is a very large string or maybe even binary data" + 100 random chars) should not produce a value close to 1000. I understand it will not work always and there would be some hash/hash-range collisions, but I think I can introduce another hashing algorithm and verify both to minimize collisions.
So what do you think? Maybe there is a better way to achieve my goal, maybe I'm asking too much, I don't know. I'm not well versed in Cryptograpy, math or algorithms.
Thank you again for your time and effort
How about a simple summation? Your hash can then wrap at the desired size, and if you take this into account during hash comparisons, a small difference in inputs should yield a small difference in hashes.
However, I think "minimal collisions" and "proportional change in output" are conflicting goals.
This is called, in other domains, perceptual hashing.
One approach to this is as follows:
Get a training multiset of n-grams. (E.g. if n=2 and your training data was "This is a test" your training set would be "Th", "hi", "is", "s ", etc)
Sort and calculate the frequencies of said n-grams, decending.
Then the hash of a word is the first bits of "for each n-gram in the database, is this word's frequency said n-gram higher than the average frequency?"
Note that this can and will result in many collisions with similar words, unfortunately, unless the hash length is absurdly long.
MD5 or SHA-x is not what you want.
According to wikipedia, for example the substitution cipher has no avalanche effect (this is the word you mean).
In terms of hashing you could use some kind of a figure total.
For example:
char* hashme = "hallo123";
int result=0;
for(int i = 0; i<8; ++i) {
result += hashme[i];
}
It may be geared towards kids, but the old NSA Kid's section has some really good ideas.
Of course, these algorithms are really insecure, so you cannot use this in place of REAL encryption. (But you can't use a real encryption algorithm when you just want to have fun, either.)
The number grid involves setting up a grid, then using the coordinates of each letter:
Further ideas:
Mix up the letter arangement
Convert numbers to binary to obfuscate
A winding way also uses a grid. Essentially, the letters are packed in the grid left to right, in rows downwards. The output is produced by slicing vertically through the grid:
Typically hash and encryption algorithms oriented towards cryptography will behave in the exact opposite way of what you're looking for (i.e. small changes in the input will cause large changes in the output and vice versa), so this algorithm class is a dead end.
As a quick digression on why these algorithms behave like this: of necessity, they're designed to obscure statistical relationships between the input and output to make them more difficult to crack. For example, in the English language the letter "e" is by far the most commonly-used letter; in some very weak classical ciphers you could simply find the most common letter and figure that that corresponds to "e" (e.g. - if n is the most common letter, then odds are n = e). Actually, a statistical pattern like you describe would likely make the algorithm significantly more vulnerable to chosen-plaintext, known-plaintext, man in the middle, and replay attacks.
The man in the middle and replay attacks would be made significantly easier by the fact that it would be much easier to edit the ciphertext to achieve the desired plaintext without knowing the key (especially if you have access to a couple of chosen plaintexts).
If you know that
7/19/2016 1:35 transfer $10 from account x to account y
(where the datestamp is used to defend against a replay attack) encodes to
12345678910
whereas
7/19/2016 1:40 transfer $10 from account x to account y
encodes to
12445678910
it's a pretty safe guess that
12545678910
will mean something like
7/19/2016 1:45 transfer $10 from account x to account y
Without having access to the original key, you could replay this packet on a regular basis to continue to steal money from someone's account simply by making a trivial edit. Granted, this is a fairly contrived example, but it still illustrates the basic problem.
My understanding of what you're looking for is statistical similarity between files. This might help some: https://en.wikipedia.org/wiki/Semantic_similarity
This does indeed exist. The term is Locality-sensitive hashing. A concrete implementation can be found here.
Depending on the source document you might want to look at digital forensics or VisualRank (from google) for finding similar images and video. For textual data this is commonly used in anti-spam (read more here). For binary files you might want to first run disassembler and then run the algorithm on the text version - but this is just my feeling, I don't have a research to back this statement but it would be an interesting hypothesis to test.

Small String Hash Function

I have a string that will be from anywhere from length 1 to length 5 and need to hash it with good performance and minimal collisions. Any suggestions? (I don't need security)
Depending on your scenario, you may have certain needs for what kind of hash that you need. But if all you need is something to separate them, then std::hash() does spring to mind...
Another option would be something like:
long long hash(const std::string &val) {
long long hash = 0;
memcpy(reinterpret_cast<char*>(&hash), val.c_str(), std::min(sizeof(hash), val.length());
return hash;
}
Apologies for any mistypes in the above code, it has not been compiled or tested. This has minimal collisions (none), I would guess pretty good performance, and not-so-great quality. By quality I mean separation of values that are near each other and usage of entire key space.
There are also of course the whole range of regular cryptographic hash functions to choose from, but I take it from your question that these are not what you are aiming for.
Do you have linux? Try /bin/gperf it generate a perfect hash function from a key set.

Is there a library that would produce a string that would hash (SHA1) to a given input?

I'm wondering if it's possible to find a block of text that would hash to a known value. In particular, I'm looking for a function CreateDataFromHash() that could be called as follows:
unsigned char myHash[] = "da39a3ee5e6b4b0d3255bfef95601890afd80709";
unsigned int length = 10000;
CreateDataFromHash(myHash, length);
Here CreateDataFromHash would return the string of the length 10000 containing arbitrary data, which would hash to myHash using SHA1.
Thanks.
There's no known easy or even moderately difficult way to do this in general.
The entire point of hashes (or so-called one-way functions), is that it's easy to compute them, but next to impossible to reverse their computation (find input values based on output). That said, for some hash functions, there are known methods that may allow computing inputs for a given hash value in reasonable time.
For example, this MD5 sum technique will find collisions (but not input for a given output) in about 8 hours on a 1.6GHz computer.
For SHA-1 in particular you may be interested in reading this.
One of the purposes of SHA1 is that this should be very hard to do.
hashing is a one way function. you can't get input from the output.
This would be a "preimage attack". No such thing is publicly known against SHA-1.
The only attack known against SHA-1 is a collision attack. That means I find two inputs that produce the same result, but neither of them is pre-ordained, so to speak. Even so, this attack isn't really feasible for most people -- based on the amount of computation involved, the closest I can figure is that you'd have to spend somewhere in the range of a few million dollars to build a machine that would give you about one colliding pair of keys per week (assuming it ran, doing nothing else 24/7).
You have to brute force it. See
PHP brute force password generator
Get string, do hash, compare, repeat

Guessing a string with time comparison. Is it possible?

I was wondering about a strange idea: you are given and algorithm wich takes a string in input and compares it to a string that you don't know. The algoritm is just a trivial comparison, one char at a time. When a couple that doesn't match is found, 0 is returned. Otherwise it returns 1.
Can you guess the secret string in a polynomial time by using the provided algorithm?
When a string doesn't match, the time used to give the answer 0 is less than the time taken to return 1, because less comparisons are needed. Times involved are very small, and for this reason you can try a single instance many times to get a more accurate estimation. Estimating the time taken we could have informations about the secret string. If this works properly, we can guess the string one char at a time, in a polynomial time. So if this can happen we can try some kind of brute force attack char by char.
Does this make sense? Or is there something I'm misunderstanding?
Thanks in advance.
You can guess the secret string if you can can input your own strings to compare, or just observe enough strings (not chosen by you) being compared to the secret string, if the string comparison has been written in a way such that its execution time reveals information about the secret string.
This is a known weakness cryptographic software can have, and all serious cryptographic software written nowadays avoids this weakness.
For instance, to avoid revealing information about its arguments, a function that tests whether two buffers are the same or different may be written:
int crypto_memcmp(const char *s1, const char *s2, size_t n)
{
size_t i;
int answer;
for (i=0; i<n; i++)
answer = answer | (s1[i] != s2[i]);
return answer;
}
You can use several techniques to check that a piece of code does not leak secrets through timing attacks. I wrote how to do it with static analysis here but this is based on a previous idea that used Valgrind (dynamic analysis) here.
Note that it goes further than that. This article showed how you did not even need the execution path to depend on the secret to leak information. It was enough that the secret was used in the computation of some array indices that were subsequently accessed. On modern computers, this changes the execution time because the cache will make two successive accesses to similar indices faster than two successive accesses to indices that are far from each other, revealing information about the secret.
You can determine the string bit by bit. for each bit use binary search
for example:
you already know the first a bits. say it (Sa).
now you have to determine the (a+1)th bit.
there are upper bound (Sa)zzzzzzz... and lower bound (Sa)azzzzzz....
first you guess the (a+1)th bit is (a+z)/2, say r, then the string is (Sa)rzzzzzz..., and with the result, you update the upper bound and lower bound.

Any better alternatives for getting the digits of a number? (C++)

I know that you can get the digits of a number using modulus and division. The following is how I've done it in the past: (Psuedocode so as to make students reading this do some work for their homework assignment):
int pointer getDigits(int number)
initialize int pointer to array of some size
initialize int i to zero
while number is greater than zero
store result of number mod 10 in array at index i
divide number by 10 and store result in number
increment i
return int pointer
Anyway, I was wondering if there is a better, more efficient way to accomplish this task? If not, is there any alternative methods for this task, avoiding the use of strings? C-style or otherwise?
Thanks. I ask because I'm going to be wanting to do this in a personal project of mine, and I would like to do it as efficiently as possible.
Any help and/or insight is greatly appreciated.
The time it takes to extract the digits will be dwarfed by the time required to dynamically allocate the array. Consider returning the result in a struct:
struct extracted_digits
{
int number_of_digits;
char digits[12];
};
You'll want to pick a suitable value for the maximum number of digits (12 here, which is enough for a 32-bit integer). Alternatively, you could return a std::array<char, 12> and encode the terminal by using an invalid value (so, after the last value, store a 10 or something else that isn't a digit).
Depending on whether you want to handle negative values, you'll also have to decide how to report the unary minus (-).
Unless you want the representation of the number in a base that's a power of 2, that's about the only way to do it.
Smacks of premature optimisation. If profiling proves it matters, then be sure to compare your algo to itoa - internally it may use some CPU instructions that you don't have explicit access to from C++, and which your compiler's optimiser may not be clever enough to employ (e.g. AAM, which divs while saving the mod result). Experiment (and benchmark) coding the assembler yourself. You might dig around for assembly implementations of ITOA (which isn't identical to what you're asking for, but might suggest the optimal CPU instructions).
By "avoiding the use of strings", I'm going to assume you're doing this because a string-only representation is pretty inefficient if you want an integer value.
To that end, I'm going to suggest a slightly unorthodox approach which may be suitable. Don't store them in one form, store them in both. The code below is in C - it will work in C++ but you may want to consider using c++ equivalents - the idea behind it doesn't change however.
By "storing both forms", I mean you can have a structure like:
typedef struct {
int ival;
char sval[sizeof("-2147483648")]; // enough for 32-bits
int dirtyS;
} tIntStr;
and pass around this structure (or its address) rather than the integer itself.
By having macros or inline functions like:
inline void intstrSetI (tIntStr *is, int ival) {
is->ival = i;
is->dirtyS = 1;
}
inline char *intstrGetS (tIntStr *is) {
if (is->dirtyS) {
sprintf (is->sval, "%d", is->ival);
is->dirtyS = 0;
}
return is->sval;
}
Then, to set the value, you would use:
tIntStr is;
intstrSetI (&is, 42);
And whenever you wanted the string representation:
printf ("%s\n" intstrGetS(&is));
fprintf (logFile, "%s\n" intstrGetS(&is));
This has the advantage of calculating the string representation only when needed (the fprintf above would not have to recalculate the string representation and the printf only if it was dirty).
This is a similar trick I use in SQL with using precomputed columns and triggers. The idea there is that you only perform calculations when needed. So an extra column to hold the indexed lowercased last name along with an insert/update trigger to calculate it, is usually a lot more efficient than select lower(non_lowercased_last_name). That's because it amortises the cost of the calculation (done at write time) across all reads.
In that sense, there's little advantage if your code profile is set-int/use-string/set-int/use-string.... But, if it's set-int/use-string/use-string/use-string/use-string..., you'll get a performance boost.
Granted this has a cost, at the bare minimum extra storage required, but most performance issues boil down to a space/time trade-off.
And, if you really want to avoid strings, you can still use the same method (calculate only when needed), it's just that the calculation (and structure) will be different.
As an aside: you may well want to use the library functions to do this rather than handcrafting your own code. Library functions will normally be heavily optimised, possibly more so than your compiler can make from your code (although that's not guaranteed of course).
It's also likely that an itoa, if you have one, will probably outperform sprintf("%d") as well, given its limited use case. You should, however, measure, not guess! Not just in terms of the library functions, but also this entire solution (and the others).
It's fairly trivial to see that a base-100 solution could work as well, using the "digits" 00-99. In each iteration, you'd do a %100 to produce such a digit pair, thus halving the number of steps. The tradeoff is that your digit table is now 200 bytes instead of 10. Still, it easily fits in L1 cache (obviously, this only applies if you're converting a lot of numbers, but otherwise efficientcy is moot anyway). Also, you might end up with a leading zero, as in "0128".
Yes, there is a more efficient way, but not portable, though. Intel's FPU has a special BCD format numbers. So, all you have to do is just to call the correspondent assembler instruction that converts ST(0) to BCD format and stores the result in memory. The instruction name is FBSTP.
Mathematically speaking, the number of decimal digits of an integer is 1+int(log10(abs(a)+1))+(a<0);.
You will not use strings but go through floating points and the log functions. If your platform has whatever type of FP accelerator (every PC or similar has) that will not be a big deal ,and will beat whatever "sting based" algorithm (that is noting more than an iterative divide by ten and count)