Guessing a string with time comparison. Is it possible? - c++

I was wondering about a strange idea: you are given and algorithm wich takes a string in input and compares it to a string that you don't know. The algoritm is just a trivial comparison, one char at a time. When a couple that doesn't match is found, 0 is returned. Otherwise it returns 1.
Can you guess the secret string in a polynomial time by using the provided algorithm?
When a string doesn't match, the time used to give the answer 0 is less than the time taken to return 1, because less comparisons are needed. Times involved are very small, and for this reason you can try a single instance many times to get a more accurate estimation. Estimating the time taken we could have informations about the secret string. If this works properly, we can guess the string one char at a time, in a polynomial time. So if this can happen we can try some kind of brute force attack char by char.
Does this make sense? Or is there something I'm misunderstanding?
Thanks in advance.

You can guess the secret string if you can can input your own strings to compare, or just observe enough strings (not chosen by you) being compared to the secret string, if the string comparison has been written in a way such that its execution time reveals information about the secret string.
This is a known weakness cryptographic software can have, and all serious cryptographic software written nowadays avoids this weakness.
For instance, to avoid revealing information about its arguments, a function that tests whether two buffers are the same or different may be written:
int crypto_memcmp(const char *s1, const char *s2, size_t n)
{
size_t i;
int answer;
for (i=0; i<n; i++)
answer = answer | (s1[i] != s2[i]);
return answer;
}
You can use several techniques to check that a piece of code does not leak secrets through timing attacks. I wrote how to do it with static analysis here but this is based on a previous idea that used Valgrind (dynamic analysis) here.
Note that it goes further than that. This article showed how you did not even need the execution path to depend on the secret to leak information. It was enough that the secret was used in the computation of some array indices that were subsequently accessed. On modern computers, this changes the execution time because the cache will make two successive accesses to similar indices faster than two successive accesses to indices that are far from each other, revealing information about the secret.

You can determine the string bit by bit. for each bit use binary search
for example:
you already know the first a bits. say it (Sa).
now you have to determine the (a+1)th bit.
there are upper bound (Sa)zzzzzzz... and lower bound (Sa)azzzzzz....
first you guess the (a+1)th bit is (a+z)/2, say r, then the string is (Sa)rzzzzzz..., and with the result, you update the upper bound and lower bound.

Related

Is there a library that would produce a string that would hash (SHA1) to a given input?

I'm wondering if it's possible to find a block of text that would hash to a known value. In particular, I'm looking for a function CreateDataFromHash() that could be called as follows:
unsigned char myHash[] = "da39a3ee5e6b4b0d3255bfef95601890afd80709";
unsigned int length = 10000;
CreateDataFromHash(myHash, length);
Here CreateDataFromHash would return the string of the length 10000 containing arbitrary data, which would hash to myHash using SHA1.
Thanks.
There's no known easy or even moderately difficult way to do this in general.
The entire point of hashes (or so-called one-way functions), is that it's easy to compute them, but next to impossible to reverse their computation (find input values based on output). That said, for some hash functions, there are known methods that may allow computing inputs for a given hash value in reasonable time.
For example, this MD5 sum technique will find collisions (but not input for a given output) in about 8 hours on a 1.6GHz computer.
For SHA-1 in particular you may be interested in reading this.
One of the purposes of SHA1 is that this should be very hard to do.
hashing is a one way function. you can't get input from the output.
This would be a "preimage attack". No such thing is publicly known against SHA-1.
The only attack known against SHA-1 is a collision attack. That means I find two inputs that produce the same result, but neither of them is pre-ordained, so to speak. Even so, this attack isn't really feasible for most people -- based on the amount of computation involved, the closest I can figure is that you'd have to spend somewhere in the range of a few million dollars to build a machine that would give you about one colliding pair of keys per week (assuming it ran, doing nothing else 24/7).
You have to brute force it. See
PHP brute force password generator
Get string, do hash, compare, repeat

Two-way "Hashing" of string

I want to generate int from a string and be able to generate it back.
Something like hash function but two-way function.
I want to use ints as ID in my application, but want to be able to convert it back in case of logging or debugging.
Like:
int id = IDProvider::getHash("NameOfMyObject");
object * a = createObject(id);
...
if(error)
{
LOG(IDProvider::getOriginalString(a->getId()), "some message");
}
I have heard of slightly modified CRC32 to be fast and 100% reversible, but I can not find it and I am not able to write it by myself.
Any hints what should I use?
Thank you!
edit
I have just founded the source I have the whole CRC32 thing from:
Jason Gregory : Game Engine Architecture
quotation:
"As with any hashing system, collisions are a possibility (i.e., two different strings might end up with the same hash code). However, with a suitable hash function, we can all but guarantee that collisions will not occur for all reasonable input strings we might use in our game. After all, a 32-bit hash chode represents more than four billion possible values. So if our hash function does a good job of distributing strings evently throughout this very large range, we are unlikely to collide. At Naughty Dog, we used a variant of the CRC-32 algorithm to hash our strings, and we didn't encounter a single collision in over two years of development on Uncharted: Drake's Fortune."
Reducing an arbitrary length string to a fixed size int is mathematically impossible to reverse. See Pidgeonhole principle. There is a near infinite amount of strings, but only 2^32 32 bit integers.
32 bit hashes(assuming your int is 32 bit) can have collisions very easily. So it's not a good unique ID either.
There are hashfunctions which allow you to create a message with a predefined hash, but it most likely won't be the original message. This is called a pre-image.
For your problem it looks like the best idea is creating a dictionary that maps integer-ids to strings and back.
To get the likelyhood of a collision when you hash n strings check out the birthday paradox. The most important property in that context is that collisions become likely once the number of hashed messages approaches the squareroot of the number of available hash values. So with a 32 bit integer collisions become likely if you hash around 65000 strings. But if you're unlucky it can happen much earlier.
I have exactly what you need. It is called a "pointer". In this system, the "pointer" is always unique, and can always be used to recover the string. It can "point" to any string of any length. As a bonus, it also has the same size as your int. You can obtain a "pointer" to a string by using the & operand, as shown in my example code:
#include <string>
int main() {
std::string s = "Hai!";
std::string* ptr = &s; // this is a pointer
std::string copy = *ptr; // this retrieves the original string
std::cout << copy; // prints "Hai!"
}
What you need is encryption. Hashing is by definition one way. You might try simple XOR Encryption with some addition/subtraction of values.
Reversible hash function?
How come MD5 hash values are not reversible?
checksum/hash function with reversible property
http://groups.google.com/group/sci.crypt.research/browse_thread/thread/ffca2f5ac3093255
... and many more via google search...
You could look at perfect hashing
http://en.wikipedia.org/wiki/Perfect_hash_function
It only works when all the potential strings are known up front. In practice what you enable by this, is to create a limited-range 'hash' mapping that you can reverse-lookup.
In general, the [hash code + hash algorithm] are never enough to get the original value back. However, with a perfect hash, collisions are by definition ruled out, so if the source domain (list of values) is known, you can get the source value back.
gperf is a well-known, age old program to generate perfect hashes in c/c++ code. Many more do exist (see the Wikipedia page)
Is it not possible. Hashing is not-returnable function - by definition.
As everyone mentioned, it is not possible to have a "reversible hash". However, there are alternatives (like encryption).
Another one is to zip/unzip your string using any lossless algorithm.
That's a simple, fully reversible method, with no possible collision.

What's the best way to hash a string vector not very long (urls)?

I am now dealing with url classification. I partition url with "/?", etc, generating a bunch of parts. In the process, I need to hash the first part to the kth part, say, k=2, then for "http://stackoverflow.com/questions/ask", the key is a string vector "stackoverflow.com questions". Currently, the hash is like Hash. But it consumes a lot of memory. I wonder whether MD5 can help or are there other alternatives. In effect, I do not need to recover the key exactly, as long as differentiating different keys.
Thanks!
It consumes a lot of memory
If your code already works, you may want to consider leaving it as-is. If you don't have a target, you won't know when you're done. Are you sure "a lot" is synonymous with "too much" in your case?
If you decide you really need to change your working code, you should consider the large variety of the options you have available, rather than taking someone's word for a specific algorithm:
http://en.wikipedia.org/wiki/List_of_hash_functions
http://en.wikipedia.org/wiki/Comparison_of_cryptographic_hash_functions
http://www.strchr.com/hash_functions
etc
Not sure about memory implications, and it certainly would change your perf profile, but you could also look into using Tries:
http://en.wikipedia.org/wiki/Trie
MD5 is a nice hash code for stuff where security is not an issue. It's fast and reasonably long (128 bits is enough for most applications). Also the distribution is very good.
Adler32 would be a possible alternative. It's very easy to implement, just a few lines of code. It's even faster then MD5. And it's long enough/good enough for many applications (though for many it is not).
(I know Adler32 is strictly not a hash-code, but it will still do fine for many applications)
However, if storing the hash-code is consuming a lot of memory, you can always truncate the hash-code, or use XOR to "shrink" it. E.g.
uint8_t md5[16];
GetMD5(md5, ...);
// use XOR to shrink the MD5 to 32 bits
for (size_t i = 4; i < 16; i++)
md5[i % 4] ^= md5[i];
// assemble the parts into one uint32_t
uint32_t const hash = md5[0] + (md5[1] << 8) + (md5[2] << 16) + (md5[3] << 24);
Personally I think MD5 would be overkill though. Have a look at Adler32, I think it will do.
EDIT
I have to correct myself: Adler23 is a rather poor choice for short strings (less then a few thousand bytes). I had completely forgotten about that. But there is always the obvious: CRC32. Not as fast as Adler23 (about the same speed as MD5), but still acceptably easy to implement, and there are also a ton of existing implementations with all kinds of licenses out there.
If you're only trying to find out if two URL's are the same have you considered storing a binary version of the IP address of the server? If two server names resolve to the same address is that incorrect or an advantage for your application?

Any better alternatives for getting the digits of a number? (C++)

I know that you can get the digits of a number using modulus and division. The following is how I've done it in the past: (Psuedocode so as to make students reading this do some work for their homework assignment):
int pointer getDigits(int number)
initialize int pointer to array of some size
initialize int i to zero
while number is greater than zero
store result of number mod 10 in array at index i
divide number by 10 and store result in number
increment i
return int pointer
Anyway, I was wondering if there is a better, more efficient way to accomplish this task? If not, is there any alternative methods for this task, avoiding the use of strings? C-style or otherwise?
Thanks. I ask because I'm going to be wanting to do this in a personal project of mine, and I would like to do it as efficiently as possible.
Any help and/or insight is greatly appreciated.
The time it takes to extract the digits will be dwarfed by the time required to dynamically allocate the array. Consider returning the result in a struct:
struct extracted_digits
{
int number_of_digits;
char digits[12];
};
You'll want to pick a suitable value for the maximum number of digits (12 here, which is enough for a 32-bit integer). Alternatively, you could return a std::array<char, 12> and encode the terminal by using an invalid value (so, after the last value, store a 10 or something else that isn't a digit).
Depending on whether you want to handle negative values, you'll also have to decide how to report the unary minus (-).
Unless you want the representation of the number in a base that's a power of 2, that's about the only way to do it.
Smacks of premature optimisation. If profiling proves it matters, then be sure to compare your algo to itoa - internally it may use some CPU instructions that you don't have explicit access to from C++, and which your compiler's optimiser may not be clever enough to employ (e.g. AAM, which divs while saving the mod result). Experiment (and benchmark) coding the assembler yourself. You might dig around for assembly implementations of ITOA (which isn't identical to what you're asking for, but might suggest the optimal CPU instructions).
By "avoiding the use of strings", I'm going to assume you're doing this because a string-only representation is pretty inefficient if you want an integer value.
To that end, I'm going to suggest a slightly unorthodox approach which may be suitable. Don't store them in one form, store them in both. The code below is in C - it will work in C++ but you may want to consider using c++ equivalents - the idea behind it doesn't change however.
By "storing both forms", I mean you can have a structure like:
typedef struct {
int ival;
char sval[sizeof("-2147483648")]; // enough for 32-bits
int dirtyS;
} tIntStr;
and pass around this structure (or its address) rather than the integer itself.
By having macros or inline functions like:
inline void intstrSetI (tIntStr *is, int ival) {
is->ival = i;
is->dirtyS = 1;
}
inline char *intstrGetS (tIntStr *is) {
if (is->dirtyS) {
sprintf (is->sval, "%d", is->ival);
is->dirtyS = 0;
}
return is->sval;
}
Then, to set the value, you would use:
tIntStr is;
intstrSetI (&is, 42);
And whenever you wanted the string representation:
printf ("%s\n" intstrGetS(&is));
fprintf (logFile, "%s\n" intstrGetS(&is));
This has the advantage of calculating the string representation only when needed (the fprintf above would not have to recalculate the string representation and the printf only if it was dirty).
This is a similar trick I use in SQL with using precomputed columns and triggers. The idea there is that you only perform calculations when needed. So an extra column to hold the indexed lowercased last name along with an insert/update trigger to calculate it, is usually a lot more efficient than select lower(non_lowercased_last_name). That's because it amortises the cost of the calculation (done at write time) across all reads.
In that sense, there's little advantage if your code profile is set-int/use-string/set-int/use-string.... But, if it's set-int/use-string/use-string/use-string/use-string..., you'll get a performance boost.
Granted this has a cost, at the bare minimum extra storage required, but most performance issues boil down to a space/time trade-off.
And, if you really want to avoid strings, you can still use the same method (calculate only when needed), it's just that the calculation (and structure) will be different.
As an aside: you may well want to use the library functions to do this rather than handcrafting your own code. Library functions will normally be heavily optimised, possibly more so than your compiler can make from your code (although that's not guaranteed of course).
It's also likely that an itoa, if you have one, will probably outperform sprintf("%d") as well, given its limited use case. You should, however, measure, not guess! Not just in terms of the library functions, but also this entire solution (and the others).
It's fairly trivial to see that a base-100 solution could work as well, using the "digits" 00-99. In each iteration, you'd do a %100 to produce such a digit pair, thus halving the number of steps. The tradeoff is that your digit table is now 200 bytes instead of 10. Still, it easily fits in L1 cache (obviously, this only applies if you're converting a lot of numbers, but otherwise efficientcy is moot anyway). Also, you might end up with a leading zero, as in "0128".
Yes, there is a more efficient way, but not portable, though. Intel's FPU has a special BCD format numbers. So, all you have to do is just to call the correspondent assembler instruction that converts ST(0) to BCD format and stores the result in memory. The instruction name is FBSTP.
Mathematically speaking, the number of decimal digits of an integer is 1+int(log10(abs(a)+1))+(a<0);.
You will not use strings but go through floating points and the log functions. If your platform has whatever type of FP accelerator (every PC or similar has) that will not be a big deal ,and will beat whatever "sting based" algorithm (that is noting more than an iterative divide by ten and count)

C++ string comparison in one clock cycle

Is it possible to compare whole memory regions in a single processor cycle? More precisely is it possible to compare two strings in one processor cycle using some sort of MMX assembler instruction? Or is strcmp-implementation already based on that optimization?
EDIT:
Or is it possible to instruct C++ compiler to remove string duplicates, so that strings can be compared simply by their memory location? Instead of memcmp(a,b) compared by a==b (assuming that a and b are both native const char* strings).
Just use the standard C strcmp() or C++ std::string::operator==() for your string comparisons.
The implementations of them are reasonably good and are probably compiled to a very highly optimized assembly that even talented assembly programmers would find challenging to match.
So don't sweat the small stuff. I'd suggest looking at optimizing other parts of your code.
You can use the Boost Flyweight library to intern your immutable strings. String equality/inequality tests then become very fast since all it has to do at that point is compare pointers (pun not intended).
Not really. Your typical 1-byte compare instruction takes 1 cycle.
Your best bet would be to use the MMX 64-bit compare instructions( see this page for an example). However, those operate on registers, which have to be loaded from memory. The memory loads will significantly damage your time, because you'll be going out to L1 cache at best, which adds some 10x time slowdown*. If you are doing some heavy string processing, you can probably get some nifty speedup there, but again, it's going to hurt.
Other people suggest pre-computing strings. Maybe that'll work for your particular app, maybe it won't. Do you have to compare strings? Can you compare numbers?
Your edit suggests comparing pointers. That's a dangerous situation unless you can specifically guarantee that you won't be doing substring compares(ie, you are comparing some two byte strings: [0x40, 0x50] with [0x40, 0x42]. Those are not "equal", but a pointer compare would say they are).
Have you looked at the gcc strcmp() source? I would suggest that doing that would be the ideal starting place.
* Loosely speaking, if a cycle takes 1 unit, a L1 hit takes 10 units, an L2 hit takes 100 units, and an actual RAM hit takes really long.
It's not possible to perform general-purpose string operations in one cycle, but there are many optimizations you can apply with extra information.
If your problem domain allows the use of an aligned, fixed-size buffer for strings that fits in a machine register, you can perform single-cycle comparisons (not counting the load instructions).
If you always keep track of the lengths of your strings, you can compare lengths and use memcmp, which is faster than strcmp. If your application is multi-cultural, keep in mind that this only works for ordinal string comparison.
It appears you are using C++. If you only need equality comparisons with immutable strings, you can use a string interning solution (copy/paste link since I'm a new user) to guarantee that equal strings are stored at the same memory location, at which point you can simply compare pointers. See en.wikipedia.org/wiki/String_interning
Also, take a look at the Intel Optimization Reference Manual, Chapter 10 for details on the SSE 4.2's instructions for text processing. www.intel.com/products/processor/manuals/
Edit: If your problem domain allows the use of an enumeration, that is your single-cycle comparison solution. Don't fight it.
If you're optimizing for string comparisons, you may want to employ a string table (then you only need to compare the indexes of the two strings, which can be done in a single machine instruction).
If that's not feasible, you can also create a hashed string object that contains the string and a hash. Then most of the time you only have to compare the hashes if the strings aren't equal. If the hashes do match you'll have to do a full comparison though to make sure it wasn't a false positive.
It depends on how much preprocessing you do. C# and Java both have a process called interning strings which makes every string map to the same address if they have the same contents. Assuming a process like that, you could do a string equality comparison with one compare instruction.
Ordering is a bit harder.
EDIT: Obviously this answer is sidestepping the actual issue of attempting to do a string comparison within a single cycle. But it's the only way to do it unless you happen to have a sequence of instructions that can look at an unbounded amount of memory in constant time to determine the equivalent of a strcmp. Which is improbable, because if you had such an architecture the person who sold it to you would say "Hey, here's this awesome instruction that can do a string compare in a single cycle! How awesome is that?" and you wouldn't need to post a question on stackoverflow.
But that's just my reasoned opinion.
Or is it possible to instruct c++
compiler to remove string duplicates,
so that strings can be compared simply
by their memory location?
No. The compiler may remove duplicates internally, but I know of no compiler that guarantees or provides facilities for accessing such an optimisation (except possibly to turn it off). Certainly the C++ standard has nothing to say in this area.
Assuming you mean x86 ... Here is the Intel documentation.
But off the top of my head, no, I don't think you can compare more than the size of a register at a time.
Out of curiosity, why do you ask? I'm the last to invoke Knuth prematurely, but ... strcmp usually does a pretty good job.
Edit: Link now points to the modern documentation.
You can certainly compare more than one byte in a cycle. If we take the example of x86-64, you can compare up to 64-bits (8 bytes) in a single instruction (cmps), this isn't necessarily one cycle but will normally be in the low single digits (the exact speed depends on the specific processor version).
However, this doesn't mean you'll be able to all the work of comparing two arrays in memory much faster than strcmp :-
There's more than just the compare - you need to compare the two values, check if they are the same, and if so move to next chunk.
Most strcmp implementations will already be highly optimised, including checking if a and b point to the same address, and any suitable instruction-level optimisations.
Unless you're seeing alot of time spent in strcmp, I wouldn't worry about it - have you got a specific problem / use case you are trying to improve?
Even if both strings were cached, it wouldn't be possible to compare (arbitrarily long) strings in a single processor cycle. The implementation of strcmp in a modern compiler environment should be pretty much optimized, so you shouldn't bother to optimize too much.
EDIT (in reply to your EDIT):
You can't instruct the compiler to unify ALL duplicate strings - most compilers can do something like this, but it's best-effort only (and I don't know any compiler where it works across compilation units).
You might get better performance by adding the strings to a map and comparing iterators after that... the comparison itself might be one cycle (or not much more) then
If the set of strings to use is fixed, use enumerations - that's what they're there for.
Here's one solution that uses enum-like values instead of strings. It supports enum-value-inheritance and thus supports comparison similar to substring comparison. It also uses special character "¤" for naming, to avoid name collisions. You can take any class, function, or variable name and make it into enum-value (SomeClassA will become ¤SomeClassA).
struct MultiEnum
{
vector<MultiEnum*> enumList;
MultiEnum()
{
enumList.push_back(this);
}
MultiEnum(MultiEnum& base)
{
enumList.assign(base.enumList.begin(),base.enumList.end());
enumList.push_back(this);
}
MultiEnum(const MultiEnum* base1,const MultiEnum* base2)
{
enumList.assign(base1->enumList.begin(),base1->enumList.end());
enumList.assign(base2->enumList.begin(),base2->enumList.end());
}
bool operator !=(const MultiEnum& other)
{
return find(enumList.begin(),enumList.end(),&other)==enumList.end();
}
bool operator ==(const MultiEnum& other)
{
return find(enumList.begin(),enumList.end(),&other)!=enumList.end();
}
bool operator &(const MultiEnum& other)
{
return find(enumList.begin(),enumList.end(),&other)!=enumList.end();
}
MultiEnum operator|(const MultiEnum& other)
{
return MultiEnum(this,&other);
}
MultiEnum operator+(const MultiEnum& other)
{
return MultiEnum(this,&other);
}
};
MultiEnum
¤someString,
¤someString1(¤someString), // link to "someString" because it is a substring of "someString1"
¤someString2(¤someString);
void Test()
{
MultiEnum a = ¤someString1|¤someString2;
MultiEnum b = ¤someString1;
if(a!=¤someString2){}
if(b==¤someString2){}
if(b&¤someString2){}
if(b&¤someString){} // will result in true, because someString is substring of someString1
}
PS. I had definitely too much free time on my hands this morning, but reinventing the wheel is just too much fun sometimes... :)