PBKDF2 not matching between Python and Javascript libraries - password-encryption

Using password "password", salt "1234567812345678" 100 repetitions, 128-bit result
http://bitwiseshiftleft.github.com/sjcl/demo/ is a javascript implementation, gives result A374FF6A12280F020162A62A9B3212AA
http://matt.ucc.asn.au/src/pbkdf2.py is a python implementation gives result 89FBE50AF230BD273076AA9BC9F1142A
Why are they different, if PBKDF2 is a standard that they both implement?

It appears SJCL uses SHA-256, whereas the Python implementation defaults to SHA-1.
These are different hashes which can be used inside PBKDF#2, and as such will lead to different results.
PBKDF#2 is an algorithm, but does not specify the exact makeup of its internals.

Related

Handling case-insensitivity without taking locale into account

I am investigating the handling of case-insensitivity in my application. So far I realized that there are two different cases:
data is visualized to the user
data is internally handled
For case 1 you should use the user's locale, always. This means that e.g. when sorting items in a list and you want this to happen case-insensitively, then you should use locale-aware case-insensitive string compare functions.
For case 2 it seems logical that you don't want to use the user's locale, since this can have undesirable effects if you have users using a different locale, but still using the same data set (e.g. if you are managing library software, you could use the book's name as key for your book instance in the database, and want to handle this case-insensitive (this is a simplification, I know)).
When using STL containers (like std::map) I noticed that it's much more efficient to put the key in uppercase and then perform the lookup on the uppercase'd search-value. This is more efficient than performing a case-insensitive compare while looping over the map. For std::unordered_map it's probably required to do such a trick.
However, I realized that this may have strange effects as well, and I am wondering how Windows (also using case-insensitive file names) handles these situations.
E.g. The German character ß (ringel-S) is written as SS when put in uppercase. This seems to imply that ß.txt and SS.txt should denote the same file, and so ßs.txt and sß.txt and sss.txt and SSS.txt should also denote the same file. But in my experiments this doesn't seem to be the case in Windows.
So my questions:
Which C++ and/or Windows functions should be used to perform locale-independent case-insensitive string compares?
Which C++ and/or Windows functions should be used to make a string case-less (e.g. put it in uppercase) so compares are then more efficient when performing a lookup in an std::map (or even making hashing possible when using an std::unordered_map)?
Any other experience (or links to documents) regarding case-insensitive string handling for internal (i.e. non-visualization-related) data?

How do I do a regexp to find all names with Ernest Hemingway, but with spelling mistakes?

I need to comb a large set of data and look for some specific names.
They might appear with syntax errors in the texts.
What solution should I adopt?
common syntax errors:
ernst hmngi
Hurnest Huminguee
Ersnet Henimgway
Those are spelling errors. Regular expressions are not for this kind of task, you should look into Soundex. There is a CPAN module for it:
http://metacpan.org/pod/Text::Soundex
It finds matching words broadly based on phonetics (how the words sound when spoken) in American English.
You could look into approximate regexp matching as implemented in e.g. the TRE library. With the TRE tool tre-agrep with different error tolerance values, I can match all variants:
$ cat > test.txt
ernst hmngi
Hurnest Huminguee
Ersnet Henimgway
$ tre-agrep -4 -i "ernest hemingway" test.txt
Ersnet Henimgway
$ tre-agrep -5 -i "ernest hemingway" test.txt
Hurnest Huminguee
Ersnet Henimgway
$ tre-agrep -6 -i "ernest hemingway" test.txt
ernst hmngi
Hurnest Huminguee
Ersnet Henimgway
Given that
you have a dictionary (i.e. a list) of specific names you are looking for
you are doing this for English, which can be tokenised in a relatively straight-forward way (e.g. by using white space and punctuation as token boundaries)
the following approach should work well:
Prepare a dictionary of the names in your list
Tokenise the text
Consider a) each token, b) each pair of consecutive tokens, c) each triple of consecutive tokens as candidate names and look them up in the dictionary using approximate string matching techniques.
There are several possible strategies for implementing approximate string matching (I'd recommend trying (E) below first):
A) Methods that reduce the string and all dictionary entries to a canonical form before lookup. Soundex is one such method. The main general problem with these methods is that they do not provide ranking by string similarity, so you might get many different candidates but have no idea which one matches best. Furthermore, the canonical form is based on pronounciation rules for specific languages (e.g. Soundex for English), which is not good for names, especially non-English ones. It is also problematic because the errors you are dealing with are probably caused by mistyping, not mis-pronouncing a name. E.g. using a 'q' instead of a 'w' might be a frequent problem for you, because 'q' and 'w' are located next to each other on the keyboard, while their pronounciation is totally different.
B) Methods that use a search trie to implement the dictionary. During look-up in the trie, you can allow for one or two mismatches and thus find slightly misspelled candidates. The main problem here is that the lookup typically becomes inacceptably slow as soon as you allow for more than 2 character mismatches, in particular when mismatches are allowed at the beginning of strings. There are certain ways to opimise the performance, though. See here for a few ideas.
C) Methods based on n-gram look-up. Here you can use a hash table for the dictionary implementation, but rather than put the names into the hash directly, you split each name into its character n-grams (for predefined n, typically 2 or 3) and put the n-grams into the dictionary. E.g. For
hemingway
you will put
hem
emi
min
ing
ngw
gwa
way
into the hash. During look-up, you do the same with the candidate string, look-up all its n-grams and accept that name that has the highest number of n-grams in common with the input. For example, if the input is hemmgway, you'll find that it has three n-grams (hem,gwa,way) in common with the dictionary entry hemingway.
This method works relatively well if your strings are fairly long and have only a few errors here and there. Maybe also not optimal in your case, but you might want to give it a try.
D) Methods that use Levenshtein automata to implement the dictionary. This is a relatively complicated method, and also has problems when you want to allow for a very large number of errors. A detailed description is found in this paper by Schulz and Mihov. I am unsure whether there is a ready-to-use, open-source implementation available.
E) Methods that combine an implementation of the Levenshtein edit distance function with a metric tree. Given you description I believe this would work best for you, and I have used this method myself in a similar situation. You find further references in the answers to this SO question, and a link to an implementation (which I haven't tried though) in this SO question.

Truly compile-time string hashing in C++

Basically I need a truly compile-time string hashing in C++. I don't care about technique specifics, can be templates, macros, anything. All other hashing techniques I've seen so far can only generate hashtable (like 256 CRC32 hashes) in compile time, not a real hash.
In other words, I need to have this
printf("%d", SOMEHASH("string"));
to be compiled as (in pseudo-assembler)
push HASHVALUE
push "%d"
call printf
even in Debug builds, with no runtime operations on string. I am using GCC 4.2 and Visual Studio 2008 and I need the solution to be OK for those compilers (so no C++0x).
The trouble is that in C++03 the result of subscripting a string literal (i.e. access a single character) is not a compile-time constant suitable for use as a template parameter.
It is therefore not possible to do this. I would recommend you to write a script to compute the hashes and insert them directly into the source code, i.e.
printf("%d", SOMEHASH("string"));
gets converted to
printf("%d", 257359823 /*SOMEHASH("string")*/ ));
Write your own preprocessor that scans the source for SOMEHASH("") and replaces it with the computed hash. Then pass the output of that to the compiler.
(Similar techniques are used for I18N.)
With templates only the following syntax will work:
SOMEHASH<'s','t','r','i','n','g'>
see this eg:
http://arcticinteractive.com/2009/04/18/compile-time-string-hashing-boost-mpl/
or
compile-time string hashing
You have to wait for user-defined literals in C++0x for this.
If you don't mind using the new C++0x standard in your code (some answers also include links to stuff that works in the older C++03 standard), these questions have been asked before on StackOverflow:
Compile-time (preprocessor) hashing of string
Compile time string hashing
Both of those contain answers that will help you figure out how to possibly implement this.
Here is a blog post that shows how to use Boost.MPL Compile Time String Hashing
That's not possible, it might be in C++0x but definitely not in C++03.

Any built in hash method in C++?

I was looking for md5 for C++, and I realize md5 is not built in (even though there are a lot of very good libraries to support the md5 function). Then, I realized I don't actually need md5, any hashing method will do. Thus, I was wondering if C++ has such functions? I mean, built-in hashing functions?
While I was researching for C++, I saw Java, PHP, and some other programming languages support md5. For example, in PHP, you just need to call: md5("your string");.
A simple hash function will do. (If possible, please include some simple code on how to use it.)
This is simple. With C++11 you get a
hash<string>
functor which you can use like this (untested, but gives you the idea):
hash<string> h;
const size_t value = h("mystring");
If you don't have C++11, take a look at boost, maybe boost::tr1::hash_map. They probably provide a string-hashing function, too.
For very simple cases you can start with something along these lines:
size_t h = 0
for(int i=0; i<s.size(); ++i)
h = h*31 + s[i];
return h;
To take up the comment below. To prevent short strings from clustering you may want to initialize h differently. Maybe you can use the length for that (but that is just my first idea, unproven):
size_t h = numeric_limits::max<size_t>() / (s.length()+1); // +1: no div-by-0
...
This should not be worse then before, but still far from perfect.
It depends which version of C++ you have... and what kind of hashing function you are looking for.
C++03 does not have any hashing container, and thus no need for hashing. A number of compilers have been proposing custom headers though. Otherwise Boost.Functional.Hash may help.
C++0x has the unordered_ family of containers, and thus a std::hash predicate, which already works for C++ standard types (built-in types and std::string, at least).
However, this is a simple hash, good enough for hash maps, not for security.
If you are looking for cryptographic hash, then the issue is completely different (and md5 is loosy), and you'll need a library for (for example) a SHA-2 hash.
If you are looking for speed, check out CityHash and MurmurHash. Both have restrictions, but they are heavily optimized.
How about using boost, Boost.Functional/Hash

Encrypting password in compiled C or C++ code

I know how to compile C and C++ Source files using GCC and CC in the terminal, however i would like to know if its safe to include passwords in these files, once compiled.
For example.. i check user input for a certain password e.g 123, but it appears compiled C/C++ programs is possible to be decompiled.
Is there anyway to compile a C/C++ source file, while keeping the source completely hidden..
If not, could anyone provide a small example of encrypting the input, then checking against the password e.g: (SHA1, MD5)
No you can't securely include password in your source file. Strings in executable file are in plain text, anyone with a text editor can easily look at your password.
A not so secure, but would trample some people, is to store the encrypted string instead. So, basically:
enc = "03ac674216f3e15c761ee1a5e255f067953623c8b388b4459e13f978d7c846f4"
bool check() {
pass = getPassFromUser();
encpass = myHashingFunction(pass);
return pass == encpass;
}
this will deter some people, but isn't really much more secure, it is relatively trivial for assembly hacker to replace the 'enc' string in your executable with another sha256-encoded string with a known cleartext value.
Even if you use a separate authentication server, it is not difficult to setup a bogus authentication server and fool your program connect to this bogus authentication server.
Even if you use SHA1 to generate a hash it is not really all that safe if you do it in a normal way (write a function to check a password) any determined or knowledgable hacker given access to the executable will be able to get around it (replace your hash with a known hash or just replace the checkPassword() call with a call that returns true.
The question is who are you trying to protect against? Your little brother, a hacker, international spies, industrial espionage?
Using SHA1 with the hash just contained within in the code (or a config file) will only protect against you little brother? (read casual computer users that can't be bothered to try and hack your program instead of paying the share ware price). In this case using plain text password or a SHA1 hash makes little difference (maybe a couple of percent more people will not bother).
If you want to make your code safe against anything else then you will need to do a lot more. A book on security is a good starting point but the only real way to do this is to take a security class where protection techniques are taught. This is a very specialized field and rolling your own version is likely to be counter productive and give you no real protection (using a hash is only the first step).
It is not recommended to keep any sensitive static data inside code. You can use configuration files for that. There you can store whatever you like.
But if you really want to do that first remember that the code can be easily changed by investigating with a debugger and modifying it. Only programs that user doesn't have access to can be considered safer (web sites for example).
The majority of login passwords (of different sites) are not stored in clear in the database but encrypted with algorithms MD5, SHA1, Blowfish etc.
I'd suggest you use one of these algorithms from OpenSSL library.
What I would do is using some public-key cryptographic algorithm. This will probably take a little longer to be cracked because in my opinion there is nothing 100% sure when talking about software protection.
It's not safe if you store them as plain text, you can just dump the file or use a utility like strings to find text in the executable.
You will have to encode them in some manner.
Here is a code sample that might help you, using OpenSSL.
#include <openssl/evp.h>
bool SHA256Hash(const char* buf, size_t buflen, char* res, size_t reslen)
{
if (reslen >= 32)
{
EVP_MD_CTX mdctx;
EVP_MD_CTX_init(&mdctx);
EVP_DigestInit_ex(&mdctx, EVP_sha256(), NULL);
EVP_DigestUpdate(&mdctx, buf, buflen);
EVP_DigestFinal_ex(&mdctx, res, &len);
EVP_MD_CTX_cleanup(&mdctx);
return (len == 32);
}
return false;
}
I took this sample from the systools library and had to adapt it. So i'm not sure it compiles without modifications. However, it should help you.
Please note that, to determine if storing a hash value of some password in your binary is safe, we must know what you want it for.
If you expect it to forbid some functionalities of your program unless some special password is given, then it is useless: an attacker is likely to remove the whole password-check code instead of trying to guess or reverse the stored password.
Try finding out Hashing Functions and Ciphering Methods for securing your passwords and their storage.