Any built in hash method in C++? - c++

I was looking for md5 for C++, and I realize md5 is not built in (even though there are a lot of very good libraries to support the md5 function). Then, I realized I don't actually need md5, any hashing method will do. Thus, I was wondering if C++ has such functions? I mean, built-in hashing functions?
While I was researching for C++, I saw Java, PHP, and some other programming languages support md5. For example, in PHP, you just need to call: md5("your string");.
A simple hash function will do. (If possible, please include some simple code on how to use it.)

This is simple. With C++11 you get a
hash<string>
functor which you can use like this (untested, but gives you the idea):
hash<string> h;
const size_t value = h("mystring");
If you don't have C++11, take a look at boost, maybe boost::tr1::hash_map. They probably provide a string-hashing function, too.
For very simple cases you can start with something along these lines:
size_t h = 0
for(int i=0; i<s.size(); ++i)
h = h*31 + s[i];
return h;
To take up the comment below. To prevent short strings from clustering you may want to initialize h differently. Maybe you can use the length for that (but that is just my first idea, unproven):
size_t h = numeric_limits::max<size_t>() / (s.length()+1); // +1: no div-by-0
...
This should not be worse then before, but still far from perfect.

It depends which version of C++ you have... and what kind of hashing function you are looking for.
C++03 does not have any hashing container, and thus no need for hashing. A number of compilers have been proposing custom headers though. Otherwise Boost.Functional.Hash may help.
C++0x has the unordered_ family of containers, and thus a std::hash predicate, which already works for C++ standard types (built-in types and std::string, at least).
However, this is a simple hash, good enough for hash maps, not for security.
If you are looking for cryptographic hash, then the issue is completely different (and md5 is loosy), and you'll need a library for (for example) a SHA-2 hash.
If you are looking for speed, check out CityHash and MurmurHash. Both have restrictions, but they are heavily optimized.

How about using boost, Boost.Functional/Hash

Related

Truly compile-time string hashing in C++

Basically I need a truly compile-time string hashing in C++. I don't care about technique specifics, can be templates, macros, anything. All other hashing techniques I've seen so far can only generate hashtable (like 256 CRC32 hashes) in compile time, not a real hash.
In other words, I need to have this
printf("%d", SOMEHASH("string"));
to be compiled as (in pseudo-assembler)
push HASHVALUE
push "%d"
call printf
even in Debug builds, with no runtime operations on string. I am using GCC 4.2 and Visual Studio 2008 and I need the solution to be OK for those compilers (so no C++0x).
The trouble is that in C++03 the result of subscripting a string literal (i.e. access a single character) is not a compile-time constant suitable for use as a template parameter.
It is therefore not possible to do this. I would recommend you to write a script to compute the hashes and insert them directly into the source code, i.e.
printf("%d", SOMEHASH("string"));
gets converted to
printf("%d", 257359823 /*SOMEHASH("string")*/ ));
Write your own preprocessor that scans the source for SOMEHASH("") and replaces it with the computed hash. Then pass the output of that to the compiler.
(Similar techniques are used for I18N.)
With templates only the following syntax will work:
SOMEHASH<'s','t','r','i','n','g'>
see this eg:
http://arcticinteractive.com/2009/04/18/compile-time-string-hashing-boost-mpl/
or
compile-time string hashing
You have to wait for user-defined literals in C++0x for this.
If you don't mind using the new C++0x standard in your code (some answers also include links to stuff that works in the older C++03 standard), these questions have been asked before on StackOverflow:
Compile-time (preprocessor) hashing of string
Compile time string hashing
Both of those contain answers that will help you figure out how to possibly implement this.
Here is a blog post that shows how to use Boost.MPL Compile Time String Hashing
That's not possible, it might be in C++0x but definitely not in C++03.

C++, Multilanguage/Localisation support

what's the best way to add multilanguage support to a C++ program?
If possible, the language should be read in from a plain text file containing something like key-value pairs (§WelcomeMessage§ "Hello %s!").
I thought of something like adding a localizedString(key) function that returns the string of the loaded language file. Are there better or more efficient ways?
//half-pseudo code
//somewhere load the language key value pairs into langfile[]
string localizedString(key)
{
//do something else here with the string like parsing placeholders
return langfile[key];
}
cout << localizedString(§WelcomeMessage§);
Simplest way without external libraries:
// strings.h
enum
{
LANG_EN_EN,
LANG_EN_AU
};
enum
{
STRING_HELLO,
STRING_DO_SOMETHING,
STRING_GOODBYE
};
// strings.c
char* en_gb[] = {"Well, Hello","Please do something","Goodbye"};
char* en_au[] = {"Morning, Cobber","do somin'","See Ya"};
char** languages[MAX_LANGUAGES] = {en_gb,en_au};
This will give you what you want. Obviously you could read the strings from a file. I.e.
// en_au.lang
STRING_HELLO,"Morning, CObber"
STRING_DO_SOMETHING,"do somin'"
STRING_GOODBYE,"See Ya"
But you would need a list of string names to match to the string titles. i.e.
// parse_strings.c
struct PARSE_STRINGS
{
char* string_name;
int string_id;
}
PARSE_STRINGS[] = {{"STRING_HELLO",STRING_HELLO},
{"STRING_DO_SOMETHING",STRING_DO_SOMETHING},
{"STRING_GOODBYE",STRING_GOODBYE}};
The above should be slightly easier in C++ as you could use the enum classes toString() method (or what ever it as - can't be bothered to look it up).
All you then have to do is parse the language files.
I hope this helps.
PS: and to access the strings:
languages[current_language][STRING_HELLO]
PPS: apologies for the half c half C++ answer.
Space_C0wb0w's suggestion is a good one. We currently use successfully use ICU for that in our products.
Echoing your comment to his answer: It is indeed hard to say that ICU is "small, clean, uncomplicated". There is "accidental" complexity in ICU coming from its "Java-ish" interface, but a large part of the complexity and size simply comes from the complexity and size of the problem domain it is addressing.
If you don't need ICU's full power and are only interested in "message translation", you may want to look at GNU gettext which, depending on your platform and licencing requirements, may be a "smaller, cleaner and less-complicated" alternative.
The Boost.Locale project is also an interesting alternative. In fact, its "Messages Formatting" functionality is based on the gettext model.
Since you are asking for the best way (and didn't mention the platform) I would recommend GNU Gettext.
Arguably it is the most complete and mature internationalization library for C/C++ programming.

Performance of a trimming function

My old Trimming function:
string TailTrimString (const string & sSource, const char *chars) {
size_t End = sSource.find_last_not_of(chars);
if (End == string::npos) {
// only "*chars"
return "";
}
if (End == sSource.size() - 1) {
// noting to trim
return sSource;
}
return sSource.substr(0, End + 1);
}
Instead of it I've decided to use boost, and wrote the trivial:
string TailTrimString (const string & sSource, const char *chars) {
return boost::algorithm::trim_right_copy_if(sSource,boost::algorithm::is_any_of(chars));
}
And I was amazed to find out that the new function works much slower.
I've done some profiling, and I see that the function is_any_of is very slow.
Is it possible that boost's implementation works slower than my quite straightforward implementation? Is there anything I should use instead of is_any_of in order to improve the performance?
I also found a discussion on this matter in the boost's mailing list, but I am still not sure on how can I improve the performance of my code.
The boost version that I use is 1.38, which is quite old, but I guess this code didn't change too much since then.
Thank you.
it possible that boost's implementation works slower than my quite straightforward implementation?
Of course.
Is there anything I should use instead of is_any_of in order to improve the performance?
Yeah -- your original code. You said nothing about it having a defect, or the reason why you re-implemented it using boost. If there was no defect in the original code, then there was no valid reason to chuck the original implementation.
Introducing Boost to a codebase makes sense. It brings a lot of functionality that can be helpful. But gutting a function for the sole purpose of using a new technology is a big rookie mistake.
EDIT:
In response to your comment:
I still don't understand, why boost's performance is worse.
A hand-crafted function that is designed to do one specific job for one specific application will often be faster than a generic solution. Boost is a great library of generic tools that can save a lot of programming and a lot of defects. But its generic. You may only need to trim your string in a specific way, but Boost handles everything. That takes time.
In answer to your question about the relative performance, the std::string::find_last_not_of, will wrap C string routines (such as strcspan), and these are very fast, however boost::algorithm::is_any_of uses (probably used to use, I'd hazard that in the later versions this has changed!) a std::set for the set of characters to look for and does a check in this set for each character - which will not be anywhere near as fast!
EDIT: just to add an echo, your function works, it's not broken, it's not slow, so don't bother changing it...
In answer to your question about the relative performance.
You are using boost::algorithm::trim_right_copy_if which, according to the name, creates a copy of the input before trimming. Try using boost::algorithm::trim_right_if to see if that has better performance. This function will perform the operation in-place instead of on a new string.

Named parameter string formatting in C++

I'm wondering if there is a library like Boost Format, but which supports named parameters rather than positional ones. This is a common idiom in e.g. Python, where you have a context to format strings with that may or may not use all available arguments, e.g.
mouse_state = {}
mouse_state['button'] = 0
mouse_state['x'] = 50
mouse_state['y'] = 30
#...
"You clicked %(button)s at %(x)d,%(y)d." % mouse_state
"Targeting %(x)d, %(y)d." % mouse_state
Are there any libraries that offer the functionality of those last two lines? I would expect it to offer a API something like:
PrintFMap(string format, map<string, string> args);
In Googling I have found many libraries offering variations of positional parameters, but none that support named ones. Ideally the library has few dependencies so I can drop it easily into my code. C++ won't be quite as idiomatic for collecting named arguments, but probably someone out there has thought more about it than me.
Performance is important, in particular I'd like to keep memory allocations down (always tricky in C++), since this may be run on devices without virtual memory. But having even a slow one to start from will probably be faster than writing it from scratch myself.
The fmt library supports named arguments:
print("You clicked {button} at {x},{y}.",
arg("button", "b1"), arg("x", 50), arg("y", 30));
And as a syntactic sugar you can even (ab)use user-defined literals to pass arguments:
print("You clicked {button} at {x},{y}.",
"button"_a="b1", "x"_a=50, "y"_a=30);
For brevity the namespace fmt is omitted in the above examples.
Disclaimer: I'm the author of this library.
I've always been critic with C++ I/O (especially formatting) because in my opinion is a step backward in respect to C. Formats needs to be dynamic, and makes perfect sense for example to load them from an external resource as a file or a parameter.
I've never tried before however to actually implement an alternative and your question made me making an attempt investing some weekend hours on this idea.
Sure the problem was more complex than I thought (for example just the integer formatting routine is 200+ lines), but I think that this approach (dynamic format strings) is more usable.
You can download my experiment from this link (it's just a .h file) and a test program from this link (test is probably not the correct term, I used it just to see if I was able to compile).
The following is an example
#include "format.h"
#include <iostream>
using format::FormatString;
using format::FormatDict;
int main()
{
std::cout << FormatString("The answer is %{x}") % FormatDict()("x", 42);
return 0;
}
It is different from boost.format approach because uses named parameters and because
the format string and format dictionary are meant to be built separately (and for
example passed around). Also I think that formatting options should be part of the
string (like printf) and not in the code.
FormatDict uses a trick for keeping the syntax reasonable:
FormatDict fd;
fd("x", 12)
("y", 3.141592654)
("z", "A string");
FormatString is instead just parsed from a const std::string& (I decided to preparse format strings but a slower but probably acceptable approach would be just passing the string and reparsing it each time).
The formatting can be extended for user defined types by specializing a conversion function template; for example
struct P2d
{
int x, y;
P2d(int x, int y)
: x(x), y(y)
{
}
};
namespace format {
template<>
std::string toString<P2d>(const P2d& p, const std::string& parms)
{
return FormatString("P2d(%{x}; %{y})") % FormatDict()
("x", p.x)
("y", p.y);
}
}
after that a P2d instance can be simply placed in a formatting dictionary.
Also it's possible to pass parameters to a formatting function by placing them between % and {.
For now I only implemented an integer formatting specialization that supports
Fixed size with left/right/center alignment
Custom filling char
Generic base (2-36), lower or uppercase
Digit separator (with both custom char and count)
Overflow char
Sign display
I've also added some shortcuts for common cases, for example
"%08x{hexdata}"
is an hex number with 8 digits padded with '0's.
"%026/2,8:{bindata}"
is a 24-bit binary number (as required by "/2") with digit separator ":" every 8 bits (as required by ",8:").
Note that the code is just an idea, and for example for now I just prevented copies when probably it's reasonable to allow storing both format strings and dictionaries (for dictionaries it's however important to give the ability to avoid copying an object just because it needs to be added to a FormatDict, and while IMO this is possible it's also something that raises non-trivial problems about lifetimes).
UPDATE
I've made a few changes to the initial approach:
Format strings can now be copied
Formatting for custom types is done using template classes instead of functions (this allows partial specialization)
I've added a formatter for sequences (two iterators). Syntax is still crude.
I've created a github project for it, with boost licensing.
The answer appears to be, no, there is not a C++ library that does this, and C++ programmers apparently do not even see the need for one, based on the comments I have received. I will have to write my own yet again.
Well I'll add my own answer as well, not that I know (or have coded) such a library, but to answer to the "keep the memory allocation down" bit.
As always I can envision some kind of speed / memory trade-off.
On the one hand, you can parse "Just In Time":
class Formater:
def __init__(self, format): self._string = format
def compute(self):
for k,v in context:
while self.__contains(k):
left, variable, right = self.__extract(k)
self._string = left + self.__replace(variable, v) + right
This way you don't keep a "parsed" structure at hand, and hopefully most of the time you'll just insert the new data in place (unlike Python, C++ strings are not immutable).
However it's far from being efficient...
On the other hand, you can build a fully constructed tree representing the parsed format. You will have several classes like: Constant, String, Integer, Real, etc... and probably some subclasses / decorators as well for the formatting itself.
I think however than the most efficient approach would be to have some kind of a mix of the two.
explode the format string into a list of Constant, Variable
index the variables in another structure (a hash table with open-addressing would do nicely, or something akin to Loki::AssocVector).
There you are: you're done with only 2 dynamically allocated arrays (basically). If you want to allow a same key to be repeated multiple times, simply use a std::vector<size_t> as a value of the index: good implementations should not allocate any memory dynamically for small sized vectors (VC++ 2010 doesn't for less than 16 bytes worth of data).
When evaluating the context itself, look up the instances. You then parse the formatter "just in time", check it agaisnt the current type of the value with which to replace it, and process the format.
Pros and cons:
- Just In Time: you scan the string again and again
- One Parse: requires a lot of dedicated classes, possibly many allocations, but the format is validated on input. Like Boost it may be reused.
- Mix: more efficient, especially if you don't replace some values (allow some kind of "null" value), but delaying the parsing of the format delays the reporting of errors.
Personally I would go for the One Parse scheme, trying to keep the allocations down using boost::variant and the Strategy Pattern as much I could.
Given that Python it's self is written in C and that formatting is such a commonly used feature, you might be able (ignoring copy write issues) to rip the relevant code from the python interpreter and port it to use STL maps rather than Pythons native dicts.
I've writen a library for this puporse, check it out on GitHub.
Contributions are wellcome.

Parse URLs using C-Strings in C++

I'm learning C++ for one of my CS classes, and for our first project I need to parse some URLs using c-strings (i.e. I can't use the C++ String class).
The only way I can think of approaching this is just iterating through (since it's a char[]) and using some switch statements. From someone who is more experienced in C++ - is there a better approach? Could you maybe point me to a good online resource? I haven't found one yet.
Weird that you're not allowed to use C++ language features i.e. C++ strings!
There are some C string functions available in the standard C library.
e.g.
strdup - duplicate a string
strtok - breaking a string into tokens. Beware - this modifies the original string.
strcpy - copying string
strstr - find string in string
strncpy - copy up to n bytes of string
etc
There is a good online reference here with a full list of available c string functions
for searching and finding things.
http://www.cplusplus.com/reference/clibrary/cstring/
You can walk through strings by accessing them like an array if you need to.
e.g.
char* url="http://stackoverflow.com/questions/1370870/c-strings-in-c"
int len = strlen(url);
for (int i = 0; i < len; ++i){
std::cout << url[i];
}
std::cout << endl;
As for actually how to do the parsing, you'll have to work that out on your own. It is an assignment after all.
There are a number of C standard library functions that can help you.
First, look at the C standard library function strtok. This allows you to retrieve parts of a C string separated by certain delimiters. For example, you could tokenize with the delimiter / to get the protocol, domain, and then the file path. You could tokenize the domain with delimiter . to get the subdomain(s), second level domain, and top level domain. Etc.
It's not nearly as powerful as a regular expression parser, which is what you would really want for parsing URLs, but it works on C strings, is part of the C standard library and is probably OK to use in your assignment.
Other C standard library functions that may help:
strstr() Extracts substrings just like std::string::substr()
strspn(), strchr() and strpbrk() Find a character or characters in a string, similar to std::string::find_first_of(), etc.
Edit: A reminder that the proper way to use these functions in C++ is to include <cstring> and use them in the std:: namespace, e.g. std::strtok().
You might want to refer to an open source library that can parse URLs (as a reference for how others have done it -- obviously don't copy and paste it!), such as curl or wget (links are directly to their url parsing files).
I don't know what the requirements are for parsing the URLs,
but if this is CS level it would be appropriate to use (very
simple) BNF and a (very simple) recursive descent parser.
This would make for a more robust solution than direct
iteration, e.g. for malformed URLs.
Very few string functions from the standard C library would
be needed.
You can use C functions like strtok, strchr, strstr etc.
Many of the runtime library functions that have been mentioned work quite well, either in conjunction with or apart from the approach of iterating through the string that you mentioned (which I think is time honored).