Perfect hash function for strings known in advance

Perfect hash function for strings known in advance - c++

I have 4000 strings and I want to create a perfect hash table with these strings. The strings are known in advance, so my first idea was to use a series of if statements:
if (name=="aaa")
return 1;
else if (name=="bbb")
return 2;
.
.
.
// 4000th `if' statement
However, this would be very inefficient. Is there a better way?

gperf is a tool that does exactly that:
GNU gperf is a perfect hash function generator. For a given list of strings, it produces a hash function and hash table, in form of C or C++ code, for looking up a value depending on the input string. The hash function is perfect, which means that the hash table has no collisions, and the hash table lookup needs a single string comparison only.
According to the documentation, gperf is used to generate the reserved keyword recogniser for lexers in GNU C, GNU C++, GNU Java, GNU Pascal, GNU Modula 3, and GNU indent.
The way it works is described in GPERF: A Perfect Hash Function Generator by Douglas C. Schmidt.

Better later than never, I believe this now finally answers the OP question:
Simply use https://github.com/serge-sans-paille/frozen -- a Compile-time (constexpr) library of immutable containers for C++ (using "perfect hash" under the hood).
On my tests, it performed in pair with the famous GNU's gperf perfect hash C code generator.
On your pseudo-code terms:
#include <frozen/unordered_map.h>
#include <frozen/string.h>
constexpr frozen::unordered_map<frozen::string, int, 2> olaf = {
{"aaa", 1},
{"bbb", 2},
.
.
.
// 4000th element
};
return olaf.at(name);
Will respond in O(1) time rather than OP's O(n)
-- O(n) assuming the compiler wouldn't optimize your if chain, which it might do)

Since the question is still unanswered and I'm about to add the same functionality to my HFT platform, I'll share my inventory for Perfect Hash Algorithms in C++. It is harder than I thought to find an open, flexible and bug free implementation, so I'm sharing the ones I didn't drop yet:
The CMPH library, with a collection of papers and such algorithms -- https://git.code.sf.net/p/cmph/git
BBHash, one more implementation from a paper's author -- https://github.com/rizkg/BBHash
Ademakov's -- another implementation from the paper above -- https://github.com/ademakov/PHF
wahern/phf -- I'm currently inspecting this one and trying to solve some allocation bugs it has when dealing with C++ Strings on huge key sets -- https://github.com/wahern/phf.git
emphf -- seems unmantained -- https://github.com/ot/emphf.git

I believe #NPE's answer is very reasonable, and I doubt it is too much for your application as you seem to imply.
Consider the following example: suppose you have your "engine" logic (that is: your application's functionality) contained in a file called engine.hpp:
// this is engine.hpp
#pragma once
#include <iostream>
void standalone() {
std::cout << "called standalone" << std::endl;
}
struct Foo {
static void first() {
std::cout << "called Foo::first()" << std::endl;
}
static void second() {
std::cout << "called Foo::second()" << std::endl;
}
};
// other functions...
and suppose you want to dispatch the different functions based on the map:
"standalone" dispatches void standalone()
"first" dispatches Foo::first()
"second" dispatches Foo::second()
# other dispatch rules...
You can do that using the following gperf input file (I called it "lookups.gperf"):
%{
#include "engine.hpp"
struct CommandMap {
const char *name;
void (*dispatch) (void);
};
%}
%ignore-case
%language=C++
%define class-name Commands
%define lookup-function-name Lookup
struct CommandMap
%%
standalone, standalone
first, Foo::first
second, Foo::second
Then you can use gperf to create a lookups.hpp file using a simple command:
gperf -tCG lookups.gperf > lookups.hpp
Once I have that in place, the following main subroutine will dispatch commands based on what I type:
#include <iostream>
#include "engine.hpp" // this is my application engine
#include "lookups.hpp" // this is gperf's output
int main() {
std::string command;
while(std::cin >> command) {
auto match = Commands::Lookup(command.c_str(), command.size());
if(match) {
match->dispatch();
} else {
std::cerr << "invalid command" << std::endl;
}
}
}
Compile it:
g++ main.cpp -std=c++11
and run it:
$ ./a.out
standalone
called standalone
first
called Foo::first()
Second
called Foo::second()
SECOND
called Foo::second()
first
called Foo::first()
frst
invalid command
Notice that once you have generated lookups.hpp your application has no dependency whatsoever in gperf.
Disclaimer: I took inspiration for this example from this site.

Related

Array or object: how to use nlohmann::json for simple use cases?

I want to use the nlohmann JSON library in order to read options from a JSON file. Specifying options is optional, as reflected in the constructor in my code example. I'm assuming the JSON structure is an object in its root.
Unfortunately, I'm unable to use these options, because it is unclear to me how I can force the JSON structure to be an object. What is worse, merely initializing a member variable with a JSON object {} (magically?) turns it into an array [{}].
#include <cstdlib>
#include <iostream>
#include <nlohmann/json.hpp>
class Example {
public:
explicit Example(const nlohmann::json& options = nlohmann::json::object())
: m_options{options}
{
std::clog << options << '\n' << m_options << '\n';
}
private:
nlohmann::json m_options;
};
auto main() -> int
{
Example x;
Example y{nlohmann::json::object()};
return EXIT_SUCCESS;
}
This results in the following output. Notice that we have to perform some ceremony in order to use an empty object as the default value (= empty settings), with = nlohmann::json::object(). Also notice that the settings object changes its type as soon as we initialize the member value (!):
{}
[{}]
My use use case is quite straightforward, but I'm unable to extract settings, unless I explicitly check whether the settings are an array or an object.
Another thing that worries me is that incorrect code compiles without warning, e.g., code in which I use x.value("y") on a JSON array x containing an object with key "y". Only at run time do I discover that I should have done x.at(0).value("y") instead.
In brief, the whole situation is quite surprising to me. I must be missing something / I must be using this library in an unintended way?

nlohman is a very "modern" library, it uses a lot of features in C++. And that might make it harder to read and understand the code. But it is very flexible.
This short introduction might help
Introduction to nlohmann json
Parse text to json object is done like
constexpr std::string_view stringJson = R"({"k1": "v1"})";
nlohmann::json j = nlohmann::json::parse( stringJson.begin(), stringJson.end() );

Getting a function name (func) from a class T and a pointer to member function void(T::*pmf)()

Is it possible to write some f() template function that takes a type T and a pointer to member function of signature void(T::*pmf)() as (template and/or function) arguments and returns a const char* that points to the member function's __func__ variable (or to the mangled function name)?
EDIT: I am asked to explain my use-case. I am trying to write a unit-test library (I know there is a Boost Test library for this purpose). And my aim is not to use any macros at all:
struct my_test_case : public unit_test::test {
void some_test()
{
assert_test(false, "test failed.");
}
};
My test suite runner will call my_test_case::some_test() and if its assertion fails, I want it log:
ASSERTION FAILED (&my_test_case::some_test()): test failed.
I can use <typeinfo> to get the name of the class but the pointer-to-member-function is just an offset, which gives no clue to the user about the test function being called.

It seems like what you are trying to achieve, is to get the name of the calling function in assert_test(). With gcc you can use
backtace to do that. Here is a naive example:
#include <iostream>
#include <execinfo.h>
#include <cxxabi.h>
namespace unit_test
{
struct test {};
}
std::string get_my_caller()
{
std::string caller("???");
void *bt[3]; // backtrace
char **bts; // backtrace symbols
size_t size = sizeof(bt)/sizeof(*bt);
int ret = -4;
/* get backtrace symbols */
size = backtrace(bt, size);
bts = backtrace_symbols(bt, size);
if (size >= 3) {
caller = bts[2];
/* demangle function name*/
char *name;
size_t pos = caller.find('(') + 1;
size_t len = caller.find('+') - pos;
name = abi::__cxa_demangle(caller.substr(pos, len).c_str(), NULL, NULL, &ret);
if (ret == 0)
caller = name;
free(name);
}
free(bts);
return caller;
}
void assert_test(bool expression, const std::string& message)
{
if (!expression)
std::cout << "ASSERTION FAILED " << get_my_caller() << ": " << message << std::endl;
}
struct my_test_case : public unit_test::test
{
void some_test()
{
assert_test(false, "test failed.");
}
};
int main()
{
my_test_case tc;
tc.some_test();
return 0;
}
Compiled with:
g++ -std=c++11 -rdynamic main.cpp -o main
Output:
ASSERTION FAILED my_test_case::some_test(): test failed.
Note: This is a gcc (linux, ...) solution, which might be difficult to port to other platforms!

TL;DR: It is not possible to do this in a reasonably portable way, other than using macros. Using debug symbols is really a hard solution, which will introduce a maintenance and architecture problem in the future, and a bad solution.
The names of functions, in any form, is not guaranteed to be stored in the binary [or anywhere else for that matter]. Static free functions certainly won't have to expose their name to the rest of the world, and there is no real need for virtual member functions to have their names exposed either (except when the vtable is formed in A.c and the member function is in B.c).
It is also entirely permissible for the linker to remove ALL names of functions and variables. Names MAY be used by shared libraries to find functions not present in the binary, but the "ordinal" way can avoid that too, if the system is using that method.
I can't see any other solution than making assert_test a macro - and this is actually a GOOD use-case of macros. [Well, you could of course pass __func__ as a an argument, but that's certainly NOT better than using macros in this limited case].
Something like:
#define assert_test(x, y) do_assert_test(x, y, __func__)
and then implment do_assert_test to do what your original assert_test would do [less the impossible bit of figuring out the name of the function].
If it's unit tests, and you can be sure that you will always do this with debug symbols, you could solve it in a very non-portable way by building with debug symbols and then using the debug interface to find the name of the function you are currently in. The reason I say it's non-portable is that the debug API for a given OS is not standard - Windows does it one way, Linux another, and I'm not sure how it works in MacOS - and to make matters worse, my quick search on the subject seems to indicate that reading debug symbols doesn't have an API as such - there is a debug API that allows you to inspect the current process and figure out where you are, what the registers contain, etc, but not to find out what the name of the function is. So that's definitely a harder solution than "convince whoever needs to be convinced that this is a valid use of a macro".

Dynamically creating a map at compile-time

I'm implementing Lua in a game engine. All of the functions being exported to Lua have headers that start with luavoid, luaint or luabool just for quick reference of the expected parameters, and so I can see at a glance that this function is being exported.
#define luavoid(...) void
luavoid(std::string s) TextMsg()
{
std::string s;
ExtractLuaParams(1, s);
::TextMsg(s.c_str());
}
To actually export a function to Lua, they're added to a dictionary. On startup, the map is used to call lua_register.
std::unordered_map<std::string, ScriptCall> _callMap = {
{ "TextMsg", TextMsg },
...
}
There will be a lot of functions exported. Rather than have to maintain this map manually, I'd like to automate its creation.
My first instinct was something with macros at compile-time. I gave up on it initially and started writing a program to parse the code (as a pre-build event), since all the functions can be text-matched with the luaX macros. It would create a header file with the map automatically generated.
Then I went back to doing it at compile-time after figuring out a way to do it. I came up with this solution as an example before I finally implement it in the game:
using MapType = std::unordered_map<std::string, int>;
template <MapType& m>
struct MapMaker
{
static int MakePair(std::string s, int n)
{
m[s] = n;
return n;
}
};
#define StartMap(map) MapType map
#define AddMapItem(map, s, n) int map##s = MapMaker<map>::MakePair(#s, n)
StartMap(myMap);
AddMapItem(myMap, abc, 1);
AddMapItem(myMap, def, 2);
AddMapItem(myMap, ghi, 3);
void main()
{
for (auto& x : myMap)
{
std::cout << x.first.c_str() << "->" << x.second << std::endl;
}
}
It works.
My question is, how horrible is this and can it be improved? All I want in the end is a list mapping a a string to a function. Is there a better way to create a map or should I just go with the text-parsing method?
Be gentle(-ish). This is my first attempt at coding with templates like this. I assume this falls under template metaprogramming.

how horrible is this and can it be improved?
Somewhere between hideous and horrendous. (Some questions better left unasked.) And yes...
All I want in the end is a list mapping a a string to a function. Is there a better way to create a map or should I just go with the text-parsing method?
The simplest thing to do is:
#define ADDFN(FN) { #FN, FN }
std::unordered_map<std::string, ScriptCall> _callMap = {
ADDFN(TextMsg),
...
};
This uses the macros to automate the repetition in the string literal function names and identifiers - there's nothing further substantive added by your implementation.
That said, you could experiment with automating things further than your implementation, perhaps something like this:
#define LUAVOID(FN, ...) \
void FN(); \
static auto addFN ## __LINE__ = myMap.emplace(#FN, FN); \
void FN()
LUAVOID(TextMsg, string s)
{
...
}
See it running here.
The idea here is that the macro generates a function declaration so that it can register the function, then a definition afterwards. __LINE__ likely suffices for uniqueness of the identifiers - assuming you have one file doing this, and that your compiler substitutes a numeric literal (which all compilers I've used do, but I can't remember if the Standard mandates that). The emplace function has a non-void return type so can be used directly to insert to the map.
Be gentle(-ish). This is my first attempt at coding with templates like this.
Sorry.
I assume this falls under template metaprogramming.
It's arguable. Many C++ programmers (myself included) think of "metaprogramming" as involving more advanced template usage - such as variable-length lists of parameters, recursive instantiations, and specialisation - but many others consider all template usage to be "metaprogramming" since the templates provide instructions for how to create instantiations, which is technically sufficient to constitute metaprogramming.

boost::program_options : iterating over and printing all options

I have recently started to use boost::program_options and found it to be highly convenient. That said, there is one thing missing that I was unable to code myself in a good way:
I would like to iterate over all options that have been collected in a boost::program_options::variables_map to output them on the screen. This should become a convenience function, that I can simply call to list all options that were set without the need to update the function when I add new options or for each program.
I know that I can check and output individual options, but as said above, this should become a general solution that is oblivious to the actual options. I further know that I can iterate over the contents of variables_map since it is simply an extended std::map. I could then check for the type containd in the stored boost::any variable and use .as<> to convert it back to the appropriate type. But this would mean coding a long switch block with one case for each type. And this doesn't look like good coding style to me.
So the question is, is there a better way to iterate over these options and output them?

As #Rost previously mentioned, Visitor pattern is a good choice here. To use it with PO you need to use notifiers for your options in such a way that if option is passed notifier will fill an entry in your set of boost::variant values. The set should be stored separately. After that you could iterate over your set and automatically process actions (i.e. print) on them using boost::apply_visitor.
For visitors, inherit from boost::static_visitor<>
Actually, I made Visitor and generic approach use more broad.
I created a class MyOption that holds description, boost::variant for value and other options like implicit, default and so on. I fill a vector of objects of the type MyOption in the same way like PO do for their options (see boost::po::options_add()) via templates. In the moment of passing std::string() or double() for boosts::variant initialization you fill type of the value and other things like default, implicit.
After that I used Visitor pattern to fill boost::po::options_description container since boost::po needs its own structures to parse input command line. During the filling I set notifyer for each option - if it will be passed boost::po will automatically fill my original object of MyOption.
Next you need to execute po::parse and po::notify. After that you will be able to use already filled std::vector<MyOption*> via Visitor pattern since it holds boost::variant inside.
What is good about all of this - you have to write your option type only once in the code - when filling your std::vector<MyOption*>.
PS. if using this approach you will face a problem of setting notifyer for an option with no value, refer to this topic to get a solution: boost-program-options: notifier for options with no value
PS2. Example of code:
std::vector<MyOptionDef> options;
OptionsEasyAdd(options)
("opt1", double(), "description1")
("opt2", std::string(), "description2")
...
;
po::options_descripton boost_descriptions;
AddDescriptionAndNotifyerForBoostVisitor add_decr_visitor(boost_descriptions);
// here all notifiers will be set automatically for correct work with each options' value type
for_each(options.begin(), options.end(), boost::apply_visitor(add_descr_visitor));

It's a good case to use Visitor pattern. Unfortunately boost::any doesn't support Visitor pattern like boost::variant does. Nevertheless there are some 3rd party approaches.
Another possible idea is to use RTTI: create map of type_info of known types mapped to type handler functor.

Since you are going to just print them out anyway you can grab original string representation when you parse. (likely there are compiler errors in the code, I ripped it out of my codebase and un-typedefed bunch of things)
std::vector<std::string> GetArgumentList(const std::vector<boost::program_options::option>& raw)
{
std::vector<std::string> args;
BOOST_FOREACH(const boost::program_options::option& option, raw)
{
if(option.unregistered) continue; // Skipping unknown options
if(option.value.empty())
args.push_back("--" + option.string_key));
else
{
// this loses order of positional options
BOOST_FOREACH(const std::string& value, option.value)
{
args.push_back("--" + option.string_key));
args.push_back(value);
}
}
}
return args;
}
Usage:
boost::program_options::parsed_options parsed = boost::program_options::command_line_parser( ...
std::vector<std::string> arguments = GetArgumentList(parsed.options);
// print

I was dealing with just this type of problem today. This is an old question, but perhaps this will help people who are looking for an answer.
The method I came up with is to try a bunch of as<...>() and then ignore the exception. It's not terribly pretty, but I got it to work.
In the below code block, vm is a variables_map from boost program_options. vit is an iterator over vm, making it a pair of std::string and boost::program_options::variable_value, the latter being a boost::any. I can print the name of the variable with vit->first, but vit->second isn't so easy to output because it is a boost::any, ie the original type has been lost. Some should be cast as a std::string, some as a double, and so on.
So, to cout the value of the variable, I can use this:
std::cout << vit->first << "=";
try { std::cout << vit->second.as<double>() << std::endl;
} catch(...) {/* do nothing */ }
try { std::cout << vit->second.as<int>() << std::endl;
} catch(...) {/* do nothing */ }
try { std::cout << vit->second.as<std::string>() << std::endl;
} catch(...) {/* do nothing */ }
try { std::cout << vit->second.as<bool>() << std::endl;
} catch(...) {/* do nothing */ }
I only have 4 types that I use to get information from the command-line/config file, if I added more types, I would have to add more lines. I'll admit that this is a bit ugly.

How to keep track of call statistics? C++

I'm working on a project that delivers statistics to the user. I created a class called Dog,
And it has several functions. Speak, woof, run, fetch, etc.
I want to have a function that spits out how many times each function has been called. I'm also interested in the constructor calls and destructor calls as well.
I have a header file which defines all the functions, then a separate .cc file that implements them. My question is, is there a way to keep track of how many times each function is called?
I have a function called print that will fetch the "statistics" and then output them to standard output. I was considering using static integers as part of the class itself, declaring several integers to keep track of those things. I know the compiler will create a copy of the integer and initialize it to a minimum value, and then I'll increment the integers in the .cc functions.
I also thought about having static integers as a global variable in the .cc. Which way is easier? Or is there a better way to do this?
Any help is greatly appreciated!

Using static member variables is the way to go. However, the compiler will not "create a copy of the integer and initialize it to a minimum value"; you'll have to provide a definition for each one in the .cc file and initialize it to 0 there. (Things are a bit different if you're using C++11, but the basic idea is the same.)
There's no reason to use static global variables instead of static members.
foo.h:
class Foo {
static int countCtor_;
static int countDtor_;
static int countprint_:
Foo();
~Foo();
static void print();
};
foo.cc:
#include <iostream>
#include "foo.h"
int Foo::countCtor_ = 0;
int Foo::countDtor_ = 0;
int Foo::countprint_ = 0;
Foo::Foo() {
++countCtor_;
// Something here
}
Foo::~Foo() {
++countDtor_;
// Something here
}
void Foo::print() {
++countprint_;
std::cout << "Ctor: " << countCtor_ << "\n"
<< "Dtor: " << countDtor_ << "\n"
<< "print: " << countprint_ << "\n";
}
But if you've got a lot of functions, the repetition involved is a bit annoying—it's very easy to accidentally do ++countBar_ when you meant ++countBaz_ (especially if you copy and paste the boilerplate), so you may want something a bit fancier, such as a static map and a macro that increments counts[__FUNC__], so you can just use the exact same line in each function. Like this:
foo.h:
#include <map>
class Foo {
static std::map<const char*, int> counts_;
Foo();
~Foo();
void print();
};
foo.cc:
#include <iostream>
#include "foo.h"
std::map<const char *, int> Foo::counts_;
#define INC_COUNT_() do { ++counts_[__FUNC__]; } while (0)
Foo::Foo() {
INC_COUNT_();
// Something here
}
Foo::~Foo() {
INC_COUNT_();
// Something here
}
void Foo::print() {
INC_COUNT_();
for (std::map<const char *, int>::const_iterator it = counts_.begin();
it != counts_.end(); ++it) {
std::cout << it->first << ": " << it->second << "\n";
}
}
In the example code above, __FUNC__ is a placeholder. Unfortunately, there is no standard-compliant value you can use in its place. Most compilers have some subset of __func__, __FUNC__, __FUNCTION__, __FUNCSIG__, and __PRETTY_FUNCTION__. However, none of those are standard in C++03. C++11 does standardize __func__, but only as an "implementation-defined string", which isn't guaranteed to be useful, or even unique. On top of that, the values will be different on different compilers. Also, some of them may be macros rather than identifiers, to make things more fun.
If you want truly portable code, in C++11, you can use something like string(__func__) + ":" + STRINGIZE(__LINE__)—this will be somewhat ugly, but at least each function will have a unique name. And in C++03, there is no equivalent. If you just need "portable enough", consult the documentation for every compiler you use, or rely on something like autoconf.

Is there any reason you can't use standard profiling tools that will count these calls for you? Something like gprof?
Otherwise static integers would be the way to go.

Assuming you want these statistics tracked all the time in your program, you could use an unordered_map of your function names:
std::unordered_map<const char *, unsigned> stats;
void foo () {
// use __FUNCDNAME__ for MSVC
++stats[__PRETTY_FUNCTION__];
//...
}
The use of compiler specific function name specifiers is purposefully there to get the decorated function names. This is so that overloaded function names get counted as separate functions.
This technique allows you to add new functions easily without thinking about anything else, but there is a small additional cost if there are hash collisions (which can be remedied somewhat by sizing the stats map to be larger). There is no hash computed on the string, since the key is a pointer type, it just uses the pointer value itself as the hash.
If this is just one-off code for profiling, then you should first try to use the code profiling tools available on your platform.

You can put static locals inside the methods themselves, that seems cleaner since these variables aren't logically connected to the class so there's no reason to make them members.
Additionaly, you could have a macro to simplify the work. I normally don't recommend using macros, but this seems like an appropriate use:
#define DEFINE_COUNTER \
static int noCalls = 0; \
noCalls++;
void foo()
{
DEFINE_COUNTER
}

Use a library that implements the Observer Pattern or Method Call Interception. You can choose one from this list, or use something like Vitamin.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Perfect hash function for strings known in advance - c++

Related

Array or object: how to use nlohmann::json for simple use cases?

Getting a function name (func) from a class T and a pointer to member function void(T::*pmf)()

Dynamically creating a map at compile-time

boost::program_options : iterating over and printing all options

How to keep track of call statistics? C++

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Perfect hash function for strings known in advance - c++

Related

Array or object: how to use nlohmann::json for simple use cases?

Getting a function name (__func__) from a class T and a pointer to member function void(T::*pmf)()

Dynamically creating a map at compile-time

boost::program_options : iterating over and printing all options

How to keep track of call statistics? C++

Categories

Resources

Getting a function name (func) from a class T and a pointer to member function void(T::*pmf)()