Short introduction to my questions:
i'm trying to implement a "sort of" relational database using stl containers. This is just for fun/educational purpose, so no need for answers like "use this library", "this is absolutely useless" and so on.
I know title is a little bit confusing at this point, but we will reach the point (suggestions for improvement to title are really welcome).
I proceeded with little steps:
i can build table as vector of maps from columns name to their values => std::vector<std::map<std::string, some_variant>>. It's simple and it represents what i need.
wait, i can just store column's names once and access values with their index. => std::vector<std::vector<some_variant>>.As simple as point 1, but faster than that.
wait wait, in a database a table is literrally a sequence of tuple => std::vector<std::tuple<args...>>. This is cool, it represents exactly what i'm doing, correct type without variant and even faster than the other.
Note: the "faster than" was measured for
1000000 records with a simple loop like this:
std::random_device dev;
std::mt19937 gen(dev());
std::uniform_int_distribution<long> rand1_1000(1, 1000);
std::uniform_real_distribution<double> rand1_10(1.0, 10.0);
void fill_1()
{
using my_variant = std::variant<long, long long, double, std::string>;
using values = std::map<std::string, my_variant>;
using table = std::vector<values>;
table t;
for (int i = 0; i < 1000000; ++i)
t.push_back({ {"col_1", rand1_1000(gen)}, {"col_2", rand1_1000(gen)}, {"col_3", rand1_10(gen)} });
std::cout << "size:" << t.size() << "\n";//just to prevent optimization
}
2234101600ns - avg:2234
446344100ns - avg:446
132075400ns - avg:132
INSERT:
No problem with any of these solutions, insert are as simple as pushing back elements as in the example.
SELECT:
1 and 2 are simple, but 3 is tricky.
So, finally, questions:
Memory usage: there is a lot of overhead using solution 1 and 2 in term of used memory. So, 3 seems to be again the right choice here.
For the example with 1 million records of 2 longs and a double i was expecteing something near 4MB*2 for longs and 8MB for doubles plus some overhead for vectors, maps and variants where used. Instead we have (measured with windows task manager, not extremely accurate, i know):
1.340 MB
2.120 MB
3.31 MB
Am i missing something? Other than reserving the right size in advance or shrink_to_fit after the insert loop?
Is there a way to run-time retrieve some tuple field as in the case of a select statement?
using my_tuple = std::tuple<long, long, string, double>;
std::vector<my_tuple> table;
int to_select;//this could be a vector of columns to select obviosly
std::cin>>to_select;
auto result = select (table, to_select);
Do you see any chance to implement this last line in any way?
We have two problem for what i see: the result type should take the the type from the starting tuple and then, actually perform the selection of desired fields.
I read a lot of answers about that, they all talk about contiguous indexes using make_index_sequence or complile-time known index.
I also found this article, very interesting, but not really useful for this case.
This is doable but it is strange:
template<size_t candidate, typename ...T>
constexpr std::variant<T...> helperTupleValueAt(const std::tuple<T...>& t, size_t index)
{
if constexpr (candidate >= sizeof...(T)) {
throw std::logic_error("out of bounds");
} else {
if (candidate == index) {
return std::variant<T...>{ std::in_place_index<candidate>, std::get<candidate>(t) };
} else {
return helperTupleValueAt<candidate + 1>(t, index);
}
}
}
template<typename ...T>
std::variant<T...> tupleValueAt(const std::tuple<T...>& t, size_t index)
{
return helperTupleValueAt<0>(t, index);
}
https://wandbox.org/permlink/FQJd4chAFVSg5eSy
Related
I tried to implement an unordered map for a Class called Pair, that stores an integer and a bitset. Then I found out, that there isn't a hashfunction for this Class.
Now I wanted to create my own hashfunction. But instead of using the XOR function or comparable functions, I wanted to have a hashfunction like the following approach:
the bitsets in my class obviously have fixed size, so I wanted to do the following:
example: for a instance of Pair with the bitset<6> = 101101, and the integer 6:
create a string = "1011016"
and now use the default hashfunction on this string
because the bitsets have fixed size, each key would be unique
how could I implement this approach?
thank you in advance
To expand on a comment, as requested:
Converting to string and then hashing that string would be somewhat slow. At least slower than it needs to be. A faster approach would be to combine the bit patterns, e.g. like this:
struct Pair
{
std::bitset<6> bits;
int intval;
};
template<>
std::hash<Pair>
{
std::size_t operator()(const Pair& pair) const noexcept
{
std::size_t rtrn = static_cast<std::size_t>(pair.intval);
rtrn = (rtrn << pair.bits.size()) | pair.bits.to_ulong();
return rtrn;
}
};
This works on two assumptions:
The upper bits of the integer are generally not interesting
The size of the bitset is always small compared to size_t
I think it is a suitable hash function for use in unordered_map. One may argue that it has poor mixing and a very good hash should change many bits if only a few bits in its input change. But that is not required here. unordered_map is generally designed to work with cheap hash functions. For example GCC's hash for builtin types and pointers is just a static- or reinterpret-cast.
Possible improvements
We can preserve the upper bits by rotating instead of shifting.
template<>
std::hash<Pair>
{
std::size_t operator()(const Pair& pair) const noexcept
{
std::size_t rtrn = static_cast<std::size_t>(pair.intval);
std::size_t intdigits = std::numeric_limits<decltype(pair.intval)>::digits;
std::size_t bitdigits = pair.bits.size();
// can be simplified to std::rotl(rtrn, bitdigits) in C++20
rtrn = (rtrn << bitdigits) | (rtrn >> (intdigits - bitdigits));
rtrn ^= pair.bits.to_ulong();
return rtrn;
}
};
Nothing will change for small integers (except some bitflips for small negative ints). But for large integers we still use the whole range of inputs, which might be of interest for pathological cases such as integer series 2^30, 2^30 + 2^29, 2^30 + 2^28, ...
If the size of the bitset may increase, stop doing fancy stuff and just combine the hashes. I wouldn't just xor them to avoid hash collisions on small integers.
std::hash<Pair>
{
std::size_t operator()(const Pair& pair) const noexcept
{
std::hash<decltype(pair.intval)> ihash;
std::hash<decltype(pair.bits)> bhash;
return ihash(pair.intval) * 31 + bhash(pair.bits);
}
};
I picked the simple polynomial hash approach common in Java. I believe GCC uses the same one internally for string hashing. Someone else may expand on the topic or suggest a better one. 31 is commonly chosen as it is a prime number one off a power of two. So it can be computed quickly as (x << 5) - x
I am a mathematician by training and need to simulate a continuous time Markov chain. I need to use a variant of Gillespie algorithm which relies on fast reading and writing to a 13-dimensional array. At the same time, I need to set the size of each dimension based on users input (they will be each roughly of order 10). Once these sizes are set by the user, they will not change throughout the runtime. The only thing which changes will be the data contained in them. What is the most efficient way of doing this?
My first try was to use the standard arrays but their sizes must be known at the compilation time, which is not my case. Is std::vector a good structure for this? If so, how shall I go about initializing a creature as:
vector<vector<vector<vector<vector<vector<vector<vector<vector<vector<vector<vector<vector<int>>>>>>>>>>>>> Array;
Will the initialization take more time than dealing with an array? Or, is there a better data container to use, please?
Thank you for any help!
I would start by using a std::unordered_map to hold key-value pairs, with each key being a 13-dimensional std::array, and each value being an int (or whatever datatype is appropriate), like this:
#include <iostream>
#include <unordered_map>
#include <array>
typedef std::array<int, 13> MarkovAddress;
// Define a hasher that std::unordered_map can use
// to compute a hash value for a MarkovAddress
// borrowed from: https://codereview.stackexchange.com/a/172095/126857
template<class T, size_t N>
struct std::hash<std::array<T, N>> {
size_t operator() (const std::array<T, N>& key) const {
std::hash<T> hasher;
size_t result = 0;
for(size_t i = 0; i < N; ++i) {
result = result * 31 + hasher(key[i]); // ??
}
return result;
}
};
int main(int, char **)
{
std::unordered_map<MarkovAddress, int> map;
// Just for testing
const MarkovAddress a{{1,2,3,4,5,6,7,8,9,10,11,12,13}};
// Place a value into the map at the specified address
map[a] = 12345;
// Now let's see if the value is present in the map,
// and retrieve it if so
if (map.count(a) > 0)
{
std::cout << "Value in map is " << map[a] << std::endl;
}
else std::cout << "Value not found!?" << std::endl;
return 0;
}
That will give you fast (O(1)) lookup and insert, which is likely your first priority. If you later run into trouble with that (e.g. too much RAM used, or you need a well-defined iteration order, or etc) you could replace it with something more elaborate later.
While implementing a variation of a binary search problem, I needed to reorder slicing points (i.e start, mid, end) so that they are stored within the corresponding variable (e.g. (1,5,2) -> (1,2,5)). This is fairly simple to do with a few if statements and swaps. However as a thought experiment, I'm now interested in generalizing this to work with n many T type variables. I started experimenting with some intuitive solutions and as a starting place, I came up with this template function:
template<typename T>
void
sortInPlace(
std::function<bool (const T&, const T&)> compareFunc,
T& start,
T& mid,
T& end)
{
std::vector<T> packed {start, mid, end};
std::sort(packed.begin(), packed.end(), compareFunc);
auto packedAsTuple = make_tuple(packed[0], packed[1], packed[2]);
std::tie(start, mid, end) = packedAsTuple;
}
And when I ran the following, using typedef std::pair<int,int> Pivot:
//Comparison function to sort by pair.first, ascending:
std::function<bool(const Pivot&, const Pivot&)>
comp =[](const Pivot & a, const Pivot & b) {
return std::get < 0 > (a) < std::get < 0 > (b);
};
int main(){
Pivot a(8,1);
Pivot b(2,3);
Pivot c(4,6);
sortInPlace(comp,a,b,c);
}
This turns out to work as intended:
a after sort: 2, 3
b after sort: 4, 6
c after sort: 8, 1
Ideally, the following step is to convert this template into a variadic template but I'm having trouble achieving this. I also have a few things that bother me regarding the current version:
The usage of std::vector was an arbitrary decision. I've done this because it's unclear to me what structure/container is the best to use in packing of these values. Seems to me that the choice of structure needs to be constructed, sorted and unpacked/converted to tuples easily and I'm not sure if there is such a magical structure out there.
Just to get something going, I had to settle with manually packing/unpacking the arguments. It is also unclear to me how to use std::tie or any other unpacking/moving operation with variable number (i.e. unknown in compile-time) of elements.
While there is no particular reason to exclusively use stl functions/structures, I'd be surprised to learn that there is no intuitive way to achieve this using the abstractions provided within stl. As a result, I'm more interested in achieving my goal using minimal help outside of stl.
I embarked on this thought experiment expecting to end up with a syntactically-correct version of std::move(std::sort({x,y,z}, comp), {x,y,z}) and considering where my research has led me so far, I'm starting to think I'm over-complicating this problem. Any help, insight or suggestion would be much appreciated!
One possible C++17 solution with std::sort that generalizes your example:
template<class Comp, class... Ts>
void my_sort(Comp comp, Ts&... values) {
using T = std::common_type_t<Ts...>;
T vals[]{std::move(values)...};
std::sort(std::begin(vals), std::end(vals), comp);
auto it = std::begin(vals);
((values = std::move(*it++)), ...);
}
using Pivot = std::pair<int, int>;
const auto comp = [](Pivot a, Pivot b) {
return std::get<0>(a) < std::get<0>(b);
};
Pivot a(8, 1);
Pivot b(2, 3);
Pivot c(4, 6);
my_sort(comp, a, b, c);
If the number N of parameters in the pack is small, you don't need std::sort at all. Just a series of (hardcoded) comparisons (and for small N the minimum number of comparisons is known exactly) will do the job - see sec. 5.3 Optimum sorting of Knuth's TAOCP vol. 3.
Let's say I would like to create a unordered set of unordered multisets of unsigned int. For this, I need to create a hash function to calculate a hash of the unordered multiset. In fact, it has to be good for CRC as well.
One obvious solution is to put the items in vector, sort them and return a hash of the result. This seems to work, but it is expensive.
Another approach is to xor the values, but obviously if I have one item twice or none the result will be the same - which is not good.
Any ideas how I can implement this cheaper - I have an application that will be doing this thousand for thousands of sets, and relatively big ones.
Since it is a multiset, you would like for the hash value to be the same for identical multisets, whose representation might have the same elements presented, added, or deleted in a different order. You would then like for the hash value to be commutative, easy to update, and change for each change in elements. You would also like for two changes to not readily cancel their effect on the hash.
One operation that meets all but the last criteria is addition. Just sum the elements. To keep the sum bounded, do the sum modulo the size of your hash value. (E.g. modulo 264 for a 64-bit hash.) To make sure that inserting or deleting zero values changes the hash, add one to each value first.
A drawback of the sum is that two changes can readily cancel. E.g. replacing 1 3 with 2 2. To address that, you can use the same approach and sum a polynomial of the entries, still retaining commutativity. E.g. instead of summing x+1, you can sum x2+x+1. Now it is more difficult to contrive sets of changes with the same sum.
Here's a reasonable hash function for std::unordered_multiset<int> it would be better if the computations were taken mod a large prime but the idea stands.
#include <iostream>
#include <unordered_set>
namespace std {
template<>
struct hash<unordered_multiset<int>> {
typedef unordered_multiset<int> argument_type;
typedef std::size_t result_type;
const result_type BASE = static_cast<result_type>(0xA67);
result_type log_pow(result_type ex) const {
result_type res = 1;
result_type base = BASE;
while (ex > 0) {
if (ex % 2) {
res = res * base;
}
base *= base;
ex /= 2;
}
return res;
}
result_type operator()(argument_type const & val) const {
result_type h = 0;
for (const int& el : val) {
h += log_pow(el);
}
return h;
}
};
};
int main() {
std::unordered_set<std::unordered_multiset<int>> mySet;
std::unordered_multiset<int> set1{1,2,3,4};
std::unordered_multiset<int> set2{1,1,2,2,3,3,4,4};
std::cout << "Hash 1: " << std::hash<std::unordered_multiset<int>>()(set1)
<< std::endl;
std::cout << "Hash 2: " << std::hash<std::unordered_multiset<int>>()(set2)
<< std::endl;
return 0;
}
Output:
Hash 1: 2290886192
Hash 2: 286805088
When it's a prime p, the number of collisions is proportional to 1/p. I'm not sure what the analysis is for powers of two. You can make updates to the hash efficient by adding/subtracting BASE^x when you insert/remove the integer x.
Implement the inner multiset as a value->count hash map.
This will allow you to avoid the problem that an even number of elements cancels out via xor in the following way: Instead of xor-ing each element, you construct a new number from the count and the value (e.g. multiplying them), and then you can build the full hash using xor.
I have a function, f(a,b), that accepts two inputs. I do not know ahead of time which values of a and b will be used. I'm okay with being a little wasteful on memory (I care about speed). I want to be able to check if the output of f(a,b) has already been delivered, and if so, deliver that output again without re-running through the f(a,b) process.
Trivially easy to do in Python with decorators, but C++ is way over my head here.
I would use a std::map (or maybe an std::unordered_map) whose key is a std::pair, or perhaps use a map of maps.
C++11 improvements are probably helpful in that case. Or maybe some Boost thing.
The poster asks:
I want to be able to check if the output of f(a,b) has already been delivered, and if so, deliver that output again without re-running through the f(a,b) process.
It's pretty easy in C++ using a std::map. The fact that the function has exactly two parameters means that we can use std::pair to describe them.
#include <map>
#include <iostream>
uint64_t real_f(int a, int b) {
std::cout << "*";
// Do something tough:
return (uint64_t)a*b;
}
uint64_t memo_f(int a, int b) {
typedef std::pair<int, int> key;
typedef std::map<key, uint64_t> map;
static map m;
key k(a,b);
map::iterator it = m.find(k);
if(it == m.end()) {
return m[k] = real_f(a, b);
}
return it->second;
}
int main () {
std::cout << memo_f(1, 2) << "\n";
std::cout << memo_f(3, 4) << "\n";
std::cout << memo_f(1, 2) << "\n";
std::cout << memo_f(3, 4) << "\n";
std::cout << memo_f(5, 6) << "\n";
}
The output of the above program is:
*2
*12
2
12
*30
The lines without asterisks represent cached results.
With C++11, you could use tasks and futures. Let f be your function:
int f(int a, int b)
{
// Do hard work.
}
Then you would schedule the function execution, which returns you a handle to the return value. This handle is called a future:
template <typename F>
std::future<typename std::result_of<F()>::type>
schedule(F f)
{
typedef typename std::result_of<F()>::type result_type;
std::packaged_task<result_type> task(f);
auto future = task.get_future();
tasks_.push_back(std::move(task)); // Queue the task, execute later.
return std::move(future);
}
Then, you could use this mechanism as follows:
auto future = schedule(std::bind(&f, 42, 43)); // Via std::bind.
auto future = schedule([&] { f(42, 43); }); // Lambda alternative.
if (future.has_value())
{
auto x = future.get(); // Blocks if the result of f(a,b) is not yet availble.
g(x);
}
Disclaimer: my compiler does not support tasks/futures, so the code may have some rough edges.
The main point about this question are the relative expenses in CPU and RAM between calculating f(a,b) and keeping some sort of lookup table to cache results.
Since an exhaustive table of 128 bits index length is not (yet) feasable, we need to reduce the lookup space into a manageable size - this can't be done without some considerations inside your app:
How big is the really used space of function inputs? Is there a pattern in it?
What about the temporal component? Do you expect repeated calculations to be close to one another or ditributed along the timeline?
What about the distribution? Do you assume a tiny part of the index space to consume the majority of function calls?
I would simply start with a fixed-size array of (a,b, f(a,b)) tuples and a linear search. Depending on your pattern as asked above, you might want to
window-slide it (drop oldest on a cache miss): This is good for localized reocurrences
have (a,b,f(a,b),count) tuples with the tuple with the smallest count being expelled - this is good for non-localized occurrences
have some key-function determine a position in the cache (this is good for tiny index space usage)
whatever else Knuth or Google might have thought of
You might also want to benchmark repeated calculation against the lookup mechanism, if the latter becomes more and more complex: std::map and freinds don't come for free, even if they are high-quality implementations.
The only easy way is to use std::map. std::unordered_map does not work. We cannot use std::pair as the key in unordered map. You can do the following,
std::map<pair<int, int>, int> mp;
int func(int a, int b)
{
if (mp.find({a, b}) != mp.end()) return mp[{a, b}];
// compute f(a, b)...
mp[{a, b}] = // computed value;
return mp[{a, b}];
}