Hash function for sequences of tuples - Duplicate elimination - c++

I want to eliminate duplicates of sequences of tuples. These sequences look like this:
1. (1,1)(2,5,9)(2,3,10)(2,1)
2. (1,2)(3,2,1)(2,5,9)(2,1)
3. (1,1)(2,5,9)(2,3,10)(2,1)
4. (2,1)(2,3,10)(2,5,9)(1,1)
5. (2,1)(2,3,10)(1,1)
6. (1,1)(2,5,9)(2,3,10)(2,2)
The number of entries per tuple varies as does the number of tuple per sequence. Since I have lots of sequences which I ultimately want to deal with in parallel using CUDA, I thought that calculating a hash per sequence would be an efficient way to identify duplicate sequences.
How would such a hash function be implemented?
And: How big is the collision probability of two different sequences producing the same final hash value?
I have two requirements which I am not sure if they can be fulfilled:
a) Can such a hash be calculated on the fly?
I want to avoid the storage of the full sequences, therefore I'd like to do something like this:
h = 0; // init hash
...
h = h + hash(1,1);
...
h = h + hash(2,5,9);
...
h = h + hash(2,3,10)
...
h = h + hash(2,1)
where + is any operator which combines hashes of tuples.
b) Can such a hash be independent of the "direction" of the sequence?
In the above example sequences 1. and 4. consist of the same tuples but the order is reversed, but I like to identify them as duplicates.

For hashing you can use std::hash<std::size_t> or whatever (unsigned) integer type you use. The collision probability is somewhere around 1.0/std::numeric_limits<std::size_t>::max(), which is very small. To make the usability a bit better you can write your own tuple hasher:
namespace hash_tuple
{
std::size_t hash_combine(std::size_t l, std::size_t r) noexcept
{
constexpr static const double phi = 1.6180339887498949025257388711906969547271728515625;
static const double val = std::pow(2ULL, 4ULL * sizeof(std::size_t));
static const std::size_t magic_number = val / phi;
l ^= r + magic_number + (l << 6) + (l >> 2);
return l;
}
template <typename TT>
struct hash
{
std::size_t operator()(TT const& tt) const noexcept
{
return std::hash<TT>()(tt);
}
};
namespace
{
template <class TupleT, std::size_t Index = std::tuple_size<TupleT>::value - 1ULL>
struct HashValueImpl
{
static std::size_t apply(std::size_t seed, TupleT const& tuple) noexcept
{
seed = HashValueImpl<TupleT, Index - 1ULL>::apply(seed, tuple);
seed = hash_combine(seed, std::get<Index>(tuple));
return seed;
}
};
template <class TupleT>
struct HashValueImpl<TupleT, 0ULL>
{
static std::size_t apply(size_t seed, TupleT const& tuple) noexcept
{
seed = hash_combine(seed, std::get<0>(tuple));
return seed;
}
};
}
template <typename ... TT>
struct hash<std::tuple<TT...>>
{
std::size_t operator()(std::tuple<TT...> const& tt) const noexcept
{
std::size_t seed = 0;
seed = HashValueImpl<std::tuple<TT...> >::apply(seed, tt);
return seed;
}
};
}
Thus you can write code like
using hash_tuple::hash;
auto mytuple = std::make_tuple(3, 2, 1, 0);
auto hasher = hash<decltype(mytuple)>();
std::size_t mytuple_hash = hasher(mytuple);
To fulfill your constraint b we need for each tuple 2 hashes, the normal hash and the hash of the reversed tuple.
So at first we need to deal with how to reverse one:
template<typename T, typename TT = typename std::remove_reference<T>::type, size_t... I>
auto reverse_impl(T&& t, std::index_sequence<I...>)
-> std::tuple<typename std::tuple_element<sizeof...(I) - 1 - I, TT>::type...>
{
return std::make_tuple(std::get<sizeof...(I) - 1 - I>(std::forward<T>(t))...);
}
template<typename T, typename TT = typename std::remove_reference<T>::type>
auto reverse(T&& t)
-> decltype(reverse_impl(std::forward<T>(t),
std::make_index_sequence<std::tuple_size<TT>::value>()))
{
return reverse_impl(std::forward<T>(t),
std::make_index_sequence<std::tuple_size<TT>::value>());
}
Then we can calculate our hashes
auto t0 = std::make_tuple(1, 2, 3, 4, 5, 6);
auto t1 = std::make_tuple(6, 5, 4, 3, 2, 1);
using hash_tuple::hash;
auto hasher = hash<decltype(t0)>();
std::size_t t0hash = hasher(t0);
std::size_t t1hash = hasher(t1);
std::size_t t0hsah = hasher(reverse(t0));
std::size_t t1hsah = hasher(reverse(t1));
And if hash_combine(t0hash, t1hash) == hash_combine(t1hsah, t0hsah) you found what you want. You can apply this "inner-tuple-hashing-mechanic" to the hashes of many tuples pretty easily. Play around with this online!

Related

create tuple of nth element in the cartesian product of input vectors

I have a python function which returns the nth-element in the cartesian product of a number of input arrays
def prod(n, arrs):
out = []
for i,arr in enumerate(arrs):
denom = numpy.prod([ len(p) for p in arrs[i+1:] ], dtype=int)
idx = n // denom % len(arr)
out.append( arr[idx] )
return out
This works great:
a = [ 1000, 1100, 1200, 1300, 1400 ]
b = [ 1.0, 1.5, 2.0, 2.5, 3.0, 3.5 ]
c = [ -2, -1, 0, 1, 2 ]
for n in range(20, 30):
i = prod(n, [a, b, c])
print(n, i)
[1000, 3.0, -2]
[1000, 3.0, -1]
[1000, 3.0, 0]
[1000, 3.0, 1]
[1000, 3.0, 2]
[1000, 3.5, -2]
[1000, 3.5, -1]
[1000, 3.5, 0]
[1000, 3.5, 1]
[1000, 3.5, 2]
Now I would like to translate this to C++ (max standard C++-17)
template<typename... Ts>
auto prod(std::size_t n, const std::vector<Ts>&... vs)
-> std::tuple<const std::decay_t<typename std::vector<Ts>::value_type>&...>
{
// template magic here
}
Can someone help me with the template magic required to construct the tuple using the above formula?
First, let's just auto-deduce the function's return type for simplicity.
Next, index-sequences are neat. With them, it can be done.
With C++20, we could get the indices from the sequence in a lambda. Before that, we need an extra function.
Finally, we have to start creating the indices from the end, either storing the indices and then using them in reverse order or reversing the resulting tuple.
template <class T, std::size_t... Ns>
static auto prod_impl(std::size_t n, T tuple, std::index_sequence<Ns...>) {
auto f = [&](auto N){ auto r = n % N; n /= N; return r; };
auto x = std::forward_as_tuple(std::get<(sizeof...(Ns)) - Ns - 1>(tuple)[f(std::get<(sizeof...(Ns)) - Ns - 1>(tuple).size())]...);
return std::forward_as_tuple(std::get<(sizeof...(Ns)) - Ns - 1>(x)...);
}
template<class... Ts>
auto prod(std::size_t n, const std::vector<Ts>&... vs) {
return prod_impl(n, std::forward_as_tuple(vs...), std::make_index_sequence<sizeof...(vs)>());
}
Simpler alternative for the inner function using an array of indices:
template <class T, std::size_t... Ns>
static auto prod_impl(std::size_t n, T tuple, std::index_sequence<Ns...>) {
auto f = [&](auto N){ auto r = n % N; n /= N; return r; };
std::size_t Is[] = { f(std::get<sizeof...(Ns) - Ns - 1>(tuple).size())... , 0};
return std::forward_as_tuple(std::get<Ns>(tuple)[sizeof...(Ns) - Ns - 1]...);
}
The other answers have already mentioned the indices trick. Here is my attempt at it, converted as directly as possible from your python code:
template <typename T, std::size_t... I>
auto prod_impl(std::size_t n, T tuple, std::index_sequence<I...>) {
std::array sizes{ std::size(std::get<I>(tuple))... };
auto enumerator = [&sizes,n](std::size_t i, auto&& arr) -> decltype(auto) {
auto denom = std::accumulate(std::begin(sizes) + i + 1, std::end(sizes), 1, std::multiplies<>{});
auto idx = (n / denom) % std::size(arr);
return arr[idx];
};
return std::forward_as_tuple(enumerator(I, std::get<I>(tuple))...);
}
template<typename... Ts, typename Is = std::index_sequence_for<Ts...>>
auto prod(std::size_t n, const std::vector<Ts>&... vs) {
return prod_impl(n, std::forward_as_tuple(vs...), Is{});
}
(Live example: http://coliru.stacked-crooked.com/a/a8b975c29d429054)
We can use the "indices trick" to get us indices associated with the different vectors.
This gets us the following C++14 solution:
#include <tuple>
#include <vector>
#include <cstdlib>
using std::array;
using std::size_t;
template <size_t NDim>
constexpr array<size_t, NDim> delinearize_coordinates(
size_t n, array<size_t, NDim> dimensions)
{
// This might be optimizable into something nicer, maybe even a one-liner
array<size_t, NDim> result{};
for(size_t i = 0; i < NDim; i++) {
result[NDim-1-i] = n % dimensions[NDim-1-i];
n = n / dimensions[NDim-1-i];
};
return result;
}
template<size_t... Is, typename... Ts>
auto prod_inner(
std::index_sequence<Is...>,
size_t n,
const std::vector<Ts>&... vs)
-> std::tuple<const std::decay_t<typename std::vector<Ts>::value_type>&...>
{
auto vs_as_tuple = std::make_tuple( vs ... );
auto coordinates = delinearize_coordinates<sizeof...(Ts)>(n, { vs.size()... });
return { std::get<Is>(vs_as_tuple)[coordinates[Is]] ... };
}
template<typename... Ts>
auto prod(size_t n, const std::vector<Ts>&... vs)
-> std::tuple<const std::decay_t<typename std::vector<Ts>::value_type>&...>
{
return prod_inner(std::make_index_sequence<sizeof...(Ts)>{}, n,
std::forward<const std::vector<Ts>&>(vs)...);
}
Notes:
Unlike in your code, I've factored-out the function which translates a single number into a sequence of coordinates.
I think you can get this down to C++11 if you provide your own make_index_sequence.

Get linear index for multidimensional access

I'm trying to implement a multidimensional std::array, which hold a contigous array of memory of size Dim-n-1 * Dim-n-2 * ... * Dim-1. For that, i use private inheritance from std::array :
constexpr std::size_t factorise(std::size_t value)
{
return value;
}
template<typename... Ts>
constexpr std::size_t factorise(std::size_t value, Ts... values)
{
return value * factorise(values...);
}
template<typename T, std::size_t... Dims>
class multi_array : std::array<T, factorise(Dims...)>
{
// using directive and some stuff here ...
template<typename... Indexes>
reference operator() (Indexes... indexes)
{
return base_type::operator[] (linearise(std::make_integer_sequence<Dims...>(), indexes...)); // Not legal, just to explain the need.
}
}
For instance, multi_array<5, 2, 8, 12> arr; arr(2, 1, 4, 3) = 12; will access to the linear index idx = 2*(5*2*8) + 1*(2*8) + 4*(8) + 3.
I suppose that i've to use std::integer_sequence, passing an integer sequence to the linearise function and the list of the indexes, but i don't know how to do it. What i want is something like :
template<template... Dims, std::size_t... Indexes>
auto linearise(std::integer_sequence<int, Dims...> dims, Indexes... indexes)
{
return (index * multiply_but_last(dims)) + ...;
}
With multiply_but_last multiplying all remaining dimension but the last (i see how to implement with a constexpr variadic template function such as for factorise, but i don't understand if it is possible with std::integer_sequence).
I'm a novice in variadic template manipulation and std::integer_sequence and I think that I'm missing something. Is it possible to get the linear index computation without overhead (i.e. like if the operation has been hand-writtent) ?
Thanks you very much for your help.
Following might help:
#include <array>
#include <cassert>
#include <iostream>
template <std::size_t, typename T> using alwaysT_t = T;
template<typename T, std::size_t ... Dims>
class MultiArray
{
public:
const T& operator() (alwaysT_t<Dims, std::size_t>... indexes) const
{
return values[computeIndex(indexes...)];
}
T& operator() (alwaysT_t<Dims, std::size_t>... indexes)
{
return values[computeIndex(indexes...)];
}
private:
size_t computeIndex(alwaysT_t<Dims, std::size_t>... indexes_args) const
{
constexpr std::size_t dimensions[] = {Dims...};
std::size_t indexes[] = {indexes_args...};
size_t index = 0;
size_t mul = 1;
for (size_t i = 0; i != sizeof...(Dims); ++i) {
assert(indexes[i] < dimensions[i]);
index += indexes[i] * mul;
mul *= dimensions[i];
}
assert(index < (Dims * ...));
return index;
}
private:
std::array<T, (Dims * ...)> values;
};
Demo
I replaced your factorize by fold expression (C++17).
I have a very simple function that converts multi-dimensional index to 1D index.
#include <initializer_list>
template<typename ...Args>
inline constexpr size_t IDX(const Args... params) {
constexpr size_t NDIMS = sizeof...(params) / 2 + 1;
std::initializer_list<int> args{params...};
auto ibegin = args.begin();
auto sbegin = ibegin + NDIMS;
size_t res = 0;
for (int dim = 0; dim < NDIMS; ++dim) {
size_t factor = dim > 0 ? sbegin[dim - 1] : 0;
res = res * factor + ibegin[dim];
}
return res;
}
You may need to add "-Wno-c++11-narrowing" flag to your compiler if you see a warning like non-constant-expression cannot be narrowed from type 'int'.
Example usage:
2D array
int array2D[rows*cols];
// Usually, you need to access the element at (i,j) like this:
int elem = array2D[i * cols + j]; // = array2D[i,j]
// Now, you can do it like this:
int elem = array2D[IDX(i,j,cols)]; // = array2D[i,j]
3D array
int array3D[rows*cols*depth];
// Usually, you need to access the element at (i,j,k) like this:
int elem = array3D[(i * cols + j) * depth + k]; // = array3D[i,j,k]
// Now, you can do it like this:
int elem = array3D[IDX(i,j,k,cols,depth)]; // = array3D[i,j,k]
ND array
// shapes = {s1,s2,...,sn}
T arrayND[s1*s2*...*sn]
// indices = {e1,e2,...,en}
T elem = arrayND[IDX(e1,e2,...,en,s2,...,sn)] // = arrayND[e1,e2,...,en]
Note that the shape parameters passed to IDX(...) begins at the second shape, which is s2 in this case.
BTW: This implementation requires C++ 14.

Recursive iteration over parameter pack

I'm currently trying to implement a function, which accepts some data and a parameter-pack ...args. Inside I call another function, which recursively iterates the given arguments.
Sadly I'm having some issues to compile it. Apparently the compiler keeps trying to compile the recursive function, but not the overload to stop the recursion.
Does anyone have an idea what the issue is ?
class Sample
{
public:
template<class ...TArgs, std::size_t TotalSize = sizeof...(TArgs)>
static bool ParseCompositeFieldsXXX(const std::vector<std::string> &data, TArgs &&...args)
{
auto field = std::get<0>(std::forward_as_tuple(std::forward<TArgs>(args)...));
//bool ok = ParseField(field, 0, data);
auto x = data[0];
bool ok = true;
if (TotalSize > 1)
return ok && ParseCompositeFields<1>(data, std::forward<TArgs>(args)...);
return ok;
}
private:
template<std::size_t Index, class ...TArgs, std::size_t TotalSize = sizeof...(TArgs)>
static bool ParseCompositeFields(const std::vector<std::string> &data, TArgs &&...args)
{
auto field = std::get<Index>(std::forward_as_tuple(std::forward<TArgs>(args)...));
//bool ok = ParseField(field, Index, data);
auto x = data[Index];
bool ok = true;
if (Index < TotalSize)
return ok && ParseCompositeFields<Index + 1>(data, std::forward<TArgs>(args)...);
return ok;
}
template<std::size_t Index>
static bool ParseCompositeFields(const std::vector<std::string> &data)
{
volatile int a = 1 * 2 + 3;
}
};
int wmain(int, wchar_t**)
{
short x1 = 0;
std::string x2;
long long x3 = 0;
Sample::ParseCompositeFieldsXXX({ "1", "Sxx", "-5,32" }, x1, x2, x3);
return 0;
}
\utility(446): error C2338: tuple index out of bounds
...
\main.cpp(56): note: see reference to class template
instantiation 'std::tuple_element<3,std::tuple>' being compiled
Alternative approach
You seem to be using rather old technique here. Simple expansion is what you're searching for:
#include <cstddef>
#include <utility>
#include <tuple>
#include <vector>
#include <string>
class Sample
{
template <std::size_t index, typename T>
static bool parse_field(T&& field, const std::vector<std::string>& data)
{
return true;
}
template <typename Tuple, std::size_t ... sequence>
static bool parse_impl(Tuple&& tup, const std::vector<std::string>& data, std::index_sequence<sequence...>)
{
using expander = bool[];
expander expansion{parse_field<sequence>(std::get<sequence>(tup), data)...};
bool result = true;
for (auto iter = std::begin(expansion); iter != std::end(expansion); ++iter)
{
result = result && *iter;
}
return result;
}
public:
template<class ...TArgs, std::size_t TotalSize = sizeof...(TArgs)>
static bool ParseCompositeFieldsXXX(const std::vector<std::string> &data, TArgs &&...args)
{
return parse_impl(std::forward_as_tuple(std::forward<TArgs>(args)...),
data, std::make_index_sequence<sizeof...(TArgs)>{});
}
};
int main()
{
short x1 = 0;
std::string x2;
long long x3 = 0;
Sample::ParseCompositeFieldsXXX({ "1", "Sxx", "-5,32" }, x1, x2, x3);
return 0;
}
If you're looking at something like array, then it is array. Don't use recursion unless required, as it usually makes it complicated. Of course there are exceptions though.
Making it better
As you can see, one doesn't even need a class here. Just remove it.
Possible problems
One problem might arise if the order of invocation matters. IIRC, before C++17 this doesn't have strong evaluation order, so it might fail you sometimes.
Does anyone have an idea what the issue is ?
The crucial point are the lines:
if (Index < TotalSize)
return ok && ParseCompositeFields<Index + 1>(data, std::forward<TArgs>(args)...);
First of all, to be logically correct, the condition should read Index < TotalSize - 1., as tuple element counts are zero-based.
Furthermore, even if Index == TotalSize - 1, the compiler is still forced to instantiate ParseCompositeFields<Index + 1> (as it has to compile the if-branch), which effectively is ParseCompositeFields<TotalSize>. This however will lead to the error your got when trying to instantiate std::get<TotalSize>.
So in order to conditionally compile the if-branch only when the condition is fulfilled, you would have to use if constexpr(Index < TotalSize - 1) (see on godbolt). For C++14, you have to fall back on using template specializations and function objects:
class Sample
{
template<std::size_t Index, bool>
struct Helper {
template<class ...TArgs, std::size_t TotalSize = sizeof...(TArgs)>
static bool ParseCompositeFields(const std::vector<std::string> &data, TArgs &&...args)
{
auto field = std::get<Index>(std::forward_as_tuple(std::forward<TArgs>(args)...));
//bool ok = ParseField(field, Index, data);
auto x = data[Index];
bool ok = true;
return ok && Helper<Index + 1, (Index < TotalSize - 1)>::ParseCompositeFields(data, std::forward<TArgs>(args)...);
}
};
template<std::size_t Index>
struct Helper<Index, false> {
template<class ...TArgs, std::size_t TotalSize = sizeof...(TArgs)>
static bool ParseCompositeFields(const std::vector<std::string> &data, TArgs &&...args) {
volatile int a = 1 * 2 + 3;
return true;
}
};
public:
template<class ...TArgs, std::size_t TotalSize = sizeof...(TArgs)>
static bool ParseCompositeFieldsXXX(const std::vector<std::string> &data, TArgs &&...args)
{
auto field = std::get<0>(std::forward_as_tuple(std::forward<TArgs>(args)...));
//bool ok = ParseField(field, 0, data);
auto x = data[0];
bool ok = true;
return ok && Helper<1, (TotalSize > 1)>::ParseCompositeFields(data, std::forward<TArgs>(args)...);
}
};

Initializer list weirdly depends on order of parameters?

I have the following snippet of code:
#include <type_traits>
#include <limits>
#include <initializer_list>
#include <cassert>
template <typename F, typename... FIn>
auto min_on(F f, const FIn&... v) -> typename std::common_type<FIn...>::type
{
using rettype = typename std::common_type<FIn...>::type;
rettype result = std::numeric_limits<rettype>::max();
(void)std::initializer_list<int>{((f(v) < result) ? (result = static_cast<rettype>(v), 0) : 0)...};
return result;
}
int main()
{
auto mod2 = [](int a)
{
return a % 2;
};
assert(min_on(mod2, 2) == 2); // PASSES as it should
assert(min_on(mod2, 3) == 3); // PASSES as it should
assert(min_on(mod2, 2, 3) == 3); // PASSES but shouldn't - should be 2
assert(min_on(mod2, 2, 3) == 2); // FAILS but shouldn't - should be 2
}
The idea behind template function min_on is that it should return the parameter x from list of parameters passed to it v so that it gives the smallest values for expression f(v).
The problem that I have observed is that somehow the order of parameters inside the std::initializer_list is important so the the code above will fail whereas this code:
assert(min_on(mod2, 3, 2) == 2);
will work. What might be wrong in here?
Your function sets result to v if f(v) < result. With mod2 as f, f(v) will only ever result in a 0, 1 or a -1. Which means that if all of your values are greater than 1, result will be set to the last v which was tested, because f(v) will always be less than result. Try putting a negative number in the middle of a bunch of positive numbers, and the negative number will always be the result, no matter where you place it.
assert(min_on(mod2, 2, 3, 4, -3, 7, 6, 5) == -3);
Perhaps you want this instead:
std::initializer_list<int>{((f(v) < f(result)) ? (result = static_cast<rettype>(v), 0) : 0)...};
The difference is I am testing f(v) < f(result), instead of f(v) < result. Although, the function is still not correct generally because it assumes that f(std::numeric_limits<rettype>::max()) is the max possible value. In the case of mod2 it works. But with something like this:
[](int a) { return -a; }
it would clearly be wrong. So perhaps you could instead require a first argument:
template <typename F, typename FirstT, typename... FIn>
auto min_on(F f, const FirstT& first, const FIn&... v)
-> typename std::common_type<FirstT, FIn...>::type
{
using rettype = typename std::common_type<FirstT, FIn...>::type;
rettype result = first;
(void)std::initializer_list<int>{((f(v) < f(result)) ? (result = static_cast<rettype>(v), 0) : 0)...};
return result;
}
Or, if you're want to avoid unnecessary calls to f:
template <typename F, typename FirstT, typename... FIn>
auto min_on(F f, const FirstT& first, const FIn&... v)
-> typename std::common_type<FirstT, FIn...>::type
{
using rettype = typename std::common_type<FirstT, FIn...>::type;
rettype result = first;
auto result_trans = f(result);
auto v_trans = result_trans;
(void)std::initializer_list<int>{(
(v_trans = f(v), v_trans < result_trans)
? (result = static_cast<rettype>(v), result_trans = v_trans, 0) : 0)...};
return result;
}

CUDA Thrust - Counting matching subarrays

I'm trying to figure out if it's possible to efficiently calculate the conditional entropy of a set of numbers using CUDA. You can calculate the conditional entropy by dividing an array into windows, then counting the number of matching subarrays/substrings for different lengths. For each subarray length, you calculate the entropy by adding together the matching subarray counts times the log of those counts. Then, whatever you get as the minimum entropy is the conditional entropy.
To give a more clear example of what I mean, here is full calculation:
The initial array is [1,2,3,5,1,2,5]. Assuming the window size is 3, this must be divided into five windows: [1,2,3], [2,3,5], [3,5,1], [5,1,2], and [1,2,5].
Next, looking at each window, we want to find the matching subarrays for each length.
The subarrays of length 1 are [1],[2],[3],[5],[1]. There are two 1s, and one of each other number. So the entropy is log(2)2 + 4(log(1)*1) = 0.6.
The subarrays of length 2 are [1,2], [2,3], [3,5], [5,1], and [1,2]. There are two [1,2]s, and four unique subarrays. The entropy is the same as length 1, log(2)2 + 4(log(1)*1) = 0.6.
The subarrays of length 3 are the full windows: [1,2,3], [2,3,5], [3,5,1], [5,1,2], and [1,2,5]. All five windows are unique, so the entropy is 5*(log(1)*1) = 0.
The minimum entropy is 0, meaning it is the conditional entropy for this array.
This can also be presented as a tree, where the counts at each node represent how many matches exist. The entropy for each subarray length is equivalent to the entropy for each level of the tree.
If possible, I'd like to perform this calculation on many arrays at once, and also perform the calculation itself in parallel. Does anyone have suggestions on how to accomplish this? Could thrust be useful? Please let me know if there is any additional information I should provide.
I tried solving your problem using thrust. It works, but it results in a lot of thrust calls.
Since your input size is rather small, you should process multiple arrays in parallel.
However, doing this results in a lot of book-keeping effort, you will see this in the following code.
Your input range is limited to [1,5], which is equivalent to [0,4]. The general idea is that (theoretically) any tuple out of this range (e.g. {1,2,3} can be represented as a number in base 4 (e.g. 1+2*4+3*16 = 57).
In practice we are limited by the size of the integer type. For a 32bit unsigned integer this will lead to a maximum tuple size of 16. This is also the maximum window size the following code can handle (changing to a 64bit unsigned integer will lead to a maximum tuple size of 32).
Let's assume the input data is structured like this:
We have 2 arrays we want to process in parallel, each array is of size 5 and window size is 3.
{{0,0,3,4,4},{0,2,1,1,3}}
We can now generate all windows:
{{0,0,3},{0,3,4},{3,4,4}},{{0,2,1},{2,1,1},{1,1,3}}
Using a per tuple prefix sum and applying the aforementioned representation of each tuple as a single base-4 number, we get:
{{0,0,48},{0,12,76},{3,19,83}},{{0,8,24},{2,6,22},{1,5,53}}
Now we reorder the values so we have the numbers which represent a subarray of a specific length next to each other:
{{0,0,3},{0,12,19},{48,76,83}},{0,2,1},{8,6,5},{24,22,53}}
We then sort within each group:
{{0,0,3},{0,12,19},{48,76,83}},{0,1,2},{5,6,8},{22,24,53}}
Now we can count how often a number occurs in each group:
2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
Applying the log-formula results in
0.60206,0,0,0,0,0
Now we fetch the minimum value per array:
0,0
#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <thrust/transform.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/functional.h>
#include <thrust/random.h>
#include <iostream>
#include <thrust/tuple.h>
#include <thrust/reduce.h>
#include <thrust/scan.h>
#include <thrust/gather.h>
#include <thrust/sort.h>
#include <math.h>
#include <chrono>
#ifdef PRINT_ENABLED
#define PRINTER(name) print(#name, (name))
#else
#define PRINTER(name)
#endif
template <template <typename...> class V, typename T, typename ...Args>
void print(const char* name, const V<T,Args...> & v)
{
std::cout << name << ":\t";
thrust::copy(v.begin(), v.end(), std::ostream_iterator<T>(std::cout, "\t"));
std::cout << std::endl;
}
template <typename Integer, Integer Min, Integer Max>
struct random_filler
{
__device__
Integer operator()(std::size_t index) const
{
thrust::default_random_engine rng;
thrust::uniform_int_distribution<Integer> dist(Min, Max);
rng.discard(index);
return dist(rng);
}
};
template <std::size_t ArraySize,
std::size_t ArrayCount,
std::size_t WindowSize,
typename T,
std::size_t WindowCount = ArraySize - (WindowSize-1),
std::size_t PerArrayCount = WindowSize * WindowCount>
__device__ __inline__
thrust::tuple<T,T,T,T> calc_indices(const T& i0)
{
const T i1 = i0 / PerArrayCount;
const T i2 = i0 % PerArrayCount;
const T i3 = i2 / WindowSize;
const T i4 = i2 % WindowSize;
return thrust::make_tuple(i1,i2,i3,i4);
}
template <typename Iterator,
std::size_t ArraySize,
std::size_t ArrayCount,
std::size_t WindowSize,
std::size_t WindowCount = ArraySize - (WindowSize-1),
std::size_t PerArrayCount = WindowSize * WindowCount,
std::size_t TotalCount = PerArrayCount * ArrayCount
>
class sliding_window
{
public:
typedef typename thrust::iterator_difference<Iterator>::type difference_type;
struct window_functor : public thrust::unary_function<difference_type,difference_type>
{
__host__ __device__
difference_type operator()(const difference_type& i0) const
{
auto t = calc_indices<ArraySize, ArrayCount,WindowSize>(i0);
return thrust::get<0>(t) * ArraySize + thrust::get<2>(t) + thrust::get<3>(t);
}
};
typedef typename thrust::counting_iterator<difference_type> CountingIterator;
typedef typename thrust::transform_iterator<window_functor, CountingIterator> TransformIterator;
typedef typename thrust::permutation_iterator<Iterator,TransformIterator> PermutationIterator;
typedef PermutationIterator iterator;
sliding_window(Iterator first) : first(first){}
iterator begin(void) const
{
return PermutationIterator(first, TransformIterator(CountingIterator(0), window_functor()));
}
iterator end(void) const
{
return begin() + TotalCount;
}
protected:
Iterator first;
};
template <std::size_t ArraySize,
std::size_t ArrayCount,
std::size_t WindowSize,
typename Iterator>
sliding_window<Iterator, ArraySize, ArrayCount, WindowSize>
make_sliding_window(Iterator first)
{
return sliding_window<Iterator, ArraySize, ArrayCount, WindowSize>(first);
}
template <typename KeyType,
std::size_t ArraySize,
std::size_t ArrayCount,
std::size_t WindowSize>
struct key_generator : thrust::unary_function<KeyType, thrust::tuple<KeyType,KeyType> >
{
__device__
thrust::tuple<KeyType,KeyType> operator()(std::size_t i0) const
{
auto t = calc_indices<ArraySize, ArrayCount,WindowSize>(i0);
return thrust::make_tuple(thrust::get<0>(t),thrust::get<2>(t));
}
};
template <typename Integer,
std::size_t Base,
std::size_t ArraySize,
std::size_t ArrayCount,
std::size_t WindowSize>
struct base_n : thrust::unary_function<thrust::tuple<Integer, Integer>, Integer>
{
__host__ __device__
Integer operator()(const thrust::tuple<Integer, Integer> t) const
{
const auto i = calc_indices<ArraySize, ArrayCount, WindowSize>(thrust::get<0>(t));
// ipow could be optimized by precomputing a lookup table at compile time
const auto result = thrust::get<1>(t)*ipow(Base, thrust::get<3>(i));
return result;
}
// taken from http://stackoverflow.com/a/101613/678093
__host__ __device__ __inline__
Integer ipow(Integer base, Integer exp) const
{
Integer result = 1;
while (exp)
{
if (exp & 1)
result *= base;
exp >>= 1;
base *= base;
}
return result;
}
};
template <std::size_t ArraySize,
std::size_t ArrayCount,
std::size_t WindowSize,
typename T,
std::size_t WindowCount = ArraySize - (WindowSize-1),
std::size_t PerArrayCount = WindowSize * WindowCount>
__device__ __inline__
thrust::tuple<T,T,T,T> calc_sort_indices(const T& i0)
{
const T i1 = i0 % PerArrayCount;
const T i2 = i0 / PerArrayCount;
const T i3 = i1 % WindowCount;
const T i4 = i1 / WindowCount;
return thrust::make_tuple(i1,i2,i3,i4);
}
template <typename Integer,
std::size_t ArraySize,
std::size_t ArrayCount,
std::size_t WindowSize,
std::size_t WindowCount = ArraySize - (WindowSize-1),
std::size_t PerArrayCount = WindowSize * WindowCount>
struct pre_sort : thrust::unary_function<Integer, Integer>
{
__device__
Integer operator()(Integer i0) const
{
auto t = calc_sort_indices<ArraySize, ArrayCount,WindowSize>(i0);
const Integer i_result = ( thrust::get<2>(t) * WindowSize + thrust::get<3>(t) ) + thrust::get<1>(t) * PerArrayCount;
return i_result;
}
};
template <typename Integer,
std::size_t ArraySize,
std::size_t ArrayCount,
std::size_t WindowSize,
std::size_t WindowCount = ArraySize - (WindowSize-1),
std::size_t PerArrayCount = WindowSize * WindowCount>
struct generate_sort_keys : thrust::unary_function<Integer, Integer>
{
__device__
thrust::tuple<Integer,Integer> operator()(Integer i0) const
{
auto t = calc_sort_indices<ArraySize, ArrayCount,WindowSize>(i0);
return thrust::make_tuple( thrust::get<1>(t), thrust::get<3>(t));
}
};
template<typename... Iterators>
__host__ __device__
thrust::zip_iterator<thrust::tuple<Iterators...>> zip(Iterators... its)
{
return thrust::make_zip_iterator(thrust::make_tuple(its...));
}
struct calculate_log : thrust::unary_function<std::size_t, float>
{
__host__ __device__
float operator()(std::size_t i) const
{
return i*log10f(i);
}
};
int main()
{
typedef int Integer;
typedef float Real;
const std::size_t array_count = ARRAY_COUNT;
const std::size_t array_size = ARRAY_SIZE;
const std::size_t window_size = WINDOW_SIZE;
const std::size_t window_count = array_size - (window_size-1);
const std::size_t input_size = array_count * array_size;
const std::size_t base = 4;
thrust::device_vector<Integer> input_arrays(input_size);
thrust::counting_iterator<Integer> counting_it(0);
thrust::transform(counting_it,
counting_it + input_size,
input_arrays.begin(),
random_filler<Integer,0,base>());
PRINTER(input_arrays);
const int runs = 100;
auto start = std::chrono::high_resolution_clock::now();
for (int k = 0 ; k < runs; ++k)
{
auto sw = make_sliding_window<array_size, array_count, window_size>(input_arrays.begin());
const std::size_t total_count = window_size * window_count * array_count;
thrust::device_vector<Integer> result(total_count);
thrust::copy(sw.begin(), sw.end(), result.begin());
PRINTER(result);
auto ti_begin = thrust::make_transform_iterator(counting_it, key_generator<Integer, array_size, array_count, window_size>());
auto base_4_ti = thrust::make_transform_iterator(zip(counting_it, sw.begin()), base_n<Integer, base, array_size, array_count, window_size>());
thrust::inclusive_scan_by_key(ti_begin, ti_begin+total_count, base_4_ti, result.begin());
PRINTER(result);
thrust::device_vector<Integer> result_2(total_count);
auto ti_pre_sort = thrust::make_transform_iterator(counting_it, pre_sort<Integer, array_size, array_count, window_size>());
thrust::gather(ti_pre_sort,
ti_pre_sort+total_count,
result.begin(),
result_2.begin());
PRINTER(result_2);
thrust::device_vector<Integer> sort_keys_1(total_count);
thrust::device_vector<Integer> sort_keys_2(total_count);
auto zip_begin = zip(sort_keys_1.begin(),sort_keys_2.begin());
thrust::transform(counting_it,
counting_it+total_count,
zip_begin,
generate_sort_keys<Integer, array_size, array_count, window_size>());
thrust::stable_sort_by_key(result_2.begin(), result_2.end(), zip_begin);
thrust::stable_sort_by_key(zip_begin, zip_begin+total_count, result_2.begin());
PRINTER(result_2);
thrust::device_vector<Integer> key_counts(total_count);
thrust::device_vector<Integer> sort_keys_1_reduced(total_count);
thrust::device_vector<Integer> sort_keys_2_reduced(total_count);
// count how often each sub array occurs
auto zip_count_begin = zip(sort_keys_1.begin(), sort_keys_2.begin(), result_2.begin());
auto new_end = thrust::reduce_by_key(zip_count_begin,
zip_count_begin + total_count,
thrust::constant_iterator<Integer>(1),
zip(sort_keys_1_reduced.begin(), sort_keys_2_reduced.begin(), thrust::make_discard_iterator()),
key_counts.begin()
);
std::size_t new_size = new_end.second - key_counts.begin();
key_counts.resize(new_size);
sort_keys_1_reduced.resize(new_size);
sort_keys_2_reduced.resize(new_size);
PRINTER(key_counts);
PRINTER(sort_keys_1_reduced);
PRINTER(sort_keys_2_reduced);
auto log_ti = thrust::make_transform_iterator (key_counts.begin(), calculate_log());
thrust::device_vector<Real> log_result(new_size);
auto zip_keys_reduced_begin = zip(sort_keys_1_reduced.begin(), sort_keys_2_reduced.begin());
auto log_end = thrust::reduce_by_key(zip_keys_reduced_begin,
zip_keys_reduced_begin + new_size,
log_ti,
zip(sort_keys_1.begin(),thrust::make_discard_iterator()),
log_result.begin()
);
std::size_t final_size = log_end.second - log_result.begin();
log_result.resize(final_size);
sort_keys_1.resize(final_size);
PRINTER(log_result);
thrust::device_vector<Real> final_result(final_size);
auto final_end = thrust::reduce_by_key(sort_keys_1.begin(),
sort_keys_1.begin() + final_size,
log_result.begin(),
thrust::make_discard_iterator(),
final_result.begin(),
thrust::equal_to<Integer>(),
thrust::minimum<Real>()
);
final_result.resize(final_end.second-final_result.begin());
PRINTER(final_result);
}
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - start);
std::cout << "took " << duration.count()/runs << "milliseconds" << std::endl;
return 0;
}
compile using
nvcc -std=c++11 conditional_entropy.cu -o benchmark -DARRAY_SIZE=1000 -DARRAY_COUNT=1000 -DWINDOW_SIZE=10 && ./benchmark
This configuration takes 133 milliseconds on my GPU (GTX 680), so around 0.1 milliseconds per array.
The implementation can definitely be optimized, e.g. using a precomputed lookup table for the base-4 conversion and maybe some of the thrust calls can be avoided.