memory management with cudd package - binary-decision-diagram

I have a memory problem with the cudd package. I've modified cudd to support custom constant nodes, for example some expression of the form (op x y), by adding a new field in the DdNode's type union.
I wrote a function that transforms a constant node of type bool to a 0/1-valued BDD node.
the transformation procedure works like Cudd_addApply with a slight difference: Ite is used to connect the returned values instead of cuddUniqueInter, because the return values might have a different index than the original children (sometimes even lower than the current level, causing assertions to fail)
the problem arised when I tried to recollect the intermediate results, this is how I managed the reference counts (NULL checks are removed):
auto n = Cudd_addIthVar(dd, index);
Cudd_Ref(n);
// T, E are transformations of children
res = (T == E) ? T : Cudd_addIte(dd, n, T, E);
Cudd_Ref(res);
Cudd_RecursiveDeref(dd, n); // remove temp results
Cudd_RecursiveDeref(dd, T);
Cudd_RecursiveDeref(dd, E);
cuddCacheInsert1(dd, op, f, res);
Cudd_Deref(res);
return (res);
this approach seems to pass Cudd_DebugCheck and Cudd_CheckKeys, but the ref counts of some internal nodes exceed unexpectedly a hundred and the ite operation hangs.

Related

How do I convert this accumulate in a reduce (or some other parallel STL algorithm)?

I am developing a Barnes–Hut simulation in C++ for a course project. Here you can find a detailed explanation of the algorithm.
I briefly explain the datatypes I will use later.
A body is represented by the Body datatype, which is a struct containing a Vector2f representing the position, and a float for the mass).
A node in the quadtree is represented by the Node datatype. Of course, the node struct contains some data, which can be: 1) Its four children, corresponding to its four subquadrants. This holds only the node is a fork in the tree. 2) Its body. This holds only when the node is a leaf of the tree. 3) Empty. This holds when the node does not contain any body. Therefore, data is a std::variant.
I wrote the recursive function that calculates the net force acting on a body. It takes in input a Node (at the first call, the quadtree's root) and the Body we want to query, and returns a Vector2f representing the net force acting on the body.
Of course, the function needs to visit the variant and dispatch to the correct lambda.
Vector2f compute_approximate_net_force_on_body(const Node& node,
const Body& body) {
const auto visit_empty = [](const Empty&) -> Vector2f { return {0, 0}; };
const auto visit_body = [body](const Body& visited) -> Vector2f {
return compute_gravitational_force(visited, body);
};
const auto visit_region = [&](const Subquadrants& subquadrants) -> Vector2f {
float distance = (body.m_position - node.center_of_mass()).norm();
if (node.length() / distance < OMEGA) {
// Approximation
return bh::compute_gravitational_force(
{node.center_of_mass(), node.total_mass()}, body);
} else {
return std::accumulate(
subquadrants.begin(), subquadrants.end(), Vector2f{0, 0},
[body](const Vector2f& total, const std::shared_ptr<Node>& curr) {
return (total + compute_approximate_net_force_on_body(*curr, body))
.eval();
});
}
};
return std::visit(overloaded{visit_empty, visit_body, visit_region},
node.data());
}
The interesting part is the one with accumulate. Essentially, it invokes the algorithm recursively, with the same node and the four node's subquadrants, and accumulates the result into a Vector2f.
Since that the four calls are completely independent, I thought that I could make the computation parallel. Initially, I converted the accumulate into a reduce, but I later discovered that this can't work because
the types of the signature of the binary operation function must be identical (mine are not);
The binary operation must be associative and commutative (mine is not).
I am looking for suggestions on how to parallelize the recursive calls, possibly using the STL library. If possible, the C++ standard must be C++17 or below. One approach that I have in mind is to use std::async and std::future, but it is less elegant than the accumulate-like one. Are there any other else?
Thank you for your insights.

Is there an efficient way to use views::filter after transform? (range adaptors) [duplicate]

This question already has an answer here:
Generator called twice in C++20 views pipeline [duplicate]
(1 answer)
Closed 1 year ago.
A common example of strange behaviour with views::filter:
#include <iostream>
#include <ranges>
#include <vector>
int main ()
{
using namespace std;
auto ml = [](char c) // ml = make lambda (always accepts / transforms to 1)
{
return [c](int) {cout << c; return 1;};
};
vector<int> vec = {1};
auto view = vec
| views::transform (ml('T'))
| views::filter (ml('F'));
// use view somehow:
return *view.begin();
}
Which outputs TFT (note the extra T). demo
We must know that:
auto view = vec
| views::transform (ml('A'))
| views::filter (ml('B'));
...is just syntax sugar for:
auto view = views::filter(views::transform(vec, ml('A')), ml('B'));
Problem explained:
Having implemented a few mock versions of views::filter, it seems the issue is:
Unlike other iterators, filter::iterator does its work during operator ++ (searching for an accepted value)
operator * is designed to extract the value, but with filter::iterator the work has already been done and lost (we don't need redo the search for an accepted value, but we do need to re-calculate it)
We can't store the result because of the constant-time copy constraint for views (the value could be an array)
To explain this in a picture we'll represent the process of iterating over:
view = container | Transform | Filter1 | Filter2 | Filter3
(apologies for the ugly diagram) circles represent the work being done Pr(F3) for Print F3 (the work done in the mock example):
We can see that if we combine only filters or only transforms then we don't repeat the work - the issue is having filters above the transforms (above in the diagram).
In worst-case we get [where n(x) number of x]:
n(iterations) = n(filters) * n(transforms)
when we'd expect n(filters) + n(transforms)
We can see this parabolic vs linear growth with this example.
Question
Sometimes the transforms need to be done before we can determine what to filter, so how to avoid the extra work-load? This seems important as views are designed to iterate over containers, which is bottle-neck territory.
Is there a way to use std::views for situations like this without the huge slow-down described above?
I think a good way to think of this is that a filter needs to get the value of its input in order to decide whether to perform the filtering. Then when you evaluate the output of the filter, we must evaluate again, unless the filter has cached the input (see below).
From alternating filters and views, it appears that each filter builds a list of all transforms below it, and then uses this list of transforms to evaluate the result. So modifying your example
#include <iostream>
#include <ranges>
#include <vector>
int main()
{
auto ml = [](char c) // ml = make lambda (always accepts / transforms to 1)
{
return [c](int) {std::cout << c; return 1; };
};
std::vector<int> vec = { 1,1,1 };
auto view = vec
| std::views::transform(ml('T'))
| std::views::filter(ml('F'))
| std::views::transform(ml('U'))
| std::views::filter(ml('G'));
for( auto v : view)
std::cout << v << ",";
return 0;
}
gives output
TFTUGTU1,TFTUGTU1,TFTUGTU1,
we want to take our input, then do transform T, filter F, transform U, filter G. So we may have expected output TFUG1. However, here is what seems to happen
Filter G needs to know if the value got filtered already, and needs to know the element value to know if it should do any filtering itself. So it passes this back up the chain to the next filter - F.
Filter F needs to know if it needs to filter so it evaluates transform T, it then does the filtering and finds it can pass the value, so it adds T to it's list of transforms.
G now can decide if it needs to filter. It evaluates the input using F's list of transforms (which only includes T) and any transforms between F and G. So we call transforms T and U.
G now does its filtering and finds it can pass the value. So it adds all the transforms below to it's list of transforms, This list is now T and U.
Finally we want the value, so G evaluates everything by calling its list of transforms T and U. We now have our result.
Note that this is a conceptual model from observations. I haven't trawled the standard or the code to check this, it's just how it appears from the example.
You can view this as a tree
G
|\
U U
| \
F \
|\ \
T T T
Every filter causes a branch with filters and transforms on the left and just transforms on the right. We evaluate the left of each filter node then the filter node itself, then the right branch of each filter node.
The complexity is as you say, O((n_filters+1)*n_transforms).
My suspicion is that if you had provided a constexpr as the transform, then the compiler could optimise this into a simple linear evaluation of TFUG, because it would see that the return value would be the same for each evaluation. However, the act of adding the std::cout in the transform means that this is not a constexpr and so these optimisations cannot happen.
I therefore suspect that what has happened here is that by trying to show the structure of the C++ code, you have stopped optimisations occurring in the binary code.
You can of course test this by creating a constexpr transform and checking what happens to the runtime as you increase the layers of views with optimisation turned on.
Edit
I initially said that I thought the filters were caching a list of transforms and if they were doing this then if the transforms were constexpr then they might cache the results of these transfers instead. On reflection and with more reading, I think the filters are not caching a list of transforms at all. They are evaluating the left side of each branch on incrementing the iterator and the right side on dereferencing.
However, I would still imagine that with constexpr transforms, the compiler could effectively optimise this. And without checking the code, maybe filters can do caching - I'm sure someone on here knows. So I think it would still be useful to test a constexpr optimised example.

Having an iterator as a sliding window of 3 elements that can overshoot bounds (possibly using Boost)

Having read this SO post and exploring Boost.Iterator, I want to see if I can make a sliding window of size 3 iterate through a single vector where the final iteration has an 'empty third element'.
Assuming that the vector size is >= 2, an example:
{a, b, c, d, e, f, g}
We will always start on index 1 because this algorithm I'm implementing requires a 'previous' element to be present and does not need to operate on the first element (so we would iterate from i = 1 while i < size()):
V
[a, b, c]
{a, b, c, d, e, f, g}
when I move to the next iteration, it would look like:
V
[b, c, d]
{a, b, c, d, e, f, g}
and upon reaching the last element in the iteration, it would have this:
V
[f, g, EMPTY]
{a, b, c, d, e, f, g}
What I want is to be able to grab the "prev" and check if "hasNext" and grab the next element if available. My goal is very clean modern C++ code, and it not doing bookkeeping of tracking pointers/references for three different elements makes the code a lot cleaner:
for (const auto& it : zippedIterator(dataVector)) {
someFunc(it.first, triplet.second);
if (someCondition(it.second) && hasThirdElement) {
anotherFunc(it.second, it.third)
}
}
I was trying to see if this is possible with boost's zip iterator, but I don't know if it allows me to overshoot the end and have some empty value.
I've thought of doing some hacky stuff like having a dummy final element, but then I have to document it and I'm trying to write clean code with zero hacky tricks.
I was also going to roll my own iterator but apparently std::iterator is deprecated.
I also don't want to create a copy of the underlying vector since this will be used in a tight loop that needs to be fast and copying everything would be very expensive for the underlying objects. It doesn't need to be extremely optimized, but copying the iterator values into a new array is out of the question.
If this were a matter of simply having a sized window into a range, then what you really want is to have a range that you can advance. In your case, that range is 3 elements long, but there's no reason that a general mechanism couldn't allow for a variable-sized range. It would just be a pair of iterators, such that you can ++ or -- both of them at the same time.
The problem you run into is that you want to manufacture an element if the subrange is off the end of the range. That complicates things; that would require proxy-iterators and so forth.
If you want a solution for your specific case (a 3-element sized range, where the last element can be manufactured if it's off the end of the main range), then you first need to decide if you want to have an actual type for this. That is, is it worth implementing a whole type, rather than a couple of one-off utility functions?
My way to handle this would be to redefine the problem. What you seem to have is a current element, just like any other iteration. But you want to be able to access the previous element. And you want to be able to peek ahead to the next element; if there is none, then you want to manufacture some default. So... perform iteration, but write a couple of utility functions that let you access what you need from the current element.
for(auto curr = ++dataVector.begin();
curr != dataVector.end();
++curr)
{
someFunc(prevElement(curr), *curr);
auto nextIt = curr + 1;
if(nextIt != dataVector.end() && someCondition(*curr))
anotherFunc(*curr, *nextIt)
}
prevElement is a simple function that accesses the element before the given iterator.
template<typename It>
//requires BidirectionalIterator<It>
decltype(auto) prevElement(It curr) {return *(--curr);}
If you want to have a function to check the next element and manufacture a value for it, that can be done too. This one has to return a prvalue of the element, since we may have to manufacture it:
template<typename It>
//requires ForwardIterator<It>
auto checkNextElement(It curr, It endIt)
{
++curr;
if(curr == endIt)
return std::iterator_traits<It>::value_type{};
return *curr;
}
Yes, this isn't all clever, with special range types and the like. But the stuff you're doing is hardly common, particularly the having to manufacture the next element as you do. By keeping things simple and obvious, you make it easy for someone to read your code without having to understand some specialized sub-range type.

Sorting a vector where tie-breaker elements are lazily computed

I want to sort a vector of structs by a primary field and use a secondary field as a tie-breaker. The normal way would be this:
struct element {
int primary;
int secondary;
};
bool comparator(const element& e1, const element& e2) {
if (e1.primary != e2.primary) {
return e1.primary < e2.primary;
}
return e1.secondary < e2.secondary;
}
But the secondary data is expensive to compute. As it is only needed when the primary values are equal, I want to compute it lazily.
It seems the only place I can do this lazy evaluation is within the comparator itself. Something like:
bool comparator(const element& e1, const element& e2) {
if (e1.primary != e2.primary) {
return e1.primary < e2.primary;
}
return e1.computeSecondary() < e2.computeSecondary();
}
While this will avoid evaluating the secondary for the cases when the primary values are different, it will end up recomputing the secondary values for the same element each time it is compared with another element. The data I want to sort is long tailed with something like 30% of values equal to 1, 20% equal to 2, 5% equal to 3, and lower % for higher values. So, there will be fair number of cases where the secondary element will get computed, and not storing the computed values could result in them being recomputed too many times.
So, I would like the secondary values to be evaluated at most once per element. But the comparator takes const ref arguments, so it can't modify the secondary value of the element. How can this be achieved?
Possible options are, in a nutshell.
Declare secondary mutable.
Use const_cast in comparator.
Use const_cast in computeSecondary.
Create a simple Lazy template class that either holds a value or a thunk and, when asked for, internally forces a value if it hasn't been evaluated yet and reports the result (or immediately reports a result, if it is already known), does not take long; and declare secondary as of type Lazy<int>.
Or rather do not reinvent the wheel and use std::future that is actually that very Lazy template (in one case).
Or anything else, one can create more approaches.

Tabulation hashing and N3980

I am having trouble adapting the pending C++1z proposal N3980 by #HowardHinnant to work with tabulation hashing.
Ab initio computing a tabulation hash works the same as for the hashing algorithm (Spooky, Murmur, etc.) described in N3980. It is not that complicated: just serialize an object of any user-defined type through hash_append() and let the hash function update a pointer into a table of random numbers as you go along.
The trouble starts when trying to implement one of the nice properties of tabulation hashing: very cheap to compute incremental updates to the hash if an object is mutated. For "hand-made" tabulation hashes, one just recomputes the hash of the object's affected bytes.
My question is: how to communicate incremental updates to a uhash<MyTabulationAlgorithm> function object while keeping true to the central theme of N3980 (Types don't know #)?
To illustrate the design difficulties: say I have a user-defined type X with N data members xi of various types Ti
struct X
{
T1 x1;
...
TN xN;
};
Now create an object and compute its hash
X x { ... }; // initialize
std::size_t h = uhash<MyTabulationAlgorithm>(x);
Update a single member, and recompute the hash
x.x2 = 42;
h ^= ...; // ?? I want to avoid calling uhash<>() again
I could compute the incremental update as something like
h ^= hash_update(x.x2, start, stop);
where [start, stop) represents the range of the table of random numbers that correspond to the x2 data member. However, in order to incrementally (i.e. cheaply!) update the hash for arbitrary mutations, every data member needs to somehow know its own subrange in the serialized byte stream of its containing class. This doesn't feel like the spirit of N3980. E.g., adding new data members to the containing class, would change the class layout and therefore the offsets in the serialized byte stream.
Application: tabulation hashing is very old, and it has recently been shown that it has very nice mathematical properties (see the Wikipedia link). It's also very popular in the board game programming community (computer chess and go e.g.) where it goes under the name of Zobrist hashing. There, a board position plays the role of X, and a move the role of a small update (move a piece from its source to its destination e.g.). It would be nice if N3980 could not only be adapted to such tabulation hashing, but that it can also accomodate the cheap incremental updates.
It seems that you should be able to do this by telling MyTabulationAlgorithm to ignore the values of all class members except that which has changed:
x.x2 = 42;
IncrementalHashAdaptor<MyTabulationAlgorithm, T2> inc{x.x2};
hash_append(inc, x);
h ^= inc;
All IncrementalHashAdaptor has to do is check the memory range it is passed to see whether x2 is included in it:
template<class HashAlgorithm, class T>
struct IncrementalHashAdaptor
{
T& t;
HashAlgorithm h = {};
bool found = false;
void operator()(void const* key, std::size_t len) noexcept
{
if (/* t contained within [key, key + len) */) {
assert(!found);
found = true;
char const* p = addressof(t);
h.ignore(key, (p - key));
h(p, sizeof(T));
h.ignore(p + sizeof(T), len - (p - key) - sizeof(T));
}
else {
h.ignore(key, len);
}
}
operator std:size_t() const { assert(found); return h; }
};
Obviously, this will only work for members whose object location is both possible to determine externally and corresponds to the memory block passed to the hash algorithm; but this should correspond to the vast majority of cases.
You would probably want to wrap IncrementalHashAdaptor and the following hash_append into a uhash_incremental utility; this is left as an exercise for the reader.
There is a question mark over performance; assuming HashAlgorithm::ignore(...) is visible to the compiler and is uncomplicated it should optimize well; if this does not occur you should be able to calculate the byte-stream address of X::x2 at program startup using a similar strategy.