Container that allows fast search and order at the same time - c++

I am getting in scenarios with this problem again and again and I implement different approaches every time. Now I decided to see if the stackoverflow community could suggest something better.
Let say that I have a reconcile API, where the current set of objects in a system need to be reevaluated - and this might take some time. (Note that obtaining the list of IDs of the objects is fast, the evaluation is slow.) It is public API, reconcile could be called irresponsibly. I would like to guarantee that every object in the system is reevaluated after the last call, while at the same time I do not want to reevaluate any object more than once without need. So far so good any set, ordered or unordered will do.
This additional requirement is the key: I would like to rotate the items to prevent in case of reconcile API misuse to reevaluating the same objects that sit on the "top".
... or if I have "A B C D E F" in the system at the first call, I will schedule: "A B C D E F" for reevaluation in this order.
Lets say that A B and C was already evaluated and there are new objects G and H in the system: The new queue should look like: "D E F A B C G H" where "D E F G H A B C" will be better, but it is not critical. I do not want the queue to be "A B C D E F G H" or "D E F A B C D E F G H"
The question is what stl or boost container (or combination) to use to solve this?

IMO the best approach is, if you need anything more complicated than vector, map, unordered_map, set, then you should just default to boost::multi_index_container. Multi index container has the advantage that it is extremely powerful and flexible and can efficiently support a wide variety of lookup schemes, and it's also quite easily extensible if your needs become greater later on. I would build the entire application that way first, then if you time things and find that you need to optimize, then try to replace relevant multi index containers with optimized data structures tailored to the particular operations you need to support. This saves you an incredible amount of development time fussing over data structure decisions.

A ring would be the proper data structure for this, but I don't know of any standard implementations. You can easily simulate it using by using a std::list by maintaining iterators to iterate, insert, and detect the end though.
std::list<Item>* is = new std::list<Item>();
auto it = is->begin();
auto ir = is->end();
is->insert(ir, i);
if (++it == is->end())
it = is->begin();
This gives O(1) insert and O(1) iteration. It adds in additional branch per iteration, but that could be eliminated with a proper ring.

Related

Mapping into multiple maps in parallel with Java 8 Streams

I'm iterating over a CloseableIterator (looping over elements) and currently adding to a hashmap (just putting into a HashMap, dealing with conflicts as needed). My goal is to do this process in parallel, add to multiple hashmaps in chunks using parallelism to speed up the process. Then reduce to a single hashmap.
Not sure how to do the first step, using streams to map into multiple hashmaps in parallel. Appreciate help.
Parallel streams collected into Collectors.toMap will already process the stream on multiple threads and then combine per-thread maps as a final step. Or in the case of toConcurrentMap multiple threads will process the stream and combine data into a thread-safe map.
If you only have an Iterator (as opposed to an Iterable or a Spliterator), it's probably not worth parallelizing. In Effective Java, Josh Bloch states that:
Even under the best of circumstances, parallelizing a pipeline is unlikely to increase its performance if the source is from Stream.iterate, or the intermediate operation limit is used.
An Iterator has only a next method, which (typically) must be called sequentially. Thus, any attempt to parallelize would be doing essentially what Stream.iterate does: sequentially starting the stream and then sending the data to other threads. There's a lot of overhead that comes with this transfer, and the cache is not on your side at all. There's a good chance that it wouldn't be worth it, except maybe if you have few elements to iterate over and you have a lot of work to do on each one. In this case, you may as well put them all into an ArrayList and parallelize from there.
It's a different story if you can get a reasonably parallelizable Stream. You can get these if you have a good Iterable or Spliterator. If you have a good Spliterator, you can get a Stream using the StreamSupport.stream methods. Any Iterable has a spliterator method. If you have a Collection, use the parallelStream method.
A Map in Java has key-value pairs, so I'm not exactly sure what you mean by "putting into a HashMap." For this answer, I'll assume that you mean that you're making a call to the put method where the key is one of the elements and the value Boolean.TRUE. If you update your question, I can give a more specific answer.
In this case, your code could look something like this:
public static <E> Map<E, Boolean> putInMap(Stream<E> elements) {
return elements.parallel()
.collect(Collectors.toConcurrentMap(e -> e, e -> Boolean.TRUE, (a, b) -> Boolean.TRUE));
}
e -> e is the key mapper, making it so that the keys are the elements.
e -> Boolean.TRUE is the value mapper, making it so the set values are true.
(a, b) -> Boolean.TRUE is the merge function, deciding how to merge two elements into one.

Perfect hash function generator for functions

I have a set of C++ functions. I want to map this functions in an hash table, something like: unordered_map<function<ReturnType (Args...)> , SomethingElse>, where SomethingElse is not relevant for this question.
This set of functions is previously known, small (let say less than 50) and static (is not gonna change).
Since lookup performance is crucial (should be performed in O(1)), I want to define a perfect hashing function.
There exists a perfect hash function generator for this scenario?
I know that there exists perfect hashing functions generators (like GPERF or CMPH) but since I've never used them, I don't know if they're suitable for my case.
REASON:
I'm trying to design a framework where, given a program written in C++, the user can select a subset F of the functions defined in this program.
For each f belonging to F, the framework implements a memoization strategy: when we call f with input i, we store (i,o) inside some data structure. So, if we are going to call AGAIN f with i, we will return o without performing again the (time expensive) computation.
The "already computed results" will be shared among different users (maybe on the cloud), so if user u1 has already computed o, user u2 will save computing time calling f with i (using the same annotation of before).
Obviously, we need to store the set of pairs (f,inputs_sets) (where inputs_sets is the already computed results set that I talked before), which is the original question: how do I do it?
So, using the "enumeration trick" proposed in the comments in this scenario could be a solution, assuming that the all the users use the exactly same enumeration, which could be a problem: supposing that our program has f1,f2,f3 what if u1 wants to memoize only f1 and f2 (so F={f1,f2}), while u2 wants to memoize only f3 (so F={f3})? An overkill solution could be to enumerate all the functions defined in the program, but this could generate an huge waste of memory.
Ok, maybe not what you want to hear but consider this: since you talk about a few functions, less than 50, the hash lookup should be negligible, even with collisions. Have you actually profiled and saw that the lookup is critical?
So my advise is to focus your energy on something else, most likely a perfect hash function would not bring any kind of improved performance in your case.
I am going to go one step further and say that I think that for less than 50 elements a flat map (good ol' vector) would have similar performance (or maybe even better due to cache locality). But again, measurements are required.

normalize or not?

I have a DB in which there are 4 tables.
A -> B -> C -> D
Current the way I have it is, the Primary Key of A is a foreign key in B. And B would have it's own Primary Key, which is a foreign key in C, etc etc.
However, C can't be linked to A without B.
The problem is, a core function of my program involve pulling matching entries from A and D.
Should I include the primary key of A in D too
Doing so will create unnecessary data duplication 'coz A->B->C->D are hierarchy.
see pic for what D would look like.
If you take all D-s in relation with given A, I would keep it normalized.
But if you want specific subset of such D-s and its easy to know which in advance, but time consuming later (eg. if you want all D-s from newest C from newest B), I would prefare storing this shortcut somewhere.
It does not have to be in D itself (esp. if you don't want all D-s connected with A).
If you want to do it to make your queries easier to read and write, then consider view.
If you want to do it to increase performance, try everything and measure it. (And I'm not expert in performance tuning of SQL, so I have no specific advice beyond that)

Best sorting algorithm for case where many objects have "do-not-care" relationships to each other

I have an unusual sorting case that my googling has turned up little on. Here are the parameters:
1) Random access container. (C++ vector)
2) Generally small vector size (less than 32 objects)
3) Many objects have "do-not-care" relationships relative to each other, but they are not equal. (i.e. They don't care about which of them appears first in the final sorted vector, but they may compare differently to other objects.) To put it a third way (if it's still unclear), the comparison function for 2 objects can return 3 results: "order is correct," "order need to be fliped," or "do not care."
4) Equalities are possible, but will be very rare. (But this would probably just be treated like any other "do-not-care."
5) Comparison operator is far more expensive than object movement.
6) There is no comparison speed difference for determining that objects care or don't care about each other. (i.e. I don't know of a way to make a quicker comparison that simply says whether the 2 objects care about each other of not.)
7) Random starting order.
Whatever you're going to do, given your conditions I'd make sure you draw up a big pile of tests cases (eg get a few datasets and shuffle them a few thousand times) as I suspect it'd be easy to choose a sort that fails to meet your requirements.
The "do not care" is tricky as most sort algorithms depend on a strict ordering of the sort value - if A is 'less than or equal to' B, and B is 'less than or equal to' C, then it assumes that A is less than or equal to C -- in your case if A 'doesn't care' about B but does care about C, but B is less than C, then what do you return for the A-B comparison to ensure A will be compared to C?
For this reason, and it being small vectors, I'd recommend NOT using any of the built in methods as I think you'll get the wrong answers, instead I'd build a custom insertion sort.
Start with an empty target vector, insert first item, then for each subsequent item scan the array looking for the bounds of where it can be inserted (ie ignoring the 'do not cares', find the last item it must go after and the first it must go before) and insert it in the middle of that gap, moving everything else along the target vector (ie it grows by one entry each time).
[If the comparison operation is particularly expensive, you might do better to start in the middle and scan in one direction until you hit one bound, then choose whether the other bound is found moving from that bound, or the mid point... this would probably reduce the number of comparisons, but from reading what you say about your requirements you couldn't, say, use a binary search to find the right place to insert each entry]
Yes, this is basically O(n^2), but for a small array this shouldn't matter, and you can prove that the answers are right. You can then see if any other sorts do better, but unless you can return a proper ordering for any given pair then you'll get weird results...
You can't make the sorting with "don't care", it is likely to mess with the order of elemets. Example:
list = {A, B, C};
where:
A dont care B
B > C
A < C
So even with the don't care between A and B, B has to be greater than A, or one of those will be false: B > C or A < C. If it will never happen, then you need to treat them as equals instead of the don't care.
What you have there is a "partial order".
If you have an easy way to figure out the objects where the order is not "don't care" for a given objects, you can tackle this with basic topological sorting.
If you have a lot of "don't care"s (i.e. if you only have a sub-quadratic number of edges in your partial ordering graph), this will be a lot faster than ordinary sorting - however, if you don't the algorithm will be quadratic!
I believe a selection sort will work without modification, if you treat the "do-not-care" result as equal. Of course, the performance leaves something to be desired.

Binary tree with different node types

I'm working on a somewhat complex mathematical code, written in C++. I'm using (templated) tree structures for adaptive function representation. Due to some mathematical properties I end up in a situation where I need to change from one type of node to another. This needs to happen transparently and with minimal overhead, both in terms of storage and performance, since these structures are used in very heavy computations.
The detailed situation is as follows: I have a templated abstract base class defining general mathematical and structural properties of a general, doubly-linked node. Each node needs information both from it's parent and from a top-level Tree class, in addition to keeping track of it's children. Two classes inherit from this class, the FunctionNode and the GenNode. These classes are very different in terms of storage and functionality, and should not be (at least public) ancestors of each other. Thus, I would like to construct a tree like this:
T
N
/ \
N N
/ \
G N
/ \
G G
Where T is a Tree, N is a normal FunctionNode and G is a GenNode. The problem is the N - G transition: N needs to have children of type G, and G a parent of type N. Since N and G are only cousins and not siblings, I can't convert a N* to a G*. It's sufficient for G to know that N is a BaseNode, but N has to somehow store G polymorphically so that the correct virtuals get called automagically when the tree is traversed. Any ideas how to solve this problem elegantly and efficiently would be much appreciated! :) Of course one could just hack this, but since this is a very fundamental piece of code I would like to have a good solution for it. It's likely that there will be many derivations of this code in the future.
Best regards,
Jonas Juselius
Centre for Theoretical and Computational Chemistry, University of Tromsø
Don't use inheritance when delegation will do. Look at the Strategy design pattern for guidance on this.
The "N - G" transition may be better handled by having a subclass of N (N_g) which is a unary operator (where other N's are binary) and will delegate work to the associated G object. The G subtree is then -- actually -- a disjoint family of classes based on G's, not N's.
T
N
/ \
N N
/ \
N_g N
|
G
/ \
G G
"One of the problems is that I do not know beforehand whether the next N will be N or N_g."
"beforehand?" Before what? If you are creating N's and then trying to decide if they should have been N_g's, you've omitted several things.
You've instantiated the N too early in the process.
You've forgotten to write an N_g constructor that works by copying an N.
You've forgotten to write a replace_N_with_Ng method that "clones" an N to create an N_g, and then replaces the original N in the tree with the N_g.
The point of polymorphism is that you don't need to know "beforehand" what anything is. You should wait as long as possible to create either an N or an N_g and bind the resulting N (or subclass of N) object into the tree.
"Furthermore, sometimes I need to prune all G:s, and generate more N:s, before perhaps generating some more G:s."
Fine. You walk the tree, replacing N_g instances with N instances to "prune". You walk the tree replacing N instances with N_g's to generate a new/different G subtree.
Look into using RTTI - Run-time Type Information.
Have you though of using Boost.Any ?
It seems like the textbook example in my opinion.
Having thought about the problem some more I came up with the following idea:
Logically, but not functionally, GenNode is a-kind-of FunctionNode. If one splits FunctionNode into two classes, one containing the common denominators, and one having the additional functionality only FunctionNode should have, FunctionNode can inherit from that class using private inheritance. Now GenNode can safely inherit from FunctionNode, and all problems can be solved like normal using virtuals. Any comments?