Store Tree Object Directly in Source to Avoid Growing at Run-time

Store Tree Object Directly in Source to Avoid Growing at Run-time - c++

I have 50 (large) decision trees that are currently serialized (in pre-order) as individual, long strings. All of the strings are directly stored in a .cpp declaration file in order to avoid having to read them from a file at run-time. So, at run-time, a function is called that deserializes each string and constructs its corresponding decision tree using a standard recursive process. Subsequently, a set of features (vector of doubles) is dropped down each decision tree and a class prediction is output. A la Random Forest, a majority vote is taken and final class is taken.
I've tried optimizing the code and have discovered that the re-construction of these large trees takes up the majority (~98%) of my run-time. Thus, I wanted to ask if there were some way to hardcode the entire tree object into the .cpp declaration file. So, instead of having to re-construct the trees at run-time, the tree objects are already available to be traversed at run-time.

I you have access to C++11, I think constexpr functions are your solution.
You could write functions to generate the data of the trees at compile-time, storing that data in arrays at compile time.
See this thread for a working usage example.

Related

Save gmsh-generated nodes and elements to std::vector or other container

I am using the gmsh C++ API as a part of a larger optimization project. All I need gmsh for is to mesh some polygons at every iteration in order to conduct a finite element analysis.
According to the tutorials, once a mesh instance is generated, it can be save to a .msh file with the write() method, e.g.:
gmsh::initialize();
// ...
// Geometry definition
// ...
gmsh::model::mesh::generate(2);
gmsh::write("myfile.msh");
And in the .msh file I can see all the nodes, elements, and other information.
Now, as I mentioned above, I only need gmsh's output to conduct some analysis with some functions that I've already written, and to which I only need to feed a std::vector or Eigen::Vector containing the nodes and the vectors.
One (inefficient) way to do so is, of course, to export the .msh file and then parse it to create a std::vector out of the nodal information. I am looking for a way to just access the nodes and elements so that I can store them in a std::vector (or Eigen::Vector) directly.
Is there a way to avoid dumping everything to a local file? I know I could reverse engineer this operation by going through the write() method and looking at how the nodal information is saved to a file, but:
I feel like there must be some API function that serves exactly this purpose
I'd rather avoid going through the huge source files to figure out this information myself, since I need this software to be completed as soon as possible

Uniquely identify an arbitrary object in c++

I'm trying to create a general memoizator for multiple and arbitrary functions.
For each function std::function<ReturnType(Args...)> that we want to memoize, we unordered_map<Args ..., ReturnType> (I'm keeping things simple on purpose).
The big problem comes when our memoized function has some really big argument Args ...: for example let suppose that our function sort a vector of 10 millions numbers and then returns the sorted vector, so something like std::function<vector<double>(vector<double>)>.
As you can imagine, after having inserted less than 100 vectors, we have already filled 8 GBS of memory. Notice that maybe this is given from the combination of huge vectors and the memory required by the sorting algorithm (I didn't investigate on the causes).
So what about if instead of the structure described above, we define unordered_map<UUID(Args ...), ReturnType> (where UUID= Universally Unique Identifier)? We should relax the deterministic feature (so maybe we return a wrong error), but with a very low probability.
The problem is that since I never used UUIDs, I don't know if there are suitable implementations for this application.
So my question is:
There exists a better solution than UUIDs for this problem?
Which UUID implementation is better suitable for this problem?
boost uuid is a possible candidate?
Unfortunately, the problem could be solved for Args ... but not for ReturnType, so there is a solution for memoized result?
Notice that the UUIDs generated for the object x should be the same even in different runs and machines.
Notice that if we have the same UUID for two different objects (and so we return the wrong value) with a really low probability, then it could be acceptable...let's say that this could be a "probabilistic memoizator".
I know that this application doesn't make sense in a memoization context (what are the odds that an user asks two times to sort the same 10 millions elements vector?), but it's time and memory expensive (so good for benchmarking and to introduce the memory problem that I stated above), so please don't whip and crucify me because this is an absurd memoization application.

Identifying any object is easy. The address is "object identity" in C++. This is also the reason that even empty classes cannot have zero size.
Now, what you want is value equivalence. That's strictly not in the language domain. It's solidly in the application/library logic domain.
You should consider using something like boost::flyweights. It has precisely this facility, and makes it "easy" to customize the equivalence semantics for your types.

Are recursive types really the only way to build noncontinuous arbitrary-size data structures?

I just noticed a question asking what recursive data types ("self-referential types") would be good for in C++ and I was tempted to boldly claim
It's the only way to construct data structures (more precisely containers) that can accept arbitrary large data collections without using continuous memory areas.
That is, if you had no random-access arrays, you would require some means of references (logically) to a type within that type (obviously, instead of having a MyClass* next member you could say void* next but that would still point to a MyClass object or a derived type).
However, I am careful with absolute statements -- just because I couldn't think of something doesn't mean it's not possible, so am I overlooking something? Are there data structures that are neither organised using mechanisms similar to linked lists / trees nor using continuous sequences exclusively?
Note: This is tagged both c++ and language-agnostic as I'd be interested specifically in the C++ language but also in theoretical aspects.

It's the only way to construct data structures (more precisely containers) that can accept arbitrary large data collections without using continuous memory areas.
After contemplating for a while, this statement seems be correct. It is self-evident, in fact.
Suppose I've a collection of elements in a non-contiguous memory. Also suppose that I'm currently at element e. Now the question is, how would I know the next element in the collection? Is there any way?
Given an element e from a collection, there are only two ways to compute the location of the next element:
If I assume that it is at offset sizeof(e) irrespective of what e is, then it means that the next element starts where the current element ends. But then this implies that the collection is in a contiguous memory, which is forbidden in this discussion.
The element e itself tells us the location of the next element. It may store the address itself, or an offset. Either way, it is using the concept of self-reference, which too is forbidden in this discussion.
As I see it, the underlying idea of both of these approaches is exactly same: they both implement self-reference. The only difference is that in the former, the self-reference is implemented implicitly, using sizeof(e) as offset. This implicit self-reference is supported by the language itself, and implemented by the compiler. In the latter, it is explicit, everything is done by the programmer himself, as now the offset (or pointer) is stored in the element itself.
Hence, I don't see any third approach to implement self-reference. If not self-reference, then what terminology would one use to describe the computation of the location of the next element to an element e.
So my conclusion is, your statement is absolutely correct.

The problem is that the dynamic allocator itself is managing contiguous storage. Think about the "tape" used for a Turing Machine, or the Von Neumann architecture. So to seriously consider the problem, you would likely need to develop a new computing model and new computer architecture.
If you think disregarding the contiguous memory of the underlying machine is okay, I am sure a number of solutions are possible. The first that comes to my mind is that each node of the container is marked with an identifier that has no relation to its position in memory. Then, to find the associated node, all of memory is scanned until the identifier is found. This isn't even particularly inefficient if given enough computing elements in a parallel machine.

Here's a sketch of a proof.
Given that a program must be of finite size, all types defined within the program must contain only finitely many members and reference only finitely many other types. The same holds for any program entrypoint and for any objects defined before program initialisation.
In the absence of contiguous arrays (which are the product of a type with a runtime natural number and are therefore unconstrained in size), all types must be arrived at through the composition of types as above; derivation of types (pointer-to-pointer-to-A) is still constrained by the size of the program. There are no facilities other than contiguous arrays to compose a runtime value with a type.
This is a little contentious; if e.g. mappings are considered primitive then one can approximate an array with a map whose keys are the natural numbers. Of course, any implementation of a map must use self-referential data structures (B-trees) or contiguous arrays (hash tables).
Next, if the types are non-recursive then any chain of types (A references B references C...) must terminate, and can be of no greater length than the number of types defined in the program. Thus the total size of data referenceable by the program is limited to the product of the sizes of each type multiplied by the number of names defined in the program (in its entrypoint and static data).
This holds even if functions are recursive (which strictly speaking breaks the prohibition on recursive types, since functions are types); the amount of data immediately visible at any one point in the program is still limited to the product of the sizes of each type multiplied by the number of names visible at that point.
An exception to this is if you store a "container" in a stack of recursive function calls; however such a program would not be able to traverse its data at random without unwinding the stack and having to reread data, which is something of a disqualification.
Finally, if it is possible to create types dynamically the above proof does not hold; we could for example create a Lisp-style list structure where each cell is of a distinct type: cons<4>('h', cons<3>('e', cons<2>('l', cons<1>('l', cons<0>('o', nil))))). This is not possible in most static-typed languages, although it is possible in some dynamic languages e.g. Python.

The statement is not correct. The simple counter example is std::deque in C++. The basic data structure (for the language-agnostic part) is a contiguous array of pointers to arrays of data. The actual data is stored in ropes (non-contiguous blocks), that are chained through a contiguous array.
This might be bordering your requirements, depending on what without using continuous memory areas mean. I am using the interpretation that the stored data is not contiguous, but this data structure depends on having arrays for the intermediate layer.

I think a better phrasing would be:
It's the only way to construct data structures (more precisely containers) that can accept
arbitrary large data collections without using memory areas of determinable address.
What I mean is that normal arrays use addr(idx)=idx*size+inital_addr to get the memory address of an element. However, if you change that to something like addr(idx)=idx*idx*size+initial_addr then the elements of the data structure are not stored in continuous memory areas, rather, there are large gaps between where elements are stored. Thus, it is not continuous memory.

Mapping vectors of arbitrary type

I need to store a list vectors of different types, each to be referenced by a string identifier. For now, I'm using std::map with std::string as the key and boost::any as it's value (example implementation posted here).
I've come unstuck when trying to run a method on all the stored vector, e.g.:
std::map<std::string, boost::any>::iterator it;
for (it = map_.begin(); it != map_.end(); ++it) {
it->second.reserve(100); // FAIL: refers to boost::any not std::vector
}
My questions:
Is it possible to cast boost::any to an arbitrary vector type so I can execute its methods?
Is there a better way to map vectors of arbitrary types and retrieve then later on with the correct type?
At present, I'm toying with an alternative implementation which replaces boost::any with a pointer to a base container class as suggested in this answer. This opens up a whole new can of worms with other issues I need to work out. I'm happy to go down this route if necessary but I'm still interested to know if I can make it work with boost::any, of if there are other better solutions.
P.S. I'm a C++ n00b novice (and have been spoilt silly by Python's dynamic typing for far too long), so I may well be going about this the wrong way. Harsh criticism (ideally followed by suggestions) is very welcome.
The big picture:
As pointed out in comments, this may well be an XY problem so here's an overview of what I'm trying to achieve.
I'm writing a task scheduler for a simulation framework that manages the execution of tasks; each task is an elemental operation on a set of data vectors. For example, if task_A is defined in the model to be an operation on "x"(double), "y"(double), "scale"(int) then what we're effectively trying to emulate is the execution of task_A(double x[i], double y[i], int scale[i]) for all values of i.
Every task (function) operate on different subsets of data so these functions share a common function signature and only have access to data via specific APIs e.g. get_int("scale") and set_double("x", 0.2).
In a previous incarnation of the framework (written in C), tasks were scheduled statically and the framework generated code based on a given model to run the simulation. The ordering of tasks is based on a dependency graph extracted from the model definition.
We're now attempting to create a common runtime for all models with a run-time scheduler that executes tasks as their dependencies are met. The move from generating model-specific code to a generic one has brought about all sorts of pain. Essentially, I need to be able to generically handle heterogenous vectors and access them by "name" (and perhaps type_info), hence the above question.
I'm open to suggestions. Any suggestion.

Looking through the added detail, my immediate reaction would be to separate the data out into a number of separate maps, with the type as a template parameter. For example, you'd replace get_int("scale") with get<int>("scale") and set_double("x", 0.2) with set<double>("x", 0.2);
Alternatively, using std::map, you could pretty easily change that (for one example) to something like doubles["x"] = 0.2; or int scale_factor = ints["scale"]; (though you may need to be a bit wary with the latter -- if you try to retrieve a nonexistent value, it'll create it with default initialization rather than signaling an error).
Either way, you end up with a number of separate collections, each of which is homogeneous, instead of trying to put a number of collections of different types together into one big collection.
If you really do need to put those together into a single overall collection, I'd think hard about just using a struct, so it would become something like vals.doubles["x"] = 0.2; or int scale_factor = vals.ints["scale"];
At least offhand, I don't see this losing much of anything, and by retaining static typing throughout, it certainly seems to fit better with how C++ is intended to work.

data structure for storing array of strings in a memory

I'm considering of data structure for storing a large array of strings in a memory. Strings will be inserted at the beginning of the programm and will not be added or deleted while programm is running. The crucial point is that search procedure should be as fast as it can be. Saving of memory is not important. I incline to standard structure hash_set from standard library, that allows to search elements in the structure with about constant time. But it's not guaranteed that this time will be short. Will anyone suggest a better standard desicion?
Many thanks!

Try a Prefix Tree
A Trie is better than a Binary Search Tree for searching elements. Compared against a hash table, you could see this question

If lookup time really is the only important thing, then at startup time, once you have all the strings, you could compute a perfect hash over them, and use this as the hashing function for a hashtable.
The problem is how you'd execute the hash - any kind of byte-code-based computation is probably going to be slower than using a fixed hash and dealing with collisions. But if all you care about is lookup speed, then you can require that your process has the necessary privileges to load and execute code. Write the code for the perfect hash, run it through a compiler, load it. Test at runtime whether it's actually faster for these strings than your best known data-agnostic structure (which might be a Trie, a hashtable, a Judy array or a splay tree, depending on implementation details and your typical access patterns), and if not fall back to that. Slow setup, fast lookup.
It's almost never truly the case that speed is the only crucial point.

There is e.g. google-sparsehash.
It includes a dense hash set/map (re)implementation that may perform better than the standard library hash set/map.
See performance. Make sure that you are using a good hash function. (My subjective vote: murmur2.)
Strings will be inserted at the
beginning of the programm and will not
be added or deleted while programm is running.
If the strings are immutable - so insertion/deletion is "infrequent", so to speak -, another option is to build a Directed Acyclic Word Graph or a Compact Directed Acyclic Word Graph that might* be faster than a hash table and has a better worst case guarantee.
**Standard disclaimer applies: depending on the use case, implementations, data set, phase of the moon, etc. Theoretical expectations may differ from observed results because of factors not accounted for (e.g. cache and memory latency, time complexity of certain machine instructions, etc.).*

A hash_set with a suitable number of buckets would be ideal, alternatively a vector with the strings in dictionary order, searched used binary search, would be great too.

The two standard data structures for fast string lookup are hash tables and tries, particularly Patricia tries. A good hash implementation and a good trie implementation should give similar performance, as long as the hash implementation is good enough to limit the number of collisions. Since you never modify the set of strings, you could try to build a perfect hash. If performance is more important than development time, try all solutions and benchmark them.
A complementary technique that could save lookups in the string table is to use atoms: each time you read a string that you know you're going to look up in the table, look it up immediately, and store a pointer to it (or an index in the data structure) instead of storing the string. That way, testing the equality of two strings is a simple pointer or integer equality (and you also save memory by storing each string once).

Your best bet would be as follows:
Building your structure:
Insert all your strings (char*s) into an array.
Sort the array lexicographically.
Lookup
Use a binary search on your array.
This maintains cache locality, allows for efficient lookup (Will search in a space of ~4 billion strings with 32 comparisons), and is dead simple to implement. There's no need to get fancy with tries, because they are complicated, and slower than they appear (especially if you have long strings).
Random sidenote: Combined with http://blogs.msdn.com/b/oldnewthing/archive/2005/05/19/420038.aspx, you'll be unstoppable!

Well, assuming you truly want an array and not an associative contaner as you've mentioned, the allocation strategy mentioned in Raymond Chen's Blog would be efficient.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js