Skippable Context: I have a simulation loop (using fixed update but variable rendering pattern), which instantiates entities of classes that are generated on the fly according to user-input and/or file configurations from a database of millions of components (of the container state modifying pattern).
I have implemented a system that automagically...deduces(some kind of logic cell/math I don't know the name to) and applies needed components whenever the user-input/config ignores the fact that one of their options requires additional components.
How so? Many of the components are complex formulae or fuzzy logic (gates?) or other complex scientific reasoning, coded in a way that can operate my simulation's structure, its objects, its environment, thus sometimes one component relies on the other and I need the 'deduction algorithm/system' to be able to pass that reliance to the class constructor.
I've used maximum granularity in the way I decided to store these 'knowledge pieces' because I really can't waste any memory given the sizes and compute intensity of the simulations and number of individual instances but now I'm running the issue of a single instance needing thousands, sometimes tens of thousands of components, and I need the instance's "creation-map" both saved and still bound as a private member so I can: 1st - know where my deductions are leading the constructor of instances and maybe be able to use memoization to reduce build times; 2nd - implement injection of changes to a live instance during the simulation¹.
What I think I need: I need a possibly infinite or at least very long
bit-mask, so I can iterate the creation faster and let the final
component-tree of my dynamically-constructed objects logged for future
use.
Possible approaches, which I have no idea will work: 1st - Manually
and sequentially store values for the bit flags in each RAM cell,
using the RAM wafer as my bit-mask. 2nd - Breakdown the map into
smaller bit-masks of known size(hard because the final number of
components is unknown until the creation is done and decoupling the
deduction is something which might even be impossible without
refactoring the entire system). 3rd - Figure out a way to make an
infinite bit-mask, or use some library that has implemented a really
long integer(5.12e+11 or bigger).
My code is written in C++, my rendering and compute kernels are Vulkan.
My objective question:
How do I implement an arbitrarily long bit-mask that is memory and compute efficient?
If I am allowed a bonus question, what is the most efficient way to implement such a bit-mask, assuming I have no architecture (software and hardware) constraints?
¹ I cannot browse an object's tree during the simulation and I also cannot afford to pause the simulation and wait for the browsing to be done before injecting the modification, I need to know and be able to make an arbitrary change on an arbitrary frame of simulation in both a pre-scheduled manner and a real-time manner.
what is the most efficient way to implement such a bit-mask
Put std::vector<bool> there and profile. If/when you’ll discover you’re spending significant time working with that vector, read further.
There’s no most efficient way, it all depends on what exactly does your code do with these bits.
One standard approach is keep continuous vector of integers, and treat them as a vector of bits. For example, std::vector<bool> on Windows uses 32-bit values, keeping 32 bools in each one.
However, the API of std::vector is too general. If your vector has majority of 0-s, or majority of 1-s, you’ll be able to implement many operations (get value, find first/next/previous 0/1) much faster than std::vector<bool> does.
Also depending on your access patterns, you may want either smaller or larger elements. If you’ll decide to use large SIMD types i.e. __m128i or __m256i don’t forget about alignment for them.
Related
I'm wrapping a native code (mostly Fortran 77) using JNA. The output (i.e. the results) of the native function consits of a bunch of nested (custom) types/structs, which I map to corresponding Structure in JNA. These Structures mostly consist of an array of other Structures (so Structure A holds an array of Structure B, Structure B holds an array of structure C etc).
Using same benchmarking (mainly by logging time-differences) I've found that most of the time is not spent in the native code, but during mapping of JNA. Fortran subroutine call takes about 50ms, but total time is 250ms.
I've found that
.setAutoWrite(false) on our Structure reduces overhead by ~ factor of 2 (total execution time almost halfes)
Keeping (statically allocated) arrays as small as possible helps to keeps JNA overhead low
Changing DOUBLE PRECISION (double) to REAL (float) seems not to make any difference
Are there any further tricks to optimize JNA performance in our case? I know I could flatten down my structures to a 1D array of primitives and use direct mapping, but I try to avoid that (because it will be a pain to encode/decode these structures)
As noted in the JNA FAQ, direct mapping would be your best performance increase, but you've excluded that as an option. It also notes that the calling overhead for each native call is another performance hit, which you've partially addressed by changing setAutoWrite().
You also did mention flattening your structures to an array of primitives, but rejected that due to encoding/decoding complexity. However, moving in this direction is probably the next best choice, and it's possible that the biggest performance issue you're currently facing is a combination of JNA's Structure access using reflection and native reads. Oracle notes:
Because reflection involves types that are dynamically resolved,
certain Java virtual machine optimizations can not be performed.
Consequently, reflective operations have slower performance than their
non-reflective counterparts, and should be avoided in sections of code
which are called frequently in performance-sensitive applications.
Since you are here asking a performance-related question and using JNA Structures, I can only assume you're writing a "performance-sensitive application". Internally, the Structure does this:
for (StructField structField : fields().values()) {
readField(structField);
}
which does a single Native read for each field, followed by this, which ends up using reflection under the hood.
setFieldValue(structField.field, result, true);
The moral of the story is that normally with Structures, generally each field involves a native read + reflection write, or a reflection read + native write.
The first step you can make without making any other changes is to setAutoSynch(false) on the structure. (You've already done half of this with the "write" version; this does both read and write.) From the docs:
For extremely large or complex structures where you only need to
access a small number of fields, you may see a significant performance
benefit by avoiding automatic structure reads and writes. If auto-read
and -write are disabled, it is up to you to ensure that the Java
fields of interest are synched before and after native function calls
via readField(String) and writeField(String,Object). This is typically
most effective when a native call populates a large structure and you
only need a few fields out of it. After the native call you can call
readField(String) on only the fields of interest.
To really go all out, flattening will possibly help a little more to get rid of any of the reflection overhead. The trick is making the offset conversions easy.
Some directions to go, balancing complexity vs. performance:
To write to native memory, allocate and clear a buffer of bytes (mem = new Memory(size); mem.clear(); or just new byte[size]), and write specific fields to the byte offset you determine using the value from Structure.fieldOffset(name). This does use reflection, but you could do this once for each structure and store a map of name to offset for later use.
For reading from native memory, make all your native read calls using a flat buffer to reduce the native overhead to a single read/write. You can cast that buffer to a Structure when you read it (incurring reflection for each field once) or read specific byte offsets per the above strategy.
My neuroevolution program (C++) is currently limited to small data sets, and I have projects for it that would (on my current workstation/cloud arrangement) take months to run. The biggest bottleneck is NOT the evaluation of the network or evolutionary processes; it is the size of the data sets. To obtain the fitness of a candidate network, it must be evaluated for EACH record in the set.
In a perfect world, I would have access to a cloud-based virtual machine instance with 1 core for each record in the 15,120-record Cover Type data set. However, the largest VMs I have found are 112-core. At present my program uses OpenMP to parallelize the for-loop implementing the evaluation of all records. The speedup is equal to the number of cores. The crossover/mutation is serial, but could easily be parallelized for the evaluation of each individual (100-10,000 of them).
The biggest problem is the way the network had to be implemented. Addressing the network directly from this structure.
struct DNA {
vector<int> sizes;
vector<Function> types;
vector<vector<double>> biases;
vector<vector<vector<double>>> weights;
};
GPU acceleration appears to be impossible. The program's structures must be made of multi-dimensional data types of sizes that can differ (not every layer is the same size). I selected STL vectors... THEN realized that kernels cannot be passed or address these. Standard operations (vector/matrix) would require data conversion, transfer, run, and conversion back. It simply isn't viable.
MPI. I have condsidered this, recently, and it would appear to be viable for the purposes of evaluating the fitness of each individual. If evaluating each takes more that a couple of seconds (which is a near-certainty), I can imagine this approach being the best way forward. However, I am considering 3 possibilities for how to procced:
Initialize a "master" cloud instance, and use it to launch 100-10,000 smaller instances. Each would have a copy of the data set in-memory, and would need to be deleted once the program found a solution.
SBCs, with their low costs and increasing specifications could permit the construction of a small home computing cluster, eliminating any security concerns with the cloud and giving me more control over the hardware.
I have no idea what I'm doing, it is impossible to breed larger neural networks (practically) without GPU acceleration, I failed to understand that the "thrust" library could allow vector-based code to run on a GPU, and I haven't done my homework.
By looking at what you described, I do not think GPU acceleration is impossible. My favorite approach is OpenCL but even if you use CUDA, you can't easily use C++ STL for the purpose. But if you go through the hurdle of converting your C++ code to C data structures (i.e., float, double, or int and arrays of them, instead of vector<> types, and redefine your vector<Function> into more primitive types), leveraging the GPU should be easy, especially if your program is mostly matrix operations. But you may want to beware that GPU architecture is different from CPU. If your logic has a lot of branching (i.e., if-then-else structures), the performance in GPU would not be good.
GPU is far more capable than you thought. All the memory in GPU is dynamically allocated, which means you can allocate as many memory as you want. If you want to specify different size for each thread, just simply store them in an array and use thread ID to index. Moreover, you can even store the network in shared memory and evaluate records over the threads to accelerate memory access. The most convenient way, as you mentioned, is to make use of thrust library. You don't need to understand how it is implemented if your aim is not study GPU. You neither need to worry about performance issue because it is optimized by professional GPU experts (many from Nvidia who build GPU). Thrust is designed very similar to STL, therefore it is easy to master if you are familiar with C++.
Suppose you have a very large graph with lots of processing upon its nodes (like tens of milions of operations per node). The core routine is the same for each node, but there are some additional operations based on internal conditions. There can be 2 such conditions which produces 4 cases (0,0), (1,0), (0,1), (1,1). E.g. (1,1) means that both conditions hold. Conditions are established once (one set for each node independently) in a program and, from then on, never change. Unfortunately, they are determined in runtime and in a fully unpredictable way (based on data received via HTTP from and external server).
What is the fastest in such scenario? (taken into account modern compiler optimizations which I have no idea of)
simply using "IFs": if (condition X) perform additional operation X.
using inheritance to derive four classes from base class (exposing method OPERATION) to have a proper operation and save tens milions of "ifs". [but I am not sure if this is a real saving, inheritance must have its overhead too)
use pointer to function to assign the function based on conditions once.
I would take me long to come to a point where I can test it by myself (I don't have such a big data yet and this will be integrated into bigger project so would not be easy to test all versions).
Reading answers: I know that I probably have to experiment with it. But apart from everything, this is sort of a question what is faster:
tens of milions of IF statements and normal statically known function calls VS function pointer calls VS inheritance which I think is not the best idea in this case and I am thinking of eliminating it from further inspection Thanks for any constructive answers (not saying that I shouldn't care about such minor things ;)
There is no real answer except to measure the actual code on the
real data. At times, in the past, I've had to deal with such
problems, and in the cases I've actually measured, virtual
functions were faster than if's. But that doesn't mean much,
since the cases I measured were in a different program (and thus
a different context) than yours. For example, a virtual
function call will generally prevent inlining, whereas an if is
inline by nature, and inlining may open up additional
optimization possibilities.
Also the machines I measured on handled virtual functions pretty
well; I've heard that some other machines (HP's PA, for example)
are very ineffective in their implementation of indirect jumps
(including not only virtual function calls, but also the return
from the function---again, the lost opportunity to inline
costs).
If you absolutely have to have the fastest way, and the process order of the nodes is not relevant, make four different types, one for each case, and define a process function for each. Then in a container class, have four vectors, one for each node type. Upon creation of a new node get all the data you need to create the node, including the conditions, and create a node of the correct type and push it into the correct vector. Then, when you need to process all your nodes, process them type by type, first processing all nodes in the first vector, then the second, etc.
Why would you want to do this:
No ifs for the state switching
No vtables
No function indirection
But much more importantly:
No instruction cache thrashing (you're not jumping to a different part of your code for every next node)
No branch prediction misses for state switching ifs (since there are none)
Even if you'd have inheritance with virtual functions and thus function indirection through vtables, simply sorting the nodes by their type in your vector may already make a world of difference in performance as any possible instruction cache thrashing would essentially be gone and depending on the methods of branch prediction the branch prediction misses could also be reduced.
Also, don't make a vector of pointers, but make a vector of objects. If they are pointers you have an extra adressing indirection, which in itself is not that worrisome, but the problem is that it may lead to data cache thrashing if the objects are pretty much randomly spread throughout your memory. If on the other hand your objects are directly put into the vector the processing will basically go through memory linearly and the cache will basically hit every time and cache prefetching might actually be able to do a good job.
Note though that you would pay heavily in data structure creation if you don't do it correctly, if at all possible, when making the vector reserve enough capacity in it immediately for all your nodes, reallocating and moving every time your vector runs out of space can become expensive.
Oh, and yes, as James mentioned, always, always measure! What you think may be the fastest way may not be, sometimes things are very counter intuitive, depending on all kinds of factors like optimizations, pipelining, branch prediction, cache hits/misses, data structure layout, etc. What I wrote above is a pretty general approach, but it is not guaranteed to be the fastest and there are definitely ways to do it wrong. Measure, Measure, Measure.
P.S. inheritance with virtual functions is roughly equivalent to using function pointers. Virtual functions are usually implemented by a vtable at the head of the class, which is basically just a table of function pointers to the implementation of the given virtual for the actual type of the object. Whether ifs is faster than virtuals or the other way around is a very very difficult question to answer and depends completely on the implementation, compiler and platform used.
I'm actually quite impressed with how effective branch prediction can be, and only the if solution allows inlining which also can be dramatic. Virtual functions and pointer to function also involve loading from memory and could possibly cause cache misses
But, you have four conditions so branch misses can be expensive.
Without the ability to test and verify the answer really can't be answered. Especially since its not even clear that this would be a performance bottleneck sufficient enough to warrant optimization efforts.
In cases like this. I would err on the side of readability and ease of debugging and go with if
Many programmers have taken classes and read books that go on about certain favorite subjects: pipelining, cacheing, branch prediction, virtual functions, compiler optimizations, big-O algorithms, etc. etc. and the performance of those.
If I could make an analogy to boating, these are things like trimming weight, tuning power, adjusting balance and streamlining, assuming you are starting from some speedboat that's already close to optimal.
Never mind you may actually be starting from the Queen Mary, and you're assuming it's a speedboat.
It may very well be that there are ways to speed up the code by large factors, just by cutting away fat (masquerading as good design), if only you knew where it was.
Well, you don't know where it is, and guessing where is a waste of time, unless you value being wrong.
When people say "measure and use a profiler" they are pointing in the right direction, but not far enough.
Here's an example of how I do it, and I made a crude video of it, FWIW.
Unless there's a clear pattern to these attributes, no branch predictor exists that can effectively predict this data-dependent condition for you. Under such circumstances, you may be better off avoiding a control speculation (and paying the penalty of a branch misprediction), and just wait for the actual data to arrive and resolve the control flow (more likely to occur using virtual functions). You'll have to benchmark of course to verify that, as it depends on the actual pattern (if for e.g. you have even small groups of similarly "tagged" elements).
The sorting suggested above is nice and all, but note that it converts a problem that's just plain O(n) into an O(logn) one, so for large sizes you'll lose unless you can sort once - traverse many times, or otherwise cheaply maintain the sort state.
Note that some predictors may also attempt to predict the address of the function call, so you might be facing the same problem there.
However, I must agree about the comments regarding early-optimizations - do you know for sure that the control flow is your bottleneck? What if fetching the actual data from memory takes longer? In general, it would seem that your elements can be process in parallel, so even if you run this on a single thread (and much more if you use multiple cores) - you should be bandwidth-bound and not latency bound.
I have the following three-dimensional bit array(for a bloom filter):
unsigned char P_bit_table_[P_ROWS][ROWS][COLUMNS];
the P_ROWS's dimension represents independent two-dimensional bit arrays(i.e, P_ROWS[0], P_ROWS1,P_ROWS[2] are independent bit arrays) and could be as large as 100MBs and contains data which are populated independently. The data that I am looking for could be in any of these P_ROWS and right now I am searching through it independently, which is P_ROWS[0] then P_ROWS1 and so on until i get a positive or until the end of it(P_ROWS[n-1]). This implies that if n is 100 I have to do this search(bit comparison) 100 times(and this search is done very often). Some body suggested that I can improve the search performance if I could do bit grouping (use a column-major order on the row-major order array-- I DON'T KNOW HOW).
I really need to improve the performance of the search because the program does a lot of it.
I will be happy to give more details of my bit table implementation if required.
Sorry for the poor language.
Thanks for your help.
EDIT:
The bit grouping could be done in the following format:
Assume the array to be :
unsigned char P_bit_table_[P_ROWS][ROWS][COLUMNS]={{(a1,a2,a3),(b1,b2,b3),(c1,c2,c3))},
{(a1,a2,a3),(b1,b2,b3),(c1,c2,c3))},
{(a1,a2,a3),(b1,b2,b3),(c1,c2,c3))}};
As you can see all the rows --on the third dimension-- have similar data. What I want after the grouping is like; all the a1's are in one group(as just one entity so that i can compare them with another bit for checking if they are on or off ) and all the b1's are in another group and so on.
Re-use Other People's Algorithms
There are a ton of bit-calculation optimizations out there including many that are non-obvious, like Hamming Weights and specialized algorithms for finding the next true or false bit, that are rather independent of how you structure your data.
Reusing algorithms that other people have written can really speed up computation and lookups, not to mention development time. Some algorithms are so specialized and use computational magic that will have you scratching your head: in that case, you can take the author's word for it (after you confirm their correctness with unit tests).
Take Advantage of CPU Caching and Multithreading
I personally reduce my multidimensional bit arrays to one dimension, optimized for expected traversal.
This way, there is a greater chance of hitting the CPU cache.
In your case, I would also think deeply about the mutability of the data and whether you want to put locks on blocks of bits. With 100MBs of data, you have the potential of running your algorithms in parallel using many threads, if you can structure your data and algorithms to avoid contention.
You may even have a lockless model if you divide up ownership of the blocks of data by thread so no two threads can read or write to the same block. It all depends on your requirements.
Now is a good time to think about these issues. But since no one knows your data and usage better than you do, you must consider design options in the context of your data and usage patterns.
I'm designing a custom VM and am curious about how many registers I should use. Initially, I had 255, but I'm a little concerned about backing 255 pointers (a whole KB) on to the stack or heap every time I call a function, when most of them won't even be used. How many registers should I use?
You might want to look into register windows, which are a way of reducing the number of "active" registers available at any one time, while still keeping a large number of registers in core.
Having said that, you may find that using a stack-based architecture is more convenient. Some major virtual machines intended to be implemented in software (JVM, CLR, Python, etc) use a stack architecture. It's certainly easier to write a compiler for a stack rather than an artificially restricted set of registers.
This generally depends on how many you think you'll need. I question 255 registers' usefulness in practical applications.
The last register machine I built was aimed at supporting a small programming language, and when mapping things out, I looked at the types of applications, the design methodologies I wanted to guide people to use, balancing all that with performance concerns when designing the register file.
It's not something that can easily be answered without more details, but if you stop and think about what it is you're trying to do, and balance it all out with whatever aspects you find important, you'll come to a conclusion you can live with, and that probably makes sense.
Whatever number of registers you choose, you are probably going to have way too many for most subroutines and way too few for a few subroutines. (This is just a guess. However, considering how many things in programming follow a Power Law Distribution – incoming references to objects, modules, classes, outgoing references from objects, modules, classes, cyclomatic complexity of subroutines, NPath complexity of subroutines, SLOC length of subroutines, lifetime of objects, size of objects – it is only reasonable to assume that the same is true for the number of registers for a subroutine, especially if you consider that there is probably a correlation between complexity/length and number of registers.)
The Parrot VM has found quite a simple way out of this conundrum: they have an infinite number of registers. Obviously, those registers aren't stored in an infinite array, rather they lazily materialize just enough registers for any single subroutine. That way, they never run out of registers, and they never waste any space.
Sorry guys. I made a stupid on this one. Turns out that I already had a vector of registers to optimize access to the stack, which I totally forgot about. Instead of duping them, I just set the registers in the state to be a reference to the stack's registers. Now all I need to do is specialize pushing to push straight to a register, and problem solved in a nice efficient fashion. These registers will also never need backing, since there's nothing function-dependent about them, and they'll grow in perfect accordance with my stack. It had just never occurred to me that I could push values into them without pushing an equivalent value into the stack.
The absolutely hideous template mess this is turning into for simple design concepts though is making me extremely unhappy. Want to buy: static if and variadic templates.