I am trying to improve the speed of a computational (biological) model written in C++ (previous version is on my github: Prokaryotes). The most time-consuming function is where I calculate binding affinities between transcription factors and binding sites on a single genome.
Background: In my model, binding affinity is given by the Hamming distance between the binding domain of a transcription factor (a 20-bool array) and the sequence of a binding site (also a 20-bool array). For a single genome, I need to calculate the affinities between all active transcription factors (typically 5-10) and all binding sites (typically 10-50). I do this every timestep for more than 10,000 cells in the population to update their gene expression states. Having to calculate up to half a million comparisons of 20-bool arrays to simulate just one timestep of my model means that typical experiments take several months (2M--10M timesteps).
For the previous version of the model (link above) genomes remained fairly small, so I could calculate binding affinities once for every cell (at birth) and store and re-use these numbers during the cell's lifetime. However, in the latest version, genomes expand considerably and multiple genomes reside within the same cell. Thus, storing affinities of all transcript factor--binding site pairs in a cell becomes impractical.
In the current implementation I defined an inline function belonging to the Bead class (which is a base class for transcription factor class "Regulator" and binding site class "Bsite"). It is written directly in the header file Bead.hh:
inline int Bead::BindingAffinity(const bool* sequenceA, const bool* sequenceB, int seqlen) const
{
int affinity = 0;
for (int i=0; i<seqlen; i++)
{
affinity += (int)(sequenceA[i]^sequenceB[i]);
}
return affinity;
}
The above function accepts two pointers to boolean arrays (sequenceA and sequenceB), and an integer specifying their length (seqlen). Using a simple for-loop I then check at how many positions the arrays differ (sequenceA[i]^sequenceB[i]), summing into the variable affinity.
Given a binding site (bsite) on the genome, we can then iterate through the genome and for every transcription factor (reg) calculate its affinity to this particular binding site like so:
affinity = (double)reg->BindingAffinity(bsite->sequence, reg->sequence);
So, this is how streamlined I managed to make it; since I don't have a programming background, I wonder whether there are better ways to write the above function or to structure the code (i.e. should BindingAffinity be a function of the base Bead class)? Suggestions are greatly appreciated.
Thanks to #PaulMcKenzie and #eike for your suggestions. I tested both ideas against my previous implementation. Below are the results. In short, both answers work very well.
My previous implementation yielded an average runtime of 5m40 +/- 7 (n=3) for 1000 timesteps of the model. Profiling analysis with GPROF showed that the function BindingAffinity() took 24.3% of total runtime. [see Question for the code].
The bitset implementation yielded an average runtime of 5m11 +/- 6 (n=3), corresponding to a ~9% speed increase. Only 3.5% of total runtime is spent in BindingAffinity().
//Function definition in Bead.hh
inline int Bead::BindingAffinity(std::bitset<regulator_length> sequenceA, const std::bitset<regulator_length>& sequenceB) const
{
return (int)(sequenceA ^= sequenceB).count();
}
//Function call in Genome.cc
affinity = (double)reg->BindingAffinity(bsite->sequence, reg->sequence);
The main downside of the bitset implementation is that unlike with boolean arrays (my previous implementation), I have to specify the length of the bitset that goes into the function. I am occasionally comparing bitsets of different lengths, so for these I now have to specify separate functions (templates would not work for multi-file project according to https://www.cplusplus.com/doc/oldtutorial/templates/).
For the integer implementation I tried two alternatives to the std::popcount(seq1^seq2) function suggested by #eike since I am working with an older version of C++ that doesn't include this.
Alternative #1:
inline int Bead::BindingAffinity(int sequenceA, int sequenceB) const
{
int i = sequenceA^sequenceB;
std::bitset<32> bi (i);
return ((std::bitset<32>)i).count();
}
Alternative #2:
inline int Bead::BindingAffinity(int sequenceA, int sequenceB) const
{
int i = sequenceA^sequenceB;
//SWAR algorithm, copied from https://stackoverflow.com/questions/109023/how-to-count-the-number-of-set-bits-in-a-32-bit-integer
i = i - ((i >> 1) & 0x55555555); // add pairs of bits
i = (i & 0x33333333) + ((i >> 2) & 0x33333333); // quads
i = (i + (i >> 4)) & 0x0F0F0F0F; // groups of 8
return (i * 0x01010101) >> 24; // horizontal sum of bytes
}
These yielded average runtimes of 5m06 +/- 6 (n=3) and 5m06 +/- 3 (n=3), respectively, corresponding to a ~10% speed increase compared to my previous implementation. I only profiled Alternative #2, which showed that only 2.2% of total runtime was spent in BindingAffinity(). The downside of using integers for bitstrings is that I have to be very careful whenever I change any of the code. Single-bit mutations are definitely possible as mentioned by #eike, but everything is just a little bit trickier.
Conclusion:
Both the bitset and integer implementations for comparing bitstrings achieve impressive speed improvements. So much so, that BindingAffinity() is no longer the bottleneck in my code.
Related
I have a large array (> millions) of Items, where each Item has the form:
struct Item { void *a; size_t b; };
There are a handful of distinct a fields—meaning there are many items with the same a field.
I would like to "factor" this information out to save about 50% memory usage.
However, the trouble is that these Items have a significant ordering, and that may change over time. Therefore, I can't just go ahead make a separate Item[] for each distinct a, because that will lose the relative ordering of the items with respect to each other.
On the other hand, if I store the orderings of all the items in a size_t index; field, then I lose any memory savings from the removal of the void *a; field.
So is there a way for me to actually save memory here, or no?
(Note: I can already think of e.g. using an unsigned char for a to index into a small array, but I'm wondering if there's a better way. That one will require me to either use unaligned memory or to split every Item[] into two, which isn't great for memory locality, so I'd prefer something else.)
(Note: I can already think of e.g. using an unsigned char for a to index into a small array, but I'm wondering if there's a better way.)
This thinking is on the right track, but it's not that simple, since you will run into some nasty alignment/padding issues that will negate your memory gains.
At that point, when you start trying to scratch the last few bytes of a structure like this, you will probably want to use bit fields.
#define A_INDEX_BITS 3
struct Item {
size_t a_index : A_INDEX_BITS;
size_t b : (sizeof(size_t) * CHAR_BIT) - A_INDEX_BITS;
};
Note that this will limit how many bits are available for b, but on modern platforms, where sizeof(size_t) is 8, stripping 3-4 bits from it is rarely an issue.
Use a combination of lightweight compression schemes (see this for examples and some references) to represent the a* values. #Frank's answer employes DICT followed by NS, for example. If you have long runs of the same pointer, you could consider RLE (Run-Length Encoding) on top of that.
This is a bit of a hack, but I've used it in the past with some success. The extra overhead for object access was compensated for by the significant memory reduction.
A typical use case is an environment where (a) values are actually discriminated unions (that is, they include a type indicator) with a limited number of different types and (b) values are mostly kept in large contiguous vectors.
With that environment, it is quite likely that the payload part of (some kinds of) values uses up all the bits allocated for it. It is also possible that the datatype requires (or benefits from) being stored in aligned memory.
In practice, now that aligned access is not required by most mainstream CPUs, I would just used a packed struct instead of the following hack. If you don't pay for unaligned access, then storing a { one-byte type + eight-byte value } as nine contiguous bytes is probably optimal; the only cost is that you need to multiply by 9 instead of 8 for indexed access, and that is trivial since the 9 is a compile-time constant.
If you do have to pay for unaligned access, then the following is possible. Vectors of "augmented" values have the type:
// Assume that Payload has already been typedef'd. In my application,
// it would be a union of, eg., uint64_t, int64_t, double, pointer, etc.
// In your application, it would be b.
// Eight-byte payload version:
typedef struct Chunk8 { uint8_t kind[8]; Payload value[8]; }
// Four-byte payload version:
typedef struct Chunk4 { uint8_t kind[4]; Payload value[4]; }
Vectors are then vectors of Chunks. For the hack to work, they must be allocated on 8- (or 4-)byte aligned memory addresses, but we've already assumed that alignment is required for the Payload types.
The key to the hack is how we represent a pointer to an individual value, because the value is not contiguous in memory. We use a pointer to it's kind member as a proxy:
typedef uint8_t ValuePointer;
And then use the following low-but-not-zero-overhead functions:
#define P_SIZE 8U
#define P_MASK P_SIZE - 1U
// Internal function used to get the low-order bits of a ValuePointer.
static inline size_t vpMask(ValuePointer vp) {
return (uintptr_t)vp & P_MASK;
}
// Getters / setters. This version returns the address so it can be
// used both as a getter and a setter
static inline uint8_t* kindOf(ValuePointer vp) { return vp; }
static inline Payload* valueOf(ValuePointer vp) {
return (Payload*)(vp + 1 + (vpMask(vp) + 1) * (P_SIZE - 1));
}
// Increment / Decrement
static inline ValuePointer inc(ValuePointer vp) {
return vpMask(++vp) ? vp : vp + P_SIZE * P_SIZE;
}
static inline ValuePointer dec(ValuePointer vp) {
return vpMask(vp--) ? vp - P_SIZE * P_SIZE : vp;
}
// Simple indexed access from a Chunk pointer
static inline ValuePointer eltk(Chunk* ch, size_t k) {
return &ch[k / P_SIZE].kind[k % P_SIZE];
}
// Increment a value pointer by an arbitrary (non-negative) amount
static inline ValuePointer inck(ValuePointer vp, size_t k) {
size_t off = vpMask(vp);
return eltk((Chunk*)(vp - off), k + off);
}
I left out a bunch of the other hacks but I'm sure you can figure them out.
One cool thing about interleaving the pieces of the value is that it has moderately good locality of reference. For the 8-byte version, almost half of the time a random access to a kind and a value will only hit one 64-byte cacheline; the rest of the time two consecutive cachelines are hit, with the result that walking forwards (or backwards) through a vector is just as cache-friendly as walking through an ordinary vector, except that it uses fewer cachelines because the objects are half the size. The four byte version is even cache-friendlier.
I think I figured out the information-theoretically-optimal way to do this myself... it's not quite worth the gains in my case, but I'll explain it here in case it helps someone else.
However, it requires unaligned memory (in some sense).
And perhaps more importantly, you lose the ability easily add new values of a dynamically.
What really matters here is the number of distinct Items, i.e. the number of distinct (a,b) pairs. After all, it could be that for one a there are a billion different bs, but for the other ones there are only a handful, so you want to take advantage of that.
If we assume that there are N distinct items to choose from, then we need n = ceil(log2(N)) bits to represent each Item. So what we really want is an array of n-bit integers, with n computed at run time. Then, once you get the n-bit integer, you can do a binary search in log(n) time to figure out which a it corresponds to, based on your knowledge of the count of bs for each a. (This may be a bit of a performance hit, but it depends on the number of distinct as.)
You can't do this in a nice memory-aligned fashion, but that isn't too bad. What you would do is make a uint_vector data structure with the number of bits per element being a dynamically-specifiable quantity. Then, to randomly access into it, you'd do a few divisions or mod operations along with bit-shifts to extract the required integer.
The caveat here is that the dividing by a variable will probably severely damage your random-access performance (although it'll still be O(1)). The way to mitigate that would probably be to write a few different procedures for common values of n (C++ templates help here!) and then branch into them with various if (n == 33) { handle_case<33>(i); } or switch (n) { case 33: handle_case<33>(i); }, etc. so that the compiler sees the divisor as a constant and generates shifts/adds/multiplies as needed, rather than division.
This is information-theoretically optimal as long as you require a constant number of bits per element, which is what you would want for random-accessing. However, you could do better if you relax that constraint: you could pack multiple integers into k * n bits, then extract them with more math. This will probably kill performance too.
(Or, long story short: C and C++ really need a high-performance uint_vector data structure...)
An Array-of-Structures approach may be helpful. That is, have three vectors...
vector<A> vec_a;
vector<B> vec_b;
SomeType b_to_a_map;
You access your data as...
Item Get(int index)
{
Item retval;
retval.a = vec_a[b_to_a_map[index]];
retval.b = vec_b[index];
return retval;
}
Now all you need to do is choose something sensible for SomeType. For example, if vec_a.size() were 2, you could use vector<bool> or boost::dynamic_bitset. For more complex cases you could try bit-packing, for example to support 4-values of A, we simple change our function with...
int a_index = b_to_a_map[index*2]*2 + b_to_a_map[index*2+1];
retval.a = vec_a[a_index];
You can always beat bit-packing by using range-packing, using div/mod to store a fractional bit length per item, but the complexity grows quickly.
A good guide can be found here http://number-none.com/product/Packing%20Integers/index.html
I have a matrix which wraps around.
m_matrixOffset points to first cell(0, 0) of the wrapped around matrix. So to access a cell we have below function GetCellInMatrix .Logic to wrap around(in while loop) is executed each time someone access a cell. This is executed thousands of time in a second. Is there any way to optimize this using some lookup or someother way. MAX_ROWS and MAX_COLS may not be power of 2.
struct Cell
{
Int rowId;
Int colId;
}
int matData[MAX_ROWS][MAX_COLS];
int GetCellInMatrix(const Cell& cellIndex)
{
Cell newCellIndex = cellIndex + m_matrixOffset ;
while (newCellIndex.rowId > MAX_ROWS)
{
newCellIndex.rowId -= MAX_ROWS;
}
while (newCellIndex.colId > MX_COLS)
{
newCellIndex.y -= MAX_COLS;
}
return data[newCellIndex.rowId][newCellIndex.colId];
}
You might be interested in the concept of division with remainder, usually implemented as a % b for the remainder.
Thus
return data[newCellIndex.rowId % MAX_ROWS][newCellIndex.colId % MAX_COLS];
does not need the while loops before it.
As per comment, the implied integer division in the remainder computation is too costly if done at each query. Assuming that m_matrixOffset is constant over a large number of queries, reduce its coordinates once using the remainder operations. Then the newCellIndex are less than twice the maximum, thus need only to be reduced at most once. Thus it is safe to replace while with if, sparing one comparison.
If you can sacrifice memory for space, then double the matrix dimensions and fill the excess entries with the repeated matrix elements. You have to make sure this pattern holds when updating the matrix.
Then, again assuming that both m_matrixOffset and CellIndex are inside the maxima for rows and columns, you can access the cell of the extended matrix without any further reduction. This would be a variant on the "lookup table" idea.
Or use real lookup tables, but you then execute 3 array cell lookups like in
return data[repeatedRowIndex[newCellIndex.rowId]][repeatedColIndex[newCellIndex.colId]];
It depends if the wrap is small or large in relation to the matrix.
The most common case is that all you need is the nearest neighbour. So make the matrix N+2 by M+2 and duplicate the wrap. That makes reads fast but writes a bit fiddly (often a good trade-off).
If that's no good, specialise the functions. Work out which cells are edge cells and handle the specially (you must be able to do this cheaper than simply hard-coding the logic into the access, of course, if only one or two cells change every pass that will hold, not if you generate a random list every pass).
We have a given 3D-mesh and we are trying to eliminate identical vertexes. For this we are using a self defined struct containing the coordinates of a vertex and the corresponding normal.
struct vertice
{
float p1,p2,p3,n1,n2,n3;
bool operator == (const vertice& vert) const
{
return (p1 == vert.p1 && p2 == vert.p2 && p3 == vert.p3);
}
};
After filling the vertex with data, it is added to an unordered_set to remove the duplicates.
struct hashVertice
{
size_t operator () (const vertice& vert) const
{
return(7*vert.p1 + 13*vert.p2 + 11*vert.p3);
}
};
std::unordered_set<vertice,hashVertice> verticesSet;
vertice vert;
while(i<(scene->mMeshes[0]->mNumVertices)){
vert.p1 = (float)scene->mMeshes[0]->mVertices[i].x;
vert.p2 = (float)scene->mMeshes[0]->mVertices[i].y;
vert.p3 = (float)scene->mMeshes[0]->mVertices[i].z;
vert.n1 = (float)scene->mMeshes[0]->mNormals[i].x;
vert.n2 = (float)scene->mMeshes[0]->mNormals[i].y;
vert.n3 = (float)scene->mMeshes[0]->mNormals[i].z;
verticesSet.insert(vert);
i = i+1;
}
We discovered that it is too slow for data amounts like 3.000.000 vertexes. Even after 15 minutes of running the program wasn't finished. Is there a bottleneck we don't see or is another data structure better for such a task?
What happens if you just remove verticesSet.insert(vert); from the loop?
If it speeds-up dramatically (as I expect it would), your bottleneck is in the guts of the std::unordered_set, which is a hash-table, and the main potential performance problem with hash tables is when there are excessive hash collisions.
In your current implementation, if p1, p2 and p3 are small, the number of distinct hash codes will be small (since you "collapse" float to integer) and there will be lots of collisions.
If the above assumptions turn out to be true, I'd try to implement the hash function differently (e.g. multiply with much larger coefficients).
Other than that, profile your code, as others have already suggested.
Hashing floating point can be tricky. In particular, your hash
routine calculates the hash as a floating point value, then
converts it to an unsigned integral type. This has serious
problems if the vertices can be small: if all of the vertices
are in the range [0...1.0), for example, your hash function
will never return anything greater than 13. As an unsigned
integer, which means that there will be at most 13 different
hash codes.
The usual way to hash floating point is to hash the binary
image, checking for the special cases first. (0.0 and -0.0
have different binary images, but must hash the same. And it's
an open question what you do with NaNs.) For float this is
particularly simple, since it usually has the same size as
int, and you can reinterpret_cast:
size_t
hash( float f )
{
assert( /* not a NaN */ );
return f == 0.0 ? 0.0 : reinterpret_cast( unsigned& )( f );
}
I know, formally, this is undefined behavior. But if float and
int have the same size, and unsigned has no trapping
representations (the case on most general purpose machines
today), then a compiler which gets this wrong is being
intentionally obtuse.
You then use any combining algorithm to merge the three results;
the one you use is as good as any other (in this case—it's
not a good generic algorithm).
I might add that while some of the comments insist on profiling
(and this is generally good advice), if you're taking 15 minutes
for 3 million values, the problem can really only be a poor hash
function, which results in lots of collisions. Nothing else will
cause that bad of performance. And unless you're familiar with
the internal implementation of std::unordered_set, the usual
profiler output will probably not give you much information.
On the other hand, std::unordered_set does have functions
like bucket_count and bucket_size, which allow analysing
the quality of the hash function. In your case, if you cannot
create an unordered_set with 3 million entries, your first
step should be to create a much smaller one, and use these
functions to evaluate the quality of your hash code.
If there is a bottleneck, you are definitely not seeing it, because you don't include any kind of timing measures.
Measure the timing of your algorithm, either with a profiler or just manually. This will let you find the bottleneck - if there is one.
This is the correct way to proceed. Expecting yourself, or alternatively, StackOverflow users to spot bottlenecks by eye inspection instead of actually measuring time in your program is, from my experience, the most common cause of failed attempts at optimization.
I have a class containing a number of double values. This is stored in a vector where the indices for the classes are important (they are referenced from elsewhere). The class looks something like this:
Vector of classes
class A
{
double count;
double val;
double sumA;
double sumB;
vector<double> sumVectorC;
vector<double> sumVectorD;
}
vector<A> classes(10000);
The code that needs to run as fast as possible is something like this:
vector<double> result(classes.size());
for(int i = 0; i < classes.size(); i++)
{
result[i] += classes[i].sumA;
vector<double>::iterator it = find(classes[i].sumVectorC.begin(), classes[i].sumVectorC.end(), testval);
if(it != classes[i].sumVectorC.end())
result[i] += *it;
}
The alternative is instead of one giant loop, split the computation into two separate loops such as:
for(int i = 0; i < classes.size(); i++)
{
result[i] += classes[i].sumA;
}
for(int i = 0; i < classes.size(); i++)
{
vector<double>::iterator it = find(classes[i].sumVectorC.begin(), classes[i].sumVectorC.end(), testval);
if(it != classes[i].sumVectorC.end())
result[i] += *it;
}
or to store each member of the class in a vector like so:
Class of vectors
vector<double> classCounts;
vector<double> classVal;
...
vector<vector<double> > classSumVectorC;
...
and then operate as:
for(int i = 0; i < classes.size(); i++)
{
result[i] += classCounts[i];
...
}
Which way would usually be faster (across x86/x64 platforms and compilers)? Are look-ahead and cache lines are the most important things to think about here?
Update
The reason I'm doing a linear search (i.e. find) here and not a hash map or binary search is because the sumVectors are very short, around 4 or 5 elements. Profiling showed a hash map was slower and a binary search was slightly slower.
As the implementation of both variants seems easy enough I would build both versions and profile them to find the fastest one.
Empirical data usually beats speculation.
As a side issue: Currently, the find() in your innermost loop does a linear scan through all elements of classes[i].sumVectorC until it finds a matching value. If that vector contains many values, and you have no reason to believe that testVal appears near the start of the vector, then this will be slow -- consider using a container type with faster lookup instead (e.g. std::map or one of the nonstandard but commonly implemented hash_map types).
As a general guideline: consider algorithmic improvements before low-level implementation optimisation.
As lothar says, you really should test it out. But to answer your last question, yes, cache misses will be a major concern here.
Also, it seems that your first implementation would run into load-hit-store stalls as coded, but I'm not sure how much of a problem that is on x86 (it's a big problem on XBox 360 and PS3).
It looks like optimizing the find() would be a big win (profile to know for sure). Depending on the various sizes, in addition to replacing the vector with another container, you could try sorting sumVectorC and using a binary search in the form of lower_bound. This will turn your linear search O(n) into O(log n).
if you can guarrantee that std::numeric_limits<double>::infinity is not a possible value, ensuring that the arrays are sorted with a dummy infinite entry at the end and then manually coding the find so that the loop condition is a single test:
array[i]<test_val
and then an equality test.
then you know that the average number of looked at values is (size()+1)/2 in the not found case. Of course if the search array changes very frequently then the issue of keeping it sorted is an issue.
of course you don't tell us much about sumVectorC or the rest of A for that matter, so it is hard to ascertain and give really good advice. For example if sumVectorC is never updates then it is probably possible to find an EXTREMELY cheap hash (eg cast ULL and bit extraction) that is perfect on the sumVectorC values that fits into double[8]. Then the overhead is bit extract and 1 comparison versus 3 or 6
Also if you have a bound on sumVectorC.size() that is reasonable(you mentioned 4 or 5 so this assumption seems not bad) you could consider using an aggregated array or even just a boost::array<double> and add your own dynamic size eg :
class AggregatedArray : public boost::array<double>{
size_t _size;
size_t size() const {
return size;
}
....
push_back(..){...
pop(){...
resize(...){...
};
this gets rid of the extra cache line access to the allocated array data for sumVectorC.
In the case of sumVectorC very infrequently updating if finding a perfect hash (out of your class of hash algoithhms)is relatively cheap then you can incur that with profit when sumVectorC changes. These small lookups can be problematic and algorithmic complexity is frequently irrelevant - it is the constants that dominate. It is an engineering problem and not a theoretical one.
Unless you can guarantee that the small maps are in cache you can be almost be guaranteed that using a std::map will yield approximately 130% worse performance as pretty much each node in the tree will be in a separate cache line
Thus instead of accessing (4 times 1+1 times 2)/5 = 1.2 cache lines per search (the first 4 are in first cacheline, the 5th in the second cacheline, you will access (1 + 2 times 2 + 2 times 3) = 9/5) + 1 for the tree itself = 2.8 cachelines per search (the 1 being 1 node at the root, 2 nodes being children of the root, and the last 2 being grandchildren of the root, plus the tree itself)
So I would predict using a std::map to take 2.8/1.2 = 233% as long for a sumVectorC having 5 entries
This what I meant when I said: "It is an engineering problem and not a theoretical one."
I am writing a simulation and need some hint on the design. The basic idea is that data for the given stochastic processes is being generated and later on consumed for various calculations. For example for 1 iteration:
Process 1 -> generates data for source 1: x1
Process 2 -> generates data for source 1: x2
and so on
Later I want to apply some transformations for example on the output of source 2, which results in x2a, x2b, x2c. So in the end up with the following vector: [x1, x2a, x2b, x2c].
I have a problem, as for N-multivariate stochastic processes (representing for example multiple correlated phenomenons) I have to generate N dimensional sample at once:
Process 1 -> generates data for source 1...N: x1...xN
I am thinking about the simple architecture that would allow to structuralize the simulation code and provide flexibility without hindering the performance.
I was thinking of something along these lines (pseudocode):
class random_process
{
// concrete processes would generate and store last data
virtual data_ptr operator()() const = 0;
};
class source_proxy
{
container_type<process> processes;
container_type<data_ptr> data; // pointers to the process data storage
data operator[](size_type number) const { return *(data[number]);}
void next() const {/* update the processes */}
};
Somehow I am not convinced about this design. For example, if I'd like to work with vectors of samples instead of single iteration, then above design should be changed (I could for example have the processes to fill the submatrices of the proxy-matrix passed to them with data, but again not sure if this is a good idea - if yes then it would also fit nicely the single iteration case). Any comments, suggestions and criticism are welcome.
EDIT:
Short summary of the text above to summarize the key points and clarify the situation:
random_processes contain the logic to generate some data. For example it can draw samples from multivariate random gaussian with the given means and correlation matrix. I can use for example Cholesky decomposition - and as a result I'll be getting a set of samples [x1 x2 ... xN]
I can have multiple random_processes, with different dimensionality and parameters
I want to do some transformations on individual elements generated by random_processes
Here is the dataflow diagram
random_processes output
x1 --------------------------> x1
----> x2a
p1 x2 ------------transform|----> x2b
----> x2c
x3 --------------------------> x3
p2 y1 ------------transform|----> y1a
----> y1b
The output is being used to do some calculations.
When I read this "the answer" doesn't materialize in my mind, but instead a question:
(This problem is part of a class of problems that various tool vendors in the market have created configurable solutions for.)
Do you "have to" write this or can you invest in tried and proven technology to make your life easier?
In my job at Microsoft I work with high performance computing vendors - several of which have math libraries. Folks at these companies would come much closer to understanding the question than I do. :)
Cheers,
Greg Oliver [MSFT]
I'll take a stab at this, perhaps I'm missing something but it sounds like we have a list of processes 1...N that don't take any arguments and return a data_ptr. So why not store them in a vector (or array) if the number is known at compile time... and then structure them in whatever way makes sense. You can get really far with the stl and the built in containers (std::vector) function objects(std::tr1::function) and algorithms (std::transform)... you didn't say much about the higher level structure so I'm assuming a really silly naive one, but clearly you would build the data flow appropriately. It gets even easier if you have a compiler with support for C++0x lambdas because you can nest the transformations easier.
//compiled in the SO textbox...
#include <vector>
#include <functional>
#include <numerics>
typedef int data_ptr;
class Generator{
public:
data_ptr operator()(){
//randomly generate input
return 42 * 4;
}
};
class StochasticTransformation{
public:
data_ptr operator()(data_ptr in){
//apply a randomly seeded function
return in * 4;
}
};
public:
data_ptr operator()(){
return 42;
}
};
int main(){
//array of processes, wrap this in a class if you like but it sounds
//like there is a distinction between generators that create data
//and transformations
std::vector<std::tr1::function<data_ptr(void)> generators;
//TODO: fill up the process vector with functors...
generators.push_back(Generator());
//transformations look like this (right?)
std::vector<std::tr1::function<data_ptr(data_ptr)> transformations;
//so let's add one
transformations.push_back(StochasticTransformation);
//and we have an array of results...
std::vector<data_ptr> results;
//and we need some inputs
for (int i = 0; i < NUMBER; ++i)
results.push_back(generators[0]());
//and now start transforming them using transform...
//pick a random one or do them all...
std::transform(results.begin(),results.end(),
results.begin(),results.end(),transformation[0]);
};
I think that the second option (the one mentioned in the last paragraph) makes more sense. In the one you had presented you are playing with pointers and indirect access to random process data. The other one would store all the data (either vector or a matrix) in one place - the source_proxy object. The random processes objects are then called with a submatrix to populate as a parameter, and themselves they do not store any data. The proxy manages everything - from providing the source data (for any distinct source) to requesting new data from the generators.
So changing a bit your snippet we could end up with something like this:
class random_process
{
// concrete processes would generate and store last data
virtual void operator()(submatrix &) = 0;
};
class source_proxy
{
container_type<random_process> processes;
matrix data;
data operator[](size_type source_number) const { return a column of data}
void next() {/* get new data from the random processes */}
};
But I agree with the other comment (Greg) that it is a difficult problem, and depending on the final application may require heavy thinking. It's easy to go into the dead-end resulting in rewriting lots of code...