I'm trying to make a simple shift of Eigen's Matrix<int,200,200>, but I can't get Eigen::Translation to work. Since I'm rather new to C++, Eigen`s official documentation isn't of much use to me. I can't extract any useful information from it. I've tried to declare my translation as:
Translation<int,2> t(1,0);
hoping for a simple one row shift, but I can't get it to do anything with my matrix. Actually I'm not even sure if that's what this method is for... if not, could you please recommend some other, preferably fast, way of doing matrix translation on a torus? I'm looking for an equivalent to MATLab's circshift.
The Translation class template is from the Geometry module and represents a translation transformation. It has nothing to do with shifting values in an array/matrix.
According to this discussion, the shifting feature wasn't implemented yet as of 2010 and was of low priority back then. I don't see any indication in the documentation that things are any different now, 4 years later.
So, you need to do it yourself. For example:
/// Shifts a matrix/vector row-wise.
/// A negative \a down value is taken to mean shifting up.
/// When passed zero for \a down, the input matrix is returned unchanged.
/// The type \a M can be either a fixed- or dynamically-sized matrix.
template <typename M> M shiftedByRows(const M & in, int down)
{
if (!down) return in;
M out(in.rows(), in.cols());
if (down > 0) down = down % in.rows();
else down = in.rows() - (-down % in.rows());
// We avoid the implementation-defined sign of modulus with negative arg.
int rest = in.rows() - down;
out.topRows(down) = in.bottomRows(down);
out.bottomRows(rest) = in.topRows(rest);
return out;
}
Related
I am trying to improve the speed of a computational (biological) model written in C++ (previous version is on my github: Prokaryotes). The most time-consuming function is where I calculate binding affinities between transcription factors and binding sites on a single genome.
Background: In my model, binding affinity is given by the Hamming distance between the binding domain of a transcription factor (a 20-bool array) and the sequence of a binding site (also a 20-bool array). For a single genome, I need to calculate the affinities between all active transcription factors (typically 5-10) and all binding sites (typically 10-50). I do this every timestep for more than 10,000 cells in the population to update their gene expression states. Having to calculate up to half a million comparisons of 20-bool arrays to simulate just one timestep of my model means that typical experiments take several months (2M--10M timesteps).
For the previous version of the model (link above) genomes remained fairly small, so I could calculate binding affinities once for every cell (at birth) and store and re-use these numbers during the cell's lifetime. However, in the latest version, genomes expand considerably and multiple genomes reside within the same cell. Thus, storing affinities of all transcript factor--binding site pairs in a cell becomes impractical.
In the current implementation I defined an inline function belonging to the Bead class (which is a base class for transcription factor class "Regulator" and binding site class "Bsite"). It is written directly in the header file Bead.hh:
inline int Bead::BindingAffinity(const bool* sequenceA, const bool* sequenceB, int seqlen) const
{
int affinity = 0;
for (int i=0; i<seqlen; i++)
{
affinity += (int)(sequenceA[i]^sequenceB[i]);
}
return affinity;
}
The above function accepts two pointers to boolean arrays (sequenceA and sequenceB), and an integer specifying their length (seqlen). Using a simple for-loop I then check at how many positions the arrays differ (sequenceA[i]^sequenceB[i]), summing into the variable affinity.
Given a binding site (bsite) on the genome, we can then iterate through the genome and for every transcription factor (reg) calculate its affinity to this particular binding site like so:
affinity = (double)reg->BindingAffinity(bsite->sequence, reg->sequence);
So, this is how streamlined I managed to make it; since I don't have a programming background, I wonder whether there are better ways to write the above function or to structure the code (i.e. should BindingAffinity be a function of the base Bead class)? Suggestions are greatly appreciated.
Thanks to #PaulMcKenzie and #eike for your suggestions. I tested both ideas against my previous implementation. Below are the results. In short, both answers work very well.
My previous implementation yielded an average runtime of 5m40 +/- 7 (n=3) for 1000 timesteps of the model. Profiling analysis with GPROF showed that the function BindingAffinity() took 24.3% of total runtime. [see Question for the code].
The bitset implementation yielded an average runtime of 5m11 +/- 6 (n=3), corresponding to a ~9% speed increase. Only 3.5% of total runtime is spent in BindingAffinity().
//Function definition in Bead.hh
inline int Bead::BindingAffinity(std::bitset<regulator_length> sequenceA, const std::bitset<regulator_length>& sequenceB) const
{
return (int)(sequenceA ^= sequenceB).count();
}
//Function call in Genome.cc
affinity = (double)reg->BindingAffinity(bsite->sequence, reg->sequence);
The main downside of the bitset implementation is that unlike with boolean arrays (my previous implementation), I have to specify the length of the bitset that goes into the function. I am occasionally comparing bitsets of different lengths, so for these I now have to specify separate functions (templates would not work for multi-file project according to https://www.cplusplus.com/doc/oldtutorial/templates/).
For the integer implementation I tried two alternatives to the std::popcount(seq1^seq2) function suggested by #eike since I am working with an older version of C++ that doesn't include this.
Alternative #1:
inline int Bead::BindingAffinity(int sequenceA, int sequenceB) const
{
int i = sequenceA^sequenceB;
std::bitset<32> bi (i);
return ((std::bitset<32>)i).count();
}
Alternative #2:
inline int Bead::BindingAffinity(int sequenceA, int sequenceB) const
{
int i = sequenceA^sequenceB;
//SWAR algorithm, copied from https://stackoverflow.com/questions/109023/how-to-count-the-number-of-set-bits-in-a-32-bit-integer
i = i - ((i >> 1) & 0x55555555); // add pairs of bits
i = (i & 0x33333333) + ((i >> 2) & 0x33333333); // quads
i = (i + (i >> 4)) & 0x0F0F0F0F; // groups of 8
return (i * 0x01010101) >> 24; // horizontal sum of bytes
}
These yielded average runtimes of 5m06 +/- 6 (n=3) and 5m06 +/- 3 (n=3), respectively, corresponding to a ~10% speed increase compared to my previous implementation. I only profiled Alternative #2, which showed that only 2.2% of total runtime was spent in BindingAffinity(). The downside of using integers for bitstrings is that I have to be very careful whenever I change any of the code. Single-bit mutations are definitely possible as mentioned by #eike, but everything is just a little bit trickier.
Conclusion:
Both the bitset and integer implementations for comparing bitstrings achieve impressive speed improvements. So much so, that BindingAffinity() is no longer the bottleneck in my code.
I have a matrix which wraps around.
m_matrixOffset points to first cell(0, 0) of the wrapped around matrix. So to access a cell we have below function GetCellInMatrix .Logic to wrap around(in while loop) is executed each time someone access a cell. This is executed thousands of time in a second. Is there any way to optimize this using some lookup or someother way. MAX_ROWS and MAX_COLS may not be power of 2.
struct Cell
{
Int rowId;
Int colId;
}
int matData[MAX_ROWS][MAX_COLS];
int GetCellInMatrix(const Cell& cellIndex)
{
Cell newCellIndex = cellIndex + m_matrixOffset ;
while (newCellIndex.rowId > MAX_ROWS)
{
newCellIndex.rowId -= MAX_ROWS;
}
while (newCellIndex.colId > MX_COLS)
{
newCellIndex.y -= MAX_COLS;
}
return data[newCellIndex.rowId][newCellIndex.colId];
}
You might be interested in the concept of division with remainder, usually implemented as a % b for the remainder.
Thus
return data[newCellIndex.rowId % MAX_ROWS][newCellIndex.colId % MAX_COLS];
does not need the while loops before it.
As per comment, the implied integer division in the remainder computation is too costly if done at each query. Assuming that m_matrixOffset is constant over a large number of queries, reduce its coordinates once using the remainder operations. Then the newCellIndex are less than twice the maximum, thus need only to be reduced at most once. Thus it is safe to replace while with if, sparing one comparison.
If you can sacrifice memory for space, then double the matrix dimensions and fill the excess entries with the repeated matrix elements. You have to make sure this pattern holds when updating the matrix.
Then, again assuming that both m_matrixOffset and CellIndex are inside the maxima for rows and columns, you can access the cell of the extended matrix without any further reduction. This would be a variant on the "lookup table" idea.
Or use real lookup tables, but you then execute 3 array cell lookups like in
return data[repeatedRowIndex[newCellIndex.rowId]][repeatedColIndex[newCellIndex.colId]];
It depends if the wrap is small or large in relation to the matrix.
The most common case is that all you need is the nearest neighbour. So make the matrix N+2 by M+2 and duplicate the wrap. That makes reads fast but writes a bit fiddly (often a good trade-off).
If that's no good, specialise the functions. Work out which cells are edge cells and handle the specially (you must be able to do this cheaper than simply hard-coding the logic into the access, of course, if only one or two cells change every pass that will hold, not if you generate a random list every pass).
I am given
struct point
{
int x;
int y;
};
and the table of points:
point tab[MAX];
Program should return the minimal distance between the centers of gravity of any possible pair of subsets from tab. Subset can be any size (of course >=1 and < MAX).
I am obliged to write this program using recursion.
So my function will be int type because I have to return int.
I globally set variable min (because while doing recurssion I have to compare some values with this min)
int min = 0;
My function should for sure, take number of elements I add, sum of Y coordinates and sum of X coordinates.
int return_min_distance(int sY, int sX, int number, bool iftaken[])
I will be glad for any help further.
I thought about another table of bools which I pass as a parameter to determine if I took value or not from table. Still my problem is how to implement this, I do not know how to even start.
I think you need a function that can iterate through all subsets of the table, starting with either nothing or an existing iterator. The code then gets easy:
int min_distance = MAXINT;
SubsetIterator si1(0, tab);
while (si1.hasNext())
{
SubsetIterator si2(&si1, tab);
while (si2.hasNext())
{
int d = subsetDistance(tab, si1.subset(), si2.subset());
if (d < min_distance)
{
min_distance = d;
}
}
}
The SubsetIterators can be simple base-2 numbers capable of counting up to MAX, where a 1 bit indicates membership in the subset. Yes, it's a O(N^2) algorithm, but I think it has to be.
The trick is incorporating recursion. Sorry, I just don't see how it helps here. If I can think of a way to use it, I'll edit my answer.
Update: I thought about this some more, and while I still can't see a use for recursion, I found a way to make the subset processing easier. Rather than run through the entire table for every distance computation, the SubsetIterators could store precomputed sums of the x and y values for easy distance computation. Then, on every iteration, you subtract the values that are leaving the subset and add the values that are joining. A simple bit-and operation can reveal these. To be even more efficient, you could use gray coding instead of two's complement to store the membership bitmap. This would guarantee that at each iteration exactly one value enters and/or leaves the subset. Minimal work.
Background
I have a large collection (~thousands) of sequences of integers. Each sequence has the following properties:
it is of length 12;
the order of the sequence elements does not matter;
no element appears twice in the same sequence;
all elements are smaller than about 300.
Note that the properties 2. and 3. imply that the sequences are actually sets, but they are stored as C arrays in order to maximise access speed.
I'm looking for a good C++ algorithm to check if a new sequence is already present in the collection. If not, the new sequence is added to the collection. I thought about using a hash table (note however that I cannot use any C++11 constructs or external libraries, e.g. Boost). Hashing the sequences and storing the values in a std::set is also an option, since collisions can be just neglected if they are sufficiently rare. Any other suggestion is also welcome.
Question
I need a commutative hash function, i.e. a function that does not depend on the order of the elements in the sequence. I thought about first reducing the sequences to some canonical form (e.g. sorting) and then using standard hash functions (see refs. below), but I would prefer to avoid the overhead associated with copying (I can't modify the original sequences) and sorting. As far as I can tell, none of the functions referenced below are commutative. Ideally, the hash function should also take advantage of the fact that elements never repeat. Speed is crucial.
Any suggestions?
http://partow.net/programming/hashfunctions/index.html
http://code.google.com/p/smhasher/
Here's a basic idea; feel free to modify it at will.
Hashing an integer is just the identity.
We use the formula from boost::hash_combine to get combine hashes.
We sort the array to get a unique representative.
Code:
#include <algorithm>
std::size_t array_hash(int (&array)[12])
{
int a[12];
std::copy(array, array + 12, a);
std::sort(a, a + 12);
std::size_t result = 0;
for (int * p = a; p != a + 12; ++p)
{
std::size_t const h = *p; // the "identity hash"
result ^= h + 0x9e3779b9 + (result << 6) + (result >> 2);
}
return result;
}
Update: scratch that. You just edited the question to be something completely different.
If every number is at most 300, then you can squeeze the sorted array into 9 bits each, i.e. 108 bits. The "unordered" property only saves you an extra 12!, which is about 29 bits, so it doesn't really make a difference.
You can either look for a 128 bit unsigned integral type and store the sorted, packed set of integers in that directly. Or you can split that range up into two 64-bit integers and compute the hash as above:
uint64_t hash = lower_part + 0x9e3779b9 + (upper_part << 6) + (upper_part >> 2);
(Or maybe use 0x9E3779B97F4A7C15 as the magic number, which is the 64-bit version.)
Sort the elements of your sequences numerically and then store the sequences in a trie. Each level of the trie is a data structure in which you search for the element at that level ... you can use different data structures depending on how many elements are in it ... e.g., a linked list, a binary search tree, or a sorted vector.
If you want to use a hash table rather than a trie, then you can still sort the elements numerically and then apply one of those non-commutative hash functions. You need to sort the elements in order to compare the sequences, which you must do because you will have hash table collisions. If you didn't need to sort, then you could multiply each element by a constant factor that would smear them across the bits of an int (there's theory for finding such a factor, but you can find it experimentally), and then XOR the results. Or you could look up your ~300 values in a table, mapping them to unique values that mix well via XOR (each one could be a random value chosen so that it has an equal number of 0 and 1 bits -- each XOR flips a random half of the bits, which is optimal).
I would just use the sum function as the hash and see how far you come with that. This doesn’t take advantage of the non-repeating property of the data, nor of the fact that they are all < 300. On the other hand, it’s blazingly fast.
std::size_t hash(int (&arr)[12]) {
return std::accumulate(arr, arr + 12, 0);
}
Since the function needs to be unaware of ordering, I don’t see a smart way of taking advantage of the limited range of the input values without first sorting them. If this is absolutely required, collision-wise, I’d hard-code a sorting network (i.e. a number of if…else statements) to sort the 12 values in-place (but I have no idea how a sorting network for 12 values would look like or even if it’s practical).
EDIT After the discussion in the comments, here’s a very nice way of reducing collisions: raise every value in the array to some integer power before summing. The easiest way of doing this is via transform. This does generate a copy but that’s probably still very fast:
struct pow2 {
int operator ()(int n) const { return n * n; }
};
std::size_t hash(int (&arr)[12]) {
int raised[12];
std::transform(arr, arr + 12, raised, pow2());
return std::accumulate(raised, raised + 12, 0);
}
You could toggle bits, corresponding to each of the 12 integers, in the bitset of size 300. Then use formula from boost::hash_combine to combine ten 32-bit integers, implementing this bitset.
This gives commutative hash function, does not use sorting, and takes advantage of the fact that elements never repeat.
This approach may be generalized if we choose arbitrary bitset size and if we set or toggle arbitrary number of bits for each of the 12 integers (which bits to set/toggle for each of the 300 values is determined either by a hash function or using a pre-computed lookup table). Which results in a Bloom filter or related structures.
We can choose Bloom filter of size 32 or 64 bits. In this case, there is no need to combine pieces of large bit vector into a single hash value. In case of classical implementation of Bloom filter with size 32, optimal number of hash functions (or non-zero bits for each value of the lookup table) is 2.
If, instead of "or" operation of classical Bloom filter, we choose "xor" and use half non-zero bits for each value of the lookup table, we get a solution, mentioned by Jim Balter.
If, instead of "or" operation, we choose "+" and use approximately half non-zero bits for each value of the lookup table, we get a solution, similar to one, suggested by Konrad Rudolph.
I accepted Jim Balter's answer because he's the one who came closest to what I eventually coded, but all of the answers got my +1 for their helpfulness.
Here is the algorithm I ended up with. I wrote a small Python script that generates 300 64-bit integers such that their binary representation contains exactly 32 true and 32 false bits. The positions of the true bits are randomly distributed.
import itertools
import random
import sys
def random_combination(iterable, r):
"Random selection from itertools.combinations(iterable, r)"
pool = tuple(iterable)
n = len(pool)
indices = sorted(random.sample(xrange(n), r))
return tuple(pool[i] for i in indices)
mask_size = 64
mask_size_over_2 = mask_size/2
nmasks = 300
suffix='UL'
print 'HashType mask[' + str(nmasks) + '] = {'
for i in range(nmasks):
combo = random_combination(xrange(mask_size),mask_size_over_2)
mask = 0;
for j in combo:
mask |= (1<<j);
if(i<nmasks-1):
print '\t' + str(mask) + suffix + ','
else:
print '\t' + str(mask) + suffix + ' };'
The C++ array generated by the script is used as follows:
typedef int_least64_t HashType;
const int maxTableSize = 300;
HashType mask[maxTableSize] = {
// generated array goes here
};
inline HashType xorrer(HashType const &l, HashType const &r) {
return l^mask[r];
}
HashType hashConfig(HashType *sequence, int n) {
return std::accumulate(sequence, sequence+n, (HashType)0, xorrer);
}
This algorithm is by far the fastest of those that I have tried (this, this with cubes and this with a bitset of size 300). For my "typical" sequences of integers, collision rates are smaller than 1E-7, which is completely acceptable for my purpose.
I used a profiler to look over some code which does not yet run fast enough. It found that the following function took most of the time, and half of the time in this function was spent in floor. Now, there are two possibilities: optimizing this function or going one level above and reducing the calls to this function. I wonder, if the first one is possible.
int Sph::gridIndex (Vector3 position) const {
int mx = ((int)floor(position.x / _gridIntervalSize) % _gridSize);
int my = ((int)floor(position.y / _gridIntervalSize) % _gridSize);
int mz = ((int)floor(position.z / _gridIntervalSize) % _gridSize);
if (mx < 0) {
mx += _gridSize;
}
if (my < 0) {
my += _gridSize;
}
if (mz < 0) {
mz += _gridSize;
}
int x = mx * _gridSize * _gridSize;
int y = my * _gridSize;
int z = mz * 1;
return x + y + z;
}
Vector3 is just some simple class which stores three floats and provides some overloaded operators. _gridSize is of type int and _gridIntervalSize is a float. There are _gridSize ^ 3 buckets.
The purpose of the function is to provide hash table support. Every 3d-point is mapped to an index, and points which lie in the same voxel of size _gridIntervalSize ^ 3 should land in the same bucket.
First rule of optimization when there is math involved: Eliminate division, square roots, and trig functions.
inverse_size = 1 / _gridIntervalSize;
....that should be done only once, not once per call.
int mx = ((int)floor(position.x * inverse_size) % _gridSize);
int my = ((int)floor(position.y * inverse_size) % _gridSize);
int mz = ((int)floor(position.z * inverse_size) % _gridSize);
I would also recommend dropping the mod operation because that's another division - if your grid size is a power of 2 you can use & (gridsize-1) which will also allow you to delete the conditional code at the bottom which is another big savings.
On another note, using overloaded operators may be hurting you. This is a touchy subject here so I'll let you experiment with it and decide for yourself.
I assume you use floor because negative values are possible, and because you don't want an anomaly due to the default truncation when you cast to int (values rounding toward zero from both sides, making some oversized voxels).
If you can specify a safe most-negative value for each value in the vector, you could subtract that (negative) value, or rather the nearest more-negative multiple of _gridIntervalSize, before the cast, and drop the floor.
Using fmod may ensure you have a safe most-negative value, and replace the integer %, but it's probably an anti-optimisation. Still, as a quick change, it may be worth checking.
Also, check whether your platform supports vector instructions, and whether your compiler can easily be encouraged to use them. x86 chips certainly have integer vector instructions as well as float (the old Pentium 1 MMX instructions, for a start) and might be able to handle this much more efficiently than the "normal" CPU instruction set. This may even be a case for digging out the list of vector instruction intrinsics for your compiler and doing some hand-optimisation. Just check what the compiler can do for you first - I'm not sure how much of this kind of optimisation compilers will do for you already.
One probably trivial piece of micro-optimisation...
return (mx * _gridSize + my) * _gridSize + mz;
Saves one integer multiplication. Trivial, of course, and the compiler may catch it anyway, but this is an old habitual thing.
Oh - watch the leading underscores. Those are reserved identifiers. Not likely to cause a problem, but you can't complain if they do.
EDIT
Another way to avoid the floor is to handle positive and negative separately. If you are willing to accept that items bang-on-the-edge of a grid cell may be in the wrong cell (possible anyway since floats should be considered approximate). Just apply a -1 offset in the negative case, to pull it away from the zero by almost exactly right amount to compensate for the truncation. You might consider a bit-fiddling increment-the-mantissa afterwards (to get already integer values in the cell you'd expect) but this is probably unnecessary.
If you can impose power-of-two limitations to your sizes, there may be a bit-fiddling way to efficiently extract the grid position from a float, avoiding some or all of the multiply, floor and % for each of x, y and z, assuming a standard floating point representation (ie this is non-portable). Again, handle positive and negative separately. Extract the exponent, bit-shift the mantissa accordingly, then mask out unwanted bits.
I think you need to look higher up the hierarchy to get real speed improvements. That is, is storing points in a hash-map really the most efficent solution? I assume you have an array of Vector3 arrays, i.e:
Vector3 *points [size][size][size]
where each element in the 3D array is an array of Vector3.
The algorithm you're using doesn't guarantee uniform distribution of points in each Vector3 array, which may be a problem. A cluster of points within _gridIntervalSize will map to the same array.
An alternative method would be to use oct-trees, which are like binary trees but each node has eight child nodes. Each node requires the min/max x/y/z values to define the volume the node covers. To add values to the tree:
Recursive search tree to find smallest node that can contain point
Add point to node
If number of points in node > upper limit to number of points in a node
Create child nodes and move points to child nodes
You may want to use quad-trees if there is little variation in values along a particular axis. Another method is to use BSPs - divide the world into two halves and recurse to find the container to add your point to. Again, these can be dynamic.
Converting the floats to ints and having the division planes lie on integer values will speed up the process as well.
Googling the above terms will lead you to more in depth analysis of the algorithms.
Finally, using floats (or doubles) for co-ordinates in an infinite plane is a bad idea - the further you get from (0,0,0) the less precision you have (the gaps between floating point values increases as the value increases). You will need to 'reset' the floating point values to keep the precision. One method is to 'tile' the space and change the co-ordinates to use integer and floating point parts. The integer part defines the 'tile' and the floating point part defines the position in the tile. This method gets you a much simpler hashing method - just use the integer parts, no call to floor required and only integer calculations required. Another approach is to use fixed-point values rather than floating point values, but this would constrain your precision. This would make calculations accross tile boundaries much easier.
If you could expand on what the top-level requriements of your coordinate system is, there are probably better algorithms available to you.