Hashing pairs of integers to array indices?

Hashing pairs of integers to array indices? - c++

I have a set of sequential, 0-based integers representing vertex indices in a mesh.
Every vertex is connected to at least 2 other vertices, to form edges of the mesh.
Edges are represented by pairs of vertices. So, for example, (0, 2) might be one edge between vertex 0 and 2.
Currently, in order to quickly lookup edges in my mesh, I store my Edge class in a std::unordered_map, and generate hashes as follows:
//sorted so (0, 2) and (2, 0) will return same hash
__int64 GetEdgeHash (int vertex1, int vertex2)
{
return (__int64)min(vertex1, vertex2) * INT_MAX + max(vertex1, vertex2);
}
However, an unordered_map has enough overhead during creation and lookup that it has a noticeable performance impact elsewhere in my code. I'm wondering if there's a way to hash pairs of integer such that each pair corresponds to some index an array whose size is <= numVertices * 2 (since the number of edges in a mesh could never exceed that value). If that were possible, I could just use a normal std::vector to store my edges and processing them would be much faster.
Obviously that's not currently possible since my hash function will return values anywhere from 0 to 4611686016279904256.
A naive approach like:
int GetEdgeHash (int vertex1, int vertex2)
{
return vertex1 + vertex;
}
would satisfy the array size limitation, but obviously results in many collisions.
Is there another way to achieve the same goal?

A very simple solution could be based on your initial approach, however not using INT_MAX, but the number of existing vertices:
uint64_t numberOfVertices;
uint64_t index(uint32_t vertex1, uint32_t vertex2)
{
return vertex1 * numberOfVertices + vertex2;
}
This algorithm is collision free, but requires a vector size of the square of numberOfVertices; as is, though, is only applicable if you have a fix (or at least, a maximum) number of vertices.
If the number of vertices might increase beyond the maximum, you could e. g. duplicate the maximum each time it is exceeded, which would, however, require to re-"hash" all your nodes in the vector, i. e. this is an expensive operation and should occur as rarely as possible (duplication of maximum might already make this rare enough...).

Related

Speed up Iteration Over Neighbors in a Graph

I have a static graph (the topology does not change over time and is known at compile time) where each node in the graph can have one of three states. I then simulate a dynamic where a node has a probability of changing its state over time, and this probability depends on the state of its neighbors. As the graph grows larger the simulations start getting very slow, but after some profiling, I identified that most of the computation time was spent iterating over the list of neighbors.
I was able to improve the speed of the simulations by changing the data structure used to access neighbors in the graph but was wondering if there are better (faster) ways to do it.
My current implementation goes like this:
For a graph with N nodes labeled from 0 to N-1 and average number of neighbors of K, I store each state as an integer in an std::vector<int> states and the number of neighbors for each node in std::vector<int> number_of_neighbors.
To store neighbors information I created two more vectors: std::vector<int> neighbor_lists which stores, in order, the nodes that are neighbors to node 0, node 1, ... , node N, and an index vector std::vector<int> index which stores, for each node, the index of its first neighbor in neighbor_lists.
So I have four vectors in total:
printf( states.size() ); // N
printf( number_of_neighbors.size() ); // N
printf( neighbor_lists.size() ); // N * k
printf( index.size() ); // N
When updating node i I access its neighbors like so:
// access neighbors of node i:
for ( int s=0; s<number_of_neighbors[i]; s++ ) {
int neighbor_node = neighbor_lists[index[i] + s];
int state_of_neighbor = states[neighbor_node];
// use neighbor state for stuff...
}
To sum up my question then: is there a faster implementation for accessing neighboring nodes in a fixed graph structure?
Currently, I've gone up to N = 5000 for a decent number of simulation time, but I was aiming for N ~ 15.000 if at all possible.

It's important to know the order of magnitude of N because, if it isn't to high, you can use the fact that you know compile time the topology so you can put the data in std::arrays of known dimensions (instead of std::vectors), using the smallest possible type to (if necessary) save stack memory, ad define some of they as constexpr (all but states).
So, if N isn't too big (stack limit!), you can define
states as an std::array<std::uint_fast8_t, N> (8 bits for 3 state are enough)
number_of_neighbors as a constexpr std::array<std::uint_fast8_t, N> (if the maximum number of neighbors is less that 256, a bigger type otherwise)
neighbor_list as a constexpr std::array<std::uint_fast16_t, M> (where M is the known sum of the number of neighbors) if 16 bit are enough for N; a bigger type otherwise
index as a constexpr std::array<std::uint_fast16_t, N> if 16 bit are enough for M; a bigger type otherwise
I think (I hope) that using arrays of known dimensions that are constexpr (when possible) the compiler can create a fastest code.
Regarding the updating code... I'm a old C programmer so I'm used to trying to optimize the code in a way that modern compiler do better, so I don't know if the following code is a good idea; anyway, I would write the code like this
auto first = index[i];
auto top = first + number_of_neighbors[i];
for ( auto s = first ; s < top ; ++s ) {
auto neighbor_node = neighbor_lists[s];
auto state_of_neighbor = states[neighbor_node];
// use neighbor state for stuff...
}
-- EDIT --
The OP specify that
Currently, I've gone up to N = 5000 for a decent number of simulation time, but I was aiming for N ~ 15.000 if at all possible.
So 16 bit should be enough -- for the type in neighbor_list and in index -- and
states and number_of_neighbors are about 15 kB each (30 kB using a 16 bit variable)
index is about 30 kB.
It seems to me that are reasonable values for stack variables.
The problem could be neighbor_list; if the medium number of neighbor is low, say 10 to fix a number, we have that M (sum of neighbors) is about 150'000, so neighbor_list is about 300 kB; not low but reasonable for some environment.
If the medium number is high -- say 100, to fix another number --, neighbor_list become about 3 MB; it should be to high, in some environments.

Currently you are accessing sum(K) nodes for each iteration. That doesn't sound so bad ... until you hit access the cache.
For less than 2^16 nodes you only need an uint16_t to identify a node, but with K neighbours you will need an uint32_t to index the neighbour list.
The 3 states can as already mentioned be stored in 2 bits.
So having
// your nodes neighbours, N elements, 16K*4 bytes=64KB
// really the start of the next nodes neighbour as we start in zero.
std::vector<uint32_t> nbOffset;
// states of your nodes, N elements, 16K* 1 byte=16K
std::vector<uint8_t> states;
// list of all neighbour relations,
// sum(K) > 2^16, sum(K) elements, sum(K)*2 byte (E.g. for average K=16, 16K*2*16=512KB
std::vector<uint16_t> nbList;
Your code:
// access neighbors of node i:
for ( int s=0; s<number_of_neighbors[i]; s++ ) {
int neighbor_node = neighbor_lists[index[i] + s];
int state_of_neighbor = states[neighbor_node];
// use neighbor state for stuff...
}
rewriting your code to
uint32_t curNb = 0;
for (auto curOffset : nbOffset) {
for (; curNb < curOffset; curNb++) {
int neighbor_node = nbList[curNb]; // done away with one indirection.
int state_of_neighbor = states[neighbor_node];
// use neighbor state for stuff...
}
}
So to update one node you need to read the current state from states, read the offset from nbOffset and use that index to look up the neighbour list nbList and the index from nbList to look up the neighbours states in states.
The first 2 will most likely already be in L1$ if you run linearly through the list. Reading the first value from nbList for each node might be in L1$ if you compute the nodes linearly otherwise it will most likely cause a L1$ and likely a L2$ miss, the following reads would be hardware pre-fetched.
Reading linearly through the nodes has the added advantage that each neighbour list will only be read once per iteration of the node set and therefore the likelihood that states stay in L1$ will increase dramatically.
Decreasing the size of states could improve the the chance that it stayes in L1$ further, with a little calculation there can be store 4 states of 2 bits in each byte, reducing the size of states to 4KB. So depending on how much "stuff" you do you could have a very low cache miss rate.
But if you jump around in the nodes and do "stuff" the situation quickly gets worse inducing a nearly guaranteed L2$ miss for nbList and potential L1$ misses for the current node and the K calls to state. This could lead to slow downs by a factor 10 to 50.
If your in the latter scenario with random access you should consider storing an extra copy of the state in the neighbour list saving the cost of accessing states K times. You have to measure if this is faster.
Regarding in-lining the data in the program you would gain a little for not having to access the vector, I would in this case estimate it to less than 1% gain from that.
In-lining and constexpr aggressively your compiler would boil your computer for years and reply "42" as the final result of the program. You have to find a middle ground.

create pairs of vertices using adjacency list in linear time

I have n number of vertices numbered 1...n and want to pair every vertex with all other vertices. That would result in n*(n-1)/2 number of edges. Each vertex has some strength.The difference between the strength of two vertices is the weight of the edge.I need to get the total weight. Using two loops I can do this in O(n^2) time. But I want to reduce the time.I can use adjacency list and using that create a graph of n*(n-1)/2 edges but how will I create the adjacency list without using two loops. The input takes only the number of vertices and the strength of each vertex.
for(int i=0;i<n;i++)
for(int j=i+1;j<n;j++)
{
int w=abs((strength[i]-strength[j]));
sum+=w;
}
this is what i did earlier.I need a better way to do this.

If there are O(N*N) edges, then you can't list them all in linear time.
However, if indeed all you need is to compute the sum, here's a solution in O(N*log(N)). You can kind of improve the solution by using instead O(N) sorting algorithm, such as radix sort.
#include <algorithm>
#include <cstdint>
// ...
std::sort(strength, strength+n);
uint64_t sum = 0;
int64_t runSum = strength[0];
for(int i=1; i<n; i++) {
sum += int64_t(i)*strength[i] - runSum;
runSum += strength[i];
}
// Now "sum" contains the sum of weigths over all edges
To explain the algorithm:
The idea is to avoid summing over all edges explicitly (requiring O(N*N)), but rather to add sums of several weights at once. Consider the last vertex n-1 and the average A[n-1] = (strength[0] + strength[1] + ... + strength[n-2])/(n-1): obviously we could add (strength[n-1] - A[n-1]) * (n-1), i.e. n-1 weights at once, if the weights were all larger than strength[n-1], or all smaller than it. However, due to abs operation, we would have to add different amounts depending on whether the strength of the other vertex is larger or smaller than the strength of the current vertex. So one solution is to sort the strengths first, so to ensure that each next strength is greater or equal to the previous.

Pick a matrix cell according to its probability

I have a 2D matrix of positive real values, stored as follow:
vector<vector<double>> matrix;
Each cell can have a value equal or greater to 0, and this value represents the possibility of the cell to be chosen. In particular, for example, a cell with a value equals to 3 has three times the probability to be chosen compared to a cell with value 1.
I need to select N cells of the matrix (0 <= N <= total number of cells) randomly, but according to their probability to be selected.
How can I do that?
The algorithm should be as fast as possible.

I describe two methods, A and B.
A works in time approximately N * number of cells, and uses space O(log number of cells). It is good when N is small.
B works in time approximately (number of cells + N) * O(log number of cells), and uses space O(number of cells). So, it is good when N is large (or even, 'medium') but uses a lot more memory, in practice it might be slower in some regimes for that reason.
Method A:
The first thing you need to do is normalize the entries. (It's not clear to me if you assume they are normalized or not.) That means, sum all the entries and divide by the sum. (This part is potentially slow, so it's better if you assume or require that it already happened.)
Then you sample like this:
Choose a random [i,j] entry of the matrix (by choosing i,j each uniformly randomly from the range of integers 0 to n-1).
Choose a uniformly random real number p in the range [0, 1].
Check if matrix[i][j] > p. If so, return the pair [i][j]. If not, go back to step 1.
Why does this work? The probability that we end at step 3 with any particular output, is equal to, the probability that [i][j] was selected (this is the same for each entry), times the probality that the number p was small enough. This is proportional to the value matrix[i][j], so the sampling is choosing each entry with the correct proportions. It's also possible that at step 3 we go back to the start -- does that bias things? Basically, no. The reason is, suppose we arbitrarily choose a number k and then consider the distribution of the algorithm, conditioned on stopping exactly after k rounds. Conditioned on the assumption that we stop at the k'th round, no matter what value k we choose, the distribution we sample has to be exactly right by the above argument. Since if we eliminate the case that p is too small, the other possibilities all have their proportions correct. Since the distribution is perfect for each value of k that we might condition on, and the overall distribution (not conditioned on k) is an average of the distributions for each value of k, the overall distribution is perfect also.
If you want to analyze the number of rounds that typically needed in a rigorous way, you can do it by analyzing the probability that we actually stop at step 3 for any particular round. Since the rounds are independent, this is the same for every round, and statistically, it means that the running time of the algorithm is poisson distributed. That means it is tightly concentrated around its mean, and we can determine the mean by knowing that probability.
The probability that we stop at step 3 can be determined by considering the conditional probability that we stop at step 3, given that we chose any particular entry [i][j]. By the formulas for conditional expectation, you get that
Pr[ stop at step 3 ] = sum_{i,j} ( 1/(n^2) * Matrix[i,j] )
Since we assumed the matrix is normalized, this sum reduces to just 1/n^2. So, the expected number of rounds is about n^2 (that is, n^2 up to a constant factor) no matter what the entries in the matrix are. You can't hope to do a lot better than that I think -- that's about the same amount of time it takes to just read all the entries of the matrix, and it's hard to sample from a distribution that you cannot even read all of.
Note: What I described is a way to correctly sample a single element -- to get N elements from one matrix, you can just repeat it N times.
Method B:
Basically you just want to compute a histogram and sample inversely from it, so that you know you get exactly the right distribution. Computing the histogram is expensive, but once you have it, getting samples is cheap and easy.
In C++ it might look like this:
// Make histogram
typedef unsigned int uint;
typedef std::pair<uint, uint> upair;
typedef std::map<double, upair> histogram_type;
histogram_type histogram;
double cumulative = 0.0f;
for (uint i = 0; i < Matrix.size(); ++i) {
for (uint j = 0; j < Matrix[i].size(); ++j) {
cumulative += Matrix[i][j];
histogram[cumulative] = std::make_pair(i,j);
}
}
std::vector<upair> result;
for (uint k = 0; k < N; ++k) {
// Do a sample (this should never repeat... if it does not find a lower bound you could also assert false quite reasonably since it means something is wrong with rand() implementation)
while(1) {
double p = cumulative * rand(); // Or, for best results use std::mt19937 or boost::mt19937 and sample a real in the range [0,1] here.
histogram_type::iterator it = histogram::lower_bound(p);
if (it != histogram.end()) {
result.push_back(it->second);
break;
}
}
}
return result;
Here the time to make the histogram is something like number of cells * O(log number of cells) since inserting into the map takes time O(log n). You need an ordered data structure in order to get cheap lookup N * O(log number of cells) later when you do repeated sampling. Possibly you could choose a more specialized data structure to go faster, but I think there's only limited room for improvement.
Edit: As #Bob__ points out in comments, in method (B) a written there is potentially going to be some error due to floating point round-off if the matrices are quite large, even using type double, at this line:
cumulative += Matrix[i][j];
The problem is that, if cumulative is much larger than Matrix[i][j] beyond what the floating point precision can handle then these each time this statement is executed you may observe significant errors which accumulate to introduce significant inaccuracy.
As he suggests, if that happens, the most straightforward way to fix it is to sort the values Matrix[i][j] first. You could even do this in the general implementation to be safe -- sorting these guys isn't going to take more time asymptotically than you already have anyways.

Use map instead of array in C++ to protect searching outside of array bounds?

I have a gridded rectangular file that I have read into an array. This gridded file contains data values and NODATA values; the data values make up a continuous odd shape inside of the array, with NODATA values filling in the rest to keep the gridded file rectangular. I perform operations on the data values and skip the NODATA values.
The operations I perform on the data values consist of examining the 8 surrounding neighbors (the current cell is the center of a 3x3 grid). I can handle when any of the eight neighbors are NODATA values, but when actual data values fall in the first or last row/column, I trigger an error by trying to access an array value that doesn't exist.
To get around this I have considered three options:
Add a new first and last row/column with NODATA values, and adjust my code accordingly - I can cycle through the internal 'original' array and handle the new NODATA values like the edges I'm already handling that don't fall in the first and last row/column.
I can create specific processes for handling the cells in first and last row/column that have data - modified for loops (a for loop that steps through a specific sequence/range) that only examine the surrounding cells that exist, though since I still need 8 neighboring values (NODATA/non-existent cells are given the same value as the central cell) I would have to copy blank/NODATA values to a secondary 3x3 grid. Though there maybe a way to avoid the secondary grid. This solution is annoying as I have to code up specialized routines to all corner cells (4 different for loops) and any cell in the 1st or last row/column (another 4 different for loops). With a single for loop for any non-edge cell.
Use a map, which based on my reading, appears capable of storing the original array while letting me search for locations outside the array without triggering an error. In this case, I still have to give these non-existent cells a value (equal to the center of the array) and so may or may not have to set up a secondary 3x3 grid as well; once again there maybe a way to avoid the secondary grid.
Solution 1 seems the simplest, solution 3 the most clever, and 2 the most annoying. Are there any solutions I'm missing? Or does one of these solutions deserve to be the clear winner?

My advice is to replace all read accesses to the array by a function. For example, arr[i][j] by getarr(i,j). That way, all your algorithmic code stays more or less unchanged and you can easily return NODATA for indices outside bounds.
But I must admit that it is only my opinion.

I've had to do this before and the fastest solution was to expand the region with NODATA values and iterate over the interior. This way the core loop is simple for the compiler to optimize.
If this is not a computational hot-spot in the code, I'd go with Serge's approach instead though.
To minimize rippling effects I used an array structure with explicit row/column strides, something like this:
class Grid {
private:
shared_ptr<vector<double>> data;
int origin;
int xStride;
int yStride;
public:
Grid(int nx, int ny) :
data( new vector<double>(nx*ny) ),
origin(0),
xStride(1),
yStride(nx) {
}
Grid(int nx, int ny, int padx, int pady) :
data( new vector<double>((nx+2*padx)*(ny+2*pady));
xStride(1),
yStride(nx+2*padx),
origin(nx+3*padx) {
}
double& operator()(int x, int y) {
return (*data)[origin + x*xStride + y*yStride];
}
}
Now you can do
Grid g(5,5,1,1);
Grid g2(5,5);
//Initialise
for(int i=0; i<5; ++i) {
for(int j=0; j<5; ++j) {
g(i,j)=i+j;
}
}
// Convolve (note we don't care about going outside the
// range, and our indices are unchanged between the two
// grids.
for(int i=0; i<5; ++i) {
for(int j=0; j<5; ++j) {
g2(i,j)=0;
g2(i,j)+=g(i-1,j);
g2(i,j)+=g(i+1,j);
g2(i,j)+=g(i,j-1);
g2(i,j)+=g(i,j+1);
}
}
Aside: This data structure is awesome for working with transposes, and sub-matrices. Each of those is just an adjustment of the offset and stride values.

Solution 1 is the standard solution. It takes maximum advantage of modern computer architectures, where a few bytes of memory are no big deal, and correct instruction prediction accelerates performance. As you keep accessing memory in a predictable pattern (with fixed strides), the CPU prefetcher will successfully read ahead.
Solution 2 saves a small amount of memory, but the special handling of the edges incurs a real slowdown. Still, the large chunk in the middle benefits from the prefetcher.
Solution 3 is horrible. Map access is O(log N) instead of O(1), and in practice it can be 10-20 times slower. Maps have poor locality of reference; the CPU prefetcher will not kick in.

If simple means "easy to read" I'd recommend you declare a class with an overloaded [] operator. Use it like a regular array but it'll have bounds checking to handle NODATA.
If simple means "high performance" and you have sparse grid with isolated DATA consider implementing linked lists to the DATA values and implement optimal operators that go directly to tge DATA values.

1 wastes memory proportional to your overall rectangle size, 3/maps are clumsy here, 2 is actually very easy to do:
T d[X][Y] = ...;
for (int x = 0; x < X; ++x)
for (int y = 0; y < Y; ++y) // move over d[x][y] centres
{
T r[3][3] = { { d[i,j], d[i,j], d[i,j] },
d[i,j], d[i,j], d[i,j] },
d[i,j], d[i,j], d[i,j] } };
for (int i = std::min(0, x-1); i < std::max(X-1, x+1); ++i)
for (int j = std::min(0, y-1); j < std::max(Y-1, y+1); ++j)
if (d[i][j] != NoData)
r[i-x][j-y] = d[i][j];
// use r for whatever...
}
Note that I'm using signed int very deliberately so x-1 and y-1 don't become huge positive numbers (as they would with say size_t) and break the std::min logic... but you could express it differently if you had some reason to prefer size_t (e.g. x == 0 ? 0 : x - 1).

iterating through TWO sparse matrices

I'm using boost sparse matrices holding bool's and trying to write a comparison function for storing them in a map. It is a very simple comparison function. Basically, the idea is to look at the matrix as a binary number (after being flattened into a vector) and sorting based on the value of that number. This can be accomplished in this way:
for(unsigned int j = 0; j < maxJ; j++)
{
for(unsigned int i = 0; i < maxI; i++)
{
if(matrix1(i,j) < matrix2(i,j) return true;
else if(matrix1(i,j) > matrix2(i,j) return false;
}
}
return false;
However, this is inefficient because of the sparseness of the matrices and I'd like to use iterators for the same result. The algorithm using iterators seems straightforward, i.e.
1) grab the first nonzero cell in each matrix, 2) compare j*maxJ+i for both, 3) if equal, grab the next nonzero cells in each matrix and repeat. Unfortunately, in code this is extremely tedious and I'm worried about errors.
What I'm wondering is (a) is there a better way to do this and (b) is there a simple way to get the "next nonzero cell" for both matrices? Obviously, I can't use nested for loops like one would to iterate through one sparse matrix.
Thanks for your help.
--
Since it seems that the algorithm I proposed above may be the best solution in my particular application, I figured I should post the code I developed for the tricky part, getting the next nonzero cells in the two sparse matrices. This code is not ideal and not very clear, but I'm not sure how to improve it. If anyone spots a bug or knows how to improve it, I would appreciate some comments. Otherwise, I hope this is useful to someone else.
typedef boost::numeric::ublas::mapped_matrix<bool>::const_iterator1 iter1;
typedef boost::numeric::ublas::mapped_matrix<bool>::const_iterator2 iter2;
// Grabs the next nonzero cell in a sparse matrix after the cell pointed to by i1, i2.
std::pair<iter1, iter2> next_cell(iter1 i1, iter2 i2, iter1 end) const
{
if(i2 == i1.end())
{
if (i1 == end)
return std::pair<iter1, iter2>(i1, i2);
++i1;
i2 = i1.begin();
}
else
{
++i2;
}
for(; i1 != end;)
{
for(; i2 != i1.end(); ++i2)
{
return std::pair<iter1, iter2>(i1,i2);
}
++i1;
if(i1 != end) i2 = i1.begin();
}
return std::pair<iter1, iter2>(i1, i2);
}

I like this question, by the way.
Let me pseudocode out what I think you're asking
declare list of sparse matrices ListA
declare map MatMAp with a sparse Matrix type mapping to a double, along with a
`StrictWeakMatrixOrderer` function which takes two sparse matrices.
Insert ListA into MatMap.
The Question: How do I write a StrictWeakMatrixOrderer efficiently?
This is an approach. I'm inventing this on the fly....
Define a function flatten() and precompute the flattened matrices, storing the flattened vectors in a vector(or another container with a random indexing upper bound). flatten() could be as simple as concatenating each row(or column) with the previous one(which can be done in linear time if you have a constant-time function to grab a row/column).
This yields a set of vectors with size on the order of 10^6. This is a tradeoff - saving this information instead of on-the-fly computing it. This is useful if you're going to be doing a lot of compares as you go along.
Remember, zeros contain information - dropping them will possibly yield two vectors equal to each other, whereas their generating matrix may not be equal.
Then, we have transformed the algorithm question from "order matrices" into "order vectors".
I've never heard of a distance metric for matrices, but I've heard of distance metrics for vectors.
You could use a "sum of differences" ordering aka Hamming distance. (foreach element that's different, add 1). That will be a O(n) algorithm:
for i = 0 to max.
if(a[i] != b[i])
distance++
return distance
The Hamming distance satisfies these conditions
d(a,b) = d(b,a)
d(a,a) = 0
d(x, z) <= d(x, y) + d(y, z)
Now to do some off-the-cuff analysis....
10^6 elements in a matrix(or its corrosponding vector).
O(n) distance metric.
But it's O(n) compares. If each array access has O(m) time, then you would have an O(n*(n+n)) = O(n^2) metric. So you have to have < O(n) access. It turns out that std::vector [] operator provides "amortized constant time access to arbitrary elements" according to SGI's STL site.
Providing you have sufficient memory to store k*2*10^6, where k is the number of matrices you are managing, this is a working solution that uses lots of memory in exchange for being linear.

(a) I don't fully understand what you're trying to accomplish, but if you want to compare if both matrices have the same value at the same index it's sufficient to use elementwise matrix multiplication (which should be implemented for sparse as well):
matrix3 = element_prod (matrix1, matrix2);
That way you'll get for each index:
0 (false) * 1 (true) = 0 (false)
0*0 = 0
1*1 = 1
So resulting matrix3 will have your solution in one line :)

It seems to me we're talking about implementing bitwise,elementwise operators on boost::sparse_matrix, since comparing if one vector (or matrix) is smaller than another without using any standard vector norms demands special operators (or special mappings/norms).
To my knowledge boost does not provide special operators for binary matrices (not to speak of sparse binary matrices). There are unlikely any straightforward solutions to this using BLAS level matrix/vector algebra. Binary matrices have an own place in the linear algebra field, so there are tricks and theorems but i doubt those are easier than your solution.
Your question could be reformulated as: How do i sort efficiently astronomically large numbers represented by a 2d-bitmap (n=100 so 100x100 elements would give you a number like 2^10000).
Good question !

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js