Memory Efficiency - Eigen::VectorXd in a loop

Memory Efficiency - Eigen::VectorXd in a loop - c++

I have a Measurement object that has two Eigen::VectorXd members -- one for position and the other velocity.
Measurements are arranged in a dataset by scans -- i.e., at each timestep, a new scan of measurements is added to the dataset. These types are defined as:
typedef std::shared_ptr<Measurement> MeasurementPtr;
typedef std::vector<MeasurementPtr> scan_t;
typedef std::vector<scan_t> dataset_t;
At the beginning of each iteration of my algorithm, I need to apply a new transformation to each measurement. Currently, I have:
for (auto scan = dataset_.begin(); scan != dataset_.end(); ++scan)
for (auto meas = scan->begin(); meas != scan->end(); ++meas) {
// Transform this measurement to bring it into the same
// coordinate frame as the current scan
if (scan != std::prev(dataset_.end())) {
core::utils::perspective_transform(T_, (*meas)->pos);
core::utils::perspective_transform(T_, (*meas)->vel);
}
}
Where perspective_transform is defined as
void perspective_transform(const Eigen::Projective2d& T, Eigen::VectorXd& pos) {
pos = (T*pos.homogeneous()).hnormalized();
}
Adding this code increases computation time by 40x when I run the algorithm with scans in the dataset with 50 measurements in each scan -- making it rather slow. I believe this is because I have 550 small objects, each with 2 Eigen memory writes. I removed the writing of the result to memory and my benchmark shows only a slight decrease -- suggesting that this is a memory-efficiency problem and not a computation bottleneck.
How can I speed up this computation? Is there a way to first loop through and create an Eigen::Matrix from Eigen::Map that I could then do the computation once and have it automatically update the two members of the all the Measurement objects?

You might want to rework your data-structures.
Currently you have an array-of-struct (AOS), with a number of indirections.
A structure-of-arrays (SOA) is generally more efficient in memory access.
What about ?:
struct Scant_t
{
Eigen::MatrixXd position;
Eigen::MatrixXd velocity;
}
the .rowwise() and .colwise() operators might be powerfull enough to do the homogeneous transform, which would save you writing the inner loop.

Related

Quick access to a cell in 2D matrix which wraps around

I have a matrix which wraps around.
m_matrixOffset points to first cell(0, 0) of the wrapped around matrix. So to access a cell we have below function GetCellInMatrix .Logic to wrap around(in while loop) is executed each time someone access a cell. This is executed thousands of time in a second. Is there any way to optimize this using some lookup or someother way. MAX_ROWS and MAX_COLS may not be power of 2.
struct Cell
{
Int rowId;
Int colId;
}
int matData[MAX_ROWS][MAX_COLS];
int GetCellInMatrix(const Cell& cellIndex)
{
Cell newCellIndex = cellIndex + m_matrixOffset ;
while (newCellIndex.rowId > MAX_ROWS)
{
newCellIndex.rowId -= MAX_ROWS;
}
while (newCellIndex.colId > MX_COLS)
{
newCellIndex.y -= MAX_COLS;
}
return data[newCellIndex.rowId][newCellIndex.colId];
}

You might be interested in the concept of division with remainder, usually implemented as a % b for the remainder.
Thus
return data[newCellIndex.rowId % MAX_ROWS][newCellIndex.colId % MAX_COLS];
does not need the while loops before it.
As per comment, the implied integer division in the remainder computation is too costly if done at each query. Assuming that m_matrixOffset is constant over a large number of queries, reduce its coordinates once using the remainder operations. Then the newCellIndex are less than twice the maximum, thus need only to be reduced at most once. Thus it is safe to replace while with if, sparing one comparison.
If you can sacrifice memory for space, then double the matrix dimensions and fill the excess entries with the repeated matrix elements. You have to make sure this pattern holds when updating the matrix.
Then, again assuming that both m_matrixOffset and CellIndex are inside the maxima for rows and columns, you can access the cell of the extended matrix without any further reduction. This would be a variant on the "lookup table" idea.
Or use real lookup tables, but you then execute 3 array cell lookups like in
return data[repeatedRowIndex[newCellIndex.rowId]][repeatedColIndex[newCellIndex.colId]];

It depends if the wrap is small or large in relation to the matrix.
The most common case is that all you need is the nearest neighbour. So make the matrix N+2 by M+2 and duplicate the wrap. That makes reads fast but writes a bit fiddly (often a good trade-off).
If that's no good, specialise the functions. Work out which cells are edge cells and handle the specially (you must be able to do this cheaper than simply hard-coding the logic into the access, of course, if only one or two cells change every pass that will hold, not if you generate a random list every pass).

Why would this search method not be scalable?

I want to parallel my search algorithm using openMP, vTree is a binary search tree, and I want to apply my search algorithm for each of the point set. below is a snippet of my code. the search procedure for two points is totally irrelevant and so can be parallel. though they do need to read a same tree, but once constructed, the tree wouldn't be modified any more. thus it is read-only.
However, the code below shows terrible scalability, on my 32-core platform, only 2x speed up is achieved. is it because that vTree is read by all threads? if so, how can I further optimize the code?
auto results = vector<vector<Point>>(particleNum);
auto t3 = high_resolution_clock::now();
double radius = 1.6;
#pragma omp parallel for
for (decltype(points.size()) i = 0; i < points.size(); i++)
{
vTree.search(points[i], radius, results[i]);
}
auto t4 = high_resolution_clock::now();
double searchTime = duration_cast<duration<double>>(t4 - t3).count();
the type signature for search is
void VPTree::search(const Point& p, double radius, vector<Point>& result) const
search result would be put into result.

My best guess would be that you are cache ping-pong'ing on the result vectors. I would assume that your "search" function uses the passed-in result vector as a place to put points and that you use it throughout the algorithm to insert neighbors as you encounter them in the search tree. Whenever you add a point to that result vector, the internal data of that vector object will be modified. And because all of your result vectors are packed together in contiguous memory, it is likely that different result vectors occupy the same cache lines. So, when the CPU maintains cache coherence, it will constantly lock the relevant cache lines.
The way to solve it is to use an internal, temporary vector that you only assign to the results vector once at the end (which can be done cheaply if you use move semantics). Something like this:
void VPTree::search(const Point& p, double radius, vector<Point>& result) const {
vector<Point> tmp_result;
// ... add results to "tmp_result"
result = std::move(tmp_result);
return;
}
Or, you could also just return the vector by value (which is implicitly using a move):
vector<Point> VPTree::search(const Point& p, double radius) const {
vector<Point> result;
// ... add results to "result"
return result;
}
Welcome to the joyful world of move-semantics and how awesome it can be at solving these types of concurrency / cache-coherence issues.
It is also conceivable that you're experiencing problems related to accessing the same tree from all threads, but since it's all read-only operations, I'm pretty sure that even on a conservative architecture like x86 (and other Intel / AMD CPUs) that this should not pose a significant issue, but I might be wrong (maybe a kind of "over-subscription" problem might be at play, but it's dubious). And other problems might include the fact that OpenMP does incur quite a bit of overhead (spawning threads, synchronizing, etc.) which has to be weighted against the computational cost of the actual operations you are doing within those parallel loops (and it's not always a favorable trade-off). And also, if your VPTree (I imagine stands for "Vantage-point Tree") does not have good locality of references (e.g., you implemented it as a linked-tree), then the performance is going to be terrible whichever way you use it (as I explain here).

How not to mess up the cache when working on some long vectors in memory?

Premise
I want to do some kind of computation involving k long vectors of data (each of length n), which I receive in main memory, and write some kind of result back to main memory. For simplicity suppose also that the computation is merely
for(i = 0; i < n; i++)
v_out[i] = foo(v_1[i],v_2[i], ... ,v_k[i])
or perhaps
for(i = 0; i < n; i++)
v_out[i] = bar_k(...bar_2(bar_1(v_1[i]),v_2[i]), ... ),v_k[i])
(this is not code, it's pseudocode.) foo() and bar_i() functions have no side-effects. k is constant (known at compile-time), n is known only right before this computation occurs (and it is relatively big - at least a few times larger than the entire L2 cache size and maybe more).
Suppose I am on an single thread on a single core of an x86_64 processor (Intel, or AMD, or what-have-you; the choice does matter, probably). Finally, suppose foo() (resp. bar_i()) is not an intensive computation, i.e. the time to read data from memory and write it back is significant or even dominant relative to the n (resp. kxn) invocations of foo() (resp. bar_i()).
Question
How do I arrange this computation, so as to avoid:
The data from one input vector clearing out cached data for another vector.
Input vector data clearing out cached output vector data.
Intermediary results of bar_j(...bar_1(v_1[i])...) remaining in registers or L1 cache if there's enough capacity to hold them there until the data for v_{j+1}[i] ... v_k[i] arrives and allows us to complete the computation. Same for L2.
The L1 cache lines of the output vectors being cleared while we intend to continue working on elements in that cache line. Same for L2.
Under-utilization of memory bandwidth.
core idle time, to the extent possible.
Write-read-write-read-write sequences on v_out (which might be very expensive if those writes need to be updated into main memory; the motivation here is that it might be tempting to read just one vector, update the output and repeat).
Notes:
Any rearrangement of the input data counts towards the total computation time. The vectors will not be re-used in the alternate arrangement so it's basically a waste of time.
If it makes things easier for you to assume alignment, or lack of alignment, that's fine, just say so.
Computing with the bar_i functions allows more flexibility with the access patters, but creates additional challenges w.r.t. the caching of v_out values.

The data from one input vector clearing out cached data for another
vector.
If you are using v_1[i], v_2[i], ..., v_k[i] in the same function call, the input vectors will not clear out data cached for other vectors. For each element you read in a vector, the CPU will fetch just a cache line, not the entire vector. So if you read k elements, you will bring k cache lines from each vector.
Input vector data clearing out cached output vector data.
You have the same case as the above. That won't be the case.
Output vector data remaining in registers or L1 cache for as long as
we need it (in the bar case).
You could try to use _mm_prefetch intrinsics to fetch the data before writing it.
Under-utilization of memory bandwidth.
To do that you need to maximize the number of full width transactions. Basically you need that when the CPU fetches a cache line, all the elements are used right away. To do this you must rearrange your data. I would consider all the k vectors as a matrix of k x n elements, stored in a column major format.
type* pMat = (type*)aligned_alloc(CACHE_LINE_SIZE, n * k * sizeof(type));
v_0[i] = pMat[i * k + 0];
v_1[i] = pMat[i * k + 1];
// ...
v_k-1[i] = pMat[i * k + k-1];
That will put the v_0, ... v_k elements in SIMD registers and you might have the chance of better vectorization.
core idle time, to the extent possible.
Less cache misses, less transcedental instructions will lead to less idle time.
Write-read-write-read-write sequences on v_out (which might be very
expensive if those writes need to be updated into main memory; the
motivation here is that it might be tempting to read just one vector,
update the output and repeat).
You could decrease the price of the sequences using prefetching (_mm_prefetch).

To reduce the clearing out of cached data, you could rearrange your k vectors to 1 vector holding a structure with k members. That way, the loop will access those elments in sequence and not jump around in memory.
struct VectorData
{
Type1 Var1;
Type2 Var2;
// ...
TypeK VarK;
};
std::vector<VectorData> v_in;
for (i = 0; i < n; i++){
v_out[i] = foo(v_in[i].Var1, v_in[i].Var2, ... , v_in[i].VarK);
// Or just pass the whole element:
v_out[i] = foo(v_in[i]);
}

Calculate average using SSE with STL vectors

I'm trying to learn about vectorisation, and rather than reinvet the wheel I'm using Agner Fog's vector library
Here's my original C++/STL code
#include <vector>
#include <vectorclass.h>
template<typename T>
double mean_v1(T begin,T end) {
float mean = 0;
std::for_each(begin,end,[&mean](const double& d) { mean+=d; });
return mean / std::distance(begin,end);
}
double mean_v2(T begin,T end) {
float mean = 0;
const int distance = std::distance(begin,end); // This is expensive
const int loop = ( distance >> 2)+1; // divide by 4
const int partial = distance & 2; // remainder 4
Vec4d vec;
for(int i = 0; i < loop;++i) {
if(i == (loop-1)) {
vec.load_partial(partial,&*begin);
mean = horizontal_add(vec);
}
else {
vec.load(&*begin);
mean = horizontal_add(vec);
begin+=4; // This is expensive
}
}
return mean / distance;
}
int main(int argc,char**argv) {
using namespace boost::assign;
std::vector<float> numbers;
// Note 13 numbers, which won't fit into a sse register perfectly
numbers+=39.57,39.57,39.604,39.58,39.61,31.669,31.669,31.669,31.65,32.09,33.54,32.46,33.45;
const float mean1 = mean_v1(numbers.begin(),numbers.end());
const float mean2 = mean_v2(numbers.begin(),numbers.end());
return 0;
}
Both v1 and v2 work correctly and they both take about the same time. However profiling it shows the the std::distance() and moving the iterator along takes almost 45% of the total time. The vector adds is just 0.8% which is significantly faster than v1.
Searching the web, all the examples seem to deal with perfect number of values that fit precisely into the SSE registers. How do people deal with odd numbers of values eg for this example where setting up the loop is taking a lot longer than the calculation.
I'm thinking there must be best practices or ideas on how to deal with this scenario.
Assume I can't change the interface of mean() to take float[], but must use iterators

You're mixing float & double unnecessarily, especially as you don't let your accumulator be double your precision is totally destroyed and won't be close to satisfactory for larger series.
As the arithmetic is super light weight what's destroying your performance here is most likely memory access, read up on memory cache lines and how they work. Basically what you need to do here is probe ahead, some processors have explicit instructions for pulling stuff into your cache, otherwise you can perform a load at a memory location ahead of time. Create another level of nesting in your loop and at regular intervals prime the cache with data you know you will get to in a few iterations.
What people do to maximize performance is that they spend a lot of time actually designing their data layout. You shouldn't need to do an intermediate transformation on your data. So what people do is they allocate aligned memory ( most SIMD instruction sets either requires or imposes grave penalties for reading / writing to unaligned memory ), and then they try to aggregate data in such a way that it fits the instruction set. In fact it's often a win to pad your data up to whatever register size the instruction set supports. So if lets say you're going to process 3 dimensional vectors, padding with an extra element which is unused will almost always be a big win.

Help with algorithm for merging vectors

I need a very fast algorithm for the following task. I have already implemented several algorithms that complete it, but they're all too slow for the performance I need. It should be fast enough that the algorithm can be run at least 100,000 times a second on a modern CPU. It will be implemented in C++.
I am working with spans/ranges, a structure that has a start and an end coordinate on a line.
I have two vectors (dynamic arrays) of spans and I need to merge them. One vector is src and the other dst. The vectors are sorted by span start coordinates, and the spans do not overlap within one vector.
The spans in the src vector must be merged with the spans in the dst vector, such that the resulting vector is still sorted and has no overlaps. Ie. if overlaps are detected during the merging, the two spans are merged into one. (Merging two spans is just a matter of changing the coordinates in the structure.)
Now, there is one more catch, the spans in the src vector must be "widened" during the merge. This means that a constant will be added to the start and another (larger) constant to the end coordinate of every span in src. This means that after the src spans are widened they might overlap.
What I have arrived at so far is that it cannot be done fully in-place, some kind of temporary storage is needed. I think it should be doable in linear time over the number of elements of src and dst summed.
Any temporary storage can probably be shared between multiple runs of the algorithm.
The two primary approaches I have tried, which are too slow, are:
Append all elements of src to dst, widening each element before appending it. Then run an in-place sort. Finally iterate over the resulting vector using a "read" and "write" pointer, with the read pointer running ahead of the write pointer, merging spans as they go. When all elements have been merged (the read pointer reaches end) dst is truncated.
Create a temporary work-vector. Do a naive merge as described above by repeatedly picking the next element from either src or dst and merging into the work-vector. When done, copy the work-vector to dst, replacing it.
The first method has the problem that sorting is O((m+n)*log(m+n)) instead of O(m+n) and has somewhat overhead. It also means the dst vector has to grow much larger than it really needs.
The second has the primary problem of a lot of copying around and again allocation/deallocation of memory.
The data structures used for storing/managing the spans/vectors can be altered if you think that's needed.
Update: Forgot to say how large the datasets are. The most common cases are between 4 and 30 elements in either vector, and either dst is empty or there is a large amount of overlap between the spans in src and dst.

We know that the absolute best case runtime is O(m+n) this is due to the fact that you at least have to scan over all of the data in order to be able to merge the lists. Given this, your second method should give you that type of behavior.
Have you profiled your second method to find out what the bottlenecks are? It is quite possible that, depending on the amount of data you are talking about it is actually impossible to do what you want in the specified amount of time. One way to verify this is to do something simple like sum up all the start and end values of the spans in each vector in a loop, and time that. Basically here you are doing a minimal amount of work for each element in the vectors. This will provide you with a baseline for the best performance you can expect to get.
Beyond that you can avoid copying the vectors element by element by using the stl swap method, and you can preallocate the temp vector to a certain size in order to avoid triggering the expansion of the array when you are merging the elements.
You might consider using 2 vectors in your system and whenever you need to do a merge you merge into the unused vector, and then swap (this is similar to double buffering used in graphics). This way you don't have to reallocate the vectors every time you do the merge.
However, you are best off profiling first and finding out what your bottleneck is. If the allocations are minimal compared to the actual merging process than you need to figure out how to make that faster.
Some possible additional speedups could come from accessing the vectors raw data directly which avoids the bounds checks on each access the data.

The sort you mention in Approach 1 can be reduced to linear time (from log-linear as you describe it) because the two input lists are already sorted. Just perform the merge step of merge-sort. With an appropriate representation for the input span vectors (for example singly-linked lists) this can be done in-place.
http://en.wikipedia.org/wiki/Merge_sort

How about the second method without repeated allocation--in other words, allocate your temporary vector once, and never allocate it again? Or, if the input vectors are small enough (But not constant size), just use alloca instead of malloc.
Also, in terms of speed, you may want to make sure that your code is using CMOV for the sorting, since if the code is actually branching for every single iteration of the mergesort:
if(src1[x] < src2[x])
dst[x] = src1[x];
else
dst[x] = src2[x];
The branch prediction will fail 50% of the time, which will have an enormous hit on performance. A conditional move will likely do much much better, so make sure the compiler is doing that, and if not, try to coax it into doing so.

i don't think a strictly linear solution is possible, because widening the src vector spans may in the worst-case cause all of them to overlap (depending on the magnitude of the constant that you are adding)
the problem may be in the implementation, not in the algorithm; i would suggest profiling the code for your prior solutions to see where the time is spent
reasoning:
for a truly "modern" CPU like the Intel Core 2 Extreme QX9770 running at 3.2GHz, one can expect about 59,455 MIPS
for 100,000 vectors, you would have to process each vector in 594,550 instuctions. That's a LOT of instructions.
ref: wikipedia MIPS
in addition, note that adding a constant to the src vector spans does not de-sort them, so you can normalize the src vector spans independently, then merge them with the dst vector spans; this should reduce the workload of your original algorithm

1 is right out - a full sort is slower than merging two sorted lists.
So you're looking at tweaking 2 (or something entirely new).
If you change the data structures to doubly linked lists, then you can merge them in constant working space.
Use a fixed-size heap allocator for the list nodes, both to reduce memory use per node and to improve the chance that the nodes are close together in memory, reducing page misses.
You might be able to find code online or in your favourite algorithm book to optimise a linked list merge. You'll want to customise this in order to do span coalescing at the same time as the list merge.
To optimise the merge, first note that for each run of values coming off the same side without one coming from the other side, you can insert the whole run into the dst list in one go, instead of inserting each node in turn. And you can save one write per insertion over a normal list operation, by leaving the end "dangling", knowing that you'll patch it up later. And provided that you don't do deletions anywhere else in your app, the list can be singly-linked, which means one write per node.
As for 10 microsecond runtime - kind of depends on n and m...

I would always keep my vector of spans sorted. That makes implementing algorithms a LOT easier -- and possible to do in linear time.
OK, so I'd sort the spans based on:
span minimum in increasing order
then span maximum in decreasing order
You need to create a function to do that.
Then I'd use std::set_union to merge the vectors (you can merge more than one before continuing).
Then for each consecutive sets of spans with identical minimums, you keep the first and remove the rest (they're sub-spans of the first span).
Then you need to merge your spans. That should be pretty doable now and feasible in linear time.
OK, here's the trick now. Don't try to do this in-place. Use one or more temporary vectors (and reserve enough space ahead of time). Then at the end, call std::vector::swap to put the results in the input vector of your choice.
I hope that's enough to get you going.

What is your target system? Is it multi-core? If so you could consider multithreading this algorithm

I wrote a new container class just for this algorithm, tailored to the needs. This also gave me a chance to adjust other code around my program which got a little speed boost at the same time.
This is significantly faster than the old implementation using STL vectors, but which was otherwise basically the same thing. But while it's faster it's still not really fast enough... unfortunately.
Profiling doesn't reveal what is the real bottleneck any longer. The MSVC profiler seems to sometimes place the "blame" on the wrong calls (supposedly identical runs assign widely different running times) and most calls are getting coalesced into one big chink.
Looking at a disassembly of the generated code shows that there's a very large amount of jumps in the generated code, I think that might be the main reason behind the slowness now.
class SpanBuffer {
private:
int *data;
size_t allocated_size;
size_t count;
inline void EnsureSpace()
{
if (count == allocated_size)
Reserve(count*2);
}
public:
struct Span {
int start, end;
};
public:
SpanBuffer()
: data(0)
, allocated_size(24)
, count(0)
{
data = new int[allocated_size];
}
SpanBuffer(const SpanBuffer &src)
: data(0)
, allocated_size(src.allocated_size)
, count(src.count)
{
data = new int[allocated_size];
memcpy(data, src.data, sizeof(int)*count);
}
~SpanBuffer()
{
delete [] data;
}
inline void AddIntersection(int x)
{
EnsureSpace();
data[count++] = x;
}
inline void AddSpan(int s, int e)
{
assert((count & 1) == 0);
assert(s >= 0);
assert(e >= 0);
EnsureSpace();
data[count] = s;
data[count+1] = e;
count += 2;
}
inline void Clear()
{
count = 0;
}
inline size_t GetCount() const
{
return count;
}
inline int GetIntersection(size_t i) const
{
return data[i];
}
inline const Span * GetSpanIteratorBegin() const
{
assert((count & 1) == 0);
return reinterpret_cast<const Span *>(data);
}
inline Span * GetSpanIteratorBegin()
{
assert((count & 1) == 0);
return reinterpret_cast<Span *>(data);
}
inline const Span * GetSpanIteratorEnd() const
{
assert((count & 1) == 0);
return reinterpret_cast<const Span *>(data+count);
}
inline Span * GetSpanIteratorEnd()
{
assert((count & 1) == 0);
return reinterpret_cast<Span *>(data+count);
}
inline void MergeOrAddSpan(int s, int e)
{
assert((count & 1) == 0);
assert(s >= 0);
assert(e >= 0);
if (count == 0)
{
AddSpan(s, e);
return;
}
int *lastspan = data + count-2;
if (s > lastspan[1])
{
AddSpan(s, e);
}
else
{
if (s < lastspan[0])
lastspan[0] = s;
if (e > lastspan[1])
lastspan[1] = e;
}
}
inline void Reserve(size_t minsize)
{
if (minsize <= allocated_size)
return;
int *newdata = new int[minsize];
memcpy(newdata, data, sizeof(int)*count);
delete [] data;
data = newdata;
allocated_size = minsize;
}
inline void SortIntersections()
{
assert((count & 1) == 0);
std::sort(data, data+count, std::less<int>());
assert((count & 1) == 0);
}
inline void Swap(SpanBuffer &other)
{
std::swap(data, other.data);
std::swap(allocated_size, other.allocated_size);
std::swap(count, other.count);
}
};
struct ShapeWidener {
// How much to widen in the X direction
int widen_by;
// Half of width difference of src and dst (width of the border being produced)
int xofs;
// Temporary storage for OverlayScanline, so it doesn't need to reallocate for each call
SpanBuffer buffer;
inline void OverlayScanline(const SpanBuffer &src, SpanBuffer &dst);
ShapeWidener(int _xofs) : xofs(_xofs) { }
};
inline void ShapeWidener::OverlayScanline(const SpanBuffer &src, SpanBuffer &dst)
{
if (src.GetCount() == 0) return;
if (src.GetCount() + dst.GetCount() == 0) return;
assert((src.GetCount() & 1) == 0);
assert((dst.GetCount() & 1) == 0);
assert(buffer.GetCount() == 0);
dst.Swap(buffer);
const int widen_s = xofs - widen_by;
const int widen_e = xofs + widen_by;
size_t resta = src.GetCount()/2;
size_t restb = buffer.GetCount()/2;
const SpanBuffer::Span *spa = src.GetSpanIteratorBegin();
const SpanBuffer::Span *spb = buffer.GetSpanIteratorBegin();
while (resta > 0 || restb > 0)
{
if (restb == 0)
{
dst.MergeOrAddSpan(spa->start+widen_s, spa->end+widen_e);
--resta, ++spa;
}
else if (resta == 0)
{
dst.MergeOrAddSpan(spb->start, spb->end);
--restb, ++spb;
}
else if (spa->start < spb->start)
{
dst.MergeOrAddSpan(spa->start+widen_s, spa->end+widen_e);
--resta, ++spa;
}
else
{
dst.MergeOrAddSpan(spb->start, spb->end);
--restb, ++spb;
}
}
buffer.Clear();
}

If your most recent implementation still isn't fast enough, you might end up having to look at alternative approaches.
What are you using the outputs of this function for?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js