How to optimize interpolation of large set of scattered points? - c++

I am currently working with a set of coordinate points (longitude, latitude, about 60000 of them) and the temperature at that location. I need to do a interpolation on them to compute the values at some points with unknown temperature as to map certain regions.
As to respect the influence that the points have between them I have converted every (long, lat) point to a unit sphere point (x, y, z).
I have started applying the generalized multidimension Shepard interpolation from "Numerical recipes 3rd Edition":
Doub interp(VecDoub_I &pt)
{
Doub r, w, sum=0., sumw=0.;
if (pt.size() != dim)
throw("RBF_interp bad pt size");
for (Int i=0;i<n;i++)
{
if ((r=rad(&pt[0],&pts[i][0])) == 0.)
return vals[i];
sum += (w = pow(r,pneg));
sumw += w*vals[i];
}
return sumw/sum;
}
Doub rad(const Doub *p1, const Doub *p2)
{
Doub sum = 0.;
for (Int i=0;i<dim;i++)
sum += SQR(p1[i]-p2[i]);
return sqrt(sum);
}
As you can see, for the interpolation of one point, the algorithm computes the distance of that point to each of the other points and taking it as a weight in the final value.
Even though this algorithm works it is much too slow compared to what I need since I will be computing a lot of points to map a grid of a certain region.
One way of optimizing this is that I could leave out the points than are beyond a certain radius, but would pose a problem for areas with few or no points.
Another thing would be to reduce the computing of the distance between each 2 points by only computing once a Look-up Table and storing the distances. The problem with this is that it is impossible to store such a large matrix (60000 x 60000).
The grid of temperatures that is obtained, will be used to compute contours for different temperature values.
If anyone knows a way to optimize this algorithm or maybe help with a better one, I will appreciate it.

Radial basis functions with infinite support is probably not what you want to be using if you have a large number of data points and will be taking a large number of interpolation values.
There are variants that use N nearest neighbours and finite support to reduce the number of points that must be considered for each interpolation value. A variant of this can be found in the first solution mentioned here Inverse Distance Weighted (IDW) Interpolation with Python. (though I have a nagging suspicion that this implementation can be discontinuous under certain conditions - there are certainly variants that are fine)

Your look-up table doesn't have to store every point in the 60k square, only those once which are used repeatedly. You can map any coordinate x to int(x*resolution) to improve the hit rate by lowering the resolution.
A similar lookup table for the power function might also help.

Related

Given n points, how can I find the number of points with given distance

I have an input of n unique points (X,Y) that are between 0 and 2^32 inclusive. The coordinates are integers.
I need to create an algorithm that finds the number of pairs of points with a distance of exactly 2018.
I have thought of checking with every other point but it would be O(n^2) and I have to make it more efficient. I also thought of using a set or a vector and sort it using a comparator based on the distance with the origin point but it wouldn't help at all.
So how can I do it efficiently?
There is one Pythagorean triple with the hypotenuse of 2018: 11182+16802=20182.
Since all coordinates are integers, the only possible differences between the coordinates (both X an Y) of the two points are 0, 1118, 1680, and 2018.
Finding all pairs of points with a given difference between X (or Y) coordinates is a simple n log n operation.
Numbers other than 2018 might need a bit more work because they might be members of more than one Pythagorean triple (for example 2015 is a hypotenuse of 3 triples). If the number is not given as a constant, but provided at run time, you will have to generate all triples with this hypotenuse. This may require some sqrt(N) effort (N is the hypotenuse, not the number of points). One can find a recipe on the math stackexchange, e.g. here (there are many others).
You could try using a Quadtree. First you start sorting your points into the quadtree. You should specify a lower limit for the cell size of e.g. 2048 wich is a power of 2. Then iterate though the points and calculate distances to the points in the same cell and to the points in adjacent cells. That way you should be able to decrease the number of distance calculations drastically.
The main difficulty will probably be implementing the tree structure. You also have to find a way to find adjacent cells (you must include the possibility to traverse upwards in the tree)
The complexity of this is probably O(n*log(n)) in the best case but don't pin me down on that.
One additional word on the distance calculation: You are probably much faster if you don't do
dx = p1x - p2x;
dy = p1y - p2y;
if ( sqrt(dx*dx + dy*dy) == 2018 ) {
...
}
but
dx = p1x - p2x;
dy = p1y - p2y;
if ( dx*dx + dy*dy == 2018*2018 ) {
...
}
Squaring is faster than taking the sqare root. So just compare the square of the distance with the square of 2018.

Is there a way to generate the corners of a regular N-gon without division and without trig, given N as input

Edit: So I found a page related to rasterizing a trapezoid https://cse.taylor.edu/~btoll/s99/424/res/ucdavis/GraphicsNotes/Rasterizing-Polygons/Rasterizing-Polygons.html but am still trying to figure out if I can just do the edges
I am trying to generate points for the corners of an arbitrary N-gon. Where N is a non-zero positive integer. But I am trying to do so efficiently without the need of division and trig. I am thinking that there is probably some of sort of Bresenham's type algorithm for this but I cannot seem to find anything.
The only thing I can find on stackoverflow is this but the internal angle increments are found by using 2*π/N:
How to draw a n sided regular polygon in cartesian coordinates?
Here was the algorithm from that page in C
float angle_increment = 2*PI / n_sides;
for(int i = 0; i < n_sides; ++i)
{
float x = x_centre + radius * cos(i*angle_increment +start_angle);
float y = y_centre + radius * sin(i*angle_increment +start_angle);
}
So is it possible to do so without any division?
There's no solution which avoids calling sin and cos, aside from precomputing them for all useful values of N. However, you only need to do the computation once, regardless of N. (Provided you know where the centre of the polygon is. You might need a second computation to figure out the coordinates of the centre.) Every vertex can be computed from the previous vertex using the same rotation matrix.
The lookup table of precomputed values is not an unreasonable solution, since at some sufficiently large value of N the polygon becomes indistinguishable from a circle. So you probably only need a lookup table of a few hundred values.

C++ - Efficient way to compare vectors

At the moment i'm working with a camera to detect a marker. I use opencv and the Aruco Libary.
Only I'm stuck with a problem right now. I need to detect if the distance between 2 marker is less than a specific value. I have a function to calculate the distance, I can compare everything. But I'm looking for the most efficient way to keep track of all the markers (around 5/6) and how close they are together.
There is a list with markers but I cant find a efficient way to compare all of them.
I have a
Vector <Marker>
I also have a function called getDistance.
double getDistance(cv::Point2f punt1, cv::Point2f punt2)
{
float xd = punt2.x-punt1.x;
float yd = punt2.y-punt1.y;
double Distance = sqrtf(xd*xd + yd*yd);
return Distance;
}
The Markers contain a Point2f, so i can compare them easily.
One way to increase performance is to keep all the distances squared and avoid using the square root function. If you square the specific value you are checking against then this should work fine.
There isn't really a lot to recommend. If I understand the question and I'm counting the pairs correctly, you'll need to calculate 10 distances when you have 5 points, and 15 distances when you have 6 points. If you need to determine all of the distances, then you have no choice but to calculate all of the distances. I don't see any way around that. The only advice I can give is to make sure you calculate the distance between each pair only once (e.g., once you know the distance between points A and B, you don't need to calculate the distance between B and A).
It might be possible to sort the vector in such a way that you can short circuit your loop. For instance, if you sort it correctly and the distance between point A and point B is larger than your threshold, then the distances between A and C and A and D will also be larger than the threshold. But keep in mind that sorting isn't free, and it's likely that for small sets of points it would be faster to just calculate all distances ("Fancy algorithms are slow when n is small, and n is usually small. Fancy algorithms have big constants. Until you know that n is frequently going to be big, don't get fancy. ... For example, binary trees are always faster than splay trees for workaday problems.").
Newer versions of the C and C++ standard library have a hypot function for calculating distance between points:
#include <cmath>
double getDistance(cv::Point2f punt1, cv::Point2f punt2)
{
return std::hypot(punt2.x - punt1.x, punt2.y - punt1.y);
}
It's not necessarily faster, but it should be implemented in a way that avoids overflow when the points are far apart.
One minor optimization is to simply check if the change in X or change in Y exceeds the threshold. If it does, you can ignore the distance between those two points because the overall distance will also exceed the threshold:
const double threshold = ...;
std::vector<cv::Point2f> points;
// populate points
...
for (auto i = points.begin(); i != points.end(); ++i) {
for (auto j = i + 1; j != points.end(); ++j) {
double dx = std::abs(i->x - j->x), dy = std::abs(i->y - j->y);
if (dx > threshold || dy > threshold) {
continue;
}
double distance = std::hypot(dx, dy);
if (distance > threshold) {
continue;
}
...
}
}
If you're dealing with large amounts of data inside your vector you may want to consider some multithreading using future.
Vector <Marker> could be chunked into X chunks which are asynchronously computed together and stored inside std::future<>, putting to use #Sesame's suggestion will also increase your speed as well.

Very fast 3D distance check?

Is there a way to do a quick and dirty 3D distance check where the results are rough, but it is very very fast? I need to do depth sorting. I use STL sort like this:
bool sortfunc(CBox* a, CBox* b)
{
return a->Get3dDistance(Player.center,a->center) <
b->Get3dDistance(Player.center,b->center);
}
float CBox::Get3dDistance( Vec3 c1, Vec3 c2 )
{
//(Dx*Dx+Dy*Dy+Dz*Dz)^.5
float dx = c2.x - c1.x;
float dy = c2.y - c1.y;
float dz = c2.z - c1.z;
return sqrt((float)(dx * dx + dy * dy + dz * dz));
}
Is there possibly a way to do it without a square root or possibly without multiplication?
You can leave out the square root because for all positive (or really, non-negative) numbers x and y, if sqrt(x) < sqrt(y) then x < y. Since you're summing squares of real numbers, the square of every real number is non-negative, and the sum of any positive numbers is positive, the square root condition holds.
You cannot eliminate the multiplication, however, without changing the algorithm. Here's a counterexample: if x is (3, 1, 1) and y is (4, 0, 0), |x| < |y| because sqrt(1*1+1*1+3*3) < sqrt(4*4+0*0+0*0) and 1*1+1*1+3*3 < 4*4+0*0+0*0, but 1+1+3 > 4+0+0.
Since modern CPUs can compute a dot product faster than they can actually load the operands from memory, it's unlikely that you would have anything to gain by eliminating the multiply anyway (I think the newest CPUs have a special instruction that can compute a dot product every 3 cycles!).
I would not consider changing the algorithm without doing some profiling first. Your choice of algorithm will heavily depend on the size of your dataset (does it fit in cache?), how often you have to run it, and what you do with the results (collision detection? proximity? occlusion?).
What I usually do is first filter by Manhattan distance
float CBox::Within3DManhattanDistance( Vec3 c1, Vec3 c2, float distance )
{
float dx = abs(c2.x - c1.x);
float dy = abs(c2.y - c1.y);
float dz = abs(c2.z - c1.z);
if (dx > distance) return 0; // too far in x direction
if (dy > distance) return 0; // too far in y direction
if (dz > distance) return 0; // too far in z direction
return 1; // we're within the cube
}
Actually you can optimize this further if you know more about your environment. For example, in an environment where there is a ground like a flight simulator or a first person shooter game, the horizontal axis is very much larger than the vertical axis. In such an environment, if two objects are far apart they are very likely separated more by the x and y axis rather than the z axis (in a first person shooter most objects share the same z axis). So if you first compare x and y you can return early from the function and avoid doing extra calculations:
float CBox::Within3DManhattanDistance( Vec3 c1, Vec3 c2, float distance )
{
float dx = abs(c2.x - c1.x);
if (dx > distance) return 0; // too far in x direction
float dy = abs(c2.y - c1.y);
if (dy > distance) return 0; // too far in y direction
// since x and y distance are likely to be larger than
// z distance most of the time we don't need to execute
// the code below:
float dz = abs(c2.z - c1.z);
if (dz > distance) return 0; // too far in z direction
return 1; // we're within the cube
}
Sorry, I didn't realize the function is used for sorting. You can still use Manhattan distance to get a very rough first sort:
float CBox::ManhattanDistance( Vec3 c1, Vec3 c2 )
{
float dx = abs(c2.x - c1.x);
float dy = abs(c2.y - c1.y);
float dz = abs(c2.z - c1.z);
return dx+dy+dz;
}
After the rough first sort you can then take the topmost results, say the top 10 closest players, and re-sort using proper distance calculations.
Here's an equation that might help you get rid of both sqrt and multiply:
max(|dx|, |dy|, |dz|) <= distance(dx,dy,dz) <= |dx| + |dy| + |dz|
This gets you a range estimate for the distance which pins it down to within a factor of 3 (the upper and lower bounds can differ by at most 3x). You can then sort on, say, the lower number. You then need to process the array until you reach an object which is 3x farther away than the first obscuring object. You are then guaranteed to not find any object that is closer later in the array.
By the way, sorting is overkill here. A more efficient way would be to make a series of buckets with different distance estimates, say [1-3], [3-9], [9-27], .... Then put each element in a bucket. Process the buckets from smallest to largest until you reach an obscuring object. Process 1 additional bucket just to be sure.
By the way, floating point multiply is pretty fast nowadays. I'm not sure you gain much by converting it to absolute value.
I'm disappointed that the great old mathematical tricks seem to be getting lost. Here is the answer you're asking for. Source is Paul Hsieh's excellent web site: http://www.azillionmonkeys.com/qed/sqroot.html . Note that you don't care about distance; you will do fine for your sort with square of distance, which will be much faster.
In 2D, we can get a crude approximation of the distance metric without a square root with the formula:
distanceapprox (x, y) =
which will deviate from the true answer by at most about 8%. A similar derivation for 3 dimensions leads to:
distanceapprox (x, y, z) =
with a maximum error of about 16%.
However, something that should be pointed out, is that often the distance is only required for comparison purposes. For example, in the classical mandelbrot set (z←z2+c) calculation, the magnitude of a complex number is typically compared to a boundary radius length of 2. In these cases, one can simply drop the square root, by essentially squaring both sides of the comparison (since distances are always non-negative). That is to say:
√(Δx2+Δy2) < d is equivalent to Δx2+Δy2 < d2, if d ≥ 0
I should also mention that Chapter 13.2 of Richard G. Lyons's "Understanding Digital Signal Processing" has an incredible collection of 2D distance algorithms (a.k.a complex number magnitude approximations). As one example:
Max = x > y ? x : y;
Min = x < y ? x : y;
if ( Min < 0.04142135Max )
|V| = 0.99 * Max + 0.197 * Min;
else
|V| = 0.84 * Max + 0.561 * Min;
which has a maximum error of 1.0% from the actual distance. The penalty of course is that you're doing a couple branches; but even the "most accepted" answer to this question has at least three branches in it.
If you're serious about doing a super fast distance estimate to a specific precision, you could do so by writing your own simplified fsqrt() estimate using the same basic method as the compiler vendors do, but at a lower precision, by doing a fixed number of iterations. For example, you can eliminate the special case handling for extremely small or large numbers, and/or also reduce the number of Newton-Rapheson iterations. This was the key strategy underlying the so-called "Quake 3" fast inverse square root implementation -- it's the classic Newton algorithm with exactly one iteration.
Do not assume that your fsqrt() implementation is slow without benchmarking it and/or reading the sources. Most modern fsqrt() library implementations are branchless and really damned fast. Here for example is an old IBM floating point fsqrt implementation. Premature optimization is, and always will be, the root of all evil.
Note that for 2 (non-negative) distances A and B, if sqrt(A) < sqrt(B), then A < B. Create a specialized version of Get3DDistance() (GetSqrOf3DDistance()) that does not call sqrt() that would be used only for the sortfunc().
If you worry about performance, you should also take care of the way you send your arguments:
float Get3dDistance( Vec3 c1, Vec3 c2 );
implies two copies of Vec3 structure. Use references instead:
float Get3dDistance( Vec3 const & c1, Vec3 const & c2 );
You could compare squares of distances instead of the actual distances, since d2 = (x1-x2)2 + (y1-y2)2+ (z1-z2)2. It doesn't get rid of the multiplication, but it does eliminate the square root operation.
How often are the input vectors updated and how often are they sorted? Depending on your design, it might be quite efficient to extend the "Vec3" class with a pre-calculated distance and sort on that instead. Especially relevant if your implementation allows you to use vectorized operations.
Other than that, see the flipcode.com article on approximating distance functions for a discussion on yet another approach.
Depending slightly on the number of points that you are being used to compare with, what is below is pretty much guaranteed to be the get the list of points in approximate order assuming all points change at all iteration.
1) Rewrite the array into a single list of Manhattan distances with
out[ i ] = abs( posn[ i ].x - player.x ) + abs( posn[ i ].y - player.y ) + abs( posn[ i ].z - player.z );
2) Now you can use radix sort on floating point numbers to order them.
Note that in practice this is going to be a lot faster than sorting the list of 3d positions because it significantly reduces the memory bandwidth requirements in the sort operation which all of the time is going to be spend and in which unpredictable accesses and writes are going to occur. This will run on O(N) time.
If many of the points are stationary at each direction there are far faster algorithms like using KD-Trees, although implementation is quite a bit more complex and it is much harder to get good memory access patterns.
If this is simply a value for sorting, then you can swap the sqrt() for a abs(). If you need to compare distances against set values, get the square of that value.
E.g. instead of checking sqrt(...) against a, you can compare abs(...) against a*a.
You may want to consider caching the distance between the player and the object as you calculate it, and then use that in your sortfunc. This would depend upon how many times your sort function looks at each object, so you might have to profile to be sure.
What I'm getting at is that your sort function might do something like this:
compare(a,b);
compare(a,c);
compare(a,d);
and you would calculate the distance between the player and 'a' every time.
As others have mentioned, you can leave out the sqrt in this case.
If you could center your coordinates around the player, use spherical coordinates? Then you could sort by the radius.
That's a big if, though.
If your operation happens a lot, it might be worth to put it into some 3D data structure. You probably need the distance sorting to decide which object is visible, or some similar task. In order of complexity you can use:
Uniform (cubic) subdivision
Divide the used space into cells, and assign the objects to the cells. Fast access to element, neighbours are trivial, but empty cells take up a lot of space.
Quadtree
Given a threshold, divide used space recursively into four quads until less then threshold number of object is inside. Logarithmic access element if objects don't stack upon each other, neighbours are not hard to find, space efficient solution.
Octree
Same as Quadtree, but divides into 8, optimal even if objects are above each other.
Kd tree
Given some heuristic cost function, and a threshold, split space into two halves with a plane where the cost function is minimal. (Eg.: same amount of objects at each side.) Repeat recursively until threshold reached. Always logarithmic, neighbours are harder to get, space efficient (and works in all dimensions).
Using any of the above data structures, you can start from a position, and go from neighbour to neighbour to list the objects in increasing distance. You can stop at desired cut distance. You can also skip cells that cannot be seen from the camera.
For the distance check, you can do one of the above mentioned routines, but ultimately they wont scale well with increasing number of objects. These can be used to display data that takes hundreds of gigabytes of hard disc space.

Fastest way to calculate cubic bezier curves?

Right now I calculate it like this:
double dx1 = a.RightHandle.x - a.UserPoint.x;
double dy1 = a.RightHandle.y - a.UserPoint.y;
double dx2 = b.LeftHandle.x - a.RightHandle.x;
double dy2 = b.LeftHandle.y - a.RightHandle.y;
double dx3 = b.UserPoint.x - b.LeftHandle.x;
double dy3 = b.UserPoint.y - b.LeftHandle.y;
float len = sqrt(dx1 * dx1 + dy1 * dy1) +
sqrt(dx2 * dx2 + dy2 * dy2) +
sqrt(dx3 * dx3 + dy3 * dy3);
int NUM_STEPS = int(len * 0.05);
if(NUM_STEPS > 55)
{
NUM_STEPS = 55;
}
double subdiv_step = 1.0 / (NUM_STEPS + 1);
double subdiv_step2 = subdiv_step*subdiv_step;
double subdiv_step3 = subdiv_step*subdiv_step*subdiv_step;
double pre1 = 3.0 * subdiv_step;
double pre2 = 3.0 * subdiv_step2;
double pre4 = 6.0 * subdiv_step2;
double pre5 = 6.0 * subdiv_step3;
double tmp1x = a.UserPoint.x - a.RightHandle.x * 2.0 + b.LeftHandle.x;
double tmp1y = a.UserPoint.y - a.RightHandle.y * 2.0 + b.LeftHandle.y;
double tmp2x = (a.RightHandle.x - b.LeftHandle.x)*3.0 - a.UserPoint.x + b.UserPoint.x;
double tmp2y = (a.RightHandle.y - b.LeftHandle.y)*3.0 - a.UserPoint.y + b.UserPoint.y;
double fx = a.UserPoint.x;
double fy = a.UserPoint.y;
//a user
//a right
//b left
//b user
double dfx = (a.RightHandle.x - a.UserPoint.x)*pre1 + tmp1x*pre2 + tmp2x*subdiv_step3;
double dfy = (a.RightHandle.y - a.UserPoint.y)*pre1 + tmp1y*pre2 + tmp2y*subdiv_step3;
double ddfx = tmp1x*pre4 + tmp2x*pre5;
double ddfy = tmp1y*pre4 + tmp2y*pre5;
double dddfx = tmp2x*pre5;
double dddfy = tmp2y*pre5;
int step = NUM_STEPS;
while(step--)
{
fx += dfx;
fy += dfy;
dfx += ddfx;
dfy += ddfy;
ddfx += dddfx;
ddfy += dddfy;
temp[0] = fx;
temp[1] = fy;
Contour[currentcontour].DrawingPoints.push_back(temp);
}
temp[0] = (GLdouble)b.UserPoint.x;
temp[1] = (GLdouble)b.UserPoint.y;
Contour[currentcontour].DrawingPoints.push_back(temp);
I'm wondering if there is a faster way to interpolate cubic beziers?
Thanks
Look into forward differencing for a faster method. Care must be taken to deal with rounding errors.
The adaptive subdivision method, with some checks, can be fast and accurate.
There is another point that is also very important, which is that you are approximating your curve using a lot of fixed-length straight-line segments. This is inefficient in areas where your curve is nearly straight, and can lead to a nasty angular poly-line where the curve is very curvy. There is not a simple compromise that will work for high and low curvatures.
To get around this is you can dynamically subdivide the curve (e.g. split it into two pieces at the half-way point and then see if the two line segments are within a reasonable distance of the curve. If a segment is a good fit for the curve, stop there; if it is not, then subdivide it in the same way and repeat). You have to be careful to subdivide it enough that you don't miss any localised (small) features when sampling the curve in this way.
This will not always draw your curve "faster", but it will guarantee that it always looks good while using the minimum number of line segments necessary to achieve that quality.
Once you are drawing the curve "well", you can then look at how to make the necessary calculations "faster".
Actually you should continue splitting until the two lines joining points on curve (end nodes) and their farthest control points are "flat enough":
- either fully aligned or
- their intersection is at a position whose "square distance" from both end nodes is below one half "square pixel") - note that you don't need to compute the actual distance, as it would require computing a square root, which is slow)
When you reach this situation, ignore the control points and join the two end-points with a straight segment.
This is faster, because rapidly you'll get straight segments that can be drawn directly as if they were straight lines, using the classic Bresenham algorithm.
Note: you should take into account the fractional bits of the endpoints to properly set the initial value of the error variable accumulating differences and used by the incremental Bresenham algorithm, in order to get better results (notably when the final segment to draw is very near from the horizontal or vertical or from the two diagonals); otherwise you'll get visible artefacts.
The classic Bresenham algorithm to draw lines between points that are aligned on an integer grid initializes this error variable to zero for the position of the first end node. But a minor modification of the Bresenham algorithm scales up the two distances variables and the error value simply by a constant power of two, before using the 0/+1 increments for the x or y variable which remain unscaled.
The high order bits of the error variable also allows you compute an alpha value that can be used to draw two stacked pixels with the correct alpha-shading. In most cases, your images will be using 8-bit color planes at most, so you will not need more that 8 bits of extra precision for the error value, and the upscaling can be limited to the factor of 256: you can use it to draw "smooth" lines.
But you could limit yourself to the scaling factor of 16 (four bits): typical bitmap images you have to draw are not extremely wide and their resolution is far below +/- 2 billions (the limit of a signed 32-bit integer): when you scale up the coordinates by a factor of 16, it will remain 28 bits to work with, but you should already have "clipped" the geometry to the view area of your bitmap to render, and the error variable of the Bresenham algorithm will remain below 56 bits in all cases and will still fit in a 64-bit integer.
If your error variable is 32-bit, you must limit the scaled coordinates below 2^15 (not more than 15 bits) for the worst case (otherwise the test of the sign of the error varaible used by Bresenham will not work due to integer overflow in the worst case), and with the upscaling factor of 16 (4 bits) you'll be limited to draw images not larger than 11 bits in width or height, i.e. 2048x2048 images.
But if your draw area is effectively below 2048x2048 pixels, there's no problem to draw lined smoothed by 16 alpha-shaded values of the draw color (to draw alpha-shaded pixels, you need to read the orignal pixel value in the image before mixing the alpha-shaded color, unless the computed shade is 0% for the first staked pixel that you don't need to draw, and 100% for the second stacked pixel that you can overwrite directly with the plain draw color)
If your computed image also includes an alpha-channel, your draw color can also have its own alpha value that you'll need to shade and combine with the alpha value of the pixels to draw. But you don't need any intermediate buffer just for the line to draw because you can draw directly in the target buffer.
With the error variable used by the Bresenham algorithm, there's no problem at all caused by rounding errors because they are taken into account by this variable. So set its initial value properly (the alternative, by simply scaling up all coordinates by a factor of 16 before starting subdividing recursively the spline is 16 times slower in the Bresenham algorithm itself).
Note how "flat enough" can be calculated. "Flatness" is a mesure of the minimum absolute angle (between 0 and 180°) between two sucessive segment but you don't need to compute the actual angle because this flatness is also equivalent to setting a minimum value to the cosine of their relative angle.
That cosine value also does not need to be computed directly because all you need is in fact the vector product of the two vectors and compare it with the square of the maximum of their length.
Note also that the "square of the cosine" is also "one minus the square of the sine". A maximum square cosine value is also a minimum square sine value... Now you know which kind of "vector product" to use: the fastest and simplest to compute is the scalar product, whose square is proportional to the square sine of the two vectors and to the product of square lengths of both vectors.
So checking if the curve is "flat enough" is simple: compute a ratio between two scalar products and see if this ratio is below the "flatness" constant value (of the minimum square sine). There's no division by zero because you'll determine which of the two vectors is the longest, and if this one has a square length below 1/4, your curve is already flat enough for the rendering resolution; otherwise check this ratio between the longest and the shortest vector (formed by the crossing diagonals of the convex hull containing the end points and control points):
with quadratic beziers, the convex hull is a triangle and you choose two pairs
with cubic beziers, the convex hull is a 4-sides convex polygon and the diagonals may either join an end point with one of the two control points, or join together the two end-points and the two control points and you have six possibilities
Use the combination giving the maximum length for the first vector between the 1st end-point and one of the three other points, the second vector joining two other points):
Al you need is to determine the "minimum square length" of the segments starting with one end-point or control-point to the next control-point or end-point in the sequence. (in a quadratic Bezier you just compare two segments, with a quadratic Bezier, you check 3 segments)
If this "minimum square length" is below 1/4 you can stop there, the curve is "flat enough".
Then determine the "maximum square length" of the segments starting with one end-point to any one of the other end-point or control-point (with a quadratic Bezier you can safely use the same 2 segments as above, with a cubic Bezier you discard one of the 3 segments used above joining the 2 control-points, but you add the segment joining the two end-nodes).
Then check that the "minimum square length" is lower than the product of the constant "flatness" (minimum square sine) times the "maximum square length" (if so the curve is "flat enough".
In both cases, when your curve is "flat enough", you just need to draw the segment joining the two end-points. Otherwise you split the spline recursively.
You may include a limit to the recursion, but in practice it will never be reached unless the convex hull of the curve covers a very large area in a very large draw area; even with 32 levels of recusions, it will never explode in a rectangular draw area whose diagonal is shorter than 2^32 pixels (the limit would be reached only if you are splitting a "virtual Bezier" in a virtually infinite space with floating-point coordinates but you don't intend to draw it directly, because you won't have the 1/2 pixel limitation in such space, and only if you have set an extreme value for the "flatness" such that your "minimum square sine" constant parameter is 1/2^32 or lower).