Fast way to convert ARGB texture to next power of 2 texture - c++

We have some old devices that don't support non-pot textures and we have a function that converts ARGB textures to next power of 2 texture. The problem is that it's quite slow and we're wondering if there is a better approach to convert these textures.
void PotTexture()
{
size_t u2 = 1; while (u2 < imageData.width) u2 *= 2;
size_t v2 = 1; while (v2 < imageData.height) v2 *= 2;
std::vector<unsigned char> pottedImageData;
pottedImageData.resize(u2 * v2 * 4);
size_t y, x, c;
for (y = 0; y < imageData.height; y++)
{
for (x = 0; x < imageData.width; x++)
{
for (c = 0; c < 4; c++)
{
pottedImageData[4 * u2 * y + 4 * x + c] = imageData.convertedData[4 * imageData.width * y + 4 * x + c];
}
}
}
imageData.width = u2;
imageData.height = v2;
std::swap(imageData.convertedData, pottedImageData);
}
On some devices this can easily use 100% of the CPU so any optimizations would be amazing. Are there any existing functions that I could look at that perform this conversion?
Edit:
I've optimized the above loop slightly to:
for (y = 0; y < imageData.height; y++)
{
memcpy(
&(pottedImageData[y * u2 * 4]),
&(imageData.convertedData[y * imageData.width * 4]),
imageData.width * 4);
}

Even devices that don't support NPOT texture should support NPOT load.
Create the texture as an exact power of 2 and NO CONTENT using glTexImage2D, passing a null pointer for data.
data may be a null pointer. In this case, texture memory is allocated to accommodate a texture of width width and height height. You can then download subtextures to initialize this texture memory. The image is undefined if the user tries to apply an uninitialized portion of the texture image to a primitive.
Then use glTexSubImage2D to upload a NPOT image, which occupies only a portion of the total texture. This can be done without any CPU-side image rearrangement.

Having had a similar problem in a program I wrote, I took a very different approach. Rather than stretch the source texture, I just copied it into the top left corner of an otherwise empty power-of-two texture.
Then in the pixel shader you use a pair of floats to adjust s,t values so you fetch from just the top left corner.
float sAdjust = static_cast<float>(textureWidth) / static_cast<float>(containerWidth)
float tAdjust = static_cast<float>(textureHeight) / static_cast<float>(containerHeight)
is how you compute them, and to use them you'll get a Vec2 holding the s,t coordinates, just multiply s by sAdjust and t by tAdjust before using them to fetch. If you're using Direct3D, it'd be something akin to this:
D3DXVECTOR4 stAdjust;
stAdjust.x = sAdjust;
stAdjust.y = tAdjust;
// Transfer stAdjust into a float4 inside your pixel shader, call it stAdjust in there
Now in the pixel shader assume you have:
float2 texcoord;
float4 stAdjust;
you just say:
texcoord.x = texcoord.x * stAdjust.x;
texcoord.y = texcoord.y * stAdjust.y;
before using texcoord. Sorry I can't tell you how to do this in GLSL, but you get the general idea.

Okay, the very first optimization can be done here:
size_t u2 = 1; while (u2 < imageData.width) u2 *= 2;
size_t v2 = 1; while (v2 < imageData.height) v2 *= 2;
What you want to do is (for each dimension) find the floor of the logarithm-base2 (log2) and put that into 2**n+1. The standard math library has function log2 but it operates on floating point. But we can use is. 2**n can be written as 1 << n. So this gives
size_t const dim_p2_… = 1 << (int)floor(log2(dim_…)+1);
Better but not yet ideal, because of that float conversion. The Bit Twiddling hacks document has a few functions for integer ilog2: https://graphics.stanford.edu/~seander/bithacks.html#IntegerLog
But we're still not optimal. Let me introduce you to Compiler intrinsics, which translate into machine instructions, if the machine in question can do it on the metal.
GNU GCC: int __builtin_ffs (unsigned int x), which returns one plus the index of the least significant 1-bit of x, or if x is zero, returns zero.
MSVC++: _BitScanReverse, which returns the length of the run of the most significant bits set zero. So _BitScanReverse is like builtin_ffs - 1 (there's also a builtin_clz which behaves exactly like BitScanReverse.
So we can do
#define ilog2_p1(x) (__builtin_ffs(x))
or
#define ilog2_p1(x) (__BitScanReverse(x)+1)
and use that.
size_t const dim_p2_… = 1 << (int)floor(ilog2_p1(dim_…));
While we're at bit twiddling: We can save that whole ordeal if a texture is already in power of two format. A few years ago I (independently) rediscovered the wonderfully portable bit twiddling trick, exploiting the properties of complement-2 integers. You can also find it in the bit twiddles document. But the type neutral, concise macro form is rarely seen. So here it is:
#define ISPOW2(x) ( (x) && !( (x) & ((x) - 1) ) )
You're using C++ so templates are in order:
template<typename T> bool ispow2(T const x) { return x && !( x & (x - 1) ); }
Then Ben Voight already told you, how to use glTexSubImage2D to load that into the texture. Also have a look at the GL_ARB_texture_rectangle extension, that allows to load NPOT textures, but without the ability for mipmapping and advanced filtering. But it might be a viable choice for you.
If you ever feel the need to scale the texture it's always worth looking into dual spaces. In this case the spatial frequency domain dual space. Upscaling a signal is essentially a impulse response. As such it can be described as a convolution. Convolutions usually are O(n²) in complexity. But due to the Fourier Convolution theorem in Fourier space the equivalent is simple multiplication, so it becomes O(n). FFT can be done with O(n log n), so the total complexity is about O(n + 2n log n), which is much better.

Related

Optimizing compute shaders

I have been doing a lot of different computations in compute shaders in OpenGL for the last couple of months. Some work fine, others are slow, some I could optimize somewhat, others again I could not optimize whatsoever.
I have been playing around with the simple code below (gravitational forces between n particles), just to find some strategies on how to increase performance in general, but absolutely nothing works:
#version 450 core
uniform uint NumParticles;
layout (std430, binding = 0) buffer bla
{
double rIn[];
};
layout (std430, binding = 1) writeonly buffer bla2
{
double aOut[];
};
layout (local_size_x = 128, local_size_y = 1, local_size_z = 1) in;
void main()
{
int n;
double dist3, dist2;
dvec3 a, diff, r = dvec3(rIn[gl_GlobalInvocationID.x * 3 + 0], rIn[gl_GlobalInvocationID.x * 3 + 1], rIn[gl_GlobalInvocationID.x * 3 + 2]);
a.x = a.y = a.z = 0;
for (n = 0; n < NumParticles; n++)
{
if (n != gl_GlobalInvocationID.x)
{
diff = dvec3(rIn[n * 3 + 0], rIn[n * 3 + 1], rIn[n * 3 + 2]) - r;
dist2 = dot(diff, diff);
dist3 = 1.0 / (sqrt(dist2) * dist2);
a += diff * dist3;
}
}
aOut[gl_GlobalInvocationID.x * 3 + 0] = a.x;
aOut[gl_GlobalInvocationID.x * 3 + 1] = a.y;
aOut[gl_GlobalInvocationID.x * 3 + 2] = a.z;
}
I have the strong suspicion that it is a lot of memory access that slows down this code. So one thing I tried was making a shared variable as a "buffer" and let the first thread (gl_LocalInvocationID.x == 0) read the first (for example) 1024 particles, let all threads do their calculations, then the next 1024, ect. This slowed the code down by a factor of 2-3. Another thing I tried, was putting the particle-coordinates in a uniform array (which only works for up to 1024 particles and I use a lot more - so this was just to see, if it made a difference), which changed absolutely nothing.
I can provide some code for the above examples, but I don't think, this would be helpful.
I know there are minor improvements one could make (like using inversesqrt instead of 1.0 / sqrt, not computing particle n <-> particle m when m <-> n is already computed...), but I would be interested in a general approach for compute shaders.
So can anybody give me any hints for how I could improve performance for this code? I couldn't really find anything online on how to improve performance of compute shaders, so any general advice (not necessarily just for this code) would be appreciated.
This operation as defined doesn't seem like a good one for GPU parallelism. It's very hungry in terms of memory accesses, as complete processing for one particle requires reading the data for every other particle in the system.
If you want to keep the algorithm as is, you can implement it more optimally. As it stands, each work item does all of the processing for a particular particle all at once. That's a huge number of memory operations happening all at once.
Instead, split your particles into blocks, sized for a work group. Each work group operates on a block of source particles and a block of test particles (which may be the same block). The test particles should be loaded into shared memory, so each work group can repeatedly read test data quickly. So a single work group only does a portion of the tests for each block of source particles.
The big difficulty now is writing the data. Since multiple work groups are potentially be writing the added forces to the same source particles, you need to use some mechanism to either atomically increment the source particle data or write the data to a temporary memory buffer. A second compute shader process can run over the temporary buffers and combine the data in a reduction process.

How can I draw smooth pixels in a netpbm image?

I just wrote a small netpbm parser and I am having fun with it, drawing mostly parametric equations. They look OK for a first time thing, but how can I expand upon this and have something that looks legit? This picture is how my method recreated the Arctic Monkeys logo which was just
0.5[cos(19t) - cos(21t)]
(I was trying to plot both cosines first before superpositioning them)
It obviously looks very "crispy" and sharp. I used as small of a step size as I could without it taking forever to finish. (0.0005, takes < 5 sec)
The only idea I had was that when drawing a white pixel, I should also draw its immediate neighbors with some slightly lighter gray. And then draw the neighbors of THOSE pixels with even lighter gray. Almost like the white color is "dissolving" or "dissipating".
I didn't try to implement this because it felt like a really bad way to do it and I am not even sure it'd produce anything near the desirable effect so I thought I'd ask first.
EDIT: here's a sample code that draws just a small spiral
the draw loop:
for (int t = 0; t < 6 * M_PI; t += 0.0005)
{
double r = t;
new_x = 10 * r * cosf(0.1 * M_PI * t);
new_y = -10 * r * sinf(0.1 * M_PI * t);
img.SetPixel(new_x + img.img_width / 2, new_y + img.img_height / 2, 255);
}
//img is a PPM image with magic number P5 (binary grayscale)
SetPixel:
void PPMimage::SetPixel(const uint16_t x, const uint16_t y, const uint16_t pixelVal)
{
assert(pixelVal >= 0 && pixelVal <= max_greys && "pixelVal larger than image's maximum max_grey\n%d");
assert(x >= 0 && x < img_width && "X value larger than image width\n");
assert(y >= 0 && y < img_height && "Y value larger than image height\n");
img_raster[y * img_width + x] = pixelVal;
}
This is what this code produces
A very basic form of antialiasing for a scatter plot (made of points rather than lines) can be achieved by applying something like stochastic rounding: consider the brush to be a pixel-sized square (but note the severe limitations of this model), centered at the non-integer coordinates of the plotted point, and compute its overlap with the four pixels that share the corner closest to that point. Treat that overlap fraction as a grayscale fraction and set each pixel to the largest value for a large number of points approximating a line, or do alpha blending for a small number of discrete points.

Faster Method for Multiple Bilinear Interpolation?

I am writing a program in C++ to reconstruct a 3D object from a set of projected 2D images, the most computation-intensive part of which involves magnifying and shifting each image via bilinear interpolation. I currently have a pair of functions for this task; "blnSetup" defines a handful of parameters outside the loop, then "bilinear" applies the interpolation point-by-point within the loop:
(NOTE: 'I' is a 1D array containing ordered rows of image data)
//Pre-definition structure (in header)
struct blnData{
float* X;
float* Y;
int* I;
float X0;
float Y0;
float delX;
float delY;
};
//Pre-definition function (outside the FOR loop)
extern inline blnData blnSetup(float* X, float* Y, int* I)
{
blnData bln;
//Create pointers to X, Y, and I vectors
bln.X = X;
bln.Y = Y;
bln.I = I;
//Store offset and step values for X and Y
bln.X0 = X[0];
bln.delX = X[1] - X[0];
bln.Y0 = Y[0];
bln.delY = Y[1] - Y[0];
return bln;
}
//Main interpolation function (inside the FOR loop)
extern inline float bilinear(float x, float y, blnData bln)
{
float Ixy;
//Return -1 if the target point is outside the image matrix
if (x < bln.X[0] || x > bln.X[-1] || y < bln.Y[0] || y > bln.Y[-1])
Ixy = 0;
//Otherwise, apply bilinear interpolation
else
{
//Define known image width
int W = 200;
//Find nearest indices for interpolation
int i = floor((x - bln.X0) / bln.delX);
int j = floor((y - bln.Y0) / bln.delY);
//Interpolate I at (xi, yj)
Ixy = 1 / ((bln.X[i + 1] - bln.X[i])*(bln.Y[j + 1] - bln.Y[j])) *
(
bln.I[W*j + i] * (bln.X[i + 1] - x) * (bln.Y[j + 1] - y) +
bln.I[W*j + i + 1] * (x - bln.X[i]) * (bln.Y[j + 1] - y) +
bln.I[W*(j + 1) + i] * (bln.X[i + 1] - x) * (y - bln.Y[j]) +
bln.I[W*(j + 1) + i + 1] * (x - bln.X[i]) * (y - bln.Y[j])
);
}
return Ixy;
}
EDIT: The function calls are below. 'flat.imgdata' is a std::vector containing the input image data and 'proj.imgdata' is a std::vector containing the transformed image.
int Xs = flat.dim[0];
int Ys = flat.dim[1];
int* Iarr = flat.imgdata.data();
float II, x, y;
bln = blnSetup(X, Y, Iarr);
for (int j = 0; j < flat.imgdata.size(); j++)
{
x = 1.2*X[j % Xs];
y = 1.2*Y[j / Xs];
II = bilinear(x, y, bln);
proj.imgdata[j] = (int)II;
}
Since I started optimizing, I have been able to reduce computation time by ~50x (!) by switching from std::vectors to C arrays within the interpolation function, and another 2x or so by cleaning up redundant computations/typecasting/etc, but assuming O(n) with n being the total number of processed pixels, the full reconstruction (~7e10 pixels) should still take 40min or so--about an order of magnitude longer than my goal of <5min.
According to Visual Studio's performance profiler, the interpolation function call ("II = bilinear(x, y, bln);") is unsurprisingly still the majority of my computation load. I haven't been able to find any linear algebraic methods for fast multiple interpolation, so my question is: is this basically as fast as my code will get, short of applying more or faster CPUs to the task? Or is there a different approach that might speed things up?
P.S. I've also only been coding in C++ for about a month now, so feel free to point out any beginner mistakes I might be making.
I wrote up a long answer suggesting looking at OpenCV (opencv.org), or using Halide (http://halide-lang.org/), and getting into how image warping is optimized, but I think a shorter answer might serve better. If you are really just scaling and translating entire images, OpenCV has code to do that and we have an example for resizing in Halide as well (https://github.com/halide/Halide/blob/master/apps/resize/resize.cpp).
If you really have an algorithm that needs to index an image using floating-point coordinates which result from a computation that cannot be turned into a moderately simple function on integer coordinates, then you really want to be using filtered texture sampling on a GPU. Most techniques for optimizing on the CPU rely on exploiting some regular pattern of access in the algorithm and removing float to integer conversion from the addressing. (For resizing, one uses two integer variables, one which indexes the pixel coordinate of the image and the other which is the fractional part of the coordinate and it indexes a kernel of weights.) If this is not possible, the speedups are somewhat limited on CPUs. OpenCV does provide fairly general remapping support, but it likely isn't all that fast.
Two optimizations that may be applicable here are trying to move the boundary condition out the loop and using a two pass approach in which the horizontal and vertical dimensions are processed separately. The latter may or may not win and will require tiling the data to fit in cache if the images are very large. Tiling in general is pretty important for large images, but it isn't clear it is the first order performance problem here and depending on the values in the inputs, the cache behavior may not be regular enough anyway.
"vector 50x slower than array". That's a dead giveaway you're in debug mode, where vector::operator[] is not inlined. You will probably get the necessary speedup, and a lot more, simply by switching to release mode.
As a bonus, vector has a .back() method, so you have a proper replacement for that [-1]. Pointers to the begin of an array don't contain the array size, so you can't find the back of an array that way.

Non-maximum suppression in Canny's algorithm: optimize with SSE

I have the "honor" to improve the runtime of the following code of someone else. (it's a non-maximum supression from the canny - algorithm"). My first thought was to use SSE-intrinsic code, i'm very new in this area, so my question is.
Is there any chance to do this?
And if so, can someone give me a few hints?
void vNonMaximumSupression(
float* fpDst,
float const*const fpMagnitude,
unsigned char const*const ucpGradient, ///< [in] 0 -> 0°, 1 -> 45°, 2 -> 90°, 3 -> 135°
int iXCount,
int iXOffset,
int iYCount,
int ignoreX,
int ignoreY)
{
memset(fpDst, 0, sizeof(fpDst[0]) * iXCount * iXOffset);
for (int y = ignoreY; y < iYCount - ignoreY; ++y)
{
for (int x = ignoreX; x < iXCount - ignoreX; ++x)
{
int idx = iXOffset * y + x;
unsigned char dir = ucpGradient[idx];
float fMag = fpMagnitude[idx];
if (dir == 0 && fpMagnitude[idx - 1] < fMag && fMag > fpMagnitude[idx + 1] ||
dir == 1 && fpMagnitude[idx - iXCount + 1] < fMag && fMag > fpMagnitude[idx + iXCount - 1] ||
dir == 2 && fpMagnitude[idx - iXCount] < fMag && fMag > fpMagnitude[idx + iXCount] ||
dir == 3 && fpMagnitude[idx - iXCount - 1] < fMag && fMag > fpMagnitude[idx + iXCount + 1]
)
fpDst[idx] = fMag;
else
fpDst[idx] = 0;
}
}
}
Discussion
As #harold noted, the main problem for vectorization here is that the algorithm uses different offset for each pixel (specified by direction matrix). I can think of several potential ways of vectorization:
#nikie: evaluate all branches at once, i.e. compare each pixel with all of its neighbors. The results of these comparisons are blended according to the direction values.
#PeterCordes: load a lot of pixels into SSE registers, then use _mm_shuffle_epi8 to choose only the neighbors in the given direction. Then perform two vectorized comparisons.
(me): use scalar instructions to load the proper two neighboring pixels along the direction. Then combine these values for four pixels into SSE registers. Finally, do two comparisons in SSE.
The second approach is very hard to implement efficiently, because for a pack of 4 pixels, there are 18 neighboring pixels to choose from. That would require too much shuffles, I think.
The first approach looks nice, but it would perform four times more operations per pixel. I suppose the speedup of vector instructions would be overwhelmed by too much computations done.
I suggest using the third approach. Below you can see hints on improving performance.
Hybrid approach
First of all, we want to make scalar code as fast as possible. The code presented by you contains too much branches. Most of them are not predictable, for instance the switch by direction.
In order to remove branches, I suggest creating an array delta = {1, stride - 1, stride, stride + 1}, which gives index offset from direction. By using this array, you can find indices of neighboring pixels to compare with (without branches). Then you do two comparisons. Finally, you can write a ternary operator like res = (isMax ? curr : 0);, hoping that compiler can generate branchless code for it.
Unfortunately, compiler (at least MSVC2013) is not smart enough to avoid branch by isMax. That's why we can benefit from rewriting the inner cycle with scalar SSE intrinsics. Look up the guide for reference. You need mostly intrinsics ending with _ss, since the code is completely scalar.
Finally, we can vectorize everything except for loading neighboring pixels. In order to load neighboring pixels, we can use _mm_setr_ps intrinsics with scalar arguments, asking the compiler to generate some good code for us =)
__m128 forw = _mm_setr_ps(src[idx+0 + offset0], src[idx+1 + offset1], src[idx+2 + offset2], src[idx+3 + offset3]);
__m128 back = _mm_setr_ps(src[idx+0 - offset0], src[idx+1 - offset1], src[idx+2 - offset2], src[idx+3 - offset3]);
I have just implemented it myself. Tested in single thread on Ivy Bridge 3.4Ghz. A random image of 1024 x 1024 resolution was used as a source. The results (in milliseconds) are:
original: 13.078 //your code
branchless: 8.556 //'branchless' code
scalarsse: 2.151 //after rewriting to sse intrinsics
hybrid: 1.159 //partially vectorized code
They confirm performance improvements on each step. The final code needs a bit more than one millisecond to process a one-megapixel image. The total speedup is about 11.3 times. Indeed, you can get better performance on GPU =)
I hope the presented information would be enough for you to reproduce the steps. If you are seeking terrible spoilers, look here for my implementations of all these stages.

Subsampling an array of numbers

I have a series of 100 integer values which I need to reduce/subsample to 77 values for the purpose of fitting into a predefined space on screen. This gives a fraction of 77/100 values-per-pixel - not very neat.
Assuming the 77 is fixed and cannot be changed, what are some typical techniques for subsampling 100 numbers down to 77. I get a sense that it will be a jagged mapping, by which I mean the first new value is the average of [0, 1] then the next value is [3], then average [4, 5] etc. But how do I approach getting the pattern for this mapping?
I am working in C++, although I'm more interested in the technique than implementation.
Thanks in advance.
Either if you downsample or you oversample, you are trying to reconstruct a signal over nonsampled points in time... so you have to make some assumptions.
The sampling theorem tells you that if you sample a signal knowing that it has no frequency components over half the sampling frequency, you can continously and completely recover the signal over the whole timing period. There's a way to reconstruct the signal using sinc() functions (this is sin(x)/x)
sinc() (indeed sin(M_PI/Sampling_period*x)/M_PI/x) is a function that has the following properties:
Its value is 1 for x == 0.0 and 0 for x == k*Sampling_period with k == 0, +-1, +-2, ...
It has no frequency component over half of the sampling_frequency derived from Sampling_period.
So if you consider the sum of the functions F_x(x) = Y[k]*sinc(x/Sampling_period - k) to be the sinc function that equals the sampling value at position k and 0 at other sampling value and sum over all k in your sample, you'll get the best continous function that has the properties of not having components on frequencies over half the sampling frequency and have the same values as your samples set.
Said this, you can resample this function at whatever position you like, getting the best way to resample your data.
This is by far, a complicated way of resampling data, (it has also the problem of not being causal, so it cannot be implemented in real time) and you have several methods used in the past to simplify the interpolation. you have to constructo all the sinc functions for each sample point and add them together. Then you have to resample the resultant function to the new sampling points and give that as a result.
Next is an example of the interpolation method just described. It accepts some input data (in_sz samples) and output interpolated data with the method described before (I supposed the extremums coincide, which makes N+1 samples equal N+1 samples, and this makes the somewhat intrincate calculations of (in_sz - 1)/(out_sz - 1) in the code (change to in_sz/out_sz if you want to make plain N samples -> M samples conversion:
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
/* normalized sinc function */
double sinc(double x)
{
x *= M_PI;
if (x == 0.0) return 1.0;
return sin(x)/x;
} /* sinc */
/* interpolate a function made of in samples at point x */
double sinc_approx(double in[], size_t in_sz, double x)
{
int i;
double res = 0.0;
for (i = 0; i < in_sz; i++)
res += in[i] * sinc(x - i);
return res;
} /* sinc_approx */
/* do the actual resampling. Change (in_sz - 1)/(out_sz - 1) if you
* don't want the initial and final samples coincide, as is done here.
*/
void resample_sinc(
double in[],
size_t in_sz,
double out[],
size_t out_sz)
{
int i;
double dx = (double) (in_sz-1) / (out_sz-1);
for (i = 0; i < out_sz; i++)
out[i] = sinc_approx(in, in_sz, i*dx);
}
/* test case */
int main()
{
double in[] = {
0.0, 1.0, 0.5, 0.2, 0.1, 0.0,
};
const size_t in_sz = sizeof in / sizeof in[0];
const size_t out_sz = 5;
double out[out_sz];
int i;
for (i = 0; i < in_sz; i++)
printf("in[%d] = %.6f\n", i, in[i]);
resample_sinc(in, in_sz, out, out_sz);
for (i = 0; i < out_sz; i++)
printf("out[%.6f] = %.6f\n", (double) i * (in_sz-1)/(out_sz-1), out[i]);
return EXIT_SUCCESS;
} /* main */
There are different ways of interpolation (see wikipedia)
The linear one would be something like:
std::array<int, 77> sampling(const std::array<int, 100>& a)
{
std::array<int, 77> res;
for (int i = 0; i != 76; ++i) {
int index = i * 99 / 76;
int p = i * 99 % 76;
res[i] = ((p * a[index + 1]) + ((76 - p) * a[index])) / 76;
}
res[76] = a[99]; // done outside of loop to avoid out of bound access (0 * a[100])
return res;
}
Live example
Create 77 new pixels based on the weighted average of their positions.
As a toy example, think about the 3 pixel case which you want to subsample to 2.
Original (denote as multidimensional array original with RGB as [0, 1, 2]):
|----|----|----|
Subsample (denote as multidimensional array subsample with RGB as [0, 1, 2]):
|------|------|
Here, it is intuitive to see that the first subsample seems like 2/3 of the first original pixel and 1/3 of the next.
For the first subsample pixel, subsample[0], you make it the RGB average of the m original pixels that overlap, in this case original[0] and original[1]. But we do so in weighted fashion.
subsample[0][0] = original[0][0] * 2/3 + original[1][0] * 1/3 # for red
subsample[0][1] = original[0][1] * 2/3 + original[1][1] * 1/3 # for green
subsample[0][2] = original[0][2] * 2/3 + original[1][2] * 1/3 # for blue
In this example original[1][2] is the green component of the second original pixel.
Keep in mind for different subsampling you'll have to determine the set of original cells that contribute to the subsample, and then normalize to find the relative weights of each.
There are much more complex graphics techniques, but this one is simple and works.
Everything depends on what you wish to do with the data - how do you want to visualize it.
A very simple approach would be to render to a 100-wide image, and then smooth scale the image down to a narrower size. Whatever graphics/development framework you're using will surely support such an operation.
Say, though, that your goal might be to retain certain qualities of the data, such as minima and maxima. In such a case, for each bin, you're drawing a line of darker color up to the minimum value, and then continue with a lighter color up to the maximum. Or, you could, instead of just putting a pixel at the average value, you draw a line from the minimum to the maximum.
Finally, you might wish to render as if you had 77 values only - then the goal is to somehow transform the 100 values down to 77. This will imply some kind of an interpolation. Linear or quadratic interpolation is easy, but adds distortions to the signal. Ideally, you'd probably want to throw a sinc interpolator at the problem. A good list of them can be found here. For theoretical background, look here.