Drawing large numbers of pixels in OpenGL - c++

I've been working on some sound processing code and now I'm doing some visualizations. I finished making a spectrogram spectrogram, but how I am drawing it is too slow.
I'm using OpenGL to do 2D drawing, which has made searching for help more difficult. Also I am very new to OpenGL, so I don't know the standard way things are done.
I am storing the r,g,b values for each pixel in a large matrix.
Each time I get a small sound segment, I process it and convert it to column of pixels. Everything is shifted to the left 1 pixel, and the new line is put at the end.
Each time I redraw, I am looping through setting the color and drawing each pixel individually, which seems like a horribly inefficient way to do this.
Is there a better way to do this? Is there some method for simply shifting a bunch of pixels over?

They are many ways to improve your drawing speed.
The simplest would be to allocate a an RGB texture that you will draw using a screen aligned texture quad.
Each time that you want to draw a new line you can use glTexSubImage2d to a load a new subset of the texture and then you redraw the quad.

Are you perhaps passing a lot more data to the graphics card than you have pixels? This could happen if your FFT size is much larger than the height of the drawing area or the number of spectral lines is a lot more than its width. If so, it's possible that the bottle neck could be passing too much data across the bus. Try reducing the number of spectral lines by either averaging them or picking (taking the maximum in each bin for a set of consecutive lines).

GL_POINTS, VBO, GL_STREAM_DRAW.

I know this is an old question, but . . .
Use a circular buffer to store the pixels, and then simply call glDrawPixels twice with the appropriate offsets. Something like this untested C:
#define SIZE_X 800
#define SIZE_Y 600
unsigned char pixels[SIZE_Y][SIZE_X*2][3];
int start = 0;
void add_line(const unsigned char line[SIZE_Y][1][3]) {
int i,j,coord=(start+SIZE_X)%(2*SIZE_X);
for (i=0;i<SIZE_Y;++i) for (j=0;j<3;++j) pixels[i][coord][j] = line[i][0][j];
start = (start+1) % (2*SIZE_X);
}
void draw(void) {
int w;
w = 2*SIZE_X-start;
if (w!=0) glDrawPixels(w,SIZE_Y,GL_RGB,GL_UNSIGNED_BYTE,3*sizeof(unsigned char)*SIZE_Y*start+pixels);
w = SIZE_X - w;
if (w!=0) glDrawPixels(SIZE_X,SIZE_Y,GL_RGB,GL_UNSIGNED_BYTE,pixels);
}

Related

How to detect if an image contains only white color with C++

We are writing a piece of software which downloads tiles from the internet from WMS servers (these are map servers, and they provide images as map data for various locations on the globe) and then displays them inside a window, using Qt and some OpenGL bindings.
Some of these servers contain data only for specific regions on the planet, and if you request and area outside of what they support it they provide you just a blank white image, which we do not want to use since they occupy extra space. So the question is:
How to identify whether an image contains only 1 color (white), or not.
What we have tried till now is the following:
Create a QImage, loop over every pixel of it, see if it differs from white. This is extremely slow, and since we want this to be a more or less realtime application, this idea sadly does not work.
Check if the image size is the same as an empty image size, but this also does not work, since it might happen that:
There is another image with the same size which actually contains data
It might be that tiles which are over an ocean have just one color, a light blue, and we need those tiles.
Do a "post processing" of the downloaded images and remove them from the scene later, but this looks ugly from the users' perspective that tiles are just appearing and disappearing ...
Request transparent images from the WMS servers, but due to some OpenGL mishappenings, when rendering, these images appear as black only on some (mostly low-end) video cards.
Any idea, library to use, direction or even code is welcome, and we need a C++ solution, since our app is C++.
Edit for those suggesting to sample pixels only from a few points in the map:
and
The two images above (yes, the left image contains a very tiny piece of Norway in the corner), would be eliminated if we would assume that the image is entirely white based only sampling a few points, in case none of those points actually touch any color than white. Link to the second image: https://wms.geonorge.no/skwms1/wms.sjokartraster2?LAYERS=all&SRS=EPSG:900913&FORMAT=image/png&SERVICE=WMS&VERSION=1.1.1&REQUEST=GetMap&BBOX=-313086.067812500,9079495.966562500,0.000000000,9392582.034375001&WIDTH=256&HEIGHT=256&TRANSPARENT=false
The correct and most reliable way would be to uncompress the PNG bytes and check each pixel in a tight loop.
The most usual source of an image process routine being "slow" is invoking a function call per-pixel. So if you are calling QImage::pixel in a nested loop for each row/column, it will not have the performance you desire.
Instead, take advantage of the fact that QImage gives you raw image bytes via the scanLine method or the bits method:
Something like this might work:
const int bytes_per_line = qimage.bytesPerLine();
unsigned char white_row[MAX_WIDTH * 4];
memset(white_row, 0xff, sizeof(white_row));
bool allWhite = true;
for (int row = 0; allWhite && (row < height); row++)
{
unsigned char* row_data = qimage.scanLine(row);
allWhite = !memcmp(row_data, white_row, bytes_per_line);
}
The above loop terminates pretty fast the moment a non-white pixel is encountered.

Speeding up drawing bitmap magnification within second bitmap with blend

The following code stretches a bitmap, blends it with an existing background, maintains transparent area of primary graphic and then displays the blend within a window (imgScreen). This works fine when the level of stretch is not large or when it is actually shrinking the initial bitmap. However when stretching the graphic it is very slow.
I have limited experience with C++ and this kind of graphics so perhaps there is another more efficient way to do this. The primary bitmap to be sized is always square. Any ideas are much appreciated..!
I was going to try not displaying clipping area but from tests it seems the initial stretch is causing the slowdown... Also having trouble seeing how to calculate non clipped area... Drawing to controls seems a waste but seems only way to use built in functions like stretchdraw and the alpha draw option.
std::auto_ptr<Graphics::TBitmap> bmap(new Graphics::TBitmap);
std::auto_ptr<Graphics::TBitmap> bmap1(new Graphics::TBitmap);
int s = newsize;
TRect sR = Rect(X,Y,X+s,Y+s);
TRect tR = Rect(0,0,s,s);
bmap->SetSize(s,s);
bmap->Canvas->StretchDraw(Rect(0, 0, s, s), Form1->Image4->Picture-
>Bitmap); // scale
bmap1->SetSize(s,s);
bmap1->Canvas->CopyRect(tR, Form1->imgScreen->Canvas, sR); //background
bmap1->Canvas->Draw(0,0,bmap.get()); // combine
Form1->imgTemp->Picture->Assign(bmap1.get());
Form1->imgScreen->Canvas->Draw(X,Y, Form1->imgTemp->Picture->Bitmap,
alpha);
Displays correctly but as graphic gets larger draw rate slows down quickly...

Pygame: slow performance using pygame.Surface and convert_alpha()

I'm developing a simple tile game that displays a grid image and paints it with successive layers of images. So I have-
list_of_image_tiles = { GRASS: pygame.image.load('/grass.png').convert_alpha(), TREES: pygame.image.load('/trees.png').convert_alpha(), etc}
Then later on I blit these-
DISPLAYSURF.blit(list_of_images[lists_of_stuff][TREES], (col*TILESIZE,row*TILESIZE))
DISPLAYSURF.blit(list_of_images[lists_of_stuff][GRASS], (col*TILESIZE,row*TILESIZE))
Note that for brevity I've not included a lot of code but it does basically work- except performance is painfully slow. If I comment out the DISPLAYSURF stuff performance leaps forward, so I think I need a better way to do the DISPLAYSURF stuff, or possibly the pygame.image.load bits (is convert_alpha() the best way, bearing in mind I need the layered-image approach?)
I read something called psycho might help, but not sure how to fit that in. Any ideas how to improve the performance most welcome.
There are a couple of things you can do.
Perform the "multi-layer" blit just once to a surface then just blit that surface every frame to the DISPLAYSURF.
Identify parts of the screen that need to be updated and use screen.update(rectangle_list) instead of screen.flip().
Edit to add example of 1.
Note: you didn't give much of your code, so I just fit this with how I do it.
# build up the level surface once when you enter a level.
level = Surface((LEVEL_WIDTH * TILESIZE, LEVEL_HIGHT * TILESIZE))
for row in range(LEVEL_HIGHT):
for col in range(LEVEL_WIDTH):
level.blit(list_of_images[lists_of_stuff][TREES], (col * TILESIZE, row * TILESIZE))
level.blit(list_of_images[lists_of_stuff][GRASS], (col * TILESIZE, row * TILESIZE))
then in main loop during draw part
# blit only the part of the level that should be on the screen
# view is a Rect describing what tiles should be viewable
disp = DISPLAYSURF..get_rect()
level_area = Rect((view.left * TILESIZE, view.top * TILESIZE), disp.size)
DISPLAYSURF.blit(level, disp, area = level_area)
You should use colorkey whenever you dont need per pixel alpha. I just changed all convert_alphas in my code to simple convert and set color key for fully opaque parts of image. Performance increase TEN FOLD!

what is the most efficient way of moving multiple objects (stored in VBO) in space? should I use glTranslatef or a shader?

I'm trying to get the hang of moving objects (in general) and line strips (in particular) most efficiently in opengl and therefore I'm writing an application where multiple line segments are traveling with a constant speed from right to left. At every time point the left most point will be removed, the entire line will be shifted to the left, and a new point will be added at the very right of the line (this new data point is streamed / received / calculated on the fly, every 10ms or so). To illustrate what I mean, see this image:
Because I want to work with many objects, I decided to use vertex buffer objects in order to minimize the amount of gl* calls. My current code looks something like this:
A) setup initial vertices:
# calculate my_func(x) in range [0, n]
# (could also be random data)
data = my_func(0, n)
# create & bind buffer
vbo_id = GLuint()
glGenBuffers(1, vbo_id);
glBindBuffer(GL_ARRAY_BUFFER, vbo_id)
# allocate memory & transfer data to GPU
glBufferData(GL_ARRAY_BUFFER, sizeof(data), data, GL_DYNAMIC_DRAW)
B) update vertices:
draw():
# get new data and update offset
data = my_func(n+dx, n+2*dx)
# update offset 'n' which is the current absolute value of x.
n = n + 2*dx
# upload data
glBindBuffer(GL_ARRAY_BUFFER, vbo_id)
glBufferSubData(GL_ARRAY_BUFFER, n, sizeof(data), data)
# translate scene so it looks like line strip has moved to the left.
glTranslatef(-local_shift, 0.0, 0.0)
# draw all points from offset
glVertexPointer(2, GL_FLOAT, 0, n)
glDrawArrays(GL_LINE_STRIP, 0, points_per_vbo)
where my_func would do something like this:
my_func(start_x, end_x):
# generate the correct x locations.
x_values = range(start_x, end_x, STEP_SIZE)
# generate the y values. We could be getting these values from a sensor.
y_values = []
for j in x_values:
y_values.append(random())
data = []
for i, j in zip(x_values, y_values):
data.extend([i, j])
return data
This works just fine, however if I have let's say 20 of those line strips that span the entire screen, then things slow down considerably.
Therefore my questions:
1) should I use glMapBuffer to bind the buffer on the GPU and fill the data directly (instead of using glBufferSubData)? Or will this make no difference performance wise?
2) should I use a shader for moving objects (here line strip) instead of calling glTranslatef? If so, how would such a shader look like? (I suspect that a shader is the wrong way to go, since my line strip is NOT a period function but rather contains random data).
3) what happens if the window get's resized? how do I keep aspect ratio and scale vertices accordingly? glViewport() only helps scaling in y direction, not in x direction. If the window is rescaled in x-direction, then in my current implementation I would have to recalculate the position of the entire line strip (calling my_func to get the new x coordinates) and upload it to the GPU. I guess this could be done more elegantly? How would I do that?
4) I noticed that when I use glTranslatef with a non integral value, the screen starts to flicker if the line strip consists of thousands of points. This is most probably because the fine resolution that I use to calculate the line strip does not match the pixel resolution of the screen and therefore sometimes some points appear in front and sometimes behind other points (this is particularly annoying when you don't render a sine wave but some 'random' data). How can I prevent this from happening (besides the obvious solution of translating by a integer multiple of 1 pixel)? If a window get re-sized from let's say originally 800x800 pixels to 100x100 pixels and I still want to visualize a line strip of 20 seconds, then shifting in x direction must work flicker free somehow with sub pixel precision, right?
5) as you can see I always call glTranslatef(-local_shift, 0.0, 0.0) - without ever doing the opposite. Therefore I keep shifting the entire view to the right. And that's why I need to keep track of the absolute x position (in order to place new data at the correct location). This problem will eventually lead to an artifact, where the line is overlapping with the edges of the window. I guess there must be a better way for doing this, right? Like keeping the x values fixed and just moving & updating the y values?
EDIT I've removed the sine wave example and replaced it with a better example. My question is generally about how to move line strips in space most efficiently (while adding new values to them). Therefore any suggestions like "precompute the values for t -> infinity" don't help here (I could also just be drawing the current temperature measured in front of my house).
EDIT2
Consider this toy example where after each time step, the first point is removed and a new one is added to the end:
t = 0
*
* * *
* **** *
1234567890
t = 1
*
* * * *
**** *
2345678901
t = 2
* *
* * *
**** *
3456789012
I don't think I can use a shader here, can I?
EDIT 3: example with two line strips.
EDIT 4: based on Tim's answer I'm using now the following code, which works nicely, but breaks the line into two (since I have two calls of glDrawArrays), see also the following two screenshots.
# calculate the difference
diff_first = x[1] - x[0]
''' first part of the line '''
# push the matrix
glPushMatrix()
move_to = -(diff_first * c)
print 'going to %d ' % (move_to)
glTranslatef(move_to, 0, 0)
# format of glVertexPointer: nbr points per vertex, data type, stride, byte offset
# calculate the offset into the Vertex
offset_bytes = c * BYTES_PER_POINT
stride = 0
glVertexPointer(2, GL_FLOAT, stride, offset_bytes)
# format of glDrawArrays: mode, Specifies the starting index in the enabled arrays, nbr of points
nbr_points_to_render = (nbr_points - c)
starting_point_in_above_selected_Vertex = 0
glDrawArrays(GL_POINTS, starting_point_in_above_selected_Vertex, nbr_points_to_render)
# pop the matrix
glPopMatrix()
''' second part of the line '''
# push the matrix
glPushMatrix()
move_to = (nbr_points - c) * diff_first
print 'moving to %d ' %(move_to)
glTranslatef(move_to, 0, 0)
# select the vertex
offset_bytes = 0
stride = 0
glVertexPointer(2, GL_FLOAT, stride, offset_bytes)
# draw the line
nbr_points_to_render = c
starting_point_in_above_selected_Vertex = 0
glDrawArrays(GL_POINTS, starting_point_in_above_selected_Vertex, nbr_points_to_render)
# pop the matrix
glPopMatrix()
# update counter
c += 1
if c == nbr_points:
c = 0
EDIT5 the resulting solution must obviously render one line across the screen - and no two lines that are missing a connection. The circular buffer solution by Tim provides a solution on how to move the plot, but I end up with two lines, instead of one.
Here's my thoughts to the revised question:
1) should I use glMapBuffer to bind the buffer on the GPU and fill the
data directly (instead of using glBufferSubData)? Or will this make no
difference performance wise?
I'm not aware that there is any significant performance between the two, though I would probably prefer glBufferSubData.
What I might suggest in your case is to create a VBO with N floats, and then use it similar to a circular buffer. Keep an index locally to where the 'end' of the buffer is, then every update replace the value under 'end' with the new value, and increment the pointer. This way you only have to update a single float each cycle.
Having done that, you can draw this buffer using 2x translates and 2x glDrawArrays/Elements:
Imagine that you've got an array of 10 elements, and the buffer end pointer is at element 4. Your array will contain the following 10 values, where x is a constant value, and f(n-d) is the random sample from d cycles ago:
0: (0, f(n-4) )
1: (1, f(n-3) )
2: (2, f(n-2) )
3: (3, f(n-1) )
4: (4, f(n) ) <-- end of buffer
5: (5, f(n-9) ) <-- start of buffer
6: (6, f(n-8) )
7: (7, f(n-7) )
8: (8, f(n-6) )
9: (9, f(n-5) )
To draw this (pseudo-guess code, might not be exactly correct):
glTranslatef( -end, 0, 0);
glDrawArrays( LINE_STRIP, end+1, (10-end)); //draw elems 5-9 shifted left by 4
glPopMatrix();
glTranslatef( end+1, 0, 0);
glDrawArrays(LINE_STRIP, 0, end); // draw elems 0-4 shifted right by 5
Then in the next cycle, replace the oldest value with the new random value,and shift the circular buffer pointer forward.
2) should I use a shader for moving objects (here line strip) instead
of calling glTranslatef? If so, how would such a shader look like? (I
suspect that a shader is the wrong way to go, since my line strip is
NOT a period function but rather contains random data).
Probably optional, if you use the method that I've described in #1. There's not a particular advantage to using one here.
3) what happens if the window get's resized? how do I keep aspect
ratio and scale vertices accordingly? glViewport() only helps scaling
in y direction, not in x direction. If the window is rescaled in
x-direction, then in my current implementation I would have to
recalculate the position of the entire line strip (calling my_func to
get the new x coordinates) and upload it to the GPU. I guess this
could be done more elegantly? How would I do that?
You shouldn't have to recalculate any data. Just define all your data in some fixed coordinate system that makes sense to you, and then use projection matrix to map this range to the window. Without more specifics its hard to answer.
4) I noticed that when I use glTranslatef with a non integral value,
the screen starts to flicker if the line strip consists of thousands
of points. This is most probably because the fine resolution that I
use to calculate the line strip does not match the pixel resolution of
the screen and therefore sometimes some points appear in front and
sometimes behind other points (this is particularly annoying when you
don't render a sine wave but some 'random' data). How can I prevent
this from happening (besides the obvious solution of translating by a
integer multiple of 1 pixel)? If a window get re-sized from let's say
originally 800x800 pixels to 100x100 pixels and I still want to
visualize a line strip of 20 seconds, then shifting in x direction
must work flicker free somehow with sub pixel precision, right?
Your assumption seems correct. I think the thing to do here would either to enable some kind of antialiasing (you can read other posts for how to do that), or make the lines wider.
There are a number of things that could be at work here.
glBindBuffer is one of the slowest OpenGL operations (along with similar call for shaders, textures, etc.)
glTranslate adjusts the modelview matrix, which the vertex unit multiplies all points by. So, it simply changes what matrix you multiply by. If you were to instead use a vertex shader, then you'd have to translate it for each vertex individually. In short: glTranslate is faster. In practice, this shouldn't matter too much, though.
If you're recalculating the sine function on a lot of points every time you draw, you're going to have performance issues (especially since, by looking at your source, it looks like you might be using Python).
You're updating your VBO every time you draw it, so it's not any faster than a vertex array. Vertex arrays are faster than intermediate mode (glVertex, etc.) but nowhere near as fast as display lists or static VBOs.
There could be coding errors or redundant calls somewhere.
My verdict:
You're calculating a sine wave and an offset on the CPU. I strongly suspect that most of your overhead comes from calculating and uploading different data every time you draw it. This is coupled with unnecessary OpenGL calls and possibly unnecessary local calls.
My recommendation:
This is an opportunity for the GPU to shine. Calculating function values on parallel data is (literally) what the GPU does best.
I suggest you make a display list representing your function, but set all the y-coordinates to 0 (so it's a series of points all along the line y=0). Then, draw this exact same display list once for every sine wave you want to draw. Ordinarily, this would just produce a flat graph, but, you write a vertex shader that transforms the points vertically into your sine wave. The shader takes a uniform for the sine wave's offset ("sin(x-offset)"), and just changes each vertex's y.
I estimate this will make your code at least ten times faster. Furthermore, because the vertices' x coordinates are all at integral points (the shader does the "translation" in the function's space by computing "sin(x-offset)"), you won't experience jittering when offsetting with floating point values.
You've got a lot here, so I'll cover what I can. Hopefully this will give you some areas to research.
1) should I use glMapBuffer to bind the buffer on the GPU and fill the data directly (instead of using glBufferSubData)? Or will this make no difference performance wise?
I would expect glBufferSubData to have better performance. If the data is stored on the GPU then mapping it will either
Copy the data back into host memory so you can modify it, and the copy it back when you unmap it.
or, give you a pointer to the GPU's memory directly which the CPU will access over PCI-Express. This isn't anywhere near as slow as it used to be to access GPU memory when we were on AGP or PCI, but it's still slower and not as well cached, etc, as host memory.
glSubBufferData will send the update of the buffer to the GPU and it will modify the buffer. No copying the back and fore. All data transferred in one burst. It should be able to do it as an asynchronous update of the buffer as well.
Once you get into "is this faster than that?" type comparisons you need to start measuring how long things take. A simple frame timer is normally sufficient (but report time per frame, not frames per second - it makes numbers easier to compare). If you go finer-grained than that, just be aware that because of the asynchronous nature of OpenGL, you often see time being consumed away from the call that caused the work. This is because after you give the GPU a load of work, it's only when you have to wait for it to finish something that you notice how long it's taking. That normally only happens when you're waiting for front/back buffers to swap.
2) should I use a shader for moving objects (here line strip) instead of calling glTranslatef? If so, how would such a shader look like?
No difference. glTranslate modifies a matrix (normally the Model-View) which is then applied to all vertices. If you have a shader you'd apply a translation matrix to all your vertices. In fact the driver is probably building a small shader for you already.
Be aware that the older APIs like glTranslate() are depreciated from OpenGL 3.0 onwards, and in modern OpenGL everything is done with shaders.
3) what happens if the window get's resized? how do I keep aspect ratio and scale vertices accordingly? glViewport() only helps scaling in y direction, not in x direction.
glViewport() sets the size and shape of the screen area that is rendered to. Quite often it's called on window resizing to set the viewport to the size and shape of the window. Doing just this will cause any image rendered by OpenGL to change aspect ratio with the window. To keep things looking the same you also have to control the projection matrix to counteract the effect of changing the viewport.
Something along the lines of:
glViewport(0,0, width, height);
glMatrixMode(GL_PROJECTION_MATRIX);
glLoadIdentity();
glScale2f(1.0f, width / height); // Keeps X scale the same, but scales Y to compensate for aspect ratio
That's written from memory, and I might not have the maths right, but hopefully you get the idea.
4) I noticed that when I use glTranslatef with a non integral value, the screen starts to flicker if the line strip consists of thousands of points.
I think you're seeing a form of aliasing which is due to the lines moving under the sampling grid of the pixels. There are various anti-aliasing techniques you can use to reduce the problem. OpenGL has anti-aliased lines (glEnable(GL_SMOOTH_LINE)), but a lot of consumer cards didn't support it, or only did it in software. You can try it, but you may get no effect or run very slowly.
Alternatively you can look into Multi-sample anti-aliasing (MSAA), or other types that your card may support through extensions.
Another option is rendering to a high resolution texture (via Frame Buffer Objects - FBOs) and then filtering it down when you render it to the screen as a textured quad. This would also allow you to do a trick where you move the rendered texture slightly to the left each time, and rendered the new strip on the right each frame.
1 1
1 1 1 Frame 1
11
1
1 1 1 Frame 1 is copied left, and a new line segment is added to make frame 2
11 2
1
1 1 3 Frame 2 is copied left, and a new line segment is added to make frame 3
11 2
It's not a simple change, but it might help you out with your problem (5).

openGL Creating texture atlas at run time?

So I've set up my framework in a neat little system to wrap SDL, openGL and box2D all together for a 2D game.
Now how it works is that I create an object of "GameObject" class, specify a "source PNG", and then it automatically creates an openGL texture and a box2d body of the same dimensions.
Now I am worried about if I start needing to render many different textures on screen.
Is it possible to load in all my sprite sheets at run time, and then group them all together into one texture? If so, how? And what would be a good way to implement it (so that I wouldn't have to manually be specifying any parameters or anything).
The reason I want to do it at run time and not pre-done is so that I can easily load together all (or most) of the tiles, enemies etc.. of a certain level into this one texture, because every level won't have the same enemies. It'd also make the whole creating art process easier.
There are likely some libraries that already exist for creating texture atlases (optimal packing is a nontrivial problem) and converting old texture coordinates to the new ones.
However, if you want to do it yourself, you probably would do something like this:
Load all textures from disk (your "source PNG") and retrieve the raw pixel data buffer,
If necessary, convert all source textures into the same pixel format,
Create a new texture big enough to hold all the existing textures, along with a corresponding buffer to hold the pixel data
"Blit" the pixel data from the source images into the new buffer at a given offset (see below)
Create a texture as normal using the new buffer's data.
While doing this, determine the mapping from "old" texture coordinates into the "new" texture coordinates (should be a simple matter of recording the offsets for each element of the texture atlas and doing a quick transform). It would probably also be pretty easy to do it inside a pixel shader, but some profiling would be required to see if the overhead of passing the extra parameters is worth it.
Obviously you also want to check to make sure you are not doing something silly like loading the same texture into the atlas twice, but that's a concern that's outside this procedure.
To "blit" (copy) from the source image to the target image you'd do something like this (assuming you're copying a 128x128 texture into a 512x512 atlas texture, starting at (128, 0) on the target):
unsigned char* source = new unsigned char[ 128 * 128 * 4 ]; // in reality, comes from your texture loader
unsigned char* target = new unsigned char[ 512 * 512 * 4 ];
int targetX = 128;
int targetY = 0;
for(int sourceY = 0; sourceY < 128; ++sourceY) {
for(int sourceX = 0; sourceX < 128; ++sourceX) {
int from = (sourceY * 128 * 4) + (sourceX * 4); // 4 bytes per pixel (assuming RGBA)
int to = ((targetY + sourceY) * 512 * 4) + ((targetX + sourceX) * 4); // same format as source
for(int channel = 0; channel < 4; ++channel) {
target[to + channel] = source[from + channel];
}
}
}
This is a very simple brute force implementation: there are much faster, more succinct and more clever ways to copy an array, but the idea is that you are basically copying the contents of the source texture into the target texture at a given X and Y offset. In the end, you will have created a new texture which contains the old textures in it.
If the indexing math doesn't make sense to you, think about how a 2D array is actually indexed inside a 1D space (such as computer memory).
Please forgive any bugs. This isn't production code but instead something I wrote without checking if it compiles or runs.
Since you're using SDL, I should mention that it has a nice function that might be able to help you: SDL_BlitSurface. You can create an SDL_Surface entirely within SDL and simply use SDL_BlitSurface to copy your source surfaces into it, then convert the atlas surface into a GL texture.
It will take care of all the math, and can also do a format conversion for you on the fly.