JOGL glTexSubImage2D eats up to 100% cpu and takes ages - opengl

I've faced with the issue that execution time for single gl.glTexSubImage2D() call takes 0.1-0.2sec while running on Linux and eating 100% of cpu. On mac it is all fine.
The call arguments are the following:
gl.glTexSubImage2D(GL.GL_TEXTURE_2D, 0, 0, 0, 1920, 1080, GL2.GL_RED, GL2.GL_UNSIGNED_SHORT, data);
Textures setup is the following:
void glCreateClearTex(GL gl, int target, int fmt, int format, int type, int filter, int w, int h, int val) {
float fval = 0;
int stride;
if (w == 0)
w = 1;
if (h == 0)
h = 1;
stride = 2/*2048*/ * 2;
ByteBuffer init = ByteBuffer.allocateDirect(stride * h/*2048*/);
glAdjustAlignment(gl, stride);
gl.glPixelStorei(GL2.GL_UNPACK_ROW_LENGTH, w);
gl.glTexImage2D(target, 0, fmt, w, h, 0, format, type, init);
gl.glTexParameterf(target, GL2.GL_TEXTURE_PRIORITY, 1.0f);
gl.glTexParameteri(target, GL2.GL_TEXTURE_MIN_FILTER, GL2.GL_LINEAR);
gl.glTexParameteri(target, GL2.GL_TEXTURE_MAG_FILTER, GL2.GL_LINEAR);
gl.glTexParameteri(target, GL2.GL_TEXTURE_WRAP_S, GL2.GL_CLAMP_TO_EDGE);
gl.glTexParameteri(target, GL2.GL_TEXTURE_WRAP_T, GL2.GL_CLAMP_TO_EDGE);
gl.glTexParameterfv(target, GL2.GL_TEXTURE_BORDER_COLOR, FloatBuffer.wrap(new float[] { fval, fval, fval, fval }));
}
Mplayer doing same work natively runs just fine. glxgears runs ok but also takes up 100%. This may be the sign of OpenGL setup issues but glxinfo and others report that it is hw rendering. Graphic Card is ATI FirePro.

I found the issue. Jogl has two variants of gl.glTexSubImage2D(). One is using data ptr to upload to pbo and later to GPU, another one - offset inside already prepared pbo. My mistake was that I uploaded data twice and this somehow caused major slowdown on linux.
So the fix is to upload data to pbo and then upload it to GPU with gl.glTexSubImage2D() using offset inside pbo.

Related

How to split a big image into multiple textures

Let's assume I have a big image whose size is 2560*800 and format is RGBA.
I'd like to this big image to 2 textures whose size are 1280*800.
The simple, but stupid, way is m
#define BPP_RGBA 4
int* makeNTexturesFromBigRgbImage(uint8_t *srcImg,
Size srcSize,
uint32_t numTextures,
uint32_t texWidth,
uint32_t texHeigh) {
int i, h, srcStride;
uint8_t *pSrcPos, *pDstPos;
int [] texIds = new int[numTextures];
srcStride = srcSize.w*BPP_RGBA;
glGenTextures(numTextures, texIds);
for (i=0; i<numTexures; i++) {
uint8_t *subImageBuf = alloc(texWidth*texHeight*BPP_RGBA);
pSrcPos = srcImg+(texWidth*BPP_RGBA)*i
pDstPos = subImageBuf;
for (h=0; h<texHeight; h++) {
memcpy(pDstPos, pSrcPos, texWidth*BPP_RGBA)
pSrcPos += srcStride;
pDstPos += (texWidth*BPP_RGBA);
}
glBindTexture(GL_TEXTURE_2D, texIds[i]);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA, texWidth, texHeight, 0, GL_RGBA, GL_UNSIGNED_BYTE, subImageBuf);
free(subImageBuf);
}
}
But, as I mentioned above, this approach is very stupid.
So, I'd like to know the best way that DOES NOT copy operation on CPU like above.
Is it possible with only openGL APIs.
For example, is it possible?
Step 1. Make a texture, 2560*800, with an big image.
Step 2. Make 2 textures, 1280*800, from the texture in 1.
Thanks.

Halide with GPU (OpenGL) as Target - benchmarking and using HalideRuntimeOpenGL.h

I am new to Halide. I have been playing around with the tutorials to get a feel for the language. Now, I am writing a small demo app to run from command line on OSX.
My goal is to perform a pixel-by-pixel operation on an image, schedule it on the GPU and measure the performance. I have tried a couple things which I want to share here and have a few questions about the next steps.
First approach
I scheduled the algorithm on GPU with Target being OpenGL, but because I could not access the GPU memory to write to a file, in the Halide routine, I copied the output to the CPU by creating Func cpu_out similar to the glsl sample app in the Halide repo
pixel_operation_cpu_out.cpp
#include "Halide.h"
#include <stdio.h>
using namespace Halide;
const int _number_of_channels = 4;
int main(int argc, char** argv)
{
ImageParam input8(UInt(8), 3);
input8
.set_stride(0, _number_of_channels) // stride in dimension 0 (x) is three
.set_stride(2, 1); // stride in dimension 2 (c) is one
Var x("x"), y("y"), c("c");
// algorithm
Func input;
input(x, y, c) = cast<float>(input8(clamp(x, input8.left(), input8.right()),
clamp(y, input8.top(), input8.bottom()),
clamp(c, 0, _number_of_channels))) / 255.0f;
Func pixel_operation;
// calculate the corresponding value for input(x, y, c) after doing a
// pixel-wise operation on each each pixel. This gives us pixel_operation(x, y, c).
// This operation is not location dependent, eg: brighten
Func out;
out(x, y, c) = cast<uint8_t>(pixel_operation(x, y, c) * 255.0f + 0.5f);
out.output_buffer()
.set_stride(0, _number_of_channels)
.set_stride(2, 1);
input8.set_bounds(2, 0, _number_of_channels); // Dimension 2 (c) starts at 0 and has extent _number_of_channels.
out.output_buffer().set_bounds(2, 0, _number_of_channels);
// schedule
out.compute_root();
out.reorder(c, x, y)
.bound(c, 0, _number_of_channels)
.unroll(c);
// Schedule for GLSL
out.glsl(x, y, c);
Target target = get_target_from_environment();
target.set_feature(Target::OpenGL);
// create a cpu_out Func to copy over the data in Func out from GPU to CPU
std::vector<Argument> args = {input8};
Func cpu_out;
cpu_out(x, y, c) = out(x, y, c);
cpu_out.output_buffer()
.set_stride(0, _number_of_channels)
.set_stride(2, 1);
cpu_out.output_buffer().set_bounds(2, 0, _number_of_channels);
cpu_out.compile_to_file("pixel_operation_cpu_out", args, target);
return 0;
}
Since I compile this AOT, I make a function call in my main() for it. main() resides in another file.
main_file.cpp
Note: the Image class used here is the same as the one in this Halide sample app
int main()
{
char *encodeded_jpeg_input_buffer = read_from_jpeg_file("input_image.jpg");
unsigned char *pixelsRGBA = decompress_jpeg(encoded_jpeg_input_buffer);
Image input(width, height, channels, sizeof(uint8_t), Image::Interleaved);
Image output(width, height, channels, sizeof(uint8_t), Image::Interleaved);
input.buf.host = &pixelsRGBA[0];
unsigned char *outputPixelsRGBA = (unsigned char *)malloc(sizeof(unsigned char) * width * height * channels);
output.buf.host = &outputPixelsRGBA[0];
double best = benchmark(100, 10, [&]() {
pixel_operation_cpu_out(&input.buf, &output.buf);
});
char* encoded_jpeg_output_buffer = compress_jpeg(output.buf.host);
write_to_jpeg_file("output_image.jpg", encoded_jpeg_output_buffer);
}
This works just fine and gives me the output I expect. From what I understand, cpu_out makes the values in out available on the CPU memory, which is why I am able to access these values by accessing output.buf.host in main_file.cpp
Second approach:
The second thing I tried was to not do the copy to host from device in the Halide schedule by creating Func cpu_out, instead using copy_to_host function in main_file.cpp.
pixel_operation_gpu_out.cpp
#include "Halide.h"
#include <stdio.h>
using namespace Halide;
const int _number_of_channels = 4;
int main(int argc, char** argv)
{
ImageParam input8(UInt(8), 3);
input8
.set_stride(0, _number_of_channels) // stride in dimension 0 (x) is three
.set_stride(2, 1); // stride in dimension 2 (c) is one
Var x("x"), y("y"), c("c");
// algorithm
Func input;
input(x, y, c) = cast<float>(input8(clamp(x, input8.left(), input8.right()),
clamp(y, input8.top(), input8.bottom()),
clamp(c, 0, _number_of_channels))) / 255.0f;
Func pixel_operation;
// calculate the corresponding value for input(x, y, c) after doing a
// pixel-wise operation on each each pixel. This gives us pixel_operation(x, y, c).
// This operation is not location dependent, eg: brighten
Func out;
out(x, y, c) = cast<uint8_t>(pixel_operation(x, y, c) * 255.0f + 0.5f);
out.output_buffer()
.set_stride(0, _number_of_channels)
.set_stride(2, 1);
input8.set_bounds(2, 0, _number_of_channels); // Dimension 2 (c) starts at 0 and has extent _number_of_channels.
out.output_buffer().set_bounds(2, 0, _number_of_channels);
// schedule
out.compute_root();
out.reorder(c, x, y)
.bound(c, 0, _number_of_channels)
.unroll(c);
// Schedule for GLSL
out.glsl(x, y, c);
Target target = get_target_from_environment();
target.set_feature(Target::OpenGL);
std::vector<Argument> args = {input8};
out.compile_to_file("pixel_operation_gpu_out", args, target);
return 0;
}
main_file.cpp
#include "pixel_operation_gpu_out.h"
#include "runtime/HalideRuntime.h"
int main()
{
char *encodeded_jpeg_input_buffer = read_from_jpeg_file("input_image.jpg");
unsigned char *pixelsRGBA = decompress_jpeg(encoded_jpeg_input_buffer);
Image input(width, height, channels, sizeof(uint8_t), Image::Interleaved);
Image output(width, height, channels, sizeof(uint8_t), Image::Interleaved);
input.buf.host = &pixelsRGBA[0];
unsigned char *outputPixelsRGBA = (unsigned char *)malloc(sizeof(unsigned char) * width * height * channels);
output.buf.host = &outputPixelsRGBA[0];
double best = benchmark(100, 10, [&]() {
pixel_operation_gpu_out(&input.buf, &output.buf);
});
int status = halide_copy_to_host(NULL, &output.buf);
char* encoded_jpeg_output_buffer = compress_jpeg(output.buf.host);
write_to_jpeg_file("output_image.jpg", encoded_jpeg_output_buffer);
return 0;
}
So, now, what I think is happening is that pixel_operation_gpu_out is keeping output.buf on the GPU and when I do copy_to_host, that's when I get the memory copied over to the CPU. This program gives me the expected output as well.
Questions:
The second approach is much slower than the first approach. The slow part is not in the benchmarked part though. For example, for first approach, I get 17ms as benchmarked time for a 4k image. For the same image, in the second approach, I get the benchmarked time as 22us and the time taken for copy_to_host is 10s. I'm not sure if this behavior is expected since both approach 1 and 2 are essentially doing the same thing.
The next thing I tried was to use [HalideRuntimeOpenGL.h][3] and link textures to input and output buffers to be able to draw directly to a OpenGL context from main_file.cpp instead of saving to a jpeg file. However, I could find no examples to figure out how to use the functions in HalideRuntimeOpenGL.h and whatever things I did try on my own were always giving me run time errors which I could not figure out how to solve. If anyone has any resources they can point me to, that will be great.
Also, any feedback on the code I have above are welcome too. I know it works and is doing what I want but it could be the completely wrong way of doing it and I wouldn't know any better.
Mostly likely the reason for the 10s to copy memory back is because the GPU API has queued all the kernel invocations and then waits on them to finish when halide_copy_to_host is called. You can call halide_device_sync inside the benchmark timing after running all the compute calls to handle get the compute time inside the loop without the copy back time.
I cannot tell from the code how many times the kernel is being run from this code. (My guess is 100, but it may be that those arguments to benchmark setup some sort of parameterization where it tries to run it as many times as need be to get significance. If so, that is a problem because the queuing call is really fast but the compute is of course async. If this is the case, you can do things like queue ten calls and then call halide_device_sync and play with the number "10" to get a real picture of how long it takes.)

Create a video from OpenGL rendered frames

I am trying to create a video file from an animation I have in OpenGL.
I have been reading on how to do that and to my understanding there are two options:
Save each rendered frame in OpenGL to an image file and then create a video file from those
Get the frame data using glReadPixels() and on the fly write those to a video file
The second approach is what I believe would work best for me, however, I cannot find info on how to achieve the second part (write to a video file).
Can anyone point me out to some web sites where I learn how to do that? What kind of libraries are out there that I can use to encode(?) a video from the frames I am rendering in OpenGL?
EDIT
After searching a bit more about this, I believe ffmpeg is the way to go. I found this blog that has a code that apparently works on windows.
I have downloaded ffmpeg from the website so that I can execute the command just as in the example. Unfortunately, my application crashes and no video is being created. I checked for the file pointer to be valid but it is not, so I believe the error comes from the execution of the function popen.
I am passing the exact same arguments as the command but still no valid file pointer, any idea on what could be happening?
The thing is, I don't want to spend much time coding the video encoding since I have other projects to work on.
Since I couldn't use ffmpeg directly from my c++ code, a possible solution is as follows. In Qt5 you have the function paintGL where you update the frame to be rendered. After it, get the pixels with glReadPixels and then just save the frame as a png image using QImage
void OpenGLViewer::paintGL()
{
// Clear screen
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
// Update array attached to OpenGL
glDrawArrays(GL_POINTS, 0, _points);
update();
glReadPixels(0, 0, this->width(), this->height(), GL_RGBA, GL_UNSIGNED_BYTE, _buffer);
std::stringstream name;
name << "Frame" << _frame++ << ".png";
QString filename(name.str().c_str());
QImage imagen(_buffer, this->width(), this->height(), QImage::Format_ARGB32);
imagen.save(filename, "PNG");
}
This will leave a bunch of images in your working directory that you can encode in a video using the following command from the console
ffmpeg -framerate 30 -start_number 0 -i Frame%d.png -vcodec mpeg4 -vf vflip test.avi
I still have to check why the colors are inverted but for now this works fine since the animation is the important thing and not the colors.
You can do in this way after installing libpng:
uint8_t pixels = new uint8_t[wh*3];
glReadPixels(0, 0, w, h, GL_RGB, GL_UNSIGNED_BYTE, (GLvoid *) pixels);
for (int j = 0; j * 2 < h; ++j) {
int x = j * w * 3;
int y = (h - 1 - j) * w * 3;
for (int i = w * 3; i > 0; --i) {
uint8_t tmp = pixels[x];
pixels[x] = pixels[y];
pixels[y] = tmp;
++x;
++y;
}
}
png_structp png = png_create_write_struct(PNG_LIBPNG_VER_STRING, nullptr, nullptr, nullptr);
if (!png)
return false;
png_infop info = png_create_info_struct(png);
if (!info) {
png_destroy_write_struct(&png, &info);
return false;
}
std::string s = "IMAGE/"+string(filename);
FILE *fp = fopen(s.c_str(), "wb");
if (!fp) {
png_destroy_write_struct(&png, &info);
return false;
}
png_init_io(png, fp);
png_set_IHDR(png, info, w, h, 8 , PNG_COLOR_TYPE_RGB, PNG_INTERLACE_NONE,
PNG_COMPRESSION_TYPE_BASE, PNG_FILTER_TYPE_BASE);
png_colorp palette = (png_colorp)png_malloc(png, PNG_MAX_PALETTE_LENGTH * sizeof(png_color));
if (!palette) {
fclose(fp);
png_destroy_write_struct(&png, &info);
return false;
}
png_set_PLTE(png, info, palette, PNG_MAX_PALETTE_LENGTH);
png_write_info(png, info);
png_set_packing(png);
png_bytepp rows = (png_bytepp)png_malloc(png, h * sizeof(png_bytep));
for (int i = 0; i < h; ++i)
rows[i] = (png_bytep)(pixels + (h - i - 1) * w * 3);
png_write_image(png, rows);
png_write_end(png, info);
png_free(png, palette);
png_destroy_write_struct(&png, &info);
fclose(fp);
delete[] rows;

OpenGL + PBO + FBO + some ATI cards - color and pixel shifting

We are developing software for slide show creation and use OpenGL.
We use FBO + PBO for fast data reading from VGA to RAM but on some video cards from ATI we faced with the following problems:
swapping RGB components
pixel shifting
There are no problems if we do not use PBO.
Also we have noticed that the aspect ratio of PBO/FBO (4:3) solve the pixel shifting problem.
Any thoughts or suggestions?
Here are more details:
ATI Radeon HD 3650
PBO code:
public bool PBO_Initialize(
int bgl_size_w,
int bgl_size_h)
{
PBO_Release();
if (mCSGL12Control1 != null)
{
GL mGL = mCSGL12Control1.GetGL();
mCSGL12Control1.wgl_MakeCurrent();
//
// check PBO is supported by your video card
if (mGL.bglGenBuffersARB == true &&
mGL.bglBindBufferARB == true &&
mGL.bglBufferDataARB == true &&
mGL.bglBufferSubDataARB == true &&
mGL.bglMapBufferARB == true &&
mGL.bglUnmapBufferARB == true &&
mGL.bglDeleteBuffersARB == true &&
mGL.bglGetBufferParameterivARB == true)
{
mGL.glGenBuffersARB(2, _pbo_imageBuffers);
int clientHeight1 = bgl_size_h / 2;
int clientHeight2 = bgl_size_h - clientHeight1;
int clientSize1 = bgl_size_w * clientHeight1 * 4;
int clientSize2 = bgl_size_w * clientHeight2 * 4;
mGL.glBindBufferARB(GL.GL_PIXEL_PACK_BUFFER_ARB, _pbo_imageBuffers[0]);
mGL.glBufferDataARB(GL.GL_PIXEL_PACK_BUFFER_ARB, clientSize1, IntPtr.Zero,
GL.GL_STREAM_READ_ARB);
mGL.glBindBufferARB(GL.GL_PIXEL_PACK_BUFFER_ARB, _pbo_imageBuffers[1]);
mGL.glBufferDataARB(GL.GL_PIXEL_PACK_BUFFER_ARB, clientSize2, IntPtr.Zero,
GL.GL_STREAM_READ_ARB);
return true;
}
}
return false;
}
...
PBO read data back to memory
int clientHeight1 = _bgl_size_h / 2;
int clientHeight2 = _bgl_size_h - clientHeight1;
int clientSize1 = _bgl_size_w * clientHeight1 * 4;
int clientSize2 = _bgl_size_w * clientHeight2 * 4;
//mGL.glPushAttrib(GL.GL_VIEWPORT_BIT | GL.GL_COLOR_BUFFER_BIT);
// Bind two different buffer objects and start the glReadPixels
// asynchronously. Each call will return directly after
// starting the DMA transfer.
mGL.glBindBufferARB(GL.GL_PIXEL_PACK_BUFFER_ARB, _pbo_imageBuffers[0]);
mGL.glReadPixels(0, 0, _bgl_size_w, clientHeight1, imageFormat,
pixelTransferMethod, IntPtr.Zero);
mGL.glBindBufferARB(GL.GL_PIXEL_PACK_BUFFER_ARB, _pbo_imageBuffers[1]);
mGL.glReadPixels(0, clientHeight1, _bgl_size_w, clientHeight2, imageFormat,
pixelTransferMethod, IntPtr.Zero);
//mGL.glPopAttrib();
mGL.glBindFramebufferEXT(GL.GL_FRAMEBUFFER_EXT, 0);
// Process partial images. Mapping the buffer waits for
// outstanding DMA transfers into the buffer to finish.
mGL.glBindBufferARB(GL.GL_PIXEL_PACK_BUFFER_ARB, _pbo_imageBuffers[0]);
IntPtr pboMemory1 = mGL.glMapBufferARB(GL.GL_PIXEL_PACK_BUFFER_ARB,
GL.GL_READ_ONLY_ARB);
mGL.glBindBufferARB(GL.GL_PIXEL_PACK_BUFFER_ARB, _pbo_imageBuffers[1]);
IntPtr pboMemory2 = mGL.glMapBufferARB(GL.GL_PIXEL_PACK_BUFFER_ARB,
GL.GL_READ_ONLY_ARB);
System.Runtime.InteropServices.Marshal.Copy(pboMemory1, _bgl_rgbaData_out, 0, clientSize1);
System.Runtime.InteropServices.Marshal.Copy(pboMemory2, _bgl_rgbaData_out, clientSize1, clientSize2);
// Unmap the image buffers
mGL.glBindBufferARB(GL.GL_PIXEL_PACK_BUFFER_ARB, _pbo_imageBuffers[0]);
mGL.glUnmapBufferARB(GL.GL_PIXEL_PACK_BUFFER_ARB);
mGL.glBindBufferARB(GL.GL_PIXEL_PACK_BUFFER_ARB, _pbo_imageBuffers[1]);
mGL.glUnmapBufferARB(GL.GL_PIXEL_PACK_BUFFER_ARB);
FBO initialization
private static void FBO_Initialize(GL mGL,
ref int[] bgl_texture,
ref int[] bgl_framebuffer,
ref int[] bgl_renderbuffer,
ref byte[] bgl_rgbaData,
int bgl_size_w,
int bgl_size_h)
{
// Texture
mGL.glGenTextures(1, bgl_texture);
mGL.glBindTexture(GL.GL_TEXTURE_2D, bgl_texture[0]);
mGL.glTexParameteri(GL.GL_TEXTURE_2D, GL.GL_TEXTURE_MAG_FILTER, GL.GL_NEAREST);
mGL.glTexParameteri(GL.GL_TEXTURE_2D, GL.GL_TEXTURE_MIN_FILTER, GL.GL_NEAREST);
mGL.glTexParameteri(GL.GL_TEXTURE_2D, GL.GL_TEXTURE_WRAP_S, GL.GL_CLAMP_TO_EDGE);
mGL.glTexParameteri(GL.GL_TEXTURE_2D, GL.GL_TEXTURE_WRAP_T, GL.GL_CLAMP_TO_EDGE);
IntPtr null_ptr = new IntPtr(0);
// <null> means reserve texture memory, but texels are undefined
mGL.glTexImage2D(GL.GL_TEXTURE_2D, 0, GL.GL_RGBA, bgl_size_w, bgl_size_h, 0, GL.GL_RGBA, GL.GL_UNSIGNED_BYTE, null_ptr);
//
mGL.glGenFramebuffersEXT(1, bgl_framebuffer);
mGL.glBindFramebufferEXT(GL.GL_FRAMEBUFFER_EXT, bgl_framebuffer[0]);
mGL.glGenRenderbuffersEXT(1, bgl_renderbuffer);
mGL.glBindRenderbufferEXT(GL.GL_RENDERBUFFER_EXT, bgl_renderbuffer[0]);
mGL.glRenderbufferStorageEXT(GL.GL_RENDERBUFFER_EXT, GL.GL_DEPTH_COMPONENT24, bgl_size_w, bgl_size_h);
mGL.glFramebufferTexture2DEXT(GL.GL_FRAMEBUFFER_EXT, GL.GL_COLOR_ATTACHMENT0_EXT,
GL.GL_TEXTURE_2D, bgl_texture[0], 0);
mGL.glFramebufferRenderbufferEXT(GL.GL_FRAMEBUFFER_EXT, GL.GL_DEPTH_ATTACHMENT_EXT,
GL.GL_RENDERBUFFER_EXT, bgl_renderbuffer[0]);
// Errors?
int status = mGL.glCheckFramebufferStatusEXT(GL.GL_FRAMEBUFFER_EXT);
if (status != GL.GL_FRAMEBUFFER_COMPLETE_EXT || mGL.glGetError() != GL.GL_NO_ERROR)
{
mGL.glFramebufferTexture2DEXT(GL.GL_FRAMEBUFFER_EXT, GL.GL_COLOR_ATTACHMENT0_EXT,
GL.GL_TEXTURE_2D, 0, 0);
mGL.glFramebufferRenderbufferEXT(GL.GL_FRAMEBUFFER_EXT, GL.GL_DEPTH_ATTACHMENT_EXT,
GL.GL_RENDERBUFFER_EXT, 0);
mGL.glBindTexture(GL.GL_TEXTURE_2D, 0);
mGL.glDeleteTextures(1, bgl_texture);
mGL.glBindRenderbufferEXT(GL.GL_RENDERBUFFER_EXT, 0);
mGL.glDeleteRenderbuffersEXT(1, bgl_renderbuffer);
mGL.glBindFramebufferEXT(GL.GL_FRAMEBUFFER_EXT, 0);
mGL.glDeleteFramebuffersEXT(1, bgl_framebuffer);
throw new Exception("Bad framebuffer.");
}
mGL.glDrawBuffer(GL.GL_COLOR_ATTACHMENT0_EXT);
mGL.glReadBuffer(GL.GL_COLOR_ATTACHMENT0_EXT); // For glReadPixels()
mGL.glBindFramebufferEXT(GL.GL_FRAMEBUFFER_EXT, 0);
mGL.glDrawBuffer(GL.GL_BACK);
mGL.glReadBuffer(GL.GL_BACK);
mGL.glBindTexture(GL.GL_TEXTURE_2D, 0);
bgl_rgbaData = new byte[bgl_size_w * bgl_size_h * 4];
}
It seems that re-installing/updating VGA Driver does solve this problem.
Really strange behaviour (also, it may be that the official notebook driver is old/buggy/etc. and causes the problem, so updating with the latest driver from AMD, for this vga-chip series, seems affect/solve the problem. Also I'm not sure if the previouse driver was set up correct thus I say re-installing/updating)
Thank you all for help.

OpenGL Issue Drawing a Large Image Texture causing Skewing

I'm trying to store a 1365x768 image on a 2048x1024 texture in OpenGL ES but the resulting image once drawn appears skewed. If I run the same 1365x768 image through gluScaleImage() and fit it onto the 2048x1024 texture it looks fine when drawn but this OpenGL call is slow and hurts performance.
I'm doing this on an Android device (Motorola Milestone) which has 256MB of memory. Not sure if the memory is a factor though since it works fine when scaled using gluScaleImage() (it's just slower.)
Mapping smaller textures (854x480 onto 1024x512, for example) works fine though. Does anyone know why this is and suggestions for what I can do about it?
Update
Some code snippets to help understand context...
// uiImage is loaded. The texture dimensions are determined from upsizing the image
// dimensions to a power of two size:
// uiImage->_width = 1365
// uiImage->_height = 768
// width = 2048
// height = 1024
// Once the image is loaded:
// INT retval = gluScaleImage(GL_RGBA, uiImage->_width, uiImage->_height, GL_UNSIGNED_BYTE, uiImage->_texels, width, height, GL_UNSIGNED_BYTE, data);
copyImage(GL_RGBA, uiImage->_width, uiImage->_height, GL_UNSIGNED_BYTE, uiImage->_texels, width, height, GL_UNSIGNED_BYTE, data);
if (pixelFormat == RGB565 || pixelFormat == RGBA4444)
{
unsigned char* tempData = NULL;
unsigned int* inPixel32;
unsigned short* outPixel16;
tempData = new unsigned char[height*width*2];
inPixel32 = (unsigned int*)data;
outPixel16 = (unsigned short*)tempData;
if(pixelFormat == RGB565)
{
// "RRRRRRRRGGGGGGGGBBBBBBBBAAAAAAAA" --> "RRRRRGGGGGGBBBBB"
for(unsigned int i = 0; i < numTexels; ++i, ++inPixel32)
{
*outPixel16++ = ((((*inPixel32 >> 0) & 0xFF) >> 3) << 11) |
((((*inPixel32 >> 8) & 0xFF) >> 2) << 5) |
((((*inPixel32 >> 16) & 0xFF) >> 3) << 0);
}
}
if(tempData != NULL)
{
delete [] data;
data = tempData;
}
}
// [snip..]
// Copy function (mostly)
static void copyImage(GLint widthin, GLint heightin, const unsigned int* datain, GLint widthout, GLint heightout, unsigned int* dataout)
{
unsigned int* p1 = const_cast<unsigned int*>(datain);
unsigned int* p2 = dataout;
int nui = widthin * sizeof(unsigned int);
for(int i = 0; i < heightin; i++)
{
memcpy(p2, p1, nui);
p1 += widthin;
p2 += widthout;
}
}
In the render code, without changing my texture coordinates I should see the correct image when using gluScaleImage() and a smaller image (that requires some later correction factors) for the copyImage() code. This is what happens when the image is small (854x480 for example works fine with copyImage()) but when I use the 1365x768 image, that's when the skewing appears.
Finally solved the issue. First thing to know is what's the maximum texture size allowed for the device:
GLint texSize;
glGetIntegerv(GL_MAX_TEXTURE_SIZE, &texSize);
When I ran this the texture size max for the Motorola Milestone was 2048x2048, which was fine in my case.
After messing with the texture mapping to no end I finally decided to try opening and resaving the image..and voilĂ  it suddenly began working. I don't know what was wrong with the format the original image was stored in but as advice to anyone else experiencing a similar problem: might be worth looking at your image itself.