Write code, which does not cause many page faults - c++

I have a code for rendering an OpenGL scene. This code is causing many page faults, when started without visual studio. The code seen in paintGL() is only a fraction what happens there, but it takes the most time.
Example code:
void prepareData() {
std::vector<int> m_indices; // vector of point indices, that should be connected
std::vector<float> m_vertices; // vector of the 3d points
/*
fill the vectors
*/
}
void MyGLWidget::paintGL() {
glBegin(GL_TRIANGLE_STRIP);
for (unsigned int i=0; i < m_indices.size(); i++)
{
// search end of strip
if (m_indices[i] < 0)
{
// store end of strip
endStrip = i;
// we need at least three vertices for a triangle strip
if (startStrip+2 < endStrip)
{
// draw strip
for (unsigned int j=startStrip; j<endStrip; j++) {
idx = 3 * m_indices[j];
glVertex3dv(m_vertices[idx]));
}
}
// store start of next strip
startStrip = i+1;
}
}
glEnd();
}
So here is the problem: when the data changes and gets calculated, the next call of paintGL() is very slow, because accessing the new values causes a lot of page faults.
When the data does not change, paintGL() is as fast as it should be.
Both data vectors can get be really big, normally we have sizes like 10 million indices and 15 million vertices.
My question is, how can I achieve to make the paintGL faster, when the values to display are freshly calculated?
When the application is started with Visual Studio (both Releae builds), there aren't that many page faults and it is faster than normal. How does Visual Studio achieve that and can I do this too, without visual studio monitoring my application.
The problem was already described here, but now I have found out the root cause for the problem: Release Build is faster, when started from Visual Studio than started “normally”
Additional information: C++/opengl application running smoother with debugger attached

The increased page fault load is just a secondary symptom of the really poor rendering loop. Modern GPUs operate on (large/-ish) buffers of vertex and index data. When using glBegin…glEnd intermediate mode, the driver is forced to create such buffers in situ. To speed things up there are a lot of heuristics, including the drivers also marking pages so that they get notified, if the contents of the pages changes, so that buffers are recreated only when needed.
Rewrite it to use indexed triangles in a vertex array, this is the mode GPUs and OpenGL drivers work best.
Even a client side vertex array will massively speed things up, since the driver can then coalesce the buffer copy. Of course the best thing would be, if you could just place m_vertices in a Vertex Buffer Object.
#include <vector>
#include <utility>
// overloaded wrappers to deduce the
// GL type from the type of the index buffer vector
namespace gl_wrap {
void DrawElements(
GLenum mode, GLsizei count,
std::vector<GLubyte> const &idx_buffer,
size_t offset )
{
glDrawElements(mode, count, GL_UNSIGNED_BYTE, idx_buffer.data()+offset);
}
void DrawElements(
GLenum mode, GLsizei count,
std::vector<GLushort> const &idx_buffer,
size_t offset )
{
glDrawElements(mode, count, GL_UNSIGNED_SHORT, idx_buffer.data()+offset);
}
void DrawElements(
GLenum mode, GLsizei count,
std::vector<GLuint> const &idx_buffer,
size_t offset )
{
glDrawElements(mode, count, GL_UNSIGNED_INT, idx_buffer.data()+offset);
}
}
void MyGLWidget::paintGL() {
glBegin(GL_TRIANGLE_STRIP);
std::vector<std::pair<size_t,size_t>> strips;
size_t i_prev = 0, i = 0;
for( auto idx : m_indices ){
++i;
if( idx < 0 ){
strips.push_back(std::make_pair(i_prev, i-i_prev));
i_prev = i;
}
}
glEnableClientState(GL_VERTEX_ARRAY);
glVertexPointer(3, GL_DOUBLE, 0, m_vertices.data());
for( auto const &s : strips ){
gl_wrap::DrawElements(GL_TRIANGLE_STRIP, m_indices.data(), s.second, s.first);
}
glDisableClientState(GL_VERTEX_ARRAY);
}

Related

Is there a limit of object in openGL?

I want to draw 2000 spheres using OpenGL in Visual C++.
The following code draw 1000 spheres and the result looks fine.
But when I increase the number of spheres of 2000 (see the partial below code and highlighted by ** ), it failed.
The following error message appears.
"freeglut : fgInitGL2 : fghGenBuffers is NULL"
Could you help me to solve this problem?
for (int j = 0; j < 10; j++) {
for (int k = 0; k < 10; k++) {
**for (int l = 0; l < 20; l++) { \\ for (int l = 0; l < 10; l++)**
glPushMatrix();
glTranslatef(j, k, l);
gluSphere(myQuad, 0.5, 100, 100);
glPopMatrix();
}
}
}
This is a full code for test.
#include <GL/glew.h>
#include <GL/glut.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <fstream>
#include <string>
#include <cstring>
#include <cmath>
#include <iostream>
int selectedObject = 1;
bool drawThatAxis = 0;
bool lightEffect = 1;
float fovy = 60.0, aspect = 1.0, zNear = 1.0, zFar = 100.0;
float depth = 8;
float phi = 0, theta = 0;
float downX, downY;
bool leftButton = false, middleButton = false;
GLfloat white[3] = { 1.0, 1.0, 1.0 };
void displayCallback(void);
GLdouble width, height;
int wd;
int main(int argc, char* argv[])
{
glutInit(&argc, argv);
glutInitDisplayMode(GLUT_DEPTH);
glutInitWindowSize(800,600);
wd = glutCreateWindow("3D Molecules");
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
glEnable(GL_DEPTH_TEST);
glCullFace(GL_BACK);
glEnable(GL_CULL_FACE);
glClearColor(0.0, 0.0, 0.0, 0.0);
GLuint id;
id = glGenLists(1);
GLUquadric* myQuad;
myQuad = gluNewQuadric();
glNewList(id, GL_COMPILE);
for (int j = 0; j < 10; j++) {
for (int k = 0; k < 10; k++) {
for (int l = 0; l < 10; l++) {
glPushMatrix();
glTranslatef(j, k, l);
gluSphere(myQuad, 0.5, 100, 100);
glPopMatrix();
}
}
}
glEndList();
glutDisplayFunc(displayCallback);
glutMainLoop();
return 0;
}
void displayCallback(void)
{
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
glMatrixMode(GL_PROJECTION);
glLoadIdentity();
gluPerspective(fovy, aspect, zNear, zFar);
glMatrixMode(GL_MODELVIEW);
glLoadIdentity();
gluLookAt(0, 0, 40, 0, 0, 0, 0, 1, 0);
glTranslatef(0.0, 0.0, -depth);
glRotatef(-theta, 1.0, 0.0, 0.0);
glRotatef(phi, 0.0, 1.0, 0.0);
if (lightEffect) {
glEnable(GL_LIGHTING);
glEnable(GL_LIGHT0);
}
else
{
glDisable(GL_LIGHTING);
glDisable(GL_LIGHT0);
}
switch (selectedObject)
{
case (1):
glCallList(1);
break;
default:
break;
}
glFlush();
}
Is there a limit of object in openGL?
OpenGL doesn't define limits like a "maximum number of objects".
Application is able to draw as many objects as possible as they fit into CPU memory, but usually drawing becomes unreliably slow before application hits memory limits. Even when all texture and vertex data do not fit into GPU memory, OpenGL still doesn't fail and keep drawing by constantly uploading CPU->GPU memory on each frame.
So if we come back to the question of OpenGL limits - indeed, there are memory limits, as you may see from another similar question. Your code doesn't actually check for any OpenGL errors using glGetError(), hence your conclusion about fghGenBuffers() being a root cause is misleading. I would expected GL_OUT_OF_MEMORY error to appear in your case. Modern OpenGL also defines a more sophisticated mechanism for reporting errors - ARB_debug_output.
Display Lists is a very archaic mechanism in OpenGL world, intended to optimize drawing of large amounts of data by "remembering" a sequence of OpenGL commands into some internal driver-managed caches. This mechanism was commonly used before wide adoption of Vertex Buffer Objects, which have been added to OpenGL 1.5 as a more straightforward and efficient way to control vertex data memory, and before Vulkan and GL_NV_command_list re-invented Command Buffers as a more reliable interface for caching a sequence of GPU commands.
A big design issue of Display Lists mechanism is an unpredictable memory management and extremely varying implementation across vendors (from very poor to extremely optimized). Modern graphic drivers try to uploaded vertex data onto GPU memory implicitly while compiling Display Lists, but what they actually do remains hidden.
Archaic GLU library is another source of a mystery in your code, as it is difficult to estimate a memory utilized by gluSphere(). A pessimistic calculations show:
size_t aNbSpheres = 10 * 10 * 20;
size_t aNbSlices, aNbStacks = 100;
size_t aNbTriangles = aNbSlices * aNbSlices * 2;
size_t aNbNodes = aNbSpheres * aNbTriangles * 3; // non-indexed array
size_t aNodeSize = (3 * sizeof(GLfloat)) * 2; // position + normal
size_t aMemSize = aNbNodes * aNodeSize;
size_t aMemSizeMiB = aMemSize / 1024 / 1024;
that just vertex data of 2000 spheres may utilize about 2.746 GiB of memory!
If your application is built in 32-bit mode then no surprise it does hit 32-bit address space memory limits. But even in case of 64-bit application, OpenGL driver implementation might hit some internal limits, which will be reported by the same GL_OUT_OF_MEMORY error.
Regardless of memory limits, your code is trying to draw around 40M of triangles. This is not something impossible for a fast modern GPU hardware, but it might be really slow on low-end embedded graphics.
So what could be done next?
Learn OpenGL debugging practices - using glGetError() and/or ARB_debug_output to localize the place and root cause of this and other issues.
Reduce gluSphere() tessellation parameters.
Generate a Display List of a single sphere and draw it many times. The instancing dramatically reduces memory consumption. This, however, may be a slower alternative to drawing all sphere at once (but 2000 draw calls is not that big for modern CPU).
Replace obsolete GLU library with direct generation of vertex data - sphere tessellation is not that difficult to implement and there are a lot of samples around the web.
Learn Vertex Buffer Objects and use them instead of obsolete Display Lists.
Learn GLSL and modern OpenGL so that you may implement hardware instancing for drawing sphere most efficiently.
From the other side, fghGenBuffers error looks really weird as glGenBuffers() should present in every modern OpenGL implementation. Print driver information via glGetString(GL_VENDOR)/glGetString(GL_RENDERER)/glGetString(GL_VERSION) to see if your system has a proper GPU driver installation and doesn't use an obsolete Microsoft software implementation of OpenGL 1.1.
Buffer allocations can fail, of course — everything in computers is finite — but your error message isn't related to your problem.
You received the error:
freeglut : fgInitGL2 : fghGenBuffers is NULL
That's an error from freeglut, which is not part of OpenGL. So look up the implementation of freeglut's fgInitGL2.
If fghGenBuffers failed, that means that the following line failed:
CHECK("fghGenBuffers", fghGenBuffers = (FGH_PFNGLGENBUFFERSPROC)glutGetProcAddress("glGenBuffers"));
i.e. GLUT was unable to obtain the address of the glGenBuffers function. It didn't ask for buffers and fail to get them, it asked for the address of the function it should ask for buffers, and didn't even get that.
That, however is only an fgWarning, i.e. a warning, not an error. I would dare guess that you would see that message on your terminal from the moment your program starts, irrespective of whether it subsequently fails. It's something GLUT wants you to know, but it isn't proximate to your failure.
As to your actual problem: it is almost certainly to do with attempting to overfill a display list. Your best solution in context is to put only a single sphere into a display list, and issue 2000 calls to draw that, modifying the model-view matrix between each.
As a quick aside: display lists proved to be a bad idea, not offering much scope for optimisation and becoming mostly unused by OpenGL 1.5. They were deprecated in OpenGL 3.0 in 2008, as was the entire fixed-functionality pipeline — including glPushMatrix, glTranslate and glPopMatrix.
That's not to harangue, but be aware that the way your code is formed relies on lingering deprecated functionality. It may contain hard limits that nobody has bothered to update, or in any other way see very limited maintenance.
It's far and away the simplest way to get going though, and you're probably in the company of a thousand CAD or other scientific programs, so the best advice right now really is just not to try to put all your spheres in one display list.

OpenGL glMultiDrawElementsIndirect with Interleaved Buffers

Originally using glDrawElementsInstancedBaseVertex to draw the scene meshes. All the meshes vertex attributes are being interleaved in a single buffer object. In total there are only 30 unique meshes. So I've been calling draw 30 times with instance counts, etc. but now I want to batch the draw calls into one using glMultiDrawElementsIndirect. Since I have no experience with this command function, I've been reading articles here and there to understand the implementation with little success. (For testing purposes all meshes are instanced only once).
The command structure from the OpenGL reference page.
struct DrawElementsIndirectCommand
{
GLuint vertexCount;
GLuint instanceCount;
GLuint firstVertex;
GLuint baseVertex;
GLuint baseInstance;
};
DrawElementsIndirectCommand commands[30];
// Populate commands.
for (size_t index { 0 }; index < 30; ++index)
{
const Mesh* mesh{ m_meshes[index] };
commands[index].vertexCount = mesh->elementCount;
commands[index].instanceCount = 1; // Just testing with 1 instance, ATM.
commands[index].firstVertex = mesh->elementOffset();
commands[index].baseVertex = mesh->verticeIndex();
commands[index].baseInstance = 0; // Shouldn't impact testing?
}
// Create and populate the GL_DRAW_INDIRECT_BUFFER buffer... bla bla
Then later down the line, after setup I do some drawing.
// Some prep before drawing like bind VAO, update buffers, etc.
// Draw?
if (RenderMode == MULTIDRAW)
{
// Bind, Draw, Unbind
glBindBuffer(GL_DRAW_INDIRECT_BUFFER, m_indirectBuffer);
glMultiDrawElementsIndirect (GL_TRIANGLES, GL_UNSIGNED_INT, nullptr, 30, 0);
glBindBuffer(GL_DRAW_INDIRECT_BUFFER, 0);
}
else
{
for (size_t index { 0 }; index < 30; ++index)
{
const Mesh* mesh { m_meshes[index] };
glDrawElementsInstancedBaseVertex(
GL_TRIANGLES,
mesh->elementCount,
GL_UNSIGNED_INT,
reinterpret_cast<GLvoid*>(mesh->elementOffset()),
1,
mesh->verticeIndex());
}
}
Now the glDrawElements... still works fine like before when switched. But trying glMultiDraw... gives indistinguishable meshes but when I set the firstVertex to 0 for all commands, the meshes look almost correct (at least distinguishable) but still largely wrong in places?? I feel I'm missing something important about indirect multi-drawing?
//Indirect data
commands[index].firstVertex = mesh->elementOffset();
//Direct draw call
reinterpret_cast<GLvoid*>(mesh->elementOffset()),
That's not how it works for indirect rendering. The firstVertex is not a byte offset; it's the first vertex index. So you have to divide the byte offset by the size of the index to compute firstVertex:
commands[index].firstVertex = mesh->elementOffset() / sizeof(GLuint);
The result of that should be a whole number. If it wasn't, then you were doing unaligned reads, which probably hurt your performance. So fix that ;)

Is using vectors to store Vertices for DirectX9 slow?

Over the past few days I made my first "engine" thingy. A central object with a window object, graphics object, and an input object - all nice and encapsulated and happy.
In this setup I also included some objects in the graphics object that handle some 'utility' functions, like a camera and a 'vindex' manager.
The Vertex/Index Manager stores all vertices and indices in std::vectors, that are called upon and sent to graphics when it's time to create the buffers.
The only problem is that I get ~8 frames a second with only 8-10 rectangles.
I think the problem is in the 'Vindex' object (my shader is nothing spectacular, and the pipeline is pretty vanilla).
Is storing Vertices in this way a plum bad idea, or is there just some painfully obvious thing I'm missing?
I did a little evolution sim project a few years ago that was pretty messy code-wise, but it rendered 20,000 vertices at 100s of frames a second on this machine, so it's not my machine that's slow.
I've been kind of staring at this for several hours, any and all input is VERY much appreciated :)
Example from my object that stores my vertices:
for (int i = 0; i < 24; ++i)
{
mVertList.push_back(Vertex(v[i], n[i], col));
}
For Clarity's sake
std::vector<Vertex> mVertList;
std::vector<int> mIndList;
and
std::vector<Vertex> VindexPile::getVerts()
{
return mVertList;
}
std::vector<int> VindexPile::getInds()
{
return mIndList;
}
In my graphics.cpp file:
md3dDevice->CreateVertexBuffer(mVinds.getVerts().size() * sizeof(Vertex), D3DUSAGE_WRITEONLY, 0, D3DPOOL_MANAGED, &mVB, 0);
Vertex * v = 0;
mVB->Lock(0, 0, (void**)&v, 0);
std::vector<Vertex> vList = mVinds.getVerts();
for (int i = 0; i < mVinds.getVerts().size(); ++i)
{
v[i] = vList[i];
}
mVB->Unlock();
md3dDevice->CreateIndexBuffer(mVinds.getInds().size() * sizeof(WORD), D3DUSAGE_WRITEONLY, D3DFMT_INDEX16, D3DPOOL_MANAGED, &mIB, 0);
WORD* ind = 0;
mIB->Lock(0, 0, (void**)&ind, 0);
std::vector<int> iList = mVinds.getInds();
for (int i = 0; i<mVinds.getInds().size(); ++i)
{
ind[i] = iList[i];
}
mIB->Unlock();
There is quite a bit of copying going on in here: I can not tell without running a profiler and some more code, but that seems like the first culprit:
std::vector<Vertex> vList = mVinds.getVerts();
std::vector<int> iList = mVinds.getInds();
Those two calls create copies of your vertex/index buffers, which is most probably not what you want - you most probably want to declare those as const references. You are also ruining cache coherency by doing those copies, which slows down your program more.
mVertList.push_back(Vertex(v[i], n[i], col));
This is moving and resizing the vectors quite a lot as well - you should most probably use reserve or resize before putting stuff in your vectors, to avoid reallocation and moving throughout memory of your data.
If I have to give you one big advice however, that would be: Profile. I don't know what tools you have access to, however there are plenty of profilers available, pick one and learn it, and it will provide much more valuable insight into why your program is slow.

CUDA + OpenGL. Unknown code=4(cudaErrorLaunchFailure) error

I am doing a simple n-body simulation on CUDA, which then I am trying to visualize with OpenGL.
After I have initialitzed my particle data on the CPU, allocated the respective memory and transfered that data on the GPU, the program has to enter the following cycle:
1) Compute the forces on each particle (CUDA part)
2) update particle positions (CUDA part)
3) display the particles for this time step (OpenGL part)
4) go back to 1)
The interface between CUDA and OpenGL I achieve with the following code:
GLuint dataBufferID;
particle_t* Particles_d;
particle_t* Particles_h;
cudaGraphicsResource *resources[1];
I allocate space on OpenGLs Array_Buffer and register the latter as a cudaGraphicsResource using the following code:
void createVBO()
{
// create buffer object
glGenBuffers(1, &dataBufferID);
glBindBuffer(GL_ARRAY_BUFFER, dataBufferID);
glBufferData(GL_ARRAY_BUFFER, bufferStride*N*sizeof(float), 0, GL_DYNAMIC_DRAW);
glBindBuffer(GL_ARRAY_BUFFER, 0);
checkCudaErrors(cudaGraphicsGLRegisterBuffer(resources, dataBufferID, cudaGraphicsMapFlagsNone));
}
Lastly, the program cycle that I described (steps 1 to 4) is realized by the following function update(int)
void update(int value)
{
// map OpenGL buffer object for writing from CUDA
float* dataPtr;
checkCudaErrors(cudaGraphicsMapResources(1, resources, 0));
size_t num_bytes;
//get a pointer to that buffer object for manipulation with cuda!
checkCudaErrors(cudaGraphicsResourceGetMappedPointer((void **)&dataPtr, &num_bytes,resources[0]));
//fill the Graphics Resource with particle position Data!
launch_kernel<<<NUM_BLOCKS,NUM_THREADS>>>(Particles_d,dataPtr,1);
// unmap buffer object
checkCudaErrors(cudaGraphicsUnmapResources(1, resources, 0));
glutPostRedisplay();
glutTimerFunc(milisec,update,0);
}
I compile end I get the following errors :
CUDA error at src/main.cu:390 code=4(cudaErrorLaunchFailure) "cudaGraphicsMapResources(1, resources, 0)"
CUDA error at src/main.cu:392 code=4(cudaErrorLaunchFailure) "cudaGraphicsResourceGetMappedPointer((void **)&dataPtr, &num_bytes,resources[0])"
CUDA error at src/main.cu:397 code=4(cudaErrorLaunchFailure) "cudaGraphicsUnmapResources(1, resources, 0)"
Does anyone know what might be the reasons for that exception? Am I supposed to create the dataBuffer using createVBO() every time prior to the execution of update(int) ...?
p.s. Just for more clarity, my kernel function is the following:
__global__ void launch_kernel(particle_t* Particles,float* data, int KernelMode){
int i = blockIdx.x*THREADS_PER_BLOCK + threadIdx.x;
if(KernelMode == 1){
//N_d is allocated on device memory
if(i > N_d)
return;
//and update dataBuffer!
updateX(Particles+i);
for(int d=0;d<DIM_d;d++){
data[i*bufferStride_d+d] = Particles[i].p[d]; // update the new coordinate positions in the data buffer!
}
// fill in also the RGB data and the radius. In general THIS IS NOT NECESSARY!! NEED TO PERFORM ONCE! REFACTOR!!!
data[i*bufferStride_d+DIM_d] =Particles[i].r;
data[i*bufferStride_d+DIM_d+1] =Particles[i].g;
data[i*bufferStride_d+DIM_d+2] =Particles[i].b;
data[i*bufferStride_d+DIM_d+3] =Particles[i].radius;
}else{
// if KernelMode = 2 then Update Y
float* Fold = new float[DIM_d];
for(int d=0;d<DIM_d;d++)
Fold[d]=Particles[i].force[d];
//of course in parallel :)
computeForces(Particles,i);
updateV(Particles+i,Fold);
delete [] Fold;
}
// in either case wait for all threads to finish!
__syncthreads();
}
As I mentioned at one of the comments above , it turned out that I had mistaken the computing capability compiler option. I ran cuda-memcheck and I saw that that the cuda Api launch was failing. After I found the right compiler options, everything worked like a charm.

Using vertex buffers in jogl, crash when too many triangles

I have written a simple application in Java using Jogl which draws a 3d geometry. The camera can be rotated by dragging the mouse. The application works fine, but drawing the geometry with glBegin(GL_TRIANGLE) ... calls ist too slow.
So I started to use vertex buffers. This also works fine until the number of triangles gets larger than 1000000. If that happens, the display driver suddenly crashes and my montior gets dark. Is there a limit of how many triangles fit in the buffer? I hoped to get 1000000 triangles rendered at a reasonable frame rate.
I have no idea on how to debug this problem. The nasty thing is that I have to reboot Windows after each launch, since I have no other way to get my display working again. Could anyone give me some advice?
The vertices, triangles and normals are stored in arrays float[][] m_vertices, int[][] m_triangles, float[][] m_triangleNormals.
I initialized the buffer with:
// generate a VBO pointer / handle
if (m_vboHandle <= 0) {
int[] vboHandle = new int[1];
m_gl.glGenBuffers(1, vboHandle, 0);
m_vboHandle = vboHandle[0];
}
// interleave vertex / normal data
FloatBuffer data = Buffers.newDirectFloatBuffer(m_triangles.length * 3*3*2);
for (int t=0; t<m_triangles.length; t++)
for (int j=0; j<3; j++) {
int v = m_triangles[t][j];
data.put(m_vertices[v]);
data.put(m_triangleNormals[t]);
}
data.rewind();
// transfer data to VBO
int numBytes = data.capacity() * 4;
m_gl.glBindBuffer(GL.GL_ARRAY_BUFFER, m_vboHandle);
m_gl.glBufferData(GL.GL_ARRAY_BUFFER, numBytes, data, GL.GL_STATIC_DRAW);
m_gl.glBindBuffer(GL.GL_ARRAY_BUFFER, 0);
Then, the scene gets rendered with:
gl.glBindBuffer(GL.GL_ARRAY_BUFFER, m_vboHandle);
gl.glEnableClientState(GL2.GL_VERTEX_ARRAY);
gl.glEnableClientState(GL2.GL_NORMAL_ARRAY);
gl.glVertexPointer(3, GL.GL_FLOAT, 6*4, 0);
gl.glNormalPointer(GL.GL_FLOAT, 6*4, 3*4);
gl.glDrawArrays(GL.GL_TRIANGLES, 0, 3*m_triangles.length);
gl.glDisableClientState(GL2.GL_VERTEX_ARRAY);
gl.glDisableClientState(GL2.GL_NORMAL_ARRAY);
gl.glBindBuffer(GL.GL_ARRAY_BUFFER, 0);
Try checking the return value of calling glBufferData. It will return GL_OUT_OF_MEMORY if it cannot satisfy numBytes.