Is using vectors to store Vertices for DirectX9 slow?

Is using vectors to store Vertices for DirectX9 slow? - c++

Over the past few days I made my first "engine" thingy. A central object with a window object, graphics object, and an input object - all nice and encapsulated and happy.
In this setup I also included some objects in the graphics object that handle some 'utility' functions, like a camera and a 'vindex' manager.
The Vertex/Index Manager stores all vertices and indices in std::vectors, that are called upon and sent to graphics when it's time to create the buffers.
The only problem is that I get ~8 frames a second with only 8-10 rectangles.
I think the problem is in the 'Vindex' object (my shader is nothing spectacular, and the pipeline is pretty vanilla).
Is storing Vertices in this way a plum bad idea, or is there just some painfully obvious thing I'm missing?
I did a little evolution sim project a few years ago that was pretty messy code-wise, but it rendered 20,000 vertices at 100s of frames a second on this machine, so it's not my machine that's slow.
I've been kind of staring at this for several hours, any and all input is VERY much appreciated :)
Example from my object that stores my vertices:
for (int i = 0; i < 24; ++i)
{
mVertList.push_back(Vertex(v[i], n[i], col));
}
For Clarity's sake
std::vector<Vertex> mVertList;
std::vector<int> mIndList;
and
std::vector<Vertex> VindexPile::getVerts()
{
return mVertList;
}
std::vector<int> VindexPile::getInds()
{
return mIndList;
}
In my graphics.cpp file:
md3dDevice->CreateVertexBuffer(mVinds.getVerts().size() * sizeof(Vertex), D3DUSAGE_WRITEONLY, 0, D3DPOOL_MANAGED, &mVB, 0);
Vertex * v = 0;
mVB->Lock(0, 0, (void**)&v, 0);
std::vector<Vertex> vList = mVinds.getVerts();
for (int i = 0; i < mVinds.getVerts().size(); ++i)
{
v[i] = vList[i];
}
mVB->Unlock();
md3dDevice->CreateIndexBuffer(mVinds.getInds().size() * sizeof(WORD), D3DUSAGE_WRITEONLY, D3DFMT_INDEX16, D3DPOOL_MANAGED, &mIB, 0);
WORD* ind = 0;
mIB->Lock(0, 0, (void**)&ind, 0);
std::vector<int> iList = mVinds.getInds();
for (int i = 0; i<mVinds.getInds().size(); ++i)
{
ind[i] = iList[i];
}
mIB->Unlock();

There is quite a bit of copying going on in here: I can not tell without running a profiler and some more code, but that seems like the first culprit:
std::vector<Vertex> vList = mVinds.getVerts();
std::vector<int> iList = mVinds.getInds();
Those two calls create copies of your vertex/index buffers, which is most probably not what you want - you most probably want to declare those as const references. You are also ruining cache coherency by doing those copies, which slows down your program more.
mVertList.push_back(Vertex(v[i], n[i], col));
This is moving and resizing the vectors quite a lot as well - you should most probably use reserve or resize before putting stuff in your vectors, to avoid reallocation and moving throughout memory of your data.
If I have to give you one big advice however, that would be: Profile. I don't know what tools you have access to, however there are plenty of profilers available, pick one and learn it, and it will provide much more valuable insight into why your program is slow.

Related

Does this allocation have any effects on speed?

This non-member function drawPoly() draws an n-sided polygon in 3D space from a list of vertices.
This function typically gets called thousands of times during normal execution and speed is critical.
Ignoring the effects of the functions called within drawPoly(), does the allocation of the 25-element vertex array have any negative effects on speed?
void drawPoly(const meshx::Face& face, gen::Vector position,
ALLEGRO_COLOR color, bool filled)
{
ALLEGRO_VERTEX vertList[25];
std::size_t k = 0;
// ...For every vertex in the polygon...
for(; k < face.getNumVerts(); ++k) {
vertList[k].x = position.x + face.alVerts[k].x;
vertList[k].y = position.y + face.alVerts[k].y;
vertList[k].z = position.z + face.alVerts[k].z;
vertList[k].u = 0;
vertList[k].v = 0;
vertList[k].color = color;
}
// Draw with ALLEGRO_VERTEXs and no textures.
if(filled) {
al_draw_prim(vertList, nullptr, nullptr,
0, k, ALLEGRO_PRIM_TRIANGLE_LIST);
} else {
al_draw_prim(vertList, nullptr, nullptr,
0, k, ALLEGRO_PRIM_LINE_LOOP);
}
}

The only way to tell it for sure, is to measure. But what else could you use instead, to compare with? Allocating on the heap would be obviously slower. Using a global variable to hold the vertices could be an option - only for perf benchmarking.
Given that the stack allocation of trivially constructible objects is usually translates to a simple change of the stack pointer, the allocation itself probably wouldn't be a big deal. What could have an observable effect tough, is touching extra cache lines. The less cache lines the code writes, the better, from the performance perspective. Therefore, you can experiment with splitting vertList[25] into cache line sized arrays, and calling al_draw_prim multiple times. A benchmark would show if there's a difference.

SDL rendering too slow

I'm a beginner in both C++ and SDL, and I'm slowly writing a small game following some tutorials to learn some concepts.
I'm having a problem, though: My rendering seems to be really slow.
I've used PerformanceCounters to calculate my loops with and without my rendering function. Without it, I get 0~2ish milliseconds per frame; when I add the rendering, it goes up to 65ish ms per frame.
Could someone tell me what is wrong with my rendering function?
SDL_Texture *texture;
...
// gets called by the main loop
void render(int x_offset, int y_offset)
{
if (texture)
{
SDL_DestroyTexture(texture);
}
texture = SDL_CreateTexture(renderer,
SDL_PIXELFORMAT_ARGB8888,
SDL_TEXTUREACCESS_STREAMING,
texture_w,
texture_h);
if (SDL_LockTexture(texture, NULL, &pixel_memory, &pitch) < 0) {
printf("Oops! %s\n", SDL_GetError());
}
Uint32 *pixel;
Uint8 *row = (Uint8 *) pixel_memory;
for (int j = 0; j < texture_h; ++j) {
pixel = (Uint32 *)((Uint8 *) pixel_memory + j * pitch);
for (int i = 0; i < texture_w; ++i) {
Uint8 alpha = 255;
Uint8 red = 172;
Uint8 green = 0;
Uint8 blue = 255;
*pixel++ = ((alpha << 24) | (red << 16) | (green << 8) | (blue));
}
}
SDL_UnlockTexture(texture);
}

It's likely slow because you're destroying and creating the texture every single frame, locking textures/uploading pixel data isn't super fast, but I doubt it's the bottleneck here. I strongly recommend allocating the texture once before entering your main loop and re-using it during rendering, then destroying it before your program exits.

The SDL2 is based on hardware rendering. Acessing textures, even with the streaming flag won't be fast since you play ping pong with the GPU.
Instead of creating and destroying a texture each frame, you should consider simply cleaning it before redrawing.
Another option would be to use a surface. You do your stuff with the surface and then draw it as a texture. I'm not sure that the gain would be huge but I think it will still be better than destroying, creating, locking and unlocking a texture each frame.
Looking at your code, I understand it is but a test, though you could try to render to a texture with SDL primitives.
Lastly, keep in mind during your tests that your driver might force the vertical sync, which could lead to fake bad performance.

Probably nothing. Locking textures for direct pixel access is slow. Chances are, you can do a lot of additional stuff in the render function and not see any further decrease in speed.
If you want faster rendering, you need higher-level functions.

OpenGL Merging Vertex Data to Minimize Draw Calls

Background
2D "Infinite" World separated into chunks
One VAO (& VBO/EBO) per chunk
Nested for loop in chunk render; one draw call per block.
Code
void Chunk::Render(/* ... */) {
glBindVertexArray(vao);
for (int x = 0; x < 64; x++) {
for (int y = 0; y < 64; y++) {
if (blocks[x][y] == 1) {
/* ... Uniforms ... */
glDrawElements(GL_TRIANGLE_STRIP, 6, GL_UNSIGNED_INT, (void*)0);
}
}
}
glBindVertexArray(0);
}
There is a generation algorithm in the constructor. This could be anything: noise, random, etc. The algorithm goes through and sets an element in the blocks array to 1 (meaning: render block) or 0 (meaning: do not render)
Problem
How would I go about combining these triangle strips together in order to minimize draw calls? I can think of a few algorithms to find the triangles that should be merged together in a draw call, but I am confused as how to merge them together. Do I need to add it to the vertices array and call glBufferData again? Would it be bad to call glBufferData so many times per-frame?
I'm not really rendering that many triangles, am I? I think I've heard of people who can easily draw ten-thousand triangles with minimal CPU usage (or.. millions even). So what is wrong with how I am drawing currently?
EDIT
_[Andon M. Coleman][1]_ has given me a lot of information in the [chat][2]. I have now switched over to using instanced arrays; I cannot believe how much of a difference it makes in performance, for a minute I thought Linux's `top` command was malfunctioning. It's _very_ significant. Instead of only being able to render say.. 60 triangles, I can render over a million with barely any change in CPU usage.

C++ plotting Mandlebrot set, bad performance

I'm not sure if there is an actual performance increase to achieve, or if my computer is just old and slow, but I'll ask anyway.
So I've tried making a program to plot the Mandelbrot set using the cairo library.
The loop that draws the pixels looks as follows:
vector<point_t*>::iterator it;
for(unsigned int i = 0; i < iterations; i++){
it = points->begin();
//cout << points->size() << endl;
double r,g,b;
r = (double)i+1 / (double)iterations;
g = 0;
b = 0;
while(it != points->end()){
point_t *p = *it;
p->Z = (p->Z * p->Z) + p->C;
if(abs(p->Z) > 2.0){
cairo_set_source_rgba(cr, r, g, b, 1);
cairo_rectangle (cr, p->x, p->y, 1, 1);
cairo_fill (cr);
it = points->erase(it);
} else {
it++;
}
}
}
The idea is to color all points that just escaped the set, and then remove them from list to avoid evaluating them again.
It does render the set correctly, but it seems that the rendering takes a lot longer than needed.
Can someone spot any performance issues with the loop? or is it as good as it gets?
Thanks in advance :)
SOLUTION
Very nice answers, thanks :) - I ended up with a kind of hybrid of the answers. Thinking of what was suggested, i realized that calculating each point, putting them in a vector and then extract them was a huge waste CPU time and memory. So instead, the program now just calculate the Z value of each point witout even using the point_t or vector. It now runs A LOT faster!

Edit: I think the suggestion in the answer of kuroi neko is also a very good idea if you do not care about "incremental" computation, but have a fixed number of iterations.
You should use vector<point_t> instead of vector<point_t*>.
A vector<point_t*> is a list of pointers to point_t. Each point is stored at some random location in the memory. If you iterate over the points, the pattern in which memory is accessed looks completely random. You will get a lot of cache misses.
On the opposite vector<point_t> uses continuous memory to store the points. Thus the next point is stored directly after the current point. This allows efficient caching.
You should not call erase(it); in your inner loop.
Each call to erase has to move all elements after the one you remove. This has O(n) runtime. For example, you could add a flag to point_t to indicate that it should not be processed any longer. It may be even faster to remove all the "inactive" points after each iteration.
It is probably not a good idea to draw individual pixels using cairo_rectangle. I would suggest you create an image and store the color for each pixel. Then draw the whole image with one draw call.
Your code could look like this:
for(unsigned int i = 0; i < iterations; i++){
double r,g,b;
r = (double)i+1 / (double)iterations;
g = 0;
b = 0;
for(vector<point_t>::iterator it=points->begin(); it!=points->end(); ++it) {
point_t& p = *it;
if(!p.active) {
continue;
}
p.Z = (p.Z * p.Z) + p.C;
if(abs(p.Z) > 2.0) {
cairo_set_source_rgba(cr, r, g, b, 1);
cairo_rectangle (cr, p.x, p.y, 1, 1);
cairo_fill (cr);
p.active = false;
}
}
// perhaps remove all points where p.active = false
}
If you can not change point_t, you can use an additional vector<char> to store if a point has become "inactive".

The Zn divergence computation is what makes the algorithm slow (depending on the area you're working on, of course). In comparison, pixel drawing is mere background noise.
Your loop is flawed because it makes the Zn computation slow.
The way to go is to compute divergence for each point in a tight, optimized loop, and then take care of the display.
Besides, it's useless and wasteful to store Z permanently.
You just need C as an input and the number of iterations as an output.
Assuming your points array only holds C values (basically you don't need all this vector crap, but it won't hurt performances either), you could do something like that :
for(vector<point_t>::iterator it=points->begin(); it!=points->end(); ++it)
{
point_t Z = 0;
point_t C = *it;
for(unsigned int i = 0; i < iterations; i++) // <-- this is the CPU burner
{
Z = Z * Z + C;
if(abs(Z) > 2.0) break;
}
cairo_set_source_rgba(cr, (double)i+1 / (double)iterations, g, b, 1);
cairo_rectangle (cr, p->x, p->y, 1, 1);
cairo_fill (cr);
}
Try to run this with and without the cairo thing and you should see no noticeable difference in execution time (unless you're looking at an empty spot of the set).
Now if you want to go faster, try to break down the Z = Z * Z + C computation in real and imaginary parts and optimize it. You could even use mmx or whatever to do parallel computations.
And of course the way to go to gain another significant speed factor is to parallelize your algorithm over the available CPU cores (i.e. split your display area is subsets and have different worker threads compute these parts in parallel).
This is not as obvious at it might seem, though, since each sub-picture will have a different computation time (black areas are very slow to compute while white areas are computed almost instantly).
One way to do it is to split the area is a large number of rectangles, and have all worker threads pick a random rectangle from a common pool until all rectangles have been processed.
This simple load balancing scheme that makes sure no CPU core will be left idle while its buddies are busy on other parts of the display.

The first step to optimizing performance is to find out what is slow. Your code mixes three tasks- iterating to calculate whether a point escapes, manipulating a vector of points to test, and plotting the point.
Separate these three operations and measure their contribution. You can optimise the escape calculation by parallelising it using simd operations. You can optimise the vector operations by not erasing from the vector if you want to remove it but adding it to another vector if you want to keep it ( since erase is O(N) and addition O(1) ) and improve locality by having a vector of points rather than pointers to points, and if the plotting is slow then use an off-screen bitmap and set points by manipulating the backing memory rather than using cairo functions.
(I was going to post this but #Werner Henze already made the same point in a comment, hence community wiki)

is it possible to speed-up matlab plotting by calling c / c++ code in matlab?

It is generally very easy to call mex files (written in c/c++) in Matlab to speed up certain calculations. In my experience however, the true bottleneck in Matlab is data plotting. Creating handles is extremely expensive and even if you only update handle data (e.g., XData, YData, ZData), this might take ages. Even worse, since Matlab is a single threaded program, it is impossible to update multiple plots at the same time.
Therefore my question: Is it possible to write a Matlab GUI and call C++ (or some other parallelizable code) which would take care of the plotting / visualization? I'm looking for a cross-platform solution that will work on Windows, Mac and Linux, but any solution that get's me started on either OS is greatly appreciated!
I found a C++ library that seems to use Matlab's plot() syntax but I'm not sure whether this would speed things up, since I'm afraid that if I plot into Matlab's figure() window, things might get slowed down again.
I would appreciate any comments and feedback from people who have dealt with this kind of situation before!
EDIT: obviously, I've already profiled my code and the bottleneck is the plotting (dozen of panels with lots of data).
EDIT2: for you to get the bounty, I need a real life, minimal working example on how to do this - suggestive answers won't help me.
EDIT3: regarding the data to plot: in a most simplistic case, think about 20 line plots, that need to be updated each second with something like 1000000 data points.
EDIT4: I know that this is a huge amount of points to plot but I never said that the problem was easy. I can not just leave out certain data points, because there's no way of assessing what points are important, before actually plotting them (data is sampled a sub-ms time resolution). As a matter of fact, my data is acquired using a commercial data acquisition system which comes with a data viewer (written in c++). This program has no problem visualizing up to 60 line plots with even more than 1000000 data points.
EDIT5: I don't like where the current discussion is going. I'm aware that sub-sampling my data might speeds up things - however, this is not the question. The question here is how to get a c / c++ / python / java interface to work with matlab in order hopefully speed up plotting by talking directly to the hardware (or using any other trick / way)

Did you try the trivial solution of changing the render method to OpenGL ?
opengl hardware;
set(gcf,'Renderer','OpenGL');
Warning!
There will be some things that disappear in this mode, and it will look a bit different, but generally plots will runs much faster, especially if you have a hardware accelerator.
By the way, are you sure that you will actually gain a performance increase?
For example, in my experience, WPF graphics in C# are considerably slower than Matlabs, especially scatter plot and circles.
Edit: I thought about the fact that the number of points that is actually drawn to the screen can't be that much. Basically it means that you need to interpolate at the places where there is a pixel in the screen. Check out this object:
classdef InterpolatedPlot < handle
properties(Access=private)
hPlot;
end
methods(Access=public)
function this = InterpolatedPlot(x,y,varargin)
this.hPlot = plot(0,0,varargin{:});
this.setXY(x,y);
end
end
methods
function setXY(this,x,y)
parent = get(this.hPlot,'Parent');
set(parent,'Units','Pixels')
sz = get(parent,'Position');
width = sz(3); %Actual width in pixels
subSampleX = linspace(min(x(:)),max(x(:)),width);
subSampleY = interp1(x,y,subSampleX);
set(this.hPlot,'XData',subSampleX,'YData',subSampleY);
end
end
end
And here is an example how to use it:
function TestALotOfPoints()
x = rand(10000,1);
y = rand(10000,1);
ip = InterpolatedPlot(x,y,'color','r','LineWidth',2);
end
Another possible improvement:
Also, if your x data is sorted, you can use interp1q instead of interp, which will be much faster.
classdef InterpolatedPlot < handle
properties(Access=private)
hPlot;
end
% properties(Access=public)
% XData;
% YData;
% end
methods(Access=public)
function this = InterpolatedPlot(x,y,varargin)
this.hPlot = plot(0,0,varargin{:});
this.setXY(x,y);
% this.XData = x;
% this.YData = y;
end
end
methods
function setXY(this,x,y)
parent = get(this.hPlot,'Parent');
set(parent,'Units','Pixels')
sz = get(parent,'Position');
width = sz(3); %Actual width in pixels
subSampleX = linspace(min(x(:)),max(x(:)),width);
subSampleY = interp1q(x,y,transpose(subSampleX));
set(this.hPlot,'XData',subSampleX,'YData',subSampleY);
end
end
end
And the use case:
function TestALotOfPoints()
x = rand(10000,1);
y = rand(10000,1);
x = sort(x);
ip = InterpolatedPlot(x,y,'color','r','LineWidth',2);
end

Since you want maximum performance you should consider writing a minimal OpenGL viewer. Dump all the points to a file and launch the viewer using the "system"-command in MATLAB. The viewer can be really simple. Here is one implemented using GLUT, compiled for Mac OS X. The code is cross platform so you should be able to compile it for all the platforms you mention. It should be easy to tweak this viewer for your needs.
If you are able to integrate this viewer more closely with MATLAB you might be able to get away with not having to write to and read from a file (= much faster updates). However, I'm not experienced in the matter. Perhaps you can put this code in a mex-file?
EDIT: I've updated the code to draw a line strip from a CPU memory pointer.
// On Mac OS X, compile using: g++ -O3 -framework GLUT -framework OpenGL glview.cpp
// The file "input" is assumed to contain a line for each point:
// 0.1 1.0
// 5.2 3.0
#include <vector>
#include <sstream>
#include <fstream>
#include <iostream>
#include <GLUT/glut.h>
using namespace std;
struct float2 { float2() {} float2(float x, float y) : x(x), y(y) {} float x, y; };
static vector<float2> points;
static float2 minPoint, maxPoint;
typedef vector<float2>::iterator point_iter;
static void render() {
glClearColor(1.0f, 1.0f, 1.0f, 1.0f);
glClear(GL_COLOR_BUFFER_BIT);
glMatrixMode(GL_PROJECTION);
glLoadIdentity();
glOrtho(minPoint.x, maxPoint.x, minPoint.y, maxPoint.y, -1.0f, 1.0f);
glColor3f(0.0f, 0.0f, 0.0f);
glEnableClientState(GL_VERTEX_ARRAY);
glVertexPointer(2, GL_FLOAT, sizeof(points[0]), &points[0].x);
glDrawArrays(GL_LINE_STRIP, 0, points.size());
glDisableClientState(GL_VERTEX_ARRAY);
glutSwapBuffers();
}
int main(int argc, char* argv[]) {
ifstream file("input");
string line;
while (getline(file, line)) {
istringstream ss(line);
float2 p;
ss >> p.x;
ss >> p.y;
if (ss)
points.push_back(p);
}
if (!points.size())
return 1;
minPoint = maxPoint = points[0];
for (point_iter i = points.begin(); i != points.end(); ++i) {
float2 p = *i;
minPoint = float2(minPoint.x < p.x ? minPoint.x : p.x, minPoint.y < p.y ? minPoint.y : p.y);
maxPoint = float2(maxPoint.x > p.x ? maxPoint.x : p.x, maxPoint.y > p.y ? maxPoint.y : p.y);
}
float dx = maxPoint.x - minPoint.x;
float dy = maxPoint.y - minPoint.y;
maxPoint.x += dx*0.1f; minPoint.x -= dx*0.1f;
maxPoint.y += dy*0.1f; minPoint.y -= dy*0.1f;
glutInit(&argc, argv);
glutInitDisplayMode(GLUT_RGBA | GLUT_DOUBLE);
glutInitWindowSize(512, 512);
glutCreateWindow("glview");
glutDisplayFunc(render);
glutMainLoop();
return 0;
}
EDIT: Here is new code based on the discussion below. It renders a sin function consisting of 20 vbos, each containing 100k points. 10k new points are added each rendered frame. This makes a total of 2M points. The performance is real-time on my laptop.
// On Mac OS X, compile using: g++ -O3 -framework GLUT -framework OpenGL glview.cpp
#include <vector>
#include <sstream>
#include <fstream>
#include <iostream>
#include <cmath>
#include <iostream>
#include <GLUT/glut.h>
using namespace std;
struct float2 { float2() {} float2(float x, float y) : x(x), y(y) {} float x, y; };
struct Vbo {
GLuint i;
Vbo(int size) { glGenBuffersARB(1, &i); glBindBufferARB(GL_ARRAY_BUFFER, i); glBufferDataARB(GL_ARRAY_BUFFER, size, 0, GL_DYNAMIC_DRAW); } // could try GL_STATIC_DRAW
void set(const void* data, size_t size, size_t offset) { glBindBufferARB(GL_ARRAY_BUFFER, i); glBufferSubData(GL_ARRAY_BUFFER, offset, size, data); }
~Vbo() { glDeleteBuffers(1, &i); }
};
static const int vboCount = 20;
static const int vboSize = 100000;
static const int pointCount = vboCount*vboSize;
static float endTime = 0.0f;
static const float deltaTime = 1e-3f;
static std::vector<Vbo*> vbos;
static int vboStart = 0;
static void addPoints(float2* points, int pointCount) {
while (pointCount) {
if (vboStart == vboSize || vbos.empty()) {
if (vbos.size() >= vboCount+2) { // remove and reuse vbo
Vbo* first = *vbos.begin();
vbos.erase(vbos.begin());
vbos.push_back(first);
}
else { // create new vbo
vbos.push_back(new Vbo(sizeof(float2)*vboSize));
}
vboStart = 0;
}
int pointsAdded = pointCount;
if (pointsAdded + vboStart > vboSize)
pointsAdded = vboSize - vboStart;
Vbo* vbo = *vbos.rbegin();
vbo->set(points, pointsAdded*sizeof(float2), vboStart*sizeof(float2));
pointCount -= pointsAdded;
points += pointsAdded;
vboStart += pointsAdded;
}
}
static void render() {
// generate and add 10000 points
const int count = 10000;
float2 points[count];
for (int i = 0; i < count; ++i) {
float2 p(endTime, std::sin(endTime*1e-2f));
endTime += deltaTime;
points[i] = p;
}
addPoints(points, count);
// render
glClearColor(1.0f, 1.0f, 1.0f, 1.0f);
glClear(GL_COLOR_BUFFER_BIT);
glMatrixMode(GL_PROJECTION);
glLoadIdentity();
glOrtho(endTime-deltaTime*pointCount, endTime, -1.0f, 1.0f, -1.0f, 1.0f);
glColor3f(0.0f, 0.0f, 0.0f);
glEnableClientState(GL_VERTEX_ARRAY);
for (size_t i = 0; i < vbos.size(); ++i) {
glBindBufferARB(GL_ARRAY_BUFFER, vbos[i]->i);
glVertexPointer(2, GL_FLOAT, sizeof(float2), 0);
if (i == vbos.size()-1)
glDrawArrays(GL_LINE_STRIP, 0, vboStart);
else
glDrawArrays(GL_LINE_STRIP, 0, vboSize);
}
glDisableClientState(GL_VERTEX_ARRAY);
glutSwapBuffers();
glutPostRedisplay();
}
int main(int argc, char* argv[]) {
glutInit(&argc, argv);
glutInitDisplayMode(GLUT_RGBA | GLUT_DOUBLE);
glutInitWindowSize(512, 512);
glutCreateWindow("glview");
glutDisplayFunc(render);
glutMainLoop();
return 0;
}

As a number of people have mentioned in their answers, you do not need to plot that many points. I think it is important to rpeat Andrey's comment:
that is a HUGE amount of points! There isn't enough pixels on the screen to plot that amount.
Rewriting plotting routines in different languages is a waste of your time. A huge number of hours have gone into writing MATLAB, whay makes you think you can write a significantly faster plotting routine (in a reasonable amount of time)? Whilst your routine may be less general, and therefore would remove some of the checks that the MATLAB code will perform, your "bottleneck" is that you are trying to plot so much data.
I strongly recommend one of two courses of action:
Sample your data: You do not need 20 x 1000000 points on a figure - the human eye won't be able to distinguish between all the points, so it is a waste of time. Try binning your data for example.
If you maintain that you need all those points on the screen, I would suggest using a different tool. VisIt or ParaView are two examples that come to mind. They are parallel visualisation programs designed to handle extremenly large datasets (I have seen VisIt handle datasets that contained PetaBytes of data).

There is no way you can fit 1000000 data points on a small plot. How about you choose one in every 10000 points and plot those?
You can consider calling imresize on the large vector to shrink it, but manually building a vector by omitting 99% of the points may be faster.
#memyself The sampling operations are already occurring. Matlab is choosing what data to include in the graph. Why do you trust matlab? It looks to me that the graph you showed significantly misrepresents the data. The dense regions should indicate that the signal is at a constant value, but in your graph it could mean that the signal is at that value half the time - or was at that value at least once during the interval corresponding to that pixel?

Would it be possible to use an alternate architectue? For example, use MATLAB to generate the data and use a fast library or application (GNUplot?) to handle the plotting?
It might even be possible to have MATLAB write the data to a stream as the plotter consumes the data. Then the plot would be updated as MATLAB generates the data.
This approach would avoid MATLAB's ridiculously slow plotting and divide the work up between two separate processes. The OS/CPU would probably assign the process to different cores as a matter of course.

I think it's possible, but likely to require writing the plotting code (at least the parts you use) from scratch, since anything you could reuse is exactly what's slowing you down.
To test feasibility, I'd start with testing that any Win32 GUI works from MEX (call MessageBox), then proceed to creating your own window, test that window messages arrive to your WndProc. Once all that's going, you can bind an OpenGL context to it (or just use GDI), and start plotting.
However, the savings is likely to come from simpler plotting code and use of newer OpenGL features such as VBOs, rather than threading. Everything is already parallel on the GPU, and more threads don't help transfer of commands/data to the GPU any faster.

I did a very similar thing many many years ago (2004?). I needed an oscilloscope-like display for kilohertz sampled biological signals displayed in real time. Not quite as many points as the original question has, but still too many for MATLAB to handle on its own. IIRC I ended up writing a Java component to display the graph.
As other people have suggested, I also ended up down-sampling the data. For each pixel on the x-axis, I calculated the minimum and maximum values taken by the data, then drew a short vertical line between those values. The entire graph consisted of a sequence of short vertical lines, each immediately adjacent to the next.
Actually, I think that the implementation ended up writing the graph to a bitmap that scrolled continuously using bitblt, with only new points being drawn ... or maybe the bitmap was static and the viewport scrolled along it ... anyway it was a long time ago and I might not be remembering it right.

Blockquote
EDIT4: I know that this is a huge amount of points to plot but I never said that the problem was easy. I can not just leave out certain data points, because there's no way of assessing what points are important, before actually plotting them
Blockquote
This is incorrect. There is a way to to know which points to leave out. Matlab is already doing it. Something is going to have to do it at some point no matter how you solve this. I think you need to redirect your problem to be "how do I determine which points I should plot?".
Based on the screenshot, the data looks like a waveform. You might want to look at the code of audacity. It is an open source audio editing program. It displays plots to represent the waveform in real time, and they look identical in style to the one in your lowest screen shot. You could borrow some sampling techniques from them.

What you are looking for is the creation of a MEX file.
Rather than me explaining it, you would probably benefit more from reading this: Creating C/C++ and Fortran Programs to be Callable from MATLAB (MEX-Files) (a documentation article from MathWorks).
Hope this helps.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js