Uniform affecting shader flow and performances

Uniform affecting shader flow and performances - c++

i was experimenting with OpenGL fragment shaders by doing a huge blur (300*300) done in two passes, one horizontal, one vertical.
I noticed that passing the direction as a uniform (vec2) is about 10 time slower than to directly write it in the code (140 to 12 fps).
ie:
vec2 dir = vec2(0, 1) / textureSize(tex, 0);
int size = 150;
for(int i = -size; i != size; ++i) {
float w = // compute weight here...
acc += w * texture(tex, + coord + vec2(i) * dir);
}
appear to be faster than:
uniform vec2 dir;
/*
...
*/
int size = 150;
for(int i = -size; i != size; ++i) {
float w = // compute weight here...
acc += w * texture(tex, + coord + vec2(i) * dir);
}
Creating two programs with different uniforms doesn't change anything.
Does anyone know why there is such a huge difference and why doesn't the driver see that "inlining" dir might be much faster ?
EDIT : Taking size as a uniform also have an impact, but not as much as dir.
If you are interested in seeing what it looks like (FRAPS provides the fps counter):
uniform blur.
"inline" blur.
no blur.
Quick notes : i am running on a nVidia 760M GTX using OpenGL 4.2 and glsl 420. Also puush's jpeg is responsible for the colors in the images.

A good guess would be that the UBOs are stored in shared memory, but might require an occasional round-trip to global memory (vram), while the non-uniform version stores that little piece of data in registers or constant memory.
However, since the OpenGL standard does not dictate where your data is stored, you would have to look at a profiler, and try to gain better understanding of how NVIDIA's GL implementation works.
I'd recommend, you start by profiling, using NVIDIA PerfKit or NVIDIA NSIGHT for VS. Even if you think, it's too much trouble for now. If you want to write high-performance code, you should start getting used to the process. You will see how easy it gets eventually.
EDIT:
So why is it so much slower? Because in this case, one failed optimization (data not in registers) can cause other (if not most other) optimizations to also fail. And, coincidentally, optimizations are absolutely necessary for GPU code to run fast.

Related

Stuck trying to optimize complex GLSL fragment shader

So first off, let me say that while the code works perfectly well from a visual point of view, it runs into very steep performance issues that get progressively worse as you add more lights. In its current form it's good as a proof of concept, or a tech demo, but is otherwise unusable.
Long story short, I'm writing a RimWorld-style game with real-time top-down 2D lighting. The way I implemented rendering is with a 3 layered technique as follows:
First I render occlusions to a single-channel R8 occlusion texture mapped to a framebuffer. This part is lightning fast and doesn't slow down with more lights, so it's not part of the problem:
Then I invoke my lighting shader by drawing a huge rectangle over my lightmap texture mapped to another framebuffer. The light data is stored in an array in an UBO and it uses the occlusion mapping in its calculations. This is where the slowdown happens:
And lastly, the lightmap texture is multiplied and added to the regular world renderer, this also isn't affected by the number of lights, so it's not part of the problem:
The problem is thus in the lightmap shader. The first iteration had many branches which froze my graphics driver right away when I first tried it, but after removing most of them I get a solid 144 fps at 1440p with 3 lights, and ~58 fps at 1440p with 20 lights. An improvement, but it scales very poorly. The shader code is as follows, with additional annotations:
#version 460 core
// per-light data
struct Light
{
vec4 location;
vec4 rangeAndstartColor;
};
const int MaxLightsCount = 16; // I've also tried 8 and 32, there was no real difference
layout(std140) uniform ubo_lights
{
Light lights[MaxLightsCount];
};
uniform sampler2D occlusionSampler; // the occlusion texture sampler
in vec2 fs_tex0; // the uv position in the large rectangle
in vec2 fs_window_size; // the window size to transform world coords to view coords and back
out vec4 color;
void main()
{
vec3 resultColor = vec3(0.0);
const vec2 size = fs_window_size;
const vec2 pos = (size - vec2(1.0)) * fs_tex0;
// process every light individually and add the resulting colors together
// this should be branchless, is there any way to check?
for(int idx = 0; idx < MaxLightsCount; ++idx)
{
const float range = lights[idx].rangeAndstartColor.x;
const vec2 lightPosition = lights[idx].location.xy;
const float dist = length(lightPosition - pos); // distance from current fragment to current light
// early abort, the next part is expensive
// this branch HAS to be important, right? otherwise it will check crazy long lines against occlusions
if(dist > range)
continue;
const vec3 startColor = lights[idx].rangeAndstartColor.yzw;
// walk between pos and lightPosition to find occlusions
// standard line DDA algorithm
vec2 tempPos = pos;
int lineSteps = int(ceil(abs(lightPosition.x - pos.x) > abs(lightPosition.y - pos.y) ? abs(lightPosition.x - pos.x) : abs(lightPosition.y - pos.y)));
const vec2 lineInc = (lightPosition - pos) / lineSteps;
// can I get rid of this loop somehow? I need to check each position between
// my fragment and the light position for occlusions, and this is the best I
// came up with
float lightStrength = 1.0;
while(lineSteps --> 0)
{
const vec2 nextPos = tempPos + lineInc;
const vec2 occlusionSamplerUV = tempPos / size;
lightStrength *= 1.0 - texture(occlusionSampler, vec2(occlusionSamplerUV.x, 1 - occlusionSamplerUV.y)).x;
tempPos = nextPos;
}
// the contribution of this light to the fragment color is based on
// its square distance from the light, and the occlusions between them
// implemented as multiplications
const float strength = max(0, range - dist) / range * lightStrength;
resultColor += startColor * strength * strength;
}
color = vec4(resultColor, 1.0);
}
I call this shader as many times as I need, since the results are additive. It works with large batches of lights or one by one. Performance-wise, I didn't notice any real change trying different batch numbers, which is perhaps a bit odd.
So my question is, is there a better way to look up for any (boolean) occlusions between my fragment position and light position in the occlusion texture, without iterating through every pixel by hand? Could render buffers perhaps help here (from what I've read they're for reading data back to system memory, I need it in another shader though)?
And perhaps, is there a better algorithm for what I'm doing here?

I can think of a couple routes for optimization:
Exact: apply a distance transform on the occlusion map: this will give you the distance to the nearest occluder at each pixel. After that you can safely step by that distance within the loop, instead of doing baby steps. This will drastically reduce the number of steps in open regions.
There is a very simple CPU-side algorithm to compute a DT, and it may suit you if your occluders are static. If your scene changes every frame, however, you'll need to search the literature for GPU side algorithms, which seem to be more complicated.
Inexact: resort to soft shadows -- it might be a compromise you are willing to make, and even seen as an artistic choice. If you are OK with that, you can create a mipmap from your occlusion map, and then progressively increase the step and sample lower levels as you go farther from the point you are shading.
You can go further and build an emitters map (into the same 4-channel map as the occlusion). Then your entire shading pass will be independent of the number of lights. This is an equivalent of voxel cone tracing GI applied to 2D.

Blur on Windows Phone 8 too slow

I'm implementing blur effect on windows phone using native C++ with DirectX, but it looks like even the simplest blur with small kernel causes visible FPS drop.
float4 main(PixelShaderInput input) : SV_TARGET
{
float4 source = screen.Sample(LinearSampler, input.texcoord);
float4 sum = float4(0,0,0,0);
float2 sizeFactor = float2(0.00117, 0.00208);
for (int x = -2; x <= 2; x ++)
{
float2 offset = float2(x, 0) *sizeFactor;
sum += screen.Sample(LinearSampler, input.texcoord + offset);
}
return ((sum / 5) + source);
}
I'm currently using this pixel shader for 1D blur and it's visibily slower than without blur. Is it really so that WP8 phone hardware is that slow or am I making some mistake? If so, could you point me where to look for error?
Thank you.

Phones often don't have the best fill-rate, and blur is one of the worst things you can do if you're fill-rate bound. Using some numbers from gfxbench.com's Fill test, a typical phone fill rate is around 600MTex/s. With some rough math:
(600M texels/s) / (1280*720 texels/op) / (60 frames/s) ~= 11 ops/frame
So in your loop, if your surface is the entire screen, and you're doing 5 reads and 1 write, that's 6 of your 11 ops used, just for the blur. So I would say a framerate drop is expected. One way around this is to dynamically lower your resolution, and do a single linear upscale - you'll get a different kind of natural blur from the linear interpolation, which might be passable depending on the visual effect you're going for.

How to extend vertex shader capabalities for GPGPU

I'm trying to implement Scrypt hasher (for LTC miner) on GLSL (don't ask me why).
And, actually, I'm stucked with HMAC SHA-256 algorithm. Despite I've implemented SHA-256 correctly (it retuns corrent hash for input), fragment shader stops to compile when I add the last step (hashing previous hash concated with oKey).
The shader can't do more than three rounds of SHA-256. It just stops to compile. What are the limits? It doesn't use much memory, 174 vec2 objects in total. It seems, it doesn't relate to memory, because any extra SHA256 round doesn't require new memory. And it seems, it doesn't relate to viewport size. It stops to work on both 1x1 and 1x128 viewports.
I've started to do miner on WebGL, but after limit appeared, I've tried to run the same shader in the Qt on the full featured OpenGL. In result, desktop OpenGL allows one SHA256 round lesser then OpenGL ES in WebGL (why?).
Forgot to mention. Shader fails on the linkage stage. The shader compiles well itself, but the program linkage fails.
I don't use any textures, any extensions, slow things etc. Just simple square (4 vec2 vertecies) and several uniforms for fragment shader.
Input data is just 80 bytes, the result of fragment shader is binary (black or white), so the task ideally fits the GLSL principes.
My videocard is Radeon HD7970 with plenty of VRAM, which is able to fit hundreds of scrypt threads (scrypt uses 128kB per hash, but I can't achieve just HMAC-SHA-256). My card supports OpenGL 4.4.
I'm newbie in OpenGL, and may understand something wrong. I understand that fragment shader runs for each pixel separately, but if I have 1x128 viewport, there are only 128x348 bytes used. Where is the limit of fragment shader.
Here is the common code I use to let you understand, how I'm trying to solve the problem.
uniform vec2 base_nonce[2];
uniform vec2 header[20]; /* Header of the block */
uniform vec2 H[8];
uniform vec2 K[64];
void sha256_round(inout vec2 w[64], inout vec2 t[8], inout vec2 hash[8]) {
for (int i = 0; i < 64; i++) {
if( i > 15 ) {
w[i] = blend(w[i-16], w[i-15], w[i-7], w[i-2]);
}
_s0 = e0(t[0]);
_maj = maj(t[0],t[1],t[2]);
_t2 = safe_add(_s0, _maj);
_s1 = e1(t[4]);
_ch = ch(t[4], t[5], t[6]);
_t1 = safe_add(safe_add(safe_add(safe_add(t[7], _s1), _ch), K[i]), w[i]);
t[7] = t[6]; t[6] = t[5]; t[5] = t[4];
t[4] = safe_add(t[3], _t1);
t[3] = t[2]; t[2] = t[1]; t[1] = t[0];
t[0] = safe_add(_t1, _t2);
}
for (int i = 0; i < 8; i++) {
hash[i] = safe_add(t[i], hash[i]);
t[i] = hash[i];
}
}
void main () {
vec2 key_hash[8]; /* Our SHA-256 hash */
vec2 i_key[16];
vec2 i_key_hash[8];
vec2 o_key[16];
vec2 nonced_header[20]; /* Header with nonce */
set_nonce_to_header(nonced_header);
vec2 P[32]; /* Padded SHA-256 message */
pad_the_header(P, nonced_header);
/* Hash HMAC secret key */
sha256(P, key_hash);
/* Make iKey and oKey */
for(int i = 0; i < 16; i++) {
if (i < 8) {
i_key[i] = xor(key_hash[i], vec2(Ox3636, Ox3636));
o_key[i] = xor(key_hash[i], vec2(Ox5c5c, Ox5c5c));
} else {
i_key[i] = vec2(Ox3636, Ox3636);
o_key[i] = vec2(Ox5c5c, Ox5c5c);
}
}
/* SHA256 hash of iKey */
for (int i = 0; i < 8; i++) {
i_key_hash[i] = H[i];
t[i] = i_key_hash[i];
}
for (int i = 0; i < 16; i++) { w[i] = i_key[i]; }
sha256_round(w, t, i_key_hash);
gl_FragColor = toRGBA(i_key_hash[0]);
}
What solutions can I use to improve the situation? Is there something cool in OpenGL 4.4, in OpenGL ES 3.1? Is it even possible to do such calculations and keep so much (128kB) in fragment shader? What are limits for the vertex shader? Can I do the same on the vertex shader instead the fragment?

I try to answer on the my own question.
Shader is a small processor with limited registers and cache memory. Also, there are limit for instruction execution. So, the whole architecture to fit all into one fragment shader is wrong.
On another way, you can change your shader programs during render tens or hundreds times. It is normal practice.
It is necessary to divide big computation into smaller parts and render them separately. Use render-to-texture to save your work.
Due to the webgl statistic, 96.5% of clients has MAX_TEXTURE_SIZE eq 4096. It gives you 32 megabytes of memory. It can contain the draft data for 256 threads of scrypt computation.

is it possible to speed-up matlab plotting by calling c / c++ code in matlab?

It is generally very easy to call mex files (written in c/c++) in Matlab to speed up certain calculations. In my experience however, the true bottleneck in Matlab is data plotting. Creating handles is extremely expensive and even if you only update handle data (e.g., XData, YData, ZData), this might take ages. Even worse, since Matlab is a single threaded program, it is impossible to update multiple plots at the same time.
Therefore my question: Is it possible to write a Matlab GUI and call C++ (or some other parallelizable code) which would take care of the plotting / visualization? I'm looking for a cross-platform solution that will work on Windows, Mac and Linux, but any solution that get's me started on either OS is greatly appreciated!
I found a C++ library that seems to use Matlab's plot() syntax but I'm not sure whether this would speed things up, since I'm afraid that if I plot into Matlab's figure() window, things might get slowed down again.
I would appreciate any comments and feedback from people who have dealt with this kind of situation before!
EDIT: obviously, I've already profiled my code and the bottleneck is the plotting (dozen of panels with lots of data).
EDIT2: for you to get the bounty, I need a real life, minimal working example on how to do this - suggestive answers won't help me.
EDIT3: regarding the data to plot: in a most simplistic case, think about 20 line plots, that need to be updated each second with something like 1000000 data points.
EDIT4: I know that this is a huge amount of points to plot but I never said that the problem was easy. I can not just leave out certain data points, because there's no way of assessing what points are important, before actually plotting them (data is sampled a sub-ms time resolution). As a matter of fact, my data is acquired using a commercial data acquisition system which comes with a data viewer (written in c++). This program has no problem visualizing up to 60 line plots with even more than 1000000 data points.
EDIT5: I don't like where the current discussion is going. I'm aware that sub-sampling my data might speeds up things - however, this is not the question. The question here is how to get a c / c++ / python / java interface to work with matlab in order hopefully speed up plotting by talking directly to the hardware (or using any other trick / way)

Did you try the trivial solution of changing the render method to OpenGL ?
opengl hardware;
set(gcf,'Renderer','OpenGL');
Warning!
There will be some things that disappear in this mode, and it will look a bit different, but generally plots will runs much faster, especially if you have a hardware accelerator.
By the way, are you sure that you will actually gain a performance increase?
For example, in my experience, WPF graphics in C# are considerably slower than Matlabs, especially scatter plot and circles.
Edit: I thought about the fact that the number of points that is actually drawn to the screen can't be that much. Basically it means that you need to interpolate at the places where there is a pixel in the screen. Check out this object:
classdef InterpolatedPlot < handle
properties(Access=private)
hPlot;
end
methods(Access=public)
function this = InterpolatedPlot(x,y,varargin)
this.hPlot = plot(0,0,varargin{:});
this.setXY(x,y);
end
end
methods
function setXY(this,x,y)
parent = get(this.hPlot,'Parent');
set(parent,'Units','Pixels')
sz = get(parent,'Position');
width = sz(3); %Actual width in pixels
subSampleX = linspace(min(x(:)),max(x(:)),width);
subSampleY = interp1(x,y,subSampleX);
set(this.hPlot,'XData',subSampleX,'YData',subSampleY);
end
end
end
And here is an example how to use it:
function TestALotOfPoints()
x = rand(10000,1);
y = rand(10000,1);
ip = InterpolatedPlot(x,y,'color','r','LineWidth',2);
end
Another possible improvement:
Also, if your x data is sorted, you can use interp1q instead of interp, which will be much faster.
classdef InterpolatedPlot < handle
properties(Access=private)
hPlot;
end
% properties(Access=public)
% XData;
% YData;
% end
methods(Access=public)
function this = InterpolatedPlot(x,y,varargin)
this.hPlot = plot(0,0,varargin{:});
this.setXY(x,y);
% this.XData = x;
% this.YData = y;
end
end
methods
function setXY(this,x,y)
parent = get(this.hPlot,'Parent');
set(parent,'Units','Pixels')
sz = get(parent,'Position');
width = sz(3); %Actual width in pixels
subSampleX = linspace(min(x(:)),max(x(:)),width);
subSampleY = interp1q(x,y,transpose(subSampleX));
set(this.hPlot,'XData',subSampleX,'YData',subSampleY);
end
end
end
And the use case:
function TestALotOfPoints()
x = rand(10000,1);
y = rand(10000,1);
x = sort(x);
ip = InterpolatedPlot(x,y,'color','r','LineWidth',2);
end

Since you want maximum performance you should consider writing a minimal OpenGL viewer. Dump all the points to a file and launch the viewer using the "system"-command in MATLAB. The viewer can be really simple. Here is one implemented using GLUT, compiled for Mac OS X. The code is cross platform so you should be able to compile it for all the platforms you mention. It should be easy to tweak this viewer for your needs.
If you are able to integrate this viewer more closely with MATLAB you might be able to get away with not having to write to and read from a file (= much faster updates). However, I'm not experienced in the matter. Perhaps you can put this code in a mex-file?
EDIT: I've updated the code to draw a line strip from a CPU memory pointer.
// On Mac OS X, compile using: g++ -O3 -framework GLUT -framework OpenGL glview.cpp
// The file "input" is assumed to contain a line for each point:
// 0.1 1.0
// 5.2 3.0
#include <vector>
#include <sstream>
#include <fstream>
#include <iostream>
#include <GLUT/glut.h>
using namespace std;
struct float2 { float2() {} float2(float x, float y) : x(x), y(y) {} float x, y; };
static vector<float2> points;
static float2 minPoint, maxPoint;
typedef vector<float2>::iterator point_iter;
static void render() {
glClearColor(1.0f, 1.0f, 1.0f, 1.0f);
glClear(GL_COLOR_BUFFER_BIT);
glMatrixMode(GL_PROJECTION);
glLoadIdentity();
glOrtho(minPoint.x, maxPoint.x, minPoint.y, maxPoint.y, -1.0f, 1.0f);
glColor3f(0.0f, 0.0f, 0.0f);
glEnableClientState(GL_VERTEX_ARRAY);
glVertexPointer(2, GL_FLOAT, sizeof(points[0]), &points[0].x);
glDrawArrays(GL_LINE_STRIP, 0, points.size());
glDisableClientState(GL_VERTEX_ARRAY);
glutSwapBuffers();
}
int main(int argc, char* argv[]) {
ifstream file("input");
string line;
while (getline(file, line)) {
istringstream ss(line);
float2 p;
ss >> p.x;
ss >> p.y;
if (ss)
points.push_back(p);
}
if (!points.size())
return 1;
minPoint = maxPoint = points[0];
for (point_iter i = points.begin(); i != points.end(); ++i) {
float2 p = *i;
minPoint = float2(minPoint.x < p.x ? minPoint.x : p.x, minPoint.y < p.y ? minPoint.y : p.y);
maxPoint = float2(maxPoint.x > p.x ? maxPoint.x : p.x, maxPoint.y > p.y ? maxPoint.y : p.y);
}
float dx = maxPoint.x - minPoint.x;
float dy = maxPoint.y - minPoint.y;
maxPoint.x += dx*0.1f; minPoint.x -= dx*0.1f;
maxPoint.y += dy*0.1f; minPoint.y -= dy*0.1f;
glutInit(&argc, argv);
glutInitDisplayMode(GLUT_RGBA | GLUT_DOUBLE);
glutInitWindowSize(512, 512);
glutCreateWindow("glview");
glutDisplayFunc(render);
glutMainLoop();
return 0;
}
EDIT: Here is new code based on the discussion below. It renders a sin function consisting of 20 vbos, each containing 100k points. 10k new points are added each rendered frame. This makes a total of 2M points. The performance is real-time on my laptop.
// On Mac OS X, compile using: g++ -O3 -framework GLUT -framework OpenGL glview.cpp
#include <vector>
#include <sstream>
#include <fstream>
#include <iostream>
#include <cmath>
#include <iostream>
#include <GLUT/glut.h>
using namespace std;
struct float2 { float2() {} float2(float x, float y) : x(x), y(y) {} float x, y; };
struct Vbo {
GLuint i;
Vbo(int size) { glGenBuffersARB(1, &i); glBindBufferARB(GL_ARRAY_BUFFER, i); glBufferDataARB(GL_ARRAY_BUFFER, size, 0, GL_DYNAMIC_DRAW); } // could try GL_STATIC_DRAW
void set(const void* data, size_t size, size_t offset) { glBindBufferARB(GL_ARRAY_BUFFER, i); glBufferSubData(GL_ARRAY_BUFFER, offset, size, data); }
~Vbo() { glDeleteBuffers(1, &i); }
};
static const int vboCount = 20;
static const int vboSize = 100000;
static const int pointCount = vboCount*vboSize;
static float endTime = 0.0f;
static const float deltaTime = 1e-3f;
static std::vector<Vbo*> vbos;
static int vboStart = 0;
static void addPoints(float2* points, int pointCount) {
while (pointCount) {
if (vboStart == vboSize || vbos.empty()) {
if (vbos.size() >= vboCount+2) { // remove and reuse vbo
Vbo* first = *vbos.begin();
vbos.erase(vbos.begin());
vbos.push_back(first);
}
else { // create new vbo
vbos.push_back(new Vbo(sizeof(float2)*vboSize));
}
vboStart = 0;
}
int pointsAdded = pointCount;
if (pointsAdded + vboStart > vboSize)
pointsAdded = vboSize - vboStart;
Vbo* vbo = *vbos.rbegin();
vbo->set(points, pointsAdded*sizeof(float2), vboStart*sizeof(float2));
pointCount -= pointsAdded;
points += pointsAdded;
vboStart += pointsAdded;
}
}
static void render() {
// generate and add 10000 points
const int count = 10000;
float2 points[count];
for (int i = 0; i < count; ++i) {
float2 p(endTime, std::sin(endTime*1e-2f));
endTime += deltaTime;
points[i] = p;
}
addPoints(points, count);
// render
glClearColor(1.0f, 1.0f, 1.0f, 1.0f);
glClear(GL_COLOR_BUFFER_BIT);
glMatrixMode(GL_PROJECTION);
glLoadIdentity();
glOrtho(endTime-deltaTime*pointCount, endTime, -1.0f, 1.0f, -1.0f, 1.0f);
glColor3f(0.0f, 0.0f, 0.0f);
glEnableClientState(GL_VERTEX_ARRAY);
for (size_t i = 0; i < vbos.size(); ++i) {
glBindBufferARB(GL_ARRAY_BUFFER, vbos[i]->i);
glVertexPointer(2, GL_FLOAT, sizeof(float2), 0);
if (i == vbos.size()-1)
glDrawArrays(GL_LINE_STRIP, 0, vboStart);
else
glDrawArrays(GL_LINE_STRIP, 0, vboSize);
}
glDisableClientState(GL_VERTEX_ARRAY);
glutSwapBuffers();
glutPostRedisplay();
}
int main(int argc, char* argv[]) {
glutInit(&argc, argv);
glutInitDisplayMode(GLUT_RGBA | GLUT_DOUBLE);
glutInitWindowSize(512, 512);
glutCreateWindow("glview");
glutDisplayFunc(render);
glutMainLoop();
return 0;
}

As a number of people have mentioned in their answers, you do not need to plot that many points. I think it is important to rpeat Andrey's comment:
that is a HUGE amount of points! There isn't enough pixels on the screen to plot that amount.
Rewriting plotting routines in different languages is a waste of your time. A huge number of hours have gone into writing MATLAB, whay makes you think you can write a significantly faster plotting routine (in a reasonable amount of time)? Whilst your routine may be less general, and therefore would remove some of the checks that the MATLAB code will perform, your "bottleneck" is that you are trying to plot so much data.
I strongly recommend one of two courses of action:
Sample your data: You do not need 20 x 1000000 points on a figure - the human eye won't be able to distinguish between all the points, so it is a waste of time. Try binning your data for example.
If you maintain that you need all those points on the screen, I would suggest using a different tool. VisIt or ParaView are two examples that come to mind. They are parallel visualisation programs designed to handle extremenly large datasets (I have seen VisIt handle datasets that contained PetaBytes of data).

There is no way you can fit 1000000 data points on a small plot. How about you choose one in every 10000 points and plot those?
You can consider calling imresize on the large vector to shrink it, but manually building a vector by omitting 99% of the points may be faster.
#memyself The sampling operations are already occurring. Matlab is choosing what data to include in the graph. Why do you trust matlab? It looks to me that the graph you showed significantly misrepresents the data. The dense regions should indicate that the signal is at a constant value, but in your graph it could mean that the signal is at that value half the time - or was at that value at least once during the interval corresponding to that pixel?

Would it be possible to use an alternate architectue? For example, use MATLAB to generate the data and use a fast library or application (GNUplot?) to handle the plotting?
It might even be possible to have MATLAB write the data to a stream as the plotter consumes the data. Then the plot would be updated as MATLAB generates the data.
This approach would avoid MATLAB's ridiculously slow plotting and divide the work up between two separate processes. The OS/CPU would probably assign the process to different cores as a matter of course.

I think it's possible, but likely to require writing the plotting code (at least the parts you use) from scratch, since anything you could reuse is exactly what's slowing you down.
To test feasibility, I'd start with testing that any Win32 GUI works from MEX (call MessageBox), then proceed to creating your own window, test that window messages arrive to your WndProc. Once all that's going, you can bind an OpenGL context to it (or just use GDI), and start plotting.
However, the savings is likely to come from simpler plotting code and use of newer OpenGL features such as VBOs, rather than threading. Everything is already parallel on the GPU, and more threads don't help transfer of commands/data to the GPU any faster.

I did a very similar thing many many years ago (2004?). I needed an oscilloscope-like display for kilohertz sampled biological signals displayed in real time. Not quite as many points as the original question has, but still too many for MATLAB to handle on its own. IIRC I ended up writing a Java component to display the graph.
As other people have suggested, I also ended up down-sampling the data. For each pixel on the x-axis, I calculated the minimum and maximum values taken by the data, then drew a short vertical line between those values. The entire graph consisted of a sequence of short vertical lines, each immediately adjacent to the next.
Actually, I think that the implementation ended up writing the graph to a bitmap that scrolled continuously using bitblt, with only new points being drawn ... or maybe the bitmap was static and the viewport scrolled along it ... anyway it was a long time ago and I might not be remembering it right.

Blockquote
EDIT4: I know that this is a huge amount of points to plot but I never said that the problem was easy. I can not just leave out certain data points, because there's no way of assessing what points are important, before actually plotting them
Blockquote
This is incorrect. There is a way to to know which points to leave out. Matlab is already doing it. Something is going to have to do it at some point no matter how you solve this. I think you need to redirect your problem to be "how do I determine which points I should plot?".
Based on the screenshot, the data looks like a waveform. You might want to look at the code of audacity. It is an open source audio editing program. It displays plots to represent the waveform in real time, and they look identical in style to the one in your lowest screen shot. You could borrow some sampling techniques from them.

What you are looking for is the creation of a MEX file.
Rather than me explaining it, you would probably benefit more from reading this: Creating C/C++ and Fortran Programs to be Callable from MATLAB (MEX-Files) (a documentation article from MathWorks).
Hope this helps.

C++ shader question

I am using Nvidia CG and Direct3D9 and have the question about the following code.
It compiles, but doesn't "loads" (using cgLoadProgram wrapper) and the resulting failure is described simplyas D3D failure happened.
It's a part of the pixel shader compiled with shader model set to 3.0
What may be interesting is that this shader loads fine in the following cases:
1) Manually unrolling the while statement (to many if { } statements).
2) Removing the line with the tex2D function in the loop.
3) Switching to shader model 2_X and manually unrolling the loop.
Problem part of the shader code:
float2 tex = float2(1, 1);
float2 dtex = float2(0.01, 0.01);
float h = 1.0 - tex2D(height_texture1, tex);
float height = 1.00;
while ( h < height )
{
height -= 0.1;
tex += dtex;
// Remove the next line and it works (not as expected,
// of course)
h = tex2D( height_texture1, tex );
}
If someone knows why this can happen or could test the similiar code in non-CG environment or could help me in some other way, I'm waiting for you ;)
Thanks.

I think you need to determine the gradients before the loop using ddx/ddy on the texture coordinates and then use tex2D(sampler2D samp, float2 s, float2 dx, float2 dy)
The GPU always renders quads not pixels (even on pixel borders - superfluous pixels are discarded by the render backend). This is done because it allows it to always calculate the screen space texture derivates even when you use calculated texture coordinates. It just needs to take the difference between the values at the pixel centers.
But this doesn't work when using dynamic branching like in the code in the question, because the shader processors at the individual pixels could diverge in control flow. So you need to calculate the derivates manually via ddx/ddy before the program flow can diverge.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js