OpenGL Crashes With Heavy Calculation [closed]

OpenGL Crashes With Heavy Calculation [closed] - c++

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I am new to OpenGL. My first project consists on rendering a mandelbrot set (which I find quite fascinating) and due to the nature of the calculations that have to be done I thought it would be better to do them on the GPU (basically I apply a complex function on each point of a part of the complex plane, a lot of time, and I color this point based on the output : lots of parallelizable calculations, that seems nice for a GPU, right ?).
So everything works well when there aren't too many calculations for a single image but as soon as pixels*iterations go past about 9 billions, the program crashes (the image displayed shows that only part of it has been calculated, the cyan part is the initial background) :
Dark Part of the Mandelbrot Set not Fully Calculated
In fact, if total number of calculations is below this limit but close enough (say 8.5 billion) it will still crash but it will take way more time. So I guess that there is some kind of problem which doesn't appear at sufficiently small number of calculations (it has always worked flawlessly until it got there). I have really no idea of what it could be, since I am really new to this. When the program crashes it says : "Unhandled exception at 0x000000005DA6DD38 (nvoglv64.dll) in Mandelbrot Set.exe: Fatal program exit requested.". It is also the same address that is specified there (it only changes when I exit Visual Studio, my IDE).
Well here is the whole code, plus the shader files (vertex shader isn't doing anything, all calculations are in the fragment shader) :
EDIT :
Here is a link to all the .cpp and .h files of the project, the code is too large to be placed here and is correct anyway (though far from perfect) ;
https://github.com/JeffEkaka/Mandelbrot/tree/master
Here are the shaders :
NoChanges.vert (vertex shader)
#version 400
// Inputs
in vec2 vertexPosition; // 2D vec.
in vec4 vertexColor;
out vec2 fragmentPosition;
out vec4 fragmentColor;
void main() {
gl_Position.xy = vertexPosition;
gl_Position.z = 0.0;
gl_Position.w = 1.0; // Default.
fragmentPosition = vertexPosition;
fragmentColor = vertexColor;
}
CalculationAndColorShader.frag (fragment shader)
#version 400
uniform int WIDTH;
uniform int HEIGHT;
uniform int iter;
uniform double xmin;
uniform double xmax;
uniform double ymin;
uniform double ymax;
void main() {
dvec2 z, c;
c.x = xmin + (double(gl_FragCoord.x) * (xmax - xmin) / double(WIDTH));
c.y = ymin + (double(gl_FragCoord.y) * (ymax - ymin) / double(HEIGHT));
int i;
z = c;
for(i=0; i<iter; i++) {
double x = (z.x * z.x - z.y * z.y) + c.x;
double y = (z.y * z.x + z.x * z.y) + c.y;
if((x * x + y * y) > 4.0) break;
z.x = x;
z.y = y;
}
float t = float(i) / float(iter);
float r = 9*(1-t)*t*t*t;
float g = 15*(1-t)*(1-t)*t*t;
float b = 8.5*(1-t)*(1-t)*(1-t)*t;
gl_FragColor = vec4(r, g, b, 1.0);
}
I am using SDL 2.0.5 and glew 2.0.0, and last version of OpenGL I believe. Code has been compiled on Visual Studio (MSVC compiler I believe) with some optimizations enabled. Also, I am using doubles even in my gpu calculations (I know they are ultra-slow but I need their precision).

The first thing you need to understand is that "context switching" is different on GPUs (and, in general, most Heterogeneous architectures) than it is on CPU/Host architectures. When you submit a task to the GPU—in this case, "render my image"—the GPU will solely work on that task until completion.
There's a few details I'm abstracting, naturally: Nvidia hardware will try to schedule smaller tasks on unused cores, and all three major vendors (AMD, Intel, NVidia) have some fine-tuned behaviors which complicate my above generalization, but as a matter of principle, you should assume that any task submitted to the GPU will consume the GPU's entire resources until completed.
On its own, that's not a big problem.
But on Windows (and most consumer Operating Systems), if the GPU spends too much time on a single task, the OS will assume that the GPU isn't responding, and will do one of several different things (or possibly a subset of multiple of them):
Crash: doesn't happen so much anymore, but on older systems I have bluescreened my computers with over-ambitious Mandelbrot renders
Reset the driver: which means you'll lose all OpenGL state, and is essentially unrecoverable from the program's perspective
Abort the operation: Some newer device drivers are clever enough to simply kill the task rather than killing the entire context state. But this can depend on the specific API you're using: my OpenGL/GLSL based Mandelbrot programs tend to crash the driver, whereas my OpenCL programs usually have more elegant failures.
Let it go to completion, without issue: This will only happen if the GPU in question is not used by the Operating System as its display driver. So this is only an option if you have more than one Graphics card in your system and you explicitly ensure that rendering is happening on the Graphics Card not used by the OS, or if the card being used is a Compute Card that probably doesn't have display drivers associated with it. In OpenGL, this is basically a non-starter, but if you were using OpenCL or Vulkan, this might be a potential work-around.
The exact timing varies, but you should generally assume that if a single task takes more than 2 seconds, it'll crash the program.
So how do you fix this problem? Well, if this were an OpenCL-based render, it would be pretty easy:
std::vector<cl_event> events;
for(int32_t x = 0; x < WIDTH; x += KERNEL_SIZE) {
for(int32_t y = 0; y < HEIGHT; y += KERNEL_SIZE) {
int32_t render_start[2] = {x, y};
int32_t render_end[2] = {std::min(WIDTH, x + KERNEL_SIZE), std::min(HEIGHT, y + KERNEL_SIZE)};
events.emplace_back();
//I'm abstracting the clSubmitNDKernel call
submit_task(queue, kernel, render_start, render_end, &events.back(), /*...*/);
}
}
clWaitForEvents(queue, events.data(), events.size());
In OpenGL, you can use the same basic principle, but things are made a bit more complicated because of how absurdly abstract the OpenGL model is. Because the drivers are want to bundle together multiple draw calls into a single command to the underlying hardware, you need to explicitly make them behave themselves, or else the driver will bundle them all together, and you'll get the exact same problem even though you've written it to specifically break up the task.
for(int32_t x = 0; x < WIDTH; x += KERNEL_SIZE) {
for(int32_t y = 0; y < HEIGHT; y += KERNEL_SIZE) {
int32_t render_start[2] = {x, y};
int32_t render_end[2] = {std::min(WIDTH, x + KERNEL_SIZE), std::min(HEIGHT, y + KERNEL_SIZE)};
render_portion_of_image(render_start, render_end);
//The call to glFinish is the important part: otherwise, even breaking up
//the task like this, the driver might still try to bundle everything together!
glFinish();
}
}
The exact appearance of render_portion_of_image is something you'll need to design yourself, but the basic idea is to specify to the program that only the pixels between render_start and render_end are to be rendered.
You might be wondering what the value of KERNEL_SIZE should be. That's something you'll have to experiment on your own, as it depends entirely on how powerful your graphics card is. The value should be
Small enough that no single task will ever take more than x quantity of time (I usually go for 50 milliseconds, but as long as you keep it below half a second, it's usually safe)
Large enough that you're not submitting hundreds of thousands of tiny tasks to the GPU. At a certain point, you'll spend more time synchronizing the Host←→GPU interface than actually doing work on the GPU, and since GPU architectures often have hundreds or even thousands of cores, if your tasks are too small, you'll lose speed simply by not saturating all the cores.
In my personal experience, the best way to determine is to have a bunch of "testing" renders before the program starts, where you render the image at 10,000 iterations of the escape algorithm on a 32x32 image of the central bulb of the Mandelbrot Set (rendered all at once, with no breaking up of the algorithm), and seeing how long it takes. The algorithm I use essentially looks like this:
int32_t KERNEL_SIZE = 32;
std::chrono::nanoseconds duration = 0;
while(KERNEL_SIZE < 2048 && duration < std::chrono::milliseconds(50)) {
//duration_of is some code I've written to time the task. It's best to use GPU-based
//profiling, as it'll be more accurate than host-profiling.
duration = duration_of([&]{render_whole_image(KERNEL_SIZE)});
if(duration < std::chrono::milliseconds(50)) {
if(is_power_of_2(KERNEL_SIZE)) KERNEL_SIZE += KERNEL_SIZE / 2;
else KERNEL_SIZE += KERNEL_SIZE / 3;
}
}
final_kernel_size = KERNEL_SIZE;
The last thing I'd recommend is to use OpenCL for the heavy-duty lifting of rendering the mandelbrot set itself, and use OpenGL (including the OpenGL←→OpenCL Interop API!) to actually display the image on screen. OpenCL is, on a technical level, going to be neither faster nor slower than OpenGL, but it gives you a lot of control over the operations you perform, and it's easier to reason about what the GPU is doing (and what you need to do to alter its behavior) when you're using a more explicit API than OpenGL. You could, if you want to stick to a single API, use Vulkan instead, but since Vulkan is extremely low-level and thus very complicated to use, I don't recommend that unless you're up to the challenge.
EDIT: A few other things:
I'd have multiple versions of the program, one that renders with floats, and the other rendering with doubles. In my version of this program, I actually have a version that uses two float values to simulate a double, as described here. On most hardware this can be slower, but on certain architectures (particularly NVidia's Maxwell architecture) if the speed of processing floats is sufficiently fast enough, it can actually outperform double simply by sheer magnitude: on some GPU architectures, floats are 32x faster than doubles.
You might be tempted to have an "adaptive" algorithm that dynamically adjusts kernel size on the fly. This is more trouble than it's worth, and the time spent on the host reevaluating the next kernel size will outweigh any slight performance gains you otherwise achieve.

Related

which one is proper method of writing this gl code

i have been doing some experiments with opengl and handling textures.
in my experiment i have a 2d array of (int) which are randomly generated
int mapskeleton[300][300];
then after that i have my own obj file loader for loading obj with textures
m2d wall,floor;//i initialize and load those files at start
for recording statistics of render times i used
bool Once = 1;
int secs = 0;
now to the render code here i did my experiment
// Code A: Benchmarked on radeon 8670D
// Takes 232(average) millisecs for drawing 300*300 tiles
if(Once)
secs = glutGet(GLUT_ELAPSED_TIME);
for(int i=0;i<mapHeight;i++){
for(int j=0;j<mapWidth;j++){
if(mapskeleton[j][i] == skel_Wall){
glBindTexture(GL_TEXTURE_2D,wall.texture);
glPushMatrix();
glTranslatef(j*10,i*10,0);
wall.Draw();//Draws 10 textured triangles
glPopMatrix();
}
if(mapskeleton[j][i] == skel_floor){
glBindTexture(GL_TEXTURE_2D,floor.texture);
glPushMatrix();
glTranslatef(j*10,i*10,0);
floor.Draw();//Draws 2 textured triangles
glPopMatrix();
}
}
}
if(Once){
secs = glutGet(GLUT_ELAPSED_TIME)-secs;
printf("time taken for rendering %i msecs",secs)
Once = 0;
}
and other code is
// Code B: Benchmarked on radeon 8670D
// Takes 206(average) millisecs for drawing 300*300 tiles
if(Once)
secs = glutGet(GLUT_ELAPSED_TIME);
glBindTexture(GL_TEXTURE_2D,floor.texture);
for(int i=0;i<mapHeight;i++){
for(int j=0;j<mapWidth;j++){
if(mapskeleton[j][i] == skel_floor){
glPushMatrix();
glTranslatef(j*10,i*10,0);
floor.Draw();
glPopMatrix();
}
}
}
glBindTexture(GL_TEXTURE_2D,wall.texture);
for(int i=0;i<mapHeight;i++){
for(int j=0;j<mapWidth;j++){
if(mapskeleton[j][i] == skel_Wall){
glPushMatrix();
glTranslatef(j*10,i*10,0);
wall.Draw();
glPopMatrix();
}
}
}
if(Once){
secs = glutGet(GLUT_ELAPSED_TIME)-secs;
printf("time taken for rendering %i msecs",secs)
Once = 0;
}
for me code A looks good with a point of a person(Beginner) viewing code. but benchmarks say different.
my gpu seems to like code B. I don't understand why does code B takes less time to render?

Changes to OpenGL state can generally be expensive - the driver's and/or GPUs data structures and caches can become invalidated. In your case, the change in question is binding a different texture. In code B, you're doing it twice. In code A, you're easily doing it thousands of times.
When programming OpenGL rendering, you'll generally want to set up the pipeline for settings A, render everything which needs settings A, re-set the pipeline for settings B, render everything which needs settings B, and so on.

#Angew covered why one options is more efficient than the other. But there is an important point that needs to be stated very clearly. Based on the text of your question, particularly here:
for recording statistics of render times
my gpu seems to like code B
you seem to attempt to measure rendering/GPU performance.
You are NOT AT ALL measuring GPU performance!
You measure the time for setting up the state and making the draw calls. OpenGL lets the GPU operate asynchronously from the code executed on the CPU. The picture you should keep in mind when you make (most) OpenGL calls is that you're submitting work to the GPU for later execution. There's no telling when the GPU completes that work. It most definitely (except for very few calls that you want to avoid in speed critical code) does not happen by the time the call returns.
What you're measuring in your code is purely the CPU overhead for making these calls. This includes what's happening in your own code, and what happens in the driver code for handling the calls and preparing the work for later submission to the GPU.
I'm not saying that the measurement is not useful. Minimizing CPU overhead is very important. You just need to be very aware of what you are in fact measuring, and make sure that you draw the right conclusions.

Drawing big circles from scratch [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
I am new to C++, but the language seems alright to me. As a learning project I have decided to make a minor 2D graphic-engine. It might seem like a hard project, but I have a good idea how to move on.
I havn't really started yet, but I am forming things in my head at the moment, when I came across this problem:
At some point I will have to make a function to draw circles on the screen. My approach to that right now would be something like this:
in a square with sides from (x-r) to (x+r) loop through x and y,
if at each point, the current distance sqr(x^2+y^2) is less than or equal to r
, then draw a pixel at that point.
This would work, if not, dont bother telling me, I'll figure it out. I would of cause only draw this circle if the x+r & y+r is on the screen.
The problem lies in that I will need to draw really big circles sometimes. If for example I need to draw a circle with radius 5000, the (if pixel loop calculations would need loop a total of 10000^2 times). So with a processor at 2Ghz, this single circle would only be able to render 2Ghz/(10000^2) which is ~22 times/s while taking up the whole core (believing it only takes one calculation per pixel, which is nowhere the truth).
Which approach is the correct one now? I guess it has something to do with using the GFX for these simple calculations. If so, can I use OpenGL for C++ for this? I'd like to learn that as well :)

My first C/C++ projects were in fact graphics libraries as well. I did not have OpenGL or DirectX and was using DOS at the time. I learned quite a lot from it, as I constantly found new and better (and faster) ways to draw to the screen.
The problem with modern operating systems is that they don't really allow you to do what I did back then. You cannot just start using the hardware directly. And frankly, these days you don't want to anymore.
You can still draw everything yourself. Have a look at SDL if you want to put your own pixels. This is a C library that you will be able to wrap into your own C++ objects. It works on different platforms (including Linux, Windows, Mac,...) and internally it will make use of things like DirectX or OpenGL.
For real-world graphics, one doesn't just go about drawing one's own pixels. That is not efficient. Or at least not on devices where you cannot use the hardware directly...
But for your purposes, I think SDL is definitely the way to go! Good luck with that.

You don't do graphics by manually drawing pixels to screen, that way madness lies.
What you want to use is either DirectX or OpenGL. I suggest you crack open google and go read, there's a lot to read out there.
Once you've downloaded the libs there's lots of sample projects to take a look at, they'll get you started.
There's two approaches at this point: there's the mathematical way of calculating the vectors that describe a shape with a very large number of sides (i.e it'll look like a circle). Or there's the 'cheating' method of just drawing a texture (i.e a picture) of a circle to the screen with an alpha channel to make the rest of the texture transparent. (The cheating method is easier to code, faster to execute, and produces a better result, although it is less flexible).
If you want to do it mathematically then both of these libraries will allow you to draw lines to screen, so you need to begin your approach from the view of start point and end point of each line, not the individual pixels. i,e you want vector graphics.
I can't do the heavy maths right now, but the vector approach might look a little like this (sudo-code):
in-param: num_of_sides, length_of_side;
float angle = 360 / num_of_sides;
float start_x = 0;
float start_y = 0;
x = startX;
y = startX;
for(int i(0); i < num_of_sides; ++i)
{
float endX, endY;
rotateOffsetByAngle(x, y, x + lenth_of_side, y, angle * i, endX, endY);
drawline(x, y, endX, endY);
x = endX;
y = endY;
}
drawline(float startX, startY, endX, endY)
{
//does code that draws line between the start and end coordinates;
}
rotateOffsetByAngle(float startX, startY, endX, endY, angle, &outX, &outY)
{
//the in-parameters startX, startY and endX, endY describe a line
//we treat this line as the offset from the starting point
//do code that rotates this line around the point startX, startY, by the angle.
//after this rotation is done endX and endY are now at the same
//distance from startX and startY that they were, but rotated.
outX = endX;
outY = endY; //pass these new coordinates back out by reference;
}
In the above code we move around the outside of the circle drawing each individual line around the outside 1 by one. For each line we have a start point and an offset, we then rotate the offset by an angle (this angle increases as we move around the circle). Then we draw the line from the start point to the offset point. Before we begin the next iteration we move the start point to the offset point so the next line starts from the end of the last.
I hope that's understandable.

That is one way to draw a filled circle. It will perform appallingly slowly, as you can see.
Modern graphics is based on abstracting away the lower-level stuff so that it can be optimised; the developer writes drawCircle(x,y,r) and the graphics libary + drivers can pass that all the way down to the chip, which can fill in the appropriate pixels.
Although you are writing in C++, you are not manipulating data closest to the core unless you use the graphics drivers. There are layers of subroutine calls between even your setPixelColour level methods and an actual binary value being passed over the wire; at almost every layer there are checks and additional calculations and routines run. The secret to faster graphics, then, is to reduce the number of these calls you make. If you can get the command drawCircle all the way to the graphics chip, do that. Don't waste a call on a single pixel, when it's as mundane as drawing a regular shape.
In a modern OS, there are layers of graphics processing taking the requests of individual applications like yours and combining them with the windowing, compositing and any other effects. So your command to 'draw to screen' is intermediated by several layers already. What you want to provide to the CPU is the minimum information necessary to offload the calculations to the graphics subsystem.
I would say if you want to learn to draw stuff on the screen, play with canvas and js, as the development cycle is easy and comparatively painless. If you want to learn C++, try project Euler, or draw stuff using existing graphics libraries. If you want to write a 2d graphics library, learn the underlying graphics technologies like DirectX and OpenGL, because they are the way that graphics is done in reality. But they seem so complex, you say? Then you need to learn more C++ first. They are the way they are for some very good reasons, however complex the result is.

As the first answer says, you shouldn't do this yourself for serious work. But if you just want to do this as an example, then could could do something like this: First define a function for drawing line segments on the screen:
void draw_line(int x1, int y1, int x2, int y2);
This should be relatively straightforward to implement: Just find the direction that is changing fastest, and iterate over that direction while using integer logic to find out how much the other dimension should change. I.e., if x is changing faster, then y = (x-x1)*(y2-y1)/(x2-x1).
Then use this function to implement a circle as piecewise line elements:
void draw_circle(int x, int y, int r)
{
double dtheta = 2*M_PI/8/r;
int x1 = x+r, x2, y1 = y, y2;
int n = 2*M_PI/dtheta;
for(int i = 1; i < n; i++)
{
double theta = i*dtheta;
x2 = int(x+r*cos(theta)); y2 = int(y+r*sin(theta));
draw_line(x1, y1, x2, y2);
x1 = x2; y1 = y2;
}
}
This uses floating point logic and trigonometric functions to figure out which line elements best approximate a circle. It is a somewhat crude implementation, but I think any implementation that wants to be efficient for very large circles has to do something like this.
If you are only allowed to use integer logic, one approach could be to first draw a low-resolution integer circle, and then subdivide each selected pixel into smaller pixels, and choose the sub-pixels you want there, and so on. This would scale as N log N, so still slower than the approach above. But you would be able to avoid sin and cos.

Measure Render-To-Texture Performance in OpenGL ES 2.0

Basically, I'm doing some sort of image processing using a screen-sized rectangle made of two triangles and a fragment shader, which is doing the whole processing stuff. The actual effect is something like an animation as it depends on a uniform variable, called current_frame.
I'm very much interested in measuring the performance in terms of "MPix/s". What I do is something like that:
/* Setup all necessary stuff, including: */
/* - getting the location of the `current_frame` uniform */
/* - creating an FBO, adding a color attachment */
/* and setting it as the current one */
double current_frame = 0;
double step = 1.0f / NUMBER_OF_ITERATIONS;
tic(); /* Start counting the time */
for (i = 0; i < NUMBER_OF_ITERATIONS; i++)
{
glUniform1f(current_frame_handle, current_frame);
current_frame += step;
glDrawArrays(GL_TRIANGLES, 0, NUMBER_OF_INDICES);
glFinish();
}
double elapsed_time = tac(); /* Get elapsed time in seconds */
/* Calculate achieved pixels per second */
double pps = (OUT_WIDTH * OUT_HEIGHT * NUMBER_OF_ITERATIONS) / elapsed_time;
/* Sanity check by using reading the output into a buffer */
/* using glReadPixels and saving this buffer into a file */
As far as theory goes, is there anything wrong with my concept?
Also, I've got the impression that glFinish() on mobile hardware doesn't necessarily wait for previous render calls and may do some optimizations.
Of course, I can always force it by doing glReadPixels() after each draw, but that would be quite slow so that this wouldn't really help.
Could you advise me as to whether my testing scenario is sensible and whether there is something more that can be done.

Concerning speed, using glDrawArrays() still duplicates the shared vertices.
glDrawElements() is the solution to reduce the number of vertices in
the array, so it allows transferring less data to OpenGL.
http://www.songho.ca/opengl/gl_vertexarray.html
Just throwing that in there to help speed up your results. As far as your timing concept, it looks fine to me. Are you getting results similar to what you had hoped?

I would precalculate all possible frames, and then use glEnableClientState() and glTexCoordPointer() to change which part of the existing texture is drawn in each frame.

OpenGL Pixel Shader: how to generate random matrix of 0s and 1s (on each pixel)?

So what I need is simple: each time we perform our shader (meaning on each pixel) I need to calculate random matrix of 1s and 0s with resolution == originalImageResolution. How to do such thing?
As for now I have created one for shadertoy random matrix resolution is set to 15 by 15 here because gpu makes chrome fall often when I try stuff like 200 by 200 while really I need full image resolution size
#ifdef GL_ES
precision highp float;
#endif
uniform vec2 resolution;
uniform float time;
uniform sampler2D tex0;
float rand(vec2 co){
return fract(sin(dot(co.xy ,vec2(12.9898,78.233))) * (43758.5453+ time));
}
vec3 getOne(){
vec2 p = gl_FragCoord.xy / resolution.xy;
vec3 one;
for(int i=0;i<15;i++){
for(int j=0;j<15;j++){
if(rand(p)<=0.5)
one = (one.xyz + texture2D(tex0,vec2(j,i)).xyz)/2.0;
}
}
return one;
}
void main(void)
{
gl_FragColor = vec4(getOne(),1.0);
}
And one for Adobe pixel bender:
<languageVersion: 1.0;>
kernel random
< namespace : "Random";
vendor : "Kabumbus";
version : 3;
description : "not as random as needed, not as fast as needed"; >
{
input image4 src;
output float4 outputColor;
float rand(float2 co, float2 co2){
return fract(sin(dot(co.xy ,float2(12.9898,78.233))) * (43758.5453 + (co2.x + co2.y )));
}
float4 getOne(){
float4 one;
float2 r = outCoord();
for(int i=0;i<200;i++){
for(int j=0;j<200;j++){
if(rand(r, float2(i,j))>=1.0)
one = (one + sampleLinear(src,float2(j,i)))/2.0;
}
}
return one;
}
void
evaluatePixel()
{
float4 oc = getOne();
outputColor = oc;
}
}
So my real problem is - my shaders make my GPU deiver fall. How to use GLSL for same purpose that I do now but with out failing and if possible faster?
Update:
What I want to create is called Single-Pixel Camera (google Compressive Imaging or Compressive Sensing), I want to create gpu based software implementation.
Idea is simple:
we have an image - NxM.
for each pixel in image we want GPU to performe the next operations:
to generate NxMmatrix of random values - 0s and 1s.
compute arithmetic mean of all pixels on original image whose coordinates correspond to coordinates of 1s in our random NxM matrix
output result of arithmetic mean as pixel color.
What I tried to implement in my shaders was simulate that wary process.
What is really stupid in trying to do this on gpu:
Compressive Sensing does not tall us to compute NxM matrix of such arithmetic mean values, it meeds just a peace of it (for example 1/3). So I put some pressure I do not need to on GPU. However testing on more data is not always a bad idea.

Thanks for adding more detail to clarify your question. My comments are getting too long so I'm going to an answer. Moving comments into here to keep them together:
Sorry to be slow, but I am trying to understand the problem and the goal. In your GLSL sample, I don't see a matrix being generated. I see a single vec3 being generated by summing a random selection (varying over time) of cells from a 15 x 15 texture (matrix). And that vec3 is recomputed for each pixel. Then the vec3 is used as the pixel color.
So I'm not clear whether you really want to create a matrix, or just want to compute a value for every pixel. The latter is in some sense a 'matrix', but computing a simple random value for 200 x 200 pixels would not strain your graphics driver. Also you said you wanted to use the matrix. So I don't think that's what you mean.
I'm trying to understand why you want a matrix - to preserve a consistent random basis for all the pixels? If so, you can either precompute a random texture, or use a consistent pseudorandom function like you have in rand() except not use time. You clearly know about that so I guess I still don't understand the goal. Why are you summing a random selection of cells from the texture, for each pixel?
I believe the reason your shader is crashing is that your main() function is exceeding its time limit - either for a single pixel, or for the whole set of pixels. Calling rand() 40,000 times per pixel (in a 200 * 200 nested loop) could certainly explain that!
If you had 200 x 200 pixels, and are calling sin() 40k times for each one, that's 160,000,000 calls per frame. Poor GPU!
I'm hopeful that if we understand the goal better, we'll be able to recommend a more efficient way to get the effect you want.
Update.
(Deleted this part, since it was mistaken. Even though many cells in the source matrix may each contribute less than a visually detectable amount of color to the result, the total of the many cells can contribute a visually detectable amount of color.)
New update based on updated question.
OK, (thinking "out loud" here so you can check whether I'm understanding correctly...) Since you need each of the random NxM values only once, there is no actual requirement to store them in a matrix; the values can simply be computed on demand and then thrown away. That's why your example code above does not actually generate a matrix.
This means we cannot get away from generating (NxM)^2 random values per frame, that is, NxM random values per pixel, and there are NxM pixels. So for N=M=200, that's 160 million random values per frame.
However, we can still optimize some things.
First, since your random values only need to be one bit each (you only need a boolean answer to decide whether to include each cell from the source texture into the mix), you can probably use a cheaper pseudo random number generator. The one you're using outputs much more random data per call than one bit. For example, you could call the same PRNG function as you're using now, but store the value and extract 32 random bits out of it. Or at least several, depending on how many are random enough. In addition, instead of using a sin() function, if you have extension GL_EXT_gpu_shader4 (for bitwise operators), you could use something like this:
.
int LFSR_Rand_Gen(in int n)
{
// <<, ^ and & require GL_EXT_gpu_shader4.
n = (n << 13) ^ n;
return (n * (n*n*15731+789221) + 1376312589) & 0x7fffffff;
}
Second, you are currently performing one divide operation per included cell (/2.0), which is probably relatively expensive, unless the compiler and GPU are able to optimize it into a bit shift (is that possible for floating point?). This also will not give the arithmetic mean of the input values, as discussed above... it will put much more weight on the later values and very little on the earlier ones. As a solution, keep a count of how many values are being included, and divide by that count once, after the loop is finished.
Whether these optimizations will be enough to enable your GPU driver to drive 200x200 * 200x200 pixels per frame, I don't know. They should definitely enable you to increase your resolution substantially.
Those are the ideas that occur to me off the top of my head. I am far from being a GPU expert though. It would be great if someone more qualified can chime in with suggestions.
P.S. In your comment, you jokingly (?) mentioned the option of precomputing N*M NxM random matrices. Maybe that's not a bad idea?? 40,000x40,000 is a big texture (40MB at least), but if you store 32 bits of random data per cell, that comes down to 1250 x 40,000 cells. Too bad vanilla GLSL doesn't help you with bitwise operators to extract the data, but even if you don't have the GL_EXT_gpu_shader4 extension you can still fake it. (Maybe you would also need a special extension then for non-square textures?)

Plotting waveform of the .wav file

I wanted to plot the wave-form of the .wav file for the specific plotting width.
Which method should I use to display correct waveform plot ?
Any Suggestions , tutorial , links are welcomed....

Basic algorithm:
Find number of samples to fit into draw-window
Determine how many samples should be presented by each pixel
Calculate RMS (or peak) value for each pixel from a sample block. Averaging does not work for audio signals.
Draw the values.
Let's assume that n(number of samples)=44100, w(width)=100 pixels:
then each pixel should represent 44100/100 == 441 samples (blocksize)
for (x = 0; x < w; x++)
draw_pixel(x_offset + x,
y_baseline - rms(&mono_samples[x * blocksize], blocksize));
Stuff to try for different visual appear:
rms vs max value from block
overlapping blocks (blocksize x but advance x/2 for each pixel etc)
Downsampling would not probably work as you would lose peak information.

Either use RMS, BlockSize depends on how far you are zoomed in!
float RMS = 0;
for (int a = 0; a < BlockSize; a++)
{
RMS += Samples[a]*Samples[a];
}
RMS = sqrt(RMS/BlockSize);
or Min/Max (this is what cool edit/Audtion Uses)
float Max = -10000000;
float Min = 1000000;
for (int a = 0; a < BlockSize; a++)
{
if (Samples[a] > Max) Max = Samples[a];
if (Samples[a] < Min) Min = Samples[a];
}

Almost any kind of plotting is platform specific. That said, .wav files are most commonly used on Windows, so it's probably a fair guess that you're interested primarily (or exclusively) in code for Windows as well. In this case, it mostly depends on your speed requirements. If you want a fairly static display, you can just draw with MoveTo and (mostly) LineTo. If that's not fast enough, you can gain a little speed by using something like PolyLine.
If you want it substantially faster, chances are that your best bet is to use something like OpenGL or DirectX graphics. Either of these does the majority of real work on the graphics card. Given that you're talking about drawing a graph of sound waves, even a low-end graphics card with little or no work on optimizing the drawing will probably keep up quite easily with almost anything you're likely to throw at it.
Edit: As far as reading the .wav file itself goes, the format is pretty simple. Most .wav files are uncompressed PCM samples, so drawing them is a simple matter of reading the headers to figure out the sample size and number of channels, then scaling the data to fit in your window.
Edit2: You have a couple of choices for handling left and right channels. One is to draw them in two separate plots, typically one above the other. Another is to draw them superimposed, but in different colors. Which is more suitable depends on what you're trying to accomplish -- if it's mostly to look cool, a superimposed, multi-color plot will probably work nicely. If you want to allow the user to really examine what's there in detail, you'll probably want two separate plots.

What exactly do you mean by a waveform? Are you trying to plot the level of the frequency components in the signal a.k.a the spectrum, most commonly seen in musci visualizers, car stereos, boomboxes? If so, you should use the Fast Fourier Transform. FFT is a standard technique to split a time domain signal into its individual frequencies. There are tons of good FFT library routines available.
In C++, you can use the openFrameworks library to set up a music player for wav, extract the FFT and draw it.
You can also use Processing with the Minim library to do the same. I have tried it and it is pretty straightforward.
Processing even has support for OpenGL and it is a snap to use.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js