I want to calculate the per-row minimum of a matrix of floats in GLSL in the browser, of about 1000 rows, 4000 columns.
Building on previous answers (see this) I used a for loop. However I would like to use a uniform for the upper bound, which is not possible in WebGL GLSL ES 1.0. This is because the length of the row is defined after the fragment shader, and I'd like to avoid messing with #DEFINEs.
So I found out that this workaround - fixed cycle length with a if/break defined by a uniform - works ok:
#define MAX_INT 65536
void main(void) {
float m = 0.0;
float k = -1.0;
int r = 40;
for(int i = 0; i < MAX_INT; ++i){
float ndx = floor(gl_FragCoord.y) * float(r) + float(i);
float a = getPoint(values, dimensions, ndx).x;
m = m > a ? m : a;
if (i >= r) { break; }
};
}
Now the question: does this have big drawbacks? Is there something weird I am doing and I'm missing something?
I believe, but am not entirely sure, that the only risk is some driver/gpu will still make the long loop.
As an example imagine this loop
uniform int limit;
void main() {
float sum = 0;
for (int i = 0; i < 3; ++i) {
sum += texture2D(tex, vec2(float(i) / 3, 0)).r;
if (i >= limit) {
break;
}
}
gl_FragColor = vec4(sum);
}
that can be re-written by the driver like this
uniform int limit;
void main() {
float sum = 0;
for (int i = 0; i < 3; ++i) {
float temp = texture2D(tex, vec2(float(i) / 3, 0)).r;
sum += temp * step(float(i), float(limit));
}
gl_FragColor = vec4(sum);
}
no branches. I don't know if any such drivers/gpus still exist that have no conditionals but the idea of requiring a const integer expression for a loop is so the branches can be removed and/or the loop un-rolled at compile time if the driver/GPU decided to do either.
uniform int limit;
void main() {
float sum = 0;
sum += step(float(0), float(limit)) * texture2D(tex, vec2(float(0) / 3, 0)).r;
sum += step(float(1), float(limit)) * texture2D(tex, vec2(float(1) / 3, 0)).r;
sum += step(float(2), float(limit)) * texture2D(tex, vec2(float(2) / 3, 0)).r;
gl_FragColor = vec4(sum);
}
Also, as an aside, the specific example you have above doesn't output anything so most drivers would turn the entire shader into a no-op.
Related
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 months ago.
Improve this question
I am currently comparing the implementation of a n-body simulation in the GPU using CUDA and OpenGL (Compute Shaders) for a project, but I run into a problem using shared memory.
First I implemented the version with no shared memory as follows:
CUDA
#include "helper_math.h"
//...
__device__ float dist2(float3 A, float3 B)
{
float3 C = A - B;
return dot(C, C);
}
__global__ void n_body_vel_calc(float3* positions, float3* velocities,
unsigned numParticles, float mass, float deltaTime)
{
unsigned i = blockDim.x * blockIdx.x + threadIdx.x;
if (i >= numParticles)
return;
const float G = 6.6743e-11f;
float3 cur_position = positions[i];
float3 force = make_float3(0.0f, 0.0f, 0.0f);
for (unsigned j = 0; j < numParticles; ++j)
{
if (i == j)
continue;
float3 neighbor_position = positions[j];
float inv_distance2 = 1.0f / dist2(cur_position, neighbor_position);
float3 direction = normalize(neighbor_position - cur_position);
force += G * mass * mass * inv_distance2 * direction;
}
float3 acceleration = force / mass;
velocities[i] += acceleration * deltaTime;
}
OpenGL
// glBufferStorage(GL_SHADER_STORAGE_BUFFER, ..., ..., ...);
#version 460
layout(local_size_x=128) in;
layout(location = 0) uniform int numParticles;
layout(location = 1) uniform float mass;
layout(location = 2) uniform float dt;
layout(std430, binding=0) buffer pblock { vec3 positions[]; };
layout(std430, binding=1) buffer vblock { vec3 velocities[]; };
float dist2(vec3 A, vec3 B)
{
vec3 C = A - B;
return dot( C, C );
}
void main()
{
int i = int(gl_GlobalInvocationID);
if (i >= numParticles)
return;
const float G = 6.6743e-11f;
vec3 cur_position = positions[i];
vec3 force = vec3(0.0);
for (uint j = 0; j < numParticles; ++j)
{
if (i == j)
continue;
vec3 neighbor_position = positions[j];
float inv_distance2 = 1.0 / dist2(cur_position, neighbor_position);
vec3 direction = normalize(neighbor_position - cur_position);
force += G * mass * mass * inv_distance2 * direction;
}
vec3 acceleration = force / mass;
velocities[i] += acceleration * dt;
}
With the same number of threads per group, number of particles and the same number of times executing the kernel, the CUDA version takes 82 ms and OpengGL takes 70 ms. Weird thing that there speed is much different, but I can attribute that to GLSL having geometric operations optimized somehow.
My problem comes next, when I write the versions with shared memory, which should increase the performance by not reading from global memory multiple times.
CUDA
__global__ void n_body_vel_calc(float3* positions, float3 * velocities, unsigned workgroupSize,
unsigned numParticles, float mass, float deltaTime)
{
// size of array == workgroupSize
extern __shared__ float3 temp_tile[];
unsigned i = blockDim.x * blockIdx.x + threadIdx.x;
if (i >= numParticles)
return;
const float G = 6.6743e-11f;
float3 cur_position = positions[i];
float3 force = make_float3(0.0f, 0.0f, 0.0f);
for (unsigned tile = 0; tile < numParticles; tile += workgroupSize)
{
temp_tile[threadIdx.x] = positions[tile + threadIdx.x];
__syncthreads();
for (unsigned j = 0; j < workgroupSize; ++j)
{
if (i == j || ((tile + j) >= numParticles))
continue;
float3 neighbor_position = temp_tile[j];
float inv_distance2 = 1.0f / dist2(cur_position, neighbor_position);
float3 direction = normalize(neighbor_position - cur_position);
force += G * mass * mass * inv_distance2 * direction;
}
__syncthreads();
}
float3 acceleration = force / mass;
velocities[i] += acceleration * deltaTime;
}
OpenGL
#version 460
layout(local_size_x=128) in;
layout(location = 0) uniform int numParticles;
layout(location = 1) uniform float mass;
layout(location = 2) uniform float dt;
layout(std430, binding=0) buffer pblock { vec3 positions[]; };
layout(std430, binding=1) buffer vblock { vec3 velocities[]; };
// Shared variables
shared vec3 temp_tile[gl_WorkGroupSize.x];
void main()
{
int i = int(gl_GlobalInvocationID);
if (i >= numParticles)
return;
const float G = 6.6743e-11f;
vec3 cur_position = positions[i];
vec3 force = vec3(0.0);
for (uint tile = 0; tile < numParticles; tile += gl_WorkGroupSize.x)
{
temp_tile[gl_LocalInvocationIndex] = positions[tile + gl_LocalInvocationIndex];
groupMemoryBarrier();
barrier();
for (uint j = 0; j < gl_WorkGroupSize.x; ++j)
{
if (i == j || (tile + j) >= numParticles)
continue;
vec3 neighbor_position = temp_tile[j];
float inv_distance2 = 1.0 / dist2(cur_position, neighbor_position);
vec3 direction = normalize(neighbor_position - cur_position);
force += G * mass * mass * inv_distance2 * direction;
}
groupMemoryBarrier();
barrier();
}
vec3 acceleration = force / mass;
velocities[i] += acceleration * dt;
}
My principal problem comes next. With the same parameters as above, the CUDA version increases its execution time to 128 ms (greatly diminishing its performance), and the OpenGL one took 68 (a small improvement over the other version).
I have compiled the CUDA version with the toolkit version 11.7 and 10.0 with MSVC V143 and V142 and the results are more or less the same.
Why the OpenGL implementation is faster with shared memory, but the CUDA one its not? Am I missing something?
I defined a GLSL function that calculates the average of an array of numbers, but this function is "overloaded" to work with arrays that have more than one size. Here, the same function is defined twice with a different parameter:
float average(float[4] array) {
float sum = 0.0;
for(int i = 0; i < array.length(); i++){
sum += array[i];
}
return sum/float(array.length());
}
//this is redundant: the same function is defined with a different size parameter
float average(float[5] array) {
float sum = 0.0;
for(int i = 0; i < array.length(); i++){
sum += array[i];
}
return sum/float(array.length());
}
void mainImage( out vec4 fragColor, in vec2 fragCoord )
{
float a1[] = float[](1.0, 1.0, 1.0, 1.2, 1.1);
float a2[] = float[](1.0, 1.0, 1.0, 1.2);
float av = average(a1)+average(a2);
fragColor = vec4(av,0.0,0.0,0.0);
}
Is it possible to define a function that accepts an array parameter of any size, instead of defining a different function for each size?
I add a Sprite as background.
Now I wish my Sprite can blur gradually become blurred.
I think I may modify the Texture2D to do the job, but it seems that Texture2D can not be modified.
So, what should I do?
You can use shader for that. You can get simple blur shader from cocos test project, like this:
#ifdef GL_ES
precision mediump float;
#endif
varying vec4 v_fragmentColor;
varying vec2 v_texCoord;
uniform vec2 resolution;
uniform float blurRadius;
uniform float sampleNum;
vec4 blur(vec2);
void main(void)
{
vec4 col = blur(v_texCoord); //* v_fragmentColor.rgb;
gl_FragColor = vec4(col) * v_fragmentColor;
}
vec4 blur(vec2 p)
{
if (blurRadius > 0.0 && sampleNum > 1.0)
{
vec4 col = vec4(0);
vec2 unit = 1.0 / resolution.xy;
float r = blurRadius;
float sampleStep = r / sampleNum;
float count = 0.0;
for(float x = -r; x < r; x += sampleStep)
{
for(float y = -r; y < r; y += sampleStep)
{
float weight = (r - abs(x)) * (r - abs(y));
col += texture2D(CC_Texture0, p + vec2(x * unit.x, y * unit.y)) * weight;
count += weight;
}
}
return col / count;
}
return texture2D(CC_Texture0, p);
}
If you don't know how to add custom shader to your sprite - here is an example!
You extend Sprite class:
class MySpriteBlur : public Sprite {
public:
~MySpriteBlur();
bool initWithTexture(Texture2D* texture, const Rect& rect);
void initGLProgram();
static MySpriteBlur *create(const char *pszFileName);
void setBlurRadius(float radius);
void setBlurSampleNum(float num);
protected:
float _blurRadius;
float _blurSampleNum;
};
And then implement it:
MySpriteBlur::~MySpriteBlur() {
}
MySpriteBlur* MySpriteBlur::create(const char *pszFileName) {
MySpriteBlur* pRet = new (std::nothrow) MySpriteBlur();
if (pRet && pRet->initWithFile(pszFileName)) {
pRet->autorelease();
} else {
CC_SAFE_DELETE(pRet);
}
return pRet;
}
bool MySpriteBlur::initWithTexture(Texture2D* texture, const Rect& rect) {
_blurRadius = 0;
if (Sprite::initWithTexture(texture, rect)) {
#if CC_ENABLE_CACHE_TEXTURE_DATA
auto listener = EventListenerCustom::create(EVENT_RENDERER_RECREATED, [this](EventCustom* event) {
initGLProgram();
});
_eventDispatcher->addEventListenerWithSceneGraphPriority(listener, this);
#endif
initGLProgram();
return true;
}
return false;
}
void MySpriteBlur::initGLProgram() {
std::string fragSource = FileUtils::getInstance()->getStringFromFile(
FileUtils::getInstance()->fullPathForFilename("shaders/example_blur.fsh"));
auto program = GLProgram::createWithByteArrays(ccPositionTextureColor_noMVP_vert, fragSource.data());
auto glProgramState = GLProgramState::getOrCreateWithGLProgram(program);
setGLProgramState(glProgramState);
auto size = getTexture()->getContentSizeInPixels();
getGLProgramState()->setUniformVec2("resolution", size);
getGLProgramState()->setUniformFloat("blurRadius", _blurRadius);
getGLProgramState()->setUniformFloat("sampleNum", 7.0f);
}
void MySpriteBlur::setBlurRadius(float radius) {
_blurRadius = radius;
getGLProgramState()->setUniformFloat("blurRadius", _blurRadius);
}
void MySpriteBlur::setBlurSampleNum(float num) {
_blurSampleNum = num;
getGLProgramState()->setUniformFloat("sampleNum", _blurSampleNum);
}
Hope that will help!
You have three options:
1) make a blurred background in photoshop (quick and simple, but extra size),
2) use a shader (not that simple and blur is a heavy operation),
3) redraw (on the fly) your background making it a new texture.
Here's my post how to draw on texture:
http://discuss.cocos2d-x.org/t/is-it-possible-to-erase-some-pixels-from-a-sprite/34460/5?u=piotrros
Knowing this here's a function from my project, which blurs one image (a data array) to another one:
void Sample::blur(unsigned char* inputData, unsigned char* outputData, float r) {
int R2 = pow(r + 2, 2);
for(int i = 0; i < canvasHeight; i++){
for(int j = 0; j < canvasWidth; j++) {
int val1 = 0;
int val2 = 0;
int val3 = 0;
int val4 = 0;
int index2 = (j + (canvasHeight - i - 1) * canvasWidth) * 4;
for(int iy = i - r; iy < i + r + 1; iy++){
for(int ix = j - r; ix < j + r + 1; ix++) {
int x = CLAMP(ix, 0, canvasWidth - 1);
int y = CLAMP(iy, 0, canvasHeight - 1);
int index = (x + (canvasHeight - y - 1) * canvasWidth) * 4;
val1 += inputData[index];
val2 += inputData[index + 1];
val3 += inputData[index + 2];
val4 += inputData[index + 3];
}
}
outputData[index2] = val1 / R2;
outputData[index2 + 1] = val2 / R2;
outputData[index2 + 2] = val3 / R2;
outputData[index2 + 3] = val4 / R2;
}
}
}
Just remember that blur is heavy and long operation so if you have a big image it may take a while.
I have a uniform variable called control_count (count of the control points in a bezier curve). In the marked part in my code, if I replace the constant 4 with this variable, it's just stops working, if it's 4 it's working fine. The variable must have the value 4 in it, I tested it before and after the loop as well, I marked this in the code too. It should be an unrolling problem? How do I force the compiler not to do this?
#version 150
layout(lines_adjacency) in;
layout(line_strip, max_vertices = 101) out;
out vec4 gs_out_col;
uniform mat4 MVP;
uniform int control_count;
uniform int tess_count;
int degree;
int binom( int n, int k );
void main()
{
degree = control_count - 1;
vec3 b[10];
float B[10];
////////////MARK//////////////////
//control_count must be 4, other ways it'd draw less points
for(int i = 0; i < control_count; ++i){
b[i] = gl_in[i].gl_Position.xyz;
}
////////////END MARK//////////////////
for(int i = 0; i <= tess_count; ++i){
float t = i / float(tess_count);
gl_Position = vec4(0);
////////////MARK//////////////////
//here, if I write control_count instead of 4, I don't get what I expect
for(int j = 0; j < 4; ++j){
////////////END MARK//////////////////
B[j] = binom(3, j) * pow(1 - t, 3 - j) * pow(t, j);
gl_Position += vec4(b[j] * B[j], B[j]);
}
gl_Position = MVP * gl_Position;
////////////MARK//////////////////
//control_count - 4 --> I get red color,
//control_count - 3 --> I get purple,
//so the variable must have the value 4
gs_out_col = vec4(1, 0, control_count - 4, 1);//gl_Position;
////////////END MARK//////////////////
EmitVertex();
}
}
The "good" result using the constant 4:
The "wrong" result using the variable control_count:
I'm having a problem with edge detection using Sobel operator: it produces too many false edges, effect is shown on pictures below.
I'm using a 3x3 sobel operator - first extracting vertical then horizontal, final output is magnitude of each filter output.
Edges on synthetic images are extracted properly but natural images produce have too many false edges or "noise" even if image is preprocessed by applying blur or median filter.
What might be cause of this? Is it implementation problem (then: why synthetic images are fine?) or I need to do some more preprocessing?
Original:
Output:
code:
void imageOp::filter(image8* image, int maskSize, int16_t *mask)
{
if((image == NULL) || (maskSize/2 == 0) || maskSize < 1)
{
if(image == NULL)
{
printf("filter: image pointer == NULL \n");
}
else if(maskSize < 1)
{
printf("filter: maskSize must be greater than 1\n");
}
else
{
printf("filter: maskSize must be odd number\n");
}
return;
}
image8* fImage = new image8(image->getHeight(), image->getWidth());
uint16_t sum = 0;
int d = maskSize/2;
int ty, tx;
for(int x = 0; x < image->getHeight(); x++) //
{ // loop over image
for(int y = 0; y < image->getWidth(); y++) //
{
for(int xm = -d; xm <= d; xm++)
{
for(int ym = -d; ym <= d; ym++)
{
ty = y + ym;
if(ty < 0) // edge conditions
{
ty = (-1)*ym - 1;
}
else if(ty >= image->getWidth())
{
ty = image->getWidth() - ym;
}
tx = x + xm;
if(tx < 0) // edge conditions
{
tx = (-1)*xm - 1;
}
else if(tx >= image->getHeight())
{
tx = image->getHeight() - xm;
}
sum += image->img[tx][ty] * mask[((xm+d)*maskSize) + ym + d];
}
}
if(sum > 255)
{
fImage->img[x][y] = 255;
}
else if(sum < 0)
{
fImage->img[x][y] = 0;
}
else
{
fImage->img[x][y] = (uint8_t)sum;
}
sum = 0;
}
}
for(int x = 0; x < image->getHeight(); x++)
{
for(int y = 0; y < image->getWidth(); y++)
{
image->img[x][y] = fImage->img[x][y];
}
}
delete fImage;
}
This appears to be due to a math error somewhere in your code. To follow on my comment, this is what I get when I run your image through a Sobel operator here (edge strength is indicated by brightness of the output image):
I used a GLSL fragment shader to produce this:
precision mediump float;
varying vec2 textureCoordinate;
varying vec2 leftTextureCoordinate;
varying vec2 rightTextureCoordinate;
varying vec2 topTextureCoordinate;
varying vec2 topLeftTextureCoordinate;
varying vec2 topRightTextureCoordinate;
varying vec2 bottomTextureCoordinate;
varying vec2 bottomLeftTextureCoordinate;
varying vec2 bottomRightTextureCoordinate;
uniform sampler2D inputImageTexture;
void main()
{
float bottomLeftIntensity = texture2D(inputImageTexture, bottomLeftTextureCoordinate).r;
float topRightIntensity = texture2D(inputImageTexture, topRightTextureCoordinate).r;
float topLeftIntensity = texture2D(inputImageTexture, topLeftTextureCoordinate).r;
float bottomRightIntensity = texture2D(inputImageTexture, bottomRightTextureCoordinate).r;
float leftIntensity = texture2D(inputImageTexture, leftTextureCoordinate).r;
float rightIntensity = texture2D(inputImageTexture, rightTextureCoordinate).r;
float bottomIntensity = texture2D(inputImageTexture, bottomTextureCoordinate).r;
float topIntensity = texture2D(inputImageTexture, topTextureCoordinate).r;
float h = -topLeftIntensity - 2.0 * topIntensity - topRightIntensity + bottomLeftIntensity + 2.0 * bottomIntensity + bottomRightIntensity;
float v = -bottomLeftIntensity - 2.0 * leftIntensity - topLeftIntensity + bottomRightIntensity + 2.0 * rightIntensity + topRightIntensity;
float mag = length(vec2(h, v));
gl_FragColor = vec4(vec3(mag), 1.0);
You don't show your mask values, which I assume contain the Sobel kernel. In the above code, I've hardcoded the calculations performed against the red channel of each pixel in a 3x3 Sobel kernel. This is purely for performance on my platform.
One thing I don't notice in your code (again, I may be missing it like I did the sum being set back to 0) is the determination of the magnitude of the vector for the two portions of the Sobel operator. I'd expect to see a square root operation in there somewhere, if that was present.