Metal Prevented Device Address Mode - c++

I am creating a graphics application that uses Metal to render everything. When I did a frame debug under pipeline statistics for all of my draw calls there is a !! priority alert titled "Prevented Device Address Mode Load" with the details:
Indexing using unsigned int for offset prevents addressing calculation in device. To prevent this extra ALU operation use int for offset.
So for my simplest draw call that involves this here is what is going on. There is a large amount of vertex data followed by an index buffer. The index buffer is created and filled at the start and is then constant from then on. The vertex data is constantly all changing.
I have the following types:
struct Vertex {
float3 data;
};
typedef int32_t indexType;
Then the following draw call
[encoder drawIndexedPrimitives:MTLPrimitiveTypeTriangle indexCount:/*int here*/ indexType:MTLIndexTypeUInt32 indexBuffer:indexBuffer indexBufferOffset:0];
Which goes to the following vertex function
vertex VertexOutTC vertex_fun(constant Vertex * vertexBuffer [[ buffer(0) ]],
indexType vid [[ vertex_id ]],
constant matrix_float3x3* matrix [[buffer(1)]]) {
const float2 coords[] = {float2(-1, -1), float2(-1, 1), float2(1, -1), float2(1, 1)};
CircleVertex vert = vertexBuffer[vid];
VertexOutTC out;
out.position = float4((*matrix * float3(vert.data.x, vert.data.y, 1.0)).xy, ((float)((int)vid/4))/10000.0, 1.0);
out.color = HSVtoRGB(vert.data.z, 1.0, 1.0);
out.tc = coords[vid % 4];
return out;
}
I am very confused what exactly I am doing wrong here. The error would seem to suggest I shouldnt use an unsigned type for the offset which I am guessing is the index buffer.
The thing is is ultimately for the index buffer there is only MTLIndexTypeUInt32 and MTLIndexTypeUInt16 both of which are unsigned. Furthermore if I try to use a raw int as the type the shader wont compile. What is going on here?

In Table 5.1 of the Metal Shading Language Specification, they list the "Corresponding Data Type" for vertex_id as ushort or uint. (There are similar tables in that document for all the rest of the types, my examples will use thread_position_in_grid which is the same).
Meanwhile, the hardware prefers signed types for addressing. So if you do
kernel void test(uint position [[thread_position_in_grid]], device float *test) {
test[position] = position;
test[position + 1] = position;
test[position + 2] = position;
}
we are indexing test by an unsigned integer. Debugging this shader we can see that it involves 23 instructions, and has the "Prevented Device Mode Store" warning:
If we convert to int instead, this uses only 18 instructions:
kernel void test(uint position [[thread_position_in_grid]], device float *test) {
test[(int)position] = position;
test[(int)position + 1] = position;
test[(int)position + 2] = position;
}
However, not all uint can fit into int, so this optimization only works for half the range of uint. Still, that's many usecases.
What about ushort? Well,
kernel void test(ushort position [[thread_position_in_grid]], device float *test) {
test[position] = position;
test[position + 1] = position;
test[position + 2] = position;
}
This version is only 17 instructions. We are also "warned" about using unsigned indexing here, even though it is faster than the signed versions above. This suggests to me the warning is not especially well-designed and requires significant interpretation.
kernel void test(ushort position [[thread_position_in_grid]], device float *test) {
short p = position;
test[p] = position;
test[p + 1] = position;
test[p + 2] = position;
}
This is the signed version of short, and fixes the warning, but is also 17 instructions. So it makes Xcode happier, but I'm not sure it's actually better.
Finally, here's the case I was in. My position ranges above signed short, but below unsigned short. Does it make sense to promote short to int for the indexing?
kernel void test(ushort position [[thread_position_in_grid]], device float *test) {
int p = position;
test[p] = position;
test[p + 1] = position;
test[p + 2] = position;
}
This is also 17 instructions, and generates the device store warning. I believe the compiler proves ushort fits into int, and ignores the conversion. This "unsigned" arithmetic then produces a warning telling me to use int, even though that's exactly what I did.
In summary, these warnings are a bit naive, and should really be confirmed or refuted through on-device testing.

Related

GLSL uint_fast64_t type

how can i get an input to the vertex shader of type uint_fast64_t?
there is not such type available in the language how can i pass it differently?
my code is this:
#version 330 core
#define CHUNK_SIZE 16
#define BLOCK_SIZE_X 0.1
#define BLOCK_SIZE_Y 0.1
#define BLOCK_SIZE_Z 0.1
// input vertex and UV coordinates, different for all executions of this shader
layout(location = 0) in uint_fast64_t vertexPosition_modelspace;
layout(location = 1) in vec2 vertexUV;
// Output data ; will be interpolated for each fragment.
out vec2 UV;
// model view projection matrix
uniform mat4 MVP;
int getAxis(uint_fast64_t p, int choice) { // axis: 0=x 1=y 2=z 3=index_x 4=index_z
switch (choice) {
case 0:
return (int)((p>>59 ) & 0xF); //extract the x axis int but i only want 4bits
case 1:
return (int)((p>>23 ) & 0xFF);//extract the y axis int but i only want 8bits
case 2:
return (int)((p>>55 ) & 0xF);//extract the z axis int but i only want 4bits
case 3:
return (int)(p & 0x807FFFFF);//extract the index_x 24bits
case 4:
return (int)((p>>32) & 0x807FFFFF);//extract the index_z 24bits
}
}
void main()
{
// assign vertex position
float x = (getAxis(vertexPosition_modelspace,0) + getAxis(vertexPosition_modelspace,3)*CHUNK_SIZE)*BLOCK_SIZE_X;
float y = getAxis(vertexPosition_modelspace,1)*BLOCK_SIZE_Y;
float z = (getAxis(vertexPosition_modelspace,2) + getAxis(vertexPosition_modelspace,3)*CHUNK_SIZE)*BLOCK_SIZE_Z;
gl_Position = MVP * vec4(x,y,z, 1.0);
// UV of the vertex. No special space for this one.
UV = vertexUV;
}
the error message i am takeing is :
i tried to put uint64_t but the same problem
Unextended GLSL for OpenGL does not have the ability to directly use 64-bit integer values. And even the fairly widely supported ARB extension that allows for the use of 64-bit integers within shaders doesn't actually allow you to use them as vertex shader attributes. That requires an NVIDIA extension supported only by... NVIDIA.
However, you can send 32-bit integers, and a 64-bit integer is just two 32-bit integers. You can put 64-bit integers into the buffer and pass them as 2 32-bit unsigned integers in your vertex attribute format:
glVertexAttribIFormat(0, 2, GL_UNSIGNED_INT, <byte_offset>);
Your shader will retrieve them as a uvec2 input:
layout(location = 0) in uvec2 vertexPosition_modelspace;
The x component of the vector will have the first 4 bytes and the y component will store the second 4 bytes. But since "first" and "second" are determined by your CPU's endian, you'll need to know whether your CPU is little endian or big endian to be able to use them. Since most desktop GL implementations are paired with little endian CPUs, we'll assume that is the case.
In this case, vertexPosition_modelspace.x contains the low 4 bytes of the 64-bit integer, and vertexPosition_modelspace.y contains the high 4 bytes.
So your code could be adjusted as follows (with some cleanup):
const vec3 BLOCK_SIZE(0.1, 0.1, 0.1);
//Get the three axes all at once.
uvec3 getAxes(in uvec2 p)
{
return uvec3(
(p.y >> 27) & 0xF),
(p.x >> 23) & 0xFF),
(p.y >> 23) & 0xF)
);
}
//Get the indices
uvec2 getIndices(in uvec2 p)
{
return p & 0x807FFFFF; //Performs component-wise bitwise &
}
void main()
{
uvec3 iPos = getAxes(vertexPosition_modelspace);
uvec2 indices = getIndices(vertexPosition_modelspace);
vec3 pos = vec3(
iPos.x + (indices.x * CHUNK_SIZE),
iPos.y,
iPos.z + (indices.x * CHUNK_SIZE) //You used index 3 in your code, so I used .x here, but I think you meant index 4.
);
pos *= BLOCK_SIZE;
...
}

Why are my Uniforms and Storage Buffers showing the wrong data in Vulkan using GLSL?

EDIT2: I found the error, the code that creates the buffer was overwriting one of the storage buffers with one of the uniform buffers that I create afterwards because of a copy paste error.
So I'm currently trying to adapt the Ray Tracing Weekend project (https://raytracing.github.io/) from a CPU program into a compute shader using Vulkan. I'm writing the compute shader using GLSL which is compiled to SPIRV.
I send the scene in the form of a struct containing arrays of structs to the GPU as a storage buffer which looks like this on the CPU (world_gpu being the storage buffer):
struct sphere_gpu
{
point3 centre;
float radius;
};
struct material_gpu
{
vec3 albedo;
float refraction_index;
float fuzz;
uint32_t material_type;
};
struct world_gpu
{
sphere_gpu spheres[484];
material_gpu materials[484];
uint32_t size;
};
and this on the GPU:
// Struct definitions to mirror the CPU representation
struct sphere{
vec4 centre;
float radius;
};
struct material{
vec4 albedo;
float refraction_index;
float fuzz;
uint material_type;
};
// Input scene
layout(std430, binding = 0) buffer world{
sphere[MAX_SPHERES] spheres;
material[MAX_SPHERES] materials;
uint size;
} wrld;
I've already fixed the alignment problem for vec3 on the CPU side by using alignas(16) for my vec3 type: class alignas (16) vec3, and changing the type on the GPU representation to be vec4s as shown above to match the alignment of the data I'm sending over
However, whilst testing this I only seem to be able to read 0s for the spheres when I inspect the data after the compute shader has finished running (I've hijacked my output pixel array in the shader which I write debug data to so that I can read it and debug certain things).
Is there anything obviously stupid that I'm doing here, aside from being a Vulkan noob in general?
EDIT:
Here's my buffer uploading code. set_manual_buffer_data is where the data is actually copied to the buffer, create_manual_buffer is where the buffer and memory itself are created.
template <typename T>
void set_manual_buffer_data(vk::Device device, vk::Buffer& buffer, vk::DeviceMemory& buffer_memory, T* elements, uint32_t num_elements,uint32_t element_size)
{
uint32_t size = element_size * num_elements;
// Get a pointer to the device memory
void* buffer_ptr = device.mapMemory(buffer_memory, 0, element_size * num_elements);
// Copy data to buffer
memcpy(buffer_ptr, elements, element_size * num_elements);
device.unmapMemory(buffer_memory);
}
// call with physical_device.getMemoryProperties() for second argument
void create_manual_buffer(vk::Device device, vk::PhysicalDeviceMemoryProperties memory_properties, uint32_t queue_family_index, const uint32_t buffer_size, vk::BufferUsageFlagBits buffer_usage, vk::Buffer& buffer, vk::DeviceMemory& buffer_memory)
{
vk::BufferCreateInfo buffer_create_info{};
buffer_create_info.flags = vk::BufferCreateFlags();
buffer_create_info.size = buffer_size;
buffer_create_info.usage = buffer_usage; // Play with this
buffer_create_info.sharingMode = vk::SharingMode::eExclusive; //concurrent or exclusive
buffer_create_info.pQueueFamilyIndices = &queue_family_index;
buffer_create_info.queueFamilyIndexCount = 1;
buffer = device.createBuffer(buffer_create_info);
vk::MemoryRequirements memory_requirements = device.getBufferMemoryRequirements(buffer);
uint32_t memory_type_index = static_cast<uint32_t>(~0);
vk::DeviceSize memory_heap_size = static_cast<uint32_t>(~0);
for (uint32_t current_memory_type_index = 0; current_memory_type_index < memory_properties.memoryTypeCount; ++current_memory_type_index)
{
// search for desired memory type from the device memory
vk::MemoryType MemoryType = memory_properties.memoryTypes[current_memory_type_index];
if ((vk::MemoryPropertyFlagBits::eHostVisible & MemoryType.propertyFlags) &&
(vk::MemoryPropertyFlagBits::eHostCoherent & MemoryType.propertyFlags))
{
memory_heap_size = memory_properties.memoryHeaps[MemoryType.heapIndex].size;
memory_type_index = current_memory_type_index;
break;
}
}
// Create device memory
vk::MemoryAllocateInfo buffer_allocate_info(memory_requirements.size, memory_type_index);
buffer_memory = device.allocateMemory(buffer_allocate_info);
device.bindBufferMemory(buffer, buffer_memory, 0);
}
This code is then called here (I haven't got to the refactoring stage yet so please forgive the spaghetti):
std::vector<vk::Buffer> uniform_buffers;
std::vector<vk::DeviceMemory> uniform_buffers_memory;
std::vector<vk::Buffer> storage_buffers;
std::vector<vk::DeviceMemory> storage_buffers_memory;
void run_compute(Vulkan_Wrapper &vulkan, Vulkan_Compute &compute, world_gpu *world, color* image, uint32_t image_size, image_info img_info, camera_gpu camera_gpu)
{
vulkan.init();
uniform_buffers.resize(2);
uniform_buffers_memory.resize(2);
storage_buffers.resize(2);
storage_buffers_memory.resize(2);
vulkan.create_manual_buffer(vulkan.m_device, vulkan.m_physical_device.getMemoryProperties(),
vulkan.m_queue_family_index, sizeof(world_gpu),
vk::BufferUsageFlagBits::eStorageBuffer, storage_buffers[0],
storage_buffers_memory[0]);
vulkan.create_manual_buffer(vulkan.m_device, vulkan.m_physical_device.getMemoryProperties(),
vulkan.m_queue_family_index, image_size * sizeof(color),
vk::BufferUsageFlagBits::eStorageBuffer, storage_buffers[1],
storage_buffers_memory[1]);
vulkan.set_manual_buffer_data(vulkan.m_device, storage_buffers[0], storage_buffers_memory[0], world, 1, sizeof(world_gpu));
vulkan.set_manual_buffer_data(vulkan.m_device, storage_buffers[1], storage_buffers_memory[1], image, image_size, sizeof(color));
vulkan.create_manual_buffer(vulkan.m_device, vulkan.m_physical_device.getMemoryProperties(),
vulkan.m_queue_family_index, sizeof(image_info),
vk::BufferUsageFlagBits::eUniformBuffer, storage_buffers[0],
uniform_buffers_memory[0]);
vulkan.create_manual_buffer(vulkan.m_device, vulkan.m_physical_device.getMemoryProperties(),
vulkan.m_queue_family_index, sizeof(camera_gpu),
vk::BufferUsageFlagBits::eUniformBuffer, uniform_buffers[1],
uniform_buffers_memory[1]);
vulkan.set_manual_buffer_data(vulkan.m_device, uniform_buffers[0], uniform_buffers_memory[0], &img_info, 1, sizeof(img_info));
vulkan.set_manual_buffer_data(vulkan.m_device, uniform_buffers[1], uniform_buffers_memory[1], &camera_gpu, 1, sizeof(camera_gpu));
// Run pipeline etc
I should note that it works perfectly fine when I check the values stored in the image storage buffer (storage_buffers_memory[1]), it's the other 3 that is giving me difficulties

vkCreateComputePipelines takes too long

I encountered a strange problem with compiling Vulkan compute shader.
I have this shader (which is not even all that complex)
#version 450
#extension GL_GOOGLE_include_directive : enable
//#extension GL_EXT_debug_printf : enable
#extension GL_KHR_shader_subgroup_basic : enable
#extension GL_KHR_shader_subgroup_arithmetic : enable
#define IS_AVAILABLE_BUFFER_ANN_ENTITIES
#define IS_AVAILABLE_BUFFER_GLOBAL_MUTABLES
#define IS_AVAILABLE_BUFFER_BONES
#define IS_AVAILABLE_BUFFER_WORLD
//#define IS_AVAILABLE_BUFFER_COLLISION_GRID
#include "descriptors_compute.comp"
layout (local_size_x_id = GROUP_SIZE_CONST_ID) in;
#include "utils.comp"
shared float[ANN_MAX_SIZE] tmp1;
shared float[ANN_MAX_SIZE] tmp2;
shared uint[ANN_TOUCHED_BLOCK_COUNT] touched_block_ids;
mat3 rotation_mat_from_yaw_and_pitch(vec2 yaw_and_pitch){
const vec2 Ss = sin(yaw_and_pitch); // let S denote sin(yaw) and s denote sin(pitch)
const vec2 Cc = cos(yaw_and_pitch); // let C denote cos(yaw) and c denote cos(pitch)
const vec4 Cs_cC_Sc_sS = vec4(Cc,Ss) * vec4(Ss.y,Cc,Ss.x);
return mat3(Cs_cC_Sc_sS.y,-Ss.y,-Cs_cC_Sc_sS.z,Cs_cC_Sc_sS.x,Cc.y,-Cs_cC_Sc_sS.w,Ss.x,0,Cc.x);
}
void main() {
const uint entity_id = gl_WorkGroupID.x;
const uint lID = gl_LocalInvocationID.x;
const uint entities_count = global_mutables.ann_entities;
if (entity_id < entities_count){
const AnnEntity entity = ann_entities[entity_id];
const Bone bone = bones[entity.bone_idx];
const mat3 rotation = rotation_mat_from_yaw_and_pitch(bone.yaw_and_pitch);
const uint BLOCK_TOUCH_SENSE_OFFSET = 0;
const uint LIDAR_LENGTH_SENSE_OFFSET = BLOCK_EXTENDED_SENSORY_FEATURES_LEN*ANN_TOUCHED_BLOCK_COUNT;
for(uint i=lID;i<ANN_LIDAR_COUNT;i+=GROUP_SIZE){
const vec3 rotated_lidar_direction = rotation * entity.lidars[i].direction;
const RayCastResult ray = ray_cast(bone.new_center, rotated_lidar_direction);
tmp1[LIDAR_LENGTH_SENSE_OFFSET+i] = ray.ratio_of_traversed_length;
}
for(uint i = lID;i<ANN_OUTPUT_SIZE;i+=GROUP_SIZE){
const AnnSparseOutputNeuron neuron = entity.ann_output[i];
float sum = neuron.bias;
for(uint j=0;j<neuron.incoming.length();j++){
sum += tmp1[neuron.incoming[j].src_neuron] * neuron.incoming[j].weight;
}
tmp2[i] = max(0,sum);//ReLU activation
}
vec2 rotation_change = vec2(0,0);
for(uint i = lID;i<ANN_OUTPUT_ROTATION_MUSCLES_SIZE;i+=GROUP_SIZE){
rotation_change += tmp2[ANN_OUTPUT_ROTATION_MUSCLES_OFFSET+i] * ANN_IMPULSES_OF_ROTATION_MUSCLES[i];
}
rotation_change = subgroupAdd(rotation_change);
if(lID==0){
bones[entity.bone_idx].yaw_and_pitch += rotation_change;
}
}
}
The function ray_cast is probably the most complex part of this shader, but I also reuse this exact same function in many other shaders that compile instantly. I was wondering whether GL_KHR_shader_subgroup_arithmetic might be slowing down vkCreateComputePipelines, but if removing it makes no difference. It takes Vulkan over a minute to finish vkCreateComputePipelines. I also have a bunch of utility functions included but I only use a few constants from there and ray_cast, so 90% of that code is unused and should be removed by glslc. Could it be that Vulkan is quietly trying to perform any other kind of optimisation and it's causing the delay? I thought that all optimisations are done by glslc and there is not much postprocessing done on SPIR-V. I use
Nvidia with their proprietary drivers by the way.
It really puzzles me why this shader is so slow to create, even though I have other shaders that are ten times longer and more complex and yet they load instantly.
Is there any way to profile this?
Upon closer inspection I noticed that normally all the generated SPIR-V files for my shaders take about 10-30KB. However, this one shader takes 178KB.
With help of spirv-dis I looked inside the generated assembly and noticed that vast majority of the op-codes was OpConstant. It was because I had structs that looked like
struct AnnSparseOutputNeuron{
AnnSparseConnection[ANN_LATENT_CONNECTIONS_PER_OUTPUT_NEURON] incoming;
float bias;
};
They contain large arrays. As a result both
const AnnEntity entity = ann_entities[entity_id];
and
const AnnSparseOutputNeuron neuron = entity.ann_output[i];
would be compiled to lots of op-codes that write those constant values for every single element of the array. So instead of writing code of the form
const A a = buffer_of_As[i];
f(a.some_filed)
it's better to use
f(buffer_of_As[i].some_filed)
This seems to have solved the problem. I thought that glslc would be smart enough to figure out such optimizations but apparently it's not.

How to pack normals into GL_INT_2_10_10_10_REV

In my pet project video memory started to become an issue, therefore I had a look at various techniques to minimize the memory footprint. I tried using GL_INT_2_10_10_10_REV, but I get lighting artifacts using my packing method. These artifacts seem to be not a result of inaccuracies, because using a normalized char[3] or short[3] works flawlessly. Because of otherwise useless padding I would prefer to use the more space efficient GL_INT_2_10_10_10_REV.
This is the packing code:
union Vec3IntPacked {
int i32;
struct {
int a:2;
int z:10;
int y:10;
int x:10;
} i32f3;
};
int vec3_to_i32f3(const Vec3* v) {
Vec3IntPacked packed;
packed.i32f3.x = to_int(clamp(v->x, -1.0f, 1.0f) * 511);
packed.i32f3.y = to_int(clamp(v->y, -1.0f, 1.0f) * 511);
packed.i32f3.z = to_int(clamp(v->z, -1.0f, 1.0f) * 511);
return packed.i32;
} // NOTE: to_int is a static_cast
If I am reading the spec correctly (section 10.3.8, "Packed Vertex Data Formats" and conversion rules in 2.1 and 2.2), this should work, but well it doesn't.
I should also note, that the above code was tested on multiple OSs (all 64bit though, but int should still be 32 bit nethertheless) and graphics card vendors to check whether it was a driver related issue.
Furthermore the OpenGL 3.3 core profile is used.
The vertex structure is composed as following:
struct BasicVertex {
float position[3];
unsigned short uv[2];
int normal;
int tangent;
int bitangent;
} // resulting in a 4-byte aligned 28 byte structure
Hopefully I provided sufficient information and someone can shed some light on how to properly pack normals into GL_INT_2_10_10_10_REV.
The order in your bitfield declaration looks incorrect. Based on the spec document (section "2.8.2 Packed Vertex Data Formats" on page 32 of the 3.3 spec), the bit range for each component is:
x: bits 0-9
y: bits 10-19
z: bits 20-29
w: bits 30-31
After some searching, it looks like the order of bits in a bitfield is not defined by the C standard. See e.g. Which end of a bit field is the most significant bit?
The compilers I have seen typically use a lowest to highest bit order. For example, Microsoft defines this for their compiler:
Bit fields are allocated within an integer from least-significant to most-significant bit.
If you rely on using a compiler with this order, your declaration should look like this:
union Vec3IntPacked {
int i32;
struct {
int x:10;
int y:10;
int z:10;
int w:2;
} i32f3;
};
For guaranteed full portability, you would use shift operators to build the values, and not use a bitfield at all.
Depending on how you declare and use the attribute in your vertex shader, you may also want to make sure that you set the w component to 1. Of course if you don't use the w component in the vertex shader, that will not be necessary.
I'm just leaving this here, because I had a hard time getting this to work and there is no full-scale answer on StackOverflow. Reto Koradi is correct about the byte/bit ordering (the OpenGL wiki also shows the layout) and using shifts, but you still need to get there correctly...
The example code (and other questions here on StackOverflow) seem to rely on undefined behaviour and it didn't work for me. What is working for me (for OpenGL <= 4.1) is
inline uint32_t Pack_INT_2_10_10_10_REV(float x, float y, float z, float w)
{
const uint32_t xs = x < 0;
const uint32_t ys = y < 0;
const uint32_t zs = z < 0;
const uint32_t ws = w < 0;
uint32_t vi =
ws << 31 | ((uint32_t)(w + (ws << 1)) & 1) << 30 |
zs << 29 | ((uint32_t)(z * 511 + (zs << 9)) & 511) << 20 |
ys << 19 | ((uint32_t)(y * 511 + (ys << 9)) & 511) << 10 |
xs << 9 | ((uint32_t)(x * 511 + (xs << 9)) & 511);
return vi;
}
which I found here. For normals just leave out the "w" part. If you have a faster/easier method, I'd love to know it. For setting up the attribute pointer, make sure you use
glVertexAttribPointer(1, 4, GL_INT_2_10_10_10_REV, GL_TRUE, stride, dataPointer);
and your normal data arrives in your shader as a vec4, mapped to [-1,1]. You can also conveniently use a vec3 if you don't need the w-component and OpenGL will just give you xyz, so probably you don't need to change any shader code at all. Some answers state you must use "glVertexAttribIPointer", but this wrong.
Note that as of OpenGL 4.2 the mapping from float to the packed format was changed, so the conversion is different, but easier. This is vertex format also supported in OpenGL ES 3.0 and above.

Gaussian-distributed pseudo-random number generator in GLSL [duplicate]

As the GPU driver vendors don't usually bother to implement noiseX in GLSL, I'm looking for a "graphics randomization swiss army knife" utility function set, preferably optimised to use within GPU shaders. I prefer GLSL, but code any language will do for me, I'm ok with translating it on my own to GLSL.
Specifically, I'd expect:
a) Pseudo-random functions - N-dimensional, uniform distribution over [-1,1] or over [0,1], calculated from M-dimensional seed (ideally being any value, but I'm OK with having the seed restrained to, say, 0..1 for uniform result distribution). Something like:
float random (T seed);
vec2 random2 (T seed);
vec3 random3 (T seed);
vec4 random4 (T seed);
// T being either float, vec2, vec3, vec4 - ideally.
b) Continous noise like Perlin Noise - again, N-dimensional, +- uniform distribution, with constrained set of values and, well, looking good (some options to configure the appearance like Perlin levels could be useful too). I'd expect signatures like:
float noise (T coord, TT seed);
vec2 noise2 (T coord, TT seed);
// ...
I'm not very much into random number generation theory, so I'd most eagerly go for a pre-made solution, but I'd also appreciate answers like "here's a very good, efficient 1D rand(), and let me explain you how to make a good N-dimensional rand() on top of it..." .
For very simple pseudorandom-looking stuff, I use this oneliner that I found on the internet somewhere:
float rand(vec2 co){
return fract(sin(dot(co, vec2(12.9898, 78.233))) * 43758.5453);
}
You can also generate a noise texture using whatever PRNG you like, then upload this in the normal fashion and sample the values in your shader; I can dig up a code sample later if you'd like.
Also, check out this file for GLSL implementations of Perlin and Simplex noise, by Stefan Gustavson.
It occurs to me that you could use a simple integer hash function and insert the result into a float's mantissa. IIRC the GLSL spec guarantees 32-bit unsigned integers and IEEE binary32 float representation so it should be perfectly portable.
I gave this a try just now. The results are very good: it looks exactly like static with every input I tried, no visible patterns at all. In contrast the popular sin/fract snippet has fairly pronounced diagonal lines on my GPU given the same inputs.
One disadvantage is that it requires GLSL v3.30. And although it seems fast enough, I haven't empirically quantified its performance. AMD's Shader Analyzer claims 13.33 pixels per clock for the vec2 version on a HD5870. Contrast with 16 pixels per clock for the sin/fract snippet. So it is certainly a little slower.
Here's my implementation. I left it in various permutations of the idea to make it easier to derive your own functions from.
/*
static.frag
by Spatial
05 July 2013
*/
#version 330 core
uniform float time;
out vec4 fragment;
// A single iteration of Bob Jenkins' One-At-A-Time hashing algorithm.
uint hash( uint x ) {
x += ( x << 10u );
x ^= ( x >> 6u );
x += ( x << 3u );
x ^= ( x >> 11u );
x += ( x << 15u );
return x;
}
// Compound versions of the hashing algorithm I whipped together.
uint hash( uvec2 v ) { return hash( v.x ^ hash(v.y) ); }
uint hash( uvec3 v ) { return hash( v.x ^ hash(v.y) ^ hash(v.z) ); }
uint hash( uvec4 v ) { return hash( v.x ^ hash(v.y) ^ hash(v.z) ^ hash(v.w) ); }
// Construct a float with half-open range [0:1] using low 23 bits.
// All zeroes yields 0.0, all ones yields the next smallest representable value below 1.0.
float floatConstruct( uint m ) {
const uint ieeeMantissa = 0x007FFFFFu; // binary32 mantissa bitmask
const uint ieeeOne = 0x3F800000u; // 1.0 in IEEE binary32
m &= ieeeMantissa; // Keep only mantissa bits (fractional part)
m |= ieeeOne; // Add fractional part to 1.0
float f = uintBitsToFloat( m ); // Range [1:2]
return f - 1.0; // Range [0:1]
}
// Pseudo-random value in half-open range [0:1].
float random( float x ) { return floatConstruct(hash(floatBitsToUint(x))); }
float random( vec2 v ) { return floatConstruct(hash(floatBitsToUint(v))); }
float random( vec3 v ) { return floatConstruct(hash(floatBitsToUint(v))); }
float random( vec4 v ) { return floatConstruct(hash(floatBitsToUint(v))); }
void main()
{
vec3 inputs = vec3( gl_FragCoord.xy, time ); // Spatial and temporal inputs
float rand = random( inputs ); // Random per-pixel value
vec3 luma = vec3( rand ); // Expand to RGB
fragment = vec4( luma, 1.0 );
}
Screenshot:
I inspected the screenshot in an image editing program. There are 256 colours and the average value is 127, meaning the distribution is uniform and covers the expected range.
Gustavson's implementation uses a 1D texture
No it doesn't, not since 2005. It's just that people insist on downloading the old version. The version that is on the link you supplied uses only 8-bit 2D textures.
The new version by Ian McEwan of Ashima and myself does not use a texture, but runs at around half the speed on typical desktop platforms with lots of texture bandwidth. On mobile platforms, the textureless version might be faster because texturing is often a significant bottleneck.
Our actively maintained source repository is:
https://github.com/ashima/webgl-noise
A collection of both the textureless and texture-using versions of noise is here (using only 2D textures):
http://www.itn.liu.se/~stegu/simplexnoise/GLSL-noise-vs-noise.zip
If you have any specific questions, feel free to e-mail me directly (my email address can be found in the classicnoise*.glsl sources.)
Gold Noise
// Gold Noise ©2015 dcerisano#standard3d.com
// - based on the Golden Ratio
// - uniform normalized distribution
// - fastest static noise generator function (also runs at low precision)
// - use with indicated fractional seeding method.
float PHI = 1.61803398874989484820459; // Φ = Golden Ratio
float gold_noise(in vec2 xy, in float seed){
return fract(tan(distance(xy*PHI, xy)*seed)*xy.x);
}
See Gold Noise in your browser right now!
This function has improved random distribution over the current function in #appas' answer as of Sept 9, 2017:
The #appas function is also incomplete, given there is no seed supplied (uv is not a seed - same for every frame), and does not work with low precision chipsets. Gold Noise runs at low precision by default (much faster).
There is also a nice implementation described here by McEwan and #StefanGustavson that looks like Perlin noise, but "does not require any setup, i.e. not textures nor uniform arrays. Just add it to your shader source code and call it wherever you want".
That's very handy, especially given that Gustavson's earlier implementation, which #dep linked to, uses a 1D texture, which is not supported in GLSL ES (the shader language of WebGL).
After the initial posting of this question in 2010, a lot has changed in the realm of good random functions and hardware support for them.
Looking at the accepted answer from today's perspective, this algorithm is very bad in uniformity of the random numbers drawn from it. And the uniformity suffers a lot depending on the magnitude of the input values and visible artifacts/patterns will become apparent when sampling from it for e.g. ray/path tracing applications.
There have been many different functions (most of them integer hashing) being devised for this task, for different input and output dimensionality, most of which are being evaluated in the 2020 JCGT paper Hash Functions for GPU Rendering. Depending on your needs you could select a function from the list of proposed functions in that paper and simply from the accompanying Shadertoy.
One that isn't covered in this paper but that has served me very well without any noticeably patterns on any input magnitude values is also one that I want to highlight.
Other classes of algorithms use low-discrepancy sequences to draw pseudo-random numbers from, such as the Sobol squence with Owen-Nayar scrambling. Eric Heitz has done some amazing research in this area, as well with his A Low-Discrepancy Sampler that Distributes Monte Carlo Errors as a Blue Noise in Screen Space paper.
Another example of this is the (so far latest) JCGT paper Practical Hash-based Owen Scrambling, which applies Owen scrambling to a different hash function (namely Laine-Karras).
Yet other classes use algorithms that produce noise patterns with desirable frequency spectrums, such as blue noise, that is particularly "pleasing" to the eyes.
(I realize that good StackOverflow answers should provide the algorithms as source code and not as links because those can break, but there are way too many different algorithms nowadays and I intend for this answer to be a summary of known-good algorithms today)
Do use this:
highp float rand(vec2 co)
{
highp float a = 12.9898;
highp float b = 78.233;
highp float c = 43758.5453;
highp float dt= dot(co.xy ,vec2(a,b));
highp float sn= mod(dt,3.14);
return fract(sin(sn) * c);
}
Don't use this:
float rand(vec2 co){
return fract(sin(dot(co.xy ,vec2(12.9898,78.233))) * 43758.5453);
}
You can find the explanation in Improvements to the canonical one-liner GLSL rand() for OpenGL ES 2.0
hash:
Nowadays webGL2.0 is there so integers are available in (w)GLSL.
-> for quality portable hash (at similar cost than ugly float hashes) we can now use "serious" hashing techniques.
IQ implemented some in https://www.shadertoy.com/view/XlXcW4 (and more)
E.g.:
const uint k = 1103515245U; // GLIB C
//const uint k = 134775813U; // Delphi and Turbo Pascal
//const uint k = 20170906U; // Today's date (use three days ago's dateif you want a prime)
//const uint k = 1664525U; // Numerical Recipes
vec3 hash( uvec3 x )
{
x = ((x>>8U)^x.yzx)*k;
x = ((x>>8U)^x.yzx)*k;
x = ((x>>8U)^x.yzx)*k;
return vec3(x)*(1.0/float(0xffffffffU));
}
Just found this version of 3d noise for GPU, alledgedly it is the fastest one available:
#ifndef __noise_hlsl_
#define __noise_hlsl_
// hash based 3d value noise
// function taken from https://www.shadertoy.com/view/XslGRr
// Created by inigo quilez - iq/2013
// License Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
// ported from GLSL to HLSL
float hash( float n )
{
return frac(sin(n)*43758.5453);
}
float noise( float3 x )
{
// The noise function returns a value in the range -1.0f -> 1.0f
float3 p = floor(x);
float3 f = frac(x);
f = f*f*(3.0-2.0*f);
float n = p.x + p.y*57.0 + 113.0*p.z;
return lerp(lerp(lerp( hash(n+0.0), hash(n+1.0),f.x),
lerp( hash(n+57.0), hash(n+58.0),f.x),f.y),
lerp(lerp( hash(n+113.0), hash(n+114.0),f.x),
lerp( hash(n+170.0), hash(n+171.0),f.x),f.y),f.z);
}
#endif
A straight, jagged version of 1d Perlin, essentially a random lfo zigzag.
half rn(float xx){
half x0=floor(xx);
half x1=x0+1;
half v0 = frac(sin (x0*.014686)*31718.927+x0);
half v1 = frac(sin (x1*.014686)*31718.927+x1);
return (v0*(1-frac(xx))+v1*(frac(xx)))*2-1*sin(xx);
}
I also have found 1-2-3-4d perlin noise on shadertoy owner inigo quilez perlin tutorial website, and voronoi and so forth, he has full fast implementations and codes for them.
I have translated one of Ken Perlin's Java implementations into GLSL and used it in a couple projects on ShaderToy.
Below is the GLSL interpretation I did:
int b(int N, int B) { return N>>B & 1; }
int T[] = int[](0x15,0x38,0x32,0x2c,0x0d,0x13,0x07,0x2a);
int A[] = int[](0,0,0);
int b(int i, int j, int k, int B) { return T[b(i,B)<<2 | b(j,B)<<1 | b(k,B)]; }
int shuffle(int i, int j, int k) {
return b(i,j,k,0) + b(j,k,i,1) + b(k,i,j,2) + b(i,j,k,3) +
b(j,k,i,4) + b(k,i,j,5) + b(i,j,k,6) + b(j,k,i,7) ;
}
float K(int a, vec3 uvw, vec3 ijk)
{
float s = float(A[0]+A[1]+A[2])/6.0;
float x = uvw.x - float(A[0]) + s,
y = uvw.y - float(A[1]) + s,
z = uvw.z - float(A[2]) + s,
t = 0.6 - x * x - y * y - z * z;
int h = shuffle(int(ijk.x) + A[0], int(ijk.y) + A[1], int(ijk.z) + A[2]);
A[a]++;
if (t < 0.0)
return 0.0;
int b5 = h>>5 & 1, b4 = h>>4 & 1, b3 = h>>3 & 1, b2= h>>2 & 1, b = h & 3;
float p = b==1?x:b==2?y:z, q = b==1?y:b==2?z:x, r = b==1?z:b==2?x:y;
p = (b5==b3 ? -p : p); q = (b5==b4 ? -q : q); r = (b5!=(b4^b3) ? -r : r);
t *= t;
return 8.0 * t * t * (p + (b==0 ? q+r : b2==0 ? q : r));
}
float noise(float x, float y, float z)
{
float s = (x + y + z) / 3.0;
vec3 ijk = vec3(int(floor(x+s)), int(floor(y+s)), int(floor(z+s)));
s = float(ijk.x + ijk.y + ijk.z) / 6.0;
vec3 uvw = vec3(x - float(ijk.x) + s, y - float(ijk.y) + s, z - float(ijk.z) + s);
A[0] = A[1] = A[2] = 0;
int hi = uvw.x >= uvw.z ? uvw.x >= uvw.y ? 0 : 1 : uvw.y >= uvw.z ? 1 : 2;
int lo = uvw.x < uvw.z ? uvw.x < uvw.y ? 0 : 1 : uvw.y < uvw.z ? 1 : 2;
return K(hi, uvw, ijk) + K(3 - hi - lo, uvw, ijk) + K(lo, uvw, ijk) + K(0, uvw, ijk);
}
I translated it from Appendix B from Chapter 2 of Ken Perlin's Noise Hardware at this source:
https://www.csee.umbc.edu/~olano/s2002c36/ch02.pdf
Here is a public shade I did on Shader Toy that uses the posted noise function:
https://www.shadertoy.com/view/3slXzM
Some other good sources I found on the subject of noise during my research include:
https://thebookofshaders.com/11/
https://mzucker.github.io/html/perlin-noise-math-faq.html
https://rmarcus.info/blog/2018/03/04/perlin-noise.html
http://flafla2.github.io/2014/08/09/perlinnoise.html
https://mrl.nyu.edu/~perlin/noise/
https://rmarcus.info/blog/assets/perlin/perlin_paper.pdf
https://developer.nvidia.com/gpugems/GPUGems/gpugems_ch05.html
I highly recommend the book of shaders as it not only provides a great interactive explanation of noise, but other shader concepts as well.
EDIT:
Might be able to optimize the translated code by using some of the hardware-accelerated functions available in GLSL. Will update this post if I end up doing this.
lygia, a multi-language shader library
If you don't want to copy / paste the functions into your shader, you can also use lygia, a multi-language shader library. It contains a few generative functions like cnoise, fbm, noised, pnoise, random, snoise in both GLSL and HLSL. And many other awesome functions as well. For this to work it:
Relays on #include "file" which is defined by Khronos GLSL standard and suported by most engines and enviroments (like glslViewer, glsl-canvas VS Code pluging, Unity, etc. ).
Example: cnoise
Using cnoise.glsl with #include:
#ifdef GL_ES
precision mediump float;
#endif
uniform vec2 u_resolution;
uniform float u_time;
#include "lygia/generative/cnoise.glsl"
void main (void) {
vec2 st = gl_FragCoord.xy / u_resolution.xy;
vec3 color = vec3(cnoise(vec3(st * 5.0, u_time)));
gl_FragColor = vec4(color, 1.0);
}
To run this example I used glslViewer.
Please see below an example how to add white noise to the rendered texture.
The solution is to use two textures: original and pure white noise, like this one: wiki white noise
private static final String VERTEX_SHADER =
"uniform mat4 uMVPMatrix;\n" +
"uniform mat4 uMVMatrix;\n" +
"uniform mat4 uSTMatrix;\n" +
"attribute vec4 aPosition;\n" +
"attribute vec4 aTextureCoord;\n" +
"varying vec2 vTextureCoord;\n" +
"varying vec4 vInCamPosition;\n" +
"void main() {\n" +
" vTextureCoord = (uSTMatrix * aTextureCoord).xy;\n" +
" gl_Position = uMVPMatrix * aPosition;\n" +
"}\n";
private static final String FRAGMENT_SHADER =
"precision mediump float;\n" +
"uniform sampler2D sTextureUnit;\n" +
"uniform sampler2D sNoiseTextureUnit;\n" +
"uniform float uNoseFactor;\n" +
"varying vec2 vTextureCoord;\n" +
"varying vec4 vInCamPosition;\n" +
"void main() {\n" +
" gl_FragColor = texture2D(sTextureUnit, vTextureCoord);\n" +
" vec4 vRandChosenColor = texture2D(sNoiseTextureUnit, fract(vTextureCoord + uNoseFactor));\n" +
" gl_FragColor.r += (0.05 * vRandChosenColor.r);\n" +
" gl_FragColor.g += (0.05 * vRandChosenColor.g);\n" +
" gl_FragColor.b += (0.05 * vRandChosenColor.b);\n" +
"}\n";
The fragment shared contains parameter uNoiseFactor which is updated on every rendering by main application:
float noiseValue = (float)(mRand.nextInt() % 1000)/1000;
int noiseFactorUniformHandle = GLES20.glGetUniformLocation( mProgram, "sNoiseTextureUnit");
GLES20.glUniform1f(noiseFactorUniformHandle, noiseFactor);
FWIW I had the same questions and I needed it to be implemented in WebGL 1.0, so I couldn't use a few of the examples given in previous answers. I tried the Gold Noise mentioned before, but the use of PHI doesn't really click for me. (distance(xy * PHI, xy) * seed just equals length(xy) * (1.0 - PHI) * seed so I don't see how the magic of PHI should be put to work when it gets directly multiplied by seed?
Anyway, I did something similar just without PHI and instead added some variation at another place, basically I take the tan of the distance between xy and some random point lying outside of the frame to the top right and then multiply with the distance between xy and another such random point lying in the bottom left (so there is no accidental match between these points). Looks pretty decent as far as I can see. Click to generate new frames.
(function main() {
const dim = [512, 512];
twgl.setDefaults({ attribPrefix: "a_" });
const gl = twgl.getContext(document.querySelector("canvas"));
gl.canvas.width = dim[0];
gl.canvas.height = dim[1];
const bfi = twgl.primitives.createXYQuadBufferInfo(gl);
const pgi = twgl.createProgramInfo(gl, ["vs", "fs"]);
gl.canvas.onclick = (() => {
twgl.bindFramebufferInfo(gl, null);
gl.useProgram(pgi.program);
twgl.setUniforms(pgi, {
u_resolution: dim,
u_seed: Array(4).fill().map(Math.random)
});
twgl.setBuffersAndAttributes(gl, pgi, bfi);
twgl.drawBufferInfo(gl, bfi);
});
})();
<script src="https://twgljs.org/dist/4.x/twgl-full.min.js"></script>
<script id="vs" type="x-shader/x-vertex">
attribute vec4 a_position;
attribute vec2 a_texcoord;
void main() {
gl_Position = a_position;
}
</script>
<script id="fs" type="x-shader/x-fragment">
precision highp float;
uniform vec2 u_resolution;
uniform vec2 u_seed[2];
void main() {
float uni = fract(
tan(distance(
gl_FragCoord.xy,
u_resolution * (u_seed[0] + 1.0)
)) * distance(
gl_FragCoord.xy,
u_resolution * (u_seed[1] - 2.0)
)
);
gl_FragColor = vec4(uni, uni, uni, 1.0);
}
</script>
<canvas></canvas>