After generating a set of data using a compute shader and storing it in a Shader Storage buffer, I am attempting to read from that buffer to print out the data using the code:
#define INDEX_AT(x,y,z,i) (xyzToId(Vec3i((x), (y), (z)),\
Vec3i(NUM_RAYS_X,\
NUM_RAYS_Y,\
POINTS_ON_RAY))\
* 3 + (i))
PRINT_GL_ERRORS();
glBindBuffer(GL_SHADER_STORAGE_BUFFER, dPositionBuffer);
float* data_ptr = NULL;
for (int ray_i = 0; ray_i < POINTS_ON_RAY; ray_i++)
{
for (int y = 0; y < NUM_RAYS_Y; y++)
{
int x = 0;
data_ptr = NULL;
data_ptr = (float*)glMapBufferRange(
GL_SHADER_STORAGE_BUFFER,
INDEX_AT(x, y, ray_i, 0) * sizeof(float),
3 * (NUM_RAYS_X) * sizeof(float),
GL_MAP_READ_BIT);
if (data_ptr == NULL)
{
PRINT_GL_ERRORS();
return false;
}
else
{
for (int x = 0; x < NUM_RAYS_X; x++)
{
std::cout << "("
<< data_ptr[x * 3 + 0] << ","
<< data_ptr[x * 3 + 1] << ","
<< data_ptr[x * 3 + 2] << ") , ";
}
}
glUnmapBuffer(GL_SHADER_STORAGE_BUFFER);
PRINT_GL_ERRORS();
std::cout << std::endl;
}
std::cout << "\n" << std::endl;
}
where the function xyzToId converts three dimensional coordinates into a one-dimensional index.
When I attempt to run this, however, the program crashes at the call to glMapBufferRange, giving the error message:
The NVIDIA OpenGL driver lost connection with the display driver due to exceeding the Windows Time-Out limit and is unable to continue.
The application must close.
Error code: 7
Would you like to visit
http://nvidia.custhelp.com/cgi-bin/nvidia.cfg/php/enduser/std_adp.php?p_faqid=3007
for help?
The buffer that I am mapping is not very large at all, only 768 floats, and previous calls to glMapBuffer on a different shader storage buffer (of only two floats) completed with no problems. I can't seem to find any information relevant to this error online, and everything that I have read about the speed of glMapBufferRange indicates that a buffer of this size should only take on the order of tens of milliseconds to map, not the two second timeout that the program is crashing on.
Am I missing something about how glMapBufferRange should be used?
It was an unrelated error. Today I learned that OpenGL sometimes buffers commands, and several actions (like mapping a buffer) forces it to finish all the commands in its queue. In this case, it was the action of actually dispatching the compute shader itself.
Today I also learned that indexing a shader storage buffer out of bounds will cause the OpenGL driver to freeze up just like it would if it was taking to long to complete.
All in all, this was largely a case of errors masquerading as different errors and popping up in the wrong spot.
Related
I'm working on an OpenFX plugin to process images in grading/post-production software.
All my processing is done in a series of Metal kernel functions. The image is sent to the GPU as buffers (float array), one for the input and one for the output.
The output is then used by the OpenFX framework for display inside the host application, so up till then I didn't have to take care of it.
I now need to be able to read the output values once the GPU has processed the commands. I have tried to use the "contents" method applied on the buffer but my plugin keeps crashing (in the worst case), or gives me very weird values when it "works" (I'm not supposed to have anything over 1 and under 0, but I get very large numbers, 0 or negative 0, nan... So I assume I have a memory access issue of sorts).
At first I thought it was an issue with Private/Shared memory, so I tried to modify the buffer to be shared. But I'm still struggling!
Full disclosure: I have no specific training in MSL, I'm learning as I go with this project so I might be doing and-or saying very stupid things. I have looked around for hours before deciding to ask for help. Thanks to all who will help out in any way!
Below is the code (without everything that doesn't concern my current issue). If it is lacking anything of interest please let me know.
id < MTLBuffer > srcDeviceBuf = reinterpret_cast<id<MTLBuffer> >(const_cast<float*>(p_Input)) ;
//Below is the destination Image buffer creation the way it used to be done before my edits
//id < MTLBuffer > dstDeviceBuf = reinterpret_cast<id<MTLBuffer> >(p_Output);
//My attempt at creating a Shared memory buffer
MTLResourceOptions bufferOptions = MTLResourceStorageModeShared;
int bufferLength = sizeof(float)*1920*1080*4;
id <MTLBuffer> dstDeviceBuf = [device newBufferWithBytes:p_Output length:bufferLength options:bufferOptions];
id<MTLCommandBuffer> commandBuffer = [queue commandBuffer];
commandBuffer.label = [NSString stringWithFormat:#"RunMetalKernel"];
id<MTLComputeCommandEncoder> computeEncoder = [commandBuffer computeCommandEncoder];
//First method to be computed
[computeEncoder setComputePipelineState:_initModule];
int exeWidth = [_initModule threadExecutionWidth];
MTLSize threadGroupCount = MTLSizeMake(exeWidth, 1, 1);
MTLSize threadGroups = MTLSizeMake((p_Width + exeWidth - 1) / exeWidth,
p_Height, 1);
[computeEncoder setBuffer:srcDeviceBuf offset: 0 atIndex: 0];
[computeEncoder setBuffer:dstDeviceBuf offset: 0 atIndex: 8];
//encodes first module to be executed
[computeEncoder dispatchThreadgroups:threadGroups threadsPerThreadgroup: threadGroupCount];
//Modules encoding
if (p_lutexport_on) {
//Fills the image with patch values for the LUT computation
[computeEncoder setComputePipelineState:_LUTExportModule];
[computeEncoder dispatchThreadgroups:threadGroups threadsPerThreadgroup: threadGroupCount];
}
[computeEncoder endEncoding];
[commandBuffer commit];
if (p_lutexport_on) {
//Here is where I try to read the buffer values (and inserts them into a custom object "p_lut_exp_lut"
float* result = static_cast<float*>([dstDeviceBuf contents]);
//Retrieve the output values and populate the LUT with them
int lutLine = 0;
float3 out;
for (int index(0); index < 35937 * 4; index += 4) {
out.x = result[index];
out.y = result[index + 1];
out.z = result[index + 2];
p_lutexp_lut->setValuesAtLine(lutLine, out);
lutLine++;
}
p_lutexp_lut->toFile();
}
If a command buffer includes write or read operations on a given MTLBuffer, you must ensure that these operations complete before reading the buffers contents. You can use the addCompletedHandler: method, waitUntilCompleted method, or custom semaphores to signal that a command buffer has completed execution.
[commandBuffer addCompletedHandler:^(id<MTLCommandBuffer> cb) {
/* read or write buffer here */
}];
[commandBuffer commit];
My program reads a file, processes it and saves the results in a csv file.
The whole of us include a loop in which many different files are processed. a separate csv file is generated for each of these files.
I was able to implement the processing very efficiently in terms of time, so that saving the respective results is the longest process in the loop.
The results are available as vector <float> and are currently saved as follows:
std::vector<float*> out = calculation(bla);
fstream data;
data.open(savepfad + name + ".csv", ios::out);
data<< sizex << endl;
data<< sizey << endl;
data<< dim << endl;
for (int d = 0; d < dim; d++)
{
for (int x = 0; x < sizex * sizey; x++)
{
data << out[d][x] << ",";
}
data << endl;
}
data.close();
my first thought was that i would simply outsource the storage process to a new thread (possibly with a fork) so i could continue with the main loop. But I use windows.
can I somehow write the data to the hard drive faster?
does anyone have a brilliant idea?
EDIT:
so i rebuilt the code according to the statements, but there is no real speed advantage. The code now looks like this:
std::vector<float*> out = calculation(bla);
string line = std::to_string(sizex) + "\n" + std::to_string(sizey ) + "\n" + std::to_string(dim) + "\n";
for (int d = 0; d < dim; d++)
{
for (int x = 0; x < sizex * sizey; x++)
{
line += out[d][x];
line += ",";
}
line += "\n";
}
fstream data;
data.open(savepfad + name + ".csv", ios::out);
data<<line;
data.close();
I also noticed that if out [] [] = 0 hours :: to_string (out [] []) makes 0 from 0.00 to 0.000000, and a data << out [] [] only writes 0 into the file. this makes the file size from 8000KB to 36000KB.
So if I can dump quasi instant 100MB onto the hard disk in python, I have to be able to write 8000KB relatively quickly, currently it takes between 1 and 2 minutes.
example size:
sizex = 638
sizey = 958
dim = 8
The time measurement shows that it takes almost the entire time to go through the two loops. it is a vector consisting of arrays. is the access to out too slow?
data << endl sends a newline AND flushes the result to disk.
You could do
data << "\n";
instead to send a newline without flushing.
The end result is that you flush fewer times, which means you spend less time waiting for the OS.
If that is still not fast enough, consider buffering everything into a ostrstream and dumping that into data in one go.
There are a couple of things you can do which may help, I would try implementing them one after another and measure the performance.
Don't flush after every line:
std::endl actually flushes the buffers and saves the file to the drive, that's probably killing the performance. So use << '\n';
You can try to minimize memory allocation and copying, if you buffer every line (or multiple lines) before writing it out. I would try to reserve a big string (std::string line; line.reserve(<big number enough for the full line>);) and do line += std::to_string(out[d][x]); line += ',';
You can optimize this even further, and you can try to use std::to_chars.
+1. If you are on windows, you can try to use the latest MSVC, they reported 5x speedup in float to string conversion (compared to crt functions), after implementing to_chars. https://www.youtube.com/watch?v=4P_kbF0EbZM
Consider a mesh whose bins are decomposed among processes. The numbers in the image are the ranks of processes.
At each time step, some of the points displace so that it is needed to send them to new destinations. This point-sending is done by all processes having displaced points. In the image only the points of lower-left corner bin are shown as an example.
I don't know how long should a process keep listening for receive messages? The problem is that a receiver does not even know whether a message would arrive or not because no point might pass to its region.
Also note that, the source and destination of a point might be the same as for blue point.
Edit: Below is an incomplete code to express the problem.
void transfer_points()
{
world.isend(dest, ...);
while (true)
{
mpi::status msg = world.iprobe(any_source, any_tag);
if (msg.count() != 0)
{
world.irecv(any_source, ...);
}
// but how long keep probing?
if (???) {break;}
}
}
Are you familiar with one-sided MPI or RMA (Remote Memory Access) via MPI_Win_* operations? The way I understand your problem, it should be solvable neatly with it:
Ranks that send some points just put it into the other rank's memory (window).
Barrier
Receivers have directly went to the barrier, and are now in possession of the data
Here is an example of a ring send with RMA (In c++ syntax!). In your situation it should only need some minor modification, i. e. only call MPI_Put if necessary and some math about the offsets to write into the buffer.
#include <iostream>
#include "mpi.h"
int main(int argc, char* argv[]) {
MPI::Init(argc,argv);
int rank = MPI::COMM_WORLD.Get_rank();
int comm_size = MPI::COMM_WORLD.Get_size();
int neighbor_left = rank - 1;
int neighbor_right = rank + 1;
//Left and right most are neighbors.
if(neighbor_right >= comm_size) { neighbor_right = 0;}
if(neighbor_left < 0) {neighbor_left = comm_size - 1;}
int postbox[2];
MPI::Win window = MPI::Win::Create(postbox,2,sizeof(int),MPI_INFO_NULL,MPI::COMM_WORLD);
window.Fence(0);
// Put my rank in the second entry of my left neighbor (I'm his right neighbor)
window.Put(&rank,1,MPI_INT,neighbor_left,1,1,MPI_INT);
window.Fence(0);
// Put my rank in the first entry of my right neighbor (I'm his left neighbor)
window.Put(&rank,1,MPI_INT,neighbor_right,0,1,MPI_INT);
window.Fence(0);
std::cout << "I'm rank = " << rank << " my Neighbors (l-r) are " << postbox[0] << " " << postbox[1] << std::endl;
MPI::Finalize();
return 0;
}
I am trying to create an OpenCL raycaster. Therefore I am drawing to an OpenGL texture many times a second. However queue.enqueueNDRangeKernel eventually returns -9999. If I remove write_imagef from my kernel code, it works, so i figured this causes the problem.
OpenCL kernel (broken down)
__kernel void main(__write_only image2d_t screen)
{
unsigned int x = get_global_id(0);
unsigned int y = get_global_id(1);
int2 coords = (int2) (x, y);
write_imagef(screen, coords, (float4)(1,0,1,1));
}
This is the code that runs once in c++:
cl::Program::Sources sources;
string code = ResourceLoader::loadFile(filename);
sources.push_back({ code.c_str(),code.length() });
program = cl::Program(OpenCL::context, sources);
if (program.build({ OpenCL::default_device }) != CL_SUCCESS)
{
cout << "Could not build program \"" << filename << "\"! Error:" << endl;
cout << "OpenCL: Error building: " << program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(OpenCL::default_device) << "\n";
system("PAUSE");
exit(1);
}
queue = CommandQueue(OpenCL::context, OpenCL::default_device);
kernel = Kernel(program, "main");
//OpenGL texture
ImageGL b(OpenCL::context, CL_MEM_READ_WRITE, GL_TEXTURE_2D, 0, argument, &error);
if (error != 0)
{
cout << "CL Error: " << OpenCL::get_cl_error_string(error) << endl;
system("PAUSE");
exit(error);
}
kernel.setArg(0, b);
This Code runs every frame:
glFinish();
queue.enqueueAcquireGLObjects(&this->buffersGL);
NDRange range;
if (lengthZ <= 0 && lengthY <= 0)
range = NDRange(lengthX);
else if (lengthZ <= 0)
range = NDRange(lengthX, lengthY);
else
range = NDRange(lengthX, lengthY, lengthZ);
cl::Event wait;
cl_int run_err = queue.enqueueNDRangeKernel(kernel, NDRange(), range, NullRange, NULL, &wait);
if (run_err != 0)
{
cout << OpenCL::get_cl_error_string(run_err) << " (" << run_err << ")" << endl;
system("PAUSE");
}
queue.enqueueReleaseGLObjects(&this->buffersGL);
What could be causing the -9999 error and how can I fix it? Also, there are often big chunks of "dead pixels" that have not been drawn to in the texture...
You enqueue the release of GL buffers, but do not wait for it to complete.
queue.enqueueReleaseGLObjects(&this->buffersGL);
either get the finish event out of this (watch out for leaks!), or wait on the command queue to finish all tasks before proceeding to releasing the GL objects. When one thing in a queue depends on another, you are supposed to arrange their ordering yourself.
You also queue a bunch of tasks that depend on the GL objects. Either wait for them to complete (finish the queue), or take their events and feed them to the enqueue release GL objects as perquisites.
As an aside:
Using fewer kernels might be a good idea, instead of one per pixel.
Using fewer kernels might be a good idea, instead of one per pixel.
Thanks alot Yakk! I tried that by first simply using a smaller screen size and it suddenly worked again! As it turns out though the texture I was drawing to was the problem. It was not 600x600 pixels big and that's what caused the crash. Apparently OpenCL can draw to pixels that "don't actually exist" a couple of times before crashing. It still is weird behaviour...
I am currently working a P300 (basically there is detectable increase in a brain wave when a user sees something they are interested) detection system in C++ using the Emotiv EPOC. The system works but to improve accuracy I'm attempting to use Wekinator for machine learning, using an support vector machine (SVM).
So for my P300 system I have three stimuli (left, right and forward arrows). My program keeps track of the stimulus index and performs some filtering on the incoming "brain wave" and then calculates which index has the highest average area under the curve to determine which stimuli the user is looking at.
For my integration with Wekinator: I have setup Wekinator to receive a custom OSC message with 64 features (the length of the brain wave related to the P300) and set up three parameters with discrete values of 1 or 0. For training I have I have been sending the "brain wave" for each stimulus index in a trial and setting the relevant parameters to 0 or 1, then training it and running it. The issue is that when the OSC message is received by the the program from Wekinator it is returning 4 messages, rather than just the one most likely.
Here is the code for the training (and input to Wekinator during run time):
for(int s=0; s < stimCount; s++){
for(int i=0; i < stimIndexes[s].size(); i++) {
int eegIdx = stimIndexes[s][i];
ofxOscMessage wek;
wek.setAddress("/oscCustomFeatures");
if (eegIdx + winStart + winLen < sig.size()) {
int winIdx = 0;
for(int e=eegIdx + winStart; e < eegIdx + winStart + winLen; e++) {
wek.addFloatArg(sig[e]);
//stimAvgWins[s][winIdx++] += sig[e];
}
validWindowCount[s]++;
}
std::cout << "Num args: " << wek.getNumArgs() << std::endl;
wekinator.sendMessage(wek);
}
}
Here is the receipt of messages from Wekinator:
if(receiver.hasWaitingMessages()){
ofxOscMessage msg;
while(receiver.getNextMessage(&msg)) {
std::cout << "Wek Args: " << msg.getNumArgs() << std::endl;
if (msg.getAddress() == "/OSCSynth/params"){
resultReceived = true;
if(msg.getArgAsFloat(0) == 1){
result = 0;
} else if(msg.getArgAsFloat(1) == 1){
result = 1;
} else if(msg.getArgAsFloat(2) == 1){
result = 2;
}
std::cout << "Wek Result: " << result << std::endl;
}
}
}
Full code for both is at the following Gist:
https://gist.github.com/cilliand/f716c92933a28b0bcfa4
Main query is basically whether something is wrong with the code: Should I send the full "brain wave" for a trial to Wekinator? Or should I train Wekinator on different features? Does the code look right or should it be amended? Is there a way to only receive one OSC message back from Wekinator based on smaller feature sizes i.e. 64 rather than 4 x 64 per stimulus or 9 x 64 per stimulus index.