MPI + CUDA AWARE, concurrents kernels and MPI_Sendrecv

MPI + CUDA AWARE, concurrents kernels and MPI_Sendrecv - concurrency

During my work, I've found a little problem. Now I'm using MVAPICH-GDR-2.05 and Open MPI 1.7.4 with CUDA 6.0.
I'm working on the exchange of non contiguous elements between GPUs (like the columns of a matrix), and I'm trying to run two kernel's (one for scatter and one for gather) and a communication with MPI_Sendrecv between two GPUs concurrently.
I've used the CUDA profiler (nvprof) to see what my program is doing, and I've seen strange things:
With Open MPI 1.7.4, I've 3 cuda streams works concurrently.
With MVAPICH-gdr-2.05, I've two concurrent kernel's and the MPI_Sendrecv is not concurrent with them.
Do you know why MPI_Sendrecv in MVAPICH does this?
This is my pseudocode:
// creation and initialization of streams
cudaStream_t stream1, stream2;
cudaStreamCreateWithFlags( stream1, cudaStreamNonBlocking )
cudaStreamCreateWithFlags( stream2, cudaStreamNonBlocking )
///////////////////////////////////////////////////////////////////
// 1) --> gather of the first chunk
gather_kernel <<< dimGrid, dimBlock, 0, stream1 >>> ( ... )
cudaStreamSynchronize(stream1)
// 2) --> gather of the second chunk
// --> communication of the first chunk
gather_kernel <<< dimGrid, dimBlock, 0, stream1 >>> ( ... )
MPI_Sendrecv( ... )
cudaStreamSynchronize(stream1)
// 3) --> scatter of the chunk (ii)
// --> gather of the chunk (ii+2)
// --> communication of the chunk (ii+1)
// K is the number of chunk
for ( ii=0; ii<K-2; ii++ ){
scatter_kernel <<< dimGrid, dimBlock, 0, stream2 >>> ( ... )
gather_kernel <<< dimGrid, dimBlock, 0, stream1 >>> ( ... )
MPI_Sendrecv( ... )
cudaStreamSynchronize(stream2)
cudaStreamSynchronize(stream1)
}
// 4) --> scatter of the penultimate chunk
// --> communication of the last chunk
scatter_kernel <<< dimGrid, dimBlock, 0, stream2 >>> ( ... )
MPI_Sendrecv( ... )
cudaStreamSynchronize(stream2)
// 5) --> scatter of the last chunk
scatter_kernel <<< dimGrid, dimBlock, 0, stream2 >>> ( ... )
cudaStreamSynchronize(stream2)
And these are the two profiler's screenshoot:
MVAPICH 2.05
Open MPI 1.7.4

Related

OpenCL on MacOS: SIGABRT in release build, EXC_BAD_INSTRUCTION in libdispatch in debug build when using AMD Radeon 555 as CL device

I'm encountering a hard to track down bug on MacOS in an OpenCL-based application. In a release build my code crashes with a SIGABRT at some point, in a release build I get an EXC_BAD_INSTRUCTION on a thread obviously managing some lib dispatch / GCD stuff (com.apple.libdispatch-manager). Note that I do not call any GCD related things myself, so I assume this is done by the Apple OpenCL runtime in the background.
The context is a benchmarking application that measures latency between enqueuing CL commands and receiving the CL_COMPLETE callback for various ways of accessing the CL buffers. You'll find the code below. The error only occurs for one of the three available CL Devices in my MacBook Pro (AMD Radeon Pro 555 Compute Engine).
Relevant part of the code:
nlohmann::json performTestUseHostPtr()
{
nlohmann::json results;
std::vector<cl::Event> inputBufferEvent (1);
std::vector<cl::Event> outputBufferEvent (1);
std::vector<cl::Event> kernelEvent (1);
for (auto size : testSizes)
{
std::vector<float> inputBufferHost (size);
std::vector<float> outputBufferHost (size);
cl::Buffer inputBuffer (context, CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, size * sizeof (float), inputBufferHost.data());
cl::Buffer outputBuffer (context, CL_MEM_USE_HOST_PTR | CL_MEM_WRITE_ONLY, size * sizeof (float), outputBufferHost.data());
void* inputBufferMapped = queue.enqueueMapBuffer (inputBuffer, CL_TRUE, CL_MAP_WRITE_INVALIDATE_REGION, 0, size * sizeof (float));
std::memcpy (inputBufferMapped, testData.data(), size * sizeof (float));
kernel.setArg (0, inputBuffer);
kernel.setArg (1, outputBuffer);
for (int i = 0; i < numTests; ++i)
{
startTimes[i] = my::HighResolutionTimer::now();
queue.enqueueUnmapMemObject (inputBuffer, inputBufferMapped, nullptr, &inputBufferEvent[0]);
inputBufferEvent[0].setCallback (CL_COMPLETE, setTimestampCallback, &unmapCompletedTimes[i]);
queue.enqueueNDRangeKernel (kernel, cl::NullRange, cl::NDRange (size), cl::NullRange, &inputBufferEvent, &kernelEvent[0]);
kernelEvent[0].setCallback (CL_COMPLETE, setTimestampCallback, &kernelCompletedTimes[i]);
void* outputBufferMapped = queue.enqueueMapBuffer (outputBuffer, CL_FALSE, CL_MAP_READ, 0, size * sizeof (float), &kernelEvent, &outputBufferEvent[0]);
outputBufferEvent[0].setCallback (CL_COMPLETE, setTimestampCallback, &mapCompletedTimes[i]);
inputBufferMapped = queue.enqueueMapBuffer (inputBuffer, CL_TRUE, CL_MAP_WRITE_INVALIDATE_REGION, 0, size * sizeof (float), &kernelEvent, nullptr);
// --- Release build error seems to happen somewhere here ---
queue.finish();
std::memcpy (inputBufferMapped, outputBufferMapped, size * sizeof (float));
queue.enqueueUnmapMemObject (outputBuffer, outputBufferMapped);
queue.finish();
}
queue.enqueueUnmapMemObject (inputBuffer, inputBufferMapped);
results["vecSize=" + std::to_string (size)] = calculateTimes();
queue.finish();
}
return results;
}
Notes:
I checked the error codes of all CL calls, all return CL_SUCCESS, just removed them in the code above for a better overview.
I marked the line where I roughly assume the error to happen, this is based on inserting print-statements in the release-version and watching which points of the code were completed before the fault occurs. Inserting a print statement above the queue.finish(); statement furthermore lets the bug disappear, so this is likely to be something timing related.
Update:
When inserting a short sleep in the line where I assumed the error to happen and running a debug build it now also triggers a SIGABRT. Additionally I can find the following prints on the console:
OpenCLLatencyTests(17903,0x10012a5c0) malloc: tiny_free_list_remove_ptr: Internal invariant broken (next ptr of prev): ptr=0x1003052d0, prev_next=0x0
OpenCLLatencyTests(17903,0x10012a5c0) malloc: *** set a breakpoint in malloc_error_break to debug
Signal: SIGABRT (signal SIGABRT)
E0412 11:55:02.898913 233472000 ProtobufClient.cpp:63] No such process
Question:
Can anyone spot an obvious error in my code?
If not, are there any known bugs in the Apple OpenCL implementation that could cause errors like that?

Why does my data not fit into a CUDA Texture Object?

I'm trying to fill a CUDA Texture Object with some data but the call to cudaCreateTextureObject fails with the following error (edit: on both a GTX 1080TI and a RTX 2080TI):
GPU ERROR! 'invalid argument' (err code 11)
It works if I put less data into my texture so my guess is that my computation about how much data I can fit into a texture is off.
My thought process is as follows:
(executable code follows below)
My data comes in the form of (76,76) images where each pixel is a float. What I would like to do is to store a column of images in a Texture Object; as I understand it, cudaMallocPitch is the way to do this.
When computing the number of images I can store in one texture I'm using the following formula to determine how much space a single image needs:
GTX_1080TI_MEM_PITCH * img_dim_y * sizeof(float)
Where the first argument should be the memory pitch on a GTX 1080TI card (512 bytes). The number of bytes that I can store in a 1D texture is given as 2^27 here. When I divide the latter by the former I get 862.3, assuming this is the number of images I can store in one Texture Object. However, when I try to store more than 855 images in my buffer the program crashes with the error above.
Here's the code:
In the following the main function (a) sets up all the relevant parameters, (b) allocates the memory using cudaMallocPitch, and (c) configures and creates a CUDA Texture Object:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <cassert>
#define GTX_1080TI_MEM_PITCH 512
#define GTX_1080TI_1DTEX_WIDTH 134217728 // 2^27
//=====================================================================[ util ]
// CUDA error checking for library functions
#define CUDA_ERR_CHK(func){ cuda_assert( (func), __FILE__, __LINE__ ); }
inline void cuda_assert( const cudaError_t cu_err, const char* file, int line ){
if( cu_err != cudaSuccess ){
fprintf( stderr, "\nGPU ERROR! \'%s\' (err code %d) in file %s, line %d.\n\n", cudaGetErrorString(cu_err), cu_err, file, line );
exit( EXIT_FAILURE );
}
}
// CUDA generic error checking (used after kernel calls)
#define GPU_ERR_CHK(){ gpu_assert(__FILE__, __LINE__); }
inline void gpu_assert( const char* file, const int line ){
cudaError cu_err = cudaGetLastError();
if( cu_err != cudaSuccess ){
fprintf( stderr, "\nGPU KERNEL ERROR! \'%s\' (err code %d) in file %s, line %d.\n\n", cudaGetErrorString(cu_err), cu_err, file, line );
exit(EXIT_FAILURE);
}
}
//=====================================================================[ main ]
int main(){
// setup
unsigned int img_dim_x = 76;
unsigned int img_dim_y = 76;
unsigned int img_num = 856; // <-- NOTE: set this to 855 and it should work - but we should be able to put 862 here?
unsigned int pitched_img_size = GTX_1080TI_MEM_PITCH * img_dim_y * sizeof(float);
unsigned int img_num_per_tex = GTX_1080TI_1DTEX_WIDTH / pitched_img_size;
fprintf( stderr, "We should be able to stuff %d images into one texture.\n", img_num_per_tex );
fprintf( stderr, "We use %d (more than 855 leads to a crash).\n", img_num );
// allocate pitched memory
size_t img_tex_pitch;
float* d_img_tex_data;
CUDA_ERR_CHK( cudaMallocPitch( &d_img_tex_data, &img_tex_pitch, img_dim_x*sizeof(float), img_dim_y*img_num ) );
assert( img_tex_pitch == GTX_1080TI_MEM_PITCH );
fprintf( stderr, "Asking for %zd bytes allocates %zd bytes using pitch %zd. Available: %zd/%d\n",
img_num*img_dim_x*img_dim_y*sizeof(float),
img_num*img_tex_pitch*img_dim_y*sizeof(float),
img_tex_pitch,
GTX_1080TI_1DTEX_WIDTH - img_num*img_tex_pitch*img_dim_y*sizeof(float),
GTX_1080TI_1DTEX_WIDTH );
// generic resource descriptor
cudaResourceDesc res_desc;
memset(&res_desc, 0, sizeof(res_desc));
res_desc.resType = cudaResourceTypePitch2D;
res_desc.res.pitch2D.desc = cudaCreateChannelDesc<float>();
res_desc.res.pitch2D.devPtr = d_img_tex_data;
res_desc.res.pitch2D.width = img_dim_x;
res_desc.res.pitch2D.height = img_dim_y*img_num;
res_desc.res.pitch2D.pitchInBytes = img_tex_pitch;
// texture descriptor
cudaTextureDesc tex_desc;
memset(&tex_desc, 0, sizeof(tex_desc));
tex_desc.addressMode[0] = cudaAddressModeClamp;
tex_desc.addressMode[1] = cudaAddressModeClamp;
tex_desc.filterMode = cudaFilterModeLinear; // for linear interpolation (NOTE: this breaks normal integer indexing!)
tex_desc.readMode = cudaReadModeElementType;
tex_desc.normalizedCoords = false; // we want to index using [0;img_dim] rather than [0;1]
// make sure there are no lingering errors
GPU_ERR_CHK();
fprintf(stderr, "No CUDA error until now..\n");
// create texture object
cudaTextureObject_t img_tex_obj;
CUDA_ERR_CHK( cudaCreateTextureObject(&img_tex_obj, &res_desc, &tex_desc, NULL) );
fprintf(stderr, "bluppi\n");
}
This should crash when cudaCreateTextureObject is called. If the img_num parameter (at the start of main) is changed from 856 to 855, however, the code should execute successfully. (edit: The expected behavior would be that the code runs through with a value of 862 but fails with a value of 863 since that actually requires more bytes than the documented buffer size offers.)
Any help would be appreciated!

Since you're working with a 2D texture here, the number of bytes you can store in a 1D texture (the "width") is of no relevance here.
2D textures may have different characteristics depending on the type of memory that provides the backing for the texture. Two examples are linear memory and CUDA Array. You have chosen to use a linear memory backing (that which is provided by cudaMalloc* operations other than cudaMallocArray).
The primary problem you are running into is the maximum texture height. To discover what this is, we could refer to the table 14 in the programming guide, which lists:
Maximum width and height for a 2D texture reference bound to linear memory 65000 x 65000
You are exceeding this 65000 number when going from 855 to 856 images, for an image height of 76 rows. 856*76 = 65056, 855*76 = 64980
"But wait" you say, that table 14 entry says texture reference, and I am using a texture object.
You are correct, and table 14 doesn't explicitly list the corresponding limit for texture objects. In that case, we have to refer to the device properties readable from the device at runtime, using cudaGetDeviceProperties(). If we review the data available there, we see this readable item:
maxTexture2DLinear[3] contains the maximum 2D texture dimensions for 2D textures bound to pitch linear memory.
(I suspect the 3 is a typo, but no matter, we only need the first 2 values).
This is the value we want to be sure. If we modify your code to obey that limit, there are no problems:
$ cat t382.cu
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <cassert>
#define GTX_1080TI_MEM_PITCH 512
#define GTX_1080TI_1DTEX_WIDTH 134217728 // 2^27
//=====================================================================[ util ]
// CUDA error checking for library functions
#define CUDA_ERR_CHK(func){ cuda_assert( (func), __FILE__, __LINE__ ); }
inline void cuda_assert( const cudaError_t cu_err, const char* file, int line ){
if( cu_err != cudaSuccess ){
fprintf( stderr, "\nGPU ERROR! \'%s\' (err code %d) in file %s, line %d.\n\n", cudaGetErrorString(cu_err), cu_err, file, line );
exit( EXIT_FAILURE );
}
}
// CUDA generic error checking (used after kernel calls)
#define GPU_ERR_CHK(){ gpu_assert(__FILE__, __LINE__); }
inline void gpu_assert( const char* file, const int line ){
cudaError cu_err = cudaGetLastError();
if( cu_err != cudaSuccess ){
fprintf( stderr, "\nGPU KERNEL ERROR! \'%s\' (err code %d) in file %s, line %d.\n\n", cudaGetErrorString(cu_err), cu_err, file, line );
exit(EXIT_FAILURE);
}
}
//=====================================================================[ main ]
int main(){
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
size_t max2Dtexturelinearwidth = prop.maxTexture2DLinear[0]; // texture x dimension
size_t max2Dtexturelinearheight = prop.maxTexture2DLinear[1]; // texture y dimension
fprintf( stderr, "maximum 2D linear texture dimensions (width,height): %lu,%lu\n", max2Dtexturelinearwidth, max2Dtexturelinearheight);
// setup
unsigned int img_dim_x = 76;
unsigned int img_dim_y = 76;
//unsigned int img_num = 856; // <-- NOTE: set this to 855 and it should work - but we should be able to put 862 here?
unsigned int img_num = max2Dtexturelinearheight/img_dim_y;
fprintf( stderr, "maximum number of images per texture: %u\n", img_num);
unsigned int pitched_img_size = GTX_1080TI_MEM_PITCH * img_dim_y * sizeof(float);
unsigned int img_num_per_tex = GTX_1080TI_1DTEX_WIDTH / pitched_img_size;
fprintf( stderr, "We should be able to stuff %d images into one texture.\n", img_num_per_tex );
fprintf( stderr, "We use %d (more than 855 leads to a crash).\n", img_num );
// allocate pitched memory
size_t img_tex_pitch;
float* d_img_tex_data;
CUDA_ERR_CHK( cudaMallocPitch( &d_img_tex_data, &img_tex_pitch, img_dim_x*sizeof(float), img_dim_y*img_num ) );
assert( img_tex_pitch == GTX_1080TI_MEM_PITCH );
fprintf( stderr, "Asking for %zd bytes allocates %zd bytes using pitch %zd. Available: %zd/%d\n",
img_num*img_dim_x*img_dim_y*sizeof(float),
img_num*img_tex_pitch*img_dim_y*sizeof(float),
img_tex_pitch,
GTX_1080TI_1DTEX_WIDTH - img_num*img_tex_pitch*img_dim_y*sizeof(float),
GTX_1080TI_1DTEX_WIDTH );
// generic resource descriptor
cudaResourceDesc res_desc;
memset(&res_desc, 0, sizeof(res_desc));
res_desc.resType = cudaResourceTypePitch2D;
res_desc.res.pitch2D.desc = cudaCreateChannelDesc<float>();
res_desc.res.pitch2D.devPtr = d_img_tex_data;
res_desc.res.pitch2D.width = img_dim_x;
res_desc.res.pitch2D.height = img_dim_y*img_num;
res_desc.res.pitch2D.pitchInBytes = img_tex_pitch;
// texture descriptor
cudaTextureDesc tex_desc;
memset(&tex_desc, 0, sizeof(tex_desc));
tex_desc.addressMode[0] = cudaAddressModeClamp;
tex_desc.addressMode[1] = cudaAddressModeClamp;
tex_desc.filterMode = cudaFilterModeLinear; // for linear interpolation (NOTE: this breaks normal integer indexing!)
tex_desc.readMode = cudaReadModeElementType;
tex_desc.normalizedCoords = false; // we want to index using [0;img_dim] rather than [0;1]
// make sure there are no lingering errors
GPU_ERR_CHK();
fprintf(stderr, "No CUDA error until now..\n");
// create texture object
cudaTextureObject_t img_tex_obj;
CUDA_ERR_CHK( cudaCreateTextureObject(&img_tex_obj, &res_desc, &tex_desc, NULL) );
fprintf(stderr, "bluppi\n");
}
$ nvcc -o t382 t382.cu
$ cuda-memcheck ./t382
========= CUDA-MEMCHECK
maximum 2D linear texture dimensions (width,height): 131072,65000
maximum number of images per texture: 855
We should be able to stuff 862 images into one texture.
We use 855 (more than 855 leads to a crash).
Asking for 19753920 bytes allocates 133079040 bytes using pitch 512. Available: 1138688/134217728
No CUDA error until now..
bluppi
========= ERROR SUMMARY: 0 errors
$

cl::Event::waitForEvents returns -7 (CL_EXEC_STATUS_ERROR_ FOR_EVENTS_IN_WAIT_LIST)

I am attempting to run the same kernel on two GPU devices concurrently within the same context.
I have hit a snag, where when trying to profile the event object, I get a -7 (event object not available) for the second command queue.
When I wait for events then it errors out with -7. This only seems to happen in command queue 2.
Any idea why? Any help would be much apprecated.
Code attached.
void *bytes;
float *zeropad;
float *output_f;
void *outputbytes;
int ret;
ret = posix_memalign(&bytes, total_alignment_requirement, cshape[level][1]*(size+2)*(size+2)*sizeof(float));
zeropad = (float *)bytes;
//float *output_f = (float *)calloc(cshape[level][0]*size*size,sizeof(float));
//SR assigning aligned memory
ret = posix_memalign(&outputbytes, total_alignment_requirement, cshape[level][1]*(size+2)*(size+2)*sizeof(float));
output_f = (float *)outputbytes;
unsigned int total=0;
//prepare matrix for OpenCL
padding_input(matrix,zeropad,size,in_depth);
cl::Buffer zeropad_buf(openclObjects.context,CL_MEM_READ_ONLY| CL_MEM_COPY_HOST_PTR,(size+2)*(size+2)*cshape[level][1]*sizeof(float),zeropad);
cl::Buffer output_buf(openclObjects.context,CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR ,cshape[level][0]*size*size*sizeof(float),output_f);
cl::Buffer bs(openclObjects.context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,cshape[level][0]*sizeof(float),bc[level]);
// SR using sub buffers only zeropad_buf and output_bf and to chunk up the buffer and submit the kernels twice...once to each device
//Creating sub_buffers for zeropad_buf
size_t zeropad_buf_size = (size+2)*(size+2)*cshape[level][1]*sizeof(float);
size_t output_buf_size = cshape[level][0]*size*size*sizeof(float);
size_t zeropad_split_pos = zeropad_buf_size / 2;
zeropad_split_pos -= zeropad_split_pos % total_alignment_requirement;
cl_buffer_region zero_rgn_4core = {0, zeropad_split_pos};
cl_buffer_region zero_rgn_2core = {zeropad_split_pos, zeropad_buf_size - zeropad_split_pos};
/*
cl_buffer_region zero_rgn_4core = {0, zeropad_buf_size/2};
cl_buffer_region zero_rgn_2core = {zeropad_buf_size/2, zeropad_buf_size/2};
*/
cl_buffer_region output_rgn_4core = {0, output_buf_size/2};
cl_buffer_region output_rgn_2core = {output_buf_size/2, output_buf_size/2};
cl::Buffer zeropad_buf_4Core = zeropad_buf.createSubBuffer(CL_MEM_READ_ONLY,CL_BUFFER_CREATE_TYPE_REGION, &zero_rgn_4core);
std::cout<<"zero_pad sub-buffer region 1 created"<<std::endl;
cl::Buffer zeropad_buf_2Core = zeropad_buf.createSubBuffer(CL_MEM_READ_ONLY,CL_BUFFER_CREATE_TYPE_REGION, &zero_rgn_2core);
std::cout<<"zero_pad sub-buffer region 2 created"<<std::endl;
cl::Buffer output_buf_4Core = output_buf.createSubBuffer(CL_MEM_READ_WRITE,CL_BUFFER_CREATE_TYPE_REGION, &output_rgn_4core);
cl::Buffer output_buf_2Core = output_buf.createSubBuffer(CL_MEM_READ_WRITE,CL_BUFFER_CREATE_TYPE_REGION, &output_rgn_2core);
cl::NDRange global(global_x, global_y, global_y);
cl::NDRange local(1, group_size, group_size);
//cl::Event evt[2];//SR
//SR use a vector events
std::vector<cl::Event> events;
cl::Event evt1, evt2;
//SR Kernel after sub buffering - 4 core
openclObjects.conv_gpu.setArg<cl::Memory>(0, zeropad_buf_4Core);
openclObjects.conv_gpu.setArg<cl::Memory>(1, conv_weights[level]);
openclObjects.conv_gpu.setArg<cl::Memory>(2, output_buf_4Core);
openclObjects.conv_gpu.setArg<cl::Memory>(3, bs);
openclObjects.conv_gpu.setArg<int>(4, size+2);
openclObjects.conv_gpu.setArg<int>(5, cshape[level][1]);
openclObjects.conv_gpu.setArg<int>(6, size);
openclObjects.conv_gpu.setArg<int>(7, cshape[level][0]);
openclObjects.conv_gpu.setArg<int>(8, CONV_SIZE);
cl_int err=openclObjects.queue[0].enqueueNDRangeKernel( openclObjects.conv_gpu, cl::NullRange, global, local, NULL, &evt1); //SR
events.push_back(evt1);
// cl_int err=openclObjects.queue.enqueueNDRangeKernel( openclObjects.conv_gpu, cl::NullRange, global, local, NULL)
//SR Kernel after sub buffering - 2 core
openclObjects.conv_gpu.setArg<cl::Memory>(0, zeropad_buf_2Core);
openclObjects.conv_gpu.setArg<cl::Memory>(1, conv_weights[level]);
openclObjects.conv_gpu.setArg<cl::Memory>(2, output_buf_2Core);
openclObjects.conv_gpu.setArg<cl::Memory>(3, bs);
openclObjects.conv_gpu.setArg<int>(4, size+2);
openclObjects.conv_gpu.setArg<int>(5, cshape[level][1]);
openclObjects.conv_gpu.setArg<int>(6, size);
openclObjects.conv_gpu.setArg<int>(7, cshape[level][0]);
openclObjects.conv_gpu.setArg<int>(8, CONV_SIZE);
//SR Added for CQ2 (2 Core GPU)
err=openclObjects.queue[1].enqueueNDRangeKernel( openclObjects.conv_gpu, cl::NullRange, global, local, NULL, &evt2);
events.push_back(evt2);
std::cout<<"Enqueue CQ2"<<std::endl;
//get event info
cl::CommandQueue CQ;
cl::Device CQ_device;
evt2.getInfo(CL_EVENT_COMMAND_QUEUE,&CQ);
CQ.getInfo(CL_QUEUE_DEVICE, &CQ_device);
std::cout<<"New Code"<<std::endl;
std::cout<<"Event attached to COmmand Q2"<<std::endl;
std::cout<<"Device Name in Command Queue 1: "<<CQ_device.getInfo<CL_DEVICE_NAME>()<<std::endl;
std::cout<<"Device Vendor: "<<CQ_device.getInfo<CL_DEVICE_VENDOR>()<<std::endl;
std::cout<<"Device max CU: "<<CQ_device.getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>()<<std::endl;
cl::Event::waitForEvents(events);
//openclObjects.queue[0].finish(); //SR
std::cout<<"Command Queue 1 complete"<<std::endl;
//openclObjects.queue[1].finish();//SR added for CQ2
std::cout<<"Command Queue 2 complete"<<std::endl;
// printf("global_x, global_y, global_y, error: %d %d %d %d\n",global_x, global_y, global_y, err);
// printf("%d\n",err);
cl_ulong elapsed=0;
cl_ulong elapsed1=0; //SR calculate elapse per command queue
cl_ulong elapsed0=0; //SR calculate elapse per command queue
elapsed0 =evt1.getProfilingInfo<CL_PROFILING_COMMAND_END>()-evt1.getProfilingInfo<CL_PROFILING_COMMAND_START>(); //SR
std::cout<<"Profile Info: Command Queue 1"<<std::endl;
elapsed1 =evt2.getProfilingInfo<CL_PROFILING_COMMAND_END>()-evt2.getProfilingInfo<CL_PROFILING_COMMAND_START>(); //SR
std::cout<<"Profile Info: Command Queue 2"<<std::endl;
//std::cout<<"elapsed CQ0"<<elapsed0<<std::endl; //SR
//std::cout<<"elapsed CQ1"<<elapsed1<<std::endl; //SR
elapsed = elapsed0+elapsed1;

Try uncomment openclObjects.queue[0].finish(); and openclObjects.queue[1].finish();
also u can use flush instead of finish.

OpenCL vs CUDA: Pinned memory

I have been porting my RabbitCT CUDA implementation to OpenCL and I'm running into issues with pinned memory.
For CUDA a host buffer is created that buffers the input images to be processed in pinned memory. This allows the host to catch the next batch of input images while the GPU processes the current batch. A simplified mockup of my CUDA implementation is as follows:
// globals
float** hostProjBuffer = new float*[BUFFER_SIZE];
float* devProjection[STREAMS_MAX];
cudaStream_t stream[STREAMS_MAX];
void initialize()
{
// initiate streams
for( uint s = 0; s < STREAMS_MAX; s++ ){
cudaStreamCreateWithFlags (&stream[s], cudaStreamNonBlocking);
cudaMalloc( (void**)&devProjection[s], imgSize);
}
// initiate buffers
for( uint b = 0; b < BUFFER_SIZE; b++ ){
cudaMallocHost((void **)&hostProjBuffer[b], imgSize);
}
}
// main function called for all input images
void backproject(imgdata* r)
{
uint projNr = r->imgnr % BUFFER_SIZE;
uint streamNr = r->imgnr % STREAMS_MAX;
// When buffer is filled, wait until work in current stream has finished
if(projNr == 0) {
cudaStreamSynchronize(stream[streamNr]);
}
// copy received image data to buffer (maps double precision to float)
std::copy(r->I_n, r->I_n+(imgSizeX * imgSizeY), hostProjBuffer[projNr]);
// copy image and matrix to device
cudaMemcpyAsync( devProjection[streamNr], hostProjBuffer[projNr], imgSize, cudaMemcpyHostToDevice, stream[streamNr] );
// call kernel
backproject<<<numBlocks, threadsPerBlock, 0 , stream[streamNr]>>>(devProjection[streamNr]);
}
So, for CUDA, I create a pinned host pointer for each buffer item and copy the data to the device before executing kernel of each stream.
For OpenCL I initially did something similar when following the Nvidia OpenCL Best Practices Guide. Here they recommend creating two buffers, one for copying the kernel data to and one for the pinned memory. However, this leads to the implementation using double the device memory as both the kernel and pinned memory buffers are allocated on the device.
To get around this memory issue, I created an implementation where only a mapping is made to the device as it is needed. This can be seen in the following implementation:
// globals
float** hostProjBuffer = new float* [BUFFER_SIZE];
cl_mem devProjection[STREAMS_MAX], devMatrix[STREAMS_MAX];
cl_command_queue queue[STREAMS_MAX];
// initiate streams
void initialize()
{
for( uint s = 0; s < STREAMS_MAX; s++ ){
queue[s] = clCreateCommandQueueWithProperties(context, device, NULL, &status);
devProjection[s] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, imgSize, NULL, &status);
}
}
// main function called for all input images
void backproject(imgdata* r)
{
const uint projNr = r->imgnr % BUFFER_SIZE;
const uint streamNr = r->imgnr % STREAMS_MAX;
// when buffer is filled, wait until work in current stream has finished
if(projNr == 0) {
status = clFinish(queue[streamNr]);
}
// map host memory region to device buffer
hostProjBuffer[projNr] = (float*) clEnqueueMapBuffer(queue[streamNr], devProjection[streamNr], CL_FALSE, CL_MAP_WRITE_INVALIDATE_REGION, 0, imgSize, 0, NULL, NULL, &status);
// copy received image data to hostbuffers
std::copy(imgPtr, imgPtr + (imgSizeX * imgSizeY), hostProjBuffer[projNr]);
// unmap the allocated pinned host memory
clEnqueueUnmapMemObject(queue[streamNr], devProjection[streamNr], hostProjBuffer[projNr], 0, NULL, NULL);
// set stream specific arguments
clSetKernelArg(kernel, 0, sizeof(devProjection[streamNr]), (void *) &devProjection[streamNr]);
// launch kernel
clEnqueueNDRangeKernel(queue[streamNr], kernel, 3, NULL, global_work_size, local_work_size, 0, NULL, NULL);
clFlush(queue[streamNr]);
clFinish(queue[streamNr]); //should be removed!
}
This implementation does use a similar amount of device memory as the CUDA implementation. However, I have been unable to get this last code example working without a clFinish after each loop, which significantly hampers the performance of the application. This indicates data is lost as the host moves ahead of the kernel. I tried increasing my buffer size to the number of input images, but this did not work either. So somehow during execution, the hostBuffer data gets lost.
So, with the goal to write OpenCL code similar to CUDA, I have three questions:
What is the recommended implementation for OpenCL pinned memory?
Is my OpenCL implementation similar to how CUDA handles pinned memory?
What causes the wrong data to be used in the OpenCL example?
Thanks in advance!
Kind regards,
Remy
PS: Question initially asked at the Nvidia developer forums

WebRTC Audio Processing Module (APM) and calculating echo delay for a playback device

I am very new to audio processing. I created a program that records and streams audio one way, and does not record on the other end. Basically, it transmit whatever is recorded in one location to another location. However, there are many circumstances where this program will also output the audio in the same location as the recording source. This creates a noticeable "echo" because of the audio delay (which is dependent on many factors).
Because I'm not sure if anything else is out there, I am trying to use WebRTC's audio processing module for Gain Control and acoustic echo cancellation. The gain control seems to work well, but the AEC doesn't really work so well. I'm assuming maybe it's because I am not setting the correct stream delay, or maybe this isn't really what AEC is for.
The current code that I'm using reads something I've recorded from a file in an attempt to get rid of the echo, at least the first occurrence of it. If I set the stream delay to 0. as we can imagine the current audio gets cancelled out completely. I try different values to not much avail.
So my question is, and I hope this is specific enough, what am I doing wrong in this model here?
void start( char *inFilename, char *outFilename )
{
FILE *infile = fopen( inFilename, "rb" );
FILE *outfile = fopen( outFilename, "wb" );
// Our frame manager
AudioFrame frame;
frame._audioChannel = CHANNELS;
frame._frequencyInHz = SAMPLERATE;
frame._payloadDataLengthInSamples = SAMPLERATE/100; // Math for 20ms frames
// Get the size of our frames
const size_t frameLength = frame._payloadDataLengthInSamples*CHANNELS;
AudioProcessing* apm = AudioProcessing::Create(0);
//
apm->set_sample_rate_hz( SAMPLERATE ); // Super-wideband processing.
//
// // Mono capture and stereo render.
apm->set_num_channels(1, 1);
apm->set_num_reverse_channels(1);
//
apm->high_pass_filter()->Enable(true);
//
//apm->echo_cancellation()->set_suppression_level( EchoCancellation::SuppressionLevel::kHighSuppression );
apm->echo_cancellation()->enable_drift_compensation( false );
apm->echo_cancellation()->Enable( true );
//
apm->noise_suppression()->set_level( NoiseSuppression::Level::kHigh );
apm->noise_suppression()->Enable( true );
//
apm->gain_control()->set_analog_level_limits( 0, 255 );
apm->gain_control()->set_mode( GainControl::Mode::kAdaptiveDigital );
apm->gain_control()->Enable( true );
//
// apm->voice_detection()->Enable(true);
//
// // Start a voice call...
while( fread(frame._payloadData, sizeof( int16_t ), frameLength, infile )==frameLength )
{
//apm->set_stream_delay_ms( 0 );
apm->AnalyzeReverseStream( &frame );
//
// // ... Render frame arrives bound for the audio HAL ...
//
// // ... Capture frame arrives from the audio HAL ...
// // Call required set_stream_ functions.
// apm->gain_control()->set_stream_analog_level(analog_level);
//
apm->set_stream_delay_ms( 300 );
int err = apm->ProcessStream( &frame );
fprintf( stdout, "Output %i\n", err );
//
// // Call required stream_ functions.
// analog_level = apm->gain_control()->stream_analog_level();
// has_voice = apm->stream_has_voice();
fwrite( frame._payloadData, sizeof( int16_t ), frameLength, outfile );
}
//
// // Repeate render and capture processing for the duration of the call...
// // Start a new call...
// apm->Initialize();
//
// // Close the application...
AudioProcessing::Destroy( apm );
apm = NULL;
fclose( infile );
fclose( outfile );
}
Using the includes and libraries from: http://www.freedesktop.org/software/pulseaudio/webrtc-audio-processing/

I have the same problem to , i have try to find API that tells me how much stream delay is in OpenSLES API manual but failed.
As now, i think it maybe need to calculate stream delay by myself.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

MPI + CUDA AWARE, concurrents kernels and MPI_Sendrecv - concurrency

Related

OpenCL on MacOS: SIGABRT in release build, EXC_BAD_INSTRUCTION in libdispatch in debug build when using AMD Radeon 555 as CL device

Why does my data not fit into a CUDA Texture Object?

cl::Event::waitForEvents returns -7 (CL_EXEC_STATUS_ERROR_ FOR_EVENTS_IN_WAIT_LIST)

OpenCL vs CUDA: Pinned memory

WebRTC Audio Processing Module (APM) and calculating echo delay for a playback device

Categories

Resources