OpenCL vs CUDA: Pinned memory - c++

I have been porting my RabbitCT CUDA implementation to OpenCL and I'm running into issues with pinned memory.
For CUDA a host buffer is created that buffers the input images to be processed in pinned memory. This allows the host to catch the next batch of input images while the GPU processes the current batch. A simplified mockup of my CUDA implementation is as follows:
// globals
float** hostProjBuffer = new float*[BUFFER_SIZE];
float* devProjection[STREAMS_MAX];
cudaStream_t stream[STREAMS_MAX];
void initialize()
{
// initiate streams
for( uint s = 0; s < STREAMS_MAX; s++ ){
cudaStreamCreateWithFlags (&stream[s], cudaStreamNonBlocking);
cudaMalloc( (void**)&devProjection[s], imgSize);
}
// initiate buffers
for( uint b = 0; b < BUFFER_SIZE; b++ ){
cudaMallocHost((void **)&hostProjBuffer[b], imgSize);
}
}
// main function called for all input images
void backproject(imgdata* r)
{
uint projNr = r->imgnr % BUFFER_SIZE;
uint streamNr = r->imgnr % STREAMS_MAX;
// When buffer is filled, wait until work in current stream has finished
if(projNr == 0) {
cudaStreamSynchronize(stream[streamNr]);
}
// copy received image data to buffer (maps double precision to float)
std::copy(r->I_n, r->I_n+(imgSizeX * imgSizeY), hostProjBuffer[projNr]);
// copy image and matrix to device
cudaMemcpyAsync( devProjection[streamNr], hostProjBuffer[projNr], imgSize, cudaMemcpyHostToDevice, stream[streamNr] );
// call kernel
backproject<<<numBlocks, threadsPerBlock, 0 , stream[streamNr]>>>(devProjection[streamNr]);
}
So, for CUDA, I create a pinned host pointer for each buffer item and copy the data to the device before executing kernel of each stream.
For OpenCL I initially did something similar when following the Nvidia OpenCL Best Practices Guide. Here they recommend creating two buffers, one for copying the kernel data to and one for the pinned memory. However, this leads to the implementation using double the device memory as both the kernel and pinned memory buffers are allocated on the device.
To get around this memory issue, I created an implementation where only a mapping is made to the device as it is needed. This can be seen in the following implementation:
// globals
float** hostProjBuffer = new float* [BUFFER_SIZE];
cl_mem devProjection[STREAMS_MAX], devMatrix[STREAMS_MAX];
cl_command_queue queue[STREAMS_MAX];
// initiate streams
void initialize()
{
for( uint s = 0; s < STREAMS_MAX; s++ ){
queue[s] = clCreateCommandQueueWithProperties(context, device, NULL, &status);
devProjection[s] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, imgSize, NULL, &status);
}
}
// main function called for all input images
void backproject(imgdata* r)
{
const uint projNr = r->imgnr % BUFFER_SIZE;
const uint streamNr = r->imgnr % STREAMS_MAX;
// when buffer is filled, wait until work in current stream has finished
if(projNr == 0) {
status = clFinish(queue[streamNr]);
}
// map host memory region to device buffer
hostProjBuffer[projNr] = (float*) clEnqueueMapBuffer(queue[streamNr], devProjection[streamNr], CL_FALSE, CL_MAP_WRITE_INVALIDATE_REGION, 0, imgSize, 0, NULL, NULL, &status);
// copy received image data to hostbuffers
std::copy(imgPtr, imgPtr + (imgSizeX * imgSizeY), hostProjBuffer[projNr]);
// unmap the allocated pinned host memory
clEnqueueUnmapMemObject(queue[streamNr], devProjection[streamNr], hostProjBuffer[projNr], 0, NULL, NULL);
// set stream specific arguments
clSetKernelArg(kernel, 0, sizeof(devProjection[streamNr]), (void *) &devProjection[streamNr]);
// launch kernel
clEnqueueNDRangeKernel(queue[streamNr], kernel, 3, NULL, global_work_size, local_work_size, 0, NULL, NULL);
clFlush(queue[streamNr]);
clFinish(queue[streamNr]); //should be removed!
}
This implementation does use a similar amount of device memory as the CUDA implementation. However, I have been unable to get this last code example working without a clFinish after each loop, which significantly hampers the performance of the application. This indicates data is lost as the host moves ahead of the kernel. I tried increasing my buffer size to the number of input images, but this did not work either. So somehow during execution, the hostBuffer data gets lost.
So, with the goal to write OpenCL code similar to CUDA, I have three questions:
What is the recommended implementation for OpenCL pinned memory?
Is my OpenCL implementation similar to how CUDA handles pinned memory?
What causes the wrong data to be used in the OpenCL example?
Thanks in advance!
Kind regards,
Remy
PS: Question initially asked at the Nvidia developer forums

Related

WASAPI captured packets do not align

I'm trying to visualize a soundwave captured by WASAPI loopback but find that the packets I record do not form a smooth wave when put together.
My understanding of how the WASAPI capture client works is that when I call pCaptureClient->GetBuffer(&pData, &numFramesAvailable, &flags, NULL, NULL) the buffer pData is filled from the front with numFramesAvailable datapoints. Each datapoint is a float and they alternate by channel. Thus to get all available datapoints I should cast pData to a float pointer, and take the first channels * numFramesAvailable values. Once I release the buffer and call GetBuffer again it provides the next packet. I would assume that these packets would follow on from each other but it doesn't seem to be the case.
My guess is that either I'm making an incorrect assumption about the format of the audio data in pData or the capture client is either missing or overlapping frames. But have no idea how to check these.
To make the code below as brief as possible I've removed things like error status checking and cleanup.
Initialization of capture client:
const CLSID CLSID_MMDeviceEnumerator = __uuidof(MMDeviceEnumerator);
const IID IID_IMMDeviceEnumerator = __uuidof(IMMDeviceEnumerator);
const IID IID_IAudioClient = __uuidof(IAudioClient);
const IID IID_IAudioCaptureClient = __uuidof(IAudioCaptureClient);
pAudioClient = NULL;
IMMDeviceEnumerator * pDeviceEnumerator = NULL;
IMMDevice * pDeviceEndpoint = NULL;
IAudioClient *pAudioClient = NULL;
IAudioCaptureClient *pCaptureClient = NULL;
int channels;
// Initialize audio device endpoint
CoInitialize(nullptr);
CoCreateInstance(CLSID_MMDeviceEnumerator, NULL, CLSCTX_ALL, IID_IMMDeviceEnumerator, (void**)&pDeviceEnumerator );
pDeviceEnumerator ->GetDefaultAudioEndpoint(eRender, eConsole, &pDeviceEndpoint );
// init audio client
WAVEFORMATEX *pwfx = NULL;
REFERENCE_TIME hnsRequestedDuration = 10000000;
REFERENCE_TIME hnsActualDuration;
audio_device_endpoint->Activate(IID_IAudioClient, CLSCTX_ALL, NULL, (void**)&pAudioClient);
pAudioClient->GetMixFormat(&pwfx);
pAudioClient->Initialize(AUDCLNT_SHAREMODE_SHARED, AUDCLNT_STREAMFLAGS_LOOPBACK, hnsRequestedDuration, 0, pwfx, NULL);
channels = pwfx->nChannels;
pAudioClient->GetService(IID_IAudioCaptureClient, (void**)&pCaptureClient);
pAudioClient->Start(); // Start recording.
Capture of packets (note that std::mutex packet_buffer_mutex and vector<vector<float>> packet_bufferare already be defined and used by another thread to safely display the data):
UINT32 packetLength = 0;
BYTE *pData = NULL;
UINT32 numFramesAvailable;
DWORD flags;
int max_packets = 8;
std::unique_lock<std::mutex>write_guard(packet_buffer_mutex, std::defer_lock);
while (true) {
pCaptureClient->GetNextPacketSize(&packetLength);
while (packetLength != 0)
{
// Get the available data in the shared buffer.
pData = NULL;
pCaptureClient->GetBuffer(&pData, &numFramesAvailable, &flags, NULL, NULL);
if (flags & AUDCLNT_BUFFERFLAGS_SILENT)
{
pData = NULL; // Tell CopyData to write silence.
}
write_guard.lock();
if (packet_buffer.size() == max_packets) {
packet_buffer.pop_back();
}
if (pData) {
float * pfData = (float*)pData;
packet_buffer.emplace(packet_buffer.begin(), pfData, pfData + channels * numFramesAvailable);
} else {
packet_buffer.emplace(packet_buffer.begin());
}
write_guard.unlock();
hpCaptureClient->ReleaseBuffer(numFramesAvailable);
pCaptureClient->GetNextPacketSize(&packetLength);
}
std::this_thread::sleep_for(std::chrono::milliseconds(10));
}
I store the packets in a vector<vector<float>> (where each vector<float> is a packet) removing the last one and inserting the newest at the start so I can iterate over them in order.
Below is the result of a captured sinewave, plotting alternating values so it only represents a single channel. It is clear where the packets are being stitched together.
Something is playing a sine wave to Windows; you're recording the sine wave back in the audio loopback; and the sine wave you're getting back isn't really a sine wave.
You're almost certainly running into glitches. The most likely causes of glitching are:
Whatever is playing the sine wave to Windows isn't getting data to Windows in time, so the buffer is running dry.
Whatever is reading the loopback data out of Windows isn't reading the data in time, so the buffer is filling up.
Something is going wrong in between playing the sine wave to Windows and reading it back.
It is possible that more than one of these are happening.
The IAudioCaptureClient::GetBuffer call will tell you if you read the data too late. In particular it will set *pdwFlags so that the AUDCLNT_BUFFERFLAGS_DATA_DISCONTINUITY bit is set.
Looking at your code, I see you're doing the following things between the GetBuffer and the WriteBuffer:
Waiting on a lock
Sometimes doing something called "pop_back"
Doing something called "emplace"
I quote from the above-linked documentation:
Clients should avoid excessive delays between the GetBuffer call that acquires a packet and the ReleaseBuffer call that releases the packet. The implementation of the audio engine assumes that the GetBuffer call and the corresponding ReleaseBuffer call occur within the same buffer-processing period. Clients that delay releasing a packet for more than one period risk losing sample data.
In particular you should NEVER DO ANY OF THE FOLLOWING between GetBuffer and ReleaseBuffer because eventually they will cause a glitch:
Wait on a lock
Wait on any other operation
Read from or write to a file
Allocate memory
Instead, pre-allocate a bunch of memory before calling IAudioClient::Start. As each buffer arrives, write to this memory. On the side, have a regularly scheduled work item that takes written memory and writes it to disk or whatever you're doing with it.

pjsip capture and play pcm data

I have some embedded Devices that have no audio device by default. They communicate with each other via a FPGA. So my question is, how do I capture/play back audio from pjsip in pcm in order to send/receive it with the FPGA?
I know that there is pjmedia_mem_player_create() and pjmedia_mem_capture_create() but I can't seem to find any good info towards using these functions.
I tried the following piece of code, but an assertion failed cause one of the function's parameter is "empty".
Error:
pjmedia_mem_capture_create: Assertion `pool && buffer && size && clock_rate && channel_count && samples_per_frame && bits_per_sample && p_port' failed.
Note: I'm mainly using pjsua2 for everything else like registrations, transports etc. Also the default audio is set to null with ep.audDevManager().setNullDev(); as without this, making/receiving a call would simply fail?!
void MyCall::onCallMediaState(OnCallMediaStateParam &prm){
CallInfo ci = getInfo();
pj_caching_pool_init(&cp, &pj_pool_factory_default_policy, 0);
pj_pool_t *pool = pj_pool_create(&cp.factory, "POOLNAME", 2000, 2000, NULL);
void *buffer;
pjmedia_port *prt;
#define CLOCK_RATE 8000
#define CHANELS 1
#define SAMPLES_PER_FRAME 480
#define BITS_PER_SAMPLE 16
pjmedia_mem_capture_create( pool, //Pool
buffer, //Buffer
2000, //Buffer Size
CLOCK_RATE,
CHANELS,
SAMPLES_PER_FRAME,
BITS_PER_SAMPLE,
0, //Options
&prt); //The return port}
UPDATE
The assertion failed cause the buffer variable doesn't have any memory allocated to it. Allocate with twice the amount of samples per frame to have sufficient memory.
buffer = pj_pool_zalloc(pool, 960);
Also a callback needs to be registered with pjmedia_mem_capture_set_eof_cb2() (The two at the end is necessary for PJSIP 2.10 or later) Apparently from there the buffer can be used. Just that my implementation atm doesn't execute the callback.
Looks like I found the solution, I have modified your code and wrote a simple code in C with pjsua API to dump every frame to file. Sorry for mess, I'm not proficient in C:
pjsua_call_info ci;
pjsua_call_get_info(call_id, &ci);
pjsua_conf_port_info cpi;
pjsua_conf_get_port_info(ci.conf_slot, &cpi);
pj_pool_t *pool = pjsua_pool_create("POOLNAME", 2000, 2000);
pjmedia_port *prt;
uint buf_size = cpi.bits_per_sample*cpi.samples_per_frame/8;
void *buffer = pj_pool_zalloc(pool, buf_size);
pjsua_conf_port_id port_id;
pjmedia_mem_capture_create( pool,
buffer,
buf_size,
cpi.clock_rate,
cpi.channel_count,
cpi.samples_per_frame,
cpi.bits_per_sample,
0,
&prt);
pjmedia_mem_capture_set_eof_cb(prt, buffer, dump_incoming_frames);
pjsua_conf_add_port(pool, prt, &port_id);
pjsua_conf_connect(ci.conf_slot, port_id); //connect port with conference
///////dumping frames///
static pj_status_t dump_incoming_frames(pjmedia_port * port, void * usr_data){
pj_size_t buf_size = pjmedia_mem_capture_get_size(port);
char * data = usr_data;
...
fwrite(data,sizeof(data[0]),buf_size,fptr);
...
}
Documenation says pjmedia_mem_capture_set_eof_cb is deprecated but I couldn't make work pjmedia_mem_capture_set_eof_cb2, buf_size is 0 for every call of dump_incoming_frames so just left with deprecated function. I also succeed the same result with creating custom port.
I hope you can modify it easily to your C++/pjsua2 code
UPD:
I have modified the PJSIP and packed audio in-out streaming into proper PJSUA2/Media classes so it can be called from Python. Full code is here.

Multithreading for image processing at GPU using CUDA

Problem Statement:
I have to continuously process 8 megapixel images captured from a camera . There have to be several image processing algorithms on it like color interpolation, color transformation etc. These operations will take a long time at CPU. So, I decided to do these operations at GPU using CUDA kernel. I have already written a working CUDA kernel for color transformation. But still I need some more boost in the performance.
There are basically two computational times:
Copying the source image from CPU to GPU and vice-versa
Processing of the source image at GPU
when the image is getting copied from CPU to GPU....nothing else happens. And similarly, when the processing of image at GPU working...nothing else happens.
MY IDEA: I want to do multi-threading so that I can save some time. I want to capture the next image while the processing of previous image is going on at GPU. And, when the GPU finishes the processing of previous image then, the next image is already there for it to get transferred from CPU to GPU.
What I need: I am completely new to the world of Multi-threading. I am watching some tutorials and some other stuff to know more about it. So, I am looking up for some suggestions about the proper steps and proper logic.
I'm not sure you really need threads for this. CUDA has the ability to allow for asynchronous concurrent execution between host and device (without the necessity to use multiple CPU threads.) What you're asking for is a pretty standard "pipelined" algorithm. It would look something like this:
$ cat t832.cu
#include <stdio.h>
#define IMGSZ 8000000
// for this example, NUM_FRAMES must be less than 255
#define NUM_FRAMES 128
#define nTPB 256
#define nBLK 64
unsigned char cur_frame = 0;
unsigned char validated_frame = 0;
bool validate_image(unsigned char *img) {
validated_frame++;
for (int i = 0; i < IMGSZ; i++) if (img[i] != validated_frame) {printf("image validation failed at %d, was: %d, should be: %d\n",i, img[i], validated_frame); return false;}
return true;
}
void CUDART_CB my_callback(cudaStream_t stream, cudaError_t status, void* data) {
validate_image((unsigned char *)data);
}
bool capture_image(unsigned char *img){
for (int i = 0; i < IMGSZ; i++) img[i] = cur_frame;
if (++cur_frame == NUM_FRAMES) {cur_frame--; return true;}
return false;
}
__global__ void img_proc_kernel(unsigned char *img){
int idx = threadIdx.x + blockDim.x*blockIdx.x;
while(idx < IMGSZ){
img[idx]++;
idx += gridDim.x*blockDim.x;}
}
int main(){
// setup
bool done = false;
unsigned char *h_imgA, *h_imgB, *d_imgA, *d_imgB;
size_t dsize = IMGSZ*sizeof(unsigned char);
cudaHostAlloc(&h_imgA, dsize, cudaHostAllocDefault);
cudaHostAlloc(&h_imgB, dsize, cudaHostAllocDefault);
cudaMalloc(&d_imgA, dsize);
cudaMalloc(&d_imgB, dsize);
cudaStream_t st1, st2;
cudaStreamCreate(&st1); cudaStreamCreate(&st2);
unsigned char *cur = h_imgA;
unsigned char *d_cur = d_imgA;
unsigned char *nxt = h_imgB;
unsigned char *d_nxt = d_imgB;
cudaStream_t *curst = &st1;
cudaStream_t *nxtst = &st2;
done = capture_image(cur); // grabs a frame and puts it in cur
// enter main loop
while (!done){
cudaMemcpyAsync(d_cur, cur, dsize, cudaMemcpyHostToDevice, *curst); // send frame to device
img_proc_kernel<<<nBLK, nTPB, 0, *curst>>>(d_cur); // process frame
cudaMemcpyAsync(cur, d_cur, dsize, cudaMemcpyDeviceToHost, *curst);
// insert a cuda stream callback here to copy the cur frame to output
cudaStreamAddCallback(*curst, &my_callback, (void *)cur, 0);
cudaStreamSynchronize(*nxtst); // prevent overrun
done = capture_image(nxt); // capture nxt image while GPU is processing cur
unsigned char *tmp = cur;
cur = nxt;
nxt = tmp; // ping - pong
tmp = d_cur;
d_cur = d_nxt;
d_nxt = tmp;
cudaStream_t *st_tmp = curst;
curst = nxtst;
nxtst = st_tmp;
}
}
$ nvcc -o t832 t832.cu
$ cuda-memcheck ./t832
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$
There are many cuda sample codes which may be helpful also, such as simpleStreams, asyncAPI, and simpleCallbacks
Since your question is very wide, I can only think of the following advice:
1) Use CUDA streams
When using more than one CUDA stream, the memory transfer between CPU->GPU, the GPU processing and the memory transfer between GPU->CPU can overlap. This way the image processing of the next image can already begin while the result is transferred back.
You can also decompose each frame. Use n streams per frame and launch the image processing kernels n times with an offset.
2) Apply the producer-consumer scheme
The producer thread captures the frames from the camera and stores them in a thread-safe container. The consumer thread(s) fetch(es) a frame from this source container, upload(s) it to the GPU using its/their own CUDA stream(s), launches the kernel and copies the result back to the host.
Each consumer thread would synchronize with its stream(s) before trying to get a new image from the source container.
A simple implementation could look like this:
#include <vector>
#include <thread>
#include <memory>
struct ThreadSafeContainer{ /*...*/ };
struct Producer
{
Producer(std::shared_ptr<ThreadSafeContainer> c) : container(c)
{
}
void run()
{
while(true)
{
// grab image from camera
// store image in container
}
}
std::shared_ptr<ThreadSafeContainer> container;
};
struct Consumer
{
Consumer(std::shared_ptr<ThreadSafeContainer> c) : container(c)
{
cudaStreamCreate(&stream);
}
~Consumer()
{
cudaStreamDestroy(stream);
}
void run()
{
while(true)
{
// read next image from container
// upload to GPU
cudaMemcpyAsync(...,...,...,stream);
// run kernel
kernel<<<..., ..., ..., stream>>>(...);
// copy results back
cudaMemcpyAsync(...,...,...,stream);
// wait for results
cudaStreamSynchronize(stream);
// do something with the results
}
}
std::shared_ptr<ThreadSafeContainer> container;
cudaStream_t stream; // or multiple streams per consumer
};
int main()
{
// create an instance of ThreadSafeContainer which whill be shared between Producer and Consumer instances
auto container = std::make_shared<ThreadSafeContainer>();
// create one instance of Producer, pass the shared container as an argument to the constructor
auto p = std::make_shared<Producer>(container);
// create a separate thread which executes Producer::run
std::thread producer_thread(&Producer::run, p);
const int consumer_count = 2;
std::vector<std::thread> consumer_threads;
std::vector<std::shared_ptr<Consumer>> consumers;
// create as many consumers as specified
for (int i=0; i<consumer_count;++i)
{
// create one instance of Consumer, pass the shared container as an argument to the constructor
auto c = std::make_shared<Consumer>(container);
// create a separate thread which executes Consumer::run
consumer_threads.push_back(std::thread(&Consumer::run, c));
}
// wait for the threads to finish, otherwise the program will just exit here and the threads will be killed
// in this example, the program will never exit since the infinite loop in the run() methods never end
producer_thread.join();
for (auto& t : consumer_threads)
{
t.join();
}
return 0;
}

Kernel AIO (libaio) write request fails when writing from memory reserved at boot time

I am pretty new to linux aio (libaio) and am hoping someone here with more experience can help me out.
I have a system that performs high-speed DMA from a PCI device to the system RAM. I then want to write the data to disk using libaio directly from the DMA address space. Currently the memory used for DMA is reserved on boot through the use of the "memmap" kernel boot command.
In my application, the memory is mapped with mmap to a virtual userspace address pointer which I believe I should be able to pass as my buffer to the io_prep_pwrite() call. However, when I do this, the write does not seem to succeed. When the request is reaped with io_getevents the "res" field has error code -22 which prints as "Bad Address".
However, if I do a memcpy from the previously mapped memory location to a new buffer allocated by my application and then use the pointer to this local buffer in the call to io_prep_pwrite the request works just fine and the data is written to disk. The problem is that performing the memcpy creates a bottle-neck for the speed at which I need to stream data to disk and I am unable to keep up with the DMA rate.
I do not understand why I have to copy the memory first for the write request to work. I created a minimal example to illustrate the issue below. The bufBaseLoc is the mapped address and the localBuffer is the address the data is copied to. I do not want to have to perform the following line:
memcpy(localBuffer, bufBaseLoc, WRITE_SIZE);
...or have the localBuffer at all. In the "Prepare IOCB" section I want to use:
io_prep_pwrite(iocb, fid, bufBaseLoc, WRITE_SIZE, 0);
...but it does not work. But the local buffer, which is just a copy, does work.
Does anyone have any insight into why? Any help would be greatly appreciated.
Thanks,
#include <cstdio>
#include <string>
#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>
#include <libaio.h>
#define WRITE_SIZE 0x80000 //4MB buffer
#define DMA_BASE_ADDRESS 0x780000000 //Physical address of memory reserved at boot
//Reaping callback
void writeCallback(io_context_t ctx, struct iocb *iocb, long res, long res2)
{
//Res is number of bytes written by the request. It should match the requested IO size. If negative it is an error code
if(res != (long)iocb->u.c.nbytes)
{
fprintf(stderr, "WRITE_ERROR: %s\n", strerror(-res));
}
else
{
fprintf(stderr, "Success\n");
}
}
int main()
{
//Initialize Kernel AIO
io_context_t ctx = 0;
io_queue_init(256, &ctx);
//Open /dev/mem and map physical address to userspace
int fdmem = open("/dev/mem", O_RDWR);
void *bufBaseLoc = mmap(NULL, WRITE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fdmem, DMA_BASE_ADDRESS);
//Set memory here for test (normally done be DMA)
memset(bufBaseLoc, 1, WRITE_SIZE);
//Initialize Local Memory Buffer (DON’T WANT TO HAVE TO DO THIS)
uint8_t* localBuffer;
posix_memalign((void**)&localBuffer, 4096, WRITE_SIZE);
memset(localBuffer, 1, WRITE_SIZE);
//Open/Allocate file on disk
std::string filename = "tmpData.dat";
int fid = open(filename.c_str(), O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT, S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP | S_IROTH);
posix_fallocate(fid, 0, WRITE_SIZE);
//Copy from DMA buffer to local process buffer (THIS IS THE COPY I WANT TO AVOID)
memcpy(localBuffer, bufBaseLoc, WRITE_SIZE);
//Prepare IOCB
struct iocb *iocbs[1];
struct iocb *iocb = (struct iocb*) malloc(sizeof (struct iocb));
io_prep_pwrite(iocb, fid, localBuffer, WRITE_SIZE, 0); //<--THIS WORKS (but is not what I want to do)
//io_prep_pwrite(iocb, fid, bufBaseLoc, WRITE_SIZE, 0); //<--THIS DOES NOT WORK (but is what I want to do)
io_set_callback(iocb, writeCallback);
iocbs[0] = iocb;
//Send Request
int res = io_submit(ctx, 1, iocbs);
if (res !=1)
{
fprintf(stderr, "IO_SUBMIT_ERROR: %s\n", strerror(-res));
}
//Reap Request
struct io_event events[1];
size_t ret = io_getevents(ctx, 1, 1, events, 0);
if (ret==1)
{
io_callback_t cb=(io_callback_t)events[0].data;
struct iocb *iocb_e = events[0].obj;
cb(ctx, iocb_e, events[0].res, events[0].res2);
}
}
It could be that your original buffer is not aligned.
Any buffer that libAIO writes to disk needs to be aligned(to around 4K I think). When you allocate your temporary buffer you use posix_memalign, which will allocate it correctly aligned, meaning the write will succeed. If your original reserved buffer is not aligned it could be causing you issues.
If you can try to allocate that initial buffer with an aligned alloc it could solve the issue.

Using multiple openmp host threads and opencl

My GPU is a ATI Mobility Radeon HD 5450; specifications for the 5470 (which are nearly identical) can be found here. I've encountered a problem using multiple host threads (using OpenMP) and OpenCL. What I'm doing is the following:
I've got one instance of the following class containing the context, program, queue, etc.:
class OclMain
{
...
...
cl::Device device;
cl::Context context;
cl::CommandQueue queue;
cl::Program program;
cl::Program::Sources sources;
...
...
}
For each OpenMP thread I have an instance of the following class, which each contains its own cl::Kernel because setArg(...) isn't thread safe. Each of these also handles its own buffer creation, execution of the kernel, etc. If, for example, I have a maximum of 16 threads using #pragma omp parallel for num_threads(16), I create 16 of these objects and each thread has its own Ocl object. When a thread is done, the Ocl object is reused for a next iteration of the aforementioned for loop and setEpi() is called again to upload the new data to the device. Each thread handles one cv::Mat epi.
class Ocl
{
Ocl(OclMain *oclm) : oclm(oclm) { kernel = cl::Kernel(oclg->program,"kernel"); }
...
// this data doesn't change during execution of an openmp thread, so I only upload it once to the device after thread creation
void setEpi(cv::Mat &epi)
{
...
img_epi = cl::Image2D(oclm->context, CL_MEM_READ_ONLY, cl::ImageFormat(CL_RGBA, CL_FLOAT), epi.cols, epi.rows, 0, 0);
cl::size_t<3> origin;
origin[0] = 0; origin[1] = 0, origin[2] = 0;
cl::size_t<3> region;
region[0] = epi.cols; region[1] = epi.rows; region[2] = 1;
oclg->queue.enqueueWriteImage(img_epi, CL_TRUE, origin, region, 0, 0, epi.data, 0, 0);
...
// enqueue some more WriteBuffers here (only small buffers)
...
...
}
...
// gets called multiple times (maximum epi.rows times, on a row per row basis;
// the algorithm I need to implement works this way)
void runKernel(...)
{
// will contain result of kernel computation
cl::Buffer buff(oclg->context,CL_MEM_WRITE_ONLY,sizeof(float)*dmax*epi.cols);
// set kernel args here
// ...
// enqueue kernel
oclg->queue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(epi.cols/64*64+64),cl::NDRange(64));
// get results from device
oclg->queue.enqueueReadBuffer(buffer,CL_TRUE,0,sizeof(float)*dmax*epi.cols,result.data());
}
...
...
cl::Kernel kernel;
OclMain *oclm
cl::Image2D img_epi
}
If epi.rows and epi.cols is small (e.g.: 240x100) this works without a problem.
If epi.rows and epi.cols is big(ger) (e.g.: 863x100), it does not, unless I only use ONE OpenMP thread. If I use more threads, the program will freeze after executing the first few threads. As far as radeontop is concerned; at this point there is nothing going on on the GPU, 0% for all statistics.
The problem seems to be that the call to oclg->queue.enqueueWriteImage(img_epi, CL_TRUE, origin, region, 0, 0, epi.data, 0, NULL) never finishes. If I change the call to non-blocking, the program continues running, untill the next blocking call which won't return either.
I tried flushing; didn't help much either. HOWEVER, I don't seem to have this problem if I upload the data using the CL_MEM_COPY_HOST_PTR method instead of enqueueing a WriteImage. What's going on?