Using multiple openmp host threads and opencl - c++

My GPU is a ATI Mobility Radeon HD 5450; specifications for the 5470 (which are nearly identical) can be found here. I've encountered a problem using multiple host threads (using OpenMP) and OpenCL. What I'm doing is the following:
I've got one instance of the following class containing the context, program, queue, etc.:
class OclMain
{
...
...
cl::Device device;
cl::Context context;
cl::CommandQueue queue;
cl::Program program;
cl::Program::Sources sources;
...
...
}
For each OpenMP thread I have an instance of the following class, which each contains its own cl::Kernel because setArg(...) isn't thread safe. Each of these also handles its own buffer creation, execution of the kernel, etc. If, for example, I have a maximum of 16 threads using #pragma omp parallel for num_threads(16), I create 16 of these objects and each thread has its own Ocl object. When a thread is done, the Ocl object is reused for a next iteration of the aforementioned for loop and setEpi() is called again to upload the new data to the device. Each thread handles one cv::Mat epi.
class Ocl
{
Ocl(OclMain *oclm) : oclm(oclm) { kernel = cl::Kernel(oclg->program,"kernel"); }
...
// this data doesn't change during execution of an openmp thread, so I only upload it once to the device after thread creation
void setEpi(cv::Mat &epi)
{
...
img_epi = cl::Image2D(oclm->context, CL_MEM_READ_ONLY, cl::ImageFormat(CL_RGBA, CL_FLOAT), epi.cols, epi.rows, 0, 0);
cl::size_t<3> origin;
origin[0] = 0; origin[1] = 0, origin[2] = 0;
cl::size_t<3> region;
region[0] = epi.cols; region[1] = epi.rows; region[2] = 1;
oclg->queue.enqueueWriteImage(img_epi, CL_TRUE, origin, region, 0, 0, epi.data, 0, 0);
...
// enqueue some more WriteBuffers here (only small buffers)
...
...
}
...
// gets called multiple times (maximum epi.rows times, on a row per row basis;
// the algorithm I need to implement works this way)
void runKernel(...)
{
// will contain result of kernel computation
cl::Buffer buff(oclg->context,CL_MEM_WRITE_ONLY,sizeof(float)*dmax*epi.cols);
// set kernel args here
// ...
// enqueue kernel
oclg->queue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(epi.cols/64*64+64),cl::NDRange(64));
// get results from device
oclg->queue.enqueueReadBuffer(buffer,CL_TRUE,0,sizeof(float)*dmax*epi.cols,result.data());
}
...
...
cl::Kernel kernel;
OclMain *oclm
cl::Image2D img_epi
}
If epi.rows and epi.cols is small (e.g.: 240x100) this works without a problem.
If epi.rows and epi.cols is big(ger) (e.g.: 863x100), it does not, unless I only use ONE OpenMP thread. If I use more threads, the program will freeze after executing the first few threads. As far as radeontop is concerned; at this point there is nothing going on on the GPU, 0% for all statistics.
The problem seems to be that the call to oclg->queue.enqueueWriteImage(img_epi, CL_TRUE, origin, region, 0, 0, epi.data, 0, NULL) never finishes. If I change the call to non-blocking, the program continues running, untill the next blocking call which won't return either.
I tried flushing; didn't help much either. HOWEVER, I don't seem to have this problem if I upload the data using the CL_MEM_COPY_HOST_PTR method instead of enqueueing a WriteImage. What's going on?

Related

Getting External Headphones (8): EXC_BAD_ACCESS (code=1, address=0x0) error when I am trying to use maximilian

I am testing out using the maximilian library with JUCE. I am trying to use the maxiSample feature and I have implemented it exactly how the example code says to. Whenever I run the standalone app, I get the error "External Headphones (8): EXC_BAD_ACCESS (code=1, address=0x0)" and it gives me a breakpoint at line 747 of maximilian.cpp. It's not my headphones as it does the same thing with any playback device. Truly at a loss.
I've attached my MainComponent.cpp below. Any advice would be great, thank you!
#include "MainComponent.h"
#include "maximilian.h"
//==============================================================================
MainComponent::MainComponent()
{
// Make sure you set the size of the component after
// you add any child components.
setSize (800, 600);
// Some platforms require permissions to open input channels so request that here
if (juce::RuntimePermissions::isRequired (juce::RuntimePermissions::recordAudio)
&& ! juce::RuntimePermissions::isGranted (juce::RuntimePermissions::recordAudio))
{
juce::RuntimePermissions::request (juce::RuntimePermissions::recordAudio,
[&] (bool granted) { setAudioChannels (granted ? 2 : 0, 2); });
}
else
{
// Specify the number of input and output channels that we want to open
setAudioChannels (2, 2);
}
}
MainComponent::~MainComponent()
{
// This shuts down the audio device and clears the audio source.
shutdownAudio();
sample1.load("/Users/(username)/JuceTestPlugins/maxiSample/Source/kick.wav");
}
//==============================================================================
void MainComponent::prepareToPlay (int samplesPerBlockExpected, double sampleRate)
{
// This function will be called when the audio device is started, or when
// its settings (i.e. sample rate, block size, etc) are changed.
// You can use this function to initialise any resources you might need,
// but be careful - it will be called on the audio thread, not the GUI thread.
// For more details, see the help for AudioProcessor::prepareToPlay()
}
void MainComponent::getNextAudioBlock (const juce::AudioSourceChannelInfo& bufferToFill)
{
// Your audio-processing code goes here!
// For more details, see the help for AudioProcessor::getNextAudioBlock()
// Right now we are not producing any data, in which case we need to clear the buffer
// (to prevent the output of random noise)
//bufferToFill.clearActiveBufferRegion();
for(int sample = 0; sample < bufferToFill.buffer->getNumSamples(); ++sample){
//float sample2 = sample1.
//float wave = tesOsc.sinewave(200);
//double sample2 = sample1.play();
// leftSpeaker[sample] = (0.25 * wave);
// rightSpeaker[sample] = leftSpeaker[sample];
double *output;
output[0] = sample1.play();
output[1] = output[0];
}
}
void MainComponent::releaseResources()
{
// This will be called when the audio device stops, or when it is being
// restarted due to a setting change.
// For more details, see the help for AudioProcessor::releaseResources()
}
//==============================================================================
void MainComponent::paint (juce::Graphics& g)
{
// (Our component is opaque, so we must completely fill the background with a solid colour)
g.fillAll (getLookAndFeel().findColour (juce::ResizableWindow::backgroundColourId));
// You can add your drawing code here!
}
void MainComponent::resized()
{
// This is called when the MainContentComponent is resized.
// If you add any child components, this is where you should
// update their positions.
}
Can't say for sure, but couple of things catch my attention.
In getNextAudioBlock() you are dereferencing invalid pointers:
double *output;
output[0] = sample1.play();
output[1] = output[0];
The pointer variable output is uninitialised and will probably be filled with garbage or zeros, which will make the program read from invalid memory. This problem is most likely to cause the EXC_BAD_ACCESS. This method is called from the realtime audio thread, so you probably get a crash on a non-main thread (in this case the thread of External Headphones (8)).
It also is no clear to me what exactly it is you're trying to do here, so it's hard for me to say how it should be. What I can say is that assigning the result of sample1.play() to a double value looks suspicious.
Normally, when dealing with juce::AudioSourceChannelInfo you would get pointers to the audio buffers like so:
auto** bufferPointer = bufferToFill.buffer->getArrayOfWritePointers()
Further, you are loading a file inside the destructor of MainComponent. This at least is suspicious, why would you load a file during destruction?
MainComponent::~MainComponent()
{
// This shuts down the audio device and clears the audio source.
shutdownAudio();
sample1.load("/Users/(username)/JuceTestPlugins/maxiSample/Source/kick.wav");
}

Windows Desktop Duplication API taking a long time

I am using the windows desktop duplication API to record my screen in windows 10. I am however having some issues with performance. When playing a video using google chrome and attempting to record the time it takes to record the screen fluctuates from 15ms to 45ms. I wanted to be able to record at at least 30fps, and I know the desktop duplication api is capable of doing it. Anyways here is the code I used to actually capture the screen:
processor->hr = processor->lDeskDupl->AcquireNextFrame(0, &processor->lFrameInfo, &processor->lDesktopResource);
if (processor->hr == DXGI_ERROR_WAIT_TIMEOUT) {
processor->lDeskDupl->ReleaseFrame();
return false;
}
if (FAILED(processor->hr)) {
processor->lDeskDupl->ReleaseFrame();
return false;
}
// QI for ID3D11Texture2D
processor->hr = processor->lDesktopResource->QueryInterface(IID_PPV_ARGS(&processor->lAcquiredDesktopImage));
if (FAILED(processor->hr)) {
processor->lDeskDupl->ReleaseFrame();
return false;
}
processor->lDesktopResource.Release();
if (processor->lAcquiredDesktopImage == nullptr) {
processor->lDeskDupl->ReleaseFrame();
return false;
}
processor->lImmediateContext->CopyResource(processor->lGDIImage, processor->lAcquiredDesktopImage);
processor->lAcquiredDesktopImage.Release();
processor->lDeskDupl->ReleaseFrame();
// Copy image into CPU access texture
processor->lImmediateContext->CopyResource(processor->lDestImage, processor->lGDIImage);
// Copy from CPU access texture to bitmap buffer
D3D11_MAPPED_SUBRESOURCE resource;
processor->subresource = D3D11CalcSubresource(0, 0, 0);
processor->lImmediateContext->Map(processor->lDestImage, processor->subresource, D3D11_MAP_READ_WRITE, 0, &resource);
BYTE* sptr = reinterpret_cast<BYTE*>(resource.pData);
BYTE* dptr = processor->pBuf;
UINT lRowPitch = min(processor->lBmpRowPitch, resource.RowPitch);
for (int i = 0; i < processor->lOutputDuplDesc.ModeDesc.Height; i++) {
memcpy_s(dptr, processor->lBmpRowPitch, sptr, lRowPitch);
sptr += resource.RowPitch;
dptr += processor->lBmpRowPitch;
}
It is important to note that this is the specific section that is taking 15ms-45ms to complete every cycle. The memcpy loop at the bottom accounts for about 2ms of that time usually so I know that that is not responsible for the time it is taking here. Also AcquireNextFrame's timeout is set to zero so it returns nearly immediately. Any help would be greatly appreciated! The code pasted here was adapted from this: https://gist.github.com/Xirexel/a69ade44df0f70afd4a01c1c9d9e02cd
You're not using the API in optimal way. Read remarks in ReleaseFrame API documentation:
For performance reasons, we recommend that you release the frame just before you call the IDXGIOutputDuplication::AcquireNextFrame method to acquire the next frame. When the client does not own the frame, the operating system copies all desktop updates to the surface. This can result in wasted GPU cycles if the operating system updates the same region for each frame that occurs.
You are not doing what's written there, you release frames as soon as you copy.

OpenCL vs CUDA: Pinned memory

I have been porting my RabbitCT CUDA implementation to OpenCL and I'm running into issues with pinned memory.
For CUDA a host buffer is created that buffers the input images to be processed in pinned memory. This allows the host to catch the next batch of input images while the GPU processes the current batch. A simplified mockup of my CUDA implementation is as follows:
// globals
float** hostProjBuffer = new float*[BUFFER_SIZE];
float* devProjection[STREAMS_MAX];
cudaStream_t stream[STREAMS_MAX];
void initialize()
{
// initiate streams
for( uint s = 0; s < STREAMS_MAX; s++ ){
cudaStreamCreateWithFlags (&stream[s], cudaStreamNonBlocking);
cudaMalloc( (void**)&devProjection[s], imgSize);
}
// initiate buffers
for( uint b = 0; b < BUFFER_SIZE; b++ ){
cudaMallocHost((void **)&hostProjBuffer[b], imgSize);
}
}
// main function called for all input images
void backproject(imgdata* r)
{
uint projNr = r->imgnr % BUFFER_SIZE;
uint streamNr = r->imgnr % STREAMS_MAX;
// When buffer is filled, wait until work in current stream has finished
if(projNr == 0) {
cudaStreamSynchronize(stream[streamNr]);
}
// copy received image data to buffer (maps double precision to float)
std::copy(r->I_n, r->I_n+(imgSizeX * imgSizeY), hostProjBuffer[projNr]);
// copy image and matrix to device
cudaMemcpyAsync( devProjection[streamNr], hostProjBuffer[projNr], imgSize, cudaMemcpyHostToDevice, stream[streamNr] );
// call kernel
backproject<<<numBlocks, threadsPerBlock, 0 , stream[streamNr]>>>(devProjection[streamNr]);
}
So, for CUDA, I create a pinned host pointer for each buffer item and copy the data to the device before executing kernel of each stream.
For OpenCL I initially did something similar when following the Nvidia OpenCL Best Practices Guide. Here they recommend creating two buffers, one for copying the kernel data to and one for the pinned memory. However, this leads to the implementation using double the device memory as both the kernel and pinned memory buffers are allocated on the device.
To get around this memory issue, I created an implementation where only a mapping is made to the device as it is needed. This can be seen in the following implementation:
// globals
float** hostProjBuffer = new float* [BUFFER_SIZE];
cl_mem devProjection[STREAMS_MAX], devMatrix[STREAMS_MAX];
cl_command_queue queue[STREAMS_MAX];
// initiate streams
void initialize()
{
for( uint s = 0; s < STREAMS_MAX; s++ ){
queue[s] = clCreateCommandQueueWithProperties(context, device, NULL, &status);
devProjection[s] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, imgSize, NULL, &status);
}
}
// main function called for all input images
void backproject(imgdata* r)
{
const uint projNr = r->imgnr % BUFFER_SIZE;
const uint streamNr = r->imgnr % STREAMS_MAX;
// when buffer is filled, wait until work in current stream has finished
if(projNr == 0) {
status = clFinish(queue[streamNr]);
}
// map host memory region to device buffer
hostProjBuffer[projNr] = (float*) clEnqueueMapBuffer(queue[streamNr], devProjection[streamNr], CL_FALSE, CL_MAP_WRITE_INVALIDATE_REGION, 0, imgSize, 0, NULL, NULL, &status);
// copy received image data to hostbuffers
std::copy(imgPtr, imgPtr + (imgSizeX * imgSizeY), hostProjBuffer[projNr]);
// unmap the allocated pinned host memory
clEnqueueUnmapMemObject(queue[streamNr], devProjection[streamNr], hostProjBuffer[projNr], 0, NULL, NULL);
// set stream specific arguments
clSetKernelArg(kernel, 0, sizeof(devProjection[streamNr]), (void *) &devProjection[streamNr]);
// launch kernel
clEnqueueNDRangeKernel(queue[streamNr], kernel, 3, NULL, global_work_size, local_work_size, 0, NULL, NULL);
clFlush(queue[streamNr]);
clFinish(queue[streamNr]); //should be removed!
}
This implementation does use a similar amount of device memory as the CUDA implementation. However, I have been unable to get this last code example working without a clFinish after each loop, which significantly hampers the performance of the application. This indicates data is lost as the host moves ahead of the kernel. I tried increasing my buffer size to the number of input images, but this did not work either. So somehow during execution, the hostBuffer data gets lost.
So, with the goal to write OpenCL code similar to CUDA, I have three questions:
What is the recommended implementation for OpenCL pinned memory?
Is my OpenCL implementation similar to how CUDA handles pinned memory?
What causes the wrong data to be used in the OpenCL example?
Thanks in advance!
Kind regards,
Remy
PS: Question initially asked at the Nvidia developer forums

Multithreading for image processing at GPU using CUDA

Problem Statement:
I have to continuously process 8 megapixel images captured from a camera . There have to be several image processing algorithms on it like color interpolation, color transformation etc. These operations will take a long time at CPU. So, I decided to do these operations at GPU using CUDA kernel. I have already written a working CUDA kernel for color transformation. But still I need some more boost in the performance.
There are basically two computational times:
Copying the source image from CPU to GPU and vice-versa
Processing of the source image at GPU
when the image is getting copied from CPU to GPU....nothing else happens. And similarly, when the processing of image at GPU working...nothing else happens.
MY IDEA: I want to do multi-threading so that I can save some time. I want to capture the next image while the processing of previous image is going on at GPU. And, when the GPU finishes the processing of previous image then, the next image is already there for it to get transferred from CPU to GPU.
What I need: I am completely new to the world of Multi-threading. I am watching some tutorials and some other stuff to know more about it. So, I am looking up for some suggestions about the proper steps and proper logic.
I'm not sure you really need threads for this. CUDA has the ability to allow for asynchronous concurrent execution between host and device (without the necessity to use multiple CPU threads.) What you're asking for is a pretty standard "pipelined" algorithm. It would look something like this:
$ cat t832.cu
#include <stdio.h>
#define IMGSZ 8000000
// for this example, NUM_FRAMES must be less than 255
#define NUM_FRAMES 128
#define nTPB 256
#define nBLK 64
unsigned char cur_frame = 0;
unsigned char validated_frame = 0;
bool validate_image(unsigned char *img) {
validated_frame++;
for (int i = 0; i < IMGSZ; i++) if (img[i] != validated_frame) {printf("image validation failed at %d, was: %d, should be: %d\n",i, img[i], validated_frame); return false;}
return true;
}
void CUDART_CB my_callback(cudaStream_t stream, cudaError_t status, void* data) {
validate_image((unsigned char *)data);
}
bool capture_image(unsigned char *img){
for (int i = 0; i < IMGSZ; i++) img[i] = cur_frame;
if (++cur_frame == NUM_FRAMES) {cur_frame--; return true;}
return false;
}
__global__ void img_proc_kernel(unsigned char *img){
int idx = threadIdx.x + blockDim.x*blockIdx.x;
while(idx < IMGSZ){
img[idx]++;
idx += gridDim.x*blockDim.x;}
}
int main(){
// setup
bool done = false;
unsigned char *h_imgA, *h_imgB, *d_imgA, *d_imgB;
size_t dsize = IMGSZ*sizeof(unsigned char);
cudaHostAlloc(&h_imgA, dsize, cudaHostAllocDefault);
cudaHostAlloc(&h_imgB, dsize, cudaHostAllocDefault);
cudaMalloc(&d_imgA, dsize);
cudaMalloc(&d_imgB, dsize);
cudaStream_t st1, st2;
cudaStreamCreate(&st1); cudaStreamCreate(&st2);
unsigned char *cur = h_imgA;
unsigned char *d_cur = d_imgA;
unsigned char *nxt = h_imgB;
unsigned char *d_nxt = d_imgB;
cudaStream_t *curst = &st1;
cudaStream_t *nxtst = &st2;
done = capture_image(cur); // grabs a frame and puts it in cur
// enter main loop
while (!done){
cudaMemcpyAsync(d_cur, cur, dsize, cudaMemcpyHostToDevice, *curst); // send frame to device
img_proc_kernel<<<nBLK, nTPB, 0, *curst>>>(d_cur); // process frame
cudaMemcpyAsync(cur, d_cur, dsize, cudaMemcpyDeviceToHost, *curst);
// insert a cuda stream callback here to copy the cur frame to output
cudaStreamAddCallback(*curst, &my_callback, (void *)cur, 0);
cudaStreamSynchronize(*nxtst); // prevent overrun
done = capture_image(nxt); // capture nxt image while GPU is processing cur
unsigned char *tmp = cur;
cur = nxt;
nxt = tmp; // ping - pong
tmp = d_cur;
d_cur = d_nxt;
d_nxt = tmp;
cudaStream_t *st_tmp = curst;
curst = nxtst;
nxtst = st_tmp;
}
}
$ nvcc -o t832 t832.cu
$ cuda-memcheck ./t832
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$
There are many cuda sample codes which may be helpful also, such as simpleStreams, asyncAPI, and simpleCallbacks
Since your question is very wide, I can only think of the following advice:
1) Use CUDA streams
When using more than one CUDA stream, the memory transfer between CPU->GPU, the GPU processing and the memory transfer between GPU->CPU can overlap. This way the image processing of the next image can already begin while the result is transferred back.
You can also decompose each frame. Use n streams per frame and launch the image processing kernels n times with an offset.
2) Apply the producer-consumer scheme
The producer thread captures the frames from the camera and stores them in a thread-safe container. The consumer thread(s) fetch(es) a frame from this source container, upload(s) it to the GPU using its/their own CUDA stream(s), launches the kernel and copies the result back to the host.
Each consumer thread would synchronize with its stream(s) before trying to get a new image from the source container.
A simple implementation could look like this:
#include <vector>
#include <thread>
#include <memory>
struct ThreadSafeContainer{ /*...*/ };
struct Producer
{
Producer(std::shared_ptr<ThreadSafeContainer> c) : container(c)
{
}
void run()
{
while(true)
{
// grab image from camera
// store image in container
}
}
std::shared_ptr<ThreadSafeContainer> container;
};
struct Consumer
{
Consumer(std::shared_ptr<ThreadSafeContainer> c) : container(c)
{
cudaStreamCreate(&stream);
}
~Consumer()
{
cudaStreamDestroy(stream);
}
void run()
{
while(true)
{
// read next image from container
// upload to GPU
cudaMemcpyAsync(...,...,...,stream);
// run kernel
kernel<<<..., ..., ..., stream>>>(...);
// copy results back
cudaMemcpyAsync(...,...,...,stream);
// wait for results
cudaStreamSynchronize(stream);
// do something with the results
}
}
std::shared_ptr<ThreadSafeContainer> container;
cudaStream_t stream; // or multiple streams per consumer
};
int main()
{
// create an instance of ThreadSafeContainer which whill be shared between Producer and Consumer instances
auto container = std::make_shared<ThreadSafeContainer>();
// create one instance of Producer, pass the shared container as an argument to the constructor
auto p = std::make_shared<Producer>(container);
// create a separate thread which executes Producer::run
std::thread producer_thread(&Producer::run, p);
const int consumer_count = 2;
std::vector<std::thread> consumer_threads;
std::vector<std::shared_ptr<Consumer>> consumers;
// create as many consumers as specified
for (int i=0; i<consumer_count;++i)
{
// create one instance of Consumer, pass the shared container as an argument to the constructor
auto c = std::make_shared<Consumer>(container);
// create a separate thread which executes Consumer::run
consumer_threads.push_back(std::thread(&Consumer::run, c));
}
// wait for the threads to finish, otherwise the program will just exit here and the threads will be killed
// in this example, the program will never exit since the infinite loop in the run() methods never end
producer_thread.join();
for (auto& t : consumer_threads)
{
t.join();
}
return 0;
}

My OpenAL C++ audio streaming buffer gliching

I have for the first time coding sound generating with OpenAL in C++.
What I want to do is to generate endless sinus wave into a double buffering way.
And the problem is that the sound is glittering/lags. I Think it is between the buffering and I don't know why it is like that.
My code:
void _OpenALEngine::play()
{
if(!m_running && !m_threadRunning)
{
ALfloat sourcePos[] = {0,0,0};
ALfloat sourceVel[] = {0,0,0};
ALfloat sourceOri[] = {0,0,0,0,0,0};
alGenSources(1, &FSourceID);
alSourcefv (FSourceID, AL_POSITION, sourcePos);
alSourcefv (FSourceID, AL_VELOCITY, sourceVel);
alSourcefv (FSourceID, AL_DIRECTION, sourceOri);
GetALError();
ALuint FBufferID[2];
alGenBuffers( 2, &FBufferID[0] );
GetALError();
// Gain
ALfloat listenerPos[] = {0,0,0};
ALfloat listenerVel[] = {0,0,0};
ALfloat listenerOri[] = {0,0,0,0,0,0};
alListenerf( AL_GAIN, 1.0 );
alListenerfv(AL_POSITION, listenerPos);
alListenerfv(AL_VELOCITY, listenerVel);
alListenerfv(AL_ORIENTATION, listenerOri);
GetALError();
alSourceQueueBuffers( FSourceID, 2, &FBufferID[0] );
GetALError();
alSourcePlay(FSourceID);
GetALError();
m_running = true;
m_threadRunning = true;
Threading::Thread thread(Threading::ThreadStart(this, &_OpenALEngine::threadPlaying));
thread.Start();
}
}
Void _OpenALEngine::threadPlaying()
{
while(m_running)
{
// Check how much data is processed in OpenAL's internal queue.
ALint Processed;
alGetSourcei( FSourceID, AL_BUFFERS_PROCESSED, &Processed );
GetALError();
// Add more buffers while we need them.
while ( Processed-- )
{
alSourceUnqueueBuffers( FSourceID, 1, &BufID );
runBuffer(); // <--- Generate the sinus wave and submit the Array to the submitBuffer method.
alSourceQueueBuffers( FSourceID, 1, &BufID );
ALint val;
alGetSourcei(FSourceID, AL_SOURCE_STATE, &val);
if(val != AL_PLAYING)
{
alSourcePlay(FSourceID);
}
}
// Don't kill the CPU.
Thread::Sleep(1);
}
m_threadRunning = false;
return Void();
}
void _OpenALEngine::submitBuffer(byte* buffer, int length)
{
// Submit more data to OpenAL
alBufferData( BufID, AL_FORMAT_MONO8, buffer, length * sizeof(byte), 44100 );
}
I generate the sinus wave in the runBuffer() method. And the sinus generator is correct because when I increase the buffer array from 4096 to 40960 the glittering/lags sound with bigger interval. Thank you very much if some one know the problem and will share it :)
Similar Problems are all over the internet and I'm not 100% sure this is the solution to this on. But it probably is, and if not it might at least help others. Most other threads are on different forums and I'm not registering everywhere just to share my knowledge...
The code below is what I came up after 2 days of experimenting. Most solutions I found did not work for me...
(it's not exactly my code, I stripped it of some parts special to my case, so I'm sorry if there are typos or similar that prevent it from being copied verbatim)
My experiments were on an iPhone. Some of the the things I found out, might be iOS-specific.
The problem is that there is no guaranty at what point a processed buffer is marked as such and is available for unqueueing. Trying to build a version that sleeps until a buffer becomes available again I saw that this might be much(I use very small buffers) later than expected. So I realised that the common idea to wait until a buffer is available(which works for most frameworks, but not openAL) is wrong. Instead you should wait until the time you should enqueue another buffer.
With that you have to give up the idea of double-buffering. When the time comes you should check if a buffer exists and unqueue it. But if none is available you need to create a 3rd...
Waiting for when a buffer should be enqueue can be done by calculating times relative to the system-clock, which worked fairly well for me but I decided to go for a version where I rely on a time source that is definitivly in sync with openAL. Best I came up with was wait depending on what s left in the queue. Here, iOS seems not fully in accordance to openAL-spec because AL_SAMPLE_OFFSET should be exact to one sample but I never saw anything but multiples of 2048. That's about 45 microseconds #44100, this is where the 50000 in the code comes from(little more than the smalest unit iOS handles)
Depending on the block-size this can easily be bigger. But with that code I had 3 times that alSourcePlay() was needed again in the last ~hour(compared to up to 10 per minute with other implementations that claimed to be the solution)
uint64 enqueued(0); // keep track of samples in queue
while (bKeepRunning)
{
// check if enough in buffer and wait
ALint off;
alGetSourcei(m_Source, AL_SAMPLE_OFFSET, &off);
uint32 left((enqueued-off)*1000000/SAMPLE_RATE);
if (left > 50000) // at least 50000 mic-secs in buffer
usleep(left - 50000);
// check for available buffer
ALuint buffer;
ALint processed;
alGetSourcei(m_Source, AL_BUFFERS_PROCESSED, &processed);
switch (processed)
{
case 0: // no buffer to unqueue->create new
alGenBuffers(1, &buffer);
break;
case 1: // on buffer to unqueue->use that
alSourceUnqueueBuffers(m_Source, 1, &buffer);
enqueued -= BLOCK_SIZE_SAMPLES;
break;
default: // multiple buffers to unqueue->take one,delete on
{ // could also delete more if processed>2
// but doesn't happen often
// therefore simple implementation(will del. in next loop)
ALuint bufs[2];
alSourceUnqueueBuffers(m_Source, 2, bufs);
alDeleteBuffers(1, bufs);
buffer = bufs[1];
enqueued -= 2*BLOCK_SIZE_SAMPLES;
}
break;
}
// fill block
alBufferData(buffer, AL_FORMAT_STEREO16, pData,
BLOCK_SIZE_SAMPLES*4, SAMPLE_RATE);
alSourceQueueBuffers(m_Source, 1, &buffer);
//check state
ALint state;
alGetSourcei(m_Source, AL_SOURCE_STATE, &state);
if (state != AL_PLAYING)
{
enqueued = BLOCK_SIZE_SAMPLES;
alSourcePlay(m_Source);
}
else
enqueued += BLOCK_SIZE_SAMPLES;
}
I have written OpenAL streaming servers so I know your pain - my instinct is to confirm you have spawned separate threads for the I/O logic which available your streaming audio data - separate from the thread to hold your above OpenAL code ??? If not this will cause your symptoms. Here is a simple launch of each logical chunk into its own thread :
std::thread t1(launch_producer_streaming_io, chosen_file, another_input_parm);
std::this_thread::sleep_for (std::chrono::milliseconds( 100));
std::thread t2(launch_consumer_openal, its_input_parm1, parm2);
// -------------------------
t1.join();
t2.join();
where launch_producer_streaming_io is a method being called with its input parms which services the input/output to continuously supply the audio data ... launch_consumer_openal is a method launched in its own thread where you instantiate your OpenAL class