error -11 OpenCL - c++

i'm getting error -11 on this line
checkerror(clBuildProgram(program, deviceidcount, deviceids.data(), nullptr, nullptr, nullptr));
my kernel is
__kernel void render(double playerx,double playery,double playerz,double yaw,double pitch,double x1,double y1,double z1,double x2,double y2,double z2,double x3,double y3,double z3,__global int* texture){
//const int i = get_global_id(0);
//x[i] = a*x[i];
//x[i] = cos(a);
x1 = x1-playerx;
y1 = y1-playery;
z1 = z1-playerz;
x2 = x2-playerx;
y2 = y2-playery;
z2 = z2-playerz;
x3 = x3-playerx;
y3 = y3-playery;
z3 = z3-playerz;
double smallyaw = yaw - M_PI_2;
double bigpitch = pitch + M_PI_2;
double screenx1 = cos(smallyaw)*cos(pitch)*x1 + sin(smallyaw)*cos(pitch)*y1 + sin(pitch)*z1;
double screeny1 = cos(yaw)*cos(bigpitch)*x1 + sin(yaw)*cos(bigpitch)*y1 + sin(bigpitch)*z1;
double screenz1 = cos(yaw)*cos(pitch)*x1 + sin(yaw)*cos(pitch)*y1 + sin(pitch)*z1;
printf(screenx1);
printf(screeny1);
printf(screenz1);
}
i can't see anything wrong with it in the terms of syntax. and i also tried replacing all the doubles with floats.
this is stupid after looking at this for the longest time i commented out the printf lines and it worked. how am i supposed to check what these variables are equal to. can someone tell me how to properly print things?

printf("value = %#g\n", 3.012);
prints 3.012 to console.
Printing to console should be done in a thread-safe way so your cl thread should be same as console flushing thread.

Printing output from many cores can give unexpected results. Timings are off, prints don't come back in any particular order etc. Try to print from only one work item.
if i == 0{
printf(...)
}
You can also put a barrier above that and loop through multiple values from work item 0 if you need to.

i can't see anything wrong with it in the terms of syntax.
Well try looking harder because that's not how you use printf.
If you want to print doubles use
printf("%f", value);
If this doesn't make sense would recommend reading documentation for both generic C printf and OpenCL printf.
Also since -11 is CL_BUILD_PROGRAM _FAILURE you can use clGetProgramBuildInfo to retrieve the build log and check where the compilation went wrong.

Related

dprintf vs returning in C++

Disclaimer that I know nothing about C++ so bear with me... I am looking at some existing code which prints a continuous stream of strings describing the position of a VR controller.
void CMainApplication::printDevicePositionalData(const char * deviceName, vr::HmdMatrix34_t posMatrix, vr::HmdVector3_t position, vr::HmdQuaternion_t quaternion)
{
LARGE_INTEGER qpc; // Query Performance Counter for Acquiring high-resolution time stamps.
// From MSDN: "QPC is typically the best method to use to time-stamp events and
// measure small time intervals that occur on the same system or virtual machine.
QueryPerformanceCounter(&qpc);
// Print position and quaternion (rotation).
dprintf("\n%lld, %s, x = %.5f, y = %.5f, z = %.5f, qw = %.5f, qx = %.5f, qy = %.5f, qz = %.5f",
qpc.QuadPart, deviceName,
position.v[0], position.v[1], position.v[2],
quaternion.w, quaternion.x, quaternion.y, quaternion.z);
}
When I run the compiled exe in powershell it does not seem to print anything. Only if I run .\this_program.exe | tee output.txt do I see anything, as it simultaneously writes to a .txt file.
How can I change to above code to return these values, as I want to be able to read them in realtime with python using subprocess and stdout. Thanks
If you want to print to the console output, you should not be using:
dprintf - This function prints a formatted string to the command window for the debugger.
With C++, IO streams should be used (std::cout, std::clog, or std::cerr).
Or fallback to printf.

Looking for a way to speed up a function

I'm trying to speed up a big block of code across many files and found out that one function uses about 70% of the total time. This is because this function is called 477+ million times.
The pointer array par can only be one of two presets, either
par[0] = 0.057;
par[1] = 2.87;
par[2] = -3.;
par[3] = -0.03;
par[4] = -3.05;
par[5] = -3.5;
OR
par[0] = 0.043;
par[1] = 2.92;
par[2] = -3.21;
par[3]= -0.065;
par[4] = -3.00;
par[5] = -2.65;
So I've tried plugging in numbers depending on which preset it is but have failed to find any significant time saves.
The pow and exp functions seem to be called about every time and they take up about 40 and 20 percent of the total time respectively, so only 10% of the total time is used by the parts of this function that aren't pow or exp. Finding ways to speed those up would probably be the best but none of the exponents used in pow are integers except -4 and I don't know if 1/(x*x*x*x) is faster than pow(x, -4).
double Signal::Param_RE_Tterm_approx(double Tterm, double *par) {
double value = 0.;
// time after Che angle peak
if (Tterm > 0.) {
if ( fabs(Tterm/ *par) >= 1.e-2) {
value += -1./(*par)*exp(-1.*Tterm/(*par));
}
else {
value += -1./par[0]*(1. - Tterm/par[0] + Tterm*Tterm/(par[0]*par[0]*2.) - Tterm*Tterm*Tterm/(par[0]*par[0]*par[0]*6.) );
}
if ( fabs(Tterm* *(par+1)) >= 1.e-2) {
value += *(par+2)* *(par+1)*pow( 1.+*(par+1)*Tterm, *(par+2)-1. );
}
else {
value += par[2]*par[1]*( 1.+(par[2]-1.)*par[1]*Tterm + (par[2]-1.)*(par[2]-1.-1.)/2.*par[1]*par[1]*Tterm*Tterm + (par[2]-1.)*(par[2]-1.-1.)*(par[2]-1.-2.)/6.*par[1]*par[1]*par[1]*Tterm*Tterm*Tterm );
}
}
// time before Che angle peak
else {
if ( fabs(Tterm/ *(par+3)) >= 1.e-2 ) {
value += -1./ *(par+3) *exp(-1.*Tterm/ *(par+3));
}
else {
value += -1./par[3]*(1. - Tterm/par[3] + Tterm*Tterm/(par[3]*par[3]*2.) - Tterm*Tterm*Tterm/(par[3]*par[3]*par[3]*6.) );
}
if ( fabs(Tterm* *(par+4) >= 1.e-2 ) {
value += *(par+5)* *(par+4) *pow( 1.+ *(par+4)*Tterm, *(par+5)-1. );
}
else {
value += par[5]*par[4]*( 1.+(par[5]-1.)*par[4]*Tterm + (par[5]-1.)*(par[5]-1.-1.)/2.*par[4]*par[4]*Tterm*Tterm + (par[5]-1.)*(par[5]-1.-1.)*(par[5]-1.-2.)/6.*par[4]*par[4]*par[4]*Tterm*Tterm*Tterm );
}
}
return value * 1.e9;
}
I first rewrote it to be a bit easier to follow:
#include <math.h>
double Param_RE_Tterm_approx(double Tterm, double const* par) {
double value = 0.;
if (Tterm > 0.) {
// time after Che angle peak
if ( fabs(Tterm/ par[0]) >= 1.e-2) {
value += -1./(par[0])*exp(-1.*Tterm/(par[0]));
} else {
value += -1./par[0]*(1. - Tterm/par[0] + Tterm*Tterm/(par[0]*par[0]*2.) - Tterm*Tterm*Tterm/(par[0]*par[0]*par[0]*6.) );
}
if ( fabs(Tterm* par[1]) >= 1.e-2) {
value += par[2]* par[1]*pow( 1.+par[1]*Tterm, par[2]-1. );
} else {
value += par[2]*par[1]*( 1.+(par[2]-1.)*par[1]*Tterm + (par[2]-1.)*(par[2]-1.-1.)/2.*par[1]*par[1]*Tterm*Tterm + (par[2]-1.)*(par[2]-1.-1.)*(par[2]-1.-2.)/6.*par[1]*par[1]*par[1]*Tterm*Tterm*Tterm );
}
} else {
// time before Che angle peak
if ( fabs(Tterm/ par[3]) >= 1.e-2 ) {
value += -1./ par[3] *exp(-1.*Tterm/ par[3]);
} else {
value += -1./par[3]*(1. - Tterm/par[3] + Tterm*Tterm/(par[3]*par[3]*2.) - Tterm*Tterm*Tterm/(par[3]*par[3]*par[3]*6.) );
}
if ( fabs(Tterm* par[4]) >= 1.e-2 ) {
value += par[5]* par[4] *pow( 1.+ par[4]*Tterm, par[5]-1. );
} else {
value += par[5]*par[4]*( 1.+(par[5]-1.)*par[4]*Tterm + (par[5]-1.)*(par[5]-1.-1.)/2.*par[4]*par[4]*Tterm*Tterm + (par[5]-1.)*(par[5]-1.-1.)*(par[5]-1.-2.)/6.*par[4]*par[4]*par[4]*Tterm*Tterm*Tterm );
}
}
return value * 1.e9;
}
We can then look at its structure.
There are two main branches -- Tterm negative (before) and positive (after). These correspond to using 0,1,2 or 3,4,5 in the par array.
Then in each case we do two things to add to value. In both cases, for small cases we use a polynomial, and for big cases we use an exponential/power equation.
As a guess, this is because the polynomial is a decent approximation for the exponential for small values -- the error is acceptable. What you should do is confirm that guess -- take a look at the Taylor series expansion of the "big" power/exponent based equation, and see if it agrees with the polynomials somehow. Or check numerically.
If it is the case, this means that this equation has a known amount of error that is acceptable. Quite often there are faster versions of exp or pow that have a known amount of max error; consider using those.
If this isn't the case, there still could be an acceptable amount of error, but the Taylor series approximation can give you "in code" information about what is an acceptable amount of error.
A next step I'd take is to tear the 8 pieces of this equation apart. There is positive/negative, the first and second value+= in each branch, and then the polynomial/exponential case.
I'm guesing the fact that exp is taking ~1/3 the time of pow is because you have 3 calls to pow to 1 call to exp in your function, but you might find out something interesting like "all of our time is actually in the Tterm > 0. case" or what have you.
Now examine call sites. Is there a pattern in the Tterm you are passing this function? Ie, do you tend to pass Tterms in roughly sorted order? If so, you can do the test for which function to call outside of calling this function, and do it in batches.
Simply doing it in batches and compiling with optimization and inlining the bodies of the functions might make a surprising amount of difference; compilers are getting better at vectorizing work.
If that doesn't work, you can start threading things off. On a modern computer you can have 4-60 threads solving this problem independently, and this problem looks like you'd get nearly linear speedup. A basic threading library, like TBB, would be good for this kind of task.
For the next step up, if you are getting large batches of data and you need to do a lot of processing, you can stuff it onto a GPU and solve it there. Sadly, GPU<->RAM communication is small, so simply doing the math in this function on the GPU and reading/writing back and forth with RAM won't give you much if any performance. But if more work than just this can go on the GPU, it might be worth it.
The only 10% of the total time is used by the parts of this function that aren't pow or exp.
If your function performance bottleneck is exp(), pow() execution, consider using vector instructions in your calculations. All modern processors support at least SSE2 instruction set, so this approach will definitely give at least ~2x speed up, because your calculation could be easily vectorized.
I recommend you to use this c++ vectorization library, which contains all standard mathematical functions (such as exp and pow) and allows to write code in OOP-style without using assembly language . I used it several times and it must work perfectly in your problem.
If you have GPU, you should also consider trying cuda framework, because, again, your problem could be perfectly vectorized. Moreover, If this function is called 477+ million times, GPU will literally eliminate your problem...
(Partial optimization:)
The longest expression has
Common subexpressions
Polynomial evaluated the costly way.
Pre-define these (perhaps add them to par[]):
a = par[5]*par[4];
b = (par[5]-1.);
c = b*(par[5]-2.)/2.;
d = c*(par[5]-3.)/3.;
Then, for example, the longest expression becomes:
e = par[4]*Tterm;
value += a*(((d*e + c)*e + b)*e + 1.);
And simplify the rest.
If the expressions are curve-fitting approximations, why not do also with
value += -1./(*par)*exp(-1.*Tterm/(*par));
You should also ask whether all 477M iterations are needed.
If you want to explore batching / more optimization opportunities for fusing in computations that depend on these values, try using Halide
I've rewritten your program in Halide here:
#include <Halide.h>
using namespace Halide;
class ParamReTtermApproxOpt : public Generator<ParamReTtermApproxOpt>
{
public:
Input<Buffer<float>> tterm{"tterm", 1};
Input<Buffer<float>> par{"par", 1};
Input<int> ncpu{"ncpu"};
Output<Buffer<float>> output{"output", 1};
Var x;
Func par_inv;
void generate() {
// precompute 1 / par[x]
par_inv(x) = fast_inverse(par(x));
// after che peak
Expr after_che_peak = tterm(x) > 0;
Expr first_term = -par_inv(0) * fast_exp(-tterm(x) * par_inv(0));
Expr second_term = par(2) * par(1) * fast_pow(1 + par(1) * tterm(x), par(2) - 1);
// before che peak
Expr third_term = -par_inv(3) * fast_exp(-tterm(x) * par_inv(3));
Expr fourth_term = par(5) * par(4) * fast_pow(1 + par(4) * tterm(x), par(5) - 1);
// final value
output(x) = 1.e9f * select(after_che_peak, first_term + second_term,
third_term + fourth_term);
}
void schedule() {
par_inv.bound(x, 0, 6);
par_inv.compute_root();
Var xo, xi;
// break x into two loops, one for ncpu tasks
output.split(x, xo, xi, output.extent() / ncpu)
// mark the task loop parallel
.parallel(xo)
// vectorize each thread's computation for 8-wide vector lanes
.vectorize(xi, 8);
output.print_loop_nest();
}
};
HALIDE_REGISTER_GENERATOR(ParamReTtermApproxOpt, param_re_tterm_approx_opt)
I can run 477,000,000 iterations in slightly over one second on my Surface Book (with ncpu=4). Batching is hugely important here since it enables vectorization.
Note that the equivalent program written using double arithmetic is much slower (20x) than float arithmetic. Though Halide doesn't supply fast_ versions for doubles, so this might not be quite apples-to-apples. Regardless, I would check whether you need the extra precision.

compilers processing maths differently?

I made a code to find the derivative of a function at a given point. The code reads
#include"stdafx.h"
#include<iostream>
using namespace std;
double function(double x) {
return (3 * x * x);
}
int main() {
double x, y, dy, dx;
cin >> x;
y = function(x);
dx = 0.00000001;
dy = function(x + dx) - y;
cout << "Derivative of function at x = " << x << " is " << (double)dy / dx;
cin >> x;
}
Now my college uses turbo C++ as its IDE and compiler while at home I have visual studio (because TC++ looks very bad on a 900p screen but jokes apart). When I tried a similar program on the college PCs the result was quite messed up and was much less accurate than what I am getting at home. for example:
Examples:
x = 3
#College result = 18.something
#Home result = 18 (precise without a decimal point)
x = 1
#College result = 6.000.....something
#Home result = 6 (precise without a decimal point)
The Very big Question:
Why are different compilers giving different results ?
I’m 90% sure the result’s same in both cases, and the only reason why you see difference is different output formatting.
For 64-bit IEEE double math, the precise results of those computations are probably 17.9999997129698385833762586116790771484375 and 6.0000000079440951594733633100986480712890625, respectively.
If you want to verify that hypothesis, you can print you double values this way:
void printDoubleAsHex( double val )
{
const uint64_t* p = (const uint64_t*)( &val );
printf( "%" PRIx64 "\n", *p );
}
And verify you have same output in both compilers.
However, there’s also 10% chance that indeed your two compilers compiled your code in a way that the result is different. Ain’t uncommon, it can even happen with the same compiler but different settings/flags/options.
The most likely reason is different instruction sets. By default, many modern compilers generate SSE instructions for the code like yours, older ones producer legacy x87 code (x87 operates on 80-bit floating point values on the stack, SSE on 32 or 64-bit FP values on these vector registers, hence the difference in precision). Another reason is different rounding modes. Another one is compiler specific optimizations, such as /fp in Visual C++.

Bad optimization of std::fabs()?

Recently i was working with an application that had code similar to:
for (auto x = 0; x < width - 1 - left; ++x)
{
// store / reset points
temp = hPoint = 0;
for(int channel = 0; channel < audioData.size(); channel++)
{
if (peakmode) /* fir rms of window size */
{
for (int z = 0; z < sizeFactor; z++)
{
temp += audioData[channel][x * sizeFactor + z + offset];
}
hPoint += temp / sizeFactor;
}
else /* highest sample in window */
{
for (int z = 0; z < sizeFactor; z++)
{
temp = audioData[channel][x * sizeFactor + z + offset];
if (std::fabs(temp) > std::fabs(hPoint))
hPoint = temp;
}
}
.. some other code
}
... some more code
}
This is inside a graphical render loop, called some 50-100 times / sec with buffers up to 192kHz in multiple channels. So it's a lot of data running through the innermost loops, and profiling showed this was a hotspot.
It occurred to me that one could cast the float to an integer and erase the sign bit, and cast it back using only temporaries. It looked something like this:
if ((const float &&)(*((int *)&temp) & ~0x80000000) > (const float &&)(*((int *)&hPoint) & ~0x80000000))
hPoint = temp;
This gave a 12x reduction in render time, while still producing the same, valid output. Note that everything in the audiodata is sanitized beforehand to not include nans/infs/denormals, and only have a range of [-1, 1].
Are there any corner cases where this optimization will give wrong results - or, why is the standard library function not implemented like this? I presume it has to do with handling of non-normal values?
e: the layout of the floating point model is conforming to ieee, and sizeof(float) == sizeof(int) == 4
Well, you set the floating-point mode to IEEE conforming. Typically, with switches like --fast-math the compiler can ignore IEEE corner cases like NaN, INF and denormals. If the compiler also uses intrinsics, it can probably emit the same code.
BTW, if you're going to assume IEEE format, there's no need for the cast back to float prior to the comparison. The IEEE format is nifty: for all positive finite values, a<b if and only if reinterpret_cast<int_type>(a) < reinterpret_cast<int_type>(b)
It occurred to me that one could cast the float to an integer and erase the sign bit, and cast it back using only temporaries.
No, you can't, because this violates the strict aliasing rule.
Are there any corner cases where this optimization will give wrong results
Technically, this code results in undefined behavior, so it always gives wrong "results". Not in the sense that the result of the absolute value will always be unexpected or incorrect, but in the sense that you can't possibly reason about what a program does if it has undefined behavior.
or, why is the standard library function not implemented like this?
Your suspicion is justified, handling denormals and other exceptional values is tricky, the stdlib function also needs to take those into account, and the other reason is still the undefined behavior.
One (non-)solution if you care about performance:
Instead of casting and pointers, you can use a union. Unfortunately, that only works in C, not in C++, though. That won't result in UB, but it's still not portable (although it will likely work with most, if not all, platforms with IEEE-754).
union {
float f;
unsigned u;
} pun = { .f = -3.14 };
pun.u &= ~0x80000000;
printf("abs(-pi) = %f\n", pun.f);
But, granted, this may or may not be faster than calling fabs(). Only one thing is sure: it won't be always correct.
You would expect fabs() to be implemented in hardware. There was an 8087 instruction for it in 1980 after all. You're not going to beat the hardware.
How the standard library function implements it is .... implementation dependent. So you may find different implementation of the standard library with different performance.
I imagine that you could have problems in platforms where int is not 32 bits. You 'd better use int32_t (cstdint>)
For my knowledge, was std::abs previously inlined ? Or the optimisation you observed is mainly due to suppression of the function call ?
Some observations on how refactoring may improve performance:
as mentioned, x * sizeFactor + offset can be factored out of the inner loops
peakmode is actually a switch changing the function's behaviour - make two functions rather than test the switch mid-loop. This has 2 benefits:
easier to maintain
fewer local variables and code paths to get in the way of optimisation.
The division of temp by sizeFactor can be deferred until outside the channel loop in the peakmode version.
abs(hPoint) can be pre-computed whenever hPoint is updated
if audioData is a vector of vectors you may get some performance benefit by taking a reference to audioData[channel] at the start of the body of the channel loop, reducing the array indexing within the z loop down to one dimension.
finally, apply whatever specific optimisations for the calculation of fabs you deem fit. Anything you do here will hurt portability so it's a last resort.
In VS2008, using the following to track the absolute value of hpoint and hIsNeg to remember whether it is positive or negative is about twice as fast as using fabs():
int hIsNeg=0 ;
...
//Inside loop, replacing
// if (std::fabs(temp) > std::fabs(hPoint))
// hPoint = temp;
if( temp < 0 )
{
if( -temp > hpoint )
{
hpoint = -temp ;
hIsNeg = 1 ;
}
}
else
{
if( temp > hpoint )
{
hpoint = temp ;
hIsNeg = 0 ;
}
}
...
//After loop
if( hIsNeg )
hpoint = -hpoint ;

Granular Synthesis in iOS 6 using AudioFileServices

I have a question regarding a sound synthesis app that I'm working on. I am trying to read in an audio file, create randomized 'grains' using granular synthesis techniques, place them into an output buffer and then be able to play that back to the user using OpenAL. For testing purposes, I am simply writing the output buffer to a file that I can then listen back to.
Judging by my results, I am on the right track but am getting some aliasing issues and playback sounds that just don't seem quite right. There is usually a rather loud pop in the middle of the output file and volume levels are VERY loud at times.
Here are the steps that I have taken to get the results I need, but I'm a little bit confused about a couple of things, namely formats that I am specifying for my AudioStreamBasicDescription.
Read in an audio file from my mainBundle, which is a mono file in .aiff format:
ExtAudioFileRef extAudioFile;
CheckError(ExtAudioFileOpenURL(loopFileURL,
&extAudioFile),
"couldn't open extaudiofile for reading");
memset(&player->dataFormat, 0, sizeof(player->dataFormat));
player->dataFormat.mFormatID = kAudioFormatLinearPCM;
player->dataFormat.mFormatFlags = kAudioFormatFlagIsSignedInteger | kAudioFormatFlagIsPacked;
player->dataFormat.mSampleRate = S_RATE;
player->dataFormat.mChannelsPerFrame = 1;
player->dataFormat.mFramesPerPacket = 1;
player->dataFormat.mBitsPerChannel = 16;
player->dataFormat.mBytesPerFrame = 2;
player->dataFormat.mBytesPerPacket = 2;
// tell extaudiofile about our format
CheckError(ExtAudioFileSetProperty(extAudioFile,
kExtAudioFileProperty_ClientDataFormat,
sizeof(AudioStreamBasicDescription),
&player->dataFormat),
"couldnt set client format on extaudiofile");
SInt64 fileLengthFrames;
UInt32 propSize = sizeof(fileLengthFrames);
ExtAudioFileGetProperty(extAudioFile,
kExtAudioFileProperty_FileLengthFrames,
&propSize,
&fileLengthFrames);
player->bufferSizeBytes = fileLengthFrames * player->dataFormat.mBytesPerFrame;
Next I declare my AudioBufferList and set some more properties
AudioBufferList *buffers;
UInt32 ablSize = offsetof(AudioBufferList, mBuffers[0]) + (sizeof(AudioBuffer) * 1);
buffers = (AudioBufferList *)malloc(ablSize);
player->sampleBuffer = (SInt16 *)malloc(sizeof(SInt16) * player->bufferSizeBytes);
buffers->mNumberBuffers = 1;
buffers->mBuffers[0].mNumberChannels = 1;
buffers->mBuffers[0].mDataByteSize = player->bufferSizeBytes;
buffers->mBuffers[0].mData = player->sampleBuffer;
My understanding is that .mData will be whatever was specified in the formatFlags (in this case, type SInt16). Since it is of type (void *), I want to convert this to float data which is obvious for audio manipulation. Before I set up a for loop which just iterated through the buffer and cast each sample to a float*. This seemed unnecessary so now I pass in my .mData buffer to a function I created which then granularizes the audio:
float *theOutBuffer = [self granularizeWithData:(float *)buffers->mBuffers[0].mData with:framesRead];
In this function, I dynamically allocate some buffers, create random size grains, place them in my out buffer after windowing them using a hamming window and return that buffer (which is float data). Everything is cool up to this point.
Next I set up all my output file ASBD and such:
AudioStreamBasicDescription outputFileFormat;
bzero(audioFormatPtr, sizeof(AudioStreamBasicDescription));
outputFileFormat->mFormatID = kAudioFormatLinearPCM;
outputFileFormat->mSampleRate = 44100.0;
outputFileFormat->mChannelsPerFrame = numChannels;
outputFileFormat->mBytesPerPacket = 2 * numChannels;
outputFileFormat->mFramesPerPacket = 1;
outputFileFormat->mBytesPerFrame = 2 * numChannels;
outputFileFormat->mBitsPerChannel = 16;
outputFileFormat->mFormatFlags = kAudioFormatFlagIsFloat | kAudioFormatFlagIsPacked;
UInt32 flags = kAudioFileFlags_EraseFile;
ExtAudioFileRef outputAudioFileRef = NULL;
NSString *tmpDir = NSTemporaryDirectory();
NSString *outFilename = #"Decomp.caf";
NSString *outPath = [tmpDir stringByAppendingPathComponent:outFilename];
NSURL *outURL = [NSURL fileURLWithPath:outPath];
AudioBufferList *outBuff;
UInt32 abSize = offsetof(AudioBufferList, mBuffers[0]) + (sizeof(AudioBuffer) * 1);
outBuff = (AudioBufferList *)malloc(abSize);
outBuff->mNumberBuffers = 1;
outBuff->mBuffers[0].mNumberChannels = 1;
outBuff->mBuffers[0].mDataByteSize = abSize;
outBuff->mBuffers[0].mData = theOutBuffer;
CheckError(ExtAudioFileCreateWithURL((__bridge CFURLRef)outURL,
kAudioFileCAFType,
&outputFileFormat,
NULL,
flags,
&outputAudioFileRef),
"ErrorCreatingURL_For_EXTAUDIOFILE");
CheckError(ExtAudioFileSetProperty(outputAudioFileRef,
kExtAudioFileProperty_ClientDataFormat,
sizeof(outputFileFormat),
&outputFileFormat),
"ErrorSettingProperty_For_EXTAUDIOFILE");
CheckError(ExtAudioFileWrite(outputAudioFileRef,
framesRead,
outBuff),
"ErrorWritingFile");
The file is written correctly, in CAF format. My question is this: am I handling the .mData buffer correctly in that I am casting the samples to float data, manipulating (granulating) various window sizes and then writing it to a file using ExtAudioFileWrite (in CAF format)? Is there a more elegant way to do this such as declaring my ASBD formatFlag as kAudioFlagIsFloat? My output CAF file has some clicks in it and when I open it in Logic, it looks like there is a lot of aliasing. This makes sense if I am trying to send it float data but there is some kind of conversion happening which I am unaware of.
Thanks in advance for any advice on the matter! I have been an avid reader of pretty much all the source material online, including the Core Audio Book, various blogs, tutorials, etc. The ultimate goal of my app is to play the granularized audio in real time to a user with headphones so the writing to file thing is just being used for testing at the moment. Thanks!
What you say about step 3 suggests to me you are interpreting an array of shorts as an array of floats? If that is so, we found the reason for your trouble. Can you assign the short values one by one into an array of floats? That should fix it.
It looks like mData is a void * pointing to an array of shorts. Casting this pointer to a float * doesn't change the underlying data into float but your audio processing function will treat them as if they were. However, float and short values are stored in totally different ways, so the math you do in that function will operate on very different values that have nothing to do with your true input signal. To investigate this experimentally, try the following:
short data[4] = {-27158, 16825, 23024, 15};
void *pData = data;
The void pointer doesn't indicate what kind of data it points to, so erroneously, one can falsely assume it points to float values. Note that a short is 2 byte wide, but a float is 4 byte wide. It is a coincidence that your code did not crash with an access violation. Interpreted as float the array above is only long enough for two values. Let's just look at the first value:
float *pfData = (float *)pData;
printf("%d == %f\n", data[0], pfData[0]);
The output of this will be -27158 == 23.198200 illustrating how instead of the expected -27158.0f you obtain roughly 23.2f. Two problematic things happened. First, sizeof(float) is not sizeof(short). Second, the "ones and zeros" of a floating point number are stored very differently than an integer. See http://en.wikipedia.org/wiki/Single_precision_floating-point_format.
How to solve the problem? There are at least two simple solutions. First, you could convert each element of the array before you feed it into your audio processor:
int k;
float *pfBuf = (float *)malloc(n_data * sizeof(float));
short *psiBuf = (short *)buffers->mBuffers[0].mData[k];
for (k = 0; k < n_data; k ++)
{
pfBuf[k] = psiBuf[k];
}
[self granularizeWithData:pfBuf with:framesRead];
for (k = 0; k < n_data; k ++)
{
psiBuf[k] = pfBuf[k];
}
free(pfBuf);
You see that most likely you will have to convert everything back to short after your call to granularizeWithData: with:. So the second solution would be to do all processing in short although from what you write, I imagine you would not like that latter approach.