I wrote a small sound playing library with PortAudio on Linux. It's for a small game, so there are lots of little sounds when various things happen. I open up a stream for each wav file to play by calling Pa_OpenStream(). On linux this call takes on average around 10ms. However on Windows this typically takes 40 to 70ms. And worse, the first call takes 1.3 seconds. Then after that occasionally it will again take 1.3 seconds. I haven't been able to find anything consistent about why it hangs, except that it happens every first call. The windows build actually runs fine on Wine.
I assume this has to do with differences in the underlying sound API in use in different systems. But oddly enough I haven't found any information anywhere, despite extensive searching.
Here's my play function:
int play(const char * sN)
{
float threshold = .01f;
char * soundName = (char*)sN;
float g = glfwGetTime();
updatePlayer();
float g2 = glfwGetTime();
if (g2-g > threshold) printf("updatePlayer: %f/", g2 - g);
if (!paused && (int)streams.size() < maxStreams && !mute)
{
streamStr * ss = new streamStr;
g = glfwGetTime();
if (g-g2 > threshold) printf("new stream: %f/", g - g2);
PaError err;
sfData * sdata = getData(soundName);
ss->sfd = sdata;
g2 = glfwGetTime();
if (g2-g > threshold)printf("getData: %f/", g2 - g);
err = Pa_OpenStream(&(ss->stream), 0, &sdata->outputParameters, sdata->sfInfo.samplerate, paFramesPerBufferUnspecified, paNoFlag, PaCallback, ss);
if (err)
{
printf("PortAudio error opening output: %s\n", Pa_GetErrorText(err));
delete ss;
return 1;
}
g = glfwGetTime();
if (g-g2 > threshold)
printf("Pa_OpenStream: %f/", g - g2);
Pa_StartStream(ss->stream);
g2 = glfwGetTime();
if (g2-g > threshold)printf("Pa_StartStream: %f/", g2 - g);
addStreams(ss);
g = glfwGetTime();
if (g-g2 > threshold)printf("addStreams: %f", g - g2);
//Pa_SetStreamFinishedCallback(ss, finishedCallback);
printf("\n");
}
return 0;
}
IDK why it's taking that long (because I don't know windows), but I can say you are going about this the wrong way. Specifically, you shouldn't make any timing expectations about opening a new stream. For example, I would expect similar issues (albeit to a much lesser degree) on OS X.
The correct implementation would be to always have a stream open, playing silence. Then, when you need to play a sound, you can play it right away. For best latency, you should pre-load the first few buffers from the file so you don't need to access the disk when playback starts. I don't know what the exact overhead is on windows for opening a stream (I'm sure it depends on the API), but on some versions of OS X, it's huge (the entire kernel switches into preemptive mode if no audio was running before).
That said, 1.3 seconds is insane. I recommend asking on the mailing list. Be sure to say what host-API you are using because you didn't say that here, and, for Windows, it matters. Also, what version of windows.
To minimise startup latency for this use-case (i.e. expecting StartStream() to give minimum startup latency) you should use the paPrimeOutputBuffersUsingStreamCallback stream flag. Otherwise the initial buffers will be zero and the time it takes for the sound to hit the DACs will include playing out the buffer length of zeros (which would be around 80ms on Windows WMME or DirectSound with the default PA settings).
Related
Disclaimer that I know nothing about C++ so bear with me... I am looking at some existing code which prints a continuous stream of strings describing the position of a VR controller.
void CMainApplication::printDevicePositionalData(const char * deviceName, vr::HmdMatrix34_t posMatrix, vr::HmdVector3_t position, vr::HmdQuaternion_t quaternion)
{
LARGE_INTEGER qpc; // Query Performance Counter for Acquiring high-resolution time stamps.
// From MSDN: "QPC is typically the best method to use to time-stamp events and
// measure small time intervals that occur on the same system or virtual machine.
QueryPerformanceCounter(&qpc);
// Print position and quaternion (rotation).
dprintf("\n%lld, %s, x = %.5f, y = %.5f, z = %.5f, qw = %.5f, qx = %.5f, qy = %.5f, qz = %.5f",
qpc.QuadPart, deviceName,
position.v[0], position.v[1], position.v[2],
quaternion.w, quaternion.x, quaternion.y, quaternion.z);
}
When I run the compiled exe in powershell it does not seem to print anything. Only if I run .\this_program.exe | tee output.txt do I see anything, as it simultaneously writes to a .txt file.
How can I change to above code to return these values, as I want to be able to read them in realtime with python using subprocess and stdout. Thanks
If you want to print to the console output, you should not be using:
dprintf - This function prints a formatted string to the command window for the debugger.
With C++, IO streams should be used (std::cout, std::clog, or std::cerr).
Or fallback to printf.
Is there a way to signal (success/failure) to the host at the end of kernel execution?
I am looking at an iterative process where calculations are made in device and after each iteration, a boolean variable is passed to host that tells if the process has converged. Based on the variable, host decides to either stop iterating or go through another round of iteration.
Copying a single boolean variable at the end of every iteration nullifies the time gain obtained through parallelization. Hence, I would like to find a way to let the host know of the convergence status (success/failure) without having to CudaMemCpy every time.
Note: The time issue exists after using pinned memory to transfer data.
Alternatives that I have looked at.
asm("trap;"); & assert();
These will trigger respectively Unknown error and cudaErrorAssert in host. Unfortunately, they are "sticky" in that the error cannot be reset using CudaGetLastError. The only way is to reset device using cudaDeviceReset().
using CudaHostAllocMapped to avoid CudaMemCpy This is of no use as it does not offer any time based advantage over standard pinned memory allocation + CudaMemCpy. (Pg 460, MultiCore and GPU Programming, An Integrated Approach, Morgran Kruffmann 2014).
Will appreciate other ways to overcome this issue.
I suspect the real issue here is that your iteration kernel run time is very short (on the order of 100us or less), meaning the work per iteration is very small. The best solution might be to try to increase the work per iteration (refactor your code/algorithm, tackle a larger problem, etc.)
However, here are some possibilities:
Use mapped/pinned memory. Your claim in item 2 of your question is unsupported, IMO, without a lot more context than a page reference to a book that many of us probably don't have available to look at.
Use dynamic parallelism. Move your kernel launch process to a CUDA parent kernel that is issuing child kernels. Whatever boolean is set by the child kernel will be immediately discoverable in the parent kernel, without any need for a cudaMemcpy operation or mapped/pinned memory.
Use a pipelined algorithm, and overlap a speculative kernel launch with the device->host copy of the boolean, for each pipeline stage.
I consider the first two items above fairly obvious, so I'll provide a worked example for item 3. The basic idea is that we will ping-pong between two streams, launching the kernel alternately into one stream then the other. We will have a 3rd stream so that we can overlap the device->host copy operations with the execution of the next launch. Due to the overlap of D->H copy with kernel execution, there is effectively no "cost" for the copy operation, it is hidden by kernel execution work.
Here's a fully worked example, plus a nvvp timeline:
$ cat t267.cu
#include <stdio.h>
const int stop_count = 5;
const long long tdelay = 1000000LL;
__global__ void test_kernel(int *icounter, bool *istop, int *ocounter, bool *ostop){
if (*istop) return;
long long start = clock64();
while (clock64() < tdelay+start);
int my_count = *icounter;
my_count++;
if (my_count >= stop_count) *ostop = true;
*ocounter = my_count;
}
int main(){
volatile bool *v_stop;
volatile int *v_counter;
bool *h_stop, *d_stop1, *d_stop2, *d_s1, *d_s2, *d_ss;
int *h_counter, *d_counter1, *d_counter2, *d_c1, *d_c2, *d_cs;
cudaStream_t s1, s2, s3, *sp1, *sp2, *sps;
cudaEvent_t e1, e2, *ep1, *ep2, *eps;
cudaStreamCreate(&s1);
cudaStreamCreate(&s2);
cudaStreamCreate(&s3);
cudaEventCreate(&e1);
cudaEventCreate(&e2);
cudaMalloc(&d_counter1, sizeof(int));
cudaMalloc(&d_stop1, sizeof(bool));
cudaMalloc(&d_counter2, sizeof(int));
cudaMalloc(&d_stop2, sizeof(bool));
cudaHostAlloc(&h_stop, sizeof(bool), cudaHostAllocDefault);
cudaHostAlloc(&h_counter, sizeof(int), cudaHostAllocDefault);
v_stop = h_stop;
v_counter = h_counter;
int n_counter = 1;
h_stop[0] = false;
h_counter[0] = 0;
cudaMemcpy(d_stop1, h_stop, sizeof(bool), cudaMemcpyHostToDevice);
cudaMemcpy(d_stop2, h_stop, sizeof(bool), cudaMemcpyHostToDevice);
cudaMemcpy(d_counter1, h_counter, sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_counter2, h_counter, sizeof(int), cudaMemcpyHostToDevice);
sp1 = &s1;
sp2 = &s2;
ep1 = &e1;
ep2 = &e2;
d_c1 = d_counter1;
d_c2 = d_counter2;
d_s1 = d_stop1;
d_s2 = d_stop2;
test_kernel<<<1,1, 0, *sp1>>>(d_c1, d_s1, d_c2, d_s2);
cudaEventRecord(*ep1, *sp1);
cudaStreamWaitEvent(s3, *ep1, 0);
cudaMemcpyAsync(h_stop, d_s2, sizeof(bool), cudaMemcpyDeviceToHost, s3);
cudaMemcpyAsync(h_counter, d_c2, sizeof(int), cudaMemcpyDeviceToHost, s3);
while (v_stop[0] == false){
cudaStreamWaitEvent(*sp2, *ep1, 0);
sps = sp1; // ping-pong
sp1 = sp2;
sp2 = sps;
eps = ep1;
ep1 = ep2;
ep2 = eps;
d_cs = d_c1;
d_c1 = d_c2;
d_c2 = d_cs;
d_ss = d_s1;
d_s1 = d_s2;
d_s2 = d_ss;
test_kernel<<<1,1, 0, *sp1>>>(d_c1, d_s1, d_c2, d_s2);
cudaEventRecord(*ep1, *sp1);
while (n_counter > v_counter[0]);
n_counter++;
if(v_stop[0] == false){
cudaStreamWaitEvent(s3, *ep1, 0);
cudaMemcpyAsync(h_stop, d_s2, sizeof(bool), cudaMemcpyDeviceToHost, s3);
cudaMemcpyAsync(h_counter, d_c2, sizeof(int), cudaMemcpyDeviceToHost, s3);
}
}
cudaDeviceSynchronize(); // optional
printf("terminated at counter = %d\n", v_counter[0]);
}
$ nvcc -arch=sm_52 -o t267 t267.cu
$ ./t267
terminated at counter = 5
$
In the above diagram, we see that 5 kernel launches are evident (actually 6) andy they are bouncing back and forth between two streams. (The 6th kernel launch, which we would expect from the code organization and pipelining, is a very short line at the end of stream15 above. This kernel launches but immediately witness that stop is true, so it exits.) The device -> host copies are in a 3rd stream. If we zoom in closely at the handoff from one kernel iteration to the next:
we see that even these very short D->H memcpy operations are essentially overlapped with the next kernel execution. For reference, the gap between kernel executions above is about 5us.
Note that this was entirely done on linux. If you attempt this on windows WDDM, it may be difficult to achieve anything similar, due to WDDM command batching. Windows TCC should approximately duplicate linux behavior, however.
The basic problem was as follows:
When I run the below Kernel with N threads and don't include the 4
lines to instantiate and populate the ScaledLLA variable every thing
works fine.
When I run the below Kernel with N threads and do include the 4
lines to instantiate and populate the ScaledLLA variable the GPU locks
up, and Windows throws a "display driver not responding" error.
If I reduce the number of threads running by reducing the grid size
everything worked fine.
I'm new to CUDA and have been incrementally building out some GIS functionality.
my host code looks like this at the kernel call.
MapperKernel << <g_CUDAControl->aGetGridSize(), g_CUDAControl->aGetBlockSize() >> >(g_Deltas.lat, g_Deltas.lon, 32.2,
g_DataReader->aGetMapper().aGetRPCBoundingBox()[0], g_DataReader->aGetMapper().aGetRPCBoundingBox()[1],
g_CUDAControl->aGetBlockSize().x,
g_CUDAControl->aGetThreadPitch(),
LLA_Offset,
LLA_ScaleFactor,
RPC_XN,RPC_XD,RPC_YN,RPC_YD,
Pixel_Offset, Pixel_ScaleFactor,
device_array);
cudaDeviceSynchronize(); //code crashes here
host_array = (point3D*)malloc(num_bytes);
cudaMemcpy(host_array, device_array, num_bytes, cudaMemcpyDeviceToHost);
the Kernel that is being called looks like this:
__global__ void MapperKernel(double deltaLat, double deltaLon, double passedAlt,
double minLat, double minLon,
int threadsperblock,
int threadPitch,
point3D LLA_Offset,
point3D LLA_ScaleFactor,
double * RPC_XN, double * RPC_XD, double * RPC_YN, double * RPC_YD,
point2D pixelOffset, point2D pixelScaleFactor,
point3D * rValue)
{
//calculate thread's LLA
int latindex = threadIdx.x + blockIdx.x*threadsperblock;
int lonindex = threadIdx.y + blockIdx.y*threadsperblock;
point3D LLA;
LLA.lat = ((double)(latindex))*deltaLat + minLat;
LLA.lon = ((double)(lonindex))*deltaLon + minLon;
LLA.alt = passedAlt;
//scale threads LLA - adding these four lines is what causes the problem
point3D ScaledLLA;
ScaledLLA.lat = (LLA.lat - LLA_Offset.lat) * LLA_ScaleFactor.lat;
ScaledLLA.lon = (LLA.lon - LLA_Offset.lon) * LLA_ScaleFactor.lon;
ScaledLLA.alt = (LLA.alt - LLA_Offset.alt) * LLA_ScaleFactor.alt;
rValue[lonindex*threadPitch + latindex] = ScaledLLA; //if I assign LLA without calculating ScaledLLA everything works fine
}
if I assign LLA to rValue then everything executes quickly and I get the expected behavior; however, when I add those fourlines for ScaledLLA and try to assign it to rValue, CUDA takes too long for windows's liking at the cudaDeviceSynchronize() call and I get a
"display driver not responding" error that then proceeds to reset the GPU. From looking around the error appears to be a windows thing that occurs when Windows believes that the GPU isn't being responsive. I am certain that the kernel is running and performing the right calculations, because I have stepped through it with the NSIGHT debugger.
Does anybody have a good explanation for why adding those three lines to the kernel would cause the execution time to spike?
I'm running Win7 VS 2013 and have nsight 4.5 installed.
For those who get here later via a search engine. It turns out the problem was with the card running out of memory.
That should probably have been one of the top couple of things to think of since the problem occurred only after the instantiation was added.
The card only had so much memory (~2GB) and my rvalue buffer was taking up most (~1.5GB) of it. With every thread trying to instantiate its own point3D variable the card simply ran out of memory.
For those interested NSight's profiler said that it was a cudaUknownError.
The fix was to lower the number of threads running the kernel
I have a program that reads the current time from the system clock and saves it to a text file. I previously used the GetSystemTime function which worked, but the times weren't completely consistent eg: one of the times is 32567.789 and the next time is 32567.780 which is backwards in time.
I am using this program to save the time up to 10 times a second. I read that the GetSystemTimeAsFileTime function is more accurate. My question is, how to I convert my current code to use the GetSystemTimeAsFileTime function? I tried to use the FileTimeToSystemTime function but that had the same problems.
SYSTEMTIME st;
GetSystemTime(&st);
WORD sec = (st.wHour*3600) + (st.wMinute*60) + st.wSecond; //convert to seconds in a day
lStr.Format( _T("%d %d.%d\n"),GetFrames() ,sec, st.wMilliseconds);
std::wfstream myfile;
myfile.open("time.txt", std::ios::out | std::ios::in | std::ios::app );
if (myfile.is_open())
{
myfile.write((LPCTSTR)lStr, lStr.GetLength());
myfile.close();
}
else {lStr.Format( _T("open file failed: %d"), WSAGetLastError());
}
EDIT To add some more info, the code captures an image from a camera which runs 10 times every second and saves the time the image was taken into a text file. When I subtract the 1st entry of the text file from the second and so on eg: entry 2-1 3-2 4-3 etc I get this graph, where the x axis is the number of entries and the y axis is the subtracted values.
All of them should be around the 0.12 mark which most of them are. However you can see that a lot of them vary and some even go negative. This isn't due to the camera because the camera has its own internal clock and that has no variations. It has something to do with capturing the system time. What I want is the most accurate method to extract the system time with the highest resolution and as little noise as possible.
Edit 2 I have taken on board your suggestions and ran the program again. This is the result:
As you can see it is a lot better than before but it is still not right. I find it strange that it seems to do it very incrementally. I also just plotted the times and this is the result, where x is the entry and y is the time:
Does anyone have any idea on what could be causing the time to go out every 30 frames or so?
First of all, you wanna get the FILETIME as follows
FILETIME fileTime;
GetSystemTimeAsFileTime(&fileTime);
// Or for higher precision, use
// GetSystemTimePreciseAsFileTime(&fileTime);
According to FILETIME's documentation,
It is not recommended that you add and subtract values from the FILETIME structure to obtain relative times. Instead, you should copy the low- and high-order parts of the file time to a ULARGE_INTEGER structure, perform 64-bit arithmetic on the QuadPart member, and copy the LowPart and HighPart members into the FILETIME structure.
So, what you should be doing next are
ULARGE_INTEGER theTime;
theTime.LowPart = fileTime.dwLowDateTime;
theTime.HighPart = fileTime.dwHighDateTime;
__int64 fileTime64Bit = theTime.QuadPart;
And that's it. The fileTime64Bit variable now contains the time you're looking for.
If you want to get a SYSTEMTIME object instead, you could just do the following:
SYSTEMTIME systemTime;
FileTimeToSystemTime(&fileTime, &systemTime);
Getting the system time out of Windows with decent accuracy is something that I've had fun with, too... I discovered that Javascript code running on Chrome seemed to produce more consistent timer results than I could with C++ code, so I went looking in the Chrome source. An interesting place to start is the comments at the top of time_win.cc in the Chrome source. The links given there to a Mozilla bug and a Dr. Dobb's article are also very interesting.
Based on the Mozilla and Chrome sources, and the above links, the code I generated for my own use is here. As you can see, it's a lot of code!
The basic idea is that getting the absolute current time is quite expensive. Windows does provide a high resolution timer that's cheap to access, but that only gives you a relative, not absolute time. What my code does is split the problem up into two parts:
1) Get the system time accurately. This is in CalibrateNow(). The basic technique is to call timeBeginPeriod(1) to get accurate times, then call GetSystemTimeAsFileTime() until the result changes, which means that the timeBeginPeriod() call has had an effect. This gives us an accurate system time, but is quite an expensive operation (and the timeBeginPeriod() call can affect other processes) so we don't want to do it each time we want a time. The code also calls QueryPerformanceCounter() to get the current high resolution timer value.
bool NeedCalibration = true;
LONGLONG CalibrationFreq = 0;
LONGLONG CalibrationCountBase = 0;
ULONGLONG CalibrationTimeBase = 0;
void CalibrateNow(void)
{
// If the timer frequency is not known, try to get it
if (CalibrationFreq == 0)
{
LARGE_INTEGER freq;
if (::QueryPerformanceFrequency(&freq) == 0)
CalibrationFreq = -1;
else
CalibrationFreq = freq.QuadPart;
}
if (CalibrationFreq > 0)
{
// Get the current system time, accurate to ~1ms
FILETIME ft1, ft2;
::timeBeginPeriod(1);
::GetSystemTimeAsFileTime(&ft1);
do
{
// Loop until the value changes, so that the timeBeginPeriod() call has had an effect
::GetSystemTimeAsFileTime(&ft2);
}
while (FileTimeToValue(ft1) == FileTimeToValue(ft2));
::timeEndPeriod(1);
// Get the current timer value
LARGE_INTEGER counter;
::QueryPerformanceCounter(&counter);
// Save calibration values
CalibrationCountBase = counter.QuadPart;
CalibrationTimeBase = FileTimeToValue(ft2);
NeedCalibration = false;
}
}
2) When we want the current time, get the high resolution timer by calling QueryPerformanceCounter(), and use the change in that timer since the last CalibrateNow() call to work out an accurate "now". This is in Now() in my code. This also periodcally calls CalibrateNow() to ensure that the system time doesn't go backwards, or drift out.
FILETIME GetNow(void)
{
for (int i = 0; i < 4; i++)
{
// Calibrate if needed, and give up if this fails
if (NeedCalibration)
CalibrateNow();
if (NeedCalibration)
break;
// Get the current timer value and use it to compute now
FILETIME ft;
::GetSystemTimeAsFileTime(&ft);
LARGE_INTEGER counter;
::QueryPerformanceCounter(&counter);
LONGLONG elapsed = ((counter.QuadPart - CalibrationCountBase) * 10000000) / CalibrationFreq;
ULONGLONG now = CalibrationTimeBase + elapsed;
// Don't let time go back
static ULONGLONG lastNow = 0;
now = max(now,lastNow);
lastNow = now;
// Check for clock skew
if (LONGABS(FileTimeToValue(ft) - now) > 2 * GetTimeIncrement())
{
NeedCalibration = true;
lastNow = 0;
}
if (!NeedCalibration)
return ValueToFileTime(now);
}
// Calibration has failed to stabilize, so just use the system time
FILETIME ft;
::GetSystemTimeAsFileTime(&ft);
return ft;
}
It's all a bit hairy but works better than I had hoped. This also seems to work well as far back on Windows as I have tested (which was Windows XP).
I believe you are looking for GetSystemTimePreciseAsFileTime() function or even QueryPerformanceCounter() - to be short for something that is guarantied to produce monotone values.
So I'm picking up C++ after a long hiatus and I had the idea to create a program which can generate music based upon strings of numbers at runtime (was inspired by the composition of Pi done by some people) with the eventual goal being some sort of procedural music generation software.
So far I have been able to make a really primitive version of this with the Beep() function and just feeding through the first so and so digits of Pi as a test. Works like a charm.
What I'm looking for now is how I could kick it up a notch and get some higher quality sound being made (because Beep() literally is the most primitive sound... ever) and I realized I have absolutely no idea how to do this. What I need is either a library or some sort of API that can:
1) Generate sound without pre-existing file. I want the result to be 100% generated by code and not rely on any samples, optimally.
2) If I could get something going that would be capable of playing multiple sounds at a time, like be able to play chords or a melody with a beat, that would be nice.
3) and If I could in any way control the wave it plays (kinda like chiptune mixers can) via equation or some other sort of data, that'd be super helpful.
I don't know if this is a weird request or I just researched it using the wrong terms, but I just wasn't able to find anything along these lines or at least nothing that was well documented at all. :/
If anyone can help, I'd really appreciate it.
EDIT: Also, apparently I'm just super not used to asking stuff on forums, my target platform is Windows (7, specifically, although I wouldn't think that matters).
I use portaudio (http://www.portaudio.com/). It will let you create PCM streams in a portable way. Then you just push the samples into the stream, and they will play.
#edit: using PortAudio is pretty easy. You initialize the library. I use floating point samples to make it super easy. I do it like this:
PaError err = Pa_Initialize();
if ( err != paNoError )
return false;
mPaParams.device = Pa_GetDefaultOutputDevice();
if ( mPaParams.device == paNoDevice )
return false;
mPaParams.channelCount = NUM_CHANNELS;
mPaParams.sampleFormat = paFloat32;
mPaParams.suggestedLatency =
Pa_GetDeviceInfo( mPaParams.device )->defaultLowOutputLatency;
mPaParams.hostApiSpecificStreamInfo = NULL;
Then later when you want to play sounds you create a stream, 2 channels for stereo, at 44khz, good for mp3 audio:
PaError err = Pa_OpenStream( &mPaStream,
NULL, // no input
&mPaParams,
44100, // params
NUM_FRAMES, // frames per buffer
0,
sndCallback,
this
);
Then you implement the callback to fill the PCM audio stream. The callback is a c function, but I just call through to my C++ class to handle the audio. I ripped this from my code, and it may not be 100% correct now as I removed a ton of stuff you won't care about. But its works kind of like this:
static int sndCallback( const void* inputBuffer,
void* outputBuffer,
unsigned long framesPerBuffer,
const PaStreamCallbackTimeInfo* timeInfo,
PaStreamCallbackFlags statusFlags,
void* userData )
{
Snd* snd = (Snd*)userData;
return snd->callback( (float*)outputBuffer, framesPerBuffer );
}
u32 Snd::callback( float* outbuf, u32 nFrames )
{
mPlayMutex.lock(); // use mutexes because this is asyc code!
// clear the output buffer
memset( outbuf, 0, nFrames * NUM_CHANNELS * sizeof( float ));
// mix all the sounds.
if ( mChannels.size() )
{
// I have multiple audio sources I'm mixing. That's what mChannels is.
for ( s32 i = mChannels.size(); i > 0; i-- )
{
for ( u32 j = 0; j < frameCount * NUM_CHANNELS; j++ )
{
float f = outbuf[j] + getNextSample( i ) // <------------------- your code here!!!
if ( f > 1.0 ) f = 1.0; // clamp it so you don't get clipping.
if ( f < -1.0 ) f = -1.0;
outbuf[j] = f;
}
}
}
mPlayMutex.unlock_p();
return 1; // when you are done playing audio return zero.
}
I answered a very similar question on this earlier this week: Note Synthesis, Harmonics (Violin, Piano, Guitar, Bass), Frequencies, MIDI . In your case if you don't want to rely on samples then the wavetable method is out. So your simplest option would be to dynamically vary the frequency and amplitude of sinusoids over time, which is easy but will sound pretty terrible (like a cheap Theremin). Your only real option would be a more sophisticated synthesis algorithm such as one of the Physical Modelling ones (eg Karplus-Strong). That would be an interesting project, but be warned that it does require something of a mathematical background.
You can indeed use something like Portaudio as Rafael has mentioned to physically get the sound out of the PC, in fact I think Portaudio is the best option for that. But generating the data so that it sounds musical is by far your biggest challenge.