I am working off a demo from the book "Learning Core Audio: A Hands-On Guide to Audio Programming for Mac and iOS." Chapter 8 shows how to set up a simple AudioUnit graph to play through from the AUHAL input unit to an output unit. This setup doesn't actually connect the audio units; instead, both units use a callback and pass audio data through an instance of CARingBuffer. I'm coding for MacOS 10.15.6, and using code directly from the publisher here. Here's a picture of how it works:
The code builds and runs, but I get no audio. Note that later, after introducing a speech synthesis unit, I do get playback, so I know the basics are working.
InputRenderProc asks the AUHAL unit for input and stores it in the ring buffer.
MyAUGraphPlayer *player = (MyAUGraphPlayer*) inRefCon;
// have we ever logged input timing? (for offset calculation)
if (player->firstInputSampleTime < 0.0) {
player->firstInputSampleTime = inTimeStamp->mSampleTime;
if ((player->firstOutputSampleTime > -1.0) &&
(player->inToOutSampleTimeOffset < 0.0)) {
player->inToOutSampleTimeOffset = player->firstInputSampleTime - player->firstOutputSampleTime;
}
}
// render into our buffer
OSStatus inputProcErr = noErr;
inputProcErr = AudioUnitRender(player->inputUnit,
ioActionFlags,
inTimeStamp,
inBusNumber,
inNumberFrames,
player->inputBuffer);
if (! inputProcErr) {
inputProcErr = player->ringBuffer->Store(player->inputBuffer,
inNumberFrames,
inTimeStamp->mSampleTime);
UInt32 sz = sizeof(player->inputBuffer);
printf ("stored %d frames at time %f (%d bytes)\n", inNumberFrames, inTimeStamp->mSampleTime, sz);
for (int i = 0; i < player->inputBuffer->mNumberBuffers; i++ ){
//printf("stored audio string[%d]: %s\n", i, player->inputBuffer->mBuffers[i].mData);
}
}
If I uncomment the printf statement, I see what looks like audio data being stored.
stored audio string[1]: #P'\274a\353\273\336^\274x\205 \2741\330B\2747'\274\371\361U\274\346\274\274}\212C\274\334\365%\274\261\367\273\340\307/\274E
stored 512 frames at time 134610.000000 (8 bytes)
However, when I fetch from the ring buffer in the GraphRenderCallback like this...
MyAUGraphPlayer *player = (MyAUGraphPlayer*) inRefCon;
// have we ever logged output timing? (for offset calculation)
if (player->firstOutputSampleTime < 0.0) {
player->firstOutputSampleTime = inTimeStamp->mSampleTime;
if ((player->firstInputSampleTime > -1.0) &&
(player->inToOutSampleTimeOffset < 0.0)) {
player->inToOutSampleTimeOffset = player->firstInputSampleTime - player->firstOutputSampleTime;
}
}
// copy samples out of ring buffer
OSStatus outputProcErr = noErr;
// new CARingBuffer doesn't take bool 4th arg
outputProcErr = player->ringBuffer->Fetch(ioData,
inNumberFrames,
inTimeStamp->mSampleTime + player->inToOutSampleTimeOffset);
I get nothing (I know I can't expect proper null-terminated string output, but I thought I'd see something).
fetched 512 frames at time 160776.000000
fetched audio string[0, size 2048]: xx
fetched audio string[1, size 2048]: xx
fetched 512 frames at time 161288.000000
fetched audio string[0, size 2048]: xx
fetched audio string[1, size 2048]: xx
This is not a permission problem; I have other non-AudioUnit code that can get mic input. In addition, I created a plist that makes this app prompt for mic access every time, so I know that is working. I cannot understand why data goes into this ring buffer, but never comes out.
These days you need to declare that you want to use the microphone, providing an explanation string. This wasn't the case in 2012 when Learning Core Audio was published.
In short, you now need to:
add an NSMicrophoneUsageDescription string to your Info.plist
add sandboxing capability and enable Audio Input
The sample code you're using is a command line tool, so adding an Info.plist to it in Xcode isn't as simple as with a .app package. Also the code does not seem to work if you run it from Xcode. In my case it has to be run for Terminal.app. This may be due to the fact that my Terminal has microphone permissions (viewable in System Preferences > Security & Privacy > Microphone). You can and probably should explicitly request microphone access from the user (yourself in this case!) by using requestAccessForMediaType on an AVCaptureDevice. That's right, AVFoundation code in a Core Audio tutorial, what's the world coming to.
There are more details on the above steps in this answer
p.s. I think the person who thought capturing zeroes instead of returning an error was a good idea is probably good friends with whoever invented returning HTTP 200 with an error code in the body.
Related
I'm trying to use Google Oboe for a 3D audio processing app due to it's low latency. The app will have a C++ backend, which does the processing, and the frontend is done with Flutter. I'm running a couple of tests to see if it'll work but I'm having issues loading assets from Flutter to Oboe. I checked the example RhythmGame in Oboe's repo, done with Java, but couldn't quiet find a way of doing that straight from Dart to C++.
The connection between front and backend is through dart::ffi
Here's what I've tried so far. Based on the example published by Richard Heap here, I changed the noise variable from just a sine wave to a short fragment of a song in a wav file:
class _MyAppState extends State<MyApp> {
final stream = OboeStream();
var noise = Float32List(512);
Timer t;
#override
void initState() {
super.initState();
// for (var i = 0; i < noise.length; i++) {
// noise[i] = sin(8 * pi * i / noise.length);
// }
_loadSound();
}
void _loadSound() async {
final ByteData data = await rootBundle.load('assets/song_cut.wav');
noise = data.buffer.asFloat32List();
}
(...)
Then this function in Dart calls the Dart wrapper of the native library:
void start() {
stream.start();
var interval = (512000 / stream.getSampleRate()).floor() + 1;
t = Timer.periodic(Duration(milliseconds: interval), (_) {
stream.write(noise);
});
}
The wrapper in Dart is:
void write(Float32List original) {
var length = original.length;
var copy = allocate<Float>(count: length)
..asTypedList(length).setAll(0, original);
FfiGoogleOboe()._streamWrite(_nativeInstance, copy, length);
free(copy);
}
_streamWrite is the native function in C++:
EXTERNC void stream_write(void* ptr, void* data, int32_t size) {
auto stream = static_cast<OboeFfiStream*>(ptr);
auto dataToWrite = static_cast<float*>(data);
stream->write(dataToWrite, size);
}
void OboeFfiStream::write(float *data, int32_t size) {
managedStream->write(data, size, 1000000);
}
Now I can hear the song but it comes out with too much distortion. When trying with the sine I could hear it too, but it also had some distortion. I'm not yet using the callback mode in Oboe, since I wanted to try if this worked first.
1 - what format is your WAV file in? Is it 32 bit floats? Don't forget that WAV files have a header, so you should discard the first few tens of bytes (up to the data segment). Be sure that you start reading the audio data on a float boundary (which may not be a multiple of 4 if the header isn't). If necessary, just use a hex editor to ascertain the offset of the float data and start reading there. Or, truncate the header and rename your asset to song_cut.raw. Audacity should be able to produce a header-less raw audio file.
2 - What sample rate is your audio clip recorded at? Does that match the sample rate of the device? (Note that iOS devices are normally 44.1k, but Android devices are frequently 48k. When using an Android emulator on macOS, who knows what the reported sample rate will be! Expect pitch distortion if your rates don't match - or use a resampler. I think Oboe has one. Alternatively, the sample repo associated with the talk contains one you can use.)
3 - note that the timer interval is finely tuned (for demo purposes) to the approximate time taken to deliver 512 samples at the sound card rate. This might be ok for demos, but isn't for real life. Also, your wav file probably doesn't have exactly 512 samples in it. Either adjust your audio loop to 512 samples, or adjust the 512000 constant to match the number of samples in your loop.
4a - You aren't using the callback method yet, but you probably should as soon as possible. One method I've had success with is to use a lock-free circular buffer. The Oboe callback tries to empty the buffer, while the Dart timer routine tries to fill it. The bigger the buffer the less chance there is of an underflow, but the worse the latency.
4b - The ideal solution would be to have the Oboe callback call up into Dart, but I haven't found a way to do that as C->Dart calls must be on the main Dart thread, but the Oboe callbacks are surely on a high-priority IO thread.
I'm attempting to Play a Raw (int16 PCM) encoded audio file in my android application. I've been following and reading through the Oboe documentation/samples to try to get one of my own audio files to play.
The audio file I need to play is roughly 6kb, or 1592 frames (stereo).
Either no sound plays, or sound/jitter plays on startup (with varying output - see bellow)
Troubleshooting
update
I have switched to floats for buffer queuing, instead of keeping everything to int16_t (and converting back to int16_t when done), although now I'm back to no sound.
The audio seems to be either not playing, or playing on startup (which is wrong). The sound should play after I press 'start'.
When the app was implemented with int16_t only, the premature sound was relative to how big the buffer size was. If the buffer size is smaller than the audio file, the sound is very fast and clipped (more drone-like at lower buffer sizes). Bigger than the Raw audio size it seems like it plays on a loop and gets quieter at higher buffer sizes. The sound would also get "softer" when the start button is pressed. I'm not even entirely sure this means the raw audio was playing, it could just be random nonsense jitters from Android.
When filling the buffers with floats, and converting to int16_t afterwards, no audio is played.
(I have tried running systrace, but I honestly don't know what I'm looking for)
The stream opens fine.
The buffer size fails to be ajusted in createPlaybackStream() (although somehow it still sets it to twice the burst size)
The stream starts fine.
The Raw resources are being loaded fine.
Implementation
What I am currently trying in the builder:
Setting the callback to this, or onAudioReady()
Setting the performance mode to LowLatency
Setting the sharing mode to Exclusive
Setting the buffer capacity to (anything bigger than my audio file frame count)
Setting the burst size (frames per call back) to (anything equal to or lower than the buffer capacity / 2)
I am using the Player class and the AAssetManager class from the Rhythm Game sample here: https://github.com/google/oboe/blob/master/samples/RhythmGame. I am using these classes to load my resources and play the sound. Player.renderAudio writes the audio data to the output buffer.
Here are the relevant methods from my audio engine:
void AudioEngine::createPlaybackStream() {
// // Load the RAW PCM data files into memory
std::shared_ptr<AAssetDataSource> soundSource(AAssetDataSource::newFromAssetManager(assetManager, "sound.raw", ChannelCount::Mono));
if (soundSource == nullptr) {
LOGE("Could not load source data for sound");
return;
}
sound = std::make_shared<Player>(soundSource);
AudioStreamBuilder builder;
builder.setCallback(this);
builder.setPerformanceMode(PerformanceMode::LowLatency);
builder.setSharingMode(SharingMode::Exclusive);
builder.setChannelCount(mChannelCount);
Result result = builder.openStream(&stream);
if (result == Result::OK && stream != nullptr) {
mSampleRate = stream->getSampleRate();
mFramesPerBurst = stream->getFramesPerBurst();
int channelCount = stream->getChannelCount();
if (channelCount != mChannelCount) {
LOGW("Requested %d channels but received %d", mChannelCount, channelCount);
}
// Set the buffer size to (burst size * 2) - this will give us the minimum possible latency while minimizing underruns
stream->setBufferSizeInFrames(mFramesPerBurst * 2);
if (setBufferSizeResult != Result::OK) {
LOGW("Failed to set buffer size. Error: %s", convertToText(setBufferSizeResult.error()));
}
// Start the stream - the dataCallback function will start being called
result = stream->requestStart();
if (result != Result::OK) {
LOGE("Error starting stream. %s", convertToText(result));
}
} else {
LOGE("Failed to create stream. Error: %s", convertToText(result));
}
}
DataCallbackResult AudioEngine::onAudioReady(AudioStream *audioStream, void *audioData, int32_t numFrames) {
int16_t *outputBuffer = static_cast<int16_t *>(audioData);
sound->renderAudio(outputBuffer, numFrames);
return DataCallbackResult::Continue;
}
// When the 'start' button is pressed, it calls this method with true
// There should be no sound on app start-up until this button is pressed
// Sound stops when 'stop' is pressed
setPlaying(bool isPlaying) {
sound->setPlaying(isPlaying);
}
Setting the buffer capacity to (anything bigger than my audio file frame count)
You don't need to set the buffer capacity. This will be set automatically at a reasonable level for you. Typically ~3000 frames. Note that buffer capacity is different from buffer size which defaults to 2*framesPerBurst.
Setting the burst size (frames per call back) to (anything equal to or lower than the buffer capacity / 2)
Again, don't do this. onAudioReady will be called every time the stream requires more audio data and numFrames indicates how many frames you should supply. If you override this value with a value which isn't an exact ratio of the audio device's native burst size (typical values are 128, 192 and 240 frames depending on underlying hardware) then you may get audio glitches.
I have switched to floats for buffer queuing
The format which you need to supply data in is determined by the audio stream and it is only known after the stream has been opened. You can get it by calling stream->getFormat().
In the RhythmGame sample (at least the version you're referring to) here's how the formats work:
Source file is converted from 16-bit to float inside AAssetDataSource::newFromAssetManager (floats are the preferred format for any kind of signal processing)
If the stream format is 16-bit then convert it back inside onAudioReady
1592 frames (stereo).
You said that your source was stereo but you're specifying it as mono here:
std::shared_ptr soundSource(AAssetDataSource::newFromAssetManager(assetManager, "sound.raw", ChannelCount::Mono));
Without doubt that will cause audio problems because the AAssetDataSource will have a value for numFrames which is double the correct value. This will cause audio glitches because half the time you'll be playing random parts of system memory.
I have a DSP software which captures the audio playing using the WASAPI api in shared loopback mode.
hr = _pAudioClient->Initialize(AUDCLNT_SHAREMODE_SHARED, AUDCLNT_STREAMFLAGS_LOOPBACK, 0, 0, _pFormat, 0);
This part works fine, but now I want to be able to detect the number of channels actually playing. In other words how would I be able to detect if the audio playing is in stereo, 5.1, 7.1?
The problem is:
* Since loopback have to use shared mode there could be multiple sources playing.
* This analysis has to be done in real-time. Can't wait until playback is done.
* Detect the difference between a channel not used at all by any playback source and a channel that is temporarily silent
The best solution in my mind would be If I could retrieve a list of all playback source/sub mixes and query them each for the number of channels. That way I don't have to analyse the audio data stream itself.
Loopback recording takes place in mix format defined on the endpoint, so regardless of what the original audio format was you get the data in the mix format, mixed from possibly multiple played sources and also converted to such shared format.
Device Formats
Loopback Recording
WASAPI loopback contains the mix of all audio being played...
The GetMixFormat method retrieves the stream format that the audio engine uses for its internal processing of shared-mode streams...
After an application has used GetMixFormat or IsFormatSupported to find an appropriate format for a shared-mode or exclusive-mode stream, the application can call the Initialize method to initialize a stream with that format. An application that attempts to initialize a shared-mode stream with a format that is not identical to the mix format obtained from the GetMixFormat method, but that has the same number of channels and the same sample rate as the mix format, is likely to succeed. Before calling Initialize, the application can call IsFormatSupported to verify that Initialize will accept the format.
That is, even though WASAPI offers some flexibility in audio format, channel configuration and sample rate are defined by shared format when it comes to loopback capture.
As you are getting the mix, you cannot really identify "non-active" channels: this information is lost during mixing to shared format.
Also, the actual shared format can be configured interactively via Control Panel:
Ok I now have a solution to my problem. As far as I know you can not detect sub-mixes in the shared mix so the only option was to analyze the audio stream/capture buffer.
First during my main capture loop I set the current timestamp for all channels playing.
const time_t now = Date::getCurrentTimeMillis();
//Iterate all capture frames
for (i = 0; i < numFramesAvailable; ++i) {
for (j = 0; j < _nChannelsIn; ++j) {
//Identify which channels are playing.
if (pCaptureBuffer[j] != 0) {
_pUsedChannels[j] = now;
}
}
}
Then every second I call this function which evaluates if a channel has played the last second. Based upon which channels are playing I can do conditional routing.
void checkUsedChannels() {
const time_t now = Date::getCurrentTimeMillis();
//Compare now against last used timestamp and determine active channels
for (size_t i = 0; i < _nChannelsIn; ++i) {
if (now - _pUsedChannels[i] > 1000) {
_pUsedChannels[i] = 0;
}
}
//Update conditional routing
for (const Input *pInut : _inputs) {
pInut->evalConditions();
}
}
Very simple solution but it appears to be working.
I am very new to the EDSDK so sorry for maybe weird question in some places.
Is it possible to access a video stream and perform some operations on it using the SDK? I need this to capture very thin region (ROI) of a specified size (for example 3840x10 px) for each frame in the stream. Don`t understand this as compression of a frame, aspect ratios are not needed to follow. These changes in theory should increase fps, because the region will be very thin (Should they?).
I found the code snippet below from the official documentation, although it seems this causes only to send a signal for starting and stopping video rec, without accessing the stream.
EdsUInt32 record_start = 4; // Begin movie shooting
err = EdsSetPropertyData(cameraRef, kEdsPropID_Record, 0, sizeof(record_start), &record_start);
EdsUInt32 record_stop = 0; // End movie shooting
err = EdsSetPropertyData(cameraRef, kEdsPropID_Record, 0, sizeof(record_stop), &record_stop);
I would be very thanksful for any suggestions and help. Please feel free to ask any additional information!
This sdk doesnt allow you to directly get access to hi res streams like industrial cams would. You can access over USB ~960x640 liveview images in sequential JPGs. Movie recording can only be done to internal card, and after stopping transfering the result. Outside of this SDk, use of an external HDMI recorder gives access to a near realtime feed at max FullHD1080p, depending on model and not always “clean”.
Currently, I am parsing wav files and storing samples in std::vector<int16_t> sample. Now, I want to apply VAD (Voice Activity Detection) on this data to find out the "regions" of voice, and more specifically the start and end of words.
The parsed wav files are 16KHz, 16 bit PCM, mono. My code is in C++.
I have searched a lot about it but could not find proper documentation regarding webRTC's VAD functions.
From what I have found, the function that I need to use is WebRtcVad_Process(). It's prototype is written below :
int WebRtcVad_Process(VadInst* handle, int fs, const int16_t* audio_frame,
size_t frame_length)
From what I found here : https://stackoverflow.com/a/36826564/6487831
Each frame of audio that you send to the VAD must be 10, 20 or 30 milliseconds long.
Here's an outline of an example that assumes audio_frame is 10 ms (320 bytes) of audio at 16000 Hz:
int is_voiced = WebRtcVad_Process(vad, 16000, audio_frame, 160);
It makes sense :
1 sample = 2B = 16 bits
SampleRate = 16000 sample/sec = 16 samples/ms
For 10 ms, no of samples = 160
So, based on that I have implemented this :
const int16_t * temp = sample.data();
for(int i = 0, ms = 0; i < sample.size(); i += 160, ms++)
{
int isActive = WebRtcVad_Process(vad, 16000, temp, 160); //10 ms window
std::cout<<ms<<" ms : "<<isActive<<std::endl;
temp = temp + 160; // processed 160 samples
}
Now, I am not really sure if this is correct. Also, I am also unsure about whether this gives me correct output or not.
So,
Is it possible to use the samples parsed directly from the wav files, or does it need some processing?
Am I looking at the correct function to do the job?
How to use the function to properly perform VAD on the audio stream?
Is it possible to distinct between the spoken words?
What is the best way to check if the output I am getting is correct?
If not, what is the best way to do this task?
I'll start by saying that no, I don't think you will be able to segment an utterance into individual words using VAD. From the article on speech segmentation in Wikipedia:
One might expect that the inter-word spaces used by many written
languages like English or Spanish would correspond to pauses in their
spoken version, but that is true only in very slow speech, when the
speaker deliberately inserts those pauses. In normal speech, one
typically finds many consecutive words being said with no pauses
between them, and often the final sounds of one word blend smoothly or
fuse with the initial sounds of the next word.
That said, I'll try to answer your other questions.
You need to decode the WAV file, which could be compressed, into raw PCM audio data before running VAD. See e.g. Reading and processing WAV file data in C/C++. Alternately, you could use something like sox to convert the WAV file to raw audio before running your code. This command will convert a WAV file of any format to 16 KHz, 16-bit PCM in the format that WebRTCVAD expects:
sox my_file.wav -r 16000 -b 16 -c 1 -e signed-integer -B my_file.raw
It looks like you are using the right function. To be more specific, you should be doing this:
#include "webrtc/common_audio/vad/include/webrtc_vad.h"
// ...
VadInst *vad;
WebRtcVad_Create(&vad);
WebRtcVad_Init(vad);
const int16_t * temp = sample.data();
for(int i = 0, ms = 0; i < sample.size(); i += 160, ms += 10)
{
int isActive = WebRtcVad_Process(vad, 16000, temp, 160); //10 ms window
std::cout << ms << " ms : " << isActive << std::endl;
temp = temp + 160; // processed 160 samples (320 bytes)
}
To see if it's working, you can run known files and see if you get the results you expect. For example, you could start by processing silence and confirm that you never (or rarely--this algorithm is not perfect) see a voiced result come back from WebRtcVad_Process. Then you could try a file that is all silence except for one short utterance in the middle, etc. If you want to compare to an existing test, the py-webrtcvad module has a unit test that does this; see the test_process_file function.
To do word-level segmentation, you will probably need to find a speech recognition library that does it or gives you access to the information that you need to do it. E.g. this thread on the Kaldi mailing list seems to talks about how to segment by words.