Azure TTS generating garbled result when requesting Opus encoding - c++

The following sample code (C++, Linux, x64) uses the MS Speech SDK to request a text-to-speech of a single sentence in Opus format with no container. It then uses the Opus lib to decode to raw PCM. Everything seems to run with no errors but the result sounds garbled, as if some of the audio is missing, and the result Done, got 14880 bytes, decoded to 24000 bytes looks like this might be a decoding issue rather than an Azure issue as I'd expect a much higher compression ratio.
Note that this generates a raw PCM file, play back with: aplay out.raw -f S16_LE -r 24000 -c 1
#include <stdio.h>
#include <string>
#include <assert.h>
#include <vector>
#include <speechapi_cxx.h>
#include <opus.h>
using namespace Microsoft::CognitiveServices::Speech;
static const std::string subscription_key = "abcd1234"; // insert valid key here
static const std::string service_region = "westus";
static const std::string text = "Hi, this is Azure";
static const int sample_rate = 24000;
#define MAX_FRAME_SIZE 6*960 // from Opus trivial_example.c
int main(int argc, char **argv) {
// create Opus decoder
int err;
OpusDecoder* opus_decoder = opus_decoder_create(sample_rate, 1, &err);
assert(err == OPUS_OK);
// create Azure client
auto azure_speech_config = SpeechConfig::FromSubscription(subscription_key, service_region);
azure_speech_config->SetSpeechSynthesisVoiceName("en-US-JennyNeural");
azure_speech_config->SetSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat::Audio24Khz16Bit48KbpsMonoOpus);
auto azure_synth = SpeechSynthesizer::FromConfig(azure_speech_config, NULL);
FILE* fp = fopen("out.raw", "w");
int in_bytes=0, decoded_bytes=0;
// callback to capture incoming packets
azure_synth->Synthesizing += [&in_bytes, &decoded_bytes, fp, opus_decoder](const SpeechSynthesisEventArgs& e) {
printf("Synthesizing event received with audio chunk of %zu bytes\n", e.Result->GetAudioData()->size());
auto audio_data = e.Result->GetAudioData();
in_bytes += audio_data->size();
// confirm that this is exactly one valid Opus packet
assert(opus_packet_get_nb_frames((const unsigned char*)audio_data->data(), audio_data->size()) == 1);
// decode the packet
std::vector<uint8_t> decoded_data(MAX_FRAME_SIZE);
int decoded_frame_size = opus_decode(opus_decoder, (const unsigned char*)audio_data->data(), audio_data->size(),
(opus_int16*)decoded_data.data(), decoded_data.size()/sizeof(opus_int16), 0);
assert(decoded_frame_size > 0); // confirm no decode error
decoded_frame_size *= sizeof(opus_int16); // result size is in samples, convert to bytes
printf("Decoded to %d bytes\n", decoded_frame_size);
assert(decoded_frame_size <= (int)decoded_data.size());
fwrite(decoded_data.data(), 1, decoded_frame_size, fp);
decoded_bytes += decoded_frame_size;
};
// perform TTS
auto result = azure_synth->SpeakText(text);
printf("Done, got %d bytes, decoded to %d bytes\n", in_bytes, decoded_bytes);
// cleanup
fclose(fp);
opus_decoder_destroy(opus_decoder);
}

I've received no useful response to this question (I've also asked here and here and even tried Azure's paid support) so I gave up and switched from Audio24Khz16Bit48KbpsMonoOpus to Ogg48Khz16BitMonoOpus which means the Opus encoding is wrapped in an Ogg container, requiring the rather cumbersome libopusfile API to decode. It was kind of a pain to implement but it does the job.

Related

Capturing YUYV in c++ using v4l2

I have a webcam connected to beaglebone via usb. I am coding in c++ and my goal is to capture raw UNCOMPRESSED picture from the webcam.
Firstly i checked what formats are supported via command v4l2-ctl --list-formats and the result was:
Index : 0
Type : Video Capture
Pixel Format: 'MJPG' (compressed)
Name : Motion-JPEG
Index : 1
Type : Video Capture
Pixel Format: 'YUYV'
Name : YUYV 4:2:2
So from this I assume it has to be possible to get an uncompressed picture if i try to use YUYV format.
Knowing this I started writing a program in c++. I successfully written a program to capture a compressed picture, but when trying to capture using format YUYV it doesnt work and i really need some help to get this done.
Here is my code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <errno.h>
#include <sys/ioctl.h>
#include <sys/types.h>
#include <sys/time.h>
#include <sys/mman.h>
#include <linux/videodev2.h>
#include <libv4l2.h>
template <typename typeXX>
void clear_memmory(typeXX* x) {
memset(x, 0, sizeof(*x));
}
void xioctl(int cd, int request, void *arg){
int response;
do{
//ensures we get the correct response.
response = v4l2_ioctl(cd, request, arg);
}
while (response == -1 && ((errno == EINTR) || (errno == EAGAIN)));
if (response == -1) {
fprintf(stderr, "error %d, %s\n", errno, strerror(errno));
exit(EXIT_FAILURE);
}
}
struct LMSBBB_buffer{
void* start;
size_t length;
};
int main(){
const char* dev_name = "/dev/video0";
int width=1920;
int height=1080;
int fd = v4l2_open(dev_name, O_RDWR | O_NONBLOCK, 0);
struct v4l2_format format = {0};
format.type = V4L2_BUF_TYPE_VIDEO_CAPTURE;
format.fmt.pix.width = width;
format.fmt.pix.height = height;
format.fmt.pix.pixelformat = V4L2_PIX_FMT_RGB24;//V4L2_PIX_FMT_YUYV //V4L2_PIX_FMT_RGB24
format.fmt.pix.field = V4L2_FIELD_NONE; //V4L2_FIELD_NONE
xioctl(fd, VIDIOC_S_FMT, &format);
printf("Device initialized.\n");
///request buffers
struct v4l2_requestbuffers req = {0};
req.count = 2;
req.type = V4L2_BUF_TYPE_VIDEO_CAPTURE;
req.memory = V4L2_MEMORY_MMAP;
xioctl(fd, VIDIOC_REQBUFS, &req);
printf("Buffers requested.\n");
///mapping buffers
struct v4l2_buffer buf;
LMSBBB_buffer* buffers;
unsigned int i;
buffers = (LMSBBB_buffer*) calloc(req.count, sizeof(*buffers));
for (i = 0; i < req.count; i++) {
clear_memmory(&(buf));
(buf).type = V4L2_BUF_TYPE_VIDEO_CAPTURE;
(buf).memory = V4L2_MEMORY_MMAP;
(buf).index = i;
xioctl(fd, VIDIOC_QUERYBUF, &buf);
buffers[i].length = (buf).length;
printf("A buff has a len of: %i\n",buffers[i].length);
buffers[i].start = v4l2_mmap(NULL, (buf).length, PROT_READ | PROT_WRITE, MAP_SHARED,fd, (buf).m.offset);
if (MAP_FAILED == buffers[i].start) {
perror("Can not map the buffers.");
exit(EXIT_FAILURE);
}
}
printf("Buffers mapped.\n");
for (i = 0; i < req.count; i++) {
clear_memmory(&(buf));
(buf).type = V4L2_BUF_TYPE_VIDEO_CAPTURE;
(buf).memory = V4L2_MEMORY_MMAP;
(buf).index = i;
ioctl(fd,VIDIOC_QBUF, &(buf));
}
enum v4l2_buf_type type;
type = V4L2_BUF_TYPE_VIDEO_CAPTURE;
ioctl(fd,VIDIOC_STREAMON, &type);
printf("buffers queued and streaming.\n");
int pic_count=0;
///CAPTURE
fd_set fds;
struct timeval tv;
int r;
char out_name[256];
FILE* fout;
do {
FD_ZERO(&fds);
FD_SET(fd, &fds);
// Timeout.
tv.tv_sec = 2;
tv.tv_usec = 0;
r = select(fd + 1, &fds, NULL, NULL, &tv);
} while ((r == -1 && (errno = EINTR)));
if (r == -1) {
perror("select");
exit(EXIT_FAILURE);
}
clear_memmory(&(buf));
(buf).type = V4L2_BUF_TYPE_VIDEO_CAPTURE;
(buf).memory = V4L2_MEMORY_MMAP;
xioctl(fd,VIDIOC_DQBUF, &(buf));
printf("Buff index: %i\n",(buf).index);
sprintf(out_name, "image%03d.ppm",pic_count);
fout = fopen(out_name, "w");
if (!fout) {
perror("Cannot open image");
exit(EXIT_FAILURE);
}
fprintf(fout, "P6\n%d %d 255\n",width, height);
fwrite(buffers[(buf).index].start, (buf).bytesused, 1, fout);
fclose(fout);
pic_count++;
clear_memmory(&(buf));
(buf).type = V4L2_BUF_TYPE_VIDEO_CAPTURE;
(buf).memory = V4L2_MEMORY_MMAP;
xioctl(fd,VIDIOC_DQBUF, &(buf));
printf("Buff index: %i\n",(buf).index);
sprintf(out_name, "image%03d.ppm",pic_count);
fout = fopen(out_name, "w");
if (!fout) {
perror("Cannot open image");
exit(EXIT_FAILURE);
}
fprintf(fout, "P6\n%d %d 255\n",width, height);
fwrite(buffers[(buf).index].start, (buf).bytesused, 1, fout);
fclose(fout);
pic_count++;
///xioctl(fd,VIDIOC_QBUF, &(buf));
return 0;
}
in line 50, i can choose the format between V4L2_PIX_FMT_YUYV and V4L2_PIX_FMT_RGB24.
for V4L2_PIX_FMT_RGB24 i get the picture, but when using V4L2_PIX_FMT_YUYV I get this error:
libv4l2: error dequeuing buf: Resource temporarily unavailable
libv4l2: error dequeuing buf: Resource temporarily unavailable
libv4l2: error dequeuing buf: Resource temporarily unavailable
libv4l2: error dequeuing buf: Resource temporarily unavailable
libv4l2: error dequeuing buf: Resource temporarily unavailable
the error lines goes for ever until i end the program manually.
Does anyone have an idea what to do? I spent over 2 weeks on this and i can't move anywhere from here. I would really appreciate any advice.
From what I see you are requesting a FullHD (1920x1080) buffer in YUYV format from a camera. You did not mention the camera type/model/specs, but if it is a generic USB-attached hardware most likely you will not get a raw FullHD YUYV buffer as an output, only the MJPEG one (which you can decode to YUV, if you hack around with libjpeg) or the decoded RGB buffer (which is pretty much the decoded MJPEG with YUV->RGB conversion) which is not mmapped.
The exact list of formats with framerates can be requested by this command, which would probably tell you it does not provide a 1920x1080 YUYV, only something smaller, like 640x480:
v4l2-ctl --list-formats
If you need video processing with "true" zero-copy access to raw YUYV camera frames, you need direct access to hardware and that specific hardware in the first place. Once you have the USB interface between your software and the camera, you get an extra indirection and that means the speed goes down. Think for a moment, the YUYV frame at 1920x1080 takes up approximately 4 Megabytes of memory. At 30 FPS this is 120 Megabytes (or 960 Megabits) per second of bus throughput. If you have a USB2.0 camera, there is just no bandwidth to support this (thus the need for MJPEG). Even at 15FPS this is 480 Megabits, not counting the USB latency and protocol overhead.
To provide some "actionable feedback" I would advice to first concentrate on the algorithms (probably, you just don't want to loose the processing speed at the very first step) which you want to apply to the image. Don't hesitate to use OpenCV for camera input and basic image processing, later you can switch to some hardware interface and hand-written algorithms.
The easier way of getting raw frames would be to use Android's camera interface and try to process the incoming frames with GLSL shaders using the GL_TEXTURE_EXTERNAL_OES extension, about which there information and code samples available. There you can connect GL textures to AHardwareBuffer instances and then use AHardwareBuffer_lock function to get raw pointers. The exact supported formats also may vary across the hardware, so do not expect this to be super-easy.
I've recently had a similar issue. In my case the camera driver needed the VIDIOC_S_PARM ioctl in order to set the frame rate and initialize the camera for the selected capture mode.
You can try to add this code after the VIDIOC_S_FMT and see if it works for you as well:
struct v4l2_streamparm streamparam;
memset(&streamparam, 0, sizeof(streamparam));
streamparam.type = V4L2_BUF_TYPE_VIDEO_CAPTURE;
xioctl(fd, VIDIOC_G_PARM, &streamparam);
streamparam.parm.capture.timeperframe.numerator = 1;
streamparam.parm.capture.timeperframe.denominator = 5;
xioctl(fd, VIDIOC_S_PARM, &streamparam);

avformat_open_input cannot open a file with raw opus audio data

I have a problem when trying to open a binary file containing raw audio data in opus format. When I try to open this file, the library returns an error: Unknown input format: opus.
How can I open this file ?
I need to open it and write all the raw audio data to an audio container. I understand that the opus format is intended only for encoding. I realized this using command:
$ ffmpeg -formats | grep Opus
ffmpeg version 3.4.4 Copyright (c) 2000-2018 the FFmpeg developers
E opus Ogg Opus # For only encoding
Then what format should I use to open this file ? With ogg ? I tried, but there are also problems with opening the outgoing file. I provide the code that shows only the necessary part to open the file:
int main(int argc, char *argv[])
{
// ...
av_register_all();
AVFormatContext *iFrmCtx = nullptr;
AVFormatContext *oFrmCtx = nullptr;
AVPacket packet;
const char *iFilename = "opus.bin"; // Raw audio data with `opus` format
const char *oFilename = "opus.mka"; // Audio file with `opus` audio format
AVDictionary* frmOpts = nullptr;
const qint32 smpRateErrorCode = av_dict_set_int(&frmOpts, "sample_rate", 8000, 0);
const qint32 bitRateErrorCode = av_dict_set_int(&frmOpts, "bit_rate", 64000, 0);
const qint32 channelErrorCode = av_dict_set_int(&frmOpts, "channels", 2, 0);
if (smpRateErrorCode < 0 ||
bitRateErrorCode < 0 ||
channelErrorCode < 0) {
return EXIT_FAILURE;
}
AVInputFormat *iFrm = av_find_input_format("opus"); // Error: Unknown input format
if (iFrm == nullptr) {
av_dict_free(&frmOpts);
return EXIT_FAILURE;
}
qint32 ret = 0;
if ((ret = avformat_open_input(&iFrmCtx, iFilename, iFrm, &frmOpts)) < 0) {
av_dict_free(&frmOpts);
return EXIT_FAILURE;
}
// We're doing something...
}
As said before Opus is not self-delimiting, it needs a container. and since you got the raw data from rtp payload, and the opus codecs is a dynamic codecs (with dynamic payload size) you can't use ffmpeg AVFormatContext to read raw data from the file.
But you can work around this issue and instead of using (av_read_frame) to fill the AVPacket to decode them you can manually fill the AVPacket data and size and then push it to the decoder.
note that you should also update the pts and dts for each AVPacket .

Processing audio file in memory with lib sox

I am trying to process audio file in memory with SOX C++ API and I stuck at the very beginning. The goal is to load an audio file from disk, apply few effects (tempo/gain adjustments) in memory. Here is the code I started with, but I receive a strange error when creating out stream:
formats: can't open output file `': No such file or directory
What could be an issue here? I am testing it on Mac. Here is the code:
#include <iostream>
#include <sox.h>
#include <assert.h>
int main() {
sox_format_t * in, *out;
sox_effect_t * e;
sox_init();
in = sox_open_read("/path/to/file.wav", NULL, NULL, NULL);
sox_format_t *out_format = (sox_format_t *)malloc(sizeof(sox_format_t));
memcpy(out_format, in, sizeof(sox_format_t));
char * buffer;
size_t buffer_size;
out = sox_open_memstream_write(&buffer, &buffer_size, &in->signal, NULL, "sox", NULL);
//chain = sox_create_effects_chain(&in->encoding, &out->encoding);
//e = sox_create_effect(sox_find_effect("input"));
return 0;
}
sox_open_memstream_write() uses either fmemopen() or open_memstream() depending on the parameters you pass.
Some (or all) versions of OSX do not have these functions.
The same is true for Windows.
You can find the relevant code in file src/formats.c, function open_write(), look for the #ifdef HAVE_FMEMOPEN conditionals.

Portaudio + Opus encoding / decoding audio input

I'm working on a VOIP client using Portaudio and opus.
I read from the microphone in a frame
-encode each frame with Opus and put it in a list
-pop the first element from the list and decode it
-read it with portaudio
If i do the same thing without encoding my sound it works great. But when I use Opus my sound is bad, I can't understand the voice (which is bad for a voip client)
HandlerOpus::HandlerOpus(int sample_rate, int num_channels)
{
this->num_channels = num_channels;
this->enc = opus_encoder_create(sample_rate, num_channels, OPUS_APPLICATION_VOIP, &this->error);
this->dec = opus_decoder_create(sample_rate, num_channels, &this->error);
opus_int32 rate;
opus_encoder_ctl(enc, OPUS_GET_BANDWIDTH(&rate));
this->encoded_data_size = rate;
}
HandlerOpus::~HandlerOpus(void)
{
opus_encoder_destroy(this->enc);
opus_decoder_destroy(this->dec);
}
unsigned char *HandlerOpus::encodeFrame(const float *frame, int frame_size)
{
unsigned char *compressed_buffer;
int ret;
compressed_buffer = new (unsigned char[this->encoded_data_size]);
ret = opus_encode_float(this->enc, frame, frame_size, compressed_buffer, this->encoded_data_size);
return (compressed_buffer);
}
float *HandlerOpus::decodeFrame(const unsigned char *data, int frame_size)
{
int ret;
float *frame = new (float[frame_size * this->num_channels]);
opus_packet_get_nb_channels(data);
ret = opus_decode_float(this->dec, data, this->encoded_data_size, frame, frame_size, 0);
return (frame);
}
I can't change the library I have to use Opus.
The sample rate is 48000 and the frames per buffer is 480 and I tried in mono and stereo.
What am I doing wrong?
I solved the problem myself I changed the config : The sample rate to 24000 and the frames per buffer is still 480.
It's 6 years later, but I'm gonna post an answer for future googlers like me:
I had very similiar problem and fixed it by changing PortAudio sample type to paInt32 and switched from opus_decode_float to just opus_decode

process video stream from memory buffer

I need to parse a video stream (mpeg ts) from proprietary network protocol (which I already know how to do) and then I would like to use OpenCV to process the video stream into frames. I know how to use cv::VideoCapture from a file or from a standard URL, but I would like to setup OpenCV to read from a buffer(s) in memory where I can store the video stream data until it is needed. Is there a way to setup a call back method (or any other interfrace) so that I can still use the cv::VideoCapture object? Is there a better way to accomplish processing the video with out writing it out to a file and then re-reading it. I would also entertain using FFMPEG directly if that is a better choice. I think I can convert AVFrames to Mat if needed.
I had a similar need recently. I was looking for a way in OpenCV to play a video that was already in memory, but without ever having to write the video file to disk. I found out that the FFMPEG interface already supports this through av_open_input_stream. There is just a little more prep work required compared to the av_open_input_file call used in OpenCV to open a file.
Between the following two websites I was able to piece together a working solution using the ffmpeg calls. Please refer to the information on these websites for more details:
http://ffmpeg.arrozcru.org/forum/viewtopic.php?f=8&t=1170
http://cdry.wordpress.com/2009/09/09/using-custom-io-callbacks-with-ffmpeg/
To get it working in OpenCV, I ended up adding a new function to the CvCapture_FFMPEG class:
virtual bool openBuffer( unsigned char* pBuffer, unsigned int bufLen );
I provided access to it through a new API call in the highgui DLL, similar to cvCreateFileCapture. The new openBuffer function is basically the same as the open( const char* _filename ) function with the following difference:
err = av_open_input_file(&ic, _filename, NULL, 0, NULL);
is replaced by:
ic = avformat_alloc_context();
ic->pb = avio_alloc_context(pBuffer, bufLen, 0, pBuffer, read_buffer, NULL, NULL);
if(!ic->pb) {
// handle error
}
// Need to probe buffer for input format unless you already know it
AVProbeData probe_data;
probe_data.buf_size = (bufLen < 4096) ? bufLen : 4096;
probe_data.filename = "stream";
probe_data.buf = (unsigned char *) malloc(probe_data.buf_size);
memcpy(probe_data.buf, pBuffer, probe_data.buf_size);
AVInputFormat *pAVInputFormat = av_probe_input_format(&probe_data, 1);
if(!pAVInputFormat)
pAVInputFormat = av_probe_input_format(&probe_data, 0);
// cleanup
free(probe_data.buf);
probe_data.buf = NULL;
if(!pAVInputFormat) {
// handle error
}
pAVInputFormat->flags |= AVFMT_NOFILE;
err = av_open_input_stream(&ic , ic->pb, "stream", pAVInputFormat, NULL);
Also, make sure to call av_close_input_stream in the CvCapture_FFMPEG::close() function instead of av_close_input_file in this situation.
Now the read_buffer callback function that is passed in to avio_alloc_context I defined as:
static int read_buffer(void *opaque, uint8_t *buf, int buf_size)
{
// This function must fill the buffer with data and return number of bytes copied.
// opaque is the pointer to private_data in the call to avio_alloc_context (4th param)
memcpy(buf, opaque, buf_size);
return buf_size;
}
This solution assumes the entire video is contained in a memory buffer and would probably have to be tweaked to work with streaming data.
So that's it! Btw, I'm using OpenCV version 2.1 so YMMV.
Code to do similar to the above, for opencv 4.2.0 is on:
https://github.com/jcdutton/opencv
Branch: 4.2.0-jcd1
Load the entire file into RAM pointed to by buffer, of size buffer_size.
Sample code:
VideoCapture d_reader1;
d_reader1.open_buffer(buffer, buffer_size);
d_reader1.read(input1);
The above code reads the first frame of video.