Decoding AAC audio with ffmpeg - c++

I'm trying to decode an AAC audio stream in an ADTS container, which is streamed from an external hardware H264 encoder.
I've parsed out the ADTS and it tells me I've got a 2 channel, 44100 AAC Main profile frame. I setup the extra data bytes for the ffmpeg decoder and decode the frame successfully? as follows:
(pseudo c++ code)
setup the decoder:
avcodec_find_decoder(codec_id);
avcodec_alloc_context3(context->codec);
avcodec_open2(context->av_codec_context, context->codec, nullptr);
av_init_packet(&context->av_raw_packet);
setup the extra data bytes:
// AOT_MAIN, 44.1kHz, Stereo
// 00001010 00010000
// extradata = 0x0A, 0X10
memcpy(context->av_codec_context->extradata, extradata, extradataLength);
avcodec_open2(context->av_codec_context, context->codec, nullptr);
then decode the frame:
// decode frame
const int len = avcodec_decode_audio4(context->av_codec_context, context->frame, &got_frame, &context->av_raw_packet);
*sampleRate = context->av_codec_context->sample_rate;
*sampleFormat = context->av_codec_context->sample_format;
*bitsPerSample = av_get_bytes_per_sample(context->av_codec_context->sample_fmt) * 8;
*channels = context->av_codec_context->channels;
*channelLayout = context->av_codec_context->channelLayout;
// get frame
*outDataSize = av_samples_get_buffer_size(nullptr, context->av_codec_context->channels, context->frame->nb_samples, context->av_codec_context->sample_fmt, 1);
The decoded frame:
// array of 8192 bytes, context info is as expected:
context->av_codec_context->channels = 2
context->av_codec_context->channelLayout = 3 (AV_CH_LAYOUT_STEREO)
context->frame->sample_fmt = 8 (AV_SAMPLE_FMT_FLTP) // float, planar
context->frame->sample_rate = 44100
Now as I understand it each frame in the raw format for 32 bit will be 4 bytes per sample, and each channel will be interleaved (so every 4th byte is the alternating channel). That leaves me with 1024 samples for each channel (8192 / 32 bits / 2 channels).
I've tried exporting multiple frames of this data to a file, and importing as a raw file (32-bit float, 2 channel 44100Hz, little endian) in Audacity to sanity check. Instead of music, all I get is noise and the detected length of the audio is way longer than I would have expected (5 seconds dumped to file, but Audacity says 22.5 seconds). I've tried a variety of import format settings. What am I likely doing wrong here?
I'm a little new to working with audio, so I may be misunderstanding something.
Edit: I tried panning the audio to the right channel, and its reflected in the data. It also looks like a repeating pattern exactly 1024 samples apart, which indicates to me a programming error with a buffer not getting overwritten after the first sample.

This was nothing more than a difficult bug to find. Zooming in on the audio sample in Audacity revealed the repeating pattern of 1024 samples wide.
A buffer was in fact not being updated properly and I was processing the same audio frame over and over:
for(var offset = 0; offset < packet.Length; offset++) {
var frame = ReadAdtsFrame();
// offset += frame.Length;
// ^ essentially this was missing, so the frame buffer was always the first frame
}
I will leave this here to display my shame to the world and a reminder that most often its your own bugs that get you in the end.

Related

ESP32 - Store RGB from decoded JPEG MCUs in buffer

We are currently working on implementing motion detection for the ESP32-cam. To be able to work with the motion detection we want to access the raw pixels of the image, but also want to have a compressed JPEG to send over a network.
We use the JPEGDEC library were they have an example to draw a decoded JPEG image to an LCD screen. The decode function is supplied a callback function that should handle the drawing of each MCU. However, instead of instantly drawing each MCU we want to store the RGB565 data for the entire image in the variable rgb. This RGB image will then be used for the motion detection.
We have tried to implement the code below but can not get it to work and also have some questions:
We want to have the RGB data as uint8_t but the supplied pixels from JPEGDEC (pDraw->pPixels) are of type uint16_t. How can we handle this?
As of now we try to allocate enough memory to store the RGB image, but the malloc function returns NULL. Is it correct to allocate this amount below?
#include <JPEGDEC.h>
uint8_t *rgb = (uint8_t*) malloc(sizeof(uint8_t) * WIDTH * HEIGHT);
JPEGDEC jpeg;
void loop()
{
camera_fb_t* frame = esp_camera_fb_get();
jpeg.openRAM((uint8_t*) frame->buf, frame->len, drawMCU);
jpeg.setPixelType(RGB565_LITTLE_ENDIAN);
jpeg.decode(0, 0, JPEG_SCALE_HALF);
jpeg.close();
pixel_counter = 0;
}
int drawMCU(JPEGDRAW *pDraw) {
for (int i = 0; i < pDraw->iWidth * pDraw->iHeight; i++) {
rgb[pixel_counter] = pDraw->pPixels[i];
pixel_counter++;
}
}
You can't call malloc outside of a function. If you have a setup() function move it there. Also, you need to allocate space for the r, g, and b bytes - which would be 3 times WIDTH * HEIGHT.
Then, extract the RGB bytes from the RGB565 data and store in the array.
uint8_t *rgb = NULL;
void setup() {
rgb = malloc(WIDTH * HEIGHT * 3);
}
int drawMCU(JPEGDRAW *pDraw)
{
for (int i = 0; i < pDraw->iWidth * pDraw->iHeight; i++) {
uint8_t b = (pDraw->pPixels[i] & 0x001F) << 3;
uint8_t g = (pDraw->pPixels[i] & 0x07E0) >> 3;
uint8_t r = (pDraw->pPixels[i] & 0xF800) >> 8;
rgb[i * 3] = r;
rgb[i * 3 + 1] = g;
rgb[i * 3 + 2] = b;
}
}
Or, don't use malloc at all since WIDTH and HEIGHT are known at compile time:
uint8_t rgb[WIDTH * HEIGHT * 3];
It's not clear, I've done the same thing, extracted pixels to RGB565 format and shown live video stream on 65k color oled SSD1331 without use an external jpeg decoder.
You do not need a decoder at all, I've used the provided functions with the camera driver, search for "img_converters.h".
In the Arduino IDE open examples/ESP32/Camera/CameraWebServer.ino, move to
app_httpd.cpp Tab and you can see this file inclusion.
This file contains all functions to convert image formats.
In my linux system I found it on ESP32 Arduino core at this position:
/home/massimo/Arduino/hardware/espressif/esp32/tools/sdk/include/esp32-camera/img_converters.h
It even provide some functions to do overlay on frames, like draw lines (see facial recognition red and green rects) and to print a text. Please I interested to it, if someone find more infos I want to try.
For a byte buffer depends on size you capture the image, don't expets to capture a full 1600x1200 frame to ESP ram, are a lots of data and you cannot, 2 bytes (RGB565) every pixel are a lots of data and you cannot capture in a single pass, you need to divide example in 8/16/32/64 strips.
With oled that is 96x64 I've set camera resolution at very low resolution, do not have much sense capture a big image and then downscale it, so I've set 160x120 resolution, when I capture the frame, by default it is jpeg, I've converted to RGB565 using a function provided in the img_converters all the frame in one single array, then removed vertical and horizontal odd pixel lines, so the final image result in 80x60 pixels, to complete it finally I pass all entire frame to the oled library I wrote for this oled in just one single command that draw very fast using this command that write a buffer in a single pass:
oled.pushPixels(pixelData, len);
Read from SD and draw on oled require 6-7 milliseconds every frame at fullscreen 96x64.
An image of 80x60 pixels 16 bit color (frame extracted from ESP32CAMto draw on oled) need 80 x 60 x 2 = 9600 bytes and there is no problem, I took it even on ESP8266, I've tried to allocate up 32kb, but just when I started to convert my code to show on 320x240 ILI9341 I tried a lot to manage a single frame, but without success, too much datas, the only way I've stripped it in 8 or 16 chunks per frame.
Because I've managed to read images from SD Card, I read one strip, then I render on display, then load another strip, then render.... etc.... This slow down a lot.
My oled library is very fast, it can show videos (read from uSD) at resolution 96x64 # near 130 FPS, when I tried the same video (Matrix) but bigger on TFT, it sit down, max 12 FPS on a 320x180 pixel video with esp8266 clocked # 160Mhz and SPI set to 50Mhz. A lots of data .... 12 FPS but to view on TFT it is incredible, it is not a PC or a smartphone or raspberry, it is a 4$ microcontroller + 6$ (320x240 2.44 ILI9341 touchscreen TFT) + 1$ card reader = touchscreen video player.
With oled I've managed to play videos # 30 FPS with audio 44100 16 Bit Stereo synced out of external PCM5102A dac and it is fantastic, even with two MAX98357A 3W I2S to make a stereo channels and directly attached to hifi speakers. This way using an audio library I wrote, can play films downoladed from Youtube or other sources (with high quality audio out of the box) on a small color oled that is big like a coin and it is good to make a small ESP iOT smartwatch (but not touchscreen), it works with ESP8266 and with ESP32, not yet tried the external dac on ESP32CAM, it can work, but not in conjunction of SD Card, so I doubt it is capable to playback from SD_MMC, no more pins.

ffmpeg Get Audio Samples in a specific AVSampleFormat from AVFrame

I am looking at the example from ffmpeg docs:
Here
static int output_audio_frame(AVFrame *frame)
{
size_t unpadded_linesize = frame->nb_samples * av_get_bytes_per_sample(frame->format);
printf("audio_frame n:%d nb_samples:%d pts:%s\n",
audio_frame_count++, frame->nb_samples,
av_ts2timestr(frame->pts, &audio_dec_ctx->time_base));
/* Write the raw audio data samples of the first plane. This works
* fine for packed formats (e.g. AV_SAMPLE_FMT_S16). However,
* most audio decoders output planar audio, which uses a separate
* plane of audio samples for each channel (e.g. AV_SAMPLE_FMT_S16P).
* In other words, this code will write only the first audio channel
* in these cases.
* You should use libswresample or libavfilter to convert the frame
* to packed data. */
fwrite(frame->extended_data[0], 1, unpadded_linesize, audio_dst_file);
return 0;
}
The issue is decoder's format cant be set so it will give me audio samples in any of the following types:
enum AVSampleFormat {
AV_SAMPLE_FMT_NONE = -1, AV_SAMPLE_FMT_U8, AV_SAMPLE_FMT_S16, AV_SAMPLE_FMT_S32,
AV_SAMPLE_FMT_FLT, AV_SAMPLE_FMT_DBL, AV_SAMPLE_FMT_U8P, AV_SAMPLE_FMT_S16P,
AV_SAMPLE_FMT_S32P, AV_SAMPLE_FMT_FLTP, AV_SAMPLE_FMT_DBLP, AV_SAMPLE_FMT_S64,
AV_SAMPLE_FMT_S64P, AV_SAMPLE_FMT_NB
}
I am working with a sound engine and the engine requres me to send float [-1 to 1] PCM data to the engine so I would like to obtain the frame's audio data as float for the two channels (stereo music). How may I do that? Do I need to use libswresample? If so can anyone sent me an example for my case
Encoding audio Example
Resampling audio Example
Transcoding Example
If you don't get the desired format from the decoder, you have to resample it, encode with AV_SAMPLE_FMT_FLT.
According to enum AVSampleFormat
The floating-point formats are based on full volume being in the range [-1.0, 1.0]. Any values outside this range are beyond full volume level.
All the Examples are well documented and not that complicated. The function names alone are very explanatory, so it shouldn't be that hard to understand.

Can 8-bit (bits per sample) PCM WAV files contain more than one channel?

I realized it is bad for me to neglect this thought, because I haven't read anything about number of channels and bits per sample in this light. My reason is that I'm not sure how the samples of 2-channel 8-bit PCM files will look like.
Is it 1 sample = 1 channel? or 1 sample = 4 bits (left) + 4 bits (right)
Context:
I am writing a program that reads WAV files, and it occurred to me that if I come across 8-bit PCM WAV files, and my code reads this way (see below), then my program is unable to properly read multi-channel 8-bit PCM WAV files.
// read actual audio data after obtaining
// the headers
// audioData is a vector of vectors (1 vector per channel)
uint32_t temp;
while( !feof(wavFile) ) {
for(uint16_t i = 0; i < numChannels; i++) {
temp = 0;
fread(&temp,sizeof(uint8_t),1,wavFile);
audioData.at(i).push_back(temp);
}
}
The structure, which typically describes format of WAV audio data, is described in MSDN here: WAVEFORMATEX structure:
"sample" for PCM audio is a block of data, which includes all channels
nBlockAlign value is size, in bytes, of such block corresponding to sample
samples go at specific fixed rate, defined by nSamplesPerSec value
each sample block consists of nChannels values, each one of wBitsPerSample
That is, two channel file with 8 bits per sample has nSamplesPerSec pairs for each second of audio data, each pair includes two 8-bit values for every channel of the two.
(here is an example of where this structure exists in the WAV file - though this is a more complicated case with 24-bits/sample, but you should get the idea).

FFMPEG API: decode MPEG to YUV frames and change these frames

I need save all frames from MPEG4 or H.264 video to YUV-frames using C++ library. For example, in .yuv, .y4m or .y format. Then I need read these frames like a digital files and change some samples (Y-value). How can I do it without convert to RGB?
And how store values of AVFrame->data? Where store Y-, U- and V-values?
Thanks and sorry for my English=)
If you use libav* to decode, you will receive the frames in their native colorspace (usually YUV 420) But it is what ever was chosen at encode time. Assuming you are in YUV420 or convert to YUV420 y: AVFrame->data[0], u: AVFrame->data[1], v: AVFrame->data[2]
For Y, 1 byte per pixel AVFrame->data[0][(x*AVFrame->linesize[0]) + y]
For U and V its 4 pixles per byte (quarter resolution of Y plane). So
AVFrame->data[1][(x/2*AVFrame->linesize[1]) + y/2], AVFrame->data[2][(x/2*AVFrame->linesize[2]) + y/2]

Image data of PCX file

I have a big binary file with lots of files stored inside it. I'm trying to copy the data of a PCX image from the file and write it to a new file which I can then open in an image editor.
After obtaining the specs for the header of a PCX file I think that I've located the image in the big binary file. My problem is that I cannot figure out how many bytes I'm supposed to read after the header. I read about decoding PCX files, but I don't want to decode anything. I want to read the encoded image data and write that to a seperate file so the image editor can open in.
Here is the header. I've included the values of the image as I guess they can be used to determine the "end-of-file" for the image data.
struct PcxHeader
{
BYTE Identifier; // PCX Id Number (Always 0x0A) // 10
BYTE Version; // Version Number // 5
BYTE Encoding; // Encoding Format // 1
BYTE BitsPerPixel; // Bits per Pixel // 8
WORD XStart; // Left of image // 0
WORD YStart; // Top of Image // 0
WORD XEnd; // Right of Image // 319
WORD YEnd; // Bottom of image // 199
WORD HorzRes; // Horizontal Resolution // 320
WORD VertRes; // Vertical Resolution // 200
BYTE Palette[48]; // 16-Color EGA Palette
BYTE Reserved1; // Reserved (Always 0)
BYTE NumBitPlanes; // Number of Bit Planes // 1
WORD BytesPerLine; // Bytes per Scan-line // 320
WORD PaletteType; // Palette Type // 0
WORD HorzScreenSize; // Horizontal Screen Size // 0
WORD VertScreenSize; // Vertical Screen Size // 0
BYTE Reserved2[54]; // Reserved (Always 0)
};
There are three components to the PCX file format:
128-byte header (though less are actually used, it is 128 bytes long)
variable-length image data
optional 256 color palette (though improper PCX files exist with palette sizes other than 256 colors).
From the Wikipedia artice:
Due to the PCX compression scheme the only way to find the actual length of the image data is to read and process it. This effort is made difficult because the format allows for the compressed data to run beyond the image dimensions, often padding it to the next 8 or 16 line boundary.
In general, then, it sound like you'll have to do a "deep process" of the image data to find the complete PCX file embedded within your larger binary file.
Without knowing much about the PCX file format, I can take a best guess at this:
bytesAfterHeader = header.BytesPerLine * header.VertRes;