ESP32 - Store RGB from decoded JPEG MCUs in buffer

ESP32 - Store RGB from decoded JPEG MCUs in buffer - c++

We are currently working on implementing motion detection for the ESP32-cam. To be able to work with the motion detection we want to access the raw pixels of the image, but also want to have a compressed JPEG to send over a network.
We use the JPEGDEC library were they have an example to draw a decoded JPEG image to an LCD screen. The decode function is supplied a callback function that should handle the drawing of each MCU. However, instead of instantly drawing each MCU we want to store the RGB565 data for the entire image in the variable rgb. This RGB image will then be used for the motion detection.
We have tried to implement the code below but can not get it to work and also have some questions:
We want to have the RGB data as uint8_t but the supplied pixels from JPEGDEC (pDraw->pPixels) are of type uint16_t. How can we handle this?
As of now we try to allocate enough memory to store the RGB image, but the malloc function returns NULL. Is it correct to allocate this amount below?
#include <JPEGDEC.h>
uint8_t *rgb = (uint8_t*) malloc(sizeof(uint8_t) * WIDTH * HEIGHT);
JPEGDEC jpeg;
void loop()
{
camera_fb_t* frame = esp_camera_fb_get();
jpeg.openRAM((uint8_t*) frame->buf, frame->len, drawMCU);
jpeg.setPixelType(RGB565_LITTLE_ENDIAN);
jpeg.decode(0, 0, JPEG_SCALE_HALF);
jpeg.close();
pixel_counter = 0;
}
int drawMCU(JPEGDRAW *pDraw) {
for (int i = 0; i < pDraw->iWidth * pDraw->iHeight; i++) {
rgb[pixel_counter] = pDraw->pPixels[i];
pixel_counter++;
}
}

You can't call malloc outside of a function. If you have a setup() function move it there. Also, you need to allocate space for the r, g, and b bytes - which would be 3 times WIDTH * HEIGHT.
Then, extract the RGB bytes from the RGB565 data and store in the array.
uint8_t *rgb = NULL;
void setup() {
rgb = malloc(WIDTH * HEIGHT * 3);
}
int drawMCU(JPEGDRAW *pDraw)
{
for (int i = 0; i < pDraw->iWidth * pDraw->iHeight; i++) {
uint8_t b = (pDraw->pPixels[i] & 0x001F) << 3;
uint8_t g = (pDraw->pPixels[i] & 0x07E0) >> 3;
uint8_t r = (pDraw->pPixels[i] & 0xF800) >> 8;
rgb[i * 3] = r;
rgb[i * 3 + 1] = g;
rgb[i * 3 + 2] = b;
}
}
Or, don't use malloc at all since WIDTH and HEIGHT are known at compile time:
uint8_t rgb[WIDTH * HEIGHT * 3];

It's not clear, I've done the same thing, extracted pixels to RGB565 format and shown live video stream on 65k color oled SSD1331 without use an external jpeg decoder.
You do not need a decoder at all, I've used the provided functions with the camera driver, search for "img_converters.h".
In the Arduino IDE open examples/ESP32/Camera/CameraWebServer.ino, move to
app_httpd.cpp Tab and you can see this file inclusion.
This file contains all functions to convert image formats.
In my linux system I found it on ESP32 Arduino core at this position:
/home/massimo/Arduino/hardware/espressif/esp32/tools/sdk/include/esp32-camera/img_converters.h
It even provide some functions to do overlay on frames, like draw lines (see facial recognition red and green rects) and to print a text. Please I interested to it, if someone find more infos I want to try.
For a byte buffer depends on size you capture the image, don't expets to capture a full 1600x1200 frame to ESP ram, are a lots of data and you cannot, 2 bytes (RGB565) every pixel are a lots of data and you cannot capture in a single pass, you need to divide example in 8/16/32/64 strips.
With oled that is 96x64 I've set camera resolution at very low resolution, do not have much sense capture a big image and then downscale it, so I've set 160x120 resolution, when I capture the frame, by default it is jpeg, I've converted to RGB565 using a function provided in the img_converters all the frame in one single array, then removed vertical and horizontal odd pixel lines, so the final image result in 80x60 pixels, to complete it finally I pass all entire frame to the oled library I wrote for this oled in just one single command that draw very fast using this command that write a buffer in a single pass:
oled.pushPixels(pixelData, len);
Read from SD and draw on oled require 6-7 milliseconds every frame at fullscreen 96x64.
An image of 80x60 pixels 16 bit color (frame extracted from ESP32CAMto draw on oled) need 80 x 60 x 2 = 9600 bytes and there is no problem, I took it even on ESP8266, I've tried to allocate up 32kb, but just when I started to convert my code to show on 320x240 ILI9341 I tried a lot to manage a single frame, but without success, too much datas, the only way I've stripped it in 8 or 16 chunks per frame.
Because I've managed to read images from SD Card, I read one strip, then I render on display, then load another strip, then render.... etc.... This slow down a lot.
My oled library is very fast, it can show videos (read from uSD) at resolution 96x64 # near 130 FPS, when I tried the same video (Matrix) but bigger on TFT, it sit down, max 12 FPS on a 320x180 pixel video with esp8266 clocked # 160Mhz and SPI set to 50Mhz. A lots of data .... 12 FPS but to view on TFT it is incredible, it is not a PC or a smartphone or raspberry, it is a 4$ microcontroller + 6$ (320x240 2.44 ILI9341 touchscreen TFT) + 1$ card reader = touchscreen video player.
With oled I've managed to play videos # 30 FPS with audio 44100 16 Bit Stereo synced out of external PCM5102A dac and it is fantastic, even with two MAX98357A 3W I2S to make a stereo channels and directly attached to hifi speakers. This way using an audio library I wrote, can play films downoladed from Youtube or other sources (with high quality audio out of the box) on a small color oled that is big like a coin and it is good to make a small ESP iOT smartwatch (but not touchscreen), it works with ESP8266 and with ESP32, not yet tried the external dac on ESP32CAM, it can work, but not in conjunction of SD Card, so I doubt it is capable to playback from SD_MMC, no more pins.

Related

Decoding AAC audio with ffmpeg

I'm trying to decode an AAC audio stream in an ADTS container, which is streamed from an external hardware H264 encoder.
I've parsed out the ADTS and it tells me I've got a 2 channel, 44100 AAC Main profile frame. I setup the extra data bytes for the ffmpeg decoder and decode the frame successfully? as follows:
(pseudo c++ code)
setup the decoder:
avcodec_find_decoder(codec_id);
avcodec_alloc_context3(context->codec);
avcodec_open2(context->av_codec_context, context->codec, nullptr);
av_init_packet(&context->av_raw_packet);
setup the extra data bytes:
// AOT_MAIN, 44.1kHz, Stereo
// 00001010 00010000
// extradata = 0x0A, 0X10
memcpy(context->av_codec_context->extradata, extradata, extradataLength);
avcodec_open2(context->av_codec_context, context->codec, nullptr);
then decode the frame:
// decode frame
const int len = avcodec_decode_audio4(context->av_codec_context, context->frame, &got_frame, &context->av_raw_packet);
*sampleRate = context->av_codec_context->sample_rate;
*sampleFormat = context->av_codec_context->sample_format;
*bitsPerSample = av_get_bytes_per_sample(context->av_codec_context->sample_fmt) * 8;
*channels = context->av_codec_context->channels;
*channelLayout = context->av_codec_context->channelLayout;
// get frame
*outDataSize = av_samples_get_buffer_size(nullptr, context->av_codec_context->channels, context->frame->nb_samples, context->av_codec_context->sample_fmt, 1);
The decoded frame:
// array of 8192 bytes, context info is as expected:
context->av_codec_context->channels = 2
context->av_codec_context->channelLayout = 3 (AV_CH_LAYOUT_STEREO)
context->frame->sample_fmt = 8 (AV_SAMPLE_FMT_FLTP) // float, planar
context->frame->sample_rate = 44100
Now as I understand it each frame in the raw format for 32 bit will be 4 bytes per sample, and each channel will be interleaved (so every 4th byte is the alternating channel). That leaves me with 1024 samples for each channel (8192 / 32 bits / 2 channels).
I've tried exporting multiple frames of this data to a file, and importing as a raw file (32-bit float, 2 channel 44100Hz, little endian) in Audacity to sanity check. Instead of music, all I get is noise and the detected length of the audio is way longer than I would have expected (5 seconds dumped to file, but Audacity says 22.5 seconds). I've tried a variety of import format settings. What am I likely doing wrong here?
I'm a little new to working with audio, so I may be misunderstanding something.
Edit: I tried panning the audio to the right channel, and its reflected in the data. It also looks like a repeating pattern exactly 1024 samples apart, which indicates to me a programming error with a buffer not getting overwritten after the first sample.

This was nothing more than a difficult bug to find. Zooming in on the audio sample in Audacity revealed the repeating pattern of 1024 samples wide.
A buffer was in fact not being updated properly and I was processing the same audio frame over and over:
for(var offset = 0; offset < packet.Length; offset++) {
var frame = ReadAdtsFrame();
// offset += frame.Length;
// ^ essentially this was missing, so the frame buffer was always the first frame
}
I will leave this here to display my shame to the world and a reminder that most often its your own bugs that get you in the end.

ffmpeg Get Audio Samples in a specific AVSampleFormat from AVFrame

I am looking at the example from ffmpeg docs:
Here
static int output_audio_frame(AVFrame *frame)
{
size_t unpadded_linesize = frame->nb_samples * av_get_bytes_per_sample(frame->format);
printf("audio_frame n:%d nb_samples:%d pts:%s\n",
audio_frame_count++, frame->nb_samples,
av_ts2timestr(frame->pts, &audio_dec_ctx->time_base));
/* Write the raw audio data samples of the first plane. This works
* fine for packed formats (e.g. AV_SAMPLE_FMT_S16). However,
* most audio decoders output planar audio, which uses a separate
* plane of audio samples for each channel (e.g. AV_SAMPLE_FMT_S16P).
* In other words, this code will write only the first audio channel
* in these cases.
* You should use libswresample or libavfilter to convert the frame
* to packed data. */
fwrite(frame->extended_data[0], 1, unpadded_linesize, audio_dst_file);
return 0;
}
The issue is decoder's format cant be set so it will give me audio samples in any of the following types:
enum AVSampleFormat {
AV_SAMPLE_FMT_NONE = -1, AV_SAMPLE_FMT_U8, AV_SAMPLE_FMT_S16, AV_SAMPLE_FMT_S32,
AV_SAMPLE_FMT_FLT, AV_SAMPLE_FMT_DBL, AV_SAMPLE_FMT_U8P, AV_SAMPLE_FMT_S16P,
AV_SAMPLE_FMT_S32P, AV_SAMPLE_FMT_FLTP, AV_SAMPLE_FMT_DBLP, AV_SAMPLE_FMT_S64,
AV_SAMPLE_FMT_S64P, AV_SAMPLE_FMT_NB
}
I am working with a sound engine and the engine requres me to send float [-1 to 1] PCM data to the engine so I would like to obtain the frame's audio data as float for the two channels (stereo music). How may I do that? Do I need to use libswresample? If so can anyone sent me an example for my case

Encoding audio Example
Resampling audio Example
Transcoding Example
If you don't get the desired format from the decoder, you have to resample it, encode with AV_SAMPLE_FMT_FLT.
According to enum AVSampleFormat
The floating-point formats are based on full volume being in the range [-1.0, 1.0]. Any values outside this range are beyond full volume level.
All the Examples are well documented and not that complicated. The function names alone are very explanatory, so it shouldn't be that hard to understand.

Portion of image sensor used for 1080p video

I'm trying to get the width and height of the sensor used when recording 1080p video for an image processing application using a raspi cam. I have noted the field of view changes from 1080p video to a 1080p still image, even though the resolution is the same. I believe, this is done due to a bit rate issue of h264 video.
All of these observations, make me confused as to how I can calculate the correct width and height in mm, when using 1080p video. In the raspberry pi camera spec, it says:
sensor resolution - 2592 x 1944 pixels
sensor dimensions - 3.76 x 2.74 mm
Will a straightforward linear interpolation be accurate? ex: (3.76 * 1920 / 2592). But, then it seems the image can be scaled as well, which happens in either the video or the still image format.
Note: I have calibrated the camera and have all intrinsic values in pixel units. My effort here is to convert all of these into mm.

Just calibrate the camera for the mode you want to use.
The width and height of your sensor is given in the specs.
It also gives you a pixel size of 1.4µm x 1.4µm. If not given you could calculate it by dividing the sensor width by the image width. Same for height.
And it says that there is cropping in 1080p mode. This means that only a region of your sensor is used. Simply multiply your image width and height with the pixel size and you'll get the size of the sensor area that is used for 1080p.
To get the position of that area take a picture of the same scene in 1080p and in full resolution and compare them.
Not sure about the scaling. You did not provide sufficient information here.
You can either calibrate your camera in 1080p mode or you calibrate it in full resoultion and correct the pixel positions by some translation offset. Pixel size and physical position did not change through cropping...

Visualize stream from a PMD Camboard Nano in Qt

EDIT: honest recommendation
If you want to stream from a PMD in realtime, use C#. Any UI is simple to create and there os quite a mighty library, MetriCam by Metrilus AG, which supports streaming for a variety of 3D-Cameras. I am able to get stable 45 fps with that.
ORIGINAL:
I've been trying to get depth information from a PMD camboard nano and visualize it in a GUI. The Information is delivered as a 165x120 float array.
As I also want to use the Data for analysis purpose (image quality, white noise etc.), I need to grab the frames at a specific framerate. The problem is, that the SDK which PMD delivers with its camera (for MATLAB & C) only provides the possibility to grab single frames by calling
pmdUpdate(hnd);
so the framerate is dependent on how often you poll the image data.
I initially tried to do the analysis in MATLAB, but I couldn't get more than 30 fps out of the camera and adding some further code to the loop made it impossible to work with (I need at least reliable 25 fps).
I then switched to C, where I got rates of up to 70 fps, but could not visualize the data.
Then I tried it with Qt, which is based on C/C++ - it should therefore be fast polling the image data - and where I could easily include the libraries of the PMDSDK. As I am new to Qt, though, I do not know much about the UI-Elements.
So my question:
Is there any performant way to visualize a 2D-float-array on a Qt-GUI? If not, how about anything useful in Visual Studio with C++?
(I know that drawing every pixel one by one on a QGraphicsView is dumb, but I tried it, and I get a whopping framerate of .4 fps...)
Thanks for any helpful suggestions!
Jannik

The QImage Class actually has a constructor that accepts a uchar pointer/array. You only need to map my float values to RGB values in uchar-format.
pmdGetDistances(hnd, dist, dd.img.numColumns*dd.img.numRows*sizeof(float));
uchar *imagemap = new uchar[dd.img.numColumns*dd.img.numRows*3];
int i,j;
for (i = 0; i < 165; i++){
for (j = 0; j < 120; j++){
uchar value = (uchar)std::floor(40*dist[j*165+i]);
if(value > 255 || value < 0){
value = 0;
}
//colorscaling integrated
imagemap[3*(j*165+i)] = floor((255-value)*(255-value)/255.0);
imagemap[3*(j*165+i)+1] = abs(floor((value-127)/1.5));
imagemap[3*(j*165+i)+2] = floor(value*value/255.0);
}
}
The QImage can then be converted to Pixmap and displayed in the QGraphicsView. This worked for me, but the framerate seems not really stable.
QImage image(imagemap, 165, 120, 165*3, QImage::Format_RGB888);
QPixmap pmap(QPixmap::fromImage(image));
scene->addPixmap(pmap.scaled(165,120));
ui->viewCamera->update();
It could be worth a try to send the Thread sleeping until the desired time is elapsed.QThread::msleep(msec);

Draw sound wave with possibility to zoom in/out

I'm writing a sound editor for my graduation. I'm using BASS to extract samples from MP3, WAV, OGG etc files and add DSP effects like echo, flanger etc. Simply speaching I made my framework that apply an effect from position1 to position2, cut/paste management.
Now my problem is that I want to create a control similar with this one from Cool Edit Pro that draw a wave form representation of the song and have the ability to zoom in/out select portions of the wave form etc. After a selection i can do something like:
TInterval EditZone = WaveForm->GetSelection();
where TInterval have this form:
struct TInterval
{
long Start;
long End;
}
I'm a beginner when it comes to sophisticated drawing so any hint on how to create a wave form representation of a song, using sample data returned by BASS, with ability to zoom in/out would be appreciated.
I'm writing my project in C++ but I can understand C#, Delphi code so if you want you can post snippets in last two languages as well :)
Thanx DrOptix

By Zoom, I presume you mean horizontal zoom rather than vertical. The way audio editors do this is to scan the wavform breaking it up into time windows where each pixel in X represents some number of samples. It can be a fractional number, but you can get away with dis-allowing fractional zoom ratios without annoying the user too much. Once you zoom out a bit the max value is always a positive integer and the min value is always a negative integer.
for each pixel on the screen, you need to have to know the minimum sample value for that pixel and the maximum sample value. So you need a function that scans the waveform data in chunks and keeps track of the accumulated max and min for that chunk.
This is slow process, so professional audio editors keep a pre-calculated table of min and max values at some fixed zoom ratio. It might be at 512/1 or 1024/1. When you are drawing with a zoom ration of > 1024 samples/pixel, then you use the pre-calculated table. if you are below that ratio you get the data directly from the file. If you don't do this you will find that you drawing code gets to be too slow when you zoom out.
Its worthwhile to write code that handles all of the channels of the file in an single pass when doing this scanning, slowness here will make your whole program feel sluggish, it's the disk IO that matters here, the CPU has no trouble keeping up, so straightforward C++ code is fine for building the min/max tables, but you don't want to go through the file more than once and you want to do it sequentially.
Once you have the min/max tables, keep them around. You want to go back to the disk as little as possible and many of the reasons for wanting to repaint your window will not require you to rescan your min/max tables. The memory cost of holding on to them is not that high compared to the disk io cost of building them in the first place.
Then you draw the waveform by drawing a series of 1 pixel wide vertical lines between the max value and the min value for the time represented by that pixel. This should be quite fast if you are drawing from pre built min/max tables.

I've recently done this myself. As Marius suggests you need to work out how many samples are at each column of pixels. You then work out the minimum and maximum and then plot a vertical line from the maximum to the minimum.
As a first pass this seemingly works fine. The problem you'll get is that as you zoom out it will start to take too long to retrieve the samples from disk. As a solution to this I built a "peak" file alongside the audio file. The peak file stores the minimum/maximum pairs for groups of n samples. PLaying with n till you get the right amount is up to uyou. Personally I found 128 samples to be a good tradeoff between size and speed. Its also worth remembering that, unless you are drawing a control larger than 65536 pixels in size that you needn't store this peak information as anything more than 16-bit values which saves a bit of space.

Wouldn't you just plot the sample points on a 2 canvas? You should know how many samples there are per second for a file (read it from the header), and then plot the value on the y axis. Since you want to be able to zoom in and out, you need to control the number of samples per pixel (the zoom level). Next you take the average of those sample points per pixel (for example take the average of every 5 points if you have 5 samples per pixel. Then you can use a 2d drawing api to draw lines between the points.

Using the open source NAudio Package -
public class WavReader2
{
private readonly WaveFileReader _objStream;
public WavReader2(String sPath)
{
_objStream = new WaveFileReader(sPath);
}
public List<SampleRangeValue> GetPixelGraph(int iSamplesPerPixel)
{
List<SampleRangeValue> colOutputValues = new List<SampleRangeValue>();
if (_objStream != null)
{
_objStream.Position = 0;
int iBytesPerSample = (_objStream.WaveFormat.BitsPerSample / 8) * _objStream.WaveFormat.Channels;
int iNumPixels = (int)Math.Ceiling(_objStream.SampleCount/(double)iSamplesPerPixel);
byte[] aryWaveData = new byte[iSamplesPerPixel * iBytesPerSample];
_objStream.Position = 0; // startPosition + (e.ClipRectangle.Left * iBytesPerSample * iSamplesPerPixel);
for (float iPixelNum = 0; iPixelNum < iNumPixels; iPixelNum += 1)
{
short iCurrentLowValue = 0;
short iCurrentHighValue = 0;
int iBytesRead = _objStream.Read(aryWaveData, 0, iSamplesPerPixel * iBytesPerSample);
if (iBytesRead == 0)
break;
List<short> colValues = new List<short>();
for (int n = 0; n < iBytesRead; n += 2)
{
short iSampleValue = BitConverter.ToInt16(aryWaveData, n);
colValues.Add(iSampleValue);
}
float fLowPercent = (float)((float)colValues.Min() /ushort.MaxValue);
float fHighPercent = (float)((float)colValues.Max() / ushort.MaxValue);
colOutputValues.Add(new SampleRangeValue(fHighPercent, fLowPercent));
}
}
return colOutputValues;
}
}
public struct SampleRangeValue
{
public float HighPercent;
public float LowPercent;
public SampleRangeValue(float fHigh, float fLow)
{
HighPercent = fHigh;
LowPercent = fLow;
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js