FMOD using excessive codec memory with non-streamed samples - c++

I'm using FMODEx 4.40.10 currently. We load OGG samples like this:
uint uiFlags(FMOD_SOFTWARE | FMOD_LOWMEM | FMOD_CREATESAMPLE);
FMOD::SOUND* pSound(NULL);
m_pFMODSystem->createSound(sPath, uiFlags, NULL, &pSound);
When looking at the memory usage of this sound via:
FMOD_MEMORY_USAGE_DETAILS usage;
pSound->getMemoryInfo(FMOD_MEMBITS_ALL, 0, NULL, &usage);
usage.codec reports greater than 0. This doesn't make sense to me, as the FMOD documentation states that FMOD_MEMORY_USAGE_DETAILS::codec is:
codec
[out] Codecs allocated for streaming
As you can see from how sounds are loaded, there should be no streaming.
With multiple OGG files loaded, when I query the system's memory usage, it shows codec being a large number - all of the individual file codec usages added together. The memory numbers being reported by FMOD match the memory usage that I see from my own memory profiling.
When I load raw PCM data, usage.codec reports as 0.
Why is "codec" greater than 0 when I'm loading non-streaming OGG files? Is there a way to disable this memory usage?
Edit: As a test, after loading the OGG, I extract the PCM data and have FMOD create a new sound. I then free the sound made from the OGG and replace it with the new sound loaded from the PCM data. This works flawlessly. This is further evidence that the codec memory it has allocated is unnecessary.

http://www.fmod.org/forum/viewtopic.php?f=7&t=15762
FMOD confirmed my findings and said the workaround of loading the compressed sound, extracting the PCM data, unloading the sound, and then creating a new one from the extracted PCM data is fine.

Related

How to validate properly ffmpeg pts/dts after demuxing/decoding?

How should I validate pts/dts after demuxing and then after decoding?
For me it is significant to have valid pts all the time for days and
possibly weeks of continuous streaming.
After demuxing I check:
dts <= pts
prev_packet_dts < next_packet_pts
I also discard packets with AV_NOPTS_VALUE and wait for packets with
proper pts, because I don't know video duration at this case.
pts of packets can be not increasing because of I-P-B frames
Is it all right?
What about decoded AVFrames?
Should 'pts' be increasing all the time?
Why at some point 'pts' could lag behind 'dts'?
Why pict_type is a parameter of AVFrame? Should be at AVPacket, because
AVPacket is a compressed frame, not the opposite?
Ideally, yes. Unless if your format allows discontinuities, or wraps timestamps around due to overflow, like MPEG-TS.
Writing error.
It is an informational field, indicating the provenance of the frame. It can be used by filters or encoders, e.g. keyframe alignment during a re-encode.
At libav support I was advised to not rely on decoder output. It is more solid to produce pts/dts for encoding/muxing manually and I should search for ffmpeg tools sources to proper implementation. I will search for this approach.
For now I discard AVFrames only with AV_NOPTS_VALUE, and the rest of encoding/muxing works fine.
Validation of AVPackets after Demuxing remains the same, as described above.

Media Foundation video re-encoding producing audio stream sync offset

I'm attempting to write a simple windows media foundation command line tool to use IMFSourceReader and IMFSyncWriter to load in a video, read the video and audio as uncompressed streams and re-encode them to H.246/AAC with some specific hard-coded settings.
The simple program Gist is here
sample video 1
sample video 2
sample video 3
(Note: the video's i've been testing with are all stereo, 48000k sample rate)
The program works, however in some cases when comparing the newly outputted video to the original in an editing program, I see that the copied video streams match, but the audio stream of the copy is pre-fixed with some amount of silence and the audio is offset, which is unacceptable in my situation.
audio samples:
original - |[audio1] [audio2] [audio3] [audio4] [audio5] ... etc
copy - |[silence] [silence] [silence] [audio1] [audio2] [audio3] ... etc
In cases like this the first video frames coming in have a non zero timestamp but the first audio frames do have a 0 timestamp.
I would like to be able to produce a copied video who's first frame from the video and audio streams is 0, so I first attempted to subtract that initial timestamp (videoOffset) from all subsequent video frames which produced the video i wanted, but resulted in this situation with the audio:
original - |[audio1] [audio2] [audio3] [audio4] [audio5] ... etc
copy - |[audio4] [audio5] [audio6] [audio7] [audio8] ... etc
The audio track is shifted now in the other direction by a small amount and still doesn't align. This can also happen sometimes when a video stream does have a starting timestamp of 0 yet WMF still cuts off some audio samples at the beginning anyway (see sample video 3)!
I've been able to fix this sync alignment and offset the video stream to start at 0 with the following code inserted at the point of passing the audio sample data to the IMFSinkWriter:
//inside read sample while loop
...
// LONGLONG llDuration has the currently read sample duration
// DWORD audioOffset has the global audio offset, starts as 0
// LONGLONG audioFrameTimestamp has the currently read sample timestamp
//add some random amount of silence in intervals of 1024 samples
static bool runOnce{ false };
if (!runOnce)
{
size_t numberOfSilenceBlocks = 1; //how to derive how many I need!? It's aribrary
size_t samples = 1024 * numberOfSilenceBlocks;
audioOffset = samples * 10000000 / audioSamplesPerSecond;
std::vector<uint8_t> silence(samples * audioChannels * bytesPerSample, 0);
WriteAudioBuffer(silence.data(), silence.size(), audioFrameTimeStamp, audioOffset);
runOnce= true;
}
LONGLONG audioTime = audioFrameTimeStamp + audioOffset;
WriteAudioBuffer(dataPtr, dataSize, audioTime, llDuration);
Oddly, this creates an output video file that matches the original.
original - |[audio1] [audio2] [audio3] [audio4] [audio5] ... etc
copy - |[audio1] [audio2] [audio3] [audio4] [audio5] ... etc
The solution was to insert extra silence in block sizes of 1024 at the beginning of the audio stream. It doesn't matter what the audio chunk sizes provided by IMFSourceReader are, the padding is in multiples of 1024.
My problem is that there seems to be no detectable reason for the the silence offset. Why do i need it? How do i know how much i need? I stumbled across the 1024 sample silence block solution after days of fighting this problem.
Some videos seem to only need 1 padding block, some need 2 or more, and some need no extra padding at all!
My question here are:
Does anyone know why this is happening?
Am I using Media Foundation incorrectly in this situation to cause this?
If I am correct, How can I use the video metadata to determine if i need to pad an audio stream and how many 1024 blocks of silence need to be in the pad?
EDIT:
For the sample videos above:
sample video 1 : the video stream starts at 0 and needs no extra blocks, passthrough of original data works fine.
sample video 2 : video stream starts at 834166 (hns) and needs 1 1024 block of silence to sync
sample video 3 : video stream starts at 0 and needs 2 1024 blocks of silence to sync.
UPDATE:
Other things I have tried:
Increasing the duration of the first video frame to account for the offset: Produces no effect.
I wrote another version of your program to handle NV12 format correctly (yours was not working) :
EncodeWithSourceReaderSinkWriter
I use Blender as video editing tools. Here is my results with Tuning_against_a_window.mov :
from the bottom to the top :
Original file
Encoded file
I changed the original file by settings "elst" atoms with the value of 0 for number entries (I used Visual Studio hexa editor)
Like Roman R. said, MediaFoundation mp4 source doesn't use the "edts/elst" atoms. But Blender and your video editing tools do. Also the "tmcd" track is ignored by mp4 source.
"edts/elst" :
Edits Atom ( 'edts' )
Edit lists can be used for hint tracks...
MPEG-4 File Source
The MPEG-4 file source silently ignores hint tracks.
So in fact, the encoding is good. I think there is no audio stream sync offset, comparing to the real audio/video data. For example, you can add "edts/elst" to the encoded file, to get the same result.
PS: on the encoded file, i added "edts/elst" for both audio/video tracks. I also increased size for trak atoms and moov atom. I confirm, Blender shows same wave form for both original and encoded file.
EDIT
I tried to understand relation between mvhd/tkhd/mdhd/elst atoms, in the 3 video samples. (Yes I know, i should read the spec. But i'm lazy...)
You can use a mp4 explorer tool to get atom's values, or use the mp4 parser from my H264Dxva2Decoder project :
H264Dxva2Decoder
Tuning_against_a_window.mov
elst (media time) from tkhd video : 20689
elst (media time) from tkhd audio : 1483
GREEN_SCREEN_ANIMALS__ALPACA.mp4
elst (media time) from tkhd video : 2002
elst (media time) from tkhd audio : 1024
GOPR6239_1.mov
elst (media time) from tkhd video : 0
elst (media time) from tkhd audio : 0
As you can see, with GOPR6239_1.mov, media time from elst is 0. That's why there is no video/audio sync problem with this file.
For Tuning_against_a_window.mov and GREEN_SCREEN_ANIMALS__ALPACA.mp4, i tried to calculate the video/audio offset.
I modified my project to take this into account :
EncodeWithSourceReaderSinkWriter
For now, i didn't find a generic calculation for all files.
I just find the video/audio offset needed to encode correctly both files.
For Tuning_against_a_window.mov, i begin encoding after (movie time - video/audio mdhd time).
For GREEN_SCREEN_ANIMALS__ALPACA.mp4, i begin encoding after video/audio elst media time.
It's OK, but I need to find the right unique calculation for all files.
So you have 2 options :
encode the file and add elst atom
encode the file using right offset calculation
it depends on your needs :
The first option permits you to keep the original file.But you have to add the elst atom
With the second option you have to read atom from the file before encoding, and the encoded file will loose few original frames
If you choose the first option, i will explain how I add the elst atom.
PS : i'm intersting by this question, because in my H264Dxva2Decoder project, the edts/elst atom is in my todo list.
I parse it, but i don't use it...
PS2 : this link sounds interesting :
Audio Priming - Handling Encoder Delay in AAC

Compressing PCM data

I'm using WinAPI - Wave functions to create a recording program that records the microphone for X seconds. I've searched a bit over the net, and found out PCM data is too large, and it'll be a problem to send it through sockets...
How can I compress it to something smaller? Any simple / "cheap" way ?
I've also noticed, when I'm declaring the format using the Wave API functions, I'm using this code :
WAVEFORMATEX pFormat;
pFormat.wFormatTag= WAVE_FORMAT_PCM; // simple, uncompressed format
pFormat.nChannels=1; // 1=mono, 2=stereo
pFormat.nSamplesPerSec=sampleRate; // 44100
pFormat.nAvgBytesPerSec=sampleRate*2; // = nSamplesPerSec * n.Channels * wBitsPerSample/8
pFormat.nBlockAlign=2; // = n.Channels * wBitsPerSample/8
pFormat.wBitsPerSample=16; // 16 for high quality, 8 for telephone-grade
pFormat.cbSize=0;
As you can see, pFormat.wFormatTag= WAVE_FORMAT_PCM;
maybe I can insert instead of WAVE_FORMAT_PCM something else, so it'll be compressed right away?
I've checked MSDN for other values, though none of them works for me in my Visual Studio...
So what can I do?
Thanks!
The simplest way is to simply reduce your sample rate from 44100 to something more manageable like 22050, 16000, 11025, or even 8000. Most voice codecs don't go higher than 16000 hz anyway. And the older ones are optimized for 8khz.
The next step is to find a codec. There's a handful of codecs to use with the Windows Audio Compression Manager, but almost all of them date back to Windows 95 and sound terrible by modern standards after being decompressed.
You can always convert to WMA in real time using the Format SDK or with Media Foundation APIs. Or just go get an open source MP3 library like LAME.
For telephone quality speech you can change to 8 bits per sample and a sample rate of 8000. This will greatly reduce the amount of data.
GSM has good compression. You can convert a block of PCM data to GSM (or any other codec you have installed) using acmStreamConvert(). Refer to MSDN for more details:
Converting Data from One Format to Another

Rendering some sound data into one new sound data?

I'm creating an application that will read a unique format that contains sound "bank" and offsets when the sound must be played.
Imagine something like..
Sound bank: (ID on the left-hand and file name on the right-hand side)
0 kick.wav
1 hit.wav
2 flute.wav
And the offsets: (Time in ms on the left-hand and sound ID on the right-hand side)
1000 0
2000 1
3000 2
And the application will generate a new sound file (ie. wav, for later conversion to other formats) that plays a kick at first sec, a hit at second sec, and flute at third sec.
I completely have no idea on where to begin.
I usually use FMOD for audio playbacks, but never did something like this before.
I'm using C++ and wxWidgets on a MSVC++ Express Edition environment, and LGPL libraries would be fine.
If I understand correctly, you want to generate a new wave file by mixing wavs from a soundbank. You may not need a sound API at all for this, especially if all your input wavs are in the same format.
Simply load each wav file into a buffer. For SampleRate*secondsUntilStartTime samples, for each buffer in the ActiveList, add buffer[bufferIdx++] into the output buffer. If bufferIdx == bufferLen, remove this buffer from the ActiveList. At StartTime, add the next buffer the ActiveList, and repeat.
If FMOD supports output to a file instead of the sound hardware, you can do this same thing with the streaming API. Just keep track of elapsed samples in the StreamCallback, and start mixing in new files whenever you reach their start offsets.

Compression File

I have a X bytes file. And I want to compress it in block of size 32Kb, for example.
Is there any lib that Can I do this?
I used Zlib for Delphi but I just can compress a full file in new compressed file.
Tranks a lot,
Pedro
Why don't you use a simple header to determine block boundaries? Consider this:
Read fixed amount of data from input into a buffer (say 32 KiB)
Compress that buffer with a "freshly created" deflate stream (underlying compression algorithm of ZLIB).
Write compressed size to output stream
Write compressed data to output stream
Go to step 1 until you reach end-of-file.
Pros:
You can decompress any block even in multi-threaded fashion.
Data corruption only limited to corrupted block. Rest of data can be restored.
Cons:
You loss most of contextual information (similarities between data). So, you will have lower compression ratio.
You need slightly more work.