GStreamer : seek request problem in a pipeline with a mixer with pad offset - gstreamer

I want to mix multiple wav file. Files can have different start time.
For that, I set an offset on the pad of the mixer.
I am using gstreamer-java.
This is an example of the timeline with two files. There is a 10 sec offset for file 2.
It works fine. File 2 start as expected.
But if I do a seek request, I won't hear file 2 for the duration of its offset (here : 10 sec).
When I hear file 2 again, the file is in sync with the expected timeline.
Is it possible to do a seek request when the mixer has pad with offset ?

Related

Adding 10 second wav file to gstreamer pipeline that is already playing

I have a gstreamer pipeline created from the python gst bindings, which is set up to play a headset's microphone back to the headset's speaker. This works fine and is playing in a pipeline like this:
JackAudioSrc -> GstAudioMixer -> Queue -> GstJackAudioSink
Then many seconds later I want to play a short 10 second .wav file into the pipeline so the wav file is mixed with the microphone and heard on the headset. To do this, a GstFileSrc is dynamically added to the GstAudioMixer to mix in a short 10 second wav file to the headset's speaker, which gives pipeline like this:
GstJackAudioSrc -> GstAudioMixer -> Queue -> GstJackAudioSink
/
Gstfilesrc -> Gstwavparse ->/
When the Gstfilesrc and Gstwavparse file is dynamically added to a sink pad of the mixer, at a time say 6 seconds since the start of the pipeline, only the last 4 seconds of the wav is heard.
The problem seems to be that the wav file seeks to the time relative to when the pipeline started PLAYING.
I have tried changing "do-timestamp" in a multifilesrc, and GstIndentity "sync"=True, and can't find a way to set "live" on a filesrc, and many others but to no avail.
However, the whole 10 second wav file will play nicely if the pipeline is set to Gst.State.NULL then back to Gst.State.PLAYING when the filesrc is added at 6 seconds. This works as the pipeline time gets set back to zero, but this produces a click on the headset, which is unacceptable.
How can I ensure that the wav file starts playing from the start of the wav file, so that the whole 10 seconds is heard on the headset, if added to the pipeline at any random time?
An Update:
I can now get the timing of the wave file correct by adding a clocksync and setting its timestamp offset, before the wavparse:
nanosecs = pipeline.query_position(Gst.Format.TIME)[1]
clocksync.set_property("ts-offset", nanosecs)
Although the start/stop times are now correct, the wav audio is corrupted and heard as nothing but clicks and blips, but at least it starts playing at the correct time and finishes at the correct time. Note that without the clocksync the wav file audio is perfectly clear, it just starts and stops at the wrong time. So the ts-offset is somehow corrupting the audio.
Why is the audio being corrupted?
So I got this working and the answer is not to use the clocksync, but instead request a mixer sink pad, then call set_offset(nanosecs) on the mixer sink pad, before linking the wavparse to the mixer:
sink_pad = audio_mixer.get_request_pad("sink_%u")
nanosecs = pipeline.query_position(Gst.Format.TIME)[1]
sink_pad.set_offset(nanosecs)
sink_pad.add_probe(GstPadProbeType.IDLE, wav_callback)
def wav_callback(pad, pad_probe_info, userdata):
wavparse.link(audio_mixer)
wav_bin.set_state(Gst.State.PLAYING)
return Gst.PadProbeReturn.REMOVE
Then if the wav file needs to be rewound/replayed:
def replay_wav():
global wav_bin
global sink_pad
wav_bin.seek_simple(Gst.Format.TIME, Gst.SeekFlags.FLUSH, 0)
nanosecs = pipeline.query_position(Gst.Format.TIME)[1]
sink_pad.set_offset(nanosecs)

Media Foundation video re-encoding producing audio stream sync offset

I'm attempting to write a simple windows media foundation command line tool to use IMFSourceReader and IMFSyncWriter to load in a video, read the video and audio as uncompressed streams and re-encode them to H.246/AAC with some specific hard-coded settings.
The simple program Gist is here
sample video 1
sample video 2
sample video 3
(Note: the video's i've been testing with are all stereo, 48000k sample rate)
The program works, however in some cases when comparing the newly outputted video to the original in an editing program, I see that the copied video streams match, but the audio stream of the copy is pre-fixed with some amount of silence and the audio is offset, which is unacceptable in my situation.
audio samples:
original - |[audio1] [audio2] [audio3] [audio4] [audio5] ... etc
copy - |[silence] [silence] [silence] [audio1] [audio2] [audio3] ... etc
In cases like this the first video frames coming in have a non zero timestamp but the first audio frames do have a 0 timestamp.
I would like to be able to produce a copied video who's first frame from the video and audio streams is 0, so I first attempted to subtract that initial timestamp (videoOffset) from all subsequent video frames which produced the video i wanted, but resulted in this situation with the audio:
original - |[audio1] [audio2] [audio3] [audio4] [audio5] ... etc
copy - |[audio4] [audio5] [audio6] [audio7] [audio8] ... etc
The audio track is shifted now in the other direction by a small amount and still doesn't align. This can also happen sometimes when a video stream does have a starting timestamp of 0 yet WMF still cuts off some audio samples at the beginning anyway (see sample video 3)!
I've been able to fix this sync alignment and offset the video stream to start at 0 with the following code inserted at the point of passing the audio sample data to the IMFSinkWriter:
//inside read sample while loop
...
// LONGLONG llDuration has the currently read sample duration
// DWORD audioOffset has the global audio offset, starts as 0
// LONGLONG audioFrameTimestamp has the currently read sample timestamp
//add some random amount of silence in intervals of 1024 samples
static bool runOnce{ false };
if (!runOnce)
{
size_t numberOfSilenceBlocks = 1; //how to derive how many I need!? It's aribrary
size_t samples = 1024 * numberOfSilenceBlocks;
audioOffset = samples * 10000000 / audioSamplesPerSecond;
std::vector<uint8_t> silence(samples * audioChannels * bytesPerSample, 0);
WriteAudioBuffer(silence.data(), silence.size(), audioFrameTimeStamp, audioOffset);
runOnce= true;
}
LONGLONG audioTime = audioFrameTimeStamp + audioOffset;
WriteAudioBuffer(dataPtr, dataSize, audioTime, llDuration);
Oddly, this creates an output video file that matches the original.
original - |[audio1] [audio2] [audio3] [audio4] [audio5] ... etc
copy - |[audio1] [audio2] [audio3] [audio4] [audio5] ... etc
The solution was to insert extra silence in block sizes of 1024 at the beginning of the audio stream. It doesn't matter what the audio chunk sizes provided by IMFSourceReader are, the padding is in multiples of 1024.
My problem is that there seems to be no detectable reason for the the silence offset. Why do i need it? How do i know how much i need? I stumbled across the 1024 sample silence block solution after days of fighting this problem.
Some videos seem to only need 1 padding block, some need 2 or more, and some need no extra padding at all!
My question here are:
Does anyone know why this is happening?
Am I using Media Foundation incorrectly in this situation to cause this?
If I am correct, How can I use the video metadata to determine if i need to pad an audio stream and how many 1024 blocks of silence need to be in the pad?
EDIT:
For the sample videos above:
sample video 1 : the video stream starts at 0 and needs no extra blocks, passthrough of original data works fine.
sample video 2 : video stream starts at 834166 (hns) and needs 1 1024 block of silence to sync
sample video 3 : video stream starts at 0 and needs 2 1024 blocks of silence to sync.
UPDATE:
Other things I have tried:
Increasing the duration of the first video frame to account for the offset: Produces no effect.
I wrote another version of your program to handle NV12 format correctly (yours was not working) :
EncodeWithSourceReaderSinkWriter
I use Blender as video editing tools. Here is my results with Tuning_against_a_window.mov :
from the bottom to the top :
Original file
Encoded file
I changed the original file by settings "elst" atoms with the value of 0 for number entries (I used Visual Studio hexa editor)
Like Roman R. said, MediaFoundation mp4 source doesn't use the "edts/elst" atoms. But Blender and your video editing tools do. Also the "tmcd" track is ignored by mp4 source.
"edts/elst" :
Edits Atom ( 'edts' )
Edit lists can be used for hint tracks...
MPEG-4 File Source
The MPEG-4 file source silently ignores hint tracks.
So in fact, the encoding is good. I think there is no audio stream sync offset, comparing to the real audio/video data. For example, you can add "edts/elst" to the encoded file, to get the same result.
PS: on the encoded file, i added "edts/elst" for both audio/video tracks. I also increased size for trak atoms and moov atom. I confirm, Blender shows same wave form for both original and encoded file.
EDIT
I tried to understand relation between mvhd/tkhd/mdhd/elst atoms, in the 3 video samples. (Yes I know, i should read the spec. But i'm lazy...)
You can use a mp4 explorer tool to get atom's values, or use the mp4 parser from my H264Dxva2Decoder project :
H264Dxva2Decoder
Tuning_against_a_window.mov
elst (media time) from tkhd video : 20689
elst (media time) from tkhd audio : 1483
GREEN_SCREEN_ANIMALS__ALPACA.mp4
elst (media time) from tkhd video : 2002
elst (media time) from tkhd audio : 1024
GOPR6239_1.mov
elst (media time) from tkhd video : 0
elst (media time) from tkhd audio : 0
As you can see, with GOPR6239_1.mov, media time from elst is 0. That's why there is no video/audio sync problem with this file.
For Tuning_against_a_window.mov and GREEN_SCREEN_ANIMALS__ALPACA.mp4, i tried to calculate the video/audio offset.
I modified my project to take this into account :
EncodeWithSourceReaderSinkWriter
For now, i didn't find a generic calculation for all files.
I just find the video/audio offset needed to encode correctly both files.
For Tuning_against_a_window.mov, i begin encoding after (movie time - video/audio mdhd time).
For GREEN_SCREEN_ANIMALS__ALPACA.mp4, i begin encoding after video/audio elst media time.
It's OK, but I need to find the right unique calculation for all files.
So you have 2 options :
encode the file and add elst atom
encode the file using right offset calculation
it depends on your needs :
The first option permits you to keep the original file.But you have to add the elst atom
With the second option you have to read atom from the file before encoding, and the encoded file will loose few original frames
If you choose the first option, i will explain how I add the elst atom.
PS : i'm intersting by this question, because in my H264Dxva2Decoder project, the edts/elst atom is in my todo list.
I parse it, but i don't use it...
PS2 : this link sounds interesting :
Audio Priming - Handling Encoder Delay in AAC

Default WAV description when all specs are "0"

I'm learning how to read WAV files in C++, and extract data according to the header. I have a few WAV files lying around. By looking at the header of all of them, I see that they all follow the rules of wave files. However, files recordings produced by TeamSpeak are weird, but they're still playable in media players.
So looking at the standard format of WAV files, it looks like this:
So in all files that look normal, I get legitimate values for all the values from "AudioFormat" up to "BitsPerSample" (from the picture). However, in TeamSpeak files, ALL these values are exactly zero.
This, but the first 3 values are not zero. So there's "RIFF" and "WAVE" in the first and third strings, and the ChunkSize seems legit.
So my question is: How does the player know anything about such a file and recognize that this file is mono or stereo? The sample rate? Anything about it? Is it like there's something standard to assume when all these values are zero?
Update
I examined the file with MediaInfo and got this:
General
Complete name : ts3_recording_16_10_02_17_53_54.wav
Format : Wave
File size : 2.45 MiB
Duration : 13 s 380 ms
Overall bit rate mode : Constant
Overall bit rate : 1 536 kb/s
Audio
Format : PCM
Format settings, Endianness : Little
Format settings, Sign : Signed
Codec ID : 1
Duration : 13 s 380 ms
Bit rate mode : Constant
Bit rate : 1 536 kb/s
Channel(s) : 2 channels
Sampling rate : 48.0 kHz
Bit depth : 16 bits
Stream size : 2.45 MiB (100%)
Still though don't understand how it arrived at these conclusions.
After examining your file with a hex editor with WAV binary templates, it is obvious that there is an additional "JUNK" chunk before the "fmt" one (screenshot attached). The JUNK chunk is possibly there for some padding reasons, but all it's values are 0s. You need to seek (fseek maybe) the wav file in your code for the first occurrence of "fmt" bytes and parse the WAVEFORMATEX info from there.

Calculate TS File Duration

I am working on a media player application : Which plays ISDB-T audio and video.
I am using GStreamer for decoding & rendering.
For AV Sync to work perfectly, I should regulate file reads: so that data will be not be pushed to Gstreamer neither too fast nor too slow.
If I know the duration of TS file before hand, then I can regulate my reads. But how to calculate the TS file duration ?
Because, I need to verify the application with multiple TS files, cannot calculate the duration using some utility and keep changing the file reads - How can this be achieved in program?
Thanks,
Kranti
If you have sufficient knowledge in the encoding and PES layer inside the transport stream, then you can read the time-stamps within the TS and calculate it yourself.
It requires seeking to the end of the file, searching for the last time-stamp, and subtracting the first time stamp of the same program in the beginning of the file.
EDIT: In addition to the above method you need to include the last frame duration.
((last_pts - first_pts) + frame_duration) / pts_resolution
Lets say you have a 30 fps segment with a duration of 6.006s
((1081080 - 543543) + 3003) / 90000 = 6.006
in most cases, each PES header contains a PTS and/or DTS, which is measured in 90kHz frequency. so the steps may include:
find the program you need to demux from MPEG TS.
find the PID of stream.
find the first TS packet with PID found, and payload_start_indicator set to 1; that will be the starting of a PES frame, which will contain a PES header.
Parse the PES header to find the starting PTS of the stream.
parse the file backwards from end, to find a packet with same PID and payload_start_indicator set, which will contain the last PTS.
find thier difference, divide it by 90000 will give duration in Seconds

Rendering some sound data into one new sound data?

I'm creating an application that will read a unique format that contains sound "bank" and offsets when the sound must be played.
Imagine something like..
Sound bank: (ID on the left-hand and file name on the right-hand side)
0 kick.wav
1 hit.wav
2 flute.wav
And the offsets: (Time in ms on the left-hand and sound ID on the right-hand side)
1000 0
2000 1
3000 2
And the application will generate a new sound file (ie. wav, for later conversion to other formats) that plays a kick at first sec, a hit at second sec, and flute at third sec.
I completely have no idea on where to begin.
I usually use FMOD for audio playbacks, but never did something like this before.
I'm using C++ and wxWidgets on a MSVC++ Express Edition environment, and LGPL libraries would be fine.
If I understand correctly, you want to generate a new wave file by mixing wavs from a soundbank. You may not need a sound API at all for this, especially if all your input wavs are in the same format.
Simply load each wav file into a buffer. For SampleRate*secondsUntilStartTime samples, for each buffer in the ActiveList, add buffer[bufferIdx++] into the output buffer. If bufferIdx == bufferLen, remove this buffer from the ActiveList. At StartTime, add the next buffer the ActiveList, and repeat.
If FMOD supports output to a file instead of the sound hardware, you can do this same thing with the streaming API. Just keep track of elapsed samples in the StreamCallback, and start mixing in new files whenever you reach their start offsets.