How to find frame end when MPEG2 stream coming in MPEG-TS Container over RTP? - rtp

I am receiving MPEG2-TS stream over RTP. But i am unable to find the end of a particular frame.
When only MPEG2 stream came over RTP then marker bit in RTP header is set to 1 when there is end of any frame , but in this case marker bit is always 0.
Can anyone help me , how can i find the frame end in case of MPEG2-TS?

According to RFC 2250 M bit should indicate the end of frame in case of mpeg-ts. (3.3 RTP Fixed Header for MPEG ES encapsulation) but many decoder may not be putting it in header.
only other way to find the start of frame is to decode the header of 188 byte mpeg-ts packet.mpeg-ts contains "Payload Unit Start Indicator".
so your algo will be like
RTP data contain integer number of mpeg-ts packets.
each packet starts with 0x47
check the "payload unit start indicator" fiels for each packet
if "payload unit start indicator == 1" check the if PES or PSI
ignore packet if PSI and continue with step-1, else go to next step
for PES packet check "Stream id" if its video you hit a new frame.

Related

Decoding an unknown CRC or checksum?

I've been trying decode the CRC or checksum algorithm that is being used for the serial communication between a drone and its camera for about a week without a lot of luck and I was wondering if anybody here sees something I am missing or has any suggestions.
A typical packet looks like this:
FE1A390100020001AE0BE0FF090046250B00040000004E0D32080008540D8808F4016B54
They always start with 0xFE. The 2nd byte is the total size of the packet minus 10 bytes. The packet sizes vary, but I think I am specifically interested the 0x1A size. Byte 3 seems to be a packet counter because it usually increases by 1, but sometimes I have seen it jump to a completely different number for a few packets (usually when changing to a 0x22 size packet) before resuming the increment by 1 sequence. The last 2 bytes always change and I believe are the checksum or CRC. All the rest of the bytes seem to stay the same from one 0x1A packet to the next unless I manipulate the drones radio controls.
Right after powering up there is a series of packets that I assume is for initializing the communication. They are the shortest packets and have the least amount of change between them so it seems like they might be the easyiest to look at. Here are the first 7 bytes sent after powering it on.
From Drone to camera
Time:
8.3982205 FE030001000000010200018F68
8.39934725 FE03010100000001020001A844
8.400473958 FE03020100000001020001C130
8.401600708 FE050301000000000000000001AAE8
8.402900792 FE1A040100020001000000000000000000000C000300000853060008AB028808F4014629
8.406020958 FE22050100030002000000000000000000000000000000000000B3FFFFFFDE22006300FF615110050000C956
8.4098345 FE1A060100020001000000000000000000000C000300000853060008AB028808F40180A9
If I put the first 3 packets into reveng with -w 16 -s then it comes back with:
reveng: warning: you have only given 3 samples
reveng: warning: to reduce false positives, give 4 or more samples
width=16 poly=0x1487 init=0x0334 refin=false refout=false xorout=0x0000 check=0xa5b9 residue=0x0000 name=(none)
If i add the 4th packet it finds the same poly, but there rest of it looks differnt:
width=16 poly=0x1487 init=0x417d refin=false refout=false xorout=0x5582 check=0xbfa2 residue=0xb059 name=(none)
If i add the 5th packet reveng comes back with no model found.
However, if I remove packet 4 and then run it with packets, 1,2,3 and 5 if finds the same poly again, but different values for the rest:
width=16 poly=0x1487 init=0x804b refin=false refout=false xorout=0x0138 check=0x7dcc residue=0xc8ca name=(none)
Most combinations of packets containing a 0x1A size packet and the first 3 initialization packets that I run through reveng come back with 'no model found'. So far every time I have run reveng with only 0x1a sized packets has failed to find a model.
I think it is possible that after the initialization packets it some how incorporates info it receives from the camera to the drone into the CRC calculation for the data going from the drone to the camera, but there isn't a lot of data in those packets. Here are the first 9 packets that are sent from the camera to the drone. Prior to the first 0x1A packet being sent from the drone, the only data sent from the camera seems to be 0x7D0001.
From camera to drone:
Time
3.474456792 FE0500020000000000007D00013D40
4.475220208 FE0501020000000000007D000168C5
5.476483875 FE0502020000000000007D00018642
6.477295958 FE0503020000000000007D0001D3C7
7.4783405 FE0504020000000000007D00014B45
8.479420458 FE06050200010003FA078538B838B3
8.480811667 FE0506020000000000007D0001F047
9.48057875 FE0507020000000000007D0001A5C2
9.481883 FE06080200010003F9078638B8386037
I have tried incorporating 0x7D0001 into the packets and running them through reveng, but that didn't seem to help.
I have also tried reveng -w 8 -s on various combinations of packets without finding a model. And I have tried various checksum algos manually (possibly incorrectly) without success.
I have a bunch more data that I have captured here:
https://drive.google.com/open?id=1v8MCaXOvP_2Wv_hcaqhUZnXvqNI1_2Ur
Any ideas? Suggestions? This has been driving me nuts for a week

Media Foundation video re-encoding producing audio stream sync offset

I'm attempting to write a simple windows media foundation command line tool to use IMFSourceReader and IMFSyncWriter to load in a video, read the video and audio as uncompressed streams and re-encode them to H.246/AAC with some specific hard-coded settings.
The simple program Gist is here
sample video 1
sample video 2
sample video 3
(Note: the video's i've been testing with are all stereo, 48000k sample rate)
The program works, however in some cases when comparing the newly outputted video to the original in an editing program, I see that the copied video streams match, but the audio stream of the copy is pre-fixed with some amount of silence and the audio is offset, which is unacceptable in my situation.
audio samples:
original - |[audio1] [audio2] [audio3] [audio4] [audio5] ... etc
copy - |[silence] [silence] [silence] [audio1] [audio2] [audio3] ... etc
In cases like this the first video frames coming in have a non zero timestamp but the first audio frames do have a 0 timestamp.
I would like to be able to produce a copied video who's first frame from the video and audio streams is 0, so I first attempted to subtract that initial timestamp (videoOffset) from all subsequent video frames which produced the video i wanted, but resulted in this situation with the audio:
original - |[audio1] [audio2] [audio3] [audio4] [audio5] ... etc
copy - |[audio4] [audio5] [audio6] [audio7] [audio8] ... etc
The audio track is shifted now in the other direction by a small amount and still doesn't align. This can also happen sometimes when a video stream does have a starting timestamp of 0 yet WMF still cuts off some audio samples at the beginning anyway (see sample video 3)!
I've been able to fix this sync alignment and offset the video stream to start at 0 with the following code inserted at the point of passing the audio sample data to the IMFSinkWriter:
//inside read sample while loop
...
// LONGLONG llDuration has the currently read sample duration
// DWORD audioOffset has the global audio offset, starts as 0
// LONGLONG audioFrameTimestamp has the currently read sample timestamp
//add some random amount of silence in intervals of 1024 samples
static bool runOnce{ false };
if (!runOnce)
{
size_t numberOfSilenceBlocks = 1; //how to derive how many I need!? It's aribrary
size_t samples = 1024 * numberOfSilenceBlocks;
audioOffset = samples * 10000000 / audioSamplesPerSecond;
std::vector<uint8_t> silence(samples * audioChannels * bytesPerSample, 0);
WriteAudioBuffer(silence.data(), silence.size(), audioFrameTimeStamp, audioOffset);
runOnce= true;
}
LONGLONG audioTime = audioFrameTimeStamp + audioOffset;
WriteAudioBuffer(dataPtr, dataSize, audioTime, llDuration);
Oddly, this creates an output video file that matches the original.
original - |[audio1] [audio2] [audio3] [audio4] [audio5] ... etc
copy - |[audio1] [audio2] [audio3] [audio4] [audio5] ... etc
The solution was to insert extra silence in block sizes of 1024 at the beginning of the audio stream. It doesn't matter what the audio chunk sizes provided by IMFSourceReader are, the padding is in multiples of 1024.
My problem is that there seems to be no detectable reason for the the silence offset. Why do i need it? How do i know how much i need? I stumbled across the 1024 sample silence block solution after days of fighting this problem.
Some videos seem to only need 1 padding block, some need 2 or more, and some need no extra padding at all!
My question here are:
Does anyone know why this is happening?
Am I using Media Foundation incorrectly in this situation to cause this?
If I am correct, How can I use the video metadata to determine if i need to pad an audio stream and how many 1024 blocks of silence need to be in the pad?
EDIT:
For the sample videos above:
sample video 1 : the video stream starts at 0 and needs no extra blocks, passthrough of original data works fine.
sample video 2 : video stream starts at 834166 (hns) and needs 1 1024 block of silence to sync
sample video 3 : video stream starts at 0 and needs 2 1024 blocks of silence to sync.
UPDATE:
Other things I have tried:
Increasing the duration of the first video frame to account for the offset: Produces no effect.
I wrote another version of your program to handle NV12 format correctly (yours was not working) :
EncodeWithSourceReaderSinkWriter
I use Blender as video editing tools. Here is my results with Tuning_against_a_window.mov :
from the bottom to the top :
Original file
Encoded file
I changed the original file by settings "elst" atoms with the value of 0 for number entries (I used Visual Studio hexa editor)
Like Roman R. said, MediaFoundation mp4 source doesn't use the "edts/elst" atoms. But Blender and your video editing tools do. Also the "tmcd" track is ignored by mp4 source.
"edts/elst" :
Edits Atom ( 'edts' )
Edit lists can be used for hint tracks...
MPEG-4 File Source
The MPEG-4 file source silently ignores hint tracks.
So in fact, the encoding is good. I think there is no audio stream sync offset, comparing to the real audio/video data. For example, you can add "edts/elst" to the encoded file, to get the same result.
PS: on the encoded file, i added "edts/elst" for both audio/video tracks. I also increased size for trak atoms and moov atom. I confirm, Blender shows same wave form for both original and encoded file.
EDIT
I tried to understand relation between mvhd/tkhd/mdhd/elst atoms, in the 3 video samples. (Yes I know, i should read the spec. But i'm lazy...)
You can use a mp4 explorer tool to get atom's values, or use the mp4 parser from my H264Dxva2Decoder project :
H264Dxva2Decoder
Tuning_against_a_window.mov
elst (media time) from tkhd video : 20689
elst (media time) from tkhd audio : 1483
GREEN_SCREEN_ANIMALS__ALPACA.mp4
elst (media time) from tkhd video : 2002
elst (media time) from tkhd audio : 1024
GOPR6239_1.mov
elst (media time) from tkhd video : 0
elst (media time) from tkhd audio : 0
As you can see, with GOPR6239_1.mov, media time from elst is 0. That's why there is no video/audio sync problem with this file.
For Tuning_against_a_window.mov and GREEN_SCREEN_ANIMALS__ALPACA.mp4, i tried to calculate the video/audio offset.
I modified my project to take this into account :
EncodeWithSourceReaderSinkWriter
For now, i didn't find a generic calculation for all files.
I just find the video/audio offset needed to encode correctly both files.
For Tuning_against_a_window.mov, i begin encoding after (movie time - video/audio mdhd time).
For GREEN_SCREEN_ANIMALS__ALPACA.mp4, i begin encoding after video/audio elst media time.
It's OK, but I need to find the right unique calculation for all files.
So you have 2 options :
encode the file and add elst atom
encode the file using right offset calculation
it depends on your needs :
The first option permits you to keep the original file.But you have to add the elst atom
With the second option you have to read atom from the file before encoding, and the encoded file will loose few original frames
If you choose the first option, i will explain how I add the elst atom.
PS : i'm intersting by this question, because in my H264Dxva2Decoder project, the edts/elst atom is in my todo list.
I parse it, but i don't use it...
PS2 : this link sounds interesting :
Audio Priming - Handling Encoder Delay in AAC

RTP timestamp not linear?

I was trying to reconstruct an audio conversation (a-b call using g711 audio) using the rtp time-stamp. I used to fill silence using difference of two rtp time-stamp and sampling rate. The conversation went out of sync and then I see that rtp time-stamp is not linear.I was not able to get exact clock time using rtp time-stamp and resulted in sync issues. How do i calculate the exact time.
I have the same problem with a Stream provided by GStreamer, whic doesnt provide monotonic timestamps.
for Example: The Difference between the stamps should bei exactly 1920, but it is between ~120 and ~3500, but in average 1920.
The problem here is that there is no way to find missing samples, because you never know if the high difference is from the Encoder delay or from a sample missing.
If you have only Audio to decode, I would try to put "valid" PTS values to each sample (in my case basetime+1920, basetime+3840 and so on.)
The big problem here comes when video AND audio were combined. Here this trick doesnt work well, when samples are missing and there is no way to find out when this is the case :(
when you want to send rtp you should notice about two things:
the time stamp is incremented due to the amout of byte sents.
e.g for PT=10, you may have this pattern:
1160 byte , time stamp increment: 1154 and wait 26 ms
lets see how this calculation happens:
. number of packet should be sent in one second : 1/(26ms) = 38
time stamp increment : clockrate / # = 1154
Regarding to RFC3550 (https://www.ietf.org/rfc/rfc3550.txt)
The sampling instant MUST be derived from a clock that increments
monotonically
Its not a choice nor an option. By the way please read the full description of the timestamp field of the RTP packet, there I found it also:
As an example, for fixed-rate audio
the timestamp clock would likely increment by one for each
sampling period. If an audio application reads blocks covering
160 sampling periods from the input device, the timestamp would be
increased by 160 for each such block, regardless of whether the
block is transmitted in a packet or dropped as silent.
If you want to check linearity then use the RTCP SR RTP and NTP timestamps field. At the SR report the RTP timestamp belongs to the NTP timestamp.
So the difference of two consecutive RTP timestamp (lets call them dRTPt_1, dRTP_2, ...) and the difference of two consecutive NTP timestamps (lets call them dNTP_1, dNTP_2, ...) and then multiply dRTP_t* with clock rate and check weather you get dNTP_t*.
But first please read the RFC.

Can a false sync word be found in the payload of an MPEG-1/MPEG-2 frame?

I know I can find other answers about this on SO, but I want clarifications from somebody who really knows MPEG-1/MPEG-2 (or MP3, obviously).
The start of an MPEG-1/2 frame is 12 set bits starting at a byte boundary, so bytes ff f*, where * is any nibble. Those 12 bits are called a sync word. This is a useful characteristic to find the start of a frame in any MPEG-1/2 stream.
My first question is: formally, can a false sync word be found or not in the payload of an MPEG-1/2 frame, outside its header?
If so, here's my second question: why does the sync word mechanism even exist then? If we cannot make sure that we found a new frame when reading fff, what is the purpose of this sync word?
Please do not even consider ID3 in your answer; I already know about sync words that can be found in ID3v2 payloads, but that's well documented.
I worked on MPEG-2 streams, more precisely Transport Streams (TS): I guess we can find similarities.
A TS is composed of Transport Packets, which have a header, starting with a sync byte 0x47.
We also can found 0x47 within the payload of the TP, but we know that it is not a sync byte because it is not aligned (TP have a fixed size of 188 bytes).
The sync word gives an entry point to someone that looks at the stream, and allows a program to synchronize his process with the stream, hence the name.
It also allows a fast browsing and parsing of the stream: in a TS you can jump from a packet to another (inspect header, check sync byte, skip 188 bytes and so on)
Finally it is a safety measure that helps you to spot errors (in the stream during transmission for example or in the process if a bug caused a bad alignment)
These argument are about TS but I think the same goes with your case : finding a sync word within a payload should not be an issue because you should always able to distinguish payload and header, most of the time with a length information (either because the size is fixed like in TP or because you have a TLV format).
can a false sync word be found or not in the payload of an MPEG-1/2
frame, outside its header?
According to this, "frame sync can be easily (and very frequently) found in any binary file." See the section titled "MPEG Audio Frame Header"
I confirmed this with an .mp3 song that I chose at random (stripped of ID3 tags). It had 5193 sync words, of which only 4898 were found to be valid (using code too long to be included here).
>>> f = open('notag.mp3', 'rb')
>>> r=f.read()
>>> r.count(b'\xff\xfb')
5193
why does the sync word mechanism even exist then? If we cannot make
sure that we found a new frame when reading fff, what is the purpose
of this sync word?
We can be (relatively) sure if we are checking the rest of the frame header, and not just the sync word. There are bits following the sync which can be used to:
identify a false positive or
give you useful info
With .mp3, you have to use those useful bits to calculate the size of the frame. By skipping ahead <frame-size> bytes before looking for the next sync word, you avoid any false syncs that may be present in the payload. See the section titled "How to calculate frame length" in that same link.

Calculate TS File Duration

I am working on a media player application : Which plays ISDB-T audio and video.
I am using GStreamer for decoding & rendering.
For AV Sync to work perfectly, I should regulate file reads: so that data will be not be pushed to Gstreamer neither too fast nor too slow.
If I know the duration of TS file before hand, then I can regulate my reads. But how to calculate the TS file duration ?
Because, I need to verify the application with multiple TS files, cannot calculate the duration using some utility and keep changing the file reads - How can this be achieved in program?
Thanks,
Kranti
If you have sufficient knowledge in the encoding and PES layer inside the transport stream, then you can read the time-stamps within the TS and calculate it yourself.
It requires seeking to the end of the file, searching for the last time-stamp, and subtracting the first time stamp of the same program in the beginning of the file.
EDIT: In addition to the above method you need to include the last frame duration.
((last_pts - first_pts) + frame_duration) / pts_resolution
Lets say you have a 30 fps segment with a duration of 6.006s
((1081080 - 543543) + 3003) / 90000 = 6.006
in most cases, each PES header contains a PTS and/or DTS, which is measured in 90kHz frequency. so the steps may include:
find the program you need to demux from MPEG TS.
find the PID of stream.
find the first TS packet with PID found, and payload_start_indicator set to 1; that will be the starting of a PES frame, which will contain a PES header.
Parse the PES header to find the starting PTS of the stream.
parse the file backwards from end, to find a packet with same PID and payload_start_indicator set, which will contain the last PTS.
find thier difference, divide it by 90000 will give duration in Seconds