Split mp4 by size and keep (rebase) chapter marks - gstreamer

I need to split MP4 files by size while maintaining chapter information. Is there a way to recalculate and split chapter marks at the same time?
Example: I have a 3 GB video file with chapters distributed over the entire file. I need to split into 2 GB + 1 GB chunks. Chapters with start times within the first 2 GB need to end up in the first split while chapters with start times after the 2 GB split shall be rebased (chapter start times recalculated) and written to the next file.
I have tried gstreamer, but this drops chapters entirely:
gst-launch-1.0 -e splitmuxsrc name=demux location=video.mp4 splitmuxsink name=mux location=video_%03d.mp4 max-size-bytes=1000000 demux.video_0 ! mux.video demux.audio_0 ! mux.audio_0 > nul
With mp4box the first split retains all chapters even if the chapter start time exceeds the length of that split while all other splits have no chapters at all:
mp4box video.mp4 -splits 1024 -out video_$num%03d$.mp4
I could probably do it manually by extracting chapter info with mp4box, then splitting by size, then somehow reading in the video length of each split, manually splitting and rebasing the chapter marks and then write them to the individual files.
(I am using Windows 10)
Any suggestions? Thanks

Related

How do I read data from a file with description and blank lines with Fortran 77?

I am new to Fortran 77. I need to read the data from a given text file into two arrays, but there are some lines that either are blank or contain descriptive information on the data set before the lines containing the data I need to read. How do I skip those lines?
Also, is there a way my code can count the number of lines containing the data I'm interested in in that file? Or do I necessarily have to count them by hand to build my do-loops for reading the data?
I have tried to find examples online and in Schaum's Programming with Fortran 77, but couldn't find anything too specific on that.
Part of the file I need to read data from follows below. I need to build an array with the entries under each column.
Data from fig. 3 in Klapdor et al., MPLA_17(2002)2409
E(keV) counts_in_bin
2031.5 5.4
2032.5 0
2033.5 0
I am assuming this question is very basic, but I've been fighting with this for a while now, so I thought I would ask.
If you know where the lines are that you don't need/want to read, you can advance the IO with a call to read with no input items.
You can use:
read(input-unit,*)
to read a line from your input file, discard its contents and advance IO to the next line.
It has been a long time since I have looked at F77 code, but in general if your read statement in a DO loop can deal with finding empty lines, or even a record that contains only blanks, then you could write logic to trap that condition and go to a break or continue statement. I just don't recall if read can deal with the situation intelligently.
Alternatively, if you are using a UNIX shell and coreutils, you can use sed to remove empty line, /^$/
or /^ *$/ to preprocess the file before you send it onto F77
Something like
$ sed infile -e 'd/^$/;d/^ *$/' > outfile
It should look something like this:-
C Initialise
integer i
character*80 t1,t2,t3
real*8 x,y
open(unit=1,file='qdata.txt')
C Read headers
read(1,100)t1
100 format(A80)
write(6,*) t1
read(1,100)t2
write(6,*) t2
read(1,100)t3
write(6,*) t3
write(6,*)
C Read data
do 10 i=1,10
read(1,*,end=99) x,y
write(6,*) x,y
10 continue
99 continue
end
So I've used a classic formatted read to read in the header lines, then free-format to read the numbers. The free-format read with the asterisk skips white space including blank lines so it does what you want, and when there is no more data it will go to statement 99 and finish.
The output looks like this:-
Data from fig. 3 in Klapdor et al., MPLA_17(2002)2409
E(keV) counts_in_bin
2031.5000000000000 5.4000000000000004
2032.5000000000000 0.0000000000000000
2033.5000000000000 0.0000000000000000

Grep pattern match between very large files is way too slow

I've spent way too much time on this and am looking for suggestions. I have too very large files (FASTQ files from an Illumina sequencing run for those interested). What I need to do is match a pattern common between both files and print that line plus the 3 lines below it into two separate files without duplications (which exist in the original files). Grep does this just fine but the files are ~18GB and matching between them is ridiculously slow. Example of what I need to do is below.
FileA:
#DLZ38V1_0262:8:1101:1430:2087#ATAGCG/1
NTTTCAGTTAGGGCGTTTGAAAACAGGCACTCCGGCTAGGCTGGTCAAGG
+DLZ38V1_0262:8:1101:1430:2087#ATAGCG/1
BP\cccc^ea^eghffggfhh`bdebgfbffbfae[_ffd_ea[H\_f_c
#DLZ38V1_0262:8:1101:1369:2106#ATAGCG/1
NAGGATTTAAAGCGGCATCTTCGAGATGAAATCAATTTGATGTGATGAGC
+DLZ38V1_0262:8:1101:1369:2106#ATAGCG/1
BP\ccceeggggfiihihhiiiihiiiiiiiiihighiighhiifhhhic
#DLZ38V1_0262:8:2316:21261:100790#ATAGCG/1
TGTTCAAAGCAGGCGTATTGCTCGAATATATTAGCATGGAATAATAGAAT
+DLZ38V1_0262:8:2316:21261:100790#ATAGCG/1
__\^c^ac]ZeaWdPb_e`KbagdefbZb[cebSZIY^cRaacea^[a`c
You can see 3 unique headers starting with # followed by 3 additional lines
FileB:
#DLZ38V1_0262:8:1101:1430:2087#ATAGCG/2
GAAATCAATGGATTCCTTGGCCAGCCTAGCCGGAGTGCCTGTTTTCAAAC
+DLZ38V1_0262:8:1101:1430:2087#ATAGCG/2
_[_ceeeefffgfdYdffed]e`gdghfhiiihdgcghigffgfdceffh
#DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
GCCATTCAGTCCGAATTGAGTACAGTGGGACGATGTTTCAAAGGTCTGGC
+DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
_aaeeeeegggggiiiiihihiiiihgiigfggiighihhihiighhiii
#DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
GCCATTCAGTCCGAATTGAGTACAGTGGGACGATGTTTCAAAGGTCTGGC
+DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
_aaeeeeegggggiiiiihihiiiihgiigfggiighihhihiighhiii
#DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
GCCATTCAGTCCGAATTGAGTACAGTGGGACGATGTTTCAAAGGTCTGGC
+DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
_aaeeeeegggggiiiiihihiiiihgiigfggiighihhihiighhiii
There are 4 headers here but only 2 are unique as one of them is repeated 3 times
I need the common headers between the two files without duplicates plus the 3 lines below them. In the same order in each file.
Here's what I have so far:
grep -E #DLZ38V1.*/ --only-matching FileA | sort -u -o FileA.sorted
grep -E #DLZ38V1.*/ --only-matching FileB | sort -u -o FileB.sorted
comm -12 FileA.sorted FileB.sorted > combined
combined
#DLZ38V1_0262:8:1101:1369:2106#ATAGCG/
#DLZ38V1_0262:8:1101:1430:2087#ATAGCG/
This is only the common headers between the two files without duplicates. This is what I want.
Now I need to match these headers to the original files and grab the 3 lines below them but only once.
If I use grep I can get what I want for each file
while read -r line; do
grep -A3 -m1 -F $line FileA
done < combined > FileA.Final
FileA.Final
#DLZ38V1_0262:8:1101:1369:2106#ATAGCG/1
NAGGATTTAAAGCGGCATCTTCGAGATGAAATCAATTTGATGTGATGAGC
+DLZ38V1_0262:8:1101:1369:2106#ATAGCG/1
BP\ccceeggggfiihihhiiiihiiiiiiiiihighiighhiifhhhic
#DLZ38V1_0262:8:1101:1430:2087#ATAGCG/1
NTTTCAGTTAGGGCGTTTGAAAACAGGCACTCCGGCTAGGCTGGTCAAGG
+DLZ38V1_0262:8:1101:1430:2087#ATAGCG/1
BP\cccc^ea^eghffggfhh`bdebgfbffbfae[_ffd_ea[H\_f_c
The while loop is repeated to generate FileB.Final
#DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
GCCATTCAGTCCGAATTGAGTACAGTGGGACGATGTTTCAAAGGTCTGGC
+DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
_aaeeeeegggggiiiiihihiiiihgiigfggiighihhihiighhiii
#DLZ38V1_0262:8:1101:1430:2087#ATAGCG/2
GAAATCAATGGATTCCTTGGCCAGCCTAGCCGGAGTGCCTGTTTTCAAAC
+DLZ38V1_0262:8:1101:1430:2087#ATAGCG/2
This works but FileA and FileB are ~18GB and my combined file is around ~2GB. Does anyone have any suggestions on how I can dramatically speed up the last step?
Depending on how often do you need to run this:
you could dump (you'll probably want bulk inserts with the index built afterwards) your data into a Postgres (sqlite?) database, build an index on it, and enjoy the fruits of 40 years of research into efficient implementations of relational databases with practically no investment from you.
you could mimic having a relational database by using the unix utility 'join', but there wouldn't be much joy, since that doesn't give you an index, yet it is likely to be faster than 'grep', you might hit physical limitations...I never tried to join two 18G files.
you could write a bit of C code (put your favourite compiled (to machine code) language here), which converts your strings (four letters only, right?) into binary and builds an index (or more) based on it. This could be made lightning fast and small memory footprint as your fifty character string would take up only two 64bit words.
Thought I should post the fix I came up with for this. Once I obtained the combined file (above) I used a perl hash reference to read them into memory and scan file A. Matches in file A were hashed and used to scan file B. This still takes a lot of memory but works very fast. From 20+ days with grep to ~20 minutes.

Reg ex searching of csv file,

I have huge task to do, seperating Voltage data from recorded .csv files of the format.
13/03/2014 18:48,71.556671,71.651062,71.639755,72.130692,71.961441,72.646423,72.262756,72.334511,7.812012
I am new to RegExpressions, how do i get data from column 10, repeatedly?
I have over 10,000,000 files to reduce and average to 32,000 for exel to graph. Any advice greatly welcome, trying to use PowerGrep to get up to speed.
Not that I would say that regex is the tool for it, but here goes:
(?:[^,]*,){9}([^,]*)
I.e. nine "columns" of non-commas, separated by commas, then capture the tenth in group 1.
E.g. use it with a Perl one-liner:
perl -ne 'chomp; /(?:[^,]*,){9}([^,]*)/ and print "$1\n"'

Efficiently read the last row of a csv file

Is there an efficient C or C++ way to read the last row of a CSV file? The naive approach involves reading in the entire file and then going to the end. Is there a quicker way this can be done (particularly if the CSV files are large)?
What you can do is guess the line length, then jump 2-3 lines before the end of the file and read the remaining lines. The last line you read is the last one, as long you read at least one line prior (otherwise, you still start again with a bigger offset)
I posted some sample code for doing a similar thing (reading last N lines) in this answer (in PHP, but serves as an illustration)
For implementations in a variety of languages, see
C++ : c++ fastest way to read only last line of text file?
Python : Efficiently finding the last line in a text file
Perl : How can I read lines from the end of file in Perl?
C# : Get last 10 lines of very large text file > 10GB c#
PHP : how to read only 5 last line of the txt file
Java: Read last n lines of a HUGE file
Ruby: Reading the last n lines of a file in Ruby?
Objective-C : How to read data from NSFileHandle line by line?
You can try working backwards. Read some size block of bytes from the end of the file, and look for the newline. If there is no newline in that block, then read the previous block, and so on.
Note that if the size of a row relative to the size of the file is large that this may result in worse performance, because most file caching schemes assume someone reads forward in the file.
You can use Perl module File::ReadBackwards.
Your problem falls into the same domain as searching for a string within a file. As you rightly point out, it's not always a great idea to read the entire file into memory and then search for your string. But you can always do the next best thing. Memory map your file. Then use your string searching functions to search backwards from the end of the string for your newline.
It's an extremely efficient mechanism with minimal memory footprint and optimum disk I/O.
Read with what and on what? On a Unix system, if you want the last line, it is as simple as
tail -n1 file.csv
If you want this approach from within your C++ app, you can do something like
system("tail -n1 file.csv")
if you want a quick and dirty way to accomplish this task.

How to read an MP3 file, separating metadata from audio?

I understand that the MP3 file format essentially consists of two segments, id3 metadata+audio frames. How can I read in binary form, all of the id3 segment and all of the audio frames as two binary blobs? I'm looking to simply perform a hash calculation on the metadata and the audio as two separate units in a file. How can I determine where the "split point" is in the file?
From the ID3 tag specification:
+-----------------------------+
| Header (10 bytes) |
+-----------------------------+
| Extended Header |
| (variable length, OPTIONAL) |
+-----------------------------+
| Frames (variable length) |
+-----------------------------+
| Padding |
| (variable length, OPTIONAL) |
+-----------------------------+
| Footer (10 bytes, OPTIONAL) |
+-----------------------------+
Note that there are several ID3 tag versions out there.
Specification: http://www.id3.org/id3v2.4.0-structure
There are usually zero, one, or two metadata chunks.
At the beginning of the file there may be an optional ID3 version 2 metadata chunk, which comes in three subversions. This ID3v2 always has a variable length which is encoded in the header, though it's encoded slightly differently depending on the subversion.
Then you have the audio frames. There is a variable number of them. There is no header telling how many there will be or where in the file they end.
Then at the end of the file there may be an optional ID3 version 1 metadta chunk, which has a fixed length of 128 bytes and begins with a 3-byte magic word.
Rarely, an ID3v2 tag might be at the end of the file or even in the middle.
Also there are rare extensions which may add extra stuff to the ID3v1 tag making it longer.
You can iterate through all the "frames" in an MP3 file. Each frame begins with three bytes that can be used to tell whether the frame is an ID3v2 "tag", an MP3 audio frame, or an ID3v1 tag.
Note that errors or corruption are not rare in the audio frames. These frames start with 0xFFFFFF, called the "synch" pattern, and you have to use the other bytes and bits in the frame to both do a sanity check and calculate the length of the frame.
When a frame doesn't begin with the synch pattern, an ID3 tag magic word, or fails the sanity check, you should ignore bytes until you find the next 0xFFFFFF synch pattern.
So you can take some shortcuts which will work most of the time or iterate through the whole file, which can be slow. Also I'm not really an expert so there's likely to be things I've left out due to ignorance. In particular I think that though there are mechanisms to make sure there are no false synch patterns embedded in the metadata, I believe that sometimes they still occur.
Hope this helps for any new people coming here via the Googles (-: