I have a bunch of audio files about telephone conversation. I want to try to split an audio file into two, each contains only one speaker's speech. Maybe I need to use speech diarization. But how can I do that? anybody can give me some clues? Thank you. ps: Linux OS.C/C++
While separating the individual speakers is quite a difficult problem you can automatically split the audio where there are pauses. This would produce a series of files that would likely be easier to manage since speakers often alternate between pauses.
This approach requires the open source Julius speech recognition decoder package. This is available in many Linux package repositories. I use the Ubuntu multiverse repository.
Here is the site: http://julius.sourceforge.jp/en_index.php
Step 0: Install Julius
sudo apt-get install julius
Step 1: Segment Audio
adintool -in file -out file -filename myRecording.wav -startid 0 -freq 44100 -lv 2048 -zc 30 -headmargin 600 -tailmargin 600
-startid is the starting segment number that will be appended to the filename
-freq is the sample rate of the source audio file
-lv is the level of the audio above which voice detection will be active
-zc is the zero crossings above which voice detection will be active
-headmargin and -tailmargin is the amount of silence before and after each audio segment
Note that -lv and -zc will have to be adjusted for your particular audio recording's attributes while -headmargin and -tailmargin will have to be adjusted for your particular speaker's styles. But the values given above have worked well for my voice recordings in the past.
Here is the documentation: http://julius.sourceforge.jp/juliusbook/en/adintool.html
In my experience preprocessing the audio using compression and normalization gives better results and requires less adjustment of the Julius arguments. These initial steps are recommended but not required.
This approach requires the open source SoX audio toolkit package. This is also available in many Linux package repositories. I use the Ubuntu universe repository.
Here is the site: http://sox.sourceforge.net
Step -2: Install SoX
sudo apt-get install sox
Step -1: Preprocess Audio
sox myOriginalRecording.wav myRecording.wav gain -b -n -8 compand 0.2,0.6 4:-48,-32,-24 0 -64 0.2 gain -b -n -2
gain -b -n balances and normalizes the audio to a given level
compand compresses (in this case) the audio based on the parameters
Note that compand may require some time to completely understand the parameters. But the values given above have worked well for my voice recordings in the past.
Here is the documentation: http://sox.sourceforge.net/sox.html
While this will not give you identification of each speaker it will greatly simplify the task of doing it by ear, which may end up being the only option for a while. But I do hope you find practical solution if it is already available.
Yes, diarization is what you want.
There are a couple of tools you could look at, both are GPL. One is LIUM spkdiarization (Java), the other is SHoUT toolkit (C++). LIUM is well documented and there's a script next to it, SHoUT is a bit more cryptic, so you should follow instructions the author posted here.
Though I may be a bit too late. ;)
Related
The livestream doesn't end (for now, and if ended, I think will be erased of YouTube, that's my reason to download), actually, the video is still in transmission, you know, like the streams of the NASA or streams of channels news. The detail is that the transmission lasts about 10-11 hours, and the transmission has lasted about 3 days. So it was a matter of time before the first concerts were no longer available to watch on the broadcast.
This is the video: https://www.youtube.com/watch?v=rE6QI0ywr0c
I want to download some concerts, but the things that I wanted, are disappearing with the passing of time. Right now, I'm only interested in the Disclosure concert. His concert starts at approximately -3:38:12. I mention it in case someone wants to help me.
I was trying this command, but only appear a text that i don't understand (I'll post it in the comments, all the images with his info). The command is this → yt-dlp.exe -f (bestvideo+bestaudio/best) "link" --postprocessor-args "ffmpeg:-ss 00:00:00 -to 00:00:00" -o "%(title)s_method1.%(ext)s"
The idea of that command emerged on this ideas
https://www.reddit.com/r/youtubedl/wiki/howdoidownloadpartsofavideo/
https://github.com/yt-dlp/yt-dlp/issues/686
Also, I was trying to do this How do you use youtube-dl to download live streams (that are live)?, but I can't get the HLS m3u8 URL in Chrome and Chrome Dev (yes, I go to F12 (Chrome Developer Tools) - Network and I write m3u8, I didn't find anything.
I should mention that I don't have extensive knowledge on codes and yt-dlp. I only learned the necessary to download videos, you know, yt-dlp.exe -F (link) and then yt-dlp.exe -f (numbers of resolution and audio) (link).
So if you recommend any programs or commands, please let me know as precisely as possible.
Any new info I'm gonna update in the comments.
PS: sorry for my english
I tried the following command in order to get the best video and audio quality (I can also avoid to write --format best because from the documentation I read that this is the default setting):
youtube-dl.exe --format best https://www.youtube.com/watch?v=7wfUUZvybPY
and I got a video.mp4 with the following characteristics:
I downloaded the same video by using 4k Video Downloader and I got:
How can I get the same result also by using youtube-dl?
You can parse all formats available with:
youtube-dl.exe -F https://www.youtube.com/watch?v=7wfUUZvybPY
Look at first column, "format code". For this video, best option is:
youtube-dl --format 315 https://www.youtube.com/watch?v=7wfUUZvybPY for 3440x1440 video, and
youtube-dl --format 140 https://www.youtube.com/watch?v=7wfUUZvybPY for 129kbit audio.
Then, with ffmpeg, you can merge that two streams in your preferred container (you can find many answers here in Stackoverflow).
For very high bitrates there isn't a file already merged available on YouTube, ffmpeg is a crucial tool for this type of conversions!
I'm trying to download a 4k video from youtube. For this, I used the command
youtube-dl -f best https://youtu.be/VcR5RCzWfeY
However, using this command only downloads the video in 720p. Manually specifying the resolution, however, seems to work:
youtube-dl https://youtu.be/VcR5RCzWfeY -f 313+bestaudio
The documentation states that using nothing should download the best quality possible, but I always get the default quality of 720p. This tends to be an issue when I am downloading playlists with multiple file qualities. So what gives? Is there some other code I should be using?
youtube-dl downloads the best quality by default. (This may not be the highest resolution for all of the supported sites, but it tends to be that one for YouTube.)
-f best is not the default. It advises youtube-dl to download the best single file format. For many supported sites, the best single format will be the best overall, but that does not apply to YouTube.
To get the highest quality, simply run youtube-dl without any -f:
youtube-dl https://youtu.be/VcR5RCzWfeY
For your example video, this will produce an 7680x4320 video file weighing 957MB.
Note that this requires ffmpeg to be installed on your machine and available in your PATH (or specified with --ffmpeg-location). To find out which version of ffmpeg you have, type ffmpeg.
I want festival tts to read a bit slower, can anyone help me with that?
I use python 2.7 and I run the code in gnome-terminal.
What does your ~/.festivalrc look like? To use festival with ALSA, I have:
(Parameter.set 'Audio_Method 'Audio_Command)
(Parameter.set 'Audio_Command "aplay -Dplug:default -f S16_LE -r 15000 $FILE")
Using aplay, the rate of playback is determined by the value after the -r flag, which you can increase to make it speak faster, or decrease to make it slower.
If you're not using ALSA, then adding (Parameter.set 'Duration_Stretch 1.5) or similar may help.
If you are okay with writing a wrapper around, you can use sable and the RATE tag. For reference, here is an example project I made:
http://www.cs.cmu.edu/~srallaba/Audio_Rendering_of_STEM/
in which technique 2 has rate variations.
Alternatively, you can use flite - festival lite. While festival was designed to enable research in speech synthesis, flite is ideal for real time implementations. The readme has an example to stretch duration using flite:
./bin/flite --setf duration_stretch=1.5 doc/alice
Hope it helps.
I had exactly the same problem and AFAIK, it is not possible to do that (I also hope to be wrong, so please correct me). It is also not possible to e.g. shift the frequency range of the voice. That is, without tinkering with the voice files (did not check this as it seems more than what I'd like to do).
Personally, I solved this by using the old mbrola voices and espeak. I used a python wrapper, used to invoke espeak from command line, but there is also a somehow old library. Despite the voice quality being lower than the CMU voices, the overall experience is IMHO sometimes better.
Consider using the Festival utility text2wave to write the audio as a file, then play the file using sox with the speed and pitch effects. To slow the audio down you will need a speed value less than one, and compensate for the effect on pitch with a positive value in pitch.
i want to capture some frame from a video so i used command like this:
ffmpeg -i MyVideo.mp4 -ss 1:20:12 -vframes 1 test-pic.jpg
but ffmpeg proccess frame from begin of video so this command is too slow. i research and i found some article about keyframe so i try to extract keyframe by a command like this
ffmpeg -vf select="eq(pict_type\,PICT_TYPE_I)" -i MyVideo.mp4 -vsync 2 -s 160x90 -f image2 thumbnails-%02d.jpeg
but this command also is to slow and capture too many frame.
I need a linux command or c++ or python code to capture a frame that dont take long time
The ffmpeg wiki states regarding fast seeking:
The -ss parameter needs to be specified before -i:
ffmpeg -ss 00:03:00 -i Underworld.Awakening.avi -frames:v 1 out1.jpg
This example will produce one image frame (out1.jpg) somewhere around
the third minute from the beginning of the movie. The input will be
parsed using keyframes, which is very fast. The drawback is that it
will also finish the seeking at some keyframe, not necessarily located
at specified time (00:03:00), so the seeking will not be as accurate
as expected.
You could also use hybrid mode, combining fast seeking and slow (decode) seeking, which is kind of the middle ground.
If you want to implement this in C/C++, see the docs/examples directory of ffmpeg to get started and av_seek_frame.
I recently hacked together some C code to do thumbnails myself, which uses the hybrid mode effectively. May be helpful to you, or not.
Hello, Mr. Anderson.
I'm not familiar with using C++ or Python to do such a thing. I'm sure it's possible (I could probably get a good idea of how to do this if I researched for an hour), but the time it would take to implement a full solution may outweigh the time cost of finding a better frame capturing program. After a bit of Googling, I came up with:
VirtualDub
Camtasia
Frame-shots