Set audio format in liblvc

Set audio format in liblvc - c++

How can I set the format of an audio in libvlc?
there is a function in libvlc for it but I don't know how to use it[from here]:
LIBVLC_API void libvlc_audio_set_format ( libvlc_media_player_t * mp,
const char * format,
unsigned rate,
unsigned channels
)
Set decoded audio format.
This only works in combination with libvlc_audio_set_callbacks(), and
is mutually exclusive with libvlc_audio_set_format_callbacks().
Parameters mp the media player format a four-characters string
identifying the sample format (e.g. "S16N" or "FL32") rate sample rate
(expressed in Hz) channels channels count Version LibVLC 2.0.0 or
later
How can I set the format of audio file, for example a wav file?

This API is for raw, decoded audio, which is typically forwarded to speakers or re-encoded to store it.
This API is NOT to export audio as files (unless you implement that yourself in your app, that is). To convert files, see the stream output MRL command-line syntax, as there is currently no designated libvlc API available for use-case.

Related

Media Foundation video re-encoding producing audio stream sync offset

I'm attempting to write a simple windows media foundation command line tool to use IMFSourceReader and IMFSyncWriter to load in a video, read the video and audio as uncompressed streams and re-encode them to H.246/AAC with some specific hard-coded settings.
The simple program Gist is here
sample video 1
sample video 2
sample video 3
(Note: the video's i've been testing with are all stereo, 48000k sample rate)
The program works, however in some cases when comparing the newly outputted video to the original in an editing program, I see that the copied video streams match, but the audio stream of the copy is pre-fixed with some amount of silence and the audio is offset, which is unacceptable in my situation.
audio samples:
original - |[audio1] [audio2] [audio3] [audio4] [audio5] ... etc
copy - |[silence] [silence] [silence] [audio1] [audio2] [audio3] ... etc
In cases like this the first video frames coming in have a non zero timestamp but the first audio frames do have a 0 timestamp.
I would like to be able to produce a copied video who's first frame from the video and audio streams is 0, so I first attempted to subtract that initial timestamp (videoOffset) from all subsequent video frames which produced the video i wanted, but resulted in this situation with the audio:
original - |[audio1] [audio2] [audio3] [audio4] [audio5] ... etc
copy - |[audio4] [audio5] [audio6] [audio7] [audio8] ... etc
The audio track is shifted now in the other direction by a small amount and still doesn't align. This can also happen sometimes when a video stream does have a starting timestamp of 0 yet WMF still cuts off some audio samples at the beginning anyway (see sample video 3)!
I've been able to fix this sync alignment and offset the video stream to start at 0 with the following code inserted at the point of passing the audio sample data to the IMFSinkWriter:
//inside read sample while loop
...
// LONGLONG llDuration has the currently read sample duration
// DWORD audioOffset has the global audio offset, starts as 0
// LONGLONG audioFrameTimestamp has the currently read sample timestamp
//add some random amount of silence in intervals of 1024 samples
static bool runOnce{ false };
if (!runOnce)
{
size_t numberOfSilenceBlocks = 1; //how to derive how many I need!? It's aribrary
size_t samples = 1024 * numberOfSilenceBlocks;
audioOffset = samples * 10000000 / audioSamplesPerSecond;
std::vector<uint8_t> silence(samples * audioChannels * bytesPerSample, 0);
WriteAudioBuffer(silence.data(), silence.size(), audioFrameTimeStamp, audioOffset);
runOnce= true;
}
LONGLONG audioTime = audioFrameTimeStamp + audioOffset;
WriteAudioBuffer(dataPtr, dataSize, audioTime, llDuration);
Oddly, this creates an output video file that matches the original.
original - |[audio1] [audio2] [audio3] [audio4] [audio5] ... etc
copy - |[audio1] [audio2] [audio3] [audio4] [audio5] ... etc
The solution was to insert extra silence in block sizes of 1024 at the beginning of the audio stream. It doesn't matter what the audio chunk sizes provided by IMFSourceReader are, the padding is in multiples of 1024.
My problem is that there seems to be no detectable reason for the the silence offset. Why do i need it? How do i know how much i need? I stumbled across the 1024 sample silence block solution after days of fighting this problem.
Some videos seem to only need 1 padding block, some need 2 or more, and some need no extra padding at all!
My question here are:
Does anyone know why this is happening?
Am I using Media Foundation incorrectly in this situation to cause this?
If I am correct, How can I use the video metadata to determine if i need to pad an audio stream and how many 1024 blocks of silence need to be in the pad?
EDIT:
For the sample videos above:
sample video 1 : the video stream starts at 0 and needs no extra blocks, passthrough of original data works fine.
sample video 2 : video stream starts at 834166 (hns) and needs 1 1024 block of silence to sync
sample video 3 : video stream starts at 0 and needs 2 1024 blocks of silence to sync.
UPDATE:
Other things I have tried:
Increasing the duration of the first video frame to account for the offset: Produces no effect.

I wrote another version of your program to handle NV12 format correctly (yours was not working) :
EncodeWithSourceReaderSinkWriter
I use Blender as video editing tools. Here is my results with Tuning_against_a_window.mov :
from the bottom to the top :
Original file
Encoded file
I changed the original file by settings "elst" atoms with the value of 0 for number entries (I used Visual Studio hexa editor)
Like Roman R. said, MediaFoundation mp4 source doesn't use the "edts/elst" atoms. But Blender and your video editing tools do. Also the "tmcd" track is ignored by mp4 source.
"edts/elst" :
Edits Atom ( 'edts' )
Edit lists can be used for hint tracks...
MPEG-4 File Source
The MPEG-4 file source silently ignores hint tracks.
So in fact, the encoding is good. I think there is no audio stream sync offset, comparing to the real audio/video data. For example, you can add "edts/elst" to the encoded file, to get the same result.
PS: on the encoded file, i added "edts/elst" for both audio/video tracks. I also increased size for trak atoms and moov atom. I confirm, Blender shows same wave form for both original and encoded file.
EDIT
I tried to understand relation between mvhd/tkhd/mdhd/elst atoms, in the 3 video samples. (Yes I know, i should read the spec. But i'm lazy...)
You can use a mp4 explorer tool to get atom's values, or use the mp4 parser from my H264Dxva2Decoder project :
H264Dxva2Decoder
Tuning_against_a_window.mov
elst (media time) from tkhd video : 20689
elst (media time) from tkhd audio : 1483
GREEN_SCREEN_ANIMALS__ALPACA.mp4
elst (media time) from tkhd video : 2002
elst (media time) from tkhd audio : 1024
GOPR6239_1.mov
elst (media time) from tkhd video : 0
elst (media time) from tkhd audio : 0
As you can see, with GOPR6239_1.mov, media time from elst is 0. That's why there is no video/audio sync problem with this file.
For Tuning_against_a_window.mov and GREEN_SCREEN_ANIMALS__ALPACA.mp4, i tried to calculate the video/audio offset.
I modified my project to take this into account :
EncodeWithSourceReaderSinkWriter
For now, i didn't find a generic calculation for all files.
I just find the video/audio offset needed to encode correctly both files.
For Tuning_against_a_window.mov, i begin encoding after (movie time - video/audio mdhd time).
For GREEN_SCREEN_ANIMALS__ALPACA.mp4, i begin encoding after video/audio elst media time.
It's OK, but I need to find the right unique calculation for all files.
So you have 2 options :
encode the file and add elst atom
encode the file using right offset calculation
it depends on your needs :
The first option permits you to keep the original file.But you have to add the elst atom
With the second option you have to read atom from the file before encoding, and the encoded file will loose few original frames
If you choose the first option, i will explain how I add the elst atom.
PS : i'm intersting by this question, because in my H264Dxva2Decoder project, the edts/elst atom is in my todo list.
I parse it, but i don't use it...
PS2 : this link sounds interesting :
Audio Priming - Handling Encoder Delay in AAC

WASAPI: Identify non-active channels on loopback recording

I have a DSP software which captures the audio playing using the WASAPI api in shared loopback mode.
hr = _pAudioClient->Initialize(AUDCLNT_SHAREMODE_SHARED, AUDCLNT_STREAMFLAGS_LOOPBACK, 0, 0, _pFormat, 0);
This part works fine, but now I want to be able to detect the number of channels actually playing. In other words how would I be able to detect if the audio playing is in stereo, 5.1, 7.1?
The problem is:
* Since loopback have to use shared mode there could be multiple sources playing.
* This analysis has to be done in real-time. Can't wait until playback is done.
* Detect the difference between a channel not used at all by any playback source and a channel that is temporarily silent
The best solution in my mind would be If I could retrieve a list of all playback source/sub mixes and query them each for the number of channels. That way I don't have to analyse the audio data stream itself.

Loopback recording takes place in mix format defined on the endpoint, so regardless of what the original audio format was you get the data in the mix format, mixed from possibly multiple played sources and also converted to such shared format.
Device Formats
Loopback Recording
WASAPI loopback contains the mix of all audio being played...
The GetMixFormat method retrieves the stream format that the audio engine uses for its internal processing of shared-mode streams...
After an application has used GetMixFormat or IsFormatSupported to find an appropriate format for a shared-mode or exclusive-mode stream, the application can call the Initialize method to initialize a stream with that format. An application that attempts to initialize a shared-mode stream with a format that is not identical to the mix format obtained from the GetMixFormat method, but that has the same number of channels and the same sample rate as the mix format, is likely to succeed. Before calling Initialize, the application can call IsFormatSupported to verify that Initialize will accept the format.
That is, even though WASAPI offers some flexibility in audio format, channel configuration and sample rate are defined by shared format when it comes to loopback capture.
As you are getting the mix, you cannot really identify "non-active" channels: this information is lost during mixing to shared format.
Also, the actual shared format can be configured interactively via Control Panel:

Ok I now have a solution to my problem. As far as I know you can not detect sub-mixes in the shared mix so the only option was to analyze the audio stream/capture buffer.
First during my main capture loop I set the current timestamp for all channels playing.
const time_t now = Date::getCurrentTimeMillis();
//Iterate all capture frames
for (i = 0; i < numFramesAvailable; ++i) {
for (j = 0; j < _nChannelsIn; ++j) {
//Identify which channels are playing.
if (pCaptureBuffer[j] != 0) {
_pUsedChannels[j] = now;
}
}
}
Then every second I call this function which evaluates if a channel has played the last second. Based upon which channels are playing I can do conditional routing.
void checkUsedChannels() {
const time_t now = Date::getCurrentTimeMillis();
//Compare now against last used timestamp and determine active channels
for (size_t i = 0; i < _nChannelsIn; ++i) {
if (now - _pUsedChannels[i] > 1000) {
_pUsedChannels[i] = 0;
}
}
//Update conditional routing
for (const Input *pInut : _inputs) {
pInut->evalConditions();
}
}
Very simple solution but it appears to be working.

Google Cloud Speech-to-Text (MP3 to text)

I am using Google Cloud Platform Speech-to-Text API trial account service. I am not able to get text from an audio file. I do not know what exact encoding and sample Rate Hertz I should use for MP3 file of bit rate 128kbps. I tried various options but I am not getting the transcription.
const speech = require('#google-cloud/speech');
const config = {
encoding: 'LINEAR16', //AMR, AMR_WB, LINEAR16(for wav)
sampleRateHertz: 16000, //16000 giving blank result.
languageCode: 'en-US'
};

MP3 is now supported in beta:
MP3 Only available as beta. See RecognitionConfig reference for details.
https://cloud.google.com/speech-to-text/docs/encoding
MP3 MP3 audio. Support all standard MP3 bitrates (which range from 32-320 kbps). When using this encoding, sampleRateHertz can be optionally unset if not known.
https://cloud.google.com/speech-to-text/docs/reference/rest/v1p1beta1/RecognitionConfig#AudioEncoding
You can find out the sample rate using a variety of tools such as iTunes. CD-quality audio uses a sample rate of 44100 Hertz. Read more here:
https://en.wikipedia.org/wiki/44,100_Hz
To use this in a Google SDK, you may need to use one of the beta SDKs that defines this. Here is the constant from the Go Beta SDK:
RecognitionConfig_MP3 RecognitionConfig_AudioEncoding = 8
https://godoc.org/google.golang.org/genproto/googleapis/cloud/speech/v1p1beta1

According to the official documentation (https://cloud.google.com/speech-to-text/docs/encoding),
Only the following formats are supported:
FLAC
LINEAR16
MULAW
AMR
AMR_WB
OGG_OPUS
SPEEX_WITH_HEADER_BYTE
Anything else will be rejected.
Your best bet is to convert the MP3 file to either:
FLAC. .NET: How can I convert an mp3 or a wav file to .flac
Wav and use LINEAR16 in that case. You can use NAudio. Converting mp3 data to wav data C#
Honestly it is annoying that Google does not support MP3 from the get-go compared to Amazon, IBM and Microsoft who do as it forces us to jump through hoops and also increase the bandwidth usage since FLAC and LINEAR16 are lossless and therefore much bigger to transmit.

I had the same issue and resolved it by converting it to FLAC.
Try converting your audio to FLAC and use
encoding: 'FLAC',
For conversion, you can use sox
ref: https://www.npmjs.com/package/sox

now, the mp3 type for spedch-to-text,only available in module speech_v1p1beta1 ,you must post your request for this module,and you will get what you want.
the encoding: 'MP3'
python example like this:
from google.cloud import speech_v1p1beta1 as speech
import io
import base64
client = speech.SpeechClient()
speech_file = "your mp3 file path"
with io.open(speech_file, "rb") as audio_file:
content = (audio_file.read())
audio = speech.RecognitionAudio(content=content)
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.MP3,
sample_rate_hertz=44100,
language_code="en-US",
)
response = client.recognize(config=config, audio=audio)
# Each result is for a consecutive portion of the audio. Iterate through
# them to get the transcripts for the entire audio file.
print(response)
for result in response.results:
# The first alternative is the most likely one for this portion.
print(u"Transcript: {}".format(result.alternatives[0].transcript))
result

Stream live audio live555

I was writing as I could not find the answer in previous topics. I am using live555 to stream live video (h264) and audio(g723), which are being recorded by a web camera. The video part is already done and it works perfectly, but I have no clue about the audio task.
As long as I have read I have to create a ServerMediaSession to which I should add two subsessions: one for the video and one for the audio. For the video part I created a subclass of OnDemandServerMediaSubsession, a subclass of FramedSource and the Encoder class, but for the audio aspect I do not know on which classes should I base the implementation.
The web camera records and delivers audio frames in g723 format separatedly from the video. I would say the audio is raw as when I try to play it in VLC it says that it could not find any startcode; so I suppose it is the raw audio stream what is recorded by the web cam.
I was wondering if someone could give me a hint.

For an audio stream ,your override of OnDemandServerMediaSubsession::createNewRTPSink should create a SimpleRTPSink.
Something like :
RTPSink* YourAudioMediaSubsession::createNewRTPSink(Groupsock* rtpGroupsock, unsigned char rtpPayloadTypeIfDynamic, FramedSource* inputSource)
{
return SimpleRTPSink::createNew(envir(), rtpGroupsock,
4,
frequency,
"audio",
"G723",
channels );
}
The frequency and the number of channels should comes from the inputSource.

Compressing PCM data

I'm using WinAPI - Wave functions to create a recording program that records the microphone for X seconds. I've searched a bit over the net, and found out PCM data is too large, and it'll be a problem to send it through sockets...
How can I compress it to something smaller? Any simple / "cheap" way ?
I've also noticed, when I'm declaring the format using the Wave API functions, I'm using this code :
WAVEFORMATEX pFormat;
pFormat.wFormatTag= WAVE_FORMAT_PCM; // simple, uncompressed format
pFormat.nChannels=1; // 1=mono, 2=stereo
pFormat.nSamplesPerSec=sampleRate; // 44100
pFormat.nAvgBytesPerSec=sampleRate*2; // = nSamplesPerSec * n.Channels * wBitsPerSample/8
pFormat.nBlockAlign=2; // = n.Channels * wBitsPerSample/8
pFormat.wBitsPerSample=16; // 16 for high quality, 8 for telephone-grade
pFormat.cbSize=0;
As you can see, pFormat.wFormatTag= WAVE_FORMAT_PCM;
maybe I can insert instead of WAVE_FORMAT_PCM something else, so it'll be compressed right away?
I've checked MSDN for other values, though none of them works for me in my Visual Studio...
So what can I do?
Thanks!

The simplest way is to simply reduce your sample rate from 44100 to something more manageable like 22050, 16000, 11025, or even 8000. Most voice codecs don't go higher than 16000 hz anyway. And the older ones are optimized for 8khz.
The next step is to find a codec. There's a handful of codecs to use with the Windows Audio Compression Manager, but almost all of them date back to Windows 95 and sound terrible by modern standards after being decompressed.
You can always convert to WMA in real time using the Format SDK or with Media Foundation APIs. Or just go get an open source MP3 library like LAME.

For telephone quality speech you can change to 8 bits per sample and a sample rate of 8000. This will greatly reduce the amount of data.

GSM has good compression. You can convert a block of PCM data to GSM (or any other codec you have installed) using acmStreamConvert(). Refer to MSDN for more details:
Converting Data from One Format to Another

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js