Google Cloud Speech-to-Text (MP3 to text) - mp3

I am using Google Cloud Platform Speech-to-Text API trial account service. I am not able to get text from an audio file. I do not know what exact encoding and sample Rate Hertz I should use for MP3 file of bit rate 128kbps. I tried various options but I am not getting the transcription.
const speech = require('#google-cloud/speech');
const config = {
encoding: 'LINEAR16', //AMR, AMR_WB, LINEAR16(for wav)
sampleRateHertz: 16000, //16000 giving blank result.
languageCode: 'en-US'
};

MP3 is now supported in beta:
MP3 Only available as beta. See RecognitionConfig reference for details.
https://cloud.google.com/speech-to-text/docs/encoding
MP3 MP3 audio. Support all standard MP3 bitrates (which range from 32-320 kbps). When using this encoding, sampleRateHertz can be optionally unset if not known.
https://cloud.google.com/speech-to-text/docs/reference/rest/v1p1beta1/RecognitionConfig#AudioEncoding
You can find out the sample rate using a variety of tools such as iTunes. CD-quality audio uses a sample rate of 44100 Hertz. Read more here:
https://en.wikipedia.org/wiki/44,100_Hz
To use this in a Google SDK, you may need to use one of the beta SDKs that defines this. Here is the constant from the Go Beta SDK:
RecognitionConfig_MP3 RecognitionConfig_AudioEncoding = 8
https://godoc.org/google.golang.org/genproto/googleapis/cloud/speech/v1p1beta1

According to the official documentation (https://cloud.google.com/speech-to-text/docs/encoding),
Only the following formats are supported:
FLAC
LINEAR16
MULAW
AMR
AMR_WB
OGG_OPUS
SPEEX_WITH_HEADER_BYTE
Anything else will be rejected.
Your best bet is to convert the MP3 file to either:
FLAC. .NET: How can I convert an mp3 or a wav file to .flac
Wav and use LINEAR16 in that case. You can use NAudio. Converting mp3 data to wav data C#
Honestly it is annoying that Google does not support MP3 from the get-go compared to Amazon, IBM and Microsoft who do as it forces us to jump through hoops and also increase the bandwidth usage since FLAC and LINEAR16 are lossless and therefore much bigger to transmit.

I had the same issue and resolved it by converting it to FLAC.
Try converting your audio to FLAC and use
encoding: 'FLAC',
For conversion, you can use sox
ref: https://www.npmjs.com/package/sox

now, the mp3 type for spedch-to-text,only available in module speech_v1p1beta1 ,you must post your request for this module,and you will get what you want.
the encoding: 'MP3'
python example like this:
from google.cloud import speech_v1p1beta1 as speech
import io
import base64
client = speech.SpeechClient()
speech_file = "your mp3 file path"
with io.open(speech_file, "rb") as audio_file:
content = (audio_file.read())
audio = speech.RecognitionAudio(content=content)
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.MP3,
sample_rate_hertz=44100,
language_code="en-US",
)
response = client.recognize(config=config, audio=audio)
# Each result is for a consecutive portion of the audio. Iterate through
# them to get the transcripts for the entire audio file.
print(response)
for result in response.results:
# The first alternative is the most likely one for this portion.
print(u"Transcript: {}".format(result.alternatives[0].transcript))
result

Related

Compress pyaudio stream with zlib

I am trying to send audio from a microphone input between a server and client using pyaudio, I only need voice quality sampled at a rate of 8000. Without compression it works fine and I am trying to add zlib compression to reduce the bandwidth.
In my server the stream_callback function is
def callback(in_data, frame_count, time_info, status):
for s in read_list[1:]:
s.send(zlib.compress(in_data))
return (None, pyaudio.paContinue)
In my client I am trying to decompress like this
try:
while True:
data = s.recv(CHUNK)
stream.write(zlib.decompress(data, zlib.MAX_WBITS | 16))
except KeyboardInterrupt:
pass
I have tried various parameters with zlib.MAX_WBITS but all return this error:
zlib.error: Error -3 while decompressing data: incorrect header check
Edit: I have also tried with no second parameter with zlib.decompress
Can someone suggest what I am doing wrong please, TIA
You don't need the second parameter of zlib.decompress at all. What you have in the question would look for a gzip stream instead of a zlib stream.
For compressing audio, you should use an audio compressor. Take a look at this answer.

Google Cloud Speech to text returning empty result or error

Working hard for 4 days now to fix the google cloud speech to text api to work, but still see no light at the end of the tunnel. Searched on the net a lot, read the documentations a lot but see no result.
Our site is bbsradio.com, we are trying to auto extract transcript from our mp3 files using google speech-to-text api. Code is written on PHP and almost exact copy of this: https://github.com/GoogleCloudPlatform/php-docs-samples/blob/master/speech/src/transcribe_async.php
I see process is completed and its reached out here "$operation->pollUntilComplete();" but its not showing it was successful at "if ($operation->operationSucceeded()) {" and its not returning any error either at $operation->getError().
I am converting the mp3 to raw file like this: ffmpeg -y -loglevel panic -i /public_html/sites/default/files/show-archives/audio-clips-9-23-2020/911freefall2020-05-24.mp3 -f s16le -acodec pcm_s16le -vn -ac 1 -ar 16000 -map_metadata -1 /home/mp3_to_raw/911freefall2020-05-24.raw
While tried with FLAC format as well, not worked. I tested converted FLAC file using windows media player, I can listen conversation clearly. I checked the files its Hz 16000, channel = 1 and its 16 bit. I see file is uploaded in cloud storage. Checked this:
https://cloud.google.com/speech-to-text/docs/troubleshooting and
https://cloud.google.com/speech-to-text/docs/best-practices
There are lot of discussion and documentation, seems nothing is helpful at this moment. If some one can really help me out to find out the issue, it will be really really really great!
TLDR; convert from MP3 to a 1-channel FLAC file with the same sample rate as your MP3 file.
Long explanation:
Since you're using MP3 files as your process input, probably you MP3 compression artifacts might be hurting you when you resample to to 16KHz (you cannot hear this, but the algoritm will).
To confirm this theory:
Execute ffprobe -hide_banner filename.mp3 it will output something like this:
Metadata:
...
Duration: 00:02:12.21, start: 0.025057, bitrate: 320 kb/s
Stream #0:0: Audio: mp3, 44100 Hz, stereo, s16p, 320 kb/s
Metadata:
encoder : LAME3.99r
In this case, the sample rate is OK for Google-Spech-Api. Just transcode the file without changing the sample rate (remove the -ar 16000 from your ffmpeg command)
You might get into trouble if the original MP3 bitrate is low. 320kb/s seems safe (unless the recording has a lot of noise).
Take into account that voice recoded under 64kb/s (ISDN line quality) can be understood only by humans if there is some noise.
At last I found the solution and reason of the issue. Actually getting empty results is a bug of the php api code. What you need to do:
Replace this:
$operation->pollUntilComplete();
by this:
while(!$operation->isDone()){
$operation->pollUntilComplete();
}
Read this: enter link description here

Google Cloud Platform: Speech to Text Conversion of Large Media Files

I'm trying to extract text from mp4 media file downloaded from youtube. As I'm using google cloud platform so thought to give a try to google cloud speech.
After all the installations and configurations, I copied the following code snippet to get start with:
with io.open(file_name, 'rb') as audio_file:
content = audio_file.read()
audio = types.RecognitionAudio(content=content)
config = types.RecognitionConfig(encoding=enums.RecognitionConfig.AudioEncoding.LINEAR16, sample_rate_hertz=16000, language_code='en-US')
response = client.long_running_recognize(config, audio)
But I got the following error regarding file size:
InvalidArgument: 400 Inline audio exceeds duration limit. Please use a
GCS URI.
Then I read that I should use streams for large media files. So, I tried the following code snippet:
with io.open(file_name, 'rb') as audio_file:
content = audio_file.read()
#In practice, stream should be a generator yielding chunks of audio data.
stream = [content]
requests = (types.StreamingRecognizeRequest(audio_content=chunk)for chunk in stream)
config = types.RecognitionConfig(encoding=enums.RecognitionConfig.AudioEncoding.LINEAR16,sample_rate_hertz=16000,language_code='en-US')
streaming_config = types.StreamingRecognitionConfig(config=config)
responses = client.streaming_recognize(streaming_config, requests)
But still I got the following error:
InvalidArgument: 400 Invalid audio content: too long.
So, can anyone please suggest an approach to transcribe an mp4 file and extract text. I don't have any complex requirement of very large media file. Media file can be 10-15 mins long maximum. Thanks
The error message means that the file is too big and you need to first copy the media file to Google Cloud Storage and then specify a Cloud Storage URI such as gs://bucket/path/mediafile.
The key to using a Cloud Storage URI is:
RecognitionAudio audio =
RecognitionAudio.newBuilder().setUri(gcsUri).build();
The following code will show you how to specify a GCS URI for input. Google has a complete example on github.
public static void syncRecognizeGcs(String gcsUri) throws Exception {
// Instantiates a client with GOOGLE_APPLICATION_CREDENTIALS
try (SpeechClient speech = SpeechClient.create()) {
// Builds the request for remote FLAC file
RecognitionConfig config =
RecognitionConfig.newBuilder()
.setEncoding(AudioEncoding.FLAC)
.setLanguageCode("en-US")
.setSampleRateHertz(16000)
.build();
RecognitionAudio audio = RecognitionAudio.newBuilder().setUri(gcsUri).build();
// Use blocking call for getting audio transcript
RecognizeResponse response = speech.recognize(config, audio);
List<SpeechRecognitionResult> results = response.getResultsList();
for (SpeechRecognitionResult result : results) {
// There can be several alternative transcripts for a given chunk of speech. Just use the
// first (most likely) one here.
SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
System.out.printf("Transcription: %s%n", alternative.getTranscript());
}
}
}

Possible sample rates in Google Speech-to-Text?

I'm using the function provided in the GCS docs that allows me to transcribe text in Cloud Storage:
def transcribe_gcs(gcs_uri):
"""Asynchronously transcribes the audio file specified by the gcs_uri."""
from google.cloud import speech
from google.cloud.speech import enums
from google.cloud.speech import types
client = speech.SpeechClient()
audio = types.RecognitionAudio(uri=gcs_uri)
config = types.RecognitionConfig(
encoding=enums.RecognitionConfig.AudioEncoding.FLAC,
sample_rate_hertz=48000,
language_code='en-US')
operation = client.long_running_recognize(config, audio)
print('Waiting for operation to complete...')
response = operation.result(timeout=2000)
# Print the first alternative of all the consecutive results.
for result in response.results:
print('Transcript: {}'.format(result.alternatives[0].transcript))
print('Confidence: {}'.format(result.alternatives[0].confidence))
return ' '.join(result.alternatives[0].transcript for result in response.results)
By default, sample_rate_hertz is set at 16000. I changed it to 48000, but I've been having trouble setting it any higher, such as at 64k or 96k. Is 48k is the upper range of the sample rate?
As specified in the documentation for Cloud Speech API, 48000 Hz is indeed the upper bound supported by this API.
Sample rates between 8000 Hz and 48000 Hz are supported within the
Speech API.
Therefore, in order to work with higher sample rates you will have to resample your audio files.
Let me also refer you to this other page where the basic information of features supported by Cloud Speech API can be found.

Set audio format in liblvc

How can I set the format of an audio in libvlc?
there is a function in libvlc for it but I don't know how to use it[from here]:
LIBVLC_API void libvlc_audio_set_format ( libvlc_media_player_t * mp,
const char * format,
unsigned rate,
unsigned channels
)
Set decoded audio format.
This only works in combination with libvlc_audio_set_callbacks(), and
is mutually exclusive with libvlc_audio_set_format_callbacks().
Parameters mp the media player format a four-characters string
identifying the sample format (e.g. "S16N" or "FL32") rate sample rate
(expressed in Hz) channels channels count Version LibVLC 2.0.0 or
later
How can I set the format of audio file, for example a wav file?
This API is for raw, decoded audio, which is typically forwarded to speakers or re-encoded to store it.
This API is NOT to export audio as files (unless you implement that yourself in your app, that is). To convert files, see the stream output MRL command-line syntax, as there is currently no designated libvlc API available for use-case.