I'm using the function provided in the GCS docs that allows me to transcribe text in Cloud Storage:
def transcribe_gcs(gcs_uri):
"""Asynchronously transcribes the audio file specified by the gcs_uri."""
from google.cloud import speech
from google.cloud.speech import enums
from google.cloud.speech import types
client = speech.SpeechClient()
audio = types.RecognitionAudio(uri=gcs_uri)
config = types.RecognitionConfig(
encoding=enums.RecognitionConfig.AudioEncoding.FLAC,
sample_rate_hertz=48000,
language_code='en-US')
operation = client.long_running_recognize(config, audio)
print('Waiting for operation to complete...')
response = operation.result(timeout=2000)
# Print the first alternative of all the consecutive results.
for result in response.results:
print('Transcript: {}'.format(result.alternatives[0].transcript))
print('Confidence: {}'.format(result.alternatives[0].confidence))
return ' '.join(result.alternatives[0].transcript for result in response.results)
By default, sample_rate_hertz is set at 16000. I changed it to 48000, but I've been having trouble setting it any higher, such as at 64k or 96k. Is 48k is the upper range of the sample rate?
As specified in the documentation for Cloud Speech API, 48000 Hz is indeed the upper bound supported by this API.
Sample rates between 8000 Hz and 48000 Hz are supported within the
Speech API.
Therefore, in order to work with higher sample rates you will have to resample your audio files.
Let me also refer you to this other page where the basic information of features supported by Cloud Speech API can be found.
Related
"failureReason": "Job validation failed: Request field config is
invalid, expected an estimated total output size of at most 400 GB
(current value is 1194622697155 bytes).",
The actual input file was only 8 seconds long. It was created using the safari media recorder api on mac osx.
"failureReason": "Job validation failed: Request field
config.editList[0].startTimeOffset is 0s, expected start time less
than the minimum duration of all inputs for this atom (0s).",
The actual input file was 8 seconds long. It was created using the desktop Chrome media recorder api, with mimeType "webm; codecs=vp9" on mac osx.
Note that Stackoverlow wouldn't allow me to include the tag google-cloud-transcoder suggested by "Getting Support" https://cloud.google.com/transcoder/docs/getting-support?hl=sr
Like Faniel mentioned, your first issue is that your video was less than 10 seconds which is below the minimum 10 seconds for the API.
Your second issue is that the "Duration" information is likely missing from the EBML headers of your .webm file. When you record with MediaRecorder the duration of your video is set to N/A in the file headers as it is not known in advance. This means the Transcoder API will treat the length of your video is Infinity / 0. Some consider this a bug with Chromium.
To confirm this is your issue you can use ts-ebml or ffprobe to inspect the headers of your video. You can also use these tools to repair the headers. Read more about this here and here
Also just try running with the Transcoder API with this demo .webm which has its duration information set correctly.
This Google documentation states that the input file’s length must be at least 5 seconds in duration and should be stored in Cloud Storage (for example, gs://bucket/inputs/file.mp4). Job Validation error can occur when the inputs are not properly packaged and don't contain duration metadata or contain incorrect duration metadata. When the inputs are not properly packaged, we can explicitly specify startTimeOffset and endTimeOffset in the job config to set the correct duration. If the duration of the ffprobe output (in seconds) of the job config is more than 400 GB, it can result in a job validation error. We can use the following formula to estimate the output size.
estimatedTotalOutputSizeInBytes = bitrateBps * outputDurationInSec / 8;
Thanks for the question and feedback. The Transcoder API currently has a minimum duration of 10 seconds which may be why the job wasn't successful.
I'm trying to extract text from mp4 media file downloaded from youtube. As I'm using google cloud platform so thought to give a try to google cloud speech.
After all the installations and configurations, I copied the following code snippet to get start with:
with io.open(file_name, 'rb') as audio_file:
content = audio_file.read()
audio = types.RecognitionAudio(content=content)
config = types.RecognitionConfig(encoding=enums.RecognitionConfig.AudioEncoding.LINEAR16, sample_rate_hertz=16000, language_code='en-US')
response = client.long_running_recognize(config, audio)
But I got the following error regarding file size:
InvalidArgument: 400 Inline audio exceeds duration limit. Please use a
GCS URI.
Then I read that I should use streams for large media files. So, I tried the following code snippet:
with io.open(file_name, 'rb') as audio_file:
content = audio_file.read()
#In practice, stream should be a generator yielding chunks of audio data.
stream = [content]
requests = (types.StreamingRecognizeRequest(audio_content=chunk)for chunk in stream)
config = types.RecognitionConfig(encoding=enums.RecognitionConfig.AudioEncoding.LINEAR16,sample_rate_hertz=16000,language_code='en-US')
streaming_config = types.StreamingRecognitionConfig(config=config)
responses = client.streaming_recognize(streaming_config, requests)
But still I got the following error:
InvalidArgument: 400 Invalid audio content: too long.
So, can anyone please suggest an approach to transcribe an mp4 file and extract text. I don't have any complex requirement of very large media file. Media file can be 10-15 mins long maximum. Thanks
The error message means that the file is too big and you need to first copy the media file to Google Cloud Storage and then specify a Cloud Storage URI such as gs://bucket/path/mediafile.
The key to using a Cloud Storage URI is:
RecognitionAudio audio =
RecognitionAudio.newBuilder().setUri(gcsUri).build();
The following code will show you how to specify a GCS URI for input. Google has a complete example on github.
public static void syncRecognizeGcs(String gcsUri) throws Exception {
// Instantiates a client with GOOGLE_APPLICATION_CREDENTIALS
try (SpeechClient speech = SpeechClient.create()) {
// Builds the request for remote FLAC file
RecognitionConfig config =
RecognitionConfig.newBuilder()
.setEncoding(AudioEncoding.FLAC)
.setLanguageCode("en-US")
.setSampleRateHertz(16000)
.build();
RecognitionAudio audio = RecognitionAudio.newBuilder().setUri(gcsUri).build();
// Use blocking call for getting audio transcript
RecognizeResponse response = speech.recognize(config, audio);
List<SpeechRecognitionResult> results = response.getResultsList();
for (SpeechRecognitionResult result : results) {
// There can be several alternative transcripts for a given chunk of speech. Just use the
// first (most likely) one here.
SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
System.out.printf("Transcription: %s%n", alternative.getTranscript());
}
}
}
There are four different ways to send data across USB: Control, Interrupt, Bulk, and Isochronous. book ref 1
From the book book ref 1 page 330:
... Bulk endpoints transfer large amounts of data. These endpoints are usually much larger (they can hold more characters at once) that interrupt endpoints. ...
when I get my endpoint input I use the following command.
import usb.core
import usb.util
dev = usb.core.find(idVendor=0x0683, idProduct=0x4108)
if dev is None:
raise ValueError('Device not found')
dev.reset()
dev.set_configuration()
cfg = dev.get_active_configuration()
intf = cfg[(0,0)]
epi = usb.util.find_descriptor(
intf,
# match the first IN endpoint
custom_match = \
lambda e: \
usb.util.endpoint_direction(e.bEndpointAddress) ==\
usb.util.ENDPOINT_IN)
I tried to add, but it give my a sytax error that I don't fully understand :
usb.util.endpoint_type()== \
usb.util.ENDPOINT_TYPE_BULK
Here is another very good source on how to work with USB link 1
It seems that usb endpoints have parameters that can be specified in python
where bEndpointAddress indicates what endpoint this descriptor is describing.
bmAttributes specifies the transfer type. This can either be Control, Interrupt, Isochronous or Bulk Transfers. If an Isochronous endpoint is specified, additional attributes can be selected such as the Synchronisation and usage types.
wMaxPacketSize indicates the maximum payload size for this endpoint.
bInterval is used to specify the polling interval of certain transfers. The units are expressed in frames, thus this equates to either 1ms for low/full speed devices and 125us for high speed devices.
I have tried:
epi.wMaxPacketSize = 72000000 #to make the buffer large
epi.bmAttributes = 3 # 3 = 10 in binary. to change the mode to bulk
My questions are:
Where do I specify what kind of endpoint I am using for Windows and(or) Linux and how to do that? and How can I change the buffer size on each endpoint?
Try this:
epi = usb.util.find_descriptor(intf,
custom_match = \
lambda e: \
usb.util.endpoint_direction(e.bEndpointAddress) == \
usb.util.ENDPOINT_IN \
and \
usb.util.endpoint_type(e.bmAttributes) == \
usb.util.ENDPOINT_TYPE_BULK )
But you misunderstood the part about the parameters. bmAttributes and wMaxPacketSize are specified by the USB hardware and not to be changed by Python.
I am using Google Cloud Platform Speech-to-Text API trial account service. I am not able to get text from an audio file. I do not know what exact encoding and sample Rate Hertz I should use for MP3 file of bit rate 128kbps. I tried various options but I am not getting the transcription.
const speech = require('#google-cloud/speech');
const config = {
encoding: 'LINEAR16', //AMR, AMR_WB, LINEAR16(for wav)
sampleRateHertz: 16000, //16000 giving blank result.
languageCode: 'en-US'
};
MP3 is now supported in beta:
MP3 Only available as beta. See RecognitionConfig reference for details.
https://cloud.google.com/speech-to-text/docs/encoding
MP3 MP3 audio. Support all standard MP3 bitrates (which range from 32-320 kbps). When using this encoding, sampleRateHertz can be optionally unset if not known.
https://cloud.google.com/speech-to-text/docs/reference/rest/v1p1beta1/RecognitionConfig#AudioEncoding
You can find out the sample rate using a variety of tools such as iTunes. CD-quality audio uses a sample rate of 44100 Hertz. Read more here:
https://en.wikipedia.org/wiki/44,100_Hz
To use this in a Google SDK, you may need to use one of the beta SDKs that defines this. Here is the constant from the Go Beta SDK:
RecognitionConfig_MP3 RecognitionConfig_AudioEncoding = 8
https://godoc.org/google.golang.org/genproto/googleapis/cloud/speech/v1p1beta1
According to the official documentation (https://cloud.google.com/speech-to-text/docs/encoding),
Only the following formats are supported:
FLAC
LINEAR16
MULAW
AMR
AMR_WB
OGG_OPUS
SPEEX_WITH_HEADER_BYTE
Anything else will be rejected.
Your best bet is to convert the MP3 file to either:
FLAC. .NET: How can I convert an mp3 or a wav file to .flac
Wav and use LINEAR16 in that case. You can use NAudio. Converting mp3 data to wav data C#
Honestly it is annoying that Google does not support MP3 from the get-go compared to Amazon, IBM and Microsoft who do as it forces us to jump through hoops and also increase the bandwidth usage since FLAC and LINEAR16 are lossless and therefore much bigger to transmit.
I had the same issue and resolved it by converting it to FLAC.
Try converting your audio to FLAC and use
encoding: 'FLAC',
For conversion, you can use sox
ref: https://www.npmjs.com/package/sox
now, the mp3 type for spedch-to-text,only available in module speech_v1p1beta1 ,you must post your request for this module,and you will get what you want.
the encoding: 'MP3'
python example like this:
from google.cloud import speech_v1p1beta1 as speech
import io
import base64
client = speech.SpeechClient()
speech_file = "your mp3 file path"
with io.open(speech_file, "rb") as audio_file:
content = (audio_file.read())
audio = speech.RecognitionAudio(content=content)
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.MP3,
sample_rate_hertz=44100,
language_code="en-US",
)
response = client.recognize(config=config, audio=audio)
# Each result is for a consecutive portion of the audio. Iterate through
# them to get the transcripts for the entire audio file.
print(response)
for result in response.results:
# The first alternative is the most likely one for this portion.
print(u"Transcript: {}".format(result.alternatives[0].transcript))
result
I'm using the Python Speech Recognition library to recognize speech input from the microphone.
This works fine with my default microphone.
This is the code I'm using. According to what I understood of the documentation
Creates a new Microphone instance, which represents a physical
microphone on the computer. Subclass of AudioSource.
If device_index is unspecified or None, the default microphone is used
as the audio source. Otherwise, device_index should be the index of
the device to use for audio input. https://pypi.python.org/pypi/SpeechRecognition/
The problem is that when I want to get the node with pyaudio.get_device_count() - 1. I'm getting this error.
AttributeError: 'module' object has no attribute 'get_device_count'
So I'm not sure how to configure the microphone to use a usb microphone
import pyaudio
import speech_recognition as sr
index = pyaudio.get_device_count() - 1
print index
r = sr.Recognizer()
with sr.Microphone(index) as source:
audio = r.listen(source)
try:
print("You said " + r.recognize(audio))
except LookupError:
print("Could not understand audio")
myPyAudio=pyaudio.PyAudio()
print "Seeing pyaudio devices:",myPyAudio.get_device_count()
That's a bug in the library. I just pushed out a fix in 1.3.1, so this should now be fixed!
Version 1.3.1 retains full backwards compatibility with previous versions.