Google Cloud Platform: Speech to Text Conversion of Large Media Files

Google Cloud Platform: Speech to Text Conversion of Large Media Files - google-cloud-platform

I'm trying to extract text from mp4 media file downloaded from youtube. As I'm using google cloud platform so thought to give a try to google cloud speech.
After all the installations and configurations, I copied the following code snippet to get start with:
with io.open(file_name, 'rb') as audio_file:
content = audio_file.read()
audio = types.RecognitionAudio(content=content)
config = types.RecognitionConfig(encoding=enums.RecognitionConfig.AudioEncoding.LINEAR16, sample_rate_hertz=16000, language_code='en-US')
response = client.long_running_recognize(config, audio)
But I got the following error regarding file size:
InvalidArgument: 400 Inline audio exceeds duration limit. Please use a
GCS URI.
Then I read that I should use streams for large media files. So, I tried the following code snippet:
with io.open(file_name, 'rb') as audio_file:
content = audio_file.read()
#In practice, stream should be a generator yielding chunks of audio data.
stream = [content]
requests = (types.StreamingRecognizeRequest(audio_content=chunk)for chunk in stream)
config = types.RecognitionConfig(encoding=enums.RecognitionConfig.AudioEncoding.LINEAR16,sample_rate_hertz=16000,language_code='en-US')
streaming_config = types.StreamingRecognitionConfig(config=config)
responses = client.streaming_recognize(streaming_config, requests)
But still I got the following error:
InvalidArgument: 400 Invalid audio content: too long.
So, can anyone please suggest an approach to transcribe an mp4 file and extract text. I don't have any complex requirement of very large media file. Media file can be 10-15 mins long maximum. Thanks

The error message means that the file is too big and you need to first copy the media file to Google Cloud Storage and then specify a Cloud Storage URI such as gs://bucket/path/mediafile.
The key to using a Cloud Storage URI is:
RecognitionAudio audio =
RecognitionAudio.newBuilder().setUri(gcsUri).build();
The following code will show you how to specify a GCS URI for input. Google has a complete example on github.
public static void syncRecognizeGcs(String gcsUri) throws Exception {
// Instantiates a client with GOOGLE_APPLICATION_CREDENTIALS
try (SpeechClient speech = SpeechClient.create()) {
// Builds the request for remote FLAC file
RecognitionConfig config =
RecognitionConfig.newBuilder()
.setEncoding(AudioEncoding.FLAC)
.setLanguageCode("en-US")
.setSampleRateHertz(16000)
.build();
RecognitionAudio audio = RecognitionAudio.newBuilder().setUri(gcsUri).build();
// Use blocking call for getting audio transcript
RecognizeResponse response = speech.recognize(config, audio);
List<SpeechRecognitionResult> results = response.getResultsList();
for (SpeechRecognitionResult result : results) {
// There can be several alternative transcripts for a given chunk of speech. Just use the
// first (most likely) one here.
SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
System.out.printf("Transcription: %s%n", alternative.getTranscript());
}
}
}

Related

Possible sample rates in Google Speech-to-Text?

I'm using the function provided in the GCS docs that allows me to transcribe text in Cloud Storage:
def transcribe_gcs(gcs_uri):
"""Asynchronously transcribes the audio file specified by the gcs_uri."""
from google.cloud import speech
from google.cloud.speech import enums
from google.cloud.speech import types
client = speech.SpeechClient()
audio = types.RecognitionAudio(uri=gcs_uri)
config = types.RecognitionConfig(
encoding=enums.RecognitionConfig.AudioEncoding.FLAC,
sample_rate_hertz=48000,
language_code='en-US')
operation = client.long_running_recognize(config, audio)
print('Waiting for operation to complete...')
response = operation.result(timeout=2000)
# Print the first alternative of all the consecutive results.
for result in response.results:
print('Transcript: {}'.format(result.alternatives[0].transcript))
print('Confidence: {}'.format(result.alternatives[0].confidence))
return ' '.join(result.alternatives[0].transcript for result in response.results)
By default, sample_rate_hertz is set at 16000. I changed it to 48000, but I've been having trouble setting it any higher, such as at 64k or 96k. Is 48k is the upper range of the sample rate?

As specified in the documentation for Cloud Speech API, 48000 Hz is indeed the upper bound supported by this API.
Sample rates between 8000 Hz and 48000 Hz are supported within the
Speech API.
Therefore, in order to work with higher sample rates you will have to resample your audio files.
Let me also refer you to this other page where the basic information of features supported by Cloud Speech API can be found.

Google Cloud Speech-to-Text (MP3 to text)

I am using Google Cloud Platform Speech-to-Text API trial account service. I am not able to get text from an audio file. I do not know what exact encoding and sample Rate Hertz I should use for MP3 file of bit rate 128kbps. I tried various options but I am not getting the transcription.
const speech = require('#google-cloud/speech');
const config = {
encoding: 'LINEAR16', //AMR, AMR_WB, LINEAR16(for wav)
sampleRateHertz: 16000, //16000 giving blank result.
languageCode: 'en-US'
};

MP3 is now supported in beta:
MP3 Only available as beta. See RecognitionConfig reference for details.
https://cloud.google.com/speech-to-text/docs/encoding
MP3 MP3 audio. Support all standard MP3 bitrates (which range from 32-320 kbps). When using this encoding, sampleRateHertz can be optionally unset if not known.
https://cloud.google.com/speech-to-text/docs/reference/rest/v1p1beta1/RecognitionConfig#AudioEncoding
You can find out the sample rate using a variety of tools such as iTunes. CD-quality audio uses a sample rate of 44100 Hertz. Read more here:
https://en.wikipedia.org/wiki/44,100_Hz
To use this in a Google SDK, you may need to use one of the beta SDKs that defines this. Here is the constant from the Go Beta SDK:
RecognitionConfig_MP3 RecognitionConfig_AudioEncoding = 8
https://godoc.org/google.golang.org/genproto/googleapis/cloud/speech/v1p1beta1

According to the official documentation (https://cloud.google.com/speech-to-text/docs/encoding),
Only the following formats are supported:
FLAC
LINEAR16
MULAW
AMR
AMR_WB
OGG_OPUS
SPEEX_WITH_HEADER_BYTE
Anything else will be rejected.
Your best bet is to convert the MP3 file to either:
FLAC. .NET: How can I convert an mp3 or a wav file to .flac
Wav and use LINEAR16 in that case. You can use NAudio. Converting mp3 data to wav data C#
Honestly it is annoying that Google does not support MP3 from the get-go compared to Amazon, IBM and Microsoft who do as it forces us to jump through hoops and also increase the bandwidth usage since FLAC and LINEAR16 are lossless and therefore much bigger to transmit.

I had the same issue and resolved it by converting it to FLAC.
Try converting your audio to FLAC and use
encoding: 'FLAC',
For conversion, you can use sox
ref: https://www.npmjs.com/package/sox

now, the mp3 type for spedch-to-text,only available in module speech_v1p1beta1 ,you must post your request for this module,and you will get what you want.
the encoding: 'MP3'
python example like this:
from google.cloud import speech_v1p1beta1 as speech
import io
import base64
client = speech.SpeechClient()
speech_file = "your mp3 file path"
with io.open(speech_file, "rb") as audio_file:
content = (audio_file.read())
audio = speech.RecognitionAudio(content=content)
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.MP3,
sample_rate_hertz=44100,
language_code="en-US",
)
response = client.recognize(config=config, audio=audio)
# Each result is for a consecutive portion of the audio. Iterate through
# them to get the transcripts for the entire audio file.
print(response)
for result in response.results:
# The first alternative is the most likely one for this portion.
print(u"Transcript: {}".format(result.alternatives[0].transcript))
result

Simple libtorrent Python client

I tried creating a simple libtorrent python client (for magnet uri), and I failed, the program never continues past the "downloading metadata".
If you may help me write a simple client it would be amazing.
P.S. When I choose a save path, is the save path the folder which I want my data to be saved in? or the path for the data itself.
(I used a code someone posted here)
import libtorrent as lt
import time
ses = lt.session()
ses.listen_on(6881, 6891)
params = {
'save_path': '/home/downloads/',
'storage_mode': lt.storage_mode_t(2),
'paused': False,
'auto_managed': True,
'duplicate_is_error': True}
link = "magnet:?xt=urn:btih:4MR6HU7SIHXAXQQFXFJTNLTYSREDR5EI&tr=http://tracker.vodo.net:6970/announce"
handle = lt.add_magnet_uri(ses, link, params)
ses.start_dht()
print 'downloading metadata...'
while (not handle.has_metadata()):
time.sleep(1)
print 'got metadata, starting torrent download...'
while (handle.status().state != lt.torrent_status.seeding):
s = handle.status()
state_str = ['queued', 'checking', 'downloading metadata', \
'downloading', 'finished', 'seeding', 'allocating']
print '%.2f%% complete (down: %.1f kb/s up: %.1f kB/s peers: %d) %s %.3' % \
(s.progress * 100, s.download_rate / 1000, s.upload_rate / 1000, \
s.num_peers, state_str[s.state], s.total_download/1000000)
time.sleep(5)

What happens it is that the first while loop becomes infinite because the state does not change.
You have to add a s = handle.status (); for having the metadata the status changes and the loop stops. Alternatively add the first while inside the other while so that the same will happen.

Yes, the save path you specify is the one that the torrents will be downloaded to.
As for the metadata downloading part, I would add the following extensions first:
ses.add_extension(lt.create_metadata_plugin)
ses.add_extension(lt.create_ut_metadata_plugin)
Second, I would add a DHT bootstrap node:
ses.add_dht_router("router.bittorrent.com", 6881)
Finally, I would begin debugging the application by seeing if my network interface is binding or if any other errors come up (my experience with BitTorrent download problems, in general, is that they are network related). To get an idea of what's happening I would use libtorrent-rasterbar's alert system:
ses.set_alert_mask(lt.alert.category_t.all_categories)
And make a thread (with the following code) to collect the alerts and display them:
while True:
ses.wait_for_alert(500)
alert = lt_session.pop_alert()
if not alert:
continue
print "[%s] %s" % (type(alert), alert.__str__())
Even with all this working correctly, make sure that torrent you are trying to download actually has peers. Even if there are a few peers, none may be configured correctly or support metadata exchange (exchanging metadata is not a standard BitTorrent feature). Try to load a torrent file (which doesn't require downloading metadata) and see if you can download successfully (to rule out some network issues).

wowza apple hls event tracking

hi i am using wowza streaming engine 4 , to stream smil file
i am able to trace events when file play on flash,
and gether informantions like , which file play , time etc,
in onConnect() event,
precisely i want to get which file is played from my smil file.
but in case of apple hls streaming when i try to get file name in onHTTPSessionDestroy() method , eg.
public void onHTTPSessionDestroy(IHTTPStreamerSession httpSession) {
String streamName = httpSession.getStreamName();
}
i only get the name of smil file , not the actual file played .
is it possible to get the played file info in wowza hls steaming

New Api is introduce in wowza 4.1.1
public void onHTTPStreamerRequest(IHTTPStreamerSession httpSession, IHTTPStreamerRequestContext reqContext)
we can get check which bit-rate is played like media_w1577403587_b4000000_1.ts
_b is the bitrate

Play 2 WS, how to detect interrupted or corrupt download?

My service occasionally fetches some content (mostly image files) in batches (20 at a time), in a concurrent manner. Sometimes, some of these image files end up corrupted (browser doesn't render them), not sure why, but it only happens when downloading in larger batches. How to check programmatically if the download was corrupted so I can restart it?
I use Play2 WS on Scala. Iteratees not used.

stackoverflow.com/questions/304268/getting-a-files-md5-checksum-in-java
try retrieving the md5 checksum. would that work for you?
Use DigestUtils from Apache Commons Codec library:
FileInputStream fis = new FileInputStream(new File("foo"));
String md5 = org.apache.commons.codec.digest.DigestUtils.md5Hex(fis);

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Google Cloud Platform: Speech to Text Conversion of Large Media Files - google-cloud-platform

Related

Possible sample rates in Google Speech-to-Text?

Google Cloud Speech-to-Text (MP3 to text)

Simple libtorrent Python client

wowza apple hls event tracking

Play 2 WS, how to detect interrupted or corrupt download?

Categories

Resources