My goal is to process several videos using a speech-to-text model.
Google confusingly has two products that seem to do the same thing.
What are the major differences between these offering?
Google Cloud Speech-to-Text: https://cloud.google.com/speech-to-text/docs/basics
Speech-to-Text has an "enhanced video" model for interpreting the audio.
Google Video Intelligence: https://cloud.google.com/video-intelligence/docs/feature-speech-transcription
VI has the option to request a SPEECH_TRANSCRIPTION feature
The main difference between the two are the input used. Speech to Text API only accepts audio inputs while Video Intelligence accepts video inputs.
As mentioned in your question "Speech to Text has an enhance video model", it means that it has a model that is designed to transcribe audio that originated from video files. Meaning the original file was in video, then converted to audio. As seen in this tutorial, the video was converted to audio prior to transcribing it.
I suggest to use Video Intelligence API if you would like to directly transcribe the audio content into text. You can follow this tutorial on how to transcribe text using Video Intelligence API.
Related
I've used the Video Intelligence API to do object tracking on video.
In the document [1], it recognizes more than 20,000 objects, places, and actions in stored and streaming video.
I have a questions. Is there any document that shows what kind of objects can be recognized or can't be recognized?
It's my first question. Thank you.
[1] https://cloud.google.com/video-intelligence
In this GCP documentation, it enumerates the categories in which Cloud Video Intelligence API can detect, analyze, track, transcribe and recognize: https://cloud.google.com/video-intelligence/docs/how-to
Among the things that are listed on the GCP documentation that Cloud Video Intelligence API can detect, track and recognize are: faces, people, shot changes, explicit content, objects, logos and text. Cloud Video Intelligence API are already pre-trained, if in case there are objects that Cloud Video Intelligence API can't recognize, you can train your own custom models using AutoML Video Intelligence. To get started with AutoML Video Intelligence, you can refer to this GCP documentation: https://cloud.google.com/video-intelligence/automl/docs/beginners-guide
As to the limitation of object that can be recognized in Cloud Video Intelligence API, there is no document that states which object are not recognizable. The only limits that are in the Cloud Video Intelligence API documentation are in terms of video size, per request and length. GCP Documentation: https://cloud.google.com/video-intelligence/quotas
I'm trying to create a small app that will allow me to translate audio to text via the Google Speech to Text services. I'd like to bypass the need for heavy processing and leverage as many cloud tools as possible to have audio streamed to the text to speech service. I've been able to get the streaming process to work, however, I have to relay the data to my server first and this creates an expense I'd like to cut out. There are a few questions that would help solve my problem in a cost effective way!
Can I created a signed URL for a Google Text To Speech streaming session?
Can I leverage the cloud and Cloud Functions to trigger processing by the text to speech service and then retrieve real time updates?
Can I get a signed URL that links to a copy of the audio streamed to the Google text to speech service?
I want to do a project of speech-to-text analysis where I would like to 1) Speaker recognition 2) Speaker diarization 3)Speech-to-text. Right now I am testing various APIs provided for various companies like Microsoft, Google, AWS, IBM etc
I could find in Microsoft you have the option for user enrollment and speaker recognition (https://cognitivewuppe.portal.azure-api.net/docs/services/563309b6778daf02acc0a508/operations/5645c3271984551c84ec6797)
However, all other platforms do have speaker diarization but not speaker recognition. In speaker diarization if I understand correctly it will be able to "distinguish" between users but how will it recognize unless until I don't enrol them? I could find only enrollment option available in azure
But I want to be sure so just want to check here maybe i am looking at correct documents or maybe there is some other way to achieve this in Google cloud, Watson and AWS transcribe. If that is the case can you folks please assist me with that
Speaker Recognition is divided into two categories: speaker verification and speaker identification.
https://learn.microsoft.com/en-us/azure/cognitive-services/speaker-recognition/home
Diarization is the process of separating speakers in a piece of audio. Our Batch pipeline supports diarization and is capable of recognizing two speakers on mono channel recordings.
When you use batch transcription api and enable diarization. It will return 1,2.
All transcription output contains a SpeakerId. If diarization is not used, it will show "SpeakerId": null in the JSON output. For diarization we support two voices, so the speakers will be identified as "1" or "2".
https://github.com/MicrosoftDocs/azure-docs/blob/master/articles/cognitive-services/Speech-Service/batch-transcription.md
Ex: In a call center scenario the customer does not need to identify who is speaking, and cannot train the model beforehand with speaker voices since a new user calls in every time. Rather they only need to identify different voices when converting voice to text.
or
You can use Video Indexer supports transcription, speaker diarization (enumeration), and emotion recognition both from the text and the tone of the voice. Additional insights are available as well e.g. topic inference, language identification, brand detection, translation, etc. You can consume it via the video or audio-only APIs for COGS optimization.
You can use VI for speaker diarization. When you get the insights JSON, you can find speaker IDs both under Insights.transcript[0].speakerId as well as under Insights.Speakers. When dealing with audio files, where each speaker is recoded on a different channel, VI identifies that and applies the transcription and diarization accordingly.
I need to use C++ to transcribe real-time audio from my mobile phone, but there is no real-time transcription demo in the Google example.
This is an official example, but not real-time transcription, but based on a file, so who can help me find examples of real-time streaming recognition?
https://github.com/GoogleCloudPlatform/cpp-docs-samples/tree/master/speech/api
Can you please help with list of video formats and codecs that are supported by Google Cloud Video Intelligence API
The API supports common video formats, including .MOV, .MPEG4, .MP4, and .AVI. Per https://cloud.google.com/video-intelligence/docs/
Video Intelligence uses FFMPEG to process the videos, so the formats supported by FFMPEG are also supported by it.