Google Speech Recognition Timestamps for Known Transcript - google-cloud-platform

I have an audio file and I have an exact transcript of that audio file. I would like to be able to get the timestamps of each word in that specific transcript.
I don't want timestamps for for the non-accurate recognized speech. I can already do that, and it is useful, but it's not quite good enough due to the mistakes in the speech recognition.
Does anyone know if this is possible with Google speech recognition?

It is not possible with Google speech recognition. You have to use other services. Even open source tools exist.

Related

How are Speech-to-Text and Video Intelligence SPEECH_TRANSCRIPTION related?

My goal is to process several videos using a speech-to-text model.
Google confusingly has two products that seem to do the same thing.
What are the major differences between these offering?
Google Cloud Speech-to-Text: https://cloud.google.com/speech-to-text/docs/basics
Speech-to-Text has an "enhanced video" model for interpreting the audio.
Google Video Intelligence: https://cloud.google.com/video-intelligence/docs/feature-speech-transcription
VI has the option to request a SPEECH_TRANSCRIPTION feature
The main difference between the two are the input used. Speech to Text API only accepts audio inputs while Video Intelligence accepts video inputs.
As mentioned in your question "Speech to Text has an enhance video model", it means that it has a model that is designed to transcribe audio that originated from video files. Meaning the original file was in video, then converted to audio. As seen in this tutorial, the video was converted to audio prior to transcribing it.
I suggest to use Video Intelligence API if you would like to directly transcribe the audio content into text. You can follow this tutorial on how to transcribe text using Video Intelligence API.

Google Speech to Text domain adaptation

I am trying to do domain adaptation with text data to improve the speech to text results of Google Cloud Speech-to-Text.
I have already done this with the Azure and AWS Speech to text systems. There you just throw a huge text corpus with domain specific language at the system and you usually get better results after that.
For the Google speech to text system I have not found anything like that. What I found is this tutorial: https://cloud.google.com/speech-to-text/docs/speech-adaptation
This sadly only allows very specific adaptations (manually adding words that should be recognized better).
I have tried doing a keyword extraction on my text corpus and putting the extracted words in the speech_contexts[{"phrases": []}] parameter but this didn't change my results.
Is there any way to train the Google speech to text service (language model) with a large text corpus for domain adaptation?

Getting word timestamps for TTS

I have a text in japanese that I'm turning into an mp3 with the Google Cloud Text to Speech functionality.
I also want to have word timestamps for the mp3 that gets returned by Google.
Google Speech to Text offers this functionality but when I submit the files I get from TTS to STT, the result is not always good.
What is the best way to also get word timestamps for the TTS mp3?
Google Cloud Speech-to-Text it's a ML based service, so it's expected that the results are not always as "good" as you may expect them, it has it's limitations.
What I could suggest is to take a look at their relevant documentation about this topic like the best practices, the guide and the basics page that talk about it. Additionally, you could take a look at the issues within their issue tracker platform, like for example this issue for additional information on it and even if you find a reproducible issue within the service you can publish it there, so their team can be aware of it.

Speaker Diarizations vs speaker recognition google cloud vs microsoft azure vs ibm watson vs aws transcribe

I want to do a project of speech-to-text analysis where I would like to 1) Speaker recognition 2) Speaker diarization 3)Speech-to-text. Right now I am testing various APIs provided for various companies like Microsoft, Google, AWS, IBM etc
I could find in Microsoft you have the option for user enrollment and speaker recognition (https://cognitivewuppe.portal.azure-api.net/docs/services/563309b6778daf02acc0a508/operations/5645c3271984551c84ec6797)
However, all other platforms do have speaker diarization but not speaker recognition. In speaker diarization if I understand correctly it will be able to "distinguish" between users but how will it recognize unless until I don't enrol them? I could find only enrollment option available in azure
But I want to be sure so just want to check here maybe i am looking at correct documents or maybe there is some other way to achieve this in Google cloud, Watson and AWS transcribe. If that is the case can you folks please assist me with that
Speaker Recognition is divided into two categories: speaker verification and speaker identification.
https://learn.microsoft.com/en-us/azure/cognitive-services/speaker-recognition/home
Diarization is the process of separating speakers in a piece of audio. Our Batch pipeline supports diarization and is capable of recognizing two speakers on mono channel recordings.
When you use batch transcription api and enable diarization. It will return 1,2.
All transcription output contains a SpeakerId. If diarization is not used, it will show "SpeakerId": null in the JSON output. For diarization we support two voices, so the speakers will be identified as "1" or "2".
https://github.com/MicrosoftDocs/azure-docs/blob/master/articles/cognitive-services/Speech-Service/batch-transcription.md
Ex: In a call center scenario the customer does not need to identify who is speaking, and cannot train the model beforehand with speaker voices since a new user calls in every time. Rather they only need to identify different voices when converting voice to text.
or
You can use Video Indexer supports transcription, speaker diarization (enumeration), and emotion recognition both from the text and the tone of the voice. Additional insights are available as well e.g. topic inference, language identification, brand detection, translation, etc. You can consume it via the video or audio-only APIs for COGS optimization.
You can use VI for speaker diarization. When you get the insights JSON, you can find speaker IDs both under Insights.transcript[0].speakerId as well as under Insights.Speakers. When dealing with audio files, where each speaker is recoded on a different channel, VI identifies that and applies the transcription and diarization accordingly.

How to disable sentence-level auto correction in Google Cloud Speech-to-Text API

I am working on a speech recognition task, which involves the detection of children's speaking capability, improvement over time...
I'd like to use the Google Cloud Speech to Text API for the ASR part of the detection. Then I would use the transcripts of different measurements to estimate the advancement.
But! The sentence level autocorrect of Google Speech API consistently rewrites the previous limb of the spoken sentence...
Is there a way to disable the autocorrect of this ASR?
I can't bypass this problem with the "speechContext", "single_utterance" or "maxAlternatives" options.
"single_utterance" may work with words, but it corrects the misspells..
Any advice in this field?
If you use streaming instead of batch recognize, you should receive an answer as soon as that part of the audio is transcribed, it does not wait for the rest of the sentence. You should then just store the first answer provided by the stream, not the further corrections.
This means that you don't have to wait until isFinal=True.
For a quick and dirty example of what I mean, go tho the speech API page, and run the streaming test with the developer tools open. There you'll see the streaming data received as the words are being spoken: