How to improve the transcription quality in AWS Transcribe - amazon-web-services

I have the few audio files which are the conversation between Customer and Agent stored successfully in S3.
I try to convert the audio files as text using AWS transcribe and it is converting successfully.
But the weird part is, It is not even 60 % accurate, These are my configuration for the AWS Transcribe
1) Language code - English(Indian)
2) Audio Frequency - 8000HZ
3) Format - WAV
As per this guidelines (https://docs.aws.amazon.com/transcribe/latest/dg/limits-guidelines.html),
I set the Audio Frequency and Format to 8KHZ and Format as WAV
Do I need to change any other parameters for improving the audio quality?
Any help is appreciated.
Thanks,
Harry

Many thing can affect transcript quality, like background noise in audio, speaker overlap, speakers' accent. Higher quality audio usually gives better result.

You can try using custom vocabularies. You can create these custom vocabularies as mentioned here https://docs.aws.amazon.com/transcribe/latest/dg/how-vocabulary.html
This custom vocabulary list should some prior keywords which would be spoken and are specific to this domain. However, as per my experience these custom vocabularies overfit (incorrectly outputs the words in transcript from the custom vocabulary) at times.

Related

Convert video into different qualities AWS MediaConvert

I have a test.mp4 file (for example). I need to convert it so that there was an option to select the quality on the client-side in the player.
For example, if the video is in 4k resolution, then the client should be able to select the quality of auto, 4k, 1080p, 720p, and 480p.
If the video is 1080p, the choice should be auto, 1080p, 720p and 480p.
And so on.
I know I should choose to convert to Apple HLS and get an m3u8 file in the output.
Tried using ABR, but that's not what I need.
I use AWS MediaConvert to convert.
What you are describing sounds like an HLS bitrate stack. I'll answer based on that assumption.
It will be the responsibility of the playback software to present a menu of the available resolutions. If you want the player to disable its adaptive rendition selection logic and permit the viewer to stay on a specified rendition regardless of segment download times, that workflow needs to be configured within the video player object. In either case you will need an asset file group consisting of manifests and segments.
FYI, MediaConvert has both an automatic ABR mode (which determines the number of renditions & bitrate settings automatically) and a 'manual mode' where you provide the parameters of each child rendition. In this mode, each child rendition is added as a separate Output under the main Apple HLS Output Group. More information can be found here: https://docs.aws.amazon.com/mediaconvert/latest/ug/outputs-file-ABR.html.

YouTube's auto captioning produces better results than Google Speech to Text API (Model: video, UseEnhanced: true). How can be this possible?

Here my settings of Google Speech to Text AI
Here is the output file of Speech to Text AI : https://justpaste.it/speechtotext2
Here is the output file of YouTube's auto caption: https://justpaste.it/ytautotranslate
This is the video link : https://www.youtube.com/watch?v=IOMO-kcqxJ8&ab_channel=SoftwareEngineeringCourses-SECourses
This is the audio file of the video provided to Google Speech AI : https://storage.googleapis.com/text_speech_furkan/machine_learning_lecture_1.flac
Here I am providing time assigned SRT files
YouTube's SRT : https://drive.google.com/file/d/1yPA1m0hPr9VF7oD7jv5KF7n1QnV3Z82d/view?usp=sharing
Google Speech to Text API's SRT (timing assigned by YouTube) : https://drive.google.com/file/d/1AGzkrxMEQJspYenCbohUM4iuXN7H89wH/view?usp=sharing
I made comparison for some sentences and definitely YouTube's auto translation is better
For example
Google Speech to Text : Represent the **doctor** representation is one of the hardest part of computer AI you will learn about more about that in the future lessons.
What does this mean? Do you think this means that we are not just focused on behavior and **into doubt**. It is more about the reasoning when a human takes an action. There is a reasoning behind it.
YouTube's auto captioning : represent the **data** representation is one of the hardest part of computer ai you will we will learn more about that in the future lessons
what does this mean do you think this means that we are not just focused on behavior and **input** it is more about the reasoning when a human takes an action there is a reasoning behind it
I checked many cases and YouTube's guessing correct words is much better. How is this even possible?
This is the command I used to extract audio of the video : ffmpeg -i "input.mkv" -af aformat=s16:48000:output.flac
Both the automatic captions of the Youtube Auto Caption feature and the transcription of the Speech to Text Recognition are generated by machine learning algorithms, in which case the quality of the transcription may vary according to different aspects.
It is important to note that he Speech to Text API utilizes machine learning algorithms for its transcription, the ones that are improved over time and the results can vary according to the input file and the request configuration. One way of helping the models of Google transcription is by enabling data logging, this will allow Google to collect data from your audio transcription requests that will help to improve its machine learning models used for recognizing speech audio, including enhanced models.
Additionally, on the request configuration of the Speech to Text API, you can specify the RecognitionConfig settings. This parameter contains the encoding, sampleRateHertz, languageCode, maxAlternatives, profanityFilter and the speechContext, every parameter plays an important role on the accuracy of the transcription of the file.
Specifically for FLAC audio files, a lossless compression helps in the quality of the audio provided, since there is no degradation in quality of the original digital sample, FLAC uses a compression level parameter from 0 (fastest) to 8 (smallest file size).
Also, the Speech to Text API offers different ways to improve the accuracy of the transcription, such as:
Speech adaptation : This feature allows you to specify words and/or phrases that STT should recognize more frequently in your audio data
Speech adaptation boost : This feature allows allows you to add numerical weights to words and/or phrases according to how frequently they should be recognized in your audio data.
Phrases hints : Send a list of words and phrases that provide hints to the speech recognition task
These features might help you with the accuracy of the Speech to Text API recognizing your audio files.
Finally, please refer to the Speech to Text best practices to improve the transcription of your audio files, these recommendations are designed for greater efficiency and accuracy as well as reasonable response times from the API.

How to convert food-101 dataset into usable format for AWS SageMaker

I'm still very new to the world of machine learning and am looking for some guidance for how to continue a project that I've been working on. Right now I'm trying to feed in the Food-101 dataset into the Image Classification algorithm in SageMaker and later deploy this trained model onto an AWS deeplens to have food detection capabilities. Unfortunately the dataset comes with only the raw image files organized in sub folders as well as a .h5 file (not sure if I can just directly feed this file type into sageMaker?). From what I've gathered neither of these are suitable ways to feed in this dataset into SageMaker and I was wondering if anyone could help point me in the right direction of how I might be able to prepare the dataset properly for SageMaker i.e convert to a .rec or something else. Apologies if the scope of this question is very broad I am still a beginner to all of this and I'm simply stuck and do not know how to proceed so any help you guys might be able to provide would be fantastic. Thanks!
if you want to use the built-in algo for image classification, you can either use Image format or RecordIO format, re: https://docs.aws.amazon.com/sagemaker/latest/dg/image-classification.html#IC-inputoutput
Image format is straightforward: just build a manifest file with the list of images. This could be an easy solution for you, since you already have images organized in folders.
RecordIO requires that you build files with the 'im2rec' tool, re: https://mxnet.incubator.apache.org/versions/master/faq/recordio.html.
Once your data set is ready, you should be able to adapt the sample notebooks available at https://github.com/awslabs/amazon-sagemaker-examples/tree/master/introduction_to_amazon_algorithms

Creating custom voice commands (GNU/Linux)

I'm looking for advices, for a personal project.
I'm attempting to create a software for creating customized voice commands. The goal is to allow user/me to record some audio data (2/3 secs) for defining commands/macros. Then, when the user will speak (record the same audio data), the command/macro will be executed.
The software must be able to detect a command in less than 1 second of processing time in a low-cost computer (RaspberryPi, for example).
I already searched in two ways :
- Speech Recognition (CMU-Sphinx, Julius, simon) : There is good open-source solutions, but they often need large database files, and speech recognition is not really what I'm attempting to do. Speech Recognition could consume too much power for a small feature.
- Audio Fingerprinting (Chromaprint -> http://acoustid.org/chromaprint) : It seems to be almost what I'm looking for. The principle is to create fingerprint from raw audio data, then compare fingerprints to determine if they can be identical. However, this kind of software/library seems to be designed for song identification (like famous softwares on smartphones) : I'm trying to configure a good "comparator", but I think I'm going in a bad way.
Do you know some dedicated software or parcel of code doing something similar ?
Any suggestion would be appreciated.
I had a more or less similar project in which I intended to send voice commands to a robot. A speech recognition software is too complicated for such a task. I used FFT implementation in C++ to extract Fourier components of the sampled voice, and then I created a histogram of major frequencies (frequencies at which the target voice command has the highest amplitudes). I tried two approaches:
Comparing the similarities between histogram of the given voice command with those saved in the memory to identify the most probable command.
Using Support Vector Machine (SVM) to train a classifier to distinguish voice commands. I used LibSVM and the results are considerably better than the first approach. However, one problem with SVM method is that you need a rather large data set for training. Another problem is that, when an unknown voice is given, the classifier will output a command anyway (which is obviously a wrong command detection). This can be avoided by the first approach where I had a threshold for similarity measure.
I hope this helps you to implement your own voice activated software.
Song fingerprint is not a good idea for that task because command timings can vary and fingerprint expects exact time match. However its very easy to implement matching with DTW algorithm for time series and features extracted with CMUSphinx library Sphinxbase. See Wikipedia entry about DTW for details.
http://en.wikipedia.org/wiki/Dynamic_time_warping
http://cmusphinx.sourceforge.net/wiki/download

How to use speech recognition with/on video file?

How can I code speech recognition engine (Using Microsoft Speech SDK) to "listen" a video file and save the detection into a file?
This is very similar to this question and has a very similar answer. You need to separate out the audio portion, convert it to WAV format, and send it to an inproc recognizer.
However, it has the same problems that I described before (requires training, assumes a single voice, and assumes the microphone is close to the speaker). If that's the case, then you can likely get reasonably good results. If that's not the case (i.e., you're trying to transcribe a TV show, or worse, some sort of camcorder audio), then the results will likely be unsatisfactory.