how to stream microphone audio from browser to S3 - amazon-web-services

I want to stream the microphone audio from the web browser to AWS S3.
Got it working
this.recorder = new window.MediaRecorder(...);
this.recorder.addEventListener('dataavailable', (e) => {
this.chunks.push(e.data);
});
and then when user clicks on stop upload the chunks new Blob(this.chunks, { type: 'audio/wav' }) as multiparts to AWS S3.
But the problem is if the recording is 2-3 hours longer then it might take exceptionally longer and user might close the browser before waiting for the recording to complete uploading.
Is there a way we can stream the web audio directly to S3 while it's going on?
Things I tried but can't get a working example:
Kineses video streams, looks like it's only for real time streaming between multiple clients and I have to write my own client which will then save it to S3.
Thought to use kinesis data firehose but couldn't find any client data producer from brower.
Even tried to find any resource using aws lex or aws ivs but I think they are just over engineering for my use case.
Any help will be appreciated.

You can set the timeslice parameter when calling start() on the MediaRecorder. The MediaRecorder will then emit chunks which roughly match the length of the timeslice parameter.
You could upload those chunks using S3's multipart upload feature as you already mentioned.
Please note that you need a library like extendable-media-recorder if you want to record a WAV file since no browser supports that out of the box.

Related

AWS Lambda function, API gateway and ffmpeg timeout issue

I have created a lambda function, that is extracting the audio stream from a video file using ffmpeg. I have also configured API gateway as a trigger, where I am passing the file to the lambda function in the request body.
The lambda function is working perfectly well with small files, but with bigger files, it needs a bit more time and then I am running into the API gateway timeout, which according to my understanding is set to 29 seconds max.
So when I trigger audio extraction from a bigger file, I am hitting this timeout and my API request fails to return any result even though the transcoding still runs in the background and the file is extracted, so I was wondering what is the best approach to handle those cases, where the execution of the lambda function is taking longer?
I was thinking to start the transcoding in the background and simply return a JSON with a message that the transcoding might take a couple of minutes, depending on the input file duration, but if I try to push the ffmpeg to the background I am being presented with an error, that the destination file doesn't exist.
os.system(f"{ffmpeg} -loglevel panic -nostdin -i {in_video} -vn -c:a aac -ar 48000 -b:a 192K {out_audio} 2> /dev/null &")
This is the ffmpeg command extracting the audio and transcoding it to AAC.
If I remove the 2> /dev/null & part of the command, it runs just fine, but if I keep it, I get an error:
"errorMessage": "[Errno 2] No such file or directory: 'output_audio.aac'"
"errorType": "FileNotFoundError"
So I was wondering what is the preferred way to run processes in the background.
There are many options that can be considered.
But first, since you already have all the flow working with lambda behind API Gateway, you can use lambda url.
Lambda url are a good way to trigger lambda via HTTPS. It supports multiple authorization mechanism such as IAM.
The interesting point is about timeout. When using Lambda url, the maximum timeout you can have is 15 mins, which is definitely better than the 29s you have when dealing with API Gateway.
Lambda url is free of charge and can be enabled on existing lambda function.
Increasing the timeout might just push back the problem until you have a very big file to convert and in the long run, maybe worth exploring other solution like uploading the file to S3 and maybe use AWS Batch or Spin up an EC2 to process the file. This would require more architecture design and implementation though.
For longer processing, it is recommended to use asynchronous invocations, where the Lambda function is triggered and runs until completion and does not block the caller. One option to solve it would be to upload the file to S3, configure the Lambda function to react to the S3 event, download the file from S3, process it, and upload it to another S3 bucket after processing completes.

Record real time Audio from browser and stream to Amazon S3 for storage

I want to record audio from my browser and live stream it for storage in Amazon S3. I cannot wait till the recording is finished as client can close the browser, so I would like to store what has been spoken (or nearest 5-10 second).
The issue is multipart upload does not support less than 5Mib chunks, and audio files will be for most part less than 5Mib.
Ideally I would like to send the chunks in 5 seconds, so what has been said in last 5 seconds to be uploaded.
Can it be support by S3? or should I use any other AWS service to first hold the recording parts - heard about kinesis stream but not sure if it can serve the purpose.

Aws lambda audio features extraction ( Not enough storage -Layers )

We have IOT sensors that uploads wav files into S3 Bucket.
We want to be able to extract sound features from each file that is getting uploaded (create obj event) with aws lambda
For that we need:
python librosa or pyAudio analysis package + numpy and scipy. (~ 240mb unzziped)
ffmpeg (~ 70mb unzziped)
As you can see there is no way to put them all together in same lambda package (250mb uncompressed max). And im getting an error when not including the ffmpeg in the layers when gathering the wav file:
[ERROR] FileNotFoundError: [Errno 2] No such file or directory: 'ffprobe': 'ffprobe'
which is related to ffmpeg.
We are looking for implementation recommendation, we thought about:
Putting the ffmpeg file in s3 and getting it every single invoke ( without having to put it in the layers. ( if it is even possible)
Chaining two lambdas: 1 for processing the input file through ffmpeg and puting the output file in abother bucket > 2 function invoked and extracting features from the processed data. ( using SNS / chaining mechanism) ( if it is even possible)
Move to EC2 where there we will have a problem with concurrent invokation accuring when two files uploads at the same time.
there has to be and easier way, ill be glad to hear for other opinions before diving into implementation,
Thank you all!
The scenario appears to be:
Files come in at random times
The files need to be processed, but not in real-time
The required libraries are too big for an AWS Lambda function
Suggested architecture:
Configure an Amazon S3 Event to send a message to an Amazon SQS queue when a file arrives
Configure an Amazon CloudWatch Event to trigger an AWS Lambda function at regular intervals (eg 1 hour)
The Lambda function checks whether there are messages in the queue
If there are messages, it launches an Amazon EC2 instance with a User Data script that installs and starts the processing system
The processing system will:
Grab a message from the queue
Process the message (without the limitations of Lambda)
Delete the message
If there are no messages left in the queue, it will terminate the EC2 instance
This can be very cost-effective because Amazon EC2 Linux instances are charged per-second. You can run several workers in parallel to process the messages (but be careful when writing the termination code, to ensure that all workers have finished processing messages). Or, if things are not time-critical, just choose the smallest usable Instance Type and single-thread it since larger instances cost more anyway (so they are no better from a cost-efficient standpoint).
Make sure you put monitoring in place to ensure that messages are being processed. Implement a Dead Letter Queue in Amazon SQS to catch messages that are failing to process and put a CloudWatch Alarm on the DLQ to notify you if things seem to be going wrong.

AWS Lex storage of audio

I’ve created a Lex bot that is integrated with an Amazon Connect work flow. The bot is invoked when the user calls the phone number specified in the Connect instance, and the bot itself invokes a Lambda function for initialisation & validation and fulfilment. The bot asks several questions that require the caller to provide simple responses. It all works OK, so far so good. I would like to add a final question that asks the caller for their comments. This could be any spoken text, including non-English words. I would like to be able to capture this Comment slot value as an audio stream or file, perhaps for storage in S3, with the goal of emailing a call centre administrator and providing the audio file as an MP3 or WAV attachment. Is there any way of doing this in Lex?
I’ve seen mention of ‘User utterance storage’ here: https://aws.amazon.com/blogs/contact-center/amazon-connect-with-amazon-lex-press-or-say-input/, but there’s no such setting visible in my Lex console.
I’m aware that Connect can be configured to store a recording in S3, but I need to be able to access the recording for the current phone call from within the Lambda function in order to attach it to an email. Any advice on how to achieve this, or suggestions for a workaround, would be much appreciated.
Thanks
Amazon Connect call recording can only record conversations once an agent accepts the call. Currently Connect cannot record voice in the Contact Flows. So in regards to getting the raw audio from Connect, that is not possible.
However, it looks like you can get it from lex if you developed an external application (could be lambda) that gets utterances: https://docs.aws.amazon.com/lex/latest/dg/API_GetUtterancesView.html
I also do not see the option to enable or disable user utterance storage in Lex, but this makes me think that by default, all are recorded: https://docs.aws.amazon.com/lex/latest/dg/API_DeleteUtterances.html

how to speed up google cloud speech

I am using a microphone which records sound through a browser, converts it into a file and sends the file to a java server. Then, my java server sends the file to the cloud speech api and gives me the transcription. The problem is that the transcription is super long (around 3.7sec for 2sec of dialog).
So I would like to speed up the transcription. The first thing to do is to stream the data (if I start the transcription at the beginning of the record. The problem is that I don't really understand the api. For instance if I want to transcript my audio stream from the source (browser/microphone) I need to use some kind of JS api, but I can't find anything I can use in a browser (we can't use node like this can we?).
Else I need to stream my data from my js to my java (not sure how to do it without breaking the data...) and then push it through streamingRecognizeFile from there : https://github.com/GoogleCloudPlatform/java-docs-samples/blob/master/speech/cloud-client/src/main/java/com/example/speech/Recognize.java
But it takes a file as the input, so how am I supposed to use it? I cannot really tell the system I finished or not the record... How will it understand it is the end of the transcription?
I would like to create something in my web browser just like the google demo there :
https://cloud.google.com/speech/
I think there is some fundamental stuff I do not understand about the way to use the streaming api. If someone can explain a bit how I should process about this, it would be owesome.
Thank you.
Google "Speech-to-Text typically processes audio faster than real-time, processing 30 seconds of audio in 15 seconds on average" [1]. You can use Google APIs Explorer to test exactly how long your each request would take [2].
To speed up the transcribing you may try to add recognition metadata to your request [3]. You can provide phrase hints if you are aware of the context of the speech [4]. Or use enhanced models to use special set of machine learning models [5]. All these suggestions would improve the accuracy and might have effects on transcribing speed.
When using the streaming recognition, in config you can set singleUtterance option to True. This will detect if user pause speaking and cease the recognition. If not streaming request will continue until to the content limit, which is 1 minute of audio length for streaming request [6].