extract all aws transcribe results using boto3 - amazon-web-services

I have a couple hundred transcribed results in aws transcribe and I would like to get all the transcribed text and store it in one file.
Is there any way to do this without clicking on each transcribed result and copy and pasting the text?

You can do this via the AWS APIs.
For example, if you were using Python, you can use the Python boto3 SDK:
list_transcription_jobs() will return a list of Transcription Job Names
For each job, you could then call get_transcription_job(), which will provide the TranscriptFileUri that is the location where the transcription is stored.
You can then use get_object() to download the file from Amazon S3
Your program would then need to combine the content from each file into one file.
See how you go with that. If you run into any specific difficulties, post a new Question with the code and an explanation of the problem.

I put an example on GitHub that shows how to:
run an AWS Transcribe job,
use the Requests package to get the output,
write output to the console.
You ought to be able to refit if pretty easily for your purposes. Here's some of the code, but it'll make more sense if you check out the full example:
job_name_simple = f'Jabber-{time.time_ns()}'
print(f"Starting transcription job {job_name_simple}.")
start_job(
job_name_simple, f's3://{bucket_name}/{media_object_key}', 'mp3', 'en-US',
transcribe_client)
transcribe_waiter = TranscribeCompleteWaiter(transcribe_client)
transcribe_waiter.wait(job_name_simple)
job_simple = get_job(job_name_simple, transcribe_client)
transcript_simple = requests.get(
job_simple['Transcript']['TranscriptFileUri']).json()
print(f"Transcript for job {transcript_simple['jobName']}:")
print(transcript_simple['results']['transcripts'][0]['transcript'])

Related

Custom vocabulary with AWS Transcribe- Japanese Language in AWS

While using aws transcriber, I want to create custom vocab but Not able to create custom vocabulary with Japanese words and nor able to find any sample of custom vocab phrases file.
Tried character code from the table and the direct japanese words array of strings. Neither worked.
Got the error "The vocabulary that you’re trying to create contains invalid characters or incorrectly formatted terms. See the developer guide for more information."
Here is my code
response = transcribe.create_vocabulary(
VocabularyName = 'vocab2',
LanguageCode = 'ja-JP',
Phrases = ["0x3005 0x3005"]
)
Any leads would be appreciated!
Upload to S3 first, forget the upload file button
AWS provides two ways to create custom vocabulary on the console, upload a file or fetch from s3. For the same file, I failed when uploading directly, but succeed when uploading to s3 first. I guess it's a bug in AWS, but we have to live with it.

how to setup multiple automated workflows on aws glue

We're trying to use AWS Glue for ETL operations in our nodejs project. The workflow will be like below
user uploads csv file
data transformation from XYZ format to ABC format(mapping and changing field names)
download transformed csv file to local system
Note that, this flow should happen programmatically(creating crawlers, job triggers should be done programmatically not using the console). I don't know why documentation and other articles always show how to create crawlers, create jobs from glue console?
I believe that we have to create lambda functions and triggers. but not quite sure how to achieve this end to end flow. can anyone please help me. Thanks

Sagemaker, get spark dataframe from data image url on S3

I am trying to obtain a sparkdataframe which contains the paths and image for all images in my data. The data is store as follow :
folder/image_category/image_n.jpg
I worked on a local jupyter notebook and got no problem with using following code:
dataframe = spark.read.format("image").load(path)
I need to do the same exercise using AWS sagemaker and S3. I created a bucket following the same pattern :
s3://my_bucket/folder/image_category/image_n.jpg
I've tried a lot of possible solutions i found online, based on boto3, s3fs and other stuff, but unfortunately i am still unable to make it work (and i am starting to lose faith ...).
Would anyone have something reliable i could base my work on ?

Send S3 document to Textract using Go

I'm trying to use Go to send objects in a S3 bucket to Textract and collect the response.
I'm using the aws go sdk package and able to connect to my S3 bucket and list all the objects contained within. So far so good. I now need to be able to send one of those objects (a .pdf file) to Textract and collect the response(s).
The AWS Go SDK content for interacting with Textract seem to be quite extensive but I cannot find a good example for how to do this.
I would be very grateful for a sample or advice on how to do this.
To start a job, you invoke StartDocumentTextDetection, using a DocumentLocation to specify the file, and you specify a SNS topic where Textract will publish a notification when it has finished to process your job.
You have now two possibilities:
Subscribe to the SNS topic, and when you receive a message retrieve the result
Create a lambda function triggered by the SNS topic, which retrieves the result.
The second option is IMO better 'cause it use less computation time (doesn't run until the job hasn't finished).
To retrieve the job, you use GetDocumentTextDetection
If anyone else reaches this site searching for an answer:
I understood the documentation as if I could just call the StartDocumentAnalysis function through the textract SDK but in fact what was missing is the fact that you need to create a new Session first and do the calls based on the session:
https://docs.aws.amazon.com/sdk-for-go/api/service/textract/#New

CSV Export using Api Gateway and Lambda

What I would like to do:
What I would like to do is have a url which would return to the caller a CSV file which is essentially a export of data. I would like this to remain to be a serverless solution.
What I have done:
I have created an AWS API Gateway with the URL I want. I have created a lambda that will query the database and create a CSV string of that data. That data is placed in a JSON object and returned. API gateway then gets the CSV data from the json object and returns CSV to the caller with appropriate headers to indicate tht it is a CSV and attachment. Testing from the browser I get the download automatically just like I intended.
The problem I see:
This works well until there is a sizable amount of data at which point I start getting "body size is too long".
My attempts to resolve:
I did some googling around and I see others have had similar issues. In one solution I saw that they return a link to the file that they created. This solution seems viable for them because they had a server. For my serverless architecture it seems to be a little trickier. I could take and store the file into S3 but then i would have to return a link to S3. That seems like it could work but doesn't feel right like im missing a configuration option. It also feels like im exposing the implementation by returning the s3 urls as well.
I have looked around for tutorials and example of people doing similar things and i haven't found any.
My Questions:
Is there a way to do this?
Is there another solution that i dont know of?
How do i return a file, in this case CSV, from API Gateway of a larger size
There is a limit of 6 MB for AWS Lambda response payloads. If the files you need to server are larger than that you won't be able to serve them directly from Lambda.
Using S3 to store and serve the files is the standard way of doing something like this. I would leave the S3 bucket private and generate S3 Pre-signed URLs in the Lambda function. That will limit the time that the CSV file is available for download, and it will prevent people from being able to guess the URLs of files you are serving. You would use an S3 Lifecycle Policy to archive or delete the files after a period of time.