I am using Amazon Connect and storing the call recording in one region.
I have Amazon Transcribe in another region and I followed How to create an audio transcript with Amazon Transcribe | AWS to convert the audio file to transcript format. Steps seem very simple.
However, when I click on Create in Amazon Transcribe (to convert the audio recording file generated by connect to Transcript), it is throwing the error: the recording is there in other region (which is expected in my case, because the recorded (audio file) is not there in the same region)
The S3 URI that you provided points to the incorrect region. Make sure that the bucket is in the XXX-XXX region and try your request again.
where xxx-xxx is the region of Amazon Transcribe. It is expected the recording (audio file) to be there in the same region.
But:
Is there a way to expose the S3 bucket with an audio file so that It can be accessed from other regions too?
If not, what is the other way to solve this?
"Is there a way to expose the S3 bucket...?"
As it turns out, exposing the bucket isn't the problem. Buckets are always physically located in exactly one region, but are accessible from all regions as well as from outside AWS if the requester is in possession of appropriate and authorized credentials and no policy explicitly denies the access.
But nothing in S3 about the bucket can be changed to fix the error you're getting, because the problem is somewhere else -- not S3.
From the API data types in the Amazon Transcribe Developer Guide:
MediaFileUri
The S3 location of the input media file. The URI must be in the same region as the [Amazon Transcribe] API endpoint that you are calling.
https://docs.aws.amazon.com/transcribe/latest/dg/API_Media.html
Transcribe was designed not to reach across regional boundaries to access media in a bucket, and stops you if you try, with the message you're getting.
Why does it work that way? Possibly performance/efficiency. Possibly security. Possibly to help unwitting users avoid unexpected billing charges for cross-region data transport. Possibly other reasons, maybe in combination with the above.
Possible solutions:
Use Connect, an S3 bucket, and Transcribe, all in the same region; or
Use two buckets and S3 Cross-Region Replication to replicate files from the Connect region to the Transcribe region. Be aware that this ca have significant costs at scale, since S3 is moving data acroas regional boundaries. Be further aware that replication is fast but not instantaneous, so calls to Transcribe might fail to find media that has arrived in the first bucket but not yet the second; or
Use two buckets, and make a call in your code to S3's PUT+Copy API to copy the file to the second bucket in the Transcribe region, before calling Transcribe.
Related
I use the Illumina Basespace service to do high throughput sequencing secondary analyzes. This service uses AWS servers and therefore all files are stored on s3.
I would like to transfer the files (results of analyzes) from basespace to my own aws s3 account. I would like to know what would be the best strategy to make things go quickly knowing that in the end we can summarize it as a copy of files from an s3 bucket belonging to Illumina to an s3 bucket belonging to me.
The solutions I'm thinking of:
use the CLI basespace tool to copy the files to our on premise servers then transfer them back to aws
use this tool from an ec2 instance.
use the illumina API to get a pre-signed download url (but then how can I use this url to download the file directly into my s3 bucket?).
If I use an ec2 instance, what kind of instance do you recommend to have enough resources without having too much (and therefore spending money for nothing)?
Thanks in advance,
Quentin
I want to download million of files from S3 bucket which will take more than a week to be downloaded one by one - any way/ any command to download those files in parallel using shell script ?
Thanks,
AWS CLI
You can certainly issue GetObject requests in parallel. In fact, the AWS Command-Line Interface (CLI) does exactly that when transferring files, so that it can take advantage of available bandwidth. The aws s3 sync command will transfer the content in parallel.
See: AWS CLI S3 Configuration
If your bucket has a large number of objects, it can take a long time to list the contents of the bucket. Therefore, you might want to sync the bucket by prefix (folder) rather than trying it all at once.
AWS DataSync
You might instead want to use AWS DataSync:
AWS DataSync is an online data transfer service that simplifies, automates, and accelerates copying large amounts of data to and from AWS storage services over the internet or AWS Direct Connect... Move active datasets rapidly over the network into Amazon S3, Amazon EFS, or Amazon FSx for Windows File Server. DataSync includes automatic encryption and data integrity validation to help make sure that your data arrives securely, intact, and ready to use.
DataSync uses a protocol that takes full advantage of available bandwidth and will manage the parallel downloading of content. A fee of $0.0125 per GB applies.
AWS Snowball
Another option is to use AWS Snowcone (8TB) or AWS Snowball (50TB or 80TB), which are physical devices that you can pre-load with content from S3 and have it shipped to your location. You then connect it to your network and download the data. (It works in reverse too, for uploading bulk data to Amazon S3).
I have recently joined a company that uses S3 Buckets for various different projects within AWS. I want to identify and potentially delete S3 Objects that are not being accessed (read and write), in an effort to reduce the cost of S3 in my AWS account.
I read this, which helped me to some extent.
Is there a way to find out which objects are being accessed and which are not?
There is no native way of doing this at the moment, so all the options are workarounds depending on your usecase.
You have a few options:
Tag each S3 Object (e.g. 2018-10-24). First turn on Object Level Logging for your S3 bucket. Set up CloudWatch Events for CloudTrail. The Tag could then be updated by a Lambda Function which runs on a CloudWatch Event, which is fired on a Get event. Then create a function that runs on a Scheduled CloudWatch Event to delete all objects with a date tag prior to today.
Query CloudTrail logs on, write a custom function to query the last access times from Object Level CloudTrail Logs. This could be done with Athena, or a direct query to S3.
Create a Separate Index, in something like DynamoDB, which you update in your application on read activities.
Use a Lifecycle Policy on the S3 Bucket / key prefix to archive or delete the objects after x days. This is based on upload time rather than last access time, so you could copy the object to itself to reset the timestamp and start the clock again.
No objects in Amazon S3 are required by other AWS services, but you might have configured services to use the files.
For example, you might be serving content through Amazon CloudFront, providing templates for AWS CloudFormation or transcoding videos that are stored in Amazon S3.
If you didn't create the files and you aren't knowingly using the files, can you probably delete them. But you would be the only person who would know whether they are necessary.
There is recent AWS blog post which I found very interesting and cost optimized approach to solve this problem.
Here is the description from AWS blog:
The S3 server access logs capture S3 object requests. These are generated and stored in the target S3 bucket.
An S3 inventory report is generated for the source bucket daily. It is written to the S3 inventory target bucket.
An Amazon EventBridge rule is configured that will initiate an AWS Lambda function once a day, or as desired.
The Lambda function initiates an S3 Batch Operation job to tag objects in the source bucket. These must be expired using the following logic:
Capture the number of days (x) configuration from the S3 Lifecycle configuration.
Run an Amazon Athena query that will get the list of objects from the S3 inventory report and server access logs. Create a delta list with objects that were created earlier than 'x' days, but not accessed during that time.
Write a manifest file with the list of these objects to an S3 bucket.
Create an S3 Batch operation job that will tag all objects in the manifest file with a tag of "delete=True".
The Lifecycle rule on the source S3 bucket will expire all objects that were created prior to 'x' days. They will have the tag given via the S3 batch operation of "delete=True".
Expiring Amazon S3 Objects Based on Last Accessed Date to Decrease Costs
So I have an S3 bucket full of over 200GB of different videos. It would be very time consuming to manually set up jobs to transcode all of these.
How can I use either the web UI or aws cli to transcode all videos in this bucket at 1080p, replicating the same output path in a different bucket?
I also want any new videos added to the original bucket to be transcoded automatically immediately after upload.
I've seen some posts about Lambda functions, but I don't know anything about this.
A lambda function is just a temporary machine that runs some code.
The sample code in your link is what you are looking for as a solution. You can call your lambda function once for each item in the S3 bucket and kick off concurrent processing of the entire bucket.
Using S3 cross-region replication, if a user downloads http://mybucket.s3.amazonaws.com/myobject , will it automatically download from the closest region like cloudfront? So no need to specify the region in the URL like http://mybucket.s3-[region].amazonaws.com/myobject ?
http://aws.amazon.com/about-aws/whats-new/2015/03/amazon-s3-introduces-cross-region-replication/
Bucket names are global, and cross-region replication involves copying it to a different bucket.
In other words, example.us-west-1 and example.us-east-1 is not valid, as there can only be one bucket named 'example'.
That's implied in the announcement post- Mr. Barr is using buckets named jbarr and jbarr-replication.
Using S3 cross-Region replication will put your object into two (or more) buckets in two different Regions.
If you want a single access point that will choose the closest available bucket then you want to use Multi-Region Access Points (MRAP)
MRAP makes use of Global Accelerator and puts bucket requests onto the AWS backbone at the closest edge location, which provides faster, more reliable connection to the actual bucket. Global Accelerator also chooses the closest available bucket. If a bucket is not available, it will serve the request from the other bucket providing automatic failover
You can also configure it in an active/passive configuration, always serving from one bucket until you initiate a failover
From the MRAP page on AWS console it even shows you a graphical representation of your replication rules
s3 is global service, no need specify the region. The S3 name has to be unique globally.
when you create s3, you need specify the region, however it doesn't mean you need put the region name when you access it. To speed up the access speed from other region, there are several options like
-- Amazon S3 Transfer Acceleration with same bucket name.
-- Or use set up another bucket with different name in different region and enable cross region replication. Create an origin group with two origins for cloudfront.