I use the Illumina Basespace service to do high throughput sequencing secondary analyzes. This service uses AWS servers and therefore all files are stored on s3.
I would like to transfer the files (results of analyzes) from basespace to my own aws s3 account. I would like to know what would be the best strategy to make things go quickly knowing that in the end we can summarize it as a copy of files from an s3 bucket belonging to Illumina to an s3 bucket belonging to me.
The solutions I'm thinking of:
use the CLI basespace tool to copy the files to our on premise servers then transfer them back to aws
use this tool from an ec2 instance.
use the illumina API to get a pre-signed download url (but then how can I use this url to download the file directly into my s3 bucket?).
If I use an ec2 instance, what kind of instance do you recommend to have enough resources without having too much (and therefore spending money for nothing)?
Thanks in advance,
Quentin
Related
Can you assign a public IP or URL to an S3 Glacier Vault? I want to use it for automatic backups.
I realize that I can upload to an S3 bucket and then use lifecycle rules to move it over to glacier, but I'm asking if I can skip that step entirely and upload directly to Glacier Vault.
Thanks for any tips!
When originally released, Amazon Glacier was only accessible directly (rather than via Amazon S3). It offers low-cost storage, but it is only accessible via API (not much can be done in the Management Console) and it is very slow because all requests are processed as jobs. This even makes it slow to list the contents of a Vault.
You can certainly access Amazon Glacier directly, but it would be via API calls to the Glacier Endpoint. I would recommend that you use tools such as Cloudberry Backup that know how to talk directly to Glacier.
However, a much simpler way to use Glacier is to store files in Amazon S3 and then select the Glacier or Glacier Deep Archive storage class. This allows use of the S3 interface and the Deep Archive storage class is actually cheaper than Glacier itself! You can also use the AWS CLI to upload backups, which is much easier than working with a Glacier Vault.
By the way, if you are purely wanting to use S3/Glacier for "backups", I would highly recommend using traditional backup tools that know how to use S3. They are much more reliable, and offer more capabilities, than doing it yourself. For example, they can keep multiple versions of files and can retain deleted files for a period of time to allow recovery.
Specify --storage-class GLACIER if you are using aws s3 cp command of cli. Use upload-archive if you are using aws glacier command of cli.
I want to download million of files from S3 bucket which will take more than a week to be downloaded one by one - any way/ any command to download those files in parallel using shell script ?
Thanks,
AWS CLI
You can certainly issue GetObject requests in parallel. In fact, the AWS Command-Line Interface (CLI) does exactly that when transferring files, so that it can take advantage of available bandwidth. The aws s3 sync command will transfer the content in parallel.
See: AWS CLI S3 Configuration
If your bucket has a large number of objects, it can take a long time to list the contents of the bucket. Therefore, you might want to sync the bucket by prefix (folder) rather than trying it all at once.
AWS DataSync
You might instead want to use AWS DataSync:
AWS DataSync is an online data transfer service that simplifies, automates, and accelerates copying large amounts of data to and from AWS storage services over the internet or AWS Direct Connect... Move active datasets rapidly over the network into Amazon S3, Amazon EFS, or Amazon FSx for Windows File Server. DataSync includes automatic encryption and data integrity validation to help make sure that your data arrives securely, intact, and ready to use.
DataSync uses a protocol that takes full advantage of available bandwidth and will manage the parallel downloading of content. A fee of $0.0125 per GB applies.
AWS Snowball
Another option is to use AWS Snowcone (8TB) or AWS Snowball (50TB or 80TB), which are physical devices that you can pre-load with content from S3 and have it shipped to your location. You then connect it to your network and download the data. (It works in reverse too, for uploading bulk data to Amazon S3).
I have the following currently created in AWS us-east-1 region and per the request of our AWS architect I need to move it all to the us-east-2, completely, and continue developing in us-east-2 only. What are the easiest and least work and coding options (as this is a one-time deal) to move?
S3 bucket with a ton of folders and files.
Lambda function.
AWS Glue database with a ton of crawlers.
AWS Athena with a ton of tables.
Thank you so much for taking a look at my little challenge :)
There is no easy answer for your situation. There are no simple ways to migrate resources between regions.
Amazon S3 bucket
You can certainly create another bucket and then copy the content across, either using the AWS Command-Line Interface (CLI) aws s3 sync command or, for huge number of files, use S3DistCp running under Amazon EMR.
If there are previous Versions of objects in the bucket, it's not easy to replicate them. Hopefully you have Versioning turned off.
Also, it isn't easy to get the same bucket name in the other region. Hopefully you will be allowed to use a different bucket name. Otherwise, you'd need to move the data elsewhere, delete the bucket, wait a day, create the same-named bucket in another region, then copy the data across.
AWS Lambda function
If it's just a small number of functions, you could simply recreate them in the other region. If the code is stored in an Amazon S3 bucket, you'll need to move the code to a bucket in the new region.
AWS Glue
Not sure about this one. If you're moving the data files, you'll need to recreate the database anyway. You'll probably need to create new jobs in the new region (but I'm not that familiar with Glue).
Amazon Athena
If your data is moving, you'll need to recreate the tables anyway. You can use the Athena interface to show the DDL commands required to recreate a table. Then, run those commands in the new region, pointing to the new S3 bucket.
AWS Support
If this is an important system for your company, it would be prudent to subscribe to AWS Support. They can provide advice and guidance for these types of situations, and might even have some tools that can assist with a migration. The cost of support would be minor compared to the savings in your time and effort.
Is it possible for you to create CloudFormation stacks (from existing resources) using the console, then copying the contents of those stacks and run them in the other region (replacing values where they need to be).
See this link: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/resource-import-new-stack.html
Can we use amazon v4 API of amazon: https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-post-example.html but I don't think it's useful for my purpose.
What I want is, there are some files on websites, I want those files to be uploaded in amazon s3 bucket without downloading them first on my local computer, current scenario is like this:
The third-party website provides downloaded link of file -> download file to my computer > upload to amazon s3
Can we eliminate the middle one so it become like this:
The third-party website provides downloaded link of file -> upload to amazon s3
You can't avoid the "download" part unless that "other website" is willing to do upload for you.
But you can eliminate your local network connection from the equation and do download/upload using EC2 instance in the same region as your bucket.
$ wget https://example.com/example.txt
$ aws s3 cp example.txt s3://mybucket
Your EC2 instance should have the role, allowing it to interact with S3.
You can do the same thing with Lambda, but you'll be limited by the size of the filesystem of the lambda runtime.
The third-party website provides downloaded link of file -> upload to amazon s3
If the 3rd party doesn't push the content "itself", you will need an actor/service/logic which downloads and uploads the data.
The "logic" means some compute resources - c2, ecs, lambda, batch.. it's the same download/upload process, just the traffic doesn't need to go through your computer. Every option has its pros and cons (e. g. lambda may be the cheapest for occasional tasks, but it has its limits)
You did not specify what initiates the upload (regular scan? event? on-demand?), that may affect your options too.
I have a requirement to transfer data(one time) from on prem to AWS S3. The data size is around 1 TB. I was going through AWS Datasync, Snowball etc... But these managed services are better to migrate if the data is in petabytes. Can someone suggest me the best way to transfer the data in a secured way cost effectively
You can use the AWS Command-Line Interface (CLI). This command will copy data to Amazon S3:
aws s3 sync c:/MyDir s3://my-bucket/
If there is a network failure or timeout, simply run the command again. It only copies files that are not already present in the destination.
The time taken will depend upon the speed of your Internet connection.
You could also consider using AWS Snowball, which is a piece of hardware that is sent to your location. It can hold 50TB of data and costs $200.
If you have no specific requirements (apart from the fact that it needs to be encrypted and the file-size is 1TB) then I would suggest you stick to something plain and simple. S3 supports an object size of 5TB so you wouldn't run into trouble. I don't know if your data is made up of many smaller files or 1 big file (or zip) but in essence its all the same. Since the end-points or all encrypted you should be fine (if your worried, you can encrypt your files before and they will be encrypted while stored (if its backup of something). To get to the point, you can use API tools for transfer or just file-explorer type of tools which have also connectivity to S3 (e.g. https://www.cloudberrylab.com/explorer/amazon-s3.aspx). some other point: cost-effectiviness of storage/transfer all depends on how frequent you need the data, if just a backup or just in case. archiving to glacier is much cheaper.
1 TB is large but it's not so large that it'll take you weeks to get your data onto S3. However if you don't have a good upload speed, use Snowball.
https://aws.amazon.com/snowball/
Snowball is a device shipped to you which can hold up to 100TB. You load your data onto it and ship it back to AWS and they'll upload it to the S3 bucket you specify when loading the data.
This can be done in multiple ways.
Using AWS Cli, we can copy files from local to S3
AWS Transfer using FTP or SFTP (AWS SFTP)
Please refer
There are tools like cloudberry clients which has a UI interface
You can use AWS DataSync Tool