How to Transfer Data from my PC to S3 on AWS - amazon-web-services

Can anyone suggest any document for transferring data from my Personal Computer to S3 on AWS. I have about 50GB of data to be transferred and later use spark to analyze the data.

There are many free ways to upload files to S3, including:
use the AWS console, go into S3, navigate to the S3 bucket, then use
Actions | Upload
use s3cmd
use the awscli
use Cloudberry Explorer

To upload from your local machine to S3, you can use tools like CyberDuck. Some times large uploads may get interrupted ... Tools like Cyberduck can resume an aborted update.
If you already have data onto an Amazon EC2 machine instance, then s3cmd works pretty well.

Related

How to automatically sync s3 bucket to a local folder using windows server

Im trying to have a replica of my s3 bucket in a local folder. it should be updated when a change occurs on the bucket.
You can use the aws cli s3 sync command to copy ('synchronize') files from an Amazon S3 bucket to a local drive.
To have it update frequently, you could schedule it as a Windows Scheduled Tasks. Please note that it will be making frequent calls to AWS, which will incur API charges ($0.005 per 1000 requests).
Alternatively, you could use utilities that 'mount' an Amazon S3 bucket as a drive (Tntdrive, Cloudberry, Mountain Duck, etc). I'm not sure how they detect changes -- they possibly create a 'virtual drive' where the data is not actually downloaded until it is accessed.
You can use rclone and Winfsp to mount S3 as a drive.
Though this might not be a 'mount' in traditional terms.
You will need to setup a task scheduler for a continuous sync.
Example : https://blog.spikeseed.cloud/mount-s3-as-a-disk/

File transfer from basespace to aws S3

I use the Illumina Basespace service to do high throughput sequencing secondary analyzes. This service uses AWS servers and therefore all files are stored on s3.
I would like to transfer the files (results of analyzes) from basespace to my own aws s3 account. I would like to know what would be the best strategy to make things go quickly knowing that in the end we can summarize it as a copy of files from an s3 bucket belonging to Illumina to an s3 bucket belonging to me.
The solutions I'm thinking of:
use the CLI basespace tool to copy the files to our on premise servers then transfer them back to aws
use this tool from an ec2 instance.
use the illumina API to get a pre-signed download url (but then how can I use this url to download the file directly into my s3 bucket?).
If I use an ec2 instance, what kind of instance do you recommend to have enough resources without having too much (and therefore spending money for nothing)?
Thanks in advance,
Quentin

download files from AWS S3 bucket in parallel

I want to download million of files from S3 bucket which will take more than a week to be downloaded one by one - any way/ any command to download those files in parallel using shell script ?
Thanks,
AWS CLI
You can certainly issue GetObject requests in parallel. In fact, the AWS Command-Line Interface (CLI) does exactly that when transferring files, so that it can take advantage of available bandwidth. The aws s3 sync command will transfer the content in parallel.
See: AWS CLI S3 Configuration
If your bucket has a large number of objects, it can take a long time to list the contents of the bucket. Therefore, you might want to sync the bucket by prefix (folder) rather than trying it all at once.
AWS DataSync
You might instead want to use AWS DataSync:
AWS DataSync is an online data transfer service that simplifies, automates, and accelerates copying large amounts of data to and from AWS storage services over the internet or AWS Direct Connect... Move active datasets rapidly over the network into Amazon S3, Amazon EFS, or Amazon FSx for Windows File Server. DataSync includes automatic encryption and data integrity validation to help make sure that your data arrives securely, intact, and ready to use.
DataSync uses a protocol that takes full advantage of available bandwidth and will manage the parallel downloading of content. A fee of $0.0125 per GB applies.
AWS Snowball
Another option is to use AWS Snowcone (8TB) or AWS Snowball (50TB or 80TB), which are physical devices that you can pre-load with content from S3 and have it shipped to your location. You then connect it to your network and download the data. (It works in reverse too, for uploading bulk data to Amazon S3).

Best way to transfer data from on-prem to AWS

I have a requirement to transfer data(one time) from on prem to AWS S3. The data size is around 1 TB. I was going through AWS Datasync, Snowball etc... But these managed services are better to migrate if the data is in petabytes. Can someone suggest me the best way to transfer the data in a secured way cost effectively
You can use the AWS Command-Line Interface (CLI). This command will copy data to Amazon S3:
aws s3 sync c:/MyDir s3://my-bucket/
If there is a network failure or timeout, simply run the command again. It only copies files that are not already present in the destination.
The time taken will depend upon the speed of your Internet connection.
You could also consider using AWS Snowball, which is a piece of hardware that is sent to your location. It can hold 50TB of data and costs $200.
If you have no specific requirements (apart from the fact that it needs to be encrypted and the file-size is 1TB) then I would suggest you stick to something plain and simple. S3 supports an object size of 5TB so you wouldn't run into trouble. I don't know if your data is made up of many smaller files or 1 big file (or zip) but in essence its all the same. Since the end-points or all encrypted you should be fine (if your worried, you can encrypt your files before and they will be encrypted while stored (if its backup of something). To get to the point, you can use API tools for transfer or just file-explorer type of tools which have also connectivity to S3 (e.g. https://www.cloudberrylab.com/explorer/amazon-s3.aspx). some other point: cost-effectiviness of storage/transfer all depends on how frequent you need the data, if just a backup or just in case. archiving to glacier is much cheaper.
1 TB is large but it's not so large that it'll take you weeks to get your data onto S3. However if you don't have a good upload speed, use Snowball.
https://aws.amazon.com/snowball/
Snowball is a device shipped to you which can hold up to 100TB. You load your data onto it and ship it back to AWS and they'll upload it to the S3 bucket you specify when loading the data.
This can be done in multiple ways.
Using AWS Cli, we can copy files from local to S3
AWS Transfer using FTP or SFTP (AWS SFTP)
Please refer
There are tools like cloudberry clients which has a UI interface
You can use AWS DataSync Tool

How to unzip 10gb file on AWS S3 browser

suggest the best process using aws cli or any alternatives by downloading to local using s3 browser and upload. (After extracting locally it is 60gb file).
Amazon S3 is purely a storage service. There is no in-built capability to process data (eg to unzip a file).
You would need to download the file, unzip it, and upload the result back to S3. This would best be done via an Amazon EC2 instance in the same region. (AWS Lambda only has 500MB temporary storage space, so this is not an option for a 60GB file.)
You could look at a combination of mime-type and content-disposition meta info on the stored files as .gz
Then even let a web browser deal with decompressing etc., once received compressed.
But, I'm not totally sure what you are after.