Can we use amazon v4 API of amazon: https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-post-example.html but I don't think it's useful for my purpose.
What I want is, there are some files on websites, I want those files to be uploaded in amazon s3 bucket without downloading them first on my local computer, current scenario is like this:
The third-party website provides downloaded link of file -> download file to my computer > upload to amazon s3
Can we eliminate the middle one so it become like this:
The third-party website provides downloaded link of file -> upload to amazon s3
You can't avoid the "download" part unless that "other website" is willing to do upload for you.
But you can eliminate your local network connection from the equation and do download/upload using EC2 instance in the same region as your bucket.
$ wget https://example.com/example.txt
$ aws s3 cp example.txt s3://mybucket
Your EC2 instance should have the role, allowing it to interact with S3.
You can do the same thing with Lambda, but you'll be limited by the size of the filesystem of the lambda runtime.
The third-party website provides downloaded link of file -> upload to amazon s3
If the 3rd party doesn't push the content "itself", you will need an actor/service/logic which downloads and uploads the data.
The "logic" means some compute resources - c2, ecs, lambda, batch.. it's the same download/upload process, just the traffic doesn't need to go through your computer. Every option has its pros and cons (e. g. lambda may be the cheapest for occasional tasks, but it has its limits)
You did not specify what initiates the upload (regular scan? event? on-demand?), that may affect your options too.
Related
Im trying to have a replica of my s3 bucket in a local folder. it should be updated when a change occurs on the bucket.
You can use the aws cli s3 sync command to copy ('synchronize') files from an Amazon S3 bucket to a local drive.
To have it update frequently, you could schedule it as a Windows Scheduled Tasks. Please note that it will be making frequent calls to AWS, which will incur API charges ($0.005 per 1000 requests).
Alternatively, you could use utilities that 'mount' an Amazon S3 bucket as a drive (Tntdrive, Cloudberry, Mountain Duck, etc). I'm not sure how they detect changes -- they possibly create a 'virtual drive' where the data is not actually downloaded until it is accessed.
You can use rclone and Winfsp to mount S3 as a drive.
Though this might not be a 'mount' in traditional terms.
You will need to setup a task scheduler for a continuous sync.
Example : https://blog.spikeseed.cloud/mount-s3-as-a-disk/
I use the Illumina Basespace service to do high throughput sequencing secondary analyzes. This service uses AWS servers and therefore all files are stored on s3.
I would like to transfer the files (results of analyzes) from basespace to my own aws s3 account. I would like to know what would be the best strategy to make things go quickly knowing that in the end we can summarize it as a copy of files from an s3 bucket belonging to Illumina to an s3 bucket belonging to me.
The solutions I'm thinking of:
use the CLI basespace tool to copy the files to our on premise servers then transfer them back to aws
use this tool from an ec2 instance.
use the illumina API to get a pre-signed download url (but then how can I use this url to download the file directly into my s3 bucket?).
If I use an ec2 instance, what kind of instance do you recommend to have enough resources without having too much (and therefore spending money for nothing)?
Thanks in advance,
Quentin
I have a link in a request which is pointing to some pdf /image content type. My requirement is to upload the content in the link to the s3 server.
Do I have to download it and then uploading the file but I have to many call and limited file storage in the machine Or Is there any other way to achieve this.
You must upload the file to Amazon S3.
It is not possible to tell Amazon S3 to retrieve a file from a URL.
My requirement is to upload the content in the link to the s3 server.
we - you need some compute resource. S3 itself won't do that.
Do I have to download it and then uploading the file
Or Is there any other way to achieve this.
The compute resource (logic) doesn't need to reside on your computer. You may use some AWS Compute resource near the S3, such as Lambda, EC2, ECS, .. You may decide based on the predicted load or other requirements.
What is the better option of get data from a directory in SFTP and copy in bucket of S3 of AWS? In SFTP i only have permission of read so Rsync isn't option.
My idea is create a job in GLUE with Python that download this data y copy in bucket of S3. They are different files, one weighs about 600 MB, others are 4 GB.
Assuming you are talking about an sFTP server that is not on AWS, you have a few different options that may be easier than what you have proposed (although your solution could work):
Download the AWS CLI onto the sFTP server and copy the files via the AWS s3 cp command.
Write a script using the AWS SDK that takes the files and copies them. You may need to use the multi-part upload with the size of your files.
Your can create an AWS managed sFTP server that links directly to your s3 bucket as the backend storage for that server, then use sftp commands to copy the files over.
Be mindful that you will need the appropriate permissions in your AWS account to complete any of these 3 (or 4) solutions.
I want to download million of files from S3 bucket which will take more than a week to be downloaded one by one - any way/ any command to download those files in parallel using shell script ?
Thanks,
AWS CLI
You can certainly issue GetObject requests in parallel. In fact, the AWS Command-Line Interface (CLI) does exactly that when transferring files, so that it can take advantage of available bandwidth. The aws s3 sync command will transfer the content in parallel.
See: AWS CLI S3 Configuration
If your bucket has a large number of objects, it can take a long time to list the contents of the bucket. Therefore, you might want to sync the bucket by prefix (folder) rather than trying it all at once.
AWS DataSync
You might instead want to use AWS DataSync:
AWS DataSync is an online data transfer service that simplifies, automates, and accelerates copying large amounts of data to and from AWS storage services over the internet or AWS Direct Connect... Move active datasets rapidly over the network into Amazon S3, Amazon EFS, or Amazon FSx for Windows File Server. DataSync includes automatic encryption and data integrity validation to help make sure that your data arrives securely, intact, and ready to use.
DataSync uses a protocol that takes full advantage of available bandwidth and will manage the parallel downloading of content. A fee of $0.0125 per GB applies.
AWS Snowball
Another option is to use AWS Snowcone (8TB) or AWS Snowball (50TB or 80TB), which are physical devices that you can pre-load with content from S3 and have it shipped to your location. You then connect it to your network and download the data. (It works in reverse too, for uploading bulk data to Amazon S3).