Get files From SFTP to S3 AWS - amazon-web-services

What is the better option of get data from a directory in SFTP and copy in bucket of S3 of AWS? In SFTP i only have permission of read so Rsync isn't option.
My idea is create a job in GLUE with Python that download this data y copy in bucket of S3. They are different files, one weighs about 600 MB, others are 4 GB.

Assuming you are talking about an sFTP server that is not on AWS, you have a few different options that may be easier than what you have proposed (although your solution could work):
Download the AWS CLI onto the sFTP server and copy the files via the AWS s3 cp command.
Write a script using the AWS SDK that takes the files and copies them. You may need to use the multi-part upload with the size of your files.
Your can create an AWS managed sFTP server that links directly to your s3 bucket as the backend storage for that server, then use sftp commands to copy the files over.
Be mindful that you will need the appropriate permissions in your AWS account to complete any of these 3 (or 4) solutions.

Related

How to move files from on-prem NAS to AWS S3 when file arrives on NAS

I have a requirement where we need to move files from on-prem NAS storage to AWS S3.
Files keep coming on NAS storage when it arrives we have notification set up in AWS and then we need to pull files from AWS to S3.
Can I access NAS storage and pull files from AWS to S3?
Does it require any additional configuration or simple EC2 or Lambda function can work based on size of the file?
How about NAS --> SFTP --> S3 using AWS Transfer family solution.
Is there any better way to move files from NAS to S3?
We want to avoid writing code as much as we can.
You should take a look at AWS Datasync.
It is a data transfer service of AWS that allow to copy data to and from AWS storage services over the Internet or over AWS Direct Connect (protocols NFS, SMB).
You don't need EC2 or AWS lambda. You have to install an agent that will read from a source location, and sync your data to S3. The agent is deployed on-premise. Please find the supported Hypervisor here: https://docs.aws.amazon.com/datasync/latest/userguide/agent-requirements.html and the deployment guide here: https://docs.aws.amazon.com/datasync/latest/userguide/deploy-agents.html

How to unzip (or rather untar) in-place on Amazon S3?

I have lots of .tar.gz files (millions) stored in a bucket on Amazon S3.
I'd like to untar them and create the corresponding folders on Amazon S3 (in the same bucket or another).
Is it possible to do this without me having to download/process them locally?
It's not possible with only S3. You'll have to have something like EC2, ECS or Lambda, preferably running in the same region as the S3 bucket, to download the .tar files, extract them, and upload every file that was extracted back to S3.
With AWS lambda you can do this! You won't have the extra costs with downloading and uploading to AWS network.
You should follow this blog but then unzip instead of zipping!
https://www.antstack.io/blog/create-zip-using-lambda-with-files-streamed-from-s3/

Upload data to s3 with curl

I am recording video data with a HELO device. With a curl command, I can upload data to my local computer:
curl -v0 --output example-filename-downloaded.mov http://192.168.0.2/media0/example-filename-source.mov
Where 192.168.0.2 is replaced by the IP address that the device is connected with. Now, I want to download this data not to my own pc, but to a cloud environment (AWS S3). Normally when I upload data to s3, I use the aws s3 cp filename s3://bucketname/directory command. However, I want to set something up so that the file does not have to stored on the pc, but is uploaded to s3 immediately. So as if the curl command would have the s3 destination in it.
Any thoughts on how to do this?
There are two ways to upload data to Amazon S3:
Use an AWS SDK that communicates directly with the S3 API, or
Uploading a File to Amazon S3 Using HTTP POST
You can apparently Use cURL to upload POST data with files - Stack Overflow. I haven't tried it with S3, but it might work!
You would need to include authentication information in the headers, since you don't want to open your bucket to anyone being able to upload to the S3 bucket (otherwise bots will upload random movies and music, costing you money). This will make the curl command more complex since you need to provide the required information to make S3 happy.
A better method
Frankly, I suggest you use the AWS Command-Line Interface (CLI) since it will be much simpler. This, of course, requires the ability to install the AWS CLI on the system. If this is not the case for your HELO device, then you'll need to use the above method (good luck!).
If your concern is that "the file does not have to stored on the pc", then please note that your curl method will still involve downloading the file to your computer and then uploading to S3. Yes, the file is not saved to disk, but it still needs to pass through your computer.
If you can install the AWS CLI, then the equivalent method using the AWS CLI would be:
curl http://example.com/some.file.on.the.internet.txt | aws s3 cp - s3://my-bucket/filename.txt
This will retrieve a file from a given URL, then pass it via standard input (using - as the source) to the Amazon S3 bucket. The file will 'pass through' your computer, but will not be saved to disk. This is equivalent to the curl method you were wanting to do, but makes the copy to S3 much easier.
This can be useful if you want to retrieve a file from somewhere on the Internet and then upload it to S3. If your HELO device makes it possible to 'pull' a file, this could be done from your own PC.

download files from AWS S3 bucket in parallel

I want to download million of files from S3 bucket which will take more than a week to be downloaded one by one - any way/ any command to download those files in parallel using shell script ?
Thanks,
AWS CLI
You can certainly issue GetObject requests in parallel. In fact, the AWS Command-Line Interface (CLI) does exactly that when transferring files, so that it can take advantage of available bandwidth. The aws s3 sync command will transfer the content in parallel.
See: AWS CLI S3 Configuration
If your bucket has a large number of objects, it can take a long time to list the contents of the bucket. Therefore, you might want to sync the bucket by prefix (folder) rather than trying it all at once.
AWS DataSync
You might instead want to use AWS DataSync:
AWS DataSync is an online data transfer service that simplifies, automates, and accelerates copying large amounts of data to and from AWS storage services over the internet or AWS Direct Connect... Move active datasets rapidly over the network into Amazon S3, Amazon EFS, or Amazon FSx for Windows File Server. DataSync includes automatic encryption and data integrity validation to help make sure that your data arrives securely, intact, and ready to use.
DataSync uses a protocol that takes full advantage of available bandwidth and will manage the parallel downloading of content. A fee of $0.0125 per GB applies.
AWS Snowball
Another option is to use AWS Snowcone (8TB) or AWS Snowball (50TB or 80TB), which are physical devices that you can pre-load with content from S3 and have it shipped to your location. You then connect it to your network and download the data. (It works in reverse too, for uploading bulk data to Amazon S3).

How to speed up copying files between two accounts

I need to copy some buckets from one account to another. I got all permissions so I started transferring the data via cli (cp command). I am operating on a c4.large. The problem is that there is pretty much data (9tb) and it goes realy slow. In 20 minutes I transferred like 20gb...
I checked the internet speed and the download is 3000Mbit/s and the upload is 500 Mbit/s. How can I speed up it?
The AWS Command-Line Interface (CLI) aws s3 cp command simply sends the copy request to Amazon S3. The data is transferred between the Amazon S3 buckets without downloading to your computer. Therefore, the size and bandwidth of the computer issuing the command is not related to the speed of data transfer.
It is likely that the aws s3 cp command is only copying a small number of files simultaneously. You could increase the speed by setting the max_concurrent_requests parameter to a higher value:
aws configure set default.s3.max_concurrent_requests 20
See:
AWS CLI S3 Configuration — AWS CLI Command Reference
Getting the Most Out of the Amazon S3 CLI | AWS Partner Network (APN) Blog