Recursively move files from SFTP to S3 preserving structure

Recursively move files from SFTP to S3 preserving structure - amazon-web-services

I'm trying to recursively move files from an SFTP server to S3, possibly using boto3. I want to preserve the folder/file structure as well. I was looking to do it this way:
import pysftp
private_key = "/mnt/results/sftpkey"
srv = pysftp.Connection(host="server.com", username="user1", private_key=private_key)
srv.get_r("/mnt/folder", "./output_folder")
Then take those files and upload them to S3 using boto3. However, the folders and files on the server are numerous with deep levels and also large in size. So my machine ends up running out of memory and disk space. I was thinking of a script where I could download single files and upload single files and then delete and repeat.
I know this would take a long time to finish, but I can run this as a job without running out of space and not keep my machine open the entire time. Has anyone done something similar? Any help is appreciated!

If you can't (or don't want) to download all of the files at once before sending them to S3, then you need to download them one at a time.
Further, from there, it follows that you'll need to build a list of files to download, then work on them, transferring one file to your local computer, then sending it to S3.
A very simple version of this would look something like this:
import pysftp
import stat
import boto3
import os
import json
# S3 bucket and prefix to upload to
target_bucket = "example-bucket"
target_prefix = ""
# Root FTP folder to sync
base_path = "./"
# Both base_path and target_prefix should end in a "/"
# Or, for the prefix, be empty for the root of the bucket
srv = pysftp.Connection(
host="server.com",
username="user1",
private_key="/mnt/results/sftpkey",
)
if os.path.isfile("all_files.json"):
# No need to cache files more than once. This lets us restart
# on a failure, though really we should be caching files in
# something more robust than just a json file
with open("all_files.json") as f:
all_files = json.load(f)
else:
# No local cache, go ahead and get the files
print("Need to get list of files...")
todo = [(base_path, target_prefix)]
all_files = []
while len(todo):
cur_dir, cur_prefix = todo.pop(0)
print("Listing " + cur_dir)
for cur in srv.listdir_attr(cur_dir):
if stat.S_ISDIR(cur.st_mode):
# A directory, so walk into it
todo.append((cur_dir + cur.filename + "/", cur_prefix + cur.filename + "/"))
else:
# A file, just add it to our cache
all_files.append([cur_dir + cur.filename, cur_prefix + cur.filename])
# Save the cache out to disk
with open("all_files.json", "w") as f:
json.dump(all_files, f)
# And now, for every file in the cache, download it
# and turn around and upload it to S3
s3 = boto3.client('s3')
while len(all_files):
ftp_file, s3_name = all_files.pop(0)
print("Downloading " + ftp_file)
srv.get(ftp_file, "_temp_")
print("Uploading " + s3_name)
s3.upload_file("_temp_", target_bucket, s3_name)
# Clean up, and update the cache with one less file
os.unlink("_temp_")
with open("all_files.json", "w") as f:
json.dump(all_files, f)
srv.close()
Error checking, and speed improvements are obviously possible.

You have to do it file-by-file.
Start with the recursive download code here:
Python pysftp get_r from Linux works fine on Linux but not on Windows
After each sftp.get, do S3 upload and remove the file.
Actually you can even copy the file from SFTP to S3 without storing the file locally:
Transfer file from SFTP to S3 using Paramiko

Related

How can I retrieve a folder from S3 into an AWS SageMaker notebook

I have a folder with several files corresponding to checkpoints of a RL model trained using RLLIB. I want to make an analysis of the checkpoints in a way that I need to pass a certain folder as an argument, e.g., analysis_function(folder_path). I have to run this line on a SageMaker notebook. I have seen that there are some questions on SO about how to retrieve files from s3, such as this one. However; how can I retrieve a whole folder?

To read the whole folder, you will just have to list all files in the folder and loop through them. You could either do something like -
import boto3
s3_res = boto3.resource("s3")
my_bucket = s3.Bucket("<your-bucket-name>")
for object in my_bucket.objects.filter(Prefix="<your-prefix>")
# your code goes here
Or, simply download the files to your local storage and loop them as you see fit (copy reference)-
!aws s3 cp s3://bucket/prefix/ . --recursive

How to download large csv files from S3 without running into 'out of memory' issue?

I need to process large files stored in S3 bucket. I need to divide the csv file into smaller chunks for processing. However, this seems to be a task done better on file-system storage rather an on object storage.
Hence, I am planning to download the large file to local, divide it into smaller chunks and then upload the resultant files together in a different folder.
I am aware of the method download_fileobj but could not determine whether it would result in out of memory error while downloading large files of sizes ~= 10GB.

I would recommend using download_file():
import boto3
s3 = boto3.resource('s3')
s3.meta.client.download_file('mybucket', 'hello.txt', '/tmp/hello.txt')
It will not run out of memory while downloading. Boto3 will take care of the transfer process.

You can use the awscli command line for this. Stream the output as follows:
aws s3 cp s3://<bucket>/file.txt -
The above command will stream the file contents in the terminal. Then you can use split and/or tee commands to create file chunks.
Example: aws s3 cp s3://<bucket>/file.txt - | split -d -b 100000 -
More details in this answer: https://stackoverflow.com/a/7291791/2732674

You can increase the bandwidth usage by making concurrent S3 API transfer calls
config = TransferConfig(max_concurrency=150)
s3_client.download_file(
Bucket=s3_bucket,
Filename='path',
Key="key",
Config=config
)

You can try boto3 s3.Object api.
import boto3
s3 = boto3.resource('s3')
object = s3.Object('bucket_name','key')
body = object.get()['Body'] #body returns streaming string
for line in body:
print line

How to find the origin of data?

So far the files are just being downloaded individually like the following rather than all being in one zipped file:
s3client = boto3.client('s3')
t.download_file(‘firstbucket’, obj['Key'], filename)

Let me save you some trouble by using AWS CLI:
aws s3 cp s3://mybucket/mydir/ . --recursive ; zip myzip.zip *.csv
You can change the wildcard to suit your needs but this will work inherently faster than Python seeing as AWS CLI has been optimized far beyond the capabilities of boto

if you want to use boto you'll have to do it in a loop like you have and add each item to a zip file.
with the CLI you can use s3 sync and then zip that up
https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
aws s3 sync s3://bucket-name ./local-location && zip bucket.zip ./local-location

It looks like you're really close, but you need to pass a file name to ZipFile.write() and download_file does not return a file name. The following should work alright, but I haven't tested it exhaustively.
from tempfile import NamedTemporaryFile
from zipfile import ZipFile
import boto3
def archive_bucket(bucket_name, zip_name):
s3 = boto3.client('s3')
paginator = s3.get_paginator('list_objects_v2')
with ZipFile(zip_name, 'w') as zf:
for page in paginator.paginate(Bucket=bucket_name):
for obj in page['Contents']:
# This might have issues on some systems since the file will
# be open for writes in two places. You can use other
# methods of creating a temporary file to work around that.
with NamedTemporaryFile() as f:
s3.download_file(bucket_name, obj['Key'], f.name)
# Copies over the temprary file using the key as the
# file name in the zip.
zf.write(f.name, obj['Key'])
This has less space usage than the solutions using the CLI, but it still isn't ideal. You will still have two copies of a given file at some point in time: one in the temp file and one that has been zipped up. So you need to make sure that you have enough space on disk to support the size of all the files you're downloading plus the size of the largest of those files. If there were a way to open a file-like object that wrote directly to a file in the zip directory then you could get around that. I don't know how to do that however.

Copy an image in Amazon S3 from Image URL using Django

I have an image URL (for example: http://www.myexample.come/testImage.jpg) and I would to upload this image on Amazon S3 using Django.
I'm not found a way to copy directly the resource from URL in Amazon S3 passing directly the file URL.
So, I think that i have to implement these steps in my project:
Download the file locally from URL http://www.myexample.come/testImage.jpg. I will have a local file testImage.jpg
I have to upload the local file into Amazon S3. I will have a S3 Url.
I have to delete the local file testImage.jpg
Is this a good way to build this feature?
Is possible to improve these steps?
I have to use this features when I receive a REST request and I have to respond passing in the response the uploaded S3 File Url... Are these steps a good way about performance?

The easiest way off the top of my head would be to use requests with io from the python std lib -- this is a bit of code I used a while back, I just tested it with python 2.7.9 and it works
>>> requests_image('http://docs.python-requests.org/en/latest/_static/requests-sidebar.png')
and it works with the latest version of requests (2.6.0) - but I should point out that it's just a snippet, and I was in full control of the image urls being handed to the function, so there's nothing in the way of error checking (you could use Pillow to open the image and confirm it's really a jpeg, etc.)
import requests
from io import open as iopen
from urlparse import urlsplit
def requests_image(file_url):
suffix_list = ['jpg', 'gif', 'png', 'tif', 'svg',]
file_name = urlsplit(file_url)[2].split('/')[-1]
file_suffix = file_name.split('.')[1]
i = requests.get(file_url)
if file_suffix in suffix_list and i.status_code == requests.codes.ok:
with iopen(file_name, 'wb') as file:
file.write(i.content)
else:
return False

how to write 4K size file on remote server in python

I am trying to write 50 4KB files on a EC2 instance with S3 mounted on it.
How can I do this in python?
I am no sure how to proceed with this.

If you have the S3 object bucket mounted via FUSE or some other method to get S3 object space as a pseudo file system, then you write files just like anything else in Python.
with open('/path/to/s3/mount', 'wb') as dafile:
dafile.write('contents')
If are you trying to put objects in S3 from an EC2 instance, then you will want to follow the boto documentation on how to do this.
to start you off:
create a /etc/boto.cfg or ~/.boto file like the boto howto says
from boto.s3.connection import S3Connection
conn = S3Connection()
# if you want, you can conn = S3Connection('key_id_here, 'secret_here')
bucket = conn.get_bucket('your_bucket_to_store_files')
for file in fifty_file_names:
bucket.new_key(file).set_contents_from_file('/local/path/to/{}'.format(file))
This assumes you are doing fairly small files, like you said 50k. Larger files may need to be split/chunked.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Recursively move files from SFTP to S3 preserving structure - amazon-web-services

Related

How can I retrieve a folder from S3 into an AWS SageMaker notebook

How to download large csv files from S3 without running into 'out of memory' issue?

How to find the origin of data?

Copy an image in Amazon S3 from Image URL using Django

how to write 4K size file on remote server in python

Categories

Resources